1 thought on “The high -end FPGA of the top three hegemony (1)”

  1. Intel announced earlier that they had begun to deliver their first new Agilex FPGA to customers who experienced the first experience. This has entered the "positive confrontation" stage of competition between the two largest FPGA suppliers. Xilinx delivered their first "Versal ACAP" FPGA in June, so after a long and controversial "who can deliver first?" It turns out that these two competitors can be delivered to the FPGA product line that can be used to match the target of the opponent within two months. This means that, unlike other competitions to improve performance by the first time the introduction of advanced nodes, both companies do not have enough time to use a new and more advanced technology to win design victory.
    , the field of competition has expanded, and new players Achronix claims that they will deliver their first Speedster 7T FPGA samples by the end of this year. For the development team, this means that by the end of this year, there will be three completely different high -end FPGA products to choose from -all these products use similar process techniques and have unique functions.
    This article is the first part of the multi -part series of the new high -end FPGA series of these three suppliers. We will study the underlying process, the FPGA logic organization (LUT) itself, to accelerate the processing and networking of enhanced resources, memory architecture, chip/packaging/custom architecture, I/O resources, design tool strategies, unique and uniqueness and each product of each product Novel characteristics and functions, as well as marketing strategies. If you can get fun from a large number of FLOPS, crazy bandwidth, or some interesting, and some fun, powerful semiconductor devices, then this will be an exciting trip for you.
    Stochida -Intel and Achronix are involved and provided information about this article. Xilinx did not respond to our request for information.
    This time, the hegemon of high -end FPGA has changed. In the past, the biggest market of high -end FPGA was the network in terms of network, as well as the change in market share, which mainly depends on who can provide the richest designs for deploying the latest round of cable and wireless network customers. Large market share. However, the timing of 5G has changed the dynamic. Before the current wave of FPGA technology, 5G has begun to accelerate. Therefore, the first round of 5G main network was built on the previous generation programming logic. These devices will be integrated into a 5G ecosystem, so we cannot determine whether the 5G revolution and the birth of the new generation of FPGA have remained consistent. The design of these FPGAs has fully understood the 5G mechanism. However, don't underestimate the importance of FPGA on 5G, or the importance of 5G on the FPGA market. Today, when you use your mobile phone, 99%of your call may be performed through FPGA. With 5G, the influence of FPGA will be greater.
    The rapid expansion of the emerging market with the acceleration of the data center (mainly for AI workload), this phenomenon has aroused people's interest. It is estimated that artificial intelligence accelerates the market will develop rapidly in the next few years, so these three suppliers will compete for most of these devices with their impressive cost performance and higher energy efficiency, and claim that they provide provided by them provided The solution can be extended to the edge/end side. Each of these suppliers clearly realized that occupying these AI acceleration card slots is urgent, and they all designed new chips around this idea.
    In look at all these factors?
    In the underlying process technology, Xilinx and Achronix FPGA series are designed based on TSMC 7nm, while Intel Agilex uses similar Intel 10nm processes. Don't be confused by the difference between 7/10 naming. Don't be confused by the difference between the naming method of 7/10. We have pointed out very long ago that marketing groups in the semiconductor industry have named nodes according to the sound of good markets in the market, rather than derive them from any recognition features of the crystal tube itself. According to our estimation, TSMC's 7nm and Intel10nm are roughly considerable processes, and manufacturers using these two processes are basically the same. This means that Intel's long -term leading position in technical technology seems to have disappeared, but when we approach the bottleneck of Moore's law, the competition in the field of silicon processing is inevitable.
    Is to the latest semiconductor craft nodes, all three suppliers have been promoted moderately. However, this kind of advancement can no longer meet the historical standards of Moore's law, because the income increase brought about by the new process update of several process nodes in the past has been steadily declining. Everyone has gained a temporary promotion from FINFET technology. Now, as Moore's Law is about to end on the economy level, we may find that the decrease in marginal income will continue.
    In the past, as the size of the transistor decreases, each new process node has greatly improved the density of the transistor and obtained better performance and lower power consumption. At present, suppliers must weigh between the three, and even on their preferred indicators, they can usually only get smaller returns. At the same time, the Non-recurring cost transferred to the new process node continued to grow in index. This means that the risks assumed by FPGA have risen sharply. This is because in order to maintain competitiveness, they need to continue to invest to obtain continuous revenue. This also means that we are entering a new era. The architecture and functions of FPGA itself, FPGA tools, and marketing strategies of these three companies will become a key factor affecting income, not who will take the lead in using new process processes.
    If into a baptism, let's take a look at the functions and characteristics of each supplier's product. Start with the most basic FPGA function -LUT structure. We often lament that each company's calculation of LUT is different, and each generation of this game has become more complicated. Xilinx and Achronix currently use 6 input Lut, and Intel's ALM is essentially 8 input LUT. Manufacturers agreed to use 2.2 LUT4S PER LUT6, and 2.99 LUT4S PER LUT8 to convert different luts to 4 input Lut.
    In calculated by this method, the ACHRONIX Speedster 7T series includes leading the industry from 363K to 2.6m LUT6 (equivalent to 800K to 5.76M). The Intel Agilex series includes 132 k to 912K ALM (equivalent to 395Kk 395K To 2.7m equivalent Lut4), Xilinx's Versal series products include about 246K to 984K CLB (converted to equivalent Lut4 from 541K to 2.2m). Each supplier claims that its own architecture is superior, and it emphasizes that it can improve the logical density, performance, or wiring design function in some specific applications or configurations. At present, we don't know whether the Lut of any supplier is significantly better than the Lut of any other supplier.
    , but FPGA's available resources depend on the number of LUT. You must also consider the following challenges: the percentage of Lut effectively (we will discuss the design tools later), and the number of enhanced functions integrated into the logical module, these functions allow these functions Participation. According to your design, you may find that more content is stuffed into one or more FPGAs, and these contents have nothing to do with the number of LUT.
    The main reason for FPGA "good at" artificial intelligence reasoning is that it can complete a large number of arithmetic operations in parallel (mainly the accumulation of various accuracy multiplication), which is due to the existence of a large number of "DSPs of weaving in the programmable logic structure Block array ". This enables FPGA to perform a matrix operation such as convolutional matrix components such as the traditional von Nokaman structure.
    Analysis of the hardware multiplication of AI reasoning, the variable accuracy multiplication of Achronix can achieve a 41K INT-8 multiplication or 82K INT-4 multiplication. Intel Agilex has 2K-17K 18 × 19 multiplications, Xilinx Versal has 500-3k "DSP engine", which is probably "DSP58 Slice", including 27 × 24 multiplication and new hardware floating-point functions. This comparison must be "from apples to oranges to mango". As for which fruit is more suitable for your application, it must be "determined by the designer".
    Now these three suppliers have enhanced support for floating -point multiplication. Achronix provides a new architecture for their DSP blocks, which they call "machine learning processor" (MLP). Each MLP contains as many as 32 multiplications/accumulators (MACs), 4-24-bit integer mode, and various floating-point modes. It can support the BFLOAT16 format and block floating-point format. The most important thing is that the Achronix MLP tightly couples the embedded memory module with the calculation unit, so that the Mac operation can run at a frequency of 750 MHz, and wait for the access to the deposit to obtain data through the FPGA organization.
    Intel also uses a variable accuracy DSP module with hardware floating point (basically just like they have provided many years of features). Intel's floating point support may be the most extensive and mature of the three. With Agilex, they launched two new floating -point modes, namely half precision floating point (FP16) and block floating point (BFLOAT16), and adjusted architecture to make its DSP operations more efficient.
    Xilinx has upgraded its previous DSP48 Slice to DSP58 -probably because they now include hardware floating points, and its multiplier has also been upgraded to 27 × 24. Therefore, in this generation of products, the other two suppliers have also joined the ranks of Intel to provide hardware multiplication in supporting floating -point operations. This is a reversal for Xilinx. Seling Si previously claimed that it was not a good idea to implement floating -point hardware multiplication in FPGA, because floating -point operations were mainly used for training, while FPGA was mainly targeted at reasonable applications.
    The floating -point format can be available, Versal (up to 2.1K multiplication instruments) and Agilex (up to 8.7K multiplications) support FP32 formats. These three series support the semi -precision (FP16) -Versal can support up to 2.1K multiplication instruments, Agilex can support up to 17.1K multiplication instruments, Speedster can support up to 5.1K multiplication instruments. Agilex (up to 17.1K multiplication instruments) and Speedster (up to 5.1K) support BFLOAT16. For FP24 format floating -point multiplications, Versal and Agilex may use FP32 units, while Speedster has a multiplier of up to 2.6K. Achronix Speedster also supports a blocking point multiplier of up to 81.9K.
    Xilinx also brings a new software programmable vector processor-up to 400 1GHz V liW-SIMD vector processing core arrays, which has memory that enhances computing and tight coupling. This provides a simpler programming model for parallel complicated vector operations and using FPGA's rich computing resources. Overall, the "GPU /Inference Engine" was selected in Xilinx's "Kitchen Sink" competition strategy. We will discuss this in detail later.
    Intel's response to the Achronix MLP and Xilinx vector processors is the evolution of the old school. They pointed out that the Agilex DSP module realizes the same functions as other suppliers. You can use the FPGA design development process that has been established and fully understood, and does not require customers to divide their design in various system structures of the device. This is a good thing if your team has the professional knowledge of FPGA/RTL design. But if your application needs to be developed by software engineers, Xilinx's software programming method may have advantages.
    In addition to simply calculating the multiplication, we can also compare these capabilities by viewing the supplier's statement on the theoretical performance. But one thing to pay attention to is that these claims are seriously exaggerated, and it is difficult to define accurately. Suppliers usually get a number by multiplied the number of multiplications on the chip with the maximum operating frequency of these multiplications, and obtain a "maximum XX TOPS or TFLOPS" number. Obviously, the design in the real world will not use a 100%multiplier, no design can reach the maximum theoretical clock rate of these multiplications, and no design can continue to provide input data for these multiplications at a proper rate. And the accuracy of these multiplication operations varies from suppliers.
    If requirements, we can say that FPGA can actually reach 50-90 % of its theoretical maximum value in actual design. This is better than the GPU, which is considered to reach 10-20 % of its theoretical maximum value in the real world.
    In the number of TOPS operations operated by INT8, if we include 133 TOPS in its vector processor, Xilinx Versal tops the list with approximately 171 TOPS. 12 from its DSP block and 26 from its logical structure. Speedster followed closely, with about 86 TOPS, 61 of which came from their MLP modules and 25 logical structures from them. The maximum value of Agile xi NT8 is 92 TOPS, of which 51 are from DSP blocks and 41 from logical structures. Judging from the TFLOPS in the BFLOAT16 format, Agilex leads 40, Versal follows 9 closely, and Speedster is at the bottom of 8. Speedster has a great advantage in blocking -point operation, but it has 123 TFLOPS, followed by 41 Agilex and 15 versal.
    These numbers come from the company's own data table. As we mentioned, they are theoretical maximum values ​​and cannot be achieved in practical applications. Achronix's "availability" has a certain value, because their MLP is a unique design, which aims to maintain the module itself in the module itself, and run at the maximum clock rate without data to and from the logical structure. The most common operation in AI reasoning can be completed. Similarly, Xilinx's vector processor architecture should be able to keep the data flow steadily through the arithmetic unit. In other words, we have not seen the benchmark or reference design to prove the claims of these companies in any meaningful way.
    of course, to use all these Lut and multiplication instruments, you need to allow your design actual layout and wiring and meet the timing requirements of the selected chip. With the development of FPGA, this has become an increasingly difficult challenge. Single than his network and logical paths are distributed on huge chips through limited routing resources, which gradually turns the traditional timing of convergence into a nightmare. The conventional technologies used to achieve time -sequential convergence in synchronous design have encountered obstacles and cannot be expanded. Both Xilinx and Achronix both solve this problem in the new generation of FPGA by adding traditional logic and routing structures. NOC has essentially changed the rules of the game, because the entire chip no longer needs to converge time sequencing in a huge magic -like fusion. At present, smaller synchronous blocks can pass data through NOC, reduce the burden on the traditional routing structure, and decompose the problem that the huge design automation tool needs to be solved into smaller and easier management problems.
    For generations, Intel has adopted another method to solve this problem -a large amount of micro registers called "Hyperflex registers" flattened to the entire logical structure. These registers are allowed to reorganize the longer and more complex logical paths and the assembly line, so that the entire design is essentially asynchronous. Interestingly, this is also the effect of NOC used by Xilinx and Achronix. Each method is facing challenges, because these two methods will add a lot of complexity to chip design and design tools we use. According to reports, the Hyperflex register in Intel's case has some negative effects on the overall speed that can be achieved on the logical architecture. Intel said that the Hyperflex architecture in Agilex FPGA is the second generation, and compared with the previous generation Hyperflex architecture, it has improved and enhanced, which can improve performance and simplify timing convergence. After the progress of Agilex, we will have to wait and see the response after the user uses.
    If suppliers using NOC for routing, in Xilinx and Achronix, Achronix claims to achieve the fastest NOC through its two -dimensional cross -chip AXI. Each line or column in this NOC realizes two 256 -bit unidirectional AXI channels with 2 GHz, that is, the data traffic of 512 GBPS can be supported in each direction. The Speedster's NOC has a total of 197 nodes, which eventually forms a total bandwidth of 27 TBPS, which reduces the resource burden of the FPGA tradition according to the position. As far as we know, Xilinx's Versal NOC performance has not been released, but there are about 28 nodes. We guess that the total bandwidth is 1.5 TBPS.
    Okay, our ink is used up this week, but next week we will continue -see the charming and flexible memory architecture brought by these FPGA series, the unique packaging and customization functions of each series, Crazy Serdes IO function, embedded processing subsystem, design tool process, etc.
    *Click to read the original text at the end of the text, you can read the original English text.
    It today is the 2125 issue of "Semiconductor Industry Observation" for you. Welcome to follow.
    Observation of semiconductor industry
    "Semiconductor First Vertical Media"
    The identification QR code, reply to the keyword below, read more
    AI | Walls | TSMC | Huawei | Integrated circuit | TWS headset | Xiaomi
    Reply to submission, watch "How to Be a Member of" Semiconductor Industry Observation "
    Reply search, and you can easily find other articles you are interested in!
    English original
    !

Leave a Comment