EDA² | ISEDA 2024

Session Details

Technical Session 26

Need More Acceleration? Use FPGA!

Session Chair: Yibo Lin, Peking University

HT-TMR: An Efficient Netlist-Level TMR Tool for FPGA SEU Mitigation

Presenter: Yaowei Zhang, Beijing Microelectronics Technology Institute

Abstract: This paper describes an efficient software tool HT-TMR tool, one of BMTI HongTu series software, developed to automatically apply TMR mitigation on FPGA design to protect against SEUs. The tool provides three preset TMR modes and a custom TMR mode for designers. Once the TMR mode is selected, the tool parses the input EDIF file, determines necessary points to insert the voters, and generates the newly constructed circuit in EDIF format which has the same functionality of the original. Compared with Mentor's Precision Hi-Rel tool, our tool has a lower resource utilization, better timing performance, and higher compatibility.

An FPGA-based Multi-Core Overlay Processor for Transformer-based Models

Presenter:Shaoqiang Lu, Shanghai Jiao Tong University

Abstract: Transformer-based models have achieved extensive success with increasingly large numbers of parameters and computations, for which many multi-core accelerators have been developed. Nevertheless, they suffer from limited throughput due to either low operating frequency or high communication overhead between cores. This paper proposes an FPGA-based multi-core overlay processor, MCore-OPU, to optimize intra-core computation and inter-core communication. First, we boost the operating frequency of the processing element (PE) array to double the rest of the processor to improve the intra-core throughput. Second, we develop on-chip synchronization routers to reduce expensive off-chip memory traffic, where only the partial sum and maximum are communicated between cores rather than entire vectors for layer normalization and softmax. Meanwhile, we optimize the multi-core model allocation and scheduling to minimize the inter-core communications and maximize the intracore computation efficiency. The MCore-OPU is implemented with four cores and four DDRs on the Xilinx U200 FPGA, where the PE array runs 600MHz, and the rest runs 300MHz. Experimental results show that the MCore-OPU outperforms other FPGA-based accelerators by 1.24×–1.39× and A100 GPU by 5.31×–5.81× in throughput per DSP for BERT, ViT, GPT-2 and LLaMA inference, respectively.

Toward Efficient Co-Design of CNN Quantization and HW Architecture on FPGA Hybrid-Accelerator

Presenter: Yiran Zhang, Southern University of Science and Technology

Abstract: Field programmable gate array (FPGA) has emerged as a promising platform for accelerating convolutional neural networks (CNNs). In this paper, we propose a low-latency CNN hybrid-accelerator system and an efficient design space exploration (DSE) method. Specifically, our targeted FPGA platform consists of different types of accelerators for two advantages: high concurrency and full hardware utilization (i.e., look-up tables (LUTs) and digital signal processors (DSPs)). Besides, we adopt a bandwidth-aware analytical model for system latency to consider pipeline stalls and computation cycles simultaneously. Furthermore, for the huge design space encompassing layer-wise CNN quantization and FPGA hybrid-accelerator architecture, we propose a DSE method (named DiMEGA) aimed at enhancing search efficiency, which is a differentiable method embedded by a genetic algorithm. The performance of our CNN hybrid-accelerator system is demonstrated on a PYNQ-Z2 FPGA platform. The experimental results show that the system latency can be reduced by 42% ~ 48% without sacrificing accuracy, and the DSE time of DiMEGA is reduced by 23% on ResNet20-CIFAR10, and 63% on ResNet56-CIFAR10, compared with SOTA.

DUET: FPGA-Accelerated Differential Testing Framework for Efficient Processor Verification

Presenter:Shoulin Zhang, Zhengzhou University

Abstract: "With the increasing complexity of modern processors, verification becomes the main bottleneck in the entire processor development cycle, which can occupy up to 70\% of the development time. Primarily this is because multiple iterations of debugging and software simulations are extremely time-consuming.
To improve the verification and debugging efficiency, recent studies investigated differential testing techniques such as DiffTest, which dynamically compares the runtime results from the hardware design-under-test (DUT) and its software reference model. However, the slow speed of software simulation constrains the efficiency and effectiveness of these verification techniques. Although FPGAs can speed up the simulation, current methods either offer limited visibility into design details or are costly when dynamically checking against a reference model at the system level.
In this paper, we present DUET, an FPGA accelerated differential testing technique that combines the fine-grained debugging capability of DiffTest and the high simulation speed of FPGA prototyping. We evaluate the proposed method with practical RISC-V processors and demonstrate that the proposed approach accelerates verification efficiency up to 20x while preserving full visibility and debugging capabilities."

An FPGA-based Efficient Streaming Vector Processing Engine for Transformer-based Models

Presenter: Siyuan Miao, University of California, Los Angeles

Abstract: Transformer-based models have obtained extensive success. As their linear operations have been significantly accelerated by a wide range of approaches, nonlinear operations tend to have limited hardware efficiency and become the performance bottleneck. Prior works to accelerate nonlinear operations suffer from either low area efficiency with poor instruction set architecture (ISA) or limited throughput due to the loop-carried dependency in reduce operations. In this paper, we propose a vector processing engine (VPE) with new streaming ISA to achieve flexible streaming execution for nonlinear operations and obtain better hardware efficiency. Moreover, we relax the loop-carried dependency in reduce operations with look-ahead dataflow optimization and improve the throughput of nonlinear operations. Experiment results on Xilinx U200 FPGA show that VPE can outperform other FPGA vector processing units on throughput by 1.14x--3x for softmax, layer normalization, GELU across different vector sizes.

Efficient Verification Framework for RISC-V Instruction Extensions with FPGA Acceleration

Presenter: Zijian Jiang, Beijing University of Technology

Abstract: "The RISC-V instruction set architecture (ISA) enjoys the flexibility for domain-specific custom instruction extensions. While the basic RISC-V ISA contains common instructions, the extended accelerators provide additional computing power to meet diverse needs, making it well-suited for various emerging fields. High-level synthesis (HLS) provides a way to build hardware accelerators directly using RTL. It allows software engineers to create complex digital circuit designs using high-level languages such as C/C++, further improving development efficiency. However, verifying a design that includes RISC-V cores and custom extensions can be challenging. Traditional approaches for verifying HLS-generated designs use C-RTL co-simulation, which primarily focuses on the unit level, while making impractical assumptions about interactions between HLS-generated IPs and the processor. On the other hand, designs that combine RISC-V cores with custom extensions require system-level verification, which must extensively exercise both components and their interconnections. Furthermore, traditional C-RTL co-simulation performs cycle-accurate software simulation, which can be extremely time-consuming. To efficiently verify a RISC-V processor design with custom instruction extensions, we propose a novel verification framework that combines the benefits of the high-level abstraction of C/C++ simulation and cycle-accurate modeling of C-RTL co-simulations. We map the RISC-V core and the HLS-generated custom instruction accelerators, along with their corresponding C/C++ software models, onto the same FPGA with hardened processors allowing them to run simultaneously. A global monitor and checker carefully check the results of both the hardware and software in real-time. If a mismatch is detected, we capture a snapshot of the entire hardware, and reconstruct the simulation in external software simulators for detailed debugging. Through a series of benchmark experiments, results show a significant performance improvement over conventional approaches from 1419x to 9011x."