Optimizing an OpenCL AI Kernel for the data center using Silexica’s SLX FPGA

In this application note, SLX FPGA accelerates an AI-related face detection design example, leveraging the bottom-up flow of Xilinx's Vitis 2020.2 and Alveo U280 accelerator card.

1_Chip_Final_000

 

This application note is written for FPGA application developers using Vitis HLS 2020.2 version

line_45

Introduction

FPGAs are being increasingly employed as co-processors in data centers. A driver behind this transition is AI applications that leverage the parallel nature of FPGAs. The Xilinx Alveo family of accelerator cards that connect to x86 processors using a PCI express interface are very popular in this domain. For programming these accelerator cards you can either use a top-down approach, starting from a top-level C/C++ and OpenCL application and working towards lower-level kernels or a bottom-up approach where the kernel blocks are compiled into Xilinx objects (.xo) that can be linked together into a binary at a later stage.

The bottom-up flow has several advantages over the top-down flow. (1) It allows design, validation, and optimization of kernels separately from the main application. (2) It provides faster iteration cycles for development and optimization of kernels by splitting the design into smaller components. (3) It facilitates reuse; a collection of (.xo) files can be reused like a library.

In this application note, we use a face detection application as a reference design to show how designers can use SLX FPGA to optimize a kernel when using Vitis bottom-up flow. Note that the same methodology is also applicable when designing a kernel from scratch or importing an existing kernel from Vitis HLS.

More about Silexica

HLS

HLS Beamformer

Developement Flow

The following development tools from Silexica and Xilinx are required to create this application:

  • SLX FPGA version 2020.4-sp1
  • Vitis Libraries version 2020.2
  • Vitis High-Level Synthesis version 2020.2
  • Vitis Unified Software Platform 2020.2

The entire end to end flow is demonstrated in figure 1. The flow starts with creating a new SLX project. However, if you have an existing Vitis HLS project, SLX FPGA can directly import it.

 

Figure 1: SLX FPGA workflow for Vitis bottom-up projects

Figure 1: SLX FPGA workflow for Vitis bottom-up projects

 

Creating & Configuring an SLX FPGA Project

Launch SLX FPGA and start the project creation wizard by clicking the “New SLX project” SLX-FPGA-Vitis-HLS-flow-facedetect-4 icon. Create a new SLX FPGA project as shown in figure 2. The next step is to configure this project

 

Figure 2: Create a new SLX FPGA projectFigure 2: Create a new SLX FPGA project

+ READ MORE

The configuration editor will automatically appear when you create a new project, but you can also bring it up anytime by clicking the orange gear button SLX-FPGA-Vitis-HLS-flow-facedetect-7 . Drag & drop your application source files in the spec folder of your project, as shown in figure 3. For this appnote, we take the face-detection applications from the Rosette benchmarks1. Next, you need to specify the FPGA part number and build options. For this application, we are targeting an Alveo U280 FPGA. In the FPGA Part field, press select, and then choose the xcu280-fsvh2892-2L-e. To set the build options, enter the clean, build and run commands as shown in figure 3. For ‘make’ based projects, such as this one, please verify that the makefile does not hardcode the compiler and instead uses the (CC) and (CXX) environment variables to reference the C and C++ compilers respectively. SLX will overwrite these variables with its proprietary compilers in different phases of analysis. The run command executes the testbench (also included in the benchmark suite), which ensures functional correctness and is also used to analyze the dynamic behavior of the application.

 

Figure 3 shows how the configuration options will look.

 

SLX-FPGA-Vitis-HLS-flow-facedetect-9-1

Figure 3: Configuring a new SLX FPGA project

 

After these basic configurations are done, we can proceed with selecting the top-level hardware function for our application and setting the right interfaces. Open the function mapping editor by clicking the ‘function mapping’ button SLX-FPGA-Vitis-HLS-flow-facedetect-10 . If you know the top-level hardware function, check it for synthesizability issues and map it to FPGA using the right-click menu in the function mapping editor. Alternatively, run the auto-select FPGA function(s) to let SLX automatically pick the top-level hardware function(s). For this application, we select face_detect_swas our top-level hardware function. Once the top-level hardware function is correctly selected, the function mapping editor will look like figure 4, all functions mapped to FPGA will have a red border.

 

Figure 4: SLX FPGA function mapping editor

Figure 4: SLX FPGA function mapping editor

 

Now we are ready to select the interfaces for this function. After selecting the top-level hardware function in the function mapping editor, click on the properties tab and open the interface selection with the menu on the left side, as shown in figure 5. Select axi_m interface for all array and pointer interfaces and s_axilite interface for scalars. This will generate the interface pragmas required for using a Xilinx object on an Alveo accelerator card. Additionally, SLX’s optimization engine is now aware of the interface constraints and chooses the optimization pragmas accordingly.

 

SLX-FPGA-Vitis-HLS-flow-facedetect-13-2

Figure 5: SLX FPGA interface selection

 

With all the interfaces properly selected, we are now set to optimize and generate pragmas with SLX FPGA.

 

Generating HLS pragmas in SLX FPGA

Generating HLS pragmas is a two-step process:

  1. Find and parallelize loops in the FPGA SLX-FPGA-Vitis-HLS-flow-facedetect-14
  2. Generate code with HLS pragmas inserted SLX-FPGA-Vitis-HLS-flow-facedetect-15

 

In the first step, SLX’s optimization engine searches the design space of possible solutions to determine the optimal set of pragmas and parameters to apply. The design space consists of: (1) different parallelization options for loops, i.e., pipelining or unrolling with different unroll factors, (2) multi-dimensional partitioning and reshaping options for arrays (complete or cyclic with different factors), and (3) function hierarchy; inline or block. For this particular example, this results in approx. 1.32 x e19 design points and SLX’s optimization engine converges to a solution in almost 70 seconds.

 

Figure 6: SLX FPGA hints view

Figure 6: SLX FPGA hints view

 

Figure 6 shows the SLX FPGA hints view. The fourth and the fifth columns in the hints view show the CPU total cost and FPGA total cost for the different functions and loops in the application. The FPGA total cost is an estimate of the latency contribution to the critical path for a particular function or loop. This is particularly useful to help developers focus their efforts for optimization. For example, the weekClassifier function on row 33 (figure 6) takes 24.4% of CPU time in a pure software implementation. However, its contribution to the critical path latency in an FPGA implementation is only 3.63%. In contrast, the loop on row 4 (figure 6) for the cascadeClassifier function takes 79.9% of the CPU time in a pure software implementation but contributes 97.2% of the FPGA critical path latency. The hints view also highlights the most critical loops carried dependencies (LCDs). Note that SLX FPGA does not consider all LCDs as equal and separates the LCDs that can be ignored (e.g., induction and reduction variables) from the most critical ones. Such information can potentially help developers save time by allowing them to focus their efforts on the parts of their application that really matter in an FPGA implementation.

 

Figure 7: SLX FPGA Code Generation Wizard showing automated pragma insertion

Figure 7: SLX FPGA Code Generation Wizard showing automated pragma insertion

 

Clicking the “Generate HLS Code” button SLX-FPGA-Vitis-HLS-flow-facedetect-20 opens the code transformation wizard, as shown in figure 7. Here, the user can inspect the generated code side-by-side to the original version and select/unselect pragmas for code generation for fine-tuning the implementation.

 

Performance Improvement

After synthesizing the design, we compare the performance and resource utilization of the SLX optimized kernel with the original unoptimized one. For this particular design, we allowed SLX FPGA to use all available resources on the selected device; however, additional constraints can be applied if necessary. Table 1 shows a summary of the results. We see a 7.8x reduction in latency for a 3x increase in LUTs, 2.4x increase in FF, and 2.7x increase in DSP blocks. This increase in resource utilization is not a big concern for an Alveo card since all resources are still under 5% utilization level. Should greater performance be required, a host of additional analysis capabilities are available in SLX FPGA to help guide designers to refactor their code more quickly and effectively.

 

Table 1 – Original Vitis Library compared to SLX FPGA instrumented code

Table 1 – Original Vitis Library compared to SLX FPGA instrumented code

 

Importing the Xilinx Object in Vitis application project

The hls folder in an SLX FPGA project contains a Vitis HLS project with SLX optimized source code. We open this project with Vitis HLS and export RTL as a Xilinx object, as shown in figure 8. Before exporting to Vitis, we need to add the Extern “C” wrapper to ensure C linkage.

 

Figure 8: Exporting the Xilinx Object from Vitis HLS

Figure 8: Exporting the Xilinx Object from Vitis HLS

 

Within a Vitis workspace, create a new application with an Alveo U280 card as the target device, as shown in figure 9.

Figure 9: Creating an application project in Vitis Unified Platform

Figure 9: Creating an application project in Vitis Unified Platform

 

Once the project is created, we import the .xo file into the src folder for kernels as shown in figure 10. After importing the .xo file, click the “add hardware function” button SLX-FPGA-Vitis-HLS-flow-facedetect-25, and select face_detect_sw the list.

 

Figure 10: Importing the kernel in a Vitis application project

Figure 10: Importing the kernel in a Vitis application project

 

The developer is now able to create the wider application, which runs on the x86 host, taking advantage of the accelerated face_detect_sw kernel.

 

 

Figure 11: System diagram in Vitis Analyzer

Figure 11: System diagram in Vitis Analyzer

 

Conclusion

This application note has demonstrated how SLX FPGA can be used to optimize a kernel targeted for PCIe connected Alveo cards, leveraging the Vitis bottom-up kernel flow. For this example, SLX FPGA was able to reduce the latency of a commonly used AI kernel for face detection. The approach can be applied to most Xilinx-based data-center applications, including Amazon F1 instances. This methodology can be applied whether you are developing an application from scratch or reusing an existing design and customizing it according to your needs.

 

Download the Application Note here

 

Accelerate the Journey from C/C++ to FPGA
SLX FPGA sits on top of HLS compiler
  • Prepares the C/C++ code for optimum HLS results
  • Takes the guesswork out of using HLS
Removes the roadblocks in HLS adoption
  • Non-synthesizable C/C++ code
  • Finding parallelism
  • Poor performance and bloated area
SLX_FPGA_Color

REQUEST A DEMO

 

This project has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 858051.

1200px-Flag_of_Europe.svg