The configuration editor will automatically appear when you create a new project, but you can also bring it up anytime by clicking the orange gear button . Drag & drop your application source files in the spec folder of your project, as shown in figure 3. For this appnote, we take the face-detection applications from the Rosette benchmarks1. Next, you need to specify the FPGA part number and build options. For this application, we are targeting an Alveo U280 FPGA. In the FPGA Part field, press select, and then choose the xcu280-fsvh2892-2L-e. To set the build options, enter the clean, build and run commands as shown in figure 3. For ‘make’ based projects, such as this one, please verify that the makefile does not hardcode the compiler and instead uses the (CC) and (CXX) environment variables to reference the C and C++ compilers respectively. SLX will overwrite these variables with its proprietary compilers in different phases of analysis. The run command executes the testbench (also included in the benchmark suite), which ensures functional correctness and is also used to analyze the dynamic behavior of the application.
Figure 3 shows how the configuration options will look.
Figure 3: Configuring a new SLX FPGA project
After these basic configurations are done, we can proceed with selecting the top-level hardware function for our application and setting the right interfaces. Open the function mapping editor by clicking the ‘function mapping’ button . If you know the top-level hardware function, check it for synthesizability issues and map it to FPGA using the right-click menu in the function mapping editor. Alternatively, run the auto-select FPGA function(s) to let SLX automatically pick the top-level hardware function(s). For this application, we select face_detect_swas our top-level hardware function. Once the top-level hardware function is correctly selected, the function mapping editor will look like figure 4, all functions mapped to FPGA will have a red border.
Figure 4: SLX FPGA function mapping editor
Now we are ready to select the interfaces for this function. After selecting the top-level hardware function in the function mapping editor, click on the properties tab and open the interface selection with the menu on the left side, as shown in figure 5. Select axi_m interface for all array and pointer interfaces and s_axilite interface for scalars. This will generate the interface pragmas required for using a Xilinx object on an Alveo accelerator card. Additionally, SLX’s optimization engine is now aware of the interface constraints and chooses the optimization pragmas accordingly.
Figure 5: SLX FPGA interface selection
With all the interfaces properly selected, we are now set to optimize and generate pragmas with SLX FPGA.
Generating HLS pragmas in SLX FPGA
Generating HLS pragmas is a two-step process:
- Find and parallelize loops in the FPGA
- Generate code with HLS pragmas inserted
In the first step, SLX’s optimization engine searches the design space of possible solutions to determine the optimal set of pragmas and parameters to apply. The design space consists of: (1) different parallelization options for loops, i.e., pipelining or unrolling with different unroll factors, (2) multi-dimensional partitioning and reshaping options for arrays (complete or cyclic with different factors), and (3) function hierarchy; inline or block. For this particular example, this results in approx. 1.32 x e19 design points and SLX’s optimization engine converges to a solution in almost 70 seconds.
Figure 6: SLX FPGA hints view
Figure 6 shows the SLX FPGA hints view. The fourth and the fifth columns in the hints view show the CPU total cost and FPGA total cost for the different functions and loops in the application. The FPGA total cost is an estimate of the latency contribution to the critical path for a particular function or loop. This is particularly useful to help developers focus their efforts for optimization. For example, the weekClassifier function on row 33 (figure 6) takes 24.4% of CPU time in a pure software implementation. However, its contribution to the critical path latency in an FPGA implementation is only 3.63%. In contrast, the loop on row 4 (figure 6) for the cascadeClassifier function takes 79.9% of the CPU time in a pure software implementation but contributes 97.2% of the FPGA critical path latency. The hints view also highlights the most critical loops carried dependencies (LCDs). Note that SLX FPGA does not consider all LCDs as equal and separates the LCDs that can be ignored (e.g., induction and reduction variables) from the most critical ones. Such information can potentially help developers save time by allowing them to focus their efforts on the parts of their application that really matter in an FPGA implementation.
Figure 7: SLX FPGA Code Generation Wizard showing automated pragma insertion
Clicking the “Generate HLS Code” button opens the code transformation wizard, as shown in figure 7. Here, the user can inspect the generated code side-by-side to the original version and select/unselect pragmas for code generation for fine-tuning the implementation.
After synthesizing the design, we compare the performance and resource utilization of the SLX optimized kernel with the original unoptimized one. For this particular design, we allowed SLX FPGA to use all available resources on the selected device; however, additional constraints can be applied if necessary. Table 1 shows a summary of the results. We see a 7.8x reduction in latency for a 3x increase in LUTs, 2.4x increase in FF, and 2.7x increase in DSP blocks. This increase in resource utilization is not a big concern for an Alveo card since all resources are still under 5% utilization level. Should greater performance be required, a host of additional analysis capabilities are available in SLX FPGA to help guide designers to refactor their code more quickly and effectively.
Table 1 – Original Vitis Library compared to SLX FPGA instrumented code
Importing the Xilinx Object in Vitis application project
The hls folder in an SLX FPGA project contains a Vitis HLS project with SLX optimized source code. We open this project with Vitis HLS and export RTL as a Xilinx object, as shown in figure 8. Before exporting to Vitis, we need to add the Extern “C” wrapper to ensure C linkage.
Figure 8: Exporting the Xilinx Object from Vitis HLS
Within a Vitis workspace, create a new application with an Alveo U280 card as the target device, as shown in figure 9.
Figure 9: Creating an application project in Vitis Unified Platform
Once the project is created, we import the .xo file into the src folder for kernels as shown in figure 10. After importing the .xo file, click the “add hardware function” button , and select face_detect_sw the list.
Figure 10: Importing the kernel in a Vitis application project
The developer is now able to create the wider application, which runs on the x86 host, taking advantage of the accelerated face_detect_sw kernel.
Figure 11: System diagram in Vitis Analyzer
This application note has demonstrated how SLX FPGA can be used to optimize a kernel targeted for PCIe connected Alveo cards, leveraging the Vitis bottom-up kernel flow. For this example, SLX FPGA was able to reduce the latency of a commonly used AI kernel for face detection. The approach can be applied to most Xilinx-based data-center applications, including Amazon F1 instances. This methodology can be applied whether you are developing an application from scratch or reusing an existing design and customizing it according to your needs.
Download the Application Note here