MATLAB® is the go-to toolbox for high level algorithm design in many application domains, ranging from signal processing to control systems and data analysis. MATLAB Coder generates executable C/C++ code from MATLAB implementations. However, the performance requirements of these applications often mandate a hardware implementation. SLX FPGA helps transform the auto-generated C/C++ code into a synthesizable, optimized, and hardware-aware implementation for high level synthesis (HLS).
To create a hardware implementation from a MATLAB design, we have two options: (1) Using MATLAB HDL CoderTM to generate a synthesizable HDL implementation or (2) Using MATLAB CoderTM to generate C/C++ code and passing it on to HLS tools that translate it into an HDL implementation. The first option is comparatively restricted in terms of the MATLAB functions, language constructs, and Simulink blocks supported by the HDL Coder. Moreover, it often requires hardware design expertise for tweaking and/or integrating the generated HDL code with components/features designed outside MATLAB. The second option is less restrictive with a much wider range of MATLAB functions and language constructs supported by the MATLAB C/C++ code generation tools; however, the generated code is not targeted for hardware implementation and often is not synthesizable. SLX FPGA is the perfect tool to bridge that gap.
Figure 1: Overview of methodology
Figure 1 gives an overview of the methodology. We start with a MATLAB design and generate C/C++ code using the MATLAB Coder tools. This C/C++ code is passed to SLX FPGA for synthesizability checking and optimization. Using static and dynamic analysis techniques, SLX FPGA performs automatic code refactoring and generates hints that guide the developer on optimization and synthesizability. After these automated and guided refactoring iterations, we get a synthesizable, optimized, and hardware-aware C/C++ code with HLS pragmas inserted that drive further optimizations during synthesis. SLX FPGA then invokes Xilinx VivadoTM tools for the generation of a hardware implementation of the application.
● Refactor non-synthesizable C/C++ code for HLS - Helps programmers with automated and guided refactoring of non-synthesizable code. For example, automatically replacing non-synthesizable library functions calls with synthesizable ones and generating detailed hints guiding the developer on transforming non-synthesizable pieces of code into synthesizable ones.
● Parallelism detection - Efficiency of FPGA implementations relies heavily on parallelism. Using a combination of static and dynamic analysis techniques, SLX FPGA detects parallelism and guides the developer on how to exploit the parallelism. SLX FPGA also flags roadblocks for parallelism and helps the user eliminate the roadblocks to drive additional parallelism.
● Hardware optimization - Exploration of the appropriate function pipelining and loop unrolling, providing data for the hardware through array partitioning, and design space of interfaces available on the target platform.
● Pragma insertion - Automatically insert HLS pragmas that guide the compiler in optimizations. HLS pragmas include various parameters that require tuning. SLX leverages static and dynamic analysis data and combines it with optimization algorithms during the exploration of these pragmas and their various parameters.
● Integration and Synthesis – Vivado HLS Project file generation and export to Vivado HLS for Synthesis.
We take the Kalman filter, a shipped example from MATLAB to investigate the effectiveness of using SLX FPGA for facilitating efficient high-level synthesis from MATLAB generated C code. Kalman filter is a prediction algorithm that uses time series of measurements containing statistical noise to produce estimates of unknown variables. Its numerous applications include control systems, navigation, and financial forecasting. The next section gives an overview of the methodology followed by a discussion of results.
MATLAB CoderTM generated portable and reliable C/C++ code from MATLAB designs. It supports most of the MATLAB language and a wide range of built-in functions and tool boxes. Default MATLAB generated code includes the use of dynamic memory and recursion which are not synthesizable. However, there is support for MISRA compliance that allows us to generate code with less synthesizability problems. Figure 2 shows a screenshot of the MISRA compliance settings used for generating C code from the Kalman filter example.
Figure 2: MISRA compliance setting in MATLAB Coder.
We can also select the numeric format for code generation, i.e. fixed point, single precision floating point, or double precision floating point. For this case study we use single precision floating point. The generated code is readable, preserves MATLAB comments, and line by line relationships between MATLAB and generated code can be visualized for inspection. The next step is making an SLX FPGA project and importing the generated code into it.
The first step after successful code generation is to make sure it is synthesizable. SLX provides support for synthesizability checking and helps programmers with automatic and guided refactoring for non-synthesizable code. Figure 3 shows that the synthesizability check for this use case failed due to a non-synthesizable memset call in the ObjTrack function. The third column in the upper part of Figure 3 shows the line profiling information and this function call takes 0.01% for the pure software execution time. Since this is not performance critical, we simply replace it with a for loop and rerun the synthesizability tests to show that our code is now synthesizable.
Figure 3: Synthesizability hints with SLX FPGA
FPGA performance relies heavily on exploiting parallelism in the application. SLX FPGA supports two types of parallelization patterns: (1) data-level parallelism (DLP), and (2) pipeline-level parallelism (PLP). Data-level parallelism is extracted for loops in which every computation in the for loop is either independent or can be mapped efficiently to a highly parallel hardware architecture (e.g., a multiply-add tree). It is worth noting that the parallelism detection algorithms in SLX consider that the application computations will be mapped into reconfigurable hardware, expanding the set of detected parallelism patterns regarding traditional parallelizing compilers targeted at accelerating code running on regular processors. Even in the presence of loop-carried dependencies that hinder DLP, it may be beneficial to pipeline the execution of several iterations for a loop. SLX detects the cases where pipelining the loop body renders benefits, even before automatically inserting the pragmas in the code.
Figure 4: SLX parallelization hints.
For this use case, 99.85% of execution time is spent in a for-loop nest that forms the main kernel of the application. Figure 4 shows the results of SLX parallelization tools. Both data-level and pipeline-level parallelization options are explored. Figure 4 shows the results of this exploration. PLP in the main application loop is detected. This is reported to the user in the form of hints. It is worth mentioning that although the presence of PLP has been detected, there may exist a combination of other patterns that yields more speedup, and this is considered in the HW optimization stage. The parallelism detection stage merely reports about parallelization opportunities and blockers and their predicted impact on the application performance. An example of such a blocker can be seen in Figure 4, where even though DLP has a potential speed-up of 300x, it is not selected due to loop-carried dependencies on several variables. For some cases it would be possible for the developer to use this information to transform the algorithm and resolve the dependencies; however, in this case such a transformation is not trivial.
SLX brings HLS-awareness to generic C/C++ by exploring different array partitioning options and the design space of interfaces available. SLX automatically inserts HLS pragmas for these design choices that would otherwise require several synthesis iterations and a deep understanding of the platform hardware.
Figure 5: Automatic HLS pragma insertion
Figure 5 shows the output after running pragma insertion on the Kalman filter use case. We see three types of HLS pragmas inserted, (1) the pipeline pragma, (2) trip-count pragma, and (3) the array partitioning pragmas. The pipeline and trip-count pragmas are used to guide the synthesis and performance estimation of the parallel pipeline that is selected during the exploration phase. Array partitioning is used for managing bandwidth of synthesized memories, such that they don’t become performance bottlenecks while considering the available resources on the platform.
After pragma insertion, SLX FPGA automatically repurposes the user project, and generates a Vivado HLS project that is synthesized to get the final performance and resource consumption for the accelerated application functions. This performance is fed back to the user, who can then decide if further application optimization is needed. The generated project can be exported to Vivado HLS for further optimization, RTL co-simulation and verification, or to be packaged for further integration in a Vivado IP Integrator system.
Table 1 summarizes the results for our use case. The results show a performance improvement of more than 62x when SLX is used for exploring parallelization and array partitioning options and automatically adding HLS pragmas.
Table 1: Performance results with SLX auto-inserted pragmas
The speed-up comes from a better utilization of the available hardware resources. Table 2 summarizes the resource utilization for both cases, i.e., when no HLS pragmas are inserted and the MATLAB generated code is passed on to Vivado HLS tools simply after making it synthesizable, and when SLX is used to automatically add HLS pragmas. We see that the resource utilization is significantly higher when SLX pragmas are used.
Table 2 Utilization summaries
In summary, this use case demonstrates that SLX is highly effective for improving performance of an HLS implementation of this MATLAB generated code. With a few simple clicks we can get performance improvements that would otherwise require several design cycles and significant expertise of hardware design.