# Advancing OpenCL™ for FPGAs

## Boosting Performance with Intel® FPGA SDK for OpenCL™ Technology

Field programmable gate arrays (FPGAs) are capable of very high performance, especially power-performance. This is perhaps not surprising. After all, FPGA hardware itself is malleable―configurable to match the application rather than the other way around. Also not surprising is that this additional degree of freedom―that the application developer can change the hardware as well as the software―should lead to increased complexity everywhere in the application development workflow.

And indeed, this has been the case. Until recently, most developers of FPGA applications relied on tools and methods that have more in common with those used by hardware designers than by software programmers. The languages used have been hardware description languages (HDLs) such as Verilog* and VHDL*. These describe the nature of the logic rather than a flow of instructions. The compile times (called synthesis) have been very long. And an uncomfortable amount of system knowledge has been required, especially for debug and test.

All of this is now changing. Intel has created Intel® FPGA SDK for OpenCL™ technology^{1}, which provides an alternative to HDL programming. The technology and related tools belong to a class of techniques called high-level synthesis (HLS) that enable designs to be expressed with higher levels of abstraction. Intel FPGA SDK for OpenCL technology is now in widespread use. Amazingly for longtime FPGA application developers, the performance achieved is often close to―or even better than―HDL code. But it also seems apparent that achieving this performance is often limited to developers who already know how the C-to-hardware translation works, and who have an in-house toolkit of optimization methods.

At Boston University, we’ve worked on enumerating, characterizing, and systematizing one such optimization toolkit. There are already a number of best practices for FPGA OpenCL documents. This work augments them, largely by applying additional methods well known to the high-performance computing (HPC) community^{2}. In this methodology, we believe we’re on the right track. It’s taken decades for HPC performance programming to reach its current level of sophistication. We shouldn’t be surprised that users of Intel FPGA SDK for OpenCL technology need to follow a similar path and learning curve.

Please note that you can see more details on the work described here in references 3 and 4 at the end of the article. The first uses the FFT as a detailed case study. The second describes the empirically guided optimization framework. Also of potential interest, in related work, references 5 and 6 show how we augmented the toolflow, which can be used to test/verify design functionality and performance without generating hardware for the entire system. As a result, we can identify design bottlenecks and the impact of optimizations with greater accuracy, and thus achieve rapid turnaround. **Figure 1** shows how these pieces fit together with the existing toolflow.

### Empirically Guided Code Optimizations

We’ve proposed a series of systematic and empirically guided code optimizations for OpenCL that augment current best practices and substantially improve performance. Our work characterizes and measures the impact of all these optimizations. This not only enables programmers to follow a script when optimizing their own kernels, but also opens the way for the development of autotuners to perform optimizations automatically.

We broadly categorize code optimizations in this domain into three sets:

- Intel’s best practices (IBPs)
- Universal code optimizations (UCOs)
- FPGA-specific optimizations (FSOs)

IBPs are design strategies given in the Intel Best Practices Guide^{7}, which show how to express hardware using OpenCL semantics. We separate these from UCOs and FSOs because IBPs are well-known to the FPGA OpenCL community and there have been several studies characterizing their behavior.

UCOs consist of general approaches to optimizing programs that, to a large degree, are independent of the compute platform, e.g.:

- Use 1D arrays
- Records of arrays
- Predication
- Loop merging
- Scalar replacement
- Precomputing constants

Though described (for example, in reference 2), they are largely missing from IBP documentation. FSOs consist of a number of FPGA-specific optimizations that typically augment IBPs. They’re based on:

**Obtaining**a particular FPGA-specific mapping not found as an IBP**Facts**stated in IBPs, but which we have leveraged and converted into optimizations**Typically used practices**which (we have found) should actually be avoided

There are seven code versions, discussed in detail in references 4 and 6, which are incrementally developed. Each version contains one or more applied optimizations. **Table 1** summarizes the optimizations and their type (IBP, FSO, and/or UCO).

**Table 1. Summary of code versions and optimizations**

Version |
Optimizations |
Type |

0 | ( GPU code for porting to FPGA OpenCL) |
― |

1 | Single thread code with cache optimization |
IBP, FSO |

2 | Implement task parallel computations in separate kernels and connect them using channels |
IBP |

Fully unroll all loops with #pragma unroll |
IBP, UCO | |

Minimize variable declaration outside compute loops (use temps where possible) |
IBP, UCO | |

Use constants for problem sizes and data values (instead of relying on off-chip memory access) |
IBP, FSO, UCO | |

Coalesce memory operations |
IBP, UCO | |

3 | Implement the entire computation within a single kernel and avoid using channels |
FSO |

4 | Reduce array sizes to infer pipeline registers as registers instead of BRAMs | FSO |

5 | Perform computations in detail, using temporary variables to store intermediate results |
FSO, UCO |

6 | Use predication instead of conditional branch statements when defining forks in the data path |
FSO, UCO |

### Version 0: Sub-Optimal Baseline Code

A popular starting point (for example, in reference 8) is kernels based on multiple work items (MWI) such as is appropriate for GPUs. Advantages of starting here include ease of exploiting data parallelism through SIMD, and compute unit replication (CUR), which is exclusive to MWI structures.

**Algorithm 1 **shows a V0-type kernel (based on reference 9). The core operation is to populate a matrix using known values of the first row and the first column. Each unknown entry is computed based on the values of its left, up, and up-left locations. This is achieved using loops which iterate in order over all matrix entries. The max function is implemented using “if-else” statements. In **Algorithm 1**, SIZE represents the dimension of blocks of matrix entries being processed.

**Algorithm 1. Needleman Wunsch-V0**

1: int tx = get_local_id(0) 2: _local int* temp 3: _local int* ref 4: Initializetempfrom global memory 5: barrier(CLK_LOCAL_MEM_FENCE); 6: Initializereffrom global memory 7: barrier(CLK_LOCAL MEM_FENCE); 8:fori= 1 :SIZEdo9:iftx≤ithen10: computet_idx_xandt_idx_ybased on tx and i 11: temp(t_idx_y][t_idx_x] = 12:max(temp[t_idx_y-1][t_idx_x-1] + 13: ref[t idx y-1][t idx x-1], 14: temp(t_idx_y][t idx x-1] - penalty, 15: temp(t_idx_y-l][t_idx_x] - penalty); 16: barrier(CLK_LOCAL_MEM_FENCE); 17: barrier(CLK_LOCAL_MEM_FENCE); 18:fori=SIZE- 2 : 0do19: Perform computations similar to above 20: barrier(CLK_LOCAL_MEM_FENCE); 21: Storetempto global memory

### Version 1: Preferred Baseline Code (Used for Reference)

A less intuitive, but preferred, alternative is to use (as a baseline) single-threaded CPU code. In particular, initial designs should be implemented as single work item (SWI) kernels as recommended by IBPs. SWI kernels can infer and exploit all forms of parallelism effectively, and do so in a more efficient way than MWI kernels. The CPU-like baseline code should also be optimized for cache performance. This:

**Helps**the compiler infer connectivity between parallel pipelines (i.e., whether data can potentially be directly transferred between pipelines instead of being stored in memory)**Improves**bandwidth for on-chip data access**Efficiently uses**the internal cache of load store units which are responsible for off-chip memory transactions

**Algorithm 2** shows the preferred baseline kernel. The first row and column of the matrix are Vector A and Vector B, respectively.

**Algorithm 2. Needleman Wunsch-V1**

1:fori= 1 :Vector_B_Sizedo2:forj= 1 :Vector_A_Sizedo3: Out[i,j] =max( Out[i-1,j] - penalty, 4: Out[i-1,j-1] + ref[i,j] , Out[i,j-1] - penalty)

### Version 2: IBPs

Given the preferred baseline code, we then apply the following commonly used IBPs:

- Multiple task parallel kernels
- Fully unroll all loops
- Minimizing state register usage
- Constant arrays
- Coalescing

**Algorithm 3** shows the Needleman Wunsch kernel structure after we apply IBPs. Parallelism is exploited using a systolic array, with each processing element (PE) implemented in a separate kernel. Channels are used to connect PEs in a specified sequence. For each inner loop iteration, PEs compute consecutive columns within the same row. This ensures spatial locality for memory transactions. The drawback is data dependencies between kernels, which can’t be reliably broken down by the compiler since it optimizes each kernel as an individual entity. Thus, the overhead of synchronizing data paths can result in performance degradation.

**Algorithm 3. Needleman Wunsch-V2**

1: N_{←}Size of systolic array 2:P E_{k}_{→}Kernel Begin3:intup, left, up_left, cached_up, cached_up_left 5:Initializecached_up & cached up_left 6:forj= 1 : 1 :Vector_B_Sizedo7: left_{←}read_channel(PE) 8: up = cached_up 9: up left = cached_up_left 10: cached_up =_{k-1}max(up - penalty, 11: left - penalty , up_left + ref [j,i]) 12: cached_up_left = left 13: Out[j,i] = cached up 14:write_channel(PE)_{k-1}_{←}cached_up

### Version 3: Single-Kernel Design

In Version 3, we merge the IBP-optimized task parallel kernels and declare all compute loops within the same kernel. This is because the compiler is still able to automatically infer task parallel pipelines. Having a single kernel carries a number of advantages over the multi-kernel approach, e.g.:

**Inherent**global synchronization**Reduced**resource usage and delays through pipeline merging/reordering**Simplified**control logic

**Algorithm 4** shows the kernel structure for implementing the systolic array as a single kernel. The compiler can now optimize the entire computation, as opposed to individual PEs. Synchronization overhead is also reduced, since almost all computation is tied to a single loop variable (j ). Nested loops are used because, in this particular case, the cost of initiation intervals is outweighed by the reduction in resource usage. This is because the compiler was unable to infer data access patterns when loops were coalesced.

**Algorithm 4. Needleman Wunsch-V3**

1: N_{←}Size of systolic array 2:intvalue[N+1], left[Vector_B_Size] 3: left_{←}Vector_B 4:fori= 1 : 1 :Vector_A_Size=Ndo5: base =f(i)6: value_{←}Vector_A[base:base+N+1] 7:forj = 1 : 1 :Vector_B_Sizedo8:intup_left[N+1] 9:fork= 2 : 1 : N + 1do10: up_left[k] = value[k-1] 11: value[1] = left[j] 12:#pragma unroll13:fork = 2 : 1 : N + 1do14: value[k] =max(value[k-1] - penalty, 15: up_left[k] + ref[j, base+k] , value[k] - penalty) 16: left[j] = value[N+1] 17: Out value[2:N+1]

### Version 4: Reduced Array Sizes

Having large variable arrays results in pipeline registers being inferred as BRAMs instead of registers, which can have significant drawbacks on the design. Since BRAMs can’t source and sink data with the same throughput as registers, barrel shifters and memory replication are required. This drastically increases resource usage. Moreover, the compiler is also unable to launch stall-free iterations of compute loops due to memory dependencies. The solution is to break large arrays corresponding to intermediate variables into smaller ones.

**Algorithm 5** shows the kernel structure for inferring pipeline registers as registers. All arrays are expressed as individual variables, generated using scripts, with the exception of local storage of Vector B in “left,” which has low throughput requirements.

**Algorithm 5. Needleman Wunsch-V4**

1: N_{←}Size of systolic array 2:intvalue_1, value_2 ... Value_N_plus_l 3:intleft [Vector_B_Size] 4: left Vector_B 5:fori= 1 : 1 :Vector_A_Size=Ndo6: base = f(i) 7: value_1 Vector_A[base] 8: ↓ 9: value_N plus 1 Vector_A[base+N+1] 10:forj= 1 : 1 :Vector_B_Sizedo11:intup_left_2 ... Up_left_N_plus_l 12: up_left_2 = value_1 13: ↓ 14: up_left_N_plus_l = value_N 15: value_l = left [j] 16: value_2 =max(value_l - penalty, 17: up_left_2 + reff[j,base+2], value_2 - penalty) 18: ↓ 19: value_N_plus_l =max(value_N - penalty, 20: up_left_N_plus_l + ref[j,base+N+1], 21: value_N_plus_l - penalty) 22: left[j] = value_N_plus_1 23: Out_{←}value 2 ... value N plus 1

### Version 5: Detailed Computations

The OpenCL compiler doesn’t reliably break down large computations being assigned to a single variable into intermediate stages. This reduces the number of possible pipeline stages and can result in larger critical paths and data dependency stalls. Our solution is to do computations in as much detail as possible by using intermediate variables to help the compiler infer pipelines. If the logic is already optimal, these variables will be synthesized away and won’t waste resources.

**Algorithm 6** shows the kernel structure after performing computations in detail with a number of intermediate variables added. The “max” function is also explicitly implemented.

**Algorithm 6. Needleman Wunsch-V5**

1: N_{←}Size of systolic array 2:intvalue_1, value_2 ... Value_N_plus_l 3:intleft [Vector_B_Size] 4: left_{←}Vector_B 5:fori= 1 : 1 :Vector_A_Size=Ndo6: base = f(i) 7: value_1 Vector_A[base] 8: ↓ 9: value_N plus 1 Vector_A[base+N+1] 10:forj= 1 : 1 :Vector_B_Sizedo11:inta_2 = value _l + ref [j,base+2]; 12: value_l = left[j] 13: 14:intb_2 = value_l - penalty 15:inta_3 = value_2 + ref[j,base+3]; 16:intc_2 = value_2 - penalty 17: 18:if((a_2 ≥ b_2) && (a_2 ≥ c_2)) 19: value_2 = a_2 20:else if((b_2 > a_2) && (b_2 ≥ c_2)) 21: value 2 = b_2 22:else23: value 2 = c_2 24: 25:intb_3 = value 2 - penalty 26:inta_4 = value 3 + ref[j,base+4]; 27:intc_3 = value 3 - penalty 28: . . . 29: left[j] = value_N_plus_1 30: Out_{←}value 2 ... Value_N_plus_l

### Version 6: Predication

We optimize conditional operations by explicitly specifying architecture states which ensure the validity of the computation. Since hardware is persistent and will always exist once synthesized, we avoid using conditional branch statements. Instead, variable values are conditionally assigned such that the output of invalid operations is not committed and hence does not impact the overall result. **Algorithm 7** shows the “if-else” operations replaced with conditional assignments.

**Algorithm 7. Needleman Wunsch-V6**

1: . . . 2:inta_2 = value_1 + ref[j,base+2]; 3: value_1 = left[j 4: 5:intb_2 = value_1 - penalty 6:inta_3 = value_2 + ref[j,base+3]; 7:intc_2 = value_2 - penalty 8: 9:intd_2 = (a_2 > b_2) ? a_2 : b_2 10: value_2 = (c_2 > d_2) ? c_2 : d_2 11: . . .

### Hardware Specifications

The designs are implemented using an Intel® Arria® 10AX115H3F34I2SG FPGA and Intel® FPGA SDK for OpenCL™ technology 16.0. This FPGA has 427,200 ALMs, 1,506K logic elements, 1,518 DSP blocks, and 53 MB of on-chip storage. For GPU implementations, we use the NVIDIA Tesla* P100 PCIe 12GB GPU with CUDA* 8.0. It has 3,584 CUDA cores and peak bandwidth of 549 GB/s. CPU codes are implemented on a 14-core, 2.4 GHz **Intel® Xeon® E5-2680v4 processor** with Intel® C++ Compiler v16.0.1.

### Optimization Characterization

The optimizations are tested for the full OpenCL compilation flow using these benchmarks:

- Needleman Wunsch (NW)
- Fast Fourier Transform (FFT)
- Range Limited Molecular Dynamics (Range Limited)
- Particle Mesh Ewald (PME)
- Dense Matrix-Matrix Multiplication (MMM)
- Sparse Matrix Dense Vector Multiplication (SpMV) and Cyclic Redundancy Check (CRC)

**Table 2** provides a summary of these benchmarks, their associated dwarfs^{8}, tested problem sizes, and applicable code versions that are developed.

Benchmark |
Dwarf |
Problem Size |
V1 |
V2 |
V3 |
V4 |
V5 |
V6 |

NW | Dynamic Programming |
16K x 16K Integer Table |
• | • | • | • | • | • |

FFT | Spectral Methods | 64 point Radix-2 1D FFT, 8,192 Vectors |
• | • | • | • | • | • |

Range Limited | N-Body | 180 Particles per Cell, 15% Pass Rate |
• | • | • | • | • | • |

PME | Structured Grids | 1,000,000 Particles, 323 Grid, 3D Tri-Cubic |
• | • | • | • | ||

MMM | Dense Linear Algebra | 1K x 1K Matrix, Single Precision |
• | • | • | • | ||

SpMV | Sparse Linear Algebra |
1K x 1K Matrix, Single Precision, 5%-Sparsity, NZ=51,122 |
• | • | • | • | • | |

CRC | Combinational Logic |
100 MB CRC32 |
• | • | • | • | • |

**Figure 2** shows the results of individual optimizations. In almost all cases, we can see the same trend where traditional optimizations (V2) only result in a fraction of the speedup possible. By applying the additional optimizations on top of V2, performance is improved by orders of magnitude.

The average impact of individual optimizations is shown in **Figure 3**. Generally, each successive set of optimizations applied results in increasing performance. The exception is V5. This is due to higher execution times of V5 for NW and SpMV. In both cases, performing computations in as much detail as possible results in the use of conditional statements that outweigh the benefits of the optimization. Once these statements are removed in V6, the speedup increases.

To demonstrate the overall effectiveness of the approach, we compare the performance of the optimized kernels against existing CPU, GPU, Verilog, and FPGA-OpenCL implementations. **Table 3** lists the references for these designs. They’re either obtained from the literature or implemented using available source code/libraries (denoted by an asterisk). Verilog FFT measurement from reference 3 has been extended to include off-chip access overhead.

**Table 3. References for existing implementations**

Benchmark |
CPU |
GPU |
Verilog* |
OpenCL™ |

NW | Rodinia*^{9} |
Rodinia*^{9} |
Benkrid*^{10} |
Zohouri*^{11} |

FFT | MKL*^{12} |
cuFFT**^{13} |
Sanaullah*^{3} |
Intel^{14} |

Range Limited | ― | ― | Yang*^{15} |
Yang^{15} |

PME | Ferit^{16} |
Ferit*^{16} |
Sanaullah^{17} |
― |

MMM | MKL*^{12} |
cuBLAS*^{18} |
Shen*^{19} |
Spector*^{20} |

SpMV | MKL*^{12} |
cuSPARSE*^{18} |
Zhou*^{22} |
OpenDwarfs*^{8} |

CRC | Brumme*^{23} |
― | Anand *^{24} |
OpenDwarfs^{8} |

**Figure 4** shows the average speedup achieved over the CPU code, while **Figure 5** shows the normalized execution times for all implementations. From the results, we observe that our work outperforms multicore CPU implementations by approximately 1.2x due to the performance of codes written using Intel® Math Kernel Library (Intel® MKL). We’ve also achieved an average of approximately 5x lower execution time than existing FPGA OpenCL work. The GPU speedup of 2.4x relative to our work is due to the use of a high-end GPU (Tesla* P100) compared to a midrange FPGA (Intel® Arria® 10 FPGAs). We therefore also provide an estimate of high-end FPGA performance (Stratix R 10*) using a conservative factor of 4x to account for an increase in resource only. Results show that the optimized kernels on Stratix 10 are expected to outperform GPU designs by 65%, on average.

### Conclusions

Comparison with existing Verilog* implementations show that the OpenCL kernels are, on average, within 12% of hand-tuned HDL. This demonstrates that the optimizations are able to bridge the performance-programmability gap for FPGAs and deliver HDL-like performance using OpenCL.

### Special Thanks To …

This article is from *The Parallel Universe*, Intel’s quarterly magazine that helps you take your software development into the future with the latest tools, tips, and training to expand your expertise. Get it >

### References

^{1}Intel® FPGA SDK for OpenCL™ technology

^{2}S. Chellappa, F. Franchetti, and M. Pueschel, “How To Write Fast Numerical Code: A Small Introduction,” in Generative and Transformational Techniques in Software Engineering II, Lecture Notes in Computer Science v5235, 2008, pp. 196 – 259.

^{3}A. Sanaullah and M. Herbordt, “FPGA HPC using OpenCL: Case Study in 3D FFT,” in Proceedings of the International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies, 2018.

^{4}A. Sanaullah, R. Patel, and M. Herbordt, “An Empirically Guided Optimization Framework for FPGA OpenCL,” in Proceedings of the IEEE Conference on Field Programmable Technology, 2018.

^{5}A. Sanaullah, C. Yang, D. Crawley, and M. Herbordt, “SimBSP: Enabling RTL Simulation for Intel FPGA OpenCL Kernels,” in Proceedings on Heterogeneous High-Performance Reconfigurable Computing, 2018.

^{6}A. Sanaullah and M. Herbordt, “Unlocking Performance-Programmability by Penetrating the Intel FPGA OpenCL Toolow,” in IEEE High Performance Extreme Computing Conference, 2018.

^{7}”Intel FPGA SDK for OpenCL Pro Edition Best Practices Guide”

^{8}K. Krommydas, A. E. Helal, A. Verma, and W.C. Feng, “Bridging the Performance Programmability Gap for FPGAs via OpenCL: A Case Study with Opendwarfs,” Department of Computer Science, Virginia Polytechnic Institute and State University, Tech. Rep., 2016.

^{9}S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, “Rodinia: A Benchmark Suite for Heterogeneous Computing,” in IEEE International Symposium on Workload Characterization, 2009, pp. 44{54.

^{10}K. Benkrid, Y. Liu, and A. Benkrid, “A Highly Parameterized and Efficient FPGA-Based Skeleton for Pairwise Biological Sequence Alignment,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 17, No. 4, pp. 561 – 570, 2009.

^{11}H. R. Zohouri, N. Maruyamay, A. Smith, M. Matsuda, and S. Matsuoka, “Evaluating and Optimizing OpenCL Kernels for High Performance Computing with FPGAs” in International Conference for High Performance Computing, Networking, Storage and Analysis, SC16, 2016, pp. 409{420.

^{12}Intel® Math Kernel Library

^{13}C. Nvidia, “CuFFT Library,” 2010.

^{14}FFT (1D) Design Example

^{15}C. Yang, J. Sheng, R. Patel, A. Sanaullah, V. Sachdeva, and M. Herbordt, “OpenCL for HPC with FPGAs: Case Study in Molecular Electrostatics,” in IEEE High Performance Extreme Computing Conference, 2017.

^{16}F. Buyukkececi, O. Awile, and I. F. Sbalzarini, “A Portable OpenCL Implementation of Generic Particle―Mesh and Mesh―Particle Interpolation in 2D and 3D,” Parallel Computing, Vol. 39, No. 2, pp. 94 – 111, 2013V

^{17}A. Sanaullah, A. Khoshparvar, and M. C. Herbordt, “FPGA-Accelerated Particle-Grid Mapping,” in Field-Programmable Custom Computing Machines (FCCM), 2016 IEEE 24th Annual International Symposium on. IEEE, 2016, pp. 192 – 195.

^{18}Nvidia, “CUBLAS Library,” NVIDIA Corporation, Santa Clara, CA, 2008.

^{19}J. Shen, Y. Qiao, Y. Huang, M. Wen, and C. Zhang, “Towards a Multi-Array Architecture for

Accelerating Large-Scale Matrix Multiplication on FPGAs,” in Circuits and Systems (ISCAS), 2018

IEEE International Symposium on. IEEE, 2018, pp. 1 – 5.

^{20}Q. Gautier, A. Althoff, P. Meng, and R. Kastner, “Spector: An OpenCL FPGA Benchmark Suite,” in International Conference on Field-Programmable Technology, 2016.

^{21}Nvidia, “CuSparse Library,” NVIDIA Corporation, Santa Clara, CA, 2014.

^{22}L. Zhuo and V. K. Prasanna, “Sparse Matrix-Vector Multiplication on FPGAs,” in Proceedings of the International Symposium on Field-Programmable Gate Arrays, 2005.

^{23}S. Brumme, “Fast CRC32,” http://create.stephan-brumme.com/crc32/, 2018.

^{24}P. A. Anand et al., “Design of High Speed CRC Algorithm for Ethernet on FPGA using Reduced Lookup Table Algorithm,” in India Conference (INDICON), 2016 IEEE Annual. IEEE, 2016, pp. 1- 6.