Advancing OpenCL™ for FPGAs

Boosting Performance with Intel® FPGA SDK for OpenCL™ Technology

Field programmable gate arrays (FPGAs) are capable of very high performance, especially power-performance. This is perhaps not surprising. After all, FPGA hardware itself is malleable―configurable to match the application rather than the other way around. Also not surprising is that this additional degree of freedom―that the application developer can change the hardware as well as the software―should lead to increased complexity everywhere in the application development workflow.

And indeed, this has been the case. Until recently, most developers of FPGA applications relied on tools and methods that have more in common with those used by hardware designers than by software programmers. The languages used have been hardware description languages (HDLs) such as Verilog* and VHDL*. These describe the nature of the logic rather than a flow of instructions. The compile times (called synthesis) have been very long. And an uncomfortable amount of system knowledge has been required, especially for debug and test.

All of this is now changing. Intel has created Intel® FPGA SDK for OpenCL™ technology1, which provides an alternative to HDL programming. The technology and related tools belong to a class of techniques called high-level synthesis (HLS) that enable designs to be expressed with higher levels of abstraction. Intel FPGA SDK for OpenCL technology is now in widespread use. Amazingly for longtime FPGA application developers, the performance achieved is often close to―or even better than―HDL code. But it also seems apparent that achieving this performance is often limited to developers who already know how the C-to-hardware translation works, and who have an in-house toolkit of optimization methods.

At Boston University, we’ve worked on enumerating, characterizing, and systematizing one such optimization toolkit. There are already a number of best practices for FPGA OpenCL documents. This work augments them, largely by applying additional methods well known to the high-performance computing (HPC) community2. In this methodology, we believe we’re on the right track. It’s taken decades for HPC performance programming to reach its current level of sophistication. We shouldn’t be surprised that users of Intel FPGA SDK for OpenCL technology need to follow a similar path and learning curve.

Please note that you can see more details on the work described here in references 3 and 4 at the end of the article. The first uses the FFT as a detailed case study. The second describes the empirically guided optimization framework. Also of potential interest, in related work, references 5 and 6 show how we augmented the toolflow, which can be used to test/verify design functionality and performance without generating hardware for the entire system. As a result, we can identify design bottlenecks and the impact of optimizations with greater accuracy, and thus achieve rapid turnaround. Figure 1 shows how these pieces fit together with the existing toolflow.

Empirically Guided Code Optimizations

We’ve proposed a series of systematic and empirically guided code optimizations for OpenCL that augment current best practices and substantially improve performance. Our work characterizes and measures the impact of all these optimizations. This not only enables programmers to follow a script when optimizing their own kernels, but also opens the way for the development of autotuners to perform optimizations automatically.

Figure 1 – The standard Intel® FPGA SDK for OpenCL™ technology toolflow is shown in light blue. Our augmentations to the standard toolflow are shown in yellow, green, and purple. This article describes the systematic optimizations.

We broadly categorize code optimizations in this domain into three sets:

  1. Intel’s best practices (IBPs)
  2. Universal code optimizations (UCOs)
  3. FPGA-specific optimizations (FSOs)

IBPs are design strategies given in the Intel Best Practices Guide7, which show how to express hardware using OpenCL semantics. We separate these from UCOs and FSOs because IBPs are well-known to the FPGA OpenCL community and there have been several studies characterizing their behavior.

UCOs consist of general approaches to optimizing programs that, to a large degree, are independent of the compute platform, e.g.:

  • Use 1D arrays
  • Records of arrays
  • Predication
  • Loop merging
  • Scalar replacement
  • Precomputing constants

Though described (for example, in reference 2), they are largely missing from IBP documentation. FSOs consist of a number of FPGA-specific optimizations that typically augment IBPs. They’re based on:

  • Obtaining a particular FPGA-specific mapping not found as an IBP
  • Facts stated in IBPs, but which we have leveraged and converted into optimizations
  • Typically used practices which (we have found) should actually be avoided

There are seven code versions, discussed in detail in references 4 and 6, which are incrementally developed. Each version contains one or more applied optimizations. Table 1 summarizes the optimizations and their type (IBP, FSO, and/or UCO).

Table 1. Summary of code versions and optimizations

Version Optimizations Type
0 ( GPU code for porting to FPGA OpenCL)
1 Single thread code with cache optimization IBP, FSO
2 Implement task parallel computations in separate kernels and connect them using channels IBP
Fully unroll all loops with #pragma unroll IBP, UCO
Minimize variable declaration outside compute loops (use temps where possible) IBP, UCO
Use constants for problem sizes and data values (instead of relying on off-chip memory access) IBP, FSO, UCO
Coalesce memory operations IBP, UCO
3 Implement the entire computation within a single kernel and avoid using channels FSO
4 Reduce array sizes to infer pipeline registers as registers instead of BRAMs FSO
5 Perform computations in detail, using temporary variables to store intermediate results FSO, UCO
6 Use predication instead of conditional branch statements when defining forks in the data path FSO, UCO


Version 0: Sub-Optimal Baseline Code

A popular starting point (for example, in reference 8) is kernels based on multiple work items (MWI) such as is appropriate for GPUs. Advantages of starting here include ease of exploiting data parallelism through SIMD, and compute unit replication (CUR), which is exclusive to MWI structures.

Algorithm 1 shows a V0-type kernel (based on reference 9). The core operation is to populate a matrix using known values of the first row and the first column. Each unknown entry is computed based on the values of its left, up, and up-left locations. This is achieved using loops which iterate in order over all matrix entries. The max function is implemented using “if-else” statements. In Algorithm 1, SIZE represents the dimension of blocks of matrix entries being processed.

Algorithm 1. Needleman Wunsch-V0

   1: int tx = get_local_id(0)
   2: _local int* temp
   3: _local int* ref
   4: Initialize temp from global memory
   5: barrier(CLK_LOCAL_MEM_FENCE);
   6: Initialize ref from global memory
   7: barrier(CLK_LOCAL MEM_FENCE);
   8: for i = 1 : SIZE do 
   9:       if tx≤i then
   10:     compute t_idx_x and t_idx_y based on tx and i
   11:     temp(t_idx_y][t_idx_x] =
   12:      max (temp[t_idx_y-1][t_idx_x-1] +
   13:      ref[t idx y-1][t idx x-1],
   14:      temp(t_idx_y][t idx x-1] - penalty, 
   15:      temp(t_idx_y-l][t_idx_x] - penalty);
   16:     barrier(CLK_LOCAL_MEM_FENCE); 
   17: barrier(CLK_LOCAL_MEM_FENCE); 
   18: for i = SIZE - 2 : 0 do 
   19:     Perform computations similar to above
   20: barrier(CLK_LOCAL_MEM_FENCE);
   21: Store temp to global memory 

Version 1: Preferred Baseline Code (Used for Reference)

A less intuitive, but preferred, alternative is to use (as a baseline) single-threaded CPU code. In particular, initial designs should be implemented as single work item (SWI) kernels as recommended by IBPs. SWI kernels can infer and exploit all forms of parallelism effectively, and do so in a more efficient way than MWI kernels. The CPU-like baseline code should also be optimized for cache performance. This:

  • Helps the compiler infer connectivity between parallel pipelines (i.e., whether data can potentially be directly transferred between pipelines instead of being stored in memory)
  • Improves bandwidth for on-chip data access
  • Efficiently uses the internal cache of load store units which are responsible for off-chip memory transactions

Algorithm 2 shows the preferred baseline kernel. The first row and column of the matrix are Vector A and Vector B, respectively.

Algorithm 2. Needleman Wunsch-V1

   1: for i = 1 : Vector_B_Size do
   2:       for j = 1 : Vector_A_Size do
   3:          Out[i,j] = max( Out[i-1, j ] - penalty, 
   4:             Out[i-1,j-1] + ref[i,j ] , Out[i,j -1] - penalty)

Version 2: IBPs

Given the preferred baseline code, we then apply the following commonly used IBPs:

  • Multiple task parallel kernels
  • Fully unroll all loops
  • Minimizing state register usage
  • Constant arrays
  • Coalescing

Algorithm 3 shows the Needleman Wunsch kernel structure after we apply IBPs. Parallelism is exploited using a systolic array, with each processing element (PE) implemented in a separate kernel. Channels are used to connect PEs in a specified sequence. For each inner loop iteration, PEs compute consecutive columns within the same row. This ensures spatial locality for memory transactions. The drawback is data dependencies between kernels, which can’t be reliably broken down by the compiler since it optimizes each kernel as an individual entity. Thus, the overhead of synchronizing data paths can result in performance degradation.

Algorithm 3. Needleman Wunsch-V2

   1: N  Size of systolic array
   2: P Ek  Kernel Begin
   3: int up, left, up_left, cached_up, cached_up_left
   5:       Initialize cached_up & cached up_left
   6:    for j = 1 : 1 : Vector_B_Size do
   7:       left  read_channel (PEk-1)
   8:       up = cached_up 
   9:       up left = cached_up_left 
   10:      cached_up = max(up - penalty,
   11:         left - penalty , up_left + ref [j,i])
   12:      cached_up_left = left
   13:      Out[j,i] = cached up
   14:      write_channel (PEk-1)  cached_up 

Version 3: Single-Kernel Design

In Version 3, we merge the IBP-optimized task parallel kernels and declare all compute loops within the same kernel. This is because the compiler is still able to automatically infer task parallel pipelines. Having a single kernel carries a number of advantages over the multi-kernel approach, e.g.:

  • Inherent global synchronization
  • Reduced resource usage and delays through pipeline merging/reordering
  • Simplified control logic

Algorithm 4 shows the kernel structure for implementing the systolic array as a single kernel. The compiler can now optimize the entire computation, as opposed to individual PEs. Synchronization overhead is also reduced, since almost all computation is tied to a single loop variable (j ). Nested loops are used because, in this particular case, the cost of initiation intervals is outweighed by the reduction in resource usage. This is because the compiler was unable to infer data access patterns when loops were coalesced.

Algorithm 4. Needleman Wunsch-V3

   1: N  Size of systolic array
   2: int value[N+1], left[Vector_B_Size]
   3: left  Vector_B
   4: for i = 1 : 1 : Vector_A_Size=N do
   5:       base = f(i)
   6:       value  Vector_A[base:base+N+1] 
   7:       for j = 1 : 1 : Vector_B_Size do
   8:          int up_left[N+1]
   9:          for k = 2 : 1 : N + 1 do
   10:         up_left[k] = value[k-1] 
   11:      value[1] = left[j]
   12:      #pragma unroll 
   13:      for k = 2 : 1 : N + 1 do
   14:         value[k] = max(value[k-1] - penalty, 
   15:           up_left[k] + ref[j, base+k] , value[k] - penalty)
   16:      left[j] = value[N+1] 
   17:      Out value[2:N+1] 

Version 4: Reduced Array Sizes

Having large variable arrays results in pipeline registers being inferred as BRAMs instead of registers, which can have significant drawbacks on the design. Since BRAMs can’t source and sink data with the same throughput as registers, barrel shifters and memory replication are required. This drastically increases resource usage. Moreover, the compiler is also unable to launch stall-free iterations of compute loops due to memory dependencies. The solution is to break large arrays corresponding to intermediate variables into smaller ones.

Algorithm 5 shows the kernel structure for inferring pipeline registers as registers. All arrays are expressed as individual variables, generated using scripts, with the exception of local storage of Vector B in “left,” which has low throughput requirements.

Algorithm 5. Needleman Wunsch-V4

   1: N  Size of systolic array
   2: int value_1, value_2 ... Value_N_plus_l
   3: int left [Vector_B_Size]
   4: left Vector_B
   5:       for i = 1 : 1 : Vector_A_Size=N do  
   6:       base = f(i)
   7:       value_1 Vector_A[base]
   8:          ↓
   9:       value_N plus 1 Vector_A[base+N+1]
   10:   for j = 1 : 1 : Vector_B_Size do 
   11:      int up_left_2 ... Up_left_N_plus_l
   12:      up_left_2 = value_1
   13:      ↓
   14:   up_left_N_plus_l = value_N
   15:   value_l = left [j] 
   16:   value_2 = max(value_l - penalty,
   17:      up_left_2 + reff[j,base+2], value_2 - penalty) 
   18:      ↓
   19:   value_N_plus_l = max(value_N - penalty, 
   20:      up_left_N_plus_l + ref[j,base+N+1],
   21:      value_N_plus_l - penalty) 
   22:   left[j] = value_N_plus_1 
   23:   Out  value 2 ... value N plus 1 

Version 5: Detailed Computations

The OpenCL compiler doesn’t reliably break down large computations being assigned to a single variable into intermediate stages. This reduces the number of possible pipeline stages and can result in larger critical paths and data dependency stalls. Our solution is to do computations in as much detail as possible by using intermediate variables to help the compiler infer pipelines. If the logic is already optimal, these variables will be synthesized away and won’t waste resources.

Algorithm 6 shows the kernel structure after performing computations in detail with a number of intermediate variables added. The “max” function is also explicitly implemented.

Algorithm 6. Needleman Wunsch-V5

   1: N  Size of systolic array
   2: int value_1, value_2 ... Value_N_plus_l
   3: int left [Vector_B_Size]
   4: left  Vector_B
   5: for i = 1 : 1 : Vector_A_Size=N do
   6:       base = f(i)
   7:       value_1 Vector_A[base]
   8:          ↓
   9:       value_N plus 1 Vector_A[base+N+1]
   10:   for j = 1 : 1 : Vector_B_Size do 
   11:      int a_2 = value _l + ref [j,base+2];
   12:      value_l = left[j] 
   14:   int b_2 = value_l - penalty 
   15:   int a_3 = value_2 + ref[j,base+3];
   16:   int c_2 = value_2 - penalty 
   18:   if ((a_2 ≥ b_2) && (a_2 ≥ c_2))
   19:         value_2 = a_2
   20:   else if ((b_2 > a_2) && (b_2 ≥ c_2))
   21:      value 2 = b_2
   22:   else
   23:      value 2 = c_2
   25:   int b_3 = value 2 - penalty
   26:   int a_4 = value 3 + ref[j,base+4];
   27:   int c_3 = value 3 - penalty
   28:   .
   29: left[j] = value_N_plus_1
   30: Out  value 2 ... Value_N_plus_l 

Version 6: Predication

We optimize conditional operations by explicitly specifying architecture states which ensure the validity of the computation. Since hardware is persistent and will always exist once synthesized, we avoid using conditional branch statements. Instead, variable values are conditionally assigned such that the output of invalid operations is not committed and hence does not impact the overall result. Algorithm 7 shows the “if-else” operations replaced with conditional assignments.

Algorithm 7. Needleman Wunsch-V6

   1: .
   2: int a_2 = value_1 + ref[j,base+2];
   3: value_1 = left[j
   5: int b_2 = value_1 - penalty
   6: int a_3 = value_2 + ref[j,base+3];
   7: int c_2 = value_2 - penalty
   9: int d_2 = (a_2 > b_2) ? a_2 : b_2
   10: value_2 = (c_2 > d_2) ? c_2 : d_2
   11: .

Hardware Specifications

The designs are implemented using an Intel® Arria® 10AX115H3F34I2SG FPGA and Intel® FPGA SDK for OpenCL™ technology 16.0. This FPGA has 427,200 ALMs, 1,506K logic elements, 1,518 DSP blocks, and 53 MB of on-chip storage. For GPU implementations, we use the NVIDIA Tesla* P100 PCIe 12GB GPU with CUDA* 8.0. It has 3,584 CUDA cores and peak bandwidth of 549 GB/s. CPU codes are implemented on a 14-core, 2.4 GHz Intel® Xeon® E5-2680v4 processor with Intel® C++ Compiler v16.0.1.

Optimization Characterization

The optimizations are tested for the full OpenCL compilation flow using these benchmarks:

  • Needleman Wunsch (NW)
  • Fast Fourier Transform (FFT)
  • Range Limited Molecular Dynamics (Range Limited)
  • Particle Mesh Ewald (PME)
  • Dense Matrix-Matrix Multiplication (MMM)
  • Sparse Matrix Dense Vector Multiplication (SpMV) and Cyclic Redundancy Check (CRC)

Table 2 provides a summary of these benchmarks, their associated dwarfs8, tested problem sizes, and applicable code versions that are developed.

Benchmark Dwarf Problem Size V1 V2 V3 V4 V5 V6
NW Dynamic
16K x 16K
FFT Spectral Methods 64 point
Radix-2 1D
FFT, 8,192
Range Limited N-Body 180
Particles per
Cell, 15%
PME Structured Grids 1,000,000
323 Grid, 3D
MMM Dense Linear Algebra 1K x 1K
SpMV Sparse Linear
1K x 1K
CRC Combinational
100 MB


Figure 2 shows the results of individual optimizations. In almost all cases, we can see the same trend where traditional optimizations (V2) only result in a fraction of the speedup possible. By applying the additional optimizations on top of V2, performance is improved by orders of magnitude.

Figure 2 – Impact of systematic application of the proposed optimizations to a cache-optimized CPU baseline code. In almost all cases, every subsequent code version shows increasing performance, with up to orders of magnitude better performance possible for fully-optimized

The average impact of individual optimizations is shown in Figure 3. Generally, each successive set of optimizations applied results in increasing performance. The exception is V5. This is due to higher execution times of V5 for NW and SpMV. In both cases, performing computations in as much detail as possible results in the use of conditional statements that outweigh the benefits of the optimization. Once these statements are removed in V6, the speedup increases.

Figure 3 – Performance for different code versions, obtained by averaging the speedup of all applicable benchmarks.

To demonstrate the overall effectiveness of the approach, we compare the performance of the optimized kernels against existing CPU, GPU, Verilog, and FPGA-OpenCL implementations. Table 3 lists the references for these designs. They’re either obtained from the literature or implemented using available source code/libraries (denoted by an asterisk). Verilog FFT measurement from reference 3 has been extended to include off-chip access overhead.

Table 3. References for existing implementations

Benchmark CPU GPU Verilog* OpenCL™
NW Rodinia*9 Rodinia*9 Benkrid*10 Zohouri*11
FFT MKL*12 cuFFT**13 Sanaullah*3 Intel14
Range Limited Yang*15 Yang15
PME Ferit16 Ferit*16 Sanaullah17
MMM MKL*12 cuBLAS*18 Shen*19 Spector*20
SpMV MKL*12 cuSPARSE*18 Zhou*22 OpenDwarfs*8
CRC Brumme*23 Anand *24 OpenDwarfs8


Figure 4 shows the average speedup achieved over the CPU code, while Figure 5 shows the normalized execution times for all implementations. From the results, we observe that our work outperforms multicore CPU implementations by approximately 1.2x due to the performance of codes written using Intel® Math Kernel Library (Intel® MKL). We’ve also achieved an average of approximately 5x lower execution time than existing FPGA OpenCL work. The GPU speedup of 2.4x relative to our work is due to the use of a high-end GPU (Tesla* P100) compared to a midrange FPGA (Intel® Arria® 10 FPGAs). We therefore also provide an estimate of high-end FPGA performance (Stratix R 10*) using a conservative factor of 4x to account for an increase in resource only. Results show that the optimized kernels on Stratix 10 are expected to outperform GPU designs by 65%, on average.


Comparison with existing Verilog* implementations show that the OpenCL kernels are, on average, within 12% of hand-tuned HDL. This demonstrates that the optimizations are able to bridge the performance-programmability gap for FPGAs and deliver HDL-like performance using OpenCL.

Figure 4 – Average speedup with respect to CPU across all applicable benchmarks


Figure 5 – Performance of our work compared to existing CPU, GPU, Verilog*, and FPGA OpenCL implementations. Our work outperforms CPU and OpenCL for most of the benchmarks. Moreover, we also achieve speedups over GPU (SpMV, PME) and Verilog (SpMV, Range Limited).


Special Thanks To …

This article is from The Parallel Universe, Intel’s quarterly magazine that helps you take your software development into the future with the latest tools, tips, and training to expand your expertise. Get it >


1Intel® FPGA SDK for OpenCL™ technology
2S. Chellappa, F. Franchetti, and M. Pueschel, “How To Write Fast Numerical Code: A Small Introduction,” in Generative and Transformational Techniques in Software Engineering II, Lecture Notes in Computer Science v5235, 2008, pp. 196 – 259.
3A. Sanaullah and M. Herbordt, “FPGA HPC using OpenCL: Case Study in 3D FFT,” in Proceedings of the International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies, 2018.
4A. Sanaullah, R. Patel, and M. Herbordt, “An Empirically Guided Optimization Framework for FPGA OpenCL,” in Proceedings of the IEEE Conference on Field Programmable Technology, 2018.
5A. Sanaullah, C. Yang, D. Crawley, and M. Herbordt, “SimBSP: Enabling RTL Simulation for Intel FPGA OpenCL Kernels,” in Proceedings on Heterogeneous High-Performance Reconfigurable Computing, 2018.
6A. Sanaullah and M. Herbordt, “Unlocking Performance-Programmability by Penetrating the Intel FPGA OpenCL Toolow,” in IEEE High Performance Extreme Computing Conference, 2018.
7Intel FPGA SDK for OpenCL Pro Edition Best Practices Guide
8K. Krommydas, A. E. Helal, A. Verma, and W.C. Feng, “Bridging the Performance Programmability Gap for FPGAs via OpenCL: A Case Study with Opendwarfs,” Department of Computer Science, Virginia Polytechnic Institute and State University, Tech. Rep., 2016.
9S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, “Rodinia: A Benchmark Suite for Heterogeneous Computing,” in IEEE International Symposium on Workload Characterization, 2009, pp. 44{54.
10K. Benkrid, Y. Liu, and A. Benkrid, “A Highly Parameterized and Efficient FPGA-Based Skeleton for Pairwise Biological Sequence Alignment,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 17, No. 4, pp. 561 – 570, 2009.
11H. R. Zohouri, N. Maruyamay, A. Smith, M. Matsuda, and S. Matsuoka, “Evaluating and Optimizing OpenCL Kernels for High Performance Computing with FPGAs” in International Conference for High Performance Computing, Networking, Storage and Analysis, SC16, 2016, pp. 409{420.
12Intel® Math Kernel Library
13C. Nvidia, “CuFFT Library,” 2010.
14FFT (1D) Design Example
15C. Yang, J. Sheng, R. Patel, A. Sanaullah, V. Sachdeva, and M. Herbordt, “OpenCL for HPC with FPGAs: Case Study in Molecular Electrostatics,” in IEEE High Performance Extreme Computing Conference, 2017.
16F. Buyukkececi, O. Awile, and I. F. Sbalzarini, “A Portable OpenCL Implementation of Generic Particle―Mesh and Mesh―Particle Interpolation in 2D and 3D,” Parallel Computing, Vol. 39, No. 2, pp. 94 – 111, 2013V
17A. Sanaullah, A. Khoshparvar, and M. C. Herbordt, “FPGA-Accelerated Particle-Grid Mapping,” in Field-Programmable Custom Computing Machines (FCCM), 2016 IEEE 24th Annual International Symposium on. IEEE, 2016, pp. 192 – 195.
18Nvidia, “CUBLAS Library,” NVIDIA Corporation, Santa Clara, CA, 2008.
19J. Shen, Y. Qiao, Y. Huang, M. Wen, and C. Zhang, “Towards a Multi-Array Architecture for
Accelerating Large-Scale Matrix Multiplication on FPGAs,” in Circuits and Systems (ISCAS), 2018
IEEE International Symposium on. IEEE, 2018, pp. 1 – 5.
20Q. Gautier, A. Althoff, P. Meng, and R. Kastner, “Spector: An OpenCL FPGA Benchmark Suite,” in International Conference on Field-Programmable Technology, 2016.
21Nvidia, “CuSparse Library,” NVIDIA Corporation, Santa Clara, CA, 2014.
22L. Zhuo and V. K. Prasanna, “Sparse Matrix-Vector Multiplication on FPGAs,” in Proceedings of the International Symposium on Field-Programmable Gate Arrays, 2005.
23S. Brumme, “Fast CRC32,”, 2018.
24P. A. Anand et al., “Design of High Speed CRC Algorithm for Ethernet on FPGA using Reduced Lookup Table Algorithm,” in India Conference (INDICON), 2016 IEEE Annual. IEEE, 2016, pp. 1- 6.

Performance varies by use, configuration, and other factors. Learn more at