# Boosting the Performance of Graph Analytics Workloads

## Analyzing the Graph Benchmarks on Intel® Xeon® Processors

A graph is an intuitive way of representing a big data set and the relationships between its elements. Each vertex represents an element, and edges connect related elements. Representing data sets as a graph means you can build on the rich history of graph theory. And there are a variety of algorithms to extract useful information from a graph. In this article, we’ll explore the implementation characteristics of basic graph analysis algorithms and how they perform on Intel® Xeon™ processors.

### Graphs and Graph Analytics

A graph is a structured representation of a data set with relationships between elements. Each element is represented as a vertex, and relationships between elements are shown as an edge between two vertices. Both vertices and edges can have attributes representing either the characteristics of the element or the relationship. The vertices to which a vertex is connected through edges are called its neighbors, and the number of neighbors is called the degree of a vertex.

Graph analysis applications extract characteristics from a graph (or multiple graphs) that provide useful information, e.g.:

- Vertices with specific features
- The shortest path between vertices
- Vertex clusters
- Higher-order relationships among vertices

With the growing availability of big data sets and the need to extract useful information from them, graph analytics applications are becoming an important workload, both in the data center and at the edge.

For this study, we’ll use the GAP Benchmark Suite^{1}, a set of common graph algorithms implemented in C++ and optimized by researchers at the University of California at Berkeley for performance on sharedmemory multicore processors (**Table 1**). We evaluate GAP performance on an Intel Xeon processor-based server and investigate opportunities to further improve performance.

**Table 1. Gap Benchmark Suite overview**

Algorithm |
Abbreviation |
What it Does |

PageRank* |
pr | Calculates the popularity of a vertex by aggregating the popularity of its neighbors |

Triangle counting |
tc | Counts the number of triangles (three vertices that are fully connected) |

Connected components |
cc | Splits the graph into subgraphs with no edges between them |

Breadth-first search |
bfs | Walks through the graph in breadth-first order |

Single-source shortest path |
sssp | Calculates the shortest path from one vertex to all others |

Betweenness centrality |
bc | Calculates the centrality of a vertex, determined by how many shortest paths go through it |

### Characteristics of Graph Algorithms

Graph algorithms pose challenging behavior for conventional processor architectures. A typical operation is to fetch the attributes of all neighbors of a vertex. The list of neighbors, determined by the topology of the graph, is usually irregular. This leads to sparse memory accesses―accessing individual elements scattered on a large data structure. Sparse memory accesses have no locality, leading to poor cache utilization.

Fetching the attributes of the neighbors means using indirect accesses. For example, if N is the array of neighbors of a vertex, and A the array containing the attributes, the attribute of neighbor i is accessed by A[N[i]]. This pattern is difficult to predict and to vectorize, which leads to an underutilization of the available compute and memory resources.

On the other hand, graph algorithms generally have a lot of parallelism. The set of algorithms in the GAP suite fall into two categories in terms of parallelism. The first category consists of algorithms that operate on all vertices concurrently (pr, tc, and cc). They have abundant parallelism and can be executed across many threads. Their parallelism is only limited by the size of the graph.

The second category is front-based algorithms where, at each iteration, a subset of vertices is analyzed (the current front) and a new front is defined to be processed in the next iteration (bfs, sssp, and bc). These algorithms usually start with a front containing a single vertex. The next front consists of its neighbors, then the neighbors of these neighbors, and so on. In the first iterations, the size of the front (and thus the parallelism that can be exploited) is limited. Also, each iteration ends with a global barrier, which creates additional synchronization overhead. These algorithms scale worse with increasing thread count, especially on smaller graphs.

### Running Graph Algorithms on Intel Xeon Processors

Despite the challenging behavior of graph algorithms, there are ways to increase the efficiency of running these applications on a multi-core Intel Xeon processor-based server.

#### Vectorization

Using vector memory instructions can increase the performance of a graph algorithm by increasing the number of parallel load operations, which hides part of their latency. Specifically, you can use the vector gather instruction (AVX2* and AVX-512*) to perform multiple indirect loads in one instruction. However, the compiler isn’t always able to detect these indirect access patterns, or it can decide to not vectorize based on its heuristics. Therefore, it might be useful to add #pragma vector always to force the compiler to vectorize and/or to rewrite the code to make the indirect access pattern more apparent to the compiler.

**Figure 1** gives an example for cc. The original code on the left did not generate vector gather instructions, while the code on the right did. This led to a speedup of 5x for cc on Intel Xeon processors.

for (NodeID v : g.out_neigh(u)) { NodeID comp_v = comp[v]; if ((comp_u < comp_v) && (comp_v == comp[comp_v])) { change = true; comp[comp_v] = comp_u; } }

NodeID *a = g.out_neigh(u).begin();#pragma vector alwaysfor (int i=0; i<n; i++){ NodeID comp_v =comp[a[i]]; if ((comp_u < comp_v) && (comp_v == comp[comp_v])) { change = true; comp[comp_v] = comp_u; } }

It might also be useful to look at other vectorization opportunities, and to rewrite the code so that they can be exploited (e.g., using intrinsics). For example, in tc, we need to count the number of matches between the elements of two neighbor lists. Katsov describes an algorithm to speed up the matching algorithm with SSE instructions^{2}. We adapted this algorithm to AVX-512 and included this in the tc benchmark, leading to a performance increase of 2.5x (code not included for brevity).^{3}

### Parallelism

The GAP benchmarks are parallelized using OpenMP*. As discussed before, there are two categories of parallelism: vertex- and front-based. For the vertex-parallel algorithms (pr, cc, and tc), it’s important to use dynamic scheduling in the OpenMP parallel for-loops because the processing time of a vertex depends on its neighbor count, which can differ significantly across vertices. With static scheduling, threads that are assigned vertices with many neighbors execute longer than other threads―leading to load imbalance. To reduce the scheduling overhead while still maintaining enough scheduling flexibility, set chunk size to 16 or 32.

The front-based algorithms (bfs, sssp, and bc) are harder to parallelize, which means they don’t use the full capacity of the processor (fewer threads than cores). The current front contains many fewer vertices than the full graph, and the next front can only be processed when the current front is finished. To fully exploit the increasing core count of Intel Xeon processor-based servers, these algorithms need to be revised to increase their parallelism.

An example of this is already implemented in bfs. Instead of looking for neighbors of the current front and checking whether they’ve already been visited (forward algorithm), all non-visited vertices are considered, and it is checked whether they are a neighbor of a vertex in the current front (backward algorithm). Because there are more non-visited vertices than vertices in the front, there’s more parallelism to exploit. The downside is that the backward algorithm sometimes does unnecessary work when most of the non-visited vertices are not neighbors of the current front.

At each step of the algorithm, we can choose between the two methods. This choice is currently done using the characteristics of the current front and the remaining vertices, but it should also include the available parallelism (core or thread count) to better exploit the capacity of the processor.

### Caches and Input Graphs

Graph workloads generally don’t generate cache-friendly access patterns. The one exception is when the graph, or the most accessed data structure of the graph (e.g., the attributes of vertices), fits in the last-level cache (which is up to 38 MB per socket on high-end Intel Xeon processors). From our experiments, we notice that performance (expressed in GTEPS, or giga traversed edges per second) decreases with increasing graph size, since less and less of the data fits into the cache. Because of the nature of graphs and graph algorithms, methods to improve cache locality either don’t work well or take too much time to reorganize the graph, often more than the algorithm itself.

### Distributed Graph Processing

The GAP benchmarks are designed for single-node execution only (using OpenMP parallelization). However, based on the insights from our study, we briefly discuss the impact on distributed graph processing (i.e., using multiple nodes). For high-performance multinode execution, it’s crucial to minimize the communication and maximize local data and computation. This is challenging for graph applications because of the irregular and non-localized access pattern. Partitioning a graph to minimize the number of edges between partitions is an NP-complete problem in itself, and often leads to more compute time than the algorithm itself. Therefore, when you’re deploying a graph analysis algorithm on multiple nodes, the nodes should be connected by a high-bandwidth, low-latency network such as Intel® Omni-Path Architecture to deal with the unavoidably high amount of communication between the nodes.

### Boosting Performance through Analysis

Graph applications form a challenging workload for current processors because of their memory intensiveness and irregularity. By carefully crafting their implementation and exploiting vector units and thread parallelism, we can increase performance significantly. However, more investigation is needed, including redesigning the algorithms to fully exploit the capabilities of an Intel Xeon processor-based server, especially when moving to distributed processing.

### References

^{1}S. Beamer, K. Asanovic, and D. A. Patterson, “The GAP Benchmark Suite,” 2015. http://gap.cs.berkeley.edu/benchmark.html

^{2}I. Katsov, “Fast Intersection of Sorted Lists Using SSE Instructions,” 2012. https://highlyscalable.wordpress.com/2012/06/05/fast-intersection-sorted-lists-sse/

^{3}S. Eyerman et al., “Many-Core Graph Workload Analysis,” 2018. https://dl.acm.org/citation.cfm?id=3291686.