Measuring Graph Analytics Performance
What Is Graph Analytics? And Why Does It Matter?
A graph is a good way to represent a set of objects and the relations between them (Figure 1). Graph analytics is the set of techniques to extract information from connections between entities.
For example, graph analytics can be used to:
- Get recommendations among friends in a social network
- Find cut points in a communication network or electrical grid
- Determine drug effects on a biochemical pathway
- Detect robocalls in a telecommunications network
- Find optimal routes between locations on a map
Graph analytics has been getting a lot of attention lately, possibly because Gartner listed it among the top 10 data and analytics technology trends for 2019:
“The application of graph processing and graph DBMSs will grow at 100 percent annually through 2022 to continuously accelerate data preparation and enable more complex adaptive data science.” (Source: Gartner Identifies Top 10 Data and Analytics Technology Trends for 2019)
Intel has a long history of leadership in graph analysis. For example, Intel coauthored the GraphBLAS specification to formulate graph problems such as sparse linear algebra. Though the GraphBLAS API was just published in 2017, the initial proposal and manifesto were published over 10 years ago. Today, the same industry and academic partnership is coauthoring the forthcoming LAGraph specification for a library of graph algorithms. Intel was also selected by DARPA to develop a new processor to handle large graph datasets (see “DARPA Taps Intel for Graph Analytics Chip Project”). We’ll continue to innovate and push the graph analytics envelope.
Benchmarking Graph Analytics Performance
Graph analytics is a large and varied landscape. Even the simple examples in Figure 1 show differing characteristics. For example, some networks are highly connected; some are sparser. A network of webpages exhibits different connectivity than a network of Twitter* users, where some users have millions of followers while most have only a few. Consequently, no single combination of graph algorithm, graph topology, or graph size can adequately represent the entire landscape.
Therefore, we use the GAP Benchmark Suite from the University of California, Berkeley, to measure graph analytics performance. GAP specifies six widely-used algorithms (Table 1) and five small to medium-sized graphs (Table 2). Each graph has different characteristics, which is important because optimizations that work well for one graph topology may not work well for others. For example, the Road* graph is relatively small, but its high diameter can cause problems for some algorithms. Consequently, the 30 GAP data points provide good coverage of the graph analytics landscape.
Table 1. GAP measures the performance of six common graph analytics algorithms.
Table 2. GAP uses five graphs of varying size and topology to give a more complete picture of graph analytics performance.
GAP also has the advantages of being a clearly-defined, objective, off-the-shelf benchmark. It doesn’t require special hardware or software configurations, so it’s easy to run.
Here are the steps I took to run the benchmark:
- Download the GAP package.
- Run ‘make’ in the gapbs-master subdirectory to build the reference implementations for the six graph analytics kernels. For now, I’m just using the default GAP build parameters (the GNU* C++ compiler with the -std=c++11 -O3 -Wall -fopenmp options). My system had the GNU v7.4.0 compiler installed.
- In the same directory, run ‘‘make -f benchmark/bench.mk bench graphs’ to download or generate the five benchmark graphs and convert them to more efficient input formats.
- The GAP reference implementations are parallelized with OpenMP*, so it’s important to set the number of threads. Otherwise, GAP will use all available cores, even if this doesn’t give the best parallel efficiency. The ‘export OMP_NUM_THREADS=32’ command sets the number of OpenMP threads to 32, for example.
- Finally, run ‘make -f benchmark/bench.mk bench run’ to launch the benchmarks and generate results (Table 3).
Table 3. Compute times (in seconds, lower is better) for GAP running on a two-socket Intel® Xeon® processor-based system1. GAP was run with 1, 8, 16, 24, 32, 48, 64, and 96 threads. Best performance is shown for each test.
I tried to generate NVIDIA V100* comparative data, but ran into several technical barriers:
- Only one of the GAP graphs (Road) fits in the memory of a single V100.
- Only one of the GAP algorithms (PR) in RAPIDS cuGraph* can use the aggregate memory of multiple V100s.
- The current version of cuGraph does not provide a BC implementation.
- The cuGraph APIs and documentation do not expose certain implementation details or algorithm parameters. For example, the multi-V100 PR implementation in cuGraph does not provide a convergence tolerance parameter, which makes an apples-to-apples comparison to the GAP results difficult.
Consequently, it’s only possible to run nine of the 30 GAP tests (Table 4). As you can see, where GAP can run on both architectures, the Intel® Xeon® processors outperformed the V100 on most tests, even when the graph is small enough to fit in GPU memory (i.e., Road) or when the algorithm can use multiple GPUs (i.e., PageRank). Also, TCO clearly favors Intel Xeon processors for graph analytics (Table 5).
Table 4. Compute times (in seconds, lower is better) for cuGraph running on an AWS EC2 p3.16xlarge instance (eight V100 GPUs connected via NVLink). All Road tests used a single V100. Twitter and Web tests used eight V100s. Note that the multi-V100 PR in cuGraph does not provide a convergence tolerance parameter so the default parameters were used for these tests. Kron and Urand tests failed with a “thrust::system::system_error.” DNR = Does not run because of insufficient memory. NA = Not available in cuGraph v0.9.0.
Table 5. Relative TCO of the Intel Xeon processor and V100 benchmark systems. Note that the AWS price comparison is only an approximation because no EC2 instances exactly match the Intel Xeon processor-based benchmarking system, but the m4.16xlarge instance is similar.
Comprehensive, Objective, and Reproducible
I can probably improve the GAP results on the Intel Xeon processor platform (Table 3) with a few simple changes like using the Intel® compiler and aggressive optimization and vectorization, tweaking the OpenMP OMP_PROC_BIND and OMP_PLACES environment variables, experimenting with the numactl utility, adjusting page sizes, etc.—but that’s not the purpose of this article. My goal is to show how easy it is to get comprehensive, objective, and reproducible graph analytics performance data without obfuscation or resorting to benchmarking tricks.
- Processor: Intel® Xeon® Gold 6252 (2.1 GHz, 24 cores), HyperThreading enabled (48 virtual cores per socket); Memory: 384 GB Micron DDR4-2666; Operating system: Ubuntu Linux* release 4.15.0-29, kernel 31.x86_64; Software: GAP Benchmark Suite (downloaded and run September 2019). Processor: NVIDIA Tesla V100*; Memory: 16 GB; Operating system: Ubuntu Linux release 4.15.0-1047-aws, kernel 49.x86_64; Software: RAPIDS v0.9.0 (downloaded and run September 2019).
- Source: https://www.microway.com/hpc-tech-tips/nvidia-tesla-v100-price-analysis/ (accessed September 2019) and https://ark.intel.com/content/www/us/en/ark/products/192447/intel-xeon-gold-6252-processor-
35-75m-cache-2-10-ghz.html (accessed September 2019)
- Source: https://aws.amazon.com/ec2/pricing/on-demand/ (accessed September 2019)
- See “Boosting the Performance of Graph Analytics Workloads” if you’re interested in tuning GAP.