Breadth-first Search

Application Description

This benchmark computes the shortest path from a source node to all nodes in a directed, unweighted graph.

Application Description

We have 2 distributed implementations of BFS: a push-style and a pull-style. Both are bulk-synchronous parallel (BSP) implementations: execution proceeds in rounds, and synchronization of data among hosts occurs between rounds.

The push-style version checks a node to see if its distance has changed since the last round. If it has, it will update its neighbor's distances using its new distance. The pull-style version goes over all nodes: all nodes check their in-neighbors, and if a neighbor has a distance that results in a new shortest path distance, then a node updates itself with its neighbors' data. Execution of both versions continues until there are no more nodes that are updated in a round.

Psuedocode for the computation step of the 2 implementations follows below:

1 2 3 4 5 6 7

for (node n in graph) { if (n.distance != n.old_distance) { for (neighbor a of node n) { a.distance = min(n.distance + 1, a.distance) } } }

Figure 1: Pseudocode for BFS Push computation

1 2 3 4 5 6 7 8

for (node n in graph) { for (in-neighbor a of node n) { if (a.distance + 1 < n.distance) { n.distance = a.distance + 1 } } } }

Figure 2: Pseudocode for BFS Pull computation

Synchronization of the distance variable occurs between BSP rounds. A node will take the minimum distance value of all proxies that exist in the system for that node.

Performance

The graph below shows the strong scaling of bfs-push using both Bulk Synchronous Parallel (Gluon-Sync) and Bulk-Asynchronous Parallel (Gluon-Async) execution models which use a Gluon communication substrate. The experiments were conducted on Stampede Cluster (Stampede2), which is connected through Intel Omni-Path Architecture (peak bandwidth of 100Gbps). Each node has 2 Intel Xeon Platinum 8160 “Skylake” CPUs with 24 cores per CPU and 192GB DDR4 RAM. We use up to 128 CPU machines, each with 48 threads. We run on 4 graphs: clueweb12, uk14, wdc14, and wdc12. Most are real-world web-crawls: the web data commons hyperlink graph. wdc12, is the largest publicly available dataset. wdc14, clueweb12, and uk14 are all other large web-crawls.

Figure 3: Strong scaling of BFS on Stampede2 (Skylake).

The graph below shows scaling on up to 64 Tesla P100 GPUs on the Bridges cluster at the Pittsbugh Supercomputing Center. The GPUs are on 32 machines connected with the Intel Omni-Path Architecture, and each machine has 2 Intel Broadwell E5-2683 v4 CPUs. friendster and twitter50 are social network graphs, and uk07 is a webcrawl.

Figure 4: Strong scaling of BFS on Bridges (Tesla P100).