Galois
|
We assume the readers have read Tutorial and understand Galois programming model and related constructs reasonably.
This guide has two parts. The first part talks about the principles to build a NUMA-aware data structure; the second part talks about how to make your data structure work with Galois' conflict detection.
You can hold a fixed amount of work items, e.g. a 2D torus of fixed width and height, using an array. To allocate an array in a NUMA-aware fashion, you can use galois::LargeArray and its allocation methods.
galois::LargeArray supports five ways of memory allocation (see details at NUMA-Awareness).
Below are rule of thumb for deciding the allocation policy.
Define local_iterator for your data structure so that the Galois runtime can better leverage the locality of accesses by using local ranges formed by local_iterators. Otherwise, Galois falls back to use global ranges given by begin() and end(), which are of type iterator.
See galois::graphs::LC_CSR_Graph as an example data structure built with galois::LargeArray.
If work items are generated dynamically by threads, then it is better to have a per-thread data structure backed by a per-thread heap. Threads populate the data structure in their thread-local copy, and then work items are merged or iterated in another serial/parallel phase. galois::InsertBag can be used for this purpose. This design helps scalability by avoiding having a big lock around a global data structure.
You can further exploit the semantics of work items when designing your own per-thread data structure.
Start from galois::substrate::PerThreadStorage or galois::substrate::PerSocketStorage if none of the above fits your needs. See Per-Thread and Per-Socket Storage for more details.
local_iterator for a per-thread data structure can be naturally defined as the iterators to thread-local containers. The global iterator should be carefully constructed from local_iterators in order to conform to the STL interface. Galois will pick up local iterators to leverage the spacial/temporal locality in work generation.
See galois::InsertBag as an example data structure built with this design in mind.
To address the memory usage in a NUMA-aware fashion, you may need to work with custom memory allocators. As an example, the code snippet below shows parts of the implementation for galois::gdeque related to memory allocations:
Suppose we want to write a program to count the number of neighbors for each node in a 2D torus using a push-style operator, where each node increments its neighbors' labels. There are conflicts among node activities, so we need the torus nodes to be conflict-aware if we do not want to handle synchronization in our operator code.
The neighbors of a 2D torus node can be inferred given the width, height and node numbering scheme of the torus. Therefore, it suffices to store only node data in an array, e.g. galois::LargeArray. Using a graph to represent a 2D torus is possible but would be an overkill. For simplicity of this presentation, torus nodes are numbered from 0 onwards in row-major order.
To make torus nodes be aware of conflicts, we can (1) use an internal struct to hold the real data, and (2) let the internal struct inherit from galois::runtime::Locakble. See the code snippet below for an example. T, given as a template parameter, is the type of actual node data.
To store node data with locks for a torus, we instantiate a galois::LargeArray containing elements of the internal struct:
To detect conflicts, we need to call galois::runtime::acquire on instances of the internal struct. All functions acquiring node ownership need to do so, so it is convenient to implement the call inside a wrapper function, as the following code snippet shows. size() returns the number of nodes in the torus. The parameter mflag is default to galois::MethodFlag::WRITE. We will see an example of optimizing out conflict detection using this parameter later.
Now we need to let all methods that need ownership of a node call the wrapper function, as the code snippet below shows. Returning reference to node data allows users to modify it in-place.
Recall that galois::do_all and galois::for_each expect galois::iterate, whose parameter can be a pair of integers or a container with well-defined iterator. To make our 2D torus interoperable with Galois, we need to provide size-related functions or the iterator, given by begin() and end(), to loop through nodes. Size-related functions are easy to provide:
Iterators can be defined as the following:
Users may instantiate iterators by begin() and end(), which are implemented as the following:
In our current example, we need to increment each node's immediate neighbors. Hence, we need APIs to get neighbors for a given node. Since we need to lock all neighbors before actual increments for operator cautiousness, it will be convenient to have a function locking all neighbors for a given node. The following code snippet shows how to achieve this.
Now we can use our 2D torus as the following.
Upon termination, this example app will output statistics similar to the following (generated with a 2000*2000 torus and 24 threads).
STAT_TYPE, REGION, CATEGORY, TOTAL_TYPE, TOTAL
STAT, do_all_torus_reset_self, Iterations, TSUM, 4000000
STAT, do_all_torus_reset_self, Time, TMAX, 70
STAT, for_each_torus_add_neighbors, Iterations, TSUM, 4001481
STAT, for_each_torus_add_neighbors, Time, TMAX, 88
STAT, for_each_torus_add_neighbors, Commits, TSUM, 4000000
STAT, for_each_torus_add_neighbors, Pushes, TSUM, 0
STAT, for_each_torus_add_neighbors, Conflicts, TSUM, 1481
Note that all conflicts are detected successfully and reported.
Conflict detection can be turned off when not required by passing galois::MethodFlag::UNPROTECTED to getData(), as shown below in the serial loop for verification.
The full example is available at lonestar/tutorial_examples/ConflictAwareTorus.cpp