Goal of this Section

We assume the readers have read Tutorial and understand Galois programming model and related constructs reasonably.

This guide has two parts. The first part talks about the principles to build a NUMA-aware data structure; the second part talks about how to make your data structure work with Galois' conflict detection.

Principles for NUMA-aware Data Structures

Array-based Data Structures

You can hold a fixed amount of work items, e.g. a 2D torus of fixed width and height, using an array. To allocate an array in a NUMA-aware fashion, you can use galois::LargeArray and its allocation methods.

galois::LargeArray supports five ways of memory allocation (see details at NUMA-Awareness).

galois::LargeArray::allocateLocal: allocate the array at the calling thread's socket.
galois::LargeArray::allocateFloating: each page of the array will be allocated to the socket of the first-touching thread.
galois::LargeArray::allocateBlocked: partition the array evenly into contiguous chunks, and each thread allocates its own chunk.
galois::LargeArray::allocateInterleaved: interleave the pages of the array among threads in a round-robin fashion.
galois::LargeArray::allocateSpecified: allocate the array by a specified thread ranges.

Below are rule of thumb for deciding the allocation policy.

Use galois::LargeArray::allocateBlocked if the accesses are concentrated within evenly partitioned chunks.
Use galois::LargeArray::allocateSpecified if you know the access patterns which are different from blocked.
Use galois::LargeArray::allocateInterleaved when the access pattern is unknown or random.

Define local_iterator for your data structure so that the Galois runtime can better leverage the locality of accesses by using local ranges formed by local_iterators. Otherwise, Galois falls back to use global ranges given by begin() and end(), which are of type iterator.

See galois::graphs::LC_CSR_Graph as an example data structure built with galois::LargeArray.

Per-thread Data Structure

If work items are generated dynamically by threads, then it is better to have a per-thread data structure backed by a per-thread heap. Threads populate the data structure in their thread-local copy, and then work items are merged or iterated in another serial/parallel phase. galois::InsertBag can be used for this purpose. This design helps scalability by avoiding having a big lock around a global data structure.

You can further exploit the semantics of work items when designing your own per-thread data structure.

If the work items can be merged by well-defined reduction operations,
- Extend from galois::GSimpleReducible if the work items are cheap to copy.
- Extend from galois::GBigReducible if the work items are expensive to copy, e.g. containers.
See Reduction Operations for more details.
If the work items need to be organized with a certain structure, extend from galois::PerThreadContainer.
If the work items form a multi-set and never get removed before processed, use galois::InsertBag.

Start from galois::substrate::PerThreadStorage or galois::substrate::PerSocketStorage if none of the above fits your needs. See Per-Thread and Per-Socket Storage for more details.

local_iterator for a per-thread data structure can be naturally defined as the iterators to thread-local containers. The global iterator should be carefully constructed from local_iterators in order to conform to the STL interface. Galois will pick up local iterators to leverage the spacial/temporal locality in work generation.

See galois::InsertBag as an example data structure built with this design in mind.

Managing Memory with Custom Allocators

To address the memory usage in a NUMA-aware fashion, you may need to work with custom memory allocators. As an example, the code snippet below shows parts of the implementation for galois::gdeque related to memory allocations:

  galois::FixedSizeAllocator<Block> heap;
  template <typename... Args>
  Block* alloc_block(Args&&... args) {
    // Fixed size allocator can only allocate 1 object at a time of size
    // sizeof(Block). Argument to allocate is always 1.
    Block* b = heap.allocate(1);
    return new (b) Block(std::forward<Args>(args)...);
  }
  void free_block(Block* b) {
    b->~Block();
    heap.deallocate(b, 1);
  }

Conflict-aware Galois Data Structures

Suppose we want to write a program to count the number of neighbors for each node in a 2D torus using a push-style operator, where each node increments its neighbors' labels. There are conflicts among node activities, so we need the torus nodes to be conflict-aware if we do not want to handle synchronization in our operator code.

The neighbors of a 2D torus node can be inferred given the width, height and node numbering scheme of the torus. Therefore, it suffices to store only node data in an array, e.g. galois::LargeArray. Using a graph to represent a 2D torus is possible but would be an overkill. For simplicity of this presentation, torus nodes are numbered from 0 onwards in row-major order.

Conflict Awareness

To make torus nodes be aware of conflicts, we can (1) use an internal struct to hold the real data, and (2) let the internal struct inherit from galois::runtime::Locakble. See the code snippet below for an example. T, given as a template parameter, is the type of actual node data.

  //************************************************************************
  // internal type to combine user data with Lockable object
  //************************************************************************
  struct NodeData : public galois::runtime::Lockable {
  public:
    using reference = T&;
  public:
    T v;
  public:
    reference getData() { return v; }
  };

To store node data with locks for a torus, we instantiate a galois::LargeArray containing elements of the internal struct:

  size_t numRows, numCols;
  // use galois::LargeArray for NUMA-aware allocation
  // will allocate numRows*numCols elements in constructors
  galois::LargeArray<NodeData> data;

To detect conflicts, we need to call galois::runtime::acquire on instances of the internal struct. All functions acquiring node ownership need to do so, so it is convenient to implement the call inside a wrapper function, as the following code snippet shows. size() returns the number of nodes in the torus. The parameter mflag is default to galois::MethodFlag::WRITE. We will see an example of optimizing out conflict detection using this parameter later.

  //************************************************************************
  // functions to acquire node ownership
  //************************************************************************
  void acquireNode(TorusNode n,
                   galois::MethodFlag mflag = galois::MethodFlag::WRITE) {
    // sanity check
    assert(n < size());
    // use this call to detect conflicts and handling aborts
    galois::runtime::acquire(&data[n], mflag);
  }

Now we need to let all methods that need ownership of a node call the wrapper function, as the code snippet below shows. Returning reference to node data allows users to modify it in-place.

  //************************************************************************
  // function to access node data
  //************************************************************************
  typename NodeData::reference
  getData(TorusNode n, galois::MethodFlag mflag = galois::MethodFlag::WRITE) {
    acquireNode(n, mflag);
    // use the internal wrapper type to encapsulate users from Lockable objects
    return data[n].getData();
  }

Container Interface

Recall that galois::do_all and galois::for_each expect galois::iterate, whose parameter can be a pair of integers or a container with well-defined iterator. To make our 2D torus interoperable with Galois, we need to provide size-related functions or the iterator, given by begin() and end(), to loop through nodes. Size-related functions are easy to provide:

  //************************************************************************
  // functions for size of the torus
  //************************************************************************
  size_t height() { return numRows; }
  size_t width() { return numCols; }
  size_t size() { return width() * height(); }

Iterators can be defined as the following:

  //************************************************************************
  // subtypes visible to user
  //************************************************************************
public:
  // opaque type for node
  using TorusNode = size_t;
  // iterator for an STL container
  using iterator = boost::counting_iterator<TorusNode>;

Users may instantiate iterators by begin() and end(), which are implemented as the following:

  //************************************************************************
  // functions to traverse nodes
  //************************************************************************
  iterator begin() { return iterator(0); }
  iterator end() { return iterator(size()); }

Easy Operator Cautiousness

In our current example, we need to increment each node's immediate neighbors. Hence, we need APIs to get neighbors for a given node. Since we need to lock all neighbors before actual increments for operator cautiousness, it will be convenient to have a function locking all neighbors for a given node. The following code snippet shows how to achieve this.

  //************************************************************************
  // functions to access neighboring nodes, i.e. edges in a general graph
  //************************************************************************
  iterator upNeighbor(TorusNode n) {
    auto r = n / numCols, c = n % numCols;
    auto newR = (r + numRows - 1) % numRows;
    return iterator(newR * numCols + c);
  }
  iterator downNeighbor(TorusNode n) {
    auto r = n / numCols, c = n % numCols;
    auto newR = (r + 1) % numRows;
    return iterator(newR * numCols + c);
  }
  iterator leftNeighbor(TorusNode n) {
    auto r = n / numCols, c = n % numCols;
    auto newC = (c + numCols - 1) % numCols;
    return iterator(r * numCols + newC);
  }
  iterator rightNeighbor(TorusNode n) {
    auto r = n / numCols, c = n % numCols;
    auto newC = (c + 1) % numCols;
    return iterator(r * numCols + newC);
  }
  //************************************************************************
  // function to lock all neighbors of node n
  // similar to edge_begin(), edge_end() or edges() in a general graph
  //************************************************************************
  void
  acquireAllNeighbors(TorusNode n,
                      galois::MethodFlag mflag = galois::MethodFlag::WRITE) {
    acquireNode(*upNeighbor(n), mflag);
    acquireNode(*downNeighbor(n), mflag);
    acquireNode(*leftNeighbor(n), mflag);
    acquireNode(*rightNeighbor(n), mflag);
  }

Use the 2D Torus

Now we can use our 2D torus as the following.

  using Torus     = Torus2D<unsigned int>;
  using TorusNode = Torus::TorusNode;
  Torus torus(std::atoi(argv[1]), std::atoi(argv[2]));
  galois::do_all(
      galois::iterate(size_t{0},
                      torus.size()), // range as a pair of unsigned integers
      [&](TorusNode n) { torus.getData(n) = 0; } // operator
      ,
      galois::loopname("do_all_torus_reset_self") // options
  );
  galois::for_each(
      galois::iterate(
          torus), // range as a container. assuming begin() and end()
      [&](TorusNode n, auto&) { // operator
        // cautious point
        torus.acquireAllNeighbors(n);
        torus.getData(*torus.upNeighbor(n)) += 1;
        torus.getData(*torus.downNeighbor(n)) += 1;
        torus.getData(*torus.leftNeighbor(n)) += 1;
        torus.getData(*torus.rightNeighbor(n)) += 1;
      },
      galois::loopname("for_each_torus_add_neighbors") // options
      ,
      galois::no_pushes());

Upon termination, this example app will output statistics similar to the following (generated with a 2000*2000 torus and 24 threads).

STAT_TYPE, REGION, CATEGORY, TOTAL_TYPE, TOTAL
STAT, do_all_torus_reset_self, Iterations, TSUM, 4000000
STAT, do_all_torus_reset_self, Time, TMAX, 70
STAT, for_each_torus_add_neighbors, Iterations, TSUM, 4001481
STAT, for_each_torus_add_neighbors, Time, TMAX, 88
STAT, for_each_torus_add_neighbors, Commits, TSUM, 4000000
STAT, for_each_torus_add_neighbors, Pushes, TSUM, 0
STAT, for_each_torus_add_neighbors, Conflicts, TSUM, 1481

Note that all conflicts are detected successfully and reported.

Conflict detection can be turned off when not required by passing galois::MethodFlag::UNPROTECTED to getData(), as shown below in the serial loop for verification.

  // serial verification, no conflict is possible
  size_t numWrongAnswer = 0;
  for (auto n : torus) {
    // use galois::MethodFlag::UNPROTECTED to notify Galois runtime
    // that do not acquire lock for this call
    if (torus.getData(n, galois::MethodFlag::UNPROTECTED) != 4) {
      numWrongAnswer++;
    }
  }
  std::cout << "# nodes of wrong answer: " << numWrongAnswer << std::endl;

The full example is available at lonestar/tutorial_examples/ConflictAwareTorus.cpp

Table of Contents