Per-Thread Storage

Per-Thread storage refers to the storage which is local to each thread in a parallel program. This can be very useful in certain multi-threaded scenarios. For example, consider a multi-threaded program which accumulates information into a global variable. To avoid race conditions, every access to this global variable would have to be protected by a lock (mutex). Alternatively, each thread might accumulate into a thread-local variable on thread-local storage. Since each thread is accessing its own local variable, there will be no race condition. Finally, threads can synchronize to a final accumulation from their thread-local variables to a single shared global variable, which will lead to much better performance and scalability as compared the former approach of locking.

C++11 standard libraries provide the keyword _Thread_local to define thread-local variables. The header <threads.h>, if supported, defines thread_local as a synonym for that keyword. Example of usage:

#include <threads.h>

thread_local int foo = 0;

However, in C++ only static variables can be thread-local variables. Therefore, you can not dynamically create thread-local variables using C++ standard libraries.

Dynamical thread-local allocation/de-allocation can be very useful for parallel program. Therefore, Galois provides dynamic thread-local storage. The source for galois::substrate::PerThreadStorage shows the API for per-thread storage.

The code snippet below shows the declaration for per-thread storage:

  typedef galois::gstl::Vector<unsigned> VecTy;
  typedef galois::substrate::PerThreadStorage<VecTy> ThreadLocalData;
  ThreadLocalData edgesThreadLocal;

The code snippet below shows the usage of per-thread storage inside an operator for galois::for_each:

  // Find the partition n is most connected to
  auto pickPartitionEC = [&](GNode n, auto&) -> unsigned {
    auto& edges = *edgesThreadLocal.getLocal();
    edges.clear();
    edges.resize(parts.size(), 0);
    unsigned P = cg.getData(n).getPart();
    for (auto ii : cg.edges(n)) {
      GNode neigh = cg.getEdgeDst(ii);
      auto& nd    = cg.getData(neigh);
      if (parts[nd.getPart()].partWeight < maxSize || nd.getPart() == P)
        edges[nd.getPart()] += cg.getEdgeData(ii);
    }
    return std::distance(edges.begin(),
                         std::max_element(edges.begin(), edges.end()));
  };

As it can be seen above that unlike C++ Thread_local, galois::substrate::PerThreadStorage variables can be dynamically allocated/de-allocated or resized. A thread can get its own thread-local copy of per-thread storage by calling galois::substrate::PerThreadStorage::getLocal.

PerThreadStorage API also allows threads to access variables on other threads by passing the remote thread's id, for example,

//Thread 0 accessing edgesThreadLocal on Thread 1

auto& edges = *edgesThreadLocal.getRemote(1);

Per-Socket Storage

Similar to Per-Thread storage, Galois also provides Per-Socket (or Per-Package) storage, which is at the level of socket (or package). Each socket can have its own copy of a variable to work on and threads in different sockets can simultaneously access the socket-local variable without any race conditions. Also, in NUMA architecture, accessing a variable on a local socket is faster than accessing a variable on a remote socket (see details at NUMA-Awareness).

API for per-socket storage galois::substrate::PerSocketStorage is similar to per-thread galois::substrate::PerThreadStorage.

The code snippet below shows the usage of galois PerSocketStorage variable inside galois::on_each:

galois::substrate::PerSocketStorage<double*> socket_weights;
galois::on_each([](unsigned tid, unsigned total) {
  //Only one thread in a socket is allowed to access
  if (galois::substrate::getThreadPool().isLeader(tid)) {
    double* p = new double[NUM_VARIABLES];
     socket_weights.getLocal() = p;
    std::fill(p, p + NUM_VARIABLES, 0);
  }
});

Table of Contents

Per-Thread Storage

Per-Socket Storage