Galois
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Pages
Per-Thread and Per-Socket Storage

Per-Thread Storage

Per-Thread storage refers to the storage which is local to each thread in a parallel program. This can be very useful in certain multi-threaded scenarios. For example, consider a multi-threaded program which accumulates information into a global variable. To avoid race conditions, every access to this global variable would have to be protected by a lock (mutex). Alternatively, each thread might accumulate into a thread-local variable on thread-local storage. Since each thread is accessing its own local variable, there will be no race condition. Finally, threads can synchronize to a final accumulation from their thread-local variables to a single shared global variable, which will lead to much better performance and scalability as compared the former approach of locking.

C++11 standard libraries provide the keyword _Thread_local to define thread-local variables. The header <threads.h>, if supported, defines thread_local as a synonym for that keyword. Example of usage:

#include <threads.h>
thread_local int foo = 0;

However, in C++ only static variables can be thread-local variables. Therefore, you can not dynamically create thread-local variables using C++ standard libraries.

Dynamical thread-local allocation/de-allocation can be very useful for parallel program. Therefore, Galois provides dynamic thread-local storage. The source for galois::substrate::PerThreadStorage shows the API for per-thread storage.

The code snippet below shows the declaration for per-thread storage:

ThreadLocalData edgesThreadLocal;

The code snippet below shows the usage of per-thread storage inside an operator for galois::for_each:

// Find the partition n is most connected to
auto pickPartitionEC = [&](GNode n, auto&) -> unsigned {
auto& edges = *edgesThreadLocal.getLocal();
edges.clear();
edges.resize(parts.size(), 0);
unsigned P = cg.getData(n).getPart();
for (auto ii : cg.edges(n)) {
GNode neigh = cg.getEdgeDst(ii);
auto& nd = cg.getData(neigh);
if (parts[nd.getPart()].partWeight < maxSize || nd.getPart() == P)
edges[nd.getPart()] += cg.getEdgeData(ii);
}
return std::distance(edges.begin(),
std::max_element(edges.begin(), edges.end()));
};

As it can be seen above that unlike C++ Thread_local, galois::substrate::PerThreadStorage variables can be dynamically allocated/de-allocated or resized. A thread can get its own thread-local copy of per-thread storage by calling galois::substrate::PerThreadStorage::getLocal.

PerThreadStorage API also allows threads to access variables on other threads by passing the remote thread's id, for example,

//Thread 0 accessing edgesThreadLocal on Thread 1
auto& edges = *edgesThreadLocal.getRemote(1);

Per-Socket Storage

Similar to Per-Thread storage, Galois also provides Per-Socket (or Per-Package) storage, which is at the level of socket (or package). Each socket can have its own copy of a variable to work on and threads in different sockets can simultaneously access the socket-local variable without any race conditions. Also, in NUMA architecture, accessing a variable on a local socket is faster than accessing a variable on a remote socket (see details at NUMA-Awareness).

API for per-socket storage galois::substrate::PerSocketStorage is similar to per-thread galois::substrate::PerThreadStorage.

The code snippet below shows the usage of galois PerSocketStorage variable inside galois::on_each:

galois::on_each([](unsigned tid, unsigned total) {
//Only one thread in a socket is allowed to access
if (galois::substrate::getThreadPool().isLeader(tid)) {
double* p = new double[NUM_VARIABLES];
socket_weights.getLocal() = p;
std::fill(p, p + NUM_VARIABLES, 0);
}
});