Galois
|
Worklists are required to track work items in data-driven algorithms. A scheduling policy indicates the order by which work items are extracted from a worklist. Galois provides various scheduling policies in the namespace galois::worklists, which can be instantiated by providing one of them as the template parameter of galois::wl, an optional parameter to galois::for_each. Below we will cover Galois schedulers, starting from commonly used ones.
Chunked worklists assign work items to threads one chunk at a time. Similarly, each thread accumulates new work in a chunk before putting it on the worklist. Chunking offers better scalability because threads can amortize the cost of their access to the shared worklist over the entire chunk. The user chooses the size of the chunk: A large chunk size means less contention on the shared worklist, but may lead to load-imbalance, while a small chunk size may increase the contention on the shared worklist.
The worklist of chunks itself can be organized in different ways, and we have observed that mapping the communication patterns of threads to the hardware-topology leads to more scalable implementations:
Below is an example of using chunked worklists from lonestar/tutorial_examples/SSSPPushSimple.cpp:
galois::worklists::OrderedByIntegerMetric is suitable for implementing soft priorities, by which active work items can be prioritized but priority inversion will not lead to wrong answers or deadlock. OBIM expects two template parameters:
Below is an example of defining a galois::worklists::OrderedByIntegerMetric scheduling in lonestar/tutorial_examples/SSSPPushSimple.cpp:
The reqIndexer object defines the priority binning function. Internally this OBIM uses a PerSocketChunkLIFO to store the items with the same priority. The following code snippet shows how to use OBIM with galois::for_each:
OBIM works well when the algorithms performance is sensitive to scheduling, and the work-items can be grouped into a small number of bins, ordered by integer priority (typically ~1000 bins). For example, when a single-source shortest path problem, focusing on nodes with lower distances will converge faster if there are sufficient number of nodes to be processed in parallel.
When parallel execution is organized in rounds separated by barriers, existing work items are processed in current round, while new items generated in current round will be postponed until the next round. If this is the case, galois::worklists::BulkSynchronous can be used to avoid maintaining two worklists explicitly in user code. The underlying worklist for rounds can be customized by providing template parameters to galois::worklists::BulkSynchronous.
galois::worklists::LocalQueue creates local non-shared worklists which are used for all work generated during concurrent operation and use a global worklist for all initial work.
galois::worklists::OwnerComputes is similar to galois::worklists::LocalQueue. The differences are listed below:
galois::worklists::StableIterator can be used when loop iterates over a fixed range of items, and the operator does not generate new work items. It is different from galois::do_all in the sense that it supports speculation and allows for iterations to abort.
Galois also provides sequential worklists, where push() and pop() need to acquire locks. As such, sequential schedulers are not scalable beyond a few number of threads, so we recommend not using them with parallel loops.