Manual

Command Line

Galois programs can be run just like any other Java application. However, to simplify preparation of certain types of experimental results and to modify certain runtime parameters such as the number of threads to use, we suggest using the galois execution script found in scripts/galois. The script takes four types of arguments: arguments to the Java VM, arguments to the Galois runtime system, the class name of the application to run and arguments to the application. The arguments must appear in this order, but only the class name is required. All other arguments are optional.

scripts/galois [vmargs] [galoisargs] class_name [args]

class_name is the name of the class to run. It must contain a static void main method. args are the command line arguments to pass to the application.

vmargs can be zero or more of the following:

-m HEAPSIZE: set the min and max Java heap size to HEAPSIZE
--vm VMARG: pass VMARG directly to the Java VM
-d: disable assertions (assertions are enabled by default)

galoisargs can be zero or more of the following:

-r NUMRUNS: run the application NUMRUNS times in the same VM instance (the default is 1)

-t NUMTHREADS: run the application using NUMTHREADS threads (the default is 1)

-f FILE: read the application arguments (args) from FILE instead of from the command line

-s: use serial execution, i.e., switch to non-concurrent versions of all Galois data-structures and use a serial worklist and execution strategy

-g: enable additional statistics, including stack profiling and processor utilization tracking

--help: print help for Galois options

Statistics

During execution of an application, the Galois runtime system will collect and print various statistics. These statistics help keep track of how well the application and the Galois system are performing. By default, the system keeps track of runtime, number of iterations committed and aborted, and time threads spend idle in the runtime but not in application code. The following is example output:

==== Summary Statistics ==== Idleness: Total Thread Time (thread*ms): 4880 Rel. Idleness: 0.0535 Iterations: Committed: 243199 Total: 244153 Abort Ratio: 0.0039 Runtimes (ms): Drop first 1: mean: 604.43 min: 579 max: 660 sample stdev: 27.35

Runtime is the time taken between calls to util.Launcher.getLauncher().startTiming() and util.Launcher.getLauncher().stopTiming(). For most applications included in the Galois distribution this excludes time to read input from files. If there are no calls to startTiming(), then the runtime is the entire time to run the application.

The abort ratio is simply 1 - committed / total iterations.

Total thread time is the total time each Galois worker thread is alive. Idleness keeps track of how often alive threads are idle in the Galois runtime code. Relative idleness is the ratio between thread idle time and total thread time. This only keeps track of the following causes of idleness: the time between initially starting a Galois foreach loop and when a thread executes its first iteration, the time a thread sleeps waiting for more work to appear on the worklist, the time a thread sleeps due to excessive aborting iterations, and the time between a thread finishing its last iteration and when a Galois foreach loop actually finishes.

More detailed results can be found in the stats.txt file in the current working directory after running a Galois application. It contains expanded data for these categories.

For applications with multiple foreach loops, these statistics will be summed over all of these loops. When multiple runs are executed within the same VM instance (the -r command line option), all statistics except the runtime will only be from the last run in the instance. However, the runtime statistic is calculated from all runs.

Running a Parallel Foreach Loop

Unordered foreach loops

Running a foreach loop with an unordered worklist in Galois is accomplished using the call

GaloisRuntime.foreach(Iterable<T> initial, Lambda2Void<T, ExecutorContext<T>> body, Rule priority);

with the following three parameters:

initial: This collection contains the initial active nodes the computation starts with. It can be any collection class that implements the Iterable interface. The content of the collection is used to populate the internal worklist of the Galois runtime.
body: The foreach body implements the code that is executed to process an active node. To create a body, a class has to be defined that implements the Lambda2Void<T, ExecutorContext<T>> interface, including the method call(T item, ExecutorContext<T> ctx). The parameter Item in the call method is the active node to process and ctx is the current execution context to which new active nodes can be added.
priority: The priority specifies the order in which elements from the worklist are handed out to the worker threads. In serial or one-thread execution, this order is the same as the order in which the active nodes are processed and committed, whereas in multiple-thread execution, the order is not guaranteed. Users can specify a chain of rules, each of which defines a special order of the active nodes. The specified rules are applied from left to right. Whenever active nodes are equal according to the previous rule(s), the next rule is applied to further order them. The chain of rules is created by the classes Priority and Priority.Rule. The Priority.first method is used to create the first rule and returns a instance of the class Priority.Rule. Calling the method then on such an instance appends additional rules.

There are several predefined classes for representing orders. They can be passed to the Priority or the Rule class to create a rule instance. Here are some examples:

RandomOrder: random order
FIFO: first-in first-out and there are no active nodes with the same priority
LIFO: last-in first-out and there are no active nodes with the same priority
ChunkedFIFO: like FIFO, but the worklist assigns the same priority to a chunk of active nodes. Conceptually, the worklist has a counter and a number indicating the current priority level. When a new active node is added, the counter is incremented and the active node is assigned the current priority. When the counter reaches a user-specified threshold (i.e., the chunk size), it is reset and the current priority is incremented. The default chunk size is 64.
Bucketed: the order is defined by a user-defined function that maps an active node to an integer. The priority increases with ascending value by default. Users can pass in a boolean to specify whether the priority is defined according to ascending or descending order when creating a rule. An example is provided below.
Ordered: the order is defined by a user-defined comparator

Let us look at the spanning tree example to illustrate these concepts:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

final Bag<Edge> spanningTree = Bag.create(); GNode<NodeData> startNode = Graphs.getRandom(graph); startNode.getData().setInSpanningTree(true); GaloisRuntime.foreach(Arrays.asList(startNode), new Lambda2Void<GNode<NodeData>, ForeachContext<GNode<NodeData>>> { @Override public void call(final GNode<NodeData> src, final ForeachContext<GNode<NodeData>> ctx) { src.map(new ActivityBody() { @Override public void call(final GNode<NodeData> dst) { // assume no self-edges NodeData dstData = dst.getData(); if (!dstData.inSpanningTree()) { Edge edge = new Edge(src, dst); spanningTree.add(edge); dstData.setInSpanningTree(true); ctx.add(dst); } } }); } }, Priority.first(ChunkedFIFO.class, 128).then(FIFO.class));

In line 4, an initial worklist containing only the node startNode is generated (Arrays.asList(startNode)) and passed into the GaloisRuntime.foreach method. In line 6, an anonymous class that implements the Lambda2Void<T, ExecutorContext<T>> interface is created that expresses the foreach body, which inserts the edge between the current active node src and each of its neighbors into the spanning tree if the corresponding neighbor is not already in the spanning tree. In line 21, Priority.first() creates the first rule of a rule chain specifying ChunkedFIFO order with a chunk size of 128. Priority.first() returns an instance of the Rule class and the method then() is used to specify the next rule in the rule chain. In this example, FIFO order is specified as the next rule. In line 16, ctx.add(newActiveNode) is called to add the new active node dst to the worklist.

The following example illustrates the use of the Bucketed class:

1 2 3 4 5 6 7

Priority.first(Bucketed.class, 1000, false, indexer); Lambda<GNode<Node>, Integer> indexer = new Lambda<GNode<Node>, Integer>() { @Override public Integer call(GNode<Node> x) { return Math.min(x.getData().dist / 50000, 1000 - 1); } };

Line 1 defines a rule based on the Bucketed class with 1000 buckets, sorted in descending order as indicated by false, and using the index function indexer. The index function maps an active node x to a integer according to its dist field (line 5).

The Galois system allows separate rules to be specified for the newly created active nodes that are added dynamically to the worklist (we refer to this order as "local order" and the orders discussed above as "global order"). The local order can be defined by Priority.Rule.thenLocally(...). Here is an example:

Priority.first(FIFO.class).thenLocally(LIFO.class);

This specification means that the active nodes in the "initial" worklist are ordered according to the global FIFO order, but dynamically added active nodes are ordered according to the local LIFO order and are put into the worklist before the next active node in the global order.

If the initial worklist contains all nodes in the graph, all neighbors of a node, or, in general, all objects a Mappable object can map to, then there is a more efficient way to launch the foreach:

GaloisRuntime.foreach(Mappable<T> mappable, Lambda2Void<T, ExecutorContext<T>> body, Rule priority);

Instead of an Iterable<T> object, the first parameter is a Mappable<T> object. The mappable approach is faster because it avoids the cost of creating a separate worklist, copying all mapped objects into the worklist, and copying the worklist into the Galois internal worklist.

Ordered foreach loops

Running a foreach with an ordered worklist using Galois is accomplished through the following method:

GaloisRuntime.foreachOrdered(Iterable<T> initial, Lambda2Void<T, ExecutorContext<T>> body, Rule priority);

The parameters have similar meaning as in the unordered case except when specifying the order (i.e., the priority) only the Ordered.class can be used. The following code provides an example:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

GaloisRuntime.foreachOrdered(initial, new Lambda2Void<KEdge, ForeachContext<KEdge>>() { @Override public void call(final KEdge e, final ExecutorContext<KEdge> ctx) { GNode<KNode> rep1 = UFHelper.find(e.getFirst()); GNode<KNode> rep2 = UFHelper.find(e.getSecond()); if (rep1 != rep2) { UFHelper.union(rep1, rep2); e.setInMST(); } } }, Priority.first(Ordered.class, new EdgeCmp())); public class EdgeCmp implements Comparator<KEdge> { @Override public int compare(KEdge left, KEdge right) { return left.getWeight() - right.getWeight(); } }

Line 1 invokes an ordered foreach loop using GaloisRuntime.foreachOrdered(...). Line 11 specifies the order, which has to be the Ordered.class. It requires the specification of a comparator, in this case the class EdgeCmp, which is defined in lines 13-18 and has to implement the java.util.Comparator interface.

Suspending and stopping foreach loops

In some application, it is desirable to suspend the parallel execution to do some serial computation and then resume the parallel computation. For example, the preflow-push algorithm suspends the parallel flow pushing execution temporarily to perform global relabeling. Similarly, it may be desirable to stop the parallel computation when some condition is met. These two behaviors are supported through the methods suspendWith() and finish() in the ExecutorContext class. There are two helper classes for the two behaviors: CounterToSuspendWith and CounterToFinish. Both classes maintain a counter and when the count reaches a preset threshold, the parallel execution is suspended or stopped, respectively. The following code provides an example:

1 2 3 4 6 8 9 10 11 12 13 14 15 16 17 18 19

final Counter<GNode<Node>> relabelYet = new CounterToSuspendWith<GNode<Node>>(globalRelabelInterval, false, new LambdaVoid<ExecutorContext<GNode<Node>>>() { @Override public void call(ExecutorContext<GNode<Node>> ctx) { globalRelabelSerial(gapYet, ctx); } }); GaloisRuntime.foreach(initializePreflow(null), new Lambda2Void<GNode<Node>, ExecutorContext<GNode<Node>>>() { @Override public void call(GNode<Node> item, ExecutorContext<GNode<Node>> ctx) { int increment = 1; if (discharge(ctx, null, item)) { increment += BETA; } relabelYet.increment(ctx, increment); } }, Priority.first(ChunkedFIFO.class, 8));

A CounterToSuspend class is defined in Lines 1-9. The constructor of CounterToSuspendWith is the following:

public CounterToSuspendWith(int countTo, boolean exact, LambdaVoid<ExecutorContext<T>> callback);

The parameter countTo is the threshold (in the example it is globalRelabelInterval, which specifies the interval between two global relabelings). Exact means the parallel execution is suspended when the counter is exactly equal to the threshold countTo; otherwise, whenever the counter is greater than the countTo value, the parallel execution is suspended. After the suspension, the counter is reset. In line 17, the counter is increased by the variable increment.

Nested foreach loops

Galois supports nested foreach loops, that is, calls to GaloisRuntime.foreach and GaloisRuntime.foreachOrdered can be statically and dynamically nested.

Code Examples

Initiating parallel execution

Launching a parallel foreach loop is accomplished by calling one of the foreach methods in the Galois system and passing an appropriate Lambda object to it. The runtime then removes one element after another from the worklist and invokes the call method of the Lambda object on each element. The following example shows how a Lambda2Void object is used to process a node in a given runtime context. The first parameter is a graph node (in this case containing a generic Object data item) and the second parameter is an ExecutorContext<GNode<Object>>, which can be used, e.g., to add items to the worklist by calling ctx.add(gNode, ...).

protected void runLoop() throws ExecutionException { GaloisRuntime.foreach(nodes, new Lambda2Void<GNode<Object>, ExecutorContext<GNode<Object>>>() { @Override public void call(GNode<Object> item, ExecutorContext<GNode<Object>> ctx) { runBody(item, ctx); // code for processing the active node goes here } }, Priority.first(RandomOrder.class)); }

Processing the neighbors of a graph node

Processing a node taken from the worklist may require traversing its neighbors. The following example (taken from the Boruvka application) shows a Lambda object for finding the neighbor with the lightest edge connected to a given node. This Lambda object requires two graph nodes to find the weight of the edge connecting them and therefore uses a Lambda2Void object.

private class FindLightestClosure implements Lambda2Void<GNode<Object>, GNode<Object>> { GNode<Object> lightest = null; int min = Integer.MAX_VALUE; @Override public void call(GNode<Object> neighbor, GNode<Object> node) { int curr = graph.getEdgeData(node, neighbor, MethodFlag.NONE); if (curr < min) { lightest = neighbor; min = curr; } } }

Note that this object contains two fields for storing information across different calls. To avoid concurrent data races, a new object is allocated each time a node is processed, as shown by the next code snippet, which invokes the Lambda object via the map method of the GNode type.

FindLightestClosure closure = new FindLightestClosure(); n.map(closure, n, MethodFlag.CHECK_CONFLICT);

Processing all graph nodes

The following example shows how to use a Lambda object to assign new values to the data items stored in all graph nodes.

g.map(new LambdaVoid<GNode<Integer>>() { @Override public void call(GNode<Integer> node) { node.setData(perm[node.getData()]); } });

Tuning Galois Programs

This section provides some Galois-specific tuning hints that we have found useful to improve the performance of our codes.

Graph class selection

Galois programs have to use one of the provided concurrent graph classes for all data that is shared between parallel activities. Several graph implementations are available that are optimized for different uses. The three currently supported implementations are listed in Table 1.

ArrayIndexedTree<NodeDataType>

LocalComputationGraph<NodeDataType, EdgeDataType>

MorphGraph<NodeDataType, EdgeDataType>

Table 1: Supported graph classes

Each of these three graph classes contains a builder to create graphs of this type that are tailored to specific needs. For example, the indexed tree is optimized for tree-based algorithms that do not need edge data and allows the indexing of specific children, such as the first child of a node in a search tree. The user can tailor the graph by specifying the maximum number of children per node (e.g., two in case of a binary tree). To maximize the performance and minimize the memory consumption, this number should always be chosen as small as possible.

The morph graph is the most general graph. It can store any type of object in each node and edge. However, the builder allows the user to specify what type of data, if any, the edges can hold. The supported types are object, int, long, float, double, and void (no edge data). Using objects results in the slowest execution and should therefore be avoided if one of the other types can be used instead. The user has to specify whether the edges are directed or undirected, which is application dependent. The user can choose between an implementation that internally uses a vector or a hashset. The vector-based implementation is preferable for sparse graphs (where all nodes have few neighbors) whereas the hashset-based implementation results in better performance for dense graphs (where at least some nodes have many neighbors at some time during the computation).

The local computation graph does not permit operations that modify the graph in any way, such as adding or removing edges or nodes. Hence, it can only be used for computations that merely modify the values contained in the nodes or edges. Whereas using a local computation graph makes local computation operations much faster, a local computation graph has to first be generated by copying the structure and the values from a morph graph, which takes time. Thus, using a local computation graph is only worthwhile if more than O(E+V) operations will be performed on it. If the edges of the graph have one data field of type int, long, float, or double, the user can request an optimized local computation graph from the graph builder that is tailored towards these edge data types. Using the most restrictive graph version typically results in the best performance and the smallest memory footprint.

Worklist parameters

The parallel foreach loops in Galois take an initial worklist as a parameter. Because any class that implements the List interface can be passed, it is often unnecessary to first populate an explicit worklist. Instead, the original object should be passed directly, or a List-based view can be passed such as Arrays.asList().

The internal worklist is not directly accessible. For example, it cannot be queried to determine whether a specific item is in the list or not. Hence, we sometimes found it useful to add a flag to our worklist objects that is programmatically updated to indicate whether the object is on the worklist or not.

BoundedFIFO<T>

BoundedLIFO<T>

ChunkedFIFO<T>

ChunkedLIFO<T>

ChunkedRandomOrder<T>

FIFO<T>

LIFO<T>

RandomOrder<T>

Table 2: Selection of supported schedules

The Galois system currently supports many different schedules for determining the order in which the worklist elements are handed out to the worker threads. Depending on the application, selecting a good schedule can have a substantial impact on performance. Table 2 provides some examples. For the full list of supported schedules, see the galois.runtime.wl package in the API section.

The FIFO schedule selects elements in First-In First-Out order whereas LIFO uses Last-In First-Out order. Chunked FIFO and LIFO are similar but assign an entire chunk (range of adjacent elements) from the worklist to the worker threads. The bounded versions limit the maximum number of elements the worklist can hold but are a little faster. RandomOrder uses a random number generator to select each element, which can be slow. ChunkedRandomOrder is faster because it randomly assigns chunks to threads but does not randomize the elements within a chunk. Note that it is possible to build a hierarchy of schedules. For example, first specifying ChunkedFIFO and then LIFO means that chunks of elements are assigned to worker threads in FIFO order, but each thread will see the elements from its chunk in LIFO order. More detail can be found in the Foreach Usage section. If the schedule does not matter, ChunkedFIFO should be used. If it is unclear which schedule to use for a given application, we recommend experimenting with the different schedules.

Method invocation flags

Most methods in the Galois graph classes allow an optional flag to be passed as the final parameter. This flag can be used to turn off conflict detection and/or saving of undo information to make the runtime system faster. However, these flags may only be used under certain conditions where their usage does not interfere with program correctness. See the galois.objects.MethodFlag class in the API section.

The implementation of an operator can sometimes be changed so that it still computes the same result but exposes more opportunity to exploit these flags. One very important code transformation in this regard is making operators cautious, possibly using the one-shot optimization [PPoPP'10]. If an operator is guaranteed to first execute read-only methods that touch all nodes of the graph that later method invocations in the same operator will touch, the operator is said to be cautious. As long as all the nodes have been touched and locked before the first update in a cautious operator, any possible conflict will already have been raised or the operator is guaranteed to terminate without a conflict. Therefore, it is not necessary to check for conflicts or to save undo information for any method call after and including the first method that performs a graph update in a cautious operator. The flags can be used to communicate this information to the system, thus making the program execution much faster.

If the operator cannot directly be made cautious, the one-shot optimization may be used to make it so if the nodes an operator will touch are statically known. For example, if the operator will only touch the direct neighbors of the active node (or a subset thereof), one can insert a method call at the beginning of the operator that reads the direct neighbors (and discards the result). As a side effect, this invocation will lock all the neighbors, making the operator cautious and therefore allowing the programmer to invoke all further method calls without conflict detection or saving undo information.

If none of these optimizations apply, it may still be possible to optimize some method calls. For example, when building a tree top down, the tree is modified at the leafs and the rest of the tree is only read. Hence, it is not necessary (and bad for performance) to lock the entire path from the root to the leaf where a new node needs to be inserted. Instead, the leaf should first be found in read-only mode using the flag that disables conflict detection. Then the leaf should be read again, but this time with conflict detection enabled to ensure that it is not currently in use by another activity. Finally, the new node can be inserted without saving undo information since no other activity can insert a node at the same place in the tree. See the binary search tree code in the Getting Started section for an example.

Finally, some foreach loops only read the shared data. For such reader operators, conflict detection and saving undo information can always be turned off.

General performance improvements

Almost all general (i.e., non-Galois specific) code optimizations will also help make Galois code run faster and should therefore be applied. One important example is to reuse objects as much as possible to avoid calls to "new". To get the most from parallelization, as much work as possible should be performed inside the parallel loops. Because there is a fixed overhead associated with each loop iteration, it is important to perform enough work per iteration so that the overhead is small in comparison. To access the neighbors of a node, a map method needs to be called, which takes a Lambda object as a parameter, i.e., a function to be applied to each neighbor (see the Code Example section for an example). Since creating Lambda objects is expensive, they should be reused as much as possible for best performance. Finally, Galois application writers should strive to make their computations as local as possible to minimize conflicts between concurrent activities and to maximize cache performance.