Hello,
I am back. Today I will describe briefly two possible ways to assemble the algebraic system originated by a FEM method.

More precisely I am going to solve the 1D problem

$$ \frac{\mathrm{d}^2 u}{\mathrm{d}x^2}=2,\qquad \text{in }\Omega=[0,1] $$

with initial condition

$$ u(0)=u_0=1,\qquad u(1)=u_1=2. $$

The solution of this problem is obviously \(u=x^2+1\).

The first code which I implemented is available at this link (always the same repository :))

The code constructs a vector of grid points. The mesh is “partitioned” among threads. Each thread assembles a local stiffness matrix and than copy the values in the global stiffness matrix A.

The only race condition which could happen is for the grid points which are the boundaries of the partitions. Therefore if we are using N threads the number “critical” points is N-1. To avoid a race condition I have created a mutex for each “critical” point (which is marked as shared).

If an entry of type shared need to be inserted, assemble() takes the corresponding lock.

It is worth to notice that the mutex is not a plain std::mutex but it is a std::recursive_mutex to avoid a deadlock when the lock is acquired twice (it happens when i=j in the assembly loop).

This choice of lock mechanism works in this case because the number of critical point is very small compared to the size of the vector. In general this approach is very inefficient.

The second code which I implemented

use a completely different approach.

Now the assemble() function fills some local variables (a vector and a matrix). Here local means that each thread fills some variables which are used only by itself. The “critical” part is the update of the global variables with the local ones. These update is performed by the function push().

In order to avoid a race condition, the threads are divided in two sets. The threads in the same set don’t have any shared position to write in. Obviously it is necessary to have a global barrier to avoid different set of threads to be run concurrently. This barrier is performed using the std::join() method.

This second method seems more efficient and is completely lock-free.

The method which I have just described is known as coloring. It is possible to describe the interactions between threads as graph where each node represents a thread; two nodes are connected if they share at least one element. Having a set of connected partitions, you assign each partition a color such that no two partitions of the same color share a connection, and then you can handle all partitions of the same color in parallel.

The coloring algorithm can be generalized for a general codes. More precisely:

  1. the user provides, for each thread, a vector containing the global indices of the elements which the thread uses;
  2. comparing all these vectors is possible to create a graph and extract the connected components (the creation of the graph is not very efficient since the vectors are generally not sorted);
  3. a simple greedy algorithm can find a sub-optimal coloring.

A possible greedy algorithm is the following:

  1. color first vertex with first color;
  2. do this for remaining vertices: consider the currently picked vertex and color it with the lowest numbered color that has not been used on any previously colored vertices adjacent to it. If all previously used colors appear on vertices adjacent, assign a new color to it.

The coloring need to be performed only once and than can be used at every update process. The greedy algorithm never uses more than \(d+1\) colors where \(d\) is the maximum degree of a vertex in the given graph.

Stay tuned!
Marco.