[Translation] Implementing Blocked Floyd-Warshall for solving all-pairs shortest-path problems C# Implementation

present (sb for a job etc)

In the previous post, we implemented theFloyd-Warshall (Floyd-Warshall algorithm)(four variants) andRoute Reconstruction Algorithm. In these posts, we explore all the basic concepts for the shortest path problem, in-memory data representation, parallelism, vectorization, and how to adapt algorithms to data characteristics.

In this post, we will continue our journey to explore a more efficient way to solve all pairs shortest path problems. However, this time, in addition to utilizing the vector and parallel capabilities of the CPU, we will also utilize the L1, L2, and L3 caches.

Does it sound interesting? So let's get to coding 🙂 .

Before diving into the theory and code of the algorithm, I suggest taking a few minutes to review the basic information about the "CPU cache".

If you have a solid understanding of the topic, feel free to skip this section. If you decide to read it (because you're just curious or want to review the basics), please note that the following information is a simplified overview.

What is caching?

You've likely heard about CPU caching in advertisements for new Intel or AMD chips, or in hardware discussions with friends or coworkers. Here, we're not going to get into a discussion of what CPU caching actually is, or who is better at implementing it, Intel, AMD, or Apple.

Instead, we will try to build a mental model of caching and learn how to use it to:

Gain a better understanding of the limits we can or cannot push when writing high-performance code in C# and .
Better understand how applications perform in high-performance scenarios on different platforms and with different parameters.
Proudly talk about "caching" in front of your friends.

So, let's start with the basics ...... What exactly is a cache?

Imagine you're sitting in an empty apartment with your computer in front of you. You're working on a new feature and it's going well. You're working fast, with no mistakes or errors, and getting into perfect shape.

You are.CPU。

You'll need to check out the documentation on a regular basis, and all the documentation you need is in the library across the road.

The library ismain memory。

You can always access the library to read whatever you need. However, the time it takes to get to the library each time hinders your progress, which doesn't sound efficient. To improve the situation, you start writing snippets of some documents on sticky notes and sticking them on your monitor.

The sticky note on the monitor isfirst-level cache。

Now, when you need to find a document, check the sticky notes first and only go to the library if the information is not available.

If you find the information you need on a sticky note, call it acache hit; otherwise, it isCache misses。

Putting sticky notes on your monitor reduces the amount of time spent at the "library", but it doesn't eliminate it completely, because you don't know in advance what documents you'll need. There is a limit to the amount of useful information you can bring back from the library.

The amount of useful information that can be brought back is calledcache line。

You can't put millions of sticky notes on your monitor, so you'll have to get rid of some of them when space runs out.

When you throw away sticky notes because of space constraints calledcache eviction。

In order to stop throwing away useful information (and to avoid going to the library when you need to), you bought yourself a "little box". Now, when you bring back "new" sticky notes from the library and need to make room on your monitor, instead of throwing away the "old" sticky notes, you put them in the "little box". That way, if you need the information again, you can take it out of the "little box" and put it back on your monitor, saving you a trip to the library.

The "little box" is..second-level cache。

A "small box" can hold more sticky notes than a monitor, but it's not infinite. So when some of the sticky notes don't fit, you still need to throw them away. That's why you decided to buy a "giant box". Now, when you bring back new sticky notes from the library, first copy them and put the copies in the "giant box", and then put the originals on your monitor. If you need to replace the sticky notes on your monitor, put them in the "little box" as before, and if the "little box" is full, you throw them away. However, this time, if you need the information again, you just copy the sticky notes from the "giant box" and put them back on the monitor.

The "giant box" isThree-level cache (aka LLC)。

The "giant box" does not eliminate trips to the "library", however, if the capacity of the "giant box" is quite large, you will be able to store a large number of relevant documents in it, thus enabling you to complete a large amount of work without interruption. This will enable you to accomplish a great deal of work without interruption.

That's basically it.

The above model is not complete or ideal, but at least I hope it clearly articulates the main reason modern chips have multiple levels of cache - to minimize the number of round-trip accesses to main memory and to keep the CPU as busy as possible.

What may not be obvious from the model is that all caches (L1 "sticky note on the monitor", L2 "small box" and L3 "huge box") have different access times ( i.e. latency) and size.

Below is an example of an Intel i7-6700 (Skylake) CPU:

(computing) cache	Access time (i.e., delay)	adults and children
L1	4 cycles	64 KB
L2	12 cycles	256 KB
L3	42 cycles	8 MB

Table 1.Intel i7-6700（Skylake）The access times and sizes of the processor's L1, L2, and L3 caches.

Please note that the above figures are only an example. Different CPUs have different characteristics. However, the overall relationship is roughly the same - L1 is the smallest but also the fastest, L2 is in between, and L3 is the largest but also the slowest.

The key to really understanding these numbers is to know how much time it takes to access main memory. On the same Intel i7-6700 (Skylake) CPU, accessing main memory takes about42 cycles plus 51 nanosecondsYou will notice that in addition to the period, there is a fixed "51 nanosecond" time. You will notice that in addition to the period, there is a fixed "51 nanosecond" time. This is not a mistake, because the period can be converted (roughly) to nanoseconds by a very simple formula:

\(t=\frac{1}f\)

Among them:

t is the approximate time of one CPU cycle
f is the CPU frequency

In our example, the result of the calculation is:

\(t=\frac{1 cycle}{4∗10^{9}\frac{cycle}a}=0.25∗10^{−9}s=0.25ns\)

Converting the access time values in the previous table from period to access time yields the following results:

elemental	Access time (i.e., delay)
L1	1 ns
L2	3 ns
L3	10.5 ns
random access memory (RAM)	10.5 ns + 51 ns = 60.5 ns

Table 2.Intel i7-6700（Skylake）Approximate access times (in nanoseconds) for the processor's L1, L2, and L3 caches and main memory.

Even considering that this is a very approximate conversion that ignores many details, the results are still surprising - accessing the L1 cache is 60 times faster than accessing main memory.

Of course, in practice, it's not that simple and cache-related optimizations don't directly result in a 60x performance improvement. However, as I hope to show in this post, they can still significantly improve application performance.

Readers familiar with this topic will notice that I have not mentioned anything about cache modes (exclusive and contained), nor have I touched on instruction caching and memory alignment, hardware and software prefetching, and so on. I did this to simplify the description.

However, if you've just stumbled across this note and want to know more, you can get it from theWikipedia pageTo start, if you want to think about low-level performance, I recommend theDenis Bakhvalov's bookand Ulrich Drepper'sdiscuss a paper or thesis (old)。

The blocking Floyd-Warshall algorithm

The Blocked Floyd-Warshall algorithm (also known as Blocked Floyd-Warshall) was introduced in [2]. The algorithm deals with a graph consisting of verticesVcomposition chartGthat is represented by the weight matrix W. It splits the matrix into equal-sized blocks (creating a new block matrix)B), and use these blocks to compute the shortest paths between all pairs of vertices in the graph.

This definition may not sound obvious, so here's an example.

Imagine an 8×8 matrix of weightsW(see Fig. 1a). We divide the matrix into 2 × 2 blocks, creating a new 4 × 4 block matrix B (see Fig. 1b).

Fig. 1. Schematic diagram of the process of partitioning an 8 × 8 weight matrix W (a) into a 4 × 4 2 × 2 block matrix B (b).

matricesBEach block contains the matrixWThe path is a part of all the paths in the path, e.g. (see Fig. 2):

B[0,0]Contains paths from vertices 0 and 1 to 0 and 1
B[0,3]Contains paths from vertices 0 and 1 to 6 and 7
B[2,0]Contains paths from vertices 4 and 5 to 0 and 1
... B[2,3]Contains the path from vertices 4 and 5 to vertices 6 and 7

Fig. 2. Schematic representation of the paths represented by the different blocks in the 4 × 4 block matrix.

All the vertices represented by these four blocks are closely related. Looking only at the matrixBJust notice that you can find a path from 4 to 7 through vertex 0 by moving from vertex 4 to 0 and then from 0 to 7. You can then compare it to the existing path from 4 to 7 and update it to the shortest path if necessary (see Figure 3).

Figure 3. Schematic diagram of the path from vertex 4 to vertex 7 through vertex 0 on the block matrix B.

If you remember the Floyd-Warshall algorithm, then these operations must sound very familiar as they are similar to the operations of that algorithm:

algorithm FloydWarshall(W) do
  for k = 0 to N - 1 do
    for i = 0 to N - 1 do
        for j = 0 to N - 1 do
            // In the existing path from i to j and
            // between the existing path from i to j and the new path from i to k to j, // and replace the existing path with the smallest one.
            // and replace the existing path with the minimum
            W[i, j] = min(W[i, j], W[i, k] + W[k, j])
        end for
    end for
  end for
end algorithm

// where W - is a weight matrix of size N x N, // min() - is a function that returns the smaller of its arguments.
// min() - is a function that returns the smaller of its arguments

However, the Floyd-Warshall algorithm uses the matrixW(instead ofB) work. Fortunately, it can be easily rewritten as a block that accepts three blocksB1、B2cap (a poem)B3Instead of a matrixWThe process:

function Procedure(B1, B2, B3) do
  for k = 0 to L - 1 do
    for i = 0 to L - 1 do
      for j = 0 to L - 1 do
        B1[i, j] = min(B1[i, j], B2[i, k] + B3[k, j])
      end for
    end for
  end for
end function

// where B1, B2 and B3 - are blocks in the block matrix B, // L - is the size of the blocks (B1, B2 and B3 are of size L x L).
// L - is the size of the block (B1, B2 and B3 are L x L)
// min() - is a function that returns the smaller of its arguments.

If we takeB1 = B[2,3]，B2 = B[2,0]cap (a poem)B3 = B[0,3]call (programming)Procedure, which will recalculate all paths from vertices 4,5 to vertices 6,7 through vertices 0,1.

The important point here is that while Procedure does recompute the paths from vertices 4,5 to vertices 6,7 (via 0 and 1), these paths are not guaranteed to be the shortest because the recomputation relies on existing paths stored in blocks B[2,0] and B[0,3].

Fortunately, before recomputing the paths in block B[2,3], we can use the... The same Procedure but with different input parameters is used to recompute the paths in blocks B[2,0] and B[0,3] (see Figure 4):

To recalculate B[2,0], we start withB1 = B[2,0]，B2 = B[2,0]cap (a poem)B3 = B[0,0]call (programming)Procedure. This recalculates all paths from vertices 4,5 to 0,1 through 0 and 1.
To recalculate B[0,3], we start withB1 = B[0,3]，B2 = B[0,0]cap (a poem)B3 = B[0,3]call (programming)Procedure. This recalculates all paths from vertices 0,1 to 6,7 through 0 and 1.

Fig. 4. Schematic representation of the 4 → 0 path through vertex 0 (left) and the 0 → 7 path through vertex 1 computed as part of the recomputation of the B[2,0] and B[0,3] blocks.

You may have noticed--B[2,0]cap (a poem)B[0,3]The recalculation of the block relies on the data from theB[0,0]A block of data, like theB[2,3]The recalculation of the block relies on theB[2,0]cap (a poem)B[0,3]Block like.

Luckily (again), after recalculating theB[2,0]cap (a poem)B[0,3]Before the path in the block, we can use... The sameProcedureBut (again) using different input parameters to recalculate theB[0,0]in the path (see Figure 5):

To recalculateB[0,0]We'll start withB1 = B[0,0]，B2 = B[0,0]cap (a poem)B3 = B[0,0]call (programming)Procedure. This recalculates all paths from vertex 0,1 to 0,1 through 0 and 1.

Figure 5.B[0,0]A portion of the block that computes the schematic of the 0 → 1 path through vertex 0.

This may seem a bit surprising (recalculating the block with the block itself), but if you recall, we inferred from the Floyd-Warshall algorithm thatProcedurecode, which means that when all input parameters are set to the same block, theProcedureCompletely analogous to the Floyd-Warshall algorithm and recalculates the paths within the block:

function Procedure(B, B, B) do
  for k = 0 to L - 1 do
    for i = 0 to L - 1 do 
      for j = 0 to L - 1 do 
        B[i, j] = min(B[i, j], B[i, k] + B[k, j])
      end for
    end for
  end for
end function

Taken together, the process of recomputing the path from vertices 4, 5 to vertices 6, 7 through vertices 0 and 1 is as follows:

in order toB1 = B[0,0]，B2 = B[0,0]cap (a poem)B3 = B[0,0]call (programming)Procedure, computed from vertex 0,1 to vertex 0,1 (by the blockB[0,0]denote) all paths through vertices 0 and 1 (see Fig. 6a).
in order toB1 = B[0,3]，B2 = B[0,0]cap (a poem)B3 = B[0,3]call (programming)Procedurethat computes from vertex 0,1 to vertex 6,7 (by the blockB[0,3]denote) all paths through vertices 0 and 1 (see Fig. 6b).
in order toB1 = B[2,0]，B2 = B[2,0]cap (a poem)B3 = B[0,0]call (programming)Procedurethat computes from vertices 4,5 to vertices 0,1 (by the blockB[2,0]denote) all paths through vertices 0 and 1 (see Fig. 6c).
in order toB1 = B[2,3]，B2 = B[2,0]cap (a poem)B3 = B[0,3]call (programming)Procedure, computes from vertices 4,5 to vertices 6,7 (by the blockB[2,3]denote) all paths through vertices 0 and 1 (see Figure 6d).

Figure 6. Schematic representation of the 4 → 7 path through vertex 0. First the 0 → 1 path through 0 is computed (a), then the 0 → 7 path through vertex 1 (b), then the 4 → 0 path through 0 (c), and finally the 4 → 7 path through 0 (d).

In code, these steps can be represented by four consecutiveProcedureCall representation:

Procedure(B[0,0], B[0,0], B[0,0])
Procedure(B[0,3], B[0,0], B[0,3])
Procedure(B[2,0], B[2,0], B[0,0])
Procedure(B[2,3], B[2,0], B[0,3])

Interestingly, by tweaking the code above a bit, we can recalculate the path from vertices 4,5 to 2,3 through 0 and 1 (all we need to do is to use theB[0,1]interchangeabilityB[0,3]）：

Procedure(B[0,0], B[0,0], B[0,0])
Procedure(B[0,1], B[0,0], B[0,1])
Procedure(B[2,0], B[2,0], B[0,0])
Procedure(B[2,1], B[2,0], B[0,1])

... If we useB[0,2]interchangeabilityB[0,1]... I think you've got it. We can go ahead and calculate all the paths between all the vertices that pass through vertices 0 and 1 (see Figure 7):

Fig. 7. Schematic of recalculating all paths between all vertices in the matrix through vertices 0 and 1.

In the code, the implementation simply repeats the same steps for different blocks:

// Recalculate block B[0,0] (i.e. the "diagonal" block)
//
Procedure(B[0,0], B[0,0], B[0,0])
//
// Recalculate blocks B[0,1], B[0,2] and B[0,3] (i.e., the "horizontal" block)
//
for i = 1 to 4 do
  Procedure(B[0,i], B[0,0], B[0,i])
end
//
// Recalculate blocks B[1,0], B[2,0] and B[3,0] (i.e. the "vertical" blocks)
//
for i = 1 to 4 do
  Procedure(B[i,0], B[i,0], B[0,0])
B[i,0], B[i,0], B[0,0])
//
// Recalculate the block:


// B[3,1], B[3,2], B[3,3].
// (i.e., the "outer" blocks)
// (i.e., the "outer" blocks)
for i = 1 to 4 do
  for j = 1 to 4 do
    Procedure(B[i,j], B[i,0], B[0,j])
  end
B[i,j], B[i,0], B[0,j] end

Plain and simple. This code recalculates all paths between all vertices passing through vertices 0 and 1.

To recompute the remaining vertex combinations, we need to... Repeat the above algorithm, but fromB[1,1]，B[2,2]cap (a poem)B[3,3]Start as a "diagonal" block (see Figure 8):

Figure 8. Schematic diagram of recalculating all paths between all vertices through vertices (from left to right) 2 and 3, 4 and 5, 6 and 7.

In code, the complete blocking Floyd-Warshall algorithm looks surprisingly simple:

algorithm BlockedFloydWarshall(B) do
  // Iterate over all diagonal blocks.
  // Iterate over all "diagonal" blocks.
  for m = 0 to M do
    // recalculate the diagonal blocks
    //
    Procedure(B[m,m], B[m,m], B[m,m])
    B[m,m], B[m,m], B[m,m]) //
    // Recalculate the "Horizontal" block
    //
    for i = 0 to M do
      //do
      // Here, we skip the "diagonal" block.
      // if i !
      if i ! = m then
        Procedure(B[m,i], B[m,m], B[m,i])
      end
    B[m,i], B[m,m], B[m,i] end
    //end
    // Recalculate the "vertical" blocks.
    //B[m,m].
    for i = 0 to M do
      //M do
      // Here, we skip the "diagonal" block.
      // if i !
      if i ! = m then
        Procedure(B[i,m], B[i,m], B[m,m])
      end
    end
    //end
    // Recalculate the "peripheral" blocks.
    //
    for i = 0 to M do
      //
      // Here, we skip all the "horizontal" blocks.
      // if i !
      if i ! = m then
        for j = 0 to 4 do
          // If i != m then for j = 0 to 4 do
          // Here, we skip all the "vertical" blocks.
          // // Here, we skip all the "vertical" blocks.
          if j ! = m then
            Procedure(B[i,j], B[i,m], B[m,j])
          end
        end
      B[i,m], B[m,j] end
    end
  end
end

// where B - is an M x M sized block matrix

The algorithm iterates through all the Diagonal blocks and recalculates all the paths between all the vertices that pass through the vertices of the Diagonal blocks in turn.

Eventually, the algorithm recalculates all paths between all vertices that pass through all vertices.

This is why both algorithms have the same space and time complexity of O(n^{2) and O(n}3)。

Now, when we're done with the theoretical part, it's time to implement it in C#.

experimental environment

All experiments in this paper were executed on a randomly generated complete graph of 4800 vertices.

The experimental machine is equipped with 2 Intel Xeon CPUs E5-2620 v4 (2.10GHz, 16 physical cores and 32 logical cores in total) and the operating system is Windows Server 2019.

We used a block size of 120 × 120 because that block size demonstrated the best performance characteristics on the experimental hardware.

Code is compiled and executed using .NET SDK 8.0 and .NET Runtime 8.0 (x64, AVX2). All code can be found in theGitHubFound in the warehouse.

All implementations of the Floyd-Warshall algorithm in this article are the same as those in the previous article (only the variable and method names are slightly different).

Basic realization (i.e., baseline realization)

We start with the following C# implementation of the program:

static void Procedure(
  Span<int> B1, 
  Span<int> B2, 
  Span<int> B3, 
  int block_size)
{
  for (var k = 0; k < block_size; ++k)
  {
    for (var i = 0; i < block_size; ++i)
    {
      for (var j = 0; j < block_size; ++j)
      {
        var distance = B2[i * block_size + k] + B3[k * block_size + j];
        if (B1[i * block_size + j] > distance)
        {
          B1[i * block_size + j] = distance;
        }
      }
    }
  }
}

This code is an almost verbatim translation of the previously demonstrated pseudo-code into C#. The only significant difference is that all input blocks are represented by theSpan<T>and the combination of block sizes constitutes.

If you're confused about how to access an array via multiplication, don't worry, I'll explain it in a minute.

The algorithm itself is slightly different compared to the pseudo-code (we refer to this implementation as the "baseline"):

private static void Baseline(
  int[] B,
  int block_count,
  int block_size)
{
  var lineral_block_size = block_size * block_size;
  var lineral_block_row_size = block_count * lineral_block_size;

  for (var m = 0; m < block_count; ++m)
  {
    var offset_mm = m * lineral_block_row_size + m * lineral_block_size;

    var mm = new Span<int>(B, offset_mm, lineral_block_size);
    //
    // count“diagonally”lump (of earth)
    //
    Procedure(mm, mm, mm, block_size);
    //
    // 在同一循环中count“orthogonal”cap (a poem)“vertically”lump (of earth)
    //
    for (var i = 0; i < block_count; ++i)
    {
      if (i != m)
      {
        var offset_im = i * lineral_block_row_size + m * lineral_block_size;
        var offset_mi = m * lineral_block_row_size + i * lineral_block_size;

        var im = new Span<int>(B, offset_im, lineral_block_size);
        var mi = new Span<int>(B, offset_mi, lineral_block_size);

        Procedure(im, im, mm, block_size);
        Procedure(mi, mm, mi, block_size);
      }
    }
    //
    // count“periphery”lump (of earth)
    //
    for (var i = 0; i < block_count; ++i)
    {
      if (i != m)
      {
        var offset_im = i * lineral_block_row_size + m * lineral_block_size;

        var im = new Span<int>(B, offset_im, lineral_block_size);

        for (var j = 0; j < block_count; ++j)
        {
          if (j != m)
          {
            var offset_ij = i * lineral_block_row_size + j * lineral_block_size;
            var offset_mj = m * lineral_block_row_size + j * lineral_block_size;

            var ij = new Span<int>(B, offset_ij, lineral_block_size);
            var mj = new Span<int>(B, offset_mj, lineral_block_size);

            Procedure(ij, im, mj, block_size);
          }
        }
      }
    }
  }
}

You must have noticed.Baselinecap (a poem)ProcedureA large number of offsets were computed which were then used to create theSpanas well as inSpanInside access value.

Here is how it works. Suppose we have an 8 × 8 square matrix W (see Figure 9a). In our minds, the matrix is a square, but in memory, we can represent the "square" as an array of 8 × 8 = 64 values (this is called a "row-major" or "linear" representation), where each value can be represented using a simple formula. linear" representation), where each value can be accessed using a simple formula:

\(i_{row}×8+i{column}\)

Among them:

\(i_{row}\) - Index of the line where the element is located
\(i_{column}\) - Index of the column where the element is located
8 - Number of elements in a single line

That's exactly what we're doing inProcedurewhen accessing the block values in theblock_sizesubstitute (X for Y, or a number in an algebraic expression)8）。

Figure 9. Linear representation in memory of matrix W (from the previous article) and matrix B from the current article. On the left, the rows of matrix W are placed sequentially in memory, forming a "row-major-order" representation. On the right, each block is placed sequentially in the linear representation of matrix B, forming a "row-major-order" representation of the blocks.

When it comes to block matrices, we can apply exactly the same thinking. Suppose a 4 × 4 block matrix B, where each block is a 2 × 2 square matrix (see Figure 9b). We can represent this matrix as an array of (2 × 2) x (4 × 4) = 4 × 16 = 64 values (again), where each block is placed into the array in turn using the same logic. In this way, we can find the "starting position" of any block using the following formula:

\(i_{row}×16+i_{column}×4\)

Among them:

\(i_{row}\) - Row index of the block
\(i_{column}\) - Column index of the block
16 - Number of elements in a block in a single line
4 - Number of elements in a single block

As you can see.16cap (a poem)4existBaselineis calculated asliner_block_row_sizecap (a poem)liner_block_size。

These operations are critical to the performance of the algorithm. Linear representations perform extremely well in terms of cache friendliness (especially if we access elements sequentially most of the time). They also allow us to access any block of the matrix B (and any value in any block) in constant time, which is also beneficial for performance.

Now let's see.BaselineRealized operational situation.

Results

Below are the experimental results for the "baseline" realization:

Algorithm (baseline)	seek	Average time (seconds)	Error (seconds)	Standard deviation (seconds)	LLC missing/operating
Floyd-Warshall	4800	172.881 seconds	2.0421	1.9101	6,903,951,087
Blocked Floyd-Warshall	4800（120）	200.075 seconds	0.0578	0.0540	59,423,676

Table 3. Experimental results of Floyd-Warshall and Blocked Floyd-Warshall algorithms implemented in "Baseline".

Surprisingly, Blocked Floyd-Warshall is about 15% slower than Floyd-Warshall. This was not the expected result ...... Perhaps it was expected?

Consider the code for both algorithms, which is actually quite simple - read three values from memory, sum two values, compare two values, and write one value back.

Modern CPUs are greatly optimized for single-threaded execution. They have a multi-stage pipeline and are able to process multiple operations simultaneously (in the absence of data dependencies). They can detect linear memory access patterns (that's why we use it in both algorithms) and automatically preload data from main memory into cache (hardware prefetching).

All of these factors combine to lead to one result: in algorithms with cache-friendly memory access (if one looks at theLLC Miss/OpThis column, you can see that Blocked Floyd-Warshall is more cache-friendly), yet has no performance gain and even behaves slower because of the extra work it requires to process the data in the right way (e.g., Span instantiation, stack push-popping, extra loops, etc.).

We can run the same code under an analyzer (e.g. Intel VTune) and clearly see that none of the algorithms are "constrained" (i.e. memory limited):

Algorithm (Baseline)	clock cycle	refunded instruction	CPI Ratio	memory binding
Floyd-Warshall	364,956,900,000	1,771,270,200,000	0.206	3.0%
Blocked Floyd-Warshall	449,603,700,000	1,925,290,500,000	0.234	0.0%

Table 4. Clock cycle, retired instruction, CPI ratio, and memory binding parameters for the Floyd-Warshall and Blocked Floyd-Warshall algorithms for the "Baseline" implementation from the Intel VTune analyzer.

I extracted four parameters (out of many):

clock cycle- - is the number of CPU clock cycles consumed in executing our algorithm.
Instructions Retired- - is the number of instructions processed by the CPU.
CPI Ratio-- is the number of clock cycles / retired instructions.
memory binding- is a high-level aggregation metric that indicates the impact of memory on execution (while the CPU waits for data to be fetched from memory).

As you can see, the Blocked Floyd-Warshall algorithm has a slightly higher number of retired instructions than Floyd-Warshall, which is the expected result because it does more work in scheduling the computation. Both have lower CPI ratios, which means that the CPU is able to process up to five instructions per "cycle" (i.e., clock cycle).

These two observations combined with the low memory binding basically tell us that the CPU is working to its maximum capacity without any stalls.

Next, let's see if we can make the Blocked Floyd-Warshall algorithm outperform Floyd-Warshall.

Vectorized implementations (a.k.a. "vectors")

The first step we can take to improve the performance of this entirely CPU-computation-intensive algorithm is to force the CPU to do more operations simultaneously. That's right, I'm talking about vectorization.

existPrevious Post.in which you've seen a vectorized implementation of the Floyd-Warshall algorithm, and if you haven't, don't worry, because it's almost a complete copy of the process:

static void Procedure(
  
  
  Span<int> B3, int block_size)
  int block_size)
{
  for (var k = 0; k < block_size; ++k)
  {
    for (var i = 0; i < block_size; ++i)
    {
      // Vector of values to read from B2.
      // (get existing path "i -> k")
      var ik_vec = new Vector< int> (B2[i * block_size + k]);

      // vectorized loop
      var j = 0;
      for (; j < block_size - Vector<int>.Count; j += Vector<int>.Count)
      Count; j += Vector<int>.
        // Read the vector of values from B1.
        // (get the existing path of "i -> j")
        var ij_vec = new Vector<int>(
          (i * block_size + j, Vector<int>.Count));;

        // Read the vector of values from B3.
        // (get the existing path "k -> j")
        // and add it to the value in B2
        // (get the new path "i -> k -> j")
        var ikj_vec = new Vector<int> (
          (k * block_size + j, Vector<int>.Count))
          + ik_vec.

        // Compare the vector of existing paths in B1 to the vector of new paths
        var lt_vec = (ij_vec, ikj_vec);
        if (lt_vec == new Vector<int>(-1))
        {
          continue;
        }

        // Copy all "i -> k -> j" paths smaller than the existing "i -> j" paths.
        // Copy back to block B1
        var r_vec = (lt_vec, ij_vec, ikj_vec); r_vec.
        r_vec.CopyTo((i * block_size + j, Vector<int>.Count));;
      }

      // Non-vectorized loop
      for (; j < block_size; ++j)
      {
        var distance = B2[i * block_size + k] + B3[k * block_size + j].
        if (B1[i * block_size + j] > distance)
        {
          B1[i * block_size + j] = distance;
        }
      }
    }
  }
}

As before, we useVector cap (a poem)Vector<T> type to implement vectorization. The implementation consists of two loops: one vectorized (for computing most paths) and the other non-vectorized (for computing the remaining paths when the size of the block is not divisible by the size of the vector).

There is no part of the main algorithm that can be vectorized, and since theProcedure The interface is unchanged and we can just plug in the new code and experiment with it.

Results

The following are the experimental results of the vectorization implementation:

Algorithms (vectors)	seek	Average time (seconds)	Error (seconds)	Standard deviation (seconds)	LLC Missing/Operating
Floyd-Warshall	4800	73.251 seconds	0.4993	0.4671	6,353,509,854
Blocked Floyd-Warshall	4800（120）	64.792 seconds	0.0318	0.0282	50,703,019

Table 5. Experimental results for the Floyd-Warshall and Blocked Floyd-Warshall algorithms implemented in Vector.

This time, Blocked Floyd-Warshall is about 12% faster than the Floyd-Warshall algorithm.

Running the same code under the analyzer also demonstrated different results:

Algorithms (vectors)	clock cycle	refunded instruction	CPI Ratio	memory binding
Floyd-Warshall	161,668,500,000	490,026,600,000	0.330	6.8%
Blocked Floyd-Warshall	144,314,100,000	488,193,300,000	0.296	0.3%

Table 6. Clock cycles, retired instructions, CPI ratios, and memory binding parameters for the Floyd-Warshall and Blocked Floyd-Warshall algorithms for the Vector implementation from the Intel VTune analyzer.

We can see that the number of retired instructions is significantly reduced for both algorithms (which is the expected result since vector instructions perform multiple operations at the same time and the CPU executes multiple vector instructions in each clock cycle).

We can also see the change in memory binding - Floyd-Warshall's memory binding doubles, while the Blocked Floyd-Warshall algorithm remains almost unchanged.

Despite this, the CPI ratio is still low, which again suggests that the CPU is barely stuttering. Therefore, there is still room for improvement.

Next, let's see if we can apply parallelism here.

Parallel vectorization implementation (i.e., "parallel vectors")

Parallelizing the blocking Floyd-Warshall algorithm is a pure form of "task parallelism", where the "task" is the recomputation of a single block (i.e., a single call).Procedure）。

The easiest way to implement this is to use messaging (i.e., send a custom message),CountdownEvent cap (a poem)ThreadPool。

Below is the code to customize the message:

private readonly struct ParallelMessage
{
  // Countdown events for signaling，to indicate when a message has been processed
  public readonly CountdownEvent Event;
  public readonly int[] Matrix;
  public readonly int OffsetB1;
  public readonly int OffsetB2;
  public readonly int OffsetB3;
  public readonly int BlockSize;
  public readonly int SpanLength;

  public ParallelMessage(
    CountdownEvent Event,
    int[] Matrix,
    int OffsetB1,
    int OffsetB2,
    int OffsetB3,
    int BlockSize,
    int SpanLength)
  {
     = Event;
     = Matrix;
    this.OffsetB1 = OffsetB1;
    this.OffsetB2 = OffsetB2;
    this.OffsetB3 = OffsetB3;
     = BlockSize;
     = SpanLength;
  }
}

The message is a read-only structure (avoiding allocation for performance reasons) containing all the information needed to initialize the B1, B2, and B3 blocks and call theProcedure。

ParallelMessage convert toProcedure The procedure called by theParallelProcedure Processing:

static void ParallelProcedure(
  ParallelMessage message)
{
  var B1 = new Span<int>(, message.OffsetB1, );
  var B2 = new Span<int>(, message.OffsetB2, );
  var B3 = new Span<int>(, message.OffsetB3, );
 
  Procedure(B1, B2, B3, );
 
  // Indicates that the message has been processed
  ();
}

In addition to the block information, theParallelMessage It also includes a review ofCountdownEvent The references to theParallelProcedure Signaling is performed when the calculation is complete.

Here is the code:

private static void ParallelVectorOptimisations(
  int[] matrix,
  int block_count,
  int block_size)
{
  var iteration_sync = new CountdownEvent(0);
 
  var lineral_block_size = block_size * block_size;
  var lineral_block_row_size = block_count * lineral_block_size;
 
  for (var m = 0; m < block_count; ++m)
  {
    var offset_mm = m * lineral_block_row_size + m * lineral_block_size;
 
    var mm = new Span<int>(matrix, offset_mm, lineral_block_size);
 
    Procedure(mm, mm, mm, block_size);
     
    // Calculate the number of signals we expect to receive
    // （On arrival of all signals，The event itself is marked as signaled）
    // We expect a line of blocks...
    // （“level (of achievement etc)”lump (of earth)，除了对角lump (of earth)）
    //
    // ...以及一列lump (of earth)
    // （“perpendicular”lump (of earth)，除了对角lump (of earth)）
    //
    // consequently，2 * block_count - 2 = 2 * (block_count - 1)
    //
    iteration_sync.Reset(2 * (block_count - 1));
    for (var i = 0; i < block_count; ++i)
    {
      if (i != m)
      {
        var offset_im = i * lineral_block_row_size + m * lineral_block_size;
        var offset_mi = m * lineral_block_row_size + i * lineral_block_size;
 
        var message_x = new ParallelMessage(
          iteration_sync, matrix, offset_im, offset_im, offset_mm,
          block_size, lineral_block_size);
 
        var message_y = new ParallelMessage(
          iteration_sync, matrix, offset_mi, offset_mm, offset_mi,
          block_size, lineral_block_size);
 
        (ParallelProcedure, message_x, false);
        (ParallelProcedure, message_y, false);
      }
    }
    // Wait before moving to the next iteration for all the“level (of achievement etc)”cap (a poem)“perpendicular”lump (of earth)重新计算完成
    iteration_sync.Wait();
 
    // 现在我们期望除了一个列cap (a poem)一个行外，所有lump (of earth)都已重新计算
    // This means that if we have a 4x4 的lump (of earth)矩阵
    // We expect to recalculate 3x3 的lump (of earth)
    //
    // consequently，(block_count - 1) * (block_count - 1)
    //
    iteration_sync.Reset((block_count - 1) * (block_count - 1));
    for (var i = 0; i < block_count; ++i)
    {
      if (i != m)
      {
        var offset_im = i * lineral_block_row_size + m * lineral_block_size;
        for (var j = 0; j < block_count; ++j)
        {
          if (j != m)
          {
            var offset_ij = i * lineral_block_row_size + j * lineral_block_size;
            var offset_mj = m * lineral_block_row_size + j * lineral_block_size;
 
            var message = new ParallelMessage(
              iteration_sync, matrix, offset_ij, offset_im, offset_mj,
              block_size, lineral_block_size);
 
            (ParallelProcedure, message, false);
          }
        }
      }
    }
    // Waiting for all“periphery”lump (of earth)重新计算完成
    iteration_sync.Wait();
  }
}

The implementation is simple (perhaps usingCountdownEvent NET concurrency primitives), with the exception of .

Let's see how the combination of vectorization and parallelism affects performance.

Results

The following are the results of the implementation:

Algorithms (parallel vectors)	depiction	Average time (seconds)	Error (seconds)	Standard deviation (seconds)	LLC Miss / Op
Floyd-Warshall	4800	32.311	0.0542	0.0480	4,277,045,385
Blocked Floyd-Warshall.	4800（120）	3.396	0.0014	0.0013	90,435,311

Table 7. Experimental results of the "parallel vectors" implementation of the Floyd-Warshall algorithm and the Blocked Floyd-Warshall algorithm.

This is not a mistake. The Floyd-Warshall algorithm, which introduces parallelism into blocking, improves its performance by a factor of nearly 20, while Floyd-Warshall's speedup is relatively modest, at about a factor of two.

Running the code under a performance analysis tool reveals why:

Algorithms (vectors)	clock cycle	Retired directives	CPI rate	memory limit
Floyd-Warshall	2,038,598,100,000	606,769,800,000	3.360	78.9%
Blocked Floyd-Warshall.	254,444,400,000	483,594,300,000	0.526	0.0%

Table 8. Clock jump, decommissioning instruction, CPI rate, and memory constraint parameters for the "vector" implementation output of the Intel VTune analyzer for the Floyd-Warshall and Blocked Floyd-Warshall algorithms.

The memory limit is at 78.9% - this basically means that the CPU spends most of its time waiting for data in memory rather than doing any computation.

This may seem unexpected, since the statistics of the two previous algorithms were very close. But don't forget one important part - modern CPUs are excellent at single-threaded execution, but very different when it comes to multi-threading.

Look at the results of the parallel (non-vectorized) version of the Floyd-Warshall algorithm:

Algorithms (parallel)	depiction	Average time (seconds)	Error (seconds)	Standard deviation (seconds)	LLC Miss / Op
Floyd-Warshall	4800	27.629	0.0101	0.0094	4,571,752,038

Table 9. Experimental results for the "parallel" implementation of the Floyd-Warshall algorithm.

It is faster than the vectorized version (27.629 seconds vs 32.311 seconds) because it puts less pressure on memory:

Algorithms (parallel)	clock cycle	Retired directives	CPI rate	memory limit
Floyd-Warshall	1,907,472,000,000	2,449,542,900,000	0.779	30.5%

Table 10. Clock ticks, retired instructions, CPI rates, and memory constraints output by the Intel VTune analyzer for the "parallel" implementation of the Floyd-Warshall algorithm.

The Floyd-Warshall algorithm is memory-constrained and has limited scalability (even using more cores will only result in minor speedups). However, this does not apply to blocking Floyd-Warshall, as it can actually be improved since it is currently not CPU- or memory-constrained. But that's a completely different story.

reach a verdict

In this article, we implement another algorithm for solving the shortest path problem for all pairs of points. In the process, we learned the basics of caching, reiterated linear memory access, vectorization, and the basic concurrency capabilities of .

We also witnessed the efficiency of cache-aware (i.e., cache-friendly) algorithms, especially when executed in parallel.

I hope you enjoyed this article and thank you for reading to the end.

bibliography

You can find more information on the7-Zip LZMA BenchmarkingFind the delay values mentioned and more information in the
Venkataraman, G., Sahni, S., Mukhopadhyaya, S. "A Blocked All-Pairs Shortest Paths Algorithm" // Journal of Experimental Algorithmics (JEA). (prefix indicating ordinal number, e.g. first, number two etc)8classifier for rolled material (wad of paper money, movie reel etc), 2003surname Nian, (prefix indicating ordinal number, e.g. first, number two etc)857–874leaf。
Karasik, O. N., cap (a poem) A. A. Prihozhy. "Tuning Block-Parallel All-Pairs Shortest Path Algorithm For Efficient Multi-Core Implementation." Systems Analysis and Applied Informatics 3 (2022): 57-65。

Original link

/2024/09/26/implementing-blocked-floyd-warshall-algorithm-for-solving-all-pairs-shortest-path-problem-in-c/