I Read Papers with Awesome-Graphs: Interpreting the PowerGraph

PowerGraph Thesis：《PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs》

Last time through the articleThe Dissertation Graph: Awesome-Graphs Takes a Sample with 200 Graph Systems Dissertations.Introduced the thesis graphing project Awesome-Graphs to the group, and from Google'sPregelStarting to unpack key papers on graph computing systems. This time, we share with you PowerGraph, a classic graph computation framework paper published in OSDI 2012, which aims to solve the computational skew problem caused by the power-law distribution of graph data through point slicing, and proposes a GAS (edge-centered) computation framework that distinguishes it from Pregel's VC (point-centered).

Students interested in graph computing technology can learn more about it, and you are very welcome to follow and participate in the open source project of Thesis Mapping:

Awesome-Graphs：/TuGraph-family/Awesome-Graphs
OSGraph：/TuGraph-family/OSGraph

Thanks in advance to those of you who gave the project a Star, and let's get right into the meat of it!

summaries

Distributed graph computation systems such as Pregel and GraphLab have significant limitations in graph computation performance and scalability when dealing with natural graphs (where the point out-degree obeys a power-law distribution), and PowerGraph proposes a new graph slicing approach to solve the problem.

1. Introduction

Pregel, GraphLab describes graph computation using point programs (vertex-programs) that run in parallel and interact along edges. Such an abstraction requires a smaller number of neighbors of points to obtain greater parallelism, and more efficient graph partitioning to obtain less communication. However, in the real world, the degree of a graph is generally distributed according to a power law, meaning that there exists a small fraction of points with very high degree, which makes traditional graph partitioning more difficult.

PowerGraph proposes a new computational model with higher parallelism, less communication storage cost, and more efficient graph partitioning strategies for natural graph-oriented computational tasks. Key contributions are:

Analyze the challenges and limitations of existing graph-parallel abstractions computed on power-law graphs.
Proposing a graph computation abstraction for PowerGraph.
An incremental caching mechanism (Delta Caching) is proposed to allow dynamic updating of the graph state.
Efficient slicing methods for power-law graphs.
Theoretical modeling for networks and storage.
A high-performance open source implementation of PowerGraph.
Evaluation of three MLDM algorithms implemented using PowerGraph on EC2.

2. Graph Parallel Abstraction

Graph Parallel Abstraction Containment:

A sparse graph : G = {V, E}.
One point program: Q.
Q can execute in parallel on any v ∈ V and can interact with Q(u) on v's neighbor u, where (u, v) ∈ E.
GraphLab uses shared state for interaction, Pregel uses message passing for interaction.

2.1 Pregel

Integral Synchronized Messaging Abstraction.

BSP, message passing, overstep, barrier, termination condition, combiner

2.2 GraphLab

Asynchronous Distributed Shared Memory Abstraction.

Shared access, distributed graph storage, serialization to avoid simultaneous neighbor updates, elimination of message passing, decoupling of user algorithms from data movement, point data (shared with all neighbors), edge data (shared with specific neighbors)

2.3 PowerGraph

GAS computational modeling:

Gather phase: the values of neighboring points and neighboring edges are collected and aggregated by a generic sum:\(\Sigma \leftarrow \bigoplus_{v \in \mathbf{Nbr}[u]} g(D_u, D_{(u,v)}, D_v)\)。
Apply phase: update the result after sum to the value of the center point:\(D_u^{new} \leftarrow a(D_u, \Sigma)\)
Scatter node: update the neighboring edge values with the new point values:\(\forall v \in \mathbf{Nbr}[u]:(D_{(u,v)}) \leftarrow s(D_u^{new}, D_{(u,v)}, D_v)\)

Gather and Scatter correspond to the inputs and outputs of a point program. For example, in PageRank, Gather only operates on incoming edges and Scatter only operates on outgoing edges. But many MLDM algorithms Gather and Scatter operate on all neighboring edges.

3. Challenges of nature maps

Power law distribution of natural graph degrees:\(\mathbf{P}(d) \propto d^{-\alpha}\)，\(\alpha \approx 2\)

The larger α is, the lower the density of the graph (number of edges/points), and most of the points are of low degree.
The smaller α is, the higher the density of the graph and the number of height nodes increase (the more skewed the graph is).

Challenges brought to existing graph computing systems by natural graph hotspots:

Worker load is not balanced.
Zoning lacks localization.
Performance bottlenecks due to asymmetry in communication.
The amount of storage is proportional to the number of degrees and exceeds the memory capacity of a single machine.
Computation cannot be parallelized and scalability is limited.

4. PowerGraph abstraction

PowerGraph combines the design advantages of Pregel and GraphLab. From GraphLab, it borrows the ideas of "data graphs" and "shared state" to reduce the cost of transferring information to the user's design, and from Pregel, it borrows the concepts of exchangeable and combinable sums (which do not depend on the implementation of partitioning).

4.1 GAS point program

GAS stateless interface design:

The computational flow of GAS:

The gather and sum functions of the Gather phase use map+reduce to gather information about neighbors. gather functions are executed in parallel on the neighboring edges of u (fan-in, gather_nbrs can be none, in, out, all), and gather_nbrs can be none, in, out, all. sum functions is exchangeable and combinable, the result of sum is put to the cumulative value corresponding to point u\(a_u\), this value is in cached form.

The apply function in the apply stage calculates new point values based on the point cache\(D_u\)and automatically writes back to the graph state.\(a_u\)The size and complexity of theapply function determine the efficiency of the network and storage, so they should be sublinearly/constantly related to the node degree.

The scatter function of the scatter phase executes in parallel on the neighboring edges of u (fanout, scatter_nbrs can be none, in, out, all) and generates new edge values\(D_{(u,v)}\)The scatter function has an optional return value of\(\Delta a\)If\(\Delta a\)is not empty and the cumulative value cache of neighboring points\(a_v\)also exists, then the sum function is used to update the\(a_v\). Otherwise clear the\(a_v\)Cached values.

Algorithm example:

PageRank
- #outNbrs(v) denotes the out-degree of v, which should be understood here as a graph feature that can be aggregated in advance.
- The 0.15 of apply was not divided by the full map point count N, a non-strict PageRank algorithm.\(\mathbf{Rank}(u) = \frac{1-q}{|V|} + q \sum_{v \in \mathbf{Nbr}[u]}\frac{\mathbf{Rank}(v)}{\mathbf{\#outNbrs}(v)}, q=0.85\)
- The original semantics of #outNbrs(u) leads to a change in the convergence semantics: the point rank is no longer updated -> the point passes to give the rank of the degree neighbor is no longer updated. It does not affect the convergence of the algorithm, but it is affected by the\(\epsilon\)Precision Impact.
- The scatter return value is delta, as it should be here, i.e., the propagation gives the incremental value of delta for the degree neighbors, which can be summed directly to the\(a_v\)。
Greedy coloring
- The min c of apply should be understood as the color c that selects the color with the smallest id among all alternative colors except the set of colors S of its neighbors.
- The scatter function returns null directly, thinking that the state of graph coloring cannot be updated incrementally, but rather by overwriting it.
SSSP
- scatter's changed(Du): Dunew ! = Duold.
- Increased(Du) of scatter: Dunew > Duold.
- Active(v) is only needed if Du gets smaller, and only if it returns Du+D(u,v), which is not rigorous enough here.

4.2 Incremental Caching

The GAS point program is generally triggered by a small number of neighbor updates, but each time it will gather the data of all neighbors, resulting in a waste of computational resources, so by maintaining a cache of point sum values\(a_u\), to enable skipping the next iteration of the GATHER phase.

incremental value\(\Delta a\)Equivalent to a correction based on the results of the previous round of GATHER, the types of cumulative operations in the GATHER phase are generally required to form an abelian group (with exchangeable, combinable addition operations, and addition inverse (negativity) operations), then:
\(\Delta a = g(D_u, D_{(u, v)}^{new}, D_v^{new}) - g(D_u, D_{(u, v)}, D_v)\)

4.3 Activating the next calculation

Triggering computation on a point by calling Active(v) or Active_all() is limited to activating the current point itself or a neighboring point, a restriction that ensures that activation events can be handled efficiently and provides flexibility for both synchronous and asynchronous processing.

4.3.1 Synchronized execution

Features:

gather, apply, and scatter are executed sequentially, with each stage called a minor-step.
All minor-steps of GAS constitute super-steps.
Subsequent minor-steps are visible only after the previous minor-step's changes to the point edges have been committed.
The points activated in the previous super-step will be executed in the next super-step.
The Pregel-like approach ensures deterministic execution.
Execution is inefficient and the algorithm converges at full speed.

4.3.2 Asynchronous execution

Features:

Activated points are executed as soon as processor and network resources are sufficient.
Modifications made to point edges by apply/scatter are immediately updated to the graph and are visible to subsequently computed neighbors.
Leveraging Resources and Accelerating Algorithm Convergence.
Uncertainty in execution leads to unstable and even divergent algorithmic results.

To address the problem of uncertainty in the outcome of asynchronous execution, GraphLab uses a string formalization mechanism. The concurrent execution of neighboring programs is prohibited through a fine-grained locking protocol, which requires neighboring points to acquire locks sequentially, which is not fair enough to height points. For this reason, PowerGraph proposes a parallel locking protocol to solve the problem.

4.4 Comparison of GraphLab/Pregel

Simulate GraphLab point programs: connect data on neighboring points and neighboring edges via gather and sum, and use apply to run GraphLab's programs.
Simulates the Pregel point program: collects input messages via gather and sum, and merges the list of neighbors used to compute the output messages, then generates a new collection of messages to send to scatter via apply.

5. Distributed graph slicing

Edge slicing results in more storage and network overhead because an additional copy of neighbor information (edge + ghost point) is maintained.

Normalization constants for power-law Zipf distributions:

5.1 Balanced p-way point slicing algorithm

The balanced p-way point cut is described as the following optimization problem:

p: number of machines
A(v): the set of machine partitions to which point v is assigned, a subset of {1, p}.
A(e): the machine partition to which edge e is assigned, taking values [1, p].
λ: non-equilibrium factor, a small constant not less than 1.
min Optimization Goal: Minimize the total number of machine partitions to which the points are sliced and reduce storage/network overhead.
max Optimization objective: balance constraints to ensure that edges are distributed as evenly as possible across machine partitions.

On a power-law distribution graph, the replication factor is only correlated with α.

The smaller α, the higher the number of machines and the higher the replication factor.
The smaller α and the smaller the number of machines, the more pronounced the advantage of point-slicing.

Given an edge slice, if g ghost points are generated, then the number of mirror points is strictly less than g for a point slice at the same partition boundary.

5.2 Greedy Point Slicing Algorithm

Greedy point slicing is described as the following optimization problem: minimize the increase in the replication factor when putting in new edges.

Ai: the partition of the machine where the ith edge has been allocated.
A(ei+1): the machine partition to be allocated for the i+1st edge.
k: Find the optimal value of k that minimizes the replication factor (|V| is constant).

Based on this, the edge placement strategy is derived: for an edge e(u, v)

case 1: A(u), A(v) are in one machine, then e(u, v) is assigned to this machine.
case 2: A(u), A(v) are not in a machine, then e(u, v) is assigned the machine with the fewest edges.
Case 3: A(u), A(v) have only one allocation, then e(u, v) is allocated to the already allocated machine.
Case 3: A(u), A(v) are not assigned, then e(u, v) is assigned to the least loaded machine.

Greedy heuristic algorithms are de-randomized and therefore need to be coordinated across machines, for which there are two distributed implementations:

Coordinated: maintains a distributed table of Ai(v) values. All machines update this distributed table periodically, maintaining their own caches.
Oblivious: Each machine runs the greedy-inspired algorithm independently, maintains its own Ai(v) values, and does not share data.

On different datasets, Coordinated has the lowest replication factor, followed by Oblivious and finally Random.
On different graph algorithms, Coordinated has the least running time, followed by Oblivious and finally Random.

On the Twitter dataset, the replication factor increases as more machines are added, with Coordinated growing the slowest, followed by Oblivious and finally Random. the replication factor model proposed in the paper is very close to the Random algorithm.
On the Twitter dataset, composition time is slowest for Coordinated, followed by Oblivious, and finally Random. composition time decreases as more machines are added.

6. Abstract comparison

Test Preparation:

Using 5 synthetic power law distribution plots with α from 1.8 to 2.2.
The fanout map is constructed by zipf sampling, and then the map is inverted to obtain the fanout map.
Performs the PageRank algorithm.
GraphLab (v1), Pregel (Piccolo, Giraph out of memory), PowerGraph.
8 * 8C32G Intel Xeon E5620, 1G bandwidth.
GraphLab, Piccolo use random edge cuts and PowerGraph uses random point cuts.

6.1 Calculation of imbalances

The computational imbalance is measured by the standard deviation of the iteration time.
GraphLab's iterative standard deviation becomes larger when there are more fan-in edges (GraphLab loads more neighbors).
Pregel's iterative standard deviation becomes larger when there are more fan-out edges (Pregel has to send more messages).
Uniform edge distribution so that PowerGraph is less affected.

6.2 Communication imbalances

The edge-sliced traffic is related to the ghost count and the point-sliced traffic is related to the mirror count.
Pregel needs to send messages to the outgoing edge, so the communication is high on the fanout graph.
GraphLab and PowerGraph do not consider edge direction when synchronizing data, so the amount of communication is essentially unchanged.
PowerGraph has the least amount of traffic thanks to efficient point slicing.

6.3 Runtime Comparison

The overall running time of the iteration matches well with the amount of communication and is less associated with the computation because the PageRank computation is lighter.
Using a greedy partitioning strategy can boost performance by another 25%-50%.

7. Realization and assessment

Analyzed using three PowerGraph implementations:

Bulk Synchronous (Sync): a synchronous implementation.
Asynchronous (Async): Asynchronous implementation.
Asynchronous Serializable (Async+S): asynchronous + serialized implementation.

7.1 Figure loading and partitioning

The experiments load data files from HDFS in a distributed manner and use the Oblivious algorithm for graph partitioning by default. The greedy-heuristic partitioning strategy significantly reduces the execution time and memory consumption for all the algorithms. The runtime time and replication factor are positively correlated.

7.2 Synchronization engine

PowerGraph is 3-8x the performance of Spark.
Despite the loading cost of the greedy heuristic partitioning algorithm, it still brings the task greater performance improvement and communication reduction.
Delta Caching: 45% reduction in runtime by avoiding unnecessary gather.
Weak Scaling (Gustafson's law): keeps the problem size constant on a single processor, and the effect is close to the ideal (65s processing a 6.4B edge graph, a measure of horizontal scalability, and the ideal is derived from the law).

7.3 Asynchronous engines

PowerGraph uses state machines to manage point states: INACTIVE, GATHER, APPLY, SCATTER.
As the number of partitions increases, the task throughput (number of point program operations/second) steadily increases.
Using Delta Caching allows the algorithm to converge quickly, and turning off Delta Caching throughput will continue to rise (increasing the computational communication ratio as the computation focuses on height points).
Running the graph coloring algorithm using a synchronous approach causes the algorithm to fail to converge. (Each iteration is updating the same color min c)

7.4 Asynchronous Serialization Engine

Serialization is achieved by prohibiting the simultaneous execution of neighboring programs.
Ensuring serialization of graph parallel computation is equivalent to solving the philosopher's meal problem: points = philosophers; edges = forks. (Dijkstra's scheme for GraphLab and Chandy-Misra's scheme for PowerGraph.)
Serial parallelism does not grow linearly with the number of points because it is a power-law graph and the point density grows super-linearly.
The graph coloring problem, the asynchronous engine can satisfy the coloring condition quickly, but the last 1% of the edges take up 34% of the time (long tails, height point competition), and the asynchronous serialization is relatively more uniform. Also, the asynchronous engine's performs 2x more operations.

ALS algorithm (Alternating Least Squares) with higher throughput for asynchronous engines and faster convergence for asynchronous serialization.

7.5 Error tolerance

The synchronization engine saves snapshots between supersteps.
The asynchronous engine hangs the task first and then saves it using GraphLab's snapshot algorithm.

7.6 MLDM Applications

reach a verdict

Hypergraph: the point slicing problem can be converted to a hypergraph slicing problem by converting edges to points of the hypergraph and points to edges of the hypergraph. But hypergraph slicing is time-intensive, and we favor the goal of reducing the amount of communication.
Streaming point slitting, streaming edge slitting.
GraphChi: Can borrow computational ideas from supplemental external memory.
Dynamic graph computation: exploring time-based graph structures.