Architecture Knowledge Points (III)

Dynamic branch prediction is a mechanism for predicting future branches by recording and analyzing historical information about the branching behavior of a program as it runs. This technique is intended to improve the efficiency of the processor pipeline and reduce pipeline stalls caused by branching instructions. The method you mentioned of determining branching behavior by looking up instruction addresses is an implementation of dynamic branch prediction.

Several key points that characterize dynamic branching predictions

Historical Information Recording：
- Recording Branching Behavior: A dynamic branch predictor records the historical behavior of each branch instruction, such as whether a branch occurs and how often it occurs. This information is usually stored in a table called the Branch History Table (BHT, Branch History Table) or Pattern History Table (PHT, Pattern History Table).
- Updated forecast information: Each time a branch instruction is executed, the predictor updates this historical information, enabling future predictions to be more accurate.
Predictive algorithms：
- Single-digit predictor: The simplest dynamic branch predictor uses a single bit to keep track of whether a branch has occurred. If the last branch occurred, it predicts that the next one will also occur; if the last one did not occur, it predicts that the next one will not occur either.
- Two-digit saturation counter: A more sophisticated predictor uses a two-bit state machine that is better able to cope with changes in branching behavior. The two-bit counter has four states and predicts whether a branch will occur or not by updating the state.
- Global branch prediction: Using the global branching history for prediction, one can consider the branching patterns of the entire program.
- Local branch prediction: Using local branch histories, only the historical behavior of specific branch instructions is considered.
dynamic adjustment：
- The dynamic branching predictor continuously adjusts its prediction model based on runtime branching behavior. Each time a branching instruction is executed, the predictor compares the predicted result with the actual result and adjusts its internal state accordingly to improve the accuracy of future predictions.
command address association：
- Indexing and managing historical information about branching behavior through instruction addresses is one of the core aspects of dynamic branch prediction. The predictor uses the address of a branch instruction to find its corresponding history and makes predictions based on those records.
- If the last branch is predicted to occur, the processor prefetches the address of the instruction that the last branch jumped to, thus reducing possible pipeline stalls.

typical example

Suppose there is a branching instructionBIts address is0x100The The dynamic branch predictor recordsBexecution history and use these records in forecasting:

initial state：BThe history information of the command is not recorded or the predictor is initialized to not jump.
first implementationB：
- Practical implementationBjump to0x200。
- Update the record:0x100The branch history information for the jump.
Again.B：
- Predictor Lookup0x100's history, found that it jumped last time and predicted that it would jump this time as well.
- processor prefetch0x200The instructions.
Third implementationB：
- The predictor continues to look for0x100's history to continue predicting jumps or no jumps based on the behavior of the previous two.

reach a verdict

By looking up the instruction address to determine the branching behavior and predicting the branching behavior of the next execution based on the result of the last execution, the dynamic nature of this method is reflected in its ability to continuously adjust and optimize the prediction model according to the actual branching behavior of the program running, thus improving the accuracy of the prediction and the efficiency of the processor pipeline. This is the core feature of the dynamic branch prediction technique.

In the absence of branching instructions, the program does execute sequentially, with the program counter (PC) pointing to the next instruction address in sequence. This means that instruction addresses are usually assigned sequentially, but it is important to note the following:

Reasons for consecutive allocations

Compiler behavior：
- The compiler compiles the source code into machine code and places these machine code instructions into an executable file. In the absence of branching instructions or other control flow changes to the instructions, the compiler puts the instructions into memory in order, with the addresses of these instructions being consecutive.
memory model：
- The computer's memory model is linear, with memory addresses being contiguous spaces. Therefore, when a program is loaded into memory, instructions are also loaded sequentially into consecutive memory addresses.

(for) instance

Suppose there is a simple piece of code:

int main() {
    int a = 5;
    int b = 10;
    int c = a + b;
    return c;
}

The compiler compiles it to assembly code (simplified example):

0x1000: MOV R1, #5 ; load 5 into R1 register
0x1004: MOV R2, #10 ; load 10 into R2 register
0x1008: ADD R3, R1, R2 ; R3 = R1 + R2
0x100C: MOV R0, R3 ; move result to R0 register
0x1010: BX LR ; return

In this example:

Each instruction occupies a specific memory address, and these addresses are contiguous:0x1000, 0x1004, 0x1008, 0x100C, 0x1010。
Without a branch instruction, the PC will start from the0x1000begins, pointing to the next consecutive instruction address in sequence.

Points to note

Overall structure of the program：
- Although most instruction addresses are contiguous, an entire program may contain multiple code segments (e.g., functions, subroutines). These code segments may be separated from each other and are not always consecutive.
memory alignment：
- Some processor architectures require that instruction addresses follow a specific alignment (e.g., 4-byte alignment). This means that even if the instruction lengths are different, the addresses may still be allocated according to the alignment requirements.
make superior：
- Compiler optimizations may cause certain instructions to be reordered or insert filler instructions (e.g., NOP), and these cases may also affect the sequential allocation of instructions.

reach a verdict

In the absence of branching instructions, instruction addresses are usually assigned sequentially because of the way the compiler and memory model work. However, factors such as the layout of the overall program, memory alignment, and compiler optimizations can affect how instruction addresses are assigned. Overall, programs execute sequentially without branching, and instruction addresses are usually sequential.

In dynamic branch prediction, the method of using one prediction bit is one of the most basic prediction mechanisms, called One-bit Branch Prediction (OBP). The method is simple and easy to implement, but it does have some performance drawbacks, as you mentioned: even though a branch almost always occurs, it can still lead to multiple false predictions in some cases.

How One Prediction Bit Works

The one-bit prediction method relies on a single bit to record whether the branch was last taken (Taken) or not taken (Not Taken):

Predicted position = 1: Predict that branches will be taken.
Predictive bit = 0: Predict that branches will not be taken.

Performance deficiencies

Let's analyze the performance flaws you mentioned:

Even though a branch almost always occurs, two mispredictions still occur instead of one when the branch does not occur.

typical example

Suppose we have a loop in which branches are almost always taken, but not at the last iteration:

for (int i = 0; i < 10; i++) {
    // Loop body
}

For this loop:

For the first 9 iterations the branch is taken (continuing the loop).
At the end of the 10th iteration, the branch is not taken (exits the loop).

The flaws of one prediction bit

initial state: Assume that the prediction bit is initially 0 (not taken).
First Iteration：
- Prediction: not taken.
- Actual: taken (mispredicted).
- Updated prediction bit: 1 (taken).
Second through ninth iterations：
- Prediction: taken.
- Actual: Taken (predicted correctly).
- Predictive bit holds 1 (is taken).
Tenth iteration：
- Prediction: taken.
- Actual: not taken (mispredicted).
- Update prediction bit: 0 (not taken).
Re-enter the loop：
- Prediction: not taken.
- Actual: taken (mispredicted).
- Updated prediction bit: 1 (taken).

summarize

In the first 9 iterations, the prediction bit is gradually updated to 1 and the prediction is correct.
In the 10th iteration, the prediction bit resulted in one misprediction.
On re-entering the loop, another misprediction results as the prediction bit has just been updated to zero.

As a result, even if a branch almost always occurs, under the method using one prediction bit, two mispredictions still occur: one when the branch is not actually taken, and the other when the branch is taken again. This shortcoming leads to inefficient prediction.

prescription

To address this shortcoming, more sophisticated branch prediction methods, such as the Two-bit Saturating Counter, can be used:

Four states are represented using two bits: strong prediction is not taken, weak prediction is not taken, weak prediction is taken, and strong prediction is taken.
This method reduces the number of mispredictions because the direction of prediction changes only after two consecutive prediction errors.

summarize

The one-bit prediction bit approach is simple but suffers from performance drawbacks, especially in the case where branching almost always occurs leading to two false predictions. More sophisticated prediction mechanisms (e.g., two-bit saturation counters) can reduce the number of false predictions and improve the accuracy of branch prediction by introducing more states.

Dynamic branch prediction with two saturated counters can be combined with some of the knowledge points in machine learning, especially on the concepts of state transitions and probabilistic prediction. Here are some of the relevant knowledge points:

1. Finite State Machine (FSM)

A two-bit saturation counter is essentially a simple finite state machine with four states:

Strongly Not Taken (Strongly Not Taken)
Weakly Not Taken (Weakly Not Taken)
Weakly Taken
Strongly Taken (Strongly Taken)

Such state machines can be analogized in machine learning to state space models such as Markov chains.

2. Markov Chain

The state transitions of the two saturation counters can be viewed as a Markov chain with the following characteristics:

state of affairs: 4 states (SNT, WNT, WT, ST)
divert or distract (attention etc): Result based on current branch (Taken or Not Taken)

Markov chains are tools used in machine learning to characterize the state of a system and transfers between states, and are often used to predict future states.

3. probabilistic forecast

Although the two-bit saturation counter is a deterministic model, it is designed with a similar philosophy to probabilistic prediction:

Strong take and strong no-take status: similar to a high-confidence prediction.
Weak take and weak no-take states: similar to a low confidence prediction.

In machine learning, probabilistic prediction models such as Park Bayes and Hidden Markov Model (HMM) make state predictions based on historical data, which is similar to two-bit saturation counters making predictions based on historical branching behavior.

4. Reinforcement Learning (RL)

The Q-Learning algorithm in reinforcement learning has some similarities with two-bit saturation counters:

Q-Learning: Based on the current state and actions taken, the Q value is updated and the optimal policy is gradually learned.
Two-digit saturation counter: Based on the current branching result (Taken or Not Taken), the state is updated and the prediction strategy is gradually optimized.

In reinforcement learning, Q-value updates are analogous to state updates in a two-bit saturation counter, which are continuously adjusted to achieve more accurate predictions.

5. Memory Mechanisms

Two-bit saturation counters can be analogized to some extent to neural networks with memory, such as long and short-term memory networks (LSTMs) and gated recurrent units (GRUs):

memorization: Saves the results of previous branches, which are used to guide the current prediction.
LSTM/GRU: Retaining important historical information and suppressing unimportant information through a gate mechanism to better predict the future.

Combined with examples

Suppose we model branch prediction with a machine learning approach, either using Markov chains or simple reinforcement learning algorithms for prediction:

import numpy as np

class TwoBitPredictor:
    def __init__(self):
        # States: 00 = SNT, 01 = WNT, 10 = WT, 11 = ST
         = 0b10  # Start with weakly taken (WT)
    
    def predict(self):
        # Predict taken if in states 10 (WT) or 11 (ST)
        return  >> 1
    
    def update(self, taken):
        # Update state based on the actual branch result
        if taken:
            if  < 0b11:
                 += 1
        else:
            if  > 0b00:
                 -= 1

# Simulate the predictor
predictor = TwoBitPredictor()
branches = [True, True, False, True, True, False, True, False]

for branch in branches:
    prediction = ()
    print(f"Prediction: {'Taken' if prediction else 'Not Taken'}, Actual: {'Taken' if branch else 'Not Taken'}")
    (branch)

This simple simulation shows how dynamic branch prediction can be performed with a two-bit saturation counter. In this way, some basic concepts in machine learning can be combined with dynamic branch prediction to help understand and improve prediction algorithms.

The primary role of a branch predictor is to improve the performance of a computer processor. Specifically, a branch predictor reduces or eliminates pipeline stalls and delays caused by branching instructions (e.g., conditional jumps, loops, and function calls, etc.) by predicting the execution path of branching instructions (e.g., conditional jumps, loops, and function calls, etc.) in a program. The following are the details of what a branch predictor does:

1. Improving the efficiency of the instruction pipeline

Modern processors use pipelining techniques to increase the parallelism and efficiency of instruction execution. However, branching instructions (such as conditional jumps and loops) interrupt the sequential execution of the instruction pipeline. A branch predictor reduces pipeline stalls and improves pipeline utilization by predicting the direction of a branch, allowing the processor to continue loading and executing subsequent instructions without having to wait for the result of a branch instruction.

2. Reduced waiting time for instruction execution

When the processor encounters a branching instruction, if it cannot predict the result of the branch, the processor will have to wait for the result of the execution of the branching instruction. This causes the processor to enter a waiting state and reduces execution efficiency. The branch predictor reduces the waiting time and increases the instruction execution speed by predicting the result of the branch and allowing the processor to continue executing instructions on the predicted path.

3. Improving Instruction-Level Parallelism (ILP)

The branch predictor enables the processor to load and execute instructions ahead of predicted branch paths, thereby increasing instruction-level parallelism. By increasing the parallel execution of instructions, the processor can utilize its resources more efficiently and improve overall performance.

4. Reducing Performance Losses from Branching Error Prediction

While the branch predictor cannot guarantee that every prediction will be correct, the probability of branch mispredictions can be significantly reduced by using advanced prediction algorithms (e.g., two-bit saturation counters, global history registers, and hybrid predictors, etc.). When the branch predictor makes a correct prediction, the overhead of pipeline emptying and reloading due to branch misprediction can be avoided, thus reducing performance loss.

5. Increase processor throughput

Accurate prediction by the branch predictor allows the processor to maintain an efficient instruction throughput, i.e., the number of instructions processed per clock cycle. By reducing stalls and delays caused by branching instructions, the processor is able to execute instructions more continuously and quickly, increasing overall throughput.

Example: Branch Predictor with Two-Bit Saturation Counter

Suppose a simple two-bit saturated counter branch predictor. The predictor has four states: strong no-take (00), weak no-take (01), weak take (10), and strong take (11). The predictor transfers states based on the historical execution results of the branch instructions and predicts the direction of the branch accordingly. It works as follows:

initial state: Assume that the initial state is weakly taken (10).
Predictive branching: Predicts based on the current state. If the state is 10 or 11, the predicted branch will be taken; if the state is 00 or 01, the predicted branch will not be taken.
Execute the branch: The processor executes the branch instructions and updates the status based on the actual results.
Status Update：
- If the prediction is correct, the state stays the same or shifts in a stronger direction.
- If the prediction is wrong, the state shifts in the opposite direction.

The following is a sample code showing the basic workflow of a two-bit saturation counter:

#include <iostream>
#include <vector>

class TwoBitPredictor {
public:
    enum State { SNT, WNT, WT, ST };

    TwoBitPredictor() : state(WT) {}

    bool predict() const {
        return state == WT || state == ST;
    }

    void update(bool taken) {
        if (taken) {
            if (state != ST) state = static_cast<State>(state + 1);
        } else {
            if (state != SNT) state = static_cast<State>(state - 1);
        }
    }

private:
    State state;
};

int main() {
    TwoBitPredictor predictor;
    std::vector<bool> branches = {true, true, false, true, true, false, true, false};

    for (bool branch : branches) {
        bool prediction = ();
        std::cout << "Prediction: " << (prediction ? "Taken" : "Not Taken")
                  << ", Actual: " << (branch ? "Taken" : "Not Taken") << std::endl;
        (branch);
    }

    return 0;
}

In this example, theTwoBitPredictorclass implements a simple two-bit saturation counter predictor. By predicting and updating state, this predictor is able to improve processor performance by performing dynamic branch prediction based on historical branching results.