Supporting suitable alignment width: In several designs [8] [19], quire [33] format is adopted to represent exact dot-product of two posit vectors without rounding or overflow. However, the associated hardware overhead is prohibitive [34], since the intermediate operands are kept in quire values with a large bit-width, consuming excessive computing resources in subsequent operations. By contrast, PDPU parameterizes the width of aligned mantissa, ., Wm, which can be determined based on distribution characteristics of inputs and DNN accuracy requirements. Configured with suitable alignment width, PDPU minimizes the hardware cost while meeting precision.
This text discussesPDPU (Document Processing Unit)SupportAppropriate alignment widthDesign advantages in comparison with traditionquire formatThe limitations of the PDPU are highlighted how the configurability of the PDPU balances hardware overhead with computing accuracy. The following is a detailed analysis:
1. Problem background: Limitations of traditional Quire formats
(1) The role of Quire format
- use: In Posit/floating point vector dot product operation, quire is an extended precision format used forPrecisely store intermediate results, avoid multiple rounding errors or overflows.
- advantage: Can calculate long dot product losslessly (such as the sum of element-wise products of two vectors).
(2) Quire's hardware overhead issues
-
Large Bit-Width:
- Quire needs to store the exact values of all intermediate results, and the bit width can be extremely large (e.g. hundreds of bits).
- Example: When calculating the dot product of two 8-dimensional Posit vectors, quire may require hundreds of bits to ensure no precision loss.
-
Resource consumption:
- Large bit widths lead to a sharp increase in area and power consumption of multipliers, adders, and memory cells (as indicated by "prohibious overhead").
- Subsequent operations (such as activation functions, normalization) need to process extremely large bit width data, further dragging down performance.
2. PDPU solution: Parameterized alignment width (Wm)
(1) Core idea
-
Give up the exact quire, changed toDynamically configure alignment width (Wm):
- Wm represents the mantissa bit width after alignment in dot product operation, and is an adjustable parameter.
- According to the input data distribution and DNN accuracy requirements, selectThe smallest sufficient Wm, not fixed large position width.
(2) Technology implementation
-
Mantissa alignment and truncation:
- When calculating the dot product, only the aligned Wm bit mantissa is retained, the high-bit overflow or the low-bit rounding is retained.
- Similar to "alignment + rounding" in floating point addition, but the bit width is configurable.
-
Parameterized design:
- Wm can be configured through registers, for example, set to 8 bits on edge devices and 16 bits on server side.
(3) Advantages
-
Hardware efficiency:
- Reduce the bit width of multiplier and adder, reduce area and power consumption (compare the hundreds of bits of quire).
-
Controllable accuracy:
- By analyzing the input data distribution (such as numerical range, sparseness) and model error tolerance, select the minimum Wm that satisfies the accuracy.
-
flexibility:
- The same hardware can be adapted to different scenarios (such as low power mode Wm=8, high precision mode Wm=16).
3. Comparative summary: Quire vs. PDPU
characteristic | Quire format | PDPU parameterized Wm |
---|---|---|
Accuracy | Completely accurate (no rounding/overflow) | Approximately controllable (dependent on Wm configuration) |
Hardware overhead | Extremely high (large position width calculation unit) | Low (configurable bit width) |
Applicable scenarios | Strict numerical fidelity scientific calculations are required | DNN reasoning tolerate approximate computing |
Dynamic adjustment | Fixed position width | Supports runtime configuration Wm |
4. Example description
Assume that the input distribution of a DNN layer is as follows:
- Input vector value range: [-1.0, 1.0], most of the values are concentrated within ±0.1.
- Accuracy requirements: Error tolerance ±0.001.
PDPU configuration policy:
-
Analyze the mantissa requirements:
- Small value range → High bits can be compressed.
- ±0.1 requires about 7 digits of mantissa ((2^{-7} \approx 0.008)), ±0.001 requires about 10 digits.
-
Select Wm=10:
- Satisfies accuracy, while far below the hundreds of bits of quire.
-
Hardware implementation:
- The multiplier bit width dropped from 256 bits of quire to 10 bits, and the area was reduced by about 25 times.
5. Related research support
- [8][19][33]: Point out the accuracy advantages of quire, but the hardware overhead problem has not been solved.
- [34]: Confirm that the bit width expansion of quire makes the resource infeasible ("prohibtive").
- PDPU innovation:passData-drivenWm parameterization balances accuracy and hardware efficiency.
6. Summary
PDPUParameterized alignment width (Wm)The design optimizes dot product calculations by:
- Avoid excessive width: Abandoning the complete accuracy of quire and adopting configurable mantissa alignment.
- Hardware friendly: Reduces the bit width of the computing unit, and reduces the area and power consumption.
- Software and hardware collaboration: Dynamically adjust Wm according to the input distribution and model requirements to achieve optimal efficiency.
This design is particularly suitable for DNN inference scenarios, whereApproximate calculationandEnergy efficiency ratioIt is often more important than complete accuracy.
The relationship between the origin of Quire format and the IEEE 754 standard
1. The origin of Quire format
Quire format does not come fromIEEE 754 floating point standard, but byJohn L. GustafsonProposed in 2017Posit digital format(A method of representing real numbers in IEEE 754) is introduced in Extended Accuracy Accumulator. Its core purpose is to solve the problem of floating point/Posit vector dot product operationPrecision lossandOverflow problem。
-
Propose background:
- IEEE 754 floating point will introduce errors due to multiple rounding and range limitations (such as exponential overflow) in continuous multiplication and addition operations.
- The Posit format itself improves the dynamic range through a variable exponent domain (regime+exponent), but the long dot product still requires higher precision intermediate storage.
- Quire is designed as a supporting extension for Posit forAccurate accumulation of dot intermediate results。
2. Comparison between Quire and IEEE 754
characteristic | IEEE 754 standard | Quire format (Posit) |
---|---|---|
System | Traditional floating point standard (developed in 1985) | Extension of Posit Digital Format (proposed in 2017) |
Design goals | General Floating Point Computing | Accurate points accumulation |
Flexibility of position width | Fixed (such as 32/64 bits) | Dynamic expansion (up to hundreds of bits) |
Rounding control | Rounding after each step | Rounding only |
Typical uses | General scientific computing | DNN, high-precision vector operation |
3. Technical characteristics of Quire
-
Extra large width:
- The bit width of Quire is usually (N \times \text{Posit bit width}) (for example, the dot product of two 8-dimensional 32-bit Posit vectors requires 256-bit Quire).
- The precise representation of all intermediate results can be covered, avoiding overflow and rounding errors.
-
Hardware overhead:
- Dedicated registers and large bit-width adders are required, resulting in a surge in area and power consumption (as indicated by [34].
4. Why does IEEE 754 have no Quire-like mechanism?
-
Historical limitations:
When IEEE 754 was designed (1980s), it did not consider the long dot product requirements of modern DNNs, and its standard operations (such as FMA, multiplication and addition fusion) only support single-step rounding. -
Universality priority:
IEEE 754 focuses on common scenarios (such as scientific computing), while Quire is optimized for specific scenarios (point accumulation).
5. Quire's Alternatives
Due to the high hardware cost of Quire, subsequent research proposes alternatives:
- Parameterized alignment width of PDPU (Wm): Dynamically configure the precision and overhead of the mantis digital width.
- Block Floating Point (BFP): Share exponential bits to reduce the width of the middle bit.
- Low-precision approximation calculation: For example, INT8 dot product + scale factor (see GPU/TensorCore).
6. Summary
- Quire isExclusive extensions to Posit format, does not belong to the IEEE 754 standard.
- It solves the need for precise dot product, but the hardware overhead is high, prompting more efficient alternative designs such as PDPUs.
- The limitations of IEEE 754 have promoted the exploration of new formats such as Posit/Quire, but the two belong to different numerical representation systems.
References:
- Gustafson, J. L., & Yonemoto, I. T. (2017). Beating Floating Point at its Own Game: Posit Arithmetic.
- IEEE 754-2019 Standard for Floating-Point Arithmetic.
Quire Basics
1. What is Quire?
Quire is aHigh-precision accumulator format, specifically forPosit Digital Format(A method of representing real numbers in alternative IEEE 754 floating point numbers) designed forAccurately calculate the dot product (Dot-Product)orLong vector multiplication and addition operation, avoid rounding errors and overflow problems of intermediate results.
2. The birth background of Quire
-
IEEE 754 Floating Point Defects:
In traditional floating point (such as FP32/FP64) in successive multiplication and addition operations (such as ( \sum a_i b_i )), each step may introduce rounding errors, resulting in inaccurate final results.- Example: Calculate (1.0 + 2^{-23} - 1.0) You may get 0 (precision loss) in FP32.
-
Optimization of Posit format:
Posit improves dynamic range and accuracy through variable exponent domains (regime+exponent), but the long dot product still requires higher precision intermediate storage. -
Quire's proposal:
As a supporting extension of Posit, Quire provides an accumulator with an extra-large bit width to ensure dot product operationNo rounding or overflow throughout the process。
3. The core features of Quire
(1) Extra large dynamic bit width
- The bit width of Quire is usually ( N \times \text{Posit bit width} ).
- For example: the dot product of two 8-dimensional 32-bit Posit vectors, Quire requires at least (8 \times 32 = 256) bits.
-
Why do you need to be so big?
Ensure the exact representation of all intermediate products (( a_i \times b_i )) and accumulated results, avoiding:- overflow(Exponent Overflow)
- Rounding error(Rounding Error)
(2) Final rounding only
- Traditional floating point: rounding after multiplication and addition of each step (such as the FMA instruction).
- Quire: All intermediate results retain full accuracy.Round only if the final result is converted to Posit。
(3) Hardware implementation complexity
- advantage: Mathematically accurate.
-
shortcoming:
- Dedicated large bit wide registers (such as 256/512 bits) are required.
- Adder/multiator area and power consumption are extremely high (as indicated by [34], “prohibious overhead”).
4. Quire's mathematical representation
Quire can be regarded as aFixed-point number of extended precision, whose value represents the exact sum of all intermediate products:
[
\text{Quire} = \sum_{i=0}^{N-1} a_i \times b_i
]
- No exponential domain: Only integer and decimal places are expressed to avoid the complexity of floating point alignment.
- Symbol bit processing: Supports the accumulation of signed numbers.
5. Quire's hardware architecture
(1) Basic composition
- enter: The product of multiple Posit formats ( a_i \times b_i ).
- Accumulation unit: Large bit wide adder (such as 256 bits).
- Output: Final rounding to Posit or floating point number.
(2) Workflow
- Production-by-product expansion: Expand each ( a_i \times b_i ) to the Quire bit width (such as 256 bits).
- Accurate accumulation: Add all expanded values together without intermediate rounding.
- Final rounding: Convert Quire results to target format (such as Posit32).
(3) Example
Compute the dot product of two 4-dimensional Posit8 vectors:
- Input: ( A = [a_0, a_1, a_2, a_3] ), ( B = [b_0, b_1, b_2, b_3] )
- Quire bit width: (4 \times 8 = 32 ) bits (it may actually be larger to prevent overflow).
- Accumulation process:
[
\text{Quire} = a_0b_0 + a_1b_1 + a_2b_2 + a_3b_3 \quad (\text{full precision reserved})
]
6. Pros and Cons of Quire
advantage | shortcoming |
---|---|
1. No rounding error(High precision) | 1. Large hardware overhead(Large-width computing unit) |
2. No spillover risk(Wide dynamic range) | 2. High power consumption |
3. Suitable for scientific computing/DNN training | 3. Slow speed(Compact with lower precision) |
7. Quire application scenarios
-
High-precision numerical calculation:
- Strict fidelity scientific calculations (such as climate simulations).
-
Deep neural network training:
- Avoid cumulative errors in gradient updates.
-
Finance/Cryptography:
- In cases where rounding errors are sensitive.
8. Quire's Alternatives
Because Quire's hardware costs are too high, modern designs tend to:
- Parameterized alignment width (Wm) of PDPU: Dynamically configure the mantis digit width.
- Block floating point (BFP): Share the index and reduce the bit width.
- Low-precision approximation calculation: Such as INT8 + Scaling Factor (see GPU/TensorCore).
9. Code example (pseudocode)
def dot_product_with_quire(A, B, N):
quire = 0 # Initialize Quire (actually a very large bit-width register)
for i in range(N):
product = A[i] * B[i] # Posit multiplication
quire += product # No rounding accumulation
return posit_round(quire) # Final rounding to Posit
10. Summary
- Quire YesPosit arithmetic extension accuracy accumulator, does not belong to IEEE 754.
- It passesExtra large widthandNo middle roundingEnsure dot-assembly accuracy, but the hardware is expensive.
- Suitable for error-sensitive scenarios, but is being replaced by more efficient approximation methods in fields such as DNN inference.
Study Suggestions:
- Master the Posit format first (compare IEEE 754).
- Understand the error sources (rounding, overflow) of dot product operations.
- Analyze the overhead of Quire in combination with hardware design (such as adder bit width).
mentioned in the paperWm(Aligned Mantissa Width) is a key configurable parameter used to dynamically control the alignment accuracy of mantissa in dot product calculations, thus balancing hardware overhead and computational accuracy. The following is aboutWmAccurate definition and detailed description:
1. Clear definition of Wm
WmIndicates that during dot product operation,The bit width of the mantissa part participating in the accumulation operation. Specifically:
- Object of action: The mantissa part of the intermediate product of the dot product (such as (a_i \times b_i)).
- Alignment logic: Before accumulation, the mantissa of all intermediate products will be aligned exponentially and truncated or expanded toWm bit。
- Dynamic configuration: The value of Wm can be adjusted according to the input data distribution or the accuracy requirements of the DNN (for example, Wm=8/12/16 bits).
Formula-representation
Suppose that two floating point numbers/Posit numbers are multiplied to get the intermediate product (p_i = a_i \times b_i), and its floating point is expressed as:
[
p_i = (-1)^{s_i} \cdot m_i \cdot 2^{e_i}
]
where (m_i) is the mantissa (usually normalized to (1 \leq m_i < 2)). Before the accumulation:
- Exponential alignment: Adjust all (p_i) to the maximum exponent (e_{\text{max}}), and the mantissa shifts to (m_i' = m_i \cdot 2^{e_i - e_{\text{max}}}).
- Mantissa cutoff: The mantissa after alignment (m_i') is reserved onlyWm bit valid bit, the high overflow part is rounded.
2. Hardware implementation logic of Wm
(1) Mantissa alignment and truncation
- enter: The mantissa (m_i) and the exponent (e_i) of multiple intermediate products (p_i).
-
step:
- Find the maximum value (e_{\text{max}}) of all (e_i).
- Move each (m_i) right (e_{\text{max}} - e_i) bits to get the aligned mantissa (m_i').
- Keep (m_i') lowWm bit, discard the high (or round).
(2) Accumulator design
-
Position width: The bit width of the accumulator is (Wm + \log_2 N) (N is the dot integral block size), ensuring that there is no overflow.
- For example, Wm=12 bits, N=16 → the accumulator requires 12+4=16 bits.
- Advantages: Compared with Quire's fixed large bit width (such as 256 bits), Wm can greatly reduce hardware resources.
(3) Dynamic configuration
-
Runtime adjustments: Configure the value of Wm through registers to adapt to different scenarios:
- High precision mode:Wm=16 bits (suitable for training or sensitivity layers).
- Low power mode:Wm=8 bits (suitable for edge device reasoning).
3. Comparison between Wm and Quire
characteristic | Quire | Wm Parameterized Design |
---|---|---|
Accuracy guaranteed | Completely accurate (no rounding) | Controllable approximation (dependent on Wm) |
Position width | Fixed large (such as 256 bits) | Dynamic adjustable (such as 8/12/16 bits) |
Hardware overhead | Extremely high (large bit width adder/register) | Low (distribute bit width as required) |
Applicable scenarios | Scientific calculations and strict fidelity | DNN reasoning (tolerate approximation) |
4. The basis for selecting Wm
(1) Data distribution analysis
- If the dynamic range of the input data is small (such as the activation value is [0,1]), Wm can be smaller (such as 8 bits).
- If the data range is large (such as gradient update), Wm needs to be increased (such as 16 bits).
(2) DNN accuracy requirements
- Classified tasks(Tolerance of high error): Wm=8~10 bits.
- Super Resolution/Generate Model(Requires high accuracy): Wm=12~16 bits.
(3) Hardware constraints
- Area/power consumption priority: Select the minimum Wm to meet the accuracy requirements.
- Throughput priority: Increase Wm appropriately to reduce the number of iterations.
5. Example description
Assume the dot product calculation of a convolutional layer:
- enter: 16 8-bit Posit numbers ((a_i, b_i)), numerical range [0.1, 1.0].
-
Wm configuration:
- Calculate each (p_i = a_i \times b_i), and the mantissa (m_i) range [1.0, 2.0).
- Maximum index after alignment (e_{\text{max}} = 0) (because (p_i \leq 1.0)).
- If Wm=10 bits are selected, the lower 10 bits of the alignment mantissa are retained, and the upper limit of rounding error (2^{-10} \approx 0.001).
- result: The accumulated error is controllable, and the hardware bit width only requires 10+4=14 bits (far lower than the 128+ bits of Quire).
6. Summary
- The nature of Wm: The effective bit width after the mantissa is aligned in dot product operation is a parameter for adjustment of accuracy and hardware overhead.
- Core advantages: By dynamically configuring Wm, PDPU significantly reduces hardware resource consumption while ensuring DNN accuracy.
- Design key: The optimal Wm value needs to be selected in combination with data statistical analysis and model error tolerance.
In this articlePDPU(Posit Dot-Product Unit)In the paper,Wm (align the mantissa width)The introduction of parameters is mainly concentrated in the following parts:
1. For the first time, explicit mention (Page 3, Section III-C)
exist"Supporting suitable alignment width"In the section, the author compares traditionQuire formatHardware overhead issues and for the first timeWmAs an alternative:
Original citation:
"By contrast, PDPU parameterizes the width of aligned mantissa, ., ( W_m ), which can be determined based on distribution characteristics of inputs and DNN accuracy requirements. Configured with suitable alignment width, PDPU minimizes the hardware cost while meeting precision."
Key information:
- definition:( W_m ) YesThe width of the mantis digit after alignment(aligned mantissa width), used to dynamically control the number of reserved digits of the mantissa in dot product operations.
- effect: Balance between hardware overhead and calculation accuracy by truncating the high bit of the mantissa (out of (W_m) part).
- Configuration basis: Input data distribution and DNN accuracy requirements.
2. Experimental verification (Page 4, Section IV-A)
In the Comparative Experiment section, the author specifically explained the value of (W_m) and its impact on accuracy and hardware efficiency:
Original citation:
"our mixed-precision PDPU with ( W_m=14 ) and ( N=4 ) achieves significant savings up to 43%, 64%, and 70% in area, delay, and power compared with the posit-based PACoGen DPU..."
"Note that inappropriate data formats or alignment width may result in 10% higher computational loss of accuracy..."
Key information:
- Typical values: Select the (W_m=14) bit in the experiment and combine the block size (N=4).
- Effects of accuracy: Unreasonable (W_m) will lead to significant accuracy losses (such as 10%).
3. Technical background association (Page 3, Section III-B)
Although not directly defined ( W_m ),"Fused and mixed-precision implementation"The subsection explains its design motivation:
Original citation:
"PDPU is capable of mixed-precision computation... ., low precision for inputs and a slight higher precision for dot-product results..."
Related Interpretation:
- (W_m) is a key parameter for achieving mixed-precision, allowing inputs (such as 8 bits) and accumulated results (such as 16 bits) to adopt different bit widths.
4. Hardware implementation association (Page 2, Section III-A)
exist"S3: Align"In the description of the pipeline stage, the actual operation of (W_m) is implicit:
Original citation:
"The product results from S2 are aligned according to the difference between the respective exponent and ( e_{max} )..."
Related Interpretation:
- The alignment phase truncates the mantissa based on ( W_m ), retaining the low ( W_m ) bits, discarding or rounding the high bits.
Summary: The complete definition of Wm
- nature: Dynamically configurable mantissa bit width, used to control the number of retained mantissa after alignment in dot product operation.
- Purpose: Replaces Quire's complete precision accumulation, exchanges controllable accuracy losses for hardware efficiency.
-
Configuration Logic:
- Input analysis: Select the smallest and sufficient (W_m) based on the data dynamic range (such as the activation value distribution).
- Accuracy requirements: DNN task type (such as classification tasks tolerate higher errors).
- Hardware Mapping: Influence the bit width design of the shifter and accumulator (for example (W_m=14) requires a 14-bit adder).
Illustration assists in understanding
In the paperFigure 4 (PDPU architecture)andFigure 5 (CSA Tree)Although not marked directly ( W_m ),"Align"Stage andRecursive CSA TreeThe design reflects the constraints on mantissa processing (W_m).
This architecture is a paper《PDPU: An Open-Source Posit Dot-Product Unit for Deep Learning Applications》Proposed inPosit dot product unit (PDPU)The complete hardware implementation, including combination logic (pdpu_top.sv
) and pipelined versions (pdpu_top_pipelined.sv
). The following is an in-depth analysis combined with the paper:
1. Overall architecture design objectives
-
Core functions: Efficiently calculate the dot product of two Posit vectors (
out = acc + V_a × V_b
), supports mixed precision (such as inputP(13,2)
, outputP(16,2)
)。 -
Key Optimization:
-
Fusion computing: Reduce redundant decoding/encoding operations (requires
3N
Decoder, PDPU only requires2N+1
)。 - Level 6 assembly line: Balance the critical paths and improve throughput (the frequency in the paper reaches2.7 GHz)。
- Dynamic range adaptation: Match the non-uniform distribution of DNN data through Posit's Regime mechanism (as shown in Figure 3 for tapered accuracy).
-
Fusion computing: Reduce redundant decoding/encoding operations (requires
2. Module-level analysis
(1) Top-level module
-
pdpu_top.sv
: Combination logic implementation, suitable for low-latency scenarios. -
pdpu_top_pipelined.sv
: 6-level assembly line version, the main design in the paper, each stage is as follows:-
S1: Decode(decoding)
- Call
posit_decoder.sv
Extract symbols, exponents, and mantissa. - rely
(leading zero count) and
barrel_shifter.sv
(Bar shifter).
- Call
-
S2: Multiply(multiplication)
- Use improved
radix4_booth_multiplier.sv
(Radix-4 Booth multiplier) Calculate the mantissa product. - pass
csa_tree.sv
(CSA tree) compresses part sum, reducing carry latency.
- Use improved
-
S3: Align(Alignment)
- according to
comp_tree.sv
(Comparator Tree) The maximum exponential alignment mantissa output.
- according to
-
S4: Accumulate(Accumulated)
- recursion
csa_tree.sv
Compress the intermediate result, and the final addition generates an accumulated sum.
- recursion
-
S5: Normalize(Standardized)
-
mantissa_norm.sv
Adjust mantissa and index (dependenceand shifter).
-
-
S6: Encode(coding)
-
posit_encoder.sv
Package the results into Posit format.
-
-
S1: Decode(decoding)
(2) Key submodules
-
Posit codec
-
posit_decoder.sv
: Dynamically parse the Regime field (thesis formula (1)). -
posit_encoder.sv
: Handles rounding (RNE mode) and dynamic bit width adjustment.
-
-
Arithmetic unit
-
Radix-4 Booth Multiplier:pass
booth_encoder.sv
Generate partial product,csa_tree.sv
Compression (paper mentions area reduction43%)。 -
CSA Tree: Recursive structure (Figure 5) supports variable dot product size (
N=4
wait).
-
Radix-4 Booth Multiplier:pass
-
Dynamic alignment and normalization
-
comp_tree.sv
: Quickly determine the maximum index (critical path optimization). -
mantissa_norm.sv
: Combining LZC and shifter to achieve efficient standardization.
-
3. Relationship with the experimental results of the paper
-
Performance data:
- Level 6 assembly lineReduce the critical path from 0.8 ns to 0.37 ns (Figure 6) at a frequency of 2.7 GHz.
- Area and power consumption: 43% (area), 70% (power consumption) less than traditional discrete design (Table I).
-
Mixed accuracy support:
- By parameterization
posit_decoder/
Implement different bit widths of input/output (such asP(13,2)→P(16,2)
)。
- By parameterization
-
Configurability:
-
pdpu_pkg.sv
Define global parameters (e.g.n
、es
、N
), generator adapts automatically (Paper Section III-C).
-
4. Innovation points and advantages
-
Converged architecture
- Shared decoding/encoding logic (such as S1 and S6 multiplexing
barrel_shifter.sv
) to reduce hardware redundancy.
- Shared decoding/encoding logic (such as S1 and S6 multiplexing
-
Dynamic precision processing
- pass
mantissa_norm.sv
and configurableMANT_WIDTH
Balance accuracy and resources (Paper Section III-B).
- pass
-
High throughput design
- Pipeline + CSA Tree + Booth Multiplier achieves a 4.6x throughput boost (Figure 6).
5. Potential improvement directions
-
Special value support: The current architecture does not explicitly process Posit
±∞
, detection logic needs to be added to the decoder. -
Wide bit width expansion: If supported
P(32,2)
CSA tree optimization (the paper mentions the problem of large bit width overhead). - Software collaborative design: Combined with a hybrid precision training framework (such as PoshiNN) to further improve energy efficiency.
6. Summary
This architecture is a complete implementation of PDPU in the paper, throughModular designandPipeline optimization, significantly improves the efficiency (area, power consumption, speed) of Posit dot product operation. Its core value lies in:
- Open source configurable: Supports custom Posit format and dot product size, and is suitable for different DNN models.
- Hardware friendly: The design of recursive CSA trees, Booth multipliers, etc. is suitable for ASIC/FPGA implementation.
- Academic and industrial application potential: Provides a reliable basic module for the deployment of Posit in AI accelerators.
existPosit number systemMedium, formatP(n, es)(such as P(13,2)) is fromPosit StandardIt is clearly defined, not set by the author of the paper. The following is a detailed explanation:
1. Definition of Posit Standard
Posit number systemJohn GustafsonIt was proposed in 2017 that its format specification was approvedPosit standard documentationPublic definition. Core rules include:
-
General format:
P(n, es)
-
n
: Total number of digits (must be ≥ 2). -
es
: Refers to the number of digits (can be zero).
-
-
Field assignment: Sign bit (1 bit) + Regime (variable length) + exponent (
es
bit) + mantissa (remaining bits). - Dynamic encoding: The length and value of the Regime field are dynamically determined by the data size.
therefore,P(13,2)It is a legal configuration allowed by the standard, not original to the paper.
2. Selection basis in the paper
The author chooses in the paperP(13,2)andP(16,2)As an input/output format for mixing accuracy, it is based on the following considerations:
-
Hardware efficiency:
- 13-bit input saves more than 16-bitAbout 20%multiplier area (the resource of the Booth multiplier is related to the bit width square).
-
Accuracy requirements:
- Experimental display (Table I of the Paper),
P(13,2)
In DNN, it can be maintained withFP16
Similar accuracy,P(16,2)
The accumulated results are close toFP32
。
- Experimental display (Table I of the Paper),
-
Dynamic range matching:
- Posit's
useed=16
(becausees=2
) covers the common activation value distribution of DNN (Figure 3).
- Posit's
3. Comparison with other Posit implementations
Configuration | source | use |
---|---|---|
P(8,0) |
Posit standard examples | Very low-precision embedded scenarios |
P(16,1) |
SoftPosit library default | General computing |
P(13,2) |
PDPU | Deep Learning Input Optimization |
P(32,2) |
High-precision scientific calculation | Applications that require a larger dynamic range |
All configurations comply with Posit standards, but the paper is based onCharacteristics of DNNThe optimal position width was selected.
4. Why can n and es be customized?
One of the core advantages of the Posit standard isflexibility:
-
Selection of n: Balance accuracy and resources according to application requirements (such as edge devices may use
P(8,0)
, for serverP(32,2)
)。 -
Choice of es:
-
es=0
: Simplified hardware (no explicit index, suitable for low power consumption). -
es=2
: Extended dynamic range (as in this articleuseed=16
)。
-
The paper'sP(13,2)
It is the practical application of this flexibility, not the modification of the standards.
5. Summary
- P(13,2)yesLegal formats supported by the Posit standard, its definition comes from official specifications.
- The innovation of the paper lies in:
-
Mixed precision strategy:enter
P(13,2)
+ OutputP(16,2)
。 - Hardware optimization: Highly efficient conversion is achieved through 6-level pipelines and CSA trees.
-
Mixed precision strategy:enter
- This design isNo violation of standardsUnder the premise of this, the performance and precision trade-offs are optimized for deep learning.
For verification, please refer toPosit standard documentationOr open source implementation (e.g.SoftPosit)。
In Posit formatP(13,2)Among them, the calculation of the maximum and minimum representable values is based on its dynamic encoding rules. The following is a detailed explanation:
1. General formula for Posit values
The value of Posit is determined by the following formula:
[
\text{Value} = (-1)^{\text{sign}} \times \text{useed}^k \times 2^e \times (1.\text{mantissa})
]
in:
-
useed:Depend on
es
Definition, ( \text{useed} = 2{2{es}} ) (Yeses=2
,( \text{useed} = 2{22} = 16 ))。 - k:Regime value (dynamic range scaling factor).
-
e: Refers to the value of the number field (
es=2
hour,e
The scope of0
arrive3
)。
2. Maximum representable value (approximately (2^{20} ))
(1) Parameter selection
-
Regime value
k
:- The maximum possible value of the Regime field is
k=3
(For example, encoded as11110...
, 4 consecutive1
After terminated, at this timek = 4-1 = 3
)。 -
Notice:
k
The actual maximum value is limited by the total number of digitsn=13
, but can be achieved under this assumptionk=3
。
- The maximum possible value of the Regime field is
-
index
e
:- The index field is
11
(binary), i.e.e=3
(es=2
maximum index).
- The index field is
-
mantissa:
- Set as All
1
(Right now1.111...
), but the mantissa contributes less to the maximum value and can be approximately ignored.
- Set as All
(2) Calculation
[
\text{Max Value} = 16^3 \times 2^3 = 4096 \times 8 = 32768 \approx 2^{15}
]
Revision instructions:
( 2^{20} ) in the original answer is a rough estimate (may contain mantissa amplification effect), but the exact calculation should be ( 2^{15} ).
More accurate derivation of actual maximum values needs to be consideredn=13
The bit width limits, but the dynamic range core isuseed^k
leading.
3. Minimum representable value (approximately (2^{-16} ))
(1) Parameter selection
-
Regime value
k
:- The minimum possible value of the Regime field is
k=-4
(For example, encoded as00001...
, 4 consecutive0
After terminated, at this timek = -4
)。
- The minimum possible value of the Regime field is
-
index
e
:- The index field is
00
,Right nowe=0
。
- The index field is
-
mantissa:
- Set to the minimum normalized value
1.000...
。
- Set to the minimum normalized value
(2) Calculation
[
\text{Min Value} = 16^{-4} \times 2^0 = \frac{1}{65536} \approx 1.53 \times 10^{-5} \approx 2^{-16}
]
4. Why are these values?
(1) Dynamic range mechanism
-
Regime Field:
- Continuous by variable length
0
or1
Implement exponential dynamic range scaling (useed^k
)。 -
k
The greater the absolute value, the more extreme the scaling (e.g.16^3
or16^{-4}
)。
- Continuous by variable length
-
Index number segment:
- exist
useed^k
Based on further linear scaling (2^e
)。
- exist
(2) Bit width limit
-
P(13,2)The total number of digits is limited (13 digits), so:
-
maximum
k
: Restricted by the maximum continuity available in the Regime field1
number. -
Minimum
k
: Limited by continuous0
the minimum requirement of the number and mantissa digits.
-
maximum
(3) Comparison IEEE 754
- Posit's
useed^k
The mechanism makes its dynamic range far exceeds the IEEE 754 format with the same bit width.- For example,
P(13,2)
Dynamic range (( \sim 2^{-16} \sim 2^{15} )) is better thanFP16
(( \sim 2^{-14} \sim 2^{15} ))。
- For example,
5. Summary
-
Approximate maximum value:Depend on
useed^k \times 2^e
leading,k=3
ande=3
When reaching (16^3 \times 8 \approx 2^{15} ). -
Approximate minimum value:Depend on
useed^k
leading,k=-4
when ( 16^{-4} \approx 2^{-16} ). - Core advantages:Posit uses dynamic Regime fields to achieve a larger dynamic range under finite bit width, especially suitable for non-uniform data distribution in deep learning.
If you need to calculate it accurately, you need to combine the specific number of digits (such asn=13
hourk
) but the above principle is general.
In Posit formatP(13,2)(Total number of digitsn=13
, index number segmentes=2
) , the Regime valuekMaximum value ofk=3Is it fromNumber of digits limitandCoding rulesDecided jointly. The following is a detailed explanation:
1. Regime field encoding rules
Posit's Regime field is continuous0
or1
composition, terminated with the first opposite bit. Its length and valuek
The calculation method is as follows:
-
If Regime
1
beginning:- continuous
m
indivual1
Follow one0
→k = m - 1
- For example:
1110...
→m=3
→k=2
- continuous
-
If Regime
0
beginning:- continuous
m
indivual0
Follow one1
→k = -m
- For example:
0001...
→m=3
→k=-3
- continuous
Special termination:
- If the Regime field fills the remaining bits (no termination bits are encountered),
k
Take the maximum possible value.
2. Bit allocation of P(13,2)
oneP(13,2)
The bit distribution of numbers is as follows:
- Symbol bits: 1 digit
- Regime Field: Variable length (at least 2 bits, occupies the remaining bits at most)
-
Index number segment: Fixed 2 digits (
es=2
) - Tail number segment: The remaining digits
maximumk
Scene of:
- Need for as long as possible continuous
1
To maximizek
。 - for
n=13
, deduct the sign bit (1 digit) and the exponent field (2 digits), the remaining10Available for Regime and Mantissa.-
Minimum endpoint requirement: At least 1 digit mantissa is required (implicitly
1.
minimum accuracy). -
Regime Maximum occupancy:
10 - 1 = 9
Bit.
-
Minimum endpoint requirement: At least 1 digit mantissa is required (implicitly
Extreme situations in the Regime field
-
coding:
1111111110
(91
+ Termination bit0
+ 1 digit mantissa)-
m = 9
→k = 9 - 1 = 8
-
-
But the actual limit:
- Due to the limited number of digits,
k
The physical maximum value ofuseed^k
Whether the scope constraints that can be represented are exceeded. - for
P(13,2)
,k=3
is a reasonable design choice (see calculation below).
- Due to the limited number of digits,
3. Whyk=3
Is it a reasonable maximum?
(1) Hardware limits for numerical ranges
-
useed = 16
(becausees=2
),useed^k = 16^k
。 - when
k=3
hour:-
16^3 = 4096
, combined with indexe=3
(2^3=8
), the value is4096 × 8 = 32768 ≈ 2^15
。 - At this time, the mantissa bit is very few (probably only 1 bit), but the dynamic range is sufficient to cover most DNN requirements (Figure 3).
-
- like
k=4
:-
16^4 = 65536
, but the mantis digit further decreases, resulting in a sharp drop in accuracy and an increase in hardware implementation complexity.
-
(2) Balance of bit width allocation
- Bigger
k
Requires longer Regime fields, squeezing the mantis digits. - exist
n=13
Down:-
k=3
When the Regime field occupies 4 digits (1110
), the remaining 6 digits are used for the exponent (2 bits) and the mantissa (4 bits). -
k=4
When the Regime field requires 5 digits (11110
), the mantissa is only 3 digits left, and the accuracy loss is significant.
-
(3) Design selection in the paper
- The author passed experimental verification (Table I of the paper),
P(13,2)
ofk=3
It can meet the numerical range requirements of DNN while retaining sufficient mantissa accuracy. - Higher
k
The improvement in model accuracy is limited, but it will increase hardware overhead.
4. Dynamic range comparison (P(13,2) vs. FP16)
Format | Maximum positive value | Minimum positive value |
---|---|---|
P(13,2) |
16^3 × 2^3 ≈ 2^15 |
16^{-4} ≈ 2^{-16} |
FP16 |
2^{15} |
2^{-14} |
-
Advantages: The minimum value of Posit is smaller (
2^{-16}
vs.2^{-14}
), suitable for representing a gradient close to zero in DNN. - cost: The maximum value has a slightly lower symmetry (but the DNN requires less supermaximum value).
5. Summary
-
k=3
yesP(13,2)
Reasonable maximum value, determined by the following factors:-
Number of digits limit:
n=13
The balance of Regime and mantissa is below. - Hardware efficiency: Avoid insufficient mantissa accuracy due to excessively long Regime fields.
- Application requirements: Covering typical numerical ranges of DNN (experimental verification of the paper).
-
Number of digits limit:
- This choiceComply with Posit standards, and optimize the trade-offs of dynamic range and precision for deep learning.