PDPU: An Open-Source Posit Dot-Product Unit for Deep Learning Applications (III)

Supporting suitable alignment width: In several designs [8] [19], quire [33] format is adopted to represent exact dot-product of two posit vectors without rounding or overflow. However, the associated hardware overhead is prohibitive [34], since the intermediate operands are kept in quire values with a large bit-width, consuming excessive computing resources in subsequent operations. By contrast, PDPU parameterizes the width of aligned mantissa, ., Wm, which can be determined based on distribution characteristics of inputs and DNN accuracy requirements. Configured with suitable alignment width, PDPU minimizes the hardware cost while meeting precision.

This text discussesPDPU (Document Processing Unit)SupportAppropriate alignment widthDesign advantages in comparison with traditionquire formatThe limitations of the PDPU are highlighted how the configurability of the PDPU balances hardware overhead with computing accuracy. The following is a detailed analysis:

1. Problem background: Limitations of traditional Quire formats

(1) The role of Quire format

use: In Posit/floating point vector dot product operation, quire is an extended precision format used forPrecisely store intermediate results, avoid multiple rounding errors or overflows.
advantage: Can calculate long dot product losslessly (such as the sum of element-wise products of two vectors).

(2) Quire's hardware overhead issues

Large Bit-Width：
- Quire needs to store the exact values of all intermediate results, and the bit width can be extremely large (e.g. hundreds of bits).
- Example: When calculating the dot product of two 8-dimensional Posit vectors, quire may require hundreds of bits to ensure no precision loss.
Resource consumption：
- Large bit widths lead to a sharp increase in area and power consumption of multipliers, adders, and memory cells (as indicated by "prohibious overhead").
- Subsequent operations (such as activation functions, normalization) need to process extremely large bit width data, further dragging down performance.

2. PDPU solution: Parameterized alignment width (Wm)

(1) Core idea

Give up the exact quire, changed toDynamically configure alignment width (Wm)：
- Wm represents the mantissa bit width after alignment in dot product operation, and is an adjustable parameter.
- According to the input data distribution and DNN accuracy requirements, selectThe smallest sufficient Wm, not fixed large position width.

(2) Technology implementation

Mantissa alignment and truncation：
- When calculating the dot product, only the aligned Wm bit mantissa is retained, the high-bit overflow or the low-bit rounding is retained.
- Similar to "alignment + rounding" in floating point addition, but the bit width is configurable.
Parameterized design：
- Wm can be configured through registers, for example, set to 8 bits on edge devices and 16 bits on server side.

(3) Advantages

Hardware efficiency：
- Reduce the bit width of multiplier and adder, reduce area and power consumption (compare the hundreds of bits of quire).
Controllable accuracy：
- By analyzing the input data distribution (such as numerical range, sparseness) and model error tolerance, select the minimum Wm that satisfies the accuracy.
flexibility：
- The same hardware can be adapted to different scenarios (such as low power mode Wm=8, high precision mode Wm=16).

3. Comparative summary: Quire vs. PDPU

characteristic	Quire format	PDPU parameterized Wm
Accuracy	Completely accurate (no rounding/overflow)	Approximately controllable (dependent on Wm configuration)
Hardware overhead	Extremely high (large position width calculation unit)	Low (configurable bit width)
Applicable scenarios	Strict numerical fidelity scientific calculations are required	DNN reasoning tolerate approximate computing
Dynamic adjustment	Fixed position width	Supports runtime configuration Wm

4. Example description

Assume that the input distribution of a DNN layer is as follows:

Input vector value range: [-1.0, 1.0], most of the values are concentrated within ±0.1.
Accuracy requirements: Error tolerance ±0.001.

PDPU configuration policy：

Analyze the mantissa requirements：
- Small value range → High bits can be compressed.
- ±0.1 requires about 7 digits of mantissa ((2^{-7} \approx 0.008)), ±0.001 requires about 10 digits.
Select Wm=10：
- Satisfies accuracy, while far below the hundreds of bits of quire.
Hardware implementation：
- The multiplier bit width dropped from 256 bits of quire to 10 bits, and the area was reduced by about 25 times.

5. Related research support

[8][19][33]: Point out the accuracy advantages of quire, but the hardware overhead problem has not been solved.
[34]: Confirm that the bit width expansion of quire makes the resource infeasible ("prohibtive").
PDPU innovation:passData-drivenWm parameterization balances accuracy and hardware efficiency.

6. Summary

PDPUParameterized alignment width (Wm)The design optimizes dot product calculations by:

Avoid excessive width: Abandoning the complete accuracy of quire and adopting configurable mantissa alignment.
Hardware friendly: Reduces the bit width of the computing unit, and reduces the area and power consumption.
Software and hardware collaboration: Dynamically adjust Wm according to the input distribution and model requirements to achieve optimal efficiency.

This design is particularly suitable for DNN inference scenarios, whereApproximate calculationandEnergy efficiency ratioIt is often more important than complete accuracy.

The relationship between the origin of Quire format and the IEEE 754 standard

1. The origin of Quire format

Quire format does not come fromIEEE 754 floating point standard, but byJohn L. GustafsonProposed in 2017Posit digital format(A method of representing real numbers in IEEE 754) is introduced in Extended Accuracy Accumulator. Its core purpose is to solve the problem of floating point/Posit vector dot product operationPrecision lossandOverflow problem。

Propose background：
- IEEE 754 floating point will introduce errors due to multiple rounding and range limitations (such as exponential overflow) in continuous multiplication and addition operations.
- The Posit format itself improves the dynamic range through a variable exponent domain (regime+exponent), but the long dot product still requires higher precision intermediate storage.
- Quire is designed as a supporting extension for Posit forAccurate accumulation of dot intermediate results。

2. Comparison between Quire and IEEE 754

characteristic	IEEE 754 standard	Quire format (Posit)
System	Traditional floating point standard (developed in 1985)	Extension of Posit Digital Format (proposed in 2017)
Design goals	General Floating Point Computing	Accurate points accumulation
Flexibility of position width	Fixed (such as 32/64 bits)	Dynamic expansion (up to hundreds of bits)
Rounding control	Rounding after each step	Rounding only
Typical uses	General scientific computing	DNN, high-precision vector operation

3. Technical characteristics of Quire

Extra large width：
- The bit width of Quire is usually (N \times \text{Posit bit width}) (for example, the dot product of two 8-dimensional 32-bit Posit vectors requires 256-bit Quire).
- The precise representation of all intermediate results can be covered, avoiding overflow and rounding errors.
Hardware overhead：
- Dedicated registers and large bit-width adders are required, resulting in a surge in area and power consumption (as indicated by [34].

4. Why does IEEE 754 have no Quire-like mechanism?

Historical limitations：
When IEEE 754 was designed (1980s), it did not consider the long dot product requirements of modern DNNs, and its standard operations (such as FMA, multiplication and addition fusion) only support single-step rounding.
Universality priority：
IEEE 754 focuses on common scenarios (such as scientific computing), while Quire is optimized for specific scenarios (point accumulation).

5. Quire's Alternatives

Due to the high hardware cost of Quire, subsequent research proposes alternatives:

Parameterized alignment width of PDPU (Wm): Dynamically configure the precision and overhead of the mantis digital width.
Block Floating Point (BFP): Share exponential bits to reduce the width of the middle bit.
Low-precision approximation calculation: For example, INT8 dot product + scale factor (see GPU/TensorCore).

6. Summary

Quire isExclusive extensions to Posit format, does not belong to the IEEE 754 standard.
It solves the need for precise dot product, but the hardware overhead is high, prompting more efficient alternative designs such as PDPUs.
The limitations of IEEE 754 have promoted the exploration of new formats such as Posit/Quire, but the two belong to different numerical representation systems.

References：

Gustafson, J. L., & Yonemoto, I. T. (2017). Beating Floating Point at its Own Game: Posit Arithmetic.
IEEE 754-2019 Standard for Floating-Point Arithmetic.

Quire Basics

1. What is Quire?

Quire is aHigh-precision accumulator format, specifically forPosit Digital Format(A method of representing real numbers in alternative IEEE 754 floating point numbers) designed forAccurately calculate the dot product (Dot-Product)orLong vector multiplication and addition operation, avoid rounding errors and overflow problems of intermediate results.

2. The birth background of Quire

IEEE 754 Floating Point Defects：
In traditional floating point (such as FP32/FP64) in successive multiplication and addition operations (such as ( \sum a_i b_i )), each step may introduce rounding errors, resulting in inaccurate final results.
- Example: Calculate (1.0 + 2^{-23} - 1.0) You may get 0 (precision loss) in FP32.
Optimization of Posit format：
Posit improves dynamic range and accuracy through variable exponent domains (regime+exponent), but the long dot product still requires higher precision intermediate storage.
Quire's proposal：
As a supporting extension of Posit, Quire provides an accumulator with an extra-large bit width to ensure dot product operationNo rounding or overflow throughout the process。

3. The core features of Quire

(1) Extra large dynamic bit width

The bit width of Quire is usually ( N \times \text{Posit bit width} ).
- For example: the dot product of two 8-dimensional 32-bit Posit vectors, Quire requires at least (8 \times 32 = 256) bits.
Why do you need to be so big?
Ensure the exact representation of all intermediate products (( a_i \times b_i )) and accumulated results, avoiding:
- overflow（Exponent Overflow）
- Rounding error（Rounding Error）

(2) Final rounding only

Traditional floating point: rounding after multiplication and addition of each step (such as the FMA instruction).
Quire: All intermediate results retain full accuracy.Round only if the final result is converted to Posit。

(3) Hardware implementation complexity

advantage: Mathematically accurate.
shortcoming：
- Dedicated large bit wide registers (such as 256/512 bits) are required.
- Adder/multiator area and power consumption are extremely high (as indicated by [34], “prohibious overhead”).

4. Quire's mathematical representation

Quire can be regarded as aFixed-point number of extended precision, whose value represents the exact sum of all intermediate products:
[
\text{Quire} = \sum_{i=0}^{N-1} a_i \times b_i
]

No exponential domain: Only integer and decimal places are expressed to avoid the complexity of floating point alignment.
Symbol bit processing: Supports the accumulation of signed numbers.

5. Quire's hardware architecture

(1) Basic composition

enter: The product of multiple Posit formats ( a_i \times b_i ).
Accumulation unit: Large bit wide adder (such as 256 bits).
Output: Final rounding to Posit or floating point number.

(2) Workflow

Production-by-product expansion: Expand each ( a_i \times b_i ) to the Quire bit width (such as 256 bits).
Accurate accumulation: Add all expanded values together without intermediate rounding.
Final rounding: Convert Quire results to target format (such as Posit32).

(3) Example

Compute the dot product of two 4-dimensional Posit8 vectors:

Input: ( A = [a_0, a_1, a_2, a_3] ), ( B = [b_0, b_1, b_2, b_3] )
Quire bit width: (4 \times 8 = 32 ) bits (it may actually be larger to prevent overflow).
Accumulation process:
[
\text{Quire} = a_0b_0 + a_1b_1 + a_2b_2 + a_3b_3 \quad (\text{full precision reserved})
]

6. Pros and Cons of Quire

advantage	shortcoming
1. No rounding error(High precision)	1. Large hardware overhead(Large-width computing unit)
2. No spillover risk(Wide dynamic range)	2. High power consumption
3. Suitable for scientific computing/DNN training	3. Slow speed(Compact with lower precision)

7. Quire application scenarios

High-precision numerical calculation：
- Strict fidelity scientific calculations (such as climate simulations).
Deep neural network training：
- Avoid cumulative errors in gradient updates.
Finance/Cryptography：
- In cases where rounding errors are sensitive.

8. Quire's Alternatives

Because Quire's hardware costs are too high, modern designs tend to:

Parameterized alignment width (Wm) of PDPU: Dynamically configure the mantis digit width.
Block floating point (BFP): Share the index and reduce the bit width.
Low-precision approximation calculation: Such as INT8 + Scaling Factor (see GPU/TensorCore).

9. Code example (pseudocode)

def dot_product_with_quire(A, B, N):
     quire = 0 # Initialize Quire (actually a very large bit-width register)
     for i in range(N):
         product = A[i] * B[i] # Posit multiplication
         quire += product # No rounding accumulation
     return posit_round(quire) # Final rounding to Posit

10. Summary

Quire YesPosit arithmetic extension accuracy accumulator, does not belong to IEEE 754.
It passesExtra large widthandNo middle roundingEnsure dot-assembly accuracy, but the hardware is expensive.
Suitable for error-sensitive scenarios, but is being replaced by more efficient approximation methods in fields such as DNN inference.

Study Suggestions：

Master the Posit format first (compare IEEE 754).
Understand the error sources (rounding, overflow) of dot product operations.
Analyze the overhead of Quire in combination with hardware design (such as adder bit width).

mentioned in the paperWm（Aligned Mantissa Width) is a key configurable parameter used to dynamically control the alignment accuracy of mantissa in dot product calculations, thus balancing hardware overhead and computational accuracy. The following is aboutWmAccurate definition and detailed description:

1. Clear definition of Wm

WmIndicates that during dot product operation,The bit width of the mantissa part participating in the accumulation operation. Specifically:

Object of action: The mantissa part of the intermediate product of the dot product (such as (a_i \times b_i)).
Alignment logic: Before accumulation, the mantissa of all intermediate products will be aligned exponentially and truncated or expanded toWm bit。
Dynamic configuration: The value of Wm can be adjusted according to the input data distribution or the accuracy requirements of the DNN (for example, Wm=8/12/16 bits).

Formula-representation

Suppose that two floating point numbers/Posit numbers are multiplied to get the intermediate product (p_i = a_i \times b_i), and its floating point is expressed as:
[
p_i = (-1)^{s_i} \cdot m_i \cdot 2^{e_i}
]
where (m_i) is the mantissa (usually normalized to (1 \leq m_i < 2)). Before the accumulation:

Exponential alignment: Adjust all (p_i) to the maximum exponent (e_{\text{max}}), and the mantissa shifts to (m_i' = m_i \cdot 2^{e_i - e_{\text{max}}}).
Mantissa cutoff: The mantissa after alignment (m_i') is reserved onlyWm bit valid bit, the high overflow part is rounded.

2. Hardware implementation logic of Wm

(1) Mantissa alignment and truncation

enter: The mantissa (m_i) and the exponent (e_i) of multiple intermediate products (p_i).
step：
1. Find the maximum value (e_{\text{max}}) of all (e_i).
2. Move each (m_i) right (e_{\text{max}} - e_i) bits to get the aligned mantissa (m_i').
3. Keep (m_i') lowWm bit, discard the high (or round).

(2) Accumulator design

Position width: The bit width of the accumulator is (Wm + \log_2 N) (N is the dot integral block size), ensuring that there is no overflow.
- For example, Wm=12 bits, N=16 → the accumulator requires 12+4=16 bits.
Advantages: Compared with Quire's fixed large bit width (such as 256 bits), Wm can greatly reduce hardware resources.

(3) Dynamic configuration

Runtime adjustments: Configure the value of Wm through registers to adapt to different scenarios:
- High precision mode:Wm=16 bits (suitable for training or sensitivity layers).
- Low power mode:Wm=8 bits (suitable for edge device reasoning).

3. Comparison between Wm and Quire

characteristic	Quire	Wm Parameterized Design
Accuracy guaranteed	Completely accurate (no rounding)	Controllable approximation (dependent on Wm)
Position width	Fixed large (such as 256 bits)	Dynamic adjustable (such as 8/12/16 bits)
Hardware overhead	Extremely high (large bit width adder/register)	Low (distribute bit width as required)
Applicable scenarios	Scientific calculations and strict fidelity	DNN reasoning (tolerate approximation)

4. The basis for selecting Wm

(1) Data distribution analysis

If the dynamic range of the input data is small (such as the activation value is [0,1]), Wm can be smaller (such as 8 bits).
If the data range is large (such as gradient update), Wm needs to be increased (such as 16 bits).

(2) DNN accuracy requirements

Classified tasks(Tolerance of high error): Wm=8~10 bits.
Super Resolution/Generate Model(Requires high accuracy): Wm=12~16 bits.

(3) Hardware constraints

Area/power consumption priority: Select the minimum Wm to meet the accuracy requirements.
Throughput priority: Increase Wm appropriately to reduce the number of iterations.

5. Example description

Assume the dot product calculation of a convolutional layer:

enter: 16 8-bit Posit numbers ((a_i, b_i)), numerical range [0.1, 1.0].
Wm configuration：
1. Calculate each (p_i = a_i \times b_i), and the mantissa (m_i) range [1.0, 2.0).
2. Maximum index after alignment (e_{\text{max}} = 0) (because (p_i \leq 1.0)).
3. If Wm=10 bits are selected, the lower 10 bits of the alignment mantissa are retained, and the upper limit of rounding error (2^{-10} \approx 0.001).
result: The accumulated error is controllable, and the hardware bit width only requires 10+4=14 bits (far lower than the 128+ bits of Quire).

6. Summary

The nature of Wm: The effective bit width after the mantissa is aligned in dot product operation is a parameter for adjustment of accuracy and hardware overhead.
Core advantages: By dynamically configuring Wm, PDPU significantly reduces hardware resource consumption while ensuring DNN accuracy.
Design key: The optimal Wm value needs to be selected in combination with data statistical analysis and model error tolerance.

In this articlePDPU（Posit Dot-Product Unit）In the paper,Wm (align the mantissa width)The introduction of parameters is mainly concentrated in the following parts:

1. For the first time, explicit mention (Page 3, Section III-C)

exist"Supporting suitable alignment width"In the section, the author compares traditionQuire formatHardware overhead issues and for the first timeWmAs an alternative:

Original citation：
"By contrast, PDPU parameterizes the width of aligned mantissa, ., ( W_m ), which can be determined based on distribution characteristics of inputs and DNN accuracy requirements. Configured with suitable alignment width, PDPU minimizes the hardware cost while meeting precision."

Key information：

definition:( W_m ) YesThe width of the mantis digit after alignment(aligned mantissa width), used to dynamically control the number of reserved digits of the mantissa in dot product operations.
effect: Balance between hardware overhead and calculation accuracy by truncating the high bit of the mantissa (out of (W_m) part).
Configuration basis: Input data distribution and DNN accuracy requirements.

2. Experimental verification (Page 4, Section IV-A)

In the Comparative Experiment section, the author specifically explained the value of (W_m) and its impact on accuracy and hardware efficiency:

Original citation：
"our mixed-precision PDPU with ( W_m=14 ) and ( N=4 ) achieves significant savings up to 43%, 64%, and 70% in area, delay, and power compared with the posit-based PACoGen DPU..."
"Note that inappropriate data formats or alignment width may result in 10% higher computational loss of accuracy..."

Key information：

Typical values: Select the (W_m=14) bit in the experiment and combine the block size (N=4).
Effects of accuracy: Unreasonable (W_m) will lead to significant accuracy losses (such as 10%).

3. Technical background association (Page 3, Section III-B)

Although not directly defined ( W_m ),"Fused and mixed-precision implementation"The subsection explains its design motivation:

Original citation：
"PDPU is capable of mixed-precision computation... ., low precision for inputs and a slight higher precision for dot-product results..."

Related Interpretation：

(W_m) is a key parameter for achieving mixed-precision, allowing inputs (such as 8 bits) and accumulated results (such as 16 bits) to adopt different bit widths.

4. Hardware implementation association (Page 2, Section III-A)

exist"S3: Align"In the description of the pipeline stage, the actual operation of (W_m) is implicit:

Original citation：
"The product results from S2 are aligned according to the difference between the respective exponent and ( e_{max} )..."

Related Interpretation：

The alignment phase truncates the mantissa based on ( W_m ), retaining the low ( W_m ) bits, discarding or rounding the high bits.

Summary: The complete definition of Wm

nature: Dynamically configurable mantissa bit width, used to control the number of retained mantissa after alignment in dot product operation.
Purpose: Replaces Quire's complete precision accumulation, exchanges controllable accuracy losses for hardware efficiency.
Configuration Logic：
- Input analysis: Select the smallest and sufficient (W_m) based on the data dynamic range (such as the activation value distribution).
- Accuracy requirements: DNN task type (such as classification tasks tolerate higher errors).
Hardware Mapping: Influence the bit width design of the shifter and accumulator (for example (W_m=14) requires a 14-bit adder).

Illustration assists in understanding

In the paperFigure 4 (PDPU architecture)andFigure 5 (CSA Tree)Although not marked directly ( W_m ),"Align"Stage andRecursive CSA TreeThe design reflects the constraints on mantissa processing (W_m).

This architecture is a paper《PDPU: An Open-Source Posit Dot-Product Unit for Deep Learning Applications》Proposed inPosit dot product unit (PDPU)The complete hardware implementation, including combination logic (pdpu_top.sv) and pipelined versions (pdpu_top_pipelined.sv). The following is an in-depth analysis combined with the paper:

1. Overall architecture design objectives

Core functions: Efficiently calculate the dot product of two Posit vectors (out = acc + V_a × V_b), supports mixed precision (such as inputP(13,2), outputP(16,2)）。
Key Optimization：
- Fusion computing: Reduce redundant decoding/encoding operations (requires3NDecoder, PDPU only requires2N+1）。
- Level 6 assembly line: Balance the critical paths and improve throughput (the frequency in the paper reaches2.7 GHz）。
- Dynamic range adaptation: Match the non-uniform distribution of DNN data through Posit's Regime mechanism (as shown in Figure 3 for tapered accuracy).

2. Module-level analysis

(1) Top-level module

pdpu_top.sv: Combination logic implementation, suitable for low-latency scenarios.
pdpu_top_pipelined.sv: 6-level assembly line version, the main design in the paper, each stage is as follows:
1. S1: Decode(decoding)
  - Callposit_decoder.svExtract symbols, exponents, and mantissa.
  - rely(leading zero count) andbarrel_shifter.sv(Bar shifter).
2. S2: Multiply(multiplication)
  - Use improvedradix4_booth_multiplier.sv(Radix-4 Booth multiplier) Calculate the mantissa product.
  - passcsa_tree.sv(CSA tree) compresses part sum, reducing carry latency.
3. S3: Align(Alignment)
  - according tocomp_tree.sv(Comparator Tree) The maximum exponential alignment mantissa output.
4. S4: Accumulate(Accumulated)
  - recursioncsa_tree.svCompress the intermediate result, and the final addition generates an accumulated sum.
5. S5: Normalize(Standardized)
  - mantissa_norm.svAdjust mantissa and index (dependenceand shifter).
6. S6: Encode(coding)
  - posit_encoder.svPackage the results into Posit format.

(2) Key submodules

Posit codec
- posit_decoder.sv: Dynamically parse the Regime field (thesis formula (1)).
- posit_encoder.sv: Handles rounding (RNE mode) and dynamic bit width adjustment.
Arithmetic unit
- Radix-4 Booth Multiplier:passbooth_encoder.svGenerate partial product,csa_tree.svCompression (paper mentions area reduction43%）。
- CSA Tree: Recursive structure (Figure 5) supports variable dot product size (N=4wait).
Dynamic alignment and normalization
- comp_tree.sv: Quickly determine the maximum index (critical path optimization).
- mantissa_norm.sv: Combining LZC and shifter to achieve efficient standardization.

3. Relationship with the experimental results of the paper

Performance data：
- Level 6 assembly lineReduce the critical path from 0.8 ns to 0.37 ns (Figure 6) at a frequency of 2.7 GHz.
- Area and power consumption: 43% (area), 70% (power consumption) less than traditional discrete design (Table I).
Mixed accuracy support：
- By parameterizationposit_decoder/Implement different bit widths of input/output (such asP(13,2)→P(16,2)）。
Configurability：
- pdpu_pkg.svDefine global parameters (e.g.n、es、N), generator adapts automatically (Paper Section III-C).

4. Innovation points and advantages

Converged architecture
- Shared decoding/encoding logic (such as S1 and S6 multiplexingbarrel_shifter.sv) to reduce hardware redundancy.
Dynamic precision processing
- passmantissa_norm.svand configurableMANT_WIDTHBalance accuracy and resources (Paper Section III-B).
High throughput design
- Pipeline + CSA Tree + Booth Multiplier achieves a 4.6x throughput boost (Figure 6).

5. Potential improvement directions

Special value support: The current architecture does not explicitly process Posit±∞, detection logic needs to be added to the decoder.
Wide bit width expansion: If supportedP(32,2)CSA tree optimization (the paper mentions the problem of large bit width overhead).
Software collaborative design: Combined with a hybrid precision training framework (such as PoshiNN) to further improve energy efficiency.

6. Summary

This architecture is a complete implementation of PDPU in the paper, throughModular designandPipeline optimization, significantly improves the efficiency (area, power consumption, speed) of Posit dot product operation. Its core value lies in:

Open source configurable: Supports custom Posit format and dot product size, and is suitable for different DNN models.
Hardware friendly: The design of recursive CSA trees, Booth multipliers, etc. is suitable for ASIC/FPGA implementation.
Academic and industrial application potential: Provides a reliable basic module for the deployment of Posit in AI accelerators.

existPosit number systemMedium, formatP(n, es)(such as P(13,2)) is fromPosit StandardIt is clearly defined, not set by the author of the paper. The following is a detailed explanation:

1. Definition of Posit Standard

Posit number systemJohn GustafsonIt was proposed in 2017 that its format specification was approvedPosit standard documentationPublic definition. Core rules include:

General format：P(n, es)
- n: Total number of digits (must be ≥ 2).
- es: Refers to the number of digits (can be zero).
Field assignment: Sign bit (1 bit) + Regime (variable length) + exponent (esbit) + mantissa (remaining bits).
Dynamic encoding: The length and value of the Regime field are dynamically determined by the data size.

therefore,P(13,2)It is a legal configuration allowed by the standard, not original to the paper.

2. Selection basis in the paper

The author chooses in the paperP(13,2)andP(16,2)As an input/output format for mixing accuracy, it is based on the following considerations:

Hardware efficiency：
- 13-bit input saves more than 16-bitAbout 20%multiplier area (the resource of the Booth multiplier is related to the bit width square).
Accuracy requirements：
- Experimental display (Table I of the Paper),P(13,2)In DNN, it can be maintained withFP16Similar accuracy,P(16,2)The accumulated results are close toFP32。
Dynamic range matching：
- Posit'suseed=16(becausees=2) covers the common activation value distribution of DNN (Figure 3).

3. Comparison with other Posit implementations

Configuration	source	use
`P(8,0)`	Posit standard examples	Very low-precision embedded scenarios
`P(16,1)`	SoftPosit library default	General computing
`P(13,2)`	PDPU	Deep Learning Input Optimization
`P(32,2)`	High-precision scientific calculation	Applications that require a larger dynamic range

All configurations comply with Posit standards, but the paper is based onCharacteristics of DNNThe optimal position width was selected.

4. Why can n and es be customized?

One of the core advantages of the Posit standard isflexibility：

Selection of n: Balance accuracy and resources according to application requirements (such as edge devices may useP(8,0), for serverP(32,2)）。
Choice of es：
- es=0: Simplified hardware (no explicit index, suitable for low power consumption).
- es=2: Extended dynamic range (as in this articleuseed=16）。

The paper'sP(13,2)It is the practical application of this flexibility, not the modification of the standards.

5. Summary

P(13,2)yesLegal formats supported by the Posit standard, its definition comes from official specifications.
The innovation of the paper lies in:
- Mixed precision strategy:enterP(13,2)+ OutputP(16,2)。
- Hardware optimization: Highly efficient conversion is achieved through 6-level pipelines and CSA trees.
This design isNo violation of standardsUnder the premise of this, the performance and precision trade-offs are optimized for deep learning.

For verification, please refer toPosit standard documentationOr open source implementation (e.g.SoftPosit）。

In Posit formatP(13,2)Among them, the calculation of the maximum and minimum representable values is based on its dynamic encoding rules. The following is a detailed explanation:

1. General formula for Posit values

The value of Posit is determined by the following formula:
[
\text{Value} = (-1)^{\text{sign}} \times \text{useed}^k \times 2^e \times (1.\text{mantissa})
]
in:

useed:Depend onesDefinition, ( \text{useed} = 2^{2{es}} ) (Yeses=2，( \text{useed} = 2^{22} = 16 )）。
k:Regime value (dynamic range scaling factor).
e: Refers to the value of the number field (es=2hour,eThe scope of0arrive3）。

2. Maximum representable value (approximately (2^{20} ))

(1) Parameter selection

Regime valuek：
- The maximum possible value of the Regime field isk=3(For example, encoded as11110..., 4 consecutive1After terminated, at this timek = 4-1 = 3）。
- Notice：kThe actual maximum value is limited by the total number of digitsn=13, but can be achieved under this assumptionk=3。
indexe：
- The index field is11(binary), i.e.e=3（es=2maximum index).
mantissa：
- Set as All1(Right now1.111...), but the mantissa contributes less to the maximum value and can be approximately ignored.

(2) Calculation

[
\text{Max Value} = 16^3 \times 2^3 = 4096 \times 8 = 32768 \approx 2^{15}
]
Revision instructions：
( 2^{20} ) in the original answer is a rough estimate (may contain mantissa amplification effect), but the exact calculation should be ( 2^{15} ).
More accurate derivation of actual maximum values needs to be consideredn=13The bit width limits, but the dynamic range core isuseed^kleading.

3. Minimum representable value (approximately (2^{-16} ))

(1) Parameter selection

Regime valuek：
- The minimum possible value of the Regime field isk=-4(For example, encoded as00001..., 4 consecutive0After terminated, at this timek = -4）。
indexe：
- The index field is00,Right nowe=0。
mantissa：
- Set to the minimum normalized value1.000...。

(2) Calculation

[
\text{Min Value} = 16^{-4} \times 2^0 = \frac{1}{65536} \approx 1.53 \times 10^{-5} \approx 2^{-16}
]

4. Why are these values?

(1) Dynamic range mechanism

Regime Field：
- Continuous by variable length0or1Implement exponential dynamic range scaling (useed^k）。
- kThe greater the absolute value, the more extreme the scaling (e.g.16^3or16^{-4}）。
Index number segment：
- existuseed^kBased on further linear scaling (2^e）。

(2) Bit width limit

P(13,2)The total number of digits is limited (13 digits), so:
- maximumk: Restricted by the maximum continuity available in the Regime field1number.
- Minimumk: Limited by continuous0the minimum requirement of the number and mantissa digits.

(3) Comparison IEEE 754

Posit'suseed^kThe mechanism makes its dynamic range far exceeds the IEEE 754 format with the same bit width.
- For example,P(13,2)Dynamic range (( \sim 2^{-16} \sim 2^{15} )) is better thanFP16（( \sim 2^{-14} \sim 2^{15} )）。

5. Summary

Approximate maximum value:Depend onuseed^k \times 2^eleading,k=3ande=3When reaching (16^3 \times 8 \approx 2^{15} ).
Approximate minimum value:Depend onuseed^kleading,k=-4when ( 16^{-4} \approx 2^{-16} ).
Core advantages:Posit uses dynamic Regime fields to achieve a larger dynamic range under finite bit width, especially suitable for non-uniform data distribution in deep learning.

If you need to calculate it accurately, you need to combine the specific number of digits (such asn=13hourk) but the above principle is general.

In Posit formatP(13,2)(Total number of digitsn=13, index number segmentes=2) , the Regime valuekMaximum value ofk=3Is it fromNumber of digits limitandCoding rulesDecided jointly. The following is a detailed explanation:

1. Regime field encoding rules

Posit's Regime field is continuous0or1composition, terminated with the first opposite bit. Its length and valuekThe calculation method is as follows:

If Regime1beginning：
- continuousmindivual1Follow one0 → k = m - 1
- For example:1110... → m=3 → k=2
If Regime0beginning：
- continuousmindivual0Follow one1 → k = -m
- For example:0001... → m=3 → k=-3

Special termination：

If the Regime field fills the remaining bits (no termination bits are encountered),kTake the maximum possible value.

2. Bit allocation of P(13,2)

oneP(13,2)The bit distribution of numbers is as follows:

Symbol bits: 1 digit
Regime Field: Variable length (at least 2 bits, occupies the remaining bits at most)
Index number segment: Fixed 2 digits (es=2）
Tail number segment: The remaining digits

maximumkScene of：

Need for as long as possible continuous1To maximizek。
forn=13, deduct the sign bit (1 digit) and the exponent field (2 digits), the remaining10Available for Regime and Mantissa.
- Minimum endpoint requirement: At least 1 digit mantissa is required (implicitly1.minimum accuracy).
- Regime Maximum occupancy：10 - 1 = 9Bit.

Extreme situations in the Regime field

coding：1111111110(91+ Termination bit0+ 1 digit mantissa)
- m = 9 → k = 9 - 1 = 8
But the actual limit：
- Due to the limited number of digits,kThe physical maximum value ofuseed^kWhether the scope constraints that can be represented are exceeded.
- forP(13,2)，k=3is a reasonable design choice (see calculation below).

3. Why`k=3`Is it a reasonable maximum?

(1) Hardware limits for numerical ranges

useed = 16(becausees=2），useed^k = 16^k。
whenk=3hour:
- 16^3 = 4096, combined with indexe=3（2^3=8), the value is4096 × 8 = 32768 ≈ 2^15。
- At this time, the mantissa bit is very few (probably only 1 bit), but the dynamic range is sufficient to cover most DNN requirements (Figure 3).
likek=4：
- 16^4 = 65536, but the mantis digit further decreases, resulting in a sharp drop in accuracy and an increase in hardware implementation complexity.

(2) Balance of bit width allocation

BiggerkRequires longer Regime fields, squeezing the mantis digits.
existn=13Down:
- k=3When the Regime field occupies 4 digits (1110), the remaining 6 digits are used for the exponent (2 bits) and the mantissa (4 bits).
- k=4When the Regime field requires 5 digits (11110), the mantissa is only 3 digits left, and the accuracy loss is significant.

(3) Design selection in the paper

The author passed experimental verification (Table I of the paper),P(13,2)ofk=3It can meet the numerical range requirements of DNN while retaining sufficient mantissa accuracy.
HigherkThe improvement in model accuracy is limited, but it will increase hardware overhead.

4. Dynamic range comparison (P(13,2) vs. FP16)

Format	Maximum positive value	Minimum positive value
`P(13,2)`	`16^3 × 2^3 ≈ 2^15`	`16^{-4} ≈ 2^{-16}`
`FP16`	`2^{15}`	`2^{-14}`

Advantages: The minimum value of Posit is smaller (2^{-16} vs. 2^{-14}), suitable for representing a gradient close to zero in DNN.
cost: The maximum value has a slightly lower symmetry (but the DNN requires less supermaximum value).

5. Summary

k=3yesP(13,2)Reasonable maximum value, determined by the following factors:
1. Number of digits limit：n=13The balance of Regime and mantissa is below.
2. Hardware efficiency: Avoid insufficient mantissa accuracy due to excessively long Regime fields.
3. Application requirements: Covering typical numerical ranges of DNN (experimental verification of the paper).
This choiceComply with Posit standards, and optimize the trade-offs of dynamic range and precision for deep learning.