๐ป Floating-Point & Fixed-Point
The fundamental problem we were addressing is: How do we represent and perform calculations with real, non-integer numbers (like 3.14159 or 0.000001) using a computer’s memory, which is designed to store only discrete integer values (binary bits)?
Digital systems cannot perfectly store every real number, so the core challenge is balancing the need for
- range (how large/small a number can be) against
- precision (how accurately the number is stored).
We discussed two primary methods for managing this challenge in computer architecture:
๐ฏ Fixed-Point Representation
Fixed-Point representation is a way of encoding real numbers by allocating a fixed, predetermined number of bits for the integer part and the fractional part.
- It is called “fixed-point” because the binary point (the radix point) is implicitly defined to be at a specific location within the bit string.
- It is conceptually similar to how we represent money in dollars and cents, where the decimal point is always fixed two places from the right.
Structure
- A fixed-point number is typically represented as a bit-string: $N_{I}$ bits for the integer part and $N_{F}$ bits for the fractional part.
- Notation: The format is often written as Q(I.F), where $I$ is the number of integer bits and $F$ is the number of fractional bits. A common signed format is Q(1.15) (1 sign bit, 15 fractional bits).
- Value Calculation: The value $V$ of a fixed-point number is calculated as:
Precision and Range
- Range: The largest number that can be represented is determined by the number of integer bits, $N_I$.
- $V_{\max} = 2^{N_I} - 2^{-N_F}$
- Precision (Resolution): The smallest difference between two representable numbers is determined by the number of fractional bits, $N_F$.
- $\text{Resolution} = 2^{-N_F}$
- Key Trait: The hardware for fixed-point math is simpler and faster because there is no need to manage an exponent.
Use Cases
- It is used in Digital Signal Processing (DSP) and embedded systems where speed and predictable precision are more important than a large dynamic range.
- It is common in graphics and game development for representing vectors or coordinates where the scale is known and limited.
๐ Floating-Point Representation
Floating-Point representation is a method of encoding real numbers by using a signed exponent to allow the binary point to “float” (move) to any position.
- It is conceptually similar to scientific notation (e.g., $6.02 \times 10^{23}$), where the exponent shifts the decimal point.
- This provides a massive dynamic range (very large numbers and very small numbers) at the cost of varying precision.
- Most computers use the IEEE 754 Standard for floating-point arithmetic, which defines formats like Single-Precision (32-bit) and Double-Precision (64-bit).
Structure (IEEE 754)
A floating-point number is stored using three fields:
- Sign bit (S): 1 bit, determines if the number is positive (0) or negative (1).
- Exponent (E): $N_E$ bits, determines the magnitude (where the binary point is located). It is stored with a bias to allow for negative exponents.
- Mantissa / Significand (M): $N_M$ bits, stores the precision digits. Since the most significant bit of the mantissa is always 1 for normalized numbers (e.g., $1.xxxxx$), this ‘1’ is often implicitly hidden, giving one extra bit of precision for free.
- Value Calculation: The value $V$ of a floating-point number is calculated as:
Precision and Range
| Standard | Total Bits | Sign (S) | Exponent (E) | Mantissa (M) | Dynamic Range (Approx) |
|---|---|---|---|---|---|
| Single-Precision (float) | 32 | 1 | 8 | 23 (+1 implicit) | $\approx 10^{-38}$ to $10^{38}$ |
| Double-Precision (double) | 64 | 1 | 11 | 52 (+1 implicit) | $\approx 10^{-308}$ to $10^{308}$ |
- Key Trait: The dynamic range is enormous, but precision decreases as the magnitude of the number increases.
Use Cases
- It is the standard for general-purpose scientific and engineering calculations where numbers can vary widely in magnitude (e.g., physics simulations, finance, machine learning).
- Used in languages like Python, Java, and C++ as the default type for non-integer numbers.
โ๏ธ Comparison Summary
| Feature | Fixed-Point | Floating-Point |
|---|---|---|
| Precision | Uniform (same absolute error everywhere) | Relative (better for smaller numbers, worse for larger ones) |
| Dynamic Range | Limited by $N_I$ and $N_F$ | Massive (determined by $N_E$) |
| Hardware | Simple, fast, low power | Complex, slower (requires specialized FPU) |
| Complexity | Simple implementation | Complex implementation (special cases for $\pm\text{Inf}$, NaN, denormalized) |
| Memory | Typically uses fewer bits for the same range as float | Typically requires more bits for the same resolution as fixed-point |