Skip to content
๐Ÿ’ป Floating-Point & Fixed-Point

๐Ÿ’ป Floating-Point & Fixed-Point

The fundamental problem we were addressing is: How do we represent and perform calculations with real, non-integer numbers (like 3.14159 or 0.000001) using a computer’s memory, which is designed to store only discrete integer values (binary bits)?

Digital systems cannot perfectly store every real number, so the core challenge is balancing the need for

  • range (how large/small a number can be) against
  • precision (how accurately the number is stored).

We discussed two primary methods for managing this challenge in computer architecture:


๐ŸŽฏ Fixed-Point Representation

Fixed-Point representation is a way of encoding real numbers by allocating a fixed, predetermined number of bits for the integer part and the fractional part.

  • It is called “fixed-point” because the binary point (the radix point) is implicitly defined to be at a specific location within the bit string.
  • It is conceptually similar to how we represent money in dollars and cents, where the decimal point is always fixed two places from the right.

Structure

  • A fixed-point number is typically represented as a bit-string: $N_{I}$ bits for the integer part and $N_{F}$ bits for the fractional part.
  • Notation: The format is often written as Q(I.F), where $I$ is the number of integer bits and $F$ is the number of fractional bits. A common signed format is Q(1.15) (1 sign bit, 15 fractional bits).
  • Value Calculation: The value $V$ of a fixed-point number is calculated as:
$$ V = \sum_{i=-N_F}^{N_I-1} b_i \cdot 2^i $$

Precision and Range

  • Range: The largest number that can be represented is determined by the number of integer bits, $N_I$.
    • $V_{\max} = 2^{N_I} - 2^{-N_F}$
  • Precision (Resolution): The smallest difference between two representable numbers is determined by the number of fractional bits, $N_F$.
    • $\text{Resolution} = 2^{-N_F}$
  • Key Trait: The hardware for fixed-point math is simpler and faster because there is no need to manage an exponent.

Use Cases

  • It is used in Digital Signal Processing (DSP) and embedded systems where speed and predictable precision are more important than a large dynamic range.
  • It is common in graphics and game development for representing vectors or coordinates where the scale is known and limited.

๐Ÿš€ Floating-Point Representation

Floating-Point representation is a method of encoding real numbers by using a signed exponent to allow the binary point to “float” (move) to any position.

  • It is conceptually similar to scientific notation (e.g., $6.02 \times 10^{23}$), where the exponent shifts the decimal point.
  • This provides a massive dynamic range (very large numbers and very small numbers) at the cost of varying precision.
  • Most computers use the IEEE 754 Standard for floating-point arithmetic, which defines formats like Single-Precision (32-bit) and Double-Precision (64-bit).

Structure (IEEE 754)

A floating-point number is stored using three fields:

  1. Sign bit (S): 1 bit, determines if the number is positive (0) or negative (1).
  2. Exponent (E): $N_E$ bits, determines the magnitude (where the binary point is located). It is stored with a bias to allow for negative exponents.
  3. Mantissa / Significand (M): $N_M$ bits, stores the precision digits. Since the most significant bit of the mantissa is always 1 for normalized numbers (e.g., $1.xxxxx$), this ‘1’ is often implicitly hidden, giving one extra bit of precision for free.
  • Value Calculation: The value $V$ of a floating-point number is calculated as:
$$ V = (-1)^S \times (1.\text{Mantissa}) \times 2^{\text{Exponent} - \text{Bias}} $$

Precision and Range

StandardTotal BitsSign (S)Exponent (E)Mantissa (M)Dynamic Range (Approx)
Single-Precision (float)321823 (+1 implicit)$\approx 10^{-38}$ to $10^{38}$
Double-Precision (double)6411152 (+1 implicit)$\approx 10^{-308}$ to $10^{308}$
  • Key Trait: The dynamic range is enormous, but precision decreases as the magnitude of the number increases.

Use Cases

  • It is the standard for general-purpose scientific and engineering calculations where numbers can vary widely in magnitude (e.g., physics simulations, finance, machine learning).
  • Used in languages like Python, Java, and C++ as the default type for non-integer numbers.

โš–๏ธ Comparison Summary

FeatureFixed-PointFloating-Point
PrecisionUniform (same absolute error everywhere)Relative (better for smaller numbers, worse for larger ones)
Dynamic RangeLimited by $N_I$ and $N_F$Massive (determined by $N_E$)
HardwareSimple, fast, low powerComplex, slower (requires specialized FPU)
ComplexitySimple implementationComplex implementation (special cases for $\pm\text{Inf}$, NaN, denormalized)
MemoryTypically uses fewer bits for the same range as floatTypically requires more bits for the same resolution as fixed-point
Last updated on