💻 Floating-Point & Fixed-Point

The fundamental problem we were addressing is: How do we represent and perform calculations with real, non-integer numbers (like 3.14159 or 0.000001) using a computer’s memory, which is designed to store only discrete integer values (binary bits)?

Digital systems cannot perfectly store every real number, so the core challenge is balancing the need for

range (how large/small a number can be) against
precision (how accurately the number is stored).

We discussed two primary methods for managing this challenge in computer architecture:

🎯 Fixed-Point Representation

Fixed-Point representation is a way of encoding real numbers by allocating a fixed, predetermined number of bits for the integer part and the fractional part.

It is called “fixed-point” because the binary point (the radix point) is implicitly defined to be at a specific location within the bit string.
It is conceptually similar to how we represent money in dollars and cents, where the decimal point is always fixed two places from the right.

Structure

A fixed-point number is typically represented as a bit-string: $N_{I}$ bits for the integer part and $N_{F}$ bits for the fractional part.
Notation: The format is often written as Q(I.F), where $I$ is the number of integer bits and $F$ is the number of fractional bits. A common signed format is Q(1.15) (1 sign bit, 15 fractional bits).
Value Calculation: The value $V$ of a fixed-point number is calculated as:

$$ V = \sum_{i=-N_F}^{N_I-1} b_i \cdot 2^i $$

Precision and Range

Range: The largest number that can be represented is determined by the number of integer bits, $N_I$.
- $V_{\max} = 2^{N_I} - 2^{-N_F}$
Precision (Resolution): The smallest difference between two representable numbers is determined by the number of fractional bits, $N_F$.
- $\text{Resolution} = 2^{-N_F}$
Key Trait: The hardware for fixed-point math is simpler and faster because there is no need to manage an exponent.

Use Cases

It is used in Digital Signal Processing (DSP) and embedded systems where speed and predictable precision are more important than a large dynamic range.
It is common in graphics and game development for representing vectors or coordinates where the scale is known and limited.

🚀 Floating-Point Representation

Floating-Point representation is a method of encoding real numbers by using a signed exponent to allow the binary point to “float” (move) to any position.

It is conceptually similar to scientific notation (e.g., $6.02 \times 10^{23}$), where the exponent shifts the decimal point.
This provides a massive dynamic range (very large numbers and very small numbers) at the cost of varying precision.
Most computers use the IEEE 754 Standard for floating-point arithmetic, which defines formats like Single-Precision (32-bit) and Double-Precision (64-bit).

Structure (IEEE 754)

A floating-point number is stored using three fields:

Sign bit (S): 1 bit, determines if the number is positive (0) or negative (1).
Exponent (E): $N_E$ bits, determines the magnitude (where the binary point is located). It is stored with a bias to allow for negative exponents.
Mantissa / Significand (M): $N_M$ bits, stores the precision digits. Since the most significant bit of the mantissa is always 1 for normalized numbers (e.g., $1.xxxxx$), this ‘1’ is often implicitly hidden, giving one extra bit of precision for free.

Value Calculation: The value $V$ of a floating-point number is calculated as:

$$ V = (-1)^S \times (1.\text{Mantissa}) \times 2^{\text{Exponent} - \text{Bias}} $$

Precision and Range

Standard	Total Bits	Sign (S)	Exponent (E)	Mantissa (M)	Dynamic Range (Approx)
Single-Precision (float)	32	1	8	23 (+1 implicit)	$\approx 10^{-38}$ to $10^{38}$
Double-Precision (double)	64	1	11	52 (+1 implicit)	$\approx 10^{-308}$ to $10^{308}$

Key Trait: The dynamic range is enormous, but precision decreases as the magnitude of the number increases.

Use Cases

It is the standard for general-purpose scientific and engineering calculations where numbers can vary widely in magnitude (e.g., physics simulations, finance, machine learning).
Used in languages like Python, Java, and C++ as the default type for non-integer numbers.

⚖️ Comparison Summary

Feature	Fixed-Point	Floating-Point
Precision	Uniform (same absolute error everywhere)	Relative (better for smaller numbers, worse for larger ones)
Dynamic Range	Limited by $N_I$ and $N_F$	Massive (determined by $N_E$)
Hardware	Simple, fast, low power	Complex, slower (requires specialized FPU)
Complexity	Simple implementation	Complex implementation (special cases for $\pm\text{Inf}$, NaN, denormalized)
Memory	Typically uses fewer bits for the same range as float	Typically requires more bits for the same resolution as fixed-point

Last updated on 2025-11-08

💥 Integer Overflow 📊 Base‑b Recursive Listing Method