Skip to content
๐ŸŒ UTF-8

๐ŸŒ UTF-8

ASCII was elegantโ€”but limited. UTF-8 solves the multilingual bottleneck while preserving backward compatibility. Itโ€™s the default encoding for the web, APIs, and most modern systems. Understanding its byte structure is essential for debugging, protocol design, and semantic integrity.

๐Ÿง  Core Principles

  • Variable-Length Encoding: UTF-8 uses 1โ€“4 bytes per character.
  • ASCII-Compatible: First 128 characters (0โ€“127) are identical to ASCII.
  • Self-Synchronizing: No byte ambiguityโ€”each byte signals its role.
  • Endian-Agnostic: Unlike UTF-16/UTF-32, byte order doesnโ€™t matter.

๐Ÿงฎ Byte Structure

Byte CountRange (Hex)Binary PrefixCode Point RangeExample Char
1 byte00โ€“7F0xxxxxxxU+0000 to U+007FA, 1, !
2 bytesC2โ€“DF + 80โ€“BF110xxxxx 10xxxxxxU+0080 to U+07FFรฉ, รง
3 bytesE0โ€“EF + 2x 80โ€“BF1110xxxx 10xxxxxx 10xxxxxxU+0800 to U+FFFFไธญ, ฯ€
4 bytesF0โ€“F4 + 3x 80โ€“BF11110xxx 10xxxxxx 10xxxxxx 10xxxxxxU+10000 to U+10FFFF๐Ÿง , ๐Ÿ‰

๐Ÿ” Each start byte identifies the byte-length of the character ๐Ÿ” Each continuation byte starts with 10, ensuring no overlap with leading bytes.

๐Ÿงญ Semantic Zones

  • BMP (Basic Multilingual Plane): U+0000 to U+FFFF โ€” most common characters.
  • Supplementary Planes: U+10000+ โ€” emojis, historic scripts, rare symbols.
  • Control & Format Characters: Invisible but critical (e.g., ZWJ, BOM).

๐Ÿงฐ Use Cases

  • Web Pages: HTML defaults to UTF-8 (<meta charset="UTF-8">)
  • APIs & JSON: UTF-8 ensures consistent parsing across platforms.
  • Filesystem & CLI: UTF-8 filenames, logs, and shell output.
  • Multilingual Apps: Enables global character support without encoding chaos.

๐Ÿงฉ UTF-8 vs Other Encodings

FeatureUTF-8UTF-16ASCII
Bit Width8โ€“32 bits16โ€“32 bits7 bits
CompatibilityASCII subsetNot ASCIILegacy only
EndiannessIrrelevantRequired (BOM)N/A
Size EfficiencyHigh for EnglishHigh for Asian scriptsMinimal
Last updated on