๐ UTF-8
ASCII was elegantโbut limited. UTF-8 solves the multilingual bottleneck while preserving backward compatibility. Itโs the default encoding for the web, APIs, and most modern systems. Understanding its byte structure is essential for debugging, protocol design, and semantic integrity.
๐ง Core Principles
- Variable-Length Encoding: UTF-8 uses 1โ4 bytes per character.
- ASCII-Compatible: First 128 characters (0โ127) are identical to ASCII.
- Self-Synchronizing: No byte ambiguityโeach byte signals its role.
- Endian-Agnostic: Unlike UTF-16/UTF-32, byte order doesnโt matter.
๐งฎ Byte Structure
| Byte Count | Range (Hex) | Binary Prefix | Code Point Range | Example Char |
|---|---|---|---|---|
| 1 byte | 00โ7F | 0xxxxxxx | U+0000 to U+007F | A, 1, ! |
| 2 bytes | C2โDF + 80โBF | 110xxxxx 10xxxxxx | U+0080 to U+07FF | รฉ, รง |
| 3 bytes | E0โEF + 2x 80โBF | 1110xxxx 10xxxxxx 10xxxxxx | U+0800 to U+FFFF | ไธญ, ฯ |
| 4 bytes | F0โF4 + 3x 80โBF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | U+10000 to U+10FFFF | ๐ง , ๐ |
๐ Each start byte identifies the byte-length of the character
๐ Each continuation byte starts with 10, ensuring no overlap with leading bytes.
๐งญ Semantic Zones
- BMP (Basic Multilingual Plane): U+0000 to U+FFFF โ most common characters.
- Supplementary Planes: U+10000+ โ emojis, historic scripts, rare symbols.
- Control & Format Characters: Invisible but critical (e.g.,
ZWJ,BOM).
๐งฐ Use Cases
- Web Pages: HTML defaults to UTF-8 (
<meta charset="UTF-8">) - APIs & JSON: UTF-8 ensures consistent parsing across platforms.
- Filesystem & CLI: UTF-8 filenames, logs, and shell output.
- Multilingual Apps: Enables global character support without encoding chaos.
๐งฉ UTF-8 vs Other Encodings
| Feature | UTF-8 | UTF-16 | ASCII |
|---|---|---|---|
| Bit Width | 8โ32 bits | 16โ32 bits | 7 bits |
| Compatibility | ASCII subset | Not ASCII | Legacy only |
| Endianness | Irrelevant | Required (BOM) | N/A |
| Size Efficiency | High for English | High for Asian scripts | Minimal |
Last updated on