๐ Unicode
Unicode solves the chaos of incompatible encodings by defining a single, consistent catalog of characters for every language, symbol, and emoji. Itโs not an encodingโitโs the semantic backbone that encoding schemes like UTF-8, UTF-16, and UTF-32 rely on.
๐ข Code Points
- Each character is assigned a unique number:
U+XXXX - Example:
'A'โU+0041,'ฯ'โU+03C0,'๐'โU+1F600
๐งฉ Planes & Ranges
| Plane Name | Range | Notable Content |
|---|---|---|
| Basic Multilingual Plane | U+0000โU+FFFF | Most common scripts, symbols |
| Supplementary Multilingual | U+10000โU+1FFFF | Historic scripts, rare symbols |
| Supplementary Ideographic | U+20000โU+2FFFF | Rare CJK ideographs |
| Supplementary Special | U+E0000โU+EFFFF | Tags, variation selectors |
| Private Use Areas | U+E000โU+F8FF, etc | Custom characters (non-standard) |
๐ Unicode currently defines over 143,000 characters across 159 scripts.
๐งฌ Character Properties
- General Category: Letter, Number, Symbol, Punctuation, etc.
- Combining Class: Used for accents and diacritics
- Bidirectional Class: Controls rendering for RTL scripts
- Script: Latin, Cyrillic, Arabic, Han, etc.
๐ Normalization Forms
| Form | Description | Use Case |
|---|---|---|
| NFC | Composed form | Web, file systems |
| NFD | Decomposed form | Canonical comparison |
| NFKC | Compatibility composed | Search, indexing |
| NFKD | Compatibility decomposed | Audit, equivalence checking |
โ ๏ธ 'รฉ' can be U+00E9 (composed) or U+0065 U+0301 (decomposed). Normalization ensures consistency.
๐งฐ Use Cases
- Multilingual Text: Enables consistent rendering across languages
- Emoji Support: Unicode defines emoji sequences and modifiers
- Search & Indexing: Normalization ensures accurate matching
- Security Audits: Prevents spoofing via visually similar characters
๐งฎ Unicode vs Encoding
| Layer | Role | Example |
|---|---|---|
| Unicode | Defines characters | 'ฯ' โ U+03C0 |
| UTF-8 | Encodes Unicode in bytes | U+03C0 โ 0xCF 0x80 |
| UTF-16 | Encodes Unicode in 2โ4 bytes | U+1F600 โ 0xD83D 0xDE00 |
| UTF-32 | Encodes Unicode in 4 bytes | U+1F600 โ 0x0001F600 |
Last updated on