UNICODE. What is it?

This document introduces the Unicode character set — what it is, why it exists, and how it’s encoded using formats such as UTF-8 and UTF-16. It concludes with an overview of how Java internally represents Unicode.

🔤 1. What Is Unicode?

Unicode is a universal character set standard designed to consistently represent text from every major language, past and present, as well as symbols, punctuation, and special-purpose characters such as emojis.

Key Concepts:

A Unicode character is identified by a unique code point, written as U+XXXX (hexadecimal).
The range of valid Unicode code points is from U+0000 to U+10FFFF, covering over one million potential characters.
Unicode is not an encoding — it is an abstract mapping between characters and numbers. Encoding is needed to store or transmit these characters in bytes.

🔡 2. How Is Unicode Encoded?

To store Unicode text in memory or transmit it over networks, it must be encoded as a sequence of bytes. Unicode defines several encoding forms, of which the most widely used are UTF-8 and UTF-16.

✅ UTF-8

Variable-width encoding: each character uses between 1 and 4 bytes.
ASCII-compatible: characters in the ASCII range (U+0000–U+007F) use a single byte.
Efficient for English text, and widely adopted in file formats, web protocols, and APIs.

Code Point Range	Bytes	Byte Pattern Example
`U+0000` – `U+007F`	1	`0xxxxxxx`
`U+0080` – `U+07FF`	2	`110xxxxx 10xxxxxx`
`U+0800` – `U+FFFF`	3	`1110xxxx 10xxxxxx 10xxxxxx`
`U+10000` – `U+10FFFF`	4	`11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`

✅ UTF-16

Variable-width encoding: characters use 2 or 4 bytes.
Characters in the Basic Multilingual Plane (BMP) (U+0000–U+FFFF) use a single 16-bit unit.
Characters above U+FFFF (supplementary characters) require two 16-bit code units, known as a surrogate pair.

Surrogate Pairs:

Type	Range
High surrogate	`U+D800`–`U+DBFF`
Low surrogate	`U+DC00`–`U+DFFF`

To encode a supplementary character:

Subtract 0x10000 from the code point.
Split the 20-bit result into two 10-bit halves.
Map to high and low surrogate ranges.

🧪 3. Encoding Examples

Character	Code Point	UTF-8 (hex)	UTF-16 (hex)
A	`U+0041`	`41`	`0041`
€ (Euro)	`U+20AC`	`E2 82 AC`	`20AC`
😀 (Smile)	`U+1F600`	`F0 9F 98 80`	`D83D DE00`

📌 4. Summary Comparison

Feature	UTF-8	UTF-16
Width	1–4 bytes	2 or 4 bytes
ASCII Compatibility	✅ Yes	❌ No
Supplementary Support	✅ Yes	✅ Yes (via surrogates)
Efficiency	Best for ASCII-rich	Best for Asian scripts
Use Cases	Web, APIs, files	OS APIs, in-memory text

☕ 5. Unicode in Java

Java initially used UCS-2, a fixed-width 2-byte encoding that could only represent characters in the Basic Multilingual Plane (BMP). Since Java 5, Java has internally used UTF-16.

Key Java behaviors:

A Java char is a 16-bit unit.
BMP characters are represented by a single char.
Supplementary characters are encoded as two char values (a surrogate pair).
String.length() counts char values, not Unicode code points.

✅ Counting Characters Accurately in Java

To count Unicode characters (code points) correctly — including emoji and other supplementary characters — use codePointCount():

public class UnicodeCountExample {
    public static void main(String[] args) {
        String text = "A😀€Z"; // A, smiley, euro, Z
        int lengthInChars = text.length(); // 6 (due to surrogate pair)
        int actualCharacters = text.codePointCount(0, text.length()); // 4

        System.out.println("Length in char units: " + lengthInChars);
        System.out.println("Actual Unicode characters: " + actualCharacters);
    }
}

Output:

Length in char units: 6
Actual Unicode characters: 4

To iterate over characters safely:

text.codePoints().forEach(cp -> {
    System.out.println("Code point: U+" + Integer.toHexString(cp).toUpperCase());
});

This ensures your logic is Unicode-aware, especially for applications involving multilingual content or emoji support.

Let me know if you’d like a printable version, class diagram, or encoding visual aid.

The Daily Kebab

The ramblings of a technomuse