This document introduces the Unicode character set โ what it is, why it exists, and how it’s encoded using formats such as UTF-8 and UTF-16. It concludes with an overview of how Java internally represents Unicode.
๐ค 1. What Is Unicode?
Unicode is a universal character set standard designed to consistently represent text from every major language, past and present, as well as symbols, punctuation, and special-purpose characters such as emojis.
Key Concepts:
- A Unicode character is identified by a unique code point, written as
U+XXXX
(hexadecimal). - The range of valid Unicode code points is from
U+0000
toU+10FFFF
, covering over one million potential characters. - Unicode is not an encoding โ it is an abstract mapping between characters and numbers. Encoding is needed to store or transmit these characters in bytes.
๐ก 2. How Is Unicode Encoded?
To store Unicode text in memory or transmit it over networks, it must be encoded as a sequence of bytes. Unicode defines several encoding forms, of which the most widely used are UTF-8 and UTF-16.
โ UTF-8
- Variable-width encoding: each character uses between 1 and 4 bytes.
- ASCII-compatible: characters in the ASCII range (
U+0000
โU+007F
) use a single byte. - Efficient for English text, and widely adopted in file formats, web protocols, and APIs.
Code Point Range | Bytes | Byte Pattern Example |
---|---|---|
U+0000 โ U+007F | 1 | 0xxxxxxx |
U+0080 โ U+07FF | 2 | 110xxxxx 10xxxxxx |
U+0800 โ U+FFFF | 3 | 1110xxxx 10xxxxxx 10xxxxxx |
U+10000 โ U+10FFFF | 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
โ UTF-16
- Variable-width encoding: characters use 2 or 4 bytes.
- Characters in the Basic Multilingual Plane (BMP) (
U+0000
โU+FFFF
) use a single 16-bit unit. - Characters above
U+FFFF
(supplementary characters) require two 16-bit code units, known as a surrogate pair.
Surrogate Pairs:
Type | Range |
---|---|
High surrogate | U+D800 โU+DBFF |
Low surrogate | U+DC00 โU+DFFF |
To encode a supplementary character:
- Subtract
0x10000
from the code point. - Split the 20-bit result into two 10-bit halves.
- Map to high and low surrogate ranges.
๐งช 3. Encoding Examples
Character | Code Point | UTF-8 (hex) | UTF-16 (hex) |
---|---|---|---|
A | U+0041 | 41 | 0041 |
โฌ (Euro) | U+20AC | E2 82 AC | 20AC |
๐ (Smile) | U+1F600 | F0 9F 98 80 | D83D DE00 |
๐ 4. Summary Comparison
Feature | UTF-8 | UTF-16 |
---|---|---|
Width | 1โ4 bytes | 2 or 4 bytes |
ASCII Compatibility | โ Yes | โ No |
Supplementary Support | โ Yes | โ Yes (via surrogates) |
Efficiency | Best for ASCII-rich | Best for Asian scripts |
Use Cases | Web, APIs, files | OS APIs, in-memory text |
โ 5. Unicode in Java
Java initially used UCS-2, a fixed-width 2-byte encoding that could only represent characters in the Basic Multilingual Plane (BMP). Since Java 5, Java has internally used UTF-16.
Key Java behaviors:
- A Java
char
is a 16-bit unit. - BMP characters are represented by a single
char
. - Supplementary characters are encoded as two
char
values (a surrogate pair). String.length()
countschar
values, not Unicode code points.
โ Counting Characters Accurately in Java
To count Unicode characters (code points) correctly โ including emoji and other supplementary characters โ use codePointCount()
:
public class UnicodeCountExample {
public static void main(String[] args) {
String text = "A๐โฌZ"; // A, smiley, euro, Z
int lengthInChars = text.length(); // 6 (due to surrogate pair)
int actualCharacters = text.codePointCount(0, text.length()); // 4
System.out.println("Length in char units: " + lengthInChars);
System.out.println("Actual Unicode characters: " + actualCharacters);
}
}
Output:
Length in char units: 6
Actual Unicode characters: 4
To iterate over characters safely:
text.codePoints().forEach(cp -> {
System.out.println("Code point: U+" + Integer.toHexString(cp).toUpperCase());
});
This ensures your logic is Unicode-aware, especially for applications involving multilingual content or emoji support.
Let me know if you’d like a printable version, class diagram, or encoding visual aid.