UNICODE. What is it?

This document introduces the Unicode character set โ€” what it is, why it exists, and how it’s encoded using formats such as UTF-8 and UTF-16. It concludes with an overview of how Java internally represents Unicode.


๐Ÿ”ค 1. What Is Unicode?

Unicode is a universal character set standard designed to consistently represent text from every major language, past and present, as well as symbols, punctuation, and special-purpose characters such as emojis.

Key Concepts:

  • A Unicode character is identified by a unique code point, written as U+XXXX (hexadecimal).
  • The range of valid Unicode code points is from U+0000 to U+10FFFF, covering over one million potential characters.
  • Unicode is not an encoding โ€” it is an abstract mapping between characters and numbers. Encoding is needed to store or transmit these characters in bytes.

๐Ÿ”ก 2. How Is Unicode Encoded?

To store Unicode text in memory or transmit it over networks, it must be encoded as a sequence of bytes. Unicode defines several encoding forms, of which the most widely used are UTF-8 and UTF-16.


โœ… UTF-8

  • Variable-width encoding: each character uses between 1 and 4 bytes.
  • ASCII-compatible: characters in the ASCII range (U+0000โ€“U+007F) use a single byte.
  • Efficient for English text, and widely adopted in file formats, web protocols, and APIs.
Code Point RangeBytesByte Pattern Example
U+0000 โ€“ U+007F10xxxxxxx
U+0080 โ€“ U+07FF2110xxxxx 10xxxxxx
U+0800 โ€“ U+FFFF31110xxxx 10xxxxxx 10xxxxxx
U+10000 โ€“ U+10FFFF411110xxx 10xxxxxx 10xxxxxx 10xxxxxx

โœ… UTF-16

  • Variable-width encoding: characters use 2 or 4 bytes.
  • Characters in the Basic Multilingual Plane (BMP) (U+0000โ€“U+FFFF) use a single 16-bit unit.
  • Characters above U+FFFF (supplementary characters) require two 16-bit code units, known as a surrogate pair.

Surrogate Pairs:

TypeRange
High surrogateU+D800โ€“U+DBFF
Low surrogateU+DC00โ€“U+DFFF

To encode a supplementary character:

  1. Subtract 0x10000 from the code point.
  2. Split the 20-bit result into two 10-bit halves.
  3. Map to high and low surrogate ranges.

๐Ÿงช 3. Encoding Examples

CharacterCode PointUTF-8 (hex)UTF-16 (hex)
AU+0041410041
โ‚ฌ (Euro)U+20ACE2 82 AC20AC
๐Ÿ˜€ (Smile)U+1F600F0 9F 98 80D83D DE00

๐Ÿ“Œ 4. Summary Comparison

FeatureUTF-8UTF-16
Width1โ€“4 bytes2 or 4 bytes
ASCII Compatibilityโœ… YesโŒ No
Supplementary Supportโœ… Yesโœ… Yes (via surrogates)
EfficiencyBest for ASCII-richBest for Asian scripts
Use CasesWeb, APIs, filesOS APIs, in-memory text

โ˜• 5. Unicode in Java

Java initially used UCS-2, a fixed-width 2-byte encoding that could only represent characters in the Basic Multilingual Plane (BMP). Since Java 5, Java has internally used UTF-16.

Key Java behaviors:

  • A Java char is a 16-bit unit.
  • BMP characters are represented by a single char.
  • Supplementary characters are encoded as two char values (a surrogate pair).
  • String.length() counts char values, not Unicode code points.

โœ… Counting Characters Accurately in Java

To count Unicode characters (code points) correctly โ€” including emoji and other supplementary characters โ€” use codePointCount():

public class UnicodeCountExample {
    public static void main(String[] args) {
        String text = "A๐Ÿ˜€โ‚ฌZ"; // A, smiley, euro, Z
        int lengthInChars = text.length(); // 6 (due to surrogate pair)
        int actualCharacters = text.codePointCount(0, text.length()); // 4

        System.out.println("Length in char units: " + lengthInChars);
        System.out.println("Actual Unicode characters: " + actualCharacters);
    }
}

Output:

Length in char units: 6
Actual Unicode characters: 4

To iterate over characters safely:

text.codePoints().forEach(cp -> {
    System.out.println("Code point: U+" + Integer.toHexString(cp).toUpperCase());
});

This ensures your logic is Unicode-aware, especially for applications involving multilingual content or emoji support.


Let me know if you’d like a printable version, class diagram, or encoding visual aid.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.