Skip to content

Encoding and character sets

As a reminder, we will usually work with the hexadecimal system. We should keep a hex code table nearby for reference.

HexBinary
00000
10001
20010
30011
40100
50101
60110
70111
81000
91001
A1010
B1011
C1100
D1101
E1110
F1111

UTF-8 encodes characters using 1-byte, 2-byte, 3-byte, or 4-byte sequences depending on the Unicode code point value.

Here is the table which highlights the main structure of the encoding:

First code pointLast code pointByte 1Byte 2Byte 3Byte 4
U+0000U+007F0yyyzzzz
U+0080U+07FF110xxxyy10yyzzzz
U+0800U+FFFF1110wwww10xxxxyy10yyzzzz
U+010000U+10FFFF11110uvv10vvwwww10xxxxyy10yyzzzz

Let’s understand this table.

  1. The 1-Byte Range (U+0000 - U+007F)

The first bit must be 0. This leaves 7 bits for data encoding. Now let’s understand the range:

U+0000 - U+007F = 0000 0000 - 0111 1111 = 0-127

Doesn’t it make sense now? Yes, it does! Now you see why the range is like this.

  1. The 2-Byte Range (U+0080 - U+07FF)

The first byte starts with 110, and the second starts with 10. This leaves 11 bits (16 - 5 = 11). Now let’s understand the range:

U+0080 - U+07FF = 0000 0000 1000 0000 - 0000 0111 1111 1111 = 128 - 2047

The same principle applies here.

  1. The 3-Byte Range (U+0800 - U+FFFF)

The first byte starts with 1110. The second and third bytes start with 10. This leaves 16 bits (24 - 8 = 16). Now let’s understand the range:

U+0800 - U+FFFF = 0000 0000 0000 1000 0000 0000 - 0000 0000 1111 1111 1111 1111 = 2048 - 65535

The same.

  1. The 4-Byte Range (U+010000 - U+10FFFF)

The first byte starts with 11110. The second, third, and fourth bytes start with 10. This leaves 21 bits (32 - 11 = 21). Now let’s understand the range:

U+010000 - U+10FFFF = 0000 0000 0000 0001 0000 0000 0000 0000 - 0000 0000 0001 0000 1111 1111 1111 1111 = 65536 - 1,114,111

The same.

A character is either 2 bytes or 4 bytes long. The first 2 bytes of a file indicate the UTF-16 byte order (encoding).

‘FE FF’ - UTF-16 Big-Endian - the most significant byte comes first.

‘FF FE’ - UTF-16 Little-Endian - the least significant byte comes first.

  1. The 2-Byte Characters, the Basic Multilingual Plane (The BMP)

For these, a character is exactly 2 bytes (16 bits) long.

Range: U+0000 - U+FFFF = 0 - 65535

  1. Surrogate pairs (for code points above U+FFFF)

This uses a special mechanism to represent 4-byte code points via surrogate pairs in UTF-16.

Range: U+10000 - U+10FFFF = 65536 - 1,114,111

od man page

hexdump man page

iconv man page

Display content of a file in hex format

Terminal window
$ od -a -A x -t x1 new.txt

Result:

Result:
Terminal window
cat text.txt | hexdump -C text.txt

Result:

Result:

Convert a content of a file from utf-8 to utf-16 format encoding

Terminal window
$ iconv -f utf-16 -t utf-8 text.txt > new8.txt

Using these tools we can analyse encoding issues in our code 🙂