Unicode and UTF Overview

This post is a brief technival overview of Unicode, a widely used standard for multilingual character representation, and the family of UTF-x encoding algorithms. First a brief introduction to Unicode:

Unicode is intended to address the need for a workable, reliable world text encoding.

Unicode could be roughly described as “wide-body ASCII” that has been stretched to 16 bits to encompass the characters of all the world’s living languages. In a properly engineered design, 16 bits per character are more than sufficient for this purpose.

http://www.unicode.org/history/unicode88.pdf

Character Representation: Code Points and Planes

The reference to a specific character is called a code-point. ASCII for example uses 8 bit per character, which allows for 2^8 = 256 different characters (code-points).

Unicode uses 16 bits (2 bytes) per code-point and furthermore associates each code-point with one of 17planes. Therefore Unicode provides 2^16 = 65,536 unique code-points per plane, with 2^16 * 17 = 1,114,112 maximum total unique code-points.

Currently only 6 of the 17 available planes are used:

Plane    Unicode repr.Description
0U+0000 … U+FFFFBasic Multilingual Plane
1U+10000 … U+1FFFFSupplementary Multilingual Plane
2U+20000 … U+2FFFFSupplementary Ideographic Plane
14U+E0000 … U+EFFFFSupplementary Special-purpose Plane
15-16U+F0000 … U+10FFFF   Private Use Area
  
 

Unicode code points of the first plane use two bytes, all other planes require a third byte to indicate the plane (blue color above).

Code points U+0000 to U+00FF (0-255) are identical to the Latin-1 values, so converting between them simply requires converting code points to byte values. In fact any document containing only characters of the first 127 code-points of the ASCII character map is a perfectly valid UTF-8 encoded Unicode document.

Character Encoding: UTF-8, 16 and 32

 

>>> u = u"€"
>>> u
u'\u20ac'
>>> bytearray(u)
Traceback (most recent call last):
  File "", line 1, in
TypeError: unicode argument without an encoding
>>>

This is where Unicode Transformation Formats (UTF) come into play. UTF-8/16/32 encoding stores any given unicode byte-array into either a variable amount of 8 bit blocks, or one or multiple 16 or 32 bit blocks.

UTF-8

UTF-8 is a variable-width encoding, with each unicode character represented by one to four bytes. A main advantage of UTF-8 is backward compatibility with the ASCII charset, allowing us to use the same decoding function for both any ASCII text and any utf-8 encoded unicode text.

If the character is encoded into just one byte, the high-order bit is 0 and the other bits represent the code point (in the range 0..127). If the character is encoded into a sequence of more than one byte, the first byte has as many leading ’1′ bits as the total number of bytes in the sequence, followed by a ’0′ bit, and the succeeding bytes are all marked by a leading “10″ bit pattern. The remaining bits in the byte sequence are concatenated to form the Unicode code point value.

UTF-16

UTF-16 always uses two bytes for encoding each code-point, and is thereby limited to characters of only the “Basic Multilingual Plane” (U+0000 to U+FFFF). Unicode code-points of other planes use 3 bytes and UTF-16 converts these into two 16-bit pairs, called a surrogate pair.

UTF-32

UTF-32 always uses exactly four bytes for encoding each Unicode code point (if the endianess is specified).

Summary

    • UTF-8 can encode any code-point of any plane, and compresses lower code-points into fewer bytes (eg. ASCII charset into 1 byte). UTF-8 furthermore shares a common encoding with the first 127 code-points of the ASCII character set. Recommended for everything related to text.
    • UTF-16 always saves 16 bit blocks without compression. If Unicode character is of a higher plane than 0 it has three bytes, and UTF-16 needs two 16-bit groups to represent it (see the euro € sign example below)
    • UTF-32 encodes all Unicode code-points, but always saves 32 bit groups with no compression

.

Examples

>>> u = u"a"
>>> u
u'a'
>>> repr(u.encode("utf-8"))
"'a'"
>>> repr(u.encode("utf-16"))    # no endianess specified
"'\\xff\\xfea\\x00'"
>>> repr(u.encode("utf-16-le")) # little endian byte order
"'a\\x00'"
>>> repr(u.encode("utf-16-be")) # big endian byte order
"'\\x00a'"
>>> repr(u.encode("utf-32"))
"'\\xff\\xfe\\x00\\x00a\\x00\\x00\\x00'"
>>> repr(u.encode("utf-32-le"))
"'a\\x00\\x00\\x00'"
>>> repr(u.encode("utf-32-be"))
"'\\x00\\x00\\x00a'"

>>> u = u"€"
>>> u
u'\u20ac'
>>> repr(u.encode("utf-8"))
"'\\xe2\\x82\\xac'"
>>> repr(u.encode("utf-16"))
"'\\xff\\xfe\\xac '"
>>> repr(u.encode("utf-16-le"))
"'\\xac '"
>>> repr(u.encode("utf-16-be"))
"' \\xac'"
>>> repr(u.encode("utf-32"))
"'\\xff\\xfe\\x00\\x00\\xac \\x00\\x00'"
>>> repr(u.encode("utf-32-le"))
"'\\xac \\x00\\x00'"
>>> repr(u.encode("utf-32-be"))
"'\\x00\\x00 \\xac'"

Feedback

Please leave a comment if you have feedback or questions!

Further Reading