This post is a brief technival overview of Unicode, a widely used standard for multilingual character representation, and the family of UTF-x encoding algorithms. First a brief introduction to Unicode:
Unicode is intended to address the need for a workable, reliable world text encoding.
Unicode could be roughly described as “wide-body ASCII” that has been stretched to 16 bits to encompass the characters of all the world’s living languages. In a properly engineered design, 16 bits per character are more than sufficient for this purpose.
Character Representation: Code Points and Planes
The reference to a specific character is called a code-point. ASCII for example uses 8 bit per character, which allows for 2^8 = 256 different characters (code-points).
Unicode uses 16 bits (2 bytes) per code-point and furthermore associates each code-point with one of 17planes. Therefore Unicode provides 2^16 = 65,536 unique code-points per plane, with 2^16 * 17 = 1,114,112 maximum total unique code-points.
Currently only 6 of the 17 available planes are used:
|0||U+0000 … U+FFFF||Basic Multilingual Plane|
|1||U+10000 … U+1FFFF||Supplementary Multilingual Plane|
|2||U+20000 … U+2FFFF||Supplementary Ideographic Plane|
|14||U+E0000 … U+EFFFF||Supplementary Special-purpose Plane|
|15-16||U+F0000 … U+10FFFF||Private Use Area|
Unicode code points of the first plane use two bytes, all other planes require a third byte to indicate the plane (blue color above).
Code points U+0000 to U+00FF (0-255) are identical to the Latin-1 values, so converting between them simply requires converting code points to byte values. In fact any document containing only characters of the first 127 code-points of the ASCII character map is a perfectly valid UTF-8 encoded Unicode document.
Character Encoding: UTF-8, 16 and 32
>>> u = u"€" >>> u u'\u20ac' >>> bytearray(u) Traceback (most recent call last): File "", line 1, in TypeError: unicode argument without an encoding >>>
This is where Unicode Transformation Formats (UTF) come into play. UTF-8/16/32 encoding stores any given unicode byte-array into either a variable amount of 8 bit blocks, or one or multiple 16 or 32 bit blocks.
UTF-8 is a variable-width encoding, with each unicode character represented by one to four bytes. A main advantage of UTF-8 is backward compatibility with the ASCII charset, allowing us to use the same decoding function for both any ASCII text and any utf-8 encoded unicode text.
If the character is encoded into just one byte, the high-order bit is 0 and the other bits represent the code point (in the range 0..127). If the character is encoded into a sequence of more than one byte, the first byte has as many leading ’1′ bits as the total number of bytes in the sequence, followed by a ’0′ bit, and the succeeding bytes are all marked by a leading “10″ bit pattern. The remaining bits in the byte sequence are concatenated to form the Unicode code point value.
UTF-16 always uses two bytes for encoding each code-point, and is thereby limited to characters of only the “Basic Multilingual Plane” (U+0000 to U+FFFF). Unicode code-points of other planes use 3 bytes and UTF-16 converts these into two 16-bit pairs, called a surrogate pair.
UTF-32 always uses exactly four bytes for encoding each Unicode code point (if the endianess is specified).
- UTF-8 can encode any code-point of any plane, and compresses lower code-points into fewer bytes (eg. ASCII charset into 1 byte). UTF-8 furthermore shares a common encoding with the first 127 code-points of the ASCII character set. Recommended for everything related to text.
- UTF-16 always saves 16 bit blocks without compression. If Unicode character is of a higher plane than 0 it has three bytes, and UTF-16 needs two 16-bit groups to represent it (see the euro € sign example below)
- UTF-32 encodes all Unicode code-points, but always saves 32 bit groups with no compression
>>> u = u"a" >>> u u'a' >>> repr(u.encode("utf-8")) "'a'" >>> repr(u.encode("utf-16")) # no endianess specified "'\\xff\\xfea\\x00'" >>> repr(u.encode("utf-16-le")) # little endian byte order "'a\\x00'" >>> repr(u.encode("utf-16-be")) # big endian byte order "'\\x00a'" >>> repr(u.encode("utf-32")) "'\\xff\\xfe\\x00\\x00a\\x00\\x00\\x00'" >>> repr(u.encode("utf-32-le")) "'a\\x00\\x00\\x00'" >>> repr(u.encode("utf-32-be")) "'\\x00\\x00\\x00a'" >>> u = u"€" >>> u u'\u20ac' >>> repr(u.encode("utf-8")) "'\\xe2\\x82\\xac'" >>> repr(u.encode("utf-16")) "'\\xff\\xfe\\xac '" >>> repr(u.encode("utf-16-le")) "'\\xac '" >>> repr(u.encode("utf-16-be")) "' \\xac'" >>> repr(u.encode("utf-32")) "'\\xff\\xfe\\x00\\x00\\xac \\x00\\x00'" >>> repr(u.encode("utf-32-le")) "'\\xac \\x00\\x00'" >>> repr(u.encode("utf-32-be")) "'\\x00\\x00 \\xac'"
Please leave a comment if you have feedback or questions!