Chonk Apps: Unicode, UTF-8 and Egyptian horse guy - An introduction to character encoding

In the 'olden days' there used to be only 128 possible characters. This was known as the ascii character set and each number represented a character (http://www.asciitable.com/). So the letter A was represented by the number 65. And 65 would be stored in memory in a single byte as: 01000001. It was simple in the olden days.

A = 65

B = 66

C = 67

Unfortunately, some people wanted to display more than those 128 characters in their documents and emails, such as the German umlaut or the french accented e.

Ü = ?

So instead of telling these people to belt up and learn to use the characters they'd got, a new standard was invented called Unicode.

Unicode is really nice because it can store characters for every language EVER CONCEIVED BY MAN. Yes, that's right, even Egyptian Heiroglyphs can be stored using Unicode (if you have a minute, check these out, they're amazing, and SO much more interesting than the boring old ascii set, anyway I digress....)

Unicode is similar to ascii in that characters are mapped to unique identifiers, except that instead of being numbers, these identifers are called code points. They look like this:

U+1302C

The "U+" means "Unicode" and the "1302C" is a hexadecimal value representing an Egyptian heiroglyph of a man standing with each leg on a horse (no kidding the guy's probably a circus performer or something).

Code points are a handy way of referring to particular characters. Unfortunately they offer no help at all when it comes to actually storing these characters in computer memory.

This is where encodings come to the rescue. An encoding dictates how a particular character should be stored in memory. Since computers are, underneath it all, simple creatures which only understand 1s and 0s, all character encoding schemes map characters to binary data.

The most popular encoding for Unicode text is UTF-8 (Unicode Transformation Format 8). The 8 refers to the fact that unicode characters are stored in 8 bit bytes (also known as octets). UTF-8 uses "variable width encoding" which just means that each character is stored using 1 to 4 bytes, depending on its code point. Unlike ASCII which only ever uses 1 byte, which is where all the problems arose from in the first place!

Anyway, here's an example of UTF-8 encoding using the horse guy character, which uses 4 bytes:

Code point: U+1302C
Byte values (Binary): 11110000 10010011 10000000 10101100
Byte values (Hex): f0 93 80 ac

The UTF-8 encoder takes the unicode value of 1302C and converts it into 4 bytes of data. Now, the big question, how does it do this magical encoding? Well, it's quite simple once you get your head round it. Here's how:

1. The code point is converted into a binary value

1302C (Hex) = 10011000000101100 (Binary)

2. The encoder decides how many bytes are required to store this code point. This is decided by the number of bits in the binary value. The rules are:

7 bits or less then 1 byte is required (Note: 1 byte characters have exactly the same format as ASCII. 2-4 byte characters follow a COMPLETELY different format)
8-11, 2 bytes
12-16, 3 bytes
17-21, 4 bytes

Since horse guy has 17 bits he needs 4 bytes.

3. Once the number of bytes is determined the byte sequence is created like this (work round from A to D):

And that's that. There's a more in depth explanation here http://en.wikipedia.org/wiki/UTF-8#Design.

So now you know what unicode and UTF-8 are, you probably want to have a play around with horse guy yourself (erm, so to speak...), you can do so like this:

Download and install the Aegyptus font for displaying Egyptian heiroglyphs from here http://users.teilar.gr/~g1951d/. (Drag the .ttf file into C:\Windows\Fonts)
Download BabelPad from here http://www.babelstone.co.uk/Software/BabelPad.html
Run BabelPad, Goto Options->Single Font, choose Aegyptus in the font list
Goto Tools->Character Map, Choose Single Font and Aegyptus.
Type 1302C in the Go to code point field. This should bring up all the Egyptian heiroglyphs, click horse guy, then insert. This should insert horse guy into your text file. You may have to increase the font size to see the glyph properly.
Goto File->Save As. Under encoding choose UTF-8. Uncheck the Byte Order Mark box (it is bad practice to include this).
Choose a filename and click Save
Open your text file in a Hex editor (I can recommend frhed). You should see the following hex values: f0 93 80 ac.

I hope this has been of some use to someone, please let me know if you didn't understand anything and I'll try to make it clearer. I have to confess I read this great article from joel on software first, but I still didn't get how UTF-8 worked, hence this article.

I have also written a UTF-8 calculator so you can convert your code points into UTF-8 byte sequences here.

Chonk Apps

Wednesday, 28 September 2011

Unicode, UTF-8 and Egyptian horse guy - An introduction to character encoding

No comments:

Post a Comment

About Me

Blog Archive