What is a Character Encoding?
The first step to understanding character encodings is that we're going to need to talk a little about how computers store character data. I know we would love to believe that when we push the
a key on our keyboard, the computer records a little
a symbol somewhere, but that's just fantasy.
I imagine most of us know that deep in the heart of computers pretty much everything is eventually in terms of ones and zeros. That means that an
a has to be stored as some number. In fact, it is. We can see what number using Ruby 1.8:
$ ruby -ve 'p ?a' ruby 1.8.6 (2008-08-11 patchlevel 287) [i686-darwin9.4.0] 97
?a syntax gives us a specific character, instead of a full
String. In Ruby 1.8 it does that by returning the code of that encoded character. You can also get this by indexing one character out of a
$ ruby -ve 'p "a"' ruby 1.8.6 (2008-08-11 patchlevel 287) [i686-darwin9.4.0] 97
String behaviors were deemed confusing by the Ruby core team and have been changed in Ruby 1.9. They now return one character
Strings. If you want to see the character codes in Ruby 1.9 you can use
$ ruby_dev -ve 'p "a".getbyte(0)' ruby 1.9.0 (2008-10-10 revision 0) [i386-darwin9.5.0] 97
That's shows us how to get the magic number, but it doesn't tell us what the number really is. When it was decided that we would need to store character data as numbers a simple chart was made mapping some numbers to certain characters. This mapping is known as US-ASCII or just ASCII.
Now ASCII covers everything you would find on an English keyboard: letters in upper and lower case, numbers, and some common symbols. There was even some room left in the 128 character ASCII mapping for some control character sequences.
Life was perfect, right? Uh, no.
This lead to two facts that went together beautifully:
- The entire world can't quite get by on just these characters, surprisingly enough
- We had more room in each byte since ASCII was only using seven of the eight bits in a byte (that's how you get 128 characters)
Awesome. We still had a spare bit that could buy us 128 more characters and we needed more characters. It was serendipity! Just about everyone had great ideas for how we should use these extra 128 characters and they all used them in their own way. Character encodings were born.
Because those extra 128 characters could change meaning depending on exactly who's scheme we're using now, we say the character data is encoded in that scheme. You will need to know which encoding is used for that data to read it correctly.
To give one specific example, the character encoding ISO-8859-1 (also known as Latin-1) is a common default in some operating systems, programs, and even programming languages. It fills the extra characters primarily with accented characters useful to many European languages.
Now if it was really just about those extra 128 characters, things still wouldn't be too tricky. Unfortunately, there's one more twist: even 256 characters aren't enough for some languages. Since 256 is all the numbers we can squeeze out of one little byte, these languages need multibyte character encodings, where it can take more than just one byte to represent a single character.
Multibyte encodings are generally trickier to work with. You have to be very careful not to divide data in such a way that a character might be split between the first and second byte (or between other bytes for bigger encodings).
Japanese is a great example here. Because they have symbols for most words instead of just the pieces used to make words, their language has a few thousand symbols in common usage. One popular Japanese character encoding is Shift JIS and it needs two bytes to fit some of these characters in.
I've only shared a few specific examples here, but the truth is that there are quite a few encodings in common usage today. You don't necessarily need to support all of these encodings in every program and, in truth, there are some good reasons not to. A good first step is just being aware that different encodings exist and different people store their data in different ways. Modern day programmers can no longer afford to remain ignorant to these issues.
If you think about it, I'm sure you can imagine instances where the encoding was wrong. Ever seen a slew of question marks or funny box shaped characters in your email client or shell? Often this is a sign of the data not being encoded in the scheme the program expected. This led to the program not being able to display the content correctly. That's what we're trying to avoid.
Key take-away points:
- Different people the world over store their data in different ways
- All character data has some encoding scheme that tells you how to interpret the data
- You must know the encoding data is in to correctly process it
- Some encodings are harder to work with than others, especially multibyte encodings
- Junk output, like questions marks and box shaped characters, are often what you see when programs get confused about the character encoding data is in