The Unicode Character Set and Encodings
Since the rise of the various character encodings, there has been a quest to find the one perfect encoding we could all use. It's hard to get everyone to agree about whether or not this has truly been accomplished, but most of us agree that Unicode is as close as it gets.
The goal of Unicode was literally to provide a character set that includes all characters in use today. That's letters and numbers for all languages, all the images needed by pictographic languages, and all symbols. As you can imagine that's quite a challenging task, but they've done very well. Take a moment to browse all the characters in the current Unicode specification to see for yourself. The Unicode Consortium often reminds us that they still have room for more characters as well, so we will be all set when we start meeting alien races.
Now in order to really understand what Unicode is, I need to clear up a point I've played pretty loose with so far: a character set and a character encoding aren't necessarily the same thing. Unicode is one character set, and has multiple character encodings. Allow me to explain.
What is a Character Encoding?
The first step to understanding character encodings is that we're going to need to talk a little about how computers store character data. I know we would love to believe that when we push the
akey on our keyboard, the computer records a little
asymbol somewhere, but that's just fantasy.
I imagine most of us know that deep in the heart of computers pretty much everything is eventually in terms of ones and zeros. That means that an
ahas to be stored as some number. In fact, it is. We can see what number using Ruby 1.8:
$ ruby -ve 'p ?a' ruby 1.8.6 (2008-08-11 patchlevel 287) [i686-darwin9.4.0] 97
?asyntax gives us a specific character, instead of a full
String. In Ruby 1.8 it does that by returning the code of that encoded character. You can also get this by indexing one character out of a
$ ruby -ve 'p "a"' ruby 1.8.6 (2008-08-11 patchlevel 287) [i686-darwin9.4.0] 97
Stringbehaviors were deemed confusing by the Ruby core team and have been changed in Ruby 1.9. They now return one character
Strings. If you want to see the character codes in Ruby 1.9 you can use
Understanding M17n (Multilingualization)
Big changes are coming to Ruby in version 1.9 with regard to character encodings. Ruby is going from a language with some of the weakest character encoding support to arguably some of the best support out there for working with different encodings. We're all grown up now.
The downside is that the new code comes with a good size learning curve. I would know because I recently battled through figuring it out so I could add support to the standard CSV library for nearly all of the encodings. It was a battle too. It's brave new territory and there's not a lot of help out there yet for understanding Ruby's new features.
I'm hoping to change that.
This posting will be the start of a new series of blog articles designed to explain the character encoding support in Ruby 1.9. I'm going to assume you know absolutely nothing about character encodings though and begin by explaining in detail what they are and why we have them.
After that, we're going to examine the character encoding support in Ruby 1.8. There's a lot less support there to examine, but it's not well understood and I'm hoping that seeing it in detail will help with understanding how and why Ruby 1.9 is changing.
Getting FasterCSV Ready for Ruby 1.9
The call came down from on high just before the Ruby 1.9 release: replace the standard
faster_csv.rb. With only hours to make the change it was a little harder than I expected. The
FasterCSVcode base was pretty vanilla Ruby, but it required more work than I would have guessed to get running on Ruby 1.9. Let me share a few of the tips I learned while doctoring the code in the hope that it will help others get their code ready for Ruby 1.9.
StringClass Grows Up
One of the biggest changes in Ruby 1.9 is the addition of m17n (multilingualization). This means that Ruby's Strings are now encoding aware and we must clarify in our code if we are working with bytes, characters, or lines.
This is a good change, but the odds are that most of us have lazily used the old way to our advantage in the past. If you've ever written code like:
lines = str.to_a
you have bad habits to break. I sure did. Under Ruby 1.9 that code would translate to:
lines = str.lines.to_a
The Ruby VM: Episode IV
We've talked about threads, so let's talk a little about character encodings. This is another big change planned for Ruby's future. Matz, you have stated that you plan to add m17n (multilingualization) support to Ruby. Can you talk a little about what that change actually means for Ruby users?
Nothing much, except for some incompatibility in string manipulation, for example,
97, and string indexing will be based on character instead of byte. I guess the biggest difference is that we can officially declare we support Unicode. ;-)
Unlike Perl nor Python, Ruby's M17N is not Unicode based (Universal Character Set or USC). It's character set independent (CSI). It will handle Unicode, along with other encoding schemes such as ISO8859 or EUC-JP etc. without converting them into Unicode.
Some misunderstand our motivation. We are no Unicode haters. Rather, I'd love to use Unicode if situation allows. We hate conversion between character sets. For historical reasons, there are many variety of character sets. For example, Shift_JIS character set has at least 5 variations, which differ each other in a few characters mapping. Unfortunately, we have no way to distinguish them. Thus conversion may cause information loss. If a language provide Unicode centric text manipulation, there's no way to avoid the problem, as long as we use that language.
On my policy, I escape from this topic :)