Character Encodings

My extensive coverage of a complex topic all programmers should study a little.
  • 15

    OCT
    2008

    What is a Character Encoding?

    The first step to understanding character encodings is that we're going to need to talk a little about how computers store character data. I know we would love to believe that when we push the a key on our keyboard, the computer records a little a symbol somewhere, but that's just fantasy.

    I imagine most of us know that deep in the heart of computers pretty much everything is eventually in terms of ones and zeros. That means that an a has to be stored as some number. In fact, it is. We can see what number using Ruby 1.8:

    $ ruby -ve 'p ?a'
    ruby 1.8.6 (2008-08-11 patchlevel 287) [i686-darwin9.4.0]
    97
    

    The unusual ?a syntax gives us a specific character, instead of a full String. In Ruby 1.8 it does that by returning the code of that encoded character. You can also get this by indexing one character out of a String:

    $ ruby -ve 'p "a"[0]'
    ruby 1.8.6 (2008-08-11 patchlevel 287) [i686-darwin9.4.0]
    97
    

    These String behaviors were deemed confusing by the Ruby core team and have been changed in Ruby 1.9. They now return one character Strings. If you want to see the character codes in Ruby 1.9 you can use getbyte():

    Read more…

  • 14

    OCT
    2008

    Understanding M17n (Multilingualization)

    Big changes are coming to Ruby in version 1.9 with regard to character encodings. Ruby is going from a language with some of the weakest character encoding support to arguably some of the best support out there for working with different encodings. We're all grown up now.

    The downside is that the new code comes with a good size learning curve. I would know because I recently battled through figuring it out so I could add support to the standard CSV library for nearly all of the encodings. It was a battle too. It's brave new territory and there's not a lot of help out there yet for understanding Ruby's new features.

    I'm hoping to change that.

    This posting will be the start of a new series of blog articles designed to explain the character encoding support in Ruby 1.9. I'm going to assume you know absolutely nothing about character encodings though and begin by explaining in detail what they are and why we have them.

    After that, we're going to examine the character encoding support in Ruby 1.8. There's a lot less support there to examine, but it's not well understood and I'm hoping that seeing it in detail will help with understanding how and why Ruby 1.9 is changing.

    Read more…