Character Encodings

My extensive coverage of a complex topic all programmers should study a little.

14

OCT
2008

Understanding M17n (Multilingualization)

Big changes are coming to Ruby in version 1.9 with regard to character encodings. Ruby is going from a language with some of the weakest character encoding support to arguably some of the best support out there for working with different encodings. We're all grown up now.

The downside is that the new code comes with a good size learning curve. I would know because I recently battled through figuring it out so I could add support to the standard CSV library for nearly all of the encodings. It was a battle too. It's brave new territory and there's not a lot of help out there yet for understanding Ruby's new features.

I'm hoping to change that.

This posting will be the start of a new series of blog articles designed to explain the character encoding support in Ruby 1.9. I'm going to assume you know absolutely nothing about character encodings though and begin by explaining in detail what they are and why we have them.

After that, we're going to examine the character encoding support in Ruby 1.8. There's a lot less support there to examine, but it's not well understood and I'm hoping that seeing it in detail will help with understanding how and why Ruby 1.9 is changing.

Finally, we will examine all the new encoding features of Ruby 1.9 in as much detail as possible. We will literally cover it all. Along the way, I'll talk strategy and give you all the helpful tips I know to successfully managing character encodings, in general as well as with Ruby specifics.

This message will serve as a table of contents for this series of posts, so you may want to bookmark it if this topic is of interest to you. Here are all of the posts, in order:

  1. What is a character encoding?
  2. The Unicode Character Set and Encodings
  3. General Encoding Strategies
  4. Bytes and Characters in Ruby 1.8
  5. The $KCODE Variable and jcode Library
  6. Encoding Conversion With iconv
  7. Ruby 1.8 Character Encoding Flaws
  8. Ruby 1.9's String
  9. Ruby 1.9's Three Default Encodings
  10. Miscellaneous M17n Details
  11. What Ruby 1.9 Gives Us
Comments (16)
  1. LUcas EFe
    LUcas EFe October 14th, 2008 Link

    Thank you, man. This is really appreciated.

    Regards

  2. Csiszár Attila
    Csiszár Attila December 28th, 2008 Link

    Very informative and nice articles!

    Could you give some insights about date localization too? I struggled last time with localizing dates and I'm not happy at all with Ruby's capabilities. This is an another weak point in 1.8x and I'm curious how this will be changing in 1.9?

    Thanks,
    Attila

    1. James Edward Gray II
      James Edward Gray II December 28th, 2008 Link

      Ruby itself doesn't really provide any localization tools. You will need to look into add-on libraries for this.

      The latest versions of Rails have some support for this, but it's pretty young and unrefined, in my opinion. For full featured support you might want to try a library like Ruby-GetText-Package.

  3. roger
    roger July 15th, 2009 Link

    Fascinatingly, googling for m17n returned a link to this in the top 10 :)

  4. James Edward Gray II
    James Edward Gray II August 6th, 2009 Link

    Here's another good resource for m17n documentation by Brain Candler: http://github.com/candlerb/string19/tree/master.

  5. James Edward Gray II
    James Edward Gray II August 6th, 2009 Link

    When you are ready to deal with character encodings from Ruby's C API, Yugui has some tips for that.

  6. Joe Grossberg
    Joe Grossberg January 8th, 2010 Link

    Just a quick suggestion: it would be nice if you stated that M17N == "multilingualization", for those new to the subject and unfamiliar with the jargon.

    1. James Edward Gray II
      James Edward Gray II January 8th, 2010 Link

      I do mention it when I get into coverage of 1.9's system, but you're right that I probably should have said it on this introduction.

  7. Rohan D
    Rohan D February 24th, 2010 Link

    Sexy article

  8. Tin Htay Hlaing
    Tin Htay Hlaing September 16th, 2010 Link

    Thank you so much for your post.It is really supportive for me

    1. James Edward Gray II
      James Edward Gray II September 16th, 2010 Link

      Glad I could help.

  9. samuelgilman
    samuelgilman January 6th, 2011 Link

    Very painful :)

  10. Mike Owens
    Mike Owens June 15th, 2011 Link

    I got bit by this today doing some logfile processing, and I was too grateful to find your excellent writeup on the topic. It is both comprehensive and concise. Each entry is well thought out and gets right to the point. I really appreciate your taking the time to explain all this and do it so well.

  11. Kang
    Kang May 21st, 2012 Link

    Great series of posts.
    From these, I can get a basic understanding about char encoding in Ruby.
    thanks~

  12. Craig
    Craig May 22nd, 2012 Link

    Ruby's Unicode handling is a misfeature. Carrying around the original serialization format for the life of the string is utterly moronic.

    Also, your profile picture looks like "Doofy" from Scary Movie.

    1. James Edward Gray II
      James Edward Gray II May 22nd, 2012 Link

      Are we talking about the encoding() method? If so, how would you know the format of the bytes without it?