Character Encodings

My extensive coverage of a complex topic all programmers should study a little.

14

OCT
2008

Understanding M17n (Multilingualization)

Big changes are coming to Ruby in version 1.9 with regard to character encodings. Ruby is going from a language with some of the weakest character encoding support to arguably some of the best support out there for working with different encodings. We're all grown up now.

The downside is that the new code comes with a good size learning curve. I would know because I recently battled through figuring it out so I could add support to the standard CSV library for nearly all of the encodings. It was a battle too. It's brave new territory and there's not a lot of help out there yet for understanding Ruby's new features.

I'm hoping to change that.

This posting will be the start of a new series of blog articles designed to explain the character encoding support in Ruby 1.9. I'm going to assume you know absolutely nothing about character encodings though and begin by explaining in detail what they are and why we have them.

After that, we're going to examine the character encoding support in Ruby 1.8. There's a lot less support there to examine, but it's not well understood and I'm hoping that seeing it in detail will help with understanding how and why Ruby 1.9 is changing.

Finally, we will examine all the new encoding features of Ruby 1.9 in as much detail as possible. We will literally cover it all. Along the way, I'll talk strategy and give you all the helpful tips I know to successfully managing character encodings, in general as well as with Ruby specifics.

This message will serve as a table of contents for this series of posts, so you may want to bookmark it if this topic is of interest to you. Here are all of the posts, in order:

  1. What is a character encoding?
  2. The Unicode Character Set and Encodings
  3. General Encoding Strategies
  4. Bytes and Characters in Ruby 1.8
  5. The $KCODE Variable and jcode Library
  6. Encoding Conversion With iconv
  7. Ruby 1.8 Character Encoding Flaws
  8. Ruby 1.9's String
  9. Ruby 1.9's Three Default Encodings
  10. Miscellaneous M17n Details
  11. What Ruby 1.9 Gives Us
Comments (16)
  1. LUcas EFe
    LUcas EFe October 14th, 2008 Reply Link

    Thank you, man. This is really appreciated.

    Regards

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
  2. Csiszár Attila
    Csiszár Attila December 28th, 2008 Reply Link

    Very informative and nice articles!

    Could you give some insights about date localization too? I struggled last time with localizing dates and I'm not happy at all with Ruby's capabilities. This is an another weak point in 1.8x and I'm curious how this will be changing in 1.9?

    Thanks,
    Attila

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
    2. James Edward Gray II
      James Edward Gray II December 28th, 2008 Reply Link

      Ruby itself doesn't really provide any localization tools. You will need to look into add-on libraries for this.

      The latest versions of Rails have some support for this, but it's pretty young and unrefined, in my opinion. For full featured support you might want to try a library like Ruby-GetText-Package.

      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
  3. roger
    roger July 15th, 2009 Reply Link

    Fascinatingly, googling for m17n returned a link to this in the top 10 :)

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
  4. James Edward Gray II
    James Edward Gray II August 6th, 2009 Reply Link

    Here's another good resource for m17n documentation by Brain Candler: http://github.com/candlerb/string19/tree/master.

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
  5. James Edward Gray II
    James Edward Gray II August 6th, 2009 Reply Link

    When you are ready to deal with character encodings from Ruby's C API, Yugui has some tips for that.

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
  6. Joe Grossberg
    Joe Grossberg January 8th, 2010 Reply Link

    Just a quick suggestion: it would be nice if you stated that M17N == "multilingualization", for those new to the subject and unfamiliar with the jargon.

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
    2. James Edward Gray II
      James Edward Gray II January 8th, 2010 Reply Link

      I do mention it when I get into coverage of 1.9's system, but you're right that I probably should have said it on this introduction.

      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
  7. Rohan D
    Rohan D February 24th, 2010 Reply Link

    Sexy article

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
  8. Tin Htay Hlaing
    Tin Htay Hlaing September 16th, 2010 Reply Link

    Thank you so much for your post.It is really supportive for me

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
    2. James Edward Gray II
      James Edward Gray II September 16th, 2010 Reply Link

      Glad I could help.

      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
  9. samuelgilman
    samuelgilman January 6th, 2011 Reply Link

    Very painful :)

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
  10. Mike Owens
    Mike Owens June 15th, 2011 Reply Link

    I got bit by this today doing some logfile processing, and I was too grateful to find your excellent writeup on the topic. It is both comprehensive and concise. Each entry is well thought out and gets right to the point. I really appreciate your taking the time to explain all this and do it so well.

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
  11. Kang
    Kang May 21st, 2012 Reply Link

    Great series of posts.
    From these, I can get a basic understanding about char encoding in Ruby.
    thanks~

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
  12. Craig
    Craig May 22nd, 2012 Reply Link

    Ruby's Unicode handling is a misfeature. Carrying around the original serialization format for the life of the string is utterly moronic.

    Also, your profile picture looks like "Doofy" from Scary Movie.

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
    2. James Edward Gray II

      Are we talking about the encoding() method? If so, how would you know the format of the bytes without it?

      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
Leave a Comment (using GitHub Flavored Markdown)

Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

Ajax loader