14
OCT2008
Understanding M17n (Multilingualization)
Big changes are coming to Ruby in version 1.9 with regard to character encodings. Ruby is going from a language with some of the weakest character encoding support to arguably some of the best support out there for working with different encodings. We're all grown up now.
The downside is that the new code comes with a good size learning curve. I would know because I recently battled through figuring it out so I could add support to the standard CSV library for nearly all of the encodings. It was a battle too. It's brave new territory and there's not a lot of help out there yet for understanding Ruby's new features.
I'm hoping to change that.
This posting will be the start of a new series of blog articles designed to explain the character encoding support in Ruby 1.9. I'm going to assume you know absolutely nothing about character encodings though and begin by explaining in detail what they are and why we have them.
After that, we're going to examine the character encoding support in Ruby 1.8. There's a lot less support there to examine, but it's not well understood and I'm hoping that seeing it in detail will help with understanding how and why Ruby 1.9 is changing.
Finally, we will examine all the new encoding features of Ruby 1.9 in as much detail as possible. We will literally cover it all. Along the way, I'll talk strategy and give you all the helpful tips I know to successfully managing character encodings, in general as well as with Ruby specifics.
This message will serve as a table of contents for this series of posts, so you may want to bookmark it if this topic is of interest to you. Here are all of the posts, in order:
- What is a character encoding?
- The Unicode Character Set and Encodings
- General Encoding Strategies
- Bytes and Characters in Ruby 1.8
- The $KCODE Variable and jcode Library
- Encoding Conversion With iconv
- Ruby 1.8 Character Encoding Flaws
- Ruby 1.9's String
- Ruby 1.9's Three Default Encodings
- Miscellaneous M17n Details
- What Ruby 1.9 Gives Us
Comments (16)
-
LUcas EFe October 14th, 2008 Reply Link
Thank you, man. This is really appreciated.
Regards
-
Very informative and nice articles!
Could you give some insights about date localization too? I struggled last time with localizing dates and I'm not happy at all with Ruby's capabilities. This is an another weak point in 1.8x and I'm curious how this will be changing in 1.9?
Thanks,
Attila-
Ruby itself doesn't really provide any localization tools. You will need to look into add-on libraries for this.
The latest versions of Rails have some support for this, but it's pretty young and unrefined, in my opinion. For full featured support you might want to try a library like Ruby-GetText-Package.
-
-
Fascinatingly, googling for m17n returned a link to this in the top 10 :)
-
Here's another good resource for m17n documentation by Brain Candler: http://github.com/candlerb/string19/tree/master.
-
When you are ready to deal with character encodings from Ruby's C API, Yugui has some tips for that.
-
Just a quick suggestion: it would be nice if you stated that M17N == "multilingualization", for those new to the subject and unfamiliar with the jargon.
-
I do mention it when I get into coverage of 1.9's system, but you're right that I probably should have said it on this introduction.
-
-
Sexy article
-
Thank you so much for your post.It is really supportive for me
-
Glad I could help.
-
-
Very painful :)
-
I got bit by this today doing some logfile processing, and I was too grateful to find your excellent writeup on the topic. It is both comprehensive and concise. Each entry is well thought out and gets right to the point. I really appreciate your taking the time to explain all this and do it so well.
-
Great series of posts.
From these, I can get a basic understanding about char encoding in Ruby.
thanks~ -
Ruby's Unicode handling is a misfeature. Carrying around the original serialization format for the life of the string is utterly moronic.
Also, your profile picture looks like "Doofy" from Scary Movie.
-
Are we talking about the
encoding()
method? If so, how would you know the format of the bytes without it?
-