Character Encodings

My extensive coverage of a complex topic all programmers should study a little.



Ruby 1.8 Character Encoding Flaws

Now that we have toured the entire landscape of Ruby 1.8's encoding support, we need to discuss the problems the system has. These long standing issues are what pushed the core team to build the m17n (multilingualization) implementation for Ruby 1.9.

The main problems are:

  • Not enough encodings supported
  • Regexp-only support just isn't comprehensive enough
  • $KCODE is a global setting for all encodings

I imagine most of those are pretty straightforward, but let's talk through them just to make sure we learn from the mistakes of the past. I'm pretty sure this will make it easier to understand why things are the way they are in Ruby 1.9.

The "not enough encodings" complaint should be the most obvious of all. Ruby 1.8 supports four and one is just no encoding. That means you really only get UTF-8 and two Asian encodings. The UTF-8 support is how we've managed to make it this far, but there are a ton of common encodings that just aren't covered.

The most important thing to realize here though is that we can't just keep adding encodings to Ruby 1.8. The system wasn't designed with that in mind. We will run out of letters to tack onto the end of a Regexp very fast. It's just not practical.

Once we have more encodings, it's time to get serious about wider support. Regexp was probably the one place where the core team was able to reap the biggest rewards form a little encoding hack, but it really just allows us to divide up characters. There's a lot more to encodings than that. What about checking if encoded data is valid, working with character groups, or examining Unicode code points? Regexp alone can't solve all of those problems.

Finally, one giant encoding switch is dangerous. There are several places where you need to know an encoding: the data in a String, the data being read from an IO object, and the encoding your source itself is written in, for example. In Ruby 1.8, you can't differentiate those things. You can just set it in one place. What if I write my source in UTF-8 and set $KCODE accordingly, but then load your library that's written in Shift JIS? One of us isn't going to get our way and that can't be good for the code.

Again, I just wanted highlight these issues, because I think it will help clarify why things are changing in Ruby 1.9. As we now dig into Ruby 1.9 encodings keep an eye out for these specific flaws and how they are being addressed…

Comments (0)
Leave a Comment (using GitHub Flavored Markdown)

Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

Ajax loader