Ruby 1.8 Character Encoding Flaws
Now that we have toured the entire landscape of Ruby 1.8's encoding support, we need to discuss the problems the system has. These long standing issues are what pushed the core team to build the m17n (multilingualization) implementation for Ruby 1.9.
The main problems are:
- Not enough encodings supported
Regexp-only support just isn't comprehensive enough
$KCODEis a global setting for all encodings
I imagine most of those are pretty straightforward, but let's talk through them just to make sure we learn from the mistakes of the past. I'm pretty sure this will make it easier to understand why things are the way they are in Ruby 1.9.
The "not enough encodings" complaint should be the most obvious of all. Ruby 1.8 supports four and one is just no encoding. That means you really only get UTF-8 and two Asian encodings. The UTF-8 support is how we've managed to make it this far, but there are a ton of common encodings that just aren't covered.
The most important thing to realize here though is that we can't just keep adding encodings to Ruby 1.8. The system wasn't designed with that in mind. We will run out of letters to tack onto the end of a
Regexpvery fast. It's just not practical.
The Definitive Guide to SQLite
I'm a huge fan of SQLite. Every time I do something with the little database it always manages to impress me in new ways. Here's a pop quiz for you:
- Did you know SQLite is totally free? I mean really free. All the code is in the public domain, protected by affidavits, and you can literally do anything you like with it.
- Did you know SQLite uses "manifest typing" which is similar to Ruby's dynamic typing? The database engine will really allow you to handle field types in whatever way is best for your needs. Of course, you can do type checking in triggers if you prefer to be more strict.
- Everyone knows SQLite shoves an entire database in one file, but did you know that it can work with more than one of those files at once? Yes, SQLite can query across multiple databases.
I could go on and on. Really, I could. SQLite is that cool.
It's almost silly to use flat files these days. If you find yourself needing one, you can load one gem instead, stick a full database in the file, take advantage of transactions and locking (very multiprocessing friendly), gain a full query language for working with the data, and have a prebuilt human interface completely separate from your code (the command-line tool is great for debugging). It's hard to beat that.
XMPP and Metaprogramming Screencasts
I've mentioned some nice screencasts I've found in the past. Well, I've been watching quite a few more lately and I've uncovered some more hits.
First, PeepCode has another excellent screencast on using XMPP with Ruby. This video explains what XMPP is and isn't, why it's important, and shows a good deal of information about how you can work with the protocol to accomplish some real world server to server or human communication tasks. You don't need any prior XMPP knowledge going into this one.
It's hard to overstate exactly how much PeepCode got right with this video. For example, I've seen quite a few screencasts now that byte off more than they can chew for a short video. That's not the case here. XMPP turns out to be perfectly bite sized in that a one hour video can serve as a strong introduction to pretty much all you need to know when using it. This has other advantages too. Since the creator isn't trying to squeeze too much content into too short of time, he can afford to drop some truly stellar related tips. In the case of the XMPP video these are what IM client to use when debugging, because it allows you to see the underlying protocol, and how to easily combine XMPP with DRb for fire-and-forget messaging. These extras really push this screencast over the top.
Encoding Conversion With iconv
There's one last standard library we need to discuss for us to have completely covered Ruby 1.8's support for character encodings. The
iconvlibrary ships with Ruby and it can handle an impressive set of character encoding conversions.
This is an important piece of the puzzle. You may have accepted my advice that it's OK to just work with UTF-8 data whenever you have the choice, but the fact is that there's a lot of non-UTF-8 data in the world. Legacy systems may have produced data before UTF-8 was popular, some services may work in different encodings for any number of reasons, and not quite everyone has embraced Unicode fully yet. If you run into data like this, you will need a way to convert it to UTF-8 as you import it and possibly a way to convert it back when you export it. That's exactly what
Instead of jumping right into Ruby's
iconvlibrary, let's come at it with a slightly different approach.
iconvis actually a C library that performs these conversions and on most systems where it is installed you will have a command-line interface for it.
I'm sure everyone has noticed that my blog posting has dramatically fallen off from the rate I was getting articles out. Unfortunately, I've been spending my blog time fighting the endless war against spam. I've made some progress there and thought I would share some details that others might find useful.
As I've covered previously this blog now requires me to approve all comments. I'm super happy with this decision. I approve posts promptly, so there's pretty much no downside for users and this means you have not seen a single spam message on this site since I made the change. This was literally the perfect solution… on the viewer's side of the fence.
What it didn't fix was the hassle on my side. I don't mind approving messages at all, as long as I have a reasonable pile to go through. However, the spammers really ramped up their efforts against me lately and this blog received 11,134 comment posts in the month of November alone. Six of those were legitimate comments. That exceeds my definition of reasonable.
The $KCODE Variable and jcode Library
All of the Ruby files I create start with the same Shebang line:
#!/usr/bin/env ruby -wKU
It's not really needed for every file since it generally only matters if the file is executed. However, I tend to go ahead and add it to all Ruby files I build for several reasons:
- You never know when a file may be executed (
if __FILE__ == $PROGRAM_NAME; endsections are often added to libraries, for example)
- It makes it obvious the file is Ruby code
- It shows the rules this code expects
The rules I mention here, specified by command-line switches, are the main point of interest.
-wturns on Ruby's warnings which are very handy. I recommend doing that whenever you can. But that doesn't have anything to do with character encodings.
-KUsets a magic Ruby variable:
$KCODE. You can do the same in your code if you aren't in a position to control the command-line arguments:
$KCODE = "U"
You probably recognize the
Uas a name for Ruby 1.8's UTF-8 encoding, from my earlier list of encodings. It can also be set to
S. Modern versions of Rails do set
$KCODE = "U"for you.
- You never know when a file may be executed (
Bytes and Characters in Ruby 1.8
Gregory Brown said, in a training session at the Lone Star Rubyconf, "Ruby 1.8 works in bytes. Ruby 1.9 works in characters." The truth of Ruby 1.9 is maybe a little more complicated and we will discuss all of that eventually, but Greg is dead right about Ruby 1.8.
In Ruby 1.8, a
Stringis always just a collection of bytes.
The important question is, how does that one golden rule relate to all that we've learned about character encodings? Essentially, it puts all the responsibility on you as the developer. Ruby 1.8 leaves it to you to determine what to do with those bytes and it doesn't provide a lot of encoding savvy help. That's why knowing at least the basics of encodings is so important when working with Ruby 1.8.
There are plusses and minuses to every system and this one is no exception. On the side of plusses, Ruby 1.8 can pretty much support any encoding you can imagine. After all, a character encoding is just some bytes that somehow map to a set of characters and all Ruby 1.8
Strings are just some bytes. If you say a
Stringholds Latin-1 data and treat it as such, that's fine by Ruby.
General Encoding Strategies
Before we get into specifics, let's try to distill a few best practices for working with encodings. I'm sure you can tell that there's a lot that needs to be considered with encodings, so let's try to focus in on a few key points that will help us the most.
Use UTF-8 Everywhere You Can
We know UTF-8 isn't perfect, but it's pretty darn close to perfect. There is no other single encoding you could pick that has the potential to satisfy such a wide audience. It's our best bet. For these reasons, UTF-8 is quickly becoming the preferred encoding for the Web, email, and more.
If you have a say over what encoding or encodings your software will accept, support, and deliver, choose UTF-8 whenever you can. This is absolutely the best default.
Get in the Habit of Documenting Your Encodings
We learned that you must know a data's encoding to properly work with it. While there are tools to help you guess an encoding, you really want to try and avoid being in this position. Part of how to make that happen is to be a good citizen and make sure you are documenting your encodings at every step.
The Unicode Character Set and Encodings
Since the rise of the various character encodings, there has been a quest to find the one perfect encoding we could all use. It's hard to get everyone to agree about whether or not this has truly been accomplished, but most of us agree that Unicode is as close as it gets.
The goal of Unicode was literally to provide a character set that includes all characters in use today. That's letters and numbers for all languages, all the images needed by pictographic languages, and all symbols. As you can imagine that's quite a challenging task, but they've done very well. Take a moment to browse all the characters in the current Unicode specification to see for yourself. The Unicode Consortium often reminds us that they still have room for more characters as well, so we will be all set when we start meeting alien races.
Now in order to really understand what Unicode is, I need to clear up a point I've played pretty loose with so far: a character set and a character encoding aren't necessarily the same thing. Unicode is one character set, and has multiple character encodings. Allow me to explain.
What is a Character Encoding?
The first step to understanding character encodings is that we're going to need to talk a little about how computers store character data. I know we would love to believe that when we push the
akey on our keyboard, the computer records a little
asymbol somewhere, but that's just fantasy.
I imagine most of us know that deep in the heart of computers pretty much everything is eventually in terms of ones and zeros. That means that an
ahas to be stored as some number. In fact, it is. We can see what number using Ruby 1.8:
$ ruby -ve 'p ?a' ruby 1.8.6 (2008-08-11 patchlevel 287) [i686-darwin9.4.0] 97
?asyntax gives us a specific character, instead of a full
String. In Ruby 1.8 it does that by returning the code of that encoded character. You can also get this by indexing one character out of a
$ ruby -ve 'p "a"' ruby 1.8.6 (2008-08-11 patchlevel 287) [i686-darwin9.4.0] 97
Stringbehaviors were deemed confusing by the Ruby core team and have been changed in Ruby 1.9. They now return one character
Strings. If you want to see the character codes in Ruby 1.9 you can use