Character Encodings

My extensive coverage of a complex topic all programmers should study a little.
  • 18


    What Ruby 1.9 Gives Us

    In this final post of the series, I want to revisit our earlier discussion on encoding strategies. Ruby 1.9 adds a lot of power to the handling of character encodings as you have now seen. We should talk a little about how that can change the game.

    UTF-8 is Still King

    The most important thing to take note of is what hasn't changed with Ruby 1.9. I said a good while back that the best Encoding for general use is UTF-8. That's still very true.

    I still strongly recommend that we favor UTF-8 as the one-size-almost-fits-all Encoding. I really believe that we can and should use it exclusively inside our code, transcode data to it on the way in, and transcode output when we absolutely must. The more of us that do this, the better things will get.

    As we've discussed earlier in the series, Ruby 1.9 does add some new features that help our UTF-8 only strategies. For example, you could use things like the Encoding command-line switches (-E and -U) to setup auto translation for all input you read. These shortcuts are great for simple scripting, but I'm going to recommend you just be explicit about your Encodings in any serious code.

    Read more…

  • 15


    Miscellaneous M17n Details

    We've now discussed the core of Ruby 1.9's m17n (multilingualization) engine. String and IO are where you will see the big changes. The new m17n system is a big beast though with a lot of little details. Let's talk a little about some side topics that also relate to how we work with character encodings in Ruby 1.9.

    More Features of the Encoding Class

    You've seen me using Encoding objects all over the place in my explanations of m17n, but we haven't talked much about them. They are very simple, mainly just being a named representation of each Encoding inside Ruby. As such, Encoding is a storage place for some tools you may find handy when working with them.

    First, you can receive a list() of all Encoding objects Ruby has loaded in the form of an Array:

    $ ruby -e 'puts Encoding.list.first(3), "..."'

    If you're just interested in a specific Encoding, you can find() it by name:

    $ ruby -e 'p Encoding.find("UTF-8")'
    $ ruby -e 'p Encoding.find("No-Such-Encoding")'
    -e:1:in `find': unknown encoding name - No-Such-Encoding (ArgumentError)
        from -e:1:in `<main>'

    Read more…

  • 5


    Ruby 1.9's Three Default Encodings

    I suspect early contact with the new m17n (multilingualization) engine is going to come to Rubyists in the form of this error message:

    invalid multibyte char (US-ASCII)

    Ruby 1.8 didn't care what you stuck in a random String literal, but 1.9 is a touch pickier. I think you'll see that the change is for the better, but we do need to spend some time learning to play by Ruby's new rules.

    That takes us to the first of Ruby's three default Encodings.

    The Source Encoding

    In Ruby's new grown up world of all encoded data, each and every String needs an Encoding. That means an Encoding must be selected for a String as soon as it is created. One way that a String can be created is for Ruby to execute some code with a String literal in it, like this:

    str = "A new String"

    That's a pretty simple String, but what if I use a literal like the following instead?

    str = "Résumé"

    What Encoding is that in? That fundamental question is probably the main reason we all struggle a bit with character encodings. You can't tell just from looking at that data what Encoding it is in. Now, if I showed you the bytes you may be able to make an educated guess, but the data just isn't wearing an Encoding name tag.

    Read more…

  • 30


    Ruby 1.9's String

    Ruby 1.9 has an all new encoding engine called m17n (for multilingualization, with 17 letters between the m and n). This new engine may not be what you are use to from many other modern languages.

    It's common to pick one versatile encoding, likely a Unicode encoding, and work with all data in that one format. Ruby 1.9 goes a different way. Instead of favoring one encoding, Ruby 1.9 makes it possible to work with data in over 80 encodings.

    To accomplish this, changes had to be made in several places where Ruby works with character data. You're going to notice those changes the most in Ruby's String though, so let's begin by talking about what's changed there.

    All Strings are now Encoded

    In Ruby 1.8 a String was a collection of bytes. You sometimes treated those bytes as other things, like characters when you hit it with a Regexp or lines when you called each(). At it's core though, it was just some bytes. You indexed the data by byte counts, sizes where in bytes, and so on.

    In Ruby 1.9 a String is now a collection of encoded data. That means it is both the raw bytes and the attached Encoding information about how to interpret those bytes.

    Read more…

  • 11


    Ruby 1.8 Character Encoding Flaws

    Now that we have toured the entire landscape of Ruby 1.8's encoding support, we need to discuss the problems the system has. These long standing issues are what pushed the core team to build the m17n (multilingualization) implementation for Ruby 1.9.

    The main problems are:

    • Not enough encodings supported
    • Regexp-only support just isn't comprehensive enough
    • $KCODE is a global setting for all encodings

    I imagine most of those are pretty straightforward, but let's talk through them just to make sure we learn from the mistakes of the past. I'm pretty sure this will make it easier to understand why things are the way they are in Ruby 1.9.

    The "not enough encodings" complaint should be the most obvious of all. Ruby 1.8 supports four and one is just no encoding. That means you really only get UTF-8 and two Asian encodings. The UTF-8 support is how we've managed to make it this far, but there are a ton of common encodings that just aren't covered.

    The most important thing to realize here though is that we can't just keep adding encodings to Ruby 1.8. The system wasn't designed with that in mind. We will run out of letters to tack onto the end of a Regexp very fast. It's just not practical.

    Read more…

  • 8


    Encoding Conversion With iconv

    There's one last standard library we need to discuss for us to have completely covered Ruby 1.8's support for character encodings. The iconv library ships with Ruby and it can handle an impressive set of character encoding conversions.

    This is an important piece of the puzzle. You may have accepted my advice that it's OK to just work with UTF-8 data whenever you have the choice, but the fact is that there's a lot of non-UTF-8 data in the world. Legacy systems may have produced data before UTF-8 was popular, some services may work in different encodings for any number of reasons, and not quite everyone has embraced Unicode fully yet. If you run into data like this, you will need a way to convert it to UTF-8 as you import it and possibly a way to convert it back when you export it. That's exactly what iconv does.

    Instead of jumping right into Ruby's iconv library, let's come at it with a slightly different approach. iconv is actually a C library that performs these conversions and on most systems where it is installed you will have a command-line interface for it.

    Read more…

  • 5


    The $KCODE Variable and jcode Library

    All of the Ruby files I create start with the same Shebang line:

    #!/usr/bin/env ruby -wKU

    It's not really needed for every file since it generally only matters if the file is executed. However, I tend to go ahead and add it to all Ruby files I build for several reasons:

    • You never know when a file may be executed (if __FILE__ == $PROGRAM_NAME; end sections are often added to libraries, for example)
    • It makes it obvious the file is Ruby code
    • It shows the rules this code expects -w and -KU

    The rules I mention here, specified by command-line switches, are the main point of interest. -w turns on Ruby's warnings which are very handy. I recommend doing that whenever you can. But that doesn't have anything to do with character encodings. -KU does.

    -KU sets a magic Ruby variable: $-K or $KCODE. You can do the same in your code if you aren't in a position to control the command-line arguments:

    $KCODE = "U"

    You probably recognize the U as a name for Ruby 1.8's UTF-8 encoding, from my earlier list of encodings. It can also be set to N (the default), E, or S. Modern versions of Rails do set $KCODE = "U" for you.

    Read more…

  • 30


    Bytes and Characters in Ruby 1.8

    Gregory Brown said, in a training session at the Lone Star Rubyconf, "Ruby 1.8 works in bytes. Ruby 1.9 works in characters." The truth of Ruby 1.9 is maybe a little more complicated and we will discuss all of that eventually, but Greg is dead right about Ruby 1.8.

    In Ruby 1.8, a String is always just a collection of bytes.

    The important question is, how does that one golden rule relate to all that we've learned about character encodings? Essentially, it puts all the responsibility on you as the developer. Ruby 1.8 leaves it to you to determine what to do with those bytes and it doesn't provide a lot of encoding savvy help. That's why knowing at least the basics of encodings is so important when working with Ruby 1.8.

    There are plusses and minuses to every system and this one is no exception. On the side of plusses, Ruby 1.8 can pretty much support any encoding you can imagine. After all, a character encoding is just some bytes that somehow map to a set of characters and all Ruby 1.8 Strings are just some bytes. If you say a String holds Latin-1 data and treat it as such, that's fine by Ruby.

    Read more…

  • 21


    General Encoding Strategies

    Before we get into specifics, let's try to distill a few best practices for working with encodings. I'm sure you can tell that there's a lot that needs to be considered with encodings, so let's try to focus in on a few key points that will help us the most.

    Use UTF-8 Everywhere You Can

    We know UTF-8 isn't perfect, but it's pretty darn close to perfect. There is no other single encoding you could pick that has the potential to satisfy such a wide audience. It's our best bet. For these reasons, UTF-8 is quickly becoming the preferred encoding for the Web, email, and more.

    If you have a say over what encoding or encodings your software will accept, support, and deliver, choose UTF-8 whenever you can. This is absolutely the best default.

    Get in the Habit of Documenting Your Encodings

    We learned that you must know a data's encoding to properly work with it. While there are tools to help you guess an encoding, you really want to try and avoid being in this position. Part of how to make that happen is to be a good citizen and make sure you are documenting your encodings at every step.

    Read more…

  • 16


    The Unicode Character Set and Encodings

    Since the rise of the various character encodings, there has been a quest to find the one perfect encoding we could all use. It's hard to get everyone to agree about whether or not this has truly been accomplished, but most of us agree that Unicode is as close as it gets.

    The goal of Unicode was literally to provide a character set that includes all characters in use today. That's letters and numbers for all languages, all the images needed by pictographic languages, and all symbols. As you can imagine that's quite a challenging task, but they've done very well. Take a moment to browse all the characters in the current Unicode specification to see for yourself. The Unicode Consortium often reminds us that they still have room for more characters as well, so we will be all set when we start meeting alien races.

    Now in order to really understand what Unicode is, I need to clear up a point I've played pretty loose with so far: a character set and a character encoding aren't necessarily the same thing. Unicode is one character set, and has multiple character encodings. Allow me to explain.

    Read more…