Character Encodings

My extensive coverage of a complex topic all programmers should study a little.

30

OCT
2008

Bytes and Characters in Ruby 1.8

Gregory Brown said, in a training session at the Lone Star Rubyconf, "Ruby 1.8 works in bytes. Ruby 1.9 works in characters." The truth of Ruby 1.9 is maybe a little more complicated and we will discuss all of that eventually, but Greg is dead right about Ruby 1.8.

In Ruby 1.8, a String is always just a collection of bytes.

The important question is, how does that one golden rule relate to all that we've learned about character encodings? Essentially, it puts all the responsibility on you as the developer. Ruby 1.8 leaves it to you to determine what to do with those bytes and it doesn't provide a lot of encoding savvy help. That's why knowing at least the basics of encodings is so important when working with Ruby 1.8.

There are plusses and minuses to every system and this one is no exception. On the side of plusses, Ruby 1.8 can pretty much support any encoding you can imagine. After all, a character encoding is just some bytes that somehow map to a set of characters and all Ruby 1.8 Strings are just some bytes. If you say a String holds Latin-1 data and treat it as such, that's fine by Ruby.

I won't lie to you though, there are more minuses than plusses to this approach. Latin-1 is a pretty simple case since each byte is a character. With many other encodings though, like the UTF-8 encoding I've recommended we rely on, things get a lot more complicated.

Slicing up a Ruby 1.8 String by index means working in bytes and that means it's possible for us to accidentally break a multi-byte character. Running regular expressions over data faces similar issues. That's just two examples of things we commonly do, but the truth is that many String operations just aren't encoding safe in Ruby 1.8. You can't even call simple things like reverse() on a String because it could break the order of those multi-byte characters. And remember that size() will always count bytes, not characters.

Ruby 1.8 is also never going to police the contents of a String. That means to Ruby 1.8 a String with valid UTF-8 data, a String with broken UTF-8 data, and a String with some bytes in Latin-1 and some in UTF-8 are all just Strings. It doesn't care. It's unlikely that the latter two are going to be of any use to you, so you will need to be the one making sure you don't create such problems. If you got String data from two separate sources in different encodings, you can't just combine them with a simple +.

This may be starting to sound a little bleak and it probably is. However, Ruby 1.8 throws one major exception into the works that can help you in many cases: the regex engine is aware of four character encodings. Often we can use this simple fact to work with characters.

What encodings does Ruby 1.8 know? Here's the full list:

  • None (n or N)
  • EUC (e or E)
  • Shift_JIS (s or S)
  • UTF-8 (u or U)

The None encoding is the default in Ruby 1.8. It's just the golden rule I've already mentioned: treat everything as bytes. If your encoding isn't on this list, you will need to use None and be darn sure you don't do anything to the data that could damage the encoding. That's very hard and the fact is that doing significant work with an encoding not on the above list in Ruby 1.8 will be quite a challenge for you.

Both EUC (Extended Unix Code) and SHIFT_JIS are primarily Asian character encodings. SHIFT_JIS is a Japanese encoding and EUC is mainly used for Japanese, Korean, and simplified Chinese. You can tell Ruby comes from Japan, can't you? Obviously these are very helpful if you are Asian, but the rest of us won't need these much.

Now we get to the good news: our champion UTF-8 made the list! Yes, this means Ruby 1.8 has limited support for working with UTF-8 data. It's not comprehensive, but we get some help.

The letters listed after each encoding are used in multiple places inside Ruby 1.8 to tell it which encoding you need to work with. I'll point those places out as we get into the details.

What does it mean to have a character encoding on the above list? It means that the regex engine can recognize characters in that encoding, even if they are multibyte. That assures us that regular expression constructs that target characters, like character classes ([…]) and the match-one-character shortcut (.), will correctly match whatever number of bytes represents one character at that place in the data. It also changes the definition of constructs like \s and \w which can be used to match whitespace and word characters respectively. The definition of a "word" character in Unicode is quite a bit broader than the simple ASCII character class of [A-Za-z0-9_].

Let's look at some examples of this, so you can see how it works. I'll play around with a simple UTF-8 String in Ruby 1.8 and show you the various encoding effects. Remember that the default encoding is None, so that's what we get if we don't ask for anything else.

A common task in working with characters in Ruby 1.8 is to convert a String into an Array of characters. If we can do just that much, we can work-around some of the weaknesses of Ruby 1.8's String always working in bytes. Given that, this almost does what we want:

$ ruby -e 'p "Résumé".scan(/./m)'
["R", "\303", "\251", "s", "u", "m", "\303", "\251"]

You probably know that scan() just builds an Array of matches for the passed Regexp in the String receiver. The /m option I'm using here puts the regex engine in multi-line mode and in that a . matches all characters (it usually doesn't match newlines).

So what went wrong above? Well, the "é" characters in my String take two bytes in UTF-8. The golden rule tells us Ruby 1.8 works in bytes and that's definitely what we saw. It split up the bytes needed for those characters. This is bad, because if I now change this Array, I have excellent chances of breaking my data.

Again, that used the default None mode, because we didn't tell it to do otherwise. However, if we throw the regex engine into UTF-8 mode, we will get actual characters:

$ ruby -e 'p "Résumé".scan(/./mu)'
["R", "\303\251", "s", "u", "m", "\303\251"]

Notice how the two bytes needed for the "é" stay together now? (I'll show you how to get Ruby to stop escaping the content and show the actual "é" in my next post.) The regex engine saw that it takes both bytes to make a character in UTF-8, the encoding I requested, and thus the ., which matches one character, is forced to grab them both.

I chose UTF-8 mode by adding the /u option to my Regexp literal. You probably recognize the letter from my earlier list of encodings. Similarly, you can use /e for EUC, /s for Shift_JIS, and even /n for None though that's the default. Regexp.new() also accepts a third parameter for these encodings if you are creating expressions that way: Regexp.new(".", Regexp::MULTILINE, "u").

Using this one simple trick, we can fix some of the unsafe String methods I mentioned earlier. For example, Ruby 1.8 normally counts bytes with size():

$ ruby -e 'p "Résumé".size'
8

but we can now count characters, if desired:

$ ruby -e 'p "Résumé".scan(/./mu).size'
6

We can also fix the dangerous reverse() method which would normally break our multibyte "é" characters by screwing up the byte order:

$ ruby -e 'p "Résumé".reverse'
"\251\303mus\251\303R"

"\303\251" is a UTF-8 "é", but the "\251\303" we see here is broken UTF-8 data that doesn't mean anything. We can fix that with:

$ ruby -e 'p "Résumé".scan(/./mu).reverse.join'
"\303\251mus\303\251R"

This time we use the regex engine to divide the String into a character Array, then we reverse() that and join() it back into a String. You can see that this kept the "é" bytes in the proper order.

Really study these examples above until you understand what's going on here. This is all the support Ruby 1.8 provides for working with characters, so you need to understand how to use it.

Here's one last set of examples showing the other regex change I mentioned:

$ ruby -e 'p "Résumé"[/\w+/]'
"R"
$ ruby -e 'p "Résumé"[/\w+/u]'
"R\303\251sum\303\251"

In the default None mode, \w is the same as [A-Za-z0-9_]. That doesn't match the special bytes needed to build the "é" character, so the match ends there. Note that UTF-8 mode changes that though and we get the full word.

Ruby 1.8 doesn't provide a whole lot of additional encoding support outside the regex engine. There is one magic variable and some helpful standard libraries we will discuss in future posts, but the main part of Ruby 1.8's character encoding support is just this.

One other small feature that may be worth a quick mention is that you can get Unicode code points using String's unpack() method:

$ ruby -e 'p "Résumé".unpack("U*")'
[82, 233, 115, 117, 109, 233]

The U code tells unpack() to convert a character into a Unicode code point and the * just repeats it for all characters in the String.

I don't find myself needing to work with character points often, but you can use this for one interesting cheat. The Unicode code points are a superset of the byte values used in Latin-1, so you can actually convert between the two encodings using just unpack() and pack():

utf8 = latin1.unpack("C*").pack("U*")
# ... or ...
latin1 = utf8.unpack("U*").pack("C*")  # more dangerous

However, I'll show you a superior way to handle encoding conversions in a future post.

It's important to remember that this is not full character encoding support. For example, there is a long list of rules about how to correctly convert some Unicode characters to upper case, but upcase() doesn't know them and you cannot regex your way out of that mess. If you need these features for a given encoding, you will need to look for an external library that meets your needs or roll your own solution.

Comments (11)
  1. amishera
    amishera March 26th, 2010 Reply Link

    Mr. Gray,

    Your articles on character encodings are very informative. I have a question about this problem I am not sure how to deal with it.

    I have an html file which is encoded in UTF-8. The file contains the
    following text:

    It's a wonderful life
    

    now the character code 39 is for aphostrohpe in UTF8. so suppose I got
    the 39 out of the text using:

        s="It's a wonderful life"
        s.gsub(/&#(\d+);/, '\1')
    

    The output is

    It39s a wonderful life
    

    So firstly I am having trouble making it

    It\39s a wonderful life
    

    Secondly I manually did this in test_utf8.rb:

    puts "It\39s a wonderful life"
    

    and ran it

    ruby test_utf8.rb > utf8.txt
    

    but by opening it in the open office by setting the encoding to utf-8
    the output is

    It#9s a wonderful life
    

    So how to correctly parse the collect and convert html character
    reference to encoded characters in utf-8 and then save file?

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
    2. James Edward Gray II
      James Edward Gray II March 26th, 2010 Reply Link

      This isn't really a character encoding issue. It's about HTML escapes, as the solution you received on Ruby Talk shows.

      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
      2. amishera
        amishera March 31st, 2010 Reply Link

        Mr. Gray,

        Thanks for your reply.

        I solved the problem using this:

        s.gsub!(/#(\d+);/) { [$1.to_i].pack("U*") }
        

        Which not only converts the HTML part ie converts #39; into 39 but also gets the corresponding Unicode character. But I am not sure whether this is the right approach. Or is there any other better approach?

        1. Reply (using GitHub Flavored Markdown)

          Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

          Ajax loader
        2. amishera
          amishera March 31st, 2010 Reply Link

          Sorry I wanted to mean that given a number I wanted to convert to Unicode character. So like you know for ASCII we do something like

          39.chr
          

          so for doing the same thing for Unicode instead of ASCII, what is the right approach?

          1. Reply (using GitHub Flavored Markdown)

            Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

            Ajax loader
        3. James Edward Gray II
          James Edward Gray II March 31st, 2010 Reply Link

          Your approach is correct, though I still prefer the solution you got from Ruby Talk. It's more general and probably more correct with regard to Web encodings. For example, the default for the Web is Latin-1, not UTF-8. Those two encodings are codepoint compatible for all of Latin-1's range though, which is why your code above works.

          1. Reply (using GitHub Flavored Markdown)

            Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

            Ajax loader
  2. amishera
    amishera March 26th, 2010 Reply Link

    Mr. Gray,

    I have a question which I am unable to resolve. As 'e' here is represented by 2 bytes \303 and \251. But 1 byte has a maximum range 0-255. So how come 1 of the bytes is 303 which should be out of range?

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
    2. James Edward Gray II
      James Edward Gray II March 26th, 2010 Reply Link

      Those character escapes are octal, so \303 is really 195. Good question.

      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
  3. grillermo
    grillermo August 21st, 2011 Reply Link

    I have a question regarding the different Ruby versions.
    When I ran this code

    p "Résumé"[/\w+/u]
    

    With Ruby 1.8 I get:

    "R\303\251sum\303\251"
    

    As expected but if run it on Ruby 1.9 I only get

    "R"
    

    Why is this regex only returned the first match, how can I get all matches?

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
    2. James Edward Gray II
      James Edward Gray II August 22nd, 2011 Reply Link

      Try this in 1.9:

      p "Résumé"[/\p{Word}+/]
      
      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
  4. aashish
    aashish October 28th, 2011 Reply Link

    I am trying to install character-encoding gem on ubuntu 64bit 11.10, Ruby 1.8.7
    I get error as follows http://pastie.org/2773649, can you help in fixing it.
    Can you mail any solution.

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
    2. James Edward Gray II
      James Edward Gray II October 28th, 2011 Reply Link

      You are getting errors about something trying to require the psych library that wasn't added to Ruby until 1.9. I suspect you have somehow corrupted your RubyGems install between using different versions, but I'm just guessing.

      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
Leave a Comment (using GitHub Flavored Markdown)

Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

Ajax loader