Bytes and Characters in Ruby 1.8
Gregory Brown said, in a training session at the Lone Star Rubyconf, "Ruby 1.8 works in bytes. Ruby 1.9 works in characters." The truth of Ruby 1.9 is maybe a little more complicated and we will discuss all of that eventually, but Greg is dead right about Ruby 1.8.
In Ruby 1.8, a
String is always just a collection of bytes.
The important question is, how does that one golden rule relate to all that we've learned about character encodings? Essentially, it puts all the responsibility on you as the developer. Ruby 1.8 leaves it to you to determine what to do with those bytes and it doesn't provide a lot of encoding savvy help. That's why knowing at least the basics of encodings is so important when working with Ruby 1.8.
There are plusses and minuses to every system and this one is no exception. On the side of plusses, Ruby 1.8 can pretty much support any encoding you can imagine. After all, a character encoding is just some bytes that somehow map to a set of characters and all Ruby 1.8
Strings are just some bytes. If you say a
String holds Latin-1 data and treat it as such, that's fine by Ruby.
I won't lie to you though, there are more minuses than plusses to this approach. Latin-1 is a pretty simple case since each byte is a character. With many other encodings though, like the UTF-8 encoding I've recommended we rely on, things get a lot more complicated.
Slicing up a Ruby 1.8
String by index means working in bytes and that means it's possible for us to accidentally break a multi-byte character. Running regular expressions over data faces similar issues. That's just two examples of things we commonly do, but the truth is that many
String operations just aren't encoding safe in Ruby 1.8. You can't even call simple things like
reverse() on a
String because it could break the order of those multi-byte characters. And remember that
size() will always count bytes, not characters.
Ruby 1.8 is also never going to police the contents of a
String. That means to Ruby 1.8 a
String with valid UTF-8 data, a
String with broken UTF-8 data, and a
String with some bytes in Latin-1 and some in UTF-8 are all just
Strings. It doesn't care. It's unlikely that the latter two are going to be of any use to you, so you will need to be the one making sure you don't create such problems. If you got
String data from two separate sources in different encodings, you can't just combine them with a simple
This may be starting to sound a little bleak and it probably is. However, Ruby 1.8 throws one major exception into the works that can help you in many cases: the regex engine is aware of four character encodings. Often we can use this simple fact to work with characters.
What encodings does Ruby 1.8 know? Here's the full list:
- None (n or N)
- EUC (e or E)
- Shift_JIS (s or S)
- UTF-8 (u or U)
The None encoding is the default in Ruby 1.8. It's just the golden rule I've already mentioned: treat everything as bytes. If your encoding isn't on this list, you will need to use None and be darn sure you don't do anything to the data that could damage the encoding. That's very hard and the fact is that doing significant work with an encoding not on the above list in Ruby 1.8 will be quite a challenge for you.
Both EUC (Extended Unix Code) and SHIFT_JIS are primarily Asian character encodings. SHIFT_JIS is a Japanese encoding and EUC is mainly used for Japanese, Korean, and simplified Chinese. You can tell Ruby comes from Japan, can't you? Obviously these are very helpful if you are Asian, but the rest of us won't need these much.
Now we get to the good news: our champion UTF-8 made the list! Yes, this means Ruby 1.8 has limited support for working with UTF-8 data. It's not comprehensive, but we get some help.
The letters listed after each encoding are used in multiple places inside Ruby 1.8 to tell it which encoding you need to work with. I'll point those places out as we get into the details.
What does it mean to have a character encoding on the above list? It means that the regex engine can recognize characters in that encoding, even if they are multibyte. That assures us that regular expression constructs that target characters, like character classes (
[…]) and the match-one-character shortcut (
.), will correctly match whatever number of bytes represents one character at that place in the data. It also changes the definition of constructs like
\w which can be used to match whitespace and word characters respectively. The definition of a "word" character in Unicode is quite a bit broader than the simple ASCII character class of
Let's look at some examples of this, so you can see how it works. I'll play around with a simple UTF-8
String in Ruby 1.8 and show you the various encoding effects. Remember that the default encoding is None, so that's what we get if we don't ask for anything else.
A common task in working with characters in Ruby 1.8 is to convert a
String into an
Array of characters. If we can do just that much, we can work-around some of the weaknesses of Ruby 1.8's
String always working in bytes. Given that, this almost does what we want:
$ ruby -e 'p "Résumé".scan(/./m)' ["R", "\303", "\251", "s", "u", "m", "\303", "\251"]
You probably know that
scan() just builds an
Array of matches for the passed
Regexp in the
String receiver. The
/m option I'm using here puts the regex engine in multi-line mode and in that a
. matches all characters (it usually doesn't match newlines).
So what went wrong above? Well, the
"é" characters in my
String take two bytes in UTF-8. The golden rule tells us Ruby 1.8 works in bytes and that's definitely what we saw. It split up the bytes needed for those characters. This is bad, because if I now change this
Array, I have excellent chances of breaking my data.
Again, that used the default None mode, because we didn't tell it to do otherwise. However, if we throw the regex engine into UTF-8 mode, we will get actual characters:
$ ruby -e 'p "Résumé".scan(/./mu)' ["R", "\303\251", "s", "u", "m", "\303\251"]
Notice how the two bytes needed for the
"é" stay together now? (I'll show you how to get Ruby to stop escaping the content and show the actual
"é" in my next post.) The regex engine saw that it takes both bytes to make a character in UTF-8, the encoding I requested, and thus the
., which matches one character, is forced to grab them both.
I chose UTF-8 mode by adding the
/u option to my
Regexp literal. You probably recognize the letter from my earlier list of encodings. Similarly, you can use
/e for EUC,
/s for Shift_JIS, and even
/n for None though that's the default.
Regexp.new() also accepts a third parameter for these encodings if you are creating expressions that way:
Regexp.new(".", Regexp::MULTILINE, "u").
Using this one simple trick, we can fix some of the unsafe
String methods I mentioned earlier. For example, Ruby 1.8 normally counts bytes with
$ ruby -e 'p "Résumé".size' 8
but we can now count characters, if desired:
$ ruby -e 'p "Résumé".scan(/./mu).size' 6
We can also fix the dangerous
reverse() method which would normally break our multibyte
"é" characters by screwing up the byte order:
$ ruby -e 'p "Résumé".reverse' "\251\303mus\251\303R"
"\303\251" is a UTF-8
"é", but the
"\251\303" we see here is broken UTF-8 data that doesn't mean anything. We can fix that with:
$ ruby -e 'p "Résumé".scan(/./mu).reverse.join' "\303\251mus\303\251R"
This time we use the regex engine to divide the
String into a character
Array, then we
reverse() that and
join() it back into a
String. You can see that this kept the
"é" bytes in the proper order.
Really study these examples above until you understand what's going on here. This is all the support Ruby 1.8 provides for working with characters, so you need to understand how to use it.
Here's one last set of examples showing the other regex change I mentioned:
$ ruby -e 'p "Résumé"[/\w+/]' "R" $ ruby -e 'p "Résumé"[/\w+/u]' "R\303\251sum\303\251"
In the default None mode,
\w is the same as
[A-Za-z0-9_]. That doesn't match the special bytes needed to build the
"é" character, so the match ends there. Note that UTF-8 mode changes that though and we get the full word.
Ruby 1.8 doesn't provide a whole lot of additional encoding support outside the regex engine. There is one magic variable and some helpful standard libraries we will discuss in future posts, but the main part of Ruby 1.8's character encoding support is just this.
One other small feature that may be worth a quick mention is that you can get Unicode code points using
$ ruby -e 'p "Résumé".unpack("U*")' [82, 233, 115, 117, 109, 233]
U code tells
unpack() to convert a character into a Unicode code point and the
* just repeats it for all characters in the
I don't find myself needing to work with character points often, but you can use this for one interesting cheat. The Unicode code points are a superset of the byte values used in Latin-1, so you can actually convert between the two encodings using just
utf8 = latin1.unpack("C*").pack("U*") # ... or ... latin1 = utf8.unpack("U*").pack("C*") # more dangerous
However, I'll show you a superior way to handle encoding conversions in a future post.
It's important to remember that this is not full character encoding support. For example, there is a long list of rules about how to correctly convert some Unicode characters to upper case, but
upcase() doesn't know them and you cannot regex your way out of that mess. If you need these features for a given encoding, you will need to look for an external library that meets your needs or roll your own solution.