30
OCT2008
Bytes and Characters in Ruby 1.8
Gregory Brown said, in a training session at the Lone Star Rubyconf, "Ruby 1.8 works in bytes. Ruby 1.9 works in characters." The truth of Ruby 1.9 is maybe a little more complicated and we will discuss all of that eventually, but Greg is dead right about Ruby 1.8.
In Ruby 1.8, a String
is always just a collection of bytes.
The important question is, how does that one golden rule relate to all that we've learned about character encodings? Essentially, it puts all the responsibility on you as the developer. Ruby 1.8 leaves it to you to determine what to do with those bytes and it doesn't provide a lot of encoding savvy help. That's why knowing at least the basics of encodings is so important when working with Ruby 1.8.
There are plusses and minuses to every system and this one is no exception. On the side of plusses, Ruby 1.8 can pretty much support any encoding you can imagine. After all, a character encoding is just some bytes that somehow map to a set of characters and all Ruby 1.8 String
s are just some bytes. If you say a String
holds Latin-1 data and treat it as such, that's fine by Ruby.
I won't lie to you though, there are more minuses than plusses to this approach. Latin-1 is a pretty simple case since each byte is a character. With many other encodings though, like the UTF-8 encoding I've recommended we rely on, things get a lot more complicated.
Slicing up a Ruby 1.8 String
by index means working in bytes and that means it's possible for us to accidentally break a multi-byte character. Running regular expressions over data faces similar issues. That's just two examples of things we commonly do, but the truth is that many String
operations just aren't encoding safe in Ruby 1.8. You can't even call simple things like reverse()
on a String
because it could break the order of those multi-byte characters. And remember that size()
will always count bytes, not characters.
Ruby 1.8 is also never going to police the contents of a String
. That means to Ruby 1.8 a String
with valid UTF-8 data, a String
with broken UTF-8 data, and a String
with some bytes in Latin-1 and some in UTF-8 are all just String
s. It doesn't care. It's unlikely that the latter two are going to be of any use to you, so you will need to be the one making sure you don't create such problems. If you got String
data from two separate sources in different encodings, you can't just combine them with a simple +
.
This may be starting to sound a little bleak and it probably is. However, Ruby 1.8 throws one major exception into the works that can help you in many cases: the regex engine is aware of four character encodings. Often we can use this simple fact to work with characters.
What encodings does Ruby 1.8 know? Here's the full list:
- None (n or N)
- EUC (e or E)
- Shift_JIS (s or S)
- UTF-8 (u or U)
The None encoding is the default in Ruby 1.8. It's just the golden rule I've already mentioned: treat everything as bytes. If your encoding isn't on this list, you will need to use None and be darn sure you don't do anything to the data that could damage the encoding. That's very hard and the fact is that doing significant work with an encoding not on the above list in Ruby 1.8 will be quite a challenge for you.
Both EUC (Extended Unix Code) and SHIFT_JIS are primarily Asian character encodings. SHIFT_JIS is a Japanese encoding and EUC is mainly used for Japanese, Korean, and simplified Chinese. You can tell Ruby comes from Japan, can't you? Obviously these are very helpful if you are Asian, but the rest of us won't need these much.
Now we get to the good news: our champion UTF-8 made the list! Yes, this means Ruby 1.8 has limited support for working with UTF-8 data. It's not comprehensive, but we get some help.
The letters listed after each encoding are used in multiple places inside Ruby 1.8 to tell it which encoding you need to work with. I'll point those places out as we get into the details.
What does it mean to have a character encoding on the above list? It means that the regex engine can recognize characters in that encoding, even if they are multibyte. That assures us that regular expression constructs that target characters, like character classes ([…]
) and the match-one-character shortcut (.
), will correctly match whatever number of bytes represents one character at that place in the data. It also changes the definition of constructs like \s
and \w
which can be used to match whitespace and word characters respectively. The definition of a "word" character in Unicode is quite a bit broader than the simple ASCII character class of [A-Za-z0-9_]
.
Let's look at some examples of this, so you can see how it works. I'll play around with a simple UTF-8 String
in Ruby 1.8 and show you the various encoding effects. Remember that the default encoding is None, so that's what we get if we don't ask for anything else.
A common task in working with characters in Ruby 1.8 is to convert a String
into an Array
of characters. If we can do just that much, we can work-around some of the weaknesses of Ruby 1.8's String
always working in bytes. Given that, this almost does what we want:
$ ruby -e 'p "Résumé".scan(/./m)'
["R", "\303", "\251", "s", "u", "m", "\303", "\251"]
You probably know that scan()
just builds an Array
of matches for the passed Regexp
in the String
receiver. The /m
option I'm using here puts the regex engine in multi-line mode and in that a .
matches all characters (it usually doesn't match newlines).
So what went wrong above? Well, the "é"
characters in my String
take two bytes in UTF-8. The golden rule tells us Ruby 1.8 works in bytes and that's definitely what we saw. It split up the bytes needed for those characters. This is bad, because if I now change this Array
, I have excellent chances of breaking my data.
Again, that used the default None mode, because we didn't tell it to do otherwise. However, if we throw the regex engine into UTF-8 mode, we will get actual characters:
$ ruby -e 'p "Résumé".scan(/./mu)'
["R", "\303\251", "s", "u", "m", "\303\251"]
Notice how the two bytes needed for the "é"
stay together now? (I'll show you how to get Ruby to stop escaping the content and show the actual "é"
in my next post.) The regex engine saw that it takes both bytes to make a character in UTF-8, the encoding I requested, and thus the .
, which matches one character, is forced to grab them both.
I chose UTF-8 mode by adding the /u
option to my Regexp
literal. You probably recognize the letter from my earlier list of encodings. Similarly, you can use /e
for EUC, /s
for Shift_JIS, and even /n
for None though that's the default. Regexp.new()
also accepts a third parameter for these encodings if you are creating expressions that way: Regexp.new(".", Regexp::MULTILINE, "u")
.
Using this one simple trick, we can fix some of the unsafe String
methods I mentioned earlier. For example, Ruby 1.8 normally counts bytes with size()
:
$ ruby -e 'p "Résumé".size'
8
but we can now count characters, if desired:
$ ruby -e 'p "Résumé".scan(/./mu).size'
6
We can also fix the dangerous reverse()
method which would normally break our multibyte "é"
characters by screwing up the byte order:
$ ruby -e 'p "Résumé".reverse'
"\251\303mus\251\303R"
"\303\251"
is a UTF-8 "é"
, but the "\251\303"
we see here is broken UTF-8 data that doesn't mean anything. We can fix that with:
$ ruby -e 'p "Résumé".scan(/./mu).reverse.join'
"\303\251mus\303\251R"
This time we use the regex engine to divide the String
into a character Array
, then we reverse()
that and join()
it back into a String
. You can see that this kept the "é"
bytes in the proper order.
Really study these examples above until you understand what's going on here. This is all the support Ruby 1.8 provides for working with characters, so you need to understand how to use it.
Here's one last set of examples showing the other regex change I mentioned:
$ ruby -e 'p "Résumé"[/\w+/]'
"R"
$ ruby -e 'p "Résumé"[/\w+/u]'
"R\303\251sum\303\251"
In the default None mode, \w
is the same as [A-Za-z0-9_]
. That doesn't match the special bytes needed to build the "é"
character, so the match ends there. Note that UTF-8 mode changes that though and we get the full word.
Ruby 1.8 doesn't provide a whole lot of additional encoding support outside the regex engine. There is one magic variable and some helpful standard libraries we will discuss in future posts, but the main part of Ruby 1.8's character encoding support is just this.
One other small feature that may be worth a quick mention is that you can get Unicode code points using String
's unpack()
method:
$ ruby -e 'p "Résumé".unpack("U*")'
[82, 233, 115, 117, 109, 233]
The U
code tells unpack()
to convert a character into a Unicode code point and the *
just repeats it for all characters in the String
.
I don't find myself needing to work with character points often, but you can use this for one interesting cheat. The Unicode code points are a superset of the byte values used in Latin-1, so you can actually convert between the two encodings using just unpack()
and pack()
:
utf8 = latin1.unpack("C*").pack("U*")
# ... or ...
latin1 = utf8.unpack("U*").pack("C*") # more dangerous
However, I'll show you a superior way to handle encoding conversions in a future post.
It's important to remember that this is not full character encoding support. For example, there is a long list of rules about how to correctly convert some Unicode characters to upper case, but upcase()
doesn't know them and you cannot regex your way out of that mess. If you need these features for a given encoding, you will need to look for an external library that meets your needs or roll your own solution.
Comments (11)
-
amishera March 26th, 2010 Reply Link
Mr. Gray,
Your articles on character encodings are very informative. I have a question about this problem I am not sure how to deal with it.
I have an html file which is encoded in UTF-8. The file contains the
following text:It's a wonderful life
now the character code 39 is for aphostrohpe in UTF8. so suppose I got
the 39 out of the text using:s="It's a wonderful life" s.gsub(/&#(\d+);/, '\1')
The output is
It39s a wonderful life
So firstly I am having trouble making it
It\39s a wonderful life
Secondly I manually did this in
test_utf8.rb
:puts "It\39s a wonderful life"
and ran it
ruby test_utf8.rb > utf8.txt
but by opening it in the open office by setting the encoding to utf-8
the output isIt#9s a wonderful life
So how to correctly parse the collect and convert html character
reference to encoded characters in utf-8 and then save file?-
This isn't really a character encoding issue. It's about HTML escapes, as the solution you received on Ruby Talk shows.
-
Mr. Gray,
Thanks for your reply.
I solved the problem using this:
s.gsub!(/#(\d+);/) { [$1.to_i].pack("U*") }
Which not only converts the HTML part ie converts
#39;
into39
but also gets the corresponding Unicode character. But I am not sure whether this is the right approach. Or is there any other better approach?-
Sorry I wanted to mean that given a number I wanted to convert to Unicode character. So like you know for ASCII we do something like
39.chr
so for doing the same thing for Unicode instead of ASCII, what is the right approach?
-
Your approach is correct, though I still prefer the solution you got from Ruby Talk. It's more general and probably more correct with regard to Web encodings. For example, the default for the Web is Latin-1, not UTF-8. Those two encodings are codepoint compatible for all of Latin-1's range though, which is why your code above works.
-
-
-
-
Mr. Gray,
I have a question which I am unable to resolve. As
'e'
here is represented by 2 bytes\303
and\251
. But 1 byte has a maximum range 0-255. So how come 1 of the bytes is303
which should be out of range?-
Those character escapes are octal, so
\303
is really195
. Good question.
-
-
I have a question regarding the different Ruby versions.
When I ran this codep "Résumé"[/\w+/u]
With Ruby 1.8 I get:
"R\303\251sum\303\251"
As expected but if run it on Ruby 1.9 I only get
"R"
Why is this regex only returned the first match, how can I get all matches?
-
Try this in 1.9:
p "Résumé"[/\p{Word}+/]
-
-
I am trying to install character-encoding gem on ubuntu 64bit 11.10, Ruby 1.8.7
I get error as follows http://pastie.org/2773649, can you help in fixing it.
Can you mail any solution.-
You are getting errors about something trying to require the
psych
library that wasn't added to Ruby until 1.9. I suspect you have somehow corrupted your RubyGems install between using different versions, but I'm just guessing.
-