Gray Soft / Character Encodings / Ruby 1.9's String

30

MAR
2009

Ruby 1.9's String

This post is part of a series.

Ruby 1.9 has an all new encoding engine called m17n (for multilingualization, with 17 letters between the m and n). This new engine may not be what you are use to from many other modern languages.

It's common to pick one versatile encoding, likely a Unicode encoding, and work with all data in that one format. Ruby 1.9 goes a different way. Instead of favoring one encoding, Ruby 1.9 makes it possible to work with data in over 80 encodings.

To accomplish this, changes had to be made in several places where Ruby works with character data. You're going to notice those changes the most in Ruby's String though, so let's begin by talking about what's changed there.

All Strings are now Encoded

In Ruby 1.8 a String was a collection of bytes. You sometimes treated those bytes as other things, like characters when you hit it with a Regexp or lines when you called each(). At it's core though, it was just some bytes. You indexed the data by byte counts, sizes where in bytes, and so on.

In Ruby 1.9 a String is now a collection of encoded data. That means it is both the raw bytes and the attached Encoding information about how to interpret those bytes.

Let me show a simple example of this difference. Don't worry yet about how I got the data into these variables. We're going to talk a lot about that down the road. For now, just focus on how Ruby uses the Encoding to decide how to handle the data:

# the attached encoding
puts utf8_resume.encoding.name    # >> UTF-8
puts latin1_resume.encoding.name  # >> ISO-8859-1

# size() is now encoded data size or characters
puts utf8_resume.size    # >> 6
puts latin1_resume.size  # >> 6

# but we can ask for a raw bytesize() to see they are different
puts utf8_resume.bytesize    # >> 8
puts latin1_resume.bytesize  # >> 6

# we now index by encoded data (again characters)
puts utf8_resume[2..4]    # >> sum
puts latin1_resume[2..4]  # >> sum

These examples may seem very basic, but there's a whole lot we can learn from them. First, notice how all String objects now have this attached Encoding object. As I said before, they are a container of bytes and the attached information about how to interpret those bytes. You'll now always have both pieces in Ruby Strings, even if the rules of interpretation just say to treat them as raw bytes (more on that later).

The next two examples show that when we ask Ruby the size() of the data, it now interprets the bytes as dictated by the attached rules and gives us the encoded size which is generally just a character count. We can explicitly ask for the raw bytesize() if we want, but that's no longer the norm. This is a big change from Ruby 1.8.

The final example shows us that indexing is similarly affected. We are now counting in terms of encoded data or characters, not bytes. So even though it had to skip three bytes in the UTF-8 String but only two in the Latin-1 String, my slices returned the same characters for the same indices.

The important take away is this: a String is now some bytes and the rules for interpreting those bytes. Hopefully that is starting to feel a little natural to you since that's really what we decided character encodings are all about.

Changing an Encoding

Again, I don't want to get into how a String gets its initial Encoding just yet. That's a topic all its own that we will discuss down the road. However, there are times when you will want to change an Encoding and that's related to some more new features in String. Let's talk about those.

The first way to change an Encoding is to call force_encoding(). This is your way to tell Ruby you know better what this data is and you need to change the rules for how this data is being treated. For example:

abc = "abc"
puts abc.encoding.name  # >> US-ASCII

abc.force_encoding("UTF-8")
puts abc.encoding.name  # >> UTF-8

As this example shows, when I created this String Ruby gave it an US-ASCII Encoding. Again, let's not worry about how Ruby made that decision yet. The important thing is that I didn't want US-ASCII, but instead UTF-8. Thus I used force_encoding() to tell Ruby, this is actually UTF-8 data so you need to change the Encoding attached to it.

Now, it's important to note I could get away this in this case because those bytes mean the same thing in US-ASCII and UTF-8 character encodings. I didn't change the data at all, just the rules for interpreting that data.

That can be dangerous though. The risk is that you may set the rules incorrectly for some data. Let's go back to my earlier Latin-1 String to show you what I mean:

# the correct Encoding for this data
puts latin1_resume.encoding.name    # >> ISO-8859-1
puts latin1_resume.bytesize         # >> 6
puts latin1_resume.valid_encoding?  # >> true

# a mistake, setting the wrong Encoding
latin1_resume.force_encoding("UTF-8")

# the data is unchanged, but now the Encoding doesn't match
puts latin1_resume.encoding.name    # >> UTF-8
puts latin1_resume.bytesize         # >> 6
puts latin1_resume.valid_encoding?  # >> false

# when we later try to use the data
latin1_resume =~ /\AR/  # !> ArgumentError:
                        #    invalid byte sequence in UTF-8

Note how my use of force_encoding() switches the Encoding but not the data. You can tell because the bytesize() didn't change. Well, those bytes aren't a valid chunk of UTF-8 data as valid_encoding?() tells us. Worse, if we try to actually use the broken data, we may get fireworks as we do when I apply a Regexp here.

That leads us to the other way to change an Encoding. If what we have is a valid set of data in some Encoding and what we want is that data translated into a different Encoding, we need to transcode the data. You can do that in Ruby 1.9 with the encode() method (or encode!() to modify the original String instead of building a new one).

Let's try that Latin-1 to UTF-8 conversion one more time using encode():

# valid Latin-1 data
puts latin1_resume.encoding.name    # >> ISO-8859-1
puts latin1_resume.bytesize         # >> 6
puts latin1_resume.valid_encoding?  # >> true

# transcode the data to UTF-8
transcoded_utf8_resume = latin1_resume.encode("UTF-8")

# now correctly changed to UTF-8
puts transcoded_utf8_resume.encoding.name    # >> UTF-8
puts transcoded_utf8_resume.bytesize         # >> 8
puts transcoded_utf8_resume.valid_encoding?  # >> true

As you can see the difference with this approach was that both the Encoding and the data changed. The data was in fact translated from the old Encoding to the new one.

That leaves us with some pretty easy rules for deciding when to use each tactic. If you know what the data is better than Ruby does and you just need to fix the Encoding, use force_encoding(). Just be careful in such cases, because you may setup errors that get triggered the next time the data is used (possibly far away from the Encoding switch) if you are wrong. When you want to translate data from one Encoding to another, use encode().

Be Careful with Comparisons

Theses changes to how String data is managed have complicated the rules of String comparison a bit, unfortunately. I'm going to go against the grain here and recommend against you spending a lot of energy memorizing the new rules.

Instead, I think it's much more useful to come up with one rule that's more likely to serve you better in the long run. For that I suggest: normalize a group of String objects to the same Encoding before working with them together. That goes for comparisons and other shared operations as well.

I just think it's too hard to work with several different kinds of data and reason correctly about what's going to happen as you do.

One thing that my help a little in normalizing your data is Ruby's concept of compatible Encodings. Here's an example of checking and taking advantage of compatible Encodings:

# data in two different Encodings
p ascii_my                      # >> "My "
puts ascii_my.encoding.name     # >> US-ASCII
p utf8_resume                   # >> "Résumé"
puts utf8_resume.encoding.name  # >> UTF-8

# check compatibility
p Encoding.compatible?(ascii_my, utf8_resume)  # >> #<Encoding:UTF-8>

# combine compatible data
my_resume = ascii_my + utf8_resume
p my_resume                   # >> "My Résumé"
puts my_resume.encoding.name  # >> UTF-8

In this example I had data in two different Encodings, US-ASCII and UTF-8. I asked Ruby if the two pieces of data were compatible?(). Ruby can respond to that question in one of two ways. If it returns false, the data is not compatible and you will probably need to transcode at least one piece of it to work with the other. If an Encoding is returned, the data is compatible and can be concatenated resulting in data with the returned Encoding. You can see how that played out when I combined these Strings.

This feature is probably most useful for what I've shown right here, combining ASCII with a bigger Encoding. More complicated scenarios are going to require some transcoding.

Explicit Iteration

In Ruby 1.8, String's each() method iterated over lines of data. I imagine that was done because it's a common way to process data, but the question is what made lines the correct choice? What about iterating by bytes or characters? You could iterate by bytes in Ruby 1.8 using each_byte(), but you needed to resort to Regexp tricks to get characters.

In the Ruby 1.9 realm of all encoded data, blessing one type of iteration just doesn't make sense. Instead, each() has been removed from String and it is no longer Enumerable. This is probably one of the biggest changes to the core API that code will need to adapt to.

Take heart though, String iteration is not gone. Instead, you now just need to be explicit about what you want to iterate over and you have several choices:

utf8_resume.each_byte do |byte|
  puts byte
end
# >> 82
# >> 195
# >> 169
# >> 115
# >> 117
# >> 109
# >> 195
# >> 169
utf8_resume.each_char do |char|
  puts char
end
# >> R
# >> é
# >> s
# >> u
# >> m
# >> é
utf8_resume.each_codepoint do |codepoint|
  puts codepoint
end
# >> 82
# >> 233
# >> 115
# >> 117
# >> 109
# >> 233
utf8_resume.each_line do |line|
  puts line
end
# >> Résumé

Similarly, you can ask for an Enumerator for each type when you want to use a different iterator than just each(). The standard method of just not passing a block to get an Enumerator works on the methods above, but there are also methods just for this purpose:

p utf8_resume.bytes.first(3)
# >> [82, 195, 169]
p utf8_resume.chars.find { |char| char.bytesize > 1 }
# >> "é"
p utf8_resume.codepoints.to_a
# >> [82, 233, 115, 117, 109, 233]
p utf8_resume.lines.map { |line| line.reverse }
# >> ["émuséR"]

[Update: bytes(), chars(), etc. where changed to return Arrays instead of Enumerators in Ruby 2.0 and up.]

I think we'll find in the long run that this change is definitely for the better. I feel it makes the code more self-documenting. That's always a good thing, in my opinion.

The trickiest part about losing each() is when you need your code to run on both Ruby 1.8 and 1.9. When that's the case, you can either add a method to String in Ruby 1.8:

if RUBY_VERSION < "1.9"
  require "enumerator"
  class String
    def lines
      enum_for(:each)
    end
  end
end

or use a simple trick like:

str.send(str.respond_to?(:lines) ? :lines : :to_s).each do |line|
  # ...
end

This post is part of a series.

← Previous Post

↑ Table of Contents

→ Next Post

In: Character Encodings | Tags: Multilingualization | 15 Comments

Comments (15)

Axel Niedenhoff April 2nd, 2009 Reply Link

I have just checked the docs for the String class in 1.8, because I thought it defines an each_line method (as an equivalent for the each method). And it does! So maybe the trick to make code work on both 1.8 and 1.9 is just to use each_line instead of each?
1. Reply (using GitHub Flavored Markdown)
  
  Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
  
  Or login with:
  Name * Email URL Comment *
2. James Edward Gray II April 2nd, 2009 Reply Link
  
  That's a good point. Using each_line() may help a little with simple each() iteration cases. Of course, if you need any of the other iterators Enumerable provides to String in Ruby 1.8 you'll need something like the hacks I mentioned.
  1. Reply (using GitHub Flavored Markdown)
    
    Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
    
    Or login with:
    Name * Email URL Comment *
James Edward Gray II May 27th, 2009 Reply Link
It's probably worth mentioning that it is possible for a transcoding operation to fail with an error. For example:
```
$ cat transcode.rb 
# encoding: UTF-8
utf8   = "Résumé…"
latin1 = utf8.encode("ISO-8859-1")
$ ruby transcode.rb 
transcode.rb:3:in `encode': "\xE2\x80\xA6" from UTF-8 to ISO-8859-1 
(Encoding::UndefinedConversionError)
  from transcode.rb:3:in `<main>'
```
Naturally this fails because "…" is not a valid character in Latin-1. I've shown ways to handle this using iconv in the past and those still work just fine in Ruby 1.9. The new Ruby does include some simple translation options though and we can use those to do some crude translation:
```
# encoding: UTF-8
utf8   = "Résumé…"
latin1 = utf8.encode("ISO-8859-1", undef: :replace)
puts latin1  # >> Résumé?
```
As you can see, I just asked for undefined characters in the target Encoding to be replaced here, which Ruby used a "?" for. You can set the :replace key to any String you would prefer though. You can also set :invalid to :replace to swap out invalid characters in the original Encoding as the transcoding occurs. Finally there are some utility options like :universal_newline, which will transcode "\r\n" and "\r" to "\n" when set to true.
1. Reply (using GitHub Flavored Markdown)
  
  Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
  
  Or login with:
  Name * Email URL Comment *
2. Ryan July 29th, 2010 Reply Link
  
  Wow, this was a huge help. I have been beating my head against my desk trying to figure out how to properly encode a string. Kept getting ASCII 8-Bit to UTF-8 errors, but using encode! with undef and replace in the options hash worked like a charm. Thanks!
  1. Reply (using GitHub Flavored Markdown)
    
    Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
    
    Or login with:
    Name * Email URL Comment *
James Edward Gray II May 27th, 2009 Reply Link
Ruby has another interesting exception when it comes to Encoding incompatibility that may be worth mentioning. You can't typically add Strings with incompatible Encodings as this shows:
```
$ cat incompatible.rb 
# encoding: UTF-8
utf8 = "一"
sjis = "二".encode("Shift_JIS")
puts "#{utf8.encoding} + #{sjis.encoding} ="
utf8 + sjis
$ ruby incompatible.rb 
UTF-8 + Shift_JIS =
incompatible.rb:5:in `<main>': incompatible character encodings:
UTF-8 and Shift_JIS (Encoding::CompatibilityError)
```
However, Ruby does keep an eye on String content (mainly for optimization purposes) and when both Strings contain only 7-bit ASCII, an exception will be made:
```
$ cat ascii.rb 
# encoding: UTF-8

utf8 = "abc"
sjis = "def".encode("Shift_JIS")

print "Given all ASCII data:  " if [utf8, sjis].all?(&:ascii_only?)
print "#{utf8.encoding} + #{sjis.encoding} = "

result = utf8 + sjis
$ ruby ascii.rb 
Given all ASCII data:  UTF-8 + Shift_JIS = UTF-8
```
There are a few points of interest in this little example. First, note the ascii_only?() method to check for these special cased Strings. Next, notice that Ruby did do the concatenation, even though these are not compatible Encodings. Finally, the result had an Encoding of UTF-8, simply because that was the Encoding of the first (leftmost) String. It would have been Shift_JIS had I reversed them.

I still don't really recommend relying on these special behaviors though. I believe you will encounter less problems if you stick to my advice of normalizing String Encodings before working with mixed data.
1. Reply (using GitHub Flavored Markdown)
  
  Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
  
  Or login with:
  Name * Email URL Comment *
2. Joe Marty September 7th, 2011 Reply Link
  
  Thank you for this amazingly helpful series!
  I was curious about your comment regarding an "interesting exception when it comes to Encoding incompatibility" when using only 7-bit ASCII. Is this actually an exception? The impression I got was that the .compatible? method would compare string content and find out if one of the encodings could (or if there is an encoding that could) potentially contain all of the characters in both strings. Or does it simply compare the encodings against a table of 100% compatible encodings, and return the result?
  
  In the former case, the fact that both strings contain 7-bit ASCII is just a coincidence, not a special exception, and what really matters is the fact that all the characters in both strings can be encoded in UTF-8, so compatible? returns UTF-8, and adding strings, therefore, uses UTF-8... however I have not tried this or figured out how to setup an experiment to see if other cases work the same way.
  
  Do you know which is the case?
  1. Reply (using GitHub Flavored Markdown)
    
    Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
    
    Or login with:
    Name * Email URL Comment *
  2. James Edward Gray II September 7th, 2011 Reply Link
    
    I'm pretty sure Ruby does not compare contents. That could get pretty inefficient on large Strings.
    
    I believe the case is that it compares encodings, with the special exception for 7-bit ASCII Strings.
    1. Reply (using GitHub Flavored Markdown)
      
      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
      
      Or login with:
      Name * Email URL Comment *
MK June 3rd, 2009 Reply Link

Thanks for this guy -- I didn't read the whole series but boy am I glad it's here...I'm new to ruby and rails and just ran into an encoding error on html pages done in iso-8859-1. You did a good job of demonstrating the important points.

Thanks again, keep up the good work!
1. Reply (using GitHub Flavored Markdown)
  
  Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
  
  Or login with:
  Name * Email URL Comment *
massi January 11th, 2010 Reply Link
Hi,

I'm trying to render an image from mysql using send_data and I'm getting this error :
invalid byte sequence in UTF-8
Here is my code :
```
def get_photo
    @image_data = Photo.find(params[:id])
    @image = @image_data.binary_data
    @url  = @image_data.url
    send_data(@image, :type => 'image/jpeg,
                      :filename => "#{params[:id]}.jpg",
                      :disposition => 'inline')
end
```
BTW, I'm using ruby 1.9.1 with rails 2.3.5.
1. Reply (using GitHub Flavored Markdown)
  
  Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
  
  Or login with:
  Name * Email URL Comment *
2. James Edward Gray II January 11th, 2010 Reply Link
  
  Without knowing where the error comes from, the only thing I can see that might be an issue is if params[:id] contained non-UTF-8 bytes. This question is probably better asked on the Rails mailing list tough.
  1. Reply (using GitHub Flavored Markdown)
    
    Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
    
    Or login with:
    Name * Email URL Comment *
руби May 11th, 2010 Reply Link

Thank you for this note. The method .each still appears in the core documentation with its alias, .each_line: http://www.ruby-doc.org/core-1.8.6/String.html#method-i-each

This is confusing because .each seems to be what I want when I want "each word delimited by a specific character," however, changing to .each_line did work.
1. Reply (using GitHub Flavored Markdown)
  
  Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
  
  Or login with:
  Name * Email URL Comment *
Matthew November 17th, 2010 Reply Link

Hmm I tried transcoding the following to UTF-8 and it still fails:

I like to go to the store…

You can see the M$ fancy ellepsis char at the end. I always get this error even if I try to transcode:

incompatible character encodings: ASCII-8BIT and UTF-8
1. Reply (using GitHub Flavored Markdown)
  
  Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
  
  Or login with:
  Name * Email URL Comment *
2. James Edward Gray II November 17th, 2010 Reply Link
  
  I assume the content is already proper UTF-8 bytes and you really just need to force_encoding("UTF-8"). Hope that helps.
  1. Reply (using GitHub Flavored Markdown)
    
    Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
    
    Or login with:
    Name * Email URL Comment *
noname February 22nd, 2011 Reply Link

Great info, thanx for ur hard work ;)
1. Reply (using GitHub Flavored Markdown)
  
  Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
  
  Or login with:
  Name * Email URL Comment *
Tim Segraves November 8th, 2012 Reply Link

Thanks James, this was immensely useful. I'd spent several hours trying to get rid of the <?> I was getting in my markup. Using the force_encoding("UTF-8") was all I needed.
1. Reply (using GitHub Flavored Markdown)
  
  Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
  
  Or login with:
  Name * Email URL Comment *