30
MAR2009
Ruby 1.9's String
Ruby 1.9 has an all new encoding engine called m17n (for multilingualization, with 17 letters between the m and n). This new engine may not be what you are use to from many other modern languages.
It's common to pick one versatile encoding, likely a Unicode encoding, and work with all data in that one format. Ruby 1.9 goes a different way. Instead of favoring one encoding, Ruby 1.9 makes it possible to work with data in over 80 encodings.
To accomplish this, changes had to be made in several places where Ruby works with character data. You're going to notice those changes the most in Ruby's String
though, so let's begin by talking about what's changed there.
All Strings are now Encoded
In Ruby 1.8 a String
was a collection of bytes. You sometimes treated those bytes as other things, like characters when you hit it with a Regexp
or lines when you called each()
. At it's core though, it was just some bytes. You indexed the data by byte counts, sizes where in bytes, and so on.
In Ruby 1.9 a String
is now a collection of encoded data. That means it is both the raw bytes and the attached Encoding
information about how to interpret those bytes.
Let me show a simple example of this difference. Don't worry yet about how I got the data into these variables. We're going to talk a lot about that down the road. For now, just focus on how Ruby uses the Encoding
to decide how to handle the data:
# the attached encoding
puts utf8_resume.encoding.name # >> UTF-8
puts latin1_resume.encoding.name # >> ISO-8859-1
# size() is now encoded data size or characters
puts utf8_resume.size # >> 6
puts latin1_resume.size # >> 6
# but we can ask for a raw bytesize() to see they are different
puts utf8_resume.bytesize # >> 8
puts latin1_resume.bytesize # >> 6
# we now index by encoded data (again characters)
puts utf8_resume[2..4] # >> sum
puts latin1_resume[2..4] # >> sum
These examples may seem very basic, but there's a whole lot we can learn from them. First, notice how all String
objects now have this attached Encoding
object. As I said before, they are a container of bytes and the attached information about how to interpret those bytes. You'll now always have both pieces in Ruby String
s, even if the rules of interpretation just say to treat them as raw bytes (more on that later).
The next two examples show that when we ask Ruby the size()
of the data, it now interprets the bytes as dictated by the attached rules and gives us the encoded size which is generally just a character count. We can explicitly ask for the raw bytesize()
if we want, but that's no longer the norm. This is a big change from Ruby 1.8.
The final example shows us that indexing is similarly affected. We are now counting in terms of encoded data or characters, not bytes. So even though it had to skip three bytes in the UTF-8 String
but only two in the Latin-1 String
, my slices returned the same characters for the same indices.
The important take away is this: a String
is now some bytes and the rules for interpreting those bytes. Hopefully that is starting to feel a little natural to you since that's really what we decided character encodings are all about.
Changing an Encoding
Again, I don't want to get into how a String
gets its initial Encoding
just yet. That's a topic all its own that we will discuss down the road. However, there are times when you will want to change an Encoding
and that's related to some more new features in String
. Let's talk about those.
The first way to change an Encoding
is to call force_encoding()
. This is your way to tell Ruby you know better what this data is and you need to change the rules for how this data is being treated. For example:
abc = "abc"
puts abc.encoding.name # >> US-ASCII
abc.force_encoding("UTF-8")
puts abc.encoding.name # >> UTF-8
As this example shows, when I created this String
Ruby gave it an US-ASCII Encoding
. Again, let's not worry about how Ruby made that decision yet. The important thing is that I didn't want US-ASCII, but instead UTF-8. Thus I used force_encoding()
to tell Ruby, this is actually UTF-8 data so you need to change the Encoding
attached to it.
Now, it's important to note I could get away this in this case because those bytes mean the same thing in US-ASCII and UTF-8 character encodings. I didn't change the data at all, just the rules for interpreting that data.
That can be dangerous though. The risk is that you may set the rules incorrectly for some data. Let's go back to my earlier Latin-1 String
to show you what I mean:
# the correct Encoding for this data
puts latin1_resume.encoding.name # >> ISO-8859-1
puts latin1_resume.bytesize # >> 6
puts latin1_resume.valid_encoding? # >> true
# a mistake, setting the wrong Encoding
latin1_resume.force_encoding("UTF-8")
# the data is unchanged, but now the Encoding doesn't match
puts latin1_resume.encoding.name # >> UTF-8
puts latin1_resume.bytesize # >> 6
puts latin1_resume.valid_encoding? # >> false
# when we later try to use the data
latin1_resume =~ /\AR/ # !> ArgumentError:
# invalid byte sequence in UTF-8
Note how my use of force_encoding()
switches the Encoding
but not the data. You can tell because the bytesize()
didn't change. Well, those bytes aren't a valid chunk of UTF-8 data as valid_encoding?()
tells us. Worse, if we try to actually use the broken data, we may get fireworks as we do when I apply a Regexp
here.
That leads us to the other way to change an Encoding
. If what we have is a valid set of data in some Encoding
and what we want is that data translated into a different Encoding
, we need to transcode the data. You can do that in Ruby 1.9 with the encode()
method (or encode!()
to modify the original String
instead of building a new one).
Let's try that Latin-1 to UTF-8 conversion one more time using encode()
:
# valid Latin-1 data
puts latin1_resume.encoding.name # >> ISO-8859-1
puts latin1_resume.bytesize # >> 6
puts latin1_resume.valid_encoding? # >> true
# transcode the data to UTF-8
transcoded_utf8_resume = latin1_resume.encode("UTF-8")
# now correctly changed to UTF-8
puts transcoded_utf8_resume.encoding.name # >> UTF-8
puts transcoded_utf8_resume.bytesize # >> 8
puts transcoded_utf8_resume.valid_encoding? # >> true
As you can see the difference with this approach was that both the Encoding
and the data changed. The data was in fact translated from the old Encoding
to the new one.
That leaves us with some pretty easy rules for deciding when to use each tactic. If you know what the data is better than Ruby does and you just need to fix the Encoding
, use force_encoding()
. Just be careful in such cases, because you may setup errors that get triggered the next time the data is used (possibly far away from the Encoding
switch) if you are wrong. When you want to translate data from one Encoding
to another, use encode()
.
Be Careful with Comparisons
Theses changes to how String
data is managed have complicated the rules of String
comparison a bit, unfortunately. I'm going to go against the grain here and recommend against you spending a lot of energy memorizing the new rules.
Instead, I think it's much more useful to come up with one rule that's more likely to serve you better in the long run. For that I suggest: normalize a group of String
objects to the same Encoding
before working with them together. That goes for comparisons and other shared operations as well.
I just think it's too hard to work with several different kinds of data and reason correctly about what's going to happen as you do.
One thing that my help a little in normalizing your data is Ruby's concept of compatible Encoding
s. Here's an example of checking and taking advantage of compatible Encoding
s:
# data in two different Encodings
p ascii_my # >> "My "
puts ascii_my.encoding.name # >> US-ASCII
p utf8_resume # >> "Résumé"
puts utf8_resume.encoding.name # >> UTF-8
# check compatibility
p Encoding.compatible?(ascii_my, utf8_resume) # >> #<Encoding:UTF-8>
# combine compatible data
my_resume = ascii_my + utf8_resume
p my_resume # >> "My Résumé"
puts my_resume.encoding.name # >> UTF-8
In this example I had data in two different Encoding
s, US-ASCII and UTF-8. I asked Ruby if the two pieces of data were compatible?()
. Ruby can respond to that question in one of two ways. If it returns false
, the data is not compatible and you will probably need to transcode at least one piece of it to work with the other. If an Encoding
is returned, the data is compatible and can be concatenated resulting in data with the returned Encoding
. You can see how that played out when I combined these String
s.
This feature is probably most useful for what I've shown right here, combining ASCII with a bigger Encoding
. More complicated scenarios are going to require some transcoding.
Explicit Iteration
In Ruby 1.8, String
's each()
method iterated over lines of data. I imagine that was done because it's a common way to process data, but the question is what made lines the correct choice? What about iterating by bytes or characters? You could iterate by bytes in Ruby 1.8 using each_byte()
, but you needed to resort to Regexp
tricks to get characters.
In the Ruby 1.9 realm of all encoded data, blessing one type of iteration just doesn't make sense. Instead, each()
has been removed from String
and it is no longer Enumerable
. This is probably one of the biggest changes to the core API that code will need to adapt to.
Take heart though, String
iteration is not gone. Instead, you now just need to be explicit about what you want to iterate over and you have several choices:
utf8_resume.each_byte do |byte|
puts byte
end
# >> 82
# >> 195
# >> 169
# >> 115
# >> 117
# >> 109
# >> 195
# >> 169
utf8_resume.each_char do |char|
puts char
end
# >> R
# >> é
# >> s
# >> u
# >> m
# >> é
utf8_resume.each_codepoint do |codepoint|
puts codepoint
end
# >> 82
# >> 233
# >> 115
# >> 117
# >> 109
# >> 233
utf8_resume.each_line do |line|
puts line
end
# >> Résumé
Similarly, you can ask for an Enumerator
for each type when you want to use a different iterator than just each()
. The standard method of just not passing a block to get an Enumerator
works on the methods above, but there are also methods just for this purpose:
p utf8_resume.bytes.first(3)
# >> [82, 195, 169]
p utf8_resume.chars.find { |char| char.bytesize > 1 }
# >> "é"
p utf8_resume.codepoints.to_a
# >> [82, 233, 115, 117, 109, 233]
p utf8_resume.lines.map { |line| line.reverse }
# >> ["émuséR"]
[Update: bytes()
, chars()
, etc. where changed to return Array
s instead of Enumerator
s in Ruby 2.0 and up.]
I think we'll find in the long run that this change is definitely for the better. I feel it makes the code more self-documenting. That's always a good thing, in my opinion.
The trickiest part about losing each()
is when you need your code to run on both Ruby 1.8 and 1.9. When that's the case, you can either add a method to String
in Ruby 1.8:
if RUBY_VERSION < "1.9"
require "enumerator"
class String
def lines
enum_for(:each)
end
end
end
or use a simple trick like:
str.send(str.respond_to?(:lines) ? :lines : :to_s).each do |line|
# ...
end
Comments (15)
-
Axel Niedenhoff April 2nd, 2009 Reply Link
I have just checked the docs for the
String
class in 1.8, because I thought it defines aneach_line
method (as an equivalent for theeach
method). And it does! So maybe the trick to make code work on both 1.8 and 1.9 is just to useeach_line
instead ofeach
?-
That's a good point. Using
each_line()
may help a little with simpleeach()
iteration cases. Of course, if you need any of the other iteratorsEnumerable
provides toString
in Ruby 1.8 you'll need something like the hacks I mentioned.
-
-
It's probably worth mentioning that it is possible for a transcoding operation to fail with an error. For example:
$ cat transcode.rb # encoding: UTF-8 utf8 = "Résumé…" latin1 = utf8.encode("ISO-8859-1") $ ruby transcode.rb transcode.rb:3:in `encode': "\xE2\x80\xA6" from UTF-8 to ISO-8859-1 (Encoding::UndefinedConversionError) from transcode.rb:3:in `<main>'
Naturally this fails because
"…"
is not a valid character in Latin-1. I've shown ways to handle this using iconv in the past and those still work just fine in Ruby 1.9. The new Ruby does include some simple translation options though and we can use those to do some crude translation:# encoding: UTF-8 utf8 = "Résumé…" latin1 = utf8.encode("ISO-8859-1", undef: :replace) puts latin1 # >> Résumé?
As you can see, I just asked for undefined characters in the target
Encoding
to be replaced here, which Ruby used a"?"
for. You can set the:replace
key to anyString
you would prefer though. You can also set:invalid
to:replace
to swap out invalid characters in the originalEncoding
as the transcoding occurs. Finally there are some utility options like:universal_newline
, which will transcode"\r\n"
and"\r"
to"\n"
when set totrue
.-
Wow, this was a huge help. I have been beating my head against my desk trying to figure out how to properly encode a string. Kept getting ASCII 8-Bit to UTF-8 errors, but using
encode!
withundef
andreplace
in the options hash worked like a charm. Thanks!
-
-
Ruby has another interesting exception when it comes to
Encoding
incompatibility that may be worth mentioning. You can't typically addString
s with incompatibleEncoding
s as this shows:$ cat incompatible.rb # encoding: UTF-8 utf8 = "一" sjis = "二".encode("Shift_JIS") puts "#{utf8.encoding} + #{sjis.encoding} =" utf8 + sjis $ ruby incompatible.rb UTF-8 + Shift_JIS = incompatible.rb:5:in `<main>': incompatible character encodings: UTF-8 and Shift_JIS (Encoding::CompatibilityError)
However, Ruby does keep an eye on
String
content (mainly for optimization purposes) and when bothString
s contain only 7-bit ASCII, an exception will be made:$ cat ascii.rb # encoding: UTF-8 utf8 = "abc" sjis = "def".encode("Shift_JIS") print "Given all ASCII data: " if [utf8, sjis].all?(&:ascii_only?) print "#{utf8.encoding} + #{sjis.encoding} = " result = utf8 + sjis $ ruby ascii.rb Given all ASCII data: UTF-8 + Shift_JIS = UTF-8
There are a few points of interest in this little example. First, note the
ascii_only?()
method to check for these special casedString
s. Next, notice that Ruby did do the concatenation, even though these are not compatibleEncoding
s. Finally, theresult
had anEncoding
of UTF-8, simply because that was theEncoding
of the first (leftmost)String
. It would have been Shift_JIS had I reversed them.I still don't really recommend relying on these special behaviors though. I believe you will encounter less problems if you stick to my advice of normalizing
String
Encoding
s before working with mixed data.-
Thank you for this amazingly helpful series!
I was curious about your comment regarding an "interesting exception when it comes toEncoding
incompatibility" when using only 7-bit ASCII. Is this actually an exception? The impression I got was that the.compatible?
method would compare string content and find out if one of the encodings could (or if there is an encoding that could) potentially contain all of the characters in both strings. Or does it simply compare the encodings against a table of 100% compatible encodings, and return the result?In the former case, the fact that both strings contain 7-bit ASCII is just a coincidence, not a special exception, and what really matters is the fact that all the characters in both strings can be encoded in UTF-8, so
compatible?
returns UTF-8, and adding strings, therefore, uses UTF-8... however I have not tried this or figured out how to setup an experiment to see if other cases work the same way.Do you know which is the case?
-
I'm pretty sure Ruby does not compare contents. That could get pretty inefficient on large
String
s.I believe the case is that it compares encodings, with the special exception for 7-bit ASCII
String
s.
-
-
-
Thanks for this guy -- I didn't read the whole series but boy am I glad it's here...I'm new to ruby and rails and just ran into an encoding error on html pages done in iso-8859-1. You did a good job of demonstrating the important points.
Thanks again, keep up the good work!
-
Hi,
I'm trying to render an image from mysql using send_data and I'm getting this error :
invalid byte sequence in UTF-8
Here is my code :def get_photo @image_data = Photo.find(params[:id]) @image = @image_data.binary_data @url = @image_data.url send_data(@image, :type => 'image/jpeg, :filename => "#{params[:id]}.jpg", :disposition => 'inline') end
BTW, I'm using ruby 1.9.1 with rails 2.3.5.
-
Without knowing where the error comes from, the only thing I can see that might be an issue is if
params[:id]
contained non-UTF-8 bytes. This question is probably better asked on the Rails mailing list tough.
-
-
Thank you for this note. The method
.each
still appears in the core documentation with its alias,.each_line
: http://www.ruby-doc.org/core-1.8.6/String.html#method-i-eachThis is confusing because
.each
seems to be what I want when I want "each word delimited by a specific character," however, changing to.each_line
did work. -
Hmm I tried transcoding the following to UTF-8 and it still fails:
I like to go to the store…
You can see the M$ fancy ellepsis char at the end. I always get this error even if I try to transcode:
incompatible character encodings: ASCII-8BIT and UTF-8
-
I assume the content is already proper UTF-8 bytes and you really just need to
force_encoding("UTF-8")
. Hope that helps.
-
-
Great info, thanx for ur hard work ;)
-
Thanks James, this was immensely useful. I'd spent several hours trying to get rid of the
<?>
I was getting in my markup. Using theforce_encoding("UTF-8")
was all I needed.