Ruby 1.9's Three Default Encodings
I suspect early contact with the new m17n (multilingualization) engine is going to come to Rubyists in the form of this error message:
invalid multibyte char (US-ASCII)
Ruby 1.8 didn't care what you stuck in a random
String literal, but 1.9 is a touch pickier. I think you'll see that the change is for the better, but we do need to spend some time learning to play by Ruby's new rules.
That takes us to the first of Ruby's three default
The Source Encoding
In Ruby's new grown up world of all encoded data, each and every
String needs an
Encoding. That means an
Encoding must be selected for a
String as soon as it is created. One way that a
String can be created is for Ruby to execute some code with a
String literal in it, like this:
str = "A new String"
That's a pretty simple
String, but what if I use a literal like the following instead?
str = "Résumé"
Encoding is that in? That fundamental question is probably the main reason we all struggle a bit with character encodings. You can't tell just from looking at that data what
Encoding it is in. Now, if I showed you the bytes you may be able to make an educated guess, but the data just isn't wearing an
Encoding name tag.
That's true of a frightening lot of data we deal with every day. A plain text file doesn't generally say what
Encoding the data inside is in. When you think about that, it's a miracle we can successfully read a lot of things.
When we're talking about program code, the problem gets worse. I may want to write my code in UTF-8, but some Japanese programmer may want to write his code in Shift JIS. Ruby should support that and, in fact, 1.9 does. Let's complicate things a bit more though: imagine that I bundle up that UTF-8 code I wrote in a gem and the Japanese programmer later uses it to help with his Shift JIS code. How do we make that work seamlessly?
The Ruby 1.8 strategy of one global variable won't survive a test like this, so it was time to switch strategies. Ruby 1.9's answer to this problem is the source
All Ruby source code now has some
Encoding. When you create a
String literal in your code, it is assigned the
Encoding of your source. That simple rule solves all the problems I just described pretty nicely. As long my source
Encoding is UTF-8 and the Japanese programmer's source
Encoding is Shift JIS, my literals will work as I expect and his will work as he expects. Obviously if we share any data, we will need to establish some rules about our shared formats using documentation or code that can adapt to different
Encodings, but we should have been doing that all along anyway.
Thus the only question becomes, what's my source
Encoding and how do I change it?
There are a few different ways Ruby can select a source
Encoding. Here are the options:
$ cat no_encoding.rb p __ENCODING__ $ ruby no_encoding.rb #<Encoding:US-ASCII> $ cat magic_comment.rb # encoding: UTF-8 p __ENCODING__ $ ruby magic_comment.rb #<Encoding:UTF-8> $ cat magic_comment2.rb #!/usr/bin/env ruby -w # encoding: UTF-8 p __ENCODING__ $ ruby magic_comment2.rb #<Encoding:UTF-8> $ echo $LC_CTYPE en_US.UTF-8 $ ruby -e 'p __ENCODING__' #<Encoding:UTF-8> $ ruby -KU no_encoding.rb #<Encoding:UTF-8>
The first example shows us two important things. The first is the main rule of source
Encodings: source files receive a US-ASCII
Encoding, unless you say otherwise. [Update: this was changed to UTF-8 in Ruby 2.0 and up.] This is where I expect programmers to run into the error I mentioned earlier. If you place any non-ASCII content in a
String literal without changing the source
Encoding, Ruby will die with that error. Thus you need to change the source
Encoding to work with any non-ASCII data. The second thing we see here is the new
__ENCODING__ keyword that can be used to get the source
Encoding that's active where it is executed.
The second example shows the preferred way to set your source
Encoding and it's called a magic comment. If the first line of your code is a comment that includes the word
coding, followed by a colon and space, and then an
Encoding name, the source
Encoding for that file is changed to the indicated
Encoding. If your code has a shebang line, the magic comment must come on the second line, with no spacing between them. Once set, all
String literals you create in that file will have that
Encoding attached to them.
The third example shows an exception to the rule for your convenience. When you feed Ruby some code on the command-line using the
-e switch, it gets a source
Encoding from your environment. I have UTF-8 set in the
LC_CTYPE environment variable, but some people also use the
LANG variable for this. This makes scripting easier since Ruby will (hopefully) match the
Encoding of any other commands you chain together.
The fourth example is another interesting exception to the rule. Ruby 1.9 still supports the
-K* style switches from Ruby 1.8 including the
-KU switch I've recommended so heavily in this series. These switches have a couple of effects, but of particular note they are the only non-magic comment way to modify the source
Encoding. This is good news for backwards compatibility, because some Ruby 1.8 code may be able to run on Ruby 1.9 without
Encoding problems thanks to this. I must stress that this is just for backwards compatibility though, and magic comments are the future.
With magic comments the code will include its
Encoding data. It will probably seem a little tedious to add them to all your source files at first, but it's really not that big of a change. In the past, I've recommended we stick the following shebang line at the top of our files:
#!/usr/bin/env ruby -wKU
Now, for Ruby 1.9, I'm recommending we switch to something like this:
#!/usr/bin/env ruby -w # encoding: UTF-8
Note that the magic comment format rules are pretty loose and all of following examples would work the same:
# encoding: UTF-8 # coding: UTF-8 # -*- coding: UTF-8 -*-
This is nice for support in some text editors that also read such comments.
If we all get into that habit of adding magic comments, our code can work together regardless of the various
Encodings we personally favor. Ruby will know how to handle each separate file. As an added bonus, we programmers also get to see these comments and know more about the code we are working with. That makes it a good habit to get into, I think.
The Default External and Internal Encodings
There's another way
Strings are commonly created and that's by reading from some
IO object. It doesn't make sense to give those
Strings the source
Encoding because the external data doesn't have to be related to your source code. Also, you really need to know how data is encoded to read it correctly. Even a simple concept like reading the next line of data changes if you are talking about UTF-8 or UTF-16LE (the LE stands for a Little Endian byte order) data. Thus, it makes sense for
IO objects to have at least one
Encoding attached to them. Ruby 1.9 is generous and gives them two: the external
Encoding and the internal
Encoding is the
Encoding the data is in inside the
IO object. That affects how data will be read and this is the
Encoding data will be returned in as long as the internal
Encoding isn't set (more on that in a bit). Let's look at an example of how this plays out in practice:
$ cat show_external.rb open(__FILE__, "r:UTF-8") do |file| puts file.external_encoding.name p file.internal_encoding file.each do |line| p [line.encoding.name, line] end end $ ruby show_external.rb UTF-8 nil ["UTF-8", "open(__FILE__, \"r:UTF-8\") do |file|\n"] ["UTF-8", " puts file.external_encoding.name\n"] ["UTF-8", " p file.internal_encoding\n"] ["UTF-8", " file.each do |line|\n"] ["UTF-8", " p [line.encoding.name, line]\n"] ["UTF-8", " end\n"] ["UTF-8", "end\n"]
There are four things to notice in this example:
- I set the external
:UTF-8onto the end of my mode
Stringwhen I opened the
- You can use
external_encoding()to check the external
Encodingas I have here
internal_encoding()works the same for the internal
Encoding, which will be
nilunless you explicitly set it
- Note how each
Stringcreated as I read the data is given the
Encoding just adds one more twist. When set, data will still be read in the external
Encoding, but transcoded to the internal
Encoding as the
String is created. It's a convenience for you as the programmer. Watch how that changes things:
$ cat show_internal.rb open(__FILE__, "r:UTF-8:UTF-16LE") do |file| puts file.external_encoding.name puts file.internal_encoding.name file.each do |line| p [line.encoding.name, line[0..3]] end end $ ruby show_internal.rb UTF-8 UTF-16LE ["UTF-16LE", "o\x00p\x00e\x00n\x00"] ["UTF-16LE", " \x00 \x00p\x00u\x00"] ["UTF-16LE", " \x00 \x00p\x00u\x00"] ["UTF-16LE", " \x00 \x00f\x00i\x00"] ["UTF-16LE", " \x00 \x00 \x00 \x00"] ["UTF-16LE", " \x00 \x00e\x00n\x00"] ["UTF-16LE", "e\x00n\x00d\x00\n\x00"]
There are a couple differences here:
- A second added
Encodingon the mode
:UTF-16LEin this example) sets the
internal_encoding()as I show with the second
- This little change gets Ruby to translate all of the data for me (I just shortened the output because UTF-16LE is noisy)
Encoding works the same when writing. It still represents the
Encoding in the
IO object, or the
Encoding data is going to. However, you don't need to specify an internal
Encoding when writing. Ruby will automatically use the
Encoding of a
String you output as the internal
Encoding and transcode as needed to reach the external
Encoding. For example:
$ cat write_internal.rb # encoding: UTF-8 open("data.txt", "w:UTF-16LE") do |file| puts file.external_encoding.name p file.internal_encoding data = "My data…" p [data.encoding.name, data] file << data end p File.read("data.txt") $ ruby write_internal.rb UTF-16LE nil ["UTF-8", "My data…"] "M\x00y\x00 \x00d\x00a\x00t\x00a\x00& "
Note how my data was transcoded before it was written even though the
nil. Ruby used the
Encoding to decide what was needed.
Both of those
Encodings should be pretty straight forward. The only question left about them is: what happens if you don't set them? The answer is that the
IO inherits the default external
Encoding and/or the default internal
Encoding whenever one isn't set. Now we need to know how Ruby chooses those defaults.
The default external
Encoding is pulled from your environment, much like the source
Encoding is for code given on the command-line. Have a look:
$ echo $LC_CTYPE en_US.UTF-8 $ ruby -e 'puts Encoding.default_external.name' UTF-8 $ LC_CTYPE=ja_JP.sjis ruby -e 'puts Encoding.default_external.name' Shift_JIS
The default internal
Encoding is simply
nil. You must actively change it to get anything else.
Encodings have a global setter:
Encoding.default_internal=(). You can set them to an
Encoding object or just the
String name of an
You can also change these default
Encodings using some command-line switches. The new
-E switch can be used to set one or both of the
$ ruby -e 'p [Encoding.default_external, Encoding.default_internal]' [#<Encoding:UTF-8>, nil] $ ruby -E Shift_JIS \ > -e 'p [Encoding.default_external, Encoding.default_internal]' [#<Encoding:Shift_JIS>, nil] $ ruby -E :UTF-16LE \ > -e 'p [Encoding.default_external, Encoding.default_internal]' [#<Encoding:UTF-8>, #<Encoding:UTF-16LE>] $ ruby -E Shift_JIS:UTF-16LE \ > -e 'p [Encoding.default_external, Encoding.default_internal]' [#<Encoding:Shift_JIS>, #<Encoding:UTF-16LE>]
As you can see, the argument for this switch is just like what you would append to a mode
String in a call to
There's one more command-line switch shortcut for those of us who prefer to just use UTF-8 everywhere. The new
-U switch sets
Encoding.default_internal() to UTF-8. Using that, you can just set the external
Encoding for your
IO objects, or let it default from your environment, and all
Strings you read will be transcoded to the preferred UTF-8.
Probably the most important thing to note about
Encoding.default_internal() is that you should really just treat them as shortcuts for your own scripting. Pulling
Encodings from the environment or command-line switches can be handy when you're in control of where the code runs, but you're going to need to be more explicit for code you intend for others to run. When in doubt, set the external and internal
Encodings the way you want them for each
IO object. It's a bit more tedious, but also safer in that it won't mysteriously be changed by some outside force. Also remember that the defaults are global settings affecting all loaded code, including any libraries you
require(). That can be a boon or bane, so just remember to factor it into your thinking when you're wondering, "Where does this
String get its