Gray Soft / Tags / Unicode

What Ruby 1.9 Gives Us

2014-04-18T19:27:32Z

In this final post of the series, I want to revisit our earlier discussion on encoding strategies. Ruby 1.9 adds a lot of power to the handling of character encodings as you have now seen. We should talk a little about how that can change the game.

UTF-8 is Still King

The most important thing to take note of is what hasn't changed with Ruby 1.9. I said a good while back that the best Encoding for general use is UTF-8. That's still very true.

I still strongly recommend that we favor UTF-8 as the one-size-almost-fits-all Encoding. I really believe that we can and should use it exclusively inside our code, transcode data to it on the way in, and transcode output when we absolutely must. The more of us that do this, the better things will get.

As we've discussed earlier in the series, Ruby 1.9 does add some new features that help our UTF-8 only strategies. For example, you could use things like the Encoding command-line switches (-E and -U) to setup auto translation for all input you read. These shortcuts are great for simple scripting, but I'm going to recommend you just be explicit about your Encodings in any serious code.

New Rules

Ruby 1.9 literally gives us a whole new world of power to work with data as we see fit. As is usually the case though, our new powers come with new responsibilities. Start building your good Ruby 1.9 habits today:

Add the magic comment to the top of all source files
Explicitly declare the Encodings for an IO object when you open() them

Yes, this adds a little extra work, but the effort is worth it. Be disciplined in your awareness of Encodings and help Ruby know the right way to treat your data.

New Strategies

While UTF-8 is a great single choice, Ruby 1.9 gives us some exciting new options for character handling. I'll give just one example here, to get you thinking in the right direction, but the sky's the limit now and I'm sure we'll see some neat uses of the new system in the coming years.

When I converted the FasterCSV code to be the standard CSV library in Ruby 1.9, I really sat down and thought out how m17n should be handled. Here are some thoughts that led to my final plan:

We tend to throw pretty big data at CSV parsers. We often use them for database dumps, for example.
I expected to pay a performance penalty for constantly transcoding all incoming data to UTF-8. I'm not sure how big it would have been, but it's certainly more work than Ruby 1.8 does just reading some bytes. Naturally, I wanted the library to stay as quick as possible.
Since the parser has always been able to read directly from any IO object, those who wanted UTF-8 transcoding already had a way to get it.
CSV is a super simple format to parse, requiring only four standard characters that you can count on having in any Encoding Ruby supports.
Finally, I just wanted to take the m17n features for a spin, of course!

All of this combined to form my strategy for the CSV library: don't transcode the data, transcode the parser instead.

If you transcode the data, you pay a penalty at every read. However, transcoding the parser is just a one-time upfront cost. The characters will be available in whatever format the data is in and once the parser is transcoded we can just read and parse the data normally. The fields returned won't have gone through a conversion, unless the user code explicitly sets that up. This seems to give everyone the choice to have their data the way they want it.

This process isn't too tough to realize, though it does get a bit tedious in places. The first step is just to figure out what Encoding the data is actually in. Here's the code from 1.9's CSV library that does that:

@encoding =   if @io.respond_to? :internal_encoding
                @io.internal_encoding || @io.external_encoding
              elsif @io.is_a? StringIO
                @io.string.encoding
              end
@encoding ||= Encoding.default_internal || Encoding.default_external

That code just makes sure I set @encoding to the Encoding I'm actually going to be working with after all reads. If an internal_encoding() is set on an IO, it will be transcoded into that and that's what I will be facing. Otherwise, the external_encoding() is what we will see. The code can also parse from a String directly by wrapping it in a StringIO object. When it does that, we can just ask the underlying String what the Encoding for the data is. If we can't find an Encoding, likely because it hasn't been set, we'll use the defaults because that's what Ruby is going to assume as well.

Once we have the Encoding, we need a couple of helper methods that will build String and Regexp objects in that Encoding for us. Here are those simple methods:

def encode_str(*chunks)
  chunks.map { |chunk| chunk.encode(@encoding.name) }.join
end

def encode_re(*chunks)
  Regexp.new(encode_str(*chunks))
end

Those should be super straightforward if you've read my earlier discussion of how transcoding works. You can pass encode_str() one or more String arguments and it will transcode each one, then join() them into a complete whole. The encode_re() just wraps encode_str() since Regexp.new() will correctly set the Encoding by the Encoding of the passed String.

Now for the tedious step. You have to completely avoid using bare String or Regexp literals for anything that will eventually interact with the raw data. For example, here is the code CSV uses to prepare the parser before it begins reading:

# Pre-compiles parsers and stores them by name for access during reads.
def init_parsers(options)
  # store the parser behaviors
  @skip_blanks      = options.delete(:skip_blanks)
  @field_size_limit = options.delete(:field_size_limit)

  # prebuild Regexps for faster parsing
  esc_col_sep = escape_re(@col_sep)
  esc_row_sep = escape_re(@row_sep)
  esc_quote   = escape_re(@quote_char)
  @parsers = {
    # for empty leading fields
    leading_fields: encode_re("\\A(?:", esc_col_sep, ")+"),
    # The Primary Parser
    csv_row:        encode_re(
      "\\G(?:\\A|", esc_col_sep, ")",                # anchor the match
      "(?:", esc_quote,                              # find quoted fields
             "((?>[^", esc_quote, "]*)",             # "unrolling the loop"
             "(?>", esc_quote * 2,                   # double for escaping
             "[^", esc_quote, "]*)*)",
             esc_quote,
             "|",                                    # ... or ...
             "([^", esc_quote, esc_col_sep, "]*))",  # unquoted fields
      "(?=", esc_col_sep, "|\\z)"                    # ensure field is ended
    ),
    # a test for unescaped quotes
    bad_field:      encode_re(
      "\\A", esc_col_sep, "?",                   # an optional comma
      "(?:", esc_quote,                          # a quoted field
             "(?>[^", esc_quote, "]*)",          # "unrolling the loop"
             "(?>", esc_quote * 2,               # double for escaping
             "[^", esc_quote, "]*)*",
             esc_quote,                          # the closing quote
             "[^", esc_quote, "]",               # an extra character
             "|",                                # ... or ...
             "[^", esc_quote, esc_col_sep, "]+", # an unquoted field
             esc_quote, ")"                      # an extra quote
    ),
    # safer than chomp!()
    line_end:       encode_re(esc_row_sep, "\\z"),
    # illegal unquoted characters
    return_newline: encode_str("\r\n")
  }
end

Don't worry about breaking down those heavily optimized regular expressions. The point here is just to notice how everything is eventually passed through encode_str() or encode_re().

Those were the major changes needed inside the CSV code to get it to parse natively in the Encoding of the data. I did have to add more code due to some side issues I ran into, but they don't really relate to this strategy too much:

Regexp.escape() didn't work correctly on all the Encodings I tested it with. It's improved a lot since then, but last I checked there were still some oddball Encodings it didn't support. Given that, I had to roll my own. If you want to see how I did that, check inside CSV.initialize() for how @re_esc and @re_chars get set and then have a look at CSV.escape_re().
CSV's line ending detection reads ahead in the data by fixed byte counts. That's tricky to do safely with encoded data since you could always land in the middle of a character. See CSV.read_to_char() for how I work around that issue, if you are interested.
Finally, testing the code with all the Encodings Ruby supports was a bit tricky, due to the concept of "dummy Encodings". See my discussion on those for details on how to filter them out of the mix.

Like anything, this strategy had plusses and minuses. As I've already said, it's a touch tedious to have to avoid normal literals. The added complexity to the code makes it a little harder to read and maintain. That's the price you pay.

Still, I think it shows some of the possibilities of what we can accomplish with Ruby's new features. We can stick to UTF-8 as our one-size-fits-all solution as we've done in the past. That's still a great idea in most cases. However, now we have some new options that were impractically hard with an older Ruby.

Miscellaneous M17n Details

2014-04-18T19:20:43Z

We've now discussed the core of Ruby 1.9's m17n (multilingualization) engine. String and IO are where you will see the big changes. The new m17n system is a big beast though with a lot of little details. Let's talk a little about some side topics that also relate to how we work with character encodings in Ruby 1.9.

More Features of the Encoding Class

You've seen me using Encoding objects all over the place in my explanations of m17n, but we haven't talked much about them. They are very simple, mainly just being a named representation of each Encoding inside Ruby. As such, Encoding is a storage place for some tools you may find handy when working with them.

First, you can receive a list() of all Encoding objects Ruby has loaded in the form of an Array:

$ ruby -e 'puts Encoding.list.first(3), "..."'
ASCII-8BIT
UTF-8
US-ASCII
...

If you're just interested in a specific Encoding, you can find() it by name:

$ ruby -e 'p Encoding.find("UTF-8")'
#<Encoding:UTF-8>
$ ruby -e 'p Encoding.find("No-Such-Encoding")'
-e:1:in `find': unknown encoding name - No-Such-Encoding (ArgumentError)
    from -e:1:in `<main>'

As you can see, Ruby raises an ArgumentError if it doesn't know about a given Encoding.

Some Encoding objects also have more than one name. These aliases() can be used interchangeably to refer to the same Encoding. For example, ASCII is an alias for US-ASCII:

$ ruby -e 'puts Encoding.aliases["ASCII"]'
US-ASCII
$ ruby -e 'p Encoding.find("ASCII") == Encoding.find("US-ASCII")' 
true

The aliases() method returns a Hash keyed with the alternate names Ruby knows about. The values are the actual Encoding name that alias refers to. You can use either a name or an alias when referring to an Encoding by name, like with calls to Encoding::find() or IO::open().

Finally, there's one more gotcha you should be aware of if you're going to write some code that supports a large set of Ruby's Encodings. Ruby ships with a few dummy?() Encodings that don't have character handling completely implemented. These are used for stateful Encodings. You will want to filter them out of Encodings you try to support to avoid running into problems:

$ ruby -e 'puts "Dummy Encodings:", Encoding.list.select(&:dummy?).map(&:name)'
Dummy Encodings:
ISO-2022-JP
ISO-2022-JP-2
UTF-7

String Escapes

In Ruby 1.8 you would sometimes see byte escapes used to insert raw bytes into a String. For example, you can choose to build the String "…" with the following byte escapes:

$ ruby -v -KU -e 'p "\xe2\x80\xa6"'
ruby 1.8.6 (2009-03-31 patchlevel 368) [i686-darwin9.6.0]
"…"

The same tactic still works on Ruby 1.9, but remember that Encodings are still going to play into this as we've been discussing:

$ cat utf8_escapes.rb 
# encoding: UTF-8
str = "\xe2\x80\xa6"
p [str.encoding, str, str.valid_encoding?]
$ ruby -v utf8_escapes.rb 
ruby 1.9.1p0 (2009-01-30 revision 21907) [i386-darwin9.6.0]
[#<Encoding:UTF-8>, "…", true]
$ cat invalid_escapes.rb 
# encoding: UTF-8
str = "\xe2\x80"
p [str.encoding, str, str.valid_encoding?]
$ ruby -v invalid_escapes.rb 
ruby 1.9.1p0 (2009-01-30 revision 21907) [i386-darwin9.6.0]
[#<Encoding:UTF-8>, "\xE2\x80", false]

Notice that I got the requested bytes in both cases. However, those Strings were assigned the source Encoding as normal. In the first case, that built a valid UTF-8 String. However, the second case is invalid and may later cause me fits as I try to use the String.

There are a couple of exceptions though, where a String escape can actually change the Encoding of the literal. First, you'll likely remember that using a multibyte character is not allowed if you don't change the source Encoding:

$ cat bad_code.rb 
"abc…"
$ ruby -v bad_code.rb 
ruby 1.9.1p0 (2009-01-30 revision 21907) [i386-darwin9.6.0]
bad_code.rb:1: invalid multibyte char (US-ASCII)
bad_code.rb:1: invalid multibyte char (US-ASCII)

However, a special case is made for \x## escapes:

$ cat ascii_escapes.rb 
puts "Source Encoding:  #{__ENCODING__}"
str = "abc\xe2\x80\xa6"
p [str.encoding, str, str.valid_encoding?]
$ ruby -v ascii_escapes.rb 
ruby 1.9.1p0 (2009-01-30 revision 21907) [i386-darwin9.6.0]
Source Encoding:  US-ASCII
[#<Encoding:ASCII-8BIT>, "abc\xE2\x80\xA6", true]

Notice that the Encoding of the String was upgraded to ASCII-8BIT to accommodate the bytes. We'll talk a lot more about that special Encoding later in this post, but for now just make note of the fact that this exception gives you an easy way to work with binary data.

Octal escapes (\###), control escapes (\cx or \C-x), meta escapes (\M-x), and meta-control escapes (\M-\C-x) all follow the same rules as the hex escapes (\x##) we've just been discussing.

The other exception is the \u#### escape that can be used to enter Unicode characters by codepoint. When you use this escape, the String gets a UTF-8 Encoding regardless of the current source Encoding:

$ cat ascii_u_escape.rb 
str = "\u2026"
p [str.encoding, str]
$ ruby -v ascii_u_escape.rb 
ruby 1.9.1p0 (2009-01-30 revision 21907) [i386-darwin9.6.0]
[#<Encoding:UTF-8>, "…"]
$ cat sjis_u_escape.rb 
# encoding: Shift_JIS
str = "\u2026"
p [str.encoding, str]
$ ruby -v sjis_u_escape.rb 
ruby 1.9.1p0 (2009-01-30 revision 21907) [i386-darwin9.6.0]
[#<Encoding:UTF-8>, "…"]
$ cat utf8_u_escape.rb 
# encoding: UTF-8
str = "\u2026"
p [str.encoding, str]
$ ruby -v utf8_u_escape.rb 
ruby 1.9.1p0 (2009-01-30 revision 21907) [i386-darwin9.6.0]
[#<Encoding:UTF-8>, "…"]

Notice how the String received a UTF-8 Encoding in all three cases, regardless of the current source Encoding. This exception gives you an easy way to work with UTF-8 data, no matter what your native Encoding is.

The Unicode escape can be followed by exactly four hex digits as I've shown above, or you can use an alternate form \u{#…} where you place between one and six hex digits between the braces. Both forms have the same effect on the String's Encoding.

Working with Binary Data

Not all data is textual data. Ruby's String class can also be used to hold raw byte sequences. For example, you may want to work with the raw bytes of a PNG image.

Ruby 1.9 has an Encoding for this which basically just means treat my data as raw bytes. You can think of this Encoding as a way to shut off character handling and just work with bytes:

$ cat raw_bytes.rb 
# encoding: UTF-8
str = "Résumé"
def str.inspect
  { data:     dup,
    encoding: encoding.name,
    chars:    size,
    bytes:    bytesize }.inspect
end
p str
str.force_encoding("BINARY")
p str
$ ruby raw_bytes.rb 
{:data=>"Résumé", :encoding=>"UTF-8", :chars=>6, :bytes=>8}
{:data=>"R\xC3\xA9sum\xC3\xA9", :encoding=>"ASCII-8BIT", :chars=>8, :bytes=>8}

See how switching the Encoding (without changing the data) shut off Ruby's concept of characters? The character count became the same as the byte count and Ruby started giving a more raw version of the inspect() String to show those are just bytes.

If you expected this Encoding to be called BINARY, you are half right. As you
can see I could use that name above because it is a valid alias. Ruby switched
to the real name in the inspect() message though. Ruby actually refers to the
Encoding as ASCII-8BIT, which leads us to another twist.

Obviously, there's not really such a thing a "ASCII-8BIT" outside of Ruby. Even while working with binary data though, it's not uncommon to want to make a check for some simple ASCII pieces. For example, the first few signature bytes of a PNG image do contain the simple ASCII String "PNG":

$ cat png_sig.rb 
sig = "\x89PNG\r\n\C-z\n"
png = /\A.PNG/

p({sig => sig.encoding.name, png => png.encoding.name})

if sig =~ png
  puts "This data looks like a PNG image."
end
$ ruby png_sig.rb 
{"\x89PNG\r\n\x1A\n"=>"ASCII-8BIT", /\A.PNG/=>"US-ASCII"}
This data looks like a PNG image.

Ruby makes this possible by making ASCII-8BIT compatible?() with US-ASCII. That allows tricks like the above where I validated the PNG signature with a simple US-ASCII Regexp. Thus, ASCII-8BIT means ASCII plus some other bytes and you can choose to treat parts of it as ASCII when that helps you work with the data.

It's worth noting that Ruby will now fallback to an ASCII-8BIT Encoding anytime you read() by bytes:

$ cat binary_fallback.rb 
open("ascii.txt", "w+:UTF-8") do |f|
  f.puts "abc"
  f.rewind
  str = f.read(2)
  p [str.encoding.name, str]
end
$ ruby binary_fallback.rb 
["ASCII-8BIT", "ab"]

That makes sense, because you could chop up characters when reading by bytes. If you really need to read() some bytes but keep your Encoding you will need to set and validate it manually. Here's one way you might do something like that:

$ cat read_to_char.rb 
# encoding: UTF-8
open("ascii.txt", "w+:UTF-8") do |f|
  f.puts "Résumé"
  f.rewind
  str = f.read(2)
  until str.dup.force_encoding(f.external_encoding).valid_encoding?
    str << f.read(1)
  end
  str.force_encoding(f.external_encoding)
  p [str.encoding.name, str]
end
$ ruby read_to_char.rb 
["UTF-8", "Ré"]

In that example, I just read() the fixed bytes I wanted and then push forward byte by byte until my data is valid in the desired Encoding. I had to test a dup() of the data and only force_encoding() when I was sure I was done reading, because UTF-8 and ASCII-8BIT are not compatible?() and would have raised Encoding::CompatibilityError as I was adding on bytes.

Working with binary data also requires you to know one more thing about Ruby's IO objects. Ruby has a feature where it translates some data you read on Windows. The translation is super simple: "\r\n" sequences read from an IO object are simplified to a solo "\n". This features is to help make Unix scripts work well on a platform that has different line endings. It does create a gotcha though: when you're going to read any non-text data, be it binary data or just a non-ASCII compatible Encoding like UTF-16, you need to warn Ruby not to do the translation for your code to be properly cross-platform.

By the way, this isn't new. This was even true in the Ruby 1.8 era.

Telling Ruby to treat the data as binary and not perform any translation (again, only active on Windows) is simple. You can just add a "b" for binary to your mode String in a call to open(). Thus you would read with something like:

open(path, "rb") do |f|
  # ...
end

or write with code like:

open(path, "wb") do |f|
  # ...
end

If you always knew about this quirk and you did a good job of always doing this, give yourself a big pat on the back because you're all set. If you didn't, you've got a bad habit you'll need to break. Don't feel too bad about it though. I've known about this quirk since my Perl (which does the same thing) days and I've always tried to follow it. However, about ten different bugs were recently filed against one of my libraries that amounted to me missing this "b" in several places. It's easy to forget.

Ruby 1.9 is much more strict about the binary flag. It's going to complain if you don't add it when it feels it is needed. For example:

$ cat missing_b.rb 
# Ruby 1.9 will let this slide
open("utf_16.txt", "w:UTF-16LE") do |f|
  f.puts "Some data."
end
# but not this
open("utf_16.txt", "r:UTF-16LE") do |f|
  # ...
end
$ ruby missing_b.rb 
missing_b.rb:6:in `initialize': ASCII incompatible encoding needs binmode
                                (ArgumentError)
    from missing_b.rb:6:in `open'
    from missing_b.rb:6:in `<main>'

Of course, this is trivial to fix. You just have to add the missing "b":

$ cat with_b.rb 
open("utf_16.txt", "wb:UTF-16LE") do |f|
  f.puts "Some data."
end
open("utf_16.txt", "rb:UTF-16LE") do |f|
  puts f.external_encoding.name
end
$ ruby with_b.rb 
UTF-16LE

I showed the external_encoding() there to show that it's exactly what I specified. However, as a reward for adding in these "b"'s we've been bad about leaving out in the past, Ruby will now assume you want ASCII-8BIT when you supply the "b" and not an external_encoding():

$ cat b_means_binary.rb 
open("utf_16.txt", "r") do |f|
  puts "Inherited from environment:  #{f.external_encoding.name}"
end
open("utf_16.txt", "rb") do |f|
  puts %Q{Using "rb":  #{f.external_encoding.name}}
end
$ ruby b_means_binary.rb 
Inherited from environment:  UTF-8
Using "rb":  ASCII-8BIT

It's worth nothing that Ruby 1.8 accidently helped train us to leave out the magic "b". For example, you could use IO::read() to slurp some data, but that method didn't provide a way to indicate that the data was binary. In truth, you really needed this monster for a safe cross-platform read of binary data: open(path, "rb") { |f| f.read }. It's no surprise that IO::read() was more common. IO::readlines() and IO::foreach() had the same issue. The core team has acknowledged these problems with some new additions. First, you can now pass a Hash as the final argument to all the methods that open an IO and use that to set options like :mode or separately :external_encoding, :internal_encoding, and :binmode (the name for the magic "b"). Here are some examples:

File.read("utf_16.txt", mode: "rb:UTF-16LE")

File.readlines("utf_16.txt", mode: "rb:UTF-16LE")

File.foreach("utf_16.txt", mode: "rb:UTF-16LE") do |line|

end

File.open("utf_16.txt", mode: "rb:UTF-16LE") do |f|

end

open("utf_16.txt", mode: "rb:UTF-16LE") do |f|

end

As one last shortcut along these lines, the new IO::binread() method is the same as IO.read(…, mode: "rb:ASCII-8BIT").

Regex Encodings

Now that all our data has an Encoding, it only makes sense that our Regexp objects would need to be tagged as well. That is the case, but the rules for how an Encoding is selected differs for Regexp. Let's talk a little about how and why.

First, let's get the big surprise out of the way:

$ cat re_encoding.rb 
# encoding: UTF-8
utf8_str   = "résumé"
latin1_str = utf8_str.encode("ISO-8859-1")
binary_str = utf8_str.dup.force_encoding("ASCII-8BIT")
utf16_str  = utf8_str.encode("UTF-16BE")

re = /\Ar.sum.\z/
puts "Regexp.encoding.name:  #{re.encoding.name}"

[utf8_str, latin1_str, binary_str, utf16_str].each do |str|
  begin
    result = str =~ re ? "Matches" : "Doesn't match"
  rescue Encoding::CompatibilityError
    result = "Can't match non-ASCII compatible?() Encoding"
  end
  puts "#{result}:  #{str.encoding.name}"
end
$ ruby re_encoding.rb 
Regexp.encoding.name:  US-ASCII
Matches:  UTF-8
Matches:  ISO-8859-1
Doesn't match:  ASCII-8BIT
Can't match non-ASCII compatible?() Encoding:  UTF-16BE

After we did all that talking about the source Encoding Ruby goes and ignores it on us. You can see that the Regexp was set to US-ASCII instead of the UTF-8 that was in effect at the time. Surprising though that may be, there is actually a pretty good reason for it.

My Regexp literal only contained seven bit ASCII, so Ruby chose to simplify the Encoding. If it left it at the source Encoding of UTF-8, it would be useful for checking UTF-8 data. As it is though, it can now be used to check any ASCII compatible?() data. You can see in the output that the expression was tried against three different String's, because they are all ASCII compatible?(). (It did fail to match one since I changed the rules of how to interpret the data and one character became two bytes, but the attempt was still made.) The fourth match could not be attempted, because UTF-16 is not ASCII compatible?().

Of course, if your Regexp includes eight bit characters, you use the special escapes that change an Encoding, or you apply one of the old Ruby 1.8 style Encoding options, you can get a non-ASCII Encoding:

$ cat encodings.rb 
# encoding: UTF-8
res = [
  /…\z/,       # source Encoding
  /\A\uFEFF/,  # special escape
  /abc/u       # Ruby 1.8 option
]
puts res.map { |re| [re.encoding.name, re.inspect].join(" ") }
$ ruby encodings.rb
UTF-8 /…\z/
UTF-8 /\A\uFEFF/
UTF-8 /abc/

I used /u which you will probably remember as a way to get a UTF-8 Regexp from the old Ruby 1.8 system. The /e (for EUC_JP) and /s (for a Shift_JIS extension called Windows-31J) options still work too. Ruby 1.9 also still supports the old /n option, but it has some warning tossing exceptions for legacy reasons and I recommend just avoiding it going forward. You can build an ASCII-8BIT Regexp in another way I'll show in just a moment.

As of Ruby 1.9.2, this concept of a lenient Regexp, one that will match any ASCII compatible?() Encoding, has a new name:

$ cat fixed_encoding.rb 
[/a/, /a/u].each do |re|
  puts "%-10s %s" % [ re.encoding, re.fixed_encoding? ? "fixed" :
                                                        "not fixed" ]
end
$ ruby fixed_encoding.rb 
US-ASCII   not fixed
UTF-8      fixed

A fixed_encoding?() Regexp is one that will raise an Encoding::CompatibilityError if matched against any String that contains a different Encoding from the Regexp itself, as long as the String isn't ascii_only?(). If fixed_encoding?() returns false, the Regexp can be used against any ASCII compatible?() Encoding. There's also a new constant with this name that can be used to disable the ASCII downgrading:

$ cat force_re_encoding.rb 
puts Regexp.new("abc".force_encoding("UTF-8")).encoding.name
puts Regexp.new( "abc".force_encoding("UTF-8"),
                 Regexp::FIXEDENCODING ).encoding.name
$ ruby force_re_encoding.rb 
US-ASCII
UTF-8

Note how a Regexp will take the Encoding of the String passed to Regexp::new() when Regexp::FIXEDENCODING is set. You can use this combination to build a Regexp in any Encoding you need, including the ASCII-8BIT I mentioned earlier.

Once your Regexp is at least compatible to your data's Encoding, pattern matches function as they always have. (Well, in truth, Ruby 1.9 brings us a powerful new regular expression engine called Oniguruma, but that's another topic for another time.) Under average circumstances, Ruby 1.9's Regexp Encoding selection option mean that they are compatible with a lot of data and everything should just work for you. However, if you end up getting some errors at match time, you may need to abandon the simple /…/ literal and use the new features I've shown to build a Regexp that perfectly matches your data's Encoding.

Handling a BOM

Some multibyte Encodings recommend that data in that Encoding begin with a Byte Order Mark (also known as a BOM) indicating the order of the bytes. UTF-16 is a good example.

Note that Ruby doesn't even support a UTF-16 Encoding. Instead, you must pick between UTF-16BE and UTF-16LE for "Big Endian" or "Little Endian" byte order. This indicates whether the most significant byte comes first or last:

$ ruby -e 'p "a".encode("UTF-16BE")'
"\x00a"
$ ruby -e 'p "a".encode("UTF-16LE")'
"a\x00"

Now, when someone goes to read your UTF-16 data back, they'll need to know which byte order you used to get things right. You could just tell them which order was used the same way you'll probably tell them that the data is UTF-16 encoded. Or you could add a BOM to the data.

A Unicode BOM is just the character U+FEFF at the beginning of your data. There's no such character for the reversed bytes U+FFFE, so this makes it easy to correctly tell the order of the bytes. Another minor advantage is that this BOM probably indicates you are reading Unicode data. A lot of software will check for this special start of the data, use it to set the proper byte order, and then pretend it didn't even exist by removing it from the data they show users.

Ruby 1.9 won't automatically add a BOM to your data, so you're going to need to take care of that if you want one. Luckily, it's not too tough. The basic idea is just to print the bytes needed at the beginning of a file. For example, we can add a BOM to a UTF-16LE file as such:

$ cat utf16_bom.rb 
# encoding: UTF-8
File.open("utf16_bom.txt", "w:UTF-16LE") do |f|
  f.puts "\uFEFFThis is UTF-16LE with a BOM."
end
$ ruby utf16_bom.rb 
$ ruby -e 'p File.binread("utf16_bom.txt")[0..9]'
"\xFF\xFET\x00h\x00i\x00s\x00"

Notice that I just used the Unicode escape to add the BOM character to the data. Because my output String was in UTF-8, Ruby had to transcode it to UTF-16LE and that process arranged the bytes correctly for me, as you see in the sample output.

Reading a BOM is a similar process. We will need to pull the relevant bytes and see if they match a Unicode BOM. When they do, we can then start reading again with the Encoding we matched. We might code that up like this:

$ cat read_bom.rb 
class File
  UTFS = [32, 16].map { |b| %w[BE LE].map { |o| "UTF-#{b}#{o}" } }.
                  flatten << "UTF-8"

  def self.open_using_unicode_bom(path, *args, &blk)
    # check the BOM to find the Encoding
    encoding = UTFS[0..-2].find(lambda { UTFS[-1] }) do |utf|
      bom = "\uFEFF".encode(utf)
      binread(path, bom.bytesize).force_encoding(utf) == bom
    end
    # set the Encoding
    if args.first.nil?
      args << "r#{'b' unless encoding == UTFS[-1]}:#{encoding}"
    elsif args.first.is_a? Hash
      args.first.merge!(external_encoding: encoding)
    else
      args.first.sub!(/\A([^:]*)/, "\\1:#{encoding}")
    end
    # hand off to open()
    if blk
      open(path, *args) do |f|
        f.read_unicode_bom
        blk[f]
      end
    else
      f = open(path, *args)
      f.read_unicode_bom
      f
    end
  end

  def read_unicode_bom
    bytes = external_encoding.name[/\AUTF-?(\d+)/i, 1].to_i / 8
    read(bytes) if bytes > 1
  end
end

# example usage with the File we created earlier
File.open_using_unicode_bom("utf16_bom.txt") do |f|
  line = f.gets
  p [line.encoding, line[0..3]]
end
$ ruby read_bom.rb 
[#<Encoding:UTF-16LE>, "T\x00h\x00i\x00s\x00"]

These examples just deal with Unicode BOM's, but you would handle other BOM's in a similar fashion. Find out what bytes are needed for your Encoding, write those out before the data, and later check for them when reading the data back. The String escapes we discussed earlier can be handy when writing the bytes and binread() is equally handy when checking for the BOM.

I do recommend including a BOM in Unicode Encodings like UTF-16 and UTF-32, but please don't add them to UTF-8 data. The UTF-8 byte order is part of its specification and it never varies. Thus you don't need a BOM to read it correctly. If you add one, you damage one of the great UTF-8 advantages in that it can pass for US-ASCII (assuming it's all seven bit characters).

Ruby 1.9's Three Default Encodings

2014-04-18T18:40:50Z

I suspect early contact with the new m17n (multilingualization) engine is going to come to Rubyists in the form of this error message:

invalid multibyte char (US-ASCII)

Ruby 1.8 didn't care what you stuck in a random String literal, but 1.9 is a touch pickier. I think you'll see that the change is for the better, but we do need to spend some time learning to play by Ruby's new rules.

That takes us to the first of Ruby's three default Encodings.

The Source Encoding

In Ruby's new grown up world of all encoded data, each and every String needs an Encoding. That means an Encoding must be selected for a String as soon as it is created. One way that a String can be created is for Ruby to execute some code with a String literal in it, like this:

str = "A new String"

That's a pretty simple String, but what if I use a literal like the following instead?

str = "Résumé"

What Encoding is that in? That fundamental question is probably the main reason we all struggle a bit with character encodings. You can't tell just from looking at that data what Encoding it is in. Now, if I showed you the bytes you may be able to make an educated guess, but the data just isn't wearing an Encoding name tag.

That's true of a frightening lot of data we deal with every day. A plain text file doesn't generally say what Encoding the data inside is in. When you think about that, it's a miracle we can successfully read a lot of things.

When we're talking about program code, the problem gets worse. I may want to write my code in UTF-8, but some Japanese programmer may want to write his code in Shift JIS. Ruby should support that and, in fact, 1.9 does. Let's complicate things a bit more though: imagine that I bundle up that UTF-8 code I wrote in a gem and the Japanese programmer later uses it to help with his Shift JIS code. How do we make that work seamlessly?

The Ruby 1.8 strategy of one global variable won't survive a test like this, so it was time to switch strategies. Ruby 1.9's answer to this problem is the source Encoding.

All Ruby source code now has some Encoding. When you create a String literal in your code, it is assigned the Encoding of your source. That simple rule solves all the problems I just described pretty nicely. As long my source Encoding is UTF-8 and the Japanese programmer's source Encoding is Shift JIS, my literals will work as I expect and his will work as he expects. Obviously if we share any data, we will need to establish some rules about our shared formats using documentation or code that can adapt to different Encodings, but we should have been doing that all along anyway.

Thus the only question becomes, what's my source Encoding and how do I change it?

There are a few different ways Ruby can select a source Encoding. Here are the options:

$ cat no_encoding.rb 
p __ENCODING__
$ ruby no_encoding.rb 
#<Encoding:US-ASCII>

$ cat magic_comment.rb 
# encoding: UTF-8
p __ENCODING__
$ ruby magic_comment.rb 
#<Encoding:UTF-8>
$ cat magic_comment2.rb 
#!/usr/bin/env ruby -w
# encoding: UTF-8
p __ENCODING__
$ ruby magic_comment2.rb 
#<Encoding:UTF-8>

$ echo $LC_CTYPE
en_US.UTF-8
$ ruby -e 'p __ENCODING__'
#<Encoding:UTF-8>

$ ruby -KU no_encoding.rb 
#<Encoding:UTF-8>

The first example shows us two important things. The first is the main rule of source Encodings: source files receive a US-ASCII Encoding, unless you say otherwise. [Update: this was changed to UTF-8 in Ruby 2.0 and up.] This is where I expect programmers to run into the error I mentioned earlier. If you place any non-ASCII content in a String literal without changing the source Encoding, Ruby will die with that error. Thus you need to change the source Encoding to work with any non-ASCII data. The second thing we see here is the new __ENCODING__ keyword that can be used to get the source Encoding that's active where it is executed.

The second example shows the preferred way to set your source Encoding and it's called a magic comment. If the first line of your code is a comment that includes the word coding, followed by a colon and space, and then an Encoding name, the source Encoding for that file is changed to the indicated Encoding. If your code has a shebang line, the magic comment must come on the second line, with no spacing between them. Once set, all String literals you create in that file will have that Encoding attached to them.

The third example shows an exception to the rule for your convenience. When you feed Ruby some code on the command-line using the -e switch, it gets a source Encoding from your environment. I have UTF-8 set in the LC_CTYPE environment variable, but some people also use the LANG variable for this. This makes scripting easier since Ruby will (hopefully) match the Encoding of any other commands you chain together.

The fourth example is another interesting exception to the rule. Ruby 1.9 still supports the -K* style switches from Ruby 1.8 including the -KU switch I've recommended so heavily in this series. These switches have a couple of effects, but of particular note they are the only non-magic comment way to modify the source Encoding. This is good news for backwards compatibility, because some Ruby 1.8 code may be able to run on Ruby 1.9 without Encoding problems thanks to this. I must stress that this is just for backwards compatibility though, and magic comments are the future.

With magic comments the code will include its Encoding data. It will probably seem a little tedious to add them to all your source files at first, but it's really not that big of a change. In the past, I've recommended we stick the following shebang line at the top of our files:

#!/usr/bin/env ruby -wKU

Now, for Ruby 1.9, I'm recommending we switch to something like this:

#!/usr/bin/env ruby -w
# encoding: UTF-8

Note that the magic comment format rules are pretty loose and all of following examples would work the same:

# encoding: UTF-8

# coding: UTF-8

# -*- coding: UTF-8 -*-

This is nice for support in some text editors that also read such comments.

If we all get into that habit of adding magic comments, our code can work together regardless of the various Encodings we personally favor. Ruby will know how to handle each separate file. As an added bonus, we programmers also get to see these comments and know more about the code we are working with. That makes it a good habit to get into, I think.

The Default External and Internal Encodings

There's another way Strings are commonly created and that's by reading from some IO object. It doesn't make sense to give those Strings the source Encoding because the external data doesn't have to be related to your source code. Also, you really need to know how data is encoded to read it correctly. Even a simple concept like reading the next line of data changes if you are talking about UTF-8 or UTF-16LE (the LE stands for a Little Endian byte order) data. Thus, it makes sense for IO objects to have at least one Encoding attached to them. Ruby 1.9 is generous and gives them two: the external Encoding and the internal Encoding.

The external Encoding is the Encoding the data is in inside the IO object. That affects how data will be read and this is the Encoding data will be returned in as long as the internal Encoding isn't set (more on that in a bit). Let's look at an example of how this plays out in practice:

$ cat show_external.rb 
open(__FILE__, "r:UTF-8") do |file|
  puts file.external_encoding.name
  p    file.internal_encoding
  file.each do |line|
    p [line.encoding.name, line]
  end
end
$ ruby show_external.rb 
UTF-8
nil
["UTF-8", "open(__FILE__, \"r:UTF-8\") do |file|\n"]
["UTF-8", "  puts file.external_encoding.name\n"]
["UTF-8", "  p    file.internal_encoding\n"]
["UTF-8", "  file.each do |line|\n"]
["UTF-8", "    p [line.encoding.name, line]\n"]
["UTF-8", "  end\n"]
["UTF-8", "end\n"]

There are four things to notice in this example:

I set the external Encoding by tacking :UTF-8 onto the end of my mode String when I opened the File
You can use external_encoding() to check the external Encoding as I have here
internal_encoding() works the same for the internal Encoding, which will be nil unless you explicitly set it
Note how each String created as I read the data is given the external_encoding()

The internal Encoding just adds one more twist. When set, data will still be read in the external Encoding, but transcoded to the internal Encoding as the String is created. It's a convenience for you as the programmer. Watch how that changes things:

$ cat show_internal.rb 
open(__FILE__, "r:UTF-8:UTF-16LE") do |file|
  puts file.external_encoding.name
  puts file.internal_encoding.name
  file.each do |line|
    p [line.encoding.name, line[0..3]]
  end
end
$ ruby show_internal.rb 
UTF-8
UTF-16LE
["UTF-16LE", "o\x00p\x00e\x00n\x00"]
["UTF-16LE", " \x00 \x00p\x00u\x00"]
["UTF-16LE", " \x00 \x00p\x00u\x00"]
["UTF-16LE", " \x00 \x00f\x00i\x00"]
["UTF-16LE", " \x00 \x00 \x00 \x00"]
["UTF-16LE", " \x00 \x00e\x00n\x00"]
["UTF-16LE", "e\x00n\x00d\x00\n\x00"]

There are a couple differences here:

A second added Encoding on the mode String (my :UTF-16LE in this example) sets the internal_encoding() as I show with the second puts()
This little change gets Ruby to translate all of the data for me (I just shortened the output because UTF-16LE is noisy)

The external Encoding works the same when writing. It still represents the Encoding in the IO object, or the Encoding data is going to. However, you don't need to specify an internal Encoding when writing. Ruby will automatically use the Encoding of a String you output as the internal Encoding and transcode as needed to reach the external Encoding. For example:

$ cat write_internal.rb 
# encoding: UTF-8
open("data.txt", "w:UTF-16LE") do |file|
  puts file.external_encoding.name
  p    file.internal_encoding
  data = "My data…"
  p [data.encoding.name, data]
  file << data
end
p File.read("data.txt")
$ ruby write_internal.rb 
UTF-16LE
nil
["UTF-8", "My data…"]
"M\x00y\x00 \x00d\x00a\x00t\x00a\x00& "

Note how my data was transcoded before it was written even though the internal_encoding() was nil. Ruby used the String's Encoding to decide what was needed.

Both of those IO Encodings should be pretty straight forward. The only question left about them is: what happens if you don't set them? The answer is that the IO inherits the default external Encoding and/or the default internal Encoding whenever one isn't set. Now we need to know how Ruby chooses those defaults.

The default external Encoding is pulled from your environment, much like the source Encoding is for code given on the command-line. Have a look:

$ echo $LC_CTYPE
en_US.UTF-8
$ ruby -e 'puts Encoding.default_external.name'
UTF-8
$ LC_CTYPE=ja_JP.sjis ruby -e 'puts Encoding.default_external.name'
Shift_JIS

The default internal Encoding is simply nil. You must actively change it to get anything else.

Both default IO Encodings have a global setter: Encoding.default_external=() and Encoding.default_internal=(). You can set them to an Encoding object or just the String name of an Encoding.

You can also change these default Encodings using some command-line switches. The new -E switch can be used to set one or both of the IO Encodings:

$ ruby -e 'p [Encoding.default_external, Encoding.default_internal]'
[#<Encoding:UTF-8>, nil]
$ ruby -E Shift_JIS \
> -e 'p [Encoding.default_external, Encoding.default_internal]'
[#<Encoding:Shift_JIS>, nil]
$ ruby -E :UTF-16LE \
> -e 'p [Encoding.default_external, Encoding.default_internal]'
[#<Encoding:UTF-8>, #<Encoding:UTF-16LE>]
$ ruby -E Shift_JIS:UTF-16LE \
> -e 'p [Encoding.default_external, Encoding.default_internal]'
[#<Encoding:Shift_JIS>, #<Encoding:UTF-16LE>]

As you can see, the argument for this switch is just like what you would append to a mode String in a call to File.open().

There's one more command-line switch shortcut for those of us who prefer to just use UTF-8 everywhere. The new -U switch sets Encoding.default_internal() to UTF-8. Using that, you can just set the external Encoding for your IO objects, or let it default from your environment, and all Strings you read will be transcoded to the preferred UTF-8.

Probably the most important thing to note about Encoding.default_external() and Encoding.default_internal() is that you should really just treat them as shortcuts for your own scripting. Pulling Encodings from the environment or command-line switches can be handy when you're in control of where the code runs, but you're going to need to be more explicit for code you intend for others to run. When in doubt, set the external and internal Encodings the way you want them for each IO object. It's a bit more tedious, but also safer in that it won't mysteriously be changed by some outside force. Also remember that the defaults are global settings affecting all loaded code, including any libraries you require(). That can be a boon or bane, so just remember to factor it into your thinking when you're wondering, "Where does this String get its Encoding from?"

Encoding Conversion With iconv

2014-04-17T19:14:31Z

There's one last standard library we need to discuss for us to have completely covered Ruby 1.8's support for character encodings. The iconv library ships with Ruby and it can handle an impressive set of character encoding conversions.

This is an important piece of the puzzle. You may have accepted my advice that it's OK to just work with UTF-8 data whenever you have the choice, but the fact is that there's a lot of non-UTF-8 data in the world. Legacy systems may have produced data before UTF-8 was popular, some services may work in different encodings for any number of reasons, and not quite everyone has embraced Unicode fully yet. If you run into data like this, you will need a way to convert it to UTF-8 as you import it and possibly a way to convert it back when you export it. That's exactly what iconv does.

Instead of jumping right into Ruby's iconv library, let's come at it with a slightly different approach. iconv is actually a C library that performs these conversions and on most systems where it is installed you will have a command-line interface for it.

It's very easy to use the iconv program. Just always follow these three steps:

Tell iconv the encoding you want it to write data out in, including any special translation instructions
Tell iconv the encoding data will be passed to it in
Send the input into iconv on STDIN (or just list the files as arguments, if you prefer) and redirect iconv's STDOUT to where you want output to be written

For example, let's say I have some UTF-8 data:

$ echo "Résumé" > utf8.txt
$ wc -c utf8.txt 
       9 utf8.txt

My terminal works in UTF-8, so that's the data echo wrote into the file. You can see that it's encoded now because we have nine bytes in the file (one each for "R", "s", "u", "m", and "\n" plus two for each "é").

Here's how we would convert that data to Latin-1 using iconv:

$ iconv -t LATIN1 -f UTF8 < utf8.txt > latin1.txt
$ wc -c latin1.txt 
       7 latin1.txt

You can see the conversion worked, because an "é" is only one byte in Latin-1 and we dropped two bytes.

Note my use of all three steps here:

I used -t LATIN1 to set the to encoding without any special translations
I used -f UTF8 to set the from encoding
I used < utf8.txt to pipe data in and > latin1.txt to pipe data out of the program

Those are always the steps as I said before.

You only need to know two more things about iconv. First, iconv supports a truck load of encodings, including all of the common encodings I've been talking about in this series. They vary some on different platforms though, so you will need to check what is available to you:

$ iconv --list
ANSI_X3.4-1968 ANSI_X3.4-1986 ASCII CP367 IBM367 ISO-IR-6 ISO646-US
  ISO_646.IRV:1991 US US-ASCII CSASCII
UTF-8 UTF8
UTF-8-MAC UTF8-MAC
ISO-10646-UCS-2 UCS-2 CSUNICODE
UCS-2BE UNICODE-1-1 UNICODEBIG CSUNICODE11
UCS-2LE UNICODELITTLE
ISO-10646-UCS-4 UCS-4 CSUCS4
UCS-4BE
UCS-4LE
UTF-16
…

Each line of that listing shows a single encoding. The space separated lists on each line are all aliases for that encoding that iconv will accept. Thus that first long line that I had to break into two provides a bunch of aliases for US-ASCII. We can also see by reading down a bit that iconv will accept UTF8 or UTF-8.

The last thing to know about iconv is that it has some special translation modes. To see those in action, let's work with a different piece of data:

$ echo "On and on… and on…" > utf8.txt
$ cat utf8.txt 
On and on… and on…

That last character is an ellipsis or three dots all in one character. Unicode has that character, but Latin-1 does not. Let's see what happens if we try to convert the data now:

$ iconv -f UTF8 -t LATIN1 < utf8.txt > latin1.txt

iconv: (stdin):1:9: cannot convert
$ cat latin1.txt 
On and on

As you can see, I got an error when it reached the first occurrence of the problem character. The cat command also shows that it completely quit working there.

That may be what you need, so you can tell a user you can't work with their data. I often find though that I just need to do the best I can with the data that I have. iconv's translation modes can help with that.

First, you can ask iconv to ignore any characters that cannot be converted to the new encoding:

$ iconv -t LATIN1//IGNORE -f UTF8 < utf8.txt > latin1_wignore.txt
$ cat latin1_wignore.txt 
On and on and on

As you can see, we completed the entire translation that time, only losing the problematic characters. The //IGNORE sequence adds the translation mode. Modes are always specified after the output encoding. That's an improvement for sure, but it's possible to do even better in this case.

iconv has another translation mode where it will try to transliterate characters into an equivalent representation in the target encoding:

$ iconv -t LATIN1//TRANSLIT -f UTF8 < utf8.txt > latin1_wtranslit.txt
$ cat latin1_wtranslit.txt 
On and on... and on...

This time, instead of dropping the ellipsis characters, iconv replaced them with three full stops each. It's not as fancy as the Unicode character, but it gets the job done and we do a good job of keeping the meaning of the data.

//TRANSLIT can't convert absolutely everything you will see in the wild, so it's still possible to get errors when using it. You can combine the modes though by specifying //TRANSLIT//IGNORE. That will ask iconv to transliterate what it can and drop the rest. Note that order does matter there, you need to be sure it tries transliteration before ignoring the character.

You can also give iconv specific translations for bytes it has trouble with. I've never needed that level of control though and find the translation modes help me do more with less effort. Have a quick browse through man iconv, if you are curious.

That's all you need to know about iconv. You are now a character conversion expert. Congratulations.

Of course, it would be nice to talk about how this affects Ruby. Let's do that.

The Ruby standard library is just like the program we've been playing with. It just provides a method interface to the underlying C code. To show that, here's the same conversion we started with:

#!/usr/bin/env ruby -wKU

require "iconv"

utf8 = "Résumé"
utf8.size  # => 8

latin1 = Iconv.conv("LATIN1", "UTF8", utf8)
latin1.size  # => 6

You can see that the steps are exactly the same. The first parameter is your target encoding and the second is the encoding your data is currently in. You pass the data to convert in the last parameter and the return value of the call is the result.

If you are going to do several conversions in a row, it's slightly easier to create an Iconv instance and just reuse that:

#!/usr/bin/env ruby -wKU

require "iconv"

utf8_to_latin1 = Iconv.new("LATIN1//TRANSLIT//IGNORE", "UTF8")

resume = "Résumé"
utf8_to_latin1.iconv(resume).size  # => 6

on_and_on = "On and on… and on…"
utf8_to_latin1.iconv(on_and_on)  # => "On and on... and on..."

That's all there is to it. The new() method builds an object that remembers the encodings you are converting and then you can call iconv() (instead of the conv() class method we used earlier) to convert data.

When things go wrong, the Ruby interface will raise exceptions like Iconv::InvalidEncoding or Iconv::InvalidCharacter. See the documentation for details.

The Ruby 1.8 library does not provide a way to programatically list the supported encodings, which is one of the big reasons I started off showing you the command-line program instead. You will need to check them there. However, Ruby 1.9 adds a method for this:

$ ruby_dev -r iconv -r pp -ve 'pp Iconv.list'
ruby 1.9.0 (2008-10-10 revision 0) [i386-darwin9.5.0]
[["ANSI_X3.4-1968",
  "ANSI_X3.4-1986",
  "ASCII",
  "CP367",
  "IBM367",
  "ISO-IR-6",
  "ISO646-US",
  "ISO_646.IRV:1991",
  "US",
  "US-ASCII",
  "CSASCII"],
 ["UTF-8", "UTF8"],
…

This concludes our tour of character encoding tools for Ruby 1.8. In later posts, we will take a step back from all of this and examine what the problems with this system are. That will pave the way for us to discuss the new m17n (multilingualization) code in Ruby 1.9.

The $KCODE Variable and jcode Library

2014-04-17T15:47:22Z

All of the Ruby files I create start with the same Shebang line:

#!/usr/bin/env ruby -wKU

It's not really needed for every file since it generally only matters if the file is executed. However, I tend to go ahead and add it to all Ruby files I build for several reasons:

You never know when a file may be executed (if __FILE__ == $PROGRAM_NAME; end sections are often added to libraries, for example)
It makes it obvious the file is Ruby code
It shows the rules this code expects -w and -KU

The rules I mention here, specified by command-line switches, are the main point of interest. -w turns on Ruby's warnings which are very handy. I recommend doing that whenever you can. But that doesn't have anything to do with character encodings. -KU does.

-KU sets a magic Ruby variable: $-K or $KCODE. You can do the same in your code if you aren't in a position to control the command-line arguments:

$KCODE = "U"

You probably recognize the U as a name for Ruby 1.8's UTF-8 encoding, from my earlier list of encodings. It can also be set to N (the default), E, or S. Modern versions of Rails do set $KCODE = "U" for you.

So what does changing this magic variable do? First, it has the tiny effect of changing what Ruby escapes in inspect() output. Have a look:

$ ruby -e 'p "Résumé"'
"R\303\251sum\303\251"
$ ruby -KUe 'p "Résumé"'
"Résumé"

It's nice to be able to see your data as it actually is, assuming your terminal correctly handles UTF-8. However, that's really just a side-effect of setting $KCODE.

The main purpose of $KCODE is that it changes the default encoding of all regular expressions that do not specify otherwise. Thus we can split up UTF-8 data by characters without adding a /u to the end of our expression:

$ ruby -e 'p "Résumé".scan(/./m)'
["R", "\303", "\251", "s", "u", "m", "\303", "\251"]
$ ruby -KUe 'p "Résumé".scan(/./m)'
["R", "é", "s", "u", "m", "é"]
$ ruby -KUe 'p "Résumé".scan(/./mn)'
["R", "\303", "\251", "s", "u", "m", "\303", "\251"]

Notice that the default encoding for that second example was switched to UTF-8. However, I can still override this with an explicit encoding, as I did in example three by adding the /n option for None.

Now, I tend to prefer $KCODE over $-K because the former seems more common in Ruby literature. In fact, Ruby 1.8 uses the term in another place, providing a method to get the encoding used in a Regexp:

$ ruby -e 'p /./.kcode'
nil
$ ruby -e 'p /./u.kcode'
"utf8"

Beware of that harmless looking kcode() method though as it hides quite a few gotchas. First, you can see that it has its own names for the options that don't really match up with what we've seen elsewhere. It also doesn't seem to be aware of the $KCODE variable, in an ironic twist of naming:

$ ruby -e '$KCODE = "U"; re = /./m; p "Résumé".scan(re); p re.kcode'
["R", "é", "s", "u", "m", "é"]
nil

As you can see, the encoding of the expression was clearly set correctly, but kcode() didn't report the change. If you really want to know the encoding of a Regexp in Ruby 1.8, I suggest using code like the following:

class Regexp
  def encoding
    if kcode
      kcode[0, 1]
    elsif %w[n N u U e E s S].include? $KCODE
      $KCODE.downcase
    else
      "n"
    end
  end
end

Using just the first letter of kcode() should get us back to a standard set of letters. If kcode() isn't set, we can use $KCODE. However, do note that I make sure it's set to an expected value. You can set $KCODE to any junk value and Ruby will just silently ignore it (defaulting back to N), so it's good to reality check the contents when you rely on it. Finally, we just return the default if neither appear to be set.

That's really all there is to know about $KCODE, but Ruby 1.8 ships with a simple standard library called jcode that combines well with everything we've been discussing in these last two posts.

To use the jcode library, set $KCODE and then require the library. Setting $KCODE first is important, and you will receive a warning if you require jcode without setting $KCODE (as long as you took my advice and turned warnings on with -w):

$ ruby -r jcode -e 'p "Résumé".jsize'
8
$ ruby -w -r jcode -e 'p "Résumé".jsize'
Warning: $KCODE is NONE.
8

See, I told you -w was important.

As long as you do have $KCODE set properly, jcode adds a bunch of methods to String that work in characters. These methods are just simple wrappers over the techniques I showed you in my last post, so you get methods like jsize() which returns a count of characters instead of bytes:

$ ruby -KU -r jcode -e 'p "Résumé".jsize'
6

Probably the most useful method jcode adds is each_char():

$ ruby -KU -r jcode -e '"Résumé".each_char { |c| p c }'
"R"
"é"
"s"
"u"
"m"
"é"

See the documentation for the full method list.

Bytes and Characters in Ruby 1.8

2014-04-12T20:06:06Z

Gregory Brown said, in a training session at the Lone Star Rubyconf, "Ruby 1.8 works in bytes. Ruby 1.9 works in characters." The truth of Ruby 1.9 is maybe a little more complicated and we will discuss all of that eventually, but Greg is dead right about Ruby 1.8.

In Ruby 1.8, a String is always just a collection of bytes.

The important question is, how does that one golden rule relate to all that we've learned about character encodings? Essentially, it puts all the responsibility on you as the developer. Ruby 1.8 leaves it to you to determine what to do with those bytes and it doesn't provide a lot of encoding savvy help. That's why knowing at least the basics of encodings is so important when working with Ruby 1.8.

There are plusses and minuses to every system and this one is no exception. On the side of plusses, Ruby 1.8 can pretty much support any encoding you can imagine. After all, a character encoding is just some bytes that somehow map to a set of characters and all Ruby 1.8 Strings are just some bytes. If you say a String holds Latin-1 data and treat it as such, that's fine by Ruby.

I won't lie to you though, there are more minuses than plusses to this approach. Latin-1 is a pretty simple case since each byte is a character. With many other encodings though, like the UTF-8 encoding I've recommended we rely on, things get a lot more complicated.

Slicing up a Ruby 1.8 String by index means working in bytes and that means it's possible for us to accidentally break a multi-byte character. Running regular expressions over data faces similar issues. That's just two examples of things we commonly do, but the truth is that many String operations just aren't encoding safe in Ruby 1.8. You can't even call simple things like reverse() on a String because it could break the order of those multi-byte characters. And remember that size() will always count bytes, not characters.

Ruby 1.8 is also never going to police the contents of a String. That means to Ruby 1.8 a String with valid UTF-8 data, a String with broken UTF-8 data, and a String with some bytes in Latin-1 and some in UTF-8 are all just Strings. It doesn't care. It's unlikely that the latter two are going to be of any use to you, so you will need to be the one making sure you don't create such problems. If you got String data from two separate sources in different encodings, you can't just combine them with a simple +.

This may be starting to sound a little bleak and it probably is. However, Ruby 1.8 throws one major exception into the works that can help you in many cases: the regex engine is aware of four character encodings. Often we can use this simple fact to work with characters.

What encodings does Ruby 1.8 know? Here's the full list:

None (n or N)
EUC (e or E)
Shift_JIS (s or S)
UTF-8 (u or U)

The None encoding is the default in Ruby 1.8. It's just the golden rule I've already mentioned: treat everything as bytes. If your encoding isn't on this list, you will need to use None and be darn sure you don't do anything to the data that could damage the encoding. That's very hard and the fact is that doing significant work with an encoding not on the above list in Ruby 1.8 will be quite a challenge for you.

Both EUC (Extended Unix Code) and SHIFT_JIS are primarily Asian character encodings. SHIFT_JIS is a Japanese encoding and EUC is mainly used for Japanese, Korean, and simplified Chinese. You can tell Ruby comes from Japan, can't you? Obviously these are very helpful if you are Asian, but the rest of us won't need these much.

Now we get to the good news: our champion UTF-8 made the list! Yes, this means Ruby 1.8 has limited support for working with UTF-8 data. It's not comprehensive, but we get some help.

The letters listed after each encoding are used in multiple places inside Ruby 1.8 to tell it which encoding you need to work with. I'll point those places out as we get into the details.

What does it mean to have a character encoding on the above list? It means that the regex engine can recognize characters in that encoding, even if they are multibyte. That assures us that regular expression constructs that target characters, like character classes ([…]) and the match-one-character shortcut (.), will correctly match whatever number of bytes represents one character at that place in the data. It also changes the definition of constructs like \s and \w which can be used to match whitespace and word characters respectively. The definition of a "word" character in Unicode is quite a bit broader than the simple ASCII character class of [A-Za-z0-9_].

Let's look at some examples of this, so you can see how it works. I'll play around with a simple UTF-8 String in Ruby 1.8 and show you the various encoding effects. Remember that the default encoding is None, so that's what we get if we don't ask for anything else.

A common task in working with characters in Ruby 1.8 is to convert a String into an Array of characters. If we can do just that much, we can work-around some of the weaknesses of Ruby 1.8's String always working in bytes. Given that, this almost does what we want:

$ ruby -e 'p "Résumé".scan(/./m)'
["R", "\303", "\251", "s", "u", "m", "\303", "\251"]

You probably know that scan() just builds an Array of matches for the passed Regexp in the String receiver. The /m option I'm using here puts the regex engine in multi-line mode and in that a . matches all characters (it usually doesn't match newlines).

So what went wrong above? Well, the "é" characters in my String take two bytes in UTF-8. The golden rule tells us Ruby 1.8 works in bytes and that's definitely what we saw. It split up the bytes needed for those characters. This is bad, because if I now change this Array, I have excellent chances of breaking my data.

Again, that used the default None mode, because we didn't tell it to do otherwise. However, if we throw the regex engine into UTF-8 mode, we will get actual characters:

$ ruby -e 'p "Résumé".scan(/./mu)'
["R", "\303\251", "s", "u", "m", "\303\251"]

Notice how the two bytes needed for the "é" stay together now? (I'll show you how to get Ruby to stop escaping the content and show the actual "é" in my next post.) The regex engine saw that it takes both bytes to make a character in UTF-8, the encoding I requested, and thus the ., which matches one character, is forced to grab them both.

I chose UTF-8 mode by adding the /u option to my Regexp literal. You probably recognize the letter from my earlier list of encodings. Similarly, you can use /e for EUC, /s for Shift_JIS, and even /n for None though that's the default. Regexp.new() also accepts a third parameter for these encodings if you are creating expressions that way: Regexp.new(".", Regexp::MULTILINE, "u").

Using this one simple trick, we can fix some of the unsafe String methods I mentioned earlier. For example, Ruby 1.8 normally counts bytes with size():

$ ruby -e 'p "Résumé".size'
8

but we can now count characters, if desired:

$ ruby -e 'p "Résumé".scan(/./mu).size'
6

We can also fix the dangerous reverse() method which would normally break our multibyte "é" characters by screwing up the byte order:

$ ruby -e 'p "Résumé".reverse'
"\251\303mus\251\303R"

"\303\251" is a UTF-8 "é", but the "\251\303" we see here is broken UTF-8 data that doesn't mean anything. We can fix that with:

$ ruby -e 'p "Résumé".scan(/./mu).reverse.join'
"\303\251mus\303\251R"

This time we use the regex engine to divide the String into a character Array, then we reverse() that and join() it back into a String. You can see that this kept the "é" bytes in the proper order.

Really study these examples above until you understand what's going on here. This is all the support Ruby 1.8 provides for working with characters, so you need to understand how to use it.

Here's one last set of examples showing the other regex change I mentioned:

$ ruby -e 'p "Résumé"[/\w+/]'
"R"
$ ruby -e 'p "Résumé"[/\w+/u]'
"R\303\251sum\303\251"

In the default None mode, \w is the same as [A-Za-z0-9_]. That doesn't match the special bytes needed to build the "é" character, so the match ends there. Note that UTF-8 mode changes that though and we get the full word.

Ruby 1.8 doesn't provide a whole lot of additional encoding support outside the regex engine. There is one magic variable and some helpful standard libraries we will discuss in future posts, but the main part of Ruby 1.8's character encoding support is just this.

One other small feature that may be worth a quick mention is that you can get Unicode code points using String's unpack() method:

$ ruby -e 'p "Résumé".unpack("U*")'
[82, 233, 115, 117, 109, 233]

The U code tells unpack() to convert a character into a Unicode code point and the * just repeats it for all characters in the String.

I don't find myself needing to work with character points often, but you can use this for one interesting cheat. The Unicode code points are a superset of the byte values used in Latin-1, so you can actually convert between the two encodings using just unpack() and pack():

utf8 = latin1.unpack("C*").pack("U*")
# ... or ...
latin1 = utf8.unpack("U*").pack("C*")  # more dangerous

However, I'll show you a superior way to handle encoding conversions in a future post.

It's important to remember that this is not full character encoding support. For example, there is a long list of rules about how to correctly convert some Unicode characters to upper case, but upcase() doesn't know them and you cannot regex your way out of that mess. If you need these features for a given encoding, you will need to look for an external library that meets your needs or roll your own solution.

General Encoding Strategies

2014-04-12T19:30:46Z

Before we get into specifics, let's try to distill a few best practices for working with encodings. I'm sure you can tell that there's a lot that needs to be considered with encodings, so let's try to focus in on a few key points that will help us the most.

Use UTF-8 Everywhere You Can

We know UTF-8 isn't perfect, but it's pretty darn close to perfect. There is no other single encoding you could pick that has the potential to satisfy such a wide audience. It's our best bet. For these reasons, UTF-8 is quickly becoming the preferred encoding for the Web, email, and more.

If you have a say over what encoding or encodings your software will accept, support, and deliver, choose UTF-8 whenever you can. This is absolutely the best default.

Get in the Habit of Documenting Your Encodings

We learned that you must know a data's encoding to properly work with it. While there are tools to help you guess an encoding, you really want to try and avoid being in this position. Part of how to make that happen is to be a good citizen and make sure you are documenting your encodings at every step.

If you send an email, make sure it specifies a correct character set. Add a meta tag to Web pages to state the encoding. View the source of this page for an example. Document encodings accepted and returned from your API's. This will raise everyone's encoding awareness, which helps us all.

Develop Your Encoding Safe Senses

You need to get into the habit of thinking, "Is this encoding safe?" When you call a method, ask the question. When you hand your data off to some process, reality check some results.

Have you ever done something like str[1..-2] in Ruby 1.8? I sure have and it's not safe. You're cutting bytes there and that may dice a bigger character into pieces. Then your data is junk.

This may sound like paranoia, but it's really not as bad as it seems. There tend to just be a few key points where you need to go out of your way to protect the data and it's asking this question repeatedly that teaches you to spot those.

To give an example, while enhancing the standard CSV library for Ruby 1.9's m17n (multilingualization) implementation, I needed to use some user provided data in a Regexp. That's easy right?

Regexp.escape(data)

Luckily, my instincts were just good enough to wonder, is that safe? I fed some UTF-32 data to Regexp.escape() to find out. Remember, multibyte encodings that will show some seemingly normal data are great for testing edge cases. Ruby broke my data:

p Regexp.escape("+".encode("UTF-32BE"))
"\x00\x00\x00\\+"

Now, this was just a case of Ruby 1.9 still being raw around the edges. It looks like this has been fixed in current builds:

$ ruby_dev -ve 'p Regexp.escape("+".encode("UTF-32BE"))'
ruby 1.9.0 (2008-10-10 revision 0) [i386-darwin9.5.0]
"\x00\x00\x00\\\x00\x00\x00+"

Still the point stands, you can't even trust Ruby at some times. Be cautious.

The natural conclusion of this is that you want to know how encodings are handled all through the pipeline your data will pass through. Does your HTML arrange to receive form data in UTF-8? Is Ruby in UTF-8 mode when it receives that data? Does the MySQL table you store that data in have an encoding set to UTF-8? Modern versions of Rails even handle two of those three steps for you. That's why it's important to look into the tools you use.

These strategies aren't all you will need, but they are a terrific start. This is not too much to remember and it will greatly increase your awareness of the issues. That's the most important thing.

The Unicode Character Set and Encodings

2015-12-17T20:15:16Z

Since the rise of the various character encodings, there has been a quest to find the one perfect encoding we could all use. It's hard to get everyone to agree about whether or not this has truly been accomplished, but most of us agree that Unicode is as close as it gets.

The goal of Unicode was literally to provide a character set that includes all characters in use today. That's letters and numbers for all languages, all the images needed by pictographic languages, and all symbols. As you can imagine that's quite a challenging task, but they've done very well. Take a moment to browse all the characters in the current Unicode specification to see for yourself. The Unicode Consortium often reminds us that they still have room for more characters as well, so we will be all set when we start meeting alien races.

Now in order to really understand what Unicode is, I need to clear up a point I've played pretty loose with so far: a character set and a character encoding aren't necessarily the same thing. Unicode is one character set, and has multiple character encodings. Allow me to explain.

A character set is just the mapping of symbols to their magic number representations inside the computer. Unicode calls these numbers code points and they are usually written in the form U+0061 where the U+ means Unicode and the four digit number is hexadecimal for a code point. Thus 0061 is is 97. That happens to be the Unicode code point for a and if you remember my previous post well, you will recognize that matches up with US-ASCII. We'll talk more about that in a bit. It is worth noting though that Ruby 1.8 and 1.9 can show you these code points:

$ ruby -vKUe 'p "aé…".unpack("U*")'
ruby 1.8.6 (2008-08-11 patchlevel 287) [i686-darwin9.4.0]
[97, 233, 8230]
$ ruby_dev -ve 'p "aé…".unpack("U*")'
ruby 1.9.0 (2008-10-10 revision 0) [i386-darwin9.5.0]
[97, 233, 8230]

The U pattern for unpack() asks for a Unicode code point and the * just repeats it for each character. Note that I used the -KU switch to get Ruby 1.8 in UTF-8 mode. Ruby 1.9 assumed UTF-8 because of how my environment is configured. We will talk a lot more about those details when we get into specific language features.

Code points aren't what actually gets recorded in a file, they are just abstract numbers for each character. How those characters get written into a data stream is an encoding. There are multiple encodings for Unicode or multiple ways to record those abstract numbers into files.

Different encodings have different strengths. For example, one possible encoding of Unicode is UTF-32, where 32 bits (or four bytes) are reserved for each code point. This has the advantage that you can always count on four bytes being used (unlike variable length encodings, which we will discuss shortly). An obvious downside though is the wasted space. I mean if you have all ASCII data, you only really need one byte each, but UTF-32 will use four without exception.

You do need to be very careful how you work with multibyte encodings. UTF-32 is a good example of one that can be pretty tricky, because parts of the data can look normal. For example, look at this simple String as Ruby 1.9 sees it:

$ ruby_dev -ve 'p "abc".encode("UTF-32BE")'
ruby 1.9.0 (2008-10-10 revision 0) [i386-darwin9.5.0]
"\x00\x00\x00a\x00\x00\x00b\x00\x00\x00c"

There are a lot of null bytes in there, but notice how there are also normal "a", "b", and "c" bytes. I'm not going to show how this could happen to avoid encouraging bad habits, but if you replaced just the "a" byte with two bytes like "ab" your encoding is now broken and will eventually cause you problems. You also have to be careful anytime you slice up a String to make sure you don't divide the content mid-character.

Another possible encoding of Unicode is UTF-8. It has become pretty popular for things like email and web pages in recent years for several reasons. First, UTF-8 is 100% compatible with US-ASCII. The lowest 128 code points match their US-ASCII equivalents and UTF-8 encodes these in a single byte. Ruby 1.9 can show us this:

$ cat ascii_and_utf8.rb 
str   = "abc"
ascii = str.encode("US-ASCII")
utf8  = str.encode("UTF-8")

[ascii, utf8].each do |encoded_str|
  p [encoded_str, encoded_str.encoding.name, encoded_str.bytes.to_a]
end
$ ruby_dev -v ascii_and_utf8.rb 
ruby 1.9.0 (2008-10-10 revision 0) [i386-darwin9.5.0]
["abc", "US-ASCII", [97, 98, 99]]
["abc", "UTF-8", [97, 98, 99]]

I've used several new Ruby 1.9 features here. I don't want to go too deeply into these at this point but briefly: encode() allows me to transcode a String from its current encoding to the one I pass the name for, encoding() gives me the current Encoding object for that String and name() turns that into a simple name, and finally Ruby 1.9 Strings provide Enumerators to walk the content by bytes(), chars(), codepoints(), or lines() and I use that to get the actual bytes here. I promise we will talk a lot more about these when we get to handling encodings in Ruby 1.9.

For now the key point to notice about this example is that US-ASCII and UTF-8 are the same all the way down to the bytes.

Of course, 128 characters isn't enough to contain the super large Unicode character set. Eventually you need more bytes. UTF-8 is a variable length encoding that uses more bytes to represent larger code points as needed. It does this with a simple set of rules:

Single byte characters always have a 0 in the most significant bit: 0xxxxxxx.
The number of significant 1 bits shows how many bytes the code point takes up for multibyte code points. Thus the most significant bits of a two byte character will be 110xxxxx and they will be 1110xxxx for a three byte character.
All other bytes of multibyte sequences begin with 10: 10xxxxxx.

Again, we can ask Ruby 1.9 to show this:

$ cat utf8_bytes.rb 
# encoding:  UTF-8

chars = %w[a é …]
chars.each do |char|
  p char.bytes.map { |b| "%08b" % b }
end
$ ruby_dev utf8_bytes.rb 
["01100001"]
["11000011", "10101001"]
["11100010", "10000000", "10100110"]

Notice how different characters are different lengths and how the byte patterns show what to expect as I just described. This makes UTF-8 a little safer to manipulate, because you won't see a bare "a" byte that isn't really an "a" in the data. You do still have to be careful how you slice up a String though to avoid breaking up multibyte characters.

All of these facts combine to make UTF-8 a very good choice for universal character encodings, in my opinion. The characters you need will be there. Simple ASCII content will be unchanged. Most software has at least some support for UTF-8 now as well.

Is Unicode perfect? No, it's not.

Some characters have multiple representations. For example, the Unicode code points are actually a super set of Latin-1 and thus include single byte versions of accented characters like é. Unicode also has the concept of combining marks though, where the accent would have one point and the letter another. Those are combined into one character when displayed. This creates some oddities where two Strings could appear to contain the same content but not test equal depending on how they are compared. It also lessens the benefit of an encoding like UTF-32 since four bytes are just guaranteed for a code point, but it can take multiple code points to build a character.

Asian cultures have also been slow to adopt Unicode for a few reasons. First, Unicode usually makes their data larger. For example, Shift JIS can represent all the Japanese characters in two bytes while most of them will be three bytes in UTF-8. Hard drive space is pretty cheap these days, but a 1.5x multiplier on most of your data can be a factor in some cases.

The Unicode Consortium also had to make some hard choices when specifying all of these characters. One such choice, known as Han Unification, was heavily debated for a while. I think many people recognize why the decision was made these days, but the debate definitely slowed Unicode adoption, especially in Japan.

Finally, there's a lot of data out there not in a Unicode encoding. Unfortunately, there are issues that can make it hard to convert this data to Unicode flawlessly. All of these factors combine to make a Unicode-as-a-one-encoding-fits-all philosophy not totally flawless.

Still, it's absolutely your best bet for support of a wide audience in a single encoding.

Key take-away points:

A character set isn't quite the same as an encoding
Unicode is one character set that can be encoded several different ways
Unicode is designed to support all characters used by all people
You won't find a better default encoding for modern day software as Unicode satisfies a much higher percentage of the world's population than any other single encoding
UTF-8 is probably the best Unicode encoding to work with when you have the choice because of how well it fits in with plain US-ASCII and the fact that it's a little safer to work with
Multibyte encodings can be tricky to work with properly, especially encodings like UTF-32 that can contain some normal looking data

What is a Character Encoding?

2014-04-12T19:12:31Z

The first step to understanding character encodings is that we're going to need to talk a little about how computers store character data. I know we would love to believe that when we push the a key on our keyboard, the computer records a little a symbol somewhere, but that's just fantasy.

I imagine most of us know that deep in the heart of computers pretty much everything is eventually in terms of ones and zeros. That means that an a has to be stored as some number. In fact, it is. We can see what number using Ruby 1.8:

$ ruby -ve 'p ?a'
ruby 1.8.6 (2008-08-11 patchlevel 287) [i686-darwin9.4.0]
97

The unusual ?a syntax gives us a specific character, instead of a full String. In Ruby 1.8 it does that by returning the code of that encoded character. You can also get this by indexing one character out of a String:

$ ruby -ve 'p "a"[0]'
ruby 1.8.6 (2008-08-11 patchlevel 287) [i686-darwin9.4.0]
97

These String behaviors were deemed confusing by the Ruby core team and have been changed in Ruby 1.9. They now return one character Strings. If you want to see the character codes in Ruby 1.9 you can use getbyte():

$ ruby_dev -ve 'p "a".getbyte(0)'
ruby 1.9.0 (2008-10-10 revision 0) [i386-darwin9.5.0]
97

That's shows us how to get the magic number, but it doesn't tell us what the number really is. When it was decided that we would need to store character data as numbers a simple chart was made mapping some numbers to certain characters. This mapping is known as US-ASCII or just ASCII.

Now ASCII covers everything you would find on an English keyboard: letters in upper and lower case, numbers, and some common symbols. There was even some room left in the 128 character ASCII mapping for some control character sequences.

Life was perfect, right? Uh, no.

This lead to two facts that went together beautifully:

The entire world can't quite get by on just these characters, surprisingly enough
We had more room in each byte since ASCII was only using seven of the eight bits in a byte (that's how you get 128 characters)

Awesome. We still had a spare bit that could buy us 128 more characters and we needed more characters. It was serendipity! Just about everyone had great ideas for how we should use these extra 128 characters and they all used them in their own way. Character encodings were born.

Because those extra 128 characters could change meaning depending on exactly who's scheme we're using now, we say the character data is encoded in that scheme. You will need to know which encoding is used for that data to read it correctly.

To give one specific example, the character encoding ISO-8859-1 (also known as Latin-1) is a common default in some operating systems, programs, and even programming languages. It fills the extra characters primarily with accented characters useful to many European languages.

Now if it was really just about those extra 128 characters, things still wouldn't be too tricky. Unfortunately, there's one more twist: even 256 characters aren't enough for some languages. Since 256 is all the numbers we can squeeze out of one little byte, these languages need multibyte character encodings, where it can take more than just one byte to represent a single character.

Multibyte encodings are generally trickier to work with. You have to be very careful not to divide data in such a way that a character might be split between the first and second byte (or between other bytes for bigger encodings).

Japanese is a great example here. Because they have symbols for most words instead of just the pieces used to make words, their language has a few thousand symbols in common usage. One popular Japanese character encoding is Shift JIS and it needs two bytes to fit some of these characters in.

I've only shared a few specific examples here, but the truth is that there are quite a few encodings in common usage today. You don't necessarily need to support all of these encodings in every program and, in truth, there are some good reasons not to. A good first step is just being aware that different encodings exist and different people store their data in different ways. Modern day programmers can no longer afford to remain ignorant to these issues.

If you think about it, I'm sure you can imagine instances where the encoding was wrong. Ever seen a slew of question marks or funny box shaped characters in your email client or shell? Often this is a sign of the data not being encoded in the scheme the program expected. This led to the program not being able to display the content correctly. That's what we're trying to avoid.

Key take-away points:

Different people the world over store their data in different ways
All character data has some encoding scheme that tells you how to interpret the data
You must know the encoding data is in to correctly process it
Some encodings are harder to work with than others, especially multibyte encodings
Junk output, like questions marks and box shaped characters, are often what you see when programs get confused about the character encoding data is in