Character Encodings

My extensive coverage of a complex topic all programmers should study a little.

18

JUN
2009

What Ruby 1.9 Gives Us

In this final post of the series, I want to revisit our earlier discussion on encoding strategies. Ruby 1.9 adds a lot of power to the handling of character encodings as you have now seen. We should talk a little about how that can change the game.

UTF-8 is Still King

The most important thing to take note of is what hasn't changed with Ruby 1.9. I said a good while back that the best Encoding for general use is UTF-8. That's still very true.

I still strongly recommend that we favor UTF-8 as the one-size-almost-fits-all Encoding. I really believe that we can and should use it exclusively inside our code, transcode data to it on the way in, and transcode output when we absolutely must. The more of us that do this, the better things will get.

As we've discussed earlier in the series, Ruby 1.9 does add some new features that help our UTF-8 only strategies. For example, you could use things like the Encoding command-line switches (-E and -U) to setup auto translation for all input you read. These shortcuts are great for simple scripting, but I'm going to recommend you just be explicit about your Encodings in any serious code.

New Rules

Ruby 1.9 literally gives us a whole new world of power to work with data as we see fit. As is usually the case though, our new powers come with new responsibilities. Start building your good Ruby 1.9 habits today:

Yes, this adds a little extra work, but the effort is worth it. Be disciplined in your awareness of Encodings and help Ruby know the right way to treat your data.

New Strategies

While UTF-8 is a great single choice, Ruby 1.9 gives us some exciting new options for character handling. I'll give just one example here, to get you thinking in the right direction, but the sky's the limit now and I'm sure we'll see some neat uses of the new system in the coming years.

When I converted the FasterCSV code to be the standard CSV library in Ruby 1.9, I really sat down and thought out how m17n should be handled. Here are some thoughts that led to my final plan:

  • We tend to throw pretty big data at CSV parsers. We often use them for database dumps, for example.
  • I expected to pay a performance penalty for constantly transcoding all incoming data to UTF-8. I'm not sure how big it would have been, but it's certainly more work than Ruby 1.8 does just reading some bytes. Naturally, I wanted the library to stay as quick as possible.
  • Since the parser has always been able to read directly from any IO object, those who wanted UTF-8 transcoding already had a way to get it.
  • CSV is a super simple format to parse, requiring only four standard characters that you can count on having in any Encoding Ruby supports.
  • Finally, I just wanted to take the m17n features for a spin, of course!

All of this combined to form my strategy for the CSV library: don't transcode the data, transcode the parser instead.

If you transcode the data, you pay a penalty at every read. However, transcoding the parser is just a one-time upfront cost. The characters will be available in whatever format the data is in and once the parser is transcoded we can just read and parse the data normally. The fields returned won't have gone through a conversion, unless the user code explicitly sets that up. This seems to give everyone the choice to have their data the way they want it.

This process isn't too tough to realize, though it does get a bit tedious in places. The first step is just to figure out what Encoding the data is actually in. Here's the code from 1.9's CSV library that does that:

@encoding =   if @io.respond_to? :internal_encoding
                @io.internal_encoding || @io.external_encoding
              elsif @io.is_a? StringIO
                @io.string.encoding
              end
@encoding ||= Encoding.default_internal || Encoding.default_external

That code just makes sure I set @encoding to the Encoding I'm actually going to be working with after all reads. If an internal_encoding() is set on an IO, it will be transcoded into that and that's what I will be facing. Otherwise, the external_encoding() is what we will see. The code can also parse from a String directly by wrapping it in a StringIO object. When it does that, we can just ask the underlying String what the Encoding for the data is. If we can't find an Encoding, likely because it hasn't been set, we'll use the defaults because that's what Ruby is going to assume as well.

Once we have the Encoding, we need a couple of helper methods that will build String and Regexp objects in that Encoding for us. Here are those simple methods:

def encode_str(*chunks)
  chunks.map { |chunk| chunk.encode(@encoding.name) }.join
end

def encode_re(*chunks)
  Regexp.new(encode_str(*chunks))
end

Those should be super straightforward if you've read my earlier discussion of how transcoding works. You can pass encode_str() one or more String arguments and it will transcode each one, then join() them into a complete whole. The encode_re() just wraps encode_str() since Regexp.new() will correctly set the Encoding by the Encoding of the passed String.

Now for the tedious step. You have to completely avoid using bare String or Regexp literals for anything that will eventually interact with the raw data. For example, here is the code CSV uses to prepare the parser before it begins reading:

# Pre-compiles parsers and stores them by name for access during reads.
def init_parsers(options)
  # store the parser behaviors
  @skip_blanks      = options.delete(:skip_blanks)
  @field_size_limit = options.delete(:field_size_limit)

  # prebuild Regexps for faster parsing
  esc_col_sep = escape_re(@col_sep)
  esc_row_sep = escape_re(@row_sep)
  esc_quote   = escape_re(@quote_char)
  @parsers = {
    # for empty leading fields
    leading_fields: encode_re("\\A(?:", esc_col_sep, ")+"),
    # The Primary Parser
    csv_row:        encode_re(
      "\\G(?:\\A|", esc_col_sep, ")",                # anchor the match
      "(?:", esc_quote,                              # find quoted fields
             "((?>[^", esc_quote, "]*)",             # "unrolling the loop"
             "(?>", esc_quote * 2,                   # double for escaping
             "[^", esc_quote, "]*)*)",
             esc_quote,
             "|",                                    # ... or ...
             "([^", esc_quote, esc_col_sep, "]*))",  # unquoted fields
      "(?=", esc_col_sep, "|\\z)"                    # ensure field is ended
    ),
    # a test for unescaped quotes
    bad_field:      encode_re(
      "\\A", esc_col_sep, "?",                   # an optional comma
      "(?:", esc_quote,                          # a quoted field
             "(?>[^", esc_quote, "]*)",          # "unrolling the loop"
             "(?>", esc_quote * 2,               # double for escaping
             "[^", esc_quote, "]*)*",
             esc_quote,                          # the closing quote
             "[^", esc_quote, "]",               # an extra character
             "|",                                # ... or ...
             "[^", esc_quote, esc_col_sep, "]+", # an unquoted field
             esc_quote, ")"                      # an extra quote
    ),
    # safer than chomp!()
    line_end:       encode_re(esc_row_sep, "\\z"),
    # illegal unquoted characters
    return_newline: encode_str("\r\n")
  }
end

Don't worry about breaking down those heavily optimized regular expressions. The point here is just to notice how everything is eventually passed through encode_str() or encode_re().

Those were the major changes needed inside the CSV code to get it to parse natively in the Encoding of the data. I did have to add more code due to some side issues I ran into, but they don't really relate to this strategy too much:

  • Regexp.escape() didn't work correctly on all the Encodings I tested it with. It's improved a lot since then, but last I checked there were still some oddball Encodings it didn't support. Given that, I had to roll my own. If you want to see how I did that, check inside CSV.initialize() for how @re_esc and @re_chars get set and then have a look at CSV.escape_re().
  • CSV's line ending detection reads ahead in the data by fixed byte counts. That's tricky to do safely with encoded data since you could always land in the middle of a character. See CSV.read_to_char() for how I work around that issue, if you are interested.
  • Finally, testing the code with all the Encodings Ruby supports was a bit tricky, due to the concept of "dummy Encodings". See my discussion on those for details on how to filter them out of the mix.

Like anything, this strategy had plusses and minuses. As I've already said, it's a touch tedious to have to avoid normal literals. The added complexity to the code makes it a little harder to read and maintain. That's the price you pay.

Still, I think it shows some of the possibilities of what we can accomplish with Ruby's new features. We can stick to UTF-8 as our one-size-fits-all solution as we've done in the past. That's still a great idea in most cases. However, now we have some new options that were impractically hard with an older Ruby.

Comments (2)
  1. Sascha Konietzke
    Sascha Konietzke July 16th, 2009 Reply Link

    Thanks so much for those great articles James! I read them all and now have a much better understanding about Strings in Ruby 1.9.

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
  2. Vincent
    Vincent February 20th, 2011 Reply Link

    I totally agree with Sasha. Nice write up! It demystifies the whole encoding thing in ruby...

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
Leave a Comment (using GitHub Flavored Markdown)

Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

Ajax loader