18
JUN2009
What Ruby 1.9 Gives Us
In this final post of the series, I want to revisit our earlier discussion on encoding strategies. Ruby 1.9 adds a lot of power to the handling of character encodings as you have now seen. We should talk a little about how that can change the game.
UTF-8 is Still King
The most important thing to take note of is what hasn't changed with Ruby 1.9. I said a good while back that the best Encoding
for general use is UTF-8. That's still very true.
I still strongly recommend that we favor UTF-8 as the one-size-almost-fits-all Encoding
. I really believe that we can and should use it exclusively inside our code, transcode data to it on the way in, and transcode output when we absolutely must. The more of us that do this, the better things will get.
As we've discussed earlier in the series, Ruby 1.9 does add some new features that help our UTF-8 only strategies. For example, you could use things like the Encoding
command-line switches (-E
and -U
) to setup auto translation for all input you read. These shortcuts are great for simple scripting, but I'm going to recommend you just be explicit about your Encoding
s in any serious code.
New Rules
Ruby 1.9 literally gives us a whole new world of power to work with data as we see fit. As is usually the case though, our new powers come with new responsibilities. Start building your good Ruby 1.9 habits today:
- Add the magic comment to the top of all source files
- Explicitly declare the
Encoding
s for anIO
object when youopen()
them
Yes, this adds a little extra work, but the effort is worth it. Be disciplined in your awareness of Encoding
s and help Ruby know the right way to treat your data.
New Strategies
While UTF-8 is a great single choice, Ruby 1.9 gives us some exciting new options for character handling. I'll give just one example here, to get you thinking in the right direction, but the sky's the limit now and I'm sure we'll see some neat uses of the new system in the coming years.
When I converted the FasterCSV
code to be the standard CSV
library in Ruby 1.9, I really sat down and thought out how m17n should be handled. Here are some thoughts that led to my final plan:
- We tend to throw pretty big data at CSV parsers. We often use them for database dumps, for example.
- I expected to pay a performance penalty for constantly transcoding all incoming data to UTF-8. I'm not sure how big it would have been, but it's certainly more work than Ruby 1.8 does just reading some bytes. Naturally, I wanted the library to stay as quick as possible.
- Since the parser has always been able to read directly from any
IO
object, those who wanted UTF-8 transcoding already had a way to get it. - CSV is a super simple format to parse, requiring only four standard characters that you can count on having in any
Encoding
Ruby supports. - Finally, I just wanted to take the m17n features for a spin, of course!
All of this combined to form my strategy for the CSV
library: don't transcode the data, transcode the parser instead.
If you transcode the data, you pay a penalty at every read. However, transcoding the parser is just a one-time upfront cost. The characters will be available in whatever format the data is in and once the parser is transcoded we can just read and parse the data normally. The fields returned won't have gone through a conversion, unless the user code explicitly sets that up. This seems to give everyone the choice to have their data the way they want it.
This process isn't too tough to realize, though it does get a bit tedious in places. The first step is just to figure out what Encoding
the data is actually in. Here's the code from 1.9's CSV
library that does that:
@encoding = if @io.respond_to? :internal_encoding
@io.internal_encoding || @io.external_encoding
elsif @io.is_a? StringIO
@io.string.encoding
end
@encoding ||= Encoding.default_internal || Encoding.default_external
That code just makes sure I set @encoding
to the Encoding
I'm actually going to be working with after all reads. If an internal_encoding()
is set on an IO
, it will be transcoded into that and that's what I will be facing. Otherwise, the external_encoding()
is what we will see. The code can also parse from a String
directly by wrapping it in a StringIO
object. When it does that, we can just ask the underlying String
what the Encoding
for the data is. If we can't find an Encoding
, likely because it hasn't been set, we'll use the defaults because that's what Ruby is going to assume as well.
Once we have the Encoding
, we need a couple of helper methods that will build String
and Regexp
objects in that Encoding
for us. Here are those simple methods:
def encode_str(*chunks)
chunks.map { |chunk| chunk.encode(@encoding.name) }.join
end
def encode_re(*chunks)
Regexp.new(encode_str(*chunks))
end
Those should be super straightforward if you've read my earlier discussion of how transcoding works. You can pass encode_str()
one or more String
arguments and it will transcode each one, then join()
them into a complete whole. The encode_re()
just wraps encode_str()
since Regexp.new()
will correctly set the Encoding
by the Encoding
of the passed String
.
Now for the tedious step. You have to completely avoid using bare String
or Regexp
literals for anything that will eventually interact with the raw data. For example, here is the code CSV
uses to prepare the parser before it begins reading:
# Pre-compiles parsers and stores them by name for access during reads.
def init_parsers(options)
# store the parser behaviors
@skip_blanks = options.delete(:skip_blanks)
@field_size_limit = options.delete(:field_size_limit)
# prebuild Regexps for faster parsing
esc_col_sep = escape_re(@col_sep)
esc_row_sep = escape_re(@row_sep)
esc_quote = escape_re(@quote_char)
@parsers = {
# for empty leading fields
leading_fields: encode_re("\\A(?:", esc_col_sep, ")+"),
# The Primary Parser
csv_row: encode_re(
"\\G(?:\\A|", esc_col_sep, ")", # anchor the match
"(?:", esc_quote, # find quoted fields
"((?>[^", esc_quote, "]*)", # "unrolling the loop"
"(?>", esc_quote * 2, # double for escaping
"[^", esc_quote, "]*)*)",
esc_quote,
"|", # ... or ...
"([^", esc_quote, esc_col_sep, "]*))", # unquoted fields
"(?=", esc_col_sep, "|\\z)" # ensure field is ended
),
# a test for unescaped quotes
bad_field: encode_re(
"\\A", esc_col_sep, "?", # an optional comma
"(?:", esc_quote, # a quoted field
"(?>[^", esc_quote, "]*)", # "unrolling the loop"
"(?>", esc_quote * 2, # double for escaping
"[^", esc_quote, "]*)*",
esc_quote, # the closing quote
"[^", esc_quote, "]", # an extra character
"|", # ... or ...
"[^", esc_quote, esc_col_sep, "]+", # an unquoted field
esc_quote, ")" # an extra quote
),
# safer than chomp!()
line_end: encode_re(esc_row_sep, "\\z"),
# illegal unquoted characters
return_newline: encode_str("\r\n")
}
end
Don't worry about breaking down those heavily optimized regular expressions. The point here is just to notice how everything is eventually passed through encode_str()
or encode_re()
.
Those were the major changes needed inside the CSV
code to get it to parse natively in the Encoding
of the data. I did have to add more code due to some side issues I ran into, but they don't really relate to this strategy too much:
-
Regexp.escape()
didn't work correctly on all theEncoding
s I tested it with. It's improved a lot since then, but last I checked there were still some oddballEncoding
s it didn't support. Given that, I had to roll my own. If you want to see how I did that, check insideCSV.initialize()
for how@re_esc
and@re_chars
get set and then have a look atCSV.escape_re()
. -
CSV
's line ending detection reads ahead in the data by fixed byte counts. That's tricky to do safely with encoded data since you could always land in the middle of a character. SeeCSV.read_to_char()
for how I work around that issue, if you are interested. - Finally, testing the code with all the
Encoding
s Ruby supports was a bit tricky, due to the concept of "dummyEncoding
s". See my discussion on those for details on how to filter them out of the mix.
Like anything, this strategy had plusses and minuses. As I've already said, it's a touch tedious to have to avoid normal literals. The added complexity to the code makes it a little harder to read and maintain. That's the price you pay.
Still, I think it shows some of the possibilities of what we can accomplish with Ruby's new features. We can stick to UTF-8 as our one-size-fits-all solution as we've done in the past. That's still a great idea in most cases. However, now we have some new options that were impractically hard with an older Ruby.
Comments (2)
-
Sascha Konietzke July 16th, 2009 Reply Link
Thanks so much for those great articles James! I read them all and now have a much better understanding about
String
s in Ruby 1.9. -
I totally agree with Sasha. Nice write up! It demystifies the whole encoding thing in ruby...