My Projects

The projects that I have spent significant time on.

2

JAN
2008

Getting FasterCSV Ready for Ruby 1.9

The call came down from on high just before the Ruby 1.9 release: replace the standard csv.rb library with faster_csv.rb. With only hours to make the change it was a little harder than I expected. The FasterCSV code base was pretty vanilla Ruby, but it required more work than I would have guessed to get running on Ruby 1.9. Let me share a few of the tips I learned while doctoring the code in the hope that it will help others get their code ready for Ruby 1.9.

Ruby's String Class Grows Up

One of the biggest changes in Ruby 1.9 is the addition of m17n (multilingualization). This means that Ruby's Strings are now encoding aware and we must clarify in our code if we are working with bytes, characters, or lines.

This is a good change, but the odds are that most of us have lazily used the old way to our advantage in the past. If you've ever written code like:

lines = str.to_a

you have bad habits to break. I sure did. Under Ruby 1.9 that code would translate to:

lines = str.lines.to_a

String#lines() returns an Enumerable::Enumerator by default (more on that shortly), so you need to add the to_a() call unless you are going to follow-up with other iteration methods.

Now, if you need the code to run on both 1.8 and 1.9, you will need one more trick. First, if you just need to iterate over the lines you can use String#each_line() which is present in both versions. For less basic iterations, I recommend:

lines = str.send(str.respond_to?(:lines) ? :lines : :to_s).to_a

Here I just call String#lines() if it is available and a no-op String#to_s() when it's not. You can safely follow that with any Enumerable method and it will work in Ruby 1.8 and Ruby 1.9.

Enumerable#zip() Took a Beating

[Update: Both of my complaints about zip() were eventually addressed. The 1.8 behavior has been restored.]

If you were a fan of Enumerable#zip() under Ruby 1.8, odds are good that it's going to surprise you under Ruby 1.9.

First, the standard Enumerable::Enumerator library has been moved into the core as we already saw with String#lines(). With this move the core iteration methods have been enhanced to return an Enumerable::Enumerator, if called without a block. This is generally a nice iterator chaining feature. For example, making the fictional but oft-requested map_with_index() is now as easy as:

enum.each_with_index.map {  }

Enumerable#zip() may be the exception though. It already had a meaningful return value when called without a block. That has been overridden by the new behavior though, so you will now get an Enumerable::Enumerator when you probably expected an Array. I've found that I now need to type the following to get what I usually want:

enum.zip(other_enum).to_a

It's hard to see that as an improvement, but the fact is that it gets worse. For some reason I can't justify, another change was made to Enumerable#zip(). Let's look at what happens with Enumerable objects of different sizes under Ruby 1.8:

>> short = [1, 2]
=> [1, 2]
>> long = %w[one two three four]
=> ["one", "two", "three", "four"]
>> short.zip(long)
=> [[1, "one"], [2, "two"]]
>> long.zip(short)
=> [["one", 1], ["two", 2], ["three", nil], ["four", nil]]

Note that the size of the result set is based on the size of the Enumerable that is used as the receiver for the Enumerable#zip() call. This works out well in practice, because you can always find the longer count if you need to preserve all of the data. If you want the shorter results, you can lead with the smaller set or filter out the nil objects. The choice is in your hands.

Unfortunately, Ruby 1.9 changes the rules:

>> short.zip(long).to_a
=> [[1, "one"], [2, "two"]]
>> long.zip(short).to_a
=> [["one", 1], ["two", 2]]

As you can see, the shortest Enumerable now limits the results no matter where it occurs. The problem with this change is that it discards data and you have to go out of your way to save it. This new behavior is documented though, so I assume it's intentional.

What do you do if you want a safe 1.8 data preserving Enumerable#zip() that works on 1.8 and 1.9? About the best I can come up with is:

require "enumerator"
zipped = long.enum_for(:each_with_index).
              map { |e, i| [e, short.to_a[i]] }

Obviously, I'm open to better ideas.

FasterCSV is the New CSV

I found the above incompatibilities by introducing a new one. FasterCSV has replaced the standard CSV class in the standard library. By replaced, I mean that it is now called CSV. This will cause code that used the old library problems.

The methods provided on the CSV object are similar, but the old CSV code used positional parameters where as the new library uses a Hash argument syntax (e.g., row_sep: "\r\n"). That's going to trip up any non-trivial usage.

The new library is feature rich and fully documented, so I don't expect anyone to have trouble getting their code working under 1.9. The problem will be writing code that works on both versions. For that, I recommend using code like the following to determined which library you are working with:

require "csv"
if CSV.const_defined? :Reader
  # use old CSV code here…
else
  # use FasterCSV style code, but with CSV class, here…
end

Feel free to email me with any other CSV compatibility questions.

This is Just a Start

The above is a short list of issues I've run into a couple of times now. Please feel free to add your own observations about Ruby 1.9 compatibility in the comments below. Let's do our best to make this post a generally useful resource for all.

Comments (33)
  1. Sam Ruby
    Sam Ruby January 2nd, 2008 Reply Link

    Porting REXML to Ruby 1.9 overlaps slightly and covers some additional ground.

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
  2. Alex Fenton
    Alex Fenton January 2nd, 2008 Reply Link

    Ruby 1.9 introduces an incompatible syntax change for conditional statements such as if and case/when. Previously a colon could be used as a shorthand for a then statement; this is perhaps most useful with multiple when statements on one line.

    The following is legitimate Ruby in 1.8:

    case x 
    when Regexp  : puts 'a regex'
    when Hash    : puts 'a regex'
    when Numeric : puts 'a number'
    when String  : puts 'a string'
    end
    

    But not in Ruby 1.9; now an explicit then statement must be used:

    case x
    when Regexp then puts 'a regex'
    ...
    
    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
    2. James Edward Gray II
      James Edward Gray II January 2nd, 2008 Reply Link

      Just to be clear the then keyword was also supported in Ruby 1.8 so using it for conditionals is fine for both versions.

      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
    3. hk
      hk January 4th, 2008 Reply Link

      But not in Ruby 1.9; now an explicit then statement must be used

      Or you could just do what everyone else does, and put what happens then on a new line. Then, you won't need then, and your code is more consistent and readable.

      I prefer this implementation. Allowing same-line then with a colon was bad style, IMO - as is the use of colons as meaningful operators in general.

      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
    4. Chris Gaffney
      Chris Gaffney January 4th, 2008 Reply Link

      If you still prefer the single character single line notation you can just substitute the colon with a semi-colon

      1.9:

      case sound
          when /bamf/i; puts 'Nightcrawler'
          when /boff/i; puts 'Batman'
      end
      
      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
      2. TJ Holowaychuk
        TJ Holowaychuk February 2nd, 2009 Reply Link

        I thought the semi colon functioning as an alias for then was great, its easier to look read IMO

        1. Reply (using GitHub Flavored Markdown)

          Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

          Ajax loader
        2. James Edward Gray II
          James Edward Gray II February 2nd, 2009 Reply Link

          It's actually the colon, not semicolon, that use to stand in for then. It was removed because it is being used in other ways, like the new Hash syntax.

          1. Reply (using GitHub Flavored Markdown)

            Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

            Ajax loader
    5. Rein Henrichs
      Rein Henrichs May 13th, 2010 Reply Link

      The : for case statements was removed from Ruby syntax because ; works in all versions and does not require special syntax.

      case x
        when Hash ; puts 'a hash'
      

      I prefer it to then.

      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
  3. Frederick Cheung
    Frederick Cheung January 3rd, 2008 Reply Link

    A bunch of methods like instance_variables, constants, etc… that used to return strings now return symbols.

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
    2. Francisco Laguna
      Francisco Laguna January 3rd, 2008 Reply Link

      I found what Frederick said is especially important for the typical BlankSlate type of class. What was in 1.8:

      class BlankSlate
        instance_methods.each { |meth|
          undef_method(meth) unless meth =~ /\A__/
        }
        ...
      end
      

      becomes in 1.9 (for example):

      class BlankSlate
        instance_methods.each { |meth|
          undef_method(meth) unless meth.to_s =~ /\A__/
        }
        ...
      end
      

      Other than that, I noticed that Thread#critical and Thread#critical= went away, but for those of us who want to explicitely schedule stuff, Fibers are nicer anyway.

      IO.getc will return a String thats one character long instead of the ASCII value of the character itself.

      1.8:

      STDIN.getc
      a
      => 97
      

      1.9:

      irb(main):002:0> STDIN.getc
      a
      => "a"
      

      This also breaks the excellent HighLine lib. hint hint

      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
      2. James Edward Gray II
        James Edward Gray II January 3rd, 2008 Reply Link

        Yes, I do need to get HighLine working under 1.9. I'll try to get to that before too long now. Thanks for reminding me.

        1. Reply (using GitHub Flavored Markdown)

          Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

          Ajax loader
      3. Daniel Luz
        Daniel Luz January 7th, 2008 Reply Link

        Francisco Laguna, since symbols now respond to #=~, I don't think that change is necessary, unless I am missing something. It should be noted, though, that for some reason Ruby 1.9 now emits a warning about undef'ing object_id, so you may want to preserve it too.

        1. Reply (using GitHub Flavored Markdown)

          Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

          Ajax loader
  4. Sander Land
    Sander Land January 3rd, 2008 Reply Link

    Using each or map on a result from Enumerable#zip is also extremely slow in 1.9. I discovered this when an application took twice as long in 1.9 as in 1.8. The innermost loop had a zip_with() call (zip -> map) which caused this.

    require 'benchmark'
    a = Array.new(25){rand}
    Benchmark.bmbm{|x|
      x.report("zip"){ 1_000_000.times { a.zip(a).map{} } }
    }
    

    Results in 1.8.6

              user     system      total        real
    zip  16.170000   0.320000  16.490000 ( 16.576643)
    

    Results in 1.9

              user     system      total        real
    zip 192.360000   1.430000 193.790000 (195.467429)
    

    Using to_a gives the same results.

    And this is with the slow 1.8.6 ubuntu/enable-pthread version vs an -O3/no-pthread compiled 1.9. On most code the 1.9 version is about four times as fast as the 1.8.6 version.

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
  5. gga
    gga January 4th, 2008 Reply Link

    For extension writers, ruby1.9 has, incorrectly in my opinion, deprecated Ruby's version.h file.

    This means it is not possible to know the ruby version easily and you now MUST write a Makefile of some sort to pass the proper defines or to check if your ruby supports some feature through some try-compile checks.

    This probably ranks as one of the worst changes in ruby 1.9.

    This obviously begs the question why this was done (as there's no benefit) and what should extension developers do if some function exists in both ruby1.8 and ruby1.9 but has different functionality (as some of the cases show here).

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
  6. Robert Dober
    Robert Dober January 5th, 2008 Reply Link

    James I am just working on RQ#151 and I want my solution to be version agnostic, up to now the following was my idea:

    Write the code in v1.9 and just require a file to upgrade 1.8 ruby just enough to run your code, such the require can go away one day, here is a very first shot:

    class String
       unless instance_methods.include?( "to_char" ) then
         require "enumerator"
         def each_char &blk
           return enum_for(:each_byte).map{ |b| b.chr } unless blk
           enum_for(:each_byte).each do |b| blk.call b.chr end
         end 
       end
    end
    

    of course it would be much better to wrap the whole include
    file into a version test, but that very test might be a tough one, the following is rather a bad example:

    begin
      "".to_a
      def to_char...
         ...
      end
    rescue
      nil
    end
    

    Going for the Ruby version constant

    if /^1\.8/ === RUBY_VERSION then
       ...
    end
    

    is probably a sound decision after all.

    What do you think?

    Cheers

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
    2. James Edward Gray II
      James Edward Gray II January 6th, 2008 Reply Link

      I would probable just do:

      require "jcode" unless "".respond_to?(:each_char)
      
      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
      2. Robert Dober
        Robert Dober January 6th, 2008 Reply Link

        James

        Now for the idea of saying

        require "jcode" unless "".respond_to?(:each_char)
        

        This is an approach I have seen first in Javascript for Browser Quirks but after some thoughts I believe that it is a bad idea for libraries. What if a require before our require just added each_char to String? And that is not exactly far fetched an idea either.

        For applications however it will work, unless you require third part libraries carelessly before the code above, this however can be debugged easily…

        For libraries there would be no way to debug or even fix it in a general manner.

        One could of course argue that someone could tamper with RUBY_VERSION too, but well we still have to let people kill themselves if they insist, sigh!

        1. Reply (using GitHub Flavored Markdown)

          Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

          Ajax loader
        2. James Edward Gray II
          James Edward Gray II January 6th, 2008 Reply Link

          Well, I hope that any each_char() implementation would give me the expected one character at a time.

          My main point though was that I felt safer using the each_char() method that comes with Ruby 1.8 than building my own.

          1. Reply (using GitHub Flavored Markdown)

            Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

            Ajax loader
          2. Robert Dober
            Robert Dober January 6th, 2008 Reply Link

            I see we are talking about two different things.

            1. I wanted a guard against the ruby version for lots of definitions, not only String#to_char.
            2. I cannot use Ruby's String#to_char because I need the 1.9 functionality of the returned Enumerator in case it is called without a block.

            Maybe the idea to write version agnostic code was not really what you are after here, and you emphasis on 1.9, in that case I am a little bit OT, as usual...

            1. Reply (using GitHub Flavored Markdown)

              Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

              Ajax loader
            2. James Edward Gray II
              James Edward Gray II January 7th, 2008 Reply Link

              Robert: I guess I am still a little confused about our discussion. You've mentioned both String#to_char() and String#each_char() and your code checks for one but creates the other. String#to_char() isn't a method I'm familiar with and I can't locate and documentation on it.

              Just FYI, I believe your code also has a bug in it. Checks like instance_methods.include?("some_str") don't work as expected in Ruby 1.9. Those method names are now returned as Symbol objects so include?() will fail to match the String.

              You can load jcode and enumerator and use enum_for(:each_char) to get an Enumerable::Enumerator in Ruby 1.8 or 1.9. I do now understand that we were discussing many methods instead of a specific example though, so that may not help.

              1. Reply (using GitHub Flavored Markdown)

                Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

                Ajax loader
              2. Robert Dober
                Robert Dober January 7th, 2008 Reply Link

                James

                many apologies about such many typos, I was referring to #each_char only. Thanks for the hint with jcode and instance_methods.include?. I missed jcode's functionality.

                1. Reply (using GitHub Flavored Markdown)

                  Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

                  Ajax loader
  7. Sander Land
    Sander Land January 9th, 2008 Reply Link

    The zip problems appear to be fixed with the January 8th version, it's still ~50% slower than 1.8, but that's manageable.

    James, I saw you post about this on Ruby-CORE mailing list. Is this the way to go for posting bugs?
    I'm asking this because I discovered what I think is a rather serious bug and posted a bug report on Rubyforge about three weeks ago, but there is no reply to the report or any of my follow-ups, other than the bug getting assigned to Matz.

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
    2. James Edward Gray II
      James Edward Gray II January 9th, 2008 Reply Link

      Yes, I was able to sway Matz and Enumerable#zip() has been "repaired."

      Opinions seem to differ on whether on not to use the bug tracker on Rubyforge or the Ruby Core mailing list. I believe the core team is trying to get more into the bug tacker habit, so it's probably best to start there for most things. I find I have more success with topics that should be discussed, like the Enumerable#zip() issue, on Ruby Core though. For serious issues, I recommend putting it on in the bug tracker then drawing attention to it on Ruby Core.

      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
  8. Gavin Sinclair
    Gavin Sinclair January 14th, 2008 Reply Link

    When Ruby 1.8 was on the horizon and 1.6 was the normal version for people to use, someone created a library called "shim", which allowed the use of 1.7/1.8-style features in 1.6 code.

    With compatibility between 1.8 and 1.9 a key issue for many people, such a "shim" library could be very useful.

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
    2. Benoit Daloze
      Benoit Daloze July 10th, 2010 Reply Link

      When Ruby 1.8 was on the horizon and 1.6 was the normal version for people to use, someone created a library called "shim", which allowed the use of 1.7/1.8-style features in 1.6 code.

      With compatibility between 1.8 and 1.9 a key issue for many people, such a "shim" library could be very useful.

      I would recommend you backports
      http://github.com/marcandre/backports

      As it is pure Ruby, I believe you should still care where speed is important (though it would be better if users upgrade to 1.9 of course)

      I found these "hacks" very bad to read, I would personally be tempted to use sth like backports instead (or maybe only a part of it)

      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
  9. Michael Barton
    Michael Barton September 22nd, 2009 Reply Link

    Hi James,

    Thanks for sharing. Based on your code I've tried the following for backwards compatiblity with Ruby 1.8 where everything uses the CSV class constant.

    require "csv"
    if CSV.const_defined? :Reader
      # Ruby 1.8 compatible
      require 'fastercsv'
      Object.send(:remove_const, :CSV)
      CSV = FasterCSV
    else
      # CSV is now FasterCSV in ruby 1.9
    end
    
    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
    2. William
      William October 12th, 2009 Reply Link

      Thanks to Michael Barton. His fix is just what I sought.

      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
  10. James Edward Gray II
    James Edward Gray II March 26th, 2010 Reply Link

    This isn't really a character encoding issue. It's about HTML escapes, as the solution you received on Ruby Talk shows.

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
    2. amishera
      amishera March 31st, 2010 Reply Link

      Mr. Gray,

      Thanks for your reply.

      I solved the problem using this:

      s.gsub!(/#(\d+);/) { [$1.to_i].pack("U*") }
      

      Which not only converts the HTML part ie converts #39; into 39 but also gets the corresponding Unicode character. But I am not sure whether this is the right approach. Or is there any other better approach?

      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
      2. amishera
        amishera March 31st, 2010 Reply Link

        Sorry I wanted to mean that given a number I wanted to convert to Unicode character. So like you know for ASCII we do something like

        39.chr
        

        so for doing the same thing for Unicode instead of ASCII, what is the right approach?

        1. Reply (using GitHub Flavored Markdown)

          Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

          Ajax loader
      3. James Edward Gray II
        James Edward Gray II March 31st, 2010 Reply Link

        Your approach is correct, though I still prefer the solution you got from Ruby Talk. It's more general and probably more correct with regard to Web encodings. For example, the default for the Web is Latin-1, not UTF-8. Those two encodings are codepoint compatible for all of Latin-1's range though, which is why your code above works.

        1. Reply (using GitHub Flavored Markdown)

          Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

          Ajax loader
  11. James Edward Gray II
    James Edward Gray II March 26th, 2010 Reply Link

    Those character escapes are octal, so \303 is really 195. Good question.

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
Leave a Comment (using GitHub Flavored Markdown)

Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

Ajax loader