16
APR2007
No Longer the Fastest Game in Town
If your number one concern when working with CSV data in Ruby is raw speed, you might want to know that FasterCSV is no longer the fastest option.
There are a couple of new contenders for Ruby CSV processing including a C extension called SimpleCSV and a pure Ruby library called LightCsv. I haven't been able to test SimpleCSV
locally, because I can't get it to build on my box, but users do tell me it's faster. I have run some trivial benchmarks for LightCsv
though and it too is pretty quick:
$ rake benchmark
(in /Users/james/Documents/faster_csv)
time ruby -r csv -e '6.times { CSV.foreach("test/test_data.csv") { |row| } }'
real 0m5.481s
user 0m5.468s
sys 0m0.010s
time ruby -r lightcsv -e \
'6.times { LightCsv.foreach("test/test_data.csv") { |row| } }'
real 0m0.358s
user 0m0.349s
sys 0m0.008s
time ruby -r lib/faster_csv -e \
'6.times { FasterCSV.foreach("test/test_data.csv") { |row| } }'
real 0m0.742s
user 0m0.732s
sys 0m0.009s
It's important to note that LightCsv
is indeed very "light." FasterCSV
has grown up into a feature rich library that provides many different ways to look at your data. In contrast, LightCsv
doesn't yet allow you to set column or row separators. Given that, it's only an option for vanilla CSV you just need to iterate over. If that's what you have though, and speed counts, it might just be the right choice.
For the curious, LightCsv
achieves its speed advantage in two ways. First, it uses StringScanner
to manage the parsing. StringScanner
is a C extension, though it is a standard library installed with Ruby.
More importantly, I suspect, LightCsv
uses an input buffer for reading while FasterCSV
works line by line. I suspect this second difference accounts for the majority of the speed increase since the buffered code will hit the hard drive quite a bit less for the average CSV file. This does require more memory though, of course.
Aside from these differences, FasterCSV
and LightCsv
have very similar parsers.
Comments (2)
-
tommy April 18th, 2007 Reply Link
LightCsv
do not useStringIO
.
It useStringScanner
.-
Oops. Good catch. I have corrected the article.
-