The Standard Library

Digging into those helpful libraries that ship with Ruby.



The Secret Shell Helper

Someone pops onto the Ruby Talk mailing list fairly regularly asking how to break up content like:

one "two" "a longer three"

They expect to end with a three element Array, where the third item will contain spaces. They generally expect the quotes will have been removed as well.

If your needs are very, very simple you may be able to handle this with a regular expression:

data = 'one "two" "a longer three"'
p data.scan(/"([^"]*)"|(\S+)/).flatten.compact
# >> ["one", "two", "a longer three"]

That just searches for either a set of quotes with some non-quote characters between them or a run of non-whitespace characters. Those are the two possibilities for the fields. Note that the two separate capture here mean scan() will returns contents in the form:

[[nil, "one"], ["two", nil], ["a longer three", nil]]

That's why I added a flatten() and compact() to get down to the actual matches.

The regular expression approach can get pretty complex though if any kind of escaping for quotes is involved. When that happens, you may need to step up to a parser.

One choice for that would be to abuse a CSV parser to get it to divide up the data for you. Here's how you would do that with FasterCSV:

require "rubygems"
require "faster_csv"

data = 'one "two" "a longer three"'
p data.parse_csv(:col_sep => " ")
# >> ["one", "two", "a longer three"]

As you see, replacing the column separator (traditionally a comma) with a simple space gets FasterCSV to break down this data correctly.

This parser will handle escaping, though it's CSV style escaping. That means that quotes will need to be doubled:

require "rubygems"
require "faster_csv"

data = 'simple "embedded ""quote"" characters"'
p data.parse_csv(:col_sep => " ")
# >> ["simple", "embedded \"quote\" characters"]

I doubt that fits the data well too often. I suspect that an escaped quote is more often \" than "". Why is that? Well, data of this type isn't typically CSV data. Which leads us to the natural question, what kind of data are we really working with here?

I'm guessing it's shell data more often that not. Most shells handle quoting like this:

$ ruby -e 'p ARGV' one "two" "a longer three"
["one", "two", "a longer three"]

If that's really the case, we're going to need a shell oriented parser. It's sadly not well known, I assume because it's strangely absent from, but Ruby ships with such a parser. The standard Shellwords library will break these down for you:

require "shellwords"

data = "one 'two' 'a longer three'"
p Shellwords.shellwords(data)
# >> ["one", "two", "a longer three"]

If your data really is shell content, you'll be glad to know that Shellwords will handle all the special cases:

require "shellwords"

s = lambda { |shell| p Shellwords.shellwords(shell) }

s[%Q{"escaped \\"quote\\" characters"}]
s[%Q{escaped\\ spaces}]
s[%Q{'back to'" back quoting"}]
# >> ["escaped \"quote\" characters"]
# >> ["escaped spaces"]
# >> ["back to back quoting"]

Shellwords has some new features in Ruby 1.9 as well:

  • shellsplit() is added as an alias for shellwords()
  • shellescape() was added to escape a String for use in Bash
  • shelljoin() was added to escape and join an Array of arguments
  • shellsplit() and shellescape() are added to String and shelljoin() is added to Array for easier access

Hopefully this helps you find the right parser for your data.

Comments (1)
  1. Kelsey
    Kelsey May 13th, 2009 Reply Link

    Thanks for posting this, James. I never knew this existed.

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
Leave a Comment (using GitHub Flavored Markdown)

Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

Ajax loader