Ruby Voodoo

Deep dives into random corners of my favorite programming language.



Rich Methods

Some APIs provide collections of dirt simple methods that just do one little thing.

This approach in less common in Ruby though, especially in the core and standard library of the language itself. Ruby often gives us rich methods with lots of switches we can toggle and half hidden behaviors.

Let's look at some examples of what I am talking about.

Get a Line at a Time

I suspect most Rubyists have used gets() to read lines of input from some kind of IO. Here's the basic usage:

>> require "stringio"
=> true
>> f =<<END_STR)
=> #<StringIO:0x007fd5a264fa08>
>> f.gets
=> "<xml>\n"
>> f.gets
=> "  <tags>Content</tags>\n"

I didn't want to mess with external files for these trivial examples, so I just loaded StringIO from the standard library. It allows us to wrap a simple String (defined in this example using the heredoc syntax) in the IO interface. In other words, I'm calling gets() here for a String just as I could with a File or $stdin.

As the last two calls show, gets() reads until it finds a "\n" and then returns the content read. Actually, that's what it does by default, but you can tell gets() what character to read to, if you prefer:

>> f.rewind
=> 0
>> f.gets(">")
=> "<xml>"
>> f.gets(">")
=> "\n  <tags>"
>> f.gets(">")
=> "Content</tags>"

When you're working with XML documents, newlines don't really mean much. You don't actually care where they are. What you do care about are tags. Reading from tag to tag is like reading one of those great books that skip the boring bits to give you interesting scene after interesting scene.

As you can see above, one tiny change to the gets() call, specifying the character to read to as the tag ending ">", can make this happen.

"But wait, there's more!"

>> f ="One\n\nTwo\n\nThree")
=> #<StringIO:0x007fd5a260efa8>
>> f.gets("")
=> "One\n\n"
>> f.gets("")
=> "Two\n\n"

The empty String ("") is a magic value for the character to read to, since it makes no sense as that value. This turns on paragraph mode and in that mode Ruby will read one paragraph at a time. For this purpose a paragraphs are defined as being separated by two consecutive newlines (or a blank line in word processor terms).

These aren't even all the features of gets(). It can do more. For example, you can provide an upper limit of bytes to read, to prevent wonky input from forcing your program to allocate the tons of memory to hold large Ruby String objects.

Let's look at another method.

Hash Merging

Many Ruby methods sneak their rich functionality in through the use of blocks. Deferring some decision to the caller by allowing them to provide custom code for handling it makes some methods crazy flexible.

To show what I mean, let's play with good old merge():

>> {a: 1, b: 2}.merge(c: 3, d: 4)
=> {:a=>1, :b=>2, :c=>3, :d=>4}

Most Rubyists run into examples like this pretty early in their studies. The code just returns a fresh Hash containing the keys and values of both the receiver and the Hash passed as an argument to merge().

How are ties handled?

>> {a: 1, b: 2}.merge(b: :two, c: 3)
=> {:a=>1, :b=>:two, :c=>3}

The Hash passed as an argument to merge() wins. Again, I doubt this is much of a surprise to anyone.

However, I don't think everyone knows that you can take control of this merging process. During a merge() any conflict will be passed to a block, if provided, and the block can return what to store in the new Hash:

>> {a: 1, b: 2}.merge(b: :two, c: 3) { |_, old, new| Array(old) + Array(new) }
=> {:a=>1, :b=>[2, :two], :c=>3}

You can throw away either item, log the conflict, combine them as I have done here, or do whatever else you can think of, all because merge() takes a block.

Can you guess how ActiveSupport implements reverse_merge!() now?

Easy Tokenizing

Let's do one last method with a rich interface (even though Ruby has many more):

>> "1,2,3".split(",")
=> ["1", "2", "3"]

This is another very common method. It turns a String into an Array by dividing up the contents everywhere the passed separator is encountered. I used a String separator above but a Regexp is also allowed:

>> "1, 2, 3".split(/\s*,\s*/)
=> ["1", "2", "3"]

This makes it easier to handle complex separators. For example, the Regexp above permits optional whitespace characters on either side of the comma.

But a Regexp can include capture groups. How are they handled?

>> "1, 2, 3".split(/\s*(,)\s*/)
=> ["1", ",", "2", ",", "3"]

Easy enough: the captured value(s) are returned with the separated contents.

The real question this raises for me is, "What the heck is this feature good for?" Well, one thing I have found over the years is that this usage of split() can make dividing some input into tokens pretty darn easy:

>> "<xml><tags>Content</tags></xml>".split(/(<[^>]+>)/)
=> ["", "<xml>", "", "<tags>", "Content", "</tags>", "", "</xml>"]

You can use this one feature as a backbone for a moderately complex parser. Mote does just that.

I made a video explaining how this parsing trick (and more) are accomplished in detail. You can use the coupon BLOGREADER for $3 off if you want to check it out.

Comments (0)
Leave a Comment (using GitHub Flavored Markdown)

Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

Ajax loader