Rubies in the Rough

This is where I try to teach how I think about programming.

1

NOV
2011

The Wrong Tool for the Job

I want to start our exploration of how to think about Ruby programming with a miniseries called Breaking All of the Rules. As I'm sure you know, programmers have a lot of rules. You can barely speak to a programmer for a few minutes without them quoting some axiom. We have a huge collection of advice to hand out.

Fortunately, I'm not just a programmer. I'm also a tournament chess player. Getting good at chess has really helped my programming. That's because chess players also have a ton of rules.

Perhaps you've seen our books of opening chess moves? They are literally hundred of pages that just list the various moves that you can "start" a chess game with. I use the word start very loosely there because some combinations can go 20 moves into the game or more. Chess games generally only average about 60 moves, so the first third of what we do is often straight out of a book. In fact, while you are playing a chess opening, we say that you are "in book." That means you are following the rules.

But here's the kicker: it's very common for a player to "break book" at some point. Yes, they choose to stop following the rules, on purpose.

Don't get me wrong, the rules are great. They exist to help us, literally. In chess they are there to steer you away from traps players can fall into. That's why we suffer through the rote memorization of so many openings.

Why then do we break out of the pattern at some point? Well, once we do we won't know exactly what comes next, but neither will our opponent. From that point forward, it's just your ideas against theirs. Given that, if you break book, you send the impression we're going to play the game your way. That can give you a slight advantage with regard to initiative, assuming you manage to avoid the dangers the book openings protect your from.

This pattern repeats over and over again in chess. Learning the rules will make you good. They protect you. However, learning the right time to break them will make you great.

Eventually, I realized that programming is much the same. All those rules we drill into your head when you are learning to program, well, they are designed to protect you. That's why we teach them, of course. But, if you can learn the right times to break them, there are advantages to be had.

Let's look at how and why we might choose to break a rule.

A First Rule

It's unlikely you can be a part of the programming world very long without hearing this rule: use the right tool for the job. The real question is, why do we say this?

I assume the answers are fairly obvious in this case. The right tool for the job is probably going to be easier to use. It should also be more efficient at doing a job it is intended to accomplish.

But what if those just aren't your main goals?

For example, let's say you've set a goal for yourself to really learn some Ruby. Maybe you came to the language via Ruby on Rails or another similar source and now you are ready to really get down to understanding the language underneath the surface. Let's say that's your primary goal: learn more Ruby.

In a series of fantasy books I read, called the The Wheel of Time, there are two groups of magic users. The women have been at it for a very long time now. They find girls with the potential to use magic, take them to their school, and give them excellent, if very strict, training about how and even when they are allowed to use magic. The men are a different story. Magic was outlawed for them until very recently, so they are new to the game. In order to catch up, they use their magic for everything they possibly can. They make their beds and warm their tea using their magic. Of course, their way is more dangerous and some of them do get hurt. Those that avoid the dangers though learn very fast due to the copious practice.

The same could work for Ruby. If you really want to raise your skill, using it more will help. Plus, you are less likely to burn your house down when you just want to make some tea.

The Command-line Frontier

Where could you use more Ruby? One place that is often overlooked is at the command-line. Programmers are pretty much always needing to do some task at the command-line.

Let me give an example. Just last week I received a bug report for a Rails application where the user got it into a bad state after some complex interactions. I needed to know this user's exact path through the application to be able to reproduce what happened.

That's a job for scanning through log files. There's so much data in a log file though and I needed to get the right data. Luckily, the user told me what they entered into one of the forms. I used grep (with a small assist from zcat) to search for one of those exact parameters, like this:

$ zcat log/production.log.4.gz | grep 'EXACT_EMAIL'
  Parameters: {…, "user"=>{"email"=>"EXACT_EMAIL", …}, "commit"=>"Sign Up"}

That matched line is part of one of their requests. If I can see how that request started, I can find the user. I asked grep for a little context:

$ zcat log/production.log.4.gz | grep -C 2 'EXACT_EMAIL'
Started POST "/connect" for 0.0.0.0 at 2011-10-25 10:09:30 +0000
  Processing by SessionsController#create as HTML
  Parameters: {…, "user"=>{"email"=>"EXACT_EMAIL", …}, "commit"=>"Sign Up"}
Rendered user_mailer/validation.text.erb (0.6ms)
…

Bingo, that got me an IP address, which we will say was 0.0.0.0 just for example purposes. Now it's trivial to look at just the requests of that user:

$ zcat log/production.log.4.gz | grep 'for 0\.0\.0\.0'
Started GET "/rubies-in-the-rough" for 0.0.0.0 at 2011-10-25 08:40:01 +0000
Started GET "/connect?…" for 0.0.0.0 at 2011-10-25 10:08:53 +0000
Started GET "/auth/github" for 0.0.0.0 at 2011-10-25 10:08:58 +0000
Started GET "/auth/twitter" for 0.0.0.0 at 2011-10-25 10:09:04 +0000
Started GET "/auth/twitter/callback?…" for 0.0.0.0 at …
Started GET "/connect" for 0.0.0.0 at 2011-10-25 10:09:15 +0000
Started POST "/connect" for 0.0.0.0 at 2011-10-25 10:09:30 +0000
Started GET "/" for 0.0.0.0 at 2011-10-25 10:09:33 +0000
Started GET "/rubies-in-the-rough" for 0.0.0.0 at 2011-10-25 10:14:02 +0000
Started GET "/connect?…" for 0.0.0.0 at 2011-10-25 10:14:38 +0000
Started GET "/auth/twitter" for 0.0.0.0 at 2011-10-25 10:14:46 +0000
Started GET "/auth/twitter/callback?…" for 0.0.0.0 at …
Started GET "/connect" for 0.0.0.0 at 2011-10-25 10:14:49 +0000
Started GET "/validate/…" for 0.0.0.0 at 2011-10-25 10:15:02 +0000
Started GET "/rubies-in-the-rough" for 0.0.0.0 at 2011-10-25 10:15:04 +0000
Started GET "/" for 0.0.0.0 at 2011-10-25 10:15:08 +0000
Started GET "/rubies-in-the-rough" for 0.0.0.0 at 2011-10-25 10:19:14 +0000

That tells me where they went, but it doesn't tell me how those requests turned out. If I need to know that too, I can ask grep to add after context:

$ zcat log/production.log.4.gz | grep -A 10 'for 0\.0\.0\.0'
Started GET "/rubies-in-the-rough" for 0.0.0.0 at 2011-10-25 08:40:01 +0000
  Processing by InterestsController#show as HTML
  Parameters: {"param"=>"rubies-in-the-rough"}
Rendered interests/_interest.html.erb (5.6ms)
Rendered shared/_nav_links.html.erb (1.4ms)
Rendered shared/_nav_links.html.erb (2.3ms)
Rendered interests/show.html.erb within layouts/application (15.5ms)
Completed 200 OK in 20ms (Views: 16.0ms | ActiveRecord: 1.5ms)


Started GET "/rubies-in-the-rough" for 1.1.1.1 at 2011-10-25 08:41:45 +0000
--
Started GET "/connect?…" for 0.0.0.0 at 2011-10-25 10:08:53 +0000
  Processing by SessionsController#create as HTML
…

This approach isn't perfect. I just guessed at the context number and you can see that I picked up some noise from other requests (1.1.1.1 in this example). Still, it's fairly easy to follow where the requests I care about start, thanks to grep adding -- dividing lines. As it turns out, this was enough information to understand what had happened.

Now for a twist: what if I hadn't known grep well enough to do this?

Command-line Ruby

I call easily tell you what I would have done if I couldn't use grep for my log spelunking task, because I do it all the time. I fallback on the one thing I know like crazy: Ruby.

Ruby has a surprising amount of command-line magic built into it that can turn it into a deadly weapon for those who know how to use it. Let me walk you through some personal favorites.

First, you have to learn -e. It's the key to command-line Ruby. It means, there's no script this time so just execute the following code. To show a trivial example, let's generate some data that we can use in upcoming examples:

$ ruby -e 'puts((1..10).to_a)'
1
2
3
4
5
6
7
8
9
10

Did you know puts() will accept an Array and print it one-item-per-line? I love that trick.

OK, but let's use output redirection to toss that data into a file and make a little more:

$ ruby -e 'puts((1..10).to_a)' > one_to_ten.txt
$ ruby -e 'puts((11..20).to_a)' > eleven_to_twenty.txt
$ ls
eleven_to_twenty.txt one_to_ten.txt

Perfect.

Now, it's worth discussing just how well behaved Unix programs like grep read data. I bet you already know that you can pass a file argument on the command-line and grep will search inside of that file (shown here using a pattern that always matches every line):

$ grep '^' one_to_ten.txt
1
2
3
4
5
6
7
8
9
10

What you may not know is that you aren't just limited to one file. If you provide more, grep will just treat them as one long data stream, searching through them in the order they are listed:

$ grep '^' one_to_ten.txt eleven_to_twenty.txt 
one_to_ten.txt:1
one_to_ten.txt:2
one_to_ten.txt:3
one_to_ten.txt:4
one_to_ten.txt:5
one_to_ten.txt:6
one_to_ten.txt:7
one_to_ten.txt:8
one_to_ten.txt:9
one_to_ten.txt:10
eleven_to_twenty.txt:11
eleven_to_twenty.txt:12
eleven_to_twenty.txt:13
eleven_to_twenty.txt:14
eleven_to_twenty.txt:15
eleven_to_twenty.txt:16
eleven_to_twenty.txt:17
eleven_to_twenty.txt:18
eleven_to_twenty.txt:19
eleven_to_twenty.txt:20

Finally, we can list no files at all and grep will read from STDIN. Let's pipe some data there to show that:

$ cat one_to_ten.txt eleven_to_twenty.txt | grep '^'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

You can tell this is a pretty standard means of treating input for Unix programs, because I slipped in an example of cat reading from multiple files the same way grep did.

Ruby also supports this input pattern. It has a special constant called ARGF that supports IO-like operations. ARGF does the same trick of reading from the passed files names or STDIN.

Given that, we can make a cat-like Ruby command with this code:

$ ruby -e 'ARGF.each do |line| puts line end' one_to_ten.txt
1
2
3
4
5
6
7
8
9
10

We don't really do that though. That model is so common that Ruby provides many shortcuts for it. First, we can just order Ruby into a print loop with no code to execute:

$ ruby -pe '' one_to_ten.txt eleven_to_twenty.txt 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Obviously, that's not too helpful, but we can transform the data on the way through, say to line up the numbers:

$ ruby -pe '$_ = "%04d\n" % $_' one_to_ten.txt eleven_to_twenty.txt 
0001
0002
0003
0004
0005
0006
0007
0008
0009
0010
0011
0012
0013
0014
0015
0016
0017
0018
0019
0020

OK, that got pretty cryptic, so we better walk through it. First, -p basically wraps your code in this loop:

while gets
  # any code (say from -e) goes here
  print
end

Even that is still a bit cryptic. These bare Kernel methods like gets() and print() have some powerful defaults. For example, gets() reads from ARGF by default and stores the result in $_ (the current line variable). Similarly, print will spit out the contents of $_ when no arguments are passed. In other words, you can think of the above loop in this expanded form, if it helps:

while ($_ = ARGF.gets)
  # any code (say from -e) goes here
  print $_
end

That leaves only what I passed to -e: $_ = "%04d\n" % $_. The String is a format saying to display a decimal number on a line by itself, make it at least 4 digits wide, and lead with zeros instead of spaces. It's applied to the current contents of $_ (from gets()) using the format operator (%). The result is stuffed back into $_ (so print() will output the modified content).

Now, we don't have to print each line. Ruby also supports -n, which is the same loop with no print(). Using that, we can decide what to print:

$ ruby -ne 'print if $_.to_i.even?' one_to_ten.txt eleven_to_twenty.txt 
2
4
6
8
10
12
14
16
18
20

That means we've reached the point where we can write our own grep:

$ ruby -ne 'print if $_ =~ /6/' one_to_ten.txt eleven_to_twenty.txt 
6
16

Oh, but Ruby has another shortcut for that too. Testing $_ against a Regular Expression is common in these command-line programs, so Ruby allows you to drop the variable reference and the operator:

$ ruby -ne 'print if /6/' one_to_ten.txt eleven_to_twenty.txt 
6
16

Cryptic, but powerful, eh?

Tracking Context

I've shown you some simple command-line tricks, but how would we handle the surrounding context like grep does?

The following lines are pretty easy, now that we know how what the loop surrounding our code looks like. Here are our matches, with up to two lines of context below them:

$ cat one_to_ten.txt eleven_to_twenty.txt > ns.txt
$ ruby -ne 'if /9/; print; 2.times do gets and print end end' ns.txt
9
10
11
19
20

Note that I condensed our number files before I started there, to keep these commands shorter. Also note that it's fine to pass multiple lines to -e as long as you split them with semicolons (;).

In the example above, I just fetch and print some extra lines when I find a match. I use gets() and print() for the extra lines, just like Ruby is doing with our loop.

That covers after context, but how would we handle before context? That's trickier. We need to track proceeding lines somewhere. But assigning a variable inside of our loop could mean that it would be reassigned each time. Does it surprise you to hear that Ruby has an answer for that too?

You probably learned at some point that interpreters, like Ruby, start executing code on your first line and finish at the last line (assuming no flow control or calls to exit()). That's not really the full story though. Run this Ruby program to prove it to yourself:

        puts "Two"
END   { puts "Five" }
        puts "Three"
BEGIN { puts "One"  }
        puts "Four"

Does it surprise you that those numbers print in order? Well, surprising or not, they do. Ruby's actual execution goes something like this:

  1. Run all BEGIN { … } blocks from first encountered to last
  2. Run all code outside of BEGIN { … } and END { … } first to last
  3. Run all END { … } blocks from first encountered to last

Given that, we could recreate wc -l (word count lines) like this:

$ ruby -ne 'BEGIN { $count = 0 }; $count += 1; END { p $count }' ns.txt 
20

Even though all of that code is technically inside of our loop, the BEGIN { … } block is run before the loop and END { … } is run after.

Now that we know another dirty trick, we are ready to take on grep's before context lines:

$ ruby -ne 'BEGIN { $p = [] };
            puts $p, $_ if /2$/; $p = ($p + [$_]).last(2)' ns.txt 
1
2
10
11
12

I finally had to break that example up over two lines to keep it readable, but I really entered it all on one long line in my terminal.

You can see that I just prepared an Array before we ever started and I'm adding new lines onto it as they come in. However, I'm keeping it truncated down to just the context I care about. Then, when it's time to print(), I place those lines before the match.

We can combine those two techniques to reproduce grep's -C switch. In order to do that though, I'll need to use the context size in multiple places. That means it's time to get it into a variable. Wouldn't you know it, Ruby has a switch for that too? Watch this:

$ ruby -se 'p [$var, $n]' -- -var='some value' -n=42
["some value", "42"]

When you pass -s Ruby parses arguments after the script name (or the argument ender --). If they take the form of -VARIABLE=VALUE, Ruby will strip them and set a global variable equal to the indicated value. This can make it a little easier to configure simple scripts.

Given that, here's the final grep-with-context copycat:

$ ruby -nse 'BEGIN { $p = [] };
             if /6/; puts $p, $_;
             $c.to_i.times do gets and print end end;
             $p = ($p + [$_]).last($c.to_i)' -- -c=2 ns.txt
4
5
6
7
8
14
15
16
17
18

Again that's just one line.

Ruby's Advantage

OK, let's face it, grep pretty handily whips my attempts to replace it. However, don't count our friend Ruby out just yet!

It's important to remember that grep is a hyper-specialized tool that's very good at what it does, but has a limited scope. Ruby, on the other hand, is a general purpose programming language. It has a lot of tricks up its sleeve and a big reach when it comes to problem domains.

Let me prove it to you by turning the tables on grep.

Remember the original problem? I was trying to find the specific requests from one user in a log file. I used grep and it worked, but I had to guess how much context I would need and put up with some noise from unrelated requests.

Ruby can do better, with one last shortcut. Meet the "flip flop" operator:

$ ruby -ne 'print if /4/../7/' ns.txt 
4
5
6
7
14
15
16
17

When you use a Range of Regexp objects in a command-line conditional like this, Ruby takes it as starting and ending conditions. When the first Regexp matches once, a switch will flip on and the condition will return true until the second Regexp matches a line. The condition then goes back to being false until the first Regexp matches again.

In the example above, the first expression matched on line 4, so we started printing. The second expression matched line 7 so that's where we stopped printing. The cycle repeated on lines 14 and 17.

Let's take this back to my original problem:

$ zcat log/production.log.4.gz |
  ruby -ne 'if /^Started.+for 0\.0\.0\.0/../^Completed/;
            print; puts "--" if /^Completed/ end'
Started GET "/rubies-in-the-rough" for 0.0.0.0 at 2011-10-25 08:40:01 +0000
  Processing by InterestsController#show as HTML
  Parameters: {"param"=>"rubies-in-the-rough"}
Rendered interests/_interest.html.erb (5.6ms)
Rendered shared/_nav_links.html.erb (1.4ms)
Rendered shared/_nav_links.html.erb (2.3ms)
Rendered interests/show.html.erb within layouts/application (15.5ms)
Completed 200 OK in 20ms (Views: 16.0ms | ActiveRecord: 1.5ms)
--
Started GET "/connect?…" for 0.0.0.0 at 2011-10-25 10:08:53 +0000
  Processing by SessionsController#create as HTML
  Parameters: {…}
Rendered users/_form.html.erb (73.2ms)
Rendered shared/_connection_links.html.erb (3.4ms)
Rendered shared/_nav_links.html.erb (0.9ms)
Rendered shared/_nav_links.html.erb (1.6ms)
Rendered sessions/new.html.erb within layouts/application (85.6ms)
Completed 200 OK in 87ms (Views: 85.3ms | ActiveRecord: 1.0ms)
--
Started GET "/auth/github" for 0.0.0.0 at 2011-10-25 10:08:58 +0000
  Processing by ConnectionsController#auth as HTML
  Parameters: {"provider"=>"github"}
Rendered public/404.html (0.1ms)
Completed 404 Not Found in 1ms (Views: 0.8ms | ActiveRecord: 0.0ms)
…

The Ruby version is a little longer, but it grabs exactly what I want without guessing or noise. (Note: that might not be the case if requests had been interleaved in the log, but grep would fair no better in that case.) I say we can call this a win.

Remember, There are Traps to Avoid

I have purposely taken you into some very dark corners of Ruby in this article. I did that to hopefully stretch your brain a bit with something you may not be familiar with.

I'm also serious about using more command-line Ruby. I promise you will learn new things, good and bad. The bad lessons are helpful too.

Remember though, we're breaking book here. Ruby prides itself on being a clean language. However, a lot of the features I've shown today are dangerously cryptic. If you write a full program using these features, you're going to step right into those traps the rules try to protect your from.

Play your own game, but watch for traps.

Comments (2)
  1. Ryan LeCompte
    Ryan LeCompte January 14th, 2012 Reply Link

    I really enjoyed this article! Here's another way to extract full requests from a rails log using Enumerable#slice_before:

    ruby -e 'ARGF.slice_before(/^Started (GET|POST|PUT|DELETE)/).each { |slice| puts slice }' development.log
    
    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
  2. Lucas Prim
    Lucas Prim February 17th, 2012 Reply Link

    Wow! I would never have guessed that ruby could overthrow grep so beautifully! Great article!

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
Leave a Comment (using GitHub Flavored Markdown)

Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

Ajax loader