Gray Soft / Rubies in the Rough / Experimenting With DATA

11

JAN
2012

Experimenting With DATA

This post is part of a series.

In the last article, I talked about the importance of a culture that encourages experimentation. It's hard to fiddle with something and not gain a better understanding of how it works. That knowledge is valuable to us programmers. I mentioned though that the way Perl programmers experiment is not the same way us Rubyists do it. Let me show you some actual Ruby experimentation I've witnessed over the years…

Executing Your Email

Some of Ruby's features are fairly obscure. Even worse, some of us who use those obscure features try to bend them to even stranger purposes. This is one way Rubyists like to experiment. Ironically, the features I'm going to talk about in this article are inherited from Perl.

Ruby can literally use your email as an executable program. Assume I have the following saved in a file called email.txt:

Dear Nuby:

I just thought you would like to know what the Hello World program looks
like in Ruby.  Here's the code:

#!/usr/bin/env ruby -w

puts "Hello world!"

__END__

I hope the simplicity of that inspires you to learn more.

May Ruby Be With You,
Ruby Jedi

If we ask nicely, Ruby will will happily execute the code in that email:

$ ruby -x email.txt 
Hello world!

Now, there are two features that make this possible. The least interesting is the -x switch I fed to the Ruby interpreter. It throws away all contents of the passed program up to the "shebang line" (#!) that mentions ruby. That's why Ruby ignored the top of the email message.

__END__, the reason Ruby ignored the rest of the message, is a totally different story.

Getting Your DATA

If your Ruby program contains a line that is just __END__, a couple of things happen. First, Ruby stops executing code just before that line, so anything that follows the special marker is ignored. The other effect is that Ruby opens a special IO-like object (it's usually a File object, but it may be more generic if Ruby is reading the program from stdin), positions it just after the special marker, and places it in the DATA constant.

There's kind of a lot going on there, so let's walk through an example:

$ cat data.rb 
p DATA.read

__END__
Some data.
$ ruby data.rb 
"Some data.\n"

Here's what happened:

Ruby ignored what came after __END__ (the Some data. line)
Ruby opened an File object and positioned it just before the S
The code accessed that object via the DATA constant

Hopefully that adequately explains this slightly odd feature. I believe the intended usage is for form-letter type content, like this:

$ cat generate_thanks.rb 
require "erb"

unless ARGV.size >= 2
  abort "USAGE:  #{$PROGRAM_NAME} NAME PURCHASE1 [PURCHASE2 ...]"
end
name      = ARGV.shift
purchases = ARGV

letter = ERB.new(DATA.read, nil, "%")
letter.run

__END__
Dear <%= name %>:

Thank you for the recent purchase of:

% purchases.each do |purchase|
* <%= purchase %>
% end

We hope these products don't steal too much of your work time.

The Distraction Team
$ ruby generate_thanks.rb James Skyrim Catherine
Dear James:

Thank you for the recent purchase of:

* Skyrim
* Catherine

We hope these products don't steal too much of your work time.

The Distraction Team

As you can see, it's nice not to have the code cluttered up with the huge letter String. Using __END__ we can keep the two separate, but still let them interact via DATA.

If you don't recognize the ERB template I used above, it's still the same template engine Rails uses. I just turned on a "trim mode" to allow full lines of code starting with a percent sign (%). You can do the same for templates in Rails, if you like:

config.action_view.erb_trim_mode = "%"

[Update: the above was true in old versions of Rails. Changes to the templating system in newer versions dropped this feature and the Rails core team elected not to restore it.]

A Cheat

This feature is often used to cheat an implementation of a quine, a program that outputs its own source. Since DATA is pointed at the source, we can shift it back to the beginning and read away. Observe:

$ cat quine.rb 
print DATA.tap(&:rewind).read
__END__
DO NOT DELETE:  needed for DATA
$ ruby quine.rb 
print DATA.tap(&:rewind).read
__END__
DO NOT DELETE:  needed for DATA

The process is simple. First, we need to backtrack DATA to the beginning of the file. We use rewind() for that, but it has a rather unhelpful return value. We discard that with the help of tap(). Then we can just read() the source and print() it back out.

It's worth noting that this really is a cheat. Most definitions of a quine forbid IO operations for what are now likely very obvious reasons. Still, it's interesting just how easily DATA cuts through this challenge.

A Built-in Lock

For some reason, this silly language feature seems to inspire us programmers. I have seen multiple Rubyists twist it to strange purposes. My favorite example comes from Daniel Berger (who stole it from those Perl guys).

Say you have a program that you only wish to run one copy of at any given time. There are many reasons you might need this, but a common one is that the script is run as a Cron job. It may do something like read records from a database and send emails to your users. If there are a ton to send and your system is bogged down, this could take a while. You don't want Cron to kick in another copy before the job finishes, because it might cause users to be emailed twice.

This is usually handled with a complex dance of having the process write out a PID file when it starts up and remove it as it finishes. When a new process starts, it can check for the existence of that PID file and exit without doing any work if it is still there. Of course, the original process may die without properly cleaning up the file, so processes that find the file should probably search the process table to make sure a job with that process ID is still running. If it isn't, they should ignore the PID file and start anyway.

That's the tried and true system because it works, but it's also a pain to code up and get all of the edge cases right. Look at this trivial recreation:

$ cat exclusive.rb 
DATA.flock(File::LOCK_EX | File::LOCK_NB) or abort "Already running."

trap("INT", "EXIT")

puts "Running..."
loop do
  sleep
end

__END__
DO NOT DELETE:  used for locking
$ ruby exclusive.rb 
Running...
^Z
[1]+  Stopped                 ruby exclusive.rb
$ ruby exclusive.rb 
Already running.
$ fg
ruby exclusive.rb
^C$ ruby exclusive.rb 
Running...

This is essentially a one-liner that handles all of the scenarios above. We add a meaningless __END__ section so Ruby opens the File object for us. Then we grab an exclusive file lock on that object. We tell flock() we don't want to block waiting on that lock, so it will toss a false if we can't have it right now and hand-off to our abort() call.

The rest of the code is just to make the example cleaner. The trap() call makes interrupt signal from ⌃C exit quietly. The rest of the code is just a busy loop to keep things going.

Now focus on the examples. I run the program and background it with a ⌃Z. Note how it won't let me start a second copy after that, until I pull the original process back to the foreground and halt it.

The beauty of this system is that we don't have to do any cleanup. The operating system will remove the file lock as our process exits, even if it's because we crashed. That's ideal.

Never do the work you can push off on others.

That pattern comes up over and over again in programming. A lot of people complain that Ruby leaks memory (the truth of that is complex and for another post). They claim Ruby is not useable for a long running process due to this leaking. Even if the complaint were true, the conclusion doesn't follow. I use this pattern when I want a long running Ruby process:

Write the simplest event loop I can that just pulls jobs and assigns workers
Fork a process for each worker, do the work, then exit

Ruby can run forever like this. Why? Because exit() is the ultimate garbage collector. Properly cleaning up after yourself is hard. That's what operating systems are for. Leave that job to the pro.

Carrying Your DATA With You

I have done my own experiments with __END__ and DATA. My efforts have been about actually storing content in DATA. Yes, I mean both reading and writing to it.

For example, let's say that I have some program that works on a Git repository. It runs through the various commits and does something expensive with them. We will say that it calculates some metrics and perhaps checks out the code for each SHA to do that. Plus, it stores the results somewhere else that we're not going to worry about for the sake of this example. But we don't want it to store duplicates.

If we pretend that DATA is read and write (it's not really intended for that), we can just toss the last SHA we worked with there. Each time we work forward from that SHA and update it to the latest.

Here's the code to do something like that:

# remember position of DATA
pos = DATA.pos

# read the last SHA processed
last = DATA.read.to_s.strip

# work with the SHA after the last (all on first run)
Dir.chdir(ARGV.first || Dir.pwd) do
  command  = "git rev-list --reverse HEAD"
  command << " ^#{last}" if last.size == 40
  shas     = `#{command}`.lines.map(&:strip)
  shas.each do |sha|
    puts "Checking out #{sha[/\A.{7}/]}, calculating metrics, " +
         "and storing results..."
  end
  last = shas.last
end

# write out the last SHA we processed
if last
  DATA.reopen(__FILE__, "r+")
  DATA.truncate(pos)
  DATA.seek(pos)
  DATA.puts last
end

__END__
NONE

The biggest chunk of work above is in the middle and you can safely ignore that section since it's just me talking to Git and faking some work. The interesting bits are at the beginning and the end.

The first trick is to memorize where DATA starts out (right after __END__), because thats the point we need to go back to when we want to update the SHA. After we've memorized that key position, we load the previous SHA (if any), and do the work.

At the end, I have to handle the fact that DATA is really just for reading. To do that, I reopen() it for reading and writing. Then it's a simple matter of replacement, since I memorized the magic position number:

Lop off the end of the file after the position
Move the write head to the new end (the position again)
Append the new SHA

After I run that code on the repository for this site, the SHA is updated:

$ cat walk_git_commits.rb | grep -A 1 __END__ walk_git_commits.rb 
__END__
NONE
$ ruby walk_git_commits.rb ../Documents/subinterest
Checking out 9490ee4, calculating metrics, and storing results...
Checking out f2ce511, calculating metrics, and storing results...
Checking out dfaded9, calculating metrics, and storing results...
...
$ cat walk_git_commits.rb | grep -A 1 __END__ walk_git_commits.rb 
__END__
50cb651fa11de417c0db7978127ba8d06aa67f06

If I run it again immediately, it doesn't do any work (because the last recorded SHA is the latest). I can also manually update the SHA, if I want to start from some arbitrary point.

I view the current SHA as metadata for the code in this case, so this approach allows it to live with the code. If I email this script to a coworker, it will pick up at the right place in our shared repository.

You can use this trick in other areas. For example, S3 objects can have custom headers associated with them. You can squirrel away some metadata in these fields, keeping it with the object it relates to. I like this better in many cases than needing to match an S3 object to a separate database record in order to have the full picture of what I'm looking at.

No Magic Here

One important thing to realize about these tricks is how unmagical they really are. Can you do the source locking trick without DATA? Sure. You just change this code:

DATA.flock(File::LOCK_EX | File::LOCK_NB) or abort "Already running."

into this:

open(__FILE__).flock( File::LOCK_EX |
                      File::LOCK_NB ) or abort "Already running."

We can open the file ourselves and lock it. If we do, we don't even need the special __END__ marker.

It's the same with my rewriting example. I could just work with the file manually, though I would need to find the __END__ token myself and that's a bit if work if I want to do it as well as Ruby does.

The point isn't that DATA makes these hacks possible. It's that the feature inspired programmers to find these hacks. Inspiration is powerful. That's why it's so key that many programmers say they enjoy working with Ruby. A design goal of the language was to be friendly to humans. That matters. It helps us think and play.

I've shown you some historical experimentation in Ruby. This may or may not appeal to you. That's fine. We all enjoy different things and find inspiration in different places. The important thing is that you do find yours and exercise it.

Pop Quiz: One Last DATA Trick

When I first showed the __END__ marker, did you think you had seen it before in Sinatra? It uses the same feature, right?

Yes and no.

The problem is that a Sinatra application may not be the file executed. For example, the server may get kicked off with rackup. If it does, rackup was the executed file and DATA could only be set using the __END__ marker in that. There can only be one DATA after all. Other __END__ tokens still work for ignoring content and Sinatra definitely counts on that, but it cannot use DATA.

Given that, can you puzzle out how it seems to do the same thing? Take your best guess, then check to see if you are right.

This post is part of a series.

← Previous Post

In: Rubies in the Rough | Tags: Experimentation & Scripting | 0 Comments