Gray Soft / Rubies in the Rough / Doing it Wrong

11

NOV
2011

Doing it Wrong

This post is part of a series.

Continuing with my Breaking All of the Rules series, I want to peek into several little areas where I've been caught doing the wrong thing. I'm a rule breaker and I'm determined to take someone down with me!

My Forbidden Parser

In one application, I work with an API that hands me very simple data like this:

<emails>
  <email>user1@example.com</email>
  <email>user2@example.com</email>
  <email>user3@example.com</email>
  …
</emails>

Now I need to make a dirty confession: I parsed this with a Regular Expression.

I know, I know. We should never parse HTML or XML with a Regular Expression. If you don't believe me, just take a moment to actually read that response. Yikes!

Oh and you shouldn't validate emails with a Regular Expression. Oops. We're talking about at least two violations here.

But it gets worse.

You may be think I rolled a little parser based on Regular Expressions. That might look like this:

#!/usr/bin/env ruby -w

require "strscan"

class EmailParser
  def initialize(data)
    @scanner = StringScanner.new(data)
  end

  def parse(&block)
    parse_emails(&block)
  end

  private

  def parse_emails(&block)
    @scanner.scan(%r{\s*<emails>\s*}) or fail "Failed to match list start"
    loop do
      parse_email(&block) or break
    end
    @scanner.scan(%r{\s*</emails>}) or fail "Failed to match list end"
  end

  def parse_email(&block)
    if @scanner.scan(%r{<email>\s*})
      if email = @scanner.scan_until(%r{</email>\s*})
        block[email.strip[0..-9].strip]
        return true
      else
        fail "Failed to match email end"
      end
    end
    false
  end
end

EmailParser.new(ARGF.read).parse do |email|
  puts email
end

If you aren't familiar with StringScanner, it's a standard Ruby library that just wraps a String and tracks your current position in that String. You can then throw Regular Expressions at the wrapped String, like I did here with scan() and scan_util(). Whenever an expressions matches, your position is advanced past it. Future expressions will then be tested at the new position. A failed match has no effect on the position.

With that understanding, the example above should be pretty easy to follow. It just looks for the start of the emails list, works through each email in the list, and then ensures the list is closed properly. I take a block when you kick off the parse() and call it each time I find a matching email to facilitate iterating over the emails.

That's great to know, but it's not what I did. I said I parsed it with a Regular Expression. My real solution looked more like this:

#!/usr/bin/env ruby -wKU

ARGF.read.scan(/[^>\s@]+@[^\s@]+\.[^\s@<]+/) do |email|
  puts email
end

If you have any sense, you should be scared blind by this point. Fair enough, but let's talk it through. Then you can lynch me.

My Regular Expression just hunts through the data for something that roughly looks like an email address. It's super low tech.

There are also new violations in this example, for those keeping score:

My email-like expression doesn't match some valid emails
With two minor character exceptions, I'm ignoring the XML completely

Alright, it's time to actually address all of these violations that are piling up. Let's start with: don't validate emails with a Regular Expression.

I'm not.

Man, that was easy. One down.

No, seriously, I'm not. I'm just hunting for emails in some data. Perhaps the API validates them before it sends them down. Perhaps we will go on to validate the addresses after we have them, most likely by actually sending a message to the address. None of this is relevant to what I'm doing here. I'm just working with what I was given.

As always though, I would rather ask why this rule exists. What is it trying to protect me from? In this case, it's two things. First, it's difficult to match a well-formed email address. The specification is quite complex. I bet you could do it with a Regular Expression (I've never tried), but it's not going to be pretty.

Honesty, I think that's an argument in my favor. I don't worry about validating at all, because you shouldn't. I look for an email email-ish thing and leave validation to the code properly suited to the task.

On a related note, my email-like expression isn't perfect. It doesn't match the entire email specification. For example, james@localhost is a valid email address that I don't match. However, that address would be for a specific machine I couldn't send email to anyway. Given that, I'm inclined to call this a feature instead of a bug.

I'm not sure what the specification says about emails containing < or >, to be painfully honest. But they should be escaped in XML anyway, so I get a buy on that issue.

Parsing XML with a Regular Expression? Again, I'm going to argue that it never happened. We already decided that I ignored the XML, right? So I'm off the hook there.

Still, why is it that we are expected to use a real parser over anything else? Well, XML can be complicated, right? I mean, this XML is still a valid representation of the same list:

<emails>
  <email>
    user1@example.com
  </email>
  <email>
    user2@example.com
  </email>
  <email>
    user3@example.com
  </email>
  …
</emails>

A parser just handles differences like that for you. But you know what? So does my solution. Try it!

OK, but the XML could really change. Let's say it became this:

<users>
  <user>
    <first>User</first>
    <last>One</last>
    <email>user1@example.com</email>
  </user>
  <user>
    <first>User</first>
    <last>Two</last>
    <email>user2@example.com</email>
  </user>
  <user>
    <first>User</first>
    <last>Three</last>
    <email>user3@example.com</email>
  </user>
  …
</users>

Now that's an interesting change, because I think it's pretty unlikely that a parser-based solution wouldn't need updates to handle that. However, my solution still works just the same.

In fact, since I am ignoring the XML, I can handle non-XML data such as this boring text list:

user1@example.com
user2@example.com
user3@example.com
…

Not that my solution is perfect, of course. It wouldn't handle a CSV list without changes, for example. Still it does seem surprisingly flexible.

In the end though, it comes down to one point for me. If this thing breaks, I'm out a few minutes of work. It was very little effort to setup and it worked fine for our needs at the time. If that changes, I can always replace it.

This just didn't seem like a huge issue. Given that, I didn't want to spend a lot of energy on it unless I was sure it was needed. A solution should generally be in the same scale as the problem itself, in my opinion.

There's a Wrong Time for Everything

Let's move on to a separate issue.

You know how you should schedule all of that pesky background work for off-peak hours so you don't hassle the users? Well, I'm not very good at that either. Have a look at these two /etc/crontab entries from this application's server:

00 17 * * *  deploy  … rake payments:schedule_due …
30 20 * * *  deploy  … rake reports:email_admin_statistics …

These times are UTC, and if you adjust for that, the first one lands right around lunchtime where I live in the middle of Central Time (11 AM or 12 PM depending on Daylight Savings Time). That means I'm doing the busy work right in the middle of the day. In fact, I chose that time because I am in the middle of the U.S. and adjusting for the time zones around me should still make it mid-morning or mid-afternoon. I wanted to make sure it's in the middle of most users' days.

Why? Well, that first job triggers the submission of due payments. There are a few steps for that process, but it will generally be resolved within a couple of hours. That means, anything in the process requiring user attention, like a failed payment, should be found when they have a decent chance of being available to do something about it.

To further that goal, the second job kicks in mid-afternoon and sends me a report of how everything went today. If something does need addressing, I'm probably still at work to handle it. My users are probably still around too, should I need to discuss anything with them.

You get the idea. We do work on off-times so it doesn't bother our users. However, if that work might need to involve users, doing it on their time can totally make sense.

Serial Storage Violations

If I will violate parsing and processing laws, you just know my database usage has to be suspect. You're right.

One of the rules for relational data storage is that we shouldn't serialize a bunch of data into some field. One way we sometimes see that done in Rails is with code list this:

class Transaction < ActiveRecord::Base
  serialize :data
end

By default that will convert whatever Ruby object is assigned to data into YAML, for storage. Then, when you retrieve the record back, the data field will be restored to its (likely non-String) form. You could even arrange to have the serialization done in a binary format or whatever you like, YAML is just the default.

Why don't we want to store data like that? Mainly because it can make it difficult to impossible to query on. That's painful because we would be giving up one of the primary reasons to use a relational database in the first place.

Of course, there's always an assumption hidden in rules like this. This time it is: you will need to query by the serialized data.

In the example shown above, it's not going to happen. A Transaction object in that system records fields passed to the application in response to API calls. An API typically gives us quite a bit of data and we typically care about just a couple of fields. Of course, we may want that data someday. You never know.

To handle that, a Transaction is created for each API response. A Hash of all provided fields is stuffed into data and serialized into the database. Then the system goes on to access just the fields it cares about and ignore everything else, knowing it was tucked away for safe keeping.

If we do ever need to pull those Transaction objects, we sure as heck won't be doing it by data we ignored. We wouldn't know what that is to use it in a query. Instead, each Transaction is tied to the relevant related record with a separate foreign key field. We would have to query on that to have any kind of context.

Plus, we get a big bonus for using this approach. The API can change the fields they are giving us and that likely wouldn't affect anything. Going one step further, we could switch to a totally different API and this setup wouldn't need to change at all. In this case, not tying ourselves to specific fields gives us flexibility and only costs us what we don't need. It's a double win.

Yes, There's a Point

Rules are great. I love them and I hand them out as much as any other programmer. Most of the time, they will help you.

Programmers need to get into the habit of dissecting rules though. You always need to be asking, "Why are they telling me this?"

Once you understand the reasoning, you can follow the rule when it makes sense, but put it aside when it's not needed. That will open up all sorts of new options for you.