Rails

Stories from the trenches of my day job building Web applications.

10

APR
2008

Five ActiveRecord Tips

This article was written for the Railcasts 100th Episode Contest. I think the idea is great and I look forward to reading great tips from all who decide to participate.

1. create_or_find_by_…

I imagine most of you know that ActiveRecord can handle finders like:

MyARClass.find_or_create_by_name(some_name)

This will attempt to find the object that has some_name in its name field or, if the find fails, a new object will be created with that name. It's important to note that the order is exactly as I just listed it: find then create. Here are the relevant lines from the current Rails source showing the process:

record = find_initial(options)

if record.nil?
  record = self.new { |r| r.send(:attributes=, attributes, guard_protected_attributes) }
  #{'yield(record) if block_given?'}
  #{'record.save' if instantiator == :create}
  record
else
  record
end

The above code is inside a String literal fed to class_eval(), which is why you see interpolation being used.

Unfortunately, this process is subject to race conditions because the object could be created by another process (or Thread) between the find and the creation. If that happens, you are likely to run into another hardship in that calls to create() fail quietly (returning the unsaved object). These are some pretty rare happenings for sure, but they can be avoided under certain conditions.

I find myself typically using find_or_create_by_… for things like category groupings or geocoded locations. In my case, there always seems to be two unique properties I can count on:

  • I don't allow duplicate listings
  • I don't have to worry about entries being deleted from the database

When both of those conditions are true, you can rewrite the Rails helper method to be multiprocessing safe. The steps are:

  1. Build validations so that attempts to save the duplicate fail
  2. Be sure to use create!() instead of create()
  3. Reverse the order to create!() then find()

Here's the practical example of how this plays out:

class Category < ActiveRecord::Base
  validates_uniqueness_of :name

  def self.create_or_find_by_name(name)
    create!(:name => name) rescue find_by_name(name)
  end
end

The idea here is that we try the create!() first and just let it fail if it doesn't validate (because it is a duplicate). I use create!() because it doesn't fail silently and the truth is that I generally prefer it for that very reason. You could also use create() and check the returned object with new_record?(), if you prefer. When the create!() fails, we know it's safe to call the finder because there is definitely a matching entry in the database.

2. find_all_in_chunks

I often want to scan whole tables in a Rails application, usually to generate reports about the data. Of course, for very large tables, a call to find(:all) is not generally a great idea as it uses up too much memory to slurp all of that data into Ruby objects.

I have a tiny little library I drop in most Rails projects now to help with this. It requires will_paginate, so be sure to install that plugin first. Then just drop this code in lib/find_all_in_chunks.rb:

module FindAllInChunks
  def find_all_in_chunks(query = Hash.new, &iterator)
    1.upto(1.0/0.0) do |i|
      records = paginate({:page => i, :per_page => 50}.merge(query))
      records.each(&iterator)
      break unless records.next_page
    end
  end
end

ActiveRecord::Base.extend(FindAllInChunks)

and add this line to config/environment.rb:

require "find_all_in_chunks"

The code is simple. It just fetches the data in chunks by paginating the results and walking through the pages until it reaches the last one. Each entry in the page is then yielded to the block you pass this finder. This makes it seem like you are working with a simple each() iterator, though it will eventually fetch all of the entries. You can pass will_paginate's :per_page parameter to adjust how many entries are retrieved at one time.

3. FasterCSV as an Import Tool

If you need to jump start the database by importing some content, FasterCSV can be a big help. It can easily feed CSV data into an ActiveRecord subclass using code like:

FCSV.foreach( csv_path, :headers           => true,
                        :header_converters => :symbol ) do |row|
  begin
    MyARClass.create!(row.to_hash)
  rescue ActiveRecord::RecordNotSaved # in current Rails
    # handle save failures here…
  end
end

This uses the first line of your CSV data as the field names, so be sure they match up to the names in your database. It's also a good idea to make sure the CSV data is in UTF-8 and set $KCODE = "U" when using an older version of Rails that doesn't do this for you.

I tend to just vendor the FasterCSV source and add a require for it in config/environment.rb, so I don't have to worry about installing the gem everywhere.

4. Learn to Love ActiveRecord's :select Parameter

I think the most under used feature of ActiveRecord's finders is the :select parameter. You really owe it to yourself to play with this sucker until you get the "Ah ha!" moment.

The concept is simple: :select tells ActiveRecord which columns to fetch from the database.

By default, everything is fetched, but there are great reasons to exclude some fields. For example, if you are grabbing a bunch of objects from a sizable table (meaning that it has many fields) but you only need certain fields, throw the rest out! It will take ActiveRecord less time to fetch the data and convert it to Ruby and use less memory, possibly making it practical for you to fetch more entries at once. That's a lot of great gains!

Another thing to know is that ActiveRecord takes the field names it uses from the names provided here so use SQL's AS to get what you want. You can even add aggregate fields and alias them so they are just like normal fields for this query.

This has a million different applications, but, to give an example, I pull some email campaign statistics by type using a query like:

statistics = Profile.find( :all
  :select => "profiles.type, " +
             "COUNT(DISTINCT users.id) AS emailed, " +
             "COUNT(IF(profiles.viewed_offer = 1, 1, NULL)) AS viewed, " +
             "COUNT(IF(profiles.signed_up = 1, 1, NULL)) AS accepted, " +
             "COUNT(IF(users.email_status = 'Opted Out', 1, NULL)) AS unsubscribed",
  :joins  => "INNER JOIN users ON users.id = profiles.user_id " +
             "INNER JOIN emailings ON emailings.user_id = users.id",
  :group  => "profiles.type"
)

In this database a User has_many Profiles and Emailings are associated with User objects. I use :joins to link all of that together here. (The :include option seems to override your custom :select so I just use :joins instead.) Note that I INNER JOIN these tables to cut down on the data fetched. It's handy to remember that there are more options to SQL than just what ActiveRecord uses.

The :select parameter is the juicy part here. I setup a group of aggregate fields to track the statistics we care about, aliasing each of these to nice method names for the returned objects.

This single query combs through a ton of data very quickly and returns all the vital details we care about. Now it's important to note that what is returned here aren't really Profile objects as we typically think of them. These are more summary objects that have a new set of methods we defined in this query (emailed(), viewed(), accepted(), unsubscribed()).

5. Sneaking in a HAVING Clause

ActiveRecord doesn't have a parameter for including a HAVING clause in your database queries and I sometimes find myself needing it. Luckily, you can sneak it in on the end of your :group parameter without needing to resort to a find_by_sql() call. (I don't recall ever using HAVING without GROUP BY, though there probably are some cases where it would make sense to do so.)

As an example, here's a query from one of my Rails applications that finds all duplicate email addresses in the database:

duplicates = User.find( :all,
  :select     => "email, COUNT(email) AS duplicate_count",
  :conditions => "email IS NOT NULL AND email != ''",
  :group      => "email HAVING duplicate_count > 1"
)
Comments (11)
  1. Semin
    Semin April 11th, 2008 Reply Link

    I have a strong feeling that you are going to be a winner!

    Thank you so much for the tips (especially for #2).

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
  2. Andy Stewart
    Andy Stewart April 14th, 2008 Reply Link

    Useful knowledge. Thanks for writing up these tips.

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
  3. Duncan Beevers
    Duncan Beevers May 13th, 2008 Reply Link

    will_paginate actually includes a paginated_each method which does operate on batched groups of AR models.

    Very important when your set is large.

    Nice tip on the HAVING clause.

    Also, I haven't played with it much, but you could probably add db-computed attributes to an ActiveRecord model through options specified in a named_scope.

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
    2. James Edward Gray II

      I was using an older copy of will_paginate that didn't yet contain paginated_each. Thanks for pointing it out to me.

      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
  4. Paul Grout
    Paul Grout July 27th, 2008 Reply Link

    Tips 4 & 5 were a big help. BTW

    I don't recall ever using HAVING without GROUP BY, though there probably are some cases where it would make sense to do so.

    I suppose that would just be a where clause.

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
  5. Alex
    Alex August 21st, 2008 Reply Link

    Any chance you would mind sharing a syntactically correct example of usage for find_all_in_chunks? I'm kind of a noob and can't figure out how to use it to replace find(:all).each

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
    2. James Edward Gray II
      James Edward Gray II August 21st, 2008 Reply Link

      Sure, you use it like this:

      People.find_all_in_chunks(:per_page => 200) do |person|
        # … use person here …
      end
      

      I wasn't aware of paginated_each() when I wrote it though. You can just use that with a recent version of will_paginate:

      People.paginated_each(:per_page => 200) do |person|
        # … use person here …
      end
      

      Hope that helps.

      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
      2. Alex
        Alex August 22nd, 2008 Reply Link

        Thanks so much for that!

        1. Reply (using GitHub Flavored Markdown)

          Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

          Ajax loader
  6. rick
    rick September 19th, 2008 Reply Link

    Technically, your fourth tip is wrong. #find is for finding records, but you're running an aggregated query. That's what #count was made for. However, you can't perform multiple aggregations (if someone can suggest a clean API for this, I'd be more then happy to implement).

    However, #5 can be shortened to: User.count(:id, :group => :email, :having => 'count(id) > 1'). As a bonus, you can use a belongs_to attribute. Order.count(:id, :group => :product). For cases where you can't use #count, you can also do something like: User.connection.select_all(User.sanitize_sql([..., *args])). It's definitely not pretty, so perhaps creating pseudo models with your custom #find call isn't a big deal.

    Just FYI: surround source code terms like foo_bar with the tilde so that markdown doesn't treat it like italics.

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
  7. eno
    eno February 9th, 2009 Reply Link

    Sadly apparently no one in the Rails community is aware of your first topic. But even more sadly your solution actually features a race condition similar to that in the original implementation; namely two competing processes might run in the situation where the entry is not yet there. The only way I found to work around that and any those issues is by using some kind of DB level locking, see http://1rad.wordpress.com/2008/09/29/0x04-atomic-science/ I would like to see a MT safe implementation in rails, but apparently no one cares.

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
  8. roger
    roger August 24th, 2011 Reply Link

    Thanks for the duplicate tip.

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
Leave a Comment (using GitHub Flavored Markdown)

Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

Ajax loader