10
APR2008
Five ActiveRecord Tips
This article was written for the Railcasts 100th Episode Contest. I think the idea is great and I look forward to reading great tips from all who decide to participate.
1. create_or_find_by_…
I imagine most of you know that ActiveRecord
can handle finders like:
MyARClass.find_or_create_by_name(some_name)
This will attempt to find the object that has some_name
in its name
field or, if the find fails, a new object will be created with that name
. It's important to note that the order is exactly as I just listed it: find then create. Here are the relevant lines from the current Rails source showing the process:
record = find_initial(options)
if record.nil?
record = self.new { |r| r.send(:attributes=, attributes, guard_protected_attributes) }
#{'yield(record) if block_given?'}
#{'record.save' if instantiator == :create}
record
else
record
end
The above code is inside a String
literal fed to class_eval()
, which is why you see interpolation being used.
Unfortunately, this process is subject to race conditions because the object could be created by another process (or Thread
) between the find and the creation. If that happens, you are likely to run into another hardship in that calls to create()
fail quietly (returning the unsaved object). These are some pretty rare happenings for sure, but they can be avoided under certain conditions.
I find myself typically using find_or_create_by_…
for things like category groupings or geocoded locations. In my case, there always seems to be two unique properties I can count on:
- I don't allow duplicate listings
- I don't have to worry about entries being deleted from the database
When both of those conditions are true, you can rewrite the Rails helper method to be multiprocessing safe. The steps are:
- Build validations so that attempts to save the duplicate fail
- Be sure to use
create!()
instead ofcreate()
- Reverse the order to
create!()
thenfind()
Here's the practical example of how this plays out:
class Category < ActiveRecord::Base
validates_uniqueness_of :name
def self.create_or_find_by_name(name)
create!(:name => name) rescue find_by_name(name)
end
end
The idea here is that we try the create!()
first and just let it fail if it doesn't validate (because it is a duplicate). I use create!()
because it doesn't fail silently and the truth is that I generally prefer it for that very reason. You could also use create()
and check the returned object with new_record?()
, if you prefer. When the create!()
fails, we know it's safe to call the finder because there is definitely a matching entry in the database.
2. find_all_in_chunks
I often want to scan whole tables in a Rails application, usually to generate reports about the data. Of course, for very large tables, a call to find(:all)
is not generally a great idea as it uses up too much memory to slurp all of that data into Ruby objects.
I have a tiny little library I drop in most Rails projects now to help with this. It requires will_paginate, so be sure to install that plugin first. Then just drop this code in lib/find_all_in_chunks.rb
:
module FindAllInChunks
def find_all_in_chunks(query = Hash.new, &iterator)
1.upto(1.0/0.0) do |i|
records = paginate({:page => i, :per_page => 50}.merge(query))
records.each(&iterator)
break unless records.next_page
end
end
end
ActiveRecord::Base.extend(FindAllInChunks)
and add this line to config/environment.rb
:
require "find_all_in_chunks"
The code is simple. It just fetches the data in chunks by paginating the results and walking through the pages until it reaches the last one. Each entry in the page is then yield
ed to the block you pass this finder. This makes it seem like you are working with a simple each()
iterator, though it will eventually fetch all of the entries. You can pass will_paginate
's :per_page
parameter to adjust how many entries are retrieved at one time.
3. FasterCSV
as an Import Tool
If you need to jump start the database by importing some content, FasterCSV can be a big help. It can easily feed CSV data into an ActiveRecord
subclass using code like:
FCSV.foreach( csv_path, :headers => true,
:header_converters => :symbol ) do |row|
begin
MyARClass.create!(row.to_hash)
rescue ActiveRecord::RecordNotSaved # in current Rails
# handle save failures here…
end
end
This uses the first line of your CSV data as the field names, so be sure they match up to the names in your database. It's also a good idea to make sure the CSV data is in UTF-8 and set $KCODE = "U"
when using an older version of Rails that doesn't do this for you.
I tend to just vendor the FasterCSV
source and add a require for it in config/environment.rb
, so I don't have to worry about installing the gem everywhere.
4. Learn to Love ActiveRecord
's :select
Parameter
I think the most under used feature of ActiveRecord
's finders is the :select
parameter. You really owe it to yourself to play with this sucker until you get the "Ah ha!" moment.
The concept is simple: :select
tells ActiveRecord
which columns to fetch from the database.
By default, everything is fetched, but there are great reasons to exclude some fields. For example, if you are grabbing a bunch of objects from a sizable table (meaning that it has many fields) but you only need certain fields, throw the rest out! It will take ActiveRecord
less time to fetch the data and convert it to Ruby and use less memory, possibly making it practical for you to fetch more entries at once. That's a lot of great gains!
Another thing to know is that ActiveRecord
takes the field names it uses from the names provided here so use SQL's AS
to get what you want. You can even add aggregate fields and alias them so they are just like normal fields for this query.
This has a million different applications, but, to give an example, I pull some email campaign statistics by type using a query like:
statistics = Profile.find( :all
:select => "profiles.type, " +
"COUNT(DISTINCT users.id) AS emailed, " +
"COUNT(IF(profiles.viewed_offer = 1, 1, NULL)) AS viewed, " +
"COUNT(IF(profiles.signed_up = 1, 1, NULL)) AS accepted, " +
"COUNT(IF(users.email_status = 'Opted Out', 1, NULL)) AS unsubscribed",
:joins => "INNER JOIN users ON users.id = profiles.user_id " +
"INNER JOIN emailings ON emailings.user_id = users.id",
:group => "profiles.type"
)
In this database a User
has_many
Profile
s and Emailing
s are associated with User
objects. I use :joins
to link all of that together here. (The :include
option seems to override your custom :select
so I just use :joins
instead.) Note that I INNER JOIN
these tables to cut down on the data fetched. It's handy to remember that there are more options to SQL than just what ActiveRecord
uses.
The :select
parameter is the juicy part here. I setup a group of aggregate fields to track the statistics we care about, aliasing each of these to nice method names for the returned objects.
This single query combs through a ton of data very quickly and returns all the vital details we care about. Now it's important to note that what is returned here aren't really Profile
objects as we typically think of them. These are more summary objects that have a new set of methods we defined in this query (emailed()
, viewed()
, accepted()
, unsubscribed()
).
5. Sneaking in a HAVING
Clause
ActiveRecord
doesn't have a parameter for including a HAVING
clause in your database queries and I sometimes find myself needing it. Luckily, you can sneak it in on the end of your :group
parameter without needing to resort to a find_by_sql()
call. (I don't recall ever using HAVING
without GROUP BY
, though there probably are some cases where it would make sense to do so.)
As an example, here's a query from one of my Rails applications that finds all duplicate email addresses in the database:
duplicates = User.find( :all,
:select => "email, COUNT(email) AS duplicate_count",
:conditions => "email IS NOT NULL AND email != ''",
:group => "email HAVING duplicate_count > 1"
)
Comments (11)
-
Semin April 11th, 2008 Reply Link
I have a strong feeling that you are going to be a winner!
Thank you so much for the tips (especially for #2).
-
Useful knowledge. Thanks for writing up these tips.
-
will_paginate
actually includes a paginated_each method which does operate on batched groups of AR models.Very important when your set is large.
Nice tip on the
HAVING
clause.Also, I haven't played with it much, but you could probably add db-computed attributes to an
ActiveRecord
model through options specified in anamed_scope
.-
I was using an older copy of
will_paginate
that didn't yet containpaginated_each
. Thanks for pointing it out to me.
-
-
Tips 4 & 5 were a big help. BTW
I don't recall ever using
HAVING
withoutGROUP BY
, though there probably are some cases where it would make sense to do so.I suppose that would just be a
where
clause. -
Any chance you would mind sharing a syntactically correct example of usage for
find_all_in_chunks
? I'm kind of a noob and can't figure out how to use it to replacefind(:all).each
…-
Sure, you use it like this:
People.find_all_in_chunks(:per_page => 200) do |person| # … use person here … end
I wasn't aware of
paginated_each()
when I wrote it though. You can just use that with a recent version ofwill_paginate
:People.paginated_each(:per_page => 200) do |person| # … use person here … end
Hope that helps.
-
Thanks so much for that!
-
-
-
Technically, your fourth tip is wrong.
#find
is for finding records, but you're running an aggregated query. That's what#count
was made for. However, you can't perform multiple aggregations (if someone can suggest a clean API for this, I'd be more then happy to implement).However, #5 can be shortened to:
User.count(:id, :group => :email, :having => 'count(id) > 1')
. As a bonus, you can use abelongs_to
attribute.Order.count(:id, :group => :product)
. For cases where you can't use#count
, you can also do something like:User.connection.select_all(User.sanitize_sql([..., *args]))
. It's definitely not pretty, so perhaps creating pseudo models with your custom #find call isn't a big deal.Just FYI: surround source code terms like
foo_bar
with the tilde so that markdown doesn't treat it like italics. -
Sadly apparently no one in the Rails community is aware of your first topic. But even more sadly your solution actually features a race condition similar to that in the original implementation; namely two competing processes might run in the situation where the entry is not yet there. The only way I found to work around that and any those issues is by using some kind of DB level locking, see http://1rad.wordpress.com/2008/09/29/0x04-atomic-science/ I would like to see a MT safe implementation in rails, but apparently no one cares.
-
Thanks for the duplicate tip.