Gray Soft / Key-Value Stores

Tokyo Cabinet's Key-Value Database Types

2014-04-19T02:04:47Z

We've taken a good look at Tokyo Cabinet's Hash Database, but there's a lot more to the library than just that. Tokyo Cabinet supports three other kinds of databases. In addition, each database type accepts various tuning parameters that can be used to change its behavior. Each database type and setting involves different tradeoffs so you really have a lot of options for turning Tokyo Cabinet into exactly what you need. Let's look into some of those options now.

The B+Tree Database

Tokyo Cabinet's B+Tree Database is a little slower than the Hash Database we looked at before. That's its downside. However, giving up a little speed gains you several extra features that may just allow you to work smarter instead of faster.

The B+Tree Database is a more advanced form of the Hash Database. What that means is that all of the stuff I showed you in the last article still applies. You can set, read, and remove values by keys, iteration is supported, and you still have access to the neat options like adding to counters. With a B+Tree Database you get all of that and more.

The first major addition is that a B+Tree Database is ordered. You don't really need to do anything to turn this on, it's just the way it is. As you add pairs to the database, they will be ordered by the keys you use. The default ordering is lexical:

#!/usr/bin/env ruby -wKU

require "oklahoma_mixer"

OklahomaMixer.open("ordered.tcb") do |db|
  db[:c] = 3
  db[:a] = 1
  db[:b] = 2
  db.to_a  # => [["a", "1"], ["b", "2"], ["c", "3"]]
end

This simple example shows us a couple of things. First, creating a B+Tree Database is as simple as changing the file extension. Remember when I said the .tch stood for Tokyo Cabinet Hash Database? Well, it shouldn't be too surprising that .tcb stands for Tokyo Cabinet B+Tree Database. Oklahoma Mixer will notice which extension you use and load the right features for that database type.

The other thing to notice here is the ordering. I purposely added the keys out of order, but you can see that to_a() shows them all lined up correctly. Now to_a() is really just an iterator the database object inherits from Enumerable, so we now know that iteration will be in database order. Methods like keys() and even values() will also return their listings in order as well.

As I said, the default ordering is lexical, so number keys are little strange:

#!/usr/bin/env ruby -wKU

require "oklahoma_mixer"

OklahomaMixer.open("lexical.tcb") do |db|
  db[1]  = :first
  db[2]  = :middle
  db[11] = :last
  db.to_a  # => [["1", "first"], ["11", "last"], ["2", "middle"]]
end

Notice they don't come out in the order we would probably think is most natural, as I described in the values. To fix that we need to change the default ordering and you can do that using a tuning parameter of the B+Tree Database. We are allowed to set a comparison function when we open() the database that will order the keys however we desire. This function is just like a block you would pass to sort() in Ruby: it will be handed two keys at a time to compare and it is expected to return negative, zero, or positive for the first argument being less than, equal to, or greater than the second. The good news is, you can usually cheat your way out of remembering these comparison rules by leaning on Ruby's spaceship operator:

#!/usr/bin/env ruby -wKU

require "oklahoma_mixer"

OklahomaMixer.open( "numerical.tcb",
                    :cmpfunc => lambda { |a, b| a.to_i <=> b.to_i } ) do |db|
  db[1]  = :first
  db[2]  = :middle
  db[11] = :last
  db.to_a  # => [["1", "first"], ["2", "middle"], ["11", "last"]]
end

This example shows how tuning parameters get set with Oklahoma Mixer. Just pass some keyword arguments to open() for each parameter you need to adjust. This allows Oklahoma Mixer to perform the needed setup before connecting to your database. That's critical for things like a B+Tree comparison function that have to be set before the database is accepting data.

It's worth noting that the comparison function is not stored in the database file and needs to be reset (to the same function if you want to avoid unpredictable results) each time you open() that database.

OK, enough about ordering. What else do we get with the B+Tree Database?

You also get key ranges. Since the database has an inherit order, we're no longer limited to :prefix searches of the keys() and we can now ask for all of the keys between two endpoints:

#!/usr/bin/env ruby -wKU

require "oklahoma_mixer"

OklahomaMixer.open("ranges.tcb") do |db|
  db.update(:a => 1, :b => 2, :c => 3, :d => 4, :e => 5)
  db.keys(:range => "ab".."d")   # => ["b", "c", "d"]
  db.keys(:range => "ab"..."d")  # => ["b", "c"]
end

Note that I used "ab" in my Range queries which is really between the actual "a" and "b" keys in the database. That works just fine.

You can also pass the :limit option I've shown before with :range, but you can't pass :prefix. It's one or the other: :prefix or :range.

This ability to work with a Range of keys is even extended to the iterators. You've always had the ability to stop iterating whenever you like using Ruby's break keyword, but now you can tell the iterators where to start, making it possible to iterate over a subset of the pairs in the database:

#!/usr/bin/env ruby -wKU

require "oklahoma_mixer"

OklahomaMixer.open("ranges.tcb") do |db|
  db.update(:a => 1, :b => 2, :c => 3, :d => 4, :e => 5)
  db.each("ab") do |key, value|
    puts "%p => %p" % [key, value]
    break if key >= "d"
  end
  # >> "b" => "2"
  # >> "c" => "3"
  # >> "d" => "4"
end

Again, I used "ab" and it jumped to the first key after that. The only place that might get a little confusing is if you try that same trick with the (also added to B+Tree Databases) reverse_each() iterator:

#!/usr/bin/env ruby -wKU

require "oklahoma_mixer"

OklahomaMixer.open("ranges.tcb") do |db|
  db.update(:a => 1, :b => 2, :c => 3, :d => 4, :e => 5)
  db.reverse_each("ddd") do |key, value|
    puts "%p => %p" % [key, value]
    break if key <= "b"
  end
  # >> "e" => "5"
  # >> "d" => "4"
  # >> "c" => "3"
  # >> "b" => "2"
end

See how it started with "e"? It always jumps to the first key equal to or after the one you provide, even if you are planning to iterate backwards. Since "ddd" is between "d" and "e", that means we start on the key after "ddd" ("e").

B+Tree Databases have one more feature and it's a wild one. These databases support an additional storage mode that allows duplicate values to be stored under the same key:

#!/usr/bin/env ruby -wKU

require "oklahoma_mixer"

OklahomaMixer.open("dupes.tcb") do |db|
  %w[James Dana Baby].each do |name|
    db.store("Gray", name, :dup)
  end
  db.to_a  # => [["Gray", "James"], ["Gray", "Dana"], ["Gray", "Baby"]]
end

As you can see, the :dup storage mode shuts off the normal value replacing behavior and instead inserts the duplicate value after what was already stored for that key.

Several methods in Oklahoma Mixer have been expanded to support these duplicate values. For example, with a B+Tree Database you can scope values() or size() to a specific key, retrieving just the values() stored under that key or getting a count of how many values there are for that key:

#!/usr/bin/env ruby -wKU

require "oklahoma_mixer"

OklahomaMixer.open("names.tcb") do |db|
  db["Matsumoto"] = "Yukihiro"
  %w[James Dana].each do |name|
    db.store("Gray", name, :dup)
  end
  db.values         # => ["James", "Dana", "Yukihiro"]
  db.values("Gray") # => ["James", "Dana"]
  db.size           # => 3
  db.size("Gray")   # => 2
end

You will need to use these methods to work with duplicate values because normal indexing, fetch(), and delete() still just work with the first value stored under a key. That behavior can be valuable too though:

#!/usr/bin/env ruby -wKU

require "oklahoma_mixer"

OklahomaMixer.open("todo.tcb") do |db|
  if db.size.zero?
    %w[B+tree Fixed-length tuning].each do |topic|
      db.store(:blog, topic, :dup)
    end
  end

  puts "Write about #{db.delete(:blog)}."
end

If I run that program three times, this is what you see:

$ ruby tc_example.rb 
Write about B+tree.
$ ruby tc_example.rb 
Write about Fixed-length.
$ ruby tc_example.rb 
Write about tuning.

See how delete() just kept pulling the first value that was left? That allowed us to use it as a simple queue in this case.

The delete() method can be passed the :dup storage mode as a second argument. When you do, all values under the passed key will be removed.

When working with duplicates, be aware that keys() and each_key() (or any iterator) behave differently. keys() returns a unique list, so keys with duplicate values under them will only be listed once. Iteration walks each pair in the database though, so a key will come up once for each value stored under it. Put another way, iteration does show duplicates while keys() won't.

Let me show one last, slightly bigger example to bring together all of the features discussed above. Here's a little more involved queuing system:

#!/usr/bin/env ruby -wKU

require "oklahoma_mixer"

GROUPS = {nil => 0, "critical" => 1, "normal" => 2, "low" => 3}

order = lambda { |a, b| 
  a_group, a_priority = a.split(":")
  b_group, b_priority = b.split(":")
  [GROUPS[a_group], -a_priority.to_i] <=> [GROUPS[b_group], -b_priority.to_i]
}
OklahomaMixer.open("queue.tcb", :cmpfunc => order) do |db|
  case ARGV.shift
  when "add"
    group = "normal"
    ARGV.delete_if { |o| o =~ /\A--(critical|low)\z/ and group = $1 }
    priority = 10
    ARGV.delete_if { |o| o =~ /\A-(\d+)\z/ and priority = $1.to_i }
    db.store("#{group}:#{priority}", ARGV.join("; "), :dup)
  when "list"
    db.each do |key, value|
      puts key
      puts "  #{value}"
    end
  when "do_one"
    if key = db.keys(:limit => 1).first and job = db.delete(key)
      eval(job)
    end
  when "do_all"
    loop do
      if key = db.keys(:limit => 1).first and job = db.delete(key)
        eval(job)
      else
        break
      end
    end
  else
    abort "Usage:  #{$PROGRAM_NAME} add|list|do_one|do_all [OPTIONS]"
  end
end

Most of that should be pretty straightforward code after all we've talked about, but let me point out one tricky spot. I had to add the nil => 0 entry to my GROUPS because fetching a full, or in this case :limited, set of keys() is really just a :prefix query with an empty :prefix. Because of that, you want to make sure your ordering functions always order an empty String key before anything else. Calling split() on the empty String gives me a nil group, which is converted to a 0 so it will come out first.

It's probably also worth pointing out that I could have just used each() with the do_all command. However, always fetching the first key and using that is a little better in a multiprocessing environment where other processes might be adding to the queue. If I'm iterating through the list, I won't see new critical jobs if they are added above where I am at. Using keys() though, I'll always grab the most important job next. This code isn't really built for multiprocessing to tell the truth, but let's save that discussion for a later article.

Anyway, here's an example of me playing around with the program above, so you can see how it works in practice:

$ ruby queue.rb add 'puts "An average job."'
$ ruby queue.rb add --low 'puts "This can wait..."'
$ ruby queue.rb add --critical 'puts "Very important."'
$ ruby queue.rb add --critical -100 'puts "Most important!"'
$ ruby queue.rb listcritical:100
  puts "Most important!"
critical:10
  puts "Very important."
normal:10
  puts "An average job."
low:10
  puts "This can wait..."
$ ruby queue.rb do_one
Most important!
$ ruby queue.rb do_one
Very important.
$ ruby queue.rb do_all
An average job.
This can wait...

To summarize, the B+Tree Database gives you ordering, key ranges and cursor based iteration (the ability to skip to a specific key), and duplicate storage. You pay a speed penalty for these added features though. That's the tradeoff.

The Fixed-length Database

Another type of database supported by Tokyo Cabinet is the Fixed-length Database. It too is an extension of the Hash Database, supporting most of the features I showed you in that article. However, I'm not going to lie to you, this database type comes with three significant restrictions.

First, all keys are Integers greater than 0. You can't use arbitrary Strings as you do with the Hash and B+Tree Databases. As such, you lose the ability to do :prefix queries on keys(). The database is ordered though, similar to the B+Tree Database. You can't change this ordering, but it is done numerically (instead of lexically) since all keys are just Integers anyway. Given that, :range queries on keys() are supported. Methods like keys() and the iterators will pass you Integer keys in Ruby, instead of the String keys you get with the other database types.

The second downside is that all values stored have a fixed-length, which is what gives the database its name. This length defaults to 255, but you can tune it to anything you like with the :width tuning parameter:

#!/usr/bin/env ruby -wKU

require "oklahoma_mixer"

OklahomaMixer.open("four.tcf", :width => 4) do |db|
  db.update( 1 => :one,
             2 => :two,
             3 => :three,
             4 => :four,
             5 => :fix,
             6 => :six,
             7 => :seven,
             8 => :eight )
  db.each_value do |num|
    puts num
  end
  # >> one
  # >> two
  # >> thre
  # >> four
  # >> fix
  # >> six
  # >> seve
  # >> eigh
end

Notice how everything beyond my selected :width of 4 was just silently discarded. That's the fixed-length at work.

Also note that you create a Tokyo Cabinet Fixed-length Database as you probably expect by now, with the file extension .tcf.

Finally, the Fixed-length Database has one last size limit. The overall file size of the database is limited to 268435456 bytes, by default. This too can be tuned using the :limsiz tuning parameter and you are free to make the limit quite large. Just remember that values are fixed length, so setting :width => 1024, :limsiz => 4 * 1024 will mean your database only holds four keys. Trying to add data beyond this limit will raise an OklahomaMixer::Error::CabinetError.

That's a lot of minuses and you are probably wondering why anyone would be willing to accept all of these limits when we've already seen that there are more powerful options. The answer is performance. The Fixed-length Database is Tokyo Cabinet's fastest weapon. It treats the database file as a raw array of bytes and it can jump straight to any value with simple math. (Due to this, defrag() isn't supported on a Fixed-length Database, though Oklahoma Mixer does provide a no-op just to match the interface of the other database types.) That makes it wicked quick. If your data storage needs are simple enough to fit within these limitations, you can take advantage of this added speed boost.

The Fixed-length Database interface does have one other neat feature I should mention. It supports four special key names: :min, :max, :prev, and :next. You can use these values in many of the methods that take keys. For example:

#!/usr/bin/env ruby -wKU

require "oklahoma_mixer"

OklahomaMixer.open("special.tcf") do |db|
  db.update( 1  => :first,
             2  => :middle,
             42 => :last )
  db[:min]  # => "first"
  db[:max]  # => "last"
  db[:next] = :added
  db.keys  # => [1, 2, 42, 43]
  db[43]   # => "added"
end

Be careful when using these. :min and :max will raise an OklahomaMixer::Error::CabinetError if there are no keys in the database. :prev (not shown above) is even pickier, requiring a :min key that is above 1, so it can safely add below it without hitting 0. I find :next pretty useful though, as it makes it possible to queue up values. Here's the simple queue example I showed in the B+Tree code rewritten to use a Fixed-length Database instead:

#!/usr/bin/env ruby -wKU

require "oklahoma_mixer"

OklahomaMixer.open("todo.tcf") do |db|
  # load the data
  %w[B+tree Fixed-length tuning].each do |topic|
    db[:next] = "Write about #{topic}."
  end

  # read it back
  loop do
    begin
      puts db.delete(:min)
    rescue OklahomaMixer::Error::CabinetError  # no keys for :min
      break
    end
  end
  # >> Write about B+tree.
  # >> Write about Fixed-length.
  # >> Write about tuning.
end

To summarize, the B+Tree Database may slow you down a little, but the Fixed-length speeds you up, as long as you can accept certain restrictions. Select the database type that best fits the needs of your data.

Tuning Parameters

We've seen how to set tuning parameters in the examples above and already learned what some do. I'm going to save some parameters for discussions in later articles, when we talk about their specific functions. For now though, here are most of the tuning parameters available to the three databases we've covered so far.

:bnum (for Hash or B+Tree—can be used with optimize()): Specifies the number of elements to use in the bucket array. The default is 131071 for Hash Databases and 32749 for B+Tree Databases. The suggested size is from 0.5 to 4 times the total number of records stored for Hash Databases or 1 to 4 times the total for B+Tree Databases.
:apow (for Hash or B+Tree—can be used with optimize()): Specifies to the size of record alignment as a power of 2. The default is 4 for Hash Databases and 8 for B+Tree Databases, meaning 2 ** 4 = 16 and 2 ** 8 = 256 respectively.
:fpow (for Hash or B+Tree—can be used with optimize()): Specifies the maximum number of elements in the free block pool as a power of 2. The default is 10, meaning 2 ** 10 = 1024.
:opts (for Hash or B+Tree—can be used with optimize()): Specifies the options for the database in a String of recognized character codes. There are no options by default, but this is commonly set to "ld" or "lb" for bigger databases. The options are:
- "l" allows the database file to grow large (over 2 GB) by using a 64-bit bucket array.
- "d" compresses each record in a Hash Database or page in a B+Tree Database with Deflate compression.
- "b" compresses each record in a Hash Database or page in a B+Tree Database with BZIP2 compression.
- "t" compresses each record in a Hash Database or page in a B+Tree Database with TCBS compression.
:rcnum (for Hash): Specifies the maximum number of records to be cached. It is 0 or disabled by default.
:xmsiz (for Hash or B+Tree): Specifies the size of extra mapped memory. The default is 67108864 for Hash Databases or 0 (disabled) for B+Tree Databases.
:dfunit (for Hash or B+Tree): Specifies the auto defragmentation unit step number. It is 0 or disabled by default.
:cmpfunc (for B+tree): Specifies the comparison function used to order B+Tree Databases. See the detailed examples above for an explanation.
:lmemb (for B+Tree—can be used with optimize()): Specifies the number of members in each leaf page. The default is 128.
:nmemb (for B+Tree—can be used with optimize()): Specifies the number of members in each non-leaf page. The default is 256.
:lcnum (for B+tree): Specifies the maximum number of leaf nodes to be cached. The default is 1024.
:ncnum (for B+tree): Specifies the maximum number of non-leaf nodes to be cached. The default is 512.
:width (for Fixed-length—can be used with optimize()): Specifies the width of values in Fixed-length Databases. See the detailed examples above for an explanation.
:limsiz (for Fixed-length—can be used with optimize()): Specifies the limit on database file size in Fixed-length Databases. See the detailed examples above for an explanation.

I apologize for keeping the cryptic names in Oklahoma Mixer, but I felt it was better to stick with what Tokyo Cabinet uses so users could read about them in documentation and other resources for that library. Tokyo Tyrant also uses these names to configure a database by command-line, so you will find them in several different contexts.

Database objects have an optimize() method that can be used to modify the tuning parameters of an open() database. The parameters that can be used as such are noted above. There are sometimes additional restrictions though. For example, the :limsiz of a Fixed-length Database usually has to be increased when changed through optimize().

That covers the various key-value database types in Tokyo Cabinet. The fourth type is quite a bit different from what we've look at so far, so I'll do a full article on it next.

Tokyo Cabinet as a Key-Value Store

2014-06-05T18:57:10Z

Like most key-value stores, Tokyo Cabinet has a very Hash-like interface from Ruby (assuming you use Oklahoma Mixer). You can almost think of a Tokyo Cabinet database as a Hash that just happens to be stored in a file instead of memory. The advantage of that is that your data doesn't have to fit into memory. Luckily, you don't have to pay a big speed penalty to get this disk-backed storage. Tokyo Cabinet is pretty darn fast.

Getting and Setting Keys

Let's have a look at the normal Hash-like methods as well as the file storage aspect:

#!/usr/bin/env ruby -KU

require "oklahoma_mixer"

OklahomaMixer.open("data.tch") do |db|
  if db.size.zero?
    puts "Loading the database.  Rerun to read back the data."
    db[:one] = 1
    db[:two] = 2
    db.update(:three => 3, :four => 4)
    db["users:1"] = "James"
    db["users:2"] = "Ruby"
  else
    puts "Reading data."
    %w[ db[:one]
        db["users:2"]
        -
        db.keys
        db.keys(:prefix\ =>\ "users:")
        db.keys(:limit\ =>\ 2)
        db.values
        -
        db.values_at(:one,\ :two) ].each do |command|
      puts(command == "-" ? "" : "#{command} = %p" % [eval(command)])
    end
  end
end

If I run that code twice, I see:

$ ruby tc_example.rb 
Loading the database.  Rerun to read back the data.
$ ruby tc_example.rb 
Reading data.
db[:one] = "1"
db["users:2"] = "Ruby"

db.keys = ["one", "two", "three", "four", "users:1", "users:2"]
db.keys(:prefix => "users:") = ["users:1", "users:2"]
db.keys(:limit => 2) = ["one", "two"]
db.values = ["1", "2", "3", "4", "James", "Ruby"]

db.values_at(:one, :two) = ["1", "2"]

The file storage should be pretty obvious here. The first run of the program populated the data file and the second run read the data back. Obviously the data exists outside the process. It's actually stored in the file I named in my call to open(): "data.tch". We will dig a lot more into the meaning of the file extensions later, but for now it's enough to know that .tch stands for Tokyo Cabinet Hash database. It's also worth pointing out that you don't have to pass a block to open(). When not passed a block open() will return the database reference and expect you to call close() manually when you are done, just as you could with any IO object from Ruby. Tokyo Cabinet can buffer output just like Ruby's IO streams can, so know that your data isn't guaranteed to have hit the disk until after a close(). You can flush() the data to disk before that though, if needed.

The getting and setting methods shouldn't be much of a surprise. I started off by using calling size() to count the pairs already in the database. I then used []= to set a few keys. Note that I also used update() to add multiple keys at once. (The merge()/merge!() methods of Hash don't really make sense for the database so you do need to use the update() alias.) Later I read the data back with []. It's all very Hash-like. I was even able to ask for all of the keys() as you can with a Hash, but the Oklahoma Mixer version of that method supports some extra filters like the :prefix and :limit shown above. There's also the matching values() call, though it doesn't have any filters. You can see that Oklahoma Mixer also allows us to fetch multiple keys at once with values_at().

The last thing to get out of this example is the usual truth of key-value storage: keys and values are generally considered Strings. Notice how db[:one] = 1 actually stored a value of "1" under the key "one". Make sure you remember to convert it back when you read it if you really need the number.

Another cool Hash-like feature you can make use of are defaults. You can set a static object to be used as the default value for keys not in the database or provide code to run to generate the default. Here is some code showing the possibilities in action:

#!/usr/bin/env ruby -KU

require "oklahoma_mixer"

OklahomaMixer.open("data.tch") do |db|
  # no default set
  db[:missing]  # => nil

  # an Object default
  db.default = 0
  db[:missing]  # => 0
end

# another way to set an Object default
OklahomaMixer.open("data.tch", :default => 42) do |db|
  db[:missing]  # => 42
end

# a Proc default
proc = lambda { |key|
  type, id = key.to_s.split(":")
  "New #{type} with id #{id}"
}
OklahomaMixer.open("data.tch", :default => proc) do |db|
  db["user:3"]  # => nil
end

Proc defaults are always executed, so if you want a default that returns a Proc, just pass a Proc that creates the desired Proc. All other objects are returned when indexing a missing value.

The important thing to remember about the defaults is that they are not stored in the file. They are just a convenience from the Ruby interface and you will need to set them again anytime you make a new connection to the database.

You can also walk the key-value pairs of a Tokyo Cabinet database using the standard iterators you expect in Ruby:

#!/usr/bin/env ruby -KU

require "pp"

require "oklahoma_mixer"

OklahomaMixer.open("data.tch") do |db|
  # a Hash-like each()
  db.each do |key, value|
    puts "db[%p] = %p" % [key, value]
  end

  # other iterators from Enumerable are supported
  puts
  pp db.select { |key, _|     key   =~ /\Ausers:/ }
  pp db.find   { |_,   value| value =~ /\A\D/     }
end

Running that gives us:

db["one"] = "1"
db["two"] = "2"
db["three"] = "3"
db["four"] = "4"
db["users:1"] = "James"
db["users:2"] = "Ruby"

[["users:1", "James"], ["users:2", "Ruby"]]
["users:1", "James"]

You can see that we get an each() that walks key-value pairs, just as a Hash would. We also get all of the other standard Enumerable iterators. This gives us several different ways to comb the data for specific keys.

When you are done playing around with data, you have multiple options for getting rid of it. You can just clear() all keys if you are sure that's safe. Of course, just deleting the file has pretty much the same effect. If you need to selectively remove data, you can delete() a single key-value pair or use the delete_if() iterator to programmatically remove pairs.

#!/usr/bin/env ruby -KU

require "oklahoma_mixer"

OklahomaMixer.open("data.tch") do |db|
  db.delete(:one)  # => "1"
  db.delete_if { |key, _| key =~ /\Ausers:/ }

  db.keys  # => ["two", "three", "four"]

  db.clear
  db.keys  # => []
end

The delete() method does return the value for the removed key, or nil if that key wasn't in the database. That feature isn't really safe if multiple processes are manipulating the data at once, unless you take the right precautions. We will talk a lot more about that later though.

That covers the basic Hash style interface to Tokyo Cabinet. Let's move into some other aspects of the library now.

Counters and Appended Values

We've already seen the standard Hash-like method of storing data with db[:key] = :value. The less common store() method from Hash is also supported (as is fetch() for retrieving values), so you can do things like db.store(:key, :value). The advantage of using store() is that it supports modes. You can use these modes to manipulate values in different ways. Let's look at some of the options.

Most key-values stores provide an action for atomically incrementing a counter and Tokyo Cabinet is no exception. This is important because it allows you to track unique ID's. Have a look at the various ways you can use the store() method to manage counters with the :add mode:

#!/usr/bin/env ruby -KU

require "oklahoma_mixer"

OklahomaMixer.open("data.tch") do |db|
  db["globals:user_id"]                    # => nil
  db["globals:float"]                      # => nil

  db.store("globals:user_id", 1, :add)     # => 1
  db.store("globals:user_id", 1, :add)     # => 2

  db.store("globals:float",    2.1, :add)  # => 2.1
  db.store("globals:user_id", -1,   :add)  # => 1
end

While all of that should be pretty obvious, this mode has a few gotchas you want to stay aware of. It's OK to start :adding to a nil field as I've shown above, but don't try to use a field already set to a non-:added value or you will likely get a OklahomaMixer::Error::CabinetError. This is true even if you have what you think is a number in the value. Tokyo Cabinet's numbers are a C-ish chunk of bytes so it won't recognize digits in String form. This also means you don't generally want to read an :added value with a normal call to []. It probably won't look like anything you are expecting. Tokyo Cabinet also uses different formats for Integer and Float values, so you will get the same error if you try to switch. Always add the same type of number to a given field.

Another unusual type of value management can be done in Tokyo Cabinet by appending to a value with :cat mode. For example:

#!/usr/bin/env ruby -KU

require "oklahoma_mixer"

OklahomaMixer.open("data.tch") do |db|
  db.store(:friend_ids, " 1", :cat)
  db.store(:friend_ids, " 3", :cat)
  db.store(:friend_ids, " 5", :cat)
  db.store(:friend_ids, " 3", :cat)

  db[:friend_ids]                        # => " 1 3 5 3"
  db[:friend_ids].to_s.scan(/\S+/).uniq  # => ["1", "3", "5"]
end

As you can see, this method will create a value if it didn't exist and then continue appending to the value after it does. If you need the opposite behavior, to avoid messing with a key that already exists, try :keep mode instead:

#!/usr/bin/env ruby -KU

require "oklahoma_mixer"

OklahomaMixer.open("data.tch") do |db|
  db[:exists] = "Can't touch this!"

  db.store(:exists, "Lost.", :keep)  # => false
  db[:exists]                        # => "Can't touch this!"
end

Similarly, you can just pass a block to store() that will be called if a key already exists. That block is expected to return the value that should be saved to the database:

#!/usr/bin/env ruby -KU

require "oklahoma_mixer"

OklahomaMixer.open("data.tch") do |db|
  adder = lambda { |key, old_value, new_value| old_value.to_i + new_value }
  db[:num]  # => nil
  db.store(:num, 41, &adder)
  db[:num]  # => "41"
  db.store(:num,  1, &adder)
  db[:num]  # => "42"
end

These modes give you some powerful ways to build up values over time, even with different processes working on the same data. Their effects are atomic and that's important in any multiprocessing environment.

Transactions

Transactions are a big part of what makes Tokyo Cabinet great to work with. With them you can define a set of actions that must succeed or fail as a whole. Let's start by considering this from the classical transferring money between accounts example:

#!/usr/bin/env ruby -KU

require "oklahoma_mixer"

OklahomaMixer.open("data.tch") do |db|
  db["accounts:1:balance"] = 100
  db["accounts:2:balance"] = 100

  db.transaction do
    db["accounts:1:balance"] = db["accounts:1:balance"].to_i - 10
    db["accounts:2:balance"] = db["accounts:2:balance"].to_i + 10
  end

  db["accounts:1:balance"]  # => "90"
  db["accounts:2:balance"]  # => "110"
end

That code should be easy to understand. I just removed an amount from one account and added that same amount to the other. I've done this transfer inside of a transaction(), but it doesn't really have any effect when things go right as they did here. Let's break something and see what happens:

#!/usr/bin/env ruby -KU

require "oklahoma_mixer"

OklahomaMixer.open("data.tch") do |db|
  db["accounts:1:balance"] = 100
  db["accounts:2:balance"] = 100

  begin
    db.transaction do
      db["accounts:1:balance"] = db["accounts:1:balance"].to_i - 10
      db["accounts:2:balance"] = db["accounts:2:balance"].to_i + 10
      fail "Oops!"
    end
  rescue
    # do nothing:  just continue on to checking the balances
  end

  db["accounts:1:balance"]  # => "100"
  db["accounts:2:balance"]  # => "100"
end

This time we see the difference. Both of my actions against the database had already been processed. However, my fail() call was part of the same transaction() and the Exception meant everything had to be undone. Notice that the account balances were restored to their previous values.

It is possible for you to cancel a transaction() without triggering an Exception. That's what the abort() method is for:

#!/usr/bin/env ruby -KU

require "oklahoma_mixer"

OklahomaMixer.open("data.tch") do |db|
  db.store("globals:user_id", 41, :add)  # pretend we have a few users
  db["users:42:last_name"] = "Nobody"    # and some bad data

  user = {:first_name => "James", :last_name => "Gray"}
  db.transaction do
    user[:id] = db.store("globals:user_id", 1, :add)
    if user.all? { |k, v| db.store("users:#{user[:id]}:#{k}", v, :keep) }
      user[:saved] = true
    else
      db.abort
    end
  end

  unless user[:saved]
    puts "Unable to save user.  Problem field(s):"
    user.each_key do |key|
      if value = db["users:#{user[:id]}:#{key}"]
        puts %Q{db["users:#{user[:id]}:#{key}"] = #{value.inspect}}
      end
    end
    # >> Unable to save user.  Problem field(s):
    # >> db["users:42:last_name"] = "Nobody"
  end
end

As you can see, abort() didn't toss an Exception but it rolled back my transaction() all the same. None of the new user fields were added to the database because they couldn't all be safely added. I knew that because one of the :keep mode calls to store() returned false when it tried to set an already existing key.

That's the magic of transactions. They are an all-or-nothing thing. Only if your block completes with no Exception thrown and no call to abort() will all of the changes be made.

Database File Maintenance

There are a lot of advantages that come with a database that's just one file in the file system. You can build symlinks to it, set permissions on it, and check its size with the normal tools your OS provides (though Oklahoma Mixer does have a file_size() method that returns the file size in bytes, if you need it). Of course, there are also tradeoffs you should stay aware of.

First, The file can get a little bloated over time. The reason is normal fragmentation: Tokyo Cabinet may clear a key freeing up some space and later fill it with a not-quite-as-big item. It may not find a good use for the even smaller remaining space for a long time. This creates small pockets of unused space that grow the file over time.

The easiest way to deal with this is to call defrag() periodically at a slow time. This will lock up the database for a few seconds while Tokyo Cabinet cleans it up. This will take care of the wasted space and shrink the file size back down (assuming it was fragmented).

Another issue to stay aware of is how you make backup copies of the database file. You need to be careful about using standard tools like cp or rsync on a Tokyo Cabinet file. It's fine if you know all connections to it are currently closed, but it's not safe when a connection might be changing the data inside of it mid-copy. If you try that, you will likely get a corrupt copy of the database.

The solution is to call copy() and pass in the path where you would like to create a copy of the database. It will synchronize the data, lock out changes, and then make a full duplicate. This process is quite snappy, even with bigger data sets. If desired, you can ask Oklahoma Mixer for the path() of the original database, edit it in some small way, and use that as the path for the duplicate database.

Just make sure you keep these issues in mind as you plan out your storage.

Those are the basics of using Tokyo Cabinet as a key-value store, but there's really a lot more to what Tokyo Cabinet can do. I'll show you what all is built onto this simple foundation in upcoming articles.

Where Redis is a Good Fit

2014-04-18T22:46:41Z

Like any system, Redis has strengths and weaknesses. Some of the biggest positives with Redis are:

It's wicked fast. In fact, it may just be the fastest key-value store.
The collection types and the atomic operations that work on them allow you to model some moderately complex data scenarios. This makes Redis fit some higher order problems where a simple key-value store wouldn't quite be enough.
The snapshot data dumping model can be an asset. You get persistence with Redis, but you pay a minimal penalty for it.

Of course, there are always some minuses. These are the two I consider the most important:

Redis is an in-memory data store, first and foremost. That means your entire dataset must fit completely in RAM and leave enough breathing room for anything else the server must do.
Snapshot backups are not perfect. If Redis fails between snapshots, you can lose data. You need to make sure that's acceptable for any application you use it in.

It may seem weird to call snapshots both a pro and a con, but it does work for you in some ways and against you in others. You have to decide where the trade-off is worth it.

Given the above breakdown, I will list three places where I think Redis can be the right tool for the job. This is not meant to be an exhaustive list. I'm sure there are many other places Redis could be well used. Instead, it's better to look at why I've chosen these areas and try Redis with problems that have similar needs.

Redis makes an excellent cache or, more specifically, a memcached replacement. The reasoning for this is simple: you pretty much get memcached's features, plus lists and sets. You can use those collections to answer some queries, saving you trips to more expensive databases. That means you can use your cache more and increase the benefits of having it. You also get some persistence, which isn't critical in this scenario but is a nice value add.
Redis is an ideal realtime statistics tracker. If you are tracking stats in realtime, there are three things you really care about: speed, speed, and speed. Some nice atomic operations don't hurt either. From simple counters, to audit logs, to sets of unique IP addresses, and much more, Redis really rocks this kind of problem domain.
Redis can be the primary database for some Web applications. This one isn't as much of a given as the other two, obviously, and you can see that I've switched to using words like can and some. However, when your database needs are simple, Redis may be enough tool for the job.

If you are going to try Redis in this role, you need to stay very aware of its minuses. For example, will it be OK if you lose some keys here and there? As you consider that though, do remember that you can force a save when needed. Perhaps it would be enough to force saves after a new user account is created and play things a little looser with the rest of the data, for example. It's also worth noting that Redis supports master-slave replication which can help reduce this risk.

When considering if the entire database can be in memory at once, consider how fast the data will grow as well. Are you going to have time to monitor usage and react long before you run into a nasty limit?

That said, there are plenty of applications that fit into those criteria. Take this blog for example. I'm sure the contents of it fit into a reasonable chunk of memory, it changes so little I could afford a hard save after every single write, and the rest of the time Redis would pay me back with crazy great reading speed.

There's even a nice object mapping library for Redis: Ohm. I do encourage you to play with Redis at a lower level before resorting to such shortcuts though.

Hopefully this series has given you some ideas of how you might use Redis. It's not right for every application, but it can really be a big win where it fits. You should now know enough to watch for such opportunities and take advantage of them.

Lists and Sets in Redis

2014-04-18T22:28:37Z

[Update: though all of the techniques I show here still apply, many methods of the Redis gem have changed names to match the actual Redis commands they call. There are also easier and more powerful ways to do some of what I show in here, thanks to additions to Redis.]

Redis adds one huge twist to traditional key-value storage: collections. Supporting both lists and sets through some very powerful atomic operations allows for advanced key-value usage.

Lists

Redis allows a single key to hold a list of values. This is your typical ordered list with the operations you would expect: appending, indexed access, and access to a range of values.

This has many potential uses. I'll cover two that I think will be very common. First, if you are going to use Redis as a full database, you store things that are naturally a list of items, like comments, in a real list. Let's look at some code:

#!/usr/bin/env ruby -wKU

require "redis"

CLEAR = `clear`

# create an article to comment on
db                = Redis.new
article_id        = db.incr("global:next_article_id")
article           = "article:#{article_id}"
class << article
  def method_missing(field, *args, &blk)
    return super unless field.to_s !~ /[!?=]\z/ && args.empty? && blk.nil?
    "#{self}:#{field}"
  end
end
db[article.title] = "My Favorite Language"
db[article.body]  = "I love Ruby!"
# initialize some session details
comments_per_page = 2
comment_page      = 1
login             = ARGV.shift || "JEG2"

loop do
  # show article
  print CLEAR
  puts "#{db[article.title]}:"
  puts "  #{db[article.body]}"

  # paginate comments
  start      =  comments_per_page * (comment_page - 1)
  finish     =  start + comments_per_page - 1
  comments   =  db.list_range(article.comments, start, finish)
  pagination =  Array(start.zero? ? nil : "(p)revious")
  pagination << "(n)ext" if db.list_length(article.comments) - 1 > finish
  # show comments
  comments.each do |comment|
    posted, user, body = comment.split("|", 3)
    puts "----"
    puts "  #{body}"
    puts "  posted by #{user} on #{posted}"
  end

  # handle commands
  puts
  print "Command? [#{(%w[(c)omment (q)uit] + pagination).join(', ')}]  "
  case (command = gets)
  when /\Ac(?:omment)?\Z/i  # add a comment
    print "Your comment?  "
    comment = gets or break
    posted  = Time.now.strftime('%m/%d/%Y at %H:%M:%S')
    db.push_tail(article.comments, "#{posted}|#{login}|#{comment.strip}")
  when /\Ap(?:revious)?\Z/i  # view previous page of comments
    if pagination.first =~ /\A\(p\)/
      comment_page -= 1
    else
      puts "You are on the first page of comments."
      gets or break
    end
  when /\An(?:ext)?\Z/i  # view next page of comments
    if pagination.last =~ /\A\(n\)/
      comment_page += 1
    else
      puts "You are on the last page of comments."
      gets or break
    end
  when /\Aq(?:uit)?\Z/i, nil  # exit program
    break
  end
end

I know that looks like a lot of code, but it's mostly interface. It also shows much of the common list interactions in just three methods.

You can see that adding to the list was a simple matter of calling push_tail(). Similar to how incr()/decr() initialize counters to 0, list actions will default to an empty list if the key is undefined when they are triggered. You will get an error if you use the operations on keys that have already been set to non-list values though, so be careful with that.

When I was ready to read the list back, I paginated through the results using list_length() and list_range(). You pass a starting and ending index to list_range() and negative indices can be used to count backwards from the end just as Ruby allows. Redis doesn't allow you to read a full list with a simple key lookup ([] or get()), so use list_range(…, 0, -1) instead.

Here is what this example looks like in practice after I've entered a few comments:

My Favorite Language:
  I love Ruby!
----
  First!
  posted by JEG2 on 09/07/2009 at 15:35:11
----
  Yeah, we know.
  posted by JEG2 on 09/07/2009 at 15:35:29

Command? [(c)omment, (q)uit, (n)ext]

Then if I type an n followed by a return:

My Favorite Language:
  I love Ruby!
----
  ruby { |love| love + 1 }
  posted by JEG2 on 09/07/2009 at 15:35:52
----
  Who doesn't?
  posted by JEG2 on 09/07/2009 at 15:36:20

Command? [(c)omment, (q)uit, (p)revious, (n)ext]

You get the idea.

Let's go back to how I added items to the list for just a moment though.

As the name push_tail() might lead you to guess, there are three related methods: pop_tail(), push_head(), pop_head(). The push_head() method will add items to the beginning of the list, instead of the end as I did above. Then pop_head() and pop_tail() can be used to remove entries from either end. This means that lists can function as stacks and queues, just as Ruby's Array does.

I believe using Redis lists as queues will be another popular setup:

#!/usr/bin/env ruby -wKU

require "redis"

WORKER_COUNT = 3

# spawn some workers
WORKER_COUNT.times do
  fork do
    db = Redis.new
    loop do
      next  unless work = db.pop_head(:work)
      break if     work == "QUIT"
      puts "Work from #{Process.pid}: #{work} = #{eval work}"
    end
  end
end

# generate some work
db = Redis.new
10.times do
  db.push_tail( :work,
                "#{rand(10)} #{%w[+ - * / %][rand(5)]} #{rand(9) + 1}" )
end

# finish off all processes
WORKER_COUNT.times do
  db.push_tail(:work, "QUIT")
end
Process.waitall

When I run that, we can see the workers pulling off their assignments and taking care of the work:

Work from 1478: 9 + 9 = 18
Work from 1479: 0 / 7 = 0
Work from 1478: 1 % 3 = 1
Work from 1479: 2 % 4 = 2
Work from 1479: 4 * 7 = 28
Work from 1478: 1 / 9 = 0
Work from 1479: 1 + 1 = 2
Work from 1479: 3 * 4 = 12
Work from 1478: 6 - 9 = -3
Work from 1480: 0 / 1 = 0

Adding and removing work are both atomic actions. Once a work has a job, it's guaranteed other workers won't receive it.

You aren't limited to a single queue either, of course. Each key can hold a queue, so you can divide jobs up by priority, time required, or anything else that makes sense.

I know of at least one big site using this kind of setup to process their background jobs. They switched to this approach because just saving the jobs to a relational database was too much of a time penalty for their needs.

You can edit a list by index using list_set(), but unlike in Ruby, you will get an error if you try to set an index that doesn't already exist. Use list_index() to read a single value at an index.

If you want to use lists as a circular buffer, say for storing recent log entries or notices without allowing the data to grow infinitely, the list_trim() method is what you need. It takes identical arguments to list_range(), but instead of returning the indicated keys it shaves the list down to include only those keys. Thus, if you want to keep a list at no more than 100 entries, you can use code like:

db.push_tail(:list, "whatever")
db.list_trim(:list, -100, -1)

If you need to delete items out of the middle of a list, use the list_rm() method. It takes the key, a count of how many matching values to remove, and the value to match against. A count of 0 will remove all matching values from within the list and negative counts will start removing at the tail of the list, moving toward the head.

Finally, you may want to know about the rename() method, in case you need to atomically replace an entire list. You can build up the new list under a separate key an then just use rename() to replace the old key. Be careful with rename() though because it expects the old key name followed by the new key name, which is backwards from how we usually do things in Ruby. If you need a rename() that won't destroy old data, there's also a rename_unless_exists() method.

Sets

Redis has one more type of collection: sets. Sets are probably even simpler to work with operation wise, but are a large source in the power of Redis over normal key-value stores.

To Redis a set is an unordered collection with no duplicate members. If you add an item to a set more than once, it will still just be listed one time.

Let's begin by looking at some basic set operations:

#!/usr/bin/env ruby -wKU

require "redis"

db = Redis.new

db.set_member?(:nums, "one")  # => false
db.set_add(:nums, "one")
db.set_member?(:nums, "one")  # => 1
db.set_delete(:nums, "one")
db.set_member?(:nums, "one")  # => false

50.times do
  db.set_add(:nums, "one")
end
db.set_add(:nums, "two")
db.set_count(:nums)    # => 2
db.set_members(:nums)  # => ["two", "one"]

db.spop(:nums)  # => "two"
db.spop(:nums)  # => "one"
db.spop(:nums)  # => nil

db.set_add(:work, "write about sets")
db.set_add(:work, "write about sorting")
db.set_members(:work)  # => ["write about sorting", "write about sets"]
db.set_move(:work, :finished, "write about sets")
db.set_members(:work)      # => ["write about sorting"]
db.set_members(:finished)  # => ["write about sets"]

I assume each of those examples is pretty easy to follow. The first chunk of code shows adding to and deleting from a set, plus testing membership. After that we see the rules I mentioned earlier: no duplicates and unordered. Next we see the spop() method which is just a random member remover function. The final chunk of code shows how you can atomically move members from one set to another. Those are the basics of sets.

The real power of sets comes from how you can use them to build simple queries in Redis. That's possible through the support of some atomic set operations: intersect(ion), union, and diff(erence). Here are some examples:

#!/usr/bin/env ruby -wKU

require "redis"

db = Redis.new

db.set_add(:odd,            "one")
db.set_add(:even,           "two")
db.set_add(:odd,            "three")
db.set_add(:divisible_by_3, "three")
db.set_add(:even,           "four")
db.set_add(:divisible_by_4, "four")
db.set_add(:odd,            "five")
db.set_add(:even,           "six")
db.set_add(:divisible_by_3, "six")
db.set_add(:odd,            "seven")
db.set_add(:even,           "eight")
db.set_add(:divisible_by_4, "eight")
db.set_add(:odd,            "nine")
db.set_add(:divisible_by_3, "nine")
db.set_add(:even,           "ten")
db.set_add(:odd,            "eleven")
db.set_add(:even,           "twelve")
db.set_add(:divisible_by_3, "twelve")
db.set_add(:divisible_by_4, "twelve")

db.set_intersect(:even, :divisible_by_3)                   # => ["six",
                                                           #     "twelve"]
db.set_intersect(:even, :divisible_by_3, :divisible_by_4)  # => ["twelve"]

db.set_diff(:divisible_by_3, :even)                   # => ["three", "nine"]
db.set_diff(:even, :divisible_by_3)                   # => ["four", "ten",
                                                      #     "eight", "two"]
db.set_diff(:even, :divisible_by_3, :divisible_by_4)  # => ["ten", "two"]

db.set_union(:divisible_by_3, :divisible_by_4)  # => ["four", "six",
                                                #     "twelve", "three",
                                                #     "eight", "nine"]

db.set_union_store(:all, :even, :odd)
db.set_members(:all)  # => ["four", "eleven", "seven", "eight", "one",
                      #     "six", "ten", "twelve", "three", "five", "nine",
                      #     "two"]

Again, I assume the results of these examples are pretty straight forward. set_intersect() returns the members in all listed sets, set_diff() subtracts the members of each successive set listed from the first set, and set_union() returns all members present in any of the listed sets. Note that order is significant in set_diff() due to the way it is defined. Finally, I showed a variant that stores the results instead of returning the entire set. Though I only showed set_union_store(), set_intersect_store() and set_diff_store() do exist.

A typical usage for these methods is to store unique identifiers of records in the system by various categorical breakdowns. You can then use the set operations to find simple query results of those present in more than one category, in one category but not others, or those present in any category. This may be able to save you a trip to a more powerful but expensive tool, like SQL, for some queries.

To give an example of building queries with sets, let's look at a simple program. This code will load some first and last names from the U.S Census Bureau into a Redis database. We will then allow the user to make queries by name position (first or last), gender, popularity, and any leading prefix.

To keep the code simple, we're going to adopt a pretty crude definition of popularity. The files are already in order of rank, so we will just call the first third of the file popular, the second third average, and the rest uncommon.

We're also going to need to be a bit clever to handle prefix searches with just sets. Redis doesn't really have text search features, so we will need build an index we can use. The idea is simple: given the name "James" we will add it to the sets "prefix:j", "prefix:ja", "prefix:jam", "prefix:jame", and "prefix:james". Later, when we are answering queries, we can just add the set for the input the user gave us into our set intersection.

Here's the code:

#!/usr/bin/env ruby -wKU

require "abbrev"

require "redis"

# prepare the database
db = Redis.new
if db.dbsize.zero?
  puts "Loading names into the database..."
  started, count = Time.now, 0
  # load names
  %w[all.last female.first male.first].each do |group|
    gender, position = group.split(".")
    category         = ([position, gender] - %w[all]).join(":")
    top_third        = File.size("dist.#{group}.txt") / 3
    File.open("dist.#{group}.txt") do |names|
      names.each do |given|
        name = given[/\w+/]
        db.pipelined do |commands|
          commands.set_add("position:#{position}", name)
          commands.set_add("gender:#{gender}",     name) \
            unless gender == "all"
          Abbrev.abbrev(name.downcase).keys.each do |prefix|
            commands.set_add("prefix:#{prefix}", name)
          end
          popularity = if    names.pos < top_third     then "popular"
                       elsif names.pos < top_third * 2 then "average"
                       else                                 "uncommon"
                       end
          commands.set_add("popularity:#{category}:#{popularity}", name)
        end
        count += 1
      end
    end
  end
  # save and report
  db.save
  puts "Loaded #{count} names in #{Time.now - started} seconds."
  puts
end

# perform queries
position   = %w[first last]
gender     = %w[male female]
popularity = %w[popular average uncommon]
# UI for entering queries
def ask(choices)
  print [ choices[0..-2].join(", "),
          choices[-1] ].join(", or ").capitalize + "?  "
  choice = gets.to_s.strip
  choice.empty? ? nil : Abbrev.abbrev(choices)[choice]
end
query = [ ]
if choice = ask(position)
  query << "position:#{choice}"
end
unless choice == "last"
  if choice = ask(gender)
    query << "gender:#{choice}"
  end
end
if choice = ask(popularity)
  query << "popularity:#{query.map { |f| f[/\w+\z/] }.join(':')}:#{choice}"
end
print "Prefix?  "
prefix = gets.to_s.strip
unless prefix.empty?
  query << "prefix:#{prefix.downcase}"
end
puts
# execute query and show stats
puts "Running query..."
width = query.map { |set| set.size }.max
query.each do |set|
  puts "  %#{width}s:  #{db.set_count(set)} names" % set
end
started = Time.now
puts
puts( query.empty? ? db.set_union("position:first", "position:last") :
                     db.set_intersect(*query) )
puts
puts "#{Time.now - started} seconds."

Using that code, we can load the database and search for names like mine:

$ ruby names.rb
Loading names into the database...
Loaded 94293 names in 58.25652 seconds.

First, or last?  f
Male, or female?  m
Popular, average, or uncommon?  p
Prefix?  Ja

Running query...
                 position:first:  5163 names
                    gender:male:  1219 names
  popularity:first:male:popular:  406 names
                      prefix:ja:  501 names

JARED
JACKIE
JAVIER
JAY
JAIME
JAMES
JACK
JAMIE
JACOB
JASON

0.003345 seconds.

As you can see, a little setup work really pays us back. Redis can chew through those multi-set queries in no time at all.

Since we've now seen all of the Redis value types, it's worth a quick mention that you can ask the database for the type() of a given key. The response will be one of: "none" (key not set), "string", "list", or "set".

Sorting

As soon as you have collections of data, you will want tools for controlling the order of those collections. This is especially important with things like set that don't have an order until you define one.

Redis has a single method for this, called sort(), but it's by far the most complex and powerful method in the entire API. Let's examine what it can do piece by piece. First, here is a basic sort:

#!/usr/bin/env ruby -wKU

require "redis"

db = Redis.new

db.set_add(:nums, "1")
db.set_add(:nums, "2")
db.set_add(:nums, "10")
db.set_add(:nums, "11")

db.sort(:nums)  # => ["1", "2", "10", "11"]

That's easy enough to understand. I built a set (note that sort() works with lists too) and asked Redis to sort the members. It did.

Here's a question though: did the order surprise you? Values are generally considered Strings in Redis, but those numbers weren't sorted as Strings. This is another one of those sometimes numeric meaning exceptions I've mentioned before. Redis goes one step further with sort() though and even recognizes Floats:

#!/usr/bin/env ruby -wKU

require "redis"

db = Redis.new

db.set_add(:floats, "3")
db.set_add(:floats, "3.02")
db.set_add(:floats, "3.14")
db.set_add(:floats, "3.2")
db.set_add(:floats, "4")

db.sort(:floats)  # => ["3", "3.02", "3.14", "3.2", "4"]

Now, if you want a String ordering, you can ask for it. You can also reverse either order. Here's how those options work in our original example:

#!/usr/bin/env ruby -wKU

require "redis"

db = Redis.new

db.set_add(:nums, "1")
db.set_add(:nums, "2")
db.set_add(:nums, "10")
db.set_add(:nums, "11")

db.sort(:nums, :order => "ALPHA")       # => ["1", "10", "11", "2"]
db.sort(:nums, :order => "ALPHA DESC")  # => ["2", "11", "10", "1"]

db.sort(:nums, :order => "DESC")  # => ["11", "10", "2", "1"]

The sort() method also supports a limit with offset that can be used to fetch a subset of entries and also handle pagination:

#!/usr/bin/env ruby -wKU

require "redis"

db = Redis.new

db.set_add(:nums, "1")
db.set_add(:nums, "2")
db.set_add(:nums, "10")
db.set_add(:nums, "11")

db.sort(:nums, :limit => [0, 3])  # => ["1", "2", "10"]
db.sort(:nums, :limit => [3, 3])  # => ["11"]

There are two more options you can set with sort(). The first option allows you to use a key lookup to find the actual value to order by. Thus, if you had a set of ID's, but actually wanted to sort() by an attribute associated with those ID's, you could look up the attribute. Let me show you what I mean:

#!/usr/bin/env ruby -wKU

require "redis"

db = Redis.new

%w[one two three four].each do |num|
  id = db.incr("global:num_id")
  db["num:#{id}:word"]      = num
  db["num:#{id}:word_size"] = num.size
  db.set_add(:nums, id)
end

db.sort(:nums, :by => "num:*:word_size")  # => ["1", "2", "4", "3"]

I made the ID's match up with the numbers here just to make the example easy to follow, but notice how those ID's where actually orders by their "num:ID:word_size" key. As you can see, sort() replaced the * in my key name with the actual member of the set, which was the ID in this case.

The final feature lets us take that one step further. Just as we can fetch a key for the order, we can also fetch a key for the result. That allows us to return not just the ID, but the actual data. For example:

#!/usr/bin/env ruby -wKU

require "redis"

db = Redis.new

%w[one two three four].each do |num|
  id = db.incr("global:num_id")
  db["num:#{id}:word"]      = num
  db["num:#{id}:word_size"] = num.size
  db.set_add(:nums, id)
end

db.sort( :nums, :by  => "num:*:word_size",
                :get => "num:*:word" )  # => ["one", "two", "four", "three"]

So the sort() fetched the word_size keys to build the ordering and then fetched the word keys as the actual result. If you want, you can even fetch multiple keys for each result:

#!/usr/bin/env ruby -wKU

require "redis"

db = Redis.new

%w[one two three four].each do |num|
  id = db.incr("global:num_id")
  db["num:#{id}:word"]      = num
  db["num:#{id}:word_size"] = num.size
  db.set_add(:nums, id)
end

db.sort( :nums, :by  => "num:*:word_size",
                :get => %w[ num:*:word
                            num:*:word_size ] )  # => ["one", "3",
                                                 #     "two", "3",
                                                 #     "four", "4",
                                                 #     "three", "5"]

sort() is the power tool of collection ordering and fetching. I've shown most of the options separately to make things easier to understand, but you can combine them as needed to order, limit, and even lookup your data.

The collections types add quite a bit of flexibility to simple key-value storage. You can use these tools to group keys and process them by various criteria. This makes Redis useable in a wider range of scenarios than some simple key-value stores.

Using Redis as a Key-Value Store

2014-04-18T22:27:14Z

[Update: though all of the techniques I show here still apply, many methods of the Redis gem have changed names to match the actual Redis commands they call.]

Redis is a first and foremost a server providing key-value storage. As such, the primary features of any client library are for connecting to the server and manipulating those key-value pairs.

Connecting to the Server

Connecting to the Redis server can be as simple as Redis.new, thanks to some defaults in both the server and Ezra's Ruby client library for talking to that server. I won't pass any options to the constructor calls below, but you can use any of the following as needed:

:host if you need to connect to an external host instead of the default 127.0.0.1
:port if you need to use something other than the default port of 6379
:password if you configured Redis to require a password on connection
:db if you want to select one of the multiple configured databases, other than the default of 0 (databases are identified by a zero-based index)
:timeoeut if you want a different timeout for Redis communication than the default of 5 seconds
:logger if you want the library to log activity as it works

Getting and Setting Keys

Once connected to, Redis can be used as an in-memory key-value store, much like memcached. The client library exposes this key getting and setting functionality just like a Ruby Hash:

#!/usr/bin/env ruby -wKU

require "redis"

db = Redis.new
db[:my_key]  # => nil
db[:my_key] = "my_value"
db[:my_key]  # => "my_value"
db.delete(:my_key)
db[:my_key]  # => nil

Notice that we can read, write, and delete key-value pairs just as we could with a Hash, using [], []=, and delete() respectively. If we look for a key that isn't in the database, we get nil just as Ruby would give us for a Hash.

There are two other methods for slightly more advanced setting operations. First, getset() can be used to update a value while also retrieving its previous value. There's also a set_unless_exists() method that will not override an existing value. Here are those operations in action:

#!/usr/bin/env ruby -wKU

require "redis"

db       = Redis.new
db[:adv] = "old"

db.getset(:adv, "new")              # => "old"
db[:adv]                            # => "new"
db.set_unless_exists(:adv, "lost")  # => false
db[:adv]                            # => "new"

Other Hash-like operations are supported. For example, you can check for the existence of a key, count the number of keys in the database, fetch a list of keys matching a pattern, or even get random keys:

#!/usr/bin/env ruby -wKU

require "redis"

db             = Redis.new
db[:key1]      = 1
db[:key2]      = 2
db[:key3]      = 3
db[:other_key] = "other"

db.key?(:key3)   # => 1
db.key?(:key4)   # => false
db.dbsize        # => 4
db.keys("key*")  # => ["key2", "key3", "key1"]
db.randkey       # => "key2"
db.randkey       # => "other_key"

Note that the pattern passed to keys() is similar to a file glob. You can use a ? in the pattern to mean any one character and * to match any run of characters. You can also use \\ to escape these special characters and match them literally.

It's worth noting that all Redis keys and values are pretty much Strings:

#!/usr/bin/env ruby -wKU

require "redis"

db             = Redis.new
db[Object.new] = Object.new

k = db.keys("#*").first  # => "#<Object:0x301894>"
db[k]                    # => "#<Object:0x301858>"

There are some minor exceptions where values can sometimes be treated as numbers. You can also have collections in values, but the collections hold the typical Strings, with sometimes numeric meaning. I'll talk more about these cases later.

Key Expiration

Redis supports setting expiration times on stored keys. When that time expires, the key will be purged. This is very useful when using Redis as a cache. You can set an expiration time by calling the expire() method, or you can just use a different version of the key setting that includes the timeout:

#!/usr/bin/env ruby -wKU

require "redis"

db = Redis.new
db.set(:cached, "short lived", 3)

4.times do
  sleep 1
  puts "db[:cached] is #{db[:cached].inspect} at #{Time.now}"
end
# >> db[:cached] is "short lived" at Sat Sep 05 14:01:07 -0500 2009
# >> db[:cached] is "short lived" at Sat Sep 05 14:01:08 -0500 2009
# >> db[:cached] is "short lived" at Sat Sep 05 14:01:09 -0500 2009
# >> db[:cached] is nil at Sat Sep 05 14:01:10 -0500 2009

Note that a write operation against a key with an expiration timeout set, a volatile key in Redis parlance, clears the timeout. You can use the ttl() method if you need to examine the time to live for a key. There's also a matching get() method to go with the set() I used above, though it's just an alias for [] and has nothing to do with timeouts.

Counters

Redis supports some other interesting operations on simple keys. For example, you can use the atomic incr() operation to manage globally unique ID's:

#!/usr/bin/env ruby -wKU

require "redis"

3.times do
  fork do
    db  = Redis.new
    ids = Array.new(10) { db.incr("global:next_user_id") }
    puts "#{Process.pid}: #{ids.join(', ')}"
  end
end

Process.waitall
# >> 1148: 1, 3, 6, 9, 12, 15, 17, 22, 25, 27
# >> 1147: 4, 8, 11, 13, 16, 18, 20, 23, 26, 29
# >> 1149: 2, 5, 7, 10, 14, 19, 21, 24, 28, 30

This is one of the exceptions I mentioned earlier where Redis will try to treat a value as a number. In this case an Integer is expected and a Float will be truncated. If it holds non-numeric content, it is set to "0" and then modified as requested. That's why you can start with a key that doesn't exist, as I did above.

There is a matching decr() operation. You can also choose to pass an Integer as the second argument to these methods to raise or lower the count by that amount.

Getting and Setting Multiple Keys at Once

Another interesting feature is the ability to fetch more than one key at a time:

#!/usr/bin/env ruby -wKU

require "redis"

db                    = Redis.new
db["user:1:username"] = "JEG2"
db["user:1:password"] = "secret"

db.mget("user:1:username", "user:1:password")  # => ["JEG2", "secret"]

We can tie the counter and multiple get features together to do some basic object storage inside Redis:

#!/usr/bin/env ruby -wKU

require "redis"

DB = Redis.new

class User
  def initialize(id = nil)
    @id     = id
    @fields = Hash.new
    load if @id
  end

  attr_reader :id

  def method_missing(meth, *args, &blk)
    if meth.to_s =~ /\A(\w+)=/
      @fields[$1] = args.first
    else
      @fields[meth]
    end
  end

  def load
    keys    = DB.keys("user:#{@id}:*")
    values  = DB.mget(*keys)
    @fields = Hash[*keys.map { |k| k[/\w+\z/] }.zip(values).flatten]
  end

  def save
    @id ||= DB.incr("global:next_user_id")
    DB.pipelined do |commands|
      @fields.each do |k, v|
        commands["user:#{@id}:#{k}"] = v
      end
    end
  end

  def inspect
    "<#User:#{@id} #{@fields.map { |k, v| "#{k}:#{v.inspect}" }.join(' ')}>"
  end
end

User.new(1)  # => <#User:1 username:"JEG2" password:"secret">

new_guy = User.new
new_guy.username = "New Guy"
new_guy.password = "123"
new_guy.save

User.new(new_guy.id)  # => <#User:31 username:"New Guy" password:"123">

I snuck in a another feature in my implementation of save() for that example: pipelined commands. If you're going to issue a bunch of commands real quick, as I did with the field saves in this case, you can pipeline them. This queues them up locally and then fires them all at the Redis server as your block exits. This can make those batch operations a little more efficient.

Saving and Shutting Down

Redis does send periodic snapshot data backups to disk, unlike memcached. I've already talked about how you can configure exactly when these backups happen on the server side, but you can also request a snapshot from the client side. Just call bgsave() to trigger the usual asynchronous save or save() if you would prefer a synchronous backup.

When you are done playing around with a Redis session, you can call the shutdown() method to close all connections, dump the database to disk, and exit the server. If you don't wish to keep the data, you can call flush_db() to ditch the data in the database you are connected to. You may also wish to examine the statistics from a call to info() before you shutdown() to see what work the server has done.

That covers basic key-value store usage. However, Redis has some unique features that really set it apart from other key-value stores. We will look into those next.

Setting up the Redis Server

2014-04-18T21:16:36Z

Before we can play with Redis, you will need to get the server running locally. Luckily, that's very easy.

Installing Redis

Building Redis is a simple matter of grabbing the code and compiling it. Once built, you can place the executables in a convenient location in your PATH. On my box, I can do all of that with these commands:

curl -O http://redis.googlecode.com/files/redis-1.0.tar.gz
tar xzvf redis-1.0.tar.gz 
cd redis-1.0
make
sudo cp redis-server redis-cli redis-benchmark /usr/local/bin

Those commands build version 1.0 of the server, which is the current stable release as of this writing. You may need to adjust the version numbers down the road to get the latest releases though.

I also copied the executables to where I prefer to have them: /usr/local/bin. Feel free to change that directory in the last command to whatever you prefer.

If you will be talking to Redis from Ruby, as I will show in all of my examples, you are going to need a client library. I recommend Ezra Zygmuntowicz's redis-rb. You can install that gem with:

gem install redis

Running and Configuring Redis

That's it for the install. Launching the server is even easier. The pattern is just:

redis-server path/to/redis.conf

The argument is the path to the configuration file that tells Redis how you want it to behave. There's a sample configuration file in the Redis source code that shows the options.

I'm not going to discuss all of the configuration options. They are already well commented in the sample file. However, I do want to mention a few things.

First, if you will only be connecting to a local Redis instance, uncomment the bind configuration in the sample file:

bind 127.0.0.1

That tells Redis not to listen for external connections.

If you do need to accept external connections, you may want to set a limit for the number of simultaneous connections to avoid exhausting the file descriptors on your server. You can also adjust the timeout for inactive connections to reclaim those resources:

maxclients 128
timeout 60

There are some other non-network limits you may wish to fiddle with as well.

By default, Redis supports multiple databases. It even has a move() command that allows you to transfer keys between databases. I find I usually only want one though. I'm more likely to launch multiple servers if I want more. This would allow me to control resources, like memory consumption, on a per database basis, at the cost of losing the atomic move(). Even if I didn't just want one database though, I think it would be rare to need the 16 that are configured by default. You can easily turn that down:

databases 1

Another limit you may want to consider setting is the maximum memory limit. If you plan to use Redis as a memcached replacement, you will likely wish to control how much memory it can consume. You can set the maximum number of bytes Redis can allocate, after which it will start purging volatile keys. If it cannot reclaim any more memory it will start refusing write commands. Here's a sample setting for a 100MB limit:

maxmemory 104857600

Note that the above setting is really only a good idea when using Redis as a cache. If you are using it as a general database, you will need to monitor its memory consumption and take action before too many resources are consumed.

The last setting I want to talk about is probably the most important for using Redis.

The server will periodically fork and asynchronously dump the current contents of the database to disk. The dump is actually made to a temporary file and then moved to replace any older dump, so the operation is atomic and won't leave you with a partially dumped database. If Redis is eventually shutdown and reloaded, it will restore from this dump file.

How often it dumps the keys is configureable by the amount of time that passes and the number of changes that have been made to the data. For example, the following settings tell Redis to dump the database after 60 seconds if 100 changes have been made or after five minutes if there has been at least 1 change:

save 300 1
save 60 100

As you can see, you can set several different conditions. As soon as any one line of conditions matches, meaning both the time and the changes much match, the database is dumped and both counts restart.

Note that the time condition can be met before the changes. This means that, using the settings above, I can launch a Redis server, let it sit for five or more minutes, and then change a single key to trigger an immediate dump. The time will have already passed and as soon as I make the changes condition true as well, that is enough. In other words, I don't have to wait five minutes after I make the change.

That covers plenty about installing and running the Redis server. You are now all set to play with it.

Using Key-Value Stores From Ruby

2014-04-18T21:01:21Z

I've been playing with a few different key-value stores recently. My choices are pretty popular and you can find documentation for them. However, it can still be a bit of work to relate everything to Ruby specific usage, which is what I care about. Given that, here are my notes on the systems I've used.

Redis

Tokyo Cabinet, Tokyo Tyrant, and Tokyo Dystopia

Installing the Tokyo Software
Tokyo Cabinet as a Key-Value Store
Tokyo Cabinet's Key-Value Database Types
Tokyo Cabinet's Tables
Threads and Multiprocessing With Tokyo Cabinet
Tokyo Tyrant as a Network Interface
The Strengths of Tokyo Cabinet