Key-Value Stores

Notes from my learning about simple NoSQL storage solutions.

10

JAN
2010

Tokyo Cabinet's Key-Value Database Types

We've taken a good look at Tokyo Cabinet's Hash Database, but there's a lot more to the library than just that. Tokyo Cabinet supports three other kinds of databases. In addition, each database type accepts various tuning parameters that can be used to change its behavior. Each database type and setting involves different tradeoffs so you really have a lot of options for turning Tokyo Cabinet into exactly what you need. Let's look into some of those options now.

The B+Tree Database

Tokyo Cabinet's B+Tree Database is a little slower than the Hash Database we looked at before. That's its downside. However, giving up a little speed gains you several extra features that may just allow you to work smarter instead of faster.

The B+Tree Database is a more advanced form of the Hash Database. What that means is that all of the stuff I showed you in the last article still applies. You can set, read, and remove values by keys, iteration is supported, and you still have access to the neat options like adding to counters. With a B+Tree Database you get all of that and more.

The first major addition is that a B+Tree Database is ordered. You don't really need to do anything to turn this on, it's just the way it is. As you add pairs to the database, they will be ordered by the keys you use. The default ordering is lexical:

#!/usr/bin/env ruby -wKU

require "oklahoma_mixer"

OklahomaMixer.open("ordered.tcb") do |db|
  db[:c] = 3
  db[:a] = 1
  db[:b] = 2
  db.to_a  # => [["a", "1"], ["b", "2"], ["c", "3"]]
end

This simple example shows us a couple of things. First, creating a B+Tree Database is as simple as changing the file extension. Remember when I said the .tch stood for Tokyo Cabinet Hash Database? Well, it shouldn't be too surprising that .tcb stands for Tokyo Cabinet B+Tree Database. Oklahoma Mixer will notice which extension you use and load the right features for that database type.

The other thing to notice here is the ordering. I purposely added the keys out of order, but you can see that to_a() shows them all lined up correctly. Now to_a() is really just an iterator the database object inherits from Enumerable, so we now know that iteration will be in database order. Methods like keys() and even values() will also return their listings in order as well.

As I said, the default ordering is lexical, so number keys are little strange:

#!/usr/bin/env ruby -wKU

require "oklahoma_mixer"

OklahomaMixer.open("lexical.tcb") do |db|
  db[1]  = :first
  db[2]  = :middle
  db[11] = :last
  db.to_a  # => [["1", "first"], ["11", "last"], ["2", "middle"]]
end

Notice they don't come out in the order we would probably think is most natural, as I described in the values. To fix that we need to change the default ordering and you can do that using a tuning parameter of the B+Tree Database. We are allowed to set a comparison function when we open() the database that will order the keys however we desire. This function is just like a block you would pass to sort() in Ruby: it will be handed two keys at a time to compare and it is expected to return negative, zero, or positive for the first argument being less than, equal to, or greater than the second. The good news is, you can usually cheat your way out of remembering these comparison rules by leaning on Ruby's spaceship operator:

#!/usr/bin/env ruby -wKU

require "oklahoma_mixer"

OklahomaMixer.open( "numerical.tcb",
                    :cmpfunc => lambda { |a, b| a.to_i <=> b.to_i } ) do |db|
  db[1]  = :first
  db[2]  = :middle
  db[11] = :last
  db.to_a  # => [["1", "first"], ["2", "middle"], ["11", "last"]]
end

This example shows how tuning parameters get set with Oklahoma Mixer. Just pass some keyword arguments to open() for each parameter you need to adjust. This allows Oklahoma Mixer to perform the needed setup before connecting to your database. That's critical for things like a B+Tree comparison function that have to be set before the database is accepting data.

It's worth noting that the comparison function is not stored in the database file and needs to be reset (to the same function if you want to avoid unpredictable results) each time you open() that database.

OK, enough about ordering. What else do we get with the B+Tree Database?

You also get key ranges. Since the database has an inherit order, we're no longer limited to :prefix searches of the keys() and we can now ask for all of the keys between two endpoints:

#!/usr/bin/env ruby -wKU

require "oklahoma_mixer"

OklahomaMixer.open("ranges.tcb") do |db|
  db.update(:a => 1, :b => 2, :c => 3, :d => 4, :e => 5)
  db.keys(:range => "ab".."d")   # => ["b", "c", "d"]
  db.keys(:range => "ab"..."d")  # => ["b", "c"]
end

Note that I used "ab" in my Range queries which is really between the actual "a" and "b" keys in the database. That works just fine.

You can also pass the :limit option I've shown before with :range, but you can't pass :prefix. It's one or the other: :prefix or :range.

This ability to work with a Range of keys is even extended to the iterators. You've always had the ability to stop iterating whenever you like using Ruby's break keyword, but now you can tell the iterators where to start, making it possible to iterate over a subset of the pairs in the database:

#!/usr/bin/env ruby -wKU

require "oklahoma_mixer"

OklahomaMixer.open("ranges.tcb") do |db|
  db.update(:a => 1, :b => 2, :c => 3, :d => 4, :e => 5)
  db.each("ab") do |key, value|
    puts "%p => %p" % [key, value]
    break if key >= "d"
  end
  # >> "b" => "2"
  # >> "c" => "3"
  # >> "d" => "4"
end

Again, I used "ab" and it jumped to the first key after that. The only place that might get a little confusing is if you try that same trick with the (also added to B+Tree Databases) reverse_each() iterator:

#!/usr/bin/env ruby -wKU

require "oklahoma_mixer"

OklahomaMixer.open("ranges.tcb") do |db|
  db.update(:a => 1, :b => 2, :c => 3, :d => 4, :e => 5)
  db.reverse_each("ddd") do |key, value|
    puts "%p => %p" % [key, value]
    break if key <= "b"
  end
  # >> "e" => "5"
  # >> "d" => "4"
  # >> "c" => "3"
  # >> "b" => "2"
end

See how it started with "e"? It always jumps to the first key equal to or after the one you provide, even if you are planning to iterate backwards. Since "ddd" is between "d" and "e", that means we start on the key after "ddd" ("e").

B+Tree Databases have one more feature and it's a wild one. These databases support an additional storage mode that allows duplicate values to be stored under the same key:

#!/usr/bin/env ruby -wKU

require "oklahoma_mixer"

OklahomaMixer.open("dupes.tcb") do |db|
  %w[James Dana Baby].each do |name|
    db.store("Gray", name, :dup)
  end
  db.to_a  # => [["Gray", "James"], ["Gray", "Dana"], ["Gray", "Baby"]]
end

As you can see, the :dup storage mode shuts off the normal value replacing behavior and instead inserts the duplicate value after what was already stored for that key.

Several methods in Oklahoma Mixer have been expanded to support these duplicate values. For example, with a B+Tree Database you can scope values() or size() to a specific key, retrieving just the values() stored under that key or getting a count of how many values there are for that key:

#!/usr/bin/env ruby -wKU

require "oklahoma_mixer"

OklahomaMixer.open("names.tcb") do |db|
  db["Matsumoto"] = "Yukihiro"
  %w[James Dana].each do |name|
    db.store("Gray", name, :dup)
  end
  db.values         # => ["James", "Dana", "Yukihiro"]
  db.values("Gray") # => ["James", "Dana"]
  db.size           # => 3
  db.size("Gray")   # => 2
end

You will need to use these methods to work with duplicate values because normal indexing, fetch(), and delete() still just work with the first value stored under a key. That behavior can be valuable too though:

#!/usr/bin/env ruby -wKU

require "oklahoma_mixer"

OklahomaMixer.open("todo.tcb") do |db|
  if db.size.zero?
    %w[B+tree Fixed-length tuning].each do |topic|
      db.store(:blog, topic, :dup)
    end
  end

  puts "Write about #{db.delete(:blog)}."
end

If I run that program three times, this is what you see:

$ ruby tc_example.rb 
Write about B+tree.
$ ruby tc_example.rb 
Write about Fixed-length.
$ ruby tc_example.rb 
Write about tuning.

See how delete() just kept pulling the first value that was left? That allowed us to use it as a simple queue in this case.

The delete() method can be passed the :dup storage mode as a second argument. When you do, all values under the passed key will be removed.

When working with duplicates, be aware that keys() and each_key() (or any iterator) behave differently. keys() returns a unique list, so keys with duplicate values under them will only be listed once. Iteration walks each pair in the database though, so a key will come up once for each value stored under it. Put another way, iteration does show duplicates while keys() won't.

Let me show one last, slightly bigger example to bring together all of the features discussed above. Here's a little more involved queuing system:

#!/usr/bin/env ruby -wKU

require "oklahoma_mixer"

GROUPS = {nil => 0, "critical" => 1, "normal" => 2, "low" => 3}

order = lambda { |a, b| 
  a_group, a_priority = a.split(":")
  b_group, b_priority = b.split(":")
  [GROUPS[a_group], -a_priority.to_i] <=> [GROUPS[b_group], -b_priority.to_i]
}
OklahomaMixer.open("queue.tcb", :cmpfunc => order) do |db|
  case ARGV.shift
  when "add"
    group = "normal"
    ARGV.delete_if { |o| o =~ /\A--(critical|low)\z/ and group = $1 }
    priority = 10
    ARGV.delete_if { |o| o =~ /\A-(\d+)\z/ and priority = $1.to_i }
    db.store("#{group}:#{priority}", ARGV.join("; "), :dup)
  when "list"
    db.each do |key, value|
      puts key
      puts "  #{value}"
    end
  when "do_one"
    if key = db.keys(:limit => 1).first and job = db.delete(key)
      eval(job)
    end
  when "do_all"
    loop do
      if key = db.keys(:limit => 1).first and job = db.delete(key)
        eval(job)
      else
        break
      end
    end
  else
    abort "Usage:  #{$PROGRAM_NAME} add|list|do_one|do_all [OPTIONS]"
  end
end

Most of that should be pretty straightforward code after all we've talked about, but let me point out one tricky spot. I had to add the nil => 0 entry to my GROUPS because fetching a full, or in this case :limited, set of keys() is really just a :prefix query with an empty :prefix. Because of that, you want to make sure your ordering functions always order an empty String key before anything else. Calling split() on the empty String gives me a nil group, which is converted to a 0 so it will come out first.

It's probably also worth pointing out that I could have just used each() with the do_all command. However, always fetching the first key and using that is a little better in a multiprocessing environment where other processes might be adding to the queue. If I'm iterating through the list, I won't see new critical jobs if they are added above where I am at. Using keys() though, I'll always grab the most important job next. This code isn't really built for multiprocessing to tell the truth, but let's save that discussion for a later article.

Anyway, here's an example of me playing around with the program above, so you can see how it works in practice:

$ ruby queue.rb add 'puts "An average job."'
$ ruby queue.rb add --low 'puts "This can wait..."'
$ ruby queue.rb add --critical 'puts "Very important."'
$ ruby queue.rb add --critical -100 'puts "Most important!"'
$ ruby queue.rb listcritical:100
  puts "Most important!"
critical:10
  puts "Very important."
normal:10
  puts "An average job."
low:10
  puts "This can wait..."
$ ruby queue.rb do_one
Most important!
$ ruby queue.rb do_one
Very important.
$ ruby queue.rb do_all
An average job.
This can wait...

To summarize, the B+Tree Database gives you ordering, key ranges and cursor based iteration (the ability to skip to a specific key), and duplicate storage. You pay a speed penalty for these added features though. That's the tradeoff.

The Fixed-length Database

Another type of database supported by Tokyo Cabinet is the Fixed-length Database. It too is an extension of the Hash Database, supporting most of the features I showed you in that article. However, I'm not going to lie to you, this database type comes with three significant restrictions.

First, all keys are Integers greater than 0. You can't use arbitrary Strings as you do with the Hash and B+Tree Databases. As such, you lose the ability to do :prefix queries on keys(). The database is ordered though, similar to the B+Tree Database. You can't change this ordering, but it is done numerically (instead of lexically) since all keys are just Integers anyway. Given that, :range queries on keys() are supported. Methods like keys() and the iterators will pass you Integer keys in Ruby, instead of the String keys you get with the other database types.

The second downside is that all values stored have a fixed-length, which is what gives the database its name. This length defaults to 255, but you can tune it to anything you like with the :width tuning parameter:

#!/usr/bin/env ruby -wKU

require "oklahoma_mixer"

OklahomaMixer.open("four.tcf", :width => 4) do |db|
  db.update( 1 => :one,
             2 => :two,
             3 => :three,
             4 => :four,
             5 => :fix,
             6 => :six,
             7 => :seven,
             8 => :eight )
  db.each_value do |num|
    puts num
  end
  # >> one
  # >> two
  # >> thre
  # >> four
  # >> fix
  # >> six
  # >> seve
  # >> eigh
end

Notice how everything beyond my selected :width of 4 was just silently discarded. That's the fixed-length at work.

Also note that you create a Tokyo Cabinet Fixed-length Database as you probably expect by now, with the file extension .tcf.

Finally, the Fixed-length Database has one last size limit. The overall file size of the database is limited to 268435456 bytes, by default. This too can be tuned using the :limsiz tuning parameter and you are free to make the limit quite large. Just remember that values are fixed length, so setting :width => 1024, :limsiz => 4 * 1024 will mean your database only holds four keys. Trying to add data beyond this limit will raise an OklahomaMixer::Error::CabinetError.

That's a lot of minuses and you are probably wondering why anyone would be willing to accept all of these limits when we've already seen that there are more powerful options. The answer is performance. The Fixed-length Database is Tokyo Cabinet's fastest weapon. It treats the database file as a raw array of bytes and it can jump straight to any value with simple math. (Due to this, defrag() isn't supported on a Fixed-length Database, though Oklahoma Mixer does provide a no-op just to match the interface of the other database types.) That makes it wicked quick. If your data storage needs are simple enough to fit within these limitations, you can take advantage of this added speed boost.

The Fixed-length Database interface does have one other neat feature I should mention. It supports four special key names: :min, :max, :prev, and :next. You can use these values in many of the methods that take keys. For example:

#!/usr/bin/env ruby -wKU

require "oklahoma_mixer"

OklahomaMixer.open("special.tcf") do |db|
  db.update( 1  => :first,
             2  => :middle,
             42 => :last )
  db[:min]  # => "first"
  db[:max]  # => "last"
  db[:next] = :added
  db.keys  # => [1, 2, 42, 43]
  db[43]   # => "added"
end

Be careful when using these. :min and :max will raise an OklahomaMixer::Error::CabinetError if there are no keys in the database. :prev (not shown above) is even pickier, requiring a :min key that is above 1, so it can safely add below it without hitting 0. I find :next pretty useful though, as it makes it possible to queue up values. Here's the simple queue example I showed in the B+Tree code rewritten to use a Fixed-length Database instead:

#!/usr/bin/env ruby -wKU

require "oklahoma_mixer"

OklahomaMixer.open("todo.tcf") do |db|
  # load the data
  %w[B+tree Fixed-length tuning].each do |topic|
    db[:next] = "Write about #{topic}."
  end

  # read it back
  loop do
    begin
      puts db.delete(:min)
    rescue OklahomaMixer::Error::CabinetError  # no keys for :min
      break
    end
  end
  # >> Write about B+tree.
  # >> Write about Fixed-length.
  # >> Write about tuning.
end

To summarize, the B+Tree Database may slow you down a little, but the Fixed-length speeds you up, as long as you can accept certain restrictions. Select the database type that best fits the needs of your data.

Tuning Parameters

We've seen how to set tuning parameters in the examples above and already learned what some do. I'm going to save some parameters for discussions in later articles, when we talk about their specific functions. For now though, here are most of the tuning parameters available to the three databases we've covered so far.

  • :bnum (for Hash or B+Tree—can be used with optimize()): Specifies the number of elements to use in the bucket array. The default is 131071 for Hash Databases and 32749 for B+Tree Databases. The suggested size is from 0.5 to 4 times the total number of records stored for Hash Databases or 1 to 4 times the total for B+Tree Databases.
  • :apow (for Hash or B+Tree—can be used with optimize()): Specifies to the size of record alignment as a power of 2. The default is 4 for Hash Databases and 8 for B+Tree Databases, meaning 2 ** 4 = 16 and 2 ** 8 = 256 respectively.
  • :fpow (for Hash or B+Tree—can be used with optimize()): Specifies the maximum number of elements in the free block pool as a power of 2. The default is 10, meaning 2 ** 10 = 1024.
  • :opts (for Hash or B+Tree—can be used with optimize()): Specifies the options for the database in a String of recognized character codes. There are no options by default, but this is commonly set to "ld" or "lb" for bigger databases. The options are:
    • "l" allows the database file to grow large (over 2 GB) by using a 64-bit bucket array.
    • "d" compresses each record in a Hash Database or page in a B+Tree Database with Deflate compression.
    • "b" compresses each record in a Hash Database or page in a B+Tree Database with BZIP2 compression.
    • "t" compresses each record in a Hash Database or page in a B+Tree Database with TCBS compression.
  • :rcnum (for Hash): Specifies the maximum number of records to be cached. It is 0 or disabled by default.
  • :xmsiz (for Hash or B+Tree): Specifies the size of extra mapped memory. The default is 67108864 for Hash Databases or 0 (disabled) for B+Tree Databases.
  • :dfunit (for Hash or B+Tree): Specifies the auto defragmentation unit step number. It is 0 or disabled by default.
  • :cmpfunc (for B+tree): Specifies the comparison function used to order B+Tree Databases. See the detailed examples above for an explanation.
  • :lmemb (for B+Tree—can be used with optimize()): Specifies the number of members in each leaf page. The default is 128.
  • :nmemb (for B+Tree—can be used with optimize()): Specifies the number of members in each non-leaf page. The default is 256.
  • :lcnum (for B+tree): Specifies the maximum number of leaf nodes to be cached. The default is 1024.
  • :ncnum (for B+tree): Specifies the maximum number of non-leaf nodes to be cached. The default is 512.
  • :width (for Fixed-length—can be used with optimize()): Specifies the width of values in Fixed-length Databases. See the detailed examples above for an explanation.
  • :limsiz (for Fixed-length—can be used with optimize()): Specifies the limit on database file size in Fixed-length Databases. See the detailed examples above for an explanation.

I apologize for keeping the cryptic names in Oklahoma Mixer, but I felt it was better to stick with what Tokyo Cabinet uses so users could read about them in documentation and other resources for that library. Tokyo Tyrant also uses these names to configure a database by command-line, so you will find them in several different contexts.

Database objects have an optimize() method that can be used to modify the tuning parameters of an open() database. The parameters that can be used as such are noted above. There are sometimes additional restrictions though. For example, the :limsiz of a Fixed-length Database usually has to be increased when changed through optimize().

That covers the various key-value database types in Tokyo Cabinet. The fourth type is quite a bit different from what we've look at so far, so I'll do a full article on it next.

Comments (4)
  1. Ilya Grigorik
    Ilya Grigorik January 16th, 2010 Reply Link

    Awesome stuff James. Great to see a good explanation on all the tuning parameters - though to be honest, I think those deserve an entire post by themselves! Lots of people get confused and lost in them.

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
    2. James Edward Gray II
      James Edward Gray II January 16th, 2010 Reply Link

      I'm told there is some good documentation for them in Japanese, but I haven't seen a lot of great stuff in English, no.

      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
  2. Mick Bisignani
    Mick Bisignani March 22nd, 2011 Reply Link

    James, I found your two posts on on Tokyo Cabinet to be priceless. Any thoughts on building out a Kyoto Cabinet equivalent ?

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
    2. James Edward Gray II
      James Edward Gray II March 22nd, 2011 Reply Link

      I may do that some day. I have yet to play with it myself yet, but I want to.

      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
Leave a Comment (using GitHub Flavored Markdown)

Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

Ajax loader