Gray Soft / Key-Value Stores / Tokyo Cabinet as a Key-Value Store

1

JAN
2010

Tokyo Cabinet as a Key-Value Store

This post is part of a series.

Like most key-value stores, Tokyo Cabinet has a very Hash-like interface from Ruby (assuming you use Oklahoma Mixer). You can almost think of a Tokyo Cabinet database as a Hash that just happens to be stored in a file instead of memory. The advantage of that is that your data doesn't have to fit into memory. Luckily, you don't have to pay a big speed penalty to get this disk-backed storage. Tokyo Cabinet is pretty darn fast.

Getting and Setting Keys

Let's have a look at the normal Hash-like methods as well as the file storage aspect:

#!/usr/bin/env ruby -KU

require "oklahoma_mixer"

OklahomaMixer.open("data.tch") do |db|
  if db.size.zero?
    puts "Loading the database.  Rerun to read back the data."
    db[:one] = 1
    db[:two] = 2
    db.update(:three => 3, :four => 4)
    db["users:1"] = "James"
    db["users:2"] = "Ruby"
  else
    puts "Reading data."
    %w[ db[:one]
        db["users:2"]
        -
        db.keys
        db.keys(:prefix\ =>\ "users:")
        db.keys(:limit\ =>\ 2)
        db.values
        -
        db.values_at(:one,\ :two) ].each do |command|
      puts(command == "-" ? "" : "#{command} = %p" % [eval(command)])
    end
  end
end

If I run that code twice, I see:

$ ruby tc_example.rb 
Loading the database.  Rerun to read back the data.
$ ruby tc_example.rb 
Reading data.
db[:one] = "1"
db["users:2"] = "Ruby"

db.keys = ["one", "two", "three", "four", "users:1", "users:2"]
db.keys(:prefix => "users:") = ["users:1", "users:2"]
db.keys(:limit => 2) = ["one", "two"]
db.values = ["1", "2", "3", "4", "James", "Ruby"]

db.values_at(:one, :two) = ["1", "2"]

The file storage should be pretty obvious here. The first run of the program populated the data file and the second run read the data back. Obviously the data exists outside the process. It's actually stored in the file I named in my call to open(): "data.tch". We will dig a lot more into the meaning of the file extensions later, but for now it's enough to know that .tch stands for Tokyo Cabinet Hash database. It's also worth pointing out that you don't have to pass a block to open(). When not passed a block open() will return the database reference and expect you to call close() manually when you are done, just as you could with any IO object from Ruby. Tokyo Cabinet can buffer output just like Ruby's IO streams can, so know that your data isn't guaranteed to have hit the disk until after a close(). You can flush() the data to disk before that though, if needed.

The getting and setting methods shouldn't be much of a surprise. I started off by using calling size() to count the pairs already in the database. I then used []= to set a few keys. Note that I also used update() to add multiple keys at once. (The merge()/merge!() methods of Hash don't really make sense for the database so you do need to use the update() alias.) Later I read the data back with []. It's all very Hash-like. I was even able to ask for all of the keys() as you can with a Hash, but the Oklahoma Mixer version of that method supports some extra filters like the :prefix and :limit shown above. There's also the matching values() call, though it doesn't have any filters. You can see that Oklahoma Mixer also allows us to fetch multiple keys at once with values_at().

The last thing to get out of this example is the usual truth of key-value storage: keys and values are generally considered Strings. Notice how db[:one] = 1 actually stored a value of "1" under the key "one". Make sure you remember to convert it back when you read it if you really need the number.

Another cool Hash-like feature you can make use of are defaults. You can set a static object to be used as the default value for keys not in the database or provide code to run to generate the default. Here is some code showing the possibilities in action:

#!/usr/bin/env ruby -KU

require "oklahoma_mixer"

OklahomaMixer.open("data.tch") do |db|
  # no default set
  db[:missing]  # => nil

  # an Object default
  db.default = 0
  db[:missing]  # => 0
end

# another way to set an Object default
OklahomaMixer.open("data.tch", :default => 42) do |db|
  db[:missing]  # => 42
end

# a Proc default
proc = lambda { |key|
  type, id = key.to_s.split(":")
  "New #{type} with id #{id}"
}
OklahomaMixer.open("data.tch", :default => proc) do |db|
  db["user:3"]  # => nil
end

Proc defaults are always executed, so if you want a default that returns a Proc, just pass a Proc that creates the desired Proc. All other objects are returned when indexing a missing value.

The important thing to remember about the defaults is that they are not stored in the file. They are just a convenience from the Ruby interface and you will need to set them again anytime you make a new connection to the database.

You can also walk the key-value pairs of a Tokyo Cabinet database using the standard iterators you expect in Ruby:

#!/usr/bin/env ruby -KU

require "pp"

require "oklahoma_mixer"

OklahomaMixer.open("data.tch") do |db|
  # a Hash-like each()
  db.each do |key, value|
    puts "db[%p] = %p" % [key, value]
  end

  # other iterators from Enumerable are supported
  puts
  pp db.select { |key, _|     key   =~ /\Ausers:/ }
  pp db.find   { |_,   value| value =~ /\A\D/     }
end

Running that gives us:

db["one"] = "1"
db["two"] = "2"
db["three"] = "3"
db["four"] = "4"
db["users:1"] = "James"
db["users:2"] = "Ruby"

[["users:1", "James"], ["users:2", "Ruby"]]
["users:1", "James"]

You can see that we get an each() that walks key-value pairs, just as a Hash would. We also get all of the other standard Enumerable iterators. This gives us several different ways to comb the data for specific keys.

When you are done playing around with data, you have multiple options for getting rid of it. You can just clear() all keys if you are sure that's safe. Of course, just deleting the file has pretty much the same effect. If you need to selectively remove data, you can delete() a single key-value pair or use the delete_if() iterator to programmatically remove pairs.

#!/usr/bin/env ruby -KU

require "oklahoma_mixer"

OklahomaMixer.open("data.tch") do |db|
  db.delete(:one)  # => "1"
  db.delete_if { |key, _| key =~ /\Ausers:/ }

  db.keys  # => ["two", "three", "four"]

  db.clear
  db.keys  # => []
end

The delete() method does return the value for the removed key, or nil if that key wasn't in the database. That feature isn't really safe if multiple processes are manipulating the data at once, unless you take the right precautions. We will talk a lot more about that later though.

That covers the basic Hash style interface to Tokyo Cabinet. Let's move into some other aspects of the library now.

Counters and Appended Values

We've already seen the standard Hash-like method of storing data with db[:key] = :value. The less common store() method from Hash is also supported (as is fetch() for retrieving values), so you can do things like db.store(:key, :value). The advantage of using store() is that it supports modes. You can use these modes to manipulate values in different ways. Let's look at some of the options.

Most key-values stores provide an action for atomically incrementing a counter and Tokyo Cabinet is no exception. This is important because it allows you to track unique ID's. Have a look at the various ways you can use the store() method to manage counters with the :add mode:

#!/usr/bin/env ruby -KU

require "oklahoma_mixer"

OklahomaMixer.open("data.tch") do |db|
  db["globals:user_id"]                    # => nil
  db["globals:float"]                      # => nil

  db.store("globals:user_id", 1, :add)     # => 1
  db.store("globals:user_id", 1, :add)     # => 2

  db.store("globals:float",    2.1, :add)  # => 2.1
  db.store("globals:user_id", -1,   :add)  # => 1
end

While all of that should be pretty obvious, this mode has a few gotchas you want to stay aware of. It's OK to start :adding to a nil field as I've shown above, but don't try to use a field already set to a non-:added value or you will likely get a OklahomaMixer::Error::CabinetError. This is true even if you have what you think is a number in the value. Tokyo Cabinet's numbers are a C-ish chunk of bytes so it won't recognize digits in String form. This also means you don't generally want to read an :added value with a normal call to []. It probably won't look like anything you are expecting. Tokyo Cabinet also uses different formats for Integer and Float values, so you will get the same error if you try to switch. Always add the same type of number to a given field.

Another unusual type of value management can be done in Tokyo Cabinet by appending to a value with :cat mode. For example:

#!/usr/bin/env ruby -KU

require "oklahoma_mixer"

OklahomaMixer.open("data.tch") do |db|
  db.store(:friend_ids, " 1", :cat)
  db.store(:friend_ids, " 3", :cat)
  db.store(:friend_ids, " 5", :cat)
  db.store(:friend_ids, " 3", :cat)

  db[:friend_ids]                        # => " 1 3 5 3"
  db[:friend_ids].to_s.scan(/\S+/).uniq  # => ["1", "3", "5"]
end

As you can see, this method will create a value if it didn't exist and then continue appending to the value after it does. If you need the opposite behavior, to avoid messing with a key that already exists, try :keep mode instead:

#!/usr/bin/env ruby -KU

require "oklahoma_mixer"

OklahomaMixer.open("data.tch") do |db|
  db[:exists] = "Can't touch this!"

  db.store(:exists, "Lost.", :keep)  # => false
  db[:exists]                        # => "Can't touch this!"
end

Similarly, you can just pass a block to store() that will be called if a key already exists. That block is expected to return the value that should be saved to the database:

#!/usr/bin/env ruby -KU

require "oklahoma_mixer"

OklahomaMixer.open("data.tch") do |db|
  adder = lambda { |key, old_value, new_value| old_value.to_i + new_value }
  db[:num]  # => nil
  db.store(:num, 41, &adder)
  db[:num]  # => "41"
  db.store(:num,  1, &adder)
  db[:num]  # => "42"
end

These modes give you some powerful ways to build up values over time, even with different processes working on the same data. Their effects are atomic and that's important in any multiprocessing environment.

Transactions

Transactions are a big part of what makes Tokyo Cabinet great to work with. With them you can define a set of actions that must succeed or fail as a whole. Let's start by considering this from the classical transferring money between accounts example:

#!/usr/bin/env ruby -KU

require "oklahoma_mixer"

OklahomaMixer.open("data.tch") do |db|
  db["accounts:1:balance"] = 100
  db["accounts:2:balance"] = 100

  db.transaction do
    db["accounts:1:balance"] = db["accounts:1:balance"].to_i - 10
    db["accounts:2:balance"] = db["accounts:2:balance"].to_i + 10
  end

  db["accounts:1:balance"]  # => "90"
  db["accounts:2:balance"]  # => "110"
end

That code should be easy to understand. I just removed an amount from one account and added that same amount to the other. I've done this transfer inside of a transaction(), but it doesn't really have any effect when things go right as they did here. Let's break something and see what happens:

#!/usr/bin/env ruby -KU

require "oklahoma_mixer"

OklahomaMixer.open("data.tch") do |db|
  db["accounts:1:balance"] = 100
  db["accounts:2:balance"] = 100

  begin
    db.transaction do
      db["accounts:1:balance"] = db["accounts:1:balance"].to_i - 10
      db["accounts:2:balance"] = db["accounts:2:balance"].to_i + 10
      fail "Oops!"
    end
  rescue
    # do nothing:  just continue on to checking the balances
  end

  db["accounts:1:balance"]  # => "100"
  db["accounts:2:balance"]  # => "100"
end

This time we see the difference. Both of my actions against the database had already been processed. However, my fail() call was part of the same transaction() and the Exception meant everything had to be undone. Notice that the account balances were restored to their previous values.

It is possible for you to cancel a transaction() without triggering an Exception. That's what the abort() method is for:

#!/usr/bin/env ruby -KU

require "oklahoma_mixer"

OklahomaMixer.open("data.tch") do |db|
  db.store("globals:user_id", 41, :add)  # pretend we have a few users
  db["users:42:last_name"] = "Nobody"    # and some bad data

  user = {:first_name => "James", :last_name => "Gray"}
  db.transaction do
    user[:id] = db.store("globals:user_id", 1, :add)
    if user.all? { |k, v| db.store("users:#{user[:id]}:#{k}", v, :keep) }
      user[:saved] = true
    else
      db.abort
    end
  end

  unless user[:saved]
    puts "Unable to save user.  Problem field(s):"
    user.each_key do |key|
      if value = db["users:#{user[:id]}:#{key}"]
        puts %Q{db["users:#{user[:id]}:#{key}"] = #{value.inspect}}
      end
    end
    # >> Unable to save user.  Problem field(s):
    # >> db["users:42:last_name"] = "Nobody"
  end
end

As you can see, abort() didn't toss an Exception but it rolled back my transaction() all the same. None of the new user fields were added to the database because they couldn't all be safely added. I knew that because one of the :keep mode calls to store() returned false when it tried to set an already existing key.

That's the magic of transactions. They are an all-or-nothing thing. Only if your block completes with no Exception thrown and no call to abort() will all of the changes be made.

Database File Maintenance

There are a lot of advantages that come with a database that's just one file in the file system. You can build symlinks to it, set permissions on it, and check its size with the normal tools your OS provides (though Oklahoma Mixer does have a file_size() method that returns the file size in bytes, if you need it). Of course, there are also tradeoffs you should stay aware of.

First, The file can get a little bloated over time. The reason is normal fragmentation: Tokyo Cabinet may clear a key freeing up some space and later fill it with a not-quite-as-big item. It may not find a good use for the even smaller remaining space for a long time. This creates small pockets of unused space that grow the file over time.

The easiest way to deal with this is to call defrag() periodically at a slow time. This will lock up the database for a few seconds while Tokyo Cabinet cleans it up. This will take care of the wasted space and shrink the file size back down (assuming it was fragmented).

Another issue to stay aware of is how you make backup copies of the database file. You need to be careful about using standard tools like cp or rsync on a Tokyo Cabinet file. It's fine if you know all connections to it are currently closed, but it's not safe when a connection might be changing the data inside of it mid-copy. If you try that, you will likely get a corrupt copy of the database.

The solution is to call copy() and pass in the path where you would like to create a copy of the database. It will synchronize the data, lock out changes, and then make a full duplicate. This process is quite snappy, even with bigger data sets. If desired, you can ask Oklahoma Mixer for the path() of the original database, edit it in some small way, and use that as the path for the duplicate database.

Just make sure you keep these issues in mind as you plan out your storage.

Those are the basics of using Tokyo Cabinet as a key-value store, but there's really a lot more to what Tokyo Cabinet can do. I'll show you what all is built onto this simple foundation in upcoming articles.

This post is part of a series.

← Previous Post

↑ Table of Contents

→ Next Post

In: Key-Value Stores | Tags: Tokyo Cabinet | 7 Comments

Comments (7)

george January 2nd, 2010 Reply Link

Whow, thanks for your detailed writeup. I'am evaluating & watching several noSQL stores for about a year and must say that tokoy is the most compelling to me, due to its ease of use.
I have not read about the oklahoma_mixer so far and will give it a shot.
1. Reply (using GitHub Flavored Markdown)
  
  Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
  
  Or login with:
  Name * Email URL Comment *
2. James Edward Gray II January 2nd, 2010 Reply Link
  
  I just made the first release of Oklahoma Mixer yesterday when I published this article, so it's still very new. I'm working hard on it though and really trying to get it useable as quickly as possible. Expect new releases with more features very soon.
  1. Reply (using GitHub Flavored Markdown)
    
    Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
    
    Or login with:
    Name * Email URL Comment *
mickb January 9th, 2011 Reply Link
Thank you Thank you ! Great tutorial. May be the reason why I will pick TokyoCabinet over NoSQL options. Questions
1. Are you still maintaining the library given that KyotoCabinet is out ?
2. Do you know of any sites where I can find benchmarks for TK ? Looking to see how far the library can be pushed.
1. Reply (using GitHub Flavored Markdown)
  
  Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
  
  Or login with:
  Name * Email URL Comment *
2. James Edward Gray II January 9th, 2011 Reply Link
  
  I haven't done a lot with Oklahoma Mixer lately, but that's mostly just because I'm busy. I really want to get back to it and finish it up. I confess that I got a little bummed out that KyotoCabinet was released right while I was going great guns on Oklahoma Mixer though.
  
  There's a little bit of timing information in the demos of Jeremy Hidegardner's LSRC 2009 Presentation. If choosing Tokyo though, I would encourage you to worry more about data size than speed. Tokyo is great for moderate data sizes, but it starts to choke with bigger data sets.
  1. Reply (using GitHub Flavored Markdown)
    
    Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
    
    Or login with:
    Name * Email URL Comment *
zhando June 5th, 2014 Reply Link

I've always had a soft spot for TC and have had a ball using the rufus-tokyo library for various apps over the years.

Recently I've kind of rediscovered tc and how fun it is to use with shell scripts. I've been able to do quite a bit with just the tc utilities and some bash glue.

Oklahoma_Mixer came in REALLY handy when I hacked together an app (again, mostly with bash scripts) to weed out duplicate media files. Just pipe tcbmgr, uniq and a little ruby script w/ OK_mixer and the values method for b-trees made it a breeze to process the dupes.

Too bad the ruby community just didn't take to tc and kc in a big way.. Maybe this sweet library you started could be much further along. Of course OK_Mixer is a lot of fun as it is. Thanks for writing it.
1. Reply (using GitHub Flavored Markdown)
  
  Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
  
  Or login with:
  Name * Email URL Comment *
2. James Edward Gray II June 5th, 2014 Reply Link
  
  Just FYI, I explain why I lost interest in this project in this article.
  1. Reply (using GitHub Flavored Markdown)
    
    Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
    
    Or login with:
    Name * Email URL Comment *
  2. zhando June 5th, 2014 Reply Link
    
    Fair enough.. And it appears that Kyoto Cabinet is pretty much over as well. Since at least Nov 2012.. Too bad.
    1. Reply (using GitHub Flavored Markdown)
      
      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
      
      Or login with:
      Name * Email URL Comment *