Character Encodings

My extensive coverage of a complex topic all programmers should study a little.

5

NOV
2008

The $KCODE Variable and jcode Library

All of the Ruby files I create start with the same Shebang line:

#!/usr/bin/env ruby -wKU

It's not really needed for every file since it generally only matters if the file is executed. However, I tend to go ahead and add it to all Ruby files I build for several reasons:

  • You never know when a file may be executed (if __FILE__ == $PROGRAM_NAME; end sections are often added to libraries, for example)
  • It makes it obvious the file is Ruby code
  • It shows the rules this code expects -w and -KU

The rules I mention here, specified by command-line switches, are the main point of interest. -w turns on Ruby's warnings which are very handy. I recommend doing that whenever you can. But that doesn't have anything to do with character encodings. -KU does.

-KU sets a magic Ruby variable: $-K or $KCODE. You can do the same in your code if you aren't in a position to control the command-line arguments:

$KCODE = "U"

You probably recognize the U as a name for Ruby 1.8's UTF-8 encoding, from my earlier list of encodings. It can also be set to N (the default), E, or S. Modern versions of Rails do set $KCODE = "U" for you.

So what does changing this magic variable do? First, it has the tiny effect of changing what Ruby escapes in inspect() output. Have a look:

$ ruby -e 'p "Résumé"'
"R\303\251sum\303\251"
$ ruby -KUe 'p "Résumé"'
"Résumé"

It's nice to be able to see your data as it actually is, assuming your terminal correctly handles UTF-8. However, that's really just a side-effect of setting $KCODE.

The main purpose of $KCODE is that it changes the default encoding of all regular expressions that do not specify otherwise. Thus we can split up UTF-8 data by characters without adding a /u to the end of our expression:

$ ruby -e 'p "Résumé".scan(/./m)'
["R", "\303", "\251", "s", "u", "m", "\303", "\251"]
$ ruby -KUe 'p "Résumé".scan(/./m)'
["R", "é", "s", "u", "m", "é"]
$ ruby -KUe 'p "Résumé".scan(/./mn)'
["R", "\303", "\251", "s", "u", "m", "\303", "\251"]

Notice that the default encoding for that second example was switched to UTF-8. However, I can still override this with an explicit encoding, as I did in example three by adding the /n option for None.

Now, I tend to prefer $KCODE over $-K because the former seems more common in Ruby literature. In fact, Ruby 1.8 uses the term in another place, providing a method to get the encoding used in a Regexp:

$ ruby -e 'p /./.kcode'
nil
$ ruby -e 'p /./u.kcode'
"utf8"

Beware of that harmless looking kcode() method though as it hides quite a few gotchas. First, you can see that it has its own names for the options that don't really match up with what we've seen elsewhere. It also doesn't seem to be aware of the $KCODE variable, in an ironic twist of naming:

$ ruby -e '$KCODE = "U"; re = /./m; p "Résumé".scan(re); p re.kcode'
["R", "é", "s", "u", "m", "é"]
nil

As you can see, the encoding of the expression was clearly set correctly, but kcode() didn't report the change. If you really want to know the encoding of a Regexp in Ruby 1.8, I suggest using code like the following:

class Regexp
  def encoding
    if kcode
      kcode[0, 1]
    elsif %w[n N u U e E s S].include? $KCODE
      $KCODE.downcase
    else
      "n"
    end
  end
end

Using just the first letter of kcode() should get us back to a standard set of letters. If kcode() isn't set, we can use $KCODE. However, do note that I make sure it's set to an expected value. You can set $KCODE to any junk value and Ruby will just silently ignore it (defaulting back to N), so it's good to reality check the contents when you rely on it. Finally, we just return the default if neither appear to be set.

That's really all there is to know about $KCODE, but Ruby 1.8 ships with a simple standard library called jcode that combines well with everything we've been discussing in these last two posts.

To use the jcode library, set $KCODE and then require the library. Setting $KCODE first is important, and you will receive a warning if you require jcode without setting $KCODE (as long as you took my advice and turned warnings on with -w):

$ ruby -r jcode -e 'p "Résumé".jsize'
8
$ ruby -w -r jcode -e 'p "Résumé".jsize'
Warning: $KCODE is NONE.
8

See, I told you -w was important.

As long as you do have $KCODE set properly, jcode adds a bunch of methods to String that work in characters. These methods are just simple wrappers over the techniques I showed you in my last post, so you get methods like jsize() which returns a count of characters instead of bytes:

$ ruby -KU -r jcode -e 'p "Résumé".jsize'
6

Probably the most useful method jcode adds is each_char():

$ ruby -KU -r jcode -e '"Résumé".each_char { |c| p c }'
"R"
"é"
"s"
"u"
"m"
"é"

See the documentation for the full method list.

Comments (8)
  1. Tim Morgan
    Tim Morgan November 6th, 2008 Reply Link

    This is the best post yet. I was afraid the only way to work with Unicode Strings properly in 1.8 was with Regexps. I'll be taking a peek at jcode.

    BTW, thanks for this series of posts. I'm not sure there is anything this comprehensive anywhere else. If there is, I haven't found it.

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
    2. James Edward Gray II
      James Edward Gray II November 6th, 2008 Reply Link

      jcode is far from comprehensive, but it can save you a few trips to regular expression for some simple cases, yes. For real character savvy manipulations, see Ruby 1.9.

      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
  2. Sam
    Sam June 12th, 2009 Reply Link

    A nit: I think you meant inspect() instead of inpect().

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
    2. James Edward Gray II
      James Edward Gray II June 12th, 2009 Reply Link

      Good catch. Fixed.

      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
  3. Александар
    Александар September 7th, 2009 Reply Link

    Hello,

    running your example, where the shebang line is:

    #!/usr/bin/env ruby -wKU
    

    my ruby 1.8.7 on Ubuntu barfs with:

    /usr/bin/env: ruby -wKU: No such file or directory
    

    Then I've tried the following:

    #!/usr/bin/env ruby
    $KCODE = 'u' 
    

    and

    #!/usr/bin/env ruby
    $KCODE = 'U' 
    

    Finally just settling on:

    #!/usr/local/bin/ruby -wKU
    
    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
    2. James Edward Gray II
      James Edward Gray II September 7th, 2009 Reply Link

      It's true that some versions of the env command do not properly support passing arguments to the referenced executable. When faced with such a platform, you have two choices. First, you can specify the actual path, as you decided on. Another option would be to set the flags manually, as you hinted at:

      #!/usr/bin/env ruby
      
      $VERBOSE = true # -w
      $KCODE   = "U"  # -KU
      
      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
  4. Me
    Me March 16th, 2011 Reply Link

    You say "See the documentation for the full method list." and you links to http://www.ruby-doc.org/stdlib/libdoc/jcode/rdoc/classes/String.html but this URL doesn't exist anymore. Could you please fix the link?

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
    2. James Edward Gray II
      James Edward Gray II March 16th, 2011 Reply Link

      jcode has been removed in 1.9 in favor of m17n (multilingualization).

      That's why the link is gone. Here's a 1.8.7 specific link:

      http://www.ruby-doc.org/stdlib-1.8.7/libdoc/jcode/rdoc/classes/String.html

      I don't control the ruby-doc.org site, so you'll need to email that site's maintainer about any changes you would like to see there.

      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
Leave a Comment (using GitHub Flavored Markdown)

Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

Ajax loader