5
NOV2008
The $KCODE Variable and jcode Library
All of the Ruby files I create start with the same Shebang line:
#!/usr/bin/env ruby -wKU
It's not really needed for every file since it generally only matters if the file is executed. However, I tend to go ahead and add it to all Ruby files I build for several reasons:
- You never know when a file may be executed (
if __FILE__ == $PROGRAM_NAME; end
sections are often added to libraries, for example) - It makes it obvious the file is Ruby code
- It shows the rules this code expects
-w
and-KU
The rules I mention here, specified by command-line switches, are the main point of interest. -w
turns on Ruby's warnings which are very handy. I recommend doing that whenever you can. But that doesn't have anything to do with character encodings. -KU
does.
-KU
sets a magic Ruby variable: $-K
or $KCODE
. You can do the same in your code if you aren't in a position to control the command-line arguments:
$KCODE = "U"
You probably recognize the U
as a name for Ruby 1.8's UTF-8 encoding, from my earlier list of encodings. It can also be set to N
(the default), E
, or S
. Modern versions of Rails do set $KCODE = "U"
for you.
So what does changing this magic variable do? First, it has the tiny effect of changing what Ruby escapes in inspect()
output. Have a look:
$ ruby -e 'p "Résumé"'
"R\303\251sum\303\251"
$ ruby -KUe 'p "Résumé"'
"Résumé"
It's nice to be able to see your data as it actually is, assuming your terminal correctly handles UTF-8. However, that's really just a side-effect of setting $KCODE
.
The main purpose of $KCODE
is that it changes the default encoding of all regular expressions that do not specify otherwise. Thus we can split up UTF-8 data by characters without adding a /u
to the end of our expression:
$ ruby -e 'p "Résumé".scan(/./m)'
["R", "\303", "\251", "s", "u", "m", "\303", "\251"]
$ ruby -KUe 'p "Résumé".scan(/./m)'
["R", "é", "s", "u", "m", "é"]
$ ruby -KUe 'p "Résumé".scan(/./mn)'
["R", "\303", "\251", "s", "u", "m", "\303", "\251"]
Notice that the default encoding for that second example was switched to UTF-8. However, I can still override this with an explicit encoding, as I did in example three by adding the /n
option for None.
Now, I tend to prefer $KCODE
over $-K
because the former seems more common in Ruby literature. In fact, Ruby 1.8 uses the term in another place, providing a method to get the encoding used in a Regexp
:
$ ruby -e 'p /./.kcode'
nil
$ ruby -e 'p /./u.kcode'
"utf8"
Beware of that harmless looking kcode()
method though as it hides quite a few gotchas. First, you can see that it has its own names for the options that don't really match up with what we've seen elsewhere. It also doesn't seem to be aware of the $KCODE
variable, in an ironic twist of naming:
$ ruby -e '$KCODE = "U"; re = /./m; p "Résumé".scan(re); p re.kcode'
["R", "é", "s", "u", "m", "é"]
nil
As you can see, the encoding of the expression was clearly set correctly, but kcode()
didn't report the change. If you really want to know the encoding of a Regexp
in Ruby 1.8, I suggest using code like the following:
class Regexp
def encoding
if kcode
kcode[0, 1]
elsif %w[n N u U e E s S].include? $KCODE
$KCODE.downcase
else
"n"
end
end
end
Using just the first letter of kcode()
should get us back to a standard set of letters. If kcode()
isn't set, we can use $KCODE
. However, do note that I make sure it's set to an expected value. You can set $KCODE
to any junk value and Ruby will just silently ignore it (defaulting back to N
), so it's good to reality check the contents when you rely on it. Finally, we just return the default if neither appear to be set.
That's really all there is to know about $KCODE
, but Ruby 1.8 ships with a simple standard library called jcode
that combines well with everything we've been discussing in these last two posts.
To use the jcode
library, set $KCODE
and then require the library. Setting $KCODE
first is important, and you will receive a warning if you require jcode
without setting $KCODE
(as long as you took my advice and turned warnings on with -w
):
$ ruby -r jcode -e 'p "Résumé".jsize'
8
$ ruby -w -r jcode -e 'p "Résumé".jsize'
Warning: $KCODE is NONE.
8
See, I told you -w
was important.
As long as you do have $KCODE
set properly, jcode
adds a bunch of methods to String
that work in characters. These methods are just simple wrappers over the techniques I showed you in my last post, so you get methods like jsize()
which returns a count of characters instead of bytes:
$ ruby -KU -r jcode -e 'p "Résumé".jsize'
6
Probably the most useful method jcode
adds is each_char()
:
$ ruby -KU -r jcode -e '"Résumé".each_char { |c| p c }'
"R"
"é"
"s"
"u"
"m"
"é"
See the documentation for the full method list.
Comments (8)
-
Tim Morgan November 6th, 2008 Reply Link
This is the best post yet. I was afraid the only way to work with Unicode
String
s properly in 1.8 was withRegexp
s. I'll be taking a peek atjcode
.BTW, thanks for this series of posts. I'm not sure there is anything this comprehensive anywhere else. If there is, I haven't found it.
-
jcode
is far from comprehensive, but it can save you a few trips to regular expression for some simple cases, yes. For real character savvy manipulations, see Ruby 1.9.
-
-
A nit: I think you meant
inspect()
instead ofinpect()
.-
Good catch. Fixed.
-
-
Hello,
running your example, where the shebang line is:
#!/usr/bin/env ruby -wKU
my ruby 1.8.7 on Ubuntu barfs with:
/usr/bin/env: ruby -wKU: No such file or directory
Then I've tried the following:
#!/usr/bin/env ruby $KCODE = 'u'
and
#!/usr/bin/env ruby $KCODE = 'U'
Finally just settling on:
#!/usr/local/bin/ruby -wKU
-
It's true that some versions of the
env
command do not properly support passing arguments to the referenced executable. When faced with such a platform, you have two choices. First, you can specify the actual path, as you decided on. Another option would be to set the flags manually, as you hinted at:#!/usr/bin/env ruby $VERBOSE = true # -w $KCODE = "U" # -KU
-
-
You say "See the documentation for the full method list." and you links to http://www.ruby-doc.org/stdlib/libdoc/jcode/rdoc/classes/String.html but this URL doesn't exist anymore. Could you please fix the link?
-
jcode
has been removed in 1.9 in favor of m17n (multilingualization).That's why the link is gone. Here's a 1.8.7 specific link:
http://www.ruby-doc.org/stdlib-1.8.7/libdoc/jcode/rdoc/classes/String.html
I don't control the ruby-doc.org site, so you'll need to email that site's maintainer about any changes you would like to see there.
-