15
APR2009
Miscellaneous M17n Details
We've now discussed the core of Ruby 1.9's m17n (multilingualization) engine. String
and IO
are where you will see the big changes. The new m17n system is a big beast though with a lot of little details. Let's talk a little about some side topics that also relate to how we work with character encodings in Ruby 1.9.
More Features of the Encoding Class
You've seen me using Encoding
objects all over the place in my explanations of m17n, but we haven't talked much about them. They are very simple, mainly just being a named representation of each Encoding
inside Ruby. As such, Encoding
is a storage place for some tools you may find handy when working with them.
First, you can receive a list()
of all Encoding
objects Ruby has loaded in the form of an Array
:
$ ruby -e 'puts Encoding.list.first(3), "..."'
ASCII-8BIT
UTF-8
US-ASCII
...
If you're just interested in a specific Encoding
, you can find()
it by name:
$ ruby -e 'p Encoding.find("UTF-8")'
#<Encoding:UTF-8>
$ ruby -e 'p Encoding.find("No-Such-Encoding")'
-e:1:in `find': unknown encoding name - No-Such-Encoding (ArgumentError)
from -e:1:in `<main>'
As you can see, Ruby raises an ArgumentError
if it doesn't know about a given Encoding
.
Some Encoding
objects also have more than one name. These aliases()
can be used interchangeably to refer to the same Encoding
. For example, ASCII
is an alias for US-ASCII
:
$ ruby -e 'puts Encoding.aliases["ASCII"]'
US-ASCII
$ ruby -e 'p Encoding.find("ASCII") == Encoding.find("US-ASCII")'
true
The aliases()
method returns a Hash
keyed with the alternate names Ruby knows about. The values are the actual Encoding
name that alias refers to. You can use either a name or an alias when referring to an Encoding
by name, like with calls to Encoding::find()
or IO::open()
.
Finally, there's one more gotcha you should be aware of if you're going to write some code that supports a large set of Ruby's Encoding
s. Ruby ships with a few dummy?()
Encoding
s that don't have character handling completely implemented. These are used for stateful Encoding
s. You will want to filter them out of Encoding
s you try to support to avoid running into problems:
$ ruby -e 'puts "Dummy Encodings:", Encoding.list.select(&:dummy?).map(&:name)'
Dummy Encodings:
ISO-2022-JP
ISO-2022-JP-2
UTF-7
String Escapes
In Ruby 1.8 you would sometimes see byte escapes used to insert raw bytes into a String
. For example, you can choose to build the String
"…"
with the following byte escapes:
$ ruby -v -KU -e 'p "\xe2\x80\xa6"'
ruby 1.8.6 (2009-03-31 patchlevel 368) [i686-darwin9.6.0]
"…"
The same tactic still works on Ruby 1.9, but remember that Encoding
s are still going to play into this as we've been discussing:
$ cat utf8_escapes.rb
# encoding: UTF-8
str = "\xe2\x80\xa6"
p [str.encoding, str, str.valid_encoding?]
$ ruby -v utf8_escapes.rb
ruby 1.9.1p0 (2009-01-30 revision 21907) [i386-darwin9.6.0]
[#<Encoding:UTF-8>, "…", true]
$ cat invalid_escapes.rb
# encoding: UTF-8
str = "\xe2\x80"
p [str.encoding, str, str.valid_encoding?]
$ ruby -v invalid_escapes.rb
ruby 1.9.1p0 (2009-01-30 revision 21907) [i386-darwin9.6.0]
[#<Encoding:UTF-8>, "\xE2\x80", false]
Notice that I got the requested bytes in both cases. However, those String
s were assigned the source Encoding
as normal. In the first case, that built a valid UTF-8 String
. However, the second case is invalid and may later cause me fits as I try to use the String
.
There are a couple of exceptions though, where a String
escape can actually change the Encoding
of the literal. First, you'll likely remember that using a multibyte character is not allowed if you don't change the source Encoding
:
$ cat bad_code.rb
"abc…"
$ ruby -v bad_code.rb
ruby 1.9.1p0 (2009-01-30 revision 21907) [i386-darwin9.6.0]
bad_code.rb:1: invalid multibyte char (US-ASCII)
bad_code.rb:1: invalid multibyte char (US-ASCII)
However, a special case is made for \x##
escapes:
$ cat ascii_escapes.rb
puts "Source Encoding: #{__ENCODING__}"
str = "abc\xe2\x80\xa6"
p [str.encoding, str, str.valid_encoding?]
$ ruby -v ascii_escapes.rb
ruby 1.9.1p0 (2009-01-30 revision 21907) [i386-darwin9.6.0]
Source Encoding: US-ASCII
[#<Encoding:ASCII-8BIT>, "abc\xE2\x80\xA6", true]
Notice that the Encoding
of the String
was upgraded to ASCII-8BIT to accommodate the bytes. We'll talk a lot more about that special Encoding
later in this post, but for now just make note of the fact that this exception gives you an easy way to work with binary data.
Octal escapes (\###
), control escapes (\cx
or \C-x
), meta escapes (\M-x
), and meta-control escapes (\M-\C-x
) all follow the same rules as the hex escapes (\x##
) we've just been discussing.
The other exception is the \u####
escape that can be used to enter Unicode characters by codepoint. When you use this escape, the String
gets a UTF-8 Encoding
regardless of the current source Encoding
:
$ cat ascii_u_escape.rb
str = "\u2026"
p [str.encoding, str]
$ ruby -v ascii_u_escape.rb
ruby 1.9.1p0 (2009-01-30 revision 21907) [i386-darwin9.6.0]
[#<Encoding:UTF-8>, "…"]
$ cat sjis_u_escape.rb
# encoding: Shift_JIS
str = "\u2026"
p [str.encoding, str]
$ ruby -v sjis_u_escape.rb
ruby 1.9.1p0 (2009-01-30 revision 21907) [i386-darwin9.6.0]
[#<Encoding:UTF-8>, "…"]
$ cat utf8_u_escape.rb
# encoding: UTF-8
str = "\u2026"
p [str.encoding, str]
$ ruby -v utf8_u_escape.rb
ruby 1.9.1p0 (2009-01-30 revision 21907) [i386-darwin9.6.0]
[#<Encoding:UTF-8>, "…"]
Notice how the String
received a UTF-8 Encoding
in all three cases, regardless of the current source Encoding
. This exception gives you an easy way to work with UTF-8 data, no matter what your native Encoding
is.
The Unicode escape can be followed by exactly four hex digits as I've shown above, or you can use an alternate form \u{#…}
where you place between one and six hex digits between the braces. Both forms have the same effect on the String
's Encoding
.
Working with Binary Data
Not all data is textual data. Ruby's String
class can also be used to hold raw byte sequences. For example, you may want to work with the raw bytes of a PNG image.
Ruby 1.9 has an Encoding
for this which basically just means treat my data as raw bytes. You can think of this Encoding
as a way to shut off character handling and just work with bytes:
$ cat raw_bytes.rb
# encoding: UTF-8
str = "Résumé"
def str.inspect
{ data: dup,
encoding: encoding.name,
chars: size,
bytes: bytesize }.inspect
end
p str
str.force_encoding("BINARY")
p str
$ ruby raw_bytes.rb
{:data=>"Résumé", :encoding=>"UTF-8", :chars=>6, :bytes=>8}
{:data=>"R\xC3\xA9sum\xC3\xA9", :encoding=>"ASCII-8BIT", :chars=>8, :bytes=>8}
See how switching the Encoding
(without changing the data) shut off Ruby's concept of characters? The character count became the same as the byte count and Ruby started giving a more raw version of the inspect()
String
to show those are just bytes.
If you expected this Encoding
to be called BINARY, you are half right. As you
can see I could use that name above because it is a valid alias. Ruby switched
to the real name in the inspect()
message though. Ruby actually refers to theEncoding
as ASCII-8BIT, which leads us to another twist.
Obviously, there's not really such a thing a "ASCII-8BIT" outside of Ruby. Even while working with binary data though, it's not uncommon to want to make a check for some simple ASCII pieces. For example, the first few signature bytes of a PNG image do contain the simple ASCII String
"PNG"
:
$ cat png_sig.rb
sig = "\x89PNG\r\n\C-z\n"
png = /\A.PNG/
p({sig => sig.encoding.name, png => png.encoding.name})
if sig =~ png
puts "This data looks like a PNG image."
end
$ ruby png_sig.rb
{"\x89PNG\r\n\x1A\n"=>"ASCII-8BIT", /\A.PNG/=>"US-ASCII"}
This data looks like a PNG image.
Ruby makes this possible by making ASCII-8BIT compatible?()
with US-ASCII. That allows tricks like the above where I validated the PNG signature with a simple US-ASCII Regexp
. Thus, ASCII-8BIT means ASCII plus some other bytes and you can choose to treat parts of it as ASCII when that helps you work with the data.
It's worth noting that Ruby will now fallback to an ASCII-8BIT Encoding
anytime you read()
by bytes:
$ cat binary_fallback.rb
open("ascii.txt", "w+:UTF-8") do |f|
f.puts "abc"
f.rewind
str = f.read(2)
p [str.encoding.name, str]
end
$ ruby binary_fallback.rb
["ASCII-8BIT", "ab"]
That makes sense, because you could chop up characters when reading by bytes. If you really need to read()
some bytes but keep your Encoding
you will need to set and validate it manually. Here's one way you might do something like that:
$ cat read_to_char.rb
# encoding: UTF-8
open("ascii.txt", "w+:UTF-8") do |f|
f.puts "Résumé"
f.rewind
str = f.read(2)
until str.dup.force_encoding(f.external_encoding).valid_encoding?
str << f.read(1)
end
str.force_encoding(f.external_encoding)
p [str.encoding.name, str]
end
$ ruby read_to_char.rb
["UTF-8", "Ré"]
In that example, I just read()
the fixed bytes I wanted and then push forward byte by byte until my data is valid in the desired Encoding
. I had to test a dup()
of the data and only force_encoding()
when I was sure I was done reading, because UTF-8 and ASCII-8BIT are not compatible?()
and would have raised Encoding::CompatibilityError
as I was adding on bytes.
Working with binary data also requires you to know one more thing about Ruby's IO
objects. Ruby has a feature where it translates some data you read on Windows. The translation is super simple: "\r\n"
sequences read from an IO
object are simplified to a solo "\n"
. This features is to help make Unix scripts work well on a platform that has different line endings. It does create a gotcha though: when you're going to read any non-text data, be it binary data or just a non-ASCII compatible Encoding
like UTF-16, you need to warn Ruby not to do the translation for your code to be properly cross-platform.
By the way, this isn't new. This was even true in the Ruby 1.8 era.
Telling Ruby to treat the data as binary and not perform any translation (again, only active on Windows) is simple. You can just add a "b"
for binary to your mode String
in a call to open()
. Thus you would read with something like:
open(path, "rb") do |f|
# ...
end
or write with code like:
open(path, "wb") do |f|
# ...
end
If you always knew about this quirk and you did a good job of always doing this, give yourself a big pat on the back because you're all set. If you didn't, you've got a bad habit you'll need to break. Don't feel too bad about it though. I've known about this quirk since my Perl (which does the same thing) days and I've always tried to follow it. However, about ten different bugs were recently filed against one of my libraries that amounted to me missing this "b"
in several places. It's easy to forget.
Ruby 1.9 is much more strict about the binary flag. It's going to complain if you don't add it when it feels it is needed. For example:
$ cat missing_b.rb
# Ruby 1.9 will let this slide
open("utf_16.txt", "w:UTF-16LE") do |f|
f.puts "Some data."
end
# but not this
open("utf_16.txt", "r:UTF-16LE") do |f|
# ...
end
$ ruby missing_b.rb
missing_b.rb:6:in `initialize': ASCII incompatible encoding needs binmode
(ArgumentError)
from missing_b.rb:6:in `open'
from missing_b.rb:6:in `<main>'
Of course, this is trivial to fix. You just have to add the missing "b"
:
$ cat with_b.rb
open("utf_16.txt", "wb:UTF-16LE") do |f|
f.puts "Some data."
end
open("utf_16.txt", "rb:UTF-16LE") do |f|
puts f.external_encoding.name
end
$ ruby with_b.rb
UTF-16LE
I showed the external_encoding()
there to show that it's exactly what I specified. However, as a reward for adding in these "b"
's we've been bad about leaving out in the past, Ruby will now assume you want ASCII-8BIT when you supply the "b"
and not an external_encoding()
:
$ cat b_means_binary.rb
open("utf_16.txt", "r") do |f|
puts "Inherited from environment: #{f.external_encoding.name}"
end
open("utf_16.txt", "rb") do |f|
puts %Q{Using "rb": #{f.external_encoding.name}}
end
$ ruby b_means_binary.rb
Inherited from environment: UTF-8
Using "rb": ASCII-8BIT
It's worth nothing that Ruby 1.8 accidently helped train us to leave out the magic "b"
. For example, you could use IO::read()
to slurp some data, but that method didn't provide a way to indicate that the data was binary. In truth, you really needed this monster for a safe cross-platform read of binary data: open(path, "rb") { |f| f.read }
. It's no surprise that IO::read()
was more common. IO::readlines()
and IO::foreach()
had the same issue. The core team has acknowledged these problems with some new additions. First, you can now pass a Hash
as the final argument to all the methods that open an IO
and use that to set options like :mode
or separately :external_encoding
, :internal_encoding
, and :binmode
(the name for the magic "b"
). Here are some examples:
File.read("utf_16.txt", mode: "rb:UTF-16LE")
File.readlines("utf_16.txt", mode: "rb:UTF-16LE")
File.foreach("utf_16.txt", mode: "rb:UTF-16LE") do |line|
end
File.open("utf_16.txt", mode: "rb:UTF-16LE") do |f|
end
open("utf_16.txt", mode: "rb:UTF-16LE") do |f|
end
As one last shortcut along these lines, the new IO::binread()
method is the same as IO.read(…, mode: "rb:ASCII-8BIT")
.
Regex Encodings
Now that all our data has an Encoding
, it only makes sense that our Regexp
objects would need to be tagged as well. That is the case, but the rules for how an Encoding
is selected differs for Regexp
. Let's talk a little about how and why.
First, let's get the big surprise out of the way:
$ cat re_encoding.rb
# encoding: UTF-8
utf8_str = "résumé"
latin1_str = utf8_str.encode("ISO-8859-1")
binary_str = utf8_str.dup.force_encoding("ASCII-8BIT")
utf16_str = utf8_str.encode("UTF-16BE")
re = /\Ar.sum.\z/
puts "Regexp.encoding.name: #{re.encoding.name}"
[utf8_str, latin1_str, binary_str, utf16_str].each do |str|
begin
result = str =~ re ? "Matches" : "Doesn't match"
rescue Encoding::CompatibilityError
result = "Can't match non-ASCII compatible?() Encoding"
end
puts "#{result}: #{str.encoding.name}"
end
$ ruby re_encoding.rb
Regexp.encoding.name: US-ASCII
Matches: UTF-8
Matches: ISO-8859-1
Doesn't match: ASCII-8BIT
Can't match non-ASCII compatible?() Encoding: UTF-16BE
After we did all that talking about the source Encoding
Ruby goes and ignores it on us. You can see that the Regexp
was set to US-ASCII instead of the UTF-8 that was in effect at the time. Surprising though that may be, there is actually a pretty good reason for it.
My Regexp
literal only contained seven bit ASCII, so Ruby chose to simplify the Encoding
. If it left it at the source Encoding
of UTF-8, it would be useful for checking UTF-8 data. As it is though, it can now be used to check any ASCII compatible?()
data. You can see in the output that the expression was tried against three different String
's, because they are all ASCII compatible?()
. (It did fail to match one since I changed the rules of how to interpret the data and one character became two bytes, but the attempt was still made.) The fourth match could not be attempted, because UTF-16 is not ASCII compatible?()
.
Of course, if your Regexp
includes eight bit characters, you use the special escapes that change an Encoding
, or you apply one of the old Ruby 1.8 style Encoding
options, you can get a non-ASCII Encoding
:
$ cat encodings.rb
# encoding: UTF-8
res = [
/…\z/, # source Encoding
/\A\uFEFF/, # special escape
/abc/u # Ruby 1.8 option
]
puts res.map { |re| [re.encoding.name, re.inspect].join(" ") }
$ ruby encodings.rb
UTF-8 /…\z/
UTF-8 /\A\uFEFF/
UTF-8 /abc/
I used /u
which you will probably remember as a way to get a UTF-8 Regexp
from the old Ruby 1.8 system. The /e
(for EUC_JP) and /s
(for a Shift_JIS extension called Windows-31J) options still work too. Ruby 1.9 also still supports the old /n
option, but it has some warning tossing exceptions for legacy reasons and I recommend just avoiding it going forward. You can build an ASCII-8BIT Regexp
in another way I'll show in just a moment.
As of Ruby 1.9.2, this concept of a lenient Regexp
, one that will match any ASCII compatible?()
Encoding
, has a new name:
$ cat fixed_encoding.rb
[/a/, /a/u].each do |re|
puts "%-10s %s" % [ re.encoding, re.fixed_encoding? ? "fixed" :
"not fixed" ]
end
$ ruby fixed_encoding.rb
US-ASCII not fixed
UTF-8 fixed
A fixed_encoding?()
Regexp
is one that will raise an Encoding::CompatibilityError
if matched against any String
that contains a different Encoding
from the Regexp
itself, as long as the String
isn't ascii_only?()
. If fixed_encoding?()
returns false
, the Regexp
can be used against any ASCII compatible?()
Encoding
. There's also a new constant with this name that can be used to disable the ASCII downgrading:
$ cat force_re_encoding.rb
puts Regexp.new("abc".force_encoding("UTF-8")).encoding.name
puts Regexp.new( "abc".force_encoding("UTF-8"),
Regexp::FIXEDENCODING ).encoding.name
$ ruby force_re_encoding.rb
US-ASCII
UTF-8
Note how a Regexp
will take the Encoding
of the String
passed to Regexp::new()
when Regexp::FIXEDENCODING
is set. You can use this combination to build a Regexp
in any Encoding
you need, including the ASCII-8BIT I mentioned earlier.
Once your Regexp
is at least compatible to your data's Encoding
, pattern matches function as they always have. (Well, in truth, Ruby 1.9 brings us a powerful new regular expression engine called Oniguruma, but that's another topic for another time.) Under average circumstances, Ruby 1.9's Regexp
Encoding
selection option mean that they are compatible with a lot of data and everything should just work for you. However, if you end up getting some errors at match time, you may need to abandon the simple /…/
literal and use the new features I've shown to build a Regexp
that perfectly matches your data's Encoding
.
Handling a BOM
Some multibyte Encoding
s recommend that data in that Encoding
begin with a Byte Order Mark (also known as a BOM) indicating the order of the bytes. UTF-16 is a good example.
Note that Ruby doesn't even support a UTF-16 Encoding
. Instead, you must pick between UTF-16BE and UTF-16LE for "Big Endian" or "Little Endian" byte order. This indicates whether the most significant byte comes first or last:
$ ruby -e 'p "a".encode("UTF-16BE")'
"\x00a"
$ ruby -e 'p "a".encode("UTF-16LE")'
"a\x00"
Now, when someone goes to read your UTF-16 data back, they'll need to know which byte order you used to get things right. You could just tell them which order was used the same way you'll probably tell them that the data is UTF-16 encoded. Or you could add a BOM to the data.
A Unicode BOM is just the character U+FEFF at the beginning of your data. There's no such character for the reversed bytes U+FFFE, so this makes it easy to correctly tell the order of the bytes. Another minor advantage is that this BOM probably indicates you are reading Unicode data. A lot of software will check for this special start of the data, use it to set the proper byte order, and then pretend it didn't even exist by removing it from the data they show users.
Ruby 1.9 won't automatically add a BOM to your data, so you're going to need to take care of that if you want one. Luckily, it's not too tough. The basic idea is just to print the bytes needed at the beginning of a file. For example, we can add a BOM to a UTF-16LE file as such:
$ cat utf16_bom.rb
# encoding: UTF-8
File.open("utf16_bom.txt", "w:UTF-16LE") do |f|
f.puts "\uFEFFThis is UTF-16LE with a BOM."
end
$ ruby utf16_bom.rb
$ ruby -e 'p File.binread("utf16_bom.txt")[0..9]'
"\xFF\xFET\x00h\x00i\x00s\x00"
Notice that I just used the Unicode escape to add the BOM character to the data. Because my output String
was in UTF-8, Ruby had to transcode it to UTF-16LE and that process arranged the bytes correctly for me, as you see in the sample output.
Reading a BOM is a similar process. We will need to pull the relevant bytes and see if they match a Unicode BOM. When they do, we can then start reading again with the Encoding
we matched. We might code that up like this:
$ cat read_bom.rb
class File
UTFS = [32, 16].map { |b| %w[BE LE].map { |o| "UTF-#{b}#{o}" } }.
flatten << "UTF-8"
def self.open_using_unicode_bom(path, *args, &blk)
# check the BOM to find the Encoding
encoding = UTFS[0..-2].find(lambda { UTFS[-1] }) do |utf|
bom = "\uFEFF".encode(utf)
binread(path, bom.bytesize).force_encoding(utf) == bom
end
# set the Encoding
if args.first.nil?
args << "r#{'b' unless encoding == UTFS[-1]}:#{encoding}"
elsif args.first.is_a? Hash
args.first.merge!(external_encoding: encoding)
else
args.first.sub!(/\A([^:]*)/, "\\1:#{encoding}")
end
# hand off to open()
if blk
open(path, *args) do |f|
f.read_unicode_bom
blk[f]
end
else
f = open(path, *args)
f.read_unicode_bom
f
end
end
def read_unicode_bom
bytes = external_encoding.name[/\AUTF-?(\d+)/i, 1].to_i / 8
read(bytes) if bytes > 1
end
end
# example usage with the File we created earlier
File.open_using_unicode_bom("utf16_bom.txt") do |f|
line = f.gets
p [line.encoding, line[0..3]]
end
$ ruby read_bom.rb
[#<Encoding:UTF-16LE>, "T\x00h\x00i\x00s\x00"]
These examples just deal with Unicode BOM's, but you would handle other BOM's in a similar fashion. Find out what bytes are needed for your Encoding
, write those out before the data, and later check for them when reading the data back. The String
escapes we discussed earlier can be handy when writing the bytes and binread()
is equally handy when checking for the BOM.
I do recommend including a BOM in Unicode Encoding
s like UTF-16 and UTF-32, but please don't add them to UTF-8 data. The UTF-8 byte order is part of its specification and it never varies. Thus you don't need a BOM to read it correctly. If you add one, you damage one of the great UTF-8 advantages in that it can pass for US-ASCII (assuming it's all seven bit characters).
Comments (4)
-
Axel Niedenhoff May 11th, 2009 Reply Link
The
b
option for opening files is even present in C. Maybe that’s where it originated and all platforms building upon the standard C library have inherited it from there. -
Another method got a neat upgrade in Ruby 1.9:
Integer.chr()
. You can use this method in Ruby 1.8 to convert simple byte values into single characterString
s. However, the method is limited to single byte values. This example shows both how it works and the limit:$ cat chr.rb p 97.chr p 256.chr $ ruby -v chr.rb ruby 1.8.6 (2009-03-31 patchlevel 368) [i686-darwin9.6.0] "a" chr.rb:2:in `chr': 256 out of char range (RangeError) from chr.rb:2
That much is unchanged in Ruby 1.9:
$ ruby_dev -v chr.rb ruby 1.9.1p129 (2009-05-12 revision 23412) [i386-darwin9.6.0] "a" chr.rb:2:in `chr': 256 out of char range (RangeError) from chr.rb:2:in `<main>'
However, Ruby 1.9 adds a new twist. The method now takes an optional
Encoding
argument, or theString
name of anEncoding
. If you provide anEncoding
, the method will convert a codepoint (which you can get withord()
orcodepoints()
) into aString
:$ cat codepoint_chr.rb # encoding: UTF-8 p "é".ord p "é".codepoints.first p 233.chr("UTF-8") $ ruby_dev -v codepoint_chr.rb ruby 1.9.1p129 (2009-05-12 revision 23412) [i386-darwin9.6.0] 233 233 "é"
That turns out to be a pretty easy way to spot check some codepoint mappings in IRb.
-
I've read all your articles about
Encoding
in detail but I can't find a solution to the following problem.My app receives a string from another app such as:
"%e47%e14%e1a"
. When I unescape the HTML I end up with something like this"\xE47\xE14\xE1a"
which correspond to the codepoints: 3655, 3604 and 3610 from the Thai Alphabet. Using Ruby 1.9.2"\xE47\xE14\xE1a".valid_encoding?
returnsfalse
, because these are the byte sequences but rather the Unicode codepoints. How can I use Ruby here to convert these codepoints to the proper strings, ็ดบ?-
I'm not sure those URL escapes are valid, which is why I believe you are having trouble. I think the escapes are suppose to be handled by bytes, not codepoints.
That said, this code seems to expand it correctly:
>> "%e47%e14%e1a".scan(/%\w+/).map { |cp| Integer("0x#{cp[1..-1]}") }.pack("U*") => "็ดบ"
Hope that helps.
-