Gray Soft / Character Encodings / Miscellaneous M17n Details

15

APR
2009

Miscellaneous M17n Details

This post is part of a series.

We've now discussed the core of Ruby 1.9's m17n (multilingualization) engine. String and IO are where you will see the big changes. The new m17n system is a big beast though with a lot of little details. Let's talk a little about some side topics that also relate to how we work with character encodings in Ruby 1.9.

More Features of the Encoding Class

You've seen me using Encoding objects all over the place in my explanations of m17n, but we haven't talked much about them. They are very simple, mainly just being a named representation of each Encoding inside Ruby. As such, Encoding is a storage place for some tools you may find handy when working with them.

First, you can receive a list() of all Encoding objects Ruby has loaded in the form of an Array:

$ ruby -e 'puts Encoding.list.first(3), "..."'
ASCII-8BIT
UTF-8
US-ASCII
...

If you're just interested in a specific Encoding, you can find() it by name:

$ ruby -e 'p Encoding.find("UTF-8")'
#<Encoding:UTF-8>
$ ruby -e 'p Encoding.find("No-Such-Encoding")'
-e:1:in `find': unknown encoding name - No-Such-Encoding (ArgumentError)
    from -e:1:in `<main>'

As you can see, Ruby raises an ArgumentError if it doesn't know about a given Encoding.

Some Encoding objects also have more than one name. These aliases() can be used interchangeably to refer to the same Encoding. For example, ASCII is an alias for US-ASCII:

$ ruby -e 'puts Encoding.aliases["ASCII"]'
US-ASCII
$ ruby -e 'p Encoding.find("ASCII") == Encoding.find("US-ASCII")' 
true

The aliases() method returns a Hash keyed with the alternate names Ruby knows about. The values are the actual Encoding name that alias refers to. You can use either a name or an alias when referring to an Encoding by name, like with calls to Encoding::find() or IO::open().

Finally, there's one more gotcha you should be aware of if you're going to write some code that supports a large set of Ruby's Encodings. Ruby ships with a few dummy?() Encodings that don't have character handling completely implemented. These are used for stateful Encodings. You will want to filter them out of Encodings you try to support to avoid running into problems:

$ ruby -e 'puts "Dummy Encodings:", Encoding.list.select(&:dummy?).map(&:name)'
Dummy Encodings:
ISO-2022-JP
ISO-2022-JP-2
UTF-7

String Escapes

In Ruby 1.8 you would sometimes see byte escapes used to insert raw bytes into a String. For example, you can choose to build the String "…" with the following byte escapes:

$ ruby -v -KU -e 'p "\xe2\x80\xa6"'
ruby 1.8.6 (2009-03-31 patchlevel 368) [i686-darwin9.6.0]
"…"

The same tactic still works on Ruby 1.9, but remember that Encodings are still going to play into this as we've been discussing:

$ cat utf8_escapes.rb 
# encoding: UTF-8
str = "\xe2\x80\xa6"
p [str.encoding, str, str.valid_encoding?]
$ ruby -v utf8_escapes.rb 
ruby 1.9.1p0 (2009-01-30 revision 21907) [i386-darwin9.6.0]
[#<Encoding:UTF-8>, "…", true]
$ cat invalid_escapes.rb 
# encoding: UTF-8
str = "\xe2\x80"
p [str.encoding, str, str.valid_encoding?]
$ ruby -v invalid_escapes.rb 
ruby 1.9.1p0 (2009-01-30 revision 21907) [i386-darwin9.6.0]
[#<Encoding:UTF-8>, "\xE2\x80", false]

Notice that I got the requested bytes in both cases. However, those Strings were assigned the source Encoding as normal. In the first case, that built a valid UTF-8 String. However, the second case is invalid and may later cause me fits as I try to use the String.

There are a couple of exceptions though, where a String escape can actually change the Encoding of the literal. First, you'll likely remember that using a multibyte character is not allowed if you don't change the source Encoding:

$ cat bad_code.rb 
"abc…"
$ ruby -v bad_code.rb 
ruby 1.9.1p0 (2009-01-30 revision 21907) [i386-darwin9.6.0]
bad_code.rb:1: invalid multibyte char (US-ASCII)
bad_code.rb:1: invalid multibyte char (US-ASCII)

However, a special case is made for \x## escapes:

$ cat ascii_escapes.rb 
puts "Source Encoding:  #{__ENCODING__}"
str = "abc\xe2\x80\xa6"
p [str.encoding, str, str.valid_encoding?]
$ ruby -v ascii_escapes.rb 
ruby 1.9.1p0 (2009-01-30 revision 21907) [i386-darwin9.6.0]
Source Encoding:  US-ASCII
[#<Encoding:ASCII-8BIT>, "abc\xE2\x80\xA6", true]

Notice that the Encoding of the String was upgraded to ASCII-8BIT to accommodate the bytes. We'll talk a lot more about that special Encoding later in this post, but for now just make note of the fact that this exception gives you an easy way to work with binary data.

Octal escapes (\###), control escapes (\cx or \C-x), meta escapes (\M-x), and meta-control escapes (\M-\C-x) all follow the same rules as the hex escapes (\x##) we've just been discussing.

The other exception is the \u#### escape that can be used to enter Unicode characters by codepoint. When you use this escape, the String gets a UTF-8 Encoding regardless of the current source Encoding:

$ cat ascii_u_escape.rb 
str = "\u2026"
p [str.encoding, str]
$ ruby -v ascii_u_escape.rb 
ruby 1.9.1p0 (2009-01-30 revision 21907) [i386-darwin9.6.0]
[#<Encoding:UTF-8>, "…"]
$ cat sjis_u_escape.rb 
# encoding: Shift_JIS
str = "\u2026"
p [str.encoding, str]
$ ruby -v sjis_u_escape.rb 
ruby 1.9.1p0 (2009-01-30 revision 21907) [i386-darwin9.6.0]
[#<Encoding:UTF-8>, "…"]
$ cat utf8_u_escape.rb 
# encoding: UTF-8
str = "\u2026"
p [str.encoding, str]
$ ruby -v utf8_u_escape.rb 
ruby 1.9.1p0 (2009-01-30 revision 21907) [i386-darwin9.6.0]
[#<Encoding:UTF-8>, "…"]

Notice how the String received a UTF-8 Encoding in all three cases, regardless of the current source Encoding. This exception gives you an easy way to work with UTF-8 data, no matter what your native Encoding is.

The Unicode escape can be followed by exactly four hex digits as I've shown above, or you can use an alternate form \u{#…} where you place between one and six hex digits between the braces. Both forms have the same effect on the String's Encoding.

Working with Binary Data

Not all data is textual data. Ruby's String class can also be used to hold raw byte sequences. For example, you may want to work with the raw bytes of a PNG image.

Ruby 1.9 has an Encoding for this which basically just means treat my data as raw bytes. You can think of this Encoding as a way to shut off character handling and just work with bytes:

$ cat raw_bytes.rb 
# encoding: UTF-8
str = "Résumé"
def str.inspect
  { data:     dup,
    encoding: encoding.name,
    chars:    size,
    bytes:    bytesize }.inspect
end
p str
str.force_encoding("BINARY")
p str
$ ruby raw_bytes.rb 
{:data=>"Résumé", :encoding=>"UTF-8", :chars=>6, :bytes=>8}
{:data=>"R\xC3\xA9sum\xC3\xA9", :encoding=>"ASCII-8BIT", :chars=>8, :bytes=>8}

See how switching the Encoding (without changing the data) shut off Ruby's concept of characters? The character count became the same as the byte count and Ruby started giving a more raw version of the inspect() String to show those are just bytes.

If you expected this Encoding to be called BINARY, you are half right. As you
can see I could use that name above because it is a valid alias. Ruby switched
to the real name in the inspect() message though. Ruby actually refers to the
Encoding as ASCII-8BIT, which leads us to another twist.

Obviously, there's not really such a thing a "ASCII-8BIT" outside of Ruby. Even while working with binary data though, it's not uncommon to want to make a check for some simple ASCII pieces. For example, the first few signature bytes of a PNG image do contain the simple ASCII String "PNG":

$ cat png_sig.rb 
sig = "\x89PNG\r\n\C-z\n"
png = /\A.PNG/

p({sig => sig.encoding.name, png => png.encoding.name})

if sig =~ png
  puts "This data looks like a PNG image."
end
$ ruby png_sig.rb 
{"\x89PNG\r\n\x1A\n"=>"ASCII-8BIT", /\A.PNG/=>"US-ASCII"}
This data looks like a PNG image.

Ruby makes this possible by making ASCII-8BIT compatible?() with US-ASCII. That allows tricks like the above where I validated the PNG signature with a simple US-ASCII Regexp. Thus, ASCII-8BIT means ASCII plus some other bytes and you can choose to treat parts of it as ASCII when that helps you work with the data.

It's worth noting that Ruby will now fallback to an ASCII-8BIT Encoding anytime you read() by bytes:

$ cat binary_fallback.rb 
open("ascii.txt", "w+:UTF-8") do |f|
  f.puts "abc"
  f.rewind
  str = f.read(2)
  p [str.encoding.name, str]
end
$ ruby binary_fallback.rb 
["ASCII-8BIT", "ab"]

That makes sense, because you could chop up characters when reading by bytes. If you really need to read() some bytes but keep your Encoding you will need to set and validate it manually. Here's one way you might do something like that:

$ cat read_to_char.rb 
# encoding: UTF-8
open("ascii.txt", "w+:UTF-8") do |f|
  f.puts "Résumé"
  f.rewind
  str = f.read(2)
  until str.dup.force_encoding(f.external_encoding).valid_encoding?
    str << f.read(1)
  end
  str.force_encoding(f.external_encoding)
  p [str.encoding.name, str]
end
$ ruby read_to_char.rb 
["UTF-8", "Ré"]

In that example, I just read() the fixed bytes I wanted and then push forward byte by byte until my data is valid in the desired Encoding. I had to test a dup() of the data and only force_encoding() when I was sure I was done reading, because UTF-8 and ASCII-8BIT are not compatible?() and would have raised Encoding::CompatibilityError as I was adding on bytes.

Working with binary data also requires you to know one more thing about Ruby's IO objects. Ruby has a feature where it translates some data you read on Windows. The translation is super simple: "\r\n" sequences read from an IO object are simplified to a solo "\n". This features is to help make Unix scripts work well on a platform that has different line endings. It does create a gotcha though: when you're going to read any non-text data, be it binary data or just a non-ASCII compatible Encoding like UTF-16, you need to warn Ruby not to do the translation for your code to be properly cross-platform.

By the way, this isn't new. This was even true in the Ruby 1.8 era.

Telling Ruby to treat the data as binary and not perform any translation (again, only active on Windows) is simple. You can just add a "b" for binary to your mode String in a call to open(). Thus you would read with something like:

open(path, "rb") do |f|
  # ...
end

or write with code like:

open(path, "wb") do |f|
  # ...
end

If you always knew about this quirk and you did a good job of always doing this, give yourself a big pat on the back because you're all set. If you didn't, you've got a bad habit you'll need to break. Don't feel too bad about it though. I've known about this quirk since my Perl (which does the same thing) days and I've always tried to follow it. However, about ten different bugs were recently filed against one of my libraries that amounted to me missing this "b" in several places. It's easy to forget.

Ruby 1.9 is much more strict about the binary flag. It's going to complain if you don't add it when it feels it is needed. For example:

$ cat missing_b.rb 
# Ruby 1.9 will let this slide
open("utf_16.txt", "w:UTF-16LE") do |f|
  f.puts "Some data."
end
# but not this
open("utf_16.txt", "r:UTF-16LE") do |f|
  # ...
end
$ ruby missing_b.rb 
missing_b.rb:6:in `initialize': ASCII incompatible encoding needs binmode
                                (ArgumentError)
    from missing_b.rb:6:in `open'
    from missing_b.rb:6:in `<main>'

Of course, this is trivial to fix. You just have to add the missing "b":

$ cat with_b.rb 
open("utf_16.txt", "wb:UTF-16LE") do |f|
  f.puts "Some data."
end
open("utf_16.txt", "rb:UTF-16LE") do |f|
  puts f.external_encoding.name
end
$ ruby with_b.rb 
UTF-16LE

I showed the external_encoding() there to show that it's exactly what I specified. However, as a reward for adding in these "b"'s we've been bad about leaving out in the past, Ruby will now assume you want ASCII-8BIT when you supply the "b" and not an external_encoding():

$ cat b_means_binary.rb 
open("utf_16.txt", "r") do |f|
  puts "Inherited from environment:  #{f.external_encoding.name}"
end
open("utf_16.txt", "rb") do |f|
  puts %Q{Using "rb":  #{f.external_encoding.name}}
end
$ ruby b_means_binary.rb 
Inherited from environment:  UTF-8
Using "rb":  ASCII-8BIT

It's worth nothing that Ruby 1.8 accidently helped train us to leave out the magic "b". For example, you could use IO::read() to slurp some data, but that method didn't provide a way to indicate that the data was binary. In truth, you really needed this monster for a safe cross-platform read of binary data: open(path, "rb") { |f| f.read }. It's no surprise that IO::read() was more common. IO::readlines() and IO::foreach() had the same issue. The core team has acknowledged these problems with some new additions. First, you can now pass a Hash as the final argument to all the methods that open an IO and use that to set options like :mode or separately :external_encoding, :internal_encoding, and :binmode (the name for the magic "b"). Here are some examples:

File.read("utf_16.txt", mode: "rb:UTF-16LE")

File.readlines("utf_16.txt", mode: "rb:UTF-16LE")

File.foreach("utf_16.txt", mode: "rb:UTF-16LE") do |line|

end

File.open("utf_16.txt", mode: "rb:UTF-16LE") do |f|

end

open("utf_16.txt", mode: "rb:UTF-16LE") do |f|

end

As one last shortcut along these lines, the new IO::binread() method is the same as IO.read(…, mode: "rb:ASCII-8BIT").

Regex Encodings

Now that all our data has an Encoding, it only makes sense that our Regexp objects would need to be tagged as well. That is the case, but the rules for how an Encoding is selected differs for Regexp. Let's talk a little about how and why.

First, let's get the big surprise out of the way:

$ cat re_encoding.rb 
# encoding: UTF-8
utf8_str   = "résumé"
latin1_str = utf8_str.encode("ISO-8859-1")
binary_str = utf8_str.dup.force_encoding("ASCII-8BIT")
utf16_str  = utf8_str.encode("UTF-16BE")

re = /\Ar.sum.\z/
puts "Regexp.encoding.name:  #{re.encoding.name}"

[utf8_str, latin1_str, binary_str, utf16_str].each do |str|
  begin
    result = str =~ re ? "Matches" : "Doesn't match"
  rescue Encoding::CompatibilityError
    result = "Can't match non-ASCII compatible?() Encoding"
  end
  puts "#{result}:  #{str.encoding.name}"
end
$ ruby re_encoding.rb 
Regexp.encoding.name:  US-ASCII
Matches:  UTF-8
Matches:  ISO-8859-1
Doesn't match:  ASCII-8BIT
Can't match non-ASCII compatible?() Encoding:  UTF-16BE

After we did all that talking about the source Encoding Ruby goes and ignores it on us. You can see that the Regexp was set to US-ASCII instead of the UTF-8 that was in effect at the time. Surprising though that may be, there is actually a pretty good reason for it.

My Regexp literal only contained seven bit ASCII, so Ruby chose to simplify the Encoding. If it left it at the source Encoding of UTF-8, it would be useful for checking UTF-8 data. As it is though, it can now be used to check any ASCII compatible?() data. You can see in the output that the expression was tried against three different String's, because they are all ASCII compatible?(). (It did fail to match one since I changed the rules of how to interpret the data and one character became two bytes, but the attempt was still made.) The fourth match could not be attempted, because UTF-16 is not ASCII compatible?().

Of course, if your Regexp includes eight bit characters, you use the special escapes that change an Encoding, or you apply one of the old Ruby 1.8 style Encoding options, you can get a non-ASCII Encoding:

$ cat encodings.rb 
# encoding: UTF-8
res = [
  /…\z/,       # source Encoding
  /\A\uFEFF/,  # special escape
  /abc/u       # Ruby 1.8 option
]
puts res.map { |re| [re.encoding.name, re.inspect].join(" ") }
$ ruby encodings.rb
UTF-8 /…\z/
UTF-8 /\A\uFEFF/
UTF-8 /abc/

I used /u which you will probably remember as a way to get a UTF-8 Regexp from the old Ruby 1.8 system. The /e (for EUC_JP) and /s (for a Shift_JIS extension called Windows-31J) options still work too. Ruby 1.9 also still supports the old /n option, but it has some warning tossing exceptions for legacy reasons and I recommend just avoiding it going forward. You can build an ASCII-8BIT Regexp in another way I'll show in just a moment.

As of Ruby 1.9.2, this concept of a lenient Regexp, one that will match any ASCII compatible?() Encoding, has a new name:

$ cat fixed_encoding.rb 
[/a/, /a/u].each do |re|
  puts "%-10s %s" % [ re.encoding, re.fixed_encoding? ? "fixed" :
                                                        "not fixed" ]
end
$ ruby fixed_encoding.rb 
US-ASCII   not fixed
UTF-8      fixed

A fixed_encoding?() Regexp is one that will raise an Encoding::CompatibilityError if matched against any String that contains a different Encoding from the Regexp itself, as long as the String isn't ascii_only?(). If fixed_encoding?() returns false, the Regexp can be used against any ASCII compatible?() Encoding. There's also a new constant with this name that can be used to disable the ASCII downgrading:

$ cat force_re_encoding.rb 
puts Regexp.new("abc".force_encoding("UTF-8")).encoding.name
puts Regexp.new( "abc".force_encoding("UTF-8"),
                 Regexp::FIXEDENCODING ).encoding.name
$ ruby force_re_encoding.rb 
US-ASCII
UTF-8

Note how a Regexp will take the Encoding of the String passed to Regexp::new() when Regexp::FIXEDENCODING is set. You can use this combination to build a Regexp in any Encoding you need, including the ASCII-8BIT I mentioned earlier.

Once your Regexp is at least compatible to your data's Encoding, pattern matches function as they always have. (Well, in truth, Ruby 1.9 brings us a powerful new regular expression engine called Oniguruma, but that's another topic for another time.) Under average circumstances, Ruby 1.9's Regexp Encoding selection option mean that they are compatible with a lot of data and everything should just work for you. However, if you end up getting some errors at match time, you may need to abandon the simple /…/ literal and use the new features I've shown to build a Regexp that perfectly matches your data's Encoding.

Handling a BOM

Some multibyte Encodings recommend that data in that Encoding begin with a Byte Order Mark (also known as a BOM) indicating the order of the bytes. UTF-16 is a good example.

Note that Ruby doesn't even support a UTF-16 Encoding. Instead, you must pick between UTF-16BE and UTF-16LE for "Big Endian" or "Little Endian" byte order. This indicates whether the most significant byte comes first or last:

$ ruby -e 'p "a".encode("UTF-16BE")'
"\x00a"
$ ruby -e 'p "a".encode("UTF-16LE")'
"a\x00"

Now, when someone goes to read your UTF-16 data back, they'll need to know which byte order you used to get things right. You could just tell them which order was used the same way you'll probably tell them that the data is UTF-16 encoded. Or you could add a BOM to the data.

A Unicode BOM is just the character U+FEFF at the beginning of your data. There's no such character for the reversed bytes U+FFFE, so this makes it easy to correctly tell the order of the bytes. Another minor advantage is that this BOM probably indicates you are reading Unicode data. A lot of software will check for this special start of the data, use it to set the proper byte order, and then pretend it didn't even exist by removing it from the data they show users.

Ruby 1.9 won't automatically add a BOM to your data, so you're going to need to take care of that if you want one. Luckily, it's not too tough. The basic idea is just to print the bytes needed at the beginning of a file. For example, we can add a BOM to a UTF-16LE file as such:

$ cat utf16_bom.rb 
# encoding: UTF-8
File.open("utf16_bom.txt", "w:UTF-16LE") do |f|
  f.puts "\uFEFFThis is UTF-16LE with a BOM."
end
$ ruby utf16_bom.rb 
$ ruby -e 'p File.binread("utf16_bom.txt")[0..9]'
"\xFF\xFET\x00h\x00i\x00s\x00"

Notice that I just used the Unicode escape to add the BOM character to the data. Because my output String was in UTF-8, Ruby had to transcode it to UTF-16LE and that process arranged the bytes correctly for me, as you see in the sample output.

Reading a BOM is a similar process. We will need to pull the relevant bytes and see if they match a Unicode BOM. When they do, we can then start reading again with the Encoding we matched. We might code that up like this:

$ cat read_bom.rb 
class File
  UTFS = [32, 16].map { |b| %w[BE LE].map { |o| "UTF-#{b}#{o}" } }.
                  flatten << "UTF-8"

  def self.open_using_unicode_bom(path, *args, &blk)
    # check the BOM to find the Encoding
    encoding = UTFS[0..-2].find(lambda { UTFS[-1] }) do |utf|
      bom = "\uFEFF".encode(utf)
      binread(path, bom.bytesize).force_encoding(utf) == bom
    end
    # set the Encoding
    if args.first.nil?
      args << "r#{'b' unless encoding == UTFS[-1]}:#{encoding}"
    elsif args.first.is_a? Hash
      args.first.merge!(external_encoding: encoding)
    else
      args.first.sub!(/\A([^:]*)/, "\\1:#{encoding}")
    end
    # hand off to open()
    if blk
      open(path, *args) do |f|
        f.read_unicode_bom
        blk[f]
      end
    else
      f = open(path, *args)
      f.read_unicode_bom
      f
    end
  end

  def read_unicode_bom
    bytes = external_encoding.name[/\AUTF-?(\d+)/i, 1].to_i / 8
    read(bytes) if bytes > 1
  end
end

# example usage with the File we created earlier
File.open_using_unicode_bom("utf16_bom.txt") do |f|
  line = f.gets
  p [line.encoding, line[0..3]]
end
$ ruby read_bom.rb 
[#<Encoding:UTF-16LE>, "T\x00h\x00i\x00s\x00"]

These examples just deal with Unicode BOM's, but you would handle other BOM's in a similar fashion. Find out what bytes are needed for your Encoding, write those out before the data, and later check for them when reading the data back. The String escapes we discussed earlier can be handy when writing the bytes and binread() is equally handy when checking for the BOM.

I do recommend including a BOM in Unicode Encodings like UTF-16 and UTF-32, but please don't add them to UTF-8 data. The UTF-8 byte order is part of its specification and it never varies. Thus you don't need a BOM to read it correctly. If you add one, you damage one of the great UTF-8 advantages in that it can pass for US-ASCII (assuming it's all seven bit characters).

This post is part of a series.

← Previous Post

↑ Table of Contents

→ Next Post

In: Character Encodings | Tags: Multilingualization & Unicode | 4 Comments

Comments (4)

Axel Niedenhoff May 11th, 2009 Reply Link

The b option for opening files is even present in C. Maybe that’s where it originated and all platforms building upon the standard C library have inherited it from there.
1. Reply (using GitHub Flavored Markdown)
  
  Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
  
  Or login with:
  Name * Email URL Comment *
James Edward Gray II May 27th, 2009 Reply Link
Another method got a neat upgrade in Ruby 1.9: Integer.chr(). You can use this method in Ruby 1.8 to convert simple byte values into single character Strings. However, the method is limited to single byte values. This example shows both how it works and the limit:
```
$ cat chr.rb 
p 97.chr
p 256.chr
$ ruby -v chr.rb 
ruby 1.8.6 (2009-03-31 patchlevel 368) [i686-darwin9.6.0]
"a"
chr.rb:2:in `chr': 256 out of char range (RangeError)
    from chr.rb:2
```
That much is unchanged in Ruby 1.9:
```
$ ruby_dev -v chr.rb 
ruby 1.9.1p129 (2009-05-12 revision 23412) [i386-darwin9.6.0]
"a"
chr.rb:2:in `chr': 256 out of char range (RangeError)
    from chr.rb:2:in `<main>'
```
However, Ruby 1.9 adds a new twist. The method now takes an optional Encoding argument, or the String name of an Encoding. If you provide an Encoding, the method will convert a codepoint (which you can get with ord() or codepoints()) into a String:
```
$ cat codepoint_chr.rb 
# encoding: UTF-8
p "é".ord
p "é".codepoints.first
p 233.chr("UTF-8")
$ ruby_dev -v codepoint_chr.rb 
ruby 1.9.1p129 (2009-05-12 revision 23412) [i386-darwin9.6.0]
233
233
"é"
```
That turns out to be a pretty easy way to spot check some codepoint mappings in IRb.
1. Reply (using GitHub Flavored Markdown)
  
  Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
  
  Or login with:
  Name * Email URL Comment *
David Wilkie November 18th, 2010 Reply Link

I've read all your articles about Encoding in detail but I can't find a solution to the following problem.

My app receives a string from another app such as: "%e47%e14%e1a". When I unescape the HTML I end up with something like this "\xE47\xE14\xE1a" which correspond to the codepoints: 3655, 3604 and 3610 from the Thai Alphabet. Using Ruby 1.9.2 "\xE47\xE14\xE1a".valid_encoding? returns false, because these are the byte sequences but rather the Unicode codepoints. How can I use Ruby here to convert these codepoints to the proper strings, ็ดบ?
1. Reply (using GitHub Flavored Markdown)
  
  Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
  
  Or login with:
  Name * Email URL Comment *
2. James Edward Gray II November 18th, 2010 Reply Link
  I'm not sure those URL escapes are valid, which is why I believe you are having trouble. I think the escapes are suppose to be handled by bytes, not codepoints.
  
  That said, this code seems to expand it correctly:
```
>> "%e47%e14%e1a".scan(/%\w+/).map { |cp| Integer("0x#{cp[1..-1]}") }.pack("U*")
=> "็ดบ"
```
  Hope that helps.
  1. Reply (using GitHub Flavored Markdown)
    
    Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
    
    Or login with:
    Name * Email URL Comment *