Encoding Conversion With iconv
There's one last standard library we need to discuss for us to have completely covered Ruby 1.8's support for character encodings. The
iconv library ships with Ruby and it can handle an impressive set of character encoding conversions.
This is an important piece of the puzzle. You may have accepted my advice that it's OK to just work with UTF-8 data whenever you have the choice, but the fact is that there's a lot of non-UTF-8 data in the world. Legacy systems may have produced data before UTF-8 was popular, some services may work in different encodings for any number of reasons, and not quite everyone has embraced Unicode fully yet. If you run into data like this, you will need a way to convert it to UTF-8 as you import it and possibly a way to convert it back when you export it. That's exactly what
Instead of jumping right into Ruby's
iconv library, let's come at it with a slightly different approach.
iconv is actually a C library that performs these conversions and on most systems where it is installed you will have a command-line interface for it.
It's very easy to use the
iconv program. Just always follow these three steps:
iconvthe encoding you want it to write data out in, including any special translation instructions
iconvthe encoding data will be passed to it in
- Send the input into
STDIN(or just list the files as arguments, if you prefer) and redirect
STDOUTto where you want output to be written
For example, let's say I have some UTF-8 data:
$ echo "Résumé" > utf8.txt $ wc -c utf8.txt 9 utf8.txt
My terminal works in UTF-8, so that's the data
echo wrote into the file. You can see that it's encoded now because we have nine bytes in the file (one each for
"\n" plus two for each
Here's how we would convert that data to Latin-1 using
$ iconv -t LATIN1 -f UTF8 < utf8.txt > latin1.txt $ wc -c latin1.txt 7 latin1.txt
You can see the conversion worked, because an
"é" is only one byte in Latin-1 and we dropped two bytes.
Note my use of all three steps here:
- I used
-t LATIN1to set the to encoding without any special translations
- I used
-f UTF8to set the from encoding
- I used
< utf8.txtto pipe data in and
> latin1.txtto pipe data out of the program
Those are always the steps as I said before.
You only need to know two more things about
iconv supports a truck load of encodings, including all of the common encodings I've been talking about in this series. They vary some on different platforms though, so you will need to check what is available to you:
$ iconv --list ANSI_X3.4-1968 ANSI_X3.4-1986 ASCII CP367 IBM367 ISO-IR-6 ISO646-US ISO_646.IRV:1991 US US-ASCII CSASCII UTF-8 UTF8 UTF-8-MAC UTF8-MAC ISO-10646-UCS-2 UCS-2 CSUNICODE UCS-2BE UNICODE-1-1 UNICODEBIG CSUNICODE11 UCS-2LE UNICODELITTLE ISO-10646-UCS-4 UCS-4 CSUCS4 UCS-4BE UCS-4LE UTF-16 …
Each line of that listing shows a single encoding. The space separated lists on each line are all aliases for that encoding that
iconv will accept. Thus that first long line that I had to break into two provides a bunch of aliases for US-ASCII. We can also see by reading down a bit that
iconv will accept UTF8 or UTF-8.
The last thing to know about
iconv is that it has some special translation modes. To see those in action, let's work with a different piece of data:
$ echo "On and on… and on…" > utf8.txt $ cat utf8.txt On and on… and on…
That last character is an ellipsis or three dots all in one character. Unicode has that character, but Latin-1 does not. Let's see what happens if we try to convert the data now:
$ iconv -f UTF8 -t LATIN1 < utf8.txt > latin1.txt iconv: (stdin):1:9: cannot convert $ cat latin1.txt On and on
As you can see, I got an error when it reached the first occurrence of the problem character. The
cat command also shows that it completely quit working there.
That may be what you need, so you can tell a user you can't work with their data. I often find though that I just need to do the best I can with the data that I have.
iconv's translation modes can help with that.
First, you can ask
iconv to ignore any characters that cannot be converted to the new encoding:
$ iconv -t LATIN1//IGNORE -f UTF8 < utf8.txt > latin1_wignore.txt $ cat latin1_wignore.txt On and on and on
As you can see, we completed the entire translation that time, only losing the problematic characters. The
//IGNORE sequence adds the translation mode. Modes are always specified after the output encoding. That's an improvement for sure, but it's possible to do even better in this case.
iconv has another translation mode where it will try to transliterate characters into an equivalent representation in the target encoding:
$ iconv -t LATIN1//TRANSLIT -f UTF8 < utf8.txt > latin1_wtranslit.txt $ cat latin1_wtranslit.txt On and on... and on...
This time, instead of dropping the ellipsis characters,
iconv replaced them with three full stops each. It's not as fancy as the Unicode character, but it gets the job done and we do a good job of keeping the meaning of the data.
//TRANSLIT can't convert absolutely everything you will see in the wild, so it's still possible to get errors when using it. You can combine the modes though by specifying
//TRANSLIT//IGNORE. That will ask
iconv to transliterate what it can and drop the rest. Note that order does matter there, you need to be sure it tries transliteration before ignoring the character.
You can also give
iconv specific translations for bytes it has trouble with. I've never needed that level of control though and find the translation modes help me do more with less effort. Have a quick browse through
man iconv, if you are curious.
That's all you need to know about
iconv. You are now a character conversion expert. Congratulations.
Of course, it would be nice to talk about how this affects Ruby. Let's do that.
The Ruby standard library is just like the program we've been playing with. It just provides a method interface to the underlying C code. To show that, here's the same conversion we started with:
#!/usr/bin/env ruby -wKU require "iconv" utf8 = "Résumé" utf8.size # => 8 latin1 = Iconv.conv("LATIN1", "UTF8", utf8) latin1.size # => 6
You can see that the steps are exactly the same. The first parameter is your target encoding and the second is the encoding your data is currently in. You pass the data to convert in the last parameter and the return value of the call is the result.
If you are going to do several conversions in a row, it's slightly easier to create an
Iconv instance and just reuse that:
#!/usr/bin/env ruby -wKU require "iconv" utf8_to_latin1 = Iconv.new("LATIN1//TRANSLIT//IGNORE", "UTF8") resume = "Résumé" utf8_to_latin1.iconv(resume).size # => 6 on_and_on = "On and on… and on…" utf8_to_latin1.iconv(on_and_on) # => "On and on... and on..."
That's all there is to it. The
new() method builds an object that remembers the encodings you are converting and then you can call
iconv() (instead of the
conv() class method we used earlier) to convert data.
When things go wrong, the Ruby interface will raise exceptions like
Iconv::InvalidCharacter. See the documentation for details.
The Ruby 1.8 library does not provide a way to programatically list the supported encodings, which is one of the big reasons I started off showing you the command-line program instead. You will need to check them there. However, Ruby 1.9 adds a method for this:
$ ruby_dev -r iconv -r pp -ve 'pp Iconv.list' ruby 1.9.0 (2008-10-10 revision 0) [i386-darwin9.5.0] [["ANSI_X3.4-1968", "ANSI_X3.4-1986", "ASCII", "CP367", "IBM367", "ISO-IR-6", "ISO646-US", "ISO_646.IRV:1991", "US", "US-ASCII", "CSASCII"], ["UTF-8", "UTF8"], …
This concludes our tour of character encoding tools for Ruby 1.8. In later posts, we will take a step back from all of this and examine what the problems with this system are. That will pave the way for us to discuss the new m17n (multilingualization) code in Ruby 1.9.