Character Encodings

My extensive coverage of a complex topic all programmers should study a little.

8

DEC
2008

Encoding Conversion With iconv

There's one last standard library we need to discuss for us to have completely covered Ruby 1.8's support for character encodings. The iconv library ships with Ruby and it can handle an impressive set of character encoding conversions.

This is an important piece of the puzzle. You may have accepted my advice that it's OK to just work with UTF-8 data whenever you have the choice, but the fact is that there's a lot of non-UTF-8 data in the world. Legacy systems may have produced data before UTF-8 was popular, some services may work in different encodings for any number of reasons, and not quite everyone has embraced Unicode fully yet. If you run into data like this, you will need a way to convert it to UTF-8 as you import it and possibly a way to convert it back when you export it. That's exactly what iconv does.

Instead of jumping right into Ruby's iconv library, let's come at it with a slightly different approach. iconv is actually a C library that performs these conversions and on most systems where it is installed you will have a command-line interface for it.

It's very easy to use the iconv program. Just always follow these three steps:

  1. Tell iconv the encoding you want it to write data out in, including any special translation instructions
  2. Tell iconv the encoding data will be passed to it in
  3. Send the input into iconv on STDIN (or just list the files as arguments, if you prefer) and redirect iconv's STDOUT to where you want output to be written

For example, let's say I have some UTF-8 data:

$ echo "Résumé" > utf8.txt
$ wc -c utf8.txt 
       9 utf8.txt

My terminal works in UTF-8, so that's the data echo wrote into the file. You can see that it's encoded now because we have nine bytes in the file (one each for "R", "s", "u", "m", and "\n" plus two for each "é").

Here's how we would convert that data to Latin-1 using iconv:

$ iconv -t LATIN1 -f UTF8 < utf8.txt > latin1.txt
$ wc -c latin1.txt 
       7 latin1.txt

You can see the conversion worked, because an "é" is only one byte in Latin-1 and we dropped two bytes.

Note my use of all three steps here:

  1. I used -t LATIN1 to set the to encoding without any special translations
  2. I used -f UTF8 to set the from encoding
  3. I used < utf8.txt to pipe data in and > latin1.txt to pipe data out of the program

Those are always the steps as I said before.

You only need to know two more things about iconv. First, iconv supports a truck load of encodings, including all of the common encodings I've been talking about in this series. They vary some on different platforms though, so you will need to check what is available to you:

$ iconv --list
ANSI_X3.4-1968 ANSI_X3.4-1986 ASCII CP367 IBM367 ISO-IR-6 ISO646-US
  ISO_646.IRV:1991 US US-ASCII CSASCII
UTF-8 UTF8
UTF-8-MAC UTF8-MAC
ISO-10646-UCS-2 UCS-2 CSUNICODE
UCS-2BE UNICODE-1-1 UNICODEBIG CSUNICODE11
UCS-2LE UNICODELITTLE
ISO-10646-UCS-4 UCS-4 CSUCS4
UCS-4BE
UCS-4LE
UTF-16
…

Each line of that listing shows a single encoding. The space separated lists on each line are all aliases for that encoding that iconv will accept. Thus that first long line that I had to break into two provides a bunch of aliases for US-ASCII. We can also see by reading down a bit that iconv will accept UTF8 or UTF-8.

The last thing to know about iconv is that it has some special translation modes. To see those in action, let's work with a different piece of data:

$ echo "On and on… and on…" > utf8.txt
$ cat utf8.txt 
On and on… and on…

That last character is an ellipsis or three dots all in one character. Unicode has that character, but Latin-1 does not. Let's see what happens if we try to convert the data now:

$ iconv -f UTF8 -t LATIN1 < utf8.txt > latin1.txt

iconv: (stdin):1:9: cannot convert
$ cat latin1.txt 
On and on

As you can see, I got an error when it reached the first occurrence of the problem character. The cat command also shows that it completely quit working there.

That may be what you need, so you can tell a user you can't work with their data. I often find though that I just need to do the best I can with the data that I have. iconv's translation modes can help with that.

First, you can ask iconv to ignore any characters that cannot be converted to the new encoding:

$ iconv -t LATIN1//IGNORE -f UTF8 < utf8.txt > latin1_wignore.txt
$ cat latin1_wignore.txt 
On and on and on

As you can see, we completed the entire translation that time, only losing the problematic characters. The //IGNORE sequence adds the translation mode. Modes are always specified after the output encoding. That's an improvement for sure, but it's possible to do even better in this case.

iconv has another translation mode where it will try to transliterate characters into an equivalent representation in the target encoding:

$ iconv -t LATIN1//TRANSLIT -f UTF8 < utf8.txt > latin1_wtranslit.txt
$ cat latin1_wtranslit.txt 
On and on... and on...

This time, instead of dropping the ellipsis characters, iconv replaced them with three full stops each. It's not as fancy as the Unicode character, but it gets the job done and we do a good job of keeping the meaning of the data.

//TRANSLIT can't convert absolutely everything you will see in the wild, so it's still possible to get errors when using it. You can combine the modes though by specifying //TRANSLIT//IGNORE. That will ask iconv to transliterate what it can and drop the rest. Note that order does matter there, you need to be sure it tries transliteration before ignoring the character.

You can also give iconv specific translations for bytes it has trouble with. I've never needed that level of control though and find the translation modes help me do more with less effort. Have a quick browse through man iconv, if you are curious.

That's all you need to know about iconv. You are now a character conversion expert. Congratulations.

Of course, it would be nice to talk about how this affects Ruby. Let's do that.

The Ruby standard library is just like the program we've been playing with. It just provides a method interface to the underlying C code. To show that, here's the same conversion we started with:

#!/usr/bin/env ruby -wKU

require "iconv"

utf8 = "Résumé"
utf8.size  # => 8

latin1 = Iconv.conv("LATIN1", "UTF8", utf8)
latin1.size  # => 6

You can see that the steps are exactly the same. The first parameter is your target encoding and the second is the encoding your data is currently in. You pass the data to convert in the last parameter and the return value of the call is the result.

If you are going to do several conversions in a row, it's slightly easier to create an Iconv instance and just reuse that:

#!/usr/bin/env ruby -wKU

require "iconv"

utf8_to_latin1 = Iconv.new("LATIN1//TRANSLIT//IGNORE", "UTF8")

resume = "Résumé"
utf8_to_latin1.iconv(resume).size  # => 6

on_and_on = "On and on… and on…"
utf8_to_latin1.iconv(on_and_on)  # => "On and on... and on..."

That's all there is to it. The new() method builds an object that remembers the encodings you are converting and then you can call iconv() (instead of the conv() class method we used earlier) to convert data.

When things go wrong, the Ruby interface will raise exceptions like Iconv::InvalidEncoding or Iconv::InvalidCharacter. See the documentation for details.

The Ruby 1.8 library does not provide a way to programatically list the supported encodings, which is one of the big reasons I started off showing you the command-line program instead. You will need to check them there. However, Ruby 1.9 adds a method for this:

$ ruby_dev -r iconv -r pp -ve 'pp Iconv.list'
ruby 1.9.0 (2008-10-10 revision 0) [i386-darwin9.5.0]
[["ANSI_X3.4-1968",
  "ANSI_X3.4-1986",
  "ASCII",
  "CP367",
  "IBM367",
  "ISO-IR-6",
  "ISO646-US",
  "ISO_646.IRV:1991",
  "US",
  "US-ASCII",
  "CSASCII"],
 ["UTF-8", "UTF8"],
…

This concludes our tour of character encoding tools for Ruby 1.8. In later posts, we will take a step back from all of this and examine what the problems with this system are. That will pave the way for us to discuss the new m17n (multilingualization) code in Ruby 1.9.

Comments (22)
  1. Tim Morgan
    Tim Morgan December 11th, 2008 Reply Link

    Thanks for this. I have seen iconv on Unixy systems for a long time and never knew what it was all about. I really appreciate this series. Keep up the good work.

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
  2. Xabriel J. Collazo-Mojica

    These posts are the most comprehensive treatment on Strings for Ruby. Thank you very much James; you've just saved my Capstone project from doom. :)

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
  3. Jim Tran
    Jim Tran April 12th, 2009 Reply Link

    Couldn't agree more - thanks for a superb series so far. I can't believe I'm up on a Saturday night at 2:50am reading this entire series. This is surprisingly engrossing stuff, really helps to connect the dots. Thanks!

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
  4. Rob Miller
    Rob Miller April 12th, 2010 Reply Link

    cool site.

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
  5. SunnySan
    SunnySan April 15th, 2010 Reply Link

    Do I need to install iconv onto my PC to work ?
    how to do it??

    I added this:

    class String #in enviroment.rb (last lines)
      require 'iconv' #this line is not needed in rails !
      def to_utf8(encoding) 
        Iconv.conv( 'UTF-8',"#{encoding}", self)
      end
    end
    

    but when I use it with encoding=ISO8859-7 it return blanks!!!
    any idea??

    Thanks

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
    2. James Edward Gray II
      James Edward Gray II April 15th, 2010 Reply Link

      The iconv library is required yes. You will want to Google some instructions for installing it on your platform since I don't know what that is and I'm not an expert for all of them.

      However, I don't think that's the issue you are seeing here. Different encodings are supported on different platforms. Try listing the supported encodings as shown in my post and make sure the one you want is in the list.

      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
  6. Todd
    Todd July 27th, 2010 Reply Link

    I believe in your examples you mean UTF-8 instead of UTF8

    utf8_to_latin1 = Iconv.new("LATIN1//TRANSLIT//IGNORE", "UTF8")
    

    should be

    utf8_to_latin1 = Iconv.new("LATIN1//TRANSLIT//IGNORE", "UTF-8")
    

    It might be that more recent versions of Ruby are okay with UTF8 but I believe 1.8.x only accepted UTF-8 form

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
    2. James Edward Gray II
      James Edward Gray II July 27th, 2010 Reply Link

      Iconv is smart and will accept either:

      $ iconv --list | grep UTF8
      UTF-8 UTF8
      UTF-8-MAC UTF8-MAC
      
      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
  7. Carlos
    Carlos July 30th, 2010 Reply Link

    Hi.

    I have the following code:

    file = File.new("paymul1", "w")
    data = Iconv.iconv("utf-8", "us-ascii", ic.to_s).join
    file.print data
    

    I expected that the file charset was utf-8. But:

    $ file -i paymul1 
    paymul1: text/plain charset=us-ascii
    

    the file charset was us-ascii.

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
    2. James Edward Gray II
      James Edward Gray II July 30th, 2010 Reply Link

      US-ASCII is a valid subset of UTF-8, so I'm guessing your data just doesn't have any characters with higher-order bits set. If that's true, the file is valid US-ASCII, Latin-1, UTF-8, and more. The file program just went with the simplest answer.

      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
  8. Zeno Davatzu
    Zeno Davatzu January 6th, 2011 Reply Link

    Interesting thank you for the post.

    How would I use iconv on Windows with Ruby 1.8.6 and RubyGems on Windows Vista?

    Best

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
    2. James Edward Gray II
      James Edward Gray II January 6th, 2011 Reply Link

      I'm sorry, but I'm not a Windows guy and thus not the right person to answer that question.

      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
      2. Zeno Davatz
        Zeno Davatz January 6th, 2011 Reply Link

        Thanks for the reply. I found the Windows installer:

        http://gnuwin32.sourceforge.net/packages/libiconv.htm

        I am also not a Windows guy ;) but I want our Ruby Software to run on Windows as well. Ruby on Windows is actually better supported then Ruby on OS X as far as I can tell. I find that fact interesting as I am working on Linux (and Mac) since 10 years now.

        1. Reply (using GitHub Flavored Markdown)

          Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

          Ajax loader
        2. James Edward Gray II
          James Edward Gray II January 6th, 2011 Reply Link

          Thanks for filling me in on the Windows solution.

          I don't think many people would agree with you on the current level of Windows support, but I am glad to hear that it is improving.

          1. Reply (using GitHub Flavored Markdown)

            Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

            Ajax loader
  9. steve s
    steve s April 2nd, 2011 Reply Link

    Thank you for this well-written post. I had been using iconv for a project of mine, but didn't know about the //TRANSLIT and //IGNORE options and needed to deal with a small handful of cases where the conversion I wanted was failing. Your post very quickly taught me what I had been looking for. Kind regards and thank you.

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
  10. Patryk J
    Patryk J July 17th, 2011 Reply Link

    Just wanted to say "Thank you!" for writing this piece and going through iconv in such depth. I'm doing some conversions, and this was exactly what I needed. Thanks!!

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
  11. Roger Paul
    Roger Paul December 14th, 2011 Reply Link

    I tried the UTF8 on the iconv1.14 and it did not work. Only UTF-8 worked. And even then it ignored the e'.

    iconv -t LATIN1//TRANSLIT -f UTF-8 < utf8.txt > latin1_wtranslit.txt
    iconv: (stdin):1:1: cannot convert
    

    Is there something wrong with my iconv (compiled on SUN x86 using gcc?

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
    2. James Edward Gray II
      James Edward Gray II December 14th, 2011 Reply Link

      I'm not sure exactly why you are seeing these oddities. My suggestion is to try listing the available encodings with:

      iconv -l
      

      Make sure both of the encodings you are using are in that list.

      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
  12. Sean
    Sean June 16th, 2012 Reply Link

    Thanks for this very informative post. Best summary of iconv I could find!

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
  13. Andreas Fischlin
    Andreas Fischlin September 4th, 2012 Reply Link

    Indeed very helpful and was also exactly what I needed. Thanks a lot.

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
  14. Paul Emico
    Paul Emico November 14th, 2012 Reply Link

    Help:

    I am using Iconv.iconv('ascii//TRANSLIT', 'utf8', value) and I get the warning message:

    /usr/lib/ruby/gems/1.9.1/gems/activesupport-3.2.3/lib/active_support/dependencies.rb:251:in `block in require': iconv will be deprecated in the future, use String#encode instead
    

    How do I have to use the method String#encode to do the same?

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
    2. James Edward Gray II
      James Edward Gray II November 14th, 2012 Reply Link

      This is a good question.

      String#encode() doesn't really provide an equivalent option to iconv's //TRANSLIT. You can specify fallbacks, but you can't ask Ruby to intelligently handle the replacements for you.

      In this instance, iconv still feels superior to me. It will be interesting to see how that is addressed as this feature is retired.

      Of course, they can't really take iconv away from us. We can always just shell out to the command-line program. So, if you want to stick with iconv, you can.

      1. Reply (using GitHub Flavored Markdown)

        Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

        Ajax loader
Leave a Comment (using GitHub Flavored Markdown)

Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

Ajax loader