Gray Soft / Character Encodings / Encoding Conversion With iconv

The 22nd Comment on "Encoding Conversion With iconv"

2014-04-17T16:04:36Z

This is a good question.

String#encode() doesn't really provide an equivalent option to iconv's //TRANSLIT. You can specify fallbacks, but you can't ask Ruby to intelligently handle the replacements for you.

In this instance, iconv still feels superior to me. It will be interesting to see how that is addressed as this feature is retired.

Of course, they can't really take iconv away from us. We can always just shell out to the command-line program. So, if you want to stick with iconv, you can.

The 21st Comment on "Encoding Conversion With iconv"

2014-04-17T16:04:36Z

Help:

I am using Iconv.iconv('ascii//TRANSLIT', 'utf8', value) and I get the warning message:

/usr/lib/ruby/gems/1.9.1/gems/activesupport-3.2.3/lib/active_support/dependencies.rb:251:in `block in require': iconv will be deprecated in the future, use String#encode instead

How do I have to use the method String#encode to do the same?

The 20th Comment on "Encoding Conversion With iconv"

2014-03-27T01:38:28Z

Indeed very helpful and was also exactly what I needed. Thanks a lot.

The 19th Comment on "Encoding Conversion With iconv"

2014-04-17T16:02:38Z

Thanks for this very informative post. Best summary of iconv I could find!

The 18th Comment on "Encoding Conversion With iconv"

2014-04-17T16:02:21Z

I'm not sure exactly why you are seeing these oddities. My suggestion is to try listing the available encodings with:

iconv -l

Make sure both of the encodings you are using are in that list.

The 17th Comment on "Encoding Conversion With iconv"

2014-04-17T16:02:21Z

I tried the UTF8 on the iconv1.14 and it did not work. Only UTF-8 worked. And even then it ignored the e'.

iconv -t LATIN1//TRANSLIT -f UTF-8 < utf8.txt > latin1_wtranslit.txt
iconv: (stdin):1:1: cannot convert

Is there something wrong with my iconv (compiled on SUN x86 using gcc?

The 16th Comment on "Encoding Conversion With iconv"

2014-04-17T16:00:21Z

Just wanted to say "Thank you!" for writing this piece and going through iconv in such depth. I'm doing some conversions, and this was exactly what I needed. Thanks!!

The 15th Comment on "Encoding Conversion With iconv"

2014-04-17T15:59:58Z

Thank you for this well-written post. I had been using iconv for a project of mine, but didn't know about the //TRANSLIT and //IGNORE options and needed to deal with a small handful of cases where the conversion I wanted was failing. Your post very quickly taught me what I had been looking for. Kind regards and thank you.

The 14th Comment on "Encoding Conversion With iconv"

2014-04-17T15:59:23Z

Thanks for filling me in on the Windows solution.

I don't think many people would agree with you on the current level of Windows support, but I am glad to hear that it is improving.

The 13th Comment on "Encoding Conversion With iconv"

2014-04-17T15:59:23Z

Thanks for the reply. I found the Windows installer:

http://gnuwin32.sourceforge.net/packages/libiconv.htm

I am also not a Windows guy ;) but I want our Ruby Software to run on Windows as well. Ruby on Windows is actually better supported then Ruby on OS X as far as I can tell. I find that fact interesting as I am working on Linux (and Mac) since 10 years now.

The 12th Comment on "Encoding Conversion With iconv"

2014-04-17T15:59:23Z

I'm sorry, but I'm not a Windows guy and thus not the right person to answer that question.

The 11th Comment on "Encoding Conversion With iconv"

2014-04-17T15:59:23Z

Interesting thank you for the post.

How would I use iconv on Windows with Ruby 1.8.6 and RubyGems on Windows Vista?

Best

The 10th Comment on "Encoding Conversion With iconv"

2014-04-17T15:55:50Z

US-ASCII is a valid subset of UTF-8, so I'm guessing your data just doesn't have any characters with higher-order bits set. If that's true, the file is valid US-ASCII, Latin-1, UTF-8, and more. The file program just went with the simplest answer.

The 9th Comment on "Encoding Conversion With iconv"

2014-04-17T15:55:50Z

Hi.

I have the following code:

file = File.new("paymul1", "w")
data = Iconv.iconv("utf-8", "us-ascii", ic.to_s).join
file.print data

I expected that the file charset was utf-8. But:

$ file -i paymul1 
paymul1: text/plain charset=us-ascii

the file charset was us-ascii.

The 8th Comment on "Encoding Conversion With iconv"

2014-04-17T15:53:25Z

Iconv is smart and will accept either:

$ iconv --list | grep UTF8
UTF-8 UTF8
UTF-8-MAC UTF8-MAC

The 7th Comment on "Encoding Conversion With iconv"

2014-04-17T15:53:25Z

I believe in your examples you mean UTF-8 instead of UTF8

utf8_to_latin1 = Iconv.new("LATIN1//TRANSLIT//IGNORE", "UTF8")

should be

utf8_to_latin1 = Iconv.new("LATIN1//TRANSLIT//IGNORE", "UTF-8")

It might be that more recent versions of Ruby are okay with UTF8 but I believe 1.8.x only accepted UTF-8 form

The 6th Comment on "Encoding Conversion With iconv"

2014-04-17T15:50:48Z

The iconv library is required yes. You will want to Google some instructions for installing it on your platform since I don't know what that is and I'm not an expert for all of them.

However, I don't think that's the issue you are seeing here. Different encodings are supported on different platforms. Try listing the supported encodings as shown in my post and make sure the one you want is in the list.

The 5th Comment on "Encoding Conversion With iconv"

2014-04-17T15:50:48Z

Do I need to install iconv onto my PC to work ?
how to do it??

I added this:

class String #in enviroment.rb (last lines)
  require 'iconv' #this line is not needed in rails !
  def to_utf8(encoding) 
    Iconv.conv( 'UTF-8',"#{encoding}", self)
  end
end

but when I use it with encoding=ISO8859-7 it return blanks!!!
any idea??

Thanks

The 4th Comment on "Encoding Conversion With iconv"

2014-03-27T01:38:27Z

cool site.

The 3rd Comment on "Encoding Conversion With iconv"

2014-03-27T01:38:26Z

Couldn't agree more - thanks for a superb series so far. I can't believe I'm up on a Saturday night at 2:50am reading this entire series. This is surprisingly engrossing stuff, really helps to connect the dots. Thanks!

The 2nd Comment on "Encoding Conversion With iconv"

2014-04-17T15:48:22Z

These posts are the most comprehensive treatment on Strings for Ruby. Thank you very much James; you've just saved my Capstone project from doom. :)

The 1st Comment on "Encoding Conversion With iconv"

2014-04-17T15:48:03Z

Thanks for this. I have seen iconv on Unixy systems for a long time and never knew what it was all about. I really appreciate this series. Keep up the good work.

Encoding Conversion With iconv

2014-04-17T19:14:31Z

There's one last standard library we need to discuss for us to have completely covered Ruby 1.8's support for character encodings. The iconv library ships with Ruby and it can handle an impressive set of character encoding conversions.

This is an important piece of the puzzle. You may have accepted my advice that it's OK to just work with UTF-8 data whenever you have the choice, but the fact is that there's a lot of non-UTF-8 data in the world. Legacy systems may have produced data before UTF-8 was popular, some services may work in different encodings for any number of reasons, and not quite everyone has embraced Unicode fully yet. If you run into data like this, you will need a way to convert it to UTF-8 as you import it and possibly a way to convert it back when you export it. That's exactly what iconv does.

Instead of jumping right into Ruby's iconv library, let's come at it with a slightly different approach. iconv is actually a C library that performs these conversions and on most systems where it is installed you will have a command-line interface for it.

It's very easy to use the iconv program. Just always follow these three steps:

Tell iconv the encoding you want it to write data out in, including any special translation instructions
Tell iconv the encoding data will be passed to it in
Send the input into iconv on STDIN (or just list the files as arguments, if you prefer) and redirect iconv's STDOUT to where you want output to be written

For example, let's say I have some UTF-8 data:

$ echo "Résumé" > utf8.txt
$ wc -c utf8.txt 
       9 utf8.txt

My terminal works in UTF-8, so that's the data echo wrote into the file. You can see that it's encoded now because we have nine bytes in the file (one each for "R", "s", "u", "m", and "\n" plus two for each "é").

Here's how we would convert that data to Latin-1 using iconv:

$ iconv -t LATIN1 -f UTF8 < utf8.txt > latin1.txt
$ wc -c latin1.txt 
       7 latin1.txt

You can see the conversion worked, because an "é" is only one byte in Latin-1 and we dropped two bytes.

Note my use of all three steps here:

I used -t LATIN1 to set the to encoding without any special translations
I used -f UTF8 to set the from encoding
I used < utf8.txt to pipe data in and > latin1.txt to pipe data out of the program

Those are always the steps as I said before.

You only need to know two more things about iconv. First, iconv supports a truck load of encodings, including all of the common encodings I've been talking about in this series. They vary some on different platforms though, so you will need to check what is available to you:

$ iconv --list
ANSI_X3.4-1968 ANSI_X3.4-1986 ASCII CP367 IBM367 ISO-IR-6 ISO646-US
  ISO_646.IRV:1991 US US-ASCII CSASCII
UTF-8 UTF8
UTF-8-MAC UTF8-MAC
ISO-10646-UCS-2 UCS-2 CSUNICODE
UCS-2BE UNICODE-1-1 UNICODEBIG CSUNICODE11
UCS-2LE UNICODELITTLE
ISO-10646-UCS-4 UCS-4 CSUCS4
UCS-4BE
UCS-4LE
UTF-16
…

Each line of that listing shows a single encoding. The space separated lists on each line are all aliases for that encoding that iconv will accept. Thus that first long line that I had to break into two provides a bunch of aliases for US-ASCII. We can also see by reading down a bit that iconv will accept UTF8 or UTF-8.

The last thing to know about iconv is that it has some special translation modes. To see those in action, let's work with a different piece of data:

$ echo "On and on… and on…" > utf8.txt
$ cat utf8.txt 
On and on… and on…

That last character is an ellipsis or three dots all in one character. Unicode has that character, but Latin-1 does not. Let's see what happens if we try to convert the data now:

$ iconv -f UTF8 -t LATIN1 < utf8.txt > latin1.txt

iconv: (stdin):1:9: cannot convert
$ cat latin1.txt 
On and on

As you can see, I got an error when it reached the first occurrence of the problem character. The cat command also shows that it completely quit working there.

That may be what you need, so you can tell a user you can't work with their data. I often find though that I just need to do the best I can with the data that I have. iconv's translation modes can help with that.

First, you can ask iconv to ignore any characters that cannot be converted to the new encoding:

$ iconv -t LATIN1//IGNORE -f UTF8 < utf8.txt > latin1_wignore.txt
$ cat latin1_wignore.txt 
On and on and on

As you can see, we completed the entire translation that time, only losing the problematic characters. The //IGNORE sequence adds the translation mode. Modes are always specified after the output encoding. That's an improvement for sure, but it's possible to do even better in this case.

iconv has another translation mode where it will try to transliterate characters into an equivalent representation in the target encoding:

$ iconv -t LATIN1//TRANSLIT -f UTF8 < utf8.txt > latin1_wtranslit.txt
$ cat latin1_wtranslit.txt 
On and on... and on...

This time, instead of dropping the ellipsis characters, iconv replaced them with three full stops each. It's not as fancy as the Unicode character, but it gets the job done and we do a good job of keeping the meaning of the data.

//TRANSLIT can't convert absolutely everything you will see in the wild, so it's still possible to get errors when using it. You can combine the modes though by specifying //TRANSLIT//IGNORE. That will ask iconv to transliterate what it can and drop the rest. Note that order does matter there, you need to be sure it tries transliteration before ignoring the character.

You can also give iconv specific translations for bytes it has trouble with. I've never needed that level of control though and find the translation modes help me do more with less effort. Have a quick browse through man iconv, if you are curious.

That's all you need to know about iconv. You are now a character conversion expert. Congratulations.

Of course, it would be nice to talk about how this affects Ruby. Let's do that.

The Ruby standard library is just like the program we've been playing with. It just provides a method interface to the underlying C code. To show that, here's the same conversion we started with:

#!/usr/bin/env ruby -wKU

require "iconv"

utf8 = "Résumé"
utf8.size  # => 8

latin1 = Iconv.conv("LATIN1", "UTF8", utf8)
latin1.size  # => 6

You can see that the steps are exactly the same. The first parameter is your target encoding and the second is the encoding your data is currently in. You pass the data to convert in the last parameter and the return value of the call is the result.

If you are going to do several conversions in a row, it's slightly easier to create an Iconv instance and just reuse that:

#!/usr/bin/env ruby -wKU

require "iconv"

utf8_to_latin1 = Iconv.new("LATIN1//TRANSLIT//IGNORE", "UTF8")

resume = "Résumé"
utf8_to_latin1.iconv(resume).size  # => 6

on_and_on = "On and on… and on…"
utf8_to_latin1.iconv(on_and_on)  # => "On and on... and on..."

That's all there is to it. The new() method builds an object that remembers the encodings you are converting and then you can call iconv() (instead of the conv() class method we used earlier) to convert data.

When things go wrong, the Ruby interface will raise exceptions like Iconv::InvalidEncoding or Iconv::InvalidCharacter. See the documentation for details.

The Ruby 1.8 library does not provide a way to programatically list the supported encodings, which is one of the big reasons I started off showing you the command-line program instead. You will need to check them there. However, Ruby 1.9 adds a method for this:

$ ruby_dev -r iconv -r pp -ve 'pp Iconv.list'
ruby 1.9.0 (2008-10-10 revision 0) [i386-darwin9.5.0]
[["ANSI_X3.4-1968",
  "ANSI_X3.4-1986",
  "ASCII",
  "CP367",
  "IBM367",
  "ISO-IR-6",
  "ISO646-US",
  "ISO_646.IRV:1991",
  "US",
  "US-ASCII",
  "CSASCII"],
 ["UTF-8", "UTF8"],
…

This concludes our tour of character encoding tools for Ruby 1.8. In later posts, we will take a step back from all of this and examine what the problems with this system are. That will pave the way for us to discuss the new m17n (multilingualization) code in Ruby 1.9.