8
DEC2008
Encoding Conversion With iconv
There's one last standard library we need to discuss for us to have completely covered Ruby 1.8's support for character encodings. The iconv
library ships with Ruby and it can handle an impressive set of character encoding conversions.
This is an important piece of the puzzle. You may have accepted my advice that it's OK to just work with UTF-8 data whenever you have the choice, but the fact is that there's a lot of non-UTF-8 data in the world. Legacy systems may have produced data before UTF-8 was popular, some services may work in different encodings for any number of reasons, and not quite everyone has embraced Unicode fully yet. If you run into data like this, you will need a way to convert it to UTF-8 as you import it and possibly a way to convert it back when you export it. That's exactly what iconv
does.
Instead of jumping right into Ruby's iconv
library, let's come at it with a slightly different approach. iconv
is actually a C library that performs these conversions and on most systems where it is installed you will have a command-line interface for it.
It's very easy to use the iconv
program. Just always follow these three steps:
- Tell
iconv
the encoding you want it to write data out in, including any special translation instructions - Tell
iconv
the encoding data will be passed to it in - Send the input into
iconv
onSTDIN
(or just list the files as arguments, if you prefer) and redirecticonv
'sSTDOUT
to where you want output to be written
For example, let's say I have some UTF-8 data:
$ echo "Résumé" > utf8.txt
$ wc -c utf8.txt
9 utf8.txt
My terminal works in UTF-8, so that's the data echo
wrote into the file. You can see that it's encoded now because we have nine bytes in the file (one each for "R"
, "s"
, "u"
, "m"
, and "\n"
plus two for each "é"
).
Here's how we would convert that data to Latin-1 using iconv
:
$ iconv -t LATIN1 -f UTF8 < utf8.txt > latin1.txt
$ wc -c latin1.txt
7 latin1.txt
You can see the conversion worked, because an "é"
is only one byte in Latin-1 and we dropped two bytes.
Note my use of all three steps here:
- I used
-t LATIN1
to set the to encoding without any special translations - I used
-f UTF8
to set the from encoding - I used
< utf8.txt
to pipe data in and> latin1.txt
to pipe data out of the program
Those are always the steps as I said before.
You only need to know two more things about iconv
. First, iconv
supports a truck load of encodings, including all of the common encodings I've been talking about in this series. They vary some on different platforms though, so you will need to check what is available to you:
$ iconv --list
ANSI_X3.4-1968 ANSI_X3.4-1986 ASCII CP367 IBM367 ISO-IR-6 ISO646-US
ISO_646.IRV:1991 US US-ASCII CSASCII
UTF-8 UTF8
UTF-8-MAC UTF8-MAC
ISO-10646-UCS-2 UCS-2 CSUNICODE
UCS-2BE UNICODE-1-1 UNICODEBIG CSUNICODE11
UCS-2LE UNICODELITTLE
ISO-10646-UCS-4 UCS-4 CSUCS4
UCS-4BE
UCS-4LE
UTF-16
…
Each line of that listing shows a single encoding. The space separated lists on each line are all aliases for that encoding that iconv
will accept. Thus that first long line that I had to break into two provides a bunch of aliases for US-ASCII. We can also see by reading down a bit that iconv
will accept UTF8 or UTF-8.
The last thing to know about iconv
is that it has some special translation modes. To see those in action, let's work with a different piece of data:
$ echo "On and on… and on…" > utf8.txt
$ cat utf8.txt
On and on… and on…
That last character is an ellipsis or three dots all in one character. Unicode has that character, but Latin-1 does not. Let's see what happens if we try to convert the data now:
$ iconv -f UTF8 -t LATIN1 < utf8.txt > latin1.txt
iconv: (stdin):1:9: cannot convert
$ cat latin1.txt
On and on
As you can see, I got an error when it reached the first occurrence of the problem character. The cat
command also shows that it completely quit working there.
That may be what you need, so you can tell a user you can't work with their data. I often find though that I just need to do the best I can with the data that I have. iconv
's translation modes can help with that.
First, you can ask iconv
to ignore any characters that cannot be converted to the new encoding:
$ iconv -t LATIN1//IGNORE -f UTF8 < utf8.txt > latin1_wignore.txt
$ cat latin1_wignore.txt
On and on and on
As you can see, we completed the entire translation that time, only losing the problematic characters. The //IGNORE
sequence adds the translation mode. Modes are always specified after the output encoding. That's an improvement for sure, but it's possible to do even better in this case.
iconv
has another translation mode where it will try to transliterate characters into an equivalent representation in the target encoding:
$ iconv -t LATIN1//TRANSLIT -f UTF8 < utf8.txt > latin1_wtranslit.txt
$ cat latin1_wtranslit.txt
On and on... and on...
This time, instead of dropping the ellipsis characters, iconv
replaced them with three full stops each. It's not as fancy as the Unicode character, but it gets the job done and we do a good job of keeping the meaning of the data.
//TRANSLIT
can't convert absolutely everything you will see in the wild, so it's still possible to get errors when using it. You can combine the modes though by specifying //TRANSLIT//IGNORE
. That will ask iconv
to transliterate what it can and drop the rest. Note that order does matter there, you need to be sure it tries transliteration before ignoring the character.
You can also give iconv
specific translations for bytes it has trouble with. I've never needed that level of control though and find the translation modes help me do more with less effort. Have a quick browse through man iconv
, if you are curious.
That's all you need to know about iconv
. You are now a character conversion expert. Congratulations.
Of course, it would be nice to talk about how this affects Ruby. Let's do that.
The Ruby standard library is just like the program we've been playing with. It just provides a method interface to the underlying C code. To show that, here's the same conversion we started with:
#!/usr/bin/env ruby -wKU
require "iconv"
utf8 = "Résumé"
utf8.size # => 8
latin1 = Iconv.conv("LATIN1", "UTF8", utf8)
latin1.size # => 6
You can see that the steps are exactly the same. The first parameter is your target encoding and the second is the encoding your data is currently in. You pass the data to convert in the last parameter and the return value of the call is the result.
If you are going to do several conversions in a row, it's slightly easier to create an Iconv
instance and just reuse that:
#!/usr/bin/env ruby -wKU
require "iconv"
utf8_to_latin1 = Iconv.new("LATIN1//TRANSLIT//IGNORE", "UTF8")
resume = "Résumé"
utf8_to_latin1.iconv(resume).size # => 6
on_and_on = "On and on… and on…"
utf8_to_latin1.iconv(on_and_on) # => "On and on... and on..."
That's all there is to it. The new()
method builds an object that remembers the encodings you are converting and then you can call iconv()
(instead of the conv()
class method we used earlier) to convert data.
When things go wrong, the Ruby interface will raise exceptions like Iconv::InvalidEncoding
or Iconv::InvalidCharacter
. See the documentation for details.
The Ruby 1.8 library does not provide a way to programatically list the supported encodings, which is one of the big reasons I started off showing you the command-line program instead. You will need to check them there. However, Ruby 1.9 adds a method for this:
$ ruby_dev -r iconv -r pp -ve 'pp Iconv.list'
ruby 1.9.0 (2008-10-10 revision 0) [i386-darwin9.5.0]
[["ANSI_X3.4-1968",
"ANSI_X3.4-1986",
"ASCII",
"CP367",
"IBM367",
"ISO-IR-6",
"ISO646-US",
"ISO_646.IRV:1991",
"US",
"US-ASCII",
"CSASCII"],
["UTF-8", "UTF8"],
…
This concludes our tour of character encoding tools for Ruby 1.8. In later posts, we will take a step back from all of this and examine what the problems with this system are. That will pave the way for us to discuss the new m17n (multilingualization) code in Ruby 1.9.
Comments (22)
-
Tim Morgan December 11th, 2008 Reply Link
Thanks for this. I have seen
iconv
on Unixy systems for a long time and never knew what it was all about. I really appreciate this series. Keep up the good work. -
These posts are the most comprehensive treatment on
String
s for Ruby. Thank you very much James; you've just saved my Capstone project from doom. :) -
Couldn't agree more - thanks for a superb series so far. I can't believe I'm up on a Saturday night at 2:50am reading this entire series. This is surprisingly engrossing stuff, really helps to connect the dots. Thanks!
-
cool site.
-
Do I need to install
iconv
onto my PC to work ?
how to do it??I added this:
class String #in enviroment.rb (last lines) require 'iconv' #this line is not needed in rails ! def to_utf8(encoding) Iconv.conv( 'UTF-8',"#{encoding}", self) end end
but when I use it with
encoding=ISO8859-7
it return blanks!!!
any idea??Thanks
-
The
iconv
library is required yes. You will want to Google some instructions for installing it on your platform since I don't know what that is and I'm not an expert for all of them.However, I don't think that's the issue you are seeing here. Different encodings are supported on different platforms. Try listing the supported encodings as shown in my post and make sure the one you want is in the list.
-
-
I believe in your examples you mean
UTF-8
instead ofUTF8
utf8_to_latin1 = Iconv.new("LATIN1//TRANSLIT//IGNORE", "UTF8")
should be
utf8_to_latin1 = Iconv.new("LATIN1//TRANSLIT//IGNORE", "UTF-8")
It might be that more recent versions of Ruby are okay with
UTF8
but I believe 1.8.x only acceptedUTF-8
form-
Iconv
is smart and will accept either:$ iconv --list | grep UTF8 UTF-8 UTF8 UTF-8-MAC UTF8-MAC
-
-
Hi.
I have the following code:
file = File.new("paymul1", "w") data = Iconv.iconv("utf-8", "us-ascii", ic.to_s).join file.print data
I expected that the file charset was utf-8. But:
$ file -i paymul1 paymul1: text/plain charset=us-ascii
the file charset was us-ascii.
-
US-ASCII is a valid subset of UTF-8, so I'm guessing your data just doesn't have any characters with higher-order bits set. If that's true, the file is valid US-ASCII, Latin-1, UTF-8, and more. The
file
program just went with the simplest answer.
-
-
Interesting thank you for the post.
How would I use
iconv
on Windows with Ruby 1.8.6 and RubyGems on Windows Vista?Best
-
I'm sorry, but I'm not a Windows guy and thus not the right person to answer that question.
-
Thanks for the reply. I found the Windows installer:
http://gnuwin32.sourceforge.net/packages/libiconv.htm
I am also not a Windows guy ;) but I want our Ruby Software to run on Windows as well. Ruby on Windows is actually better supported then Ruby on OS X as far as I can tell. I find that fact interesting as I am working on Linux (and Mac) since 10 years now.
-
Thanks for filling me in on the Windows solution.
I don't think many people would agree with you on the current level of Windows support, but I am glad to hear that it is improving.
-
-
-
-
Thank you for this well-written post. I had been using
iconv
for a project of mine, but didn't know about the//TRANSLIT
and//IGNORE
options and needed to deal with a small handful of cases where the conversion I wanted was failing. Your post very quickly taught me what I had been looking for. Kind regards and thank you. -
Just wanted to say "Thank you!" for writing this piece and going through
iconv
in such depth. I'm doing some conversions, and this was exactly what I needed. Thanks!! -
I tried the
UTF8
on the iconv1.14 and it did not work. OnlyUTF-8
worked. And even then it ignored thee'
.iconv -t LATIN1//TRANSLIT -f UTF-8 < utf8.txt > latin1_wtranslit.txt iconv: (stdin):1:1: cannot convert
Is there something wrong with my
iconv
(compiled on SUN x86 usinggcc
?-
I'm not sure exactly why you are seeing these oddities. My suggestion is to try listing the available encodings with:
iconv -l
Make sure both of the encodings you are using are in that list.
-
-
Thanks for this very informative post. Best summary of
iconv
I could find! -
Indeed very helpful and was also exactly what I needed. Thanks a lot.
-
Help:
I am using
Iconv.iconv('ascii//TRANSLIT', 'utf8', value)
and I get the warning message:/usr/lib/ruby/gems/1.9.1/gems/activesupport-3.2.3/lib/active_support/dependencies.rb:251:in `block in require': iconv will be deprecated in the future, use String#encode instead
How do I have to use the method
String#encode
to do the same?-
This is a good question.
String#encode()
doesn't really provide an equivalent option toiconv
's//TRANSLIT
. You can specify fallbacks, but you can't ask Ruby to intelligently handle the replacements for you.In this instance,
iconv
still feels superior to me. It will be interesting to see how that is addressed as this feature is retired.Of course, they can't really take
iconv
away from us. We can always just shell out to the command-line program. So, if you want to stick withiconv
, you can.
-