<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Gray Soft / Tags / Unicode</title>
  <id>tag:graysoftinc.com,2014-03-20:/tags/Unicode</id>
  <updated>2014-04-18T19:27:32Z</updated>
  <link rel="self" href="http://graysoftinc.com/tags/Unicode/feed.xml"/>
  <link rel="alternate" href="http://graysoftinc.com/tags/Unicode"/>
  <author>
    <name>James Edward Gray II</name>
  </author>
  <entry>
    <title>What Ruby 1.9 Gives Us</title>
    <link rel="alternate" href="http://graysoftinc.com/character-encodings/what-ruby-19-gives-us"/>
    <id>tag:graysoftinc.com,2009-06-18:/posts/83</id>
    <updated>2014-04-18T19:27:32Z</updated>
    <summary>A look at some new possibilities we have thanks to the character encoding savvy nature of Ruby 1.9.</summary>
    <content type="html">&lt;p&gt;In this final post of the series, I want to revisit our earlier discussion on encoding strategies.  Ruby 1.9 adds a lot of power to the handling of character encodings as you have now seen.  We should talk a little about how that can change the game.&lt;/p&gt;

&lt;h4&gt;UTF-8 is Still King&lt;/h4&gt;

&lt;p&gt;The most important thing to take note of is what hasn't changed with Ruby 1.9.  I said a good while back that &lt;a href="/character-encodings/general-encoding-strategies"&gt;the best &lt;code&gt;Encoding&lt;/code&gt; for general use is UTF-8&lt;/a&gt;.  That's still very true.&lt;/p&gt;

&lt;p&gt;I still strongly recommend that we favor UTF-8 as the one-size-almost-fits-all &lt;code&gt;Encoding&lt;/code&gt;.  I really believe that we can and should use it exclusively inside our code, transcode data to it on the way in, and transcode output when we absolutely must.  The more of us that do this, the better things will get.&lt;/p&gt;

&lt;p&gt;As we've discussed earlier in the series, Ruby 1.9 does add some new features that help our UTF-8 only strategies.  For example, you could use things like the &lt;code&gt;Encoding&lt;/code&gt; command-line switches (&lt;code&gt;-E&lt;/code&gt; and &lt;code&gt;-U&lt;/code&gt;) to setup auto translation for all input you read.  These shortcuts are great for simple scripting, but I'm going to recommend you just be explicit about your &lt;code&gt;Encoding&lt;/code&gt;s in any serious code.&lt;/p&gt;

&lt;h4&gt;New Rules&lt;/h4&gt;

&lt;p&gt;Ruby 1.9 literally gives us a whole new world of power to work with data as we see fit.  As is usually the case though, our new powers come with new responsibilities.  Start building your good Ruby 1.9 habits today:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Add &lt;a href="/character-encodings/ruby-19s-three-default-encodings"&gt;the magic comment&lt;/a&gt; to the top of all source files&lt;/li&gt;
&lt;li&gt;Explicitly &lt;a href="/character-encodings/ruby-19s-three-default-encodings"&gt;declare the &lt;code&gt;Encoding&lt;/code&gt;s for an &lt;code&gt;IO&lt;/code&gt; object&lt;/a&gt; when you &lt;code&gt;open()&lt;/code&gt; them&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;Yes, this adds a little extra work, but the effort is worth it.  Be disciplined in your awareness of &lt;code&gt;Encoding&lt;/code&gt;s and help Ruby know the right way to treat your data.&lt;/p&gt;

&lt;h4&gt;New Strategies&lt;/h4&gt;

&lt;p&gt;While UTF-8 is a great single choice, Ruby 1.9 gives us some exciting new options for character handling.  I'll give just one example here, to get you thinking in the right direction, but the sky's the limit now and I'm sure we'll see some neat uses of the new system in the coming years.&lt;/p&gt;

&lt;p&gt;When I converted the &lt;code&gt;FasterCSV&lt;/code&gt; code to be the standard &lt;code&gt;CSV&lt;/code&gt; library in Ruby 1.9, I really sat down and thought out how m17n should be handled.  Here are some thoughts that led to my final plan:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We tend to throw pretty big data at CSV parsers.  We often use them for database dumps, for example.&lt;/li&gt;
&lt;li&gt;I expected to pay a performance penalty for constantly transcoding all incoming data to UTF-8.  I'm not sure how big it would have been, but it's certainly more work than Ruby 1.8 does just reading some bytes.  Naturally, I wanted the library to stay as quick as possible.&lt;/li&gt;
&lt;li&gt;Since the parser has always been able to read directly from any &lt;code&gt;IO&lt;/code&gt; object, those who wanted UTF-8 transcoding already had a way to get it.&lt;/li&gt;
&lt;li&gt;CSV is a super simple format to parse, requiring only four standard characters that you can count on having in any &lt;code&gt;Encoding&lt;/code&gt; Ruby supports.&lt;/li&gt;
&lt;li&gt;Finally, I just wanted to take the m17n features for a spin, of course!&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;All of this combined to form my strategy for the &lt;code&gt;CSV&lt;/code&gt; library:  don't transcode the data, transcode the parser instead.&lt;/p&gt;

&lt;p&gt;If you transcode the data, you pay a penalty at every read.  However, transcoding the parser is just a one-time upfront cost.  The characters will be available in whatever format the data is in and once the parser is transcoded we can just read and parse the data normally.  The fields returned won't have gone through a conversion, unless the user code explicitly sets that up.  This seems to give everyone the choice to have their data the way they want it.&lt;/p&gt;

&lt;p&gt;This process isn't too tough to realize, though it does get a bit tedious in places.  The first step is just to figure out what &lt;code&gt;Encoding&lt;/code&gt; the data is actually in.  Here's the code from 1.9's &lt;code&gt;CSV&lt;/code&gt; library that does that:&lt;/p&gt;

&lt;div class="highlight highlight-ruby"&gt;&lt;pre&gt;&lt;span class="vi"&gt;@encoding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;   &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="vi"&gt;@io&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;respond_to?&lt;/span&gt; &lt;span class="ss"&gt;:internal_encoding&lt;/span&gt;
                &lt;span class="vi"&gt;@io&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;internal_encoding&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="vi"&gt;@io&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;external_encoding&lt;/span&gt;
              &lt;span class="k"&gt;elsif&lt;/span&gt; &lt;span class="vi"&gt;@io&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_a?&lt;/span&gt; &lt;span class="no"&gt;StringIO&lt;/span&gt;
                &lt;span class="vi"&gt;@io&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encoding&lt;/span&gt;
              &lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="vi"&gt;@encoding&lt;/span&gt; &lt;span class="o"&gt;||=&lt;/span&gt; &lt;span class="no"&gt;Encoding&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;default_internal&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="no"&gt;Encoding&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;default_external&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;That code just makes sure I set &lt;code&gt;@encoding&lt;/code&gt; to the &lt;code&gt;Encoding&lt;/code&gt; I'm actually going to be working with after all reads.  If an &lt;code&gt;internal_encoding()&lt;/code&gt; is set on an &lt;code&gt;IO&lt;/code&gt;, it will be transcoded into that and that's what I will be facing.  Otherwise, the &lt;code&gt;external_encoding()&lt;/code&gt; is what we will see.  The code can also parse from a &lt;code&gt;String&lt;/code&gt; directly by wrapping it in a &lt;code&gt;StringIO&lt;/code&gt; object.  When it does that, we can just ask the underlying &lt;code&gt;String&lt;/code&gt; what the &lt;code&gt;Encoding&lt;/code&gt; for the data is.  If we can't find an &lt;code&gt;Encoding&lt;/code&gt;, likely because it hasn't been set, we'll use the defaults because that's what Ruby is going to assume as well.&lt;/p&gt;

&lt;p&gt;Once we have the &lt;code&gt;Encoding&lt;/code&gt;, we need a couple of helper methods that will build &lt;code&gt;String&lt;/code&gt; and &lt;code&gt;Regexp&lt;/code&gt; objects in that &lt;code&gt;Encoding&lt;/code&gt; for us.  Here are those simple methods:&lt;/p&gt;

&lt;div class="highlight highlight-ruby"&gt;&lt;pre&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;encode_str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="vi"&gt;@encoding&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;encode_re&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="no"&gt;Regexp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;encode_str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Those should be super straightforward if you've read &lt;a href="/character-encodings/ruby-19s-string"&gt;my earlier discussion of how transcoding works&lt;/a&gt;.  You can pass &lt;code&gt;encode_str()&lt;/code&gt; one or more &lt;code&gt;String&lt;/code&gt; arguments and it will transcode each one, then &lt;code&gt;join()&lt;/code&gt; them into a complete whole.  The &lt;code&gt;encode_re()&lt;/code&gt; just wraps &lt;code&gt;encode_str()&lt;/code&gt; since &lt;code&gt;Regexp.new()&lt;/code&gt; will correctly set the &lt;code&gt;Encoding&lt;/code&gt; by the &lt;code&gt;Encoding&lt;/code&gt; of the passed &lt;code&gt;String&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Now for the tedious step.  You have to completely avoid using bare &lt;code&gt;String&lt;/code&gt; or &lt;code&gt;Regexp&lt;/code&gt; literals for anything that will eventually interact with the raw data.  For example, here is the code &lt;code&gt;CSV&lt;/code&gt; uses to prepare the parser before it begins reading:&lt;/p&gt;

&lt;div class="highlight highlight-ruby"&gt;&lt;pre&gt;&lt;span class="c1"&gt;# Pre-compiles parsers and stores them by name for access during reads.&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;init_parsers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="c1"&gt;# store the parser behaviors&lt;/span&gt;
  &lt;span class="vi"&gt;@skip_blanks&lt;/span&gt;      &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;:skip_blanks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="vi"&gt;@field_size_limit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;delete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;:field_size_limit&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="c1"&gt;# prebuild Regexps for faster parsing&lt;/span&gt;
  &lt;span class="n"&gt;esc_col_sep&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;escape_re&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="vi"&gt;@col_sep&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;esc_row_sep&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;escape_re&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="vi"&gt;@row_sep&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;esc_quote&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;escape_re&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="vi"&gt;@quote_char&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="vi"&gt;@parsers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;# for empty leading fields&lt;/span&gt;
    &lt;span class="n"&gt;leading_fields&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;encode_re&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;A(?:"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;esc_col_sep&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;")+"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="c1"&gt;# The Primary Parser&lt;/span&gt;
    &lt;span class="n"&gt;csv_row&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;        &lt;span class="n"&gt;encode_re&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;G(?:&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;A|"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;esc_col_sep&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;")"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                &lt;span class="c1"&gt;# anchor the match&lt;/span&gt;
      &lt;span class="s2"&gt;"(?:"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;esc_quote&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                              &lt;span class="c1"&gt;# find quoted fields&lt;/span&gt;
             &lt;span class="s2"&gt;"((?&amp;gt;[^"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;esc_quote&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"]*)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;             &lt;span class="c1"&gt;# "unrolling the loop"&lt;/span&gt;
             &lt;span class="s2"&gt;"(?&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;esc_quote&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                   &lt;span class="c1"&gt;# double for escaping&lt;/span&gt;
             &lt;span class="s2"&gt;"[^"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;esc_quote&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"]*)*)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
             &lt;span class="n"&gt;esc_quote&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
             &lt;span class="s2"&gt;"|"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                                    &lt;span class="c1"&gt;# ... or ...&lt;/span&gt;
             &lt;span class="s2"&gt;"([^"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;esc_quote&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;esc_col_sep&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"]*))"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# unquoted fields&lt;/span&gt;
      &lt;span class="s2"&gt;"(?="&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;esc_col_sep&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"|&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;z)"&lt;/span&gt;                    &lt;span class="c1"&gt;# ensure field is ended&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="c1"&gt;# a test for unescaped quotes&lt;/span&gt;
    &lt;span class="n"&gt;bad_field&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;      &lt;span class="n"&gt;encode_re&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;A"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;esc_col_sep&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                   &lt;span class="c1"&gt;# an optional comma&lt;/span&gt;
      &lt;span class="s2"&gt;"(?:"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;esc_quote&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                          &lt;span class="c1"&gt;# a quoted field&lt;/span&gt;
             &lt;span class="s2"&gt;"(?&amp;gt;[^"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;esc_quote&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"]*)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# "unrolling the loop"&lt;/span&gt;
             &lt;span class="s2"&gt;"(?&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;esc_quote&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;               &lt;span class="c1"&gt;# double for escaping&lt;/span&gt;
             &lt;span class="s2"&gt;"[^"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;esc_quote&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"]*)*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
             &lt;span class="n"&gt;esc_quote&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                          &lt;span class="c1"&gt;# the closing quote&lt;/span&gt;
             &lt;span class="s2"&gt;"[^"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;esc_quote&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"]"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;               &lt;span class="c1"&gt;# an extra character&lt;/span&gt;
             &lt;span class="s2"&gt;"|"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                                &lt;span class="c1"&gt;# ... or ...&lt;/span&gt;
             &lt;span class="s2"&gt;"[^"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;esc_quote&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;esc_col_sep&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"]+"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# an unquoted field&lt;/span&gt;
             &lt;span class="n"&gt;esc_quote&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;")"&lt;/span&gt;                      &lt;span class="c1"&gt;# an extra quote&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="c1"&gt;# safer than chomp!()&lt;/span&gt;
    &lt;span class="n"&gt;line_end&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;       &lt;span class="n"&gt;encode_re&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;esc_row_sep&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s2"&gt;z"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="c1"&gt;# illegal unquoted characters&lt;/span&gt;
    &lt;span class="n"&gt;return_newline&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;encode_str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\r\n&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Don't worry about breaking down those heavily optimized regular expressions.  The point here is just to notice how everything is eventually passed through &lt;code&gt;encode_str()&lt;/code&gt; or &lt;code&gt;encode_re()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Those were the major changes needed inside the &lt;code&gt;CSV&lt;/code&gt; code to get it to parse natively in the &lt;code&gt;Encoding&lt;/code&gt; of the data.  I did have to add more code due to some side issues I ran into, but they don't really relate to this strategy too much:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Regexp.escape()&lt;/code&gt; didn't work correctly on all the &lt;code&gt;Encoding&lt;/code&gt;s I tested it with.  It's improved a lot since then, but last I checked there were still some oddball &lt;code&gt;Encoding&lt;/code&gt;s it didn't support.  Given that, I had to roll my own.  If you want to see how I did that, &lt;a href="http://svn.ruby-lang.org/cgi-bin/viewvc.cgi/trunk/lib/csv.rb?revision=22788&amp;amp;view=markup"&gt;check inside &lt;code&gt;CSV.initialize()&lt;/code&gt;&lt;/a&gt; for how &lt;code&gt;@re_esc&lt;/code&gt; and &lt;code&gt;@re_chars&lt;/code&gt; get set and then have a &lt;a href="http://svn.ruby-lang.org/cgi-bin/viewvc.cgi/trunk/lib/csv.rb?revision=22788&amp;amp;view=markup"&gt;look at &lt;code&gt;CSV.escape_re()&lt;/code&gt;&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;CSV&lt;/code&gt;'s line ending detection reads ahead in the data by fixed byte counts.  That's tricky to do safely with encoded data since you could always land in the middle of a character.  &lt;a href="http://svn.ruby-lang.org/cgi-bin/viewvc.cgi/trunk/lib/csv.rb?revision=22788&amp;amp;view=markup"&gt;See &lt;code&gt;CSV.read_to_char()&lt;/code&gt;&lt;/a&gt; for how I work around that issue, if you are interested.&lt;/li&gt;
&lt;li&gt;Finally, testing the code with all the &lt;code&gt;Encoding&lt;/code&gt;s Ruby supports was a bit tricky, due to the concept of "dummy &lt;code&gt;Encoding&lt;/code&gt;s".  &lt;a href="/character-encodings/miscellaneous-m17n-details"&gt;See my discussion on those&lt;/a&gt; for details on how to filter them out of the mix.&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;Like anything, this strategy had plusses and minuses.  As I've already said, it's a touch tedious to have to avoid normal literals.  The added complexity to the code makes it a little harder to read and maintain.  That's the price you pay.&lt;/p&gt;

&lt;p&gt;Still, I think it shows some of the possibilities of what we can accomplish with Ruby's new features.  We can stick to UTF-8 as our one-size-fits-all solution as we've done in the past.  That's still a great idea in most cases.  However, now we have some new options that were impractically hard with an older Ruby.&lt;/p&gt;</content>
    <author>
      <name>James Edward Gray II</name>
    </author>
  </entry>
  <entry>
    <title>Miscellaneous M17n Details</title>
    <link rel="alternate" href="http://graysoftinc.com/character-encodings/miscellaneous-m17n-details"/>
    <id>tag:graysoftinc.com,2009-04-15:/posts/82</id>
    <updated>2014-04-18T19:20:43Z</updated>
    <summary>This post covers various side details of the new encoding engine.  It's kind of a grab bag of topics that you should also know about when writing character encoding savvy code.</summary>
    <content type="html">&lt;p&gt;We've now discussed the core of Ruby 1.9's m17n (multilingualization) engine.  &lt;code&gt;String&lt;/code&gt; and &lt;code&gt;IO&lt;/code&gt; are where you will see the big changes.  The new m17n system is a big beast though with a lot of little details.  Let's talk a little about some side topics that also relate to how we work with character encodings in Ruby 1.9.&lt;/p&gt;

&lt;h4&gt;More Features of the Encoding Class&lt;/h4&gt;

&lt;p&gt;You've seen me using &lt;code&gt;Encoding&lt;/code&gt; objects all over the place in my explanations of m17n, but we haven't talked much about them.  They are very simple, mainly just being a named representation of each &lt;code&gt;Encoding&lt;/code&gt; inside Ruby.  As such, &lt;code&gt;Encoding&lt;/code&gt; is a storage place for some tools you may find handy when working with them.&lt;/p&gt;

&lt;p&gt;First, you can receive a &lt;code&gt;list()&lt;/code&gt; of all &lt;code&gt;Encoding&lt;/code&gt; objects Ruby has loaded in the form of an &lt;code&gt;Array&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby -e 'puts Encoding.list.first(3), "..."'
ASCII-8BIT
UTF-8
US-ASCII
...
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If you're just interested in a specific &lt;code&gt;Encoding&lt;/code&gt;, you can &lt;code&gt;find()&lt;/code&gt; it by name:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby -e 'p Encoding.find("UTF-8")'
#&amp;lt;Encoding:UTF-8&amp;gt;
$ ruby -e 'p Encoding.find("No-Such-Encoding")'
-e:1:in `find': unknown encoding name - No-Such-Encoding (ArgumentError)
    from -e:1:in `&amp;lt;main&amp;gt;'
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;As you can see, Ruby raises an &lt;code&gt;ArgumentError&lt;/code&gt; if it doesn't know about a given &lt;code&gt;Encoding&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Some &lt;code&gt;Encoding&lt;/code&gt; objects also have more than one name.  These &lt;code&gt;aliases()&lt;/code&gt; can be used interchangeably to refer to the same &lt;code&gt;Encoding&lt;/code&gt;.  For example, &lt;code&gt;ASCII&lt;/code&gt; is an alias for &lt;code&gt;US-ASCII&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby -e 'puts Encoding.aliases["ASCII"]'
US-ASCII
$ ruby -e 'p Encoding.find("ASCII") == Encoding.find("US-ASCII")' 
true
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;aliases()&lt;/code&gt; method returns a &lt;code&gt;Hash&lt;/code&gt; keyed with the alternate names Ruby knows about. The values are the actual &lt;code&gt;Encoding&lt;/code&gt; name that alias refers to.  You can use either a name or an alias when referring to an &lt;code&gt;Encoding&lt;/code&gt; by name, like with calls to &lt;code&gt;Encoding::find()&lt;/code&gt; or &lt;code&gt;IO::open()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Finally, there's one more gotcha you should be aware of if you're going to write some code that supports a large set of Ruby's &lt;code&gt;Encoding&lt;/code&gt;s.  Ruby ships with a few &lt;code&gt;dummy?()&lt;/code&gt; &lt;code&gt;Encoding&lt;/code&gt;s that don't have character handling completely implemented.  These are used for stateful &lt;code&gt;Encoding&lt;/code&gt;s.  You will want to filter them out of &lt;code&gt;Encoding&lt;/code&gt;s you try to support to avoid running into problems:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby -e 'puts "Dummy Encodings:", Encoding.list.select(&amp;amp;:dummy?).map(&amp;amp;:name)'
Dummy Encodings:
ISO-2022-JP
ISO-2022-JP-2
UTF-7
&lt;/code&gt;&lt;/pre&gt;

&lt;h4&gt;String Escapes&lt;/h4&gt;

&lt;p&gt;In Ruby 1.8 you would sometimes see byte escapes used to insert raw bytes into a &lt;code&gt;String&lt;/code&gt;.  For example, you can choose to build the &lt;code&gt;String&lt;/code&gt; &lt;code&gt;"…"&lt;/code&gt; with the following byte escapes:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby -v -KU -e 'p "\xe2\x80\xa6"'
ruby 1.8.6 (2009-03-31 patchlevel 368) [i686-darwin9.6.0]
"…"
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The same tactic still works on Ruby 1.9, but remember that &lt;code&gt;Encoding&lt;/code&gt;s are still going to play into this as we've been discussing:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ cat utf8_escapes.rb 
# encoding: UTF-8
str = "\xe2\x80\xa6"
p [str.encoding, str, str.valid_encoding?]
$ ruby -v utf8_escapes.rb 
ruby 1.9.1p0 (2009-01-30 revision 21907) [i386-darwin9.6.0]
[#&amp;lt;Encoding:UTF-8&amp;gt;, "…", true]
$ cat invalid_escapes.rb 
# encoding: UTF-8
str = "\xe2\x80"
p [str.encoding, str, str.valid_encoding?]
$ ruby -v invalid_escapes.rb 
ruby 1.9.1p0 (2009-01-30 revision 21907) [i386-darwin9.6.0]
[#&amp;lt;Encoding:UTF-8&amp;gt;, "\xE2\x80", false]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Notice that I got the requested bytes in both cases.  However, those &lt;code&gt;String&lt;/code&gt;s were assigned the source &lt;code&gt;Encoding&lt;/code&gt; as normal.  In the first case, that built a valid UTF-8 &lt;code&gt;String&lt;/code&gt;.  However, the second case is invalid and may later cause me fits as I try to use the &lt;code&gt;String&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;There are a couple of exceptions though, where a &lt;code&gt;String&lt;/code&gt; escape can actually change the &lt;code&gt;Encoding&lt;/code&gt; of the literal.  First, you'll likely remember that using a multibyte character is not allowed if you don't change the source &lt;code&gt;Encoding&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ cat bad_code.rb 
"abc…"
$ ruby -v bad_code.rb 
ruby 1.9.1p0 (2009-01-30 revision 21907) [i386-darwin9.6.0]
bad_code.rb:1: invalid multibyte char (US-ASCII)
bad_code.rb:1: invalid multibyte char (US-ASCII)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;However, a special case is made for &lt;code&gt;\x##&lt;/code&gt; escapes:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ cat ascii_escapes.rb 
puts "Source Encoding:  #{__ENCODING__}"
str = "abc\xe2\x80\xa6"
p [str.encoding, str, str.valid_encoding?]
$ ruby -v ascii_escapes.rb 
ruby 1.9.1p0 (2009-01-30 revision 21907) [i386-darwin9.6.0]
Source Encoding:  US-ASCII
[#&amp;lt;Encoding:ASCII-8BIT&amp;gt;, "abc\xE2\x80\xA6", true]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Notice that the &lt;code&gt;Encoding&lt;/code&gt; of the &lt;code&gt;String&lt;/code&gt; was upgraded to ASCII-8BIT to accommodate the bytes.  We'll talk a lot more about that special &lt;code&gt;Encoding&lt;/code&gt; later in this post, but for now just make note of the fact that this exception gives you an easy way to work with binary data.&lt;/p&gt;

&lt;p&gt;Octal escapes (&lt;code&gt;\###&lt;/code&gt;), control escapes (&lt;code&gt;\cx&lt;/code&gt; or &lt;code&gt;\C-x&lt;/code&gt;), meta escapes (&lt;code&gt;\M-x&lt;/code&gt;), and meta-control escapes (&lt;code&gt;\M-\C-x&lt;/code&gt;) all follow the same rules as the hex escapes (&lt;code&gt;\x##&lt;/code&gt;) we've just been discussing.&lt;/p&gt;

&lt;p&gt;The other exception is the &lt;code&gt;\u####&lt;/code&gt; escape that can be used to enter Unicode characters by codepoint.  When you use this escape, the &lt;code&gt;String&lt;/code&gt; gets a UTF-8 &lt;code&gt;Encoding&lt;/code&gt; regardless of the current source &lt;code&gt;Encoding&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ cat ascii_u_escape.rb 
str = "\u2026"
p [str.encoding, str]
$ ruby -v ascii_u_escape.rb 
ruby 1.9.1p0 (2009-01-30 revision 21907) [i386-darwin9.6.0]
[#&amp;lt;Encoding:UTF-8&amp;gt;, "…"]
$ cat sjis_u_escape.rb 
# encoding: Shift_JIS
str = "\u2026"
p [str.encoding, str]
$ ruby -v sjis_u_escape.rb 
ruby 1.9.1p0 (2009-01-30 revision 21907) [i386-darwin9.6.0]
[#&amp;lt;Encoding:UTF-8&amp;gt;, "…"]
$ cat utf8_u_escape.rb 
# encoding: UTF-8
str = "\u2026"
p [str.encoding, str]
$ ruby -v utf8_u_escape.rb 
ruby 1.9.1p0 (2009-01-30 revision 21907) [i386-darwin9.6.0]
[#&amp;lt;Encoding:UTF-8&amp;gt;, "…"]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Notice how the &lt;code&gt;String&lt;/code&gt; received a UTF-8 &lt;code&gt;Encoding&lt;/code&gt; in all three cases, regardless of the current source &lt;code&gt;Encoding&lt;/code&gt;.  This exception gives you an easy way to work with UTF-8 data, no matter what your native &lt;code&gt;Encoding&lt;/code&gt; is.&lt;/p&gt;

&lt;p&gt;The Unicode escape can be followed by exactly four hex digits as I've shown above, or you can use an alternate form &lt;code&gt;\u{#…}&lt;/code&gt; where you place between one and six hex digits between the braces.  Both forms have the same effect on the &lt;code&gt;String&lt;/code&gt;'s &lt;code&gt;Encoding&lt;/code&gt;.&lt;/p&gt;

&lt;h4&gt;Working with Binary Data&lt;/h4&gt;

&lt;p&gt;Not all data is textual data.  Ruby's &lt;code&gt;String&lt;/code&gt; class can also be used to hold raw byte sequences.  For example, you may want to work with the raw bytes of a PNG image.&lt;/p&gt;

&lt;p&gt;Ruby 1.9 has an &lt;code&gt;Encoding&lt;/code&gt; for this which basically just means treat my data as raw bytes.  You can think of this &lt;code&gt;Encoding&lt;/code&gt; as a way to shut off character handling and just work with bytes:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ cat raw_bytes.rb 
# encoding: UTF-8
str = "Résumé"
def str.inspect
  { data:     dup,
    encoding: encoding.name,
    chars:    size,
    bytes:    bytesize }.inspect
end
p str
str.force_encoding("BINARY")
p str
$ ruby raw_bytes.rb 
{:data=&amp;gt;"Résumé", :encoding=&amp;gt;"UTF-8", :chars=&amp;gt;6, :bytes=&amp;gt;8}
{:data=&amp;gt;"R\xC3\xA9sum\xC3\xA9", :encoding=&amp;gt;"ASCII-8BIT", :chars=&amp;gt;8, :bytes=&amp;gt;8}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;See how switching the &lt;code&gt;Encoding&lt;/code&gt; (without changing the data) shut off Ruby's concept of characters?  The character count became the same as the byte count and Ruby started giving a more raw version of the &lt;code&gt;inspect()&lt;/code&gt; &lt;code&gt;String&lt;/code&gt; to show those are just bytes.&lt;/p&gt;

&lt;p&gt;If you expected this &lt;code&gt;Encoding&lt;/code&gt; to be called BINARY, you are half right. As you&lt;br&gt;
can see I could use that name above because it is a valid alias. Ruby switched&lt;br&gt;
to the real name in the &lt;code&gt;inspect()&lt;/code&gt; message though.  Ruby actually refers to the&lt;br&gt;&lt;code&gt;Encoding&lt;/code&gt; as ASCII-8BIT, which leads us to another twist.&lt;/p&gt;

&lt;p&gt;Obviously, there's not really such a thing a "ASCII-8BIT" outside of Ruby.  Even while working with binary data though, it's not uncommon to want to make a check for some simple ASCII pieces.  For example, the first few signature bytes of a PNG image do contain the simple ASCII &lt;code&gt;String&lt;/code&gt; &lt;code&gt;"PNG"&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ cat png_sig.rb 
sig = "\x89PNG\r\n\C-z\n"
png = /\A.PNG/

p({sig =&amp;gt; sig.encoding.name, png =&amp;gt; png.encoding.name})

if sig =~ png
  puts "This data looks like a PNG image."
end
$ ruby png_sig.rb 
{"\x89PNG\r\n\x1A\n"=&amp;gt;"ASCII-8BIT", /\A.PNG/=&amp;gt;"US-ASCII"}
This data looks like a PNG image.
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Ruby makes this possible by making ASCII-8BIT &lt;code&gt;compatible?()&lt;/code&gt; with US-ASCII.  That allows tricks like the above where I validated the PNG signature with a simple US-ASCII &lt;code&gt;Regexp&lt;/code&gt;.  Thus, ASCII-8BIT means ASCII plus some other bytes and you can choose to treat parts of it as ASCII when that helps you work with the data.&lt;/p&gt;

&lt;p&gt;It's worth noting that Ruby will now fallback to an ASCII-8BIT &lt;code&gt;Encoding&lt;/code&gt; anytime you &lt;code&gt;read()&lt;/code&gt; by bytes:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ cat binary_fallback.rb 
open("ascii.txt", "w+:UTF-8") do |f|
  f.puts "abc"
  f.rewind
  str = f.read(2)
  p [str.encoding.name, str]
end
$ ruby binary_fallback.rb 
["ASCII-8BIT", "ab"]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;That makes sense, because you could chop up characters when reading by bytes.  If you really need to &lt;code&gt;read()&lt;/code&gt; some bytes but keep your &lt;code&gt;Encoding&lt;/code&gt; you will need to set and validate it manually.  Here's one way you might do something like that:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ cat read_to_char.rb 
# encoding: UTF-8
open("ascii.txt", "w+:UTF-8") do |f|
  f.puts "Résumé"
  f.rewind
  str = f.read(2)
  until str.dup.force_encoding(f.external_encoding).valid_encoding?
    str &amp;lt;&amp;lt; f.read(1)
  end
  str.force_encoding(f.external_encoding)
  p [str.encoding.name, str]
end
$ ruby read_to_char.rb 
["UTF-8", "Ré"]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In that example, I just &lt;code&gt;read()&lt;/code&gt; the fixed bytes I wanted and then push forward byte by byte until my data is valid in the desired &lt;code&gt;Encoding&lt;/code&gt;.  I had to test a &lt;code&gt;dup()&lt;/code&gt; of the data and only &lt;code&gt;force_encoding()&lt;/code&gt; when I was sure I was done reading, because UTF-8 and ASCII-8BIT are not &lt;code&gt;compatible?()&lt;/code&gt; and would have raised &lt;code&gt;Encoding::CompatibilityError&lt;/code&gt; as I was adding on bytes.&lt;/p&gt;

&lt;p&gt;Working with binary data also requires you to know one more thing about Ruby's &lt;code&gt;IO&lt;/code&gt; objects.  Ruby has a feature where it translates some data you read on Windows.  The translation is super simple:  &lt;code&gt;"\r\n"&lt;/code&gt; sequences read from an &lt;code&gt;IO&lt;/code&gt; object are simplified to a solo &lt;code&gt;"\n"&lt;/code&gt;.  This features is to help make Unix scripts work well on a platform that has different line endings.  It does create a gotcha though:  when you're going to read any non-text data, be it binary data or just a non-ASCII compatible &lt;code&gt;Encoding&lt;/code&gt; like UTF-16, you need to warn Ruby not to do the translation for your code to be properly cross-platform.&lt;/p&gt;

&lt;p&gt;By the way, this isn't new.  This was even true in the Ruby 1.8 era.&lt;/p&gt;

&lt;p&gt;Telling Ruby to treat the data as binary and not perform any translation (again, only active on Windows) is simple.  You can just add a &lt;code&gt;"b"&lt;/code&gt; for binary to your mode &lt;code&gt;String&lt;/code&gt; in a call to &lt;code&gt;open()&lt;/code&gt;.  Thus you would read with something like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;open(path, "rb") do |f|
  # ...
end
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;or write with code like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;open(path, "wb") do |f|
  # ...
end
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If you always knew about this quirk and you did a good job of always doing this, give yourself a big pat on the back because you're all set.  If you didn't, you've got a bad habit you'll need to break.  Don't feel too bad about it though.  I've known about this quirk since my Perl (which does the same thing) days and I've always tried to follow it.  However, about ten different bugs were recently filed against one of my libraries that amounted to me missing this &lt;code&gt;"b"&lt;/code&gt; in several places.  It's easy to forget.&lt;/p&gt;

&lt;p&gt;Ruby 1.9 is much more strict about the binary flag.  It's going to complain if you don't add it when it feels it is needed.  For example:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ cat missing_b.rb 
# Ruby 1.9 will let this slide
open("utf_16.txt", "w:UTF-16LE") do |f|
  f.puts "Some data."
end
# but not this
open("utf_16.txt", "r:UTF-16LE") do |f|
  # ...
end
$ ruby missing_b.rb 
missing_b.rb:6:in `initialize': ASCII incompatible encoding needs binmode
                                (ArgumentError)
    from missing_b.rb:6:in `open'
    from missing_b.rb:6:in `&amp;lt;main&amp;gt;'
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Of course, this is trivial to fix.  You just have to add the missing &lt;code&gt;"b"&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ cat with_b.rb 
open("utf_16.txt", "wb:UTF-16LE") do |f|
  f.puts "Some data."
end
open("utf_16.txt", "rb:UTF-16LE") do |f|
  puts f.external_encoding.name
end
$ ruby with_b.rb 
UTF-16LE
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I showed the &lt;code&gt;external_encoding()&lt;/code&gt; there to show that it's exactly what I specified.  However, as a reward for adding in these &lt;code&gt;"b"&lt;/code&gt;'s we've been bad about leaving out in the past, Ruby will now assume you want ASCII-8BIT when you supply the &lt;code&gt;"b"&lt;/code&gt; and not an &lt;code&gt;external_encoding()&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ cat b_means_binary.rb 
open("utf_16.txt", "r") do |f|
  puts "Inherited from environment:  #{f.external_encoding.name}"
end
open("utf_16.txt", "rb") do |f|
  puts %Q{Using "rb":  #{f.external_encoding.name}}
end
$ ruby b_means_binary.rb 
Inherited from environment:  UTF-8
Using "rb":  ASCII-8BIT
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It's worth nothing that Ruby 1.8 accidently helped train us to leave out the magic &lt;code&gt;"b"&lt;/code&gt;.  For example, you could use &lt;code&gt;IO::read()&lt;/code&gt; to slurp some data, but that method didn't provide a way to indicate that the data was binary.  In truth, you really needed this monster for a safe cross-platform read of binary data:  &lt;code&gt;open(path, "rb") { |f| f.read }&lt;/code&gt;.  It's no surprise that &lt;code&gt;IO::read()&lt;/code&gt; was more common.  &lt;code&gt;IO::readlines()&lt;/code&gt; and &lt;code&gt;IO::foreach()&lt;/code&gt; had the same issue.  The core team has acknowledged these problems with some new additions.  First, you can now pass a &lt;code&gt;Hash&lt;/code&gt; as the final argument to all the methods that open an &lt;code&gt;IO&lt;/code&gt; and use that to set options like &lt;code&gt;:mode&lt;/code&gt; or separately &lt;code&gt;:external_encoding&lt;/code&gt;, &lt;code&gt;:internal_encoding&lt;/code&gt;, and &lt;code&gt;:binmode&lt;/code&gt; (the name for the magic &lt;code&gt;"b"&lt;/code&gt;).  Here are some examples:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;File.read("utf_16.txt", mode: "rb:UTF-16LE")

File.readlines("utf_16.txt", mode: "rb:UTF-16LE")

File.foreach("utf_16.txt", mode: "rb:UTF-16LE") do |line|

end

File.open("utf_16.txt", mode: "rb:UTF-16LE") do |f|

end

open("utf_16.txt", mode: "rb:UTF-16LE") do |f|

end
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;As one last shortcut along these lines, the new &lt;code&gt;IO::binread()&lt;/code&gt; method is the same as &lt;code&gt;IO.read(…, mode: "rb:ASCII-8BIT")&lt;/code&gt;.&lt;/p&gt;

&lt;h4&gt;Regex Encodings&lt;/h4&gt;

&lt;p&gt;Now that all our data has an &lt;code&gt;Encoding&lt;/code&gt;, it only makes sense that our &lt;code&gt;Regexp&lt;/code&gt; objects would need to be tagged as well.  That is the case, but the rules for how an &lt;code&gt;Encoding&lt;/code&gt; is selected differs for &lt;code&gt;Regexp&lt;/code&gt;.  Let's talk a little about how and why.&lt;/p&gt;

&lt;p&gt;First, let's get the big surprise out of the way:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ cat re_encoding.rb 
# encoding: UTF-8
utf8_str   = "résumé"
latin1_str = utf8_str.encode("ISO-8859-1")
binary_str = utf8_str.dup.force_encoding("ASCII-8BIT")
utf16_str  = utf8_str.encode("UTF-16BE")

re = /\Ar.sum.\z/
puts "Regexp.encoding.name:  #{re.encoding.name}"

[utf8_str, latin1_str, binary_str, utf16_str].each do |str|
  begin
    result = str =~ re ? "Matches" : "Doesn't match"
  rescue Encoding::CompatibilityError
    result = "Can't match non-ASCII compatible?() Encoding"
  end
  puts "#{result}:  #{str.encoding.name}"
end
$ ruby re_encoding.rb 
Regexp.encoding.name:  US-ASCII
Matches:  UTF-8
Matches:  ISO-8859-1
Doesn't match:  ASCII-8BIT
Can't match non-ASCII compatible?() Encoding:  UTF-16BE
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;After we did all that talking about the source &lt;code&gt;Encoding&lt;/code&gt; Ruby goes and ignores it on us.  You can see that the &lt;code&gt;Regexp&lt;/code&gt; was set to US-ASCII instead of the UTF-8 that was in effect at the time.  Surprising though that may be, there is actually a pretty good reason for it.&lt;/p&gt;

&lt;p&gt;My &lt;code&gt;Regexp&lt;/code&gt; literal only contained seven bit ASCII, so Ruby chose to simplify the &lt;code&gt;Encoding&lt;/code&gt;.  If it left it at the source &lt;code&gt;Encoding&lt;/code&gt; of UTF-8, it would be useful for checking UTF-8 data.  As it is though, it can now be used to check any ASCII &lt;code&gt;compatible?()&lt;/code&gt; data.  You can see in the output that the expression was tried against three different &lt;code&gt;String&lt;/code&gt;'s, because they are all ASCII &lt;code&gt;compatible?()&lt;/code&gt;.  (It did fail to match one since I changed the rules of how to interpret the data and one character became two bytes, but the attempt was still made.)  The fourth match could not be attempted, because UTF-16 is not ASCII &lt;code&gt;compatible?()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Of course, if your &lt;code&gt;Regexp&lt;/code&gt; includes eight bit characters, you use the special escapes that change an &lt;code&gt;Encoding&lt;/code&gt;, or you apply one of the old Ruby 1.8 style &lt;code&gt;Encoding&lt;/code&gt; options, you can get a non-ASCII &lt;code&gt;Encoding&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ cat encodings.rb 
# encoding: UTF-8
res = [
  /…\z/,       # source Encoding
  /\A\uFEFF/,  # special escape
  /abc/u       # Ruby 1.8 option
]
puts res.map { |re| [re.encoding.name, re.inspect].join(" ") }
$ ruby encodings.rb
UTF-8 /…\z/
UTF-8 /\A\uFEFF/
UTF-8 /abc/
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I used &lt;code&gt;/u&lt;/code&gt; which you will probably remember as a way to get a UTF-8 &lt;code&gt;Regexp&lt;/code&gt; &lt;a href="http://blog.grayproductions.net/articles/bytes_and_characters_in_ruby_18"&gt;from the old Ruby 1.8 system&lt;/a&gt;.  The &lt;code&gt;/e&lt;/code&gt; (for EUC_JP) and &lt;code&gt;/s&lt;/code&gt; (for a Shift_JIS extension called Windows-31J) options still work too.  Ruby 1.9 also still supports the old &lt;code&gt;/n&lt;/code&gt; option, but &lt;a href="http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/23204"&gt;it has some warning tossing exceptions for legacy reasons&lt;/a&gt; and I recommend just avoiding it going forward.  You can build an ASCII-8BIT &lt;code&gt;Regexp&lt;/code&gt; in another way I'll show in just a moment.&lt;/p&gt;

&lt;p&gt;As of Ruby 1.9.2, this concept of a lenient &lt;code&gt;Regexp&lt;/code&gt;, one that will match any ASCII &lt;code&gt;compatible?()&lt;/code&gt; &lt;code&gt;Encoding&lt;/code&gt;, has a new name:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ cat fixed_encoding.rb 
[/a/, /a/u].each do |re|
  puts "%-10s %s" % [ re.encoding, re.fixed_encoding? ? "fixed" :
                                                        "not fixed" ]
end
$ ruby fixed_encoding.rb 
US-ASCII   not fixed
UTF-8      fixed
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;A &lt;code&gt;fixed_encoding?()&lt;/code&gt; &lt;code&gt;Regexp&lt;/code&gt; is one that will raise an &lt;code&gt;Encoding::CompatibilityError&lt;/code&gt; if matched against any &lt;code&gt;String&lt;/code&gt; that contains a different &lt;code&gt;Encoding&lt;/code&gt; from the &lt;code&gt;Regexp&lt;/code&gt; itself, as long as the &lt;code&gt;String&lt;/code&gt; isn't &lt;code&gt;ascii_only?()&lt;/code&gt;.  If &lt;code&gt;fixed_encoding?()&lt;/code&gt; returns &lt;code&gt;false&lt;/code&gt;, the &lt;code&gt;Regexp&lt;/code&gt; can be used against any ASCII &lt;code&gt;compatible?()&lt;/code&gt; &lt;code&gt;Encoding&lt;/code&gt;.  There's also a new constant with this name that can be used to disable the ASCII downgrading:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ cat force_re_encoding.rb 
puts Regexp.new("abc".force_encoding("UTF-8")).encoding.name
puts Regexp.new( "abc".force_encoding("UTF-8"),
                 Regexp::FIXEDENCODING ).encoding.name
$ ruby force_re_encoding.rb 
US-ASCII
UTF-8
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Note how a &lt;code&gt;Regexp&lt;/code&gt; will take the &lt;code&gt;Encoding&lt;/code&gt; of the &lt;code&gt;String&lt;/code&gt; passed to &lt;code&gt;Regexp::new()&lt;/code&gt; when &lt;code&gt;Regexp::FIXEDENCODING&lt;/code&gt; is set.  You can use this combination to build a &lt;code&gt;Regexp&lt;/code&gt; in any &lt;code&gt;Encoding&lt;/code&gt; you need, including the ASCII-8BIT I mentioned earlier.&lt;/p&gt;

&lt;p&gt;Once your &lt;code&gt;Regexp&lt;/code&gt; is at least compatible to your data's &lt;code&gt;Encoding&lt;/code&gt;, pattern matches function as they always have.  (Well, in truth, Ruby 1.9 brings us a powerful new regular expression engine called Oniguruma, but that's another topic for another time.)  Under average circumstances, Ruby 1.9's &lt;code&gt;Regexp&lt;/code&gt; &lt;code&gt;Encoding&lt;/code&gt; selection option mean that they are compatible with a lot of data and everything should just work for you.  However, if you end up getting some errors at match time, you may need to abandon the simple &lt;code&gt;/…/&lt;/code&gt; literal and use the new features I've shown to build a &lt;code&gt;Regexp&lt;/code&gt; that perfectly matches your data's &lt;code&gt;Encoding&lt;/code&gt;.&lt;/p&gt;

&lt;h4&gt;Handling a BOM&lt;/h4&gt;

&lt;p&gt;Some multibyte &lt;code&gt;Encoding&lt;/code&gt;s recommend that data in that &lt;code&gt;Encoding&lt;/code&gt; begin with a &lt;a href="http://en.wikipedia.org/wiki/Byte_order_mark"&gt;Byte Order Mark (also known as a BOM)&lt;/a&gt; indicating the order of the bytes.  UTF-16 is a good example.&lt;/p&gt;

&lt;p&gt;Note that Ruby doesn't even support a UTF-16 &lt;code&gt;Encoding&lt;/code&gt;.  Instead, you must pick between UTF-16BE and UTF-16LE for "Big Endian" or "Little Endian" byte order.  This indicates whether the most significant byte comes first or last:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby -e 'p "a".encode("UTF-16BE")'
"\x00a"
$ ruby -e 'p "a".encode("UTF-16LE")'
"a\x00"
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now, when someone goes to read your UTF-16 data back, they'll need to know which byte order you used to get things right.  You could just tell them which order was used the same way you'll probably tell them that the data is UTF-16 encoded.  Or you could add a BOM to the data.&lt;/p&gt;

&lt;p&gt;A Unicode BOM is just the character U+FEFF at the beginning of your data.  There's no such character for the reversed bytes U+FFFE, so this makes it easy to correctly tell the order of the bytes.  Another minor advantage is that this BOM probably indicates you are reading Unicode data.  A lot of software will check for this special start of the data, use it to set the proper byte order, and then pretend it didn't even exist by removing it from the data they show users.&lt;/p&gt;

&lt;p&gt;Ruby 1.9 won't automatically add a BOM to your data, so you're going to need to take care of that if you want one.  Luckily, it's not too tough.  The basic idea is just to print the bytes needed at the beginning of a file.  For example, we can add a BOM to a UTF-16LE file as such:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ cat utf16_bom.rb 
# encoding: UTF-8
File.open("utf16_bom.txt", "w:UTF-16LE") do |f|
  f.puts "\uFEFFThis is UTF-16LE with a BOM."
end
$ ruby utf16_bom.rb 
$ ruby -e 'p File.binread("utf16_bom.txt")[0..9]'
"\xFF\xFET\x00h\x00i\x00s\x00"
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Notice that I just used the Unicode escape to add the BOM character to the data.  Because my output &lt;code&gt;String&lt;/code&gt; was in UTF-8, Ruby had to transcode it to UTF-16LE and that process arranged the bytes correctly for me, as you see in the sample output.&lt;/p&gt;

&lt;p&gt;Reading a BOM is a similar process.  We will need to pull the relevant bytes and see if they match a Unicode BOM.  When they do, we can then start reading again with the &lt;code&gt;Encoding&lt;/code&gt; we matched.  We might code that up like this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ cat read_bom.rb 
class File
  UTFS = [32, 16].map { |b| %w[BE LE].map { |o| "UTF-#{b}#{o}" } }.
                  flatten &amp;lt;&amp;lt; "UTF-8"

  def self.open_using_unicode_bom(path, *args, &amp;amp;blk)
    # check the BOM to find the Encoding
    encoding = UTFS[0..-2].find(lambda { UTFS[-1] }) do |utf|
      bom = "\uFEFF".encode(utf)
      binread(path, bom.bytesize).force_encoding(utf) == bom
    end
    # set the Encoding
    if args.first.nil?
      args &amp;lt;&amp;lt; "r#{'b' unless encoding == UTFS[-1]}:#{encoding}"
    elsif args.first.is_a? Hash
      args.first.merge!(external_encoding: encoding)
    else
      args.first.sub!(/\A([^:]*)/, "\\1:#{encoding}")
    end
    # hand off to open()
    if blk
      open(path, *args) do |f|
        f.read_unicode_bom
        blk[f]
      end
    else
      f = open(path, *args)
      f.read_unicode_bom
      f
    end
  end

  def read_unicode_bom
    bytes = external_encoding.name[/\AUTF-?(\d+)/i, 1].to_i / 8
    read(bytes) if bytes &amp;gt; 1
  end
end

# example usage with the File we created earlier
File.open_using_unicode_bom("utf16_bom.txt") do |f|
  line = f.gets
  p [line.encoding, line[0..3]]
end
$ ruby read_bom.rb 
[#&amp;lt;Encoding:UTF-16LE&amp;gt;, "T\x00h\x00i\x00s\x00"]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;These examples just deal with Unicode BOM's, but you would handle other BOM's in a similar fashion.  Find out what bytes are needed for your &lt;code&gt;Encoding&lt;/code&gt;, write those out before the data, and later check for them when reading the data back.  The &lt;code&gt;String&lt;/code&gt; escapes we discussed earlier can be handy when writing the bytes and &lt;code&gt;binread()&lt;/code&gt; is equally handy when checking for the BOM.&lt;/p&gt;

&lt;p&gt;I do recommend including a BOM in Unicode &lt;code&gt;Encoding&lt;/code&gt;s like UTF-16 and UTF-32, but please don't add them to UTF-8 data.  The UTF-8 byte order is part of its specification and it never varies.  Thus you don't need a BOM to read it correctly.  If you add one, you damage one of the great UTF-8 advantages in that it can pass for US-ASCII (assuming it's all seven bit characters).&lt;/p&gt;</content>
    <author>
      <name>James Edward Gray II</name>
    </author>
  </entry>
  <entry>
    <title>Ruby 1.9's Three Default Encodings</title>
    <link rel="alternate" href="http://graysoftinc.com/character-encodings/ruby-19s-three-default-encodings"/>
    <id>tag:graysoftinc.com,2009-04-05:/posts/81</id>
    <updated>2014-04-18T18:40:50Z</updated>
    <summary>Now that we've covered String, we need to talk about how String's get their initial Encoding.</summary>
    <content type="html">&lt;p&gt;I suspect early contact with the new m17n (multilingualization) engine is going to come to Rubyists in the form of this error message:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;invalid multibyte char (US-ASCII)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Ruby 1.8 didn't care what you stuck in a random &lt;code&gt;String&lt;/code&gt; literal, but 1.9 is a touch pickier.  I think you'll see that the change is for the better, but we do need to spend some time learning to play by Ruby's new rules.&lt;/p&gt;

&lt;p&gt;That takes us to the first of Ruby's three default &lt;code&gt;Encoding&lt;/code&gt;s.&lt;/p&gt;

&lt;h4&gt;The Source Encoding&lt;/h4&gt;

&lt;p&gt;In Ruby's new grown up world of all encoded data, each and every &lt;code&gt;String&lt;/code&gt; needs an &lt;code&gt;Encoding&lt;/code&gt;.  That means an &lt;code&gt;Encoding&lt;/code&gt; must be selected for a &lt;code&gt;String&lt;/code&gt; as soon as it is created.  One way that a &lt;code&gt;String&lt;/code&gt; can be created is for Ruby to execute some code with a &lt;code&gt;String&lt;/code&gt; literal in it, like this:&lt;/p&gt;

&lt;div class="highlight highlight-ruby"&gt;&lt;pre&gt;&lt;span class="n"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"A new String"&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;That's a pretty simple &lt;code&gt;String&lt;/code&gt;, but what if I use a literal like the following instead?&lt;/p&gt;

&lt;div class="highlight highlight-ruby"&gt;&lt;pre&gt;&lt;span class="n"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Résumé"&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;What &lt;code&gt;Encoding&lt;/code&gt; is that in?  That fundamental question is probably the main reason we all struggle a bit with character encodings.  You can't tell just from looking at that data what &lt;code&gt;Encoding&lt;/code&gt; it is in.  Now, if I showed you the bytes you may be able to make an educated guess, but the data just isn't wearing an &lt;code&gt;Encoding&lt;/code&gt; name tag.&lt;/p&gt;

&lt;p&gt;That's true of a frightening lot of data we deal with every day.  A plain text file doesn't generally say what &lt;code&gt;Encoding&lt;/code&gt; the data inside is in.  When you think about that, it's a miracle we can successfully read a lot of things.&lt;/p&gt;

&lt;p&gt;When we're talking about program code, the problem gets worse.  I may want to write my code in UTF-8, but some Japanese programmer may want to write his code in Shift JIS.  Ruby should support that and, in fact, 1.9 does.  Let's complicate things a bit more though:  imagine that I bundle up that UTF-8 code I wrote in a gem and the Japanese programmer later uses it to help with his Shift JIS code.  How do we make that work seamlessly?&lt;/p&gt;

&lt;p&gt;The Ruby 1.8 strategy of one global variable won't survive a test like this, so it was time to switch strategies.  Ruby 1.9's answer to this problem is the source &lt;code&gt;Encoding&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;All Ruby source code now has some &lt;code&gt;Encoding&lt;/code&gt;.  When you create a &lt;code&gt;String&lt;/code&gt; literal in your code, it is assigned the &lt;code&gt;Encoding&lt;/code&gt; of your source.  That simple rule solves all the problems I just described pretty nicely.  As long my source &lt;code&gt;Encoding&lt;/code&gt; is UTF-8 and the Japanese programmer's source &lt;code&gt;Encoding&lt;/code&gt; is Shift JIS, my literals will work as I expect and his will work as he expects.  Obviously if we share any data, we will need to establish some rules about our shared formats using documentation or code that can adapt to different &lt;code&gt;Encoding&lt;/code&gt;s, but we should have been doing that all along anyway.&lt;/p&gt;

&lt;p&gt;Thus the only question becomes, what's my source &lt;code&gt;Encoding&lt;/code&gt; and how do I change it?&lt;/p&gt;

&lt;p&gt;There are a few different ways Ruby can select a source &lt;code&gt;Encoding&lt;/code&gt;.  Here are the options:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ cat no_encoding.rb 
p __ENCODING__
$ ruby no_encoding.rb 
#&amp;lt;Encoding:US-ASCII&amp;gt;

$ cat magic_comment.rb 
# encoding: UTF-8
p __ENCODING__
$ ruby magic_comment.rb 
#&amp;lt;Encoding:UTF-8&amp;gt;
$ cat magic_comment2.rb 
#!/usr/bin/env ruby -w
# encoding: UTF-8
p __ENCODING__
$ ruby magic_comment2.rb 
#&amp;lt;Encoding:UTF-8&amp;gt;

$ echo $LC_CTYPE
en_US.UTF-8
$ ruby -e 'p __ENCODING__'
#&amp;lt;Encoding:UTF-8&amp;gt;

$ ruby -KU no_encoding.rb 
#&amp;lt;Encoding:UTF-8&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The first example shows us two important things.  The first is the main rule of source &lt;code&gt;Encoding&lt;/code&gt;s:  source files receive a US-ASCII &lt;code&gt;Encoding&lt;/code&gt;, unless you say otherwise.  &lt;em&gt;[&lt;strong&gt;Update&lt;/strong&gt;:  this was changed to UTF-8 in Ruby 2.0 and up.]&lt;/em&gt;  This is where I expect programmers to run into the error I mentioned earlier.  If you place any non-ASCII content in a &lt;code&gt;String&lt;/code&gt; literal without changing the source &lt;code&gt;Encoding&lt;/code&gt;, Ruby will die with that error.  Thus you need to change the source &lt;code&gt;Encoding&lt;/code&gt; to work with any non-ASCII data.  The second thing we see here is the new &lt;code&gt;__ENCODING__&lt;/code&gt; keyword that can be used to get the source &lt;code&gt;Encoding&lt;/code&gt; that's active where it is executed.&lt;/p&gt;

&lt;p&gt;The second example shows the preferred way to set your source &lt;code&gt;Encoding&lt;/code&gt; and it's called a magic comment.  If the first line of your code is a comment that includes the word &lt;code&gt;coding&lt;/code&gt;, followed by a colon and space, and then an &lt;code&gt;Encoding&lt;/code&gt; name, the source &lt;code&gt;Encoding&lt;/code&gt; for that file is changed to the indicated &lt;code&gt;Encoding&lt;/code&gt;.  If your code has a shebang line, the magic comment must come on the second line, with no spacing between them.  Once set, all &lt;code&gt;String&lt;/code&gt; literals you create in that file will have that &lt;code&gt;Encoding&lt;/code&gt; attached to them.&lt;/p&gt;

&lt;p&gt;The third example shows an exception to the rule for your convenience.  When you feed Ruby some code on the command-line using the &lt;code&gt;-e&lt;/code&gt; switch, it gets a source &lt;code&gt;Encoding&lt;/code&gt; from your environment.  I have UTF-8 set in the &lt;code&gt;LC_CTYPE&lt;/code&gt; environment variable, but some people also use the &lt;code&gt;LANG&lt;/code&gt; variable for this.  This makes scripting easier since Ruby will (hopefully) match the &lt;code&gt;Encoding&lt;/code&gt; of any other commands you chain together.&lt;/p&gt;

&lt;p&gt;The fourth example is another interesting exception to the rule.  Ruby 1.9 still supports the &lt;code&gt;-K*&lt;/code&gt; style switches from Ruby 1.8 including the &lt;a href="http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/19552"&gt;&lt;code&gt;-KU&lt;/code&gt; switch&lt;/a&gt; I've recommended so heavily in this series.  These switches have a couple of effects, but of particular note they are the only non-magic comment way to modify the source &lt;code&gt;Encoding&lt;/code&gt;.  This is good news for backwards compatibility, because some Ruby 1.8 code may be able to run on Ruby 1.9 without &lt;code&gt;Encoding&lt;/code&gt; problems thanks to this.  I must stress that this is just for backwards compatibility though, and magic comments are the future.&lt;/p&gt;

&lt;p&gt;With magic comments the code will include its &lt;code&gt;Encoding&lt;/code&gt; data.  It will probably seem a little tedious to add them to all your source files at first, but it's really not that big of a change.  In the past, I've recommended we stick the following shebang line at the top of our files:&lt;/p&gt;

&lt;div class="highlight highlight-ruby"&gt;&lt;pre&gt;&lt;span class="c1"&gt;#!/usr/bin/env ruby -wKU&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now, for Ruby 1.9, I'm recommending we switch to something like this:&lt;/p&gt;

&lt;div class="highlight highlight-ruby"&gt;&lt;pre&gt;&lt;span class="c1"&gt;#!/usr/bin/env ruby -w&lt;/span&gt;
&lt;span class="c1"&gt;# encoding: UTF-8&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Note that the magic comment format rules are pretty loose and all of following examples would work the same:&lt;/p&gt;

&lt;div class="highlight highlight-ruby"&gt;&lt;pre&gt;&lt;span class="c1"&gt;# encoding: UTF-8&lt;/span&gt;

&lt;span class="c1"&gt;# coding: UTF-8&lt;/span&gt;

&lt;span class="c1"&gt;# -*- coding: UTF-8 -*-&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This is nice for support in some text editors that also read such comments.&lt;/p&gt;

&lt;p&gt;If we all get into that habit of adding magic comments, our code can work together regardless of the various &lt;code&gt;Encoding&lt;/code&gt;s we personally favor.  Ruby will know how to handle each separate file.  As an added bonus, we programmers also get to see these comments and know more about the code we are working with.  That makes it a good habit to get into, I think.&lt;/p&gt;

&lt;h4&gt;The Default External and Internal Encodings&lt;/h4&gt;

&lt;p&gt;There's another way &lt;code&gt;String&lt;/code&gt;s are commonly created and that's by reading from some &lt;code&gt;IO&lt;/code&gt; object.  It doesn't make sense to give those &lt;code&gt;String&lt;/code&gt;s the source &lt;code&gt;Encoding&lt;/code&gt; because the external data doesn't have to be related to your source code.  Also, you really need to know how data is encoded to read it correctly.  Even a simple concept like reading the next line of data changes if you are talking about UTF-8 or UTF-16LE (the LE stands for a &lt;a href="http://en.wikipedia.org/wiki/Endianness"&gt;Little Endian byte order&lt;/a&gt;) data.  Thus, it makes sense for &lt;code&gt;IO&lt;/code&gt; objects to have at least one &lt;code&gt;Encoding&lt;/code&gt; attached to them.  Ruby 1.9 is generous and gives them two:  the external &lt;code&gt;Encoding&lt;/code&gt; and the internal &lt;code&gt;Encoding&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The external &lt;code&gt;Encoding&lt;/code&gt; is the &lt;code&gt;Encoding&lt;/code&gt; the data is in inside the &lt;code&gt;IO&lt;/code&gt; object.  That affects how data will be read and this is the &lt;code&gt;Encoding&lt;/code&gt; data will be returned in as long as the internal &lt;code&gt;Encoding&lt;/code&gt; isn't set (more on that in a bit).  Let's look at an example of how this plays out in practice:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ cat show_external.rb 
open(__FILE__, "r:UTF-8") do |file|
  puts file.external_encoding.name
  p    file.internal_encoding
  file.each do |line|
    p [line.encoding.name, line]
  end
end
$ ruby show_external.rb 
UTF-8
nil
["UTF-8", "open(__FILE__, \"r:UTF-8\") do |file|\n"]
["UTF-8", "  puts file.external_encoding.name\n"]
["UTF-8", "  p    file.internal_encoding\n"]
["UTF-8", "  file.each do |line|\n"]
["UTF-8", "    p [line.encoding.name, line]\n"]
["UTF-8", "  end\n"]
["UTF-8", "end\n"]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;There are four things to notice in this example:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;I set the external &lt;code&gt;Encoding&lt;/code&gt; by tacking &lt;code&gt;:UTF-8&lt;/code&gt; onto the end of my mode &lt;code&gt;String&lt;/code&gt; when I opened the &lt;code&gt;File&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;You can use &lt;code&gt;external_encoding()&lt;/code&gt; to check the external &lt;code&gt;Encoding&lt;/code&gt; as I have here&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;internal_encoding()&lt;/code&gt; works the same for the internal &lt;code&gt;Encoding&lt;/code&gt;, which will be &lt;code&gt;nil&lt;/code&gt; unless you explicitly set it&lt;/li&gt;
&lt;li&gt;Note how each &lt;code&gt;String&lt;/code&gt; created as I read the data is given the &lt;code&gt;external_encoding()&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;The internal &lt;code&gt;Encoding&lt;/code&gt; just adds one more twist.  When set, data will still be read in the external &lt;code&gt;Encoding&lt;/code&gt;, but transcoded to the internal &lt;code&gt;Encoding&lt;/code&gt; as the &lt;code&gt;String&lt;/code&gt; is created.  It's a convenience for you as the programmer.  Watch how that changes things:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ cat show_internal.rb 
open(__FILE__, "r:UTF-8:UTF-16LE") do |file|
  puts file.external_encoding.name
  puts file.internal_encoding.name
  file.each do |line|
    p [line.encoding.name, line[0..3]]
  end
end
$ ruby show_internal.rb 
UTF-8
UTF-16LE
["UTF-16LE", "o\x00p\x00e\x00n\x00"]
["UTF-16LE", " \x00 \x00p\x00u\x00"]
["UTF-16LE", " \x00 \x00p\x00u\x00"]
["UTF-16LE", " \x00 \x00f\x00i\x00"]
["UTF-16LE", " \x00 \x00 \x00 \x00"]
["UTF-16LE", " \x00 \x00e\x00n\x00"]
["UTF-16LE", "e\x00n\x00d\x00\n\x00"]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;There are a couple differences here:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A second added &lt;code&gt;Encoding&lt;/code&gt; on the mode &lt;code&gt;String&lt;/code&gt; (my &lt;code&gt;:UTF-16LE&lt;/code&gt; in this example) sets the &lt;code&gt;internal_encoding()&lt;/code&gt; as I show with the second &lt;code&gt;puts()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;This little change gets Ruby to translate all of the data for me (I just shortened the output because UTF-16LE is noisy)&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;The external &lt;code&gt;Encoding&lt;/code&gt; works the same when writing.  It still represents the &lt;code&gt;Encoding&lt;/code&gt; in the &lt;code&gt;IO&lt;/code&gt; object, or the &lt;code&gt;Encoding&lt;/code&gt; data is going to.  However, you don't need to specify an internal &lt;code&gt;Encoding&lt;/code&gt; when writing.  Ruby will automatically use the &lt;code&gt;Encoding&lt;/code&gt; of a &lt;code&gt;String&lt;/code&gt; you output as the internal &lt;code&gt;Encoding&lt;/code&gt; and transcode as needed to reach the external &lt;code&gt;Encoding&lt;/code&gt;.  For example:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ cat write_internal.rb 
# encoding: UTF-8
open("data.txt", "w:UTF-16LE") do |file|
  puts file.external_encoding.name
  p    file.internal_encoding
  data = "My data…"
  p [data.encoding.name, data]
  file &amp;lt;&amp;lt; data
end
p File.read("data.txt")
$ ruby write_internal.rb 
UTF-16LE
nil
["UTF-8", "My data…"]
"M\x00y\x00 \x00d\x00a\x00t\x00a\x00&amp;amp; "
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Note how my data was transcoded before it was written even though the &lt;code&gt;internal_encoding()&lt;/code&gt; was &lt;code&gt;nil&lt;/code&gt;.  Ruby used the &lt;code&gt;String&lt;/code&gt;'s &lt;code&gt;Encoding&lt;/code&gt; to decide what was needed.&lt;/p&gt;

&lt;p&gt;Both of those &lt;code&gt;IO&lt;/code&gt; &lt;code&gt;Encoding&lt;/code&gt;s should be pretty straight forward.  The only question left about them is:  what happens if you don't set them?  The answer is that the &lt;code&gt;IO&lt;/code&gt; inherits the default external &lt;code&gt;Encoding&lt;/code&gt; and/or the default internal &lt;code&gt;Encoding&lt;/code&gt; whenever one isn't set.  Now we need to know how Ruby chooses those defaults.&lt;/p&gt;

&lt;p&gt;The default external &lt;code&gt;Encoding&lt;/code&gt; is pulled from your environment, much like the source &lt;code&gt;Encoding&lt;/code&gt; is for code given on the command-line.  Have a look:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ echo $LC_CTYPE
en_US.UTF-8
$ ruby -e 'puts Encoding.default_external.name'
UTF-8
$ LC_CTYPE=ja_JP.sjis ruby -e 'puts Encoding.default_external.name'
Shift_JIS
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The default internal &lt;code&gt;Encoding&lt;/code&gt; is simply &lt;code&gt;nil&lt;/code&gt;.  You must actively change it to get anything else.&lt;/p&gt;

&lt;p&gt;Both default &lt;code&gt;IO&lt;/code&gt; &lt;code&gt;Encoding&lt;/code&gt;s have a global setter:  &lt;code&gt;Encoding.default_external=()&lt;/code&gt; and &lt;code&gt;Encoding.default_internal=()&lt;/code&gt;.  You can set them to an &lt;code&gt;Encoding&lt;/code&gt; object or just the &lt;code&gt;String&lt;/code&gt; name of an &lt;code&gt;Encoding&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;You can also change these default &lt;code&gt;Encoding&lt;/code&gt;s using some command-line switches.  The new &lt;code&gt;-E&lt;/code&gt; switch can be used to set one or both of the &lt;code&gt;IO&lt;/code&gt; &lt;code&gt;Encoding&lt;/code&gt;s:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby -e 'p [Encoding.default_external, Encoding.default_internal]'
[#&amp;lt;Encoding:UTF-8&amp;gt;, nil]
$ ruby -E Shift_JIS \
&amp;gt; -e 'p [Encoding.default_external, Encoding.default_internal]'
[#&amp;lt;Encoding:Shift_JIS&amp;gt;, nil]
$ ruby -E :UTF-16LE \
&amp;gt; -e 'p [Encoding.default_external, Encoding.default_internal]'
[#&amp;lt;Encoding:UTF-8&amp;gt;, #&amp;lt;Encoding:UTF-16LE&amp;gt;]
$ ruby -E Shift_JIS:UTF-16LE \
&amp;gt; -e 'p [Encoding.default_external, Encoding.default_internal]'
[#&amp;lt;Encoding:Shift_JIS&amp;gt;, #&amp;lt;Encoding:UTF-16LE&amp;gt;]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;As you can see, the argument for this switch is just like what you would append to a mode &lt;code&gt;String&lt;/code&gt; in a call to &lt;code&gt;File.open()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;There's one more command-line switch shortcut for those of us who prefer to just use UTF-8 everywhere.  The new &lt;code&gt;-U&lt;/code&gt; switch sets &lt;code&gt;Encoding.default_internal()&lt;/code&gt; to UTF-8.  Using that, you can just set the external &lt;code&gt;Encoding&lt;/code&gt; for your &lt;code&gt;IO&lt;/code&gt; objects, or let it default from your environment, and all &lt;code&gt;String&lt;/code&gt;s you read will be transcoded to the preferred UTF-8.&lt;/p&gt;

&lt;p&gt;Probably the most important thing to note about &lt;code&gt;Encoding.default_external()&lt;/code&gt; and &lt;code&gt;Encoding.default_internal()&lt;/code&gt; is that you should really just treat them as shortcuts for your own scripting.  Pulling &lt;code&gt;Encoding&lt;/code&gt;s from the environment or command-line switches can be handy when you're in control of where the code runs, but you're going to need to be more explicit for code you intend for others to run.  When in doubt, set the external and internal &lt;code&gt;Encoding&lt;/code&gt;s the way you want them for each &lt;code&gt;IO&lt;/code&gt; object.  It's a bit more tedious, but also safer in that it won't mysteriously be changed by some outside force.  Also remember that the defaults are global settings affecting all loaded code, including any libraries you &lt;code&gt;require()&lt;/code&gt;.  That can be a boon or bane, so just remember to factor it into your thinking when you're wondering, "Where does this &lt;code&gt;String&lt;/code&gt; get its &lt;code&gt;Encoding&lt;/code&gt; from?"&lt;/p&gt;</content>
    <author>
      <name>James Edward Gray II</name>
    </author>
  </entry>
  <entry>
    <title>Encoding Conversion With iconv</title>
    <link rel="alternate" href="http://graysoftinc.com/character-encodings/encoding-conversion-with-iconv"/>
    <id>tag:graysoftinc.com,2008-12-08:/posts/72</id>
    <updated>2014-04-17T19:14:31Z</updated>
    <summary>This article covers the Ruby 1.8 system of converting between character encodings.</summary>
    <content type="html">&lt;p&gt;There's one last standard library we need to discuss for us to have completely covered Ruby 1.8's support for character encodings.  The &lt;code&gt;iconv&lt;/code&gt; library ships with Ruby and it can handle an impressive set of character encoding conversions.&lt;/p&gt;

&lt;p&gt;This is an important piece of the puzzle.  You may have accepted my advice that it's OK to just work with UTF-8 data whenever you have the choice, but the fact is that there's a lot of non-UTF-8 data in the world.  Legacy systems may have produced data before UTF-8 was popular, some services may work in different encodings for any number of reasons, and not quite everyone has embraced Unicode fully yet.  If you run into data like this, you will need a way to convert it to UTF-8 as you import it and possibly a way to convert it back when you export it. That's exactly what &lt;code&gt;iconv&lt;/code&gt; does.&lt;/p&gt;

&lt;p&gt;Instead of jumping right into Ruby's &lt;code&gt;iconv&lt;/code&gt; library, let's come at it with a slightly different approach.  &lt;code&gt;iconv&lt;/code&gt; is actually a C library that performs these conversions and on most systems where it is installed you will have a command-line interface for it.&lt;/p&gt;

&lt;p&gt;It's very easy to use the &lt;code&gt;iconv&lt;/code&gt; program.  Just always follow these three steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Tell &lt;code&gt;iconv&lt;/code&gt; the encoding you want it to write data out in, including any special translation instructions&lt;/li&gt;
&lt;li&gt;Tell &lt;code&gt;iconv&lt;/code&gt; the encoding data will be passed to it in&lt;/li&gt;
&lt;li&gt;Send the input into &lt;code&gt;iconv&lt;/code&gt; on &lt;code&gt;STDIN&lt;/code&gt; (or just list the files as arguments, if you prefer) and redirect &lt;code&gt;iconv&lt;/code&gt;'s &lt;code&gt;STDOUT&lt;/code&gt; to where you want output to be written&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;For example, let's say I have some UTF-8 data:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ echo "Résumé" &amp;gt; utf8.txt
$ wc -c utf8.txt 
       9 utf8.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;My terminal works in UTF-8, so that's the data &lt;code&gt;echo&lt;/code&gt; wrote into the file.  You can see that it's encoded now because we have nine bytes in the file (one each for &lt;code&gt;"R"&lt;/code&gt;, &lt;code&gt;"s"&lt;/code&gt;, &lt;code&gt;"u"&lt;/code&gt;, &lt;code&gt;"m"&lt;/code&gt;, and &lt;code&gt;"\n"&lt;/code&gt; plus two for each &lt;code&gt;"é"&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;Here's how we would convert that data to Latin-1 using &lt;code&gt;iconv&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ iconv -t LATIN1 -f UTF8 &amp;lt; utf8.txt &amp;gt; latin1.txt
$ wc -c latin1.txt 
       7 latin1.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You can see the conversion worked, because an &lt;code&gt;"é"&lt;/code&gt; is only one byte in Latin-1 and we dropped two bytes.&lt;/p&gt;

&lt;p&gt;Note my use of all three steps here:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;I used &lt;code&gt;-t LATIN1&lt;/code&gt; to set the &lt;i&gt;to&lt;/i&gt; encoding without any special translations&lt;/li&gt;
&lt;li&gt;I used &lt;code&gt;-f UTF8&lt;/code&gt; to set the &lt;em&gt;from&lt;/em&gt; encoding&lt;/li&gt;
&lt;li&gt;I used &lt;code&gt;&amp;lt; utf8.txt&lt;/code&gt; to pipe data in and &lt;code&gt;&amp;gt; latin1.txt&lt;/code&gt; to pipe data out of the program&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;Those are always the steps as I said before.&lt;/p&gt;

&lt;p&gt;You only need to know two more things about &lt;code&gt;iconv&lt;/code&gt;.  First, &lt;code&gt;iconv&lt;/code&gt; supports a truck load of encodings, including all of the common encodings I've been talking about in this series.  They vary some on different platforms though, so you will need to check what is available to you:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ iconv --list
ANSI_X3.4-1968 ANSI_X3.4-1986 ASCII CP367 IBM367 ISO-IR-6 ISO646-US
  ISO_646.IRV:1991 US US-ASCII CSASCII
UTF-8 UTF8
UTF-8-MAC UTF8-MAC
ISO-10646-UCS-2 UCS-2 CSUNICODE
UCS-2BE UNICODE-1-1 UNICODEBIG CSUNICODE11
UCS-2LE UNICODELITTLE
ISO-10646-UCS-4 UCS-4 CSUCS4
UCS-4BE
UCS-4LE
UTF-16
…
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Each line of that listing shows a single encoding.  The space separated lists on each line are all aliases for that encoding that &lt;code&gt;iconv&lt;/code&gt; will accept.  Thus that first long line that I had to break into two provides a bunch of aliases for US-ASCII.  We can also see by reading down a bit that &lt;code&gt;iconv&lt;/code&gt; will accept UTF8 or UTF-8.&lt;/p&gt;

&lt;p&gt;The last thing to know about &lt;code&gt;iconv&lt;/code&gt; is that it has some special translation modes.  To see those in action, let's work with a different piece of data:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ echo "On and on… and on…" &amp;gt; utf8.txt
$ cat utf8.txt 
On and on… and on…
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;That last character is an ellipsis or three dots all in one character.  Unicode has that character, but Latin-1 does not.  Let's see what happens if we try to convert the data now:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ iconv -f UTF8 -t LATIN1 &amp;lt; utf8.txt &amp;gt; latin1.txt

iconv: (stdin):1:9: cannot convert
$ cat latin1.txt 
On and on
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;As you can see, I got an error when it reached the first occurrence of the problem character.  The &lt;code&gt;cat&lt;/code&gt; command also shows that it completely quit working there.&lt;/p&gt;

&lt;p&gt;That may be what you need, so you can tell a user you can't work with their data.  I often find though that I just need to do the best I can with the data that I have.  &lt;code&gt;iconv&lt;/code&gt;'s translation modes can help with that.&lt;/p&gt;

&lt;p&gt;First, you can ask &lt;code&gt;iconv&lt;/code&gt; to &lt;em&gt;ignore&lt;/em&gt; any characters that cannot be converted to the new encoding:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ iconv -t LATIN1//IGNORE -f UTF8 &amp;lt; utf8.txt &amp;gt; latin1_wignore.txt
$ cat latin1_wignore.txt 
On and on and on
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;As you can see, we completed the entire translation that time, only losing the problematic characters.  The &lt;code&gt;//IGNORE&lt;/code&gt; sequence adds the translation mode.  Modes are always specified after the output encoding.  That's an improvement for sure, but it's possible to do even better in this case.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;iconv&lt;/code&gt; has another translation mode where it will try to &lt;em&gt;transliterate&lt;/em&gt; characters into an equivalent representation in the target encoding:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ iconv -t LATIN1//TRANSLIT -f UTF8 &amp;lt; utf8.txt &amp;gt; latin1_wtranslit.txt
$ cat latin1_wtranslit.txt 
On and on... and on...
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This time, instead of dropping the ellipsis characters, &lt;code&gt;iconv&lt;/code&gt; replaced them with three full stops each.  It's not as fancy as the Unicode character, but it gets the job done and we do a good job of keeping the meaning of the data.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;//TRANSLIT&lt;/code&gt; can't convert absolutely everything you will see in the wild, so it's still possible to get errors when using it.  You can combine the modes though by specifying &lt;code&gt;//TRANSLIT//IGNORE&lt;/code&gt;.  That will ask &lt;code&gt;iconv&lt;/code&gt; to transliterate what it can and drop the rest.  Note that order does matter there, you need to be sure it tries transliteration before ignoring the character.&lt;/p&gt;

&lt;p&gt;You can also give &lt;code&gt;iconv&lt;/code&gt; specific translations for bytes it has trouble with.  I've never needed that level of control though and find the translation modes help me do more with less effort.  Have a quick browse through &lt;code&gt;man iconv&lt;/code&gt;, if you are curious.&lt;/p&gt;

&lt;p&gt;That's all you need to know about &lt;code&gt;iconv&lt;/code&gt;.  You are now a character conversion expert.  Congratulations.&lt;/p&gt;

&lt;p&gt;Of course, it would be nice to talk about how this affects Ruby.  Let's do that.&lt;/p&gt;

&lt;p&gt;The Ruby standard library is just like the program we've been playing with.  It just provides a method interface to the underlying C code.  To show that, here's the same conversion we started with:&lt;/p&gt;

&lt;div class="highlight highlight-ruby"&gt;&lt;pre&gt;&lt;span class="c1"&gt;#!/usr/bin/env ruby -wKU&lt;/span&gt;

&lt;span class="nb"&gt;require&lt;/span&gt; &lt;span class="s2"&gt;"iconv"&lt;/span&gt;

&lt;span class="n"&gt;utf8&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Résumé"&lt;/span&gt;
&lt;span class="n"&gt;utf8&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;  &lt;span class="c1"&gt;# =&amp;gt; 8&lt;/span&gt;

&lt;span class="n"&gt;latin1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;Iconv&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"LATIN1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"UTF8"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;utf8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;latin1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;  &lt;span class="c1"&gt;# =&amp;gt; 6&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You can see that the steps are exactly the same.  The first parameter is your target encoding and the second is the encoding your data is currently in.  You pass the data to convert in the last parameter and the return value of the call is the result.&lt;/p&gt;

&lt;p&gt;If you are going to do several conversions in a row, it's slightly easier to create an &lt;code&gt;Iconv&lt;/code&gt; instance and just reuse that:&lt;/p&gt;

&lt;div class="highlight highlight-ruby"&gt;&lt;pre&gt;&lt;span class="c1"&gt;#!/usr/bin/env ruby -wKU&lt;/span&gt;

&lt;span class="nb"&gt;require&lt;/span&gt; &lt;span class="s2"&gt;"iconv"&lt;/span&gt;

&lt;span class="n"&gt;utf8_to_latin1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;Iconv&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"LATIN1//TRANSLIT//IGNORE"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"UTF8"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;resume&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Résumé"&lt;/span&gt;
&lt;span class="n"&gt;utf8_to_latin1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iconv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resume&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;  &lt;span class="c1"&gt;# =&amp;gt; 6&lt;/span&gt;

&lt;span class="n"&gt;on_and_on&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"On and on… and on…"&lt;/span&gt;
&lt;span class="n"&gt;utf8_to_latin1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iconv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;on_and_on&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# =&amp;gt; "On and on... and on..."&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;That's all there is to it.  The &lt;code&gt;new()&lt;/code&gt; method builds an object that remembers the encodings you are converting and then you can call &lt;code&gt;iconv()&lt;/code&gt; (instead of the &lt;code&gt;conv()&lt;/code&gt; class method we used earlier) to convert data.&lt;/p&gt;

&lt;p&gt;When things go wrong, the Ruby interface will raise exceptions like &lt;code&gt;Iconv::InvalidEncoding&lt;/code&gt; or &lt;code&gt;Iconv::InvalidCharacter&lt;/code&gt;.  See &lt;a href="http://www.ruby-doc.org/stdlib-1.8.6/libdoc/iconv/rdoc/Iconv.html"&gt;the documentation&lt;/a&gt; for details.&lt;/p&gt;

&lt;p&gt;The Ruby 1.8 library does not provide a way to programatically list the supported encodings, which is one of the big reasons I started off showing you the command-line program instead.  You will need to check them there.  However, Ruby 1.9 adds a method for this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby_dev -r iconv -r pp -ve 'pp Iconv.list'
ruby 1.9.0 (2008-10-10 revision 0) [i386-darwin9.5.0]
[["ANSI_X3.4-1968",
  "ANSI_X3.4-1986",
  "ASCII",
  "CP367",
  "IBM367",
  "ISO-IR-6",
  "ISO646-US",
  "ISO_646.IRV:1991",
  "US",
  "US-ASCII",
  "CSASCII"],
 ["UTF-8", "UTF8"],
…
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This concludes our tour of character encoding tools for Ruby 1.8.  In later posts, we will take a step back from all of this and examine what the problems with this system are.  That will pave the way for us to discuss the new m17n (multilingualization) code in Ruby 1.9.&lt;/p&gt;</content>
    <author>
      <name>James Edward Gray II</name>
    </author>
  </entry>
  <entry>
    <title>The $KCODE Variable and jcode Library</title>
    <link rel="alternate" href="http://graysoftinc.com/character-encodings/the-kcode-variable-and-jcode-library"/>
    <id>tag:graysoftinc.com,2008-11-05:/posts/70</id>
    <updated>2014-04-17T15:47:22Z</updated>
    <summary>Details on how to globally change the encoding for Ruby 1.8 as well as coverage for a simple character encoding helper library.</summary>
    <content type="html">&lt;p&gt;All of the Ruby files I create start with the same &lt;a href="http://en.wikipedia.org/wiki/Shebang_(Unix)"&gt;Shebang line&lt;/a&gt;:&lt;/p&gt;

&lt;div class="highlight highlight-ruby"&gt;&lt;pre&gt;&lt;span class="c1"&gt;#!/usr/bin/env ruby -wKU&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;It's not really needed for every file since it generally only matters if the file is executed.  However, I tend to go ahead and add it to all Ruby files I build for several reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You never know when a file may be executed (&lt;code&gt;if __FILE__ == $PROGRAM_NAME; end&lt;/code&gt; sections are often added to libraries, for example) &lt;/li&gt;
&lt;li&gt;It makes it obvious the file is Ruby code&lt;/li&gt;
&lt;li&gt;It shows the rules this code expects &lt;code&gt;-w&lt;/code&gt; and &lt;code&gt;-KU&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;The rules I mention here, specified by command-line switches, are the main point of interest.  &lt;code&gt;-w&lt;/code&gt; turns on Ruby's warnings which are very handy.  I recommend doing that whenever you can.  But that doesn't have anything to do with character encodings.  &lt;code&gt;-KU&lt;/code&gt; does.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;-KU&lt;/code&gt; sets a magic Ruby variable:  &lt;code&gt;$-K&lt;/code&gt; or &lt;code&gt;$KCODE&lt;/code&gt;.  You can do the same in your code if you aren't in a position to control the command-line arguments:&lt;/p&gt;

&lt;div class="highlight highlight-ruby"&gt;&lt;pre&gt;&lt;span class="vg"&gt;$KCODE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"U"&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You probably recognize the &lt;code&gt;U&lt;/code&gt; as a name for Ruby 1.8's UTF-8 encoding, from my earlier &lt;a href="/character-encodings/bytes-and-characters-in-ruby-18"&gt;list of encodings&lt;/a&gt;.  It can also be set to &lt;code&gt;N&lt;/code&gt; (the default), &lt;code&gt;E&lt;/code&gt;, or &lt;code&gt;S&lt;/code&gt;.  Modern versions of Rails do set &lt;code&gt;$KCODE = "U"&lt;/code&gt; for you.&lt;/p&gt;

&lt;p&gt;So what does changing this magic variable do?  First, it has the tiny effect of changing what Ruby escapes in &lt;code&gt;inspect()&lt;/code&gt; output.  Have a look:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby -e 'p "Résumé"'
"R\303\251sum\303\251"
$ ruby -KUe 'p "Résumé"'
"Résumé"
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It's nice to be able to see your data as it actually is, assuming your terminal correctly handles UTF-8.  However, that's really just a side-effect of setting &lt;code&gt;$KCODE&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The main purpose of &lt;code&gt;$KCODE&lt;/code&gt; is that it changes the default encoding of all regular expressions that do not specify otherwise.  Thus we can split up UTF-8 data by characters without adding a &lt;code&gt;/u&lt;/code&gt; to the end of our expression:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby -e 'p "Résumé".scan(/./m)'
["R", "\303", "\251", "s", "u", "m", "\303", "\251"]
$ ruby -KUe 'p "Résumé".scan(/./m)'
["R", "é", "s", "u", "m", "é"]
$ ruby -KUe 'p "Résumé".scan(/./mn)'
["R", "\303", "\251", "s", "u", "m", "\303", "\251"]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Notice that the default encoding for that second example was switched to &lt;em&gt;UTF-8&lt;/em&gt;.  However, I can still override this with an explicit encoding, as I did in example three by adding the &lt;code&gt;/n&lt;/code&gt; option for &lt;em&gt;None&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Now, I tend to prefer &lt;code&gt;$KCODE&lt;/code&gt; over &lt;code&gt;$-K&lt;/code&gt; because the former seems more common in Ruby literature.  In fact, Ruby 1.8 uses the term in another place, providing a method to get the encoding used in a &lt;code&gt;Regexp&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby -e 'p /./.kcode'
nil
$ ruby -e 'p /./u.kcode'
"utf8"
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Beware of that harmless looking &lt;code&gt;kcode()&lt;/code&gt; method though as it hides quite a few gotchas.  First, you can see that it has its own names for the options that don't really match up with what we've seen elsewhere.  It also doesn't seem to be aware of the &lt;code&gt;$KCODE&lt;/code&gt; variable, in an ironic twist of naming:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby -e '$KCODE = "U"; re = /./m; p "Résumé".scan(re); p re.kcode'
["R", "é", "s", "u", "m", "é"]
nil
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;As you can see, the encoding of the expression was clearly set correctly, but &lt;code&gt;kcode()&lt;/code&gt; didn't report the change.  If you really want to know the encoding of a &lt;code&gt;Regexp&lt;/code&gt; in Ruby 1.8, I suggest using code like the following:&lt;/p&gt;

&lt;div class="highlight highlight-ruby"&gt;&lt;pre&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Regexp&lt;/span&gt;
  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;encoding&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;kcode&lt;/span&gt;
      &lt;span class="n"&gt;kcode&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;elsif&lt;/span&gt; &lt;span class="sx"&gt;%w[n N u U e E s S]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;include?&lt;/span&gt; &lt;span class="vg"&gt;$KCODE&lt;/span&gt;
      &lt;span class="vg"&gt;$KCODE&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;downcase&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;
      &lt;span class="s2"&gt;"n"&lt;/span&gt;
    &lt;span class="k"&gt;end&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Using just the first letter of &lt;code&gt;kcode()&lt;/code&gt; should get us back to a standard set of letters.  If &lt;code&gt;kcode()&lt;/code&gt; isn't set, we can use &lt;code&gt;$KCODE&lt;/code&gt;.  However, do note that I make sure it's set to an expected value.  You can set &lt;code&gt;$KCODE&lt;/code&gt; to any junk value and Ruby will just silently ignore it (defaulting back to &lt;code&gt;N&lt;/code&gt;), so it's good to reality check the contents when you rely on it.  Finally, we just return the default if neither appear to be set.&lt;/p&gt;

&lt;p&gt;That's really all there is to know about &lt;code&gt;$KCODE&lt;/code&gt;, but Ruby 1.8 ships with a simple standard library called &lt;code&gt;jcode&lt;/code&gt; that combines well with everything we've been discussing in these last two posts.&lt;/p&gt;

&lt;p&gt;To use the &lt;code&gt;jcode&lt;/code&gt; library, set &lt;code&gt;$KCODE&lt;/code&gt; and then require the library.  Setting &lt;code&gt;$KCODE&lt;/code&gt; first is important, and you will receive a warning if you require &lt;code&gt;jcode&lt;/code&gt; without setting &lt;code&gt;$KCODE&lt;/code&gt; (as long as you took my advice and turned warnings on with &lt;code&gt;-w&lt;/code&gt;):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby -r jcode -e 'p "Résumé".jsize'
8
$ ruby -w -r jcode -e 'p "Résumé".jsize'
Warning: $KCODE is NONE.
8
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;See, I told you &lt;code&gt;-w&lt;/code&gt; was important.&lt;/p&gt;

&lt;p&gt;As long as you do have &lt;code&gt;$KCODE&lt;/code&gt; set properly, &lt;code&gt;jcode&lt;/code&gt; adds a bunch of methods to &lt;code&gt;String&lt;/code&gt; that work in characters.  These methods are just simple wrappers over the techniques I showed you in &lt;a href="/character-encodings/bytes-and-characters-in-ruby-18"&gt;my last post&lt;/a&gt;, so you get methods like &lt;code&gt;jsize()&lt;/code&gt; which returns a count of characters instead of bytes:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby -KU -r jcode -e 'p "Résumé".jsize'
6
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Probably the most useful method &lt;code&gt;jcode&lt;/code&gt; adds is &lt;code&gt;each_char()&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby -KU -r jcode -e '"Résumé".each_char { |c| p c }'
"R"
"é"
"s"
"u"
"m"
"é"
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;See &lt;a href="http://www.ruby-doc.org/stdlib-1.8.6/libdoc/jcode/rdoc/index.html"&gt;the documentation&lt;/a&gt; for the full method list.&lt;/p&gt;</content>
    <author>
      <name>James Edward Gray II</name>
    </author>
  </entry>
  <entry>
    <title>Bytes and Characters in Ruby 1.8</title>
    <link rel="alternate" href="http://graysoftinc.com/character-encodings/bytes-and-characters-in-ruby-18"/>
    <id>tag:graysoftinc.com,2008-10-30:/posts/69</id>
    <updated>2014-04-12T20:06:06Z</updated>
    <summary>This is a look at the key features of Ruby 1.8's character encoding support.</summary>
    <content type="html">&lt;p&gt;Gregory Brown said, in a training session at the Lone Star Rubyconf, "Ruby 1.8 works in bytes.  Ruby 1.9 works in characters."  The truth of Ruby 1.9 is maybe a little more complicated and we will discuss all of that eventually, but Greg is dead right about Ruby 1.8.&lt;/p&gt;

&lt;p&gt;In Ruby 1.8, a &lt;code&gt;String&lt;/code&gt; is always just a collection of bytes.&lt;/p&gt;

&lt;p&gt;The important question is, how does that one golden rule relate to all that we've learned about character encodings?  Essentially, it puts all the responsibility on you as the developer.  Ruby 1.8 leaves it to you to determine what to do with those bytes and it doesn't provide a lot of encoding savvy help.  That's why knowing at least the basics of encodings is so important when working with Ruby 1.8.&lt;/p&gt;

&lt;p&gt;There are plusses and minuses to every system and this one is no exception.  On the side of plusses, Ruby 1.8 can pretty much support any encoding you can imagine.  After all, a character encoding is just some bytes that somehow map to a set of characters and all Ruby 1.8 &lt;code&gt;String&lt;/code&gt;s are just some bytes.  If you say a &lt;code&gt;String&lt;/code&gt; holds Latin-1 data and treat it as such, that's fine by Ruby.&lt;/p&gt;

&lt;p&gt;I won't lie to you though, there are more minuses than plusses to this approach.  Latin-1 is a pretty simple case since each byte is a character.  With many other encodings though, like the UTF-8 encoding I've recommended we rely on, things get a lot more complicated.&lt;/p&gt;

&lt;p&gt;Slicing up a Ruby 1.8 &lt;code&gt;String&lt;/code&gt; by index means working in bytes and that means it's possible for us to accidentally break a multi-byte character.  Running regular expressions over data faces similar issues.  That's just two examples of things we commonly do, but the truth is that many &lt;code&gt;String&lt;/code&gt; operations just aren't encoding safe in Ruby 1.8.  You can't even call simple things like &lt;code&gt;reverse()&lt;/code&gt; on a &lt;code&gt;String&lt;/code&gt; because it could break the order of those multi-byte characters.  And remember that &lt;code&gt;size()&lt;/code&gt; will always count bytes, not characters.&lt;/p&gt;

&lt;p&gt;Ruby 1.8 is also never going to police the contents of a &lt;code&gt;String&lt;/code&gt;.  That means to Ruby 1.8 a &lt;code&gt;String&lt;/code&gt; with valid UTF-8 data, a &lt;code&gt;String&lt;/code&gt; with broken UTF-8 data, and a &lt;code&gt;String&lt;/code&gt; with some bytes in Latin-1 and some in UTF-8 are all just &lt;code&gt;String&lt;/code&gt;s.  It doesn't care.  It's unlikely that the latter two are going to be of any use to you, so you will need to be the one making sure you don't create such problems.  If you got &lt;code&gt;String&lt;/code&gt; data from two separate sources in different encodings, you can't just combine them with a simple &lt;code&gt;+&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This may be starting to sound a little bleak and it probably is.  However, Ruby 1.8 throws one major exception into the works that can help you in many cases:  the regex engine is aware of four character encodings.  Often we can use this simple fact to work with characters.&lt;/p&gt;

&lt;p&gt;What encodings does Ruby 1.8 know?  Here's the full list:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;None (n or N)&lt;/li&gt;
&lt;li&gt;EUC (e or E)&lt;/li&gt;
&lt;li&gt;Shift_JIS (s or S)&lt;/li&gt;
&lt;li&gt;UTF-8 (u or U)&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;The &lt;em&gt;None&lt;/em&gt; encoding is the default in Ruby 1.8.  It's just the golden rule I've already mentioned:  treat everything as bytes.  If your encoding isn't on this list, you will need to use &lt;em&gt;None&lt;/em&gt; and be darn sure you don't do anything to the data that could damage the encoding.  That's very hard and the fact is that doing significant work with an encoding not on the above list in Ruby 1.8 will be quite a challenge for you.&lt;/p&gt;

&lt;p&gt;Both &lt;em&gt;EUC&lt;/em&gt; (Extended Unix Code) and &lt;em&gt;SHIFT_JIS&lt;/em&gt; are primarily Asian character encodings.  &lt;em&gt;SHIFT_JIS&lt;/em&gt; is a Japanese encoding and &lt;em&gt;EUC&lt;/em&gt; is mainly used for Japanese, Korean, and simplified Chinese.  You can tell Ruby comes from Japan, can't you?  Obviously these are very helpful if you are Asian, but the rest of us won't need these much.&lt;/p&gt;

&lt;p&gt;Now we get to the good news:  our champion &lt;em&gt;UTF-8&lt;/em&gt; made the list!  Yes, this means Ruby 1.8 has limited support for working with &lt;em&gt;UTF-8&lt;/em&gt; data.  It's not comprehensive, but we get some help.&lt;/p&gt;

&lt;p&gt;The letters listed after each encoding are used in multiple places inside Ruby 1.8 to tell it which encoding you need to work with.  I'll point those places out as we get into the details.&lt;/p&gt;

&lt;p&gt;What does it mean to have a character encoding on the above list?  It means that the regex engine can recognize characters in that encoding, even if they are multibyte.  That assures us that regular expression constructs that target characters, like character classes (&lt;code&gt;[…]&lt;/code&gt;) and the match-one-character shortcut (&lt;code&gt;.&lt;/code&gt;), will correctly match whatever number of bytes represents one character at that place in the data.  It also changes the definition of constructs like &lt;code&gt;\s&lt;/code&gt; and &lt;code&gt;\w&lt;/code&gt; which can be used to match whitespace and word characters respectively.  The definition of a "word" character in Unicode is quite a bit broader than the simple ASCII character class of &lt;code&gt;[A-Za-z0-9_]&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Let's look at some examples of this, so you can see how it works.  I'll play around with a simple UTF-8 &lt;code&gt;String&lt;/code&gt; in Ruby 1.8 and show you the various encoding effects.  Remember that the default encoding is &lt;em&gt;None&lt;/em&gt;, so that's what we get if we don't ask for anything else.&lt;/p&gt;

&lt;p&gt;A common task in working with characters in Ruby 1.8 is to convert a &lt;code&gt;String&lt;/code&gt; into an &lt;code&gt;Array&lt;/code&gt; of characters.  If we can do just that much, we can work-around some of the weaknesses of Ruby 1.8's &lt;code&gt;String&lt;/code&gt; always working in bytes.  Given that, this almost does what we want:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby -e 'p "Résumé".scan(/./m)'
["R", "\303", "\251", "s", "u", "m", "\303", "\251"]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You probably know that &lt;code&gt;scan()&lt;/code&gt; just builds an &lt;code&gt;Array&lt;/code&gt; of matches for the passed &lt;code&gt;Regexp&lt;/code&gt; in the &lt;code&gt;String&lt;/code&gt; receiver.  The &lt;code&gt;/m&lt;/code&gt; option I'm using here puts the regex engine in &lt;em&gt;multi-line&lt;/em&gt; mode and in that a &lt;code&gt;.&lt;/code&gt; matches all characters (it usually doesn't match newlines).&lt;/p&gt;

&lt;p&gt;So what went wrong above?  Well, the &lt;code&gt;"é"&lt;/code&gt; characters in my &lt;code&gt;String&lt;/code&gt; take two bytes in UTF-8.  The golden rule tells us Ruby 1.8 works in bytes and that's definitely what we saw.  It split up the bytes needed for those characters.  This is bad, because if I now change this &lt;code&gt;Array&lt;/code&gt;, I have excellent chances of breaking my data.&lt;/p&gt;

&lt;p&gt;Again, that used the default &lt;em&gt;None&lt;/em&gt; mode, because we didn't tell it to do otherwise.  However, if we throw the regex engine into &lt;em&gt;UTF-8&lt;/em&gt; mode, we will get actual characters:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby -e 'p "Résumé".scan(/./mu)'
["R", "\303\251", "s", "u", "m", "\303\251"]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Notice how the two bytes needed for the &lt;code&gt;"é"&lt;/code&gt; stay together now?  (I'll show you how to get Ruby to stop escaping the content and show the actual &lt;code&gt;"é"&lt;/code&gt; in &lt;a href="/character-encodings/the-kcode-variable-and-jcode-library"&gt;my next post&lt;/a&gt;.)  The regex engine saw that it takes both bytes to make a character in &lt;em&gt;UTF-8&lt;/em&gt;, the encoding I requested, and thus the &lt;code&gt;.&lt;/code&gt;, which matches one character, is forced to grab them both.&lt;/p&gt;

&lt;p&gt;I chose &lt;em&gt;UTF-8&lt;/em&gt; mode by adding the &lt;code&gt;/u&lt;/code&gt; option to my &lt;code&gt;Regexp&lt;/code&gt; literal.  You probably recognize the letter from my earlier list of encodings.  Similarly, you can use &lt;code&gt;/e&lt;/code&gt; for &lt;em&gt;EUC&lt;/em&gt;, &lt;code&gt;/s&lt;/code&gt; for &lt;em&gt;Shift_JIS&lt;/em&gt;, and even &lt;code&gt;/n&lt;/code&gt; for &lt;em&gt;None&lt;/em&gt; though that's the default.  &lt;code&gt;Regexp.new()&lt;/code&gt; also accepts a third parameter for these encodings if you are creating expressions that way:  &lt;code&gt;Regexp.new(".", Regexp::MULTILINE, "u")&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Using this one simple trick, we can fix some of the unsafe &lt;code&gt;String&lt;/code&gt; methods I mentioned earlier.  For example, Ruby 1.8 normally counts bytes with &lt;code&gt;size()&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby -e 'p "Résumé".size'
8
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;but we can now count characters, if desired:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby -e 'p "Résumé".scan(/./mu).size'
6
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We can also fix the dangerous &lt;code&gt;reverse()&lt;/code&gt; method which would normally break our multibyte &lt;code&gt;"é"&lt;/code&gt; characters by screwing up the byte order:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby -e 'p "Résumé".reverse'
"\251\303mus\251\303R"
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;"\303\251"&lt;/code&gt; is a UTF-8 &lt;code&gt;"é"&lt;/code&gt;, but the &lt;code&gt;"\251\303"&lt;/code&gt; we see here is broken UTF-8 data that doesn't mean anything.  We can fix that with:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby -e 'p "Résumé".scan(/./mu).reverse.join'
"\303\251mus\303\251R"
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This time we use the regex engine to divide the &lt;code&gt;String&lt;/code&gt; into a character &lt;code&gt;Array&lt;/code&gt;, then we &lt;code&gt;reverse()&lt;/code&gt; that and &lt;code&gt;join()&lt;/code&gt; it back into a &lt;code&gt;String&lt;/code&gt;.  You can see that this kept the &lt;code&gt;"é"&lt;/code&gt; bytes in the proper order.&lt;/p&gt;

&lt;p&gt;Really study these examples above until you understand what's going on here.  This is all the support Ruby 1.8 provides for working with characters, so you need to understand how to use it.&lt;/p&gt;

&lt;p&gt;Here's one last set of examples showing the other regex change I mentioned:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby -e 'p "Résumé"[/\w+/]'
"R"
$ ruby -e 'p "Résumé"[/\w+/u]'
"R\303\251sum\303\251"
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In the default &lt;em&gt;None&lt;/em&gt; mode, &lt;code&gt;\w&lt;/code&gt; is the same as &lt;code&gt;[A-Za-z0-9_]&lt;/code&gt;.  That doesn't match the special bytes needed to build the &lt;code&gt;"é"&lt;/code&gt; character, so the match ends there.  Note that &lt;em&gt;UTF-8&lt;/em&gt; mode changes that though and we get the full word.&lt;/p&gt;

&lt;p&gt;Ruby 1.8 doesn't provide a whole lot of additional encoding support outside the regex engine.  There is one magic variable and some helpful standard libraries we will discuss in future posts, but the main part of Ruby 1.8's character encoding support is just this.&lt;/p&gt;

&lt;p&gt;One other small feature that may be worth a quick mention is that you can get Unicode code points using &lt;code&gt;String&lt;/code&gt;'s &lt;code&gt;unpack()&lt;/code&gt; method:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby -e 'p "Résumé".unpack("U*")'
[82, 233, 115, 117, 109, 233]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;U&lt;/code&gt; code tells &lt;code&gt;unpack()&lt;/code&gt; to convert a character into a Unicode code point and the &lt;code&gt;*&lt;/code&gt; just repeats it for all characters in the &lt;code&gt;String&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;I don't find myself needing to work with character points often, but you can use this for one interesting cheat.  The Unicode code points are a superset of the byte values used in Latin-1, so you can actually convert between the two encodings using just &lt;code&gt;unpack()&lt;/code&gt; and &lt;code&gt;pack()&lt;/code&gt;:&lt;/p&gt;

&lt;div class="highlight highlight-ruby"&gt;&lt;pre&gt;&lt;span class="n"&gt;utf8&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;latin1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unpack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"C*"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"U*"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# ... or ...&lt;/span&gt;
&lt;span class="n"&gt;latin1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;utf8&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unpack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"U*"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"C*"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# more dangerous&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;However, I'll show you a superior way to handle encoding conversions in a future post.&lt;/p&gt;

&lt;p&gt;It's important to remember that this is not full character encoding support.  For example, there is a long list of rules about how to correctly convert some Unicode characters to upper case, but &lt;code&gt;upcase()&lt;/code&gt; doesn't know them and you cannot regex your way out of that mess.  If you need these features for a given encoding, you will need to look for an external library that meets your needs or roll your own solution.&lt;/p&gt;</content>
    <author>
      <name>James Edward Gray II</name>
    </author>
  </entry>
  <entry>
    <title>General Encoding Strategies</title>
    <link rel="alternate" href="http://graysoftinc.com/character-encodings/general-encoding-strategies"/>
    <id>tag:graysoftinc.com,2008-10-21:/posts/68</id>
    <updated>2014-04-12T19:30:46Z</updated>
    <summary>This is an attempt to establish general encoding strategies.</summary>
    <content type="html">&lt;p&gt;Before we get into specifics, let's try to distill a few best practices for working with encodings.  I'm sure you can tell that there's a lot that needs to be considered with encodings, so let's try to focus in on a few key points that will help us the most.&lt;/p&gt;

&lt;h4&gt;Use UTF-8 Everywhere You Can&lt;/h4&gt;

&lt;p&gt;We know UTF-8 isn't perfect, but it's pretty darn close to perfect.  There is no other single encoding you could pick that has the potential to satisfy such a wide audience.  It's our best bet.  For these reasons, &lt;a href="ftp://ftp.isi.edu/in-notes/rfc2277.txt"&gt;UTF-8 is quickly becoming the preferred encoding for the Web, email, and more&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you have a say over what encoding or encodings your software will accept, support, and deliver, choose UTF-8 whenever you can.  This is absolutely the best default.&lt;/p&gt;

&lt;h4&gt;Get in the Habit of Documenting Your Encodings&lt;/h4&gt;

&lt;p&gt;We learned that you must know a data's encoding to properly work with it.  While there are tools to help you guess an encoding, you really want to try and avoid being in this position.  Part of how to make that happen is to be a good citizen and make sure you are documenting your encodings at every step.&lt;/p&gt;

&lt;p&gt;If you send an email, make sure it specifies a correct character set.  Add a meta tag to Web pages to state the encoding.  View the source of this page for an example.  Document encodings accepted and returned from your API's.  This will raise everyone's encoding awareness, which helps us all.&lt;/p&gt;

&lt;h4&gt;Develop Your Encoding Safe Senses&lt;/h4&gt;

&lt;p&gt;You need to get into the habit of thinking, "Is this encoding safe?"  When you call a method, ask the question.  When you hand your data off to some process, reality check some results.&lt;/p&gt;

&lt;p&gt;Have you ever done something like &lt;code&gt;str[1..-2]&lt;/code&gt; in Ruby 1.8?  I sure have and it's not safe.  You're cutting bytes there and that may dice a bigger character into pieces.  Then your data is junk.&lt;/p&gt;

&lt;p&gt;This may sound like paranoia, but it's really not as bad as it seems.  There tend to just be a few key points where you need to go out of your way to protect the data and it's asking this question repeatedly that teaches you to spot those.&lt;/p&gt;

&lt;p&gt;To give an example, while enhancing the standard &lt;code&gt;CSV&lt;/code&gt; library for Ruby 1.9's m17n (multilingualization) implementation, I needed to use some user provided data in a &lt;code&gt;Regexp&lt;/code&gt;.  That's easy right?&lt;/p&gt;

&lt;div class="highlight highlight-ruby"&gt;&lt;pre&gt;&lt;span class="no"&gt;Regexp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;escape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Luckily, my instincts were just good enough to wonder, is that safe?  I fed some UTF-32 data to &lt;code&gt;Regexp.escape()&lt;/code&gt; to find out.  Remember, multibyte encodings that will show some seemingly normal data are great for testing edge cases.  Ruby broke my data:&lt;/p&gt;

&lt;div class="highlight highlight-ruby"&gt;&lt;pre&gt;&lt;span class="nb"&gt;p&lt;/span&gt; &lt;span class="no"&gt;Regexp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;escape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"+"&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"UTF-32BE"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\x00\x00\x00\\&lt;/span&gt;&lt;span class="s2"&gt;+"&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now, this was just a case of Ruby 1.9 still being raw around the edges.  It looks like this has been fixed in current builds:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby_dev -ve 'p Regexp.escape("+".encode("UTF-32BE"))'
ruby 1.9.0 (2008-10-10 revision 0) [i386-darwin9.5.0]
"\x00\x00\x00\\\x00\x00\x00+"
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Still the point stands, you can't even trust Ruby at some times.  Be cautious.&lt;/p&gt;

&lt;p&gt;The natural conclusion of this is that you want to know how encodings are handled all through the pipeline your data will pass through.  Does your HTML arrange to receive form data in UTF-8?  Is Ruby in UTF-8 mode when it receives that data?  Does the MySQL table you store that data in have an encoding set to UTF-8?  Modern versions of Rails even handle two of those three steps for you.  That's why it's important to look into the tools you use.&lt;/p&gt;

&lt;p&gt;These strategies aren't all you will need, but they are a terrific start.  This is not too much to remember and it will greatly increase your awareness of the issues.  That's the most important thing.&lt;/p&gt;</content>
    <author>
      <name>James Edward Gray II</name>
    </author>
  </entry>
  <entry>
    <title>The Unicode Character Set and Encodings</title>
    <link rel="alternate" href="http://graysoftinc.com/character-encodings/the-unicode-character-set-and-encodings"/>
    <id>tag:graysoftinc.com,2008-10-16:/posts/67</id>
    <updated>2015-12-17T20:15:16Z</updated>
    <summary>An in depth study of the largest character set in modern usage and how it can meet your needs.</summary>
    <content type="html">&lt;p&gt;Since the rise of the various character encodings, there has been a quest to find the one perfect encoding we could all use.  It's hard to get everyone to agree about whether or not this has truly been accomplished, but most of us agree that &lt;a href="http://unicode.org/"&gt;Unicode&lt;/a&gt; is as close as it gets.&lt;/p&gt;

&lt;p&gt;The goal of Unicode was literally to provide a character set that includes all characters in use today.  That's letters and numbers for all languages, all the images needed by pictographic languages, and all symbols.  As you can imagine that's quite a challenging task, but they've done very well.  Take a moment to &lt;a href="http://www.unicode.org/charts/"&gt;browse all the characters in the current Unicode specification&lt;/a&gt; to see for yourself.  The Unicode Consortium often reminds us that they still have room for more characters as well, so we will be all set when we start meeting alien races.&lt;/p&gt;

&lt;p&gt;Now in order to really understand what Unicode is, I need to clear up a point I've played pretty loose with so far:  a character set and a character encoding aren't necessarily the same thing.  Unicode is one character set, and has multiple character encodings.  Allow me to explain.&lt;/p&gt;

&lt;p&gt;A character set is just the mapping of symbols to their magic number representations inside the computer.  Unicode calls these numbers code points and they are usually written in the form &lt;code&gt;U+0061&lt;/code&gt; where the &lt;code&gt;U+&lt;/code&gt; means Unicode and the four digit number is hexadecimal for a code point.  Thus &lt;code&gt;0061&lt;/code&gt; is is &lt;code&gt;97&lt;/code&gt;.  That happens to be the Unicode code point for &lt;code&gt;a&lt;/code&gt; and if you remember my previous post well, you will recognize that matches up with US-ASCII.  We'll talk more about that in a bit.  It is worth noting though that Ruby 1.8 and 1.9 can show you these code points:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby -vKUe 'p "aé…".unpack("U*")'
ruby 1.8.6 (2008-08-11 patchlevel 287) [i686-darwin9.4.0]
[97, 233, 8230]
$ ruby_dev -ve 'p "aé…".unpack("U*")'
ruby 1.9.0 (2008-10-10 revision 0) [i386-darwin9.5.0]
[97, 233, 8230]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;U&lt;/code&gt; pattern for &lt;code&gt;unpack()&lt;/code&gt; asks for a Unicode code point and the &lt;code&gt;*&lt;/code&gt; just repeats it for each character.  Note that I used the &lt;code&gt;-KU&lt;/code&gt; switch to get Ruby 1.8 in UTF-8 mode.  Ruby 1.9 assumed UTF-8 because of how my environment is configured.  We will talk a lot more about those details when we get into specific language features.&lt;/p&gt;

&lt;p&gt;Code points aren't what actually gets recorded in a file, they are just abstract numbers for each character.  How those characters get written into a data stream is an encoding.  There are multiple encodings for Unicode or multiple ways to record those abstract numbers into files.&lt;/p&gt;

&lt;p&gt;Different encodings have different strengths.  For example, one possible encoding of Unicode is UTF-32, where 32 bits (or four bytes) are reserved for each code point.  This has the advantage that you can always count on four bytes being used (unlike variable length encodings, which we will discuss shortly).  An obvious downside though is the wasted space.  I mean if you have all ASCII data, you only really need one byte each, but UTF-32 will use four without exception.&lt;/p&gt;

&lt;p&gt;You do need to be very careful how you work with multibyte encodings.  UTF-32 is a good example of one that can be pretty tricky, because parts of the data can look normal.  For example, look at this simple &lt;code&gt;String&lt;/code&gt; as Ruby 1.9 sees it:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby_dev -ve 'p "abc".encode("UTF-32BE")'
ruby 1.9.0 (2008-10-10 revision 0) [i386-darwin9.5.0]
"\x00\x00\x00a\x00\x00\x00b\x00\x00\x00c"
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;There are a lot of null bytes in there, but notice how there are also normal &lt;code&gt;"a"&lt;/code&gt;, &lt;code&gt;"b"&lt;/code&gt;, and &lt;code&gt;"c"&lt;/code&gt; bytes.  I'm not going to show how this could happen to avoid encouraging bad habits, but if you replaced just the &lt;code&gt;"a"&lt;/code&gt; byte with two bytes like &lt;code&gt;"ab"&lt;/code&gt; your encoding is now broken and will eventually cause you problems.  You also have to be careful anytime you slice up a &lt;code&gt;String&lt;/code&gt; to make sure you don't divide the content mid-character.&lt;/p&gt;

&lt;p&gt;Another possible encoding of Unicode is UTF-8.  It has become pretty popular for things like email and web pages in recent years for several reasons.  First, UTF-8 is 100% compatible with US-ASCII.  The lowest 128 code points match their US-ASCII equivalents and UTF-8 encodes these in a single byte.  Ruby 1.9 can show us this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ cat ascii_and_utf8.rb 
str   = "abc"
ascii = str.encode("US-ASCII")
utf8  = str.encode("UTF-8")

[ascii, utf8].each do |encoded_str|
  p [encoded_str, encoded_str.encoding.name, encoded_str.bytes.to_a]
end
$ ruby_dev -v ascii_and_utf8.rb 
ruby 1.9.0 (2008-10-10 revision 0) [i386-darwin9.5.0]
["abc", "US-ASCII", [97, 98, 99]]
["abc", "UTF-8", [97, 98, 99]]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I've used several new Ruby 1.9 features here.  I don't want to go too deeply into these at this point but briefly:  &lt;code&gt;encode()&lt;/code&gt; allows me to transcode a &lt;code&gt;String&lt;/code&gt; from its current encoding to the one I pass the name for, &lt;code&gt;encoding()&lt;/code&gt; gives me the current &lt;code&gt;Encoding&lt;/code&gt; object for that &lt;code&gt;String&lt;/code&gt; and &lt;code&gt;name()&lt;/code&gt; turns that into a simple name, and finally Ruby 1.9 &lt;code&gt;String&lt;/code&gt;s provide &lt;code&gt;Enumerator&lt;/code&gt;s to walk the content by &lt;code&gt;bytes()&lt;/code&gt;, &lt;code&gt;chars()&lt;/code&gt;, &lt;code&gt;codepoints()&lt;/code&gt;, or &lt;code&gt;lines()&lt;/code&gt; and I use that to get the actual bytes here.  I promise we will talk a lot more about these when we get to handling encodings in Ruby 1.9.&lt;/p&gt;

&lt;p&gt;For now the key point to notice about this example is that US-ASCII and UTF-8 are the same all the way down to the bytes.&lt;/p&gt;

&lt;p&gt;Of course, 128 characters isn't enough to contain the super large Unicode character set.  Eventually you need more bytes.  UTF-8 is a variable length encoding that uses more bytes to represent larger code points as needed.  It does this with a simple set of rules:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Single byte characters always have a &lt;code&gt;0&lt;/code&gt; in the most significant bit:  &lt;code&gt;0xxxxxxx&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt; The number of significant &lt;code&gt;1&lt;/code&gt; bits shows how many bytes the code point takes up for multibyte code points.  Thus the most significant bits of a two byte character will be &lt;code&gt;110xxxxx&lt;/code&gt; and they will be &lt;code&gt;1110xxxx&lt;/code&gt; for a three byte character.&lt;/li&gt;
&lt;li&gt; All other bytes of multibyte sequences begin with &lt;code&gt;10&lt;/code&gt;:  &lt;code&gt;10xxxxxx&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;Again, we can ask Ruby 1.9 to show this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ cat utf8_bytes.rb 
# encoding:  UTF-8

chars = %w[a é …]
chars.each do |char|
  p char.bytes.map { |b| "%08b" % b }
end
$ ruby_dev utf8_bytes.rb 
["01100001"]
["11000011", "10101001"]
["11100010", "10000000", "10100110"]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Notice how different characters are different lengths and how the byte patterns show what to expect as I just described.  This makes UTF-8 a little safer to manipulate, because you won't see a bare &lt;code&gt;"a"&lt;/code&gt; byte that isn't really an &lt;code&gt;"a"&lt;/code&gt; in the data.  You do still have to be careful how you slice up a &lt;code&gt;String&lt;/code&gt; though to avoid breaking up multibyte characters.&lt;/p&gt;

&lt;p&gt;All of these facts combine to make UTF-8 a very good choice for universal character encodings, in my opinion.  The characters you need will be there.  Simple ASCII content will be unchanged.  Most software has at least some support for UTF-8 now as well.&lt;/p&gt;

&lt;p&gt;Is Unicode perfect?  No, it's not.&lt;/p&gt;

&lt;p&gt;Some characters have multiple representations.  For example, the Unicode code points are actually a super set of Latin-1 and thus include single byte versions of accented characters like &lt;code&gt;é&lt;/code&gt;.  Unicode also has the concept of combining marks though, where the accent would have one point and the letter another.  Those are combined into one character when displayed.  This creates some oddities where two &lt;code&gt;String&lt;/code&gt;s could appear to contain the same content but not test equal depending on how they are compared.  It also lessens the benefit of an encoding like UTF-32 since four bytes are just guaranteed for a code point, but it can take multiple code points to build a character.&lt;/p&gt;

&lt;p&gt;Asian cultures have also been slow to adopt Unicode for a few reasons.  First, Unicode usually makes their data larger.  For example, Shift JIS can represent all the Japanese characters in two bytes while most of them will be three bytes in UTF-8.  Hard drive space is pretty cheap these days, but a 1.5x multiplier on most of your data can be a factor in some cases.&lt;/p&gt;

&lt;p&gt;The Unicode Consortium also had to make some hard choices when specifying all of these characters.  One such choice, known as &lt;a href="http://en.wikipedia.org/wiki/Han_unification"&gt;Han Unification&lt;/a&gt;, was heavily debated for a while.  I think many people recognize why the decision was made these days, but the debate definitely slowed Unicode adoption, especially in Japan.&lt;/p&gt;

&lt;p&gt;Finally, there's a lot of data out there not in a Unicode encoding.  Unfortunately, there are issues that can make it hard to convert this data to Unicode flawlessly.  All of these factors combine to make a Unicode-as-a-one-encoding-fits-all philosophy not totally flawless.&lt;/p&gt;

&lt;p&gt;Still, it's absolutely your best bet for support of a wide audience in a single encoding.&lt;/p&gt;

&lt;p&gt;Key take-away points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A character set isn't quite the same as an encoding&lt;/li&gt;
&lt;li&gt;Unicode is one character set that can be encoded several different ways&lt;/li&gt;
&lt;li&gt;Unicode is designed to support all characters used by all people&lt;/li&gt;
&lt;li&gt;You won't find a better default encoding for modern day software as Unicode satisfies a much higher percentage of the world's population than any other single encoding&lt;/li&gt;
&lt;li&gt;UTF-8 is probably the best Unicode encoding to work with when you have the choice because of how well it fits in with plain US-ASCII and the fact that it's a little safer to work with&lt;/li&gt;
&lt;li&gt;Multibyte encodings can be tricky to work with properly, especially encodings like UTF-32 that can contain some normal looking data&lt;/li&gt;
&lt;/ul&gt;</content>
    <author>
      <name>James Edward Gray II</name>
    </author>
  </entry>
  <entry>
    <title>What is a Character Encoding?</title>
    <link rel="alternate" href="http://graysoftinc.com/character-encodings/what-is-a-character-encoding"/>
    <id>tag:graysoftinc.com,2008-10-15:/posts/66</id>
    <updated>2014-04-12T19:12:31Z</updated>
    <summary>Building a workable definition of character encodings and trying to give practical examples of how they affect us.</summary>
    <content type="html">&lt;p&gt;The first step to understanding character encodings is that we're going to need to talk a little about how computers store character data.  I know we would love to believe that when we push the &lt;code&gt;a&lt;/code&gt; key on our keyboard, the computer records a little &lt;code&gt;a&lt;/code&gt; symbol somewhere, but that's just fantasy.&lt;/p&gt;

&lt;p&gt;I imagine most of us know that deep in the heart of computers pretty much everything is eventually in terms of ones and zeros.  That means that an &lt;code&gt;a&lt;/code&gt; has to be stored as some number.  In fact, it is.  We can see what number using Ruby 1.8:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby -ve 'p ?a'
ruby 1.8.6 (2008-08-11 patchlevel 287) [i686-darwin9.4.0]
97
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The unusual &lt;code&gt;?a&lt;/code&gt; syntax gives us a specific character, instead of a full &lt;code&gt;String&lt;/code&gt;.  In Ruby 1.8 it does that by returning the code of that encoded character.  You can also get this by indexing one character out of a &lt;code&gt;String&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby -ve 'p "a"[0]'
ruby 1.8.6 (2008-08-11 patchlevel 287) [i686-darwin9.4.0]
97
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;These &lt;code&gt;String&lt;/code&gt; behaviors were deemed confusing by the Ruby core team and have been changed in Ruby 1.9.  They now return one character &lt;code&gt;String&lt;/code&gt;s.  If you want to see the character codes in Ruby 1.9 you can use &lt;code&gt;getbyte()&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby_dev -ve 'p "a".getbyte(0)'
ruby 1.9.0 (2008-10-10 revision 0) [i386-darwin9.5.0]
97
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;That's shows us how to get the magic number, but it doesn't tell us what the number really is.  When it was decided that we would need to store character data as numbers &lt;a href="http://en.wikipedia.org/wiki/Image:ASCII_Code_Chart-Quick_ref_card.jpg"&gt;a simple chart was made mapping some numbers to certain characters&lt;/a&gt;.  This mapping is known as US-ASCII or just ASCII.&lt;/p&gt;

&lt;p&gt;Now ASCII covers everything you would find on an English keyboard:  letters in upper and lower case, numbers, and some common symbols.  There was even some room left in the 128 character ASCII mapping for some control character sequences.&lt;/p&gt;

&lt;p&gt;Life was perfect, right?  Uh, no.&lt;/p&gt;

&lt;p&gt;This lead to two facts that went together beautifully:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The entire world can't quite get by on just these characters, surprisingly enough&lt;/li&gt;
&lt;li&gt;We had more room in each byte since ASCII was only using seven of the eight bits in a byte (that's how you get 128 characters)&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;Awesome.  We still had a spare bit that could buy us 128 more characters and we needed more characters.  It was serendipity!  Just about everyone had great ideas for how we should use these extra 128 characters and they all used them in their own way.  Character encodings were born.&lt;/p&gt;

&lt;p&gt;Because those extra 128 characters could change meaning depending on exactly who's scheme we're using now, we say the character data is encoded in that scheme.  You will need to know which encoding is used for that data to read it correctly.&lt;/p&gt;

&lt;p&gt;To give one specific example, the character encoding &lt;a href="http://en.wikipedia.org/wiki/Latin-1"&gt;ISO-8859-1 (also known as Latin-1)&lt;/a&gt; is a common default in some operating systems, programs, and even programming languages.  It fills the extra characters primarily with accented characters useful to many European languages.&lt;/p&gt;

&lt;p&gt;Now if it was really just about those extra 128 characters, things still wouldn't be too tricky.  Unfortunately, there's one more twist:  even 256 characters aren't enough for some languages.  Since 256 is all the numbers we can squeeze out of one little byte, these languages need multibyte character encodings, where it can take more than just one byte to represent a single character.&lt;/p&gt;

&lt;p&gt;Multibyte encodings are generally trickier to work with.  You have to be very careful not to divide data in such a way that a character might be split between the first and second byte (or between other bytes for bigger encodings).&lt;/p&gt;

&lt;p&gt;Japanese is a great example here.  Because they have symbols for most words instead of just the pieces used to make words, their language has a few thousand symbols in common usage.  One popular Japanese character encoding is &lt;a href="http://en.wikipedia.org/wiki/Shift_JIS"&gt;Shift JIS&lt;/a&gt; and it needs two bytes to fit some of these characters in.&lt;/p&gt;

&lt;p&gt;I've only shared a few specific examples here, but the truth is that there are &lt;a href="http://en.wikipedia.org/wiki/Character_encoding#Popular_character_encodings"&gt;quite a few encodings in common usage today&lt;/a&gt;.  You don't necessarily need to support all of these encodings in every program and, in truth, there are some good reasons not to.  A good first step is just being aware that different encodings exist and different people store their data in different ways.  Modern day programmers can no longer afford to remain ignorant to these issues.&lt;/p&gt;

&lt;p&gt;If you think about it, I'm sure you can imagine instances where the encoding was wrong.  Ever seen a slew of question marks or funny box shaped characters in your email client or shell?  Often this is a sign of the data not being encoded in the scheme the program expected.  This led to the program not being able to display the content correctly.  That's what we're trying to avoid.&lt;/p&gt;

&lt;p&gt;Key take-away points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Different people the world over store their data in different ways&lt;/li&gt;
&lt;li&gt;All character data has some encoding scheme that tells you how to interpret the data&lt;/li&gt;
&lt;li&gt;You must know the encoding data is in to correctly process it&lt;/li&gt;
&lt;li&gt;Some encodings are harder to work with than others, especially multibyte encodings&lt;/li&gt;
&lt;li&gt;Junk output, like questions marks and box shaped characters, are often what you see when programs get confused about the character encoding data is in&lt;/li&gt;
&lt;/ul&gt;</content>
    <author>
      <name>James Edward Gray II</name>
    </author>
  </entry>
</feed>
