<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Gray Soft / Character Encodings / General Encoding Strategies</title>
  <id>tag:graysoftinc.com,2014-03-20:/posts/68</id>
  <updated>2014-04-12T19:25:05Z</updated>
  <link rel="self" href="http://graysoftinc.com/character-encodings/general-encoding-strategies/feed.xml"/>
  <link rel="alternate" href="http://graysoftinc.com/character-encodings/general-encoding-strategies"/>
  <author>
    <name>James Edward Gray II</name>
  </author>
  <entry>
    <title>The 4th Comment on "General Encoding Strategies"</title>
    <link rel="alternate" href="http://graysoftinc.com/character-encodings/general-encoding-strategies#comment_469"/>
    <id>tag:graysoftinc.com,2012-03-07:/comments/469</id>
    <updated>2014-04-12T19:25:05Z</updated>
    <summary>I stand by my recommendation.  UTF-8 is still the best choice.

If your program needs to work with other encodings, transcode to UTF-8 on the way in, work with that one encoding internally, and transcode as needed on the way back out.  Handling ...</summary>
    <content type="html">&lt;p&gt;I stand by my recommendation.  UTF-8 is still the best choice.&lt;/p&gt;

&lt;p&gt;If your program needs to work with other encodings, transcode to UTF-8 on the way in, work with that one encoding internally, and transcode as needed on the way back out.  Handling multiple encodings internally is extremely complex.&lt;/p&gt;</content>
    <author>
      <name>James Edward Gray II</name>
    </author>
  </entry>
  <entry>
    <title>The 3rd Comment on "General Encoding Strategies"</title>
    <link rel="alternate" href="http://graysoftinc.com/character-encodings/general-encoding-strategies#comment_468"/>
    <id>tag:graysoftinc.com,2012-03-07:/comments/468</id>
    <updated>2014-04-12T19:25:05Z</updated>
    <summary>UTF-8 everywhere?
Nay.
Operating systems do not tend to us these internally. 
OS X uses UTF 8, UTF 16 and UTF 32 where appropriate and handles conversion invisibly most of the time. ICU library under the hood. 

I would say it is better advic...</summary>
    <content type="html">&lt;p&gt;UTF-8 everywhere?&lt;br&gt;
Nay.&lt;br&gt;
Operating systems do not tend to us these internally. &lt;br&gt;
OS X uses UTF 8, UTF 16 and UTF 32 where appropriate and handles conversion invisibly most of the time. ICU library under the hood. &lt;/p&gt;

&lt;p&gt;I would say it is better advice to not pretend UTF8 is the new ASCII and just try to learn to do one thing. &lt;br&gt;
It is a much better idea to encourage all coders to do their homework and know that there will be heterogeneous environments. &lt;br&gt;
Files can contain anything while file systems tend to use some specific encoding for file names plus some kind of limitations. &lt;/p&gt;

&lt;p&gt;It IS a good idea for software to be prepared to accept data in multiple encodings and internally convert all of it to a common encoding for use within the app. &lt;/p&gt;</content>
    <author>
      <name>JJ</name>
    </author>
  </entry>
  <entry>
    <title>The 2nd Comment on "General Encoding Strategies"</title>
    <link rel="alternate" href="http://graysoftinc.com/character-encodings/general-encoding-strategies#comment_350"/>
    <id>tag:graysoftinc.com,2010-03-18:/comments/350</id>
    <updated>2014-03-27T01:38:27Z</updated>
    <summary>thanks for all of this great info buddy! I really learned about my problems with ruby strings here.I&amp;#39;ll spread the word =D</summary>
    <content type="html">&lt;p&gt;thanks for all of this great info buddy! I really learned about my problems with ruby strings here.I'll spread the word =D&lt;/p&gt;</content>
    <author>
      <name>Ignacio De La Madrid</name>
    </author>
  </entry>
  <entry>
    <title>The 1st Comment on "General Encoding Strategies"</title>
    <link rel="alternate" href="http://graysoftinc.com/character-encodings/general-encoding-strategies#comment_278"/>
    <id>tag:graysoftinc.com,2009-04-24:/comments/278</id>
    <updated>2014-03-27T01:38:26Z</updated>
    <summary>If you do find yourself in a situation where you don&amp;#39;t know a character encoding and you are forced to guess it (again try to avoid this whenever possible), Andrew S. Townley posted a message to Ruby Talk showing [how to use the rchardet gem to gu...</summary>
    <content type="html">&lt;p&gt;If you do find yourself in a situation where you don't know a character encoding and you are forced to guess it (again try to avoid this whenever possible), Andrew S. Townley posted a message to Ruby Talk showing &lt;a href="http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/334884"&gt;how to use the rchardet gem to guess an encoding&lt;/a&gt;.&lt;/p&gt;</content>
    <author>
      <name>James Edward Gray II</name>
    </author>
  </entry>
  <entry>
    <title>General Encoding Strategies</title>
    <link rel="alternate" href="http://graysoftinc.com/character-encodings/general-encoding-strategies"/>
    <id>tag:graysoftinc.com,2008-10-21:/posts/68</id>
    <updated>2014-04-12T19:30:46Z</updated>
    <summary>This is an attempt to establish general encoding strategies.</summary>
    <content type="html">&lt;p&gt;Before we get into specifics, let's try to distill a few best practices for working with encodings.  I'm sure you can tell that there's a lot that needs to be considered with encodings, so let's try to focus in on a few key points that will help us the most.&lt;/p&gt;

&lt;h4&gt;Use UTF-8 Everywhere You Can&lt;/h4&gt;

&lt;p&gt;We know UTF-8 isn't perfect, but it's pretty darn close to perfect.  There is no other single encoding you could pick that has the potential to satisfy such a wide audience.  It's our best bet.  For these reasons, &lt;a href="ftp://ftp.isi.edu/in-notes/rfc2277.txt"&gt;UTF-8 is quickly becoming the preferred encoding for the Web, email, and more&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you have a say over what encoding or encodings your software will accept, support, and deliver, choose UTF-8 whenever you can.  This is absolutely the best default.&lt;/p&gt;

&lt;h4&gt;Get in the Habit of Documenting Your Encodings&lt;/h4&gt;

&lt;p&gt;We learned that you must know a data's encoding to properly work with it.  While there are tools to help you guess an encoding, you really want to try and avoid being in this position.  Part of how to make that happen is to be a good citizen and make sure you are documenting your encodings at every step.&lt;/p&gt;

&lt;p&gt;If you send an email, make sure it specifies a correct character set.  Add a meta tag to Web pages to state the encoding.  View the source of this page for an example.  Document encodings accepted and returned from your API's.  This will raise everyone's encoding awareness, which helps us all.&lt;/p&gt;

&lt;h4&gt;Develop Your Encoding Safe Senses&lt;/h4&gt;

&lt;p&gt;You need to get into the habit of thinking, "Is this encoding safe?"  When you call a method, ask the question.  When you hand your data off to some process, reality check some results.&lt;/p&gt;

&lt;p&gt;Have you ever done something like &lt;code&gt;str[1..-2]&lt;/code&gt; in Ruby 1.8?  I sure have and it's not safe.  You're cutting bytes there and that may dice a bigger character into pieces.  Then your data is junk.&lt;/p&gt;

&lt;p&gt;This may sound like paranoia, but it's really not as bad as it seems.  There tend to just be a few key points where you need to go out of your way to protect the data and it's asking this question repeatedly that teaches you to spot those.&lt;/p&gt;

&lt;p&gt;To give an example, while enhancing the standard &lt;code&gt;CSV&lt;/code&gt; library for Ruby 1.9's m17n (multilingualization) implementation, I needed to use some user provided data in a &lt;code&gt;Regexp&lt;/code&gt;.  That's easy right?&lt;/p&gt;

&lt;div class="highlight highlight-ruby"&gt;&lt;pre&gt;&lt;span class="no"&gt;Regexp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;escape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Luckily, my instincts were just good enough to wonder, is that safe?  I fed some UTF-32 data to &lt;code&gt;Regexp.escape()&lt;/code&gt; to find out.  Remember, multibyte encodings that will show some seemingly normal data are great for testing edge cases.  Ruby broke my data:&lt;/p&gt;

&lt;div class="highlight highlight-ruby"&gt;&lt;pre&gt;&lt;span class="nb"&gt;p&lt;/span&gt; &lt;span class="no"&gt;Regexp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;escape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"+"&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"UTF-32BE"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\x00\x00\x00\\&lt;/span&gt;&lt;span class="s2"&gt;+"&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Now, this was just a case of Ruby 1.9 still being raw around the edges.  It looks like this has been fixed in current builds:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby_dev -ve 'p Regexp.escape("+".encode("UTF-32BE"))'
ruby 1.9.0 (2008-10-10 revision 0) [i386-darwin9.5.0]
"\x00\x00\x00\\\x00\x00\x00+"
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Still the point stands, you can't even trust Ruby at some times.  Be cautious.&lt;/p&gt;

&lt;p&gt;The natural conclusion of this is that you want to know how encodings are handled all through the pipeline your data will pass through.  Does your HTML arrange to receive form data in UTF-8?  Is Ruby in UTF-8 mode when it receives that data?  Does the MySQL table you store that data in have an encoding set to UTF-8?  Modern versions of Rails even handle two of those three steps for you.  That's why it's important to look into the tools you use.&lt;/p&gt;

&lt;p&gt;These strategies aren't all you will need, but they are a terrific start.  This is not too much to remember and it will greatly increase your awareness of the issues.  That's the most important thing.&lt;/p&gt;</content>
    <author>
      <name>James Edward Gray II</name>
    </author>
  </entry>
</feed>
