Gray Soft / Character Encodings / The Unicode Character Set and Encodings

16

OCT
2008

The Unicode Character Set and Encodings

This post is part of a series.

Since the rise of the various character encodings, there has been a quest to find the one perfect encoding we could all use. It's hard to get everyone to agree about whether or not this has truly been accomplished, but most of us agree that Unicode is as close as it gets.

The goal of Unicode was literally to provide a character set that includes all characters in use today. That's letters and numbers for all languages, all the images needed by pictographic languages, and all symbols. As you can imagine that's quite a challenging task, but they've done very well. Take a moment to browse all the characters in the current Unicode specification to see for yourself. The Unicode Consortium often reminds us that they still have room for more characters as well, so we will be all set when we start meeting alien races.

Now in order to really understand what Unicode is, I need to clear up a point I've played pretty loose with so far: a character set and a character encoding aren't necessarily the same thing. Unicode is one character set, and has multiple character encodings. Allow me to explain.

A character set is just the mapping of symbols to their magic number representations inside the computer. Unicode calls these numbers code points and they are usually written in the form U+0061 where the U+ means Unicode and the four digit number is hexadecimal for a code point. Thus 0061 is is 97. That happens to be the Unicode code point for a and if you remember my previous post well, you will recognize that matches up with US-ASCII. We'll talk more about that in a bit. It is worth noting though that Ruby 1.8 and 1.9 can show you these code points:

$ ruby -vKUe 'p "aé…".unpack("U*")'
ruby 1.8.6 (2008-08-11 patchlevel 287) [i686-darwin9.4.0]
[97, 233, 8230]
$ ruby_dev -ve 'p "aé…".unpack("U*")'
ruby 1.9.0 (2008-10-10 revision 0) [i386-darwin9.5.0]
[97, 233, 8230]

The U pattern for unpack() asks for a Unicode code point and the * just repeats it for each character. Note that I used the -KU switch to get Ruby 1.8 in UTF-8 mode. Ruby 1.9 assumed UTF-8 because of how my environment is configured. We will talk a lot more about those details when we get into specific language features.

Code points aren't what actually gets recorded in a file, they are just abstract numbers for each character. How those characters get written into a data stream is an encoding. There are multiple encodings for Unicode or multiple ways to record those abstract numbers into files.

Different encodings have different strengths. For example, one possible encoding of Unicode is UTF-32, where 32 bits (or four bytes) are reserved for each code point. This has the advantage that you can always count on four bytes being used (unlike variable length encodings, which we will discuss shortly). An obvious downside though is the wasted space. I mean if you have all ASCII data, you only really need one byte each, but UTF-32 will use four without exception.

You do need to be very careful how you work with multibyte encodings. UTF-32 is a good example of one that can be pretty tricky, because parts of the data can look normal. For example, look at this simple String as Ruby 1.9 sees it:

$ ruby_dev -ve 'p "abc".encode("UTF-32BE")'
ruby 1.9.0 (2008-10-10 revision 0) [i386-darwin9.5.0]
"\x00\x00\x00a\x00\x00\x00b\x00\x00\x00c"

There are a lot of null bytes in there, but notice how there are also normal "a", "b", and "c" bytes. I'm not going to show how this could happen to avoid encouraging bad habits, but if you replaced just the "a" byte with two bytes like "ab" your encoding is now broken and will eventually cause you problems. You also have to be careful anytime you slice up a String to make sure you don't divide the content mid-character.

Another possible encoding of Unicode is UTF-8. It has become pretty popular for things like email and web pages in recent years for several reasons. First, UTF-8 is 100% compatible with US-ASCII. The lowest 128 code points match their US-ASCII equivalents and UTF-8 encodes these in a single byte. Ruby 1.9 can show us this:

$ cat ascii_and_utf8.rb 
str   = "abc"
ascii = str.encode("US-ASCII")
utf8  = str.encode("UTF-8")

[ascii, utf8].each do |encoded_str|
  p [encoded_str, encoded_str.encoding.name, encoded_str.bytes.to_a]
end
$ ruby_dev -v ascii_and_utf8.rb 
ruby 1.9.0 (2008-10-10 revision 0) [i386-darwin9.5.0]
["abc", "US-ASCII", [97, 98, 99]]
["abc", "UTF-8", [97, 98, 99]]

I've used several new Ruby 1.9 features here. I don't want to go too deeply into these at this point but briefly: encode() allows me to transcode a String from its current encoding to the one I pass the name for, encoding() gives me the current Encoding object for that String and name() turns that into a simple name, and finally Ruby 1.9 Strings provide Enumerators to walk the content by bytes(), chars(), codepoints(), or lines() and I use that to get the actual bytes here. I promise we will talk a lot more about these when we get to handling encodings in Ruby 1.9.

For now the key point to notice about this example is that US-ASCII and UTF-8 are the same all the way down to the bytes.

Of course, 128 characters isn't enough to contain the super large Unicode character set. Eventually you need more bytes. UTF-8 is a variable length encoding that uses more bytes to represent larger code points as needed. It does this with a simple set of rules:

Single byte characters always have a 0 in the most significant bit: 0xxxxxxx.
The number of significant 1 bits shows how many bytes the code point takes up for multibyte code points. Thus the most significant bits of a two byte character will be 110xxxxx and they will be 1110xxxx for a three byte character.
All other bytes of multibyte sequences begin with 10: 10xxxxxx.

Again, we can ask Ruby 1.9 to show this:

$ cat utf8_bytes.rb 
# encoding:  UTF-8

chars = %w[a é …]
chars.each do |char|
  p char.bytes.map { |b| "%08b" % b }
end
$ ruby_dev utf8_bytes.rb 
["01100001"]
["11000011", "10101001"]
["11100010", "10000000", "10100110"]

Notice how different characters are different lengths and how the byte patterns show what to expect as I just described. This makes UTF-8 a little safer to manipulate, because you won't see a bare "a" byte that isn't really an "a" in the data. You do still have to be careful how you slice up a String though to avoid breaking up multibyte characters.

All of these facts combine to make UTF-8 a very good choice for universal character encodings, in my opinion. The characters you need will be there. Simple ASCII content will be unchanged. Most software has at least some support for UTF-8 now as well.

Is Unicode perfect? No, it's not.

Some characters have multiple representations. For example, the Unicode code points are actually a super set of Latin-1 and thus include single byte versions of accented characters like é. Unicode also has the concept of combining marks though, where the accent would have one point and the letter another. Those are combined into one character when displayed. This creates some oddities where two Strings could appear to contain the same content but not test equal depending on how they are compared. It also lessens the benefit of an encoding like UTF-32 since four bytes are just guaranteed for a code point, but it can take multiple code points to build a character.

Asian cultures have also been slow to adopt Unicode for a few reasons. First, Unicode usually makes their data larger. For example, Shift JIS can represent all the Japanese characters in two bytes while most of them will be three bytes in UTF-8. Hard drive space is pretty cheap these days, but a 1.5x multiplier on most of your data can be a factor in some cases.

The Unicode Consortium also had to make some hard choices when specifying all of these characters. One such choice, known as Han Unification, was heavily debated for a while. I think many people recognize why the decision was made these days, but the debate definitely slowed Unicode adoption, especially in Japan.

Finally, there's a lot of data out there not in a Unicode encoding. Unfortunately, there are issues that can make it hard to convert this data to Unicode flawlessly. All of these factors combine to make a Unicode-as-a-one-encoding-fits-all philosophy not totally flawless.

Still, it's absolutely your best bet for support of a wide audience in a single encoding.

Key take-away points:

A character set isn't quite the same as an encoding
Unicode is one character set that can be encoded several different ways
Unicode is designed to support all characters used by all people
You won't find a better default encoding for modern day software as Unicode satisfies a much higher percentage of the world's population than any other single encoding
UTF-8 is probably the best Unicode encoding to work with when you have the choice because of how well it fits in with plain US-ASCII and the fact that it's a little safer to work with
Multibyte encodings can be tricky to work with properly, especially encodings like UTF-32 that can contain some normal looking data

This post is part of a series.

← Previous Post

↑ Table of Contents

→ Next Post

In: Character Encodings | Tags: Multilingualization & Unicode | 8 Comments

Comments (8)

Allan Odgaard October 17th, 2008 Reply Link

While you indirectly say so, I think it is worth putting emphasis on the fact that UTF-8 data implicitly carry a checksum in the multi-byte sequences.

This is nice because plain text files are normally not tagged with encoding (as they have no natural place for such tag), but the checksum can be used instead.

For example a user who has been using CP-1252 for all of his text files can in practice move to UTF-8 file-by-file by performing an UTF-8 validity check when loading a file, should the sequence fail to be valid UTF-8, then it is one of his old CP-1252 files.
1. Reply (using GitHub Flavored Markdown)
  
  Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
  
  Or login with:
  Name * Email URL Comment *
2. James Edward Gray II October 19th, 2008 Reply Link
  Allan has a great point there. You can use code like the following in Ruby 1.8 to validate UTF-8:
```
#!/usr/bin/env ruby -wKU

module UTF8Checksum
  def is_utf8?
    where_we_were = pos
    begin
      loop do
        break if eof?
        first_byte = "%08b" % read(1)[0]
        unless first_byte[0] == ?0
          bytes_left  = first_byte[/\A1+/].size - 1
          extra_bytes = read(bytes_left)
          unless extra_bytes                    and
                 extra_bytes.size == bytes_left and
                 extra_bytes.split("").
                             all? { |b| ("%08b" % b[0]) =~ /\A10/ }
            return false
          end
        end
      end
      return true
    ensure
      seek(where_we_were)
    end
  end
end

class IO
  include UTF8Checksum
end
ARGF.extend(UTF8Checksum)

class String
  def is_utf8?
    require "stringio"
    StringIO.new(self).extend(UTF8Checksum).is_utf8?
  end
end

if __FILE__ == $PROGRAM_NAME
  answer = ARGF.is_utf8?
  p answer
  exit(answer ? 0 : 1)
end
```
  In Ruby 1.9, you could check that a String is UTF-8 with the simple code:
```
str.force_encoding("UTF-8").valid_encoding?
```
  1. Reply (using GitHub Flavored Markdown)
    
    Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
    
    Or login with:
    Name * Email URL Comment *
3. Panagiotis Atmatzidis February 16th, 2011 Reply Link
  
  Thanks James for the excellent tutorial! Thanks Alan for giving the straight solution to my problem! :-)
  1. Reply (using GitHub Flavored Markdown)
    
    Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
    
    Or login with:
    Name * Email URL Comment *
Jan January 16th, 2009 Reply Link

Great article! It made the difference between charset and encoding clear to me.
1. Reply (using GitHub Flavored Markdown)
  
  Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
  
  Or login with:
  Name * Email URL Comment *
amishera March 23rd, 2010 Reply Link

I think it would be great to have 2 links per page which takes the user to the next and previous topics. Instead of showing the link text as 'next' and 'previous' it should show the title of the text directly such as 'ruby 1.8 encoding' which is say the title of the next topic. It would even be greater if you have the table of contents on each page.
1. Reply (using GitHub Flavored Markdown)
  
  Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
  
  Or login with:
  Name * Email URL Comment *
2. James Edward Gray II March 23rd, 2010 Reply Link
  
  I've been working on a rewrite of this blog which I will get finished with eventually. It will handle series much better, I promise. I write a lot of them so the blog needs to be more tuned to that.
  1. Reply (using GitHub Flavored Markdown)
    
    Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
    
    Or login with:
    Name * Email URL Comment *
Andre September 21st, 2010 Reply Link
Hi James,

First off, thanks for a great article. In fact thanks for the whole series. I'm not through yet but I'm sure I'll love the rest as much as the first ones.

On to my question: It's about the part where you show the binary representation of UTF-8 encoded bytes. Specifically this part:
```
a = ["1100001"]
```
Should this mentally be read as
```
a = ["01100001"] ?
```
Otherwise, it somehow confuses me as per the rules above it is a single byte and "1100001" is the binary representation for "a".
1. Reply (using GitHub Flavored Markdown)
  
  Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
  
  Or login with:
  Name * Email URL Comment *
2. James Edward Gray II September 21st, 2010 Reply Link
  
  Correct. Ruby left out the most significant bit since it's unset. It has to be a 0 though by the rules. You've got it.
  1. Reply (using GitHub Flavored Markdown)
    
    Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.
    
    Or login with:
    Name * Email URL Comment *