Gray Soft

The programming blog of James Edward Gray II (JEG2).
  • 5

    NOV
    2008

    The $KCODE Variable and jcode Library

    All of the Ruby files I create start with the same Shebang line:

    #!/usr/bin/env ruby -wKU
    

    It's not really needed for every file since it generally only matters if the file is executed. However, I tend to go ahead and add it to all Ruby files I build for several reasons:

    • You never know when a file may be executed (if __FILE__ == $PROGRAM_NAME; end sections are often added to libraries, for example)
    • It makes it obvious the file is Ruby code
    • It shows the rules this code expects -w and -KU

    The rules I mention here, specified by command-line switches, are the main point of interest. -w turns on Ruby's warnings which are very handy. I recommend doing that whenever you can. But that doesn't have anything to do with character encodings. -KU does.

    -KU sets a magic Ruby variable: $-K or $KCODE. You can do the same in your code if you aren't in a position to control the command-line arguments:

    $KCODE = "U"
    

    You probably recognize the U as a name for Ruby 1.8's UTF-8 encoding, from my earlier list of encodings. It can also be set to N (the default), E, or S. Modern versions of Rails do set $KCODE = "U" for you.

    Read more…

  • 30

    OCT
    2008

    Bytes and Characters in Ruby 1.8

    Gregory Brown said, in a training session at the Lone Star Rubyconf, "Ruby 1.8 works in bytes. Ruby 1.9 works in characters." The truth of Ruby 1.9 is maybe a little more complicated and we will discuss all of that eventually, but Greg is dead right about Ruby 1.8.

    In Ruby 1.8, a String is always just a collection of bytes.

    The important question is, how does that one golden rule relate to all that we've learned about character encodings? Essentially, it puts all the responsibility on you as the developer. Ruby 1.8 leaves it to you to determine what to do with those bytes and it doesn't provide a lot of encoding savvy help. That's why knowing at least the basics of encodings is so important when working with Ruby 1.8.

    There are plusses and minuses to every system and this one is no exception. On the side of plusses, Ruby 1.8 can pretty much support any encoding you can imagine. After all, a character encoding is just some bytes that somehow map to a set of characters and all Ruby 1.8 Strings are just some bytes. If you say a String holds Latin-1 data and treat it as such, that's fine by Ruby.

    Read more…

  • 21

    OCT
    2008

    General Encoding Strategies

    Before we get into specifics, let's try to distill a few best practices for working with encodings. I'm sure you can tell that there's a lot that needs to be considered with encodings, so let's try to focus in on a few key points that will help us the most.

    Use UTF-8 Everywhere You Can

    We know UTF-8 isn't perfect, but it's pretty darn close to perfect. There is no other single encoding you could pick that has the potential to satisfy such a wide audience. It's our best bet. For these reasons, UTF-8 is quickly becoming the preferred encoding for the Web, email, and more.

    If you have a say over what encoding or encodings your software will accept, support, and deliver, choose UTF-8 whenever you can. This is absolutely the best default.

    Get in the Habit of Documenting Your Encodings

    We learned that you must know a data's encoding to properly work with it. While there are tools to help you guess an encoding, you really want to try and avoid being in this position. Part of how to make that happen is to be a good citizen and make sure you are documenting your encodings at every step.

    Read more…

  • 16

    OCT
    2008

    The Unicode Character Set and Encodings

    Since the rise of the various character encodings, there has been a quest to find the one perfect encoding we could all use. It's hard to get everyone to agree about whether or not this has truly been accomplished, but most of us agree that Unicode is as close as it gets.

    The goal of Unicode was literally to provide a character set that includes all characters in use today. That's letters and numbers for all languages, all the images needed by pictographic languages, and all symbols. As you can imagine that's quite a challenging task, but they've done very well. Take a moment to browse all the characters in the current Unicode specification to see for yourself. The Unicode Consortium often reminds us that they still have room for more characters as well, so we will be all set when we start meeting alien races.

    Now in order to really understand what Unicode is, I need to clear up a point I've played pretty loose with so far: a character set and a character encoding aren't necessarily the same thing. Unicode is one character set, and has multiple character encodings. Allow me to explain.

    Read more…

  • 15

    OCT
    2008

    What is a Character Encoding?

    The first step to understanding character encodings is that we're going to need to talk a little about how computers store character data. I know we would love to believe that when we push the a key on our keyboard, the computer records a little a symbol somewhere, but that's just fantasy.

    I imagine most of us know that deep in the heart of computers pretty much everything is eventually in terms of ones and zeros. That means that an a has to be stored as some number. In fact, it is. We can see what number using Ruby 1.8:

    $ ruby -ve 'p ?a'
    ruby 1.8.6 (2008-08-11 patchlevel 287) [i686-darwin9.4.0]
    97
    

    The unusual ?a syntax gives us a specific character, instead of a full String. In Ruby 1.8 it does that by returning the code of that encoded character. You can also get this by indexing one character out of a String:

    $ ruby -ve 'p "a"[0]'
    ruby 1.8.6 (2008-08-11 patchlevel 287) [i686-darwin9.4.0]
    97
    

    These String behaviors were deemed confusing by the Ruby core team and have been changed in Ruby 1.9. They now return one character Strings. If you want to see the character codes in Ruby 1.9 you can use getbyte():

    Read more…

  • 14

    OCT
    2008

    Understanding M17n (Multilingualization)

    Big changes are coming to Ruby in version 1.9 with regard to character encodings. Ruby is going from a language with some of the weakest character encoding support to arguably some of the best support out there for working with different encodings. We're all grown up now.

    The downside is that the new code comes with a good size learning curve. I would know because I recently battled through figuring it out so I could add support to the standard CSV library for nearly all of the encodings. It was a battle too. It's brave new territory and there's not a lot of help out there yet for understanding Ruby's new features.

    I'm hoping to change that.

    This posting will be the start of a new series of blog articles designed to explain the character encoding support in Ruby 1.9. I'm going to assume you know absolutely nothing about character encodings though and begin by explaining in detail what they are and why we have them.

    After that, we're going to examine the character encoding support in Ruby 1.8. There's a lot less support there to examine, but it's not well understood and I'm hoping that seeing it in detail will help with understanding how and why Ruby 1.9 is changing.

    Read more…

  • 13

    OCT
    2008

    The Secret Shell Helper

    Someone pops onto the Ruby Talk mailing list fairly regularly asking how to break up content like:

    one "two" "a longer three"
    

    They expect to end with a three element Array, where the third item will contain spaces. They generally expect the quotes will have been removed as well.

    If your needs are very, very simple you may be able to handle this with a regular expression:

    data = 'one "two" "a longer three"'
    p data.scan(/"([^"]*)"|(\S+)/).flatten.compact
    # >> ["one", "two", "a longer three"]
    

    That just searches for either a set of quotes with some non-quote characters between them or a run of non-whitespace characters. Those are the two possibilities for the fields. Note that the two separate capture here mean scan() will returns contents in the form:

    [[nil, "one"], ["two", nil], ["a longer three", nil]]
    

    That's why I added a flatten() and compact() to get down to the actual matches.

    The regular expression approach can get pretty complex though if any kind of escaping for quotes is involved. When that happens, you may need to step up to a parser.

    Read more…

  • 10

    OCT
    2008

    All About Struct

    I build small little data classes all the time and there's a reason for that: Ruby makes it trivial to do so. That's a big win because we all know that what is a trivial data class today will be tomorrow's super object, right? If I start out using a simple Array or Hash, I'll probably end up redoing most of the logic at both ends eventually. Or I can start with the trivial class and grow it naturally.

    The key to all this though is that I don't write those classes myself! That's what Ruby is for. More specifically, you need to learn to love Struct. Allow me to show you what I mean.

    Imagine I need a basic class to represent a Contact. Ruby gives us so many shortcuts that the class could be very small even without Struct:

    class Contact
      def initialize(first, last, email)
        @first = first
        @last  = last
        @email = email
      end
    
      attr_accessor :first, :last, :email
    end
    

    You could shorten that up more with some multiple assignment if you like, but that's the basics. Now using Struct is even easier:

    Read more…

  • 9

    OCT
    2008

    Dual Interface Modules

    I'm guessing we've all seen Ruby's Math Module. I'm sure you know that you can call methods in it as "module (or class) methods:"

    Math.sqrt(4)  # => 2.0
    

    That's just one way to use the Math Module though. Another is to treat it as a mixin and call the same methods as instance methods:

    module MyMathyThing
      extend Math
    
      def self.my_sqrt(*args)
        sqrt(*args)
      end
    end
    
    MyMathyThing.my_sqrt(4)  # => 2.0
    

    Ruby ships with a few Modules that work like this, including the mighty Kernel.

    How is this dual interface accomplished? With the seldom seen module_function() method. You use this much like you would private(), to affect all following method definitions:

    module Greeter
      module_function
    
      def hello
        "Hello!"
      end
    end
    
    module MyGreeter
      extend Greeter
    
      def self.my_hello
        hello
      end
    end
    
    Greeter.hello       # => "Hello!"
    MyGreeter.my_hello  # => "Hello!"
    

    As you can see, it magically gives us the dual interface for the methods beneath it. You can also affect specific methods by name, just as you could with private(). This is equivalent to my definition above:

    Read more…

    In: Ruby Voodoo | Tags: APIs | 2 Comments
  • 8

    OCT
    2008

    Readable Booleans

    There's a great little trick you can do to improve the readability of your code. A common problem is dealing with methods that have a boolean flag arguments. Here's an example I ran into just today in a Rails application:

    def rating_stars(..., clickable = false)
      # ...
    end
    

    The problem with this is that you typically see calls like this scattered around the application:

    <%= rating_stars(..., true) %>
    

    Would you know what true did there if I hadn't shown you the name of the variable first? I didn't. I had to go hunting for that method definition.

    Ironically the opposite problem, a magical dangling false, is much more rare in my experience. That's typically the default for these kind of arguments and it just makes more sense and reads better to leave it out.

    Anyway, the point is that we can typically improve the ease of understanding the common case. Remember that in Ruby false and nil are false while everything else is true. That means that truth is very loosely defined and we can pass a lot of things for our boolean flag value. For example, after looking up the method and understanding what was needed, I chose to call it like this:

    Read more…

    In: Ruby Voodoo | Tags: APIs & Style | 2 Comments