The Ruby VM Interview

Interviews with Matz and ko1 about the next generation Ruby VM.

13

JUL
2007

The Ruby VM: Episode IV

We've talked about threads, so let's talk a little about character encodings. This is another big change planned for Ruby's future. Matz, you have stated that you plan to add m17n (multilingualization) support to Ruby. Can you talk a little about what that change actually means for Ruby users?

Matz:

Nothing much, except for some incompatibility in string manipulation, for example, "abc"[0] will give "a" instead of 97, and string indexing will be based on character instead of byte. I guess the biggest difference is that we can officially declare we support Unicode. ;-)

Unlike Perl nor Python, Ruby's M17N is not Unicode based (Universal Character Set or USC). It's character set independent (CSI). It will handle Unicode, along with other encoding schemes such as ISO8859 or EUC-JP etc. without converting them into Unicode.

Some misunderstand our motivation. We are no Unicode haters. Rather, I'd love to use Unicode if situation allows. We hate conversion between character sets. For historical reasons, there are many variety of character sets. For example, Shift_JIS character set has at least 5 variations, which differ each other in a few characters mapping. Unfortunately, we have no way to distinguish them. Thus conversion may cause information loss. If a language provide Unicode centric text manipulation, there's no way to avoid the problem, as long as we use that language.

ko1:

On my policy, I escape from this topic :)

With String being enhanced to be encoding aware, some worry that we will need to specify an encoding for every String we make. Can you talk a little about how this will work in practice? Is there a default encoding? Can we set an encoding for the entire program?

Matz:

You can specify the encoding for Ruby scripts by the coding pragma at the head of the script. For example, if your script is in UTF-8, try specify

# coding: utf-8

that makes all strings and regex literals in the script to be specified UTF-8. You can also specify the encoding for IO reading strings via open, e.g.

open(path, "r:utf-8") do |f|
  line = f.gets
end

or by binmode (ala Perl), e.g.

f = open(path, "r")
f.binmode(":utf-8")

The default encoding is binary for ordinary IO, and locale specified encoding for STDIN. It should be allowed that encoding conversion at the time of IO reading, but the API is not fixed yet. Maybe

open(path, "r:utf-8<euc-jp")

that should read EUC-JP data then convert it into UTF-8 and return the converted string.

Can you tell us how far along the m17n code is and how much still needs to be done? Is this change expected to be in the 1.9.1 release?

Matz:

You will see M17N in 1.9.1 coming out next Christmas, unless something bad happens. I have done almost everything for character treating, but things related to code conversion (String#encode method and code conversion for IO) are still left undone.

Comments (1)
  1. hramrach
    hramrach September 21st, 2007 Reply Link

    Some are concerned that strings that are not automatically converted will be hard to use. I have heard that is the situation in python. They got strings with encoding information, and two strings in different encoding are not compatible. They both look like a string but you cannot perform binary operations on them unless you manually recode them to a common encoding. This makes working with multiple encodings tedious and leads to many errors.

    I usually work only with 8-bit single byte encodings and utf-8. When you get a string in 8-bit encoding you cannot tell what encoding it is. However, many interfaces contain the encoding information (like database access or http) and if you trust this information you can recode the string "safely".

    In the end I have to do something with the string, and I choose the option of my program working and giving garbage output when the input is also garbage over the option to not having a program at all. Unfortunately there are also interfaces where the encoding information is not mandatory which might lead to another class of errors.

    I could override the string methods to do what I want but it would be nicer if there was support for autorecoding in the standard library. I think that marking particular IO or string as "safe" or setting a "safe" global flag would be useful for prototyping and probably for many finished applications which run in a controlled environment.

    Michal

    1. Reply (using GitHub Flavored Markdown)

      Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

      Ajax loader
Leave a Comment (using GitHub Flavored Markdown)

Comments on this blog are moderated. Spam is removed, formatting is fixed, and there's a zero tolerance policy on intolerance.

Ajax loader