The Ruby VM: Episode IV
We've talked about threads, so let's talk a little about character encodings. This is another big change planned for Ruby's future. Matz, you have stated that you plan to add m17n (multilingualization) support to Ruby. Can you talk a little about what that change actually means for Ruby users?
Nothing much, except for some incompatibility in string manipulation, for example,
97, and string indexing will be based on character instead of byte. I guess the biggest difference is that we can officially declare we support Unicode. ;-)
Unlike Perl nor Python, Ruby's M17N is not Unicode based (Universal Character Set or USC). It's character set independent (CSI). It will handle Unicode, along with other encoding schemes such as ISO8859 or EUC-JP etc. without converting them into Unicode.
Some misunderstand our motivation. We are no Unicode haters. Rather, I'd love to use Unicode if situation allows. We hate conversion between character sets. For historical reasons, there are many variety of character sets. For example, Shift_JIS character set has at least 5 variations, which differ each other in a few characters mapping. Unfortunately, we have no way to distinguish them. Thus conversion may cause information loss. If a language provide Unicode centric text manipulation, there's no way to avoid the problem, as long as we use that language.
On my policy, I escape from this topic :)
With String being enhanced to be encoding aware, some worry that we will need to specify an encoding for every String we make. Can you talk a little about how this will work in practice? Is there a default encoding? Can we set an encoding for the entire program?
You can specify the encoding for Ruby scripts by the coding pragma at the head of the script. For example, if your script is in UTF-8, try specify
# coding: utf-8
that makes all strings and regex literals in the script to be specified UTF-8. You can also specify the encoding for IO reading strings via open, e.g.
open(path, "r:utf-8") do |f| line = f.gets end
binmode(ala Perl), e.g.
f = open(path, "r") f.binmode(":utf-8")
The default encoding is binary for ordinary IO, and locale specified encoding for STDIN. It should be allowed that encoding conversion at the time of IO reading, but the API is not fixed yet. Maybe
that should read EUC-JP data then convert it into UTF-8 and return the converted string.
Can you tell us how far along the m17n code is and how much still needs to be done? Is this change expected to be in the 1.9.1 release?
You will see M17N in 1.9.1 coming out next Christmas, unless something bad happens. I have done almost everything for character treating, but things related to code conversion (
String#encodemethod and code conversion for IO) are still left undone.