Ruby 1.9’s m17n (multilingualization) engine has eased some of the pains of supporting different encodings. Unfortunately, it makes things a little more difficult in the simplest of cases. I was introduced to this problem for the first time with the error message
invalid multibyte char (US-ASCII). This is a mini crash course for understanding the new behavior of Ruby strings and different encodings when used on an HFS+ file system.
What I was trying to do
I was attempting to build a ruby application that would access files on my Time Machine Backup. A typical path might look something like
/Volumes/Time Machine Backups/Backups.backupdb/Jon Stacey’s iMac. I could not get any file related commands to work properly though. The most common problem was getting the result that the file path did not exist.
The crux of the problem
After quite a bit of head scratching, I discovered the problem. It’s that small little apostrophe. You see, it’s not actually an apostrophe. It’s a closing quotation mark! To be specific, it’s
\u2019, or Unicode code point U+2019.
The HFS+ file system used by Mac OS X supports UTF-16, and apparently, Apple decided to take advantage of this functionality in some areas of Mac OS X. Why did they choose a technically incorrect symbol? That’s anybody’s guess, but it is causing a small headache.
You can check that this is the problem from your Mac Terminal. Try changing directories into your Time Machine drive without using tab completion. If you can’t get past the directory with that closing quotation mark, then you have this problem.
Perhaps this was carried over from Snow Leopard upgrade. I noticed that my iPod also uses the same closing quotation mark instead of an apostrophe. Let us know if you’re on a clean Lion install and this is not a problem.
How to access these directories from Ruby 1.9
Simply copy/pasting this special symbol from Terminal would have worked just fine in Ruby 1.8. This was because a string Ruby 1.8 was a collection of bytes. Strings in Ruby 1.9 are now a collection of encoded data. If you simply paste that special closing quote into a ruby string, you’re very likely to get the error message
invalid multibyte char (US-ASCII).
The way I got around this was to explicitly place the Unicode character code point within the string. For example,
string = "Jon\u2019s iMac.
Tools and more detailed information
Grant McLean runs a very nice Unicode Character Finder. It allows you to paste your characters into the preview window and get more information about the character.
It took me awhile to figure this little problem out, mostly thanks to the excellent series of articles on Character Encodings by James Gray. I highly suggest reading his series.
The key to solving the problem once I knew the Unicode character that I needed, was understanding that there are multiple ways to place an encoded character within a Ruby string. For example, the encoded character could be inserted using the code point with
\u####, or hhex,
\x##, octal, etc.