Ruby 1.9’s m17n (multilingualization) engine has eased some of the pains of supporting different encodings. Unfortunately, it makes things a little more difficult in the simplest of cases. I was introduced to this problem for the first time with the error message `invalid multibyte char (US-ASCII)`. This is a mini crash course for understanding the new behavior of Ruby strings and different encodings when used on an HFS+ file system.
## What I was trying to do
I was attempting to build a ruby application that would access files on my Time Machine Backup. A typical path might look something like `/Volumes/Time Machine Backups/Backups.backupdb/Jon Stacey’s iMac`. I could not get any file related commands to work properly though. The most common problem was getting the result that the file path did not exist.
## The crux of the problem
After quite a bit of head scratching, I discovered the problem. It’s that small little apostrophe. You see, it’s not actually an apostrophe. It’s a closing quotation mark! To be specific, it’s `\u2019`, or Unicode code point U+2019.
The HFS+ file system used by Mac OS X supports UTF-16, and apparently, Apple decided to take advantage of this functionality in some areas of Mac OS X. Why did they choose a technically incorrect symbol? That’s anybody’s guess, but it is causing a small headache.
You can check that this is the problem from your Mac Terminal. Try changing directories into your Time Machine drive without using tab completion. If you can’t get past the directory with that closing quotation mark, then you have this problem.
Perhaps this was carried over from Snow Leopard upgrade. I noticed that my iPod also uses the same closing quotation mark instead of an apostrophe. Let us know if you’re on a clean Lion install and this is not a problem.
## How to access these directories from Ruby 1.9
Simply copy/pasting this special symbol from Terminal would have worked just fine in Ruby 1.8. This was because a string Ruby 1.8 was a collection of bytes. Strings in Ruby 1.9 are now a collection of encoded data. If you simply paste that special closing quote into a ruby string, you’re very likely to get the error message `invalid multibyte char (US-ASCII)`.
The way I got around this was to explicitly place the Unicode character code point within the string. For example, `string = “Jon\u2019s iMac`.
## Tools and more detailed information
Grant McLean runs a very nice [Unicode Character Finder](http://www.mclean.net.nz/ucf/). It allows you to paste your characters into the preview window and get more information about the character.
It took me awhile to figure this little problem out, mostly thanks to the excellent [series of articles on Character Encodings by James Gray](http://blog.grayproductions.net/categories/character_encodings). I highly suggest reading his series.
The key to solving the problem once I knew the Unicode character that I needed, was understanding that there are multiple ways to place an encoded character within a Ruby string. For example, the encoded character could be inserted using the code point with `\u####`, or hhex, `\x##`, octal, etc.
The typewriter apostrophe key also had to serve as the single opening quote.
When you state that the Apple’s choice is technically incorrect, you don’t seem to be distinguishing between a typewriter apostrophe and a typographic apostrophe. I would suggest reviewing http://en.wikipedia.org/wiki/Apostrophe#Computing and moving on to Robin Williams’ “The Mac is Not a Typewriter” for background. In Unicode, the typographic apostrophe and the single closing quote mark are the same mark. What you are calling an apostrophe (U+0027) is a holdover from the era of typewriters, when a single key needed to serve as the apostrophe, the single closing quote, a foot mark, and the prime.
Jack,
That’s a good point–U+0027 is a holdover from ASCII. Perhaps I did not articulate the reason why I think Apple made a mistake with this, and I’ll try to clear it up. Perhaps I should even expand that and say that the computing community as a whole is also guilty of this mistake, whether intentionally or not. I’m just picking on Apple since it’s the only system that I’ve had this issue on so far. Keep in mind that my comments below are with the English language in mind, since that’s my primary language and I’m selfish. The majority of widely used programming languages are also written with the English/ASCII charset in mind.
The problem is that there is confusion between usage of the old ASCII [typewriter] apostrophe, U+0027, and the preferred Unicode apostrophe, U+2019. We [as in Apple, and the computing community] do not appear to have been consistent in our choice of preferred apostrophe character. When I think of an apostrophe, there is only one, and I posit that this is also true for most other general computing users. A typewriter apostrophe and a typographic apostrophe convey an identical concept. The meaning of a word or sentence does not change on the usage. Reducing the problem to a very simplistic level, we could say that the appearance of the apostrophe, whether it be straight or curly, is a function of style.
Nonetheless, users such as myself are now facing cognitive dissonance. When I type in iWork Pages, U+2019 will be used if it’s not the leading character immediately following a whitespace. If I type in another editor though, such as TextMate, U+0027 will be used. If type in TextEdit, U+0027 will be used. As I’m typing this response, WordPress displays and stores the apostrophe characters as U+0027, but then renders it to readers as U+2019 automatically. Now we are representing an identical concept with multiple characters. This is not a problem to a human as the appearance of an apostrophe character does not matter much to a certain degree. However, there is a problem when the character does matter, such as when dealing with computer code.
The conundrum: If I try to insert U+2019 as an apostrophe in the Ruby programming language, I will get a syntax error. I’m not sure how other languages [Python, PHP, C++, etc.] would handle this, but I assume there’s going to be similar errors thrown about syntax. I argue that in the general computing environment we need to decide whether we’re going to use U+2019 or U+0027 entirely. In other words, we need a single character to represent a single concept at the most basic level, or the core operating system libraries need to handle this contingency [which they do not despite supporting Unicode]. We can’t easily mix and match outside of a typographic environment. For example, iWork Pages is smart enough to know that these two characters are equivalent, so when searching for a word with an apostrophe, Pages will match both characters. However, other applications such as TextEdit make a distinction and will not match both representations, despite the meaning remaining the same.
I’m not a typographer and I have no arguments as to the proper use of characters when we’re in the document realm, mostly because that’s application and domain specific. It also enters an area of philosophical debate in which I know little. I can certainly understand the argument and the purity for having separate characters to represent each unique character. On the other hand, these multiple characters are mapping to the same written language concepts and lead to confusion and unexpected problems
Up until I ran into this issue, the world hummed along quite nicely under the following assumption, which I paraphrase from the Wikipedia article you linked to: an apostrophe entered via computer keyboard is a typewriter apostrophe by default and will be dynamically converted when appropriate to a fancier typographic character automatically when appropriate. Otherwise, the user must override the character in some other fashion such as through a character palette or shortcut and so on.
This finally brings me to the crux of why I say that Apple’s use of a typographic apostrophe here is technically wrong. At this level of the operating system [a filename] the assumption is that the apostrophe is a typewriter apostrophe. This is because (1) if a document is saved from any application in OS X and an apostrophe is entered from the keyboard it is U+0027, (2) if an apostrophe is entered in Terminal.app its U+0027, (3) programming language syntax uses U+0027, and (4) the user expects it to just work. As such, Apple’s use of a typographic apostrophe here as part of a filename is incorrect. The default apostrophe used would have been U+0027, but they either intentionally or unintentionally [i.e. copy/pasted] used the U+2019 character. While the underlying filesystem may support these characters it’s one thing to make use of that functionality in the pursuit of m17n, but it’s another thing when using unexpected characters to represent a common concept. Remember, I don’t have experience with other languages, so this is heavily English focused. The long-term solution is to incorporate transparent handling of situations like these throughout the entire operating system, but I have heard no noise of this happening. The solution today is to default to U+0027 except within those situations where it actually makes sense [e.g. document editing].
What do you think?