Phantom Currency Signs in Japan and Korea

If you're not from Japan or Korea, you might be surprised that when you reboot your Windows OS in the Japanese language for non-Unicode programs (system locale) your backslashes are no longer backslashes; they are yen signs. Well, don't worry, they are still backslashs, they are just displayed and printed differently by many of the fonts in the Japanese locale. But there is a more troubling internationalization issue: Unicode text coming out of Japan and Korea sources may have a backslash where you would expect a yen sign ¥ or won sign ₩. This whole subject has been discussed elsewhere (references below) but I talk about the need to repair the Unicode text.

Where the backslash issue gets interesting is in the encoding conversion between the locale code pages and Unicode. While 0x5c is clearly the yen sign in the Japanese code page 932 (Shift-JIS), it is converted to the Unicode U+005c REVERSE SOLIDUS (backslash) rather than the U+00a5 YEN SIGN. Similarly in the Korean code page 936, 0x5c is clearly the won sign but it is converted to the Unicode backslash rather than the U+20a9 WON SIGN.

Ten months ago, Michael Kaplan got into a long argument with Norman Diamond in Whats up with the Korean (Unicode) sort?. Part of what Norman was arguing was that since 0x5c is a Yen sign in the Japanese code page it should be converted to the Yen sign in Unicode. Makes sense right? Well, Michael explains that it's more important role in the operating system is as "the path separator" and therefore must be converted to the backslash in Unicode. It is still displayed as a Yen sign in Japan whether in Unicode or not.

As it stands, the situation in the Japanese locale appears to have some almost intolerable problems:

  1. In the Japanese locale, both U+00a5 and U+005c are visually indistinguishable
  2. In the Japanese locale, there is no way to print or display a backslash (even with Unicode), unless you are able to specify a font that you know will not display it as a Yen sign
  3. When users of your Unicode program are typing a yen sign on the Japanese keyboard, the yen sign probably gets inserted as the backslash U+005c character (see post)
  4. If Korean text is converted to Unicode and sent to Japan, the Won signs will turn up as Yen signs!

Okay the first two aren't so bad, because who needs to see a backslash anyway, as long as you can use the yen sign as a path separator. But the third and fourth point to the need for the program to intervene and repair the Unicode text.

The Unicode values need to be repaired based on knowledge of the meaning of the text. Pathnames should be left as they are, but text referring to the yen sign should be converted to the U+00a5 yen sign, especially if the Unicode text will be shared internationally. If not, you may see text referring to a monetary amount that looks like \90 and you might have to guess that the backslash was originally either a yen sign or a won sign.

This is important as programs and databases are upgraded from the non-Unicode code page to Unicode. It is simple with small fields when you know the string value might contain the currency symbol or at least does not represent a pathname. It is much more difficult when repairing documents that might contain both pathnames and money.

Regarding Microsoft's choice to convert code page 932 yen sign to Unicode backslash, Michael points out that there is no 'right' answer here but converting it to Unicode yen sign would break all of the Windows software that looks for a path separator, so there really is no choice. The alternative would have been to make all Windows Unicode software recognize the yen and won signs as path separators too.

If there is any fault here, it probably goes back to the seventies and the decision to do this switcheroo on IBM PCs. Also there is the fact that Japan understandably wanted to be able to see their own currency symbol from within the ASCII range. The need to share text across the globe was a secondary consideration back then.

There is also the unrelated question of why the backslash was used as the path separator in the first place. Larry Osterman addresses this in Why is the DOS path character "\"?. Who knows, if the DOS path separator had not been the backslash may be we would not be in this predicament today. On the other hand, the backslash might have been switched like this because it was the path separator, I don't know.

Last week TheOldNewThing pointed to Michael Kaplan as the resident expert on The history of the path separator in Japanese and Korean Windows. Michael's article When is a backslash not a backslash? is a solid explanation but it does not point out the Unicode repair issue, and it does not actually delve into the "history". I'm not complaining, it would be tough to research the history, especially in English (in a quick Google I didn't find any concise relevant information).

Fullwidth Yen and Won

Michael Kaplan mentions this post in I WON to talk about the YEN. In the discussion there, Mihai brought up the fullwidth Yen and Won signs which I did not address. The fullwidth or "wide" Yen U+FFE5 and Won U+FFE6, are distinguished from the halfwidth ones (U+00A5 and U+20A9) discussed above. Depending on how fullwidth signs were generally used in the text in Japan and Korea, they could alleviate some of the need to repair text that I described because there is no OS-sponsored confusion between them and the path separator, but there are still many concerns as described by Michael in his post.