Other

Explorations into Lisp

After a long break from writing I’m ready to get back to work! Over the next few months I’ll be studying algorithms, AI, and Lisp. Given that this will be my focus, I’ve decided to write about the experience of learning Lisp as I progress. This is the first in a multipart series about Common Lisp and programming languages in general. I’ll be looking at different programming idioms and comparing specific constructs in Lisp, Ruby, C++, C# and whatever I feel like pulling out at the moment.

Catching up with Macromedia

Just after I started out on my own, I had a very simple contract to produce an animated GIF for a website. It wasn’t the kind of work I would ever have put myself forward for, I’m not especially artistic, and it never feels right taking on things that I don’t have the appropriate skills or experience for. But the job came via a friend, the client needed something done, and I certainly wasn’t financially secure enough to refuse a solid offer of work.

The feedback from something I managed to bash together with a trial copy of Macromedia Fireworks was good, so it all went ahead. Surprisingly, the client was happy with the end result, and I used the proceeds to buy a full copy of Fireworks and Dreamweaver 4.

That would have been the end of the story, if I hadn’t recently decided to upgrade to Studio 8.

Phantom Currency Signs in Japan and Korea

If you're not from Japan or Korea, you might be surprised that when you reboot your Windows OS in the Japanese language for non-Unicode programs (system locale) your backslashes are no longer backslashes; they are yen signs. Well, don't worry, they are still backslashs, they are just displayed and printed differently by many of the fonts in the Japanese locale. But there is a more troubling internationalization issue: Unicode text coming out of Japan and Korea sources may have a backslash where you would expect a yen sign ¥ or won sign ₩. This whole subject has been discussed elsewhere (references below) but I talk about the need to repair the Unicode text.

Where the backslash issue gets interesting is in the encoding conversion between the locale code pages and Unicode. While 0x5c is clearly the yen sign in the Japanese code page 932 (Shift-JIS), it is converted to the Unicode U+005c REVERSE SOLIDUS (backslash) rather than the U+00a5 YEN SIGN. Similarly in the Korean code page 936, 0x5c is clearly the won sign but it is converted to the Unicode backslash rather than the U+20a9 WON SIGN.

The Power of the Lambda

In one of the (non-Ruby) applications I maintain, there is a function that is responsible for handling unit conversions. It looks something like this:

double UnitConvert(double value, string from_unit, string to_unit)

So that I can do this:

double value = UnitConvert(5.0, "feet", "inches")

The underlying part of this code has to figure out exactly how to convert between the two units. In a nutshell, there's a big hash of known unit conversions that gets loaded when the program starts up, and it can interpolate, trace paths, and figure out how to fill in any gaps that may exist. In all actuality, it's a pretty smart piece of code.

UTF-8 Versus Windows UNICODE

The term "Windows UNICODE" refers to the UCS-2 (and later UTF-16) encoding chosen by Microsoft for their standard Unicode encoding. For years, one of the recurrent thoughts in my head has been "what if Microsoft went with UTF-8 instead of UCS-2 as their Unicode encoding?" Many times I feel I would have preferred UTF-8, but it is by no means a simple issue. People often point to speed as the defining issue but even speed is not a slam dunk. Here are some points for comparison between UTF-8 and UTF-16.

The secret family split in Windows code page functions

My earlier post "Strange case of two system locale ANSI charsets" discussed the confusion between the default system locale (GetACP, Language for non-Unicode Programs) and the default user locale (setlocale, Standards and Formats). There I mentioned a problem with setting the system code page in C/C++ using setlocale, but that is only the first clue in what reveals a secret split in the family of locale-based charset functions.

Strange case of two system locale ANSI charsets

Are you technically familiar with how system locale ANSI charsets work on Windows? I thought I knew enough to get by... until recently. You may know that there are two primary ANSI locale settings, one of which requires a reboot, but do you know when it comes to the distinction between these two settings that the MSDN docs get it wrong, that the Delphi 7 core ANSI functions get it wrong, or that you cannot set the system code page in C/C++ using setlocale?

In Windows XP Regional and Language Options, you can set "Standards and formats" on the first tab, and "Language for non-Unicode programs" on the third tab, the latter requires a reboot (it is similar on previous Windows OSes). The weird thing is that both of these mess with the ANSI locale code page.

Windows APIs are built around two types of text strings, ANSI and UNICODE. The UNICODE charset (Wide Char) is pretty straight-forward because it is not affected by what the locale settings and language are. The ANSI charset always supports the 128 ASCII values, but can have different ways of utilizing the high bit of the byte to support additional characters. In single-byte charsets, the upper 128 values are assigned to the additional characters like European accented characters, Cyrillic or Greek characters. In East Asian double-byte charsets, special lead byte values in the upper 128 are followed by a second byte to complete the character. "Double-byte" is actually a misleading term because characters in double-byte strings can use either one or two bytes. An ANSI charset is implemented as a "code page" which specifies the encoding system for that charset.

The ANSI charset used by the computer changes according to the locale the computer is configured to, but what is the computer's locale? Well, no one is terribly clear on that! The Windows API GetLocaleInfo allows you to get information about either the "default system locale" or the "default user locale." The MSDN article then goes on to refer to the "current system default locale," and the "default ANSI code page for the LCID," as opposed to the "system default–ANSI code page." I have yet to discover how the User/System differentiation works although presumably user logons retain certain aspects of the Regional and Language Options. Anyway, I would say it is anything but clear.

According to MSDN for Microsoft C++, a C program can use setlocale( LC_ALL, "" ) to set itself to use the "system-default ANSI code page obtained from the operating system" rather than plain ASCII, and then all multi-byte string functions will also operate with that code page. However, it turns out that this code page is actually the one from "Standards and formats" in the computer's Regional and Language Options. I call this the "setlocale" charset.

Meanwhile, all ANSI Windows APIs and messages operate according to the ANSI code page from the "Language for non-Unicode programs" setting. This setting governs the real ANSI system locale code page which you can find out with the GetACP Win32 function. This is the default code page used in MultiByteToWideChar when you specify CP_ACP. I call this the "GetACP" charset.

When these two code pages are different such as U.S. English setlocale charset and Japanese GetACP charset, many programs used internationally exhibit bugs you won't see otherwise. For example, Delphi core source code SysUtils.pas uses a SysLocale based on the setlocale charset in many of its Ansi string functions like AnsiPos and CharLength, while implicit ANSI/WideString conversions and other Ansi functions like AnsiUpperCase happen according to the GetACP charset.

Even Winzip prior to release 8.1 did not parse pathnames correctly with different setlocale and GetACP charsets. WinZip couldn't zip Japanese filenames that contained an ASCII backslash as the second byte of the Shift-JIS character because the string functions were treating the double-byte ANSI strings as single-byte ANSI (Windows-1252) strings.

There is a reason that Windows has these two system charsets. The locale info for a locale's "Standards and formats" is provided via ANSI API in a particular charset (is this what MSDN vaguely referred to as the "default ANSI code page for the LCID"). The OS cannot provide Japanese standards and formats such as weekday names in a Western European ANSI charset. So, a programmer is supposed to interpret that locale info's text strings according to the locale info's charset, even if the machine locale charset is different. But I have not found this documented properly, and I don't think many people know about this. Delphi got it wrong and the Microsoft C++ documentation is not clear on it. I think at this point, Microsoft developers are inclined to want to forget about these issues and focus on Unicode.

Experiences and clarifications are welcome!