Strange case of two system locale ANSI charsets

Are you technically familiar with how system locale ANSI charsets work on Windows? I thought I knew enough to get by... until recently. You may know that there are two primary ANSI locale settings, one of which requires a reboot, but do you know when it comes to the distinction between these two settings that the MSDN docs get it wrong, that the Delphi 7 core ANSI functions get it wrong, or that you cannot set the system code page in C/C++ using setlocale?

In Windows XP Regional and Language Options, you can set "Standards and formats" on the first tab, and "Language for non-Unicode programs" on the third tab, the latter requires a reboot (it is similar on previous Windows OSes). The weird thing is that both of these mess with the ANSI locale code page.

Windows APIs are built around two types of text strings, ANSI and UNICODE. The UNICODE charset (Wide Char) is pretty straight-forward because it is not affected by what the locale settings and language are. The ANSI charset always supports the 128 ASCII values, but can have different ways of utilizing the high bit of the byte to support additional characters. In single-byte charsets, the upper 128 values are assigned to the additional characters like European accented characters, Cyrillic or Greek characters. In East Asian double-byte charsets, special lead byte values in the upper 128 are followed by a second byte to complete the character. "Double-byte" is actually a misleading term because characters in double-byte strings can use either one or two bytes. An ANSI charset is implemented as a "code page" which specifies the encoding system for that charset.

The ANSI charset used by the computer changes according to the locale the computer is configured to, but what is the computer's locale? Well, no one is terribly clear on that! The Windows API GetLocaleInfo allows you to get information about either the "default system locale" or the "default user locale." The MSDN article then goes on to refer to the "current system default locale," and the "default ANSI code page for the LCID," as opposed to the "system default–ANSI code page." I have yet to discover how the User/System differentiation works although presumably user logons retain certain aspects of the Regional and Language Options. Anyway, I would say it is anything but clear.

According to MSDN for Microsoft C++, a C program can use setlocale( LC_ALL, "" ) to set itself to use the "system-default ANSI code page obtained from the operating system" rather than plain ASCII, and then all multi-byte string functions will also operate with that code page. However, it turns out that this code page is actually the one from "Standards and formats" in the computer's Regional and Language Options. I call this the "setlocale" charset.

Meanwhile, all ANSI Windows APIs and messages operate according to the ANSI code page from the "Language for non-Unicode programs" setting. This setting governs the real ANSI system locale code page which you can find out with the GetACP Win32 function. This is the default code page used in MultiByteToWideChar when you specify CP_ACP. I call this the "GetACP" charset.

When these two code pages are different such as U.S. English setlocale charset and Japanese GetACP charset, many programs used internationally exhibit bugs you won't see otherwise. For example, Delphi core source code SysUtils.pas uses a SysLocale based on the setlocale charset in many of its Ansi string functions like AnsiPos and CharLength, while implicit ANSI/WideString conversions and other Ansi functions like AnsiUpperCase happen according to the GetACP charset.

Even Winzip prior to release 8.1 did not parse pathnames correctly with different setlocale and GetACP charsets. WinZip couldn't zip Japanese filenames that contained an ASCII backslash as the second byte of the Shift-JIS character because the string functions were treating the double-byte ANSI strings as single-byte ANSI (Windows-1252) strings.

There is a reason that Windows has these two system charsets. The locale info for a locale's "Standards and formats" is provided via ANSI API in a particular charset (is this what MSDN vaguely referred to as the "default ANSI code page for the LCID"). The OS cannot provide Japanese standards and formats such as weekday names in a Western European ANSI charset. So, a programmer is supposed to interpret that locale info's text strings according to the locale info's charset, even if the machine locale charset is different. But I have not found this documented properly, and I don't think many people know about this. Delphi got it wrong and the Microsoft C++ documentation is not clear on it. I think at this point, Microsoft developers are inclined to want to forget about these issues and focus on Unicode.

Experiences and clarifications are welcome!