Whether Double-Byte Is ANSI

I'm telling this story because a senior manager flippantly dismissed my careful analysis of an issue, and it got me a little riled up. For about a year as a side project I have been working with people in Japan and Taiwan on converting a Delphi-written Windows tool to Unicode as well as localizing it to Japanese (I am a C++ programmer but picked up Delphi during this project). It is a project that requires clarity on the character set issues involved.

We had an independent expert who audited the Delphi program and cautioned us among other things that we "cannot assume that the non-Unicode character string is always ANSI. When ANSI code page is specified but the non-Unicode character string contains double-byte characters, the two bytes of each double-byte character will be separated." This person was implying that double-byte was different from ANSI in that context.

The glaring problem with his point is that, in Windows terminology, double-byte is ANSI! Plain and simple. Well, read on...

On Windows, ANSI character sets are those that can act as the default non-Unicode system locale character set including double-byte character sets (DBCS). Win32 APIs involving strings generally have an A and a W version (e.g. SetWindowTextA and SetWindowTextW), where the A stands for ANSI and the W stands for Wide Character Unicode. The GetACP Win32 API returns the ANSI code page that the A APIs operate in, which is double-byte (e.g. 932) on PCs configured for Far Eastern code pages.

Thinking ANSI does not include double-byte is understandable because the popular usage of the term "ANSI" assumes that ANSI character sets are single-byte (SBCS). This is due to the origin of the Microsoft usage of the term "ANSI" in the default Western code page 1252 based on a single-byte character set drafted by the American National Standards Institute (ANSI).

It was a misnomer from the beginning because Windows-1252 was not approved by ANSI and it turned out different than the ISO Standard 8859-1. But the term ANSI on the Microsoft platform went on to encompass all of the Windows single-byte character sets in which the lower 128 values are ASCII and the upper 128 vary according to different international sets like Cyrillic.

The Microsoft usage of the term ANSI made a further leap to encompass double-byte character sets. Why is not perfectly clear, but presumably for practical reasons because it was the only term that was handy. DBCS characters can be either 1 or 2 bytes long (don't be fooled by the name "double-byte", they are actually multi-byte -- MBCS), and the 1 byte characters in the lower 128 are ASCII. An ASCII string is exactly the same when it is represented in any Windows (dare I say ANSI) SBCS or DBCS, but Wide Char Unicode is always different. So it is very convenient for the A and W Windows APIs mentioned above to carry the double-byte character sets under the A label. Another practical reason mught be that the ANSI character sets were distinguished from OEM character sets having to do with DOS and hardware, and since OEM included the double-byte sets it was convenient for ANSI to include them too.

There is no dispute about whether Microsoft includes the double-byte character sets among its ANSI character sets. The Microsoft list of Code-Page Identifiers marks the far eastern DBCS code pages (932, 936, 949, 950) as ANSI. The use of "ANSI/OEM" merely indicates that these are also OEM code pages, not that "ANSI/OEM" is some kind of special ANSI (these code pages are listed more clearly as ANSI & OEM on this page).

But there seems to always be fuzziness about this issue. One MSDN article holds back by saying "DBCS can be thought of as the ANSI character set for some Asian versions of Microsoft Windows". Another Microsoft article actually describes "Windows ANSI" as different from double-byte sets due to the lingering association of the ANSI misnomer with the single-byte sets. But the fact is that across Win32 programming, ANSI is always the name for the alternative to UNICODE, and ANSI functions always support the double-byte character sets.

Now to tell about the experience that is the reason for this article, we have to go back to the independent expert's assertion that you cannot assume ANSI for non-Unicode strings. Well I responded that "double-byte character sets such as Shift-JIS are ANSI and since the ANSI strings used in [the program] are based on machine locale you can assume ANSI locale encoding."

But I was chastised by a senior manager (the President of the Japanese subsidiary) who said that "I am a business guy but even I know that what Ben wrote [is wrong]."

Taking sides with the independent expert, the senior manager (showing an instinctive grasp of the subject despite having picked the wrong battle) went on to state: "ANSI does not always mean multi-byte safe. By specifying an ANSI code page the code can't know whether it is pure ANSI (single-byte) or ANSI/OEM (code page 932 - Japanese). ANSI encoding is single-byte by default which may cause double-byte problems. For conversion when the source is in Japanese Shift-JIS, if you just specify "ANSI" (not code page 932 - Shift-JIS), the code will separate the two bytes and do byte-based conversion to Unicode (2 bytes to 4 bytes instead of the correct way - 2 bytes to 2 bytes)."

His statement shows "ANSI confusion" that is understandable considering what I explained above, and points to what could be a legitimate concern if the wrong ANSI code page was applied. However, ultimately this particular disagreement does not depend on the definition of ANSI; the expert made an incorrect assertion that double-byte strings needed to be handled differently in the Delphi program in question. There may be other development situations (none that I know of) where something like this is a concern, but I tested it out and the functions in question worked correctly with the far eastern double-byte as the default ANSI code page.

So do you want the final final answer to the question of whether double-byte is ANSI? The answer is definitely yes, at least in terms of Windows programming, but with regard to discussions with senior management you must allow for variations in the use of the terminology remembering the fact that ANSI is a misnomer in this case anyway.