The secret family split in Windows code page functions

My earlier post "Strange case of two system locale ANSI charsets" discussed the confusion between the default system locale (GetACP, Language for non-Unicode Programs) and the default user locale (setlocale, Standards and Formats). There I mentioned a problem with setting the system code page in C/C++ using setlocale, but that is only the first clue in what reveals a secret split in the family of locale-based charset functions.

Windows locale terminology is clarified by Michael Kaplan in the article What is my locale? Well, which locale do you mean.

I'll start out with a simple tip. If you want to use "locale dependent" multi-byte functions in Visual C++ (such as mblen _mbstrlen mbstowcs mbtowc wcstombs wctomb) in your default system locale code page, do the following at the beginning of your program:

char szCP[10];
sprintf( szCP, ".%d", GetACP() );
setlocale( LC_ALL, szCP );

First of all, setlocale is necessary because otherwise the "locale dependent" (read: "setlocale dependent") functions operate in "C" locale which assumes a single-byte character set (SBCS). Secondly, you need to use GetACP to determine the default system locale code page.

Older MSDN setlocale documentation erroneously says that you can call setlocale(LC_ALL,"") or setlocale(LC_ALL,".ACP") to use the "system-default ANSI code page," however these actually retrieve the code page of the default user locale i.e. the calendar/currency setting. Programs and (more frustratingly) third party components that do not know this will have text corruption when the default user and system locales are different.

Ready for another side swipe? Most of the multi-byte character set (MBCS) functions are not "locale dependent," they are instead based on the default system locale code page! You may have noticed that when you use functions like _tclen in an _MBCS build, you do not need to call setlocale to get correct multi-byte character length. This is because it resolves to _mbclen rather than mblen.

What! That's right, there are actually 2 sets of functions for dealing with multi-byte character sets in Microsoft Visual C++. The code pages that these functions operate with are different. The so-called "locale dependent" functions (e.g. mblen) operate with the setlocale code page. The others (e.g. _mbclen) use the _getmbcp code page which is based on the default system locale code page (GetACP).

Still not sure what I am saying? Take a plain Visual C++ program with _MBCS defined (it is the default for non-Unicode programs in App Wizard), and try the following code on a machine that has Japanese as the Language for non-Unicode Programs (this is the 3rd tab in Regional Settings on Windows XP).

const char* pszTestString = "\x8a\x5c";
int n = _mbclen((const unsigned char*)pszTestString); // 2
n = mblen(pszTestString,2); // -1
setlocale( LC_ALL, ".932" );
n = _mbclen((const unsigned char*)pszTestString); // 2
n = mblen(pszTestString,2); // 2
_setmbcp( 1252 );
n = _mbclen((const unsigned char*)pszTestString); // 1
n = mblen(pszTestString,2); // 2

The example double-byte character in pszTestString yields three different results from these 2 functions that tell the character length (the correct answer in code page 932 is 2). The _mbclen function returns the correct answer at first, is unaffected by setlocale, but returns 1 after _setmbcp(1252). Meanwhile, the mblen function reports it as an invalid character first, but returns the correct answer after setlocale(LC_ALL,".932"), and is unaffected by _setmbcp.

These distinct functions are poorly delineated in the MSDN documentation; in fact they are all mixed together usually with no indication whatsoever that they operate with different code pages. One clue is that the header for _mbclen is mbstring.h, while the header for mblen is stdlib.h. The following paragraphs summarize the 2 different types of functions.

"locale dependent" functions: start out in "C" locale; are controlled by setlocale; use header stdlib.h; examples are mblen, isleadbyte, _mbstrlen, mbtowc, wctomb.

default system locale functions: start out in GetACP locale; are controlled by _setmbcp; use header mbstring.h; examples are _mbclen, _isleadbyte, _mbslen.

A rule of thumb is to stick to the _t functions such as _tclen and _istleadbyte for strings because they will always resolve to the default system locale code page in an _MBCS build. Only use setlocale and "locale dependent" functions in relation to Standards and Formats (i.e. calendar/currency stuff).