Double-Byte Safety Primer

A lot of incorrect string processing goes unnoticed when your software only deals with single byte character sets. Once your software gets used in a Far Eastern locale, there are numerous new problems that can show up, but might still be rare. Understanding the underlying multi-byte string function issues goes a long way to getting your software ship-shape.

I'm not talking about localization, I am talking about just simply having your English program work on Japanese or Chinese Windows. In other words, even though all of your menus, dialogs and message boxes are still in English, you might have bugs because of Far Eastern pathnames and other text that affects your program.

You can switch locales without switching single byte character set (e.g. U.S. to France). In this case, locale issues such as converting strings to numbers and handling dates can expose bugs, but in this article I am focusing on double-byte related issues.

For starters, a Visual C++ non-Unicode program that is built to work on Far Eastern code pages needs to have _MBCS defined, and the _t versions of string functions should be used since they are mapped to the correct default system locale code page methods. Here are some of the common string functions that matter:

  • for strchr use _tcschr which maps to _mbschr
  • for strstr use _tcsstr which maps to _mbsstr
  • for strcmp use _tcscmp which maps to _mbscmp

Some like _tcslen and _tcscpy don't really matter because they work in terms of bytes, but they are still recommended for consistency and in case of eventual move to Unicode.

When you loop through the characters of a string, you need to use the character length function _tclen. The following code looks for the first backslash in string s:

int n=0;
while ( s[n] && s[n]!='\\' )
  ++n; // unsafe

Instead, you need to increment by the character size:

int n=0;
while ( s[n] && s[n]!='\\' )
  n += _tclen(&s[n]);

Looking at the last character in a string is not trivial. You cannot just test s[nLen-1] because the last byte might be the second byte in a two byte character. Depending on the problem, it may be simplest to look at the last character by looping through the whole string first.

int n=0, nLast=0;
while ( s[n] && s[n]!='\\' )
{
  nLast = n;
  n += _tclen(&s[n]);
}

Other languages have functions like EndsWith, but C++ is skimpy on this. You can also use _mbsdec, the function for finding a previous character in a multi-byte string (though it is implemented with a loop through preceeding characters anyway).

char *pLast = _mbsdec(s,&s[_tcslen(s)]);
int nLast = pLast - s;

Pathnames are probably the single most common place that double-byte related ANSI string handling bugs are exposed. It is very common when dealing with pathnames to have a little ad hoc parsing here and there in a program, and it is generally not multi-byte safe.

If you want to test if the last character is a backslash, use the _tcsrchr function (_mbsrchr) to scan for the last occurrence and check if it is the character before the null terminator:

char *pSlash = _tcsrchr(s,'\\');
if ( pSlash && ! pSlash[1] )
  // ends with slash

Double-byte characters ending in the ASCII value for backslash are great for testing. Use a character set table for your code page (such as this one for Shift-JIS) to find valid characters ending in backslash (hex 5C happens to act as Yen on Japanese machines but it is still the directory name separator in paths). Here are two randomly chosen characters.

ソ 0x835C U+30BD # KATAKANA LETTER SO
浬 0x8A5C U+6D6C # cjk (nautical mile)

You can use these characters in names of directories that are used by the program, and enter them into any edit fields that accept filenames in the program. I was able to find a bug in Winzip 8.0 using this trick simply by trying to zip a folder containing one of these characters with the machine locale set to Shift-JIS (also observed it was fixed in Winzip 8.1).

When you use one of these characters in a folder or file name in the Windows file system it is stored internally as Unicode. However, the non-Unicode program only deals in ANSI strings. When the program calls Windows APIs like GetCurrentDirectory, it gets an ANSI string representing the pathname. If it tries to check if a pathname ends with backslash or otherwise parses out the subdirectory names, it may be double-byte unsafe.