UTF-8 Versus Windows UNICODE

The term "Windows UNICODE" refers to the UCS-2 (and later UTF-16) encoding chosen by Microsoft for their standard Unicode encoding. For years, one of the recurrent thoughts in my head has been "what if Microsoft went with UTF-8 instead of UCS-2 as their Unicode encoding?" Many times I feel I would have preferred UTF-8, but it is by no means a simple issue. People often point to speed as the defining issue but even speed is not a slam dunk. Here are some points for comparison between UTF-8 and UTF-16.

Compatibility: most would say UTF-8 wins this point

The beauty of UTF-8 is that it is compatible with ASCII, even moreso than Far Eastern multibyte character sets which can have a second byte mistaken for an ASCII character. Much of the ASCII parsing that is done in almost every piece of software can work on UTF-8 strings without modification. For example, scan a string for a period or a slash or match an English keyword or concatenate. Only the code that assumes one character per byte needs modification, such as something that divides a string to display characters individually or truncate to a certain length.

Existing ASCII files are already UTF-8.

But compatibility is not always the goal. In some ways the wide character incompatibility with ASCII helped people to clearly distinguish Unicode strings from non-Unicode. Win32 comes with separate APIs for UNICODE and ANSI strings indicated by the W and A names such as SetWindowTextA and SetWindowTextW. By having a different string type for Unicode methods, many potential bugs are avoided at compile time by type-checking in the compiler.

Complexity: I think UTF-8 wins this point

To be fair to the choice Microsoft made, you have to go back to the early 90s when Win32 was being developed and many people assumed that the 65000 values available in two bytes would be enough to cover the Unicode character set. Wide character strings became the way of doing Unicode in the 90s, not just in Windows but in the C++ standard across platforms (although the wchar_t type can vary by C++ implementation).

Microsoft saw the benefit of standardizing on a simple fixed 2-byte encoding, despite the need for a new string type and new APIs for Unicode strings. Unfortunately, by the time they were committed to UCS-2 it became apparent that it would not support the whole Unicode character set. Oops!

If the desire was to avoid a variable byte encoding, they ended up with one anyway: UTF-16. UTF-16 is the same as UCS-2 except that additional characters are supported through surrogate pairs (taking 4 bytes instead of 2). The migration from UCS-2 to UTF-16 is going slowly because almost no one cares! The characters that are not available in UCS-2 are not supported by most fonts and simply aren't on the radar yet for most applications. There is no support for surrogate pairs in Visual C++ 6.0; it has only really become a part of Windows programming with .Net.

Multibyte-aware programming for UTF-8 can be done much the same as it is done for Far Eastern code pages, and in hindsight would have been easier instituting in Windows 95 than sticking in surrogate pair handling now. In C we would have avoided all the L"string" and wchar_t stuff and kept a lot of string manipulation unchanged although there would have been some flags or modal issues to distinguish ANSI from UTF-8 strings.

Size: on this point UTF-8 wins usually

In the West, UTF-8 is generally seen as less wasteful than UTF-16 because the vast majority of text is in the ASCII range taking 1 byte in UTF-8 and 2 in UTF-16. Many of the western character sets fit into 2 bytes in UTF-8 so for those there is a tie.

In Far Eastern locales, UTF-8 is seen as wasteful because it uses 3 bytes per character compared to 2 for UCS-2 or double-byte. However, the apparent waste is not as bad as it sounds because ASCII characters often find their way into Far Eastern computing solutions (e.g. SQL keywords, numbers and math) and these only take 1 byte per character in UTF-8 and double-byte while they take 2 bytes in UCS-2, bringing the average ratio back down. Double-byte is always the most efficient in terms of memory but it is not Unicode and cannot handle neighboring languages or even international currency symbols.

Due to ASCII being so overwhelmingly prevalent at least at this point in computing history, UTF-8 is often the efficient way to go as far as Unicode encodings go.

Speed: UTF-16 wins this one

When people talk about speed they often talk about the assumptions that can be made with a fixed byte encoding like UCS-2 versus a multi-byte encoding like UTF-8. But even with properly implemented UTF-16, surrogate pairs are rare and will not cause much speed degradation. For UTF-8, multiple byte characters are normal outside of English.

Speed differences are a result of the kinds of operations being done on the text. Character by character processing requires extra logic to determine the character boundaries in a multibyte character set. But some string operations do not need to concern themselves with character boundaries in UTF-8 (or UTF-16) such as scanning or getting string length.

Computing string length is generally faster for a UTF-16 string. A string length is not the number of characters, but simply the length in terms of memory (a character count is rarely used in string manipulation). The speed of a string length function is basically a factor of the number of times you compare to zero. So in this case the speed depends on the number of byte or word (2 byte) units needed to hold the string which is the same when all characters are in the ASCII range, but less in a UTF-16 string if there are characters from outside that range.

Speed can also be affected by how much data is being moved around in memory and file I/O. So this goes back to the issue of size.

Who wins overall?

I think UTF-8 wins (barely) just because of a hunch that has grown over the years. But I have not made a convincing argument here; this was just some comparison points on compatibility, complexity, size and speed. There are so many pros and cons that everyone can find a strong argument to support either UTF-8 or UTF-16.

On the horizon is the question of broad usage of UTF-32 which may be the ultimate fixed byte encoding. Also, don't think for a minute the problems end with encoding schemes because Unicode itself is quite a complex beast (for example you can represent some characters in two different ways in Unicode due to combining characters).