Splitting Surrogate Pairs

Microsoft chose UCS-2 for its Unicode encoding system when it seemed like a nice and simple fixed size per character; then Unicode promptly outgrew UCS-2. As I said in UTF-8 Versus Windows UNICODE, the early impression of simplicity, in comparison to UTF-8 multibyte encoding, backfired.

What are surrogate pairs?

Surrogate pairs are UTF-16's answer to multibyte encoding. Basically, in the UTF-16 encoding system a Unicode character can be encoded in either one or two 16-bit values; if it is two 16-bit values it is utilizing a "surrogate pair". Surrogate pairs are simple yet they inevitably lead to a great deal of confusion.

The complexity stems from the fact that in the beginning of Unicode they did not anticipate needing more than 65535 code points. UCS-2 (one 16-bit value supporting values from 0 to 25535) was considered adequate until they realized they were going to need to go beyond. That's when they reserved a special as-yet unused range within those 65535 values for surrogate pairs and developed UTF-16 as the evolution of UCS-2 (Unicode 2.0 in 1996). It opened up the potential for a million more code points (up to 1114111, hex 10ffff).

For Unicode code points under U+FFFF, the UTF-16 encoding is a plain 16-bit value (word) binary representation of the code point number. But for Unicode code points greater than U+FFFF, a surrogate pair is used.

"Surrogate" vs "supplementary"

Where the confusion rears its ugly head is in the terminology. Michael Kaplan wrote about it in There is no such thing as a surrogate character (dammit!), and is set straight in the Unicode FAQ. Simply put:

A "surrogate code point" is a value in the range U+D800 to U+DFFF reserved for the sake of UTF-16 encoding. Pairs of surrogate code points are used in UTF-16 to represent "supplementary code points" in the range U+10000 to U+10FFFF. A "supplementary character" is the character corresponding to the supplementary code point. The term "surrogate character" is a misnomer because there are no characters corresponding to surrogate code points.

In another encoding such as UTF-8, surrogate code points are never supposed to be used, but read on...

Where splitting pairs comes in

If you are decoding a UTF-16 string as if it were UCS-2 (i.e. ignoring the possibility of surrogates) and say converting to UTF-8, you would end up with two separate and invalid code points. This is precisely what happened with Oracle's "UTF8" charset and is the reason Oracle created an "AL32UTF8" charset to implement UTF-8 properly. I wrote about this in The Enigma of Encoding Versions.

So the simple point I am making is that when dealing with UTF-16 if you don't pay attention to surrogate pairs, you run the risk of "splitting pairs." The pair consists of a leading surrogate in the range U+D800 to U+DBFF and a trailing surrogate in the range U+DC00 to U+DFFF. Either of these by itself is an error. Happy Unicoding!