The Enigma of Encoding Versions

The Enigma Machine was used to encrypt wireless messages by the German regime before and during the Second World War. It was very significant in the Allied victory that they were able not only to decipher the German Enigma's encryption, but to mostly keep it a secret that they had deciphered it. Many soldiers and ships were sacrificed to keep it a secret because the Allies did not want to act on their knowledge unless there was an alternate source that the German's could ascribe the leak to.

Why didn't they want the Germans to know that they knew the system? Because the Germans would have changed it! And it was always a huge challenge to break the new code. The Enigma actually changed many times from when the first cipher machine, Enigma A, came on the market in 1923. The early work at cryptanalysis (to "break the code" of the Enigma) was done in Poland, and then during the war was centered in a very secret English organization that employed 7000 people at its peak.

The problem with changing these things

In 1942 the Germans changed the machine and for 10 months the Allied forces suffered extra losses to the German U-boats while they worked to break the new code. One of the most fascinating stories is how they capitalized on a mistake where a German retransmitted a nearly identical message twice using the new and the old version of the Enigma, and how the slight differences in the message text itself also helped solve the puzzle (this detail I saw on TV and can't find a reference online).

As with encryption, a lot of confusion can result from a change in an encoding system. One of those nebulous and poorly documented areas of text encoding schemes is their versions. Encodings and character sets change over time and differ by implementation. There is no nice trick or resource for solving these issues, but being aware of some example pitfalls can help you be better prepared. This article is sort of a catch-all for some of the character set and encoding version issues I've run across.

The Oracle UTF-8 debacle

Oracle has been on the forefront of delivering character set technology for a long time and they have had to adjust to changes in the standards. For example, UTF-8 has been represented in "AL24UTFFSS," "UTF8" and "AL32UTF8." The first version "AL24UTFFSS" was made obsolete by a change in Unicode 2.1 that moved some characters around, and "UTF8" was affected by Unicode 3.1 that no longer allowed Oracle's ideosyncratic handling of surrogate pairs.

It took me a while to figure out exactly what the issue was between "UTF8" and "AL32UTF8" because the precise terminology is hard to find. The Oracle "UTF8" scheme allowed two UTF-16 surrogates to be stored in two separate UTF8 characters of 3 bytes each which is a no-no. Knowing the conversion algorithm, this is understandable if you treat UTF-16 as if it were UCS-2 without sensitivity to surrogates and blindly convert it to UTF-8. But "AL32UTF8" was introduced to be compliant with the official UTF-8 standard (Unicode 3.1 search on D36) while keeping "UTF8" to designate the older encoding system. Luckily the supplemental characters that require surrogate pairs are exceedingly rare as of yet so this issue has little impact.

Oracle character sets

Less known is that Oracle also updates their character sets to keep up with changes in non-Unicode standards as well. The Japanese Shift-JIS scheme was augmented by a lot of new characters in and around 2000, and I believe I once read that these were incorporated in Oracle's "JA16SJIS" too. This is done without much notice and without changing the encoding name, but that is okay because adding characters under the same encoding name has upgrade benefits without leading to serious compatibility problems.

This brings up another important issue. Oracle's implementation of Shift-JIS "JA16SJIS" is different from Microsoft's 932 and even Java's "SJIS" which can lead to round-trip conversion problems with specific problem characters such as U+00A2 CENT SIGN and U+301C WAVE DASH. In general you cannot depend on the implementations of vendors being perfectly compatible. This warning even includes Unicode because of examples like Oracle's UTF8 surrogate issue mentioned above as well as the general vagueness between UCS-2 and UTF-16 on Microsoft Windows.

The Euro Sign

It is interesting to see how the introduction of an important new symbol like the Euro sign caused a ripple across numerous character sets and was handled in many different ways.

From answers.com on ISO 8859-1: "ISO/IEC 8859-15 has been developed as an update of ISO/IEC 8859-1 to add the euro sign and other required additional characters. (This required however the removal of some less used characters from ISO/IEC 8859-1, including fraction symbols and letter-free diacritics...)"

Well at least they changed their encoding name when they changed the encoding! Which is more than I can say for Microsoft. Somewhere near the end of the 1990s (with Windows-98 I've heard), Microsoft introduced the euro sign into most of their single byte code pages. It was added as 80 in CP1252 (Western Europe), CP1250 (Eastern Europe), CP1253 (Greek), CP1254 (Turkish), CP1255 (Baltic), and as 88 in CP1521 (Cyrillic). I talked about this more in The Euro Sign Predicament.

The PalmOS character set based on Windows-1252 added the Euro Sign at 80 in PalmOS version 3.1 (late 1990s) and moved the numeric space from there to 19. So the Euro caused a lot of squirming in a lot of encodings.

Don't get me started on Cyrillic

Actually I don't know much about Cyrillic encodings, but what I hear is pretty troubling and it is no wonder Mojibake is commonly encountered in languages written in Cyrillic. The Cyrillic Charset Soup will blow your mind with its list of KOI variants.

As Wesha wrote on Raymond's blog: "I just wonder who was that... ummm... insightful person in Windows development team who decided to invent YET ANOTHER codepage for Cyrillic when developing windows? We ALREADY had three -- national standard KOI8-R, MS-DOS standard CP-866 and Apple's MacCyrillic."

Encoding version issues also crop up in decoding EBCDIC text. It is no wonder that most people just roll their eyes when it comes to character sets and try to avoid dealing with them. Although Unicode alleviates most of the problems, we've seen in the Oracle case that even Unicode is susceptible to encoding version problems.