Oh No! Mojibake!

Say "moajee bockay!" (MOH-JEE-BAH-KAY) when you've got a string of characters displaying incorrectly all scrambled and corrupted. It is a great exclamation word like Eureka! and Geronimo! (but likely to never gain broad usage outside of the programmer community of course!). The Wikipedia entry for Mojibake 文字化け gives a good definition:

Mojibake is a Japanese loanword which refers to the incorrect, unreadable characters shown when a piece of computer software fails to render a text correctly according to its character encoding.

Though the word comes from Japan, the Wikipedia article makes it seem almost established enough to be used for similar problems not just in the far east but even in Russian Cyrillic encodings. Mojibake even applies to mix-ups like generating a file in DOS OEM character set and then viewing it in Windows ANSI (see Michael Kaplan's More on OEMCP vs. ACP).

Pay attention to how corrupted text looks

When they observe mojibake in their software people often report corrupted text without keeping a record of it. But it is useful to look at the actual corrupted text as a clue to what went wrong. Programmers testers and users should make a copy or screen capture of the bad text. With enough examples you will start getting familiar with different types of mojibake where one particular encoding is viewed through the lens of another.

Another reason why it is good to pay attention to how the corrupted text looks is that it may be character set support issues rather than corruption or mojibake per se. The two common character support issues deal with font and conversion.

Not mojibake #1:  font character set support

When most of the text is replaced by little squares it might simply be due to a font that doesn't support the characters you are trying to display. In this case it is not corruption and the text will likely survive being copied into an editor and viewed with a font that does support the characters. On Macs a little apple symbol is used instead of a little square.

However, when you see only occassional little squares in your text where characters should be, it still means these characters are not supported by the font but it is likely mojibake where bytes are misinterpreted as rare or invalid characters that happen not to be supported in the font.

Not mojibake #2: ??? conversion character set support

When most of the text has question marks where other characters should be, it is probably due to conversion from one encoding into another in which the characters in question (pun not intended!) do not exist. This is correct behavior and therefore strictly speaking is not mojibake. The question mark is the normal replacement character for unsupported characters, but another character could be used as the replacement character too.

However, question marks may be a sign of mojibake if the wrong encoding is assumed during conversion. Some bytes may be interpreted as wrong but valid characters and survive the conversion to a different encoding, while others are interpreted as wrong but valid characters that do not exist in the destination encoding and get replaced by the replacement character. So, seemingly random interspersed question marks may be part of the mojibake.

Mojibake! one encoding viewed through the lens of another

Sometimes it helps to remember that text is just a series of bytes that must be interpreted in a particular encoding to be made sense of. Take the example "status" ステータス in three different encodings:

83 58 83 65 81 5b 83 5e 83 58  (Shift-JIS)
b9 30 c6 30 fc 30 bf 30 b9 30  (UTF-16LE)
e3 82 b9 e3 83 86 e3 83 bc e3 82 bf e3 82 b9  (UTF-8)

Looking at these three different byte sets you can imagine why computers can easily try to interpret one as another.

When the Shift-JIS bytes are assumed to be normal U.S. Windows-1252, they are displayed like this: ƒXƒe[ƒ^ƒX. Why? Well, the first byte is 83 which is ƒ in Windows-1252, and so on. It splits up the Shift-JIS double-byte characters without having any reason to know those two bytes were meant to be interpreted together.

When the UTF-16 bytes are assumed to be Shift-JIS, they might be displayed as ケ0ニ0・ソ0ケ. Shift-JIS is a double-byte encoding scheme which means that it uses a "lead byte" mechanism where in its case 81 to 9f and e0 to fc initiate a two byte sequence. The first byte b9 is therefore considered a single byte rendered as HALFWIDTH KATAKANA LETTER KE. The only 2-byte sequence it finds is fc 30 which is not a valid character because the trail bytes are always 40 or greater in Shift-JIS. Because of the invalid character encoding it is undefined how the text will be displayed from this point forward in the string, or it might be truncated. In my locale the mbtowcs happens to generate the replacement character U+30FB KATAKANA MIDDLE DOT (Shift-JIS 8145).

When the UTF-8 bytes are assumed to be Shift-JIS they might be displayed as: 繧ケ繝・・繧ソ繧ケ. It takes the first two bytes e3 82 and interprets them as a Shift-JIS character because e3 is a lead byte. The third byte b9 is a single-byte character in Shift-JIS (which by coincidence we encountered above in the UTF-16 bytes), and so on.

You can see that different mojibake scenarios have different visual signatures. Some of them can be recognized pretty quickly even when you don't know how to read the language.

Mojibake! I didn't see it!

If you are working on a Japanese localization project and you do not know Japanese, you may not even realize that the text you are looking at is mojibake! A good example of this is in the above case where UTF-8 is processed as Shift-JIS. There are more difficult to detect cases though too. 増加率 is a sensible thing meaning something like "uniqueness" while 荳諢乗 is meaningless gibberish. Just like in English, you know that "rtusqs" is a garbled word, even if you might be able to sound it out or guess what is meant.

The Cyrillic character set is plagued by having too many encoding systems. And here mojibake can easily escape the non-Cyrillic reader's notice because switching two single-byte Cyrillic encodings can just give a different mix of the same Cyrillic characters. For example великих is:

d7 c5 cc c9 cb c9 c8  (KOI8-R)
e2 e5 eb e8 ea e8 f5  (Windows-1251)

If the KOI8-R bytes are interpreted as Windows-1251 you see ЧЕМЙЛЙИ. If the Windows-1251 bytes are interpreted as KOI8-R you see БЕКХЙХУ. Is it just me, or do these mojibake samples look about as legit as the original?

There are even more subtle cases of mojibake where only one character in a long sequence is garbled. This might be due to an intermediate conversion where all characters except one were supported by the intermediate conversion resulting in one changed character in the result.

Thanks mojibake!