MySQL Encoding and Mojibake

As a follow-up to my last post about Mojibake character encoding corruption I want to distinguish "intermediate encoding corruption."

In a post on the JoelOnSoftware Discussion forum someone asked why about 50% of the characters in his UTF-8 strings inserted into a MySQL database were getting corrupted (and 50% weren't). This is very typical intermediate encoding corruption where some characters are corrupted while others survive.

Normal mojibake is total gibberish because one encoding is seen through the lens of another and so every single character is changed, except if the two encodings being confused coincide in a range such as ASCII in which case corruption will be exhibited in and around all of the non-ASCII characters. We see this all the time when you have an almost ASCII document with one copyright symbol or Euro symbol that gets corrupted.

However, "intermediate encoding corruption" is a kind of mojibake typified by occasional corruption of characters that suggests a translation of the text has occured in the meantime between two encodings or from one to another and back.

The JoelOnSoftware poster gave the example of inserting e3 81 af into the database and getting e3 3f af returned in a query result. The answer turned out to be that "besides setting the database character set to UTF-8, you must remember to send SET NAMES 'utf8' to MySQL prior to querying or inserting."

But what I found interesting is that the original poster had already done the legwork of looking at the bytes of the text and it was clear that the middle byte 81 of the UTF-8 character had been replaced by a question mark 3f which is a normal "replacement character" for characters not supported in an encoding conversion. Obviously something in the database was treating the text as if it was in a different encoding than UTF-8. A good candidate for what that intermediate encoding was (assuming the poster was from the West) is Latin-1 or Windows-1252 which do not have characters for the byte 81!

Here is the almost certain explanation of what is going on behind the scenes. If you do not SET NAMES 'utf8' the database assumes the text coming from the client is Latin-1 even though the program is passing UTF-8 encoded text. The table is UTF-8 so it is actually treating the text as Latin-1 and converting it to UTF-8 to store in the table. Some of the bytes are successfully translated to UTF-8 even though they are stored as mojibake gibberish ã?¯ instead of the intended character in the database.

e3 81 af is treated as 3 Latin-1 characters. The first e3 is ã encoded in UTF-8 as c3 a3. The second 81 is "lost in translation" and replaced with a question mark 3f in UTF-8. The third af is ¯ encoded in UTF-8 as c2 af. So the result stored internally in the UTF-8 database table is c3 a3 3f c2 af.

When the text is queried from the database, it is assumed that the text should be converted to Latin-1 for the client. This is the beauty of the intermediate encoding corruption because it almost makes it! The first UTF-8 character c3 a3 is translated back to e3 in Latin-1. The second UTF-8 character 3f is the same in Latin-1 (ASCII). The third UTF-8 character c2 af is af in Latin-1. The result handed back to the client is e3 3f af.

However, the client program is assuming the byte stream is UTF-8, and it almost is! It is just that in the meantime it was treated as Latin-1 and some of the bytes were corrupted during the intermediate conversion. Many characters will survive because when represented in UTF-8 their individual bytes represent valid characters in the wrongly assumed encoding. The more test characters you have, the better you can identify what that intermediate encoding is.

The MySQL bug database has a posting called Bug #11212 with a rant about a perceived bug in MySQL encoding conversion. The author points to a problem where Latin-1 81 is converted to 3f, not seeming to realize that 3f is a question mark which is a common replacement character for an invalid code in the text. For some reason he thinks it should convert to "0xC280 like Perl says". He actually gives the answer to his dilemma at the very end where he says

And please, for the love of god, don't come back with "well, latin1 chars with values between 0x80 and 0xA0 aren't REALLY chars, so that's why the server doesn't encode them properly" 'cause that's also the case for latin1 chars with values between 0x00 and 0x1f, and those don't get to take a trip to the twilight zone when entered into the db.

Sorry, but that is precicely the reason those bytes are lost! And he is wrong about those lower values not being in Latin-1, they are valid in Latin-1, Unicode and ASCII for that matter.

Anyway, this kind of "intermediate encoding corruption" mojibake forms a fascinating puzzle which can be fun to investigate if you're into that sort of thing.