Ben Bryant's blog

Emptiness in COleDateTime Format VAR_TIMEVALUEONLY

Funny how you can have trouble using a function, and search the web looking for an answer to no avail, and finally figure out how to do it only to discover that if you had read between the lines in the documentation you would have gotten it in the first place. Well, it is half true here. There turned out to be two big underlying issues, one of which was hinted at in the documentation and the other goes back to a problem I've talked about before. What's amazing about this one is that I thought I wouldn't have much to write about but I kept peeling back layers to find worse stuff.

The MFC COleDateTime class has a Format method which returns a string with the formatted date/time value. As explained in the MSDN documentation this method is overloaded so you can pass a strftime style format string or you can use:

Glossary of Text Encoding

I had an opportunity to talk to Addison Phillips, a Unicode expert (see Inter-Locale), and he tipped me off to a subtle distinction in the term "character set." I usually think of a character set as being synonymous with a particular encoding. For example, when you ask which character set is being used, the answer might be Latin-1, Windows-1252, or Shift-JIS. The HTML meta Content-Type tag uses "charset" to denote the encoding of the page. But to be more precise, these are "encodings"; a "character set" is a more abstract notion of the set of characters supported (not how they are encoded but simply which characters are included).

This distinction makes great sense although it only really came about with Unicode. Unicode is a single character set that can be represented by numerous different encoding systems such as UTF-8 and UTF-16. Addison also said there are multiple encodings for certain Cyrillic character sets.

When I started writing for CodeSnipers I didn't realize I would be writing entirely on text encoding issues, that's just the way it has turned out so far. At the time I already had some scrap notes on this topic written and I have just gone on from there. Terminology subtleties seem to keep cropping up in this subject especially because so many programmers can get by with only a vague understanding of the issues, and their use of these terms shows it.

Splitting Surrogate Pairs

Microsoft chose UCS-2 for its Unicode encoding system when it seemed like a nice and simple fixed size per character; then Unicode promptly outgrew UCS-2. As I said in UTF-8 Versus Windows UNICODE, the early impression of simplicity, in comparison to UTF-8 multibyte encoding, backfired.

What are surrogate pairs?

Surrogate pairs are UTF-16's answer to multibyte encoding. Basically, in the UTF-16 encoding system a Unicode character can be encoded in either one or two 16-bit values; if it is two 16-bit values it is utilizing a "surrogate pair". Surrogate pairs are simple yet they inevitably lead to a great deal of confusion.

The Enigma of Encoding Versions

The Enigma Machine was used to encrypt wireless messages by the German regime before and during the Second World War. It was very significant in the Allied victory that they were able not only to decipher the German Enigma's encryption, but to mostly keep it a secret that they had deciphered it. Many soldiers and ships were sacrificed to keep it a secret because the Allies did not want to act on their knowledge unless there was an alternate source that the German's could ascribe the leak to.

Why didn't they want the Germans to know that they knew the system? Because the Germans would have changed it! And it was always a huge challenge to break the new code. The Enigma actually changed many times from when the first cipher machine, Enigma A, came on the market in 1923. The early work at cryptanalysis (to "break the code" of the Enigma) was done in Poland, and then during the war was centered in a very secret English organization that employed 7000 people at its peak.

CDATA Section Delimitosis

Delimitosis = disease pertaining to delimiter

I don't know if it is just because I am a parser-minded person, but the first time I learned about CDATA Sections a warning buzzer went off in my head and has been ringing ever since. It is saying: What if ]]> happens to be in the data you put into a CDATA Section?

Well obviously it is not allowed. Hmmm. But that is not very helpful is it? Does than mean I am supposed to check to see if my text contains ]]> every time I want to use a CDATA Section? And what should I do if it does?

I want to settle some of the unsettling issues about CDATA Sections here.

How I Invented Base64

Base64 is a way of storing any data as plain ASCII text. It looks like this:

LZPVtzlndhYFJQIDAQABMA0GCSqGSIb3DQEBAgUAA1kACKr0PqphJYw1j+YPtcIq
iWlFPuN5jJ79Khfg7ASFxskYkEMjRNZV/HZDZQEhtVaU7Jxfzs2wfX5byMp2X3U/
5XUXGx7qusDgHQGs7Jk9W8CW1fuSWUgN4w==

Look familiar? You'll see it in your e-mail source when your e-mail has attachments. How did I invent it? Well I didn't really, but before I knew base64 I came up with an encoding system I called "6-bit rollover" that turned out to be nearly identical to base64. It turns out that was not a momentous achievement because the beauty of base64 is how natural and simple it is. Here I am going to show how sensible base64 is by describing my discovery process, and giving you the quick round-up of everything you need to know to use base64.

EBCDIC to ASCII (and SBCS) Conversion

The first task I had when I got a C programming job in 1991 straight out of college was a small two week project writing a program to convert EBCDIC to ASCII. The software company I joined had about 10 employees and a consultant named Sam. The owner of our company wanted to do this cheaply as a favor to the customer and hoping for a bigger contract down the road, so I think mostly only my hours were charged on the contract even though it was really just Sam mentoring me. Sam took me over to meet the customers at their site and ask some more questions about the data we were converting. My memory is that they were very nice but they could not give us any more information or sample data!

It can be hard to figure out the encoding (and the variant of the encoding) but once you get the mapping right implementing the conversion efficiently is easy for single byte character sets. Here I take the EBCDIC to ASCII example through these stages and finish by trying to emphasize that it is a crying shame when charset conversion is not extremely fast.

That Ol' OEM Code Page

If you have a regular U.S. or Western European (Windows-1252) system locale code page, try this:

  1. copy and paste these 4 characters ÂÄÒÙ into notepad
  2. save it as oem.txt
  3. open a DOS window and cd to the directory where you saved it
  4. enter: type oem.txt

Do you see ┬─╥┘ instead of ÂÄÒÙ? Why? That is because your system local "ANSI" code page is different than your DOS "OEM" code page.

Fonts To Simulate Charsets

I want to know as little about fonts as I can get away with, but I recently saw a modified font being used to obtain DOS-style box drawing characters like ┬──┘. The text was in an old PC Code Page, probably IBM437 also called "OEM United States." So rather than convert it to Unicode, it was left in the IBM437 single byte encoding and viewed with a font that replaced certain characters with box drawing characters.

Lets take the character and follow it through this process. The byte value d9 (217) represents this character in IBM437. When the browser or the OS treats this byte as if it were in Windows-1252, it converts it to Unicode U+00d9 LATIN CAPITAL LETTER U WITH GRAVE and it would normally be displayed as Ù. But in our special hacked font, the character mapped to $00d9 actually looks like the box drawing character .

Phantom Currency Signs in Japan and Korea

If you're not from Japan or Korea, you might be surprised that when you reboot your Windows OS in the Japanese language for non-Unicode programs (system locale) your backslashes are no longer backslashes; they are yen signs. Well, don't worry, they are still backslashs, they are just displayed and printed differently by many of the fonts in the Japanese locale. But there is a more troubling internationalization issue: Unicode text coming out of Japan and Korea sources may have a backslash where you would expect a yen sign ¥ or won sign ₩. This whole subject has been discussed elsewhere (references below) but I talk about the need to repair the Unicode text.

Where the backslash issue gets interesting is in the encoding conversion between the locale code pages and Unicode. While 0x5c is clearly the yen sign in the Japanese code page 932 (Shift-JIS), it is converted to the Unicode U+005c REVERSE SOLIDUS (backslash) rather than the U+00a5 YEN SIGN. Similarly in the Korean code page 936, 0x5c is clearly the won sign but it is converted to the Unicode backslash rather than the U+20a9 WON SIGN.