EBCDIC to ASCII (and SBCS) Conversion

The first task I had when I got a C programming job in 1991 straight out of college was a small two week project writing a program to convert EBCDIC to ASCII. The software company I joined had about 10 employees and a consultant named Sam. The owner of our company wanted to do this cheaply as a favor to the customer and hoping for a bigger contract down the road, so I think mostly only my hours were charged on the contract even though it was really just Sam mentoring me. Sam took me over to meet the customers at their site and ask some more questions about the data we were converting. My memory is that they were very nice but they could not give us any more information or sample data!

It can be hard to figure out the encoding (and the variant of the encoding) but once you get the mapping right implementing the conversion efficiently is easy for single byte character sets. Here I take the EBCDIC to ASCII example through these stages and finish by trying to emphasize that it is a crying shame when charset conversion is not extremely fast.

So what encoding is it?

The main problem encountered with encodings is not knowing exactly what encoding the data is in. Data is just ones and zeros although if you have examples of the data to analyse you can sometimes recognize or deduce what the encoding is.

Okay but what variant of that encoding is it?

The trouble is that before the days of ubiquitous text sharing on the Internet, and especially when encodings were burned into the hardware or read-only memory, companies like IBM often used variants of encoding systems whenever it suited their needs. Nowadays, companies are still creating problems by inventing new confusing aliases for existing code pages, and not plainly documenting versions of encodings.

As I mentioned in The Euro Sign Predicament, my pet peeve is that MySQL is calling Windows-1252 "latin1". Anyone who knows a little of the convoluted history of all the encodings similar to Latin-1, and those which are not such as "Latin I" would avoid adding new complexity to this arena. Microsoft changed their Windows-125x code pages at the end of the 90s though very little documentation indicates whether an encoding refers to the standard before or after the change. Similarly, all of the Oracle code pages are updated periodically without clear documentation.

With EBCDIC the problem is magnified because it is a proprietary (even "secret") IBM encoding. If you google it, you will find many sites providing an EBCDIC table without any indication of what flavor of EBCDIC they are showing. At the time in 1991 I remember a book we had that listed several different EBCDIC code pages. Another complication is the fuzzy use of terms like EBCDIC and ASCII. Many "ASCII" tables you will find have 256 values, not mentioning that ASCII is officially only the first 128 values and the other values are part of some other character set. A similar confusion happens for the variants of EBCDIC, although EBCDIC is not limited to the first 128 values of the byte.

To properly document an encoding, documentation should list the names, aliases, origin, and the computer models and operating systems it is associated with to help identify exactly what encoding is being referred to. But this is seldom done.

Anyway, Sam did the detective work on this project using his experience and knowledge of the IBM mainframes likely to be involved to deduce the EBCDIC variant well enough.

The conversion mapping

The conversion itself is the easy part, and that is all I am really going to tackle here. Once you have addressed the investigative issues, you'll come up with a list of character codes for mapping the source encoding (EBCDIC) to the destination encoding (ASCII). For example:

SOURCE --> DEST   
00 --> 00   null
01 --> 01   SOH (control code)
02 --> 02   STX (control code)
03 --> 03   ETX (control code)
04 --> 3f   PF (control code, not in ASCII)
05 --> 09   tab (control code)
...    
6f --> 3f   ? (question mark)
...    
cf --> 3f   (error, not in EBCDIC)
d0 --> 7d   } (right curly bracket)
d1 --> 4a   J (capital letter J)
...    
fe --> 3f   (error, not in EBCDIC)
ff --> 3f   (error, not in EBCDIC)

The values on the left are in the range 0 to 255 (i.e. 00 to ff hex), all the possible values of a byte. Because of this you can create a simple array from 0 to 255 containing the TO values. Simple byte to byte conversions like EBCDIC to ASCII and variations of this in single byte encodings are implemented using a simple array like this which you could also call a table.

For the values that do not exist in your DEST or SOURCE encoding you can put a replacement character such as a question mark 3f as the value in the array, but error handling depends on your circumstances. Encountering a code value that is meaningless in the SOURCE encoding such as cf fe and ff shown above may mean the data is bad or that you do not know the exact encoding variant or are not aware of a specific encoding oddity that applies to the data. Encountering a code value that has no corresponding value in the destination encoding such as 04 shown above is a common problem when the destination encoding is not Unicode.

I included the actual question mark EBCDIC 6f and ASCII 3f (not red) to illustrate that there is also a legitimate 3f destination character code. The replacement character does not have to be a question mark, it could be any character in the destination encoding that you want to use as the placeholder in your destination data.

Implementing the conversion

The conversion program is very simple. Given a source buffer and knowing the buffer size, just declare your conversion value array and loop through all of the bytes in the buffer assigning the converted value. The following example changes the buffer in place; the alternative would be to have separate source and destination buffers.

unsigned char aConvVals[256] = { 0x00,0x01,0x02,0x03,0x3f,0x09, ...,0x3f, ...,0x3f,0x7d,0x4a, ...,0x3f,0x3f };
for ( int nOffset=0; nOffset<nBufferSize; ++nOffset )
    pBuffer[nOffset] = aConvVals[pBuffer[nOffset]];

The important part of an efficient conversion is minimizing the instructions inside the loop. Here we are assigning a value based on an offset in an array; this is ideal. If you add error handling, try to minimize your logic that gets repeated for every byte to one if statement. One way would be to test for a special DEST value such as 80 which none of the SOURCE characters convert to.

unsigned char aConvVals[256] = { 0x00,0x01,0x02,0x03,0x80,0x09, ...,0x3f, ...,0x80,0x7d,0x4a, ...,0x80,0x80 };
unsigned char cVal;
int nProblemChars = 0;
for ( int nOffset=0; nOffset<nBufferSize; ++nOffset )
{
    cVal = aConvVals[pBuffer[nOffset]];
    if ( cVal == 0x80 )
    {
        ++nProblemChars;
        pBuffer[nOffset] = 0x3f;
    }
    else
        pBuffer[nOffset] = cVal;
}

Slow conversions are dumb

Conversion can be extremely fast and it should never impact performance if implemented properly. Hearing stories like the following post on Joel on Software Discussion make me really upset about what users have to deal with sometimes.

Never Fast Enough I have this Java program that reads every byte of a 3GB file, calls a static method that maps the byte from its original form (EBCDIC) to something useful (ASCII) in a large 'switch', then writes it to a new file. It is disappointingly slow, 3-5 hours on a HPUX 9000 server. I theorize it's because of the method call overhead (well, its being called 3 billion times for Pete's sake!), because the read/write operation itself only takes 5-10 minutes

If the read/write operation takes 5-10 minutes, the whole thing should only take 5-10 minutes, not 3-5 hours! If you read the comments in that JoS post, most comments were not helpful except for "Secure" who gave a simple and complete answer. The switch statement (instead of an array) and any other inefficient things in the Java subroutine were the prime culprits. For a large file you would need to buffer it, say 64k at a time, but the size of the buffer doesn't have a large influence. The original poster mentioned that removing the switch statement and subroutine cut the time in half but then vaguely referred to some other field translation that was slowing things down.

In general for any conversion or linear decoding/encoding operation, the file I/O time should be the significant part of the time it takes. The character set conversion itself should be insignificant; like a second for 100 megabytes (at least in C/C++). This is true even in more complex multibyte conversions which I will cover in later posts.