How I Invented Base64

Base64 is a way of storing any data as plain ASCII text. It looks like this:

LZPVtzlndhYFJQIDAQABMA0GCSqGSIb3DQEBAgUAA1kACKr0PqphJYw1j+YPtcIq
iWlFPuN5jJ79Khfg7ASFxskYkEMjRNZV/HZDZQEhtVaU7Jxfzs2wfX5byMp2X3U/
5XUXGx7qusDgHQGs7Jk9W8CW1fuSWUgN4w==

Look familiar? You'll see it in your e-mail source when your e-mail has attachments. How did I invent it? Well I didn't really, but before I knew base64 I came up with an encoding system I called "6-bit rollover" that turned out to be nearly identical to base64. It turns out that was not a momentous achievement because the beauty of base64 is how natural and simple it is. Here I am going to show how sensible base64 is by describing my discovery process, and giving you the quick round-up of everything you need to know to use base64.

Hex

Back in 1999, I needed to store a binary image in an XML document. There was no readily available function in Visual C++ or MSXML to do this. One common trick was to implement hex strings to hold binary values in strings with a few lines of code. Here is an example of 5 bytes represented in a Hex string:

48656C6C6F

Hex is pretty inefficient, taking two bytes of ASCII text for every byte of data, i.e. a 100% overhead in storage space. I figured there should be a way of using a larger variety of ASCII characters to result in using less storage space.

Stream of bits

The key is to picture the data as a stream of ones and zeros without byte boundaries, then you won't feel compelled to keep the chunks of bits from crossing those byte boundaries.

010101010101010101010101

Hex has one digit for every 4 bits in the data. The 4 bits represent a value from 0 to 15, so the characters 0123456789ABCDEF are used such that each character represents one of the values 0 to 15. If one character is used to represent every 4 bits, what if one character was used to represent every 5, 6 or 7 bits?

every 4 bits:

0101 0101 0101 0101 0101 0101...

every 5 bits:

01010 10101 01010 10101 0101...

every 6 bits:

010101 010101 010101 010101...

I felt it was important to stay within ASCII (values 0 to 127) because these only take 1 byte per character in UTF-8, and are generally immune to encoding corruption. The lowest ASCII codes are control characters including carriage returns and line feeds which I knew I didn't want to use as significant characters in my system. So 7 bits was out of the question since that would require all the values 0 to 127. However it was definitely possible to represent 6 bits as a single ASCII character, and this would yield a relatively efficient 33% overhead in storage space.

The base 64 alphabet

The next question was which characters to use. Hex uses 0-9 and A-F to represent the 16 different values in 4 bits. The problem now was to represent the 64 different values possible in 6 bits. Obviously you would want to use 0-9 and A-Z which is a total of 36 values. Then add the lower case letters for another 26, totaling 62 values.

Still 2 values short, and this is where a tough decision needs to be made because all of the other characters are punctuation signs which often have special meanings in various contexts. I happened to choose * (asterisk) and + (plus), but base64 uses + (plus) and / (slash), although there is a strain of base64 for URL and filename safety that uses - (minus) and _ (underscore). Base64 standardizes on an ordering as follows: uppercase, lowercase, numeric digits, and then the two additional characters.

Encode base64

To encode your data in base64, you preload a 64 digit character string and loop through your data 6 bits at a time, appending the appropriate digit to your result string by using the offset into your 64 digit character string. If you had a function called Get6Bits to retrieve a value for the next 6 bits in the buffer, your code would be this simple:

const char * pDigits = "A...Za...z0...9+/";
int nNextBits;
while ( Get6Bits(&nNextBits) )
  strResult += (char)pDigits[nNextBits];

If the number of bytes in the data is not divisible by 3, the 6 bits for the last base64 digit will be calculated as if there was an additional zero byte. Later when it is decoded, you will discard these extra 2 or 4 zero bits and (don't worry) you will always know the exact number of bytes that were in the original data.

One other thing about encoding base64 is that you are generally supposed to pad the result with equal signs to the edge of a 4 digit boundary (see the end of the top example of base64). There is no good reason for this that I can think of; just one of those artifacts of the old stodgy days of encoding when they invented little extra chores because they were so immersed in their own implementation scenario. In base64 there are 4 characters per 3 bytes of data and the way it turns out is there can only be 1 or 2 pad characters at the end since the first two characters are needed to complete the first of the 3 bytes. If you kept a count of base64 digits

nDigitCount

then your padding code might look like this:

int nRemainder = 4 - (nDigitCount % 4);
while ( nRemainder-- )
  strResult += '=';

Decode base64

To decode base64, preload an array to provide values or an error code for the corresponding digits at ASCII offsets, and tack 6 bits at a time onto your output buffer. The following pseudo-code relies on a function called Append6Bits which will append 6 bits at a time to your buffer unless the value is -1.

int aAsciiToBits[] = {-1...-1,62,-1,-1,-1,63,52...61,-1...-1,0,...,25,-1...-1,26...51,-1...-1};
for ( int n=0; strBase64[n]; ++n )
    Append6Bits( aAsciiToBits[strBase64[n]] );

The resulting output buffer length is rounded back to the the last byte for which all of the bits were supplied. In other words, if your data length is divisible by 3 bytes the bits will end up on a byte boundary. But if there is another byte, you will have 4 unused trailing bits, and if there are two other bytes, you will have 2 unused trailing bits.

References

The problem with base64 is that surprisingly it has no concise authoritative reference. RFC 3548 ("The Base16, Base32, and Base64 Data Encodings," 2003) is the best reference now and it summarizes the current status of the base64 standard. The two main references before that were buried in PEM (RFC 1421 "Privacy Enhancement for Internet Electronic Mail," 1993) and MIME (RFC 2045 "Multipurpose Internet Mail Extensions," 1996). Ignore the modified version of base64 for UTF-7 (RFC 2152, 1997), unless you are trafficking in UTF-7.

Line length

PEM and MIME differ on the number of characters per line: PEM says 64 and MIME says 76. But the line length doesn't matter since base64 stipulates to ignore non-significant characters like whitespace. It doesn't even need to be divided into lines at all; it can be just one continuous array of characters all on one line.

Other notes

Hex is case insensitive while base64 must be case sensitive. Base32 (5 bits per digit) which could theoretically be case insensitive, generally expects upper case, and has about 60% overhead. The overhead calculations I have given here do not include optional newline characters which add overhead too.

If I had started off by noting that the Hex encoding I have always used is also called "base16" encoding in the same way that "base64" uses a base 64 number system, I might have grasped what base64 was earlier than I did!

Good Article

I've used library functions to take advantage of base64 encoding on a few occasions but never bothered to look into the details. It's always nice to find out what's happening behind the abstractions...