Glossary of Text Encoding
Submitted by Ben Bryant on Tue, 2005-12-20 16:05.
I had an opportunity to talk to Addison Phillips, a Unicode expert (see Inter-Locale), and he tipped me off to a subtle distinction in the term "character set." I usually think of a character set as being synonymous with a particular encoding. For example, when you ask which character set is being used, the answer might be Latin-1, Windows-1252, or Shift-JIS. The HTML meta Content-Type tag uses "charset" to denote the encoding of the page. But to be more precise, these are "encodings"; a "character set" is a more abstract notion of the set of characters supported (not how they are encoded but simply which characters are included).
This distinction makes great sense although it only really came about with Unicode. Unicode is a single character set that can be represented by numerous different encoding systems such as UTF-8 and UTF-16. Addison also said there are multiple encodings for certain Cyrillic character sets.
When I started writing for CodeSnipers I didn't realize I would be writing entirely on text encoding issues, that's just the way it has turned out so far. At the time I already had some scrap notes on this topic written and I have just gone on from there. Terminology subtleties seem to keep cropping up in this subject especially because so many programmers can get by with only a vague understanding of the issues, and their use of these terms shows it.
I appreciate precision in terminology so here is a glossary of terms and succinct definitions that I came up with. One purpose is to help distinguish between interrelated and sometimes confused words, thus these definitions are meant to be read in the context and order in which they are given here. I have also provided links back to my previous posts on these subjects. You can find out more about all of these terms on Wikipedia.
- encoding: means of representing text in memory or disk
- code page: numeric encoding specifier and/or synonymous with "encoding"
- character set: a collection of characters available in an encoding
- code point: the lookup number of a character
- character: the unit of text with a name and appearance
- ASCII: the most common encoding, uses 7 bits per character, the starting point for most other character sets
- EBCDIC: IBM's 8-bit encoding mostly relegated to their mainframes and minis (EBCDIC to ASCII (and SBCS) Conversion)
- single-byte: (MBCS) refers to encodings with one byte per character including ASCII and character sets other than Far Eastern
- double-byte: (DBCS) Far Eastern (Japan, Korea, China and Taiwan) encodings with one or two bytes per character (Double-Byte Safety Primer)
- multi-byte: (MBCS) refers to encodings with one to many bytes per character including double-byte and UTF-8
- lead byte: the first byte in a two byte character in a double-byte encoding
- trailing byte: the second byte in a two byte character in a double-byte encoding
- ANSI character set: Microsoft Windows term for non-Unicode encodings (single and double-byte) available as the system locale code page (Whether Double-Byte Is ANSI)
- OEM character set: Microsoft Windows term for encodings available as the system DOS console code page (That Ol' OEM Code Page)
- Unicode: the biggest character set; it attempts to encompass all
- UTF-8: a Unicode multi-byte encoding that is backward compatible with ASCII
- UCS-2: the original "Unicode" encoding on Microsoft Windows, limited to code points up to U+FFFF
- UTF-16: the evolution of UCS-2 employing surrogate pairs to overcome the U+FFFF limit
- little endian: (LE) the byte order with most significant byte first especially in 16-bit encodings UCS-2LE and UTF-16LE
- big endian: (BE) the byte order with least significant byte first especially in 16-bit encodings UCS-2BE and UTF-16BE
- Byte Order Mark: (BOM, preamble, signature) two or so bytes at the beginning of the file indicating which Unicode encoding (How to Determine Text File Encoding)
- Windows Unicode: beware that Microsoft Windows uses the term "Unicode" or "UNICODE" to refer to UCS-2LE and UTF-16LE encodings (UTF-8 Versus Windows UNICODE)
- surrogate code point:
- surrogate character: tricked you, this is a misnomer because there are no characters corresponding to surrogate code points
- surrogate pair: a leading and trailing surrogate code point in UTF-16 to represent code points from U+10000 to U+10FFFF (Splitting Surrogate Pairs)
- leading surrogate: U+D800 to U+DBFF, also called high surrogate
- trailing surrogate: U+DC00 to U+DFFF, also called low surrogate
- UTF-32: fixed-length Unicode encoding (but remember there is still the issue of combining characters)
- UCS-4: same as UTF-32 (UCS-4 is not restricted to code points up to U+10FFFF but who cares)
- combining characters: characters like accents, diacritical marks, that modify other characters
- mojibake: characters displaying incorrectly, appearing corrupted (Oh No! Mojibake!)
- locale: computer configuration affecting system language, code page, standards and formats
- default system locale: Microsoft Windows system ANSI code page (Strange case of two system locale ANSI charsets)
- default user locale: Microsoft Windows system standards and formats (The secret family split in Windows code page functions)
- internationalization: (I18N) getting your program to work in multiple locales using Unicode and allowing for variations in number and date formatting
- localization: (L10N) translating and refitting your program for a particular culture
- Base64: most common encoding for transporting/storing binary in a text stream (How I Invented Base64)
U+D800 to U+DFFF, reserved for UTF-16 surrogate pairs
Inevitably I will need to update this. Please leave comments with any suggestions and corrections.