Welcome Guest | My Membership | Login

Character Encodings

Article

Historically, few MultiValue developers have needed to immerse themselves in the murky waters of character encodings. But in today's world, with ever greater internationalization, a basic grasp of encoding is becoming essential — whether working with email, web pages, client languages such as those in the .NET platform, or for data storage and transfer.

Unicode and internationalization are far too large a subject to cover in a single article, so this will give a high level understanding of how modern character encoding works, how it came about — and why you need to know about it.

The ASCII Era

MultiValue databases typically use an eight bit encoding scheme, which dates back to the days when all computers spoke English and IBM was still clinging on to EBCDIC whilst the rest of the industry was standardising around ASCII. Within the ASCII model all necessary characters in the English speaking world could be fitted neatly into seven bytes along with certain control characters and punctuation. This even left a whole bit to spare on the then-prevalent eight bit architectures, and different technology groups decided to use this for their own devious purposes: giving rise to our own marker characters — the field, value, subvalue and text marks — sitting at the top of the table.

ANSI

ASCII was fine and dandy so long as you spoke English, but inevitably those pesky foreigners wanted in on the act and all those extra characters from 128-255 proved an irresistible lure. The ensuing free-for-all created complete chaos until the ANSI standard came along to put things right. Under this standard, the first 127 characters were kept for good ol' ASCII (required for programmers — since all major programming languages and operating systems are still derived from English today) and those from 128 up assigned to various regional languages through the application of different code pages. So by setting your Windows PC to use page 874 you could render Thai characters or to page 1251 if you wanted Cyrillic. Similarly if you look through a VT100 emulation manual, for example, you will find selection commands between US English, Finnish, and French code pages, amongst others.

All of which was good news, so long as you only ever worked within one region, and you didn't need anything too complex like many of the far eastern character sets. But it also set an important precedent.

Below 128, all characters were predictable, and so a character and its encoding went hand in hand. If you are an English speaker, you can treat the two as synonymous: CHAR(65) is a capital A is a capital A is always a capital A. Above 127, and a character only makes sense if you understand its context, which in the case of the ANSI standard meant knowing the code page for which it was intended.

This presents some problems. Receive character 138 as part of a document from an international partner and you may need to know whether to display a Cyrillic capital letter LJE clip_image002[4] , a capital S with a caron clip_image002[6] or any of a dozen different variants. In short, knowing a character value alone is not enough. The link between the character value and its intended meaning had been broken.

Unicode

The ANSI code page approach had three huge drawbacks: it was only useful if your character set could be crammed into the extra 128 characters available in a code page, which ruled out a lot of the far eastern regions; it only really worked if your language was composed of discrete characters, again ruling out languages that use combining characters to form a single symbol; and there was no standard way to communicate which code page to use. Oh and naturally enough, DOS, Windows, and various types of UNIX all tended to use different code pages.

Clearly what was needed was one single standard to unite them all into one series — one that could define every possible regional character, symbol, combining character and, for good measure, various mathematical and engineering symbols, while not forgetting historical languages like Anglo-Saxon, Ogham , hieroglyphs, and mythical languages like Tolkein's Tengwar and Cirth(fig. 1).

fig_1

Fig. 1

So the computer industry did what it does best and bravely set up two competing standards: ISO 10646 defining the UCS (Universal Character Set) and Unicode. Fortunately for all, reason ultimately prevailed and the two standards harmonised, so for developers today Unicode is the standard of choice.

But that is far from the end of the story.

Unicode and Character Encodings

A common misconception is that Unicode operates similarly to ASCII but with a lot more characters, and that these are represented as double byte (16 bit) giving a maximum of 65,536 possible entries.

One reason for this misconception is that Windows originally offered double byte characters as a way of encoding far eastern sets, and development tools such as Delphi and C# currently define a 16 bit char data type. If you are an English or European language speaker, save a Notepad document as Unicode and it will indeed write the content as a series of double byte characters with the unused byte set to zero.

True double byte has the advantage of consistency. It is just as easy to move forward and back through a regularly double byte string as it is through a single byte string, so long as you can ignore brain-dead C routines that expect 0x0 as a string terminator and you know the order in which the two bytes are combined. The question of big-endian or little-endian is something over which manufacturers cannot agree, so your Notepad document also includes a lead-in character known as a Byte Order Mark or BOM which shows as 0xFEFF or 0xFFFE depending on the architecture and which can be used to identify which way round these are coded. Character handling routines are expected to byte swap as required. But Unicode is not a 16 bit encoding standard.

In reality, Unicode and UCS are nothing more than classifications. Their purpose is to simply define a global code table that assigns a positional value to each character (known as a "code point"). What they specifically do not define, is how those characters should be stored or encoded. And so far, they define a 32-bit character set (216), so double-byte encoding just isn't enough to hold all possible entries.

Now if you're an English speaker still working with the 7 ASCII bits, the thought of all those extra bits might be a scary and you may feel aggrieved at having to store them all. And then there are all the legacy data and documents that predate Unicode — what should you do with them?

In fact, there are many different ways in which Unicode characters can be encoded: BMP, UCS-2, UCS-4, UTF-7, UTF-8, UTF-16 and UTF-32 amongst others, along with the old code pages that can show some, but not all, characters. But out of this bewildering series, the chances are you will be introduced to either UTF-16 or UTF-8.

UTF-16 is the regularly spaced double byte encoding we just met. That's good enough for most international text unless you really have a burning need to display Elvish or some mathematical symbols. It is not compatible with ASCII but it is relatively easy to navigate and if your data is in ASCII you can easily translate between the two by either adding or stripping off the zero byte. But even so, that's extra work and who needs that?

So the popular alternative in the Western world is UTF-8, the brainchild of Ken Thompson who famously devised it on a New Jersey diner placemat.

UTF-8 uses a variable number of bytes to encode a character — anything between one and six bytes. This makes it expensive for documents that use characters at the far end of the Unicode table, but has the huge advantage that the first 127 characters can fit into a single ASCII-compatible byte. So you can take your ASCII document, mark it as UTF-8 and sit back knowing that you are now in the modern world.

Specifying the Content Type

Handling text is no longer just a matter of scanning bytes. In today's world knowing the encoding scheme matters. For email, this is specified using the Content-Type header. For a web page, this uses the meta tag:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Note that this is actually inside the HTML document and the browser needs to parse past the <html> and <head> tags to get that far.

Which just proves that text encoding isn't something to be scared of but it is something that needs to be understood.


How UTF-8 Encoding Works

UTF-8 uses a progressive scheme to encode characters, in which the most popular (to the American audience) characters are held in a single byte and the more obscure (to the same audience) expand into up to 6 bytes.
UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8. 
All UCS characters above U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte (0x00-0x7F) can appear as part of any other character and all possible 231 UCS codes can be encoded.
The first byte of a multi byte sequence is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. 
The bytes 0xFE and 0xFF are never used in the UTF-8 encoding, which is good news for field marks. 
The following byte sequences are used to represent a character. The sequence to be used depends on the Unicode number of the character: 
U-00000000 - U-0000007F: 0xxxxxxx
U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
U-00000800 - U-00007FFF: 1110xxxx 10xxxxxx 10xxxxxx 
(etc)



 

# # #          # # #          # # #

 

Related Articles


Return to top