HEBCI: HTML Entity-Based Codepage Inference

HEBCI is a technique for inferring the encoding used to submit a web form. It uses entity-encoded characters in hidden fields, allowing individual encodings to be "fingerprinted."

There's a reference implementation of HEBCI available through CPAN: Unix-heads can install it the usual way ('cpan -i Encode::HEBCI'). You can also get copies via the HEBCI CPAN page. If all else fails, here's a local copy: Encode-HEBCI-0.02.tar.gz.

What is HEBCI?

HEBCI is a technique that allows the program handling a web form to transparently detect the encoding used. By using carefully-chosen character references, we "fingerprint" the most common encodings. Thus, it is possible to guarantee that input is in a standard encoding without relying on (often unreliable) webserver/browser encoding interactions.

What's an "Encoding"?

Computers only work with numbers, not letters. So, in order to convert from letters to numbers, we have to define a "code." For instance, we might decide that an 'a' gets the number 1, 'b' gets a 2, and so on. Then, we would encode "hello" as {8,5,12,12,15}.

There are many different encodings out there: since computers and the internet started out in the United States, they mostly only had to work with English. Since English doesn't have any diacritics (ü, í, etc), nobody gave much thought to how to encode these characters at first.

After a while, as computers spread around the world, people started defining encodings for their local language. Unfortunately, these encodings would often overlap. This means that a person Germany might have typed in "schön", but in Sweden it shows up as "schån".

This overlap causes innumerable problems, especially as one moves from Latin/"Western" alphabets into Cyrillic alphabets or the writing systems of Chinese, Japanese, and Korean.

Why should I care about encodings?

There are a few reasons web programmers should care about character encodings. The first and foremost, though, is one we've all seen: "Smart" Quotes gone awry. When's the last time you went to a site, only to see "We?re happy to serve you!" ? These goofs look unprofessional and are entirely preventable. Closely related to this is the "Windows¤ is a registered trademark of Microsoft" phenomenon, a dead giveaway that there's one Apple diehard working in a Windows shop, or vice-versa.

Market share: Internationalization is key. The world is getting to be a smaller place, with more and more people getting on the internet all the time. If you think your program will only be used by English-speakers, think again. Many sites are finding out how great it is to be "big in Japan" (or, more commonly, Russia). These international users will never use your site if it mangles their input, though.

Standards compliance: In the XML-happy world we live in, encoding is the difference between interoperable compliance and not working. Many XML parsers will refuse malformed input, including RSS aggregators. So, if you're outputting an RSS feed, it's important that you normalize all input, converting all your users' encodings into UTF-8. Otherwise you simply can't publish on some of the popular aggregator sites.

How can we solve this problem?

Ideally, this would be specified in a standard, something the browser people and the webserver people agreed on. Unfortunately, this doesn't exist: we're stuck solving it for ourselves. The ideal solution will be entirely browser-neutral and passive. Unfortunately, the HTML spec doesn't define any mechanism for this. We need to find some other (sneakier) way to figure out the current character encoding.

Luckily for us, there is a trick we can use for this: entity codes. Entity codes are strings like &, which were (are) used to encode specific characters without using Unicode. When the browser displays a page, it replaces these with the appropriate character from the current encoding. Thus, & becomes the number 38 in most encodings. By itself, this is merely implementation trivia. However, this translation process occurs whenever a user submits a form. That is, the browser converts any entities in the form to numbers, then submits that information when the user clicks submit. Thus, any entity codes within the form fields are passed along as character values in the browser's current encoding.

So, all we have to do is find an entity that is encoded differently in two different encodings. We slip that into a form field, then look at its value after the user submits the form. This allows us to differentiate between those two encodings. In fact, we could look at all entities in many encodings, and find the ones that allowed us to disambiguate between many encodings. This is what I've done.

The technique

We add hidden form elements with values containing various entity codes, such as °, ÷, and —. Then, when the user submits the form, we take each of those and compare them against a list of what character has what value in what encoding. That is, each encoding has a unique fingerprint for the values of °,÷,—. For MacRoman, it's a1,d6,d1; for UTF-8, c2b0,c3b7,e28094. Thus, we only have to go through our table of encoding-to-fingerprint mappings, and see which fingerprint matches.

Note that, once this table is discovered, the cost of fingerprinting a given form submission is very low. And, in the case of misses, you can assume whatever your page's default encoding is. This fallthrough case is equivalent to what the code would have done before adding this detection layer.

Implementations

Surprisingly, one can distinguish between the Big Three (ISO-8859-1/Windows-1252, MacRoman, and UTF-8), with a single entity: º.

Codepageº
UTF-8c2ba
ISO-8859-1ba
MacRomanbc

Differentiating a larger set of encodings requires more entities. A slightly larger implementation works great with only five codepoints:

my @fp_ents = qw/deg divide mdash bdquo euro/;
my %fingerprints = (
		    "UTF-8" => ['c2b0','c3b7','e28094','e2809e','e282ac'],
		    "WINDOWS-1252" => ['b0','f7','97','84','80'],
		    "MAC"          => ['a1','d6','d1','e3','db'],
		    "MS-HEBR"      => ['b0','ba','97','84','80'],
		    "MAC-CYRILLIC" => ['a1','d6','d1','d7',''],
		    "MS-GREEK"     => ['b0','','97','84','80'],
		    "MAC-IS"       => ['a1','d6','d0','e3',''],
		    "MS-CYRL"      => ['b0','','97','84','88'],
		    "MS932"        => ['818b','8180','815c','',''],
		    "WINDOWS-31J"  => ['818b','8180','815c','',''],
		    "WINDOWS-936"  => ['a1e3','a1c2','a1aa','',''],
		    "MS_KANJI"     => ['818b','8180','','',''],
		    "ISO-8859-15"  => ['b0','f7','','','a4'],
		    "ISO-8859-1"   => ['b0','f7','','',''],
		    "CSIBM864"     => ['80','dd','','',''],
		   );

The current release uses up to 19 entities to differentiate among up to 58 encodings.

A Demonstration Application

A demo application. Note that you'll need to click the button to submit an initial fingerprint. After subtmitting the form with your default encoding, change to something else in the list above, and try it again. It should be updated to reflect the new encoding.

It's also worthwhile to view the source, to see just how simple this is from the HTML side. With minor additions like these to forms, it is now possible to check the correct encoding of data, allowing web developers to guarantee normalization and smooth interoperability with other, more picky, protocols.


Thanks to Tercent, Inc. for their financial support to create the complete data tables. Without this, HEBCI wouldn't be nearly as useful as it is today.


Copyright © 2005,2006 Josh Myer
The author's homepage.

Last modified: Thu Mar 30 22:16:44 EST 2006