HEBCI is a technique for inferring the encoding used to submit a web form. It uses entity-encoded characters in hidden fields, allowing individual encodings to be "fingerprinted."
There's a reference implementation of HEBCI available through CPAN: Unix-heads can install it the usual way ('cpan -i Encode::HEBCI'). You can also get copies via the HEBCI CPAN page. If all else fails, here's a local copy: Encode-HEBCI-0.02.tar.gz.
HEBCI is a technique that allows the program handling a web form to transparently detect the encoding used. By using carefully-chosen character references, we "fingerprint" the most common encodings. Thus, it is possible to guarantee that input is in a standard encoding without relying on (often unreliable) webserver/browser encoding interactions.
Computers only work with numbers, not letters. So, in order to convert from letters to numbers, we have to define a "code." For instance, we might decide that an 'a' gets the number 1, 'b' gets a 2, and so on. Then, we would encode "hello" as {8,5,12,12,15}.
There are many different encodings out there: since computers and the internet started out in the United States, they mostly only had to work with English. Since English doesn't have any diacritics (ü, í, etc), nobody gave much thought to how to encode these characters at first.
After a while, as computers spread around the world, people started defining encodings for their local language. Unfortunately, these encodings would often overlap. This means that a person Germany might have typed in "schön", but in Sweden it shows up as "schån".
This overlap causes innumerable problems, especially as one moves from Latin/"Western" alphabets into Cyrillic alphabets or the writing systems of Chinese, Japanese, and Korean.
There are a few reasons web programmers should care about character encodings. The first and foremost, though, is one we've all seen: "Smart" Quotes gone awry. When's the last time you went to a site, only to see "We?re happy to serve you!" ? These goofs look unprofessional and are entirely preventable. Closely related to this is the "Windows¤ is a registered trademark of Microsoft" phenomenon, a dead giveaway that there's one Apple diehard working in a Windows shop, or vice-versa.
Market share: Internationalization is key. The world is getting to be a smaller place, with more and more people getting on the internet all the time. If you think your program will only be used by English-speakers, think again. Many sites are finding out how great it is to be "big in Japan" (or, more commonly, Russia). These international users will never use your site if it mangles their input, though.
Standards compliance: In the XML-happy world we live in, encoding is the difference between interoperable compliance and not working. Many XML parsers will refuse malformed input, including RSS aggregators. So, if you're outputting an RSS feed, it's important that you normalize all input, converting all your users' encodings into UTF-8. Otherwise you simply can't publish on some of the popular aggregator sites.
Ideally, this would be specified in a standard, something the browser people and the webserver people agreed on. Unfortunately, this doesn't exist: we're stuck solving it for ourselves. The ideal solution will be entirely browser-neutral and passive. Unfortunately, the HTML spec doesn't define any mechanism for this. We need to find some other (sneakier) way to figure out the current character encoding.
Luckily for us, there is a trick we can use for this: entity
codes. Entity codes are strings like &,
which were (are) used to encode specific characters without using
Unicode. When the browser displays a page, it replaces these with
the appropriate character from the current encoding. Thus,
& becomes the number 38 in most encodings. By itself,
this is merely implementation trivia. However, this translation
process occurs whenever a user submits a form. That is, the
browser converts any entities in the form to numbers, then submits
that information when the user clicks submit. Thus, any entity
codes within the form fields are passed along as character values
in the browser's current encoding.
So, all we have to do is find an entity that is encoded differently in two different encodings. We slip that into a form field, then look at its value after the user submits the form. This allows us to differentiate between those two encodings. In fact, we could look at all entities in many encodings, and find the ones that allowed us to disambiguate between many encodings. This is what I've done.
We add hidden form elements with values containing various entity
codes, such as °, ÷, and —. Then, when
the user submits the form, we take each of those and compare them
against a list of what character has what value in what encoding.
That is, each encoding has a unique fingerprint for the values of
°,÷,—. For MacRoman, it's
a1,d6,d1; for UTF-8, c2b0,c3b7,e28094.
Thus, we only have to go through our table of
encoding-to-fingerprint mappings, and see which fingerprint matches.
Note that, once this table is discovered, the cost of fingerprinting a given form submission is very low. And, in the case of misses, you can assume whatever your page's default encoding is. This fallthrough case is equivalent to what the code would have done before adding this detection layer.
Surprisingly, one can distinguish between the Big Three (ISO-8859-1/Windows-1252, MacRoman, and UTF-8), with a single entity: º.
| Codepage | º |
|---|---|
| UTF-8 | c2ba |
| ISO-8859-1 | ba |
| MacRoman | bc |
Differentiating a larger set of encodings requires more entities. A slightly larger implementation works great with only five codepoints:
my @fp_ents = qw/deg divide mdash bdquo euro/; my %fingerprints = ( "UTF-8" => ['c2b0','c3b7','e28094','e2809e','e282ac'], "WINDOWS-1252" => ['b0','f7','97','84','80'], "MAC" => ['a1','d6','d1','e3','db'], "MS-HEBR" => ['b0','ba','97','84','80'], "MAC-CYRILLIC" => ['a1','d6','d1','d7',''], "MS-GREEK" => ['b0','','97','84','80'], "MAC-IS" => ['a1','d6','d0','e3',''], "MS-CYRL" => ['b0','','97','84','88'], "MS932" => ['818b','8180','815c','',''], "WINDOWS-31J" => ['818b','8180','815c','',''], "WINDOWS-936" => ['a1e3','a1c2','a1aa','',''], "MS_KANJI" => ['818b','8180','','',''], "ISO-8859-15" => ['b0','f7','','','a4'], "ISO-8859-1" => ['b0','f7','','',''], "CSIBM864" => ['80','dd','','',''], );
The current release uses up to 19 entities to differentiate among up to 58 encodings.
A demo application. Note that you'll need to click the button to submit an initial fingerprint. After subtmitting the form with your default encoding, change to something else in the list above, and try it again. It should be updated to reflect the new encoding.
It's also worthwhile to view the source, to see just how simple this is from the HTML side. With minor additions like these to forms, it is now possible to check the correct encoding of data, allowing web developers to guarantee normalization and smooth interoperability with other, more picky, protocols.
Thanks to Tercent, Inc. for their financial support to create the complete data tables. Without this, HEBCI wouldn't be nearly as useful as it is today.