The Japanese have a word, gaiji (外字), which roughly means “characters that Unicode doesn't support”. The original exemplars of gaiji are kanji of place names and personal names, kind of localised variants of the original kanji for those things that have in time come to be accepted variants. It's as though you were to spell London with a funny squiggle by the second “o”, and not use it in any other word. There are actual western examples too, of course: Adobe cite the new character that emerged in 2004 when the “Ukranian National Bank announced a symbol for the Hryvni, the Ukranian currency”.

Some of these characters ought to just be standardised and included into Unicode, but others aren't really common enough to warrant it. The Private Use Area doesn't seem right for it either: these aren't private characters, they're in public use but just rare. You want to be able to identify them context free, unambiguously, in unicode documents.

One possibility is to extend unicode so that it can include new characters that have a URI. So for example, if http://example.org/charname were to be a character that you've invented, you should in some way be able to encode that in a document. There are a myriad possible ways of doing so, and the mechanism itself isn't really all that important but the general principle is. Writing systems are complex and evolutionary, whereas Unicode is pretty strict and formulaic.

Since I'm the delegated authority for all URIs starting with http://purl.org/ns/, I figured I might use it to create a small characters ontology so that people could describe which URIs identify characters using RDF. It would at least be a nice philosophical protest to the ad hoc evolution of overly simple computer systems intended to encode real world information. But it'd also be ironic since RDF is itself a very good example of that kind of problem, as well as the things like OWL that it's engendered. As Dan Brickley said, “A rule of thumb to promote here might well be: if you find your thinking on some topic can almost be fully captured in OWL statements about categories, hierarchies and logical membership rules, ... you're not thinking hard enough.”

When I was thinking about how I would define a “Character” class, I'd of course do it in the nice loose FOAF style way of being anecdotal and consensus based, but also perhaps by linking to exemplars; that follows the latest cognitive science approaches a bit better. Natural language is a far more interesting encoding of knowledge than RDF, one that we can all contribute to extensibly in a much neater way than RDF's URI based extensibility. Of course we have problems creating simple formal systems with it, and RDF proponents argue that it can fill that kind of niche, but that's not to say that it won't be possible. The kinds of things hinted at in Joe Geldart's Approaching Human Knowledge and by Signiform's Thought Treasure are possible directions to bet on.

This kind of representation of knowledge issue is the same old map-territory problem, though, and it does happen in non Web media too. Handwritten manuscripts, for example, often have complex structures, annotations, lacunae which aren't captured precisely in subsequent publications. The transcriptions might just be wrong, or even a photographic reproduction might not be able to give you the full information that you require from the original document; you might have to subject it to ultraviolet light analysis or what-have-you. This isn't unique to the Web, or even to computer science.

It occurs to me that personal name gaiji, housemarks and the like are a kind of more Romantic form of blazonry, clan tartans, and so on. The trademark is now supplemented by the favicon. We don't tend to think of the devising of favicons and housemarks, however, as being high art in the same way that composing a blazon is. The main difference between high and low art is their breadth of consensus and societal class connotations. It would be nice if there were a high art form of identificational symbology that extended as a chassignite interest to any public individual.

Sean B. Palmer, 1st March 2008