Strange Strands

23 Mar 2006

General Category Values

I started work on a script to extract words from a text. It works using General Category Values in Unicode to work out which characters are letters, and then returns any sequence of one or more contiguous letters. I tried three different approaches: a) checking the character returned by unicodedata.category, b) checking whether the character in question belongs to a huge set of all the unicode letters 'twixt U+0000 and U+FFFF, and c) using a huge regexp of those same characters.

I then fed the script the Complete Works of William Shakespeare from Project Gutenberg, which is about 9MB of text and 926,286 words. Method a), with unicodedata.category, took roughly 13 seconds to complete; method b), with the set, took roughly 8 seconds; and method c), with the regexp, took roughly 4 seconds, even though the regexp is nearly a quarter of a megabyte in size.

Unfortunately, even this approach thinks that the word "can't" is actually can and t, and that hyphenated words are really two or more separate words. But at least it doesn't have the problem of thinking that an ASCII em-dash is really part of the middle of a word, which can happen if you allow unlimited punctuation inside a word. One approach that I have taken on that before is to simply replace all "--" with " - ", but that leaves the door open for further problems of the same nature; so something to check that there's only a single punctuation mark would be better.

Of course, were one to take this up with the regexp method then the regexp could grow to be dozens of megabytes long. It would be preferable to use Perl 6's grammars, perhaps. But then, whilst we're doing regexp reform, we might as well throw in tentatively matching regular expressions and so on too (for tokenisers, so re.compile(r'abc').tentative('ab') would be True).

I'm also working on a public domain ephemerides script which I might merge into the information service that I'm also hacking on in the sidelines, which uses the structure of the input to dictate what service to use, sorta like a Do The Right Thing service with improved Do What I Mean. Also along those lines, I thought of chaining a language guesser with a language translation service to use as a translation tool in phenny. I'm not sure how effective that'll be though, especially on short texts.

Strange Strands, General Category Values, by Sean B. Palmer
Archival URI: http://inamidst.com/strands/categories

Feedback?

Your email address:
inamidst.com