Gallimaufry of Whits (2007-11)

These are quick notes taken by Sean B. Palmer on the Semantic Web, Python and Javascript programming, history and antiquarianism, linguistics and conlanging, typography, and other related matters. To receive these bits of dreck regularly, subscribe to the feed. To browse other months, check the contents.

2007-11-01 13:33 UTC:

I've written a new RDF/XML parser. The previous one that I wrote, back in 2003, was released before the RDF/XML Syntax Specification went to REC, and was not fully compliant. This new one, though I'm still testing it, not only passes all of the positive parser cases in the RDF Test Cases suite, but is also faster and requires much less memory.

You should be able to use the new module as a standalone RDF/XML parser in much the same way as the previous one, and you should also be able to use it to convert files to N-Triples on the command line (give it a URI as the only argument). There's no documentation yet because it's still in testing, but the functions you'll want to use are defined at the bottom of the file.

It's a very interesting bit of kit under the covers. I'd wanted to use the xml.dom.pulldom module, which gives a stream of DOM nodes which you can then expand, but the problem is that when you expand the children of an element, you get a list. I wanted a generator, which only loads into memory one child at a time. This proves to be an extremely tricky thing to do, but I managed to make it work just fine on top of expat.

One of the nice side-effects of making an XML parser that works in this way is that you can then just follow the processing model in the RDF/XML specification very closely. In doing that, I found a two bugs (missing parent accessor, incorrect resolve(...) calls) and a typo ("tiple") which Ivan Herman, the Semantic Web Lead, very quickly added to the RDF Errata. Thanks Ivan!

I also found a more worrying bug in N-Triples, to do with grammar ambiguity, which is solvable by forcing the escaping of ">" in the absoluteURI production. That's also been added to the errata, and my new RDF/XML parser emits N-Triples using that further escaping.

XML Canonicalization was kinda tricky, and I'm not yet entirely sure that I've implemented it correctly. I know that it's better at comments than rapper is, at least, but I'm not sure about attribute ordering and such. There's only one case for XML c14n in the RDF Test Case suite, and it's pretty trivial.

As for speed, parsing my FOAF file over and over again about 250 times it turns out to take about 1.9x as long as rapper, but only 0.6x as long as rdflib. This is pretty encouraging. On small files it's a bit faster than my previous parser, but on large files there's a huge difference—the 20MB wordnet synset takes about 32 seconds to parse in rapper, 164 seconds in the new rdfxml.py ntriples turned off, 180 seconds with n-triples turned on, 291 seconds in rdflib, and in the old rdfxml.py it was running for over two hours and then I had to kill the process because it was taking up too much memory.

So that's why a rewrite was necessary, a question which Joe Presbrey (hey, I can spell his name now!) asked when I showed him a pre-release version of the rewrite for use with his walk.py. He's now adapted walk.py to use the new script, so it's rather nice to have it be tested out in the wilds of the Breadcrumbs Social Network. If you pick it up and use it for some RDF project, please do let me know. Thanks!

2007-11-02 16:36 UTC:

Today I wrote a Turtle parser, turtle.py. It passses all of the positive test cases except for test-28.ttl, which is a datatype canonicalisation case that seems to be broken. Dave says that the canonicalisation requirement is going to removed soon anyway, so no problem there.

One thing that I'm planning to implement with the parser is a service to convert Turtle embedded in HTML documents into an XSLT stylesheet for use in GRDDL, per my little GRDDL hack idea the other day (also previously covered on Whits). I'm still pondering what kind of syntax I should use for it.

2007-11-02 17:00 UTC:

Whits is one year old today! That's 68,906 words of Gallimaufrous fun. Gallimaufral? Gallimaufric? Ga... well anyway the first post was apparently about debugging Tav, who still can't bust out a regular grammar for things. Nowadays the posts are a lot less stream-of-consciousness than they were, which mightn't actually be a good thing given that I'm writing a bit less as a result. Only a bit though, and it's probably balanced by the fact that I have about 100-150 readers to consider now. Which I won't, because you're all bonkers.

2007-11-02 17:22 UTC:

Oak galls. I ordered 10g of them yesterday from Saith Ffynnon, at a cost of £9 including postage, and they came today—you get nine of them for your £9, just in case you're wondering. Wikipedia has a brief article on them with a much prettier picture already than what I'd be able to take of the ones that I sent. I'd asked the proprietress of Saith Ffynnon ("seven wells" yn Gymraeg) about half a year ago whether she'd get any back in stock, and apparently she had to wait until now to pick them; I suppose they're seasonal.

I also asked in my rather hackneyed barely bilingual Welsh and English what the Welsh term for oak galls is, but I didn't get a response yet so apparently she doesn't speak bad bilingual Welsh/English soup. Back to the drawing board for me, then, or whatever it is that you learn language from. For what it's worth, I'd guessed "derwen afalau" with a little assistance from Geiriadur, but there are no results for that on Google. Dych chi'n gwybod? Yna ebost i fi os gwelwch yn dda!

2007-11-04 10:37 UTC:

I've just announced a new microformat: hTurtle. It lets you embed Turtle in HTML and XHTML documents, so you don't have to use class-based junk to shoehorn RDF in there. Here's a demo of how it works:

<head profile="http://www.w3.org/2003/g/data-view">
<link rel="transformation" href="inamidst.com/sw/hturtle/" /> [...]
<h1>The hTurtle Microformat</h1>
<!--{ <> dc:title "The hTurtle Microformat" . }-->

As you can see, it's compatible with GRDDL, but if you just use the hTurtle profile URI as the profile you can skip the whole GRDDL mechanism with the same kind of result, which is handy if you're using HTML rather than XHTML.

It uses the Turtle parser that I wrote the other day, and also the little GRDDL non-XSLT hack, as planned.

2007-11-09 18:39 UTC:

The weather forecast project I was working on has moved forward. I've now got daily RDF/XML dumps of NOAA's GRIB forecasts for lots of large English cities, and a couple of places in America for thelsdj. This morning I was working on the ontology, and happened to ask in #swig about DanC's units work. That led to Danny Ayers asking the grand question: "is percent a unit?".

It was much harder to answer than I expected, but eventually Wikipedia's article on dimensionless quantity set us right. A percentage of cloud cover, for example, can be expressed as the unit cancellation of (say) steradians / steradians. You'll have to read the #swig logs for the rather drawn out details, but it was a fun bit of learning.

I'm still drafting up the ontology, and am currently a bit annoyed about it having to be OWL Full since I'm using TimBL's units ontology which doesn't itself use OWL. I suppose I'll just have to maintain it as OWL Full, but I was wondering about publishing a subset which is OWL Lite and then having the OWL Full declarations in a separate file. Not sure if that sort of thing is frowned on or not, or how best to achieve it even if I were to do it.

Anyway, I shouldn't let it hold me back on publishing the data. The main thing that I need to do now is set it up on cron, but the problem is that for some reason the data is stale—it didn't update properly this morning—so I'm currently trying to fix that.

2007-11-20 09:00 UTC:

I've collected all the little RDF utilities that I'm working on together into a bundle that I call Trio, and it's coming on apace! There's now a SPARQL parser and a GRDDL parser (of which two the GRDDL parser is much more complete and production usable), and a decent web module and all kinds of coolness. Here's the example from the homepage showing just how easy it makes things:

from trio import Graph, n3

G = Graph('http://inamidst.com/sbp/foaf.rdf')
Q = n3('[ foaf:knows [ foaf:name ?name ] ]')
for b in G.select(Q, order='name'): 
   print b.name

Anyway, the biggest saga in all of this has been getting the revision control working properly. Joe Presbrey requested that I make it easy for people to keep up to date, so clearly something like darcs or Mercurial was on the cards. I investigated the options and checked out darcs and Mercurial more thoroughly and found it hard to distinguish between them, but eventually got sucked into fiddling more and more with Mercurial which meant I took more of a shine to it.

Until yesterday morning, that is. The problem is a bit complicated, but here's an edit of how I put it when I went into #darcs and rambled about it:

☆ ☄ ☂ ☃ ☆

I want to export my repo via http, and Mercurial generally requires that you use a special CGI to do this, so I tried setting it up on my server but it has an old version of Mercurial so the CGI wouldn't work. I can't update the version on the server and I thought I was screwed because of that, but thankfully you can do hg pull static-http without the CGI. So the first annoyance was that when you do an "hg pull http" and there's only a static-http repo there, it doesn't fallback and try static-http, it just exits with a Not Found error.

The main problem, however, is that static-http is really, really, really slow: on my tiny repo it takes 40s to check out using hg pull static-http. So I found a way around it:

$ mkdir trio && cd trio && hg init
$ hg unbundle http://inamidst.com/sw/trio/trio.hg
$ hg pull static-http://inamidst.com/sw/trio/
$ hg update

If you do that, it only takes 5s, and if you take out the generally unnecessary pull, it only takes 2s! Twenty times faster if you compile a bundle (the trio.hg file). Now it seems to me that it might be a good idea to have hg automatically cache a bundle, or rather, to have a command for doing so; so it might make it in .hg/bundle.hg and then static-http (or rather, http) should check to see if it's there and slurp from it if it finds it.

But rather than suggest that I just decided to write it off as broken and use darcs instead. Heh. Because, as you [some dude in the channel] say, darcs works out of the box, and it's got darcs get --partial.

☆ ☄ ☂ ☃ ☆

Except that I didn't end up using darcs instead. The darcs get --partial thing lets you download a repo with only a certain set of the most recent changes, but I thought this used some magic to work out how much to send when in fact it requires you to do a special checkin. Moreover, when someone wants to upgrade from a --partial repo to a full one, you can't just issue some command to do that other than deleting your partial repo and downloading from scratch the whole thing. Suboptimal. If you don't upgrade like that, you'll find it hard to do anything with the earlier part of the tree.

Moreover, I found a simpler way to express that long unbundle process that was annoying me mainly for its verbosity:

To get the full Trio repository using Mercurial, do the 
following in a new directory:

$ hg init
$ hg unbun -u http://inamidst.com/sw/trio/trio.hg

And since all of the Trio history was in Mercurial already and Tailor, a program for converting from one type of repo to another, wouldn't work... it made sense to stick with Mercurial. I was very close to changing, though, as my ConvertingFromMercurial writeup on the darcs wiki shows. I also did suggest "Fallback to static-http if http not available" and "Clone from a Bundle" on the Mercurial wiki, since fixing those two things would make it a lot better in my opinion.

So yeah, I use Mercurial.

2007-11-20 09:28 UTC:

And here's how a friend explains my takeup of Mercurial in much more graphic terms: "Well alritey then, a half happy embrace of Mercurial. It's like, convenience sex. 'Well, she did live next door...'"

2007-11-21 11:23 UTC:

I filed a Python bug last night entitled, rather catchily, "UnicodeDecodeError that cannot be caught in narrow unicode builds". Okay, it's not as good as the Firefox bug that someone filed because of me, titled "Gross abuse of unicode combining characters fails to render properly", but it's a Python bug, and those don't come so easily. It's only the second Python bug I've filed, after my encoding bug from December 2005.

2007-11-27 11:18 UTC:

To the girl who balanced on the cobbles last night just like a mattress balances on a bottle of wine: ♡

2007-11-28 21:19 UTC:

So a problem that I've been having is that I want to use Trio for editing RDF, such as my FOAF file, but the natural way to make a rich editing interface which works anywhere these days is to use Javascript in the browser. This is the approach that Tabulator takes. With Trio on the other hand, since it's Python the only options open to me are line-mode in the term or using a curses interface to make it like nano or emacs or what-have-you.

And then I realised: why not have Trio run an HTTP server that a Javascript application can communicate with? So obvious really, when you think about it, and though it does make editing files a little more complicated than it should be, it's still probably not even as bad as installing a Firefox extension and having to worry about Javascript permissions and so on. You do get the cost of worrying about HTTP security, but the benefit is that you get to use Python to do most of the heavy lifting, which is useful when you have a decent Python RDF API that you want to be using.

This server approach is, incidentally, what Seth Russell took with his Sailor application many many years ago, which was very much like the Tabulator only very few people heard about it when he released it.

2007-11-28 21:25 UTC:

I've started more RDF threads recently than I really have the resources to manage right now, but I thought I should probably note them down here so that I can keep track of them:

RDFa RFE: No Mandated DOCTYPE, 2007-11-22
Semantic Web User Agent Conformance, 2007-11-22
RDF Stylesheets, 2007-11-13
Datatype Defaulting in OWL, 2007-11-22
RDF Validator Bug: No nodeElement Validation, 2007-11-28
RDF Validator RFE: Machine Readable Output, 2007-11-28
Re: CWM Bug: Don't Canonicalise Lists, 2007-11-23

Lots of good stuff, some of which isn't presented all that well but I don't mind on that front particularly. As long as the right people get to see the applicable things, that's fine.

The datatype defaulting thread really did not go well—I followed up with Bijan privately about it and it doesn't seem like there's an easy way to do datatype coercion in RDF that plays nicely with OWL. So it might be best to either mandate a datatype, or to say "if you don't use a datatype, IFPs and so on will break".

The more I look into GRDDL and Microformats and RDFa and so on, the more galling all of that becomes. It did spin off the Semantic Web UA Conformance stuff which I think might be productive, but... ugh.

The validator bugs are connected with bolting an HTTP server onto Trio and using it to edit RDF; it'd be nice to have a push-button method of validating RDF in the resulting editor interface. Interestingly enough, I've only got a really sketchy implementation of a default RDF document view at the moment and even that was enough to spot an error in my FOAF file that I hadn't noticed before in emacs. So it's been handy already!