Gallimaufry of Whits
Being for the Month of 2007-11
These are quick notes taken by Sean B. Palmer on the Semantic Web, Python
and Javascript programming, history and antiquarianism, linguistics and
conlanging, typography, and other related matters. To receive these bits of
dreck regularly, subscribe to the feed. To browse other
months, check the contents.
I've written a new RDF/XML
parser. The previous one that I wrote, back in 2003, was
released before the RDF/XML Syntax Specification went to REC, and was not fully
compliant. This new one, though I'm still testing it, not only passes all of
the positive parser cases in the RDF Test Cases suite, but is also faster and
requires much less memory.
You should be able to use the new module as a standalone RDF/XML parser in
much the same way as the previous one, and you should also be able to use it to
convert files to N-Triples on the command line (give it a URI as the only
argument). There's no documentation yet because it's still in testing, but the
functions you'll want to use are defined at the bottom of the file.
It's a very interesting bit of kit under the covers. I'd wanted to use the
xml.dom.pulldom module, which gives a stream of DOM nodes which you can then
expand, but the problem is that when you expand the children of an element, you
get a list. I wanted a generator, which only loads into memory one child at a
time. This proves to be an extremely tricky thing to do, but I managed to make
it work just fine on top of expat.
One of the nice side-effects of making an XML parser that works in this way
is that you can then just follow the processing model in the RDF/XML
specification very closely. In doing that, I found a two bugs (missing
parent accessor, incorrect
resolve(...) calls) and a typo ("tiple")
which Ivan Herman, the Semantic Web Lead, very quickly added to the RDF Errata. Thanks
Ivan!
I also found a more worrying bug in N-Triples, to do with grammar
ambiguity, which is solvable by forcing the escaping of ">" in the
absoluteURI production. That's also been added to the errata, and my new
RDF/XML parser emits N-Triples using that further escaping.
XML Canonicalization was kinda tricky, and I'm not yet entirely sure that
I've implemented it correctly. I know that it's better at comments than rapper
is, at least, but I'm not sure about attribute ordering and such. There's only
one case for XML c14n in the RDF Test Case suite, and it's pretty trivial.
As for speed, parsing my FOAF file over and over again about 250 times it
turns out to take about 1.9x as long as rapper, but only 0.6x as long as
rdflib. This is pretty encouraging. On small files it's a bit faster than my
previous parser, but on large files there's a huge difference—the 20MB
wordnet synset takes about 32 seconds to parse in rapper, 164 seconds in the
new rdfxml.py ntriples turned off, 180 seconds with n-triples turned on, 291
seconds in rdflib, and in the old rdfxml.py it was running for over two hours
and then I had to kill the process because it was taking up too much
memory.
So that's why a rewrite was necessary, a question which Joe Presbrey (hey, I
can spell his name now!) asked when I showed him a pre-release version of the
rewrite for use with his walk.py. He's now adapted walk.py to use the new
script, so it's rather nice to have it be tested out in the wilds of the
Breadcrumbs Social Network. If you pick it up and use it for some RDF project,
please do let me know. Thanks!
Today I wrote a Turtle
parser, turtle.py. It
passses all of the positive test cases except for
test-28.ttl, which is a datatype canonicalisation case that seems to be broken.
Dave says
that the canonicalisation requirement is going to removed soon anyway, so no
problem there.
One thing that I'm planning to implement with the parser is a service to
convert Turtle embedded in HTML documents into an XSLT stylesheet for use in
GRDDL, per my little GRDDL
hack idea the other day (also previously covered on Whits). I'm
still pondering what kind of syntax I should use for it.
Whits is one year old today! That's 68,906 words of Gallimaufrous fun.
Gallimaufral? Gallimaufric? Ga... well anyway the first post was apparently
about debugging Tav, who still can't bust out a regular grammar for things.
Nowadays the posts are a lot less stream-of-consciousness than they were, which
mightn't actually be a good thing given that I'm writing a bit less as a
result. Only a bit though, and it's probably balanced by the fact that I have
about 100-150 readers to consider now. Which I won't, because you're all
bonkers.
Oak galls. I ordered 10g of them yesterday from Saith Ffynnon, at a cost
of £9 including postage, and they came today—you get nine of them for your
£9, just in case you're wondering. Wikipedia has a brief article on them with a
much prettier picture already than what I'd be able to take of the ones that I
sent. I'd asked the proprietress of Saith Ffynnon ("seven wells" yn Gymraeg)
about half a year ago whether she'd get any back in stock, and apparently she
had to wait until now to pick them; I suppose they're seasonal.
I also asked in my rather hackneyed barely bilingual Welsh and English what
the Welsh term for oak galls is, but I didn't get a response yet so apparently
she doesn't speak bad bilingual Welsh/English soup. Back to the drawing board
for me, then, or whatever it is that you learn language from. For what it's
worth, I'd guessed "derwen afalau" with a little assistance from Geiriadur, but there are no
results for that on Google. Dych chi'n gwybod? Yna ebost i fi os gwelwch yn dda!
I've just announced
a new microformat: hTurtle. It
lets you embed Turtle in HTML and XHTML documents, so you don't have to use
class-based junk to shoehorn RDF in there. Here's a demo of how it works:
<head profile="http://www.w3.org/2003/g/data-view">
<link rel="transformation" href="inamidst.com/sw/hturtle/" /> [...]
<h1>The hTurtle Microformat</h1>
<!--{ <> dc:title "The hTurtle Microformat" . }-->
As you can see, it's compatible with GRDDL, but if you just use the hTurtle
profile URI as the profile you can skip the whole GRDDL mechanism with the same
kind of result, which is handy if you're using HTML rather than XHTML.
It uses the Turtle
parser that I wrote the other day, and also the little GRDDL non-XSLT hack,
as planned.
The weather forecast
project I was working on has moved forward. I've now got daily RDF/XML dumps of
NOAA's GRIB forecasts for lots of large English cities, and a couple of places
in America for thelsdj. This morning I
was working on the ontology, and happened to ask in
#swig about DanC's units work. That led
to Danny Ayers asking the grand question:
"is
percent a unit?".
It was much harder to answer than I expected, but eventually Wikipedia's
article on dimensionless
quantity set us right. A percentage of cloud cover, for example, can be
expressed as the unit cancellation of (say) steradians / steradians. You'll
have to read the #swig logs for the rather drawn out details, but it was a fun
bit of learning.
I'm still drafting up the ontology, and am currently a bit annoyed about it
having to be OWL Full since I'm using TimBL's units ontology which doesn't
itself use OWL. I suppose I'll just have to maintain it as OWL Full, but I was
wondering about publishing a subset which is OWL Lite and then having the OWL
Full declarations in a separate file. Not sure if that sort of thing is frowned
on or not, or how best to achieve it even if I were to do it.
Anyway, I shouldn't let it hold me back on publishing the data. The main
thing that I need to do now is set it up on cron, but the problem is that for
some reason the data is stale—it didn't update properly this morning—so I'm
currently trying to fix that.
I've collected all the little RDF utilities that I'm working on together
into a bundle that I call Trio, and
it's coming on apace! There's now a SPARQL parser and a GRDDL parser (of which two the
GRDDL parser is much more complete and production usable), and a decent web module and all kinds of
coolness. Here's the example from the homepage showing just how easy it makes
things:
from trio import Graph, n3
G = Graph('http://inamidst.com/sbp/foaf.rdf')
Q = n3('[ foaf:knows [ foaf:name ?name ] ]')
for b in G.select(Q, order='name'):
print b.name
Anyway, the biggest saga in all of this has been getting the revision
control working properly. Joe Presbrey requested that I make it easy for people
to keep up to date, so clearly something like darcs or Mercurial was on the
cards. I investigated
the options and checked out darcs and Mercurial more thoroughly and found it
hard to distinguish between them, but eventually got sucked into fiddling
more and more with Mercurial which meant I took more of a shine to it.
Until yesterday morning, that is. The problem is a bit complicated, but
here's an edit of how I put it when I went into #darcs and rambled about
it:
☆ ☄ ☂ ☃ ☆
I want to export my repo via http, and Mercurial generally requires that you
use a special CGI to do this, so I tried setting it up on my server but it has
an old version of Mercurial so the CGI wouldn't work. I can't update the
version on the server and I thought I was screwed because of that, but
thankfully you can do hg pull static-http without the CGI. So the first
annoyance was that when you do an "hg pull http" and there's only a static-http
repo there, it doesn't fallback and try static-http, it just exits with a Not
Found error.
The main problem, however, is that static-http is really, really, really
slow: on my tiny repo it takes 40s to check out using hg pull static-http. So I
found a way around it:
$ mkdir trio && cd trio && hg init
$ hg unbundle http://inamidst.com/sw/trio/trio.hg
$ hg pull static-http://inamidst.com/sw/trio/
$ hg update
If you do that, it only takes 5s, and if you take out the generally
unnecessary pull, it only takes 2s! Twenty times faster if you compile a bundle
(the trio.hg file). Now it seems to me that it might be a good idea to have hg
automatically cache a bundle, or rather, to have a command for doing so; so it
might make it in .hg/bundle.hg and then static-http (or rather, http) should
check to see if it's there and slurp from it if it finds it.
But rather than suggest that I just decided to write it off as broken and
use darcs instead. Heh. Because, as you [some dude in the channel] say, darcs
works out of the box, and it's got darcs get --partial.
☆ ☄ ☂ ☃ ☆
Except that I didn't end up using darcs instead. The darcs get --partial
thing lets you download a repo with only a certain set of the most recent
changes, but I thought this used some magic to work out how much to send when
in fact it requires you to do a special checkin. Moreover, when someone wants
to upgrade from a --partial repo to a full one, you can't just issue some
command to do that other than deleting your partial repo and downloading from
scratch the whole thing. Suboptimal. If you don't upgrade like that, you'll
find it hard to do anything with the earlier part of the tree.
Moreover, I found a simpler way to express that long unbundle process that
was annoying me mainly for its verbosity:
To get the full Trio repository using Mercurial, do the
following in a new directory:
$ hg init
$ hg unbun -u http://inamidst.com/sw/trio/trio.hg
And since all of the Trio history was in Mercurial already and Tailor, a
program for converting from one type of repo to another, wouldn't work... it
made sense to stick with Mercurial. I was very close to changing, though, as my
ConvertingFromMercurial
writeup on the darcs wiki shows. I also did suggest
"Fallback to static-http if http not available" and "Clone from a Bundle" on
the Mercurial wiki, since fixing those two things would make it a lot better in
my opinion.
So yeah, I use Mercurial.
And here's how a friend explains my takeup of Mercurial in much more graphic
terms: "Well alritey then, a half happy embrace of Mercurial. It's like,
convenience sex. 'Well, she did live next door...'"
I filed a Python bug last
night entitled, rather catchily, "UnicodeDecodeError that cannot be caught in
narrow unicode builds". Okay, it's not as good as the Firefox bug that
someone filed because of me, titled "Gross abuse of unicode combining
characters fails to render properly", but it's a Python bug, and those don't
come so easily. It's only the second Python bug I've filed, after my encoding
bug from December 2005.
To the girl who balanced on the cobbles last night just like a mattress
balances on a bottle of wine: ♡
So a problem that I've been having is that I want to use Trio for editing
RDF, such as my FOAF file, but the natural way to make a rich editing interface
which works anywhere these days is to use Javascript in the browser. This is
the approach that Tabulator takes. With Trio on the other hand, since it's
Python the only options open to me are line-mode in the term or using a curses
interface to make it like nano or emacs or what-have-you.
And then I realised: why not have Trio run an HTTP server that a Javascript
application can communicate with? So obvious really, when you think about it,
and though it does make editing files a little more complicated than it should
be, it's still probably not even as bad as installing a Firefox extension and
having to worry about Javascript permissions and so on. You do get the cost of
worrying about HTTP security, but the benefit is that you get to use Python to
do most of the heavy lifting, which is useful when you have a decent Python RDF
API that you want to be using.
This server approach is, incidentally, what Seth Russell took with his
Sailor application many many years ago, which was very much like the Tabulator
only very few people heard about it when he released it.
I've started more RDF threads recently than I really have the resources to
manage right now, but I thought I should probably note them down here so that I
can keep track of them:
Lots of good stuff, some of which isn't presented all that well but I don't
mind on that front particularly. As long as the right people get to see the
applicable things, that's fine.
The datatype defaulting thread really did not go well—I followed up with
Bijan privately about it and it doesn't seem like there's an easy way to do
datatype coercion in RDF that plays nicely with OWL. So it might be best to
either mandate a datatype, or to say "if you don't use a datatype, IFPs and so
on will break".
The more I look into GRDDL and Microformats and RDFa and so on, the more
galling all of that becomes. It did spin off the Semantic Web UA Conformance
stuff which I think might be productive, but... ugh.
The validator bugs are connected with bolting an HTTP server onto Trio and
using it to edit RDF; it'd be nice to have a push-button method of validating
RDF in the resulting editor interface. Interestingly enough, I've only got a
really sketchy implementation of a default RDF document view at the moment and
even that was enough to spot an error in my FOAF file that I hadn't noticed
before in emacs. So it's been handy already!
In October 2006 I submitted an antedating of "purdonium", a decorative coal
scoop, to the OED. I've submitted many more antedatings, but this was my first
so it's very pleasing to see that it's just turned up as the oldest quote under
purdonium, n. in the OED DRAFT REVISION Sept. 2007:
"1847 Times 2 Dec. 3/3 (advt.) Messrs. Bell, Massey and
Co.'s new Purdonium, or Coal Scoop, ornamented with flowers."
Sean B. Palmer, inamidst.com