Gallimaufry of Whits (2007-10)

These are quick notes taken by Sean B. Palmer on the Semantic Web, Python and Javascript programming, history and antiquarianism, linguistics and conlanging, typography, and other related matters. To receive these bits of dreck regularly, subscribe to the feed. To browse other months, check the contents.

2007-10-01 20:00 UTC:

Yesterday Adam and I, inspired by "rancid", Googled for cool words and awesome words using "* is a cool / awesome word" as input. Here's what we got:

Cool: shenanigans, fuck, monster, gnarly, cahoots, vagina, confabulation, shopomore, peruse, snazzy, asshat, engma, lurch, qwerty, pestle, spiffy, buff, cinchy, science, nahbubuhay, hemogloben, embrollment, and gwrthwynebwyr.

Awesome: dingo, tofurkey, whizgiggle, squircle, assfuckwitards, fuckton, awesome, Tulonic, jazztastic, quadroon, homosinuality, poppycock, compossible, smackies, floozy, sleuthlikedly, nawkish, slacktimony, incrimidating, shitique, omnichronic, dissimulate, codswallop, Potterotica, humptitude, doiley, bagarap, neathage, jobber, and gnarly.

Funnily enough, I feature already in both results, having nominated pestle as a cool word and bagarap (right here on Gallimaufry of Whits) as an awesome word. I'd also like to nominate jsled's "shitrude" as an awesome word, since we're trying to proselytise it somewhat. Hmm, proselytise is good too. We'll call that one cool, shall we?

2007-10-03 21:34 UTC:

I have this strange relationship with Amaya where I don't really like it but all the same it's too handy to avoid all the time, so I end up trying it out every year or two for some project. I decided to start taking some notes about some early history of Christianity stuff that's sucking me in at the moment, and Amaya was the obvious choice to just-take-notes without having to worry about the syntax too much.

Last time I did the whole Amaya thing I ended up with On Using Amaya, which is basically a page-after-page rant about how much Amaya sucks, down to some of the specific bugs. With the latest Amaya I'm already encountering some of the old friends, but overall I'm a bit more optimistic than before; it seems a bit more solid on OS X now than it did on Windows a couple or so years ago.

Still, using my stripped down xhtml.rnc in nxml-mode in emacs really is so great as to preclude any point of heavy Amaya use. Just gotta make sure that source is perfect.

2007-10-05 11:20 UTC:

I'm tempted to write to Caitlin Moran saying "I wanna have your babies!" But I think she'd just reply "Fine! Take them!"

2007-10-05 11:28 UTC:

Yesterday I saw a sun pillar, and took a photo of it. This was well after the sun had set: the column of yellow light that you can see, the sun pillar, is caused by reflection of sunlight by ice crystals in the atmosphere not directly by the sun itself. It actually got brighter and longer for a while after the sun had set, and then gradually disappeared.

Sun pillars are apparently not all that rare, but this is the first time that I recall seeing one. Venus pillars on the other hand are exceedingly rare—make sure you get a snap of one if you see it!

2007-10-05 11:35 UTC:

Fade teil thee zo lournagh, co Joane, zo knaggee?
Th' weithest all curcagh, wafur, an cornee.
Lidge w'ouse an a milagh, tis gaay an louthee:
Huck nigher; y'art scuddeen; fartoo zo hachee?
—"An Old Song", in Yola

Yola is a derivative of Middle English that survived until comparatively modern times. I traced this song to the Annual Register of 1789 (2nd ed., printed 1802). The Bod has the volumes up to 1778 online, in lovely high quality scans, but you'll have to rummage in Google Books for others.

2007-10-05 16:42 UTC:

Word of the day: "mismathsed".

2007-10-10 11:48 UTC:

So, RDF Templating. Simon Rozet was asking about a TV quotes ontology for RDF last night, and suggested that I revive my old RDF quotes project from 2001. The workflow for that is essentially input and screenscrape data, merge data using RDF rules, and output using some kind of templating.

When I did this originally, I used XSLT for the templating language, and basically did a scrape of the RDF/XML. But that doesn't scale well, so I poked about on Google this morning for existing RDF templating solutions. My aim was to find some off-the-shelf components to use, so that I could document the process and have others who want to do a similar thing say "oh, that looks easy; I can do that!".

The best resource that I found on the problem is the ESW Wiki's page on RdfPath: "If you want to transform RDF to XML/HTML/Text, read on!". The kinds of solutions break down into two categories: those based on XSLT, and Fresnel. Besides these two language agnostic approaches, the RdfPath page fails to mention the many homespun attempts at RDF templating, which of course I'm not as interested in given that I want to use off-the-shelf components.

Whilst I investigated some of the options listed on RdfPath, I documented the paper trail on #swig, and others pitched in to help me, especially m94mni, dorian, kwijibo, iand, chimezie, and AndyS (thanks guys!). We generated an enormous amount of discussion and blogged quite a few things along the way too.

The general conclusion that I came to to meet my requirements is that I found a decent off-the-shelf templating workflow for myself, and we started to converge on a more general processing model that might help others with slightly more complex requirements than you find in the simple presentation of a quotes database. Essentially, I'm thinking about using SPARQL (however much I dislike it) to produce XML bindings that I can then import, process, and output using XSLT. The general processing model is the decoupled one of Query -> Merge? -> Process? -> Output. You basically want to use the best quality (simplest, easiest to use, works on your system) components that'll play nicely with one another.

2007-10-10 16:23 UTC:

Once settled on the use of SPARQL and XSLT to do the templating, I figured that I'd need to model the TV quotes; in other words to solve the question that Simon had last night: is there an existing quotes ontology, and if not, what should a quotes ontology look like?

I thought that someone, possibly Danny Ayers, had done such an ontology before but I couldn't find any evidence of that on the web. I gave Simon some advice about modelling based around the OnlyModelExploitableData and Don'tWorryBeCrappy design patterns, but ironically since then I've been starting to worry about being too crappy.

For example, my original ontology just threw the quote in a simple block of text, such as the following:

AA Lady: And we have sugar cookies and marshmallows
Homer: These sugar cookies you speak of... are they symbolic?
AA Lady: They're on that table, over there [points]
Homer: Aw... all the way over there? I don't want to walk all the way over 
there... Anything that takes 12 steps isn't worth doing! Get it? Heh? Steps? 
[Cut to a scene of Homer waking up in some bushes rubbing his head]

Now, that's fine, but when you come to render it in HTML, how can you tell where the line breaks should go? Really, instead of bunging it into a big block of text, you should put it in a list:

("AA Lady: And we have sugar cookies and marshmallows"
"Homer: These sugar cookies you speak of... are they symbolic?"
"AA Lady: They're on that table, over there [points]"
"""Homer: Aw... all the way over there? I don't want to walk all the way over
there... Anything that takes 12 steps isn't worth doing! Get it? Heh? Steps?"""
"[Cut to a scene of Homer waking up in some bushes rubbing his head]")

And, indeed, it's not really a quote of a single character but a subclass of quote: dialogue.

But now, what about querying this out? SPARQL has, rather notoriously, no facilities that I'm aware of for the special treatment of lists. It does, however, allow multiple OPTIONAL constructs, so you can do something like this:

WHERE {
   OPTIONAL { ?list ?p :Test }
   OPTIONAL { ?list rdf:first ?a }
   OPTIONAL { ?list rdf:rest ?r1 }
   OPTIONAL { ?r1 rdf:first ?b }
   OPTIONAL { ?r1 rdf:rest ?r2 }
   OPTIONAL { ?r2 rdf:first ?c }
}

Except that when I actually tried that in roqet (having tried CWM's SPARQL stuff, which doesn't even output the XML bindings), it didn't work because of a bug in roqet which I subsequently reported to dajobe. So anyway, it should work, but it goes to show that list munging still isn't done all that often (otherwise someone would've spotted this before, no?), which means that it's fragile territory, which might be best to avoid if it's easily possible to do that.

So I guess actually I'm not avoiding the Don'tWorryBeCrappy pattern; in fact, quite the contrary, I'm trying to be as crappy as I can be, but without it breaking entirely somewhere along the road. What's the simplest thing that'll work in this model?

2007-10-11 11:51 UTC:

On how to represent extended dialogue in RDF, I realised that something like the following will probably work sufficiently:

[ :play :Hamlet; :dialogue (
   [ :by _:Pol; :quote "Doe you knowe me my Lord?"; :tln "1210" ]
   [ :by _:Ham; :quote "Excellent well, you are a Fishmonger."; :tln "1211" ]
)] .

But then a sideproblem to this is that to identify each line unambiguously, you need a combination of the :tln (the Norton Through Line Number, a standard way of referring to lines in Shakespearean plays) and the :play. To that end I wrote some rules that can propagate the play to each of the lines in the dialogue:

{ [ :dialogue ?d; :play ?play ] }
 => { ?d :membersFromPlay ?play } .

{ [ :membersFromPlay ?play; 
    rdf:first ?member; 
    rdf:rest ?subd ] }
 => { ?member :play ?play .
      ?subd :membersFromPlay ?play } .

But then say someone comes along and adds an annotation about one of the lines from the play with this uniquely identifying information:

[ :tln "1211"; :play :Hamlet; :note 
   "Davies says (C18) this means 'You are a fisherman, and angle for me'" ] .

How can we merge the two?

This is a known, if not fully investigated, problem in RDF which has culminated in the idea of a CIFP, a Composite Inverse Functional Property. The page just linked to has some details on the current state of the art, but when I tried Jos de Roo's implementation out, I found that it only works in his Euler; there isn't a generic soltuion that works in CWM.

So I wrote one: CIFP Rules for CWM. Lots of people pitched in and helped on #swig again, which is great, especially Henry Story who has been pushing this problem to a resolution for years now. Sandro Hawke was the first person I recall raising it.

The problem is that the core problem still remains. I don't know how to specify that I have a CIFP, except using the mechanism that we made up. We don't know what ontological ramifications it has; how nicely it plays with OWL. Bijan Parsia pitched in to say that it's being worked on, however, so that least there is the possibility of a resolution at some point. The question is what to do meanwhile.

All of this was after I slammed SPARQL, quite rightly I hope, for its accessingCollections issue and the fact that it prevents me from using SPARQL usefully on anything with an rdf:List in it, which of course includes my dialogue model. Of course it's possible to use CWM, but why shouldn't there be a lightweight solution for this too? And a standardised lightweight solution, moreover.

2007-10-16 14:36 UTC:

Postboxes in the UK all have the name of the current monarch stamped on them, and whenever I go by them I have a look to see how old they are. We have some Victorian ones just down the road, but whilst inspecting one box the other day I wondered how many Edward VIII ones there might be, given that he only reigned for a handful of months. Are there any at all?

Indeed there are. Long story short, I spent three hours this afternoon making a List of Edward VIII Postboxes. I found 57, including many photos of them, but there are said to be around 150 still out there. So there you go.

2007-10-19 15:28 UTC:

One of the biggest problems I've had in creating a Quotes Ontology is that it's difficult to find prior art, especially when Swoogle is down. This got me wondering about the more general question of what RDF vocabularies are being used—what kind of cool information is out there? So I've been working on that for several days now, but I haven't forgotten the Quotes Ontology.

Danny Ayers wrote about my templating workflow, and has been trying to convert the rules part of the process into SPARQL. In doing so, he's come up against the same SPARQL bug that so thoroughly annoys me as well. Some SPARQL folk are on the case trying to help out, but I'm not aware of Danny having made any progress yet.

In any case, I think it doesn't matter too much because my workflow is Process -> Query -> Template, and he's trying to change more of the Process part (which I'm doing with cwm) into the Query part (which I'm doing with SPARQL). I got the workflow producing the kind of output I wanted it to, with just the following classes and properties in the Quotes Ontology:

:Quote, :Dialogue, :DialogueList, :quote, :dialogue, :prefix, :by, :from.

In general, a :Quote is a conceptual quote from some work (a book, a play, a TV show) and you link it to the textual representation of that quote using :quote. The :from property links to the work that the quote is from, and the :by gives the person or character who uttered it. If you have two or more people chatting back and forwards you can use :Dialogue, which links to a :DialogueList (a list of Quote instances) via the :dialogue property.

It all sounds a bit "wha?", but it makes perfect sense once you see a few examples, and this is the best structure that I could come up with which models, and only models, the information that I want to exploit for presentation. I've been careful to make sure that the properties and classes are reusable by using OWL restrictions instead of (ironically) stricter domain and range constraints, and I of course allow for extension.

One slight annoyance is the fact that my awesome list-typing recipe makes an OWL ontology to be in the OWL Full species, though I asked Bijan Parsia "if people ask if this is an OWL Lite ontology, what do I tell them in three words or less?", and he replied "I would tell them they shouldn't care" and that "I would also say that it's nominally owl full, but in a relatively harmless way".

Once Swoogle came back online, I found that Kevin Reid had previously worked on his own quotes ontology back in October 2003, around the time I went to Bristol for the FOAF meeting, and he'd forgotten to tell me though he had been meaning to. It's very interesting to compare its approach to mine: it uses rdf:Seq to hold lines, uses domain and range, and doesn't seem to have a particularly need-driven structure though I may be wrong.

One of the biggest things I learned is that my modelling advice is really good, but really difficult to follow: the advice being a blend of the "Don't Worry, Be Crappy" and "Only Model What You Want To Exploit" design patterns. It's so easy to get into thinking about the model so much that you get into a semantic metawank and soon enough you're doing philosophy rather than computer science. You need to avoid that if you want to actually produce something.

I still haven't published the quotes ontology yet, because I'm still not really finished with it; and I'm ignoring Release Early, Release Often in favour of my Let It Mature In The Cellar. I did get /ns/ at purl.org though, so I have some nice namespace territory available to me.

2007-10-23 07:51 UTC:

Getting stats readouts of my Semantic Web Survey is starting to take a long time, but the most recent full set that I got last night was:

sites: 2768
docs: 46262
triples: 7743021
subtriples: 14602
cache size: 255M

Which is quite impressive for so few days' spidering. It's getting hard to manage that much data, and since I have five python processes gently hammering the web I can't really do much else with my account on bia anyway. Several times now I've gone into impolite amounts of CPU or memory, but I think I've fixed the problems which gave rise to that and now I'm able to have them crawling overnight without failure. Christopher Schmidt, who runs the server, has been very patient—thanks crschmidt!

So I'm almost definitely at 10m triples now. What to do with them? Well, sadly, the first step when I get to a level of completion that I'm happy with will be to crawl them all over again. I've learned a lot about real crawling, and one of the main pieces of advice is: make sure you get all the data that you want to use first time around. Oddly enough, I actually realised this before I did the first set of crawling and tried to make sure that I saved all data that I could think of, but it turned out that I missed some.

At any rate, I want to recrawl because I want to tweak the parsers a bit. For a start, the version of the rapper RDF parser that's on bia seems to be quite ancient and gives some output which is malformed, so I want to try using a more recent locally installed version. I'm also thinking about allowing @this in n3, because that's cacking out a lot of the n3 that I could potentially crawl.

I already had to recrawl all the Notation3 documents once to allow for the garnering of subtriples; which is to say, triples that exist in formulae not asserted in the document.

With all that aside though, what really am I going to do with the data? I'm looking for something I've been calling Semantic Diversity. The whole point of this exercise is that I want to find what kind of interesting data is on the Semantic Web already, if there is any, so I'm going for something that's very different to the extremely linear approaches to Semantic Web Surveying at the moment. For example, you tend to get reports like percentage of documents which are RDF/XML, the most widely used properties, and so on. I'm not looking for which properties are most widely used, though I'm sure I'll report on that too. I'm looking for interesting ways that people are mixing data. I want to find out what ontologies play well together, how people are using ontologies differently to the way they've been specified, and so on. Taking a much more descriptive and relativist approach to the whole shaboodle, if you want to hear it in those kind of terms; but it's not even just that, because I'm interested in how it plays with the prescriptive and objectivist approaches too.

I also want to be able to search through the data for interesting things quite fast, so I'm leaning towards some kind of search engine-like application, only much more geared towards the Semantic Web. Semantic Web search engines at the moment are... really kinda strange. It's not that I think they're bad, or even that they're not useful, just that they're playing awfully close to the old web memes when the Semantic Web is very different in the way that you interface it. I have quite a few sketches of ideas for how to make search more interesting, driven by some use-case questions that I have about the data I'm crawling.

2007-10-23 11:14 UTC:

The Decentralized Information Group at MIT (a bit of a W3C spin-off with folk like Danny Weitzner, DanC, and TimBL) has a weblog called Breadcrumbs that I follow. Recently they wrote about a new commenting policy that they've put in place to prevent spammers: using a FOAF whitelist. The idea has been around for a while, for example Dan Brickley wrote about it in 2002 and got a prototype working, but this was more geared towards email than weblog comments.

So Dan Connolly lightbulbed on the comment filtering and Joe Presbury et al. at DIG carried the lightbulb on a fair few yards towards the touchline after a suggestion that they outsource to Akismet instead. As Dan notes, "the idea reached critical mass in breadcrumbs after somebody suggested outsourcing; timbl and I and Danny pushed back, saying this is what DIG is all about; it's for us to research, not for us to outsource."

The whitelisting works such that the only people allowed to comment on the weblog are those that are within three foaf:knows relationships of the DIG members, and have a foaf:openid property in their profile. I added foaf:openid to my FOAF file back on the 4th October and TimBL added me to his FOAF file so that I'd hopefully be whitelisted on the next crawl. Apparently something went wrong because I kept getting "OpenID authentication failed: not a valid URL" until yesterday, when it was fixed.

Joe Presbrey wrote the crawler, a 7KB bit of Python called walk.py which uses my own somewhat old rdfxml.py to parse RDF/XML (which I was rather surprised about). It's multi-threaded, and runs very quickly indeed: it takes just 32 seconds for it to spit out all 12 OpenIDs currently within three hops of the DIG members:

$ time python walk.py
[ Iteration: 0 | URIs: 13 ]
rdfxml.py:83: DeprecationWarning: raising a string exception is deprecated
  if lang and dtype: raise "ParseError", "Can't have both"
[ Iteration: 1 | URIs: 53 ]
[ Iteration: 2 | URIs: 46 ]
[ Iteration: 3 | URIs: 8 ]
http://auth.mit.edu/syosi
http://auth.mit.edu/oshani
http://www.w3.org/People/Weitzner.html
http://www.w3.org/People/Connolly/
http://inamidst.com/
http://presbrey.mit.edu/
http://getopenid.com/amyvdh
http://bblfish.videntity.org/
http://openid.sun.com/bblfish
http://danbri.org/
http://lalana.pip.verisignlabs.com/
http://www.w3.org/People/Berners-Lee/
real    32.126s
user    2.776s
sys     1.231s
cpu     12%

According to Joe, it took six or seven hours to do the same when it was single-threaded! At the moment I think only Ryan Lee is able to run the crawl on the DIG servers, but obviously it would be handy if someone could be added as soon as is feasible after buying into the OpenID-in-FOAF solution. There have been behind the curtain discussions about having a big "DO THE CRAWL" button that all DIG members would be authorised to push, and it was noted that it wouldn't be such a bad idea to let the public do it either as it only costs a few CPU cycles (which can be controlled with nice) and a bit of bandwidth. I'm all for it being as open as is sensible.

Yesterday I interviewed DanC a bit about the whole process, and put the most cynical question that I could think of to him, since cynicism sells:

sbp: If every site had a custom method of anti-spamming that has as much of a take-up barrier as the FOAF whitelist does, wouldn't that be too burdensome on the post-a-day commenter? Put another way: are you killing off the ability to comment because of fear of the spammer?
DanC: The barrier for openid is falling, and with support in sites like advogato and livejournal, likewise. As links in the social network are increasingly part of the semantic web (think: co-citation data from conference web sites...) the barrier should go down... It seems only reasonable that the burden should be on the commentor to prove why they deserve space on my web site before I publish their comment. Given the value of publication to spammers, clearly "anyone can publish" will lead to tragedy of the commons.

Of course, the most cynical question of all is simply: will it be valuable as it scales? But only time will tell there, and it depends upon your value function. It's valuable already in that I can now comment on Breadcrumbs whereas before I couldn't because they had to shut it down in the face of the spammers. It's valuable in that it got me to add my OpenID information to my FOAF file which others can now exploit in a similar way. It's valuable in that it shows a nice practical use for the Semantic Web, which are arguably thin on the ground still after several years' work. It's true though that a lot of this value would be wiped if there were scaling issues.

So, if you want to try this system out, what should you do? The first step would be to get an OpenID if you don't have one already. Then you need to have a FOAF file, and add the foaf:openid arc (which is not yet in the HTML documentation, but is in the RDF/XML) from yourself to your OpenID. Those steps are a bit labourious, but not particularly difficult. More difficult is the final step of getting meshed into being three steps from a DIG member. Since it's currently a good example of a Chassignite Interest, feel free to email me or ask on #swig about getting connected.

2007-10-23 14:44 UTC:

As a spin-off of the FOAF whitelist chat last night, I helped R. Steven Rainwater to set up cert-level export on Advogato. DanC had been musing about it, so I sent a short note to the Advogato feedback address, and eventually chatted with Steven about the whole thing on IRC before sending him a schema which is now on the Advogato site.

That means that Advogato FOAF files are returning information such as:

trust:Master a foaf:Group; 
   foaf:member <http://www.advogato.org/person/raph/foaf.rdf#me> .

From Raph Levien's FOAF file.

2007-10-25 20:06 UTC:

Today I wrote a programming language called Plan3. It's imperative, and it uses N3 for its syntax, so it looks a bit like lisp whereas in fact it's based very closely on Pluvo. Pluvo and n3 = Plan3. It's implemented as a new cwm mode, and I have a patched local version of the cvs cwm with this new mode that can run the following code:

<> doc "dbslurp.n3 - DBPedia Slurp"; 
   dc:author [ foaf:homepage <http://inamidst.com/sbp/> ] .

data def (()
   (var resources (list))
   (select ?s ?p ?o where { ?s ?p ?o }
      (for t in (list ?s ?p ?o)
          (if (startswith t "http://dbpedia.org/resource/")
              (push t resources))))
   (return resources)
) .

main script (
   (for resource in (data)
      (store (semantics resource)))
   (output)
) .

What you do to run the above is something like:

$ echo '@prefix db: <http://dbpedia.org/resource/> .
        db:The_IT_Crowd a :Test .' | cwm --plan3=dbslurp.n3

And it'll slurp the description of the IT Crowd on DBPedia into the working context and then pretty print it out. The semantics function is a lot like log:semantics, but others of the verbs don't have any counterparts in the log: namespace, which acts rather like a declarative programming language in N3, using the --think mode as the interpreter. The log: namespace and other --think mode builtins have grown rather haphazardly, however, and the system is not extensible to the general user—you can't add new builtins. With Plan3, I'm thinking about being able to import signed scripts from online and all that kind of groovy stuff, as well as giving people a much more powerful standard library to program with.

Another way of thinking about Plan3 is that it's like the cwm command line, but formalised into a programming language. The cwm command line is strange in that it really works like a mini-language, order of the flags being important and being able to come after arguments and so on. So cwm --n3 input.n3 --rdf will load input.n3 as n3 (actually the --n3 flag is redundant here) and then the --rdf will convert it to RDF/XML. cwm --rdf input.n3 --n3 on the other hand will break as the input will be expected to be RDF/XML.

There are some differences though. With cwm, there is an implicit --print at the end which pretty prints the default context, and I was thinking about having Plan3 do the same, but then I figured that the default should be no output and you can do an (output) call instead. That way you don't have to turn the functionality off when you're getting it to do templating and so on.

It's only taken a day to get the script above, and some other small test scripts like it, working, mainly because a lot of the underpinnings of the language are a straight port from Pluvo, but it was still a very good bit of work indeed, and is already starting to approach the log:/--think language for power, and indeed exceed it in some places.

2007-10-25 20:42 UTC:

I should note that the example above was redundant on purpose to show off the language features; though ironically it's also missing off showing a planned feature by including something, the setting of the resources list. I'd like to make push work like in perl, creating a list automatically if it doesn't exist.

Anyway, here's the more compact version:

main script (
   (select ?s ?p ?o where { ?s ?p ?o }
      (for t in (list ?s ?p ?o)
          (if (startswith t "http://dbpedia.org/resource/")
              (store (semantics t)))))
   (output)
) .

This is pretty compact compared to the equivalent log:/--think concoction.

2007-10-28 14:34 UTC:

The Great Semantic Web Survey is coming on quite well: I now have nearly 25 million triples from over 80,000 documents across over 2800 domains. Managing all of this data is proving to be quite difficult, but I've been offered help from two different directions. Danny Ayers hooked me up with Talis, and Ian Davis is going to set me up an account; and Kingsley Idehen offered me a Virtuoso account in the same environment that DBPedia is hosted in. I'm probably going to try to set up Virtuoso on manxome, Aaron Swartz's server, first.

An interesting thing about Talis and Virtuoso coming to the rescue here is that they're both corporate entities who have very kindly offered to donate their resources to me. Generally this makes sense because I have friendly links in, they're nice people anyway, they'll be getting exposure and testing in return, and corporate entities are more likely to be able to manage the kind of insane amounts of data that I'm dealing with. But what's interesting to me is that there isn't more entirely grassroots activity in this area.

Meanwhile I've already been doing some simple things with the data I collected; I've been wanting to find out a) what the most widely deployed ontologies are, and b) what the most semantically rich websites are. Usually when people want to find out what terms are widely deployed on the Semantic Web, they do a rude frequency count, but I'm more sophisticated than that... So not only am I doing a rude frequency count, but I'm also blending it with distribution data—in other words, not just "how many times is this property used" but "how many sites is this used on", and other statistical data like that. Blending all of the available statistics is a very subjective thing, but I've got an output of the Semantic Web's favourite ontologies that I'm quite happy with.

So, some of the raw stats just in case anybody's interested. In my crawl data there are 14,957 distinct predicates, of which 9878 are HTTP URIs. There are 1203 ontologies, of which 1140 are in HTTP URI space. Obviously I used my own whimful definition of ontology; again it's possible to use various metrics here, so I just picked the one that felt like it had the best measure of accuracy and ease.

Once I'd found the ontology information out, I decided to figure out what the most semantically rich websites are. As I reported on #swig this morning, "this is a blend of diversity of ontologies used + amount of triples on the site, and the top twenty sites based on this metric, in descending order, are: www.w3.org, www.mindswap.org, simile.mit.edu, inamidst.com, www.wasab.dk, b4mad.net, www.ninebynine.org, www.daml.org, www.ivan-herman.net, dev.w3.org, demo.openlinksw.com, norman.walsh.name, semspace.mindswap.org, lists.w3.org, dannyayers.com, myopenlink.net, research.talis.com, www.holygoat.co.uk, www.kanzaki.com, redfoot.net".

The next step is probably to get some proper data processing going, and so that means turning to Talis and Virtuoso. On the other hand there are some other interesting Semantic Web things that I'm also working on, Plan3 being perhaps foremost amongst them. I'm trying to make sure that, at some level, I'm doing things which are useful, so I'm going to try to weave in more use cases rather than just be doing theoretical stuff all the flipping time. Indeed, as you may recall if you've been reading Whits for a week or two at least, all of this recent Semantic Web activity started from a kind of "fake use case" when Simon Rozet suggested that I update my 2001 Quotes Ontology/Workflow. I learned quite quickly from revising that that the whole easy-data-remixing promise that the Semantic Web vision made hasn't really been fulfilled yet, and yet all these interesting avenues to making things easier have popped up meanwhile.

I'm therefore working on a document called the "Semantic Web Guidepost" which is a set of high level notes coördinating my recent Semantic Web activities, in the form of a personal level tutorial. In other words, when I started to try to create the new Quotes Ontology, instead of just working through the problems that arose I documented them in very brief form along the way, linking only to the most helpful tools and services and patterns and advice and so on. That's since grown, in just a few days, to incorporate all kinds of Semantic Webbery but with the same idea of being a very practical guide and not having any waffly nonsense. It's kinda like the Ultimate Skimmers' Guide to the Semantic Web.

The Guidepost document is also where I'm collecting my ideas for use cases and stuff to use Plan3 for and so on. I am planning on publishing all this stuff, but let me know if you want to take a sneak peek at specific bits because meanwhile I'm following my "let it mature in the cellar" pattern again.

2007-10-28 18:26 UTC:

Art deco is a lot like Flickr. I used to hate Flickr like I hate all Web 2.0 junk, but to quote myself from back in April: "Okay, I like Flickr. There, I said it." And now... well... I don't think I can quite bring myself to say it outright yet, but I'm starting to think about warming to art deco.

Grand Designs today covered an art deco house that a couple built in Surrey, and it came out pretty well. I've been thinking about art deco for months now, though, and this brought together some of the elements that I'd been thinking about. My main dislike of it comes from the fact that it's so close to completely functional modernism, and then with a garish topping. It's like the two worst ends of architecture combined into a single aesthetic. But you can also flip that around, of course, and say that each of the extremes temper the other.

Another problem that I had with it is that it boasts of its contemporary nature. It's like modernism being such a misnomer now—it's no longer anywhere near modern. The whole streamlining that came from the industrial 1920s, it just sorta makes me feel as though I'm going to barf; but again you can flip that around and say that it's such a ridiculous thing that it's now a trivial conceit of history. If one were to be surrounded by art deco, with art deco buildings going up every day, that'd be too much. But as a curious relic which reminds us of our heritage... why not?

It's such a simple style, too, that it has a very centrally definable aesthetic; it's very easy to replicate, and there are lots of rather close to prototypical instances around. There's a bingo hall not too far from where I'm writing this that's a large hulk of an art deco building, and you can tell from quite some distance that it's art deco—art deco buildings scream their motifs at you, but they're so playschool that you don't feel too overwhelmed by them, as you know you could design a similar thing quite easily. It's not like art nouveau where nobody understood it at the time and nobody's really realised its full potential since either. Art deco peaked so thoroughly that only the Second World War could really have stopped it.

Connotations, as usual, help too. Whereas before I associated it with disuse and industry, now my major associations are with travel posters and seafront promenades. None of the seaside towns in England have been cared for since the '20s and '30s, so everything's still art deco on the coasts, all the lidos coming back into use and so on. It's like you can use it to escape the past which never left, rejuvenating without going too over the top. That's surely a positive thing.

I'm still suspicious of it. I'm still coming to understand how it can be used, how it was used originally, and so on. But I'm definitely starting to think about warming to art deco. Bizarre.

2007-10-29 10:20 UTC:

I'd like to be able to evaluate JSON, say of SPARQL results, securely in Python, but all the existing solutions are way too big for the job, so instead I just devised the following bit of embeddable joy:

import re

r_json = re.compile(r'^[,:{}\[\]0-9.\-+Eaeflnr-u \n\r\t]+$')
r_string = re.compile(r'"(\\.|[^"\\])*"')

def json(text): 
   """Evaluate JSON text safely (we hope)."""
   if r_json.match(r_string.sub('', text)): 
      return eval(text, {'__builtins__': None}, {})
   raise ValueError('Input must be serialised JSON.')

No guarantees, but I'll be using it. Thanks as usual to Björn Höhrmann, DAS ÜMLÄÜTÜNGËHËÜËR, for pointing out that RFC 4627 (the JSON RFC) has a security regexp. Funnily enough I had been asking about a ten-line-solution for it, and now I just realised that what I came up with is exactly ten lines. Good stuff.

2007-10-29 13:10 UTC:

Simon Rozet asked how I've been so productive this month. What's my secret? Well, ladies and gentlemen, I give you the most proactive and encouragingly paradigmatic time management strategem that I know of: Patrick Hall's FTDB™.

2007-10-29 13:12 UTC:

Three more Semantic Web things to report.

1) Yesterday I figured out a way of allowing arbitrary transformations to work in an XSLT-only GRDDL client: simply create a CGI which takes in the source doc URI as a parameter and outputs a trivial XSLT document which really just encapsulates some RDF/XML.

So for example, let's say we have a microformat for embedding Turtle in HTML documents. You can't parse Turtle in XSLT particularly, but you can parse it in Python, so you set up a service at http://example.org/hturtle which takes in ?uri=yourdocument as the QUERY_STRING, loads yourdocument, converts it to RDF/XML, and then outputs a trivial XSLT document:

<xsl:stylesheet ...>
<xsl:template match="/">
   <rdf:RDF ...>...</rdf:RDF>
</xsl:template>
</xsl:stylesheet>

And then to link it from your GRDDL document, you simply do:

<link rel="transform" href="http://example.org/hturtle?uri=yourdocument" />

I had hoped that GRDDL clients would be mandated to send a Referer header to transformation documents should they request them, but they're definitely not; the protocol trace in the example in the GRDDL specification clearly omits the Referer header, and there's no RFC-keywordsy documentation about Referer in the spec at all.

This allows you to have transformations which are ostensibly XSLT, but which behind the scenes can be anything that you can make a CGI out of—Python, Perl, Javascript, Befunge, Wang Tiles, whatever. Benjamin Nowack's already noted that this idiom may help him to make his PHP scripts available to basic GRDDL clients.

2) This morning I've been working with Dave Pawson on a simple bookmarks format in RDF. The idea is that he captures bookmarks using the following model:

:docbook rdfs:label "docbook"; 
   foaf:isPrimaryTopicOf <http://en.wikipedia.org/wiki/DocBook> .
<http://example.org/docbook> dc:title "Docbook FAQ"; 
   :topic :docbook .

And then you can easily query the data out of that using SPARQL JSON results (hence the Python JSON parser that I scribbled above), to get something like:

<li>
   <a href="http://example.org/docbook">Docbook FAQ</a>
   (<a href="http://en.wikipedia.org/wiki/DocBook">docbook</a>)
</li>

In demonstrating this to Dave, I made a little Python script called bookmarks.py which does the above transformation using the SPARQLer service on sparql.org, which as far as I can tell uses Jena/ARQ as its backend. At first I figured that it probably only accepted RDF/XML documents, so I put the N3 source URI through triplr first, only to find that it was then complaining because it was getting what it thought was an N3 document (it's only looking at the extension not the MIME type, presumably!) and actually finding that it's RDF/XML. So I was annoyed that it was using broken heuristics, but delighted that it didn't matter anyway because it accepts N3.

Anyway, it works so now Dave is busy getting into SPARQL and Jena/ARQ. I also mentioned that this might be possible with cwm, especially if you want to do slightly more advanced stuff.

3) I've been trying to get forecast data as RDF/XML. I'm thinking about setting up a service for it using the NOAA GFS data, but GRIB files, the binary format that the World Meteorological Organisation invented to shunt meteorological data around, are really difficult to parse, and even when you use this awesome script that I found which extracts data using HTTP range requests based on the GRIB inventory files, it still takes somewhat of an age. And that's just for one geographical location... If people were requesting lots of arbitrary geographical locations then it'd be too much.

So at the moment I have a little webservice which gets the GRIB file, parsing out only the 1000mb TMP (Temperature) data, but that's only really up because for some reason I can't connect to apparently any NOAA site from here. Not sure why; traceroute just barfs out immediately, though it resolves the IP address okay so it's not a DNS problem I presume. Anyway, then I'm able to parse the results to get, for example, a list of the forecasted temperatures in London, and then I fed that into SIMILE Timeplot to get a pretty graph. No RDF/XML involved in that process yet, though, and even the pretty Timeplot graph is hardly the world's most informative meteogram. Parsing GRIB files is just so tricky; it's a shame that the NDFD data doesn't cover Europe, because that'd probably be much easier to use.