Gallimaufry of Whits
Being for the Month of 2007-10
These are quick notes taken by Sean B. Palmer on the Semantic Web, Python
and Javascript programming, history and antiquarianism, linguistics and
conlanging, typography, and other related matters. To receive these bits of
dreck regularly, subscribe to the feed. To browse other
months, check the contents.
Yesterday Adam and I, inspired by "rancid",
Googled for cool words and awesome words using "* is a cool / awesome word" as
input. Here's what we got:
Cool: shenanigans, fuck, monster, gnarly, cahoots, vagina,
confabulation, shopomore, peruse, snazzy, asshat, engma, lurch, qwerty, pestle,
spiffy, buff, cinchy, science, nahbubuhay, hemogloben, embrollment, and
gwrthwynebwyr.
Awesome: dingo, tofurkey, whizgiggle, squircle, assfuckwitards,
fuckton, awesome, Tulonic, jazztastic, quadroon, homosinuality, poppycock,
compossible, smackies, floozy, sleuthlikedly, nawkish, slacktimony,
incrimidating, shitique, omnichronic, dissimulate, codswallop, Potterotica,
humptitude, doiley, bagarap, neathage, jobber, and gnarly.
Funnily enough, I feature already in both results, having nominated pestle
as a cool word and bagarap (right here on Gallimaufry of
Whits) as an awesome word. I'd also like to nominate jsled's "shitrude" as an
awesome word, since we're trying to proselytise it somewhat. Hmm, proselytise
is good too. We'll call that one cool, shall we?
I have this strange relationship with Amaya where I don't really like it but all
the same it's too handy to avoid all the time, so I end up trying it out every
year or two for some project. I decided to start taking some notes about some
early history of Christianity stuff that's sucking me in at the moment, and
Amaya was the obvious choice to just-take-notes without having to worry about
the syntax too much.
Last time I did the whole Amaya thing I ended up with On Using Amaya, which is basically a
page-after-page rant about how much Amaya sucks, down to some of the specific
bugs. With the latest Amaya I'm already encountering some of the old friends,
but overall I'm a bit more optimistic than before; it seems a bit more solid on
OS X now than it did on Windows a couple or so years ago.
Still, using my stripped down xhtml.rnc in nxml-mode in emacs really
is so great as to preclude any point of heavy Amaya use. Just gotta make sure
that source is perfect.
I'm tempted to write to Caitlin Moran saying "I
wanna have your babies!" But I think she'd just reply "Fine! Take them!"
Yesterday I saw a
sun pillar, and took a photo of it. This was
well after the sun had set: the column of yellow light that you can see, the
sun pillar, is caused by reflection of sunlight by ice crystals in the
atmosphere not directly by the sun itself. It actually got brighter and longer
for a while after the sun had set, and then gradually disappeared.
Sun pillars are apparently not all that rare, but this is the first time
that I recall seeing one. Venus pillars on the
other hand are exceedingly rare—make sure you get a snap of one if you see
it!
Fade teil thee zo lournagh, co Joane, zo knaggee?
Th' weithest all curcagh, wafur, an cornee.
Lidge w'ouse an a milagh, tis gaay an louthee:
Huck nigher; y'art scuddeen; fartoo zo hachee?
—"An Old Song", in Yola
Yola is a derivative of Middle English that survived until comparatively
modern times. I traced this song to the Annual Register of 1789
(2nd ed., printed 1802). The Bod has the volumes up to 1778 online,
in lovely high quality scans, but you'll have to rummage in Google Books for
others.
Word of the day: "mismathsed".
So, RDF Templating. Simon Rozet was asking
about a TV quotes ontology for RDF last night, and suggested that I revive my
old RDF quotes project from
2001. The workflow for that is essentially input and screenscrape data, merge
data using RDF rules, and output using some kind of templating.
When I did this originally, I used XSLT for the templating language, and
basically did a scrape of the RDF/XML. But that doesn't scale well, so I poked
about on Google this morning for existing RDF templating solutions. My aim was
to find some off-the-shelf components to use, so that I could document the
process and have others who want to do a similar thing say "oh, that looks
easy; I can do that!".
The best resource that I found on the problem is the ESW Wiki's page on RdfPath: "If
you want to transform RDF to XML/HTML/Text, read on!". The kinds of solutions
break down into two categories: those based on XSLT, and Fresnel. Besides these
two language agnostic approaches, the RdfPath page fails to mention the many
homespun attempts at RDF templating, which of course I'm not as interested in
given that I want to use off-the-shelf components.
Whilst I investigated some of the options listed on RdfPath, I documented
the paper trail on #swig, and others pitched in to help me, especially m94mni,
dorian, kwijibo, iand, chimezie, and AndyS (thanks guys!). We generated an
enormous amount of discussion and blogged quite a
few things along the way too.
The general conclusion that I came to to meet my requirements is that I
found a decent off-the-shelf templating workflow for myself, and we started to
converge on a more general processing model that might help others with
slightly more complex requirements than you find in the simple presentation of
a quotes database. Essentially, I'm thinking about using SPARQL (however much I
dislike it) to produce XML
bindings that I can then import, process, and output using XSLT. The
general processing model is the decoupled one of Query -> Merge? -> Process? ->
Output. You basically want to use the best quality (simplest, easiest to use,
works on your system) components that'll play nicely with one another.
Once settled on the use of SPARQL and XSLT to do the templating, I figured
that I'd need to model the TV quotes; in other words to solve the question that
Simon had last night: is there an existing quotes ontology, and if not, what
should a quotes ontology look like?
I thought that someone, possibly Danny Ayers, had done such an ontology
before but I couldn't find any evidence of that on the web. I gave Simon some
advice
about modelling based around the OnlyModelExploitableData and
Don'tWorryBeCrappy design patterns, but ironically since then I've been
starting to worry about being too crappy.
For example, my original ontology just threw the quote in a simple block of
text, such as the following:
AA Lady: And we have sugar cookies and marshmallows
Homer: These sugar cookies you speak of... are they symbolic?
AA Lady: They're on that table, over there [points]
Homer: Aw... all the way over there? I don't want to walk all the way over
there... Anything that takes 12 steps isn't worth doing! Get it? Heh? Steps?
[Cut to a scene of Homer waking up in some bushes rubbing his head]
Now, that's fine, but when you come to render it in HTML, how can you tell
where the line breaks should go? Really, instead of bunging it into a big block
of text, you should put it in a list:
("AA Lady: And we have sugar cookies and marshmallows"
"Homer: These sugar cookies you speak of... are they symbolic?"
"AA Lady: They're on that table, over there [points]"
"""Homer: Aw... all the way over there? I don't want to walk all the way over
there... Anything that takes 12 steps isn't worth doing! Get it? Heh? Steps?"""
"[Cut to a scene of Homer waking up in some bushes rubbing his head]")
And, indeed, it's not really a quote of a single character but a subclass of
quote: dialogue.
But now, what about querying this out? SPARQL has, rather notoriously, no
facilities that I'm aware of for the special treatment of lists. It does,
however, allow multiple OPTIONAL constructs, so you can do something like
this:
WHERE {
OPTIONAL { ?list ?p :Test }
OPTIONAL { ?list rdf:first ?a }
OPTIONAL { ?list rdf:rest ?r1 }
OPTIONAL { ?r1 rdf:first ?b }
OPTIONAL { ?r1 rdf:rest ?r2 }
OPTIONAL { ?r2 rdf:first ?c }
}
Except that when I actually tried that in roqet (having tried CWM's SPARQL
stuff, which doesn't even output the XML bindings), it didn't work because of a
bug
in roqet which I subsequently reported to dajobe. So anyway, it should work,
but it goes to show that list munging still isn't done all that often
(otherwise someone would've spotted this before, no?), which means that it's
fragile territory, which might be best to avoid if it's easily possible to do
that.
So I guess actually I'm not avoiding the Don'tWorryBeCrappy pattern; in
fact, quite the contrary, I'm trying to be as crappy as I can be, but without
it breaking entirely somewhere along the road. What's the simplest thing
that'll work in this model?
On how to represent extended dialogue in RDF, I realised that something like
the following will probably work sufficiently:
[ :play :Hamlet; :dialogue (
[ :by _:Pol; :quote "Doe you knowe me my Lord?"; :tln "1210" ]
[ :by _:Ham; :quote "Excellent well, you are a Fishmonger."; :tln "1211" ]
)] .
But then a sideproblem to this is that to identify each line unambiguously,
you need a combination of the :tln (the Norton Through Line Number, a standard
way of referring to lines in Shakespearean plays) and the :play. To that end I
wrote some rules that can propagate the play to each of the lines in the
dialogue:
{ [ :dialogue ?d; :play ?play ] }
=> { ?d :membersFromPlay ?play } .
{ [ :membersFromPlay ?play;
rdf:first ?member;
rdf:rest ?subd ] }
=> { ?member :play ?play .
?subd :membersFromPlay ?play } .
But then say someone comes along and adds an annotation about one of the
lines from the play with this uniquely identifying information:
[ :tln "1211"; :play :Hamlet; :note
"Davies says (C18) this means 'You are a fisherman, and angle for me'" ] .
How can we merge the two?
This is a known, if not fully investigated, problem in RDF which has
culminated in the idea of a CIFP, a
Composite Inverse Functional Property. The page just linked to has some details
on the current state of the art, but when I tried Jos de Roo's implementation
out, I found that it only works in his Euler; there isn't a generic soltuion
that works in CWM.
So I wrote
one: CIFP Rules for CWM. Lots
of people pitched in and helped on #swig again, which is great, especially
Henry Story who has been pushing this problem to a resolution for years now.
Sandro Hawke was the first person I recall raising it.
The problem is that the core problem still remains. I don't know how to
specify that I have a CIFP, except using the mechanism that we made up. We
don't know what ontological ramifications it has; how nicely it plays with OWL.
Bijan Parsia pitched in to say that it's being worked on, however, so that
least there is the possibility of a resolution at some point. The question is
what to do meanwhile.
All of this was after I slammed
SPARQL, quite rightly I hope, for its accessingCollections issue and the fact
that it prevents me from using SPARQL usefully on anything with an rdf:List in
it, which of course includes my dialogue model. Of course it's possible to use
CWM, but why shouldn't there be a lightweight solution for this too? And a
standardised lightweight solution, moreover.
Postboxes in the UK all have the name of the current monarch stamped on
them, and whenever I go by them I have a look to see how old they are. We have
some Victorian ones just down the road, but whilst inspecting one box the other
day I wondered how many Edward VIII ones there might be, given that he only
reigned for a handful of months. Are there any at all?
Indeed there are. Long story short, I spent three hours this afternoon
making a List of Edward VIII
Postboxes. I found 57, including many photos of them, but there are said to
be around 150 still out there. So there you go.
One of the biggest problems I've had in creating a Quotes Ontology is that
it's difficult to find prior art, especially when Swoogle is down. This got me wondering
about the more general question of what RDF vocabularies are being used—what
kind of cool information is out there? So I've been working on that for several
days now, but I haven't forgotten the Quotes Ontology.
Danny Ayers wrote about my
templating workflow, and has been trying to convert the rules part of the
process into SPARQL. In doing so, he's come up against the same
SPARQL bug that so thoroughly annoys me as well. Some
SPARQL folk are on the case trying to help out, but I'm not aware of Danny
having made any progress yet.
In any case, I think it doesn't matter too much because my workflow is
Process -> Query -> Template, and he's trying to change more of the Process
part (which I'm doing with cwm) into the Query part (which I'm doing with
SPARQL). I got the workflow producing the kind of output I wanted it to, with
just the following classes and properties in the Quotes Ontology:
:Quote, :Dialogue, :DialogueList, :quote, :dialogue, :prefix, :by,
:from.
In general, a :Quote is a conceptual quote from some work (a book, a play, a
TV show) and you link it to the textual representation of that quote using
:quote. The :from property links to the work that the quote is from, and the
:by gives the person or character who uttered it. If you have two or more
people chatting back and forwards you can use :Dialogue, which links to a
:DialogueList (a list of Quote instances) via the :dialogue property.
It all sounds a bit "wha?", but it makes perfect sense once you see a few
examples, and this is the best structure that I could come up with which
models, and only models, the information that I want to exploit for
presentation. I've been careful to make sure that the properties and classes
are reusable by using OWL restrictions instead of (ironically) stricter domain
and range constraints, and I of course allow for extension.
One slight annoyance is the fact that my awesome list-typing recipe makes
an OWL ontology to be in the OWL Full species, though I asked
Bijan Parsia "if people ask if this is an OWL Lite ontology, what do I tell
them in three words or less?", and he replied "I would tell them they shouldn't
care" and that "I would also say that it's nominally owl full, but in a
relatively harmless way".
Once Swoogle came back online, I found that Kevin Reid
had previously worked on his own quotes ontology
back in October 2003, around the time I went to Bristol for the FOAF meeting,
and he'd forgotten to tell me though he had been meaning to. It's very
interesting to compare its approach to mine: it uses rdf:Seq to hold lines,
uses domain and range, and doesn't seem to have a particularly need-driven
structure though I may be wrong.
One of the biggest things I learned is that my modelling advice is really
good, but really
difficult to follow: the advice being a blend of the "Don't Worry, Be
Crappy" and "Only Model What You Want To Exploit" design patterns. It's so easy
to get into thinking about the model so much that you get into a semantic
metawank and soon enough you're doing philosophy rather than computer science.
You need to avoid that if you want to actually produce something.
I still haven't published the quotes ontology yet, because I'm still not
really finished with it; and I'm ignoring Release Early, Release Often in
favour of my Let It Mature In The Cellar. I did get /ns/ at purl.org though, so
I have some nice namespace territory available to me.
Getting stats readouts of my Semantic Web Survey is starting to take a long
time, but the most recent full set that I got last night was:
-
sites: 2768
-
docs: 46262
-
triples: 7743021
-
subtriples: 14602
-
cache size: 255M
Which is quite impressive for so few days' spidering. It's getting hard to
manage that much data, and since I have five python processes gently hammering
the web I can't really do much else with my account on bia anyway. Several
times now I've gone into impolite amounts of CPU or memory, but I think I've
fixed the problems which gave rise to that and now I'm able to have them
crawling overnight without failure. Christopher
Schmidt, who runs the server, has been very patient—thanks crschmidt!
So I'm almost definitely at 10m triples now. What to do with them? Well,
sadly, the first step when I get to a level of completion that I'm happy with
will be to crawl them all over again. I've learned a lot about real crawling,
and one of the main pieces of advice is: make sure you get all the data that
you want to use first time around. Oddly enough, I actually realised this
before I did the first set of crawling and tried to make sure that I saved all
data that I could think of, but it turned out that I missed some.
At any rate, I want to recrawl because I want to tweak the parsers a bit.
For a start, the version of the rapper RDF parser that's on bia
seems to be quite ancient and gives some output which is malformed, so I want
to try using a more recent locally installed version. I'm also thinking about
allowing @this in n3, because that's cacking out a lot of the n3 that I could
potentially crawl.
I already had to recrawl all the Notation3 documents once to allow for the
garnering of subtriples; which is to say, triples that exist in formulae not
asserted in the document.
With all that aside though, what really am I going to do with the data? I'm
looking for something I've been calling Semantic Diversity. The whole point of
this exercise is that I want to find what kind of interesting data is on the
Semantic Web already, if there is any, so I'm going for something that's very
different to the extremely linear approaches to Semantic Web Surveying at the
moment. For example, you tend to get reports like percentage of documents which
are RDF/XML, the most widely used properties, and so on. I'm not looking for
which properties are most widely used, though I'm sure I'll report on that too.
I'm looking for interesting ways that people are mixing data. I want
to find out what ontologies play well together, how people are using ontologies
differently to the way they've been specified, and so on. Taking a much more
descriptive and relativist approach to the whole shaboodle, if you want to hear
it in those kind of terms; but it's not even just that, because I'm interested
in how it plays with the prescriptive and objectivist approaches
too.
I also want to be able to search through the data for interesting things
quite fast, so I'm leaning towards some kind of search engine-like application,
only much more geared towards the Semantic Web. Semantic Web search engines at
the moment are... really kinda strange. It's not that I think they're bad, or
even that they're not useful, just that they're playing awfully close to the
old web memes when the Semantic Web is very different in the way that you
interface it. I have quite a few sketches of ideas for how to make search more
interesting, driven by some use-case questions that I have about the data I'm
crawling.
The Decentralized Information Group at MIT (a bit of a W3C spin-off with
folk like Danny Weitzner, DanC, and TimBL) has a weblog called Breadcrumbs that I follow.
Recently they wrote about a new commenting policy that
they've put in place to prevent spammers: using a FOAF whitelist. The idea has
been around for a while, for example Dan Brickley wrote
about it in 2002 and got a prototype working, but this was more geared towards
email than weblog comments.
So Dan Connolly lightbulbed on the comment filtering and Joe Presbury et al.
at DIG carried the lightbulb on a fair few yards towards the touchline after a
suggestion that they outsource to Akismet
instead. As Dan notes,
"the idea reached critical mass in breadcrumbs after somebody suggested
outsourcing; timbl and I and Danny pushed back, saying this is what DIG is all
about; it's for us to research, not for us to outsource."
The whitelisting works such that the only people allowed to comment on the
weblog are those that are within three foaf:knows relationships of the DIG
members, and have a foaf:openid property in their profile. I added
foaf:openid to my FOAF file back on the 4th October and TimBL added me to his
FOAF file so that I'd hopefully be whitelisted on the next crawl. Apparently
something went wrong because I kept getting "OpenID authentication failed: not
a valid URL" until yesterday, when it was fixed.
Joe Presbrey wrote the crawler, a 7KB bit of Python called walk.py which uses my
own somewhat old rdfxml.py to
parse RDF/XML (which I was rather surprised about). It's multi-threaded, and
runs very quickly indeed: it takes just 32 seconds for it to spit out all 12
OpenIDs currently within three hops of the DIG members:
$ time python walk.py
[ Iteration: 0 | URIs: 13 ]
rdfxml.py:83: DeprecationWarning: raising a string exception is deprecated
if lang and dtype: raise "ParseError", "Can't have both"
[ Iteration: 1 | URIs: 53 ]
[ Iteration: 2 | URIs: 46 ]
[ Iteration: 3 | URIs: 8 ]
http://auth.mit.edu/syosi
http://auth.mit.edu/oshani
http://www.w3.org/People/Weitzner.html
http://www.w3.org/People/Connolly/
http://inamidst.com/
http://presbrey.mit.edu/
http://getopenid.com/amyvdh
http://bblfish.videntity.org/
http://openid.sun.com/bblfish
http://danbri.org/
http://lalana.pip.verisignlabs.com/
http://www.w3.org/People/Berners-Lee/
real 32.126s
user 2.776s
sys 1.231s
cpu 12%
According to Joe, it took six or seven hours to do the same when it was
single-threaded! At the moment I think only Ryan Lee is able to run the crawl
on the DIG servers, but obviously it would be handy if someone could be added
as soon as is feasible after buying into the OpenID-in-FOAF solution. There
have been behind the curtain discussions about having a big "DO THE CRAWL"
button that all DIG members would be authorised to push, and it was noted that
it wouldn't be such a bad idea to let the public do it either as it only costs
a few CPU cycles (which can be controlled with nice) and a bit of
bandwidth. I'm all for it being as open as is sensible.
Yesterday I interviewed
DanC a bit about the whole process, and put the most cynical question that I
could think of to him, since cynicism sells:
sbp: If every site had a custom method of anti-spamming that has as much
of a take-up barrier as the FOAF whitelist does, wouldn't that be too
burdensome on the post-a-day commenter? Put another way: are you killing off
the ability to comment because of fear of the spammer?
DanC: The barrier for openid is falling, and with support in sites like
advogato and livejournal, likewise. As links in the social network are
increasingly part of the semantic web (think: co-citation data from conference
web sites...) the barrier should go down... It seems only reasonable that the
burden should be on the commentor to prove why they deserve space on my web
site before I publish their comment. Given the value of publication to
spammers, clearly "anyone can publish" will lead to tragedy of the commons.
Of course, the most cynical question of all is simply: will it be valuable
as it scales? But only time will tell there, and it depends upon your value
function. It's valuable already in that I can now comment on Breadcrumbs
whereas before I couldn't because they had to shut it down in the face of the
spammers. It's valuable in that it got me to add my OpenID information to my
FOAF file which others can now exploit in a similar way. It's valuable in that
it shows a nice practical use for the Semantic Web, which are arguably thin on
the ground still after several years' work. It's true though that a lot of this
value would be wiped if there were scaling issues.
So, if you want to try this system out, what should you do? The first step
would be to get an OpenID if
you don't have one already. Then you need to have a FOAF file, and add the foaf:openid
arc (which is not yet in the HTML documentation, but is in the RDF/XML) from
yourself to your OpenID. Those steps are a bit labourious, but not particularly
difficult. More difficult is the final step of getting meshed into being three
steps from a DIG member. Since it's currently a good example of a Chassignite
Interest, feel free to email me or ask on #swig about getting
connected.
As a spin-off of the FOAF whitelist chat last night, I helped R. Steven Rainwater to set up
cert-level export on Advogato. DanC had
been musing
about it, so I sent a short note to
the Advogato feedback address, and eventually chatted with Steven about the
whole thing on IRC before sending
him a schema which is now on the
Advogato site.
That means that Advogato FOAF files are returning information such as:
trust:Master a foaf:Group;
foaf:member <http://www.advogato.org/person/raph/foaf.rdf#me> .
From Raph
Levien's FOAF file.
Today I wrote a programming language called Plan3. It's imperative, and it
uses N3 for its syntax, so it looks a bit like lisp whereas in fact it's based
very closely on Pluvo.
Pluvo and n3 =
Plan3. It's implemented as a new cwm mode, and I have a patched
local version of the cvs cwm with this new mode that can run the following
code:
<> doc "dbslurp.n3 - DBPedia Slurp";
dc:author [ foaf:homepage <http://inamidst.com/sbp/> ] .
data def (()
(var resources (list))
(select ?s ?p ?o where { ?s ?p ?o }
(for t in (list ?s ?p ?o)
(if (startswith t "http://dbpedia.org/resource/")
(push t resources))))
(return resources)
) .
main script (
(for resource in (data)
(store (semantics resource)))
(output)
) .
What you do to run the above is something like:
$ echo '@prefix db: <http://dbpedia.org/resource/> .
db:The_IT_Crowd a :Test .' | cwm --plan3=dbslurp.n3
And it'll slurp the description of the IT Crowd on DBPedia into the working
context and then pretty print it out. The semantics function is a lot like
log:semantics, but others of the verbs don't have any counterparts in the log:
namespace, which acts rather like a declarative programming language in N3,
using the --think mode as the interpreter. The log: namespace and other --think
mode builtins have grown rather haphazardly, however, and the system is not
extensible to the general user—you can't add new builtins. With Plan3, I'm
thinking about being able to import signed scripts from online and all that
kind of groovy stuff, as well as giving people a much more powerful standard
library to program with.
Another way of thinking about Plan3 is that it's like the cwm command line,
but formalised into a programming language. The cwm command line is strange in
that it really works like a mini-language, order of the flags being important
and being able to come after arguments and so on. So cwm --n3 input.n3 --rdf
will load input.n3 as n3 (actually the --n3 flag is redundant here) and then
the --rdf will convert it to RDF/XML. cwm --rdf input.n3 --n3 on the other hand
will break as the input will be expected to be RDF/XML.
There are some differences though. With cwm, there is an implicit --print at
the end which pretty prints the default context, and I was thinking about
having Plan3 do the same, but then I figured that the default should be no
output and you can do an (output) call instead. That way you don't have to turn
the functionality off when you're getting it to do templating and so on.
It's only taken a day to get the script above, and some other small test
scripts like it, working, mainly because a lot of the underpinnings of the
language are a straight port from Pluvo, but it was still a very good bit of
work indeed, and is already starting to approach the log:/--think language for
power, and indeed exceed it in some places.
I should note that the example above was redundant on purpose to show off
the language features; though ironically it's also missing off showing a
planned feature by including something, the setting of the resources list. I'd
like to make push work like in perl, creating a list automatically if it
doesn't exist.
Anyway, here's the more compact version:
main script (
(select ?s ?p ?o where { ?s ?p ?o }
(for t in (list ?s ?p ?o)
(if (startswith t "http://dbpedia.org/resource/")
(store (semantics t)))))
(output)
) .
This is pretty compact compared to the equivalent log:/--think
concoction.
The Great Semantic Web Survey is coming on quite well: I now have nearly 25
million triples from over 80,000 documents across over 2800 domains. Managing
all of this data is proving to be quite difficult, but I've been offered help
from two different directions. Danny Ayers hooked me up with Talis, and Ian Davis is going to set me up an
account; and Kingsley Idehen offered me a Virtuoso
account in the same environment that DBPedia is hosted in. I'm probably going
to try to set up Virtuoso on manxome, Aaron Swartz's server, first.
An interesting thing about Talis and Virtuoso coming to the rescue here is
that they're both corporate entities who have very kindly offered to donate
their resources to me. Generally this makes sense because I have friendly links
in, they're nice people anyway, they'll be getting exposure and testing in
return, and corporate entities are more likely to be able to manage the kind of
insane amounts of data that I'm dealing with. But what's interesting to me is
that there isn't more entirely grassroots activity in this area.
Meanwhile I've already been doing some simple things with the data I
collected; I've been wanting to find out a) what the most widely deployed
ontologies are, and b) what the most semantically rich websites are. Usually
when people want to find out what terms are widely deployed on the Semantic
Web, they do a rude frequency count, but I'm more sophisticated than that... So
not only am I doing a rude frequency count, but I'm also blending it
with distribution data—in other words, not just "how many times is this
property used" but "how many sites is this used on", and other statistical data
like that. Blending all of the available statistics is a very subjective thing,
but I've got an output of the Semantic Web's favourite ontologies that I'm
quite happy with.
So, some of the raw stats just in case anybody's interested. In my crawl
data there are 14,957 distinct predicates, of which 9878 are HTTP URIs. There
are 1203 ontologies, of which 1140 are in HTTP URI space. Obviously I used my
own whimful definition of ontology; again it's possible to use various metrics
here, so I just picked the one that felt like it had the best measure of
accuracy and ease.
Once I'd found the ontology information out, I decided to figure out what
the most semantically rich websites are. As I reported
on #swig this morning, "this is a blend of diversity of ontologies used +
amount of triples on the site, and the top twenty sites based on this metric,
in descending order, are: www.w3.org, www.mindswap.org, simile.mit.edu,
inamidst.com, www.wasab.dk, b4mad.net, www.ninebynine.org, www.daml.org,
www.ivan-herman.net, dev.w3.org, demo.openlinksw.com, norman.walsh.name,
semspace.mindswap.org, lists.w3.org, dannyayers.com, myopenlink.net,
research.talis.com, www.holygoat.co.uk, www.kanzaki.com, redfoot.net".
The next step is probably to get some proper data processing going, and so
that means turning to Talis and Virtuoso. On the other hand there are some
other interesting Semantic Web things that I'm also working on, Plan3 being
perhaps foremost amongst them. I'm trying to make sure that, at some level, I'm
doing things which are useful, so I'm going to try to weave in more use cases
rather than just be doing theoretical stuff all the flipping time. Indeed, as
you may recall if you've been reading Whits for a week or two at least, all of
this recent Semantic Web activity started from a kind of "fake use case" when
Simon Rozet suggested that I update my 2001 Quotes Ontology/Workflow. I learned
quite quickly from revising that that the whole easy-data-remixing promise that
the Semantic Web vision made hasn't really been fulfilled yet, and yet all
these interesting avenues to making things easier have popped up meanwhile.
I'm therefore working on a document called the "Semantic Web Guidepost"
which is a set of high level notes coördinating my recent Semantic Web
activities, in the form of a personal level tutorial. In other words, when I
started to try to create the new Quotes Ontology, instead of just working
through the problems that arose I documented them in very brief form along the
way, linking only to the most helpful tools and services and patterns and
advice and so on. That's since grown, in just a few days, to incorporate all
kinds of Semantic Webbery but with the same idea of being a very practical
guide and not having any waffly nonsense. It's kinda like the Ultimate
Skimmers' Guide to the Semantic Web.
The Guidepost document is also where I'm collecting my ideas for use cases
and stuff to use Plan3 for and so on. I am planning on publishing all this
stuff, but let me know if you want to take a sneak peek at specific bits
because meanwhile I'm following my "let it mature in the cellar" pattern
again.
Art deco is a lot like Flickr. I used to hate Flickr like I hate all Web 2.0
junk, but to quote
myself from back in April: "Okay, I like Flickr. There, I said it." And
now... well... I don't think I can quite bring myself to say it outright yet,
but I'm starting to think about warming to art deco.
Grand Designs today covered an
art deco house that a couple built in Surrey, and it came out pretty well.
I've been thinking about art deco for months now, though, and this brought
together some of the elements that I'd been thinking about. My main dislike of
it comes from the fact that it's so close to completely functional modernism,
and then with a garish topping. It's like the two worst ends of architecture
combined into a single aesthetic. But you can also flip that around, of course,
and say that each of the extremes temper the other.
Another problem that I had with it is that it boasts of its contemporary
nature. It's like modernism being such a misnomer now—it's no longer anywhere
near modern. The whole streamlining that came from the industrial 1920s, it
just sorta makes me feel as though I'm going to barf; but again you can flip
that around and say that it's such a ridiculous thing that it's now a trivial
conceit of history. If one were to be surrounded by art deco, with art deco
buildings going up every day, that'd be too much. But as a curious relic which
reminds us of our heritage... why not?
It's such a simple style, too, that it has a very centrally definable
aesthetic; it's very easy to replicate, and there are lots of rather close to
prototypical instances around. There's a bingo hall not too far from where I'm
writing this that's a large hulk of an art deco building, and you can tell from
quite some distance that it's art deco—art deco buildings scream their motifs
at you, but they're so playschool that you don't feel too overwhelmed by them,
as you know you could design a similar thing quite easily. It's not like art
nouveau where nobody understood it at the time and nobody's really realised its
full potential since either. Art deco peaked so thoroughly that only the Second
World War could really have stopped it.
Connotations, as usual, help too. Whereas before I associated it with disuse
and industry, now my major associations are with travel posters and seafront
promenades. None of the seaside towns in England have been cared for since the
'20s and '30s, so everything's still art deco on the coasts, all the lidos
coming back into use and so on. It's like you can use it to escape the past
which never left, rejuvenating without going too over the top. That's surely a
positive thing.
I'm still suspicious of it. I'm still coming to understand how it can be
used, how it was used originally, and so on. But I'm definitely starting to
think about warming to art deco. Bizarre.
I'd like to be able to evaluate JSON, say of SPARQL results, securely in
Python, but all the existing solutions are way too big for the job, so instead
I just devised the
following bit of embeddable joy:
import re
r_json = re.compile(r'^[,:{}\[\]0-9.\-+Eaeflnr-u \n\r\t]+$')
r_string = re.compile(r'"(\\.|[^"\\])*"')
def json(text):
"""Evaluate JSON text safely (we hope)."""
if r_json.match(r_string.sub('', text)):
return eval(text, {'__builtins__': None}, {})
raise ValueError('Input must be serialised JSON.')
No guarantees, but I'll be using it. Thanks as usual to Björn Höhrmann,
DAS ÜMLÄÜTÜNGËHËÜËR, for pointing out that RFC 4627 (the JSON RFC) has a
security regexp. Funnily enough I had been asking about a ten-line-solution for
it, and now I just realised that what I came up with is exactly ten lines. Good
stuff.
Simon Rozet asked how I've been so
productive this month. What's my secret? Well, ladies and gentlemen, I give you
the most proactive and encouragingly paradigmatic time management strategem
that I know of: Patrick Hall's FTDB™.
Three more Semantic Web things to report.
1) Yesterday I figured
out a way of allowing arbitrary transformations to work in an XSLT-only
GRDDL client: simply create a CGI which takes in the source doc URI as a
parameter and outputs a trivial XSLT document which really just encapsulates
some RDF/XML.
So for example, let's say we have a microformat for embedding Turtle in HTML
documents. You can't parse Turtle in XSLT particularly, but you can parse it in
Python, so you set up a service at http://example.org/hturtle which takes in
?uri=yourdocument as the QUERY_STRING, loads yourdocument, converts it to
RDF/XML, and then outputs a trivial XSLT document:
<xsl:stylesheet ...>
<xsl:template match="/">
<rdf:RDF ...>...</rdf:RDF>
</xsl:template>
</xsl:stylesheet>
And then to link it from your GRDDL document, you simply do:
<link rel="transform" href="http://example.org/hturtle?uri=yourdocument" />
I had hoped that GRDDL clients would be mandated to send a Referer header to
transformation documents should they request them, but they're definitely not;
the protocol trace in the example in the GRDDL specification clearly omits the
Referer header, and there's no RFC-keywordsy documentation about Referer in the
spec at all.
This allows you to have transformations which are ostensibly XSLT, but which
behind the scenes can be anything that you can make a CGI out of—Python,
Perl, Javascript, Befunge, Wang Tiles, whatever. Benjamin Nowack's already noted
that this idiom may help him to make his PHP scripts available to basic GRDDL
clients.
2) This morning I've been working with Dave Pawson on a simple bookmarks format
in RDF. The idea is that he captures bookmarks using the following model:
:docbook rdfs:label "docbook";
foaf:isPrimaryTopicOf <http://en.wikipedia.org/wiki/DocBook> .
<http://example.org/docbook> dc:title "Docbook FAQ";
:topic :docbook .
And then you can easily query the data out of that using SPARQL JSON results
(hence the Python JSON
parser that I scribbled above), to get something like:
<li>
<a href="http://example.org/docbook">Docbook FAQ</a>
(<a href="http://en.wikipedia.org/wiki/DocBook">docbook</a>)
</li>
In demonstrating this to Dave, I made a little Python script called
bookmarks.py which does the above transformation using the SPARQLer service on sparql.org,
which as far as I can tell uses Jena/ARQ as its backend. At first I figured
that it probably only accepted RDF/XML documents, so I put the N3 source URI
through triplr first, only to find that it was
then complaining because it was getting what it thought was an N3 document
(it's only looking at the extension not the MIME type, presumably!) and
actually finding that it's RDF/XML. So I was annoyed that it was using broken
heuristics, but delighted that it didn't matter anyway because it accepts
N3.
Anyway, it works so now Dave is busy getting into SPARQL and Jena/ARQ. I
also mentioned that this might be possible with cwm, especially if you want to
do slightly more advanced stuff.
3) I've been trying to get forecast data as RDF/XML. I'm thinking about
setting up a service for it using the NOAA GFS data, but GRIB files, the binary
format that the World Meteorological Organisation invented to shunt
meteorological data around, are really difficult to parse, and even when you
use this awesome script that I found which extracts data using HTTP range
requests based on the GRIB inventory files, it still takes somewhat of an age.
And that's just for one geographical location... If people were requesting lots
of arbitrary geographical locations then it'd be too much.
So at the moment I have a little webservice which gets the GRIB file,
parsing out only the 1000mb TMP (Temperature) data, but that's only really up
because for some reason I can't connect to apparently any NOAA site from here.
Not sure why; traceroute just barfs out immediately, though it resolves the IP
address okay so it's not a DNS problem I presume. Anyway, then I'm able to parse the results to
get, for example, a list of the forecasted temperatures in London, and then I
fed that into SIMILE Timeplot to
get a pretty graph. No
RDF/XML involved in that process yet, though, and even the pretty Timeplot
graph is hardly the world's most informative meteogram. Parsing GRIB files is
just so tricky; it's a shame that the NDFD data doesn't cover Europe, because
that'd probably be much easier to use.
Sean B. Palmer, inamidst.com