Site Metadata

This directory contains the remnants of a project to add simple metadata to the documents on inamidst.com, initiated in May 2005 and eventually abandoned in January 2006. The idea was that I wanted to be able to add titles, subjects, and descriptions to random pages on the site that I thought merited the most attention by site visitors, and that this information could be reused in various ways: for example, the sitemap on the inamidst homepage was once generated from this metadata.

The Code and Paraphenalia
How It Works
Footnotes and Feedback

The Code and Paraphenalia

All of the code that I developed for this project is listed below, mostly being Python, along with some notes and the dataset itself. Though I abandoned the project as a whole, the code and some of the approaches used still have value, and there are some reusable parts such as the Accept header parser. The original use of these files is explained below.

Filename	Last-Modified (UTC)	Size
/proj/meta/	2006-01-24 08:32:42	13,255 bytes
/proj/meta/combined.py	2005-06-05 13:19:22	1,714 bytes
/proj/meta/data.py	2005-05-15 15:52:10	11,594 bytes
/proj/meta/meta.py	2005-05-15 00:31:16	1,355 bytes
/proj/meta/metadata.tar.gz	2006-01-24 05:46:55	6,162 bytes
/proj/meta/metagen.py	2005-06-05 13:19:13	6,798 bytes
/proj/meta/metakey.py	2005-05-16 06:01:36	3,387 bytes
/proj/meta/metashell.py	2005-05-26 21:55:22	5,330 bytes
/proj/meta/notes	2006-01-23 08:57:54	769 bytes
/proj/meta/sitemap.py	2005-05-25 12:10:12	2,713 bytes
/proj/meta/tag.py	2005-12-17 15:04:15	676 bytes
/proj/meta/tagsh	2005-06-02 21:19:15	400 bytes

You can click the table headers to sort by that field, using some Javascript trickery.

Annotating Pages

I wondered how best to add metadata to items in the site, and after considering several approaches I decided it'd be best to have a shadow directory. So to annotate /code/example.py, say, I'd put the annotatons in the file /something/code/example.py, where "something" is the name of the shadow directory. I used a simple RFC 822 headers based format for the annotations themselves. Here's a simple example:

title: Useful Code
description: in Python, Bash, and Javascript
keywords: doc

This format is easy to write and easy to parse. The collection of metadata files that I built up is available as metadata.tar.gz, and may be useful if you want to try out some of the scripts.

Serving the Data

One of the main reasons for setting up the metadata was so that I could simply serve it where it was. To use the /code/example.py example again, I wanted people to be able to go to /something/code/example.py and get XHTML, RDF/XML, and plain text representations of the data. This is what data.py does. It used to run as a CGI script at /something/meta/data.cgi, and be mod_rewritten to in Apache using the following rules in the /something/.htaccess file:

RewriteEngine on
RewriteCond %{REQUEST_URI} !^/something/meta/
RewriteRule ^.+$ meta/data.cgi

Depending upon the Accept header that your browser sends to the script, or the QUERY_STRING (the part after the "?" in a URI) if you wanted to change it manually, you'd get back one of the three formats. The meta.py script itself contains some further details in its docstring. Features of interest in the code include a bi-directional dictionary, a cached environment getter, an HTTP Accept header parser plus test suite, and obviously the ability to parse and repurpose the metadata files.

Making Sitemaps

The next thing I wanted to do with all the data was to make a sitemap out of it, which is what metagen.py does. This was the file that I was using to generate the inamidst homepage sitemap at one point, though it's also possible to use it in a number of ways, as the once-CGI sitemap.py demonstrates.

The metagen.py module recursively looks through the metadata files, and then spits them out as a long HTML list, using the canonical paths, titles, descriptions, and adding classes for the keywords. My usual practice for the keywords was to associate a different snazzy icon, each of which was designed by Cody Woodard, with each keyword. Here are the icons that I used for my particular keyword set:

- all (all items)
- act (action)
- code (programming project)
- doc (documentation)
- pub (publication)
- svc (service)

One great advantage to the metadata files is that they could be used to provide a canonical URI for some of my scripts. With CGIs, you can often add a slash, "/", to the end of them and it'll get passed to the script in the PATH_INFO environment variable and possibly ignored. So if you have a CGI file called example.cgi, you could access it at either /example or /example/. Depending upon various things, the phase of the moon and so on, I would choose whether I preferred the script to have a slash on the end or not. Since the logic that I used was generally beyond computation (or perhaps not, but I couldn't be bothered to explicity code my intuition on this matter), it was great that I could make the choice and record it in the metadata database.

So, if I wanted to have /example be canonical, I'd put the annotations at /something/example; if I wanted it to be /example/, I'd put the annotations at /something/example/index instead. Note that up until this point, and in fact still since I've abandoned this project, I used the heuristic of sniffing for "PATH_INFO" in the source of the CGI to determine whether it should have a trailing slash or not.

Tools for Authoring

Next I found that writing metadata files all the time got a bit laborious in nano, so I decided to write a couple of scripts to automate the process. The first script was simply called meta.py, and took a single argument: the path of the file that I wanted to annotate. This allowed me not to have to worry about, for example, the fact that to annotate /dirname/ I'd have to edit the file at /something/dirname/index. All that would be handled by meta.py.

The script prompts for field values and then spits out the resulting file. I had planned to make it easy to add and remove keywords from the command line, and to edit files that already exist, but I had a better idea, so I didn't get around to updating meta.py itself which still has only its original simple capability.

Shells and Virtual Directories

The idea that I had was metashell.py, essentially a shell for viewing and editing the metadata hierarchy. Instead of using just the path hierarchy that the site uses, however, I decided to allow navigation by keywords too. Here's a transcript of a metashell.py session that better illustrates the principles involved:

$ ./metashell.py 
sbp@metashell:~$ ls
~/act
~/doc
~/highlight
~/inamidst
~/pub

sbp@metashell:~$ cd inamidst

sbp@metashell:~/inamidst$ ls
/list/ - Directory Browser
/misc/updates-rss - Recent Updates: RSS 1.0 Feed
/trove/ - Decortrove

sbp@metashell:~/inamidst$ ls /swhack/
/swhack/index

The ~/ hierarchy was for the keywords, and the / hierarchy matched the actual paths on the site. Everything that's in the / hierarchy also appears in the ~/ hierarchy, but the organisation of the ~/ hierarchy was in some ways much easier to navigate than the / hierarchy. In a sense, it was using keywords to reinvent the classification scheme that the site uses. The metashell also enabled the viewing and editing of annotation files:

sbp@metashell:~/inamidst$ make /about/index
Title: About inamidst.com
Description: All about the site
Keywords: inamidst
Created /about/index successfully

sbp@metashell:~/inamidst$ cat /swhack/index
title: Swhack Archive Mirror
description: the Swhack cultural forum's IRC logs mirror
keywords: archive doc highlight

sbp@metashell:~/inamidst$ edit /swhack/index
Current: title: Swhack Archive Mirror
Replacement Value: 
Current: description: the Swhack cultural forum's IRC logs mirror
Replacement Value: mirror of swhack.com's IRC logs
Current: keywords: archive doc highlight
Replacement Value:

Errors are handled gracefully, most of the time, and Ctrl+D exits the session, as with most shells and programs:

sbp@metashell:~/inamidst$ ls /nosuchpath
Error: [Errno 2] No such file or directory: './meta/nosuchpath'

sbp@metashell:~/inamidst$ ^D

The commands are fairly similar to those in unix, though make creates a new file rather than running GNU make of course.

Where It Went Wrong

So, why abandon a project that had so much code behind it? One of the main reasons was that I wasn't bothering to update the metadata dasebase anymore, even with tools such as metashell.py, because it was difficult to track which files were already in it and which weren't. I wrote combined.py (actually a CGI: combined.cgi) to try to get around this problem. It combined a list of the entire site with a list of all the annotated pages, so I could see at a glace which ones were missing. But as the site grew to thousands of files, even combined.py became impractical to use.

Another idea I came up with was to use a Greasemonkey script to add a small link to the top right of each page on inamidst which, when followed, would pop up an editing window that would allow me to make changes to the page's annotations, or to create it if it didn't yet exist. But this approach wouldn't work for anything but HTML, and would take a fair amount more implementation. Since inamidst is a staunchly static site, too, I'd've had to find some way of syncing the metadata files from another site. It was all too much bother.

A secondary reason was that I don't like keyword based organisation systems. I don't like folksonomies, and I don't like tagging. It's just not very useful, and moreover it's been around forever (in many different names; subjects, keywords, virtual folders...) and doesn't do much until someone gives it yet another new name. My concern on inamidst was that I was creating a completely different hierarchy to the one that the site was using. I decided that it was probably better to make the changes to the site itself rather than have to maintain a set of keywords: which is what I would have had to do if I had kept up my plans to, for example, expose metashell.py as a kind of webservice. The tagsh.sh and tag.py scripts were the beginnings of that.

The system turned out to be fragile and unwieldy despite my best efforts to make it user friendly, and though the approach is interesting and the code is possibly to a large extent reusable, the project as a whole is not something that I need on inamidst.

But what about the canonical URIs, and what about the sitemaps? For the canonical URIs, I'll probably be moving that facility to my site inventory script, which is what takes care of most of this stuff anyway. As for sitemaps, I've decided that using, for example, my server logs to see which the most popular pages are is viable. And perhaps in the not so distant future there will be filesystems where annotating data is as natural as creating the file in the first place, but that remains a dream for now even though there are already a few experimental systems like that around that have yet to gain traction.

Footnotes and Feedback

For feedback, general comments, or requests for help, try the Swhack IRC channel on Freenode. If you can't find help there, you may contact the author by email, or by using the following feedback form:

Sean B. Palmer, inamidst.com