Organise Hypertexts

by Sean B. Palmer

When redesigning my homepage I did a quick survey of my best web articles to see what I should link to. I found out that all of my best stuff was a disorganised mess, so I set out to figure out why it happened and how to put it right.

Good content beats design and branding

Focus on enthusiasms, not sites

The main problem with my articles is that I used to keep coming up with new sites, but the sites didn't last as long as the content. Instead of thinking about sites, it would be better to think about the enthusiasms themselves.

Design for a sticky slope

We don't tend to do what we planned. I often start writing a grand page, and then it turns out to be much smaller than I'd wanted. So I might start writing a page about Coleridge, and it all turns out to be about the Ancient Mariner. All of the cruft around the page will dress it up to be about Coleridge, so I should have planned for a sticky slope at the beginning.

Sites become autobiographies

You tend to work on what you like, so websites become a bit like autobiographies. They may include a lot of rubbish as a result. William Loughborough is fond of saying that the clutter is inherent to the organism. Sometimes you wouldn't create the great trees if it wasn't for the mucky soil that they grew out of.

Cholarchies beat hierarchies

Things on a website have to be organised somehow. After trying lots of different techniques, I decided that one flat directory was best. Then you can use inherent metadata to form collections, which are much more flexible than traditional sites.

Reject depth first hierarchies

People who are obsessed with organisation often want to start over again. When you want to clear all the clutter away and start again, that breaks links. You also start to feel bad about your organisation skills.

Ususally the problem comes from when you try out an unstable classification scheme using lots of things at the root of a hierarchy. Then when you want a new root hierarchy, you have to clean it out. So the answer might be to start from the principle of having already trashed something. There are two ways of trashing something, broadly speaking:

Move everything to an archive folder, start again
Delete most stuff, but keep and reorganise some

In the first case, you end up with a subdirectory which contains your old file tree. So why not start with what is effectively an old file tree? So start with:

http://example.org/01/essays/photography

And when you fill /01/ with tons of rubbish, and you despair of ever really fixing it without breaking all the links, simply move on to /02/.

The problem is that the very idea of starting depth first is itself a classification scheme. Certainly the deeper you start, the less likely you're going to get to coming to a complete root redesign, but the root is such fertile land that it may be difficult to resist.

Reject date based directories

Some people use essay URIs like this:

http://example.org/2009/photography

There are two major problems with using a date in a URI. First, people will always think that the article is old content from that particular year, even if it's being constantly updated. Second, dates in URIs are difficult to remember.

Dates also provincialise an article, making it feel like it's not generic, and that you shouldn't bother updating it in place because it's “archived” content in some way. So the user problems transfer to the author.

The problem with dates in general is that they're hard and fast, and yet they're also very inflexible. Consider the very meaning of a date like 2009. It's supposed to be 2009 years since the birth of Jesus, and yet as far as scholarly opinion can tell, Jesus was born either in several years BC, or a few years AD, but certainly not in 1 AD.

But there's another interesting point that can be derived from this. In the early creed of the Christian church, they don't say that Jesus was born in 1 AD, because obviously that wouldn't be much help if you didn't already know what AD meant, and that had yet to be established. So they say that he was born under the local rule of Pontius Pilate.

It's a bit like using landmarks instead of Ordnance Survey references. If someone asks you where the Millennium Bridge is, you might say it's at PQ 9187235 or what-have-you, but in most circumstances it would be better to say something like: it's just to the south of St. Paul's Cathedral.

Unfortunately, though this system is a good one in principle, there does not seem to be any adequate mapping onto URI design.

Use one big directory and labels

The only way to get around using some sort of a hierarchy, and hence a classification scheme, is to not have one at all. You can't be constantly moving things around when there is nowhere to move them to.

When I used Azimuth, some old weblog software, I used to just choose short names for all my posts and put them in one directory. This worked very well, because it was extremely rare that I'd ever want to use a name I'd already selected. The trick was to be somewhat specific.

In other words, if a page is about The Genome of the Common Pea, then you shouldn't use something like “genome” or “pea”, but rather “peagenome”, or even “peanome” if you want to be cute about it. Unfortunately, though, I must admit that recently I've been preferring URIs which don't use a smush of elements.

This has admittedly been making it much harder to name things. One of the pages that I've had the most problem with is my article containing A Proof of Pythagoras' Theorem. To call it just “proof” or “pythagoras” or “theorem” wouldn't be a good idea, because they would more accurately describe a page about proof, or of Pythagoras, or of theorems.

Even this was eventually solvable though, by simply renaming the page A Proof of the Pythagorean Theorem, which leads to the suitable pythagorean.html The one big directory approach isn't foolproof, but it gives you a minimum of elements, of classificational components, to have to worry about.

Collections beat sites

Paste on organisation as annotation, don't bake it in

What's the difference between a site and a collection? A site is something which is organised to be a collection of strongly related material. I have sites, for example, about Strange Lights, Samuel Taylor Coleridge, and Emily Dickinson. A collection on the other hand is a heterogenous assortment of materials, such as a weblog.

When you grow a site, you might start out with a single page. For example, I might start by writing about Emily Dickinson at /dickinson. Then I might realise that the biography section is quite big and ought to be split out, so I have /dickinson and /emilybio. Then I might realise that I want a site, so I have to convert /dickinson to /dickinson/, i.e. make it a folder, and then move /emilybio into /dickinson/biography.

What's the answer to this problem? To always start with a directory? Once I did that on infomesh.net, and found that I had dozens of annoying directories with only index.html in them, so then I converted most of them back from essay/index.html to essay.html, and added in an Apache redirect for them:

RewriteEngine on
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{DOCUMENT_ROOT}/$1.html -f [OR]
RewriteCond %{DOCUMENT_ROOT}/$1.txt -f
RewriteRule ^(.+)\.(html|txt)/$ /$1 [R]

So the problem can work in reverse too! This happened most notably in my Strange Lights site. I created wisp.html about the Will-o'-the-Wisp, and then moved that to wisp/index.html because I wanted to have wisp/NameYear files describing particular sightings. Then later I decided that actually I'd rather group by place than by name, and that they should be ungrouped by sighting type. So now I want to go back to wisp.html.

Use meta attributes to categorise

For example:

<meta collection="Essays">
<meta created="2010-02-01">

Choosing the size of pages

One of the biggest site design problems for me is whether you should group things together onto a single page, or split them into individually addressible components. I call this The Citrus Problem because taxonomers are unable to decide whether the citruses are a large family with lots of small subgenera, or a small family with relatively few groups.

This very page is a good example of the problem. There are numerous bits of content in this page, i.e. sections, and it would be nice to address them individually. You could use automatically generated element IDs, but sometimes it's nice to have individual pages.

One idea that I had was to split them all up, and then create an index page using Javascript. This avoids the problem of having to choose between server side frying vs. baking: it's client side frying, done in such a way that it degrades to the server side baking. But as with all solutions of this kind, it turned out to be too much cognitive burden, and to feel too technologically fragile.

Generally I now feel that larger pages are better than smaller pages. This problem is as old as the hills though.

Handy numbering technique

When you collect lots of files together in a directory, it's very easy to lose track of which file is used in which other file, especially styles and images. To get around this problem, I tend to name images that I know will only belong to one particular page using a numbering scheme. If the article is called article.html, then the images that it uses will be called article01.jpg, article02.png, article03.jpg, and so on.

The only problem comes when you want to extend this to HTML pages too as a kind of sub collection. So for example there are numbered works by Emily Dickinson which I wanted to call dickinson01.html, dickinson02.html and so on. But then what do I call the images that are used by dickinson.html? I decided to use three digits for those, such as dickinson001.jpg.

Social navigation beats structural navigation

Does it matter if you move things?

Once I was chatting to a guy online and he was very particular about site arrangement and had read some of my essays on the subject. But then he came out with a startling opinion: it didn't matter if he moved his pages around.

He explained that it didn't matter if he moved pages, because people tended to find them again with search engines anyway. I thought about how I manage broken links. I tend to try the Web Archive, Google for the page name if the whole site is gone, or check the new site index if it may have moved. And he's right — this sort of link breakage happens a lot, so I've become quite practised in the art of finding moved pages.

This made me realise not only that it was probably not such a bad thing to move pages around, as long as it wasn't done too indiscriminately, but also that our traditional models of how we structure a site for navigation were probably not very good.

Reject catchment areas

This reminded me of something that I did for my Strange Lights site. Once I cleared away all the old URIs and started anew. But because I knew that some people had linked to the old pages, I made a suitable Not Found page which linked to:

The Web Archive version of the specified page
The home page of the current site

In other words, it's a kind of smart Not Found page, making the whole site act as a kind of catchment area for inbound links. If the inbound links have been moved about, then the organisation may have changed too, so the inbound links will in a sense actually be redundant. If they're redundant, then an informative Not Found page may actually be the most appropriate thing to return.

Unfortunately this only works for sites, not for heterogenous collections. In other words, you have to have a kind of funnel to catch inbound links, and that funnel is contingent on the site growth and citrus problems.

Social navigation and discovery

How do we find information these days?

Search engines
Links from friends
Links from websites

When you find an interesting weblog entry, do you often browse the directory index? In practice, when I find something it's usually because I'm interested in that specific thing, and only very rarely do I want to see other things written by the same author.

But, if I do want to see other things by the same author, then I'll browse indiscriminately. This generally involves a tab flood. Find a place where lots of things are linked from, and then keep opening the more interesting looking ones.

And because this is the web, I don't bother reading summaries, only titles and link texts, so you don't have to maintain any essay summaries separate from essays themselves.

Rather than indices, then, one should strive if possible to follow the Wikipedia method of indexing. We should simply weave together the content as closely as possible. If you only have five independent random essays, then why bother contextualising them together?

Counteract hierarchies with an about page

One problem encountered with having a One Big Directory collection is that whatever goes at the “home page”, at /, is considered in some sense to be the master of the whole collection. So either you can put very little there or, better yet, you can redirect it to /about.html which puts the home page on the same level as the subpages.

Stamping beats frying

Generating topological views
Adding boilerplate to existing HTML files in an intuitive way
8vo for creating a feed from disparate pages

Keep your pages simple

When you're editing pages manually, not relying on templating a script based site generation, then you'll want to keep things simple. This will make things easier on users anyway.

Use simple structures:

Take out most branding
Take out administrative debris
Give overviews early
Give pages some structure

Use simple formats:

Don't bother with HTML validation
Eschew plain text HTML replacements

Use simple styles:

Avoid per-document stylesheets
Have a range of very simple styles

There are still some choices:

Where should the address go?
When should displayed quotations be used?
Where are figures to be placed?
Are figure captions acceptable?