inamidst.com · topic

Avocet: Structured Text

Avocet is a structured text language that can be easily converted to XHTML, a la Markdown, reST, and wiki language. See the Avocet homepage for more details, or just go straight ahead and download avocet.py to use it.

This document is an old design specification for the language, a kind of essay on how it came to be. The idea behind Avocet was for it to be a Poor Man's Hypertext. The requirements are a) Minimal feature set. No shiny gimmicks! b) A natural syntax. Use inherent metadata! and c) Easy conversion into HTML or XHTML. In other words, the aim is to keep it simple, whilst enabling the most common and useful subset of tasks that one needs to perform with HTML.

How it Works

The popular structured text language Markdown has a lot of redundant syntax. For example, to make a heading:

  This is a Heading
  =================

The redundancy is that headings already have an inherent syntax. People think that programs can't distinguish between headings and sentences, but with some very simple heuristics they can. Capitalisation and punctuation are the most essential elements:

   This is a Heading
   This is a sentence.

With this in mind, I set to work on a regular expression which encapsulates the implicit syntax of a heading, as distinct from a one line sentence. The main constraint is that this works best with standard English title capitalisation styles, where as Prof. Jack Lynch says, in “most house styles, all the major words in an English title are capitalized - 'major' meaning the first word, the last word, and everything in between except articles, conjunctions, and prepositions”.

Having said that, the regular expression that I settled on after testing was one which is more punctuation than case oriented:

   r'^(?!.*\b[a-z]{5,}\b)([A-Z][A-Za-z0-9 ,:;!?-]{,50}[A-Za-z0-9!?])$'

This came a result of testing on a huge amount of plain text input from various sources and inspecting the output, so it's roughly descriptive. Using this same principle of descriptiveness, I found that it's possible to obtain similar results from, for example, preformatted blocks. But let's outline the required elements of syntax first.

Syntactic Requirements

What kinds of features do I require? At a minimum:

Here's a summary of the state of the art of how I want to implement each:

Of these, preformatted, hypertext, and the undescribed other deserve further mention.

Preformatted

Preformatted sections are those in which whitespace is to be preserved; thus it follows that a section containing a certain supernormal configuration of whitespace is likely to be a preformatted section. Consider, for example, the following code:

def hello(*args): 
   for name in args: 
      print "Hello %s!" % name

From this, it's quite clear that it has to be displayed in a preformatted way in order for it to be legible. The main clue is the leading whitespace in the second and third lines. Indeed, it may generally be the case that where more than two or three spaces are used in a row in any bit of text, it's likely that it ought to be preformatted.

As mentioned, though, the actual rules for determining a preformatted section could get very complicated indeed. One interesting example is poetry: consider the following haiku:

Leaves falling
Lie on one another
The rain beats the rain

It's obvious again from its structure that it has to be preformatted to some extent, but this time the configuration of whitespace is the line endings after such short runs of text. This in fact could be represented as a paragraph with line breaks in it, and indeed it seems prudent for abnormally short lines in paragraphs to have line breaks after them. What the limit for this should be is uncertain; in the Avocet test implementation at the moment, lines of length 60 and under are treated in this way.

Hypertext

The age old problem. In Your Face URIs are ugly in HTML, but in plain text they're necessary, so it's therefore necessary to tart them up as much as possible. In email, a common convention is to use footnotes:

   Hello there. Please see my website [1]. Thanks.

   [1] http://example.org/

Unfortunately, this convention has two major drawbacks: a) you have to keep referring to the end of the email to follow links, and b) if you want to insert a new footnote, you have to renumber all the footnotes that occur after it.

So most wiki syntaxes avoid this method of hypertext. Instead the use inline references, normally. For example, in Wikipedia, you link "[http://example.org/ Like This]". In the Semantic Web Interest Group's chumpbot, you link in a similar format, only using a pipe to separate link text from URI.

But what is the underlying problem? Hypertext is the association of some run of text with an out of band URI. So the syntactic problem is twofold: one must identify the text in question, and the associated URI.

The most natural way of doing this would be to use the URI to delimit one of the ends of the text, and then use a special marker to delimit the other end. In other words, you end up with something like:

   Please view  my website http://example.org/ because it rocks.

This has a few problems:

The first and second problems can be fixed in one fell swoop: by placing the URI in parentheses, perhaps optionally, adding punctuation is no problem and it suddenly looks a lot more aesthetically pleasing:

   Please view <DELIMITER> my website (http://example.org); it rocks.

Whilst parens are valid URI characters, it's easy again to use heuristics to determine whether it's part of the URI or not: compare (http://example.org/path) and http://example.org/#xptr(). The Colloquy IRC client for OS X does this, or tries to, I believe. But unfortunately the paren syntax does introduce another concrete problem anyway, as Javier Candeira (with whom this hypertext syntax system was developed) noticed: what if you want to use an In Your Face URI, perhaps as an example, in parentheses? His example of this was:

   Amazon's new URIs (http://amazon.com/path/example) are great!

Of course, this contains no delimiter: it'd only be a problem were one to include another rather obvious innovation, which is that when the delimiter is not present, the single word to the left of the URI is used as the link text; so that in the example above, "URIs" would be the unfortunate recipient of the linking. One possible solution to this is to simply force people to rewrite in such situations:

   Amazon's new URIs ("http://amazon.com/path/example") are great!
   Amazon's new URIs (e.g. http://amazon.com/path/example) are great!

Both of these could be rendered as-is. Javier notes that if there is to be a mechanism for inline code examples, it could be deployed as the solution here.

The use of a delimiter is harder to avoid. It's very, very difficult to obtain the run of text heuristically. As proof of this, Javier and I noticed that in the following example we'd both place the delimiter (here represented by "@") in different places:

   Javier: I've been @thinking about learning Pluvo http://inamidst.com/pluvo/
   Sean: I've been thinking about learning @Pluvo http://inamidst.com/pluvo/

In other words, it's a thing of personal preference; a matter of style. A program should not be required to infer style, so it's necessary that some sort of delimiter be used.

So far, the best choice appears to be "=>\w". Things I've rejected so far include " " and "\.\.\w".

It might be desirable to allow hosts, a la "(example.org)", too; and to allow the footnote style so rampant in email, perhaps with the option of having the footnotes appear at any point in the text so that renumbering is made easier (in other words, you can have a paragraph with "[1]" and "[2]" in it, define those footnotes' URIs after that paragraph, and then start again numbering from 1 in the following paragraph).

Other

Implementation

I wrote hypertext.py (not yet published) as a quick and naive implementation.

Questions, Challenges, and Todo Items

The main issues so far are:

There's also the larger issue as to whether the entire language itself is useful, whether its requirements are valid, and so on down to the level of whether there should be more features such as smaller level headings, images, definition lists, and so on.

Colophon

The name is from a thread on the Swhack Mailing List.

The format itself was started with the idea of being entirely descriptive: "is it possible to make a parser which understands the kinds of inconsistent conventions that one uses in plain text when no constraints are applied?" But problems such as how hypertext should be implemented syntactically were quickly apparent, and so Avocet is as descritive as it can be, but no further than that.

Sean B. Palmer, 2006-11