Avocet: Structured Text

<title>Avocet - A Structured Text Language</title>
<link rel="stylesheet" type="text/css" href="style.css" />
<p>
   <a href="/">inamidst.com</a> 
   <img src="arrow" alt="&#xB7;" /> 
   <a href=".">topic</a>
</p>
<h1>Avocet: Structured Text</h1>

<p>Avocet is a structured text language that can be easily converted to XHTML,
a la Markdown, reST, and wiki language. See the <a
href="/proj/avocet/">Avocet</a> homepage for more details, or just go straight
ahead and download <a href="/proj/avocet/avocet.py">avocet.py</a> to use
it.</p>

<p>This document is an old design specification for the language, a kind of
essay on how it came to be. The idea behind Avocet was for it to be a Poor
Man's Hypertext. The requirements are a) Minimal feature set. No shiny
gimmicks! b) A natural syntax. Use inherent metadata! and c) Easy conversion
into HTML or XHTML. In other words, the aim is to keep it simple, whilst
enabling the most common and useful subset of tasks that one needs to perform
with HTML.</p>

<h2>How it Works</h2>

<p>The popular structured text language <a
href="http://daringfireball.net/projects/markdown/">Markdown</a> has a lot of
redundant syntax. For example, to make a heading:</p>

<pre>
  This is a Heading
  =================
</pre>

<p>The redundancy is that headings already have an inherent syntax. People
think that programs can't distinguish between headings and sentences, but with
some very simple heuristics they can. Capitalisation and punctuation are the
most essential elements:</p>

<pre>
   This is a Heading
   This is a sentence.
</pre>

<p>With this in mind, I set to work on a regular expression which encapsulates
the implicit syntax of a heading, as distinct from a one line sentence. The
main constraint is that this works best with standard English title
capitalisation styles, where as <a
href="http://andromeda.rutgers.edu/~jlynch/Writing/t.html">Prof. Jack Lynch</a>
says, in &#x201C;most house styles, all the major words in an English title are
capitalized - 'major' meaning the first word, the last word, and everything in
between except articles, conjunctions, and prepositions&#x201D;.</p>

<p>Having said that, the regular expression that I settled on after testing was
one which is more punctuation than case oriented:</p>

<pre>
   r'^(?!.*\b[a-z]{5,}\b)([A-Z][A-Za-z0-9 ,:;!?-]{,50}[A-Za-z0-9!?])$'
</pre>

<p>This came a result of testing on a huge amount of plain text input from
various sources and inspecting the output, so it's roughly descriptive. Using
this same principle of descriptiveness, I found that it's possible to obtain
similar results from, for example, preformatted blocks. But let's outline the
required elements of syntax first.</p>

<h2>Syntactic Requirements</h2>

<p>What kinds of features do I require? At a minimum:</p>

<ul>
<li>Paragraphs</li>
<li>Headings</li>
<li>Lists</li>
<li>Preformatted</li>
<li>Blockquotes</li>
<li>Hypertext</li>
<li>Unicode</li>
<li>Other</li>
</ul>

<p>Here's a summary of the state of the art of how I want to implement
each:</p>

<ul>
<li>Paragraphs can be merely blank-line separated blocks of text that don't
conform to any of the other types of block level "element".</li>
<li>Headings, as already discussed, will be differentiated by their internal
syntax using a complex regular expression and perhaps, in future, even more
heuristics. Early tests indicate that this is a more than viable approach to
headings, and it's very natural to author as long as you have faith in the
heuristics being used. The first heading in a document can be reused as its
title; but this may or may not be a good idea (it's derived from Pwyky).</li>
<li>Lists are to be bulleted. I use "*" as bullets, but some people use "-",
and others prefer "#" which may indicate a numbered list. It would probably be
feasible to allow several of these, along with Markdown style ordered lists a
la 1. 2. 3. or 1) 2) 3).</li>
<li>Preformatted sections will be based on inherent metadata: a preformatted
section is one which has multiple runs of whitespace. " {2,}" would be a naive
test of a preformatted block; in practice, the rules are likely to be much more
complex.</li>
<li>Blockquotes will either be [[[ and ]]] delimited, indented with spaces, or
possibly even follow an ^"(.*\n){2,}" - http:\S+$ like format.</li>
<li>Hypertext. Possibly the trickiest one. The solutions for this are numerous,
and all fairly terrible. The best solution I've come up with so far is using
"=>\w" to open a link, and using a URI, optionally in parenthesis, to close it;
or perhaps using "[1]"-like footnote style.</li>
<li>Unicode. As usual, something like "{U+203D}" to insert an interrobang would
be wonderful. Patrick Hall suggested allowing an algorithm for searching for
codepoints by name, so for example "{U+LAT SMA A RING}" would insert a U+00E5,
LATIN SMALL LETTER A WITH RING ABOVE. I implemented this as .unicode in phenny
as a proof of concept, and it seems to work quite well. The regexp should be
something like "\{U+[0-9A-F]{2}|[0-9A-F]{4}|[0-9A-F]{6}|[A-Z ]+\}".</li>
</ul>

<p>Of these, preformatted, hypertext, and the undescribed other deserve further
mention.</p>

<h2>Preformatted</h2>

<p>Preformatted sections are those in which whitespace is to be preserved; thus
it follows that a section containing a certain supernormal configuration of
whitespace is likely to be a preformatted section. Consider, for example, the
following code:</p>

<pre>
def hello(*args): 
   for name in args: 
      print "Hello %s!" % name
</pre>

<p>From this, it's quite clear that it has to be displayed in a preformatted
way in order for it to be legible. The main clue is the leading whitespace in
the second and third lines. Indeed, it may generally be the case that where
more than two or three spaces are used in a row in any bit of text, it's likely
that it ought to be preformatted.</p>

<p>As mentioned, though, the actual rules for determining a preformatted
section could get very complicated indeed. One interesting example is poetry:
consider the following haiku:</p>

<pre>
Leaves falling
Lie on one another
The rain beats the rain
</pre>

<p>It's obvious again from its structure that it has to be preformatted to some
extent, but this time the configuration of whitespace is the line endings after
such short runs of text. This in fact could be represented as a paragraph with
line breaks in it, and indeed it seems prudent for abnormally short lines in
paragraphs to have line breaks after them. What the limit for this should be is
uncertain; in the Avocet test implementation at the moment, lines of length 60
and under are treated in this way.</p>

<h2>Hypertext</h2>

<p>The age old problem. In Your Face URIs are ugly in HTML, but in plain text
they're necessary, so it's therefore necessary to tart them up as much as
possible. In email, a common convention is to use footnotes:</p>

<pre>
   Hello there. Please see my website [1]. Thanks.

   [1] http://example.org/
</pre>

<p>Unfortunately, this convention has two major drawbacks: a) you have to keep
referring to the end of the email to follow links, and b) if you want to insert
a new footnote, you have to renumber all the footnotes that occur after it.</p>

<p>So most wiki syntaxes avoid this method of hypertext. Instead the use inline
references, normally. For example, in Wikipedia, you link "[http://example.org/
Like This]". In the Semantic Web Interest Group's chumpbot, you link in a
similar format, only using a pipe to separate link text from URI.</p>

<p>But what is the underlying problem? Hypertext is the association of some run
of text with an out of band URI. So the syntactic problem is twofold: one must
identify the text in question, and the associated URI.</p>

<p>The most natural way of doing this would be to use the URI to delimit one of
the ends of the text, and then use a special marker to delimit the other end.
In other words, you end up with something like:</p>

<pre>
   Please view <DELIMITER> my website http://example.org/ because it rocks.
</pre>

<p>This has a few problems:</p>

<ul>
<li>It's ugly.</li>
<li>What if one wants to put punctuation after the URI?</li>
<li>The delimiter will necessarily be special syntax, not inherent.</li>
</ul>

<p>The first and second problems can be fixed in one fell swoop: by placing the
URI in parentheses, perhaps optionally, adding punctuation is no problem and it
suddenly looks a lot more aesthetically pleasing:</p>

<pre>
   Please view &lt;DELIMITER> my website (http://example.org); it rocks.
</pre>

<p>Whilst parens are valid URI characters, it's easy again to use heuristics to
determine whether it's part of the URI or not: compare
(http://example.org/path) and http://example.org/#xptr(). The Colloquy IRC
client for OS X does this, or tries to, I believe. But unfortunately the paren
syntax does introduce another concrete problem anyway, as Javier Candeira (with
whom this hypertext syntax system was developed) noticed: what if you want to
use an In Your Face URI, perhaps as an example, in parentheses? His example of
this was:</p>

<pre>
   Amazon's new URIs (http://amazon.com/path/example) are great!
</pre>

<p>Of course, this contains no delimiter: it'd only be a problem were one to
include another rather obvious innovation, which is that when the delimiter is
not present, the single word to the left of the URI is used as the link text;
so that in the example above, "URIs" would be the unfortunate recipient of the
linking. One possible solution to this is to simply force people to rewrite in
such situations:</p>

<pre>
   Amazon's new URIs ("http://amazon.com/path/example") are great!
   Amazon's new URIs (e.g. http://amazon.com/path/example) are great!
</pre>

<p>Both of these could be rendered as-is. Javier notes that if there is to be a
mechanism for inline code examples, it could be deployed as the solution
here.</p>

<p>The use of a delimiter is harder to avoid. It's very, very difficult to
obtain the run of text heuristically. As proof of this, Javier and I noticed
that in the following example we'd both place the delimiter (here represented
by "@") in different places:</p>

<pre>
   Javier: I've been @thinking about learning Pluvo http://inamidst.com/pluvo/
   Sean: I've been thinking about learning @Pluvo http://inamidst.com/pluvo/
</pre>

<p>In other words, it's a thing of personal preference; a matter of style. A
program should not be required to infer style, so it's necessary that some sort
of delimiter be used.</p>

<p>So far, the best choice appears to be "=>\w". Things I've rejected so far
include " " and "\.\.\w".</p>

<p>It might be desirable to allow hosts, a la "(example.org)", too; and to
allow the footnote style so rampant in email, perhaps with the option of having
the footnotes appear at any point in the text so that renumbering is made
easier (in other words, you can have a paragraph with "[1]" and "[2]" in it,
define those footnotes' URIs after that paragraph, and then start again
numbering from 1 in the following paragraph).</p>

<h2>Other</h2>

<ul>
<li>Use of HTML. Problem: examples vs. [sic]!</li>
<li>Code, e.g. with double spaces a la deltab</li>
<li>Line breaks in paragraphs with heuristics</li>
<li>Automatically generated tables of content</li>
<li>Examples of Avocet in illustrating itself</li>
</ul>

<h2>Implementation</h2>

<p>I wrote hypertext.py (not yet published) as a quick and naive
implementation.</p>

<h2>Questions, Challenges, and Todo Items</h2>

<p>The main issues so far are:</p>

<ul>
<li>What heuristics can one use for detecting preformatted sections?</li>
<li>Should the first heading in a document be reused as the title?</li>
<li>Is a poem to be preformatted, or a paragraph with line breaks?</li>
<li>If a short line appears in a paragraph, should a line break appear
after?</li>
<li>Should an algorithm for searching for unicode codepoint names be
allowed?</li>
<li>Should there be many blockquotes syntaxes? Should one be preferred?</li>
<li>What should the hypertext link delimiter be?</li>
<li>How do you refer to example URIs to prevent them being linked?</li>
<li>Should hostnames be automatically linked when appearing in URI
contexts?</li>
<li>Should email style numbered footnotes be allowed?</li>
<li>How does one provide examples of Avocet within itself?</li>
<li>Should table of content generation and file includes be allowed?</li>
<li>Should HTML be allowed? How do you differentiate example and real
HTML?</li>
</ul>

<p>There's also the larger issue as to whether the entire language itself is
useful, whether its requirements are valid, and so on down to the level of
whether there should be more features such as smaller level headings, images,
definition lists, and so on.</p>

<h2>Colophon</h2>

<p>The name is from a <a href="http://lists.swhack.com/swhack/2006-May/000075.html">thread</a> on the Swhack Mailing List.</p>

<p>The format itself was started with the idea of being entirely descriptive:
"is it possible to make a parser which understands the kinds of inconsistent
conventions that one uses in plain text when no constraints are applied?" But
problems such as how hypertext should be implemented syntactically were quickly
apparent, and so Avocet is as descritive as it can be, but no further than
that.</p>

<address>
<a href="http://inamidst.com/sbp/">Sean B. Palmer</a>, 2006-11
</address>