Serex

NOTE: This is a draft proposal, and may change.

Introduction

Serex is a regular grammar for s-expressions (sexps). In other words, it allows you to make a regular expression that matches a sexp rather than a string. This is not to be confused with building a sexp that matches a string. This has been done before by Olin Shivers as SRE, and also as CL-PPCRE. This is how the proposals compare:
PatternInputImplementations
StringStringPCRE
SexpStringSRE, CL-PPCRE
SexpSexpSerex

So Serex allows you to construct sexps that match other sexps. You can think of this as a kind of powerful version of macro pattern matching. The intended use case is of course the same, for use in constructing macros. Serex takes its concepts liberally from predecessors such as PCRE, TREX, RELAX NG and RELAX NG Compact, XPath, XQuery, and Perl 6 grammars, and weaves them together into something appropriate and suited for sexps.

Atoms

We'll start with some simple examples, matching atoms. All non-symbol input atoms just match themselves, so for example the number 1 matches 1, and the string "abc" matches "abc". But there are also some symbolic patterns which match a range of objects based on datatype:
PatternExample matchesExample non-matches
symabc, test, function-name123, "string", (abc)
num1, 50, 12345"1", one, (1)
str"a", "abc", "hello world!"1, sym, ("a" "b")
list(), (a b), ("hello")567, "hello"
atom123, "pqr", symbol(1), (a b c)
any50, "a", sym, (), ("a" "b")

Note that number is just provided as a generic datatype here. Implementations that allow for a full numeric tower or similar rich numeric datatypes can define a separate pattern for each numeric type.

In practice, num, str, and sym are just symbols. This means that all symbols are treated like metachars in string regexes. In other words, they are metasymbols or pattern symbols. They are analogous to tokens and productions in standard grammars. This means that we need a way to escape literal symbols to match, like "\." or "[.]" in string regexes, or `quasiquote in macros. We choose to use ' as the prefix escape character. So to match the literal str symbol, you can use 'str. Numbers, strings, and lists, however, all match themselves and do not need escaping, but as a convenience lists can be escaped to mean that all the members of the list are escaped.

Operators

Sometimes you might want to match atoms that you can't match using just the built in datatypes. For example, you might want to match anything which is a symbol or a number. In that case, you can use the or operator, which is spelled |:

(| sym num)

This would match example, function-name, *global*, 0, 1000, and 10000, but would not match "hello world!" or (some list). This is equivalent to all:

(| atom list)

Lists

Lists are, obviously, special because they provide the nested structure to sexps. If you want to match a sexp, then you probably want to match something inside lists, rather than just match any list with the list pattern. Note that, just like with atoms, if a list is specified without any metasymbols, then it matches itself. For example, (1 "2") matches (1 "2"). If you want to match symbols, then you can match the input (a b c) with either '(a b c) or ('a 'b 'c). If you do the former, you can't use metasymbols anywhere within the list. (TODO: Should we introduce , as unquote?). You can also use patterns in place of any of the elements of the list:
PatternExample matchesExample non-matches
('abc num)(abc 1)
(abc 50)
(abc 12345)
(abc "1")
(pqr 1)
(abc 1 2)

Note how cardinality is a constraint. If you only specify a symbol and a pattern using a metasymbol, then the cardinality of the input list list has to be two to match. You may, however, want to be more flexible about how you specify the cardinality of arguments. For example, what if you want to match both (abc) and (abc 1), making the number argument optional? You can do this using the option or "zero or one" operator:

('abc (? num))

Or you could use the or operator which was introduced in the section on operators:

(| ('abc) ('abc num))

Note that the (? ...) and (| ...) expressions do not form a part of the tree. It's as though they're being removed from the tree when we match, much like how in TREX or RELAX NG if you do this:

<outer>
   <zeroOrOne>
      <inner/>
   </zeroOrOne>
</outer>

The <zeroOrOne> element doesn't actually form part of the tree. It is again analogous to the metachar, so (? ...) may be thought of as a metalist.

If we want to ignore elements of a list entirely, we can do so using the splat or "zero or more" operator:

((* any))

This is equivalent to list, but this is not:

('abc (* any))

This now matches (abc), (abc 1), (abc "1" "2), and so on—any list that has abc as the first symbol inside it. As well as ?, zero or one, and *, zero or more, there is also +, which means one or more. Because of the usefulness of the forms (? ...), (* ...), and (+ ...) where ... is any of the atomic datatypes, there are the following shorthands:
RangeLonghandShorthandExample
0, 1(? ...)...?any?
0, inf(* ...)...*num*
1, inf(+ ...)...+atom+

You can also specify a specific cardinality or cardinality range using the , operator, so that (, num 5) means exactly five numbers, and (, atom 5 10) means between 5 and 10 atoms inclusive. There is no shorthand for ranges with cardinality constraints expressed in this way.

Sometimes you may want to just ignore all elements before and after a particular element that you're looking for. For example, if you have a list which only contains one number, such as (a b c 100 d e f), and you want to match that list, then you would have to write something like:

(sym* num sym*)

If you used any* instead of sym*, this wouldn't match because any* is greedy. To understand what greed means in the context of grammars, read the greedy algorithm article on Wikipedia. To get around this problem, we can use non-greedy forms of * and +, called *? and +?:

(all*? num all*)

?? does not make the zero or one operator non-greedy because it is already non-greedy. Greed only applies to operators with no upper limit on cardinality. Instead, we use ?? to mean that a token is preceded and followed by all*?, so that this:

((?? num))

Means the same as:

(all*? num all*?)

Assertions

Sometimes we want to select an element in a list based not on its datatype but on some other characteristic such as its position. To do this we use assertions. The position of an element is indicated with []:

('a 'b 'c 'd ([] 4))

Matches (a b c d ...) with any element filling the place of the ellipsis, and is therefore equivalent to ('a 'b 'c 'd any). Note that the indexing is 0-based. (TODO: Should it be 1-based instead? Pros? Cons? XPath is 1-based)

In case we want to only match a fifth element, we can use the ?? operator again:

((?? ([] 5)))

This would only match lists with at least five members. Custom assertions can be made by passing a function to the = operator:

(= (lambda (a) (> a 10)))

Matches only numbers that are bigger than ten. This might throw an error if you tried to match a symbol in this way, so = also takes a parameter to give the datatype:

(= (lambda (a) (> a 10)) num)

Where, in fact, the second argument can be any assertion including a new = assertion. (TODO: Perhaps just allow any number of arguments to = and check their type.)

Captures

When we match a serex pattern against a sexp, sometimes we do so with the intention of extracting only part of the results. To achieve this, we use a construct called a capture. Captures are denoted % and there are shorthands for most operators. For example, to capture a number that appears after a symbol in a list:

(sym (% num))

This works as follows:

> (serex
     (sym (% num))
     (abc 50))
(50)

TODO: Should this return ((50)) instead?

And you can capture multiple elements in the same way:

> (serex
     (sym (% num) (% sym+))
     (abc 50 p q r))
((50) (p q r))

There are shorthands for all datatypes so that e.g. %num is shorthand for (% num) and %sym+ is shorthand for (% sym+). There are also shorthands for the operators, so for example (%| ...) is shorthand for (% (| ...)).

Sometimes it can be helpful to name captures so that they can be referred to with greater ease, without having to resort to indexing. The results could be an association list or a hash or any other kind of mapping. The operator to create a named capture is %%, and it takes a symbol for a name as its first argument:

> (serex
     (sym (%% first num) (%% second sym+))
     (abc 50 p q r))
((first 50) (second p q r))

TODO: Should names perhaps be strings instead?

Productions

To reuse patterns inside other patterns, we need a facility for naming patterns and referring to them. We call these productions, and the operator for them is @. Productions must be defined at the beginning of the serex pattern, and can then be used anywhere inside. Productions can also refer to previous productions.

(serex
   (@ sym-or-string (| sym str))
   (@ 3-element-list ((, any 3)))
   (:abc sym-or-string 3-element-list))

This matches (abc "abc" (p q r)), for example. Productions are non capturing by default, but can be made capturing by using % inside the definition:

(@ capture-sym-or-str (%| sym str))

Note that (%| sym str) is equivalent to (% (| sym str)) but not to (| %sym %str) which would give two separate captures. The main difference between a production and a named capture is that productions aren't active patterns until they're referred to. In a sense, productions allow you to extend the range of serex types.

Traversals

The concept of traversing is probably best introduced by considering how to match an item at any depth of nesting. Consider, for example, the following input sexp:

(a (b (c (d (e 10)))))

What if we only want to capture ('e %num), ignoring its parent lists? To do so, we can use the // operator.

(// ('e %num))

This would correctly capture 10 from the example that immediately precedes it. We call this a traversal, because it traverses the tree depth-first until it finds a match. You can think of it as a tree-oriented version of the non-greedy operator *?. Another way to think about it is that it does what search does in string regular expression terminology, with match being the default. In other words:

(serex-match (// ...) ...)

Is equivalent to:

(serex-search ... ...)

Of course, the advantage of // is that it can be used in the middle of an expression:

('a (// ('e %num)))

To traverse a single layer, the . operator can be used.

Other

If the order of elements is unimportant, the interleave or & operator can be used. Imagine, for example, that an overloaded function over-load can work with both numbers and strings, so that num str and str num are valid argspecs, but neither num num nor str str. The former could be specified using:

(| ('over-load num str) ('over-load str num))

But this would quickly become verbose with many arguments. Instead, the following may be used:

('over-load (& num str))

Summary

? pattern — zero or one
* pattern — zero or more
+ pattern — one or more
*? pattern — non-greedy zero or more
+? pattern — non-greedy one or more
, pattern upper [ lower ] — custom cardinality

| patterns — union
& patterns — interleave

?? pattern — expands to: all*? pattern all*?
= function - assertion
[] number — indexing

// pattern — depth-first multi step traversal
. pattern — depth-first single step traversal

% pattern — capture
%% symbol-name pattern — named capture
@ symbol-name pattern — production

TODO: - and perhaps some intersection facility. Is there a use for intersection except for multiple assertions?


by Sean B. Palmer