This is a copy of an email sent to the VMS List, with a couple of
corrections made inline that I hadn't noticed when sending the
message.
I've derived a small grammar for the relationships between characters
within VMS words, using statistical Natural Language Processing
techniques. This can be thought of as a refinement of Stolfi's 2000
grammar.[1]
To cut to the chase, the grammar follows this paragraph, with lower
case letters representing EVA graphemes, and upper case letters
representing productions within the grammar. The syntax is PCRE, which
will be very familiar to any programmer, tested in Python.
^
(q | y | [ktfp])*
(C | T | D | A | O)*
(y | m | g)?
$
C = [cs][ktfp]*h*e*
T = [ktfp]+e*
D = [dslr]
A = ai*n*
O = o
This grammar may look a little clumsy in PCRE syntax, but with a
little explanation can be shown as remarkably elegant. It covers over
95% of all VMS words, i.e. over 19/20.
A few notes. First, the D production corresponds to Stolfi's Dealer
class. The [ktfp] character class corresponds to the Grove Gallows
characters. There is a very strong statistical basis for believing k/t
to be scribal variants or some such of the same grapheme, and the same
for f/p. If we call these t' and p', it should also be noted that
there is a strong statistical basis for believing that t' and p' are
*not* scribal variants or some such of one another. I shall refer to
t' and p' as GG, i.e. the Grove Gallows characters.
This work started out as an attempt to answer a much simpler question.
I was comparing the label of the star coming from the mouth of one of
the fish in Pisces in f70r with the labels around the twelve moons in
f67r. I figured that if the twelve moons correspond to the signs of
the zodiac which follow, then perhaps this label will be repeated. I
didn't find the label repeated, but there was a similar word with only
one character different; d/l was the swap.
Subsequently, I noticed that Wikipedia said that such swaps are
common. I wondered which swaps were the most common, so I wrote a
script to find out. This script had some interesting features, which I
will omit here for brevity. The conclusions were, to give an ascii
version of my beautiful Google Charts API graphs:
Most Swappable = d, s, t, k, l, r
Most Unswappable = h, c, i, a, e
From this, I wondered why the results came about, and I performed a
few further statistical experiments. It became clear that t/k were
swappable most likely because they were in some sense the same
character. It also became clear that h was not swappable because it
was almost always dependent on there being a "c" or "s" either
prefixing it or nearly prefixing it.
As I investigated more and more, I came up with the grammar published above.
Several further points became clear during this investigation. From
Stolfi I found out, for example, that q is more likely to be a
grammatical particle or at least a letter which can only appear as a
prefix. Similarly I accept y- and -y as occasional suffixes and
prefixes only, never infix, and -m and -g as suffixes only. I do not
speculate here on the relationship between -m and -g, though I tend to
agree with Stolfi on this matter too.
This leaves two features: the occasional prefix of GG, and the (C | T
| D | A | O)* main section of the grammar. The GG prefixing is a well
known phenomenon which gives the GG characters their names. It also
helps to establish the GG character class as a token to use within the
C and T productions.
I will omit here the laborious experimentation through which I arrived
at the productions, and skip to describing some of their interesting
statistical features. Though in the PCRE version the productions are
given in the order (C | T | D | A | O)* for proper precedence based
matching, in terms of frequency within words themselves they have the
following number of instances:
D = 14468 instances
O = 13111 instances
C = 10753 instances
T = 7993 instances
A = 6182 instances
I will now describe some characteristics of each production in that order.
Production D
Consists almost always of the single characters d, l, r, or s, in that
order of frequency. The specific frequencies are:
6154 d
5177 l
3969 r
3812 s
The bigraph ld follows, with 249 instances, which is probably
insignificant enough to be considered a result of various corruptions
and transcription errors, though of course one cannot be entirely
sure. The only other D bigraphs with more than 100 instances are ls,
with 199, and ds, with 107.
These characters come out as being the most swappable single
characters, according to my initial study, within EVA excluding the
accidental splitting of the GG characters.
Production O
This is a very simple production, consisting always of the single
letter "o", to an amazing degree of frequency over "oo", even without
considering corruptions and transcription errors.
Production C
This is a complex production. For the purpose of this section, T will
represent t' (i.e. k or t in EVA), and P will represent p' (i.e. f or
p in EVA).
The frequencies are:
3826 ch
1939 che
1245 sh
974 she
914 cTh
347 chee
230 cThe
224 shee
131 cPh
45 cT
42 cPhe
29 cThh
22 cThee
This excludes s, c, and se, which were included in the original
results due to the lax pattern. Note that a very interesting pattern
is followed here:
ch > sh > cTh > cPh
And the variants with -e appended are interleaved...
ch > che > sh > she > cTh > chee (not cThe!) > etc.
In other words, the shorter sequences are more common, and c > s > T >
P is the general sort order, with e being a kind of modifier creating
values between these level, though they start getting complex and
overlapping with the smaller frequencies.
Production T
Using the same notation T for t' and P for p', we have:
5777 T
1252 Te
1081 Tee
707 P
80 Teee
This raises an interesting question: is P merely a scribal variant for
Teee? This can probably be answered, surprisingly, in the negative. At
first it looks promising, since there are 9456 instances of T,
compared to 904 instances of P. But the theory falls down when we
consider that there are 182 instances of cPh, which is quite common.
There are, however, no instances of cTeh, or cTeeh, which we would
consider more common than cPh going by the orders described in
Production C.
To put it another way, it appears that T can be, and commonly is,
followed by "e", but P never is and yet it is not evident why. Just in
case anybody wonders whether there are instances of cPhe, I will note
that there are indeed 44 such instances, 2 of which are subsequences
of cPhee.
Production A
This production is very interesting:
3739 a
2123 aiin
780 ai
355 ain
118 aii
81 an
46 aiiin
7 aiii
Almost all of the instances of "n" are as a suffix, so ai* only occurs
in the median position and ai*n only occurs as a suffix. In other
words, it appears very likely that "n" is the spelling for "i" at the
end of a word, much like long-s in Latin early modern script. But note
the pattern here is very interesting. If we split the median and
suffix versions, we get:
a > ai > aii [ > aiii ]
aiin > ain > an [ > aiiin ]
So as an infix, the shorter versions are more frequent. As a suffix,
the longer versions are more frequent. The exception is that with
-iii- both infix and suffix are very rare.
I think that with the exceptions such as q, y, and the GG characters
noted above, the combinations of the productions above cover such a
striking majority of VMS words in such a simple and elegant manner
that they may serve as a valuable introduction to anybody wanting to
learn about the internal structure of VMS words in general.
Despite isolating these production with some ease, finding rules for
the combination thereof has been more difficult. There are a few broad
observations that one can make. The first is that with the exception
of C, none of the productions get repeated within a word, to extreme
frequency differences, and even with C the repetitions are rare. The
difference is that with the C repetitions, I think they are
statistically significant enough to be noticed. Compare the top two
for each production, matched as P+:
14441 D
12 DD
13020 O
44 OO
9446 C
629 CC
7973 T
10 TT
6172 A
5 AA
A difference of 14441 to 12 is quite phenomenal, but the 9446 to 629
ratio certainly deserves further investigation. The top five CC
repetitions are:
168 chcTh
52 shcTh
37 checTh
25 shecTh
25 chcThe
In other words, it seems to follow the same kind of frequency
characteristics as the C production in general, only with cTh (and
then a cThe!) appended.
Of the relationships between different productions, I have discovered
little of note. The most interesting discovery is probably that A is
with a very high frequency difference always followed by D when
followed. There are 3100 instances of this, compared to 77 of the next
most common combination, AC, and I haven't checked to see how many of
those are A production instances ending with n, probably signalling
mistranscribed spaces.
The only other significant point is that T seems to very strongly
favour being preceded by O, though this is not so clear as A being
followed by D. The statistics are:
4881 OT
466 CT
411 DT
This has been a long email. I have not kept up with the state of the
art of VMS studies, and I apologise if I am treading a lot of ground
which has been more properly and thoroughly considered by others
unbeknownst to me. I also apologise for my tendency to dash off long
emails like these in order to get them completed, with minimal
checking, meaning that I may have introduced some glaring errors.
Please point them out kindly if this is so!
I would of course be interested to hear any comments that people have,
especially from Stolfi whose work was of monumental help to me during
these investigations.
[1] http://www.dcc.unicamp.br/~stolfi/voynich/00-06-07-word-grammar/