Genealogy and Computers

One area in which computer science and literary history are really overlapping for me at the moment is the idea of genealogy. This is a very interesting topic because it combines thoughts about the Semantic Web with thoughts about how to be a good scholar. The overall question is: can computer assistance help us to be better scholars?

This follows on from what I said about Keywords and Cadences yesterday, because being a scholar in literary history means taking a complex mass of data and coming up with an interpretation of it that is consistent. It's basically a science; this whole idea that “literary history is english, which is humanities, which is an art” is taxonomically misleading. In literary history you have to prove your assertions about something by showing the literary record for it. The question of the chronology of Shakespeare's plays, for example, usually depends on internal evidence from the plays themselves (do they mention current events?) and external evidence that needs to be discovered (who wrote about this play? does it have a publication date? can the publication date be trusted?). This is scholarship.

Now, genealogy comes up quite often in literary history because we like to know about the home environments of people, where they come from, what kind of things affected their burgeoning minds as children. This is especially true of Shakespeare, who has received perhaps more attention that any other author, but it's also quite difficult studying Shakespeare because those literary evidences, all the little facts from his times, are complex and contradictory. It's not like a mathematical proof where you just go “Q.E.D.” at the bottom and it gets peer reviewed and then it's accepted. You have to use circumstances: did these people live close to one another? Are they mentioned in connection with any lawsuits? What kind of anecdotes survive about these people? Can they be trusted? Genealogy of common country folk in the Elizabethan times is really hard.

To be quite honest, though, it's not that much easier for more recent family trees either. I've been researching my own family history, and you look at the records and you see that ages are given inconsistently across census records, names are transcribed wrong, and so one moment you can think that your great great great grandfather is “John Douglas” or whatever, and then you find two documents that give his name as “Jim Douglas”. So what was his name? Well, it was probably Jim, but you don't know for sure.

So part of being a good scholar is summarising the evidence in such a way as to be clear about history, but without being misleading about history!

Can a computer help with this task? After all, one of the things that clouds human judgement is the fact that we are not simply formal machines that carry out grunt work. We make some very silly mistakes in fields that require formal analysis; computers don't forget to carry the one when doing addition. But on the other hand we're also very good, actually, at weighing up evidence that is conflictory and complex, because we're used to living in an environment which is conflictory and complex. Computers just add numbers.

That's the philosophy of it. To be more concrete, I've been thinking about making a little RDF schema of a genealogy language, so that I can write down my genealogy findings in a machine readable way. The question then becomes: what kind of programs might I be able to write that can assist me in interepreting all of this data? Whenever I tell people about my genealogy work, this is always what they propose. They say, “you're a computer scientist, so you've got the wonderful advantage of being able to write it down for computers and have a computer do all the heavy lifting for you!”. But then they don't tell me what that heavy lifting involves.

To be quite honest, I can't see much heavy advantage. There are certainly light advantages, that you always get with computers. If you transcribe deeds and birth certificates and so on and you have hundreds of them on the computer, you can then do free text searches through them very quickly. The online census records have done this for the whole population of Britain for various snapshots of the 19th century and, just recently, 1901. This is very helpful indeed. And yeah, it's great to be able to graph all of the connections between people so that you can do nice print outs and things like that. This is what computers are really good at at the moment. Very simple enhancements to the paper workflow. Searching; presentation; storing large amounts of data; making things accessible to lots of people.

The problem comes when you think that computers are substitutes for parts of human intelligence that they can't yet come close to. There are some fields that have proven much trickier than we thought they would be, but eventually we kinda cracked it. Translation is a good example; we thought translation would be really easy, just swapping words about, but actually it requires an immense understanding of grammar and idioms and connotations and so on. But we're getting there now, and the online translators are pretty usable, though not perfect.

With literary history, we may be in the equivalent of the very early days of translation, or it might be something that computers just can't really help us with until they become really “intelligent” in some way that we might not even fully understand yet. There's no way I can write a genealogy program that takes in all the evidence about, say, who Shakespeare's grandfather was and then spits out an objective definitive summary of the evidence that fulfills the maxim that I noted above, that it must be a good summary but not misleading. There's no way you can get a computer to fill in for that task. You can certainly assign weights to various piece of evidence and get a computer to add all the weights, and then assign some random threshold and get it to only report the facts over that threshold. That's the sort of “free text search”, “accessible to all over the internet” level of task that computers are good at right now. But, at the moment, computers don't make good historians.