ADB Mining
I’d like to propose a session that aims to produce (or failing that, define) a tool that will take in an Australian historical text and locate within this text all the mentions of people named in the Australia Dictionary of Biography (ADB).
For each person referenced I would like to create a metadata record for the corresponding ADB entry in my own database and link this to an annotation record pointing to the mention in the text. Clearly it follows that if a given person is mentioned multiple times then multiple annotation records should be created all pointing to a single ADB metadata record.
I’d ideally like to be able to scan the text and have it OCRed into TEI first but I think that would be a bit ambitious. I wanted to at least to mention this as a reminder that ultimately the process should be input agnostic – one should as easily load up a newly created Word document as a scan of a historical manuscript.
I’ve hacked around over the years with offset markup using TEI source files in Heurist. We render TEI documents as XHTML using Cocoon, allow the user to select passages of text then create an annotation record which records the annotated string, the source text, the location within the text and an optional pointer to another record – which could be an ADB entry metadata record. The Dictionary of Sydney project in particular makes extensive use of this data structure with mentions of famous Sydneysiders within text entries pointing – via annotations – to person entity records. As often as not these then reference the ADB entry.
This tool would greatly simplify the current workflow of the Dictionary of Sydney and many of the projects I am involved with would also benefit from this process. The Dictionary of Sydney however relies on a large team of volunteers and a pretty active editorial team to create annotations “manually”. It would be great to have a semi automated procedure – possibly based around a bunch of clever XSL transforms and an XML feed from the ADB – that would enable me to connect a given research database with the ADB and indeed beyond (to paraphrase Buzz Lightyear).
Catégories: Les propositions de la session