Adventures with automated text analysis

Data storage and retrieval technologies give us unprecedented access to volumes of data that are impractical to analyse manually. The quantity of data can be extreme: a collection of just one month of the worlds online media and social media contains some 386 million posts, articles etc. – about 3TB of text data! Even more modest collections are often far too large to read in their entirety.

This has lead to many techniques and advances in automated text analysis. I would like to propose a session in which those who use or wish to use automated text analysis techniques come together to exchange notes, discuss effective approaches, identify stumbling blocks and potential sources of error etc…

I myself come from a machine learning background and have only very recently began work in a humanities context. What I can offer is some more technical knowledge of what can be done, such as algorithms to detect sentiment or discussion topics running through a corpus. What I hope to gain is an understanding of how these techniques, or ones like them, are and can be used in the Humanities.