Cambridge Digital Humanities / The National Archives Digital Methods Workshop series 2017-18 Beyond words: challenges in reading historical document collections at scale.
Cambridge Digital Humanities is collaborating with the National Archives to run a series of workshops aimed at developing a funding proposal for a project exploring ways of extracting and visualising elements of document layout in large-scale digital corpora. We are particularly interested in the methodological and technical challenges involved in developing automated methods for detecting annotations in large document corpora. These exploratory workshops aim to bring together humanities scholars with computer scientists and archivists to exchange ideas, explore methods, concepts and techniques and generate research questions for future collaborations. The workshops will feature presentations mapping the terrain of current scholarship and the state-of-the-art in archival and technical approaches to this topic, as well as opportunities for discussion and problem-solving in small groups.
The first two workshops will be held in Cambridge on 15 December 2017 and 6 February 2018. To register for the first workshop, please book online here
Workshop 1 Speakers and participants include:
- Dr Anna Sexton (Head of Research, The National Archives)
- Mark Bell (Digital Researcher, The National Archives)
- Hal Blackburn (Cambridge University Library / Arthur Schnitzler Digital)
- Dr Eirini Goudarouli (Digital Research Lead, The National Archives)
- Dr Val Johnson (Director of Research and Collections, The National Archives)
- Dr Annja Neumann (University of Cambridge / Arthur Schnitzler Digital)
- Dr Barbara McGillivray (University of Cambridge / Alan Turing Institute)
- Dr Jason Scott-Warren (University of Cambridge / Centre for Material Texts)
- Ruth Selman (Principal Records Specialist (Early Modern) - The National Archives)
Libraries and archives now contain millions of pages of digitised documents spanning hundreds of years of human history and culture. The scale of these collections combined with advances in optical character recognition (OCR) has posed the question of using computers to read their contents for us. Whether this is motivated by the need to search within the corpus, or driven by the desire to quantify and analyse the content across the entire collection (or a sub-section of it), the trajectory of computational methods in this area has often been based on the assumption that understanding a document depends on its decomposition into the smallest possible constituent parts, the dissolution of its features into a ‘bag of words’ from which new structures can be reassembled. This approach has made great strides in rendering large corpora more tractable for research and inquiry, but it can lead to a fixation on the meanings which can be inferred from the words of a text, and conceal other ways of grasping the meaning of a document.
In particular, it could be argued that methods in natural language processing (NLP) have a tended to obscure the spatial and temporal dimensions of the document, focusing on reading texts as words and not as images or processes. Yet scholarship in many different humanities fields - from literature to theology to history – has long been concerned with ‘reading’ texts both spatially and temporally. Aspects of document layout have encoded meanings for almost as long as people have been writing, and when we have learnt those codes we can derive meanings from the page at a distance without needing to read the words. Likewise, historians and literary critics have long been conscious of documents as processes, whose significance lies not necessarily in what they say, but how they were composed and who composed them.
These workshops will explore the conceptual and practical challenges of going beyond words in attempts to automate the reading of documents at scale. They will focus on the discovery and analysis of annotations as a way to open up a broader set of interdisciplinary discussions about how to automate spatial and temporal ‘readings’ of documents, and ask what knowledge might be created in the process. As the process of annotation can be defined both spatially (by the location of the annotation in relation to word, line or margin) and temporally (as addition which takes place after the composition of the original text, or as a middle stage in the composition of the final text), it represents a fruitful starting point for further exploration. Moreover, the intensity of annotation activity on a document is often of interest to historians or editors, potentially opening a window into debates, conflicts and discussions in the drafting of a government policy, or illuminating aspects of the creative process behind the poet’s