6 Feb 2018 11:30am - 3:30pm Raleigh Seminar Room, Maxwell Centre, Cavendish Laboratory, West Cambridge Site


Cambridge Digital Humanities / The National Archives
Digital Methods Workshop series 2017-18
Beyond words: challenges in reading historical document collections at scale
Workshop 2 (Technical challenges)

Organised with the support and collaboration of Cambridge Big Data


John Sheridan, Digital Director, The National Archives
Daniel Bruder, Cambridge Computer Laboratory

Cambridge Digital Humanities is collaborating with the National Archives to run a series of workshops aimed at developing a funding proposal for a project exploring ways of extracting and visualising elements of document layout in large-scale digital corpora. We are particularly interested in the methodological and technical challenges involved in developing automated methods for detecting annotations in large document corpora. These exploratory workshops aim to bring together humanities scholars with computer scientists and archivists to exchange ideas, explore methods, concepts and techniques and generate research questions for future collaborations. The workshops will feature presentations mapping the terrain of current scholarship and the state-of-the-art in archival and technical approaches to this topic, as well as opportunities for discussion and problem-solving in small groups.

The second workshop in our collaborative series explores the technical challenges in mapping historical document collections at scale.

A sandwich lunch will be provided – please email Michelle Maciejewska  by 31 January if you have any specific dietary requirements. If you book a ticket and find you can no longer attend, please cancel through Eventbrite so that we can cater accurately.


John Sheridan, Digital Director, The National Archives


As Digital Director, John is responsible for digital services, enabling The National Archives to fulfil its ambitions to become a digital archive by instinct and design. His role is to provide strategic direction, transform our digital offer, and to shape and drive forward our web-based services.

Prior to this role, John was Head of Legislatiin Services at The National Archives where he led the team responsible for creating the legislation.gov.uk website, as well overseeing the operation of the official Gazette.

A former co-chair of the W3C e-Government Interest Group, John has a strong interest in web and data standards. He serves on the UK Government’s Open Standards Board which sets data standards for use across government. John was an early pioneer of open data and remains active in that community.

John’s academic background is in mathematics and information technology, with a degree in Mathematics and Computer Science from the University of Southampton and a Master’s Degree in Information Technology from the University of Liverpool. John recently led, as Principal Investigator, an Arts and Humanities Research Council funded project, ‘big data for law’, exploring the application of data analytics to the statute book, winning the Halsbury Legal Award for Innovation.

Daniel Bruder, Computer Laboratory, University of Cambridge

Necessary provisions for the reading of historical document collections at scale

The nested structure of the XML tree clashes with both the logical structure of the text and the physical structure of the document and makes workarounds to the tree model necessary. This application of tree data models and workaround mechanisms results in data repositories of a paradoxical in-between state of compliance to the XML/TEI schema on the one hand, but idiosyncrasy and ambiguity on the other. Such repositories are unsuitable for both long-term archiving as well as sustainable interdisciplinary study and cross-use in indexing or distant reading tasks. In my talk I will discuss the sources of ambiguity in digital archives as well as the potential pitfalls in reading such document collections at scale.

My goal is to devise a data model for multiple hierarchies over potentially non-linear text for long-term preservation and interoperable cross-use of complex textual data repositories.


Daniel Bruder is a PhD student at the Computer Laboratory, University of Cambridge. His research focuses on developing mathematical structures to record and archive written cultural inheritance in a form which would be applicable to any language or script independent of content. Read more about this project here: gtr.rcuk.ac.uk/projects?ref=studentship-1778276

Upcoming Events


Tel: +44 1223 766886
Email enquiries@crassh.cam.ac.uk