Discovering History: Text Mining the Cairo Genizah

21 November 2013, 13:30 - 15:00

SG2, Alison Richard Building

Speakers: Dr Christopher Stokoe and Dr Ben Outhwaite

The widespread digitisation of historical manuscripts has brought about a new age in unprecedented access to the special collections of Cambridge University Library. One such collection is the Taylor-Schechter Cairo Genizah Collection, 193,000 manuscript fragments that form a rich source of information about medieval Judaism and the history of the Mediterranean and Near East. Whilst the digital library has significantly improved access to the collection, the lack of a rich content-based catalogue still presents a substantial barrier to discovery. Typically we struggle to direct researchers to the parts of the collection that will most likely address their information needs and navigating the collection requires an extensive knowledge of the secondary literature.

In this seminar we outline a novel approach to discovery that seeks to exploit over 100 years of scholarship in order to automatically derive a content-based catalogue through the use of text mining. We use a combination of techniques from the fields of information retrieval and natural language processing in order to extract catalogue data from the literature and associate this information back to the source fragments. The resulting catalogue data is evaluated in the context of its potential to enhance navigation and discovery of the collection.

