Professor Caroline Bassett, Cambridge Digital Humanities and Dr Anne Alexander, Cambridge Digital Humanities
Computer programmes which predict the likely next words in sentences are a familiar part of everyday life for billions of people who encounter them in auto-complete tools for search engines and the predictive keyboards used by mobile phones and word processing software. These tools rely on 'language models' developed by researchers in fields such as natural language processing (NLP) and information retrieval which assign probabilities to words in a sequence based on a specific set of 'training data' (in this case a collection of texts where the frequencies of word pairings or three-word phrases have been calculated in advance).
Recent developments in machine learning have led to the creation of general language models trained on extremely large datasets which can now produce 'synthetic' texts, answer questions, and summarise information without the need for lengthy or costly processes of training for each new task. The difficulties in distinguishing the outputs of these language models from texts written by humans has provoked widespread interest in the media. Researchers have experimented with prompting GPT-3, a language model developed by OpenAI to write short stories, answer philosophical questions and apparently propose potential medical treatments – although GPT-3 did have some difficulty with the question "how many eyes does a horse have?". Meanwhile, The Guardian 'commissioned' an op-ed from GPT-3.
This Methods Workshop will explore the generation of 'synthetic' texts through presentations, discussion and demonstrations of text generation techniques which participants will be encouraged to try out for themselves during the sessions. We will also report back from the Ghost Fictions Guided Project, organised by Cambridge Digital Humanities in October and November this year. The project looks at how ideas about the distinction between 'fact', 'fiction' and 'nonfiction' are shaping the reception of text generation methods and aims to stimulate deeper critical engagement with machine learning by humanities researchers.