14 Jun 2021 All day Online



Marcus Tomalin (University of Cambridge)

Sophia Diamantopoulou (UCL)


Artificial Intelligence and Multimodality is a collaboration between the UCL Centre for Multimodal Research/Visual and Multimodal Research Forum and the Giving Voice to Digital Democracies project based at the University of Cambridge. The workshop brings together the fields of Artificial Intelligence (AI) and Multimodality, with the aim of identifying common ground that connects these two research domains. The speakers, who are prominent figures with an interest in AI and Multimodality, will explore possibilities and potentialities for interdisciplinary interactions.

Discussion will centre around two specific concrete applications:

  • AI and Multimodality: Learning and Knowledge
  • AI and Multimodality: Communication and Agency

The workshop will provide a valuable opportunity for AI researchers and multimodality theorists to connect, and exchange ideas and approaches.


Queries: Marcus Tomalin, Convenor


12:30 - 15:00

AI and Multimodality: Learning and Knowledge

12:30 – 13:30   Discussion 1: Kay O’Halloran (University of Liverpool), Lucia Specia (Imperial College London), Elisabetta Adami (University of Leeds)    [discussant]

14:00 – 15:00    Discussion 2: Victor Lim Fei (Nanyang Technological University), Nadia Berthouze (University College London)

16:30 - 19:00

AI and Multimodality: Communication and Agency

16:30 – 17:30  Discussion 3: Theo van Leeuwen (University of Southern Denmark), Douwe Kiela (Facebook AI Research), Verena Rieser (Heriot-Watt University)

18:00 – 19:00  Discussion 4: Selena Nemorin (University College London), Carl Smith (Ravensbourne University London), Rodney Jones (University of Reading) [discussant]



The workshop brings together the fields of Artificial Intelligence (AI) and Multimodality, with the aim of identifying common ground that connects these two research domains. The speakers, who are all prominent figures with an interest in AI and/or Multimodality, will explore the possibilities and potentialities for interdisciplinary interactions. Such considerations are timely since we live in a world that is increasingly multimodal. The internet, social media, mobile phones, and other digital technologies, have given prominence to forms of discourse in which texts, images, and sounds combine to convey complex messages. Animated emojis, internet memes, the automatic captioning of live shows, and e-literature are all illustrative examples of this trend. Since its emergence in the 1980s, the academic discipline of Multimodality has offered a set of robust analytical frameworks for examining multimodal phenomena. In social semiotic theories, for example, ‘modes’ such as writing, speech, image, music, and gesture, are viewed as socially and culturally determined semiotic resources which can be deployed to create meanings, and they all have different affordances – that is, particular potentialities and constraints that impact the making of signs in specific representations. Each set of modal affordances is determined by the ways in which meaning is created using the particular resources available – and the latter are inevitably characterised by their material, cultural, social, and historical development. A careful consideration of such things enables the form and function of multimodal communication in human social interactions to be analysed in greater detail, and the work of Gunther Kress, Theo Van Leeuven, Carey Jewitt, Jeff Bezemer, and others has been hugely influential.

In a parallel development, over the last decade a new generation of Artificial Intelligence (AI) systems has attained unprecedented levels of sophistication when performing generative and analytical tasks which, traditionally, only humans had accomplished convincingly. These advances have been triggered by the resurgence of neural-based machine-learning techniques which coincided with the greater availability of more effective hardware, such as Graphical Processing Units (GPUs), that facilitated parallel computation. The task of building state-of-the-art systems has benefited greatly from the release of free open-source software packages such as PyTorch (2016-present) and Tensorflow (2017-present) – and these advances have made it possible to create intelligent and autonomous systems that are increasingly multimodal. For instance, systems have been developed that can identify racist or sexist internet memes process texts as well as images (e.g., The Facebook Hateful Meme Challenge), while researchers such as Lucia Specia have created Multimodal Machine Translation systems that perform image-guided translation of written or spoken texts. These multimodal tendencies are becoming increasingly prevalent in many modern AI applications.

However, despite their many obvious overlapping areas of interest, the two research domains of AI and Multimodality remain largely disjunct. The computer scientists and information engineers who develop multimodal machine learning techniques tend to know very little about dominant theories of Multimodality, while, conversely, the semioticians, linguists, philosophers, sociologists, and social scientists who elaborate theories of Multimodality tend to know comparatively little about the architecture and application of multimodal AI systems. This persistent disconnection is regrettable since more extensive interactions involving researchers from both groups is likely to be mutually beneficial. Specifically, the design of multimodal AI systems could benefit greatly from a deeper awareness of existing theoretical frameworks for assessing human-based multimodal communication, while autonomous and intelligent systems that can create multimodal ensembles as outputs pose fundamental questions for any theories of Multimodality that presuppose human agency in the social and cultural contexts of meaning making.

Responding to this undesirable disconnection, this workshop will provide an opportunity for AI researchers and multimodality theorists alike to come together in order to exchange ideas and approaches in order to address questions such as the following: How compatible are these two domains? What are the epistemological underpinnings of these two research traditions? Do they share any common ground? How do proponents of one area view those involved with the other? What would be the benefits of facilitating greater interactions between the two groups?

Upcoming Events

Metaphors and AI: narratives in public discourse
Hybrid Event, Workshop


Tel: +44 1223 766886
Email enquiries@crassh.cam.ac.uk