Professor in Natural Language Processing, Department of Information and Computing Sciences, Utrecht University.

Associate Professor (on leave), Institute of Linguistics and Language Technology, University of Malta.

Contact

About me

My research mostly focusses on the automatic generation of language from non-linguistic information (a.k.a. Natural Language Generation). One important aspect of this is how systems -- artificial or human -- learn meaningful relationships between language and the non-linguistic, especially the perceptual, world.

In my research, I rely on machine learning methods such as neural networks, but also on experimental psycholinguistic methods. Here are some of the topics I explore:

  • Data-to-text generation, that is, the automatic summarisation of non-linguistic information, in forms that are understandable by people;
  • The vision-language interface, especially image captioning and grounded inference in multimodal, neural models;
  • The production and generation of referring expressions, especially the question of what to include in object descriptions in visual scenes;
  • Evaluation, that is, how we can determine the quality of the outputs of NLP models (especially generation models).

Apart from these, another long-standing interest is the development of tools and resources for under-resourced languages. In this connection, I've worked quite a bit on NLP for Maltese.

Current and recent projects

Here are some recent and ongoing projects I'm working on.

NL4XAI

Interactive Natural Language Technology for Explainable AI

MUFINS

Multilingual Financial Information News Summarisation

Multi3Generation

Multi3Generation: Multi-Task, Multilingual, Multimodal Language Generation

MASRI

Maltese Automatic Speech Recognition

MLRS

Maltese Language Resource Server

Publications

This is a more or less up-to-date list of publications. You can also find info from the external repos below:



Resources

Code and data created by me and my collaborators.

Code and models

The BERTu Model and training data: Pretrained and finetuned BERT models and 500m word text corpus. This is the first large-scale initiative to train deep transformer networks for NLP in Maltese, an under-resourced language. You can read more about it in Micallef et al (2022).

Grounded Textual Entailment models and data: a version of the SNLI dataset by Bowman et al (2015), linked to the images which ground the premises. The data and experiments are described in Vutrong et al (2018).

Maltese speech recognition data: Various datasets for speech-to-text in Maltese, collected as part of the MASRI project.

Tools for Maltese NLP: Various tools (including POS Tagger, tokeniser, phonetic transcriber) for Maltese. Mostly written in Python. Hosted on the Maltese Language Resource Server.

SimpleNLG: a java library for morphological generation and syntactic realisation. This used to be hosted on Google Code, but is now on Github.

Older stuff

The GenChal Repository: an online repository of datasets related to the Generation Challenges, a series of Shared Task challenges organised between 2007 and 2013.

The TUNA Corpus of Referring Expressions, a semantically transparent, annotated corpus of references to objects in visual domains. This corpus has been used in three Shared Task Evaluations since its development.

Team members

Current and former research staff and students.

Current

Hugh-Mee Wong (PhD student, Utrecht University, 2023-present)

Yingjin Song (PhD student, Utrecht University, 2022-present)

Michele Cafagna (PhD student, University of Malta and Utrecht University, 2020-present).

Amanda Muscat (PhD student, University of Malta, 2019-present).

Ettore Mariotti (co-supervised with Jose Maria Alonso, PhD candidate, University of Santiago de Compostela, 2020-present).

Eduardo Calò (co-supervised with Kees van Deemter, PhD student, Utrecht University, 2022-present).

Burak Kilic (co-supervised with Floris Bex, PhD student, Utrecht University, 2022-present).

Former

Juliette Faille (co-supervised with Claire Gardent, PhD candidate, CNRS/LORIA, 2020-2023).

Marc Tanti (post-doc, University of Malta, 2020-2021; PhD student, 2015-2019).

Carlos Hernandez Mena (post-doc, University of Malta, 2019-2021).

Claudia Borg (PhD student, University of Malta, 2012-2016).

Luke Galea (PhD student, co-supervised with Martine Grice and Adam Ussishkin, Institute of Linguistics, University of Cologne, 2012-2016).

Media

Here are some non-technical articles I (co-)wrote, as well as some media coverage of our work.

Writings for the general public

On the automatic generation of language. Newspoint, June 2020

Contracts at the crossroads (with Gordon J Pace), Sunday Times of Malta, 13th May, 2018.

Drittijiet diġitali u ugwaljanza lingwistika [Digital rights and language equality], Encore, volume 13, 2018

Is a face worth a thousand words?, Sunday Times of Malta, 23rd July, 2017.

Laboratorju tal-Lingwi: Il-Malti, it-teknoloġija u l-mezzi soċjali [A language lab: Maltese, technology and social media], Leħen il-Malti, volume 36, 2017

Decoding language (with Gordon J Pace and Mike Rosner), Think Magazine, University of Malta, volume 8, 2014.

Media coverage

I shared some thoughts about large language models with journalists at Pointer, who investigated the potential for misusing ChatGPT in criminal activity (article in Dutch).

MaltaToday Interviews 4 AI Experts on the possibilities and limits of ChatGPT

The Times of Malta reports on our ongoing efforts to crowdsource voices for Maltese ASR, as part of MASRI - Maltese Automatic Speech RecognItion.

MASRI - Maltese Automatic Speech RecognItion, was featured at the Houses of Parliament during Science in the House. News coverage of the event appeared on TVM - Public Broadcasting Services.

Lab to Life: Smart Search for Maltese Legal Professionals. Interview with Cassi Camilleri, Think magazine, 2017. This interview focusses on ongoing work with Gordon J Pace on the intelligent analysis of legal texts.

Computational Linguistics. Interview with Velislava Hillman, Times of Malta, September 11, 2015. This interview formed part of the 'Zone' series on the online Times of Malta portal.

Language and the bigger picture (documentary in Maltese). A documentary on research at the interface between linguistics, cognition, computation, social and health sciences (in Maltese). This documentary was aired on national television on the occasion of the Science in the City festival, 30th September, 2015.

Contact

Drop me a line.

Rm 5.04 Buys Ballotgebouw
Princetonplein 5
3584 CC Utrecht

a {dot} gatt {at} uu.nl