About me
My research mostly focusses on the automatic generation of language from non-linguistic information (a.k.a. Natural Language Generation). One important aspect of this is how systems -- artificial or human -- learn meaningful relationships between language and the non-linguistic, especially the perceptual, world.
In my research, I rely on machine learning methods such as neural networks, but also on experimental psycholinguistic methods. Here are some of the topics I explore:
- Data-to-text generation, that is, the automatic summarisation of non-linguistic information, in forms that are understandable by people;
- The vision-language interface, especially image captioning and grounded inference in multimodal, neural models;
- The production and generation of referring expressions, especially the question of what to include in object descriptions in visual scenes;
- Evaluation, that is, how we can determine the quality of the outputs of NLP models (especially generation models).
Apart from these, another long-standing interest is the development of tools and resources for under-resourced languages. In this connection, I've worked quite a bit on NLP for Maltese.
Current and recent projects
Here are some recent and ongoing projects I'm working on.
Publications
This is a more or less up-to-date list of publications. You can also find info from the external repos below:
Resources
Code and data created by me and my collaborators.
Code and models
The BERTu Model and training data: Pretrained and finetuned BERT models and 500m word text corpus. This is the first large-scale initiative to train deep transformer networks for NLP in Maltese, an under-resourced language. You can read more about it in Micallef et al (2022).
Grounded Textual Entailment models and data: a version of the SNLI dataset by Bowman et al (2015), linked to the images which ground the premises. The data and experiments are described in Vutrong et al (2018).
Maltese speech recognition data: Various datasets for speech-to-text in Maltese, collected as part of the MASRI project.
Tools for Maltese NLP: Various tools (including POS Tagger, tokeniser, phonetic transcriber) for Maltese. Mostly written in Python. Hosted on the Maltese Language Resource Server.
SimpleNLG: a java library for morphological generation and syntactic realisation. This used to be hosted on Google Code, but is now on Github.
Older stuff
The GenChal Repository: an online repository of datasets related to the Generation Challenges, a series of Shared Task challenges organised between 2007 and 2013.
The TUNA Corpus of Referring Expressions, a semantically transparent, annotated corpus of references to objects in visual domains. This corpus has been used in three Shared Task Evaluations since its development.
Team members
Current and former research staff and students.
Current
Ece Takmaz (Post-doc, Utrecht University, 2024)
Hugh-Mee Wong (PhD student, Utrecht University, 2023-present)
Yingjin Song (PhD student, Utrecht University, 2022-present)
Eduardo Calò (co-supervised with Kees van Deemter, PhD student, Utrecht University, 2022-2024).
Burak Kilic (co-supervised with Floris Bex, PhD student, Utrecht University, 2022-present).
Amanda Muscat (PhD student, University of Malta, 2019-present).
Former
Michele Cafagna (PhD student, University of Malta and Utrecht University, 2020-2024).
Ettore Mariotti (co-supervised with Jose Maria Alonso, PhD candidate, University of Santiago de Compostela, 2020-2024).
Juliette Faille (co-supervised with Claire Gardent, PhD candidate, CNRS/LORIA, 2020-2023).
Marc Tanti (post-doc, University of Malta, 2020-2021; PhD student, 2015-2019).
Carlos Hernandez Mena (post-doc, University of Malta, 2019-2021).
Claudia Borg (PhD student, University of Malta, 2012-2016).
Luke Galea (PhD student, co-supervised with Martine Grice and Adam Ussishkin, Institute of Linguistics, University of Cologne, 2012-2016).
Media
Here are some non-technical articles I (co-)wrote, as well as some media coverage of our work.
Writings for the general public
On the automatic generation of language. Newspoint, June 2020
Contracts at the crossroads (with Gordon J Pace), Sunday Times of Malta, 13th May, 2018.
Drittijiet diġitali u ugwaljanza lingwistika [Digital rights and language equality], Encore, volume 13, 2018
Is a face worth a thousand words?, Sunday Times of Malta, 23rd July, 2017.
Laboratorju tal-Lingwi: Il-Malti, it-teknoloġija u l-mezzi soċjali [A language lab: Maltese, technology and social media], Leħen il-Malti, volume 36, 2017
Decoding language (with Gordon J Pace and Mike Rosner), Think Magazine, University of Malta, volume 8, 2014.
Media coverage
MaltaToday Interviews 4 AI Experts on the possibilities and limits of ChatGPT
The Times of Malta reports on our ongoing efforts to crowdsource voices for Maltese ASR, as part of MASRI - Maltese Automatic Speech RecognItion.
MASRI - Maltese Automatic Speech RecognItion, was featured at the Houses of Parliament during Science in the House. News coverage of the event appeared on TVM - Public Broadcasting Services.
Lab to Life: Smart Search for Maltese Legal Professionals. Interview with Cassi Camilleri, Think magazine, 2017. This interview focusses on ongoing work with Gordon J Pace on the intelligent analysis of legal texts.
Computational Linguistics. Interview with Velislava Hillman, Times of Malta, September 11, 2015. This interview formed part of the 'Zone' series on the online Times of Malta portal.
Language and the bigger picture (documentary in Maltese). A documentary on research at the interface between linguistics, cognition, computation, social and health sciences (in Maltese). This documentary was aired on national television on the occasion of the Science in the City festival, 30th September, 2015.
Contact
Drop me a line.
Rm 5.04 Buys Ballotgebouw
Princetonplein 5
3584 CC Utrecht
a {dot} gatt {at} uu.nl