#270 AI Translation State of the Art with Tom Kocmi and Alon Lavie

Update: 2025-11-21

Description

Tom Kocmi, Researcher at Cohere, and Alon Lavie, Distinguished Career Professor at Carnegie Mellon University, join Florian and Slator language AI Research Analyst, Maria Stasimioti, on SlatorPod to talk about the state-of-the-art in AI translation and what the latest WMT25 results reveal about progress and remaining challenges.

Tom outlines how the WMT conference has become a crucial annual benchmark for assessing AI translation quality and ensuring systems are tested on fresh, demanding datasets. He notes that systems now face literary text, social-media language, ASR-noisy speech transcripts, and data selected through a difficulty-sampling algorithm. He stresses that these harder inputs expose far more system weaknesses than in previous years.

He adds that human translators also struggle as they face fatigue, time pressure, and constraints such as not being allowed to post-edit. He emphasizes that human parity claims are unreliable and highlights the need for improved human evaluation design.

Alon underscores that harder test data also challenges evaluators. He explains that segment-level scoring is now more difficult, and even human evaluators miss different subsets of errors. He highlights that automated metrics built on earlier-era training data underperformed, particularly COMET, because they absorbed their own biases.

He reports that the strongest performers in the evaluation task were reasoning-capable large language models (LLMs), either lightly prompted or submitted with elaborate evaluation-specific prompting. He notes that while these LLM-as-judge setups outperformed traditional neural metrics overall, their segment-level performance varied.

Tom points out that the translation task also revealed notable progress from smaller academic models around 9B parameters, some ranking near trillion-parameter frontier models. He sees this as a sign that competitive research is still widely accessible.

The duo concludes that they must carefully choose evaluation methods, avoid assessing models with the same metric used during training, and adopt LLM-based judging for more reliable assessments.