Guides

Translation Quality Metrics Compared: BLEU, COMET, BERTScore, MQM

An in-depth comparison of translation quality metrics — BLEU, COMET, BERTScore, and MQM — with pros, cons, and guidance on which to use when.

TL;DR — Key Takeaways

  • 1.BLEU measures n-gram overlap between a translation and reference — fast and cheap but poorly correlated with human judgment for individual segments.
  • 2.COMET uses neural models trained on human quality judgments, achieving much higher correlation with human evaluation than BLEU.
  • 3.BERTScore leverages contextual embeddings to compare semantic similarity, handling paraphrases better than BLEU but still reference-dependent.
  • 4.MQM is a human annotation framework that identifies specific errors — the gold standard for quality assessment but expensive and time-consuming.
  • 5.No single metric captures all aspects of translation quality. The best practice is to use automated metrics (COMET) for development and MQM for final quality verification.

BLEU: The Pioneer Metric

BLEU (Bilingual Evaluation Understudy), introduced by Kishore Papineni et al. in 2002, was the first widely adopted automated metric for MT evaluation. It measures the overlap of n-grams (sequences of 1-4 words) between a candidate translation and one or more reference translations, with a brevity penalty for translations that are too short.

BLEU's strengths: it's fast, deterministic, language-agnostic, and requires no training data beyond reference translations. It's reproducible — the same translations always get the same score — making it useful for tracking system improvements over time.

BLEU's weaknesses are well-documented. It only measures surface-level lexical overlap, so it penalizes valid paraphrases and rewards accidental n-gram matches. It correlates poorly with human judgment at the segment level (individual sentences). It cannot distinguish between critical errors and minor variations. A BLEU score of 30 vs 35 tells you almost nothing about relative quality for a specific translation.

Despite its limitations, BLEU remains ubiquitous in MT research as a baseline comparison metric. Its simplicity and long history make it a common denominator for tracking progress across systems and language pairs.

COMET: Neural Quality Estimation

COMET (Crosslingual Optimized Metric for Evaluation of Translation) uses pre-trained multilingual language models fine-tuned on human quality judgments. Developed at Unbabel, it takes three inputs: source text, candidate translation, and reference translation, producing a quality score that correlates much more strongly with human evaluation than BLEU.

COMET's key advantage is that it understands semantic similarity rather than just surface-level word overlap. 'The cat sat on the mat' and 'A feline was resting on the rug' would score poorly on BLEU but well on COMET, because COMET's neural backbone understands they convey the same meaning.

COMET comes in multiple variants: COMET-DA (trained on direct assessment scores), COMET-MQM (trained on MQM human annotations), and reference-free COMET-QE (quality estimation without reference translations). COMET-MQM shows the highest correlation with expert human evaluation.

Limitations: COMET requires GPU computation for efficient scoring, its behavior can be opaque (neural black box), and it may not perform well for language pairs or domains far from its training distribution. It also cannot pinpoint specific errors — it gives an overall quality estimate without explaining what's wrong.

BERTScore: Contextual Embedding Similarity

BERTScore computes similarity between candidate and reference translations using contextual embeddings from BERT or similar transformer models. Instead of matching exact words (like BLEU), it matches words based on their meaning in context, then aggregates these similarity scores.

BERTScore handles synonyms and paraphrases better than BLEU while being more interpretable than COMET. It provides precision, recall, and F1 variants, giving insight into whether errors are omissions (low recall) or additions (low precision).

However, BERTScore shares some BLEU limitations: it requires reference translations, doesn't consider source text, and cannot identify specific error types. It performs best for languages well-represented in the underlying BERT model's training data.

In practice, BERTScore occupies a middle ground — better than BLEU for research evaluation, but increasingly superseded by COMET for production MT quality assessment. Its main use case is as an additional signal in multi-metric evaluation pipelines.

MQM: The Human Gold Standard

MQM is fundamentally different from the other metrics discussed here: it's a human annotation framework, not an automated score. Trained evaluators read translations, identify errors, classify them by type and severity, and compute a penalty-based score. This makes MQM the most informative and actionable quality measure available.

MQM's advantages: it provides specific, actionable error feedback; it can evaluate without reference translations (comparing against the source only); it captures quality dimensions that automated metrics miss (cultural appropriateness, register, real-world accuracy); and it's the basis for ISO 5060.

MQM's disadvantages: it's expensive (requiring trained human evaluators), slow (hours per evaluation vs. seconds for automated metrics), and subject to inter-annotator variation. For these reasons, MQM is typically used for final quality checkpoints and periodic audits rather than continuous evaluation.

leapCAT uses MQM as one evidence layer, not the headline promise. The point is to show teams what changed, what still needs review, and why a file is ready for sign-off without pushing the whole workflow back outside the team.

Metric Comparison Summary

Speed: BLEU (milliseconds) > BERTScore (seconds) > COMET (seconds) > MQM (hours). Cost: BLEU (free) = BERTScore (free) < COMET (GPU cost) < MQM (human evaluator cost). Correlation with human judgment: MQM (gold standard) > COMET (high) > BERTScore (moderate) > BLEU (low for segments).

Reference requirement: BLEU (required), BERTScore (required), COMET (optional with QE variant), MQM (not required). Actionability: MQM (specific error feedback) > COMET (quality estimate) > BERTScore (precision/recall breakdown) > BLEU (single number).

Recommended use: BLEU for historical comparison and research baselines. COMET for development cycle evaluation and system comparison. BERTScore as a supplementary signal. MQM for final quality verification, client deliverables, and quality audits. Use multiple metrics together for a comprehensive picture.

Frequently Asked Questions

Get expert-level translation without the expert cost

43 AI agents run the full professional translation workflow — analysis, terminology, translation, review, QA — starting at $0.01/word.

Try it free