Guides

MQM Framework: The Complete Guide to Translation Quality Evaluation

Learn about the Multidimensional Quality Metrics framework — its dimensions, severity levels, scoring system, and how it relates to ISO 17100 and ASTM F2575.

TL;DR — Key Takeaways

  • 1.MQM (Multidimensional Quality Metrics) is the translation industry's most comprehensive quality evaluation framework, developed by DFKI and QTLaunchPad.
  • 2.It defines seven core dimensions — accuracy, fluency, terminology, style, design, locale convention, and verity — each with granular error subcategories.
  • 3.Errors are classified by severity: critical (meaning is dangerously wrong), major (meaning is impaired), and minor (noticeable but not harmful).
  • 4.MQM scoring follows a penalty-based model: start at 100 and subtract weighted error points to arrive at a final quality score.
  • 5.leapCAT uses MQM as its quality evaluation framework, achieving an average score of 4.2+ out of 5.0 across production translations.

What Is MQM?

Multidimensional Quality Metrics (MQM) is a framework for evaluating translation quality developed through the European Union's QTLaunchPad and QT21 research projects, led by the German Research Center for Artificial Intelligence (DFKI). Unlike single-score evaluation methods, MQM provides a structured taxonomy of error types that enables granular, reproducible quality assessment.

The framework emerged from a recognition that existing quality evaluation methods — including LISA QA, SAE J2450, and ad hoc rubrics — were either too narrow in scope or too inconsistent in application. MQM unified these approaches into a single, extensible hierarchy that can be customized for specific use cases while maintaining comparability across projects.

At its core, MQM operates on a simple principle: quality is measured by the absence of errors. Rather than asking evaluators to assign subjective quality scores, MQM asks them to identify, classify, and weight specific issues in a translation. This approach reduces inter-annotator disagreement and makes quality scores actionable — you know exactly what needs to be fixed.

The framework has been adopted by major language service providers, technology companies, and standards bodies. It forms the quality metric backbone for the TAUS Dynamic Quality Framework and has influenced the development of ISO 5060 (Translation Quality Evaluation).

The Seven MQM Dimensions

Accuracy measures whether the translation faithfully conveys the meaning of the source text. Subcategories include mistranslation, omission, addition, and untranslated text. Accuracy errors are frequently the most impactful: a mistranslated dosage instruction or legal clause can have severe real-world consequences.

Fluency evaluates whether the translation reads naturally in the target language, regardless of the source. Grammar errors, awkward phrasing, spelling mistakes, and punctuation issues fall under this dimension. A translation can be perfectly accurate yet score poorly on fluency if it reads like machine output.

Terminology assesses the correct and consistent use of domain-specific terms. Errors include using the wrong term for a concept, inconsistent terminology across a document, and failure to follow a provided termbase. In technical and medical translation, terminology errors can change meaning entirely.

Style evaluates whether the translation matches the required register, tone, and stylistic conventions. A legal contract translated in casual language or a marketing brochure written in academic prose would both constitute style errors. This dimension is heavily influenced by the project's skopos (purpose).

Design covers formatting, layout, and markup issues — truncated strings in UI elements, broken HTML tags, incorrect number formats, or misaligned text direction in RTL languages. These errors may not affect meaning but significantly impact usability.

Locale Convention addresses culturally specific adaptations: date/time formats, measurement units, currency symbols, address formats, and culturally inappropriate content. Failing to convert miles to kilometers for a European audience is a locale convention error.

Verity (sometimes called Truthfulness) checks whether the translation contains factually correct information. If a source text contains an error and the translator propagates it without flagging, or if a translator introduces a factual error, this falls under verity.

Severity Levels and Scoring

MQM uses three severity levels to weight errors. Critical errors (typically 25 penalty points per instance) represent issues that could cause harm, legal liability, or severe misunderstanding. Examples include mistranslated safety warnings, wrong drug dosages, or offensive content introduced by translation. A single critical error usually results in a failing score regardless of overall quality.

Major errors (typically 5 penalty points) impair the meaning or usability of the translation but don't cause harm. An omitted sentence in a user manual, a consistently wrong technical term, or grammar errors that change meaning fall into this category. Multiple major errors indicate a translation that needs significant revision.

Minor errors (typically 1 penalty point) are noticeable but don't impair understanding. Inconsistent capitalization, minor style deviations, or awkward but comprehensible phrasing are minor errors. While individually insignificant, a high density of minor errors indicates poor overall craftsmanship.

The standard scoring formula is: Score = 100 - (Sum of Weighted Penalties / Word Count * Normalization Factor). This produces a score from 0 to 100, where 95+ is generally considered passing quality for professional publication. The normalization factor adjusts for text length — a 50-word text with one major error should not receive the same score as a 5,000-word text with one major error.

Organizations can customize penalty weights to match their quality requirements. A pharmaceutical company might assign 50 points to terminology errors (double the default) while a creative marketing agency might weight style errors more heavily. The framework's flexibility is one of its greatest strengths.

MQM, ISO 17100, and ASTM F2575

ISO 17100 is a process standard that specifies requirements for the translation process itself — translator qualifications, revision steps, project management procedures. It tells you how translation should be done but doesn't define how to measure the quality of the output. A translation produced under ISO 17100 processes can still contain errors.

ASTM F2575 (Standard Guide for Quality Assurance in Translation) bridges the gap between process and product by providing guidelines for establishing quality requirements. It recommends defining quality parameters before a project begins but doesn't prescribe a specific error taxonomy or scoring method.

MQM fills the product quality measurement gap. It provides the specific error taxonomy, severity weights, and scoring methodology that ISO 17100 and ASTM F2575 reference but don't define. Think of it this way: ISO 17100 ensures the kitchen follows health codes, ASTM F2575 ensures the menu specifies quality standards, and MQM measures how the food actually tastes.

These standards are complementary, not competing. An enterprise translation program might require ISO 17100 certified vendors (process assurance), define quality expectations per ASTM F2575 (requirements specification), and measure output quality using MQM (product verification). This layered approach provides the most robust quality framework.

A significant recent development is ISO 5060:2024 (Translation and interpreting — Evaluation of translation output), which formally harmonizes with MQM's first-level error typology. ISO 5060 covers human translation, machine translation, and post-edited MT output evaluation, bringing MQM-compatible methodology into the ISO standards family for the first time.

MQM Limitations and Challenges

Inter-annotator agreement remains MQM's biggest challenge. Studies show that even trained annotators disagree on error classification 20-40% of the time, particularly for fluency and style dimensions where judgment is inherently subjective. Two qualified reviewers can assign different severity levels to the same error, leading to score variation.

MQM requires trained evaluators. The framework's granularity is both a strength and a barrier to adoption — classifying errors into the correct subcategory requires familiarity with the taxonomy and domain expertise. Untrained evaluators tend to over-flag minor issues or miss subtle accuracy errors.

Domain-specific weighting is essential but not standardized. The default penalty weights may not reflect actual quality priorities for all content types. Medical translation should weight accuracy errors far more heavily than marketing translation, but MQM doesn't prescribe domain-specific configurations out of the box.

MQM measures translation errors, not translation excellence. A score of 100 means zero errors were found — it doesn't mean the translation is elegant, creative, or optimally adapted. For transcreation, literary translation, or marketing copy where creative quality matters, MQM should be supplemented with other evaluation methods.

Despite these limitations, MQM remains the most rigorous and widely adopted framework for translation quality evaluation. Its structured approach makes quality measurable, comparable, and improvable — which is why it has become the industry standard for professional translation assessment.

MQM in Practice

Organizations implementing MQM typically start by selecting which dimensions and error types are relevant to their content. A software localization team might focus on accuracy, terminology, and design (UI-specific issues), while a publishing house might emphasize accuracy, fluency, and style.

Sample-based evaluation is the most common approach: rather than reviewing every word of a translation, evaluators assess representative samples (typically 2,000-3,000 words per project) and extrapolate quality scores. Full-text evaluation, while more accurate, is cost-prohibitive for large volumes.

In leapCAT, MQM sits after the workflow is organized. Teams keep the brief, approved wording, reviewer focus, and sign-off history in one place, then use MQM findings to show what can move forward and what still needs a human decision.

Frequently Asked Questions

Get expert-level translation without the expert cost

43 AI agents run the full professional translation workflow — analysis, terminology, translation, review, QA — starting at $0.01/word.

Try it free