Recent progress in applications of machine learning models to NLP has been driven by benchmarks that evaluate models across a wide variety of tasks. However, these broad-coverage benchmarks have been mostly limited to English, and despite an increasing interest in multilingual models, a benchmark that enables the comprehensive evaluation of such methods on a diverse range of languages and tasks is still missing.
To encourage more research on multilingual transfer learning, we introduce the Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark. XTREME covers 40 typologically diverse languages spanning 12 language families and includes 9 tasks that require reasoning about different levels of syntax or semantics.
The languages in XTREME are selected to maximize language diversity, coverage in existing tasks, and availability of training data. Among these are many under-studied languages, such as the Dravidian languages Tamil (spoken in southern India, Sri Lanka, and Singapore), Telugu and Malayalam (spoken mainly in southern India), and the Niger-Congo languages Swahili and Yoruba, spoken in Africa.
For a full description of the benchmark, languages and tasks, please see XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization.
Model | Participant | Affiliation | Submission Date | Score | Sentence-pair Classification | Structured Prediction | Question Answering | Sentence Retrieval | Number of parameters (in millions) |
---|---|---|---|---|---|---|---|---|---|
Human | - | - | 93.3 | 95.1 | 97.0 | 87.8 | - | ||
VECO 2.0 | AliceMind | Alibaba | March 17, 2023 | 85.8 | 90.8 | 84.6 | 77.2 | 94.9 | |
Turing ULR v6 | Alexander v-team | Microsoft | Sep 6, 2022 | 85.5 | 91.0 | 83.8 | 77.1 | 94.4 | |
ShenNonG | Cloud Xiaowei AI | Tencent | May 22, 2022 | 85.0 | 90.4 | 83.1 | 76.3 | 94.4 | |
Turing ULR v5 | Alexander v-team | Microsoft | Nov 24, 2021 | 84.5 | 90.3 | 81.7 | 76.3 | 93.7 | |
CoFe | HFL | iFLYTEK | Oct 26, 2021 | 84.1 | 90.1 | 81.4 | 75.0 | 94.2 | |
InfoXLM-XFT | Noah's Ark Lab | Huawei | Oct 5, 2021 | 82.2 | 89.3 | 75.5 | 75.2 | 92.4 | |
VECO + HICTL | AliceMind + MT | Alibaba | Sep 21, 2021 | 82.0 | 89.0 | 76.7 | 73.4 | 93.3 | 559 |
Ensemble-Distil-XFT (ED-XFT) | Huawei Ireland Research Center | Huawei | May 5, 2022 | 82.0 | 89.2 | 74.6 | 75.2 | 92.4 | |
Polyglot | MLNLC | ByteDance | Apr 29, 2021 | 81.7 | 88.3 | 80.6 | 71.9 | 90.8 | |
Unicoder + ZCode | MSRA + Cognition | Microsoft | Apr 26, 2021 | 81.6 | 88.4 | 76.2 | 72.5 | 93.7 | |
ERNIE-M | ERNIE Team | Baidu | Jan 1, 2021 | 80.9 | 87.9 | 75.6 | 72.3 | 91.9 | 559 |
HiCTL | DAMO MT Team | Alibaba | Mar 21, 2021 | 80.8 | 89.0 | 74.4 | 71.9 | 92.6 | |
T-ULRv2 + StableTune | Turing | Microsoft | Oct 7, 2020 | 80.7 | 88.8 | 75.4 | 72.9 | 89.3 | 559 |
Anonymous3 | Anonymous3 | Anonymous3 | Jan 3, 2021 | 79.9 | 88.2 | 74.6 | 71.7 | 89.0 | |
FILTER | Dynamics 365 AI Research | Microsoft | Sep 8, 2020 | 77.0 | 87.5 | 71.9 | 68.5 | 84.4 | 559 |
Creative | Creative | Microsoft | Sep 8, 2021 | 76.5 | 86.3 | 90.8 | 59.7 | 77.5 | |
X-STILTs | Phang et al. | New York University | Jun 17, 2020 | 73.5 | 83.9 | 69.4 | 67.2 | 76.5 | 559 |
xlm-roberta-large-enhanced | GTNLP | N/A | Dec 25, 2022 | 68.7 | 82.2 | 67.2 | 55.9 | 75.6 | |
XLM-R (large) | XTREME Team | Alphabet, CMU | - | 68.2 | 82.8 | 69.0 | 62.3 | 61.6 | 559 |
mBERT | XTREME Team | Alphabet, CMU | - | 59.6 | 73.7 | 66.3 | 53.8 | 47.7 | 178 |
MMTE | XTREME Team | Alphabet, CMU | - | 59.3 | 74.3 | 65.3 | 52.3 | 48.9 | 190 |
RemBERT | RemBERT Team | Alphabet | Oct 2, 2020 | 56.1 | 84.1 | 73.3 | 68.6 | NA | 575 |
XLM | XTREME Team | Alphabet, CMU | - | 55.8 | 75.0 | 65.6 | 43.9 | 44.7 | |
Anonymous5 | Anonymous5 | Anonymous5 | Mar 4, 2021 | 53.1 | 75.3 | 66.9 | 52.5 | 18.0 | |
mT5 | mT5-Team | Google Research | Jan 13, 2021 | 40.9 | 89.8 | NA | 73.6 | NA | 13000 |
Anonymous6 | Anonymous6 | Anonymous6 | Dec 20, 2022 | 39.3 | 44.2 | 0.0 | 65.5 | 34.5 |
The tasks included in XTREME cover a range of paradigms, including sentencetext classification, structured prediction, sentence retrieval and cross-lingual question answering. Consequently, in order for models to be successful on the XTREME benchmarks, they must learn representations that generalize to many standard cross-lingual transfer settings.
Each of the tasks covers a subset of the 40 languages. In order to obtain additional data in the low-resource languages that can be used for analyses, we automatically translate test sets of a natural language inference and question answering dataset to the remaining languages. We show that these can be used as a reasonable proxy for performance on gold standard test sets, with the caveat that they overestimate the performance of models that were trained on translations themselves.
Family | Languages |
---|---|
Afro-Asiatic | Arabic, Hebrew |
Austro-Asiatic | Vietnamese |
Austronesian | Indonesian, Javanese, Malay, Tagalog |
Basque | Basque |
Dravidian | Malayalam, Tamil, Telugu |
Indo-European (Indo-Aryan) | Bengali, Marathi, Hindi, Urdu |
Indo-European (Germanic) | Afrikaans, Dutch, English, German |
Indo-European (Romance) | French, Italian, Portuguese, Spanish |
Indo-European (Greek) | Greek |
Indo-European (Iranian) | Persian |
Japonic | Japanese |
Kartvelian | Georgian |
Koreanic | Korean |
Kra-Dai | Thai |
Niger-Congo | Swahili, Yoruba |
Slavic | Bulgarian, Russian |
Sino-Tibetan | Burmese, Mandarin |
Turkic | Kazakh, Turkish |
Uralic | Estonian, Finnish, Hungarian |