Run inference with multilingual models
Last updated
Last updated
There are several multilingual models in ๐Transformers, and their inference usage differs from monolingual models. Not all multilingual model usage is different though. Some models, like bert-base-multilingual-uncased, can be used just like a monolingual model. This guide will show you how to use multilingual models whose usage differs for inference.
XLM has ten different checkpoints, only one of which is monolingual. The nine remaining model checkpoints can be split into two categories: the checkpoints that use language embeddings and those that donโt.
The following XLM models use language embeddings to specify the language used at inference:
xlm-mlm-ende-1024
(Masked language modeling, English-German)
xlm-mlm-enfr-1024
(Masked language modeling, English-French)
xlm-mlm-enro-1024
(Masked language modeling, English-Romanian)
xlm-mlm-xnli15-1024
(Masked language modeling, XNLI languages)
xlm-mlm-tlm-xnli15-1024
(Masked language modeling + translation, XNLI languages)
xlm-clm-enfr-1024
(Causal language modeling, English-French)
xlm-clm-ende-1024
(Causal language modeling, English-German)
Language embeddings are represented as a tensor of the same shape as the input_ids
passed to the model. The values in these tensors depend on the language used and are identified by the tokenizerโs lang2id
and id2lang
attributes.
In this example, load the xlm-clm-enfr-1024
checkpoint (Causal language modeling, English-French):
Copied
The lang2id
attribute of the tokenizer displays this modelโs languages and their ids:
Copied
Next, create an example input:
Copied
Set the language id as "en"
and use it to define the language embedding. The language embedding is a tensor filled with 0
since that is the language id for English. This tensor should be the same size as input_ids
.
Copied
Now you can pass the input_ids
and language embedding to the model:
Copied
The run_generation.py script can generate text with language embeddings using the xlm-clm
checkpoints.
The following XLM models do not require language embeddings during inference:
xlm-mlm-17-1280
(Masked language modeling, 17 languages)
xlm-mlm-100-1280
(Masked language modeling, 100 languages)
These models are used for generic sentence representations, unlike the previous XLM checkpoints.
The following BERT models can be used for multilingual tasks:
bert-base-multilingual-uncased
(Masked language modeling + Next sentence prediction, 102 languages)
bert-base-multilingual-cased
(Masked language modeling + Next sentence prediction, 104 languages)
These models do not require language embeddings during inference. They should identify the language from the context and infer accordingly.
The following XLM-RoBERTa models can be used for multilingual tasks:
xlm-roberta-base
(Masked language modeling, 100 languages)
xlm-roberta-large
(Masked language modeling, 100 languages)
XLM-RoBERTa was trained on 2.5TB of newly created and cleaned CommonCrawl data in 100 languages. It provides strong gains over previously released multilingual models like mBERT or XLM on downstream tasks like classification, sequence labeling, and question answering.
The following M2M100 models can be used for multilingual translation:
facebook/m2m100_418M
(Translation)
facebook/m2m100_1.2B
(Translation)
In this example, load the facebook/m2m100_418M
checkpoint to translate from Chinese to English. You can set the source language in the tokenizer:
Copied
Tokenize the text:
Copied
M2M100 forces the target language id as the first generated token to translate to the target language. Set the forced_bos_token_id
to en
in the generate
method to translate to English:
Copied
The following MBart models can be used for multilingual translation:
facebook/mbart-large-50-one-to-many-mmt
(One-to-many multilingual machine translation, 50 languages)
facebook/mbart-large-50-many-to-many-mmt
(Many-to-many multilingual machine translation, 50 languages)
facebook/mbart-large-50-many-to-one-mmt
(Many-to-one multilingual machine translation, 50 languages)
facebook/mbart-large-50
(Multilingual translation, 50 languages)
facebook/mbart-large-cc25
In this example, load the facebook/mbart-large-50-many-to-many-mmt
checkpoint to translate Finnish to English. You can set the source language in the tokenizer:
Copied
Tokenize the text:
Copied
MBart forces the target language id as the first generated token to translate to the target language. Set the forced_bos_token_id
to en
in the generate
method to translate to English:
Copied
If you are using the facebook/mbart-large-50-many-to-one-mmt
checkpoint, you donโt need to force the target language id as the first generated token otherwise the usage is the same.