BARTpho

BARTpho

Overview

The BARTpho model was proposed in BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamesearrow-up-right by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.

The abstract from the paper is the following:

We present BARTpho with two versions — BARTpho_word and BARTpho_syllable — the first public large-scale monolingual sequence-to-sequence models pre-trained for Vietnamese. Our BARTpho uses the “large” architecture and pre-training scheme of the sequence-to-sequence denoising model BART, thus especially suitable for generative NLP tasks. Experiments on a downstream task of Vietnamese text summarization show that in both automatic and human evaluations, our BARTpho outperforms the strong baseline mBART and improves the state-of-the-art. We release BARTpho to facilitate future research and applications of generative Vietnamese NLP tasks.

Example of use:

Copied

>>> import torch
>>> from transformers import AutoModel, AutoTokenizer

>>> bartpho = AutoModel.from_pretrained("vinai/bartpho-syllable")

>>> tokenizer = AutoTokenizer.from_pretrained("vinai/bartpho-syllable")

>>> line = "Chúng tôi là những nghiên cứu viên."

>>> input_ids = tokenizer(line, return_tensors="pt")

>>> with torch.no_grad():
...     features = bartpho(**input_ids)  # Models outputs are now tuples

>>> # With TensorFlow 2.0+:
>>> from transformers import TFAutoModel

>>> bartpho = TFAutoModel.from_pretrained("vinai/bartpho-syllable")
>>> input_ids = tokenizer(line, return_tensors="tf")
>>> features = bartpho(**input_ids)

Tips:

  • Following mBART, BARTpho uses the “large” architecture of BART with an additional layer-normalization layer on top of both the encoder and decoder. Thus, usage examples in the documentation of BARTarrow-up-right, when adapting to use with BARTpho, should be adjusted by replacing the BART-specialized classes with the mBART-specialized counterparts. For example:

Copied

  • This implementation is only for tokenization: “monolingual_vocab_file” consists of Vietnamese-specialized types extracted from the pre-trained SentencePiece model “vocab_file” that is available from the multilingual XLM-RoBERTa. Other languages, if employing this pre-trained multilingual SentencePiece model “vocab_file” for subword segmentation, can reuse BartphoTokenizer with their own language-specialized “monolingual_vocab_file”.

This model was contributed by dqnguyenarrow-up-right. The original code can be found herearrow-up-right.

BartphoTokenizer

class transformers.BartphoTokenizer

<source>arrow-up-right

( vocab_filemonolingual_vocab_filebos_token = '<s>'eos_token = '</s>'sep_token = '</s>'cls_token = '<s>'unk_token = '<unk>'pad_token = '<pad>'mask_token = '<mask>'sp_model_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = None**kwargs )

Parameters

  • vocab_file (str) — Path to the vocabulary file. This vocabulary is the pre-trained SentencePiece model available from the multilingual XLM-RoBERTa, also used in mBART, consisting of 250K types.

  • monolingual_vocab_file (str) — Path to the monolingual vocabulary file. This monolingual vocabulary consists of Vietnamese-specialized types extracted from the multilingual vocabulary vocab_file of 250K types.

  • bos_token (str, optional, defaults to "<s>") — The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.

    When building a sequence using special tokens, this is not the token that is used for the beginning of sequence. The token used is the cls_token.

  • eos_token (str, optional, defaults to "</s>") — The end of sequence token.

    When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the sep_token.

  • sep_token (str, optional, defaults to "</s>") — The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.

  • cls_token (str, optional, defaults to "<s>") — The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.

  • unk_token (str, optional, defaults to "<unk>") — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

  • pad_token (str, optional, defaults to "<pad>") — The token used for padding, for example when batching sequences of different lengths.

  • mask_token (str, optional, defaults to "<mask>") — The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.

  • additional_special_tokens (List[str], optional, defaults to ["<s>NOTUSED", "</s>NOTUSED"]) — Additional special tokens used by the tokenizer.

  • sp_model_kwargs (dict, optional) — Will be passed to the SentencePieceProcessor.__init__() method. The Python wrapper for SentencePiecearrow-up-right can be used, among other things, to set:

    • enable_sampling: Enable subword regularization.

    • nbest_size: Sampling parameters for unigram. Invalid for BPE-Dropout.

      • nbest_size = {0,1}: No sampling is performed.

      • nbest_size > 1: samples from the nbest_size results.

      • nbest_size < 0: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.

    • alpha: Smoothing parameter for unigram sampling, and dropout probability of merge operations for BPE-dropout.

  • sp_model (SentencePieceProcessor) — The SentencePiece processor that is used for every conversion (string, tokens and IDs).

Adapted from XLMRobertaTokenizerarrow-up-right. Based on SentencePiecearrow-up-right.

This tokenizer inherits from PreTrainedTokenizerarrow-up-right which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

build_inputs_with_special_tokens

<source>arrow-up-right

( token_ids_0: typing.List[int]token_ids_1: typing.Optional[typing.List[int]] = None ) → List[int]

Parameters

  • token_ids_0 (List[int]) — List of IDs to which the special tokens will be added.

  • token_ids_1 (List[int], optional) — Optional second list of IDs for sequence pairs.

Returns

List[int]

List of input IDsarrow-up-right with the appropriate special tokens.

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An BARTPho sequence has the following format:

  • single sequence: <s> X </s>

  • pair of sequences: <s> A </s></s> B </s>

convert_tokens_to_string

<source>arrow-up-right

( tokens )

Converts a sequence of tokens (strings for sub-words) in a single string.

create_token_type_ids_from_sequences

<source>arrow-up-right

( token_ids_0: typing.List[int]token_ids_1: typing.Optional[typing.List[int]] = None ) → List[int]

Parameters

  • token_ids_0 (List[int]) — List of IDs.

  • token_ids_1 (List[int], optional) — Optional second list of IDs for sequence pairs.

Returns

List[int]

List of zeros.

Create a mask from the two sequences passed to be used in a sequence-pair classification task. BARTPho does not make use of token type ids, therefore a list of zeros is returned.

get_special_tokens_mask

<source>arrow-up-right

( token_ids_0: typing.List[int]token_ids_1: typing.Optional[typing.List[int]] = Nonealready_has_special_tokens: bool = False ) → List[int]

Parameters

  • token_ids_0 (List[int]) — List of IDs.

  • token_ids_1 (List[int], optional) — Optional second list of IDs for sequence pairs.

  • already_has_special_tokens (bool, optional, defaults to False) — Whether or not the token list is already formatted with special tokens for the model.

Returns

List[int]

A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model method.

Last updated