VITS
Last updated
Last updated
The VITS model was proposed in by Jaehyeon Kim, Jungil Kong, Juhee Son.
VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) is an end-to-end speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a conditional variational autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior.
A set of spectrogram-based acoustic features are predicted by the flow-based module, which is formed of a Transformer-based text encoder and multiple coupling layers. The spectrogram is decoded using a stack of transposed convolutional layers, much in the same style as the HiFi-GAN vocoder. Motivated by the one-to-many nature of the TTS problem, where the same text input can be spoken in multiple ways, the model also includes a stochastic duration predictor, which allows the model to synthesise speech with different rhythms from the same input text.
The model is trained end-to-end with a combination of losses derived from variational lower bound and adversarial training. To improve the expressiveness of the model, normalizing flows are applied to the conditional prior distribution. During inference, the text encodings are up-sampled based on the duration prediction module, and then mapped into the waveform using a cascade of the flow module and HiFi-GAN decoder. Due to the stochastic nature of the duration predictor, the model is non-deterministic, and thus requires a fixed seed to generate the same speech waveform.
The abstract from the paper is the following:
Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or MOS) on the LJ Speech, a single speaker dataset, shows that our method outperforms the best publicly available TTS systems and achieves a MOS comparable to ground truth.
This model can also be used with TTS checkpoints from as these checkpoints use the same architecture and a slightly modified tokenizer.
This model was contributed by and . The original code can be found .
Both the VITS and MMS-TTS checkpoints can be used with the same API. Since the flow-based model is non-deterministic, it is good practice to set a seed to ensure reproducibility of the outputs. For languages with a Roman alphabet, such as English or French, the tokenizer can be used directly to pre-process the text inputs. The following code example runs a forward pass using the MMS-TTS English checkpoint:
Copied
The resulting waveform can be saved as a .wav
file:
Copied
Or displayed in a Jupyter Notebook / Google Colab:
Copied
You can check whether you require the uroman
package for your language by inspecting the is_uroman
attribute of the pre-trained tokenizer
:
Copied
If required, you should apply the uroman package to your text inputs prior to passing them to the VitsTokenizer
, since currently the tokenizer does not support performing the pre-processing itself.
To do this, first clone the uroman repository to your local machine and set the bash variable UROMAN
to the local path:
Copied
You can then pre-process the text input using the following code snippet. You can either rely on using the bash variable UROMAN
to point to the uroman repository, or you can pass the uroman directory as an argument to the uromaize
function:
Copied
( vocab_size = 38hidden_size = 192num_hidden_layers = 6num_attention_heads = 2window_size = 4use_bias = Trueffn_dim = 768layerdrop = 0.1ffn_kernel_size = 3flow_size = 192spectrogram_bins = 513hidden_act = 'relu'hidden_dropout = 0.1attention_dropout = 0.1activation_dropout = 0.1initializer_range = 0.02layer_norm_eps = 1e-05use_stochastic_duration_prediction = Truenum_speakers = 1speaker_embedding_size = 0upsample_initial_channel = 512upsample_rates = [8, 8, 2, 2]upsample_kernel_sizes = [16, 16, 4, 4]resblock_kernel_sizes = [3, 7, 11]resblock_dilation_sizes = [[1, 3, 5], [1, 3, 5], [1, 3, 5]]leaky_relu_slope = 0.1depth_separable_channels = 2depth_separable_num_layers = 3duration_predictor_flow_bins = 10duration_predictor_tail_bound = 5.0duration_predictor_kernel_size = 3duration_predictor_dropout = 0.5duration_predictor_num_flows = 4duration_predictor_filter_channels = 256prior_encoder_num_flows = 4prior_encoder_num_wavenet_layers = 4posterior_encoder_num_wavenet_layers = 16wavenet_kernel_size = 5wavenet_dilation_rate = 1wavenet_dropout = 0.0speaking_rate = 1.0noise_scale = 0.667noise_scale_duration = 0.8sampling_rate = 16000**kwargs )
Parameters
hidden_size (int
, optional, defaults to 192) β Dimensionality of the text encoder layers.
num_hidden_layers (int
, optional, defaults to 6) β Number of hidden layers in the Transformer encoder.
num_attention_heads (int
, optional, defaults to 2) β Number of attention heads for each attention layer in the Transformer encoder.
window_size (int
, optional, defaults to 4) β Window size for the relative positional embeddings in the attention layers of the Transformer encoder.
use_bias (bool
, optional, defaults to True
) β Whether to use bias in the key, query, value projection layers in the Transformer encoder.
ffn_dim (int
, optional, defaults to 768) β Dimensionality of the βintermediateβ (i.e., feed-forward) layer in the Transformer encoder.
ffn_kernel_size (int
, optional, defaults to 3) β Kernel size of the 1D convolution layers used by the feed-forward network in the Transformer encoder.
flow_size (int
, optional, defaults to 192) β Dimensionality of the flow layers.
spectrogram_bins (int
, optional, defaults to 513) β Number of frequency bins in the target spectrogram.
hidden_act (str
or function
, optional, defaults to "relu"
) β The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu"
, "relu"
, "selu"
and "gelu_new"
are supported.
hidden_dropout (float
, optional, defaults to 0.1) β The dropout probability for all fully connected layers in the embeddings and encoder.
attention_dropout (float
, optional, defaults to 0.1) β The dropout ratio for the attention probabilities.
activation_dropout (float
, optional, defaults to 0.1) β The dropout ratio for activations inside the fully connected layer.
initializer_range (float
, optional, defaults to 0.02) β The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float
, optional, defaults to 1e-5) β The epsilon used by the layer normalization layers.
use_stochastic_duration_prediction (bool
, optional, defaults to True
) β Whether to use the stochastic duration prediction module or the regular duration predictor.
num_speakers (int
, optional, defaults to 1) β Number of speakers if this is a multi-speaker model.
speaker_embedding_size (int
, optional, defaults to 0) β Number of channels used by the speaker embeddings. Is zero for single-speaker models.
upsample_initial_channel (int
, optional, defaults to 512) β The number of input channels into the HiFi-GAN upsampling network.
upsample_rates (Tuple[int]
or List[int]
, optional, defaults to [8, 8, 2, 2]
) β A tuple of integers defining the stride of each 1D convolutional layer in the HiFi-GAN upsampling network. The length of upsample_rates
defines the number of convolutional layers and has to match the length of upsample_kernel_sizes
.
upsample_kernel_sizes (Tuple[int]
or List[int]
, optional, defaults to [16, 16, 4, 4]
) β A tuple of integers defining the kernel size of each 1D convolutional layer in the HiFi-GAN upsampling network. The length of upsample_kernel_sizes
defines the number of convolutional layers and has to match the length of upsample_rates
.
resblock_kernel_sizes (Tuple[int]
or List[int]
, optional, defaults to [3, 7, 11]
) β A tuple of integers defining the kernel sizes of the 1D convolutional layers in the HiFi-GAN multi-receptive field fusion (MRF) module.
resblock_dilation_sizes (Tuple[Tuple[int]]
or List[List[int]]
, optional, defaults to [[1, 3, 5], [1, 3, 5], [1, 3, 5]]
) β A nested tuple of integers defining the dilation rates of the dilated 1D convolutional layers in the HiFi-GAN multi-receptive field fusion (MRF) module.
leaky_relu_slope (float
, optional, defaults to 0.1) β The angle of the negative slope used by the leaky ReLU activation.
depth_separable_channels (int
, optional, defaults to 2) β Number of channels to use in each depth-separable block.
depth_separable_num_layers (int
, optional, defaults to 3) β Number of convolutional layers to use in each depth-separable block.
duration_predictor_flow_bins (int
, optional, defaults to 10) β Number of channels to map using the unonstrained rational spline in the duration predictor model.
duration_predictor_tail_bound (float
, optional, defaults to 5.0) β Value of the tail bin boundary when computing the unconstrained rational spline in the duration predictor model.
duration_predictor_kernel_size (int
, optional, defaults to 3) β Kernel size of the 1D convolution layers used in the duration predictor model.
duration_predictor_dropout (float
, optional, defaults to 0.5) β The dropout ratio for the duration predictor model.
duration_predictor_num_flows (int
, optional, defaults to 4) β Number of flow stages used by the duration predictor model.
duration_predictor_filter_channels (int
, optional, defaults to 256) β Number of channels for the convolution layers used in the duration predictor model.
prior_encoder_num_flows (int
, optional, defaults to 4) β Number of flow stages used by the prior encoder flow model.
prior_encoder_num_wavenet_layers (int
, optional, defaults to 4) β Number of WaveNet layers used by the prior encoder flow model.
posterior_encoder_num_wavenet_layers (int
, optional, defaults to 16) β Number of WaveNet layers used by the posterior encoder model.
wavenet_kernel_size (int
, optional, defaults to 5) β Kernel size of the 1D convolution layers used in the WaveNet model.
wavenet_dilation_rate (int
, optional, defaults to 1) β Dilation rates of the dilated 1D convolutional layers used in the WaveNet model.
wavenet_dropout (float
, optional, defaults to 0.0) β The dropout ratio for the WaveNet layers.
speaking_rate (float
, optional, defaults to 1.0) β Speaking rate. Larger values give faster synthesised speech.
noise_scale (float
, optional, defaults to 0.667) β How random the speech prediction is. Larger values create more variation in the predicted speech.
noise_scale_duration (float
, optional, defaults to 0.8) β How random the duration prediction is. Larger values create more variation in the predicted durations.
sampling_rate (int
, optional, defaults to 16000) β The sampling rate at which the output audio waveform is digitalized expressed in hertz (Hz).
Example:
Copied
( vocab_filepad_token = '<pad>'unk_token = '<unk>'language = Noneadd_blank = Truenormalize = Truephonemize = Trueis_uroman = False**kwargs )
Parameters
vocab_file (str
) β Path to the vocabulary file.
language (str
, optional) β Language identifier.
add_blank (bool
, optional, defaults to True
) β Whether to insert token id 0 in between the other tokens.
normalize (bool
, optional, defaults to True
) β Whether to normalize the input text by removing all casing and punctuation.
phonemize (bool
, optional, defaults to True
) β Whether to convert the input text into phonemes.
is_uroman (bool
, optional, defaults to False
) β Whether the uroman
Romanizer needs to be applied to the input text prior to tokenizing.
Construct a VITS tokenizer. Also supports MMS-TTS.
__call__
Parameters
text (str
, List[str]
, List[List[str]]
, optional) β The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True
(to lift the ambiguity with a batch of sequences).
text_pair (str
, List[str]
, List[List[str]]
, optional) β The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True
(to lift the ambiguity with a batch of sequences).
text_target (str
, List[str]
, List[List[str]]
, optional) β The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True
(to lift the ambiguity with a batch of sequences).
text_pair_target (str
, List[str]
, List[List[str]]
, optional) β The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True
(to lift the ambiguity with a batch of sequences).
add_special_tokens (bool
, optional, defaults to True
) β Whether or not to add special tokens when encoding the sequences. This will use the underlying PretrainedTokenizerBase.build_inputs_with_special_tokens
function, which defines which tokens are automatically added to the input ids. This is usefull if you want to add bos
or eos
tokens automatically.
True
or 'longest'
: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
'max_length'
: Pad to a maximum length specified with the argument max_length
or to the maximum acceptable input length for the model if that argument is not provided.
False
or 'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different lengths).
True
or 'longest_first'
: Truncate to a maximum length specified with the argument max_length
or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
'only_first'
: Truncate to a maximum length specified with the argument max_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
'only_second'
: Truncate to a maximum length specified with the argument max_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
False
or 'do_not_truncate'
(default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
max_length (int
, optional) β Controls the maximum length to use by one of the truncation/padding parameters.
If left unset or set to None
, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.
stride (int
, optional, defaults to 0) β If set to a number along with max_length
, the overflowing tokens returned when return_overflowing_tokens=True
will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens.
is_split_into_words (bool
, optional, defaults to False
) β Whether or not the input is already pre-tokenized (e.g., split into words). If set to True
, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. This is useful for NER or token classification.
pad_to_multiple_of (int
, optional) β If set will pad the sequence to a multiple of the provided value. Requires padding
to be activated. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5
(Volta).
'tf'
: Return TensorFlow tf.constant
objects.
'pt'
: Return PyTorch torch.Tensor
objects.
'np'
: Return Numpy np.ndarray
objects.
return_token_type_ids (bool
, optional) β Whether to return token type IDs. If left to the default, will return the token type IDs according to the specific tokenizerβs default, defined by the return_outputs
attribute.
return_attention_mask (bool
, optional) β Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizerβs default, defined by the return_outputs
attribute.
return_overflowing_tokens (bool
, optional, defaults to False
) β Whether or not to return overflowing token sequences. If a pair of sequences of input ids (or a batch of pairs) is provided with truncation_strategy = longest_first
or True
, an error is raised instead of returning overflowing tokens.
return_special_tokens_mask (bool
, optional, defaults to False
) β Whether or not to return special tokens mask information.
return_offsets_mapping (bool
, optional, defaults to False
) β Whether or not to return (char_start, char_end)
for each token.
return_length (bool
, optional, defaults to False
) β Whether or not to return the lengths of the encoded inputs.
verbose (bool
, optional, defaults to True
) β Whether or not to print more information and warnings. **kwargs β passed to the self.tokenize()
method
Returns
input_ids β List of token ids to be fed to a model.
token_type_ids β List of token type ids to be fed to a model (when return_token_type_ids=True
or if βtoken_type_idsβ is in self.model_input_names
).
attention_mask β List of indices specifying which tokens should be attended to by the model (when return_attention_mask=True
or if βattention_maskβ is in self.model_input_names
).
overflowing_tokens β List of overflowing tokens sequences (when a max_length
is specified and return_overflowing_tokens=True
).
num_truncated_tokens β Number of tokens truncated (when a max_length
is specified and return_overflowing_tokens=True
).
special_tokens_mask β List of 0s and 1s, with 1 specifying added special tokens and 0 specifying regular sequence tokens (when add_special_tokens=True
and return_special_tokens_mask=True
).
length β The length of the inputs (when return_length=True
)
Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences.
save_vocabulary
( save_directory: strfilename_prefix: typing.Optional[str] = None )
( config: VitsConfig )
Parameters
forward
( input_ids: typing.Optional[torch.Tensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Nonespeaker_id: typing.Optional[int] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = Nonelabels: typing.Optional[torch.FloatTensor] = None ) β transformers.models.vits.modeling_vits.VitsModelOutput
or tuple(torch.FloatTensor)
Parameters
input_ids (torch.LongTensor
of shape (batch_size, sequence_length)
) β Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.
attention_mask (torch.Tensor
of shape (batch_size, sequence_length)
, optional) β Mask to avoid performing convolution and attention on padding token indices. Mask values selected in [0, 1]
:
1 for tokens that are not masked,
0 for tokens that are masked.
speaker_id (int
, optional) β Which speaker embedding to use. Only used for multispeaker models.
output_attentions (bool
, optional) β Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail.
output_hidden_states (bool
, optional) β Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail.
labels (torch.FloatTensor
of shape (batch_size, config.spectrogram_bins, sequence_length)
, optional) β Float values of target spectrogram. Timesteps set to -100.0
are ignored (masked) for the loss computation.
Returns
transformers.models.vits.modeling_vits.VitsModelOutput
or tuple(torch.FloatTensor)
waveform (torch.FloatTensor
of shape (batch_size, sequence_length)
) β The final audio waveform predicted by the model.
sequence_lengths (torch.FloatTensor
of shape (batch_size,)
) β The length in samples of each element in the waveform
batch.
spectrogram (torch.FloatTensor
of shape (batch_size, sequence_length, num_bins)
) β The log-mel spectrogram predicted at the output of the flow model. This spectrogram is passed to the Hi-Fi GAN decoder model to obtain the final audio waveform.
hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) β Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) β Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attention weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
Copied
For certain languages with a non-Roman alphabet, such as Arabic, Mandarin or Hindi, the perl package is required to pre-process the text inputs to the Roman alphabet.
vocab_size (int
, optional, defaults to 38) β Vocabulary size of the VITS model. Defines the number of different tokens that can be represented by the inputs_ids
passed to the forward method of .
layerdrop (float
, optional, defaults to 0.1) β The LayerDrop probability for the encoder. See the [LayerDrop paper](see ) for more details.
This is the configuration class to store the configuration of a . It is used to instantiate a VITS model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the VITS architecture.
Configuration objects inherit from and can be used to control the model outputs. Read the documentation from for more information.
This tokenizer inherits from which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
( text: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] = Nonetext_pair: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = Nonetext_target: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] = Nonetext_pair_target: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = Noneadd_special_tokens: bool = Truepadding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = Falsetruncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = Nonemax_length: typing.Optional[int] = Nonestride: int = 0is_split_into_words: bool = Falsepad_to_multiple_of: typing.Optional[int] = Nonereturn_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = Nonereturn_token_type_ids: typing.Optional[bool] = Nonereturn_attention_mask: typing.Optional[bool] = Nonereturn_overflowing_tokens: bool = Falsereturn_special_tokens_mask: bool = Falsereturn_offsets_mapping: bool = Falsereturn_length: bool = Falseverbose: bool = True**kwargs ) β
padding (bool
, str
or , optional, defaults to False
) β Activates and controls padding. Accepts the following values:
truncation (bool
, str
or , optional, defaults to False
) β Activates and controls truncation. Accepts the following values:
return_tensors (str
or , optional) β If set, will return tensors instead of list of python integers. Acceptable values are:
This is only available on fast tokenizers inheriting from , if using Pythonβs tokenizer, this method will raise NotImplementedError
.
A with the following fields:
config () β Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.
The complete VITS model, for text-to-speech synthesis. This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
Indices can be obtained using . See and for details.
return_dict (bool
, optional) β Whether or not to return a instead of a plain tuple.
A transformers.models.vits.modeling_vits.VitsModelOutput
or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration () and inputs.
The forward method, overrides the __call__
special method.