Transformers
  • 🌍GET STARTED
    • Transformers
    • Quick tour
    • Installation
  • 🌍TUTORIALS
    • Run inference with pipelines
    • Write portable code with AutoClass
    • Preprocess data
    • Fine-tune a pretrained model
    • Train with a script
    • Set up distributed training with BOINC AI Accelerate
    • Load and train adapters with BOINC AI PEFT
    • Share your model
    • Agents
    • Generation with LLMs
  • 🌍TASK GUIDES
    • 🌍NATURAL LANGUAGE PROCESSING
      • Text classification
      • Token classification
      • Question answering
      • Causal language modeling
      • Masked language modeling
      • Translation
      • Summarization
      • Multiple choice
    • 🌍AUDIO
      • Audio classification
      • Automatic speech recognition
    • 🌍COMPUTER VISION
      • Image classification
      • Semantic segmentation
      • Video classification
      • Object detection
      • Zero-shot object detection
      • Zero-shot image classification
      • Depth estimation
    • 🌍MULTIMODAL
      • Image captioning
      • Document Question Answering
      • Visual Question Answering
      • Text to speech
    • 🌍GENERATION
      • Customize the generation strategy
    • 🌍PROMPTING
      • Image tasks with IDEFICS
  • 🌍DEVELOPER GUIDES
    • Use fast tokenizers from BOINC AI Tokenizers
    • Run inference with multilingual models
    • Use model-specific APIs
    • Share a custom model
    • Templates for chat models
    • Run training on Amazon SageMaker
    • Export to ONNX
    • Export to TFLite
    • Export to TorchScript
    • Benchmarks
    • Notebooks with examples
    • Community resources
    • Custom Tools and Prompts
    • Troubleshoot
  • 🌍PERFORMANCE AND SCALABILITY
    • Overview
    • 🌍EFFICIENT TRAINING TECHNIQUES
      • Methods and tools for efficient training on a single GPU
      • Multiple GPUs and parallelism
      • Efficient training on CPU
      • Distributed CPU training
      • Training on TPUs
      • Training on TPU with TensorFlow
      • Training on Specialized Hardware
      • Custom hardware for training
      • Hyperparameter Search using Trainer API
    • 🌍OPTIMIZING INFERENCE
      • Inference on CPU
      • Inference on one GPU
      • Inference on many GPUs
      • Inference on Specialized Hardware
    • Instantiating a big model
    • Troubleshooting
    • XLA Integration for TensorFlow Models
    • Optimize inference using `torch.compile()`
  • 🌍CONTRIBUTE
    • How to contribute to transformers?
    • How to add a model to BOINC AI Transformers?
    • How to convert a BOINC AI Transformers model to TensorFlow?
    • How to add a pipeline to BOINC AI Transformers?
    • Testing
    • Checks on a Pull Request
  • 🌍CONCEPTUAL GUIDES
    • Philosophy
    • Glossary
    • What BOINC AI Transformers can do
    • How BOINC AI Transformers solve tasks
    • The Transformer model family
    • Summary of the tokenizers
    • Attention mechanisms
    • Padding and truncation
    • BERTology
    • Perplexity of fixed-length models
    • Pipelines for webserver inference
    • Model training anatomy
  • 🌍API
    • 🌍MAIN CLASSES
      • Agents and Tools
      • 🌍Auto Classes
        • Extending the Auto Classes
        • AutoConfig
        • AutoTokenizer
        • AutoFeatureExtractor
        • AutoImageProcessor
        • AutoProcessor
        • Generic model classes
          • AutoModel
          • TFAutoModel
          • FlaxAutoModel
        • Generic pretraining classes
          • AutoModelForPreTraining
          • TFAutoModelForPreTraining
          • FlaxAutoModelForPreTraining
        • Natural Language Processing
          • AutoModelForCausalLM
          • TFAutoModelForCausalLM
          • FlaxAutoModelForCausalLM
          • AutoModelForMaskedLM
          • TFAutoModelForMaskedLM
          • FlaxAutoModelForMaskedLM
          • AutoModelForMaskGenerationge
          • TFAutoModelForMaskGeneration
          • AutoModelForSeq2SeqLM
          • TFAutoModelForSeq2SeqLM
          • FlaxAutoModelForSeq2SeqLM
          • AutoModelForSequenceClassification
          • TFAutoModelForSequenceClassification
          • FlaxAutoModelForSequenceClassification
          • AutoModelForMultipleChoice
          • TFAutoModelForMultipleChoice
          • FlaxAutoModelForMultipleChoice
          • AutoModelForNextSentencePrediction
          • TFAutoModelForNextSentencePrediction
          • FlaxAutoModelForNextSentencePrediction
          • AutoModelForTokenClassification
          • TFAutoModelForTokenClassification
          • FlaxAutoModelForTokenClassification
          • AutoModelForQuestionAnswering
          • TFAutoModelForQuestionAnswering
          • FlaxAutoModelForQuestionAnswering
          • AutoModelForTextEncoding
          • TFAutoModelForTextEncoding
        • Computer vision
          • AutoModelForDepthEstimation
          • AutoModelForImageClassification
          • TFAutoModelForImageClassification
          • FlaxAutoModelForImageClassification
          • AutoModelForVideoClassification
          • AutoModelForMaskedImageModeling
          • TFAutoModelForMaskedImageModeling
          • AutoModelForObjectDetection
          • AutoModelForImageSegmentation
          • AutoModelForImageToImage
          • AutoModelForSemanticSegmentation
          • TFAutoModelForSemanticSegmentation
          • AutoModelForInstanceSegmentation
          • AutoModelForUniversalSegmentation
          • AutoModelForZeroShotImageClassification
          • TFAutoModelForZeroShotImageClassification
          • AutoModelForZeroShotObjectDetection
        • Audio
          • AutoModelForAudioClassification
          • AutoModelForAudioFrameClassification
          • TFAutoModelForAudioFrameClassification
          • AutoModelForCTC
          • AutoModelForSpeechSeq2Seq
          • TFAutoModelForSpeechSeq2Seq
          • FlaxAutoModelForSpeechSeq2Seq
          • AutoModelForAudioXVector
          • AutoModelForTextToSpectrogram
          • AutoModelForTextToWaveform
        • Multimodal
          • AutoModelForTableQuestionAnswering
          • TFAutoModelForTableQuestionAnswering
          • AutoModelForDocumentQuestionAnswering
          • TFAutoModelForDocumentQuestionAnswering
          • AutoModelForVisualQuestionAnswering
          • AutoModelForVision2Seq
          • TFAutoModelForVision2Seq
          • FlaxAutoModelForVision2Seq
      • Callbacks
      • Configuration
      • Data Collator
      • Keras callbacks
      • Logging
      • Models
      • Text Generation
      • ONNX
      • Optimization
      • Model outputs
      • Pipelines
      • Processors
      • Quantization
      • Tokenizer
      • Trainer
      • DeepSpeed Integration
      • Feature Extractor
      • Image Processor
    • 🌍MODELS
      • 🌍TEXT MODELS
        • ALBERT
        • BART
        • BARThez
        • BARTpho
        • BERT
        • BertGeneration
        • BertJapanese
        • Bertweet
        • BigBird
        • BigBirdPegasus
        • BioGpt
        • Blenderbot
        • Blenderbot Small
        • BLOOM
        • BORT
        • ByT5
        • CamemBERT
        • CANINE
        • CodeGen
        • CodeLlama
        • ConvBERT
        • CPM
        • CPMANT
        • CTRL
        • DeBERTa
        • DeBERTa-v2
        • DialoGPT
        • DistilBERT
        • DPR
        • ELECTRA
        • Encoder Decoder Models
        • ERNIE
        • ErnieM
        • ESM
        • Falcon
        • FLAN-T5
        • FLAN-UL2
        • FlauBERT
        • FNet
        • FSMT
        • Funnel Transformer
        • GPT
        • GPT Neo
        • GPT NeoX
        • GPT NeoX Japanese
        • GPT-J
        • GPT2
        • GPTBigCode
        • GPTSAN Japanese
        • GPTSw3
        • HerBERT
        • I-BERT
        • Jukebox
        • LED
        • LLaMA
        • LLama2
        • Longformer
        • LongT5
        • LUKE
        • M2M100
        • MarianMT
        • MarkupLM
        • MBart and MBart-50
        • MEGA
        • MegatronBERT
        • MegatronGPT2
        • Mistral
        • mLUKE
        • MobileBERT
        • MPNet
        • MPT
        • MRA
        • MT5
        • MVP
        • NEZHA
        • NLLB
        • NLLB-MoE
        • Nyströmformer
        • Open-Llama
        • OPT
        • Pegasus
        • PEGASUS-X
        • Persimmon
        • PhoBERT
        • PLBart
        • ProphetNet
        • QDQBert
        • RAG
        • REALM
        • Reformer
        • RemBERT
        • RetriBERT
        • RoBERTa
        • RoBERTa-PreLayerNorm
        • RoCBert
        • RoFormer
        • RWKV
        • Splinter
        • SqueezeBERT
        • SwitchTransformers
        • T5
        • T5v1.1
        • TAPEX
        • Transformer XL
        • UL2
        • UMT5
        • X-MOD
        • XGLM
        • XLM
        • XLM-ProphetNet
        • XLM-RoBERTa
        • XLM-RoBERTa-XL
        • XLM-V
        • XLNet
        • YOSO
      • 🌍VISION MODELS
        • BEiT
        • BiT
        • Conditional DETR
        • ConvNeXT
        • ConvNeXTV2
        • CvT
        • Deformable DETR
        • DeiT
        • DETA
        • DETR
        • DiNAT
        • DINO V2
        • DiT
        • DPT
        • EfficientFormer
        • EfficientNet
        • FocalNet
        • GLPN
        • ImageGPT
        • LeViT
        • Mask2Former
        • MaskFormer
        • MobileNetV1
        • MobileNetV2
        • MobileViT
        • MobileViTV2
        • NAT
        • PoolFormer
        • Pyramid Vision Transformer (PVT)
        • RegNet
        • ResNet
        • SegFormer
        • SwiftFormer
        • Swin Transformer
        • Swin Transformer V2
        • Swin2SR
        • Table Transformer
        • TimeSformer
        • UperNet
        • VAN
        • VideoMAE
        • Vision Transformer (ViT)
        • ViT Hybrid
        • ViTDet
        • ViTMAE
        • ViTMatte
        • ViTMSN
        • ViViT
        • YOLOS
      • 🌍AUDIO MODELS
        • Audio Spectrogram Transformer
        • Bark
        • CLAP
        • EnCodec
        • Hubert
        • MCTCT
        • MMS
        • MusicGen
        • Pop2Piano
        • SEW
        • SEW-D
        • Speech2Text
        • Speech2Text2
        • SpeechT5
        • UniSpeech
        • UniSpeech-SAT
        • VITS
        • Wav2Vec2
        • Wav2Vec2-Conformer
        • Wav2Vec2Phoneme
        • WavLM
        • Whisper
        • XLS-R
        • XLSR-Wav2Vec2
      • 🌍MULTIMODAL MODELS
        • ALIGN
        • AltCLIP
        • BLIP
        • BLIP-2
        • BridgeTower
        • BROS
        • Chinese-CLIP
        • CLIP
        • CLIPSeg
        • Data2Vec
        • DePlot
        • Donut
        • FLAVA
        • GIT
        • GroupViT
        • IDEFICS
        • InstructBLIP
        • LayoutLM
        • LayoutLMV2
        • LayoutLMV3
        • LayoutXLM
        • LiLT
        • LXMERT
        • MatCha
        • MGP-STR
        • Nougat
        • OneFormer
        • OWL-ViT
        • Perceiver
        • Pix2Struct
        • Segment Anything
        • Speech Encoder Decoder Models
        • TAPAS
        • TrOCR
        • TVLT
        • ViLT
        • Vision Encoder Decoder Models
        • Vision Text Dual Encoder
        • VisualBERT
        • X-CLIP
      • 🌍REINFORCEMENT LEARNING MODELS
        • Decision Transformer
        • Trajectory Transformer
      • 🌍TIME SERIES MODELS
        • Autoformer
        • Informer
        • Time Series Transformer
      • 🌍GRAPH MODELS
        • Graphormer
  • 🌍INTERNAL HELPERS
    • Custom Layers and Utilities
    • Utilities for pipelines
    • Utilities for Tokenizers
    • Utilities for Trainer
    • Utilities for Generation
    • Utilities for Image Processors
    • Utilities for Audio processing
    • General Utilities
    • Utilities for Time Series
Powered by GitBook
On this page
  • Bark
  • Overview
  • BarkConfig
  • BarkProcessor
  • BarkModel
  • BarkSemanticModel
  • BarkCoarseModel
  • BarkFineModel
  • BarkCausalModel
  • BarkCoarseConfig
  • BarkFineConfig
  • BarkSemanticConfig
  1. API
  2. MODELS
  3. AUDIO MODELS

Bark

PreviousAudio Spectrogram TransformerNextCLAP

Last updated 1 year ago

Bark

Overview

Bark is a transformer-based text-to-speech model proposed by Suno AI in .

Bark is made of 4 main models:

  • (also referred to as the ‘text’ model): a causal auto-regressive transformer model that takes as input tokenized text, and predicts semantic text tokens that capture the meaning of the text.

  • (also referred to as the ‘coarse acoustics’ model): a causal autoregressive transformer, that takes as input the results of the model. It aims at predicting the first two audio codebooks necessary for EnCodec.

  • (the ‘fine acoustics’ model), this time a non-causal autoencoder transformer, which iteratively predicts the last codebooks based on the sum of the previous codebooks embeddings.

  • having predicted all the codebook channels from the , Bark uses it to decode the output audio array.

It should be noted that each of the first three modules can support conditional speaker embeddings to condition the output sound according to specific predefined voice.

Optimizing Bark

Bark can be optimized with just a few extra lines of code, which significantly reduces its memory footprint and accelerates inference.

Using half-precision

You can speed up inference and reduce memory footprint by 50% simply by loading the model in half-precision.

Copied

from transformers import BarkModel
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16).to(device)

Using 🌍 Better Transformer

Better Transformer is an 🌍 Optimum feature that performs kernel fusion under the hood. You can gain 20% to 30% in speed with zero performance degradation. It only requires one line of code to export the model to 🌍 Better Transformer:

Copied

model =  model.to_bettertransformer()

Using CPU offload

As mentioned above, Bark is made up of 4 sub-models, which are called up sequentially during audio generation. In other words, while one sub-model is in use, the other sub-models are idle.

If you’re using a CUDA device, a simple solution to benefit from an 80% reduction in memory footprint is to offload the GPU’s submodels when they’re idle. This operation is called CPU offloading. You can use it with one line of code.

Copied

model.enable_cpu_offload()

Combining optimizaton techniques

You can combine optimization techniques, and use CPU offload, half-precision and 🌍 Better Transformer all at once.

Copied

from transformers import BarkModel
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

# load in fp16
model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16).to(device)

# convert to bettertransformer
model = BetterTransformer.transform(model, keep_original_model=False)

# enable CPU offload
model.enable_cpu_offload()

Tips

Copied

>>> from transformers import AutoProcessor, BarkModel

>>> processor = AutoProcessor.from_pretrained("suno/bark")
>>> model = BarkModel.from_pretrained("suno/bark")

>>> voice_preset = "v2/en_speaker_6"

>>> inputs = processor("Hello, my dog is cute", voice_preset=voice_preset)

>>> audio_array = model.generate(**inputs)
>>> audio_array = audio_array.cpu().numpy().squeeze()

Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects.

Copied

>>> # Multilingual speech - simplified Chinese
>>> inputs = processor("惊人的!我会说中文")

>>> # Multilingual speech - French - let's use a voice_preset as well
>>> inputs = processor("Incroyable! Je peux générer du son.", voice_preset="fr_speaker_5")

>>> # Bark can also generate music. You can help it out by adding music notes around your lyrics.
>>> inputs = processor("♪ Hello, my dog is cute ♪")

>>> audio_array = model.generate(**inputs)
>>> audio_array = audio_array.cpu().numpy().squeeze()

The model can also produce nonverbal communications like laughing, sighing and crying.

Copied

>>> # Adding non-speech cues to the input text
>>> inputs = processor("Hello uh ... [clears throat], my dog is cute [laughter]")

>>> audio_array = model.generate(**inputs)
>>> audio_array = audio_array.cpu().numpy().squeeze()

To save the audio, simply take the sample rate from the model config and some scipy utility:

Copied

>>> from scipy.io.wavfile import write as write_wav

>>> # save audio to disk, but first take the sample rate from the model config
>>> sample_rate = model.generation_config.sample_rate
>>> write_wav("bark_generation.wav", sample_rate, audio_array)

BarkConfig

class transformers.BarkConfig

( semantic_config: typing.Dict = Nonecoarse_acoustics_config: typing.Dict = Nonefine_acoustics_config: typing.Dict = Nonecodec_config: typing.Dict = Noneinitializer_range = 0.02**kwargs )

Parameters

  • Example —

from_sub_model_configs

Returns

An instance of a configuration object

BarkProcessor

class transformers.BarkProcessor

( tokenizerspeaker_embeddings = None )

Parameters

Constructs a Bark processor which wraps a text tokenizer and optional Bark voice presets into a single processor.

__call__

Parameters

  • text (str, List[str], List[List[str]]) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).

  • voice_preset (str, Dict[np.ndarray]) — The voice preset, i.e the speaker embeddings. It can either be a valid voice_preset name, e.g "en_speaker_1", or directly a dictionnary of np.ndarray embeddings for each submodel of Bark. Or it can be a valid file name of a local .npz single voice preset.

    • 'pt': Return PyTorch torch.Tensor objects.

    • 'np': Return NumPy np.ndarray objects.

Returns

Main method to prepare for the model one or several sequences(s). This method forwards the text and kwargs arguments to the AutoTokenizer’s __call__() to encode the text. The method also proposes a voice preset which is a dictionary of arrays that conditions Bark’s output. kwargs arguments are forwarded to the tokenizer and to cached_file method if voice_preset is a valid filename.

from_pretrained

( pretrained_processor_name_or_pathspeaker_embeddings_dict_path = 'speaker_embeddings_path.json'**kwargs )

Parameters

  • pretrained_model_name_or_path (str or os.PathLike) — This can be either:

  • speaker_embeddings_dict_path (str, optional, defaults to "speaker_embeddings_path.json") — The name of the .json file containing the speaker_embeddings dictionnary located in pretrained_model_name_or_path. If None, no speaker_embeddings is loaded. **kwargs — Additional keyword arguments passed along to both ~tokenization_utils_base.PreTrainedTokenizer.from_pretrained.

Instantiate a Bark processor associated with a pretrained model.

save_pretrained

( save_directoryspeaker_embeddings_dict_path = 'speaker_embeddings_path.json'speaker_embeddings_directory = 'speaker_embeddings'push_to_hub: bool = False**kwargs )

Parameters

  • save_directory (str or os.PathLike) — Directory where the tokenizer files and the speaker embeddings will be saved (directory will be created if it does not exist).

  • speaker_embeddings_dict_path (str, optional, defaults to "speaker_embeddings_path.json") — The name of the .json file that will contains the speaker_embeddings nested path dictionnary, if it exists, and that will be located in pretrained_model_name_or_path/speaker_embeddings_directory.

  • speaker_embeddings_directory (str, optional, defaults to "speaker_embeddings/") — The name of the folder in which the speaker_embeddings arrays will be saved.

BarkModel

class transformers.BarkModel

( config )

Parameters

The full Bark model, a text-to-speech model composed of 4 sub-models:

It should be noted that each of the first three modules can support conditional speaker embeddings to condition the output sound according to specific predefined voice.

generate

( input_ids: typing.Optional[torch.Tensor] = Nonehistory_prompt: typing.Union[typing.Dict[str, torch.Tensor], NoneType] = None**kwargs ) → torch.LongTensor

Parameters

  • input_ids (Optional[torch.Tensor] of shape (batch_size, seq_len), optional) — Input ids. Will be truncated up to 256 tokens. Note that the output audios will be as long as the longest generation among the batch.

  • history_prompt (Optional[Dict[str,torch.Tensor]], optional) — Optional Bark speaker prompt. Note that for now, this model takes only one speaker prompt per batch.

  • kwargs (optional) — Remaining dictionary of keyword arguments. Keyword arguments are of two types:

    • Without a prefix, they will be entered as **kwargs for the generate method of each sub-model.

    • With a semantic_, coarse_, fine_ prefix, they will be input for the generate method of the semantic, coarse and fine respectively. It has the priority over the keywords without a prefix.

    This means you can, for example, specify a generation strategy for all sub-models except one.

Returns

torch.LongTensor

Output generated audio.

Generates audio from an input prompt and an additional optional Bark speaker prompt.

Example:

Copied

>>> from transformers import AutoProcessor, BarkModel

>>> processor = AutoProcessor.from_pretrained("suno/bark-small")
>>> model = BarkModel.from_pretrained("suno/bark-small")

>>> # To add a voice preset, you can pass `voice_preset` to `BarkProcessor.__call__(...)`
>>> voice_preset = "v2/en_speaker_6"

>>> inputs = processor("Hello, my dog is cute, I need him in my life", voice_preset=voice_preset)

>>> audio_array = model.generate(**inputs, semantic_max_new_tokens=100)
>>> audio_array = audio_array.cpu().numpy().squeeze()

enable_cpu_offload

( gpu_id: typing.Optional[int] = 0 )

Parameters

  • gpu_id (int, optional, defaults to 0) — GPU id on which the sub-models will be loaded and offloaded.

Offloads all sub-models to CPU using accelerate, reducing memory usage with a low impact on performance. This method moves one whole sub-model at a time to the GPU when it is used, and the sub-model remains in GPU until the next sub-model runs.

BarkSemanticModel

class transformers.BarkSemanticModel

( config )

Parameters

forward

( input_ids: typing.Optional[torch.Tensor] = Nonepast_key_values: typing.Optional[typing.Tuple[torch.FloatTensor]] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.Tensor] = Nonehead_mask: typing.Optional[torch.Tensor] = Nonelabels: typing.Optional[torch.LongTensor] = Noneinput_embeds: typing.Optional[torch.Tensor] = Noneuse_cache: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None )

Parameters

  • past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache is passed or when config.use_cache=True) — Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

    If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, 1) instead of all input_ids of shape (batch_size, sequence_length).

  • attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,

    • 0 for tokens that are masked.

  • position_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.max_position_embeddings - 1].

  • head_mask (torch.Tensor of shape (encoder_layers, encoder_attention_heads), optional) — Mask to nullify selected heads of the attention modules in the encoder. Mask values selected in [0, 1]:

    • 1 indicates the head is not masked,

    • 0 indicates the head is masked.

  • input_embeds (torch.FloatTensor of shape (batch_size, input_sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. Here, due to Bark particularities, if past_key_values is used, input_embeds will be ignored and you have to use input_ids. If past_key_values is not used and use_cache is set to True, input_embeds is used in priority instead of input_ids.

  • use_cache (bool, optional) — If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

  • output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.

  • output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

BarkCoarseModel

class transformers.BarkCoarseModel

( config )

Parameters

forward

( input_ids: typing.Optional[torch.Tensor] = Nonepast_key_values: typing.Optional[typing.Tuple[torch.FloatTensor]] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.Tensor] = Nonehead_mask: typing.Optional[torch.Tensor] = Nonelabels: typing.Optional[torch.LongTensor] = Noneinput_embeds: typing.Optional[torch.Tensor] = Noneuse_cache: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None )

Parameters

  • past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache is passed or when config.use_cache=True) — Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

    If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, 1) instead of all input_ids of shape (batch_size, sequence_length).

  • attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,

    • 0 for tokens that are masked.

  • position_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.max_position_embeddings - 1].

  • head_mask (torch.Tensor of shape (encoder_layers, encoder_attention_heads), optional) — Mask to nullify selected heads of the attention modules in the encoder. Mask values selected in [0, 1]:

    • 1 indicates the head is not masked,

    • 0 indicates the head is masked.

  • input_embeds (torch.FloatTensor of shape (batch_size, input_sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. Here, due to Bark particularities, if past_key_values is used, input_embeds will be ignored and you have to use input_ids. If past_key_values is not used and use_cache is set to True, input_embeds is used in priority instead of input_ids.

  • use_cache (bool, optional) — If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

  • output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.

  • output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

BarkFineModel

class transformers.BarkFineModel

( config )

Parameters

forward

( codebook_idx: intinput_ids: typing.Optional[torch.Tensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.Tensor] = Nonehead_mask: typing.Optional[torch.Tensor] = Nonelabels: typing.Optional[torch.LongTensor] = Noneinput_embeds: typing.Optional[torch.Tensor] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None )

Parameters

  • codebook_idx (int) — Index of the codebook that will be predicted.

  • input_ids (torch.LongTensor of shape (batch_size, sequence_length, number_of_codebooks)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it. Initially, indices of the first two codebooks are obtained from the coarse sub-model. The rest is predicted recursively by attending the previously predicted channels. The model predicts on windows of length 1024.

  • attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,

    • 0 for tokens that are masked.

  • position_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.max_position_embeddings - 1].

  • head_mask (torch.Tensor of shape (encoder_layers, encoder_attention_heads), optional) — Mask to nullify selected heads of the attention modules in the encoder. Mask values selected in [0, 1]:

    • 1 indicates the head is not masked,

    • 0 indicates the head is masked.

  • labels (torch.LongTensor of shape (batch_size, sequence_length), optional) — NOT IMPLEMENTED YET.

  • input_embeds (torch.FloatTensor of shape (batch_size, input_sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. If past_key_values is used, optionally only the last input_embeds have to be input (see past_key_values). This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.

  • output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.

  • output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

BarkCausalModel

class transformers.BarkCausalModel

( config )

forward

( input_ids: typing.Optional[torch.Tensor] = Nonepast_key_values: typing.Optional[typing.Tuple[torch.FloatTensor]] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneposition_ids: typing.Optional[torch.Tensor] = Nonehead_mask: typing.Optional[torch.Tensor] = Nonelabels: typing.Optional[torch.LongTensor] = Noneinput_embeds: typing.Optional[torch.Tensor] = Noneuse_cache: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None )

Parameters

  • past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache is passed or when config.use_cache=True) — Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

    If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, 1) instead of all input_ids of shape (batch_size, sequence_length).

  • attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,

    • 0 for tokens that are masked.

  • position_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.max_position_embeddings - 1].

  • head_mask (torch.Tensor of shape (encoder_layers, encoder_attention_heads), optional) — Mask to nullify selected heads of the attention modules in the encoder. Mask values selected in [0, 1]:

    • 1 indicates the head is not masked,

    • 0 indicates the head is masked.

  • input_embeds (torch.FloatTensor of shape (batch_size, input_sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. Here, due to Bark particularities, if past_key_values is used, input_embeds will be ignored and you have to use input_ids. If past_key_values is not used and use_cache is set to True, input_embeds is used in priority instead of input_ids.

  • use_cache (bool, optional) — If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

  • output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.

  • output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

BarkCoarseConfig

class transformers.BarkCoarseConfig

( block_size = 1024input_vocab_size = 10048output_vocab_size = 10048num_layers = 12num_heads = 12hidden_size = 768dropout = 0.0bias = Trueinitializer_range = 0.02use_cache = True**kwargs )

Parameters

  • block_size (int, optional, defaults to 1024) — The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

  • num_layers (int, optional, defaults to 12) — Number of hidden layers in the given sub-model.

  • num_heads (int, optional, defaults to 12) — Number of attention heads for each attention layer in the Transformer architecture.

  • hidden_size (int, optional, defaults to 768) — Dimensionality of the “intermediate” (often named feed-forward) layer in the architecture.

  • dropout (float, optional, defaults to 0.0) — The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

  • bias (bool, optional, defaults to True) — Whether or not to use bias in the linear layers and layer norm layers.

  • initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

  • use_cache (bool, optional, defaults to True) — Whether or not the model should return the last key/values attentions (not used by all models).

Example:

Copied

>>> from transformers import BarkCoarseConfig, BarkCoarseModel

>>> # Initializing a Bark sub-module style configuration
>>> configuration = BarkCoarseConfig()

>>> # Initializing a model (with random weights) from the suno/bark style configuration
>>> model = BarkCoarseModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

BarkFineConfig

class transformers.BarkFineConfig

( tie_word_embeddings = Truen_codes_total = 8n_codes_given = 1**kwargs )

Parameters

  • block_size (int, optional, defaults to 1024) — The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

  • num_layers (int, optional, defaults to 12) — Number of hidden layers in the given sub-model.

  • num_heads (int, optional, defaults to 12) — Number of attention heads for each attention layer in the Transformer architecture.

  • hidden_size (int, optional, defaults to 768) — Dimensionality of the “intermediate” (often named feed-forward) layer in the architecture.

  • dropout (float, optional, defaults to 0.0) — The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

  • bias (bool, optional, defaults to True) — Whether or not to use bias in the linear layers and layer norm layers.

  • initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

  • use_cache (bool, optional, defaults to True) — Whether or not the model should return the last key/values attentions (not used by all models).

  • n_codes_total (int, optional, defaults to 8) — The total number of audio codebooks predicted. Used in the fine acoustics sub-model.

  • n_codes_given (int, optional, defaults to 1) — The number of audio codebooks predicted in the coarse acoustics sub-model. Used in the acoustics sub-models.

Example:

Copied

>>> from transformers import BarkFineConfig, BarkFineModel

>>> # Initializing a Bark sub-module style configuration
>>> configuration = BarkFineConfig()

>>> # Initializing a model (with random weights) from the suno/bark style configuration
>>> model = BarkFineModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

BarkSemanticConfig

class transformers.BarkSemanticConfig

( block_size = 1024input_vocab_size = 10048output_vocab_size = 10048num_layers = 12num_heads = 12hidden_size = 768dropout = 0.0bias = Trueinitializer_range = 0.02use_cache = True**kwargs )

Parameters

  • block_size (int, optional, defaults to 1024) — The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

  • num_layers (int, optional, defaults to 12) — Number of hidden layers in the given sub-model.

  • num_heads (int, optional, defaults to 12) — Number of attention heads for each attention layer in the Transformer architecture.

  • hidden_size (int, optional, defaults to 768) — Dimensionality of the “intermediate” (often named feed-forward) layer in the architecture.

  • dropout (float, optional, defaults to 0.0) — The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

  • bias (bool, optional, defaults to True) — Whether or not to use bias in the linear layers and layer norm layers.

  • initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

  • use_cache (bool, optional, defaults to True) — Whether or not the model should return the last key/values attentions (not used by all models).

Example:

Copied

>>> from transformers import BarkSemanticConfig, BarkSemanticModel

>>> # Initializing a Bark sub-module style configuration
>>> configuration = BarkSemanticConfig()

>>> # Initializing a model (with random weights) from the suno/bark style configuration
>>> model = BarkSemanticModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

Note that 🌍 Optimum must be installed before using this feature.

Note that 🌍 Accelerate must be installed before using this feature.

Find out more on inference optimization techniques .

Suno offers a library of voice presets in a number of languages . These presets are also uploaded in the hub or .

This model was contributed by and . The original code can be found .

semantic_config (, optional) — Configuration of the underlying semantic sub-model.

coarse_acoustics_config (, optional) — Configuration of the underlying coarse acoustics sub-model.

fine_acoustics_config (, optional) — Configuration of the underlying fine acoustics sub-model.

codec_config (, optional) — Configuration of the underlying codec sub-model.

This is the configuration class to store the configuration of a . It is used to instantiate a Bark model according to the specified sub-models configurations, defining the model architecture.

Instantiating a configuration with the defaults will yield a similar configuration to that of the Bark architecture.

Configuration objects inherit from and can be used to control the model outputs. Read the documentation from for more information.

( semantic_config: BarkSemanticConfigcoarse_acoustics_config: BarkCoarseConfigfine_acoustics_config: BarkFineConfigcodec_config: PretrainedConfig**kwargs ) →

Instantiate a (or a derived class) from bark sub-models configuration.

tokenizer () — An instance of .

speaker_embeddings (Dict[Dict[str]], optional, defaults to None) — Optional nested speaker embeddings dictionary. The first level contains voice preset names (e.g "en_speaker_4"). The second level contains "semantic_prompt", "coarse_prompt" and "fine_prompt" embeddings. The values correspond to the path of the corresponding np.ndarray. See for a list of voice_preset_names.

( text = Nonevoice_preset = Nonereturn_tensors = 'pt'max_length = 256add_special_tokens = Falsereturn_attention_mask = Truereturn_token_type_ids = False**kwargs ) → Tuple(, )

return_tensors (str or , optional) — If set, will return tensors of a particular framework. Acceptable values are:

Tuple(, )

A tuple composed of a , i.e the output of the tokenizer and a , i.e the voice preset with the right tensors type.

a string, the model id of a pretrained hosted inside a model repo on boincai.com. Valid model ids can be located at the root-level, like bert-base-uncased, or namespaced under a user or organization name, like dbmdz/bert-base-german-cased.

a path to a directory containing a processor saved using the method, e.g., ./my_model_directory/.

push_to_hub (bool, optional, defaults to False) — Whether or not to push your model to the BOINC AI model hub after saving it. You can specify the repository you want to push to with repo_id (will default to the name of save_directory in your namespace). kwargs — Additional key word arguments passed along to the method.

Saves the attributes of this processor (tokenizer…) in the specified directory so that it can be reloaded using the method.

config () — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

(also referred to as the ‘text’ model): a causal auto-regressive transformer model that takes as input tokenized text, and predicts semantic text tokens that capture the meaning of the text.

(also refered to as the ‘coarse acoustics’ model), also a causal autoregressive transformer, that takes into input the results of the last model. It aims at regressing the first two audio codebooks necessary to encodec.

(the ‘fine acoustics’ model), this time a non-causal autoencoder transformer, which iteratively predicts the last codebooks based on the sum of the previous codebooks embeddings.

having predicted all the codebook channels from the , Bark uses it to decode the output audio array.

This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

config () — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

Bark semantic (or text) model. It shares the same architecture as the coarse model. It is a GPT-2 like autoregressive model with a language modeling head on top. This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it. Indices can be obtained using . See and for details.

return_dict (bool, optional) — Whether or not to return a instead of a plain tuple.

The forward method, overrides the __call__ special method.

config () — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

Bark coarse acoustics model. It shares the same architecture as the semantic (or text) model. It is a GPT-2 like autoregressive model with a language modeling head on top. This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it. Indices can be obtained using . See and for details.

return_dict (bool, optional) — Whether or not to return a instead of a plain tuple.

The forward method, overrides the __call__ special method.

config () — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

Bark fine acoustics model. It is a non-causal GPT-like model with config.n_codes_total embedding layers and language modeling heads, one for each codebook. This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

return_dict (bool, optional) — Whether or not to return a instead of a plain tuple.

The forward method, overrides the __call__ special method.

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it. Indices can be obtained using . See and for details.

return_dict (bool, optional) — Whether or not to return a instead of a plain tuple.

The forward method, overrides the __call__ special method.

input_vocab_size (int, optional, defaults to 10_048) — Vocabulary size of a Bark sub-model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling . Defaults to 10_048 but should be carefully thought with regards to the chosen sub-model.

output_vocab_size (int, optional, defaults to 10_048) — Output vocabulary size of a Bark sub-model. Defines the number of different tokens that can be represented by the: output_ids when passing forward a . Defaults to 10_048 but should be carefully thought with regards to the chosen sub-model.

This is the configuration class to store the configuration of a . It is used to instantiate the model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Bark architecture.

Configuration objects inherit from and can be used to control the model outputs. Read the documentation from for more information.

input_vocab_size (int, optional, defaults to 10_048) — Vocabulary size of a Bark sub-model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling . Defaults to 10_048 but should be carefully thought with regards to the chosen sub-model.

output_vocab_size (int, optional, defaults to 10_048) — Output vocabulary size of a Bark sub-model. Defines the number of different tokens that can be represented by the: output_ids when passing forward a . Defaults to 10_048 but should be carefully thought with regards to the chosen sub-model.

This is the configuration class to store the configuration of a . It is used to instantiate the model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Bark architecture.

Configuration objects inherit from and can be used to control the model outputs. Read the documentation from for more information.

input_vocab_size (int, optional, defaults to 10_048) — Vocabulary size of a Bark sub-model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling . Defaults to 10_048 but should be carefully thought with regards to the chosen sub-model.

output_vocab_size (int, optional, defaults to 10_048) — Output vocabulary size of a Bark sub-model. Defines the number of different tokens that can be represented by the: output_ids when passing forward a . Defaults to 10_048 but should be carefully thought with regards to the chosen sub-model.

This is the configuration class to store the configuration of a . It is used to instantiate the model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Bark architecture.

Configuration objects inherit from and can be used to control the model outputs. Read the documentation from for more information.

🌍
🌍
🌍
suno-ai/bark
BarkSemanticModel
BarkCoarseModel
BarkSemanticModel
BarkFineModel
EncodecModel
Here’s how to install it.
Here’s how to install it.
here
here
here
here
Yoach Lacombe (ylacombe)
Sanchit Gandhi (sanchit-gandhi)
here
<source>
BarkSemanticConfig
BarkCoarseConfig
BarkFineConfig
AutoConfig
BarkModel
suno/bark
PretrainedConfig
PretrainedConfig
<source>
BarkConfig
BarkConfig
BarkConfig
<source>
PreTrainedTokenizer
PreTrainedTokenizer
here
<source>
BatchEncoding
BatchFeature
TensorType
BatchEncoding
BatchFeature
BatchEncoding
BatchFeature
<source>
BarkProcessor
save_pretrained()
<source>
push_to_hub()
from_pretrained()
<source>
BarkConfig
from_pretrained()
BarkSemanticModel
BarkCoarseModel
BarkFineModel
EncodecModel
PreTrainedModel
torch.nn.Module
<source>
<source>
<source>
BarkSemanticConfig
from_pretrained()
PreTrainedModel
torch.nn.Module
<source>
AutoTokenizer
PreTrainedTokenizer.encode()
PreTrainedTokenizer.call()
What are input IDs?
What are attention masks?
What are position IDs?
ModelOutput
BarkCausalModel
<source>
BarkCoarseConfig
from_pretrained()
PreTrainedModel
torch.nn.Module
<source>
AutoTokenizer
PreTrainedTokenizer.encode()
PreTrainedTokenizer.call()
What are input IDs?
What are attention masks?
What are position IDs?
ModelOutput
BarkCausalModel
<source>
BarkFineConfig
from_pretrained()
PreTrainedModel
torch.nn.Module
<source>
What are attention masks?
What are position IDs?
ModelOutput
BarkFineModel
<source>
<source>
AutoTokenizer
PreTrainedTokenizer.encode()
PreTrainedTokenizer.call()
What are input IDs?
What are attention masks?
What are position IDs?
ModelOutput
BarkCausalModel
<source>
BarkCoarseModel
BarkCoarseModel
BarkCoarseModel
suno/bark
PretrainedConfig
PretrainedConfig
<source>
BarkFineModel
BarkFineModel
BarkFineModel
suno/bark
PretrainedConfig
PretrainedConfig
<source>
BarkSemanticModel
BarkSemanticModel
BarkSemanticModel
suno/bark
PretrainedConfig
PretrainedConfig