Transformers
  • 🌍GET STARTED
    • Transformers
    • Quick tour
    • Installation
  • 🌍TUTORIALS
    • Run inference with pipelines
    • Write portable code with AutoClass
    • Preprocess data
    • Fine-tune a pretrained model
    • Train with a script
    • Set up distributed training with BOINC AI Accelerate
    • Load and train adapters with BOINC AI PEFT
    • Share your model
    • Agents
    • Generation with LLMs
  • 🌍TASK GUIDES
    • 🌍NATURAL LANGUAGE PROCESSING
      • Text classification
      • Token classification
      • Question answering
      • Causal language modeling
      • Masked language modeling
      • Translation
      • Summarization
      • Multiple choice
    • 🌍AUDIO
      • Audio classification
      • Automatic speech recognition
    • 🌍COMPUTER VISION
      • Image classification
      • Semantic segmentation
      • Video classification
      • Object detection
      • Zero-shot object detection
      • Zero-shot image classification
      • Depth estimation
    • 🌍MULTIMODAL
      • Image captioning
      • Document Question Answering
      • Visual Question Answering
      • Text to speech
    • 🌍GENERATION
      • Customize the generation strategy
    • 🌍PROMPTING
      • Image tasks with IDEFICS
  • 🌍DEVELOPER GUIDES
    • Use fast tokenizers from BOINC AI Tokenizers
    • Run inference with multilingual models
    • Use model-specific APIs
    • Share a custom model
    • Templates for chat models
    • Run training on Amazon SageMaker
    • Export to ONNX
    • Export to TFLite
    • Export to TorchScript
    • Benchmarks
    • Notebooks with examples
    • Community resources
    • Custom Tools and Prompts
    • Troubleshoot
  • 🌍PERFORMANCE AND SCALABILITY
    • Overview
    • 🌍EFFICIENT TRAINING TECHNIQUES
      • Methods and tools for efficient training on a single GPU
      • Multiple GPUs and parallelism
      • Efficient training on CPU
      • Distributed CPU training
      • Training on TPUs
      • Training on TPU with TensorFlow
      • Training on Specialized Hardware
      • Custom hardware for training
      • Hyperparameter Search using Trainer API
    • 🌍OPTIMIZING INFERENCE
      • Inference on CPU
      • Inference on one GPU
      • Inference on many GPUs
      • Inference on Specialized Hardware
    • Instantiating a big model
    • Troubleshooting
    • XLA Integration for TensorFlow Models
    • Optimize inference using `torch.compile()`
  • 🌍CONTRIBUTE
    • How to contribute to transformers?
    • How to add a model to BOINC AI Transformers?
    • How to convert a BOINC AI Transformers model to TensorFlow?
    • How to add a pipeline to BOINC AI Transformers?
    • Testing
    • Checks on a Pull Request
  • 🌍CONCEPTUAL GUIDES
    • Philosophy
    • Glossary
    • What BOINC AI Transformers can do
    • How BOINC AI Transformers solve tasks
    • The Transformer model family
    • Summary of the tokenizers
    • Attention mechanisms
    • Padding and truncation
    • BERTology
    • Perplexity of fixed-length models
    • Pipelines for webserver inference
    • Model training anatomy
  • 🌍API
    • 🌍MAIN CLASSES
      • Agents and Tools
      • 🌍Auto Classes
        • Extending the Auto Classes
        • AutoConfig
        • AutoTokenizer
        • AutoFeatureExtractor
        • AutoImageProcessor
        • AutoProcessor
        • Generic model classes
          • AutoModel
          • TFAutoModel
          • FlaxAutoModel
        • Generic pretraining classes
          • AutoModelForPreTraining
          • TFAutoModelForPreTraining
          • FlaxAutoModelForPreTraining
        • Natural Language Processing
          • AutoModelForCausalLM
          • TFAutoModelForCausalLM
          • FlaxAutoModelForCausalLM
          • AutoModelForMaskedLM
          • TFAutoModelForMaskedLM
          • FlaxAutoModelForMaskedLM
          • AutoModelForMaskGenerationge
          • TFAutoModelForMaskGeneration
          • AutoModelForSeq2SeqLM
          • TFAutoModelForSeq2SeqLM
          • FlaxAutoModelForSeq2SeqLM
          • AutoModelForSequenceClassification
          • TFAutoModelForSequenceClassification
          • FlaxAutoModelForSequenceClassification
          • AutoModelForMultipleChoice
          • TFAutoModelForMultipleChoice
          • FlaxAutoModelForMultipleChoice
          • AutoModelForNextSentencePrediction
          • TFAutoModelForNextSentencePrediction
          • FlaxAutoModelForNextSentencePrediction
          • AutoModelForTokenClassification
          • TFAutoModelForTokenClassification
          • FlaxAutoModelForTokenClassification
          • AutoModelForQuestionAnswering
          • TFAutoModelForQuestionAnswering
          • FlaxAutoModelForQuestionAnswering
          • AutoModelForTextEncoding
          • TFAutoModelForTextEncoding
        • Computer vision
          • AutoModelForDepthEstimation
          • AutoModelForImageClassification
          • TFAutoModelForImageClassification
          • FlaxAutoModelForImageClassification
          • AutoModelForVideoClassification
          • AutoModelForMaskedImageModeling
          • TFAutoModelForMaskedImageModeling
          • AutoModelForObjectDetection
          • AutoModelForImageSegmentation
          • AutoModelForImageToImage
          • AutoModelForSemanticSegmentation
          • TFAutoModelForSemanticSegmentation
          • AutoModelForInstanceSegmentation
          • AutoModelForUniversalSegmentation
          • AutoModelForZeroShotImageClassification
          • TFAutoModelForZeroShotImageClassification
          • AutoModelForZeroShotObjectDetection
        • Audio
          • AutoModelForAudioClassification
          • AutoModelForAudioFrameClassification
          • TFAutoModelForAudioFrameClassification
          • AutoModelForCTC
          • AutoModelForSpeechSeq2Seq
          • TFAutoModelForSpeechSeq2Seq
          • FlaxAutoModelForSpeechSeq2Seq
          • AutoModelForAudioXVector
          • AutoModelForTextToSpectrogram
          • AutoModelForTextToWaveform
        • Multimodal
          • AutoModelForTableQuestionAnswering
          • TFAutoModelForTableQuestionAnswering
          • AutoModelForDocumentQuestionAnswering
          • TFAutoModelForDocumentQuestionAnswering
          • AutoModelForVisualQuestionAnswering
          • AutoModelForVision2Seq
          • TFAutoModelForVision2Seq
          • FlaxAutoModelForVision2Seq
      • Callbacks
      • Configuration
      • Data Collator
      • Keras callbacks
      • Logging
      • Models
      • Text Generation
      • ONNX
      • Optimization
      • Model outputs
      • Pipelines
      • Processors
      • Quantization
      • Tokenizer
      • Trainer
      • DeepSpeed Integration
      • Feature Extractor
      • Image Processor
    • 🌍MODELS
      • 🌍TEXT MODELS
        • ALBERT
        • BART
        • BARThez
        • BARTpho
        • BERT
        • BertGeneration
        • BertJapanese
        • Bertweet
        • BigBird
        • BigBirdPegasus
        • BioGpt
        • Blenderbot
        • Blenderbot Small
        • BLOOM
        • BORT
        • ByT5
        • CamemBERT
        • CANINE
        • CodeGen
        • CodeLlama
        • ConvBERT
        • CPM
        • CPMANT
        • CTRL
        • DeBERTa
        • DeBERTa-v2
        • DialoGPT
        • DistilBERT
        • DPR
        • ELECTRA
        • Encoder Decoder Models
        • ERNIE
        • ErnieM
        • ESM
        • Falcon
        • FLAN-T5
        • FLAN-UL2
        • FlauBERT
        • FNet
        • FSMT
        • Funnel Transformer
        • GPT
        • GPT Neo
        • GPT NeoX
        • GPT NeoX Japanese
        • GPT-J
        • GPT2
        • GPTBigCode
        • GPTSAN Japanese
        • GPTSw3
        • HerBERT
        • I-BERT
        • Jukebox
        • LED
        • LLaMA
        • LLama2
        • Longformer
        • LongT5
        • LUKE
        • M2M100
        • MarianMT
        • MarkupLM
        • MBart and MBart-50
        • MEGA
        • MegatronBERT
        • MegatronGPT2
        • Mistral
        • mLUKE
        • MobileBERT
        • MPNet
        • MPT
        • MRA
        • MT5
        • MVP
        • NEZHA
        • NLLB
        • NLLB-MoE
        • NystrΓΆmformer
        • Open-Llama
        • OPT
        • Pegasus
        • PEGASUS-X
        • Persimmon
        • PhoBERT
        • PLBart
        • ProphetNet
        • QDQBert
        • RAG
        • REALM
        • Reformer
        • RemBERT
        • RetriBERT
        • RoBERTa
        • RoBERTa-PreLayerNorm
        • RoCBert
        • RoFormer
        • RWKV
        • Splinter
        • SqueezeBERT
        • SwitchTransformers
        • T5
        • T5v1.1
        • TAPEX
        • Transformer XL
        • UL2
        • UMT5
        • X-MOD
        • XGLM
        • XLM
        • XLM-ProphetNet
        • XLM-RoBERTa
        • XLM-RoBERTa-XL
        • XLM-V
        • XLNet
        • YOSO
      • 🌍VISION MODELS
        • BEiT
        • BiT
        • Conditional DETR
        • ConvNeXT
        • ConvNeXTV2
        • CvT
        • Deformable DETR
        • DeiT
        • DETA
        • DETR
        • DiNAT
        • DINO V2
        • DiT
        • DPT
        • EfficientFormer
        • EfficientNet
        • FocalNet
        • GLPN
        • ImageGPT
        • LeViT
        • Mask2Former
        • MaskFormer
        • MobileNetV1
        • MobileNetV2
        • MobileViT
        • MobileViTV2
        • NAT
        • PoolFormer
        • Pyramid Vision Transformer (PVT)
        • RegNet
        • ResNet
        • SegFormer
        • SwiftFormer
        • Swin Transformer
        • Swin Transformer V2
        • Swin2SR
        • Table Transformer
        • TimeSformer
        • UperNet
        • VAN
        • VideoMAE
        • Vision Transformer (ViT)
        • ViT Hybrid
        • ViTDet
        • ViTMAE
        • ViTMatte
        • ViTMSN
        • ViViT
        • YOLOS
      • 🌍AUDIO MODELS
        • Audio Spectrogram Transformer
        • Bark
        • CLAP
        • EnCodec
        • Hubert
        • MCTCT
        • MMS
        • MusicGen
        • Pop2Piano
        • SEW
        • SEW-D
        • Speech2Text
        • Speech2Text2
        • SpeechT5
        • UniSpeech
        • UniSpeech-SAT
        • VITS
        • Wav2Vec2
        • Wav2Vec2-Conformer
        • Wav2Vec2Phoneme
        • WavLM
        • Whisper
        • XLS-R
        • XLSR-Wav2Vec2
      • 🌍MULTIMODAL MODELS
        • ALIGN
        • AltCLIP
        • BLIP
        • BLIP-2
        • BridgeTower
        • BROS
        • Chinese-CLIP
        • CLIP
        • CLIPSeg
        • Data2Vec
        • DePlot
        • Donut
        • FLAVA
        • GIT
        • GroupViT
        • IDEFICS
        • InstructBLIP
        • LayoutLM
        • LayoutLMV2
        • LayoutLMV3
        • LayoutXLM
        • LiLT
        • LXMERT
        • MatCha
        • MGP-STR
        • Nougat
        • OneFormer
        • OWL-ViT
        • Perceiver
        • Pix2Struct
        • Segment Anything
        • Speech Encoder Decoder Models
        • TAPAS
        • TrOCR
        • TVLT
        • ViLT
        • Vision Encoder Decoder Models
        • Vision Text Dual Encoder
        • VisualBERT
        • X-CLIP
      • 🌍REINFORCEMENT LEARNING MODELS
        • Decision Transformer
        • Trajectory Transformer
      • 🌍TIME SERIES MODELS
        • Autoformer
        • Informer
        • Time Series Transformer
      • 🌍GRAPH MODELS
        • Graphormer
  • 🌍INTERNAL HELPERS
    • Custom Layers and Utilities
    • Utilities for pipelines
    • Utilities for Tokenizers
    • Utilities for Trainer
    • Utilities for Generation
    • Utilities for Image Processors
    • Utilities for Audio processing
    • General Utilities
    • Utilities for Time Series
Powered by GitBook
On this page
  • EnCodec
  • Overview
  • EncodecConfig
  • EncodecFeatureExtractor
  • EncodecModel
  1. API
  2. MODELS
  3. AUDIO MODELS

EnCodec

PreviousCLAPNextHubert

Last updated 1 year ago

EnCodec

Overview

The EnCodec neural codec model was proposed in by Alexandre DΓ©fossez, Jade Copet, Gabriel Synnaeve, Yossi Adi.

The abstract from the paper is the following:

We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary that efficiently reduces artifacts and produce high-quality samples. We introduce a novel loss balancer mechanism to stabilize training: the weight of a loss now defines the fraction of the overall gradient it should represent, thus decoupling the choice of this hyper-parameter from the typical scale of the loss. Finally, we study how lightweight Transformer models can be used to further compress the obtained representation by up to 40%, while staying faster than real time. We provide a detailed description of the key design choices of the proposed model including: training objective, architectural changes and a study of various perceptual loss functions. We present an extensive subjective evaluation (MUSHRA tests) together with an ablation study for a range of bandwidths and audio domains, including speech, noisy-reverberant speech, and music. Our approach is superior to the baselines methods across all evaluated settings, considering both 24 kHz monophonic and 48 kHz stereophonic audio.

This model was contributed by , and . The original code can be found . Here is a quick example of how to encode and decode an audio using this model:

Copied

>>> from datasets import load_dataset, Audio
>>> from transformers import EncodecModel, AutoProcessor
>>> librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")

>>> model = EncodecModel.from_pretrained("facebook/encodec_24khz")
>>> processor = AutoProcessor.from_pretrained("facebook/encodec_24khz")
>>> librispeech_dummy = librispeech_dummy.cast_column("audio", Audio(sampling_rate=processor.sampling_rate))
>>> audio_sample = librispeech_dummy[-1]["audio"]["array"]
>>> inputs = processor(raw_audio=audio_sample, sampling_rate=processor.sampling_rate, return_tensors="pt")

>>> encoder_outputs = model.encode(inputs["input_values"], inputs["padding_mask"])
>>> audio_values = model.decode(encoder_outputs.audio_codes, encoder_outputs.audio_scales, inputs["padding_mask"])[0]
>>> # or the equivalent with a forward pass
>>> audio_values = model(inputs["input_values"], inputs["padding_mask"]).audio_values

EncodecConfig

class transformers.EncodecConfig

( target_bandwidths = [1.5, 3.0, 6.0, 12.0, 24.0]sampling_rate = 24000audio_channels = 1normalize = Falsechunk_length_s = Noneoverlap = Nonehidden_size = 128num_filters = 32num_residual_layers = 1upsampling_ratios = [8, 5, 4, 2]norm_type = 'weight_norm'kernel_size = 7last_kernel_size = 7residual_kernel_size = 3dilation_growth_rate = 2use_causal_conv = Truepad_mode = 'reflect'compress = 2num_lstm_layers = 2trim_right_ratio = 1.0codebook_size = 1024codebook_dim = Noneuse_conv_shortcut = True**kwargs )

Parameters

  • target_bandwidths (List[float], optional, defaults to [1.5, 3.0, 6.0, 12.0, 24.0]) β€” The range of diffent bandwiths the model can encode audio with.

  • sampling_rate (int, optional, defaults to 24000) β€” The sampling rate at which the audio waveform should be digitalized expressed in hertz (Hz).

  • audio_channels (int, optional, defaults to 1) β€” Number of channels in the audio data. Either 1 for mono or 2 for stereo.

  • normalize (bool, optional, defaults to False) β€” Whether the audio shall be normalized when passed.

  • chunk_length_s (float, optional) β€” If defined the audio is pre-processed into chunks of lengths chunk_length_s and then encoded.

  • overlap (float, optional) β€” Defines the overlap between each chunk. It is used to compute the chunk_stride using the following formulae : int((1.0 - self.overlap) * self.chunk_length).

  • hidden_size (int, optional, defaults to 128) β€” Intermediate representation dimension.

  • num_filters (int, optional, defaults to 32) β€” Number of convolution kernels of first EncodecConv1d down sampling layer.

  • num_residual_layers (int, optional, defaults to 1) β€” Number of residual layers.

  • upsampling_ratios (Sequence[int] , optional, defaults to [8, 5, 4, 2]) β€” Kernel size and stride ratios. The encoder uses downsampling ratios instead of upsampling ratios, hence it will use the ratios in the reverse order to the ones specified here that must match the decoder order.

  • norm_type (str, optional, defaults to "weight_norm") β€” Normalization method. Should be in ["weight_norm", "time_group_norm"]

  • kernel_size (int, optional, defaults to 7) β€” Kernel size for the initial convolution.

  • last_kernel_size (int, optional, defaults to 7) β€” Kernel size for the last convolution layer.

  • residual_kernel_size (int, optional, defaults to 3) β€” Kernel size for the residual layers.

  • dilation_growth_rate (int, optional, defaults to 2) β€” How much to increase the dilation with each layer.

  • use_causal_conv (bool, optional, defaults to True) β€” Whether to use fully causal convolution.

  • pad_mode (str, optional, defaults to "reflect") β€” Padding mode for the convolutions.

  • compress (int, optional, defaults to 2) β€” Reduced dimensionality in residual branches (from Demucs v3).

  • num_lstm_layers (int, optional, defaults to 2) β€” Number of LSTM layers at the end of the encoder.

  • trim_right_ratio (float, optional, defaults to 1.0) β€” Ratio for trimming at the right of the transposed convolution under the use_causal_conv = True setup. If equal to 1.0, it means that all the trimming is done at the right.

  • codebook_size (int, optional, defaults to 1024) β€” Number of discret codes that make up VQVAE.

  • codebook_dim (int, optional) β€” Dimension of the codebook vectors. If not defined, uses hidden_size.

  • use_conv_shortcut (bool, optional, defaults to True) β€” Whether to use a convolutional layer as the β€˜skip’ connection in the EncodecResnetBlock block. If False, an identity function will be used, giving a generic residual connection.

Example:

Copied

>>> from transformers import EncodecModel, EncodecConfig

>>> # Initializing a "facebook/encodec_24khz" style configuration
>>> configuration = EncodecConfig()

>>> # Initializing a model (with random weights) from the "facebook/encodec_24khz" style configuration
>>> model = EncodecModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

EncodecFeatureExtractor

class transformers.EncodecFeatureExtractor

( feature_size: int = 1sampling_rate: int = 24000padding_value: float = 0.0chunk_length_s: float = Noneoverlap: float = None**kwargs )

Parameters

  • feature_size (int, optional, defaults to 1) β€” The feature dimension of the extracted features. Use 1 for mono, 2 for stereo.

  • sampling_rate (int, optional, defaults to 24000) β€” The sampling rate at which the audio waveform should be digitalized expressed in hertz (Hz).

  • padding_value (float, optional, defaults to 0.0) β€” The value that is used to fill the padding values.

  • chunk_length_s (float, optional) β€” If defined the audio is pre-processed into chunks of lengths chunk_length_s and then encoded.

  • overlap (float, optional) β€” Defines the overlap between each chunk. It is used to compute the chunk_stride using the following formulae : int((1.0 - self.overlap) * self.chunk_length).

Constructs an EnCodec feature extractor.

__call__

( raw_audio: typing.Union[numpy.ndarray, typing.List[float], typing.List[numpy.ndarray], typing.List[typing.List[float]]]padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy, NoneType] = Nonetruncation: typing.Optional[bool] = Falsemax_length: typing.Optional[int] = Nonereturn_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = Nonesampling_rate: typing.Optional[int] = None )

Parameters

  • raw_audio (np.ndarray, List[float], List[np.ndarray], List[List[float]]) β€” The sequence or batch of sequences to be processed. Each sequence can be a numpy array, a list of float values, a list of numpy arrays or a list of list of float values. The numpy array must be of shape (num_samples,) for mono audio (feature_size = 1), or (2, num_samples) for stereo audio (feature_size = 2).

    • True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).

    • 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.

    • False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).

  • truncation (bool, optional, defaults to False) β€” Activates truncation to cut input sequences longer than max_length to max_length.

  • max_length (int, optional) β€” Maximum length of the returned list and optionally padding length (see above).

    • 'tf': Return TensorFlow tf.constant objects.

    • 'pt': Return PyTorch torch.Tensor objects.

    • 'np': Return Numpy np.ndarray objects.

  • sampling_rate (int, optional) β€” The sampling rate at which the audio input was sampled. It is strongly recommended to pass sampling_rate at the forward call to prevent silent errors.

Main method to featurize and prepare for the model one or several sequence(s).

EncodecModel

class transformers.EncodecModel

( config: EncodecConfig )

Parameters

decode

( audio_codes: Tensoraudio_scales: Tensorpadding_mask: typing.Optional[torch.Tensor] = Nonereturn_dict: typing.Optional[bool] = None )

Parameters

  • audio_codes (torch.FloatTensor of shape (batch_size, nb_chunks, chunk_length), optional) β€” Discret code embeddings computed using model.encode.

  • audio_scales (torch.Tensor of shape (batch_size, nb_chunks), optional) β€” Scaling factor for each audio_codes input.

  • padding_mask (torch.Tensor of shape (batch_size, channels, sequence_length)) β€” Padding mask used to pad the input_values.

Decodes the given frames into an output audio waveform.

Note that the output might be a bit bigger than the input. In that case, any extra steps at the end can be trimmed.

encode

( input_values: Tensorpadding_mask: Tensor = Nonebandwidth: typing.Optional[float] = Nonereturn_dict: typing.Optional[bool] = None )

Parameters

  • input_values (torch.Tensor of shape (batch_size, channels, sequence_length)) β€” Float values of the input audio waveform.

  • padding_mask (torch.Tensor of shape (batch_size, channels, sequence_length)) β€” Padding mask used to pad the input_values.

  • bandwidth (float, optional) β€” The target bandwidth. Must be one of config.target_bandwidths. If None, uses the smallest possible bandwidth. bandwidth is represented as a thousandth of what it is, e.g. 6kbps bandwidth is represented as bandwidth == 6.0

Encodes the input audio waveform into discrete codes.

forward

( input_values: Tensorpadding_mask: typing.Optional[torch.Tensor] = Nonebandwidth: typing.Optional[float] = Noneaudio_codes: typing.Optional[torch.Tensor] = Noneaudio_scales: typing.Optional[torch.Tensor] = Nonereturn_dict: typing.Optional[bool] = None ) β†’ transformers.models.encodec.modeling_encodec.EncodecOutput or tuple(torch.FloatTensor)

Parameters

  • input_values (torch.FloatTensor of shape (batch_size, channels, sequence_length), optional) β€” Raw audio input converted to Float and padded to the approriate length in order to be encoded using chunks of length self.chunk_length and a stride of config.chunk_stride.

  • padding_mask (torch.BoolTensor of shape (batch_size, channels, sequence_length), optional) β€” Mask to avoid computing scaling factors on padding token indices (can we avoid computing conv on these+). Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,

    • 0 for tokens that are masked.

    padding_mask should always be passed, unless the input was truncated or not padded. This is because in order to process tensors effectively, the input audio should be padded so that input_length % stride = step with step = chunk_length-stride. This ensures that all chunks are of the same shape

  • bandwidth (float, optional) β€” The target bandwidth. Must be one of config.target_bandwidths. If None, uses the smallest possible bandwidth. bandwidth is represented as a thousandth of what it is, e.g. 6kbps bandwidth is represented as bandwidth == 6.0

  • audio_codes (torch.FloatTensor of shape (batch_size, nb_chunks, chunk_length), optional) β€” Discret code embeddings computed using model.encode.

  • audio_scales (torch.Tensor of shape (batch_size, nb_chunks), optional) β€” Scaling factor for each audio_codes input.

Returns

transformers.models.encodec.modeling_encodec.EncodecOutput or tuple(torch.FloatTensor)

  • audio_codes (torch.FloatTensor of shape (batch_size, nb_chunks, chunk_length), optional) β€” Discret code embeddings computed using model.encode.

  • audio_values (torch.FlaotTensor of shape (batch_size, sequence_length), optional) Decoded audio values, obtained using the decoder part of Encodec.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Examples:

Copied

>>> from datasets import load_dataset
>>> from transformers import AutoProcessor, EncodecModel

>>> dataset = load_dataset("ashraq/esc50")
>>> audio_sample = dataset["train"]["audio"][0]["array"]

>>> model_id = "facebook/encodec_24khz"
>>> model = EncodecModel.from_pretrained(model_id)
>>> processor = AutoProcessor.from_pretrained(model_id)

>>> inputs = processor(raw_audio=audio_sample, return_tensors="pt")

>>> outputs = model(**inputs)
>>> audio_codes = outputs.audio_codes
>>> audio_values = outputs.audio_values

This is the configuration class to store the configuration of an . It is used to instantiate a Encodec model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the architecture.

Configuration objects inherit from and can be used to control the model outputs. Read the documentation from for more information.

This feature extractor inherits from which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

Instantiating a feature extractor with the defaults will yield a similar configuration to that of the architecture.

padding (bool, str or , optional, defaults to True) β€” Select a strategy to pad the returned sequences (according to the model’s padding side and padding index) among:

return_tensors (str or , optional) β€” If set, will return tensors instead of list of python integers. Acceptable values are:

config () β€” Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

The EnCodec neural audio codec model. This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

return_dict (bool, optional) β€” Whether or not to return a instead of a plain tuple.

return_dict (bool, optional) β€” Whether or not to return a instead of a plain tuple.

A transformers.models.encodec.modeling_encodec.EncodecOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration () and inputs.

The forward method, overrides the __call__ special method.

🌍
🌍
🌍
High Fidelity Neural Audio Compression
Matthijs
Patrick Von Platen
Arthur Zucker
here
<source>
EncodecModel
facebook/encodec_24khz
PretrainedConfig
PretrainedConfig
<source>
SequenceFeatureExtractor
facebook/encodec_24khz
<source>
PaddingStrategy
TensorType
<source>
EncodecConfig
from_pretrained()
PreTrainedModel
torch.nn.Module
<source>
ModelOutput
<source>
<source>
ModelOutput
EncodecConfig
EncodecModel