Transformers
  • 🌍GET STARTED
    • Transformers
    • Quick tour
    • Installation
  • 🌍TUTORIALS
    • Run inference with pipelines
    • Write portable code with AutoClass
    • Preprocess data
    • Fine-tune a pretrained model
    • Train with a script
    • Set up distributed training with BOINC AI Accelerate
    • Load and train adapters with BOINC AI PEFT
    • Share your model
    • Agents
    • Generation with LLMs
  • 🌍TASK GUIDES
    • 🌍NATURAL LANGUAGE PROCESSING
      • Text classification
      • Token classification
      • Question answering
      • Causal language modeling
      • Masked language modeling
      • Translation
      • Summarization
      • Multiple choice
    • 🌍AUDIO
      • Audio classification
      • Automatic speech recognition
    • 🌍COMPUTER VISION
      • Image classification
      • Semantic segmentation
      • Video classification
      • Object detection
      • Zero-shot object detection
      • Zero-shot image classification
      • Depth estimation
    • 🌍MULTIMODAL
      • Image captioning
      • Document Question Answering
      • Visual Question Answering
      • Text to speech
    • 🌍GENERATION
      • Customize the generation strategy
    • 🌍PROMPTING
      • Image tasks with IDEFICS
  • 🌍DEVELOPER GUIDES
    • Use fast tokenizers from BOINC AI Tokenizers
    • Run inference with multilingual models
    • Use model-specific APIs
    • Share a custom model
    • Templates for chat models
    • Run training on Amazon SageMaker
    • Export to ONNX
    • Export to TFLite
    • Export to TorchScript
    • Benchmarks
    • Notebooks with examples
    • Community resources
    • Custom Tools and Prompts
    • Troubleshoot
  • 🌍PERFORMANCE AND SCALABILITY
    • Overview
    • 🌍EFFICIENT TRAINING TECHNIQUES
      • Methods and tools for efficient training on a single GPU
      • Multiple GPUs and parallelism
      • Efficient training on CPU
      • Distributed CPU training
      • Training on TPUs
      • Training on TPU with TensorFlow
      • Training on Specialized Hardware
      • Custom hardware for training
      • Hyperparameter Search using Trainer API
    • 🌍OPTIMIZING INFERENCE
      • Inference on CPU
      • Inference on one GPU
      • Inference on many GPUs
      • Inference on Specialized Hardware
    • Instantiating a big model
    • Troubleshooting
    • XLA Integration for TensorFlow Models
    • Optimize inference using `torch.compile()`
  • 🌍CONTRIBUTE
    • How to contribute to transformers?
    • How to add a model to BOINC AI Transformers?
    • How to convert a BOINC AI Transformers model to TensorFlow?
    • How to add a pipeline to BOINC AI Transformers?
    • Testing
    • Checks on a Pull Request
  • 🌍CONCEPTUAL GUIDES
    • Philosophy
    • Glossary
    • What BOINC AI Transformers can do
    • How BOINC AI Transformers solve tasks
    • The Transformer model family
    • Summary of the tokenizers
    • Attention mechanisms
    • Padding and truncation
    • BERTology
    • Perplexity of fixed-length models
    • Pipelines for webserver inference
    • Model training anatomy
  • 🌍API
    • 🌍MAIN CLASSES
      • Agents and Tools
      • 🌍Auto Classes
        • Extending the Auto Classes
        • AutoConfig
        • AutoTokenizer
        • AutoFeatureExtractor
        • AutoImageProcessor
        • AutoProcessor
        • Generic model classes
          • AutoModel
          • TFAutoModel
          • FlaxAutoModel
        • Generic pretraining classes
          • AutoModelForPreTraining
          • TFAutoModelForPreTraining
          • FlaxAutoModelForPreTraining
        • Natural Language Processing
          • AutoModelForCausalLM
          • TFAutoModelForCausalLM
          • FlaxAutoModelForCausalLM
          • AutoModelForMaskedLM
          • TFAutoModelForMaskedLM
          • FlaxAutoModelForMaskedLM
          • AutoModelForMaskGenerationge
          • TFAutoModelForMaskGeneration
          • AutoModelForSeq2SeqLM
          • TFAutoModelForSeq2SeqLM
          • FlaxAutoModelForSeq2SeqLM
          • AutoModelForSequenceClassification
          • TFAutoModelForSequenceClassification
          • FlaxAutoModelForSequenceClassification
          • AutoModelForMultipleChoice
          • TFAutoModelForMultipleChoice
          • FlaxAutoModelForMultipleChoice
          • AutoModelForNextSentencePrediction
          • TFAutoModelForNextSentencePrediction
          • FlaxAutoModelForNextSentencePrediction
          • AutoModelForTokenClassification
          • TFAutoModelForTokenClassification
          • FlaxAutoModelForTokenClassification
          • AutoModelForQuestionAnswering
          • TFAutoModelForQuestionAnswering
          • FlaxAutoModelForQuestionAnswering
          • AutoModelForTextEncoding
          • TFAutoModelForTextEncoding
        • Computer vision
          • AutoModelForDepthEstimation
          • AutoModelForImageClassification
          • TFAutoModelForImageClassification
          • FlaxAutoModelForImageClassification
          • AutoModelForVideoClassification
          • AutoModelForMaskedImageModeling
          • TFAutoModelForMaskedImageModeling
          • AutoModelForObjectDetection
          • AutoModelForImageSegmentation
          • AutoModelForImageToImage
          • AutoModelForSemanticSegmentation
          • TFAutoModelForSemanticSegmentation
          • AutoModelForInstanceSegmentation
          • AutoModelForUniversalSegmentation
          • AutoModelForZeroShotImageClassification
          • TFAutoModelForZeroShotImageClassification
          • AutoModelForZeroShotObjectDetection
        • Audio
          • AutoModelForAudioClassification
          • AutoModelForAudioFrameClassification
          • TFAutoModelForAudioFrameClassification
          • AutoModelForCTC
          • AutoModelForSpeechSeq2Seq
          • TFAutoModelForSpeechSeq2Seq
          • FlaxAutoModelForSpeechSeq2Seq
          • AutoModelForAudioXVector
          • AutoModelForTextToSpectrogram
          • AutoModelForTextToWaveform
        • Multimodal
          • AutoModelForTableQuestionAnswering
          • TFAutoModelForTableQuestionAnswering
          • AutoModelForDocumentQuestionAnswering
          • TFAutoModelForDocumentQuestionAnswering
          • AutoModelForVisualQuestionAnswering
          • AutoModelForVision2Seq
          • TFAutoModelForVision2Seq
          • FlaxAutoModelForVision2Seq
      • Callbacks
      • Configuration
      • Data Collator
      • Keras callbacks
      • Logging
      • Models
      • Text Generation
      • ONNX
      • Optimization
      • Model outputs
      • Pipelines
      • Processors
      • Quantization
      • Tokenizer
      • Trainer
      • DeepSpeed Integration
      • Feature Extractor
      • Image Processor
    • 🌍MODELS
      • 🌍TEXT MODELS
        • ALBERT
        • BART
        • BARThez
        • BARTpho
        • BERT
        • BertGeneration
        • BertJapanese
        • Bertweet
        • BigBird
        • BigBirdPegasus
        • BioGpt
        • Blenderbot
        • Blenderbot Small
        • BLOOM
        • BORT
        • ByT5
        • CamemBERT
        • CANINE
        • CodeGen
        • CodeLlama
        • ConvBERT
        • CPM
        • CPMANT
        • CTRL
        • DeBERTa
        • DeBERTa-v2
        • DialoGPT
        • DistilBERT
        • DPR
        • ELECTRA
        • Encoder Decoder Models
        • ERNIE
        • ErnieM
        • ESM
        • Falcon
        • FLAN-T5
        • FLAN-UL2
        • FlauBERT
        • FNet
        • FSMT
        • Funnel Transformer
        • GPT
        • GPT Neo
        • GPT NeoX
        • GPT NeoX Japanese
        • GPT-J
        • GPT2
        • GPTBigCode
        • GPTSAN Japanese
        • GPTSw3
        • HerBERT
        • I-BERT
        • Jukebox
        • LED
        • LLaMA
        • LLama2
        • Longformer
        • LongT5
        • LUKE
        • M2M100
        • MarianMT
        • MarkupLM
        • MBart and MBart-50
        • MEGA
        • MegatronBERT
        • MegatronGPT2
        • Mistral
        • mLUKE
        • MobileBERT
        • MPNet
        • MPT
        • MRA
        • MT5
        • MVP
        • NEZHA
        • NLLB
        • NLLB-MoE
        • NystrΓΆmformer
        • Open-Llama
        • OPT
        • Pegasus
        • PEGASUS-X
        • Persimmon
        • PhoBERT
        • PLBart
        • ProphetNet
        • QDQBert
        • RAG
        • REALM
        • Reformer
        • RemBERT
        • RetriBERT
        • RoBERTa
        • RoBERTa-PreLayerNorm
        • RoCBert
        • RoFormer
        • RWKV
        • Splinter
        • SqueezeBERT
        • SwitchTransformers
        • T5
        • T5v1.1
        • TAPEX
        • Transformer XL
        • UL2
        • UMT5
        • X-MOD
        • XGLM
        • XLM
        • XLM-ProphetNet
        • XLM-RoBERTa
        • XLM-RoBERTa-XL
        • XLM-V
        • XLNet
        • YOSO
      • 🌍VISION MODELS
        • BEiT
        • BiT
        • Conditional DETR
        • ConvNeXT
        • ConvNeXTV2
        • CvT
        • Deformable DETR
        • DeiT
        • DETA
        • DETR
        • DiNAT
        • DINO V2
        • DiT
        • DPT
        • EfficientFormer
        • EfficientNet
        • FocalNet
        • GLPN
        • ImageGPT
        • LeViT
        • Mask2Former
        • MaskFormer
        • MobileNetV1
        • MobileNetV2
        • MobileViT
        • MobileViTV2
        • NAT
        • PoolFormer
        • Pyramid Vision Transformer (PVT)
        • RegNet
        • ResNet
        • SegFormer
        • SwiftFormer
        • Swin Transformer
        • Swin Transformer V2
        • Swin2SR
        • Table Transformer
        • TimeSformer
        • UperNet
        • VAN
        • VideoMAE
        • Vision Transformer (ViT)
        • ViT Hybrid
        • ViTDet
        • ViTMAE
        • ViTMatte
        • ViTMSN
        • ViViT
        • YOLOS
      • 🌍AUDIO MODELS
        • Audio Spectrogram Transformer
        • Bark
        • CLAP
        • EnCodec
        • Hubert
        • MCTCT
        • MMS
        • MusicGen
        • Pop2Piano
        • SEW
        • SEW-D
        • Speech2Text
        • Speech2Text2
        • SpeechT5
        • UniSpeech
        • UniSpeech-SAT
        • VITS
        • Wav2Vec2
        • Wav2Vec2-Conformer
        • Wav2Vec2Phoneme
        • WavLM
        • Whisper
        • XLS-R
        • XLSR-Wav2Vec2
      • 🌍MULTIMODAL MODELS
        • ALIGN
        • AltCLIP
        • BLIP
        • BLIP-2
        • BridgeTower
        • BROS
        • Chinese-CLIP
        • CLIP
        • CLIPSeg
        • Data2Vec
        • DePlot
        • Donut
        • FLAVA
        • GIT
        • GroupViT
        • IDEFICS
        • InstructBLIP
        • LayoutLM
        • LayoutLMV2
        • LayoutLMV3
        • LayoutXLM
        • LiLT
        • LXMERT
        • MatCha
        • MGP-STR
        • Nougat
        • OneFormer
        • OWL-ViT
        • Perceiver
        • Pix2Struct
        • Segment Anything
        • Speech Encoder Decoder Models
        • TAPAS
        • TrOCR
        • TVLT
        • ViLT
        • Vision Encoder Decoder Models
        • Vision Text Dual Encoder
        • VisualBERT
        • X-CLIP
      • 🌍REINFORCEMENT LEARNING MODELS
        • Decision Transformer
        • Trajectory Transformer
      • 🌍TIME SERIES MODELS
        • Autoformer
        • Informer
        • Time Series Transformer
      • 🌍GRAPH MODELS
        • Graphormer
  • 🌍INTERNAL HELPERS
    • Custom Layers and Utilities
    • Utilities for pipelines
    • Utilities for Tokenizers
    • Utilities for Trainer
    • Utilities for Generation
    • Utilities for Image Processors
    • Utilities for Audio processing
    • General Utilities
    • Utilities for Time Series
Powered by GitBook
On this page
  • Pop2Pian
  • Overview
  • Examples
  • Pop2PianoConfig
  • Pop2PianoFeatureExtractor
  • Pop2PianoForConditionalGeneration
  • Pop2PianoTokenizer
  • Pop2PianoProcessor
  1. API
  2. MODELS
  3. AUDIO MODELS

Pop2Piano

PreviousMusicGenNextSEW

Last updated 1 year ago

Pop2Pian

Overview

The Pop2Piano model was proposed in by Jongho Choi and Kyogu Lee.

Piano covers of pop music are widely enjoyed, but generating them from music is not a trivial task. It requires great expertise with playing piano as well as knowing different characteristics and melodies of a song. With Pop2Piano you can directly generate a cover from a song’s audio waveform. It is the first model to directly generate a piano cover from pop audio without melody and chord extraction modules.

Pop2Piano is an encoder-decoder Transformer model based on . The input audio is transformed to its waveform and passed to the encoder, which transforms it to a latent representation. The decoder uses these latent representations to generate token ids in an autoregressive way. Each token id corresponds to one of four different token types: time, velocity, note and β€˜special’. The token ids are then decoded to their equivalent MIDI file.

The abstract from the paper is the following:

Piano covers of pop music are enjoyed by many people. However, the task of automatically generating piano covers of pop music is still understudied. This is partly due to the lack of synchronized {Pop, Piano Cover} data pairs, which made it challenging to apply the latest data-intensive deep learning-based methods. To leverage the power of the data-driven approach, we make a large amount of paired and synchronized {Pop, Piano Cover} data using an automated pipeline. In this paper, we present Pop2Piano, a Transformer network that generates piano covers given waveforms of pop music. To the best of our knowledge, this is the first model to generate a piano cover directly from pop audio without using melody and chord extraction modules. We show that Pop2Piano, trained with our dataset, is capable of producing plausible piano covers.

Tips:

  1. To use Pop2Piano, you will need to install the 🌍 Transformers library, as well as the following third party modules:

Copied

pip install pretty-midi==0.2.9 essentia==2.1b6.dev1034 librosa scipy

Please note that you may need to restart your runtime after installation. 2. Pop2Piano is an Encoder-Decoder based model like T5. 3. Pop2Piano can be used to generate midi-audio files for a given audio sequence. 4. Choosing different composers in Pop2PianoForConditionalGeneration.generate() can lead to variety of different results. 5. Setting the sampling rate to 44.1 kHz when loading the audio file can give good performance. 6. Though Pop2Piano was mainly trained on Korean Pop music, it also does pretty well on other Western Pop or Hip Hop songs.

This model was contributed by . The original code can be found .

Examples

  • Example using BOINC AI Dataset:

Copied

>>> from datasets import load_dataset
>>> from transformers import Pop2PianoForConditionalGeneration, Pop2PianoProcessor

>>> model = Pop2PianoForConditionalGeneration.from_pretrained("sweetcocoa/pop2piano")
>>> processor = Pop2PianoProcessor.from_pretrained("sweetcocoa/pop2piano")
>>> ds = load_dataset("sweetcocoa/pop2piano_ci", split="test")

>>> inputs = processor(
...     audio=ds["audio"][0]["array"], sampling_rate=ds["audio"][0]["sampling_rate"], return_tensors="pt"
... )
>>> model_output = model.generate(input_features=inputs["input_features"], composer="composer1")
>>> tokenizer_output = processor.batch_decode(
...     token_ids=model_output, feature_extractor_output=inputs
... )["pretty_midi_objects"][0]
>>> tokenizer_output.write("./Outputs/midi_output.mid")
  • Example using your own audio file:

Copied

>>> import librosa
>>> from transformers import Pop2PianoForConditionalGeneration, Pop2PianoProcessor

>>> audio, sr = librosa.load("<your_audio_file_here>", sr=44100)  # feel free to change the sr to a suitable value.
>>> model = Pop2PianoForConditionalGeneration.from_pretrained("sweetcocoa/pop2piano")
>>> processor = Pop2PianoProcessor.from_pretrained("sweetcocoa/pop2piano")

>>> inputs = processor(audio=audio, sampling_rate=sr, return_tensors="pt")
>>> model_output = model.generate(input_features=inputs["input_features"], composer="composer1")
>>> tokenizer_output = processor.batch_decode(
...     token_ids=model_output, feature_extractor_output=inputs
... )["pretty_midi_objects"][0]
>>> tokenizer_output.write("./Outputs/midi_output.mid")
  • Example of processing multiple audio files in batch:

Copied

>>> import librosa
>>> from transformers import Pop2PianoForConditionalGeneration, Pop2PianoProcessor

>>> # feel free to change the sr to a suitable value.
>>> audio1, sr1 = librosa.load("<your_first_audio_file_here>", sr=44100)  
>>> audio2, sr2 = librosa.load("<your_second_audio_file_here>", sr=44100)
>>> model = Pop2PianoForConditionalGeneration.from_pretrained("sweetcocoa/pop2piano")
>>> processor = Pop2PianoProcessor.from_pretrained("sweetcocoa/pop2piano")

>>> inputs = processor(audio=[audio1, audio2], sampling_rate=[sr1, sr2], return_attention_mask=True, return_tensors="pt")
>>> # Since we now generating in batch(2 audios) we must pass the attention_mask
>>> model_output = model.generate(
...     input_features=inputs["input_features"],
...     attention_mask=inputs["attention_mask"],
...     composer="composer1",
... )
>>> tokenizer_output = processor.batch_decode(
...     token_ids=model_output, feature_extractor_output=inputs
... )["pretty_midi_objects"]

>>> # Since we now have 2 generated MIDI files
>>> tokenizer_output[0].write("./Outputs/midi_output1.mid")
>>> tokenizer_output[1].write("./Outputs/midi_output2.mid")
  • Example of processing multiple audio files in batch (Using Pop2PianoFeatureExtractor and Pop2PianoTokenizer):

Copied

>>> import librosa
>>> from transformers import Pop2PianoForConditionalGeneration, Pop2PianoFeatureExtractor, Pop2PianoTokenizer

>>> # feel free to change the sr to a suitable value.
>>> audio1, sr1 = librosa.load("<your_first_audio_file_here>", sr=44100)  
>>> audio2, sr2 = librosa.load("<your_second_audio_file_here>", sr=44100)
>>> model = Pop2PianoForConditionalGeneration.from_pretrained("sweetcocoa/pop2piano")
>>> feature_extractor = Pop2PianoFeatureExtractor.from_pretrained("sweetcocoa/pop2piano")
>>> tokenizer = Pop2PianoTokenizer.from_pretrained("sweetcocoa/pop2piano")

>>> inputs = feature_extractor(
...     audio=[audio1, audio2], 
...     sampling_rate=[sr1, sr2], 
...     return_attention_mask=True, 
...     return_tensors="pt",
... )
>>> # Since we now generating in batch(2 audios) we must pass the attention_mask
>>> model_output = model.generate(
...     input_features=inputs["input_features"],
...     attention_mask=inputs["attention_mask"],
...     composer="composer1",
... )
>>> tokenizer_output = tokenizer.batch_decode(
...     token_ids=model_output, feature_extractor_output=inputs
... )["pretty_midi_objects"]

>>> # Since we now have 2 generated MIDI files
>>> tokenizer_output[0].write("./Outputs/midi_output1.mid")
>>> tokenizer_output[1].write("./Outputs/midi_output2.mid")

Pop2PianoConfig

class transformers.Pop2PianoConfig

( vocab_size = 2400composer_vocab_size = 21d_model = 512d_kv = 64d_ff = 2048num_layers = 6num_decoder_layers = Nonenum_heads = 8relative_attention_num_buckets = 32relative_attention_max_distance = 128dropout_rate = 0.1layer_norm_epsilon = 1e-06initializer_factor = 1.0feed_forward_proj = 'gated-gelu'is_encoder_decoder = Trueuse_cache = Truepad_token_id = 0eos_token_id = 1dense_act_fn = 'relu'**kwargs )

Parameters

  • composer_vocab_size (int, optional, defaults to 21) β€” Denotes the number of composers.

  • d_model (int, optional, defaults to 512) β€” Size of the encoder layers and the pooler layer.

  • d_kv (int, optional, defaults to 64) β€” Size of the key, query, value projections per attention head. The inner_dim of the projection layer will be defined as num_heads * d_kv.

  • d_ff (int, optional, defaults to 2048) β€” Size of the intermediate feed forward layer in each Pop2PianoBlock.

  • num_layers (int, optional, defaults to 6) β€” Number of hidden layers in the Transformer encoder.

  • num_decoder_layers (int, optional) β€” Number of hidden layers in the Transformer decoder. Will use the same value as num_layers if not set.

  • num_heads (int, optional, defaults to 8) β€” Number of attention heads for each attention layer in the Transformer encoder.

  • relative_attention_num_buckets (int, optional, defaults to 32) β€” The number of buckets to use for each attention layer.

  • relative_attention_max_distance (int, optional, defaults to 128) β€” The maximum distance of the longer sequences for the bucket separation.

  • dropout_rate (float, optional, defaults to 0.1) β€” The ratio for all dropout layers.

  • layer_norm_epsilon (float, optional, defaults to 1e-6) β€” The epsilon used by the layer normalization layers.

  • initializer_factor (float, optional, defaults to 1.0) β€” A factor for initializing all weight matrices (should be kept to 1.0, used internally for initialization testing).

  • feed_forward_proj (string, optional, defaults to "gated-gelu") β€” Type of feed forward layer to be used. Should be one of "relu" or "gated-gelu".

  • use_cache (bool, optional, defaults to True) β€” Whether or not the model should return the last key/values attentions (not used by all models).

  • dense_act_fn (string, optional, defaults to "relu") β€” Type of Activation Function to be used in Pop2PianoDenseActDense and in Pop2PianoDenseGatedActDense.

Pop2PianoFeatureExtractor

class transformers.Pop2PianoFeatureExtractor

( *args**kwargs )

__call__

( *args**kwargs )

Call self as a function.

Pop2PianoForConditionalGeneration

class transformers.Pop2PianoForConditionalGeneration

( config: Pop2PianoConfig )

Parameters

forward

Parameters

  • attention_mask (torch.FloatTensor of shape (batch_size, sequence_length), optional) β€” Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

    • 1 for tokens that are not masked,

  • decoder_attention_mask (torch.BoolTensor of shape (batch_size, target_sequence_length), optional) β€” Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.

  • head_mask (torch.FloatTensor of shape (num_heads,) or (num_layers, num_heads), optional) β€” Mask to nullify selected heads of the self-attention modules in the encoder. Mask values selected in [0, 1]:

    • 1 indicates the head is not masked,

    • 0 indicates the head is masked.

  • decoder_head_mask (torch.FloatTensor of shape (num_heads,) or (num_layers, num_heads), optional) β€” Mask to nullify selected heads of the self-attention modules in the decoder. Mask values selected in [0, 1]:

    • 1 indicates the head is not masked,

    • 0 indicates the head is masked.

  • cross_attn_head_mask (torch.Tensor of shape (num_heads,) or (num_layers, num_heads), optional) β€” Mask to nullify selected heads of the cross-attention modules in the decoder. Mask values selected in [0, 1]:

    • 1 indicates the head is not masked,

    • 0 indicates the head is masked.

  • encoder_outputs (tuple(tuple(torch.FloatTensor), optional) β€” Tuple consists of (last_hidden_state, optional: hidden_states, optional: attentions) last_hidden_state of shape (batch_size, sequence_length, hidden_size) is a sequence of hidden states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.

  • past_key_values (tuple(tuple(torch.FloatTensor)) of length config.n_layers with each tuple having 4 tensors of shape (batch_size, num_heads, sequence_length - 1, embed_size_per_head)) β€” Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding. If past_key_values are used, the user can optionally input only the last decoder_input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, 1) instead of all decoder_input_ids of shape (batch_size, sequence_length).

  • inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) β€” Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix.

  • input_features (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) β€” Does the same task as inputs_embeds. If inputs_embeds is not present but input_features is present then input_features will be considered as inputs_embeds.

  • decoder_inputs_embeds (torch.FloatTensor of shape (batch_size, target_sequence_length, hidden_size), optional) β€” Optionally, instead of passing decoder_input_ids you can choose to directly pass an embedded representation. If past_key_values is used, optionally only the last decoder_inputs_embeds have to be input (see past_key_values). This is useful if you want more control over how to convert decoder_input_ids indices into associated vectors than the model’s internal embedding lookup matrix. If decoder_input_ids and decoder_inputs_embeds are both unset, decoder_inputs_embeds takes the value of inputs_embeds.

  • use_cache (bool, optional) β€” If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

  • output_attentions (bool, optional) β€” Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.

  • output_hidden_states (bool, optional) β€” Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.

  • labels (torch.LongTensor of shape (batch_size,), optional) β€” Labels for computing the sequence classification/regression loss. Indices should be in [-100, 0, ..., config.vocab_size - 1]. All labels set to -100 are ignored (masked), the loss is only computed for labels in [0, ..., config.vocab_size]

Returns

  • loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) β€” Language modeling loss.

  • logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) β€” Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

  • past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) β€” Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head).

    Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding.

  • decoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) β€” Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.

  • decoder_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) β€” Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

  • cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) β€” Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.

  • encoder_last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) β€” Sequence of hidden-states at the output of the last layer of the encoder of the model.

  • encoder_hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) β€” Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

    Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.

  • encoder_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) β€” Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length).

    Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

generate

Parameters

  • input_features (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) β€” This is the featurized version of audio generated by Pop2PianoFeatureExtractor. attention_mask β€” For batched generation input_features are padded to have the same shape across all examples. attention_mask helps to determine which areas were padded and which were not.

    • 1 for tokens that are not padded,

    • 0 for tokens that are padded.

Returns

Generates token ids for midi outputs.

Pop2PianoTokenizer

class transformers.Pop2PianoTokenizer

( *args**kwargs )

__call__

( *args**kwargs )

Call self as a function.

Pop2PianoProcessor

class transformers.Pop2PianoProcessor

( *args**kwargs )

__call__

( *args**kwargs )

Call self as a function.

vocab_size (int, optional, defaults to 2400) β€” Vocabulary size of the Pop2PianoForConditionalGeneration model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling .

This is the configuration class to store the configuration of a . It is used to instantiate a Pop2PianoForConditionalGeneration model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Pop2Piano architecture.

Configuration objects inherit from and can be used to control the model outputs. Read the documentation from for more information.

config () β€” Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.

Pop2Piano Model with a language modeling head on top. This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

( input_ids: typing.Optional[torch.LongTensor] = Noneattention_mask: typing.Optional[torch.FloatTensor] = Nonedecoder_input_ids: typing.Optional[torch.LongTensor] = Nonedecoder_attention_mask: typing.Optional[torch.BoolTensor] = Nonehead_mask: typing.Optional[torch.FloatTensor] = Nonedecoder_head_mask: typing.Optional[torch.FloatTensor] = Nonecross_attn_head_mask: typing.Optional[torch.Tensor] = Noneencoder_outputs: typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = Nonepast_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Noneinput_features: typing.Optional[torch.FloatTensor] = Nonedecoder_inputs_embeds: typing.Optional[torch.FloatTensor] = Nonelabels: typing.Optional[torch.LongTensor] = Noneuse_cache: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None ) β†’ or tuple(torch.FloatTensor)

input_ids (torch.LongTensor of shape (batch_size, sequence_length)) β€” Indices of input sequence tokens in the vocabulary. Pop2Piano is a model with relative position embeddings so you should be able to pad the inputs on both the right and the left. Indices can be obtained using . See and for detail. To know more on how to prepare input_ids for pretraining take a look a .

0 for tokens that are masked.

decoder_input_ids (torch.LongTensor of shape (batch_size, target_sequence_length), optional) β€” Indices of decoder input sequence tokens in the vocabulary. Indices can be obtained using . See and for details. Pop2Piano uses the pad_token_id as the starting token for decoder_input_ids generation. If past_key_values is used, optionally only the last decoder_input_ids have to be input (see past_key_values). To know more on how to prepare

return_dict (bool, optional) β€” Whether or not to return a instead of a plain tuple.

or tuple(torch.FloatTensor)

A or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration () and inputs.

The forward method, overrides the __call__ special method.

( input_featuresattention_mask = Nonecomposer = 'composer1'generation_config = None**kwargs ) β†’ or torch.LongTensor

composer (str, optional, defaults to "composer1") β€” This value is passed to Pop2PianoConcatEmbeddingToMel to generate different embeddings for each "composer". Please make sure that the composet value is present in composer_to_feature_token in generation_config. For an example please see .

generation_config (~generation.GenerationConfig, optional) β€” The generation configuration to be used as base parametrization for the generation call. **kwargs passed to generate matching the attributes of generation_config will override them. If generation_config is not provided, the default will be used, which had the following loading priority: 1) from the generation_config.json model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit ’s default values, whose documentation should be checked to parameterize generation. kwargs β€” Ad hoc parametrization of generate_config and/or additional model-specific kwargs that will be forwarded to the forward function of the model. If the model is an encoder-decoder model, encoder specific kwargs should not be prefixed and decoder specific kwargs should be prefixed with decoder_.

or torch.LongTensor

A (if return_dict_in_generate=True or when config.return_dict_in_generate=True) or a torch.FloatTensor. Since Pop2Piano is an encoder-decoder model (model.config.is_encoder_decoder=True), the possible types are:

,

,

,

Most generation-controlling parameters are set in generation_config which, if not passed, will be set to the model’s default generation configuration. You can override any generation_config by passing the corresponding parameters to generate(), e.g. .generate(inputs, num_beams=4, do_sample=True). For an overview of generation strategies and code examples, check out the .

🌍
🌍
🌍
Pop2Piano : Pop Audio-based Piano Cover Generation
T5
Susnato Dhar
here
<source>
Pop2PianoForConditionalGeneration
Pop2PianoForConditionalGeneration
sweetcocoa/pop2piano
PretrainedConfig
PretrainedConfig
<source>
<source>
Pop2PianoConfig
from_pretrained()
PreTrainedModel
torch.nn.Module
<source>
transformers.modeling_outputs.Seq2SeqLMOutput
AutoTokenizer
PreTrainedTokenizer.encode()
PreTrainedTokenizer.call()
What are input IDs?
Pop2Pianp Training
What are attention masks?
AutoTokenizer
PreTrainedTokenizer.encode()
PreTrainedTokenizer.call()
What are decoder input IDs?
ModelOutput
transformers.modeling_outputs.Seq2SeqLMOutput
transformers.modeling_outputs.Seq2SeqLMOutput
Pop2PianoConfig
Pop2PianoForConditionalGeneration
<source>
ModelOutput
https://boincai.com/sweetcocoa/pop2piano/blob/main/generation_config.json
GenerationConfig
ModelOutput
ModelOutput
ModelOutput
GreedySearchEncoderDecoderOutput
SampleEncoderDecoderOutput
BeamSearchEncoderDecoderOutput
BeamSampleEncoderDecoderOutput
following guide
<source>
<source>