Pop2Piano
Last updated
Last updated
The Pop2Piano model was proposed in by Jongho Choi and Kyogu Lee.
Piano covers of pop music are widely enjoyed, but generating them from music is not a trivial task. It requires great expertise with playing piano as well as knowing different characteristics and melodies of a song. With Pop2Piano you can directly generate a cover from a songβs audio waveform. It is the first model to directly generate a piano cover from pop audio without melody and chord extraction modules.
Pop2Piano is an encoder-decoder Transformer model based on . The input audio is transformed to its waveform and passed to the encoder, which transforms it to a latent representation. The decoder uses these latent representations to generate token ids in an autoregressive way. Each token id corresponds to one of four different token types: time, velocity, note and βspecialβ. The token ids are then decoded to their equivalent MIDI file.
The abstract from the paper is the following:
Piano covers of pop music are enjoyed by many people. However, the task of automatically generating piano covers of pop music is still understudied. This is partly due to the lack of synchronized {Pop, Piano Cover} data pairs, which made it challenging to apply the latest data-intensive deep learning-based methods. To leverage the power of the data-driven approach, we make a large amount of paired and synchronized {Pop, Piano Cover} data using an automated pipeline. In this paper, we present Pop2Piano, a Transformer network that generates piano covers given waveforms of pop music. To the best of our knowledge, this is the first model to generate a piano cover directly from pop audio without using melody and chord extraction modules. We show that Pop2Piano, trained with our dataset, is capable of producing plausible piano covers.
Tips:
To use Pop2Piano, you will need to install the π Transformers library, as well as the following third party modules:
Copied
Please note that you may need to restart your runtime after installation. 2. Pop2Piano is an Encoder-Decoder based model like T5. 3. Pop2Piano can be used to generate midi-audio files for a given audio sequence. 4. Choosing different composers in Pop2PianoForConditionalGeneration.generate()
can lead to variety of different results. 5. Setting the sampling rate to 44.1 kHz when loading the audio file can give good performance. 6. Though Pop2Piano was mainly trained on Korean Pop music, it also does pretty well on other Western Pop or Hip Hop songs.
This model was contributed by . The original code can be found .
Example using BOINC AI Dataset:
Copied
Example using your own audio file:
Copied
Example of processing multiple audio files in batch:
Copied
Example of processing multiple audio files in batch (Using Pop2PianoFeatureExtractor
and Pop2PianoTokenizer
):
Copied
( vocab_size = 2400composer_vocab_size = 21d_model = 512d_kv = 64d_ff = 2048num_layers = 6num_decoder_layers = Nonenum_heads = 8relative_attention_num_buckets = 32relative_attention_max_distance = 128dropout_rate = 0.1layer_norm_epsilon = 1e-06initializer_factor = 1.0feed_forward_proj = 'gated-gelu'is_encoder_decoder = Trueuse_cache = Truepad_token_id = 0eos_token_id = 1dense_act_fn = 'relu'**kwargs )
Parameters
composer_vocab_size (int
, optional, defaults to 21) β Denotes the number of composers.
d_model (int
, optional, defaults to 512) β Size of the encoder layers and the pooler layer.
d_kv (int
, optional, defaults to 64) β Size of the key, query, value projections per attention head. The inner_dim
of the projection layer will be defined as num_heads * d_kv
.
d_ff (int
, optional, defaults to 2048) β Size of the intermediate feed forward layer in each Pop2PianoBlock
.
num_layers (int
, optional, defaults to 6) β Number of hidden layers in the Transformer encoder.
num_decoder_layers (int
, optional) β Number of hidden layers in the Transformer decoder. Will use the same value as num_layers
if not set.
num_heads (int
, optional, defaults to 8) β Number of attention heads for each attention layer in the Transformer encoder.
relative_attention_num_buckets (int
, optional, defaults to 32) β The number of buckets to use for each attention layer.
relative_attention_max_distance (int
, optional, defaults to 128) β The maximum distance of the longer sequences for the bucket separation.
dropout_rate (float
, optional, defaults to 0.1) β The ratio for all dropout layers.
layer_norm_epsilon (float
, optional, defaults to 1e-6) β The epsilon used by the layer normalization layers.
initializer_factor (float
, optional, defaults to 1.0) β A factor for initializing all weight matrices (should be kept to 1.0, used internally for initialization testing).
feed_forward_proj (string
, optional, defaults to "gated-gelu"
) β Type of feed forward layer to be used. Should be one of "relu"
or "gated-gelu"
.
use_cache (bool
, optional, defaults to True
) β Whether or not the model should return the last key/values attentions (not used by all models).
dense_act_fn (string
, optional, defaults to "relu"
) β Type of Activation Function to be used in Pop2PianoDenseActDense
and in Pop2PianoDenseGatedActDense
.
( *args**kwargs )
__call__
( *args**kwargs )
Call self as a function.
( config: Pop2PianoConfig )
Parameters
forward
Parameters
attention_mask (torch.FloatTensor
of shape (batch_size, sequence_length)
, optional) β Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]
:
1 for tokens that are not masked,
decoder_attention_mask (torch.BoolTensor
of shape (batch_size, target_sequence_length)
, optional) β Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids
. Causal mask will also be used by default.
head_mask (torch.FloatTensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) β Mask to nullify selected heads of the self-attention modules in the encoder. Mask values selected in [0, 1]
:
1 indicates the head is not masked,
0 indicates the head is masked.
decoder_head_mask (torch.FloatTensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) β Mask to nullify selected heads of the self-attention modules in the decoder. Mask values selected in [0, 1]
:
1 indicates the head is not masked,
0 indicates the head is masked.
cross_attn_head_mask (torch.Tensor
of shape (num_heads,)
or (num_layers, num_heads)
, optional) β Mask to nullify selected heads of the cross-attention modules in the decoder. Mask values selected in [0, 1]
:
1 indicates the head is not masked,
0 indicates the head is masked.
encoder_outputs (tuple(tuple(torch.FloatTensor)
, optional) β Tuple consists of (last_hidden_state
, optional
: hidden_states, optional
: attentions) last_hidden_state
of shape (batch_size, sequence_length, hidden_size)
is a sequence of hidden states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
past_key_values (tuple(tuple(torch.FloatTensor))
of length config.n_layers
with each tuple having 4 tensors of shape (batch_size, num_heads, sequence_length - 1, embed_size_per_head)
) β Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding. If past_key_values
are used, the user can optionally input only the last decoder_input_ids
(those that donβt have their past key value states given to this model) of shape (batch_size, 1)
instead of all decoder_input_ids
of shape (batch_size, sequence_length)
.
inputs_embeds (torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
, optional) β Optionally, instead of passing input_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids
indices into associated vectors than the modelβs internal embedding lookup matrix.
input_features (torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
, optional) β Does the same task as inputs_embeds
. If inputs_embeds
is not present but input_features
is present then input_features
will be considered as inputs_embeds
.
decoder_inputs_embeds (torch.FloatTensor
of shape (batch_size, target_sequence_length, hidden_size)
, optional) β Optionally, instead of passing decoder_input_ids
you can choose to directly pass an embedded representation. If past_key_values
is used, optionally only the last decoder_inputs_embeds
have to be input (see past_key_values
). This is useful if you want more control over how to convert decoder_input_ids
indices into associated vectors than the modelβs internal embedding lookup matrix. If decoder_input_ids
and decoder_inputs_embeds
are both unset, decoder_inputs_embeds
takes the value of inputs_embeds
.
use_cache (bool
, optional) β If set to True
, past_key_values
key value states are returned and can be used to speed up decoding (see past_key_values
).
output_attentions (bool
, optional) β Whether or not to return the attentions tensors of all attention layers. See attentions
under returned tensors for more detail.
output_hidden_states (bool
, optional) β Whether or not to return the hidden states of all layers. See hidden_states
under returned tensors for more detail.
labels (torch.LongTensor
of shape (batch_size,)
, optional) β Labels for computing the sequence classification/regression loss. Indices should be in [-100, 0, ..., config.vocab_size - 1]
. All labels set to -100
are ignored (masked), the loss is only computed for labels in [0, ..., config.vocab_size]
Returns
loss (torch.FloatTensor
of shape (1,)
, optional, returned when labels
is provided) β Language modeling loss.
logits (torch.FloatTensor
of shape (batch_size, sequence_length, config.vocab_size)
) β Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (tuple(tuple(torch.FloatTensor))
, optional, returned when use_cache=True
is passed or when config.use_cache=True
) β Tuple of tuple(torch.FloatTensor)
of length config.n_layers
, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.
Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_values
input) to speed up sequential decoding.
decoder_hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) β Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) β Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads.
cross_attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) β Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights of the decoderβs cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads.
encoder_last_hidden_state (torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
, optional) β Sequence of hidden-states at the output of the last layer of the encoder of the model.
encoder_hidden_states (tuple(torch.FloatTensor)
, optional, returned when output_hidden_states=True
is passed or when config.output_hidden_states=True
) β Tuple of torch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size)
.
Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (tuple(torch.FloatTensor)
, optional, returned when output_attentions=True
is passed or when config.output_attentions=True
) β Tuple of torch.FloatTensor
(one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length)
.
Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
generate
Parameters
input_features (torch.FloatTensor
of shape (batch_size, sequence_length, hidden_size)
, optional) β This is the featurized version of audio generated by Pop2PianoFeatureExtractor
. attention_mask β For batched generation input_features
are padded to have the same shape across all examples. attention_mask
helps to determine which areas were padded and which were not.
1 for tokens that are not padded,
0 for tokens that are padded.
Returns
Generates token ids for midi outputs.
( *args**kwargs )
__call__
( *args**kwargs )
Call self as a function.
( *args**kwargs )
__call__
( *args**kwargs )
Call self as a function.
vocab_size (int
, optional, defaults to 2400) β Vocabulary size of the Pop2PianoForConditionalGeneration
model. Defines the number of different tokens that can be represented by the inputs_ids
passed when calling .
This is the configuration class to store the configuration of a . It is used to instantiate a Pop2PianoForConditionalGeneration model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Pop2Piano architecture.
Configuration objects inherit from and can be used to control the model outputs. Read the documentation from for more information.
config () β Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the method to load the model weights.
Pop2Piano Model with a language modeling
head on top. This model inherits from . Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
( input_ids: typing.Optional[torch.LongTensor] = Noneattention_mask: typing.Optional[torch.FloatTensor] = Nonedecoder_input_ids: typing.Optional[torch.LongTensor] = Nonedecoder_attention_mask: typing.Optional[torch.BoolTensor] = Nonehead_mask: typing.Optional[torch.FloatTensor] = Nonedecoder_head_mask: typing.Optional[torch.FloatTensor] = Nonecross_attn_head_mask: typing.Optional[torch.Tensor] = Noneencoder_outputs: typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = Nonepast_key_values: typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = Noneinputs_embeds: typing.Optional[torch.FloatTensor] = Noneinput_features: typing.Optional[torch.FloatTensor] = Nonedecoder_inputs_embeds: typing.Optional[torch.FloatTensor] = Nonelabels: typing.Optional[torch.LongTensor] = Noneuse_cache: typing.Optional[bool] = Noneoutput_attentions: typing.Optional[bool] = Noneoutput_hidden_states: typing.Optional[bool] = Nonereturn_dict: typing.Optional[bool] = None ) β or tuple(torch.FloatTensor)
input_ids (torch.LongTensor
of shape (batch_size, sequence_length)
) β Indices of input sequence tokens in the vocabulary. Pop2Piano is a model with relative position embeddings so you should be able to pad the inputs on both the right and the left. Indices can be obtained using . See and for detail. To know more on how to prepare input_ids
for pretraining take a look a .
0 for tokens that are masked.
decoder_input_ids (torch.LongTensor
of shape (batch_size, target_sequence_length)
, optional) β Indices of decoder input sequence tokens in the vocabulary. Indices can be obtained using . See and for details. Pop2Piano uses the pad_token_id
as the starting token for decoder_input_ids
generation. If past_key_values
is used, optionally only the last decoder_input_ids
have to be input (see past_key_values
). To know more on how to prepare
return_dict (bool
, optional) β Whether or not to return a instead of a plain tuple.
or tuple(torch.FloatTensor)
A or a tuple of torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various elements depending on the configuration () and inputs.
The forward method, overrides the __call__
special method.
( input_featuresattention_mask = Nonecomposer = 'composer1'generation_config = None**kwargs ) β or torch.LongTensor
composer (str
, optional, defaults to "composer1"
) β This value is passed to Pop2PianoConcatEmbeddingToMel
to generate different embeddings for each "composer"
. Please make sure that the composet value is present in composer_to_feature_token
in generation_config
. For an example please see .
generation_config (~generation.GenerationConfig
, optional) β The generation configuration to be used as base parametrization for the generation call. **kwargs
passed to generate matching the attributes of generation_config
will override them. If generation_config
is not provided, the default will be used, which had the following loading priority: 1) from the generation_config.json
model file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit βs default values, whose documentation should be checked to parameterize generation. kwargs β Ad hoc parametrization of generate_config
and/or additional model-specific kwargs that will be forwarded to the forward
function of the model. If the model is an encoder-decoder model, encoder specific kwargs should not be prefixed and decoder specific kwargs should be prefixed with decoder_.
or torch.LongTensor
A (if return_dict_in_generate=True
or when config.return_dict_in_generate=True
) or a torch.FloatTensor
. Since Pop2Piano is an encoder-decoder model (model.config.is_encoder_decoder=True
), the possible types are:
,
,
,
Most generation-controlling parameters are set in generation_config
which, if not passed, will be set to the modelβs default generation configuration. You can override any generation_config
by passing the corresponding parameters to generate(), e.g. .generate(inputs, num_beams=4, do_sample=True)
. For an overview of generation strategies and code examples, check out the .