Here, you will find demos of our paper.

Our inpainting model is a BART-like language model which has a Transformer encoder connected to an autoregressive Transformer decoder. To effectively encode long music sequences, the language model regard the observational music data as a sequence of mixed high level representations. The encoder is first pre-trained and later jointly trained with the decoder. Pipelines

In our paper, we elaborate on how the proposed model inpaints by several inference experiments as follows:

Time scope:

  1. Variable-position inpainting.

  2. Variable-length inpainting.

Semantic features:

  1. Rhythm specification.

We show some demos in the inference section. The generated parts are inllustrated in seconds (s).

Time scope: Variable-position inpainting

An example of a 16-bar melody. Our model inpaints 1-4th, 5-8th, 9-12th or 13-16th bars seperately.

Inpainting 1-4th bars (from 0s to 8s).

Inpainting 5-8th bars (from 8s to 16s).

Inpainting 9-12th bars (from 16s to 24s).

Inpainting 12-16th bars (from 24s to 32s).

Time scope: Variable-length inpainting

Examples show that the comparison between our model and the baseline in terms of infilling missing spans with different numbers of bars. In the same case, the context in the first 6 bars and the last 6 bars is the same, the models inpaint the middle 2 bars (12-16s) or 4 bars (12s-20s).

Case 1

Inpainting 2 bars (from 12s to 16s).

Our proposed model.

Baseline

Our proposed model without contrastive loss.

Inpainting 4 bars (from 12s to 20s).

Our proposed model.

Baseline

Our proposed model without contrastive loss

Case 2

Inpainting 2 bars (from 12s to 16s).

Our proposed model.

Baseline

Our proposed model without contrastive loss

Inpainting 4 bars.

Our proposed model (from 12s to 20s).

Baseline

Our proposed model without contrastive loss

Semantic features: Rhythm specification.

Here, we change conditioned rhythm representation from the melody inpainted by our proposed model to the other one by applying the EC2VAE model once again, the operation varies the number of inpainted notes within one bar.

Case 1

The melody generated from our model. The inpainted part is from 16s to 24s

Target rhythm

Target number of notes in the inpainted part.

Case 2

The melody generated from our model. The inpainted part is from 24s to 32s

Target rhythm

Target number of notes in the inpainted part.

More experiment: Irregularly-corrupted context attending

Our proposed model attends to masked contexts and infills missing segments with different total length. The positions of generated parts are sampled.

Total 10 bars.

Total 12 bars.

More experiment: Mixed-level representation sampling

Our model generate diversified versions of 'Danny Boy' under mixed-level sampling. The inpanted part is from 16s to 24s.

Version 1.

Version 2.

Version 3.

Version 4

Version 5:

Version 6:

Version 7:

Structured Generation

Our model can generate structured music

Version 1. The melody is in AB?B atructure and our model generate the third phrase from from 16s to 24s.

Version 2. The melody is in AB?B atructure and our model generate the third phrase from from 16s to 24s.

Version 2. The melody is in AA?A atructure and our model generate the third phrase from from 16s to 24s.

Below, we show the music score of some cases: Pipelines