Symbolic Music Generation Using Deep Neural Networks
Last Updated on June 4, 2024 by Editorial Team
Author(s): Xavier Yuhan Liu
Originally published on Towards AI.
How is music encoded in deep learning models? What about the datasets here? What are the famous models in this area? In this story, I will show you all of these.
Whatβs the point of generating music?
Youβve probably heard of ChatGPT and MidJourney. Theyβre widely known AI systems. However, most of the popular models are used for generating text and images.
On the other hand, models that generate music are still in development and havenβt achieved the same level of success as other models.
As American poet Henry Wadsworth Longfellow once said, βMusic is the universal language of mankind.β Music is vital in our life.
So, using AI to help with music creation would be beneficial. This story will dive into how AI is used in symbolic music creativity.
Whatβs symbolic music?
To begin with, letβs define symbolic music.
Unlike nonsymbolic music, which is just a sound wave, symbolic music represents music with music notations. Pitch, duration, key, etc., can be represented by specific notations in scores.
And nonsymbolic music contains none of these. Only sound waves exist in your MP3 and WAV files. There are no other music notations here.
Famous music models such as Suno are trained based on the nonsymbolic music dataset due to a lack of notated(symbolic) music datasets. One of the well-known models using nonsymbolic music data is Jukebox from OpenAI.
What are the main methods used currently?
Three main methods to generate symbolic music are rule-based, neural networks, and Bayesian networks.
In this story, weβll mainly focus on neural networks because all the current wide-spread AI models (e.g., Stable Diffusion and GPT4) are built with deep neural networks. Deep learning is very popular; researchers have already used it to train music generation models.
Before diving into this method, I want to discuss the other two methods to understand symbolic music generation better.
rule-based
The rule-based method involves writing a lot of if-then statements. Developers will first have some patterns and enable the software to compose the song with all the patterns according to specific rules.
Bayesian networks
Bayesian networks can also be used to model the probability distributions of music elements and transition probabilities. By learning from a large amount of music data, Bayesian networks can be trained to understand patterns and correlations between music elements.
Now, letβs discuss deep neural networks in symbolic music generation.
Deep neural networks
In fact, unlike image and text generation fields in which new model architectures like GAN and Transformers are born, the music generation field follows the current of other bone models. You can find music generation models using LSTM, GAN, and LSTM.
So, the difference in bone architecture of music generation models isnβt apparent. The real difference lies in encoding. How can we properly represent music so AI systems can understand it well?
Representation/Encoding
Daily, we have three ways to represent music on our computer: MIDI, MusicXML, and piano roll. After Transformer came out, researchers proposed a lot of tokens to encode music.
MIDI
MIDI, an ultimate way to store symbolic music generation, is the format used most often in the modern music industry. It includes almost everything: the pitch, the duration, the emotion of an instrument, and all detailed parameters used in making music.
It did store a lot of information. But it canβt be used to train the models as there is too much information.
MusicXML
Here are examples from wiki.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE score-partwise PUBLIC
"-//Recordare//DTD MusicXML 4.0 Partwise//EN"
"http://www.musicxml.org/dtds/partwise.dtd">
<score-partwise version="4.0">
<part-list>
<score-part id="P1">
<part-name>Music</part-name>
</score-part>
</part-list>
<part id="P1">
<measure number="1">
<attributes>
<divisions>1</divisions>
<key>
<fifths>0</fifths>
</key>
<time>
<beats>4</beats>
<beat-type>4</beat-type>
</time>
<clef>
<sign>G</sign>
<line>2</line>
</clef>
</attributes>
<note>
<pitch>
<step>C</step>
<octave>4</octave>
</pitch>
<duration>4</duration>
<type>whole</type>
</note>
</measure>
</part>
</score-partwise>
Here, we break music scores down in XML format. As we can see, itβs 44 times, and we have a whole note C.
The format is also very detailed and popular for storing music on modern computers.
But there is also one problem: there are too many labels. So, itβs also unsuitable to train a model with MusicXML text.
Piano roll
Itβs used very often in editing scores in modern music production.
As we can see, each grid represents a time step. Maybe itβs 1/64 of a bar. Each green line is a note. They all have different start times, steps, pitches, and durations.
The piano roll can be easily converted to matrixes. If we have different instruments (tracks), we will have multiple scores for them. Each score can be seen as a single piano roll.
The limit of piano roll representation is that itβs too sparse to be used to train the models. The most recent work using piano roll is MuseGAN.
Tokens
Itβs the kind of representation that is used most often these days. When doing NLP tasks, we convert words into tokens. So when we want to apply NLP model architectures such as LSTM and Transformer, converting music scores into tokens is necessary and suitable.
Here, we convert music notations, such as bars, bears, chords, tempos, pitches, etc., into tokens. We can then deal with them as regular text tokens.
Compared with natural text tokenization, music tokenization differs in these ways:
- One note can take more than one position, while one word only takes one position.
- In one position, there may be multiple notes, while there is only one word in one position.
Advanced Models
MuseGAN
MuseGAN is a GAN network generating music. It takes piano-roll as input. There are many different modes of GAN.
- Composer Model: One βzβ and one βGβ before the bar generator. This is like a composer is in charge of the whole band.
- Jamming Model: Multiple βzβ and multiple βGβ. Itβs like various players are playing together.
- Hybrid Model: It combines the advantages of the previous two models. There is one βzβ in charge of all the tracks.
The results can be found on its official website.
SymphonyNet
Itβs a transformer-based network trained with tokens. The Central Conservatory of Music researchers proposed a new type of tokenization.
According to the authors:
We propose a novel Multi-track Multi-instrument Repeatable (MMR) representation for symphonic music and model the music sequence using a Transformer-based auto-regressive language model with specific 3-D positional embedding.
The 3D space has three dimensions: measure, track, and note. Compressing them into 1D enables Transformer models to learn from the music data.
They proposed Music BPE to tokenize symbolic music with high dimensions. This is a work after the BPM algorithm.
The examples can be found on its official website.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI