Symbolic Music Generation Using Deep Neural Networks

Last Updated on June 4, 2024 by Editorial Team

Author(s): Xavier Yuhan Liu

Originally published on Towards AI.

How is music encoded in deep learning models? What about the datasets here? What are the famous models in this area? In this story, I will show you all of these.

Photo by Possessed Photography on Unsplash

What’s the point of generating music?

You’ve probably heard of ChatGPT and MidJourney. They’re widely known AI systems. However, most of the popular models are used for generating text and images.

On the other hand, models that generate music are still in development and haven’t achieved the same level of success as other models.

As American poet Henry Wadsworth Longfellow once said, “Music is the universal language of mankind.” Music is vital in our life.

So, using AI to help with music creation would be beneficial. This story will dive into how AI is used in symbolic music creativity.

What’s symbolic music?

To begin with, let’s define symbolic music.

Unlike nonsymbolic music, which is just a sound wave, symbolic music represents music with music notations. Pitch, duration, key, etc., can be represented by specific notations in scores.

And nonsymbolic music contains none of these. Only sound waves exist in your MP3 and WAV files. There are no other music notations here.

Famous music models such as Suno are trained based on the nonsymbolic music dataset due to a lack of notated(symbolic) music datasets. One of the well-known models using nonsymbolic music data is Jukebox from OpenAI.

What are the main methods used currently?

Three main methods to generate symbolic music are rule-based, neural networks, and Bayesian networks.

In this story, we’ll mainly focus on neural networks because all the current wide-spread AI models (e.g., Stable Diffusion and GPT4) are built with deep neural networks. Deep learning is very popular; researchers have already used it to train music generation models.

Before diving into this method, I want to discuss the other two methods to understand symbolic music generation better.

rule-based

The rule-based method involves writing a lot of if-then statements. Developers will first have some patterns and enable the software to compose the song with all the patterns according to specific rules.

Bayesian networks

Bayesian networks can also be used to model the probability distributions of music elements and transition probabilities. By learning from a large amount of music data, Bayesian networks can be trained to understand patterns and correlations between music elements.

Now, let’s discuss deep neural networks in symbolic music generation.

Deep neural networks

In fact, unlike image and text generation fields in which new model architectures like GAN and Transformers are born, the music generation field follows the current of other bone models. You can find music generation models using LSTM, GAN, and LSTM.

So, the difference in bone architecture of music generation models isn’t apparent. The real difference lies in encoding. How can we properly represent music so AI systems can understand it well?

Representation/Encoding

Daily, we have three ways to represent music on our computer: MIDI, MusicXML, and piano roll. After Transformer came out, researchers proposed a lot of tokens to encode music.

MIDI

MIDI, an ultimate way to store symbolic music generation, is the format used most often in the modern music industry. It includes almost everything: the pitch, the duration, the emotion of an instrument, and all detailed parameters used in making music.

It did store a lot of information. But it can’t be used to train the models as there is too much information.

MusicXML

Here are examples from wiki.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE score-partwise PUBLIC
 "-//Recordare//DTD MusicXML 4.0 Partwise//EN"
 "http://www.musicxml.org/dtds/partwise.dtd">
<score-partwise version="4.0">
 <part-list>
 <score-part id="P1">
 <part-name>Music</part-name>
 </score-part>
 </part-list>
 <part id="P1">
 <measure number="1">
 <attributes>
 <divisions>1</divisions>
 <key>
 <fifths>0</fifths>
 </key>
 <time>
 <beats>4</beats>
 <beat-type>4</beat-type>
 </time>
 <clef>
 <sign>G</sign>
 <line>2</line>
 </clef>
 </attributes>
 <note>
 <pitch>
 <step>C</step>
 <octave>4</octave>
 </pitch>
 <duration>4</duration>
 <type>whole</type>
 </note>
 </measure>
 </part>
</score-partwise>

The representation of middle C on the treble clef was created using MusicXML code. | Image from Wiki

Here, we break music scores down in XML format. As we can see, it’s 44 times, and we have a whole note C.

The format is also very detailed and popular for storing music on modern computers.

But there is also one problem: there are too many labels. So, it’s also unsuitable to train a model with MusicXML text.

Piano roll

It’s used very often in editing scores in modern music production.

As we can see, each grid represents a time step. Maybe it’s 1/64 of a bar. Each green line is a note. They all have different start times, steps, pitches, and durations.

The piano roll can be easily converted to matrixes. If we have different instruments (tracks), we will have multiple scores for them. Each score can be seen as a single piano roll.

The limit of piano roll representation is that it’s too sparse to be used to train the models. The most recent work using piano roll is MuseGAN.

Tokens

It’s the kind of representation that is used most often these days. When doing NLP tasks, we convert words into tokens. So when we want to apply NLP model architectures such as LSTM and Transformer, converting music scores into tokens is necessary and suitable.

Here, we convert music notations, such as bars, bears, chords, tempos, pitches, etc., into tokens. We can then deal with them as regular text tokens.

Compared with natural text tokenization, music tokenization differs in these ways:

One note can take more than one position, while one word only takes one position.
In one position, there may be multiple notes, while there is only one word in one position.

Advanced Models

MuseGAN

MuseGAN is a GAN network generating music. It takes piano-roll as input. There are many different modes of GAN.

Composer Model: One ‘z’ and one ‘G’ before the bar generator. This is like a composer is in charge of the whole band.
Jamming Model: Multiple ‘z’ and multiple ‘G’. It’s like various players are playing together.
Hybrid Model: It combines the advantages of the previous two models. There is one ‘z’ in charge of all the tracks.

The results can be found on its official website.

SymphonyNet

It’s a transformer-based network trained with tokens. The Central Conservatory of Music researchers proposed a new type of tokenization.

According to the authors:

We propose a novel Multi-track Multi-instrument Repeatable (MMR) representation for symphonic music and model the music sequence using a Transformer-based auto-regressive language model with specific 3-D positional embedding.

Tokenization | Image from the Official GitHub Repo

The 3D space has three dimensions: measure, track, and note. Compressing them into 1D enables Transformer models to learn from the music data.

They proposed Music BPE to tokenize symbolic music with high dimensions. This is a work after the BPM algorithm.

The examples can be found on its official website.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Symbolic Music Generation Using Deep Neural Networks

Author(s): Xavier Yuhan Liu

What’s the point of generating music?

What’s symbolic music?

What are the main methods used currently?

rule-based

Bayesian networks

Deep neural networks

Representation/Encoding

MIDI

MusicXML

Piano roll

Tokens

Advanced Models

MuseGAN

SymphonyNet

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

How Google’s Watermarking Technology Identifies AI-Generated Content

Taming the Oracle: Key Principals That Bring Our LLM Agents to Production

Build the Smallest LLM From Scratch With Pytorch (And Generate Pokémon Names!)

Build a Local CSV Query Assistant Using Gradio and LangChain

OpenAI Reveals New “Operator” AI Agent

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Symbolic Music Generation Using Deep Neural Networks

Author(s): Xavier Yuhan Liu

What’s the point of generating music?

What’s symbolic music?

What are the main methods used currently?

rule-based

Bayesian networks

Deep neural networks

Representation/Encoding

MIDI

MusicXML

Piano roll

Tokens

Advanced Models

MuseGAN

SymphonyNet

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement