Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Symbolic Music Generation Using Deep Neural Networks
Latest   Machine Learning

Symbolic Music Generation Using Deep Neural Networks

Last Updated on June 4, 2024 by Editorial Team

Author(s): Xavier Yuhan Liu

Originally published on Towards AI.

How is music encoded in deep learning models? What about the datasets here? What are the famous models in this area? In this story, I will show you all of these.

Photo by Possessed Photography on Unsplash

What’s the point of generating music?

You’ve probably heard of ChatGPT and MidJourney. They’re widely known AI systems. However, most of the popular models are used for generating text and images.

On the other hand, models that generate music are still in development and haven’t achieved the same level of success as other models.

As American poet Henry Wadsworth Longfellow once said, β€œMusic is the universal language of mankind.” Music is vital in our life.

So, using AI to help with music creation would be beneficial. This story will dive into how AI is used in symbolic music creativity.

What’s symbolic music?

To begin with, let’s define symbolic music.

Unlike nonsymbolic music, which is just a sound wave, symbolic music represents music with music notations. Pitch, duration, key, etc., can be represented by specific notations in scores.

And nonsymbolic music contains none of these. Only sound waves exist in your MP3 and WAV files. There are no other music notations here.

Famous music models such as Suno are trained based on the nonsymbolic music dataset due to a lack of notated(symbolic) music datasets. One of the well-known models using nonsymbolic music data is Jukebox from OpenAI.

What are the main methods used currently?

Three main methods to generate symbolic music are rule-based, neural networks, and Bayesian networks.

In this story, we’ll mainly focus on neural networks because all the current wide-spread AI models (e.g., Stable Diffusion and GPT4) are built with deep neural networks. Deep learning is very popular; researchers have already used it to train music generation models.

Before diving into this method, I want to discuss the other two methods to understand symbolic music generation better.

rule-based

The rule-based method involves writing a lot of if-then statements. Developers will first have some patterns and enable the software to compose the song with all the patterns according to specific rules.

Bayesian networks

Bayesian networks can also be used to model the probability distributions of music elements and transition probabilities. By learning from a large amount of music data, Bayesian networks can be trained to understand patterns and correlations between music elements.

Now, let’s discuss deep neural networks in symbolic music generation.

Deep neural networks

In fact, unlike image and text generation fields in which new model architectures like GAN and Transformers are born, the music generation field follows the current of other bone models. You can find music generation models using LSTM, GAN, and LSTM.

So, the difference in bone architecture of music generation models isn’t apparent. The real difference lies in encoding. How can we properly represent music so AI systems can understand it well?

Representation/Encoding

Daily, we have three ways to represent music on our computer: MIDI, MusicXML, and piano roll. After Transformer came out, researchers proposed a lot of tokens to encode music.

MIDI

Photo by Caught In Joy on Unsplash

MIDI, an ultimate way to store symbolic music generation, is the format used most often in the modern music industry. It includes almost everything: the pitch, the duration, the emotion of an instrument, and all detailed parameters used in making music.

It did store a lot of information. But it can’t be used to train the models as there is too much information.

MusicXML

Here are examples from wiki.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE score-partwise PUBLIC
"-//Recordare//DTD MusicXML 4.0 Partwise//EN"
"http://www.musicxml.org/dtds/partwise.dtd">

<score-partwise version="4.0">
<part-list>
<score-part id="P1">
<part-name>Music</part-name>
</score-part>
</part-list>
<part id="P1">
<measure number="1">
<attributes>
<divisions>1</divisions>
<key>
<fifths>0</fifths>
</key>
<time>
<beats>4</beats>
<beat-type>4</beat-type>
</time>
<clef>
<sign>G</sign>
<line>2</line>
</clef>
</attributes>
<note>
<pitch>
<step>C</step>
<octave>4</octave>
</pitch>
<duration>4</duration>
<type>whole</type>
</note>
</measure>
</part>
</score-partwise>
The representation of middle C on the treble clef was created using MusicXML code. | Image from Wiki

Here, we break music scores down in XML format. As we can see, it’s 44 times, and we have a whole note C.

The format is also very detailed and popular for storing music on modern computers.

But there is also one problem: there are too many labels. So, it’s also unsuitable to train a model with MusicXML text.

Piano roll

Piano Roll UI of FL Studio

It’s used very often in editing scores in modern music production.

As we can see, each grid represents a time step. Maybe it’s 1/64 of a bar. Each green line is a note. They all have different start times, steps, pitches, and durations.

Slides from MuseGAN

The piano roll can be easily converted to matrixes. If we have different instruments (tracks), we will have multiple scores for them. Each score can be seen as a single piano roll.

The limit of piano roll representation is that it’s too sparse to be used to train the models. The most recent work using piano roll is MuseGAN.

Tokens

It’s the kind of representation that is used most often these days. When doing NLP tasks, we convert words into tokens. So when we want to apply NLP model architectures such as LSTM and Transformer, converting music scores into tokens is necessary and suitable.

REMI Representation from the paper

Here, we convert music notations, such as bars, bears, chords, tempos, pitches, etc., into tokens. We can then deal with them as regular text tokens.

Compared with natural text tokenization, music tokenization differs in these ways:

  • One note can take more than one position, while one word only takes one position.
  • In one position, there may be multiple notes, while there is only one word in one position.

Advanced Models

MuseGAN

MuseGAN is a GAN network generating music. It takes piano-roll as input. There are many different modes of GAN.

Three models of MuseGAN | Image from the paper
  • Composer Model: One β€˜z’ and one β€˜G’ before the bar generator. This is like a composer is in charge of the whole band.
  • Jamming Model: Multiple β€˜z’ and multiple β€˜G’. It’s like various players are playing together.
  • Hybrid Model: It combines the advantages of the previous two models. There is one β€˜z’ in charge of all the tracks.

The results can be found on its official website.

SymphonyNet

It’s a transformer-based network trained with tokens. The Central Conservatory of Music researchers proposed a new type of tokenization.

According to the authors:

We propose a novel Multi-track Multi-instrument Repeatable (MMR) representation for symphonic music and model the music sequence using a Transformer-based auto-regressive language model with specific 3-D positional embedding.

Tokenization | Image from the Official GitHub Repo

The 3D space has three dimensions: measure, track, and note. Compressing them into 1D enables Transformer models to learn from the music data.

They proposed Music BPE to tokenize symbolic music with high dimensions. This is a work after the BPM algorithm.

The examples can be found on its official website.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓