This AI will extract vocals and instruments from any audio

Artificial intelligence and machine learning are the core of LALAL.AI, a new tool able to extract and separate vocals and instruments from any audio file

LALAL.AI extract vocals and instruments from any audio
LALAL.AI extract vocals and instruments from any audio
Last updated:

Artificial intelligence and machine learning are entering more and more into music production, helping creative to streamline and speed up all the everyday mechanical and tedious processes. With the help of a unique neural network trained with 20TB of data, LALAL.AI is able to extract and separate vocals and instruments from audio files.

But let’s take a step back to understand what’s and how can help artificial intelligence and machine learning in the music industry.

What are artificial intelligence and machine learning?

Artificial intelligence (AI) is the simulation of human intelligence processes by machines, especially computer systems. Because of the infinite fields that the human intelligence can be applied, nowadays artificial intelligence is programmed and developed for specific tasks like language recognition, image recognition, audio recognition, and more.

In this ocean of different AIs, machine learning (ML) is a type of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so. Machine learning algorithms use historical data as input to predict new output values.

These two things put together are the core of vocal remover by LALAL.AI. In 2020, the team has developed a unique neural network called Rocknet using 20 TB of training data to extract instrumentals and voice tracks from songs. In 2021 they’ve created Cassiopeia, a next-generation solution superior to Rocknet allowing to get improved splitting results with significantly fewer audio artifacts.

Throughout 2021 LALAL.AI has been enhanced with the capabilities to extract various musical instruments from audio and video sources.

So, what’s LALAL.AI?

LALAL.AI is a next-generation vocal remover and music source separation service that allows for quick, simple, and accurate stem extraction. Without sacrificing quality, remove vocal, instrumental, drums, bass, piano, electric guitar, acoustic guitar, and synthesizer tracks.

The service is delivered as pay-as-you-go. They offer a free tier, with 10 minutes of conversion and max file size of 50 Mb in MP3, WAV, or OGG. If you want more speed and length of your conversion, LALAL.AI offers two different paid plans: the most convenient one at the moment is the PLUS PACK and which costs € 30 instead of € 50 and includes 300 minutes of audio conversion and supports MP3, OGG, WAV, FLAC, AVI, MP4, MKV, AIFF, AAC. A less expensive one is the Lite Pack with its 90 minutes of conversion at € 15. Both packs include Fast Processing Queue and Batch Upload.

How does LALAL.AI work?

The usage of the LALAL.AI tool is as simple as uploading a file on the web. You head to the website, register an account, and hit SELECT FILE, choosing the audio or video you want to use.

We have tested the tool with the brand new Swedish House Mafia album, Paradise Again. For our first test, we have chosen Another Minute, in 24-bit hi-res FLAC. The tune is 110 Mb and it’s of the best quality you can purchase online. With our 1 Gbps internet connection, the upload took less than 10 seconds. The tool takes another 10 seconds to generate the preview. We have chosen the “vocal and instrumental” separation with the brand new algorithm Phoenix. If you like the preview, you can click the “Processing” button and wait for the algorithm to do its job.

Already from the preview, the vocal splitting was absolutely insane. It has almost perfectly separated the vocal from the instrumental. The instrumental-only file has some artifacts, but are really nothing compared to the high quality of the total conversion.

Not only Vocal and Instrumental; LALAL.AI lets use extract Drums, Bass, Electric Guitar, Acoustic Guitar, Piano, and even synthesizers. At the moment only Vocal and Instrumental and Drums make use of the new improved Phoenix algorithm.

LALAL.AI for businesses and API

The tool can be used seamlessly without the web interface through API. LALAL.AI lets you integrate their services to your app or website, using directly their network or deploying on your infrastracture.

The Phoenix algorithm, an evolution of Audio Source Separation

From Rocknet to Cassiopea and then Phoenix, the new stage of audio separation. The team at LALAL.AI found three pillars on how to develop the new algorithm:

  1. Input signal processing method.
  2. Architectural improvements.
  3. Separation quality evaluation methods.

When the neural network processes an audio file, it divides it into segments and “observes” each one individually. The main difference in the first group is an increase in the amount of data that the network “observes” at one time in order to figure out the composition of the instruments and isolate the necessary ones, such as the voice or drums. The segment length for Cassiopeia is one second, while it is eight seconds for Phoenix. Phoenix can recognize the instruments that make up the composition and the characteristics of the sought-after source better because it “observes” more data.

The larger the array of observed data, the better the theoretically achievable separation quality. In practice, however, expanding the data segment increases network complexity as well as the time required for network operation during separation as well as the time required to train it.

For Cassiopeia, for example, increasing the segment from 1 second to 8 seconds would make it impossible to train the network in a reasonable amount of time, and even if trained, the network would be so slow that users would have to wait dozens of minutes to get separated stems.

Phoenix’s architectural improvements alone enabled the team at LALAL.AI to increase the amount of observed data while also cutting the network’s runtime in half! This means that song processing takes twice as long for users.

There are numerous improvements in the second group. The team took new activation functions for neurons from computer vision and adapted them for audio processing, for example. They used more advanced normalization methods, which allowed them to better balance the network and make it more trainable.

The third group is most likely the most significant. Criteria are required to evaluate any solution. LALAL.AI team needed separation quality criteria to evaluate the quality of stem separation. Furthermore, the criteria are required during neural network training because the network needs to understand not only when it separates well and when it separates poorly, but also what it needs to do to separate better.

If you want to know more about the algorithm and its development, there’s a dedicated article on the LALAL.AI blog.

LALAL.AI Phoenix in numbers
LALAL.AI Phoenix in numbers