Google researchers have created an AI that can generate minute-long pieces of music from text prompts, and even convert a whistled or humming melody into other instruments, similar to how systems like DALL-E generate images from written prompts (through TechCrunch). The model is called MusicLM, and while you can’t play with it yourself, the company has uploaded some examples which it has produced using the model.
The examples are impressive. There are 30-second snippets of what sound like real songs created from paragraph-long descriptions that prescribe a genre, mood, and even specific instruments, as well as five-minute snippets generated from a word or two like “melodic techno ‘. Perhaps my favorite is a demo of the ‘story mode’ where the model is basically given a script to change between prompts. For example this prompt:
electronic track played in a video game (0:00-0:15)
meditation song played next to a river (0:15-0:30)
Resulted in the audio you can listen to here.
It may not be for everyone, but I could totally tell this was composed by a human being (I’ve also listened to it dozens of times while writing this article). Also on the demo site are examples of what the model produces when asked to generate 10 second clips of instruments such as the cello or maracas (the later example is one where the system does relatively poorly), 8 second clips of a certain genre, music that would suit a prison break, and even what a novice pianist would sound like versus an advanced one. It also contains interpretations of phrases such as ‘futuristic club’ and ‘accordion death metal’.
MusicLM can even simulate human vocals, and while it seems to get the tone and overall sound of voices right, there’s a quality that’s definitely off. The best way I can describe it is that they sound grainy or static. That quality is not so clear in the example above, but I think this one illustrates it quite well.
Which, by the way, is the result of being asked to make music that would play in a gym. You might also have noticed that the lyrics are nonsense, but in a way that you might not necessarily notice if you’re not paying attention — something like listening to someone sing in Simlish or that one song that should sound like English, but isn’t.
I won’t pretend to know how Google got these results, but it is released an investigative report explain it in detail if you are the type of person who would understand this figure:
AI-generated music has a long history stretching back decades; there are systems that are credited with composing pop songscopying Bach better than a human could do in the 90sand accompanying live performances. A recent version uses the AI image generation engine, StableDiffusion turn text prompts into spectrograms which are then turned into music. The paper says MusicLM can outperform other systems in terms of “caption quality and compliance,” as well as the fact that it can record audio and copy the tune.
That last part is arguably one of the coolest demos the researchers have released. The site lets you play the input audio, where someone hums or whistles a tune, then shows you how the model reproduces it as an electronic synth lead, string quartet, guitar solo, etc. From the samples I listened to, manages it does the job very well.
As with other forays into this type of AI, Google is on it significantly more careful with MusicLM than some of its peers with similar technology. “We have no plans to release models at this time,” the paper concludes, citing the risks of “potential misappropriation of creative content” (read: plagiarism) and possible cultural appropriation or misrepresentation.
It’s always possible that the technology will show up at some point in one of Google’s fun musical experiments, but for now the only people who can take advantage of the research are other people building musical AI systems. Google says it is publicly releasing a dataset containing about 5,500 music-text pairs, which could help train and evaluate other musical AIs.