67年前,因同性恋获罪的数学天才图灵咬下一口被氰化物感染的苹果离开人世。他的命运就如同他在论文中提及的的人工智能,虽不被理解,但却超越了时代。

图灵是个与时代错位的悲剧天才,而同样的悲剧天才还有李云迪。他虽拥有阿波罗的音乐才能,但却被厄洛斯的情欲之箭射中,如同一只焕发生机老鹿,闯入了小区达芙妮的出租房,不顾一切的创作着一曲不属于他的酒神赞歌,任凭旋律缠绕在床头,飘扬过云朵,消失在朝阳区派出所五线谱般的黑色铁门间。最终完成了一首风格迥然,但又回甘无穷的乐章。

图灵与李云迪,人工智能与作曲。两个生错时代的天才,一个崭新有趣的领域 — 「AI 作曲」。以下旋律均为 AI 创作。

  • AI-Music-02

  • AI-Music-03

  • AI-Music-Long-01


小时候的游戏

小时候,我热衷于一类数字游戏,在一堆数字里寻找规律。

比如:1,1,2,3,5,8,13, __?. 答案是 21。所有数字都满足规律:「第三个数字等于前两个数字之和」。

而「AI 作曲」则是另一个找规律游戏,在一系列音符的关系中寻找规律,预测出接下来最合适的那个音符。

最终 AI 生成的旋律是由数据、数据 Tokenzie、模型架构、训练以及采样来综合决定。

  • 数据决定了风格:只听过古希腊诺姆音乐的人,是几乎没法写出酒神赞歌的

  • 数据 Tokenzie:如果音符是乐高积木,那么数据 Tokenzie 就是设计这些乐高积木最基础木块。

  • 优秀的模型架构决定了 AI 的学习能力:天资卓越音乐人能举重若轻写出动人心魄的音乐,而平凡如我的普通人则需要百倍的眼界(数据)与努力(算力)。

  • 训练:模型学习成长的过程

  • Sampling(采样):AI 的预测结果是一堆概率(数字),可以直截了当选取最大概率的音符,亦可以将整体概率分布作为选择依据。

## MIDI ****音乐数据

音乐是一种声音在时间中流动的艺术,并在过程中形成了一种结构。

从物理学出发,声音的实质是传声媒介质点所产生的一系列振动现象的传递过程。而动听的音乐正是一系列质点振动的总效果。

人类可以轻而易举的通过耳膜与「音乐细胞」来感受质点振动的效果,但 AI 没有人类的耳朵,AI 的世界是虚拟化的,组成那个世界的最小「粒子」是数字。

因此,需要一座桥梁,架设在振动与数字之上,连接于现实与虚拟之间。而这座桥梁,早在 1983 就已落成,它就是 MIDI。

MIDI 的全称是「Musical Instrument Digital Interface」,即「乐器数字化接口」。

MIDI对音乐的管理无微不至,即能精细独立的管理乐音的四要素:音高、音长、音强和音色。又有极高的存储效率,200 小时的音乐只要 80M 左右的存储空间。MIDI 所存储的实际只是一组指令,告诉键盘、贝斯、架子鼓等在某个时间以怎样的方式发声。简单来说,MIDI 存储了时间点与音符的对应关系。这与以上 AI 猜音符的游戏不谋而合。

既然 MIDI 很合适,那么把互联网上海量音乐转换成 MIDI 文件,就可以得到充足的数据源。确实如此,但是将网络中 MP3、WAV 等音频转换成 MIDI 的过程也是一个重大挑战。从音频到 MIDI 还需要转录、同步、旋律,和弦提取等。

近几年,转录技术迅速发展,为数字音乐提供了大量且优质的 MIDI 素材,2018 的转录还只能从干净的音频信号中提取相对单一的声音。而就在上个月,Google Magenta 团队的最新论文[MT3: MULTI-TASK MULTITRACK][1],已经对多层次的复杂合奏音频进行转录。( *** 转录技术不是本文重点,感兴趣的朋友推荐阅读文末 [1] 。 *** )

  *** 本文所用的数据源来自 aiLabs.tw 团队的 Pop1K7 数据源。 ***  

## MIDI Tokenzie

如果音符是乐高积木,那么 MIDI Tokenzie 就是设计这些乐高积木最基础木块。

在乐理知识中,一首曲子的元素离不开:小节、节奏、音高、音长等等。MIDI Tokenzie 正是定义了这些元素的最基本「小木块」,每首曲子都是「小木块」的有序组合。

一种叫 REMI[2] 的 MIDI Tokenzie 方式。就是把 MIDI 按照如下几个纬度打散拆分:

  • 小节/位置(Bar/Position): 把一个小节的长度平均分成 32 份,每一份代表该小节中的一个位置。 每个「小木块」分别用 [1~32] 表示。

  • 速度(Tempo):从慢到快,平均拆分成 32 种速度。数字范围[33~64]。 

  • 音高 (Pitch):钢琴的 88 个键各自为独立音高。「小木块」范围[65~152]。 

  • 音长(Duration): 从短到长,平均拆分成 64 种音长。「小木块」范围[153~216]。 

  • 音强(Velocity): 从轻到重,平均拆分成 64 种音强。「小木块」范围[217~280]。

  • 休止(Rest)): 从轻到重,平均拆分成 10 种音强。「小木块」范围[281~291]。

如此一来,一段旋律就可以用一段数字序列来表示了。

 MIDI Tokenzie 是把现实里「杂乱无章」的振动映射成连绵不绝的数字,那是 AI 才能听懂的「音乐」。

「AI 作曲」的模型

AI 模型是一个装有神经元的盒子。与人类的神经元不同,AI 的神经元是计算机的模拟。人类的神经元通过突触间的电流传达一种刺激,而 AI 神经元通过一个简单的公式
w x + b w*x+b
传递数字。

旋律被 MIDI Tokenize 得到数字。丢进盒子之后被神经元接收,经过亿万神经元的有序传递,最终输出新的数字。这些新的数字可以重新转换成 MIDI 文件,变成 AI 创作的音乐。

与人类的思考方式类似,AI 会多个维度中找寻数字间关系。就比如李云迪与吴某凡,以钢琴演奏领域作为参考维度,他们之间可谓大海两端,但以朝阳区大铁门为起点,他们又近若比邻。


吴李 = x x + y y 关系_{吴李} = x_{吴}x_{李}+y_{吴}y_{李}

计算二者关系要考虑所有维度。而 AI 就如同一名冷峻的猎手,依靠他百眼巨人强大观察力,能够在千百维度之中,捕获音符数字间流淌的微妙关系。

这这种能力离不开一个重要的机制,即来自于 Google 的那篇著名论文 Attention Is All You Need[3]

论文的核心就是 Attention 机制,源于对生物行为的模仿,即用算法模仿了生物观测行为的内部过程,依据外在刺激与内在经验,增强局部的观测精度。就好比人类在集中观测某个具象物体时,无关的画面会自动模糊。

如今 Attention 机制已经深入到各个领域的 AI 算法中,就连即时战略游戏中那些战无不胜的 AI,支撑它们的算法都离不开注意力机制。比如《星际争霸2》的 AlphaStar,《Dota2》的 OpenAI Five 等等。

[AI 作曲]模型的核心之一,就是 Attention 机制 

假设AI 在 256 个维度下捕获数字间的关系。那么每个数字就要在 256 个维度下展开。

比如 「音高36」所对应的数字 100 ,数字 100  会被展开 256 个精度极高的浮点数。每个浮点数都代表了数字 100 在该维度下的值。

同样的,如果把一组「小节 速度 音高 音长 音强」 丢进盒子,就会得到 5 组 256 个浮点数。

Attention 机制的关键一步,就是把分别把每组数字与其它组数字依次做线性代数点积的运算,这就相当于把 256 个维度的每一个都纳入了考量。

比如分析「小节1」与「音高36」间关系,可将二者的所展开的浮点数做点积运算。


小节•音高 = 0.1 × 2.1 + . . . + 1.3 × 1.3 小节•音高 = 0.1×2.1+…+1.3×1.3

得到的点积结果,在一定程度上包含二者的关系。以上是对 AI 捕获个体关系的一个基础理解,Attention 正是一系列这种关系捕获方式的组合。感情兴趣的朋友可参看相关资料。

Relative Attention, the second core of the model, Relative Attention

It’s a step up from the Attention mechanic. The Attention mechanism can look for relationships between notes, but for music, it also needs to consider the periodicity and regularity of melody. Therefore, the relative position of notes should also be taken into account. This is also a key step mentioned in the paper [Music-Transformer][4]. Relative Attention is Relative Attention.

Like starting with a melody and continuing the composition. Using Relative Attention, the resulting melody will easily “burn out” after a minute.

Relative Attention, however, can output melodies with almost infinite length and great variation.

A 36-minute melody – AI-Music-Long-03

*Relative Attention implementation, please refer to the source code link at the end of this article. * * *

The third core of the model, Compound Word Transformer

Compound Word Transformer[5] this paper was published in 2021.1. The training efficiency of the model proposed in this paper has been greatly improved, which can reach 5~10 times of that of Music Transformer in 2018, and the generated melody also has a better performance. (The music at the beginning of this article was generated using a Compound Word Transformer based model.)

Compound Word Transformer’s excellent performance comes from an important improvement. The serial input of the REMI MIDI Tokenize class becomes the parallel input of the Compound Word MIDI Tokenize class.

This parallel input approach has several significant benefits:

  • The sequence length after MIDI Tokenize is greatly reduced, which is better for training fit and for generating longer melodies.

  • The model supports training parameters with different weights such as pitch, length, intensity and rest. For example, pitch expands in 1024 dimensions and rest in 256 dimensions.

  • During training, different musical elements can be independently monitored for training effects.

  • After the results are obtained, different sampling strategies can be adopted for pitch, length, intensity, rest, etc. For example, let the pitch, pitch length change more, less change in intensity.

A quick explanation of how Compound Word Transformer can support parallel typing: Compound Word Transformer uses a series of linear transformation techniques to transform the parallel structure of The Tokenzie result into a serial structure, then throw Attention to capture the relationship. After obtaining the serial structure result, the linear transformation technique is used. Turn the result into a parallel structure for output.

Compound Word Transformer is the model I eventually used. Since the paper is relatively new and the resources are limited, I stumbled along the way and found the code (Tensorflow version) open source on Github.

training

The Argonauts must constantly defeat the dragon’s tooth warriors born in the land of Boeotia in order to capture the golden fleece of Chrysomalus.

The training process of the AI model is like an RPG in which the dragon tooth warrior is constantly defeated. Every point of experience is an optimization of the weight of connections between billions of neurons.

Training the model of “AI composition”, using the beautiful melody as the output standard, the AI reversibly adjusts its own parameters, which is the back propagation in the field of deep learning.

There will always be a “gap” between the music output by the AI and the music of the training data, which is called LOSS. One of the goals of the training is to make LOSS smaller, just like the dragon Tooth Warrior.

LOSS is determined by billions of variables, so how does AI know the direction of LOSS reduction?

In fact, it is very simple to establish the functional relationship between LOSS and billions of variables.


L O S S = f ( x 1 . x 2 . . x 10000 ) LOSS = f (x_1, x_2,… ,x_{10000})

Let LOSS calculate the partial derivative of each variable once, so as to know the influence direction of each parameter on LOSS. Then let all parameters move a little bit along the direction of LOSS decrease, and get the new parameter. This process, called a STEP in deep learning, is a process in which the model propagates itself back.

After the training, with the increase of STEP, the accuracy of the model for the training set gradually increases, while the accuracy of the verification set increases first and then decreases. This is because the model slowly enters a state of overfitting.

Therefore, a STEP point with the best effect of the model should be selected. Through experiments, it is found that the generation of the best model of music effect usually occurs after the optimal point of the verification set. For example, if the best model obtained by myself is STEP == 28500, the melody generated by the previous model (STEP10000) will appear too light, or “not in tune”, while the melody generated by the later model (STEP10000) will appear “lack of change”.

For example, three melodies at a given beginning

  • STEP10000 generation (too light)

  • STEP28500 generation (appropriate benefits)

  • STEP80000 generation (missing changes)

Of course, the precision marked by the final rational number can only express the fitting degree of the model to the training set, but cannot mark the sound degree of music sensibility, which is subjective experience for different people.

The above is personal training experience for reference only. The training details are explained in the source code.

## Sampling

And the number that the AI comes up with, that’s the probability distribution. Sampling is the process of selecting samples from a specific probability distribution. For “AI composition”, the Sampling algorithm is the rule to select notes from the probability distribution of notes. It’s a natural thought to just use Greedy Search (select only the most likely).

In actual tests, Greedy Search led to a lack of variation in the music, even to a never-ending monotone. This is a Get Stuck In Loops phenomenon.

Temperature Sampling is a great way to solve Get Stuck In Loops. The algorithm is to do the probability distribution once softmax (that is, all the probabilities are compressed between (0,1)(0,1)(0,1), and the sum is equal to 1), get the probability distribution PPP. Then divide PPP by temperature T (custom parameter) to get P new =P/ tp_new =P/TP new =P/T, and finally make a probability value for P new P_ new P new.

A simple way to think about it is to think of the probability distribution of the outcome as a rolling iceberg. As the temperature rises, the iceberg melts, the probabilities get closer to each other, and at extreme high temperatures, all the probabilities will be the same. Therefore, the higher the temperature is, the gentler the probability distribution will be, and the lower the temperature is, the more concentrated the probability distribution will be. When the temperature is 1, the probability distribution will remain unchanged.

With temperature adjustment, melody variability has some room to adjust. But it’s still not efficient, because a series of long tails with very low probability could be selected as a group, which would affect the final result.

For further upgrading, the algorithm combining top-P Sampling and Temperature Sampling is also the Sampling method that I finally adopt.

The whole idea of the top-P algorithm is to discard the low-probability results at the end. For example, not all American citizens are candidates in American elections. Top-p can determine the degree of ignoring the last probability by setting the value of P. The higher the value of P is, the lower the degree of neglect will be. When P is 1, all probabilities will be retained and there is no difference with Temperature Sampling. In practice, I would set different T (temperature) and P (P) for different musical elements. For example, I would set T and P (Duration) higher to give more variation to the melody, and SET T and P (Bar) lower to give some stability to the melody.

The following two songs are sequels to the same beginning, comparing different melodies of “T-pitch”.

The pitch of the first song (T- pitch =0.01) will maintain good consistency. The second melody (T- pitch =1) has a more flexible feel.

        • – 

Beauty cannot be quantified, but it can be traced. AI’s kung fu composition is a proof.

As the ancients said, “Only music can not be regarded as false”, which is to regard music as an expression of extreme sensibility. Even lying musicians are bound to have genuine feelings in their work. “There are so many things you don’t know,” and indeed, in the world of music, he was telling the truth.

AI composition, on the other hand, is a challenge from the rational world to the perceptual world, and an attempt to unify the world with monism. If the true feelings of animals can be simulated by AI, we may one day be able to create virtual worlds optimized over reality, a form of immortality. We will get rid of the burden of the body, those thoughts of the mountain tsunami, those ordinary memories accumulated every second, are neatly written into a card that can be inserted into the mechanical body. It’s as easy as plugging a backup SIM card into a new phone.

However, the world is not necessarily unitary, and emotion and consciousness are not necessarily determined by matter. At present, no one knows the correct answer. This is the essence of physics, and there has been no great breakthrough in basic physics for a hundred years.

When the dividends of economic development drain the body of the inner volume, superficial entertainment will wait for an opportunity to take over the leisure life. The development of The Times is worth how much personal pay, this is the need to think about the problem. But there is no doubt that this age still needs an apple that hits the next Newton, a brain that can apply general relativity to models of the universe just by thinking about it, and more luck at a time than Athena’s Heroes of Gorgo.

The fusion of AI and art relies more on breakthroughs in basic science, but there are traces of beauty to be found, namely, the art works created by AI are also rich and colorful. This is an excellent area for the fusion of technology and art, as well as the exploration of unifying numbers and emotions.

The buttonwood gradually show clear branches out of the window, the window of the ono found on the law of canned cat, this eight months time to learn to make my full, which gives birth to a familiar and distant is wonderful experience, I thought for a long time, I didn’t find until began to write the article knows the memories, so this feeling comes from I have no concern when I was a child of the game, It was the simplest and most unsophisticated pleasure to be had by guessing numbers.


  • [0] Implementation process source code:netpi/compound-word-transformer-tensorflow

  • [1]MT3: Arxiv.org/abs/2111.03…

  • [2]REMI: Arxiv.org/abs/2002.00…

  • [3]Attention: Arxiv.org/abs/1706.03…

  • [4]Music-Transfomer: Arxiv.org/abs/1809.04…

  • [5]CP-Word-Transfomer: Arxiv.org/abs/1809.04…