目标是实现 轻量级的 音源分离 扒谱AI,其中音源分离不依赖训练集,而能对训练集之外的音色有普适性。
The goal of this project is to develop a lightweight AI system for audio source separation and music transcription that generalizes well beyond the training data — specifically, one that can handle unseen timbres without relying on pre-defined instrument categories in the training set.
- 轻量级:为了让研究切切实实落地应用。最终比baseline轻量了一半,成功部署到notedigger中。其实如今已经有很多很优秀的分离扒谱商品了,而转换为token、用大模型的方法实现也有一统所有领域的趋势(特指MT3)。所以选择轻量也是避其锋芒的做法。但轻量也极大限制了能用的技术。
- Lightweight: Designed for real-world applicability. The final model is half the size of the baseline and has been successfully deployed in noteDigger. While many excellent commercial transcription tools already exist, and large models using tokenization (e.g., MT3) are trending toward universal solutions, we deliberately chose a lightweight approach to avoid direct competition. However, this constraint significantly limits the range of applicable techniques.
- 音源分离:根据音色进行分离,称为“盲源分离”更确切,因为目标是不依赖训练集。和一般的“源分离”不同的是,本任务直接在“音符”的层面进行分离,而不是重构分离后的频谱。
- Source Separation: More precisely termed blind source separation, as it aims to separate instruments based purely on timbre without prior knowledge from a training set. Unlike conventional source separation (which reconstructs separated spectrograms), our method operates directly at the note level.
- 扒谱:学名“音乐转录”,结合“音源分离”指“扒带”,即输入为多音色音频,输出多轨音符,每一轨对应一种音色。
- Music Transcription: Combined with source separation, this refers to "transcribing multi-track recordings": given a polyphonic audio mixture containing multiple timbres, the system outputs multiple note tracks, each corresponding to a distinct timbre.
现有音色分离转录方法的缺点是没有泛化性,体现在两点上:
Current timbre-aware transcription approaches suffer from poor generalization, primarily due to two issues:
- 严重依赖数据集,后果是不认识训练集之外的音色。相比于“分离”他们更像是“分类”,需要制定好要有哪些类别的音色,然后通过喂数据使模型“记住”各个音色。特点是精度很高,特别是针对某一种乐器的扒谱模型。但是适用性太窄了,除非到MT3这种大模型时才会有显著突破。
- Heavy reliance on datasets: Models essentially "memorize" instrument classes from training data rather than learning to distinguish timbres in a generalizable way. They perform exceptionally well on specific instruments seen during training but fail completely on unseen ones. Only very large models like MT3 begin to overcome this limitation.
- 分离能力受结构限制。语音分离领域的一个难点是“说话人数目未知”,而如果做“分类”,则一定要确定类别数。所以二分离的模型不能用于三分离,适用性受限。
- Architectural constraints on separation capacity: Similar to challenges in speech separation (e.g., unknown number of speakers), classification models require a fixed number of output sources. A model trained for two instruments cannot handle three, severely limiting practical utility.
本研究的目的就是解决如上两点,此外追求易用性——在浏览器中就可以使用。人类对音色的跟踪完全不是这样的,我们能区分出音色,而无需知道乐器叫什么,也就是说做的不是“音色的分类”而是“音色的区分”。以下是我的猜想:当我们听到某音色的音符后,我们先会和记忆中的音色比较,如果相近就归为一类,如果不同则认为是新的音色;而“记忆”里的音色,一部分是之前习得(类比训练集),更重要的是听这个音频前面部分时建立起来的记忆,相当于动态学习音色、动态归类。这个“动态性”就是实现研究目标的关键。
The goal of this research is to address the two issues mentioned above, while also prioritizing usability—specifically, enabling full functionality directly within a web browser. Human perception of timbre does not work by classifying instruments into predefined categories; rather, we distinguish timbres without necessarily knowing the instrument names. In other words, the task is not "timbre classification" but "timbre discrimination." Here is my hypothesis: when we hear a note with a certain timbre, we compare it against timbres stored in our memory. If it closely matches an existing memory, we group it into that category; if it differs significantly, we treat it as a new timbre. The "memory" of timbres consists partly of previously learned examples (analogous to a training set), but more importantly, it is dynamically built from the earlier parts of the same audio being listened to—essentially performing dynamic timbre learning and online clustering. This "dynamic" nature is key to achieving our research objectives.
本研究将此任务拆解为两部分:
This study decomposes the task into two stages:
- 先实现“音色无关转录”,即不根据音色分类,得到的是所有音色的音符。模型参考BasicPitch,并进行了改进。(幅度编码)
- Timbre-agnostic transcription: First, transcribe all notes without relying on timbre-based classification. The model is inspired by BasicPitch but includes improvements. (magnitude encoding)
- 再实现“音色分离转录”,基本思想是对音色进行聚类,给“音色无关转录”的音符分配标签。(方向编码)
- Timbre-separated transcription: Then, assign timbre labels to the notes obtained from stage 1 through clustering timbre embeddings obtained at this stage. (directional encoding)
本研究的贡献:
Contributions of this research:
- 轻量级音色无关转录模型(参数量、运行开销减半,性能与基线持平),训练参数量只有18978,但有不错的泛化性和精度。
- A lightweight timbre-agnostic transcription model, halving both parameter count and computational overhead compared to the baseline, while maintaining comparable performance. The model has only 18,978 trainable parameters yet demonstrates strong generalization and accuracy.
- 在音色无关转录基础上拓展出一个音色编码分支,能实现2-3个乐器的70%以上正确分离。
- An extended timbre-encoding branch built upon the timbre-agnostic transcription model, capable of correctly separating 2–3 instruments with over 70% accuracy.
- 针对“音乐转录”提出了专门的深度聚类后处理方法,在音符级进行聚类提高了鲁棒性、减少了计算量。
- A novel deep clustering post-processing method specifically designed for music transcription, which performs clustering at the note level, enhancing robustness and reducing computational cost.
- 优化了损失函数:修正了BasciPitch的加权策略;使用对比学习损失InfoNCE代替传统深度聚类相似度损失(对相似度矩阵求MSE),得到了更好的编码效果。
- Optimized loss functions: (a) refined the weighting strategy of BasicPitch; (b) replaced conventional deep clustering loss that applies MSE on affinity matrix with contrastive learning losses InfoNCE, yielding superior embedding quality.
虽然上文提到了“记忆”,但本研究其实并没有实现类似的机制。最初的构想是:让模型先学一遍输入的音频,得到“记忆”,再重新过一遍输入,从记忆中联想(查询)得到最终的结果。在“记忆”中,其实相似的音色就已经聚集在一起了。我尝试了Hopfield Network,并拓展到各种注意力机制,结果发现在音色编码网络不够强大的时候,记忆的引入反而会导致类别融合,即“记忆模糊”。理论上分析,Hopfield的最大记忆容量约为编码维度的0.14倍,如果要分离3种音色,编码维度需要达到22。对于浏览器中运行的模型,这个维度有点大了。所以我觉得模型规模增大时,该机制可能会有用。在 ./model/attention.py 中留存了我们探索过的结构。具体的想法参看./model/memory.md。
Although "memory" was mentioned above, this study did not implement such a mechanism. The original idea was to let the model first process the input audio once to build a "memory," then re-process the same input while querying this memory to produce the final output. In this memory, similar timbres would already be grouped together. I experimented with Hopfield Networks and extended them into attention mechanisms. However, when the timbre-encoding network lacked sufficient capacity, introducing memory led to category merging—a phenomenon I call "memory blurring." Theoretically, a Hopfield network’s maximum memory capacity is roughly 0.14 times the encoding dimension. To separate three timbres, the encoding dimension would need to be at least 22 — an impractical size for browser-deployable models. Thus, I suspect this memory mechanism may become viable only with larger models. The explored architectures are preserved in ./model/attention.py. Refer to ./model/memory.md for more detailed descriptions.
本研究设计了一个合成数据的方法,但是效果远远不如真实数据集。论文中对其缺点进行了讨论,其中一个问题是:“缺少音域限制”,因为相邻音高的音色可以近似为不变,但是跨度大了确实有很大的不同。而可行的解决方法是:划分频带生成音符、限制生成范围。当然现在有很多模拟人工谱曲的算法,直接拿过来用可能效果会好很多。
I also designed a synthetic data generation method, but its performance fell far short of real-world datasets. The paper discusses its limitations, one major issue being the "lack of pitch-range constraints": while timbre can be approximated as invariant across adjacent pitches, it varies significantly over wider intervals. A potential solution is to generate notes within specific frequency bands. Of course, many modern algorithms simulate human-like composition; integrating such methods might yield much better results.
虽然目标是“不依赖训练集”,但是为了学习到普适的编码,训练集还是越大越好。模型有泛化性不等于训练集很小。
Although the ultimate goal is "training-set independence," a large and diverse training set remains essential for learning generalizable representations. Good generalization does not imply that a small training set suffices.
本研究的两个分支相当于分别进行了幅度和方向的编码,而我认为直接编码一个特征,其方向表示音色、幅度表示强度,类似Hinton的Capsule,是一个有希望的方向。本项目的小网络不具备这样的编码能力(失败了),但我认为模型规模足够大时可以实现。
The two branches of this model encode magnitude and direction separately. I believe a more promising approach would be to directly learn a unified feature vector where direction represents timbre and magnitude represents intensity—akin to Hinton’s Capsule Networks. Unfortunately, small networks seem to lack the capacity to achieve such representation (I tried but failed). However, I remain optimistic that sufficiently large models could realize this idea.
极力推荐《Harmonic Frequency-Separable Transformer for Instrument-Agnostic Music Transcription》,虽然没有分离音色,但是里面用到了很多我认为有希望(但是我没成功)的做法,比如:谐波卷积、attention与音色。
I highly recommend the paper "Harmonic Frequency-Separable Transformer for Instrument-Agnostic Music Transcription." Although it does not perform timbre separation, it employs several techniques I considered highly promising (though they're not very useful to this study), such as harmonic convolution and attention mechanisms for timbre modeling.
关于相位:音乐信号的相位作用不大(参考加法合成器)。由于本研究的任务不需要重构音频,所以可以直接抛弃相位。即使要重建音乐,我猜丢了相位也无关紧要。
Regarding phase: phase information plays a minimal role in musical signals (as evidenced by additive synthesizers). Since our task does not require audio reconstruction, phase can be safely discarded. Even if audio reconstruction were needed, I suspect that losing phase information would have negligible impact.
├─basicamt “音色无关转录” our timbre-agnostic transcription model
├─basicpitch 作为“音色无关转录”的baseline,对比用
├─onsets_frames 作为“音色无关转录”的baseline,对比用
|
├─septimbre “音色分离转录” our timbre-separated transcription model
├─Tanaka 作为“音色分离转录”的baseline,对比用
|
├─evaluate 模型评估
|
├─data 数据相关,如训练集、可视化
├─model 存放一些公用的torch.nn.Module
└─utils 存放一些公用的工具函数
本项目已经集成到 noteDigger 中,可以便捷地进行人工后处理、或辅助人工扒谱。使用方法见下:
This project has been integrated into noteDigger, enabling convenient manual post-processing or assisting manual transcription. Usage instructions are provided below:
本项目已经将主要的训练结果导出为ONNX,只要配置好运行时就可以使用。模型的输入输出可以参看每个文件夹下的use_model.ipynb中导出为ONNX的部分。
The main training results of this project have been exported to ONNX, and they can be used once the runtime environment is properly configured. For details on the model's input and output, please refer to the ONNX export section in use_model.ipynb in each folder.
本项目使用 uv 管理环境,请先确保已经安装。然后在本项目根目录下,执行:
This project uses uv to manage the environment. Please ensure it has been installed. Then, in the root directory of this project, execute:
uv sync
随后就可以执行 .ipynb 了。第一步是准备数据,按照data文件夹所说进行操作。
Then you can execute . ipynb. The first step is to prepare the data and follow the instructions in the data folder.
此外,项目依赖 ffmpeg ,需要可以直接通过命令行调用,需要额外安装。
In addition, the project relies on ffmpeg, which should be able to be directly called through the command line and requires additional installation.
这其实是我的毕业设计,自主选题。大学四年甚至高中的种种共同造就了这个课题。
初中时开始接触乐器和二次元,高一开始接触半音阶口琴(为了演奏二次元),但谱不够了,只能自己扒。靠耳朵显然不适合我这种新手,于是了解到wavetone,并被其扒谱方式深深折服——根据频域信息扒谱真是太方便了!于是做了个app:哼歌扒谱。不过当时不会对频域取对数,基频提取也用的是插件。基频提取显然远远不够,要实现wavetone的效果,我了解到了FFT,但是当时完全看不懂。
然后就高考了,扒谱是我择校的一半原因:冲着“信号”就认定了“电子信息”专业。另一半原因是觉得做硬件很酷,很想成为Dimsmary这样的人,然后私信发现他是电子信息专业。高考是江苏前一千,最后被综评锁死到了SEU,身边人都很惋惜但其实我还挺满意:毕竟是电子信息强校。不出意外的,我接触到了很多硬件和信号处理知识,在大三的《数字信号处理》中终于学到了FFT,于是有了noteDigger,算是达成一个目标。
不过我想学的知识、想掌握的技术已经在前5个学期中学完了。信息方向的下一步是通信,然而我对通信是一点不感兴趣。这才发现自己的目标虽然坚定但还是短浅。在大三下学期,以学院第四拿到了保研资格,难道要学通信了吗?或者选择信号处理方向?但我不想处理雷达信号,处理音频才合我意……这不就是AI干的事吗?于是决定跨保计算机。之前选修过AI方向的专业课,再补了点408相关知识,就开始陶瓷。
保研陶瓷本就费神,跨保更是鄙视链底端,不过最终还是以“智能科学与技术”专业保研了,脱离通信!导师是校外的,未来研究方向也不是通信,那毕设就没必要跟着老师走——做点自己喜欢的吧!于是就把这个课题推上去了。从24年9月就开始阅读论文,11月中旬正式确定为毕设选题,4月18日凌晨完成毕设论文(疑似全校第一),6月6日答辩。文献越读越简单,上手越做越难,前期屡屡碰壁,但从未后悔。虽然成果并不完美,但我还是挺骄傲的。
摘一段论文致谢吧:
行文至此,总算了结了四年来的执念。说来好笑,这个课题的起因特别简单而纯粹——只是想演奏找不到谱子的动漫歌曲。没想到,对“谱”的执念一路指引着我走过了本科四年:为了学习信号处理,选择了东南大学电子信息;为了便于看谱,上线了一个谱库的微信小程序;为了制谱,写了一系列乐谱处理工具;为了便于扒谱,完成了一个平台;为了对图片简谱转调,初涉深度学习还小赚一笔;再到如今,竟已经能从学术角度研究扒谱并做出创新。虽然本项目的效果与我想要的还是相差甚远,但我仍然无比欣慰。
为什么谱对我这么重要?当初是特别喜欢某个音乐,喜欢到想永远掌握住,于是认为“能演奏便是拥有”。如今再想,大概是对音乐的尊重、对演奏的热爱吧。在这里我要特别提及两个组织:第一是justice_eternal吧,简称je吧,当时在这里获取了大量ACG曲谱,然后我开始用代码处理曲谱与音乐,是“梦开始的地方”。第二是风之声口琴社,是我大学心灵的归宿,在琴房度过的每一天都与音乐相伴,还有一群可爱的人。

