-
Notifications
You must be signed in to change notification settings - Fork 215
Description
First of all, thank you very much for this excellent and inspiring work. I believe SAM Audio is a highly influential contribution to audio-visual separation, and I really appreciate you making the code and benchmark publicly available.
I would like to ask two questions while I was experimenting with SAM Audio Large.
1. Performance of SAM Audio Large on VGGClean test
I evaluated SAM Audio Large on a source separation task using the VGGClean dataset, which, similar to sam_audio_bench, is derived from real-world data but uses relatively 'coarse' audio mixing strategies.
Surprisingly, I found that the separation results are very poor, and in fact most metrics are worse than the input mixture itself. This was not what I initially expected.I used text-only inference with the following setting: result = model.separate(inputs, predict_spans=True, reranking_candidates=1)
This makes me wonder whether:Using only textual prompts is insufficient for this scenario? The model is more sensitive to the mixing strategy used in VGGClean?Or whether this behavior is expected for SAM Audio Large in such settings?I also wanted to ask whether you have tested SAM Audio on VGGClean .
Additionally, I noticed that the paper does not seem to include comparisons with more recent separation-focused models (e.g., MMAudioSep). I was curious whether such comparisons were considered or explored internally.
2. Availability of video IDs in sam_audio_bench
While downloading the videos listed in sam_audio_bench, I noticed that a subset of video IDs (around 90) are no longer accessible.This makes it difficult to fully reproduce the benchmark results. I was wondering whether you have considered release processed version of the dataset?
Again, thank you very much for your outstanding work and sorry for this long text.