Add pipeline tag and improve model card

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +56 -112
README.md CHANGED
@@ -1,20 +1,24 @@
1
  ---
 
2
  language:
3
- - zh
4
  license: apache-2.0
 
5
  tags:
6
- - audio
7
- - music
8
- - singing-voice-transcription
9
- - automatic-singing-transcription
10
- - qwen3-asr
11
- - asr
12
- base_model: Qwen/Qwen3-ASR-1.7B
13
  ---
14
 
15
  # VocalParse-1.7B
16
 
17
- VocalParse is a singing voice transcription model fine-tuned from [Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B). It transcribes singing audio into a structured autoregressive token sequence that jointly encodes lyrics, pitch, note values, and global tempo (BPM).
 
 
 
18
 
19
  ```text
20
  Singing Audio (16kHz) → Whisper Encoder → Qwen LLM Decoder → AST Token Sequence
@@ -22,47 +26,11 @@ Singing Audio (16kHz) → Whisper Encoder → Qwen LLM Decoder → AST Token Seq
22
  感 <P_68> <NOTE_4> 受 <P_60> <NOTE_8> 到 <P_65> <NOTE_8> ... <BPM_89>
23
  ```
24
 
25
- Code and documentation: [github.com/pymaster17/VocalParse](https://github.com/pymaster17/VocalParse)
26
-
27
- ## Model Details
28
-
29
- | Property | Value |
30
- |---|---|
31
- | Base model | Qwen3-ASR-1.7B (Whisper encoder + Qwen LLM decoder) |
32
- | Fine-tuning task | Automatic Singing Transcription (AST) |
33
- | Training mode | CoT (`asr_cot=true`, `bpm_position=last`) |
34
- | New vocabulary tokens | ~400 AST tokens (pitch, note value, BPM) |
35
- | Input | Mono 16 kHz singing audio |
36
- | Output | Interleaved lyric + pitch + note sequence with global BPM |
37
-
38
- ### AST Token Vocabulary Extension
39
-
40
- The base Qwen3-ASR vocabulary (151,936 tokens) is extended with:
41
-
42
- | Token type | Count | Examples |
43
- |---|---|---|
44
- | Pitch | 128 | `<P_0>` – `<P_127>` (MIDI) |
45
- | Note value | 12 | `<NOTE_4>`, `<NOTE_8>`, `<NOTE_DOT_8>`, … |
46
- | Tempo | 256 | `<BPM_0>` – `<BPM_255>` |
47
- | Special | few | Reserved for future use |
48
-
49
- ### Output Format
50
-
51
- Standard interleaved format (`bpm_position=last`):
52
-
53
- ```
54
- 感 <P_68> <NOTE_4> 受 <P_60> <NOTE_8> 到 <P_65> <NOTE_8> ... <BPM_89>
55
- ```
56
-
57
- CoT format produced during generation (`asr_cot=true`): the model first outputs plain lyrics, then the full interleaved score, separated by `<|file_sep|>`:
58
-
59
- ```
60
- 感受到<|file_sep|>感 <P_68> <NOTE_4> 受 <P_60> <NOTE_8> 到 <P_65> <NOTE_8> ... <BPM_89>
61
- ```
62
-
63
  ## Usage
64
 
65
- Install VocalParse and its dependencies with [uv](https://docs.astral.sh/uv/):
 
 
66
 
67
  ```bash
68
  uv venv --python 3.10
@@ -73,94 +41,70 @@ uv pip install git+https://github.com/pymaster17/VocalParse.git
73
 
74
  ### Quick Inference
75
 
76
- Download this checkpoint:
77
-
78
  ```python
79
- from huggingface_hub import snapshot_download
80
- snapshot_download("pymaster/VocalParse", local_dir="./vocalparse-weights")
81
- ```
82
-
83
- Write an inference config `inference.yaml`:
84
-
85
- ```yaml
86
- checkpoint: ./vocalparse-weights
87
- audio_json: /path/to/audio_list.json # ["/path/a.wav", "/path/b.flac"]
88
- mode: test_weak # test_weak | test_full | annotation
89
- inference_mode: audio-only # audio-only | audio-lyric
90
- bpm_position: "last"
91
- asr_cot: true
92
- ```
93
-
94
- Run:
95
-
96
- ```bash
97
- python -m vocalparse.inference --config inference.yaml
98
- # Multi-GPU:
99
- torchrun --nproc_per_node=4 -m vocalparse.inference --config inference.yaml
100
  ```
101
 
102
- ### Inference Modes
103
-
104
- | `inference_mode` | Prompt | Output |
105
- |---|---|---|
106
- | `audio-only` | Audio only | Lyrics + pitch + note + BPM |
107
- | `audio-lyric` | Audio + ground-truth lyrics | Pitch + note + BPM only |
108
-
109
- `audio-lyric` is the score-transcription mode for CoT-trained checkpoints: provide known lyrics and the model predicts the musical score.
110
-
111
- ### Output Modes
112
-
113
- | `mode` | Requires | Produces |
114
- |---|---|---|
115
- | `test_weak` | Audio or preprocessed | Lyric CER |
116
- | `test_full` | Preprocessed data only | Full AST metrics |
117
- | `annotation` | Audio or preprocessed | Opencpop-style JSON |
118
 
119
- ## Training
 
 
 
 
 
 
 
120
 
121
- Fine-tuned with [VocalParse](https://github.com/pymaster17/VocalParse) on Opencpop and internal singing datasets.
122
 
123
- Key training settings:
 
 
 
124
 
125
- | Parameter | Value |
126
- |---|---|
127
- | Base model | `Qwen/Qwen3-ASR-1.7B` |
128
- | `bpm_position` | `last` |
129
- | `asr_cot` | `true` |
130
- | Learning rate | 2e-5 |
131
- | LR scheduler | inverse_sqrt |
132
- | Batch size | 64 (dynamic, mel-frame budget) |
133
- | Epochs | 10 |
134
 
135
- `bpm_position=last` places the global BPM token at the end of the sequence rather than the beginning. Experiments show this consistently outperforms `first` — the model commits to a tempo estimate only after processing the full note sequence.
 
 
 
 
136
 
137
  ## Evaluation Metrics
138
 
139
  Metrics are computed with two-stage Needleman-Wunsch alignment: word-level alignment for lyrics, then pair-level alignment inside each matched word for pitch and note.
140
 
141
- | Metric | Description |
142
- |---|---|
143
- | CER | Character error rate on lyrics (silence tokens excluded) |
144
- | CER (singing) | Character error rate including silence tokens (AP/SP) |
145
- | Pitch MAE | Mean absolute pitch error in MIDI semitones |
146
- | Note MAE | Mean absolute error in log₂ note-value space |
147
- | BPM MAE | Mean absolute tempo error |
148
 
149
  ## Limitations
150
 
151
- - Primarily trained on Mandarin Chinese singing. Performance on other languages is not evaluated.
152
  - Physical note durations are not predicted by this checkpoint.
153
- - Long audio segments (> ~30 s) should be pre-segmented before inference.
154
 
155
  ## Citation
156
 
157
  ```bibtex
158
  @article{vocalparse2026,
159
- title={VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models},
160
- year={2026}
 
 
 
161
  }
162
  ```
163
 
164
  ## License
165
 
166
- Apache 2.0
 
1
  ---
2
+ base_model: Qwen/Qwen3-ASR-1.7B
3
  language:
4
+ - zh
5
  license: apache-2.0
6
+ pipeline_tag: automatic-speech-recognition
7
  tags:
8
+ - audio
9
+ - music
10
+ - singing-voice-transcription
11
+ - automatic-singing-transcription
12
+ - qwen3-asr
13
+ - asr
 
14
  ---
15
 
16
  # VocalParse-1.7B
17
 
18
+ VocalParse is a unified singing voice transcription (SVT) model built upon a Large Audio Language Model (LALM). Fine-tuned from [Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B), it transcribes singing audio into a structured autoregressive token sequence that jointly encodes lyrics, pitch, note values, and global tempo (BPM).
19
+
20
+ - **Paper:** [VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models](https://huggingface.co/papers/2605.04613)
21
+ - **Repository:** [github.com/pymaster17/VocalParse](https://github.com/pymaster17/VocalParse)
22
 
23
  ```text
24
  Singing Audio (16kHz) → Whisper Encoder → Qwen LLM Decoder → AST Token Sequence
 
26
  感 <P_68> <NOTE_4> 受 <P_60> <NOTE_8> 到 <P_65> <NOTE_8> ... <BPM_89>
27
  ```
28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
  ## Usage
30
 
31
+ ### Installation
32
+
33
+ It is recommended to use [uv](https://docs.astral.sh/uv/) for setup:
34
 
35
  ```bash
36
  uv venv --python 3.10
 
41
 
42
  ### Quick Inference
43
 
 
 
44
  ```python
45
+ from vocalparse import transcribe_one
46
+
47
+ text = transcribe_one(
48
+ audio="path/to/song.wav",
49
+ checkpoint="pymaster/VocalParse",
50
+ )
51
+ print(text)
52
+ # Example output: 感 <P_68> <NOTE_4> 受 <P_60> <NOTE_8> ... <BPM_89>
 
 
 
 
 
 
 
 
 
 
 
 
 
53
  ```
54
 
55
+ ## Model Details
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
 
57
+ | Property | Value |
58
+ |---|---|
59
+ | **Base model** | Qwen3-ASR-1.7B (Whisper encoder + Qwen LLM decoder) |
60
+ | **Fine-tuning task** | Automatic Singing Transcription (AST) |
61
+ | **Training mode** | CoT (`asr_cot=true`, `bpm_position=last`) |
62
+ | **New vocabulary tokens** | ~400 AST tokens (pitch, note value, BPM) |
63
+ | **Input** | Mono 16 kHz singing audio |
64
+ | **Output** | Interleaved lyric + pitch + note sequence with global BPM |
65
 
66
+ ### AST Token Vocabulary Extension
67
 
68
+ The base Qwen3-ASR vocabulary is extended with:
69
+ - **Pitch:** 128 tokens (`<P_0>` – `<P_127>`) representing MIDI notes.
70
+ - **Note value:** 12 tokens (e.g., `<NOTE_4>`, `<NOTE_8>`, `<NOTE_DOT_8>`).
71
+ - **Tempo:** 256 tokens (`<BPM_0>` – `<BPM_255>`).
72
 
73
+ ### Output Format
 
 
 
 
 
 
 
 
74
 
75
+ - **Standard interleaved format** (`bpm_position=last`):
76
+ `感 <P_68> <NOTE_4> 受 <P_60> <NOTE_8> 到 <P_65> <NOTE_8> ... <BPM_89>`
77
+
78
+ - **CoT format** produced during generation (`asr_cot=true`): the model first outputs plain lyrics, then the full interleaved score, separated by `<|file_sep|>`:
79
+ `感受到<|file_sep|>感 <P_68> <NOTE_4> 受 <P_60> <NOTE_8> 到 <P_65> <NOTE_8> ... <BPM_89>`
80
 
81
  ## Evaluation Metrics
82
 
83
  Metrics are computed with two-stage Needleman-Wunsch alignment: word-level alignment for lyrics, then pair-level alignment inside each matched word for pitch and note.
84
 
85
+ - **CER:** Character error rate on lyrics (silence tokens excluded).
86
+ - **Pitch MAE:** Mean absolute pitch error in MIDI semitones.
87
+ - **Note MAE:** Mean absolute error in log₂ note-value space.
88
+ - **BPM MAE:** Mean absolute tempo error.
 
 
 
89
 
90
  ## Limitations
91
 
92
+ - Primarily trained on Mandarin Chinese singing.
93
  - Physical note durations are not predicted by this checkpoint.
94
+ - Long audio segments (> 30s) should be pre-segmented before inference.
95
 
96
  ## Citation
97
 
98
  ```bibtex
99
  @article{vocalparse2026,
100
+ title = {VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models},
101
+ author = {Yukun Chen and Tianrui Wang and Zhaoxi Mu and Xinyu Yang and EngSiong Chng},
102
+ journal = {arXiv preprint arXiv:2605.04613},
103
+ year = {2026},
104
+ url = {http://arxiv.org/abs/2605.04613}
105
  }
106
  ```
107
 
108
  ## License
109
 
110
+ This model is licensed under Apache 2.0.