pymaster
/

VocalParse

@@ -1,20 +1,24 @@
 ---
 language:
-  - zh
 license: apache-2.0
 tags:
-  - audio
-  - music
-  - singing-voice-transcription
-  - automatic-singing-transcription
-  - qwen3-asr
-  - asr
-base_model: Qwen/Qwen3-ASR-1.7B
 ---
 # VocalParse-1.7B
-VocalParse is a singing voice transcription model fine-tuned from [Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B). It transcribes singing audio into a structured autoregressive token sequence that jointly encodes lyrics, pitch, note values, and global tempo (BPM).
 ```text
 Singing Audio (16kHz) → Whisper Encoder → Qwen LLM Decoder → AST Token Sequence
@@ -22,47 +26,11 @@ Singing Audio (16kHz) → Whisper Encoder → Qwen LLM Decoder → AST Token Seq
 感 <P_68> <NOTE_4> 受 <P_60> <NOTE_8> 到 <P_65> <NOTE_8> ... <BPM_89>
 ```
-Code and documentation: [github.com/pymaster17/VocalParse](https://github.com/pymaster17/VocalParse)
-## Model Details
-| Property | Value |
-|---|---|
-| Base model | Qwen3-ASR-1.7B (Whisper encoder + Qwen LLM decoder) |
-| Fine-tuning task | Automatic Singing Transcription (AST) |
-| Training mode | CoT (`asr_cot=true`, `bpm_position=last`) |
-| New vocabulary tokens | ~400 AST tokens (pitch, note value, BPM) |
-| Input | Mono 16 kHz singing audio |
-| Output | Interleaved lyric + pitch + note sequence with global BPM |
-### AST Token Vocabulary Extension
-The base Qwen3-ASR vocabulary (151,936 tokens) is extended with:
-| Token type | Count | Examples |
-|---|---|---|
-| Pitch | 128 | `<P_0>` – `<P_127>` (MIDI) |
-| Note value | 12 | `<NOTE_4>`, `<NOTE_8>`, `<NOTE_DOT_8>`, … |
-| Tempo | 256 | `<BPM_0>` – `<BPM_255>` |
-| Special | few | Reserved for future use |
-### Output Format
-Standard interleaved format (`bpm_position=last`):
-```
-感 <P_68> <NOTE_4> 受 <P_60> <NOTE_8> 到 <P_65> <NOTE_8> ... <BPM_89>
-```
-CoT format produced during generation (`asr_cot=true`): the model first outputs plain lyrics, then the full interleaved score, separated by `<|file_sep|>`:
-```
-感受到<|file_sep|>感 <P_68> <NOTE_4> 受 <P_60> <NOTE_8> 到 <P_65> <NOTE_8> ... <BPM_89>
-```
 ## Usage
-Install VocalParse and its dependencies with [uv](https://docs.astral.sh/uv/):
 ```bash
 uv venv --python 3.10
@@ -73,94 +41,70 @@ uv pip install git+https://github.com/pymaster17/VocalParse.git
 ### Quick Inference
-Download this checkpoint:
 ```python
-from huggingface_hub import snapshot_download
-snapshot_download("pymaster/VocalParse", local_dir="./vocalparse-weights")
-```
-Write an inference config `inference.yaml`:
-```yaml
-checkpoint: ./vocalparse-weights
-audio_json: /path/to/audio_list.json   # ["/path/a.wav", "/path/b.flac"]
-mode: test_weak                        # test_weak | test_full | annotation
-inference_mode: audio-only             # audio-only | audio-lyric
-bpm_position: "last"
-asr_cot: true
-```
-Run:
-```bash
-python -m vocalparse.inference --config inference.yaml
-# Multi-GPU:
-torchrun --nproc_per_node=4 -m vocalparse.inference --config inference.yaml
 ```
-### Inference Modes
-| `inference_mode` | Prompt | Output |
-|---|---|---|
-| `audio-only` | Audio only | Lyrics + pitch + note + BPM |
-| `audio-lyric` | Audio + ground-truth lyrics | Pitch + note + BPM only |
-`audio-lyric` is the score-transcription mode for CoT-trained checkpoints: provide known lyrics and the model predicts the musical score.
-### Output Modes
-| `mode` | Requires | Produces |
-|---|---|---|
-| `test_weak` | Audio or preprocessed | Lyric CER |
-| `test_full` | Preprocessed data only | Full AST metrics |
-| `annotation` | Audio or preprocessed | Opencpop-style JSON |
-## Training
-Fine-tuned with [VocalParse](https://github.com/pymaster17/VocalParse) on Opencpop and internal singing datasets.
-Key training settings:
-| Parameter | Value |
-|---|---|
-| Base model | `Qwen/Qwen3-ASR-1.7B` |
-| `bpm_position` | `last` |
-| `asr_cot` | `true` |
-| Learning rate | 2e-5 |
-| LR scheduler | inverse_sqrt |
-| Batch size | 64 (dynamic, mel-frame budget) |
-| Epochs | 10 |
-`bpm_position=last` places the global BPM token at the end of the sequence rather than the beginning. Experiments show this consistently outperforms `first` — the model commits to a tempo estimate only after processing the full note sequence.
 ## Evaluation Metrics
 Metrics are computed with two-stage Needleman-Wunsch alignment: word-level alignment for lyrics, then pair-level alignment inside each matched word for pitch and note.
-| Metric | Description |
-|---|---|
-| CER | Character error rate on lyrics (silence tokens excluded) |
-| CER (singing) | Character error rate including silence tokens (AP/SP) |
-| Pitch MAE | Mean absolute pitch error in MIDI semitones |
-| Note MAE | Mean absolute error in log₂ note-value space |
-| BPM MAE | Mean absolute tempo error |
 ## Limitations
-- Primarily trained on Mandarin Chinese singing. Performance on other languages is not evaluated.
 - Physical note durations are not predicted by this checkpoint.
-- Long audio segments (> ~30 s) should be pre-segmented before inference.
 ## Citation
 ```bibtex
 @article{vocalparse2026,
-  title={VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models},
-  year={2026}
 }
 ```
 ## License
-Apache 2.0

 ---
+base_model: Qwen/Qwen3-ASR-1.7B
 language:
+- zh
 license: apache-2.0
+pipeline_tag: automatic-speech-recognition
 tags:
+- audio
+- music
+- singing-voice-transcription
+- automatic-singing-transcription
+- qwen3-asr
+- asr
 ---
 # VocalParse-1.7B
+VocalParse is a unified singing voice transcription (SVT) model built upon a Large Audio Language Model (LALM). Fine-tuned from [Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B), it transcribes singing audio into a structured autoregressive token sequence that jointly encodes lyrics, pitch, note values, and global tempo (BPM).
+- **Paper:** [VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models](https://huggingface.co/papers/2605.04613)
+- **Repository:** [github.com/pymaster17/VocalParse](https://github.com/pymaster17/VocalParse)
 ```text
 Singing Audio (16kHz) → Whisper Encoder → Qwen LLM Decoder → AST Token Sequence
 感 <P_68> <NOTE_4> 受 <P_60> <NOTE_8> 到 <P_65> <NOTE_8> ... <BPM_89>
 ```
 ## Usage
+### Installation
+It is recommended to use [uv](https://docs.astral.sh/uv/) for setup:
 ```bash
 uv venv --python 3.10
 ### Quick Inference
 ```python
+from vocalparse import transcribe_one
+text = transcribe_one(
+    audio="path/to/song.wav",
+    checkpoint="pymaster/VocalParse",
+)
+print(text)
+# Example output: 感 <P_68> <NOTE_4> 受 <P_60> <NOTE_8> ... <BPM_89>
 ```
+## Model Details
+| Property | Value |
+|---|---|
+| **Base model** | Qwen3-ASR-1.7B (Whisper encoder + Qwen LLM decoder) |
+| **Fine-tuning task** | Automatic Singing Transcription (AST) |
+| **Training mode** | CoT (`asr_cot=true`, `bpm_position=last`) |
+| **New vocabulary tokens** | ~400 AST tokens (pitch, note value, BPM) |
+| **Input** | Mono 16 kHz singing audio |
+| **Output** | Interleaved lyric + pitch + note sequence with global BPM |
+### AST Token Vocabulary Extension
+The base Qwen3-ASR vocabulary is extended with:
+- **Pitch:** 128 tokens (`<P_0>` – `<P_127>`) representing MIDI notes.
+- **Note value:** 12 tokens (e.g., `<NOTE_4>`, `<NOTE_8>`, `<NOTE_DOT_8>`).
+- **Tempo:** 256 tokens (`<BPM_0>` – `<BPM_255>`).
+### Output Format
+- **Standard interleaved format** (`bpm_position=last`):
+  `感 <P_68> <NOTE_4> 受 <P_60> <NOTE_8> 到 <P_65> <NOTE_8> ... <BPM_89>`
+- **CoT format** produced during generation (`asr_cot=true`): the model first outputs plain lyrics, then the full interleaved score, separated by `<|file_sep|>`:
+  `感受到<|file_sep|>感 <P_68> <NOTE_4> 受 <P_60> <NOTE_8> 到 <P_65> <NOTE_8> ... <BPM_89>`
 ## Evaluation Metrics
 Metrics are computed with two-stage Needleman-Wunsch alignment: word-level alignment for lyrics, then pair-level alignment inside each matched word for pitch and note.
+- **CER:** Character error rate on lyrics (silence tokens excluded).
+- **Pitch MAE:** Mean absolute pitch error in MIDI semitones.
+- **Note MAE:** Mean absolute error in log₂ note-value space.
+- **BPM MAE:** Mean absolute tempo error.
 ## Limitations
+- Primarily trained on Mandarin Chinese singing.
 - Physical note durations are not predicted by this checkpoint.
+- Long audio segments (> 30s) should be pre-segmented before inference.
 ## Citation
 ```bibtex
 @article{vocalparse2026,
+  title   = {VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models},
+  author  = {Yukun Chen and Tianrui Wang and Zhaoxi Mu and Xinyu Yang and EngSiong Chng},
+  journal = {arXiv preprint arXiv:2605.04613},
+  year    = {2026},
+  url     = {http://arxiv.org/abs/2605.04613}
 }
 ```
 ## License
+This model is licensed under Apache 2.0.