Finance Unigram Tokenizer (Fine-tuned on Finance-Instruct-500k)
Model Overview
This repository contains a Unigram tokenizer fine-tuned on the Finance-Instruct-500k dataset, starting from a base English Unigram tokenizer.
It is tailored for financial domain text processing, capturing domain-specific terminology and patterns while maintaining efficient subword segmentation.
Key Features:
- Custom
<cls> and <sep> special tokens.
- Unigram subword segmentation optimized for financial vocabulary.
- Template-based post-processing for both single and paired sequences.
- Configured decoding using the Unigram decoder for accurate reconstruction of financial text.
Training Details
Dataset
- Name: Finance-Instruct-500k
- Source: Financial domain prompts, completions, and instructions.
- Split Used:
train
- Size: 500,000 instruction-based samples
- Loading Method: Streaming mode for efficient processing.
Tokenizer Configuration
- Model Type: Unigram
- Vocabulary Size: 30,000 (optimized for finance-specific tasks)
- Lowercasing: Enabled
- Special Tokens:
<cls> โ Classification token
<sep> โ Separator token
<unk> โ Unknown token
<pad> โ Padding token
<mask> โ Masking token (MLM tasks)
- Post-Processing Template:
- Single Sequence:
$A:0 <sep>:0 <cls>:2
- Paired Sequences:
$A:0 <sep>:0 $B:1 <sep>:1 <cls>:2
- Decoder: Unigram decoder for reconstructing original text.
Training Method
- Base Model: yakul259/english-unigram-tokenizer-60k
- Corpus Source: Finance-Instruct-500k
- Batch Size: 1000 lines per batch
- Trainer:
UnigramTrainer from Hugging Face tokenizers library
- Special Tokens Added:
<cls>, <sep>, <unk>, <pad>, <mask>
Intended Uses & Limitations
Intended Uses
- Pre-tokenization for financial LLMs.
- Downstream financial NLP tasks:
- Financial question answering
- Document parsing
- Financial news summarization
- Risk assessment chatbots
Limitations
- Optimized for English financial text โ performance may drop outside the finance domain.
- May reflect biases present in the financial data used for training.
License
This tokenizer is released under the MIT License.
Citation
If you use this tokenizer, please cite:
title = Finance Unigram Tokenizer Fine-tuned on Finance-Instruct-500k
author = yakul259
year = 2025
publisher = Hugging Face