TurkuNLP/register_oscar
Viewer • Updated • 1.26M • 249 • 5
How to use oneryalcin/text-register-fasttext-classifier with fastText:
from huggingface_hub import hf_hub_download
import fasttext
model = fasttext.load_model(hf_hub_download("oneryalcin/text-register-fasttext-classifier", "model.bin"))A FastText classifier that detects the communicative register (text type) of any English text at ~500k predictions/sec on CPU.
| Code | Register | Description | Example |
|---|---|---|---|
IN |
Informational | Factual, encyclopedic, descriptive | Wikipedia articles, reports |
NA |
Narrative | Story-like, temporal sequence of events | News stories, fiction, blog posts |
OP |
Opinion | Subjective evaluation, personal views | Reviews, editorials, comments |
IP |
Persuasion | Attempts to convince or sell | Marketing copy, ads, fundraising |
HI |
HowTo | Instructions, procedures, recipes | Tutorials, manuals, FAQs |
ID |
Discussion | Interactive, forum-style dialogue | Forum threads, Q&A, comments |
SP |
Spoken | Transcribed or spoken-style text | Interviews, podcasts, speeches |
LY |
Lyrical | Poetic, artistic, song-like | Poetry, song lyrics, creative prose |
Based on the Biber & Egbert (2018) register taxonomy. Multi-label supported (a text can be both Informational and Narrative).
import fasttext
from huggingface_hub import hf_hub_download
# Download model (quantized, 151 MB)
model_path = hf_hub_download(
"oneryalcin/text-register-fasttext-classifier",
"register_fasttext_q.bin"
)
model = fasttext.load_model(model_path)
# Predict
labels, probs = model.predict("Buy now and save 50%! Limited time offer!", k=3)
# -> [('__label__IP', 1.0), ...] # IP = Persuasion
Note: If you get a numpy error, pin
numpy<2:pip install "numpy<2"
Trained on 10 English shards from TurkuNLP/register_oscar (~1.9M documents), balanced via oversampling/undersampling to median class size.
| Metric | Full Model | Quantized |
|---|---|---|
| Precision@1 | 0.831 | 0.796 |
| Recall@1 | 0.759 | 0.727 |
| Precision@2 | 0.491 | — |
| Recall@2 | 0.898 | — |
| Speed | ~500k pred/s | ~500k pred/s |
| Size | 1.1 GB | 151 MB |
| Register | Precision | Recall | F1 | Test Support |
|---|---|---|---|---|
| Informational | 0.910 | 0.666 | 0.769 | 108,672 |
| Narrative | 0.764 | 0.766 | 0.765 | 44,238 |
| Discussion | 0.640 | 0.774 | 0.701 | 7,420 |
| Persuasion | 0.553 | 0.794 | 0.652 | 19,193 |
| Opinion | 0.567 | 0.736 | 0.640 | 20,014 |
| HowTo | 0.515 | 0.766 | 0.616 | 7,281 |
| Spoken | 0.551 | 0.513 | 0.531 | 831 |
| Lyrical | 0.657 | 0.442 | 0.529 | 251 |
"The company reported revenue of $4.2 billion..." -> Informational (1.00), Narrative (0.99)
"Once upon a time in a small village..." -> Narrative
"I honestly think this movie is terrible..." -> Opinion (1.00)
"To install the package, first run pip install..." -> HowTo (1.00)
"Buy now and save 50%! Limited time offer..." -> Persuasion (1.00)
"So like, I was telling her yesterday..." -> Spoken (1.00)
"I've been walking these streets alone..." -> Lyrical (1.00)
"Hey everyone! What do you think about..." -> Discussion (1.00)
"Introducing the revolutionary SkinGlow Pro..." -> Persuasion (1.00)
pip install huggingface_hub
# Download 10 English shards (~4 GB)
for i in $(seq 0 9); do
hf download TurkuNLP/register_oscar \
$(printf "en/en_%05d.jsonl.gz" $i) \
--repo-type dataset --local-dir ./data
done
python scripts/prepare_data.py --data-dir ./data/en --output-dir ./prepared
pip install fasttext-wheel "numpy<2"
python scripts/train.py --train ./prepared/train.txt --test ./prepared/test.txt --output ./model
# Interactive
python scripts/predict.py --model ./model/register_fasttext_q.bin
# Single text
python scripts/predict.py --model ./model/register_fasttext_q.bin --text "Buy now! 50% off!"
# Batch
python scripts/predict.py --model ./model/register_fasttext_q.bin --input texts.txt --output out.jsonl
If you use this model, please cite the source dataset:
@inproceedings{register_oscar,
title={Multilingual register classification on the full OSCAR data},
author={R{\"o}nnqvist, Samuel and others},
year={2023},
note={TurkuNLP, University of Turku}
}
@article{biber2018register,
title={Register as a predictor of linguistic variation},
author={Biber, Douglas and Egbert, Jesse},
journal={Corpus Linguistics and Linguistic Theory},
year={2018}
}
The model weights inherit the license of the source dataset (TurkuNLP/register_oscar). Scripts are released under MIT.