Text Register FastText Classifier

A FastText classifier that detects the communicative register (text type) of any English text at ~500k predictions/sec on CPU.

Labels

Code	Register	Description	Example
`IN`	Informational	Factual, encyclopedic, descriptive	Wikipedia articles, reports
`NA`	Narrative	Story-like, temporal sequence of events	News stories, fiction, blog posts
`OP`	Opinion	Subjective evaluation, personal views	Reviews, editorials, comments
`IP`	Persuasion	Attempts to convince or sell	Marketing copy, ads, fundraising
`HI`	HowTo	Instructions, procedures, recipes	Tutorials, manuals, FAQs
`ID`	Discussion	Interactive, forum-style dialogue	Forum threads, Q&A, comments
`SP`	Spoken	Transcribed or spoken-style text	Interviews, podcasts, speeches
`LY`	Lyrical	Poetic, artistic, song-like	Poetry, song lyrics, creative prose

Based on the Biber & Egbert (2018) register taxonomy. Multi-label supported (a text can be both Informational and Narrative).

Quick Start

import fasttext
from huggingface_hub import hf_hub_download

# Download model (quantized, 151 MB)
model_path = hf_hub_download(
    "oneryalcin/text-register-fasttext-classifier",
    "register_fasttext_q.bin"
)
model = fasttext.load_model(model_path)

# Predict
labels, probs = model.predict("Buy now and save 50%! Limited time offer!", k=3)
# -> [('__label__IP', 1.0), ...]  # IP = Persuasion

Note: If you get a numpy error, pin numpy<2: pip install "numpy<2"

Performance

Trained on 10 English shards from TurkuNLP/register_oscar (~1.9M documents), balanced via oversampling/undersampling to median class size.

Overall Metrics

Metric	Full Model	Quantized
Precision@1	0.831	0.796
Recall@1	0.759	0.727
Precision@2	0.491	—
Recall@2	0.898	—
Speed	~500k pred/s	~500k pred/s
Size	1.1 GB	151 MB

Per-Class F1 (threshold=0.3, k=2)

Register	Precision	Recall	F1	Test Support
Informational	0.910	0.666	0.769	108,672
Narrative	0.764	0.766	0.765	44,238
Discussion	0.640	0.774	0.701	7,420
Persuasion	0.553	0.794	0.652	19,193
Opinion	0.567	0.736	0.640	20,014
HowTo	0.515	0.766	0.616	7,281
Spoken	0.551	0.513	0.531	831
Lyrical	0.657	0.442	0.529	251

Example Predictions

"The company reported revenue of $4.2 billion..."       -> Informational (1.00), Narrative (0.99)
"Once upon a time in a small village..."                -> Narrative
"I honestly think this movie is terrible..."            -> Opinion (1.00)
"To install the package, first run pip install..."      -> HowTo (1.00)
"Buy now and save 50%! Limited time offer..."           -> Persuasion (1.00)
"So like, I was telling her yesterday..."               -> Spoken (1.00)
"I've been walking these streets alone..."              -> Lyrical (1.00)
"Hey everyone! What do you think about..."              -> Discussion (1.00)
"Introducing the revolutionary SkinGlow Pro..."         -> Persuasion (1.00)

Use Cases

Data curation: Filter pretraining corpora by register (e.g., keep only Informational + HowTo)
Content routing: Route incoming text to different processing pipelines
Boilerplate removal: Flag Persuasion/Marketing text in document corpora
Signal extraction: Identify which paragraphs in a document carry factual vs opinion content
RAG preprocessing: Score chunks by register before feeding to LLMs

Reproduce from Scratch

1. Download data

pip install huggingface_hub

# Download 10 English shards (~4 GB)
for i in $(seq 0 9); do
    hf download TurkuNLP/register_oscar \
        $(printf "en/en_%05d.jsonl.gz" $i) \
        --repo-type dataset --local-dir ./data
done

2. Prepare balanced training data

python scripts/prepare_data.py --data-dir ./data/en --output-dir ./prepared

3. Train

pip install fasttext-wheel "numpy<2"
python scripts/train.py --train ./prepared/train.txt --test ./prepared/test.txt --output ./model

4. Predict

# Interactive
python scripts/predict.py --model ./model/register_fasttext_q.bin

# Single text
python scripts/predict.py --model ./model/register_fasttext_q.bin --text "Buy now! 50% off!"

# Batch
python scripts/predict.py --model ./model/register_fasttext_q.bin --input texts.txt --output out.jsonl

Training Details

Source data: TurkuNLP/register_oscar (English, 10 shards, ~1.9M labeled documents)
Balancing: Minority classes oversampled, majority classes undersampled to median class size (~129k per class)
Architecture: FastText supervised with bigrams, 100-dim embeddings, one-vs-all loss
Hyperparameters: lr=0.5, epoch=25, wordNgrams=2, dim=100, loss=ova, bucket=2M
Text preprocessing: Whitespace collapsed, truncated to 500 words

Limitations

Spoken & Lyrical classes have lower F1 (~0.53) due to limited unique training data even after oversampling
Trained on web text only — may not generalize well to domain-specific text (legal, medical)
Bag-of-words model — does not understand word order or deep semantics
English only (the source dataset has other languages that could be used for multilingual training)

Citation

If you use this model, please cite the source dataset:

@inproceedings{register_oscar,
  title={Multilingual register classification on the full OSCAR data},
  author={R{\"o}nnqvist, Samuel and others},
  year={2023},
  note={TurkuNLP, University of Turku}
}

@article{biber2018register,
  title={Register as a predictor of linguistic variation},
  author={Biber, Douglas and Egbert, Jesse},
  journal={Corpus Linguistics and Linguistic Theory},
  year={2018}
}

License

The model weights inherit the license of the source dataset (TurkuNLP/register_oscar). Scripts are released under MIT.

Downloads last month: 1

oneryalcin
/

text-register-fasttext-classifier