Hebrew Manuscript Joint NER v2

This repository contains the MHM Pipeline person NER model. The current checkpoint is the role-aware v3 replacement for the earlier custom two-head checkpoint, while keeping the same repository and bundle name for compatibility.

The model is a DictaBERT token-classification checkpoint that predicts BIO labels with the person role encoded directly in the tag:

  • AUTHOR
  • TRANSCRIBER
  • OWNER
  • CENSOR
  • TRANSLATOR
  • COMMENTATOR

Evaluation

Held-out v3 test split, 904 items:

Metric Score
strict span + role F1 0.8031
strict precision 0.7888
strict recall 0.8180
name-only F1 0.8665
role accuracy when name matched 0.9269

Per-role strict span+role F1:

Role F1
AUTHOR 0.8678
CENSOR 0.8830
COMMENTATOR 0.5185
OWNER 0.7330
TRANSCRIBER 0.8112
TRANSLATOR 0.9072

Usage

from transformers import AutoModelForTokenClassification, AutoTokenizer

repo_id = "alexgoldberg/hebrew-manuscript-joint-ner-v2"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForTokenClassification.from_pretrained(repo_id)

In MHM Pipeline, use ner.inference_pipeline.JointNERPipeline. It preserves the legacy output schema:

from ner.inference_pipeline import JointNERPipeline

pipeline = JointNERPipeline("alexgoldberg/hebrew-manuscript-joint-ner-v2")
entities = pipeline.process_text("ื”ืกืคืจ ื ื›ืชื‘ ืขืœ ื™ื“ื™ ืžืฉื” ื‘ืŸ ื™ืขืงื‘.")

Example output:

[
  {
    "person": "ืžืฉื” ื‘ืŸ ื™ืขืงื‘",
    "role": "TRANSCRIBER",
    "confidence": 0.9918,
    "model_confidence": 0.9918,
    "start": 17,
    "end": 28
  }
]

Notes

The previous custom checkpoint can be recovered from the Hub commit history. This version intentionally replaces keyword-based role classification with neural role-aware BIO labels.

Downloads last month
42
Safetensors
Model size
0.2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for alexgoldberg/hebrew-manuscript-joint-ner-v2

Finetuned
(7)
this model