Hebrew Manuscript Joint NER v2

This repository contains the MHM Pipeline person NER model. The current checkpoint is the role-aware v3 replacement for the earlier custom two-head checkpoint, while keeping the same repository and bundle name for compatibility.

The model is a DictaBERT token-classification checkpoint that predicts BIO labels with the person role encoded directly in the tag:

AUTHOR
TRANSCRIBER
OWNER
CENSOR
TRANSLATOR
COMMENTATOR

Evaluation

Held-out v3 test split, 904 items:

Metric	Score
strict span + role F1	0.8031
strict precision	0.7888
strict recall	0.8180
name-only F1	0.8665
role accuracy when name matched	0.9269

Per-role strict span+role F1:

Role	F1
AUTHOR	0.8678
CENSOR	0.8830
COMMENTATOR	0.5185
OWNER	0.7330
TRANSCRIBER	0.8112
TRANSLATOR	0.9072

Usage

from transformers import AutoModelForTokenClassification, AutoTokenizer

repo_id = "alexgoldberg/hebrew-manuscript-joint-ner-v2"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForTokenClassification.from_pretrained(repo_id)

In MHM Pipeline, use ner.inference_pipeline.JointNERPipeline. It preserves the legacy output schema:

from ner.inference_pipeline import JointNERPipeline

pipeline = JointNERPipeline("alexgoldberg/hebrew-manuscript-joint-ner-v2")
entities = pipeline.process_text("הספר נכתב על ידי משה בן יעקב.")

Example output:

[
  {
    "person": "משה בן יעקב",
    "role": "TRANSCRIBER",
    "confidence": 0.9918,
    "model_confidence": 0.9918,
    "start": 17,
    "end": 28
  }
]

Notes

The previous custom checkpoint can be recovered from the Hub commit history. This version intentionally replaces keyword-based role classification with neural role-aware BIO labels.

Downloads last month: 42

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for alexgoldberg/hebrew-manuscript-joint-ner-v2

Base model

dicta-il/dictabert

Finetuned

(7)

this model