Retraining Stanza to optimize dependency parsing on a diachronic Swedish corpus

This repository contains Stanza BiLSTM models retrained on different combinations of UD treebanks relevant to historical Swedish. The models prefixed conll17_ are trained with static embeddings, and the models prefixed transformer_ are trained with dynamic embeddings from the transformer encoder "KBLab/bert-base-swedish-cased".

LAS Scores

LAS scores for the models are computed against a gold set of 109 manually annoted sentences divided into five different periods. For the models trained on static vectors, only the overall test scores is given:

Models with static vector embeddings (`conll17.pt`)

Languages	LAS
Swedish (with diachronic)	61.95
Icelandic (PUD)	61.49
German (LIT)	61.43
Icelandic (GC)	61.43
Bokmaal, Danish	60.13
Nynorsk	50.46
Swedish (without diachronic)	50.34
Icelandic (Modern)	46.47
Bokmaal	45.96
Icelandic (IcePaHC)	44.60

For the transformer-fed models, more fine-grained scores on each period are given as a histogram. The model transformer_seen_gold_no_silver.pt was given the gold set during training and hence has no score, but is intuitively the best model. As a benchmark, an "out-of-the-box" Stanza trained only on Talbanken is given.

Checkpoint	Embedding Type	Training Mix	Silver Data	Eval Set	LAS	Notes
transformer_seen_gold_no_silver.pt	transformer (KBLab/bert-base-swedish-cased)	seen gold	no	digphil_gold_109	n/a	trained on gold set; score not directly comparable
transformer_not_seen_gold.pt	transformer (KBLab/bert-base-swedish-cased)	not-seen gold	yes	digphil_gold_109	0.712
transformer_not_seen_gold_no_silver.pt	transformer (KBLab/bert-base-swedish-cased)	not-seen gold	no	digphil_gold_109	0.750
conll17_baseline_sv_only.pt	static vectors	sv only (no diachronic)	no	digphil_gold_109	50.34
conll17_bm.pt	static vectors	sv + diachronic + bm	no	digphil_gold_109	45.96
conll17_sv_diachron.pt	static vectors	sv + diachronic	no	digphil_gold_109	61.95	top static model
conll17_icepahc.pt	static vectors	sv + diachronic + icepahc	no	digphil_gold_109	44.60
conll17_is-modern.pt	static vectors	sv + diachronic + is-modern	no	digphil_gold_109	46.47
conll17_isPUD-pahc-gc.pt	static vectors	sv + diachronic + isPUD-pahc-gc	no	digphil_gold_109	61.43
conll17_isPUD.pt	static vectors	sv + diachronic + isPUD	no	digphil_gold_109	61.49
conll17_nn.pt	static vectors	sv + diachronic + nn	no	digphil_gold_109	50.46
conll17_de_lit.pt	static vectors	sv + diachronic + de_lit	no	digphil_gold_109	61.43
conll17_bm_dk.pt	static vectors	sv + diachronic + bm + dk	no	digphil_gold_109	60.13

Inference

Example for how the models can be run:

import os
from pathlib import Path
import stanza
from stanza.utils.conll import CoNLL
import time
import gc
import torch
from tqdm import tqdm

if torch.xpu.is_available():
    device = torch.device("xpu")
elif torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

###################################################################################################
##### SETTINGS ####################################################################################
###################################################################################################

vanilla = False

conllu_in_dir = Path("YOUR/CONLLUS/")
conllu_out_dir = Path("OUT/DIR")

depparse_model_path = "transformer_seen_gold_no_silver.pt"

os.makedirs(conllu_out_dir, exist_ok=True)

##### PREPARE #####################################################################################

if vanilla:
    print("Using vanilla Swedish Stanza pipeline with default models.")
    nlp = stanza.Pipeline(
        "sv",
        processors="tokenize,pos,lemma,depparse",
        tokenize_pretokenized=True, # to keep original tokens 
        use_gpu=True,
        pos_batch_size=3000,
        package=None,
        device=device,
    )

else:
    nlp = stanza.Pipeline(
        "sv",
        processors="tokenize,pos,lemma,depparse",
        tokenize_pretokenized=True, # to keep original tokens 
        use_gpu=True,
        pos_batch_size=3000,
        package=None,
        download_method=None,
        depparse_model_path=depparse_model_path, # depparse_model_path becomes model_path inside the depparse processor config. explicit model_path is to control where normal Stanza models are stored
        device=device,
    )

##### INFERENCE ###################################################################################

t0 = time.time()
total_sentences = 0

for fname in tqdm(sorted(os.listdir(conllu_in_dir)), desc="Files"):
    if not fname.endswith(".conllu"):
        continue

    in_path = conllu_in_dir / fname
    out_path = conllu_out_dir / fname
    
    if out_path.exists():
        print(f"File already parsed, skipping: {out_path}")
        continue

    doc = CoNLL.conll2doc(in_path)

    total_sentences += len(doc.sentences)

    with torch.inference_mode():
        parsed_doc = nlp(doc)

    with open(out_path, "w", encoding="utf-8") as f:
        CoNLL.write_doc2conll(parsed_doc, f)

    # Release per-file objects and ask both Python and the torch allocator to
    # reclaim memory that is no longer needed.
    del doc
    del parsed_doc
    gc.collect()
    if device.type == "cuda":
        torch.cuda.empty_cache()
        memory_max = torch.cuda.max_memory_allocated()
        print(f"Max memory allocated so far: {memory_max / (1024 ** 3):.2f} GB")
    elif device.type == "xpu":
        torch.xpu.empty_cache()
        memory_max = torch.xpu.max_memory_allocated()
        print(f"Max memory allocated so far: {memory_max / (1024 ** 3):.2f} GB")

t = time.time() - t0

###################################################################################################

if device.type == "cuda":
    memory_max = torch.cuda.max_memory_allocated()
    torch.cuda.reset_peak_memory_stats()
elif device.type == "xpu":
    memory_max = torch.xpu.max_memory_allocated()
    torch.xpu.reset_peak_memory_stats()
else:
    memory_max = 0

print(
    "\nFinished parsing.\n"
    f"Total sentences parsed: {total_sentences}\n"
    f"Total time: {t / 60:.2f} minutes\n"
    f"Max memory allocated: {memory_max / (1024 ** 3):.2f} GB\n"
)

Training args

Full list of training args:

batch_size: 32
bert_finetune: False
bert_finetune_layers: None
bert_hidden_layers: 4
bert_learning_rate: 1.0
bert_model: KBLab/bert-base-swedish-cased
bert_start_finetuning: 200
bert_warmup_steps: 200
bert_weight_decay: 0.0
beta2: 0.999
char: True
char_emb_dim: 100
char_hidden_dim: 400
char_num_layers: 1
char_rec_dropout: 0
charlm: True
charlm_backward_file: /home/urdatorn/stanza_resources/sv/backward_charlm/conll17.pt
charlm_forward_file: /home/urdatorn/stanza_resources/sv/forward_charlm/conll17.pt
charlm_save_dir: saved_models/charlm
charlm_shorthand: sv_conll17
checkpoint: True
checkpoint_interval: 500
checkpoint_save_name: None
continue_from: None
data_dir: data/depparse
deep_biaff_hidden_dim: 400
deep_biaff_output_dim: 160
device: xpu:0
distance: True
dropout: 0.33
eval_file: /home/urdatorn/git/stanza-digphil/data/depparse/sv_diachronic.dev.in.conllu
eval_interval: 100
gold_labels: True
hidden_dim: 400
lang: sv
linearization: True
log_norms: False
log_step: 20
lora_alpha: 128
lora_dropout: 0.1
lora_modules_to_save: []
lora_rank: 64
lora_target_modules: ['query', 'value', 'output.dense', 'intermediate.dense']
lr: 2.0
max_grad_norm: 1.0
max_steps: 50000
max_steps_before_stop: 2000
mode: train
model_type: graph
num_layers: 3
optim: adadelta
output_file: None
output_latex: False
pretrain: True
pretrain_max_vocab: 250000
rec_dropout: 0
reversed: False
sample_train: 1.0
save_dir: saved_models/depparse
save_name: {shorthand}_{embedding}_parser.pt
second_batch_size: None
second_bert_learning_rate: 0.001
second_lr: 0.0002
second_optim: adam
second_optim_start_step: 10000
second_warmup_steps: 200
seed: 1234
shorthand: sv_diachronic
silver_file: None
silver_weight: 0.5
tag_emb_dim: 50
train_file: /home/urdatorn/git/stanza-digphil/data/depparse/sv_diachronic.train.in.conllu
train_size: None
transformed_dim: 125
transition_embedding_dim: 20
transition_hidden_dim: 20
transition_merge_hidden_dim: 200
transition_subtree_combination: SubtreeCombination.NONE
transition_subtree_nonlinearity: none
use_arc_embedding: False
use_peft: False
use_ufeats: True
use_upos: True
use_xpos: True
wandb: False
wandb_name: None
weight_decay: 1e-05
word_cutoff: 7
word_dropout: 0.33
word_emb_dim: 75
wordvec_dir: /home/urdatorn/stanza_resources/sv/pretrain
wordvec_file: None
wordvec_pretrain_file: /home/urdatorn/stanza_resources/sv/pretrain/conll17.pt

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for al1808th/stanza-digphil

Base model

KBLab/bert-base-swedish-cased

Finetuned

(4)

this model

Evaluation results

LAS on DigPhil Gold (109 sentences)
self-reported

75.000