Retraining Stanza to optimize dependency parsing on a diachronic Swedish corpus

This repository contains Stanza BiLSTM models retrained on different combinations of UD treebanks relevant to historical Swedish. The models prefixed conll17_ are trained with static embeddings, and the models prefixed transformer_ are trained with dynamic embeddings from the transformer encoder "KBLab/bert-base-swedish-cased".

LAS Scores

LAS scores for the models are computed against a gold set of 109 manually annoted sentences divided into five different periods. For the models trained on static vectors, only the overall test scores is given:

Models with static vector embeddings (conll17.pt)

Languages LAS
Swedish (with diachronic) 61.95
Icelandic (PUD) 61.49
German (LIT) 61.43
Icelandic (GC) 61.43
Bokmaal, Danish 60.13
Nynorsk 50.46
Swedish (without diachronic) 50.34
Icelandic (Modern) 46.47
Bokmaal 45.96
Icelandic (IcePaHC) 44.60

For the transformer-fed models, more fine-grained scores on each period are given as a histogram. The model transformer_seen_gold_no_silver.pt was given the gold set during training and hence has no score, but is intuitively the best model. As a benchmark, an "out-of-the-box" Stanza trained only on Talbanken is given.

Checkpoint Embedding Type Training Mix Silver Data Eval Set LAS Notes
transformer_seen_gold_no_silver.pt transformer (KBLab/bert-base-swedish-cased) seen gold no digphil_gold_109 n/a trained on gold set; score not directly comparable
transformer_not_seen_gold.pt transformer (KBLab/bert-base-swedish-cased) not-seen gold yes digphil_gold_109 0.712
transformer_not_seen_gold_no_silver.pt transformer (KBLab/bert-base-swedish-cased) not-seen gold no digphil_gold_109 0.750
conll17_baseline_sv_only.pt static vectors sv only (no diachronic) no digphil_gold_109 50.34
conll17_bm.pt static vectors sv + diachronic + bm no digphil_gold_109 45.96
conll17_sv_diachron.pt static vectors sv + diachronic no digphil_gold_109 61.95 top static model
conll17_icepahc.pt static vectors sv + diachronic + icepahc no digphil_gold_109 44.60
conll17_is-modern.pt static vectors sv + diachronic + is-modern no digphil_gold_109 46.47
conll17_isPUD-pahc-gc.pt static vectors sv + diachronic + isPUD-pahc-gc no digphil_gold_109 61.43
conll17_isPUD.pt static vectors sv + diachronic + isPUD no digphil_gold_109 61.49
conll17_nn.pt static vectors sv + diachronic + nn no digphil_gold_109 50.46
conll17_de_lit.pt static vectors sv + diachronic + de_lit no digphil_gold_109 61.43
conll17_bm_dk.pt static vectors sv + diachronic + bm + dk no digphil_gold_109 60.13

Inference

Example for how the models can be run:

import os
from pathlib import Path
import stanza
from stanza.utils.conll import CoNLL
import time
import gc
import torch
from tqdm import tqdm

if torch.xpu.is_available():
    device = torch.device("xpu")
elif torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

###################################################################################################
##### SETTINGS ####################################################################################
###################################################################################################

vanilla = False

conllu_in_dir = Path("YOUR/CONLLUS/")
conllu_out_dir = Path("OUT/DIR")

depparse_model_path = "transformer_seen_gold_no_silver.pt"

os.makedirs(conllu_out_dir, exist_ok=True)

##### PREPARE #####################################################################################

if vanilla:
    print("Using vanilla Swedish Stanza pipeline with default models.")
    nlp = stanza.Pipeline(
        "sv",
        processors="tokenize,pos,lemma,depparse",
        tokenize_pretokenized=True, # to keep original tokens 
        use_gpu=True,
        pos_batch_size=3000,
        package=None,
        device=device,
    )

else:
    nlp = stanza.Pipeline(
        "sv",
        processors="tokenize,pos,lemma,depparse",
        tokenize_pretokenized=True, # to keep original tokens 
        use_gpu=True,
        pos_batch_size=3000,
        package=None,
        download_method=None,
        depparse_model_path=depparse_model_path, # depparse_model_path becomes model_path inside the depparse processor config. explicit model_path is to control where normal Stanza models are stored
        device=device,
    )

##### INFERENCE ###################################################################################

t0 = time.time()
total_sentences = 0

for fname in tqdm(sorted(os.listdir(conllu_in_dir)), desc="Files"):
    if not fname.endswith(".conllu"):
        continue

    in_path = conllu_in_dir / fname
    out_path = conllu_out_dir / fname
    
    if out_path.exists():
        print(f"File already parsed, skipping: {out_path}")
        continue

    doc = CoNLL.conll2doc(in_path)

    total_sentences += len(doc.sentences)

    with torch.inference_mode():
        parsed_doc = nlp(doc)

    with open(out_path, "w", encoding="utf-8") as f:
        CoNLL.write_doc2conll(parsed_doc, f)

    # Release per-file objects and ask both Python and the torch allocator to
    # reclaim memory that is no longer needed.
    del doc
    del parsed_doc
    gc.collect()
    if device.type == "cuda":
        torch.cuda.empty_cache()
        memory_max = torch.cuda.max_memory_allocated()
        print(f"Max memory allocated so far: {memory_max / (1024 ** 3):.2f} GB")
    elif device.type == "xpu":
        torch.xpu.empty_cache()
        memory_max = torch.xpu.max_memory_allocated()
        print(f"Max memory allocated so far: {memory_max / (1024 ** 3):.2f} GB")

t = time.time() - t0

###################################################################################################

if device.type == "cuda":
    memory_max = torch.cuda.max_memory_allocated()
    torch.cuda.reset_peak_memory_stats()
elif device.type == "xpu":
    memory_max = torch.xpu.max_memory_allocated()
    torch.xpu.reset_peak_memory_stats()
else:
    memory_max = 0

print(
    "\nFinished parsing.\n"
    f"Total sentences parsed: {total_sentences}\n"
    f"Total time: {t / 60:.2f} minutes\n"
    f"Max memory allocated: {memory_max / (1024 ** 3):.2f} GB\n"
)

Training args

Full list of training args:

batch_size: 32
bert_finetune: False
bert_finetune_layers: None
bert_hidden_layers: 4
bert_learning_rate: 1.0
bert_model: KBLab/bert-base-swedish-cased
bert_start_finetuning: 200
bert_warmup_steps: 200
bert_weight_decay: 0.0
beta2: 0.999
char: True
char_emb_dim: 100
char_hidden_dim: 400
char_num_layers: 1
char_rec_dropout: 0
charlm: True
charlm_backward_file: /home/urdatorn/stanza_resources/sv/backward_charlm/conll17.pt
charlm_forward_file: /home/urdatorn/stanza_resources/sv/forward_charlm/conll17.pt
charlm_save_dir: saved_models/charlm
charlm_shorthand: sv_conll17
checkpoint: True
checkpoint_interval: 500
checkpoint_save_name: None
continue_from: None
data_dir: data/depparse
deep_biaff_hidden_dim: 400
deep_biaff_output_dim: 160
device: xpu:0
distance: True
dropout: 0.33
eval_file: /home/urdatorn/git/stanza-digphil/data/depparse/sv_diachronic.dev.in.conllu
eval_interval: 100
gold_labels: True
hidden_dim: 400
lang: sv
linearization: True
log_norms: False
log_step: 20
lora_alpha: 128
lora_dropout: 0.1
lora_modules_to_save: []
lora_rank: 64
lora_target_modules: ['query', 'value', 'output.dense', 'intermediate.dense']
lr: 2.0
max_grad_norm: 1.0
max_steps: 50000
max_steps_before_stop: 2000
mode: train
model_type: graph
num_layers: 3
optim: adadelta
output_file: None
output_latex: False
pretrain: True
pretrain_max_vocab: 250000
rec_dropout: 0
reversed: False
sample_train: 1.0
save_dir: saved_models/depparse
save_name: {shorthand}_{embedding}_parser.pt
second_batch_size: None
second_bert_learning_rate: 0.001
second_lr: 0.0002
second_optim: adam
second_optim_start_step: 10000
second_warmup_steps: 200
seed: 1234
shorthand: sv_diachronic
silver_file: None
silver_weight: 0.5
tag_emb_dim: 50
train_file: /home/urdatorn/git/stanza-digphil/data/depparse/sv_diachronic.train.in.conllu
train_size: None
transformed_dim: 125
transition_embedding_dim: 20
transition_hidden_dim: 20
transition_merge_hidden_dim: 200
transition_subtree_combination: SubtreeCombination.NONE
transition_subtree_nonlinearity: none
use_arc_embedding: False
use_peft: False
use_ufeats: True
use_upos: True
use_xpos: True
wandb: False
wandb_name: None
weight_decay: 1e-05
word_cutoff: 7
word_dropout: 0.33
word_emb_dim: 75
wordvec_dir: /home/urdatorn/stanza_resources/sv/pretrain
wordvec_file: None
wordvec_pretrain_file: /home/urdatorn/stanza_resources/sv/pretrain/conll17.pt
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for al1808th/stanza-digphil

Finetuned
(4)
this model

Evaluation results