Retraining Stanza to optimize dependency parsing on a diachronic Swedish corpus
This repository contains Stanza BiLSTM models retrained on different combinations of UD treebanks relevant to historical Swedish. The models prefixed conll17_ are trained with static embeddings, and the models prefixed transformer_ are trained with dynamic embeddings from the transformer encoder "KBLab/bert-base-swedish-cased".
LAS Scores
LAS scores for the models are computed against a gold set of 109 manually annoted sentences divided into five different periods. For the models trained on static vectors, only the overall test scores is given:
Models with static vector embeddings (conll17.pt)
| Languages | LAS |
|---|---|
| Swedish (with diachronic) | 61.95 |
| Icelandic (PUD) | 61.49 |
| German (LIT) | 61.43 |
| Icelandic (GC) | 61.43 |
| Bokmaal, Danish | 60.13 |
| Nynorsk | 50.46 |
| Swedish (without diachronic) | 50.34 |
| Icelandic (Modern) | 46.47 |
| Bokmaal | 45.96 |
| Icelandic (IcePaHC) | 44.60 |
For the transformer-fed models, more fine-grained scores on each period are given as a histogram. The model transformer_seen_gold_no_silver.pt was given the gold set during training and hence has no score, but is intuitively the best model. As a benchmark, an "out-of-the-box" Stanza trained only on Talbanken is given.
| Checkpoint | Embedding Type | Training Mix | Silver Data | Eval Set | LAS | Notes |
|---|---|---|---|---|---|---|
| transformer_seen_gold_no_silver.pt | transformer (KBLab/bert-base-swedish-cased) | seen gold | no | digphil_gold_109 | n/a | trained on gold set; score not directly comparable |
| transformer_not_seen_gold.pt | transformer (KBLab/bert-base-swedish-cased) | not-seen gold | yes | digphil_gold_109 | 0.712 | |
| transformer_not_seen_gold_no_silver.pt | transformer (KBLab/bert-base-swedish-cased) | not-seen gold | no | digphil_gold_109 | 0.750 | |
| conll17_baseline_sv_only.pt | static vectors | sv only (no diachronic) | no | digphil_gold_109 | 50.34 | |
| conll17_bm.pt | static vectors | sv + diachronic + bm | no | digphil_gold_109 | 45.96 | |
| conll17_sv_diachron.pt | static vectors | sv + diachronic | no | digphil_gold_109 | 61.95 | top static model |
| conll17_icepahc.pt | static vectors | sv + diachronic + icepahc | no | digphil_gold_109 | 44.60 | |
| conll17_is-modern.pt | static vectors | sv + diachronic + is-modern | no | digphil_gold_109 | 46.47 | |
| conll17_isPUD-pahc-gc.pt | static vectors | sv + diachronic + isPUD-pahc-gc | no | digphil_gold_109 | 61.43 | |
| conll17_isPUD.pt | static vectors | sv + diachronic + isPUD | no | digphil_gold_109 | 61.49 | |
| conll17_nn.pt | static vectors | sv + diachronic + nn | no | digphil_gold_109 | 50.46 | |
| conll17_de_lit.pt | static vectors | sv + diachronic + de_lit | no | digphil_gold_109 | 61.43 | |
| conll17_bm_dk.pt | static vectors | sv + diachronic + bm + dk | no | digphil_gold_109 | 60.13 |
Inference
Example for how the models can be run:
import os
from pathlib import Path
import stanza
from stanza.utils.conll import CoNLL
import time
import gc
import torch
from tqdm import tqdm
if torch.xpu.is_available():
device = torch.device("xpu")
elif torch.cuda.is_available():
device = torch.device("cuda")
else:
device = torch.device("cpu")
###################################################################################################
##### SETTINGS ####################################################################################
###################################################################################################
vanilla = False
conllu_in_dir = Path("YOUR/CONLLUS/")
conllu_out_dir = Path("OUT/DIR")
depparse_model_path = "transformer_seen_gold_no_silver.pt"
os.makedirs(conllu_out_dir, exist_ok=True)
##### PREPARE #####################################################################################
if vanilla:
print("Using vanilla Swedish Stanza pipeline with default models.")
nlp = stanza.Pipeline(
"sv",
processors="tokenize,pos,lemma,depparse",
tokenize_pretokenized=True, # to keep original tokens
use_gpu=True,
pos_batch_size=3000,
package=None,
device=device,
)
else:
nlp = stanza.Pipeline(
"sv",
processors="tokenize,pos,lemma,depparse",
tokenize_pretokenized=True, # to keep original tokens
use_gpu=True,
pos_batch_size=3000,
package=None,
download_method=None,
depparse_model_path=depparse_model_path, # depparse_model_path becomes model_path inside the depparse processor config. explicit model_path is to control where normal Stanza models are stored
device=device,
)
##### INFERENCE ###################################################################################
t0 = time.time()
total_sentences = 0
for fname in tqdm(sorted(os.listdir(conllu_in_dir)), desc="Files"):
if not fname.endswith(".conllu"):
continue
in_path = conllu_in_dir / fname
out_path = conllu_out_dir / fname
if out_path.exists():
print(f"File already parsed, skipping: {out_path}")
continue
doc = CoNLL.conll2doc(in_path)
total_sentences += len(doc.sentences)
with torch.inference_mode():
parsed_doc = nlp(doc)
with open(out_path, "w", encoding="utf-8") as f:
CoNLL.write_doc2conll(parsed_doc, f)
# Release per-file objects and ask both Python and the torch allocator to
# reclaim memory that is no longer needed.
del doc
del parsed_doc
gc.collect()
if device.type == "cuda":
torch.cuda.empty_cache()
memory_max = torch.cuda.max_memory_allocated()
print(f"Max memory allocated so far: {memory_max / (1024 ** 3):.2f} GB")
elif device.type == "xpu":
torch.xpu.empty_cache()
memory_max = torch.xpu.max_memory_allocated()
print(f"Max memory allocated so far: {memory_max / (1024 ** 3):.2f} GB")
t = time.time() - t0
###################################################################################################
if device.type == "cuda":
memory_max = torch.cuda.max_memory_allocated()
torch.cuda.reset_peak_memory_stats()
elif device.type == "xpu":
memory_max = torch.xpu.max_memory_allocated()
torch.xpu.reset_peak_memory_stats()
else:
memory_max = 0
print(
"\nFinished parsing.\n"
f"Total sentences parsed: {total_sentences}\n"
f"Total time: {t / 60:.2f} minutes\n"
f"Max memory allocated: {memory_max / (1024 ** 3):.2f} GB\n"
)
Training args
Full list of training args:
batch_size: 32
bert_finetune: False
bert_finetune_layers: None
bert_hidden_layers: 4
bert_learning_rate: 1.0
bert_model: KBLab/bert-base-swedish-cased
bert_start_finetuning: 200
bert_warmup_steps: 200
bert_weight_decay: 0.0
beta2: 0.999
char: True
char_emb_dim: 100
char_hidden_dim: 400
char_num_layers: 1
char_rec_dropout: 0
charlm: True
charlm_backward_file: /home/urdatorn/stanza_resources/sv/backward_charlm/conll17.pt
charlm_forward_file: /home/urdatorn/stanza_resources/sv/forward_charlm/conll17.pt
charlm_save_dir: saved_models/charlm
charlm_shorthand: sv_conll17
checkpoint: True
checkpoint_interval: 500
checkpoint_save_name: None
continue_from: None
data_dir: data/depparse
deep_biaff_hidden_dim: 400
deep_biaff_output_dim: 160
device: xpu:0
distance: True
dropout: 0.33
eval_file: /home/urdatorn/git/stanza-digphil/data/depparse/sv_diachronic.dev.in.conllu
eval_interval: 100
gold_labels: True
hidden_dim: 400
lang: sv
linearization: True
log_norms: False
log_step: 20
lora_alpha: 128
lora_dropout: 0.1
lora_modules_to_save: []
lora_rank: 64
lora_target_modules: ['query', 'value', 'output.dense', 'intermediate.dense']
lr: 2.0
max_grad_norm: 1.0
max_steps: 50000
max_steps_before_stop: 2000
mode: train
model_type: graph
num_layers: 3
optim: adadelta
output_file: None
output_latex: False
pretrain: True
pretrain_max_vocab: 250000
rec_dropout: 0
reversed: False
sample_train: 1.0
save_dir: saved_models/depparse
save_name: {shorthand}_{embedding}_parser.pt
second_batch_size: None
second_bert_learning_rate: 0.001
second_lr: 0.0002
second_optim: adam
second_optim_start_step: 10000
second_warmup_steps: 200
seed: 1234
shorthand: sv_diachronic
silver_file: None
silver_weight: 0.5
tag_emb_dim: 50
train_file: /home/urdatorn/git/stanza-digphil/data/depparse/sv_diachronic.train.in.conllu
train_size: None
transformed_dim: 125
transition_embedding_dim: 20
transition_hidden_dim: 20
transition_merge_hidden_dim: 200
transition_subtree_combination: SubtreeCombination.NONE
transition_subtree_nonlinearity: none
use_arc_embedding: False
use_peft: False
use_ufeats: True
use_upos: True
use_xpos: True
wandb: False
wandb_name: None
weight_decay: 1e-05
word_cutoff: 7
word_dropout: 0.33
word_emb_dim: 75
wordvec_dir: /home/urdatorn/stanza_resources/sv/pretrain
wordvec_file: None
wordvec_pretrain_file: /home/urdatorn/stanza_resources/sv/pretrain/conll17.pt
- Downloads last month
- -
Model tree for al1808th/stanza-digphil
Base model
KBLab/bert-base-swedish-casedEvaluation results
- LAS on DigPhil Gold (109 sentences)self-reported75.000