songlab/clinvar_vs_benign
Viewer β’ Updated β’ 50.2k β’ 1.65k β’ 1
How to use omshrivastava/omnibimol-variant-priority with Scikit-learn:
from huggingface_hub import hf_hub_download
import joblib
model = joblib.load(
hf_hub_download("omshrivastava/omnibimol-variant-priority", "sklearn_model.joblib")
)
# only load pickle files from sources you trust
# read more about it here https://skops.readthedocs.io/en/stable/persistence.htmlLightweight, free-tier variant pathogenicity prediction models for the OmniBiMol bioinformatics platform.
Predicts whether a human missense variant is pathogenic or benign using:
| Model | Dataset | Features | AUROC | F1 | Size |
|---|---|---|---|---|---|
xgb_precomputed.json |
songlab/clinvar | 8 precomputed scores | 0.982 | 0.941 | ~391 KB |
rf_protein.pkl |
Rain021217/clinvar-pathogenicity | 20 protein features | 0.888 | 0.509 | ~19 MB |
lr_protein.pkl |
Rain021217/clinvar-pathogenicity | 20 protein features | 0.880 | 0.356 | ~1 KB |
import pickle
import numpy as np
import xgboost as xgb
# Load XGBoost model (best AUROC)
model = xgb.XGBClassifier()
model.load_model("xgb_precomputed.json")
# Load preprocessing artifacts
with open("xgb_precomputed_imp.pkl", "rb") as f:
imputer = pickle.load(f)
with open("xgb_precomputed_scaler.pkl", "rb") as f:
scaler = pickle.load(f)
# Example input: [GPN-MSA, CADD, phyloP-100v, phyloP-241m, phastCons-100v, ESM-1b, NT, HyenaDNA]
# (obtain from VEP / precomputed tracks / songlab/clinvar)
sample = np.array([[1.5, 0.3, 0.7, -1.2, 0.1, -2.5, -0.8, 0.0]])
sample_imp = imputer.transform(sample)
sample_scaled = scaler.transform(sample_imp)
proba = model.predict_proba(sample_scaled)[0, 1]
print(f"Pathogenic probability: {proba:.3f}")
python inference.py --model xgb --input my_variants.csv --output scored_variants.csv
python app.py
Built from opentargets/clinical_evidence:
0.4*(max_phase/4) + 0.4*min(n_trials/10,1) + 0.2*min(n_datasources/5,1)therapy_score = 0.5*pathogenicity_proba + 0.3*has_known_trait + 0.2*consequence_severity
Tier assignment:
| File | Description |
|---|---|
xgb_precomputed.json |
Best model: XGBoost on precomputed scores (AUROC 0.982) |
rf_protein.pkl / lr_protein.pkl |
Fallback models on protein features (no precompute needed) |
inference.py |
Production inference script (--model xgb or rf) |
app.py |
Gradio demo (single variant + batch CSV scoring) |
gene_disease_evidence.csv |
71,419 gene-disease pairs with evidence scores |
plots/*.png |
ROC, PR, confusion matrix, feature importance, distributions, precision@k |
metrics_summary.json / extended_metrics.json |
Full experiment results |
research_memo.md |
Literature review + integration recommendation |
See plots/ for all 6 evaluation figures.
class_weight="balanced" for RF/LR; dataset is naturally ~55:45 for exp2Patient VCF
β
βΌ
[VEP Annotation]
β
βββΊ Precomputed scores available? βββΊ XGBoost (AUROC 0.982)
β
βββΊ Only protein sequence? βββΊ Random Forest (AUROC 0.888)
β
βΌ
[Gene-Disease Lookup] βββΊ OpenTargets evidence score
β
βΌ
[Tier Assignment] βββΊ Tier 1/2/3/4
β
βΌ
[OmniBiMol Backend API] βββΊ Top-k + confidence + evidence
β
βΌ
[UI Report] βββΊ Sortable table β drug repurposing β wet-lab handoff
@misc{omnibimol_variant_priority,
title = {OmniBiMol Variant Priority Pipeline},
author = {OmniBiMol Team},
year = {2025},
howpublished = {\url{https://huggingface.co/omshrivastava/omnibimol-variant-priority}}
}
MIT