Twin Peaks Protein-Protein Binding Affinity Model

This repository contains the final Twin Peaks dual-head checkpoint for sequence-based protein-protein binding inference.

What It Predicts

dG: absolute protein-protein binding affinity estimate in kcal/mol-like training units.
ddG: mutation effect estimate; negative values indicate predicted stabilizing mutations and positive values indicate predicted destabilizing mutations.

Primary Checkpoint

File: best_model_checkpoint.pt
Internal label: v15_fix2_selected
Final held-out test dG PCC: 0.5191
Final held-out test ddG PCC: 0.5994
Final held-out test balance hmean: 0.5564
PDBbind v2025 OOD full dG PCC: 0.4157
PDBbind v2025 OOD strict novel-cluster dG PCC: 0.3754

Important Provenance Note

This checkpoint is the strongest deployment candidate from the final evaluation. In the experiment log it is tracked separately from the clean V16 protocol-control run because its original training-time checkpoint selection leaned toward dG. The clean V16 comparator scored test dG 0.4959 and test ddG 0.5739. Keep this distinction when reporting the training/search history in a paper or supplement.

Quick Start

Live sequence inference uses ESM-C/ESM embeddings. Users need a Hugging Face token with access to the ESM model for --live-esm. If this Twin Peaks model repository is private, users also need access to this repository.

Batch/runtime note: one predict_binding.py run loads the ESM model once, not once per CSV row. Embeddings are cached in embeddings_cache by sequence hash and reused when the same WT or mutant sequence appears again. Large mutation scans can still be slow because each unique mutant sequence needs its own embedding the first time.

pip install -r requirements.txt
export HF_TOKEN=your_huggingface_token
python predict_binding.py \
  --checkpoint best_model_checkpoint.pt \
  --seq1 "ACDE..." \
  --seq2 "FGHI..." \
  --source-type protein_complex \
  --live-esm \
  --output predictions.csv

For mutation effects:

python predict_binding.py \
  --checkpoint best_model_checkpoint.pt \
  --task ddg \
  --seq1-wt "ACDE..." \
  --seq2 "FGHI..." \
  --positions "42" \
  --indexing 1-indexed \
  --chain 1 \
  --mutate-to A \
  --source-type mutation \
  --live-esm \
  --output ddg_predictions.csv

The ddG output includes seq1_mut_effective and seq2_mut_effective, which report the exact mutant sequences evaluated by the model.

Position indexing: --positions is 1-indexed by default, so position 1 means the first amino acid. Use --indexing 0-indexed only if your input positions start at 0.

This single-mutation ddG command writes one CSV row by design. For a multi-row alanine scan and plots, use --task scan as shown below.

For whole-protein alanine scanning across both chains:

python predict_binding.py \
  --checkpoint best_model_checkpoint.pt \
  --task scan \
  --seq1-wt "ACDE..." \
  --seq2 "FGHI..." \
  --scan-chain both \
  --scan-mode alanine \
  --source-type mutation \
  --live-esm \
  --output scan_predictions.csv \
  --plot-stabilizing-output scan_top10_stabilizing.png \
  --plot-destabilizing-output scan_top10_destabilizing.png \
  --plot-max-rows 10

The scan CSV keeps all mutation rows by default. The two plot outputs show the top 10 predicted stabilizing and destabilizing mutations. WT alanines are skipped because A-to-A is not a mutation.

For batch alanine scanning, provide one protein pair per row:

name,seq1,seq2
pair_1,ACDE...,FGHI...
pair_2,KLMN...,QRST...

The Colab notebook includes a "Create example batch scan CSV" cell that writes and downloads this template. The same template is also included at examples/batch_scan_input.csv.

Then run:

python predict_binding.py \
  --checkpoint best_model_checkpoint.pt \
  --task scan \
  --input batch_scan_input.csv \
  --scan-chain both \
  --scan-mode alanine \
  --source-type mutation \
  --live-esm \
  --output batch_scan_predictions.csv \
  --plot-stabilizing-output batch_scan_top10_stabilizing.png \
  --plot-destabilizing-output batch_scan_top10_destabilizing.png \
  --plot-max-rows 10

For all single amino-acid substitutions across a chain, advanced users can use --scan-mode all. --scan-top-k is optional and truncates the scan CSV, so omit it when you want the full scan output.

For hotspot-style residue reporting, add --residue-summary-output and optionally --residue-plot-output. The residue summary groups scan rows by chain and residue position, then reports the largest absolute predicted ddG, the strongest destabilizing mutation, and the strongest stabilizing mutation for each residue. Residue labels use readable labels such as Chain 1 Y42, and the plot includes the top mutation driving each residue score. The default --residue-rank impact ranks residues by largest absolute predicted ddG; use destabilizing when alanine-scan hotspot interpretation should prioritize mutations that weaken binding.

Batch CSV Formats

dG batch input:

name,seq1,seq2
example_pair,ACDE...,FGHI...

ddG batch input compatible with the Colab-style workflow:

name,seq1_wt,seq2,positions,indexing,chain,mutate_to
example_mut,ACDE...,FGHI...,42,1-indexed,1,A

For batch ddG rows, positions is also 1-indexed by default unless an indexing column says 0-indexed.

Explicit ddG input is also supported:

name,seq1_wt,seq2_wt,seq1_mut,seq2_mut,block1_mut_positions,block2_mut_positions
example_mut,ACDE...,FGHI...,ACAE...,FGHI...,"[2]","[]"

If seq1_mut / seq2_mut are provided without mutation-position columns, the script infers changed positions from the WT and mutant sequences. WT and mutant sequences must have the same length for automatic position inference.

If explicit block1_mut_positions / block2_mut_positions are provided, they follow the row indexing value or the CLI --indexing default.

Source Type

This release uses one shared dual-head checkpoint, not three separate source-specific models. --source-type is an internal learned conditioning flag, not a model selector.

For normal public use:

Use protein_complex for absolute dG prediction.
Use mutation for ddG prediction and whole-protein mutation scans.

Advanced options are:

protein_complex: general protein-protein complexes.
mutation: SKEMPI/BindingGym-like mutation effects.
antibody_cdr: antibody CDR-style variants.

Hugging Face Access Notes

You need a Hugging Face token for live ESM-C/ESM embedding generation.
If this model repository is private, your Hugging Face account must also have access to the repository.
For public repositories, the same token is still needed for ESM/live embedding generation.

Files

best_model_checkpoint.pt: model weights plus embedded training config.
predict_binding.py: strict inference CLI.
architectures.py, esm3bedding.py, base.py, utils.py: loader/runtime support.
examples/: small example CSV schemas.
notebooks/twin_peaks_final_inference_colab.ipynb: Colab starter notebook.
release_manifest.json: release manifest and provenance details.

Limitations

This is a research model. It should not be used as the only basis for clinical, therapeutic, or safety-critical design decisions. Predictions depend on sequence-only PLM features and do not replace structure-aware validation or experimental binding assays.

Downloads last month: 35

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support