PepTune: De Novo Generation of Therapeutic Peptides with Multi-Objective-Guided Discrete Diffusion
Sophia Tang*, Yinuo Zhang* and Pranam Chatterjee
This is the repository for PepTune: De Novo Generation of Therapeutic Peptides with Multi-Objective-Guided Discrete Diffusion ๐งฌ๐ฎ published at ICML 2025. It is partially built on the MDLM repo (Sahoo et al. 2024).
PepTune leverages Monte-Carlo Tree Search (MCTS) to guide a generative masked discrete diffusion model which iteratively refines a set of Pareto non-dominated sequences optimized across a set of therapeutic properties, including binding affinity, cell membrane permeability, solubility, non-fouling, and non-hemolysis.
Environment Installation
conda env create -f src/environment.yml
conda activate peptune
Model Pretrained Weights Download
Follow the steps below to download the model weights required for this experiment.
- Download the PepTune pre-trained MDLM checkpoint and place in
checkpoints/: https://drive.google.com/file/d/1oXGDpKLNF0KX0ZdOcl1NZj5Czk2lSFUn/view?usp=sharing - Download the pre-trained binding affinity Transformer model and place in
src/scoring/functions/classifiers/: https://drive.google.com/file/d/128shlEP_-rYAxPgZRCk_n0HBWVbOYSva/view?usp=sharing
Training Data Download
Download the peptide training dataset from https://drive.google.com/file/d/1yCDr641WVjCtECg3nbG0nsMNu8j7d7gp/view?usp=drive_link and unzip it into the data/ directory:
# Download peptide_data.zip into the data/ directory
cd data/
# Unzip the training data
unzip peptide_data.zip
cd ..
After unzipping, the data should be located at data/peptide_data/.
Repository Structure
PepTune/
โโโ src/
โ โโโ train_peptune.py # Main training script
โ โโโ generate_mcts.py # MCTS-guided peptide generation
โ โโโ generate_unconditional.py # Unconditional generation
โ โโโ diffusion.py # Core masked discrete diffusion model
โ โโโ pareto_mcts.py # Pareto-front MCTS implementation
โ โโโ roformer.py # RoFormer backbone
โ โโโ noise_schedule.py # Noise scheduling (loglinear, logpoly)
โ โโโ config.yaml # Hydra configuration
โ โโโ config.py # Argparse configuration
โ โโโ environment.yml # Conda environment
โ โโโ scoring/ # Therapeutic property scoring
โ โ โโโ scoring_functions.py # Unified scoring interface
โ โ โโโ functions/ # Individual property predictors
โ โ โโโ binding.py
โ โ โโโ hemolysis.py
โ โ โโโ nonfouling.py
โ โ โโโ permeability.py
โ โ โโโ solubility.py
โ โ โโโ classifiers/ # Pre-trained scoring model weights
โ โโโ tokenizer/ # SMILES SPE tokenizer
โ โ โโโ my_tokenizers.py
โ โ โโโ new_vocab.txt
โ โ โโโ new_splits.txt
โ โโโ utils/ # Utilities & PeptideAnalyzer
โ โโโ app.py
โ โโโ generate_utils.py
โ โโโ utils.py
โโโ scripts/ # Shell scripts for running experiments
โ โโโ train.sh # Pre-training
โ โโโ generate_mcts.sh # MCTS-guided generation
โ โโโ generate_unconditional.sh # Unconditional generation
โโโ data/ # Training data
โ โโโ dataloading_for_dynamic_batching.py
โ โโโ dataset.py
โโโ checkpoints/ # Model checkpoints
โโโ assets/ # Figures
Pre-training
Before running, fill in HOME_LOC and ENV_LOC in scripts/train.sh and base_path in src/config.yaml to match your paths.
chmod +x scripts/train.sh
nohup ./scripts/train.sh > train.log 2>&1 &
Training uses Hydra configuration from src/config.yaml. Key settings:
- Backbone: RoFormer (768 hidden, 8 layers, 12 heads)
- Optimizer: AdamW (lr=3e-4, weight_decay=0.075)
- Data: 11M SMILES peptide dataset with dynamic batching by length
- Precision: fp64
- Checkpoints saved to
checkpoints/(monitorsval/nll, saves top 10)
MCTS-Guided Peptide Generation
Generate therapeutic peptides optimized across multiple objectives using Monte-Carlo Tree Search.
- Fill in
base_pathinsrc/config.yamlandsrc/scoring/scoring_functions.py. - Fill in
HOME_LOCinscripts/generate_mcts.sh. - Create output directories:
mkdir -p results logs
chmod +x scripts/generate_mcts.sh
# Usage: ./scripts/generate_mcts.sh [PROT_NAME] [PROT_NAME2] [MODE] [MODEL] [LENGTH] [EPOCH]
# Example: Generate peptides targeting GFAP with length 100
nohup ./scripts/generate_mcts.sh gfap "" 2 mcts 100 7 > generate.log 2>&1 &
Available Target Proteins
| Name | Target |
|---|---|
amhr |
AMH Receptor |
tfr |
Transferrin Receptor |
gfap |
Glial Fibrillary Acidic Protein |
glp1 |
GLP-1 Receptor |
glast |
Excitatory Amino Acid Transporter |
ncam |
Neural Cell Adhesion Molecule |
cereblon |
Cereblon (CRBN) |
ligase |
E3 Ubiquitin Ligase |
skp2 |
S-Phase Kinase-Associated Protein 2 |
p53 |
Tumor Suppressor p53 |
egfp |
Enhanced Green Fluorescent Protein |
To specify a custom target protein, override +prot_seq=<amino acid sequence> and +prot_name=<name> as Hydra arguments in the generation script.
Scoring Objectives
PepTune jointly optimizes across five therapeutic properties via the integrated scoring suite:
| Objective | Property | Model |
|---|---|---|
binding_affinity1 |
Binding affinity to target protein | Cross-attention Transformer |
solubility |
Aqueous solubility | XGBoost on SMILES CNN embeddings |
hemolysis |
Non-hemolytic | SMILES binary classifier |
nonfouling |
Non-fouling | SMILES binary classifier |
permeability |
Cell membrane permeability | PAMPA CNN |
Default MCTS Hyperparameters
These can be overridden via Hydra config overrides:
| Parameter | Default | Description |
|---|---|---|
mcts.num_children |
50 | Branching factor per MCTS node |
mcts.num_iter |
128 | Number of MCTS iterations |
mcts.num_objectives |
5 | Number of optimization objectives |
sampling.steps |
128 | Diffusion denoising steps |
sampling.seq_length |
200 | Generated peptide length |
Unconditional Generation
Generate peptides without property guidance:
chmod +x scripts/generate_unconditional.sh
nohup ./scripts/generate_unconditional.sh > generate_unconditional.log 2>&1 &
Evaluation
To summarize metrics after generation, fill in path and prot_name in src/metrics.py and run:
python src/metrics.py
Citation
If you find this repository helpful for your publications, please consider citing our paper:
@article{tang2025peptune,
title={Peptune: De novo generation of therapeutic peptides with multi-objective-guided discrete diffusion},
author={Tang, Sophia and Zhang, Yinuo and Chatterjee, Pranam},
journal={42nd International Conference on Machine Learning},
year={2025}
}
License
To use this repository, you agree to abide by the Apache 2.0 License.
