PepTune: De Novo Generation of Therapeutic Peptides with Multi-Objective-Guided Discrete Diffusion

Sophia Tang*, Yinuo Zhang* and Pranam Chatterjee

PepTune

This is the repository for PepTune: De Novo Generation of Therapeutic Peptides with Multi-Objective-Guided Discrete Diffusion ๐Ÿงฌ๐Ÿ”ฎ published at ICML 2025. It is partially built on the MDLM repo (Sahoo et al. 2024).

PepTune leverages Monte-Carlo Tree Search (MCTS) to guide a generative masked discrete diffusion model which iteratively refines a set of Pareto non-dominated sequences optimized across a set of therapeutic properties, including binding affinity, cell membrane permeability, solubility, non-fouling, and non-hemolysis.

Environment Installation

conda env create -f src/environment.yml

conda activate peptune

Model Pretrained Weights Download

Follow the steps below to download the model weights required for this experiment.

  1. Download the PepTune pre-trained MDLM checkpoint and place in checkpoints/: https://drive.google.com/file/d/1oXGDpKLNF0KX0ZdOcl1NZj5Czk2lSFUn/view?usp=sharing
  2. Download the pre-trained binding affinity Transformer model and place in src/scoring/functions/classifiers/: https://drive.google.com/file/d/128shlEP_-rYAxPgZRCk_n0HBWVbOYSva/view?usp=sharing

Training Data Download

Download the peptide training dataset from https://drive.google.com/file/d/1yCDr641WVjCtECg3nbG0nsMNu8j7d7gp/view?usp=drive_link and unzip it into the data/ directory:

# Download peptide_data.zip into the data/ directory
cd data/

# Unzip the training data
unzip peptide_data.zip

cd ..

After unzipping, the data should be located at data/peptide_data/.

Repository Structure

PepTune/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ train_peptune.py          # Main training script
โ”‚   โ”œโ”€โ”€ generate_mcts.py          # MCTS-guided peptide generation
โ”‚   โ”œโ”€โ”€ generate_unconditional.py # Unconditional generation
โ”‚   โ”œโ”€โ”€ diffusion.py              # Core masked discrete diffusion model
โ”‚   โ”œโ”€โ”€ pareto_mcts.py            # Pareto-front MCTS implementation
โ”‚   โ”œโ”€โ”€ roformer.py               # RoFormer backbone
โ”‚   โ”œโ”€โ”€ noise_schedule.py         # Noise scheduling (loglinear, logpoly)
โ”‚   โ”œโ”€โ”€ config.yaml               # Hydra configuration
โ”‚   โ”œโ”€โ”€ config.py                 # Argparse configuration
โ”‚   โ”œโ”€โ”€ environment.yml           # Conda environment
โ”‚   โ”œโ”€โ”€ scoring/                  # Therapeutic property scoring
โ”‚   โ”‚   โ”œโ”€โ”€ scoring_functions.py  # Unified scoring interface
โ”‚   โ”‚   โ””โ”€โ”€ functions/            # Individual property predictors
โ”‚   โ”‚       โ”œโ”€โ”€ binding.py
โ”‚   โ”‚       โ”œโ”€โ”€ hemolysis.py
โ”‚   โ”‚       โ”œโ”€โ”€ nonfouling.py
โ”‚   โ”‚       โ”œโ”€โ”€ permeability.py
โ”‚   โ”‚       โ”œโ”€โ”€ solubility.py
โ”‚   โ”‚       โ””โ”€โ”€ classifiers/      # Pre-trained scoring model weights
โ”‚   โ”œโ”€โ”€ tokenizer/                # SMILES SPE tokenizer
โ”‚   โ”‚   โ”œโ”€โ”€ my_tokenizers.py
โ”‚   โ”‚   โ”œโ”€โ”€ new_vocab.txt
โ”‚   โ”‚   โ””โ”€โ”€ new_splits.txt
โ”‚   โ””โ”€โ”€ utils/                    # Utilities & PeptideAnalyzer
โ”‚       โ”œโ”€โ”€ app.py
โ”‚       โ”œโ”€โ”€ generate_utils.py
โ”‚       โ””โ”€โ”€ utils.py
โ”œโ”€โ”€ scripts/                      # Shell scripts for running experiments
โ”‚   โ”œโ”€โ”€ train.sh                  # Pre-training
โ”‚   โ”œโ”€โ”€ generate_mcts.sh          # MCTS-guided generation
โ”‚   โ””โ”€โ”€ generate_unconditional.sh # Unconditional generation
โ”œโ”€โ”€ data/                         # Training data
โ”‚   โ”œโ”€โ”€ dataloading_for_dynamic_batching.py
โ”‚   โ””โ”€โ”€ dataset.py
โ”œโ”€โ”€ checkpoints/                  # Model checkpoints
โ””โ”€โ”€ assets/                       # Figures

Pre-training

Before running, fill in HOME_LOC and ENV_LOC in scripts/train.sh and base_path in src/config.yaml to match your paths.

chmod +x scripts/train.sh

nohup ./scripts/train.sh > train.log 2>&1 &

Training uses Hydra configuration from src/config.yaml. Key settings:

  • Backbone: RoFormer (768 hidden, 8 layers, 12 heads)
  • Optimizer: AdamW (lr=3e-4, weight_decay=0.075)
  • Data: 11M SMILES peptide dataset with dynamic batching by length
  • Precision: fp64
  • Checkpoints saved to checkpoints/ (monitors val/nll, saves top 10)

MCTS-Guided Peptide Generation

Generate therapeutic peptides optimized across multiple objectives using Monte-Carlo Tree Search.

  1. Fill in base_path in src/config.yaml and src/scoring/scoring_functions.py.
  2. Fill in HOME_LOC in scripts/generate_mcts.sh.
  3. Create output directories: mkdir -p results logs
chmod +x scripts/generate_mcts.sh

# Usage: ./scripts/generate_mcts.sh [PROT_NAME] [PROT_NAME2] [MODE] [MODEL] [LENGTH] [EPOCH]
# Example: Generate peptides targeting GFAP with length 100
nohup ./scripts/generate_mcts.sh gfap "" 2 mcts 100 7 > generate.log 2>&1 &

Available Target Proteins

Name Target
amhr AMH Receptor
tfr Transferrin Receptor
gfap Glial Fibrillary Acidic Protein
glp1 GLP-1 Receptor
glast Excitatory Amino Acid Transporter
ncam Neural Cell Adhesion Molecule
cereblon Cereblon (CRBN)
ligase E3 Ubiquitin Ligase
skp2 S-Phase Kinase-Associated Protein 2
p53 Tumor Suppressor p53
egfp Enhanced Green Fluorescent Protein

To specify a custom target protein, override +prot_seq=<amino acid sequence> and +prot_name=<name> as Hydra arguments in the generation script.

Scoring Objectives

PepTune jointly optimizes across five therapeutic properties via the integrated scoring suite:

Objective Property Model
binding_affinity1 Binding affinity to target protein Cross-attention Transformer
solubility Aqueous solubility XGBoost on SMILES CNN embeddings
hemolysis Non-hemolytic SMILES binary classifier
nonfouling Non-fouling SMILES binary classifier
permeability Cell membrane permeability PAMPA CNN

Default MCTS Hyperparameters

These can be overridden via Hydra config overrides:

Parameter Default Description
mcts.num_children 50 Branching factor per MCTS node
mcts.num_iter 128 Number of MCTS iterations
mcts.num_objectives 5 Number of optimization objectives
sampling.steps 128 Diffusion denoising steps
sampling.seq_length 200 Generated peptide length

Unconditional Generation

Generate peptides without property guidance:

chmod +x scripts/generate_unconditional.sh

nohup ./scripts/generate_unconditional.sh > generate_unconditional.log 2>&1 &

Evaluation

To summarize metrics after generation, fill in path and prot_name in src/metrics.py and run:

python src/metrics.py

Citation

If you find this repository helpful for your publications, please consider citing our paper:

@article{tang2025peptune,
  title={Peptune: De novo generation of therapeutic peptides with multi-objective-guided discrete diffusion},
  author={Tang, Sophia and Zhang, Yinuo and Chatterjee, Pranam},
  journal={42nd International Conference on Machine Learning},
  year={2025}
}

License

To use this repository, you agree to abide by the Apache 2.0 License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Papers for ChatterjeeLab/PepTune