Inference-Time Toxicity Mitigation in Protein Language Models

Protein language models (PLMs) are becoming practical tools for de novo protein design, yet their dual-use potential raises safety concerns.

arXiv cs.LG · Mar 04, 2026 13:28 UTC · Paper: ~15 min

Research

Academic or research source. Check the methodology, sample size, and whether it's been replicated.

Key Takeaways

Important safety implications.
Protein language models (PLMs) are becoming practical tools for de novo protein design, yet their dual-use potential raises safety concerns.
We show that domain adaptation to specific taxonomic groups can elicit toxic protein generation, even when toxicity is not the training objective.

What It Means

Context

Protein language models (PLMs) are becoming practical tools for de novo protein design, yet their dual-use potential raises safety concerns. We show that domain adaptation to specific taxonomic groups can elicit toxic protein generation, even when toxicity is not the training objective. To address this, we adapt Logit Diff Amplification (LDA) as an inference-time control mechanism for PLMs. LDA modifies token probabilities by amplifying the logit difference between a baseline model and a toxicity-finetuned model, requiring no retraining. Across four taxonomic groups, LDA consistently reduces predicted toxicity rate (measured via ToxDL2) below the taxon-finetuned baseline while preserving biological plausibility. We evaluate quality using Fréchet ESM Distance and predicted foldability (pLDDT), finding that LDA maintains distributional similarity to natural proteins and structural viability (unlike activation-based steering methods that tend to degrade sequence properties). Our results demonstrate that LDA provides a practical safety knob for protein generators that mitigates elicited toxicity while retaining generative quality.

For builders

We show that domain adaptation to specific taxonomic groups can elicit toxic protein generation, even when toxicity is not the training objective.

For Builders

We show that domain adaptation to specific taxonomic groups can elicit toxic protein generation, even when toxicity is not the training objective.

Artifacts

Paper PDF

Read Original