Consistent Synthetic Sequences Unlock Structural Diversity in Fully Atomistic De Novo Protein Design
Abstract
High-quality training datasets are crucial for the development of effective protein design models, but existing synthetic datasets often include unfavorable sequence-structure pairs, impairing generative model performance. We leverage ProteinMPNN, whose sequences are experimentally favorable as well as amenable to folding, together with structure prediction models to align high-quality synthetic structures with recoverable synthetic sequences. In that way, we create a new dataset designed specifically for training expressive, fully atomistic protein generators. By retraining La-Proteina, which models discrete residue type and side chain structure in a continuous latent space, on this dataset, we achieve new state-of-the-art results, with improvements of +54% in structural diversity and +27% in co-designability. To validate the broad utility of our approach, we further introduce Proteina Atomistica, a unified flow-based framework that jointly learns the distribution of protein backbone structure, discrete sequences, and atomistic side chains without latent variables. We again find that training on our new sequence-structure data dramatically boosts benchmark performance, improving \method's structural diversity by +73% and co-designability by +5%. Our work highlights the critical importance of aligned sequence-structure data for training high-performance de novo protein design models. Our new dataset https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/resources/proteina-atomistica_data/files?version=release , the Consistency Distilled Synthetic Protein Database, is made available as an open-source resource.
Summary
This paper tackles the critical issue of data quality in de novo protein design, arguing that existing synthetic datasets, particularly those derived from the AlphaFold Database (AFDB), contain inconsistencies between sequence and structure, hindering the performance of generative models. The authors demonstrate that sequences from the AFDB often do not fold back into their corresponding structures when using common structure prediction tools like ESMFold. To address this, they introduce a novel dataset, the Consistency Distilled Synthetic Protein Database (CDDB), created by generating sequences with ProteinMPNN for AFDB cluster representative structures and then refolding these sequences with ESMFold. This process ensures a higher degree of consistency between the synthetic sequences and structures. The authors then train two fully atomistic protein generative models, La-Proteína and a newly developed model called Proteína-Atomística, on their CDDB dataset. La-Proteína utilizes a latent space approach, while Proteína-Atomística is a unified flow-based framework that explicitly models the joint distribution of protein backbone structure, discrete sequences, and atomistic side chains. The key findings are that training on the CDDB dataset significantly boosts the performance of both models, leading to state-of-the-art results in structural diversity and co-designability. Specifically, La-Proteína achieves a +54% improvement in structural diversity and a +27% improvement in co-designability, while Proteína-Atomística shows improvements of +73% and +5%, respectively. These results highlight the importance of aligned sequence-structure data for training high-performance de novo protein design models.
Key Insights
- •The paper demonstrates that a significant portion (73.4% based on Cα RMSD and 80.9% based on all-atom RMSD) of the (real sequence, synthetic structure) pairs in the Foldseek-clustered AFDB dataset (풟 AFDB−clstr) are *not* co-designable by ESMFold, meaning the sequences do not likely fold into their given structures, questioning the suitability of AFDB-derived datasets for training joint sequence-structure models.
- •The authors introduce the Consistency Distilled Synthetic Protein Database (CDDB), a novel dataset created by generating sequences with ProteinMPNN for existing structures and then refolding with ESMFold, ensuring consistency between sequence and structure. This addresses the limitation of AFDB-derived datasets.
- •Training La-Proteína on the CDDB dataset leads to a +54% improvement in structural diversity and a +27% improvement in co-designability compared to training on the AFDB-derived dataset, demonstrating the significant impact of data quality.
- •The authors introduce Proteína-Atomística, a new multi-modal flow-based framework for joint generation of protein structure and sequence, explicitly modeling backbone, sequence, and side chains without latent variables. This provides an alternative to latent space approaches like La-Proteína.
- •Training Proteína-Atomística on the CDDB dataset improves its structural diversity by +73% and co-designability by +5% compared to training on the AFDB-derived dataset, further validating the importance of consistent sequence-structure data.
- •Counterintuitively, simply using ESMFold-predicted structures (풟 ESMFold) or filtering the AFDB for designable structures (풟 Des) does *not* improve model performance, highlighting the need for *both* high-quality structures *and* consistent, recoverable sequences.
- •The paper finds that models trained on the CDDB dataset generate sequences that fold better into their co-generated structures than ProteinMPNN sequences, potentially eliminating the need for ProteinMPNN-based redesign in protein design pipelines.
Practical Implications
- •The CDDB dataset, released as an open-source resource, can be used by researchers and engineers to train and improve de novo protein design models, potentially leading to more functional and diverse protein designs.
- •The Proteína-Atomística framework offers a new approach to joint protein structure and sequence generation, providing a baseline and a potential alternative to latent space models for researchers in the field.
- •The findings emphasize the importance of data curation and validation in machine learning for protein design, suggesting that future datasets should prioritize consistency between sequence and structure.
- •The improved designability and diversity achieved with the CDDB dataset can lead to the development of novel proteins with desired properties for various applications, including drug discovery, materials science, and synthetic biology.
- •Future research directions include extending the CDDB dataset to longer protein sequences, exploring conditional protein design tasks such as motif scaffolding and binder design, and further refining the Proteína-Atomística framework.
Links & Resources
Authors
Cite This Paper
Reidenbach, D., Cao, Z., Zhang, Z., Didi, K., Geffner, T., Zhou, G., Tang, J., Dallago, C., Vahdat, A., Kucukbenli, E., Kreis, K. (2025). Consistent Synthetic Sequences Unlock Structural Diversity in Fully Atomistic De Novo Protein Design. arXiv preprint arXiv:2512.01976.
Danny Reidenbach, Zhonglin Cao, Zuobai Zhang, Kieran Didi, Tomas Geffner, Guoqing Zhou, Jian Tang, Christian Dallago, Arash Vahdat, Emine Kucukbenli, and Karsten Kreis. "Consistent Synthetic Sequences Unlock Structural Diversity in Fully Atomistic De Novo Protein Design." arXiv preprint arXiv:2512.01976 (2025).