### General information Author: Patrick Bryant Contact e-mail: patrick.bryant@scilifelab.se DOI: 10.17044/scilifelab.19375172 License: CC BY 4.0 This readme file was last updated: 21-03-2022 Please cite as: Dataset for "Predicting the structure of large protein complexes using AlphaFold and sequential assembly", 10.17044/scilifelab.19375172 ### Dataset description AlphaFold and AlphaFold-multimer can predict the structure of single- and multiple chain proteins with very high accuracy. However, predicting protein complexes with more than a handful of chains is still unfeasible, as the accuracy rapidly decreases with the number of chains and the protein size is limited by the memory on a GPU. Nevertheless, it might be possible to predict the structure of large complexes starting from predictions of subcomponents. Here, we take a graph traversal approach to assemble 175 protein complexes with 10-30 chains using predictions of subcomponents. We compute paths through a complex graph constructed of subcomponents using Monte Carlo Tree Search and assemble these in a stepwise fashion. Using subcomponents predicted from all possible trimeric interactions, 91 complexes (52%) are assembled to completion. We create a scoring function, mpDockQ, that can distinguish if assemblies are complete and predict their accuracy. Selecting complete complexes with TM-score ≥0.9 at FPR 10% using mpDockQ results in 20 complete complexes with a median TM-score of 0.92. The complete assembly protocol, starting from the sequences, is freely available at: https://gitlab.com/patrickbryant1/molpc The repository here contains MSAs and predicted subcomponents to reproduce the assembly for the "all-trimer" approach. hhblits_msas.tar.zst - msas for each unique single chain assembly.tar.zst - the assembled structures and predicted subcomponents for the all trimer approach. The files are compressed with zstd.