中文版 | English
题名

基于条件Transformer网络的蛋白质序列设计

其他题名
Protein sequence design based on contitional Transformer network
姓名
姓名拼音
DENG Xiaomei
学号
12132731
学位类型
硕士
学位专业
07 理学
学科门类/专业学位类别
07 理学
导师
余沛源
导师单位
化学系
论文答辩日期
2024-05-14
论文提交日期
2024-06-24
学位授予单位
南方科技大学
学位授予地点
深圳
摘要

在蛋白质工程和合成生物学领域,精确设计具有特定功能的蛋白质序列是实现生物学创新和应用开发的关键。传统的蛋白质设计方法需依赖大量实验和生物信息学分析,在处理复杂系统时往往因数据与方法的局限性而面临巨大挑战。针对这些难点,本文设计了基于条件Transformer的神经网络模型,用于生成具有特定功能的酶序列。该模型能学习天然酶序列的进化关系,生成具有相似功能的新序列。结果显示,所生成蛋白质序列与原序列具有类似的三维结构和基序分布,可能保留相近的生物学特性。为了高效筛选有价值的生成序列,本文基于Transformer和图神经网络(GNN)开发了预测酶催化常数(kcat)和米氏常数(Km)的模型。该预测模型利用两种网络在处理序列和分子数据上的优势,通过学习大量酶序列、底物结构及动力学参数数据,可高精度预测酶催化反应的动力学性质。

为验证上述模型的可靠性和应用价值,本文以F-box蛋白质(TIR1)与生长素吲哚乙酸(IAA)和转录抑制因子(Aux/IAA)结合为例,通过传统的分子动力学方法和本文开发的米氏常数Km值预测模型,分别预测了不同的IAA(X)−TIR1共受体复合物与IAA的结合强度,并与实验结果进行比较以验证模型的可靠性。此外,本文利用序列生成模型生成新的Aux/IAA蛋白质序列,并通过Km值预测模型预测与IAA的结合强度。结果显示,Km值预测模型结果要优于传统的分子动力学方法,为调控生长素信号通路的激活或抑制提供了理论指导。

其他摘要

In the fields of protein engineering and synthetic biology, precisely designing protein sequences with specific functions is key to achieving biological innovation and application development. The traditional protein design methods often face significant challenges when dealing with complex systems due to the limitations of data and methodologies, relying heavily on extensive experimentation and bioinformatics analysis. To address these challenges, this study designed a conditional Transformer-based neural network model to generate new enzyme sequences with specific functions. This model can learn the evolutionary relationships among natural enzyme sequences and generate new sequences with similar functions. The results show that the generated protein sequences have similar three-dimensional structures and sequence motif distributions as the original sequences, suggesting that they may retain similar biological characteristics. In order to efficiently screen valuable generated sequences, this study developed models based on Transformer and graph neural networks (GNN) to predict enzyme catalytic constants (kcat) and Michaelis constants (Km). By leveraging the strengths of these two networks in handling sequence and molecular data and learning from large amounts of enzyme sequences, substrate structures, and kinetic parameter data, this prediction model can accurately predict the kinetic characteristics of enzyme-catalyzed reactions.

To validate the reliability and application value of the above models, this study used the binding of the F-box protein (TIR1) with auxin indole-3-acetic acid (IAA) and transcriptional repressors (Aux/IAA) as an example. The binding strengths of different IAA(X)−TIR1 co-receptor complexes with IAA were predicted using both traditional molecular dynamics methods and the Km prediction model developed in this study, and the results were compared with experimental data to verify the reliability of the models. Additionally, this thesis generated new Aux/IAA protein sequences using the sequence generation model and predicted their binding strengths with IAA using the Km prediction model. The results showed that the Km prediction model outperformed traditional molecular dynamics methods, providing theoretical guidance for regulating the activation or inhibition of the auxin signaling pathway.

关键词
其他关键词
语种
中文
培养类别
独立培养
入学年份
2021
学位授予年份
2024-06
参考文献列表

[1] HUANG P S, BOYKEN S E, BAKER D. The coming of age of de novo protein design[J]. Nature, 2016, 537(7620): 320-327.
[2] ROMERO P A, ARNOLD F H. Exploring protein fitness landscapes by directed evolution[J]. Nature Reviews Molecular Cell Biology, 2009, 10(12): 866 -876.
[3] ARNOLD F H. Directed evolution: bringing new chemistry to life[J]. Angewandte Chemie International Edition, 2018, 57(16): 4143-4148.
[4] ANAND N, EGUCHI R, MATHEWS I I, et al. Protein sequence design with a learned potential[J]. Nature Communications, 2022, 13(1): 746.
[5] LI Y, LI J, SUN J, et al. Bioinspired and mechanically strong fibers based on engineered non-spider chimeric proteins[J]. Angewandte Chemie International Edition, 2020, 59(21): 8148-8152.
[6] KUHLMAN B, BRADLEY P. Advances in protein structure prediction and design[J]. Nature Reviews Molecular Cell Biology, 2019, 20(11): 681-697.
[7] ARNOLD F H. Design by directed evolution[J]. Accounts of Chemical Research, 1998, 31(3): 125-131.
[8] CHEN I M A, MARKOWITZ V M, CHU K, et al. IMG/M: integrated genome and metagenome comparative data analysis system[J]. Nucleic Acids Research, 2017, 45(D1): D507-D516.
[9] WESTBROOK J D, BURLEY S K. How structural biologists and the protein data bank contributed to recent FDA new drug approvals[J]. Structure, 2019, 27(2): 211-217.
[10] LECUN Y, BENGIO Y, HINTON G. Deep learning[J]. Nature, 2015, 521(7553): 436-444.
[11] TUNYASUVUNAKOOL K, ADLER J, WU Z, et al. Highly accurate protein structure prediction for the human proteome[J]. Nature, 2021, 596(7873): 590 -596.
[12] YANG J, ANISHCHENKO I, PARK H, et al. Improved protein structure prediction using predicted interresidue orientations[J]. Proceedings of the National Academy of Sciences, 2020, 117(3): 1496-1503.
[13] STROKACH A, BECERRA D, CORBI-VERGE C, et al. Fast and flexible protein design using peep graph neural networks[J]. Cell Systems, 2020, 11(4): 402-411.e4.
[14] WEI K Y, MOSCHIDI D, BICK M J, et al. Computational design of closely related proteins that adopt two well-defined but structurally divergent folds[J]. Proceedings of the National Academy of Sciences, 2020, 117(13): 7208-7215.
[15] BAEK M, DIMAIO F, ANISHCHENKO I, et al. Accurate prediction of protein structures and interactions using a three-track neural network[J]. Science, 2021, 373(6557): 871-876.
[16] CUNNINGHAM J M, KOYTIGER G, SORGER P K, et al. Biophysical prediction of protein–peptide interactions and signaling networks using machine learning[J]. Nature Methods, 2020, 17(2): 175-183.
[17] HOPF T A, INGRAHAM J B, POELWIJK F J, et al. Mutation effects predicted from sequence co-variation[J]. Nature Biotechnology, 2017, 35(2): 128-135.
[18] RIESSELMAN A J, INGRAHAM J B, MARKS D S. Deep generative models of genetic variation capture the effects of mutations[J]. Nature Methods, 2018, 15(10): 816-822.
[19] WANG D D, OU-YANG L, XIE H, et al. Predicting the impacts of mutations on protein-ligand binding affinity based on molecular dynamics simulations and machine learning methods[J]. Computational and Structural Biotechnology Journal, 2020, 18: 439-454.
[20] SINAI S, KELSIC E, CHURCH G M, et al. Variational auto-encoding of protein sequences[M]. arXiv, 2017. http://arxiv.org/abs/1712.03346.
[21] BITARD-FEILDEL T. Navigating the amino acid sequence space between functional proteins using a deep learning framework[J]. PeerJ Computer Science, 2021, 7: e684.
[22] REPECKA D, JAUNISKIS V, KARPUS L, et al. Expanding functional protein sequence spaces using generative adversarial networks[J]. Nature Machine Intelligence, 2021, 3(4): 324-333.
[23] SHIN J E, RIESSELMAN A J, KOLLASCH A W, et al. Protein design and variant prediction using autoregressive generative models[J]. Nature Communications, 2021, 12(1): 2403.
[24] 伍青林, 任玉彬, 翟小威, 等. 生成模型在蛋白质序列设计中的应用[J]. 应用化学, 2022, 39(1): 3-17.
[25] GREENER J G, MOFFAT L, JONES D T. Design of metalloproteins and novel protein folds using variational autoencoders[J]. Scientific Reports, 2018, 8(1): 16189.
[26] HAWKINS-HOOKER A, DEPARDIEU F, BAUR S, et al. Generating functional protein variants with variational autoencoders[J]. PLOS Computational Biology, 2021, 17(2): e1008736.
[27] SEMENIUTA S, SEVERYN A, BARTH E. A hybrid convolutional variational autoencoder for text generation[M]. arXiv, 2017. http://arxiv.org/abs/1702.02390.
[28] SILLITOE I, DAWSON N, LEWIS T E, et al. CATH: expanding the horizons of structure-based functional annotations for genome sequences[J]. Nucleic Acids Research, 2019, 47(D1): D280-D284.
[29] SURANA S, ARORA P, SINGH D, et al. PandoraGAN: generating antiviral peptides using Generative Adversarial Network[M]. bioRxiv, 2021: 2021.02.15.431193. https://www.biorxiv.org/content/10.1101/2021.02.15.431193v1.
[30] YANG R, WU F, ZHANG C, et al. iEnhancer-GAN: a deep learning framework in combination with word embedding and sequence generative adversarial net to identify enhancers and their strength[J]. International Journal of Molecular Sciences, 2021, 22(7): 3589.
[31] WU Z, YANG K K, LISZKA M J, et al. Signal peptides generated by attention-based neural networks[J]. ACS Synthetic Biology, 2020, 9(8): 2154-2161.
[32] MADANI A, MCCANN B, NAIK N, et al. ProGen: Language Modeling for Protein Generation[M]. arXiv, 2020. http://arxiv.org/abs/2004.03497.
[33] ALFORD R F, LEAVER-FAY A, JELIAZKOV J R, et al. The Rosetta all-atom energy function for macromolecular modeling and design[J]. Journal of Chemical Theory and Computation, 2017, 13(6): 3031-3048.
[34] BAR-EVEN A, NOOR E, SAVIR Y, et al. The moderately efficient enzyme: evolutionary and physicochemical trends shaping enzyme parameters[J]. Biochemistry, 2011, 50(21): 4402-4410.
[35] DAVIDI D, NOOR E, LIEBERMEISTER W, et al. Global characterization of in vivo enzyme catalytic rates and their correspondence to in vitro kcat measurements[J]. Proceedings of the National Academy of Sciences, 2016, 113(12): 3401 -3406.
[36] KHODAYARI A, MARANAS C D. A genome-scale Escherichia coli kinetic metabolic model k-ecoli457 satisfying flux data for multiple mutant strains[J]. Nature Communications, 2016, 7(1): 13806.
[37] SAA P A, NIELSEN L K. Formulation, construction and analysis of kinetic models of metabolism: a review of modelling frameworks[J]. Biotechnology Advances, 2017, 35(8): 981-1003.
[38] STRUTZ J, MARTIN J, GREENE J, et al. Metabolic kinetic modeling provides insight into complex biological questions, but hurdles remain[J]. Current Opinion in Biotechnology, 2019, 59: 24-30.
[39] WU S G, WANG Y, JIANG W, et al. Rapid prediction of bacterial heterotrophic fluxomics using machine learning and constraint programming[J]. PLOS Computational Biology, 2016, 12(4): e1004838.
[40] KIM M, RAI N, ZORRAQUINO V, et al. Multi-omics integration accurately predicts cellular state in unexplored conditions for Escherichia coli[J]. Nature Communications, 2016, 7(1): 13090.
[41] MA J, YU M K, FONG S, et al. Using deep learning to model the hierarchical structure and function of a cell[J]. Nature Methods, 2018, 15(4): 290-298.
[42] ZRIMEC J, BÖRLIN C S, BURIC F, et al. Deep learning suggests that gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure[J]. Nature Communications, 2020, 11(1): 6141.
[43] MELLOR J, GRIGORAS I, CARBONELL P, et al. Semisupervised gaussian process for automated enzyme search[J]. ACS Synthetic Biology, 2016, 5(6): 518-528.
[44] CARBONELL P, FAULON J L. Molecular signatures-based prediction of enzyme promiscuity[J]. Bioinformatics, 2010, 26(16): 2012-2019.
[45] KROLL A, ENGQVIST M K M, HECKMANN D, et al. Deep learning allows genome-scale prediction of Michaelis constants from structural features[J]. PLOS Biology, 2021, 19(10): e3001402.
[46] RYU J Y, KIM H U, LEE S Y. Deep learning enables high-quality and high-throughput prediction of enzyme commission numbers[J]. Proceedings of the National Academy of Sciences, 2019, 116(28): 13996-14001.
[47] HECKMANN D, LLOYD C J, MIH N, et al. Machine learning applied to enzyme turnover numbers reveals protein structural correlates and improves metabolic models[J]. Nature Communications, 2018, 9(1): 5252.
[48] YAN S M, SHI D Q, NONG H, et al. Predicting Km values of beta-glucosidases using cellobiose as substrate[J]. Interdisciplinary Sciences: Computational Life Sciences, 2012, 4(1): 46-53.
[49] KROLL A, ROUSSET Y, HU X P, et al. Turnover number predictions for kinetically uncharacterized enzymes using machine and deep learning[J]. Nature Communications, 2023, 14(1): 4139.
[50] LI F, YUAN L, LU H, et al. Deep learning-based kcat prediction enables improved enzyme-constrained model reconstruction[J]. Nature Catalysis, 2022, 5(8): 662 -672.
[51] RAN X, JIANG Y, SHAO Q, et al. EnzyKR: a chirality-aware deep learning model for predicting the outcomes of the hydrolase-catalyzed kinetic resolution[J]. Chemical Science, 2023, 14(43): 12073-12082.
[52] YU H, DENG H, HE J, et al. UniKP: a unified framework for the prediction of enzyme kinetic parameters[J]. Nature Communications, 2023, 14(1): 8211.
[53] WU C H, APWEILER R, BAIROCH A, et al. The Universal Protein Resource (UniProt): an expanding universe of protein information[J]. Nucleic Acids Research, 2006, 34(suppl_1): D187-D191.
[54] GAINZA-CIRAUQUI P, CORREIA B E. Computational protein design — the next generation tool to expand synthetic biology applications[J]. Current Opinion in Biotechnology, 2018, 52: 145-152.
[55] CHEVALIER A, SILVA D A, ROCKLIN G J, et al. Massively parallel de novo protein design for targeted therapeutics[J]. Nature, 2017, 550(7674): 74-79.
[56] KAMERZELL T J, MIDDAUGH C R. Prediction machines: applied machine learning for therapeutic protein design and development[J]. Journal of Pharmaceutical Sciences, 2021, 110(2): 665-681.
[57] SILVA D A, YU S, ULGE U Y, et al. De novo design of potent and selective mimics of IL-2 and IL-15[J]. Nature, 2019, 565(7738): 186-191.
[58] LI J, LI B, SUN J, et al. Engineered near-infrared fluorescent protein assemblies for robust bioimaging and therapeutic applications[J]. Advanced Materials, 2020, 32(17): 2000964.
[59] SUN J, LI B, WANG F, et al. Proteinaceous fibers with outstanding mechanical properties manipulated by supramolecular interactions[J]. CCS Chemistry, 2020, 3(6): 1669-1677.
[60] XIAO L, WANG Z, SUN Y, et al. An artificial phase-transitional underwater bioglue with robust and switchable adhesion performance[J]. Angewandte Chemie International Edition, 2021, 60(21): 12082-12089.
[61] THE UNIPROT CONSORTIUM. UniProt: the Universal Protein Knowledgebase in 2023[J]. Nucleic Acids Research, 2023, 51(D1): D523-D531.
[62] SAYERS E W, BOLTON E E, BRISTER J R, et al. Database resources of the National Center for Biotechnology Information in 2023[J]. Nucleic Acids Research, 2023, 51(D1): D29-D38.
[63] JUMPER J, EVANS R, PRITZEL A, et al. Highly accurate protein structure prediction with AlphaFold[J]. Nature, 2021, 596(7873): 583-589.
[64] CHEN Y, NIELSEN J. Energy metabolism controls phenotypes by protein efficiency and allocation[J]. Proceedings of the National Academy of Sciences of the United States of America, 2019, 116(35): 17592-17597.
[65] SANCHEZ B J, ZHANG C, NILSSON A, et al. Improving the phenotype predictions of a yeast genome-scale metabolic model by incorporating enzymatic constraints[J]. Molecular Systems Biology, 2017, 13(8): 935.
[66] KLUMPP S, SCOTT M, PEDERSEN S, et al. Molecular crowding limits translation and cell growth[J]. Proceedings of the National Academy of Sciences of the United States of America, 2013, 110(42): 16754-16759.
[67] SCHOMBURG I, JESKE L, ULBRICH M, et al. The BRENDA enzyme information system–From a database to an expert system[J]. Journal of Biotechnology, 2017, 261: 194-206.
[68] LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278 -2324.
[69] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84 -90.
[70] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[M]. arXiv, 2015. http://arxiv.org/abs/1409.1556.
[71] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016: 770 -778. https://ieeexplore.ieee.org/document/7780459.
[72] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015: 1-9. https://ieeexplore.ieee.org/document/7298594.
[73] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems: 30. Curran Associates, Inc., 2017. https://proceedings.neurips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
[74] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2019), VOL. 1. Stroudsburg: Assoc Computational Linguistics-Acl, 2019: 4171-4186. https://www.webofscience.com/wos/woscc/fullrecord/WOS:000900116904035.
[75] LUO R, SUN L, XIA Y, et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining[J]. Briefings in Bioinformatics, 2022, 23(6): bbac409.
[76] SUZEK B E, WANG Y, HUANG H, et al. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches[J]. Bioinformatics, 2015, 31(6): 926-932.
[77] HAMANN T, BENKOVA E, BÄURLE I, et al. The Arabidopsis BODENLOS gene encodes an auxin response protein inhibiting MONOPTEROS-mediated embryo patterning[J]. Genes & Development, 2002, 16(13): 1610-1615.
[78] WEIJERS D, SCHLERETH A, EHRISMANN J S, et al. Auxin triggers transient local signaling for cell specification in Arabidopsis embryogenesis[J]. Developmental Cell, 2006, 10(2): 265-270.
[79] OVERVOORDE P, FUKAKI H, BEECKMAN T. Auxin control of root development[J]. Cold Spring Harbor Perspectives in Biology, 2010, 2(6): a001537.
[80] VERNOUX T, BESNARD F, TRAAS J. Auxin at the shoot apical meristem[J]. Cold Spring Harbor Perspectives in Biology, 2010, 2(4): a001487.
[81] SCARPELLA E, BARKOULAS M, TSIANTIS M. Control of leaf and vein development by auxin[J]. Cold Spring Harbor Perspectives in Biology, 2010, 2(1): a001511.
[82] SUNDBERG E, ØSTERGAARD L. Distinct and dynamic auxin activities during reproductive development[J]. Cold Spring Harbor Perspectives in Biology, 2009, 1(6): a001628.
[83] HOLLAND J J, ROBERTS D, LISCUM E. Understanding phototropism: from Darwin to today[J]. Journal of Experimental Botany, 2009, 60(7): 1969 -1978.
[84] PIERCE B G, WIEHE K, HWANG H, et al. ZDOCK server: interactive docking prediction of protein–protein complexes and symmetric multimers[J]. Bioinformatics, 2014, 30(12): 1771-1773.
[85] CHEN R, WENG Z. Docking unbound proteins using shape complementarity, desolvation, and electrostatics[J]. Proteins: Structure, Function, and Bioinformatics, 2002, 47(3): 281-294.
[86] KATCHALSKI-KATZIR E, SHARIV I, EISENSTEIN M, et al. Molecular surface recognition: determination of geometric fit between proteins and their ligands by correlation techniques.[J]. Proceedings of the National Academy of Sciences, 1992, 89(6): 2195-2199.
[87] CALDERON VILLALOBOS L I A, LEE S, DE OLIVEIRA C, et al. A combinatorial TIR1/AFB-Aux/IAA co-receptor system for differential sensing of auxin[J]. Nature Chemical Biology, 2012, 8(5): 477-485.

所在学位评定分委会
化学
国内图书分类号
O629.8
来源库
人工提交
成果类型学位论文
条目标识符http://sustech.caswiz.com/handle/2SGJ60CL/766005
专题南方科技大学
理学院_化学系
推荐引用方式
GB/T 7714
邓小梅. 基于条件Transformer网络的蛋白质序列设计[D]. 深圳. 南方科技大学,2024.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可 操作
12132731-邓小梅-化学系.pdf(4674KB)----限制开放--请求全文
个性服务
原文链接
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
导出为Excel格式
导出为Csv格式
Altmetrics Score
谷歌学术
谷歌学术中相似的文章
[邓小梅]的文章
百度学术
百度学术中相似的文章
[邓小梅]的文章
必应学术
必应学术中相似的文章
[邓小梅]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
[发表评论/异议/意见]
暂无评论

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。