中文版 | English
题名

GENEFISHING TO UNCOVER HIDDEN PATTERNS IN OMICS DATA

其他题名
基于 GeneFishing 方法挖掘生物组学数据中 的隐藏模式
姓名
姓名拼音
GUO Diqing
学号
12232874
学位类型
硕士
学位专业
0701 数学
学科门类/专业学位类别
07 理学
导师
张卓松
导师单位
统计与数据科学系
论文答辩日期
2024-05-19
论文提交日期
2024-07-06
学位授予单位
南方科技大学
学位授予地点
深圳
摘要
The rapid development of omics technologies has led to large-scale omics data. These data often exhibit characteristics such as high dimensionality, heterogeneity, and low signal-to-noise ratio, making it challenging to extract useful information. Traditional statistical methods may not adequately address these issues.
GeneFishing, a semi-supervised learning approach based on resampling and clustering, was introduced in 2019. It effectively captures signals from biological sequencing data, initially applied to RNA sequencing for gene prioritization. It utilizes a set of known genes as“baits”and combines them with unknown candidate genes for resampling and clustering, identifying genes with similar functions. While GeneFishing’s effectiveness has been demonstrated in RNA data, its application to other omics data, such as proteomics, remains unexplored. Additionally, defining baits and their transferability across different omics data are key issues. This study aims to address these challenges.
We apply GeneFishing to explore multi-omics data sets: single-cell RNA-seq data of mouse embryo, protein abundance data from colon cancer, protein and RNA-seq data from glioblastoma patients. We conduct exploratory data analysis, determine fishing targets, select appropriate baits using multivariate statistics and bioinformatics methods, and explore bait transferability between omics data. We adjust and improve GeneFishing parameters, and discuss the final findings.
In the study of embryo data set, we find that beside the traditional gene level, GeneFishing can also be applied at the cell level to effectively identify cells of the same type as the baits. Moreover, adjusting the clustering parameter k enables the identification of more refined cell sub-types. In the analysis of protein data, we address the missing values characteristic of protein data and apply GeneFishing, rediscovering important genes and proteins related to cancer regulation and predicting valuable genes for further research. In the protein and RNA data from glioblastoma patients, we focus on transferring baits from proteomics to RNA data, identifying significant results in the RNA data that were not prominent in the protein data, indicating the potential for comprehensive biological discoveries by integrating both omics data sets.
关键词
语种
英语
培养类别
独立培养
入学年份
2022
学位授予年份
2024-06
参考文献列表

[1] VEENSTRA T D. Omics in systems biology: current progress and future outlook[J]. Proteomics, 2021, 21(3-4): 2000235.
[2] LIU K, THEUSCH E, ZHOU Y, et al. GeneFishing to reconstruct context specific portraits of biological processes[J]. Proceedings of the National Academy of Sciences, 2019, 116(38): 18943-18950.
[3] MOREAU Y, TRANCHEVENT L C. Computational tools for prioritizing candidate genes: boosting disease gene discovery[J]. Nature Reviews Genetics, 2012, 13(8): 523-536.
[4] YU B. Stability[J]. Bernoulli, 2013, 19(4): 1484-1500.
[5] YU W, WULF A, LIU T, et al. Gene Prospector: an evidence gateway for evaluating potential susceptibility genes and interacting risk factors for human diseases[J]. BMC Bioinformatics, 2008, 9(1): 1-8.
[6] AERTS S, LAMBRECHTS D, MAITY S, et al. Gene prioritization through genomic data fusion [J]. Nature Biotechnology, 2006, 24(5): 537-544.
[7] TRANCHEVENT L C, BARRIOT R, YU S, et al. E ndeavour update: a web resource for gene prioritization in multiple species[J]. Nucleic Acids Research, 2008, 36(suppl_2): W377-W384.
[8] GREENE C S, KRISHNAN A, WONG A K, et al. Understanding multicellular function and disease with human tissue-specific networks[J]. Nature Genetics, 2015, 47(6): 569-576.
[9] CRICK F. Central dogma of molecular biology[J]. Nature, 1970, 227(5258): 561-563.
[10] GIBNEY E, NOLAN C. Epigenetics and gene expression[J]. Heredity, 2010, 105(1): 4-13.
[11] BUCCITELLI C, SELBACH M. mRNAs, proteins and the emerging principles of gene expression control[J]. Nature Reviews Genetics, 2020, 21(10): 630-644.
[12] KIM M S, PINTO S M, GETNET D, et al. A draft map of the human proteome[J]. Nature, 2014, 509(7502): 575-581.
[13] NIE L, WU G, CULLEY D E, et al. Integrative analysis of transcriptomic and proteomic data: challenges, solutions and applications[J]. Critical Reviews in Biotechnology, 2007, 27(2): 63-75.
[14] VAN DAM S, VOSA U, VAN DER GRAAF A, et al. Gene co-expression analysis for functional classification and gene–disease predictions[J]. Briefings in Bioinformatics, 2018, 19(4): 575-592.
[15] GILLIS J, PAVLIDIS P. “Guilt by association” is the exception rather than the rule in gene networks[J]. PLoS Computational Biology, 2012, 8(3): e1002444.
[16] ZHANG B, HORVATH S. A general framework for weighted gene co-expression network analysis[J]. Statistical Applications in Genetics and Molecular Biology, 2005, 4(1).
[17] D’HAESELEER P. How does gene expression clustering work?[J]. Nature Biotechnology, 2005, 23(12): 1499-1501
[18] HEYER L J, KRUGLYAK S, YOOSEPH S. Exploring expression data: identification and analysis of coexpressed genes[J]. Genome Research, 1999, 9(11): 1106-1115.
[19] KUMARI S, NIE J, CHEN H S, et al. Evaluation of gene association methods for coexpression network construction and biological knowledge discovery[J]. PloS One, 2012, 7(11): e50411.
[20] MUKAKA M M. A guide to appropriate use of correlation coefficient in medical research[J]. Malawi Medical Journal, 2012, 24(3): 69-71.
[21] FUJITA A, SATO J R, DEMASI M A A, et al. Comparing Pearson, Spearman and Hoeffding’s D measure for gene expression association analysis[J]. Journal of Bioinformatics and Computational Biology, 2009, 7(04): 663-684.
[22] HOU J, YE X, FENG W, et al. Distance correlation application to gene co-expression network analysis[J]. BMC Bioinformatics, 2022, 23(1): 1-24.
[23] HAUKE J, KOSSOWSKI T. Comparison of values of Pearson’s and Spearman’s correlation coefficients on the same sets of data[J]. Quaestiones Geographicae, 2011, 30(2): 87-93.
[24] XIAO C, YE J, ESTEVES R M, et al. Using Spearman’s correlation coefficients for exploratory data analysis on big dataset[J]. Concurrency and Computation: Practice and Experience, 2016,28(14): 3866-3878.
[25] WANG H, SUN Q, ZHAO W, et al. Individual-level analysis of differential expression of genes and pathways for personalized medicine[J]. Bioinformatics, 2015, 31(1): 62-68.
[26] BALDI P, LONG A D. A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes[J]. Bioinformatics, 2001, 17(6): 509-519.
[27] JEANMOUGIN M, DE REYNIES A, MARISA L, et al. Should we abandon the t-test in the analysis of gene expression microarray data: a comparison of variance modeling strategies[J]. PloS One, 2010, 5(9): e12336.
[28] CUI X, CHURCHILL G A. Statistical tests for differential expression in cDNA microarray experiments[J]. Genome Biology, 2003, 4: 1-10.
[29] CONSORTIUM G O. The Gene Ontology (GO) database and informatics resource[J]. Nucleic Acids Research, 2004, 32(suppl_1): D258-D261.
[30] CONSORTIUM G O. The gene ontology resource: 20 years and still GOing strong[J]. Nucleic Acids Research, 2019, 47(D1): D330-D338.
[31] SUBRAMANIAN A, TAMAYO P, MOOTHA V K, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles[J]. Proceedings of the National Academy of Sciences, 2005, 102(43): 15545-15550.
[32] HUNG J H, YANG T H, HU Z, et al. Gene set enrichment analysis: performance evaluation and usage guidelines[J]. Briefings in Bioinformatics, 2012, 13(3): 281-291.
[33] KOROTKEVICH G, SUKHOV V, BUDIN N, et al. Fast gene set enrichment analysis[J]. BioRxiv, 2016: 060012.
[34] KISELEV V Y, ANDREWS T S, HEMBERG M. Challenges in unsupervised clustering of single-cell RNA-seq data[J]. Nature Reviews Genetics, 2019, 20(5): 273-282.
[35] MACQUEEN J, et al. Some methods for classification and analysis of multivariate observations [C]//Proceedings of the fifth Berkeley symposium on mathematical statistics and probability: Vol. 1. Oakland, CA, USA, 1967: 281-297.
[36] VON LUXBURG U. A tutorial on spectral clustering[J]. Statistics and Computing, 2007, 17: 395-416.
[37] NG A, JORDAN M, WEISS Y. On spectral clustering: Analysis and an algorithm[J]. Advances in Neural Information Processing Systems, 2001, 14.
[38] 张宪超. 数据聚类[M]. 科学出版社, 2017.
[39] SALZBERG S L. Open questions: How many genes do we have?[J]. BMC Biology, 2018, 16 (1): 1-3.
[40] AMARAL P, CARBONELL-SALA S, DE LA VEGA F M, et al. The status of the human gene catalogue[J]. Nature, 2023, 622(7981): 41-47.
[41] GUO G, HUSS M, TONG G Q, et al. Resolution of cell fate decisions revealed by single-cell gene expression analysis from zygote to blastocyst[J]. Developmental Cell, 2010, 18(4): 675-685.
[42] VASAIKAR S, HUANG C, WANG X, et al. Proteogenomic analysis of human colon cancer reveals new therapeutic opportunities[J]. Cell, 2019, 177(4): 1035-1049.
[43] JIN L, BI Y, HU C, et al. A comparative study of evaluating missing value imputation methods in label-free proteomics[J]. Scientific Reports, 2021, 11(1): 1760.
[44] MA W, KIM S, CHOWDHURY S, et al. DreamAI: algorithm for the imputation of proteomics data[J]. Biorxiv, 2020: 2020-07.
[45] HICKS S C, IRIZARRY R A. Quantro: a data-driven approach to guide the choice of an appropriate normalization method[J]. Genome Biology, 2015, 16: 1-8.
[46] GAGNON-BARTSCH J A, SPEED T P. Using control genes to correct for unwanted variation in microarray data[J]. Biostatistics, 2012, 13(3): 539-552.
[47] BOLSTAD B M, IRIZARRY R A, ÅSTRAND M, et al. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias[J]. Bioinformatics, 2003, 19(2): 185-193.
[48] MCINNES L, HEALY J, MELVILLE J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction[A]. 2020. arXiv: 1802.03426.
[49] NIU L, GAO C, LI Y. Identification of potential core genes in colorectal carcinoma and key genes in colorectal cancer liver metastasis using bioinformatics analysis[J]. Scientific reports, 2021, 11(1): 23938.
[50] YUAN S, WANG P, ZHOU X, et al. Differential proteomics mass spectrometry of melanosis coli[J]. American Journal of Translational Research, 2020, 12(7): 3133.
[51] YUZHALIN A, GORDON-WEEKS A, TOGNOLI M, et al. Colorectal cancer liver metastatic growth depends on PAD4-driven citrullination of the extracellular matrix[J]. Nature Commu nications, 2018, 9(1): 4783.
[52] XING S, WANG Y, HU K, et al. WGCNA reveals key gene modules regulated by the combined treatment of colon cancer with PHY906 and CPT11[J]. Bioscience Reports, 2020, 40(9): BSR20200935.
[53] BUTTACAVOLI M, DI CARA G, ROZ E, et al. Integrated multi-omics investigations of metal loproteinases in colon cancer: Focus on MMP2 and MMP9[J]. International Journal of Molec ular Sciences, 2021, 22(22): 12389.
[54] DAVIS M E. Glioblastoma: overview of disease and treatment[J]. Clinical Journal of Oncology Nursing, 2016, 20(5): S2.
[55] WANG L B, KARPOVA A, GRITSENKO M A, et al. Proteogenomic and metabolomic characterization of human glioblastoma[J]. Cancer Cell, 2021, 39(4): 509-528.
[56] TANG F, ISHWARAN H. Random forest missing data algorithms[J]. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2017, 10(6): 363-377.
[57] STEKHOVEN D J, BÜHLMANN P. MissForest—non-parametric missing value imputation for mixed-type data[J]. Bioinformatics, 2012, 28(1): 112-118.
[58] WALJEE A K, MUKHERJEE A, SINGAL A G, et al. Comparison of imputation methods for missing laboratory data in medicine[J]. BMJ Open, 2013, 3(8): e002847.
[59] YANG A, WANG X, HU Y, et al. Identification of hub gene GRIN1 correlated with histological grade and prognosis of glioma by weighted gene coexpression network analysis[J]. BioMed Research International, 2021, 2021.
[60] QI C, LEI L, HU J, et al. Identification of a five-gene signature deriving from the vacuolar AT Pase (V-ATPase) sub-classifies gliomas and decides prognoses and immune microenvironment alterations[J]. Cell Cycle, 2022, 21(12): 1294-1315.
[61] DAUBON T, GUYON J, RAYMOND A A, et al. The invasive proteome of glioblastoma revealed by laser-capture microdissection[J]. Neuro-Oncology Advances, 2019, 1(1): vdz029.
[62] NELSON J S, BURCHFIEL C M, FEKEDULEGN D, et al. Potential risk factors for incident glioblastoma multiforme: the Honolulu Heart Program and Honolulu-Asia Aging Study [J]. Journal of Neuro-oncology, 2012, 109: 315-321.
[63] TAN A C, ASHLEY D M, LÓPEZ G Y, et al. Management of glioblastoma: State of the art and future directions[J]. CA: A Cancer Journal for Clinicians, 2020, 70(4): 299-312.
[64] HASAN T, CARAGHER S P, SHIREMAN J M, et al. Interleukin-8/CXCR2 signaling regulates therapy-induced plasticity and enhances tumorigenicity in glioblastoma[J]. Cell Death & Disease, 2019, 10(4): 292.
[65] GENG H, AN Q, ZHANG Y, et al. Role of Peptidylarginine Deiminase 4 in Central Nervous System Diseases[J]. Molecular Neurobiology, 2023, 60(11): 6748-6756.
[66] ARAUJO-ABAD S, FUENTES-BAILE M, RIZZUTI B, et al. The intrinsically disordered, epigenetic factor RYBP binds to the citrullinating enzyme PADI4 in cancer cells[J]. International Journal of Biological Macromolecules, 2023, 246: 125632.
[67] MANOU D, BOURIS P, KLETSAS D, et al. Serglycin activates pro-tumorigenic signaling and controls glioblastoma cell stemness, differentiation and invasive potential[J]. Matrix Biology Plus, 2020, 6: 100033.
[68] DONG W, LI L, TENG X, et al. End processing factor APLF promotes NHEJ efficiency and contributes to TMZ-and ionizing radiation-resistance in glioblastoma cells[J]. OncoTargets and Therapy, 2020: 10593-10605.
[69] SCHMITT C, LUCIUS R, SYNOWITZ M, et al. APOBEC3B is expressed in human glioma, and influences cell proliferation and temozolomide resistance[J]. Oncology Reports, 2018, 40 (5): 2742-2749.

所在学位评定分委会
数学
国内图书分类号
O212.6
来源库
人工提交
成果类型学位论文
条目标识符http://sustech.caswiz.com/handle/2SGJ60CL/779037
专题理学院_统计与数据科学系
推荐引用方式
GB/T 7714
Guo DQ. GENEFISHING TO UNCOVER HIDDEN PATTERNS IN OMICS DATA[D]. 深圳. 南方科技大学,2024.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可 操作
12232874-郭迪清-统计与数据科学(2187KB)----限制开放--请求全文
个性服务
原文链接
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
导出为Excel格式
导出为Csv格式
Altmetrics Score
谷歌学术
谷歌学术中相似的文章
[郭迪清]的文章
百度学术
百度学术中相似的文章
[郭迪清]的文章
必应学术
必应学术中相似的文章
[郭迪清]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
[发表评论/异议/意见]
暂无评论

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。