南方科技大学知识苑(SUSTech KC): 一种基于Spark的分布式混合索引结构的设计与实现

题名	一种基于Spark的分布式混合索引结构的设计与实现
其他题名	DESIGN AND IMPLEMENTATION OF A DISTRIBUTED HYBRID INDEX STRUCTURE BASED ON SPARK
姓名	战庶
学号	11849062
学位类型	硕士
学位专业	计算机科学与技术
导师	Elvis
论文答辩日期	2020-05-30
论文提交日期	2020-07-08
学位授予单位	哈尔滨工业大学
学位授予地点	深圳
摘要	随着互联网领域的迅速发展，用户的使用产生了海量的数据。这些数据的存储和检索技术变得越来越重要。传统单机存储的方法处理数据虽然非常简便，但存储成本高昂，而且存储空间有限，迫切需要一种分布式的解决方案，为数据处理提供便利。在处理复杂而庞大的数据时，使用合适的索引能够有效加快数据检索的速度。所以需要设计一种分布式的索引方法，为大数据处理提供服务。因此本文提出了一种分布式的混合索引结构，为大数据检索性能的提升提供一种解决方案。本分布式索引结构以分布式理论为指导，采用较为流行的Spark框架为基础，进行分布式计算以完成系统功能。Spark框架以其分布式并行计算的能力，将大数据分配到若干个节点，进行并行计算，同时结合RDD内存计算的特性，大大加快了计算的速度。根据这一特性合理设计了索引结构，并采用提供服务的方式，让主控模块对外部应用提供索引操作的接口，用户可以调用这些接口进行索引的建立和数据查询等操作。同时在系统中加入了web展现的功能，将复杂的系统接口进行了调用，构建了web展示前端与后台索引数据管理系统，支持对用户和索引数据的管理。同时还可以绘制热度图进行数据热度分析，绘制数据关系树状图探寻数据之间的关系，使数据的分布更容易被观察。经过测试发现，本分布式混合索引系统能够较为方便地为用户提供分布式索引建立的服务，并完成索引数据的检索，可以为大型数据仓库提供索引功能。同时系统的健壮性和可扩展性较为良好，能够在此基础上进行开发和功能扩展。
其他摘要	With the rapid development of the Internet, the users have generated massive amounts of data. The storage and retrieval technology of these data becomes more and more important. Although the traditional single-machine storage method is very simple to process data, the storage cost is high, and the storage space is limited. There is an urgent need for a distributed solution to provide convenience for data processing. When dealing with complex and huge data, the use of appropriate indexes can effectively speed up the speed of data retrieval. Therefore, a distributed index method needs to be designed to provide services for big data processing. Therefore, this paper proposes a distributed hybrid index structure to provide a solution for improving the performance of big data retrieval.This distributed index structure is guided by distributed theory and uses the more popular Spark framework as the basis for distributed computing to complete system functions. With its distributed parallel computing capabilities, the Spark framework distributes big data to several nodes for parallel computing, and at the same time combines the characteristics of RDD memory computing, greatly speeding up the calculation. According to this feature, the index structure is reasonably designed, and the way of providing services is used to allow the main control module to provide an interface for index operations to external applications. Users can call these interfaces for index establishment and data query operations. At the same time, the web display function is added to the system, and complex system interfaces are called, and a web display front-end and background index data management system is constructed to support the management of users and index data. At the same time, you can also draw a heat map for data heat analysis, draw a data relationship tree diagram to explore the relationship between data, so that the distribution of data is easier to observe.After testing, it was found that the system can provide users with a distributed index establishment service more conveniently, and complete index data retrieval, and can provide index functions for large data warehouses. At the same time, the system is more robust and extensible, and can be developed and expanded on this basis.
关键词	分布式索引大数据并行计算 Spark
其他关键词	distributed system index big data parallel computing Spark
语种	中文
培养类别	联合培养
成果类型	学位论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/143019
专题	工学院_计算机科学与工程系
作者单位	南方科技大学
推荐引用方式 GB/T 7714	战庶. 一种基于Spark的分布式混合索引结构的设计与实现[D]. 深圳. 哈尔滨工业大学,2020.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可	操作
一种基于Spark的分布式混合索引结构的（3404KB）	--	--	限制开放	--	请求全文