南方科技大学知识苑(SUSTech KC): 基于 CPU-GPU 系统加速精准内积检索

题名	基于 CPU-GPU 系统加速精准内积检索
其他题名	ACCELERATING EXACT INNER PRODUCT RETRIEVAL BY CPU-GPUSYSTEM
姓名	向隆
学号	11749127
学位类型	硕士
学位专业	计算机科学与技术
导师	唐博
论文答辩日期	2019-05-30
论文提交日期	2019-07-15
学位授予单位	哈尔滨工业大学
学位授予地点	深圳
摘要	内积检索问题在推荐系统、信息检索和数据挖掘等领域有着极为广泛的应用。内积检索问题是指给定一个查询向量q和一个矩阵P，在矩阵中P中找到k个向量，使得这k个向量与查询向量的内积值在查询向量与矩阵的乘积qT · P中排前 k大。内积检索问题的求解可以分为两步（1）计算查询向量与矩阵的乘积；（2）在查询向量与矩阵的乘积中检索出排前k大的内积值和矩阵中对应的向量。基于 CPU计算平台内积检索算法的性能瓶颈是计算向量与矩阵乘积的部分，该部分的计算时间占总计算时间的96%以上。与CPU 相比，GPU有着更多的计算核心和更大的显存带宽特别适合用于求解运算密集型的计算任务。然而由于内积检索问题本身的复杂性，简单的将内积检索问题的求解迁移至GPU进行计算并不能取得较好的加速效果。本文针对如何高效的利用CPU-GPU系统加速精准内积检索问题展开研究。本文主要的研究内容和贡献如下：（1）对当前基于CPU的内积检索算法进行了细致的性能分析测定了算法每个阶段的时间开销，确定基于 CPU 内积检索算法的性能瓶颈为求解向量与矩阵的乘积。针对CPU无法高效计算矩阵乘法的问题，本文提出使用CPU-GPU系统加速内积检索中矩阵乘法的计算。为了充分利用CPU-GPU系统的硬件资源对CPU-GPU系统的软硬件体系结构和计算特性进行了系统的研究，并调研了当前CPU、 GPU矩阵乘法的计算和优化方式，确定了本文所使用的矩阵乘法计算方式。（2）提出了基于GPU矩阵乘法的内积检索算法。针对GPU显存容量不足无法使用GPU计算矩阵乘法的问题，本文提出了矩阵分块计算方案，在不超过显存容量的前提下充分的利用GPU的计算资源高效的进行矩阵乘法。在合理矩阵分块计算的基础上提出基于GPU矩阵乘法的内积检索算法，解决了基于CPU内积检索算法的性能瓶颈。（3）提出了两个基于GPU的内积检索优化算法。在基于GPU矩阵乘法算法的基础上提出了基于GPU排序的内积检索算法，通过在GPU端对相乘得到的结果矩阵进行排序，提升了从乘积矩阵中检索前k大元素的效率并且减少显存与主存间传输数据的开销。并在基于GPU排序的内积检索算法的基础上提出了基于柯西-施瓦兹不等式的剪枝策略，保证了不损失计算精度的前提下大幅减少了所需的计算量。本文使用四个来自真实推荐系统的数据集对所提出的基于CPU-GPU系统的内积检索加速算法进行了性能测试。测试结果显示本文所提出的算法可以有效的解决基于CPU的内积检索算法的性能瓶颈。与目前效率最高的CPU内积检索算法相比本文所提出的基于 CPU-GPU 系统的内积检索算法获得了15.88至 138.78倍的性能提升。
其他摘要	The inner product retrieval (IPR) is widely used in many applications, e.g., recommendation system, information retrieval and data mining. Inner product retrieval is given a query vector q and a matrix P, find the k vectors in P, for which the inner product with query vector q are the largest in qP. The IPR problem can be decomposed into two fundamental operations: (i) calculate the product of the query vector and matrix; (ii) retrieve the results to find the largest inner product value and corresponding vectors in matrix P. In the traditional CPU-based solutions, the performance bottleneck is the product computation phase, which takes up more than 96% of the total computing time. Compared with CPU, GPU has more computing cores and larger memory bandwidth, which is especially suitable for solving computationally intensive computing tasks. Exploiting Graphics Processing Units (GPUs) to accelerate the computation-intensive workloads is the gold standard in data mining and machine learning communities. However, it is not trivial to apply CPU-GPU systems to boost the performance of IPR solutions due to the natural complex of the IPR problem. This paper has been studied for how to using CPU-GPU systems accelerating exact IPR problem. The main research content and contribution of this paper are as follows:(1) Analyze the time cost of each phase in CPU-based IPR solutions. In order to solve the problem that the CPU cannot calculate matrix multiplication efficiently, this paper proposes to use CPU-GPU system to accelerate the calculation process of matrix multiplication in IPR. In order to make full use of the hardware resources of the CPU-GPU system, this paper systematically studies the characteristics of CPU-GPU system. And investigated the current calculation methods of CPU and GPU matrix multiplication.(2) This paper proposed an IPR algorithm using GPU to calculation matrix multiplication. Due to the limited size of video memory, this paper proposed matrix partition strategy fully utilizing GPU to calculation matrix multiplication. Based on matrix partition strategy proposed GPU-based IPR algorithm, which solves the performance bottleneck of the CPU-based algorithm.(3) This paper proposes two IPR optimization algorithms based on GPU matrix multiplication. Firstly, proposed an IPR optimization algorithm based on GPU sorting. It improves the efficiency of retrieving the k largest element in result matrix and reduce the overhead of transferring data between video memory and main memory. Secondly, improve the efficiency of by avoiding unnecessary computation cost with early termination technique. The core idea is exploiting Cauchy-Schwarz inequality for early termination.This paper demonstrates the efficiency of our proposal on four standard real datasets. The experiment results show that the algorithm proposed in this paper can effectively solve the performance bottleneck of CPU-based IPR algorithm. Compared with the current CPU-based IPR algorithm with the highest efficiency, the IPR algorithm proposed in this paper based on CPU-GPU system achieves a performance improvement from 15.88 to 138.78 times.
关键词	内积检索 CPU-GPU 系统 Top-k 检索推荐系统
其他关键词	Inner Product Retrieval CPU-GPU System Top-k Retrieval RecommenderSystems
语种	中文
培养类别	联合培养
成果类型	学位论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/38816
专题	工学院_计算机科学与工程系
作者单位	南方科技大学
推荐引用方式 GB/T 7714	向隆. 基于 CPU-GPU 系统加速精准内积检索[D]. 深圳. 哈尔滨工业大学,2019.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可	操作
基于 CPU-GPU 系统加速精准内积检（4396KB）	学位论文	--	限制开放	CC BY-NC-SA	请求全文