南方科技大学知识苑(SUSTech KC): 基于深度学习的中文信息抽取算法研究

题名	基于深度学习的中文信息抽取算法研究
其他题名	RESEARCH ON CHINESE INFORMATION EXTRACTION ALGORITHM BASED ON DEEP LEARNING
姓名	梁家熙
学号	11849322
学位类型	硕士
学位专业	计算机技术领域工程
导师	夏志宏
论文答辩日期	2020-05-30
论文提交日期	2020-07-20
学位授予单位	哈尔滨工业大学
学位授予地点	深圳
摘要	随着信息时代的发展，大量信息以文本的形式存在于互联网。互联网的文本知识通常是以非结构化的形式存储在网页之中，常规的规则抽取手段无法很好的抽取这些知识。因此如何使用自动化的方法从文本中抽取出关键信息便成了行业内迫切需要解决的需求。信息抽取算法技术的主要目的是精准、快速、高效的从非结构化的自然语言文本中抽取出结构化的信息，并以相应预设定的格式进行保存，以供后续使用。三元组信息抽取的传统研究思路包括基于规则抽取、机器学习抽取以及深度学习方式抽取等。基于深度学习的方式相比之前的研究方法在建模效果上具有很大的优势，而在深度学习方式中，使用流水线方式和联合学习的方式存在实体对的指向、匹配问题，基于层次二分标注的方式虽然有效的对实体对进行建模，但也会存在多建模步骤引起的错误传播问题。本文为了解决多阶段的预测问题，设计实现了有向图结构的一阶段模型。此模型利用有向图的邻接矩阵来同时表达实体对的位置，以及实体词之间的指向关系。同时本论文设计了多种构建有向图邻接矩阵的模型，其中基于双线性矩阵注意力模型能够有效的利用注意力矩阵构造有向图的邻接矩阵。本文在层次二分标注模型的基础上，探索了不同的范围提取模型提取实体词特征的能力。其中端点向量混合的方式在原方法的基础上进行了改进，利用简单的特征工程的方法进一步增强了层次二分标注模型的三元组信息抽取能力。同时，本文参照基于层次二分模型的思想，将三元组信息抽取结构进行进一步细分，设计实现了三阶段模型。此模型的研究重点是不同的实体对关系分类器的分类表现。本文实验了多组关系分类模型，其中，卷积神经网络模型的分类效果要略优于长短记忆网络等模型。本文设计实现的基于有向图结构的预训练模型与双线性矩阵注意力模型组合能够达到f1值为0.807的分数，提出的基于层次化二分标注三阶段模型能够达到0.778的f1值，与文献中提出的层次化二分标注二阶段模型0.697的f1值相比，都获得了明显的效果提升。
其他摘要	With the development of the information age, a large amount of information exists in the form of text on the Internet. The text knowledge of the Internet is usually stored in web pages in an unstructured form. Conventional rule extraction methods cannot extract this knowledge well. Therefore, how to use automated methods to extract key information from the text has become an urgent need in the industry to solve. The main purpose of information extraction algorithm technology is to extract structured information from unstructured natural language text accurately, quickly and efficiently, and save it in a corresponding preset format for subsequent use.The traditional research ideas of triple information extraction include rule-based method, machine learning method and deep learning method. Compared with previous research methods, the deep learning-based method has great advantages in modeling. Among deep learning methods, the pipeline method and the joint learning method have the problem of pointing and matching of entity pairs, and the hierarchical-binary-labeling method, though the method effectively extract the entity pairs, there are also error propagation problems caused by too many steps.In order to solve the multi-stage prediction problem, this paper designs and implements a one-stage model of directed graph structure. This model uses the adjacency matrix of directed graph to express the position of entity pairs and the pointing relationship between entity words. At the same time, this paper has designed a variety of models for constructing adjacency matrices of directed graphs, and the attention model based on bilinear matrix can effectively use the attention matrix to construct adjacency matrices of directed graphs.Based on the hierarchical binary labeling model, this paper explores the ability of different range extraction models to extract the features of entity words. Among them, the method of endpoint vector mixing is improved on the basis of the original method, and the simple feature engineering method is used to further enhance the ability of extracting triple information of the hierarchical-binary-labeling model. At the same time, this paper refers to the idea based on the hierarchical dichotomy model, further subdivides the structure of the triple information extraction, and designs and implements a three-stage model. The research focus of this model is the classification performance of the relationship classifier by different entities. This paper has experimented with multiple sets of relational classification models. Among them, the classification effect of CNN model is slightly better than LSTM and other models.The combination of the directed graph based Bert model and the bilinear matrix attention model designed and implemented in this paper can achieve a score of f1 0.807, and the proposed three-stage model based on hierarchical binary labeling can achieve a score of 0.778. Compared with the score of 0.697 of the two-stage model, the results of previous models have been significantly improved.
关键词	信息抽取深度学习三元组
其他关键词	information extraction deep learning triad
语种	中文
培养类别	联合培养
成果类型	学位论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/142747
专题	创新创业学院
作者单位	南方科技大学
推荐引用方式 GB/T 7714	梁家熙. 基于深度学习的中文信息抽取算法研究[D]. 深圳. 哈尔滨工业大学,2020.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可	操作
基于深度学习的中文信息抽取算法研究.pd（3069KB）	--	--	限制开放	--	请求全文