中文版 | English
题名

基于大数据的污染源普查清查方法学研究

其他题名
Methodological Study on List Screening of Pollution Sources Survey Based on Big Data
姓名
学号
11749133
学位类型
硕士
学位专业
环境科学与工程
导师
胡清
论文答辩日期
2019-05-29
论文提交日期
2019-07-14
学位授予单位
哈尔滨工业大学
学位授予地点
深圳
摘要
为了加强对环境污染的监督管理,及时了解及记录各企事业单位潜在的环境污染基本信息,我国于2008年开展了第一次全国污染源普查。在第一次全国污染源普查工作中,囿于我国当时的认识及技术手段和数据分析能力有限,存在许多不足之处。当时在污染源普查清查阶段,政府部门仅根据企业的行业分类代码进行筛选,形成一份基本单位名录作为清查阶段的入户依据。但政府部门数据的不完整以及筛选所用的行业类别代码存在大量错误,致使清查基本单位名录存在相当数量的漏失企业,造成工业污染源基本单位名录不准确。我国第二次污染源普查于2018年开始,因此,本研究希望利用大数据及相关技术,以工商数据中的企业经营范围作为基础,识别并纠正行业类别,同时利用互联网大数据技术对基本单位名录进行增补,最终优化污染源普查清查阶段数据处理流程、提升基本单位名录的构建效率和准确度。首先,本研究对可使用的方法进行比较,对政府部门所提供的数据进行评价和筛选,在海量数据处理的背景下,构建机器学习分类模型。以此为基础,按照机器学习处理实际问题的基本思路,首先构造标准数据集并验证其准确性及可用性,利用多种分类算法进行比较分析,择优使用。随后以此构建的标定数据集为训练集,对政府部门所提供的国家工商数据、省工商数据和市工商数据进行预测分类,同时为保证可靠性,利用清查实际入户反馈及其他补充实验进行准确性检验,最终验证本研究建立的机器学习模型的可用性。针对机器学习的模型建立,我们通过几种算法比较后可知朴素贝叶斯分类算法为最佳算法,且经过清查实际反馈检验显示,若以F1值(准确率和召回率的调和平均数,F1值越高,代表分类结果越好)为评价指标进行衡量,各数据集F1值分别相对提升32.92%,21.42%,14.91%。补充实验所得结果相比于原始政府部门数据集,F1值分别相对提升151.06%、213.45%和132.13%,提升效果较显著。从而验证了标定数据集的准确性以及该机器学习模型通过企业经营范围识别并纠正错误行业类别的可用性。其次,为进一步使得第二次普查更准确,本研究探讨了利用互联网大数据对于基本单位名录增补的可行性。以互联网多源大数据为基础,通过大数据可用性的一般分析原则对数据进行评价和筛选。利用以上经过验证可用的机器学习分类预测模型,对筛选后的互联网数据进行分类预测。使用清查实际入户反馈及其他补充实验进行准确性检验,并分析数据质量。最终互联网增补数据准确率为17.26%,同市工商数据近似。结合实际工作情况,通过补充实验分析,确定互联网增补数据对于企业基本名录的增补贡献度应在4.54%-16.85%。对所得分类结果进行横向及纵向比较,互联网数据相比部门数据,存在较为明显的同质化现象,且在互联网数据中低比例数据同质化更为明显,这是由于互联网数据对企业经营范围的描述相对单一。对于具体行业分类准确程度,部门数据整体较高。互联网高比例数据准确率相比低比例数据更高,低比例数据同部门数据相比,差距较大,可用性也较低。结合清查阶段具体目标,互联网增补数据可在检索缺漏企业中起到重要作用,能够有效拓宽数据获取途径。最后,本研究依据上述利用企业经营范围对相应行业分类进行纠正和利用互联网多源大数据对缺失企业信息进行增补的效果,结合污染源普查实际工作中的部门要求,创新性地提出了基于大数据技术的污染源普查清查阶段基本单位名录编制流程的优化方法,进而为我国第二次全国污染源普查及未来其他环境统计工作提供了方法借鉴。
其他摘要
In order to strengthen environmental supervision and management, and to understand the basic environmental information of various enterprises and institutions, China conducted the first national survey of pollution sources in 2008. Summarizing the first national survey of pollution sources, its development is limited by historical background and data analysis capabilities, and there are still many deficiencies. In the list screening stage of the pollution source survey, the government department selects the list of basic units according to the enterprises’ industry classification code as the basic inventory for physical inspection. However, the data provided by government department is incomplete and there are also a large number of errors in the industry classification codes used for screening. This will result in a lot of non-target industry enterprises in the inventory of basic units. At the same time, due to many practical reasons, there are also a large amount of target enterprise information that is not contained in government department datas, resulting in the basic list of industrial pollution sources to be inaccurate. Recently, the second national survey of pollution sources began. Therefore, this study hopes to use big data and related technologies to identify and correct industry categories through business data, and to use the Internet big data technology to supplement the basic unit list. Optimize the data processing flow of the pollution source survey list screening stage and improve the construction efficiency and accuracy of the basic unit list.First, the study evaluates the data provided by government department and builds a machine learning classification model. Based on this, according to the basic idea of machine learning to deal with practical problems, construct standard data sets and verify their accuracy, compare and analyze different classification algorithms, and use them optimally. Then use the constructed calibration data set as the training set to predict and classify the national industrial data, provincial industrial data and the city industrial data provided by the government departments, and use the actual physical inspection feedback and other supplementary experiments to verify the accuracy of machine learning model. The results show that the naive Bayesian classification algorithm performs well, and the actual feedback test shows that if the F1 value is used as the evaluation index, the F1 values of each data set increase by 32.92%, 21.42%, and 14.91%, respectively. The supplement experiment, compared with the original government department dataset, shows the F1 value increased by 151.06%, 213.45% and 132.13%, respectively. The improvement effect was more significant, which verified the accuracy of the calibration data set and the machine learning model.Secondly, based on the Internet multi-source big data acquired by the third-party team, the data is evaluated and filtered through the general analysis principle of big data availability. The above-mentioned available classification prediction model is used to classify and predict the filtered Internet data. Verifing the accuracy by using the physical inspection feedback and other supplementary experiments, analyzing the data quality, and the feasibility of Internet big data for the addition of basic unit list. The final accuracy of the supplementary data is 17.26%. Combined with the actual work situation, through supplementary experimental analysis, it is determined that the contribution of Internet supplement data to the basic list of enterprises should be 4.54%-16.85%. The horizontal and vertical comparisons of the obtained classification results show that the Internet data has a more obvious homogenization phenomenon than the departmental data, and the low proportion data in the Internet data is more obvious than the high proportion data. For the accuracy of specific industry classifications, the high proportion data has a higher accuracy rate, and the low proportion data has a larger gap and lower availability than the departmental data. Combined with the specific objectives of the list screening stage, Internet supplemental data can play an important role in the retrieval of missing enterprises, and can effectively broaden the access of acquiring data.Finally, we proposed a optimization method of screening the basic unit in the pollution source survey list screening stage, based on the use of the business scope of the enterprise to correct the industry classification, the use the Internet big data to supplement the missing enterprise information, and combined with the requirements of the actual work of the pollution source survey. The optimization method will provide a reference for the second national pollution source survey and other future environmental statistics work.
关键词
其他关键词
语种
中文
培养类别
联合培养
成果类型学位论文
条目标识符http://sustech.caswiz.com/handle/2SGJ60CL/38805
专题工学院_环境科学与工程学院
作者单位
南方科技大学
推荐引用方式
GB/T 7714
鹿明. 基于大数据的污染源普查清查方法学研究[D]. 深圳. 哈尔滨工业大学,2019.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可 操作
基于大数据的污染源普查清查方法学研究.p(4550KB)----限制开放--请求全文
个性服务
原文链接
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
导出为Excel格式
导出为Csv格式
Altmetrics Score
谷歌学术
谷歌学术中相似的文章
[鹿明]的文章
百度学术
百度学术中相似的文章
[鹿明]的文章
必应学术
必应学术中相似的文章
[鹿明]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
[发表评论/异议/意见]
暂无评论

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。