南方科技大学知识苑(SUSTech KC): Comparisons of Count Data for Independent and Correlated Groups

题名	Comparisons of Count Data for Independent and Correlated Groups
其他题名	多组独立与相关的计数数据的比较
姓名	周瑞伟
学号	11749022
学位类型	硕士
学位专业	概率论与数理统计
导师	田国梁
论文答辩日期	2019-05-15
论文提交日期	2019-07-10
学位授予单位	哈尔滨工业大学
学位授予地点	深圳
摘要	Count data is very common in our daily life. These data widely appear in medical experiments, transportation departments and economic departments. For example, the number of heartbeats of arrhythmia patients, the number of calls received by a call center within a certain period of time, the number of customers entering a mall within one day and the number of traffic accidents in a crossroad during a period of time, etc. This thesis uses ventricular contraction data from the same arrhythmia patients before and after the use of a new drug. It leads to the problem of how to compare groups of correlated or independent count data. People have been interested in comparing groups of count data. In this case, our goal is to test whether this new drug has significant effect, so we need to compare the two groups or counting data. Without loss of generality, we want to compare two or more groups of count data. These data can be correlated, most of them appear as paired data, that is, the records of the same batch of individuals or pre-selected subjects under different experimental methods, the amount of data in each group is the same; these data can also be independent and the number of each group may vary.Research on groups of count data can help people make judgments and even make decisions in real life and have sufficient theoretical basis. For example, to test whether a new drug has the expected effect, and whether the monitoring device can significantly reduce the number of traffic accidents in the crossroad and whether the promotion can significantly increase the number of customers entering the mall, etc. Statistical significance can provide sufficient evidence to avoid misjudgment rather than relying solely on subjective experience or intuition.Count data is usually modeled using a Poisson distribution, but Poisson distribution requires that the expectation of the variable is equal to the variance, which is often not met in real life. Many count data will have a variance greater than expectation, which is called overdispersion. The negative binomial distribution can handle the problem of overdispersion well because it contains a parameter to model the relationship between the variance and the expectation. In addition, there are other distributions for specific data. For example, the number of dentist visits, most people do not go to the dentist every year if not necessary, because it is very expensive and troublesome, which leads to excessive zeros in the data. Poisson distribution and negative binomial distribution are unable to model such data. Therefore we need to consider zero-inflated distributions such as zero-inflated Poisson distribution and zero-inflated negative binomial distribution. These modified distributions tend to perform better than the original distributions and obtain better results.For a comparison between specific groups of count data, we can build a model to solve this problem specifically. However, such a method is too time-consuming and laborious. Therefore, we mainly use regression analysis in this thesis since it is very general and the results are intuitive and easy to understand. Performing statistical inference and hypothesis test are also convenient. In this thesis, the group acts as an important factor in the regression analysis. For count data, Poisson regression and negative binomial regression are very common models. They can qualitatively analyze the influence of group on response variable. For zero-inflated data, we also establish zero-inflated Poisson regression and zero-inflated negative binomial regression and compare them with simple regressions. We find that the zero-inflated models work better. In this thesis, we also describe how Poisson regression and negative binomial regression and the corresponding zero-inflated models use the EM algorithm or the Newton-Raphson algorithm to perform parameter estimation. The above generalized linear models can well fit the count data of independent groups. For paired count data, we notice that there may exist some subtle fluctuations of each subject due to its own reasons in multiple measurements. For example, the measurement of the number of heartbeats may fluctuate due to the individual's own reasons, so, when modeling paired count data, we must consider the fluctuations of the individual under multiple measurements, otherwise it may lead to large errors in the regression results.For correlated groups of count data, this usually behaves as records for the same individuals under different experimental methods, for example, the number of heartbeats before and after taking the drug. We introduce random effects in the regression model to explain the fluctuations of each individual. We assume that the random effects are independently distributed from the normal distribution with mean zero and variance unknown. Given the random effects, we assume that the response variable are independent of each other and follow the Poisson distribution or the negative binomial distribution. And its expectation is linked to covariates and random effects through a link function. In this way, we fully consider the possible fluctuation of each individual. The generalized linear model (GLM) with random effects is called the generalized linear mixed model (GLMM).For the parameter estimation of the GLMM, since the random effects cannot be observed and the analytic expression of the likelihood function cannot be obtained, the parameter estimation cannot be directly performed by the EM algorithm or the Newton-Raphson algorithm. Therefore, in this thesis, the Monte Carlo method, combined with the EM algorithm and Newton-Raphson algorithm, is adopted to estimate the parameters. As the random effects cannot be observed, we regard it as missing data in the EM algorithm. In the E step, we generate samples from its posterior distribution and calculate the value of the likelihood function. In the M step, we update the parameters. This iterative process continues until the predetermined convergence condition is satisfied and we obtain the maximum likelihood estimation (MLE) of the parameters. The posterior distribution expression of the random effects is very complicated, which makes it impossible to generate samples directly from the posterior distribution. Therefore, we need some other methods to carry out sampling. The acceptance-rejection method is a commonly used sampling method, where indirect extraction is performed through another probability density that is easier to sample, and it does not require knowledge of the complete probability density. As used in this thesis, it is feasible in practice, although sometimes it is time-consuming.We first carry out some numerical experiments. The original hypothesis is that there is no significant difference between the groups while the alternative hypothesis is that there is a difference between the groups. The data generation mechanism is to extract samples from the Poisson distribution and add random effects from the normal distribution to them. GLM without random effects and GLMM are used to model the simulated data. Then we compare the type I error and the type II error of different models where the type II error is measured by the power. We find that models with random effects have better performance both in the type I error and the type II error. In addition, these models are consistent with each other with respect to the significance for the group. This shows that it is very reasonable and necessary to add random effects to the model when modeling paired count data.On several real datasets, we compare traditional generalized linear models, including Poisson regression and negative binomial regression, and generalized linear regression models with random effects. We find that different models are consistent with respect to the significance for the group. The GLMM can also identify the variance of the random effects. This shows that in real situations, the random effects caused by the same individual's fluctuations do exist. In the groups of correlated count data, especially the paired data, the random effects should be considered. Some of these real datasets are zero-inflated, so we establish Poisson regression, negative binomial regression, zero-inflated Poisson regression and zero-inflated negative binomial regression. And the results between different regression models are compared. The results show that for data with excess zeros, the zero-inflated models are much better than simple regression models. For the comparisons between different models, we mainly use the information criteria. The information criteria are based on the value of the likelihood function with penalty for the number of the samples and parameters.At last, we summarize the thesis and give some discussions. The discussions are mainly obout the random effects.First, in this thesis, we assume that the random effects of different individuals are independent and identically distributed from the normal distribution with only one parameter to be estimated, which is its standard deviation. We can also assume that the random effects of different individuals are from normal distributions with different variance, then the different variance can be estimated. Similarly, we can also assume that the random effects are distributed from other distributions, gamma distribution is another choice.Secondly, in this thesis, we assume that different individuals are independent of each other, that is, their random effects are independent. In real situations, there may be some correlations between different individuals. Therefore, we can assume that random effects have a relationship with each other. For example, multivariate normal distribution or other similar multivariate distribution can be adopted. By introducing the correlated random effects, we can model the relationship of different subjects and get better and more reasonable results.
其他摘要	计数数据在日常生活中十分常见，这些数据广泛出现在医疗实验，交通部门和经济部门之中。例如心律反常病人的心跳次数，某段特定的时间内某呼叫中心所接到的电话数量，一天内进入某商场的顾客人数，一段时间内某特定的十字路口交通事故的发生次数等等。这些取值为非负整数的计数数据来源很广。本文从实际医疗背景中的某新药物使用前后同一批心律失常患者心室收缩数据的对比出发，引出如何对比多组相关或者独立的计数数据这个问题。人们对比较多组计数数据兴趣一直很深厚。在这个例子中，我们的目标是检验这个新药物是否有效果，因而我们需要比较的是两组的计数数据。不失一般性，我们的兴趣在于比较两组或者多组的计数数据。这些数据可以是相关的，大多数表现为配对数据，即，同一批个体或者预先挑选的一批配对个体在不同实验方法之下的记录数值，这样的数据中每组的数据量是相同的;这些数据也可以是不相关的，也就是说，对于不同的组别，每个个体是彼此独立的，并且数量也可以不等。对于多组计数数据的研究可以帮助人们在真实生活中做出判断甚至制定决策并且有足够的理论依据。例如上述的检验一种新药是否有预期的效果，检验监控装置是否能显著减少十字路口的交通事故数量以及促销活动是否显著地增加了进入商场的顾客人数等等。统计意义上的显著可以提供足够的依据而不是仅仅靠主观经验或者直觉来判断，能够避免误判。通常来说，计数数据可以用泊松分布来进行拟合，但是泊松分布要求随机变量的期望和方差需要相同，而这一条件在实践与生活中往往无法满足。很多数据都会呈现方差大于期望的现象，这被称为过度离散。负二项分布可以很好地处理这个问题，因为它本身含有一个参数用来对方差与期望间的关系进行建模。除了过度离散的问题外，还存在很多其他的分布，用来解决一些特定的数据中出现的问题。例如看牙医次数的数据，大多数人如果非必要，不会每年都去看牙医，因为价格十分昂贵也比较麻烦，这样就导致了数据中含有过多的零。泊松分布以及负二项分布都没办法对这样的数据进行建模。因而我们需要考虑零膨胀的分布例如零膨胀泊松分布以及零膨胀负二项分布等等。相比原始的分布，这些改动后的分布往往能够更好地对数据进行建模，达到一个更好的效果。对于特定的多组计数数据之间的比较，我们可以有针对性地建立模型，专门解决这个数据。然而这样的方法太过于费时费力。因此我们在这篇文章中主要采用比较通用的回归分析，因为其结果直观易懂且可以进行后续的统计推断与假设检验。在这篇文章中，组别被作为一个重要因素被纳入回归分析之中。对于生活中常见的计数数据来说，泊松回归与负二项回归都是被广泛使用的的模型，它们可以定性地分析因变量对响应变量的影响，其中也包括我们感兴趣的组别这个因素。对于零膨胀的数据，我们还采用了零膨胀泊松回归，考虑到过度离散的问题，我们也建立了零膨胀负二项回归，并且将零膨胀模型的结果与原始的的回归相比。我们发现零膨胀模型的效果更好。在本文中，我们还简要地对每一个回归模型进行了公式推导，描述了泊松回归和负二项回归以及对应的零膨胀模型如何用EM算法或者牛顿算法进行参数估计以及假设检验。上述的广义线性模型能够很好地拟合各组之间彼此独立的计数数据。但是，对于配对数据，我们注意到，同一个个体在多次测量中可能会由于自身原因产生一些细微的波动，例如心跳次数的测量可能由于个体自身的原因导致波动较大，因此在建模的时候，我们必须考虑个体在多次测量下的波动问题，否则可能会导致回归结果误差较大。对于相关的多组计数数据，其通常表现为对同一批多个个体在不同实验方法下的数据的记录，例如上述的同一批患者在服用药物前后的心跳次数。我们在回归模型中引入随机效应用来解释单个个体的波动性。我们假设模型中为了衡量个体波动而引入的随机效应来自正态分布，并且期望为零方差未知。在给定随机效应的前提下，我们假设响应变量相互之间是独立的泊松变量或者来自负二项分布的变量，并且其期望通过一个链接函数与协变量以及随机效应联系起来。这样一来，我们充分考虑到了每个个体自身可能存在的波动性并对其进行建模。在广义线性模型之中引入随机效应来进行建模的模型一般称为广义线性混合模型。广义线性混合模型的设定十分简单易懂且符合实际情况，但是，其中的参数估计这一部分较难。由于随机效应无法被观测，无法得到似然函数的解析式，所以无法通过EM算法或牛顿算法直接对其进行参数估计。因此，在本文中，蒙特卡罗方法被采用，并且与EM算法以及牛顿算法相结合来对参数进行估计。由于随机效应是观测不到的，因此我们将其视作EM算法中的缺失数据。在E步，有了样本的观测值后，我们就能获得随机效应后验分布的表达式，然后我们从这个比较复杂的概率密度表达式中抽取样本，并对似然函数进行数值计算;在M步，我们对参数进行迭代更新。如此不断迭代一直到满足事先给定的收敛条件进而得到参数的极大似然估计。在这个过程中，因为随机效应的后验分布表达式十分复杂，导致从后验分布中进行直接抽样无法实现，所以我们需要使用一些辅助方法来进行抽样。接受-- 拒绝方法是一种广泛使用的抽样方法，它可以在直接抽样较难的情况下通过另外一个较容易抽样的概率密度来进行间接抽取，并且它不要求知道完整的概率密度表达式，可以忽略与变量无关的常数。本文使用此方法来进行抽样，在实践中证明其是可行的，缺点在于有时抽样速度较慢。我们先进行了数值实验，原假设是各组之间无显著差别，备择假设是各组之间有差别。数据的产生机制是从泊松分布中抽取样本，并加上随机效应，数值实验中随机效应来自于正态分布。没有随机效应的简单的广义线性模型以及加入了随机效应的广义线性混合模型都被我们用来对模拟出来的数据进行建模。然后我们比较了不同模型的一二两类错误的大小，其中第二类错误是以功效的大小来衡量的。我们发现，对于这两类错误，带随机效应的模型均有良好的表现。另外，这些模型对于组别这个变量的显著性检验结果都十分一致。这也说明了在模型中加入随机效应是十分合理以及有必要的。在若干个真实数据上，我们比较了简单的广义线性模型，包括泊松回归和加入了随机效应的泊松回归，以及负二项回归和加入了随机效应的负二项回归。我们发现不同模型对于组间差距的识别表现十分一致，广义线性混合模型还能够估计出随机效应的方差。由此说明，在真实情况下，由于同一个体自身原因所导致的随机效应确实存在。在多组相关计数数据中尤其是配对数据，随机效应的影响在建模时应该加以考虑。在这些数据集上，有些数据是零膨胀的，因此我们建立了简单的泊松和负二项回归；对于明显带有超量零的数据，我们建立了适用于拥有超量零的数据的零膨胀泊松和负二项回归。接着我们比较了各种不同的回归模型，发现对于含有超量的零的数据，零膨胀模型比简单的回归模型效果要好很多。对于不同模型之间的比较，我们主要是利用信息准则来判断。信息准则是基于似然函数来比较不同模型的，它对参数数量以及样本大小进行了一定的惩罚，从而选出最优的模型。最后，我们对文章进行总结以及提出展望。展望主要是对随机效应的处理之上。首先，本文中对个体的随机效应的处理是假设不同个体随机效应是独立同分布的正态效应，只含有一个待估参数也就是其标准差。我们还可以假设不同个体的随机效应相互独立，都来自于正态分布，但是我们可以假设它们的方差不同。然后可以对不同个体的方差进行迭代然后得到极大似然估计。同样的，我们还能够假定随机效应来自于其他的分布，简单的例子包括伽玛分布等等。其次，本文还假设不同个体之间是无关的，也就是说，它们的随机效应是不相关的。但是，在真实情况下，不同个体也可能是相关的。因此我们在假设不同随机效应时，不仅可以设定为相互独立，还可以设定为互相之间存在关系。例如多元正态分布且协方差不为零或者其他类似的多元分布。通过引入多元的随机效应，我们能够对不同个体之间的相关性进行建模，从而得到更好更合理的结果。
关键词	count data regression analysis random effects Monte Carlo method EMalgorithm
其他关键词	计数数据回归分析随机效应蒙特卡罗方法 EM算法
语种	英语
培养类别	联合培养
成果类型	学位论文
条目标识符	http://sustech.caswiz.com/handle/2SGJ60CL/38937
专题	理学院_数学系
作者单位	南方科技大学
推荐引用方式 GB/T 7714	Zhou RW. Comparisons of Count Data for Independent and Correlated Groups[D]. 深圳. 哈尔滨工业大学,2019.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可	操作
Comparisons of Count（2120KB）	--	--	限制开放	--	请求全文