A Statistical Framework for Modeling Asthma and COPD Biological Heterogeneity, and a Novel Variable Selection Method for Model-based Clustering
thesisposted on 14.11.2016, 09:47 by Michael Abrha Ghebre
This thesis has two main parts. The first part is an application that focuses on the identification of a statistical framework to model the biological heterogeneity of asthma and COPD using sputum cytokines. Clustering subjects using the actual cytokines measurements may not be straightforward as these mediators have strong correlations, which are currently ignored by standard clustering techniques. Artificial data, which have similar patterns as the cytokines, but with known class membership, are simulated. Several approaches, such as data reduction using factor analysis, were performed on the simulated data to identify suitable representative of the variables and to use as input into clustering algorithm. In the simulation study, using "factor-scores" (derived from factor analysis) as input variables into clustering outperformed the alternative approaches. Thus, this approach was applied to model the biological heterogeneity of asthma and COPD, and identified three stable and three exacerbation clusters, with different proportions of overlap between the diseases. The second part is a statistical methodology in which a new method for variable selection in model-based clustering was proposed. This method generalizes the approach of Raftery and Dean (2006, JASA 101, 168-178). It relaxes the global prior assumptions of linear-relationships between clustering relevant and irrelevant variables by searching for latent structures among the variables, and accounts for nonlinear relationships between these variables by splitting the data into sub-samples. A Gaussian mixture model (unconstrained variance-covariance matrices fitted using the EM-algorithm) is applied to identify the optimal clusters. The new method performed considerably better than the Raftery and Dean technique when applied to simulated and real datasets, and demonstrates that variable selection within clustering can substantially improve the identification of optimal clusters. However, at the moment it perhaps does not perform adequately in uncovering the optimal clusters in the dataset which have strong correlations such as sputum mediators.