Database federation, resource interoperability and digital identity, for management and exploitation of contemporary biological data
thesisposted on 11.01.2011, 13:12 by Gudmundur A. Thorisson
Modern research into the genetic basis of human health and disease is increasingly dominated by high-throughput experimentation and routine generation of large volumes of complex genotype to phenotype (G2P) information. Efforts to effectively manage, integrate, analyse and interpret this wealth of data face substantial challenges. This thesis discusses informatics approaches to addressing some of these challenges, primarily in the context of disease genetics. The genome-wide association study (GWAS) is widely used in the field, but translation of findings into scientific knowledge is hampered by heterogeneous and incomplete reporting, restrictions on sharing of primary data, publication bias and other factors. The central focus of the work was design and implementation of a core informatics infrastructure for centralised gathering and presentation of GWAS results. The resulting open-access HGVbaseG2P genetic association database and web-based tools for search, retrieval and graphical genome viewing increase overall usefulness of published GWAS findings. HGVbaseG2P conceptual modelling activities were also merged into a collaborative standardisation effort with international partners. A key outcome of this joint work is a minimal model for phenotype data which, together with ontologies and other standards, lays the foundation for a federated network of semantically and syntactically interoperable, distributed G2P databases. Attempts to gather complete aggregate representations of primary GWAS data into HGVbaseG2P were largely unsuccessful, chiefly due to concerns over re-identification of study participants. This led to a separate line of inquiry which explored - via in-depth field analysis, workshop organisation and other community outreach activities – potential applications of federated identity technologies for unambiguously identifying researchers online. Results suggest two broad use cases for user-centric researcher identities - i) practical, streamlined data access management and ii) tracking digital contributions for the purpose of attribution - which are critical to facilitating and incentivising sharing of GWAS (and other) research data.