Automatic identification of variables in epidemiological datasets using logic regression

Lorenz, Matthias W.; Abdi, Negin Ashtiani; Scheckenbach, Frank; Pflug, Anja; Bülbül, Alpaslan; Catapano, Alberico L.; Agewall, Stefan; Ezhov, Marat; Bots, Michiel L.; Kiechl, Stefan; Orth, Andreas; group, PROG-IMT study

Automatic identification of variables in epidemiological datasets using logic regression.pdf (1.06 MB)

Automatic identification of variables in epidemiological datasets using logic regression

journal contribution

posted on 2018-08-07, 15:27 authored by Matthias W. Lorenz, Negin Ashtiani Abdi, Frank Scheckenbach, Anja Pflug, Alpaslan Bülbül, Alberico L. Catapano, Stefan Agewall, Marat Ezhov, Michiel L. Bots, Stefan Kiechl, Andreas Orth, PROG-IMT study group

BACKGROUND: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. METHODS: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. RESULTS: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. CONCLUSIONS: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.

Funding

We thank Ingo Ruczinski, Charles Kooperberg, and Michael LeBlanc at the Fred Hutchinson Cancer Research Center in Seattle for providing the public license CRAN software package, and the related documentation. This manuscript was prepared using a limited access dataset of the Atherosclerosis Risk In Communities (ARIC) study, obtained from the National Heart, Lung and Blood Institute (NHLBI). The ARIC study is conducted and supported by NHLBI in collaboration with the ARIC Study investigators. This manuscript does not necessarily reflect the opinions or views of the ARIC study or the NHLBI. The Bruneck study was supported by the Pustertaler Verein zur Praevention von Herz- und Hirngefaesserkrankungen, Gesundheitsbezirk Bruneck, and the Assessorat fuer Gesundheit, Province of Bolzano, Italy. The Carotid Atherosclerosis Progression Study (CAPS) was supported by the Stiftung Deutsche Schlaganfall-Hilfe. The PLIC Study is supported by a grant from SISA Sezione Regionale Lombarda. This manuscript was prepared using data from the Cardiovascular Health Study (CHS). The research reported in this article was supported by contracts N01-HC-85079 through N01-HC-85086, N01-HC-35129, N01 HC-15103, N01 HC-55222, and U01 HL080295 from the National Heart, Lung, and Blood Institute, with additional contribution from the National Institute of Neurological Disorders and Stroke. A full list of participating CHS investigators and institutions can be found at http://www.chs-nhlbi.org. The EVA Study was organized under an agreement between INSERM and the Merck, Sharp, and Dohme-Chibret Company. The Edinburgh Artery Study (EAS) was funded by the British Heart Foundation. The IMPROVE study was supported by the European Commission (Contract number: QLG1- CT- 2002- 00896), Ministero della Salute Ricerca Corrente, Italy, the Swedish Heart-Lung Foundation, the Swedish Research Council (projects 8691 and 0593), the Foundation for Strategic Research, the Stockholm County Council (project 562183),

History

Citation

BMC Medical Informatics and Decision Making, 2017, 17 (1), pp. 40-?

Author affiliation

/Organisation/COLLEGE OF LIFE SCIENCES/School of Medicine/Department of Health Sciences

Version

VoR (Version of Record)

Published in

BMC Medical Informatics and Decision Making

Publisher

BioMed Central

eissn

1472-6947

Acceptance date

2017-03-23

Copyright date

2017

Available date

2018-08-07

Publisher DOI

https://doi.org/10.1186/s12911-017-0429-1

Publisher version

https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-017-0429-1

Language

en

Administrator link

https://leicester.figshare.com/account/articles/10196333

Usage metrics

Keywords

Data management Epidemiology Logic regression Meta-analysis Algorithms Carotid Artery Diseases Carotid Intima-Media Thickness Data Mining Databases, Factual Epidemiologic Factors Humans Logistic Models Medical Informatics Applications Meta-Analysis as Topic Predictive Value of Tests Prognosis

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Automatic identification of variables in epidemiological datasets using logic regression

Funding

History

Citation

Author affiliation

Version

Published in

Publisher

eissn

Acceptance date

Copyright date

Available date

Publisher DOI

Publisher version

Language

Administrator link

Usage metrics

Categories

Keywords

Licence

Exports