%0 Journal Article %A Lorenz, Matthias W. %A Abdi, Negin Ashtiani %A Scheckenbach, Frank %A Pflug, Anja %A Bülbül, Alpaslan %A Catapano, Alberico L. %A Agewall, Stefan %A Ezhov, Marat %A Bots, Michiel L. %A Kiechl, Stefan %A Orth, Andreas %A group, PROG-IMT study %D 2018 %T Automatic identification of variables in epidemiological datasets using logic regression %U https://figshare.le.ac.uk/articles/journal_contribution/Automatic_identification_of_variables_in_epidemiological_datasets_using_logic_regression/10196336 %2 https://figshare.le.ac.uk/ndownloader/files/18373994 %K Data management %K Epidemiology %K Logic regression %K Meta-analysis %K Algorithms %K Carotid Artery Diseases %K Carotid Intima-Media Thickness %K Data Mining %K Databases, Factual %K Epidemiologic Factors %K Humans %K Logistic Models %K Medical Informatics Applications %K Meta-Analysis as Topic %K Predictive Value of Tests %K Prognosis %X BACKGROUND: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. METHODS: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. RESULTS: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. CONCLUSIONS: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies. %I University of Leicester