LSC (Leicester Scientific Corpus)

2020-04-15T07:36:07Z (GMT) by Neslihan Suzen

The LSC (Leicester Scientific Corpus)


April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk)

Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes

The data are extracted from the Web of Science [1]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.

[Version 2] A further cleaning is applied in Data Processing for LSC Abstracts in Version 1*. Details of cleaning procedure are explained in Step 6.

* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v1.

Getting Started

This text provides the information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the meaning of research texts and make it available for use in Natural Language Processing projects.

LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. The corpus contains only documents in English. Each document in the corpus contains the following parts:

1. Authors: The list of authors of the paper

2. Title: The title of the paper

3. Abstract: The abstract of the paper
4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’.
5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’.
6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4]
7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]

The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018. We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,350.

Data Processing

Step 1: Downloading of the Data Online

The dataset is collected manually by exporting documents as Tab-delimitated files online. All documents are available online.

Step 2: Importing the Dataset to R

The LSC was collected as TXT files. All documents are extracted to R.

Step 3: Cleaning the Data from Documents with Empty Abstract or without Category

As our research is based on the analysis of abstracts and categories, all documents with empty abstracts and documents without categories are removed.

Step 4: Identification and Correction of Concatenate Words in Abstracts

Especially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc. Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. The detection and identification of such words is done by sampling of medicine-related publications with human intervention. Detected concatenate words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.

The section headings in such abstracts are listed below:

Background Method(s) Design
Theoretical Measurement(s) Location
Aim(s) Methodology Process
Abstract Population Approach
Objective(s) Purpose(s) Subject(s)
Introduction Implication(s) Patient(s)
Procedure(s) Hypothesis Measure(s)
Setting(s) Limitation(s) Discussion
Conclusion(s) Result(s) Finding(s)
Material (s) Rationale(s)
Implications for health and nursing policy

Step 5: Extracting (Sub-setting) the Data Based on Lengths of Abstracts

After correction, the lengths of abstracts are calculated. ‘Length’ indicates the total number of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].

According to APA style manual [6], an abstract should contain between 150 to 250 words. In LSC, we decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis.

Step 6: [Version 2] Cleaning Copyright Notices, Permission polices, Journal Names and Conference Names from LSC Abstracts in Version 1

Publications can include a footer of copyright notice, permission policy, journal name, licence, author’s right or conference name below the text of abstract by conferences and journals. Used tool for extracting and processing abstracts in WoS database leads to attached such footers to the text. For example, our casual observation yields that copyright notices such as ‘Published by Elsevier ltd.’ is placed in many texts. To avoid abnormal appearances of words in further analysis of words such as bias in frequency calculation, we performed a cleaning procedure on such sentences and phrases in abstracts of LSC version 1. We removed copyright notices, names of conferences, names of journals, authors’ rights, licenses and permission policies identified by sampling of abstracts.

Step 7: [Version 2] Re-extracting (Sub-setting) the Data Based on Lengths of Abstracts

The cleaning procedure described in previous step leaded to some abstracts having less than our minimum length criteria (30 words). 474 texts were removed.

Step 8: Saving the Dataset into CSV Format

Documents are saved into 34 CSV files. In CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in fields.

To access the LSC for research purposes, please email to ns433@le.ac.uk.

References

[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/

[2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html
[3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html
[4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US
[5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3
[6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.