LSC (Leicester Scientific Corpus)

2019-08-12T13:00:09Z (GMT) by Neslihan Suzen
The LSC (Leicester Scientific Corpus)

August 2019 by Neslihan Suzen, PhD student at the University of Leicester ( Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes

The data is extracted from the Web of Science® [1] You may not copy or distribute this data in whole or in part without the written consent of Clarivate Analytics.

Getting Started
This text provides background information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the sense of research texts. One of the goal of publishing the data is to make it available for further analysis and use in Natural Language Processing projects.

LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English.
The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018.
Each document in the corpus contains the following parts:

1. Authors: The list of authors of the paper
2. Title: The title of the paper
3. Abstract: The abstract of the paper
4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’.
5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’.
6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4]
7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]

We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,824.

All documents in LSC have nonempty abstract, title, categories, research areas and times cited in WoS databases. There are 119 documents with empty authors list, we did not exclude these documents.

Data Processing

This section describes all steps in order for the LSC to be collected, clean and available to researchers. Processing the data consists of six main steps:

Step 1: Downloading of the Data Online

This is the step of collecting the dataset online. This is done manually by exporting documents as Tab-delimitated files. All downloaded documents are available online.

Step 2: Importing the Dataset to R

This is the process of converting the collection to RData format for processing the data. The LSC was collected as TXT files. All documents are extracted to R.

Step 3: Cleaning the Data from Documents with Empty Abstract or without Category

Not all papers have abstract and categories in the collection. As our research is based on the analysis of abstracts and categories, preliminary detecting and removing inaccurate documents were performed. All documents with empty abstracts and documents without categories are removed.

Step 4: Identification and Correction of Concatenate Words in Abstracts

Traditionally, abstracts are written in a format of executive summary with one paragraph of continuous writing, which is known as ‘unstructured abstract’. However, especially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc.

Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. As a result, some of structured abstracts in the LSC require additional process of correction to split such concatenate words. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. in the corpus. The detection and identification of concatenate words cannot be totally automated. Human intervention is needed in the identification of possible headings of sections. We note that we only consider concatenate words in headings of sections as it is not possible to detect all concatenate words without deep knowledge of research areas. Identification of such words is done by sampling of medicine-related publications. The section headings in such abstracts are listed in the List 1.

List 1 Headings of sections identified in structured abstracts

Background Method(s) Design
Theoretical Measurement(s) Location
Aim(s) Methodology Process
Abstract Population Approach
Objective(s) Purpose(s) Subject(s)
Introduction Implication(s) Patient(s)
Procedure(s) Hypothesis Measure(s)
Setting(s) Limitation(s) Discussion
Conclusion(s) Result(s) Finding(s)
Material (s) Rationale(s)
Implications for health and nursing policy
All words including headings in the List 1 are detected in entire corpus, and then words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.

Step 5: Extracting (Sub-setting) the Data Based on Lengths of Abstracts

After correction of concatenate words is completed, the lengths of abstracts are calculated. ‘Length’ indicates the totalnumber of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].

According to APA style manual [6], an abstract should contain between 150 to 250 words. However, word limits vary from journal to journal. For instance, Journal of Vascular Surgery recommends that ‘Clinical and basic research studies must include a structured abstract of 400 words or less’[7].
In LSC, the length of abstracts varies from 1 to 3805. We decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis. Documents containing less than 30 and more than 500 words in abstracts are removed.

Step 6: Saving the Dataset into CSV Format

Corrected and extracted documents are saved into 36 CSV files. The structure of files are described in the following section.

The Structure of Fields in CSV Files

In CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in separated fields.

To access the LSC for research purposes, please email to


[1]Web of Science. (15 July). Available:
[2]WoS Subject Categories. Available:
[3]Research Areas in WoS. Available:
[4]Times Cited in WoS Core Collection. (15 July). Available:
[5]Word Count. Available:
[6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.
[7]P. Gloviczki and P. F. Lawrence, "Information for authors," Journal of Vascular Surgery, vol. 65, no. 1, pp. A16-A22, 2017.