LSC (Leicester Scientific Corpus)
Neslihan Suzen
10.25392/leicester.data.9449639.v2
https://figshare.le.ac.uk/articles/dataset/LSC_Leicester_Scientific_Corpus_/9449639
<p></p><p></p><p></p><p></p><p></p><p></p><p>The LSC (Leicester Scientific Corpus)</p><p><br></p>
<p>April 2020 by Neslihan Suzen, PhD student at the University
of Leicester (ns433@leicester.ac.uk) </p><p>Supervised by Prof Alexander Gorban and Dr
Evgeny Mirkes</p><p>The data are extracted from the Web of Science [1]. You may
not copy or distribute these data in whole or in part without the written
consent of Clarivate Analytics.</p><p>[Version 2] A further
cleaning is applied in Data Processing for LSC Abstracts in Version 1*. Details
of cleaning procedure are explained in Step 6.</p><p>* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus).
figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v1.</p><p>Getting Started</p><p>This text provides the information on the LSC (Leicester
Scientific Corpus) and pre-processing steps on abstracts, and describes the
structure of files to organise the corpus. This corpus is created to be used in
future work on the quantification of the meaning of research texts and make it
available for use in Natural Language Processing projects.</p><p>LSC is a collection of abstracts of articles and proceeding
papers published in 2014, and indexed by the Web of Science (WoS) database [1].
The corpus contains only documents in English. Each document in the corpus
contains the following parts:</p><p>1. Authors: The list of authors of the paper</p><p>2. Title: The title of the paper</p><p>
3. Abstract: The abstract of the paper<br>
4. Categories: One or more category from the list of categories [2]. Full list
of categories is presented in file ‘List_of _Categories.txt’.<br>
5. Research Areas: One or more research area from the list of research areas
[3]. Full list of research areas is presented in file
‘List_of_Research_Areas.txt’.<br>
6. Total Times cited: The number of times the paper was cited by other items
from all databases within Web of Science platform [4]<br>
7. Times cited in Core Collection: The total number of times the paper was
cited by other papers within the WoS Core Collection [4]</p><p>The corpus was collected in July 2018 online and contains
the number of citations from publication date to July 2018. We describe a
document as the collection of information (about a paper) listed above. The
total number of documents in LSC is 1,673,350.</p><p>Data Processing</p><p>Step 1: Downloading of the Data Online</p><p></p>
<p>The dataset is collected manually by exporting documents as
Tab-delimitated files online. All documents are available online.</p><p>Step 2: Importing the Dataset to R</p><p></p>
<p>The LSC was collected as TXT files. All documents are
extracted to R.</p><p>Step 3: Cleaning the Data from Documents with Empty Abstract
or without Category</p><p>As our research is based on the analysis of abstracts and
categories, all documents with empty abstracts and documents without categories
are removed.</p><p>Step 4: Identification and Correction of Concatenate Words
in Abstracts</p><p>Especially medicine-related publications use ‘structured
abstracts’. Such type of abstracts are divided into sections with distinct
headings such as introduction, aim, objective, method, result, conclusion etc. Used
tool for extracting abstracts leads concatenate words of section headings with
the first word of the section. For instance, we observe words such as
ConclusionHigher and ConclusionsRT etc. The detection and identification of
such words is done by sampling of medicine-related publications with human
intervention. Detected concatenate words
are split into two words. For instance, the word ‘ConclusionHigher’ is split
into ‘Conclusion’ and ‘Higher’.</p><p>The section headings in such abstracts are listed below:</p><p></p>
<p>Background Method(s) Design<br>
Theoretical Measurement(s) Location<br>
Aim(s) Methodology Process<br>
Abstract Population Approach<br>
Objective(s) Purpose(s) Subject(s)<br>
Introduction Implication(s) Patient(s)<br>
Procedure(s) Hypothesis Measure(s)<br>
Setting(s) Limitation(s) Discussion<br>
Conclusion(s) Result(s) Finding(s)<br>
Material (s) Rationale(s)<br>
Implications for health and nursing policy</p><p>Step 5: Extracting (Sub-setting) the Data Based on Lengths
of Abstracts</p><p>After correction, the lengths of abstracts are calculated.
‘Length’ indicates the total number of words in the text, calculated by the
same rule as for Microsoft Word ‘word count’ [5].</p><p>According to APA style manual [6], an abstract should
contain between 150 to 250 words. In LSC, we decided to limit length of
abstracts from 30 to 500 words in order to study documents with abstracts of
typical length ranges and to avoid the effect of the length to the analysis.</p><p></p>
<p>Step 6: [Version 2] Cleaning Copyright Notices, Permission
polices, Journal Names and Conference Names from LSC Abstracts in Version 1</p><p>Publications can include a footer of copyright notice,
permission policy, journal name, licence, author’s right or conference name
below the text of abstract by conferences and journals. Used tool for
extracting and processing abstracts in WoS database leads to attached such
footers to the text. For example, our casual observation yields that copyright
notices such as ‘Published by Elsevier ltd.’ is placed in many texts. To avoid
abnormal appearances of words in further analysis of words such as bias in
frequency calculation, we performed a cleaning procedure on such sentences and
phrases in abstracts of LSC version 1. We removed copyright notices, names of
conferences, names of journals, authors’ rights, licenses and permission
policies identified by sampling of abstracts.</p><p>Step 7: [Version 2] Re-extracting (Sub-setting) the Data
Based on Lengths of Abstracts</p><p>The cleaning procedure described in previous step leaded to
some abstracts having less than our minimum length criteria (30 words). 474
texts were removed.</p><p>Step 8: Saving the Dataset into CSV Format</p><p>Documents are saved into 34 CSV files. In CSV files, the information is organised
with one record on each line and parts of abstract, title, list of authors,
list of categories, list of research areas, and times cited is recorded in fields.</p><p>To access the LSC for research purposes, please email to
ns433@le.ac.uk.</p><p>References</p><p>[1]Web of Science. (15 July). Available:
https://apps.webofknowledge.com/</p><p>
[2]WoS Subject Categories. Available: <a href="https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html">https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html</a><br>
[3]Research Areas in WoS. Available: <a href="https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html">https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html</a><br>
[4]Times Cited in WoS Core Collection. (15 July). Available: <a href="https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US">https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US</a><br>
[5]Word Count. Available: <a href="https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3">https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3</a><br>
[6]A. P. Association, Publication manual. American Psychological Association
Washington, DC, 1983.</p><br><p></p><p></p><p></p><p></p><p></p><p></p>
2020-04-15 07:36:07
Natural language processsing
Text mining
Corpus
data Mining
information Extraction
artificial intelligence
Machine Learning
R programming
text data