LScDC (Leicester Scientific Dictionary-Core)

Version 3 2020-04-15, 15:07

Version 2 2019-10-17, 12:33

Version 1 2019-09-25, 10:26

dataset

posted on 2020-04-15, 15:07 authored by Neslihan SuzenNeslihan Suzen

The LScDC (Leicester Scientific Dictionary-Core Dictionary)

April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)

Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes

[Version 3] The third version of LScDC (Leicester Scientific Dictionary-Core) is formed using the updated LScD (Leicester Scientific Dictionary) - Version 3*. All steps applied to build the new version of core dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. The files provided with this description are also same as described as for LScDC Version 2. The numbers of words in the 3rd versions of LScD and LScDC are summarized below.

# of words

LScD (v3) 972,060

LScDC (v3) 103,998

* Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v3

** Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v2

[Version 2]

Getting Started

This file describes a sorted and cleaned list of words from LScD (Leicester Scientific Dictionary), explains steps for sub-setting the LScD and basic statistics of words in the LSC (Leicester Scientific Corpus), to be found in [1, 2]. The LScDC (Leicester Scientific Dictionary-Core) is a list of words ordered by the number of documents containing the words, and is available in the CSV file published. There are 104,223 unique words (lemmas) in the LScDC. This dictionary is created to be used in future work on the quantification of the sense of research texts.

The objective of sub-setting the LScD is to discard words which appear too rarely in the corpus. In text mining algorithms, usage of enormous number of text data brings the challenge to the performance and the accuracy of data mining applications. The performance and the accuracy of models are heavily depend on the type of words (such as stop words and content words) and the number of words in the corpus. Rare occurrence of words in a collection is not useful in discriminating texts in large corpora as rare words are likely to be non-informative signals (or noise) and redundant in the collection of texts. The selection of relevant words also holds out the possibility of more effective and faster operation of text mining algorithms.

To build the LScDC, we decided the following process on LScD: removing words that appear in no more than 10 documents (<=10). Such words do not contribute much to discrimination of texts as they appear less than 0.01% of documents. Ignoring these words has the advantages on the reducing the size of words for applications of text mining algorithms.

There are 974,238 words in LScD and 104,223 words in LScDC. 870,015 words are removed from the LScD, that is, around 89% of words are removed. After removing such words, we also re-check the number of words in each document to affirm that all abstracts have at least 3 words. We note that in this stage “the number of words in an abstract” does not indicate the length of the abstract but the number of unique content words from LScDC. After removing 870,015 words from the pre-processed abstracts, all documents have at least 3 unique words. None of documents are removed in this stage.

Instructions of R code for building core dictionary from LScD can be found in later sections of this text and in [4]. The code also produces the new DTM (document term matrix) of the corpus with cleaned dictionary.

Processing the LScD and Creating the LScDC

This section describes steps for creation of LScDC from the LScD.

Step 1: Downloading the LScD and DTM online

The LScD and DTM (Document Term Matrix) are freely available in [2], and can be downloaded to any working directory for processing.

Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics

Step 2: Importing the LScD to R

The LScD.RData (or LScD.csv) and DTM.RData are extracted to R.

Step 3: Removing words that appear in not greater than 10 (<=10) documents

This is the process of removing words that appear in no more than 10 (<=10) in the LSC. Such words are removed from the LScD and DTM. After removing, a re-check was needed to make sure that each document has at least 1 word. Each text in LSC contains at least 3 unique words from the LScDC.

Step 4: Writing the LScDC into CSV format

The sub-list of LScD and cleaned DTM are written into RData format in the “LScDC.RData” and “DTM_ LScDC.RData”. The sub-list is also written as CSV in “LScDC.csv” file. The detailed structure of files are described in the following section.

The Organisation of the LScDC

After removing words from LScD, there are 104,223 unique words written in the file “LScDC.csv”. In the CSV file (and RData file), the number of documents containing the word and the number of appearance of the word in the corpus are recorded on each line in separated fields in the same way as for “LScD.csv” [2].

Instructions for R code: Building the LScDC from the LScD

This section presents the usage of “LScDC_Creation.R” to build a sub-set of LScD by removing words appearing in no more than 10 (<=10) documents. The code can be also used for list of words from other sources. In this case, amendments to the code may be required. The code and the detailed instructions for the code are published in [4].

LScDC_Creation.R is an R script for sub-setting the LScD to create an ordered list of words, where words appear in more than 10 documents. All outputs of the code are saved as RData file. The list of words with the corresponding fields of number of documents containing the word is also saved in CSV format. Outputs of the code are:

DTM_ LScDC: DTM_ LScDC is the Document Term Matrix constructed from the LSC with cleaned list of words. In DTM, rows correspond to documents in the collection and columns correspond to terms (words). Each entry of the matrix is the number of times the word occurs in the corresponding document.

LScDC: This file contains of an ordered sub-list of words from LScD. It also contains the number of documents containing the word and the number of appearance of the word in the corpus in separated fields in the file.

Published archive contains following files:

1. LScDC.csv (dictionary with columns: stemmed word, the number of documents containing the word and the number of appearance of the word in the corpus). Words are ordered by the number of documents containing the word).

2. LScDC.RData (the same as LScDC but in the RData file)

3. README (LScDC).docx (description of LScDC and forming procedures)

4. README (LScDC).pdf (description of LScDC and forming procedures)

5. DTM_LScDC.RData (Document-Term Matrix for LScDC)

The code can be used in the following way:

1. Download the file “LScD.RData” and “DTM.RData”

2. Open the “LScDC_Creation.R” script

3. Change parameters in the script: replace with the full path of the directory with source files

4. Run the full code.

References

[1] N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1

[2] N. Suzen. (2019). LScD (Leicester Scientific Dictionary). Available: https://doi.org/10.25392/leicester.data.9746900.v1

[3] Web of Science. (15 July). Available: https://apps.webofknowledge.com/

[4] N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION