LScDC (Leicester Scientific Dictionary-Core)

2019-09-25T10:26:51Z (GMT) by Neslihan Suzen
The LScDC (Leicester Scientific Dictionary-Core Dictionary)

September 2019 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)
Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes

Getting Started

This file describes a sorted and cleaned list of words from LScD (Leicester Scientific Dictionary), explains steps for sub-setting the LScD and basic statistics of words in the LSC (Leicester Scientific Corpus), to be found in [1, 2]. The LScDC (Leicester Scientific Dictionary-Core) is a list of words ordered by the number of documents containing the words, and is available in the CSV file published. There are 104,223 unique words (lemmas) in the LScDC. This dictionary is created to be used by Neslihan Suzen for her PhD project on text mining.

The objective of sub-setting the LScD is to discard words which appear too rarely in the corpus. In text mining algorithms, usage of enormous number of text data brings the challenge to the performance and the accuracy of data mining applications. The performance and the accuracy of models are heavily depend on the type of words (such as stop words and content words) and the number of words in the corpus. Rare occurrence of words in a collection is not useful in discriminating texts in large corpora as rare words are likely to be non-informative signals (or noise) and redundant in the collection of texts. The selection of relevant words also holds out the possibility of more effective and faster operation of text mining algorithms.
To build the LScDC, we decided the following process on LScD: removing words that appear in not greater than 10 documents (?10). Such words do not contribute much to discrimination of texts as they appear less than 0.01% of documents. Ignoring these words has the advantages on the reducing the size of words for applications of text mining algorithms.

There are 974,238 words in LScD and 104,223 words in LScDC. 870,015 words are removed from the LScD, that is, around 89% of words are removed. After removing such words, we also re-check the number of words in each document to affirm that all abstracts have at least 3 words. We note that in this stage “the number of words in an abstract” does not indicate the length of the abstract but the number of unique content words from LScDC. After removing 870,015 words from the pre-processed abstracts, all documents have at least 3 unique words. None of documents are removed in this stage.

Instructions of R code for building core dictionary from LScD can be found in later sections of this text and in [4]. The code also produces the new DTM (document term matrix) of the corpus with cleaned dictionary.

Processing the LScD and Creating the LScDC

This section describes steps for creation of LScDC from the LScD.

Step 1: Downloading the LScD and DTM online

The LScD and DTM (Document Term Matrix) are freely available in [2], and can be downloaded to any working directory for processing.

Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data is extracted from Web of Science [3]. You may not copy or distribute this data in whole or in part without the written consent of Clarivate Analytics

Step 2: Importing the LScD to R

The LScD.RData (or LScD.csv) and DTM.RData are extracted to R.

Step 3: Removing words that appear in not greater than 10 (<=10) documents

This is the process of removing words that appear in not greater than 10 (<=10) in the LSC. Such words are removed from the LScD and DTM. After removing, a re-check was needed to make sure that each document has at least 1 word. Each text in LSC contains at least 3 unique words from the LScDC.

Step 4: Writing the LScDC into CSV format

The sub-list of LScD and cleaned DTM are written into RData format in the “LScDC.RData” and “DTM_ LScDC.RData”. The sub-list is also written as CSV in “LScDC.csv” file. The detailed structure of files are described in the following section.

The Organisation of the LScDC

After removing words from LScD, there are 104,223 unique words written in the file “LScDC.csv”. In the CSV file (and RData file), the number of documents containing the word and the number of appearance of the word in the corpus are recorded on each line in separated fields in the same way as for “LScD.csv” [2].

Instructions for R code: Building the LScDC from the LScD

This section presents the usage of “LScDC_Creation.R” to build a sub-set of LScD by removing words appearing in not greater than 10 (<=10) documents. The code can be also used for list of words from other sources. In this case, amendments to the code may be required. The code and the detailed instructions for the code are published in [4].

LScDC_Creation.R is an R script for sub-setting the LScD to create an ordered list of words, where words appear more than 10 documents. All outputs of the code are saved as RData file. The list of words with the corresponding fields of number of documents containing the word is also saved in CSV format. Outputs of the code are:

DTM_ LScDC: DTM_ LScDC is the Document Term Matrix constructed from the LSC with cleaned list of words. In DTM, rows correspond to documents in the collection and columns correspond to terms (words). Each entry of the matrix is the number of times the word occurs in the corresponding document.

LScDC: This file contains of an ordered sub-list of words from LScD. It also contains the number of documents containing the word and the number of appearance of the word in the corpus in separated fields in the file.

Published archive contains following files:
1. LScDC.csv (dictionary with columns: stemmed word, the number of documents containing the word and the number of appearance of the word in the corpus). Words are ordered by the number of documents containing the word).
2. LScDC.RData (the same as LScDC but in the R data file)
3. README (LScDC).docx (description of LScDC and forming procedures)
4. README (LScDC).pdf (description of LScDC and forming procedures)
5. DTM_LScDC.RData (Document-Term Matrix for LScDC)

The code can be used in the following way:

1. Download the file “LScD.RData” and “DTM.Rdata”

2. Open the “LScDC_Creation.R” script

3. Change parameters in the script: replace with the full path of the directory with source files
4. Run the full code.

References

[1] N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1

[2] N. Suzen. (2019). LScD (Leicester Scientific Dictionary). Available: https://doi.org/10.25392/leicester.data.9746900.v1

[3] Web of Science. (15 July). Available: https://apps.webofknowledge.com/

[4] N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION