Dr.Manish Kumar Jain: Text Mining

Step 1 : Install and load the required packages

Type the R code below, to install and load the required packages:

# Install

install.packages("tm") # for text mining

install.packages("SnowballC") # for text stemming

install.packages("wordcloud") # word-cloud generator

install.packages("RColorBrewer") # color palettes

# Load

library("tm")

library("SnowballC")

library("wordcloud")

library("RColorBrewer")

Step 2 : Text mining

load the text

# Read the text file from internet

filePath <- "http://www.sthda.com/sthda/RDoc/example-files/martin-luther-king-i-have-a-dream-speech.txt"

text <- readLines(filePath)

2. Load the data as a corpus

# Load the data as a corpus

docs <- Corpus(VectorSource(text))

3. Inspect the content of the document

inspect(docs)

Text transformation

Transformation is performed using tm_map() function to replace, for example, special characters from the text.

Replacing “/”, “@” and “|” with space:

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))

docs <- tm_map(docs, toSpace, "/")

docs <- tm_map(docs, toSpace, "@")

docs <- tm_map(docs, toSpace, "\\|")

Cleaning the text

the tm_map() function is used to remove unnecessary white space, to convert the text to lower case, to remove common stopwords like ‘the’, “we”.

The information value of ‘stopwords’ is near zero due to the fact that they are so common in a language. Removing this kind of words is useful before further analyses. For ‘stopwords’, supported languages are danish, dutch, english, finnish, french, german, hungarian, italian, norwegian, portuguese, russian, spanish and swedish. Language names are case sensitive.

# Convert the text to lower case

docs <- tm_map(docs, content_transformer(tolower))

# Remove numbers

docs <- tm_map(docs, removeNumbers)

# Remove english common stopwords

docs <- tm_map(docs, removeWords, stopwords("english"))

# Remove your own stop word # specify your stopwords as a character vector docs <- tm_map(docs, removeWords, c("blabla1", "blabla2"))

# Remove punctuations

docs <- tm_map(docs, removePunctuation)

# Eliminate extra white spaces

docs <- tm_map(docs, stripWhitespace)

# Text stemming # docs <- tm_map(docs, stemDocument)

Step 4 : Build a term-document matrix

Document matrix is a table containing the frequency of the words. Column names are words and row names are documents. The function TermDocumentMatrix() from text mining package can be used as follow :

dtm <- TermDocumentMatrix(docs)

m <- as.matrix(dtm)

v <- sort(rowSums(m),decreasing=TRUE)

d <- data.frame(word = names(v),freq=v)

head(d, 10)

      word freq

will         will   17

freedom   freedom   13

ring         ring   12

day           day   11

dream       dream   11

let           let   11

every       every    9

able         able    8

one           one    8

together together    7

Step 5 : Generate the Word cloud

The importance of words can be illustrated as a word cloud as follow :

set.seed(1234)

wordcloud(words = d$word, freq = d$freq, min.freq = 1,

          max.words=200, random.order=FALSE, rot.per=0.35,

          colors=brewer.pal(8, "Dark2"))

Explore frequent terms and their associations

You can have a look at the frequent terms in the term-document matrix as follow. In the example below we want to find words that occur at least four times :

findFreqTerms(dtm, lowfreq = 4)

[1] "able" "day" "dream" "every" "faith" "free" "freedom" "let" "mountain" "nation"

[11] "one" "ring" "shall" "together" "will"

You can analyze the association between frequent terms (i.e., terms which correlate) using findAssocs() function. The R code below identifies which words are associated with “freedom” in I have a dream speech :

findAssocs(dtm, terms = "freedom", corlimit = 0.3)

$freedom

let ring mississippi mountainside stone every mountain state

0.89 0.86 0.34 0.34 0.34 0.32 0.32 0.32

The frequency table of words

head(d, 10)

word freq

will will 17

freedom freedom 13

ring ring 12

day day 11

dream dream 11

let let 11

every every 9

able able 8

one one 8

together together 7

Plot word frequencies

The frequency of the first 10 frequent words are plotted :

barplot(d[1:10,]$freq, las = 2, names.arg = d[1:10,]$word,

col ="lightblue", main ="Most frequent words",

ylab = "Word frequencies")

Dr.Manish Kumar Jain

Friday, 10 May 2019

Text Mining

Step 2 : Text mining

load the text

Text transformation

Cleaning the text

Step 4 : Build a term-document matrix

Step 5 : Generate the Word cloud

No comments:

Post a Comment

Blog Archive