Step 1 : Install and
load the required packages
Type the R code below,
to install and load the required packages:
# Install
install.packages("tm") # for text mining
install.packages("SnowballC") #
for text stemming
install.packages("wordcloud") #
word-cloud generator
install.packages("RColorBrewer") #
color palettes
# Load
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")
Step 2 : Text mining
load the text
# Read the text file from internet
filePath<-
"http://www.sthda.com/sthda/RDoc/example-files/martin-luther-king-i-have-a-dream-speech.txt"
text<-
readLines(filePath)
2.
Load the data as a corpus
# Load the data as a corpus
docs<-
Corpus(VectorSource(text))
3.
Inspect
the content of the document
inspect(docs)
Text transformation
Transformation
is performed using tm_map() function
to replace, for example, special characters from the text.
Replacing “/”, “@” and “|” with space:
toSpace<-
content_transformer(function
(x
,
pattern)
gsub(pattern
,
" ",
x))
docs<-
tm_map(docs
,
toSpace,
"/")
docs<-
tm_map(docs
,
toSpace,
"@")
docs<-
tm_map(docs
,
toSpace,
"\\|")
Cleaning the text
the tm_map() function
is used to remove unnecessary white space, to convert the text to lower case,
to remove common stopwords like ‘the’, “we”.
The information value of ‘stopwords’ is
near zero due to the fact that they are so common in a language. Removing this
kind of words is useful before further analyses. For ‘stopwords’, supported
languages are danish, dutch, english, finnish, french, german, hungarian,
italian, norwegian, portuguese, russian, spanish and swedish. Language names
are case sensitive.
#
Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# Remove numbers
docs <- tm_map(docs, removeNumbers)
# Remove
english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
# Remove
your own stop word # specify your stopwords as a
character vector docs
<- tm_map(docs, removeWords, c("blabla1", "blabla2"))
# Remove punctuations
docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
# Text
stemming # docs <- tm_map(docs,
stemDocument)
Step 4 : Build a term-document matrix
Document
matrix is a table containing the frequency of the words. Column names are words
and row names are documents. The function TermDocumentMatrix() from text mining package
can be used as follow :
dtm<-
TermDocumentMatrix(docs)
m<-
as.matrix(dtm)
v<-
sort(rowSums(m)
,
decreasing=TRUE)
d<-
data.frame(word
=
names(v)
,
freq=v)
head(d,
10)
word freq
will will 17
freedom freedom 13
ring ring 12
day day 11
dream dream 11
let let 11
every every 9
able able 8
one one 8
together together 7
Step 5 : Generate the Word cloud
The
importance of words can be illustrated as a word cloud as
follow :
set.seed(1234)
wordcloud(words=
d$word
,
freq=
d$freq
,
min.freq=
1
,
max.words=200
,
random.order=FALSE,
rot.per=0.35,
colors=brewer.pal(8
,
"Dark2"))
Explore frequent terms
and their associations
You can have a look at
the frequent terms in the term-document matrix as follow. In the example below
we want to find words that occur at least four times :
findFreqTerms(dtm, lowfreq =
4)
[1] "able" "day" "dream" "every" "faith" "free" "freedom" "let" "mountain"
"nation"
[11]
"one"
"ring"
"shall"
"together" "will"
You can analyze the association between frequent terms (i.e.,
terms which correlate) using findAssocs() function. The R code below identifies
which words are associated with “freedom” in I have a dream speech :
findAssocs(dtm, terms =
"freedom", corlimit = 0.3)
$freedom
let ring
mississippi mountainside
stone every mountain state
0.89 0.86 0.34 0.34 0.34 0.32 0.32 0.32
The frequency table of
words
head(d,
10)
word freq
will will
17
freedom freedom
13
ring ring
12
day day
11
dream dream
11
let let
11
every every
9
able able
8
one one 8
together together 7
Plot word frequencies
The frequency of the
first 10 frequent words are plotted :
barplot(d[1:10,]$freq, las =
2, names.arg = d[1:10,]$word,
col ="lightblue", main ="Most frequent words",
ylab = "Word frequencies")
No comments:
Post a Comment