[Big Data Tribe] Using R language to mine and cluster website comment text

Original link:tecdat.cn/?p=3994

Original source:Tuo End number according to the tribe public number

For unstructured Chinese comment information of websites, R’s Chinese word frequency package may be a good tool to mine its potential information. To analyze text content, the most common analysis method is to extract words in the text and count the frequency. Frequency can reflect the importance of words in the text. Generally, the more important the words are, the more times they appear in the text. After words are extracted, word clouds can also be made to visualize the frequency properties of words and make them more intuitive and clear.

For example, comment on a site like this:

Through a series of text processing and extraction of high-frequency words, finally combined with clustering, we can get the following visualization results.

First class customer:

The second type of

The third kind

This is a visual word cloud made according to the transaction comments of a website. The statistics of word frequency, word segmentation and word cloud are all made with R. Finally, clustering is made to gather different users into three categories. This chart gives you an intuitive view of the characteristics of each category of customers. However, the words in this picture still need to be optimized, because some terms or phrases may be broken down into smaller words that are not displayed. For demonstration purposes, I did not spend more time optimizing the thesaurus, mainly introducing the process and method of analysis.

Pinglun = readLines (" E: \ mobile phone review 1. TXT ") write. The table (pinglun, "E: \ mobile phone comments. TXT") pinglun1 = read. The table (" E: \ \ phone comments. TXT ", sep = "|") # Res =pinglun1[pinglun1!=" "]; # excluding general heading res = gsub (pattern = "[boutique | not unpacking motion 】 【 |] +", "", res); Res = # excluding special word gsub (pattern = "[I | | | | is] you", "", res); # Clear the carriage returns in the text! Or each carriage returns will be identified as res = a text gsub (" \ n ", "", res) # # # # # # # # # # # # # # # library (r; library(Rwordseg); Words =unlist(lapply(X=res, FUN=segmentCN)); word=lapply(X=words, FUN=strsplit, " "); v=table(unlist(word)); V =rev(sort(v)); d=data.frame(word=names(v), freq=v); D =subset(d, nchar(as. Character (d$word))>1 & d$freq>=100) # == write.table(d, file="E: \\worldcup_keyword.txt", Row. Names = FALSE) # # # # # # # # # # # # # map vocabulary # # # # # # # # # # # # # # # # # # # # 3 library (" wordcloud ") mycolors < - brewer. Pal (# 8, "Dark2") set up a department of color:  wordcloud(d[1:30,]$word,d[1:30,]$freq,random.order=FALSE,random.color=FALSE,colors=mycolors,family="myFont3") Write. CSV (d[1:30,], file="E:\\ 30 keyword. Row. Names = FALSE) # # # # # # # # # # # # kmeans clustering # # # # # # # # # # # # # # # # # # # # # # # res1 = res screening [1:10 000] # 500 samples do the test words = unlist (lapply (X = res1, FUN=segmentCN)); word=lapply(X=words, FUN=strsplit, " "); v=table(unlist(word)); V =rev(sort(v)); d=data.frame(word=names(v), freq=v); Subset (d, subset) {subset(d, subset) {subset(d, subset) {subset(d, subset) Nchar (as. Character (d$word))> 0 &d $freq>=100) # nchar(as. Character (d$word))> 0 &d $freq>=100) # nchar(as. Character (d$word))> 0 &d $freq>=100) # For (I in 1:length(res1)){words=unlist(lapply(X=res1[I], FUN=segmentCN)); Word =lapply(X=words, FUN=strsplit, ""); v=table(unlist(word)); V =rev(sort(v)); dd=data.frame(word=names(v), freq=v); Intersect (dd[,1],colnames(rating)) if(length(index)==0)next; For (j in 1:length(index)){jj=which(dd[,1]==index[j]) rating[I,colnames(rating)==index[j]]=dd[jj][[1]] =dd[jj }} write. Table (rating, file="E:\\ Colnames (result)=d[1:30,1] ### C1 = result [result [31] = = 1] c2 = result [result [31] = = 2] c3 = result [result [31] = = 3] freq1 = the apply (c1, 2, and the sum) [31] - Freq2 =apply(c2,2,sum)[-31] freq3=apply(c3,2,sum)[-31] Library ("wordcloud") mycolors < -brewer.pal (8,"Dark2")#  wordcloud(colnames(result)[-17],freq1[-17],random.order=FALSE,random.color=FALSE,colors=mycolors,family="myFont3") wordcloud(colnames(result)[-17],freq2[-17],random.order=FALSE,random.color=FALSE,colors=mycolors,family="myFont3") wordcloud(colnames(result)[-17],freq3[-17],random.order=FALSE,random.color=FALSE,colors=mycolors,family="myFont3") ###### algorithm compare y=rbind(matrix(rnorm(10000,mean=2, SD =0.3), nCOL =10),matrix(rnorm(10000,mean=1, SD =0.7),ncol=10))# generate two kinds of random number merge Paste (y)=c(paste("y",1:10) cl= Kmeans (y,2) pch1=rep("1",1000) The plot (y, col = cl $cluster, PCH = c (rep (" 1 ", 1000), rep (" 2 ", 1000)), the main = "kmeans algorithm clustering figure") # samples of each class Points (cl$centers,col=3, PCH ="*",cex=3)#Copy the code

Finally, intuitive user clustering features can be obtained for further research.

[Big Data Tribe] Using R language to mine and cluster website comment text

Original link:tecdat.cn/?p=3994

Original source:Tuo End number according to the tribe public number

Related Posts

Use a calculator every day, have you ever thought about how it is implemented in code, see how I can make a calculator in Unity

Prometheus Deployment and Construction of Monitoring System II (Basic)

2018 is just around the corner, and there are new languages, frameworks, and tools that are already on the radar