星期六, 八月 27, 2011

利用R语言的tm包进行文本挖掘

tm包是R语言中为文本挖掘提供综合性处理的package,进行操作前载入tm包,vignette命令可以让你得到相关的文档说明

library(tm)
vignette("tm")

首先要读取文本,本次操作所用的文本是tm包自带的20个XML格式文本,存放在library\tm\texxts\crude文件夹中。用Corpus命令读取文本并生成语料库文件


reut21578 <- system.file("texts", "crude", package = "tm")
reuters <- Corpus(DirSource(reut21578), readerControl = list(reader = readReut21578XML))

下一步用tm_map命令对语料库文件进行预处理,将其转为纯文本并去除多余空格,转换小写,去除常用词汇、合并异形同意词汇

reuters <- tm_map(reuters, as.PlainTextDocument)
reuters <- tm_map(reuters, stripWhitespace)
reuters <- tm_map(reuters, tolower)
reuters <- tm_map(reuters, removeWords, stopwords("english"))
tm_map(reuters, stemDocument)

利用DocumentTermMatrix将处理后的语料库进行断字处理,生成词频权重矩阵

dtm <- DocumentTermMatrix(reuters)

部分矩阵内容可通过inspect来观察

inspect(dtm[1:5, 100:105])

Docs  abdul-aziz ability able abroad, abu accept
  127          0       0    0       0   0      0
  144          0       2    0       0   0      0
  191          0       0    0       0   0      0
  194          0       0    0       0   0      0
  211          0       0    0       0   0      0

如果需要考察多个文档中特有词汇的出现频率,可以手工生成字典,并将它作为生成矩阵的参数

 (d <- Dictionary(c("prices", "crude", "oil")))
inspect(DocumentTermMatrix(reuters, list(dictionary = d)))

因为生成的矩阵是一个稀疏矩阵,再进行降维处理,之后转为标准数据框格式

dtm2 <- removeSparseTerms(dtm, sparse=0.95)
data <- as.data.frame(inspect(dtm2))

再之后就可以利用R语言中任何工具加以研究了,下面用层次聚类试试看
先进行标准化处理,再生成距离矩阵,再用层次聚类

data.scale <- scale(data)
d <- dist(data.scale, method = "euclidean")
fit <- hclust(d, method="ward")

绘制聚类图
plot(fit)


可以看到在20个文档中,489号和502号聚成一类,与其它文档区别较大。


1 条评论:

  1. 想请教一下,为什么要进行tm_map(reuters, stemDocument) 这一步呢?比较stemDocument(crude[[1]])和 crude[[1]],发现经过stem之后感觉很多单词都错了

    > data("crude")
    > crude[[1]]

    Diamond Shamrock Corp said that
    effective today it had cut its contract prices for crude oil by
    1.50 dlrs a barrel.
    The reduction brings its posted price for West Texas
    Intermediate to 16.00 dlrs a barrel, the copany said.
    "The price reduction today was made in the light of falling
    oil product prices and a weak crude oil market," a company
    spokeswoman said.
    Diamond is the latest in a line of U.S. oil companies that
    have cut its contract, or posted, prices over the last two days
    citing weak oil markets.
    Reuter


    > stemDocument(crude[[1]])

    Diamond Shamrock Corp said that
    effect today it had cut it contract price for crude oil by
    1.50 dlrs a barrel.
    The reduct bring it post price for West Texas
    Intermedi to 16.00 dlrs a barrel, the copani said.
    "The price reduct today was made in the light of falling
    oil product price and a weak crude oil market," a company
    spokeswoman said.
    Diamond is the latest in a line of U.S. oil compani that
    hav cut it contract, or posted, price over the last two days
    cit weak oil markets.
    Reuter

    回复删除