nlp - How to do large-scale replacement/tokenization in R tm_map gsub from a list? -
has managed create massive find/replace function/working code snippet exchanges out known bigrams in dataframe?
here's example. i'm able don onesie-twosie replacements want leverage known lexicon of 800 terms want find-replace turn them word units prior dtm generation. example, want turn "google analytics" "google-analytics".
i know it's theoretically possible; essentially, custom stopwords list functionally same thing, except without replacement. , seems stupid have 800 gsubs.
here's current code. help/pointers/urls/rtfms appreciated.
mystopwords <- read.csv(stopwords.file, header = false) mystopwords <- as.character(mystopwords$v1) mystopwords <- c(mystopwords, stopwords()) # load file df <- readlines(file.name) # transform corpus doc.vec <- vectorsource(df) doc.corpus <- corpus(doc.vec) # summary(doc.corpus) ## hit known phrases docs <- tm_map(doc.corpus, content_transformer(gsub), pattern = "google analytics", replacement = "google-analytics") ## clean , fix text - note, no stemming doc.corpus <- tm_map(doc.corpus, content_transformer(tolower)) doc.corpus <- tm_map(doc.corpus, removepunctuation,preserve_intra_word_dashes = true) doc.corpus <- tm_map(doc.corpus, removenumbers) doc.corpus <- tm_map(doc.corpus, removewords, c(stopwords("english"),mystopwords)) doc.corpus <- tm_map(doc.corpus, stripwhitespace)
Comments
Post a Comment