nlp - How to do large-scale replacement/tokenization in R tm_map gsub from a list? -


has managed create massive find/replace function/working code snippet exchanges out known bigrams in dataframe?

here's example. i'm able don onesie-twosie replacements want leverage known lexicon of 800 terms want find-replace turn them word units prior dtm generation. example, want turn "google analytics" "google-analytics".

i know it's theoretically possible; essentially, custom stopwords list functionally same thing, except without replacement. , seems stupid have 800 gsubs.

here's current code. help/pointers/urls/rtfms appreciated.

mystopwords <- read.csv(stopwords.file, header = false) mystopwords <- as.character(mystopwords$v1) mystopwords <- c(mystopwords, stopwords())  # load file  df <- readlines(file.name)  # transform corpus  doc.vec <- vectorsource(df) doc.corpus <- corpus(doc.vec) # summary(doc.corpus)  ## hit known phrases  docs <- tm_map(doc.corpus, content_transformer(gsub), pattern = "google analytics", replacement = "google-analytics")  ## clean , fix text - note, no stemming  doc.corpus <- tm_map(doc.corpus, content_transformer(tolower)) doc.corpus <- tm_map(doc.corpus, removepunctuation,preserve_intra_word_dashes = true) doc.corpus <- tm_map(doc.corpus, removenumbers) doc.corpus <- tm_map(doc.corpus, removewords, c(stopwords("english"),mystopwords)) doc.corpus <- tm_map(doc.corpus, stripwhitespace) 


Comments

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

performance - Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? -

c# - Asp.net web api : redirect unauthorized requst to forbidden page -