text mining - Lemmatization using txt file with lemmes in R -
i use external txt file polish lemmas structured follows: (source lemmas many other languages http://www.lexiconista.com/datasets/lemmatization/)
abadan abadanem abadan abadanie abadan abadanowi abadan abadanu abadańczyk abadańczycy abadańczyk abadańczyka abadańczyk abadańczykach abadańczyk abadańczykami abadańczyk abadańczyki abadańczyk abadańczykiem abadańczyk abadańczykom abadańczyk abadańczyków abadańczyk abadańczykowi abadańczyk abadańczyku abadanka abadance abadanka abadanek abadanka abadanką abadanka abadankach abadanka abadankami what packages , syntax, allow me use such txt database lemmatize bag of words. realize, english there wordnet, there no luck use functionality rare languages.
if not, can database converted useful package provides lemmatization? perhaps converting wide form? instance, form used free antconc concordancer, (http://www.laurenceanthony.net/software/antconc/)
abadan -> abadanem, abadanie, abadanowi, abadanu abadańczyk -> abadańczycy, abadańczyka, abadańczykach etc. in brief: how can lemmatization lemmas in txt file done in of known cran r text mining packages ? if so, how format such txt file?
update: dear @dmitriyselivanov got rid of diacritical marks, apply on tm corpus "docs"
docs <- tm_map(docs, function(x) lemma_tokenizer(x, lemma_hashmap="lemma_hm")) and tried tokenizer
lemmatokenizer <- function(x) lemma_tokenizer(x, lemma_hashmap="lemma_hm") docstdm <- documenttermmatrix(docs, control = list(wordlengths = c(4, 25), tokenize=lemmatokenizer)) it throws @ me error:
error in lemma_hashmap[[tokens]] : attempt select more 1 element in vectorindex the function works vector of texts charm though.
my guess here nothing text-mining packages task. need replace word in second column word in first column. can creating hashmap (for example https://github.com/nathan-russell/hashmap).
below example of how can create "lemmatizing" tokenizer can use in text2vec (and guess quanteda well).
contributions in order create such "lemmatizing" package welcome - useful.
library(hashmap) library(data.table) txt = "abadan abadanem abadan abadanie abadan abadanowi abadan abadanu abadańczyk abadańczycy abadańczyk abadańczykach abadańczyk abadańczykami " dt = fread(txt, header = f, col.names = c("lemma", "word")) lemma_hm = hashmap(dt$word, dt$lemma) lemma_hm[["abadanu"]] #"abadan" lemma_tokenizer = function(x, lemma_hashmap, tokenizer = text2vec::word_tokenizer) { tokens_list = tokenizer(x) for(i in seq_along(tokens_list)) { tokens = tokens_list[[i]] replacements = lemma_hashmap[[tokens]] ind = !is.na(replacements) tokens_list[[i]][ind] = replacements[ind] } tokens_list } texts = c("abadanowi abadańczykach outofvocabulary", "abadańczyk abadan outofvocabulary") lemma_tokenizer(texts, lemma_hm) #[[1]] #[1] "abadan" "abadańczyk" "outofvocabulary" #[[2]] #[1] "abadańczyk" "abadan" "outofvocabulary"
Comments
Post a Comment