text mining - Lemmatization using txt file with lemmes in R -


i use external txt file polish lemmas structured follows: (source lemmas many other languages http://www.lexiconista.com/datasets/lemmatization/)

abadan  abadanem abadan  abadanie abadan  abadanowi abadan  abadanu abadańczyk  abadańczycy abadańczyk  abadańczyka abadańczyk  abadańczykach abadańczyk  abadańczykami abadańczyk  abadańczyki abadańczyk  abadańczykiem abadańczyk  abadańczykom abadańczyk  abadańczyków abadańczyk  abadańczykowi abadańczyk  abadańczyku abadanka    abadance abadanka    abadanek abadanka    abadanką abadanka    abadankach abadanka    abadankami 

what packages , syntax, allow me use such txt database lemmatize bag of words. realize, english there wordnet, there no luck use functionality rare languages.

if not, can database converted useful package provides lemmatization? perhaps converting wide form? instance, form used free antconc concordancer, (http://www.laurenceanthony.net/software/antconc/)

abadan -> abadanem, abadanie, abadanowi, abadanu abadańczyk -> abadańczycy, abadańczyka, abadańczykach  etc. 

in brief: how can lemmatization lemmas in txt file done in of known cran r text mining packages ? if so, how format such txt file?

update: dear @dmitriyselivanov got rid of diacritical marks, apply on tm corpus "docs"

docs <- tm_map(docs, function(x) lemma_tokenizer(x, lemma_hashmap="lemma_hm"))  

and tried tokenizer

lemmatokenizer <- function(x) lemma_tokenizer(x, lemma_hashmap="lemma_hm")  docstdm <-   documenttermmatrix(docs, control = list(wordlengths = c(4, 25), tokenize=lemmatokenizer))  

it throws @ me error:

 error in lemma_hashmap[[tokens]] :    attempt select more 1 element in vectorindex  

the function works vector of texts charm though.

my guess here nothing text-mining packages task. need replace word in second column word in first column. can creating hashmap (for example https://github.com/nathan-russell/hashmap).

below example of how can create "lemmatizing" tokenizer can use in text2vec (and guess quanteda well).

contributions in order create such "lemmatizing" package welcome - useful.

library(hashmap) library(data.table) txt =    "abadan  abadanem   abadan  abadanie   abadan  abadanowi   abadan  abadanu   abadańczyk  abadańczycy   abadańczyk  abadańczykach   abadańczyk  abadańczykami   " dt = fread(txt, header = f, col.names = c("lemma", "word")) lemma_hm = hashmap(dt$word, dt$lemma)  lemma_hm[["abadanu"]] #"abadan"   lemma_tokenizer = function(x, lemma_hashmap,                             tokenizer = text2vec::word_tokenizer) {   tokens_list = tokenizer(x)   for(i in seq_along(tokens_list)) {     tokens = tokens_list[[i]]     replacements = lemma_hashmap[[tokens]]     ind = !is.na(replacements)     tokens_list[[i]][ind] = replacements[ind]   }   tokens_list } texts = c("abadanowi abadańczykach outofvocabulary",            "abadańczyk abadan outofvocabulary") lemma_tokenizer(texts, lemma_hm)  #[[1]] #[1] "abadan"          "abadańczyk"      "outofvocabulary" #[[2]] #[1] "abadańczyk"      "abadan"          "outofvocabulary" 

Comments

Popular posts from this blog

What is happening when Matlab is starting a "parallel pool"? -

angular - DownloadURL return null in below code -

php - Cannot override Laravel Spark authentication with own implementation -