python - Get tokens with CountVectorizer from sklearn using multiple separators -


i use these separators split sentence tokens (whenever python sees of chars want split sentence there):

{""/%»…l¦>|=!—\+([„:<#•}‘°_–·˘“›;^$®&”’){€*?.`@«ľ]~}

here example of sentence want split in tokens , count occurences each one:

@itkutak (pitanje za intesu: radi li ?neka)

tokens get: itkutak, pitanje, za, intesu, radi, li, neka how use countvectorizer this?

this how code looks right now:

from pandas import dataframe cv=countvectorizer(min_df=0, max_df=1.0) post_textcv= cv.fit_transform(post_text) df=dataframe(post_textcv.a, columns=cv.get_feature_names()) print(df.head) 

i assume talking sklearn's countvectorizer. according documentation, can either

  1. define token_pattern parameter. if know of tokens alphanumeric can this

    vectorizer = countvectorizer(token_pattern=u'(?u)\\b\\w+\\b') 
  2. overwrite tokenizer, writing function takes string tokenization yourself. slower compared first method though.

    def tokenizer(document):     pass  vectorizer = countvectorizer(tokenizer=tokenizer) 

Comments

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

performance - Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? -

c# - Asp.net web api : redirect unauthorized requst to forbidden page -