python - Get tokens with CountVectorizer from sklearn using multiple separators -
i use these separators split sentence tokens (whenever python sees of chars want split sentence there):
{""/%»…l¦>|=!—\+([„:<#•}‘°_–·˘“›;^$®&”’){€*?.`@«ľ]~}
here example of sentence want split in tokens , count occurences each one:
@itkutak (pitanje za intesu: radi li ?neka)
tokens get: itkutak
, pitanje
, za
, intesu
, radi
, li
, neka
how use countvectorizer
this?
this how code looks right now:
from pandas import dataframe cv=countvectorizer(min_df=0, max_df=1.0) post_textcv= cv.fit_transform(post_text) df=dataframe(post_textcv.a, columns=cv.get_feature_names()) print(df.head)
i assume talking sklearn's countvectorizer
. according documentation, can either
define
token_pattern
parameter. if know of tokens alphanumeric can thisvectorizer = countvectorizer(token_pattern=u'(?u)\\b\\w+\\b')
overwrite
tokenizer
, writing function takes string tokenization yourself. slower compared first method though.def tokenizer(document): pass vectorizer = countvectorizer(tokenizer=tokenizer)
Comments
Post a Comment