python - Get tokens with CountVectorizer from sklearn using multiple separators -

July 15, 2013

i use these separators split sentence tokens (whenever python sees of chars want split sentence there):

{""/%»…l¦>|=!—\+([„:<#•}‘°_–·˘“›;^$®&”’){€*?.`@«ľ]~}

here example of sentence want split in tokens , count occurences each one:

@itkutak (pitanje za intesu: radi li ?neka)

tokens get: itkutak, pitanje, za, intesu, radi, li, neka how use countvectorizer this?

this how code looks right now:

from pandas import dataframe cv=countvectorizer(min_df=0, max_df=1.0) post_textcv= cv.fit_transform(post_text) df=dataframe(post_textcv.a, columns=cv.get_feature_names()) print(df.head)

i assume talking sklearn's countvectorizer. according documentation, can either

define token_pattern parameter. if know of tokens alphanumeric can this
```
vectorizer = countvectorizer(token_pattern=u'(?u)\\b\\w+\\b') 
```
overwrite tokenizer, writing function takes string tokenization yourself. slower compared first method though.
```
def tokenizer(document):     pass  vectorizer = countvectorizer(tokenizer=tokenizer) 
```

Search This Blog

How Y

python - Get tokens with CountVectorizer from sklearn using multiple separators -

Comments

Post a Comment

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

reflection - How to access the object-members of an object declaration in kotlin -

php - Doctrine Query Builder Error on Join: [Syntax Error] line 0, col 87: Error: Expected Literal, got 'JOIN' -