Python SciKitLearn and Pandas categoric data -
i'm working on multivariable regression csv, predicting crop performance based on multiple factors. of columns numerical , meaningful. others numerical , categorical, or strings , categorical (for instance, crop variety, or plot code or whatever.) how teach python use them? i've found 1 hot encoder (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.onehotencoder.html#sklearn.preprocessing.onehotencoder) don't understand how apply here.
my code far:
import pandas pd import statsmodels.api sm sklearn.preprocessing import standardscaler df = pd.read_csv('filepath.csv') df.drop(df[df['labeleddatacolumn'].isnull()].index.tolist(),inplace=true) scale = standardscaler() pd.options.mode.chained_assignment = none # default='warn' x = df[['inputcolumn1', 'inputcolumn2', ...,'inputcolumn20']] y = df['labeleddatacolumn'] x[['inputcolumn1', 'inputcolumn2', ...,'inputcolumn20']] = scale.fit_transform(x[['inputcolumn1', 'inputcolumn2', ...,'inputcolumn20']].as_matrix()) #print (x) est = sm.ols(y, x).fit() est.summary()
you use get_dummies function pandas provides , convert categorical values.
something this..
predictor = pd.concat([data.get(['numerical_column_1','numerical_column_2','label']), pd.get_dummies(data['categorical_column1'], prefix='categorical_col1'), pd.get_dummies(data['categorical_column2'], prefix='categorical_col2'), axis=1)
then outcome/label column doing
outcome = predictor['label'] del predictor['label']
then call model on data doing
est = sm.ols(outcome, predictor).fit()
Comments
Post a Comment