pandas - python sklearn: IndexError :'too many indices for array' -


i new fish in machine learning. had problem lately , searched stackoverflow same topic already, still can't figure out. take look? lot!

#-*- coding:utf-8 -*- import pandas pd import numpy np import matplotlib.pyplot plt  data_train = pd.read_excel('py_train.xlsx',index_col=0) test_data = pd.read_excel('py_test.xlsx',index_col=0)   sklearn import preprocessing  x = data_train.iloc[:,1:].as_matrix() y = data_train.iloc[:,0:1].as_matrix()  sx = preprocessing.scale(x)  sklearn import linear_model clf = linear_model.logisticregression() clf.fit(sx,y)  clf 

the code runs before, , data cleaned. fit in data, like:

id  rep   b   c   d 1   0   1   2   3   4 2   0   2   3   4   5 3   0   3   4   5   6 4   1   4   5   6   7 5   1   5   6   7   8 6   1   6   7   8   9 7   1   7   8   9   10 8   1   8   9   10  11 9   1   9   10  11  12 10  1   10  11  12  13 

and code below shows indexerror. why? , how fix it?

thanks!

import numpy np import matplotlib.pyplot plt sklearn.learning_curve import learning_curve   def plot_learning_curve(estimator, title, x, y, ylim=none, cv=none, n_jobs=1,                          train_sizes=np.linspace(.05, 1., 20), verbose=0, plot=true):      train_sizes, train_scores, test_scores = learning_curve(         estimator, x, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, verbose=verbose)      train_scores_mean = np.mean(train_scores, axis=1)     train_scores_std = np.std(train_scores, axis=1)     test_scores_mean = np.mean(test_scores, axis=1)     test_scores_std = np.std(test_scores, axis=1)      if plot:         plt.figure()         plt.title(title)         if ylim not none:             plt.ylim(*ylim)   #ylim=y's limit         plt.xlabel(u"train set size")         plt.ylabel(u"score")         plt.gca().invert_yaxis()         plt.grid()    #网格          plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std,                           alpha=0.1, color="b")       # generates shaded region          plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std,                           alpha=0.1, color="r")         plt.plot(train_sizes, train_scores_mean, 'o-', color="b", label=u"train set score")             plt.plot(train_sizes, test_scores_mean, 'o-', color="r", label=u"cv score")          plt.legend(loc="best")          plt.draw()         plt.gca().invert_yaxis()         plt.show()      midpoint = ((train_scores_mean[-1] + train_scores_std[-1]) + (test_scores_mean[-1] - test_scores_std[-1])) / 2     diff = (train_scores_mean[-1] + train_scores_std[-1]) - (test_scores_mean[-1] - test_scores_std[-1])     return midpoint, diff  plot_learning_curve(clf, u"learning_curve", x, y) 

the full information:

--------------------------------------------------------------------------- indexerror                                traceback (most recent call last) <ipython-input-18-0dc3d0934602> in <module>()      42     return midpoint, diff      43  ---> 44 plot_learning_curve(clf, u"learning_curve", x, y)  <ipython-input-18-0dc3d0934602> in plot_learning_curve(estimator, title, x, y, ylim, cv, n_jobs, train_sizes, verbose, plot)       8        9     train_sizes, train_scores, test_scores = learning_curve( ---> 10         estimator, x, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, verbose=verbose)      11       12     train_scores_mean = np.mean(train_scores, axis=1)  d:\anaconda3\lib\site-packages\sklearn\learning_curve.py in learning_curve(estimator, x, y, train_sizes, cv, scoring, exploit_incremental_learning, n_jobs, pre_dispatch, verbose, error_score)     138     x, y = indexable(x, y)     139     # make list since iterating multiple times on folds --> 140     cv = list(check_cv(cv, x, y, classifier=is_classifier(estimator)))     141     scorer = check_scoring(estimator, scoring=scoring)     142   d:\anaconda3\lib\site-packages\sklearn\cross_validation.py in check_cv(cv, x, y, classifier)    1821         if classifier:    1822             if type_of_target(y) in ['binary', 'multiclass']: -> 1823                 cv = stratifiedkfold(y, cv)    1824             else:    1825                 cv = kfold(_num_samples(y), cv)  d:\anaconda3\lib\site-packages\sklearn\cross_validation.py in __init__(self, y, n_folds, shuffle, random_state)     567         test_fold_idx, per_label_splits in enumerate(zip(*per_label_cvs)):     568             label, (_, test_split) in zip(unique_labels, per_label_splits): --> 569                 label_test_folds = test_folds[y == label]     570                 # test split can big because used     571                 # kfold(max(c, self.n_folds), self.n_folds) instead of  indexerror: many indices array 

the logistic regression accepts , cross validators seem accept arrays y values. seem pass matrix

check difference:

you passing this:

df.iloc[:,0:1].as_matrix() array([[0],        [1],        [2]], dtype=int64) 

but might better use

df.iloc[:,0].as_matrix() array([0, 1, 2], dtype=int64) 

could try it?


Comments

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

performance - Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? -

jquery - Responsive Navbar with Sub Navbar -