pandas - python sklearn: IndexError :'too many indices for array' -
i new fish in machine learning. had problem lately , searched stackoverflow same topic already, still can't figure out. take look? lot!
#-*- coding:utf-8 -*- import pandas pd import numpy np import matplotlib.pyplot plt data_train = pd.read_excel('py_train.xlsx',index_col=0) test_data = pd.read_excel('py_test.xlsx',index_col=0) sklearn import preprocessing x = data_train.iloc[:,1:].as_matrix() y = data_train.iloc[:,0:1].as_matrix() sx = preprocessing.scale(x) sklearn import linear_model clf = linear_model.logisticregression() clf.fit(sx,y) clf
the code runs before, , data cleaned. fit in data, like:
id rep b c d 1 0 1 2 3 4 2 0 2 3 4 5 3 0 3 4 5 6 4 1 4 5 6 7 5 1 5 6 7 8 6 1 6 7 8 9 7 1 7 8 9 10 8 1 8 9 10 11 9 1 9 10 11 12 10 1 10 11 12 13
and code below shows indexerror. why? , how fix it?
thanks!
import numpy np import matplotlib.pyplot plt sklearn.learning_curve import learning_curve def plot_learning_curve(estimator, title, x, y, ylim=none, cv=none, n_jobs=1, train_sizes=np.linspace(.05, 1., 20), verbose=0, plot=true): train_sizes, train_scores, test_scores = learning_curve( estimator, x, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, verbose=verbose) train_scores_mean = np.mean(train_scores, axis=1) train_scores_std = np.std(train_scores, axis=1) test_scores_mean = np.mean(test_scores, axis=1) test_scores_std = np.std(test_scores, axis=1) if plot: plt.figure() plt.title(title) if ylim not none: plt.ylim(*ylim) #ylim=y's limit plt.xlabel(u"train set size") plt.ylabel(u"score") plt.gca().invert_yaxis() plt.grid() #网格 plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="b") # generates shaded region plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="r") plt.plot(train_sizes, train_scores_mean, 'o-', color="b", label=u"train set score") plt.plot(train_sizes, test_scores_mean, 'o-', color="r", label=u"cv score") plt.legend(loc="best") plt.draw() plt.gca().invert_yaxis() plt.show() midpoint = ((train_scores_mean[-1] + train_scores_std[-1]) + (test_scores_mean[-1] - test_scores_std[-1])) / 2 diff = (train_scores_mean[-1] + train_scores_std[-1]) - (test_scores_mean[-1] - test_scores_std[-1]) return midpoint, diff plot_learning_curve(clf, u"learning_curve", x, y)
the full information:
--------------------------------------------------------------------------- indexerror traceback (most recent call last) <ipython-input-18-0dc3d0934602> in <module>() 42 return midpoint, diff 43 ---> 44 plot_learning_curve(clf, u"learning_curve", x, y) <ipython-input-18-0dc3d0934602> in plot_learning_curve(estimator, title, x, y, ylim, cv, n_jobs, train_sizes, verbose, plot) 8 9 train_sizes, train_scores, test_scores = learning_curve( ---> 10 estimator, x, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, verbose=verbose) 11 12 train_scores_mean = np.mean(train_scores, axis=1) d:\anaconda3\lib\site-packages\sklearn\learning_curve.py in learning_curve(estimator, x, y, train_sizes, cv, scoring, exploit_incremental_learning, n_jobs, pre_dispatch, verbose, error_score) 138 x, y = indexable(x, y) 139 # make list since iterating multiple times on folds --> 140 cv = list(check_cv(cv, x, y, classifier=is_classifier(estimator))) 141 scorer = check_scoring(estimator, scoring=scoring) 142 d:\anaconda3\lib\site-packages\sklearn\cross_validation.py in check_cv(cv, x, y, classifier) 1821 if classifier: 1822 if type_of_target(y) in ['binary', 'multiclass']: -> 1823 cv = stratifiedkfold(y, cv) 1824 else: 1825 cv = kfold(_num_samples(y), cv) d:\anaconda3\lib\site-packages\sklearn\cross_validation.py in __init__(self, y, n_folds, shuffle, random_state) 567 test_fold_idx, per_label_splits in enumerate(zip(*per_label_cvs)): 568 label, (_, test_split) in zip(unique_labels, per_label_splits): --> 569 label_test_folds = test_folds[y == label] 570 # test split can big because used 571 # kfold(max(c, self.n_folds), self.n_folds) instead of indexerror: many indices array
the logistic regression accepts , cross validators seem accept arrays y values. seem pass matrix
check difference:
you passing this:
df.iloc[:,0:1].as_matrix() array([[0], [1], [2]], dtype=int64)
but might better use
df.iloc[:,0].as_matrix() array([0, 1, 2], dtype=int64)
could try it?
Comments
Post a Comment