python - subselection of columns in dask (from pandas) by computed boolean indexer -
i'm new dask (imported dd) , try convert pandas (imported pd) code.
the goal of following lines, slice data columns, which's values fullfill calculated requirement in dask.
there given table in csv. former code reads
inputdata=pd.read_csv("inputfile.csv"); pseudoa=inputdata.quantile([.035,.965]) pseudob=pseudoa.diff().loc[.965] inputdata=inputdata.loc[:,inputdata.columns[pseudob.values>0]] inputdata.describe()
and working fine. simple idea conversion substitute first line to
inputdata=dd.read_csv("inputfile.csv");
but resulted in strange error message indexerror: many indices array
. switching ready computed data in inputdata
, pseudob
error remains.
maybe question assigned idea of calculated boolean slicing dask-columns.
i found (maybe suboptimal) way (not solution) that. changing line 4 following
inputdata=inputdata.loc[:,inputdata.columns[(pseudob.values>0).compute()[0]]]
seems work.
yes, dask.dataframe's .loc
accessor works if gets concrete indexing values. otherwise doesn't know partitions ask data. computing lazy dask result concrete pandas result 1 sensible solution problem, if indices fit in memory.
Comments
Post a Comment