python - subselection of columns in dask (from pandas) by computed boolean indexer -


i'm new dask (imported dd) , try convert pandas (imported pd) code.

the goal of following lines, slice data columns, which's values fullfill calculated requirement in dask.

there given table in csv. former code reads

inputdata=pd.read_csv("inputfile.csv"); pseudoa=inputdata.quantile([.035,.965]) pseudob=pseudoa.diff().loc[.965] inputdata=inputdata.loc[:,inputdata.columns[pseudob.values>0]] inputdata.describe() 

and working fine. simple idea conversion substitute first line to

inputdata=dd.read_csv("inputfile.csv"); 

but resulted in strange error message indexerror: many indices array. switching ready computed data in inputdata , pseudob error remains.
maybe question assigned idea of calculated boolean slicing dask-columns.

i found (maybe suboptimal) way (not solution) that. changing line 4 following

inputdata=inputdata.loc[:,inputdata.columns[(pseudob.values>0).compute()[0]]] 

seems work.

yes, dask.dataframe's .loc accessor works if gets concrete indexing values. otherwise doesn't know partitions ask data. computing lazy dask result concrete pandas result 1 sensible solution problem, if indices fit in memory.


Comments

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

performance - Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? -

jquery - Responsive Navbar with Sub Navbar -