python - Pandas Efficiency with small data -

August 15, 2011

i curious! there lower limit, on shouldn't use pandas?

using pandas large data good, considering efficiency , readability.

but there lower limit on must use traditional looping(python 3) on pandas?

when should consider using pandas or numpy?

as far know pandas using numpy (vector operations) under hood quite extensively. numpy faster python because low level , has more memory friendly behaviour python (in many cases). depends doing of course. numpy based operations pandas should have same performance numpy of course.

for general vector (eg. column wise apply) operations faster use numpy / pandas.
"for" loops in python eg. on pandas dataframe rows slow.
if need apply non vectorized key based lookups in pandas. better go dictionaries

use pandas when need time series or data frame structures. use numpy if can organise data in matrices / vectors (arithmetics).

edit: small python object, native python might faster because low level libraries introduce small overhead!

numpy example:

in [21]: = np.random.rand(10)  in [22]: out[22]:  array([ 0.60555782,  0.14585568,  0.94783553,  0.59123449,  0.07151141,         0.6480999 ,  0.28743679,  0.19951774,  0.08312469,  0.16396394])  in [23]: %timeit a.mean() 5.16 µs ± 24.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

for loop example:

in [24]: b = a.tolist()  in [25]: b out[25]:  [0.6055578242263301,  0.14585568245745317,  0.9478355284829876,  0.5912344944487721,  0.07151141037216913,  0.6480999041895205,  0.2874367896457555,  0.19951773879879775,  0.0831246913880146,  0.16396394311100215]  in [26]: def mean(x):     ...:     s = 0     ...:     in x:     ...:         s +=     ...:     return s / len(x)     ...:   in [27]: mean(b) out[27]: 0.37441380071208025  in [28]: a.mean() out[28]: 0.37441380071208025  in [29]: %timeit mean(b) 608 ns ± 2.24 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

ooops, python loop faster here. seems numpy creates small overhead (maybe interfacing c) @ each timit iteration. lets try longer arrays.

in [34]: = np.random.rand(int(1e6))  in [35]: %timeit a.mean() 599 µs ± 18.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  in [36]: b = a.tolist()  in [37]: %timeit mean(b) 31.8 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

ok, conclusion there minimum object size on usage of low level libs numpy , pandas pays back. if likes please feel free repeat experiment pandas

Search This Blog

How Y

python - Pandas Efficiency with small data -

Comments

Post a Comment

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

reflection - How to access the object-members of an object declaration in kotlin -

php - Doctrine Query Builder Error on Join: [Syntax Error] line 0, col 87: Error: Expected Literal, got 'JOIN' -