python - Pandas Efficiency with small data -
i curious! there lower limit, on shouldn't use pandas?
using pandas large data good, considering efficiency , readability.
but there lower limit on must use traditional looping(python 3) on pandas?
when should consider using pandas or numpy?
as far know pandas using numpy (vector operations) under hood quite extensively. numpy faster python because low level , has more memory friendly behaviour python (in many cases). depends doing of course. numpy based operations pandas should have same performance numpy of course.
- for general vector (eg. column wise apply) operations faster use numpy / pandas.
- "for" loops in python eg. on pandas dataframe rows slow.
- if need apply non vectorized key based lookups in pandas. better go dictionaries
use pandas when need time series or data frame structures. use numpy if can organise data in matrices / vectors (arithmetics).
edit: small python object, native python might faster because low level libraries introduce small overhead!
numpy example:
in [21]: = np.random.rand(10) in [22]: out[22]: array([ 0.60555782, 0.14585568, 0.94783553, 0.59123449, 0.07151141, 0.6480999 , 0.28743679, 0.19951774, 0.08312469, 0.16396394]) in [23]: %timeit a.mean() 5.16 µs ± 24.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
for loop example:
in [24]: b = a.tolist() in [25]: b out[25]: [0.6055578242263301, 0.14585568245745317, 0.9478355284829876, 0.5912344944487721, 0.07151141037216913, 0.6480999041895205, 0.2874367896457555, 0.19951773879879775, 0.0831246913880146, 0.16396394311100215] in [26]: def mean(x): ...: s = 0 ...: in x: ...: s += ...: return s / len(x) ...: in [27]: mean(b) out[27]: 0.37441380071208025 in [28]: a.mean() out[28]: 0.37441380071208025 in [29]: %timeit mean(b) 608 ns ± 2.24 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
ooops, python loop faster here. seems numpy creates small overhead (maybe interfacing c) @ each timit iteration. lets try longer arrays.
in [34]: = np.random.rand(int(1e6)) in [35]: %timeit a.mean() 599 µs ± 18.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) in [36]: b = a.tolist() in [37]: %timeit mean(b) 31.8 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
ok, conclusion there minimum object size on usage of low level libs numpy , pandas pays back. if likes please feel free repeat experiment pandas
Comments
Post a Comment