python 3.x - Create new column and fill with added strings from same df with underscore as delimiter -
i have following dataframe:
import pandas pd ds = pd.dataframe({'place': [1,2,3], 'date': ['6/7/2021','1/1/2031','1/4/2011'], 'city':['moscow','new york','sidney'], 'kids':[5,3,1]})
that looks this:
city date kids place moscow 6/7/2021 5 1 new york 1/1/2031 3 2 sidney 1/4/2011 1 3
i need add new column 'key' dataframe, values in string union of values in each row of columns: 'city', 'date', 'place' underscore ('_') delimiter.
eventual goal:
city date kids place key moscow 6/7/2021 5 1 moscow_6/7/2021_1 new york 1/1/2031 3 2 new york_1/1/2031_2 sidney 1/4/2011 1 3 sidney_1/4/2011_3
first of all, i've converted integers in 'place' column strings:
df['place'].apply(str)
secondly, i've removed columns don't need (but column should in resulting df):
col_list= list(df) col_list.remove('kids')
then i've tried create new column , fill appended strings:
df['key'] = df[col_list].apply(lambda x: x.sum())
but returns me nan values in 'key' column , values still appended not undersoce delimiter.
you can add columns similar building str
, need cast last column str
:
in[87]: df['key'] = df['city'] + '_' + df['date'] + '_' + df['place'].astype(str) df out[87]: city date kids place key 0 moscow 6/7/2021 5 1 moscow_6/7/2021_1 1 new york 1/1/2031 3 2 new york_1/1/2031_2 2 sidney 1/4/2011 1 3 sidney_1/4/2011_3
timings 30k row df:
%timeit df['city'] + '_' + df['date'] + '_' + df['place'].astype(str) df %timeit df[['city', 'date', 'place']].astype(str).apply('_'.join, 1) %timeit df[['city', 'date', 'place']].astype(str).transform('_'.join, 1) 100 loops, best of 3: 9 ms per loop 10 loops, best of 3: 84 ms per loop 10 loops, best of 3: 83 ms per loop
this method seems 10x faster, it's straight vectorised addition, apply
cythonised for
loop, looks transform
needs form of grouping
, sorting, , creating new object index aligned original df time being taken guess
other @johngalt method:
%timeit pd.series(df[['city', 'date', 'place']].astype(str).values.tolist()).str.join('_') 100 loops, best of 3: 10.5 ms per loop
so marginally slower answer.
Comments
Post a Comment