python 2.7 - pandas shape issues when applying function returning multiple new columns -
i need return multiple calculated columns each row of pandas dataframe.
this error: valueerror: shape of passed values (4, 2), indices imply (4, 3)
raised when apply
function executed in following code snippet:
import pandas pd my_df = pd.dataframe({ 'datetime_stuff': ['2012-01-20', '2012-02-16', '2012-06-19', '2012-12-15'], 'url': ['http://www.something', 'http://www.somethingelse', 'http://www.foo', 'http://www.bar' ], 'categories': [['foo', 'bar'], ['x', 'y', 'z'], ['xxx'], ['a123', 'a456']], }) my_df['datetime_stuff'] = pd.to_datetime(my_df['datetime_stuff']) my_df.sort_values(['datetime_stuff'], inplace=true) print(my_df.head()) def calculate_stuff(row): if row['url'].startswith('http'): categories = row['categories'] if type(row['categories']) == list else [] calculated_column_x = row['url'] + '_other_stuff_' else: calculated_column_x = none another_column = 'deduction_from_fields' return calculated_column_x, another_column print(my_df.shape) my_df['calculated_column_x'], my_df['another_column'] = zip(*my_df.apply(calculate_stuff, axis=1))
each row of dataframe working on more complicated example above, , function calculate_stuff
applying using many different columns each row, returning multiple new columns.
however, previous example still raises valueerror
related shape
of dataframe not able understand how fix.
how create multiple new columns (for each row) can calculated starting existing columns?
when return list or tuple function being applied, pandas
attempts shoehorn dataframe ran apply over. instead, return series.
reconfigured code
my_df = pd.dataframe({ 'datetime_stuff': ['2012-01-20', '2012-02-16', '2012-06-19', '2012-12-15'], 'url': ['http://www.something', 'http://www.somethingelse', 'http://www.foo', 'http://www.bar' ], 'categories': [['foo', 'bar'], ['x', 'y', 'z'], ['xxx'], ['a123', 'a456']], }) my_df['datetime_stuff'] = pd.to_datetime(my_df['datetime_stuff']) my_df.sort_values(['datetime_stuff'], inplace=true) def calculate_stuff(row): if row['url'].startswith('http'): categories = row['categories'] if type(row['categories']) == list else [] calculated_column_x = row['url'] + '_other_stuff_' else: calculated_column_x = none another_column = 'deduction_from_fields' # changed vvvv return pd.series((calculated_column_x, another_column), ['calculated_column_x', 'another_column']) my_df.join(my_df.apply(calculate_stuff, axis=1)) categories datetime_stuff url calculated_column_x another_column 0 [foo, bar] 2012-01-20 http://www.something http://www.something_other_stuff_ deduction_from_fields 1 [x, y, z] 2012-02-16 http://www.somethingelse http://www.somethingelse_other_stuff_ deduction_from_fields 2 [xxx] 2012-06-19 http://www.foo http://www.foo_other_stuff_ deduction_from_fields 3 [a123, a456] 2012-12-15 http://www.bar http://www.bar_other_stuff_ deduction_from_fields
Comments
Post a Comment