python 2.7 - pandas shape issues when applying function returning multiple new columns -


i need return multiple calculated columns each row of pandas dataframe.

this error: valueerror: shape of passed values (4, 2), indices imply (4, 3) raised when apply function executed in following code snippet:

import pandas pd  my_df = pd.dataframe({   'datetime_stuff': ['2012-01-20', '2012-02-16', '2012-06-19', '2012-12-15'],   'url': ['http://www.something', 'http://www.somethingelse', 'http://www.foo', 'http://www.bar' ],   'categories': [['foo', 'bar'], ['x', 'y', 'z'], ['xxx'], ['a123', 'a456']],    })  my_df['datetime_stuff'] = pd.to_datetime(my_df['datetime_stuff']) my_df.sort_values(['datetime_stuff'], inplace=true)  print(my_df.head())  def calculate_stuff(row):   if row['url'].startswith('http'):     categories = row['categories'] if type(row['categories']) == list else []     calculated_column_x = row['url'] + '_other_stuff_'   else:     calculated_column_x = none   another_column = 'deduction_from_fields'   return calculated_column_x, another_column  print(my_df.shape)  my_df['calculated_column_x'], my_df['another_column'] = zip(*my_df.apply(calculate_stuff, axis=1)) 

each row of dataframe working on more complicated example above, , function calculate_stuff applying using many different columns each row, returning multiple new columns.

however, previous example still raises valueerror related shape of dataframe not able understand how fix.

how create multiple new columns (for each row) can calculated starting existing columns?

when return list or tuple function being applied, pandas attempts shoehorn dataframe ran apply over. instead, return series.


reconfigured code

my_df = pd.dataframe({   'datetime_stuff': ['2012-01-20', '2012-02-16', '2012-06-19', '2012-12-15'],   'url': ['http://www.something', 'http://www.somethingelse', 'http://www.foo', 'http://www.bar' ],   'categories': [['foo', 'bar'], ['x', 'y', 'z'], ['xxx'], ['a123', 'a456']],    })  my_df['datetime_stuff'] = pd.to_datetime(my_df['datetime_stuff']) my_df.sort_values(['datetime_stuff'], inplace=true)  def calculate_stuff(row):   if row['url'].startswith('http'):     categories = row['categories'] if type(row['categories']) == list else []     calculated_column_x = row['url'] + '_other_stuff_'   else:     calculated_column_x = none   another_column = 'deduction_from_fields'    # changed vvvv   return pd.series((calculated_column_x, another_column), ['calculated_column_x', 'another_column'])  my_df.join(my_df.apply(calculate_stuff, axis=1))       categories datetime_stuff                       url                    calculated_column_x         another_column 0    [foo, bar]     2012-01-20      http://www.something      http://www.something_other_stuff_  deduction_from_fields 1     [x, y, z]     2012-02-16  http://www.somethingelse  http://www.somethingelse_other_stuff_  deduction_from_fields 2         [xxx]     2012-06-19            http://www.foo            http://www.foo_other_stuff_  deduction_from_fields 3  [a123, a456]     2012-12-15            http://www.bar            http://www.bar_other_stuff_  deduction_from_fields 

Comments

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

performance - Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? -

jquery - Responsive Navbar with Sub Navbar -