python 2.7 - pandas shape issues when applying function returning multiple new columns -

June 15, 2011

i need return multiple calculated columns each row of pandas dataframe.

this error: valueerror: shape of passed values (4, 2), indices imply (4, 3) raised when apply function executed in following code snippet:

import pandas pd  my_df = pd.dataframe({   'datetime_stuff': ['2012-01-20', '2012-02-16', '2012-06-19', '2012-12-15'],   'url': ['http://www.something', 'http://www.somethingelse', 'http://www.foo', 'http://www.bar' ],   'categories': [['foo', 'bar'], ['x', 'y', 'z'], ['xxx'], ['a123', 'a456']],    })  my_df['datetime_stuff'] = pd.to_datetime(my_df['datetime_stuff']) my_df.sort_values(['datetime_stuff'], inplace=true)  print(my_df.head())  def calculate_stuff(row):   if row['url'].startswith('http'):     categories = row['categories'] if type(row['categories']) == list else []     calculated_column_x = row['url'] + '_other_stuff_'   else:     calculated_column_x = none   another_column = 'deduction_from_fields'   return calculated_column_x, another_column  print(my_df.shape)  my_df['calculated_column_x'], my_df['another_column'] = zip(*my_df.apply(calculate_stuff, axis=1))

each row of dataframe working on more complicated example above, , function calculate_stuff applying using many different columns each row, returning multiple new columns.

however, previous example still raises valueerror related shape of dataframe not able understand how fix.

how create multiple new columns (for each row) can calculated starting existing columns?

when return list or tuple function being applied, pandas attempts shoehorn dataframe ran apply over. instead, return series.

reconfigured code

my_df = pd.dataframe({   'datetime_stuff': ['2012-01-20', '2012-02-16', '2012-06-19', '2012-12-15'],   'url': ['http://www.something', 'http://www.somethingelse', 'http://www.foo', 'http://www.bar' ],   'categories': [['foo', 'bar'], ['x', 'y', 'z'], ['xxx'], ['a123', 'a456']],    })  my_df['datetime_stuff'] = pd.to_datetime(my_df['datetime_stuff']) my_df.sort_values(['datetime_stuff'], inplace=true)  def calculate_stuff(row):   if row['url'].startswith('http'):     categories = row['categories'] if type(row['categories']) == list else []     calculated_column_x = row['url'] + '_other_stuff_'   else:     calculated_column_x = none   another_column = 'deduction_from_fields'    # changed vvvv   return pd.series((calculated_column_x, another_column), ['calculated_column_x', 'another_column'])  my_df.join(my_df.apply(calculate_stuff, axis=1))       categories datetime_stuff                       url                    calculated_column_x         another_column 0    [foo, bar]     2012-01-20      http://www.something      http://www.something_other_stuff_  deduction_from_fields 1     [x, y, z]     2012-02-16  http://www.somethingelse  http://www.somethingelse_other_stuff_  deduction_from_fields 2         [xxx]     2012-06-19            http://www.foo            http://www.foo_other_stuff_  deduction_from_fields 3  [a123, a456]     2012-12-15            http://www.bar            http://www.bar_other_stuff_  deduction_from_fields

Search This Blog

How Y

python 2.7 - pandas shape issues when applying function returning multiple new columns -

Comments

Post a Comment

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

reflection - How to access the object-members of an object declaration in kotlin -

php - Doctrine Query Builder Error on Join: [Syntax Error] line 0, col 87: Error: Expected Literal, got 'JOIN' -