python - Accessing dtype parsing behaviour of read_csv when creating a DataFrame from a nested list -

this follows discussion pirsquared here. basically, i'm curious why read_csv better (debatable?) @ inferring datatypes , being fault-tolerant creating dataframe from, say, nested list?

there's lot of cases inferred datatypes acceptable work functionality doesn't seem exposed in dataframe(), meaning have manually deal dtypes unnecessarily, can tedious if have hundreds of columns. closest can find convert_objects() doesn't handle bools in case.

example:

import numpy np import pandas pd import csv  data = [['string_boolean', 'numeric', 'numeric_missing'],          ['false', 23, 50],          ['true', 19, 12],          ['false', 4.8, '']]  open('my_csv.csv', 'w') outfile:     writer = csv.writer(outfile)     writer.writerows(data)  # reading csv df = pd.read_csv('my_csv.csv') print df.string_boolean.dtype # automatically converted bool print df.numeric.dtype # float, expected print df.numeric_missing.dtype # float, doesn't care empty string  # creating directly list without supplying datatypes df2 = pd.dataframe(data[1:], columns=data[0]) df2.string_boolean = df2.string_boolean.astype(bool) # doesn't work df2.numeric_missing = df2.numeric_missing.astype(np.float64) # doesn't work  # creating forcing dtype doesn't work df3 = pd.dataframe(data[1:], columns=data[0],                     dtype=[bool, np.float64, np.float64])  # working method df4 = pd.dataframe(data[1:], columns=data[0]) df4.string_boolean.map({'true': true, 'false': false}) df4.numeric_missing = pd.to_numeric(df4.numeric_missing)

one solution use stringio object. difference keeps data in memory, instead of writing disk , reading in.

code follows (note: python 3!):

import numpy np import pandas pd import csv io import stringio  data = [['string_boolean', 'numeric', 'numeric_missing'],         ['false', 23, 50],         ['true', 19, 12],         ['false', 4.8, '']]  stringio() fobj:     writer = csv.writer(fobj)     writer.writerows(data)     fobj.seek(0)     df = pd.read_csv(fobj)  print(df.head(3)) print(df.string_boolean.dtype) # automatically converted bool print(df.numeric.dtype) # float, expected print(df.numeric_missing.dtype) # float, doesn't care empty string

the with stringio() fobj isn't necessary: fobj = string() work fine. , since context manager close stringio() object outside scope, df = pd.read_csv(fobj) has inside it.
note fobj.seek(0), necessity, since solution closes , reopens file, automatically set file pointer start of file.

a note on python 2 vs python 3

i tried make above code python 2/3 compatible. became mess, because of following: python 2 has io module, python 3, stringio class makes unicode (also in python 2; in python 3 is, of course, default).
great, except csv writer module in python 2 not unicode compatible.
thus, alternative use (older) python 2 (c)stringio module, example follows:

try:     cstringio import stringio except modulenotfounderror:  # python 3     io import stringio

and things plain text in python 2, , unicode in python 3.
except now, cstringio.stringio not have context manager, , with statement fail. mentioned, not necessary, keeping things close possible original code.
in other words, not find nice way stay close original code without ridiculous hacks.

i've looked @ avoiding csv writer completely, leads to:

text = '\n'.join(','.join(str(item).strip("'") item in items)                   items in data)  stringio(text) fobj:     df = pd.read_csv(fobj)

which perhaps neater (though bit less clear), and python 2/3 compatible. (i don't expect work csv module can handle, here works fine.)

why can't `pd.dataframe(...)` conversion?

here, can speculate.

i think reasoning when input python objects (dicts, lists), input known, , in hands of programmer. therefore, unlikely, perhaps illogical, that input contain strings such 'false' or ''. instead, contain objects false , np.nan (or math.nan), since programmer have taken care of (string) translation.
whereas file (csv or other), input can anything: colleague might send excel csv file, or else sends gnumeric csv file. don't know how standardised csv files are, you'd need code allow exceptions, , overall conversion of strings python (numpy) format.

so in sense, illogical expect pd.dataframe(...) accept anything: instead, should accept formatted.

you might argue convenience method takes list yours, list not csv file (which bunch of characters, including newlines). plus, expect there option pd.read_csv read files in chunks (perhaps line line), becomes harder if you'd feed string newlines instead (you can't read line line, have split on newlines , keep lines in memory. , have full string in memory somewhere, instead of on disk. digress).

besides, stringio trick few lines precisely perform trick.

Search This Blog

How Y