python - Accessing dtype parsing behaviour of read_csv when creating a DataFrame from a nested list -
this follows discussion pirsquared here. basically, i'm curious why read_csv
better (debatable?) @ inferring datatypes , being fault-tolerant creating dataframe from, say, nested list?
there's lot of cases inferred datatypes acceptable work functionality doesn't seem exposed in dataframe()
, meaning have manually deal dtypes
unnecessarily, can tedious if have hundreds of columns. closest can find convert_objects()
doesn't handle bools in case.
example:
import numpy np import pandas pd import csv data = [['string_boolean', 'numeric', 'numeric_missing'], ['false', 23, 50], ['true', 19, 12], ['false', 4.8, '']] open('my_csv.csv', 'w') outfile: writer = csv.writer(outfile) writer.writerows(data) # reading csv df = pd.read_csv('my_csv.csv') print df.string_boolean.dtype # automatically converted bool print df.numeric.dtype # float, expected print df.numeric_missing.dtype # float, doesn't care empty string # creating directly list without supplying datatypes df2 = pd.dataframe(data[1:], columns=data[0]) df2.string_boolean = df2.string_boolean.astype(bool) # doesn't work df2.numeric_missing = df2.numeric_missing.astype(np.float64) # doesn't work # creating forcing dtype doesn't work df3 = pd.dataframe(data[1:], columns=data[0], dtype=[bool, np.float64, np.float64]) # working method df4 = pd.dataframe(data[1:], columns=data[0]) df4.string_boolean.map({'true': true, 'false': false}) df4.numeric_missing = pd.to_numeric(df4.numeric_missing)
one solution use stringio
object. difference keeps data in memory, instead of writing disk , reading in.
code follows (note: python 3!):
import numpy np import pandas pd import csv io import stringio data = [['string_boolean', 'numeric', 'numeric_missing'], ['false', 23, 50], ['true', 19, 12], ['false', 4.8, '']] stringio() fobj: writer = csv.writer(fobj) writer.writerows(data) fobj.seek(0) df = pd.read_csv(fobj) print(df.head(3)) print(df.string_boolean.dtype) # automatically converted bool print(df.numeric.dtype) # float, expected print(df.numeric_missing.dtype) # float, doesn't care empty string
the with stringio() fobj
isn't necessary: fobj = string()
work fine. , since context manager close stringio()
object outside scope, df = pd.read_csv(fobj)
has inside it.
note fobj.seek(0)
, necessity, since solution closes , reopens file, automatically set file pointer start of file.
a note on python 2 vs python 3
i tried make above code python 2/3 compatible. became mess, because of following: python 2 has io
module, python 3, stringio
class makes unicode (also in python 2; in python 3 is, of course, default).
great, except csv
writer module in python 2 not unicode compatible.
thus, alternative use (older) python 2 (c)stringio
module, example follows:
try: cstringio import stringio except modulenotfounderror: # python 3 io import stringio
and things plain text in python 2, , unicode in python 3.
except now, cstringio.stringio
not have context manager, , with
statement fail. mentioned, not necessary, keeping things close possible original code.
in other words, not find nice way stay close original code without ridiculous hacks.
i've looked @ avoiding csv writer completely, leads to:
text = '\n'.join(','.join(str(item).strip("'") item in items) items in data) stringio(text) fobj: df = pd.read_csv(fobj)
which perhaps neater (though bit less clear), and python 2/3 compatible. (i don't expect work csv
module can handle, here works fine.)
why can't pd.dataframe(...)
conversion?
here, can speculate.
i think reasoning when input python objects (dicts, lists), input known, , in hands of programmer. therefore, unlikely, perhaps illogical, that input contain strings such 'false'
or ''
. instead, contain objects false
, np.nan
(or math.nan
), since programmer have taken care of (string) translation.
whereas file (csv or other), input can anything: colleague might send excel csv file, or else sends gnumeric csv file. don't know how standardised csv files are, you'd need code allow exceptions, , overall conversion of strings python (numpy) format.
so in sense, illogical expect pd.dataframe(...)
accept anything: instead, should accept formatted.
you might argue convenience method takes list yours, list not csv file (which bunch of characters, including newlines). plus, expect there option pd.read_csv
read files in chunks (perhaps line line), becomes harder if you'd feed string newlines instead (you can't read line line, have split on newlines , keep lines in memory. , have full string in memory somewhere, instead of on disk. digress).
besides, stringio
trick few lines precisely perform trick.
Comments
Post a Comment