python - Dask read_csv fails where pandas doesn't -
trying use dask's read_csv
on file pandas's read_csv
this
dd.read_csv('data/ecommerce-new.csv')
fails following error:
pandas.errors.parsererror: error tokenizing data. c error: eof inside string starting @ line 2
the file csv file of scraped data using scrapy 2 columns, 1 url , other html(which stored multiline using "
delimiter char). being parsed pandas means should well-formatted.
html,url https://google.com,"<a href=""link""> </a>"
making sample
argument big enough load entire file in memory seems work, makes me believe fails when trying infer datatypes(also there's issue should have been solved https://github.com/dask/dask/issues/1284)
has encountered problem before? there fix/workaround?
edit: apparently known problem dask's read_csv if file contains newline character between quotes. solution found read in memory:
dd.from_pandas(pd.read_csv(input_file), chunksize=25)
this works, @ cost of parallelism. other solution?
Comments
Post a Comment