python - Dask read_csv fails where pandas doesn't -


trying use dask's read_csv on file pandas's read_csv this

dd.read_csv('data/ecommerce-new.csv') 

fails following error:

pandas.errors.parsererror: error tokenizing data. c error: eof inside string starting @ line 2 

the file csv file of scraped data using scrapy 2 columns, 1 url , other html(which stored multiline using " delimiter char). being parsed pandas means should well-formatted.

html,url https://google.com,"<a href=""link""> </a>" 

making sample argument big enough load entire file in memory seems work, makes me believe fails when trying infer datatypes(also there's issue should have been solved https://github.com/dask/dask/issues/1284)

has encountered problem before? there fix/workaround?

edit: apparently known problem dask's read_csv if file contains newline character between quotes. solution found read in memory:

dd.from_pandas(pd.read_csv(input_file), chunksize=25) 

this works, @ cost of parallelism. other solution?


Comments

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

reflection - How to access the object-members of an object declaration in kotlin -

php - Doctrine Query Builder Error on Join: [Syntax Error] line 0, col 87: Error: Expected Literal, got 'JOIN' -