python - Dask read_csv fails where pandas doesn't -


trying use dask's read_csv on file pandas's read_csv this

dd.read_csv('data/ecommerce-new.csv') 

fails following error:

pandas.errors.parsererror: error tokenizing data. c error: eof inside string starting @ line 2 

the file csv file of scraped data using scrapy 2 columns, 1 url , other html(which stored multiline using " delimiter char). being parsed pandas means should well-formatted.

html,url https://google.com,"<a href=""link""> </a>" 

making sample argument big enough load entire file in memory seems work, makes me believe fails when trying infer datatypes(also there's issue should have been solved https://github.com/dask/dask/issues/1284)

has encountered problem before? there fix/workaround?

edit: apparently known problem dask's read_csv if file contains newline character between quotes. solution found read in memory:

dd.from_pandas(pd.read_csv(input_file), chunksize=25) 

this works, @ cost of parallelism. other solution?


Comments

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

performance - Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? -

jquery - Responsive Navbar with Sub Navbar -