amazon web services - Exception with Table identified via AWS Glue Crawler and stored in Data Catalog -

May 15, 2015

i'm working build new data lake of company , trying find best , recent option work here. so, found pretty nice solution work emr + s3 + athena + glue.

the process did was:

1 - run apache spark script generate 30 millions rows partitioned date @ s3 stored orc.

2 - run athena query create external table.

3 - checked table @ emr connected glue data catalog , worked perfect. both spark , hive able access.

4 - generate 30 millions rows in other folder partitioned date. in orc format

5 - ran glue crawler identify new table. added data catalog , athena able query. spark , hive aren't able it. see exception below:

spark caused by: java.lang.classcastexception: org.apache.hadoop.io.text cannot cast org.apache.hadoop.hive.ql.io.orc.orcstruct

hive error: java.io.ioexception: org.apache.hadoop.hive.ql.metadata.hiveexception: error evaluating audit_id (state=,code=0)

i checking if serialisation problem , found this:

table created manually (configuration):

input format org.apache.hadoop.hive.ql.io.orc.orcinputformat

output format org.apache.hadoop.hive.ql.io.orc.orcoutputformat

serde serialization lib org.apache.hadoop.hive.ql.io.orc.orcserde

orc.compress snappy

table created glue crawler:

input format org.apache.hadoop.mapred.textinputformat

output format org.apache.hadoop.hive.ql.io.hiveignorekeytextoutputformat

serde serialization lib org.apache.hadoop.hive.ql.io.orc.orcserde

so, not working read hive or spark. works athena. changed configurations no effect @ hive or spark.

anyone faced problem?

Search This Blog

How Y