amazon web services - Exception with Table identified via AWS Glue Crawler and stored in Data Catalog -
i'm working build new data lake of company , trying find best , recent option work here. so, found pretty nice solution work emr + s3 + athena + glue.
the process did was:
1 - run apache spark script generate 30 millions rows partitioned date @ s3 stored orc.
2 - run athena query create external table.
3 - checked table @ emr connected glue data catalog , worked perfect. both spark , hive able access.
4 - generate 30 millions rows in other folder partitioned date. in orc format
5 - ran glue crawler identify new table. added data catalog , athena able query. spark , hive aren't able it. see exception below:
spark caused by: java.lang.classcastexception: org.apache.hadoop.io.text cannot cast org.apache.hadoop.hive.ql.io.orc.orcstruct
hive error: java.io.ioexception: org.apache.hadoop.hive.ql.metadata.hiveexception: error evaluating audit_id (state=,code=0)
i checking if serialisation problem , found this:
table created manually (configuration):
input format org.apache.hadoop.hive.ql.io.orc.orcinputformat
output format org.apache.hadoop.hive.ql.io.orc.orcoutputformat
serde serialization lib org.apache.hadoop.hive.ql.io.orc.orcserde
orc.compress snappy
table created glue crawler:
input format org.apache.hadoop.mapred.textinputformat
output format org.apache.hadoop.hive.ql.io.hiveignorekeytextoutputformat
serde serialization lib org.apache.hadoop.hive.ql.io.orc.orcserde
so, not working read hive or spark. works athena. changed configurations no effect @ hive or spark.
anyone faced problem?
Comments
Post a Comment