databricks - Read XML File in Spark with multiple RowTags -

June 15, 2015

i read huge xml file 3 different rowtags apache spark dataframes.

rowtag = xml element, interpret row in spark.

the tags

contain different data structures
are not overlapping

xml-spark (https://github.com/databricks/spark-xml) offers read 1 rowtag time, need read same file 3 times (not efficient).

is there way read file in 1 read ?

details:

i have huge xml file (24 gb) contain 3 lists:

<myfile>     <containedresourcelist>         <soundrecording><title>a</title></soundrecording>       ... several million records ...         <soundrecording><title>z</title></soundrecording>     </containedresourcelist>      <containedreleaselist>         <release><releasetype>single</releasetype></release>       ... several million records ...         <release><releasetype>lp</releasetype></release>     </containedreleaselist>      <containedtransactionlist>         <transaction><sales>1</sales></transaction>       ... several million records ...         <transaction><sales>999</sales></transaction>     </containedtransactionlist> </myfile>

the xml file valid. want read rowtags soundrecording, release & transaction.

i prefer scala libs, happy lib enabling read.

ps: how output & schema ?

best option: array of 3 dataframes, 1 each rowtag
ugly option: 1 dataframe containing possible elements of 3 datastructures

one simple way use explode function. can read full xml rowtag set containedresourcelist , resulting dataframe explode dataframe new column

df.withcolumn("soundrec", explode($"soundrecording"))

you can add multiple columns each tag want explode

Search This Blog

How Y

databricks - Read XML File in Spark with multiple RowTags -

Comments

Post a Comment

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

reflection - How to access the object-members of an object declaration in kotlin -

php - Doctrine Query Builder Error on Join: [Syntax Error] line 0, col 87: Error: Expected Literal, got 'JOIN' -