databricks - Read XML File in Spark with multiple RowTags -


i read huge xml file 3 different rowtags apache spark dataframes.

rowtag = xml element, interpret row in spark.

the tags

  • contain different data structures
  • are not overlapping

xml-spark (https://github.com/databricks/spark-xml) offers read 1 rowtag time, need read same file 3 times (not efficient).

is there way read file in 1 read ?

details:

i have huge xml file (24 gb) contain 3 lists:

<myfile>     <containedresourcelist>         <soundrecording><title>a</title></soundrecording>       ... several million records ...         <soundrecording><title>z</title></soundrecording>     </containedresourcelist>      <containedreleaselist>         <release><releasetype>single</releasetype></release>       ... several million records ...         <release><releasetype>lp</releasetype></release>     </containedreleaselist>      <containedtransactionlist>         <transaction><sales>1</sales></transaction>       ... several million records ...         <transaction><sales>999</sales></transaction>     </containedtransactionlist> </myfile> 

the xml file valid. want read rowtags soundrecording, release & transaction.

i prefer scala libs, happy lib enabling read.

ps: how output & schema ?

  • best option: array of 3 dataframes, 1 each rowtag
  • ugly option: 1 dataframe containing possible elements of 3 datastructures

one simple way use explode function. can read full xml rowtag set containedresourcelist , resulting dataframe explode dataframe new column

df.withcolumn("soundrec", explode($"soundrecording")) 

you can add multiple columns each tag want explode


Comments

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

performance - Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? -

jquery - Responsive Navbar with Sub Navbar -