databricks - Read XML File in Spark with multiple RowTags -
i read huge xml file 3 different rowtags apache spark dataframes.
rowtag = xml element, interpret row in spark.
the tags
- contain different data structures
- are not overlapping
xml-spark (https://github.com/databricks/spark-xml) offers read 1 rowtag time, need read same file 3 times (not efficient).
is there way read file in 1 read ?
details:
i have huge xml file (24 gb) contain 3 lists:
<myfile> <containedresourcelist> <soundrecording><title>a</title></soundrecording> ... several million records ... <soundrecording><title>z</title></soundrecording> </containedresourcelist> <containedreleaselist> <release><releasetype>single</releasetype></release> ... several million records ... <release><releasetype>lp</releasetype></release> </containedreleaselist> <containedtransactionlist> <transaction><sales>1</sales></transaction> ... several million records ... <transaction><sales>999</sales></transaction> </containedtransactionlist> </myfile>
the xml file valid. want read rowtags soundrecording, release & transaction.
i prefer scala libs, happy lib enabling read.
ps: how output & schema ?
- best option: array of 3 dataframes, 1 each rowtag
- ugly option: 1 dataframe containing possible elements of 3 datastructures
one simple way use explode function. can read full xml rowtag set containedresourcelist , resulting dataframe explode dataframe new column
df.withcolumn("soundrec", explode($"soundrecording"))
you can add multiple columns each tag want explode
Comments
Post a Comment