apache spark - Unbagging a dataset in pyspark -
i have dataset looks this.
(34521658, 0001-01-01, 2500-01-01, 2 , a, y, 15, p, a, 4776, 4776, 4776, {(11, p, a, 4776,4766, 4776), (12, p, a, 4776,4766, 4776), (13, p, a, 4776,4766, 4776)})
and want un-bag make
(34521658, 0001-01-01, 2500-01-01, 2 , a, y, 15, p, a, 4776, 4776, 4776, 11, p, a, 4776,4766, 4776) (34521658, 0001-01-01, 2500-01-01, 2 , a, y, 15, p, a, 4776, 4776, 4776, 12, p, a, 4776,4766, 4776) (34521658, 0001-01-01, 2500-01-01, 2 , a, y, 15, p, a, 4776, 4776, 4776, 13, p, a, 4776,4766, 4776)
.
how in pyspark?
as suggested in comment flatmap or explode can used. here how can using explode sql function (explode, name says, expand array or map column more rows) keep meaningful columns sake of simplifying approach. assuming first column id , columns want explode named bag, here how initial dataset
+--------+--------------------+ | id| bag| +--------+--------------------+ |34521658|[[11,p,a,4776,476...| +--------+--------------------+
the schema dataset :
scala> df.printschema root |-- id: integer (nullable = true) |-- bag: array (nullable = true) | |-- element: struct (containsnull = true) | | |-- _1: integer (nullable = true) | | |-- _2: string (nullable = true) | | |-- _3: string (nullable = true) | | |-- _4: integer (nullable = true) | | |-- _5: integer (nullable = true) | | |-- _6: integer (nullable = true)
note bag column array of elements. on colum can apply explode function this:
df.withcolumn("bag", explode($"bag"))
the resulting dataset/dataframe is:
+--------+--------------------+ | id| bag| +--------+--------------------+ |34521658|[11,p,a,4776,4766...| |34521658|[12,p,a,4776,4766...| |34521658|[13,p,a,4776,4766...| +--------+--------------------+
hope helps
Comments
Post a Comment