apache spark - PySpark getting distinct values over a wide range of columns -


i have data large number of custom columns, content of poorly understand. columns named evar1 evar250. i'd single table distinct values, , count how these occur , name of column.

------------------------------------------------  | columnname | value                 | count   | |------------|-----------------------|---------| | evar1      | en-gb                 | 7654321 | | evar1      | en-us                 | 1234567 | | evar2      | www.myclient.com      |     123 | | evar2      | app.myclient.com      |     456 | | ...  

the best way can think of doing feels terrible, believe have read data once per column (there 400 such columns.

i = 1 df_evars = none while <= 30:   colname = "evar" + str(i)   df_temp = df.groupby(colname).agg(fn.count("*").alias("rows"))\     .withcolumn("colname", fn.lit(colname))   if df_evars:     df_evars = df_evars.union(df_temp)   else:     df_evars = df_temp display(df_evars) 

am missing better solution?

update

this has been marked duplicate 2 responses imo solve part of question.

i looking @ potentially wide tables potentially large number of values. need simple way (ie. 3 columns show source column, value , count of value in source column.

the first of responses gives me approximation of number of distinct values. pretty useless me.

the second response seems less relevant first. clarify, source data this:

-----------------------  | evar1 | evar2 | ... | |---------------|-----| |     |     | ... | | b     |     | ... | | b     | b     | ... | | b     | b     | ... | | ...  

should result in output

-------------------------------- | columnname | value | count   | |------------|-------|---------| | evar1      |     | 1       | | evar1      | b     | 3       | | evar2      |     | 2       | | evar2      | b     | 2       | | ...  

using melt borrowed here:

from pyspark.sql.functions import col  melt(     df.select([col(c).cast("string") c in df.columns]),      id_vars=[], value_vars=df.columns ).groupby("variable", "value").count() 

adapted answer user6910411.


Comments

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

performance - Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? -

jquery - Responsive Navbar with Sub Navbar -