apache spark - When should I repartition an RDD? -

April 15, 2014

i know can repartition rdd increase it's partitions , use coalesce decrease it's partitions. have 2 questions regarding cannot understand after reading @ different resources.

spark use sensible default (1 partition per block 64mb in first versions , 128mb) when generating rdd. read recommended use 2 or 3 times number of cores running jobs. here comes question:

1- how many partitions should use given file. example, suppose have 10gb .parquet file, 3 executors 2 cores , 3gb memory each. should repartition? how many partitions should use? better way make choice?

2- data types (ie .txt, .parquet, etc..) repartitioned default if no partitioning provided?

spark can run single concurrent task every partition of rdd, total number of cores in cluster.

for example :

val rdd= sc.textfile ("file.txt", 5)

the above line of code create rdd named textfile 5 partitions.

suppose have cluster 4 cores , assume each partition needs process 5 minutes. in case of above rdd 5 partitions, 4 partition processes run in parallel there 4 cores , 5th partition process process after 5 minutes when 1 of 4 cores, free.

the entire processing completed in 10 minutes , during 5th partition process, resources (remaining 3 cores) remain idle.

the best way decide on number of partitions in rdd make number of partitions equal number of cores in cluster partitions process in parallel , resources utilized in optimal way.

question : data types (ie .txt, .parquet, etc..) repartitioned default if no partitioning provided?

there default no of partitions every rdd. check can use rdd.partitions.length right after rdd created.

to use existing cluster resources in optimal way , speed up, have consider re-partitioning ensure cores utilized , partitions have enough number of records uniformly distributed.

for better understanding, have @ https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html

note : there no fixed formula this. general convention of them follow is

(numof executors * no of cores) * replicationfactor (which may 2 or 3 times more )

Search This Blog

How Y

apache spark - When should I repartition an RDD? -

note : there no fixed formula this. general convention of them follow is

Comments

Post a Comment

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

reflection - How to access the object-members of an object declaration in kotlin -

php - Doctrine Query Builder Error on Join: [Syntax Error] line 0, col 87: Error: Expected Literal, got 'JOIN' -