hadoop - What is Lineage In Spark? -


how lineage helps recompute data ?

for example, i'm having several nodes computing data 30 minutes each. if 1 fails after 15 minutes, can recompute data processed in 15 minutes again using lineage without giving 15 minutes again ?

everthing understand lineage in definition of it's rdds.

so let's review :

rdds immutable distributed collection of elements of data can stored in memory or disk across cluster of machines. data partitioned across machines in cluster can operated in parallel low-level api offers transformations , actions. rdds fault tolerant track data lineage information rebuild lost data automatically on failure

so there 2 things understand :

unfortunately, these topics quite long discuss in single answer. recommend take time reading them along following article data lineage.

and answer question , doubts :

if executor fails computing data, after 15 minutes, go last checkpoint, whether it's source or cache in memory and/or on disk.

that not save 15 minutes have mentioned !


Comments

Popular posts from this blog

Is there a better way to structure post methods in Class Based Views -

Qt QGraphicsScene is not accessable from QGraphicsView (on Qt 5.6.1) -

Python Tornado package error when running server -