hadoop - What is Lineage In Spark? -
how lineage helps recompute data ?
for example, i'm having several nodes computing data 30 minutes each. if 1 fails after 15 minutes, can recompute data processed in 15 minutes again using lineage without giving 15 minutes again ?
everthing understand lineage in definition of it's rdds.
so let's review :
rdds immutable distributed collection of elements of data can stored in memory or disk across cluster of machines. data partitioned across machines in cluster can operated in parallel low-level api offers transformations , actions. rdds fault tolerant track data lineage information rebuild lost data automatically on failure
so there 2 things understand :
unfortunately, these topics quite long discuss in single answer. recommend take time reading them along following article data lineage.
and answer question , doubts :
if executor fails computing data, after 15 minutes, go last checkpoint, whether it's source or cache in memory and/or on disk.
that not save 15 minutes have mentioned !
Comments
Post a Comment