scala - Spark-Running Batch Job with 15 minutes interval -


i using scala, tried spark streaming, if chance streaming job crashed more 15 minutes, generate data loss.

so want know, how manually keep checkpoints in batch job?

the directories of input data looks following

data --> 20170818 --> (timestamp) --> (many .json files)

the data uploaded every 5 minutes.

thanks!

you may use readstream feature in structured streaming monitor directory , pick new files. spark automatically handles checkpointing , tracking you.

val ds = spark.readstream   .format("text")   .option("maxfilespertrigger", 1)   .load(logdirectory) 

here link additional material on topic: https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-filestreamsource.html

i used format("text") should able change format("json"), here more details on json format: https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html


Comments

Popular posts from this blog

What is happening when Matlab is starting a "parallel pool"? -

angular - DownloadURL return null in below code -

php - Cannot override Laravel Spark authentication with own implementation -