In a spark streaming application batches of data should be processed as they are being generated. Ideally, batch processing time should be close to the batch interval time, but consistently below it. This is one of the major requirement of a production-ready spark streaming application.
If the batch processing time is more than the batch interval, data is going to pile up in the system and eventually will run out of resources. If the batch interval is way less than batch processing time, it is a waste cluster resources.
According to spark documentation:
"A good approach to figuring out the right batch size for your application is to test it with a conservative batch interval (say, 5-10 seconds) and a low data rate".
However, this has its own drawback. For long-running spark application that runs in production for months, things change over time. Sometimes the message characteristics such as message size change over time, causing the processing time of the same number of messages varies. Sometimes a multi-tenant cluster becomes busy during the daytime when other big data applications such as Impala/Hive/MR jobs compete for shared system resources such as CPU/Memory/Network/Disk IO.
The backpressure comes to the rescue! Backpressure was a highly demanded feature that allows the ingestion rate to be set dynamically and automatically, basing on previous micro-batch processing time. Such feedback loop makes it possible to adapt to the fluctuation nature of the streaming application.
This backpressure can be enabled by setting the configuration parameter :
sparkConf.set("spark.streaming.backpressure.enabled",”true”)
What about the first micro-batch rate? Since there is no previous micro-batch processing time available, there is no basis to estimate what is the optimal rate the application should use. According to this ticket,(SPARK-18580), this has been resolved and Currently, the `spark.streaming.kafka.maxRatePerPartition` is used as the initial rate when the backpressure is enabled.
If the batch processing time is more than the batch interval, data is going to pile up in the system and eventually will run out of resources. If the batch interval is way less than batch processing time, it is a waste cluster resources.
According to spark documentation:
"A good approach to figuring out the right batch size for your application is to test it with a conservative batch interval (say, 5-10 seconds) and a low data rate".
However, this has its own drawback. For long-running spark application that runs in production for months, things change over time. Sometimes the message characteristics such as message size change over time, causing the processing time of the same number of messages varies. Sometimes a multi-tenant cluster becomes busy during the daytime when other big data applications such as Impala/Hive/MR jobs compete for shared system resources such as CPU/Memory/Network/Disk IO.
The backpressure comes to the rescue! Backpressure was a highly demanded feature that allows the ingestion rate to be set dynamically and automatically, basing on previous micro-batch processing time. Such feedback loop makes it possible to adapt to the fluctuation nature of the streaming application.
This backpressure can be enabled by setting the configuration parameter :
sparkConf.set("spark.streaming.backpressure.enabled",”true”)
What about the first micro-batch rate? Since there is no previous micro-batch processing time available, there is no basis to estimate what is the optimal rate the application should use. According to this ticket,(SPARK-18580), this has been resolved and Currently, the `spark.streaming.kafka.maxRatePerPartition` is used as the initial rate when the backpressure is enabled.