一、
Spark Streaming 構建在Spark core API之上,具備可伸縮,高吞吐,可容錯的流處理模塊。
1)支持多種數據源,如Kafka,Flume,Socket,文件等;
- Basic sources: Sources directly available in the StreamingContext API. Examples: file systems, and socket connections.
- Advanced sources: Sources like Kafka, Flume, Kinesis, etc. are available through extra utility classes. These require linking against extra dependencies.
2)處理完成數據可寫入Kafka,Hdfs,本地文件等多種地方;
?
DStream:
Spark Streaming對持續流入的數據有個高層的抽像:
It represents a continuous stream of data
a DStream is represented by a continuous series of RDDs,Each RDD in a DStream contains data from a certain interval
Any operation applied on a DStream translates to operations on the underlying RDDs.
?
什么是RDD?
RDD是Resilient Distributed Dataset的縮寫,中文譯為彈性分布式數據集,是Spark中最重要的概念。
RDD是只讀的、分區的,可容錯的數據集合。
?
何為彈性?
RDD可在內存、磁盤之間任意切換
RDD可以轉換成其它RDD,可由其它RDD生成
RDD可存儲任意類型數據
?
二、基本概念
1)add dependency
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.3.1</version>
</dependency>
其它想關依賴查詢:
https://search.maven.org/search?q=g:org.apache.spark%20AND%20v:2.2.0
?
2)文件作為DStream源,是如何被監控的?
1)文件格式須一致
2)根據modify time開成流,而非create time
3)處理時,當前文件變更不會在此window處理,即不會reread
4)可以調用 FileSystem.setTimes()來修改文件時間,使其在下個window被處理,即使文件內容未被修改過
?
三、Transform operation
window operation
?
Spark Streaming also provides?windowed computations, which allow you to apply transformations over a sliding window of data.
every time the window?slides?over a source DStream, the source RDDs that fall within the window are combined and operated upon to produce the RDDs of the windowed DStream.?
在一個時間窗口內的RDD被合并為一個RDD來處理。
Any window operation needs to specify two parameters:
window length: The duration of the window
sliding interval: The interval at which the window operation if performed
?
四、Output operation
使用foreachRDD
dstream.foreachRDD
?is a powerful primitive that allows data to be sent out to external systems. However, it is important to understand how to use this primitive correctly and efficiently.?
?
CheckPoint概念
?
Performance Tuning
?
Fault-tolerance Semantics
?