Seatunnel本地模式快速測驗

前言

SeaTunnel（先前稱為WaterDrop）是一個分布式、高性能、易于擴展的數據集成平臺，旨在實現海量數據的同步和轉換。它支持多種數據處理引擎，包括Apache Spark和Apache Flink，并在某個版本中引入了自主研發的Zeta引擎。SeaTunnel不僅適用于離線數據同步，還能支持CDC（Change Data Capture）實時數據同步，這使得它在處理多樣化數據集成場景時表現出色。

本節內容作為官方的一個補充測驗，快速開始體驗吧。

一、Apache Seatunnel是什么？

從官網的介紹看：
Next-generation high-performance, distributed, massive data integration tool.
通過這幾個關鍵詞你能看到它的定位：下一代，高性能，分布式，大規模數據集成工具。

那到底好不好用呢？

二、安裝

下載

https://seatunnel.apache.org/download

三、測試

1. 測試 local模式下的用例

修改下模板的測試用例，然后執行如下命令：

bin/seatunnel.sh --config ./config/v2.batch.config -e local任務的配置很簡單：
這里使用了FakeSource來模擬輸出兩列，通過設置并行度=2 來打印 16 條輸出數據。
2024-07-01 21:56:06,617 INFO  [o.a.s.c.s.u.ConfigBuilder     ] [main] - Parsed config file:
{"env" : {"parallelism" : 2,"job.mode" : "BATCH","checkpoint.interval" : 10000},"source" : [{"schema" : {"fields" : {"name" : "string","age" : "int"}},"row.num" : 16,"parallelism" : 2,"result_table_name" : "fake","plugin_name" : "FakeSource"}],"sink" : [{"plugin_name" : "Console"}]
}任務的輸出信息，這里的輸出組件是 Console所以打印到了控制臺
2024-07-01 21:56:07,559 INFO  [o.a.s.c.s.f.s.FakeSourceReader] [BlockingWorker-TaskGroupLocation{jobId=860156818549112833, pipelineId=1, taskGroupId=30000}] - Closed the bounded fake source
2024-07-01 21:56:07,561 INFO  [.a.s.c.s.c.s.ConsoleSinkWriter] [st-multi-table-sink-writer-1] - subtaskIndex=0  rowIndex=1:  SeaTunnelRow#tableId=fake SeaTunnelRow#kind=INSERT : hECbG, 520364021
2024-07-01 21:56:07,561 INFO  [.a.s.c.s.c.s.ConsoleSinkWriter] [st-multi-table-sink-writer-1] - subtaskIndex=1  rowIndex=1:  SeaTunnelRow#tableId=fake SeaTunnelRow#kind=INSERT : LnGDW, 105727523
2024-07-01 21:56:07,561 INFO  [.a.s.c.s.c.s.ConsoleSinkWriter] [st-multi-table-sink-writer-1] - subtaskIndex=0  rowIndex=2:  SeaTunnelRow#tableId=fake SeaTunnelRow#kind=INSERT : UYXBT, 1212484110
2024-07-01 21:56:07,561 INFO  [.a.s.c.s.c.s.ConsoleSinkWriter] [st-multi-table-sink-writer-1] - subtaskIndex=1  rowIndex=2:  SeaTunnelRow#tableId=fake SeaTunnelRow#kind=INSERT : NYiCn, 1208734703
2024-07-01 21:56:07,561 INFO  [.a.s.c.s.c.s.ConsoleSinkWriter] [st-multi-table-sink-writer-1] - subtaskIndex=0  rowIndex=3:  SeaTunnelRow#tableId=fake SeaTunnelRow#kind=INSERT : cSZan, 151817804
任務的統計信息：
***********************************************Job Statistic Information
***********************************************
Start Time                : 2024-07-01 21:56:06
End Time                  : 2024-07-01 21:56:08
Total Time(s)             :                   2
Total Read Count          :                  32
Total Write Count         :                  32
Total Failed Count        :                   0
***********************************************

2. 使用 Flink引擎

在上面的測試用例中可以看到如下的日志輸出：

 Discovery plugin jar for: PluginIdentifier{engineType='seatunnel', pluginType='source', pluginName='FakeSource'

這表示默認情況下它使用的是 seatunnel engine 執行的，官方稱之為 zeta 。這一塊內容我們先看下 Flink引擎這邊是如何執行的。

下載安裝 flink1.17
https://nightlies.apache.org/flink/flink-docs-stable/docs/try-flink/local_installation/

啟動local cluster 模式

?  flink bin/start-cluster.sh
Starting cluster.
Starting standalonesession daemon on host MacBook-Pro-2.local.
Starting taskexecutor daemon on host MacBook-Pro-2.local.

配置環境變量

?  config cat seatunnel-env.sh
# Home directory of spark distribution.
SPARK_HOME=${SPARK_HOME:-/Users/mac/apps/spark}
# Home directory of flink distribution.
FLINK_HOME=${FLINK_HOME:-/Users/mac/apps/flink}

修改slot插槽數量為大于等于 2
為什么？因為默認的配置中配置了 2 個并行度，而 local啟動的默認情況下只有個插槽可供使用，因此任務無法運行。

默認啟動后資源插槽：

提交程序運行后，發現一直無法對 sourcez做任務切分：

這是因為 job 的并行度是 2，如下所示：

因此需要修改插槽數量才可以運行，官方這點可沒說清楚，需要注意下。

運行測試用例

?  seatunnel bin/start-seatunnel-flink-15-connector-v2.sh --config ./config/v2.streaming.conf.template
Execute SeaTunnel Flink Job: ${FLINK_HOME}/bin/flink run -c org.apache.seatunnel.core.starter.flink.SeaTunnelFlink /Users/mac/server/apache-seatunnel-2.3.5/starter/seatunnel-flink-15-starter.jar --config ./config/v2.streaming.conf.template --name SeaTunnel
Job has been submitted with JobID 9a949409a6f218d50b66ca22cc49b9c4

現在我們修改插槽數量為 2，測試如下：
訪問：http://localhost:8081/#/overview
在這里插入圖片描述
TaskManager輸出日志如下：

3. 使用 Spark引擎

提交命令

?  seatunnel bin/start-seatunnel-spark-3-connector-v2.sh \
--master 'local[4]' \
--deploy-mode client \
--config ./config/v2.streaming.conf.templateExecute SeaTunnel Spark Job: ${SPARK_HOME}/bin/spark-submit --class "org.apache.seatunnel.core.starter.spark.SeaTunnelSpark" --name "SeaTunnel" --master "local[4]" --deploy-mode "client" --jars "/Users/mac/server/seatunnel/lib/seatunnel-transforms-v2.jar,/Users/mac/server/seatunnel/lib/seatunnel-hadoop3-3.1.4-uber.jar,/Users/mac/server/seatunnel/connectors/connector-fake-2.3.5.jar,/Users/mac/server/seatunnel/connectors/connector-console-2.3.5.jar" --conf "job.mode=STREAMING" --conf "parallelism=2" --conf "checkpoint.interval=2000" /Users/mac/server/apache-seatunnel-2.3.5/starter/seatunnel-spark-3-starter.jar --config "./config/v2.streaming.conf.template" --master "local[4]" --deploy-mode "client" --name "SeaTunnel"

遇到報錯：

2024-07-01 23:25:04,610 INFO v2.V2ScanRelationPushDown:
Pushing operators to SeaTunnelSourceTable
Pushed Filters:
Post-Scan Filters:
Output: name#0, age#1Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/connector/write/Write

看樣子是缺少包導致的導致的，可以參見 issue討論https://github.com/apache/seatunnel/issues/4879 貌似需要 spark 版本 >=3.2 ,而我的是 3.1.1 因此當前這個問題暫時無解。

Since spark 3.2.0, buildForBatch and buildForStreaming have been deprecated in org.apache.spark.sql.connector.write.WriteBuilder. So you should keep spark version >= 3.2.0.

于是，我便下載了 3.2.4(spark -> spark-3.2.4-bin-without-hadoop) 測試后出現了新的問題。

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/log4j/spi/Filterat java.lang.Class.getDeclaredMethods0(Native Method)at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)at java.lang.Class.privateGetMethodRecursive(Class.java:3048)at java.lang.Class.getMethod0(Class.java:3018)at java.lang.Class.getMethod(Class.java:1784)at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:684)at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:666)
Caused by: java.lang.ClassNotFoundException: org.apache.log4j.spi.Filterat java.net.URLClassLoader.findClass(URLClassLoader.java:387)at java.lang.ClassLoader.loadClass(ClassLoader.java:418)at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:359)at java.lang.ClassLoader.loadClass(ClassLoader.java:351)... 7 more

這說的是 log4j的 jar包似乎不存在，由于我們使用的 spark 版本沒有 hadoop的依賴，因此需要在 spark-env.sh里面配置相關的屬性，如下：

export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk-1.8.jdk/Contents/Home
export HADOOP_HOME=/Users/mac/apps/hadoop
export HADOOP_CONF_DIR=/Users/mac/apps/hadoop/etc/hadoop
export SPARK_DIST_CLASSPATH=$(/Users/mac/apps/hadoop/bin/hadoop classpath)
export SPARK_MASTER_HOST=localhost
export SPARK_MASTER_PORT=7077

再次提交測試后，結果如下：

24/07/02 13:40:19 INFO ConfigBuilder: Parsed config file:
{"env" : {"parallelism" : 2,"job.mode" : "STREAMING","checkpoint.interval" : 2000},"source" : [{"schema" : {"fields" : {"name" : "string","age" : "int"}},"row.num" : 16,"parallelism" : 2,"result_table_name" : "fake","plugin_name" : "FakeSource"}],"sink" : [{"plugin_name" : "Console"}]
}24/07/02 13:40:19 INFO SparkContext: Running Spark version 3.2.4
24/07/02 13:40:25 INFO FakeSourceReader: wait split!
24/07/02 13:40:25 INFO FakeSourceReader: wait split!
24/07/02 13:40:25 INFO FakeSourceSplitEnumerator: Calculated splits for table fake successfully, the size of splits is 2.
24/07/02 13:40:25 INFO FakeSourceSplitEnumerator: Calculated splits for table fake successfully, the size of splits is 2.
24/07/02 13:40:25 INFO FakeSourceSplitEnumerator: Assigned [FakeSourceSplit(tableId=fake, splitId=1, rowNum=16), FakeSourceSplit(tableId=fake, splitId=0, rowNum=16)] to 2 readers.
24/07/02 13:40:25 INFO FakeSourceSplitEnumerator: Calculated splits successfully, the size of splits is 2.
24/07/02 13:40:25 INFO FakeSourceSplitEnumerator: Assigned [FakeSourceSplit(tableId=fake, splitId=1, rowNum=16), FakeSourceSplit(tableId=fake, splitId=0, rowNum=16)] to 2 readers.
24/07/02 13:40:25 INFO FakeSourceSplitEnumerator: Assigning splits to readers 1 [FakeSourceSplit(tableId=fake, splitId=1, rowNum=16)]
24/07/02 13:40:25 INFO FakeSourceSplitEnumerator: Calculated splits successfully, the size of splits is 2.
24/07/02 13:40:25 INFO FakeSourceSplitEnumerator: Assigning splits to readers 0 [FakeSourceSplit(tableId=fake, splitId=0, rowNum=16)]
24/07/02 13:40:26 INFO FakeSourceReader: 16 rows of data have been generated in split(fake_1) for table fake. Generation time: 1719898826259
24/07/02 13:40:26 INFO FakeSourceReader: 16 rows of data have been generated in split(fake_0) for table fake. Generation time: 1719898826259
24/07/02 13:40:26 INFO ConsoleSinkWriter: subtaskIndex=1  rowIndex=1:  SeaTunnelRow#tableId= SeaTunnelRow#kind=INSERT : eMaly, 2131476727
24/07/02 13:40:26 INFO ConsoleSinkWriter: subtaskIndex=0  rowIndex=1:  SeaTunnelRow#tableId= SeaTunnelRow#kind=INSERT : Osfqi, 257240275
24/07/02 13:40:26 INFO ConsoleSinkWriter: subtaskIndex=1  rowIndex=2:  SeaTunnelRow#tableId= SeaTunnelRow#kind=INSERT : BYVKb, 730735331

看結果符合預期，也就是使用 spark 提交 seatunnl引擎的流任務，通過FakeSource模擬兩列輸出了 16 條數據。看來的確是需要 spark3.2.x版本的才能成功了。

參考

https://www.modb.pro/db/605827

總結

本節主要總結了單機模式下使用 seatunel完成官方示例程序，初步體會使用，其實使用起來還是很簡單的，模式同我之前介紹的 DataX如出一轍，可喜的是它有自己的 web頁面可以配置，
因此后面我將分享下如何在頁面中進行配置同步任務，最后時間允許的情況下，分析起優秀的源碼設計思路，千里之行始于足下，要持續學習，持續成長，然后持續分享，再會～。