目錄
一、軟件介紹
1、軟件概述
2、發展歷史
3、名詞解釋
4、模塊介紹
軟件部署
1、下載發布包
2、上傳與解壓
?3、啟動
?4、瀏覽器驗證
一、軟件介紹
1、軟件概述
Apache DolphinScheduler 是一個分布式易擴展的可視化DAG工作流任務調度開源系統。適用于企業級場景,提供了一個可視化操作任務、工作流和全生命周期數據處理過程的解決方案。
Apache DolphinScheduler 旨在解決復雜的大數據任務依賴關系,并為應用程序提供數據和各種 OPS 編排中的關系。 解決數據研發ETL依賴錯綜復雜,無法監控任務健康狀態的問題。 DolphinScheduler 以 DAG(Directed Acyclic Graph,DAG)流式方式組裝任務,可以及時監控任務的執行狀態,支持重試、指定節點恢復失敗、暫停、恢復、終止任務等操作。
官網:https://dolphinscheduler.apache.org/zh-cn
官方文檔:https://dolphinscheduler.apache.org/zh-cn/docs/3.2.2
DolphinScheduler 的主要特性如下:
- 易于部署,提供四種部署方式,包括Standalone、Cluster、Docker和Kubernetes
- 易于使用,可以通過四種方式創建和管理工作流,包括Web UI、Python SDK和Open API
- 高可靠高可用,多主多從的去中心化架構,原生支持橫向擴展
- 高性能,性能比其他編排平臺快N倍,每天可支持千萬級任務
- Cloud Native,DolphinScheduler支持編排多云/數據中心工作流,支持自定義任務類型
- 對工作流和工作流實例(包括任務)進行版本控制
- 工作流和任務的多種狀態控制,支持隨時暫停/停止/恢復它們
- 多租戶支持
- 其他如補數支持(Web UI 原生),包括項目和數據源的權限控制
2、發展歷史
2019年3月,易觀決定將DolphinScheduler項目開源,在GitHub上發布了第一個開源版本(1.0.0)。開源后,吸引了眾多開發者的關注,社區開始逐漸形成,為項目的后續發展奠定了基礎。
2020年7月,DolphinScheduler順利進入Apache軟件基金會進行孵化,這標志著項目得到了更廣泛的認可和支持。進入Apache孵化期后,項目遵循Apache的開源治理模式,吸引了更多來自不同企業和組織的開發者參與貢獻,進一步推動了項目的發展。
2021年11月,DolphinScheduler從Apache孵化器順利畢業,成為Apache頂級項目(TLP)。這是項目發展的一個重要里程碑,意味著DolphinScheduler在技術、社區、治理等方面都達到了較高的水平,得到了Apache軟件基金會的充分肯定。
3、名詞解釋
DAG:?全稱 Directed Acyclic Graph,簡稱 DAG。工作流中的 Task 任務以有向無環圖的形式組裝起來,從入度為零的節點進行拓撲遍歷,直到無后繼節點為止。舉例如下圖:
流程定義:通過拖拽任務節點并建立任務節點的關聯所形成的可視化DAG
流程實例:流程實例是流程定義的實例化,可以通過手動啟動或定時調度生成。每運行一次流程定義,產生一個流程實例
任務實例:任務實例是流程定義中任務節點的實例化,標識著某個具體的任務
任務類型:目前支持有 SHELL、SQL、SUB_WORKFLOW(子工作流)、PROCEDURE、MR、SPARK、PYTHON、DEPENDENT(依賴),同時計劃支持動態插件擴展,注意:其中?SUB_WORKFLOW類型的任務需要關聯另外一個流程定義,被關聯的流程定義是可以單獨啟動執行的
調度方式:系統支持基于 cron 表達式的定時調度和手動調度。命令類型支持:啟動工作流、從當前節點開始執行、恢復被容錯的工作流、恢復暫停流程、從失敗節點開始執行、補數、定時、重跑、暫停、停止、恢復等待線程。 其中?恢復被容錯的工作流?和?恢復等待線程?兩種命令類型是由調度內部控制使用,外部無法調用
定時調度:系統采用?quartz?分布式調度器,并同時支持cron表達式可視化的生成
依賴:系統不單單支持?DAG?簡單的前驅和后繼節點之間的依賴,同時還提供任務依賴節點,支持流程間的自定義任務依賴
優先級?:支持流程實例和任務實例的優先級,如果流程實例和任務實例的優先級不設置,則默認是先進先出
郵件告警:支持?SQL任務?查詢結果郵件發送,流程實例運行結果郵件告警及容錯告警通知
失敗策略:對于并行運行的任務,如果有任務失敗,提供兩種失敗策略處理方式,繼續是指不管并行運行任務的狀態,直到流程失敗結束。結束是指一旦發現失敗任務,則同時Kill掉正在運行的并行任務,流程失敗結束
補數:補歷史數據,支持區間并行和串行兩種補數方式,其日期選擇方式包括日期范圍和日期枚舉兩種
4、模塊介紹
-
dolphinscheduler-master master模塊,提供工作流管理和編排服務。
-
dolphinscheduler-worker worker模塊,提供任務執行管理服務。
-
dolphinscheduler-alert 告警模塊,提供 AlertServer 服務。
-
dolphinscheduler-api web應用模塊,提供 ApiServer 服務。
-
dolphinscheduler-common 通用的常量枚舉、工具類、數據結構或者基類
-
dolphinscheduler-dao 提供數據庫訪問等操作。
-
dolphinscheduler-extract extract模塊,包含master/worker/alert的sdk
-
dolphinscheduler-service service模塊,包含Quartz、Zookeeper、日志客戶端訪問服務,便于server模塊和api模塊調用
-
dolphinscheduler-ui 前端模塊
二、軟件部署
安裝參考:https://dolphinscheduler.apache.org/en-us/docs/3.2.2/guide/installation/standalone
1、下載發布包
軟件地址:https://dolphinscheduler.apache.org/zh-cn/download/3.2.2
2、上傳與解壓
cd /usr/local/soft/
tar -zxvf apache-dolphinscheduler-3.2.2-bin.tar.gz
3、配置元數據庫
在 Standalone 模式下,DolphinScheduler 默認使用嵌入式數據庫,可查看/usr/local/soft/apache-dolphinscheduler-3.2.2-bin/standalone-server/conf/application.yaml。本案例使用外部數據庫 MySQL,方便后續直接操作相關數據
a、添加mysql依賴
將mysql-connector-java-8.0.15.jar放置到/usr/local/soft/apache-dolphinscheduler-3.2.2-bin/standalone-server/libs/standalone-server/文件夾下
b、修改數據庫參數
修改/usr/local/soft/apache-dolphinscheduler-3.2.2-bin/standalone-server/conf/application.yaml中數據庫參數,參考如下:
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#spring:profiles:active: mysqljackson:time-zone: UTCdate-format: "yyyy-MM-dd HH:mm:ss"banner:charset: UTF-8sql:init:schema-locations: classpath:sql/dolphinscheduler_mysql.sqldatasource:driver-class-name: com.mysql.cj.jdbc.Driverurl: jdbc:mysql://node11:3306/dolphinscheduler?useUnicode=true&characterEncoding=UTF-8username: rootpassword: root123quartz:job-store-type: jdbcjdbc:initialize-schema: neverproperties:org.quartz.threadPool.threadPriority: 5org.quartz.jobStore.isClustered: trueorg.quartz.jobStore.class: org.springframework.scheduling.quartz.LocalDataSourceJobStoreorg.quartz.scheduler.instanceId: AUTOorg.quartz.jobStore.tablePrefix: QRTZ_org.quartz.jobStore.acquireTriggersWithinLock: trueorg.quartz.scheduler.instanceName: DolphinSchedulerorg.quartz.threadPool.class: org.quartz.simpl.SimpleThreadPoolorg.quartz.jobStore.useProperties: falseorg.quartz.threadPool.makeThreadsDaemons: trueorg.quartz.threadPool.threadCount: 25org.quartz.jobStore.misfireThreshold: 60000org.quartz.scheduler.makeSchedulerThreadDaemon: trueorg.quartz.jobStore.driverDelegateClass: org.quartz.impl.jdbcjobstore.StdJDBCDelegateorg.quartz.jobStore.clusterCheckinInterval: 5000org.quartz.scheduler.batchTriggerAcquisitionMaxCount: 1servlet:multipart:max-file-size: 1024MBmax-request-size: 1024MBmessages:basename: i18n/messagesjpa:hibernate:ddl-auto: nonemvc:pathmatch:matching-strategy: ANT_PATH_MATCHERcloud.discovery.client.composite-indicator.enabled: falsemybatis-plus:mapper-locations: classpath:org/apache/dolphinscheduler/dao/mapper/*Mapper.xmltype-aliases-package: org.apache.dolphinscheduler.dao.entityconfiguration:cache-enabled: falsecall-setters-on-nulls: truemap-underscore-to-camel-case: truejdbc-type-for-null: NULLglobal-config:db-config:id-type: autobanner: falseregistry:type: zookeeperzookeeper:namespace: dolphinschedulerconnect-string: localhost:2181retry-policy:base-sleep-time: 60msmax-sleep: 300msmax-retries: 5session-timeout: 30sconnection-timeout: 9sblock-until-connected: 600msdigest: ~security:authentication:# Authentication types (supported types: PASSWORD,LDAP,CASDOOR_SSO)type: PASSWORD# IF you set type `LDAP`, below config will be effectiveldap:# ldap server configurls: ldap://ldap.forumsys.com:389/base-dn: dc=example,dc=comusername: cn=read-only-admin,dc=example,dc=compassword: passworduser:# admin userId when you use LDAP loginadmin: read-only-adminidentity-attribute: uidemail-attribute: mail# action when ldap user is not exist (supported types: CREATE,DENY)not-exist-action: CREATEssl:enable: false# jks file absolute path && passwordtrust-store: "/ldapkeystore.jks"trust-store-password: ""casdoor:user:admin: adminoauth2:enable: falseprovider:github:authorizationUri: "https://github.com/login/oauth/authorize"redirectUri: "http://localhost:12345/dolphinscheduler/redirect/login/oauth2"clientId: ""clientSecret: ""tokenUri: "https://github.com/login/oauth/access_token"userInfoUri: "https://api.github.com/user"callbackUrl: "http://localhost:5173/login"iconUri: ""provider: githubgitee:authorizationUri: "https://gitee.com/oauth/authorize"redirectUri: "http://127.0.0.1:12345/dolphinscheduler/redirect/login/oauth2"clientId: ""clientSecret: ""tokenUri: "https://gitee.com/oauth/token?grant_type=authorization_code"userInfoUri: "https://gitee.com/api/v5/user"callbackUrl: "http://127.0.0.1:5173/login"iconUri: ""provider: giteecasdoor:# Your Casdoor server urlendpoint: http://localhost:8000client-id: ""client-secret: ""# The certificate may be multi-line, you can use `|-` for easecertificate: ""# Your organization name added in Casdoororganization-name: built-in# Your application name added in Casdoorapplication-name: dolphinscheduler# Doplhinscheduler login urlredirect-url: http://localhost:5173/loginmaster:listen-port: 5678# master prepare execute thread number to limit handle commands in parallelpre-exec-threads: 10# master execute thread number to limit process instances in parallelexec-threads: 10# master dispatch task number per batchdispatch-task-number: 3# master host selector to select a suitable worker, default value: LowerWeight. Optional values include random, round_robin, lower_weighthost-selector: lower_weight# master heartbeat intervalmax-heartbeat-interval: 10s# master commit task retry timestask-commit-retry-times: 5# master commit task intervaltask-commit-interval: 1sstate-wheel-interval: 5sserver-load-protection:enabled: true# Master max system cpu usage, when the master's system cpu usage is smaller then this value, master server can execute workflow.max-system-cpu-usage-percentage-thresholds: 0.9# Master max jvm cpu usage, when the master's jvm cpu usage is smaller then this value, master server can execute workflow.max-jvm-cpu-usage-percentage-thresholds: 0.9# Master max System memory usage , when the master's system memory usage is smaller then this value, master server can execute workflow.max-system-memory-usage-percentage-thresholds: 0.9# Master max disk usage , when the master's disk usage is smaller then this value, master server can execute workflow.max-disk-usage-percentage-thresholds: 0.9# failover intervalfailover-interval: 10m# kill yarn/k8s application when failover taskInstance, default truekill-application-when-task-failover: trueworker-group-refresh-interval: 10scommand-fetch-strategy:type: ID_SLOT_BASEDconfig:# The incremental id stepid-step: 1# master fetch command numfetch-size: 10worker:# worker listener portlisten-port: 1234# worker execute thread number to limit task instances in parallelexec-threads: 10# worker heartbeat intervalmax-heartbeat-interval: 10s# worker host weight to dispatch tasks, default value 100host-weight: 100server-load-protection:enabled: true# Worker max system cpu usage, when the worker's system cpu usage is smaller then this value, worker server can be dispatched tasks.max-system-cpu-usage-percentage-thresholds: 0.9# Worker max jvm cpu usage, when the worker's jvm cpu usage is smaller then this value, worker server can be dispatched tasks.max-jvm-cpu-usage-percentage-thresholds: 0.9# Worker max System memory usage , when the worker's system memory usage is smaller then this value, worker server can be dispatched tasks.max-system-memory-usage-percentage-thresholds: 0.9# Worker max disk usage , when the worker's disk usage is smaller then this value, worker server can be dispatched tasks.max-disk-usage-percentage-thresholds: 0.9task-execute-threads-full-policy: REJECTtenant-config:# tenant corresponds to the user of the system, which is used by the worker to submit the job. If system does not have this user, it will be automatically created after the parameter worker.tenant.auto.create is true.auto-create-tenant-enabled: true# Scenes to be used for distributed users. For example, users created by FreeIpa are stored in LDAP. This parameter only applies to Linux, When this parameter is true, worker.tenant.auto.create has no effect and will not automatically create tenants.distributed-tenant: false# If set true, will use worker bootstrap user as the tenant to execute task when the tenant is `default`;default-tenant-enabled: truealert:port: 50052# Mark each alert of alert server if late after x milliseconds as failed.# Define value is (0 = infinite), and alert server would be waiting alert result.wait-timeout: 0max-heartbeat-interval: 60s# The maximum number of alerts that can be processed in parallelsender-parallelism: 5api:audit-enable: false# Traffic control, if you turn on this config, the maximum number of request/s will be limited.# global max request number per second# default tenant-level max request numbertraffic-control:global-switch: falsemax-global-qps-rate: 300tenant-switch: falsedefault-tenant-qps-rate: 10#customize-tenant-qps-rate:# eg.#tenant1: 11#tenant2: 20python-gateway:# Weather enable python gateway server or not. The default value is true.enabled: true# Authentication token for connection from python api to python gateway server. Should be changed the default value# when you deploy in public network.auth-token: jwUDzpLsNKEFER4*a8gruBH_GsAurNxU7A@Xc# The address of Python gateway server start. Set its value to `0.0.0.0` if your Python API run in different# between Python gateway server. It could be be specific to other address like `127.0.0.1` or `localhost`gateway-server-address: 0.0.0.0# The port of Python gateway server start. Define which port you could connect to Python gateway server from# Python API side.gateway-server-port: 25333# The address of Python callback client.python-address: 127.0.0.1# The port of Python callback client.python-port: 25334# Close connection of socket server if no other request accept after x milliseconds. Define value is (0 = infinite),# and socket server would never close even though no requests acceptconnect-timeout: 0# Close each active connection of socket server if python program not active after x milliseconds. Define value is# (0 = infinite), and socket server would never close even though no requests acceptread-timeout: 0server:port: 12345servlet:session:timeout: 120mcontext-path: /dolphinscheduler/compression:enabled: truemime-types: text/html,text/xml,text/plain,text/css,text/javascript,application/javascript,application/json,application/xmljetty:max-http-form-post-size: 5000000accesslog:enabled: truecustom-format: '%{client}a - %u %t "%r" %s %O %{ms}Tms'management:endpoints:web:exposure:include: health,metrics,prometheusendpoint:health:enabled: trueshow-details: alwayshealth:db:enabled: truedefaults:enabled: falsemetrics:tags:application: ${spring.application.name}metrics:enabled: true# Override by profile
---
spring:config:activate:on-profile: postgresqlquartz:properties:org.quartz.jobStore.driverDelegateClass: org.quartz.impl.jdbcjobstore.PostgreSQLDelegatedatasource:driver-class-name: org.postgresql.Driverurl: jdbc:postgresql://127.0.0.1:5432/dolphinschedulerusername: rootpassword: root---
spring:config:activate:on-profile: mysqlsql:init:schema-locations: classpath:sql/dolphinscheduler_mysql.sqldatasource:driver-class-name: com.mysql.cj.jdbc.Driverurl: jdbc:mysql://node11:3306/dolphinscheduler?useUnicode=true&characterEncoding=UTF-8username: rootpassword: root123
c、創建庫表
登錄數據庫,創建名為dolphinscheduler的數據庫,然后將/usr/local/soft/apache-dolphinscheduler-3.2.2-bin/standalone-server/conf/sql/dolphinscheduler_mysql.sql表文件導入進去
4、啟動
cd /usr/local/soft/apache-dolphinscheduler-3.2.2-bin
./bin/dolphinscheduler-daemon.sh start standalone-server
5、瀏覽器驗證
瀏覽器輸入:http://node11:12345/dolphinscheduler/ui/login
用戶名admin 密碼dolphinscheduler123
點擊登錄,查看界面
三、常見庫表
刪除實例
SET FOREIGN_KEY_CHECKS = 0;
TRUNCATE TABLE dolphinscheduler.t_ds_task_instance;
SET FOREIGN_KEY_CHECKS = 1;