datax介紹
官網: https://github.com/alibaba/DataX/blob/master/introduction.md
DataX 是阿里云 DataWorks數據集成 的開源版本,在阿里巴巴集團內被廣泛使用的
離線數據同步工具
/平臺。DataX 實現了包括 MySQL、Oracle、OceanBase、SqlServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、Hologres、DRDS, databend 等各種異構數據源之間高效的數據同步功能。
datax架構說明
datax數據處理流程
datax使用說明
Datax在使用是主要編寫json文件,在json中定義read如何讀取 write如何寫入
格式:參考官網代碼修改
parameter:{username:mysql數據庫用戶名
password:密碼
jdbc: MySQL://網址:端口號:/datebase
}
table:數據庫名
{"job": {"setting": {"speed": {"channel": 3},"errorLimit": {"record": 0,"percentage": 0.02}},"content": [{"reader": {"name": "mysqlreader","parameter": {"username": "root","password": "123456","column": ["id","name"],"splitPk": "db_id","connection": [{"table": ["table"],"jdbcUrl": ["jdbc:mysql://127.0.0.1:3306/database"]}]}},"writer": {"name": "streamwriter","parameter": {"print":true}}}]}
}
簡單使用
讀取mysql數據在終端中輸出結果
-- 在mysql中創建庫表
create database itcast charset=utf8;
use itcast;
create table student(id int,name varchar(20),age int,gender varchar(20)
);
insert into student values(1,'張三',20,'男'),(2,'李四',21,'男'),(3,'王五',19,'男'),(4,'趙六',22,'男');
編寫datax的json文件
{"job": {"setting": {"speed": {"channel": 3},"errorLimit": {"record": 0,"percentage": 0.02}},"content": [{"reader": {"name": "mysqlreader","parameter": {"username": "root","password": "123456","column": ["id","name","age","gender"],"splitPk": "id","connection": [{"table": ["student"],"jdbcUrl": ["jdbc:mysql://192.168.88.80:3306/itcast"]}]}},"writer": {"name": "streamwriter","parameter": {"print":true}}}]}
}
在datax的job目錄下創建json文件
cd /export/server/datax/job/
執行json文件中的配置信息
cd /export/server/datax/bin
python datax.py ../job/mysql_data.json
mysql使用sql語句讀取數據
sql語句可以實現對數據的篩選過濾
query:書寫select條件過濾
{"job": {"setting": {"speed": {"channel":1}},"content": [{"reader": {"name": "mysqlreader","parameter": {"username": "root","password": "123456","connection": [{"querySql": ["select * from student where id>=3;"],"jdbcUrl": ["jdbc:mysql://192.168.88.80:3306/itcast"]}]}},"writer": {"name": "streamwriter","parameter": {"print": true,"encoding": "UTF-8"}}}]}
}
mysql數據導入hdfs
讀取mysql數據寫入到hdfs
{"job": {"setting": {"speed": {"channel":1}},"content": [{"reader": {"name": "mysqlreader","parameter": {"username": "root","password": "123456","column": ["id","name","age","gender"],"splitPk": "id","connection": [{"table": ["student"],"jdbcUrl": ["jdbc:mysql://192.168.88.80:3306/itcast"]}]}},"writer": {"name": "hdfswriter","parameter": {"defaultFS": "hdfs://192.168.88.80:8020","fileType": "text","path": "/data","fileName": "student","column": [{"name": "id","type": "int"},{"name": "name","type": "string"},{"name": "age","type": "INT"},{"name": "gender","type": "string"}],"writeMode": "append","fieldDelimiter": "\t"}}}]}
}
使用sql語句導入需要指定jdbc連接參數
當數據中有中文是需要增加參數
jdbc:mysql://192.168.88.80:3306/itcast?useSSL=false&characterEncoding=utf8
{"job": {"setting": {"speed": {"channel":1}},"content": [{"reader": {"name": "mysqlreader","parameter": {"username": "root","password": "123456","connection": [{"querySql": ["select * from student where gender='男';"],"jdbcUrl": ["jdbc:mysql://192.168.88.80:3306/itcast?useSSL=false&characterEncoding=utf8"]}]}},"writer": {"name": "hdfswriter","parameter": {"defaultFS": "hdfs://192.168.88.80:8020","fileType": "text","path": "/data","fileName": "student","column": [{"name": "id","type": "int"},{"name": "name","type": "string"},{"name": "age","type": "INT"},{"name": "gender","type": "string"}],"writeMode": "append","fieldDelimiter": "\t"}}}]}
}
mysql數據導入hive表
hive的表是由兩部分構成的
表的元數據 hive的metastore管理
表的行數據 hdfs上以文件的方式存儲
導入hive表的數據本質就是將mysql中的數據導入hdfs中,將數據按照hive表的路徑進行導入
1-啟動hive服務 metastore hiveserve2
2-配置datagrip連接
可以聯網下載,也可以使用提前下載好的
3-創建hive表
show databases ;create database itcast;
use itcast;
create table stu(id int,name string,age int,gender string
)row format delimited fields terminated by ',';select * from stu;
4-hive表的數據導入,本質就是將數據寫入hdfs的表目錄中
編寫json文件
{"job": {"setting": {"speed": {"channel":1}},"content": [{"reader": {"name": "mysqlreader","parameter": {"username": "root","password": "123456","column": ["id","name","age","gender"],"splitPk": "id","connection": [{"table": ["student"],"jdbcUrl": ["jdbc:mysql://192.168.88.80:3306/itcast"]}]}},"writer": {"name": "hdfswriter","parameter": {"defaultFS": "hdfs://192.168.88.80:8020","fileType": "text","path": "/user/hive/warehouse/itcast.db/stu","fileName": "stu","column": [{"name": "id","type": "int"},{"name": "name","type": "string"},{"name": "age","type": "INT"},{"name": "gender","type": "string"}],"writeMode": "append","fieldDelimiter": ","}}}]}
}
datax-web 介紹
https://github.com/WeiYe-Jing/datax-web
datax-web是基于datax進行的二次開發,提供了一個可視化web頁面,方便開發人員定義datax任務,并且能自動生成json文件
datax-web使用
啟動服務
/export/server/datax-web-2.1.2/bin/start-all.sh
訪問頁面
http://hadoop01:9527/index.html
使用
創建項目
創建數據源連接
任務管理模板生成
可以設置定時執行
生成datax任務
任務執行
定時執行