clickhouse數據庫表和doris數據庫表遷移starrocks數據庫時建表注意事項總結

- 零、前言
- 一、clickhouse數據庫表在starrocks數據庫建表時問題總結
- - 1.1 數據類型類問題：
  - 1.2 數據導出階段：
- 二、doris 數據庫表在starrocks數據庫建表時問題總結
- - 2.1 properties不支持的屬性（直接刪除）：
  - 2.2 properties需修改屬性
  - 2.3 properties：doris建表語句分區明細，starrocks數據不需要明確設定，會自動更新
  - 2.4 分桶設置問題
  - 2.5 索引設置問題
  - 2.6 python程序針對上述問題解決程序
- 三、小結

零、前言

這可能是個技術選型的問題，前期各自為營，所以整個企業內部，數據庫類型五花八門，相互之間數據也很難對齊引用。隨著業務等等大勢所趨，需要整合全公司資源，因此由來了這個數據資源遷移。針對遷移過程中的建表方面，作一總結，以備后續問題原因排查和進一步整改，讓過程有跡可循。

一、clickhouse數據庫表在starrocks數據庫建表時問題總結

1.1 數據類型類問題：

字符類型varchar需要指定位數，默認位數為1，會導致導入數據為空；
其它的類型映射遇到的映射關系，遇到的類型處理映射關系整理如下，其中key為clickhouse中的數據類型，value為對應的starrocks數據庫類型：

{
‘Date’: ‘DATE’,
‘Nullable(Date)’: ‘DATE NULL’,
‘DateTime’: ‘DATETIME’,
‘Nullable(DateTime)’: ‘DATETIME NULL’,
‘Float64’: ‘FLOAT’,
‘Float32’: ‘FLOAT’,
‘Nullable(Float32)’: ‘FLOAT NULL’,
‘Nullable(Float64)’: ‘FLOAT NULL’,
‘Int32’: ‘INT’, ‘UInt32’: ‘INT’,
‘UInt16’: ‘INT’,
‘Int64’: ‘INT’,
‘Nullable(Int32)’: ‘INT NULL’,
‘Nullable(Int8)’: ‘INT NULL’,
‘Nullable(UInt32)’: ‘INT NULL’,
‘Nullable(Int64)’: ‘INT NULL’,
‘String’: ‘varchar(65535)’,
‘Nullable(String)’: ‘varchar(65535) NULL’
}

1.2 數據導出階段：

默認通過【select * from table_name limit num offset offnum 】的形式進行大文件分拆，不指定排序會導致數據導出重復。這應該是大部分數據庫的通病，需要注意先對數據按指定字段排序，再移動分塊，就可以避免，當然就需要對該表多少有一些熟悉度，諸如主鍵等進行了解；
clickhouse數據庫通過【clickhouse-client】導出parquet文件，對于日期時間類型，不明確通過 cast(字段名稱 as datetime) 進行設定，會將日期時間轉換為時間戳；

二、doris 數據庫表在starrocks數據庫建表時問題總結

doris數據庫表DDL在starrocks中存在最主要的問題就是 properties屬性修改適配，字段類型目前所用類型全部可以在starrocks數據庫中適配使用，所以無需更改。問題總結集中在properties方面。

2.1 properties不支持的屬性（直接刪除）：

	"is_being_synced" = "false","disable_auto_compaction" = "false","enable_single_replica_compaction" = "false""light_schema_change" = "true" "dynamic_partition.create_history_partition" = "false","dynamic_partition.storage_medium" = "HDD","dynamic_partition.storage_policy" = "","dynamic_partition.hot_partition_num" = "0","dynamic_partition.reserved_history_periods" = "NULL","enable_mow_light_delete" = "false""light_schema_change" = "true"

2.2 properties需修改屬性

	"replication_allocation" = "tag.location.default: 3"     				改為 "replication_num" = "3","dynamic_partition.replication_allocation" = "tag.location.default: 3"  改為 "dynamic_partition.replication_num" = "3" "dynamic_partition.history_partition_num" = "1200" starrocks數據庫默認動態分區上限500，目前直接刪除該屬性

2.3 properties：doris建表語句分區明細，starrocks數據不需要明確設定，會自動更新

	a.PARTITION BY RANGE(`xxx`) 之后的明細分區，全部刪除，不需指定，但括號需要保留。如：PARTITION BY RANGE(`xxx`)()b."dynamic_partition.start" = "-2147483648"  刪除，不需指定

2.4 分桶設置問題

	DISTRIBUTED BY HASH(`vin`) BUCKETS AUTO ，starrocks不支持auto屬性，直接設定具體數字，eg:10

2.5 索引設置問題

	索引類型：INDEX idx_vin (vin) USING INVERTED COMMENT ''，starrocks不支持 INVERTED ,將索引類型修改為： USING BITMAP 即可。

2.6 python程序針對上述問題解決程序

# 刪除分區明細
# 匹配分區定義部分，從(PARTITION開始到三個連續的右括號結束
# re.IGNORECASE:使正則表達式匹配時忽略大小寫。 
# re.DOTALL:使 . 匹配任意字符，包括換行符。
pattern = r"\(PARTITION.*\){3}"
# step_one_sql = re.sub(pattern, "", sql, flags=re.IGNORECASE | re.DOTALL) # 這個會將分區明細全部剔除，且沒有保留RANGE之后的明細需要的括號
step_one_sql = re.sub(pattern, "()", sql, flags=re.IGNORECASE | re.DOTALL) # 對上述sql的修正# 正則表達式匹配并替換 BUCKETS AUTO 為 BUCKETS 10
step_one_sql = re.sub(r" BUCKETS AUTO", r" BUCKETS 10", step_one_sql) # 正則表達式匹配索引類型：并替換 
step_one_sql = re.sub(r" USING INVERTED ", r" USING BITMAP ", step_one_sql) # 正則表達式匹配 PROPERTIES 部分
pattern = r"(PROPERTIES\s*\([\s\S]*?\);)"
match = re.search(pattern, step_one_sql)    
if match:# 提取 PROPERTIES 部分properties_content = match.group(1)# 刪除不支持的屬性properties_content = re.sub(r'"is_being_synced"\s*=\s*"\w+"\s*,?', '', properties_content)properties_content = re.sub(r'"disable_auto_compaction"\s*=\s*"\w+"\s*,?', '', properties_content)properties_content = re.sub(r'"enable_single_replica_compaction"\s*=\s*"\w+"\s*,?', '', properties_content)properties_content = re.sub(r'"light_schema_change"\s*=\s*"\w+"\s*,?', '', properties_content)properties_content = re.sub(r'"dynamic_partition\.create_history_partition"\s*=\s*"\w+"\s*,?', '', properties_content)properties_content = re.sub(r'"dynamic_partition\.storage_medium"\s*=\s*"\w+"\s*,?', '', properties_content)properties_content = re.sub(r'"dynamic_partition\.storage_policy"\s*=\s*""\s*,?', '', properties_content)properties_content = re.sub(r'"dynamic_partition\.hot_partition_num"\s*=\s*"\w+"\s*,?', '', properties_content)properties_content = re.sub(r'"dynamic_partition\.reserved_history_periods"\s*=\s*"\w+"\s*,?', '', properties_content)properties_content = re.sub(r'"dynamic_partition\.start"\s*=\s*"[\w\d-]*"\s*,?', '', properties_content)properties_content = re.sub(r'"dynamic_partition\.history_partition_num"\s*=\s*"[\w\d-]*"\s*,?', '', properties_content)properties_content = re.sub(r'"enable_mow_light_delete"\s*=\s*"\w+"\s*,?', '', properties_content)# 修改屬性properties_content = re.sub(r'"replication_allocation"\s*=\s*"tag\.location\.default:\s*\d+"','"replication_num" = "3"',properties_content)# properties_content = re.sub(#     r'"light_schema_change"\s*=\s*"\w+"',#     '"enable_light_schema_change" = "true"',#     properties_content# )properties_content = re.sub(r'"dynamic_partition.replication_allocation"\s*=\s*"tag\.location\.default:\s*\d+"','"dynamic_partition.replication_num" = "3"',properties_content)# properties_content = re.sub(#     r'"dynamic_partition.history_partition_num" = "-1"',#     '"dynamic_partition.history_partition_num" = "30"',#     properties_content# )# 清理多余的逗號和空白行properties_content = re.sub(r",\s*,", ",", properties_content)  # 清理多余的逗號properties_content = re.sub(r"\n\s*\n", "\n", properties_content)  # 清理空白行"""最后一個屬性后 會有逗號，進行替換"""# 正則表達式：匹配 PROPERTIES 塊中的最后一個逗號pattern = r'(?s)(PROPERTIES\s*\(.*?)(,\s*\);)'  # (?s) 表示單行模式，匹配換行符# 使用 re.sub 進行替換def remove_last_comma(match):# 如果匹配到逗號，則刪除return match.group(1) + match.group(2).replace(',', '')properties_content = re.sub(pattern, remove_last_comma, properties_content)# 更新回 SQL 語句updated_sql = step_one_sql.replace(match.group(1), properties_content)

三、小結

初步測試上述問題解決方案運行可行，后續遇到新問題再進行補充更新。現有的豆包之類的工具，著實提高了問題解決的效率，從而讓我們有了更多的思考時間。擁抱新工具，喜迎新未來。

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/pingmian/88019.shtml
繁體地址，請注明出處：http://hk.pswp.cn/pingmian/88019.shtml
英文地址，請注明出處：http://en.pswp.cn/pingmian/88019.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！