數據庫數據清洗、預處理與質量監控、 數據質量的核心概念
準確性 (Accuracy)
準確性指數據正確反映其所描述的實體或事件真實狀況的程度。準確的數據應當與現實世界中的實際情況一致。
一致性 (Consistency)
一致性指數據在不同表、系統或時間點之間保持邏輯上一致的程度。一致性確保數據遵循業務規則和關系約束。
數據清洗與預處理技術
1. 處理缺失值
-- 識別缺失值
SELECT COUNT(*) FROM customers WHERE phone IS NULL;-- 處理缺失值
UPDATE customers
SET phone = 'Unknown'
WHERE phone IS NULL;-- 或者刪除包含關鍵缺失值的記錄
DELETE FROM orders
WHERE customer_id IS NULL;
2. 處理重復數據
-- 識別重復記錄
SELECT email, COUNT(*)
FROM customers
GROUP BY email
HAVING COUNT(*) > 1;-- 刪除重復記錄 (保留一條)
WITH duplicates AS (SELECT email, ROW_NUMBER() OVER(PARTITION BY email ORDER BY id) AS rnFROM customers
)
DELETE FROM duplicates WHERE rn > 1;
3. 標準化數據格式
-- 統一日期格式
UPDATE transactions
SET transaction_date = TO_DATE(transaction_date, 'YYYY-MM-DD')
WHERE transaction_date ~ '^\d{4}-\d{2}-\d{2}$';-- 統一電話號碼格式
UPDATE customers
SET phone = REGEXP_REPLACE(phone, '[^0-9]', '');
4. 處理異常值
-- 識別異常交易金額
SELECT * FROM transactions
WHERE amount < 0 OR amount > 100000;-- 修正異常值 (根據業務規則)
UPDATE transactions
SET amount = 0
WHERE amount < 0;
數據質量監控
1. 創建數據質量規則表
CREATE TABLE data_quality_rules (rule_id INT PRIMARY KEY,rule_name VARCHAR(100),rule_description VARCHAR(500),check_query TEXT,threshold INT,severity VARCHAR(20)
);-- 示例規則
INSERT INTO data_quality_rules VALUES
(1, 'Null Customer Names', 'Customer names should not be null', 'SELECT COUNT(*) FROM customers WHERE customer_name IS NULL', 0, 'High'),
(2, 'Negative Order Quantities', 'Order quantities should not be negative', 'SELECT COUNT(*) FROM order_items WHERE quantity < 0', 0, 'High'),
(3, 'Invalid Email Formats', 'Emails should follow standard format', 'SELECT COUNT(*) FROM customers WHERE email !~ ''^[A-Za-z0-9._%-]+@[A-Za-z0-9.-]+[.][A-Za-z]+$''', 5, 'Medium');
2. 自動化質量檢查
-- 創建數據質量檢查存儲過程
CREATE OR REPLACE PROCEDURE run_data_quality_checks()
LANGUAGE plpgsql
AS $$
DECLARErule_record RECORD;violation_count INT;
BEGINFOR rule_record IN SELECT * FROM data_quality_rules LOOPEXECUTE rule_record.check_query INTO violation_count;IF violation_count > rule_record.threshold THENINSERT INTO data_quality_violations (rule_id, violation_count, check_date)VALUES (rule_record.rule_id, violation_count, CURRENT_DATE);-- 可以添加通知邏輯RAISE NOTICE 'Data quality violation: % - % violations found', rule_record.rule_name, violation_count;END IF;END LOOP;
END;
$$;
案例分析
案例: 電商平臺數據清洗
問題描述:
- 客戶表中有重復的電子郵件
- 訂單表中的某些價格與產品表中的價格不一致
- 客戶地址格式不統一
解決方案:
-- 1. 處理重復客戶
WITH ranked_customers AS (SELECT *,ROW_NUMBER() OVER(PARTITION BY email ORDER BY created_at) AS rnFROM customers
)
DELETE FROM customers
WHERE id IN (SELECT id FROM ranked_customers WHERE rn > 1);-- 2. 修復價格不一致問題
UPDATE order_items oi
SET unit_price = p.price
FROM products p
WHERE oi.product_id = p.id AND oi.unit_price != p.price;-- 3. 標準化地址
UPDATE customers
SET address = INITCAP(TRIM(address)),postal_code = REGEXP_REPLACE(postal_code, '[^0-9]', '');
一致性驗證示例
-- 檢查訂單總價是否等于各項目總和
SELECT o.order_id, o.total_amount, SUM(oi.quantity * oi.unit_price) AS calculated_amount,o.total_amount - SUM(oi.quantity * oi.unit_price) AS discrepancy
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY o.order_id, o.total_amount
HAVING ABS(o.total_amount - SUM(oi.quantity * oi.unit_price)) > 0.01;
準確性驗證示例
-- 檢查產品價格是否在合理范圍內
SELECT product_id, product_name, price
FROM products
WHERE price < 0 OR price > 10000; -- 假設10000是合理上限-- 檢查客戶年齡是否合理
SELECT customer_id, birth_date, EXTRACT(YEAR FROM AGE(birth_date)) AS age
FROM customers
WHERE EXTRACT(YEAR FROM AGE(birth_date)) < 18 OR EXTRACT(YEAR FROM AGE(birth_date)) > 120;
持續改進機制
- 定期執行數據質量檢查: 設置定時任務每天/每周運行質量檢查
- 建立數據質量儀表板: 可視化展示數據質量趨勢
- 源頭治理: 在應用層添加驗證邏輯,防止低質量數據進入數據庫
- 文檔化數據標準: 明確各字段的數據格式、取值范圍和業務規則
通過以上方法,可以系統地提高和維持數據庫中的數據質量,確保數據的準確性和一致性,為業務決策提供可靠的數據基礎。