第6篇、Kafka 高級實戰：生產者路由與消費者管理

Kafka 高級實戰：生產者路由與消費者管理（Python 版）

從基礎到進階：深入理解 Kafka 的生產者消息路由、消費者 Offset 管理，以及 Exactly-Once 語義實現

實戰導向：提供完整的可運行代碼示例，涵蓋自定義分區器、消息頭、大消息處理、Offset 控制、事務性消費等核心場景

🚀 快速開始

環境準備

# 基礎客戶端（純 Python，適合大多數場景）
pip install kafka-python==2.0.2# 高級客戶端（依賴 librdkafka，支持事務和更高性能）
pip install confluent-kafka==2.6.0# 可選：壓縮算法支持
pip install lz4 snappy zstandard

前置條件

確保 Kafka 集群已啟動：

# 使用 Docker 快速啟動 Kafka
docker run -d --name kafka \-p 9092:9092 \-e KAFKA_ZOOKEEPER_CONNECT=localhost:2181 \-e KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://localhost:9092 \confluentinc/cp-kafka:latest

📚 核心概念深度解析

1. 生產者核心原理

🎯 分區與消息順序

Kafka 的核心設計原則：只保證同一分區內的消息順序。這意味著：

? 同一分區內：消息嚴格按發送順序存儲和消費
? 跨分區：無法保證全局順序
🎯 業務順序：通過相同 key 路由到同一分區實現業務層面的順序保證

🔧 自定義分區器設計

關鍵挑戰：Python 內置 hash() 函數使用隨機種子，重啟后結果不一致！

# ? 錯誤做法：Python hash() 不穩定
def bad_partitioner(key_bytes, all_partitions, available_partitions):return all_partitions[hash(key_bytes) % len(all_partitions)]# ? 正確做法：使用穩定哈希算法
import hashlib
def stable_partitioner(key_bytes, all_partitions, available_partitions):if key_bytes is None:return available_partitions[0]  # 無 key 時的策略h = int(hashlib.md5(key_bytes).hexdigest(), 16)return all_partitions[h % len(all_partitions)]

📦 消息頭（Headers）的威力

Headers 是消息的元數據載體，典型應用場景：

鏈路追蹤：trace_id、span_id
協議版本：schema_version、api_version
業務標簽：event_type、source_system
灰度控制：canary_flag、experiment_id

? 性能優化策略

優化維度	參數配置	效果
批量發送	`batch_size=64KB`, `linger_ms=50ms`	提升吞吐 3-5x
壓縮算法	`compression_type=lz4`	減少網絡傳輸 50-70%
消息大小	超過 1MB 使用外部存儲	避免 Broker 壓力

🚨 大消息處理最佳實踐

問題：Kafka 不適合處理大消息（>1MB）
解決方案：外部存儲 + 消息引用模式

# 大消息處理示例
def send_large_message(user_id, large_data):# 1. 上傳到對象存儲s3_key = f"reports/{user_id}/{timestamp}.json"s3_client.put_object(Bucket="data-lake", Key=s3_key, Body=large_data)# 2. 發送引用消息message = {"type": "report_ready","user_id": user_id,"s3_location": f"s3://data-lake/{s3_key}","size_bytes": len(large_data),"checksum": hashlib.sha256(large_data).hexdigest()}producer.send("reports", key=user_id, value=message)

🛠? 實戰案例：生產者消息路由

案例一：用戶事件按 ID 路由到固定分區

🎯 業務場景

電商系統中，需要保證同一用戶的所有操作事件按時間順序處理：

用戶下單 → 支付 → 發貨 → 確認收貨
必須嚴格按順序，避免狀態不一致

💡 解決方案

通過自定義分區器，將相同 user_id 的消息路由到同一分區，實現業務層面的順序保證。

📝 完整實現（kafka-python 版本）

# producer_user_routing.py
import hashlib
import json
import time
import uuid
from datetime import datetime
from kafka import KafkaProducerdef user_id_partitioner(key_bytes, all_partitions, available_partitions):"""基于穩定哈希的用戶分區器核心特性：- 使用 MD5 確保跨進程/重啟后的穩定性- 相同 user_id 始終路由到同一分區- 無 key 時使用輪詢策略"""if key_bytes is None:# 無 key 時的策略：輪詢到可用分區return available_partitions[0] if available_partitions else all_partitions[0]# 使用 MD5 確保穩定性（Python hash() 不穩定）h = int(hashlib.md5(key_bytes).hexdigest(), 16)idx = h % len(all_partitions)return all_partitions[idx]# 創建高性能生產者
producer = KafkaProducer(bootstrap_servers=["localhost:9092"],acks="all",                    # 強一致性：等待所有副本確認retries=5,                     # 重試次數linger_ms=10,                  # 批量發送等待時間batch_size=32 * 1024,          # 32KB 批量大小compression_type="lz4",        # LZ4 壓縮（高性能）value_serializer=lambda v: json.dumps(v, ensure_ascii=False).encode("utf-8"),key_serializer=lambda k: str(k).encode("utf-8") if k is not None else None,partitioner=user_id_partitioner,
)topic = "user-events"def send_user_event(user_id: int, event_type: str, data: dict):"""發送用戶事件，自動路由到固定分區Args:user_id: 用戶ID（用作分區key）event_type: 事件類型data: 事件數據"""# 構建消息頭（元數據）headers = [("trace_id", str(uuid.uuid4()).encode()),("schema_ver", b"v1.0"),("event_type", event_type.encode()),("timestamp", str(int(time.time() * 1000)).encode()),("source", b"order-service"),]# 構建事件消息event = {"user_id": user_id,"event_type": event_type,"timestamp": datetime.now().isoformat(),"data": data}# 發送消息（異步）future = producer.send(topic,key=user_id,               # 關鍵：user_id 作為分區 keyvalue=event,headers=headers,)# 等待發送結果try:metadata = future.get(timeout=10)print(f"? 發送成功: {metadata.topic}-{metadata.partition}@{metadata.offset}")print(f"   用戶: {user_id}, 事件: {event_type}")return metadataexcept Exception as e:print(f"? 發送失敗: {e}")raisedef simulate_user_journey():"""模擬用戶完整的購物流程"""user_id = 1001# 用戶購物流程：必須按順序處理events = [("view_product", {"product_id": "P001", "price": 99.9}),("add_to_cart", {"product_id": "P001", "quantity": 2}),("checkout", {"cart_total": 199.8, "payment_method": "credit_card"}),("payment_success", {"transaction_id": "TXN123", "amount": 199.8}),("order_created", {"order_id": "ORD456", "status": "confirmed"}),("order_shipped", {"tracking_number": "TRK789", "carrier": "FedEx"}),("order_delivered", {"delivery_time": "2024-01-15T10:30:00Z"}),]print(f"🛒 開始模擬用戶 {user_id} 的購物流程...")for event_type, data in events:send_user_event(user_id, event_type, data)time.sleep(0.5)  # 模擬事件間隔print(f"? 用戶 {user_id} 購物流程完成！")if __name__ == "__main__":try:# 模擬多個用戶的并發事件simulate_user_journey()# 模擬其他用戶的事件send_user_event(1002, "view_product", {"product_id": "P002", "price": 149.9})send_user_event(1003, "add_to_cart", {"product_id": "P003", "quantity": 1})# 確保所有消息發送完成producer.flush(timeout=30)print("🎉 所有消息發送完成！")except KeyboardInterrupt:print("?? 用戶中斷")finally:producer.close()print("🔒 生產者已關閉")

🚀 運行效果

python producer_user_routing.py

預期輸出：

🛒 開始模擬用戶 1001 的購物流程...
? 發送成功: user-events-0@1234用戶: 1001, 事件: view_product
? 發送成功: user-events-0@1235用戶: 1001, 事件: add_to_cart
? 發送成功: user-events-0@1236用戶: 1001, 事件: checkout
...
🎉 所有消息發送完成！

🔍 關鍵特性驗證

分區一致性：相同 user_id 的消息都發送到同一分區
順序保證：同一分區內的消息嚴格按發送順序存儲
元數據豐富：Headers 包含完整的追蹤信息
高性能：批量發送 + 壓縮提升吞吐量

🔒 順序保證的進階策略

冪等生產者（confluent-kafka 版本）

對于嚴格順序要求的場景，推薦使用 confluent-kafka 的冪等生產者：

# producer_idempotent.py
import hashlib
import json
from confluent_kafka import Producer, KafkaException# 冪等生產者配置
conf = {"bootstrap.servers": "localhost:9092","enable.idempotence": True,                    # 啟用冪等性"acks": "all",                                 # 必須配合 acks=all"compression.type": "lz4","linger.ms": 10,"batch.size": 32768,"max.in.flight.requests.per.connection": 5,     # 限制并發，避免亂序"retries": 2147483647,                         # 無限重試（冪等保證不重復）
}producer = Producer(conf)
topic = "user-events"def delivery_callback(err, msg):"""消息發送回調"""if err is not None:print(f"? 發送失敗: {err}")else:print(f"? 發送成功: {msg.topic()}-{msg.partition()}@{msg.offset()}")def send_with_idempotence(user_id: int, event: dict):"""使用冪等生產者發送消息"""key_bytes = str(user_id).encode()# 計算穩定分區h = int(hashlib.md5(key_bytes).hexdigest(), 16)partition = h % 3  # 假設有3個分區headers = [("schema_ver", "v1.0"),("event_type", event.get("type", "unknown")),("user_id", str(user_id)),]producer.produce(topic=topic,partition=partition,key=str(user_id),value=json.dumps(event, ensure_ascii=False),headers=headers,on_delivery=delivery_callback,)# 使用示例
if __name__ == "__main__":events = [{"type": "login", "timestamp": "2024-01-15T10:00:00Z"},{"type": "view_product", "product_id": "P001"},{"type": "purchase", "amount": 99.9},]for event in events:send_with_idempotence(1001, event)producer.flush()

關鍵優勢

特性	kafka-python	confluent-kafka
冪等性	? 不支持	? 原生支持
事務	? 不支持	? 完整支持
性能	🟡 中等	? 更高
穩定性	🟡 良好	? 企業級

📖 消費者核心原理

2. Offset 管理深度解析

🎯 Offset 的本質

Offset 是 Kafka 中消息在分區內的唯一標識，類似于數據庫中的主鍵：

遞增性：同一分區內嚴格遞增
持久性：存儲在 __consumer_offsets 主題中
可控制性：支持手動設置和自動管理

🔄 消費語義對比

語義類型	處理順序	優點	缺點	適用場景
At-most-once	先提交，后處理	不重復	可能丟失	日志收集
At-least-once	先處理，后提交	不丟失	可能重復	業務處理
Exactly-once	事務性處理	不丟失不重復	復雜度高	金融交易

🛠? 手動 Offset 控制

# consumer_offset_control.py
import json
import time
from kafka import KafkaConsumer, TopicPartitionclass SmartRebalanceListener:"""智能再均衡監聽器"""def __init__(self, consumer):self.consumer = consumerself.processed_offsets = {}  # 記錄已處理的 offsetdef on_partitions_revoked(self, revoked_partitions):"""分區被收回前：安全提交進度"""print(f"🔄 分區被收回: {revoked_partitions}")try:# 提交當前所有已處理的 offsetself.consumer.commit()print("? 進度已安全提交")except Exception as e:print(f"? 提交失敗: {e}")def on_partitions_assigned(self, assigned_partitions):"""分區分配后：恢復消費位置"""print(f"📋 分配到分區: {assigned_partitions}")# 可選：從自定義存儲恢復 checkpointfor partition in assigned_partitions:checkpoint = self.load_checkpoint(partition)if checkpoint is not None:self.consumer.seek(partition, checkpoint)print(f"📍 恢復分區 {partition} 到 offset {checkpoint}")def load_checkpoint(self, partition):"""從外部存儲加載 checkpoint（示例）"""# 實際項目中可以從 Redis/數據庫 讀取return None  # 使用 Kafka 默認的 committed offsetdef create_advanced_consumer():"""創建高級消費者"""
consumer = KafkaConsumer(bootstrap_servers=["localhost:9092"],group_id="advanced-consumer-group",enable_auto_commit=False,        # 手動提交，精確控制auto_offset_reset="earliest",    # 無 offset 時從最早開始value_deserializer=lambda v: json.loads(v.decode("utf-8")),key_deserializer=lambda v: int(v.decode()) if v else None,max_poll_records=10,             # 批量拉取session_timeout_ms=30000,        # 會話超時heartbeat_interval_ms=10000,      # 心跳間隔)# 設置再均衡監聽器listener = SmartRebalanceListener(consumer)consumer.subscribe(["user-events"], listener=listener)return consumer, listenerdef process_message_batch(consumer, listener):"""批量處理消息"""
try:while True:# 批量拉取消息records = consumer.poll(timeout_ms=1000)if not records:continue# 按分區處理消息for partition, messages in records.items():print(f"\n📦 處理分區 {partition} 的 {len(messages)} 條消息")for message in messages:# 業務處理邏輯process_business_logic(message)# 記錄已處理的 offsetlistener.processed_offsets[partition] = message.offset + 1# 批量提交該分區的 offsetconsumer.commit({partition: message.offset + 1})print(f"? 分區 {partition} 處理完成，已提交到 offset {message.offset + 1}")except KeyboardInterrupt:print("\n?? 用戶中斷消費")except Exception as e:print(f"? 消費異常: {e}")raisedef process_business_logic(message):"""業務處理邏輯示例"""print(f"  🔍 處理消息: {message.topic}-{message.partition}@{message.offset}")print(f"     用戶: {message.key}, 事件: {message.value.get('event_type', 'unknown')}")# 模擬業務處理time.sleep(0.1)if __name__ == "__main__":consumer, listener = create_advanced_consumer()try:print("🚀 開始消費消息...")process_message_batch(consumer, listener)
finally:consumer.close()print("🔒 消費者已關閉")

🎯 實戰案例：消費者高級管理

案例二：按時間戳重放消息

🎯 業務場景

數據修復場景：發現某個時間點后的數據處理有誤，需要重新處理歷史數據：

從指定時間點開始重放
不重復處理已正確的數據
支持多次重放和調試

💡 解決方案

使用 Kafka 的時間戳定位功能，精確找到指定時間點對應的 offset，然后從該位置開始消費。

📝 完整實現

# consumer_replay.py
import time
import json
from datetime import datetime, timedelta
from kafka import KafkaConsumer, TopicPartitionclass MessageReplayer:"""消息重放器"""def __init__(self, topic: str, group_id: str):self.topic = topicself.group_id = group_idself.consumer = Nonedef create_replay_consumer(self):"""創建重放專用消費者"""self.consumer = KafkaConsumer(bootstrap_servers=["localhost:9092"],group_id=self.group_id,enable_auto_commit=False,        # 不自動提交，支持多次重放auto_offset_reset="earliest",value_deserializer=lambda v: json.loads(v.decode("utf-8")),key_deserializer=lambda v: int(v.decode()) if v else None,)return self.consumerdef seek_to_timestamp(self, timestamp_ms: int):"""定位到指定時間戳"""print(f"🔍 定位到時間戳: {datetime.fromtimestamp(timestamp_ms/1000)}")# 1. 訂閱主題并等待分區分配self.consumer.subscribe([self.topic])time.sleep(2)  # 等待分區分配# 2. 獲取分配的分區assignment = self.consumer.assignment()
while not assignment:self.consumer.poll(100)assignment = self.consumer.assignment()print(f"📋 分配到分區: {assignment}")# 3. 為每個分區創建時間戳查詢timestamp_queries = [TopicPartition(tp.topic, tp.partition, timestamp_ms) for tp in assignment]# 4. 查詢時間戳對應的 offsetoffsets_for_times = self.consumer.offsets_for_times(timestamp_queries)# 5. 定位到查詢到的 offsetfor tp, offset_info in offsets_for_times.items():if offset_info is not None and offset_info.offset is not None:self.consumer.seek(tp, offset_info.offset)print(f"📍 分區 {tp.partition} 定位到 offset {offset_info.offset}")else:# 該時間戳之后沒有消息，定位到末尾self.consumer.seek_to_end(tp)print(f"📍 分區 {tp.partition} 定位到末尾（時間戳后無消息）")def replay_messages(self, duration_minutes: int = 10):"""重放指定時長的消息"""print(f"🎬 開始重放最近 {duration_minutes} 分鐘的消息...")message_count = 0start_time = time.time()try:
while True:records = self.consumer.poll(timeout_ms=1000)if not records:print("? 沒有更多消息，重放完成")breakfor partition, messages in records.items():print(f"\n📦 分區 {partition.partition} 收到 {len(messages)} 條消息")for message in messages:message_count += 1# 解析消息時間戳msg_timestamp = datetime.fromtimestamp(message.timestamp / 1000)print(f"  🔍 消息 {message_count}: {message.topic}-{message.partition}@{message.offset}")print(f"     時間: {msg_timestamp}")print(f"     用戶: {message.key}, 事件: {message.value.get('event_type', 'unknown')}")# 業務處理邏輯self.process_replay_message(message)# 檢查是否超過重放時長if time.time() - start_time > duration_minutes * 60:print(f"? 重放時長達到 {duration_minutes} 分鐘，停止重放")return message_countexcept KeyboardInterrupt:print("\n?? 用戶中斷重放")except Exception as e:print(f"? 重放異常: {e}")raisereturn message_countdef process_replay_message(self, message):"""處理重放消息的業務邏輯"""# 這里實現具體的業務處理邏輯# 例如：重新計算用戶積分、更新訂單狀態等event_type = message.value.get('event_type', 'unknown')user_id = message.keyif event_type == 'purchase':print(f"    💰 重新處理用戶 {user_id} 的購買事件")# 重新計算積分、更新庫存等elif event_type == 'refund':print(f"    💸 重新處理用戶 {user_id} 的退款事件")# 重新計算退款金額、更新財務記錄等# 模擬處理時間time.sleep(0.05)def close(self):"""關閉消費者"""if self.consumer:self.consumer.close()print("🔒 重放器已關閉")def replay_from_hours_ago(hours: int = 1):"""從指定小時前開始重放"""# 計算目標時間戳target_time = datetime.now() - timedelta(hours=hours)timestamp_ms = int(target_time.timestamp() * 1000)print(f"🎯 重放目標: {target_time}")# 創建重放器replayer = MessageReplayer("user-events", f"replay-group-{int(time.time())}")try:# 創建消費者replayer.create_replay_consumer()# 定位到目標時間replayer.seek_to_timestamp(timestamp_ms)# 開始重放message_count = replayer.replay_messages(duration_minutes=5)print(f"🎉 重放完成！共處理 {message_count} 條消息")finally:replayer.close()if __name__ == "__main__":# 從1小時前開始重放replay_from_hours_ago(hours=1)

🚀 運行效果

python consumer_replay.py

預期輸出：

🎯 重放目標: 2024-01-15 09:00:00
🔍 定位到時間戳: 2024-01-15 09:00:00
📋 分配到分區: {TopicPartition(topic='user-events', partition=0)}
📍 分區 0 定位到 offset 1234
🎬 開始重放最近 5 分鐘的消息...📦 分區 0 收到 3 條消息🔍 消息 1: user-events-0@1234時間: 2024-01-15 09:00:15用戶: 1001, 事件: purchase💰 重新處理用戶 1001 的購買事件
...
🎉 重放完成！共處理 15 條消息

🔒 Exactly-Once 語義實現

案例三：事務性消費處理

🎯 業務場景

金融交易系統：需要保證消費消息 → 處理業務 → 發送結果的原子性：

要么全部成功，要么全部失敗
不能出現部分成功的情況
需要嚴格的事務保證

💡 解決方案

使用 confluent-kafka 的事務性生產者，將消費 offset 和業務處理結果打包在一個事務中。

📝 完整實現

# consumer_transactional.py
import json
import time
from confluent_kafka import Producer, Consumer, KafkaException, TopicPartitionclass TransactionalProcessor:"""事務性消息處理器"""def __init__(self, bootstrap_servers: str):self.bootstrap_servers = bootstrap_serversself.producer = Noneself.consumer = Noneself.transaction_id = f"txn-processor-{int(time.time())}"def setup_transactional_producer(self):"""設置事務性生產者"""conf = {"bootstrap.servers": self.bootstrap_servers,"enable.idempotence": True,           # 啟用冪等性"transactional.id": self.transaction_id,"acks": "all",                        # 必須配合 acks=all"compression.type": "lz4","linger.ms": 10,"batch.size": 32768,"max.in.flight.requests.per.connection": 5,}self.producer = Producer(conf)self.producer.init_transactions()print(f"? 事務性生產者已初始化: {self.transaction_id}")def setup_transactional_consumer(self, input_topic: str, group_id: str):"""設置事務性消費者"""conf = {"bootstrap.servers": self.bootstrap_servers,"group.id": group_id,"enable.auto.commit": False,          # 關鍵：手動提交 offset"auto.offset.reset": "earliest","isolation.level": "read_committed",  # 只讀取已提交的事務}self.consumer = Consumer(conf)self.consumer.subscribe([input_topic])print(f"? 事務性消費者已初始化: {group_id}")def process_with_transaction(self, input_topic: str, output_topic: str, group_id: str):"""事務性處理消息"""print(f"🚀 開始事務性處理: {input_topic} → {output_topic}")processed_count = 0try:while True:# 1. 消費消息msg = self.consumer.poll(timeout=1.0)if msg is None:continueif msg.error():if msg.error().code() == KafkaException._PARTITION_EOF:continueelse:raise KafkaException(msg.error())print(f"\n📨 收到消息: {msg.topic()}-{msg.partition()}@{msg.offset()}")# 2. 開始事務self.producer.begin_transaction()print("🔄 開始事務")try:# 3. 處理業務邏輯processed_data = self.process_business_logic(msg)# 4. 發送處理結果self.producer.produce(output_topic,key=msg.key(),value=json.dumps(processed_data, ensure_ascii=False),headers=[("processed_at", str(int(time.time() * 1000)))])print(f"📤 發送處理結果到: {output_topic}")# 5. 將消費 offset 納入事務partitions = [TopicPartition(msg.topic(), msg.partition(), msg.offset() + 1)]self.producer.send_offsets_to_transaction(partitions, self.consumer.consumer_group_metadata())print(f"📝 提交消費 offset: {msg.offset() + 1}")# 6. 提交事務self.producer.commit_transaction()print("? 事務提交成功")processed_count += 1except Exception as e:# 7. 回滾事務self.producer.abort_transaction()print(f"? 事務回滾: {e}")raiseexcept KeyboardInterrupt:print("\n?? 用戶中斷處理")except Exception as e:print(f"? 處理異常: {e}")raiseprint(f"🎉 事務性處理完成！共處理 {processed_count} 條消息")def process_business_logic(self, msg):"""業務處理邏輯示例"""# 解析輸入消息input_data = json.loads(msg.value())user_id = msg.key().decode() if msg.key() else "unknown"print(f"  🔍 處理用戶 {user_id} 的事件: {input_data.get('event_type', 'unknown')}")# 模擬業務處理（例如：計算積分、更新余額等）processed_data = {"user_id": user_id,"original_event": input_data,"processed_at": time.time(),"result": "success","points_earned": 10,  # 示例：獲得積分"new_balance": 1000,  # 示例：新余額}# 模擬處理時間time.sleep(0.1)return processed_datadef close(self):"""關閉生產者和消費者"""if self.consumer:self.consumer.close()if self.producer:self.producer.flush()print("🔒 事務處理器已關閉")def run_transactional_example():"""運行事務性處理示例"""processor = TransactionalProcessor("localhost:9092")try:# 設置事務性生產者和消費者processor.setup_transactional_producer()processor.setup_transactional_consumer("user-events", "transactional-group")# 開始事務性處理processor.process_with_transaction(input_topic="user-events",output_topic="processed-events",group_id="transactional-group")finally:processor.close()if __name__ == "__main__":run_transactional_example()

🔍 事務性處理的關鍵特性

原子性：消費、處理、發送要么全部成功，要么全部失敗
一致性：不會出現部分處理的情況
隔離性：其他消費者只能看到已提交的事務結果
持久性：事務結果持久化到 Kafka

🚀 運行效果

python consumer_transactional.py

預期輸出：

? 事務性生產者已初始化: txn-processor-1705123456
? 事務性消費者已初始化: transactional-group
🚀 開始事務性處理: user-events → processed-events📨 收到消息: user-events-0@1234
🔄 開始事務🔍 處理用戶 1001 的事件: purchase
📤 發送處理結果到: processed-events
📝 提交消費 offset: 1235
? 事務提交成功
...
🎉 事務性處理完成！共處理 5 條消息

📋 最佳實踐與故障排查

🚨 常見問題診斷

問題 1：消息順序混亂

癥狀：相同用戶的消息處理順序不一致

原因分析：

? 使用了不穩定的哈希算法（如 Python hash()）
? 多線程并發發送相同 key 的消息
? 生產者重試導致亂序

解決方案：

# ? 使用穩定哈希
import hashlib
def stable_partitioner(key_bytes, all_partitions, available_partitions):h = int(hashlib.md5(key_bytes).hexdigest(), 16)return all_partitions[h % len(all_partitions)]# ? 啟用冪等生產者
conf = {"enable.idempotence": True,"max.in.flight.requests.per.connection": 5,  # 限制并發
}

問題 2：消費重復或丟失

癥狀：消息被重復處理或丟失

原因分析：

? 自動提交 + 處理失敗 = 消息丟失
? 手動提交 + 處理失敗 = 消息重復

解決方案：

# ? At-least-once 語義
def process_with_at_least_once(consumer):while True:records = consumer.poll(timeout_ms=1000)for partition, messages in records.items():for message in messages:try:# 先處理業務邏輯process_business_logic(message)# 處理成功后提交 offsetconsumer.commit({partition: message.offset + 1})except Exception as e:# 處理失敗，不提交 offset，消息會重新消費print(f"處理失敗: {e}")raise

問題 3：大消息傳輸失敗

癥狀：超過 1MB 的消息發送失敗

解決方案：

# ? 外部存儲 + 消息引用模式
def send_large_message(user_id, large_data):# 1. 上傳到對象存儲s3_key = f"data/{user_id}/{int(time.time())}.json"s3_client.put_object(Bucket="data-lake", Key=s3_key, Body=large_data)# 2. 發送引用消息message = {"type": "large_data_ready","user_id": user_id,"s3_location": f"s3://data-lake/{s3_key}","size_bytes": len(large_data),"checksum": hashlib.sha256(large_data).hexdigest()}producer.send("large-messages", key=user_id, value=message)

? 性能優化清單

生產者優化

參數	推薦值	說明
`batch_size`	64KB	批量發送提升吞吐
`linger_ms`	50ms	等待更多消息聚合
`compression_type`	lz4	平衡壓縮率和性能
`acks`	all	強一致性場景
`retries`	3-5	避免無限重試

消費者優化

參數	推薦值	說明
`max_poll_records`	100-500	批量拉取消息
`session_timeout_ms`	30000	會話超時時間
`heartbeat_interval_ms`	10000	心跳間隔
`fetch_min_bytes`	1	最小拉取字節數
`fetch_max_wait_ms`	500	最大等待時間

🏗? 項目結構建議

kafka-advanced-python/
├── config/
│   ├── kafka_config.py          # Kafka 配置管理
│   └── topics.py               # Topic 定義
├── producers/
│   ├── user_event_producer.py  # 用戶事件生產者
│   ├── large_message_producer.py # 大消息生產者
│   └── transactional_producer.py # 事務生產者
├── consumers/
│   ├── user_event_consumer.py  # 用戶事件消費者
│   ├── replay_consumer.py      # 重放消費者
│   └── transactional_consumer.py # 事務消費者
├── utils/
│   ├── partitioners.py         # 自定義分區器
│   ├── serializers.py          # 序列化器
│   └── monitoring.py          # 監控工具
├── tests/
│   ├── test_producers.py      # 生產者測試
│   └── test_consumers.py       # 消費者測試
└── examples/├── user_routing_demo.py   # 用戶路由演示├── replay_demo.py         # 重放演示└── transaction_demo.py    # 事務演示

📊 監控與運維

關鍵指標監控

# monitoring.py
from kafka import KafkaProducer, KafkaConsumer
import timeclass KafkaMonitor:"""Kafka 監控工具"""def __init__(self, bootstrap_servers):self.bootstrap_servers = bootstrap_serversdef get_producer_metrics(self, producer):"""獲取生產者指標"""metrics = producer.metrics()return {"batch_size_avg": metrics.get("batch-size-avg", 0),"record_send_rate": metrics.get("record-send-rate", 0),"record_error_rate": metrics.get("record-error-rate", 0),}def get_consumer_metrics(self, consumer):"""獲取消費者指標"""metrics = consumer.metrics()return {"records_consumed_rate": metrics.get("records-consumed-rate", 0),"fetch_latency_avg": metrics.get("fetch-latency-avg", 0),"commit_rate": metrics.get("commit-rate", 0),}def check_topic_health(self, topic_name):"""檢查 Topic 健康狀態"""producer = KafkaProducer(bootstrap_servers=self.bootstrap_servers)consumer = KafkaConsumer(topic_name, bootstrap_servers=self.bootstrap_servers)try:# 檢查分區數量partitions = producer.partitions_for(topic_name)print(f"Topic {topic_name} 有 {len(partitions)} 個分區")# 檢查消費延遲consumer.poll(timeout_ms=1000)assignment = consumer.assignment()for partition in assignment:committed = consumer.committed(partition)end_offset = consumer.end_offsets([partition])[partition]lag = end_offset - committed if committed else 0print(f"分區 {partition.partition} 消費延遲: {lag}")finally:producer.close()consumer.close()