業務背景
我們運營一款 FPS 端游,外掛作者常把 DDoS 偽裝成「玩家掉線」來騙客服。以前排查要撈 CDN 日志、對時間戳、人工比對,平均 2 小時才能定位。現在用一條 30 行的 Python 腳本把邊緣節點日志實時打到 Kafka,再回放到 Grafana,5 分鐘就能復現「誰在什么時間被哪段流量打掛」。
1. 數據流
邊緣節點(Nginx) → filebeat → Kafka → Python 回放腳本 → Grafana(Loki)
- 邊緣節點用 Nginx stream 模塊,日志格式自定義為
$time_iso8601|$remote_addr|$bytes_sent|$bytes_received|$proxy_host
- filebeat 直接讀
/var/log/nginx/stream.log
,topic 叫game_traffic
- Python 腳本用
confluent-kafka
消費,實時計算「異常窗口」并推送到 Loki; - 運維在 Grafana 里點一下「回放」就能把時間軸往回拖 30 min,像看錄像一樣。
2. 回放腳本(replay.py)
#!/usr/bin/env python3
# pip install confluent-kafka python-dateutil requests
import json, time, datetime, collections, requests
from confluent_kafka import ConsumerBOOTSTRAP = 'kafka.example.com:9092'
LOKI_URL = 'https://loki.example.com/loki/api/v1/push 'consumer = Consumer({'bootstrap.servers': BOOTSTRAP,'group.id': 'replay','auto.offset.reset': 'latest'
})
consumer.subscribe(['game_traffic'])window = collections.deque(maxlen=1000) # 滑動 1000 條
ALERT_THRESHOLD = 100_000 # 10 秒內上行或下行超 100 MB 就告警def push_loki(stream, labels):payload = {"streams": [{"stream": labels,"values": [[str(int(time.time()*1e9)), json.dumps(stream)]]}]}requests.post(LOKI_URL, json=payload, timeout=3)while True:msg = consumer.poll(1)if msg is None: continueif msg.error():print(msg.error())continuets, src, up, down, dest = msg.value().decode().split('|')now = datetime.datetime.fromisoformat(ts)window.append((now, int(up)+int(down)))# 滑動窗口統計cutoff = now - datetime.timedelta(seconds=10)while window and window[0][0] < cutoff:window.popleft()total = sum(b for _, b in window)if total > ALERT_THRESHOLD:push_loki({"src": src, "dest": dest, "bytes": total},{"job": "game_traffic", "alert": "ddos"})
3. 落地步驟
- 邊緣節點 Nginx 加一行
log_format stream '$time_iso8601|$remote_addr|$bytes_sent|$bytes_received|$proxy_host'; access_log /var/log/nginx/stream.log stream;
- filebeat.yml 里加
filebeat.inputs: - type: logpaths: ["/var/log/nginx/stream.log"]fields_under_root: truefields:topic: game_traffic output.kafka:hosts: ["kafka.example.com:9092"]topic: '%{[topic]}'
python3 replay.py &
,丟進 supervisor 或 systemd;- Grafana 新建 Loki 數據源,查詢
就能實時看到攻擊曲線;{job="game_traffic"} | json | alert="ddos"
- 回放時把時間選擇器拖到「異常發生前 30 s」,可逐幀看哪段流量峰值對應哪批玩家掉線。
4. 結果
- 上線兩周,客服工單量下降 60%,外掛作者發現「掉線不再好用」;
- 運維同學從「撈日志 2 h」變成「點兩下 Grafana 5 min」,周末終于能安心打游戲了。