Docker 部署 - Crawl4AI 文檔 (v0.5.x)

快速入門 🚀

拉取并運行基礎版本：

# 不帶安全性的基本運行
docker pull unclecode/crawl4ai:basic
docker run -p 11235:11235 unclecode/crawl4ai:basic# 帶有 API 安全性啟用的運行
docker run -p 11235:11235 -e CRAWL4AI_API_TOKEN=your_secret_token unclecode/crawl4ai:basic

使用 Docker Compose 運行 🐳

從本地 Dockerfile 或 Docker Hub 使用 Docker Compose

Crawl4AI 提供靈活的 Docker Compose 選項，用于管理你的容器化服務。你可以使用提供的 Dockerfile 本地構建鏡像，也可以使用 Docker Hub 上的預構建鏡像。

選項 1：使用 Docker Compose 本地構建

如果你希望本地構建鏡像，請使用提供的 docker-compose.local.yml 文件。

docker-compose -f docker-compose.local.yml up -d

這將：
1. 從提供的 Dockerfile 構建 Docker 鏡像。
2. 啟動容器并將其暴露在 http://localhost:11235。

選項 2：使用 Docker Compose 從 Hub 獲取預構建鏡像

如果你更傾向于使用 Docker Hub 上的預構建鏡像，請使用 docker-compose.hub.yml 文件。

docker-compose -f docker-compose.hub.yml up -d

這將：
1. 拉取預構建鏡像 unclecode/crawl4ai:basic（或根據你的配置選擇 all）。
2. 啟動容器并將其暴露在 http://localhost:11235。

停止正在運行的服務

要停止通過 Docker Compose 啟動的服務，可以使用：

docker-compose -f docker-compose.local.yml down
# 或者
docker-compose -f docker-compose.hub.yml down

如果容器無法停止且應用仍在運行，請檢查正在運行的容器：

找到正在運行的服務的 CONTAINER ID 并強制停止它：

docker stop <CONTAINER_ID>

使用 Docker Compose 調試

查看日志：要查看容器日志：

docker-compose -f docker-compose.local.yml logs -f

移除孤立容器：如果服務仍在意外運行：

docker-compose -f docker-compose.local.yml down --remove-orphans

手動移除網絡：如果網絡仍在使用中：

docker network ls
docker network rm crawl4ai_default

為什么使用 Docker Compose？

Docker Compose 是部署 Crawl4AI 的推薦方式，因為：
1. 它簡化了多容器設置。
2. 允許你在單個文件中定義環境變量、資源和端口。
3. 使在本地開發和生產鏡像之間切換變得更容易。

例如，你的 docker-compose.yml 可以包含 API 密鑰、令牌設置和內存限制，使部署快速且一致。

API 安全性 🔒

了解 CRAWL4AI_API_TOKEN

CRAWL4AI_API_TOKEN 為你的 Crawl4AI 實例提供可選的安全性：

如果設置了 CRAWL4AI_API_TOKEN：所有 API 端點（除了 /health）都需要認證。
如果沒有設置 CRAWL4AI_API_TOKEN：API 將公開可用。

# 安全實例
docker run -p 11235:11235 -e CRAWL4AI_API_TOKEN=your_secret_token unclecode/crawl4ai:all# 未受保護實例
docker run -p 11235:11235 unclecode/crawl4ai:all

進行 API 調用

對于受保護的實例，在所有請求中包含令牌：

import requests# 設置標頭（如果使用了令牌）
api_token = "your_secret_token"  # 與 CRAWL4AI_API_TOKEN 中設置的令牌相同
headers = {"Authorization": f"Bearer {api_token}"} if api_token else {}# 發起認證請求
response = requests.post("http://localhost:11235/crawl",headers=headers,json={"urls": "https://example.com","priority": 10}
)# 檢查任務狀態
task_id = response.json()["task_id"]
status = requests.get(f"http://localhost:11235/task/{task_id}",headers=headers
)

與 Docker Compose 一起使用

在你的 docker-compose.yml 中：

services:crawl4ai:image: unclecode/crawl4ai:allenvironment:- CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-}  # 可選# ... 其他配置

然后可以：
1. 在 .env 文件中設置：

CRAWL4AI_API_TOKEN=your_secret_token

或者在命令行中設置：

CRAWL4AI_API_TOKEN=your_secret_token docker-compose up

安全提示：如果你啟用了 API 令牌，請確保保持其安全性，不要將其提交到版本控制中。除了健康檢查端點（/health）外，所有 API 端點都需要該令牌。

配置選項 🔧

環境變量

你可以使用環境變量來配置服務：

# 基本配置
docker run -p 11235:11235 \-e MAX_CONCURRENT_TASKS=5 \unclecode/crawl4ai:all# 啟用安全性和 LLM 支持
docker run -p 11235:11235 \-e CRAWL4AI_API_TOKEN=your_secret_token \-e OPENAI_API_KEY=sk-... \-e ANTHROPIC_API_KEY=sk-ant-... \unclecode/crawl4ai:all

使用 Docker Compose（推薦） 🐳

創建一個 docker-compose.yml 文件：

version: '3.8'services:crawl4ai:image: unclecode/crawl4ai:allports:- "11235:11235"environment:- CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-}  # 可選 API 安全性- MAX_CONCURRENT_TASKS=5# LLM 提供商密鑰- OPENAI_API_KEY=${OPENAI_API_KEY:-}- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}volumes:- /dev/shm:/dev/shmdeploy:resources:limits:memory: 4Greservations:memory: 1G

你可以通過兩種方式運行它：

直接使用環境變量：

CRAWL4AI_API_TOKEN=secret123 OPENAI_API_KEY=sk-... docker-compose up

使用 .env 文件（推薦）：
在同一目錄下創建一個 .env 文件：

# API 安全性（可選）
CRAWL4AI_API_TOKEN=your_secret_token# LLM 提供商密鑰
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...# 其他配置
MAX_CONCURRENT_TASKS=5

然后只需運行：

測試部署 🧪

import requests# 對于未受保護的實例
def test_unsecured():# 健康檢查health = requests.get("http://localhost:11235/health")print("健康檢查：", health.json())# 基本爬取response = requests.post("http://localhost:11235/crawl",json={"urls": "https://www.nbcnews.com/business","priority": 10})task_id = response.json()["task_id"]print("任務 ID：", task_id)# 對于受保護的實例
def test_secured(api_token):headers = {"Authorization": f"Bearer {api_token}"}# 帶認證的基本爬取response = requests.post("http://localhost:11235/crawl",headers=headers,json={"urls": "https://www.nbcnews.com/business","priority": 10})task_id = response.json()["task_id"]print("任務 ID：", task_id)

當你配置了 LLM 提供商密鑰（通過環境變量或 .env 文件），你可以使用 LLM 提取：

request = {"urls": "https://example.com","extraction_config": {"type": "llm","params": {"provider": "openai/gpt-4","instruction": "從頁面中提取主要主題"}}
}# 發起請求（如果使用 API 安全性，請添加標頭）
response = requests.post("http://localhost:11235/crawl", json=request)

提示：記得將 .env 添加到 .gitignore 中，以確保你的 API 密鑰安全！

使用示例 📝

基本爬取

request = {"urls": "https://www.nbcnews.com/business","priority": 10
}response = requests.post("http://localhost:11235/crawl", json=request)
task_id = response.json()["task_id"]# 獲取結果
result = requests.get(f"http://localhost:11235/task/{task_id}")

schema = {"name": "加密貨幣價格","baseSelector": ".cds-tableRow-t45thuk","fields": [{"name": "加密貨幣","selector": "td:nth-child(1) h2","type": "text",},{"name": "價格","selector": "td:nth-child(2)","type": "text",}],
}request = {"urls": "https://www.coinbase.com/explore","extraction_config": {"type": "json_css","params": {"schema": schema}}
}

處理動態內容

request = {"urls": "https://www.nbcnews.com/business","js_code": ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"],"wait_for": "article.tease-card:nth-child(10)"
}

request = {"urls": "https://www.nbcnews.com/business","extraction_config": {"type": "cosine","params": {"semantic_filter": "商業 財務 經濟","word_count_threshold": 10,"max_dist": 0.2,"top_k": 3}}
}

平臺特定指令 💻

macOS

docker pull unclecode/crawl4ai:basic
docker run -p 11235:11235 unclecode/crawl4ai:basic

Ubuntu

# 基礎版本
docker pull unclecode/crawl4ai:basic
docker run -p 11235:11235 unclecode/crawl4ai:basic# 帶 GPU 支持
docker pull unclecode/crawl4ai:gpu
docker run --gpus all -p 11235:11235 unclecode/crawl4ai:gpu

Windows（PowerShell）

docker pull unclecode/crawl4ai:basic
docker run -p 11235:11235 unclecode/crawl4ai:basic

測試 🧪

將以下內容保存為 test_docker.py：

import requests
import json
import time
import sysclass Crawl4AiTester:def __init__(self, base_url: str = "http://localhost:11235"):self.base_url = base_urldef submit_and_wait(self, request_data: dict, timeout: int = 300) -> dict:# 提交爬取任務response = requests.post(f"{self.base_url}/crawl", json=request_data)task_id = response.json()["task_id"]print(f"任務 ID：{task_id}")# 輪詢結果start_time = time.time()while True:if time.time() - start_time > timeout:raise TimeoutError(f"任務 {task_id} 超時")result = requests.get(f"{self.base_url}/task/{task_id}")status = result.json()if status["status"] == "completed":return statustime.sleep(2)def test_deployment():tester = Crawl4AiTester()# 測試基本爬取request = {"urls": "https://www.nbcnews.com/business","priority": 10}result = tester.submit_and_wait(request)print("基本爬取成功！")print(f"內容長度：{len(result['result']['markdown'])}")if __name__ == "__main__":test_deployment()

高級配置 ??

爬蟲參數

crawler_params 字段允許你配置瀏覽器實例和爬取行為。以下是你可以使用的關鍵參數：

request = {"urls": "https://example.com","crawler_params": {# 瀏覽器配置"headless": True,                    # 以無頭模式運行"browser_type": "chromium",          # chromium/firefox/webkit"user_agent": "custom-agent",        # 自定義用戶代理"proxy": "http://proxy:8080",        # 代理配置# 性能與行為"page_timeout": 30000,               # 頁面加載超時（毫秒）"verbose": True,                     # 啟用詳細日志"semaphore_count": 5,               # 并發請求限制# 防檢測功能"simulate_user": True,               # 模擬人類行為"magic": True,                       # 高級防檢測"override_navigator": True,          # 覆蓋導航器屬性# 會話管理"user_data_dir": "./browser-data",   # 瀏覽器配置文件位置"use_managed_browser": True,         # 使用持久瀏覽器}
}

extra 字段允許直接將額外參數傳遞給爬蟲的 arun 函數：

request = {"urls": "https://example.com","extra": {"word_count_threshold": 10,          # 每個區塊的最小字數"only_text": True,                   # 僅提取文本"bypass_cache": True,                # 強制刷新爬取"process_iframes": True,             # 包含 iframe 內容}
}

完整示例

高級新聞爬取

request = {"urls": "https://www.nbcnews.com/business","crawler_params": {"headless": True,"page_timeout": 30000,"remove_overlay_elements": True      # 移除彈出窗口},"extra": {"word_count_threshold": 50,          # 更長的內容區塊"bypass_cache": True                 # 刷新內容},"css_selector": ".article-body"
}

防檢測配置

request = {"urls": "https://example.com","crawler_params": {"simulate_user": True,"magic": True,"override_navigator": True,"user_agent": "Mozilla/5.0 ...","headers": {"Accept-Language": "en-US,en;q=0.9"}}
}

帶有自定義參數的 LLM 提取

request = {"urls": "https://openai.com/pricing","extraction_config": {"type": "llm","params": {"provider": "openai/gpt-4","schema": pricing_schema}},"crawler_params": {"verbose": True,"page_timeout": 60000},"extra": {"word_count_threshold": 1,"only_text": True}
}

基于會話的動態內容

request = {"urls": "https://example.com","crawler_params": {"session_id": "dynamic_session","headless": False,"page_timeout": 60000},"js_code": ["window.scrollTo(0, document.body.scrollHeight);"],"wait_for": "js:() => document.querySelectorAll('.item').length > 10","extra": {"delay_before_return_html": 2.0}
}

帶自定義時間的截圖

request = {"urls": "https://example.com","screenshot": True,"crawler_params": {"headless": True,"screenshot_wait_for": ".main-content"},"extra": {"delay_before_return_html": 3.0}
}

參數參考表

分類	參數	類型	描述
瀏覽器	headless	布爾值	以無頭模式運行瀏覽器
瀏覽器	browser_type	字符串	瀏覽器引擎選擇
瀏覽器	user_agent	字符串	自定義用戶代理字符串
網絡	proxy	字符串	代理服務器 URL
網絡	headers	字典	自定義 HTTP 標頭
定時	page_timeout	整數	頁面加載超時（毫秒）
定時	delay_before_return_html	浮點數	捕獲前等待時間
防檢測	simulate_user	布爾值	模擬人類行為
防檢測	magic	布爾值	高級保護
會話	session_id	字符串	瀏覽器會話 ID
會話	user_data_dir	字符串	配置文件目錄
內容	word_count_threshold	整數	每個區塊的最小字數
內容	only_text	布爾值	僅提取文本
內容	process_iframes	布爾值	包含 iframe 內容
調試	verbose	布爾值	詳細日志
調試	log_console	布爾值	瀏覽器控制臺日志

故障排除 🔍

常見問題

連接拒絕

錯誤：連接被 localhost:11235 拒絕

解決方案：確保容器正在運行且端口映射正確。

資源限制

錯誤：沒有可用插槽

解決方案：增加 MAX_CONCURRENT_TASKS 或容器資源。

GPU 訪問

解決方案：確保安裝了正確的 NVIDIA 驅動程序并使用 --gpus all 標志。

調試模式

訪問容器進行調試：

docker run -it --entrypoint /bin/bash unclecode/crawl4ai:all

查看容器日志：

docker logs [container_id]

最佳實踐 🌟

資源管理
- 設置適當的內存和 CPU 限制
- 通過健康端點監控資源使用情況
- 對于簡單爬取任務使用基礎版本
擴展
- 對于高負載使用多個容器
- 實施適當的負載均衡
- 監控性能指標
安全性
- 使用環境變量存儲敏感數據
- 實施適當的網絡隔離
- 定期進行安全更新

API 參考 📚

健康檢查

提交爬取任務

POST /crawl
Content-Type: application/json{"urls": "字符串或數組","extraction_config": {"type": "basic|llm|cosine|json_css","params": {}},"priority": 1-10,"ttl": 3600
}