https://www.firecrawl.dev/

什么是Firecrawl
Firecrawl 是一款?可以將網站轉換為 便于AI處理的Markdown 格式的爬蟲工具?,主要?提供 API 服務?,無需站點地圖,只需要接收一個 URL 地址就可以爬取網站及網站下可訪問的所有子頁面內容。
本地部署Firecrawl
https://github.com/mendableai/firecrawl/blob/main/CONTRIBUTING.md
For a simpler setup, you can use Docker Compose to run all services:
- Prerequisites: Make sure you have Docker and Docker Compose installed
Copy the?
.env.example
?file to?.env
?in the?/apps/api/
?directory and configure as neededFrom the root directory, run: docker compose up
This will start Redis, the API server, and workers automatically in the correct configuration.
git clone https://github.com/mendableai/firecrawl.git
cd firecrawl
創建.env
文件
cp apps/api/.env.example apps/api/.env
需要使用LLM的話修改一下OPENAI_API_KEY和OPENAI_BASE_URL
OPENAI_API_KEY=xxx?
OPENAI_BASE_URL=xxx
構建并啟動
docker compose build
docker compose up -d
國內可能下載playwright很慢,可以修改「apps/playwright-service-ts/Dockerfile」
RUN echo "deb http://mirrors.aliyun.com/debian/ bookworm main non-free contrib\n\ ?
deb http://mirrors.aliyun.com/debian/ bookworm-updates main non-free contrib\n\ ?
deb http://mirrors.aliyun.com/debian-security bookworm-security main non-free contrib" > /etc/apt/sources.list ?#?Install Playwright dependencies ?
ENV PLAYWRIGHT_DOWNLOAD_HOST=https://npmmirror.com/mirrors/playwright/ ?
RUN npx playwright install --with-deps
測試一下
curl -X GET http://localhost:3002/test
使用python調用
pip install firecrawl-py
import?logging ?
from?firecrawl?import?FirecrawlApp ?logging.basicConfig(level=logging.INFO) ?
logger = logging.getLogger(__name__) ?def?main():??try: ?app = FirecrawlApp(api_key=None, api_url="http://localhost:3002") ?params = { ?'formats': ['markdown'], ?} ?logger.info("開始抓取網頁...") ?scrape_status = app.scrape_url('https://www.kujiale.com/', params=params) ?logger.info("抓取結果:") ?print(scrape_status) ?except?Exception?as?e: ?logger.error(f"抓取過程中發生錯誤:?{str(e)}") ?raise??if?__name__ ==?"__main__": ?main()


從結果可以看到它會提取一些內容,方便直接將數據給AI或者插入RAG中進行后續操作
