要生成 sitemap.xml
文件,需要通過爬蟲程序抓取網站的所有有效鏈接。以下是完整的解決方案:
步驟 1:安裝必要的 Python 庫
- ounter(line
pip?install?requests?beautifulsoup4?lxml
步驟 2:創建 Python 爬蟲腳本 (sitemap_generator.py
)
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
import?requests
from?bs4?import?BeautifulSoup
from?urllib.parse?import?urljoin,?urlparse
import?xml.etree.ElementTree?as?ET
from?datetime?import?datetime
def?get_all_links(base_url):
????#?存儲已訪問和待訪問的鏈接
????visited?=?set()
????queue?=?[base_url]
????all_links?=?set()
????while?queue:
????????url?=?queue.pop(0)
????????if?url?in?visited:
????????????continue
????????try:
????????????headers?=?{
????????????????'User-Agent':?'Mozilla/5.0?(Windows?NT?10.0;?Win64;?x64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/91.0.4472.124?Safari/537.36'
????????????}
????????????response?=?requests.get(url,?headers=headers,?timeout=10)
????????????if?response.status_code?!=?200:
????????????????continue
????????????#?添加到已訪問列表
????????????visited.add(url)
????????????all_links.add(url)
????????????print(f"Crawled:?{url}")
????????????#?解析?HTML?獲取新鏈接
????????????soup?=?BeautifulSoup(response.text,?'lxml')
????????????for?link?in?soup.find_all('a',?href=True):
????????????????href?=?link['href'].strip()
????????????????full_url?=?urljoin(url,?href)
????????????????#?過濾無效鏈接
????????????????parsed?=?urlparse(full_url)
????????????????if?parsed.scheme?not?in?('http',?'https'):
????????????????????continue
????????????????if?not?parsed.netloc.endswith('91kaiye.cn'):??#?僅限本站鏈接
????????????????????continue
????????????????if?'#'?in?full_url:??#?忽略錨點
????????????????????full_url?=?full_url.split('#')[0]
????????????????#?添加到待訪問隊列
????????????????if?full_url?not?in?visited:
????????????????????queue.append(full_url)
????????except?Exception?as?e:
????????????print(f"Error?crawling?{url}:?{str(e)}")
????return?all_links
def?create_sitemap(links,?filename='sitemap.xml'):
????root?=?ET.Element('urlset',?xmlns='http://www.sitemaps.org/schemas/sitemap/0.9')
????
????for?link?in?sorted(links):
????????url_elem?=?ET.SubElement(root,?'url')
????????ET.SubElement(url_elem,?'loc').text?=?link
????????ET.SubElement(url_elem,?'lastmod').text?=?datetime.now().strftime('%Y-%m-%d')
????????ET.SubElement(url_elem,?'changefreq').text?=?'daily'
????????ET.SubElement(url_elem,?'priority').text?=?'0.8'
????
????tree?=?ET.ElementTree(root)
????tree.write(filename,?encoding='utf-8',?xml_declaration=True)
????print(f"\nSitemap?generated:?{filename}?with?{len(links)}?URLs")
if?__name__?==?'__main__':
????base_url?=?'https://www.91kaiye.cn/'
????print("Starting?crawl...")
????links?=?get_all_links(base_url)
????create_sitemap(links)
步驟 3:運行腳本
- ounter(line
python?sitemap_generator.py
執行說明:
-
爬蟲邏輯:
-
從首頁
https://www.91kaiye.cn/
開始廣度優先搜索 -
自動過濾非本站鏈接、錨點和無效 URL
-
記錄每個頁面的最后修改日期(默認當天)
-
設置更新頻率為
daily
,優先級為0.8
-
-
輸出文件:
- 生成的
sitemap.xml
格式如下:- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
<?xml?version='1.0'?encoding='utf-8'?>
<urlset?xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
??<url>
????<loc>https://www.91kaiye.cn/page1</loc>
????<lastmod>2023-10-05</lastmod>
????<changefreq>daily</changefreq>
????<priority>0.8</priority>
??</url>
??...
</urlset>
- 生成的
注意事項:
-
反爬措施:
- 如果網站有反爬機制,可能需要:
-
添加
time.sleep(1)
延遲請求 -
使用代理 IP
-
設置更真實的請求頭
-
- 如果網站有反爬機制,可能需要:
-
動態內容:
-
對于 JavaScript 渲染的頁面(如 Vue/React),需改用
Selenium
或Playwright
-
-
優化建議:
-
在服務器上定期運行(如每周一次)
-
提交到 Google Search Console
- 在
robots.txt
中添加:- ounter(line
Sitemap:?https://www.91kaiye.cn/sitemap.xml
-
替代方案:使用在線工具
如果不想運行代碼,可用在線服務生成:
-
XML-Sitemaps.com
-
Screaming Frog SEO Spider(桌面工具)
生成后請將 sitemap.xml
上傳到網站根目錄,并通過百度/Google站長工具提交。