Python爬取數據（二）

一.example2包下的

1.re模塊的compile函數使用

import repattern=re.compile(r'\d+')
print(pattern)

2.match的方法使用

import re
pattern=re.compile(r'\d+')
# m1=pattern.match('one123twothree345four')
#參數2：指定起始位置(包含),參數3：終止位置(包含),注意匹配一次成功后結束
m1=pattern.match('one123twothree345four',3,7)
print(m1.group())

3.search方法的使用

import re
pattern=re.compile(r'\d+')
m1=pattern.search('one123twothree345four')
print(m1.group())

4.findall方法的使用

import re
pattern=re.compile(r'\d+')result=pattern.findall('hello 123 world 456')
print(result)

5.split方法的使用

import restr1='a,b,c'
print(str1.split(','))str2='a,b;;c,d'
pattern=re.compile(r"[\s\,\;]+")
print(pattern.split(str2))

6.sub方法的使用

import re
string='<h1 class="test1">HelloWorld</h1>'pattern=re.compile(r'\d')
print(pattern.sub('2',string))
print(pattern.sub('2',string,1))pattern=re.compile('<(.\\d)\\sclass="(?P<classname>.*?)">.*?</(\\1)>')
print(pattern.search(string).group(3))def fun(m):return 'after sub'+m.group('classname')
print(pattern.sub(fun, string))

7.貪婪匹配

import re
string='<h1 class="test1">HelloWorld</h1>'
#貪婪匹配
pattern=re.compile(r'<.\d\sclass=.*>')
print(pattern.search(string).group())
#關閉貪婪匹配
pattern=re.compile(r'<.\d\sclass=.*?>')
print(pattern.search(string).group())

8.綜合案例

import requests
import re
def handle_detail_re(content):#re.S表示全文匹配# item_search=re.compile('ts_solgcont_title">.*?</div>\r\n\t</div>',re.S)item_search = re.compile('ts_solgcont_title">.*?<div class="ts_solgcont_bot">.*?</div>', re.S)#獲取每一條圖書的數據all_item=item_search.findall(content)#匹配書名title_search=re.compile('target="_blank">(.*?)</a>')#匹配作者author_search=re.compile('<p>作者(.*?)</p>')for item in all_item:print({"title":title_search.search(item).group(1),"author":author_search.search(item).group(1),})def main():header={"User-Agent":"Mozilla/5.0 (X11; Linux aarch64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36 CrKey/1.54.250320 Edg/135.0.0.0"}booktype=['java','python','c']for key in booktype:url='http://www.cmpedu.com/so.htm?&KEY={}'.format(key)response=requests.get(url,headers=header)handle_detail_re(response.text)if __name__ == '__main__':main()

三.example3下的

安裝beautifulsoup4的指令:pip3 install beautifulsoup4
beautifulsoup4:Beautiful Soup（bs4）是一個用于從HTML或XML文件中提取數據的Python庫。

1.獲取節點

from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p >
<p class="story">Once upon a time there were three little sisters; and their names were
<!-- Elsie -->,and
;
and they lived at the bottom of a well.</p >
<p class="story">...</p >
"""
#參數1:html代碼片段
# 參數2：解析器
soup=BeautifulSoup(html,'lxml')
#獲得標題
print(soup.title)
#獲得頭標記
print(soup.head)
#獲得體標記
print(soup.body)
#獲得標題元素內容
print(soup.title.string)
#獲得標記名稱
print(soup.title.name)
#默認的模式下只能匹配第一個節點，其他節點會被忽略
print(soup.p)

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/bicheng/76087.shtml
繁體地址，請注明出處：http://hk.pswp.cn/bicheng/76087.shtml
英文地址，請注明出處：http://en.pswp.cn/bicheng/76087.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！