python3的文本處理

jieba庫的使用

pip3 install jieba

統計hamlet.txt文本中高頻詞的個數

講解視頻

kou@ubuntu:~/python$ cat ClaHamlet.py 
#!/usr/bin/env python
# coding=utf-8#e10.1CalHamlet.py
def getText():txt = open("hamlet.txt", "r").read()txt = txt.lower()for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':txt = txt.replace(ch, " ")   #將文本中特殊字符替換為空格return txt
hamletTxt = getText()
words  = hamletTxt.split()
counts = {}
for word in words:			counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True) 
for i in range(10):word, count = items[i]print ("{0:<10}{1:>5}".format(word, count))

統計三國演義任務高頻次數

#!/usr/bin/env python
# coding=utf-8#e10.1CalHamlet.py
def getText():txt = open("hamlet.txt", "r").read()txt = txt.lower()for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':txt = txt.replace(ch, " ")   #將文本中特殊字符替換為空格return txt
hamletTxt = getText()
words  = hamletTxt.split()
counts = {}
for word in words:			counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True) 
for i in range(10):word, count = items[i]print ("{0:<10}{1:>5}".format(word, count))

爬蟲

學習資源是中國大學mooc的爬蟲課程。《嵩天老師》
下面寫幾個簡單的代碼！熟悉這幾個代碼的書寫以后基本可以完成需求！

爬取百度首頁

import requestsr = requests.get("https://www.baidu.com")
fo = open("baidu.txt", "w+")
r.encoding =  'utf-8'
str = r.text
line = fo.write( str )

爬取京東某手機頁面

import requests
url = "https://item.jd.com/2967929.html"
try:r = requests.get(url)r.raise_for_status()//如果不是200就會報錯r.encoding = r.apparent_encoding//轉utf-8格式print(r.text[:1000])//只有前1000行
except:print("False")fo.close()

BeautifulSoup

使用request進行爬取，在使用 BeautifulSoup進行處理！擁有一個更好的排版

fo = open("jingdong.md","w")url = "https://item.jd.com/2967929.html"
try:r = requests.get(url)r.encoding = r.apparent_encodingdemo = r.textsoup = BeautifulSoup(demo,"html.parser")fo.write(soup.prettify())fo.writelines(soup.prettify())
except:print("False")fo.close()

BeautifulSoup爬取百度首頁

fo = open("baidu.md","w")try:r = requests.get("https://www.baidu.com")r.encoding = r.apparent_encodingdemo = r.textsoup = BeautifulSoup(demo,"html.parser")fo.write(soup.prettify())fo.writelines(soup.prettify())
except:print("False")
fo.close()

附贈
爬蟲和python例子開源鏈接

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/382502.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/382502.shtml
英文地址，請注明出處：http://en.pswp.cn/news/382502.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！