文章目錄
- python3的文本處理
- jieba庫的使用
- 統計hamlet.txt文本中高頻詞的個數
- 統計三國演義任務高頻次數
- 爬蟲
- 爬取百度首頁
- 爬取京東某手機頁面
- BeautifulSoup
- 使用request進行爬取,在使用 BeautifulSoup進行處理!擁有一個更好的排版
- BeautifulSoup爬取百度首頁
原文記錄內容太多現進行摘錄和分類
python3的文本處理
jieba庫的使用
pip3 install jieba
統計hamlet.txt文本中高頻詞的個數
講解視頻
kou@ubuntu:~/python$ cat ClaHamlet.py
#!/usr/bin/env python
# coding=utf-8#e10.1CalHamlet.py
def getText():txt = open("hamlet.txt", "r").read()txt = txt.lower()for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':txt = txt.replace(ch, " ") #將文本中特殊字符替換為空格return txt
hamletTxt = getText()
words = hamletTxt.split()
counts = {}
for word in words: counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):word, count = items[i]print ("{0:<10}{1:>5}".format(word, count))
統計三國演義任務高頻次數
#!/usr/bin/env python
# coding=utf-8#e10.1CalHamlet.py
def getText():txt = open("hamlet.txt", "r").read()txt = txt.lower()for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':txt = txt.replace(ch, " ") #將文本中特殊字符替換為空格return txt
hamletTxt = getText()
words = hamletTxt.split()
counts = {}
for word in words: counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):word, count = items[i]print ("{0:<10}{1:>5}".format(word, count))
爬蟲
學習資源是中國大學mooc的爬蟲課程。《嵩天老師》
下面寫幾個簡單的代碼!熟悉這幾個代碼的書寫以后基本可以完成需求!
爬取百度首頁
import requestsr = requests.get("https://www.baidu.com")
fo = open("baidu.txt", "w+")
r.encoding = 'utf-8'
str = r.text
line = fo.write( str )
爬取京東某手機頁面
import requests
url = "https://item.jd.com/2967929.html"
try:r = requests.get(url)r.raise_for_status()//如果不是200就會報錯r.encoding = r.apparent_encoding//轉utf-8格式print(r.text[:1000])//只有前1000行
except:print("False")fo.close()
BeautifulSoup
使用request進行爬取,在使用 BeautifulSoup進行處理!擁有一個更好的排版
fo = open("jingdong.md","w")url = "https://item.jd.com/2967929.html"
try:r = requests.get(url)r.encoding = r.apparent_encodingdemo = r.textsoup = BeautifulSoup(demo,"html.parser")fo.write(soup.prettify())fo.writelines(soup.prettify())
except:print("False")fo.close()
BeautifulSoup爬取百度首頁
fo = open("baidu.md","w")try:r = requests.get("https://www.baidu.com")r.encoding = r.apparent_encodingdemo = r.textsoup = BeautifulSoup(demo,"html.parser")fo.write(soup.prettify())fo.writelines(soup.prettify())
except:print("False")
fo.close()
附贈
爬蟲和python例子開源鏈接