python爬蟲爬取音樂單曲_Python爬取qq音樂的過程實例

一、前言

　qq music上的音樂還是不少的，有些時候想要下載好聽的音樂，但有每次在網頁下載都是煩人的登錄什么的。于是，來了個qqmusic的爬蟲。至少我覺得for循環爬蟲，最核心的應該就是找到待爬元素所在url吧。二、Python爬取QQ音樂單曲

爬蟲步驟

1.確定目標首先我們要明確目標，本次爬取的是QQ音樂歌手劉德華的單曲。（百度百科）->分析目標（策略：url格式（范圍）、數據格式、網頁編碼）->編寫代碼->執行爬蟲2.分析目標歌曲鏈接：從左邊的截圖可以知道單曲采用分頁的方式排列歌曲信息，每頁顯示30條，總共30頁。點擊頁碼或者最右邊的">"會跳轉到下一頁，瀏覽器會向服務器發送ajax異步請求，從鏈接可以看到begin和num參數，分別代表起始歌曲下標（截圖是第2頁，起始下標是30）和一頁返回30條，服務器響應返回json格式的歌曲信息（MusicJsonCallbacksinger_track({"code":0,"data":{"list":[{"Flisten_count1":......]})），如果只是單獨想獲取歌曲信息，可以直接拼接鏈接請求和解析返回的json格式的數據。這里不采用直接解析數據格式的方法，我采用的是Python Selenium方式，每獲取和解析完一頁的單曲信息，點擊 ">" 跳轉到下一頁繼續解析，直至解析并記錄所有的單曲信息。最后請求每個單曲的鏈接，獲取詳細的單曲信息。

右邊的截圖是網頁的源碼，所有歌曲信息都在類名為mod_songlist的div浮層里面，類名為songlist_list的無序列表ul下，每個子元素li展示一個單曲，類名為songlist__album下的a標簽，包含單曲的鏈接，名稱和時長等。

3.編寫代碼1）下載網頁內容，這里使用Python 的Urllib標準庫，自己封裝了一個download方法：

def download(url, user_agent='wswp', num_retries=2): ifurl is None: returnNone print('Downloading:', url) headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'} request = urllib.request.Request(url, headers=headers) # 設置用戶代理wswp(Web Scraping with Python) try: html = urllib.request.urlopen(request).read().decode('utf-8') except urllib.error.URLError ase: print('Downloading Error:', e.reason) html = None ifnum_retries > 0: ifhasattr(e, 'code') and500 <= e.code < 600: # retry when returncode is 5xx HTTP erros returndownload(url, num_retries-1) # 請求失敗，默認重試2次, return html

2）解析網頁內容，這里使用第三方插件BeautifulSoup,具體可以參考BeautifulSoup API 。

def music_scrapter(html, page_num=0): try: soup = BeautifulSoup(html, 'html.parser') mod_songlist_div = soup.find_all('div', class_='mod_songlist') songlist_ul = mod_songlist_div[1].find('ul', class_='songlist__list') '''開始解析li歌曲信息''' lis = songlist_ul.find_all('li') forli in lis: a = li.find('div', class_='songlist__album').find('a') music_url = a['href'] # 單曲鏈接 urls.add_new_url(music_url) # 保存單曲鏈接 # print('music_url:{0} '.format(music_url)) print('total music link num:%s'% len(urls.new_urls)) next_page(page_num+1) except Timeout Exception aserr: print('解析網頁出錯:', err.args) returnnext_page(page_num + 1) return None

def get_music(): try: whileurls.has_new_url(): # print('urls count:%s'% len(urls.new_urls)) '''跳轉到歌曲鏈接，獲取歌曲詳情''' new_music_url = urls.get_new_url() print('url leave count:%s'% str( len(urls.new_urls) - 1)) html_data_info = download(new_music_url) # 下載網頁失敗，直接進入下一循環，避免程序中斷 ifhtml_data_info is None: continue soup_data_info = BeautifulSoup(html_data_info, 'html.parser') ifsoup_data_info.find('div', class_='none_txt') is not None: print(new_music_url, ' 對不起，由于版權原因，暫無法查看該專輯！') continue mod_songlist_div = soup_data_info.find('div', class_='mod_songlist') songlist_ul = mod_songlist_div.find('ul', class_='songlist__list') lis = songlist_ul.find_all('li') del lis[0] # 刪除第一個li # print('len(lis):$s'% len(lis)) forli in lis: a_songname_txt = li.find('div', class_='songlist__songname').find('span', class_='songlist__songname_txt').find('a') if'https'not in a_songname_txt['href']: #如果單曲鏈接不包含協議頭，加上 song_url = 'https:'+ a_songname_txt['href'] song_name = a_songname_txt['title'] singer_name = li.find('div', class_='songlist__artist').find('a').get_text() song_time =li.find('div', class_='songlist__time').get_text() music_info = {} music_info['song_name'] = song_name music_info['song_url'] = song_url music_info['singer_name'] = singer_name music_info['song_time'] = song_time collect_data(music_info) except Exception aserr: # 如果解析異常，跳過 print('Downloading or parse music information error continue:', err.args)

4.執行爬蟲

1爬蟲跑起來了，一頁一頁地去爬取專輯的鏈接，并保存到集合中，最后通過get_music()方法獲取單曲的名稱，鏈接，歌手名稱和時長并保存到Excel文件中。 <img src="http://img.php.cn/upload/article/000/000/001/a1138f33f00f8d95b52fbfe06e562d24-4.png"alt=""width="748"height="483"><img src="http://img.php.cn/upload/article/000/000/001/9282b5f7a1dc4a90cee186c16d036272-5.png"alt="">1

三、Python爬取QQ音樂單曲總結1.單曲采用的是分頁方式，切換下一頁是通過異步ajax請求從服務器獲取json格式的數據并渲染到頁面，瀏覽器地址欄鏈接是不變的，不能通過拼接鏈接來請求。一開始想過都通過Python Urllib庫來模擬ajax請求，后來想想還是用Selenium。Selenium能夠很好地模擬瀏覽器真實的操作，頁面元素定位也很方便，模擬單擊下一頁，不斷地切換單曲分頁，再通過BeautifulSoup解析網頁源碼，獲取單曲信息。2.url鏈接管理器，采用集合數據結構來保存單曲鏈接，為什么要使用集合？因為多個單曲可能來自同一專輯（專輯網址一樣），這樣可以減少請求次數。

1classUrlManager(object): def __init__(self): self.new_urls = set() # 使用集合數據結構，過濾重復元素 self.old_urls = set() # 使用集合數據結構，過濾重復元素

1 def add_new_url(self, url): ifurl is None: return ifurl not in self.new_urls andurl not in self.old_urls: self.new_urls.add(url) def add_new_urls(self, urls): ifurls is None orlen(urls) == 0: return forurl in urls: self.add_new_url(url) def has_new_url(self): returnlen(self.new_urls) != 0 def get_new_url(self): new_url = self.new_urls.pop() self.old_urls.add(new_url) returnnew_url

3.通過Python第三方插件openpyxl讀寫Excel十分方便，把單曲信息通過Excel文件可以很好地保存起來。

1def write_to_excel(self, content): try: forrow in content: self.workSheet.append([row['song_name'], row['song_url'], row['singer_name'], row['song_time']]) self.workBook.save(self.excelName) # 保存單曲信息到Excel文件 except Exception asarr: print('write to excel error', arr.args)

四、后語最后還是要慶祝下，畢竟成功把QQ音樂的單曲信息爬取下來了。本次能夠成功爬取單曲，Selenium功不可沒，這次只是用到了selenium一些簡單的功能，后續會更加深入學習Selenium，不僅在爬蟲方面還有UI自動化。后續還需要優化的點：1.下載的鏈接比較多，一個一個下載起來比較慢，后面打算用多線程并發下載。2.下載速度過快，為了避免服務器禁用IP，后面還要對于同一域名訪問過于頻繁的問題，有個等待機制，每個請求之間有個等待間隔。3. 解析網頁是一個重要的過程，可以采用正則表達式，BeautifulSoup和lxml，目前采用的是BeautifulSoup庫，在效率方面，BeautifulSoup沒lxml效率高，后面會嘗試采用lxml。

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/533703.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/533703.shtml
英文地址，請注明出處：http://en.pswp.cn/news/533703.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！