網頁視頻15分鐘自動暫停_在15分鐘內學習網頁爬取

網頁視頻15分鐘自動暫停

什么是網頁抓取? (What is Web Scraping?)

Web scraping, also known as web data extraction, is the process of retrieving or “scraping” data from a website. This information is collected and then exported into a format that is more useful for the user and it can be a spreadsheet or an API. Although web scraping can be done manually, in most cases, automated tools are preferred when scraping web data as they can be less costly and work at a faster rate.

Web抓取,也稱為Web數據提取,是從網站檢索或“抓取”數據的過程。 收集此信息,然后將其導出為對用戶更有用的格式,可以是電子表格或API。 盡管可以手動進行Web抓取 ,但是在大多數情況下,抓取Web數據時首選自動化工具,因為它們的成本較低且工作速度更快。

網站搜刮合法嗎? (Is Web Scraping Legal?)

The simplest way is to check the robots.txt file of the website. You can find this file by appending “/robots.txt” to the URL that you want to scrape. It is usually at the website domain /robots.txt. If all the bots indicated by ‘user-agent: *’ are blocked/disallowed in the robots.txt file, then you’re not allowed to scrape. For this article, I am scraping the Flipkart website. So, to see the “robots.txt” file, the URL is www.flipkart.com/robots.txt.

最簡單的方法是檢查網站的robots.txt文件。 您可以通過將“ /robots.txt”附加到要抓取的URL來找到此文件。 它通常位于網站域/robots.txt中。 如果robots.txt文件中阻止/禁止了“用戶代理:*”指示的所有漫游器,則不允許您抓取。 對于本文,我將抓取Flipkart網站。 因此,要查看“ robots.txt”文件,URL為www.flipkart.com/robots.txt。

用于Web爬網的庫 (Libraries used for Web Scraping)

BeautifulSoup: BeautifulSoup is a Python library for pulling data out of HTML and XML files. It works with your favourite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.

BeautifulSoup:BeautifulSoup是一個Python庫,用于從HTML和XML文件中提取數據。 它與您最喜歡的解析器一起使用,提供了導航,搜索和修改解析樹的慣用方式。

Pandas: Pandas is a fast, powerful, flexible, and easy to use open-source data analysis and manipulation tool, built on top of the Python programming language.

Pandas:Pandas是一種快速,強大,靈活且易于使用的開源數據分析和處理工具,建立在Python編程語言之上。

為什么選擇BeautifulSoup? (Why BeautifulSoup?)

It is an incredible tool for pulling out information from a webpage. You can use it to extract tables, lists, paragraphs and you can also put filters to extract information from web pages. For more info, you can refer to the BeautifulSoup documentation

它是從網頁中提取信息的不可思議的工具。 您可以使用它來提取表,列表,段落,還可以放置過濾器以從網頁中提取信息。 有關更多信息,您可以參考BeautifulSoup 文檔。

刮Flipkart網站 (Scraping Flipkart Website)

from bs4 import BeautifulSoup 
import requests
import csv
import pandas as pd

First, we import the BeautifulSoup and the requests library and these are very important libraries for web scraping.

首先,我們導入BeautifulSoup和請求庫,這些對于Web抓取是非常重要的庫。

requests: requests, is one of the packages in Python that made the language interesting. requests is based on Python’s urllib2 module.

請求:請求是Python中使該語言有趣的軟件包之一。 請求基于Python的urllib2模塊。

req = requests.get("https://www.flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off&page=1")  # URL of the website which you want to scrape
content = req.content # Get the content

To get the contents of the specified URL, submit a request using the requests library. This is the URL of the Flipkart website containing laptops.

要獲取指定URL的內容,請使用請求庫提交請求。 這是包含筆記本電腦的Flipkart網站的URL。

Image for post

This is the Flipkart website comprising of different laptops. This page contains the details of 24 laptops. So now looking at this, we try to extract the different features of the laptops such as the description of the laptop (model name along with the specification of the laptop), Processor (Intel/AMD, i3/i5/i7/Ryzen3Ryzen5/Ryzen7), RAM (4/8/16 GB), Operating System (Windows/Mac), Disk Drive Storage (SSD/HDD,256/512/1TB storage), Display (13.3/14/15.6 inches), Warranty(Onsite/Limited Hardware/International), Rating(4.1–5), Price (Rupees).

這是由不同筆記本電腦組成的Flipkart網站。 此頁面包含24臺筆記本電腦的詳細信息。 因此,現在著眼于此,我們嘗試提取筆記本電腦的不同功能,例如筆記本電腦的描述(型號名稱以及筆記本電腦的規格),處理器(Intel / AMD,i3 / i5 / i7 / Ryzen3Ryzen5 / Ryzen7) ),RAM(4/8/16 GB),操作系統(Windows / Mac),磁盤驅動器存儲(SSD / HDD,256/512 / 1TB存儲),顯示器(13.3 / 14 / 15.6英寸),保修(現場/硬件/國際限量版),評分(4.1–5),價格 (盧比)。

soup = BeautifulSoup(content,'html.parser')
print(soup.prettify())<!DOCTYPE html>
<html lang="en">
<head>
<link href="https://rukminim1.flixcart.com" rel="dns-prefetch"/>
<link href="https://img1a.flixcart.com" rel="dns-prefetch"/>
<link href="//img1a.flixcart.com/www/linchpin/fk-cp-zion/css/app.chunk.21be2e.css" rel="stylesheet"/>
<link as="image" href="//img1a.flixcart.com/www/linchpin/fk-cp-zion/img/fk-logo_9fddff.png" rel="preload"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="102988293558" property="fb:page_id"/>
<meta content="658873552,624500995,100000233612389" property="fb:admins"/>
<meta content="noodp" name="robots"/>
<link href="https://img1a.flixcart.com/www/promos/new/20150528-140547-favicon-retina.ico" rel="shortcut icon">
....
....
</script>
<script async="" defer="" id="omni_script" nonce="7596241618870897262" src="//img1a.flixcart.com/www/linchpin/batman-returns/omni/omni16.js">
</script>
</body>
</html>

Here we need to specify the content variable and the parser, which is the HTML parser. So now soup is a variable of the BeautifulSoup object of our parsed HTML. soup.prettify() displays the entire code of the webpage.

在這里,我們需要指定內容變量和解析器,即HTML解析器。 因此,湯是我們已解析HTML的BeautifulSoup對象的變量。 soup.prettify()顯示網頁的整個代碼。

Extracting the Descriptions

提取描述

Image for post

When you click on the “Inspect” tab, you will see a “Browser Inspector Box” open. We observe that the class name of the descriptions is ‘_3wU53n’ so we use the find method to extract the descriptions of the laptops.

當您單擊“檢查”選項卡時,將看到“瀏覽器檢查器框”打開。 我們觀察到描述的類名是'_3wU53n',因此我們使用find方法提取筆記本電腦的描述。

desc = soup.find_all('div' , class_='_3wU53n')[<div class="_3wU53n">HP 14s Core i5 10th Gen - (8 GB/512 GB SSD/Windows 10 Home) 14s-cs3010TU Laptop</div>,
<div class="_3wU53n">HP 14q Core i3 8th Gen - (8 GB/256 GB SSD/Windows 10 Home) 14q-cs0029TU Thin and Light Laptop</div>,
<div class="_3wU53n">Asus VivoBook 15 Ryzen 3 Dual Core - (4 GB/1 TB HDD/Windows 10 Home) M509DA-EJ741T Laptop</div>,
<div class="_3wU53n">Acer Aspire 7 Core i5 9th Gen - (8 GB/512 GB SSD/Windows 10 Home/4 GB Graphics/NVIDIA Geforce GTX 1650...</div>,
....
....
<div class="_3wU53n">MSI GP65 Leopard Core i7 10th Gen - (32 GB/1 TB HDD/512 GB SSD/Windows 10 Home/8 GB Graphics/NVIDIA Ge...</div>,
<div class="_3wU53n">Asus Core i5 10th Gen - (8 GB/512 GB SSD/Windows 10 Home/2 GB Graphics) X509JB-EJ591T Laptop</div>]

Extracting the descriptions from the website using the find method-grabbing the div tag which has the class name ‘ _3wU53n’. This returns all the div tags with the class name of ‘ _3wU53n’. As class is a special keyword in python, we have to use the class_ keyword and pass the arguments here.

使用find方法從網站中提取描述-抓取div標簽,該標簽的類名為“ _3wU53n”。 這將返回所有類名稱為“ _3wU53n”的div標簽。 由于class是python中的特殊關鍵字,因此我們必須使用class_關鍵字并在此處傳遞參數。

descriptions = [] # Create a list to store the descriptions
for i in range(len(desc)):
descriptions.append(desc[i].text)
len(descriptions)24 # Number of laptops
['HP 14s Core i5 10th Gen - (8 GB/512 GB SSD/Windows 10 Home) 14s-cs3010TU Laptop',
'HP 14q Core i3 8th Gen - (8 GB/256 GB SSD/Windows 10 Home) 14q-cs0029TU Thin and Light Laptop',
'Asus VivoBook 15 Ryzen 3 Dual Core - (4 GB/1 TB HDD/Windows 10 Home) M509DA-EJ741T Laptop',
'Acer Aspire 7 Core i5 9th Gen - (8 GB/512 GB SSD/Windows 10 Home/4 GB Graphics/NVIDIA Geforce GTX 1650...',
....
....
'MSI GP65 Leopard Core i7 10th Gen - (32 GB/1 TB HDD/512 GB SSD/Windows 10 Home/8 GB Graphics/NVIDIA Ge...',
'Asus Core i5 10th Gen - (8 GB/512 GB SSD/Windows 10 Home/2 GB Graphics) X509JB-EJ591T Laptop']

Create an empty list to store the descriptions of all the laptops. We can even access the child tags with dot access. So now iterate through all the tags and then use the .text method to extract only the text content from the tags. In every iteration append the text to the descriptions list. So after iterating through all the tags, the descriptions list will have the text content of all the laptops (which is the description of the laptop-model name along with specifications).

創建一個空列表來存儲所有筆記本電腦的描述。 我們甚至可以通過點訪問來訪問子標簽。 因此,現在遍歷所有標簽,然后使用.text方法僅從標簽中提取文本內容。 在每次迭代中,將文本添加到描述列表中。 因此,在遍歷所有標簽之后,描述列表將包含所有便攜式計算機的文本內容(這是便攜式計算機型號名稱的描述以及規格)。

Similarly, we apply the same approach to extract all the other features.

同樣,我們采用相同的方法提取所有其他功能。

Extracting the specifications

提取規格

Image for post

We observe that the various specifications are under the same div and the class names are the same for all those 5 features (Processor, RAM, Disk Drive, Display, Warranty).

我們注意到,所有這5個功能(處理器,RAM,磁盤驅動器,顯示,保修)的各種規格都在同一div下,并且類名相同。

All the features are inside the ‘li’ tag and the class name is the same for all which is ‘tVe95H’ so we need to apply some technique to extract the distinct features.

所有功能都在'li'標記內,并且所有類的名稱都相同'tVe95H',因此我們需要應用某種技術來提取不同的功能。

# Create empty lists for the features
processors=[]
ram=[]
os=[]
storage=[]
inches=[]
warranty=[]for i in range(0,len(commonclass)):
p=commonclass[i].text # Extracting the text from the tags
if("Core" in p):
processors.append(p)
elif("RAM" in p):
ram.append(p)
# If RAM is present in the text then append it to the ram list. Similarly do this for the other features as well elif("HDD" in p or "SSD" in p):
storage.append(p)
elif("Operating" in p):
os.append(p)
elif("Display" in p):
inches.append(p)
elif("Warranty" in p):
warranty.append(p)

The .text method is used to extract the text information from the tags so this gives us the values of Processor, RAM, Disk Drive, Display, Warranty. So in the same way, we apply this approach to the remaining features as well.

.text方法用于從標記中提取文本信息,因此可以為我們提供處理器,RAM,磁盤驅動器,顯示,保修的值。 因此,以同樣的方式,我們也將這種方法應用于其余功能。

print(len(processors))
print(len(warranty))
print(len(os))
print(len(ram))
print(len(inches))24
24
24
24
24

Extracting the price

提取價格

price = soup.find_all(‘div’,class_=’_1vC4OE _2rQ-NK’) 
# Extracting price of each laptop from the website
prices = []
for i in range(len(price)):
prices.append(price[i].text)
len(prices)
prices24
['?52,990',
'?34,990',
'?29,990',
'?56,990',
'?54,990',
....
....
'?78,990',
'?1,59,990',
'?52,990']

In the same manner, we extract the price of each laptop and add all the prices to the prices list.

以相同的方式,我們提取每臺筆記本電腦的價格,并將所有價格添加到價格列表中。

rating = soup.find_all('div',class_='hGSR34') 
Extracting the ratings of each laptop from the website
ratings = []
for i in range(len(rating)):
ratings.append(rating[i].text)
len(ratings)
ratings37
['4.4',
'4.5',
'4.4',
'4.4',
'4.2',
'4.5',
'4.4',
'4.5',
'4.4',
'4.2',
....
....
'?1,59,990',
'?52,990']
Image for post
Image for post

Here we are getting the length of the ratings to be 37. But what’s the reason behind it?

在這里,我們得到的評級長度為37。但是背后的原因是什么呢?

We observe that the class name for the recommended laptops is also the same as the featured laptops, so that’s why it’s extracting the ratings of recommended laptops as well.This is leading to an increase in the number of ratings. It has to be 24 but now it's 37!

我們發現推薦筆記本電腦的類別名稱也與特色筆記本電腦相同,這就是為什么它也提取推薦筆記本電腦的等級的原因,這導致等級數量增加。 它必須是24,但現在是37!

Last but not the least, merge all the features into a single data frame and store the data in the required format!

最后但并非最不重要的一點是,將所有功能合并到一個數據框中,并以所需的格式存儲數據!

df = {'Description':descriptions,'Processor':processors,'RAM':ram,'Operating System':os,'Storage':storage,'Display':inches,'Warranty':warranty,'Price':prices}
dataset = pd.DataFrame(data = d)

The final dataset

最終數據集

Image for post

Saving the dataset to a CSV file

將數據集保存到CSV文件

dataset.to_csv('laptops.csv')

Now we get the whole dataset into a CSV file.

現在,我們將整個數據集放入一個CSV文件中。

Image for post

To verify it again, we read the downloaded CSV file in Jupyter Notebook.

為了再次驗證,我們在Jupyter Notebook中讀取了下載的CSV文件。

df = pd.read_csv('laptops.csv')
df.shape(24, 9)

As this is a dynamic website, the content keeps on changing!

由于這是一個動態的網站,因此內容不斷變化!

You can always refer to my GitHub Repository for the entire code.

您可以始終參考我的GitHub存儲庫以獲取完整代碼。

Connect with me on LinkedIn here

此處通過LinkedIn與我聯系

“For every $20 you spend on web analytics tools, you should spend $80 on the brains to make sense of the data.” — Jeff Sauer

“您每花20美元在網絡分析工具上,就應該花80美元在大腦上以理解數據。” —杰夫·索爾

I hope you found the article insightful. I would love to hear feedback to improvise it and come back with better content.

我希望您發現這篇文章很有見地。 我很想聽聽反饋以即興創作,并以更好的內容回來。

Thank you so much for reading!

非常感謝您的閱讀!

翻譯自: https://towardsdatascience.com/learn-web-scraping-in-15-minutes-27e5ebb1c28e

網頁視頻15分鐘自動暫停

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/388234.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/388234.shtml
英文地址,請注明出處:http://en.pswp.cn/news/388234.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

Unity3D面試ABC

Unity3D面試ABC 楊航最近在學Unity3D&#xfeff;&#xfeff; 最先執行的方法是&#xff1a; 1、&#xff08;激活時的初始化代碼&#xff09;Awake&#xff0c;2、Start、3、Update【FixUpdate、LateUpdate】、4、&#xff08;渲染模塊&#xff09;OnGUI、5、再向后&#xff…

前嗅ForeSpider教程:創建模板

今天&#xff0c;小編為大家帶來的教程是&#xff1a;如何在前嗅ForeSpider中創建模板。主要內容有&#xff1a;模板的概念&#xff0c;模板的配置方式&#xff0c;模板的高級選項&#xff0c;具體內容如下&#xff1a; 一&#xff0c;模板的概念 模板列表的層級相當于網頁跳轉…

2.PHP利用PDO連接方式連接mysql數據庫

代碼如下 <?php$serverName "這里填IP地址";$dbName "這里填數據庫名";$userName "這里填用戶名&#xff08;默認為root&#xff09;";$password "";/*密碼默認不用填*/try { $conn new PDO("mysql:host$serverName;…

django 性能優化_優化Django管理員

django 性能優化Managing data from the Django administration interface should be fast and easy, especially when we have a lot of data to manage.從Django管理界面管理數據應該快速簡便&#xff0c;尤其是當我們要管理大量數據時。 To improve that process and to ma…

3D場景中選取場景中的物體。

楊航最近在學Unity3D&#xfeff;&#xfeff;&#xfeff;&#xfeff;在一些經典的游戲中&#xff0c;需要玩家在一個3D場景中選取場景中的物體。例如《仙劍奇俠傳》&#xff0c;選擇要攻擊的敵人時、為我方角色增加血量、為我方角色添加狀態&#xff0c;通常我們使用鼠標來選…

xpath之string(.)方法

from lxml import etreehtml <li class"tag_1">需要的內容1<a>需要的內容2</a></li> selector etree.HTML(html ) contents selector.xpath ( //li[class "tag_1"]) contents1 selector.xpath ( //li[class "tag…

循環語句

循環語句&#xff1a; 當我們要做一些重復的操作時&#xff0c;首先想到的是有沒有一種循環的語句&#xff1f; 答案當然有 Java提供了三種循環&#xff1a; for循環&#xff0c;在Java5中引入了一種主要用于數組的增強型for循環。while循環do……while循環for循環語法1&#x…

canva怎么使用_使用Canva進行數據可視化項目的4個主要好處

canva怎么使用(Notes: All opinions are my own. I am not affiliated with Canva in any way)(注意&#xff1a;所有觀點均為我自己。我與Canva毫無關系) Canva is a very popular design platform that I thought I would never use to create the deliverable for a Data V…

如何利用Shader來渲染游戲中的3D角色

楊航最近在學Unity3D&#xfeff;&#xfeff; 本文主要介紹一下如何利用Shader來渲染游戲中的3D角色&#xff0c;以及如何利用Unity提供的Surface Shader來書寫自定義Shader。 一、從Shader開始 1、通過Assets->Create->Shader來創建一個默認的Shader&#xff0c;并取名…

深入bind

今天來聊聊bind 關于之前的call跟apply 查看此鏈接 我們要明確4點內容 1. bind之后返回一個函數 let obj {name : skr } function fn(){console.log(this) } let bindfn fn.bind(obj) console.log(typeof bindfn) // function 2.bind改變this 并且可以傳參 bind之后的函數仍…

Css單位

尺寸 顏色 轉載于:https://www.cnblogs.com/jsunny/p/9866679.html

ai驅動數據安全治理_JupyterLab中的AI驅動的代碼完成

ai驅動數據安全治理As a data scientist, you almost surely use a form of Jupyter Notebooks. Hopefully, you have moved over to the goodness of JupyterLab with its integrated sidebar, tabs, and more. When it first launched in 2018, JupyterLab was great but fel…

【Android】Retrofit 2.0 的使用

一、概述 Retrofit是Square公司開發的一個類型安全的Java和Android 的REST客戶端庫。來自官網的介紹&#xff1a; A type-safe HTTP client for Android and JavaRest API是一種軟件設計風格&#xff0c;服務器作為資源存放地。客戶端去請求GET,PUT, POST,DELETE資源。并且是無…

一個透明的shader

楊航最近在學Unity3D&#xfeff;&#xfeff;Shader "Custom/xiankuang" { Properties { _LineColor ("Line Color", Color) (1,1,1,1) _GridColor ("Grid Color", Color) (1,1,1,0) _LineWidth ("Line Width", float) 0…

Mysql常用命令(二)

對數據庫的操作 增 create database db1 charset utf8; 查 # 查看當前創建的數據庫 show create database db1; # 查看所有的數據庫 show databases; 改 alter database db1 charset gbk; 刪 drop database db1; 對表的操作 use db1; #切換文件夾select database(); #查看當前所…

python中定義數據結構_Python中的數據結構—簡介

python中定義數據結構You have multiples algorithms, the steps of which require fetching the smallest value in a collection at any given point of time. Values are assigned to variables but are constantly modified, making it impossible for you to remember all…

1206封裝電容在物料可靠性設計比較低

1206封裝電容在物料可靠性設計中是要盡力避免的&#xff0c;盡量選擇0805或1210。在現場中容易出現電容因斷裂而擊穿的情況。同時容易造成保險絲燒斷。轉載于:https://www.cnblogs.com/conglinlixian/p/10414877.html

Java開發中 Double 和 float 不能直接運算

不能直接運算 是因為計算機儲存浮點類型的數值使用指數和尾數來表示 這就意味著計算時會出現“精度缺失”的現象 為了解決這個問題 我們引入 java.math.BigDecimal類來進行精確計算。 具體如下&#xff1a; public class Arith { //加法運算 public static double add(dou…

Unity3D 場景與C# Control進行結合

楊航最近在自學Unity3D&#xff0c;打算使用這個時髦、流行、強大的游戲引擎開發一個三維業務展示系統&#xff0c;不過發現游戲的UI和業務系統的UI還是有一定的差別&#xff0c;很多的用戶還是比較習慣WinForm或者WPF中的UI形式&#xff0c;于是在網上搜了一下WinForm和Unity3…

數據質量提升_合作提高數據質量

數據質量提升Author Vlad Ri?cu?ia is joined for this article by co-authors Wayne Yim and Ayyappan Balasubramanian.作者 Vlad Ri?cu?ia 和合著者 Wayne Yim 和 Ayyappan Balasubramanian 共同撰寫了這篇文章 。 為什么要數據質量&#xff1f; (Why data quality?) …