vs azure web_在Azure中遷移和自動化Chrome Web爬網程序的指南。

vs azure web

Webscraping as a required skill for many data-science related jobs is becoming increasingly desirable as more companies slowly migrate their processes to the cloud.

隨著越來越多的公司將其流程緩慢遷移到云中,將Web爬網作為許多與數據科學相關的工作所需的技能變得越來越受歡迎。

As someone who started originally getting interested in data science after scraping my University’s course evaluation catalogue, this skill single-handedly allowed me to land two internships during my undergrad program.

作為最初在刮擦我的大學的課程評估目錄后開始對數據科學感興趣的人,這一技能使我能夠在我的本科課程期間獲得兩次實習機會。

Although disputed, many people use a Chrome Webdriver and the Selenium module to scrape data off websites on the internet. While this tool can be very helpful locally, it is difficult to make these recurring tasks that are able to deploy for large scale infrastructure. Within this article, I am going to guide you through porting over your Selenium web-scraper into the Azure Network utilizing a virtual machine as well as show you how to set up the scraping to be a daily reoccurring task.

盡管存在爭議,但許多人還是使用Chrome Webdriver和Selenium模塊從互聯網上的網站上抓取數據。 盡管此工具在本地可能非常有用,但是很難執行這些能夠部署到大型基礎架構的重復任務。 在本文中,我將指導您通過使用虛擬機將Selenium Web爬網程序移植到Azure網絡,并向您展示如何將抓取設置為每天重復發生的任務。

步驟1:設置Azure虛擬機(VM) (Step 1: Setting up the Azure Virtual Machine (VM))

After you have logged into Azure, you’re going to want to make your way over to the Virtual Machines directory. While I won’t walk through every detail behind creating the VM, I will note some specifications that are important to set in order to enable appropriate access between services.

登錄到Azure后,您將需要轉到虛擬機目錄。 盡管我不會遍歷創建VM的每個細節,但我會注意到一些重要的規范,這些規范對于使服務之間能夠進行適當的訪問非常重要。

Since I am familiar most with Windows, I used a Windows 10 Pro Image for my Virtual Machine, however I would imagine that this process could be repeated for other images as well.

由于我對Windows最熟悉,因此我在虛擬機上使用了Windows 10 Pro映像,但是我想也可以對其他映像重復此過程。

For the “Select inbound ports”, make sure to include the HTTPS (443) option to allow the automation task access. We will cover this in more detail in Step 4 of this guide if you miss this step.

對于“選擇入站端口”,請確保包括 HTTPS (443) 選項以允許自動化任務訪問 。 如果您錯過了此步驟,我們將在本指南的第4步中對此進行詳細介紹。

第2步:安裝Python,Chrome,Chromedriver和必需的依賴項 (Step 2: Install Python, Chrome, Chromedriver & Required Dependencies)

Next, we are going to want to load up the VM. If you are using a Windows image, you can use RDP (Remote Desktop Protocol) to get access, or you can use a software like PuTTY to SSH into the desktop as well.

接下來,我們將要加載虛擬機。 如果使用的是Windows映像,則可以使用RDP(遠程桌面協議)進行訪問,也可以使用PuTTY之類的軟件通過SSH進入桌面。

We are going to setup our working environment here in order for Python and Chrome to get up and running. So, make sure to install your required version of Python as well as the latest version of Chrome & Chromedriver. Make note of where these files are saved as you will need them later on.

我們將在這里設置我們的工作環境,以便Python和Chrome啟動并運行。 因此,請確保安裝所需的Python版本以及最新版本的Chrome和Chromedriver。 記下這些文件的保存位置,因為以后將需要它們。

If you want to have less maintenance down the road, make sure to rename the Chrome Update folder so Chrome doesn't automatically update requiring you to download a newer version of Chromedriver. Instructions for doing so can be found here.

如果您想減少日常維護工作,請確保重命名Chrome Update文件夾,以便Chrome不會自動更新 ,而您需要下載更新版本的Chromedriver。 有關說明,請參見此處 。

步驟3:Python腳本 (Step 3: Python Script)

For sake of simplicity, we are going to use just a basic python script that loads up stack overflow. Obviously, this could easily be done using the requests library, however, as many scrapers require JavaScript interactivity with the web page, I’ll assume that your script is longer and more complex.

為了簡單起見,我們將僅使用一個基本的python腳本來加載堆棧溢出。 顯然,這可以使用請求庫輕松完成,但是,由于許多抓取工具需要與網頁進行JavaScript交互,因此我假設您的腳本更長且更復雜。

Lets call the following script scrape.py

讓我們調用以下腳本scrape.py

from selenium import webdriverDRIVER_PATH = "/path/to/chromedriver.exe"def scrape():
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('https://stackoverflow.com/')if __name__ == "__main__":
scrape()
print("Script Executed Correctly.")

You’re going to want to first make sure that the correct output is printed so you know your script works locally within the VM.

您首先要確保打印正確的輸出,以便您知道腳本可以在VM本地運行。

It is also important to note that storage within these Virtual Machines is expensive. So, try and utilize a database or equivalent to store your data outside of the VM if at all possible. For the majority of my uses, utilizing python’s pyodbc module works incredibly well for getting data that I want stored outside of the VM. However, this will likely change on a case to case basis.

還需要注意的是,這些虛擬機中的存儲非常昂貴。 因此,請盡可能利用數據庫或等效數據庫將數據存儲在VM外部。 對于我的大多數用途,利用python的pyodbc模塊非常有效地用于獲取要存儲在VM外部的數據。 但是,這可能會視情況而定。

步驟3:Powershell腳本 (Step 3: Powershell Script)

Next, you’re going to want to setup a Powershell script that runs your python code. This script is how Azure will communicate with any internal scripts you have within your VM. Again, for simplicity, my Powershell script here will utilize some basic functionality just to get the basic structure outlined.

接下來,您將要設置運行Python代碼的Powershell腳本。 該腳本是Azure如何與VM中擁有的任何內部腳本進行通信的方式。 同樣,為簡單起見,我在這里的Powershell腳本將利用一些基本功能,只是為了獲得概述的基本結構。

Lets call this script ps-scrape.ps1

讓我們將此腳本稱為ps-scrape.ps1

Write-Output "Script Started."
\path\to\python.exe \path\to\scrape.py
Write-Output "Script Ending."

Now, give this a test run by running it locally on your VM. It should print out the following results:

現在,通過在您的VM上本地運行來進行測試運行。 它應該打印出以下結果:

Script Started.
Script Executed Correctly.
Script Ending.

步驟4:Azure Powershell Runbook (Step 4: Azure Powershell Runbook)

Now that your Powershell script runs locally on your VM, it is time to do the same thing from outside your VM.

現在,您的Powershell腳本在VM上本地運行,是時候從VM外部執行相同的操作了。

Within Azure, open up the Automation Account Resource. Under Process Automation, click on Runbooks and Create a Runbook. The Runbook type should be PowerShell (not PowerShell workflow or Graphical PowerShell Workflow).

在Azure中,打開自動化帳戶資源。 在“流程自動化”下,單擊“運行手冊”并創建一個“運行手冊”。 Runbook類型應為PowerShell(而不是PowerShell工作流或圖形PowerShell工作流)。

Keep in mind, that you will likely need to import the required modules from Automation Account to allow the following to run correctly. To do this, go over to your Automation Account you created, under Shared Resources, you should see Modules. Make sure to add the AzureRM.Compute module and any other modules you may need.

請記住,您可能需要從Automation Account導入所需的模塊,以使以下內容正確運行。 為此,請轉到您在共享資源下創建的自動化帳戶,您應該看到模塊。 確保添加AzureRM.Compute模塊以及您可能需要的任何其他模塊。

Lets call the following Runbook RunbookScrape

讓我們調用以下Runbook RunbookScrape

$connectionName = "AzureRunAsConnection"
try
{
# Get the connection "AzureRunAsConnection
$servicePrincipalConnection=Get-AutomationConnection -Name $connectionName "Logging in to Azure..."
Add-AzureRmAccount `
-ServicePrincipal `
-TenantId $servicePrincipalConnection.TenantId `
-ApplicationId $servicePrincipalConnection.ApplicationId `
-CertificateThumbprint $servicePrincipalConnection.CertificateThumbprint
}
catch {
if (!$servicePrincipalConnection)
{
$ErrorMessage = "Connection $connectionName not found."
throw $ErrorMessage
} else{
Write-Error -Message $_.Exception
throw $_.Exception
}
}$rgname ="YourResourceGroupName"
$vmname ="YourVirtualMachineName"
$ScriptToRun = "vm\path\to\script\ps-scrape.ps1"
Out-File -InputObject $ScriptToRun -FilePath ScriptToRun.ps1
$run = Invoke-AzureRmVMRunCommand -ResourceGroupName $rgname -Name $vmname -CommandId 'RunPowerShellScript' -ScriptPath ScriptToRun.ps1
Write-Output $run.Value[0]
Remove-Item -Path ScriptToRun.ps1

Bolded items indicate where you will need to change the code to work for your system.

粗體字表示需要更改代碼才能在系統上工作。

If the script ran correctly but you don’t see an output. DON’T WORRY. It just means you need to update the VM network settings to allow outbound traffic through port 443. This can be done by going to the Virtual Machine where under Settings, you will see the Networking button. Go here and you should see several tabs under the Network Interface. Click on the Outbound port rules and setup a new rule to look like this.

如果腳本正確運行,但看不到輸出。 別擔心 這僅意味著您需要更新VM網絡設置以允許通過端口443的出站流量。這可以通過轉到虛擬機來完成,在虛擬機的“設置”下,您將看到“網絡”按鈕。 轉到此處,您應該在網絡接口下看到幾個選項卡。 單擊出站端口規則,然后設置一個新規則,如下所示。

Image for post

Try running the Runbook again and you should see the same output as you saw from within the VM!

再次嘗試運行Runbook,您應該會看到與從VM中看到的輸出相同的輸出!

步驟5:Runbook自動化 (Step 5: Runbook Automation)

Now comes the task of Automating your Runbook. Within Azure, open up the Logic App resource. Under the Development Tools, you should see the Logic app designer. All that is required is that you link the blocks together to make Azure startup the VM, run the Runbook, and then shut down the VM. You can see what this looks like in the following image.

現在是自動化Runbook的任務。 在Azure中,打開Logic App資源。 在開發工具下,您應該看到Logic應用程序設計器。 所需要做的就是將這些塊鏈接在一起,以使Azure啟動VM,運行Runbook,然后關閉VM。 您可以在下圖中看到它的外觀。

Image for post

Boom! You’re done. Your Python Selenium Webscraper will now run within the Azure Virtual Machine on a scheduled recurring basis.

繁榮! 你完成了。 您的Python Selenium Webscraper現在將按計劃的定期在Azure虛擬機中運行。

翻譯自: https://medium.com/swlh/guide-to-migrating-automating-chrome-web-scrapers-within-azure-909a4203476a

vs azure web

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/388100.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/388100.shtml
英文地址,請注明出處:http://en.pswp.cn/news/388100.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

hadoop eclipse windows

首先說一下本人的環境: Windows7 64位系統 Spring Tool Suite Version: 3.4.0.RELEASE Hadoop2.6.0 一.簡介 Hadoop2.x之后沒有Eclipse插件工具,我們就不能在Eclipse上調試代碼,我們要把寫好的java代碼的MapReduce打包成jar然后在Linux上運…

同步函數死鎖現象

多線程:一個進程中有多個線程可以同時執行任務 多線程的好處: 1、解決一個進程中可以同時執行多個任務的問題。 2、提高了資源利用率 多線程的弊端: 1、增加了CPU的負擔 2、降低了一個進程中線程的執行概率 3、出現了線程安全問題 4、會引發死…

netstat 在windows下和Linux下查看網絡連接和端口占用

假設忽然起個服務,告訴我8080端口被占用了,OK,我要去看一下是什么服務正在占用著,能不能殺 先假設我是在Windows下: 第一列: Proto 協議 第二列: 本地地址【ip端口】 第三列:遠程地址…

selenium 解析網頁_用Selenium進行網頁搜刮

selenium 解析網頁網頁抓取系列 (WEB SCRAPING SERIES) 總覽 (Overview) Selenium is a portable framework for testing web applications. It is open-source software released under the Apache License 2.0 that runs on Windows, Linux and macOS. Despite serving its m…

表的設計與優化

單表設計與優化 1)設計規范化表,消除數據冗余(以使用正確字段類型最明顯): 數據庫范式是確保數據庫結構合理,滿足各種查詢需要、避免數據庫操作異常的數據庫設計方式。滿足范式要求的表,稱為規范…

代理ARP協議(Proxy ARP)

代理ARP(Proxy-arp)的原理就是當出現跨網段的ARP請求時,路由器將自己的MAC返回給發送ARP廣播請求發送者,實現MAC地址代理(善意的欺騙),最終使得主機能夠通信。 圖中R1和R3處于不同的局域網&…

hive 導入hdfs數據_將數據加載或導入運行在基于HDFS的數據湖之上的Hive表中的另一種方法。

hive 導入hdfs數據Preceding pen down the article, might want to stretch out appreciation to all the wellbeing teams beginning from cleaning/sterile group to Nurses, Doctors and other who are consistently battling to spare the mankind from continuous Covid-1…

Java性能優化

一、避免在循環條件中使用復雜表達式 在不做編譯優化的情況下,在循環中,循環條件會被反復計算,如果不使用復雜表達式,而使循環條件值不變的話,程序將會運行的更快。 例子: import java.util.vector; class …

對Faster R-CNN的理解(1)

目標檢測是一種基于目標幾何和統計特征的圖像分割,最新的進展一般是通過R-CNN(基于區域的卷積神經網絡)來實現的,其中最重要的方法之一是Faster R-CNN。 1. 總體結構 Faster R-CNN的基本結構如下圖所示,其基礎是深度全…

大數據業務學習筆記_學習業務成為一名出色的數據科學家

大數據業務學習筆記意見 (Opinion) A lot of aspiring Data Scientists think what they need to become a Data Scientist is :許多有抱負的數據科學家認為,成為一名數據科學家需要具備以下條件: Coding 編碼 Statistic 統計 Math 數學 Machine Learni…

postman 請求參數為數組及JsonObject

2019獨角獸企業重金招聘Python工程師標準>>> 1. (1)數組的請求方式(post) https://blog.csdn.net/qq_21205435/article/details/81909184 (2)數組的請求方式(get) http://localhost:port/list?ages10,20,30 后端接收方式: PostMa…

領扣(LeetCode)對稱二叉樹 個人題解

給定一個二叉樹,檢查它是否是鏡像對稱的。 例如,二叉樹 [1,2,2,3,4,4,3] 是對稱的。 1/ \2 2/ \ / \ 3 4 4 3但是下面這個 [1,2,2,null,3,null,3] 則不是鏡像對稱的: 1/ \2 2\ \3 3說明: 如果你可以運用遞歸和迭代兩種方法解決這個問題&#…

python 開發api_使用FastAPI和Python快速開發高性能API

python 開發apiIf you have read some of my previous Python articles, you know I’m a Flask fan. It is my go-to for building APIs in Python. However, recently I started to hear a lot about a new API framework for Python called FastAPI. After building some AP…

Purley平臺Linpak測試,從踏坑開始一步步優化

Purley平臺Linpak測試,從踏坑開始一步步優化 #記2017年11月第一次踏坑事件 測試平臺配置: 6nodes CPU: Intel Gold 6132 2.6GHz 14C RAM: 8G *12 2666MHz NET: Infiband FDR OS: centos7.2 mpi: Intel-mpi hpl: xhpl.intel 開始踏第一坑 現象&#xff1a…

基于easyui開發Web版Activiti流程定制器詳解(一)——目錄結構

題外話(可略過): 前一段時間(要是沒記錯的話應該是3個月以前)發布了一個更新版本,很多人說沒有文檔看著比較困難,所以打算拿點時間出來詳細給大家講解一下,…

HDOJ 2037:今年暑假不AC_大二寫

AC代碼&#xff1a; #include <iostream> #include <cstdio> #include <algorithm> #define Max 105 using namespace std;struct TimeList {int start;int end; }timelist[Max]; bool compare(TimeList a, TimeList b) {if(a.end b.end)return a.start &l…

基于easyui開發Web版Activiti流程定制器詳解(二)——文件列表

&#xfeff;&#xfeff;上一篇我們介紹了目錄結構&#xff0c;這篇給大家整理一個文件列表以及詳細說明&#xff0c;方便大家查找文件。 由于設計器文件主要保存在wf/designer和js/designer目錄下&#xff0c;所以主要針對這兩個目錄進行詳細說明。 wf/designer目錄文件詳解…

杭電oj2047-2049、2051-2053、2056、2058

2047 阿牛的EOF牛肉串 1 #include<stdio.h>2 3 int main(){4 int n,i;5 _int64 s[51];6 while(~scanf("%d",&n)){7 s[1]3;s[2]8;8 for(i3;i<n;i){9 s[i] s[i-1]*2 s[i-2]*2; 10 } 11 print…

Power BI:M與DAX以及度量與計算列

When I embarked on my Power BI journey I was almost immediately slapped with an onslaught of foreign and perplexing terms that all seemed to do similar, but somehow different, things.當我開始Power BI之旅時&#xff0c;我幾乎立刻受到了外國和困惑術語的沖擊&am…

git 基本命令和操作

設置全局用戶名密碼 $ git config --global user.name runoob $ git config --global user.email testrunoob.comgit init:初始化倉庫 創建新的 Git 倉庫 git clone: 拷貝一個 Git 倉庫到本地 : git clone [url]git add:將新增的文件添加到緩存 : git add test.htmlgit status …