by Praveen Dubey
通過Praveen Dubey
如何使用瀏覽器控制臺通過JavaScript抓取并將數據保存在文件中 (How to use the browser console to scrape and save data in a file with JavaScript)
A while back I had to crawl a site for links, and further use those page links to crawl data using selenium or puppeteer. Setup for the content on the site was bit uncanny so I couldn’t start directly with selenium and node. Also, unfortunately, data was huge on the site. I had to quickly come up with an approach to first crawl all the links and pass those for details crawling of each page.
前一陣子,我不得不對一個站點進行爬網以獲取鏈接,并進一步使用這些頁面鏈接來使用Selenium或puppeteer來對數據進行爬網。 該網站上的內容設置有點離奇,所以我不能直接從Selenium和Node開始。 同樣,不幸的是,該站點上的數據非常龐大。 我必須快速想出一種方法,首先抓取所有鏈接,然后將其傳遞給每個頁面的詳細信息抓取。
That’s where I learned this cool stuff with the browser Console API. You can use this on any website without much setup, as it’s just JavaScript.
那是我從瀏覽器控制臺API那里學到的好東西。 您可以在任何網站上使用它,而無需進行太多設置,因為它只是JavaScript。
Let’s jump into the technical details.
讓我們跳入技術細節。
高級概述 (High Level Overview)
For crawling all the links on a page, I wrote a small piece of JS in the console. This JavaScript crawls all the links (takes 1–2 hours, as it does pagination also) and dumps a json
file with all the crawled data. The thing to keep in mind is that you need to make sure the website works similarly to a single page application. Otherwise, it does not reload the page if you want to crawl more than one page. If it does not, your console code will be gone.
為了抓取頁面上的所有鏈接,我在控制臺中編寫了一小段JS。 此JavaScript會爬網所有鏈接(需要1到2個小時,因為它也會進行分頁)并轉儲包含所有已爬網數據的json
文件。 要記住的事情是,您需要確保該網站的工作方式類似于單頁應用程序。 否則,如果您要爬網多個頁面,則不會重新加載頁面。 如果沒有,您的控制臺代碼將消失。
Medium does not refresh the page for some scenarios. For now, let’s crawl a story and save the scraped data in a file from the console automatically after scrapping.
中型在某些情況下不會刷新頁面。 現在,讓我們抓取一個故事,并將抓取的數據在抓取后自動從控制臺保存到文件中。
But before we do that here’s a quick demo of the final execution.
但是在開始之前,這里是最終執行的快速演示。
1.從瀏覽器獲取控制臺對象實例 (1. Get the console object instance from the browser)
// Console API to clear console before logging new data
console.API;
if (typeof console._commandLineAPI !== 'undefined') { console.API = console._commandLineAPI; //chrome
} else if (typeof console._inspectorCommandLineAPI !== 'undefined'){ console.API = console._inspectorCommandLineAPI; //Safari
} else if (typeof console.clear !== 'undefined') { console.API = console;
}
The code is simply trying to get the console object instance based on the user’s current browser. You can ignore and directly assign the instance to your browser.
該代碼只是試圖根據用戶當前的瀏覽器獲取控制臺對象實例。 您可以忽略實例并將其直接分配給瀏覽器。
Example, if you using Chrome, the below code should be sufficient.
例如,如果您使用Chrome ,則下面的代碼應該足夠了。
if (typeof console._commandLineAPI !== 'undefined') { console.API = console._commandLineAPI; //chrome
}
2.定義初級助手功能 (2. Defining the Junior helper function)
I’ll assume that you have opened a Medium story as of now in your browser. Lines 6 to 12 define the DOM element attributes which can be used to extract story title, clap count, user name, profile image URL, profile description and read time of the story, respectively.
我假設您已經在瀏覽器中打開了一個中型故事。 第6至12行定義DOM元素屬性,可分別用于提取故事標題,拍手數,用戶名,個人資料圖像URL,個人資料描述和故事的讀取時間 。
These are the basic things which I want to show for this story. You can add a few more elements like extracting links from the story, all images, or embed links.
這些是我要為這個故事展示的基本內容。 您可以添加更多元素,例如從故事中提取鏈接,所有圖像或嵌入鏈接。
3.定義我們的高級助手功能-野獸 (3. Defining our Senior helper function — the beast)
As we are crawling the page for different elements, we will save them in a collection. This collection will be passed to one of the main functions.
當我們在頁面上搜尋不同的元素時,我們會將它們保存在集合中。 該集合將傳遞給主要功能之一。
We have defined a function name, console.save
. The task for this function is to dump a csv / json file with the data passed.
我們定義了一個函數名稱console.save
。 此功能的任務是轉儲帶有所傳遞數據的csv / json文件。
It creates a Blob Object with our data. A Blob
object represents a file-like object of immutable, raw data. Blobs represent data that isn't necessarily in a JavaScript-native format.
它使用我們的數據創建一個Blob對象。 Blob
對象代表不可變的原始數據的類似文件的對象。 Blob表示的數據不一定是JavaScript原生格式。
Create blob is attached to a link tag <
;a> on which a click event is triggered.
創建blob附加到鏈接標簽<
; a>上,在該鏈接標簽上觸發了點擊事件。
Here is the quick demo of console.save
with a small array
passed as data.
這是console.save
的快速演示,其中有一個作為數據傳遞的小array
。
Putting together all the pieces of the code, this is what we have:
將所有代碼段放在一起,這就是我們所擁有的:
- Console API Instance 控制臺API實例
- Helper function to extract elements 輔助函數提取元素
- Console Save function to create a file 控制臺保存功能可創建文件
Let’s execute our console.save() in the browser to save the data in a file. For this, you can go to a story on Medium and execute this code in the browser console.
讓我們在瀏覽器中執行console.save()以將數據保存到文件中。 為此,您可以轉到Medium上的故事并在瀏覽器控制臺中執行此代碼。
I have shown the demo with extracting data from a single page, but the same code can be tweaked to crawl multiple stories from a publisher’s home page. Take an example of freeCodeCamp: you can navigate from one story to another and come back (using the browser’s back button) to the publisher home page without the page being refreshed.
我已經演示了從單個頁面提取數據的演示,但是可以對相同的代碼進行調整,以從發布者的主頁中抓取多個故事。 以freeCodeCamp為例 :您可以從一個故事導航到另一個故事,然后(使用瀏覽器的后退按鈕)返回到發布者主頁,而無需刷新頁面。
Below is the bare minimum code you need to extract multiple stories from a publisher’s home page.
下面是從發布者的主頁中提取多個故事所需的最低限度代碼。
Let’s see the code in action for getting the profile description from multiple stories.
讓我們看一下從多個故事中獲取個人檔案描述的代碼。
For any such type of application, once you have scrapped the data, you can pass it to our console.save function and store it in a file.
對于任何這種類型的應用程序,一旦您將數據抓取后,就可以將其傳遞給我們的console.save函數并將其存儲在文件中。
The console save function can be quickly attached to your console code and can help you to dump the data in the file. I am not saying you have to use the console for scraping data, but sometimes this will be a way quicker approach since we all are very familiar working with the DOM using CSS selectors.
控制臺保存功能可以快速附加到控制臺代碼中,并可以幫助您轉儲文件中的數據。 我并不是說您必須使用控制臺來抓取數據,但是有時這將是一種更快的方法,因為我們都非常熟悉使用CSS選擇器來處理DOM。
You can download the code from Github
您可以從Github下載代碼
Thank you for reading this article! Hope it gave you cool idea to scrape some data quickly without much setup. Hit the clap button if it enjoyed it! If you have any questions, send me an email (praveend806 [at] gmail [dot] com).
感謝您閱讀本文! 希望它為您提供了一個不錯的主意,使您無需進行太多設置即可快速抓取一些數據。 如果喜歡,請按拍手按鈕! 如果您有任何疑問,請給我發送電子郵件(praveend806 [at] gmail [dot] com)。
了解更多有關控制臺的資源: (Resources to learn more about the Console:)
Using the Console | Tools for Web Developers | Google DevelopersLearn how to navigate the Chrome DevTools JavaScript Console.developers.google.comBrowser ConsoleThe Browser Console is like the Web Console, but applied to the whole browser rather than a single content tab.developer.mozilla.orgBlobA Blob object represents a file-like object of immutable, raw data. Blobs represent data that isn't necessarily in a…developer.mozilla.org
使用控制臺| Web開發人員工具| Google Developers 了解如何瀏覽Chrome DevTools JavaScript控制臺。 developers.google.com 瀏覽器控制臺 瀏覽器控制臺類似于Web控制臺,但應用于整個瀏覽器,而不是單個內容選項卡。 developer.mozilla.org Blob Blob對象表示不可變的原始數據的類似文件的對象。 Blob代表不一定要包含在…中的數據... developer.mozilla.org
翻譯自: https://www.freecodecamp.org/news/how-to-use-the-browser-console-to-scrape-and-save-data-in-a-file-with-javascript-b40f4ded87ef/