ai驅動數據安全治理_AI驅動的Web數據收集解決方案的新起點

ai驅動數據安全治理

Data gathering consists of many time-consuming and complex activities. These include proxy management, data parsing, infrastructure management, overcoming fingerprinting anti-measures, rendering JavaScript-heavy websites at scale, and much more. Is there a way to automate these processes? Absolutely.

數據收集包括許多耗時且復雜的活動。 這些措施包括代理管理,數據解析,基礎結構管理,克服指紋防措施,大規模渲染JavaScript繁重的網站等。 有沒有辦法使這些過程自動化? 絕對。

Finding a more manageable solution for a large-scale data gathering has been on the minds of many in the web scraping community. Specialists saw a lot of potential in applying AI (Artificial Intelligence) and ML (Machine Learning) to web scraping. However, only recently, actions toward data gathering automation using AI applications have been taken. This is no wonder, as AI and ML algorithms became more robust at large-scale only in recent years together with advancement in computing solutions.

網絡抓取社區中的許多人一直在尋找為大規模數據收集提供更易管理的解決方案。 專家們看到了將AI(人工智能)和ML(機器學習)應用于網頁抓取的巨大潛力。 但是,直到最近,才采取行動使用AI應用程序進行數據收集自動化。 這也就不足為奇了,因為AI和ML算法直到最近幾年才隨著計算解決方案的進步而變得更加強大。

By applying AI-powered solutions in data gathering, we can help automate tedious manual work and ensure a much better quality of the collected data. To better grasp the struggles of web scraping, let’s look into the process of data gathering, its biggest challenges, and possible future solutions that might ease and potentially solve mentioned challenges.

通過在數據收集中應用基于AI的解決方案,我們可以幫助完成繁瑣的手工工作,并確保所收集數據的質量更高。 為了更好地掌握Web抓取的工作,讓我們研究數據收集的過程,最大的挑戰以及將來可能緩解和潛在解決上述挑戰的解決方案。

數據收集:逐步 (Data collection: step by step)

To better understand the web scraping process, it’s best to visualize it in a value chain:

為了更好地了解網絡抓取過程,最好在價值鏈中對其進行可視化處理:

As you can see, web scraping takes up four distinct actions:

如您所見,Web抓取采取了四個不同的操作:

  1. Crawling path building and URL collection.

    搜尋路徑建立和URL收集。
  2. Scraper development and its support.

    刮板的開發及其支持。
  3. Proxy acquisition and management.

    代理獲取和管理。
  4. Data fetching and parsing.

    數據獲取和解析。

Anything that goes beyond those terms is considered to be data engineering or part of data analysis.

超出這些術語的任何內容都被視為數據工程或數據分析的一部分。

By pinpointing which actions belong to the web scraping category, it becomes easier to find the most common data gathering challenges. It also allows us to see which parts can be automated and improved with the help of AI and ML powered solutions.

通過查明哪些動作屬于Web抓取類別,可以更輕松地找到最常見的數據收集難題。 它還使我們能夠看到哪些零件可以借助AI和ML支持的解決方案進行自動化和改進。

大規模刮刮挑戰 (Large-scale scraping challenges)

Traditional data gathering from the web requires a lot of governance and quality assurance. Of course, the difficulties that come with data gathering increase together with the scale of the scraping project. Let’s dig a little deeper into the said challenges by going through our value chain’s actions and analyzing potential issues.

從網絡收集傳統數據需要大量的管理和質量保證。 當然,數據收集帶來的困難隨著抓取項目的規模而增加。 讓我們通過價值鏈的行動并分析潛在問題,對上述挑戰進行更深入的研究。

建立搜尋路徑并收集URL (Building a crawling path and collecting URLs)

Building a crawling path is the first and essential part of data gathering. To put it simply, a crawling path is a library of URLs from which data will be extracted. The biggest challenge here is not the collection of the website URLs that you want to scrape, but obtaining all the necessary URLs of the initial targets. That could mean dozens, if not hundreds of URLs that will need to be scraped, parsed, and identified as important URLs for your case.

建立爬網路徑是數據收集的首要且必不可少的部分。 簡單來說,爬網路徑是一個URL庫,將從中提取數據。 這里最大的挑戰不是您要抓取的網站URL的集合,而是獲得初始目標的所有必需URL。 這可能意味著需要抓取,解析和標識數十個(如果不是數百個)URL,這對于您的案例而言是重要的URL。

刮板的開發及其維護 (Scraper development and its maintenance)

Building a scraper comes with a whole new set of issues. There are a lot of factors to look out for when doing so:

構建刮板會帶來一系列全新問題。 這樣做時要注意很多因素:

  • Choosing the language, APIs, frameworks, etc.

    選擇語言,API,框架等。
  • Testing out what you’ve built.

    測試您的構建。
  • Infrastructure management and maintenance.

    基礎架構管理和維護。
  • Overcoming fingerprinting anti-measures.

    克服指紋防措施。
  • Rendering JavaScript-heavy websites at scale.

    大規模渲染JavaScript繁重的網站。

These are just the tip of the iceberg that you will encounter when building a web scraper. There are plenty more smaller and time consuming things that will accumulate into larger issues.

這些只是構建網絡刮板時遇到的冰山一角。 還有很多小而費時的事情會累積成更大的問題。

代理收購與管理 (Proxy acquisition and management)

Proxy management will be a challenge, especially to those new to scraping. There are so many little mistakes one can make to block batches of proxies until successfully scraping a site. Proxy rotation is a good practice, but it doesn’t illuminate all the issues and requires constant management and upkeep of the infrastructure. So if you are relying on a proxy vendor, a good and frequent communication will be necessary.

代理管理將是一個挑戰,特別是對于那些剛開始使用的人。 在成功刮取站點之前,阻止批次代理存在很多小錯誤。 代理輪換是一種很好的做法,但是它不能說明所有問題,并且需要對基礎架構進行持續的管理和維護。 因此,如果您依賴代理供應商,則需要進行良好且頻繁的溝通。

數據獲取和解析 (Data fetching and parsing)

Data parsing is the process of making the acquired data understandable and usable. While creating a parser might sound easy, its further maintenance will cause big problems. Adapting to different page formats and website changes will be a constant struggle and will require your developers teams’ attention more often than you can expect.

數據解析是使獲取的數據易于理解和使用的過程。 盡管創建解析器聽起來很容易,但對其進行進一步的維護將導致大問題。 適應不同的頁面格式和網站更改將一直是一個難題,并且將需要您的開發團隊更多的注意力。

As you can see, traditional web scraping comes with many challenges, requires a lot of manual labour, time, and resources. However, the brightside with computing is that almost all things can be automated. And as the development of AI and ML powered web scraping is emerging, creating a future-proof large-scale data gathering becomes a more realistic solution.

如您所見,傳統的Web抓取面臨許多挑戰,需要大量的人工,時間和資源。 但是,計算的亮點是幾乎所有事物都可以自動化。 隨著AI和ML支持的Web抓取技術的發展不斷涌現,創建面向未來的大規模數據收集已成為一種更為現實的解決方案。

使網頁抓取永不過時 (Making web scraping future-proof)

In what way AI and ML can innovate and improve web scraping? According to Oxylabs Next-Gen Residential Proxy AI & ML advisory board member Jonas Kubilius, an AI researcher, Marie Sklodowska-Curie Alumnus, and Co-Founder of Three Thirds:

AI和ML以什么方式可以創新和改善網頁抓取? 根據Oxylabs下一代住宅代理AI和ML顧問委員會成員Jonas Kubilius的說法,他是AI研究人員Marie Sklodowska-Curie Alumnus和“三分之三”的聯合創始人:

“There are recurring patterns in web content that are typically scraped, such as how prices are encoded and displayed, so in principle, ML should be able to learn to spot these patterns and extract the relevant information. The research challenge here is to learn models that generalize well across various websites or that can learn from a few human-provided examples. The engineering challenge is to scale up these solutions to realistic web scraping loads and pipelines.

“網絡內容中經常會出現重復出現的模式,例如價格的編碼和顯示方式,因此,原則上,機器學習應該能夠發現這些模式并提取相關信息。 這里的研究挑戰是學習在各種網站上都能很好地概括的模型,或者可以從一些人類提供的示例中學習模型。 工程上的挑戰是將這些解決方案擴展到實際的Web抓取負載和管道。

Instead of manually developing and managing the scrapers code for each new website and URL, creating an AI and ML-powered solution will simplify the data gathering pipeline. This will take care of proxy pool management, data parsing maintenance, and other tedious work.

創建一個由AI和ML支持的解決方案將簡化數據收集流程,而不是為每個新網站和URL手動開發和管理刮板代碼。 這將負責代理池管理,數據解析維護以及其他繁瑣的工作。

Not only does AI and ML-powered solutions enable developers to build highly scalable data extraction tools, but it also enables data science teams to prototype rapidly. It also stands as a backup to your existing custom-built code if it was ever to break.

由AI和ML支持的解決方案不僅使開發人員能夠構建高度可擴展的數據提取工具,而且還使數據科學團隊能夠快速進行原型制作。 如果曾經破解過,它也可以作為現有定制代碼的備份。

網頁抓取的未來前景如何 (What the future holds for web scraping)

As we already established, creating fast data processing pipelines along with cutting edge ML techniques can offer an unparalleled competitive advantage in the web scraping community. And looking at today’s market, the implementation of AI and ML in data gathering has already started.

正如我們已經確定的那樣,創建快速的數據處理管道以及最先進的ML技術可以在Web抓取社區中提供無與倫比的競爭優勢。 縱觀當今市場,已經開始在數據收集中實施AI和ML。

For this reason, Oxylabs is introducing Next-Gen Residential Proxies which are powered by the latest AI applications.

因此,Oxylabs推出了由最新的AI應用程序提供支持的下一代住宅代理 。

Next-Gen Residential Proxies were built with heavy-duty data retrieval operations in mind. They enable web data extraction without delays or errors. The product is as customizable as a regular proxy, but at the same time, it guarantees a much higher success rate and requires less maintenance. Custom headers and IP stickiness are both supported, alongside reusable cookies and POST requests. Its main benefits are:

下一代住宅代理的構建考慮了重型數據檢索操作。 它們使Web數據提取沒有延遲或錯誤。 該產品可以像常規代理一樣進行自定義,但是同時,它可以確保更高的成功率并需要更少的維護。 支持自定義標頭和IP粘性,以及可重用的cookie和POST請求。 它的主要優點是:

  • 100% success rate

    成功率100%
  • AI-Powered Dynamic Fingerprinting (CAPTCHA, block, and website change handling)

    AI驅動的動態指紋識別(CAPTCHA,阻止和網站更改處理)
  • Machine Learning based HTML parsing

    基于機器學習HTML解析
  • Easy integration (like any other proxy)

    易于集成(像其他代理一樣)
  • Auto-Retry system

    自動重試系統
  • JavaScript rendering

    JavaScript渲染
  • Patented proxy rotation system

    專利代理旋轉系統

Going back to our previous web scraping value chain, you can see which parts of web scraping can be automated and improved with AI and ML-powered Next-Gen Residential Proxies.

回到我們以前的網絡抓取價值鏈,您可以看到可以使用AI和ML支持的下一代住宅代理來自動化和改進網絡抓取的哪些部分。

Image for post
Source: Oxylabs’ design team
資料來源:Oxylabs的設計團隊

The Next-Gen Residential Proxy solution automates almost the whole scraping process, making it a truly strong competitor for future-proof web scraping.

下一代住宅代理解決方案幾乎可以自動化整個刮削過程,使其成為永不過時的網絡刮削的真正強大競爭對手。

This project will be continuously developed and improved by Oxylabs in-house ML engineering team and a board of advisors, Jonas Kubilius, Adi Andrei, Pujaa Rajan, and Ali Chaudhry, specializing in the fields of Artificial Intelligence and ML engineering.

Oxylabs內部的ML工程團隊和顧問委員會Jonas KubiliusAdi AndreiPujaa RajanAli Chaudhry將繼續開發和改進此項目,該委員會專門研究人工智能和ML工程領域。

結語 (Wrapping up)

As the scale of web scraping projects increase, automating data gathering becomes a high priority for businesses that want to stay ahead of the competition. With the improvement of AI algorithms in recent years, along with the increase in compute power and the growth of the talent pool has made AI implementations possible in a number of industries, web scraping included.

隨著網絡抓取項目規模的擴大,對于希望保持競爭優勢的企業而言,自動化數據收集已成為當務之急。 近年來,隨著AI算法的改進,以及計算能力的提高和人才庫的增長,使得許多行業都可以實施AI,其中包括Web抓取。

Establishing AI and ML-powered data gathering techniques offers a great competitive advantage in the industry, as well as save copious amounts of time and resources. It is the new future of large-scale web scraping, and a good head start of the development of future-proof solutions.

建立由AI和ML支持的數據收集技術在行業中提供了巨大的競爭優勢,并且節省了大量的時間和資源。 這是大規模刮網的新未來,也是開發面向未來的解決方案的良好開端。

翻譯自: https://towardsdatascience.com/the-new-beginnings-of-ai-powered-web-data-gathering-solutions-a8e95f5e1d3f

ai驅動數據安全治理

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/389361.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/389361.shtml
英文地址,請注明出處:http://en.pswp.cn/news/389361.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

從Text文本中讀值插入到數據庫中

/// <summary> /// 轉換數據&#xff0c;從Text文本中導入到數據庫中 /// </summary> private void ChangeTextToDb() { if(File.Exists("Storage Card/Zyk.txt")) { try { this.RecNum.Visibletrue; SqlCeCommand sqlCreateTable…

Dataset和DataLoader構建數據通道

重點在第二部分的構建數據通道和第三部分的加載數據集 Pytorch通常使用Dataset和DataLoader這兩個工具類來構建數據管道。 Dataset定義了數據集的內容&#xff0c;它相當于一個類似列表的數據結構&#xff0c;具有確定的長度&#xff0c;能夠用索引獲取數據集中的元素。 而D…

鐵拳nat映射_鐵拳如何重塑我的數據可視化設計流程

鐵拳nat映射It’s been a full year since I’ve become an independent data visualization designer. When I first started, projects that came to me didn’t relate to my interests or skills. Over the past eight months, it’s become very clear to me that when cl…

Django2 Web 實戰03-文件上傳

作者&#xff1a;Hubery 時間&#xff1a;2018.10.31 接上文&#xff1a;接上文&#xff1a;Django2 Web 實戰02-用戶注冊登錄退出 視頻是一種可視化媒介&#xff0c;因此視頻數據庫至少應該存儲圖像。讓用戶上傳文件是個很大的隱患&#xff0c;因此接下來會討論這倆話題&#…

BZOJ.2738.矩陣乘法(整體二分 二維樹狀數組)

題目鏈接 BZOJ洛谷 整體二分。把求序列第K小的樹狀數組改成二維樹狀數組就行了。 初始答案區間有點大&#xff0c;離散化一下。 因為這題是一開始給點&#xff0c;之后詢問&#xff0c;so可以先處理該區間值在l~mid的修改&#xff0c;再處理詢問。即二分標準可以直接用點的標號…

從數據庫里讀值往TEXT文本里寫

/// <summary> /// 把預定內容導入到Text文檔 /// </summary> private void ChangeDbToText() { this.RecNum.Visibletrue; //建立文件&#xff0c;并打開 string oneLine ""; string filename "Storage Card/YD" DateTime.Now.…

DengAI —如何應對數據科學競賽? (EDA)

了解機器學習 (Understanding ML) This article is based on my entry into DengAI competition on the DrivenData platform. I’ve managed to score within 0.2% (14/9069 as on 02 Jun 2020). Some of the ideas presented here are strictly designed for competitions li…

Pytorch模型層簡單介紹

模型層layers 深度學習模型一般由各種模型層組合而成。 torch.nn中內置了非常豐富的各種模型層。它們都屬于nn.Module的子類&#xff0c;具備參數管理功能。 例如&#xff1a; nn.Linear, nn.Flatten, nn.Dropout, nn.BatchNorm2d nn.Conv2d,nn.AvgPool2d,nn.Conv1d,nn.Co…

有效溝通的技能有哪些_如何有效地展示您的數據科學或軟件工程技能

有效溝通的技能有哪些What is the most important thing to do after you got your skills to be a data scientist? It has to be to show off your skills. Otherwise, there is no use of your skills. If you want to get a job or freelance or start a start-up, you ha…

java.net.SocketException: Software caused connection abort: socket write erro

場景&#xff1a;接口測試 編輯器&#xff1a;eclipse 版本&#xff1a;Version: 2018-09 (4.9.0) testng版本&#xff1a;TestNG version 6.14.0 執行testng.xml時報錯信息&#xff1a; 出現此報錯原因之一&#xff1a;網上有人說是testng版本與eclipse版本不一致造成的&#…

[博客..配置?]博客園美化

博客園搞定時間 -> 18年6月27日 [讓我歇會兒 搞這個費腦子 代碼一個都看不懂] 轉載于:https://www.cnblogs.com/Steinway/p/9235437.html

使用K-Means對美因河畔法蘭克福的社區進行聚類

介紹 (Introduction) This blog post summarizes the results of the Capstone Project in the IBM Data Science Specialization on Coursera. Within the project, the districts of Frankfurt am Main in Germany shall be clustered according to their venue data using t…

Pytorch損失函數losses簡介

一般來說&#xff0c;監督學習的目標函數由損失函數和正則化項組成。(Objective Loss Regularization) Pytorch中的損失函數一般在訓練模型時候指定。 注意Pytorch中內置的損失函數的參數和tensorflow不同&#xff0c;是y_pred在前&#xff0c;y_true在后&#xff0c;而Ten…

讀取Mc1000的 唯一 ID 機器號

先引用Symbol.ResourceCoordination 然后引用命名空間 using System;using System.Security.Cryptography;using System.IO; 以下為類程序 /// <summary> /// 獲取設備id /// </summary> /// <returns></returns> public static string GetDevi…

樣本均值的抽樣分布_抽樣分布樣本均值

樣本均值的抽樣分布One of the most important concepts discussed in the context of inferential data analysis is the idea of sampling distributions. Understanding sampling distributions helps us better comprehend and interpret results from our descriptive as …

玩轉ceph性能測試---對象存儲(一)

筆者最近在工作中需要測試ceph的rgw&#xff0c;于是邊測試邊學習。首先工具采用的intel的一個開源工具cosbench&#xff0c;這也是業界主流的對象存儲測試工具。 1、cosbench的安裝&#xff0c;啟動下載最新的cosbench包wget https://github.com/intel-cloud/cosbench/release…

[BZOJ 4300]絕世好題

Description 題庫鏈接 給定一個長度為 \(n\) 的數列 \(a_i\) &#xff0c;求 \(a_i\) 的子序列 \(b_i\) 的最長長度&#xff0c;滿足 \(b_i\wedge b_{i-1}\neq 0\) &#xff08; \(\wedge\) 表示按位與&#xff09; \(1\leq n\leq 100000\) Solution 令 \(f_i\) 為二進制第 \(i…

因果關系和相關關系 大數據_數據科學中的相關性與因果關系

因果關系和相關關系 大數據Let’s jump into it right away.讓我們馬上進入。 相關性 (Correlation) Correlation means relationship and association to another variable. For example, a movement in one variable associates with the movement in another variable. For…

Pytorch構建模型的3種方法

這個地方一直是我思考的地方&#xff01;因為學的代碼太多了&#xff0c;構建的模型各有不同&#xff0c;這里記錄一下&#xff01; 可以使用以下3種方式構建模型&#xff1a; 1&#xff0c;繼承nn.Module基類構建自定義模型。 2&#xff0c;使用nn.Sequential按層順序構建模…

vue取數據第一個數據_我作為數據科學家的第一個月

vue取數據第一個數據A lot.很多。 I landed my first job as a Data Scientist at the beginning of August, and like any new job, there’s a lot of information to take in at once.我于8月初找到了數據科學家的第一份工作&#xff0c;并且像任何新工作一樣&#xff0c;一…