平臺api對數據收集的影響
A data analytics cycle starts with gathering and extraction. I hope my previous blog gave an idea about how data from common file formats are gathered using python. In this blog, I’ll focus on extracting the data from files that are not so common but has the most real-world applications.
數據分析周期從收集和提取開始。 我希望我以前的博客提出了有關如何使用python收集來自常見文件格式的數據的想法。 在此博客中,我將重點介紹從不太常見但具有最真實應用程序的文件中提取數據。
Whether you are a data professional or not, you might have come across the term API by now. Most people have a rather ambiguous or incorrect idea about this fairly common term.
無論您是否是數據專業人員,現在您都可能遇到過術語API。 對于這個相當普遍的術語,大多數人有一個相當模糊或不正確的想法。
在python中使用API??提取數據 (Extracting data using API in python)
API is the acronym for Application Programming Interface, which is a software intermediary (A Middleman) that allows two applications to talk to each other.
API是應用程序編程接口 ( Application Programming Interface )的首字母縮寫,它是一種軟件中介(A Middleman),它允許兩個應用程序相互通信。
Each time you use an app like Tinder, send a WhatsApp message, check the weather on your phone, you’re using an API. They allow us to share important data and expose practical business functionality between devices, applications, and individuals. And although we may not notice them, APIs are everywhere, powering our lives from behind the scenes.
每次您使用諸如Tinder之類的應用程序時,發送WhatsApp消息,檢查手機上的天氣,即表示您正在使用API??。 它們使我們能夠在設備,應用程序和個人之間共享重要數據并公開實用的業務功能。 盡管我們可能沒有注意到它們,但API無處不在,從幕后推動了我們的生活。
We can illustrate API as a bank’s ATM(Automated Teller Machine). Banks make their ATMs accessible for us to check our balance, make deposits, or withdraw cash. So here ATM is the middleman who is helping the bank as well as us, the customers.
我們可以將API舉例說明為銀行的ATM(自動柜員機)。 銀行使我們可以使用其ATM機來查詢余額,存款或提取現金。 因此,這里的ATM是幫助銀行以及我們客戶的中間人。
Similarly, Web applications use APIs to connect user-facing front ends with all-important back end functionality and data. Streaming services like Spotify and Netflix use APIs to distribute content. Automotive companies like Tesla send software updates via APIs. For further instances, you can check out the article 5 Examples of APIs We Use in Our Everyday Lives.
同樣,Web應用程序使用API??將面向用戶的前端與所有重要的后端功能和數據連接起來。 Spotify和Netflix等流媒體服務使用API??分發內容。 特斯拉等汽車公司通過API發送軟件更新。 如需更多實例,請查看文章5 我們在日常生活中使用的API的例子 。
In Data Analysis, API’s are most commonly used to retrieve data, and that will be the focus of this blog.
在數據分析中,API最常用于檢索數據,這將是本博客的重點。
When we want to retrieve data from an API, we need to make a request. Requests are used all over the web. For instance, when you visited this blog post, your web browser made a requested to the Towards Data Science web server, which responded with the content of this web page.
當我們想從API檢索數據時,我們需要發出一個請求。 請求遍及整個網絡。 例如,當您訪問此博客文章時,您的Web瀏覽器向Towards Data Science Web服務器發出了請求,該服務器以該網頁的內容作為響應。
API requests work in the same way — you request to an API server for data, and it responds to your request. There are primarily two ways to use APIs :
API請求的工作方式相同-您向API服務器請求數據,然后它會響應您的請求。 主要有兩種使用API??的方式:
- Through the command terminal using URL endpoints, or 通過使用URL端點的命令終端,或者
- Through programming language-specific wrappers. 通過編程特定于語言的包裝器。
For example, Tweepy is a famous python wrapper for Twitter API whereas twurl is a command line interface (CLI) tool but both can achieve the same outcomes.
例如, Tweepy是Twitter API的著名Python包裝器,而twurl是命令行界面(CLI)工具,但兩者都可以實現相同的結果。
Here we focus on the latter approach and will use a Python library (a wrapper) called wptools based around the original MediaWiki API. The MediaWiki action API is Wikipedia’s API that allows access to some wiki-features like authentication, page operations, and search.
在這里,我們著重于后一種方法,并將基于原始MediaWiki API使用一個名為wptools的Python庫(包裝器)。 MediaWiki操作API是Wikipedia的API ,該API允許訪問某些Wiki功能,例如身份驗證,頁面操作和搜索。
Wptools make read-only access to the MediaWiki APIs. You can get info about wiki sites, categories, and pages from any Wikimedia project in any language via the good old Mediawiki API. You can extract unstructured data from page Infoboxes or get structured, linked open data about a page via the Wikidata API, and get page contents from the high-performance RESTBase API.
Wptools對MediaWiki API進行只讀訪問。 您可以通過良好的舊Mediawiki API以任何語言從任何Wikimedia項目中獲取有關Wiki網站,類別和頁面的信息。 您可以從頁面信息框提取非結構化數據,也可以通過Wikidata API獲取有關頁面的結構化鏈接打開數據,并從高性能RESTBase API獲取頁面內容。
In the below code, I have used python’s wptools library to access the Mahatma Gandhi Wikipedia page and extract an image file from that page. For a Wikipedia URL ‘https://en.wikipedia.org/wiki/Mahatma_Gandhi’we only need to pass the last bit of the URL.
在下面的代碼中,我使用了python的wptools庫訪問Mahatma Gandhi Wikipedia頁面并從該頁面提取圖像文件。 對于Wikipedia URL'https://en.wikipedia.org/wiki/Mahatma_Gandhi',我們只需要傳遞URL的最后一位。
The get function fetches everything including extracts, images, infobox data, wiki data, etc. present on that page. By using the .data() function we can extract all the required information. The response that we get for our request to the API is most likely to be in JSON format.
get函數可獲取該頁面上顯示的所有內容,包括摘錄,圖像,信息框數據,Wiki數據等。 通過使用.data()函數,我們可以提取所有必需的信息。 我們對API的請求所獲得的響應很可能是JSON格式。
在python中從JSON讀取數據 (Reading data from JSON in python)
JSON is an acronym for JavaScript Object Notation. It is a lightweight data-interchange format. It is as easy for humans to read & write as for the machines to parse & generate. JSON has quickly become the de-facto standard for information exchange.
JSON是JavaScript Object Notation的首字母縮寫。 它是一種輕量級的數據交換格式。 對于人類而言,讀寫和解析與生成機器一樣容易。 JSON已Swift成為事實上的信息交換標準 。
When exchanging data between a browser and a server, the data can only be text. JSON is the text, and we can convert any JavaScript object into JSON, and send JSON to the server.
在瀏覽器和服務器之間交換數據時,數據只能是文本。 JSON是文本,我們可以將任何JavaScript對象轉換為JSON,然后將JSON發送到服務器。
For example, you can access GitHub’s API directly with your browser without even needing an access token. Here’s the JSON response you get when you visit a GitHub user’s API route in your browserhttps://api.github.com/users/divyanitin :
例如,您可以直接使用瀏覽器訪問GitHub的API ,甚至不需要訪問令牌。 這是在瀏覽器https://api.github.com/users/divyanitin中訪問GitHub用戶的API路由時收到的JSON響應:
{
"login": "divyanitin",
"url": "https://api.github.com/users/divyanitin",
"html_url": "https://github.com/divyanitin",
"gists_url": "https://api.github.com/users/divyanitin/gists{/gist_id}",
"type": "User",
"name": "DivyaNitin",
"location": "United States",
}
The browser seems to have done just fine displaying a JSON response. A JSON response like this is ready for use in your code. It’s easy to extract data from this text. Then you can do whatever you want with the data.
瀏覽器似乎已經很好地顯示了JSON響應。 這樣的JSON響應已準備好在您的代碼中使用。 從此文本中提取數據很容易。 然后,您可以對數據進行任何操作。
Python supports JSON natively. It comes with a json built-in package for encoding and decoding JSON data. JSON files store data within {} similar to how a dictionary stores it in Python. Similarly, JSON arrays are translated as python lists.
Python本機支持JSON。 它帶有一個JSON內置程序包,用于編碼和解碼JSON數據。 JSON文件在{}中存儲數據,類似于字典在Python中存儲數據的方式。 同樣,JSON數組會轉換為python列表。
In my above code, wiki_page.data[‘image’][0] access the first image in the image attribute i.e a JSON array. With Python json module you can read JSON files just like simple text files.
在我上面的代碼中,wiki_page.data ['image'] [0]訪問image屬性中的第一個圖像,即JSON數組。 使用Python json模塊,您可以像簡單的文本文件一樣讀取JSON文件。
The read function json.load() returns a JSON dictionary which can be easily converted into a Pandas dataframe using the pandas.DataFrame() function. You can even load the JSON file directly into a dataframe using the pandas.read_json() function.
讀取函數json.load()返回一個JSON字典,可以使用pandas.DataFrame()函數將其輕松轉換為Pandas數據框。 您甚至可以使用pandas.read_json()函數將JSON文件直接加載到數據幀中。
從Internet(HTTPS)讀取文件 (Reading files from the Internet (HTTPS))
HTTPS stands for HyperText Transfer Protocol Secure. It is a language that web browsers & web servers speak to each other. A web browser may be the client, and an application on a computer that hosts a web site may be the server.
HTTPS代表“ 超文本傳輸??協議安全” 。 這是Web瀏覽器和Web服務器相互交流的語言。 Web瀏覽器可能是客戶端,托管網站的計算機上的應用程序可能是服務器。
We are writing a code that works with remote APIs. Your maps app fetches the locations of nearby Indian restaurants or the OneDrive app starts up cloud storage. All this happens just by making an HTTPS request.
我們正在編寫與遠程API一起使用的代碼。 您的地圖應用會獲取附近印度餐廳的位置,或者OneDrive應用會啟動云存儲。 所有這些僅通過發出HTTPS請求即可完成。
‘Requests’ is a versatile HTTPS library in python with various applications. It works as a request-response protocol between a client and a server. It provides methods for accessing Web resources via HTTPS. One of its applications is to download or open a file from the web using the file URL.
“ 請求 ”是python中具有各種應用程序的通用HTTPS庫。 它用作客戶端和服務器之間的請求-響應協議。 它提供了通過HTTPS訪問Web資源的方法。 它的應用程序之一是使用文件URL從網上下載或打開文件。
To make a ‘GET’ request, we’ll use the requests.get() function, which requires one argument — the URL we want to request to.
要發出“ GET”請求,我們將使用request.get()函數,該函數需要一個參數—我們想要請求的URL。
In the below script, the open method is used to write binary data to the local file. In this, we are creating a folder and saving the extracted web data on the system using the os library of python.
在下面的腳本中,open方法用于將二進制數據寫入本地文件。 在此,我們將創建一個文件夾,并使用python的os庫將提取的Web數據保存在系統上。
The json and requests import statements load Python code that allows us to work with the JSON data format and the HTTPS protocol. We’re using these libraries because we’re not interested in the details of how to send HTTPS requests or how to parse and create valid JSON, we just want to use them to accomplish these tasks.
json和request導入語句加載Python代碼,使我們可以使用JSON數據格式和HTTPS協議。 我們之所以使用這些庫,是因為我們對如何發送HTTPS請求或如何解析和創建有效JSON的細節不感興趣,我們只想使用它們來完成這些任務。
A popular web architecture style called REST (Representational State Transfer) allows users to interact with web services via GET and POST calls (two most commonly used).
一種流行的Web架構樣式稱為REST(代表性狀態傳輸),允許用戶通過GET和POST調用(兩種最常用的)與Web服務進行交互。
GET is generally used to get information about some object or record that already exists. In contrast, POST is typically used when you want to create something.
GET通常用于獲取有關某些對象或記錄的信息。 相比之下, POST通常在您要創建內容時使用。
REST is essentially a set of useful conventions for structuring a web API. By “web API,” I mean an API that you interact with over HTTP, making requests to specific URLs, and often getting relevant data back in the response.
REST本質上是用于構造Web API的一組有用的約定。 “ Web API”是指您通過HTTP與之交互,對特定URL進行請求并經常在響應中返回相關數據的API。
For example, Twitter’s REST API allows developers to access core Twitter data and the Search API provides methods for developers to interact with Twitter Search and trends data.
例如,Twitter的REST API允許開發人員訪問Twitter的核心數據,而Search API為開發人員提供了與Twitter搜索和趨勢數據進行交互的方法。
This blog focuses on the so-called internet files. I introduced the extraction of data using simple APIs. Most of the APIs require authentication, just like my earlier analogy ATM requires us to enter the pin for authenticating our access to the bank.
該博客專注于所謂的Internet文件。 我介紹了使用簡單API提取數據的方法。 大多數API都需要認證,就像我之前的類比ATM要求我們輸入用于驗證對銀行訪問權限的密碼一樣。
You can check out my other article Twitter Analytics: “WeRateDogs” which was focused on data wrangling and analysis of twitter API. I have used all the above-mentioned scripts in this project, you can find the code for the same on my GitHub.
您可以查看我的其他文章Twitter Analytics:“ WeRateDogs” ,該文章專注于Twitter API的數據整理和分析。 我在該項目中使用了所有上述腳本,您可以在GitHub上找到相同的代碼。
As known to all, one of the most common internet files is HTML. Extracting data from the internet has the common term Web Scraping. In this, we access the website data directly using the HTML. I would be covering the same, from the “basic” to the “must-knows” in my next blog.
眾所周知,HTML是最常見的Internet文件之一。 從Internet提取數據具有通用術語Web Scraping。 在此,我們直接使用HTML訪問網站數據。 我將在下一個博客中介紹從“基本”到“必知”的內容。
If you enjoyed this blog post, leave a comment below, and share it with a friend!
如果您喜歡此博客文章,請在下面發表評論,然后與朋友分享!
翻譯自: https://towardsdatascience.com/gather-your-data-the-not-so-spooky-apis-da0da1a5992c
平臺api對數據收集的影響
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/389720.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/389720.shtml 英文地址,請注明出處:http://en.pswp.cn/news/389720.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!