tableau跨庫創建并集_刮擦柏林青年旅舍,并以此建立一個Tableau全景。

tableau跨庫創建并集

One of the coolest things about making our personal project is the fact that we can explore topics of our own interest. On my case, I’ve had the chance to backpack around the world for more than a year between 2016–2017, and it was one of the best experiences of my life.

進行個人項目的最酷的事情之一是,我們可以探索自己感興趣的主題。 就我而言,2016年至2017年之間,我有機會在世界各地背包旅行了一年多,這是我一生中最好的經歷之一。

During my travel, I used A LOT OF HOSTELS. From Hanoi to the Iguassu Falls, passing through Tokyo, Delhi and many other places, one always need a place to rest after a long day exploring the city. Funny enough, it was on some of those hostels that I got interested in learning how to code, which started my way to become a data analyst today.

在旅行中,我使用了很多雜物。 從河內到伊瓜蘇瀑布,途經東京,德里和許多其他地方,經過漫長的一天探索這座城市,人們總是需要一個休息的地方。 有趣的是,正是在一些旅館中,我對學習如何編碼感興趣,這開始使我成為今天的數據分析師。

Image for post
Source.來源 。

So, I’m very interested in understanding what makes a hostel better than another, how to compare them, etc, and after thinking about that I came up with this tutorial idea. Today we are going to do 2 things:

因此,我對了解什么使旅館比其他旅館更好,如何進行比較等感興趣,并且在考慮了這一點之后,我想到了本教程。 今天我們要做兩件事:

  • Scrap data from Hostel World, using Berlin as our study case, and save it into a data frame.

    使用柏林作為我們的研究案例,從Hostel World收集數據,并將其保存到數據框中。
  • Use that data to build a Tableau Dashboard that will allow us to select the hostel based in different criteria.

    使用該數據來構建Tableau儀表板,該儀表板將使我們能夠根據不同的條件選擇旅館。

Why Berlin Hostels? Because Berlin is an amazing city, and there’s a lot of options of hostels there for us to explore. There are many different websites to look for hostels, and we will use my favorite, Hostel World, which I particularly utilized many times, and it’s the one I trust for the accuracy of the information they provide.

為什么選擇柏林青年旅舍? 因為柏林是一個了不起的城市,所以這里有很多旅館供我們探索。 有很多不同的網站可以尋找旅館,我們將使用我最喜歡的Hostel World ,我多次使用它,并且我相信它可以提供所提供信息的準確性。

Image for post
Ricardo Gomez Angel on Ricardo Gomez Angel在UnsplashUnsplash拍攝

My goal is to show you that we can do the whole process of collect/transform/visualize data in a simple yet effective way so you can start doing your own projects. To fully enjoy this tutorial, it’s important that you are familiar with python, pandas, and also comfortable with HTML and Tableau basics concepts.

我的目標是向您展示,我們可以以一種簡單而有效的方式完成收集/轉換/可視化數據的整個過程,以便您可以開始自己的項目。 要完全享受本教程,重要的是,您必須熟悉python,pandas,并熟悉HTML和Tableau基本概念。

You can follow along with the notebook containing the code here, and access the Tableau Dashboard here.

您可以使用包含代碼的筆記本跟著在這里 ,和訪問的Tableau儀表板在這里 。

始終先瀏覽網站! (Always Explore the Website First!)

I highly recommend that you take some time to explore the structure of the website prior to start coding. If you’re using Chrome, just click on the right button of the mouse and select “Inspect”. That’s what you got:

我強烈建議您在開始編碼之前花一些時間來探索網站的結構。 如果您使用的是Chrome,只需單擊鼠標右鍵,然后選擇“檢查”。 那就是你得到的:

Image for post
The HTML structure of our target page. Source: author.
目標頁面HTML結構。 資料來源:作者。

Think the HTML structure as a tree, with all its branches holding the information of the page. Try to find which class has information about hostel name, ratings, etc. More important, check out how information of each hostel has its own “branch”, or container. That means that once we figure out how to access it, we can expand the same logic for all other hostels/containers.

將HTML結構想像成一棵樹,其所有分支都保存頁面的信息。 嘗試查找哪個班級提供有關旅館名稱,等級等的信息。更重要的是,檢查每個旅館的信息如何有其自己的“分支”或容器。 這意味著一旦弄清楚如何訪問它,我們便可以為所有其他旅館/容器擴展相同的邏輯。

On the code below I’m showing you how to get the raw information, then how to figure out how many pages of hostels we have, as we will need that to iterate later, and then how to separate the information about the first hostel in order to explore it. Take your time to read the code and comments, I wrote it specially for you:

在下面的代碼中,我向您展示如何獲取原始信息,然后如何確定我們擁有多少個旅舍頁面,因為我們以后需要進行迭代,然后如何在其中分離有關第一個旅舍的信息。為了探索它。 花些時間閱讀代碼和注釋,我是專門為您編寫的:

# importing the libraries to use on the scraping
from requests import get
from bs4 import BeautifulSoupimport pandas as pd
import numpy as npimport timeimport re# getting the html info to be used
url = 'https://www.hostelworld.com/hostels/Berlin'
response = get(url)# create soup
soup = BeautifulSoup(response.text, 'html.parser')# creating individual containers, on each one there's information about one hostel.
holstel_containers= soup.findAll(class_= 'fabresult rounded clearfix hwta-property')# Figuring out how many pages with hostels do we have available. This information is important when iterating over pages.
total_pages= soup.findAll(class_= "pagination-page-number")
final_page= pd.to_numeric(total_pages[-1].text)
print(final_page)# checking how many hostels we have on the first page
print(len(holstel_containers))first_hostel = holstel_containers[0]
print(first_hostel.prettify())

The output of this code will be first a “3”, the number of pages with hostel info, then a “30”, the number of hostels per page, and finally a long bunch of HTML, which is the information about the first hostel on the list. The information we will extract today is the following:

此代碼的輸出將首先是“ 3”,即包含旅館信息的頁面數,然后是“ 30”,即每頁的旅館數,最后是一堆HTML,這是有關第一家旅館的信息在清單上。 我們今天將提取的信息如下:

  • Name

    名稱
  • Link

    鏈接
  • Distance from centre (km)

    距中心的距離(公里)
  • Average Rating

    平均評分
  • Number of reviews

    評論數
  • Average price in USD

    平ASP格(美元)

Using our super HTML skills, we figured out that the code to extract that is the one below. If you have already used Beautiful Soup, could you get the same information in a different way? If yes, I would love to see that on the comments.

使用我們的超級HTML技能,我們找出了下面要提取的代碼。 如果您已經使用過Beautiful Soup,可以通過其他方式獲得相同的信息嗎? 如果是,我希望在評論中看到這一點。

# Hostel name
first_hostel.h2.a.text# hostel link
first_hostel.h2.a.get('href')# distance from city centre in km
first_hostel.find(class_= "addressline").text[12:18].replace('k','').replace('m','').strip()# average rating
first_hostel.find(class_='hwta-rating-score').text.replace('\n', '').strip()# number of reviews
first_hostel.find(class_="hwta-rating-counter").text.replace('\n', '').strip()# average price per night in USD
first_hostel.find(class_= "price").text.replace('\n', '').strip()[3:]

Note that we will need to use some pandas essentials, like replace and strip, along with some operators from the Beautiful Soup package, mostly the find, find_all and get. Knowing how to combining them is something that requires some practice, but I can guarantee that,once you understand the idea, it is pretty simple.

注意,我們將需要使用一些熊貓必需品,例如replace和strip ,以及Beautiful Soup包中的一些運算符,主要是findfind_allget。 知道如何將它們組合起來是需要一些實踐的事情,但是我可以保證,一旦您理解了這個想法,它就非常簡單。

Now that we know how to access the information we need in the first container, we will expand the same logic across all the hostels on the first page, and also across all the pages with hostel information. How do we do that? First by using our very well known for loop, then saving the information into empty lists, and finally using those lists to create a data frame:

現在,我們知道了如何訪問第一個容器中所需的信息,我們將在第一頁上的所有旅館以及包含旅館信息的所有頁面上擴展相同的邏輯。 我們該怎么做? 首先使用我們眾所周知的for循環 然后將信息保存到空列表中,最后使用這些列表創建數據框:

# first, create the empty lists
hostel_names= []
hostel_links= []
hostel_distance= []
hostel_ratings= []
hostel_reviews= []
hostel_prices= []for page in np.arange(1,final_page+1): # to iterate over the pages and create the conteiners, using the final_page data we've got at the beginingurl = 'https://www.hostelworld.com/hostels/Berlin?page=' + str(page)response = get(url)soup = BeautifulSoup(response.text, 'html.parser')holstel_containers= soup.findAll(class_= 'fabresult rounded clearfix hwta-property')for item in range(len(holstel_containers)): # to iterate over the results on each pagehostel_names.append(holstel_containers[item].h2.a.text)hostel_links.append(holstel_containers[item].h2.a.get('href'))hostel_distance.append(holstel_containers[item].find(class_= "addressline").text[12:18].replace('k','').replace('m','').strip())hostel_ratings.append(holstel_containers[item].find(class_='hwta-rating-score').text.replace('\n', '').strip())hostel_reviews.append(holstel_containers[item].find(class_="hwta-rating-counter").text.replace('\n', '').strip())hostel_prices.append(holstel_containers[item].find(class_= "price").text.replace('\n', '').strip()[3:])                          time.sleep(2) # this is used to not push too hard on the website# using the lists to create a brand new dataframe
hw_berlin = pd.DataFrame({'hostel_name': hostel_names,'distance_centre_km': hostel_distance,'average_rating': hostel_ratings,'number_reviews': hostel_reviews,'average_price_usd': hostel_prices,'hw_link': hostel_links
})hw_berlin.head()

And now we can appreciate the beauty of what we have just created:

現在我們可以欣賞到我們剛剛創造的美麗:

Image for post
First lines of the Berlin Hostels data frame. Source: author.
柏林旅館數據框的第一行。 資料來源:作者。

After that we just need to clean up the data a little bit, removing non-numerical characters and converting strings, saved initially as object, to numbers. Finally, we will save our results into a .csv file.

之后,我們只需要稍微整理一下數據,刪除非數字字符并將最初保存為object的字符串轉換為數字。 最后,我們將結果保存到.csv文件中。

# removing non numerical character on the column distance_centre_km
hw_berlin.distance_centre_km = [re.sub('[^0-9.]','', x) for x in hw_berlin.distance_centre_km]# converting numerical columns to proper formatlist_to_convert = ['distance_centre_km', 'average_rating', 'number_reviews', 'average_price_usd']for column in list_to_convert:hw_berlin[column] = pd.to_numeric(hw_berlin[column], errors= 'coerce')# saving the final version into a .csv file  
hw_berlin.to_csv('hw_berlin_basic_info.csv')

Tableau歡樂時光! (Tableau Fun Time!)

Tableau is one of the most powerful BI tools available today, and it offers a free version, Tableau Public, that allows you to do A LOT of cool stuff. However, it can become pretty complex very fast, even to do some basic graphs. I cannot cover all the steps I did here, as it was a lot of click and drag actions. It’s different than code where you can just type and reproduce it all.

Tableau是當今可用的功能最強大的BI工具之一,它提供了免費版本Tableau Public ,使您可以做很多很棒的事情。 但是,即使做一些基本圖形,它也會變得非常復雜。 我無法涵蓋我在此處所做的所有步驟,因為這涉及很多單擊和拖動操作。 它與代碼不同,在代碼中,您只需鍵入并復制所有內容即可。

So, if you are new to Tableau and if you want to understand how I build my visualization, the way to do that is by downloading the .twb file, which is available here, then open it in your computer, and do what we call “reverse engineering”, which is basically to check and play with the files that I’ve created yourself. Trust me, this is the most effective way to learn Tableau, and even when you can see the engineering behind, it can be hard to reproduce the same visualization. Let’s try to do it?

因此,如果您是Tableau的新手,并且想了解如何構建可視化文件,則可以通過下載.twb文件(在此處可用),然后在計算機中打開它并執行我們所謂的操作來實現。 “逆向工程” ,基本上是檢查并播放我自己創建的文件。 相信我,這是學習Tableau的最有效方法,即使您看到了背后的工程知識,也很難再現相同的可視化效果。 讓我們嘗試做嗎?

Image for post
Tableau offers different filters that help you to slice and visualize our recently scraped data. Source: author.
Tableau提供了不同的篩選器,可幫助您切片和可視化我們最近抓取的數據。 資料來源:作者。

As data or business analyst, we need basically to make data readable and easy to manipulate. The visualization I’ve build for this tutorial offers you that: you can slice and play with the hostels based in some different criteria we have available, filtering the options and finding the ones you are interested, just like a stakeholder would do. Besides the filters, I’ve included also a scatter plot where we can check the relationship between price and reviews.

作為數據或業務分析師,我們基本上需要使數據可讀并易于操縱。 我為本教程構建的可視化為您提供:您可以根據我們可用的一些不同標準對旅館進行切片和玩耍,過濾選項并找到您感興趣的選項,就像利益相關者會做的那樣。 除了過濾器之外,我還包括了一個散點圖,我們可以在其中檢查價格和評論之間的關系。

The dashboard is pretty simple, and I’ve done that way by purpose, I would like to see you doing it by yourself and sharing the link of your results on the comments. What kind of different information can you get from the date we’ve scraped? Could you do the same analysis with hostels in Paris, New York or Rio de Janeiro? I’ll leave those questions for you to answer with your own code and dashboard.

儀表板非常簡單,我是有意這樣做的,我希望您自己做,并分享您的結果在評論中的鏈接。 從我們抓取之日起,您可以獲得什么不同的信息? 您是否可以對巴黎,紐約或里約熱內盧的旅館進行同樣的分析? 我將用您自己的代碼和儀表板來回答這些問題。

That’s all for today! I hope this tutorial will help you to get more knowledge about data scraping and Tableau. Feel free to connect with me on LinkedIn and to check my other texts and code on my Medium and GitHub profiles.

今天就這些! 我希望本教程將幫助您獲得有關數據抓取和Tableau的更多知識。 隨時在LinkedIn上與我聯系,并在我的Medium和GitHub個人資料中查看我的其他文本和代碼。

Image for post

翻譯自: https://towardsdatascience.com/scraping-berlin-hostels-and-building-a-tableau-viz-with-it-a73ce5b88e22

tableau跨庫創建并集

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/391997.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/391997.shtml
英文地址,請注明出處:http://en.pswp.cn/news/391997.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

策略模式下表單驗證

策略模式下表單驗證 class Validator {constructor(strategies) {this.cache []}add(value, rules) {if (!rules instanceof Array) throw rules should be Arrayvar self thisfor(var i 0, rule; rule rules[i];) {(function(rule) {var strategyArr rule.strategy.split…

在五分鐘內學習使用Python進行類型轉換

by PALAKOLLU SRI MANIKANTA通過PALAKOLLU SRI MANIKANTA 在五分鐘內學習使用Python進行類型轉換 (Learn typecasting in Python in five minutes) 以非常詳盡的方式介紹了Python中的類型轉換和類型轉換的速成課程 (A crash course on Typecasting and Type conversion in Pyt…

Ajax post HTML 405,Web API Ajax POST向返回 405方法不允許_jquery_開發99編程知識庫

因此,我有一個像這樣的jquery ajax請求:function createLokiAccount(someurl) {var d {"Jurisdiction":17}$.ajax({type:"POST",url:"http://myserver:111/Api/V1/Customers/CreateCustomer/",data: JSON.stringify(d),c…

leetcode 480. 滑動窗口中位數(堆+滑動窗口)

中位數是有序序列最中間的那個數。如果序列的大小是偶數,則沒有最中間的數;此時中位數是最中間的兩個數的平均數。 例如: [2,3,4],中位數是 3 [2,3],中位數是 (2 3) / 2 2.5 給你一個數組 nums,有一個大…

1.0 Hadoop的介紹、搭建、環境

HADOOP背景介紹 1.1 Hadoop產生背景 HADOOP最早起源于Nutch。Nutch的設計目標是構建一個大型的全網搜索引擎,包括網頁抓取、索引、查詢等功能,但隨著抓取網頁數量的增加,遇到了嚴重的可擴展性問題——如何解決數十億網頁的存儲和索引問題。20…

如何實現多維智能監控?--AI運維的實踐探索【一】

作者丨吳樹生:騰訊高級工程師,負責SNG大數據監控平臺建設。近十年監控系統開發經驗,具有構建基于大數據平臺的海量高可用分布式監控系統研發經驗。 導語:監控數據多維化后,帶來新的應用場景。SNG的哈勃多維監控平臺在完…

.Net Web開發技術棧

有很多朋友有的因為興趣,有的因為生計而走向了.Net中,有很多朋友想學,但是又不知道怎么學,學什么,怎么系統的學,為此我以我微薄之力總結歸納寫了一篇.Net web開發技術棧,以此幫助那些想學&#…

使用Python和MetaTrader在5分鐘內開始構建您的交易策略

In one of my last posts, I showed how to create graphics using the Plotly library. To do this, we import data from MetaTrader in a ‘raw’ way without automation. Today, we will learn how to automate this process and plot a heatmap graph of the correlation…

卷積神經網絡 手勢識別_如何構建識別手語手勢的卷積神經網絡

卷積神經網絡 手勢識別by Vagdevi Kommineni通過瓦格德維科米尼(Vagdevi Kommineni) 如何構建識別手語手勢的卷積神經網絡 (How to build a convolutional neural network that recognizes sign language gestures) Sign language has been a major boon for people who are h…

spring—第一個spring程序

1.導入依賴 <dependency><groupId>org.springframework</groupId><artifactId>spring-context</artifactId><version>5.0.9.RELEASE</version></dependency>2.寫一個接口和實現 public interface dao {public void save(); }…

請對比html與css的異同,css2與css3的區別是什么?

css主要有三個版本&#xff0c;分別是css1、css2、css3。css2使用的比較多&#xff0c;因為css1的屬性比較少&#xff0c;而css3有一些老式瀏覽器并不支持&#xff0c;所以大家在開發的時候主要還是使用css2。CSS1提供有關字體、顏色、位置和文本屬性的基本信息&#xff0c;該版…

基礎 之 數組

shell中的數組 array (1 2 3) array ([1]ins1 [2]ins2 [3]ins3)array ($(命令)) # 三種定義數組&#xff0c;直接定義&#xff0c;鍵值對&#xff0c;直接用命令做數組的值。${array[*]}${array[]}${array[0]} # 輸出數組中的0位置的值&#xff0c;*和…

Linux_異常_08_本機無法訪問虛擬機web等工程

這是因為防火墻的原因&#xff0c;把響應端口開啟就行了。 # Firewall configuration written by system-config-firewall # Manual customization of this file is not recommended. *filter :INPUT ACCEPT [0:0] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [0:0] -A INPUT -m st…

Building a WAMP Dev Environment [3/4] - Installing and Configuring PHP

Moved to http://blog.tangcs.com/2008/10/27/wamp-installing-configuring-php/轉載于:https://www.cnblogs.com/WarrenTang/archive/2008/10/27/1320069.html

ipywidgets_未來價值和Ipywidgets

ipywidgetsHow to use Ipywidgets to visualize future value with different interest rates.如何使用Ipywidgets可視化不同利率下的未來價值。 There are some calculations that even being easy becoming better with a visualization of his terms. Moreover, the sooner…

2019 css 框架_宣布CSS 2019調查狀態

2019 css 框架by Sacha Greif由Sacha Greif 宣布#StateOfCSS 2019調查 (Announcing the #StateOfCSS 2019 Survey) 了解JavaScript狀況之后&#xff0c;幫助我們確定最新CSS趨勢 (After the State of JavaScript, help us identify the latest CSS trends) I’ve been using C…

計算機主機后面輻射大,電腦的背面輻射大嗎

眾所周知&#xff0c;電子產品的輻射都比較大&#xff0c;而電腦是非常常見的電子產品&#xff0c;它也存在著一定的輻射&#xff0c;那么電腦的背面輻射大嗎?下面就一起隨佰佰安全網小編來了解一下吧。有資料顯示&#xff0c;電腦后面的輻射比前面大&#xff0c;長期近距離在…

spring— Bean標簽scope配置和生命周期配置

scope配置 singleton 默認值&#xff0c;單例的prototype 多例的request WEB 項目中&#xff0c;Spring 創建一個 Bean的對象&#xff0c;將對象存入到 request 域中session WEB 項目中&#xff0c;Spring 創建一個 Bean 的對象&#xff0c;將對象存入session 域中global sess…

裝飾器3--裝飾器作用原理

多思考&#xff0c;多記憶&#xff01;&#xff01;&#xff01; 轉載于:https://www.cnblogs.com/momo8238/p/7217345.html

用folium模塊畫地理圖_使用Folium表示您的地理空間數據

用folium模塊畫地理圖As a part of the Data Science community, Geospatial data is one of the most crucial kinds of data to work with. The applications are as simple as ‘Where’s my food delivery order right now?’ and as complex as ‘What is the most optim…