tableau跨庫創建并集
One of the coolest things about making our personal project is the fact that we can explore topics of our own interest. On my case, I’ve had the chance to backpack around the world for more than a year between 2016–2017, and it was one of the best experiences of my life.
進行個人項目的最酷的事情之一是,我們可以探索自己感興趣的主題。 就我而言,2016年至2017年之間,我有機會在世界各地背包旅行了一年多,這是我一生中最好的經歷之一。
During my travel, I used A LOT OF HOSTELS. From Hanoi to the Iguassu Falls, passing through Tokyo, Delhi and many other places, one always need a place to rest after a long day exploring the city. Funny enough, it was on some of those hostels that I got interested in learning how to code, which started my way to become a data analyst today.
在旅行中,我使用了很多雜物。 從河內到伊瓜蘇瀑布,途經東京,德里和許多其他地方,經過漫長的一天探索這座城市,人們總是需要一個休息的地方。 有趣的是,正是在一些旅館中,我對學習如何編碼感興趣,這開始使我成為今天的數據分析師。

So, I’m very interested in understanding what makes a hostel better than another, how to compare them, etc, and after thinking about that I came up with this tutorial idea. Today we are going to do 2 things:
因此,我對了解什么使旅館比其他旅館更好,如何進行比較等感興趣,并且在考慮了這一點之后,我想到了本教程。 今天我們要做兩件事:
- Scrap data from Hostel World, using Berlin as our study case, and save it into a data frame. 使用柏林作為我們的研究案例,從Hostel World收集數據,并將其保存到數據框中。
- Use that data to build a Tableau Dashboard that will allow us to select the hostel based in different criteria. 使用該數據來構建Tableau儀表板,該儀表板將使我們能夠根據不同的條件選擇旅館。
Why Berlin Hostels? Because Berlin is an amazing city, and there’s a lot of options of hostels there for us to explore. There are many different websites to look for hostels, and we will use my favorite, Hostel World, which I particularly utilized many times, and it’s the one I trust for the accuracy of the information they provide.
為什么選擇柏林青年旅舍? 因為柏林是一個了不起的城市,所以這里有很多旅館供我們探索。 有很多不同的網站可以尋找旅館,我們將使用我最喜歡的Hostel World ,我多次使用它,并且我相信它可以提供所提供信息的準確性。
My goal is to show you that we can do the whole process of collect/transform/visualize data in a simple yet effective way so you can start doing your own projects. To fully enjoy this tutorial, it’s important that you are familiar with python, pandas, and also comfortable with HTML and Tableau basics concepts.
我的目標是向您展示,我們可以以一種簡單而有效的方式完成收集/轉換/可視化數據的整個過程,以便您可以開始自己的項目。 要完全享受本教程,重要的是,您必須熟悉python,pandas,并熟悉HTML和Tableau基本概念。
You can follow along with the notebook containing the code here, and access the Tableau Dashboard here.
您可以使用包含代碼的筆記本跟著在這里 ,和訪問的Tableau儀表板在這里 。
始終先瀏覽網站! (Always Explore the Website First!)
I highly recommend that you take some time to explore the structure of the website prior to start coding. If you’re using Chrome, just click on the right button of the mouse and select “Inspect”. That’s what you got:
我強烈建議您在開始編碼之前花一些時間來探索網站的結構。 如果您使用的是Chrome,只需單擊鼠標右鍵,然后選擇“檢查”。 那就是你得到的:

Think the HTML structure as a tree, with all its branches holding the information of the page. Try to find which class has information about hostel name, ratings, etc. More important, check out how information of each hostel has its own “branch”, or container. That means that once we figure out how to access it, we can expand the same logic for all other hostels/containers.
將HTML結構想像成一棵樹,其所有分支都保存頁面的信息。 嘗試查找哪個班級提供有關旅館名稱,等級等的信息。更重要的是,檢查每個旅館的信息如何有其自己的“分支”或容器。 這意味著一旦弄清楚如何訪問它,我們便可以為所有其他旅館/容器擴展相同的邏輯。
On the code below I’m showing you how to get the raw information, then how to figure out how many pages of hostels we have, as we will need that to iterate later, and then how to separate the information about the first hostel in order to explore it. Take your time to read the code and comments, I wrote it specially for you:
在下面的代碼中,我向您展示如何獲取原始信息,然后如何確定我們擁有多少個旅舍頁面,因為我們以后需要進行迭代,然后如何在其中分離有關第一個旅舍的信息。為了探索它。 花些時間閱讀代碼和注釋,我是專門為您編寫的:
# importing the libraries to use on the scraping
from requests import get
from bs4 import BeautifulSoupimport pandas as pd
import numpy as npimport timeimport re# getting the html info to be used
url = 'https://www.hostelworld.com/hostels/Berlin'
response = get(url)# create soup
soup = BeautifulSoup(response.text, 'html.parser')# creating individual containers, on each one there's information about one hostel.
holstel_containers= soup.findAll(class_= 'fabresult rounded clearfix hwta-property')# Figuring out how many pages with hostels do we have available. This information is important when iterating over pages.
total_pages= soup.findAll(class_= "pagination-page-number")
final_page= pd.to_numeric(total_pages[-1].text)
print(final_page)# checking how many hostels we have on the first page
print(len(holstel_containers))first_hostel = holstel_containers[0]
print(first_hostel.prettify())
The output of this code will be first a “3”, the number of pages with hostel info, then a “30”, the number of hostels per page, and finally a long bunch of HTML, which is the information about the first hostel on the list. The information we will extract today is the following:
此代碼的輸出將首先是“ 3”,即包含旅館信息的頁面數,然后是“ 30”,即每頁的旅館數,最后是一堆HTML,這是有關第一家旅館的信息在清單上。 我們今天將提取的信息如下:
- Name 名稱
- Link 鏈接
- Distance from centre (km) 距中心的距離(公里)
- Average Rating 平均評分
- Number of reviews 評論數
- Average price in USD 平ASP格(美元)
Using our super HTML skills, we figured out that the code to extract that is the one below. If you have already used Beautiful Soup, could you get the same information in a different way? If yes, I would love to see that on the comments.
使用我們的超級HTML技能,我們找出了下面要提取的代碼。 如果您已經使用過Beautiful Soup,可以通過其他方式獲得相同的信息嗎? 如果是,我希望在評論中看到這一點。
# Hostel name
first_hostel.h2.a.text# hostel link
first_hostel.h2.a.get('href')# distance from city centre in km
first_hostel.find(class_= "addressline").text[12:18].replace('k','').replace('m','').strip()# average rating
first_hostel.find(class_='hwta-rating-score').text.replace('\n', '').strip()# number of reviews
first_hostel.find(class_="hwta-rating-counter").text.replace('\n', '').strip()# average price per night in USD
first_hostel.find(class_= "price").text.replace('\n', '').strip()[3:]
Note that we will need to use some pandas essentials, like replace and strip, along with some operators from the Beautiful Soup package, mostly the find, find_all and get. Knowing how to combining them is something that requires some practice, but I can guarantee that,once you understand the idea, it is pretty simple.
注意,我們將需要使用一些熊貓必需品,例如replace和strip ,以及Beautiful Soup包中的一些運算符,主要是find , find_all和get。 知道如何將它們組合起來是需要一些實踐的事情,但是我可以保證,一旦您理解了這個想法,它就非常簡單。
Now that we know how to access the information we need in the first container, we will expand the same logic across all the hostels on the first page, and also across all the pages with hostel information. How do we do that? First by using our very well known for loop, then saving the information into empty lists, and finally using those lists to create a data frame:
現在,我們知道了如何訪問第一個容器中所需的信息,我們將在第一頁上的所有旅館以及包含旅館信息的所有頁面上擴展相同的邏輯。 我們該怎么做? 首先使用我們眾所周知的for循環 ,然后將信息保存到空列表中,最后使用這些列表創建數據框:
# first, create the empty lists
hostel_names= []
hostel_links= []
hostel_distance= []
hostel_ratings= []
hostel_reviews= []
hostel_prices= []for page in np.arange(1,final_page+1): # to iterate over the pages and create the conteiners, using the final_page data we've got at the beginingurl = 'https://www.hostelworld.com/hostels/Berlin?page=' + str(page)response = get(url)soup = BeautifulSoup(response.text, 'html.parser')holstel_containers= soup.findAll(class_= 'fabresult rounded clearfix hwta-property')for item in range(len(holstel_containers)): # to iterate over the results on each pagehostel_names.append(holstel_containers[item].h2.a.text)hostel_links.append(holstel_containers[item].h2.a.get('href'))hostel_distance.append(holstel_containers[item].find(class_= "addressline").text[12:18].replace('k','').replace('m','').strip())hostel_ratings.append(holstel_containers[item].find(class_='hwta-rating-score').text.replace('\n', '').strip())hostel_reviews.append(holstel_containers[item].find(class_="hwta-rating-counter").text.replace('\n', '').strip())hostel_prices.append(holstel_containers[item].find(class_= "price").text.replace('\n', '').strip()[3:]) time.sleep(2) # this is used to not push too hard on the website# using the lists to create a brand new dataframe
hw_berlin = pd.DataFrame({'hostel_name': hostel_names,'distance_centre_km': hostel_distance,'average_rating': hostel_ratings,'number_reviews': hostel_reviews,'average_price_usd': hostel_prices,'hw_link': hostel_links
})hw_berlin.head()
And now we can appreciate the beauty of what we have just created:
現在我們可以欣賞到我們剛剛創造的美麗:

After that we just need to clean up the data a little bit, removing non-numerical characters and converting strings, saved initially as object, to numbers. Finally, we will save our results into a .csv file.
之后,我們只需要稍微整理一下數據,刪除非數字字符并將最初保存為object的字符串轉換為數字。 最后,我們將結果保存到.csv文件中。
# removing non numerical character on the column distance_centre_km
hw_berlin.distance_centre_km = [re.sub('[^0-9.]','', x) for x in hw_berlin.distance_centre_km]# converting numerical columns to proper formatlist_to_convert = ['distance_centre_km', 'average_rating', 'number_reviews', 'average_price_usd']for column in list_to_convert:hw_berlin[column] = pd.to_numeric(hw_berlin[column], errors= 'coerce')# saving the final version into a .csv file
hw_berlin.to_csv('hw_berlin_basic_info.csv')
Tableau歡樂時光! (Tableau Fun Time!)
Tableau is one of the most powerful BI tools available today, and it offers a free version, Tableau Public, that allows you to do A LOT of cool stuff. However, it can become pretty complex very fast, even to do some basic graphs. I cannot cover all the steps I did here, as it was a lot of click and drag actions. It’s different than code where you can just type and reproduce it all.
Tableau是當今可用的功能最強大的BI工具之一,它提供了免費版本Tableau Public ,使您可以做很多很棒的事情。 但是,即使做一些基本圖形,它也會變得非常復雜。 我無法涵蓋我在此處所做的所有步驟,因為這涉及很多單擊和拖動操作。 它與代碼不同,在代碼中,您只需鍵入并復制所有內容即可。
So, if you are new to Tableau and if you want to understand how I build my visualization, the way to do that is by downloading the .twb file, which is available here, then open it in your computer, and do what we call “reverse engineering”, which is basically to check and play with the files that I’ve created yourself. Trust me, this is the most effective way to learn Tableau, and even when you can see the engineering behind, it can be hard to reproduce the same visualization. Let’s try to do it?
因此,如果您是Tableau的新手,并且想了解如何構建可視化文件,則可以通過下載.twb文件(在此處可用),然后在計算機中打開它并執行我們所謂的操作來實現。 “逆向工程” ,基本上是檢查并播放我自己創建的文件。 相信我,這是學習Tableau的最有效方法,即使您看到了背后的工程知識,也很難再現相同的可視化效果。 讓我們嘗試做嗎?

As data or business analyst, we need basically to make data readable and easy to manipulate. The visualization I’ve build for this tutorial offers you that: you can slice and play with the hostels based in some different criteria we have available, filtering the options and finding the ones you are interested, just like a stakeholder would do. Besides the filters, I’ve included also a scatter plot where we can check the relationship between price and reviews.
作為數據或業務分析師,我們基本上需要使數據可讀并易于操縱。 我為本教程構建的可視化為您提供:您可以根據我們可用的一些不同標準對旅館進行切片和玩耍,過濾選項并找到您感興趣的選項,就像利益相關者會做的那樣。 除了過濾器之外,我還包括了一個散點圖,我們可以在其中檢查價格和評論之間的關系。
The dashboard is pretty simple, and I’ve done that way by purpose, I would like to see you doing it by yourself and sharing the link of your results on the comments. What kind of different information can you get from the date we’ve scraped? Could you do the same analysis with hostels in Paris, New York or Rio de Janeiro? I’ll leave those questions for you to answer with your own code and dashboard.
儀表板非常簡單,我是有意這樣做的,我希望您自己做,并分享您的結果在評論中的鏈接。 從我們抓取之日起,您可以獲得什么不同的信息? 您是否可以對巴黎,紐約或里約熱內盧的旅館進行同樣的分析? 我將用您自己的代碼和儀表板來回答這些問題。
That’s all for today! I hope this tutorial will help you to get more knowledge about data scraping and Tableau. Feel free to connect with me on LinkedIn and to check my other texts and code on my Medium and GitHub profiles.
今天就這些! 我希望本教程將幫助您獲得有關數據抓取和Tableau的更多知識。 隨時在LinkedIn上與我聯系,并在我的Medium和GitHub個人資料中查看我的其他文本和代碼。

翻譯自: https://towardsdatascience.com/scraping-berlin-hostels-and-building-a-tableau-viz-with-it-a73ce5b88e22
tableau跨庫創建并集
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/391997.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/391997.shtml 英文地址,請注明出處:http://en.pswp.cn/news/391997.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!