用PYTHON探索數據 (EXPLORING DATA WITH PYTHON)
Between data blends, joins, and wrestling with the resulting levels of detail in Tableau, managing relationships between data can be tricky.
在數據混合,聯接以及在Tableau中產生的詳細程度之間進行搏斗之間,管理數據之間的關系可能很棘手。
Stepping into your shiny new Python shoes, the world opens up a bit. Instead of trying to squeeze everything into a single data source to rule them all, you can choose your battles.
踏入您閃亮的新Python鞋子,世界將會打開一點。 您可以選擇自己的戰斗,而不是試圖將所有內容壓縮到單個數據源中以對其全部進行統治。
In our previous articles, we already saw:
在之前的文章中,我們已經看到:
Pandas groupby expressions and basic visualizations
熊貓groupby表達式和基本可視化
Calculations and coloring sales by profitability
按獲利能力計算和著色銷售
Data exploration with advanced visuals
具有高級視覺效果的數據探索
搭建舞臺 (Setting the stage)
In this article, we’ll focus on one of the most important aspects of working with data in any ecosystem: joins.
在本文中,我們將重點介紹在任何生態系統中處理數據的最重要方面之一:聯接。
Lucky for us, the Tableau data we have been playing around with comes with batteries included! Within the context of the fictitious store we have been analyzing, we have a bit of data detailing various items that were returned.
幸運的是,我們一直在使用的Tableau數據隨附電池! 在我們一直在分析的虛擬商店的背景下,我們有一些數據詳細說明了退回的各種商品。
Let’s use this data on orders and returns to get comfortable joining data in Python using the Pandas library. Along the way, we’ll see how different types of joins can help us accomplish different goals.
讓我們按訂單和退貨使用此數據,以使用Pandas庫輕松地在Python中加入數據。 一路上,我們將看到不同類型的聯接如何幫助我們實現不同的目標。
第1步:快速查看我們要加入的數據 (Step 1: taking a quick look at the data we’d like to join)
Remember that orders data we’ve been working with? Let’s get a good look at it real quick.
還記得我們一直在處理的訂單數據嗎? 讓我們快速地真正了解它。

Ah, that’s familiar. Now let’s see what’s behind door #2. But first, in the spirit of getting a feel for how Pandas works, let’s highlight how we are getting our hands on this data.
嗯,很熟悉。 現在,讓我們看看2號門的背后。 但是首先,本著了解熊貓工作方式的精神,讓我們重點介紹一下如何獲取這些數據。
In the first article, we used Pandas read_excel() function to get our orders data from an Excel file. It turns out this file has multiple sheets:
在第一篇文章中,我們使用了Pandas的read_excel()函數從一個Excel文件中獲取訂單數據。 事實證明,此文件有多個工作表:
- Orders 命令
- Returns 退貨
- People 人
Let’s ignore the ‘People’ sheet for now (nobody likes them anyway), and simply acknowledge the fact that when we used the read_excel() function without specifying a sheet name, the function grabbed the first sheet available in the Excel workbook.
現在讓我們忽略“人員”工作表(無論如何都沒人喜歡),并簡單地承認一個事實,當我們使用read_excel()函數但未指定工作表名稱時,該函數將獲取Excel工作簿中可用的第一張工作表。
Here’s how we can specify that we want to fetch data from the ‘Returns’ sheet:
這是我們可以指定要從“退貨”表中獲取數據的方式:
returns_df = pd.read_excel('Global Superstore.xls', sheet_name='Returns')
And here’s what that data looks like:
數據如下所示:

步驟2:避免相親數據 (Step 2: avoid blind dates for your data)
In the data joining world, an arranged marriage is better than a blind date. That is, we want to be the overbearing parents who inspect the columns we are about to join our data on, looking for any inconsistencies that would make for a poor match. Blindly joining our data together just because we assume the columns match up could lead to some undesired results.
在數據連接世界中,包辦婚姻比相親更好。 就是說,我們希望成為一個頑強的父母,他們將檢查將要加入我們數據的列,以尋找可能導致匹配不良的任何不一致之處。 僅僅因為我們假設各列匹配會導致盲目地將我們的數據連接在一起,可能會導致一些不良結果。
To prove that point, let’s take a stroll into a blind date and see what happens.
為了證明這一點,讓我們走進相親,看看會發生什么。
For starters, let’s observe which columns our two data sets have in common.
首先,讓我們觀察一下我們的兩個數據集共有哪些列。


It looks to me like we have ‘Order ID’ and ‘Market’ in common. Of course, in many real world scenarios the same information might be stored under different names, but this isn’t the real world. It’s an example!
在我看來,我們有共同的“訂單ID”和“市場”。 當然,在許多現實世界中,相同的信息可能以不同的名稱存儲,但這不是現實世界。 這是一個例子!
By the way, here’s a shortcut you could use to find values that exist in both column lists:
順便說一下,這是一個快捷方式,可用于查找兩個列列表中都存在的值:
common_cols = [col for col in store_df.columns if col in returns_df.columns]

What you just saw is called a list comprehension. If you end up falling in love with Python, these will become like second nature to you. They’re very useful. In this case, our list comprehension provides us any value of list X whose values also appear in list Y.
您剛剛看到的被稱為列表理解 。 如果您最終愛上了Python,那么對于您來說,這些將成為第二天性。 它們非常有用。 在這種情況下,我們的列表理解為我們提供了列表X的任何值,其值也出現在列表Y中。
Now that we know we have two columns in common, what are we waiting for!? Initiate the blind date!
現在我們知道我們有兩個共同點,我們還在等什么! 開始相親!
步驟3:好的,讓我們嘗試一下相親… (Step 3: alright, let’s give the blind date a try…)
So we have ‘Order ID’ and ‘Market’ in common between our orders and our returns. Let’s join them.
因此,我們的訂單和退貨之間共有“訂單ID”和“市場”。 讓我們加入他們。
In Pandas, you can use the join() method and you can also use the merge() method. Different strokes for different folks. I prefer the merge() method in pretty much any scenario, so that’s what we’ll use here. Team merge, for life.
在Pandas中,您可以使用join()方法,也可以使用merge()方法。 物以類聚,人以群分。 在幾乎所有情況下,我都更喜歡merge()方法,因此我們將在這里使用它。 團隊合并,終生一生。
blind_date_df = store_df.merge(returns_df, on=['Order ID', 'Market'], how='left')
And let’s see what the results look like:
讓我們看看結果如何:

First of all, we ran .shape on the resulting dataframe to get values for the number of rows and columns (in that order). So our resulting dataframe has 51,290 rows and 25 columns, where the original orders dataframe has 51,290 rows and 24 columns.
首先,我們在結果數據幀上運行.shape以獲取行數和列數的值(按該順序)。 因此,我們得到的數據框具有51,290行和25列,而原始訂單數據框具有51,290行和24列。
This join has effectively sprinkled in new data for each of our rows, providing one additional column named ‘Returned’, which takes on the value of ‘Yes’ if an order was returned.
此聯接有效地為我們的每一行添加了新數據,并提供了一個名為“ Returned”的附加列,如果返回了訂單,該列的值為“ Yes”。
Note that in our join we specified the columns to join on as well as how to perform the join. What is this ‘left’ join? It simply means that the table that was there first (in this example that is our store_df) will remain as-is, and the new table’s data will be sprinkled onto it wherever relevant.
請注意,在我們的聯接中,我們指定了要聯接的列以及執行聯接的方式。 什么是“左”聯接? 這只是意味著首先存在的表(在本例中為store_df )將保持原樣,并且新表的數據將在任何相關的地方散布到該表上。
Let’s compare this to an inner join:
讓我們將其與內部聯接進行比較:

This inner join behaves differently from our left join in the sense that it only keeps the intersection between the two tables. This type of join would be useful if we only cared about analyzing orders that were returned, as it filters out any orders that were not returned.
此內部聯接的行為與我們的左聯接不同,因為它僅保持兩個表之間的交集。 如果我們只關心分析已退回的訂單,則這種類型的聯接將很有用,因為它會過濾掉未退回的所有訂單。
步驟4:相親怎么了? (Step 4: so what’s wrong with the blind date?)
Sometimes you think you know everything, and that’s when it bites you the hardest. In this example, we think we know that we have two matching columns: ‘Order ID’ and ‘Market’. But do our two data sets agree on what a market is?
有時您認為自己知道所有事情,那是最難的時刻。 在此示例中,我們認為我們知道有兩個匹配的列:“訂單ID”和“市場”。 但是,我們的兩個數據集是否就什么是市場達成共識?
Let’s stir up some drama. Orders, how do you define your markets?
讓我們煽動一些戲劇。 訂單,您如何定義市場?
store_df['Market'].unique()

This line of code takes a look at the entire ‘Market’ column and outputs the unique values found within it.
此行代碼將查看整個“市場”列,并輸出在其中找到的唯一值。
Okay. Returns, how do you define your markets?
好的。 退貨,您如何定義市場?

It looks like our orders and returns teams both need to get on the same page in terms of whether they use acronyms for markets or spell them out.
看來我們的訂單和退貨團隊都需要在同一個頁面上使用首字母縮寫詞表示市場還是將其拼寫清楚。
On the orders side, to avoid future issues we should probably switch ‘Canada’ to an acronym value like ‘CA’.
在訂單方面,為避免將來出現問題,我們可能應該將“加拿大”更改為“ CA”等首字母縮寫值。
On the returns side, to avoid future issues we should probably switch ‘United States’ to ‘US’
在收益方面,為了避免將來出現問題,我們可能應該將“美國”切換為“美國”
But wait, that only fixes the future issues… what kinds of problems is this causing right now?
但是,等等,這只能解決未來的問題……這現在會引起什么問題?
步驟5:盲目加入后清理 (Step 5: cleaning up after a blind join)
To see the mess we’re in, let’s look at how many returns we have per market (using the inner join from earlier):
為了弄清楚我們所處的混亂狀況,讓我們看一下每個市場有多少回報(使用前面的內部聯接):

Hurray, it looks like our US market is perfect and has no returns! Or wait, is it the United States market… ah, oops.
華友世紀,看來我們的美國市場是完美的,沒有任何回報! 還是等等,這是美國市場嗎?
Because the data containing our orders calls the market ‘US’ and the data containing our returns calls the market ‘United States’, the join will never match the two.
因為包含我們的訂單的數據將市場稱為“美國”,而包含我們的退貨的數據將市場稱為“美國”,所以聯接將永遠不會匹配兩者。
Luckily, it’s really easy to rename our markets. Here’s a quick way to do it in this situation, where there’s really just one mismatch that’s causing a problem. This introduces the concept of a lambda function, which you can simply ignore for now if it makes no sense to you.
幸運的是,重命名我們的市場真的很容易。 在這種情況下,這是一種快速的解決方法,實際上只有一個不匹配會導致問題。 這引入了lambda函數的概念,如果對您沒有意義,您可以暫時忽略它。
returns_df['Market'] = returns_df['Market'].apply(lambda market: market.replace('United States', 'US'))
Basically what this does is it creates a function on the fly that we use quickly to perform a useful action in a single line of code. The result of running the line of code above is that any occurrence of the ‘United States’ market has been renamed to ‘US’.
基本上,這是在運行中創建一個函數,我們可以快速使用它在一行代碼中執行有用的操作。 運行上述代碼行的結果是,所有出現的“美國”市場都已重命名為“美國”。
Now, if we run that inner join between store_df and returns_df, the results will look a bit different:
現在,如果我們在store_df和returns_df之間運行該內部聯接 ,結果將看起來有些不同:

And if we check how many returns there are per market, we get this:
如果我們檢查每個市場有多少回報,我們將得到:

第6步:對數據上癮者的挑戰 (Step 6: a little challenge for the data addicts out there)
Now that we know how to join our orders and our returns, can you figure out how to stitch together a table like the one shown below?
既然我們知道如何加入訂單和退貨,那么您能否弄清楚如何將一張表格拼接在一起,如下圖所示?

Looks like the ‘Tables’ sub-category is crying out for attention again! We can’t escape it.
看起來“表格”子類別再次引起人們的注意! 我們無法逃脫。
Applying what we’ve learned so far in this series, can you recreate this table on your own? If you’re a real over-achiever, go ahead and build it in Tableau as well and compare the process. How do you handle the market mismatch in Tableau vs Python? There are multiple ways to crack the case — go try it out!
運用我們在本系列中到目前為止所學到的知識,您可以自己重新創建該表嗎? 如果您是真正的成就者,請繼續在Tableau中進行構建,并進行比較。 您如何處理Tableau vs Python中的市場不匹配問題? 有多種破解方法-試試吧!
結語 (Wrapping it up)
Joining data is an absolutely crucial skill if you’re working with data at scale. Understanding what you’re joining is as important as knowing the technical details of how to execute the joins, keep that in mind!
如果您要大規模處理數據,那么連接數據是絕對至關重要的技能。 了解您要加入的內容與了解如何執行連接的技術細節一樣重要,請記住這一點!
If you send your data on a blind date to be joined with another table, be aware of the risks. Scrub your data sets clean before sending them on dates with other data. Dirty data tends to leave a mess, and you’ll be the one troubleshooting it.
如果您在相親數據中發送數據以與另一個表結合使用,請注意風險。 在將數據集與其他數據一起發送之前,先清理數據集。 骯臟的數據容易造成混亂,您將成為對它進行故障排除的人。
Hope to see you next time as we dive into crafting reusable code using functions!
希望下次我們使用函數編寫可重用代碼時與您見面!
翻譯自: https://towardsdatascience.com/a-gentle-introduction-to-python-for-tableau-developers-part-4-a6fd6b2f46b1
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/391123.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/391123.shtml 英文地址,請注明出處:http://en.pswp.cn/news/391123.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!