python的power bi轉換基礎

I’ve been having a great time playing around with Power BI, one of the most incredible things in the tool is the array of possibilities you have to transform your data.

我在玩Power BI方面玩得很開心,該工具中最令人難以置信的事情之一就是您必須轉換數據的一系列可能性。

You can perform your transformations directly in your SQL query, use PowerQuery, DAX, R, Python or just by using their buttons and drop-boxes.

您可以直接在SQL查詢中執行轉換,可以使用PowerQuery,DAX,R,Python或僅通過其按鈕和下拉框進行轉換。

PBI gives us a lot of choices, but as much as you can load your entire database and figure your way out just with DAX, knowing a little bit o SQL can make things so much easier. Understanding the possibilities, where each of them excels, and where do we feel comfortable, is essential to master the tool.

PBI給我們提供了許多選擇,但是盡您可以加載整個數據庫并僅使用DAX來解決問題,知道一點點SQL可以使事情變得如此簡單。 掌握各種可能性,每種方法的優勢以及我們感到舒適的地方,對于掌握該工具至關重要。

In this article, I’ll go through the basics of using Python to transform your data for building visualizations in Power BI.

在本文中,我將介紹使用Python轉換數據以在Power BI中構建可視化的基礎知識。

勘探 (Exploration)

For the following example, I’ll use Jupyter Lab for exploring the dataset and designing the transformations.

對于以下示例,我將使用Jupyter Lab探索數據集并設計轉換。

The dataset I’ll use is the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University.

我將使用的數據集是約翰霍普金斯大學系統科學與工程中心(CSSE)的COVID-19數據存儲庫 。

import pandas as pdgit = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'dataset = pd.read_csv(git)dataset
Image for post
Data frame
數據框

OK, so we loaded the dataset to a Pandas data frame, the same format we’ll receive it when performing the transformation in PBI.

好的,因此我們將數據集加載到了Pandas數據框中,與在PBI中執行轉換時會收到的格式相同。

The first thing that called my attention in this dataset was its arrangement. The dates are spread through the columns, and that’s not a very friendly format for building visualizations in PBI.

在這個數據集中引起我注意的第一件事是它的排列。 日期分布在各列中,對于在PBI中構建可視化而言,這不是一種非常友好的格式。

Another noticeable thing is the amount of NaNs in the Province/State column. Let’s get a better look at the missing values with MissingNo.

另一個值得注意的事情是“省/州”列中的NaN數量。 讓我們用MissingNo 更好地查看缺失值 。

import missingno as msnomsno.matrix(dataset)
Image for post
Missing values matrix
缺失值矩陣

Alright, mostly, our dataset is complete, but the Province/State column does have lots of missing values.

好吧,大多數情況下,我們的數據集是完整的,但是“省/州”列確實有很多缺失值。

While exploring, we can also check for typos and mismatching fields. There are lots of methods for doing so. I’ll use Difflib for illustrating.

在探索期間,我們還可以檢查拼寫錯誤和不匹配的字段。 有很多這樣做的方法。 我將使用Difflib進行說明。

from difflib import SequenceMatcher# empty lists for assembling the data frame
diff_labels = []
diff_vals = []# for every country name check every other country name
for i in dataset['Country/Region'].unique():
for j in dataset['Country/Region'].unique():
if i != j:
diff_labels.append(i + ' - ' + j)
diff_vals.append(SequenceMatcher(None, i, j).ratio() )# assemble the data frame
diff_df = pd.DataFrame(diff_labels)
diff_df.columns = ['labels']
diff_df['vals'] = diff_vals# sort values by similarity ratio
diff_df.sort_values('vals', ascending=False)[:50]
Image for post
Country names similarity
國名相似

From what I can see, most of them are just similar, so this field is already clean.

從我的看到,它們大多數都是相似的,因此該字段已經很干凈了。

As much as we could also check Provinces/ States, I guess I can pick typos from the names of countries, but not from provinces or states.

盡我們所能檢查省/州,我想我可以從國家/地區名稱中選擇錯別字,但不能從省或州中選擇錯別字。

目標 (Goal)

Whatever it is your exploration analysis, you’ll probably come up with a new design for the data you want to visualize.

無論您的勘探分析是什么,您都可能會想出想要可視化數據的新設計。

Something that’ll make your life easier when building the charts, and my idea here is to separate this dataset into three tables, like so:

可以簡化構建圖表時的工作,我的想法是將數據集分成三個表,如下所示:

One table will hold Location, with Province/State, Country/Region, Latitude, and Longitude.

一張桌子將保存位置,省/州,國家/地區,緯度和經度。

One will hold the data for countries, with the date, number of confirmed and number of new cases.

一個將保存國家/地區的數據,以及日期,確診數量和新病例數量。

And the last one will hold data for the provinces, also with the date, number of confirmed and number of new cases.

最后一個將保存各省的數據,以及日期,確診數和新病例數。

Here’s what I’m looking for as the final result:

這是最終結果:

Image for post
Tables and Relationships
表和關系

Are there better ways of arranging this dataset? — Most definitely, yes. But I think this is a good way of illustrating a goal for the dataset we want to achieve.

有更好的方法來安排此數據集嗎? —絕對是的。 但是我認為這是說明我們要實現的數據集目標的好方法。

Python腳本 (Python Scripts)

Cool, we did a little exploration and came up with an idea of what we want to build. Now we can design the transformations.

太酷了,我們進行了一些探索,并提出了我們想要構建的構想。 現在我們可以設計轉換。

Location is the easiest. We only need to select the columns we want.

位置最簡單。 我們只需要選擇所需的列。

cols = ['Province/State', 'Country/Region', 'Lat', 'Long']
location = dataset[cols]location

To get this to Power BI, we’ll need a new data source, and since we’re bringing it from a GitHub raw CSV, we can choose ‘web.’

要將其發送到Power BI,我們將需要一個新的數據源,并且由于我們是從GitHub原始CSV中獲取數據,因此我們可以選擇“網絡”。

Image for post
PBI Get Data
PBI獲取數據

Now we can add the URL for the CSV and click go till we have our new source.

現在,我們可以添加CSV的URL,然后單擊“轉到”,直到獲得新的源。

https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv
Image for post
PBI Get Data -> From Web
PBI獲取數據->從Web

After you finish loading your dataset, you can go to ‘Transform data’, select the table we just imported, and go to the ‘Transform’ tab.

加載完數據集后,可以轉到“轉換數據”,選擇我們剛剛導入的表,然后轉到“轉換”選項卡。

First, we’ll promote the first row to Headers.

首先,我們將第一行提升為Headers。

Image for post
PBI -> Transform Data -> Use first row as Headers
PBI->轉換數據->使用第一行作為標題

Then on the same tab, we can select ‘Run Python script’.

然后在同一標簽上,我們可以選擇“運行Python腳本”。

Image for post
PBI -> Transform Data -> Run Python Script
PBI->轉換數據->運行Python腳本

Here we’ll use the script we just wrote in Jupyter and press OK. Then we can choose the location Table we just made.

在這里,我們將使用剛剛在Jupyter中編寫的腳本,然后按OK。 然后,我們可以選擇剛才創建的位置表。

Image for post
New tables
新表
Image for post
Location table
位置表

Excellent, it’s arguably way more comfortable to do that with PBI only, but now we know how to use this transformation, and we can add some complexity.

太好了,可以說僅使用PBI可以更輕松地完成此操作,但是現在我們知道了如何使用此轉換,并且可以增加一些復雜性。

Let’s make the Province Time-Series transformations in Jupyter.

讓我們在Jupyter中進行省時間序列轉換。

增加復雜性 (Add Complexity)

We’ll drop the columns we don’t need, set the new index, and stack the dates in a single column.

我們將刪除不需要的列,設置新索引,并將日期堆疊在單個列中。

# drop lat and long
Time_Series_P = dataset.drop(['Lat', 'Long'], axis=1)# set country and province as index
Time_Series_P.set_index(['Province/State', 'Country/Region'], inplace=True)# stack date columns
Time_Series_P = Time_Series_P.stack()
Time_Series_P
Image for post
Stacked data frame
堆疊數據框

Next, we can convert that series back to a data frame, reset the index, and rename the columns.

接下來,我們可以將該系列轉換回數據框,重置索引,然后重命名列。

Time_Series_P = Time_Series_P.to_frame(name='Confirmed')
Time_Series_P.reset_index(inplace=True)col_names = ['Province/State', 'Country/Region',
'Date', 'Confirmed']
Time_Series_P.columns = col_namesTime_Series_P
Image for post
Transformed data frame
轉換后的數據幀

Cool, we already have the rows/ columns figured out. But I still want to add a column with ‘new cases’.

太酷了,我們已經弄清楚了行/列。 但是我仍然想添加一列“新案例”。

For that, we’ll need to sort our values by province and date. Then we’ll go through each row checking if it has the same name as the one before it. If it does, we should calculate the difference between those values. If not, we should use the amount in that row.

為此,我們需要按省和日期對值進行排序。 然后,我們將遍歷每一行,檢查其名稱是否與之前的名稱相同。 如果是這樣,我們應該計算這些值之間的差。 如果沒有,我們應該使用該行中的金額。

Time_Series_P['Date'] = pd.to_datetime(Time_Series_P['Date'])
Time_Series_P['Date'] = Time_Series_P['Date'].dt.strftime('%Y/%m/%d')Time_Series_P.sort_values(['Province/State', 'Date'], inplace=True)c = ''
new_cases = []
for index, value in Time_Series_P.iterrows():
if c != value['Province/State']:
c = value['Province/State']
val = value['Confirmed']
new_cases.append(val)
else:
new_cases.append(value['Confirmed'] - val)
val = value['Confirmed']
Time_Series_P['new_cases'] = new_cases
Image for post
The transformed data frame with new cases column
具有新案例列的轉換后的數據框

I guess that’s enough. We transformed the dataset, and we have it exactly how we wanted. Now we can pack all this code in a single script and try it.

我想就足夠了。 我們轉換了數據集,并得到了我們想要的。 現在,我們可以將所有這些代碼打包在一個腳本中并嘗試。

Time_Series_P = dataset.drop(['Lat', 'Long'], axis=1).set_index(['Province/State', 'Country/Region']).stack()
Time_Series_P = Time_Series_P.to_frame(name='Confirmed').reset_index()
Time_Series_P.columns = ['Province/State', 'Country/Region', 'Date', 'Confirmed']
Time_Series_P.dropna(inplace=True)Time_Series_P['Date'] = pd.to_datetime(Time_Series_P['Date'])
Time_Series_P['Date'] = Time_Series_P['Date'].dt.strftime('%Y/%m/%d')Time_Series_P.sort_values(['Province/State', 'Date'], inplace=True)c = ''
new_cases = []
for index, value in Time_Series_P.iterrows():
if c != value['Province/State']:
c = value['Province/State']
val = value['Confirmed']
new_cases.append(val)
else:
new_cases.append(value['Confirmed'] - val)
val = value['Confirmed']
Time_Series_P['new_cases'] = new_cases
Time_Series_P[155:170]
Image for post
Final data frame
最終數據幀

We already know how to get this to PBI. Let’s duplicate our last source, and change the python script in it, like so:

我們已經知道如何將其用于PBI。 讓我們復制最后一個源,并在其中更改python腳本,如下所示:

Image for post
PBI, run Python script
PBI,運行Python腳本

I don’t know how to create relationships in PBI with composite keys, so for connecting Location to Time_Series_P, I’ve used DAX to build a calculated column concatenating province and country.

我不知道如何使用復合鍵在PBI中創建關系,因此,為了將Location連接到Time_Series_P,我使用了DAX來構建計算得出的連接省和國家/地區的列。

loc_id = CONCATENATE(Time_Series_Province[Province/State], Time_Series_Province[Country/Region])

That’s it! You can also use similar logic to create the country table.

而已! 您也可以使用類似的邏輯來創建國家/地區表。

Time_Series_C = dataset.drop(['Lat', 'Long', 'Province/State',], axis=1).set_index(['Country/Region']).stack()
Time_Series_C = Time_Series_C.to_frame(name='Confirmed').reset_index()
Time_Series_C.columns = ['Country/Region', 'Date', 'Confirmed']
Time_Series_C = Time_Series_C.groupby(['Country/Region', 'Date']).sum().reset_index()Time_Series_C['Date'] = pd.to_datetime(Time_Series_C['Date'])
Time_Series_C['Date'] = Time_Series_C['Date'].dt.strftime('%Y/%m/%d')Time_Series_C.sort_values(['Country/Region', 'Date'], inplace=True)c = ''
new_cases = []
for index, value in Time_Series_C.iterrows():
if c != value['Country/Region']:
c = value['Country/Region']
val = value['Confirmed']
new_cases.append(val)
else:
new_cases.append(value['Confirmed'] - val)
val = value['Confirmed']
Time_Series_C['new_cases'] = new_cases
Time_Series_C

I guess that gives us an excellent idea of how to use Python transformations in PBI.

我想這給了我們一個很好的想法,如何在PBI中使用Python轉換。

結論 (Conclusion)

Having options and knowing how to use them is always a good thing; all of those transformations could have been done with PBI only. For example, it’s way easier to turn all those columns with dates to rows by selecting them and clicking ‘Unpivot columns’ in the transformation tab.

有選擇并知道如何使用它們總是一件好事。 所有這些轉換只能通過PBI完成。 例如,通過選擇所有日期日期列將其轉換為行,然后在轉換選項卡中單擊“取消透視列”,將變得更加容易。

But there may be times where you find yourself lost in the tool, or you need more control over the operation, and many cases where Python may have that library to implement the solution you were seeking.

但是有時您可能會發現自己迷失在該工具中,或者需要對操作進行更多控制,并且在許多情況下,Python可能具有該庫來實現您要尋找的解決方案。

All said and done — it’s time to design your visualization.

總而言之,這是設計可視化的時候了。

Image for post
Example of a visualization built with the transformed data
使用轉換后的數據構建的可視化示例

Thanks for reading my article. I hope you enjoyed it.

感謝您閱讀我的文章。 我希望你喜歡它。

翻譯自: https://medium.com/python-in-plain-english/basics-of-power-bi-transformations-with-python-c6df52cb21d7

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/389791.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/389791.shtml
英文地址,請注明出處:http://en.pswp.cn/news/389791.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

感想3-對于業務邏輯復用、模板復用的一些思考(未完)

內容概覽: 業務邏輯復用的目的基于現有場景,如何抽象出初步可復用邏輯復用業務邏輯會不會產生過度設計的問題業務邏輯復用的目的 我對于業務邏輯復用的理解是忽略實際業務內容,從交互流程、交互邏輯的角度去歸納、總結,提出通用的…

Git的一些總結

.git 目錄結構 |── HEAD|── branches // 分支|── config // 配置|── description // 項目的描述|── hooks // 鉤子| |── pre-commit.sample| |── pre-push.sample| └── ...|── info| └── exclude // 類似.gitignore 用于排除文件|── objects // 存儲了…

2025. 分割數組的最多方案數

2025. 分割數組的最多方案數 給你一個下標從 0 開始且長度為 n 的整數數組 nums 。分割 數組 nums 的方案數定義為符合以下兩個條件的 pivot 數目&#xff1a; 1 < pivot < nnums[0] nums[1] … nums[pivot - 1] nums[pivot] nums[pivot 1] … nums[n -1] 同時…

您是六個主要數據角色中的哪一個

When you were growing up, did you ever play the name game? The modern data organization has something similar, and it’s called the “Bad Data Blame Game.” Unlike the name game, however, the Bad Data Blame Game is played when data downtime strikes and no…

命令查看linux主機配置

查看cpu&#xff1a; # 總核數 物理CPU個數 X 每顆物理CPU的核數 # 總邏輯CPU數 物理CPU個數 X 每顆物理CPU的核數 X 超線程數# 查看物理CPU個數 cat /proc/cpuinfo| grep "physical id"| sort| uniq| wc -l# 查看每個物理CPU中core的個數(即核數) cat /proc/cpui…

C#中全局處理異常方式

using System; using System.Configuration; using System.Text; using System.Windows.Forms; using ZB.QueueSys.Common;namespace ZB.QueueSys {static class Program{/// <summary>/// 應用程序的主入口點。/// </summary>[STAThread]static void Main(){Appli…

5911. 模擬行走機器人 II

5911. 模擬行走機器人 II 給你一個在 XY 平面上的 width x height 的網格圖&#xff0c;左下角 的格子為 (0, 0) &#xff0c;右上角 的格子為 (width - 1, height - 1) 。網格圖中相鄰格子為四個基本方向之一&#xff08;“North”&#xff0c;“East”&#xff0c;“South”…

自定義按鈕動態變化_新聞價值的變化定義

自定義按鈕動態變化I read Bari Weiss’ resignation letter from the New York Times with some perplexity. In particular, I found her claim that she “was hired with the goal of bringing in voices that would not otherwise appear in your pages” a bit strange: …

Linux記錄-TCP狀態以及(TIME_WAIT/CLOSE_WAIT)分析(轉載)

1.TCP握手定理 2.TCP狀態 l CLOSED&#xff1a;初始狀態&#xff0c;表示TCP連接是“關閉著的”或“未打開的”。 l LISTEN &#xff1a;表示服務器端的某個SOCKET處于監聽狀態&#xff0c;可以接受客戶端的連接。 l SYN_RCVD &#xff1a;表示服務器接收到了來自客戶端請求…

677. 鍵值映射

677. 鍵值映射 實現一個 MapSum 類&#xff0c;支持兩個方法&#xff0c;insert 和 sum&#xff1a; MapSum() 初始化 MapSum 對象 void insert(String key, int val) 插入 key-val 鍵值對&#xff0c;字符串表示鍵 key &#xff0c;整數表示值 val 。如果鍵 key 已經存在&am…

算法 從 數中選出_算法可以選出勝出的nba幻想選秀嗎

算法 從 數中選出Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works without …

jQuery表單校驗

小小Demo&#xff1a; <script>$(function () {//給username綁定失去焦點事件$("#username").blur(function () {//得到username文本框的值var nameValue $(this).val();//每次清除數據$("table font:first").remove();//校驗username是否合法if (n…

5912. 每一個查詢的最大美麗值

5912. 每一個查詢的最大美麗值 給你一個二維整數數組 items &#xff0c;其中 items[i] [pricei, beautyi] 分別表示每一個物品的 價格 和 美麗值 。 同時給你一個下標從 0 開始的整數數組 queries 。對于每個查詢 queries[j] &#xff0c;你想求出價格小于等于 queries[j] …

django-rest-framework第一次使用使用常見問題

2019獨角獸企業重金招聘Python工程師標準>>> 記錄在第一次使用django-rest-framework框架使用時遇到的問題&#xff0c;為了便于理解在這里創建了Person和Grade這兩個model from django.db import models class Person(models.Model):SHIRT_SIZES ((S, Small),(M, …

插入腳注把腳注標注刪掉_地獄司機不應該只是英國電影歷史數據中的腳注,這說明了為什么...

插入腳注把腳注標注刪掉Cowritten by Andie Yam由安迪(Andie Yam)撰寫 Hell Drivers”, 1957地獄司機 》電影海報 Data visualization is a great way to celebrate our favorite pieces of art as well as reveal connections and ideas that were previously invisible. Mor…

vue之axios 登陸驗證及數據獲取

登陸驗證&#xff0c;獲取token methods:{callApi () {var vm thisvm.msg vm.result //驗證地址vm.loginUrl http://xxx///查詢地址vm.apiUrl http://yyy/vm.loginModel {username: 你的用戶名,password: 你的密碼,// grant_type: password,}//先獲取 tokenaxios.post(v…

5926. 買票需要的時間

5926. 買票需要的時間 有 n 個人前來排隊買票&#xff0c;其中第 0 人站在隊伍 最前方 &#xff0c;第 (n - 1) 人站在隊伍 最后方 。 給你一個下標從 0 開始的整數數組 tickets &#xff0c;數組長度為 n &#xff0c;其中第 i 人想要購買的票數為 tickets[i] 。 每個人買票…

貝葉斯統計 傳統統計_統計貝葉斯如何補充常客

貝葉斯統計 傳統統計For many years, academics have been using so-called frequentist statistics to evaluate whether experimental manipulations have significant effects.多年以來&#xff0c;學者們一直在使用所謂的常客統計學來評估實驗操作是否具有significant效果。…

吳恩達機器學習+林軒田機器學習+高等數學和線性代數等視頻領取

機器學習一直是一個熱門的領域。這次小編應大家需求&#xff0c;整理了許多相關學習視頻和書籍。本次分享包含&#xff1a;臺灣大學林軒田老師的【機器學習基石】和【機器學習技法】視頻教學、吳恩達老師的機器學習分享、徐小湛的高等數學和線性代數視頻&#xff0c;還有相關機…

saltstack二

配置管理 haproxy的安裝部署 haproxy各版本安裝包下載路徑https://www.haproxy.org/download/1.6/src/&#xff0c;跳轉地址為http&#xff0c;改為https即可 創建相關目錄 # 創建配置目錄 [rootlinux-node1 ~]# mkdir /srv/salt/prod/pkg/ [rootlinux-node1 ~]# mkdir /srv/sa…