python的power bi轉換基礎

I’ve been having a great time playing around with Power BI, one of the most incredible things in the tool is the array of possibilities you have to transform your data.

我在玩Power BI方面玩得很開心，該工具中最令人難以置信的事情之一就是您必須轉換數據的一系列可能性。

You can perform your transformations directly in your SQL query, use PowerQuery, DAX, R, Python or just by using their buttons and drop-boxes.

您可以直接在SQL查詢中執行轉換，可以使用PowerQuery，DAX，R，Python或僅通過其按鈕和下拉框進行轉換。

PBI gives us a lot of choices, but as much as you can load your entire database and figure your way out just with DAX, knowing a little bit o SQL can make things so much easier. Understanding the possibilities, where each of them excels, and where do we feel comfortable, is essential to master the tool.

PBI給我們提供了許多選擇，但是盡您可以加載整個數據庫并僅使用DAX來解決問題，知道一點點SQL可以使事情變得如此簡單。掌握各種可能性，每種方法的優勢以及我們感到舒適的地方，對于掌握該工具至關重要。

In this article, I’ll go through the basics of using Python to transform your data for building visualizations in Power BI.

在本文中，我將介紹使用Python轉換數據以在Power BI中構建可視化的基礎知識。

勘探 (Exploration)

For the following example, I’ll use Jupyter Lab for exploring the dataset and designing the transformations.

對于以下示例，我將使用Jupyter Lab探索數據集并設計轉換。

The dataset I’ll use is the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University.

我將使用的數據集是約翰霍普金斯大學系統科學與工程中心(CSSE)的COVID-19數據存儲庫。

import pandas as pdgit = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'dataset = pd.read_csv(git)dataset

OK, so we loaded the dataset to a Pandas data frame, the same format we’ll receive it when performing the transformation in PBI.

好的，因此我們將數據集加載到了Pandas數據框中，與在PBI中執行轉換時會收到的格式相同。

The first thing that called my attention in this dataset was its arrangement. The dates are spread through the columns, and that’s not a very friendly format for building visualizations in PBI.

在這個數據集中引起我注意的第一件事是它的排列。日期分布在各列中，對于在PBI中構建可視化而言，這不是一種非常友好的格式。

Another noticeable thing is the amount of NaNs in the Province/State column. Let’s get a better look at the missing values with MissingNo.

另一個值得注意的事情是“省/州”列中的NaN數量。讓我們用MissingNo 更好地查看缺失值。

import missingno as msnomsno.matrix(dataset)

Alright, mostly, our dataset is complete, but the Province/State column does have lots of missing values.

好吧，大多數情況下，我們的數據集是完整的，但是“省/州”列確實有很多缺失值。

While exploring, we can also check for typos and mismatching fields. There are lots of methods for doing so. I’ll use Difflib for illustrating.

在探索期間，我們還可以檢查拼寫錯誤和不匹配的字段。有很多這樣做的方法。我將使用Difflib進行說明。

from difflib import SequenceMatcher# empty lists for assembling the data frame
diff_labels = []
diff_vals = []# for every country name check every other country name
for i in dataset['Country/Region'].unique():
    for j in dataset['Country/Region'].unique():
        if i != j:
            diff_labels.append(i + ' - ' + j)
            diff_vals.append(SequenceMatcher(None, i, j).ratio() )# assemble the data frame           
diff_df = pd.DataFrame(diff_labels)
diff_df.columns = ['labels']
diff_df['vals'] = diff_vals# sort values by similarity ratio
diff_df.sort_values('vals', ascending=False)[:50]

From what I can see, most of them are just similar, so this field is already clean.

從我的看到，它們大多數都是相似的，因此該字段已經很干凈了。

As much as we could also check Provinces/ States, I guess I can pick typos from the names of countries, but not from provinces or states.

盡我們所能檢查省/州，我想我可以從國家/地區名稱中選擇錯別字，但不能從省或州中選擇錯別字。

目標 (Goal)

Whatever it is your exploration analysis, you’ll probably come up with a new design for the data you want to visualize.

無論您的勘探分析是什么，您都可能會想出想要可視化數據的新設計。

Something that’ll make your life easier when building the charts, and my idea here is to separate this dataset into three tables, like so:

可以簡化構建圖表時的工作，我的想法是將數據集分成三個表，如下所示：

One table will hold Location, with Province/State, Country/Region, Latitude, and Longitude.

一張桌子將保存位置，省/州，國家/地區，緯度和經度。

One will hold the data for countries, with the date, number of confirmed and number of new cases.

一個將保存國家/地區的數據，以及日期，確診數量和新病例數量。

And the last one will hold data for the provinces, also with the date, number of confirmed and number of new cases.

最后一個將保存各省的數據，以及日期，確診數和新病例數。

Here’s what I’m looking for as the final result:

這是最終結果：

Are there better ways of arranging this dataset? — Most definitely, yes. But I think this is a good way of illustrating a goal for the dataset we want to achieve.

有更好的方法來安排此數據集嗎？ —絕對是的。但是我認為這是說明我們要實現的數據集目標的好方法。

Python腳本 (Python Scripts)

Cool, we did a little exploration and came up with an idea of what we want to build. Now we can design the transformations.

太酷了，我們進行了一些探索，并提出了我們想要構建的構想。現在我們可以設計轉換。

Location is the easiest. We only need to select the columns we want.

位置最簡單。我們只需要選擇所需的列。

cols = ['Province/State', 'Country/Region', 'Lat', 'Long']
location = dataset[cols]location

To get this to Power BI, we’ll need a new data source, and since we’re bringing it from a GitHub raw CSV, we can choose ‘web.’

要將其發送到Power BI，我們將需要一個新的數據源，并且由于我們是從GitHub原始CSV中獲取數據，因此我們可以選擇“網絡”。

Now we can add the URL for the CSV and click go till we have our new source.

現在，我們可以添加CSV的URL，然后單擊“轉到”，直到獲得新的源。

https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv

After you finish loading your dataset, you can go to ‘Transform data’, select the table we just imported, and go to the ‘Transform’ tab.

加載完數據集后，可以轉到“轉換數據”，選擇我們剛剛導入的表，然后轉到“轉換”選項卡。

First, we’ll promote the first row to Headers.

首先，我們將第一行提升為Headers。

Then on the same tab, we can select ‘Run Python script’.

然后在同一標簽上，我們可以選擇“運行Python腳本”。

Here we’ll use the script we just wrote in Jupyter and press OK. Then we can choose the location Table we just made.

在這里，我們將使用剛剛在Jupyter中編寫的腳本，然后按OK。然后，我們可以選擇剛才創建的位置表。

Excellent, it’s arguably way more comfortable to do that with PBI only, but now we know how to use this transformation, and we can add some complexity.

太好了，可以說僅使用PBI可以更輕松地完成此操作，但是現在我們知道了如何使用此轉換，并且可以增加一些復雜性。

Let’s make the Province Time-Series transformations in Jupyter.

讓我們在Jupyter中進行省時間序列轉換。

增加復雜性 (Add Complexity)

We’ll drop the columns we don’t need, set the new index, and stack the dates in a single column.

我們將刪除不需要的列，設置新索引，并將日期堆疊在單個列中。

# drop lat and long
Time_Series_P = dataset.drop(['Lat', 'Long'], axis=1)# set country and province as index
Time_Series_P.set_index(['Province/State', 'Country/Region'], inplace=True)# stack date columns
Time_Series_P = Time_Series_P.stack()
Time_Series_P

Next, we can convert that series back to a data frame, reset the index, and rename the columns.

接下來，我們可以將該系列轉換回數據框，重置索引，然后重命名列。

Time_Series_P = Time_Series_P.to_frame(name='Confirmed')
Time_Series_P.reset_index(inplace=True)col_names = ['Province/State', 'Country/Region', 
             'Date', 'Confirmed']
Time_Series_P.columns = col_namesTime_Series_P

Cool, we already have the rows/ columns figured out. But I still want to add a column with ‘new cases’.

太酷了，我們已經弄清楚了行/列。但是我仍然想添加一列“新案例”。

For that, we’ll need to sort our values by province and date. Then we’ll go through each row checking if it has the same name as the one before it. If it does, we should calculate the difference between those values. If not, we should use the amount in that row.

為此，我們需要按省和日期對值進行排序。然后，我們將遍歷每一行，檢查其名稱是否與之前的名稱相同。如果是這樣，我們應該計算這些值之間的差。如果沒有，我們應該使用該行中的金額。

Time_Series_P['Date'] = pd.to_datetime(Time_Series_P['Date'])
Time_Series_P['Date'] = Time_Series_P['Date'].dt.strftime('%Y/%m/%d')Time_Series_P.sort_values(['Province/State', 'Date'], inplace=True)c = ''
new_cases = []
for index, value in Time_Series_P.iterrows():
    if c != value['Province/State']:
        c = value['Province/State']
        val = value['Confirmed']
        new_cases.append(val)
    else:
        new_cases.append(value['Confirmed'] - val)
        val = value['Confirmed']
Time_Series_P['new_cases'] = new_cases

I guess that’s enough. We transformed the dataset, and we have it exactly how we wanted. Now we can pack all this code in a single script and try it.

我想就足夠了。我們轉換了數據集，并得到了我們想要的。現在，我們可以將所有這些代碼打包在一個腳本中并嘗試。

Time_Series_P = dataset.drop(['Lat', 'Long'], axis=1).set_index(['Province/State', 'Country/Region']).stack()
Time_Series_P = Time_Series_P.to_frame(name='Confirmed').reset_index()
Time_Series_P.columns = ['Province/State', 'Country/Region', 'Date', 'Confirmed']
Time_Series_P.dropna(inplace=True)Time_Series_P['Date'] = pd.to_datetime(Time_Series_P['Date'])
Time_Series_P['Date'] = Time_Series_P['Date'].dt.strftime('%Y/%m/%d')Time_Series_P.sort_values(['Province/State', 'Date'], inplace=True)c = ''
new_cases = []
for index, value in Time_Series_P.iterrows():
    if c != value['Province/State']:
        c = value['Province/State']
        val = value['Confirmed']
        new_cases.append(val)
    else:
        new_cases.append(value['Confirmed'] - val)
        val = value['Confirmed']
Time_Series_P['new_cases'] = new_cases
Time_Series_P[155:170]

We already know how to get this to PBI. Let’s duplicate our last source, and change the python script in it, like so:

我們已經知道如何將其用于PBI。讓我們復制最后一個源，并在其中更改python腳本，如下所示：

I don’t know how to create relationships in PBI with composite keys, so for connecting Location to Time_Series_P, I’ve used DAX to build a calculated column concatenating province and country.

我不知道如何使用復合鍵在PBI中創建關系，因此，為了將Location連接到Time_Series_P，我使用了DAX來構建計算得出的連接省和國家/地區的列。

loc_id = CONCATENATE(Time_Series_Province[Province/State], Time_Series_Province[Country/Region])

That’s it! You can also use similar logic to create the country table.

而已！您也可以使用類似的邏輯來創建國家/地區表。

Time_Series_C = dataset.drop(['Lat', 'Long', 'Province/State',], axis=1).set_index(['Country/Region']).stack()
Time_Series_C = Time_Series_C.to_frame(name='Confirmed').reset_index()
Time_Series_C.columns = ['Country/Region', 'Date', 'Confirmed']
Time_Series_C = Time_Series_C.groupby(['Country/Region', 'Date']).sum().reset_index()Time_Series_C['Date'] = pd.to_datetime(Time_Series_C['Date'])
Time_Series_C['Date'] = Time_Series_C['Date'].dt.strftime('%Y/%m/%d')Time_Series_C.sort_values(['Country/Region', 'Date'], inplace=True)c = ''
new_cases = []
for index, value in Time_Series_C.iterrows():
    if c != value['Country/Region']:
        c = value['Country/Region']
        val = value['Confirmed']
        new_cases.append(val)
    else:
        new_cases.append(value['Confirmed'] - val)
        val = value['Confirmed']
Time_Series_C['new_cases'] = new_cases
Time_Series_C

I guess that gives us an excellent idea of how to use Python transformations in PBI.

我想這給了我們一個很好的想法，如何在PBI中使用Python轉換。

結論 (Conclusion)

Having options and knowing how to use them is always a good thing; all of those transformations could have been done with PBI only. For example, it’s way easier to turn all those columns with dates to rows by selecting them and clicking ‘Unpivot columns’ in the transformation tab.

有選擇并知道如何使用它們總是一件好事。所有這些轉換只能通過PBI完成。例如，通過選擇所有日期日期列將其轉換為行，然后在轉換選項卡中單擊“取消透視列”，將變得更加容易。

But there may be times where you find yourself lost in the tool, or you need more control over the operation, and many cases where Python may have that library to implement the solution you were seeking.

但是有時您可能會發現自己迷失在該工具中，或者需要對操作進行更多控制，并且在許多情況下，Python可能具有該庫來實現您要尋找的解決方案。

All said and done — it’s time to design your visualization.

總而言之，這是設計可視化的時候了。

Thanks for reading my article. I hope you enjoyed it.

感謝您閱讀我的文章。我希望你喜歡它。

翻譯自: https://medium.com/python-in-plain-english/basics-of-power-bi-transformations-with-python-c6df52cb21d7

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/389791.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/389791.shtml
英文地址，請注明出處：http://en.pswp.cn/news/389791.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！

感想3-對于業務邏輯復用、模板復用的一些思考(未完)

內容概覽： 業務邏輯復用的目的基于現有場景，如何抽象出初步可復用邏輯復用業務邏輯會不會產生過度設計的問題業務邏輯復用的目的我對于業務邏輯復用的理解是忽略實際業務內容，從交互流程、交互邏輯的角度去歸納、總結，提出通用的…

Git的一些總結

2025. 分割數組的最多方案數

2025. 分割數組的最多方案數給你一個下標從 0 開始且長度為 n 的整數數組 nums 。分割數組 nums 的方案數定義為符合以下兩個條件的 pivot 數目： 1 < pivot < nnums[0] nums[1] … nums[pivot - 1] nums[pivot] nums[pivot 1] … nums[n -1] 同時…

您是六個主要數據角色中的哪一個

When you were growing up, did you ever play the name game? The modern data organization has something similar, and it’s called the “Bad Data Blame Game.” Unlike the name game, however, the Bad Data Blame Game is played when data downtime strikes and no…