web數據交互
Most good data projects start with the analyst doing something to get a feel for the data that they are dealing with.
大多數好的數據項目都是從分析師開始做一些事情,以便對他們正在處理的數據有所了解。
They might hack together a Jupyter notebook to look at data summaries, first few rows of data and matplotlib charts. Some might look through the data as an Excel sheet and fidget with pivot tables. The ones truly one with the data might even prefer to stare directly at the raw table of data.
他們可能會聚在一起使用Jupyter筆記本來查看數據摘要,數據的前幾行和matplotlib圖表。 有些人可能會將數據視為Excel工作表和帶有數據透視表的小工具。 真正擁有數據的人甚至可能更喜歡直接盯著原始數據表。
None of these are ideal solutions. Some of these solutions might be only suitable for the masochistic among us. So what is a person to do?
這些都不是理想的解決方案。 其中一些解決方案可能僅適用于我們中間的受虐狂。 那么一個人該怎么辦?
For me, I prefer to build a web app for data exploration.
對我來說,我更喜歡構建一個用于數據探索的Web應用程序。
There’s something about the ability to slice, group, filter, and most importantly — see the data, that helps me to understand it and help me to formulate questions and hypotheses that I want answered in the .
關于切片,分組,過濾的功能,最重要的是- 查看數據,這有助于我理解數據,并幫助我提出要在回答的問題和假設。
It allows me to interact with the data visually.
它使我可以直觀地與數據進行交互 。
My preferred toolkit of choice for this task these days is Plotly and Streamlit. I’ve written enough about Plotly over the last while — I think it’s the best data visualisation package out there for Python. But Streamlit has really changed the way I work. Because it is so terse, it takes almost no extra effort to turn my plots and comments in a python script into a web app with interactivity as I tinker. (FYI — I wrote a comparison between Dash and Streamlit here)
這些天,我首選的首選工具包是Plotly和Streamlit 。 上一陣子我已經寫了足夠多的有關Plotly的文章-我認為這是Python最好的數據可視化軟件包。 但是Streamlit確實改變了我的工作方式。 因為它是如此的簡潔,所以我幾乎不需要花費額外的精力就可以將python腳本中的繪圖和注釋轉換為具有交互性的Web應用程序。 (僅供參考-我在這里寫了Dash和Streamlit之間的比較 )
I prefer to build a web app for data exploration
我更喜歡構建用于數據探索的Web應用程序
So in this article, I’d like to share with a simple example building a data exploration app with these tools.
因此,在本文中,我想與一個使用這些工具構建數據探索應用程序的簡單示例分享。
Now, for a data project – we need data, and here I will be using stats from the NBA. Learning programming can be dry, so using something relatable like sports data helps me to stay engaged; and hopefully it will for you too.
現在,對于數據項目,我們需要數據,在這里,我將使用NBA的統計數據。 學習編程可能很枯燥,因此使用諸如體育數據之類的相關內容有助于我保持專注。 希望它也對您有用。
(Don’t worry if you don’t follow the NBA as the focus is on the data science and programming!)
(如果您不關注NBA,請不要擔心,因為重點是數據科學和編程!)
在開始之前 (Before we get started)
To follow along, install a few packages — plotly
, streamlit
and pandas
. Install each (in your virtual environment) with a simple pip install [PACKAGE_NAME]
.
plotly
,請安裝一些軟件包plotly
, streamlit
和pandas
。 通過簡單的pip install [PACKAGE_NAME]
安裝每個組件(在您的虛擬環境中)。
The code for this article is on my GitHub repo here, so you can download/copy/fork away to your heart’s content.
本文的代碼位于我的GitHub存儲庫中 ,因此您可以下載/復制/分叉到您的內心。
The script is called data_explorer_app.py
— so you can run it from the shell with:
該腳本名為data_explorer_app.py
,因此您可以使用以下命令從Shell運行該腳本:
streamlit run data_explorer_app.py
Oh, this is the first in a set of data science / data analysis articles that I plan to write about using NBA data. It’ll all go to that repo, so keep your eyes peeled!
哦,這是我計劃就使用NBA數據撰寫的一組數據科學/數據分析文章中的第一篇。 一切都會去那個倉庫,所以要睜大眼睛!
If you are following along, import the key libraries with:
如果您遵循以下步驟,請使用以下命令導入密鑰庫:
import pandas as pd
import plotly.express as px
import streamlit as st
And we are ready to go.
我們已經準備好出發了。
數據深度潛水 (Data Deep Diving)
流式照明 (Streamlit-ing)
We use Streamlit here, as it is designed to help us build data apps quickly. So what we are going to build is a Streamlit app that will then run locally. (For more information — you can check out my Dash v Streamlit article here.)
我們在這里使用Streamlit,因為它旨在幫助我們快速構建數據應用程序。 因此,我們要構建的是Streamlit應用程序,該應用程序然后將在本地運行。 (有關更多信息,您可以在此處查看我的Dash v Streamlit文章 。)
If you’ve never used Streamlit, this is all you need to build bare-bones app:
如果您從未使用過Streamlit,這就是構建準系統應用程序所需要的:
import streamlit as st
st.write("Hello, world!")
Save this as app.py
, and then execute it with a shell command streamlit run app.py
:
將其另存為app.py
,然后使用shell命令streamlit run app.py
執行它:

And you have a functioning web app! Building a streamlit app is that easy. Even more amazingly, though, building a useful app isn’t much harder.
而且您有一個運行良好的Web應用程序! 構建流式應用很容易。 但是,更令人驚訝的是,構建有用的應用程序并不難。
Oh, by the way, you don’t need to stop and restart the server every time the script is changed. Whenever the underlying script file is updated, you will see a button pop-up on the top right corner like so:
哦,順便說一句,您不必在每次更改腳本時都停止并重新啟動服務器。 每當基礎腳本文件更新時,您都會在右上角看到一個按鈕彈出,如下所示:

Just keep the script running, and hit Rerun here every time you want to see the latest version at work.
只需保持腳本運行,然后在每次要查看最新版本時都單擊“重新運行”即可。
Ready? Okay, let’s go!
準備? 好吧,走吧!
原始數據探索 (Raw data exploration)
What I like to do initially it to look at the entire raw dataset. As a first step, we load the data from a CSV file:
我最初想要做的是查看整個原始數據集。 第一步,我們從CSV文件加載數據:
df = pd.read_csv("data/player_per_game.csv", index_col=0).reset_index(drop=True)
Once the data has been loaded, simply typing st.write(df)
creates a dynamic, interactive table of the entire dataframe.
加載數據后,只需鍵入st.write(df)
創建整個數據幀的動態交互式表。

And the various statistics for columns can be similarly plotted with st.write(df.describe())
.
并且可以使用st.write(df.describe())
類似地繪制列的各種統計信息。

I know you can plot a table in Jupyter notebooks, but the difference is in the interactivity. For one, tables rendered with Streamlit are sortable by columns. And as you will see later, you can incorporate filters and other dynamic elements that aren’t as easy to incorporate in notebooks — which is where the real power comes in.
我知道您可以在Jupyter筆記本中繪制表格,但區別在于交互性。 首先,使用Streamlit渲染的表可以按列排序。 就像您稍后將看到的那樣,您可以合并過濾器和其他動態元素,而這些元素和合并到筆記本中并不那么容易-這才是真正的動力所在。
Now we are ready to start adding a few charts to our app.
現在,我們準備開始向我們的應用程序添加一些圖表。
分布可視化 (Distribution visualisations)
Statistical visualisation of individual variables are extremely useful, to an extent that I think it’s an indispensable tool above and beyond looking at the raw data.
單個變量的統計可視化非常有用,在某種程度上,我認為它是查看原始數據之外不可或缺的工具。
We will begin the analysis by visualising the data by one variable, with an interactive histogram. A histogram can be constructed with Plotly like so:
我們將通過交互式變量直方圖通過一個變量將數據可視化來開始分析。 可以使用Plotly構造直方圖,如下所示:
hist_fig = px.histogram(df, x=hist_x, nbins=hist_bins)
Traditionally, we would have to manually adjust the x
and nbins
variables to see what happens, or create a huge wall of histograms from various permutations of these variables. Instead, let’s see how they can be taken in as inputs to interactively investigate the data.
傳統上,我們將不得不手動調整x
和nbins
變量以查看會發生什么,或者從這些變量的各種排列中創建巨大的直方圖墻。 相反,讓我們看看如何將它們作為交互研究數據的輸入。
The histogram will analyse data from one column of the pandas dataframe. Let’s render it as a drop-down box by calling the st.selectbox()
module. We can just grab a list of the columns as df.columns
, and additionally we provide a default choice, which we get the column number of using df.columns.get_loc()
method. Putting it together, we get:
直方圖將分析來自熊貓數據框一列的數據。 通過調用st.selectbox()
模塊,將其呈現為下拉框。 我們可以僅以df.columns
獲取列的列表,此外,我們還提供了一個默認選擇,即使用df.columns.get_loc()
方法獲得列號。 放在一起,我們得到:
hist_x = st.selectbox("Histogram variable", options=df.columns, index=df.columns.get_loc("mp_per_g"))
Then, a slider can be called with the st.slider()
module for the user to select the number of bins in the histogram. The module can be customised a minimum/maximum/default and increment parameters as you see below.
然后,可以使用st.slider()
模塊調用滑塊,以供用戶選擇直方圖中的bin數量。 您可以自定義模塊的最小/最大/默認和增量參數,如下所示。
hist_bins = st.slider(label="Histogram bins", min_value=5, max_value=50, value=25, step=1)
These parameters can then be combined to produce the figure:
然后可以將這些參數組合以產生圖形:
hist_fig = px.histogram(df, x=hist_x, nbins=hist_bins, title="Histogram of " + hist_x,
template="plotly_white")
st.write(hist_fig)
Putting it together with a little heading st.header(“Histogram”)
, we get:
將其與一個小標題st.header(“Histogram”)
放在一起,我們得到:

I recommend taking a second here to explore the data. For example, take a look at different stats like rebounds per game:
我建議在這里花點時間瀏覽數據。 例如,看一下不同的統計數據,例如每場比賽的籃板數:

Or positions:
或職位:

The interactivity makes for easier, dynamic, active exploration of the data.
交互性使對數據的更輕松,動態, 主動的 探索成為可能。
You might have noticed in this last graph that the histogram categories are not in any sort of sensible order. This is due to the fact that this is a categorical variable. So without a provided order, Plotly is (I think) plotting these categories based on the order that it starts to encounter each category for the first time. So, let’s make one last change to would fix that.
您可能已經在最后一張圖中注意到,直方圖類別沒有任何合理的順序。 這是由于這是一個類別變量。 因此,如果沒有提供的順序,Plotly(我認為)將根據第一次遇到每個類別的順序來繪制這些類別。 因此,讓我們做最后一個更改來解決該問題。
Since Plotly allows for a category_orders
parameter, we could pass a sorted order of positions. But then it wouldn’t be relevant for any of the other parameters. Instead, what we can do is to isolate the column based on the chosen input value, and pass them on by sorting them alphabetically like so:
由于Plotly允許有category_orders
參數,因此我們可以傳遞排名的排序順序。 但這與其他任何參數都不相關。 相反,我們可以做的是根據所選的輸入值隔離列,然后按字母順序對它們進行傳遞,如下所示:
df[hist_x].sort_values().unique()
All together, we get:
總之,我們得到:
hist_cats = df[hist_x].sort_values().values
hist_fig = px.histogram(df, x=hist_x, nbins=hist_bins, title="Histogram of " + hist_x,
template="plotly_white", category_orders={hist_x: hist_cats})

This way, any categorical (or ordinal) variables would be presented in order
這樣,任何分類(或有序)變量將按順序顯示
Now we can go another step and categorise our data with boxplots. Boxplots do a similar job as histograms in that they show distributions, but they are really best at showing how those distributions changed according to another variable.
現在,我們可以再進行一步,用箱線圖對數據進行分類。 箱線圖的功能與直方圖類似,因為它們可以顯示分布,但實際上最能顯示出這些分布如何根據另一個變量而變化。
So, the boxplot portion of our app is going to include two pulldown menus like below.
因此,我們應用程序的箱線圖部分將包括兩個下拉菜單,如下所示。
box_x = st.selectbox("Boxplot variable", options=df.columns, index=df.columns.get_loc("pts_per_g"))
box_cat = st.selectbox("Categorical variable", ["pos_simple", "age", "season"], 0)
And it’s just a matter of passing those two inputs to Plotly to build a figure:
只需將這兩個輸入傳遞給Plotly即可構建圖形:
box_fig = px.box(df, x=box_cat, y=box_x, title="Box plot of " + box_x, template="plotly_white", category_orders={"pos_simple": ["PG", "SG", "SF", "PF", "C"]})
st.write(box_fig)
Then… voila! You have an interactive box plot!
然后……瞧! 您有一個交互式箱形圖!

You will notice here that I manually passed an order for my simplified positions column. The reason is that this order is a relatively arbitrary, basketball-specific order (from PG to C), not an alphabetical order. As much as I would like everything to be parametric, sometimes you do have to resort to manual specifications!
您會在這里注意到,我為簡化倉位欄手動傳遞了一個訂單。 原因是該順序是相對任意的,籃球特定的順序(從PG到C),而不是字母順序。 盡管我希望所有參數都是參數化的,但有時您還是不得不求助于手動規格!
相關性和過濾器 (Correlations & filters)
Another big thing to do in data visualisation, or exploratory data analysis is to understand correlations.
數據可視化或探索性數據分析中的另一大工作是了解相關性。
It can be for example handy for some manual feature engineering in data science, and it might actually point you towards an investigative direction that you may not have considered.
例如,對于數據科學中的某些手動要素工程而言,它可能非常方便,并且實際上可能將您引向您可能沒有考慮的調查方向。
Let’s just stick to three dimensions in our scatter plot for now.
現在,讓我們在散點圖中堅持三個維度。
No, not in x, y and z directions. I am not a monster. I’ve got an example of one below — can you make sense of what’s going on?
不,不在x,y和z方向上。 我不是怪物。 我在下面有一個例子-您能理解發生了什么嗎?

Not for me, thanks.
不適合我,謝謝。
Colour will be the third dimension here to represent data. I’ve left all columns available for the the first two columns, and just a limited selection for colours — but you can really do whatever you want.
顏色將是表示數據的第三維。 我在前兩列中保留了所有列,只對顏色進行了有限的選擇-但是您確實可以做任何您想做的事情。
corr_x = st.selectbox("Correlation - X variable", options=df.columns, index=df.columns.get_loc("fg3a_per_g"))
corr_y = st.selectbox("Correlation - Y variable", options=df.columns, index=df.columns.get_loc("efg_pct"))
corr_col = st.radio("Correlation - color variable", options=["age", "season", "pos_simple"], index=1)

And the chart can be constructed as follows:
該圖表可以構造如下:
fig = px.scatter(df, x=corr_x, y=corr_y, template="plotly_white", color=corr_col, hover_data=['name', 'pos', 'age', 'season'], color_continuous_scale=px.colors.sequential.OrRd)

But this chart is not ideal. For one because the data is dominated by outliers. See the lonely dots on the top left? Those folks with effective FG% of 1.5 are not some gods of basketball, but it’s a side effect of extremely small sample sizes.
但是此圖表并不理想。 原因之一是數據受異常值支配。 看到左上方的孤獨點嗎? 那些FG有效值為1.5的人不是籃球神靈,但這是極小的樣本量的副作用。
So what can we do? Let’s put a filter into the data.
所以,我們能做些什么? 讓我們將過濾器放入數據中。
I’m going to put in two interactive portions here, one to choose the filter parameter, and the other to put the value in. As I don’t know what the parameter is here, I will simply take an empty text box that will take numbers as inputs.
我將在此處放置兩個交互式部分,一個用于選擇過濾器參數,另一個用于將值放入。由于我不知道這里的參數,我將簡單地使用一個空文本框以數字作為輸入。
corr_filt = st.selectbox("Filter variable", options=df.columns, index=df.columns.get_loc("fg3a_per_g"))
min_filt = st.number_input("Minimum value", value=6, min_value=0)
Using these values, I can filter the dataframe like so:
使用這些值,我可以像這樣過濾數據框:
tmp_df = df[df[corr_filt] > min_filt]
And then pass the temporary dataframe tmp_df
into the figure instead of the original dataframe, we get:
然后將臨時數據幀tmp_df
到圖中而不是原始數據幀中,我們得到:

This chart could be used to take a look at correlations between various stats. For example, to see that great 3 pt shooters are also typically great free throw shooters:
該圖表可用于查看各種統計數據之間的相關性。 例如,要看到出色的3分射手通常也是出色的罰球射手:

Or that great rebounders tend to be shot blockers as well. It’s also interesting that the game has changed so that no modern players average many blocks per game.
或者說,出色的籃板手也會成為蓋帽手。 有趣的是,游戲發生了變化,因此沒有現代玩家可以平均每場游戲獲得很多積木。

Plotting rebounds and assists, they show something of an inverse correlation, and are quite nicely stratified according to position here.
繪制籃板和助攻,它們顯示出反比關系,并且根據此處的位置進行了很好的分層。

Already we can see quite a lot of trends and correlations from our app. Lastly, let’s create some heatmaps to view general correlations between sets of columns of data.
我們已經可以從我們的應用程序中看到很多趨勢和相關性。 最后,讓我們創建一些熱圖以查看數據列集之間的一般相關性。
熱圖的廣義相關 (Generalised correlations with heatmaps)
Scatter plots are useful for seeing individual data points, but sometimes it’s good to just visualise datasets such that we can immediately see which columns might be well correlated, not correlated, or inversely correlated.
散點圖對于查看單個數據點很有用,但是有時最好只是可視化數據集,這樣我們就可以立即查看哪些列可能具有良好的相關性,不相關性或反相關性。
Heatmaps are perfect for this job, by setting it up to visualise what are called correlation matrices.
通過將熱圖設置為可視化所謂的相關矩陣,熱圖非常適合此工作。
Since a heatmap is best at visualising correlations between sets of input categories, let’s use an input that will take multiple categories. As a result, st.multiselect()
is the module of choice here, and df.corr()
is all we need to create the correlation matrix.
由于熱圖最適合可視化輸入類別集之間的相關性,因此讓我們使用將采用多個類別的輸入。 結果, st.multiselect()
是這里選擇的模塊,而df.corr()
是創建相關矩陣所需的全部。
The combined code is:
組合的代碼為:
hmap_params = st.multiselect("Select parameters to include on heatmap", options=list(df.columns), default=[p for p in df.columns if "fg" in p])
hmap_fig = px.imshow(df[hmap_params].corr())
st.write(hmap_fig)
And we get:
我們得到:

It’s so clear which of these columns are positively correlated or not correlated. and I also suggest playing with different colour scales / swatches for extra fun!
很明顯,這些列中的哪一列是正相關的或不相關的。 并且我還建議您使用不同的色標/色板來獲得更多的樂趣!
That’s it for today — I hope that was interesting. For my money, it’s hard to beat interactive apps like this for exploration, and the power of Plotly and Streamlit make it so easy to build these customised apps for my purpose.
今天就這樣-我希望這很有趣。 為了我的錢,很難擊敗像這樣的交互式應用程序進行探索,而Plotly和Streamlit的強大功能使為我的目的構建這些定制的應用程序變得如此容易。
And keep in mind that what I have suggested here are just basic suggestions, and what I am sure that you could build something far more useful for your own purpose and to your preference. I look forward to seeing them all!
并且請記住,我在這里提出的只是基本建議,并且我確信您可以針對自己的目的和自己的喜好構建一些更有用的東西。 我期待看到他們全部!
翻譯自: https://towardsdatascience.com/explore-any-data-with-a-custom-interactive-web-app-data-science-with-sports-410644ac742
web數據交互
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/389309.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/389309.shtml 英文地址,請注明出處:http://en.pswp.cn/news/389309.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!