多維空間可視化
Recently, I was working on a project where I was trying to build a model that could predict housing prices in King County, Washington — the area that surrounds Seattle. After looking at the features, I wanted a way to determine the houses’ worth based on location.
最近,我在一個項目中嘗試建立一個可以預測華盛頓金縣(西雅圖周邊地區)房價的模型。 在查看了這些功能之后,我想找到一種根據位置確定房屋價值的方法。
The dataset included latitude and longitude and it was easy to google them to take a look at the houses, their neighborhoods, their distance from the water, etc. But with over 17000 observations, that was a fool’s task. I had to find an easier way.
數據集包括緯度和經度,可以很容易地用谷歌瀏覽一下房屋,附近,距水的距離等。但是,通過17000多個觀察,這是一個傻瓜的任務。 我必須找到一種更簡單的方法。
I had used Geographic Information Systems (GIS) only once before but not in Python. So I did what I do best: I googled, and ran into this amazing package called GeoPandas. I am going to let the GeoPandas team sum up what they do because they can say much better than I can.
我以前只使用過一次地理信息系統(GIS),而沒有在Python中使用過。 因此,我做了我最擅長的事情:我搜索了Google,并遇到了一個名為GeoPandas的驚人軟件包。 我要讓GeoPandas團隊總結他們所做的事情,因為他們的發言能力比我更好。
GeoPandas is an open source project to make working with geospatial data in python easier. GeoPandas extends the datatypes used by pandas to allow spatial operations on geometric types. Geometric operations are performed by shapely. GeoPandas further depends on fiona for file access and descartes and matplotlib for plotting. — Description from GeoPandas Website (2020)
GeoPandas是一個開源項目,可簡化使用python中的地理空間數據的工作。 GeoPandas擴展了熊貓使用的數據類型,以允許對幾何類型進行空間操作。 幾何運算是通過勻稱進行的。 GeoPandas進一步依賴于fiona進行文件訪問,并依賴笛卡爾和matplotlib進行繪圖。 — GeoPandas網站(2020)的說明
This blew my mind, and what I wanted was really just the most basic of the features. I am going to show you how to run this code and do what I did — plotting accurate points on a map.
這讓我大吃一驚,而我想要的實際上只是最基本的功能。 我將向您展示如何運行此代碼并完成我的工作-在地圖上繪制準確的點。
You are going to need several packages and some files in addition to the basic pandas and matplotlib. They include:
除了基本的pandas和matplotlib外,您還需要幾個軟件包和一些文件。 它們包括:
- geopandas — the package that makes all of this possible geopandas-使所有這些成為可能的軟件包
shapely — package for manipulation and analysis of planar geometric objects
勻稱 —用于處理和分析平面幾何對象的程序包
descartes — provides a nicer integration of Shapely geometry objects with Matplotlib. It’s not needed every time but I import it just to be safe
笛卡爾(笛卡爾) -將Shapely幾何對象與Matplotlib更好地集成。 并非每次都需要它,但為了安全起見我將其導入
- Any .shp file — this is going to be the backdrop of the plot. Mine is going to have King County, but you should be able to find one from any city’s data department. Don’t delete any files from the .zip file it comes in. Something always breaks. 任何.shp文件-這將是情節的背景。 我的將有金縣,但您應該可以從任何城市的數據部門中找到一個。 不要從它所包含的.zip文件中刪除任何文件。總有東西會中斷。
More information about shapefiles can be found here, but the long and short of it is that these aren’t normal images. They are a vector data storage format that has information linking to locations — coordinates and the rest.
關于shapefile的更多信息可以在這里找到,但總的來說,它們不是正常圖像。 它們是矢量數據存儲格式,具有鏈接到位置(坐標和其余位置)的信息。
First I imported the basic packages that I needed and then the new packages:
首先,我導入了所需的基本軟件包,然后導入了新軟件包:
import matplotlib.pyplot as plt
import numpy as np from shapely.geometry import Point,Polygon
import geopandas as gpd
import descartes
The Point and Polygon features are what help me match my data to the map I make.
點和多邊形功能可以幫助我將數據與我制作的地圖進行匹配。
Next, I load in my data. This is basic pandas but for those that are new, everything in quotations is the name of the file I had to access the housing records.
接下來,我加載我的數據。 這是基本的大熊貓,但對于新熊貓,引號中的所有內容都是我必須訪問房屋記錄的文件的名稱。
df = pd.read_csv('kc_house_data_train.csv')
With all of the packages imported and the data ready to go, I wanted to take a look at the map I was going to be plotting. I did this by finding a shape file made by the King County government website. They have done all the hard work of surveying and cataloging the land — it would be rude to not use their freely offered services. Loading in the shape file is easy and comparable to loading in a csv file with pandas.
導入了所有軟件包并準備好數據后,我想看一下我要繪制的地圖。 我通過查找金縣政府網站制作的形狀文件來完成此操作。 他們已經完成了土地測量和分類的所有艱苦工作-不使用免費提供的服務是不禮貌的。 加載到shape文件中很容易,并且與使用pandas加載到csv文件中相當。
kings_county = gpd.read_file('*file_path_here*/School_Districts_in_King_County___schdst_area.shp')
You can open this up if you want to take a look at the data. The King County shape file was just a dataframe of locations matched with their school districts, geometry coordinates, and area. But the best part is when we plot it and yes, we have to plot it. This isn’t an image you can just call — it will have the coordinates built in so our data can be placed down like a point on a 5th grade (x,y) graph.
如果要查看數據,可以打開此窗口。 金縣形狀文件只是與他們的學區,幾何坐標和面積相匹配的位置的數據框。 但是最好的部分是當我們繪制它時,是的,我們必須繪制它。 這不是您只能調用的圖像-它具有內置的坐標,因此我們的數據可以像5級(x,y)圖上的點一樣放置。
Using the below code (notice how I edited it the same way I would edit a graph):
使用下面的代碼(注意,我以與編輯圖形相同的方式對其進行編輯):
fig, ax = plt.subplots(figsize = (15,15))
kings_county.plot(ax=ax)
ax.set_title('King County',fontdict = {'fontsize': 30})
ax.set_ylabel('Latitude',fontdict = {'fontsize': 20})
ax.set_xlabel('Longitude',fontdict = {'fontsize': 20})
My output looked like this:
我的輸出看起來像這樣:

Before we start adding our housing data we should look at utilizing the shape file to the fullest. Let’s take a look at the file.
在開始添加房屋數據之前,我們應該充分利用形狀文件。 讓我們看一下文件。
OID D# NAME geometry
0 1 1 Seattle MULTIPOLYGON (((-122.40324 47.66637...
1 2 210 Federal Way POLYGON ((-122.29057 47.39374...
2 3 216 Enumclaw POLYGON ((-121.84898 47.34708...
3 4 400 Mercer Island POLYGON ((-122.24475 47.59601...
4 5 401 Highline POLYGON ((-122.35853 47.51553...- Truncated for clarity
As you can see, the county is divided on school districts — each with a shape used as boundaries. We will now try to plot the shape file and annotate the districts using the data provided like so:
如您所見,該縣分為多個學區-每個學區的形狀都用作邊界。 現在,我們將嘗試繪制形狀文件并使用提供的數據對區域進行注釋,如下所示:
left = ['Riverview','Snoqualmie Valley']
center = ['Skykomish','Kent','Auburn','Tahoma','VashonIsland','Northshore','Shoreline','Renton','Highline','Issaquah','Enumclaw','Seattle','FederalWay','Bellevue','Mercer Island','LakeWashington','Tukwila']
right = ['Fife']
kings_county.plot(figsize = (15,15),cmap = 'gist_earth')
for idx, row in kings_county.iterrows():if row['NAME'] in left:plt.annotate(s=row['NAME'], xy=row['coords'],ha='left', color = 'red')elif row['NAME'] in center:plt.annotate(s=row['NAME'], xy=row['coords'],ha='center', color = 'red')elif row['NAME'] in right:plt.annotate(s=row['NAME'], xy=row['coords'],ha='right', color = 'red')
plt.title('School Districts in Kings County, WA', fontdict = {'fontsize': 20})
plt.ylabel('Latitude',fontdict = {'fontsize': 20})
plt.xlabel('Longitude',fontdict = {'fontsize': 20})
The lists — left, right, center — are from trial and error with the placement of the district names. Some overlapped or needed to be manipulated so that they did not stray too far from their actual district.
列表(左,右,中心)來自地區名稱的放置,反復嘗試。 有些重疊或需要進行操縱,以使它們不會偏離實際區域。
I’ve changed the color map to gist_earth for clarity. Next, I iterated through each row using the entry in the NAME series, and placing the title at a point that was definitely in the polygon. I aligned the names based on the lists I had made earlier. And this was out output:
為了清楚起見,我將顏色映射更改為gist_earth 。 接下來,我使用NAME系列中的條目遍歷每一行,并將標題放置在肯定位于多邊形中的點上。 我根據之前的清單排列了名稱。 這是輸出:

Each of the regions signifies a school district in King County. This matches the data I found about the twenty school districts in the county. I never really thought about the size and shape of a county, so I googled it just to be sure.
每個地區都代表金縣的學區。 這與我發現的有關該縣二十個學區的數據相匹配。 我從來沒有真正考慮過一個縣的大小和形狀,所以我用谷歌搜索只是為了確定。

It seemed like the Google Maps image was the perfect hole for my puzzle piece. From here, it was just a matter of formatting my data to fit the shape file. I did that by initiating my coordinate system and creating applicable points using the latitude and longitude of my houses.
似乎Google Maps圖像是我的拼圖的完美選擇。 從這里開始,只需要格式化我的數據以適合形狀文件即可。 我通過啟動坐標系并使用房屋的緯度和經度來創建適用的點來完成此操作。
crs = {'init': 'epsg:4326'} # initiating my coordinate system
geometry = [Point(x,y) for x,y in zip(df.long,df.lat)] # creating points
If you were to look at an entry in geometry, you only get back that they are shapely objects. They need to be applied to our original dataframe. Below, you can see as I make a brand new dataframe that has the coordinate system built in, the old dataframe, and the addition of the points created by the intersection of the Latitude and Longitude of the houses.
如果要查看幾何圖形中的條目,您只會發現它們是勻稱的對象。 它們需要應用于我們的原始數據框。 在下面,您可以看到當我制作一個全新的數據框時,該數據框內置了坐標系,舊的數據框,并添加了房屋的經度和緯度相交點。
geo_df = gpd.GeoDataFrame(df, # the dataframecrs = crs, # coordinate systemgeometry = geometry) # geometric points
That was the last step before we can plot the houses. Now, we put it all together.
那是我們繪制房屋之前的最后一步。 現在,我們將所有內容放在一起。
fig, ax = plt.subplots(figsize = (15,16))
kings_county.plot(ax=ax, alpha = 0.8, color = 'black')
geo_df.plot(ax = ax , markersize = 2, color = 'blue',marker ='o',label = 'House', aspect = 1)
plt.legend(prop = {'size':10} )
ax.set_title('Houses in Kings County, WA', fontdict = {'fontsize':20})
ax.set_ylabel('Latitude',fontdict = {'fontsize': 20})
ax.set_xlabel('Longitude',fontdict = {'fontsize': 20})
在上面的代碼中,步驟包括: (In the code above, the steps include:)
- Calling an object to plot. 調用對象進行繪圖。
- Plotting the King County shape file. 繪制金縣形狀文件。
Plotting the data I made that includes the geometry point.
繪制我制作的包括幾何點的數據。
This includes making markers, choosing the aspect, and adding the label for the legend.
這包括制作標記,選擇外觀以及為圖例添加標簽。
- Adding a legend, title, and axis labels. 添加圖例,標題和軸標簽。
These steps were done for each of the graphs.
對每個圖形都完成了這些步驟。
Our output:
我們的輸出:

This is a great product but our goal is to learn something from this visualization. While this gives some information, like the outliers far to the eastern part of the county, it doesn’t give much else. We have to play with parameters. Let’s try splitting the data by price. These are the houses that are listed for less than $750,000.
這是一個很棒的產品,但是我們的目標是從可視化中學習一些東西。 盡管這提供了一些信息,例如該縣東部的離群值,但它并沒有提供其他信息。 我們必須使用參數。 讓我們嘗試按價格劃分數據。 這些房屋的標價低于750,000美元。
fig, ax = plt.subplots(figsize = (15,25))
kings_county.plot(ax=ax, alpha = 0.8, color = 'black')
geo_df[geo_df['price'] < 750000].plot(ax = ax , markersize = 2,color = 'red',marker = 's',label = 'Price < 750k',aspect = 1.5)
plt.legend(prop = {'size':15} )
ax.set_title('Houses by Price in Kings County, WA', fontdict ={'fontsize': 20})
ax.set_ylabel('Latitude',fontdict = {'fontsize': 20})
ax.set_xlabel('Longitude',fontdict = {'fontsize': 20})

Now we graph the houses greater than or equal to $750,000.
現在我們繪制大于或等于750,000美元的房子的圖。
fig, ax = plt.subplots(figsize = (15,25))
kings_county.plot(ax=ax, alpha = 0.8, color = 'black')
geo_df[geo_df['price'] >= 750000].plot(ax = ax , markersize = 2,color = 'yellow',marker = 'v',label = 'Price >=750k', aspect = 1.5)
plt.legend(prop = {'size':15})
ax.set_title('Houses by Price in Kings County, WA', fontdict ={'fontsize': 20})
ax.set_ylabel('Latitude',fontdict = {'fontsize': 20})
ax.set_xlabel('Longitude',fontdict = {'fontsize': 20})

There is a big difference in terms of both location and quantity. But that is not the end, we can also layer them one on top of the other. We will be doing the expensive on top of the cheap because it is scarcer.
在位置和數量上都存在很大差異。 但這還沒有結束,我們也可以將它們一個接一個地放置。 我們將在便宜的基礎上再做昂貴的,因為它稀缺。
fig, ax = plt.subplots(figsize = (15,25))
kings_county.plot(ax=ax, alpha = 0.8, color = 'black')
geo_df[geo_df['price'] < 750000].plot(ax = ax , markersize = 1,color = 'red',marker = 's',label = 'Price <750k = Red', aspect = 1.5)
geo_df[geo_df['price'] >= 750000].plot(ax = ax , markersize = 1,color = 'yellow',marker = 'v',label = 'Price>= 750k = Yellow',aspect = 1.5)
plt.legend(prop = {'size':12})
ax.set_title('Houses by Price in Kings County, WA', fontdict ={'fontsize': 20})
ax.set_ylabel('Latitude',fontdict = {'fontsize': 20})
ax.set_xlabel('Longitude',fontdict = {'fontsize': 20})

The picture painted by this map is interesting. There is a plethora of housing in King County that falls below the bar we’ve set. Most of the houses on the lower end of the price scale falls more inland than the more expensive classes.
該地圖繪制的圖片很有趣。 金縣的住房過多,低于我們設定的標準。 價格范圍較低端的大多數房屋比昂貴的房屋價格下跌的地區更多。
If you zoom in, the more expensive houses dot the waterside. They also are more centrally located around the Seattle city center. There are several physical outliers but the trend is clear.
如果放大,則較貴的房屋將點綴在水邊。 它們還位于西雅圖市中心附近的中心位置。 有幾個物理異常值,但趨勢很明顯。
Overall, the visualization has done its job. We have made several determinations from the houses on the map. Pricier houses are collected around the downtown area and spread around Puget Sound. They are also a minority in the data, which could be telling for predicting housing prices. The houses priced on the cheaper side are much more numerous and have a varied location. This will be useful for further EDA.
總體而言,可視化已完成工作。 我們已經從地圖上的房屋中做出了一些決定。 價格較高的房屋在市區周圍收集,并分布在普吉特海灣附近。 他們也是數據中的少數,這可能有助于預測房價。 價格便宜的房屋數量更多,并且位置各異。 這對于進一步的EDA很有用。
If you want to connect to talk more about this technique, you can find me on LinkedIn. If you would like to check out the code, take a look at my Github.
如果您想聯系以更多地談論這種技術,可以在LinkedIn上找到我。 如果您想查看代碼,請查看我的Github 。
資料來源 (Sources)
King County Dataset — here
金縣數據集- 此處
King County Shape File —
金縣形狀文件—
here
這里
Geopandas
大熊貓
Shapely
勻稱
Descartes
笛卡爾
Fiona
菲奧娜
翻譯自: https://towardsdatascience.com/using-geopandas-for-spatial-visualization-21e78984dc37
多維空間可視化
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/390912.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/390912.shtml 英文地址,請注明出處:http://en.pswp.cn/news/390912.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!