在加利福尼亞州投資于新餐館:一種數據驅動的方法

“It is difficult to make predictions, especially about the future.”

“很難做出預測,尤其是對未來的預測。”

~Niels Bohr

?尼爾斯·波爾

Everything is better interpreted through data. And data-driven decision making is crucial for success in any industry.

通過數據可以更好地解釋一切。 數據驅動的決策對于任何行業的成功都是至關重要的。

And it has been true since time immemorable. The difference now is that we have, for better, developed a healthy outlook to data, and we have much more data available to us than previous times. And we have, in our disposal, computing powers previously unimagined.

自從難忘的時光以來,這就是事實。 現在的區別在于,我們更好地發展了健康的數據前景,并且我們擁有比以前更多的數據。 而且,我們擁有以前無法想象的計算能力。

In this situation, the computing power and the data should be leveraged to make better decisions to solve business problems.

在這種情況下,應利用計算能力和數據做出更好的決策來解決業務問題。

In my project, I chose to provide recommendations for opening new eateries in California City. In this project, I provided a concrete list of recommendations to invest in. Eatery types (such as- Japanese restaurant, dessert shop, etc.) and respective counties were suggested.

在我的項目中,我選擇為在加利福尼亞市開設新餐館提供建議。 在這個項目中,我提供了一份具體的投資建議清單。對餐館類型(例如日式餐廳,甜點店等)和各個縣提出了建議。

In this post, I will go over the full process of a Data Science project.

在本文中,我將介紹數據科學項目的整個過程。

數據源 (Data Sources)

For solving this problem, data from four sources have been leveraged-

為了解決這個問題,我們利用了來自四個來源的數據-

  1. Location data titled “California Counties” provided in California Open Data Portal provided by the Government of California for the geographical location data.

    加利福尼亞政府提供的加利福尼亞開放數據門戶中提供的地理位置數據稱為“加利福尼亞縣” 。

  2. The Foursquare API for information about established restaurants and other relevant detailed information about the same.

    Foursquare API,用于提供有關已建立餐廳的信息以及有關該餐廳的其他相關詳細信息。

  3. County-wise population data from the US Government Census site.

    來自美國政府人口普查站點的縣級人口數據。

  4. County-wise Real GDP data provided by the Bureau of Economic Analysis, U.S. Department of Commerce.

    美國商務部經濟分析局提供的縣級實際GDP數據。

探索性數據分析 (Exploratory Data Analysis)

After cleaning the data (which is definitely more than 90% of a Data Scientist’s job), meaningful insights were gained from the data.

清理數據后( 絕對超過數據科學家工作的90% ),從數據中獲得了有意義的見解。

Image for post
City Centers of California’s Counties, source: Author
加利福尼亞州縣城中心,資料來源:作者

It was also found that the GDPs of the counties are strongly correlated with the Populations of the counties. Thus making counties with high GDPs and high populations attractive destination of investment.

還發現縣的GDP與縣的人口密切相關。 因此,具有高GDP和高人口的縣成為吸引投資的目的地。

Image for post
Strong Correlation Between GDP and Population of Californian Counties, source: Author
加利福尼亞縣的GDP與人口之間的強相關性,來源:作者
Image for post
Number of Eateries in Each County (capped at 50 by Foursquare), source: Author
每個縣的餐館數量(Foursquare限制為50),來源:作者

With the information provided by the Foursquare API, a list of ten most common venues was obtained for each county. This will be leveraged in decision making.

借助Foursquare API提供的信息,獲得了每個縣的十個最常見的場所列表。 這將在決策中加以利用。

Image for post
Five Row
五排

應用機器學習模型 (Applying Machine Learning Model)

選擇算法 (Choosing Algorithm)

The business problem is to look for eatery types and locations to invest in. The data is not labeled. This renders the problem to be solved a classical application of unsupervised learning.

業務問題是尋找餐館類型和投資地點。數據未標記。 這使得要解決的問題成為無監督學習的經典應用。

The aim is not to look for value or look for a class. The aim is not to suggest someone only one recommendation for investment. To suggest the stakeholders a list of likely venues is the goal.

目的不是尋找價值或尋找階級。 目的不是建議某人僅提出一項投資建議。 向利益相關者建議可能的場所清單是目標。

And this can be achieved by clustering the counties based on GDP and Population. And KMeans Clustering is the best Statistical Learning algorithm to achieve this.

這可以通過基于GDP和人口對縣進行聚類來實現。 而KMeans聚類是實現這一目標的最佳統計學習算法。

Scikit-learn library’s implementation for the KMeans Clustering algorithm was used.

使用了Scikit-learn庫的KMeans聚類算法實現。

選擇k (Choosing k)

For choosing the best k for clustering, the elbow method was employed.

為了選擇最佳的k進行聚類,采用了彎頭法。

Image for post
Inertia vs. Values of k Plot, source: Author
慣性與k圖的值的關系,來源:作者

As evident from the graph, the best k is 4. Hence, the clustering algorithm was applied with k = 4. So, 4 clusters of counties were formed based on population and GDP of the counties.

從圖中可以看出,最佳k為4。因此,在k = 4時應用了聚類算法。因此,根據縣的人口和GDP形成了4個縣集群。

結果 (Results)

4 clusters were formed containing counties. Upon examination, it was found that Los Angeles county formed one cluster (cluster-2) with itself due to its comparatively abysmally high GDP and population. Counties in another cluster had high GDP and high population, but not anywhere close to the Los Angeles county. Orange, Santa Clara, and San Diego are the three counties in this cluster (cluster-3). Then there are counties with low GDP and low populations such as Plumas, Nevada, Sierra, etc. in one cluster (cluster-1), and mid-range GDP and population, such as Sacramento, Riverside, etc. in another cluster (cluster-4).

形成了包含縣的4個集群。 經檢查,發現洛杉磯縣因其GDP和人口相對較高而與其自身形成了一個集群(集群2)。 另一個集群中的縣的GDP較高且人口眾多,但洛杉磯縣附近沒有。 奧蘭治,圣克拉拉和圣地亞哥是該集群中的三個縣(集群3)。 然后是一個集群(集群1)中的Plumas,內華達州,塞拉利昂等GDP較低且人口較少的縣(另一個集群)(薩克拉曼多,河濱等)中部GDP和人口較低的縣(集群) -4)。

Image for post
Resulting Clusters on a Map, source: Author
地圖上的結果集群,來源:作者

In clusters 2, 3 we have counties with a high population and high GDP. In these counties, it will be profitable to invest in any eatery while it is advisable to invest in an eatery that is not in the top 3 venues.

在第2、3組中,我們的縣人口眾多,GDP很高。 在這些縣中,投資于任何一家餐館都是有利可圖的,而建議投資于不在前三名場所中的餐館則是有利的。

In cluster 4, the population and GDP of counties are higher than those of the counties in cluster 1 but lower than those of counties in 2 or 3. Investment in these counties is preferred after a county in cluster 2 and cluster 3, in that order. Investment should be done in uncommon eateries so that they face lesser competition.

在集群4中,縣的人口和GDP高于集群1中的縣,但低于集群2或3中的縣。在這些縣中投資優先于集群2和集群3中的縣。 。 應該在不常見的餐館里進行投資,以使他們面臨的競爭更少。

Cluster 1 is dominated by lower population counties. Investment in these counties should be preferred after investments in counties in clusters 2 or 3 or cluster 4. Investment in most common eateries is not advised at all. Investment in these counties is least advised.

集群1由人口較少的縣主導。 在對第2組或第3組或第4組的縣進行投資之后,應該優先選擇對這些縣進行投資。 建議不要在這些縣進行投資。

After suggesting investment options, tables for each cluster were formed with eatery types, not in the three most common types.

在提出投資選擇建議之后,每個集群的表格都是用餐館類型構成的,而不是三種最常見的類型。

Image for post
Table for Counties and Investment Recommendations in Cluster 3
表3中的縣和投資建議表

Full Report Link: PDF in GitHub RepositoryNotebook with Full Code: NB Viewer

完整報告鏈接: GitHub存儲庫筆記本中的PDF ,完整代碼: NB Viewer

Feel free to comment, provide feedback, or criticize.

隨時發表評論,提供反饋或批評。

Connect with me on LinkedIn or Twitter.

在LinkedIn或Twitter上與我聯系。

This blog post is related to Applied Data Science Capstone Project offered by IBM through Coursera.

這篇博客文章與IBM通過Coursera提供的Applied Data Science Capstone Project有關。

翻譯自: https://medium.com/beginning-data-science/investing-in-a-new-eastery-in-california-a-data-driven-approach-e91229e0289e

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/390940.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/390940.shtml
英文地址,請注明出處:http://en.pswp.cn/news/390940.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

javascript腳本_使用腳本src屬性將JavaScript鏈接到HTML

javascript腳本The ‘src’ attribute in a tag is the path to an external file or resource that you want to link to your HTML document.標記中的src屬性是您要鏈接到HTML文檔的外部文件或資源的路徑。 For example, if you had your own custom JavaScript file named …

阿里云ESC上的Ubuntu圖形界面的安裝

系統裝的是Ubuntu Server 16.04 64位版的圖形界面,這里是轉載的一個大神的帖子 http://blog.csdn.net/dk_0228/article/details/54571867, 當然自己也再記錄一下,加深點印象 1.更新apt-get 保證最新 apt-get update 2.用putty或者Xshell連接遠…

leetcode 1269. 停在原地的方案數(dp)

示例 1: 輸入:steps 3, arrLen 2 輸出:4 解釋:3 步后,總共有 4 種不同的方法可以停在索引 0 處。 向右,向左,不動 不動,向右,向左 向右,不動,向…

JavaScript Onclick事件解釋

The onclick event in JavaScript lets you as a programmer execute a function when an element is clicked.JavaScript中的onclick事件可讓您作為程序員在單擊元素時執行功能。 按鈕Onclick示例 (Button Onclick Example) <button onclick"myFunction()">C…

近似算法的近似率_選擇最佳近似最近算法的數據科學家指南

近似算法的近似率by Braden Riggs and George Williams (gwilliamsgsitechnology.com)Braden Riggs和George Williams(gwilliamsgsitechnology.com) Whether you are new to the field of data science or a seasoned veteran, you have likely come into contact with the te…

VMware安裝CentOS之二——最小化安裝CentOS

1、上文已經創建了一個虛擬機&#xff0c;現在我們點擊開啟虛擬機。2、虛擬機進入到安裝的界面&#xff0c;在這里我們選擇第一行&#xff0c;安裝或者升級系統。3、這里會提示要檢查光盤&#xff0c;我們直接選擇跳過。4、這里會提示我的硬件設備不被支持&#xff0c;點擊OK&a…

什么是GraphQL? 普通神話被揭穿。

I love talking about GraphQL, especially with people who have been working with GraphQL or thinking of adopting GraphQL. One common question people have is why someone would want to move to GraphQL from REST. 我喜歡談論GraphQL&#xff0c;特別是和那些一直在…

在Spring Boot里面,怎么獲取定義在application.properties文件里的值

問題&#xff1a;在Spring Boot里面&#xff0c;怎么獲取定義在application.properties文件里的值、 我想訪問application.properties里面提供的值&#xff0c;像這樣&#xff1a; logging.level.org.springframework.web: DEBUG logging.level.org.hibernate: ERROR logging…

連接sqlexpress

sqlexpress在visualstudio安裝時可選擇安裝。   數據源添加 localhost\sqlexpress window身份認證即可。轉載于:https://www.cnblogs.com/zjxbetter/p/7767241.html

在Python中使用Seaborn和WordCloud可視化YouTube視頻

I am an avid Youtube user and love watching videos on it in my free time. I decided to do some exploratory data analysis on the youtube videos streamed in the US. I found the dataset on the Kaggle on this link我是YouTube的狂熱用戶&#xff0c;喜歡在業余時間…

Win下更新pip出現OSError:[WinError17]與PerrmissionError:[WinError5]及解決

環境&#xff1a;Win7 64位&#xff0c;python3.6.0 我在準備用pip裝東西的時候&#xff0c;在cmd里先更新了一下pip&#xff0c;大概是9.0.1更新到9.0. 嘗試更新pip命令&#xff1a; pip install --upgrade pip 更新一半掛了 出現了 OSError:[WinError17] 與 PerrmissionError…

老生常談:抽象工廠模式

在創建型模式中有一個模式是不得不學的,那就是抽象工廠模式(Abstract Factory),這是創建型模式中最為復雜,功能最強大的模式.它常與工廠方法組合來實現。平時我們在寫一個組件的時候一般只針對一種語言,或者說是針對一個區域的人來實現。 例如:現有有一個新聞組件,在中國我們有…

ogc是一個非營利性組織_非營利組織的軟件資源

ogc是一個非營利性組織Please note that freeCodeCamp is not partnered with, nor do we receive a referral fee from, any of the following providers. We simply want to help guide you toward a solution for your organization.請注意&#xff0c;freeCodeCamp不與以下…

數據結構入門最佳書籍_最佳數據科學書籍

數據結構入門最佳書籍Introduction介紹 I get asked a lot what resources I recommend for people who want to start their Data Science journey. This section enlists books I recommend you should read at least once in your life as a Data Scientist.我被很多人問到…

函數式編程概念

什么是函數式編程 簡單地說&#xff0c;函數式編程通過使用函數&#xff0c;將值轉換成抽象單元&#xff0c;接著用于構建軟件系統。 面向對象VS函數式編程 面向對象編程 面向對象編程認為一切事物皆對象&#xff0c;將現實世界的事物抽象成對象&#xff0c;現實世界中的關系抽…

在Java里面怎么樣在靜態方法中調用getClass()?

問題&#xff1a;在Java里面怎么樣在靜態方法中調用getClass()&#xff1f; 我有一個類&#xff0c;它必須包含一些靜態方法&#xff0c;在這些靜態方法里面我需要像下面那樣調用getClass() 方法 public static void startMusic() {URL songPath getClass().getClassLoader(…

變量名和變量地址

變量名和變量地址 研一時&#xff0c;很偶然的翻開譚浩強老先生的《C程序設計》&#xff08;是師姐的書&#xff0c;俺的老早就賣了&#xff0c;估計當時覺得這本書寫得不夠好&#xff09;&#xff0c;很偶然的看到關于變量名的一段話&#xff1a;“變量名實際上是一個符號地址…

多重插補 均值插補_Feature Engineering Part-1均值/中位數插補。

多重插補 均值插補Understanding the Mean /Median Imputation and Implementation using feature-engine….!了解使用特征引擎的均值/中位數插補和實現…。&#xff01; 均值或中位數插補&#xff1a; (Mean or Median Imputation:) The mean or median value should be calc…

域 嵌入圖像顯示不出來_如何(以及為什么)將域概念嵌入代碼中

域 嵌入圖像顯示不出來Code should clearly reflect the problem it’s solving, and thus openly expose that problem’s domain. Embedding domain concepts in code requires thought and skill, and doesnt drop out automatically from TDD. However, it is a necessary …

linux 查看用戶上次修改密碼的日期

查看root用戶密碼上次修改的時間 方法一&#xff1a;查看日志文件&#xff1a; # cat /var/log/secure |grep password changed 方法二&#xff1a; # chage -l root-----Last password change : Feb 27, 2018 Password expires : never…