“It is difficult to make predictions, especially about the future.”
“很難做出預測，尤其是對未來的預測。”

~Niels Bohr
?尼爾斯·波爾

Everything is better interpreted through data. And data-driven decision making is crucial for success in any industry.

通過數據可以更好地解釋一切。數據驅動的決策對于任何行業的成功都是至關重要的。

And it has been true since time immemorable. The difference now is that we have, for better, developed a healthy outlook to data, and we have much more data available to us than previous times. And we have, in our disposal, computing powers previously unimagined.

自從難忘的時光以來，這就是事實。現在的區別在于，我們更好地發展了健康的數據前景，并且我們擁有比以前更多的數據。而且，我們擁有以前無法想象的計算能力。

In this situation, the computing power and the data should be leveraged to make better decisions to solve business problems.

在這種情況下，應利用計算能力和數據做出更好的決策來解決業務問題。

In my project, I chose to provide recommendations for opening new eateries in California City. In this project, I provided a concrete list of recommendations to invest in. Eatery types (such as- Japanese restaurant, dessert shop, etc.) and respective counties were suggested.

在我的項目中，我選擇為在加利福尼亞市開設新餐館提供建議。在這個項目中，我提供了一份具體的投資建議清單。對餐館類型(例如日式餐廳，甜點店等)和各個縣提出了建議。

In this post, I will go over the full process of a Data Science project.

在本文中，我將介紹數據科學項目的整個過程。

數據源 (Data Sources)

For solving this problem, data from four sources have been leveraged-

為了解決這個問題，我們利用了來自四個來源的數據-

Location data titled “California Counties” provided in California Open Data Portal provided by the Government of California for the geographical location data.
由加利福尼亞政府提供的加利福尼亞開放數據門戶中提供的地理位置數據稱為“加利福尼亞縣” 。
The Foursquare API for information about established restaurants and other relevant detailed information about the same.
Foursquare API，用于提供有關已建立餐廳的信息以及有關該餐廳的其他相關詳細信息。
County-wise population data from the US Government Census site.
來自美國政府人口普查站點的縣級人口數據。
County-wise Real GDP data provided by the Bureau of Economic Analysis, U.S. Department of Commerce.
美國商務部經濟分析局提供的縣級實際GDP數據。

探索性數據分析 (Exploratory Data Analysis)

After cleaning the data (which is definitely more than 90% of a Data Scientist’s job), meaningful insights were gained from the data.

清理數據后( 絕對超過數據科學家工作的90％ )，從數據中獲得了有意義的見解。

Image for post — City Centers of California’s Counties, source: Author

It was also found that the GDPs of the counties are strongly correlated with the Populations of the counties. Thus making counties with high GDPs and high populations attractive destination of investment.

還發現縣的GDP與縣的人口密切相關。因此，具有高GDP和高人口的縣成為吸引投資的目的地。

With the information provided by the Foursquare API, a list of ten most common venues was obtained for each county. This will be leveraged in decision making.

借助Foursquare API提供的信息，獲得了每個縣的十個最常見的場所列表。這將在決策中加以利用。

應用機器學習模型 (Applying Machine Learning Model)

選擇算法 (Choosing Algorithm)

The business problem is to look for eatery types and locations to invest in. The data is not labeled. This renders the problem to be solved a classical application of unsupervised learning.

業務問題是尋找餐館類型和投資地點。數據未標記。這使得要解決的問題成為無監督學習的經典應用。

The aim is not to look for value or look for a class. The aim is not to suggest someone only one recommendation for investment. To suggest the stakeholders a list of likely venues is the goal.

目的不是尋找價值或尋找階級。目的不是建議某人僅提出一項投資建議。向利益相關者建議可能的場所清單是目標。

And this can be achieved by clustering the counties based on GDP and Population. And KMeans Clustering is the best Statistical Learning algorithm to achieve this.

這可以通過基于GDP和人口對縣進行聚類來實現。而KMeans聚類是實現這一目標的最佳統計學習算法。

Scikit-learn library’s implementation for the KMeans Clustering algorithm was used.

使用了Scikit-learn庫的KMeans聚類算法實現。

選擇k (Choosing k)

For choosing the best k for clustering, the elbow method was employed.

為了選擇最佳的k進行聚類，采用了彎頭法。

As evident from the graph, the best k is 4. Hence, the clustering algorithm was applied with k = 4. So, 4 clusters of counties were formed based on population and GDP of the counties.

從圖中可以看出，最佳k為4。因此，在k = 4時應用了聚類算法。因此，根據縣的人口和GDP形成了4個縣集群。

結果 (Results)

4 clusters were formed containing counties. Upon examination, it was found that Los Angeles county formed one cluster (cluster-2) with itself due to its comparatively abysmally high GDP and population. Counties in another cluster had high GDP and high population, but not anywhere close to the Los Angeles county. Orange, Santa Clara, and San Diego are the three counties in this cluster (cluster-3). Then there are counties with low GDP and low populations such as Plumas, Nevada, Sierra, etc. in one cluster (cluster-1), and mid-range GDP and population, such as Sacramento, Riverside, etc. in another cluster (cluster-4).

形成了包含縣的4個集群。經檢查，發現洛杉磯縣因其GDP和人口相對較高而與其自身形成了一個集群(集群2)。另一個集群中的縣的GDP較高且人口眾多，但洛杉磯縣附近沒有。奧蘭治，圣克拉拉和圣地亞哥是該集群中的三個縣(集群3)。然后是一個集群(集群1)中的Plumas，內華達州，塞拉利昂等GDP較低且人口較少的縣(另一個集群)(薩克拉曼多，河濱等)中部GDP和人口較低的縣(集群) -4)。

In clusters 2, 3 we have counties with a high population and high GDP. In these counties, it will be profitable to invest in any eatery while it is advisable to invest in an eatery that is not in the top 3 venues.

在第2、3組中，我們的縣人口眾多，GDP很高。在這些縣中，投資于任何一家餐館都是有利可圖的，而建議投資于不在前三名場所中的餐館則是有利的。

In cluster 4, the population and GDP of counties are higher than those of the counties in cluster 1 but lower than those of counties in 2 or 3. Investment in these counties is preferred after a county in cluster 2 and cluster 3, in that order. Investment should be done in uncommon eateries so that they face lesser competition.

在集群4中，縣的人口和GDP高于集群1中的縣，但低于集群2或3中的縣。在這些縣中投資優先于集群2和集群3中的縣。。應該在不常見的餐館里進行投資，以使他們面臨的競爭更少。

Cluster 1 is dominated by lower population counties. Investment in these counties should be preferred after investments in counties in clusters 2 or 3 or cluster 4. Investment in most common eateries is not advised at all. Investment in these counties is least advised.

集群1由人口較少的縣主導。在對第2組或第3組或第4組的縣進行投資之后，應該優先選擇對這些縣進行投資。建議不要在這些縣進行投資。

After suggesting investment options, tables for each cluster were formed with eatery types, not in the three most common types.

在提出投資選擇建議之后，每個集群的表格都是用餐館類型構成的，而不是三種最常見的類型。

Full Report Link: PDF in GitHub RepositoryNotebook with Full Code: NB Viewer

完整報告鏈接： GitHub存儲庫筆記本中的PDF ，完整代碼： NB Viewer

Feel free to comment, provide feedback, or criticize.

隨時發表評論，提供反饋或批評。

Connect with me on LinkedIn or Twitter.

在LinkedIn或Twitter上與我聯系。

This blog post is related to Applied Data Science Capstone Project offered by IBM through Coursera.

這篇博客文章與IBM通過Coursera提供的Applied Data Science Capstone Project有關。

翻譯自: https://medium.com/beginning-data-science/investing-in-a-new-eastery-in-california-a-data-driven-approach-e91229e0289e

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/390940.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/390940.shtml
英文地址，請注明出處：http://en.pswp.cn/news/390940.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！