數據庫:存儲過程
Once you begin studying data science, you will hear something called ‘data science process’. This expression refers to a five stage process that usually data scientists perform when working on a project. In this post I will walk through each of them, describe what is involved and what technologies are normally used.
一旦開始學習數據科學,您將聽到一種稱為“數據科學過程”的信息。 此表述是指數據科學家通常在執行項目時執行的五個階段的過程。 在這篇文章中,我將逐步介紹它們中的每一個,描述涉及的內容和通常使用的技術。
1.數據采集 (1. Data Acquisition)
When you are just studying data science, your data may be already given to you by your instructors. Also, you can find a lot of beautiful datasets on Kaggle.com or Google Dataset Search. In this case data acquisition is pretty simple, just download the dataset and you’re all set to go.
當您僅學習數據科學時,您的數據可能已經由您的講師提供給您。 另外,您可以在Kaggle.com或Google數據集搜索上找到許多精美的數據集 。 在這種情況下,數據采集非常簡單,只需下載數據集即可。
In real life it is a little trickier. To obtain data in a format you need you will probably be using API’s or web scraping and your basic knowledge of HTML in order to obtain everything you need. In one of my earlier posts I described how I obtained the data about beauty products from Sephora.com using Selenium and BeautifulSoup.
在現實生活中,這有點棘手。 要獲取您需要的格式的數據,您可能會使用API??或Web抓取以及HTML的基本知識來獲取所需的一切。 在我以前的一篇文章中,我描述了如何使用Selenium和BeautifulSoup從Sephora.com獲得有關美容產品的數據。
Technologies used: HTML, SQL, Selenium, BeautifulSoup.
使用的技術:HTML,SQL,Selenium,BeautifulSoup。
2.數據清理 (2. Data Cleaning)
Again, if the dataset was already given to you by your instructors, or you got it on one of the websites mentioned above, there’s a good chance that your data is already clean. However, in most cases there will be some cleaning required. You need to handle the missing values (and be smart about it), make sure that all the columns are in correct datatypes (date-time, integers, floats, strings, etc.), all column names don’t contain spaces (especially important if you’re using NLP to perform analysis and modeling). Check out my post Beginner’s guide to data cleaning for more information.
同樣,如果數據集已經由您的講師提供給您,或者您已在上述網站之一上獲得,則很有可能您的數據已經清理干凈。 但是,在大多數情況下,需要進行一些清潔。 您需要處理缺失的值(并對此有所了解),確保所有列的數據類型都正確(日期時間,整數,浮點數,字符串等),所有列名均不包含空格(尤其是空格)如果您要使用NLP進行分析和建模,則非常重要)。 查看我的文章數據清理初學者指南以獲取更多信息。
Technologies used: Pandas, NumPy
使用的技術:Pandas,NumPy
3. EDA (3. EDA)
EDA stands for Exploratory Data Analysis. At this stage of the process you need to get to know your data. What is the shape of the table? How many rows and columns there are? What are the data types (to make sure you cleaned properly)? How the numeric values are distributed? Is there some sort of correlation/multicollinearity? Is there class imbalance if you want to perform classification? You need to answer all these questions and more before you get to the next stage. I would just write down all the questions I have and try to answer them one by one. This stage is also very important if you are about to present the results to a non-technical audience. While exploring your data in a meaningful way, you will create beautiful visualizations. And someone with no background in math and coding will better respond to an interactive 3D map rather than to you saying “My adjusted R2 is 0.92!”.
EDA代表探索性數據分析。 在流程的此階段,您需要了解您的數據。 桌子的形狀是什么? 有多少行和幾列? 有哪些數據類型(以確保正確清理)? 數值如何分布? 有某種相關性/多重共線性嗎? 如果要進行分類,是否存在班級失衡 ? 在進入下一階段之前,您需要回答所有這些問題以及更多其他問題。 我只想寫下所有問題,然后嘗試一個接一個地回答。 如果您要向非技術人員介紹結果,那么此階段也非常重要。 在以有意義的方式瀏覽數據時,您將創建漂亮的可視化效果。 沒有數學和編碼背景的人會更好地響應交互式3D地圖,而不是您說“我的調整后R2為0.92!”。

Technologies used: Pandas, Numpy, Matplotlib, Seaborn, Plotly (GO and Express)
使用的技術:熊貓,Numpy,Matplotlib,Seaborn,Plotly(GO和Express)
4.建模 (4. Modeling)
This is the most fun part (IMO). After all the preparation you get to create a machine learning/deep learning model that will make some sort of predictions. This can be a simple linear regression, multiple regression, classification, time series, NLP analysis, or a huge computer vision project with image recognition. Describing how each and every one of these works is beyond the scope of this post, but check out my earlier post about how to talk about regression with babies and I’m-really-bad-at-math people.
這是最有趣的部分(IMO)。 完成所有準備工作后,您將創建一個可以進行某種預測的機器學習/深度學習模型。 這可以是簡單的線性回歸,多元回歸,分類,時間序列,NLP分析或具有圖像識別功能的大型計算機視覺項目。 描述每種方法的工作方式超出了本文的范圍,但是請查閱我之前的文章 ,該文章介紹了如何與嬰兒和我真的很糟糕的人談論回歸。
Technologies used: Scikit-Learn, SciPy, NumPy, Keras, Tensorflow, PyTorch, XGBoost, and many, many more (really depends on what you’re trying to model).
使用的技術:Scikit-Learn,SciPy,NumPy,Keras,Tensorflow,PyTorch,XGBoost等(取決于您要建模的內容)。
5.模型解釋與應用 (5. Model Interpretation and Applications)
The results of your model are probably going to look something like this:
您的模型結果可能看起來像這樣:

What the heck does this all mean? You can’t just go to the investors and marketing department and say something like ‘my validation accuracy achieved 93% after I handled the class imbalance’ or ‘the proportion of the variance for a dependent variable y is explained by independent variables X by R-squared of 0.75’, you will immediately hear back “English, please!”.
這到底意味著什么? 您不能只去投資者和市場部門說“我在處理類不平衡問題后,我的驗證精度達到93%”或“因變量y的方差比例由自變量X乘以R來解釋”之類的說法平方為0.75',您將立即聽到“請英語!”的聲音。
The goal of the final stage of the data science process is to learn how to translate back from Math to English. It doesn’t matter how high or low your adjusted R2 or validation accuracy is if you can’t explain what it means in real life.
數據科學過程最后階段的目標是學習如何從數學翻譯回英語。 如果您無法解釋現實生活中的含義,那么調整后的R2或驗證精度的高低無關緊要。
The results of this whole data science process can be wrapped up in a presentation or they can be used to build a useful web application or some other sort of software. You will need basic knowledge of web development to make it happen, but if I built an app in four days, you certainly can too! Here’s a post about how I did it.
整個數據科學過程的結果可以包裝在演示文稿中,也可以用于構建有用的Web應用程序或某種其他類型的軟件。 您需要具備Web開發的基礎知識才能實現它,但是如果我在四天內構建了一個應用程序,您當然也可以! 這是關于我如何做的帖子 。
Technologies used: Your knowledge of math for data interpretation, Flask and Dash for creating a front-end.
使用的技術:您的數學知識可用于數據解釋,Flask和Dash可用于創建前端。
This is a quick summary of what a data science process looks like in a nutshell. Of course, there’s more to it in real life, but if you’re just learning, it’s a nice structure to stick to. Enjoy your data!
簡要概述了數據科學過程的外觀。 當然,現實生活中還有很多其他方面,但是如果您只是學習,那么這是一個值得堅持的好結構。 享受您的數據!
翻譯自: https://medium.com/the-innovation/data-science-process-summary-865abd16183d
數據庫:存儲過程
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/391214.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/391214.shtml 英文地址,請注明出處:http://en.pswp.cn/news/391214.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!