從數據開始 (START WITH DATA)
My data science journey began with a student job in the Advanced Analytics department of one of the biggest automotive manufacturers in Germany. I was na?ve and still doing my masters.
我的數據科學之旅從在德國最大的汽車制造商之一的Advanced Analytics部門的一名學生工作開始。 我很天真,仍然在做我的主人。
I was excited for this job because my current specialization was Digitalization. I wanted to get a hang of how it really works. I had studied programming too, but not python. My colleagues were all really smart — PhDs, Mathematicians and Physicists. Their understanding level of analytics was way beyond what I could gain by merely reading books!
我對這項工作感到很興奮,因為我目前的專長是數字化。 我想了解它的真正工作原理。 我也學習過編程,但是沒有學習過python。 我的同事們都很聰明-博士,數學家和物理學家。 他們對分析的理解水平遠遠超出了我僅通過閱讀書本就能獲得的知識!
For the first few days, the variety of projects and tasks, analysis and projects bewildered me. But, you know what was more bewildering? Questions like what is analytics? Why do it? What are all these files with so much data? What do all those numbers in the results say? How does an analytics project look like? What do they mean when they say they are analyzing data?
在最初的幾天里,各種各樣的項目和任務,分析和項目讓我感到困惑。 但是,您知道還有什么更令人困惑的嗎? 諸如什么是分析之類的問題? 為什么呢 這些數據量很大的文件是什么? 結果中所有這些數字表示什么? 分析項目的外觀如何? 當他們說他們正在分析數據時,它們是什么意思?
Overwhelming!
壓倒!
I spent days understanding analytics and the job itself. I gorged on various books and online courses that taught python, statistics, data science, etc. Gradually, I developed an understanding for the subjects and successfully completed my thesis in the same department too.
我花了幾天的時間來了解分析和工作本身。 我瀏覽了各種書籍和在線課程,這些課程和課程教授python,統計學,數據科學等。逐漸地,我對這些主題有了認識,并且也成功地在同一部門完成了我的論文。
I have explained the data analytics recipe for you below. Hope you can use this as a guide even if your ingredients change with the application.
我在下面為您解釋了數據分析方法。 希望即使您的成分隨應用程序而變化,也可以將其用作指導。
The most important step to make any project successful is having a clear start. No matter how big or small your project is, if you do not have the ingredients in the required form and the right tools, even a masterchef’s recipe will not guarantee a delicious meal in the end.
要使任何項目成功,最重要的步驟就是要有一個清晰的起點。 無論您的項目大小不一,如果您沒有所需形式的配料和正確的工具,那么即使是Masterchef的食譜也無法保證一頓美餐。
Let’s start with the ingredients before starting with the preparation.
讓我們先從成分開始,然后再開始準備。
配料: (Ingredients:)
1. The Problem
1.問題
Did you ever get irrelevant results after you searched for your query in google? What do you do then? Rephrase and refine the keywords and search again. Similarly, having the ‘why’ of your analysis clear in the beginning helps you interpret your results better.
在google中搜索查詢后,您是否得到不相關的結果? 那你怎么辦呢? 重新定義和優化關鍵字,然后再次搜索。 同樣,一開始就明確分析的“原因”可以幫助您更好地解釋結果。
After you get all the data that you need, the next step is to understand and define the problem statement. The pain points of the business case need to be addressed here. It is imperative for your aim to align with the business strategy of your company so that the analysis proves fruitful to the stakeholders.
在獲得所需的所有數據之后,下一步就是理解并定義問題說明。 這里需要解決業務案例的痛點。 您的目標必須與公司的業務戰略保持一致,以使分析對利益相關者證明是卓有成效的。
Consider the above store location example. As the result of your analysis, you will get a score assigned to each prospective location. If the strategy of your management is to finance the project only when the new location results in more than $ 100,000 profit in a new city with a minimum population of 5000. Thus, you have clear criteria to narrow down the analysis results in line with the vision of your company.
考慮上面的商店位置示例。 分析的結果是,您將獲得分配給每個預期地點的分數。 如果您的管理策略是僅在新地點在最低人口為5000的新城市中獲得超過100,000美元的利潤時才為項目提供資金。因此,您有明確的標準來縮小分析結果的范圍,以符合您公司的愿景。
2. The Data
2.數據
For any kind of data analysis, getting the data is unquestionable. Data can be acquired from various relevant sources. Thus, it may come in diverse types and formats. Your job is to cut and crush it according to its type so that it is usable for your recipe.
對于任何類型的數據分析,獲取數據都是毫無疑問的。 可以從各種相關來源獲取數據。 因此,它可能有多種類型和格式。 您的工作是根據類型將其切碎,以便將其用于您的食譜。
In a tabular representation of data, each column is a data field and each row is a record. Each record may be labelled uniquely with an ID.
在數據的表格表示中,每一列都是數據字段,每一行都是記錄。 每個記錄可以用ID唯一地標記。
For example, for predicting the next location for opening a new store, you may have to use Yearly Sales Data, Sales Data for existing store locations, Population Density of the locations, Total number of Households, Census Data, Land Area. If your company sells pet products then you need number of households with pets. If your company sells children’s products then number of households with children under 15.
例如,為了預測下一個要開設新商店的位置,您可能必須使用“年度銷售數據”,“現有商店位置的銷售數據”,該位置的人口密度,家庭總數,人口普查數據,土地面積。 如果您的公司銷售寵物產品,那么您需要攜帶寵物的家庭數量。 如果您的公司銷售兒童產品,那么有15歲以下兒童的家庭數。
Most common types of input files are .csv (comma-separated-values file), .xlsx (excel sheet file) and .txt (text file). Excel file consumes more memory while importing data. On the contrary, CSV files are faster and consumes less memory.
輸入文件的最常見類型是.csv(逗號分隔值文件)、. xlsx(excel工作表文件)和.txt(文本文件)。 Excel文件在導入數據時會占用更多內存。 相反,CSV文件更快并且消耗更少的內存。
Regardless of the file type, you have to clean each of the files and then blend all of it into one file to do the analysis. You can check out more about this here:
無論文件類型如何,您都必須清理每個文件,然后將所有文件混合到一個文件中進行分析。 您可以在此處查看有關此內容的更多信息:
3. The Software
3.軟件
The software used for the analysis can be selected depending on the kind of results you want; your knowledge of programming languages like Python or R. For those who do not prefer programming may simply use any modular analytics software. In such a tool, you just drag and drop the required functions and you are good to go with the beautifully structured results and presentations.
可以根據所需結果的類型選擇用于分析的軟件。 您對Python或R等編程語言的了解。對于不喜歡編程的人,可以簡單地使用任何模塊化分析軟件。 在這樣的工具中,您只需拖放所需的功能,就可以很好地處理結構精美的結果和演示文稿。
Popular ‘No Code’ analytics software include:
流行的“無代碼”分析軟件包括:
Tableau — Data Visualization and Reporting
Tableau-數據可視化和報告
DataRobot — Automated Machine Learning Platform
DataRobot —自動化機器學習平臺
RapidMiner — Useful for entire life-cycle from prediction to deployment
RapidMiner-從預測到部署的整個生命周期有用
Alteryx — Advanced Analytics Platform
Alteryx —高級分析平臺
MLBase — Open Source
MLBase —開源
TriFacta — Free
TriFacta —免費
For these, you simply need to go to their site, create an account and download (some may only allow trial versions for a limited period)
對于這些,您只需要訪問他們的站點,創建一個帳戶并下載(有些可能只允許在有限的時間內提供試用版)
After that just upload your data file for analysis and run. You will have your results already when you finish reading this article.
之后,只需上傳您的數據文件進行分析并運行。 閱讀完本文后,您已經擁有了結果。
Popular IDEs for statistical computing:
流行的用于統計計算的IDE:
PyCharm (Python)
PyCharm (Python)
Spyder (Anaconda Python distribution)
Spyder (Anaconda Python發行版)
RStudio (R)
RStudio (R)
You can also directly start your data analytics projects online, without downloading or installing anything!
您也可以直接在線啟動數據分析項目,而無需下載或安裝任何內容!
Google Colab
Google Colab
Microsoft Azure Notebooks
Microsoft Azure筆記本
制備: (Preparation:)
Different types of data come in different formats. Data from usually disparate sources requires cleansing, enriching and proper consolidation into one usable form in a downstream process. The technical terms generally used are data cleaning, feature selection, data transforms, feature engineering and dimensionality reduction.
不同類型的數據具有不同的格式。 通常來自不同來源的數據需要在下游過程中進行凈化,豐富和適當合并為一種可用形式。 通常使用的技術術語是數據清理,特征選擇,數據轉換,特征工程和降維。
Data cleaning and preparation is the most time consuming task in the entire analysis process.
數據清理和準備是整個分析過程中最耗時的任務。
The first thing to do with any file is to check whether the given path is correct and it opens without errors. Load the data in the software of your choice. Now, look inside.
處理任何文件的第一件事是檢查給定的路徑是否正確,并且打開時沒有錯誤。 將數據加載到您選擇的軟件中。 現在,看看里面。
An example of looking at the data is the field summary tool in Alteryx that provides a summary of data for all fields. The summary is shown below:
查看數據的一個示例是Alteryx中的字段摘要工具,該工具提供所有字段的數據摘要。 摘要如下所示:

Analyze and interpret the data using statistical tools (i.e. finding correlations, trends, outliers, etc.). However, the data might have missing values, typing errors or heterogeneous date formats; this must first be identified and fixed for better results.
使用統計工具(即查找相關性,趨勢,離群值等)分析和解釋數據。 但是,數據可能缺少值,鍵入錯誤或日期格式不均; 必須首先確定并修復此問題,以獲得更好的結果。

· Variables
·變量
Categorical Variables are variables that can take values or labels belonging to a fixed number of categories. Gender is a nominal categorical variable having two categories -male and female. The categories have no intrinsic ordering. An ordinal variable has a clear ordering. Temperature is an ordinal categorical variable with three orderly categories (low, medium and high). Such variables are encoded using different techniques for easier analysis.
分類變量是可以采用屬于固定數量類別的值或標簽的變量。 性別是具有兩個類別-男性和女性的名義分類變量。 類別沒有內在的順序。 序數變量具有清晰的順序。 溫度是具有三個有序類別(低,中和高)的有序分類變量。 使用不同的技術對此類變量進行編碼,以便于分析。
Quantitative Variables represent measurement and count. They are of two types continuous (may take any value between an interval) and discrete (countable).
定量變量代表度量和計數。 它們有連續(可在間隔之間取任意值)和離散(可數)兩種類型。

The link below gives an overview of the methods to encode the variables.
下面的鏈接概述了編碼變量的方法。
You may have to deal with the following challenges while preparing the data:
準備數據時,您可能必須應對以下挑戰:
Null Values / Missing data
空值/缺少數據
Null Values are shown in the data as NaN or “Not-a-Number” value. The NaN property is the same as the Number but not a legal number. In python, use the isNaN() global function to check if a value is NaN. In Alteryx Software, the values are shown as [Null] after running the code. They can be filtered using isNull in the formula Tool. Additionally, summarize tool also helps you to count null.
空值在數據中顯示為NaN或“非數字”值。 NaN屬性與Number相同,但不是合法編號。 在python中,使用isNaN()全局函數檢查值是否為NaN。 在Alteryx軟件中,運行代碼后,這些值顯示為[Null]。 可以使用公式工具中的isNull過濾它們。 此外,匯總工具還可以幫助您計算空值。
When no data value is stored for an observation in the dataset, it is termed as missing data or missing values in statistics. Rubin stated three mechanisms for occurance of missing data: missing at random (MAR), missing completely at random (MCAR), and missing not at random (MNAR).
如果在數據集中沒有為觀察值存儲任何數據值,則將其稱為缺失數據或統計信息中的缺失值。 魯賓指出了發生丟失數據的三種機制:隨機丟失(MAR),完全隨機丟失(MCAR)和非隨機丟失(MNAR)。
The process of assigning substituted values to missing values is called imputation. If a small portion (upto 5%) of the data is missing then the values can be imputed using method like mean, median or mode. It uses the other values in the same column for imputation. Principled methods such as the multiple-imputation (MI) method, the full information maximum likelihood (FIML) method, and the expectation-maximization (EM) method.
將替換值分配給缺失值的過程稱為插補。 如果缺少一小部分數據(最多5%),則可以使用平均值,中位數或眾數等方法估算值。 它使用同一列中的其他值進行插補。 原則方法,例如多輸入(MI)方法,完整信息最大似然(FIML)方法和期望最大化(EM)方法。
It is advisable to delete the field if more than 10% of the data is missing, as it may add statistical bias to the results.
如果缺少10%以上的數據, 建議刪除該字段,因為這可能會增加結果的統計偏差。
In Alteryx, field Summary Tool shows the percentage of missing records for each data field.
在Alteryx中,字段“匯總工具”顯示每個數據字段丟失記錄的百分比。
Follow this link for steps and formulae to deal with missing data in excel.
單擊此鏈接以獲取處理excel中缺失數據的步驟和公式。
Heterogeneous Data
異構數據
Numerical fields like age, currency, date have a huge potential for errors due to non-uniform format. For example, age can be written as 30 years or 30.2 Years or simply 30. The thousands separator and decimal separator for currency varies according to countries and regions. Make sure that these columns have an even format.
年齡,貨幣,日期等數字字段由于格式不統一而具有很大的出錯可能性。 例如,年齡可以寫為30年或30.2年,或簡單地為30。貨幣的千位分隔符和十進制分隔符會因國家和地區而異。 確保這些列具有偶數格式。

Outliers
離群值
Once the dataset is cleaned, it is time to run another pre-process regime over it. Outliers are unusual values in the dataset that may cause statistical errors in your calculations. They are abnormally away from other values in a dataset and can severely distort your output values.
清除數據集后,就該對它運行另一個預處理方案了。 離群值是數據集中的異常值,可能會導致計算中的統計錯誤。 它們異常遠離數據集中的其他值,并且可能嚴重扭曲您的輸出值。

Scatterplots help immensely when you need to instantly identify outliers in your data. Simply visualize the relationship between each predictor variable and the target variable using plots.
當您需要立即識別數據中的異常值時,散點圖可以提供極大的幫助。 使用繪圖可以簡單地可視化每個預測變量和目標變量之間的關系。
Here are some links to help you with the terms and methods to deal with outliers:
以下是一些鏈接,可幫助您了解處理異常值的條款和方法:
That was all for part 1. Check out part 2 for the analysis and presentation phases of a data science project. Stay tuned!
這就是第1部分的全部內容。請查看第2部分,了解數據科學項目的分析和演示階段。 敬請關注!
翻譯自: https://towardsdatascience.com/how-to-get-started-with-any-kind-of-data-part-1-c1746c66bc2d
本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。 如若轉載,請注明出處:http://www.pswp.cn/news/391277.shtml 繁體地址,請注明出處:http://hk.pswp.cn/news/391277.shtml 英文地址,請注明出處:http://en.pswp.cn/news/391277.shtml
如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!