數據科學 python
by Zhen Liu
劉震
首先:數據預處理 (Up first: data preprocessing)
Do you feel frustrated by breaking your data analytics flow when searching for syntax? Why do you still not remember it after looking up it for the third time?? It’s because you haven’t practiced it enough to build muscle memory for it yet.
搜索語法時,您是否因中斷數據分析流程而感到沮喪? 為什么第三遍查詢后仍不記得呢? 這是因為您尚未練習足夠的肌肉記憶。
Now, imagine that when you are coding, the Python syntax and functions just fly out from your fingertips following your analytical thoughts. How great is that! This tutorial is to help you get there.
現在,想象一下,當您進行編碼時,Python語法和函數會按照您的分析思路從您的指尖飛出。 那太好了! 本教程是為了幫助您到達那里。
I recommend practicing this script every morning for 10 mins, and repeating it for a week. It’s like doing a few small crunches a day — not for your abs, but for your data science muscles. Gradually, you’ll notice the improvement in data analytics programming efficiency after this repeat training.
我建議每天早上練習此腳本10分鐘,然后重復一周。 這就像一天做幾次小動作-不是為了您的腹肌,而是為了您的數據科學力量。 經過反復的培訓,您會逐漸發現數據分析編程效率的提高。
To begin with my ‘data science workout’, in this tutorial we’ll practice the most common syntax for data preprocessing as a warm-up session ;)
首先,從我的“數據科學鍛煉”開始,在本教程中,我們將作為預熱課程練習數據預處理的最常用語法;)
Contents:
0 . Read, View and Save data1 . Table’s Dimension and Data Types2 . Basic Column Manipulation3 . Null Values: View, Delete and Impute4 . Data Deduplication
0.讀取,查看和保存數據 (0. Read, View and Save data)
First, load the libraries for our exercise:
首先,為我們的練習加載庫:
Now we’ll read data from my GitHub repository. I downloaded the data from Zillow.
現在,我們將從GitHub存儲庫中讀取數據。 我從Zillow下載了數據。
And the results look like this:
結果看起來像這樣:
Saving a file is dataframe.to_csv(). If you don’t want the index number to be saved, use dataframe.to_csv( index = False ).
保存文件為dataframe.to_csv()。 如果您不希望保存索引號,請使用dataframe.to_csv(index = False)。
1。 表的維度和數據類型 (1 . Table’s Dimension and Data Types)
1.1尺寸 (1.1 Dimension)
How many rows and columns in this data?
此數據中有多少行和幾列?
1.2數據類型 (1.2 Data Types)
What are the data types of your data, and how many columns are numeric?
數據的數據類型是什么,數字有多少列?
Output of the first few columns’ data types:
前幾列的數據類型的輸出:
If you want to be more specific about your data, use select_dtypes() to include or exclude a data type. Question: if I only want to look at 2018’s data, how do I get that?
如果要更具體地說明數據,請使用select_dtypes()包括或排除數據類型。 問題:如果我只想看一下2018年的數據,那我怎么得到呢?
2.基本列操作 (2. Basic Column Manipulation)
2.1按列子集數據 (2.1 Subset data by columns)
Select columns by data types:
按數據類型選擇列:
For example, if you only want float and integer columns:
例如,如果只需要浮點數和整數列:
Select and drop columns by names:
按名稱選擇和刪除列:
2.2重命名列 (2.2 Rename Columns)
How do I rename the columns if I don’t like them? For example, change ‘State’ to ‘state_’; ‘City’ to ‘city_’:
如果我不喜歡這些列,該如何重命名它們? 例如,將“狀態”更改為“ state_”; 從“城市”到“ city_”:
3.空值:查看,刪除和插入 (3. Null Values: View, Delete and Impute)
3.1多少行和列具有空值? (3.1 How many rows and columns have null values?)
The outputs of isnull.any() versus isnull.sum():
isnull.any()與isnull.sum()的輸出:
Select data that isn’t null in one column, for example, ‘Metro’ isn’t null.
選擇在一列中不為空的數據,例如,“ Metro”不為空。
3.2為一組固定的列選擇不為空的行 (3.2 Select rows that are not null for a fixed set of columns)
Select a subset of data that doesn’t have null after 2000:
選擇2000年后不為空的數據子集:
If you want to select the data in July, you need to find the columns containing ‘-07’. To see if a string contains a substring, you can use substring in string, and it’ll output true or false.
如果要選擇7月的數據,則需要查找包含“ -07”的列。 要查看字符串是否包含子字符串,可以在字符串中使用子字符串,它會輸出true或false。
3.3空值子集行 (3.3 Subset Rows by Null Values)
Select rows where we want to have at least 50 non-NA values, but don’t need to be specific about the columns:
選擇我們希望至少具有50個非NA值但不需要具體說明列的行:
3.4丟失和歸類缺失值 (3.4 Drop and Impute Missing Values)
Fill NA or impute NA:
填寫NA或估算NA:
Use your own condition to fill using the where function:
使用您自己的條件使用where函數填充:
4.重復數據刪除 (4. Data Deduplication)
We need to make sure there’s no duplicated rows before we aggregate data or join them.
在聚合數據或將它們聯接之前,我們需要確保沒有重復的行。
We want to see whether there are any duplicated cities/regions. We need to decide what unique ID (city, region) we want to use in the analysis.
我們想看看是否有重復的城市/地區。 我們需要確定要在分析中使用的唯一ID(城市,地區)。
刪除重復的值。 (Drop Duplicated values.)
The ‘CountyName’ and ‘SizeRank’ combination is unique already. So we just use the columns to demonstrate the syntax of drop_duplicated.
“ CountyName”和“ SizeRank”組合已經是唯一的。 因此,我們僅使用列來演示drop_duplicated的語法。
That’s it for the first part of my series on building muscle memory for data science in Python. The full script can be found here.
這就是我在Python中為數據科學構建肌肉內存的系列文章的第一部分。 完整的腳本可以在這里找到。
Stay tuned! My next tutorial will show you how to ‘curl the data science muscles’ for slicing and dicing data.
敬請關注! 我的下一個教程將向您展示如何“卷曲數據科學的力量”來對數據進行切片和切塊。
Follow me and give me a few claps if you find this helpful :)
跟隨我,如果您覺得有幫助,請給我一些鼓掌:)
While you are working on Python, maybe you’ll be interested in my previous article:
在使用Python時,也許您會對我以前的文章感興趣:
Learn Spark for Big Data Analytics in 15 mins!I guarantee you that this short tutorial will save you a TON of time from reading the long documentations. Ready to…towardsdatascience.com
15分鐘之內即可學習Spark for Big Data Analytics! 我向您保證,這個簡短的教程將為您節省閱讀冗長文檔的時間。 準備去…朝向datascience.com
翻譯自: https://www.freecodecamp.org/news/how-to-build-up-your-muscle-memory-for-data-science-with-python-5960df1c930e/
數據科學 python