python統計行號_利用Python進行數據分析(第三篇上)

上一篇文章我記錄了自己在入門 Python 學習的一些基礎內容以及實際操作代碼時所碰到的一些問題。

這篇我將會記錄我在學習和運用 Python 進行數據分析的過程：

介紹 Numpy 和 Pandas 兩個包
運用 Numpy 和 Pandas 分析一維、二維數據
數據分析的基本過程
實戰項目【用 Python 分析朝陽醫院2018季度的藥物銷售數據】

一、簡單介紹 Numpy 和 Pandas 兩個包

NumPy 和 pandas 是 Python 常見的兩個科學運算的包，提供了比 Python 列表更高級的數組對象且運算效率更高。常用于處理大量數據并從中提取、分析有用指標。

NumPy 是 Numerical Python 的簡稱，它是目前 Python 數值計算中最為重要的基礎包。大多數計算包都提供了基于 NumPy 的科學函數功能，將 NumPy 的數組對象作為數據交換的通用語。NumPy 的核心是 ndarray 對象，它封裝了 Python 的原生數據類型的N維數組。NumPy 創建的數組在創建時就要有固定大小，數組元素需要有相同的數據類型，NumPy 也可以像Python 數組一樣使用切片。矢量化和廣播是 Numpy 的特性。

pandas 所包含的數據結構和數據梳理工具的設計使得在 Python 中進行數據清晰和分析非常快捷。pandas 經常是和其它數值計算工具，比如 NumPy 和 SciPy，以及數據可視化工具比如 matplotlib 一起使用的。 pandas 支持大部分 NumPy 語言風格的數組計算。pandas 可以直觀的描述一維和二維數據結構，分別是 Series 對象和 DataFrame 對象，理解起來很直觀清晰。pandas 可以處理多種不同的數據類型，可以處理缺失數據，可以分組和聚合，也支持切片功能。

二、運用 NumPy 和 pandas 分析一維、二維數據

首先在 conda 中安裝這兩個包，安裝命令：

conda install numpy, pandas

'''
Install two packages in conda, installation command:
conda install numpy, pandas
'''
# import numpy package
import numpy as np
# import pandas package
import pandas as pd

運用 NumPy 分析一維數據

1.1 定義一維數組：

定義一維數組 array，參數傳入的是一個列表 [2,3,4,5]

'''
Definition: 
One dimension array, parameters passed was a list[2,3,4,5]
'''
a = np.array([2,3,4,5])

1.2 查詢：

# check items
a[0]

2

1.3 切片訪問 - 獲取指定序號范圍的元素

# section acess: Acquired items from designated range series number 
# a[1:3] Acquired items from series no. 1 to series no.3
a[1:3]

array([3, 4])

1.4 查詢數據類型：

'''
dtype detail info link reference:
https://docs.scipy.org/doc/numpy-1.10.1/reference/arrays.dtypes.html
'''
# Check data types
a.dtype

dtype('int32')

1.5 統計計算 - 平均值

# Statistical caculation
# mean
a.mean()

3.5

1.6 統計計算 - 標準差

# standard deviation
a.std()

1.118033988749895

1.7 向量化運行 - 乘以標量

# vectorization: multiply scalar
b = np.array([1,2,3])
c = b * 4
c

array([ 4, 8, 12])

2. 運用 NumPy 分析二維數據

2.1 定義二維數組：

'''
Numpy Two-dimensional data structure:
Array
'''
# Define Two-dimensional data array
a = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12]
])

2.2 獲取元素：

獲取行號是0，列號是2的元素

# Acquire the items that row number is 0, and Column number is 2
a[0,2]

3

2.3 獲取行：

獲取第1行

# Acquire first row items
a[0,:]

array([1, 2, 3, 4])

2.4 獲取列：

獲取第1列

# Acquire first column items 
a[:,0]

array([1, 5, 9])

2.5 NumPy數軸參數：axis

1) 如果沒有指定數軸參數，會計算整個數組的平均值

'''
If the axis parameters is not designated, 
the mean of the entire array will be calculated 
'''
a.mean()

6.5

2) 按軸計算：axis=1 計算每一行

# caculate according to axis: axis = 1 , caculate evey single row
a.mean(axis = 1)

array([ 2.5, 6.5, 10.5])

3) 按軸計算：axis=0 計算每一列

a.mean(axis = 0)

array([5., 6., 7., 8.])

3. 運用 pandas 分析一維數據

3.1 定義 Pandas 一維數據結構：

定義 Pandas 一維數據結構 - Series

'''
Definition: 
Pandas One Dimension Data Analysis: Series 
''''''
One day stock price saved for 6 companies(USD),
Tenent 427 HKD equal to 54.74 USD.
'''
stockS = pd.Series([54.74, 190.9, 173.14, 1050.3, 181.86, 1139.49],index = ['tencent','alibaba','apple','google','facebook','amazon'])

3.2 查詢

查詢 stockS

stockS

tencent 54.74
alibaba 190.90
apple 173.14
google 1050.30
facebook 181.86
amazon 1139.49
dtype: float64

3.3 獲取描述統計信息：

# Acquired describe statistical info
stockS.describe()

count 6.000000
mean 465.071667
std 491.183757
min 54.740000
25% 175.320000
50% 186.380000
75% 835.450000
max 1139.490000
dtype: float64

3.4 iloc屬性用于根據索引獲取值

stockS.iloc[0]

54.74

3.5 loc屬性用于根據索引獲取值

# loc attribution: used to acquire value according to the index
stockS.loc['tencent']

54.74

3.6 向量化運算 - 向量相加

# vectorization: vectors addition
s1 = pd.Series([1,2,3,4], index = ['a','b','c','d'])
s2 = pd.Series([10,20,30,40], index = ['a','b','e','f'])
s3 = s1 + s2
s3

a 11.0
b 22.0
c NaN
d NaN
e NaN
f NaN
dtype: float64

3.7 刪除缺失值

# Method 1: Delete missing value 
s3.dropna()

a 11.0
b 22.0
dtype: float64

3.8 填充缺失值

# Filled up the missing values
s3 = s2.add(s1, fill_value = 0)
s3

a 11.0
b 22.0
c 3.0
d 4.0
e 30.0
f 40.0
dtype: float64

4. 運用 pandas 分析二維數據

pandas 二維數組：數據框（DataFrame）

4.1 定義數據框

'''
Pandas Two-dimensional array: DataFrame
'''
# Step1: Define a dict, Mapping names and corresponding values 
salesDict = {'medecine purchased date':['01-01-2018 FRI','02-01-2018 SAT','06-01-2018 WED'],'social security card number':['001616528','001616528','0012602828'],'commodity code':[236701,236701,236701],'commodity name':['strong yinqiao VC tablets', 'hot detoxify clearing oral liquid','GanKang compound paracetamol and amantadine hydrochloride tablets'],'quantity sold':[6,1,2],'amount receivable':[82.8,28,16.8],'amount received':[69,24.64,15]
}# import OrdererDict
from collections import OrderedDict# Define an OrderedDict
salesOrderDict = OrderedDict(salesDict)# Define DataFrame: passing Dict, list name
salesDf = pd.DataFrame(salesOrderDict)

4.2 查看

salesDf

4.3 平均值

是按每列來求平均值

# mean: caculating according to columns
salesDf.mean()

commodity code 236701.000000
quantity sold 3.000000
amount receivable 42.533333
amount received 36.213333
dtype: float64

4.4 查詢數據 - iloc屬性用于根據位置獲取值

1) 查詢第1行第2列的元素

'''
iloc attributes used to acquired value according to position
'''
# check items at 1st row and 2nd column
salesDf.iloc[0,1]

'001616528'

2) 獲取第1行 - 代表所有列

# Acquired all items of first row - collect every single colum
salesDf.iloc[0,:]

medecine purchased date 01-01-2018 FRI
social security card number 001616528
commodity code 236701
commodity name strong yinqiao VC tablets
quantity sold 6
amount receivable 82.8
amount received 69
Name: 0, dtype: object

3) 獲取第1列 - 代表所有行

# Acquired all items of first column - collect every single row 
salesDf.iloc[:,0]

0 01-01-2018 FRI
1 02-01-2018 SAT
2 06-01-2018 WED
Name: medecine purchased date, dtype: object

4.5 查詢數據 - loc屬性用于根據索引獲取值

1) 獲取第1行

'''
loc attributes used to acquired value according to index
'''
# Check items from first row first column
salesDf.loc[0,'medecine purchased date']

'01-01-2018 FRI'

2) 獲取“商品編碼”這一列

# Acquired all items of column 'commodity code'
# Method 1:
salesDf.loc[:,'commodity code']

0 236701
1 236701
2 236701
Name: commodity code, dtype: int64

3) 簡單方法：獲取“商品編碼”這一列

# Acquired all items of column 'commodity code'
# Method 2: Easy way
salesDf['commodity code']

0 236701
1 236701
2 236701
Name: commodity code, dtype: int64

4.6 數據框復雜查詢 - 切片功能

1) 通過列表來選擇某幾列的數據

# Select a few column data via list
salesDf[['commodity name','quantity sold']]

2）通過切片功能，獲取指定范圍的列

# Acquired data from define range of column via section 
salesDf.loc[:,'medecine purchased date':'quantity sold']

4.7 數據框復雜查詢 - 條件判斷

1) 通過條件判斷篩選 - 第1步：構建查詢條件

# Select via condition test
# Step 1: Establish query condition
querySer = salesDf.loc[:,'quantity sold'] > 1
type(querySer)

pandas.core.series.Series

querySer

0 True
1 False
2 True
Name: quantity sold, dtype: bool

salesDf.loc[querySer,:]

4.8 查看數據集描述統計信息

1 ) 讀取 Ecxcel 數據

# Read data from Excel
fileNameStr = 'C:UsersUSERDesktop#3Python3_The basic process of data analysisSales data of Chaoyang Hospital in 2018  - Copy.xlsx'
xls = pd.ExcelFile(fileNameStr)
salesDf = xls.parse('Sheet1')

2) 打印出前3行，以確保數據運行正常

# Print first three row to make sure data can work properly 
salesDf.head(3)

3) 查詢行、列總數

salesDf.shape

(6578, 7)

4）查看某一列的數據類型

# Check the data type of one column 
salesDf.loc[:,'quantity sold'].dtype

dtype('float64')

5）查看每一列的統計數值

# Check the statistics for each column
salesDf.describe()

下一篇我將繼續后半部分的學習

數據分析的基本過程
實戰項目【用 Python 分析朝陽醫院2018季度的藥物銷售數據】

python統計行號_利用Python進行數據分析(第三篇上)

相關文章

lnmp架構搭建—源碼編譯（nginx、mysql、php）

Java PipedInputStream available（）方法與示例

解析xml_Mybatis中mapper的xml解析詳解

lnmp—MemCache的作用

Java ListResourceBundle getKeys（）方法與示例

orale用戶密碼過期處理

python字典怎么設置_在python中設置字典中的屬性

Java ObjectInputStream readByte（）方法與示例

openresty—實現緩存前移

Nginx+Keepalived+Tomcat之動靜分離的web集群

安裝完成后的配置_cent os7 默認安裝后的一般配置

Java Integer類lowerOneBit（）方法與示例

lnmp構架——對tomcat詳解

Linux 查找文件

Java GregorianCalendar hashCode（）方法與示例

python元組為什么不可變_為什么python字符串和元組是不可變的？

InnoDB事務結構體代碼變量列表

對cookie與session的理解

ubutun 更換網絡源_Ubuntu 更換源

php : 常用函數