卡方檢驗python程序_Python從零開始第二章（1）卡方檢驗(python)

如果我們想確定兩個獨立分類數據組的統計顯著性，會發生什么？這是卡方檢驗獨立性有用的地方。

Chi-Square檢驗

我們將在1994年查看人口普查數據。具體來說，我們對“性別和“每周工作時間”之間的關系感興趣。在我們的案例中，每個人只能有一個“性別”，且只有一個工作時間類別。為了這個例子，我們將使用pandas將數字列'每周小時'轉換為一個分類列。然后我們將'sex'和'hours_per_week_categories'分配給新的數據幀。# -*- coding: utf-8 -*-

"""

Created on Sun Feb 3 19:24:55 2019

@author: czh

"""

# In[*]

import matplotlib.pyplot as plt

import numpy as np

import math

import seaborn as sns

import pandas as pd

%matplotlib inline

import os

os.chdir('D:\\train')

# In[*]

cols = ['age', 'workclass', 'fnlwg', 'education', 'education-num',

'marital-status','occupation','relationship', 'race','sex',

'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']

data = pd.read_csv('census.csv', names=cols,sep=', ')

# In[*]

#Create a column for work hour categories.

def process_hours(df):

cut_points = [0,9,19,29,39,49,1000]

label_names = ["0-9","10-19","20-29","30-39","40-49","50+"]

df["hours_per_week_categories"] = pd.cut(df["hours-per-week"],

cut_points,labels=label_names)

return df

# In[*]

data = process_hours(data)

workhour_by_sex = data[['sex', 'hours_per_week_categories']]

workhour_by_sex.head()sex hours_per_week_categories

0 Male 40-49

1 Male 10-19

2 Male 40-49

3 Male 40-49

4 Female 40-49

查看workhour_by_sex['sex'].value_counts()

Out[31]:

Male 21790

Female 10771

Name: sex, dtype: int64workhour_by_sex['hours_per_week_categories'].value_counts()

Out[33]:

40-49 18336

50+ 6462

30-39 3667

20-29 2392

10-19 1246

0-9 458

Name: hours_per_week_categories, dtype: int64原假設

回想一下，我們有興趣知道'sex'和'hours_per_week_categories'之間是否存在關系。為此，我們必須使用卡方檢驗。但首先，讓我們陳述我們的零假設和另類假設。H0：性別與每周工作小時數沒有統計學上的顯著關系.H0：性別與每周工作小時數之間沒有統計學上的顯著關系。

H1：性別和每周工作小時數之間存在統計學上的顯著關系.

下一步是將數據格式化為頻率計數表。這稱為列聯表，我們可以通過在pandas中使用pd.crosstab（）函數來實現。contingency_table = pd.crosstab(

workhour_by_sex['sex'],

workhour_by_sex['hours_per_week_categories'],

margins = True

)

contingency_table

Out[34]:

hours_per_week_categories 0-9 10-19 20-29 30-39 40-49 50+ All

sex

Female 235 671 1287 1914 5636 1028 10771

Male 223 575 1105 1753 12700 5434 21790

All 6462 1246 18336 3667 458 2392 32561

該表中的每個單元表示頻率計數。例如，表格中“男性”行和“10 -19”列的交集將表示從我們的樣本數據集中每周工作10-19小時的男性人數。 “全部”行和“50 +”列的交叉點表示每周工作50小時以上的人員總數。# In[*]

#Assigns the frequency values

malecount = contingency_table.iloc[0][0:6].values

femalecount = contingency_table.iloc[1][0:6].values

#Plots the bar chart

fig = plt.figure(figsize=(10, 5))

sns.set(font_scale=1.8)

categories = ["0-9","10-19","20-29","30-39","40-49","50+"]

p1 = plt.bar(categories, malecount, 0.55, color='#d62728')

p2 = plt.bar(categories, femalecount, 0.55, bottom=malecount)

plt.legend((p2[0], p1[0]), ('Male', 'Female'))

plt.xlabel('Hours per Week Worked')

plt.ylabel('Count')

plt.show()

image.png

上圖顯示了人口普查中的樣本數據。如果性別與每周工作小時數之間確實沒有關系。然后，數據將顯示每個時間類別的“男性”和“女性”之間的均勻比率。例如，如果5％的女性工作50+小時，我們預計工作50小時以上的男性的百分比相同。

使用Scipy進行卡方檢驗

現在我們已經完成了所有計算，現在是時候尋找捷徑了。f_obs = np.array([contingency_table.iloc[0][0:6].values,

contingency_table.iloc[1][0:6].values])

f_obs

from scipy import stats

stats.chi2_contingency(f_obs)[0:3]

Out[38]: (2287.190943926107, 0.0, 5)

p值= ~0，自由度= 5。

結論

如果p值<0.05，我們可以拒絕零假設。 “性別”和“每周工作時間”之間肯定存在某種關系。我們不知道這種關系是什么，但我們知道這兩個變量并不是彼此獨立的。

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/394701.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/394701.shtml
英文地址，請注明出處：http://en.pswp.cn/news/394701.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！