對于有多個特征的數據,我們一般的處理方式是構建特征函數,計算每個特征向量的系數,從而將其影響納入到研究量中,但對于簡單的問題,也這樣做的話未免有點小題大做。這時我們可以考慮用CMH來分析變量在每個特征下的影響,這個方法可以通過分層控制不同的無關特征和變量,凸顯變量真實的關聯關系。
以下是一個例子:
set.seed(123)
n <- 500 # 增大樣本量
Age <- sample(c("Young", "Middle", "Old"), n, replace = TRUE, prob = c(0.3, 0.4, 0.3))
Drug <- sample(c("A", "B"), n, replace = TRUE)# 改進:藥物B在Old組更可能有效,但允許例外
Effect <- ifelse((Drug == "B" & Age == "Old" & runif(n) > 0.2) | # 80% 有效(Drug == "A" & Age == "Young" & runif(n) > 0.3) | # 70% 有效(Age == "Middle" & Drug == "B" & runif(n) > 0.6) | # Middle組B藥40%有效(runif(n) > 0.9), # 10% 的全局隨機有效"Improved", "Not Improved"
)
df <- data.frame(Age, Drug, Effect)
head(df)# 三維列聯表(Age × Drug × Effect)
table_array <- table(df$Drug, df$Effect, df$Age)
table_array# 使用mantelhaen.test()
result <- mantelhaen.test(table_array)
resultlibrary(ggplot2)
ggplot(df, aes(x = Drug, fill = Effect)) +geom_bar(position = "fill") +facet_wrap(~ Age) +labs(y = "Proportion") +theme_minimal()
輸出:
, , = MiddleImproved Not ImprovedA 15 84B 51 59, , = OldImproved Not ImprovedA 7 63B 60 18, , = YoungImproved Not ImprovedA 52 24B 5 62 Mantel-Haenszel chi-squared test with continuity correctiondata: table_array
Mantel-Haenszel X-squared = 12.072, df = 1, p-value = 0.000512
alternative hypothesis: true common odds ratio is not equal to 1
95 percent confidence interval:0.4232736 0.8208328
sample estimates:
common odds ratio 0.5894378
從輸出可以看到,在Middle和Old組藥物B更有效,Young組則是藥物A,而檢驗的結果p為0.0005,說明在調查年齡分組后,藥物與療效的關系十分顯著,而公共比值則意味著使用藥物B的患者獲得改善的幾率是藥物A的0.59倍。