pandas銷售數據分析
數據保存在data目錄
- 消費者數據:customers.csv
- 商品數據:products.csv
- 交易數據:transactions.csv
customers.csv數據結構:
字段 | 描述 |
---|---|
customer_id | 客戶ID |
gender | 性別 |
age | 年齡 |
region | 地區 |
membership_date | 會員日期 |
products.csv數據結構:
字段 | 描述 |
---|---|
product_id | 產品ID |
category | 產品類別 |
brand | 品牌 |
price | 價格 |
transactions.csv數據結構:
字段 | 描述 |
---|---|
transaction_id | 交易ID |
customer_id | 客戶ID |
product_id | 產品ID |
quantity | 購買數量 |
transaction_date | 交易日期 |
price | 交易價格 |
amount | 交易金額 |
加載CSV數據,編寫代碼完成以下需求:
- 計算每個客戶的總消費金額
- 計算每個客戶的平均訂單金額
- 按產品類別統計銷售總額和銷售量
- 按性別統計客戶數量
- 創建年齡分布直方圖數據
- 計算每個月的銷售總額(時間序列分析)
- 找出最暢銷的10種產品
- 找出消費最高的10個客戶
- 計算不同品牌產品的平均價格
- 創建產品類別和性別之間的交叉表
- 創建產品類別和年齡組之間的交叉表
- 創建區域和產品類別之間的交叉表
- 創建性別和區域之間的交叉表
- 計算每個客戶的首次購買日期和最近購買日期
- 計算客戶生命周期價值(CLV)假設為一年
- 創建一個透視表,顯示每個區域、每個類別的銷售總額
- 創建一個透視表,顯示每個月、每個類別的銷售總額
- 創建一個透視表,顯示每個區域、每個性別在各個類別上的平均消費
- 計算每個客戶的購買頻率(每年購買次數)
- 分析會員時長與消費金額之間的關系
導包
import pandas as pd # 導入pandas庫,用于數據處理和分析
import numpy as np # 導入numpy庫,用于數值計算
from datetime import datetime, timedelta # 導入datetime和timedelta模塊,用于處理日期和時間
import os # 導入os庫,用于操作系統相關功能,如文件和目錄操作# 創建results目錄(如果不存在)
os.makedirs('results', exist_ok=True)
加載數據
# 加載數據
customers = pd.read_csv('data/customers.csv') # 加載客戶數據
products = pd.read_csv('data/products.csv') # 加載產品數據
transactions = pd.read_csv('data/transactions.csv') # 加載交易數據# 將日期列轉換為datetime類型
customers['membership_date'] = pd.to_datetime(customers['membership_date']) # 將會員日期列轉換為datetime類型
transactions['transaction_date'] = pd.to_datetime(transactions['transaction_date']) # 將交易日期列轉換為datetime類型
1. 計算每個客戶的總消費金額
# 1. 計算每個客戶的總消費金額
customer_spending = transactions.groupby('customer_id')['amount'].sum().reset_index() # 按客戶ID分組,計算每個客戶的總消費金額
customer_spending.columns = ['customer_id', 'total_spending'] # 重命名列名
customer_spending.to_csv('results/customer_spending.csv', index=False) # 將結果保存為CSV文件,不保存索引
print(customer_spending)
customer_id total_spending
0 1 1124.02
1 2 1710.33
2 3 1595.10
3 4 3881.27
4 5 3871.16
.. ... ...
995 996 1880.70
996 997 2683.27
997 998 1631.01
998 999 2473.05
999 1000 2608.60[1000 rows x 2 columns]
2. 計算每個客戶的平均訂單金額
# 2. 計算每個客戶的平均訂單金額
avg_order_amount = transactions.groupby('customer_id').agg(total_spending=('amount', 'sum'), # 計算每個客戶的總消費金額num_transactions=('transaction_id', 'count') # 計算每個客戶的交易次數
).reset_index()
avg_order_amount['avg_order_amount'] = avg_order_amount['total_spending'] / avg_order_amount['num_transactions'] # 計算平均訂單金額
avg_order_amount.to_csv('results/avg_order_amount.csv', index=False) # 將結果保存為CSV文件,不保存索引
print(avg_order_amount)
customer_id total_spending num_transactions avg_order_amount
0 1 1124.02 10 112.402000
1 2 1710.33 9 190.036667
2 3 1595.10 11 145.009091
3 4 3881.27 14 277.233571
4 5 3871.16 13 297.781538
.. ... ... ... ...
995 996 1880.70 10 188.070000
996 997 2683.27 11 243.933636
997 998 1631.01 9 181.223333
998 999 2473.05 10 247.305000
999 1000 2608.60 8 326.075000[1000 rows x 4 columns]
3. 按產品類別統計銷售總額和銷售量
# 3. 按產品類別統計銷售總額和銷售量
category_sales = pd.merge(transactions, products, on='product_id', how='left') # 將交易數據和產品數據按產品ID合并
category_sales = category_sales.groupby('category').agg(total_sales=('amount', 'sum'), # 按產品類別分組,計算銷售總額total_quantity=('quantity', 'sum') # 按產品類別分組,計算銷售量
).reset_index()
category_sales.to_csv('results/category_sales.csv', index=False) # 將結果保存為CSV文件,不保存索引
print(category_sales)
category total_sales total_quantity
0 Books 279650.98 5903
1 Clothing 453744.32 4887
2 Electronics 552730.76 7566
3 Food 306883.36 5883
4 Home 382538.62 5863
4. 按性別統計客戶數量
# 4. 按性別統計客戶數量
gender_distribution = customers['gender'].value_counts().reset_index() # 統計不同性別的客戶數量
gender_distribution.columns = ['gender', 'count'] # 重命名列名
gender_distribution.to_csv('results/gender_distribution.csv', index=False) # 將結果保存為CSV文件,不保存索引
5. 創建年齡分布直方圖數據
# 5. 創建年齡分布直方圖數據
age_bins = [18, 25, 35, 45, 55, 65, 80] # 定義年齡分組區間
age_labels = ['18-25', '26-35', '36-45', '46-55', '56-65', '66+'] # 定義年齡分組標簽
customers['age_group'] = pd.cut(customers['age'], bins=age_bins, labels=age_labels) # 將客戶年齡分組
age_distribution = customers['age_group'].value_counts().sort_index().reset_index() # 統計每個年齡組的客戶數量
age_distribution.columns = ['age_group', 'count'] # 重命名列名
age_distribution.to_csv('results/age_distribution.csv', index=False) # 將結果保存為CSV文件,不保存索引
print(age_distribution)
age_group count
0 18-25 132
1 26-35 349
2 36-45 325
3 46-55 126
4 56-65 21
5 66+ 0
6. 計算每個月的銷售總額(時間序列分析)
# 6. 計算每個月的銷售總額(時間序列分析)
transactions['month'] = transactions['transaction_date'].dt.to_period('M') # 提取交易日期的月份
monthly_sales = transactions.groupby('month')['amount'].sum().reset_index() # 按月份分組,計算每個月的銷售總額
monthly_sales['month'] = monthly_sales['month'].astype(str) # 將月份轉換為字符串類型
monthly_sales.to_csv('results/monthly_sales.csv', index=False) # 將結果保存為CSV文件,不保存索引
print(monthly_sales)
month amount
0 2024-07 120410.93
1 2024-08 163879.27
2 2024-09 147267.93
3 2024-10 168343.63
4 2024-11 159979.17
5 2024-12 180396.79
6 2025-01 161902.65
7 2025-02 147212.28
8 2025-03 170734.04
9 2025-04 168111.73
10 2025-05 177653.05
11 2025-06 163866.35
12 2025-07 45790.22
7. 找出最暢銷的10種產品
# 7. 找出最暢銷的10種產品
top_products = transactions.groupby('product_id').agg(total_quantity=('quantity', 'sum'), # 按產品ID分組,計算每種產品的銷售總量total_sales=('amount', 'sum') # 按產品ID分組,計算每種產品的銷售總額
).reset_index()
top_products = pd.merge(top_products, products[['product_id', 'category', 'brand']], on='product_id', how='left') # 將產品信息合并到統計結果中
top_products = top_products.sort_values('total_sales', ascending=False).head(10) # 按銷售總額降序排序,取前10種產品
top_products.to_csv('results/top_products.csv', index=False) # 將結果保存為CSV文件,不保存索引
print(top_products)
product_id total_quantity total_sales category brand
38 39 627 132033.66 Clothing BrandB
36 37 623 103498.99 Electronics BrandE
15 16 605 92038.65 Clothing BrandE
26 27 588 87329.76 Home BrandE
40 41 645 82269.75 Electronics BrandA
48 49 518 69924.82 Food BrandD
23 24 659 69419.06 Electronics BrandA
4 5 684 54473.76 Electronics BrandD
25 26 606 54412.74 Electronics BrandE
10 11 654 53850.36 Electronics BrandD
8. 找出消費最高的10個客戶
# 8. 找出消費最高的10個客戶
top_customers = customer_spending.sort_values('total_spending', ascending=False).head(10) # 按總消費金額降序排序,取前10個客戶
top_customers = pd.merge(top_customers, customers[['customer_id', 'gender', 'age', 'region']], on='customer_id', how='left') # 將客戶信息合并到統計結果中
top_customers.to_csv('results/top_customers.csv', index=False) # 將結果保存為CSV文件,不保存索引
print(top_customers)
customer_id total_spending gender age region
0 903 4568.08 Female 23 East
1 763 4421.28 Male 50 West
2 708 4409.81 Female 35 North
3 18 4405.35 Female 35 West
4 841 4353.57 Male 35 East
5 421 4266.48 Female 37 West
6 694 4037.89 Female 43 East
7 870 3987.35 Female 22 North
8 791 3925.87 Female 18 East
9 741 3888.52 Male 38 West
9. 計算不同品牌產品的平均價格
# 9. 計算不同品牌產品的平均價格
brand_prices = products.groupby('brand')['price'].agg(['mean', 'min', 'max', 'std']).reset_index() # 按品牌分組,計算每種品牌產品的平均價格、最低價格、最高價格和標準差
brand_prices.columns = ['brand', 'avg_price', 'min_price', 'max_price', 'price_std'] # 重命名列名
brand_prices.to_csv('results/brand_prices.csv', index=False) # 將結果保存為CSV文件,不保存索引
print(brand_prices)
brand avg_price min_price max_price price_std
0 BrandA 62.249000 24.06 127.55 35.000554
1 BrandB 75.945000 17.22 210.58 70.256401
2 BrandC 48.707143 24.97 71.86 17.941207
3 BrandD 59.621875 14.78 134.99 28.791657
4 BrandE 80.880909 22.71 166.13 51.561905
10. 創建產品類別和性別之間的交叉表
# 10. 創建產品類別和性別之間的交叉表
merged_data = pd.merge(transactions, products, on='product_id', how='left') # 將交易數據和產品數據按產品ID合并
merged_data = pd.merge(merged_data, customers, on='customer_id', how='left') # 將合并后的數據和客戶數據按客戶ID合并
category_gender_crosstab = pd.crosstab(merged_data['category'], merged_data['gender']) # 創建產品類別和性別之間的交叉表
category_gender_crosstab.to_csv('results/category_gender_crosstab.csv') # 將結果保存為CSV文件
print(category_gender_crosstab)
gender Female Male
category
Books 998 964
Clothing 835 803
Electronics 1273 1207
Food 1050 918
Home 976 976
11. 創建產品類別和年齡組之間的交叉表
category_age_crosstab = pd.crosstab(merged_data['category'], merged_data['age_group']) # 創建產品類別和年齡組之間的交叉表
category_age_crosstab.to_csv('results/category_age_crosstab.csv') # 將結果保存為CSV文件
print(category_age_crosstab)
age_group 18-25 26-35 36-45 46-55 56-65
category
Books 267 670 623 273 44
Clothing 231 581 499 213 38
Electronics 337 884 796 308 50
Food 269 662 664 254 32
Home 251 709 616 232 39
12. 創建區域和產品類別之間的交叉表
# 12. 創建區域和產品類別之間的交叉表
region_category_crosstab = pd.crosstab(merged_data['region'], merged_data['category']) # 創建區域和產品類別之間的交叉表
region_category_crosstab.to_csv('results/region_category_crosstab.csv') # 將結果保存為CSV文件
print(region_category_crosstab)
category Books Clothing Electronics Food Home
region
East 474 391 592 467 467
North 448 390 588 463 468
South 472 419 601 485 465
West 568 438 699 553 552
13. 創建性別和區域之間的交叉表
# 13. 創建性別和區域之間的交叉表
gender_region_crosstab = pd.crosstab(customers['gender'], customers['region']) # 創建性別和區域之間的交叉表
gender_region_crosstab.to_csv('results/gender_region_crosstab.csv') # 將結果保存為CSV文件
print(gender_region_crosstab)
region East North South West
gender
Female 117 119 133 141
Male 123 121 110 136
14. 計算每個客戶的首次購買日期和最近購買日期
# 14. 計算每個客戶的首次購買日期和最近購買日期
customer_dates = transactions.groupby('customer_id').agg(first_purchase_date=('transaction_date', 'min'), # 按客戶ID分組,計算每個客戶的首次購買日期last_purchase_date=('transaction_date', 'max') # 按客戶ID分組,計算每個客戶的最近購買日期
).reset_index()
customer_dates.to_csv('results/customer_dates.csv', index=False) # 將結果保存為CSV文件,不保存索引
print(customer_dates)
customer_id first_purchase_date last_purchase_date
0 1 2024-08-06 16:32:05.579393 2025-03-07 16:32:05.579393
1 2 2024-11-02 16:32:05.579393 2025-07-04 16:32:05.579393
2 3 2024-07-11 16:32:05.579393 2025-05-11 16:32:05.579393
3 4 2024-07-15 16:32:05.579393 2025-07-03 16:32:05.579393
4 5 2024-08-16 16:32:05.579393 2025-07-08 16:32:05.579393
.. ... ... ...
995 996 2024-07-11 16:32:05.579393 2025-06-24 16:32:05.579393
996 997 2024-08-19 16:32:05.579393 2025-07-06 16:32:05.579393
997 998 2024-08-26 16:32:05.579393 2025-07-05 16:32:05.579393
998 999 2024-08-11 16:32:05.579393 2025-06-23 16:32:05.579393
999 1000 2024-08-01 16:32:05.579393 2025-04-09 16:32:05.579393[1000 rows x 3 columns]
15. 計算客戶生命周期價值(CLV)假設為一年
# 15. 計算客戶生命周期價值(CLV)假設為一年
clv_data = pd.merge(customer_spending, customer_dates, on='customer_id', how='left') # 將客戶消費數據和購買日期數據按客戶ID合并
clv_data['customer_lifetime'] = (clv_data['last_purchase_date'] - clv_data['first_purchase_date']).dt.days / 365 # 計算客戶生命周期(年)
clv_data['clv'] = clv_data['total_spending'] / (clv_data['customer_lifetime'] + 0.001) # 計算客戶生命周期價值,避免除零錯誤
clv_data.to_csv('results/customer_clv.csv', index=False) # 將結果保存為CSV文件,不保存索引
print(clv_data)
customer_id total_spending first_purchase_date \
0 1 1124.02 2024-08-06 16:32:05.579393
1 2 1710.33 2024-11-02 16:32:05.579393
2 3 1595.10 2024-07-11 16:32:05.579393
3 4 3881.27 2024-07-15 16:32:05.579393
4 5 3871.16 2024-08-16 16:32:05.579393
.. ... ... ...
995 996 1880.70 2024-07-11 16:32:05.579393
996 997 2683.27 2024-08-19 16:32:05.579393
997 998 1631.01 2024-08-26 16:32:05.579393
998 999 2473.05 2024-08-11 16:32:05.579393
999 1000 2608.60 2024-08-01 16:32:05.579393 last_purchase_date customer_lifetime clv
0 2025-03-07 16:32:05.579393 0.583562 1922.842547
1 2025-07-04 16:32:05.579393 0.668493 2554.663925
2 2025-05-11 16:32:05.579393 0.832877 1912.872702
3 2025-07-03 16:32:05.579393 0.967123 4009.065838
4 2025-07-08 16:32:05.579393 0.893151 4329.426869
.. ... ... ...
995 2025-06-24 16:32:05.579393 0.953425 1970.506509
996 2025-07-06 16:32:05.579393 0.879452 3047.604904
997 2025-07-05 16:32:05.579393 0.857534 1899.761141
998 2025-06-23 16:32:05.579393 0.865753 2853.233607
999 2025-04-09 16:32:05.579393 0.687671 3787.874207 [1000 rows x 6 columns]
16. 創建一個透視表,顯示每個區域、每個類別的銷售總額
# 16. 創建一個透視表,顯示每個區域、每個類別的銷售總額
region_category_pivot = pd.pivot_table(merged_data,values='amount', # 透視表的值為交易金額index='region', # 透視表的行索引為區域columns='category', # 透視表的列索引為產品類別aggfunc='sum', # 聚合函數為求和fill_value=0 # 缺失值填充為0
)
region_category_pivot.to_csv('results/region_category_pivot.csv') # 將結果保存為CSV文件
print(region_category_pivot)
category Books Clothing Electronics Food Home
region
East 69158.99 108802.65 128664.17 73067.98 88262.81
North 62821.45 108363.43 130980.59 74564.24 90982.75
South 68916.92 110948.88 136255.84 73532.48 93670.87
West 78753.62 125629.36 156830.16 85718.66 109622.19
17. 創建一個透視表,顯示每個月、每個類別的銷售總額
# 17. 創建一個透視表,顯示每個月、每個類別的銷售總額
month_category_pivot = pd.pivot_table(merged_data,values='amount', # 透視表的值為交易金額index='month', # 透視表的行索引為月份columns='category', # 透視表的列索引為產品類別aggfunc='sum', # 聚合函數為求和fill_value=0 # 缺失值填充為0
)
month_category_pivot.index = month_category_pivot.index.astype(str) # 將月份索引轉換為字符串類型
month_category_pivot.to_csv('results/month_category_pivot.csv') # 將結果保存為CSV文件
print(month_category_pivot)
category Books Clothing Electronics Food Home
month
2024-07 19049.16 28345.99 32982.40 15546.22 24487.16
2024-08 22482.08 39866.22 44932.19 25444.10 31154.68
2024-09 23453.31 33326.84 42242.67 20403.07 27842.04
2024-10 24327.48 36415.63 44971.30 26200.94 36428.28
2024-11 20299.31 34942.33 46316.72 24822.69 33598.12
2024-12 24097.06 47046.58 50235.43 26123.97 32893.75
2025-01 26473.88 34748.91 42122.97 27490.10 31066.79
2025-02 20271.88 31493.03 40852.26 26059.20 28535.91
2025-03 20973.53 37191.20 48798.25 29035.07 34735.99
2025-04 22417.78 41892.14 45848.84 24966.29 32986.68
2025-05 24946.51 43113.17 58419.63 23434.67 27739.07
2025-06 25288.91 37229.08 41895.60 29361.03 30091.73
2025-07 5570.09 8133.20 13112.50 7996.01 10978.42
18. 創建一個透視表,顯示每個區域、每個性別在各個類別上的平均消費
# 18. 創建一個透視表,顯示每個區域、每個性別在各個類別上的平均消費
region_gender_category_pivot = pd.pivot_table(merged_data,values='amount', # 透視表的值為交易金額index=['region', 'gender'], # 透視表的行索引為區域和性別columns='category', # 透視表的列索引為產品類別aggfunc='mean', # 聚合函數為求平均值fill_value=0 # 缺失值填充為0
)
region_gender_category_pivot.to_csv('results/region_gender_category_pivot.csv') # 將結果保存為CSV文件
print(region_gender_category_pivot)
category Books Clothing Electronics Food Home
region gender
East Female 140.329458 270.118182 234.613919 159.747362 185.777671Male 151.623590 286.628238 200.062331 152.545305 191.844758
North Female 139.379052 257.678474 229.761571 151.541577 200.558000Male 140.980886 297.022600 214.836884 171.363604 187.933465
South Female 143.690119 257.582944 232.432237 152.286984 199.536932Male 148.668227 273.655426 220.863434 150.884807 203.678037
West Female 132.917729 272.563935 229.317895 149.335281 191.643158Male 144.845751 300.700676 219.072189 161.880280 205.052832
19. 計算每個客戶的購買頻率(每年購買次數)
# 19. 計算每個客戶的購買頻率(每年購買次數)
purchase_frequency = transactions.groupby('customer_id').agg(num_transactions=('transaction_id', 'count'), # 按客戶ID分組,計算每個客戶的交易次數first_purchase_date=('transaction_date', 'min'), # 按客戶ID分組,計算每個客戶的首次購買日期last_purchase_date=('transaction_date', 'max') # 按客戶ID分組,計算每個客戶的最近購買日期
).reset_index()
purchase_frequency['customer_lifetime'] = (purchase_frequency['last_purchase_date'] - purchase_frequency['first_purchase_date']).dt.days / 365 # 計算客戶生命周期(年)
purchase_frequency['frequency_per_year'] = purchase_frequency['num_transactions'] / (purchase_frequency['customer_lifetime'] + 0.001) # 計算購買頻率(每年購買次數),避免除零錯誤
purchase_frequency.to_csv('results/purchase_frequency.csv', index=False) # 將結果保存為CSV文件,不保存索引
print(purchase_frequency)
customer_id num_transactions first_purchase_date \
0 1 10 2024-08-06 16:32:05.579393
1 2 9 2024-11-02 16:32:05.579393
2 3 11 2024-07-11 16:32:05.579393
3 4 14 2024-07-15 16:32:05.579393
4 5 13 2024-08-16 16:32:05.579393
.. ... ... ...
995 996 10 2024-07-11 16:32:05.579393
996 997 11 2024-08-19 16:32:05.579393
997 998 9 2024-08-26 16:32:05.579393
998 999 10 2024-08-11 16:32:05.579393
999 1000 8 2024-08-01 16:32:05.579393 last_purchase_date customer_lifetime frequency_per_year
0 2025-03-07 16:32:05.579393 0.583562 17.106836
1 2025-07-04 16:32:05.579393 0.668493 13.443005
2 2025-05-11 16:32:05.579393 0.832877 13.191398
3 2025-07-03 16:32:05.579393 0.967123 14.460968
4 2025-07-08 16:32:05.579393 0.893151 14.538936
.. ... ... ...
995 2025-06-24 16:32:05.579393 0.953425 10.477516
996 2025-07-06 16:32:05.579393 0.879452 12.493582
997 2025-07-05 16:32:05.579393 0.857534 10.482983
998 2025-06-23 16:32:05.579393 0.865753 11.537307
999 2025-04-09 16:32:05.579393 0.687671 11.616574 [1000 rows x 6 columns]
20. 分析會員時長與消費金額之間的關系
# 20. 分析會員時長與消費金額之間的關系
customers['membership_days'] = (datetime.now() - customers['membership_date']).dt.days # 計算會員時長(天)
customer_membership = pd.merge(customers, customer_spending, on='customer_id', how='left') # 將客戶數據和消費數據按客戶ID合并
membership_spending_correlation = customer_membership[['membership_days', 'total_spending']].corr() # 計算會員時長和消費金額之間的相關性
customer_membership.to_csv('results/customer_membership.csv', index=False) # 將合并后的數據保存為CSV文件,不保存索引
membership_spending_correlation.to_csv('results/membership_spending_correlation.csv') # 將相關性結果保存為CSV文件
print(membership_spending_correlation)
membership_days total_spending
membership_days 1.000000 -0.008689
total_spending -0.008689 1.000000
print("數據分析完成,結果已保存到results目錄下的CSV文件中。")
數據分析完成,結果已保存到results目錄下的CSV文件中。