python批量解析提取word內容到excel

# 基于Python實現Word文檔內容批量提取與Excel自動化存儲

## 引言

在日常辦公場景中，常需要從大量Word文檔中提取結構化數據并整理到Excel表格中。傳統手動操作效率低下，本文介紹如何通過Python實現自動化批處理，使用python-docx和openpyxl庫完成以下功能：

1. 批量讀取指定目錄下的Word文檔

2. 解析文檔中的文本、表格等內容

3. 按規則存儲到Excel文件

4. 實現高效準確的數據遷移

---

## 一、環境準備

### 1.1 安裝依賴庫

```bash

pip install python-docx openpyxl pandas

```

### 1.2 庫說明

- **python-docx**: 讀寫Word文檔

- **openpyxl**: 操作Excel文件

- **pandas**: 數據整理與導出

---

## 二、實現步驟

### 2.1 創建基礎框架

```python

import os

from docx import Document

import pandas as pd

def process_word_files(input_dir, output_file):

data = []

for filename in os.listdir(input_dir):

if filename.endswith('.docx'):

filepath = os.path.join(input_dir, filename)

doc_data = parse_word(filepath)

data.append(doc_data)

save_to_excel(data, output_file)

def parse_word(filepath):

# 解析邏輯

pass

def save_to_excel(data, output_file):

# 存儲邏輯

pass

```

### 2.2 文檔解析函數實現

```python

def parse_word(filepath):

doc = Document(filepath)

result = {

'filename': os.path.basename(filepath),

'paragraphs': [],

'tables': []

}

# 提取段落文本

for para in doc.paragraphs:

if para.text.strip():

result['paragraphs'].append(para.text)

# 提取表格數據

for table in doc.tables:

table_data = []

for row in table.rows:

row_data = [cell.text for cell in row.cells]

table_data.append(row_data)

result['tables'].append(table_data)

return result

```

### 2.3 Excel存儲函數優化

```python

def save_to_excel(data, output_file):

excel_data = []

for item in data:

# 處理段落數據

para_str = '\n'.join(item['paragraphs'])

# 處理表格數據

table_str = ''

for i, table in enumerate(item['tables'], 1):

table_str += f'Table {i}:\n'

table_str += '\n'.join([' | '.join(row) for row in table])

table_str += '\n\n'

excel_data.append({

'文件名': item['filename'],

'正文內容': para_str,

'表格內容': table_str.strip()

})

df = pd.DataFrame(excel_data)

df.to_excel(output_file, index=False)

```

---

## 三、高級處理技巧

### 3.1 結構化數據提取

```python

# 示例：提取帶特定樣式的文本

def extract_special_paragraphs(doc):

special_texts = []

for para in doc.paragraphs:

if para.style.name.startswith('Heading'):

special_texts.append({

'style': para.style.name,

'text': para.text

})

return special_texts

```

### 3.2 表格數據精準定位

```python

def extract_specific_table(doc, table_index=0):

try:

table = doc.tables[table_index]

return [[cell.text for cell in row.cells] for row in table.rows]

except IndexError:

return []

```

### 3.3 批量處理增強

```python

# 多線程處理加速

from concurrent.futures import ThreadPoolExecutor

def batch_process(files):

with ThreadPoolExecutor() as executor:

results = list(executor.map(parse_word, files))

return results

```

---

## 四、執行與測試

```python

if __name__ == '__main__':

input_folder = './documents'

output_file = './output.xlsx'

process_word_files(input_folder, output_file)

```

---

## 五、注意事項

1. 文件編碼統一保存為UTF-8

2. 處理復雜表格時建議添加邊界檢查

3. 使用try-except塊處理異常文檔

4. 大數據量時建議分批次寫入Excel

---

## 結論

本方案實現了從Word到Excel的自動化數據遷移，可處理數百文檔的批量操作。通過擴展解析邏輯，可適配各類文檔模板，結合正則表達式等工具還能實現更復雜的內容提取。最終代碼已開源在

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/diannao/85887.shtml
繁體地址，請注明出處：http://hk.pswp.cn/diannao/85887.shtml
英文地址，請注明出處：http://en.pswp.cn/diannao/85887.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！