使用easyocr、PyPDF2對圖像及PDF文檔進行識別

一、概述

本 Python 腳本的主要功能是對當前目錄及其子目錄下的圖片和 PDF 文件進行光學字符識別（OCR）處理。它使用?easyocr?庫處理圖片中的文字，使用?PyPDF2?庫提取 PDF 文件中的文本，并將處理結果保存為文本文件。同時，腳本會記錄詳細的處理日志，方便用戶跟蹤處理過程和排查問題。

二、環境要求

Python 版本：建議使用 Python 3.6 及以上版本。
依賴庫：
- easyocr：用于圖片的 OCR 識別。
- PyPDF2：用于讀取 PDF 文件并提取文本。
- Pillow（PIL）：雖然腳本中未直接使用，但?easyocr?處理圖像時可能依賴。

你可以使用以下命令安裝這些依賴庫：

收起

bash

pip install easyocr PyPDF2 Pillow

三、腳本結構與功能模塊

1. 導入必要的庫

收起

python

import os
import time
import easyocr
from PyPDF2 import PdfReader
from PIL import Image

導入了處理文件系統、時間、OCR 識別、PDF 讀取和圖像處理所需的庫。

2. 設置模型下載路徑

收起

python

model_storage_directory = './easyocr_models'
os.makedirs(model_storage_directory, exist_ok=True)

定義了?easyocr?模型的存儲目錄，并確保該目錄存在。

3. 檢查網絡連接

收起

python

def check_network():try:import urllib.requesturllib.request.urlopen('https://www.baidu.com', timeout=5)return Trueexcept:return False

該函數嘗試訪問百度網站，以檢查網絡連接是否正常。如果能成功訪問，則返回?True，否則返回?False。

4. 初始化 EasyOCR reader

收起

python

try:print("Initializing EasyOCR...")print(f"Model storage directory: {os.path.abspath(model_storage_directory)}")if not check_network():print("Network connection failed. Please check your internet connection.")exit(1)print("Downloading models (this may take several minutes)...")reader = easyocr.Reader(['ch_sim', 'en'],model_storage_directory=model_storage_directory,download_enabled=True,verbose=True)print("EasyOCR initialized successfully")
except Exception as e:print(f"Failed to initialize EasyOCR: {str(e)}")exit(1)

打印初始化信息和模型存儲目錄的絕對路徑。
檢查網絡連接，若網絡異常則輸出錯誤信息并退出程序。
下載?easyocr?所需的模型，支持中文（簡體）和英文識別。
若初始化成功，打印成功信息；若出現異常，打印錯誤信息并退出程序。

5. 處理圖片文件

收起

python

def process_image(image_path):"""處理圖片文件"""try:result = reader.readtext(image_path)text = '\n'.join([item[1] for item in result])return textexcept Exception as e:print(f"Error processing image {image_path}: {str(e)}")return ""

接受圖片文件路徑作為參數。
使用?easyocr?對圖片進行 OCR 識別，提取識別結果中的文本并拼接成字符串返回。
若處理過程中出現異常，打印錯誤信息并返回空字符串。

6. 處理 PDF 文件

收起

python

def process_pdf(pdf_path):"""處理PDF文件"""try:text = ""reader = PdfReader(pdf_path)for page in reader.pages:text += page.extract_text()return textexcept Exception as e:print(f"Error processing PDF {pdf_path}: {str(e)}")return ""

接受 PDF 文件路徑作為參數。
使用?PyPDF2?讀取 PDF 文件的每一頁，并提取文本拼接成字符串返回。
若處理過程中出現異常，打印錯誤信息并返回空字符串。

7. 保存提取的文本

收起

python

def save_text(text, output_path):"""保存提取的文本"""with open(output_path, 'w', encoding='utf-8') as f:f.write(text)

接受文本內容和輸出文件路徑作為參數。
將文本內容以 UTF-8 編碼寫入指定的輸出文件。

8. 主函數?`main`

收起

python

def main():# 嘗試多個可能的輸出目錄位置output_folders = ['./output_text',  # 當前目錄os.path.expanduser('~/ocr_output'),  # 用戶主目錄os.path.join(os.getcwd(), 'ocr_output')  # 當前工作目錄]output_folder = Nonefor folder in output_folders:try:os.makedirs(folder, exist_ok=True)output_folder = folderprint(f"Using output directory: {os.path.abspath(output_folder)}")breakexcept Exception as e:print(f"Failed to create output directory {folder}: {str(e)}")if output_folder is None:print("Error: Could not create any output directory")exit(1)# 初始化日志log_file = os.path.join(output_folder, 'ocr_log.txt')# 重定向標準輸出到日志文件import sysclass Logger(object):def __init__(self, filename):self.terminal = sys.stdoutself.log = open(filename, "a", encoding='utf-8')def write(self, message):self.terminal.write(message)self.log.write(message)def flush(self):passsys.stdout = Logger(log_file)print("OCR Processing Log\n")print(f"Starting OCR processing at {time.strftime('%Y-%m-%d %H:%M:%S')}")# 支持的圖片格式image_extensions = ['.bmp', '.jpg', '.jpeg', '.png', '.tiff', '.gif']# 遍歷當前目錄及子目錄for root, dirs, files in os.walk('.'):for file in files:file_path = os.path.join(root, file)base_name, ext = os.path.splitext(file)try:# 處理圖片文件if ext.lower() in image_extensions:print(f"Processing image: {file_path}")text = process_image(file_path)output_path = os.path.join(output_folder, f"{base_name}.txt")save_text(text, output_path)print(f"Successfully processed image: {file_path} -> {output_path}")with open(log_file, 'a') as f:f.write(f"Success: {file_path} -> {output_path}\n")# 處理PDF文件elif ext.lower() == '.pdf':print(f"Processing PDF: {file_path}")text = process_pdf(file_path)output_path = os.path.join(output_folder, f"{base_name}.txt")save_text(text, output_path)print(f"Successfully processed PDF: {file_path} -> {output_path}")with open(log_file, 'a') as f:f.write(f"Success: {file_path} -> {output_path}\n")except Exception as e:error_msg = f"Error processing {file_path}: {str(e)}"print(error_msg)with open(log_file, 'a') as f:f.write(error_msg + "\n")

輸出目錄處理：嘗試在多個預設位置創建輸出目錄，若創建成功則使用該目錄，若所有嘗試均失敗則輸出錯誤信息并退出程序。
日志初始化：在輸出目錄下創建?ocr_log.txt?日志文件，將標準輸出重定向到該日志文件，同時保留在終端的輸出。記錄日志頭部信息和處理開始時間。
文件遍歷與處理：遍歷當前目錄及其子目錄下的所有文件，對圖片文件調用?process_image?函數處理，對 PDF 文件調用?process_pdf?函數處理。將處理結果保存為文本文件，并在日志中記錄成功或失敗信息。

9. 程序入口

收起

python

if __name__ == "__main__":main()

當腳本作為主程序運行時，調用?main?函數開始執行。

四、使用方法

將腳本保存為一個 Python 文件（例如?ocr_process.py）。
確保所需的依賴庫已安裝。
打開終端或命令提示符，進入腳本所在的目錄。
運行腳本：

收起

bash

python ocr_process.py

腳本會自動處理當前目錄及其子目錄下的圖片和 PDF 文件，并將處理結果保存到指定的輸出目錄中，同時生成處理日志。

五、注意事項

由于?easyocr?模型下載可能需要一定時間，首次運行腳本時請確保網絡連接穩定，耐心等待模型下載完成。
對于 PDF 文件，PyPDF2?只能提取文本內容，若 PDF 為掃描版或加密文件，可能無法正常提取文本。
若處理過程中出現錯誤，請查看日志文件?ocr_log.txt?以獲取詳細的錯誤信息。

完成代碼

import os
import time
import easyocr
from PyPDF2 import PdfReader
from PIL import Image# 設置模型下載路徑
model_storage_directory = './easyocr_models'
os.makedirs(model_storage_directory, exist_ok=True)# 檢查網絡連接
def check_network():try:import urllib.requesturllib.request.urlopen('https://www.baidu.com', timeout=5)return Trueexcept:return False# 初始化EasyOCR reader
try:print("Initializing EasyOCR...")print(f"Model storage directory: {os.path.abspath(model_storage_directory)}")if not check_network():print("Network connection failed. Please check your internet connection.")exit(1)print("Downloading models (this may take several minutes)...")reader = easyocr.Reader(['ch_sim', 'en'],model_storage_directory=model_storage_directory,download_enabled=True,verbose=True)print("EasyOCR initialized successfully")
except Exception as e:print(f"Failed to initialize EasyOCR: {str(e)}")exit(1)def process_image(image_path):"""處理圖片文件"""try:# 使用EasyOCR提取文本result = reader.readtext(image_path)# 合并所有識別結果text = '\n'.join([item[1] for item in result])return textexcept Exception as e:print(f"Error processing image {image_path}: {str(e)}")return ""def process_pdf(pdf_path):"""處理PDF文件"""try:text = ""reader = PdfReader(pdf_path)for page in reader.pages:text += page.extract_text()return textexcept Exception as e:print(f"Error processing PDF {pdf_path}: {str(e)}")return ""def save_text(text, output_path):"""保存提取的文本"""with open(output_path, 'w', encoding='utf-8') as f:f.write(text)def main():# 嘗試多個可能的輸出目錄位置output_folders = ['./output_text',  # 當前目錄os.path.expanduser('~/ocr_output'),  # 用戶主目錄os.path.join(os.getcwd(), 'ocr_output')  # 當前工作目錄]output_folder = Nonefor folder in output_folders:try:os.makedirs(folder, exist_ok=True)output_folder = folderprint(f"Using output directory: {os.path.abspath(output_folder)}")breakexcept Exception as e:print(f"Failed to create output directory {folder}: {str(e)}")if output_folder is None:print("Error: Could not create any output directory")exit(1)# 初始化日志log_file = os.path.join(output_folder, 'ocr_log.txt')# 重定向標準輸出到日志文件import sysclass Logger(object):def __init__(self, filename):self.terminal = sys.stdoutself.log = open(filename, "a", encoding='utf-8')def write(self, message):self.terminal.write(message)self.log.write(message)def flush(self):passsys.stdout = Logger(log_file)print("OCR Processing Log\n")print(f"Starting OCR processing at {time.strftime('%Y-%m-%d %H:%M:%S')}")# 支持的圖片格式image_extensions = ['.bmp', '.jpg', '.jpeg', '.png', '.tiff', '.gif']# 遍歷當前目錄及子目錄for root, dirs, files in os.walk('.'):for file in files:file_path = os.path.join(root, file)base_name, ext = os.path.splitext(file)try:# 處理圖片文件if ext.lower() in image_extensions:print(f"Processing image: {file_path}")text = process_image(file_path)output_path = os.path.join(output_folder, f"{base_name}.txt")save_text(text, output_path)print(f"Successfully processed image: {file_path} -> {output_path}")with open(log_file, 'a') as f:f.write(f"Success: {file_path} -> {output_path}\n")# 處理PDF文件elif ext.lower() == '.pdf':print(f"Processing PDF: {file_path}")text = process_pdf(file_path)output_path = os.path.join(output_folder, f"{base_name}.txt")save_text(text, output_path)print(f"Successfully processed PDF: {file_path} -> {output_path}")with open(log_file, 'a') as f:f.write(f"Success: {file_path} -> {output_path}\n")except Exception as e:error_msg = f"Error processing {file_path}: {str(e)}"print(error_msg)with open(log_file, 'a') as f:f.write(error_msg + "\n")if __name__ == "__main__":main()