stepfile:
step 00 創建數據集
目錄結構
yourproject
-data
--myset
---images #存放訓練圖片
---dev.tsv #測試標簽 tsv格式 圖片文件名\t內容
---train.tsv #訓練標簽 tsv格式 圖片文件名\t內容
-train_config.json
-train_config_gpu.json
-fix_cnocr_encoding.py
step 01 創建虛擬環境
venv\Scripts\activate
step 02 安裝開發包進行模型訓練
pip install cnocr[dev]
step 03 更新pyarrow 到18.0.0版本,解決錯誤顯示該版本中沒有PyExtensionType屬性
pip install --upgrade pyarrow==18.0.0
step 04 安裝albumentations==1.3.1版本解決據錯誤信息, compression_type 參數要求是'jpeg'或'webp'字符串問題
pip install albumentations==1.3.1
step 05 修復gbk編碼問題
python -m fix_cnocr_encoding
開始訓練:
cnocr train -m densenet_lite_136-gru --index-dir data/myset --train-config-fp train_config.json
step 06 可視化訓練結果
wandaAPI:dca541a51e980eea9bb52866363926f5ea6617edwt(請使用自己的API接口,測試發現powershell鎖死,無法鍵入API,結果可視化不可用)
train_config.json
{"vocab_fp": ".venv/label_cn.txt","img_folder": "data/myset/images","devices": 1,"accelerator": "cpu","epochs": 20,"batch_size": 4,"num_workers": 0,"pin_memory": false,"optimizer": "adam","learning_rate": 1e-3,"weight_decay": 0,"metrics": {"complete_match": {},"cer": {}},"lr_scheduler": {"name": "cos_warmup","min_lr_mult_factor": 0.01,"warmup_epochs": 0.2},"precision": 32,"limit_train_batches": 1.0,"limit_val_batches": 1.0,"pl_checkpoint_monitor": "val-complete_match-epoch","pl_checkpoint_mode": "max"
}
train_config_gpu.json
{"vocab_fp": "cnocr/label_cn.txt","img_folder": "/data/jinlong/ocr_data","devices": 1,"accelerator": "gpu","epochs": 30,"batch_size": 32,"num_workers": 8,"pin_memory": true,"optimizer": "adam","learning_rate": 3e-4,"weight_decay": 0,"train_bucket_size": null,"metrics": {"complete_match": {},"cer": {}},"lr_scheduler": {"name": "cos_warmup","min_lr_mult_factor": 0.01,"warmup_epochs": 0.2,"milestones": [5, 10, 16, 22, 30],"gamma": 0.5},"precision": 16,"log_every_n_steps": 200,"limit_train_batches": 1.0,"limit_val_batches": 1.0,"pl_checkpoint_monitor": "val-complete_match-epoch","pl_checkpoint_mode": "max"
}
fix_cnocr_encoding.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
修復CN-OCR庫中的編碼問題
專門針對utils.py文件中的read_tsv_file函數添加UTF-8編碼參數
"""
import os
import fileinput
import shutil# 定義cnocr的utils.py文件路徑
cnocr_utils_path = os.path.join(os.path.dirname(os.path.abspath(__file__)),'.venv', 'Lib', 'site-packages', 'cnocr', 'utils.py'
)# 檢查文件是否存在
if not os.path.exists(cnocr_utils_path):print(f"錯誤:找不到cnocr的utils.py文件: {cnocr_utils_path}")print("請檢查cnocr是否正確安裝在虛擬環境中")exit(1)print(f"找到cnocr的utils.py文件: {cnocr_utils_path}")# 創建備份文件
backup_path = cnocr_utils_path + '.backup'
if not os.path.exists(backup_path):shutil.copy2(cnocr_utils_path, backup_path)print(f"已創建備份文件: {backup_path}")
else:print(f"備份文件已存在: {backup_path}")# 讀取文件內容
with open(cnocr_utils_path, 'r', encoding='utf-8') as f:content = f.read()# 檢查是否已經修復過
if 'with open(fp, encoding="utf-8")' in content:print("cnocr庫的編碼問題已經被修復過了!")exit(0)# 查找read_tsv_file函數
if 'def read_tsv_file(' not in content:print("錯誤:在utils.py文件中找不到read_tsv_file函數")print("cnocr庫的版本可能與預期不同")exit(1)print("正在修復read_tsv_file函數中的編碼問題...")# 使用fileinput模塊修改文件
for line in fileinput.input(cnocr_utils_path, inplace=True, encoding='utf-8'):# 查找并替換open語句,添加encoding='utf-8'if 'with open(fp)' in line and 'encoding=' not in line:line = line.replace('with open(fp)', 'with open(fp, encoding="utf-8")')print(line, end='')print("\n修復完成!")
print(f"已在read_tsv_file函數中添加了encoding='utf-8'參數")
print(f"原始文件已備份到: {backup_path}")
print("現在您可以嘗試使用cnocr train命令了。")
step 00 創建數據集
目錄結構
-data
--myset
---images #存放訓練圖片
---dev.tsv #測試標簽 tsv格式 圖片文件名\t內容
---train.tsv #訓練標簽 tsv格式 圖片文件名\t內容
step 01 創建虛擬環境
venv\Scripts\activate
step 02 安裝開發包進行模型訓練
pip install cnocr[dev]
step 03 更新pyarrow 到18.0.0版本,解決錯誤顯示該版本中沒有PyExtensionType屬性
pip install --upgrade pyarrow==18.0.0
step 04 安裝albumentations==1.3.1版本解決據錯誤信息, compression_type 參數要求是'jpeg'或'webp'字符串問題
pip install albumentations==1.3.1
step 05 修復gbk編碼問題
python -m fix_cnocr_encoding
開始訓練:
cnocr train -m densenet_lite_136-gru --index-dir data/myset --train-config-fp train_config.json
step 06 可視化訓練結果
wandaAPI:dca541a51e980eea9bb52866363926f5ea6617edwt