問題說明
使用功能mindie 1.0 RC2推理大語言模型,遇到不少問題,記錄下解決思路。
我的硬件是910B4。
問題及解決
問題1
在docker內啟動mindie時終端報錯
Fatal Python error: PyThreadState_Get: the function must be called with the GIL held, but the GIL is released (the current Python thread state is NULL)
查看logs/pythonlog.log.xxxx
File "/usr/local/Ascend/atb-models/atb_llm/utils/file_utils.py", line 110, in check_owner
raise argparse.ArgumentTypeError("The path is not owned by current user or root")
argparse.ArgumentTypeError: The path is not owned by current user or root
問題分析:模型目錄是我從外部映射進去的,目錄的所有者是一個叫guest的用戶,而docker內的用戶是root。
解決方法:將日志目錄所有者和組改為root
chown root:root /path/to/directory -R
問題2
在docker內啟動mindie時終端報錯
Fatal Python error: PyThreadState_Get: the function must be called with the GIL held, but the GIL is released (the current Python thread state is NULL)
Python runtime state: finalizing (tstate=0x0000ffff8401d570)
查看logs/pythonlog.log.xxxx
File "/root/.cache/huggingface/modules/transformers_modules/Baichuan2-13B-Base/tokenization_baichuan.py", line 7, in <module>
import sentencepiece as spm
ModuleNotFoundError: No module named 'sentencepiece'
問題分析:我加載的事baichuan2-13b模型,該模型依賴sentencepiece這個組件
解決方法:
pip install sentencepiece
問題3
在docker內啟動mindie時終端報錯
Exception:unsupported type: torch.bfloat16
問題分析:我加載的模型是bfloat16的,而mindie貌似不支持,只能支持fp16.具體類型可以從模型下的config.json中看到
解決辦法:將模型轉換為fp16類型
import argparse
import os
import torchdef convert_bin2st_from_pretrained(model_path, out_path):from transformers import AutoModelForCausalLM, AutoTokenizertokenizer = AutoTokenizer.from_pretrained(model_path,revision="v2.0",use_fast=False,trust_remote_code=True)model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=model_path,low_cpu_mem_usage=True,trust_remote_code=True,torch_dtype=torch.float16) #這里指定float16格式print(f"Saving the target model to {out_path}")model.save_pretrained(out_path, safe_serialization=True)print(f"Saving the tokenizer to {out_path}")tokenizer.save_pretrained(out_path)if __name__ == '__main__':print(f"covert model into safetensor")convert_bin2st_from_pretrained("./Qwen2-72B-Instruct", "./Qwen2-72B-Instruct_fp16")
轉換完畢,將./Qwen2-72B-Instruct/tokenizer.json手動復制到./Qwen2-72B-Instruct_fp16。其它文件都全了。
問題4
在docker內啟動mindie時終端報錯
Fatal Python error: PyThreadState_Get: the function must be called with the GIL held, but the GIL is released (the current Python thread state is NULL)
Python runtime state: finalizing (tstate=0x0000ffffac01d570)
查看logs/pythonlog.log.xxxx
File "/usr/local/Ascend/atb-models/atb_llm/models/qwen2/router_qwen2.py", line 39, in checkout_config_qwen
if value < min_val or value > max_val:
TypeError: '<' not supported between instances of 'NoneType' and 'int'
跟蹤發現是router_qwen2.py中獲取的sliding_window為None.這個問題是我用上一步的方法轉換模型引起的。
解決方法:在轉換后的模型目錄中config.json中將sliding_window字段設置為131072。
總結
很多問題表現為GIL相關的問題,實際都是業務進程出錯了,真實原因往往在logs/pythonlog.log.xxxx中。