1.2 Kaggle大白話：Eedi競賽Transformer框架解決方案02-GPT

1.2 Kaggle大白話：Eedi競賽Transformer框架解決方案02-GPT_4o生成訓練集缺失數據

- 0. 本欄目競賽匯總表
- 1. 本文主旨
- 2. AI工程架構
- 3. 數據預處理模塊
- - 3.1 配置數據路徑和處理參數
  - 3.2 配置API參數
  - 3.3 配置輸出路徑
- 4. AI并行處理模塊
- - 4.1 定義LLM客戶端類
  - 4.2 定義數據處理函數
  - 4.3 定義JSON保存函數
  - 4.4 定義數據分片函數
  - 4.5 定義分片處理函數
  - 4.5 定義文件名排序函數
- 5. 數據整合模塊
- - 5.1 加載數據并生成分片
  - 5.2 初始化LLM客戶端并測試
  - 5.3 并行處理數據生成
  - 5.4 合并處理結果
  - 5.5 保存最終結果

0. 本欄目競賽匯總表

Kaggle競賽匯總

1. 本文主旨

大白話：由于在上一篇文章的數據探索中，我們發現了部分訓練數據的錯誤解釋存在缺失，因此直接使用GPT_4o+人設提示詞工程，對訓練集數據存在的錯誤解釋缺失問題的處理。
通過本文可收獲技能：API調用AI接口、人設提示詞工程案例、復雜的數據處理與緩存處理。
上文回顧：Eedi大模型蒸餾方案01-競賽信息解讀與數據理解

2. AI工程架構

3. 數據預處理模塊

3.1 配置數據路徑和處理參數

data_path = "~/work/eedi_synthetic_data/MalAlgoQA_format.csv"
index_start = 0
index_end = len(df)
step = 100
max_workers = 2

3.2 配置API參數

model_config = dict(openai_api_base = "https://testshellapi.kimi.asia/v1", api_key = "****",model = "gpt-4o",default_system_prompt = """##TaskYou are a Mathematics teacher. Your task is to reason and identify the ConstructName and SubjectName and then the misconception behind the user input Incorrect Answers with the Question.ConstructName is Most granular level of knowledge related to question, appears to describe the specific mathematical method or procedure used to solve the question. It explains the technique or approach needed to reach the answer.SubjectName is More general context than the construct, represents the broader mathematical topic or category that the question belongs to.Misconceptions are a mistake in conceptual understanding and they have relations with all the applications of those concepts. For example, a single misconception on the connections among proportional relationships (part/whole, part/part, whole/part) can cause problems in identifying those patterns in drawings and can be the cause of failing to realize all parts must be of equal size, therefore associating the denominator of the fraction with the total number of parts regardless their size.Answer concisely what misconception it is to lead to getting the incorrect answer.Do not use "The misconception is" to start your answers.Do not mention the concrete details of the question or answers. ##User inputQuestion: The question textA: multiple choice answer A textB: multiple choice answer B textC: multiple choice answer C textD: multiple choice answer D textCorrect Answer: The correct answer text##You should answer in the following JSON format{"ConstructName": "here writes the constructName","SubjectName": "here writes the SubjectName""MisconceptionAName": "here writes the answer A's misconception.","MisconceptionBName": "here writes the answer B's misconception.","MisconceptionCName": "here writes the answer C's misconception.","MisconceptionDName": "here writes the answer D's misconception.",}""", # system prompt,default_temperature = 0.5,max_tokens = 256,
)

3.3 配置輸出路徑

cache_folder = f"./cache_{model_config['model']}_model_misconceptions_result"
if not os.path.exists(cache_folder):os.makedirs(cache_folder)
output_data_path = f"misconception_data_{os.path.splitext(os.path.basename(data_path))[0]}_{model_config['model']}.csv"

4. AI并行處理模塊

4.1 定義LLM客戶端類

class LLMChat:def __init__(self, openai_api_base, api_key, model, default_temperature, default_system_prompt, max_tokens=512):self.client = OpenAI(api_key = api_key,base_url=openai_api_base,)self.model = modelself.default_temperature = default_temperatureself.default_system_prompt = default_system_promptself.max_tokens = max_tokensdef chat(self, user_prompt, system_prompt=None, temperature=None):if not system_prompt:system_prompt = self.default_system_promptif not temperature:temperature = self.default_temperaturechat_response = self.client.chat.completions.create(model=self.model,temperature=temperature,messages=[{"role": "system", "content": system_prompt},{"role": "user", "content": user_prompt},],max_tokens=self.max_tokens,response_format={"type": "json_object"})return chat_response.choices[0].message.content

4.2 定義數據處理函數

def process_row(args, debug=False):user_prompt = """Question: {question}A: {answer_a}B: {answer_b}C: {answer_c}D: {answer_d}Correct Answer: {correct_answer}"""index, row = argsca = row["CorrectAnswer"]correctanswer = row[f"Answer{ca}Text"]input_user_prompt = user_prompt.format(question=row['QuestionText'],answer_a=row['AnswerAText'],answer_b=row['AnswerBText'],answer_c=row['AnswerCText'],answer_d=row['AnswerDText'],correct_answer=correctanswer,)ret_data = {}try:ret_data = vc.chat(input_user_prompt)if debug:print(ret_data+'\n')except Exception as e:print(f'An exception occur {str(e)}')ret_data['error'] = str(e)passif debug:print('system: ', model_config['default_system_prompt'])print('>'* 50)print('user_input: ', input_user_prompt)print('>'* 50)print('assistant: ', ret_data)return ret_data

4.3 定義JSON保存函數

def save_json(fn, obj):with open(fn, 'w') as f:json.dump(obj, f, ensure_ascii=False, indent=4)print(f"save file to {fn}")

4.4 定義數據分片函數

def slice_range(start, end, step):if step <= 0:raise ValueError("步長必須大于0")result = []while start <= end:result.append(start)start += stepif result[-1] < end:result.append(end)return result

4.5 定義分片處理函數

def process_pairs(sliced_range):slices = []for first, second in zip(sliced_range, sliced_range[1:]):slices.append([first, second])return slices

4.5 定義文件名排序函數

def natural_sort_key(filename):parts = re.findall(r'\d+', filename)return tuple(map(int, parts))

5. 數據整合模塊

5.1 加載數據并生成分片

df = pd.read_csv(data_path)
df.head()
sliced_range = process_pairs(slice_range(index_start, index_end, step))

df數據檢查：
在這里插入圖片描述

5.2 初始化LLM客戶端并測試

vc = LLMChat(**model_config)
r = process_row((7, df.iloc[7]), debug=True)

5.3 并行處理數據生成

for slices in tqdm(sliced_range, total=len(sliced_range)):output_filepath = f'{cache_folder}/cache_res_{slices[0]}.json'if os.path.exists(output_filepath):print(f'cache file exists, skip {output_filepath}')continuedf_tasks = df.iloc[slices[0]:slices[1]]results = []with ProcessPoolExecutor(max_workers=max_workers) as executor:results = list(tqdm(executor.map(process_row, df_tasks.iterrows()), total=len(df_tasks)))save_json(output_filepath, results)

5.4 合并處理結果

f_names = glob.glob(f'{cache_folder}/*.json')
sorted_filenames = sorted(f_names, key=natural_sort_key)
f_names = sorted_filenamesresults = []
for fn in f_names:with open(fn, 'r') as f:batch_results = json.load(f)results.extend(batch_results)l = len(results)
results = [json.loads(r) for r in results]

5.5 保存最終結果

df = df.iloc[:l]
gen_df = pd.DataFrame(results)
df = pd.concat([df, gen_df], axis=1)
df.to_csv(output_data_path, index=False)

(To be continued)

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/diannao/70885.shtml
繁體地址，請注明出處：http://hk.pswp.cn/diannao/70885.shtml
英文地址，請注明出處：http://en.pswp.cn/diannao/70885.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！