如何利用dify 生成Fine?tune 需要的Alpaca 格式數據

如果你選擇llamafactory 格式進行微調，它只是格式是Alpaca格式，dify 的agent dsl 如下，你可以導入本地的dify 或者導入cloud 版本的；測試版本是0.1.5

app:description: '上傳文件，基于文件內容，使用 SiliconCloud 128K 上下文的 Qwen2.5 模型，生成日常問答內容，JSONL 格式的語料數據?? 注：- 由于 Dify 限制，超過 80000 字符的文件內容會被截斷- 生成內容僅供參考，可能存在幻覺或內容錯漏、格式錯誤，請注意甄別'icon: 🤖icon_background: '#FFEAD5'mode: workflowname: 'Fine-tune語料構造器Alpaca格式 'use_icon_as_answer_icon: false
kind: app
version: 0.1.5
workflow:conversation_variables: []environment_variables: []features:file_upload:allowed_file_extensions:- .JPG- .JPEG- .PNG- .GIF- .WEBP- .SVGallowed_file_types:- imageallowed_file_upload_methods:- local_file- remote_urlenabled: falsefileUploadConfig:audio_file_size_limit: 50batch_count_limit: 5file_size_limit: 15image_file_size_limit: 10video_file_size_limit: 100workflow_file_upload_limit: 10image:enabled: falsenumber_limits: 3transfer_methods:- local_file- remote_urlnumber_limits: 3opening_statement: ''retriever_resource:enabled: truesensitive_word_avoidance:enabled: falsespeech_to_text:enabled: falsesuggested_questions: []suggested_questions_after_answer:enabled: falsetext_to_speech:enabled: falselanguage: ''voice: ''graph:edges:- data:isInIteration: falsesourceType: starttargetType: document-extractorid: 1735807686274-source-1735807758092-targetsource: '1735807686274'sourceHandle: sourcetarget: '1735807758092'targetHandle: targettype: customzIndex: 0- data:isInIteration: falsesourceType: document-extractortargetType: codeid: 1735807758092-source-1735807761855-targetsource: '1735807758092'sourceHandle: sourcetarget: '1735807761855'targetHandle: targettype: customzIndex: 0- data:isInIteration: falsesourceType: codetargetType: llmid: 1735807761855-source-1735807764975-targetsource: '1735807761855'sourceHandle: sourcetarget: '1735807764975'targetHandle: targettype: customzIndex: 0- data:isInIteration: falsesourceType: llmtargetType: endid: 1735807764975-source-1735807769820-targetsource: '1735807764975'sourceHandle: sourcetarget: '1735807769820'targetHandle: targettype: customzIndex: 0nodes:- data:desc: ''selected: falsetitle: 開始type: startvariables:- allowed_file_extensions: []allowed_file_types:- documentallowed_file_upload_methods:- local_file- remote_urllabel: 語料文件max_length: 10options: []required: truetype: file-listvariable: attachments- allowed_file_extensions: []allowed_file_types:- imageallowed_file_upload_methods:- local_file- remote_urllabel: 觸發詞（訓練中的 system prompt）max_length: 48options: []required: truetype: text-inputvariable: triggerheight: 116id: '1735807686274'position:x: 30y: 258positionAbsolute:x: 30y: 258selected: falsesourcePosition: righttargetPosition: lefttype: customwidth: 244- data:desc: ''is_array_file: trueselected: falsetitle: 文檔提取器type: document-extractorvariable_selector:- '1735807686274'- attachmentsheight: 92id: '1735807758092'position:x: 334y: 258positionAbsolute:x: 334y: 258selected: falsesourcePosition: righttargetPosition: lefttype: customwidth: 244- data:code: "def main(articleSections: list) -> dict:\n    try:\n        # 將列表項合并為字符串\n\\        combined_text = \"\\n\".join(articleSections)\n        \n     \\   # 截取前80000個字符\n        truncated_text = combined_text[:80000]\n    \\    \n        return {\n            \"result\": truncated_text\n      \\  }\n    except Exception as e:\n        # 錯誤處理\n        return {\n   \\         \"result\": \"\"\n        }"code_language: python3desc: ''outputs:result:children: nulltype: stringselected: falsetitle: 代碼執行type: codevariables:- value_selector:- '1735807758092'- textvariable: articleSectionsheight: 54id: '1735807761855'position:x: 638y: 258positionAbsolute:x: 638y: 258selected: falsesourcePosition: righttargetPosition: lefttype: customwidth: 244- data:context:enabled: falsevariable_selector: []desc: ''model:completion_params:frequency_penalty: 0.5max_tokens: 4096temperature: 0.3mode: chatname: Qwen/Qwen2.5-72B-Instruct-128Kprovider: siliconflowprompt_template:- id: b6913d40-d173-45d8-b012-98240d42a196role: systemtext: "【角色】  \n你是一位 LLM 大語言模型科學家，參考用戶提供的「內容」，幫助用戶構造符合規范的 Fine?tune（微調）數據。\\  \n\n【任務】  \n- 針對每次給定的「內容」，生成通俗易懂、貼近現實的「問題」（instruction）；  \n- 針對每個「問題」，引用「內容」原文并結合合理解釋，給出忠實于原文主旨的「解答」（output）；\\  \n- 最終所有條目以 Alpaca 格式輸出，每條一行 JSON，組成合法的 JSONL 文件。  \n\n【Alpaca 格式說明】\\  \n每條數據必須包含三個字段：  \n```json\n{\n  \"instruction\": \"問題（貼近現實、通俗白話）\"\,\n  \"input\": \"使用用戶指定的「觸發詞」\",\n  \"output\": \"解答（忠于原文、合理演繹）\"\n}\n\```\n\n【要求】\n1.“instruction” 中的問題不要直接照搬「內容」原句，需貼近當代生活場景；\n2.問題用語通俗，避免“假、大、空”；\n\3.“output” 必須忠于原文主旨，不得曲解；可在原文基礎上合理演繹；\n\n【輸出規范】\n1.輸出為標準 JSONL 文本，每行一個\\ JSON 對象；\n2.不要在輸出中添加多余注釋或說明文字；\n3.每行對應一條訓練樣本；\n4.保證整體文件格式合法，可直接用于微調。\n\【示例】\n```json\n{\"instruction\": \"為什么我們在家里養的綠植會在有陽光的房間里長得更好？\", \"input\"\: \"光合作用是植物將光能轉化為化學能的過程……\", \"output\": \"因為光合"- id: 61530521-14cf-4eaf-8f06-a4bc89db3cb1role: usertext: '「內容」{{#1735807761855.result#}}「觸發詞」{{#1735807686274.trigger#}}'selected: falsetitle: LLMtype: llmvariables: []vision:enabled: falseheight: 98id: '1735807764975'position:x: 937.9650491140262y: 258positionAbsolute:x: 937.9650491140262y: 258selected: truesourcePosition: righttargetPosition: lefttype: customwidth: 244- data:desc: ''outputs:- value_selector:- '1735807764975'- textvariable: textselected: falsetitle: 結束type: endheight: 90id: '1735807769820'position:x: 1246y: 258positionAbsolute:x: 1246y: 258selected: falsesourcePosition: righttargetPosition: lefttype: customwidth: 244- data:author: Difydesc: ''height: 88selected: falseshowAuthor: truetext: '{"root":{"children":[{"children":[{"detail":0,"format":0,"mode":"normal","style":"","text":"設置較低的Temperature，提高輸出格式的穩定性","type":"text","version":1}],"direction":"ltr","format":"","indent":0,"type":"paragraph","version":1,"textFormat":0}],"direction":"ltr","format":"","indent":0,"type":"root","version":1}}'theme: bluetitle: ''type: ''width: 240height: 88id: '1735808753316'position:x: 951.4285714285714y: 375.7142857142857positionAbsolute:x: 951.4285714285714y: 375.7142857142857selected: falsesourcePosition: righttargetPosition: lefttype: custom-notewidth: 240- data:author: Difydesc: ''height: 88selected: falseshowAuthor: truetext: '{"root":{"children":[{"children":[{"detail":0,"format":0,"mode":"normal","style":"","text":"合并多個文檔內容，并截取前8W 字符","type":"text","version":1}],"direction":"ltr","format":"","indent":0,"type":"paragraph","version":1,"textFormat":0}],"direction":"ltr","format":"","indent":0,"type":"root","version":1}}'theme: bluetitle: ''type: ''width: 240height: 88id: '1735808799815'position:x: 640y: 338.5714285714286positionAbsolute:x: 640y: 338.5714285714286selected: falsesourcePosition: righttargetPosition: lefttype: custom-notewidth: 240viewport:x: 16.889594857123143y: 9.872527989539648zoom: 0.7632446373312666

如果你想生產openai 支持的JSONL格式，只需要稍微調整下其中LLM 的提示詞

【角色】
你是一位 LLM 大語言模型科學家，參考用戶提供的內容，幫助用戶構造符合規范的 Fine-tune（微調）數據【任務】
- 對于給定的「內容」，你每次回列出盡可能多的通俗「問題」；
- 針對每個「問題」，引用「內容」原文及對內容的合理解釋和演繹，做出「解答」；
- 并將「問題」「解答」整理為規范的 JSONL 格式【要求】
1. 問題 **不要** 直接引用「內容」，應該貼近當代現實生活；
2. 問題應該是通俗白話，避免“假、大、空“；
3. 答案應忠于原文，對于原文的解釋不能脫離原文的主旨、思想；【輸出規范】
* 輸出規范的 JSONL，每行一條數據
* 每條數據應包含一個 message 數組，每個數組都應該包含 role 分別為 system、user 和 assistant 的三條記錄
* 其中 role 為 system 的數據，作為訓練中的 system prompt 格外重要，其 content 使用用戶指定的「觸發詞」
* role 為 user 的數據對應列出的「問題」
* role 為 assistant 的數據則對應針對「問題」的「解答」
* 示例如下：
```
{"messages": [{"role": "system", "content": "你是當代大儒"}, {"role": "user", "content": "應該怎么學習？"}, {"role": "assistant", "content": "賢賢易色；事父母，能竭其力；事君，能致其身；與朋友交，言而有信。雖曰未學，吾必謂之學矣。"}]}
```

Java 碼農轉型AI 關于微調的更多內容可以查看我的github :?https://github.com/caicongyang/ML2LLM/tree/main/LLM/lora