modelscope可控細節的長文檔摘要

modelscope可控細節的長文檔摘要嘗試
本文的想法來自今年OpenAI cookbook的一篇實踐:summarizing_long_documents,目標是演示如何以可控的細節程度總結大型文檔。

如果我們想讓大語言模型總結一份長文檔(例如 10k 或更多tokens),但是直接輸入大語言模型往往會得到一個相對較短的摘要,該摘要與文檔的長度并不成比例。例如,20k tokens的文檔的摘要不會是 10k tokens的文檔摘要的兩倍長。本文通過將文檔分為幾部分來解決這個問題,然后分段生成摘要。在對大語言模型進行多次查詢后,可以重建完整的摘要。通過控制文本塊的數量及其大小,我們最終可以控制輸出中的細節級別。

本文使用的工具和模型如下:

大語言模型:Qwen2的GGUF格式模型

工具1:Ollama,將大語言模型GGUF部署成OpenAI格式的API

工具2:transformers,使用transformers的新功能,直接加載GGUF格式模型的tokenizer,用于文檔長度查詢和分段。

最佳實踐

運行Qwen2模型(詳見《魔搭社區GGUF模型怎么玩!看這篇就夠了》)

復制模型路徑,創建名為“ModelFile”的meta文件,內容如下:

FROM /mnt/workspace/qwen2-7b-instruct-q5_k_m.gguf# set the temperature to 0.7 [higher is more creative, lower is more coherent]
PARAMETER temperature 0.7
PARAMETER top_p 0.8
PARAMETER repeat_penalty 1.05
TEMPLATE """{{ if and .First .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
{{ .Response }}"""
# set the system message
SYSTEM """
You are a helpful assistant.
"""

使用ollama create命令創建自定義模型并運行

ollama create myqwen2 --file ./ModelFile
ollama run myqwen2```

安裝依賴&讀取需要總結的文檔

import os
from typing import List, Tuple, Optional
from openai import OpenAI
from transformers import AutoTokenizer
from tqdm import tqdm
# load doc
with open("data/artificial_intelligence_wikipedia.txt", "r") as file:artificial_intelligence_wikipedia_text = file.read()

加載encoding并檢查文檔長度

HuggingFace的transformers 支持加載GGUF單文件格式,以便為 gguf 模型提供進一步的訓練/微調功能,然后再將這些模型轉換回生態系統gguf中使用ggml,GGUF文件通常包含配置屬性,tokenizer,以及其他的屬性,以及要加載到模型的所有tensor,參考文檔:https://huggingface.co/docs/transformers/gguf

目前支持的模型架構為:llama,mistral,qwen2

# load encoding and check the length of dataset
encoding = AutoTokenizer.from_pretrained("/mnt/workspace/cherry/",gguf_file="qwen2-7b-instruct-q5_k_m.gguf")
len(encoding.encode(artificial_intelligence_wikipedia_text))

調用LLM的OpenAI格式的API

client = OpenAI(base_url = 'http://127.0.0.1:11434/v1',api_key='ollama', # required, but unused
)def get_chat_completion(messages, model='myqwen2'):response = client.chat.completions.create(model=model,messages=messages,temperature=0,)return response.choices[0].message.content

文檔拆解

我們定義了一些函數,將大文檔分成較小的部分。

def tokenize(text: str) -> List[str]:    return encoding.encode(text)
# This function chunks a text into smaller pieces based on a maximum token count and a delimiter.
def chunk_on_delimiter(input_string: str,max_tokens: int, delimiter: str) -> List[str]:chunks = input_string.split(delimiter)combined_chunks, _, dropped_chunk_count = combine_chunks_with_no_minimum(chunks, max_tokens, chunk_delimiter=delimiter, add_ellipsis_for_overflow=True)if dropped_chunk_count > 0:print(f"warning: {dropped_chunk_count} chunks were dropped due to overflow")combined_chunks = [f"{chunk}{delimiter}" for chunk in combined_chunks]return combined_chunks# This function combines text chunks into larger blocks without exceeding a specified token count. It returns the combined text blocks, their original indices, and the count of chunks dropped due to overflow.
def combine_chunks_with_no_minimum(chunks: List[str],max_tokens: int,chunk_delimiter="\n\n",header: Optional[str] = None,add_ellipsis_for_overflow=False,
) -> Tuple[List[str], List[int]]:dropped_chunk_count = 0output = []  # list to hold the final combined chunksoutput_indices = []  # list to hold the indices of the final combined chunkscandidate = ([] if header is None else [header])  # list to hold the current combined chunk candidatecandidate_indices = []for chunk_i, chunk in enumerate(chunks):chunk_with_header = [chunk] if header is None else [header, chunk]if len(tokenize(chunk_delimiter.join(chunk_with_header))) > max_tokens:print(f"warning: chunk overflow")if (add_ellipsis_for_overflowand len(tokenize(chunk_delimiter.join(candidate + ["..."]))) <= max_tokens):candidate.append("...")dropped_chunk_count += 1continue  # this case would break downstream assumptions# estimate token count with the current chunk addedextended_candidate_token_count = len(tokenize(chunk_delimiter.join(candidate + [chunk])))# If the token count exceeds max_tokens, add the current candidate to output and start a new candidateif extended_candidate_token_count > max_tokens:output.append(chunk_delimiter.join(candidate))output_indices.append(candidate_indices)candidate = chunk_with_header  # re-initialize candidatecandidate_indices = [chunk_i]# otherwise keep extending the candidateelse:candidate.append(chunk)candidate_indices.append(chunk_i)# add the remaining candidate to output if it's not emptyif (header is not None and len(candidate) > 1) or (header is None and len(candidate) > 0):output.append(chunk_delimiter.join(candidate))output_indices.append(candidate_indices)return output, output_indices, dropped_chunk_count

摘要函數

現在我們可以定義一個實用程序來以可控的細節級別總結文本(注意參數detail)。

該函數首先根據可控參數在最小和最大塊數之間進行插值來確定塊數detail。然后,它將文本拆分成塊并對每個塊進行總結。

<span>def summarize(text: str,</span>

現在,我們可以使用此實用程序生成具有不同詳細程度的摘要。通過detail從 0 增加到 1,我們可以逐漸獲得更長的底層文檔摘要。參數值越高,detail摘要越詳細,因為實用程序首先將文檔拆分為更多塊。然后對每個塊進行匯總,最終摘要是所有塊摘要的串聯。

def summarize(text: str,detail: float = 0,model: str = 'myqwen2',additional_instructions: Optional[str] = None,minimum_chunk_size: Optional[int] = 500,chunk_delimiter: str = "\n",summarize_recursively=False,verbose=False):"""Summarizes a given text by splitting it into chunks, each of which is summarized individually. The level of detail in the summary can be adjusted, and the process can optionally be made recursive.Parameters:- text (str): The text to be summarized.- detail (float, optional): A value between 0 and 1 indicating the desired level of detail in the summary.0 leads to a higher level summary, and 1 results in a more detailed summary. Defaults to 0.- model (str, optional): The model to use for generating summaries. Defaults to 'gpt-3.5-turbo'.- additional_instructions (Optional[str], optional): Additional instructions to provide to the model for customizing summaries.- minimum_chunk_size (Optional[int], optional): The minimum size for text chunks. Defaults to 500.- chunk_delimiter (str, optional): The delimiter used to split the text into chunks. Defaults to ".".- summarize_recursively (bool, optional): If True, summaries are generated recursively, using previous summaries for context.- verbose (bool, optional): If True, prints detailed information about the chunking process.Returns:- str: The final compiled summary of the text.The function first determines the number of chunks by interpolating between a minimum and a maximum chunk count based on the `detail` parameter. It then splits the text into chunks and summarizes each chunk. If `summarize_recursively` is True, each summary is based on the previous summaries, adding more context to the summarization process. The function returns a compiled summary of all chunks."""# check detail is set correctlyassert 0 <= detail <= 1# interpolate the number of chunks based to get specified level of detailmax_chunks = len(chunk_on_delimiter(text, minimum_chunk_size, chunk_delimiter))min_chunks = 1num_chunks = int(min_chunks + detail * (max_chunks - min_chunks))# adjust chunk_size based on interpolated number of chunksdocument_length = len(tokenize(text))chunk_size = max(minimum_chunk_size, document_length // num_chunks)text_chunks = chunk_on_delimiter(text, chunk_size, chunk_delimiter)if verbose:print(f"Splitting the text into {len(text_chunks)} chunks to be summarized.")print(f"Chunk lengths are {[len(tokenize(x)) for x in text_chunks]}")# set system messagesystem_message_content = "Rewrite this text in summarized form."if additional_instructions is not None:system_message_content += f"\n\n{additional_instructions}"accumulated_summaries = []for chunk in tqdm(text_chunks):if summarize_recursively and accumulated_summaries:# Creating a structured prompt for recursive summarizationaccumulated_summaries_string = '\n\n'.join(accumulated_summaries)user_message_content = f"Previous summaries:\n\n{accumulated_summaries_string}\n\nText to summarize next:\n\n{chunk}"else:# Directly passing the chunk for summarization without recursive contextuser_message_content = chunk# Constructing messages based on whether recursive summarization is appliedmessages = [{"role": "system", "content": system_message_content},{"role": "user", "content": user_message_content}]# Assuming this function gets the completion and works as expectedresponse = get_chat_completion(messages, model=model)accumulated_summaries.append(response)# Compile final summary from partial summariesfinal_summary = '\n\n'.join(accumulated_summaries)return final_summary
summary_with_detail_0 = summarize(artificial_intelligence_wikipedia_text, detail=0, verbose=True)

summary_with_detail_pt25 = summarize(artificial_intelligence_wikipedia_text, detail=0.25, verbose=True)

在這里插入圖片描述

此實用程序還允許傳遞附加指令。

summary_with_additional_instructions = summarize(artificial_intelligence_wikipedia_text, detail=0.1,additional_instructions="Write in point form and focus on numerical data.")
print(summary_with_additional_instructions)

最后,請注意,該實用程序允許遞歸匯總,其中每個摘要都基于先前的摘要,從而為匯總過程添加更多上下文。可以通過將參數設置summarize_recursively為 True 來啟用此功能。這在計算上更昂貴,但可以提高組合摘要的一致性和連貫性。

recursive_summary = summarize(artificial_intelligence_wikipedia_text, detail=0.1, summarize_recursively=True)
print(recursive_summary)

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/diannao/40824.shtml
繁體地址,請注明出處:http://hk.pswp.cn/diannao/40824.shtml
英文地址,請注明出處:http://en.pswp.cn/diannao/40824.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

6個步驟實現Postman接口壓力測試(建議收藏)

&#x1f345; 視頻學習&#xff1a;文末有免費的配套視頻可觀看 &#x1f345; 點擊文末小卡片 &#xff0c;免費獲取軟件測試全套資料&#xff0c;資料在手&#xff0c;漲薪更快 這里講是postman做接口并發測試&#xff0c;基礎用法不做贅述 1、第一步接口可以通的情況下點擊…

Proteus-51單片機-DS18B20多點測溫

DS18B20多點測溫 一、Proteus仿真演示 每個DS18B20都有一個唯一的64位序列號,這使得在同一總線上可以掛載多個傳感器,無需額外的地址分配。主機(通常為單片機)通過特定的時序控制,可以依次讀取各個DS18B20的溫度數據,實現分布式測溫。 二、代碼特點 三、開發環境介紹 本…

基于Arduino平臺開源小車的初步使用體驗

創作原因&#xff1a;偶然有機會接觸到基于Arduino平臺的開源智能小車&#xff0c;初步使用后與大家分享。因使用時間不常&#xff0c;可以純當個樂子看看&#xff0c;感謝大家的閱讀&#xff01; 圖&#xff1a;一款基于Arduino平臺的開源小車 一、開發環境 Misly&#xff1…

el表達式筆記及ognl

文章目錄 spel底層spel核心代碼TokenKind源碼 spel本地調試 spel示例spel list例子spel 對象例子spel list例子SimpleEvaluationContext 類找不到##### spel 如何表示包含(也就是like)? spel官網文檔 ognl表達式ognl和表達式語言的區別 這里重點說下SPEL(Spring Expression La…

shark云原生-日志體系-filebeat高級配置(適用于生產)

文章目錄 1. filebeat.inputs 靜態日志收集器2. filebeat.autodiscover 自動發現2.1. autodiscover 和 inputs2.2. 如何配置1.2.1. Providers 提供者1.2.2. Providers kubernetes templates1.2.3. 基于提示&#xff08;hints&#xff09;的自動發現支持的 **hints**的完整列表&…

windows搭建mqtt服務器,并配置DTU收集傳感器數據

1.下載并安裝emqx服務器 參考&#xff1a;Windows系統下本地MQTT服務器搭建&#xff08;保姆級教程&#xff09;_mqtt windows-CSDN博客 這里我下載的是emqx-5.3.0-windows-amd64.zip版本 下載好之后&#xff0c;放到服務器的路徑&#xff0c;我這里放的地方是&#xff1a;C…

腦啟發設計:人工智能的進化之路

編者按&#xff1a;你可以用左手&#xff08;不常用的那只手&#xff09;的小指與食指拿起一件物品么&#xff1f; 試完你是不是發現自己竟然可以毫不費力地用自己不常用的手中&#xff0c;兩根使用頻率相對較低的手指&#xff0c;做一個不常做的動作。這就是人類大腦不可思議…

如何聲明一個類?類如何繼承?

要聲明一個類&#xff0c;需要使用關鍵字class&#xff0c;后面跟著類名。類名通常以大寫字母開頭。類的聲明通常包括類的成員變量和成員函數。 類可以通過繼承來擴展現有的類。要讓一個類繼承另一個類&#xff0c;需要使用冒號&#xff08;:&#xff09;并在后面跟著父類的名…

等保2.0中,云計算平臺如何做到數據的分類和加密?

數據分類 在等保2.0中&#xff0c;數據分類是確保數據安全的首要步驟。云計算平臺需要根據數據的敏感性和重要性進行分類&#xff0c;以便采取相應的保護措施。數據分類通常包括以下幾個步驟&#xff1a; 數據識別&#xff1a;識別出哪些數據是需要保護的&#xff0c;這可能包…

py黑帽子學習筆記_burp

配置burp kali虛機默認裝好了社區版burp和java&#xff0c;其他os需要手動裝 burp是用java&#xff0c;還得下載一個jython包&#xff0c;供burp用 配apt國內源&#xff0c;然后apt install jython --download-only&#xff0c;會只下載包而不安裝&#xff0c;下載的目錄搜一…

電子數據取證如何規范高效

文章關鍵詞&#xff1a;電子數據取證、現場勘驗、手機取證 隨著信息技術的迅猛發展和廣泛應用&#xff0c;電子數據作為一種獨立的法定證據形式&#xff0c;在執紀執法實踐中的作用愈加凸顯。規范、科學、高效的電子數據取證工作&#xff0c;不僅是保證電子數據符合法定要求、…

FreeRTOS LVGL頁面切換為LCD純手動繪制遇到的問題

有時候我們需要將FreeRTOS和LVGL頁面切換為LCD純手動繪制,提供更高的靈活性和可定制性。 自定義界面設計:使用LCD純手動繪制界面,可以完全自定義界面的外觀和行為。可以根據特定的需求和設計概念創建獨特的用戶界面,而不受LVGL框架的限制。 資源優化:LVGL是一個功能強大的…

9.x86游戲實戰-匯編指令mov

免責聲明&#xff1a;內容僅供學習參考&#xff0c;請合法利用知識&#xff0c;禁止進行違法犯罪活動&#xff01; 本次游戲沒法給 內容參考于&#xff1a;微塵網絡安全 工具下載&#xff1a; 鏈接&#xff1a;https://pan.baidu.com/s/1rEEJnt85npn7N38Ai0_F2Q?pwd6tw3 提…

java實現多級菜單展示(遞歸)

實體類如下&#xff1a; package com.ssdl.baize.po;import com.baomidou.mybatisplus.annotation.*; import com.fasterxml.jackson.annotation.JsonFormat; import com.fasterxml.jackson.annotation.JsonIgnore; import io.swagger.annotations.ApiModel; import io.swagge…

cefsharp在splitContainer.Panel2中顯示調試工具DevTools(非彈出式)含源代碼

一、彈出式調試工具 (ShowDevTools) ChromiumWebBrowser webbrowser; public void showDevTools(){//定位到某元素webbrowser.ShowDevTools(null, parameters.XCoord, parameters.YCoord);

STM32智能農業監控系統教程

目錄 引言環境準備智能農業監控系統基礎代碼實現&#xff1a;實現智能農業監控系統 4.1 數據采集模塊 4.2 數據處理與分析 4.3 控制系統實現 4.4 用戶界面與數據可視化應用場景&#xff1a;農業監控與優化問題解決方案與優化收尾與總結 1. 引言 智能農業監控系統利用STM32嵌…

代碼隨想錄day37 動態規劃(3)

416. 分割等和子集 - 力扣&#xff08;LeetCode&#xff09; 解1&#xff1a;二維dp數組&#xff0c;時間O(m*n)&#xff0c;空間O(m*n)&#xff0c;m、n為dp數組的行和列數。 判斷原數組總和能否整除2&#xff1b; 將target設為total // 2&#xff08;若是total / 2&#…

遇到的異步問題

事例1&#xff1a; app.post("/predictfunc") async def predictfunc(item: Item):# 使用asyncio.to_thread()在單獨的線程中運行predict_in_threadresult await asyncio.to_thread(predictfunc_main, item)return result 事例2&#xff1a; app.post("/remo…

PCL從理解到應用【02】PCL環境安裝 | PCL測試| Linux系統

前言 本文介紹在Ubuntu18.04系統中&#xff0c;如何安裝PCL。 源碼安裝方式&#xff1a;pcl版本1.91&#xff0c;vtk版本8.2.0&#xff0c;Ubuntu版本18.04。 安裝好后&#xff0c;可以看到pcl的庫&#xff0c;在/usr/lib/中&#xff1b; 通過編寫C代碼&#xff0c;直接調用…

華為路由器靜態路由配置(eNSP模擬實驗)

實驗目標 如圖下所示&#xff0c;讓PC1ping通PC2 具體操作 配置PC設備ip 先配置PC1的ip、掩碼、網關。PC2也做這樣的配置 配置路由器ip 配置G0/0/0的ip信息 #進入系統 <Huawei>system-view #進入GigabitEthernet0/0/0接口 [Huawei]int G0/0/0 #設置接口的ip和掩碼 […