針對知識圖譜使用 Mistral-7b 從簡歷中提取實體

翻譯:“Entity Extraction from Resume using Mistral-7b for Knowledge Graphs” | by Tejpal Kumawat | Feb, 2024 | Medium[1]

在快速發展的自然語言處理(NLP)領域,從非結構化文本源中準確提取和分析信息的能力變得越來越重要。這種能力最具挑戰性和相關性的應用之一就是處理簡歷以創建知識圖譜。簡歷是密集而復雜的文檔,包含大量有關應聘者職業經歷、技能和資質的信息。然而,要準確有效地提取這些信息,需要先進的 NLP 技術。

這就是 "使用 Mistral-7b-Instruct-v2 for Knowledge Graphs 從簡歷中提取實體 "發揮作用的地方。Mistral-7b-Instruct-v2 是一種先進的語言教學模型,它通過識別和分類關鍵實體(如姓名、組織、職位名稱、技能和教育詳情),提供了一種創新的簡歷解析方法。通過利用 Mistral-7b 的指令功能,我們不僅可以高精度地提取這些實體,還能以有利于創建綜合知識圖譜的方式對它們進行結構化處理。

知識圖譜可以組織和可視化實體之間的關系,提供數據的整體視圖,這對招聘、人才管理和職位匹配等各種應用都有極大的價值。在本博客中,我們將深入探討 Mistral-7b-instruct 如何改變簡歷分析流程、實體提取背后的技術基礎以及從提取的數據中構建知識圖譜的步驟。我們還將探討這項技術對未來人力資源和招聘分析的潛在好處和影響。

這就是典型的知識圖譜:

圖片

None

由于數據隱私的原因,我們可能無法使用 openAI 或其他可用的 API。那么問題來了,我們該如何使用離線模型來準確地完成這項任務呢?

我們將使用 https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2,這是我們的用例模型。

我們將一步一步地從簡歷中獲取相關實體。

第 1 步:從 PDF 或圖片中提取文本。

我沒有展示實現這一點的代碼,但如果您有 PDF 文件,可以使用 Pymupdf;如果您有簡歷圖片,可以使用 Pytesseract。

簡歷文本是我們的用例:

text="Developer?<span?class=\"hl\">Developer</span>?Developer?-?TATA?CONSULTANTCY?SERVICE?Batavia,?OH?Relevant?course?work??Database?Systems,?Database?Administration,?Database?Security?&?Auditing,?Computer?Security,Computer?Networks,?Programming?&?Software?Development,?IT,?Information?Security?Concept?&?Admin,??IT?System?Acquisition?&?Integration,?Advanced?Web?Development,?and?Ethical?Hacking:?Network?Security?&?Pen?Testing.?Work?Experience?Developer?TATA?CONSULTANTCY?SERVICE?June?2016?to?Present?MRM?(Government?of?ME,?RI,?MS)?Developer?????Working?with?various?technologies?such?as?Java,?JSP,?JSF,?DB2(SQL),?LDAP,?BIRT?report,?Jazz?version?control,?Squirrel?SQL?client,?Hibernate,?CSS,?Linux,?and?Windows.?Work?as?part?of?a?team?that?provide?support?to?enterprise?applications.?Perform?miscellaneous?support?activities?as?requested?by?Management.?Perform?in-depth?research?and?identify?sources?of?production?issues.???SPLUNK?Developer??Supporting?the?Splunk?Operational?environment?for?Busine...OF?COMMERCE?-?BANGKOK,?TH?June?1997?to?May?2001?Skills?Db2?(2?years),?front?end?(2?years),?Java?(2?years),?Linux?(2?years),?Splunk?(2?years),?SQL?(3?years)?Certifications/Licenses?Splunk?Certified?Power?User?V6.3?August?2016?to?Present?CERT-112626?Splunk?Certified?Power?User?V6.x?May?2017?to?Present?CERT-168138?Splunk?Certified?User?V6.x?May?2017?to?Present?CERT?-181476?Driver's?License?Additional?Information?Skills??∑????SQL,?PL/SQL,?Knowledge?of?Data?Modeling,?Experience?on?Oracle?database/RDBMS.??∑????????Database?experience?on?Oracle,?DB2,?SQL?Sever,?MongoDB,?and?MySQL.??∑????????Knowledge?of?tools?including?Splunk,?tableau,?and?wireshark.??∑????????Knowledge?of?SCRUM/AGILE?and?WATERFALL?methodologies.??∑????????Web?technology?included:?HTML5,?CSS3,?XML,?JSON,?JavaScript,?node.js,?NPM,?GIT,?express.js,?jQuery,?Angular,?Bootstrap,?and?Restful?API.??∑????????Working?Knowledge?in?JAVA,?J2EE,?and?PHP.??Operating?system?Experience?included:?Windows,?Mac?OS,?Linux?(Ubuntu,?Mint,?Kali)"
第二步:提取實體。

您可以使用 Google Colab 運行下面的代碼,我使用的是 AWS 實例。

為了按照模式實現我們的提取目標,我將使用一系列提示鏈,每個提示鏈只關注一項任務--提取特定實體。通過這種方式,可以避免令牌限制。同時,提取的質量也會很好

必要的庫:

import?torch
from?transformers?import?AutoTokenizer,?AutoModelForCausalLM

下載模型:

model_name?=?"mistralai/Mistral-7B-Instruct-v0.2"
tokenizer?=?AutoTokenizer.from_pretrained(model_name)
model?=?AutoModelForCausalLM.from_pretrained(model_name,trust_remote_code=True,torch_dtype=torch.float16,#?load_in_8bit=True,#?load_in_4bit=True,device_map="auto",use_cache=True,
)

設置 Langchain 和配置模型

from?langchain.prompts.few_shot?import?FewShotPromptTemplate
from?langchain.prompts.prompt?import?PromptTemplatefrom?transformers?import?AutoTokenizer,?TextStreamer,?pipeline,LlamaForCausalLM,AutoModelForCausalLM
from?langchain?import?HuggingFacePipeline,?PromptTemplate
DEVICE?=?"cuda:0"?if?torch.cuda.is_available()?else?"cpu"
streamer?=?TextStreamer(tokenizer,?skip_prompt=True,?skip_special_tokens=True)text_pipeline?=?pipeline("text-generation",model=model,tokenizer=tokenizer,max_new_tokens=5000,do_sample=False,repetition_penalty=1.15,streamer=streamer
)llm?=?HuggingFacePipeline(pipeline=text_pipeline,?model_kwargs={"temperature":?0.1})

現在我們可以使用 LLM 了、

個人相關信息

提示

person_prompt_tpl="""From?the?Resume?text?for?a?job?aspirant?below,?extract?Entities?strictly?as?instructed?below
1.?First,?look?for?the?Person?Entity?type?in?the?text?and?extract?the?needed?information?defined?below:`id`?property?of?each?entity?must?be?alphanumeric?and?must?be?unique?among?the?entities.?You?will?be?referring?this?property?to?define?the?relationship?between?entities.?NEVER?create?new?entity?types?that?aren't?mentioned?below.?Document?must?be?summarized?and?stored?inside?Person?entity?under?`description`?propertyEntity?Types:label:'Person',id:string,role:string,description:string?//Person?Node
2.?Description?property?should?be?a?crisp?text?summary?and?MUST?NOT?be?more?than?100?characters
3.?If?you?cannot?find?any?information?on?the?entities?&?relationships?above,?it?is?okay?to?return?empty?value.?DO?NOT?create?fictious?data
4.?Do?NOT?create?duplicate?entities
5.?Restrict?yourself?to?extract?only?Person?information.?No?Position,?Company,?Education?or?Skill?information?should?be?focussed.
6.?NEVER?Impute?missing?values
Example?Output?JSON:
{{"entities":?[{{"label":"Person","id":"person1","role":"Prompt?Developer","description":"Prompt?Developer?with?more?than?30?years?of?LLM?experience"}}]}}Question:?Now,?extract?the?Person?for?the?text?below?-{text}Answer:
"""

這將幫助我們以 json 格式獲取有關個人的信息,并使用我們顯示的指令。

from?langchain.chains?import?LLMChain
prompttemplate=PromptTemplate(template=person_prompt_tpl,input_variables=['text'])
chain=LLMChain(llm=llm,prompt=prompttemplate)
import?time
t1=time.time()
result=chain(text)
t2=time.time()print(t2-t1)

輸出 → 個人信息

{
"entities":[{"label":"Person","id":"developer1","role":"Developer","description":"Experienced?developer?with?expertise?in?Java,?JSP,?JSF,?DB2(SQL),?LDAP,?BIRT?report,?Jazz?version?control,?Squirrel?SQL?client,?Hibernate,?CSS,?Linux,?and?Windows.?Has?worked?as?a?Splunk?Developer?supporting?the?Splunk?Operational?environment?for?Business?Solutions?Unit."}
]
}

這太棒了,我們在想要的表單中獲得了關于此人的標簽、身份、角色和描述。

2. 個人教育信息:

提示

edu_prompt_tpl="""From?the?Resume?text?for?a?job?aspirant?below,?extract?Entities?strictly?as?instructed?below
1.?Look?for?Education?entity?type?and?generate?the?information?defined?below:`id`?property?of?each?entity?must?be?alphanumeric?and?must?be?unique?among?the?entities.?You?will?be?referring?this?property?to?define?the?relationship?between?entities.?NEVER?create?other?entity?types?that?aren't?mentioned?below.?You?will?have?to?generate?as?many?entities?as?needed?as?per?the?types?below:Entity?Definition:label:'Education',id:string,degree:string,university:string,graduationDate:string,score:string,url:string?//Education?Node
2.?If?you?cannot?find?any?information?on?the?entities?above,?it?is?okay?to?return?empty?value.?DO?NOT?create?fictious?data
3.?Do?NOT?create?duplicate?entities?or?properties
4.?Strictly?extract?only?Education.?No?Skill?or?other?Entities?should?be?extracted
5.?DO?NOT?MISS?out?any?Education?related?entity
6.?NEVER?Impute?missing?values
Output?JSON?(Strict):
{{"entities":?[{{"label":"Education","id":"education1","degree":"Bachelor?of?Science","graduationDate":"May?2022","score":"0.0"}}]}}Question:?Now,?extract?Education?information?as?mentioned?above?for?the?text?below?-{text}Answer:
"""
from?langchain.chains?import?LLMChain
prompttemplate=PromptTemplate(template=edu_prompt_tpl,input_variables=['text'])
chain=LLMChain(llm=llm,prompt=prompttemplate)
import?time
t1=time.time()
result=chain(text)
t2=time.time()print(t2-t1)

產出 → 有關個人的教育情況:

{
"entities":?[{"label":?"Education","id":?"education1","degree":?"Master?of?Science?in?Information?Technology","university":?"KENNESAW?STATE?UNIVERSITY","graduationDate":?"May?2015"},{"label":?"Education","id":?"education2","degree":?"Master?of?Business?Administration?in?International?Business","university":?"AMERICAN?INTER?CONTINENTAL?UNIVERSITY?ATLANTA","graduationDate":?"December?2005"},{"label":?"Education","id":?"education3","degree":?"Bachelor?of?Arts?in?Public?Relations","university":?"THE?UNIVERSITY?OF?THAI?CHAMBER?OF?COMMERCE","graduationDate":?"May?2001"}
]
}

這真是令人難以置信,我們得到了包含個人所有教育信息的 json。

3. 個人技能信息

提示

skill_prompt_tpl="""From?the?Resume?text?below,?extract?Entities?strictly?as?instructed?below
1.?Look?for?prominent?Skill?Entities?in?the?text.?The`id`?property?of?each?entity?must?be?alphanumeric?and?must?be?unique?among?the?entities.?NEVER?create?new?entity?types?that?aren't?mentioned?below:Entity?Definition:label:'Skill',id:string,name:string,level:string?//Skill?Node
2.?NEVER?Impute?missing?values
3.?If?you?do?not?find?any?level?information:?assume?it?as?`expert`?if?the?experience?in?that?skill?is?more?than?5?years,?`intermediate`?for?2-5?years?and?`beginner`?otherwise.
Example?Output?Format:
{{"entities":?[{{"label":"Skill","id":"skill1","name":"Neo4j","level":"expert"}},{{"label":"Skill","id":"skill2","name":"Pytorch","level":"expert"}}]}}Question:?Now,?extract?entities?as?mentioned?above?for?the?text?below?-
{text}Answer:
"""
from?langchain.chains?import?LLMChain
prompttemplate=PromptTemplate(template=skill_prompt_tpl,input_variables=['text'])
chain=LLMChain(llm=llm,prompt=prompttemplate)
import?time
t1=time.time()
result=chain(text)
t2=time.time()print(t2-t1)

輸出 → 個人技能信息:

{
"entities":[{"label":"Skill","id":"skill1","name":"Java","level":"expert"},{"label":"Skill","id":"skill2","name":"JSP","level":"expert"},{"label":"Skill","id":"skill3","name":"JSF","level":"expert"},{"label":"Skill","id":"skill4","name":"DB2","level":"intermediate"},{"label":"Skill","id":"skill5","name":"Linux","level":"expert"},{"label":"Skill","id":"skill6","name":"Windows","level":"intermediate"},{"label":"Skill","id":"skill7","name":"SQL","level":"expert"},{"label":"Skill","id":"skill8","name":"Oracle","level":"intermediate"},{"label":"Skill","id":"skill9","name":"MySQL","level":"intermediate"},{"label":"Skill","id":"skill10","name":"MongoDB","level":"beginner"},{"label":"Skill","id":"skill11","name":"HTML5","level":"expert"},{"label":"Skill","id":"skill12","name":"CSS3","level":"expert"},
...id":"skill15","name":"JavaScript","level":"expert"},{"label":"Skill","id":"skill16","name":"Node.js","level":"expert"},{"label":"Skill","id":"skill17","name":"NPM","level":"expert"},{"label":"Skill","id":"skill18","name":"GIT","level":"expert"},{"label":"Skill","id":"skill19","name":"express.js","level":"expert"},{"label":"Skill","id":"skill20","name":"jQuery","level":"expert"},{"label":"Skill","id":"skill21","name":"Angular","level":"expert"},{"label":"Skill","id":"skill22","name":"Bootstrap","level":"expert"},{"label":"Skill","id":"skill23","name":"Restful?API","level":"expert"},{"label":"Skill","id":"skill24","name":"PHP","level":"intermediate"},{"label":"Skill","id":"skill25","name":"SCRUM/AGILE","level":"expert"},{"label":"Skill","id":"skill26","name":"WATERFALL?methodologies","level":"expert"}
]
}

好了,我們以 Json 的形式獲得了這個人的所有技能、

在這里,我們可以繪制如上圖所示的知識圖譜。

希望對你有用。

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/diannao/34995.shtml
繁體地址,請注明出處:http://hk.pswp.cn/diannao/34995.shtml
英文地址,請注明出處:http://en.pswp.cn/diannao/34995.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

Python教程:認識一下print函數

print() 是 Python 中一個非常基礎但功能強大的函數&#xff0c;用于將數據輸出到標準輸出&#xff08;通常是控制臺&#xff09;或文件。本文我們一起聊一下這個“平凡”的print函數。 原理 print() 函數的原理相對簡單&#xff0c;它接受一個或多個參數&#xff0c;并將這些…

ravynOS 0.5.0 發布 - 基于 FreeBSD 的 macOS 兼容開源操作系統

ravynOS 0.5.0 發布 - 基于 FreeBSD 的 macOS 兼容開源操作系統 ravynOS - 一個旨在提供 macOS 的精致性和 FreeBSD 的自由度的操作系統 請訪問原文鏈接&#xff1a;https://sysin.org/blog/ravynos/&#xff0c;查看最新版。原創作品&#xff0c;轉載請保留出處。 作者主頁…

snakeyaml從1.x升級2.x的方案

一、背景 因公司漏洞掃描&#xff0c;發現SnakeYAML 反序列化漏洞(CVE-2022-1471)&#xff0c;所以要求對SnakYaml進行升級。 因項目中未直接引用snakyaml包&#xff0c;經分析是springboot引用的這個包。但是在這個項目中&#xff0c;springboot用的版本是2.3.12.RELEASE版本…

睡眠剝奪對記憶鞏固的神經生物學影響

近期&#xff0c;《自然》雜志刊載的研究揭示了睡眠不足對記憶相關神經信號的不利影響&#xff0c;強調了即使在后續恢復充分睡眠的情況下&#xff0c;這種損害亦難以完全逆轉。 神經元作為大腦的基本功能單位&#xff0c;其活動并非孤立進行&#xff0c;而是通過復雜的網絡連接…

QT拖放事件之四:自定義拖放操作-利用QDrag來拖動完成數據的傳輸-案例demo

1、核心代碼 #include "Widget.h" #include "ui_Widget.h" #include "MyButton.h"Widget::Widget(QWidget *parent): QWidget

CSS3 分頁

CSS3 分頁 分頁是網頁設計中常見的一種布局方式&#xff0c;它允許將內容分布在多個頁面中&#xff0c;從而提高用戶體驗和網站的可管理性。CSS3 提供了多種靈活的方式來設計分頁&#xff0c;使得開發者能夠創建既美觀又實用的分頁導航。本文將詳細介紹如何使用 CSS3 來創建和…

python 正則表達式提取字符串

以某個字符開始、某個字符結束&#xff0c;期待的提取結果包含首末字符串 提取公式&#xff1a;a re.findall(“開始字符串.*末字符串”,str) 以某個字符開始、某個字符結束&#xff0c;期待的提取結果不包含末字符串&#xff0c;但包含首字符串 提取公式&#xff1a;a re.…

Cesium--旋轉3dtiles

以下代碼來自Cesium 論壇&#xff1a;3DTileset rotation - CesiumJS - Cesium Community 在1.118中測試可行&#xff0c;可直接在Sandcastle中運行&#xff1a; const viewer new Cesium.Viewer("cesiumContainer", {terrain: Cesium.Terrain.fromWorldTerrain()…

機器學習課程復習——線性回歸

Q&#xff1a;回歸和分類的區別&#xff1f; 回歸是連續的&#xff0c;分類是離散的 Q:用最小二乘法對線性回歸模型進行參數估計思路 例題

排序。。。

1. 掌握常用的排序方法&#xff0c;并掌握用高級語言實現排序算法的方法&#xff1b; 2. 深刻理解排序的定義和各種排序方法的特點&#xff0c;并能加以靈活應用&#xff1b; 3. 了解各種方法的排序過程及其時間復雜度的分析方法。 編程實現如下功能&#xff1a; &#xff08;1…

Makefile中error函數的用法

在 Makefile 中&#xff0c;error 函數是一個特殊的函數&#xff0c;用于在執行過程中生成一個錯誤消息并終止 Makefile 的執行。它的基本語法如下&#xff1a; $(error error-message)其中&#xff0c;error-message 是一個字符串&#xff0c;表示要顯示的錯誤消息。當 Makef…

vue+three.js渲染3D模型

安裝three.js: npm install three 頁面部分代碼&#xff1a; <div style"width: 100%; height: 300px; position: relative;"><div style"height: 200px; background-color: white; width: 100%; position: absolute; top: 0;"><div id&…

【繞過無限Debugger】

文章目錄 引言無限Debugger的工作原理繞過無限Debugger的常用技巧條件斷點法置空法代碼修改與加密 引言 在Web開發中&#xff0c;debugger語句是一種強大的JavaScript功能&#xff0c;允許開發者在代碼中設置斷點&#xff0c;便于調試和理解代碼執行流程。然而&#xff0c;這一…

【文末附gpt升級秘笈】程序的“通用性”與“過度設計”的困境

程序的“通用性”與“過度設計”的困境 四、解決方案的深入闡述 &#xff08;一&#xff09;明確需求和目標&#xff1a;需求驅動設計 在軟件開發的初期&#xff0c;我們需要與業務團隊緊密合作&#xff0c;深入了解項目的實際需求和目標。這不僅包括明確的功能需求&#xf…

filelist中+incdir+的用法

在大多數 Verilog 編譯器&#xff08;如 VCS、ModelSim/Questa、Verilator&#xff09;中&#xff0c;使用 incdir 選項指定包含路徑后&#xff0c;仍然需要在 filelist 文件中列出每一個 Verilog 源文件。incdir 選項僅告訴編譯器在特定目錄中查找頭文件&#xff08;例如 .vh …

go語言day4 引入第三方依賴 整型和字符串轉換 進制間轉換 指針類型 浮點數類型 字符串類型

Golang依賴下載安裝失敗解決方法_安裝go依賴超時怎么解決-CSDN博客 go安裝依賴包&#xff08;go get, go module&#xff09;_go 安裝依賴-CSDN博客 目錄 go語言項目中如何使用第三方依賴&#xff1a;&#xff08;前兩步可以忽略&#xff09; 一、安裝git&#xff0c;安裝程序…

linux學習week1

linux學習 一.介紹 1.概述 linux的讀法不下10種 linux是一個開源的操作系統&#xff0c;操作系統包括mac、windows、安卓等 linux的開發版&#xff1a;Ubuntu&#xff08;烏班圖&#xff09;、RedHat&#xff08;紅帽&#xff09;、CentOS linux的應用&#xff1a;linux在服…

歸并排序與快速排序總結-c++

一&#xff0c;歸并排序 歸并排序&#xff08;Merge sort&#xff09;是建立在歸并操作上的一種有效的排序算法。該算法分治法&#xff08;Divide and Conquer&#xff09;的一個非常典型的應用。 作為一種典型的分而治之思想的算法應用&#xff0c;歸并排序的實現由兩種方法…

KVM網絡模式設置

一、KVM網絡模式介紹 1、NAT ( 默認上網 ) 虛擬機利用host機器的ip進行上網,對外顯示一個ip;virbr0是KVM 默認創建的一個 Bridge,其作用是為連接其上的虛機網卡提供NAT訪問外網的功能,默認ip為192.168.122.1 2、自帶的Bridge 將虛擬機橋接到host機器的網卡上,vm和ho…