分詞消除歧義

折磨數據，它將承認任何事情 (Torture the data, and it will confess to anything)

Disambiguation as defined in the vocabulary.com dictionary refers to the removal of ambiguity by making something clear and narrowing down its meaning. Whilst data disambiguation is not an easy task, it is essential for all language processing and is directly correlated to perceived data quality.

vocabulary.com詞典中定義的歧義消除是指通過明確某些內容并縮小其含義來消除歧義。消除數據歧義雖然不是一件容易的事，但它對于所有語言處理都是必不可少的，并且與感知的數據質量直接相關。

In data integration where the goal is to consolidate data from disparate sources into a single homogenized set. The ultimate goal is to provide users with consistent access and delivery of data. However, real-world data is messy, inconsistent, and ambiguous. As a result, it needs to be processed and massaged to maximize its effectiveness. Disambiguation provides a framework in which integrating data and transforming it into a consistent format is scalable. This transformation generally involves creating a common vocabulary and a framework to extract valuable information out of the noise and added complexity.

在數據集成中，目標是將來自不同來源的數據合并到一個同質化的集中。最終目標是為用戶提供一致的數據訪問和傳遞。但是，現實世界中的數據是混亂，不一致和模棱兩可的。結果，需要對其進行處理和按摩以使其效力最大化。歧義消除提供了一個框架，在該框架中可以集成數據并將其轉換為一致的格式。這種轉換通常涉及創建通用詞匯表和框架以從噪聲和增加的復雜性中提取有價值的信息。

招聘世界中的數據 (Data In The Recruitment World)

Data in the recruitment world consists of a mixture of structured and unstructured information of varying lengths e.g., job description, curriculum vitae, cover letter, etc. At Beamery, we have implemented various mechanisms to extract and structure information from these textual sources e.g., role titles, skills, experience descriptions, and company names. Among those, role titles stand out as the key that draws the lines around the skills and the knowledge the individual possesses.

招聘世界中的數據由不同長度的結構化和非結構化信息組成，例如職位描述，履歷，求職信等。在Beamery，我們實施了各種機制來從這些文本來源中提取和構建信息，例如角色頭銜，技能，經驗描述和公司名稱。在這些角色中，角色頭銜脫穎而出，成為劃定個人擁有的技能和知識的關鍵。

Taking advantage of the rich background information around a role title is useful; however, it is also one of the harder information to deal with in the recruitment space. They are ever-changing, prone to typos, can provide additional knowledge such as seniority or location on top of the main phrase defining the work. Moreover, you are likely to encounter synonyms expressing the same set of skills and experience, pointing at the same semantic object. For example, “software engineer”, “software developer” and “software ninja” can be used interchangeably, all representing the same underlying experience. In this work, we are making a distinction between disambiguation and similarity. The following process does not make judgments about the similarities between role titles. The aim is to disambiguate the textual information without losing vital details that shape the expectations from the role title.

利用角色標題周圍的豐富背景信息很有用；但是，它也是招聘空間中較難處理的信息之一。他們瞬息萬變，容易出現錯別字，可以在定義工作的主要短語之上提供其他知識，例如資歷或位置。此外，您可能會遇到表示相同技能和經驗的同義詞，指向相同的語義對象。例如，“軟件工程師”，“軟件開發人員”和“軟件忍者”可以互換使用，它們都代表相同的基礎體驗。在這項工作中，我們正在區分歧義和相似性。以下過程不會判斷角色標題之間的相似性。目的是消除文本信息的歧義，而又不丟失影響角色標題期望的重要細節。

問題 (The Problem)

Before we go into detail, it is important to paint a clear picture of the problem. We have sampled around 10 million distinct anonymous contacts from Beamery’s database. These “contacts” are individuals that have been in contact with our client companies for a job position. Role titles are among the data stored about these individuals. When we created the list of role titles from this sample, we were shocked by the staggering count of 7 million distinct role titles. Our clients hire from different nationalities and countries, so we would expect the inclusion of different languages to compound the final number. However, this still indicates there is a clear data cleanliness problem that may be caused by typos or other technical errors in CV parsing or 3rd party integration systems.

在詳細介紹之前，重要的是要清楚地描述問題。我們從Beamery的數據庫中抽取了大約1000萬個不同的匿名聯系人。這些“聯系人”是與我們的客戶公司聯系以尋求工作職位的個人。角色標題是有關這些人的存儲數據。當我們從該示例創建角色標題列表時，我們為700萬個不同的角色標題而感到震驚。我們的客戶從不同的國籍和國家/地區聘用，因此我們希望包含不同的語言來組合最終的數量。但是，這仍然表明存在明顯的數據清潔性問題，這可能是由于簡歷解析或第三方集成系統中的錯別字或其他技術錯誤引起的。

Thinking about reducing the complexity in the role title space, an intuitive approach would be to map the raw role titles to a curated version, preferably from a taxonomy of role titles. This way we would reduce the diversity in 7 million role titles and translate them to their counterparts in a known space. There are public taxonomies available such as the efforts by ESCO and O*Net. It seems enticing at first to take advantage of a work that is incredibly costly both in experts’ time and money. Yet, mapping to a taxonomy proposes a non-trivial search problem and this isn’t really the first step into dealing with role titles. It became clear to us that we needed to deconstruct, clean, and understand the building blocks before moving into mapping or other downstream efforts.

考慮降低角色標題空間的復雜性，一種直觀的方法是將原始角色標題映射到精選版本，最好是從角色標題的分類法中進行映射。這樣，我們將減少700萬個角色標題的多樣性，并將其轉換為已知空間中的對應角色。有可用的公共分類法，例如ESCO和O * Net的努力。乍看起來似乎很誘人，因為這既花費了專家的時間和金錢，又花費了巨額成本。但是，映射到分類法會提出一個不平凡的搜索問題，而這實際上并不是處理角色標題的第一步。我們已經清楚地知道，在進行制圖或其他下游工作之前，我們需要解構，清理和理解這些構造塊。

解決方案 (The Solution)

The response came as a multi-step disambiguation framework that features a role title vocabulary. There are two main parts to the process: cleaning and feature extraction. Cleaning includes basic preprocessing, spelling correction, and token removal by discarding out of vocabulary words. The outcome of the cleaning step is the “disambiguated title”. Feature extraction is focused on transforming a string into a set of features such as seniority levels and a set of phrases.

響應是一個多步驟的歧義消除框架，該框架具有角色標題詞匯。該過程包括兩個主要部分：清潔和特征提取。清除包括基本的預處理，拼寫更正和通過丟棄詞匯表單詞來除去標記。清潔步驟的結果是“標題明確”。特征提取專注于將字符串轉換為一組特征，例如資歷級別和一組短語。

詞匯 (Vocabulary)

In the core of the disambiguation process lies the role title vocabulary. Vocabularies allow us to define the boundaries of an entity by offering a set of acceptable building blocks. In this case, words are the building blocks for role titles. Common words such as manager, director, senior, specialist are the usual suspects. We also have words that represent expertise. For example, “scientist” is a broad term that defines a set of skills. “Research scientist” indicates that the role likely to belong in academia whereas a “data scientist” is likely to work in a commercial company. All of these words are modifying the meaning of the role title heavily. But how would we decide whether a word belongs in a role title and its presence enriches the role titles meaning? We have chosen commonality as the acceptance criteria for the vocabulary. Selecting a count threshold for acceptance of a word is a balancing act. Lower count threshold means a bigger vocabulary, resulting disambiguation will be less aggressive but it will have a higher coverage. A higher threshold means a smaller vocabulary. Coverage will suffer but disambiguation can be more robust. It carries the risk of inviting many false positives.

消歧過程的核心是角色標題詞匯。詞匯表允許我們通過提供一組可接受的構建塊來定義實體的邊界。在這種情況下，單詞是角色標題的基礎。經理，董事，高級，專家等常用詞是常見的嫌疑人。我們也有代表專業知識的單詞。例如，“科學家”是定義一組技能的廣義術語。 “研究科學家”表示該角色可能屬于學術界，而“數據科學家”則可能在商業公司中工作。所有這些詞都在很大程度上改變角色標題的含義。但是，我們如何確定一個單詞是否屬于角色標題并且其出現豐富了角色標題的含義呢？我們選擇了通用性作為詞匯表的接受標準。選擇用于接受單詞的計數閾值是一種平衡行為。較低的計數閾值意味著詞匯量更大，因此消除歧義的積極性會降低，但覆蓋率會更高。閾值越高，詞匯量越少。覆蓋范圍會受到影響，但歧義消除可能會更可靠。冒著引起許多誤報的風險。

We have created a role title vocabulary of 8,330 words with a threshold of 100. It covers 96% of role titles that have a frequency of at least 5 times out of 28 million instances. The main assumption here is that any word that is not a part of the vocabulary is either noise, typo (if we failed to correct it), or so obscure that downstream models/processes cannot make sense of it.

我們創建的角色標題詞匯量為8,330個單詞，閾值為100。它涵蓋了2800萬個實例中出現頻率至少為5倍的96％的角色標題。這里的主要假設是，不屬于詞匯表的任何單詞都是噪音，錯別字(如果我們無法糾正它的話)，或者太晦澀難懂，以至于下游模型/過程無法理解它。

標題明確 (Disambiguated Title)

An important step in the cleaning process is creating a “fingerprint” of the role title. After preprocessing, spelling correction, and token removal, we create an ID of the role title with remaining tokens. We have been influenced by the simple but efficient approach taken in clustering in OpenRefine that is simply ordering the words alphabetically and keeping only the unique ones. Using the “fingerprint”, we can group role titles sharing the same ID and assign a “disambiguated title” by taking the most common version.

清理過程中的一個重要步驟是創建角色標題的“指紋”。經過預處理，拼寫更正和令牌刪除之后，我們創建了帶有剩余令牌的角色標題ID。我們受到OpenRefine群集中采用的簡單而有效的方法的影響，該方法只是簡單地按字母順序排列單詞并僅保留唯一的單詞。使用“指紋”，我們可以將具有相同ID的角色標題分組，并通過采用最常見的版本來分配“歧義標題”。

Image for post — Sharing “Senior Data Scientist” as “disambiguated title”

短語檢測 (Phrase Detection)

Role titles have different components and can include a few roles separated listed together as in “Founder and Chief Executive Officer”. If we were to aim for a one-to-one mapping to a curated list of role titles, we would always lose vital information or would be forced to make arbitrary decisions to choose one role title to map to.

角色標題具有不同的組成部分，可以包括幾個單獨列出的角色，如“創始人和首席執行官”中列出。如果我們打算與角色標題的精選列表進行一對一映射，那么我們總是會丟失重要的信息，或者被迫做出任意決定來選擇要映射的角色標題。

Instead of mapping raw role titles to a curated set of role titles, we have chosen to deconstruct the role title to list of words and phrases. Understanding the vocabulary will enable us to work with any role title spanned by these entities. This is very similar to the way a human would understand any written text. Instead of mapping every possible sentence to an instance in our memory, we learn the words and the grammar. This is the type of understanding that will be enabled by our process.

我們沒有選擇將原始角色標題映射到一組精選的角色標題，而是選擇將角色標題解構為單詞和短語列表。了解詞匯表將使我們能夠處理這些實體所跨越的任何角色標題。這與人類理解任何書面文本的方式非常相似。我們不是在記憶中將每個可能的句子映射到一個實例，而是學習單詞和語法。這是我們的流程將支持的理解類型。

To capture the phrases, we have trained an n-gram language model to qualify phrase candidates found in the role titles. The example role title below holds pockets of information that contain valuable information in assessing the expertise of the individual. We could map this role to a “Vice President” but in the process, we would lose most of the context. Instead, we are qualifying the phrases and keeping them as a set of tags. Features extracted from this role title, seniority, phrases, and the disambiguated title together would capture the complete context but we could use any of the features individually depending on the nature of the downstream solution.

為了捕獲短語，我們已經訓練了一個n語法語言模型來限定在角色標題中找到的短語候選者。下面的示例角色標題包含一些信息包，其中包含寶貴的信息，可用于評估個人的專業知識。我們可以將此角色映射為“副總統”，但在此過程中，我們將失去大部分背景。相反，我們對短語進行限定，并將其保留為一組標簽。從此角色標題，資歷，短語和歧義標題中提取的功能將共同捕獲完整的上下文，但是我們可以根據下游解決方案的性質單獨使用任何功能。

角色標題消歧過程 (Role Title Disambiguation Process)

The diagram below shows the different steps in the disambiguation process.

下圖顯示了消歧過程中的不同步驟。

Start with the raw role title

從原始角色標題開始

Detect the language of the role title

檢測角色標題的語言

Every other step in the process depends on the language of the role title. Starting with spelling correction, a vocabulary of words and phrases, and seniority dictionary changes with the language. Therefore, detecting the language at the beginning is vital.

過程中的其他每個步驟都取決于角色標題的語言。 從拼寫校正開始，單詞和短語的詞匯以及資歷詞典隨語言而變化。 因此，一開始就檢測語言至關重要。

Preprocessing includes dealing with non-Latin characters, expanding acronyms, removing punctuation, and removing whitespace.

預處理包括處理非拉丁字符，擴展首字母縮略詞，刪除標點符號和刪除空格。

The spelling correction step allows us to catch any spelling errors before we look for the vocabulary words in the role title.

拼寫校正步驟使我們能夠在查找角色標題中的詞匯之前捕獲所有拼寫錯誤。

Token removal works on the assumption that words that are left out of the vocabulary are irrelevant to the granularity we are aiming for.

刪除標記的前提是，詞匯表中遺漏的單詞與我們要達到的粒度無關。

Fingerprinting is creating a unique representation for the role title that will be as an ID.

指紋識別將為角色標題創建唯一的表示形式，并將其作為ID。

Disambiguated title is a clean version of the role title and it’s shared by all the role titles sharing the same fingerprint.

歧義標題是角色標題的干凈版本，并且所有共享相同指紋的角色標題都將其清除。

Seniority detection is the process of looking for seniority terms inside the role title. If found, seniority is extracted as a new feature from the role title.

資歷檢測是在角色標題中查找資歷術語的過程。如果找到，從角色標題中提取資歷作為新功能。

Phrase detection step makes use of an n-gram language model to assign probabilities to word groups and qualify them in their ability to represent the role title.

短語檢測步驟利用n-gram語言模型為單詞組分配概率，并使它們具有代表角色標題的能力。

After completion of the steps gives above, we end up with a list of features for a role title. Instead of establishing a one-to-one mapping, we have created a structure that captures the information available in the role title. Using this clean data, we can move to a structure where extracted entities are points in a vector space where we can infer relationships between them, getting us closer to achieving the “understanding” that we are looking for.

完成上面給出的步驟后，我們最終獲得了角色標題的功能列表。我們沒有建立一對一的映射，而是創建了一個捕獲角色標題中可用信息的結構。使用這些干凈的數據，我們可以移動到一個結構，在該結構中，提取的實體是向量空間中的點，在這里我們可以推斷它們之間的關系，從而使我們更接近實現我們所尋找的“理解”。

評價 (Evaluation)

Concepting and creating such a process is valuable; however, adoption and consistent value creation depend on proving value and improvement. Such a process with many rules of varying complexity requires a lot of care. The identification of edge cases and failings is very significant. The stakeholders need to acknowledge that this is an iterative process. The failed edge cases will be input to learnings and over time the output quality will increase.

構思和創建這樣的過程很有價值；但是，采用和持續的價值創造取決于證明的價值和改進。這種具有許多復雜度不同的規則的過程需要很多注意。邊緣情況和故障的識別非常重要。利益相關者需要承認這是一個反復的過程。失敗的邊緣案例將被輸入到學習中，隨著時間的流逝，輸出質量將提高。

For this reason, we have created an internal evaluation UI. This allowed us to recruit a group of testers and ask them to go through a set of role titles and check the outcome of the modules at each step. The feedback exposes the potential shortcomings of the process but equally importantly it gives us a ground truth set for quantitative testing. This way, we can measure the performance of individual modules every time we release a new version.

因此，我們創建了一個內部評估UI。這使我們能夠招募一組測試人員，并要求他們完成一組角色標題并在每個步驟中檢查模塊的結果。反饋暴露了該過程的潛在缺陷，但同樣重要的是，它為我們提供了定量測試的基礎事實。這樣，每次發布新版本時，我們就可以衡量各個模塊的性能。

進一步的工作 (Further Work)

We are aware that rule-based modules cannot always capture the complexity of human-level tasks. However, the current performance gives us a competent baseline to beat. Depending on the criticality of the tasks and the performance expectations we will prioritize the improvement efforts.

我們知道，基于規則的模塊無法始終捕獲人員級任務的復雜性。但是，當前的表現為我們提供了一個可以勝任的基準。根據任務的關鍵程度和性能期望，我們將優先考慮改進工作。

A possible addition to the process is the detection of different entities such as location, company names, software/technology names. Currently, we are choosing to discard location and company names from the role title vocabulary. However, with enough training data, we should be able to train a performant named entity recognition model that can recognize seniority terms as well.

對該過程的可能補充是檢測不同的實體，例如位置，公司名稱，軟件/技術名稱。當前，我們正在選擇從角色標題詞匯中刪除位置和公司名稱。但是，有了足夠的訓練數據，我們應該能夠訓練出一個可以識別資歷條件的業績型實體識別模型。

The spelling correction module depends on a select dictionary of words and phrases. However, we can get better results if we are to leverage a multilingual dataset of spelling mistakes. If we fail to correctly assign the language of the role title, we start incorrectly processing foreign language words for spelling correction as they are not present in the vocabulary.

拼寫校正模塊取決于選擇的單詞和短語詞典。但是，如果我們要利用多語言的拼寫錯誤數據集，則會獲得更好的結果。如果我們未能正確分配角色標題的語言，我們將開始不正確地處理外語單詞以進行拼寫糾正，因為這些外來單詞不存在于詞匯表中。

Another important improvement area is phrase detection. We have started with a baseline model to score the phrases; however, language modeling is one of the most popular research areas. As long as we have a large enough dataset of role titles, we can allow deep networks to learn the grammar dictating the structure of the role title and the semantic world behind it. Yet, context and progression of role titles can be harder to capture.

另一個重要的改進領域是短語檢測。我們從基線模型開始對短語進行評分；但是，語言建模是最受歡迎的研究領域之一。只要我們有足夠大的角色標題數據集，我們就可以允許深度網絡學習指示角色標題的結構及其背后的語義世界的語法。但是，角色標題的上下文和進度可能更難捕捉。

結論 (Conclusion)

This is our response to the diversity and noise in role title space. Iterative improvement is at the heart of this process and such an effort needs time to mature. This is one of the earlier steps in the data journey to build a platform to base future efforts. It would certainly help to contextualize the data problem with the problems of the business in order to keep it prioritized and supported. In our experience, we have seen that the business highly supports the disambiguation efforts as long as the context and the nature of the solution is well communicated.

這是我們對角色標題空間中多樣性和噪音的回應。迭代式改進是此過程的核心，而這種努力需要時間才能成熟。這是數據之旅中較早的步驟之一，目的是建立一個平臺來為將來的工作奠定基礎。當然，這將有助于將數據問題與業務問題聯系起來，以保持其優先級并得到支持。根據我們的經驗，只要解決方案的上下文和性質得到了很好的交流，企業就高度支持消除歧義的工作。

We capture the gist of the solution under the umbrella term “disambiguation”, however, we respond to many different problems with every module. It’s very significant that every module gets enough attention in evaluation and improvement. In the end, a chain is as strong as its weakest link.

我們在“歧義消除”這個籠統的術語下抓住了解決方案的要點，但是，我們對每個模塊都回答了許多不同的問題。每個模塊在評估和改進中得到足夠的重視非常重要。最后，一條鏈與其最薄弱的環節一樣牢固。

We hope that our story in creating a disambiguation process can inspire you to address similar problems. In the series that follows we will continue with posts regarding our progress and provide in-depth information on the individual modules.

我們希望我們在創建歧義消除過程中的故事能夠啟發您解決類似的問題。在接下來的系列中，我們將繼續發布有關我們進度的文章，并提供有關各個模塊的深入信息。

翻譯自: https://medium.com/hacking-talent/role-titles-standardization-an-overview-160306db32d0

分詞消除歧義

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/390947.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/390947.shtml
英文地址，請注明出處：http://en.pswp.cn/news/390947.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！