如何建立搜索引擎_如何建立搜尋引擎

如何建立搜索引擎

This article outlines one of the most important search algorithms used today and demonstrates how to implement it in Python in just a few lines of code.

本文概述了當今使用的最重要的搜索算法之一,并演示了如何僅用幾行代碼就可以在Python中實現它。

搜索的價值 (The value of search)

The ability to search data is something we take for granted. Modern search engines are now so sophisticated that most of our searches ‘just work’. In fact, we often only notice a search on a website or app when it does not perform. Our expectations in this space have never been higher.

搜索數據的能力是我們理所當然的。 現在,現代搜索引擎非常復雜,以至于我們的大多數搜索“都行得通”。 實際上,我們通常只會在網站或應用無法執行搜索時才會注意到該搜索。 我們對這個領域的期望從未如此高。

The intelligence of search engines has been increasing for a very simple reason, the value that an effective search tool can bring to a business is enormous; a key piece of intellectual property. Often a search bar is the main interface between customers and the business. A good search engine can, therefore, create a competitive advantage by delivering an improved user experience.

搜索引擎的智能一直在增加,原因很簡單,有效的搜索工具可以為企業帶來巨大的價值。 關鍵的知識產權。 搜索欄通常是客戶和企業之間的主要界面。 因此,好的搜索引擎可以通過提供改進的用戶體驗來創造競爭優勢。

MckKinsey estimated that this value, aggregated globally, amounted to $780Bn a year in 2009. This would put the value of each search performed at $0.50[1]. Of course, this value has no doubt increased substantially since 2009…

麥肯錫(MckKinsey)估計,2009年,全球每年的價值總計為7800億美元。這使得每次搜索的價值為0.50美元[1]。 當然,自2009年以來,這一價值無疑已大幅提高。

With this in mind, you would be forgiven that creating a modern search engine would be out of reach of most development teams, requiring huge resources and complex algorithms. However, somewhat surprisingly, a large number of enterprise business search engines are actually powered by very simple and intuitive rules which can be easily implemented using open source software.

考慮到這一點,您會原諒創建現代搜索引擎對于大多數開發團隊來說是遙不可及的,這需要大量的資源和復雜的算法。 但是,令人驚訝的是,實際上,許多企業業務搜索引擎由非常簡單直觀的規則提供支持,可以使用開源軟件輕松實現這些規則。

For example, Uber, Udemy Slack and Shopify (along with 3,000 other business and organisations [2]) all use Elasticsearch. This search engine was powered by incredibly simple term-frequency, inverse document frequency (or tf-idf) word scores up until 2016.[3] (For more details on what this is, I have written about tf-idf here and here).

例如,Uber,Udemy Slack和Shopify(以及3,000個其他企業和組織[2])都使用Elasticsearch。 該搜索引擎由令人難以置信的簡單詞頻,反文檔頻度 (或tf-idf)單詞得分提供支持,直到2016年。[3] (有關這是什么的更多詳細信息,我在這里和這里已經寫過關于tf-idf的信息 )。

After this point, it switched to the more sophisticated (but still very simple) BM25 which is still used today. This is also the algorithm implemented within Azure Cognitive Search[4].

此后,它切換到了今天仍在使用的更復雜(但仍然非常簡單)的BM25。 這也是Azure認知搜索中實現的算法[4]。

BM25:您從未聽說過的最重要的算法 (BM25: the most important algorithm you have never heard of)

So what is BM25? It stands for ‘Best match 25’ (the other 24 attempts were clearly not very successful). It was released in 1994 at the third Text Retrieval Conference, yes there really was a conference dedicated to text retrieval…

那么什么是BM25? 它代表“最佳比賽25”(其他24次嘗試顯然不是很成功)。 它于1994年在第三次文本檢索會議上發布 ,是的,確實有一個專門針對文本檢索的會議…

It is probably best thought of as tf-idf ‘on steroids’, implementing two key refinements:

最好將其視為“類固醇上的tf-idf”,它實現了兩個關鍵改進:

  • Term frequency saturation. BM25 provides diminishing returns for the number of terms matched against documents. This is fairly intuitive, if you looking to search for a specific term which is very common in documents then there should become a point where the number of occurrences of this term become less useful to the search.

    術語頻率飽和 。 BM25提供與文檔匹配的條款數量遞減的回報。 這是非常直觀的,如果您要搜索在文檔中非常常見的特定術語,那么應該出現一個問題,即該術語的出現次數對搜索沒有太大用處。

  • Document length. BM25 considers document length in the matching process. Again, this is intuitive; if a shorter article contains the same number of terms that match as a longer article, then the shorter article is likely to be more relevant.

    文件長度。 BM25在匹配過程中考慮文檔長度。 再次,這很直觀; 如果較短的文章包含與較長的文章匹配的相同數量的術語,則較短的文章可能更相關。

These refinements also introduce two hyper-parameters to adjust the impact of these items on the ranking function. ‘k’ to tune the impact of term saturation and ‘b’ to tune document length.

這些改進還引入了兩個超參數,以調整這些項目對排名功能的影響。 “ k”用于調整術語飽和度的影響,“ b”用于調整文檔長度。

Bringing this all together, BM25 is calculated as:

綜上所述,BM25的計算公式為:

Image for post
The BM25 algorithm simplified. Source: Author
簡化了BM25算法。 資料來源:作者

實施BM25,一個可行的例子 (Implementing BM25, a worked example)

Implementing BM25 is incredibly simple. Thanks to the rank-bm25 Python library this can be achieved in a handful of lines of code.

實施BM25非常簡單。 多虧了rank-bm25 Python庫,這可以用幾行代碼來實現。

In our example, we are going to create a search engine to query contract notices that have been published by UK public sector organisations.

在我們的示例中,我們將創建一個搜索引擎來查詢由英國公共部門組織發布的合同通知。

Our starting point is a dateset which contains the title of a contract notice, the description along with the link to the notice itself. To keep things simple, we have combined the title and description together to create the ‘text’ column in the dataset. It is this column that we will use to search. There are 50,000 documents which we want to search across:

我們的出發點是一個日期集,其中包含合同通知書的標題,說明以及通知書本身的鏈接。 為了簡單起見,我們將標題和描述結合在一起以在數據集中創建“文本”列。 我們將使用此列進行搜索。 我們要搜索50,000個文檔:

Image for post
The format of the data to be used in the search engine (first two rows out of 50,000). Source: Author
搜索引擎中要使用的數據格式(50,000中的前兩行)。 資料來源:作者

A link to the data and the code can be found at the bottom of this article.

可以在本文底部找到數據和代碼的鏈接。

The first step in this exercise is to extract all the words within the ‘text’ column of this dataset to create a ‘list of lists’ consisting of each document and the words within them. This is known as tokenization and can be handled by the excellent spaCy library:

此練習的第一步是提取此數據集的“文本”列中的所有單詞,以創建一個“列表列表”,其中包括每個文檔及其中的單詞。 這被稱為標記化,可以通過出色的spaCy庫進行處理:

import spacy
from rank_bm25 import BM25Okapi
from tqdm import tqdm
nlp = spacy.load("en_core_web_sm")text_list = df.text.str.lower().values
tok_text=[] # for our tokenised corpus#Tokenising using SpaCy:
for doc in tqdm(nlp.pipe(text_list, disable=["tagger", "parser","ner"])):
tok = [t.text for t in doc if t.is_alpha]
tok_text.append(tok)

Building a BM25 index can be done in a single line of code:

建立BM25索引可以用單行代碼完成:

bm25 = BM25Okapi(tok_text)

Querying this index just requires a search input which has also been tokenized:

查詢該索引僅需要搜索輸入,該輸入也已被標記化:

query = "Flood Defence"tokenized_query = query.lower().split(" ")import timet0 = time.time()
results = bm25.get_top_n(tokenized_query, df.text.values, n=3)
t1 = time.time()print(f'Searched 50,000 records in {round(t1-t0,3) } seconds \n')for i in results:
print(i)

This returns the following top 3 results that are clearly highly relevant to the search query of ‘Flood Defence’:

這將返回以下與搜索“防洪”高度相關的前3個結果:

Searched 50,000 records in 0.061 seconds:Forge Island Flood Defence and Public Realm Works Award of Flood defence and public realm works along the canal embankment at Forge Island, Market Street, Rotherham as part of the Rotherham Renaissance Flood Alleviation Scheme. Flood defence maintenance works for Lewisham and Southwark College **AWARD** Following RfQ NCG contracted with T Gunning for Flood defence maintenance works for Lewisham and Southwark College Freckleton St Byrom Street River Walls Freckleton St Byrom Street River Walls, Strengthening of existing river wall parapets to provide flood defence measures

We could fine-tune the values for ‘k’ and ‘b’ based on the expected preferences of the users performing the searches, however the defaults of k=1.5 and b=0.75 seem to work well here.

我們可以根據執行搜索的用戶的期望偏好來微調“ k”和“ b”的值,但是默認設置k = 1.5和b = 0.75在這里似乎很好。

在結束時 (In closing)

Hopefully this example highlights how simple it is to implement a robust full text search in Python. This could easily be used to power a simple web app or a smart document search tool. There is also significant scope to improve the performance of this further , which may form the topic of a future post!

希望該示例突出顯示在Python中實現健壯的全文本搜索有多么簡單。 這可以輕松地用于驅動簡單的Web應用程序或智能文檔搜索工具。 還有很大的空間可以進一步改善此性能,這可能會成為將來帖子的主題!

The code and data to this article can be found in this Colab notebook.

可在此Colab筆記本中找到本文的代碼和數據。

[1] McKinsey review of web search: https://www.mckinsey.com/~/media/mckinsey/dotcom/client_service/High%20Tech/PDFs/Impact_of_Internet_technologies_search_final2.aspx

[1]麥肯錫網絡搜索評論: https : //www.mckinsey.com/~/media/mckinsey/dotcom/client_service/High%20Tech/PDFs/Impact_of_Internet_technologies_search_final2.aspx

[2] According to StackShare, August 2020.

[2] 根據StackShare,2020年8月。

[3] BM25 became the default search from Elasticsearch v5.0 onward: https://www.elastic.co/blog/elasticsearch-5-0-0-released

[3] BM25從Elasticsearch v5.0開始成為默認搜索: https : //www.elastic.co/blog/elasticsearch-5-0-0-released

[4] https://docs.microsoft.com/en-us/azure/search/index-ranking-similarity

[4] https://docs.microsoft.com/zh-CN/azure/search/index-ranking-similarity

翻譯自: https://towardsdatascience.com/how-to-build-a-search-engine-9f8ffa405eac

如何建立搜索引擎

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/news/390664.shtml
繁體地址,請注明出處:http://hk.pswp.cn/news/390664.shtml
英文地址,請注明出處:http://en.pswp.cn/news/390664.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

用Docker自動構建紙殼CMS

紙殼CMS可以運行在Docker上,接下來看看如何自動構建紙殼CMS的Docker Image。我們希望的是在代碼提交到GitHub以后,容器鏡像服務可以自動構建Docker Image,構建好以后,就可以直接拿這個Docker Image來運行了。 Dockerfile 最重要的…

Linux學習筆記15—RPM包的安裝OR源碼包的安裝

RPM安裝命令1、 安裝一個rpm包rpm –ivh 包名“-i” : 安裝的意思“-v” : 可視化“-h” : 顯示安裝進度另外在安裝一個rpm包時常用的附帶參數有:--force : 強制安裝,即使覆蓋屬于其他包的文件也要安裝--nodeps : 當要安裝的rpm包依賴其他包時&#xff0…

leetcode 518. 零錢兌換 II

給定不同面額的硬幣和一個總金額。寫出函數來計算可以湊成總金額的硬幣組合數。假設每一種面額的硬幣有無限個。 示例 1: 輸入: amount 5, coins [1, 2, 5] 輸出: 4 解釋: 有四種方式可以湊成總金額: 55 5221 52111 511111 示例 2: 輸入: amount 3, coins [2] 輸出: 0 解…

軟件測試中什么是正交實驗法_軟件工程中的正交性

軟件測試中什么是正交實驗法正交性 (Orthogonality) In software engineering, a system is considered orthogonal if changing one of its components changes the state of that component only. 在軟件工程中,如果更改系統的組件之一僅更改該組件的狀態&#xf…

leetcode 279. 完全平方數(dp)

題目一 給定正整數 n,找到若干個完全平方數(比如 1, 4, 9, 16, …)使得它們的和等于 n。你需要讓組成和的完全平方數的個數最少。 給你一個整數 n ,返回和為 n 的完全平方數的 最少數量 。 完全平方數 是一個整數,其…

github代碼_GitHub啟動代碼空間

github代碼Codespaces works like a virtual Integrated Development Environment (IDE) on the cloud.代碼空間的工作方式類似于云上的虛擬集成開發環境(IDE)。 Until now, you had to make a pull request to contribute to a project. This required setting up the enviro…

php變量

什么叫變量&#xff1f; 變量可以通過變量名訪問。在指令式語言中&#xff0c;變量通常是可變的&#xff1b; 這里就先這么簡單理解&#xff0c;通過對語言的研究會更加的理解變量的其他意義。 在PHP中變量是用于存儲信息的"容器"&#xff1a; <?php $x5; $y6;…

js將base64做UrlEncode轉碼

使用 encodeURIComponent() 其詳細介紹 https://developer.mozilla.org/zh-CN/docs/Web/JavaScript/Reference/Global_Objects/encodeURIComponent var base64 &#xff08;base64值&#xff09;encodeURIComponent(base64 ) //轉化 轉載于:https://www.cnblogs.com/xll-qg/p…

引用自己創建的css樣式表_如何使用CSS創建聯系表

引用自己創建的css樣式表First we create the HTML elements - input fields for First Name, Last Name, Email and a Text Area for the message.首先&#xff0c;我們創建HTML元素-名字&#xff0c;姓氏&#xff0c;電子郵件和消息的文本區域的輸入字段。 Later we apply C…

leetcode 1449. 數位成本和為目標值的最大數字(dp)

這是我參與更文挑戰的第12天 &#xff0c;活動詳情查看更文挑戰 題目 給你一個整數數組 cost 和一個整數 target 。請你返回滿足如下規則可以得到的 最大 整數&#xff1a; 給當前結果添加一個數位&#xff08;i 1&#xff09;的成本為 cost[i] &#xff08;cost 數組下標…

風能matlab仿真_風能產量預測—深度學習項目

風能matlab仿真DL DATATHON- AI4ImpactDL DATATHON- AI4影響 Published by Team AI Traders — Suyash Lohia, Nguyen Khoi Phan, Nikunj Taneja, Naman Agarwal and Mihir GuptaAI交易員團隊發布 -Suyash Lohia&#xff0c;Nguyen Khoi Phan&#xff0c;Nikonj Taneja&#x…

android JNI調用(Android Studio 3.0.1)(轉)

最近回頭復習了一下android 的jni調用&#xff0c;卻發現按以前的方法調用失敗&#xff0c;一怒之下就重新摸索&#xff0c;碰了幾次壁&#xff0c;發現網上好多教程都不能成功調用&#xff0c;于是記錄一下現在AS版本成功好用的調用方法。 這里設定你的ndk已經下載并且設置沒問…

安卓源碼 代號,標簽和內部版本號

SetupSecurityPortingTuningCompatibilityReference轉到源代碼Getting Started OverviewCodelines, Branches, and ReleasesCodenames, Tags, and Build NumbersProject RolesBrand GuidelinesLicensesFAQSite UpdatesDownloading and Building RequirementsEstablishing a Bui…

git 列出標簽_Git標簽介紹:如何在Git中列出,創建,刪除和顯示標簽

git 列出標簽Tagging lets developers mark important checkpoints in the course of their projects development. For instance, software release versions can be tagged. (Ex: v1.3.2) It essentially allows you to give a commit a special name(tag).通過標記&#xff…

leetcode 278. 第一個錯誤的版本(二分)

題目 你是產品經理&#xff0c;目前正在帶領一個團隊開發新的產品。不幸的是&#xff0c;你的產品的最新版本沒有通過質量檢測。由于每個版本都是基于之前的版本開發的&#xff0c;所以錯誤的版本之后的所有版本都是錯的。 假設你有 n 個版本 [1, 2, …, n]&#xff0c;你想找…

騰訊哈勃_用Python的黑客統計資料重新審視哈勃定律

騰訊哈勃Simple OLS Regression, Pairs Bootstrap Resampling, and Hypothesis Testing to observe the effect of Hubble’s Law in Python.通過簡單的OLS回歸&#xff0c;配對Bootstrap重采樣和假設檢驗來觀察哈勃定律在Python中的效果。 In this post, we will revisit Hub…

JAVA中動態編譯的簡單使用

一、引用庫 pom文件中申明如下&#xff1a; <dependencies><!-- https://mvnrepository.com/artifact/junit/junit --><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>4.12</version><…

程序員實用小程序_我從閱讀《實用程序員》中學到了什么

程序員實用小程序In short: old but gold.簡而言之&#xff1a;古老而又黃金。 Published in 1999, The Pragmatic Programmer is a book about how to become a Pragmatic Programmer. Which really means a ‘Good Programmer’. 《實用程序員》于1999年出版&#xff0c;是一…

leetcode 5786. 可移除字符的最大數目(二分法)

題目 給你兩個字符串 s 和 p &#xff0c;其中 p 是 s 的一個 子序列 。同時&#xff0c;給你一個元素 互不相同 且下標 從 0 開始 計數的整數數組 removable &#xff0c;該數組是 s 中下標的一個子集&#xff08;s 的下標也 從 0 開始 計數&#xff09;。 請你找出一個整數…

如何使用Picterra的地理空間平臺分析衛星圖像

From April-May 2020, Sentinel-Hub had organized the third edition of their custom script competition. The competition was organized in collaboration with the Copernicus EU Earth Observation programme, the European Space Agency and AI4EO consortium.從2020年…