【C++項目實戰】：基于正倒排索引的Boost搜索引擎（1）

1. 項目的相關背景與目標

針對boost網站沒有搜索導航功能，為boost網站文檔的查找提供搜索功能
站內搜索：搜索的數據更垂直，數據量小
類似于cplusplus.com的搜索
在這里插入圖片描述

2.搜索引擎的相關宏觀原理

在這里插入圖片描述

3.技術棧和項目環境

技術棧：C/C++，C++11，STL，準標準庫Boost（相關文件操作），jsoncpp（客戶端和數據端數據交互），cppjieba（將搜索關鍵字進行切分），cpp-httplib（構建http服務器）
其他技術棧（前端）：html5，css，js，jQuery，Ajax
項目環境：Centos 7云服務器，vim/gcc（g++）/Makefile，vs2019/vs code(網頁)

4. 正排索引、倒排索引

正排索引：從文檔ID找到文檔內容（文檔內的關鍵字）
正排索引類似于書的目錄，我們可以根據頁數查找到對應的內容

目標文檔進行分詞：目的：方便建立倒排索引和查找
停止詞：了，嗎，的，the，a，一般情況我們在分詞的時候可以不考慮

倒排索引：根據文檔內容，分詞，整理不重復的各個關鍵字，對應聯系到文檔ID的方案
文檔ID中，各個文檔ID的排序按照權重進行排序
倒排索引和正排索引是相反的概念，我們可以根據文檔內容查詢到這部分內容在哪些文件中出現，從而找到對應的文件

模擬查找過程
用戶輸入：
關鍵字->倒排索引中查找->提取出是文檔ID（x，y，z，，，）->根據正排索引->找到文檔的內容->將文檔內容中的title+conent（desc）+url+文檔結果進行摘要->構建響應結果
在這里插入圖片描述

5.編寫數據去標簽與數據清洗的模塊Parser

在這里插入圖片描述

boost 官?： https://www.boost.org/
//?前只需要boost_1_78_0/doc/html?錄下的html?件，?它來進?建?索引

去標簽

[@VM-0-3-centos boost_searcher]$ touch parser.cc
//原始數據 -> 去標簽之后的數據
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html> <!--這是?個標簽-->
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Chapter 30. Boost.Process</title>
<link rel="stylesheet" href="../../doc/src/boostbook.css" type="text/css">
<meta name="generator" content="DocBook XSL Stylesheets V1.79.1">
<link rel="home" href="index.html" title="The Boost C++ Libraries BoostBook
Documentation Subset">
<link rel="up" href="libraries.html" title="Part I. The Boost C++ Libraries
(BoostBook Subset)">
<link rel="prev" href="poly_collection/acknowledgments.html"title="Acknowledgments">
<link rel="next" href="boost_process/concepts.html" title="Concepts">
</head>
<body bgcolor="white" text="black" link="#0000FF" vlink="#840084"alink="#0000FF">
<table cellpadding="2" width="100%"><tr>
<td valign="top"><img alt="Boost C++ Libraries" width="277" height="86"src="../../boost.png"></td>
<td align="center"><a href="../../index.html">Home</a></td>
<td align="center"><a href="../../libs/libraries.htm">Libraries</a></td>
<td align="center"><a href="http://www.boost.org/users/people.html">People</a>
</td>
<td align="center"><a href="http://www.boost.org/users/faq.html">FAQ</a></td>
<td align="center"><a href="../../more/index.htm">More</a></td>
</tr></table>
.........
// <> : html的標簽，這個標簽對我們進?搜索是沒有價值的，需要去掉這些標簽，?般標簽都是成對出現的！
[@VM-0-3-centos data]$ mkdir raw_html
[@VM-0-3-centos data]$ lltotal 20
drwxrwxr-x 60 16384 Mar 24 16:49 input //這?放的是原始的html?檔
drwxrwxr-x 2 4096 Mar 24 16:56 raw_html //這是放的是去標簽之后的?凈?檔
[@VM-0-3-centos input]$ ls -Rl | grep -E '*.html' | wc -l
8141
?標：把每個?檔都去標簽，然后寫?到同?個?件中！每個?檔內容不需要任何\n！?檔和?檔之間
? \3 區分version1：
類似：XXXXXXXXXXXXXXXXX\3YYYYYYYYYYYYYYYYYYYYY\3ZZZZZZZZZZZZZZZZZZZZZZZZZ\3
采?下?的?案：
version2: 寫??件中，?定要考慮下?次在讀取的時候，也要?便操作!
類似：title\3content\3url \n title\3content\3url \n title\3content\3url \n ...
?便我們getline(ifsream, line)，直接獲取?檔的全部內容：title\3content\3url

編寫parser

//代碼的基本結構：
#include <iostream>
#include <string>
#include <vector>
//是?個?錄，下?放的是所有的html??
const std::string src_path = "data/input/";const std::string output = "data/raw_html/raw.txt";
typedef struct DocInfo{
std::string title; //?檔的標題
std::string content; //?檔內容
std::string url; //該?檔在官?中的url
}DocInfo_t;
//const &: 輸?
//*: 輸出
//&：輸?輸出
bool EnumFile(const std::string &src_path, std::vector<std::string>
*files_list);
bool ParseHtml(const std::vector<std::string> &files_list,
std::vector<DocInfo_t> *results);
bool SaveHtml(const std::vector<DocInfo_t> &results, const std::string
&output);
int main()
{
std::vector<std::string> files_list;//第?步: 遞歸式的把每個html?件名帶路徑，保存到files_list中，?便后期進??個?個的
?件進?讀取
if(!EnumFile(src_path, &files_list)){
std::cerr << "enum file name error!" << std::endl;
return 1;
}
//第?步: 按照files_list讀取每個?件的內容，并進?解析
std::vector<DocInfo_t> results;
if(!ParseHtml(files_list, &results)){
std::cerr << "parse html error" << std::endl;
return 2;
}
//第三步: 把解析完畢的各個?件內容，寫?到output,按照\3作為每個?檔的分割符if(!SaveHtml(results, output)){
std::cerr << "sava html error" << std::endl;
return 3;
}
return 0;
}
bool EnumFile(const std::string &src_path, std::vector<std::string>
*files_list)
{
return true;
}
bool ParseHtml(const std::vector<std::string> &files_list,
std::vector<DocInfo_t> *results)
{
return true;
}
bool SaveHtml(const std::vector<DocInfo_t> &results, const std::string &output)
{
return true;
}

Boost開發庫的安裝

[whb@VM-0-3-centos boost_searcher]$ sudo yum install -y boost-devel //是boost 開
發庫

在這里插入圖片描述

目標：
把每個文檔都去標簽，然后寫入到同一個文件中，每個文檔內容只占一行！文檔和文檔之間‘\3’區分

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/pingmian/73474.shtml
繁體地址，請注明出處：http://hk.pswp.cn/pingmian/73474.shtml
英文地址，請注明出處：http://en.pswp.cn/pingmian/73474.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！