文檔去重(TF-IDF，MinHash, SimHash)

2個doc有些相似有些不相似，如何衡量這個相似度；

直接用Jaccard距離，計算量太大

TF-IDF:?TF*IDF

TF：該詞在該文檔中的出現次數，

IDF：該詞在所有文檔中的多少個文檔出現是DF，lg(N/(1+DF))

MinHash

代碼：

import random
from typing import List, Set, Tupleclass MinHash:def __init__(self, num_hashes: int = 100):self.num_hashes = num_hashesself.hash_functions = self._generate_hash_functions()def _generate_hash_functions(self) -> List[Tuple[int, int, int]]:"""Generate hash functions of the form (a*x + b) % c."""max_prime = 2**31 - 1  # Mersenne primereturn [(random.randint(1, max_prime), random.randint(0, max_prime), max_prime)for _ in range(self.num_hashes)]def _minhash(self, document: Set[str]) -> List[int]:"""Compute MinHash signature for a document."""signature = [float('inf')] * self.num_hashesfor word in document:for i, (a, b, c) in enumerate(self.hash_functions):hash_value = (a * hash(word) + b) % csignature[i] = min(signature[i], hash_value)return signaturedef jaccard_similarity(self, sig1: List[int], sig2: List[int]) -> float:"""Estimate Jaccard similarity from MinHash signatures."""return sum(s1 == s2 for s1, s2 in zip(sig1, sig2)) / self.num_hashesdef deduplicate(self, documents: List[str], threshold: float = 0.5) -> List[str]:"""Deduplicate documents based on MinHash similarity."""# Preprocess documents into sets of wordsdoc_sets = [set(doc.lower().split()) for doc in documents]# Compute MinHash signatures for all documentssignatures = [self._minhash(doc_set) for doc_set in doc_sets]# Find unique documentsunique_docs = []for i, doc in enumerate(documents):is_duplicate = Falsefor j in range(len(unique_docs)):if self.jaccard_similarity(signatures[i], signatures[j]) >= threshold:is_duplicate = Truebreakif not is_duplicate:unique_docs.append(doc)return unique_docs

100個hash函數；

在某個hash函數上，1個doc里的所有word，在該函數上的hash值，其中最小的那個，記下來；

該doc得到100個最小hash值，該100維向量，作為其signature；

2個doc的相似度，就是100個維度里的相等數目，除以100;

SimHash

MinHash和SimHash_minhash simhash-CSDN博客

海量文本去重（允許一定的噪聲）；文檔里權重最大的前N個詞（或詞組）進行Hash編碼，1正0負乘以詞的權重，N個詞的向量按位相加，再反編碼（正1負0），得到該文檔的編碼；兩篇文檔的距離用編碼的海明距離，小于Bar（例如3）則認為二者相似；

import hashlib
from typing import List, Tupleclass SimHash:def __init__(self, hash_bits: int = 64):self.hash_bits = hash_bitsdef _string_hash(self, text: str) -> int:"""Create a hash for a given string."""return int(hashlib.md5(text.encode('utf-8')).hexdigest(), 16)def _create_shingles(self, text: str, k: int = 2) -> List[str]:"""Create k-shingles from the text."""return [text[i:i+k] for i in range(len(text) - k + 1)]def _compute_simhash(self, features: List[str]) -> int:"""Compute the SimHash of a list of features."""v = [0] * self.hash_bitsfor feature in features:feature_hash = self._string_hash(feature)for i in range(self.hash_bits):bitmask = 1 << iif feature_hash & bitmask:v[i] += 1else:v[i] -= 1fingerprint = 0for i in range(self.hash_bits):if v[i] >= 0:fingerprint |= 1 << ireturn fingerprintdef _hamming_distance(self, hash1: int, hash2: int) -> int:"""Compute the Hamming distance between two hashes."""xor = hash1 ^ hash2return bin(xor).count('1')def compute_similarity(self, hash1: int, hash2: int) -> float:"""Compute the similarity between two SimHashes."""distance = self._hamming_distance(hash1, hash2)return 1 - (distance / self.hash_bits)def deduplicate(self, documents: List[str], threshold: float = 0.9) -> List[Tuple[str, int]]:"""Deduplicate documents based on SimHash similarity."""unique_docs = []for doc in documents:shingles = self._create_shingles(doc.lower())doc_hash = self._compute_simhash(shingles)is_duplicate = Falsefor unique_doc, unique_hash in unique_docs:if self.compute_similarity(doc_hash, unique_hash) >= threshold:is_duplicate = Truebreakif not is_duplicate:unique_docs.append((doc, doc_hash))return unique_docs# Example usage
if __name__ == "__main__":simhash = SimHash(hash_bits=64)documents = ["The quick brown fox jumps over the lazy dog","The quick brown fox jumps over the sleeping dog","The lazy dog is sleeping","A completely different document"]unique_docs = simhash.deduplicate(documents, threshold=0.7)print("Original documents:")for doc in documents:print(f"- {doc}")print("\nUnique documents:")for doc, _ in unique_docs:print(f"- {doc}")

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/web/44297.shtml
繁體地址，請注明出處：http://hk.pswp.cn/web/44297.shtml
英文地址，請注明出處：http://en.pswp.cn/web/44297.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！