利用余弦相似度在大量文章中找出抄襲的文章

? ? ? ? 我前面的2篇文章分別講了如果利用余弦相似度來判斷2篇文章的相似度，來確定文章是否存在抄襲，和余弦相似度的原理，即余弦相似度到底是怎么來判斷文章的相似性高低的等等。這一篇再說下，對于文章字數多和大量文章時，如果找到兩篇相似度高的文章。這里就需要考慮內存溢出的風險了。所以對第一篇的代碼進行了改造。在一定程度上降低了內存溢出的風險。

pom依賴

<dependency><groupId>org.apache.commons</groupId><artifactId>commons-math3</artifactId><version>3.6.1</version>
</dependency>

? ? ? ?這里和第一篇略有不同，即第一篇采用的hankcs包實現的余弦相似度算法。本篇文章時通過math3包實現的。但是原理相同。

代碼如下：

package com.lsl.config;import org.apache.commons.math3.linear.ArrayRealVector;
import org.apache.commons.math3.linear.RealVector;import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.*;public class PlagiarismDetector {// 計算余弦相似度public static double cosineSimilarity(RealVector vectorA, RealVector vectorB) {double dotProduct = vectorA.dotProduct(vectorB);double normA = vectorA.getNorm();double normB = vectorB.getNorm();return dotProduct / (normA * normB);}// 將文本轉換為詞頻向量public static Map<String, Integer> textToWordFrequency(String text) {Map<String, Integer> wordFrequency = new HashMap<>();String[] words = text.split("\\s+");for (String word : words) {wordFrequency.put(word, wordFrequency.getOrDefault(word, 0) + 1);}return wordFrequency;}// 將詞頻映射轉換為向量public static RealVector wordFrequencyToVector(Map<String, Integer> wordFrequency, List<String> vocabulary) {double[] vector = new double[vocabulary.size()];for (int i = 0; i < vocabulary.size(); i++) {vector[i] = wordFrequency.getOrDefault(vocabulary.get(i), 0);}return new ArrayRealVector(vector);}// 讀取文件內容（流式讀取）public static String readFile(String filePath) throws IOException {StringBuilder content = new StringBuilder();try (BufferedReader br = new BufferedReader(new FileReader(filePath))) {String line;while ((line = br.readLine()) != null) {content.append(line).append("\n");}}return content.toString();}// 構建詞匯表（增量構建）public static List<String> buildVocabulary(Path papersDir) throws IOException {Set<String> vocabulary = new HashSet<>();Files.list(papersDir).forEach(path -> {try {String content = readFile(path.toString());String[] words = content.split("\\s+");vocabulary.addAll(Arrays.asList(words));} catch (IOException e) {e.printStackTrace();}});return new ArrayList<>(vocabulary);}// 主函數public static void main(String[] args) throws IOException {// 論文文件目錄Path papersDir = Paths.get("D:\\codeabc");// 構建詞匯表List<String> vocabulary = buildVocabulary(papersDir);// 存儲每篇論文的詞頻向量List<RealVector> vectors = new ArrayList<>();// 逐篇處理論文Files.list(papersDir).forEach(path -> {try {String content = readFile(path.toString());Map<String, Integer> wordFrequency = textToWordFrequency(content);RealVector vector = wordFrequencyToVector(wordFrequency, vocabulary);vectors.add(vector);} catch (IOException e) {e.printStackTrace();}});System.err.println("共有=" + vectors.size() + "文章");// 比較每對論文的相似度for (int i = 0; i < vectors.size(); i++) {for (int j = i + 1; j < vectors.size(); j++) {double similarity = cosineSimilarity(vectors.get(i), vectors.get(j));if (similarity > 0.9) { // 假設相似度大于0.9認為是抄襲System.out.printf("Paper %d and Paper %d are similar with cosine similarity: %.2f%n", i, j, similarity);}}}}
}

運行截圖如下：

改進點說明

流式讀取文件：
- 使用BufferedReader逐行讀取文件內容，避免一次性加載整個文件到內存中。
增量構建詞匯表：
- 使用Files.list逐篇讀取論文內容，逐步構建詞匯表，而不是一次性加載所有論文內容。
逐篇處理論文：
- 在構建詞頻向量時，逐篇處理論文，避免一次性加載所有論文的詞頻向量到內存中。
內存優化：
- 使用HashSet存儲詞匯表，避免重復詞匯占用額外內存。
- 使用ArrayList存儲詞頻向量，確保內存使用可控

進一步優化建議

分布式計算：
- 如果數據量非常大（如100,000篇論文），可以考慮使用分布式計算框架（如Apache Spark）來并行處理數據。
外部存儲：
- 將詞匯表和詞頻向量存儲到磁盤（如數據庫或文件），避免內存不足。
分塊比較：
- 將論文分成多個塊，逐塊比較相似度，進一步減少內存占用。
剔除干擾詞匯
- 比如代碼中對于一些import導入語句可以剔除

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/pingmian/72650.shtml
繁體地址，請注明出處：http://hk.pswp.cn/pingmian/72650.shtml
英文地址，請注明出處：http://en.pswp.cn/pingmian/72650.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！