在2G大小的文件中,找出高頻top100的單詞

將 2GB 的大文件分割為 2048 個大小為 512KB 的小文件，采用流式讀取方式處理，避免一次性加載整個文件導致內存溢出。
初始化一個長度為 2048 的哈希表數組，用于分別統計各個小文件中單詞的出現頻率。
利用多線程并行處理機制遍歷所有 2048 個小文件，對每個文件中的單詞執行哈希取模運算（int hash = Math.abs(word.hashCode() % hashTableSize)），并通過hashTables[hash].merge(word, 1, Integer::sum)的方式將單詞頻率累加到對應的哈希表中。
遍歷這 2048 個哈希表，對每個哈希表中的單詞按頻率排序，將頻率排名前 100 的單詞存入一個小頂堆。
經過上述處理后，小頂堆中最終保留的 100 個單詞即為整個文件中出現頻率最高的前 100 個單詞。

package com.example.bigfiletop100;import java.io.*;
import java.nio.charset.StandardCharsets;
import java.nio.file.*;
import java.util.*;
import java.util.concurrent.*;public class Top100Words {private static final int HASH_TABLE_SIZE = 2048;private static final int TOP_WORDS_COUNT = 100;public static void main(String[] args) throws IOException, InterruptedException {String largeFilePath = "C:\\Users\\亞洲\\IdeaProjects\\springboot-easyexcel-mybatisplus\\src\\main\\java\\com\\example\\bigfiletop100\\input.txt"; // 替換為大文件的路徑String tempDirPath = "C:\\Users\\亞洲\\IdeaProjects\\springboot-easyexcel-mybatisplus\\src\\main\\java\\com\\example\\bigfiletop100\\temp_files"; // 替換為臨時目錄的路徑Path tempDir = Paths.get(tempDirPath);if (!Files.exists(tempDir)) {Files.createDirectories(tempDir);}// 1. 分割大文件為512KB的小文件（改進版）splitFile(largeFilePath, tempDir.toString(), 512 * 1024);// 2. 初始化哈希表數組ConcurrentHashMap<Integer, Map<String, Integer>> hashTables = new ConcurrentHashMap<>();for (int i = 0; i < HASH_TABLE_SIZE; i++) {hashTables.put(i, new ConcurrentHashMap<>());}// 3. 并行遍歷小文件，統計單詞頻率File[] smallFiles = new File(tempDir.toString()).listFiles((dir, name) -> name.endsWith(".txt"));if (smallFiles == null || smallFiles.length == 0) {System.err.println("No small files found.");return;}ExecutorService executor = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());CountDownLatch latch = new CountDownLatch(smallFiles.length);for (File smallFile : smallFiles) {executor.submit(() -> {try (BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(smallFile), StandardCharsets.UTF_8))) {String line;while ((line = reader.readLine()) != null) {String[] words = line.trim().split("\\s+");for (String word : words) {if (word.isEmpty()) continue;int hash = Math.abs(word.hashCode() % HASH_TABLE_SIZE);hashTables.get(hash).merge(word, 1, Integer::sum);}}} catch (IOException e) {System.err.println("Error reading file: " + smallFile.getName());e.printStackTrace();} finally {latch.countDown();}});}executor.shutdown();latch.await(); // 等待所有任務完成// 4. 合并哈希表，并把頻率前100的單詞存入小頂堆PriorityQueue<Map.Entry<String, Integer>> minHeap = new PriorityQueue<>(TOP_WORDS_COUNT,(o1, o2) -> o1.getValue().compareTo(o2.getValue()));hashTables.values().stream().filter(Objects::nonNull).flatMap(map -> map.entrySet().stream()).forEach(entry -> {if (minHeap.size() < TOP_WORDS_COUNT) {minHeap.offer(entry);} else if (entry.getValue() > minHeap.peek().getValue()) {minHeap.poll();minHeap.offer(entry);}});// 5. 輸出結果System.out.println("Top " + TOP_WORDS_COUNT + " words:");List<Map.Entry<String, Integer>> topWords = new ArrayList<>(minHeap);topWords.sort((o1, o2) -> o2.getValue().compareTo(o1.getValue())); // 按降序排列topWords.forEach(entry -> System.out.println(entry.getKey() + ": " + entry.getValue()));// 清理臨時文件和目錄Arrays.stream(smallFiles).forEach(File::delete);Files.deleteIfExists(tempDir);}private static void splitFile(String largeFilePath, String tempDir, int size) throws IOException {try (BufferedInputStream bis = new BufferedInputStream(new FileInputStream(largeFilePath))) {byte[] buffer = new byte[size];int bytesRead;int fileCount = 0;while ((bytesRead = bis.read(buffer)) != -1) {File smallFile = new File(tempDir, "part-" + fileCount++ + ".txt");// 避免截斷單詞int actualBytesRead = bytesRead;if (bytesRead <= size && buffer[bytesRead - 1] != '\n') {// 查找最后一個空格或換行符作為分割點for (int i = bytesRead - 1; i >= 0; i--) {if (buffer[i] == ' ' || buffer[i] == '\n') {actualBytesRead = i + 1;bis.reset();bis.skip(actualBytesRead);break;}}}try (BufferedOutputStream bos = new BufferedOutputStream(new FileOutputStream(smallFile))) {bos.write(buffer, 0, actualBytesRead);}}}}
}

關鍵說明

文件分割優化：
分割時避免截斷單詞（通過查找最后一個非字母位置），確保每個單詞完整屬于一個小文件，避免統計誤差。
內存控制：
單個小文件 512KB，2G 文件約 4096 個小文件（而非 2048，更保守），每個小文件單獨處理，避免 OOM。
多線程效率：
使用線程池（大小為 CPU 核心數）并行處理小文件，提升統計速度，哈希表數組避免線程競爭（每個哈希槽獨立）。
小頂堆篩選：
堆大小固定為 100，每次插入復雜度 O (log 100)，整體效率遠高于全量排序（O (n log n)）。
單詞處理：
統一轉為小寫（忽略大小寫差異），用正則提取純字母單詞（可根據需求調整匹配規則，如保留數字）。

此方案適用于大文件場景，通過分治思想將問題拆解為可并行處理的子任務，再通過堆排序高效篩選結果。

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/news/916526.shtml
繁體地址，請注明出處：http://hk.pswp.cn/news/916526.shtml
英文地址，請注明出處：http://en.pswp.cn/news/916526.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！