統計可重復列表中的TOP N

文章目錄

- - 方案1：HashMap統計 + 全排序
  - - 實現步驟：
    - 代碼實現：
    - 優缺點：
  - 方案2：HashMap統計 + 最小堆（優先隊列）
  - - 實現步驟：
    - 代碼實現：
    - 優缺點：
  - 方案3：Java Stream API
  - - 實現步驟：
    - 代碼實現：
    - 優缺點：
  - 完整示例代碼
  - 關鍵點總結
  - 方案4：并行流處理（Parallel Stream）
  - - 實現步驟：
    - 代碼實現：
    - 優缺點：
  - 方案5：桶排序（Bucket Sort）
  - - 實現步驟：
    - 代碼實現：
    - 優缺點：
  - 方案6：快速選擇（Quickselect）算法
  - - 實現步驟：
    - 代碼實現（部分）：
    - 優缺點：
  - 方案7：Guava庫的MultiSet（第三方依賴）
  - - 實現步驟：
    - 代碼實現：
    - 優缺點：
- 二、方案對比總表
- 三、總結建議

這種統計top值的情況場景使用的不少，面試過程中也有聊到過這類問題，在這詳細介紹一下思路和方案

在Java中統計列表中出現次數最多的前N個對象，常見的實現方案及其優缺點如下：

方案1：HashMap統計 + 全排序

實現步驟：

使用HashMap統計每個元素的頻率。
將統計結果轉為列表，按頻率降序排序。
取前N個元素。

代碼實現：

public static List<Map.Entry<String, Integer>> topNWithSort(List<String> list, int n) {// 統計頻率Map<String, Integer> freqMap = new HashMap<>();for (String item : list) {freqMap.put(item, freqMap.getOrDefault(item, 0) + 1);}// 轉換為列表并排序List<Map.Entry<String, Integer>> entries = new ArrayList<>(freqMap.entrySet());entries.sort((a, b) -> b.getValue().compareTo(a.getValue()));// 取前N個return entries.subList(0, Math.min(n, entries.size()));
}

優缺點：

優點：實現簡單，代碼直觀。
缺點：全排序時間復雜度為 (O(m \log m))（(m) 為不同元素的數量），當 (m) 較大時效率低。

方案2：HashMap統計 + 最小堆（優先隊列）

實現步驟：

使用HashMap統計頻率。
使用大小為N的最小堆，遍歷頻率表，維護堆頂為當前最小的頻率。
將堆中元素逆序輸出。

代碼實現：

public static List<Map.Entry<String, Integer>> topNWithHeap(List<String> list, int n) {// 統計頻率Map<String, Integer> freqMap = new HashMap<>();for (String item : list) {freqMap.put(item, freqMap.getOrDefault(item, 0) + 1);}// 初始化最小堆（按頻率升序）PriorityQueue<Map.Entry<String, Integer>> heap = new PriorityQueue<>((a, b) -> a.getValue() - b.getValue());// 遍歷頻率表，維護堆的大小為Nfor (Map.Entry<String, Integer> entry : freqMap.entrySet()) {if (heap.size() < n) {heap.offer(entry);} else if (entry.getValue() > heap.peek().getValue()) {heap.poll();heap.offer(entry);}}// 將堆轉換為列表并逆序List<Map.Entry<String, Integer>> result = new ArrayList<>(heap);result.sort((a, b) -> b.getValue().compareTo(a.getValue()));return result;
}

優缺點：

優點：時間復雜度為 (O(m \log n))，適合大數據量且 (n \ll m) 的場景。
缺點：需要手動維護堆，代碼稍復雜。

方案3：Java Stream API

實現步驟：

使用Stream的groupingBy和counting統計頻率。
按頻率降序排序后取前N個。

代碼實現：

public static List<Map.Entry<String, Long>> topNWithStream(List<String> list, int n) {return list.stream().collect(Collectors.groupingBy(Function.identity(), Collectors.counting())).entrySet().stream().sorted(Map.Entry.<String, Long>comparingByValue().reversed()).limit(n).collect(Collectors.toList());
}

優缺點：

優點：代碼簡潔，函數式編程風格。
缺點：隱藏實現細節，可能對內存和性能控制不足。

完整示例代碼

import java.util.*;
import java.util.function.Function;
import java.util.stream.Collectors;public class TopNFrequency {public static void main(String[] args) {List<String> list = Arrays.asList("apple", "banana", "apple", "orange", "banana", "apple");int n = 2;// 方法1：全排序System.out.println("HashMap + Sorting: " + topNWithSort(list, n));// 方法2：最小堆System.out.println("HashMap + Heap: " + topNWithHeap(list, n));// 方法3：Stream APISystem.out.println("Stream API: " + topNWithStream(list, n));}// 方法1：全排序public static List<Map.Entry<String, Integer>> topNWithSort(List<String> list, int n) {Map<String, Integer> freqMap = new HashMap<>();for (String item : list) {freqMap.put(item, freqMap.getOrDefault(item, 0) + 1);}List<Map.Entry<String, Integer>> entries = new ArrayList<>(freqMap.entrySet());entries.sort((a, b) -> b.getValue().compareTo(a.getValue()));return entries.subList(0, Math.min(n, entries.size()));}// 方法2：最小堆public static List<Map.Entry<String, Integer>> topNWithHeap(List<String> list, int n) {Map<String, Integer> freqMap = new HashMap<>();for (String item : list) {freqMap.put(item, freqMap.getOrDefault(item, 0) + 1);}PriorityQueue<Map.Entry<String, Integer>> heap = new PriorityQueue<>((a, b) -> a.getValue() - b.getValue());for (Map.Entry<String, Integer> entry : freqMap.entrySet()) {if (heap.size() < n) {heap.offer(entry);} else if (entry.getValue() > heap.peek().getValue()) {heap.poll();heap.offer(entry);}}List<Map.Entry<String, Integer>> result = new ArrayList<>(heap);result.sort((a, b) -> b.getValue().compareTo(a.getValue()));return result;}// 方法3：Stream APIpublic static List<Map.Entry<String, Long>> topNWithStream(List<String> list, int n) {return list.stream().collect(Collectors.groupingBy(Function.identity(), Collectors.counting())).entrySet().stream().sorted(Map.Entry.<String, Long>comparingByValue().reversed()).limit(n).collect(Collectors.toList());}
}

關鍵點總結

全排序適合數據量小的場景，代碼簡單但效率低。
最小堆適合大數據量，時間復雜度更優。
Stream API以簡潔性取勝，但需注意類型轉換和性能。

方案4：并行流處理（Parallel Stream）

實現步驟：

使用并行流加速統計和排序。
利用ConcurrentHashMap保證線程安全。

代碼實現：

public static List<Map.Entry<String, Long>> topNParallelStream(List<String> list, int n) {return list.parallelStream().collect(Collectors.groupingByConcurrent(Function.identity(), Collectors.counting())).entrySet().parallelStream().sorted(Map.Entry.<String, Long>comparingByValue().reversed()).limit(n).collect(Collectors.toList());
}

優缺點：

優點：利用多核并行處理，適合超大數據量。
缺點：線程安全控制復雜，可能因數據傾斜導致性能提升有限。

方案5：桶排序（Bucket Sort）

實現步驟：

統計頻率，記錄最大頻率。
創建頻率桶，索引為頻率，值為元素列表。
從高到低遍歷桶，收集前N個元素。

代碼實現：

public static List<Map.Entry<String, Integer>> topNBucketSort(List<String> list, int n) {Map<String, Integer> freqMap = new HashMap<>();int maxFreq = 0;for (String item : list) {int freq = freqMap.getOrDefault(item, 0) + 1;freqMap.put(item, freq);maxFreq = Math.max(maxFreq, freq);}// 創建桶（索引為頻率）List<List<String>> buckets = new ArrayList<>(maxFreq + 1);for (int i = 0; i <= maxFreq; i++) {buckets.add(new ArrayList<>());}freqMap.forEach((k, v) -> buckets.get(v).add(k));// 從高到低收集結果List<Map.Entry<String, Integer>> result = new ArrayList<>();for (int i = maxFreq; i >= 0 && result.size() < n; i--) {for (String item : buckets.get(i)) {result.add(new AbstractMap.SimpleEntry<>(item, i));if (result.size() == n) break;}}return result;
}

優缺點：

優點：時間復雜度 (O(m + k))（(k)為最大頻率），適合頻率分布集中的場景。
缺點：空間復雜度 (O(k))，若最大頻率極高則浪費內存。

方案6：快速選擇（Quickselect）算法

實現步驟：

統計頻率，將Entry存入列表。
使用快速選擇算法找到第N大的頻率分界點。
對前N個元素進行排序。

代碼實現（部分）：

public static List<Map.Entry<String, Integer>> topNQuickSelect(List<String> list, int n) {Map<String, Integer> freqMap = new HashMap<>();for (String item : list) {freqMap.put(item, freqMap.getOrDefault(item, 0) + 1);}List<Map.Entry<String, Integer>> entries = new ArrayList<>(freqMap.entrySet());quickSelect(entries, n);return entries.subList(0, n).stream().sorted((a, b) -> b.getValue().compareTo(a.getValue())).collect(Collectors.toList());
}private static void quickSelect(List<Map.Entry<String, Integer>> list, int n) {int left = 0, right = list.size() - 1;while (left <= right) {int pivotIndex = partition(list, left, right);if (pivotIndex == n) break;else if (pivotIndex < n) left = pivotIndex + 1;else right = pivotIndex - 1;}
}private static int partition(List<Map.Entry<String, Integer>> list, int low, int high) {int pivotValue = list.get(high).getValue();int i = low;for (int j = low; j < high; j++) {if (list.get(j).getValue() > pivotValue) {Collections.swap(list, i, j);i++;}}Collections.swap(list, i, high);return i;
}

優缺點：

優點：平均時間復雜度 (O(m))，適合對性能要求極高的場景。
缺點：實現復雜，需處理大量邊界條件。

方案7：Guava庫的MultiSet（第三方依賴）

實現步驟：

使用Guava的Multiset統計頻率。
按頻率排序后取前N個。

代碼實現：

public static List<Multiset.Entry<String>> topNGuava(List<String> list, int n) {Multiset<String> multiset = HashMultiset.create(list);return multiset.entrySet().stream().sorted((a, b) -> b.getCount() - a.getCount()).limit(n).collect(Collectors.toList());
}

優缺點：

優點：代碼極簡，依賴Guava工具類。
缺點：需引入第三方庫，不適合純JDK環境。

二、方案對比總表

方案	時間復雜度	空間復雜度	適用場景
全排序	(O(m \log m))	(O(m))	數據量小，代碼簡單
最小堆	(O(m \log n))	(O(n))	大數據量且 (n \ll m)
Stream API	(O(m \log m))	(O(m))	快速開發，代碼簡潔
并行流	(O(m \log m / p))	(O(m))	多核環境，超大數據量
桶排序	(O(m + k))	(O(k))	頻率集中且最大值已知
快速選擇	(O(m))（平均）	(O(m))	高性能需求，允許復雜實現
Guava MultiSet	(O(m \log m))	(O(m))	允許第三方依賴