【類拷貝文件的運用】

常用示例

當我們面臨將文本文件分成最大大小塊的時，我們可能會嘗試編寫如下代碼:

public class TestSplit {private static final long maxFileSizeBytes = 10 * 1024 * 1024; // 默認10MBpublic void split(Path inputFile, Path outputDir) throws IOException {if (!Files.exists(inputFile)) {throw new IOException("輸入文件不存在: " + inputFile);}if (Files.size(inputFile) == 0) {throw new IOException("輸入文件為空: " + inputFile);}Files.createDirectories(outputDir);try (BufferedReader reader = Files.newBufferedReader(inputFile)) {int fileIndex = 0;long currentSize = 0;BufferedWriter writer = null;try {writer = newWriter(outputDir, fileIndex++);String line;while ((line = reader.readLine()) != null) {byte[] lineBytes = (line + System.lineSeparator()).getBytes();if (currentSize + lineBytes.length > maxFileSizeBytes) {if (writer != null) {writer.close();}writer = newWriter(outputDir, fileIndex++);currentSize = 0;}writer.write(line);writer.newLine();currentSize += lineBytes.length;}} finally {if (writer != null) {writer.close();}}}}private BufferedWriter newWriter(Path dir, int index) throws IOException {Path filePath = dir.resolve("part_" + index + ".txt");return Files.newBufferedWriter(filePath);}public static void main(String[] args) {String inputFilePath = "C:\Users\fei\Desktop\testTwo.txt";String outputDirPath = "C:\Users\fei\Desktop\testTwo";TestSplit splitter = new TestSplit();try {long startTime = System.currentTimeMillis();splitter.split(Paths.get(inputFilePath), Paths.get(outputDirPath));long endTime = System.currentTimeMillis();long duration = endTime - startTime;System.out.println("文件拆分完成！");System.out.printf("總耗時：%d 毫秒%n", duration);} catch (IOException e) {System.out.println("文件拆分過程中發生錯誤：" + e.getMessage());}}
}

效率分析

此代碼在技術上是可以的，但是將大文件拆分為多個塊的效率非常低。具體如下

它執行許多堆分配（行），導致創建和丟棄大量臨時對象（字符串、字節數組）。
還有一個不太明顯的問題，它將數據復制到多個緩沖區，并在用戶和內核模式之間執行上下文切換。

代碼詳細分析

BufferedReader: BufferedReader 的 BufferedReader 中：

在底層 FileReader 或 InputStreamReader 上調用 read()
數據從內核空間→用戶空間緩沖區復制。
然后解析為 Java 字符串（堆分配）。

getBytes() : getBytes（） 的

將 String 轉換為新的 byte[] →更多的堆分配。

BufferedWriter: BufferedWriter 的 BufferedWriter 中：

從用戶空間獲取 byte/char 數據。
調用 write()這又涉及將用戶空間復制到內核空間→。
最終刷新到磁盤。

因此，數據在內核和用戶空間之間來回移動多次，并產生額外的堆改動。除了垃圾收集壓力外，它還具有以下后果：

內存帶寬浪費在緩沖區之間進行復制。
磁盤到磁盤傳輸的 CPU 利用率較高。
操作系統本可直接處理批量拷貝（通過DMA或優化I/O），但Java代碼通過引入用戶空間邏輯攔截了這種高效性。

方案

那么，我們如何避免上述問題呢？

答案是盡可能使用 zero copy，即盡可能避免離開 kernel 空間。這可以通過使用 FileChannel 方法 long transferTo(long position, long count, WritableByteChannel target) 在 java 中完成。它直接是磁盤到磁盤的傳輸，還會利用作系統的一些 IO 優化。

有問題就是所描述的方法對字節塊進行作，可能會破壞行的完整性。為了解決這個問題，我們需要一種策略來確保即使通過移動字節段處理文件時，行也保持完整

沒有上述的問題就很容易，只需為每個塊調用 transferTo，將position遞增為 position = position + maxFileSize，直到無法傳輸更多數據。

為了保持行的完整性，我們需要確定每個字節塊中最后一個完整行的結尾。為此，我們首先查找 chunk 的預期末尾，然后向后掃描以找到前面的換行符。這將為我們提供 chunk 的準確字節計數，確保包含最后的、不間斷的行。這將是執行緩沖區分配和復制的代碼的唯一部分，并且由于這些作應該最小，因此預計性能影響可以忽略不計。

private static final int LINE_ENDING_SEARCH_WINDOW = 8 * 1024;
?
private long maxSizePerFileInBytes;
private Path outputDirectory;
private Path tempDir;
?
private void split(Path fileToSplit) throws IOException {try (RandomAccessFile raf = new RandomAccessFile(fileToSplit.toFile(), "r");FileChannel inputChannel = raf.getChannel()) {
?long fileSize = raf.length();long position = 0;int fileCounter = 1;
?while (position < fileSize) {// Calculate end position (try to get close to max size)long targetEndPosition = Math.min(position + maxSizePerFileInBytes, fileSize);
?// If we're not at the end of the file, find the last line ending before max sizelong endPosition = targetEndPosition;if (endPosition < fileSize) {endPosition = findLastLineEndBeforePosition(raf, position, targetEndPosition);}
?long chunkSize = endPosition - position;var outputFilePath = tempDir.resolve("_part" + fileCounter);try (FileOutputStream fos = new FileOutputStream(outputFilePath.toFile());FileChannel outputChannel = fos.getChannel()) {inputChannel.transferTo(position, chunkSize, outputChannel);}
?position = endPosition;fileCounter++;}
?}
}
?
private long findLastLineEndBeforePosition(RandomAccessFile raf, long startPosition, long maxPosition)throws IOException {long originalPosition = raf.getFilePointer();
?try {int bufferSize = LINE_ENDING_SEARCH_WINDOW;long chunkSize = maxPosition - startPosition;
?if (chunkSize < bufferSize) {bufferSize = (int) chunkSize;}
?byte[] buffer = new byte[bufferSize];long searchPos = maxPosition;
?while (searchPos > startPosition) {long distanceToStart = searchPos - startPosition;int bytesToRead = (int) Math.min(bufferSize, distanceToStart);
?long readStartPos = searchPos - bytesToRead;raf.seek(readStartPos);
?int bytesRead = raf.read(buffer, 0, bytesToRead);if (bytesRead <= 0)break;
?// Search backwards through the buffer for newlinefor (int i = bytesRead - 1; i >= 0; i--) {if (buffer[i] == '\n') {return readStartPos + i + 1;}}
?searchPos -= bytesRead;}
?throw new IllegalArgumentException("File " + fileToSplit + " cannot be split. No newline found within the limits.");} finally {raf.seek(originalPosition);}
}

findLastLineEndBeforePosition 方法具有某些限制。具體來說，它僅適用于類 Unix 系統（\n），非常長的行可能會導致大量向后讀取迭代，并且包含超過 maxSizePerFileInBytes 的行的文件無法拆分。但是，它非常適合拆分訪問日志文件等場景，這些場景通常具有短行和大量條目。

性能分析

理論上，我們zero copy拆分文件應該【常用方式】更快，現在是時候衡量它能有多快了。為此，我為這兩個實現運行了一些基準測試，這些是結果。

Benchmark ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?  Mode  Cnt ? ? ? ? ? Score ? ?  Error ? Units
FileSplitterBenchmark.splitFile ? ? ? ? ? ? ? ? ? ? ? ? ? ?  avgt ? 15 ? ? ?  1179.429 ± ? 54.271 ? ms/op
FileSplitterBenchmark.splitFile:·gc.alloc.rate ? ? ? ? ? ? ? avgt ? 15 ? ? ?  1349.613 ± ? 60.903  MB/sec
FileSplitterBenchmark.splitFile:·gc.alloc.rate.norm ? ? ? ?  avgt ? 15  1694927403.481 ± 6060.581 ?  B/op
FileSplitterBenchmark.splitFile:·gc.count ? ? ? ? ? ? ? ? ?  avgt ? 15 ? ? ? ? 718.000 ? ? ? ? ? ? counts
FileSplitterBenchmark.splitFile:·gc.time ? ? ? ? ? ? ? ? ? ? avgt ? 15 ? ? ? ? 317.000 ? ? ? ? ? ? ? ? ms
FileSplitterBenchmark.splitFileZeroCopy ? ? ? ? ? ? ? ? ? ?  avgt ? 15 ? ? ? ?  77.352 ± ?  1.339 ? ms/op
FileSplitterBenchmark.splitFileZeroCopy:·gc.alloc.rate ? ? ? avgt ? 15 ? ? ? ?  23.759 ± ?  0.465  MB/sec
FileSplitterBenchmark.splitFileZeroCopy:·gc.alloc.rate.norm  avgt ? 15 ? ? 2555608.877 ± 8644.153 ?  B/op
FileSplitterBenchmark.splitFileZeroCopy:·gc.count ? ? ? ? ?  avgt ? 15 ? ? ? ?  10.000 ? ? ? ? ? ? counts
FileSplitterBenchmark.splitFileZeroCopy:·gc.time ? ? ? ? ? ? avgt ? 15 ? ? ? ? ? 5.000 ? ? ? ? ? ? ? ? ms

以下是用于上述結果的基準測試代碼和文件大小。

int maxSizePerFileInBytes = 1024 * 1024 // 1 MB chunks
?
public void setup() throws Exception {inputFile = Paths.get("/tmp/large_input.txt");outputDir = Paths.get("/tmp/split_output");// Create a large file for benchmarking if it doesn't existif (!Files.exists(inputFile)) {try (BufferedWriter writer = Files.newBufferedWriter(inputFile)) {for (int i = 0; i < 10_000_000; i++) {writer.write("This is line number " + i);writer.newLine();}}}
}
?
public void splitFile() throws Exception {splitter.split(inputFile, outputDir);
}
?
public void splitFileZeroCopy() throws Exception {zeroCopySplitter.split(inputFile);
}

zeroCopy表現出相當大的加速，僅用了 77 毫秒，而對于這種特定情況，【常用方式】需要 1179 毫秒。在處理大量數據或許多文件時，這種性能優勢可能至關重要。