JVM 的 Finalization Delay 引起的 OOM（java.lang.OutOfMemoryError：null at sun.misc.Unsafe.allocateMemory.）

今天在壓力測試環境某一個服務出現crash了，經過一番檢查，終于發現是由于JVM的Finalization Delay引起的，這個問題比較特殊，這里記錄一下。

這個服務是用Java寫的，主要完成的功能是根據特定的指令文件生成mp4文件，用到的java庫主要有javacv，這個庫底層其實是使用JNI調用操作系統里安裝的ffmpeg。

檢查日志文件

首先檢查日志文件，發現日志里出現了OOM的報錯

java.lang.OutOfMemoryError: nullat sun.misc.Unsafe.allocateMemory(Native Method) ~[na:1.7.0_79]at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:127) ~[na:1.7.0_79]at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) ~[na:1.7.0_79]at org.bytedeco.javacv.Frame.<init>(Frame.java:105) ~[javacv-1.1.jar:1.1]at org.bytedeco.javacv.Java2DFrameConverter.getFrame(Java2DFrameConverter.java:712) ~[javacv-1.1.jar:1.1]at org.bytedeco.javacv.Java2DFrameConverter.getFrame(Java2DFrameConverter.java:679) ~[javacv-1.1.jar:1.1]at org.bytedeco.javacv.Java2DFrameConverter.getFrame(Java2DFrameConverter.java:673) ~[javacv-1.1.jar:1.1]at org.bytedeco.javacv.Java2DFrameConverter.convert(Java2DFrameConverter.java:62) ~[javacv-1.1.jar:1.1]

很明顯這里是申請DirectByteBuffer出現了OOM，所以應該是Direct Memory申請太多了。為了確認問題，將服務跑起來，使用jconsole看了下JVM的堆內存使用情況，發現堆內存使用一直都是比較穩定的，但使用top -p ${pid}查看進程占用的內存，發現RES字段的值一直是在增長的，而且增長得很快，不到半個小時值就從原來的500M增長到1.6G。

分析代碼

接下來看一下關鍵代碼

public void encodeFrame(BufferedImage image, long frameTime) {try {long t = frameTime * 1000L;if(t>recorder.getTimestamp()) {recorder.setTimestamp(t);}Frame frame = java2dConverter.convert(image);recorder.record(frame);} catch (FrameRecorder.Exception e) {log.error("JavaCVMp4Encoder encode frame error.", e);}
}

業務層是不停地調用encodeFrame將每一張圖片編碼到mp4文件里。

而java2dConverter.convert(image);這句代碼的實現里會申請一個DirectByteBuffer。如下面的代碼。

public Frame(int width, int height, int depth, int channels) {int pixelSize = Math.abs(depth) / 8;this.imageWidth = width;this.imageHeight = height;this.imageDepth = depth;this.imageChannels = channels;this.imageStride = ((imageWidth * imageChannels * pixelSize + 7) & ~7) / pixelSize; // 8-byte alignedthis.image = new Buffer[1];ByteBuffer buffer = ByteBuffer.allocateDirect(imageHeight * imageStride * pixelSize).order(ByteOrder.nativeOrder());switch (imageDepth) {case DEPTH_BYTE:case DEPTH_UBYTE:  image[0] = buffer;                  break;case DEPTH_SHORT:case DEPTH_USHORT: image[0] = buffer.asShortBuffer();  break;case DEPTH_INT:    image[0] = buffer.asIntBuffer();    break;case DEPTH_LONG:   image[0] = buffer.asLongBuffer();   break;case DEPTH_FLOAT:  image[0] = buffer.asFloatBuffer();  break;case DEPTH_DOUBLE: image[0] = buffer.asDoubleBuffer(); break;default: throw new UnsupportedOperationException("Unsupported depth value: " + imageDepth);}
}

這里ByteBuffer.allocateDirect方法申請的DirectByteBuffer并不是Java堆內存，而是直接在C堆上申請的。而DirectByteBuffer申請的C堆內存釋放很特殊，并不是簡單地由JVM GC完成的。

先看一下DirectByteBuffer的定義

class DirectByteBuffer extends MappedByteBuffer implements DirectBuffer {...protected static final Unsafe unsafe = Bits.unsafe();...private static class Deallocatorimplements Runnable{private static Unsafe unsafe = Unsafe.getUnsafe();private long address;private long size;private int capacity;private Deallocator(long address, long size, int capacity) {assert (address != 0);this.address = address;this.size = size;this.capacity = capacity;}public void run() {if (address == 0) {// Paranoiareturn;}unsafe.freeMemory(address);address = 0;Bits.unreserveMemory(size, capacity);}}private final Cleaner cleaner;...DirectByteBuffer(int cap) {                   // package-privatesuper(-1, 0, cap, cap);boolean pa = VM.isDirectMemoryPageAligned();int ps = Bits.pageSize();long size = Math.max(1L, (long)cap + (pa ? ps : 0));Bits.reserveMemory(size, cap);long base = 0;try {base = unsafe.allocateMemory(size);} catch (OutOfMemoryError x) {Bits.unreserveMemory(size, cap);throw x;}unsafe.setMemory(base, size, (byte) 0);if (pa && (base % ps != 0)) {// Round up to page boundaryaddress = base + ps - (base & (ps - 1));} else {address = base;}cleaner = Cleaner.create(this, new Deallocator(base, size, cap));att = null;}...
}

可以看到創建DirectByteBuffer對象時實際上使用unsafe.allocateMemory申請一塊C堆內存的。DirectByteBuffer對象內部有一個Cleaner cleaner，看樣子應該是這個東東負責對申請申請的C堆內存進行釋放的。看一下Cleaner的定義：

public class Cleaner extends PhantomReference<Object> {
...private static final ReferenceQueue<Object> dummyQueue = new ReferenceQueue();private Cleaner(Object var1, Runnable var2) {super(var1, dummyQueue);this.thunk = var2;}public static Cleaner create(Object var0, Runnable var1) {return var1 == null?null:add(new Cleaner(var0, var1));}public void clean() {if(remove(this)) {try {this.thunk.run();} catch (final Throwable var2) {AccessController.doPrivileged(new PrivilegedAction() {public Void run() {if(System.err != null) {(new Error("Cleaner terminated abnormally", var2)).printStackTrace();}System.exit(1);return null;}});}}}
...
}

原來Cleaner實際上是一個PhantomReference

PhantomReference虛引用主要用來跟蹤對象被垃圾回收器回收的活動。虛引用與軟引用和弱引用的一個區別在于：虛引用必須和引用隊列（ReferenceQueue）聯合使用。當垃圾回收器準備回收一個對象時，如果發現它還有虛引用，就會在回收對象的內存之前，把這個虛引用加入到與之關聯的引用隊列中。

在這個場景里也就是說當JVM垃圾回收器準備回收某個DirectByteBuffer對象時，發現這個DirectByteBuffer對象有虛引用，就會將虛引用加入到與之關聯的引用隊列中。將虛引用加入到與之關聯的引用隊列中有什么作用？看一下Reference的實現代碼

public abstract class Reference<T> {...private T referent;...static private class Lock { }private static Lock lock = new Lock();/* List of References waiting to be enqueued.  The collector adds* References to this list, while the Reference-handler thread removes* them.  This list is protected by the above lock object. The* list uses the discovered field to link its elements.*/private static Reference<Object> pending = null;/* High-priority thread to enqueue pending References*/private static class ReferenceHandler extends Thread {private static void ensureClassInitialized(Class<?> clazz) {try {Class.forName(clazz.getName(), true, clazz.getClassLoader());} catch (ClassNotFoundException e) {throw (Error) new NoClassDefFoundError(e.getMessage()).initCause(e);}}static {// pre-load and initialize InterruptedException and Cleaner classes// so that we don't get into trouble later in the run loop if there's// memory shortage while loading/initializing them lazily.ensureClassInitialized(InterruptedException.class);ensureClassInitialized(Cleaner.class);}ReferenceHandler(ThreadGroup g, String name) {super(g, name);}public void run() {while (true) {tryHandlePending(true);}}}/*** Try handle pending {@link Reference} if there is one.<p>* Return {@code true} as a hint that there might be another* {@link Reference} pending or {@code false} when there are no more pending* {@link Reference}s at the moment and the program can do some other* useful work instead of looping.** @param waitForNotify if {@code true} and there was no pending*                      {@link Reference}, wait until notified from VM*                      or interrupted; if {@code false}, return immediately*                      when there is no pending {@link Reference}.* @return {@code true} if there was a {@link Reference} pending and it*         was processed, or we waited for notification and either got it*         or thread was interrupted before being notified;*         {@code false} otherwise.*/static boolean tryHandlePending(boolean waitForNotify) {Reference<Object> r;Cleaner c;try {synchronized (lock) {if (pending != null) {r = pending;// 'instanceof' might throw OutOfMemoryError sometimes// so do this before un-linking 'r' from the 'pending' chain...c = r instanceof Cleaner ? (Cleaner) r : null;// unlink 'r' from 'pending' chainpending = r.discovered;r.discovered = null;} else {// The waiting on the lock may cause an OutOfMemoryError// because it may try to allocate exception objects.if (waitForNotify) {lock.wait();}// retry if waitedreturn waitForNotify;}}} catch (OutOfMemoryError x) {// Give other threads CPU time so they hopefully drop some live references// and GC reclaims some space.// Also prevent CPU intensive spinning in case 'r instanceof Cleaner' above// persistently throws OOME for some time...Thread.yield();// retryreturn true;} catch (InterruptedException x) {// retryreturn true;}// Fast path for cleanersif (c != null) {c.clean();return true;}ReferenceQueue<? super Object> q = r.queue;if (q != ReferenceQueue.NULL) q.enqueue(r);return true;}static {ThreadGroup tg = Thread.currentThread().getThreadGroup();for (ThreadGroup tgn = tg;tgn != null;tg = tgn, tgn = tg.getParent());Thread handler = new ReferenceHandler(tg, "Reference Handler");/* If there were a special system-only priority greater than* MAX_PRIORITY, it would be used here*/handler.setPriority(Thread.MAX_PRIORITY);handler.setDaemon(true);handler.start();// provide access in SharedSecretsSharedSecrets.setJavaLangRefAccess(new JavaLangRefAccess() {@Overridepublic boolean tryHandlePendingReference() {return tryHandlePending(false);}});}...Reference(T referent, ReferenceQueue<? super T> queue) {this.referent = referent;this.queue = (queue == null) ? ReferenceQueue.NULL : queue;}}

這里代碼看著有些糊涂，并沒有代碼給pending這個類變量賦值，為啥ReferenceHandler這個線程執行體里又在讀取它的值，但看了看private static Reference<Object> pending = null;這一行上面的注釋，想了想終于明白了，原來JVM垃圾回收器將將虛引用加入到與之關聯的引用隊列后，JVM垃圾回收器又負責逐個將引用隊列中的引用拿出來賦于pending，然后通知ReferenceHandler線程，ReferenceHandler線程拿到引用后，發現如果是Cleaner，則調用其clean方法。然后終于與DirectByteBuffer里的Deallocator接上了，最終DirectByteBuffer申請的C堆內存被釋放。

既然DirectByteBuffer申請的C堆內存釋放是自動的，為啥在這個場景里會出現OOM呢？查閱java的bug記錄，終于找到原因。http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4857305，http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4469299。

意思是如果DirectByteBuffer創建得過于頻繁，服務器的CPU太繁忙，C堆內存還是會OOM的，原因是JVM來不及進行GC及Finalization，大量對象的銷毀工作被推后，最終C堆內存無法得到釋放。

解決方案

bug記錄提到了3個解決方案：

Insert occasional explicit System.gc() invocations to ensure that direct buffers are reclaimed.
Reduce the size of the young generation to force more frequent GCs.
Explicitly pool direct buffers at the application level.

我這里采用了第一個解決方案，代碼如下：

public void encodeFrame(BufferedImage image, long frameTime) {try {long t = frameTime * 1000L;if(t>recorder.getTimestamp()) {recorder.setTimestamp(t);}Frame frame = java2dConverter.convert(image);recorder.record(frame);if(System.currentTimeMillis() - lastGCTime > 60000){System.gc();System.runFinalization();lastGCTime = System.currentTimeMillis();Thread.yield();TimeUnit.SECONDS.sleep(3);}} catch (FrameRecorder.Exception e) {log.error("JavaCVMp4Encoder encode frame error.", e);}
}