生產環境遇到下面報錯
2025-04-23 17:44:15,780 INFO store.CachedRecordStore (CachedRecordStore.java:overrideExpiredRecords(192)) - Override State Store record MembershipState: router1:8888->hh-fed-sub25:nn2:nn2:8020-EXPIRED
2025-04-23 17:44:15,781 INFO store.CachedRecordStore (CachedRecordStore.java:overrideExpiredRecords(192)) - Override State Store record MembershipState: router1:8888->hh-fed-sub25:nn1:nn1:8020-EXPIRED
2025-04-23 17:44:15,781 INFO store.CachedRecordStore (CachedRecordStore.java:overrideExpiredRecords(192)) - Override State Store record MembershipState: router2:8888->hh-fed-sub25:nn1:nn1:8020-EXPIRED
2025-04-23 17:44:15,781 INFO store.CachedRecordStore (CachedRecordStore.java:overrideExpiredRecords(192)) - Override State Store record MembershipState: router2:8888->hh-fed-sub25:nn2:nn2:8020-EXPIRED
報錯原因是,之前子集群配置了3個router,2個nn,然后會向StateStore中存儲6個MembershipState。
后來,將子集群的router停了兩個,只運行一個router,這樣的后果就是會在運行的router日志發現上面報錯。
因為router會周期性下載MembershipState,每次都會去檢查是否過期,而我們停了2個Router,這倆Router之前和NameNode形成Membership并上報到了StateStore,并且我們關閉了刪除過期記錄的參數dfs.federation.router.store.membership.expiration.deletion,所以,會在運行的Router中打印上面報錯。
修復做法,選擇下面之一都可以:
- 開啟刪除過期參數
- dfs.federation.router.store.membership.expiration默認未5min,若設置dfs.federation.router.store.membership.expiration.deletion=2min,則表示membership過期了(超過5min沒匯報),在等2min就刪除它。
- 啟動已停止的router
參考源碼
org.apache.hadoop.hdfs.server.federation.store.CachedRecordStore#overrideExpiredRecords
public void overrideExpiredRecords(QueryResult<R> query) throws IOException {List<R> commitRecords = new ArrayList<>();List<R> deleteRecords = new ArrayList<>();List<R> newRecords = query.getRecords();long currentDriverTime = query.getTimestamp();if (newRecords == null || currentDriverTime <= 0) {LOG.error("Cannot check overrides for record");return;}for (R record : newRecords) {if (record.shouldBeDeleted(currentDriverTime)) {String recordName = StateStoreUtils.getRecordName(record.getClass());if (getDriver().remove(record)) {deleteRecords.add(record);LOG.info("Deleted State Store record {}: {}", recordName, record);} else {LOG.warn("Couldn't delete State Store record {}: {}", recordName,record);}} else if (record.checkExpired(currentDriverTime)) {String recordName = StateStoreUtils.getRecordName(record.getClass());LOG.info("Override State Store record {}: {}", recordName, record);commitRecords.add(record);}}if (commitRecords.size() > 0) {getDriver().putAll(commitRecords, true, false);}if (deleteRecords.size() > 0) {newRecords.removeAll(deleteRecords);}}
org.apache.hadoop.hdfs.server.federation.store.records.BaseRecord#checkExpired
@Overridepublic boolean checkExpired(long currentTime) {if (super.checkExpired(currentTime)) {this.setState(EXPIRED);// Commit itreturn true;}return false;}public boolean checkExpired(long currentTime) {long expiration = getExpirationMs();long modifiedTime = getDateModified();if (modifiedTime > 0 && expiration > 0) {return (modifiedTime + expiration) < currentTime;}return false;}
org.apache.hadoop.hdfs.server.federation.store.records.BaseRecord#shouldBeDeleted
public boolean shouldBeDeleted(long currentTime) {long deletionTime = getDeletionMs();if (isExpired() && deletionTime > 0) {long elapsedTime = currentTime - (getDateModified() + getExpirationMs());return elapsedTime > deletionTime;} else {return false;}
}