StarRocks BE宕機排查
排查是否OOM
dmesg -T|grep -i oom #排查是否oom
原因:
2.X版本OOM原因
- BE 的配置文件 (be.conf) 中 mem_limit 配置不合理,需要配置mem_limit=(機器總內存-其他服務占用內存-1~2g(系統預留))
比如機器內存40G,上面有個Mysql,理論上限會用4G,那么配置下mem_limit=34G (40-4-2)
排查系統參數
一般先檢查下系統參數配置是否合理,建議參考 https://docs.starrocks.io/zh/docs/deployment/environment_configurations/ 配置。
尤其需要關注ulimit、overcommit和swap參數,檢查方式如下
ulimit檢查
需要關注max processes和max open files,需要確保>=65535
ulimit -a #查看系統配置
cat /proc/$be_pid/limits #查看be進程配置
overcommit檢查
以下值應該為 1
cat /proc/sys/vm/overcommit_memory
swap檢查
以下值應該為 0,確保關閉swap
cat /proc/sys/vm/swappiness
排查BE日志
如上參數配置正確的前提下,如果還存在crash,當前crash都會在be.out中打印異常棧
首先獲取be.out
# less be.outquery_id:0862041d-07bd-11f0-9214-005056853513, fragment_instance:0862041d-07bd-11f0-9214-005056853518..............*** Aborted at 1742716891 (unix time) try "date -d @1742716891" if you are using GNU date ***
PC: @ 0x527d26b starrocks::SegmentIterator::_finish_late_materialization()
*** SIGSEGV (@0x0) received by PID 22176 (TID 0x7f06987b1700) from PID 0; stack trace: ***@ 0x688b642 google::(anonymous namespace)::FailureSignalHandler()@ 0x7f089e584630 (unknown)@ 0x527d26b starrocks::SegmentIterator::_finish_late_materialization()@ 0x5288648 starrocks::SegmentIterator::_do_get_next()@ 0x528aa30 starrocks::SegmentIterator::do_get_next()@ 0x530e573 starrocks::ProjectionIterator::do_get_next()@ 0x5994675 starrocks::SegmentIteratorWrapper::do_get_next()@ 0x57c62d3 starrocks::TimedChunkIterator::do_get_next()@ 0x5341706 starrocks::TabletReader::do_get_next()@ 0x3b0271b starrocks::pipeline::OlapChunkSource::_read_chunk_from_storage()@ 0x3b02e42 starrocks::pipeline::OlapChunkSource::_read_chunk()@ 0x3afba17 starrocks::pipeline::ChunkSource::buffer_next_batch_chunks_blocking()@ 0x37c0c38 _ZZN9starrocks8pipeline12ScanOperator18_trigger_next_scanEPNS_12RuntimeStateEiENKUlvE_clEv@ 0x38d4c91 starrocks::workgroup::ScanExecutor::worker_thread()@ 0x2ed30ec starrocks::ThreadPool::dispatch_thread()@ 0x2ecc7ba starrocks::Thread::supervise_thread()@ 0x7f089e57cea5 start_thread@ 0x7f089d97d9fd __clone@ 0x0 (unknown)
- 可先通過關鍵去常見 Crash / BUG 堆棧查詢 搜索(上面關鍵字是
_finish_late_materialization
),判斷是不是已知問題; - 根據
query_id
去fe審計日志查找sql;
參考:https://forum.mirrorship.cn/t/topic/4930