PostgreSQL Huge Page 使用建議 - 大內存主機、實例注意

標簽

PostgreSQL , Linux , huge page , shared buffer , page table , 虛擬地址 , 物理地址 , 內存地址轉換表

背景

當內存很大時，除了刷臟頁的調度可能需要優化，還有一方面是虛擬內存與物理內存映射表相關的部分需要優化。

1 臟頁調度優化

1、主要包括，調整后臺進程刷臟頁的閾值、喚醒間隔、以及老化閾值。（臟頁大于多少時開始刷、多久探測一次有多少臟頁、刷時多老的臟頁刷出。）。

vm.dirty_background_bytes = 4096000000    
vm.dirty_background_ratio = 0    
vm.dirty_expire_centisecs = 6000    
vm.dirty_writeback_centisecs = 100

2、用戶進程刷臟頁調度，當臟頁大于多少時，用戶如果要申請內存，需要協助刷臟頁。

vm.dirty_bytes = 0    
vm.dirty_ratio = 80

《DBA不可不知的操作系統內核參數》

2 內存表映射優化

這部分主要是因為虛擬內存管理，Linux需要維護虛擬內存地址與物理內存的映射關系，為了提升轉換性能，最好這部分能夠cache在cpu的cache里面。頁越大，映射表就越小。使用huge page可以減少頁表大小。

默認頁大小可以這樣獲取，

# getconf PAGESIZE    
4096

https://en.wikipedia.org/wiki/Page_table

另一個使用HUGE PAGE的原因，HUGE PAGE是常駐內存的，不會被交換出去，這也是重度依賴內存的應用（包括數據庫）非常喜歡的。

In a virtual memory system, the tables store the mappings between virtual addresses and physical addresses. When the system needs to access a virtual memory location, it uses the page tables to translate the virtual address to a physical address. Using huge pages means that the system needs to load fewer such mappings into the Translation Lookaside Buffer (TLB), which is the cache of page tables on a CPU that speeds up the translation of virtual addresses to physical addresses. Enabling the HugePages feature allows the kernel to use hugetlb entries in the TLB that point to huge pages. The hugetbl entries mean that the TLB entries can cover a larger address space, requiring many fewer entries to map the SGA, and releasing entries that can map other portions of the address space.

With HugePages enabled, the system uses fewer page tables, reducing the overhead for maintaining and accessing them. Huges pages remain pinned in memory and are not replaced, so the kernel swap daemon has no work to do in managing them, and the kernel does not need to perform page table lookups for them. The smaller number of pages reduces the overhead involved in performing memory operations, and also reduces the likelihood of a bottleneck when accessing page tables.

PostgreSQL HugePage使用建議

1、查看Linux huage page頁大小

# grep Hugepage /proc/meminfo     
Hugepagesize:       2048 kB

2、準備設置多大的shared buffer參數，假設我們的內存有512GB，想設置128GB的SHARED BUFFER。

vi postgresql.conf    shared_buffers='128GB'

3、計算需要多少huge page

128GB/2MB=65535

4、設置Linux huge page頁數

sysctl -w vm.nr_hugepages=67537

5、設置使用huge page。

vi $PGDATA/postgresql.conf    huge_pages = on                 # on, off, or try    # 設置為try的話，會先嘗試huge page，如果啟動時無法鎖定給定數目的大頁，則不會使用huge page

6、啟動數據庫

pg_ctl start

7、查看當前使用了多少huge page

cat /proc/meminfo |grep -i huge    AnonHugePages:      6144 kB    
HugePages_Total:   67537  ## 設置的HUGE PAGE    
HugePages_Free:    66117  ## 這個是當前剩余的，但是實際上真正可用的并沒有這么多，因為被PG鎖定了65708個大頁    
HugePages_Rsvd:    65708  ## 啟動PG時申請的HUGE PAGE    
HugePages_Surp:        0    
Hugepagesize:       2048 kB   ## 當前大頁2M

8、執行一些查詢，可以看到Free會變小。被PG使用掉了。

cat /proc/meminfo |grep -i huge    
AnonHugePages:      6144 kB    
HugePages_Total:   67537    
HugePages_Free:    57482    
HugePages_Rsvd:    57073    
HugePages_Surp:        0    
Hugepagesize:       2048 kB

Oracle HugePage使用建議

Oracle也是重度內存使用應用，當SGA配置較大時，同樣建議設置HUGEPAGE。

Oracle 建議當SGA大于或等于8GB時，使用huge page。

10.1 About HugePages

The HugePages feature enables the Linux kernel to manage large pages of memory in addition to the standard 4KB (on x86 and x86_64) or 16KB (on IA64) page size. If you have a system with more than 16GB of memory running Oracle databases with a total System Global Area (SGA) larger than 8GB, you should enable the HugePages feature to improve database performance.

Note
The Automatic Memory Management (AMM) and HugePages features are not compatible in Oracle Database 11g and later. You must disable AMM to be able to use HugePages.

The memory allocated to huge pages is pinned to primary storage, and is never paged nor swapped to secondary storage. You reserve memory for huge pages during system startup, and this memory remains allocated until you change the configuration.

Huge pages are 4MB in size on x86, 2MB on x86_64, and 256MB on IA64.

https://docs.oracle.com/cd/E11882_01/server.112/e10839/appi_vlm.htm#UNXAR394

https://docs.oracle.com/cd/E37670_01/E37355/html/ol_about_hugepages.html

測試對比是否使用HugePage

設計test case

創建10240個表，使用merge insert寫入200億數據。

1、建表

do language plpgsql $$    
declare    
begin    execute 'drop table if exists test';    execute 'create table test(id int8 primary key, info text, crt_time timestamp)';    for i in 0..10239 loop    execute format('drop table if exists test%s', i);    execute format('create table test%s (like test including all)', i);    end loop;    
end;    
$$;

2、創建動態寫入函數，第一種不使用綁定變量

create or replace function dyn_pre(int8) returns void as $$    
declare    suffix int8 := mod($1,10240);    
begin    execute format('insert into test%s values(%s, md5(random()::text), now()) on conflict(id) do update set info=excluded.info,crt_time=excluded.crt_time', suffix, $1);    
end;    
$$ language plpgsql strict;

3、使用綁定變量，性能更好。

create or replace function dyn_pre(int8) returns void as $$    
declare    suffix int8 := mod($1,10240);    
begin    execute format('execute p%s(%s)', suffix, $1);    exception when others then    execute format('prepare p%s(int8) as insert into test%s values($1, md5(random()::text), now()) on conflict(id) do update set info=excluded.info,crt_time=excluded.crt_time', suffix, suffix);    execute format('execute p%s(%s)', suffix, $1);    
end;    
$$ language plpgsql strict;

4、創建壓測腳本

vi test.sql    \set id random(1,20000000000)    
select dyn_pre(:id);

5、寫入性能壓測

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 56 -j 56 -T 1200000

6、多長連接壓測，PAGE TABLE觀察

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 950 -j 950 -T 1200000    pgbench -M prepared -n -r -P 1 -f ./test.sql -c 950 -j 950 -T 1200000

1 使用HugePage

1、小量連接寫入性能

transaction type: ./test.sql    
scaling factor: 1    
query mode: prepared    
number of clients: 56    
number of threads: 56    
duration: 120 s    
number of transactions actually processed: 17122345    
latency average = 0.392 ms    
latency stddev = 0.251 ms    
tps = 142657.055512 (including connections establishing)    
tps = 142687.784245 (excluding connections establishing)    
script statistics:    - statement latencies in milliseconds:    0.002  \set id random(1,20000000000)    0.390  select dyn_pre(:id);

2、1900個長連接，PAGE TABLE大小（由于是虛擬、物理內存映射關系表。所以耗費取決于連接數，以及每個連接相關聯的SHARED BUFFER以及會話自己的relcache, SYSCACHE）

cat /proc/meminfo |grep -i table    
Unevictable:           0 kB    
PageTables:       578612 kB  ## shared buffer使用了huge page，這塊省了很多。    
NFS_Unstable:          0 kB

2 未使用HugePage

sysctl -w vm.nr_hugepages=0

1、小量連接的寫入性能

transaction type: ./test.sql    
scaling factor: 1    
query mode: prepared    
number of clients: 56    
number of threads: 56    
duration: 120 s    
number of transactions actually processed: 18484181    
latency average = 0.364 ms    
latency stddev = 0.212 ms    
tps = 153887.936028 (including connections establishing)    
tps = 153905.968799 (excluding connections establishing)    
script statistics:    - statement latencies in milliseconds:    0.002  \set id random(1,20000000000)    0.362  select dyn_pre(:id);

小量連接未使用HUGE PAGE性能比使用huge page更好，猜測可能是huge page使用了類似兩級轉換(因為2MB為單個目標的映射，并不能精準定位到默認8K的數據頁的物理內存位置。可能和數據庫的索引bitmap scan道理類似，bitmap scan告訴你數據在哪個PAGE內，而不是直接告訴你數據在哪個PAGE的第幾條記錄上。)，導致了一定的損耗。

cat /proc/meminfo |grep -i table    
Unevictable:           0 kB    
PageTables:     10956556 kB  ## 不一會就增長到了10GB，因為每個連接都在TOUCH shared buffer內的數據，可能導致映射表很大。連接越多。TOUCH shared buffer內數據越多越明顯    
# PageTables 還在不斷增長    NFS_Unstable:          0 kB

CentOS 7u 配置大頁例子

1、修改/boot/grub2/grub.cfg

定位到第一個menuentry 'CentOS Linux'，在linux16 /vmlinuz最后面添加如下：

說明（關閉透明大頁，使用默認的2MB大頁，你也可以選擇用1G的大頁，但是在此之前應該先到系統中判斷支持哪些大頁規格. 查看/proc/cpuinfo里面的FLAG?Valid pages sizes on x86-64 are 2M (when the CPU supports "pse") and 1G (when the CPU supports the "pdpe1gb" cpuinfo flag).?，設置啟動時創建1536個大頁（這部分內存會被保留，所以一定要注意設置合適的大小，建議在LINUX啟動后通過sysctl來設置）。）

numa=off transparent_hugepage=never default_hugepagesz=2M hugepagesz=2M hugepages=1536

transparent_hugepage=never表示關閉透明大頁，以免不必要的麻煩。透明大頁這個特性應該還不太成熟。

hugepagesz 表示頁面大小，2M和1G選其一，默認為2M。

hugepages 表示大頁面數

總共大頁面內存量為hugepagesz * hugepages，這里為3G

例子：

menuentry 'CentOS Linux (3.10.0-693.5.2.el7.x86_64) 7 (Core)' --class centos --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-3.10.0-693.el7.x86_64-advanced-d8179b22-8b44-4552-bf2a-04bae2a5f5dd' {  load_video  set gfxpayload=keep  insmod gzio  insmod part_msdos  insmod xfs  set root='hd0,msdos1'  if [ x$feature_platform_search_hint = xy ]; then  search --no-floppy --fs-uuid --set=root --hint-bios=hd0,msdos1 --hint-efi=hd0,msdos1 --hint-baremetal=ahci0,msdos1 --hint='hd0,msdos1'  34f87a8d-8b73-4f80-b0ff-8d49b17975ca  else  search --no-floppy --fs-uuid --set=root 34f87a8d-8b73-4f80-b0ff-8d49b17975ca  fi  linux16 /vmlinuz-3.10.0-693.5.2.el7.x86_64 root=/dev/mapper/centos-root ro rd.lvm.lv=centos/root rhgb quiet LANG=en_US.UTF-8 numa=off transparent_hugepage=never default_hugepagesz=2M hugepagesz=2M hugepages=1536  initrd16 /initramfs-3.10.0-693.5.2.el7.x86_64.img  
}

重啟系統（如果你不想重啟系統而使用HUGE PAGE，使用這種方法即可sysctl -w vm.nr_hugepages=1536?）

但是修改默認的大頁規格(2M or 1G)則一定要重啟，例如：

numa=off transparent_hugepage=never default_hugepagesz=1G hugepagesz=2M hugepagesz=1G  重啟后就會變這樣  cat /proc/meminfo |grep -i huge  
AnonHugePages:         0 kB  
HugePages_Total:       0  
HugePages_Free:        0  
HugePages_Rsvd:        0  
HugePages_Surp:        0  
Hugepagesize:    1048576 kB

申請132GB大頁

sysctl -w vm.nr_hugepages=132  
vm.nr_hugepages = 132

重啟后可以使用grep Huge /proc/meminfo查看配置情況。看到下面的數據表示已經生效

HugePages_Total:    1536  
HugePages_Free:     1499  
HugePages_Rsvd:     1024  
HugePages_Surp:        0  
Hugepagesize:       2048 kB

數據庫配置（如果你想好了非大頁不可，就設置huge_pages為on，否則設置為try。on的情況下如果HUGE PAGE不夠，則啟動會報錯。TRY的話，大頁不夠就嘗試申請普通頁的方式啟動。）

postgresql.conf  huge_pages = on  
shared_buffers = 2GB  # 使用2G內存，這個值需要小于總共大頁面內存量

注意

如果postgresql.conf配置huge_pages=on時，且shared_buffers值等于huge_page總內存量（hugepagesz*hugepages）時，數據庫無法啟動，報如下錯誤：

This error usually means that PostgreSQL's request for a shared memory segment exceeded available memory or swap space.

解決辦法shared_buffers值要小于huge_page總內存量

libhugetlbfs

安裝libhugetlbfs可以觀察大頁的統計信息，分配大頁文件系統，例如你想把數據放到內存中持久化測試。

https://lwn.net/Articles/376606/

yum install -y libhugetlbfs*

小結

1、查看、修改Linux目前支持的大頁大小。

https://unix.stackexchange.com/questions/270949/how-do-you-change-the-hugepagesize

2、如果連接數較少時，使用HUGE PAGE性能不如不使用（猜測可能是huge page使用了類似兩級轉換，導致了一定的損耗。）。因此我們可以盡量使用連接池，減少連接數，提升性能。

3、能使用連接池的話，盡量使用連接池，減少連接到數據庫的連接數。因為PG與Oracle一樣是進程模型，連接越多則進程越多，大內存需要注意一些問題：

3.1 上下文切換，MEM COPY的開銷。

3.2 PAGE TABLE增大，內存使用增加。

PageTables: Amount of memory dedicated to the lowest level of page tables. This can increase to a high value if a lot of processes are attached to the same shared memory segment.

3.3 每個會話要緩存syscache, relcache等信息，如果訪問的對象很多，會導致內存使用爆增。(這個屬于邏輯層面內存使用放大, 如果訪問對象不多或者訪問過好多對象的長連接不多的話，問題不明顯)

《PostgreSQL relcache在長連接應用中的內存霸占"坑"》

這個很容易模擬，使用本例的壓測CASE，增加表的數目，增加表的字段數，每個連接的relcache就會增加。

4、如果不能使用連接池，連接數非常多，并且都是長連接(訪問較多的對象、shared buffer中的數據時)。那么當shared buffer非常大時，需要考慮使用huge page。這樣page tables會比較小。如果無法使用HugePage，那么建議設置較小的shared_buffer。

5、進程自己的內存使用,PAGE TABLE不會有放大效果，因為只是自己使用。所以work_mem, maintenance_work_mem的使用，不大會引起PAGE TABLE過大的問題。

通過觀察/proc/meminfo來查看PageTable的占用，判斷是否需要啟用大頁或降低shared_buffer或者連接數。

參考

https://en.wikipedia.org/wiki/Page_table

https://docs.oracle.com/cd/E11882_01/server.112/e10839/appi_vlm.htm#UNXAR394

https://docs.oracle.com/cd/E37670_01/E37355/html/ol_about_hugepages.html

https://unix.stackexchange.com/questions/270949/how-do-you-change-the-hugepagesize

《PostgreSQL on Linux 最佳部署手冊》

《PostgreSQL 10 + PostGIS + Sharding(pg_pathman) + MySQL(fdw外部表) on ECS 部署指南(適合新用戶)》

https://blog.dbi-services.com/configuring-huge-pages-for-your-postgresql-instance-redhatcentos-version/

https://momjian.us/main/writings/pgsql/hw_performance/

https://www.kernel.org/doc/gorman/html/understand/understand006.html

https://wiki.osdev.org/Page_Tables

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/performance_tuning_guide/sect-Red_Hat_Enterprise_Linux-Performance_Tuning_Guide-Memory-Configuring-huge-pages