Using RDMA with pagepool larger than 8GB
If you receive an RDMA error similar to this
Wed Apr 20 09:53:38 CEST 2011: runmmfs starting
Removing old /var/adm/ras/mmfs.log.* files:
Unloading modules from /lib/modules/2.6.18-194.el5/extra
Loading modules from /lib/modules/2.6.18-194.el5/extra
Module Size Used by
mmfs26 1656104 0
mmfslinux 322632 1 mmfs26
tracedev 67020 2 mmfs26,mmfslinux
Wed Apr 20 09:53:39.976 2011: mmfsd initializing. {Version: 3.4.0.4 Built: Feb 15 2011 11:25:39} …
Wed Apr 20 09:53:49.918 2011: OpenSSL library loaded and initialized.
Wed Apr 20 09:53:51.710 2011: VERBS RDMA starting.
Wed Apr 20 09:53:51.716 2011: VERBS RDMA library libibverbs.so (version >= 1.1) loaded and initialized.
Wed Apr 20 09:53:54.268 2011: VERBS RDMA ibv_reg_mr err 12 device mlx4_0 addr 0x4000000000 len 8388608 KB. Try increasing device MTTs.
Wed Apr 20 09:53:56.748 2011: VERBS RDMA ibv_reg_mr err 12 device mlx4_1 addr 0x4000000000 len 8388608 KB. Try increasing device MTTs.
Wed Apr 20 09:53:56.749 2011: VERBS RDMA library libibverbs.so unloaded.
Wed Apr 20 09:53:56.748 2011: VERBS RDMA failed to start.
when starting GPFS you most likely need to increase the value of the configuration parameter log_mtts_per_seg.
When GPFS starts with Mellanox InfiniBand RDMA (VERBS) enabled it maps all of the memory defined in pagepool into the RDMA (VERBS) driver. In fact it maps it twice so it is actually mapping 2x the memory defined by the pagepool parameter. By default the mlx4 driver can be mapped to about 32GiB of memory, which equates to just less than an 16GiB setting for GPFS pagepool.
To check the configuration of the mlx4 driver look at
# more /sys/module/mlx4_core/parameters/log_num_mtt
0
# more /sys/module/mlx4_core/parameters/log_mtts_per_seg
3
The default number of log_num_mtt is 20 and 3 for log_mtts_per_seg.
log_num_mtt = 20 – This value is used as 2^log_num_mtt or 2^20 = 1MiB
log_mtts_per_seg = 3 – This value is used as 2^log_mtts_per_seg or 2^3 = 8
PAGE_SIZE = 4K
So with this configuration (1MiB * 8 * 4K = 32GiB) 32GiB is the maximum memory that can be registered to InfiniBand based on the mtt resources configured for mlx4_core. Since GPFS registers twice the value of pagpool and there is some other MTT space used elsewhere, the maximum pagepool you can use with the default settings somewhere right below 16GiB.
The formula to computer the maximum value of pagepool when using RDMA is:
2^log_num_mtt x 2^log_mtts_per_seg * x PAGE_SIZE > ( 2x pagepool )
You can increase the maximum amount of memory supported by increasing the value of log_mtts_per_seg. For example to support a pagepool of 24 GiB you increase log_mtts_per_seg to 4.
log_num_mtt = 20
log_mtts_per_seg = 4
This equates to 64GiB as a maximum about of memory mapped.
2^20 bytes x 2^4 x 4K = 64GiB
These parameters are set on the mlx4_core module in /etc/modprobe.conf or place the line at the end of /etc/modprobe.d/mlx4_core.conf file, Depending on your version of linux.
options mlx4_core log_num_mtt=20 log_mtts_per_seg=4