### gpfs administration #### 4.1.0.x client building * as with Lustre, build on node with desired kernel for RPM packages * start with GPFS base packages, and then version package (e.g. 4.0 to 4.1) * /bin/ksh is needed * net-tools is needed for RHEL 7.x * systemd bug causes fs to not mount, run systemctl start dev-rcfs.device > /dev/null 2>&1 & /usr/lpp/mmfs/bin/mmmount rcfs 1.) sh gpfs_install-4.1.0-0_x86_64 --text-only 2.) rpm -ivh *.rpm in /usr/lpp/mmfs/4.1 3.) unpack version rpms from update tar & rpm -Uvh *.rpm 4.) build souce in /usr/lpp/mmfs/src 5.) make LINUX_DISTRIBUTION=REDHAT_AS_LINUX Autoconfig 5a.) make World 5b.) make InstallImages 5c.) make rpm 6.) yum localinstall gpfs.gplbin-2.6.32-431.23.3.el6.x86_64-4.1.0-6.x86_64.rpm from /root/rpmbuild #### 4.2.x client building * device is no longer considered a "local" block device 1.) sh Spectrum_Scale_Standard-4.2.3.7-x86_64-Linux-install --text-only 2.) rpm -ivh *.rpm in /usr/lpp/mmfs/4.2.3.7/gpfs_rpms/ (no GPFS GUI, license, java rpms) 3.) build source in /usr/lpp/mmfs/src 4.) make LINUX_DISTRIBUTION=REDHAT_AS_LINUX Autoconfig 4a.) make World 4b.) make InstallImages 4c.) make rpm 5.) yum localinstall gpfs.gplbin-2.6.32-642.3.1.el6.x86_64-4.2.3-7.x86_64.rpm #### Adding a node into cluster * ensure SSH passphrase is empty from GPFS server node to client as root * PTR records must be enabled! 1.) Create client text file (e.g. client-all.txt) with one host per line 2.) run mmaddnode -N client-all.txt 3.) run mmchlicense client -N node,node 4.) run mmstartup -N client-all.txt 5.) run mmgetstate -a (verify clients in cluster) 5a.) fs is automatically mounted and entered into /etc/fstab #### Installing RPM's for compute nodes 1.) yum install -y gpfs.base-4.1.0-0 gpfs.docs-4.1.0-0 gpfs.ext-4.1.0-0 gpfs.gpl-4.1.0 gpfs.gskit-8.0.50-16 gpfs.msg.en_US-4.1.0-0 2.) yum upgrade -y gpfs.base gpfs.docs gpfs.ext gpfs.gpl gpfs.gskit gpfs.msg.en_US 3.) yum install -y gpfs.gplbin-$(uname -r)-4.1.0-6 #### Issues readding clients into cluster place a copy of /var/mmfs/gen/mmsdrfs from a production node * node may need to be deleted first and then re-created ### administrative & trouble-shooting commands mmdiag --waiters = dumps all waiters mmdiag --deadlock = filtered list of long waiters mmdiag --config = lists all configuration settings mmfsadm dump waiters = gets a list of waiters and host IP addresses mmfsadm dump mb = Get workerthreads using grep mmfsadm dump iohist = get verbose dump of io history, including potential INODES, and nodes! mmfsadm dump tscomm = dump network traffic; focus on pending messages and nodes which haven't responded mmfsadm dump threads = obtain stacks and extra info on mmfsd threads mmfsadm dump mutex = dump all mutexes (locks) mmfsadm dump condvar = dump all conditional variables (points to OpenFile) mmfsadm dump stripe = dump more information about file system mmtracectl = RTFM, but allows fine grained traces to reference with GPFS inode numbers gpfs.snap [-N] = Get information snapshot at a single point in time; useful for trouble-shooting. Run against specific node -N #### Interpreting GPFS waiters * order matters (h->l): recovery, local-io, client-io, revoke, others, secondary, remaining * recovery: events that could hang a cluster due to unresponsive nodes * local-io: events that could point to issues within local disk subsystem (array, multipath) * client-io: events that could be due to network problems if local-io waiters are low or non-existent * revoke: events that point to nodes which are holding resources that other nodes need * others: events that are not common, but do point to specific items in the config * secondary: events that aren't the source of the problem but occur because of a larger issue * remaining: events that have no specific category and need to be examined ###### mmpmon usage * default is interactive * -p produces non HR output (useful for scripting) * -i filename = use filename with commands * -r n = repeat commands n times * -d n = delay in milliseconds after one call of _all_ commands in file * size range and latency range must be specified for histograms (rhist) ** rhist requests can cause performance degradation fs_io_s = get stats for entire system io_s = get stats for a node/nodes reset = reset stats to zero nlist s = show current hostlist nlist [add,del] n,n = add/remove host from hostlist ./* = add current node/all local nodes in cluster (* must provide IB address on client nodes) ### tuning * Do not use connected mode in IPOIB interfaces * Set ib_ipoib (send|recv)_queue_size to 8192 * Tune vm.min_free_kbytes to 5% of total memory # Settings to explore * gpfs ** linux *socketMaxListenConnections = # of nodes in cluster + 1 **net.core.somaxconn = 128 (>= socketMaxListenConnections) * idleSocketTimeout = 3600 (change to 0) * failureDetectionTime = 35 seconds; should be increased to 60 - 120 seconds (default is 1.4 renewals per second) gpfs must be down on all nodes to change value ** net.core.netdev_max_backlog=250000 * usePersistentReserve https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Welcome%20to%20High%20Performance%20Computing%20%28HPC%29%20Central/page/Linux%20System%20Tuning%20Recommendations https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/Best%20Practices%20Network%20Tuning GPFS HPC PARAMETER TUNING GPFS PARAMETER EXACT OR MINIMUM VALUE RECOMMENDED MAXIMUM VALUE RECOMMENDED failureDetectionTime 60 idleSocketTimeout 0 leaseRecoveryWait 60 maxMissedPingTimeout 35 120 maxReceiverThreads TOTAL_CPUS_IN_NODE minMissedPingTimeout 60 verbsRdma enable verbsRdmaTimeout 18 worker1Threads 128 tscWorkerPool 128 socketMaxListenConnections TOTAL_NODES_IN_CLUSTER Note that TOTAL_CPUS_IN_NODE represents the total number of logical CPUs present in the machine in the cluster with the most logical CPUs. Note that TOTAL_NODES_IN_CLUSTER represents the number of nodes defined in the GPFS cluster in addition to the number of nodes that may join the group from remote clusters (multi-cluster). Note that when setting socketMaxListenConnections, net.core.somaxconn must be set to the same value.