### gpfs administration

#### 4.1.0.x client building

* as with Lustre, build on node with desired kernel for RPM packages
* start with GPFS base packages, and then version package (e.g. 4.0 to 4.1)
* /bin/ksh is needed
* net-tools is needed for RHEL 7.x 
* systemd bug causes fs to not mount, run systemctl start dev-rcfs.device > /dev/null 2>&1 & /usr/lpp/mmfs/bin/mmmount rcfs

1.)  sh gpfs_install-4.1.0-0_x86_64 --text-only
2.)  rpm -ivh *.rpm in /usr/lpp/mmfs/4.1
3.)  unpack version rpms from update tar & rpm -Uvh *.rpm
4.)  build souce in /usr/lpp/mmfs/src
5.)  make LINUX_DISTRIBUTION=REDHAT_AS_LINUX Autoconfig
5a.) make World
5b.) make InstallImages
5c.) make rpm
6.)  yum localinstall gpfs.gplbin-2.6.32-431.23.3.el6.x86_64-4.1.0-6.x86_64.rpm from /root/rpmbuild


#### 4.2.x client building

* device is no longer considered a "local" block device

1.)  sh Spectrum_Scale_Standard-4.2.3.7-x86_64-Linux-install --text-only
2.)  rpm -ivh *.rpm in /usr/lpp/mmfs/4.2.3.7/gpfs_rpms/ (no GPFS GUI, license, java rpms)
3.)  build source in /usr/lpp/mmfs/src
4.)  make LINUX_DISTRIBUTION=REDHAT_AS_LINUX Autoconfig
4a.) make World
4b.) make InstallImages
4c.) make rpm
5.)  yum localinstall gpfs.gplbin-2.6.32-642.3.1.el6.x86_64-4.2.3-7.x86_64.rpm


#### Adding a node into cluster

* ensure SSH passphrase is empty from GPFS server node to client as root
* PTR records must be enabled!

1.)  Create client text file (e.g. client-all.txt) with one host per line
2.)  run mmaddnode -N client-all.txt
3.)  run mmchlicense client -N node,node
4.)  run mmstartup -N client-all.txt
5.)  run mmgetstate -a (verify clients in cluster)
5a.) fs is automatically mounted and entered into /etc/fstab


#### Installing RPM's for compute nodes

1.) yum install -y gpfs.base-4.1.0-0 gpfs.docs-4.1.0-0 gpfs.ext-4.1.0-0 gpfs.gpl-4.1.0 gpfs.gskit-8.0.50-16 gpfs.msg.en_US-4.1.0-0
2.) yum upgrade -y gpfs.base gpfs.docs gpfs.ext gpfs.gpl gpfs.gskit gpfs.msg.en_US
3.) yum install -y gpfs.gplbin-$(uname -r)-4.1.0-6


#### Issues readding clients into cluster
place a copy of /var/mmfs/gen/mmsdrfs from a production node
* node may need to be deleted first and then re-created


### administrative & trouble-shooting commands

mmdiag --waiters = dumps all waiters
mmdiag --deadlock = filtered list of long waiters
mmdiag --config = lists all configuration settings
mmfsadm dump waiters = gets a list of waiters and host IP addresses
mmfsadm dump mb = Get workerthreads using grep
mmfsadm dump iohist = get verbose dump of io history, including potential INODES, and nodes!
mmfsadm dump tscomm = dump network traffic; focus on pending messages and nodes which haven't responded
mmfsadm dump threads = obtain stacks and extra info on mmfsd threads
mmfsadm dump mutex = dump all mutexes (locks)
mmfsadm dump condvar = dump all conditional variables (points to OpenFile)
mmfsadm dump stripe = dump more information about file system
mmtracectl = RTFM, but allows fine grained traces to reference with GPFS inode numbers


gpfs.snap [-N] = Get information snapshot at a single point in time; useful for trouble-shooting.  Run against specific node -N

#### Interpreting GPFS waiters
* order matters (h->l):				recovery, local-io, client-io, revoke, others, secondary, remaining
* recovery:					events that could hang a cluster due to unresponsive nodes
* local-io:					events that could point to issues within local disk subsystem (array, multipath)
* client-io:					events that could be due to network problems if local-io waiters are low or non-existent
* revoke:					events that point to nodes which are holding resources that other nodes need
* others:					events that are not common, but do point to specific items in the config
* secondary:					events that aren't the source of the problem but occur because of a larger issue
* remaining:					events that have no specific category and need to be examined


###### mmpmon usage

* default is interactive
* -p produces non HR output (useful for scripting)
* -i filename = use filename with commands
* -r n = repeat commands n times
* -d n = delay in milliseconds after one call of _all_ commands in file
* size range and latency range must be specified for histograms (rhist)
** rhist requests can cause performance degradation

fs_io_s = get stats for entire system
io_s = get stats for a node/nodes
reset = reset stats to zero

nlist s = show current hostlist
nlist [add,del] n,n = add/remove host from hostlist
./* = add current node/all local nodes in cluster (* must provide IB address on client nodes)

### tuning

* Do not use connected mode in IPOIB interfaces
* Set ib_ipoib (send|recv)_queue_size to 8192
* Tune vm.min_free_kbytes to 5% of total memory

# Settings to explore

* gpfs
** linux

*socketMaxListenConnections = # of nodes in cluster + 1
**net.core.somaxconn = 128 (>= socketMaxListenConnections)

* idleSocketTimeout = 3600 (change to 0)
* failureDetectionTime = 35 seconds; should be increased to 60 - 120 seconds (default is 1.4 renewals per second)
	gpfs must be down on all nodes to change value
** net.core.netdev_max_backlog=250000
* usePersistentReserve


https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Welcome%20to%20High%20Performance%20Computing%20%28HPC%29%20Central/page/Linux%20System%20Tuning%20Recommendations
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/Best%20Practices%20Network%20Tuning


GPFS HPC PARAMETER TUNING
GPFS PARAMETER	EXACT OR MINIMUM VALUE RECOMMENDED	MAXIMUM VALUE RECOMMENDED
failureDetectionTime	60	 
idleSocketTimeout	0	 
leaseRecoveryWait	60	 
maxMissedPingTimeout	35	120
maxReceiverThreads	TOTAL_CPUS_IN_NODE	 
minMissedPingTimeout	60	 
verbsRdma	enable	 
verbsRdmaTimeout	18	 
worker1Threads	128	 
tscWorkerPool	128	 
socketMaxListenConnections	TOTAL_NODES_IN_CLUSTER	 
Note that TOTAL_CPUS_IN_NODE represents the total number of logical CPUs present in the machine in the cluster with the most logical CPUs.
Note that TOTAL_NODES_IN_CLUSTER represents the number of nodes defined in the GPFS cluster in addition to the number of nodes that may join the group from remote clusters (multi-cluster).
Note that when setting socketMaxListenConnections, net.core.somaxconn must be set to the same value.