########################### bgfs stuff #### alpha installation notes 10.250.17.240 svc-rcfc-01-ib svc-rcfc-01.rc.usf.edu Metadata & management server 10.250.17.241 svc-rcfc-02-ib svc-rcfc-02.rc.usf.edu Storage server 1 10.250.17.243 svc-rcfc-03-ib svc-rcfc-03.rc.usf.edu Storage server 2 * enable IB support in /etc/beegfs/beegfs-client-autobuild.conf: buildArgs=-j8 BEEGFS_OPENTK_IBVERBS=1 * rebuild client: /etc/init.d/beegfs-client rebuild * ensure libibverbs, librdmacm, libibverbs-utils, & libibcm (RHEL repo) are installed * all need same cards! # configuring beegfs first steps (quick overview) 1.) configure mgmtd service: /opt/beegfs/sbin/beegfs-setup-mgmtd -p /hipaa/beegfs/beegfs_mgmtd (chkconfig beegfs-mgmtd on) 2.) configure meta service: /opt/beegfs/sbin/beegfs-setup-meta -p /hipaa/beegfs/beegfs_meta -s 1 -m svc (chkconfig beegfs-meta on) 3.) configure storage service: /opt/beegfs/sbin/beegfs-setup-storage -p /hipaa -s 3 -i 301 -m svc-rcfc-02-ib (chkconfig beegfs-storage on) 4.) configure storage service: /opt/beegfs/sbin/beegfs-setup-storage -p /hipaa -s 4 -i 401 -m svc-rcfc-03-ib (chkconfig beegfs-storage on) 5.) configure client: /opt/beegfs/sbin/beegfs-setup-client -m svc-rcfc-01-ib (management node; edit mountpoint in /etc/beegfs-mounts.conf; chkconfig beegfs-{client,helperd} on) 6.) Ensure connInterfacesFile is configured in all appropriate config files (mgmtd, meta, storage, client) with preferred NIC configured 7.) Ensure connNetFilterFile is configure in all appropriate config files (mgmtd, meta, storage, client) with NIC range configured # Configure Luks multipath devices 1.) create vg with pv; vg_bgfs_(m|s)# 2.) create lv on vg; -n bgfs_(m|s)_disk 3.) cryptsetup --verbose --verify-passphrase luksFormat on new lv /dev/vg_bgfs_(s|m)#/bgfs_(s|m)#_disk 4.) cryptsetup luksOpen on new lv AND supply alias /dev/vg_bgfs_(s|m)#/bgfs_(s|m)#_disk bgfs_(s|m)#_disk 5.) mkfs on new _mapped_ device (mkfs.xfs -d su=128k,sw=4 -l version=2,su=128k -isize=512 /dev/mapper/bgfs_s#_disk | mkfs.ext4 -i 2048 -I 512 -J size=400 -Odir_index,filetype /dev/mapper/bgfs_m#_disk) # Configure Luks multipath devices no LVM 1.) cryptsetup --verbose --verify-passphrase luksFormat /dev/mapper/DEVICE 2.) crypsetup luksOpen /dev/mapper/DEVICE decrypt-DEVICE 3.) mkfs.xfs -d su=128k,sw=4 -l version=2,su=128k -isize=512 /dev/mapper/decrypt-DEVICE ; note sw is number of DATA DISKS in RAID pool # LUKS procedure for existing device(s) 1.) cryptsetup luksOpen /path/to/dm/device (LV Path from lvdisplay) cryptName (device in /etc/fstab, e.g. bgfs_s2_disk) 2.) cryptsetup luksSuspend /cryptName 3.) cryptsetup luksClose /cryptName # old notes for reference cryptsetup --verbose --verify-passphrase luksFormat /dev/vg_raid10/r10disk cryptsetup luksOpen /dev/vg_raid10/r10disk r10disk (name r10disk is mapping and is required; cannot make fs on "raw" mapper path once device is formatted) cryptsetup luksSuspend | luksResume cryptsetup luksClose # beegfs mgmtd server install # beegfs metadata server install * tune2fs -o user_xattr /dev/mapper/bgfs_m1_disk * mount -t ext4 -o noatime,nodiratime,nobarrier /dev/mapper/bgfs_m1_disk /hipaa for i in {b..e}; do echo deadline > /sys/block/sd$i/queue/scheduler; done for i in {b..e}; do echo 128 > /sys/block/sd$i/queue/nr_requests; done yum install beegfs-meta beegfs-client beegfs-utils # beegfs storage server install * for DEV in $(multipath -l|grep -Eo 'sd[a-z]{1,}[0-9]{1,}?'); do echo deadline > /sys/block/$DEV/queue/scheduler; done * for DEV in $(multipath -l|grep -Eo 'sd[a-z]{1,}[0-9]{1,}?'); do echo 4096 > /sys/block/$DEV/queue/nr_requests; done * for DEV in $(multipath -l|grep -Eo 'sd[a-z]{1,}[0-9]{1,}?'); do echo 4096 > /sys/block/$DEV/queue/read_ahead_kb; done yum install beegfs-storage beegfs-client beegfs-utils # beegfs client install * assuming IB device is up and running 1.) yum install beegfs-{client,helperd,utils} 2.) yum install kernel-devel-$(uname -r) 3.) /etc/init.d/beegfs-client rebuild 4.) /opt/beegfs/sbin/beegfs-setup-client -m svc-rcfc-01-ib 5.) copy config files to /etc/beegfs/ 6.) service beegfs-client start ### beegfs client utilities * beegfs-net ; lists connections from local node to beegfs servers (obtain numeric ids of servers) * beegfs-check-servers ; wrapper script run from local node to check connectivity to beegfs servers can be used with -p (check against mount point) or -c (check against config file) * beegfs-ctl ; general purpose utility to list nodes (clients/servers, --listnodes), list targets (meta/storage, --listtargets) remove nodes/targets (--removenode, --removetarget), get striping data (--getentryinfo), set striping (--setpattern *) find files on storage target (--find), refresh file system metadata (--refreshentryinfo), obtain I/O details for users, clients, storage (--userstats, --clientstats, --serverstats), configure and list quotas (--setquota * --getquota); all commands take an additional --help for more details. ### tuning options ** general client (beegfs-client.conf) tuneCoherentBuffers = 0|1; defaults to true; applications (git, in-memory db's) may misbehave if view of mmap'ed file isn't consisent tuneFileCacheBufSize = n bytes defaults to 524288 (512Kb); larger values require less communication between servers; use with striping tuneFileCacheType = none|buffered; defaults to buffered; leave as-is for general purpose use tuneUseGlobalAppendLocks = 0|1; defaults to false; controls if file lock should be local node only (false) or global on servers (true); tuneUseGlobalFileLocks = 0|1; defaults to false; controls if advisory file locks (flock, fnctl) should be checked on local node (false) or global (true) tunePreferredMetaFile = file_path defaults to none; path to text file containing numeric ids of preferred meta target (1 target per line) tunePreferredStorageFile = file_path defaults to none; path to text file containing numeric ids of preferred storage targets (1 target per line) tuneRemoteFync = (0|1); disabling can significantly improve performance since only the cache is used vs. "direct" writes to disk(s) connMaxInterNodeNum = n; number of concurrent connections to storage nodes. default is 12; nprocs == connMaxInterNodeNum; set higher for login nodes ** general meta (beegfs-meta.conf) tuneBindToNumaZone = n; bind operations to specific numa zone (FC card, etc.) tuneNumStreamListeners = n; defaults to 1; number of threads waiting for incoming data events; connections are handed over to worker threads for actual processing tuneNumWorkers = n; defaults to 0 (nproc * 2 ); number of worker threads; larger values allow servers to handle more requests in parallel; small clusters (100-200 nodes) should have tuneNumWokers=64 in meta.conf, while larger should have 128 tuneTargetChooser = algo; defaults to randomized; algorithm used to select storage targets for file creation; default honors clients perferred nodes/targets. tuneUseAggressiveStreamPoll = 0|1; defaults to false; if set to true will actively poll which reduces latency at higher CPU usage tuneUsePerUserMsgQueues = 0|1; defaults to false; FIFO for I/O threads. if set to true may improve fairness in MU environments (login-style node) ** general storage tuneBindToNumaZone = n; bind operations to specific numa zone (FC card, etc.) tuneFileReadAheadSize = defaults to 0m; _research_ tuneFileReadAheadTriggerSize = defaults to 4m; _research_ tuneFileReadSize = nbytes; defaults to 128k; maximum amount of data that server should read from underlying FS in single operation; no effect if values > file chunk size or tuneWorkerBufSize tuneFileWriteSize = nbytes; defaults to 128k; maximum amount of data that server should write to underlying FS in single operation; no effect if values > file chunk size or tuneWorkerBufSize tuneFileWriteSyncSize = defaults to 0m; _research_ tuneNumResyncGatherSlaves = n; defaults to 6; num of threads used to gain FS information for buddy mirror resync tuneNumResyncSlaves = n; defaults to 12; num of threads used to perform actual file and directory syncs per buddy resync tuneNumStreamListeners = n; defaults to 1; number of threads waiting for incoming data events; connections are handed over to worker threads for actual processing tuneNumWorkers = n; defaults to 0 (nproc * 2 ); ; number of worker threads; larger values allow servers to handle more requests in parallel small clusters (100-200 nodes) should have tuneNumWokers=64 in meta.conf, while larger should have 128 tuneUseAggressiveStreamPoll = 0|1; defaults to false; if set to true will actively poll which reduces latency at higher CPU usage tuneUsePerUserMsgQueues = 0|1; defaults to false; FIFO for I/O threads. if set to true may improve fairness in MU environments (login-style node) tuneUsePerTargetWorkers = 0|1; defaults to true; used to improve balance of I/O (thread count is NumWorkers X num_attached_targets) tuneWorkerBufSize = n{k,m}; buffer size allocated twice by each worker thread for I/O and network buffering; optimal performance at least 1 MB > than tuneFile{Read,Write}Size (128k default) ### Optional notes * Consider IB tuning based upon Intel/Mellanox * Extended ATTRS _not_ enabled by default; can be enabled in meta.conf if underlying FS supports ACL's store{UseExtendedAttribs,ClientXAttrs,ClientACLs} = true (beegfs-meta.conf) store{XAttrsEnabled,ACLsEnabled} = true (beegfs-client.conf) * Metadata targets and nodes can be moved * Storage targets and nodes can be moved * Easily configured preferred metadata & storage targets per client * chmod 750 /etc/beegfs will function ### Quotas ** cannot use "system" unless `getent passwd` returns _ALL_ results ** any changes to beegfs-mgmtd.conf requires restart of beegfs-mgmtd, and then beegfs-{meta,storage} services # Enable quotas * beegfs storage disks must be mounted with quota support enabled; no quotas are configured on underlying * disks within beegfs-meta/beegfs-mgmt nodes * quotaEnabled = true ; must be set within beegfs-client.conf * beegfs-fsck --enablequota ; run _after_ client.conf is modified * beegfs-ctl --getquota --uid UID|login; get quota for UID/username * beegfs-ctl --getquota --defaultlimits; get default quotas for system * beegfs-ctl --getquota --gid GID|login; get quota for GID/group * beegfs-ctl --setquota --uid|gid --sizelimit=nGMT; set block quota for user/group * beegfs-ctl --setquota --uid|gid --inodelimit=n; set inode quota for user/group * beegfs-ctl --setquota --uid|gid --sizelimit=n --inodelimit=; set default limits for user/group # Quota enforcement * quotaEnableEnforcement = true; add in beegfs-storage.conf; enable in beegfs-mgmtd.conf; enable in bgfs-meta.conf * quotaUpdateIntervalMin = n; default 10 minutes; period of quota updates. small chance a user could exceed quota during this period * quotaQueryType = system|range|file; default system; how getent passwd|group calls are performed range allows specified UID/GID's to be queried vs. all on system; helps with performance file allows specified UID/GID's within a file to be queried; suitable for non-sequential ID's ### Mirroring stuff * ensure clients aren't mounted for metadata mirroring setup * restart metadata service on all metadata nodes after changes * use manual configure if you want to define different data domains, e.g. separate racks * beegfs-ctl --addmirrorgroup --automatic --nodetype=[meta|storage] - automatically configure meta/storage mirror groups * beegfs-ctl --addmirrorgroup --nodetype=storage --primary=n --secondary=n --groupid=n - configure storage mirror group on targets n and n, with groupid n * beegfs-ctl --listmirrorgroups --nodetype=[meta|storage] - display configured mirror groups * beegfs-ctl --mirrormd - enable metadata mirroring; restart ALL metadata services #### Adding a target * ensure that FILE SYSTEM is mounted * ensure that beegfs-storage.conf is identical to other storage nodes * ensure that management host is correct * /opt/beegfs/sbin/beegfs-setup-storage -p /LUN -s SERVER_ID -i SERVER_ID_TARGET_NUM -m MANAGEMENT_HOST * restart beegfs-storage service afterwards ### Avoid unnecessary resyncs of metadata * watch -n 1 `beegfs-ctl --listtargets --nodetype=meta --mirrorgroups` * stop primary * after 5 seconds or so, restart secondary * after 2-3 seconds or so, start primary ### Misc details * metadata mirroring requires an even number of nodes/targets, storage targets do not * mirroring provides fault tolerance provided that the management service isn't down (recommended separate node) * mirroring should be configured on a new install since existing content _will not_ be mirrored * upgrading notes with different storage node names (same ids) https://groups.google.com/forum/#!searchin/fhgfs-user/trey$20dockendorf%7Csort:date/fhgfs-user/xZc62RYBaYA/r3T1A23mBQAJ * if using netfilters and specific interfaces, ensure _all_ server nodes are configured the same * server nodes need to have beegfs-client.conf & beegfs-nic.conf to utilize beegfs-ctl command * metadata mirroring is _very sensitive_, resyncs can cause noticeable latency and occur even after a mgmtd daemon restart * if using "legacy" configurations (servers with ib/opa, etc.) do not specify multiple ranges of same subnet for separate nics, or there will be a failure: 10.250.17.1/21 AND 10.250.56.1/21 * upgrading storage pools requires simply adding a new target and then restarting the storage service * using bind mount helper requires script to be executable and bind mount points to exist * beegfs-ctl --migrate will only migrate files to separate storage targets. Directories must have their patterns changed to reflect move ### oddities * seemingly long delays for storage pools to return to normal state after mgmtd restart * during metadata resync, users may experience remote I/O errors due to stripe patterns being unavailable (manifest in meta logs) * targetID's and nodeID's cannot be reused easily; easier to create new server ID and target ID's ### BeeGFS chunk file corruption * chunk files were created in affected file system directories that mapped to real file names via inode number ** deletion of these chunk files resulted in unrecoverable errors * solution was to copy/rename chunk file to match and then remove "old" file ** script developed below. Created input from `find /bgfs_itn18/sc_shares -type f -print '%i####%p\n'` ######################################################################################################################################### #!/usr/bin/awk -f BEGIN { FS="####" } { inode=$1 filename=$NF if (inode in inodes) { if (inodes[inode] ~ /[A-F0-9]{1,}-[A-F0-9]{1,}-[A-F0-9]{1,}$/) { #print "Found chunk dup:",inode,"Chunk file:",inodes[inode],"Filename:",filename printf("cp -p \"%s\" \"%s_old\" ; rm -f \"%s\"; mv \"%s_old\" \"%s\"\n", filename,filename,inodes[inode],filename,filename) } else { if (filename ~ /[A-F0-9]{1,}-[A-F0-9]{1,}-[A-F0-9]{1,}$/) { chunkfile=filename #print "Found chunk dup:",inode,"Chunk file:",chunkfile,"Filename:",inodes[inode] printf("cp -p \"%s\" \"%s_old\" ; rm -f \"%s\"; mv \"%s_old\" \"%s\"\n", inodes[inode],inodes[inode],chunkfile,inodes[inode],inodes[inode]) } else if (filename !~ /[A-F0-9]{1,}-[A-F0-9]{1,}-[A-F0-9]{1,}$/) { #print "Found non-chunk dup inode on file:",filename,"pointing to:"inodes[inode] printf("cp -p \"%s\" \"%s_old\" ;cp -p \"%s\" \"%s_old\"; mv \"%s_old\" \"%s\" ; mv \"%s_old\" \"%s\"\n", filename,filename,inodes[inode],inodes[inode],filename,filename,inodes[inode],inodes[inode]) } else if (length(inodes) == 0 && ! filename) { next } } x++ } else { inodes[inode] = filename } } END { if ( x != 0 ) { # print "Total of",x,"duplicates found" } else { # print "No duplicated inodes found" } } #########################################################################################################################################