#### disable spectre/meltdown mitigations # echo 0 > /sys/kernel/debug/x86/pti_enabled # echo 0 > /sys/kernel/debug/x86/retp_enabled # echo 0 > /sys/kernel/debug/x86/ibrs_enabled ##### systems monitoring tools with quick options * /proc/net/sockstat = get all quick TCP/UDP socket stats * /sys/devices/system/node/node* ; numa zones * /sys/devices/system/node/node*/cpulist ; CPUs assigned to node #### irqbalance service should be configured run more than once vs. oneshot (never rebalance) ## sar and useful arguments -P ALL = individual cpu stats; use 1..n for specific cpu -u ALL = user CPU stats, including software and hardware interrupts -W / -S = swapping stats; pages swapped in, out / kernel swaps -v = file handles -w = procs/s and context switches/s -d = disk stats (local); use -p for human device names -b = i/o rates -B = paging stats -s / -e = start (24HH) and end time of report * consider using sadc with OPTIONS and interval to glean data ad-hoc * sadc network options don't need to be configured for "live" use with interval * do not forget interval option for profiling! ## mpstat and useful arguments -I SUM = get all software and hardware interrupts per second for ALL CPU's (cat /proc/interrupts for definitions of TLB/s, etc.) -P n = receive statistics on processor "n"; use all for all processors -u = get all CPU stats * flags can be combined for more details ## pidstat and useful arguments -C "pattern" = query pids matching pattern -d = display disk stats -h = put all content on one line -l = display arguments to commands (useful with -C) -p = display PID data (multiple pids separated by comma) -r = display memory data -t = display threads associated with process -u = report CPU usage per process -w = report task switching activity (voluntary and involuntary) * do not forget interval option for profiling! * one liner: pidstat -h -dwlur ## iostat and useful arguments * 3 types of reports: CPU, device utilization, NFS * initial values are since boot * r/s, w/s are IOPS values -k,-m = display report in kilobytes/megabytes vs. blocks -h = display NFS report in human readable format -c = display CPU report -n = display NFS report (useful for CLIENTS) -N = display dmmapper values vs. system -P dev,{dev}? = display specific dev or devs (use full path for mapper devices) -z = omit output for idle devices -x = display extended stats: queue times, saturation, etc. * do not forget interval option for profiling! * client one liner: iostat -zmx -hn * server one liner: iostat -zmx -hN ## vmstat and useful arguments -D = displays summarized disk activity (total disks, reads, writes, etc) -d = displays disk statistics (reads, writes, iops); last two fields are I/O: cur,s = current IO and seconds spent (s = total since boot) -a = displays active/inactive memory -w = displays enlarged memory field (default behavior displays simple memory) -s = displays summary of memory details, cpu ticks, context switches, and interrupts -m = displays slab info -S{Kk,Mm} = displays stats in kilo, mega formats * one liner: vmstat -w 1 ## prtstat prtstat PID = displays human readable statistics from /proc/PID/stat ## dstat (may need to install) * hybrid tool which offers quick glance stats vs. specific tools: vmstat, slabtop, mpstat, etc. * has plugins which can be utilized, check --list * lower case options must be present when using upper case equivalents, -c -C * good plugins: top-bio (top block io), top-cputime (highest cpu time process), top-latency, top-io * use --nocolor dstat options delay count (default displays cpu usage stats, disk totals, net totals, paging, and system interrupts/context switches) ** "nice" options * CPU options -c / -C 0,1,total = displays CPU report for all or specific cpus with optional total * disk options -d / -D dm-2,dm-3,total = displays disk report (reads and writes) with optional total --disk-util = displays %utilization of all disks or disk * net options -n / -N dev,dev, total = displays network report (bytes received/sent) --net-packets = displays packets received/sent per polling interval * interrupts -i / -I n,dev,total = displays interrupts for interrupt "n" (cat /proc/interrupts) and/or device with optional total -r / --io = display i/o requests completed -p / --proc = process stats (vmstat) -s = swap stats (vmstat) -y / --sys = system stats (interrupts, context switches) --{ipc,lock,raw,socket,tcp,udp,unix} = dispay ipc, lock, raw, socket, tcp, udp, and unix statistics ## slabtop and useful arguments -d,--delay = refresh after n seconds -s, --sort S = sort by S (object) -o, --once = display once * sort criteria a,b,c,l,v,n = active objects, objects per slab,cache size, number of slabs, name o,p,s,u = number of objects, pages per slab, object size, utilization ## fio (testing IOPS vs. dd) * dd only provides sequential tests, and is singled threaded * dd only provides write performance, not read * fio can be run on the console or via config files ** fio must have a name supplied for the test on the cli or it will quietly fail ** fio must have complete path to file name to reflect directory or it will use lwd ** network tests require job files ** primary options --readwrite= ; read/write (seq) randread/randwrite (rand) readwrite (mixed seq rw) randrw (mixed random rw) --size= ; io size of job, if --iosize is used size range is used RTFM --io_size= ; total size of job if --size is used --bs= ; blocksize --bsrange=n,n ; range of differing block sizes (use bssplit for finer grained control) --bssplit=1k/10:4k/20 ; split bs up over job with various weights applied to sizes; uses percentages --ioengine=libaio (default linux io engine, rtfm for all engines) --iodepth=n; number of io units in flight against file --rwmixread/rwmixwrite=; percentage of mixed workload that should be read/writes --runtime=n; terminate after n seconds --thread ; use pthread_Create vs fork --numjobs=n; number of jobs to run fio --name=job_name ** network stuff # server [global] ioengine=net port=5201 protocol=tcp interface=131.247.252.98 bs=1M size=150G window_size=2M [receiver] listen rw=read filename=/run/user/0/testout # client [sender] startdelay=1 rw=write size=10G ## btrace * time stamps in sorted format are in nano seconds * btrace actually uses blktrace piped to blkparse * Trace Actions: C (complete), D (issued), I (inserted), Q (queued), B (bounced) M (back merge), F (front merge), G (get request), S (sleep) P (plug), U (unplug), T (timer unplug), X (split), A (remap) * RWBS: R (read), W (write), D (discard), B (barrier op), S (synchronous op) -t = use time deltas per IO -w n = run for "n" seconds -s = sort by program (end of run) * # device, cpu, seqID, time, pid, command, N bytes, action, RWBS * useful, more complete one liner: blktrace -d /dev/ -w 15 -o - |blkparse -i - -s -t -f "%D %2c %8s %5T.%9t %5p %10C %5N %2a %3d\n" > blk-trace.txt ## perf * use perf record to store data into perf.dat file # record need to know options; perf type record -e = event (get from perf list) -a = all CPUS -c n = only watch events on processor n (n, n) -p = pid (pid,pid) -t = thread (thread,thread) -u = uid (all events of user uid) -c = count to sample then quit -F = profile at CPU frequency (usually 997) -g = enable call-graph (backtrace recording) -s = per thread counts -T = timestamps perf record [options] sleep 5; record for 5 seconds # report options --stdio = use standard interface (otherwise ncurses window) -h = header details (status information) -s flag = sort on flag # sched perf sched record = record latencies perf sched latency = report per task sched latencies perf sched script = report detailed trace of recorded workload # trace perf trace = "global" strace for everything running on node, including all system calls # stat perf stat = count events #### numa options * numa_{hit,miss} = mem successfully allocated on "this" node, memory allocated on "this" node despite preferring different node * numa_foreign = mem intended for "this" node but allocated on another node numactl --show = display numa options (policy, preferred node, cpu binding, etc) numactl --hardware = display numa hardware details (nodes, node cpus, memory, distance) numastat = displays statistics (hits, misses, etc) options: -m (meminfo like stats), -v verbose, -z no empty stats, -snode? (must be last) sort descending order of numa node ## check IO scheduler (deadline, noop, cfg) cat /sys/block/DISK/queue/scheduler ######## network specific ## lnstat ## nstat * run in daemon mode; use -d for interval in secs between records & -t for time interval in secs to average rates * prints formated statistics from /proc/net/netstat & snmp ## netstat -s = get network statistics; use -t or -u for TCP/UDP -I=dev = get quick statistics for device (eth4, etc.) -i = get quick statistics for ALL devices Recv-Q/Send-Q = amount of bytes on NIC/amount of bytes not acked * don't forget interval ## ss options state .. * established, syn-sent, syn-recv, fin-wait-{1,2}, time-wait, closed * close-wait, last-ack, closing, listening * all, connected, synchronized, bucket, big * in listening states, recv-q/send-q contains current/displays max backlog (somaxconn) -6 = ipv6 only -4 = ipv4 only -t/-u = tcp/udp only -n = no resolve -e = extended details, inode, uid, sk(memory) -o = timer information -i = tcp details, rrt, cw, bytes_(ack,rec), segs_(in,out), transfer rate ts show string "ts" if the timestamp option is set sack show string "sack" if the sack option is set ecn show string "ecn" if the explicit congestion notification option is set ecnseen show string "ecnseen" if the saw ecn flag is found in received packets fastopen show string "fastopen" if the fastopen option is set cong_alg the congestion algorithm name, the default congestion algorithm is "cubic" wscale:: if window scale option is used, this field shows the send scale factor and receive scale factor rto: tcp re-transmission timeout value, the unit is millisecond backoff: used for exponential backoff re-transmission, the actual re-transmission timeout value is icsk_rto << icsk_backoff rtt:/ rtt is the average round trip time, rttvar is the mean deviation of rtt, their units are millisecond ato: ack timeout, unit is millisecond, used for delay ack mode mss: max segment size cwnd: congestion window size pmtu: path MTU value ssthresh: tcp congestion window slow start threshold bytes_acked: bytes acked bytes_received: bytes received segs_out: segments sent out segs_in: segments received send bps egress bps lastsnd: how long time since the last packet sent, the unit is millisecond lastrcv: how long time since the last packet received, the unit is millisecond lastack: how long time since the last ack received, the unit is millisecond pacing_rate bps/bps the pacing rate and max pacing rate rcv_space: a helper variable for TCP internal auto tuning socket receive buffer * filter {s,d}port using '( dst :# o dst :#)' or '( dport/sport =:ssh,https,)' * filter using IP directly on console using dst/src * filter using dport = :# or sport =:# * filter using expressions: src IP and sport gt :5000 ; dst IP and sport =: # * output from "users:(("prog",pid,fd))" ## potential network tuning options * utilization can be calculate by current throughput (rx, tx) divided by negotiated speed * saturation is hard to measure, look for "overruns", lots of retransmits ** Check kernel-doc/Documentation/networking/scaling.txt * RSS (receive side scaling) = HW-based; algorithm that attempts to ensure (via hashing) that packets from the same connection are processed via the same CPU * RPS (receive packet steering) = software implementation of RSS via short interrupt routine * RFS (receive flow steering) = software similar to RPS but with a focus on affinity for the socket that last processed a packet; improves cache hit rates * ARFS (accelerated RFS) = HW-based RFS; updates NIC with flow information to identify which CPU(s) to interrupt * XFS (transmit packet steering) = HW-based; NIC's with multiple tx queues utilize multiple CPU's to transmit packets * each socket in "LISTENING" TCP has two queues: syn and accept * connection bursts serviced by two backlog queues: syn and listen * maximum allowed length of both queues is the "backlog" (somaxconn) * syn backlog/queue ; stores inbound syn packages (3 way handshake) * listen backlog/queue ; stores packets for established connections waiting to be serviced by application * accept queue ; contains fully established connections * interrupt coalescing mode (ICS) ; allows NIC to store many packets/reach timer and then interrupt kernel vs. traditional method of interrupting kernel upon each received packet * packet drops and overruns are typically due to rx buffers not being drained fast enough * pause frames (ethernet flow control) is between adapter and port; causes switchport to stop sending frames * NIC IC is either time before hardware interrupt (usecs, micro) or # of packets * /proc/sys/net/{core,ipv4,ipv6,netfilter,unix,bridge} * sysctl vales below * rmem = receive buffer in bytes * wmem = send buffer in bytes *** timers tcp_fin_timeouts = 60s default, consider tuning for high network i/o tcp_keepalive_time - 7200s default tcp_keepalive_probes - 9 by default, # of probes sent out to determine if client is dead tcp_keepalive_intvl - 75s by default; used with keepalive_probes to determine when dead/idle connection is aborted fs.file-max = each connection requires a file handle net.core.dev_weight = n ; number of packets kernel can handle per cpu on NAPI interrupt net.core.netdev_budget = n; number of packets taken from all int's during polling cycle (3rd column in /proc/net/softnet_stat) * /sys/class/net/dev/weight = set budget per device * small increments are normal, and do not require tuning # Device backlog net.core.netdev_max_backlog = n; queue which holds packets before processing, one per CPU; check softnet_stat /proc/net/softnet_stat = each line represents a CPU; 1st col is # of frames received via interrupt; 2nd col is # of packets dropped due to max_backlog exceeded 3rd col is # of times ksoftirqd ran out of netdev_budget or CPU time # Socket buffers for all protocol types (read and write) set below net.core-[rw]mem_max = use 16MB for 1GB nic or 32,54MB for 10GB (16 * 1024 * 1024) net.core.[rw]mem_default = " " net.*.tcp_moderate_rcvbuf = 1 (sets TCP autotuning) net.ipv4.tcp_[rw]mem = "minimum default maximum"; set max to value for net.core-[rw]mem_max; default is 4MB, increase by factor of 4 application must be restarted if middle value is changed; overrides net.core.[rw]mem_default; no override net.core.[rw]mem_max may not need to adjust minimum and default, instead start with maximum net.core.optmem_max = used in few cases and shouldn't be changed unless ENOMEM is manifest; RDMA utilizes socket memory.. # first backlog queue, half open connections net.ipv4.tcp_max_syn_backlog = n; max # of unanswered connection requests (tcp handshake requests) if set too low, latency will be increased on client due to retransmits # second backlog queue, connections sent to accept() net.core.somaxconn = # listen backlog; connection is established waiting to hand off to application via accept() of client connections in LISTEN state; application restart necessary net.ipv4.tcp_max_tw_buckets = n; net.ipv4.tcp_tw_reuse = n; net.ipv4.tcp_fin_timeout = n seconds; adjust how long time sockets stay in time_wait net.ipv4_tcp_slow_start_after_idle = 0|1 ; disable/enable slow start (research) * RX/TX buffer tuning via ethtool -g dev; often are set too low, increasing alone can fix packet drops - check RX/TX queues via ss -nmp * RX/TX flow control (disable) * Recv-Q/Send-Q: n bytes not copied by the socket program (in use)/not acknowledged by remote host * txqueuelen/qlen = # of packets that can be queued before transmission; default is usually fine, but increase if tx errors are manifest * check ethtool -S dev for fails, misses, discards, drops, buffers, fifos, full * check for balancing in /proc/interrupts across all CPUs for TX and RX on dev * consider enabling IC mode to allow driver to adjust values automatically * Check RSS via /proc/interrupts using CPU|network device = 'CPU|eth0' * Check RPS via /sys/class/net/device/queues/rx-queue/rps_cpus; shouldn't be configured if RSS is available * can be used with any network card!