#### infiniband nomenclature * openib - base package for infiniband operations; loads all the necessary kernel modules on boot up, use chkconfig to enable at boot; configuration at /etc/ofed/openib.conf. * rdma - identical package to openib (Fedora RHEL6+); the same as openib except the service is named rdma; config file is /etc/rdma/rdma.conf. * libibverbs - core user space library which handles hardware abstraction 'verbs' protocol on infiniband/iwarp. * libmthca, libmlx4, libipathverbs, libehca, libcxgb3, libnes - user space hardware drivers that implement the hardware specific bit fiddling operations that libibverbs uses as verbs API; versions must match kernel version * libibcm, librdmacm - libraries which ease the process of initiating connections between infiniband hosts; libibcm uses infinband native hardware addresses, while librdmacm allows you to specify connections using tcp/ip addresses (IPoIB) with rdma specific connections. * librdmacm-utils package includes some tools for testing your network connectivity. * libibcommon, libibumad, libibmad - libraries used for creation of management messages. used by opensm, infinband-diags, ibutils, and ibsim. * opensm - infiniband subnet manager. must enable the opensmd service. configuration done via /etc/ofed/opensm.conf or /etc/rdma/opensm.conf. * dapl - application development environment which provides a transport neutral API for machine/machine operations. use by closed source Intel MPI; test utilities in the dapl-utils package. * ibutils, infiniband-diags (openib-diags) - various utilities for accessing the infiniband fabric health and testing end to end connectivity. * ibsim - infiniband fabric simulator; create a topology file that mimics the physical network/switch layout and specifying other relevant factors, one can test the overall simulated performance of the network. * libsdp (RHEL only) - library that allows transparent tcp/ip applications to use a limited form of RDMA over RDMA fabrics. performs better (IPoIB) interfaces * mstflint, tvflash - tools for burning new firmware on Mellanox hardware. * perftest, qperf - Performance testing tools specific to RDMA fabrics. * ofed-docs - Documentation from the upstream OFED project. * qlvnictools - tool for configuring participation in QLogic's proprietary InfiniPath switch vlan setups. need to have correct IniniPath switch hardware and IPath host adapters. * srptools - Simple daemon for attaching to and maintaining connections to SRP protocol based disks. * mpi-selector - Simple login script based application that allows both a system wide or user specific default MPI implementation to be selected. #### infiniband error definitions * SymbolErrors - The total number of minor link errors detected on one or more physical lanes. This includes 8B/10B coding violations and is typically an indication of a bit error on the line. * LinkRecovers - The total number of times the Port Training state machine has successfully completed the link error recovery process. * LinkDowned - The total number of times the Port Training state machine has failed the link error recovery process and downed the link. * RcvErrors - The total number of packets received on the port containing errors: local physical errors, malformed data packet errors, malformed link packet errors, packets discarded due to overrun * RcvRemotePhysErrors - The total number of packets marked with the EBP (End of Bad Packet) delimiter received on the port. This is typically due to a physical error that was detected and marked by an upstream port. * RcvSwRelayErrors - The total number of packets received on the port which were discarded due because they were not forwarded by switch * XmitDiscards - The total number of packets droped because the port is down or congested * XmitConstraintErrors - The total number of packets not transmitted from the switch port due to: FilterRawOutbound is true and packet is raw, PartitionEnforcementOutbound is true and packet fails check * RcvContraintErrors - The total number of packets received on switch port that were discarded due to: FilterRawOutbound is true and packet is raw, PartitionEnforcementOutbound is true and packet fails check * LinkIntegrityErrors - The total number of times that the count of local physical errors exceeded the threshold specified by LocalPhysErrors * ExcBufOverrunErrors - The total number of times that OverrunErrors consecutive flow control update periods occurs, with each having at least one overrun error #### OpenSM unhealty port options (flapping, unresponsive, etc.); * detected by opensm & configured in opensm.conf ; contents will appear in opensm-unhealthy-ports.log * unhealthy port conditions ; Constantly rebooted nodes/ports, Flapping links, Unresponsive ports, Noisy ports, Errors on SET SMP queries, Illegal SMPs * configuration options (different for switches) ; Ignore - In that action, OpenSM will ignore the unhealthy condition Report - OpenSM will report the unhealthy ports in opensm-unhealthy-ports.dump Isolate - OpenSM will isolate the unhealthy port from the routing - no unicast/multicast routing through port(s); will log No discover - OpenSM will ignore the port (behave as it does not exist) #### infiniband data rates * SDR = single data rate; 2.5 Gbps * DDR = double data rate; 5 Gbps * QDR = quad data rate; 10 Gbps * FDR = fourteen data rate; 14.06 Gbps * EDR = enhanced data rate; 25.78 Gbps ### binary tools (requires elevated permissions) * get devices in fabric iblinkinfo <- obtain status and speed of links in the ib network, and lid numbers ibnodes <- lists ib nodes in topology, including switch(es) ibhosts <- obtain detected infiniband neighbors ibswitches <- lists all switches within fabric ibroute -a switch_lid <- display connected hosts and lid ids on switch ibtracert lid lid <- trace route between two infiniband hosts ibtracert -n <- display only lids and GUID's ibtracert -e <- show errors ibtracert -d <- debugging information ibdiagnet --get_cable_info <- get cable information on fabric (also will grab temperature info) * get device information ibstat <- obtain infiniband CA type; amount of ports; lid #'s; states, etc. ibstat -p <- obtain port GUID's ibv_devinfo -l <- obtain ib devices ibv_devinfo -d device -i port (-i can be ommited) -v <- obtain all device information ibv_devices <- quick listing of IB devices on host ibportstate -L LID (Port 1|2) - get lid, link, and speed information on device (can use -C dev, -G 0x000(GUID), can enable/disable/reset, query, etc.) ibcheckportstate lid|guid port - check port status on switch/hca (limited detail) ibchecknet -v - scan entire subnet for errors; useful for nodes and switches * connectivity tests ibping lidNum | ibping -G portGUID <- infiniband ping to specified machine; receiving machine must be in server mode (ibping -S) ibv_uc_pingping dns_host - performs a UC transport (RC transport) send test; receiving machine must be in listen mode (ibv_uc_pingpong) ibv_rc_pingpong dns_host - performs a RC transport send test; receiving machine must be in listen mode (ibv_rc_pingpong) ibv_ud_pingpong dns_host - performs a UD transport send test; receiving machine must be in listen mode (ibv_ud_pingpong) ibv_srq_pingpong dns_host - performs a shared receive queue test; receiving machine must be in listen mode (ibv_srq_pingpong) ibv_asyncwatch - listen for asynchronous events on device * trap details /usr/include/infiniband/iba/ib_types.h - contains listing of trap messages found in Subnet manager logs * modify ports iba_portconfig -l LID -m PORT (-S state, -P physstate, -w width, -s speed); geared towards switch ports * error checks ibcheckerrs -v LID PORT (ibcheckerrs -v 92 1) ibcheckerrors <- obtain list of possible port errors ibqueryerrors --switch -G SWITCH_GUID ; used to also clear counters and other errors #### Protocols IPoIB = IP over Infiniband SDP = Sockets Direct Protocol; allows socket based applications to exploit RDMA SRP = SCSI RDMA Protocol; allows IB hosts to communicate with FC storage devices SER = iSCSI extensions over RDMA uDAPL = user Direct Access Programing Library; user mode API which natively supports IB fabrics #### terminology * partitiong determines which nodes can communicate with other nodes (sub clusters, VLANS, FC Zones, etc.); set on sm partitions.conf ** blocking factor is only introduced between edge and core switches ** does not effect fabric latency or ability to access I/O resources; does affect worst-case throughput * blocking factor = on edge switch(es), ratio of ports used to connect to servers vs. ports to connect to core switches * non blocking = equal number of ports used to connect servers and connect to core * calcuate blocking factor = divide number of server ports by core ports; blocking causes queuing and introduces latency * manual calculation for blocking factor (sports/(d+u))*d * manual calculation for leading n blocking = (sports/(n+d+u)*d+n) * CBB - constant bisectional bandwidth * LID - local id * verbs - abstract representation defining required interface between client software and HCA functions ### topology * fat tree - fully connected "spine" fabric (each switch connected to each switch); total number of links to sibling switches matches total number of parents * total number of FT nodes - Nmax = Pedge*Pcore/2 (2 levels) * edge/leaf - switch which connects to spine * create non blocking fabric - N1 (node switches) * P/2 (ports on node switches); 4 * 8/2 = 16 nodes max using 6 switches with 8 ports each * total number of ports should be three times end nodes to create non blocking fabric ### calculate switches needed 1a.) Get blocking factor per switch (divide available ports evenly into groups; 36 = 18:18; 18/3 = 6, 18+6 = 24, 36-24 = 12; 24:12 [2:1]) 1b.) Get blocking factor per switch (divide by desired factor = /2 [1:1], /3 [2:1], /4 [3:1], etc.) 2.) Calculate edge switches: nnodes / ports down; round up 3.) Calculate core switches: ports up * nedge / switchports; round up ###### changing card parameters modinfo ib_qib (or other driver) |grep "feature" (modinfo ib_qib |grep qps) vi /etc/modprobe.d/qib.conf (or name of driver); insert options ib_qib max_qps=65536 (for this example) reboot #### change hostname output in ibnodes/iblinkinfo echo "BLAH" /sys/class/infiniband/module/node_desc #### Mellanox VPI valid port settings P1: eth P2: eth P1: ib P2: ib P1: auto P2: auto P1: ib P2: eth P1: ib P2: auto P1: auto P2: eth ####### max_qp cannot be changed http://comments.gmane.org/gmane.linux.drivers.openib/44840 /sys/module/ib_qib/parameters - location of parameters /usr/include/infiniband/verbs.h - definitions of parameters ###### PSM shared contexts export PSM_SHAREDCONTExTS=0 ###### Connected vs. datagram CONNECTED-MODE=yes ; enable connected vs. datagram mode * Max mtu is 65520 in connected mode due to IP limitations * Connected mode interfaces will fall back to datagram mode for multicast traffic (switches, fabric, hosts in datagram mode) * programs which run multicast data and attempt to send the max MTU on the interface will need to configure the interface for datagram operation or find a way to cap the program packet send size that will fit in datagram sized packets #################### A note on IPoIB * Create an ifcfg-ib* file per port on a Linux node using the port GUID as the hardware address Once you have the opensm machine selected, and you've started the machine with both the openibd and opensmd services enabled, you should have a functional infiniband fabric. An easy way to test this is to make sure that the libibverbs-utils package is installed and run ibv_devinfo and ibv_devices to see what infiniband/iwarp devices the system thinks are present. Assuming that your devices are found and ibv_devinfo shows your port state to be active, then you are ready to run programs on the infiniband fabric. In addition to this, you can create tcp/ip interfaces over the infiniband network (IPoIB). To do so, you will need to create the ifcfg-ib0 (and possibly ifcfg-ib1) file in /etc/sysconfig/network-scripts. IPoIB interface types have not been added to our system-config-network tool, hence the need to manually create the files. In addition, IPoIB interfaces can not support dhcp, so they must be statically configured. A sample ifcfg-ib0 file looks like this: DEVICE=ib0 TYPE=Infiniband BOOTPROTO=static BROADCAST=192.168.0.255 IPADDR=192.168.0.1 NETMASK=255.255.255.0 NETWORK=192.168.0.0 ONBOOT=yes In the case that you have two IB ports plugged into the same infiniband fabric (aka, on the same subnet, not each port on its own subnet) and that you also have IPoIB enabled on both ports, then in order to avoid possible confusion over why things sometimes work and sometimes don't when using IPoIB interface addresses to initiate connections between machines, it is best to add the following lines to your /etc/sysctl.conf file: net.ipv4.conf.all.arp_ignore=1 net.ipv4.conf.ib0.arp_ignore=1 net.ipv4.conf.ib1.arp_ignore=1 If you intend to be able to run infiniband using applications as any user other than root, you will also need to adjust the maximum locked memory for the system. This is done by modifying the /etc/security/limits.conf file. Depending on whether or not you want to release the limit on a specific group that is allowed to run infiniband applications or on all logins, your change should look something like this: @ib_user - memlock 8192 or * - memlock 8192 The value used above is a sample value. You can set the limit to -1 to remove the limit entirely. The actual amount of locked memory your application will need depends on how many connections it is going to open and how large of a message queue it allocates for each connection plus memory for the actual read/write buffers it sends. All RDMA memory must be locked in to physical memory so that the infiniband/iwarp hardware can safely access the memory via DMA.