Friday, September 13, 2019

Transceivers: The Device Between the NIC and the Network


Transceivers: The Device Between the NIC and the Network

One of the stories that has stuck with me over the years came from a support case that a former colleague, Ryan Nelson, had point on. At Joyent, we had third parties running our cloud orchestration software in their own data centers with hardware that they had acquired and assembled themselves. In this particular episode, Ryan was diagnosing a case where a customer was complaining about the fact that networking wasn’t working for them. The operating system saw the link as down, but the customer insisted it was plugged into the switch and that a transceiver was plugged in. Eventually, Ryan asked them to take a picture of the back of the server, which is where the NIC (Network Interface Card) would be visible. It turned out that the transceiver looked like it had been run over by a truck and had been jammed in — it didn’t matter what NIC it was plugged into, it was never going to work.

As part of a broader push on datacenter management, I was thinking about this story and some questions that had often come up in the field regarding why the NIC said the link was down. These were:

  1. Was there actually a transceiver plugged into the NIC?
  2. If so, did the NIC actually support using this transceiver?

Now, the second question is a bit of a funny one. The NIC obviously knows whether or not it can use what’s plugged in, but almost every time, the system doesn’t actually make it easy to find out. A lot of NIC drivers will emit a message that goes to a system log when the transceiver is plugged in or the NIC driver first attaches, but if you’re not looking for that message or just don’t happen to be on the system’s console when that happens, suddenly you’re out of luck. You might also ask why are there transceivers that aren’t supported by a NIC, but that’s a real can of worms.

Anyways, with that all in mind, I set out on a bit of a journey and put together some more concrete proposals for what to do here in terms of RFD 89: Project Tiresias. We’ll spend the rest of this entry going into a bit of background on transceivers and then discuss how we go from knowing whether or not they’re plugged in to actually determining who made them and where they are in the system.

What is a transceiver?

We’ve been using the term 'transceiver' quite a bit so far, but that’s a pretty generic term. Let’s spend a bit of time talking through that. First, we’re really focused on transceivers as used in the context of networking. When people think of wired networking, the most common thing that comes to mind are Ethernet Cables. Ethernet isn’t the only type of cable that’s been used. Before Ethernet was common, BNC coaxial cables were used on some NICs as well.

However, in the data center, Ethernet didn’t end up keeping up with the speeds and distances that connections were being use for (though 10 Gigabit Ethernet, 10GBASE-T, has started becoming more common). In this space, fiber-optic cables and copper twinaxial cables (twinax) are much more prominent. Note, twinaxial cables are rather different from their BNC coaxial relatives. Coaxial cables are used more often when there are shorter distances to cover, such as between a top of rack switch and a server. Fiber optic cables often cover longer distances or have higher throughputs.

Because different types of cables are used in different situations, several vendors got together and agreed upon a set of standards to use when manufacturing these cables. This allowed NIC manufacturers to design a single physical and electrical interface, but still support different types of transceivers. These standards (technically a multi-source agreement) are maintained by the Small Form Factor (SFF) Committee. The committee manages standards not only for networking, but also for SAS cables and other devices.

If you’ve worked in this space, you may have heard of what are called SFP and SFP+ cables. These cables generally support 1 and 10 Gigabit networking respectively. The transceiver is controlled over an i2c bus by the NIC. The addresses and their meanings are standardized. They were originally standardized in a standard called INF-8074, but the current active standard for these devices is called SFF-8472.

With faster networking speeds, there have been additional revisions and standards put out. Devices that support 40 Gigabit networking are called QSFP+ because they combine 4 SFP+ devices. To support 25 Gigabit networking, a variant of SFP+ was created called SFP28. Finally, to support 100 Gigabit networking, they combined 4 SFP28 devices together. The 40 Gigabit devices are standardized in SFF-8436 and the 100 Gigabit have their management interface described in SFF-8636.

The standards for various devices have somewhat similar layouts. They break data into a series of different pages of which a specific offset into the page can then be accessed via the NIC’s i2c bus. These pages contain some of the following information:

  • Control over the device and its configuration

  • Static manufacturing information such as the manufacturer’s name, the
    device’s name, the serial number, and more

  • Optional information about the health of the device such as the
    temperature, voltage, and more

The pages and addresses change from specification to specification, though a large amount of the data overlaps between them. The health information of the device is required when the connector is considered active (generally fiber-optic cables with lasers) and is optional when you have a passive device (such as a copper twinax cable).

The MAC Transceiver Capability

The first part of our adventure with getting to this data begins in the operating system kernel. Similar to the case of managing NIC LEDs, the networking device driver framework has an optional capability that a driver can implement to expose this information called MAC_CAPAB_TRANSCEIVER.

The transceiver capability has a structure that the device driver has to fill out which describes some basic information for dealing with transceivers. This includes the following fields:

  • The number of transceivers present on the device.
  • A function, mct_info() that allows one to get basic information about the transceiver.
  • A function, mct_read() that allows one to read i2c data from the device.

The driver first indicates the number of transceivers that are present for it. In general, this is one. However some devices actually support combining multiple transceivers and ports into one logical device — though this isn’t commonly used. The next item, the mct_info() function, is used to answer the two questions that were posed at the beginning of this: Does the NIC think a transceiver is present and can the NIC use the transceiver? Finally, the mct_read() function allows us to go through and read specific regions of the memory map of the transceiver. Generally, user land reads an entire 256-byte page at any given time.

The kernel only facilitates reading information. It generally doesn’t try and make semantic sense of any of the data. That is purposefully left to user land — unless there’s a good reason to parse binary data in the kernel, you’re usually better off not doing that.

The following device families and drivers support the MAC_CAPAB_TRANSCEIVER capability. Some drivers that you may be more familiar with such as 1 Gigabit Ethernet devices aren’t on this list because they don’t support transceivers. Supported devices include:

  • Broadcom NetExtreme II 10 Gb devices based on the bnxe driver.
  • Chelsio T4, T5, and T6 10, 25, 40, and 100 Gbit devices based on the cxgbe driver.
  • Intel X520 10 Gbit SFP+ devices based on the ixgbe driver.
  • Intel 10, 25, and 40 Gbit SFP+, SFP28, and QSFP+ devices based on the i40e driver.
  • QLogic FastLinQ QL45xxx devices based on the qede driver.

The way that each driver gets access to this information varies from device to device. Some, like i40e and cxgbe, issue a firmware command to read information from the transceiver. Others have dedicated registers that can be programmed to read arbitrary data from the i2c device.

libsff and dltraninfo

Once we have the ability to read the actual data from the transceiver, we have to make logical sense of it. To handle that, the first thing I did was write a small library called libsff. The goal of libsff is to parse the various SFF binary data payloads and return structured data as a set of name-value pairs.

If you look at the header file, libsff.h, you’ll see a list of different keys that can be looked up. Some of these are rather straightforward, such as the string "vendor", which has the name of the manufacturer of the transceiver. Others are a bit more opaque and require referencing the actual SFF documents. Another useful feature of the library is that it tries to abstract out the differences between different versions of the specifications. The goal is that when there is similar data, it should always be found under the same key even if they are found in wildly different parts of the memory map or the way we have to parse the data is different. The goal of libraries (or really any interface and abstraction) should be to take something grotty and transform it into something more usable as though reality were as simple as it presents.

The one thing that the library doesn’t generally do today is parse all of the sensor data that may be available on the transceiver. The main reason for this is that the vast majority of transceivers that I had access to, did not implement it. On SFP, SFP+, and SFP28 devices, sensor information is optional for twinax based devices. With a few devices to test with, it would be pretty straightforward to add support for it though.

On its own, a library isn’t useful unless it has a consumer. The first consumer that I’ll discuss is dltrainfo. This is an unstable, development program that I wrote to exercise this functionality and to try and get a sense of what interfaces might be useful. There are two forms of the dltrainfo command. The first answers the questions that we laid out in the beginning about whether the transceiver is present or usable. When run this way, you see something like:

# /usr/lib/dl/dltraninfo ixgbe0: discovered 1 transceiver transceiver 0 present: yes transceiver 0 usable: yes ixgbe1: discovered 1 transceiver transceiver 0 present: yes transceiver 0 usable: yes ixgbe2: discovered 1 transceiver transceiver 0 present: yes transceiver 0 usable: yes ixgbe3: discovered 1 transceiver transceiver 0 present: yes transceiver 0 usable: yes 

The next option is to read the information from the transceiver. Here’s an example of reading this on an Intel 10 Gbit fiber-optic transceiver:

# /usr/lib/dl/dltraninfo -v ixgbe1 ixgbe1: discovered 1 transceivers transceiver 0 present: yes transceiver 0 usable: yes Identifier: 'SFP/SFP+/SFP28' Extended Identifier: 4 Connector: 'LC (Lucent Connector)' 10G+ Ethernet Compliance Codes[0]: '10G Base-SR' Ethernet Compliance Codes[0]: '1000BASE-SX' Encoding: '64B/66B' BR, nominal: '10300 MBd' Length 50um OM2: '80 m' Length 62.5um OM1: '30 m' Length OM3: '300 m' Vendor: 'Intel Corp' OUI[0]: 0 OUI[1]: 27 OUI[2]: 33 Part Number: 'FTLX8571D3BCV-IT' Revision: 'A' Laser Wavelength: '850 nm' Options[0]: 'Rx_LOS implemented' Options[1]: 'TX_FAULT implemented' Options[2]: 'TX_DISABLE implemented' Options[3]: 'RATE_SELECT implemented' Serial Number: 'AKR0EQ0' Date Code: '110618' Extended Options[0]: 'Soft Rate Select Control Implemented' Extended Options[1]: 'Soft RATE_SELECT implemented' Extended Options[2]: 'Soft RX_LOS implemented' Extended Options[3]: 'Soft TX_FAULT implemented' Extended Options[4]: 'Soft TX_DISABLE implemented' Extended Options[5]: 'Alarm/Warning flags implemented' 8472 Compliance: 'Rev 10.2' 

This allows us to interact with the information in a readable way. Effectively, this dumps out the entire name-value pair set that we construct when parsing data with libsff. There are two additional ways to print this data. The first one, -x, dumps out the data as hex data (kind of like if you run the program xxd). The second option -w writes out the first page, 0xa0, to a file. This allows you to take the raw data with you.

Seeing Transceivers in Topo

The next step with all this work is to expose the transceivers as part of the system topology in the fault management architecture (FMA). This is useful for a few reasons:

  1. It allows us to see what devices are present in the same snapshot as other devices like disks, CPUs, DIMMs, etc.
  2. FMA’s topology is a natural place for us to expose sensors.
  3. If a device is in topology, then we can generate error reports and faults against those devices.

Basically, being visible in the topology allows us to integrate it more fully into the system and makes it easy for various monitoring and inventory tools in the system to see these devices without having to make them aware of the underlying ways of getting data.

The topology information is organized as a tree. When we encounter hardware that we believe is a networking device (because its PCI class indicates it is), then we ask the kernel about how many transceivers it supports. For each transceiver, we create a port node under the NIC whose type indicates that it is intended for SFF devices.

When a transceiver is present, then we will place a transceiver node under the port. This node has two different groups of properties. The first is generic to all transceivers, which is where we indicate whether or not the hardware can use the transceiver. The second group are properties that we derive from the SFF specifications about the transceiver’s manufacturing data. This includes the vendor, part number, serial number, etc. The following block of text shows three different nodes: the NIC, the port, and the transceiver:

hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=1 group: protocol version: 1 stability: Private/Private resource fmri hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pc iexdev=0/pciexfn=1 label string MB FRU fmri hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789/motherboard=0 ASRU fmri dev:////pci@0,0/pci8086,155@1,1/pci103c,17d3@0,1 group: authority version: 1 stability: Private/Private product-id string X9SCL-X9SCM chassis-id string 0123456789 server-id string ivy group: io version: 1 stability: Private/Private dev string /pci@0,0/pci8086,155@1,1/pci103c,17d3@0,1 driver string ixgbe module fmri mod:///mod-name=ixgbe/mod-id=242 group: pci version: 1 stability: Private/Private device-id string 10fb extended-capabilities string pciexdev class-code string 20000 vendor-id string 8086 assigned-addresses uint32[] [ 2197946640 0 3750756352 0 1048576 2164392216 0 57344 0 32 2197946656 0 3753902080 0 16384 ] hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=1/port=0 group: protocol version: 1 stability: Private/Private resource fmri hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pc iexdev=0/pciexfn=1/port=0 FRU fmri hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pc iexdev=0/pciexfn=1 group: authority version: 1 stability: Private/Private product-id string X9SCL-X9SCM chassis-id string 0123456789 server-id string ivy group: port version: 1 stability: Private/Private type string sff hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789:serial=AKR0EQ0:part=FTLX8571D3BCV-IT:revision=A/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=1/port=0/transceiver=0 group: protocol version: 1 stability: Private/Private resource fmri hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789:serial=AKR0EQ0:part=FTLX8571D3BCV-IT:revision=A/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=1/port=0/transceiver=0 FRU fmri hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789:serial=AKR0EQ0:part=FTLX8571D3BCV-IT:revision=A/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=1/port=0/transceiver=0 group: authority version: 1 stability: Private/Private product-id string X9SCL-X9SCM chassis-id string 0123456789 server-id string ivy group: transceiver version: 1 stability: Private/Private type string sff usable string true group: sff-transceiver version: 1 stability: Private/Private vendor string Intel Corp part-number string FTLX8571D3BCV-IT revision string A serial-number string AKR0EQ0 

If you plug in a transceiver dedicated to fibre channel, then we’ll properly note that we can’t use the transceiver by setting the usable property to false. The following is an example of the transceiver node in that case:

hc://:product-id=X10SLM+-LN4F:server-id=haswell:chassis-id=0123456789/motherboard=0/hostbridge=0/pciexrc=0/pciexbus=1/pciexdev=0/pciexfn=1/port=0/transceiver=0 group: protocol version: 1 stability: Private/Private resource fmri hc://:product-id=X10SLM+-LN4F:server-id=haswell:chassis-id=0123456789/motherboard=0/hostbridge=0/pciexrc=0/pciexbus=1/pciexdev=0/pciexfn=1/port=0/transceiver=0 FRU fmri hc://:product-id=X10SLM+-LN4F:server-id=haswell:chassis-id=0123456789/motherboard=0/hostbridge=0/pciexrc=0/pciexbus=1/pciexdev=0/pciexfn=1/port=0/transceiver=0 group: authority version: 1 stability: Private/Private product-id string X10SLM+-LN4F chassis-id string 0123456789 server-id string haswell group: transceiver version: 1 stability: Private/Private type string sff usable string false 

Further Reading

If you’d like to read more on this, there are a couple of different places that I can send you.

For more on the SFF standards, there’s:

  • SFF-8472 which describes the SFP/SFP+ devices.
  • SFF-8436 which describes QSFP+ devices.
  • SFF-8636 which describes the memory map of QSFP28+ devices.

If you’re interested in the illumos implementation, there’s:

Looking Ahead

If you have a favorite NIC that uses SFP-based transceivers and it isn’t supported, reach out and we’ll see what we can do. If you’d find it interesting to work on exposing more of the sensor information present in the SFPs, then we’d be happy to further mentor someone there. Once these pieces are exposed in topology, it could also make sense to wire up the FRU monitor to watch for temperature thresholds, voltage drops, or device faults.

Up next, we’ll talk about understanding the topology of USB devices.



from Hacker News https://ift.tt/31jL2Jx

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.