Getting FCoE on a RHEL Linux Server working with Cisco NX-OS

Sunday, 26 Jan 2020

Getting FCoE on a RHEL Linux Server working with Cisco NX-OS

Recently I had a great deal of "fun" getting the following technologies to actually work together, so I thought this might be a good blog post if anyone else out there is still working with circa 2012 technology and needs to get stuff working:

  • Fibre Channel over Ethernet (FCoE)
  • Fibre Channel Fabric Login (FLOGI)
  • Virtual Port Channels (vPC, the other non-AWS one...) in "Individual Mode" (hack)
  • Red Hat Enterprise Linux (RHEL)

What we working with then (The Topology)?

FCoE Storage and RHEL Server Topology

  • Storage-side
    • 1x Hitachi G200 SAN Array
      • Dual-attached via 8 Gigabit Fibre Channel (GFC) to two FC SANs
        • SAN_A = FCOE-SWITCH-01 VSAN10/VLAN20
        • SAN_B = FCOE-SWITCH-02 VSAN11/VLAN21
  • Server-side
    • 1x Dell R620 with RHEL 6.9 installed as Baremetal OS (no Virtualisation)
    • 2x Intel 82599ES Combined Network Adapters (CNAs) running at 10 Gigabit Ethernet
    • FCoE yum package installed
      • yum install fcoe-utils
  • Network-side
    • 2x Cisco N5596UP Converged Ethernet/FC/FCoE Network Switches running Cisco NX-OS
    • 2x Virtual Fibre Channel (VFC) Bindings
      • Each Switch runs binding of VFC48 to Physical Interface Eth1/5
    • 1x vPC 48 mapped into both Eth1/5 instances (vPC 48 = PortChannel 48)

What didn't work then (The Problem Statement)?

This was half-setup (the best kind of setup, because it's me who's getting set-up...) when I got involved, and the issues were:

  1. Sporadic/intermittent pings to Server01
    1. Which could be restored by disabling one of the two Switch-Server Uplinks (Eth1/5<->Em1/2), on either Switch-side (Eth1/5) or Server-side (Em1/Em2)
  2. Storage Array not seeing the Server01 WWNs
    1. Which had already been set up in a SAN Zoning, and the WWN values had been confirmed

What did you do to fix (The Poirot Moment)?

Firstly, I flexed my Google muscles and found this lovely Configuring a Fibre Channel over Ethernet Interface in RHEL 7 Guide, which came in very handy. The first thing to know about this is that your FCoE Subinterfaces won't show up (or shouldn't) as /etc/sysconfig/network-scripts objects, as those are for IP/Ethernet NICs (i.e. OSI Model L2/L3), and we're dealing with FCoE/Ethernet (i.e. OSI Model L1/L2).

In my case, there were some rogue ifcfg-em1.20 and em1.21 (doesn't make sense, would have been em2.21) definitions I had to delete:

cd /etc/sysconfig/network-scripts
rm ifcfg-em1.20
y
rm ifcfg-em1.21
y
rm ifcfg-em2.21
y

Configuring the FCoE NICs (CNA / CBAs)

Upon reading the RHEL Guide, I was expecting something about a VLAN definition here, as handily the previous person who set up the Converged SAN didn't like matching Virtual Storage Area Network (VSAN) to VLAN numbering schemes, so I've got this:

Switch VSAN ID VLAN ID
FCOE-SWITCH-01 10 20
FCOE-SWITCH-02 11 21

On Cisco NX-OS, you Trunk-through the VLAN ID, not the VSAN ID, so my 802.1q Trunk to the Server looks like this:

FCOE-SWITCH-01# sh run int po48
interface port-channel48
  description Po48 - Server01
  switchport mode trunk
  switchport trunk allowed vlan 20,380-381
  spanning-tree port type edge trunk
  speed 10000
  vpc 48

So I'm happy VLAN20 (VSAN10) is being Trunked-through, but upon inspection of the /etc/fcoe/cfg-ethX example file, I find no reference of a "VLAN ID", only this option relating to VLAN:

AUTO\_VLAN="no"

Which is disabled, and there's no config present for my interfaces, em1 and em2:

[root@server01 ~]# ip a | grep em
2: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 44:a8:42:2b:4c:39 brd ff:ff:ff:ff:ff:ff
3: em2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 44:a8:42:2b:4c:3a brd ff:ff:ff:ff:ff:ff

So I do the following:

cp /etc/fcoe/cfg-ethx /etc/fcoe/cfg-em1
cp /etc/fcoe/cfg-ethx /etc/fcoe/cfg-em2
nano /etc/fcoe/cfg-em1
[Edit line AUTO_VLAN to be equal to "yes"]
[Ctrl+O to save]
nano /etc/fcoe/cfg-em2
[Edit line AUTO_VLAN to be equal to "yes"]
[Ctrl+O to save]

Then restart the FCoE Daemon:

[root@server01 ~]# service fcoe restart
[root@server01 ~]# service fcoe status
/usr/sbin/fcoemon -- RUNNING, pid=28331
Created interfaces: em1.20 em2.21

Then wait a bit, a few Kernel Syslog messages appear, et voila my two FCoE Interfaces magically come up in the fcoeadm tool:

[root@server01 ~]# fcoeadm -i
    Description:      82599ES 10-Gigabit SFI/SFP+ Network Connection
    Revision:         01
    Manufacturer:     Intel Corporation
    Serial Number:    246E966B0120
    Driver:           ixgbe 4.2.1-k
    Number of Ports:  1

        Symbolic Name:     fcoe v0.1 over em1.20
        OS Device Name:    host12
        Node Name:         0x2000246E966B0121
        Port Name:         0x2001246E966B0121
        FabricName:        0x200A8C604F332001
        Speed:             10 Gbit
        Supported Speed:   1 Gbit, 10 Gbit
        MaxFrameSize:      2112
        FC-ID (Port ID):   0x0103C0
        State:             Online

        Symbolic Name:     fcoe v0.1 over em2.21
        OS Device Name:    host13
        Node Name:         0x2000246E966B0123
        Port Name:         0x2001246E966B0123
        FabricName:        0x200B8C604F2DE381
        Speed:             10 Gbit
        Supported Speed:   1 Gbit, 10 Gbit
        MaxFrameSize:      2112
        FC-ID (Port ID):   0x0103A0
        State:             Online

And they also pop up as normal Ethernet Interfaces in the normal place:

[root@server01 ~]# ip a
13: em1.20@em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
    link/ether 24:6e:96:6b:01:20 brd ff:ff:ff:ff:ff:ff
       valid_lft forever preferred_lft forever
17: em2.21@em2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
    link/ether 24:6e:96:6b:01:22 brd ff:ff:ff:ff:ff:ff
       valid_lft forever preferred_lft forever

And if I check on the Cisco N5K side, I can see Fabric Login events occurring for the Virtual FC interfaces:

FCOE-SWITCH-01# sh flogi database | inc vfc48
vfc48            10    0x0103a0  20:01:24:6e:96:6a:f3:01 20:00:24:6e:96:6a:f3:01

So onward to the next bit of the challenge: intermittent pings/Network Connectivity.

Fixing the pings (IP/Ethernet issues)

A quick rack of the brains reveals something I've had before - when you don't set an explicit mode on Cisco PortChannels, they come up unconditionally, which can cause problems with some Server OS/Hypervisors (I'm looking at you, ESXi without a dvSwitch...):

FCOE-SWITCH-01# sh run int eth1/5
interface Ethernet1/5
  description Server01 Em1
  switchport mode trunk
  switchport trunk allowed vlan 20,380-381
  channel-group 48

Sure enough, there it is; and on the RHEL side, I'd maybe expect a "Bond0" or equivalent interface, but I don't have any - all the Configuration is done as VLAN Subinterfaces on Em1/Em2. So now, out comes the "hack" to get a PortChannel up, but leave it with a vPC Parent (in case the Server ever gets properly configured for LAG - i.e. LACP is Active) - basically add "mode active" to your PortChannel/vPC Member Interfaces (note you have to "no" the previous PortChannel command, otherwise NX-OS will bitch at you):

FCOE-SWITCH-01# conf t
FCOE-SWITCH-01(conf)# interface Ethernet1/5
FCOE-SWITCH-01(int)# no channel-group 48
FCOE-SWITCH-01(int)# channel-group 48 mode active
FCOE-SWITCH-01(int)# end
FCOE-SWITCH-01# copy run start

Et voila, pings are restored when both Eth1/5 interfaces in vPC48 are up, but it is hacky as they come up in non-PortChannel "Individual" Mode, so these are the unusual looking outputs (for something that works):

FCOE-SWITCH-01# sh port-channel summary | inc Protocol|48|I
        I - Individual  H - Hot-standby (LACP only)
Group Port-       Type     Protocol  Member Ports
48    Po48(SD)    Eth      LACP      Eth1/5(I)

FCOE-SWITCH-01# sh int po48
port-channel48 is down (No operational members)
 vPC Status: Down, vPC number: 48 [packets forwarded via vPC peer-link]
  Hardware: Port-Channel, address: 8c60.4f33.200c (bia 8c60.4f33.200c)
  Description: Po48 - Server01
  MTU 1500 bytes, BW 10000000 Kbit, DLY 10 usec

Why even bother with a vPC if you're not actually LAGing?

A great question, and one I'll have to explore now the connectivity is back and working; conceptually all I've done above is to bypass the PortChannel (for the FCoE, this is fine as FC Multipaths anyway; for the IP/Ethernet, you either LAG or don't - this is some frankenstate), so while it's "PortChannel-eligible" (so the vPC48 and Po48 make sense), there's no 802.3ad configuration on the RHEL OS (something like in this How to configure LACP 802.3ad with bonded interfaces RHEL guide), the fact it comes up as an "I" port basically means it's two standalone interfaces - which negates the need for a vPC or PortChannel.

So the PortChannel isn't required, but maybe I'd need the vPC still? Or maybe I'm losing the plot and don't need either, much like this lovely post on why LACP and vSphere (ESXi) hosts: not a very good marriage

One for another day :).