Getting FCoE on a RHEL Linux Server working with Cisco NX-OS
Getting FCoE on a RHEL Linux Server working with Cisco NX-OS
Recently I had a great deal of "fun" getting the following technologies to actually work together, so I thought this might be a good blog post if anyone else out there is still working with circa 2012 technology and needs to get stuff working:
- Fibre Channel over Ethernet (FCoE)
- Fibre Channel Fabric Login (FLOGI)
- Virtual Port Channels (vPC, the other non-AWS one...) in "Individual Mode" (hack)
- Red Hat Enterprise Linux (RHEL)
What we working with then (The Topology)?
- Storage-side
- 1x Hitachi G200 SAN Array
- Dual-attached via 8 Gigabit Fibre Channel (GFC) to two FC SANs
- SAN_A = FCOE-SWITCH-01 VSAN10/VLAN20
- SAN_B = FCOE-SWITCH-02 VSAN11/VLAN21
- Dual-attached via 8 Gigabit Fibre Channel (GFC) to two FC SANs
- 1x Hitachi G200 SAN Array
- Server-side
- 1x Dell R620 with RHEL 6.9 installed as Baremetal OS (no Virtualisation)
- 2x Intel 82599ES Combined Network Adapters (CNAs) running at 10 Gigabit Ethernet
- FCoE yum package installed
yum install fcoe-utils
- Network-side
- 2x Cisco N5596UP Converged Ethernet/FC/FCoE Network Switches running Cisco NX-OS
- 2x Virtual Fibre Channel (VFC) Bindings
- Each Switch runs binding of VFC48 to Physical Interface Eth1/5
- 1x vPC 48 mapped into both Eth1/5 instances (vPC 48 = PortChannel 48)
What didn't work then (The Problem Statement)?
This was half-setup (the best kind of setup, because it's me who's getting set-up...) when I got involved, and the issues were:
- Sporadic/intermittent pings to Server01
- Which could be restored by disabling one of the two Switch-Server Uplinks (Eth1/5<->Em1/2), on either Switch-side (Eth1/5) or Server-side (Em1/Em2)
- Storage Array not seeing the Server01 WWNs
- Which had already been set up in a SAN Zoning, and the WWN values had been confirmed
What did you do to fix (The Poirot Moment)?
Firstly, I flexed my Google muscles and found this lovely Configuring a Fibre Channel over Ethernet Interface in RHEL 7 Guide, which came in very handy. The first thing to know about this is that your FCoE Subinterfaces won't show up (or shouldn't) as /etc/sysconfig/network-scripts objects, as those are for IP/Ethernet NICs (i.e. OSI Model L2/L3), and we're dealing with FCoE/Ethernet (i.e. OSI Model L1/L2).
In my case, there were some rogue ifcfg-em1.20 and em1.21 (doesn't make sense, would have been em2.21) definitions I had to delete:
cd /etc/sysconfig/network-scripts
rm ifcfg-em1.20
y
rm ifcfg-em1.21
y
rm ifcfg-em2.21
y
Configuring the FCoE NICs (CNA / CBAs)
Upon reading the RHEL Guide, I was expecting something about a VLAN definition here, as handily the previous person who set up the Converged SAN didn't like matching Virtual Storage Area Network (VSAN) to VLAN numbering schemes, so I've got this:
Switch | VSAN ID | VLAN ID |
---|---|---|
FCOE-SWITCH-01 | 10 | 20 |
FCOE-SWITCH-02 | 11 | 21 |
On Cisco NX-OS, you Trunk-through the VLAN ID, not the VSAN ID, so my 802.1q Trunk to the Server looks like this:
FCOE-SWITCH-01# sh run int po48
interface port-channel48
description Po48 - Server01
switchport mode trunk
switchport trunk allowed vlan 20,380-381
spanning-tree port type edge trunk
speed 10000
vpc 48
So I'm happy VLAN20 (VSAN10) is being Trunked-through, but upon inspection of the /etc/fcoe/cfg-ethX example file, I find no reference of a "VLAN ID", only this option relating to VLAN:
AUTO\_VLAN="no"
Which is disabled, and there's no config present for my interfaces, em1 and em2:
[root@server01 ~]# ip a | grep em
2: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 44:a8:42:2b:4c:39 brd ff:ff:ff:ff:ff:ff
3: em2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 44:a8:42:2b:4c:3a brd ff:ff:ff:ff:ff:ff
So I do the following:
cp /etc/fcoe/cfg-ethx /etc/fcoe/cfg-em1
cp /etc/fcoe/cfg-ethx /etc/fcoe/cfg-em2
nano /etc/fcoe/cfg-em1
[Edit line AUTO_VLAN to be equal to "yes"]
[Ctrl+O to save]
nano /etc/fcoe/cfg-em2
[Edit line AUTO_VLAN to be equal to "yes"]
[Ctrl+O to save]
Then restart the FCoE Daemon:
[root@server01 ~]# service fcoe restart
[root@server01 ~]# service fcoe status
/usr/sbin/fcoemon -- RUNNING, pid=28331
Created interfaces: em1.20 em2.21
Then wait a bit, a few Kernel Syslog messages appear, et voila my two FCoE Interfaces magically come up in the fcoeadm tool:
[root@server01 ~]# fcoeadm -i
Description: 82599ES 10-Gigabit SFI/SFP+ Network Connection
Revision: 01
Manufacturer: Intel Corporation
Serial Number: 246E966B0120
Driver: ixgbe 4.2.1-k
Number of Ports: 1
Symbolic Name: fcoe v0.1 over em1.20
OS Device Name: host12
Node Name: 0x2000246E966B0121
Port Name: 0x2001246E966B0121
FabricName: 0x200A8C604F332001
Speed: 10 Gbit
Supported Speed: 1 Gbit, 10 Gbit
MaxFrameSize: 2112
FC-ID (Port ID): 0x0103C0
State: Online
Symbolic Name: fcoe v0.1 over em2.21
OS Device Name: host13
Node Name: 0x2000246E966B0123
Port Name: 0x2001246E966B0123
FabricName: 0x200B8C604F2DE381
Speed: 10 Gbit
Supported Speed: 1 Gbit, 10 Gbit
MaxFrameSize: 2112
FC-ID (Port ID): 0x0103A0
State: Online
And they also pop up as normal Ethernet Interfaces in the normal place:
[root@server01 ~]# ip a
13: em1.20@em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
link/ether 24:6e:96:6b:01:20 brd ff:ff:ff:ff:ff:ff
valid_lft forever preferred_lft forever
17: em2.21@em2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
link/ether 24:6e:96:6b:01:22 brd ff:ff:ff:ff:ff:ff
valid_lft forever preferred_lft forever
And if I check on the Cisco N5K side, I can see Fabric Login events occurring for the Virtual FC interfaces:
FCOE-SWITCH-01# sh flogi database | inc vfc48
vfc48 10 0x0103a0 20:01:24:6e:96:6a:f3:01 20:00:24:6e:96:6a:f3:01
So onward to the next bit of the challenge: intermittent pings/Network Connectivity.
Fixing the pings (IP/Ethernet issues)
A quick rack of the brains reveals something I've had before - when you don't set an explicit mode on Cisco PortChannels, they come up unconditionally, which can cause problems with some Server OS/Hypervisors (I'm looking at you, ESXi without a dvSwitch...):
FCOE-SWITCH-01# sh run int eth1/5
interface Ethernet1/5
description Server01 Em1
switchport mode trunk
switchport trunk allowed vlan 20,380-381
channel-group 48
Sure enough, there it is; and on the RHEL side, I'd maybe expect a "Bond0" or equivalent interface, but I don't have any - all the Configuration is done as VLAN Subinterfaces on Em1/Em2. So now, out comes the "hack" to get a PortChannel up, but leave it with a vPC Parent (in case the Server ever gets properly configured for LAG - i.e. LACP is Active) - basically add "mode active" to your PortChannel/vPC Member Interfaces (note you have to "no" the previous PortChannel command, otherwise NX-OS will bitch at you):
FCOE-SWITCH-01# conf t
FCOE-SWITCH-01(conf)# interface Ethernet1/5
FCOE-SWITCH-01(int)# no channel-group 48
FCOE-SWITCH-01(int)# channel-group 48 mode active
FCOE-SWITCH-01(int)# end
FCOE-SWITCH-01# copy run start
Et voila, pings are restored when both Eth1/5 interfaces in vPC48 are up, but it is hacky as they come up in non-PortChannel "Individual" Mode, so these are the unusual looking outputs (for something that works):
FCOE-SWITCH-01# sh port-channel summary | inc Protocol|48|I
I - Individual H - Hot-standby (LACP only)
Group Port- Type Protocol Member Ports
48 Po48(SD) Eth LACP Eth1/5(I)
FCOE-SWITCH-01# sh int po48
port-channel48 is down (No operational members)
vPC Status: Down, vPC number: 48 [packets forwarded via vPC peer-link]
Hardware: Port-Channel, address: 8c60.4f33.200c (bia 8c60.4f33.200c)
Description: Po48 - Server01
MTU 1500 bytes, BW 10000000 Kbit, DLY 10 usec
Why even bother with a vPC if you're not actually LAGing?
A great question, and one I'll have to explore now the connectivity is back and working; conceptually all I've done above is to bypass the PortChannel (for the FCoE, this is fine as FC Multipaths anyway; for the IP/Ethernet, you either LAG or don't - this is some frankenstate), so while it's "PortChannel-eligible" (so the vPC48 and Po48 make sense), there's no 802.3ad configuration on the RHEL OS (something like in this How to configure LACP 802.3ad with bonded interfaces RHEL guide), the fact it comes up as an "I" port basically means it's two standalone interfaces - which negates the need for a vPC or PortChannel.
So the PortChannel isn't required, but maybe I'd need the vPC still? Or maybe I'm losing the plot and don't need either, much like this lovely post on why LACP and vSphere (ESXi) hosts: not a very good marriage
One for another day :).