Terraforming an F5 Cluster into Azure with pesky DO

Thursday, 23 Mar 2023

Struggling to make an F5 BIG-IP Virtual Edition (VE) just cluster up already and stop giving errors like these?

Error 422: Invalid IP Address
Failed to send declaration
Error 500 invalid config - rolled back

Then you're in luck, because I too have been there, gotten the t-shirt and will one day wreak revenge upon F5 use other Load Balancers instead out of spite. But first, we'll need some primers on the over-convoluted ecosystem that F5 use to orchestrate provisioning of their BIG-IP via IaC techniques - or as they brand it, the F5 BIG-IP Automation Toolchain. This mainly consists of the following "Extensions", that are in effect F5's equivalent of an apt-get/yum package on other Linux distributions:

  • Declarative Onboarding (DO)
  • Cloud Forwarding Engine (CFE)
  • Application Services 3 (AS3)
  • Cloud-init
    • This isn't strictly speaking an F5 thing, but you'll need to wrangle with it to get DO and CFE to do their thing on boot of the F5 BIG-IP VE.

The BIG-IP Azure Terraform Module

Handily (or so you'll initially think), F5 supply a Terraform Module to "rapidly" get you up and running with a single or clustered F5 BIG-IP VE node(s) in Azure, in the Terraform Registry as F5Networks/bigip-module/azure. For those of you not familiar with a Terraform Module, it's just a collection of Terraform Resources with an opinionated setup (i.e. the F5 Module deploys Azure VMs that end "-f5vm01" in their Hostname), but which can accept some configurable options - you can quickly see how the BIG-IP Module works by scrutinising main.tf and working through how it passes-in variables inside from the module "bigip" input calls outside.

The savvy amongst you will be drawn in by the custom_user_data input - essentially this is used by the BIG-IP Module to invoke a one-time Linux startup/bootstrap script (using Cloud-init under the hood) to pass-through the DO, CFE and AS3 declarations into the F5 in a rendered file which is part Bash Script, part YAML and all pain. It's probably best to start at this Cloud-init Bash Script, as that will help you understand some more relationships between variables passed-through from the tfvars file - let's take the suggested custom-onboard-big.tmpl and break down what it does. First, here's the code - basically one big Bash script:

#!/bin/bash -x

# NOTE: Startup Script is run once / initialization only (Cloud-Init behavior vs. typical re-entrant for Azure Custom Script Extension )
# For 15.1+ and above, Cloud-Init will run the script directly and can remove Azure Custom Script Extension


mkdir -p  /var/log/cloud /config/cloud /var/config/rest/downloads

mkdir -p /config/cloud

LOG_FILE=/var/log/cloud/startup-script.log
[[ ! -f $LOG_FILE ]] && touch $LOG_FILE || { echo "Run Only Once. Exiting"; exit; }
npipe=/tmp/$$.tmp
trap "rm -f $npipe" EXIT
mknod $npipe p
tee <$npipe -a $LOG_FILE /dev/ttyS0 &
exec 1>&-
exec 1>$npipe
exec 2>&1

# Run Immediately Before MCPD
/usr/bin/setdb provision.extramb 1000
/usr/bin/setdb restjavad.useextramb true

curl -o /config/cloud/do_w_admin.json -s --fail --retry 60 -m 10 -L https://raw.githubusercontent.com/F5Networks/terraform-azure-bigip-module/main/config/onboard_do.json


### write_files:
# Download or Render BIG-IP Runtime Init Config

cat << 'EOF' > /config/cloud/runtime-init-conf.yaml
---
runtime_parameters:
  - name: USER_NAME
    type: static
    value: ${bigip_username}
  - name: HOST_NAME
    type: metadata
    metadataProvider:
      environment: azure
      type: compute
      field: name
  - name: SSH_KEYS
    type: static
    value: "${ssh_keypair}"
EOF

if ${az_keyvault_authentication}
then
   cat << 'EOF' >> /config/cloud/runtime-init-conf.yaml
  - name: ADMIN_PASS
    type: secret
    secretProvider:
      environment: azure
      type: KeyVault
      vaultUrl: ${vault_url}
      secretId: ${secret_id}
pre_onboard_enabled: []
EOF
else

   cat << 'EOF' >> /config/cloud/runtime-init-conf.yaml
  - name: ADMIN_PASS
    type: static
    value: ${bigip_password}
pre_onboard_enabled: []
EOF
fi

cat /config/cloud/runtime-init-conf.yaml > /config/cloud/runtime-init-conf-backup.yaml

cat << 'EOF' >> /config/cloud/runtime-init-conf.yaml
extension_packages:
  install_operations:
    - extensionType: do
      extensionVersion: ${DO_VER}
      extensionUrl: ${DO_URL}
    - extensionType: as3
      extensionVersion: ${AS3_VER}
      extensionUrl: ${AS3_URL}
    - extensionType: ts
      extensionVersion: ${TS_VER}
      extensionUrl: ${TS_URL}
    - extensionType: cf
      extensionVersion: ${CFE_VER}
      extensionUrl: ${CFE_URL}
    - extensionType: fast
      extensionVersion: ${FAST_VER}
      extensionUrl: ${FAST_URL}
extension_services:
  service_operations:
    - extensionType: do
      type: inline
      value:
        schemaVersion: 1.0.0
        class: Device
        async: true
        Common:
          class: Tenant
          hostname: '{{{HOST_NAME}}}.com'
          myNtp:
            class: NTP
            servers:
              - 0.pool.ntp.org
            timezone: UTC
          myDns:
            class: DNS
            nameServers:
              - 168.63.129.16
          admin:
            class: User
            partitionAccess:
              all-partitions:
                role: admin
            password: '{{{ADMIN_PASS}}}'
            shell: bash
            keys:
              - '{{{SSH_KEYS}}}'
            userType: regular
          '{{{USER_NAME}}}':
            class: User
            partitionAccess:
              all-partitions:
                role: admin
            password: '{{{ADMIN_PASS}}}'
            shell: bash
            keys:
              - '{{{SSH_KEYS}}}'
            userType: regular
post_onboard_enabled: []
EOF

cat << 'EOF' >> /config/cloud/runtime-init-conf-backup.yaml
extension_services:
  service_operations:
    - extensionType: do
      type: inline
      value:
        schemaVersion: 1.0.0
        class: Device
        async: true
        Common:
          class: Tenant
          hostname: '{{{HOST_NAME}}}.com'
          myNtp:
            class: NTP
            servers:
              - 0.pool.ntp.org
            timezone: UTC
          myDns:
            class: DNS
            nameServers:
              - 168.63.129.16
          admin:
            class: User
            partitionAccess:
              all-partitions:
                role: admin
            password: '{{{ADMIN_PASS}}}'
            shell: bash
            keys:
              - '{{{SSH_KEYS}}}'
            userType: regular
          '{{{USER_NAME}}}':
            class: User
            partitionAccess:
              all-partitions:
                role: admin
            password: '{{{ADMIN_PASS}}}'
            shell: bash
            keys:
              - '{{{SSH_KEYS}}}'
            userType: regular
post_onboard_enabled: []
EOF

# # Download
#PACKAGE_URL='https://cdn.f5.com/product/cloudsolutions/f5-bigip-runtime-init/v1.1.0/dist/f5-bigip-runtime-init-1.1.0-1.gz.run'
#PACKAGE_URL='https://cdn.f5.com/product/cloudsolutions/f5-bigip-runtime-init/v1.2.0/dist/f5-bigip-runtime-init-1.2.0-1.gz.run'
for i in {1..30}; do
    curl -fv --retry 1 --connect-timeout 5 -L ${INIT_URL} -o "/var/config/rest/downloads/f5-bigip-runtime-init.gz.run" && break || sleep 10
done
# Install
bash /var/config/rest/downloads/f5-bigip-runtime-init.gz.run -- '--cloud azure'
# Run
f5-bigip-runtime-init --config-file /config/cloud/runtime-init-conf.yaml
sleep 5
f5-bigip-runtime-init --config-file /config/cloud/runtime-init-conf-backup.yaml

So let's break down what happens when this is passed by the Azure VM Extension to be run as a Bash Script and saved on the F5 BIG-IP VE itself as executable /var/lib/waagent/CustomData. If you're interested in how this ends up on the F5 Linux VM as this file, this is a good write-up of custom data and Cloud-init on Azure Virtual Machines.

Here's the skinny on that Bash Script's workings:

  • (Lines 1-20) Set this up as a Bash Script for execution, and setup some of the Linux log, config and Named Pipe operations to allow it to interact with stdin/stdout
  • (Lines 21-23) Do some performance tweaks I don't know why F5 don't just bake into their stock image
  • (Line 25) Run some factory DO via JSON (yeah I know, I said YAML earlier - it takes both because F5 hate consistency it seems) as a pre-install
  • (Lines 31-46) Use heredoc (or multi-line strings to you and me) to generate the DO YAML file from the passed-in variables and save it within the F5 BIG-IP itself as YAML file /config/cloud/runtime-init-conf.yaml
    • You'll see two types of variable that can be passed-through here (and used anywhere within the heredoc definition, that is from Line 31 to Line 175, effectively)
      • So-called "moustache" variables are like {{{THIS}}} and refer back to the values passed in to the runtime_parameters section of the runtime-init-conf.yaml DO declaration - these effectively only reference variables locally defined within the same tmpl Bash Script file.
      • Standard Linux escape variables are like ${this} and refer back to the values passed in from the inbuilt variables definition within the templatefile definition of the custom_user_data variable in your main.tf file
        • Which in turn, are probably references back to Terraform variables such as var.INIT_URL, specified in your terraform.tfvars file (turtles all the way down)
    • When run, if you login to the F5 BIG-IP VE instance CLI (using Azure Serial Console), you can cat /config/cloud/runtime-init-conf.yaml to see the difference in how these two variables work at runtime, where
      • "Moustache" variables (like {{this}}) remain the same as you typed them initially; the replacement is done on execution of f5-bigip-runtime-init by this binary itself - so maybe still look like hostname: '{{{HOST_NAME}}}.com'
      • Standard Linux escape variables (like ${this}) have already been replaced by the text string and differ from how you typed them initially; the replacement has been done by the Terraform run itself - so maybe now look like value: password123 instead of previously being value: ${bigip_password}
  • (Lines 48-68) Do a Bash "if" loop based on the value derrived from Standard Linux escape vairable ${az_keyvault_authentication} (true or false) - which is defined in main.tf in the templatefile definition of the custom_user_data variable (so only passed-through one layer of turtles, from main.tf into custom-onboard-big.tmpl)
    • Output the next section of F5 DO YAML into /config/cloud/runtime-init-conf.yaml based on whether this was set to true (i.e. your F5 CFE cluster password is stored in an Azure Keyvault) or false (i.e. you're just hard-setting a password in the YAML DO definition)
  • (Line 70) Make a backup fo the /config/cloud/runtime-init-conf.yaml definition and save this as /config/cloud/runtime-init-conf-backup.yaml
  • (Lines 72-131) Append the F5 DO YAML soup which tells all the extensions to install (if you've used Linux, this is the equivalent of a string of apt-get install... commands, shown instead as YAML), and uses the mosutache/Linux escape variables you defined earlier to setup the box - Hostname, DNS, NTP, System Users and so on
  • (Lines 133-175) Repeat what was done for the "production" /config/cloud/runtime-init-conf.yaml F5 DO YAML file above for the "backup" F5 DO YAML file located at /config/cloud/runtime-init-conf-backup.yaml
  • (Lines 177-184) Download and install the f5-bigip-runtime-init executable - which is effectively F5's version of Cloud-init
  • (Line 186) Invoke F5 Cloud-init with the /config/cloud/runtime-init-conf.yaml file - to kick in the DO, CFE and AS3 processes and make your F5 go whir now
    • (Line 188) Bonus "do that again for no particular reason" run (Cloud-init is a one-time operation, not something that runs on every reboot...)

So that was fun eh? You mean to say I got all that from one bag of F5 oranges?

Troy McClure squeezing those F5 YAML oranges the hard way

What are you on about these turtles for?

It's a fancy way of saying (with the F5 DO/YAML, what feels like) infinite recursion - this Wikipedia write-up explains it better than I can.

Under the hood

To understand some of the pain you're going to encounter (yes, there's more), it's worth understanding the internals of what really happens under the hood. That's right, there's even more fun to this story that's hidden in those extension_services stanzas - and to expand on this, we need to move away from YAML for a second and focus on the F5 BIG-IP Automation Toolchain, namely what happens when these stanzas execute in the Cloud-init YAML:

  • extension_services -> service_operations -> extensionType: do
  • extension_services -> service_operations -> extensionType: cfe

Declarative Onboarding (Aga DO, DO, push pineapple...)

Somewhere in the backend, your YAML is converted into JSON, and posted to a HTTP REST API endpoint, specifically one you can probe yourself in advance by posting the content of a file you saved as do_test.json by swapping from the F5 BIG-IP default tmsh shell to the standard Linux bash shell as follows:

  1. Login to F5 via SSH or Azure Serial Console
  2. Swap to Bash prompt by typing: bash then hit Return key
  3. Save some DO-formatted JSON (like this example) as a file called do_test.json
  4. Throw it at the HTTP REST API with a curl post as follows:
curl -su admin: -d "@do_test.json" http://127.0.0.1:8100/mgmt/shared/declarative-onboarding | jq
  1. You'll get a JSON payload back, consisting first of a HTTP Status Code for the result, and also a playback of the JSON payload yoy posted in the declaration section

Here's an example F5 DO JSON payload you can tweak and play with:

{
  "schemaVersion": "1.0.0",
  "class": "Device",
  "async": true,
  "Common": {
    "class": "Tenant",
    "hostname": "f5vm01.test.net",
    "myDb": {
      "class": "DbVariables",
      "provision.extramb": 1000,
      "restjavad.useextramb": true,
      "dhclient.mgmt": "disable",
      "config.allow.rfc3927": "enable",
      "tm.tcpudptxchecksum": "Software-only"
    },
    "myModules": {
      "class": "Provision",
      "asm": "nominal",
      "ltm": "nominal"
    },
    "myNtp": {
      "class": "NTP",
      "servers": [
        "time.windows.com"
      ],
      "timezone": "UTC"
    },
    "myDns": {
      "class": "DNS",
      "nameServers": [
        "168.63.129.16"
      ]
    },
    "admin": {
      "class": "User",
      "partitionAccess": {
        "all-partitions": {
          "role": "admin"
        }
      },
      "shell": "bash",
      "userType": "regular",
      "keys": []
    },
    "bigipuser": {
      "class": "User",
      "partitionAccess": {
        "all-partitions": {
          "role": "admin"
        }
      },
      "shell": "bash",
      "userType": "regular",
      "keys": []
    },
    "internal": {
      "class": "VLAN",
      "interfaces": [
        {
          "name": "1.1",
          "tagged": false
        }
      ],
      "mtu": 1500,
      "tag": 4094,
      "cmpHash": "default",
      "failsafeEnabled": false,
      "failsafeAction": "failover-restart-tm",
      "failsafeTimeout": 90
    },
    "internal-self": {
      "class": "SelfIp",
      "address": "10.255.2.4/24",
      "vlan": "internal",
      "allowService": "none",
      "trafficGroup": "traffic-group-local-only"
    },
    "configSync": {
      "class": "ConfigSync",
      "configsyncIp": "/Common/internal-self/address"
    },
    "failoverAddress": {
      "class": "FailoverUnicast",
      "address": "/Common/internal-self/address",
      "port": 1026
    },
    "failoverGroup": {
      "class": "DeviceGroup",
      "type": "sync-failover",
      "members": [
        "f5vm01.test.net",
        "f5vm02.test.net"
      ],
      "owner": "/Common/failoverGroup/members/0",
      "autoSync": true,
      "saveOnAutoSync": false,
      "networkFailover": true,
      "fullLoadOnSync": false,
      "asmSync": false
    },
    "trust": {
      "class": "DeviceTrust",
      "localUsername": "admin",
      "remoteHost": "/Common/failoverGroup/members/0",
      "remoteUsername": "admin"
    }
  }
}

Generally here, 200 or 20x (where x is any number) means times are gravy, and the F5 successfully took your DO and configured itself as per your commands. Anything else and you should sit yourself down for some debugging fun, some helpful hints here:

  • It won't tell you which line your invalid IP Address is in, so good luck fishing
    • Note that in F5 land, this is a valid IP Address that effectively refers to whatever you configured the Internal NIC as: /Common/internal-self/address, or you can go for the more traditional 10.255.2.4 approach if, y'know, you like sleeping and/or seeing your kids of an evening
  • Sometimes it decides an error is not passable, and rolls back your entire config accordingly
    • It's much quicker having the F5 kick you in the balls via a "try some JSON and see if it works" approach using this method of tweaking do_test.yaml and POSTing to the HTTP REST Endpoint URL than forming the runtime-init-conf.yaml file from initial custom-onboard-big.tmpl and having to wait for terraform apply and related Azure VM Extension to kick in, and run a f5-bigip-runtime-init all over again (2-4 minutes)
  • Not that the F5 ARM template examples make it obvious, but on both nodes in a HA Active/Passive cluster, it wants both of them to refer within failoverGroup members in a consistent (Line 1) node0.hostname.com and (Line 2) node1.hostname.com
    • If you're like me, you'll read them and the order-swapping of remote_host between instance01.yaml and intance02.yaml and think "Huh, so it's specified Node0/Node1 on Node0 DO YAML, then swaps to Node1/Node0 order on Node1 DO YAML file" - nope, it's just that F5 actually mean "not the current node, y'know, the other one" when they say remote - meaning it changes each time and is locally relative

If you just want to check the status of the latest DO JSON invocation without supplying some fresh DO JSON to execute, then run:

curl -su admin: http://127.0.0.1:8100/mgmt/shared/declarative-onboarding | jq

(Note: The jq command the output is piped into simply takes the JSON response and pretty prints it into a space-delimited, multi-line JSON output, rather than just-showing-it-as-one-big-block-of-illegible-text)

Cloud Forwarding Extension (DO isn't an "Extension" clearly, otherwise it'd be acronym'd as "DOE")

Pretty much the same idea goes here, but there's less you can configure and this is a good reference of some working CFE JSON, in this case to test CFE in advance you would:

  1. Login to F5 via SSH or Azure Serial Console
  2. Swap to Bash prompt by typing: bash then hit Return key
  3. Save some DO-formatted JSON (like this example) as a file called cfe_test.json
  4. Throw it at the HTTP REST API with a curl post as follows:
curl -su admin: -d "@cfe_test.json" http://127.0.0.1:8100/mgmt/shared/cloud-failover/declare | jq
  1. You'll get a JSON payload back, consisting first of a HTTP Status Code for the result, and also a playback of the JSON payload you posted in the declaration section

Here's an example F5 CFE JSON payload you can tweak and play with:

{
  "failoverAddresses":{
     "enabled":true,
     "scopingTags": {
        "f5_cloud_failover_label": "mydeployment"
     }
     "addressGroupDefinitions": [
        {
           "type": "networkInterfaceAddress",
           "scopingAddress": "10.0.1.100"
        },
        {
           "type": "networkInterfaceAddress",
           "scopingAddress": "10.0.1.101"
        }
     ]
  }
}

If you just want to check the status of the latest CFE JSON invocation without supplying some fresh DO JSON to execute, then run:

curl -su admin: http://127.0.0.1:8100/mgmt/shared/cloud-failover/info | jq

(Note: The jq command the output is piped into simply takes the JSON response and pretty prints it into a space-delimited, multi-line JSON output, rather than just-showing-it-as-one-big-block-of-illegible-text)

Bonus Timesaver - F5 Extension versions that don't work with each other

For added bonus fun, when you're looking around Interwebs to find the correct F5 Extension RPMs (yes, at least that bit is standard Linux-like), you might stumble on a few in some of F5Networks own GitHub repos that don't work well together at all, and in one case cause the Cloud-init process to crap out just after the DO process is done and before the CFE process even begins. Here's a list of F5 Extension versions that don't play well together and which you should avoid using:

F5 Extension Version URL
DO 1.21.0 https://github.com/F5Networks/f5-declarative-onboarding/releases/download/v1.21.0/f5-declarative-onboarding-1.21.0-3.noarch.rpm
TS 1.20.0 https://github.com/F5Networks/f5-telemetry-streaming/releases/download/v1.20.0/f5-telemetry-1.20.0-3.noarch.rpm
FAST 1.9.0 https://github.com/F5Networks/f5-appsvcs-templates/releases/download/v1.9.0/f5-appsvcs-templates-1.9.0-1.noarch.rpm
CFE 1.8.0 https://github.com/F5Networks/f5-cloud-failover-extension/releases/download/v1.8.0/f5-cloud-failover-1.8.0-0.noarch.rpm
Cloud-init 1.2.1 https://cdn.f5.com/product/cloudsolutions/f5-bigip-runtime-init/v1.2.1/dist/f5-bigip-runtime-init-1.2.1-1.gz.run

All or nothing

One big thing to note with the Cloud-init/DO combo is it's all or nothing - if any one single part of that large /config/cloud/runtime-init-conf.yaml file goes wrong during the Cloud-init process, the F5 rolls itself back to it's factory state like a wet fish. Flapping. In the Azure winds. With an unknown default username and password you can't login to your newly-deployed localhost.localdomain with to debug.

My Terraform'd F5 leaving me unable to do anything, which was nice

What you therefore might like to do, to combat this situation while working out the magic incantations you need to generate your Cloud-init YAML (what, you weren't just born knowing the F5 Schema?), is to plant a known username and password before the Cloud-init process kicks off, by inserting the relevant F5 tmsh commands in to the top part of the custom-onboard-big.tmpl Bash script, before the cat /config/cloud/runtime-init-conf.yaml > /config/cloud/runtime-init-conf-backup.yaml section. Probably something like:

# Set Admin User password before DO process fails miserably
tmsh modify auth password bigipuser password123!

Time to hit refresh (yes, that is a F5 pun)

Much like an F5 deployed through Cloud-init, I've hit the end of my metaphorical anger threshold and it's probably time for a reset on what's left of my soul and sanity. I genuinely hope some of this helps those of you unfortunate enough to have to deploy an F5 into Azure through automation.

For the rest of you, I implore you - make better Load Balancer decisions. F5's appear to be flakier than a Cornetto from your local Ice Cream van in the summer.