Terraforming an F5 Cluster into Azure with pesky DO
Struggling to make an F5 BIG-IP Virtual Edition (VE) just cluster up already and stop giving errors like these?
Error 422: Invalid IP Address
Failed to send declaration
Error 500 invalid config - rolled back
Then you're in luck, because I too have been there, gotten the t-shirt and will one day wreak revenge upon F5 use other Load Balancers instead out of spite. But first, we'll need some primers on the over-convoluted ecosystem that F5 use to orchestrate provisioning of their BIG-IP via IaC techniques - or as they brand it, the F5 BIG-IP Automation Toolchain. This mainly consists of the following "Extensions", that are in effect F5's equivalent of an apt-get/yum package on other Linux distributions:
- Declarative Onboarding (DO)
- Cloud Forwarding Engine (CFE)
- Application Services 3 (AS3)
- Cloud-init
- This isn't strictly speaking an F5 thing, but you'll need to wrangle with it to get DO and CFE to do their thing on boot of the F5 BIG-IP VE.
The BIG-IP Azure Terraform Module
Handily (or so you'll initially think), F5 supply a Terraform Module to "rapidly" get you up and running with a single or clustered F5 BIG-IP VE node(s) in Azure, in the Terraform Registry as F5Networks/bigip-module/azure
. For those of you not familiar with a Terraform Module, it's just a collection of Terraform Resources with an opinionated setup (i.e. the F5 Module deploys Azure VMs that end "-f5vm01" in their Hostname), but which can accept some configurable options - you can quickly see how the BIG-IP Module works by scrutinising main.tf and working through how it passes-in variables inside from the module "bigip"
input calls outside.
The savvy amongst you will be drawn in by the custom_user_data
input - essentially this is used by the BIG-IP Module to invoke a one-time Linux startup/bootstrap script (using Cloud-init under the hood) to pass-through the DO, CFE and AS3 declarations into the F5 in a rendered file which is part Bash Script, part YAML and all pain. It's probably best to start at this Cloud-init Bash Script, as that will help you understand some more relationships between variables passed-through from the tfvars file - let's take the suggested custom-onboard-big.tmpl and break down what it does. First, here's the code - basically one big Bash script:
#!/bin/bash -x
# NOTE: Startup Script is run once / initialization only (Cloud-Init behavior vs. typical re-entrant for Azure Custom Script Extension )
# For 15.1+ and above, Cloud-Init will run the script directly and can remove Azure Custom Script Extension
mkdir -p /var/log/cloud /config/cloud /var/config/rest/downloads
mkdir -p /config/cloud
LOG_FILE=/var/log/cloud/startup-script.log
[[ ! -f $LOG_FILE ]] && touch $LOG_FILE || { echo "Run Only Once. Exiting"; exit; }
npipe=/tmp/$$.tmp
trap "rm -f $npipe" EXIT
mknod $npipe p
tee <$npipe -a $LOG_FILE /dev/ttyS0 &
exec 1>&-
exec 1>$npipe
exec 2>&1
# Run Immediately Before MCPD
/usr/bin/setdb provision.extramb 1000
/usr/bin/setdb restjavad.useextramb true
curl -o /config/cloud/do_w_admin.json -s --fail --retry 60 -m 10 -L https://raw.githubusercontent.com/F5Networks/terraform-azure-bigip-module/main/config/onboard_do.json
### write_files:
# Download or Render BIG-IP Runtime Init Config
cat << 'EOF' > /config/cloud/runtime-init-conf.yaml
---
runtime_parameters:
- name: USER_NAME
type: static
value: ${bigip_username}
- name: HOST_NAME
type: metadata
metadataProvider:
environment: azure
type: compute
field: name
- name: SSH_KEYS
type: static
value: "${ssh_keypair}"
EOF
if ${az_keyvault_authentication}
then
cat << 'EOF' >> /config/cloud/runtime-init-conf.yaml
- name: ADMIN_PASS
type: secret
secretProvider:
environment: azure
type: KeyVault
vaultUrl: ${vault_url}
secretId: ${secret_id}
pre_onboard_enabled: []
EOF
else
cat << 'EOF' >> /config/cloud/runtime-init-conf.yaml
- name: ADMIN_PASS
type: static
value: ${bigip_password}
pre_onboard_enabled: []
EOF
fi
cat /config/cloud/runtime-init-conf.yaml > /config/cloud/runtime-init-conf-backup.yaml
cat << 'EOF' >> /config/cloud/runtime-init-conf.yaml
extension_packages:
install_operations:
- extensionType: do
extensionVersion: ${DO_VER}
extensionUrl: ${DO_URL}
- extensionType: as3
extensionVersion: ${AS3_VER}
extensionUrl: ${AS3_URL}
- extensionType: ts
extensionVersion: ${TS_VER}
extensionUrl: ${TS_URL}
- extensionType: cf
extensionVersion: ${CFE_VER}
extensionUrl: ${CFE_URL}
- extensionType: fast
extensionVersion: ${FAST_VER}
extensionUrl: ${FAST_URL}
extension_services:
service_operations:
- extensionType: do
type: inline
value:
schemaVersion: 1.0.0
class: Device
async: true
Common:
class: Tenant
hostname: '{{{HOST_NAME}}}.com'
myNtp:
class: NTP
servers:
- 0.pool.ntp.org
timezone: UTC
myDns:
class: DNS
nameServers:
- 168.63.129.16
admin:
class: User
partitionAccess:
all-partitions:
role: admin
password: '{{{ADMIN_PASS}}}'
shell: bash
keys:
- '{{{SSH_KEYS}}}'
userType: regular
'{{{USER_NAME}}}':
class: User
partitionAccess:
all-partitions:
role: admin
password: '{{{ADMIN_PASS}}}'
shell: bash
keys:
- '{{{SSH_KEYS}}}'
userType: regular
post_onboard_enabled: []
EOF
cat << 'EOF' >> /config/cloud/runtime-init-conf-backup.yaml
extension_services:
service_operations:
- extensionType: do
type: inline
value:
schemaVersion: 1.0.0
class: Device
async: true
Common:
class: Tenant
hostname: '{{{HOST_NAME}}}.com'
myNtp:
class: NTP
servers:
- 0.pool.ntp.org
timezone: UTC
myDns:
class: DNS
nameServers:
- 168.63.129.16
admin:
class: User
partitionAccess:
all-partitions:
role: admin
password: '{{{ADMIN_PASS}}}'
shell: bash
keys:
- '{{{SSH_KEYS}}}'
userType: regular
'{{{USER_NAME}}}':
class: User
partitionAccess:
all-partitions:
role: admin
password: '{{{ADMIN_PASS}}}'
shell: bash
keys:
- '{{{SSH_KEYS}}}'
userType: regular
post_onboard_enabled: []
EOF
# # Download
#PACKAGE_URL='https://cdn.f5.com/product/cloudsolutions/f5-bigip-runtime-init/v1.1.0/dist/f5-bigip-runtime-init-1.1.0-1.gz.run'
#PACKAGE_URL='https://cdn.f5.com/product/cloudsolutions/f5-bigip-runtime-init/v1.2.0/dist/f5-bigip-runtime-init-1.2.0-1.gz.run'
for i in {1..30}; do
curl -fv --retry 1 --connect-timeout 5 -L ${INIT_URL} -o "/var/config/rest/downloads/f5-bigip-runtime-init.gz.run" && break || sleep 10
done
# Install
bash /var/config/rest/downloads/f5-bigip-runtime-init.gz.run -- '--cloud azure'
# Run
f5-bigip-runtime-init --config-file /config/cloud/runtime-init-conf.yaml
sleep 5
f5-bigip-runtime-init --config-file /config/cloud/runtime-init-conf-backup.yaml
So let's break down what happens when this is passed by the Azure VM Extension to be run as a Bash Script and saved on the F5 BIG-IP VE itself as executable /var/lib/waagent/CustomData
. If you're interested in how this ends up on the F5 Linux VM as this file, this is a good write-up of custom data and Cloud-init on Azure Virtual Machines.
Here's the skinny on that Bash Script's workings:
- (Lines 1-20) Set this up as a Bash Script for execution, and setup some of the Linux log, config and Named Pipe operations to allow it to interact with stdin/stdout
- (Lines 21-23) Do some performance tweaks I don't know why F5 don't just bake into their stock image
- (Line 25) Run some factory DO via JSON (yeah I know, I said YAML earlier - it takes both because F5 hate consistency it seems) as a pre-install
- (Lines 31-46) Use heredoc (or multi-line strings to you and me) to generate the DO YAML file from the passed-in variables and save it within the F5 BIG-IP itself as YAML file
/config/cloud/runtime-init-conf.yaml
- You'll see two types of variable that can be passed-through here (and used anywhere within the heredoc definition, that is from Line 31 to Line 175, effectively)
- So-called "moustache" variables are like
{{{THIS}}}
and refer back to the values passed in to theruntime_parameters
section of theruntime-init-conf.yaml
DO declaration - these effectively only reference variables locally defined within the same tmpl Bash Script file. - Standard Linux escape variables are like
${this}
and refer back to the values passed in from the inbuilt variables definition within thetemplatefile
definition of thecustom_user_data
variable in yourmain.tf
file- Which in turn, are probably references back to Terraform variables such as
var.INIT_URL
, specified in yourterraform.tfvars
file (turtles all the way down)
- Which in turn, are probably references back to Terraform variables such as
- So-called "moustache" variables are like
- When run, if you login to the F5 BIG-IP VE instance CLI (using Azure Serial Console), you can
cat /config/cloud/runtime-init-conf.yaml
to see the difference in how these two variables work at runtime, where- "Moustache" variables (like
{{this}}
) remain the same as you typed them initially; the replacement is done on execution off5-bigip-runtime-init
by this binary itself - so maybe still look likehostname: '{{{HOST_NAME}}}.com'
- Standard Linux escape variables (like
${this}
) have already been replaced by the text string and differ from how you typed them initially; the replacement has been done by the Terraform run itself - so maybe now look likevalue: password123
instead of previously beingvalue: ${bigip_password}
- "Moustache" variables (like
- You'll see two types of variable that can be passed-through here (and used anywhere within the heredoc definition, that is from Line 31 to Line 175, effectively)
- (Lines 48-68) Do a Bash "if" loop based on the value derrived from Standard Linux escape vairable
${az_keyvault_authentication}
(true or false) - which is defined inmain.tf
in thetemplatefile
definition of thecustom_user_data
variable (so only passed-through one layer of turtles, frommain.tf
intocustom-onboard-big.tmpl
)- Output the next section of F5 DO YAML into
/config/cloud/runtime-init-conf.yaml
based on whether this was set to true (i.e. your F5 CFE cluster password is stored in an Azure Keyvault) or false (i.e. you're just hard-setting a password in the YAML DO definition)
- Output the next section of F5 DO YAML into
- (Line 70) Make a backup fo the
/config/cloud/runtime-init-conf.yaml
definition and save this as/config/cloud/runtime-init-conf-backup.yaml
- (Lines 72-131) Append the F5 DO YAML soup which tells all the extensions to install (if you've used Linux, this is the equivalent of a string of
apt-get install...
commands, shown instead as YAML), and uses the mosutache/Linux escape variables you defined earlier to setup the box - Hostname, DNS, NTP, System Users and so on - (Lines 133-175) Repeat what was done for the "production"
/config/cloud/runtime-init-conf.yaml
F5 DO YAML file above for the "backup" F5 DO YAML file located at/config/cloud/runtime-init-conf-backup.yaml
- (Lines 177-184) Download and install the
f5-bigip-runtime-init
executable - which is effectively F5's version of Cloud-init - (Line 186) Invoke F5 Cloud-init with the
/config/cloud/runtime-init-conf.yaml
file - to kick in the DO, CFE and AS3 processes and make your F5 go whir now- (Line 188) Bonus "do that again for no particular reason" run (Cloud-init is a one-time operation, not something that runs on every reboot...)
So that was fun eh? You mean to say I got all that from one bag of F5 oranges?
What are you on about these turtles for?
It's a fancy way of saying (with the F5 DO/YAML, what feels like) infinite recursion - this Wikipedia write-up explains it better than I can.
Under the hood
To understand some of the pain you're going to encounter (yes, there's more), it's worth understanding the internals of what really happens under the hood. That's right, there's even more fun to this story that's hidden in those extension_services
stanzas - and to expand on this, we need to move away from YAML for a second and focus on the F5 BIG-IP Automation Toolchain, namely what happens when these stanzas execute in the Cloud-init YAML:
extension_services
->service_operations
->extensionType: do
extension_services
->service_operations
->extensionType: cfe
Declarative Onboarding (Aga DO, DO, push pineapple...)
Somewhere in the backend, your YAML is converted into JSON, and posted to a HTTP REST API endpoint, specifically one you can probe yourself in advance by posting the content of a file you saved as do_test.json
by swapping from the F5 BIG-IP default tmsh
shell to the standard Linux bash
shell as follows:
- Login to F5 via SSH or Azure Serial Console
- Swap to Bash prompt by typing:
bash
then hitReturn
key - Save some DO-formatted JSON (like this example) as a file called
do_test.json
- Throw it at the HTTP REST API with a curl post as follows:
curl -su admin: -d "@do_test.json" http://127.0.0.1:8100/mgmt/shared/declarative-onboarding | jq
- You'll get a JSON payload back, consisting first of a HTTP Status Code for the
result
, and also a playback of the JSON payload yoy posted in thedeclaration
section
Here's an example F5 DO JSON payload you can tweak and play with:
{
"schemaVersion": "1.0.0",
"class": "Device",
"async": true,
"Common": {
"class": "Tenant",
"hostname": "f5vm01.test.net",
"myDb": {
"class": "DbVariables",
"provision.extramb": 1000,
"restjavad.useextramb": true,
"dhclient.mgmt": "disable",
"config.allow.rfc3927": "enable",
"tm.tcpudptxchecksum": "Software-only"
},
"myModules": {
"class": "Provision",
"asm": "nominal",
"ltm": "nominal"
},
"myNtp": {
"class": "NTP",
"servers": [
"time.windows.com"
],
"timezone": "UTC"
},
"myDns": {
"class": "DNS",
"nameServers": [
"168.63.129.16"
]
},
"admin": {
"class": "User",
"partitionAccess": {
"all-partitions": {
"role": "admin"
}
},
"shell": "bash",
"userType": "regular",
"keys": []
},
"bigipuser": {
"class": "User",
"partitionAccess": {
"all-partitions": {
"role": "admin"
}
},
"shell": "bash",
"userType": "regular",
"keys": []
},
"internal": {
"class": "VLAN",
"interfaces": [
{
"name": "1.1",
"tagged": false
}
],
"mtu": 1500,
"tag": 4094,
"cmpHash": "default",
"failsafeEnabled": false,
"failsafeAction": "failover-restart-tm",
"failsafeTimeout": 90
},
"internal-self": {
"class": "SelfIp",
"address": "10.255.2.4/24",
"vlan": "internal",
"allowService": "none",
"trafficGroup": "traffic-group-local-only"
},
"configSync": {
"class": "ConfigSync",
"configsyncIp": "/Common/internal-self/address"
},
"failoverAddress": {
"class": "FailoverUnicast",
"address": "/Common/internal-self/address",
"port": 1026
},
"failoverGroup": {
"class": "DeviceGroup",
"type": "sync-failover",
"members": [
"f5vm01.test.net",
"f5vm02.test.net"
],
"owner": "/Common/failoverGroup/members/0",
"autoSync": true,
"saveOnAutoSync": false,
"networkFailover": true,
"fullLoadOnSync": false,
"asmSync": false
},
"trust": {
"class": "DeviceTrust",
"localUsername": "admin",
"remoteHost": "/Common/failoverGroup/members/0",
"remoteUsername": "admin"
}
}
}
Generally here, 200
or 20x
(where x
is any number) means times are gravy, and the F5 successfully took your DO and configured itself as per your commands. Anything else and you should sit yourself down for some debugging fun, some helpful hints here:
- It won't tell you which line your invalid IP Address is in, so good luck fishing
- Note that in F5 land, this is a valid IP Address that effectively refers to whatever you configured the Internal NIC as:
/Common/internal-self/address
, or you can go for the more traditional10.255.2.4
approach if, y'know, you like sleeping and/or seeing your kids of an evening
- Note that in F5 land, this is a valid IP Address that effectively refers to whatever you configured the Internal NIC as:
- Sometimes it decides an error is not passable, and rolls back your entire config accordingly
- It's much quicker having the F5 kick you in the balls via a "try some JSON and see if it works" approach using this method of tweaking
do_test.yaml
and POSTing to the HTTP REST Endpoint URL than forming theruntime-init-conf.yaml
file from initialcustom-onboard-big.tmpl
and having to wait forterraform apply
and related Azure VM Extension to kick in, and run af5-bigip-runtime-init
all over again (2-4 minutes)
- It's much quicker having the F5 kick you in the balls via a "try some JSON and see if it works" approach using this method of tweaking
- Not that the F5 ARM template examples make it obvious, but on both nodes in a HA Active/Passive cluster, it wants both of them to refer within
failoverGroup
members in a consistent (Line 1)node0.hostname.com
and (Line 2)node1.hostname.com
- If you're like me, you'll read them and the order-swapping of
remote_host
betweeninstance01.yaml
andintance02.yaml
and think "Huh, so it's specified Node0/Node1 on Node0 DO YAML, then swaps to Node1/Node0 order on Node1 DO YAML file" - nope, it's just that F5 actually mean "not the current node, y'know, the other one" when they sayremote
- meaning it changes each time and is locally relative
- If you're like me, you'll read them and the order-swapping of
If you just want to check the status of the latest DO JSON invocation without supplying some fresh DO JSON to execute, then run:
curl -su admin: http://127.0.0.1:8100/mgmt/shared/declarative-onboarding | jq
(Note: The jq
command the output is piped into simply takes the JSON response and pretty prints it into a space-delimited, multi-line JSON output, rather than just-showing-it-as-one-big-block-of-illegible-text)
Cloud Forwarding Extension (DO isn't an "Extension" clearly, otherwise it'd be acronym'd as "DOE")
Pretty much the same idea goes here, but there's less you can configure and this is a good reference of some working CFE JSON, in this case to test CFE in advance you would:
- Login to F5 via SSH or Azure Serial Console
- Swap to Bash prompt by typing:
bash
then hitReturn
key - Save some DO-formatted JSON (like this example) as a file called
cfe_test.json
- Throw it at the HTTP REST API with a curl post as follows:
curl -su admin: -d "@cfe_test.json" http://127.0.0.1:8100/mgmt/shared/cloud-failover/declare | jq
- You'll get a JSON payload back, consisting first of a HTTP Status Code for the
result
, and also a playback of the JSON payload you posted in thedeclaration
section
Here's an example F5 CFE JSON payload you can tweak and play with:
{
"failoverAddresses":{
"enabled":true,
"scopingTags": {
"f5_cloud_failover_label": "mydeployment"
}
"addressGroupDefinitions": [
{
"type": "networkInterfaceAddress",
"scopingAddress": "10.0.1.100"
},
{
"type": "networkInterfaceAddress",
"scopingAddress": "10.0.1.101"
}
]
}
}
If you just want to check the status of the latest CFE JSON invocation without supplying some fresh DO JSON to execute, then run:
curl -su admin: http://127.0.0.1:8100/mgmt/shared/cloud-failover/info | jq
(Note: The jq
command the output is piped into simply takes the JSON response and pretty prints it into a space-delimited, multi-line JSON output, rather than just-showing-it-as-one-big-block-of-illegible-text)
Bonus Timesaver - F5 Extension versions that don't work with each other
For added bonus fun, when you're looking around Interwebs to find the correct F5 Extension RPMs (yes, at least that bit is standard Linux-like), you might stumble on a few in some of F5Networks own GitHub repos that don't work well together at all, and in one case cause the Cloud-init process to crap out just after the DO process is done and before the CFE process even begins. Here's a list of F5 Extension versions that don't play well together and which you should avoid using:
F5 Extension | Version | URL |
---|---|---|
DO | 1.21.0 | https://github.com/F5Networks/f5-declarative-onboarding/releases/download/v1.21.0/f5-declarative-onboarding-1.21.0-3.noarch.rpm |
TS | 1.20.0 | https://github.com/F5Networks/f5-telemetry-streaming/releases/download/v1.20.0/f5-telemetry-1.20.0-3.noarch.rpm |
FAST | 1.9.0 | https://github.com/F5Networks/f5-appsvcs-templates/releases/download/v1.9.0/f5-appsvcs-templates-1.9.0-1.noarch.rpm |
CFE | 1.8.0 | https://github.com/F5Networks/f5-cloud-failover-extension/releases/download/v1.8.0/f5-cloud-failover-1.8.0-0.noarch.rpm |
Cloud-init | 1.2.1 | https://cdn.f5.com/product/cloudsolutions/f5-bigip-runtime-init/v1.2.1/dist/f5-bigip-runtime-init-1.2.1-1.gz.run |
All or nothing
One big thing to note with the Cloud-init/DO combo is it's all or nothing - if any one single part of that large /config/cloud/runtime-init-conf.yaml
file goes wrong during the Cloud-init process, the F5 rolls itself back to it's factory state like a wet fish. Flapping. In the Azure winds. With an unknown default username and password you can't login to your newly-deployed localhost.localdomain
with to debug.
What you therefore might like to do, to combat this situation while working out the magic incantations you need to generate your Cloud-init YAML (what, you weren't just born knowing the F5 Schema?), is to plant a known username and password before the Cloud-init process kicks off, by inserting the relevant F5 tmsh
commands in to the top part of the custom-onboard-big.tmpl
Bash script, before the cat /config/cloud/runtime-init-conf.yaml > /config/cloud/runtime-init-conf-backup.yaml
section. Probably something like:
# Set Admin User password before DO process fails miserably
tmsh modify auth password bigipuser password123!
Time to hit refresh (yes, that is a F5 pun)
Much like an F5 deployed through Cloud-init, I've hit the end of my metaphorical anger threshold and it's probably time for a reset on what's left of my soul and sanity. I genuinely hope some of this helps those of you unfortunate enough to have to deploy an F5 into Azure through automation.
For the rest of you, I implore you - make better Load Balancer decisions. F5's appear to be flakier than a Cornetto from your local Ice Cream van in the summer.