WiPi Monitoring with a Raspberry Pi WLAN Device

Saturday, 04 Jul 2020

The idea

I wanted to monitor the quality of the Wireless in my Office to emulate the real-world experience some of our End Users, as I was struggling to correlate what I was seeing on the Wireless LAN Controller (WLC)'s Access Point (AP) monitoring stats to the poor experience being reported. Simplistically, this is taking the magic of a Raspberry Pi, a bit of JSON and a Log Analysis Stack (I chose Splunk, I could have used ELK Stack, or Logz.io or...) that gives you pretty dashboards of the data.

The gear

The steps

Any values below <like_this> are variables to show an example value, specific to your installation; for instance <vps-server.com> is whatever DNS Domain Name/Dynamic DNS Domain Nam/Public IP Address that your VPS uses.

VPS Setup

We'll start with the VPS setup, from a high-level overview; if you're using a Provider like Vultr you should definitely harden your VPS and configure the Firewall/WAF to only allow your WiPi to send HTTP JSON feeds.

  1. Install your Linux distribution of choice when you boot your VPS (I'm a Debian man)
  2. Install Splunk onto your VPS, or more concisely:
    1. Download the RPM from Splunk's website onto your VPS (I ended up downloading locally, and then uploading the RPM via SCP to the VPS with a scp splunk.rpm admin@vps-server.com:/tmp/
    2. Install the RPM from the temporary directory:
      • rpm -i /tmp/splunk.rpm
    3. Run Splunk and accept the Terms from the /opt/splunk directory:
      • /opt/splunk/bin/splunk start
  3. Setup your VPS Firewall/WAF to allow tcp/8000 and tcp/8088 (and probably tcp/22 so you can SSH into it)
  4. I'd suggest using a Domain Name to point at it, or a Free Dynamic DNS Service such as DDNSS.de and point an A Name at your VPS' Public IP
  5. Harden your VPS by installing Fail2Ban and other similar tools

Splunk Setup

I'm focusing on Splunk because that's what I used, but similar steps will exist for ELK or Logz.io. The main advantage of Splunk is that it's free and quick; the Free version does have limitations, however - such as 500 MB Indexed Data and a time-limit on concurrent Search Queries used in your Dashboards, i.e. number of widgets you can use. There's much more that you can do, but I'll get you going with some basics.

  1. Create an Index from Settings -> Data -> Index -> New Index
    1. I called mine wipi_monitoring and accepted the defaults
    2. Indexes store the Events, so you can reference the data in this Index with a SQL-like Query, like this, from the Search & Reporting App on the Homepage:
      • index="wipi_monitoring"
  2. Create a HTTP Event Collector (HEC) to receive the JSON payload from the WiPi
    1. Go to Settings -> Data -> Data Inputs -> HTTP Event Collector -> Add new
    2. Give it a Name (I went for WiPi Monitoring) and on the Next screen, associate it with the wipi_monitoring Index you made earlier
      1. You can also tell it the Source Type is _json to speed processing up
      2. Make a note of the API Key it generates, you'll need this later
  3. Turn on HEC (it doesn't auto-enable) from the Global Settings -> Enable -> Save option, next to the New Token option within Settings -> Data -> Data Inputs -> HTTP Event Collector
  4. Create a simple Splunk Dashboard, from Splunk -> Search & Reporting -> Dashboards -> Create New Dashboard
    1. Most of my Panels are either Line Chart or Single Values, here's some of the example Searches used for them, style them how you want:
      • (Uptime Panel, Single Values) Search: index="wipi_monitoring" | timechart max(uptime)
      • (BBC.co.uk Ping RTT Panel, Line Chart) Search: index="wipi_monitoring" | timechart max(ping_bbc_avg)
      • (WLAN MAC Change Count Panel, Line Chart) Search: index="wipi_monitoring" | timechart distinct_count(wlan_ap_mac)
        • This returns a count of unique AP MAC Addresses seen (i.e. if it goes up from one, you've roamed between AP Coverage Areas)

You should now be able to login to your Splunk instance at http://<vps-server.com>:8000.

Make sure you can send HTTP to your Splunk instance on Port 8088 (or whatever <HEC Port> you picked otherwise); a quick way of this is using Telnet to see if it connects at all, i.e. telnet <vps-server.com> 8088.

The script

The following script will performa a series of checks (i.e. Ping a host, or TCP-connect to a host); time how long it took and then consolidate the results into a a JSON payload, similar to the below, and finally push this to Splunk via the HTTP Event Collector as frequently as scheduled with your Cron job. It also scrapes information from the WiPi's WLAN interface, such as RTS/CTS issues; signal quality; current bitrate; AP MACs seen and so on.

You can customise the variables towards the top of the script with your values; don't forget to replace <vps-server.com> with your VPS Server's IP Address/DNS Name; <your_location> with a meaningful Location String to you and <your_splunk_api_token> with your Splunk HEC API Token from earlier on.

JSON Payload example

{
 "dns_ms_onedrive": 101,
 "host": "wipi1",
 "http_ms_teams": 220,
 "http_google": 500,
 "location": "Some House",
 "ping_bbc_avg": 22,
 "ping_bbc_loss": "0%"",
 "ping_bbc_max": 24,
 "ping_default_gateway_avg": 13,
 "ping_default_gateway_loss": "0%",
 "ping_default_gateway_max": 47,
 "uptime": 417.7,
 "wlan_ap_mac": "00:11:22:33:44:55",
 "wlan_bitrate": "72.2 Mb/s",
 "wlan_fragment_threshold": "off"
 "wlan_invalid": 0,
 "wlan_link_quality": "49/70",
 "wlan_missed_beacon": 0,
 "wlan_rts_threshold": "off",
 "wlan_rx_invalid_crypt": 0,
 "wlan_rx_invalid_frag": 0,
 "wlan_rx_invalid_nwid": 0,
 "wlan_signal": -61 dBm,
 "wlan_tx_excessive_retries": 10,
 "wlan_tx_power": 31 dBm,
}

Cron job example

Adding this to /etc/crontab would cause the script to be run every 5 minutes. Change admin to the Admin User of your Pi (which you should change from pi factory username for security reasons):

# WiPi Monitoring report back to Splunk
*/5 *   * * *   admin    python /opt/wipi-monitoring/main.py >> /opt/wipi-monitoring/main.log 2>&1

Python Script

This will require you to pip install netifaces dnspython first.

# Author: notworkd.com
# Date: 19-Jun-2020
# Description: Monitor WiFi data and send back to WiPi Monitoring Dashboard
import netifaces
import os
import subprocess
import requests
import json
import dns.resolver
import time
import datetime

# Define constants
SPLUNK_SERVER = 'https://<vps-server.com>:8088'
SPLUNK_API_KEY = '<your_splunk_api_token>'
WIPI_LOCATION = '<your_location>'

# Functions
# Update Splunk HTTP Event Collector
def updateSplunkHec(**data):
 url = SPLUNK_SERVER + '/services/collector'
 post = {
  "event": data
 }
 r = requests.post(url, json=post, headers={"Authorization":"Splunk "+SPLUNK_API_KEY}, verify=False)
 print r.text
 if r.status_code == 200:
  return True
 else:
  return False

# Get Hostname of this WiPi
def getHostname():
 hostname = os.uname()[1]
 return hostname

# Get Default Gateway for WiFi Adapter
def getDefaultGateway():
 gws = netifaces.gateways()
 return gws['default'][netifaces.AF_INET][0]

# Get ICMP Ping response time
def getIcmpPing(host, count=5):
 cmd = "ping -c {} {}".format(count, host).split(' ')
 try:
  output = subprocess.check_output(cmd).decode().strip()
  lines = output.split("\n")
  total = lines[-2].split(',')[3].split()[1]
  loss = lines[-2].split(',')[2].split()[0]
  timing = lines[-1].split()[3].split('/')
  return {
   'type': 'rtt',
   'min': float(timing[0]),
   'avg': float(timing[1]),
   'max': float(timing[2]),
   'mdev': float(timing[3]),
   'total': str(total),
   'loss': str(loss),
  }
 except Exception as e:
  print(e)
  return None

# Get HTTP Connect response time
def getHttpConnect(url, timeout):
 r = requests.get(url, timeout=timeout)
 return int(1000 * round(r.elapsed.total_seconds(), 2))

# Get DNS response time
def getDnsResolve(fqdn):
 answers = dns.resolver.query(fqdn, 'a')
 return int(1000 * answers.response.time)

# Get WLAN Interface Stats
def getWlanStats(adapter):
 cmd = "iwconfig " + adapter
 try:
  output = subprocess.check_output(cmd, shell=True).decode().strip()
  lines = output.split("\n")
  frequency = lines[-7].split('  ')[6].split(':')[1]
  ap = lines[-7].split('  ')[7].split(': ')[1]
  bitrate = lines[-6].split('  ')[5].split('=')[1]
  txpwr = lines[-6].split('  ')[6].split('=')[1]
  rtsthrsh = lines[-5].split('  ')[6].split(':')[1]
  frgthrsh = lines[-5].split('  ')[7].split(':')[1]
  link = lines[-3].split('  ')[5].split('=')[1]
  snr = lines[-3].split('  ')[6].split('=')[1]
  rxinnw = lines[-2].split('  ')[5].split(':')[1]
  rxincr = lines[-2].split('  ')[6].split(':')[1]
  rxinfr = lines[-2].split('  ')[7].split(':')[1]
  txrtry = lines[-1].split('  ')[5].split(':')[1]
  invalid = lines[-1].split('  ')[6].split(':')[1]
  missedbcn = lines[-1].split('  ')[7].split(':')[1]
  return {
   'frequency': frequency,
   'access_point': ap,
   'bitrate': bitrate,
   'tx_power': txpwr,
   'rts_threshold': rtsthrsh,
   'fragment_threshold': frgthrsh,
   'link_quality': link,
   'signal': snr,
   'rx_invalid_nwid': rxinnw,
   'rx_invalid_crypt': rxincr,
   'rx_invalid_frag': rxinfr,
   'tx_excessive_retries': txrtry,
   'invalid': invalid,
   'missed_beacon': missedbcn
  }
 except Exception as e:
  print(e)
  return None

# Get Device Uptime
def getUptime():
 cmd = "awk '{print $0/60;}' /proc/uptime"
 try:
  output = subprocess.check_output(cmd, shell=True).decode().strip()
  return output
 except Exception as e:
  print(e)
  return None

# Main program
# Initialise variables
output = {'host': getHostname(), 'location': WIPI_LOCATION, 'uptime': getUptime()}

# Ping Default Gateway (max, avg and loss)
ping = getIcmpPing(getDefaultGateway())
output.update({'ping_default_gateway_max': int(ping['max'])})
output.update({'ping_default_gateway_avg': int(ping['avg'])})
output.update({'ping_default_gateway_loss': str(ping['loss'])})

# Ping BBC
ping = getIcmpPing('www.bbc.co.uk')
output.update({'ping_bbc_max': int(ping['max'])})
output.update({'ping_bbc_avg': int(ping['avg'])})
output.update({'ping_bbc_loss': str(ping['loss'])})

# HTTP Connect Google
http = getHttpConnect('https://www.google.co.uk', 30)
output.update({'http_google': int(http)})

# HTTP Connect Microsoft Teams
http = getHttpConnect('https://teams.microsoft.com', 30)
output.update({'http_ms_teams': int(http)})

# DNS Resolve OneDrive
dns = getDnsResolve('sharepoint.com')
output.update({'dns_ms_onedrive': int(dns)})

# Get WiFi Stats
wifi = getWlanStats('wlan0')
output.update({'wlan_ap_mac': str(wifi['access_point'])})
output.update({'wlan_bitrate': str(wifi['bitrate'])})
output.update({'wlan_tx_power': str(wifi['tx_power'])})
output.update({'wlan_rts_threshold': str(wifi['rts_threshold'])})
output.update({'wlan_fragment_threshold': str(wifi['fragment_threshold'])})
output.update({'wlan_link_quality': str(wifi['link_quality'])})
output.update({'wlan_signal': str(wifi['signal'])})
output.update({'wlan_rx_invalid_nwid': int(wifi['rx_invalid_nwid'])})
output.update({'wlan_rx_invalid_crypt': int(wifi['rx_invalid_crypt'])})
output.update({'wlan_rx_invalid_frag': int(wifi['rx_invalid_frag'])})
output.update({'wlan_tx_excessive_retries': int(wifi['tx_excessive_retries'])})
output.update({'wlan_invalid': int(wifi['invalid'])})
output.update({'wlan_missed_beacon': int(wifi['missed_beacon'])})

# Update Splunk via REST API
print(datetime.datetime.now().replace(microsecond=0).isoformat())
updateSplunkHec(**output)
print("----")

The outcome

Once done, you'll get JSON payloads being sent into Splunk via the HEC every 5 minutes, and then go start to have lovely visual dashboards like this, showing you the real-world HTTP, DNS, Ping and WLAN stats a real-world User in that Office Location might actually be seeing.

Splunk Dashboard showing overview of WiPi HTTP and DNS Probes Splunk Dashboard showing overview of WiPi WLAN Stats Splunk Dashboard showing detail of WiPi WLAN Stats

Enjoy finally having a use for that Pi other than taking up space in your drawer :).

What happened to the OpenFlow dream?

Wednesday, 29 Apr 2020

Apparently I've had this blog idea in draft for nearly two years, since April 2018, so it probably seems apt a time to expand on it. I hope this won't come out as one of those twee "Thot Leader" pieces, as I'm certainly not; nor, like many Thot Leaders, do I have any authority in telling anyone what to do or think. But I've had a drink, so I will wax lyrical about the topic with an opinion or two.

The OpenFlow Dream

Cast your mind back to 2018 (or perhaps a few years earlier if you're lucky enough to not work in #PubSec), and the two things you'd have heard ad infinitum in the Twitterverse and Tech World:

  1. Software Defined Networking (SDN) will take over all teh Networkz!!!11!!
  2. omgomgOMG OpenFlow is teh only futurez!!1111!! CLI is lame lmao

Well, OK - maybe not; maybe it's unfair to think the Thot Leaders could spell that well or luck out on sentence structure, but humour me a little. Certainly from where I stood (or sat), everyone seemed to be banging on about a centralised SDN Controller of some sort controlling all Network Control Planes ever, across the Data Centre, Enterprise, Wireless and, well, pretty much everything. For some reason everyone had simultaneously seen the light around the death of "One box, one job" (i.e. Firewall is one box; Router is another box; Switch is another box; DDoS is another box; IPS is...) while also eschewing a utopian future with one centralised Controller that would replace all these middleboxen, and essentially become the entire Network.

The Rhetoric

Irn Bru must get through

It was compelling, you could even argue it was so obvious it was revolutionary - none of these middleboxes existed just 5-10 years prior, when mostly we were worried about Routers and Switches (or for the unlucky of you, the Core Switch Routing Modules - I still have the scars, Cisco...) - all we previously had to manage was maybe a handful of L2 Access Switches or ToRs, and a few Campus Edge Routers, and away we went. But then the fleets of middleboxes came, and suddenly the Campus or Data Centre looked like some horrific mash up of Lego meets Playmobil meets Duplo: nothing quite fitted, everything was managed in something else, and it was bloody impossible to even think about driving it based on higher-level abstractions like "Intent".

So when someone turned around and said, "Yeah, sack all those boxes off; bang in an OpenFlow Controller, and it'll do the whole lot for you, mate" people listened. Many people listened in fact, and I was definitely one of them - which led to this:

Northbound Networks ZodiacFX OpenFlow Switch

The Northbound Networks ZodiacFX

Buzzword happy, imagine my excitement when Northbound Networks announced a small, mini-USB powered 4-port OpenFlow Switch - I didn't even mind my PayPal being charged in AUD to get my hands on one, this was the future! After a few weeks of international shipping, when it arrived I immediately got to work messing with OpenFlow Controllers like:

Faucet was probably my favourite (I'm a sucker for a cool name or logo, I really am) - and it was great. I sat through hours of tutorials from people like David Bombal and INE Instructor Jasson Casey and duly saw that I could program an OpenFlow Controller to per-packet police, Access Control, Rate Limit and basically do anything I'd previously had to do on a fleet of middleboxes. This was great! The future was surely OpenFlow, all hail the new king... right?

Well, no - wrong. Just like the yo-yo craze had quickly came and gone in my 90's childhood (I'm bitter, ProYo, I couldn't afford you at the time), so too did my love for OpenFlow near-overnight evaporate. But why? Am I just another Network Buzzword Jockey (well maybe, but let's keep some of my fragile ego intact...)? Did I not "get" the OpenFlow paradigm shift? What was going on?

The OSI Model

Remember this, the stupid thing they force you to learn by rote in Computer Science and Cisco Exams:

  1. Application
  2. Presentation
  3. Session
  4. Transport
  5. Network
  6. Data Link
  7. Physical

You learn it, you don't really understand why you're learning it, you spit it out for the exam; you move on with your life (wholly indoors, during this time of COVID-19, for the history books [Hi, Archive.org waves]). A few years later, you get a job in IT/Telco/Networking, maybe a few people superficially reference it now and again; you're still not getting it. Then, in more recent years, people say a few phrases that really make it "click" in your head - one of them being "up the stack".

Up the Stack

The Stack being a allegorical notion, it has a bottom, a middle, and a top - and is largely used in the IT/Networking field when talking to a collection of all the Routers, Switches, Servers, Middleboxes and so on you require to host a given Application/Set of Applications/"Service". For me, in the CapEx-rich/OpEx-poor Public Sector world, this often means one unique stack of Routers/Switches/Servers/etc per Application/"Service", due to a unique Project-driven focus in the world that no Real World Private Sector firm could afford to employ (you do "Multi-tenant", I do "Same tenant, second house on the street"). Either way, "the Stack" - for me as an Infrastructure Guy - refers to the collection of stuff that makes this up. If you were a Software Girl, this might refer to the stack of middleware (SQL, MongoDB, Amazon SQS etc) as well as the data structures and ultimately code or language binaries/compilers that run your Application - it's much the same concept, you'd just be decomposing an upper-part of the OSI Model.

Which aptly brings me to my point; Infrastructure peoples like me are at the "bottom of the Stack"; Software peoples are instead at the "top of the Stack". A major paradigm shift did indeed occur around 2016-2018, but it wasn't SDN; it was something inspired by the notion that Software is Eating The World, and it moved the problem "up the Stack" (see, that OSI Model is useful - as a shared point of reference - that's the bit they neglect to tell you in school). Where to? Glad you asked.

Kubernetes and the Sunshine Gang

"Na, na, na, na, na..." Alright, I give (it) up, that's not going to work. By now (2016 for the Real World, no idea when for the PubSec world), something called k8s was taking over mindshare from OpenFlow, and SDN in general, and really driving home the point of software being the driving force of the IT world. The same was (and is) happening in the world of Cloud vs On-Prem; the true value of Cloud is in it's higher-layer abstraction and orchestration capabilities (up that Stack again), not because it's marginally cheaper than on-Prem. There are many more things you can do in the world of k8s, OpenShift, EKS et al around overcoming the "middlebox problem" - and pushing it back to where it belongs/who knows the most about it (the Application and Software Peoples) - than using an SDN-backed approach.

Regarding the term "On-Prem" - I know, I know; but it sounds cool. Fight me.

Kubernetes searches on Google Search Trends

The nail in the OpenFlow coffin

I don't doubt that OpenFlow has a few valid use cases in the real world, but they are few and far between; but for the main, the bitter cold truth is the IT world is, currently, split into sects aligned around two disciplines:

  1. Developers
  2. Operators

How do I prove I'm right here? DevOps - you can't have an abbreviation based on tribes that don't exist. You know what this means in practice, or did mean in practice? Tribes of people, aligned to that pesky OSI Model:

  1. Applicationy Peoples
    1. Aligned to OSI L4-7
  2. Infrastructurey Peoples
    1. Aligned to OSI L1-3

How do I prove I'm right here? Middleboxes - those things that were neither OSI L3 nor OSI L4, they were a bit of both. You know why they were always a pain in the arse? Because they were trying to do with physical kit what DevOps is trying to do with people; align the tribes.

As the world has progressively moved up the Stack, and in doing so to "enabling the Application" (and by extension, the Developer), unfortunately the Infrastructure Peoples (myself included) have become less relevant. With k8s and it's ilk, no longer do you need us to mangle some middlebox via Chinese Whispers ("I'm sure he said TCP/1521? That's the Oracle DB Port isn't it? I'll set up a Load Balancer VIP Pool for that, not got time to ask him..."); you can do it yourself with things like Istio, Envoy and other cool-kid stuff I'm not Dev enough to do.

And frankly, why wouldn't you? It's your App; you built it (or are unlucky enough to be charged with keeping its COTS form alive and kicking); you know what it does, what it needs to do, and what it doesn't need to do. Us Infrastructure folk, frankly, don't; and we don't really care, because we're too busy tweaking various OSI L1-3 knobs to stop everything setting on fire.

Which brings me nicely back to OpenFlow. Sure, it's a great idea; but it's an Infrastructure person's view of the world; not a Developer's view of the world. It's us as Infrastructure folk trying to bring our detailed, abstraction-averse OSI L1-3 thinking (as you go down the Stack, the level of detail for any one given Layer goes up inversely) up the Stack to the more abstraction-dependent OSI L4-7 thinking of the Developer folk. Sprinkle on some organisation politics; Development vs Operations tribal thinking and add in some Enterprise "JFDI it, my golf mate wrote it, we must use Crapplication 1.2 now!" and you've got a recipe for a self-tapping hammer for the nail in the OpenFlow coffin.

Farewell, OpenFlow - I hardly knew ye.