WiPi Monitoring with a Raspberry Pi WLAN Device

Saturday, 04 Jul 2020

The idea

I wanted to monitor the quality of the Wireless in my Office to emulate the real-world experience some of our End Users, as I was struggling to correlate what I was seeing on the Wireless LAN Controller (WLC)'s Access Point (AP) monitoring stats to the poor experience being reported. Simplistically, this is taking the magic of a Raspberry Pi, a bit of JSON and a Log Analysis Stack (I chose Splunk, I could have used ELK Stack, or Logz.io or...) that gives you pretty dashboards of the data.

The gear

The steps

Any values below <like_this> are variables to show an example value, specific to your installation; for instance <vps-server.com> is whatever DNS Domain Name/Dynamic DNS Domain Nam/Public IP Address that your VPS uses.

VPS Setup

We'll start with the VPS setup, from a high-level overview; if you're using a Provider like Vultr you should definitely harden your VPS and configure the Firewall/WAF to only allow your WiPi to send HTTP JSON feeds.

  1. Install your Linux distribution of choice when you boot your VPS (I'm a Debian man)
  2. Install Splunk onto your VPS, or more concisely:
    1. Download the RPM from Splunk's website onto your VPS (I ended up downloading locally, and then uploading the RPM via SCP to the VPS with a scp splunk.rpm admin@vps-server.com:/tmp/
    2. Install the RPM from the temporary directory:
      • rpm -i /tmp/splunk.rpm
    3. Run Splunk and accept the Terms from the /opt/splunk directory:
      • /opt/splunk/bin/splunk start
  3. Setup your VPS Firewall/WAF to allow tcp/8000 and tcp/8088 (and probably tcp/22 so you can SSH into it)
  4. I'd suggest using a Domain Name to point at it, or a Free Dynamic DNS Service such as DDNSS.de and point an A Name at your VPS' Public IP
  5. Harden your VPS by installing Fail2Ban and other similar tools

Splunk Setup

I'm focusing on Splunk because that's what I used, but similar steps will exist for ELK or Logz.io. The main advantage of Splunk is that it's free and quick; the Free version does have limitations, however - such as 500 MB Indexed Data and a time-limit on concurrent Search Queries used in your Dashboards, i.e. number of widgets you can use. There's much more that you can do, but I'll get you going with some basics.

  1. Create an Index from Settings -> Data -> Index -> New Index
    1. I called mine wipi_monitoring and accepted the defaults
    2. Indexes store the Events, so you can reference the data in this Index with a SQL-like Query, like this, from the Search & Reporting App on the Homepage:
      • index="wipi_monitoring"
  2. Create a HTTP Event Collector (HEC) to receive the JSON payload from the WiPi
    1. Go to Settings -> Data -> Data Inputs -> HTTP Event Collector -> Add new
    2. Give it a Name (I went for WiPi Monitoring) and on the Next screen, associate it with the wipi_monitoring Index you made earlier
    3. You can also tell it the Source Type is _json to speed processing up
    4. Make a note of the API Key it generates, you'll need this later
  3. Turn on HEC (it doesn't auto-enable) from the Global Settings -> Enable -> Save option, next to the New Token option within Settings -> Data -> Data Inputs -> HTTP Event Collector
  4. Create a simple Splunk Dashboard, from Splunk -> Search & Reporting -> Dashboards -> Create New Dashboard
    1. Most of my Panels are either Line Chart or Single Values, here's some of the example Searches used for them, style them how you want:
      • (Uptime Panel, Single Values) Search: index="wipi_monitoring" | timechart max(uptime)
      • (BBC.co.uk Ping RTT Panel, Line Chart) Search: index="wipi_monitoring" | timechart max(ping_bbc_avg)
      • (WLAN MAC Change Count Panel, Line Chart) Search: index="wipi_monitoring" | timechart distinct_count(wlan_ap_mac)
      • This returns a count of unique AP MAC Addresses seen (i.e. if it goes up from one, you've roamed between AP Coverage Areas)

You should now be able to login to your Splunk instance at http://<vps-server.com>:8000.

Make sure you can send HTTP to your Splunk instance on Port 8088 (or whatever <HEC Port> you picked otherwise); a quick way of this is using Telnet to see if it connects at all, i.e. telnet <vps-server.com> 8088.

The script

The following script will performa a series of checks (i.e. Ping a host, or TCP-connect to a host); time how long it took and then consolidate the results into a a JSON payload, similar to the below, and finally push this to Splunk via the HTTP Event Collector as frequently as scheduled with your Cron job. It also scrapes information from the WiPi's WLAN interface, such as RTS/CTS issues; signal quality; current bitrate; AP MACs seen and so on.

You can customise the variables towards the top of the script with your values; don't forget to replace <vps-server.com> with your VPS Server's IP Address/DNS Name; <your_location> with a meaningful Location String to you and <your_splunk_api_token> with your Splunk HEC API Token from earlier on.

JSON Payload example

 "dns_ms_onedrive": 101,
 "host": "wipi1",
 "http_ms_teams": 220,
 "http_google": 500,
 "location": "Some House",
 "ping_bbc_avg": 22,
 "ping_bbc_loss": "0%"",
 "ping_bbc_max": 24,
 "ping_default_gateway_avg": 13,
 "ping_default_gateway_loss": "0%",
 "ping_default_gateway_max": 47,
 "uptime": 417.7,
 "wlan_ap_mac": "00:11:22:33:44:55",
 "wlan_bitrate": "72.2 Mb/s",
 "wlan_fragment_threshold": "off"
 "wlan_invalid": 0,
 "wlan_link_quality": "49/70",
 "wlan_missed_beacon": 0,
 "wlan_rts_threshold": "off",
 "wlan_rx_invalid_crypt": 0,
 "wlan_rx_invalid_frag": 0,
 "wlan_rx_invalid_nwid": 0,
 "wlan_signal": -61 dBm,
 "wlan_tx_excessive_retries": 10,
 "wlan_tx_power": 31 dBm,

Cron job example

Adding this to /etc/crontab would cause the script to be run every 5 minutes. Change admin to the Admin User of your Pi (which you should change from pi factory username for security reasons):

# WiPi Monitoring report back to Splunk
*/5 *   * * *   admin    python /opt/wipi-monitoring/main.py >> /opt/wipi-monitoring/main.log 2>&1

Python Script

This will require you to pip install netifaces dnspython first.

# Author: notworkd.com
# Date: 19-Jun-2020
# Description: Monitor WiFi data and send back to WiPi Monitoring Dashboard
import netifaces
import os
import subprocess
import requests
import json
import dns.resolver
import time
import datetime

# Define constants
SPLUNK_SERVER = 'https://<vps-server.com>:8088'
SPLUNK_API_KEY = '<your_splunk_api_token>'
WIPI_LOCATION = '<your_location>'

# Functions
# Update Splunk HTTP Event Collector
def updateSplunkHec(**data):
 url = SPLUNK_SERVER + '/services/collector'
 post = {
  "event": data
 r = requests.post(url, json=post, headers={"Authorization":"Splunk "+SPLUNK_API_KEY}, verify=False)
 print r.text
 if r.status_code == 200:
  return True
  return False

# Get Hostname of this WiPi
def getHostname():
 hostname = os.uname()[1]
 return hostname

# Get Default Gateway for WiFi Adapter
def getDefaultGateway():
 gws = netifaces.gateways()
 return gws['default'][netifaces.AF_INET][0]

# Get ICMP Ping response time
def getIcmpPing(host, count=5):
 cmd = "ping -c {} {}".format(count, host).split(' ')
  output = subprocess.check_output(cmd).decode().strip()
  lines = output.split("\n")
  total = lines[-2].split(',')[3].split()[1]
  loss = lines[-2].split(',')[2].split()[0]
  timing = lines[-1].split()[3].split('/')
  return {
   'type': 'rtt',
   'min': float(timing[0]),
   'avg': float(timing[1]),
   'max': float(timing[2]),
   'mdev': float(timing[3]),
   'total': str(total),
   'loss': str(loss),
 except Exception as e:
  return None

# Get HTTP Connect response time
def getHttpConnect(url, timeout):
 r = requests.get(url, timeout=timeout)
 return int(1000 * round(r.elapsed.total_seconds(), 2))

# Get DNS response time
def getDnsResolve(fqdn):
 answers = dns.resolver.query(fqdn, 'a')
 return int(1000 * answers.response.time)

# Get WLAN Interface Stats
def getWlanStats(adapter):
 cmd = "iwconfig " + adapter
  output = subprocess.check_output(cmd, shell=True).decode().strip()
  lines = output.split("\n")
  frequency = lines[-7].split('  ')[6].split(':')[1]
  ap = lines[-7].split('  ')[7].split(': ')[1]
  bitrate = lines[-6].split('  ')[5].split('=')[1]
  txpwr = lines[-6].split('  ')[6].split('=')[1]
  rtsthrsh = lines[-5].split('  ')[6].split(':')[1]
  frgthrsh = lines[-5].split('  ')[7].split(':')[1]
  link = lines[-3].split('  ')[5].split('=')[1]
  snr = lines[-3].split('  ')[6].split('=')[1]
  rxinnw = lines[-2].split('  ')[5].split(':')[1]
  rxincr = lines[-2].split('  ')[6].split(':')[1]
  rxinfr = lines[-2].split('  ')[7].split(':')[1]
  txrtry = lines[-1].split('  ')[5].split(':')[1]
  invalid = lines[-1].split('  ')[6].split(':')[1]
  missedbcn = lines[-1].split('  ')[7].split(':')[1]
  return {
   'frequency': frequency,
   'access_point': ap,
   'bitrate': bitrate,
   'tx_power': txpwr,
   'rts_threshold': rtsthrsh,
   'fragment_threshold': frgthrsh,
   'link_quality': link,
   'signal': snr,
   'rx_invalid_nwid': rxinnw,
   'rx_invalid_crypt': rxincr,
   'rx_invalid_frag': rxinfr,
   'tx_excessive_retries': txrtry,
   'invalid': invalid,
   'missed_beacon': missedbcn
 except Exception as e:
  return None

# Get Device Uptime
def getUptime():
 cmd = "awk '{print $0/60;}' /proc/uptime"
  output = subprocess.check_output(cmd, shell=True).decode().strip()
  return output
 except Exception as e:
  return None

# Main program
# Initialise variables
output = {'host': getHostname(), 'location': WIPI_LOCATION, 'uptime': getUptime()}

# Ping Default Gateway (max, avg and loss)
ping = getIcmpPing(getDefaultGateway())
output.update({'ping_default_gateway_max': int(ping['max'])})
output.update({'ping_default_gateway_avg': int(ping['avg'])})
output.update({'ping_default_gateway_loss': str(ping['loss'])})

# Ping BBC
ping = getIcmpPing('www.bbc.co.uk')
output.update({'ping_bbc_max': int(ping['max'])})
output.update({'ping_bbc_avg': int(ping['avg'])})
output.update({'ping_bbc_loss': str(ping['loss'])})

# HTTP Connect Google
http = getHttpConnect('https://www.google.co.uk', 30)
output.update({'http_google': int(http)})

# HTTP Connect Microsoft Teams
http = getHttpConnect('https://teams.microsoft.com', 30)
output.update({'http_ms_teams': int(http)})

# DNS Resolve OneDrive
dns = getDnsResolve('sharepoint.com')
output.update({'dns_ms_onedrive': int(dns)})

# Get WiFi Stats
wifi = getWlanStats('wlan0')
output.update({'wlan_ap_mac': str(wifi['access_point'])})
output.update({'wlan_bitrate': str(wifi['bitrate'])})
output.update({'wlan_tx_power': str(wifi['tx_power'])})
output.update({'wlan_rts_threshold': str(wifi['rts_threshold'])})
output.update({'wlan_fragment_threshold': str(wifi['fragment_threshold'])})
output.update({'wlan_link_quality': str(wifi['link_quality'])})
output.update({'wlan_signal': str(wifi['signal'])})
output.update({'wlan_rx_invalid_nwid': int(wifi['rx_invalid_nwid'])})
output.update({'wlan_rx_invalid_crypt': int(wifi['rx_invalid_crypt'])})
output.update({'wlan_rx_invalid_frag': int(wifi['rx_invalid_frag'])})
output.update({'wlan_tx_excessive_retries': int(wifi['tx_excessive_retries'])})
output.update({'wlan_invalid': int(wifi['invalid'])})
output.update({'wlan_missed_beacon': int(wifi['missed_beacon'])})

# Update Splunk via REST API

The outcome

Once done, you'll get JSON payloads being sent into Splunk via the HEC every 5 minutes, and then go start to have lovely visual dashboards like this, showing you the real-world HTTP, DNS, Ping and WLAN stats a real-world User in that Office Location might actually be seeing.

Splunk Dashboard showing overview of WiPi HTTP and DNS Probes Splunk Dashboard showing overview of WiPi WLAN Stats Splunk Dashboard showing detail of WiPi WLAN Stats

Enjoy finally having a use for that Pi other than taking up space in your drawer :).