Juniper SRX Overlay and Underlay VRF-seperated GRE Tunnels

Saturday, 22 Aug 2020

Are you used to Cisco IOS kit and trying to make a GRE Tunnel extend an Overlay VRF (or Routing Instance), using Junos SRX kit, and getting weird errors like this:

(errno=1000) create nsp tunnel failed 1
(errno=1000) tunnel session add(gr-0/0/0) failed

Then read on, dear friend - this post's for you.

Diagram time

This is what we're trying to achieve, namely:

  • Underlay Network
    • MPLS IP VPN with own BGP/IGP routing setup
    • Two Juniper SRX Firewalls in two locations, connected to said MPLS IP VPN
    • Underlying Global Table or VRF "Prod" that those Firewalls connect to
    • Each Firewall has 1x 10 Gig Ethernet connection, Xe-0/0/1, connected to the MPLS IP VPN
  • Overlay Network
    • Routing Instance (VRF/Routing Table) called "Other" present on each Juniper SRX Firewall
    • We want to join together VRF "Other" on each SRX Firewall to each other, using a GRE Tunnel Gr-0/0/0.1 to extend them
    • We'll spin up a Routed /30 p2p atop this GRE Tunnel, so we can Static Route to it, each side, to get traffic from SiteA to SiteB

Juniper Overlay GRE Tunnel Topology

The ultimate goal is for Gr-0/0/0.1 on each SRX to be able to talk to each other's 192.168.99.x IP directly, over the GRE Tunnel (which gets transited atop the Underlay MPLS IP VPN, VRF "Prod").

What your Cisco mind tells you might work

This isn't your first GRE rodeo, you've played this game before, and done something similar on Cisco IOS kit, like this:

interface Tunnel99
 description GRE Tunnel to SiteB Tun99
 ip vrf forwarding Other
 ip address 192.168.99.1 255.255.255.252
 ip mtu 1476
 tunnel source 10.0.0.7
 tunnel destination 10.1.0.98

That worked well for you, so you reckon some config conversion like this might bear some fruit:

set interfaces gr-0/0/0 unit 1 description "GRE Tunnel to SiteB Gr-0/0/0.1"
set interfaces gr-0/0/0 unit 1 tunnel source 10.0.0.7
set interfaces gr-0/0/0 unit 1 tunnel destination 10.1.0.98
set interfaces gr-0/0/0 unit 1 tunnel routing-instance destination Other
set interfaces gr-0/0/0 unit 1 family inet mtu 1476
set interfaces gr-0/0/0 unit 1 family inet address 192.168.99.1/30
!
set routing-instances Other interface gr-0/0/0.1

That'll do it, right?

Nope.

What the Juniper SRX errors to fight back

So you try your ping 192.168.99.2 routing-instance Other, it fails miserably; then you go down to checking the Underlay Tunnel Source can ping the Tunnel Destination:

PING 10.1.0.98: 56 data bytes 64 bytes from 10.0.0.7:
icmp_seq=0 ttl=122 time=5.661 ms 64 bytes from 10.1.0.98
icmp_seq=1 ttl=122 time=6.619 ms 64 bytes from 10.1.0.98

Hmm, that's all fine, Alright, Juniper, what about a little show log messages | last 10 then, eh?

Aug 21 23:00:03 node0.fpc1.pic0 IFP error> ../../../../../../../src/pfe/usp/control/applications/interface/ifp.c@3069:(errno=1000) tunnel session add(gr-0/0/0) failed

Huh, what's that all about?

Hunting the Interwebs for clues

Time to invoke Dr Google then, and off we find this similarly afflicted person with an SRX1400, with this key scrap of thinking material:

This error is when you need the static route if gr-0/0/0 and egress interface are in two different routing-instances. So you need to point the static route of the gr tunnel to the table that have the external interface.

It's not really sinking in, so you re-read it a few times; why does Junos care if my Gr-0/0/0.1 interface is in a different Routing Instance (VRF) to the Underlay, that's the point of the ...tunnel routing-instance destination Other command, right, to tell it to set the destination of the Tunnel Endpoint into another Routing Table, right?

Nope.

In Juniper land, Destination Sources you!

It's probably best if I show you the fix first, then this might click - here is the fix to suddenly make the Tunnel spring into life:

set routing-instances Other routing-options static route 10.1.0.98/32 next-table Prod.inet.0

Note1: You'll need the other IP, for SiteA (10.0.0.7) as this inter-VRF Static Route for the other SiteB Firewall

Note2: You'll need to add that ".inet.0" to your Routing Instance name, because Junos and consistency don't mix.

So if you're like me, you're looking at that bemused, and saying to yourself: "But, but... I set the Tunnel Destination to look in a different VRF, right, otherwise what exactly is the point of that routing-instance destination command all about?".

Well, I'll tell you what I reckon it's to do, I think it's to set the source of the Tunnel (i.e. what Cisco would call the "Tunnel Source") to use the Underlay (Prod VRF) to form; not, as the fecking word in the command reads, the Tunnel Destination. Hence, as there is no command that sets the destination, it assumes the Tunnel Destination is in the same VRF as the Tunnel itself, so you have to set an inter-VRF Static Route to tell it to go into the Underlay to form the GRE Tunnel.

It's the only way I can reconcile it in my head, with you needing to add he destination route/prefix as well as that ...tunnel routing-instance destination... command.

Conclusion

So there you go, in Junos land, this magic badly-named combo does what the sensibly-named tunnel source and tunnel destination combo does in Cisco IOS:

set interfaces gr-0/0/0 unit 1 tunnel source 10.0.0.7 #Note this is in VRF Prod underlay
set interfaces gr-0/0/0 unit 1 tunnel destination 10.1.0.98 #Note this is in VRF Prod underlay
set interfaces gr-0/0/0 unit 1 tunnel routing-instance destination Prod #Note this sets the VRF for the Tunnel Source
set routing-instances Other routing-options static route 10.1.0.98/32 next-table Prod.inet.0 #Note this inter-VRFs the Tunnel Destination

Mind you, in Juniper land opposites are clearly true, so maybe I should call this section the Introduction instead, seeing as it's at the end of the blog post.

If only Juniper sold mineral water, it'd be bottled at destination instead.

Sorry, I'll start now. Wait, do I mean stop...

WiPi Monitoring with a Raspberry Pi WLAN Device

Saturday, 04 Jul 2020

The idea

I wanted to monitor the quality of the Wireless in my Office to emulate the real-world experience some of our End Users, as I was struggling to correlate what I was seeing on the Wireless LAN Controller (WLC)'s Access Point (AP) monitoring stats to the poor experience being reported. Simplistically, this is taking the magic of a Raspberry Pi, a bit of JSON and a Log Analysis Stack (I chose Splunk, I could have used ELK Stack, or Logz.io or...) that gives you pretty dashboards of the data.

The gear

The steps

Any values below <like_this> are variables to show an example value, specific to your installation; for instance <vps-server.com> is whatever DNS Domain Name/Dynamic DNS Domain Nam/Public IP Address that your VPS uses.

VPS Setup

We'll start with the VPS setup, from a high-level overview; if you're using a Provider like Vultr you should definitely harden your VPS and configure the Firewall/WAF to only allow your WiPi to send HTTP JSON feeds.

  1. Install your Linux distribution of choice when you boot your VPS (I'm a Debian man)
  2. Install Splunk onto your VPS, or more concisely:
    1. Download the RPM from Splunk's website onto your VPS (I ended up downloading locally, and then uploading the RPM via SCP to the VPS with a scp splunk.rpm admin@vps-server.com:/tmp/
    2. Install the RPM from the temporary directory:
      • rpm -i /tmp/splunk.rpm
    3. Run Splunk and accept the Terms from the /opt/splunk directory:
      • /opt/splunk/bin/splunk start
  3. Setup your VPS Firewall/WAF to allow tcp/8000 and tcp/8088 (and probably tcp/22 so you can SSH into it)
  4. I'd suggest using a Domain Name to point at it, or a Free Dynamic DNS Service such as DDNSS.de and point an A Name at your VPS' Public IP
  5. Harden your VPS by installing Fail2Ban and other similar tools

Splunk Setup

I'm focusing on Splunk because that's what I used, but similar steps will exist for ELK or Logz.io. The main advantage of Splunk is that it's free and quick; the Free version does have limitations, however - such as 500 MB Indexed Data and a time-limit on concurrent Search Queries used in your Dashboards, i.e. number of widgets you can use. There's much more that you can do, but I'll get you going with some basics.

  1. Create an Index from Settings -> Data -> Index -> New Index
    1. I called mine wipi_monitoring and accepted the defaults
    2. Indexes store the Events, so you can reference the data in this Index with a SQL-like Query, like this, from the Search & Reporting App on the Homepage:
      • index="wipi_monitoring"
  2. Create a HTTP Event Collector (HEC) to receive the JSON payload from the WiPi
    1. Go to Settings -> Data -> Data Inputs -> HTTP Event Collector -> Add new
    2. Give it a Name (I went for WiPi Monitoring) and on the Next screen, associate it with the wipi_monitoring Index you made earlier
    3. You can also tell it the Source Type is _json to speed processing up
    4. Make a note of the API Key it generates, you'll need this later
  3. Turn on HEC (it doesn't auto-enable) from the Global Settings -> Enable -> Save option, next to the New Token option within Settings -> Data -> Data Inputs -> HTTP Event Collector
  4. Create a simple Splunk Dashboard, from Splunk -> Search & Reporting -> Dashboards -> Create New Dashboard
    1. Most of my Panels are either Line Chart or Single Values, here's some of the example Searches used for them, style them how you want:
      • (Uptime Panel, Single Values) Search: index="wipi_monitoring" | timechart max(uptime)
      • (BBC.co.uk Ping RTT Panel, Line Chart) Search: index="wipi_monitoring" | timechart max(ping_bbc_avg)
      • (WLAN MAC Change Count Panel, Line Chart) Search: index="wipi_monitoring" | timechart distinct_count(wlan_ap_mac)
      • This returns a count of unique AP MAC Addresses seen (i.e. if it goes up from one, you've roamed between AP Coverage Areas)

You should now be able to login to your Splunk instance at http://<vps-server.com>:8000.

Make sure you can send HTTP to your Splunk instance on Port 8088 (or whatever <HEC Port> you picked otherwise); a quick way of this is using Telnet to see if it connects at all, i.e. telnet <vps-server.com> 8088.

The script

The following script will performa a series of checks (i.e. Ping a host, or TCP-connect to a host); time how long it took and then consolidate the results into a a JSON payload, similar to the below, and finally push this to Splunk via the HTTP Event Collector as frequently as scheduled with your Cron job. It also scrapes information from the WiPi's WLAN interface, such as RTS/CTS issues; signal quality; current bitrate; AP MACs seen and so on.

You can customise the variables towards the top of the script with your values; don't forget to replace <vps-server.com> with your VPS Server's IP Address/DNS Name; <your_location> with a meaningful Location String to you and <your_splunk_api_token> with your Splunk HEC API Token from earlier on.

JSON Payload example

{
 "dns_ms_onedrive": 101,
 "host": "wipi1",
 "http_ms_teams": 220,
 "http_google": 500,
 "location": "Some House",
 "ping_bbc_avg": 22,
 "ping_bbc_loss": "0%"",
 "ping_bbc_max": 24,
 "ping_default_gateway_avg": 13,
 "ping_default_gateway_loss": "0%",
 "ping_default_gateway_max": 47,
 "uptime": 417.7,
 "wlan_ap_mac": "00:11:22:33:44:55",
 "wlan_bitrate": "72.2 Mb/s",
 "wlan_fragment_threshold": "off"
 "wlan_invalid": 0,
 "wlan_link_quality": "49/70",
 "wlan_missed_beacon": 0,
 "wlan_rts_threshold": "off",
 "wlan_rx_invalid_crypt": 0,
 "wlan_rx_invalid_frag": 0,
 "wlan_rx_invalid_nwid": 0,
 "wlan_signal": -61 dBm,
 "wlan_tx_excessive_retries": 10,
 "wlan_tx_power": 31 dBm,
}

Cron job example

Adding this to /etc/crontab would cause the script to be run every 5 minutes. Change admin to the Admin User of your Pi (which you should change from pi factory username for security reasons):

# WiPi Monitoring report back to Splunk
*/5 *   * * *   admin    python /opt/wipi-monitoring/main.py >> /opt/wipi-monitoring/main.log 2>&1

Python Script

This will require you to pip install netifaces dnspython first.

# Author: notworkd.com
# Date: 19-Jun-2020
# Description: Monitor WiFi data and send back to WiPi Monitoring Dashboard
import netifaces
import os
import subprocess
import requests
import json
import dns.resolver
import time
import datetime

# Define constants
SPLUNK_SERVER = 'https://<vps-server.com>:8088'
SPLUNK_API_KEY = '<your_splunk_api_token>'
WIPI_LOCATION = '<your_location>'

# Functions
# Update Splunk HTTP Event Collector
def updateSplunkHec(**data):
 url = SPLUNK_SERVER + '/services/collector'
 post = {
  "event": data
 }
 r = requests.post(url, json=post, headers={"Authorization":"Splunk "+SPLUNK_API_KEY}, verify=False)
 print r.text
 if r.status_code == 200:
  return True
 else:
  return False

# Get Hostname of this WiPi
def getHostname():
 hostname = os.uname()[1]
 return hostname

# Get Default Gateway for WiFi Adapter
def getDefaultGateway():
 gws = netifaces.gateways()
 return gws['default'][netifaces.AF_INET][0]

# Get ICMP Ping response time
def getIcmpPing(host, count=5):
 cmd = "ping -c {} {}".format(count, host).split(' ')
 try:
  output = subprocess.check_output(cmd).decode().strip()
  lines = output.split("\n")
  total = lines[-2].split(',')[3].split()[1]
  loss = lines[-2].split(',')[2].split()[0]
  timing = lines[-1].split()[3].split('/')
  return {
   'type': 'rtt',
   'min': float(timing[0]),
   'avg': float(timing[1]),
   'max': float(timing[2]),
   'mdev': float(timing[3]),
   'total': str(total),
   'loss': str(loss),
  }
 except Exception as e:
  print(e)
  return None

# Get HTTP Connect response time
def getHttpConnect(url, timeout):
 r = requests.get(url, timeout=timeout)
 return int(1000 * round(r.elapsed.total_seconds(), 2))

# Get DNS response time
def getDnsResolve(fqdn):
 answers = dns.resolver.query(fqdn, 'a')
 return int(1000 * answers.response.time)

# Get WLAN Interface Stats
def getWlanStats(adapter):
 cmd = "iwconfig " + adapter
 try:
  output = subprocess.check_output(cmd, shell=True).decode().strip()
  lines = output.split("\n")
  frequency = lines[-7].split('  ')[6].split(':')[1]
  ap = lines[-7].split('  ')[7].split(': ')[1]
  bitrate = lines[-6].split('  ')[5].split('=')[1]
  txpwr = lines[-6].split('  ')[6].split('=')[1]
  rtsthrsh = lines[-5].split('  ')[6].split(':')[1]
  frgthrsh = lines[-5].split('  ')[7].split(':')[1]
  link = lines[-3].split('  ')[5].split('=')[1]
  snr = lines[-3].split('  ')[6].split('=')[1]
  rxinnw = lines[-2].split('  ')[5].split(':')[1]
  rxincr = lines[-2].split('  ')[6].split(':')[1]
  rxinfr = lines[-2].split('  ')[7].split(':')[1]
  txrtry = lines[-1].split('  ')[5].split(':')[1]
  invalid = lines[-1].split('  ')[6].split(':')[1]
  missedbcn = lines[-1].split('  ')[7].split(':')[1]
  return {
   'frequency': frequency,
   'access_point': ap,
   'bitrate': bitrate,
   'tx_power': txpwr,
   'rts_threshold': rtsthrsh,
   'fragment_threshold': frgthrsh,
   'link_quality': link,
   'signal': snr,
   'rx_invalid_nwid': rxinnw,
   'rx_invalid_crypt': rxincr,
   'rx_invalid_frag': rxinfr,
   'tx_excessive_retries': txrtry,
   'invalid': invalid,
   'missed_beacon': missedbcn
  }
 except Exception as e:
  print(e)
  return None

# Get Device Uptime
def getUptime():
 cmd = "awk '{print $0/60;}' /proc/uptime"
 try:
  output = subprocess.check_output(cmd, shell=True).decode().strip()
  return output
 except Exception as e:
  print(e)
  return None

# Main program
# Initialise variables
output = {'host': getHostname(), 'location': WIPI_LOCATION, 'uptime': getUptime()}

# Ping Default Gateway (max, avg and loss)
ping = getIcmpPing(getDefaultGateway())
output.update({'ping_default_gateway_max': int(ping['max'])})
output.update({'ping_default_gateway_avg': int(ping['avg'])})
output.update({'ping_default_gateway_loss': str(ping['loss'])})

# Ping BBC
ping = getIcmpPing('www.bbc.co.uk')
output.update({'ping_bbc_max': int(ping['max'])})
output.update({'ping_bbc_avg': int(ping['avg'])})
output.update({'ping_bbc_loss': str(ping['loss'])})

# HTTP Connect Google
http = getHttpConnect('https://www.google.co.uk', 30)
output.update({'http_google': int(http)})

# HTTP Connect Microsoft Teams
http = getHttpConnect('https://teams.microsoft.com', 30)
output.update({'http_ms_teams': int(http)})

# DNS Resolve OneDrive
dns = getDnsResolve('sharepoint.com')
output.update({'dns_ms_onedrive': int(dns)})

# Get WiFi Stats
wifi = getWlanStats('wlan0')
output.update({'wlan_ap_mac': str(wifi['access_point'])})
output.update({'wlan_bitrate': str(wifi['bitrate'])})
output.update({'wlan_tx_power': str(wifi['tx_power'])})
output.update({'wlan_rts_threshold': str(wifi['rts_threshold'])})
output.update({'wlan_fragment_threshold': str(wifi['fragment_threshold'])})
output.update({'wlan_link_quality': str(wifi['link_quality'])})
output.update({'wlan_signal': str(wifi['signal'])})
output.update({'wlan_rx_invalid_nwid': int(wifi['rx_invalid_nwid'])})
output.update({'wlan_rx_invalid_crypt': int(wifi['rx_invalid_crypt'])})
output.update({'wlan_rx_invalid_frag': int(wifi['rx_invalid_frag'])})
output.update({'wlan_tx_excessive_retries': int(wifi['tx_excessive_retries'])})
output.update({'wlan_invalid': int(wifi['invalid'])})
output.update({'wlan_missed_beacon': int(wifi['missed_beacon'])})

# Update Splunk via REST API
print(datetime.datetime.now().replace(microsecond=0).isoformat())
updateSplunkHec(**output)
print("----")

The outcome

Once done, you'll get JSON payloads being sent into Splunk via the HEC every 5 minutes, and then go start to have lovely visual dashboards like this, showing you the real-world HTTP, DNS, Ping and WLAN stats a real-world User in that Office Location might actually be seeing.

Splunk Dashboard showing overview of WiPi HTTP and DNS Probes Splunk Dashboard showing overview of WiPi WLAN Stats Splunk Dashboard showing detail of WiPi WLAN Stats

Enjoy finally having a use for that Pi other than taking up space in your drawer :).