Disappearing DNS entries when your CNAME TTL differs from your PaaS Provider's

Saturday, 26 Jan 2019

The dreaded "Page can't be displayed" error

Most people in the field of IT or Networking will have seen this lovely Internet Explorer error, and immediately recognised their day was about to change course away from the schedule:

Internet Explorer Page can not be displayed error

The why can vary massively; for this blog post, we'll look at one case in point - what happens when your DNS Time to Live (TTL) record, on your CNAME, doesn't match-up with your Platform as a Service (PaaS) provider's A Name. But first, a bit of background here - names changed to protect the innocent.

The Scenario with the PaaS Provider

We've got a Web Application that we've decided to farm-out to a PaaS Provider, which used to be on-premises (or "on-prem" for you cool Cloud Kids). It's very important to the Business, but for the purpose of technology employed it's nothing special - think a HTTPS Website, where the PaaS provider does DNS-based "Elastic" (boing!) Load Balancing - also known as GSLB, but the new Cloudy World has to re-invent the terms we're already used to... grumble grumble

Let's throw in some made-up pseudonyms to anonymise this a bit, and add some context:

  • My Employer (Enterprise Business, or "the Business")
    • Name - MyCompany Ltd
    • Main External URL/Domain - mycompany.com
    • Main Internal URL/Domain - prod.mycompany.uk
  • PaaS Provider
    • Name - PaaS Co. Ltd
    • Main PaaS URL/Domain - paasco.com
    • Cloud Environment Name - PaaSCloud
    • Use Load Balancers from - BigAssLoadBalancers (Vendor)

Because the Business (rightly) thinks that a new PaaS URL of https://bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com might not be as easy-to-remember as the old on-prem (yes, I'm trying to bait you with that phrase) one of https://appname.prod.mycompany.uk; and because we've got no choice about the PaaS URL, we've taken the decision to make a new sub-domain of *.paascloud.mycompany.com. While we're there, we think we'll sort out the outmoded concept of Internal (prod.mycompany.uk) vs External (mycompany.com) URLs, because this is all hosted off-prem anyway; so it's technically no longer part of our "internal" Domain.

Regardless of PasS Co, MyCompany uses Internal DNS that sits on Active Directory Domain Controllers; for the sake of ease, I'll call this "Internal DNS". MyCompany outsources it's Internet DMZ Data Centres to another MSP; we'll call them MSPCo. MSPCo's only relevance here is that they run our External DNS/Domain (from Internet-facing ns1.mspco.com DNS Servers), whereas we run our Internal DNS/Domain AD-DC DNS Servers. Or, in short:

  • MyCompany
    • Run Internal DNS Servers (i.e. pdc1.mycompany.uk) that are authoritative (but not advertised to Internet) for *.mycompany.uk
  • MSPCo
    • Run External DNS Servers (i.e. ns1.mspco.com) that are authoritative for *.mycompany.com

To give us an easy-to-remember FQDN for the AppName Web Application, we've setup the following which means it will be https://appname.paascloud.mycompany.com:

  • Sub-Domain Space (for all Apps on PaaS Co)
    • *.paascloud.my.company.com
  • Current PaaS Web App (one of the Apps on PaaS Co)
    • appname.paascloud.mycompany.com
  • Internal DNS (MyCompany, i.e. pdc1.mycompany.uk)
    • Authoritatively Resolve requests for *.prod.mycompany.uk
    • Conditional Forward requests for *.paascloud.mycompany.com to ns1.mspco.com
  • External DNS (MSPCo, i.e. ns1.mspco.com)
    • Authoritatively Resolve requests for *.paascloud.mycompany.com

The Problem with DNS Recursion

All that we've achieved above is a series of "forwarders", such that, for the worst case (Internal Client), they'll do this:

  1. Lookup appname.paascloud.mycompany.com against Internal AD-DC DNS (i.e. pdc1.mycompany.uk)
  2. Internal AD-DC DNS Conditional Forwards this to MSPCo External DNS (i.e. ns1.mspco.com)
  3. MSPCo External DNS (i.e. ns1.mspco.com) resolves this to a CNAME of bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
    1. MSPCo External DNS (i.e. ns1.mspco.com) then Recursively Resolves this against it's upstream DNS Provider (let's say dns1.bigisp.com)...
    2. ...Which queries the Root DNS Servers (i.e. a.root-servers.net), which tell it to ask the PaaS Co Authoritative DNS Servers (i.e. ns1.paasco.com) for the A Name associated with bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com...
    3. ...Which comes back from PaaS Co DNS Servers (i.e. ns1.paasco.com) as Public IP Address 203.0.113.234 (not real, check out RFC 5737 - IPv4 Address Blocks Reserved for Documentation)
  4. Internal AD-DC DNS replies back to the Internal Client, for a request of appname.paascloud.mycompany.com, with:
    1. (The CNAME) bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
    2. (The A Name) 203.0.113.234

Phew, there's a lot of steps eh? But at least we're out of the woods now, the client has the IPv4 Address it needs, so what's the "Page not Displayed" thing all about?

Pesky DNS TTLs

Here's the bit where the hierarchy of recursion in DNS starts to 1-up you, and the bad day kicks in - perhaps as known all-too-well by these graffiti artists:

DNS Graffiti Artists make Mario 1UP his DNS

Firstly, a caveat - all of the below may be different for your scenario, depending on how MSPCo DNS Recursion is/isn't setup.

If we make use of the lovely nslookup tool on Windows, here's what we can deduce for our good response (i.e. when the page actually displays, rather than the dreaded IE "Page not Displayed" error). Remember that pdc1.mycompany.uk is my Internal DNS Server (for this example anyway, in reality AD has a Parent/Child Regional Domain Controller hierarchy, so each Client uses a different AD-DC):

C:\Users\NervousAdmin>nslookup
> set debug
> server pdc1.mycompany.uk
<snip - goes off and resolves pdc1.mycompany.uk to IP 10.0.1.99>
> appname.paascloud.mycompany.com.
Server: pdc1.mycompany.uk
Address: 10.0.1.99

------------
Got answer:
 HEADER:
 opcode = QUERY, id = 24, rcode = NOERROR
 header flags: response, want recursion, recursion avail.
 questions = 1, answers = 2, authority records = 0, additional = 0

 QUESTIONS:
 appname.paascloud.mycompany.com, type = A, class = IN
 ANSWERS:
 -> appname.paascloud.mycompany.com
 canonical name = bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
 ttl = 7200 (2 hours)
 -> bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
 internet address = 203.0.113.234
 ttl = 60 (1 min)
<snip>
------------
Name: bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
Address: 203.0.113.234
Aliases: appname.paascloud.mycompany.com

Given the response above is good (when everything is working), what does the above tell you? If we focus on the TTL sections, you'll see Windows has cached two responses here:

  1. appname.paascloud.mycompany.com -[CNAME]-> bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com, cached for 7200 seconds (or 2 hours)
  2. bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com -[A Name]-> 203.0.113.234, cached for 60 seconds (1 min)

So what happens in 60 seconds, when that A Name expires then? Let's find out - the ">" shows you are within nslookup, so just hit the Up key, and Enter to re-lookup appname.paascloud.mycompany.com. (as per prior posts, the appended dot means "just this exact FQDN, and no additional DNS Suffixes"), eventually you'll notice the bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com section goes to a TTL of 0:

So, if you just did nslookup appname.paascloud.mycompany.com, without the suffixed dot, it would try (amongst others) to lookup appname.paascloud.mycompany.com.prod.mycompany.uk and fail miserably

> appname.paascloud.mycompany.com.
Server:  pdc1.mycompany.uk
Address:  10.0.1.99
<snip - only interested in the CNAME ttl section>
ANSWERS:
<snip>
-> bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
 internet address = 203.0.113.234
 ttl = 0

But you'll notice your browser access to https://appname.paascloud.mycompany.com works fine during these tests; until you do the nslookup again, after the ttl = 0 response. Now, there be dragons.

Uh-oh, where's my response gone?

When you refresh again, your heart will drop, your bum will tighten, your browser access to https://appname.paascloud.mycompany.com will stop working, and you'll see this:

C:\Users\NervousAdmin>nslookup
> set debug
> server pdc1.mycompany.uk
<snip - goes off and resolves pdc1.mycompany.uk to IP 10.0.1.99>
> appname.paascloud.mycompany.com.
Server:  pdc1.mycompany.uk
Address:  10.0.1.99

------------
Got answer:
    HEADER:
        opcode = QUERY, id = 28, rcode = NOERROR
        header flags:  response, want recursion, recursion avail.
        questions = 1,  answers = 1,  authority records = 0,  additional = 0

    QUESTIONS:
        appname.paascloud.mycompany.com, type = A, class = IN
    ANSWERS:
    ->  appname.paascloud.mycompany.com
        canonical name = bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
        ttl = 6926 (1 hour 55 mins 26 secs)

<snip>
------------
Name:    appname.paascloud.mycompany.com

Which will give you your dreaded "Page not Displayed friend", for exactly another 1 hour, 55 minutes and 26 seconds.

And how do I know that? Because that's what the TTL says that CNAME entry will stay in your cache for - regardless of the fact your Windows Client hasn't had a recursive response of the actual IP Address that it ultimately resolves to (203.0.113.234).

So what's the fix? Firstly, lets touch on DNS TTL. This isn't much different to IPv4 TTL; it just means that, once the TTL hits 0, the entry will be purged from your local DNS Cache. What happens next is the crucial part, dictated by the "DNS Response Hierarchy" your response had; if it's just a straight single-level hierarchy (i.e. domain.com -> 203.0.113.1), then your Client will go off and re-request the DNS Request to lookup domain.com to an IP Address.

But our case is different, and not in a good way - our "DNS Response Hierarchy" looks like this:

  1. (Parent) Fetch appname.paascloud.mycompany.com
    1. (Child) If you got here, now fetch bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com

But our TTL's look like this:

  1. (Parent) appname.paascloud.mycompany.com = TTL <bigger than "Child">
    1. (Child) bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com = TTL <smaller than "Parent">

That's not what we want at all; given these are two differing DNS Administrative Domains (owned and operated by two differing Companies - MSPCo for appname.paascloud.mycompany.com and PaaS Co for bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com), we (MyCompany) don't have any direct control over these. Regardless though, we need them to flip-it-around so that this happens:

  1. (Parent) appname.paascloud.mycompany.com = TTL <smaller (or same) than "Child">
    1. (Child) bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com = TTL <bigger than "Parent">

This way, when the "Parent" (initial, or root, or "actual FQDN I wanted the IP for") TTL expires, it will remove the "Child" (CNAME) entry with it; which means the DNS Lookup process will re-occur, and we'll happily get an IPv4 Address back. Technically simple, but you try and explain that to MSPCo and PaaS Co, and you'll find your "shouty voice TTL" quickly gets towards that precious 0...