Disappearing DNS entries when your CNAME TTL differs from your PaaS Provider's
The dreaded "Page can't be displayed" error
Most people in the field of IT or Networking will have seen this lovely Internet Explorer error, and immediately recognised their day was about to change course away from the schedule:
The why can vary massively; for this blog post, we'll look at one case in point - what happens when your DNS Time to Live (TTL) record, on your CNAME, doesn't match-up with your Platform as a Service (PaaS) provider's A Name. But first, a bit of background here - names changed to protect the innocent.
The Scenario with the PaaS Provider
We've got a Web Application that we've decided to farm-out to a PaaS Provider, which used to be on-premises (or "on-prem" for you cool Cloud Kids). It's very important to the Business, but for the purpose of technology employed it's nothing special - think a HTTPS Website, where the PaaS provider does DNS-based "Elastic" (boing!) Load Balancing - also known as GSLB, but the new Cloudy World has to re-invent the terms we're already used to... grumble grumble
Let's throw in some made-up pseudonyms to anonymise this a bit, and add some context:
- My Employer (Enterprise Business, or "the Business")
- Name - MyCompany Ltd
- Main External URL/Domain -
mycompany.com
- Main Internal URL/Domain -
prod.mycompany.uk
- PaaS Provider
- Name - PaaS Co. Ltd
- Main PaaS URL/Domain -
paasco.com
- Cloud Environment Name - PaaSCloud
- Use Load Balancers from - BigAssLoadBalancers (Vendor)
Because the Business (rightly) thinks that a new PaaS URL of https://bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com might not be as easy-to-remember as the old on-prem (yes, I'm trying to bait you with that phrase) one of https://appname.prod.mycompany.uk; and because we've got no choice about the PaaS URL, we've taken the decision to make a new sub-domain of *.paascloud.mycompany.com
. While we're there, we think we'll sort out the outmoded concept of Internal (prod.mycompany.uk
) vs External (mycompany.com
) URLs, because this is all hosted off-prem anyway; so it's technically no longer part of our "internal" Domain.
Regardless of PasS Co, MyCompany uses Internal DNS that sits on Active Directory Domain Controllers; for the sake of ease, I'll call this "Internal DNS". MyCompany outsources it's Internet DMZ Data Centres to another MSP; we'll call them MSPCo. MSPCo's only relevance here is that they run our External DNS/Domain (from Internet-facing ns1.mspco.com
DNS Servers), whereas we run our Internal DNS/Domain AD-DC DNS Servers. Or, in short:
- MyCompany
- Run Internal DNS Servers (i.e.
pdc1.mycompany.uk
) that are authoritative (but not advertised to Internet) for*.mycompany.uk
- Run Internal DNS Servers (i.e.
- MSPCo
- Run External DNS Servers (i.e.
ns1.mspco.com
) that are authoritative for*.mycompany.com
- Run External DNS Servers (i.e.
To give us an easy-to-remember FQDN for the AppName Web Application, we've setup the following which means it will be https://appname.paascloud.mycompany.com:
- Sub-Domain Space (for all Apps on PaaS Co)
*.paascloud.my.company.com
- Current PaaS Web App (one of the Apps on PaaS Co)
appname.paascloud.mycompany.com
- Internal DNS (MyCompany, i.e.
pdc1.mycompany.uk
)- Authoritatively Resolve requests for
*.prod.mycompany.uk
- Conditional Forward requests for
*.paascloud.mycompany.com
tons1.mspco.com
- Authoritatively Resolve requests for
- External DNS (MSPCo, i.e.
ns1.mspco.com
)- Authoritatively Resolve requests for
*.paascloud.mycompany.com
- Authoritatively Resolve requests for
The Problem with DNS Recursion
All that we've achieved above is a series of "forwarders", such that, for the worst case (Internal Client), they'll do this:
- Lookup
appname.paascloud.mycompany.com
against Internal AD-DC DNS (i.e.pdc1.mycompany.uk
) - Internal AD-DC DNS Conditional Forwards this to MSPCo External DNS (i.e.
ns1.mspco.com
) - MSPCo External DNS (i.e.
ns1.mspco.com
) resolves this to a CNAME ofbigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
- MSPCo External DNS (i.e.
ns1.mspco.com
) then Recursively Resolves this against it's upstream DNS Provider (let's saydns1.bigisp.com
)... - ...Which queries the Root DNS Servers (i.e.
a.root-servers.net
), which tell it to ask the PaaS Co Authoritative DNS Servers (i.e.ns1.paasco.com
) for the A Name associated withbigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
... - ...Which comes back from PaaS Co DNS Servers (i.e.
ns1.paasco.com
) as Public IP Address203.0.113.234
(not real, check out RFC 5737 - IPv4 Address Blocks Reserved for Documentation)
- MSPCo External DNS (i.e.
- Internal AD-DC DNS replies back to the Internal Client, for a request of
appname.paascloud.mycompany.com
, with:- (The CNAME)
bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
- (The A Name)
203.0.113.234
- (The CNAME)
Phew, there's a lot of steps eh? But at least we're out of the woods now, the client has the IPv4 Address it needs, so what's the "Page not Displayed" thing all about?
Pesky DNS TTLs
Here's the bit where the hierarchy of recursion in DNS starts to 1-up you, and the bad day kicks in - perhaps as known all-too-well by these graffiti artists:
Firstly, a caveat - all of the below may be different for your scenario, depending on how MSPCo DNS Recursion is/isn't setup.
If we make use of the lovely nslookup
tool on Windows, here's what we can deduce for our good response (i.e. when the page actually displays, rather than the dreaded IE "Page not Displayed" error). Remember that pdc1.mycompany.uk
is my Internal DNS Server (for this example anyway, in reality AD has a Parent/Child Regional Domain Controller hierarchy, so each Client uses a different AD-DC):
C:\Users\NervousAdmin>nslookup
> set debug
> server pdc1.mycompany.uk
<snip - goes off and resolves pdc1.mycompany.uk to IP 10.0.1.99>
> appname.paascloud.mycompany.com.
Server: pdc1.mycompany.uk
Address: 10.0.1.99
------------
Got answer:
HEADER:
opcode = QUERY, id = 24, rcode = NOERROR
header flags: response, want recursion, recursion avail.
questions = 1, answers = 2, authority records = 0, additional = 0
QUESTIONS:
appname.paascloud.mycompany.com, type = A, class = IN
ANSWERS:
-> appname.paascloud.mycompany.com
canonical name = bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
ttl = 7200 (2 hours)
-> bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
internet address = 203.0.113.234
ttl = 60 (1 min)
<snip>
------------
Name: bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
Address: 203.0.113.234
Aliases: appname.paascloud.mycompany.com
Given the response above is good (when everything is working), what does the above tell you? If we focus on the TTL sections, you'll see Windows has cached two responses here:
appname.paascloud.mycompany.com
-[CNAME]->bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
, cached for7200
seconds (or 2 hours)bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
-[A Name]->203.0.113.234
, cached for60
seconds (1 min)
So what happens in 60
seconds, when that A Name expires then? Let's find out - the ">" shows you are within nslookup
, so just hit the Up key, and Enter to re-lookup appname.paascloud.mycompany.com.
(as per prior posts, the appended dot means "just this exact FQDN, and no additional DNS Suffixes"), eventually you'll notice the bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
section goes to a TTL of 0:
So, if you just did
nslookup appname.paascloud.mycompany.com
, without the suffixed dot, it would try (amongst others) to lookupappname.paascloud.mycompany.com.prod.mycompany.uk
and fail miserably
> appname.paascloud.mycompany.com.
Server: pdc1.mycompany.uk
Address: 10.0.1.99
<snip - only interested in the CNAME ttl section>
ANSWERS:
<snip>
-> bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
internet address = 203.0.113.234
ttl = 0
But you'll notice your browser access to https://appname.paascloud.mycompany.com works fine during these tests; until you do the nslookup
again, after the ttl = 0
response. Now, there be dragons.
Uh-oh, where's my response gone?
When you refresh again, your heart will drop, your bum will tighten, your browser access to https://appname.paascloud.mycompany.com will stop working, and you'll see this:
C:\Users\NervousAdmin>nslookup
> set debug
> server pdc1.mycompany.uk
<snip - goes off and resolves pdc1.mycompany.uk to IP 10.0.1.99>
> appname.paascloud.mycompany.com.
Server: pdc1.mycompany.uk
Address: 10.0.1.99
------------
Got answer:
HEADER:
opcode = QUERY, id = 28, rcode = NOERROR
header flags: response, want recursion, recursion avail.
questions = 1, answers = 1, authority records = 0, additional = 0
QUESTIONS:
appname.paascloud.mycompany.com, type = A, class = IN
ANSWERS:
-> appname.paascloud.mycompany.com
canonical name = bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
ttl = 6926 (1 hour 55 mins 26 secs)
<snip>
------------
Name: appname.paascloud.mycompany.com
Which will give you your dreaded "Page not Displayed friend", for exactly another 1 hour, 55 minutes and 26 seconds
.
And how do I know that? Because that's what the TTL says that CNAME entry will stay in your cache for - regardless of the fact your Windows Client hasn't had a recursive response of the actual IP Address that it ultimately resolves to (203.0.113.234
).
So what's the fix? Firstly, lets touch on DNS TTL. This isn't much different to IPv4 TTL; it just means that, once the TTL hits 0, the entry will be purged from your local DNS Cache. What happens next is the crucial part, dictated by the "DNS Response Hierarchy" your response had; if it's just a straight single-level hierarchy (i.e. domain.com
-> 203.0.113.1
), then your Client will go off and re-request the DNS Request to lookup domain.com
to an IP Address.
But our case is different, and not in a good way - our "DNS Response Hierarchy" looks like this:
- (Parent) Fetch
appname.paascloud.mycompany.com
- (Child) If you got here, now fetch
bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
- (Child) If you got here, now fetch
But our TTL's look like this:
- (Parent)
appname.paascloud.mycompany.com
= TTL <bigger than "Child">- (Child)
bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
= TTL <smaller than "Parent">
- (Child)
That's not what we want at all; given these are two differing DNS Administrative Domains (owned and operated by two differing Companies - MSPCo for appname.paascloud.mycompany.com
and PaaS Co for bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
), we (MyCompany) don't have any direct control over these. Regardless though, we need them to flip-it-around so that this happens:
- (Parent)
appname.paascloud.mycompany.com
= TTL <smaller (or same) than "Child">- (Child)
bigassloadbalancers-appnameprodmycompany-paascloud.paasco.com
= TTL <bigger than "Parent">
- (Child)
This way, when the "Parent" (initial, or root, or "actual FQDN I wanted the IP for") TTL expires, it will remove the "Child" (CNAME) entry with it; which means the DNS Lookup process will re-occur, and we'll happily get an IPv4 Address back. Technically simple, but you try and explain that to MSPCo and PaaS Co, and you'll find your "shouty voice TTL" quickly gets towards that precious 0...