Last updated at Tue, 05 Dec 2017 19:48:04 GMT

Synopsis:

Host operating system resolver libraries are not very good at dealing with an unreachable nameserver. Even if you specify multiple nameservers in resolv.conf and one of them goes down you will experience a period where connections will not be made because resolution is not known. There are a number of resolver tuning options but even reducing the timeout to 1 second there will result in a delay. This affects nearly all unix-like operating systems including GNU/Linux.

In this article we are concerned with an unresponsive nameserver i.e. from which no related packets are sent, in contrast to a nameserver where the service failed but can still send ICMP Port Unreachable messages as responses to requests. The latter situation will cause the failover to happen more quickly because the system doesn’t need to wait until the timeout value. The former situation often causes failing applications. This is especially important in the context of a monitoring server such as Nagios which can generate a large number of notifications if it cannot resolve names nor connect to a host within a defined threshold or timeout value.

Case Study:

In the following examples, I will simulate an unresponsive nameserver by using iptables to prevent communication from a host to its primary name server.

Working Primary Nameserver

A normal looking resolv.conf:

$ cat /etc/resolv.conf 
search internal.company.com company.com 

nameserver 10.1.2.2 
nameserver 10.1.230.144

The following tests below will indicate that networking and name resolution are working.

$ ping -c 1 10.1.2.2 
PING 10.1.2.2 (10.1.2.2) 56(84) bytes of data.
64 bytes from 10.1.2.2: icmp_seq=1 ttl=61 time=0.243 ms
$ dig nagios.internal.company.com
... 
;; ANSWER SECTION: 
nagios.internal.company.com. 86400 IN A 10.1.1.9 
... 
;; Query time: 0 msec 
;; SERVER: 10.1.2.2#53(10.1.2.2) 
;; WHEN: Thu Sep 11 11:55:39 2014 
;; MSG SIZE rcvd: 146

Failing Primary Nameserver:

Let’s fake a DNS outage by preventing our hosts from contacting the primary name server.

$ iptables -A OUTPUT -d 10.1.2.2-j DROP

Verify that the host is unreachable

$ ping -c 1 10.1.2.2
PING 10.1.2.2 (10.1.2.2) 56(84) bytes of data.
ping: sendmsg: Operation not permitted 
^C

Nagios is returning critical status’s on its DNS checks because they’re taking longer than 1 second to fulfill the request. The 1 second timeout is defined in the Nagios configuration.

$ tail -f /usr/local/nagios/var/nagios.log
[1410454706] SERVICE ALERT: nagios-console;Check DNS;CRITICAL;SOFT;2;DNS CRITICAL: 1.009 second response time. 
[1410454791] SERVICE ALERT: open-nsm;Check DNS;CRITICAL;HARD;3;DNS CRITICAL: 1.008 second response time. 
[1410454826] SERVICE ALERT: nagios-console;Check DNS;CRITICAL;HARD;3;DNS CRITICAL: 1.009 second response time. 
[1410454826] SERVICE NOTIFICATION: companybot;nagios-console;Check DNS;CRITICAL;notify-service-by-irc;DNS CRITICAL: 1.009 second response time.

Tuning the Resolver: An Incomplete Solution

By default, on Linux, the resolver’s timeout value is set to 5 seconds and the attempt value (retries) is 2. This means we have to wait 10 seconds for a response before the resolver will use another nameserver listed in resolv.conf. Also, note that this 10 second waiting period will happen for each process that attempts resolution. Many processes will simply give up before the secondary nameserver is used.

In this example I set timeout and attempts in resolv.conf to the lowest possible value so that it will switch name servers after 1 second.

$ cat /etc/resolv.conf 
options timeout:1 attempts:1 
search internal.company.com company.com 

nameserver 10.1.2.2 
nameserver 10.1.230.144

Create the test case by making the DNS server unreachable again

$ iptables -A OUTPUT -d 10.1.2.2 -j DROP

Verify that the host is unreachable

$ ping -c 1 10.1.2.2 
PING 10.1.2.2 (10.1.2.2) 56(84) bytes of data.
ping: sendmsg: Operation not permitted 
^C

We still have failures due to the fact that the delay in switching still exceeds the plugin’s critical timeout value of 1 second but by 1000ths (.008 to be precise) of a second.

$ tail -f /usr/local/nagios/var/nagios.log
[1410455356] SERVICE ALERT: nagios-console;Check DNS;CRITICAL;SOFT;1;DNS CRITICAL: 1.008 second response time. 
[1410455441] SERVICE ALERT: open-nsm;Check DNS;CRITICAL;SOFT;1;DNS CRITICAL: 1.008 second response time. 
[1410455476] SERVICE ALERT: nagios-console;Check DNS;CRITICAL;SOFT;2;DNS CRITICAL: 1.008 second response time.

Complete Solution:

A solution is to run a more sophisticated DNS resolver such as Unbound on your Nagios systems. Unbound is a lightweight caching resolver that will continuously poll your name servers and forward requests to one that’s working. In the event that one of them becomes unresponsive you will not experience a noticeable delay and your applications will continue to run as if nothing happened because unbound knew which nameservers were operational. In addition, improved performance is achieved by caching resolutions.

Unbound DNS Resolver:

Start the unbound service

$ service unbound start 
Starting unbound:

Create the test case by making the DNS server unreachable again

$ iptables -A OUTPUT -d 10.1.2.2 -j DROP

Verify that the host is unreachable

]$ ping -c 1 10.1.2.2 
PING 10.1.2.2 (10.1.2.2) 56(84) bytes of data.
ping: sendmsg: Operation not permitted 
^C

Nagios doesn’t see any problems, failover combined with resolution is less than 1 second.

$ tail -f /usr/local/nagios/var/nagios.log ...waiting... <img draggable="false" class="emoji" alt="🙂" src="https://s.w.org/images/core/emoji/2.3/svg/1f642.svg">

Configuration:

Configuring unbound is very easy. In our case we want our local resolver to send its request to localhost from which unbound will be listening. Unbound will than handle forwarding our requests to the appropriate name servers.

$ cat /etc/resolv.conf 
nameserver 127.0.0.1
$ cat /etc/unbound/unbound.conf 
server: 
    interface: 127.0.0.1 
    port: 53 
    do-ip6: no 

forward-zone: 
    name: "." # Forward all 
    forward-addr: 10.1.2.2 # Company Primary 
    forward-addr: 10.1.230.144 # Company Secondary 
    forward-addr: 8.8.8.8 # Google Public DNS
Activate the new configuration

$ service unbound restart

Resolution will fail if the unbound service stops, test your Unbound configuration before deployment and monitor the Unbound service for failures with a tool like Nagios by it’s IP address.

Complementary Tools

More Reading & Other Resources