DevOps Troubleshooting: Stop Blaming DNS!
Kyle Rankin
Sr. Systems Administrator, DevOps Engineer
Agenda
- Introduction
- How DNS Works
- Don't Blame DNS
- Local Network Issues
- Routing Issues
- Host is Down
- Blame DNS
- No Servers Could Be Reached
- Server can't find host: NXDOMAIN
- DNS Recursion Problems
- Updates don't Take
- Questions?
Introduction
- What is DevOps?
- Cooperation between Ops, Devs, and QA
- Break down traditional walls, blame
- What is DevOps Troubleshooting?
- All members of DevOps Team troubleshoot together
- Similar set of basic troubleshooting skills
- Same benefits as DevOps development, applied to troubleshooting
- Why People Blame DNS
- "DNS is a black box"
- People blame the technology they understand the least
- Some unsure how to troubleshoot DNS.
How DNS Works
- Primary job: converting hostnames to IPs
- Client sends request to local DNS server
- "What is the IP for www.google.com?"
- DNS server starts recursive query
- Tracing a recursive query
- To play at home: dig www.google.com +trace
- Recommend tracing domains you own.
Local Network Issues
Quick Sanity Check
- Ping is not a DNS troubleshooting tool
- Can perform a quick sanity check though:
$ ping web1
PING web1.example.net (10.1.2.5) 56(84) bytes of data.
DNS WORKS!
Surprising how often this is overlooked
nslookup or dig would be better:
$ nslookup web1
Server: 10.1.1.3
Address: 10.1.1.3#53
Name: web1.example.net
Address: 10.1.2.5
Local Network Issues
- Can you ping hosts on the local subnet?
- When in doubt, try your gateway:
$ sudo route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
10.1.1.0 * 255.255.255.0 U 0 0 0 eth0
default 10.1.1.1 0.0.0.0 UG 100 0 0 eth0
Use ping to test the gateway:
$ ping -c 5 10.1.1.1
PING 10.1.1.1 (10.1.1.1) 56(84) bytes of data.
64 bytes from 10.1.1.1: icmp_seq=1 ttl=64 time=3.13 ms
64 bytes from 10.1.1.1: icmp_seq=2 ttl=64 time=1.43 ms
64 bytes from 10.1.1.1: icmp_seq=3 ttl=64 time=1.79 ms
64 bytes from 10.1.1.1: icmp_seq=5 ttl=64 time=1.50 ms
--- 10.1.1.1 ping statistics ---
5 packets transmitted, 4 received, 20% packet loss, time 4020ms
rtt min/avg/max/mdev = 1.436/1.966/3.132/0.686 ms
Routing Issues
- Can packets from your host reach a remote host?
- Ping remote host:
- Ping works, move to next step
- Ping doesn't work, ping another host on same network
- Ping still doesn't work, use traceroute.
- Successful output:
$ traceroute 10.1.2.5
traceroute to 10.1.2.5 (10.1.2.5), 30 hops max, 40 byte packets
1 10.1.1.1 (10.1.1.1) 5.432 ms 5.206 ms 5.472 ms
2 web1 (10.1.2.5) 8.039 ms 8.348 ms 8.643 ms
Routing Issues
Traceroute with asterisks
$ traceroute 10.1.2.5
traceroute to 10.1.2.5 (10.1.2.5), 30 hops max, 40 byte packets
1 10.1.1.1 (10.1.1.1) 5.432 ms 5.206 ms 5.472 ms
2 * * *
3 * * *
Last IP to respond, first IP to check
In this case, check 10.1.1.1.
Routing Issues
Traceroute timeouts
$ traceroute 10.1.2.5
traceroute to 10.1.2.5 (10.1.2.5), 30 hops max, 40 byte packets
1 10.1.1.1 (10.1.1.1) 5.432 ms 5.206 ms 5.472 ms
1 10.1.1.1 (10.1.1.1) 3006.477 ms !H 3006.779 ms !H 3007.072 ms
Ping timed out at the gateway (10.1.1.1)
Host likely down
Possibly even from own subnet
Or, Network admin blocked ICMP (grrr!)
If so, use tcptraceroute instead (yay).
Host is Down
- Test whether remote port is open
- For DNS, port 53, for web, port 80:
$ nmap -p 80 10.1.2.5
Starting Nmap 4.62 ( http://nmap.org ) at 2009-02-05 18:49 PST
Interesting ports on web1 (10.1.2.5):
PORT STATE SERVICE
80/tcp filtered http
If open, probably not a networking problem
If closed, either service down (fix the server) or firewalled off
If filtered, firewall or router dropping packets.
No servers could be reached
$ nslookup web1
;; connection timed out; no servers could be reached
Possible causes:
- No name servers configured for your host
- Name servers configured, but inaccessible
Name servers defined in resolv.conf:
search example.net
nameserver 10.1.1.3
If no name servers configured, add one!
If name server listed by its hostname, not IP... think about that for a second.
No servers could be reached
Name servers configured, but inaccessible
- If name servers configured, but can't be reached, test connection to them
- Test all configured DNS servers
- Start with ping
- No ping, same subnet: DNS server could be down
- No ping, different subnet: test route to DNS server
- Ping works, DNS server not responding, test remote DNS port:
$ nmap -p 53 10.1.1.3
Starting Nmap 4.62 ( http://nmap.org ) at 2009-02-05 18:49 PST
Interesting ports on ns1 (10.1.1.3):
PORT STATE SERVICE
53/tcp open domain
server can't find host: NXDOMAIN
$ nslookup web1
Server: 10.1.1.3
Address: 10.1.1.3#53
** server can't find web1: NXDOMAIN
DNS server works, but can't find web1
Could be a DNS search path issue. Check /etc/resolv.conf:
search example.net
nameserver 10.1.1.3
- web1.example.net works, web1.dev.example.net doesn't
- Solution: use FQDN or add search path
If FQDN doesn't resolve, likely DNS server config problem
- If authoritative for domain, check zone config
- If recursive DNS server, confirm recursion enabled
- Test other domains.
DNS Recursion Problems
- Our DNS server performs queries on our behalf
- Known as recursive queries
- Many DNS servers restrict who can perform recursive queries
- If recursion is disabled:
$ nslookup www.example.net 10.1.1.4
Server: 10.1.1.4
Address: 10.1.1.4#53
** server can't find www.example.net: REFUSED
Check DNS server settings for recursion or allow-recursion option. In BIND:
options {
allow-recursion { 10.1.1/24; };
...
};
DNS Recursion Problems Continued
acl "internal" { 127.0.0.1; 192.168.0.0/24; 10.1.0.0/16; };
options {
allow-recursion { "internal"; };
...
};
Or it might be disabled:
options {
recursion no;
...
};
If disabled, will only resolve zones it is authoritative for.
Updates don't Take
- Symptoms:
- DNS works, but reports bad (or old) record
- You changed a record, but it doesn't seem to take
- DNS is right part of the time
- Some people see the update, others don't
- Three main causes:
- DNS caching issues
- Zone syntax issues
- Zone transfer issues.
DNS Caching and TTL
- TTL = Time To Live, how long to cache
- TTL = 1 day, may take 1 day for changes to propagate
- Set high TTLs low a few days before important changes
- Even then, some ISPs don't honor low TTLs
- Check TTL with dig:
$ dig web1.example.net
. . .
;; QUESTION SECTION:
;web1.example.net. IN A
;; ANSWER SECTION:
web1.example.net. 300 IN A 10.1.2.5
TTL = 300 seconds.
DNS Caching and TTL
Identify Caching Issues
- First search for official nameservers:
$ dig example.net NS
. . .
;; ANSWER SECTION:
example.net. 300 IN NS ns1.example.net.
example.net. 300 IN NS ns2.example.net.
;; ADDITIONAL SECTION:
ns1.example.net. 300 IN A 10.1.1.3
ns2.example.net. 300 IN A 10.1.1.4
Then query the nameservers directly:
$ dig web1.example.net @10.1.1.4
All have new IP? Caching issue. Flush local caches
Some NS with old IP? Possible zone transfer issue
No NS with new IP? Possible zone syntax error.
Zone Syntax Errors
- Made a change to a zone, no NS show the change
- When zone has syntax error, BIND disregards and uses old zone
- Only hint at failure in system logs:
Mar 27 21:07:26 ns1 named[25967]: /etc/bind/db.example.net:20:
#ns2.example.net: bad owner name (check-names)
Mar 27 21:07:26 ns1 named[25967]: zone example.net/IN: loading
from master file /etc/bind/db.example.net failed: bad owner name
(check-names)
Mar 27 21:07:26 snowball named[25967]: zone example.net/IN:
not loaded due to errors.
In this cased, used # for comments instead of ;
Fix the syntax error, reload BIND, check log for errors.
Zone Transfer Issues
- For ease of updates, generally one NS acts as master, rest slaves
- Changes performed on master, slaves notified of change
- If serial number for zone on master is larger, slaves pull update
- To test, perform direct query against all NS for your change
- If some don't have the change, identify the master, confirm it has the change
- Master DNS server should be listed in SOA record:
$ dig example.net SOA
. . .
;; ANSWER SECTION:
example.net. 300 IN SOA ns1.example.net. admin.example.net. 2011062300 10800 2000 604800 7200
If it has the change, but no slaves do, investigate zone transfer issue.
Zone Transfer Issues
Troubleshooting the Master
- First login to master and check system logs for zone transfer:
Mar 27 21:47:16 ns1 named[25967]: zone example.net/IN: loaded
serial 2012032700
Mar 27 21:47:16 ns1 named[25967]: zone example.net/IN: sending
notifies (serial 2012032700)
No logs? Test BIND is running:
$ ps -ef | grep named
Confirm this host is configured as a master
Check that the serial number was updated, errors are logged:
Mar 27 21:09:52 ns1 named[25967]: zone example.net/IN: zone
serial (2012011301) unchanged. zone may fail to transfer to slaves.
Mar 27 21:09:52 ns1 named[25967]: zone example.net/IN: loaded
serial 2012011301
Mar 27 21:09:52 ns1 named[25967]: zone example.net/IN: sending
notifies (serial 2012011301)
Zone Transfer Issues
Troubleshooting the Slaves from the Master
- For each slave you should see a zone transfer logged on the master:
Mar 27 21:47:16 ns1 named[25967]: client 10.1.1.4#38239: transfer
of 'example.net/IN': AXFR-style IXFR started
Mar 27 21:47:16 ns1 named[25967]: client 10.1.1.4#38239: transfer
of 'example.net/IN': AXFR-style IXFR ended
For all slaves:
- Confirm slave is listed as NS for zone
- Or is listed for zone as also-notify
If all this looks right, move troubleshooting to slaves.
Zone Transfer Issues
Troubleshooting on the Slaves
- A successful zone transfer shows up in the logs:
Mar 27 21:58:44 ns2 named[22774]: client 10.1.1.3#50946: view
external: received notify for zone 'example.net'
Mar 27 21:58:44 ns2 named[22774]: zone example.net/IN/external:
Transfer started.
Mar 27 21:58:44 ns2 named[22774]: transfer of 'example.net/IN'
from 10.1.1.3#53: connected using 10.1.1.4#38239
Mar 27 21:58:44 ns2 named[22774]: zone example.net/IN/external:
transferred serial 2012032700
Mar 27 21:58:44 ns2 named[22774]: transfer of 'example.net/IN'
from 10.1.1.3#53: end of transfer
If you see this:
Mar 27 21:58:45 ns2 named[22774]: zone example.net/IN/external:
refused notify from non-master: 10.1.1.7#35615
Confirm slave is configured as slave
Confirm slave has correct IP for master.
Zone Transfer Issues
Troubleshooting on the Slaves Cont.
- Check for serial number problems on the slave:
Mar 27 22:09:00 ns2 named[22774]: client 10.1.1.3#42895: view
external: received notify for zone 'example.net'
Mar 27 22:09:00 ns2 named[22774]: zone example.net/IN/external:
notify from 10.1.1.3#42895: zone is up to date
Common when using datestamps for serial
Admin probably typoed the serial
Find the zone's cached data on slave, confirm serial number is lower
If serial is higher, delete the cached zone and restart BIND
Slave will notice missing zone and request a zone transfer.
Questions?
Additional Resources