Linux Localhost Troubleshooting
Kyle Rankin
Systems Architect
QuinStreet Inc.
Agenda
- General Troubleshooting Philosophies
- Localhost Troubleshooting
- Diagnose high load
- Load averages
- Diagnose CPU-bound load
- Diagnose Out of memory issues
- Diagnose high I/O issues
- Diagnose out of disk space
General Troubleshooting Philosophies
- Troubleshooting is a skill
- Not everyone naturally has this skill
- Everyone can be a better troubleshooter
- You also want to be a faster troubleshooter
- Here are some general troubleshooting rules...
General Troubleshooting Philosophies
Divide the problem space
- Pick a number between 1 and 100...
- With each test, try to rule out classes of problems
- Divide the problem between people
General Troubleshooting Philosophies
Favor quick, simple tests over slow, complex tests
- Especially when downtime is measured in $$$
- Is it plugged in?
General Troubleshooting Philosophies
Favor past solutions
- Most problems happen more than once
- You'll often see the same symptoms
- If it walks like a duck, and quacks like a duck...
- Sometimes it's not a duck
- Still test hypotheses
General Troubleshooting Philosophies
Good communication is critical when collaborating
- Chat rooms better than conference calls
- Yelling over the cubicle better than conference calls
- Email probably better than conference calls
- Most things better than conference calls
General Troubleshooting Philosophies
Understand how systems work
- Everyone blames the technology they understand least
- Understand TCP/IP, DNS, Linux processes, programming, and memory management
- Good things to know anyway
- Helps avoid wild goose chases
General Troubleshooting Philosophies
Document your problems and solutions
- Many places call this a postmortem
- Repeat: most problems happen more than once
- In a team setting, makes everyone better problem solver
General Troubleshooting Philosophies
What Changed?
- Many problems caused by changes
- More likely in stable, consistent, systems
- Develop ability to track system changes, roll back
- Try to change one thing at a time
- Changes sometimes red herring
General Troubleshooting Philosophies
Use the Internet, but carefully
- You probably aren't the first in the world to have a particular error message
- Must have a good understanding of the problem first
- Google search "server not on network" won't help you
- Even if you are feeling lucky
General Troubleshooting Philosophies
Resist rebooting
- This isn't Windows 95
- Yes rebooting does fix problems, but...
- You may never isolate the cause
- Rebooting is the last resort
Localhost Troubleshooting
- Sometimes hard to diagnose
- Network and local problems often have similar symptoms
- Most common local issue: Host sluggish or Unresponsive
- Primary local resources: CPU, RAM, disk I/O
- When you overuse any, the system acts certain ways
System Load
- Fundamental metric for local troubleshooting
- First command I run on a sluggish system is uptime:
$ uptime
13:35:03 up 103 days, 8 min, 5 users, load average: 2.03, 20.17, 15.09
2.03, 20.17, 15.09 are my 1, 5, and 15 minute load averages
What is a load average?
How to interpret it
Top
- OK my load is high. WHY?
- Next tool: top
- Tons of load information in a few lines:
top - 14:08:25 up 38 days, 8:02, 1 user, load average: 1.70, 1.77, 1.68
Tasks: 107 total, 3 running, 104 sleeping, 0 stopped, 0 zombie
Cpu(s): 11.4%us, 29.6%sy, 0.0%ni, 58.3%id, .7%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 1024176k total, 997408k used, 26768k free, 85520k buffers
Swap: 1004052k total, 4360k used, 999692k free, 286040k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9463 mysql 16 0 686m 111m 3328 S 53 5.5 569:17.64 mysqld
18749 nagios 16 0 140m 134m 1868 S 12 6.6 1345:01 nagios2db_status
24636 nagios 17 0 34660 10m 712 S 8 0.5 1195:15 nagios
22442 nagios 24 0 6048 2024 1452 S 8 0.1 0:00.04 check_time.pl
CPU-bound Load
- First question: is the load CPU-bound
- To check, look at the CPU line:
Cpu(s): 11.4%us, 29.6%sy, 0.0%ni, 58.3%id, .7%wa, 0.0%hi, 0.0%si, 0.0%st
us: user CPU time
sy: system CPU time
ni: nice CPU time
id: CPU idle time (high is good)
wa: I/O wait (important)
Out of Memory Issues
- Swap death is bad
- To check, look at Mem and Swap lines:
Mem: 1024176k total, 997408k used, 26768k free, 85520k buffers
Swap: 1004052k total, 4360k used, 999692k free, 286040k cached
- Mem used and free can be misleading!
- Always check cached first, then swap used
- Real RAM used ~= used - cached + swap used
- If you are out of RAM, hit M to sort top processes by RAM use:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18749 nagios 16 0 140m 134m 1868 S 12 6.6 1345:01 nagios2db_status
9463 mysql 16 0 686m 111m 3328 S 53 5.5 569:17.64 mysqld
24636 nagios 17 0 34660 10m 712 S 8 0.5 1195:15 nagios
22442 nagios 24 0 6048 2024 1452 S 8 0.1 0:00.04 check_time.pl
Out of Memory Killer
- Remember that psdoom game? It's like that
- In a OOM condition, the OOM killer kills processes to free RAM
- Sometimes even the right processes
- Often I've had to reboot to fix the damage (if I could even login)
- OOM killer evidence in system logs:
1228419127.32453_1704.hostname:2,S:Out of Memory: Killed process 21389 (java).
1228419127.32453_1710.hostname:2,S:Out of Memory: Killed process 21389 (java).
Troubleshoot High I/O Wait
- Can be tricky to track down
- Usually oracle's fault (kidding DBAs!)
- Check for swapping first
- Use iostat to get disk I/O diagnostics
- On Ubuntu/Debian install sysstat package
Troubleshoot High I/O Wait
iostat
- iostat with no arguments gives good overall view:
Linux 2.6.24-19-server (hostname) 01/31/2009
avg-cpu: %user %nice %system %iowait %steal %idle
5.73 0.07 2.03 0.53 0.00 91.64
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 9.82 417.96 27.53 30227262 1990625
sda1 6.55 219.10 7.12 15845129 515216
sda2 0.04 0.74 3.31 53506 239328
sda3 3.24 198.12 17.09 14328323 1236081
tps = transactions per second
Blk_read/s = blocks read per second
Blk_wrtn/s = blocks written per second
Blk_read = total blocks read
Blk_wrtn = total blocks written
Out of Disk Space Issues
- Common problem
- Hopefully server monitoring catches it
- Disk always fills when you sleep
- Start diagnosis with df:
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 7.9G 541M 7.0G 8% /
varrun 189M 40K 189M 1% /var/run
varlock 189M 0 189M 0% /var/run
udev 189M 44K 189M 1% /dev
devshm 189M 0 189M 0% /dev/shm
/dev/sda3 20G 15G 5.9G 71% /home
Identify full disk, then use du to find what's causing it
Out of Disk Space Issues
du
- cd to the root of the mount point (say / or /home)
- run "The duck command":
$ sudo du -ckx | sort -n > /tmp/duck-root
Output looks like:
67872 /lib/modules/2.6.24-19-server
67876 /lib/modules
69092 /var/cache/apt
69448 /var/cache
76924 /usr/share
82832 /lib
124164 /usr
404168 /
404168 total
Out of Disk Space Issues
How to Solve
- Compress logs
- Clear your package cache
- The dreaded vim full /tmp issue...
- Get a bigger disk
Out of Disk Space Issues
Out of Inodes
- File system says it's full, df disagrees
- Could be out of inodes
- ext3 has pre-set inode limit set at mkfs
- use df -i to check:
$ df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sda 520192 17539 502653 4% /
If you run out... delete some files
Or backup and reformat. Seriously.