Linux Localhost Troubleshooting

Kyle Rankin

Systems Architect

QuinStreet Inc.

Author of The Official Ubuntu Server Book, Ubuntu Hacks, and Knoppix Hacks

http://greenfly.org/talks/misc/troubleshooting1.html

Agenda

General Troubleshooting Philosophies
Localhost Troubleshooting

Diagnose high load
Load averages
Diagnose CPU-bound load
Diagnose Out of memory issues
Diagnose high I/O issues
Diagnose out of disk space

General Troubleshooting Philosophies

Troubleshooting is a skill
Not everyone naturally has this skill
Everyone can be a better troubleshooter
You also want to be a faster troubleshooter
Here are some general troubleshooting rules...

General Troubleshooting Philosophies

Divide the problem space

Pick a number between 1 and 100...
With each test, try to rule out classes of problems
Divide the problem between people

General Troubleshooting Philosophies

Favor quick, simple tests over slow, complex tests

Especially when downtime is measured in $$$
Is it plugged in?

General Troubleshooting Philosophies

Favor past solutions

Most problems happen more than once
You'll often see the same symptoms
If it walks like a duck, and quacks like a duck...
Sometimes it's not a duck
Still test hypotheses

General Troubleshooting Philosophies

Good communication is critical when collaborating

Chat rooms better than conference calls
Yelling over the cubicle better than conference calls
Email probably better than conference calls
Most things better than conference calls

General Troubleshooting Philosophies

Understand how systems work

Everyone blames the technology they understand least
Understand TCP/IP, DNS, Linux processes, programming, and memory management
Good things to know anyway
Helps avoid wild goose chases

General Troubleshooting Philosophies

Document your problems and solutions

Many places call this a postmortem
Repeat: most problems happen more than once
In a team setting, makes everyone better problem solver

General Troubleshooting Philosophies

What Changed?

Many problems caused by changes
More likely in stable, consistent, systems
Develop ability to track system changes, roll back
Try to change one thing at a time
Changes sometimes red herring

General Troubleshooting Philosophies

Use the Internet, but carefully

You probably aren't the first in the world to have a particular error message
Must have a good understanding of the problem first
Google search "server not on network" won't help you
Even if you are feeling lucky

General Troubleshooting Philosophies

Resist rebooting

This isn't Windows 95
Yes rebooting does fix problems, but...
You may never isolate the cause
Rebooting is the last resort

Localhost Troubleshooting

Sometimes hard to diagnose
Network and local problems often have similar symptoms
Most common local issue: Host sluggish or Unresponsive
Primary local resources: CPU, RAM, disk I/O
When you overuse any, the system acts certain ways

System Load

Fundamental metric for local troubleshooting
First command I run on a sluggish system is uptime:

$ uptime
13:35:03 up 103 days, 8 min, 5 users, load average: 2.03, 20.17, 15.09

2.03, 20.17, 15.09 are my 1, 5, and 15 minute load averages
What is a load average?
How to interpret it

Top

OK my load is high. WHY?
Next tool: top

Tons of load information in a few lines:

top - 14:08:25 up 38 days,  8:02,  1 user,  load average: 1.70, 1.77, 1.68
Tasks: 107 total,   3 running, 104 sleeping,   0 stopped,   0 zombie
Cpu(s): 11.4%us, 29.6%sy, 0.0%ni, 58.3%id,  .7%wa, 0.0%hi, 0.0%si, 0.0%st
Mem:   1024176k total,   997408k used,    26768k free,    85520k buffers
Swap:  1004052k total,     4360k used,   999692k free,   286040k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 9463 mysql     16   0  686m 111m 3328 S   53  5.5 569:17.64 mysqld
18749 nagios    16   0  140m 134m 1868 S   12  6.6   1345:01 nagios2db_status
24636 nagios    17   0 34660  10m  712 S    8  0.5   1195:15 nagios
22442 nagios    24   0  6048 2024 1452 S    8  0.1   0:00.04 check_time.pl

CPU-bound Load

First question: is the load CPU-bound
To check, look at the CPU line:

Cpu(s): 11.4%us, 29.6%sy, 0.0%ni, 58.3%id,  .7%wa, 0.0%hi, 0.0%si, 0.0%st

us: user CPU time
sy: system CPU time
ni: nice CPU time
id: CPU idle time (high is good)
wa: I/O wait (important)

Out of Memory Issues

Swap death is bad
To check, look at Mem and Swap lines:
Mem: 1024176k total, 997408k used, 26768k free, 85520k buffers Swap: 1004052k total, 4360k used, 999692k free, 286040k cached
Mem used and free can be misleading!
Always check cached first, then swap used
Real RAM used ~= used - cached + swap used
If you are out of RAM, hit M to sort top processes by RAM use:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
18749 nagios    16   0  140m 134m 1868 S   12  6.6   1345:01 nagios2db_status
 9463 mysql     16   0  686m 111m 3328 S   53  5.5 569:17.64 mysqld
24636 nagios    17   0 34660  10m  712 S    8  0.5   1195:15 nagios
22442 nagios    24   0  6048 2024 1452 S    8  0.1   0:00.04 check_time.pl

Out of Memory Killer

Remember that psdoom game? It's like that
In a OOM condition, the OOM killer kills processes to free RAM
Sometimes even the right processes
Often I've had to reboot to fix the damage (if I could even login)
OOM killer evidence in system logs:

1228419127.32453_1704.hostname:2,S:Out of Memory: Killed process    21389 (java).
1228419127.32453_1710.hostname:2,S:Out of Memory: Killed process    21389 (java).

Troubleshoot High I/O Wait

Can be tricky to track down
Usually oracle's fault (kidding DBAs!)
Check for swapping first
Use iostat to get disk I/O diagnostics
On Ubuntu/Debian install sysstat package

Troubleshoot High I/O Wait

iostat

iostat with no arguments gives good overall view:

Linux 2.6.24-19-server (hostname) 	01/31/2009

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.73    0.07    2.03    0.53    0.00   91.64

Device:            tps  Blk_read/s  Blk_wrtn/s   Blk_read   Blk_wrtn
sda               9.82       417.96        27.53   30227262    1990625
sda1              6.55       219.10         7.12   15845129     515216
sda2              0.04         0.74         3.31      53506     239328
sda3              3.24       198.12        17.09   14328323    1236081

tps = transactions per second
Blk_read/s = blocks read per second
Blk_wrtn/s = blocks written per second
Blk_read = total blocks read
Blk_wrtn = total blocks written

Out of Disk Space Issues

Common problem
Hopefully server monitoring catches it
Disk always fills when you sleep
Start diagnosis with df:

Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1             7.9G  541M  7.0G   8% /
varrun                189M   40K  189M   1% /var/run
varlock               189M     0  189M   0% /var/run
udev                  189M   44K  189M   1% /dev
devshm                189M     0  189M   0% /dev/shm
/dev/sda3              20G   15G  5.9G  71% /home

Identify full disk, then use du to find what's causing it

Out of Disk Space Issues

du

cd to the root of the mount point (say / or /home)
run "The duck command":

$ sudo du -ckx | sort -n > /tmp/duck-root

Output looks like:

67872  /lib/modules/2.6.24-19-server
67876  /lib/modules
69092  /var/cache/apt
69448  /var/cache
76924  /usr/share
82832  /lib
124164 /usr
404168 /
404168 total

Out of Disk Space Issues

How to Solve

Compress logs
Clear your package cache
The dreaded vim full /tmp issue...
Get a bigger disk

Out of Disk Space Issues

Out of Inodes

File system says it's full, df disagrees
Could be out of inodes
ext3 has pre-set inode limit set at mkfs
use df -i to check:

$ df -i
Filesystem   Inodes   IUsed   IFree IUse% Mounted on
/dev/sda	    520192   17539  502653    4% /

If you run out... delete some files
Or backup and reformat. Seriously.

TechSkills - 2010-08-24

Linux Localhost Troubleshooting

Linux Localhost Troubleshooting

Kyle Rankin

Systems Architect

QuinStreet Inc.

Author of The Official Ubuntu Server Book, Ubuntu Hacks, and Knoppix Hacks

http://greenfly.org/talks/misc/troubleshooting1.html

Agenda

General Troubleshooting Philosophies

General Troubleshooting Philosophies

Divide the problem space

General Troubleshooting Philosophies

Favor quick, simple tests over slow, complex tests

General Troubleshooting Philosophies

Favor past solutions

General Troubleshooting Philosophies

Good communication is critical when collaborating

General Troubleshooting Philosophies

Understand how systems work

General Troubleshooting Philosophies

Document your problems and solutions

General Troubleshooting Philosophies

What Changed?

General Troubleshooting Philosophies

Use the Internet, but carefully

General Troubleshooting Philosophies

Resist rebooting

Localhost Troubleshooting

System Load

Top

CPU-bound Load

Out of Memory Issues

Out of Memory Killer

Troubleshoot High I/O Wait

Troubleshoot High I/O Wait

iostat

Out of Disk Space Issues

Out of Disk Space Issues

du

Out of Disk Space Issues

How to Solve

Out of Disk Space Issues

Out of Inodes

Questions?