What to do, when your Linux server crashes? This was a question that came up in the beginning of this year, and I spent quite some time to figure out my problems, and learned a lot about my machine. I hope the following text helps other people with similar problems, if they are new to debugging hardware and software.
In the beginning of this year I had the problem, that my linux box started crashing. I had unrecoverable hardware errors on my root file system and had to set up a new server. As easy as setting up
Debian is, that was really no problem at all.
However, after installing a fresh Debian my box still crashed on a regular basis, weekly, then daily.
The first suspicion I had, was that some application was causing my machine to crash. I thought maybe apache or tomcat could be the reason, but I saw nothing unusual in the logfiles.
Then I started reading the system logfiles, and tried to figure out anything unusual, or some reocurring entries that appeared before a crash. On a Debian machine the logfiles are located at /var/log/ and the directory looks something like this:
-rw-r--r-- 1 root root 0 2007-03-01 07:46 aptitude
-rw-r----- 1 root adm 1049437 2007-04-19 11:57 auth.log
-rw-rw-r-- 1 root utmp 0 2007-04-02 07:51 btmp
-rw-r----- 1 root adm 878545 2007-04-19 11:57 daemon.log
drwxr-xr-x 3 root root 4096 2007-02-09 23:27 debian-installer
-rw-r----- 1 root adm 4630 2007-04-16 21:36 debug
-rw-r--r-- 1 root root 11968 2007-04-16 21:36 dmesg
drwxr-s--- 2 Debian-exim adm 4096 2007-04-19 07:50 exim4
-rw-r--r-- 1 root root 404 2007-04-13 15:13 fontconfig.log
-rw-r----- 1 root adm 21893 2007-04-17 16:52 kern.log
drwxr-xr-x 2 root root 8192 2007-04-19 07:50 ksymoops
-rw-rw-r-- 1 root utmp 292292 2007-04-19 11:57 lastlog
-rw-r--r-- 1 root root 0 2007-04-15 06:47 lp-acct
-rw-r--r-- 1 root root 0 2007-04-15 06:47 lp-errs
-rw-r----- 1 root adm 47 2007-04-16 21:36 lpr.log
-rw-r--r-- 1 root root 0 2007-02-09 23:28 mail.err
-rw-r--r-- 1 root root 0 2007-02-09 23:28 mail.info
-rw-r--r-- 1 root root 0 2007-02-09 23:28 mail.log
-rw-r--r-- 1 root root 0 2007-02-09 23:28 mail.warn
-rw-r----- 1 root adm 28233 2007-04-19 11:56 messages
drwxr-sr-x 2 news news 4096 2007-02-09 23:28 news
-rw-r----- 1 root adm 65936 2007-04-19 11:57 syslog
-rw-r----- 1 root adm 0 2007-04-15 06:48 user.log
-rw-rw-r-- 1 root utmp 198144 2007-04-19 11:57 wtmp
However, looking at the system's logfiles did not help me either.
I had a couple of applications I suspected of causing my system to crash. Since the logfiles did not provide any hints, I thought of memory leaks. A cool tool to track memory usage and possible leaks is
Valgrind. This tool lets you debug and profile all programs under Linux.
You start valgrind with something like this:
valgrind --logfile=/log/valgrind.someapp.log --leak-check=yes ./bin/startup_someapp.sh
Valgrind puts out a very detailed logfile of the memory operations of the program it monitors. Well, the applications I thought were evil, did not show any remarkable memory leaks. At least that was my interpretation of the valgrind logfiles. I figured, this was not the problem.
The next thing I thought of was a hardware problem. The machine I have is an ordinary desktop pc. A cool tool you can install to monitor your computer is
MUNIN. Munin monitors your systems filesystem usage, inode usage, mailqueue, mysql, network traffic, postfix, processes, hard disc temperature, cpu and ram usage, and much more. A how-to install munin on Debian can be read at
http://www.debian-administration.org/articles/229.
Since I was suspicious of hardware problems, I thought of CPU and/or hard disc temperature. If you want to monitor your systems temperature you need to install the corresponding tools. For harddrives you can use
hddtemp for the cpu
lm-sensors. Setting up hddtemp is easy, setting up lm-sensors requiered re-compiling my kernel, so I skipped that. A nice tutorial how to set up this can be read
http://www.debian-administration.org/articles/327.
The following graphs where interesting for me.

On the graphs above you see when my machine crashed, these are the blank spots. As you can see on the graphs the hdd temperature is pretty constant and not related to the crashes. The same goes for memory and cpu usage.
Then I ran another tool called
memtest86+. A screenshot (not from my server) is shown on the left. This tool comes with
KNOPPIX, and you start it by typing the cheat-code memtest when prompted by KNOPPIX at start up. Memtest86+ runs infinitly and does several memory tests and repeats the tests. I ran it for a couple of hours and found some errors (~100). So I figured my RAM was broken and changed it. Running memtest86+ again with the new RAM yielded no errors.
I think the crashes my computer had were partly due to broken RAM. After replacing it, my machine does not crash
that much any more. I suspect, that the main board or the cpu also has some problems. But right now the server has a level of stability, that is reasonable for me.
I hope this article helps people....
June 4th, 2007
Well, all of the above is nice, but it did not help me. The computer kept crashing, even with new RAM. So finally I decided to change the lasts parts of my computer: The mainboard and/or the cpu. Well, when I openen the computer I saw that some of the condesers were broken. This is pretty easy to see, because they appear opened or exploded. So I decided to only change the main board. Luckily, the mainboard I use was still available (ASRock K7S41, don't laugh). Furthermore I changed the powersupply, too, because I had a cheap no-name device. So all I did is put the processor on the new mainboard, plug everything in and fire up the computer. So far, it looks fine, let's see how long it lasts.
OK, what are key takeaways? My RAM broke, a harddisc broke. I read in some thread, that this can be due to unsteady powersupply. Maybe this was the case with my server. If anyone asks me, what computer to buy for a server, I would say: "Don't be cheap, invest some extra bucks for quality parts => (+100€)".