What to do, when your Linux server crashes?

This was a question that came up in the beginning of this year, and I spent quite some time to figure out my problems, and learned a lot about my machine. I hope the following text helps other people with similar problems, if they are new to debugging hardware and software.

In the beginning of this year I had the problem, that my linux box started crashing. I had unrecoverable hardware errors on my root file system and had to set up a new server. As easy as setting up Debian is, that was really no problem at all.

However, after installing a fresh Debian my box still crashed on a regular basis, weekly, then daily.

The first suspicion I had, was that some application was causing my machine to crash. I thought maybe apache or tomcat could be the reason, but I saw nothing unusual in the logfiles.

Then I started reading the system logfiles, and tried to figure out anything unusual, or some reocurring entries that appeared before a crash. On a Debian machine the logfiles are located at /var/log/ and the directory looks something like this:

-rw-r--r--  1 root        root       0 2007-03-01 07:46 aptitude
-rw-r-----  1 root        adm  1049437 2007-04-19 11:57 auth.log
-rw-rw-r--  1 root        utmp       0 2007-04-02 07:51 btmp
-rw-r-----  1 root        adm   878545 2007-04-19 11:57 daemon.log
drwxr-xr-x  3 root        root    4096 2007-02-09 23:27 debian-installer
-rw-r-----  1 root        adm     4630 2007-04-16 21:36 debug
-rw-r--r--  1 root        root   11968 2007-04-16 21:36 dmesg
drwxr-s---  2 Debian-exim adm     4096 2007-04-19 07:50 exim4
-rw-r--r--  1 root        root     404 2007-04-13 15:13 fontconfig.log
-rw-r-----  1 root        adm    21893 2007-04-17 16:52 kern.log
drwxr-xr-x  2 root        root    8192 2007-04-19 07:50 ksymoops
-rw-rw-r--  1 root        utmp  292292 2007-04-19 11:57 lastlog
-rw-r--r--  1 root        root       0 2007-04-15 06:47 lp-acct
-rw-r--r--  1 root        root       0 2007-04-15 06:47 lp-errs
-rw-r-----  1 root        adm       47 2007-04-16 21:36 lpr.log
-rw-r--r--  1 root        root       0 2007-02-09 23:28 mail.err
-rw-r--r--  1 root        root       0 2007-02-09 23:28 mail.info
-rw-r--r--  1 root        root       0 2007-02-09 23:28 mail.log
-rw-r--r--  1 root        root       0 2007-02-09 23:28 mail.warn
-rw-r-----  1 root        adm    28233 2007-04-19 11:56 messages
drwxr-sr-x  2 news        news    4096 2007-02-09 23:28 news
-rw-r-----  1 root        adm    65936 2007-04-19 11:57 syslog
-rw-r-----  1 root        adm        0 2007-04-15 06:48 user.log
-rw-rw-r--  1 root        utmp  198144 2007-04-19 11:57 wtmp
However, looking at the system's logfiles did not help me either.

I had a couple of applications I suspected of causing my system to crash. Since the logfiles did not provide any hints, I thought of memory leaks. A cool tool to track memory usage and possible leaks is Valgrind. This tool lets you debug and profile all programs under Linux.

You start valgrind with something like this:

valgrind --logfile=/log/valgrind.someapp.log --leak-check=yes ./bin/startup_someapp.sh

Valgrind puts out a very detailed logfile of the memory operations of the program it monitors. Well, the applications I thought were evil, did not show any remarkable memory leaks. At least that was my interpretation of the valgrind logfiles. I figured, this was not the problem.

The next thing I thought of was a hardware problem. The machine I have is an ordinary desktop pc. A cool tool you can install to monitor your computer is MUNIN. Munin monitors your systems filesystem usage, inode usage, mailqueue, mysql, network traffic, postfix, processes, hard disc temperature, cpu and ram usage, and much more. A how-to install munin on Debian can be read at http://www.debian-administration.org/articles/229.

Since I was suspicious of hardware problems, I thought of CPU and/or hard disc temperature. If you want to monitor your systems temperature you need to install the corresponding tools. For harddrives you can use hddtemp for the cpu lm-sensors. Setting up hddtemp is easy, setting up lm-sensors requiered re-compiling my kernel, so I skipped that. A nice tutorial how to set up this can be read http://www.debian-administration.org/articles/327.

The following graphs where interesting for me.

  Slideshow

On the graphs above you see when my machine crashed, these are the blank spots. As you can see on the graphs the hdd temperature is pretty constant and not related to the crashes. The same goes for memory and cpu usage.

Then I ran another tool called memtest86+. A screenshot (not from my server) is shown on the left. This tool comes with KNOPPIX, and you start it by typing the cheat-code memtest when prompted by KNOPPIX at start up. Memtest86+ runs infinitly and does several memory tests and repeats the tests. I ran it for a couple of hours and found some errors (~100). So I figured my RAM was broken and changed it. Running memtest86+ again with the new RAM yielded no errors.

I think the crashes my computer had were partly due to broken RAM. After replacing it, my machine does not crash that much any more. I suspect, that the main board or the cpu also has some problems. But right now the server has a level of stability, that is reasonable for me.

I hope this article helps people....


June 4th, 2007

Well, all of the above is nice, but it did not help me. The computer kept crashing, even with new RAM. So finally I decided to change the lasts parts of my computer: The mainboard and/or the cpu. Well, when I openen the computer I saw that some of the condesers were broken. This is pretty easy to see, because they appear opened or exploded. So I decided to only change the main board. Luckily, the mainboard I use was still available (ASRock K7S41, don't laugh). Furthermore I changed the powersupply, too, because I had a cheap no-name device. So all I did is put the processor on the new mainboard, plug everything in and fire up the computer. So far, it looks fine, let's see how long it lasts.

OK, what are key takeaways? My RAM broke, a harddisc broke. I read in some thread, that this can be due to unsteady powersupply. Maybe this was the case with my server. If anyone asks me, what computer to buy for a server, I would say: "Don't be cheap, invest some extra bucks for quality parts => (+100€)".