Wednesday, October 24, 2007

How to really test RAM, or My search for system stability

I recently bought a Mac because my previous system kept crashing. It would just beep, and reboot without a visible cause. Since I still wanted to use the old system for my Linux firewall, I needed to find out what the culprit was.

Over time I found out that:
- Reboots started intermittently after placing new memory. However, Memtest86 reported no problems whatsoever.
- Over time the reboots occurred more and more often.
- The problem occurred more often during heavy disk activity, sometimes after only 2 minutes. I could not even finish a long Cygwin install session.
- According to SpeedFan my processor heated up quite a bit (up to 65ºC), after putting a bit of heat sink paste between the CPU and the heat sink, that problem was solved.
- The same SpeedFan reported that my harddisk went to high temperatures as well, reaching 50ºC but still rising when the system went down. Some searches taught me that this is high but acceptable. A full copy of the harddisk (as a USB drive) did not give any problems.
- Replacing the power unit did not help.

When I had moved everything to another motherboard an interesting thing happened: once, just once out of many reboots I got a memory failure. Got you!

I finally was able to pin the wrong RAM module using an old memory test from Doug Ledford.

Since the shown script can not be used as is, here is what I did to make it work on Ubuntu 7.10:
- Download a Linux kernel from (we are not going to compile a kernel, we just need a large zip file): wget
- Transform it to a gzipped tar:
bunzip2 linux-
gzip linux-
cp linux- /tmp

- Download the adapted My changes auto-detects files named linux-*.tar.gz, and uses the file name to predict the name of the root folder in the tar.
- One by one place a memory module in your computer and run for each configuration.

The original site has more information on how the script works and why Memtest86 is actually useless. The point is that a modern CPU can not put enough load on your memory. With concurrent DMA transfers more errors are detected.

No comments:

Post a Comment