neuhalfen.name

A random collection of posts

Live Memtest II

Permalink

Earlier this year, I wrote about the neccesity of a live memory test (or memory scrubbing) for low- to medium end servers. The recently published google paper (“DRAM Errors in the Wild: A Large-Scale Field Study”) shows that the current situation is worse that I imagined. As a consequence, I will write a memory scrubber for the Linux kernel. This posting descibes the whys and some of the planned hows.

The follwing three sections spotlight three important results of the google study and elaborate onon their relation to the yet-to-be memory scrubber.

Use ECC (or better) memory

Conclusion 1: We found the incidence of memory errors and the range of error rates across different DIMMs to be much higher than previously
reported.

About a third of machines and over 8% of DIMMs in our fleet saw at least one correctable error per year. … The number of correctable errors per DIMM is highly variable, with some DIMMs experiencing a huge number of errors, compared to others. The annual incidence of uncorrectable errors was 1.3% per machine and 0.22% per DIMM.

The conclusion we draw is that error correcting codes are crucial for reducing the large number of memory errors to a manageable number of uncorrectable errors.

Although the advice to use ECC memory is nothing new, it is always better to know why you insist on expensive ECC memory.

Scrub your memory

Single-bit soft errors in the memory array can accumulate over time and turn into multi-bit errors. In order to avoid this accumulation of single-bit errors, memory systems can employ a hardware scrubber14 that scans through the memory, while the memory is otherwise idle. Any memory words with single-bit errors are written back after correction, thus eliminating the single-bit error if it was soft.

Scrubbing the memory of the system helps to prevent correctable (soft) errors from becoming uncorrectable. If your Linux system has ECC RAM but no hardware scrubber you can – as of now – only hope that errors are found soon enough in the course of regular memory accesses.

Error rates are unlikely to be dominated by soft errors.

Conclusion 7: Error rates are unlikely to be dominated by soft errors.

… this observations leads us to the conclusion that a significant fraction of errors is likely due to mechanism other than soft errors, such as hard errors or errors induced on the datapath.

This really speaks for itself: Detected errors are likely to be caused by bad hardware. This is contrary to what was generally believed: that errors were transient and caused by solar activities or other external (non-permanent, environmental) factors.

My conclusion

The last conclusion in the google paper (conclusion 7) implies that a memory tester can be of great help, even for non-ECC systems. The reasoning is:

  1. If the majority of errors are caused by hardware failures and
  2. if these errors are persistent (e.g. a given cell is much more likely to fail than another cell),
  3. then these errors can be found without hardware ECC.

These errors can be found by writing and re-reading patterns to memory cells. If errors are found, the affected memory frames can be excluded from normal usage2. Although this aproach does not correct errors, it can prevent bad cells from constantly corrupting data.

…and action!

If one advice is taken from the paper, it is “buy good ECC memory”. The next important conclusion is: “use a memory scrubber to detect errors early”.

I will dedicate time to write a memory scrubber for Linux. The work will be accompanied by some research questions:

  • What is the performance impact of a software memory scrubber for desktop/server systems?
  • What is the best scheduling strategy for a scrubber? Preemptive or cooperative (e.g. on __free_page)?
  • Does it make sense to implement the scrubber in user space or in kernel space?
  • What are typical memory access patterns of the linux kernel anyway?

There has been some prior research on this topic (e.g. Singh, Bose and Darisala wrote a proof of concept for Solaris 83), but to my knowledge no one has ever created a solution suitable for real-life environments.

More information (and ongoing results) will be posted here.

Resources

1 DRAM Errors in the Wild: A Large-Scale Field Study, SIGMETRICS/Performance’09 (online)

2 The physical memory frames can be allocated to a pool of pages, that are never to be used and never to be freed.

3 Amandeep Singh, Debashish Bose and Sandeep Darisala (all Sun Microsystems, Inc.): Software Based In-System Memory Test for Highly Available Systems (online)

14 (This is the paper quoted inside the quotation) S. S. Mukherjee, J. Emer, T. Fossum, and S. K. Reinhardt. Cache scrubbing in microprocessors: Myth or necessity? In PRDC ’04: Proceedings of the 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. (online )

Comments