neuhalfen.name

A random collection of posts

Your Data Is Corrupted, and You Don't Know It

Permalink

Chances are, that you have corrupted data stored on your server. And written to your backup. And you'll only find out, when there is no correct copy left. But help is on the way. Kind of.

Be warned: this is a rather long piece of work. I finally came around to put my diploma thesis on github, and this is a mixture of Introduction Into Memory Management, Computer Hardware, and my thesis.

The target audience for the first sections is the interested layman, later on it gets more technical. If you are really interested read the diploma thesis.

But don't worry, the post is structured and gradually gets more technical. I tried to keep technical mumbo-jumbo to the absolute minimum, and tried to explain as much, as possible.

TL;DR

Memory errors are bad, bad, bad. And they happen quite frequently, each year ~1/3 of Googles servers showed them. Use ECC memory in all your computers, if supported. Run a memory test program every now and then. Watch out for unexplained program crashes. Make backups.

As my diploma thesis I wrote a hybrid (kernel, userspace) memory tester for Linux that solves the main problems of current memory testprograms:

Reliably test most (~70%, YMMV) of the computers memory while still beeing able to use the computer for productive work.

If the headline made you wondering “Do I have corrupt data? Why do you think I have corrupted data?”, or if you just want to learn a bit about RAM in general, or the Linux memory management: read on.

Errors in RAM

Everything a computer does at some point involves RAM ( memory ). Typical home computers, and small servers have between 4 and 16 gigabytes (GiB) of it. Defects in RAM are not easily detected because the symptoms are very unspecific. But in almost all cases bad RAM corrupts data that is saved to disk at some point, either by changing the data directly or changing the programs handling the data. Corrupted data can be anything between a pixel in a video having a slightly different color than is should, documents/databases that are no longer readable, or – worst – data that you or your company is liable for has changed its meaning without anyone finding out until it is too late.

In mid 2009 Google and Bianca Schroeder from the University of Toronto published a study that showed that Google’s – albeit commodity grade – server hardware is prone to memory errors: Each year a third of the studied systems suffered at least on correctable memory error (CE). According to their study, a system that had a CE in the past is very likely to have much more CEs or even uncorrectable errors (UE) in the near future. It is safe to assume, that consumer grade hardware is likely to show worse behaviour. These results emphasize the importance of the early detection of defective memory modules.

Not only Google found out, that RAM is a tricky beast: From the NASA website, updated May 17, 2010 at 5:00 PT.

One flip of a bit in the memory of an onboard computer appears to have caused the change in the science data pattern returning from Voyager 2, engineers at NASA’s Jet Propulsion Laboratory said Monday, May 17. A value in a single memory location was changed from a 0 to a 1.

Finding defect RAM is a non trivial task, and can be done in hardware (e.g. ECC memory), software (e.g. memtest86+) or a combination of both. Hardware tests do not find all errors, and conventional software tests require many hour long downtimes of the machines. Other software tests can run while the computer is in normal use, e.g. editing spreadsheets or serving webpages. These programs have a different problem: They just randomly poke around the computer memory, and often test only very small parts of the system memory.

Why should I have corrupted data?

How can this be true? How is data corrupted silently?

The laptop I write this post with is equipped with 8 GiB RAM, a staggering 68 billion (68,719,476,736) single bits. A bit is the smallest information unit, it can only have two different values: 0 or 1.

To corrupt a document it is enough that one of these 68 billion bits flips (1 to 0 or vice versa) when it shouldn't. It does not need to be an easily detected flip. Maybe a slightly tainted memory has no problem to store 00000000 or 11111111 but randomly flips 11101111 to 11111111 at more than 40 °C. In another case it is an alpha particle knocking an electron out of a transistor.

Why is this dangerous? Bits have a meaning. May it be data, e.g. the sign bit of a number, or program code. Again: The symptoms of a flipped bit range from instant crashes over seemingly randomly behaving systems to silently corrupted data.

Data, that can be documents, e.g. a spreadsheet, or important system data, like the file system structure. If a corruption goes undetected, it can lead to consequential errors. Maybe the flipped bit gets interpreted by the file system implementation as this file is deleted, when it actually should not be deleted. The flip might have happened in cached data about to get written back to disk, where it renders a spreadsheet unreadable because some internal structure is corrupted. Worse, the file could still perfectly readable, but contains the wrong numbers. The possibilities for disaster are endless.

It gets worse, as corruption often is silent, meaning: corruption is not detected before the symptoms are so bad, that someone investigates e.g. constant server crashes. Later it will be found out, that important financial records have been corrupted for months, and there is no possibility to verify the correctness of the data.

How can RAM break?

When talking about failures in hardware it is common to distinguish between hard and soft errors. For this discussion a hard error is a broken hardware error, e.g. a crossed lane, a broken connection, etc. Hard errors are permanent errors that have to be physically repaired. Soft errors are errors caused by some external factor. For example natural radioactive decay causes alpha particles. If alpha particles hit a transistor, they can change the transistors charge. Despite the definition, the line between hard and soft errors is a fuzzy one. Leaking charges in a processor at normal operations temperature could be classified has hard error. Leaking charges outside the specified temperature could be classified as soft errors because the CPU is used outside the specified parameters. I am no physicist, so please take this explanation with a grain of salt.

Said study from Google (I wrote about it here) shows that memory errors are very likely to be permanent (“hard”) defects in the hardware.

Why are hard errors so bad?

This definition of hard error has the consequence that a broken RAM cell (where one bit is stored) has a high chance of always corrupting the data stored in it. Memory is re-used by the operating system quickly and a (broken) cell can easily be used for dozens of different tasks in a single second, multiplying the damage.

How is RAM used by the kernel?

Most computers manage memory in chunks of 4KiB called page. If you want to know more on pages see below for an explanation.

The following three images, taken approx. 15 seconds apart, show how quickly page usages change by visualising the usage of RAM for user processes, e.g. a Document-Editor, a Browser, etc. One pixel represents 4KiB of RAM (33554432 bits).

In the first image nearly all of the systems memory (1334 of 2048 MiB) is allocated to user processes via mmap or as anonymous memory. In this case the user processes where compiling the Linux kernel, but they also could be spreadsheet programs. 15 seconds later this drops to 814 MiB, then raises to 1352 MiB of RAM used by programs. This shows that each memory cell is used for many, many different programs/things during a day.

  1. Page (memory) allocation to user processes while compiling Linux. The page-flags logically ANDed with page.flags & (ANON | MMAP ) prior visualisation. (no flags set) only refers to ANON/MMAP.
  2. Same system 15 seconds later, only 814 MiB allocated to user space programs.
  3. Another 17 seconds later: back to 1352 MiB.

I made a video ( watch in HD! ) visualising the whole 90 minute kernel build in 75 seconds. The video shows memory directly available to user processes (anonymous memory/mmaped memory), for a more detailed video look here.

What to do to prevent data corruption?

With computers there are two basic strategies: hardware or software. A pure hardware solution sounds good for the programmer because there are no failures for him to handle. In practice this does not work. If the damage is big enough, a program will be affected.

Hardware

The dangers of errors in computer memory and caches lead to the development of countermeasures, with the most prominent and most widely implemented of them being ECC (Error-Correcting Codes). In this context, ECC stands for memory subsystems equipped with in situ detection and correction capabilities of certain error classes. The basic form of ECC is termed SEC-DED (Single Error Correction, Double Error Detection) and allows the detection of two bit errors and the correction of one bit errors.

Various vendors improved ECC for systems that need a greater level of confidence in the correct behaviour of memory. The list of ECC-improvements includes IBM Chipkill, Extended ECC, Intel SDDC or Chipspare from HP. These technologies are mostly used in more expensive server systems. The same is true for Intels MCA Recovery, a mechanism that allows operating systems to detect and potentially repair/circumvent hardware errors.

The biggest advantage of hardware is, that it can correct certain kinds of errors.

Example: How Chipkill works The image shows to memory modules (DDR modules to be precise). DDR modules are build from multiple memory chips. The address width of these chips is 4, 8 or 16 bits. Depending on the type of chips used to build a DDR memory module, the module is called a x4, x8, or x16 module. This figure shows two x4 modules, combined in a way used by Sun and AMD to implement ChipKill. In this example, one of the chips (2a) is defective. In a regular SEC-DED -setup, this would lead to undetectable memory errors, because 4 out of 64 bits were faulty. By calculating the checksum over the combined data path of 128 bits, the missing symbol can be reconstructed (see here, where this image has been adapted from, for more details).

Software

Software-based counter measures against defective memory have a great disadvantage compared to hardware-based counter measures. They cannot detect memory errors the moment a client (any program) uses the memory because the interaction between memory and client is private to the client. Software based tools can only reduce the likelihood of a memory error affecting the system. They do this based on the idea that most errors are hard errors. If the software tests memory often enough, it can find errors very early. Once a defect memory cell is found, counter measurements can be taken.

Imagine you have a flask of milk in your fridge, and you don't want that your kid to put sour milk in his/her cereals. A solution analogous to the described hardware measurements would be a milk flask that beeps loudly, when the milk has gone bad. Kid takes milk, flask beeps – no sour milk in the cereals. Problem solved. Only: the flask is more expensive than regular flasks and does not detect all kinds of bad milk (e.g. flies in the milk – yes, I am stretching it). A software solution would be you sniffing at the milk every now and then. No need for fancy hardware, and you (software) can do a much more thorough test, e.g look at flies or use a microscope. Still there is the chance that the milk goes bad just after you checked.

How do Software Tests Work?

All software based tools work as memory testers that write different patterns to the memory and verify that the memory works correctly by reading the data. These software memory testers run either on the bare metal or in cooperation with the regular operating system that the computer runs. More on that later.

Software memory testers write and later verify patterns in RAM. A very naive implementation would be fill the whole memory with zero. Then check, if reading it still is zero. This implementation has several drawbacks. For example it cannot detect a cell that is always 0, even when a 1is written into it. It also does not take into account that cells can influence each other, e.g. writing a value in one cell changes a different cell as well. At the end of the post I appended a link to my thesis where this is dicussed in much deeper detail. As a takeaway: Good enough algorithms to test memory can be written in O(n). A full test takes O ((3n^2 + 2n) 2^n). Where n is the number of bits. The 68 billion number.

For the sake of definition: a page is something mapped in the user process (virtual address space), it is backed by a page frame. Multiple pages can be backed by the same page frame. A page frame is a region (4 KiB on Intel) of physical memory.

On computer systems with a paged memory layout (e.g. Windows, Linux, Solaris) tests should work on a per-page (or page-frame) basis.

An idealised memory test.
/*
* Simple scanner that tests all frames for errors
* and marks bad frames by calling ‘mark_frame_bad‘. */
void test_all_frames(size_t max_pfn) {
// Frames are numbered [0..max_pfn[
for (size_t page_frame_number = 0;
page_frame_number < max_pfn;
page_frame_number++){
// Test for missing / already marked as bad frames
if (! is_page_frame_marked_bad(page_frame_number))
// This is the difficult part: getting a *specific* page of memory
if (IS_OK( acquire_frame(page_frame_number)))
// ‘test_frame_for_errors ‘ returns true , when the frame showed errors
if (test_frame_for_errors(page_frame_number))
mark_frame_bad(page_frame_number);
else
release_frame(page_frame_number);
}
}

A software based memory test needs to do four different things:

  1. It must decide which page frame should be tested when (and how).
  2. It must be able to get exclusive access to the chosen page frame.
  3. It must be able to decide, whether a page frame is defect, or not.
  4. It must be able to isolate defect page frames, so that no other process can access it any more.

Memory tests can be distinguished into bare metal memory tests that run directly on the system, without an underlying general purpose operating system, and memory testers running on top of an operating system.

Bare Metal Memory Tests

Bare metal memory testers like memtest86+ prevent any other usage of the computer while the tests run. This makes them infeasible for timely detection of defective memory, because a downtime must be scheduled for each run.

Tests in Userspace

If frequent downtimes of servers/systems are not acceptable, then the usage of bare metal memory tests is difficult. Alternatives are memory testers that run while the computer does what it is set up for in the first place, e.g. serve webpages or run a spreadsheet calculator.

Challenges of Userpace Tests

In order to effectively test memory a test program must be able to test a specific hardware address, that means a specific cell of memory. Unfortunately this is not possible under modern operating systems.

Virtual Address Spaces. This image shows, how address spaces are mapped to physical memory (2). Each of the two processes has its own virtual address space (1a, b), that is mapped (3) onto physical memory. A few things are noteworthy: The kernel resides in the first few page frames and each process maps the kernel at the end of its address space. Both processes share one frame, but process #1 has it mapped to a different page than process #2 (4a). A page of process #2 has been swapped out (4b). If process #2 accesses an address in this page, a fault will be generated, and the page would transparently to the process be connected to a frame with the then swapped-in data. 4c) shows a page that has no mapping, accessing this page would cause the kernel to kill the process with a segfault.

Modern systems (e.g. Windows 95 and upwards, so modern might not be the best word) hide physical memory from programs. Instead of working with physical memory addresses, programs use virtual addresses. Wikipedia has more on the issue. The essence is: User space programs cannot even tell which physical address they are currently testing, let alone choose which address they want to test.

Summary: For task 2 & 4 (get access to a page frame & isolate defect page frame) a userspace program needs the help of the kernel.

Implemented Architecture

The architecture of my implementation consists of two basic parts: the memory test program as a user space process (written in Python), and a kernel module (phys_mem).

Coming back to the 4 things a memory tester needs to do:

  1. It must decide which page frame should be tested when (and how).

    –> A userspace program keeps track when a page frame has been tested the last time, and with what result. The program selects the to be tested frames and aquires them from a kernel module (phys_mem).

  2. It must be able to get exclusive access to the chosen page frame.

    –> A kernel module (phys_mem) provides as API that allows a userspace program to mmap page frames. phys_mem implements various strategies to aquire the selected page frame, e.g. by manipulating the kernels buddy allocator.

  3. It must be able to decide, whether a page frame is defect, or not.

    –> The userspace tool implements different test algorithms to test frames.

  4. It must be able to isolate defect page frames, so that no other process can access it any more.

    –> When the mem test found a bad page it tells phys_mem to isolate the page so that it is no longer used.

Package Diagram of the Design The package diagram of the implementation shows seven packages, six of them are a part of the implementation. The TestScheduler on the left side determines which frames are to be tested when, and how. Algorithms for different fault models are implemented according to the strategy pattern. Kernel based services are located in the middle column. At the top is the StructPageMap package that allows user space processes to mmap the page-flags into their address room. Below it are the Linux kernel and the PhysMem module which implements the functionality, giving the user space access to the frames acquired by the PageClaiming Implementations. On the right side, the MemoryVisualization package can collect snapshots of the pageflags and generate videos that visualize the behaviour of the mm.

Further I wrote are some tools to analyse & visualise the memory management.

Reflection on the Architecture

Splitting the whole tool into a kernel, and a userspace component has three major advantages:

  1. Testability: Testing userspace components is very easy, especially with a decoupled design. The clear & simple API of the kernel module also allows it to be tested from userspace (with Python tests!)
  2. Reliability: Crashes in the userland have no impact on system stability. The memory tested by the userspace copmponents is automatically released when the process dies.
  3. Low impact: The userspace program can be scheduled using the task scheduler. A test process with nice 20 has neglible impact on the systems performance.

More

If you want to know more, you can download my diploma thesis. If you have IEEE access, download a paper based on it (RAMpage: Graceful Degradation Management for Memory Errors in Commodity Linux Servers). Download the code.

Contribute

At the moment I have no ambition to continue on this project. Mainly it is due to time constraints (job, family, other hobbies or pet projects). If you want to hack on: fork on github. Maybe I'll jump back in, if I see others continuing the project.

There are still many things to be done:

  • It has been developed for kernel 2.6.34. Things have probably changed since then.
  • The algorithms to get a specific page frame from the kernel are far from complete. Several more strategies are needed, e.g. to acquire memory currently held by programs, and buffers.
  • The test algorithms are proof of concept implementations in Python. C is a much better choice and will yield a performance improvement of several magnitudes (this has already been tested).

Closing Remarks

Thank you for reading this far. If you have any questions or remarks, just drop them here or on github.

Comments