neuhalfen.name

A random collection of posts

KVM, OpenSolaris and Large IDE Drives That Corrupt Data

Permalink

Currently I am evaluting KVM as new virtualisation solution for my home-server. For me it is very important that I can run OpenSolaris as zfs based storage server (NFS, SMB, iSCSI).

The base for my evaluation is a dual core system with two SATA drives. The first drive (sda, 250 GB) is used as system drive for my host, the second drive (sdb, 1 TB) will act as one half of my yet-to-be zfs-mirror.

For a maximum performance and a minimum of filesystem stacking (just consider the iSCSI case of ext3 on iSCSI on zfs on virtualized ide on ext3 ..) I will pass sdb directly in the Solaris domain. The OpenSolaris domain itself lives on a qcow2 image which itself is hosted on the (Linux)host drive.

Creating the OpenSolaris domain

I created the test domain sol1 via virsh. The following snipet is the kvm create command that libvirt generates:

libvirt/qemu/sol1.log.1:LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/X11R6/bin HOME=/home/jens USER=root LOGNAME=root \
/usr/bin/kvm \
-S -M pc -m 1024 -smp 1 -name sol1 \
-uuid 133990ab-91bd-ad36-162f-032c548176dc \
-monitor pty \
-boot c \
-drive file=/srv/machines/sol1/rpool.qcow2,if=ide,index=0,boot=on \
-drive file=/dev/sdb,if=ide,index=2 \
-net nic,macaddr=54:52:00:26:5f:5a,vlan=0,model=e1000 \
-net tap,fd=22,vlan=0 \
-serial pty \
-parallel none \
-vnc 127.0.0.1:0 -k de

Note that I use ide and not scsi as bus system. The SCSI-adapter emulated by kvm is not supported by OpenSolaris and I could not find a virtio interface.

After I installed and updated the domain I created a zpool on sdb (c7d1 in OpenSolaris). Everything went well until I “stress tested” the drive by creating a large file of zeroes on it.

$ dd if=/dev/zero of=/storage/ZEROES bs=8M

After a few gigabytes were written, zfs freaked out:

$ pfexec zpool status -v storage
pool: storage
state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
see: http://www.sun.com/msg/ZFS-8000-HC
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
storage UNAVAIL 0 0 1 insufficient replicas
c7d1 UNAVAIL 0 0 6 cannot open
errors: Permanent errors have been detected in the following files:
<metadata>:<0x17>

As I found out later, the IO failures were corrupted datablocks — hooray for zfs checksumming! Without it I would suerly have lost data.

After an unsuccsessfull (because it never really finished) attempt of

zpool clear storage

a new SSH connection allowed me another status-query:

$ pfexec zpool status -v storage
pool: storage
state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
see: http://www.sun.com/msg/ZFS-8000-HC
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
storage UNAVAIL 0 0 0 insufficient replicas
c7d1 UNAVAIL 0 0 0 cannot open
errors: Permanent errors have been detected in the following files:
<metadata>:<0x17>
/storage/

That was bad news indeed. Further tests showed, that I was able to reproduce the error every time.

Updating KVM

Updating KVM to the latest version (88) did not help. With KVM beeing supported by RedHat I thought, that a found bug in KVM has a chance to get fixed.

So I decided to dig into the issue.

Testing the drive

I destroyed the domain and tested the disk from the host. Fortunally (or unfortunally) the test went without issues. That ruled out hardware and host-driver issues.

Solaris based tests

The first real test was to see whether the problem lied within zfs or below that.

pfexec format
...
analyze> verify
Ready to verify (will corrupt data). This takes a long time,
but is interruptible with CTRL-C. Continue? y
pass 0
pass 1
Data miscompare error (expecting 0xe, got 0x7000000e) at 0/0/0, offset = 0x1c00.
Data miscompare error (expecting 0x7e, got 0x7000007e) at 0/1/0, offset = 0x0.
Data miscompare error (expecting 0xfc, got 0x700000fc) at 0/2/0, offset = 0x0.
Data miscompare error (expecting 0x17a, got 0x7000017a) at 0/3/0, offset = 0x0.
Data miscompare error (expecting 0x1f8, got 0x700001f8) at 0/4/0, offset = 0x0.
Data miscompare error (expecting 0x276, got 0x70000276) at 0/5/0, offset = 0x0.
Data miscompare error (expecting 0x2f4, got 0x700002f4) at 0/6/0, offset = 0x0.
Data miscompare error (expecting 0x372, got 0x70000372) at 0/7/0, offset = 0x0.
Data miscompare error (expecting 0x3f0, got 0x700003f0) at 0/8/0, offset = 0x0.
Data miscompare error (expecting 0x46e, got 0x7000046e) at 0/9/0, offset = 0x0.
Data miscompare error (expecting 0x4ec, got 0x700004ec) at 0/10/0, offset = 0x0.
Data miscompare error (expecting 0x56a, got 0x7000056a) at 0/11/0, offset = 0x0.
Data miscompare error (expecting 0x5e8, got 0x700005e8) at 0/12/0, offset = 0x0.
Data miscompare error (expecting 0x666, got 0x70000666) at 0/13/0, offset = 0x0.
Data miscompare error (expecting 0x6e4, got 0x700006e4) at 0/14/0, offset = 0x0.
Data miscompare error (expecting 0x762, got 0x70000762) at 0/15/0, offset = 0x0.
....
Data miscompare error (expecting 0x624432, got 0x70624432) at 200/111/0, offset = 0x0.
Data miscompare error (expecting 0x^C

Uh. Bad. The hardware is good (see above), so this is either a problem in

  • The Solaris IDE-drivers
  • KVM / QEMU
  • Something different altogether

Solaris, second try

I wanted to find out what got written down to the disk. For that I dded the drive full of zeroes and checked the disk from my host.

On Solaris I destroyed the pool and filled the disk with zeroes:

dd if=/dev/zero of=/dev/dsk/c7d1p0 bs=8M

After that finnished (with an average throughput of 100 MB/s — not bad for a virtualized drive) I shut down the domain and used the host to test the drive for “all zeroes”.

As doing that by hand seemed like a bad idea™ I wrote up a small C programm to do my chores.

The errors startet at an offset of 0x2000000000 (128 Gib):

./scandisk /dev/sdb 0000000000
sizeof(long) 8
sizeof(int) 4
sizeof(size_t) 8
sizeof(off_t) 8
Opening /dev/sdb
Initial seek to 0000000000000000
Validation error @ 0000002000000000 : Expected 00000000 got 70000000
Validation error @ 0000002000000004 : Expected 00000000 got 70000000
Validation error @ 0000002000000008 : Expected 00000000 got 70000000
...

Later the error pattern changed and I adapted my test to accumulate the errors. Note that the wrong values repeat 128 times, that is 128 * sizeof(int) = 1 KiB.

After 16 blocks with unchanged MSB, the MSB decremented from 0×70 to 0×60.

Validation error @ 0000002000000000 : Expected 00000000 got 70000000
The value 70000000 appeared 128 times in succession.
Validation error @ 0000002000000200 : Expected 00000000 got 70000001
The value 70000001 appeared 128 times in succession.
... [the 128 element blocks occur for for 70000002, 70000003, ...]
Validation error @ 0000002000001800 : Expected 00000000 got 7000000c
The value 7000000c appeared 128 times in succession.
Validation error @ 0000002000001a00 : Expected 00000000 got 7000000d
The value 7000000d appeared 128 times in succession.
Validation error @ 0000002000001c00 : Expected 00000000 got 6000000e
The value 6000000e appeared 128 times in succession.
Validation error @ 0000002000001e00 : Expected 00000000 got 6000000f
The value 6000000f appeared 128 times in succession.
Validation error @ 0000002000002000 : Expected 00000000 got 60000010
The value 60000010 appeared 128 times in succession.
.... [ there are 128 6... blocks ]
Validation error @ 0000002000003a00 : Expected 00000000 got 6000001d
The value 6000001d appeared 128 times in succession.
Validation error @ 0000002000003c00 : Expected 00000000 got 5000001e
The value 5000001e appeared 128 times in succession.
Validation error @ 0000002000003e00 : Expected 00000000 got 5000001f
The value 5000001f appeared 128 times in succession.
.... [ there are 128 5... blocks ]
Validation error @ 0000002000005a00 : Expected 00000000 got 5000002d
The value 5000002d appeared 128 times in succession.
Validation error @ 0000002000005c00 : Expected 00000000 got 4000002e
The value 4000002e appeared 128 times in succession.
.... [ there are 128 4... blocks ]
Validation error @ 0000002000007a00 : Expected 00000000 got 4000003d
The value 4000003d appeared 128 times in succession.
Validation error @ 0000002000007c00 : Expected 00000000 got 3000003e
The value 3000003e appeared 128 times in succession.
... 128 blocks of 3...
Validation error @ 0000002000009a00 : Expected 00000000 got 3000004d
The value 3000004d appeared 128 times in succession.
Validation error @ 0000002000009c00 : Expected 00000000 got 2000004e
The value 2000004e appeared 128 times in succession.
... 128 blocks of 2...
Validation error @ 000000200000ba00 : Expected 00000000 got 2000005d
The value 2000005d appeared 128 times in succession.
Validation error @ 000000200000bc00 : Expected 00000000 got 1000005e
The value 1000005e appeared 128 times in succession.
... 128 blocks of 1...
Validation error @ 000000200000da00 : Expected 00000000 got 1000006d
The value 1000006d appeared 128 times in succession.
Validation error @ 000000200000dc00 : Expected 00000000 got 55555555
.....

and from there it continues more or less to the end of the disk (the 555.. pattern changes too).

Conclusion

I am sure that there is an issue with KVM, OpenSolaris and 128 GiB IDE drives. The next question is, if this is something that can be fixed in KVM or OpenSolaris or if this is some IDE-induced problem. I still remember all of this IDE-out-of-addressing-bits issues from the past.

There is a posting on the Fedora maliinglist that hints, that the BIOS used by QEMU might be out of date. The posting is from 2007 and talks about xen, but I will have a look at the BIOS.

My plan is to install a linux distro in kvm and try the same. That would show if its a KVM or an OpenSolaris error.

Versions

These are the software versions I used in this test:

$ modinfo kvm
filename: /lib/modules/2.6.28-14-server/extra/kvm.ko
license: GPL
author: Qumranet
version: kvm-kmod-devel-88
srcversion: 55127C6ABBF102EEB757625
depends:
vermagic: 2.6.28-14-server SMP mod_unload modversions
parm: oos_shadow:bool
parm: ignore_msrs:bool
$ modinfo kvm_intel
filename: /lib/modules/2.6.28-14-server/extra/kvm-intel.ko
license: GPL
author: Qumranet
version: kvm-kmod-devel-88
srcversion: 31D33AA0A551656CCEC55E8
depends: kvm
vermagic: 2.6.28-14-server SMP mod_unload modversions
parm: bypass_guest_pf:bool
parm: vpid:bool
parm: flexpriority:bool
parm: ept:bool
parm: unrestricted_guest:bool
parm: emulate_invalid_guest_state:bool
$ ls -la $(which kvm)
lrwxrwxrwx 1 root root 27 2009-08-04 16:16 /usr/bin/kvm -> /usr/bin/qemu-system-x86_64
$ kvm --version
QEMU PC emulator version 0.10.50 (qemu-kvm-devel-88), Copyright (c) 2003-2008 Fabrice Bellard

Comments