2011-06-29 07:20:37

by Ulrich Windl

[permalink] [raw]
Subject: nested block devices (partitioned RAID with LVM): where Linux sucks ;-)

Hi!

I decided to write this to the general kernel list instead of sending to the more specific lists, as this seems to be a colaboration issue:

For SLES11 SP1 (x86_64) I had configured a MD-RAID1 (0.9 superblock) on multipathed SAN devices (the latter should not be important). Then I partitioned the RAID, and one partition was used as PV for LVM. A VG had been created and LVs in it. Filesystems created, populated, etc.

The RAID device was being used as boot disk for XEN VMs. Everything worked fine until the host machine was rebooted.

(Note: The mdadm command (mdadm - v3.0.3 - 22nd October 2009) has several mis-features regarding proper error reporting standards)

The RAIDs couldn't be assembled with errors like this:
mdadm: /dev/disk/by-id/dm-name-whatever-E1 has wrong uuid.
mdadm: /dev/disk/by-id/dm-name-whatever-E2 has wrong uuid.

However:
# mdadm --examine /dev/disk/by-id/dm-name-whatever-E1 |grep -i uuid
UUID : 2861aad0:228a48bc:f93e96a3:b6fdd813 (local to host host)
# mdadm --examine /dev/disk/by-id/dm-name-whatever-E2 |grep -i uuid
UUID : 2861aad0:228a48bc:f93e96a3:b6fdd813 (local to host host)

Only when calling "mdadm -v -A /dev/md1" there are more reasonable messages like:
mdadm: cannot open device /dev/disk/by-id/dm-name-whatever-E1: Device or resource busy

Now the question is: "Why is the device busy?" and "Who is holding the device busy?"
Unfortunately (and here's a problem), neither "lsof" nor "fuser" could tell. That gave me a big headache.

Further digging in the verbose output of "mdadm" I found lines like this:
mdadm: no recogniseable superblock on /dev/disk/by-id/dm-name-whatever-E2_part5
mdadm: /dev/disk/by-id/dm-name-whatever-E2_part5 has wrong uuid.
mdadm: cannot open device /dev/disk/by-id/dm-name-whatever-E2_part2: Device or resource busy
mdadm: /dev/disk/by-id/dm-name-whatever-E2_part2 has wrong uuid.
mdadm: no recogniseable superblock on /dev/disk/by-id/dm-name-whatever-E2_part1
mdadm: /dev/disk/by-id/dm-name-whatever-E2_part1 has wrong uuid.
mdadm: cannot open device /dev/disk/by-id/dm-name-whatever-E2: Device or resource busy
mdadm: /dev/disk/by-id/dm-name-whatever-E2 has wrong uuid.

So mdadm is considering partitions as well. I guessed that activating the partitions might keept the "parent device" busy, so I tried a "kpart -vd /dev/disk/by-id/dm-name-whatever-E2", but that did do nothing (with no error message).

Then I suspected LVM could activate the PV in partition 5. I tried to deactivate LVM on the device, but that also failed.

At this point I had googled at lot, and the kernel boot parameter "nodmraid" did not help either.

At a state of despair I decided to zap away the partition table temporarily:
# sfdisk -d /dev/disk/by-id/dm-name-whatever-E1 >E1 ## Backup
# sfdisk -d /dev/disk/by-id/dm-name-whatever-E2 >E2 ## Backup
# dd if=/dev/zero bs=512 count=1 of=/dev/disk/by-id/dm-name-whatever-E1
# dd if=/dev/zero bs=512 count=1 of=/dev/disk/by-id/dm-name-whatever-E2

Then I logically disconnected the SAN disks and reconnected them (via some /sys magic).

Then the RAID devices could be assembled again! This demonstrates that:
1) The original error message of mdadm about a wrong UUID is completely wrong ("device busy" would have been correct)
2) partitions on unassembled raid legs are activated before the RAID is assembled, effectively preventing a RAID assembly (I could not find out how to fix/prevent this)

After that I restored the saved partition table to the RAID(!) device (as it had been done originally).

I haven't studied the block data structures, but obviously the RAID metadata is not at the start of the devices. If they were, a partition table would not be found, and the RAID could have been assembled without a problem.

I'm not subscribed to the kernel-list, so please CC. your replies! Thanks!

I'm sending this message to make developers aware of the problem, and possibly help normal users finding this solution via Google.

Regards,
Ulrich Windl
P.S. Novell Support was not able to provide a solution for this problem in time


2011-06-29 09:46:16

by martin f krafft

[permalink] [raw]
Subject: Re: nested block devices (partitioned RAID with LVM): where Linux sucks ;-)

also sprach Ulrich Windl <[email protected]> [2011.06.29.0914 +0200]:
> 1) The original error message of mdadm about a wrong UUID is
> completely wrong ("device busy" would have been correct)

Correct. It would be nice if you could file a bug about this in your
distro's bug tracker (or Debian's).

> 2) partitions on unassembled raid legs are activated before the
> RAID is assembled, effectively preventing a RAID assembly (I could
> not find out how to fix/prevent this)

I think you will find that LVM snatched the PV before mdadm had
a chance, hence it was busy. This is a common problem with LVM and
RAID1, because LVM (also) scans all devices and because RAID1 is
merely a mirroring setup, LVM can use either of the components just
as well.

The solution? I thought that there was a patch to LVM that prevented
it from using RAID members. But if there isn't, then exclude the
devices from its scan.

… and before you claim that Linux sucks, consider how a computer
should do it differently. I think you will find that every
implementation has to make certain hard assumptions (for it lacks
human abstraction and combination abilities), and your current
griefs make you feel like Linux does it worst of all. Rest assured:
it doesn't. It makes assumptions, but you will find that they are
quite sensible.

> I haven't studied the block data structures, but obviously the
> RAID metadata is not at the start of the devices. If they were,
> a partition table would not be found, and the RAID could have been
> assembled without a problem.

You are right, the metadata are not at the start. This is by design.

> I'm sending this message to make developers aware of the problem,

(in which case it might be wise to avoid exclaiming things like
"linux sucks"…)

> P.S. Novell Support was not able to provide a solution for this problem in time

News at 11…

--
martin | http://madduck.net/ | http://two.sentenc.es/

"no, 'eureka' is greek for 'this bath is too hot.'"
-- dr. who

spamtraps: [email protected]


Attachments:
(No filename) (2.05 kB)
digital_signature_gpg.asc (1.10 kB)
Digital signature (see http://martin-krafft.net/gpg/sig-policy/999bbcc4/current)
Download all attachments

2011-06-29 15:00:00

by Phil Turmel

[permalink] [raw]
Subject: Re: nested block devices (partitioned RAID with LVM): where Linux sucks ;-)

[Added linux-raid. Where this should have gone in the first place.]

Note: I'm somewhat less polite than usual, considering the subject line and tone.

On 06/29/2011 03:14 AM, Ulrich Windl wrote:
> Hi!
>
> I decided to write this to the general kernel list instead of sending to the more specific lists, as this seems to be a colaboration issue:

There's nothing in your report about general kernel development, sorry. Doesn't seem to be a collaboration issue, either. A distribution issue, perhaps.

> For SLES11 SP1 (x86_64) I had configured a MD-RAID1 (0.9 superblock) on multipathed SAN devices (the latter should not be important). Then I partitioned the RAID, and one partition was used as PV for LVM. A VG had been created and LVs in it. Filesystems created, populated, etc.

> The RAID device was being used as boot disk for XEN VMs. Everything worked fine until the host machine was rebooted.

I hope you didn't put it in production without testing.

> (Note: The mdadm command (mdadm - v3.0.3 - 22nd October 2009) has several mis-features regarding proper error reporting standards)

Indeed, 2-1/2 years in the open source world closes many bugs. Please retest with current kernel, udev, mdadm, and LVM.

FWIW, the default metadata was changed to v1.1 in November of 2009, and later to v1.2. Either would have avoided your problems.

> The RAIDs couldn't be assembled with errors like this:
> mdadm: /dev/disk/by-id/dm-name-whatever-E1 has wrong uuid.
> mdadm: /dev/disk/by-id/dm-name-whatever-E2 has wrong uuid.
>
> However:
> # mdadm --examine /dev/disk/by-id/dm-name-whatever-E1 |grep -i uuid
> UUID : 2861aad0:228a48bc:f93e96a3:b6fdd813 (local to host host)
> # mdadm --examine /dev/disk/by-id/dm-name-whatever-E2 |grep -i uuid
> UUID : 2861aad0:228a48bc:f93e96a3:b6fdd813 (local to host host)
>
> Only when calling "mdadm -v -A /dev/md1" there are more reasonable messages like:
> mdadm: cannot open device /dev/disk/by-id/dm-name-whatever-E1: Device or resource busy
>
> Now the question is: "Why is the device busy?" and "Who is holding the device busy?"
> Unfortunately (and here's a problem), neither "lsof" nor "fuser" could tell. That gave me a big headache.

Stacked devices have been this way forever. There's no process holding the device. The kernel's got it internally.

> Further digging in the verbose output of "mdadm" I found lines like this:
> mdadm: no recogniseable superblock on /dev/disk/by-id/dm-name-whatever-E2_part5
> mdadm: /dev/disk/by-id/dm-name-whatever-E2_part5 has wrong uuid.
> mdadm: cannot open device /dev/disk/by-id/dm-name-whatever-E2_part2: Device or resource busy
> mdadm: /dev/disk/by-id/dm-name-whatever-E2_part2 has wrong uuid.
> mdadm: no recogniseable superblock on /dev/disk/by-id/dm-name-whatever-E2_part1
> mdadm: /dev/disk/by-id/dm-name-whatever-E2_part1 has wrong uuid.
> mdadm: cannot open device /dev/disk/by-id/dm-name-whatever-E2: Device or resource busy
> mdadm: /dev/disk/by-id/dm-name-whatever-E2 has wrong uuid.
>
> So mdadm is considering partitions as well. I guessed that activating the partitions might keept the "parent device" busy, so I tried a "kpart -vd /dev/disk/by-id/dm-name-whatever-E2", but that did do nothing (with no error message).

Without instructions otherwise, both mdadm and LVM consider every block device.

The man-page for mdadm.conf describes how to filter the devices to consider. Did you read about this?

The man-page for lvm.conf describes how to filter devices to consider, including a setting called "md_component_detection". Did you read about this?

> Then I suspected LVM could activate the PV in partition 5. I tried to deactivate LVM on the device, but that also failed.
>
> At this point I had googled at lot, and the kernel boot parameter "nodmraid" did not help either.
>
> At a state of despair I decided to zap away the partition table temporarily:
> # sfdisk -d /dev/disk/by-id/dm-name-whatever-E1 >E1 ## Backup
> # sfdisk -d /dev/disk/by-id/dm-name-whatever-E2 >E2 ## Backup
> # dd if=/dev/zero bs=512 count=1 of=/dev/disk/by-id/dm-name-whatever-E1
> # dd if=/dev/zero bs=512 count=1 of=/dev/disk/by-id/dm-name-whatever-E2
>
> Then I logically disconnected the SAN disks and reconnected them (via some /sys magic).
>
> Then the RAID devices could be assembled again! This demonstrates that:
> 1) The original error message of mdadm about a wrong UUID is completely wrong ("device busy" would have been correct)
> 2) partitions on unassembled raid legs are activated before the RAID is assembled, effectively preventing a RAID assembly (I could not find out how to fix/prevent this)
>
> After that I restored the saved partition table to the RAID(!) device (as it had been done originally).
>
> I haven't studied the block data structures, but obviously the RAID metadata is not at the start of the devices. If they were, a partition table would not be found, and the RAID could have been assembled without a problem.

The metadata placement for the various versions is well documented in the man pages. Metadata versions 1.1 and 1.2 are at the beginning of the device, for this very reason, among others.

> I'm not subscribed to the kernel-list, so please CC. your replies! Thanks!
>
> I'm sending this message to make developers aware of the problem, and possibly help normal users finding this solution via Google.

Developers dealt with these use-cases more than a year ago, and pushed the fixes out in the normal way. You used old tools to set up a system. They could have been configured to deal with your use-case. This is your problem, or your distribution's problem.

> Regards,
> Ulrich Windl
> P.S. Novell Support was not able to provide a solution for this problem in time

"in time" ? So, you *did* put an untested system into production. And you are rude to the volunteers who might help? Not a good start.

Phil