2003-08-22 05:25:14

by NeilBrown

[permalink] [raw]
Subject: Re: md: bug in file raid5.c, line 540 was: Re: Linux 2.4.22-rc1

On Tuesday August 19, [email protected] wrote:
> On Tue, Aug 19, 2003 at 01:26:29PM -0700, Mike Fedyk wrote:
> > Details in dmesg output...
> >
>
> This didn't make it to the list because it was too big.
>
> compressing dmesg output, and here's an excerpt:
>
> At this point md0 had:
>
> md0 : active raid5 hda3[3] hdg3[1] hde3[0]
> 319388032 blocks level 5, 64k chunk, algorithm 0 [3/3] [UU_]
>
> Aug 18 18:29:29 srv-lr2600 kernel: md: trying to hot-add hda3 to md0 ...
> Aug 18 18:29:42 srv-lr2600 kernel: md: trying to hot-add hde3 to md0 ...
> Aug 18 18:29:44 srv-lr2600 kernel: md: trying to hot-add hdg3 to md0 ...
> Aug 18 18:36:25 srv-lr2600 kernel: md: trying to remove hda3 from md0 ...
>
> I thought I did a fail before the remove...
>
> Aug 18 18:36:25 srv-lr2600 kernel: md: cannot remove active disk hda3 from md0 ...
> Aug 18 18:36:34 srv-lr2600 kernel: md: bug in file raid5.c, line 540
> Aug 18 18:36:34 srv-lr2600 kernel:
>
> But why am I getting a bug message?

This bug could happen if you try to fail a device that is not active.
i.e. you do
"mdadm -f /dev/md0 /dev/hda3"
or "raidsetfaulty /dev/md0 /dev/hda3"
when /dev/hda3 is an idle spare or a failed drive that has been
replaced by a spare.
The BUG call can just be removed from that line.

NeilBRown


2003-08-22 18:31:03

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: md: bug in file raid5.c, line 540 was: Re: Linux 2.4.22-rc1

On 2003-08-22T15:24:46,
Neil Brown <[email protected]> said:

> This bug could happen if you try to fail a device that is not active.

Yes, thats not generally a tested code path in 2.4. On removing the
BUG() statement, also check that all counters get in/decremented
correctly, or the next lurking bug will hit you.

I fixed that for multipath in 2.4 too, but I can't get around to clean
up the patchset *sigh*


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
High Availability & Clustering ever tried. ever failed. no matter.
SuSE Labs try again. fail again. fail better.
Research & Development, SuSE Linux AG -- Samuel Beckett

2003-08-22 21:27:06

by Mike Fedyk

[permalink] [raw]
Subject: Re: md: bug in file raid5.c, line 540 was: Re: Linux 2.4.22-rc1

On Fri, Aug 22, 2003 at 05:50:39PM +0200, Lars Marowsky-Bree wrote:
> On 2003-08-22T15:24:46,
> Neil Brown <[email protected]> said:
>
> > This bug could happen if you try to fail a device that is not active.
>
> Yes, thats not generally a tested code path in 2.4. On removing the
> BUG() statement, also check that all counters get in/decremented
> correctly, or the next lurking bug will hit you.
>
> I fixed that for multipath in 2.4 too, but I can't get around to clean
> up the patchset *sigh*

Then send the patch to someone who can...

Has anyone attempted to create a testbed for md? If not, can you list all
of the states, and state transitions that are legal. I might be able to
cook something up when I get some time.

2003-08-23 17:00:12

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: md: bug in file raid5.c, line 540 was: Re: Linux 2.4.22-rc1

On 2003-08-22T14:26:59,
Mike Fedyk <[email protected]> said:

> > I fixed that for multipath in 2.4 too, but I can't get around to clean
> > up the patchset *sigh*
> Then send the patch to someone who can...

I've repeatedly announced the updated patches at
ftp://ftp.suse.com/pub/people/lmb/md-mp/kernel/ in the past (though the
URL has varied slightly), so feel free to pick them up.

> Has anyone attempted to create a testbed for md?

I've done that for the multipath module. I've put my mp-test.sh at
ftp://ftp.suse.com/pub/people/lmb/md-mp/mp-test.sh, too. It comes
without any warranty or documentation besides the comments in the
script though ;-)

It has helped me tremenduously creating weird situations and making sure
the md module keeps track of all its associated gazillions of counters
'correctly'.


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
High Availability & Clustering ever tried. ever failed. no matter.
SuSE Labs try again. fail again. fail better.
Research & Development, SuSE Linux AG -- Samuel Beckett


Attachments:
(No filename) (1.03 kB)
(No filename) (189.00 B)
Download all attachments

2003-08-26 17:23:03

by Mike Fedyk

[permalink] [raw]
Subject: Re: md: bug in file raid5.c, line 540 was: Re: Linux 2.4.22-rc1

On Sat, Aug 23, 2003 at 05:28:26PM +0200, Lars Marowsky-Bree wrote:
> On 2003-08-22T14:26:59,
> Mike Fedyk <[email protected]> said:
>
> > > I fixed that for multipath in 2.4 too, but I can't get around to clean
> > > up the patchset *sigh*
> > Then send the patch to someone who can...
>
> I've repeatedly announced the updated patches at
> ftp://ftp.suse.com/pub/people/lmb/md-mp/kernel/ in the past (though the
> URL has varied slightly), so feel free to pick them up.
>
> > Has anyone attempted to create a testbed for md?
>
> I've done that for the multipath module. I've put my mp-test.sh at
> ftp://ftp.suse.com/pub/people/lmb/md-mp/mp-test.sh, too. It comes
> without any warranty or documentation besides the comments in the
> script though ;-)
>

Is there any way to get it working on one partition, or does it require at
least two backing store block (an actual physical disk) devices that a bunch
of loop devices point to? (I'm thinking of the raid[15] case).

2003-08-26 17:33:46

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: md: bug in file raid5.c, line 540 was: Re: Linux 2.4.22-rc1

On 2003-08-26T10:22:57,
Mike Fedyk <[email protected]> said:

> Is there any way to get it working on one partition, or does it require at
> least two backing store block (an actual physical disk) devices that a bunch
> of loop devices point to? (I'm thinking of the raid[15] case).

md will work just fine - although with much reduced performance - if
setup on top of partitions on the same disk. If all you have is a single
physical disk, you can create the loop devices accordingly. For
multipath testing, I have used LVM logical volumes + loop devices to
simulate such, or used UML and fed it with a bunch of block devices (LVs
or loop devices) from the host.

(The mp-test.sh script actually knows how to create arbitary numbers of
loop devices for multipath testing, which in turn uncovered a bug in our
loop handling, which axboe took care of...)


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
High Availability & Clustering ever tried. ever failed. no matter.
SuSE Labs try again. fail again. fail better.
Research & Development, SuSE Linux AG -- Samuel Beckett


Attachments:
(No filename) (1.07 kB)
(No filename) (189.00 B)
Download all attachments