2007-01-12 13:33:45

by Michal Piotrowski

[permalink] [raw]
Subject: 2.6.20-rc4-mm1 md problem

My system hangs on this
http://www.stardust.webpages.pl/files/tbf/euridica/2.6.20-rc4-mm1/bug2.jpg
http://www.stardust.webpages.pl/files/tbf/euridica/2.6.20-rc4-mm1/mm-config

Debug plan:
- revert md-* patches
- binary search

Does someone have a better idea?

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group
(http://www.stardust.webpages.pl/ltg/)


2007-01-12 15:51:23

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: 2.6.20-rc4-mm1 md problem

On Friday, 12 January 2007 14:33, Michal Piotrowski wrote:
> My system hangs on this
> http://www.stardust.webpages.pl/files/tbf/euridica/2.6.20-rc4-mm1/bug2.jpg
> http://www.stardust.webpages.pl/files/tbf/euridica/2.6.20-rc4-mm1/mm-config
>
> Debug plan:
> - revert md-* patches
> - binary search
>
> Does someone have a better idea?

Revert git-block.patch and related stuff?

Greetings,
Rafael


--
If you don't have the time to read,
you don't have the time or the tools to write.
- Stephen King

2007-01-12 17:40:36

by Michal Piotrowski

[permalink] [raw]
Subject: Re: 2.6.20-rc4-mm1 md problem

On 12/01/07, Rafael J. Wysocki <[email protected]> wrote:
> On Friday, 12 January 2007 14:33, Michal Piotrowski wrote:
> > My system hangs on this
> > http://www.stardust.webpages.pl/files/tbf/euridica/2.6.20-rc4-mm1/bug2.jpg
> > http://www.stardust.webpages.pl/files/tbf/euridica/2.6.20-rc4-mm1/mm-config
> >
> > Debug plan:
> > - revert md-* patches
> > - binary search
> >
> > Does someone have a better idea?
>
> Revert git-block.patch and related stuff?

Indeed.

GOOD
#
##git-sym2.patch
#git-scsi-target.patch
#git-scsi-target-fixup.patch
#
git-block.patch
git-block-fixup.patch
BAD

git-block.patch it's huge patch.

diffstat git-block.patch
[..]
drivers/md/bitmap.c | 1
drivers/md/dm-emc.c | 2
drivers/md/dm-table.c | 14 -
drivers/md/dm.c | 18 -
drivers/md/dm.h | 1
drivers/md/linear.c | 14 -
drivers/md/md.c | 3
drivers/md/multipath.c | 32 --
drivers/md/raid0.c | 17 -
drivers/md/raid1.c | 70 -----
drivers/md/raid10.c | 73 ------
drivers/md/raid5.c | 60 ----
[..]

I'll do a binary search through those files.

>
> Greetings,
> Rafael

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group
(http://www.stardust.webpages.pl/ltg/)

2007-01-12 17:58:26

by Michal Piotrowski

[permalink] [raw]
Subject: Re: 2.6.20-rc4-mm1 md problem

On 12/01/07, Michal Piotrowski <[email protected]> wrote:
> On 12/01/07, Rafael J. Wysocki <[email protected]> wrote:
> > On Friday, 12 January 2007 14:33, Michal Piotrowski wrote:
> > > My system hangs on this
> > > http://www.stardust.webpages.pl/files/tbf/euridica/2.6.20-rc4-mm1/bug2.jpg
> > > http://www.stardust.webpages.pl/files/tbf/euridica/2.6.20-rc4-mm1/mm-config
> > >
> > > Debug plan:
> > > - revert md-* patches
> > > - binary search
> > >
> > > Does someone have a better idea?
> >
> > Revert git-block.patch and related stuff?
>
> Indeed.
>
> GOOD
> #
> ##git-sym2.patch
> #git-scsi-target.patch
> #git-scsi-target-fixup.patch
> #
> git-block.patch
> git-block-fixup.patch
> BAD
>
> git-block.patch it's huge patch.
>
> diffstat git-block.patch
> [..]
> drivers/md/bitmap.c | 1
> drivers/md/dm-emc.c | 2
> drivers/md/dm-table.c | 14 -
> drivers/md/dm.c | 18 -
> drivers/md/dm.h | 1
> drivers/md/linear.c | 14 -
> drivers/md/md.c | 3
> drivers/md/multipath.c | 32 --
> drivers/md/raid0.c | 17 -
> drivers/md/raid1.c | 70 -----
> drivers/md/raid10.c | 73 ------
> drivers/md/raid5.c | 60 ----
> [..]
>
> I'll do a binary search through those files.
>

That wasn't a good idea.

/mnt/md0/devel/linux-mm/drivers/md/raid1.c: In function 'unplug_slaves':
/mnt/md0/devel/linux-mm/drivers/md/raid1.c:556: error:
'request_queue_t' has no member named 'unplug_fn'

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group
(http://www.stardust.webpages.pl/ltg/)

2007-01-14 22:05:47

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.6.20-rc4-mm1 md problem

On Fri, Jan 12 2007, Michal Piotrowski wrote:
> On 12/01/07, Michal Piotrowski <[email protected]> wrote:
> >On 12/01/07, Rafael J. Wysocki <[email protected]> wrote:
> >> On Friday, 12 January 2007 14:33, Michal Piotrowski wrote:
> >> > My system hangs on this
> >> >
> >http://www.stardust.webpages.pl/files/tbf/euridica/2.6.20-rc4-mm1/bug2.jpg
> >> >
> >http://www.stardust.webpages.pl/files/tbf/euridica/2.6.20-rc4-mm1/mm-config
> >> >
> >> > Debug plan:
> >> > - revert md-* patches
> >> > - binary search
> >> >
> >> > Does someone have a better idea?
> >>
> >> Revert git-block.patch and related stuff?
> >
> >Indeed.
> >
> >GOOD
> >#
> >##git-sym2.patch
> >#git-scsi-target.patch
> >#git-scsi-target-fixup.patch
> >#
> >git-block.patch
> >git-block-fixup.patch
> >BAD
> >
> >git-block.patch it's huge patch.
> >
> >diffstat git-block.patch
> >[..]
> >drivers/md/bitmap.c | 1
> > drivers/md/dm-emc.c | 2
> > drivers/md/dm-table.c | 14 -
> > drivers/md/dm.c | 18 -
> > drivers/md/dm.h | 1
> > drivers/md/linear.c | 14 -
> > drivers/md/md.c | 3
> > drivers/md/multipath.c | 32 --
> > drivers/md/raid0.c | 17 -
> > drivers/md/raid1.c | 70 -----
> > drivers/md/raid10.c | 73 ------
> > drivers/md/raid5.c | 60 ----
> >[..]
> >
> >I'll do a binary search through those files.
> >
>
> That wasn't a good idea.
>
> /mnt/md0/devel/linux-mm/drivers/md/raid1.c: In function 'unplug_slaves':
> /mnt/md0/devel/linux-mm/drivers/md/raid1.c:556: error:
> 'request_queue_t' has no member named 'unplug_fn'

You can't go throught he individual files and expect that to compile,
yet alone work. Fortunately I think I already fixed this last week,
unfortunately I think I forgot to push the #plug branch for #for-akpm so
that Andrew automatically picks it up...

--
Jens Axboe

2007-01-15 07:22:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.6.20-rc4-mm1 md problem


* Michal Piotrowski <[email protected]> wrote:

> My system hangs on this
> http://www.stardust.webpages.pl/files/tbf/euridica/2.6.20-rc4-mm1/bug2.jpg
> http://www.stardust.webpages.pl/files/tbf/euridica/2.6.20-rc4-mm1/mm-config
>
> Debug plan:
> - revert md-* patches
> - binary search
>
> Does someone have a better idea?

Thomas saw something similar yesterday and he the partial results that
git.block (between rc2-mm1 and rc4-mm1) breaks certain disk drivers or
filesystems drivers. For me it worked fine, so it must be only on some
combinations. The changes to ll_rw_block.c look quite extensive.

Ingo

2007-01-15 08:02:24

by Thomas Gleixner

[permalink] [raw]
Subject: Re: 2.6.20-rc4-mm1 md problem

On Mon, 2007-01-15 at 08:17 +0100, Ingo Molnar wrote:
> > Debug plan:
> > - revert md-* patches
> > - binary search
> >
> > Does someone have a better idea?
>
> Thomas saw something similar yesterday and he the partial results that
> git.block (between rc2-mm1 and rc4-mm1) breaks certain disk drivers or
> filesystems drivers. For me it worked fine, so it must be only on some
> combinations. The changes to ll_rw_block.c look quite extensive.

Yes. Jens Axboe confirmed yesterday that the plug changes broke RAID.

tglx


2007-01-15 17:57:28

by Thomas Gleixner

[permalink] [raw]
Subject: [patch-mm] Workaround for RAID breakage

On Mon, 2007-01-15 at 09:08 +0100, Thomas Gleixner wrote:
> > Thomas saw something similar yesterday and he the partial results that
> > git.block (between rc2-mm1 and rc4-mm1) breaks certain disk drivers or
> > filesystems drivers. For me it worked fine, so it must be only on some
> > combinations. The changes to ll_rw_block.c look quite extensive.
>
> Yes. Jens Axboe confirmed yesterday that the plug changes broke RAID.

I tracked this down and found two problems:

- The new plug/unplug code does not check for underruns. That allows the
plug count (ioc->plugged) to become negative. This gets triggered from
various places.

AFAICS this is intentional to avoid checks all over the place, but the
underflow check is missing. All we need to do is make sure, that in case
of ioc->plugged == 0 we return early and bug, if there is either a queue
plugged in or the plugged_list is not empty.

Jens ?

- The raid1 code has no bitmap set in remount r/w. So the
pending_bio_list gets not processed for quite a time. The workaround is
to kick mddev->thread, so the list is processed. Not sure about that.

Neil ?

At least it boots and behaves normal.

tglx


Index: linux-2.6.20-rc4-mm1/block/ll_rw_blk.c
===================================================================
--- linux-2.6.20-rc4-mm1.orig/block/ll_rw_blk.c
+++ linux-2.6.20-rc4-mm1/block/ll_rw_blk.c
@@ -3757,6 +3757,12 @@ void blk_unplug_current(void)
if (!ioc)
return;

+ if (!ioc->plugged) {
+ BUG_ON(!list_empty(&ioc->plugged_list));
+ BUG_ON(ioc->plugged_queue);
+ return;
+ }
+
ioc->plugged--;
if (ioc->plugged)
return;
Index: linux-2.6.20-rc4-mm1/drivers/md/raid1.c
===================================================================
--- linux-2.6.20-rc4-mm1.orig/drivers/md/raid1.c
+++ linux-2.6.20-rc4-mm1/drivers/md/raid1.c
@@ -897,7 +897,7 @@ static int make_request(request_queue_t

spin_unlock_irqrestore(&conf->device_lock, flags);

- if (do_sync)
+ if (do_sync || !bitmap)
md_wakeup_thread(mddev->thread);
#if 0
while ((bio = bio_list_pop(&bl)) != NULL)



2007-01-15 20:17:44

by Michal Piotrowski

[permalink] [raw]
Subject: Re: [patch-mm] Workaround for RAID breakage

On 15/01/07, Thomas Gleixner <[email protected]> wrote:
> On Mon, 2007-01-15 at 09:08 +0100, Thomas Gleixner wrote:
> > > Thomas saw something similar yesterday and he the partial results that
> > > git.block (between rc2-mm1 and rc4-mm1) breaks certain disk drivers or
> > > filesystems drivers. For me it worked fine, so it must be only on some
> > > combinations. The changes to ll_rw_block.c look quite extensive.
> >
> > Yes. Jens Axboe confirmed yesterday that the plug changes broke RAID.
>
> I tracked this down and found two problems:
>
> - The new plug/unplug code does not check for underruns. That allows the
> plug count (ioc->plugged) to become negative. This gets triggered from
> various places.
>
> AFAICS this is intentional to avoid checks all over the place, but the
> underflow check is missing. All we need to do is make sure, that in case
> of ioc->plugged == 0 we return early and bug, if there is either a queue
> plugged in or the plugged_list is not empty.
>
> Jens ?
>
> - The raid1 code has no bitmap set in remount r/w. So the
> pending_bio_list gets not processed for quite a time. The workaround is
> to kick mddev->thread, so the list is processed. Not sure about that.
>
> Neil ?
>
> At least it boots and behaves normal.

Yes, it works fine now. Thanks!

> tglx

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group
(http://www.stardust.webpages.pl/ltg/)

2007-01-16 00:41:44

by Jens Axboe

[permalink] [raw]
Subject: Re: [patch-mm] Workaround for RAID breakage

On Mon, Jan 15 2007, Thomas Gleixner wrote:
> On Mon, 2007-01-15 at 09:08 +0100, Thomas Gleixner wrote:
> > > Thomas saw something similar yesterday and he the partial results that
> > > git.block (between rc2-mm1 and rc4-mm1) breaks certain disk drivers or
> > > filesystems drivers. For me it worked fine, so it must be only on some
> > > combinations. The changes to ll_rw_block.c look quite extensive.
> >
> > Yes. Jens Axboe confirmed yesterday that the plug changes broke RAID.
>
> I tracked this down and found two problems:
>
> - The new plug/unplug code does not check for underruns. That allows the
> plug count (ioc->plugged) to become negative. This gets triggered from
> various places.
>
> AFAICS this is intentional to avoid checks all over the place, but the
> underflow check is missing. All we need to do is make sure, that in case
> of ioc->plugged == 0 we return early and bug, if there is either a queue
> plugged in or the plugged_list is not empty.
>
> Jens ?

It should not go negative, that would be a bug elsewhere. So it's
interesting if it does, we should definitely put a WARN_ON() check in
there for that.

> - The raid1 code has no bitmap set in remount r/w. So the
> pending_bio_list gets not processed for quite a time. The workaround is
> to kick mddev->thread, so the list is processed. Not sure about that.
>
> Neil ?

Super, thanks for that Thomas! I'll merge it in the plug branch.

--
Jens Axboe

2007-01-16 08:21:06

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch-mm] Workaround for RAID breakage

On Tue, 2007-01-16 at 11:41 +1100, Jens Axboe wrote:
> > AFAICS this is intentional to avoid checks all over the place, but the
> > underflow check is missing. All we need to do is make sure, that in case
> > of ioc->plugged == 0 we return early and bug, if there is either a queue
> > plugged in or the plugged_list is not empty.
> >
> > Jens ?
>
> It should not go negative, that would be a bug elsewhere. So it's
> interesting if it does, we should definitely put a WARN_ON() check in
> there for that.

Well. All offenders come via queue_sync_plugs(). queue_sync_plugs()
calls blk_unplug_current().

One path which triggers is blk_sync_queue(). This happens e.g. in the
cleanup of the floppy check. There are other call pathes too.

The other is raid md_super_write(). It is not plugged and calls with the
barrier bit set, which executes the unlikely path in __make_request():

if (unlikely(bio_barrier(bio))) {
queue_sync_plugs(q);
spin_lock_irq(q->queue_lock);
goto get_rq;
}

So you either need checks all over the place to avoid calling
blk_unplug_current(), or you prevent the unplug below 0 like I did. The
BUG_ON()s I added should catch any real invalid callers. But it's up to
you.

tglx


2007-01-16 10:08:24

by Jens Axboe

[permalink] [raw]
Subject: Re: [patch-mm] Workaround for RAID breakage

On Tue, Jan 16 2007, Thomas Gleixner wrote:
> On Tue, 2007-01-16 at 11:41 +1100, Jens Axboe wrote:
> > > AFAICS this is intentional to avoid checks all over the place, but the
> > > underflow check is missing. All we need to do is make sure, that in case
> > > of ioc->plugged == 0 we return early and bug, if there is either a queue
> > > plugged in or the plugged_list is not empty.
> > >
> > > Jens ?
> >
> > It should not go negative, that would be a bug elsewhere. So it's
> > interesting if it does, we should definitely put a WARN_ON() check in
> > there for that.
>
> Well. All offenders come via queue_sync_plugs(). queue_sync_plugs()
> calls blk_unplug_current().

Ah, again a problem because the #plug branch wasn't pushed to #for-akpm
in due time. That problem was fixed already :/

> One path which triggers is blk_sync_queue(). This happens e.g. in the
> cleanup of the floppy check. There are other call pathes too.
>
> The other is raid md_super_write(). It is not plugged and calls with the
> barrier bit set, which executes the unlikely path in __make_request():
>
> if (unlikely(bio_barrier(bio))) {
> queue_sync_plugs(q);
> spin_lock_irq(q->queue_lock);
> goto get_rq;
> }
>
> So you either need checks all over the place to avoid calling
> blk_unplug_current(), or you prevent the unplug below 0 like I did. The
> BUG_ON()s I added should catch any real invalid callers. But it's up to
> you.

blk_replug_current_nested() should do it. I'll make sure the branch is
updated for next -mm.

The BUG_ON()'s are indeed correct, they should just be moved further
down the function.

--
Jens Axboe