raid5's 'make_request' function calls generic_make_request on
underlying devices and if we run out of stripe heads, it could end up
waiting for one of those requests to complete.
This is bad as recursive calls to generic_make_request go on a queue
and are not even attempted until make_request completes.
So: don't make any generic_make_request calls in raid5 make_request
until all waiting has been done. We do this by simply setting
STRIPE_HANDLE instead of calling handle_stripe().
If we need more stripe_heads, raid5d will get called to process the
pending stripe_heads which will call generic_make_request from a
different thread where no deadlock will happen.
This change by itself causes a performance hit. So add a change so
that raid5_activate_delayed is only called at unplug time, never in
raid5. This seems to bring back the performance numbers. Calling it
in raid5d was sometimes too soon...
Cc: "Dan Williams" <[email protected]>
Signed-off-by: Neil Brown <[email protected]>
### Diffstat output
./drivers/md/raid5.c | 13 ++++---------
1 file changed, 4 insertions(+), 9 deletions(-)
diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c 2008-01-11 15:01:14.000000000 +1100
+++ ./drivers/md/raid5.c 2008-01-14 12:24:07.000000000 +1100
@@ -3157,7 +3157,8 @@ static void raid5_activate_delayed(raid5
atomic_inc(&conf->preread_active_stripes);
list_add_tail(&sh->lru, &conf->handle_list);
}
- }
+ } else
+ blk_plug_device(conf->mddev->queue);
}
static void activate_bit_delay(raid5_conf_t *conf)
@@ -3547,7 +3548,7 @@ static int make_request(struct request_q
goto retry;
}
finish_wait(&conf->wait_for_overlap, &w);
- handle_stripe(sh, NULL);
+ set_bit(STRIPE_HANDLE, &sh->state);
release_stripe(sh);
} else {
/* cannot get stripe for read-ahead, just give-up */
@@ -3890,7 +3891,7 @@ static int retry_aligned_read(raid5_con
* During the scan, completed stripes are saved for us by the interrupt
* handler, so that they will not have to wait for our next wakeup.
*/
-static void raid5d (mddev_t *mddev)
+static void raid5d(mddev_t *mddev)
{
struct stripe_head *sh;
raid5_conf_t *conf = mddev_to_conf(mddev);
@@ -3915,12 +3916,6 @@ static void raid5d (mddev_t *mddev)
activate_bit_delay(conf);
}
- if (list_empty(&conf->handle_list) &&
- atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD &&
- !blk_queue_plugged(mddev->queue) &&
- !list_empty(&conf->delayed_list))
- raid5_activate_delayed(conf);
-
while ((bio = remove_bio_from_retry(conf))) {
int ok;
spin_unlock_irq(&conf->device_lock);
On Mon, 14 Jan 2008, NeilBrown wrote:
>
> raid5's 'make_request' function calls generic_make_request on
> underlying devices and if we run out of stripe heads, it could end up
> waiting for one of those requests to complete.
> This is bad as recursive calls to generic_make_request go on a queue
> and are not even attempted until make_request completes.
>
> So: don't make any generic_make_request calls in raid5 make_request
> until all waiting has been done. We do this by simply setting
> STRIPE_HANDLE instead of calling handle_stripe().
>
> If we need more stripe_heads, raid5d will get called to process the
> pending stripe_heads which will call generic_make_request from a
> different thread where no deadlock will happen.
>
>
> This change by itself causes a performance hit. So add a change so
> that raid5_activate_delayed is only called at unplug time, never in
> raid5. This seems to bring back the performance numbers. Calling it
> in raid5d was sometimes too soon...
>
> Cc: "Dan Williams" <[email protected]>
> Signed-off-by: Neil Brown <[email protected]>
probably doesn't matter, but for the record:
Tested-by: dean gaudet <[email protected]>
this time i tested with internal and external bitmaps and it survived 8h
and 14h resp. under the parallel tar workload i used to reproduce the
hang.
btw this should probably be a candidate for 2.6.22 and .23 stable.
thanks
-dean
On Tue, 15 Jan 2008 21:01:17 -0800 (PST) dean gaudet <[email protected]> wrote:
> On Mon, 14 Jan 2008, NeilBrown wrote:
>
> >
> > raid5's 'make_request' function calls generic_make_request on
> > underlying devices and if we run out of stripe heads, it could end up
> > waiting for one of those requests to complete.
> > This is bad as recursive calls to generic_make_request go on a queue
> > and are not even attempted until make_request completes.
> >
> > So: don't make any generic_make_request calls in raid5 make_request
> > until all waiting has been done. We do this by simply setting
> > STRIPE_HANDLE instead of calling handle_stripe().
> >
> > If we need more stripe_heads, raid5d will get called to process the
> > pending stripe_heads which will call generic_make_request from a
> > different thread where no deadlock will happen.
> >
> >
> > This change by itself causes a performance hit. So add a change so
> > that raid5_activate_delayed is only called at unplug time, never in
> > raid5. This seems to bring back the performance numbers. Calling it
> > in raid5d was sometimes too soon...
> >
> > Cc: "Dan Williams" <[email protected]>
> > Signed-off-by: Neil Brown <[email protected]>
>
> probably doesn't matter, but for the record:
>
> Tested-by: dean gaudet <[email protected]>
>
> this time i tested with internal and external bitmaps and it survived 8h
> and 14h resp. under the parallel tar workload i used to reproduce the
> hang.
>
> btw this should probably be a candidate for 2.6.22 and .23 stable.
>
hm, Neil said
The first fixes a bug which could make it a candidate for 24-final.
However it is a deadlock that seems to occur very rarely, and has been in
mainline since 2.6.22. So letting it into one more release shouldn't be
a big problem. While the fix is fairly simple, it could have some
unexpected consequences, so I'd rather go for the next cycle.
food fight!
On Tue, 15 Jan 2008, Andrew Morton wrote:
> On Tue, 15 Jan 2008 21:01:17 -0800 (PST) dean gaudet <[email protected]> wrote:
>
> > On Mon, 14 Jan 2008, NeilBrown wrote:
> >
> > >
> > > raid5's 'make_request' function calls generic_make_request on
> > > underlying devices and if we run out of stripe heads, it could end up
> > > waiting for one of those requests to complete.
> > > This is bad as recursive calls to generic_make_request go on a queue
> > > and are not even attempted until make_request completes.
> > >
> > > So: don't make any generic_make_request calls in raid5 make_request
> > > until all waiting has been done. We do this by simply setting
> > > STRIPE_HANDLE instead of calling handle_stripe().
> > >
> > > If we need more stripe_heads, raid5d will get called to process the
> > > pending stripe_heads which will call generic_make_request from a
> > > different thread where no deadlock will happen.
> > >
> > >
> > > This change by itself causes a performance hit. So add a change so
> > > that raid5_activate_delayed is only called at unplug time, never in
> > > raid5. This seems to bring back the performance numbers. Calling it
> > > in raid5d was sometimes too soon...
> > >
> > > Cc: "Dan Williams" <[email protected]>
> > > Signed-off-by: Neil Brown <[email protected]>
> >
> > probably doesn't matter, but for the record:
> >
> > Tested-by: dean gaudet <[email protected]>
> >
> > this time i tested with internal and external bitmaps and it survived 8h
> > and 14h resp. under the parallel tar workload i used to reproduce the
> > hang.
> >
> > btw this should probably be a candidate for 2.6.22 and .23 stable.
> >
>
> hm, Neil said
>
> The first fixes a bug which could make it a candidate for 24-final.
> However it is a deadlock that seems to occur very rarely, and has been in
> mainline since 2.6.22. So letting it into one more release shouldn't be
> a big problem. While the fix is fairly simple, it could have some
> unexpected consequences, so I'd rather go for the next cycle.
>
> food fight!
>
heheh.
it's really easy to reproduce the hang without the patch -- i could
hang the box in under 20 min on 2.6.22+ w/XFS and raid5 on 7x750GB.
i'll try with ext3... Dan's experiences suggest it won't happen with ext3
(or is even more rare), which would explain why this has is overall a
rare problem.
but it doesn't result in dataloss or permanent system hangups as long
as you can become root and raise the size of the stripe cache...
so OK i agree with Neil, let's test more... food fight over! :)
-dean
> heheh.
>
> it's really easy to reproduce the hang without the patch -- i could
> hang the box in under 20 min on 2.6.22+ w/XFS and raid5 on 7x750GB.
> i'll try with ext3... Dan's experiences suggest it won't happen with ext3
> (or is even more rare), which would explain why this has is overall a
> rare problem.
>
Hmmm... how rare?
http://marc.info/?l=linux-kernel&m=119461747005776&w=2
There is nothing specific that prevents other filesystems from hitting
it, perhaps XFS is just better at submitting large i/o's. -stable
should get some kind of treatment. I'll take altered performance over
a hung system.
--
Dan
On Wed, 16 Jan 2008 00:09:31 -0700 "Dan Williams" <[email protected]> wrote:
> > heheh.
> >
> > it's really easy to reproduce the hang without the patch -- i could
> > hang the box in under 20 min on 2.6.22+ w/XFS and raid5 on 7x750GB.
> > i'll try with ext3... Dan's experiences suggest it won't happen with ext3
> > (or is even more rare), which would explain why this has is overall a
> > rare problem.
> >
>
> Hmmm... how rare?
>
> http://marc.info/?l=linux-kernel&m=119461747005776&w=2
>
> There is nothing specific that prevents other filesystems from hitting
> it, perhaps XFS is just better at submitting large i/o's. -stable
> should get some kind of treatment. I'll take altered performance over
> a hung system.
We can always target 2.6.25-rc1 then 2.6.24.1 if Neil is still feeling
wimpy.
On Tuesday January 15, [email protected] wrote:
> On Wed, 16 Jan 2008 00:09:31 -0700 "Dan Williams" <[email protected]> wrote:
>
> > > heheh.
> > >
> > > it's really easy to reproduce the hang without the patch -- i could
> > > hang the box in under 20 min on 2.6.22+ w/XFS and raid5 on 7x750GB.
> > > i'll try with ext3... Dan's experiences suggest it won't happen with ext3
> > > (or is even more rare), which would explain why this has is overall a
> > > rare problem.
> > >
> >
> > Hmmm... how rare?
> >
> > http://marc.info/?l=linux-kernel&m=119461747005776&w=2
> >
> > There is nothing specific that prevents other filesystems from hitting
> > it, perhaps XFS is just better at submitting large i/o's. -stable
> > should get some kind of treatment. I'll take altered performance over
> > a hung system.
>
> We can always target 2.6.25-rc1 then 2.6.24.1 if Neil is still feeling
> wimpy.
I am feeling wimpy. There've been a few too many raid5 breakages
recently and it is very hard to really judge the performance impact of
this change. I even have a small uncertainty of correctness - could
it still hang in some other way? I don't think so, but this is
complex code...
If it were really common I would have expected more noise on the
mailing list. Sure, there has been some, but not much. However maybe
people are searching the archives and finding the "increase stripe
cache size" trick, and not reporting anything .... seems unlikely
though.
How about we queue it for 2.6.25-rc1 and then about when -rc2 comes
out, we queue it for 2.6.24.y? Any one (or any distro) that really
needs it can of course grab the patch them selves...
??
NeilBrown