2008-03-03 00:17:35

by NeilBrown

[permalink] [raw]
Subject: [PATCH 001 of 9] md: Fix deadlock in md/raid1 and md/raid10 when handling a read error.


When handling a read error, we freeze the array to stop any other
IO while attempting to over-write with correct data.

This is done in the raid1d(raid10d) thread and must wait for all
submitted IO to complete (except for requests that failed and are
sitting in the retry queue - these are counted in ->nr_queue and will
stay there during a freeze).

However write requests need attention from raid1d as bitmap updates
might be required. This can cause a deadlock as raid1 is waiting for
requests to finish that themselves need attention from raid1d.

So we create a new function 'flush_pending_writes' to give that attention,
and call it in freeze_array to be sure that we aren't waiting on raid1d.

Thanks to "K.Tanaka" <[email protected]> for finding and reporting
this problem.

Cc: "K.Tanaka" <[email protected]>
Signed-off-by: Neil Brown <[email protected]>

### Diffstat output
./drivers/md/raid1.c | 62 +++++++++++++++++++++++++++++++++-----------------
./drivers/md/raid10.c | 60 +++++++++++++++++++++++++++++++-----------------
2 files changed, 80 insertions(+), 42 deletions(-)

diff .prev/drivers/md/raid10.c ./drivers/md/raid10.c
--- .prev/drivers/md/raid10.c 2008-02-22 15:45:35.000000000 +1100
+++ ./drivers/md/raid10.c 2008-02-22 15:45:35.000000000 +1100
@@ -629,7 +629,36 @@ static int raid10_congested(void *data,
return ret;
}

+static int flush_pending_writes(conf_t *conf)
+{
+ /* Any writes that have been queued but are awaiting
+ * bitmap updates get flushed here.
+ * We return 1 if any requests were actually submitted.
+ */
+ int rv = 0;
+
+ spin_lock_irq(&conf->device_lock);

+ if (conf->pending_bio_list.head) {
+ struct bio *bio;
+ bio = bio_list_get(&conf->pending_bio_list);
+ blk_remove_plug(conf->mddev->queue);
+ spin_unlock_irq(&conf->device_lock);
+ /* flush any pending bitmap writes to disk
+ * before proceeding w/ I/O */
+ bitmap_unplug(conf->mddev->bitmap);
+
+ while (bio) { /* submit pending writes */
+ struct bio *next = bio->bi_next;
+ bio->bi_next = NULL;
+ generic_make_request(bio);
+ bio = next;
+ }
+ rv = 1;
+ } else
+ spin_unlock_irq(&conf->device_lock);
+ return rv;
+}
/* Barriers....
* Sometimes we need to suspend IO while we do something else,
* either some resync/recovery, or reconfigure the array.
@@ -720,7 +749,8 @@ static void freeze_array(conf_t *conf)
wait_event_lock_irq(conf->wait_barrier,
conf->barrier+conf->nr_pending == conf->nr_queued+2,
conf->resync_lock,
- raid10_unplug(conf->mddev->queue));
+ ({ flush_pending_writes(conf);
+ raid10_unplug(conf->mddev->queue); }));
spin_unlock_irq(&conf->resync_lock);
}

@@ -892,6 +922,9 @@ static int make_request(struct request_q
blk_plug_device(mddev->queue);
spin_unlock_irqrestore(&conf->device_lock, flags);

+ /* In case raid10d snuck in to freeze_array */
+ wake_up(&conf->wait_barrier);
+
if (do_sync)
md_wakeup_thread(mddev->thread);

@@ -1464,28 +1497,14 @@ static void raid10d(mddev_t *mddev)

for (;;) {
char b[BDEVNAME_SIZE];
- spin_lock_irqsave(&conf->device_lock, flags);
-
- if (conf->pending_bio_list.head) {
- bio = bio_list_get(&conf->pending_bio_list);
- blk_remove_plug(mddev->queue);
- spin_unlock_irqrestore(&conf->device_lock, flags);
- /* flush any pending bitmap writes to disk before proceeding w/ I/O */
- bitmap_unplug(mddev->bitmap);

- while (bio) { /* submit pending writes */
- struct bio *next = bio->bi_next;
- bio->bi_next = NULL;
- generic_make_request(bio);
- bio = next;
- }
- unplug = 1;
+ unplug += flush_pending_writes(conf);

- continue;
- }
-
- if (list_empty(head))
+ spin_lock_irqsave(&conf->device_lock, flags);
+ if (list_empty(head)) {
+ spin_unlock_irqrestore(&conf->device_lock, flags);
break;
+ }
r10_bio = list_entry(head->prev, r10bio_t, retry_list);
list_del(head->prev);
conf->nr_queued--;
@@ -1548,7 +1567,6 @@ static void raid10d(mddev_t *mddev)
}
}
}
- spin_unlock_irqrestore(&conf->device_lock, flags);
if (unplug)
unplug_slaves(mddev);
}

diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
--- .prev/drivers/md/raid1.c 2008-02-22 15:45:35.000000000 +1100
+++ ./drivers/md/raid1.c 2008-02-22 15:45:35.000000000 +1100
@@ -592,6 +592,37 @@ static int raid1_congested(void *data, i
}


+static int flush_pending_writes(conf_t *conf)
+{
+ /* Any writes that have been queued but are awaiting
+ * bitmap updates get flushed here.
+ * We return 1 if any requests were actually submitted.
+ */
+ int rv = 0;
+
+ spin_lock_irq(&conf->device_lock);
+
+ if (conf->pending_bio_list.head) {
+ struct bio *bio;
+ bio = bio_list_get(&conf->pending_bio_list);
+ blk_remove_plug(conf->mddev->queue);
+ spin_unlock_irq(&conf->device_lock);
+ /* flush any pending bitmap writes to
+ * disk before proceeding w/ I/O */
+ bitmap_unplug(conf->mddev->bitmap);
+
+ while (bio) { /* submit pending writes */
+ struct bio *next = bio->bi_next;
+ bio->bi_next = NULL;
+ generic_make_request(bio);
+ bio = next;
+ }
+ rv = 1;
+ } else
+ spin_unlock_irq(&conf->device_lock);
+ return rv;
+}
+
/* Barriers....
* Sometimes we need to suspend IO while we do something else,
* either some resync/recovery, or reconfigure the array.
@@ -681,7 +712,8 @@ static void freeze_array(conf_t *conf)
wait_event_lock_irq(conf->wait_barrier,
conf->barrier+conf->nr_pending == conf->nr_queued+2,
conf->resync_lock,
- raid1_unplug(conf->mddev->queue));
+ ({ flush_pending_writes(conf);
+ raid1_unplug(conf->mddev->queue); }));
spin_unlock_irq(&conf->resync_lock);
}
static void unfreeze_array(conf_t *conf)
@@ -907,6 +939,9 @@ static int make_request(struct request_q
blk_plug_device(mddev->queue);
spin_unlock_irqrestore(&conf->device_lock, flags);

+ /* In case raid1d snuck into freeze_array */
+ wake_up(&conf->wait_barrier);
+
if (do_sync)
md_wakeup_thread(mddev->thread);
#if 0
@@ -1473,28 +1508,14 @@ static void raid1d(mddev_t *mddev)

for (;;) {
char b[BDEVNAME_SIZE];
- spin_lock_irqsave(&conf->device_lock, flags);

- if (conf->pending_bio_list.head) {
- bio = bio_list_get(&conf->pending_bio_list);
- blk_remove_plug(mddev->queue);
- spin_unlock_irqrestore(&conf->device_lock, flags);
- /* flush any pending bitmap writes to disk before proceeding w/ I/O */
- bitmap_unplug(mddev->bitmap);
-
- while (bio) { /* submit pending writes */
- struct bio *next = bio->bi_next;
- bio->bi_next = NULL;
- generic_make_request(bio);
- bio = next;
- }
- unplug = 1;
-
- continue;
- }
+ unplug += flush_pending_writes(conf);

- if (list_empty(head))
+ spin_lock_irqsave(&conf->device_lock, flags);
+ if (list_empty(head)) {
+ spin_unlock_irqrestore(&conf->device_lock, flags);
break;
+ }
r1_bio = list_entry(head->prev, r1bio_t, retry_list);
list_del(head->prev);
conf->nr_queued--;
@@ -1590,7 +1611,6 @@ static void raid1d(mddev_t *mddev)
}
}
}
- spin_unlock_irqrestore(&conf->device_lock, flags);
if (unplug)
unplug_slaves(mddev);
}


2008-03-03 15:55:26

by Andre Noll

[permalink] [raw]
Subject: Re: [PATCH 001 of 9] md: Fix deadlock in md/raid1 and md/raid10 when handling a read error.

On 11:17, NeilBrown wrote:

> So we create a new function 'flush_pending_writes' to give that attention,
> and call it in freeze_array to be sure that we aren't waiting on raid1d.

Two minor remarks:

> +static int flush_pending_writes(conf_t *conf)
> +{
> + /* Any writes that have been queued but are awaiting
> + * bitmap updates get flushed here.
> + * We return 1 if any requests were actually submitted.
> + */
> + int rv = 0;
> +
> + spin_lock_irq(&conf->device_lock);
>
> + if (conf->pending_bio_list.head) {
> + struct bio *bio;
> + bio = bio_list_get(&conf->pending_bio_list);
> + blk_remove_plug(conf->mddev->queue);
> + spin_unlock_irq(&conf->device_lock);
> + /* flush any pending bitmap writes to disk
> + * before proceeding w/ I/O */
> + bitmap_unplug(conf->mddev->bitmap);
> +
> + while (bio) { /* submit pending writes */
> + struct bio *next = bio->bi_next;
> + bio->bi_next = NULL;
> + generic_make_request(bio);
> + bio = next;
> + }
> + rv = 1;
> + } else
> + spin_unlock_irq(&conf->device_lock);
> + return rv;
> +}

Do we really need to take the spin lock in the common case where
conf->pending_bio_list.head is NULL? If not, the above could be
optimized to the slightly faster and better readable

struct bio *bio;

if (!conf->pending_bio_list.head)
return 0;
spin_lock_irq(&conf->device_lock);
bio = bio_list_get(&conf->pending_bio_list);
...
spin_unlock_irq(&conf->device_lock);
return 1;


> diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
> --- .prev/drivers/md/raid1.c 2008-02-22 15:45:35.000000000 +1100
> +++ ./drivers/md/raid1.c 2008-02-22 15:45:35.000000000 +1100
> @@ -592,6 +592,37 @@ static int raid1_congested(void *data, i
> }
>
>
> +static int flush_pending_writes(conf_t *conf)
> +{

[snip]

Any chance to avoid this code duplication?

Regards
Andre
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (1.87 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2008-03-04 06:09:18

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 001 of 9] md: Fix deadlock in md/raid1 and md/raid10 when handling a read error.

On Monday March 3, [email protected] wrote:
> On 11:17, NeilBrown wrote:
>
> > So we create a new function 'flush_pending_writes' to give that attention,
> > and call it in freeze_array to be sure that we aren't waiting on raid1d.
>
> Two minor remarks:

Thanks for looking.

>
> Do we really need to take the spin lock in the common case where
> conf->pending_bio_list.head is NULL? If not, the above could be
> optimized to the slightly faster and better readable
>
> struct bio *bio;
>
> if (!conf->pending_bio_list.head)
> return 0;
> spin_lock_irq(&conf->device_lock);
> bio = bio_list_get(&conf->pending_bio_list);
> ...
> spin_unlock_irq(&conf->device_lock);
> return 1;

Maybe... If I write a memory location inside a spinlock, then after
the spinlock is dropped, I read that location on a different CPU,
am I always guaranteed to see the new value? or do I need some sort of
memory barrier?
If I could be clear on that (and memory-barriers.txt isn't
helping... maybe if I read it 7 more times) then we probably could
make the change you suggest.
???

>
>
> > diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
> > --- .prev/drivers/md/raid1.c 2008-02-22 15:45:35.000000000 +1100
> > +++ ./drivers/md/raid1.c 2008-02-22 15:45:35.000000000 +1100
> > @@ -592,6 +592,37 @@ static int raid1_congested(void *data, i
> > }
> >
> >
> > +static int flush_pending_writes(conf_t *conf)
> > +{
>
> [snip]
>
> Any chance to avoid this code duplication?

Not really.
The "conf_t" in each case is a very different conf_t that just happens
to have the same name and some common fields.
There is actually quite a lot of structural similarity between raid1
and raid10, but I don't think there is much to be gained by trying to
share code there.

Thanks,
NeilBrown

2008-03-04 11:30:28

by Andre Noll

[permalink] [raw]
Subject: Re: [PATCH 001 of 9] md: Fix deadlock in md/raid1 and md/raid10 when handling a read error.

On 17:08, Neil Brown wrote:
> > Do we really need to take the spin lock in the common case where
> > conf->pending_bio_list.head is NULL? If not, the above could be
> > optimized to the slightly faster and better readable
> >
> > struct bio *bio;
> >
> > if (!conf->pending_bio_list.head)
> > return 0;
> > spin_lock_irq(&conf->device_lock);
> > bio = bio_list_get(&conf->pending_bio_list);
> > ...
> > spin_unlock_irq(&conf->device_lock);
> > return 1;
>
> Maybe... If I write a memory location inside a spinlock, then after
> the spinlock is dropped, I read that location on a different CPU,
> am I always guaranteed to see the new value? or do I need some sort of
> memory barrier?

Are you worried about another CPU setting conf->pending_bio_list.head
to != NULL after the if statement? If that's an issue I think also
the original patch is problematic because the same might happen after
the final spin_unlock_irq() but but before flush_pending_writes()
returns zero.

Andre
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (1.05 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2008-03-06 03:30:05

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 001 of 9] md: Fix deadlock in md/raid1 and md/raid10 when handling a read error.

On Tuesday March 4, [email protected] wrote:
> On 17:08, Neil Brown wrote:
> > > Do we really need to take the spin lock in the common case where
> > > conf->pending_bio_list.head is NULL? If not, the above could be
> > > optimized to the slightly faster and better readable
> > >
> > > struct bio *bio;
> > >
> > > if (!conf->pending_bio_list.head)
> > > return 0;
> > > spin_lock_irq(&conf->device_lock);
> > > bio = bio_list_get(&conf->pending_bio_list);
> > > ...
> > > spin_unlock_irq(&conf->device_lock);
> > > return 1;
> >
> > Maybe... If I write a memory location inside a spinlock, then after
> > the spinlock is dropped, I read that location on a different CPU,
> > am I always guaranteed to see the new value? or do I need some sort of
> > memory barrier?
>
> Are you worried about another CPU setting conf->pending_bio_list.head
> to != NULL after the if statement? If that's an issue I think also
> the original patch is problematic because the same might happen after
> the final spin_unlock_irq() but but before flush_pending_writes()
> returns zero.

No. I'm worried that another CPU might set
conf->pending_bio_list.head *before* the if statement, but it isn't
seen by this CPU because of the lack of memory barriers. The spinlock
ensures that the memory state is consistent.
It is possible that I am being overcautious. But I think that is
better than the alternative.

Thanks,
NeilBrown

2008-03-06 10:52:59

by Andre Noll

[permalink] [raw]
Subject: Re: [PATCH 001 of 9] md: Fix deadlock in md/raid1 and md/raid10 when handling a read error.

On 14:29, Neil Brown wrote:
> > Are you worried about another CPU setting conf->pending_bio_list.head
> > to != NULL after the if statement? If that's an issue I think also
> > the original patch is problematic because the same might happen after
> > the final spin_unlock_irq() but but before flush_pending_writes()
> > returns zero.
>
> No. I'm worried that another CPU might set
> conf->pending_bio_list.head *before* the if statement, but it isn't
> seen by this CPU because of the lack of memory barriers. The spinlock
> ensures that the memory state is consistent.

But is that enough to avoid the deadlock? I think the following
scenario would be possible with the code in the original patch:

// suppose conf->pending_bio_list.head==NULL ATM

CPU0:
int rv = 0;
spin_lock_irq(&conf->device_lock);
if (conf->pending_bio_list.head) // false
spin_unlock_irq(&conf->device_lock);

CPU1:
conf->pending_bio_list.head = something;

CPU0:
return rv; // zero

Andre
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (1.03 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments