LinuxLists.cc - [PATCH] blk: improve order of bio handling in generic_make

2017-03-03 05:14:44

Subject: [PATCH] blk: improve order of bio handling in generic_make_request()

[ Hi Jens,
you might have seen assorted email threads recently about
deadlocks, particular in dm-snap or md/raid1/10. Also about
the excess of rescuer threads.
I think a big part of the problem is my ancient improvement
to generic_make_request to queue bios and handle them in
a strict FIFO order. As described below, that can cause
problems which individual block devices cannot fix themselves
without punting to various threads.
This patch does not fix everything, but provides a basis that
drives can build on to create dead-lock free solutions without
excess threads.
If you accept this, I will look into improving at least md
and bio_alloc_set() to be less dependant on rescuer threads.
Thanks,
NeilBrown
]

To avoid recursion on the kernel stack when stacked block devices
are in use, generic_make_request() will, when called recursively,
queue new requests for later handling. They will be handled when the
make_request_fn for the current bio completes.

If any bios are submitted by a make_request_fn, these will ultimately
handled seqeuntially. If the handling of one of those generates
further requests, they will be added to the end of the queue.

This strict first-in-first-out behaviour can lead to deadlocks in
various ways, normally because a request might need to wait for a
previous request to the same device to complete. This can happen when
they share a mempool, and can happen due to interdependencies
particular to the device. Both md and dm have examples where this happens.

These deadlocks can be erradicated by more selective ordering of bios.
Specifically by handling them in depth-first order. That is: when the
handling of one bio generates one or more further bios, they are
handled immediately after the parent, before any siblings of the
parent. That way, when generic_make_request() calls make_request_fn
for some particular device, it we can be certain that all previously
submited request for that device have been completely handled and are
not waiting for anything in the queue of requests maintained in
generic_make_request().

An easy way to achieve this would be to use a last-in-first-out stack
instead of a queue. However this will change the order of consecutive
bios submitted by a make_request_fn, which could have unexpected consequences.
Instead we take a slightly more complex approach.
A fresh queue is created for each call to a make_request_fn. After it completes,
any bios for a different device are placed on the front of the main queue, followed
by any bios for the same device, followed by all bios that were already on
the queue before the make_request_fn was called.
This provides the depth-first approach without reordering bios on the same level.

This, by itself, it not enough to remove the deadlocks. It just makes
it possible for drivers to take the extra step required themselves.

To avoid deadlocks, drivers must never risk waiting for a request
after submitting one to generic_make_request. This includes never
allocing from a mempool twice in the one call to a make_request_fn.

A common pattern in drivers is to call bio_split() in a loop, handling
the first part and then looping around to possibly split the next part.
Instead, a driver that finds it needs to split a bio should queue
(with generic_make_request) the second part, handle the first part,
and then return. The new code in generic_make_request will ensure the
requests to underlying bios are processed first, then the second bio
that was split off. If it splits again, the same process happens. In
each case one bio will be completely handled before the next one is attempted.

With this is place, it should be possible to disable the
punt_bios_to_recover() recovery thread for many block devices, and
eventually it may be possible to remove it completely.

Tested-by: Jinpu Wang <[email protected]>
Inspired-by: Lars Ellenberg <[email protected]>
Signed-off-by: NeilBrown <[email protected]>
---
block/blk-core.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index b9e857f4afe8..ef55f210dd7c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2018,10 +2018,32 @@ blk_qc_t generic_make_request(struct bio *bio)
struct request_queue *q = bdev_get_queue(bio->bi_bdev);

if (likely(blk_queue_enter(q, false) == 0)) {
+ struct bio_list hold;
+ struct bio_list lower, same;
+
+ /* Create a fresh bio_list for all subordinate requests */
+ bio_list_init(&hold);
+ bio_list_merge(&hold, &bio_list_on_stack);
+ bio_list_init(&bio_list_on_stack);
ret = q->make_request_fn(q, bio);

blk_queue_exit(q);

+ /* sort new bios into those for a lower level
+ * and those for the same level
+ */
+ bio_list_init(&lower);
+ bio_list_init(&same);
+ while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
+ if (q == bdev_get_queue(bio->bi_bdev))
+ bio_list_add(&same, bio);
+ else
+ bio_list_add(&lower, bio);
+ /* now assemble so we handle the lowest level first */
+ bio_list_merge(&bio_list_on_stack, &lower);
+ bio_list_merge(&bio_list_on_stack, &same);
+ bio_list_merge(&bio_list_on_stack, &hold);
+
bio = bio_list_pop(current->bio_list);
} else {
struct bio *bio_next = bio_list_pop(current->bio_list);
--
2.11.0

Attachments:

signature.asc (832.00 B)

2017-03-03 09:29:58

by Jack Wang

[permalink] [raw]

Subject: Re: [PATCH] blk: improve order of bio handling in generic_make_request()

On 03.03.2017 06:14, NeilBrown wrote:
>
> [ Hi Jens,
> you might have seen assorted email threads recently about
> deadlocks, particular in dm-snap or md/raid1/10. Also about
> the excess of rescuer threads.
> I think a big part of the problem is my ancient improvement
> to generic_make_request to queue bios and handle them in
> a strict FIFO order. As described below, that can cause
> problems which individual block devices cannot fix themselves
> without punting to various threads.
> This patch does not fix everything, but provides a basis that
> drives can build on to create dead-lock free solutions without
> excess threads.
> If you accept this, I will look into improving at least md
> and bio_alloc_set() to be less dependant on rescuer threads.
> Thanks,
> NeilBrown
> ]
>
>
> To avoid recursion on the kernel stack when stacked block devices
> are in use, generic_make_request() will, when called recursively,
> queue new requests for later handling. They will be handled when the
> make_request_fn for the current bio completes.
>
> If any bios are submitted by a make_request_fn, these will ultimately
> handled seqeuntially. If the handling of one of those generates
> further requests, they will be added to the end of the queue.
>
> This strict first-in-first-out behaviour can lead to deadlocks in
> various ways, normally because a request might need to wait for a
> previous request to the same device to complete. This can happen when
> they share a mempool, and can happen due to interdependencies
> particular to the device. Both md and dm have examples where this happens.
>
> These deadlocks can be erradicated by more selective ordering of bios.
> Specifically by handling them in depth-first order. That is: when the
> handling of one bio generates one or more further bios, they are
> handled immediately after the parent, before any siblings of the
> parent. That way, when generic_make_request() calls make_request_fn
> for some particular device, it we can be certain that all previously
> submited request for that device have been completely handled and are
> not waiting for anything in the queue of requests maintained in
> generic_make_request().
>
> An easy way to achieve this would be to use a last-in-first-out stack
> instead of a queue. However this will change the order of consecutive
> bios submitted by a make_request_fn, which could have unexpected consequences.
> Instead we take a slightly more complex approach.
> A fresh queue is created for each call to a make_request_fn. After it completes,
> any bios for a different device are placed on the front of the main queue, followed
> by any bios for the same device, followed by all bios that were already on
> the queue before the make_request_fn was called.
> This provides the depth-first approach without reordering bios on the same level.
>
> This, by itself, it not enough to remove the deadlocks. It just makes
> it possible for drivers to take the extra step required themselves.
>
> To avoid deadlocks, drivers must never risk waiting for a request
> after submitting one to generic_make_request. This includes never
> allocing from a mempool twice in the one call to a make_request_fn.
>
> A common pattern in drivers is to call bio_split() in a loop, handling
> the first part and then looping around to possibly split the next part.
> Instead, a driver that finds it needs to split a bio should queue
> (with generic_make_request) the second part, handle the first part,
> and then return. The new code in generic_make_request will ensure the
> requests to underlying bios are processed first, then the second bio
> that was split off. If it splits again, the same process happens. In
> each case one bio will be completely handled before the next one is attempted.
>
> With this is place, it should be possible to disable the
> punt_bios_to_recover() recovery thread for many block devices, and
> eventually it may be possible to remove it completely.
>
> Tested-by: Jinpu Wang <[email protected]>
> Inspired-by: Lars Ellenberg <[email protected]>
> Signed-off-by: NeilBrown <[email protected]>
> ---
> block/blk-core.c | 22 ++++++++++++++++++++++
> 1 file changed, 22 insertions(+)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index b9e857f4afe8..ef55f210dd7c 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2018,10 +2018,32 @@ blk_qc_t generic_make_request(struct bio *bio)
> struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>
> if (likely(blk_queue_enter(q, false) == 0)) {
> + struct bio_list hold;
> + struct bio_list lower, same;
> +
> + /* Create a fresh bio_list for all subordinate requests */
> + bio_list_init(&hold);
> + bio_list_merge(&hold, &bio_list_on_stack);
> + bio_list_init(&bio_list_on_stack);
> ret = q->make_request_fn(q, bio);
>
> blk_queue_exit(q);
>
> + /* sort new bios into those for a lower level
> + * and those for the same level
> + */
> + bio_list_init(&lower);
> + bio_list_init(&same);
> + while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
> + if (q == bdev_get_queue(bio->bi_bdev))
> + bio_list_add(&same, bio);
> + else
> + bio_list_add(&lower, bio);
> + /* now assemble so we handle the lowest level first */
> + bio_list_merge(&bio_list_on_stack, &lower);
> + bio_list_merge(&bio_list_on_stack, &same);
> + bio_list_merge(&bio_list_on_stack, &hold);
> +
> bio = bio_list_pop(current->bio_list);
> } else {
> struct bio *bio_next = bio_list_pop(current->bio_list);
>

Thanks Neil for pushing the fix.

We can optimize generic_make_request a little bit:
- assign bio_list struct hold directly instead init and merge
- remove duplicate code

I think better to squash into your fix.
---
block/blk-core.c | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3bc7202..b29b7e5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2147,8 +2147,7 @@ blk_qc_t generic_make_request(struct bio *bio)
struct bio_list lower, same, hold;

/* Create a fresh bio_list for all subordinate requests */
- bio_list_init(&hold);
- bio_list_merge(&hold, &bio_list_on_stack);
+ hold = bio_list_on_stack;
bio_list_init(&bio_list_on_stack);

ret = q->make_request_fn(q, bio);
@@ -2168,14 +2167,10 @@ blk_qc_t generic_make_request(struct bio *bio)
bio_list_merge(&bio_list_on_stack, &lower);
bio_list_merge(&bio_list_on_stack, &same);
bio_list_merge(&bio_list_on_stack, &hold);
-
- bio = bio_list_pop(current->bio_list);
} else {
- struct bio *bio_next = bio_list_pop(current->bio_list);
-
bio_io_error(bio);
- bio = bio_next;
}
+ bio = bio_list_pop(current->bio_list);
} while (bio);
current->bio_list = NULL; /* deactivate */

--

Regards,
Jack

2017-03-06 04:40:47

by NeilBrown

[permalink] [raw]

Subject: Re: [PATCH] blk: improve order of bio handling in generic_make_request()

On Fri, Mar 03 2017, Jack Wang wrote:
>
> Thanks Neil for pushing the fix.
>
> We can optimize generic_make_request a little bit:
> - assign bio_list struct hold directly instead init and merge
> - remove duplicate code
>
> I think better to squash into your fix.

Hi Jack,
I don't object to your changes, but I'd like to see a response from
Jens first.
My preference would be to get the original patch in, then other changes
that build on it, such as this one, can be added. Until the core
changes lands, any other work is pointless.

Of course if Jens wants a this merged before he'll apply it, I'll
happily do that.

Thanks,
NeilBrown

> ---
> block/blk-core.c | 9 ++-------
> 1 file changed, 2 insertions(+), 7 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 3bc7202..b29b7e5 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2147,8 +2147,7 @@ blk_qc_t generic_make_request(struct bio *bio)
> struct bio_list lower, same, hold;
>
> /* Create a fresh bio_list for all subordinate requests */
> - bio_list_init(&hold);
> - bio_list_merge(&hold, &bio_list_on_stack);
> + hold = bio_list_on_stack;
> bio_list_init(&bio_list_on_stack);
>
> ret = q->make_request_fn(q, bio);
> @@ -2168,14 +2167,10 @@ blk_qc_t generic_make_request(struct bio *bio)
> bio_list_merge(&bio_list_on_stack, &lower);
> bio_list_merge(&bio_list_on_stack, &same);
> bio_list_merge(&bio_list_on_stack, &hold);
> -
> - bio = bio_list_pop(current->bio_list);
> } else {
> - struct bio *bio_next = bio_list_pop(current->bio_list);
> -
> bio_io_error(bio);
> - bio = bio_next;
> }
> + bio = bio_list_pop(current->bio_list);
> } while (bio);
> current->bio_list = NULL; /* deactivate */
>
> --
>
> Regards,
> Jack

Attachments:

signature.asc (832.00 B)

2017-03-06 09:47:06

by Jack Wang

[permalink] [raw]

Subject: Re: [PATCH] blk: improve order of bio handling in generic_make_request()

On 06.03.2017 05:40, NeilBrown wrote:
> On Fri, Mar 03 2017, Jack Wang wrote:
>>
>> Thanks Neil for pushing the fix.
>>
>> We can optimize generic_make_request a little bit:
>> - assign bio_list struct hold directly instead init and merge
>> - remove duplicate code
>>
>> I think better to squash into your fix.
>
> Hi Jack,
> I don't object to your changes, but I'd like to see a response from
> Jens first.
> My preference would be to get the original patch in, then other changes
> that build on it, such as this one, can be added. Until the core
> changes lands, any other work is pointless.
>
> Of course if Jens wants a this merged before he'll apply it, I'll
> happily do that.
>
> Thanks,
> NeilBrown

Hi Neil,

Totally agree, let's wait for Jens's decision.

Hi Jens,

Please consider this fix also for stable 4.3+

Thanks,
Jack Wang

>
>
>
>> ---
>> block/blk-core.c | 9 ++-------
>> 1 file changed, 2 insertions(+), 7 deletions(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 3bc7202..b29b7e5 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -2147,8 +2147,7 @@ blk_qc_t generic_make_request(struct bio *bio)
>> struct bio_list lower, same, hold;
>>
>> /* Create a fresh bio_list for all subordinate requests */
>> - bio_list_init(&hold);
>> - bio_list_merge(&hold, &bio_list_on_stack);
>> + hold = bio_list_on_stack;
>> bio_list_init(&bio_list_on_stack);
>>
>> ret = q->make_request_fn(q, bio);
>> @@ -2168,14 +2167,10 @@ blk_qc_t generic_make_request(struct bio *bio)
>> bio_list_merge(&bio_list_on_stack, &lower);
>> bio_list_merge(&bio_list_on_stack, &same);
>> bio_list_merge(&bio_list_on_stack, &hold);
>> -
>> - bio = bio_list_pop(current->bio_list);
>> } else {
>> - struct bio *bio_next = bio_list_pop(current->bio_list);
>> -
>> bio_io_error(bio);
>> - bio = bio_next;
>> }
>> + bio = bio_list_pop(current->bio_list);
>> } while (bio);
>> current->bio_list = NULL; /* deactivate */
>>
>> --
>>
>> Regards,
>> Jack

2017-03-07 02:03:50

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH] blk: improve order of bio handling in generic_make_request()

On 03/05/2017 09:40 PM, NeilBrown wrote:
> On Fri, Mar 03 2017, Jack Wang wrote:
>>
>> Thanks Neil for pushing the fix.
>>
>> We can optimize generic_make_request a little bit:
>> - assign bio_list struct hold directly instead init and merge
>> - remove duplicate code
>>
>> I think better to squash into your fix.
>
> Hi Jack,
> I don't object to your changes, but I'd like to see a response from
> Jens first.
> My preference would be to get the original patch in, then other changes
> that build on it, such as this one, can be added. Until the core
> changes lands, any other work is pointless.
>
> Of course if Jens wants a this merged before he'll apply it, I'll
> happily do that.

I like the change, and thanks for tackling this. It's been a pending
issue for way too long. I do think we should squash Jack's patch
into the original, as it does clean up the code nicely.

Do we have a proper test case for this, so we can verify that it
does indeed also work in practice?

--
Jens Axboe

2017-03-07 08:51:21

by Jack Wang

[permalink] [raw]

Subject: Re: [PATCH] blk: improve order of bio handling in generic_make_request()

On 06.03.2017 21:18, Jens Axboe wrote:
> On 03/05/2017 09:40 PM, NeilBrown wrote:
>> On Fri, Mar 03 2017, Jack Wang wrote:
>>>
>>> Thanks Neil for pushing the fix.
>>>
>>> We can optimize generic_make_request a little bit:
>>> - assign bio_list struct hold directly instead init and merge
>>> - remove duplicate code
>>>
>>> I think better to squash into your fix.
>>
>> Hi Jack,
>> I don't object to your changes, but I'd like to see a response from
>> Jens first.
>> My preference would be to get the original patch in, then other changes
>> that build on it, such as this one, can be added. Until the core
>> changes lands, any other work is pointless.
>>
>> Of course if Jens wants a this merged before he'll apply it, I'll
>> happily do that.
>
> I like the change, and thanks for tackling this. It's been a pending
> issue for way too long. I do think we should squash Jack's patch
> into the original, as it does clean up the code nicely.
>
> Do we have a proper test case for this, so we can verify that it
> does indeed also work in practice?
>
Hi Jens,

I can trigger deadlock with in RAID1 with test below:

I create one md with one local loop device and one remote scsi
exported by SRP. running fio with mix rw on top of md, force_close
session on storage side. mdx_raid1 is wait on free_array in D state,
and a lot of fio also in D state in wait_barrier.

With the patch from Neil above, I can no longer trigger it anymore.

The discussion was in link below:
http://www.spinics.net/lists/raid/msg54680.html

Thanks,
Jack Wang

2017-03-07 15:47:13

Attachments:

signature.asc (832.00 B)

2017-03-07 21:14:23

On 03/09/2017 09:32 PM, NeilBrown wrote:
>
> I started looking further at the improvements we can make once
> generic_make_request is fixed, and realised that I had missed an
> important detail in this patch.
> Several places test current->bio_list, and two actually edit the list.
> With this change, that cannot see the whole lists, so it could cause a
> regression.
>
> So I've revised the patch to make sure that the entire list of queued
> bios remains visible, and change the relevant code to look at both
> the new list and the old list.
>
> Following that there are some patches which make the rescuer thread
> optional, and then starts removing it from some easy-to-fix places.

Neil, note that the v2 patch is already in Linus tree as of earlier
today. You need to rebase the series, and if we need fixups on
top of v2, then that should be done separately and with increased
urgency.

--
Jens Axboe

2017-03-10 04:40:31

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()

On 03/09/2017 09:38 PM, Jens Axboe wrote:
> On 03/09/2017 09:32 PM, NeilBrown wrote:
>>
>> I started looking further at the improvements we can make once
>> generic_make_request is fixed, and realised that I had missed an
>> important detail in this patch.
>> Several places test current->bio_list, and two actually edit the list.
>> With this change, that cannot see the whole lists, so it could cause a
>> regression.
>>
>> So I've revised the patch to make sure that the entire list of queued
>> bios remains visible, and change the relevant code to look at both
>> the new list and the old list.
>>
>> Following that there are some patches which make the rescuer thread
>> optional, and then starts removing it from some easy-to-fix places.
>
> Neil, note that the v2 patch is already in Linus tree as of earlier
> today. You need to rebase the series, and if we need fixups on
> top of v2, then that should be done separately and with increased
> urgency.

Additionally, at least the first patch appears to be badly mangled.
The formatting is screwed up.

--
Jens Axboe

2017-03-10 05:19:29

On Fri, Mar 10 2017, Lars Ellenberg wrote:

>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio)
>> */
>> blk_qc_t generic_make_request(struct bio *bio)
>> {
>> - struct bio_list bio_list_on_stack;
>> + /*
>> + * bio_list_on_stack[0] contains bios submitted by the current
>> + * make_request_fn.
>> + * bio_list_on_stack[1] contains bios that were submitted before
>> + * the current make_request_fn, but that haven't been processed
>> + * yet.
>> + */
>> + struct bio_list bio_list_on_stack[2];
>> blk_qc_t ret = BLK_QC_T_NONE;
>
> May I suggest that, if you intend to assign something that is not a
> plain &(struct bio_list), but a &(struct bio_list[2]),
> you change the task member so it is renamed (current->bio_list vs
> current->bio_lists, plural, is what I did last year).
> Or you will break external modules, silently, and horribly (or,
> rather, they won't notice, but break the kernel).
> Examples of such modules would be DRBD, ZFS, quite possibly others.
>

This is exactly what I didn't in my first draft (bio_list -> bio_lists),
but then I reverted that change because it didn't seem to be worth the
noise.
It isn't much noise, sched.h, bcache/btree.c, md/dm-bufio.c, and
md/raid1.c get minor changes.
But as I'm hoping to get rid of all of those uses, renaming before
removing seemed pointless ... though admittedly that is what I did for
bioset_create().... I wondered about that too.

The example you give later:
struct bio_list *tmp = current->bio_list;
current->bio_list = NULL;
submit_bio()
current->bio_list = tmp;

won't cause any problem. Whatever lists the parent generic_make_request
is holding onto will be untouched during the submit_bio() call, and will
be exactly as it expects them when this caller returns.

If some out-of-tree code does anything with ->bio_list that makes sense
with the previous code, then it will still make sense with the new
code. However there will be a few bios that it didn't get too look at.
These will all be bios that were submitted by a device further up the
stack (closer to the filesystem), so they *should* be irrelevant.
I could probably come up with some weird behaviour that might have
worked before but now wouldn't quite work the same way. But just fixing
bugs can sometimes affect an out-of-tree driver in a strange way because
it was assuming those bugs.

I hope that I'll soon be able to remove punt_bios_to_rescuer and
flush_current_bio_list, after which current->bio_list can really be
just a list again. I don't think it is worth changing the name for a
transient situation.

But thanks for the review - it encouraged me to think though the
consequences again and I'm now more confident.
I actually now think that change probably wasn't necessary. It is
safer though. It ensures that current functionality isn't removed
without a clear justification.

Thanks,
NeilBrown

Attachments:

signature.asc (832.00 B)