2017-03-03 05:14:44

by NeilBrown

[permalink] [raw]
Subject: [PATCH] blk: improve order of bio handling in generic_make_request()


[ Hi Jens,
you might have seen assorted email threads recently about
deadlocks, particular in dm-snap or md/raid1/10. Also about
the excess of rescuer threads.
I think a big part of the problem is my ancient improvement
to generic_make_request to queue bios and handle them in
a strict FIFO order. As described below, that can cause
problems which individual block devices cannot fix themselves
without punting to various threads.
This patch does not fix everything, but provides a basis that
drives can build on to create dead-lock free solutions without
excess threads.
If you accept this, I will look into improving at least md
and bio_alloc_set() to be less dependant on rescuer threads.
Thanks,
NeilBrown
]


To avoid recursion on the kernel stack when stacked block devices
are in use, generic_make_request() will, when called recursively,
queue new requests for later handling. They will be handled when the
make_request_fn for the current bio completes.

If any bios are submitted by a make_request_fn, these will ultimately
handled seqeuntially. If the handling of one of those generates
further requests, they will be added to the end of the queue.

This strict first-in-first-out behaviour can lead to deadlocks in
various ways, normally because a request might need to wait for a
previous request to the same device to complete. This can happen when
they share a mempool, and can happen due to interdependencies
particular to the device. Both md and dm have examples where this happens.

These deadlocks can be erradicated by more selective ordering of bios.
Specifically by handling them in depth-first order. That is: when the
handling of one bio generates one or more further bios, they are
handled immediately after the parent, before any siblings of the
parent. That way, when generic_make_request() calls make_request_fn
for some particular device, it we can be certain that all previously
submited request for that device have been completely handled and are
not waiting for anything in the queue of requests maintained in
generic_make_request().

An easy way to achieve this would be to use a last-in-first-out stack
instead of a queue. However this will change the order of consecutive
bios submitted by a make_request_fn, which could have unexpected consequences.
Instead we take a slightly more complex approach.
A fresh queue is created for each call to a make_request_fn. After it completes,
any bios for a different device are placed on the front of the main queue, followed
by any bios for the same device, followed by all bios that were already on
the queue before the make_request_fn was called.
This provides the depth-first approach without reordering bios on the same level.

This, by itself, it not enough to remove the deadlocks. It just makes
it possible for drivers to take the extra step required themselves.

To avoid deadlocks, drivers must never risk waiting for a request
after submitting one to generic_make_request. This includes never
allocing from a mempool twice in the one call to a make_request_fn.

A common pattern in drivers is to call bio_split() in a loop, handling
the first part and then looping around to possibly split the next part.
Instead, a driver that finds it needs to split a bio should queue
(with generic_make_request) the second part, handle the first part,
and then return. The new code in generic_make_request will ensure the
requests to underlying bios are processed first, then the second bio
that was split off. If it splits again, the same process happens. In
each case one bio will be completely handled before the next one is attempted.

With this is place, it should be possible to disable the
punt_bios_to_recover() recovery thread for many block devices, and
eventually it may be possible to remove it completely.

Tested-by: Jinpu Wang <[email protected]>
Inspired-by: Lars Ellenberg <[email protected]>
Signed-off-by: NeilBrown <[email protected]>
---
block/blk-core.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index b9e857f4afe8..ef55f210dd7c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2018,10 +2018,32 @@ blk_qc_t generic_make_request(struct bio *bio)
struct request_queue *q = bdev_get_queue(bio->bi_bdev);

if (likely(blk_queue_enter(q, false) == 0)) {
+ struct bio_list hold;
+ struct bio_list lower, same;
+
+ /* Create a fresh bio_list for all subordinate requests */
+ bio_list_init(&hold);
+ bio_list_merge(&hold, &bio_list_on_stack);
+ bio_list_init(&bio_list_on_stack);
ret = q->make_request_fn(q, bio);

blk_queue_exit(q);

+ /* sort new bios into those for a lower level
+ * and those for the same level
+ */
+ bio_list_init(&lower);
+ bio_list_init(&same);
+ while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
+ if (q == bdev_get_queue(bio->bi_bdev))
+ bio_list_add(&same, bio);
+ else
+ bio_list_add(&lower, bio);
+ /* now assemble so we handle the lowest level first */
+ bio_list_merge(&bio_list_on_stack, &lower);
+ bio_list_merge(&bio_list_on_stack, &same);
+ bio_list_merge(&bio_list_on_stack, &hold);
+
bio = bio_list_pop(current->bio_list);
} else {
struct bio *bio_next = bio_list_pop(current->bio_list);
--
2.11.0


Attachments:
signature.asc (832.00 B)

2017-03-03 09:29:58

by Jack Wang

[permalink] [raw]
Subject: Re: [PATCH] blk: improve order of bio handling in generic_make_request()



On 03.03.2017 06:14, NeilBrown wrote:
>
> [ Hi Jens,
> you might have seen assorted email threads recently about
> deadlocks, particular in dm-snap or md/raid1/10. Also about
> the excess of rescuer threads.
> I think a big part of the problem is my ancient improvement
> to generic_make_request to queue bios and handle them in
> a strict FIFO order. As described below, that can cause
> problems which individual block devices cannot fix themselves
> without punting to various threads.
> This patch does not fix everything, but provides a basis that
> drives can build on to create dead-lock free solutions without
> excess threads.
> If you accept this, I will look into improving at least md
> and bio_alloc_set() to be less dependant on rescuer threads.
> Thanks,
> NeilBrown
> ]
>
>
> To avoid recursion on the kernel stack when stacked block devices
> are in use, generic_make_request() will, when called recursively,
> queue new requests for later handling. They will be handled when the
> make_request_fn for the current bio completes.
>
> If any bios are submitted by a make_request_fn, these will ultimately
> handled seqeuntially. If the handling of one of those generates
> further requests, they will be added to the end of the queue.
>
> This strict first-in-first-out behaviour can lead to deadlocks in
> various ways, normally because a request might need to wait for a
> previous request to the same device to complete. This can happen when
> they share a mempool, and can happen due to interdependencies
> particular to the device. Both md and dm have examples where this happens.
>
> These deadlocks can be erradicated by more selective ordering of bios.
> Specifically by handling them in depth-first order. That is: when the
> handling of one bio generates one or more further bios, they are
> handled immediately after the parent, before any siblings of the
> parent. That way, when generic_make_request() calls make_request_fn
> for some particular device, it we can be certain that all previously
> submited request for that device have been completely handled and are
> not waiting for anything in the queue of requests maintained in
> generic_make_request().
>
> An easy way to achieve this would be to use a last-in-first-out stack
> instead of a queue. However this will change the order of consecutive
> bios submitted by a make_request_fn, which could have unexpected consequences.
> Instead we take a slightly more complex approach.
> A fresh queue is created for each call to a make_request_fn. After it completes,
> any bios for a different device are placed on the front of the main queue, followed
> by any bios for the same device, followed by all bios that were already on
> the queue before the make_request_fn was called.
> This provides the depth-first approach without reordering bios on the same level.
>
> This, by itself, it not enough to remove the deadlocks. It just makes
> it possible for drivers to take the extra step required themselves.
>
> To avoid deadlocks, drivers must never risk waiting for a request
> after submitting one to generic_make_request. This includes never
> allocing from a mempool twice in the one call to a make_request_fn.
>
> A common pattern in drivers is to call bio_split() in a loop, handling
> the first part and then looping around to possibly split the next part.
> Instead, a driver that finds it needs to split a bio should queue
> (with generic_make_request) the second part, handle the first part,
> and then return. The new code in generic_make_request will ensure the
> requests to underlying bios are processed first, then the second bio
> that was split off. If it splits again, the same process happens. In
> each case one bio will be completely handled before the next one is attempted.
>
> With this is place, it should be possible to disable the
> punt_bios_to_recover() recovery thread for many block devices, and
> eventually it may be possible to remove it completely.
>
> Tested-by: Jinpu Wang <[email protected]>
> Inspired-by: Lars Ellenberg <[email protected]>
> Signed-off-by: NeilBrown <[email protected]>
> ---
> block/blk-core.c | 22 ++++++++++++++++++++++
> 1 file changed, 22 insertions(+)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index b9e857f4afe8..ef55f210dd7c 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2018,10 +2018,32 @@ blk_qc_t generic_make_request(struct bio *bio)
> struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>
> if (likely(blk_queue_enter(q, false) == 0)) {
> + struct bio_list hold;
> + struct bio_list lower, same;
> +
> + /* Create a fresh bio_list for all subordinate requests */
> + bio_list_init(&hold);
> + bio_list_merge(&hold, &bio_list_on_stack);
> + bio_list_init(&bio_list_on_stack);
> ret = q->make_request_fn(q, bio);
>
> blk_queue_exit(q);
>
> + /* sort new bios into those for a lower level
> + * and those for the same level
> + */
> + bio_list_init(&lower);
> + bio_list_init(&same);
> + while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
> + if (q == bdev_get_queue(bio->bi_bdev))
> + bio_list_add(&same, bio);
> + else
> + bio_list_add(&lower, bio);
> + /* now assemble so we handle the lowest level first */
> + bio_list_merge(&bio_list_on_stack, &lower);
> + bio_list_merge(&bio_list_on_stack, &same);
> + bio_list_merge(&bio_list_on_stack, &hold);
> +
> bio = bio_list_pop(current->bio_list);
> } else {
> struct bio *bio_next = bio_list_pop(current->bio_list);
>

Thanks Neil for pushing the fix.

We can optimize generic_make_request a little bit:
- assign bio_list struct hold directly instead init and merge
- remove duplicate code

I think better to squash into your fix.
---
block/blk-core.c | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3bc7202..b29b7e5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2147,8 +2147,7 @@ blk_qc_t generic_make_request(struct bio *bio)
struct bio_list lower, same, hold;

/* Create a fresh bio_list for all subordinate requests */
- bio_list_init(&hold);
- bio_list_merge(&hold, &bio_list_on_stack);
+ hold = bio_list_on_stack;
bio_list_init(&bio_list_on_stack);

ret = q->make_request_fn(q, bio);
@@ -2168,14 +2167,10 @@ blk_qc_t generic_make_request(struct bio *bio)
bio_list_merge(&bio_list_on_stack, &lower);
bio_list_merge(&bio_list_on_stack, &same);
bio_list_merge(&bio_list_on_stack, &hold);
-
- bio = bio_list_pop(current->bio_list);
} else {
- struct bio *bio_next = bio_list_pop(current->bio_list);
-
bio_io_error(bio);
- bio = bio_next;
}
+ bio = bio_list_pop(current->bio_list);
} while (bio);
current->bio_list = NULL; /* deactivate */

--

Regards,
Jack

2017-03-06 04:40:47

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH] blk: improve order of bio handling in generic_make_request()

On Fri, Mar 03 2017, Jack Wang wrote:
>
> Thanks Neil for pushing the fix.
>
> We can optimize generic_make_request a little bit:
> - assign bio_list struct hold directly instead init and merge
> - remove duplicate code
>
> I think better to squash into your fix.

Hi Jack,
I don't object to your changes, but I'd like to see a response from
Jens first.
My preference would be to get the original patch in, then other changes
that build on it, such as this one, can be added. Until the core
changes lands, any other work is pointless.

Of course if Jens wants a this merged before he'll apply it, I'll
happily do that.

Thanks,
NeilBrown



> ---
> block/blk-core.c | 9 ++-------
> 1 file changed, 2 insertions(+), 7 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 3bc7202..b29b7e5 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2147,8 +2147,7 @@ blk_qc_t generic_make_request(struct bio *bio)
> struct bio_list lower, same, hold;
>
> /* Create a fresh bio_list for all subordinate requests */
> - bio_list_init(&hold);
> - bio_list_merge(&hold, &bio_list_on_stack);
> + hold = bio_list_on_stack;
> bio_list_init(&bio_list_on_stack);
>
> ret = q->make_request_fn(q, bio);
> @@ -2168,14 +2167,10 @@ blk_qc_t generic_make_request(struct bio *bio)
> bio_list_merge(&bio_list_on_stack, &lower);
> bio_list_merge(&bio_list_on_stack, &same);
> bio_list_merge(&bio_list_on_stack, &hold);
> -
> - bio = bio_list_pop(current->bio_list);
> } else {
> - struct bio *bio_next = bio_list_pop(current->bio_list);
> -
> bio_io_error(bio);
> - bio = bio_next;
> }
> + bio = bio_list_pop(current->bio_list);
> } while (bio);
> current->bio_list = NULL; /* deactivate */
>
> --
>
> Regards,
> Jack


Attachments:
signature.asc (832.00 B)

2017-03-06 09:47:06

by Jack Wang

[permalink] [raw]
Subject: Re: [PATCH] blk: improve order of bio handling in generic_make_request()



On 06.03.2017 05:40, NeilBrown wrote:
> On Fri, Mar 03 2017, Jack Wang wrote:
>>
>> Thanks Neil for pushing the fix.
>>
>> We can optimize generic_make_request a little bit:
>> - assign bio_list struct hold directly instead init and merge
>> - remove duplicate code
>>
>> I think better to squash into your fix.
>
> Hi Jack,
> I don't object to your changes, but I'd like to see a response from
> Jens first.
> My preference would be to get the original patch in, then other changes
> that build on it, such as this one, can be added. Until the core
> changes lands, any other work is pointless.
>
> Of course if Jens wants a this merged before he'll apply it, I'll
> happily do that.
>
> Thanks,
> NeilBrown

Hi Neil,

Totally agree, let's wait for Jens's decision.

Hi Jens,

Please consider this fix also for stable 4.3+

Thanks,
Jack Wang

>
>
>
>> ---
>> block/blk-core.c | 9 ++-------
>> 1 file changed, 2 insertions(+), 7 deletions(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 3bc7202..b29b7e5 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -2147,8 +2147,7 @@ blk_qc_t generic_make_request(struct bio *bio)
>> struct bio_list lower, same, hold;
>>
>> /* Create a fresh bio_list for all subordinate requests */
>> - bio_list_init(&hold);
>> - bio_list_merge(&hold, &bio_list_on_stack);
>> + hold = bio_list_on_stack;
>> bio_list_init(&bio_list_on_stack);
>>
>> ret = q->make_request_fn(q, bio);
>> @@ -2168,14 +2167,10 @@ blk_qc_t generic_make_request(struct bio *bio)
>> bio_list_merge(&bio_list_on_stack, &lower);
>> bio_list_merge(&bio_list_on_stack, &same);
>> bio_list_merge(&bio_list_on_stack, &hold);
>> -
>> - bio = bio_list_pop(current->bio_list);
>> } else {
>> - struct bio *bio_next = bio_list_pop(current->bio_list);
>> -
>> bio_io_error(bio);
>> - bio = bio_next;
>> }
>> + bio = bio_list_pop(current->bio_list);
>> } while (bio);
>> current->bio_list = NULL; /* deactivate */
>>
>> --
>>
>> Regards,
>> Jack

2017-03-07 02:03:50

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] blk: improve order of bio handling in generic_make_request()

On 03/05/2017 09:40 PM, NeilBrown wrote:
> On Fri, Mar 03 2017, Jack Wang wrote:
>>
>> Thanks Neil for pushing the fix.
>>
>> We can optimize generic_make_request a little bit:
>> - assign bio_list struct hold directly instead init and merge
>> - remove duplicate code
>>
>> I think better to squash into your fix.
>
> Hi Jack,
> I don't object to your changes, but I'd like to see a response from
> Jens first.
> My preference would be to get the original patch in, then other changes
> that build on it, such as this one, can be added. Until the core
> changes lands, any other work is pointless.
>
> Of course if Jens wants a this merged before he'll apply it, I'll
> happily do that.

I like the change, and thanks for tackling this. It's been a pending
issue for way too long. I do think we should squash Jack's patch
into the original, as it does clean up the code nicely.

Do we have a proper test case for this, so we can verify that it
does indeed also work in practice?

--
Jens Axboe

2017-03-07 08:51:21

by Jack Wang

[permalink] [raw]
Subject: Re: [PATCH] blk: improve order of bio handling in generic_make_request()



On 06.03.2017 21:18, Jens Axboe wrote:
> On 03/05/2017 09:40 PM, NeilBrown wrote:
>> On Fri, Mar 03 2017, Jack Wang wrote:
>>>
>>> Thanks Neil for pushing the fix.
>>>
>>> We can optimize generic_make_request a little bit:
>>> - assign bio_list struct hold directly instead init and merge
>>> - remove duplicate code
>>>
>>> I think better to squash into your fix.
>>
>> Hi Jack,
>> I don't object to your changes, but I'd like to see a response from
>> Jens first.
>> My preference would be to get the original patch in, then other changes
>> that build on it, such as this one, can be added. Until the core
>> changes lands, any other work is pointless.
>>
>> Of course if Jens wants a this merged before he'll apply it, I'll
>> happily do that.
>
> I like the change, and thanks for tackling this. It's been a pending
> issue for way too long. I do think we should squash Jack's patch
> into the original, as it does clean up the code nicely.
>
> Do we have a proper test case for this, so we can verify that it
> does indeed also work in practice?
>
Hi Jens,

I can trigger deadlock with in RAID1 with test below:

I create one md with one local loop device and one remote scsi
exported by SRP. running fio with mix rw on top of md, force_close
session on storage side. mdx_raid1 is wait on free_array in D state,
and a lot of fio also in D state in wait_barrier.

With the patch from Neil above, I can no longer trigger it anymore.

The discussion was in link below:
http://www.spinics.net/lists/raid/msg54680.html

Thanks,
Jack Wang

2017-03-07 15:47:13

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] blk: improve order of bio handling in generic_make_request()

On Mon 2017-03-06 10:43:59, Jack Wang wrote:
>
>
> On 06.03.2017 05:40, NeilBrown wrote:
> > On Fri, Mar 03 2017, Jack Wang wrote:
> >>
> >> Thanks Neil for pushing the fix.
> >>
> >> We can optimize generic_make_request a little bit:
> >> - assign bio_list struct hold directly instead init and merge
> >> - remove duplicate code
> >>
> >> I think better to squash into your fix.
> >
> > Hi Jack,
> > I don't object to your changes, but I'd like to see a response from
> > Jens first.
> > My preference would be to get the original patch in, then other changes
> > that build on it, such as this one, can be added. Until the core
> > changes lands, any other work is pointless.
> >
> > Of course if Jens wants a this merged before he'll apply it, I'll
> > happily do that.
> >
> > Thanks,
> > NeilBrown
>
> Hi Neil,
>
> Totally agree, let's wait for Jens's decision.
>
> Hi Jens,
>
> Please consider this fix also for stable 4.3+

Stable? We don't put this into stable, with exception of minimal fixes
for real bugs...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (1.16 kB)
signature.asc (181.00 B)
Digital signature
Download all attachments

2017-03-07 16:21:54

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] blk: improve order of bio handling in generic_make_request()

On 03/07/2017 08:46 AM, Pavel Machek wrote:
> On Mon 2017-03-06 10:43:59, Jack Wang wrote:
>>
>>
>> On 06.03.2017 05:40, NeilBrown wrote:
>>> On Fri, Mar 03 2017, Jack Wang wrote:
>>>>
>>>> Thanks Neil for pushing the fix.
>>>>
>>>> We can optimize generic_make_request a little bit:
>>>> - assign bio_list struct hold directly instead init and merge
>>>> - remove duplicate code
>>>>
>>>> I think better to squash into your fix.
>>>
>>> Hi Jack,
>>> I don't object to your changes, but I'd like to see a response from
>>> Jens first.
>>> My preference would be to get the original patch in, then other changes
>>> that build on it, such as this one, can be added. Until the core
>>> changes lands, any other work is pointless.
>>>
>>> Of course if Jens wants a this merged before he'll apply it, I'll
>>> happily do that.
>>>
>>> Thanks,
>>> NeilBrown
>>
>> Hi Neil,
>>
>> Totally agree, let's wait for Jens's decision.
>>
>> Hi Jens,
>>
>> Please consider this fix also for stable 4.3+
>
> Stable? We don't put this into stable, with exception of minimal fixes
> for real bugs...

What are you smoking? How is this not a real bug?

--
Jens Axboe

2017-03-07 17:13:01

by Jens Axboe

[permalink] [raw]
Subject: Re: blk: improve order of bio handling in generic_make_request()

On 03/07/2017 09:52 AM, Mike Snitzer wrote:
> On Tue, Mar 07 2017 at 3:49am -0500,
> Jack Wang <[email protected]> wrote:
>
>>
>>
>> On 06.03.2017 21:18, Jens Axboe wrote:
>>> On 03/05/2017 09:40 PM, NeilBrown wrote:
>>>> On Fri, Mar 03 2017, Jack Wang wrote:
>>>>>
>>>>> Thanks Neil for pushing the fix.
>>>>>
>>>>> We can optimize generic_make_request a little bit:
>>>>> - assign bio_list struct hold directly instead init and merge
>>>>> - remove duplicate code
>>>>>
>>>>> I think better to squash into your fix.
>>>>
>>>> Hi Jack,
>>>> I don't object to your changes, but I'd like to see a response from
>>>> Jens first.
>>>> My preference would be to get the original patch in, then other changes
>>>> that build on it, such as this one, can be added. Until the core
>>>> changes lands, any other work is pointless.
>>>>
>>>> Of course if Jens wants a this merged before he'll apply it, I'll
>>>> happily do that.
>>>
>>> I like the change, and thanks for tackling this. It's been a pending
>>> issue for way too long. I do think we should squash Jack's patch
>>> into the original, as it does clean up the code nicely.
>>>
>>> Do we have a proper test case for this, so we can verify that it
>>> does indeed also work in practice?
>>>
>> Hi Jens,
>>
>> I can trigger deadlock with in RAID1 with test below:
>>
>> I create one md with one local loop device and one remote scsi
>> exported by SRP. running fio with mix rw on top of md, force_close
>> session on storage side. mdx_raid1 is wait on free_array in D state,
>> and a lot of fio also in D state in wait_barrier.
>>
>> With the patch from Neil above, I can no longer trigger it anymore.
>>
>> The discussion was in link below:
>> http://www.spinics.net/lists/raid/msg54680.html
>
> In addition to Jack's MD raid test there is a DM snapshot deadlock test,
> albeit unpolished/needy to get running, see:
> https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html

Can you run this patch with that test, reverting your DM workaround?

--
Jens Axboe

2017-03-07 17:32:59

by Mike Snitzer

[permalink] [raw]
Subject: Re: blk: improve order of bio handling in generic_make_request()

On Tue, Mar 07 2017 at 12:05pm -0500,
Jens Axboe <[email protected]> wrote:

> On 03/07/2017 09:52 AM, Mike Snitzer wrote:
> > On Tue, Mar 07 2017 at 3:49am -0500,
> > Jack Wang <[email protected]> wrote:
> >
> >>
> >>
> >> On 06.03.2017 21:18, Jens Axboe wrote:
> >>> On 03/05/2017 09:40 PM, NeilBrown wrote:
> >>>> On Fri, Mar 03 2017, Jack Wang wrote:
> >>>>>
> >>>>> Thanks Neil for pushing the fix.
> >>>>>
> >>>>> We can optimize generic_make_request a little bit:
> >>>>> - assign bio_list struct hold directly instead init and merge
> >>>>> - remove duplicate code
> >>>>>
> >>>>> I think better to squash into your fix.
> >>>>
> >>>> Hi Jack,
> >>>> I don't object to your changes, but I'd like to see a response from
> >>>> Jens first.
> >>>> My preference would be to get the original patch in, then other changes
> >>>> that build on it, such as this one, can be added. Until the core
> >>>> changes lands, any other work is pointless.
> >>>>
> >>>> Of course if Jens wants a this merged before he'll apply it, I'll
> >>>> happily do that.
> >>>
> >>> I like the change, and thanks for tackling this. It's been a pending
> >>> issue for way too long. I do think we should squash Jack's patch
> >>> into the original, as it does clean up the code nicely.
> >>>
> >>> Do we have a proper test case for this, so we can verify that it
> >>> does indeed also work in practice?
> >>>
> >> Hi Jens,
> >>
> >> I can trigger deadlock with in RAID1 with test below:
> >>
> >> I create one md with one local loop device and one remote scsi
> >> exported by SRP. running fio with mix rw on top of md, force_close
> >> session on storage side. mdx_raid1 is wait on free_array in D state,
> >> and a lot of fio also in D state in wait_barrier.
> >>
> >> With the patch from Neil above, I can no longer trigger it anymore.
> >>
> >> The discussion was in link below:
> >> http://www.spinics.net/lists/raid/msg54680.html
> >
> > In addition to Jack's MD raid test there is a DM snapshot deadlock test,
> > albeit unpolished/needy to get running, see:
> > https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
>
> Can you run this patch with that test, reverting your DM workaround?

Yeap, will do. Last time Mikulas tried a similar patch it still
deadlocked. But I'll give it a go (likely tomorrow).

2017-03-07 17:47:57

by Mike Snitzer

[permalink] [raw]
Subject: Re: blk: improve order of bio handling in generic_make_request()

On Tue, Mar 07 2017 at 3:49am -0500,
Jack Wang <[email protected]> wrote:

>
>
> On 06.03.2017 21:18, Jens Axboe wrote:
> > On 03/05/2017 09:40 PM, NeilBrown wrote:
> >> On Fri, Mar 03 2017, Jack Wang wrote:
> >>>
> >>> Thanks Neil for pushing the fix.
> >>>
> >>> We can optimize generic_make_request a little bit:
> >>> - assign bio_list struct hold directly instead init and merge
> >>> - remove duplicate code
> >>>
> >>> I think better to squash into your fix.
> >>
> >> Hi Jack,
> >> I don't object to your changes, but I'd like to see a response from
> >> Jens first.
> >> My preference would be to get the original patch in, then other changes
> >> that build on it, such as this one, can be added. Until the core
> >> changes lands, any other work is pointless.
> >>
> >> Of course if Jens wants a this merged before he'll apply it, I'll
> >> happily do that.
> >
> > I like the change, and thanks for tackling this. It's been a pending
> > issue for way too long. I do think we should squash Jack's patch
> > into the original, as it does clean up the code nicely.
> >
> > Do we have a proper test case for this, so we can verify that it
> > does indeed also work in practice?
> >
> Hi Jens,
>
> I can trigger deadlock with in RAID1 with test below:
>
> I create one md with one local loop device and one remote scsi
> exported by SRP. running fio with mix rw on top of md, force_close
> session on storage side. mdx_raid1 is wait on free_array in D state,
> and a lot of fio also in D state in wait_barrier.
>
> With the patch from Neil above, I can no longer trigger it anymore.
>
> The discussion was in link below:
> http://www.spinics.net/lists/raid/msg54680.html

In addition to Jack's MD raid test there is a DM snapshot deadlock test,
albeit unpolished/needy to get running, see:
https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html

But to actually test block core's ability to handle this, upstream
commit d67a5f4b5947aba4bfe9a80a2b86079c215ca755 ("dm: flush queued bios
when process blocks to avoid deadlock") would need to be reverted.

Also, I know Lars had a drbd deadlock too. Not sure if Jack's MD test
is sufficient to coverage for drbd. Lars?

2017-03-07 18:52:02

by Jack Wang

[permalink] [raw]
Subject: Re: [PATCH] blk: improve order of bio handling in generic_make_request()



On 07.03.2017 16:46, Pavel Machek wrote:
> On Mon 2017-03-06 10:43:59, Jack Wang wrote:
>>
>>
>> On 06.03.2017 05:40, NeilBrown wrote:
>>> On Fri, Mar 03 2017, Jack Wang wrote:
>>>>
>>>> Thanks Neil for pushing the fix.
>>>>
>>>> We can optimize generic_make_request a little bit:
>>>> - assign bio_list struct hold directly instead init and merge
>>>> - remove duplicate code
>>>>
>>>> I think better to squash into your fix.
>>>
>>> Hi Jack,
>>> I don't object to your changes, but I'd like to see a response from
>>> Jens first.
>>> My preference would be to get the original patch in, then other changes
>>> that build on it, such as this one, can be added. Until the core
>>> changes lands, any other work is pointless.
>>>
>>> Of course if Jens wants a this merged before he'll apply it, I'll
>>> happily do that.
>>>
>>> Thanks,
>>> NeilBrown
>>
>> Hi Neil,
>>
>> Totally agree, let's wait for Jens's decision.
>>
>> Hi Jens,
>>
>> Please consider this fix also for stable 4.3+
>
> Stable? We don't put this into stable, with exception of minimal fixes
> for real bugs...
> Pavel
>
It indeed fixes deadlock in RAID1 case.
Please follow the link I replied to Jens regarding the test.
It did hit us in our production.

Thanks,
Jack

2017-03-07 20:44:27

by NeilBrown

[permalink] [raw]
Subject: [PATCH v2] blk: improve order of bio handling in generic_make_request()


To avoid recursion on the kernel stack when stacked block devices
are in use, generic_make_request() will, when called recursively,
queue new requests for later handling. They will be handled when the
make_request_fn for the current bio completes.

If any bios are submitted by a make_request_fn, these will ultimately
be handled seqeuntially. If the handling of one of those generates
further requests, they will be added to the end of the queue.

This strict first-in-first-out behaviour can lead to deadlocks in
various ways, normally because a request might need to wait for a
previous request to the same device to complete. This can happen when
they share a mempool, and can happen due to interdependencies
particular to the device. Both md and dm have examples where this happens.

These deadlocks can be erradicated by more selective ordering of bios.
Specifically by handling them in depth-first order. That is: when the
handling of one bio generates one or more further bios, they are
handled immediately after the parent, before any siblings of the
parent. That way, when generic_make_request() calls make_request_fn
for some particular device, we can be certain that all previously
submited requests for that device have been completely handled and are
not waiting for anything in the queue of requests maintained in
generic_make_request().

An easy way to achieve this would be to use a last-in-first-out stack
instead of a queue. However this will change the order of consecutive
bios submitted by a make_request_fn, which could have unexpected consequences.
Instead we take a slightly more complex approach.
A fresh queue is created for each call to a make_request_fn. After it completes,
any bios for a different device are placed on the front of the main queue, followed
by any bios for the same device, followed by all bios that were already on
the queue before the make_request_fn was called.
This provides the depth-first approach without reordering bios on the same level.

This, by itself, it not enough to remove all deadlocks. It just makes
it possible for drivers to take the extra step required themselves.

To avoid deadlocks, drivers must never risk waiting for a request
after submitting one to generic_make_request. This includes never
allocing from a mempool twice in the one call to a make_request_fn.

A common pattern in drivers is to call bio_split() in a loop, handling
the first part and then looping around to possibly split the next part.
Instead, a driver that finds it needs to split a bio should queue
(with generic_make_request) the second part, handle the first part,
and then return. The new code in generic_make_request will ensure the
requests to underlying bios are processed first, then the second bio
that was split off. If it splits again, the same process happens. In
each case one bio will be completely handled before the next one is attempted.

With this is place, it should be possible to disable the
punt_bios_to_recover() recovery thread for many block devices, and
eventually it may be possible to remove it completely.

Ref: http://www.spinics.net/lists/raid/msg54680.html
Tested-by: Jinpu Wang <[email protected]>
Inspired-by: Lars Ellenberg <[email protected]>
Signed-off-by: NeilBrown <[email protected]>
---
block/blk-core.c | 25 +++++++++++++++++++++----
1 file changed, 21 insertions(+), 4 deletions(-)

Changes since v1:
- merge code improvements from Jack Wang
- more edits to changelog comment
- add Ref: link.
- Add some lists to Cc, that should have been there the first time.



diff --git a/block/blk-core.c b/block/blk-core.c
index b9e857f4afe8..9520e82aa78c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2018,17 +2018,34 @@ blk_qc_t generic_make_request(struct bio *bio)
struct request_queue *q = bdev_get_queue(bio->bi_bdev);

if (likely(blk_queue_enter(q, false) == 0)) {
+ struct bio_list hold;
+ struct bio_list lower, same;
+
+ /* Create a fresh bio_list for all subordinate requests */
+ hold = bio_list_on_stack;
+ bio_list_init(&bio_list_on_stack);
ret = q->make_request_fn(q, bio);

blk_queue_exit(q);

- bio = bio_list_pop(current->bio_list);
+ /* sort new bios into those for a lower level
+ * and those for the same level
+ */
+ bio_list_init(&lower);
+ bio_list_init(&same);
+ while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
+ if (q == bdev_get_queue(bio->bi_bdev))
+ bio_list_add(&same, bio);
+ else
+ bio_list_add(&lower, bio);
+ /* now assemble so we handle the lowest level first */
+ bio_list_merge(&bio_list_on_stack, &lower);
+ bio_list_merge(&bio_list_on_stack, &same);
+ bio_list_merge(&bio_list_on_stack, &hold);
} else {
- struct bio *bio_next = bio_list_pop(current->bio_list);
-
bio_io_error(bio);
- bio = bio_next;
}
+ bio = bio_list_pop(current->bio_list);
} while (bio);
current->bio_list = NULL; /* deactivate */

--
2.12.0


Attachments:
signature.asc (832.00 B)

2017-03-07 21:14:23

by NeilBrown

[permalink] [raw]
Subject: Re: blk: improve order of bio handling in generic_make_request()

On Tue, Mar 07 2017, Mike Snitzer wrote:

> On Tue, Mar 07 2017 at 12:05pm -0500,
> Jens Axboe <[email protected]> wrote:
>
>> On 03/07/2017 09:52 AM, Mike Snitzer wrote:
>> > On Tue, Mar 07 2017 at 3:49am -0500,
>> > Jack Wang <[email protected]> wrote:
>> >
>> >>
>> >>
>> >> On 06.03.2017 21:18, Jens Axboe wrote:
>> >>> On 03/05/2017 09:40 PM, NeilBrown wrote:
>> >>>> On Fri, Mar 03 2017, Jack Wang wrote:
>> >>>>>
>> >>>>> Thanks Neil for pushing the fix.
>> >>>>>
>> >>>>> We can optimize generic_make_request a little bit:
>> >>>>> - assign bio_list struct hold directly instead init and merge
>> >>>>> - remove duplicate code
>> >>>>>
>> >>>>> I think better to squash into your fix.
>> >>>>
>> >>>> Hi Jack,
>> >>>> I don't object to your changes, but I'd like to see a response from
>> >>>> Jens first.
>> >>>> My preference would be to get the original patch in, then other changes
>> >>>> that build on it, such as this one, can be added. Until the core
>> >>>> changes lands, any other work is pointless.
>> >>>>
>> >>>> Of course if Jens wants a this merged before he'll apply it, I'll
>> >>>> happily do that.
>> >>>
>> >>> I like the change, and thanks for tackling this. It's been a pending
>> >>> issue for way too long. I do think we should squash Jack's patch
>> >>> into the original, as it does clean up the code nicely.
>> >>>
>> >>> Do we have a proper test case for this, so we can verify that it
>> >>> does indeed also work in practice?
>> >>>
>> >> Hi Jens,
>> >>
>> >> I can trigger deadlock with in RAID1 with test below:
>> >>
>> >> I create one md with one local loop device and one remote scsi
>> >> exported by SRP. running fio with mix rw on top of md, force_close
>> >> session on storage side. mdx_raid1 is wait on free_array in D state,
>> >> and a lot of fio also in D state in wait_barrier.
>> >>
>> >> With the patch from Neil above, I can no longer trigger it anymore.
>> >>
>> >> The discussion was in link below:
>> >> http://www.spinics.net/lists/raid/msg54680.html
>> >
>> > In addition to Jack's MD raid test there is a DM snapshot deadlock test,
>> > albeit unpolished/needy to get running, see:
>> > https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
>>
>> Can you run this patch with that test, reverting your DM workaround?
>
> Yeap, will do. Last time Mikulas tried a similar patch it still
> deadlocked. But I'll give it a go (likely tomorrow).

I don't think this will fix the DM snapshot deadlock by itself.
Rather, it make it possible for some internal changes to DM to fix it.
The DM change might be something vaguely like:

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3086da5664f3..06ee0960e415 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1216,6 +1216,14 @@ static int __split_and_process_non_flush(struct clone_info *ci)

len = min_t(sector_t, max_io_len(ci->sector, ti), ci->sector_count);

+ if (len < ci->sector_count) {
+ struct bio *split = bio_split(bio, len, GFP_NOIO, fs_bio_set);
+ bio_chain(split, bio);
+ generic_make_request(bio);
+ bio = split;
+ ci->sector_count = len;
+ }
+
r = __clone_and_map_data_bio(ci, ti, ci->sector, &len);
if (r < 0)
return r;

Instead of looping inside DM, this change causes the remainder to be
passed to generic_make_request() and DM only handles or region at a
time. So there is only one loop, in the top generic_make_request().
That loop will not reliable handle bios in the "right" order.

Thanks,
NeilBrown


Attachments:
signature.asc (832.00 B)

2017-03-08 00:56:09

by Mike Snitzer

[permalink] [raw]
Subject: Re: blk: improve order of bio handling in generic_make_request()

On Tue, Mar 07 2017 at 3:29pm -0500,
NeilBrown <[email protected]> wrote:

> On Tue, Mar 07 2017, Mike Snitzer wrote:
>
> > On Tue, Mar 07 2017 at 12:05pm -0500,
> > Jens Axboe <[email protected]> wrote:
> >
> >> On 03/07/2017 09:52 AM, Mike Snitzer wrote:
> >> >
> >> > In addition to Jack's MD raid test there is a DM snapshot deadlock test,
> >> > albeit unpolished/needy to get running, see:
> >> > https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
> >>
> >> Can you run this patch with that test, reverting your DM workaround?
> >
> > Yeap, will do. Last time Mikulas tried a similar patch it still
> > deadlocked. But I'll give it a go (likely tomorrow).
>
> I don't think this will fix the DM snapshot deadlock by itself.
> Rather, it make it possible for some internal changes to DM to fix it.
> The DM change might be something vaguely like:
>
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 3086da5664f3..06ee0960e415 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1216,6 +1216,14 @@ static int __split_and_process_non_flush(struct clone_info *ci)
>
> len = min_t(sector_t, max_io_len(ci->sector, ti), ci->sector_count);
>
> + if (len < ci->sector_count) {
> + struct bio *split = bio_split(bio, len, GFP_NOIO, fs_bio_set);
> + bio_chain(split, bio);
> + generic_make_request(bio);
> + bio = split;
> + ci->sector_count = len;
> + }
> +
> r = __clone_and_map_data_bio(ci, ti, ci->sector, &len);
> if (r < 0)
> return r;
>
> Instead of looping inside DM, this change causes the remainder to be
> passed to generic_make_request() and DM only handles or region at a
> time. So there is only one loop, in the top generic_make_request().
> That loop will not reliable handle bios in the "right" order.

s/not reliable/now reliably/ ? ;)

But thanks for the suggestion Neil. Will dig in once I get through a
backlog of other DM target code I have queued for 4.12 review.

Mike

2017-03-08 12:52:06

by Lars Ellenberg

[permalink] [raw]
Subject: Re: blk: improve order of bio handling in generic_make_request()

On 7 March 2017 at 17:52, Mike Snitzer <[email protected]> wrote:

> > On 06.03.2017 21:18, Jens Axboe wrote:
> > > I like the change, and thanks for tackling this. It's been a pending
> > > issue for way too long. I do think we should squash Jack's patch
> > > into the original, as it does clean up the code nicely.
> > >
> > > Do we have a proper test case for this, so we can verify that it
> > > does indeed also work in practice?
> > >
> > Hi Jens,
> >
> > I can trigger deadlock with in RAID1 with test below:
> >
> > I create one md with one local loop device and one remote scsi
> > exported by SRP. running fio with mix rw on top of md, force_close
> > session on storage side. mdx_raid1 is wait on free_array in D state,
> > and a lot of fio also in D state in wait_barrier.
> >
> > With the patch from Neil above, I can no longer trigger it anymore.
> >
> > The discussion was in link below:
> > http://www.spinics.net/lists/raid/msg54680.html
>
> In addition to Jack's MD raid test there is a DM snapshot deadlock test,
> albeit unpolished/needy to get running, see:
> https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
>
> But to actually test block core's ability to handle this, upstream
> commit d67a5f4b5947aba4bfe9a80a2b86079c215ca755 ("dm: flush queued bios
> when process blocks to avoid deadlock") would need to be reverted.
>
> Also, I know Lars had a drbd deadlock too. Not sure if Jack's MD test
> is sufficient to coverage for drbd. Lars?
>

As this is just a slightly different implementation, trading some bytes of
stack
for more local, self-contained, "obvious" code changes (good job!),
but follows the same basic idea as my original RFC [*] (see the
"inspired-by" tag)
I have no doubt it fixes the issues we are able to provoke with DRBD.
[*] https://lkml.org/lkml/2016/7/19/263
(where I also already suggest to fix the device-mapper issues
by losing the in-device-mapper loop,
relying on the loop in generic_make_request())

Cheers,
Lars

2017-03-08 16:49:58

by Mikulas Patocka

[permalink] [raw]
Subject: Re: blk: improve order of bio handling in generic_make_request()



On Wed, 8 Mar 2017, NeilBrown wrote:

> On Tue, Mar 07 2017, Mike Snitzer wrote:
>
> > On Tue, Mar 07 2017 at 12:05pm -0500,
> > Jens Axboe <[email protected]> wrote:
> >
> >> On 03/07/2017 09:52 AM, Mike Snitzer wrote:
> >> > On Tue, Mar 07 2017 at 3:49am -0500,
> >> > Jack Wang <[email protected]> wrote:
> >> >
> >> >>
> >> >>
> >> >> On 06.03.2017 21:18, Jens Axboe wrote:
> >> >>> On 03/05/2017 09:40 PM, NeilBrown wrote:
> >> >>>> On Fri, Mar 03 2017, Jack Wang wrote:
> >> >>>>>
> >> >>>>> Thanks Neil for pushing the fix.
> >> >>>>>
> >> >>>>> We can optimize generic_make_request a little bit:
> >> >>>>> - assign bio_list struct hold directly instead init and merge
> >> >>>>> - remove duplicate code
> >> >>>>>
> >> >>>>> I think better to squash into your fix.
> >> >>>>
> >> >>>> Hi Jack,
> >> >>>> I don't object to your changes, but I'd like to see a response from
> >> >>>> Jens first.
> >> >>>> My preference would be to get the original patch in, then other changes
> >> >>>> that build on it, such as this one, can be added. Until the core
> >> >>>> changes lands, any other work is pointless.
> >> >>>>
> >> >>>> Of course if Jens wants a this merged before he'll apply it, I'll
> >> >>>> happily do that.
> >> >>>
> >> >>> I like the change, and thanks for tackling this. It's been a pending
> >> >>> issue for way too long. I do think we should squash Jack's patch
> >> >>> into the original, as it does clean up the code nicely.
> >> >>>
> >> >>> Do we have a proper test case for this, so we can verify that it
> >> >>> does indeed also work in practice?
> >> >>>
> >> >> Hi Jens,
> >> >>
> >> >> I can trigger deadlock with in RAID1 with test below:
> >> >>
> >> >> I create one md with one local loop device and one remote scsi
> >> >> exported by SRP. running fio with mix rw on top of md, force_close
> >> >> session on storage side. mdx_raid1 is wait on free_array in D state,
> >> >> and a lot of fio also in D state in wait_barrier.
> >> >>
> >> >> With the patch from Neil above, I can no longer trigger it anymore.
> >> >>
> >> >> The discussion was in link below:
> >> >> http://www.spinics.net/lists/raid/msg54680.html
> >> >
> >> > In addition to Jack's MD raid test there is a DM snapshot deadlock test,
> >> > albeit unpolished/needy to get running, see:
> >> > https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
> >>
> >> Can you run this patch with that test, reverting your DM workaround?
> >
> > Yeap, will do. Last time Mikulas tried a similar patch it still
> > deadlocked. But I'll give it a go (likely tomorrow).
>
> I don't think this will fix the DM snapshot deadlock by itself.
> Rather, it make it possible for some internal changes to DM to fix it.
> The DM change might be something vaguely like:
>
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 3086da5664f3..06ee0960e415 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1216,6 +1216,14 @@ static int __split_and_process_non_flush(struct clone_info *ci)
>
> len = min_t(sector_t, max_io_len(ci->sector, ti), ci->sector_count);
>
> + if (len < ci->sector_count) {
> + struct bio *split = bio_split(bio, len, GFP_NOIO, fs_bio_set);

fs_bio_set is a shared bio set, so it is prone to deadlocks. For this
change, we would need two bio sets per dm device, one for the split bio
and one for the outgoing bio. (this also means having one more kernel
thread per dm device)

It would be possible to avoid having two bio sets if the incoming bio were
the same as the outgoing bio (allocate a small structure, move bi_end_io
and bi_private into it, replace bi_end_io and bi_private with pointers to
device mapper and send the bio to the target driver), but it would need
much more changes - basically rewrite the whole bio handling code in dm.c
and in the targets.

Mikulas

> + bio_chain(split, bio);
> + generic_make_request(bio);
> + bio = split;
> + ci->sector_count = len;
> + }
> +
> r = __clone_and_map_data_bio(ci, ti, ci->sector, &len);
> if (r < 0)
> return r;
>
> Instead of looping inside DM, this change causes the remainder to be
> passed to generic_make_request() and DM only handles or region at a
> time. So there is only one loop, in the top generic_make_request().
> That loop will not reliable handle bios in the "right" order.
>
> Thanks,
> NeilBrown
>

2017-03-08 17:32:21

by Lars Ellenberg

[permalink] [raw]
Subject: Re: blk: improve order of bio handling in generic_make_request()

On 8 March 2017 at 17:40, Mikulas Patocka <[email protected]> wrote:
>
> On Wed, 8 Mar 2017, NeilBrown wrote:
> > I don't think this will fix the DM snapshot deadlock by itself.
> > Rather, it make it possible for some internal changes to DM to fix it.
> > The DM change might be something vaguely like:
> >
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index 3086da5664f3..06ee0960e415 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> > @@ -1216,6 +1216,14 @@ static int __split_and_process_non_flush(struct clone_info *ci)
> >
> > len = min_t(sector_t, max_io_len(ci->sector, ti), ci->sector_count);
> >
> > + if (len < ci->sector_count) {
> > + struct bio *split = bio_split(bio, len, GFP_NOIO, fs_bio_set);
>
> fs_bio_set is a shared bio set, so it is prone to deadlocks. For this
> change, we would need two bio sets per dm device, one for the split bio
> and one for the outgoing bio. (this also means having one more kernel
> thread per dm device)
>
> It would be possible to avoid having two bio sets if the incoming bio were
> the same as the outgoing bio (allocate a small structure, move bi_end_io
> and bi_private into it, replace bi_end_io and bi_private with pointers to
> device mapper and send the bio to the target driver), but it would need
> much more changes - basically rewrite the whole bio handling code in dm.c
> and in the targets.
>
> Mikulas

"back then" (see previously posted link into ML archive)
I suggested this:

...

A bit of conflict here may be that DM has all its own
split and clone and queue magic, and wants to process
"all of the bio" before returning back to generic_make_request().

To change that, __split_and_process_bio() and all its helpers
would need to learn to "push back" (pieces of) the bio they are
currently working on, and not push back via "DM_ENDIO_REQUEUE",
but by bio_list_add_head(&current->bio_lists->queue, piece_to_be_done_later).

Then, after they processed each piece,
*return* all the way up to the top-level generic_make_request(),
where the recursion-to-iteration logic would then
make sure that all deeper level bios, submitted via
recursive calls to generic_make_request() will be processed, before the
next, pushed back, piece of the "original incoming" bio.

And *not* do their own iteration over all pieces first.

Probably not as easy as dropping the while loop,
using bio_advance, and pushing that "advanced" bio back to
current->...queue?

static void __split_and_process_bio(struct mapped_device *md,
struct dm_table *map, struct bio *bio)
...
ci.bio = bio;
ci.sector_count = bio_sectors(bio);
while (ci.sector_count && !error)
error = __split_and_process_non_flush(&ci);
...
error = __split_and_process_non_flush(&ci);
if (ci.sector_count)
bio_advance()
bio_list_add_head(&current->bio_lists->queue, )
...

Something like that, maybe?


Needs to be adapted to this new and improved recursion-to-iteration
logic, obviously. Would that be doable, or does device-mapper for some
reason really *need* its own iteration loop (which, because it is called
from generic_make_request(), won't be able to ever submit anything to
any device, ever, so needs all these helper threads just in case).

Lars

2017-03-09 06:08:28

by NeilBrown

[permalink] [raw]
Subject: Re: blk: improve order of bio handling in generic_make_request()

On Wed, Mar 08 2017, Mikulas Patocka wrote:

> On Wed, 8 Mar 2017, NeilBrown wrote:
>>
>> I don't think this will fix the DM snapshot deadlock by itself.
>> Rather, it make it possible for some internal changes to DM to fix it.
>> The DM change might be something vaguely like:
>>
>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>> index 3086da5664f3..06ee0960e415 100644
>> --- a/drivers/md/dm.c
>> +++ b/drivers/md/dm.c
>> @@ -1216,6 +1216,14 @@ static int __split_and_process_non_flush(struct clone_info *ci)
>>
>> len = min_t(sector_t, max_io_len(ci->sector, ti), ci->sector_count);
>>
>> + if (len < ci->sector_count) {
>> + struct bio *split = bio_split(bio, len, GFP_NOIO, fs_bio_set);
>
> fs_bio_set is a shared bio set, so it is prone to deadlocks. For this
> change, we would need two bio sets per dm device, one for the split bio
> and one for the outgoing bio. (this also means having one more kernel
> thread per dm device)

Yes, two local bio_sets would be best.
But we don't really need those extra kernel threads. I'll start working
on patches to make them optional, and then to start removing them.

Thanks,
NeilBrown


Attachments:
signature.asc (832.00 B)

2017-03-10 04:32:37

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()


I started looking further at the improvements we can make once
generic_make_request is fixed, and realised that I had missed an
important detail in this patch.
Several places test current->bio_list, and two actually edit the list.
With this change, that cannot see the whole lists, so it could cause a
regression.

So I've revised the patch to make sure that the entire list of queued
bios remains visible, and change the relevant code to look at both
the new list and the old list.

Following that there are some patches which make the rescuer thread
optional, and then starts removing it from some easy-to-fix places.

The series summary is below.

NeilBrown


NeilBrown (5):
blk: improve order of bio handling in generic_make_request()
blk: remove bio_set arg from blk_queue_split()
blk: make the bioset rescue_workqueue optional.
blk: use non-rescuing bioset for q->bio_split.
block_dev: make blkdev_dio_pool a non-rescuing bioset


block/bio.c | 39 +++++++++++++++++++++++++++------
block/blk-core.c | 42 ++++++++++++++++++++++++++++-------
block/blk-merge.c | 7 +++---
block/blk-mq.c | 4 ++-
drivers/block/drbd/drbd_main.c | 2 +-
drivers/block/drbd/drbd_req.c | 2 +-
drivers/block/pktcdvd.c | 2 +-
drivers/block/ps3vram.c | 2 +-
drivers/block/rsxx/dev.c | 2 +-
drivers/block/umem.c | 2 +-
drivers/block/zram/zram_drv.c | 2 +-
drivers/lightnvm/rrpc.c | 2 +-
drivers/md/bcache/super.c | 4 ++-
drivers/md/dm-crypt.c | 2 +-
drivers/md/dm-io.c | 2 +-
drivers/md/dm.c | 32 +++++++++++++++------------
drivers/md/md.c | 4 ++-
drivers/md/raid10.c | 3 ++-
drivers/md/raid5-cache.c | 2 +-
drivers/s390/block/dcssblk.c | 2 +-
drivers/s390/block/xpram.c | 2 +-
drivers/target/target_core_iblock.c | 2 +-
fs/btrfs/extent_io.c | 4 ++-
fs/xfs/xfs_super.c | 2 +-
include/linux/bio.h | 2 ++
include/linux/blkdev.h | 3 +--
26 files changed, 114 insertions(+), 60 deletions(-)


Attachments:
signature.asc (832.00 B)

2017-03-10 04:33:54

by NeilBrown

[permalink] [raw]
Subject: [PATCH 1/5 v3] blk: improve order of bio handling in generic_make_request()


To avoid recursion on the kernel stack when stacked block devices
are in use, generic_make_request() will, when called recursively,
queue new requests for later handling. They will be handled when the
make_request_fn for the current bio completes.

If any bios are submitted by a make_request_fn, these will ultimately
be handled seqeuntially. If the handling of one of those generates
further requests, they will be added to the end of the queue.

This strict first-in-first-out behaviour can lead to deadlocks in
various ways, normally because a request might need to wait for a
previous request to the same device to complete. This can happen when
they share a mempool, and can happen due to interdependencies
particular to the device. Both md and dm have examples where this happens.

These deadlocks can be erradicated by more selective ordering of bios.
Specifically by handling them in depth-first order. That is: when the
handling of one bio generates one or more further bios, they are
handled immediately after the parent, before any siblings of the
parent. That way, when generic_make_request() calls make_request_fn
for some particular device, we can be certain that all previously
submited requests for that device have been completely handled and are
not waiting for anything in the queue of requests maintained in
generic_make_request().

An easy way to achieve this would be to use a last-in-first-out stack
instead of a queue. However this will change the order of consecutive
bios submitted by a make_request_fn, which could have unexpected consequences.
Instead we take a slightly more complex approach.
A fresh queue is created for each call to a make_request_fn. After it completes,
any bios for a different device are placed on the front of the main queue, followed
by any bios for the same device, followed by all bios that were already on
the queue before the make_request_fn was called.
This provides the depth-first approach without reordering bios on the same level.

This, by itself, it not enough to remove all deadlocks. It just makes
it possible for drivers to take the extra step required themselves.

To avoid deadlocks, drivers must never risk waiting for a request
after submitting one to generic_make_request. This includes never
allocing from a mempool twice in the one call to a make_request_fn.

A common pattern in drivers is to call bio_split() in a loop, handling
the first part and then looping around to possibly split the next part.
Instead, a driver that finds it needs to split a bio should queue
(with generic_make_request) the second part, handle the first part,
and then return. The new code in generic_make_request will ensure the
requests to underlying bios are processed first, then the second bio
that was split off. If it splits again, the same process happens. In
each case one bio will be completely handled before the next one is attempted.

With this is place, it should be possible to disable the
punt_bios_to_recover() recovery thread for many block devices, and
eventually it may be possible to remove it completely.

Note that as some drivers look inside the bio_list, sometimes to punt
some bios to rescuer threads, we need to make both the pendind list and the
new list visible. So current->bio_list is now an array of 2 lists,
and relevant code examines both of them.

Ref: http://www.spinics.net/lists/raid/msg54680.html
Tested-by: Jinpu Wang <[email protected]>
Inspired-by: Lars Ellenberg <[email protected]>
Signed-off-by: NeilBrown <[email protected]>
---
block/bio.c | 11 ++++++++---
block/blk-core.c | 40 ++++++++++++++++++++++++++++++++--------
drivers/md/dm.c | 29 ++++++++++++++++-------------
drivers/md/raid10.c | 3 ++-
4 files changed, 58 insertions(+), 25 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 5eec5e08417f..84ae39f06f81 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -376,10 +376,13 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
bio_list_init(&punt);
bio_list_init(&nopunt);

- while ((bio = bio_list_pop(current->bio_list)))
+ while ((bio = bio_list_pop(&current->bio_list[0])))
bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
+ current->bio_list[0] = nopunt;

- *current->bio_list = nopunt;
+ while ((bio = bio_list_pop(&current->bio_list[1])))
+ bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
+ current->bio_list[1] = nopunt;

spin_lock(&bs->rescue_lock);
bio_list_merge(&bs->rescue_list, &punt);
@@ -466,7 +469,9 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
* we retry with the original gfp_flags.
*/

- if (current->bio_list && !bio_list_empty(current->bio_list))
+ if (current->bio_list &&
+ (!bio_list_empty(&current->bio_list[0]) ||
+ !bio_list_empty(&current->bio_list[1])))
gfp_mask &= ~__GFP_DIRECT_RECLAIM;

p = mempool_alloc(bs->bio_pool, gfp_mask);
diff --git a/block/blk-core.c b/block/blk-core.c
index 1086dac8724c..bd2cb4ba674e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio)
*/
blk_qc_t generic_make_request(struct bio *bio)
{
- struct bio_list bio_list_on_stack;
+ /*
+ * bio_list_on_stack[0] contains bios submitted by the current
+ * make_request_fn.
+ * bio_list_on_stack[1] contains bios that were submitted before
+ * the current make_request_fn, but that haven't been processed
+ * yet.
+ */
+ struct bio_list bio_list_on_stack[2];
blk_qc_t ret = BLK_QC_T_NONE;

if (!generic_make_request_checks(bio))
@@ -1992,7 +1999,7 @@ blk_qc_t generic_make_request(struct bio *bio)
* should be added at the tail
*/
if (current->bio_list) {
- bio_list_add(current->bio_list, bio);
+ bio_list_add(&current->bio_list[0], bio);
goto out;
}

@@ -2011,23 +2018,40 @@ blk_qc_t generic_make_request(struct bio *bio)
* bio_list, and call into ->make_request() again.
*/
BUG_ON(bio->bi_next);
- bio_list_init(&bio_list_on_stack);
- current->bio_list = &bio_list_on_stack;
+ bio_list_init(&bio_list_on_stack[0]);
+ bio_list_init(&bio_list_on_stack[1]);
+ current->bio_list = bio_list_on_stack;
do {
struct request_queue *q = bdev_get_queue(bio->bi_bdev);

if (likely(blk_queue_enter(q, false) == 0)) {
+ struct bio_list lower, same;
+
+ /* Create a fresh bio_list for all subordinate requests */
+ bio_list_on_stack[1] = bio_list_on_stack[0];
+ bio_list_init(&bio_list_on_stack[0]);
ret = q->make_request_fn(q, bio);

blk_queue_exit(q);

- bio = bio_list_pop(current->bio_list);
+ /* sort new bios into those for a lower level
+ * and those for the same level
+ */
+ bio_list_init(&lower);
+ bio_list_init(&same);
+ while ((bio = bio_list_pop(&bio_list_on_stack[0])) != NULL)
+ if (q == bdev_get_queue(bio->bi_bdev))
+ bio_list_add(&same, bio);
+ else
+ bio_list_add(&lower, bio);
+ /* now assemble so we handle the lowest level first */
+ bio_list_merge(&bio_list_on_stack[0], &lower);
+ bio_list_merge(&bio_list_on_stack[0], &same);
+ bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]);
} else {
- struct bio *bio_next = bio_list_pop(current->bio_list);
-
bio_io_error(bio);
- bio = bio_next;
}
+ bio = bio_list_pop(&bio_list_on_stack[0]);
} while (bio);
current->bio_list = NULL; /* deactivate */

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index f4ffd1eb8f44..dfb75979e455 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -989,26 +989,29 @@ static void flush_current_bio_list(struct blk_plug_cb *cb, bool from_schedule)
struct dm_offload *o = container_of(cb, struct dm_offload, cb);
struct bio_list list;
struct bio *bio;
+ int i;

INIT_LIST_HEAD(&o->cb.list);

if (unlikely(!current->bio_list))
return;

- list = *current->bio_list;
- bio_list_init(current->bio_list);
-
- while ((bio = bio_list_pop(&list))) {
- struct bio_set *bs = bio->bi_pool;
- if (unlikely(!bs) || bs == fs_bio_set) {
- bio_list_add(current->bio_list, bio);
- continue;
+ for (i = 0; i < 2; i++) {
+ list = current->bio_list[i];
+ bio_list_init(&current->bio_list[i]);
+
+ while ((bio = bio_list_pop(&list))) {
+ struct bio_set *bs = bio->bi_pool;
+ if (unlikely(!bs) || bs == fs_bio_set) {
+ bio_list_add(&current->bio_list[i], bio);
+ continue;
+ }
+
+ spin_lock(&bs->rescue_lock);
+ bio_list_add(&bs->rescue_list, bio);
+ queue_work(bs->rescue_workqueue, &bs->rescue_work);
+ spin_unlock(&bs->rescue_lock);
}
-
- spin_lock(&bs->rescue_lock);
- bio_list_add(&bs->rescue_list, bio);
- queue_work(bs->rescue_workqueue, &bs->rescue_work);
- spin_unlock(&bs->rescue_lock);
}
}

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 063c43d83b72..0536658c9d40 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -974,7 +974,8 @@ static void wait_barrier(struct r10conf *conf)
!conf->barrier ||
(atomic_read(&conf->nr_pending) &&
current->bio_list &&
- !bio_list_empty(current->bio_list)),
+ (!bio_list_empty(&current->bio_list[0]) ||
+ !bio_list_empty(&current->bio_list[1]))),
conf->resync_lock);
conf->nr_waiting--;
if (!conf->nr_waiting)



Attachments:
signature.asc (832.00 B)

2017-03-10 04:34:51

by NeilBrown

[permalink] [raw]
Subject: [PATCH 2/5] blk: remove bio_set arg from blk_queue_split()


blk_queue_split() is always called with the last arg being q->bio_split,
where 'q' is the first arg.

Also blk_queue_split() sometimes uses the passed-in 'bs' and sometimes uses
q->bio_split.

This is inconsistent and unnecessary. Remove the last arg and always use
q->bio_split inside blk_queue_split()

Signed-off-by: NeilBrown <[email protected]>
---
block/blk-core.c | 2 +-
block/blk-merge.c | 7 +++----
block/blk-mq.c | 4 ++--
drivers/block/drbd/drbd_req.c | 2 +-
drivers/block/pktcdvd.c | 2 +-
drivers/block/ps3vram.c | 2 +-
drivers/block/rsxx/dev.c | 2 +-
drivers/block/umem.c | 2 +-
drivers/block/zram/zram_drv.c | 2 +-
drivers/lightnvm/rrpc.c | 2 +-
drivers/md/md.c | 2 +-
drivers/s390/block/dcssblk.c | 2 +-
drivers/s390/block/xpram.c | 2 +-
include/linux/blkdev.h | 3 +--
14 files changed, 17 insertions(+), 19 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index bd2cb4ba674e..375006c94c15 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1637,7 +1637,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
*/
blk_queue_bounce(q, &bio);

- blk_queue_split(q, &bio, q->bio_split);
+ blk_queue_split(q, &bio);

if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
bio->bi_error = -EIO;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 2afa262425d1..ce8838aff7f7 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -188,8 +188,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
return do_split ? new : NULL;
}

-void blk_queue_split(struct request_queue *q, struct bio **bio,
- struct bio_set *bs)
+void blk_queue_split(struct request_queue *q, struct bio **bio)
{
struct bio *split, *res;
unsigned nsegs;
@@ -197,14 +196,14 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
switch (bio_op(*bio)) {
case REQ_OP_DISCARD:
case REQ_OP_SECURE_ERASE:
- split = blk_bio_discard_split(q, *bio, bs, &nsegs);
+ split = blk_bio_discard_split(q, *bio, q->bio_split, &nsegs);
break;
case REQ_OP_WRITE_ZEROES:
split = NULL;
nsegs = (*bio)->bi_phys_segments;
break;
case REQ_OP_WRITE_SAME:
- split = blk_bio_write_same_split(q, *bio, bs, &nsegs);
+ split = blk_bio_write_same_split(q, *bio, q->bio_split, &nsegs);
break;
default:
split = blk_bio_segment_split(q, *bio, q->bio_split, &nsegs);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index b2fd175e84d7..ef7b289e873c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1502,7 +1502,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
return BLK_QC_T_NONE;
}

- blk_queue_split(q, &bio, q->bio_split);
+ blk_queue_split(q, &bio);

if (!is_flush_fua && !blk_queue_nomerges(q) &&
blk_attempt_plug_merge(q, bio, &request_count, &same_queue_rq))
@@ -1624,7 +1624,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
return BLK_QC_T_NONE;
}

- blk_queue_split(q, &bio, q->bio_split);
+ blk_queue_split(q, &bio);

if (!is_flush_fua && !blk_queue_nomerges(q)) {
if (blk_attempt_plug_merge(q, bio, &request_count, NULL))
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 652114ae1a8a..f6ed6f7f5ab2 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1554,7 +1554,7 @@ blk_qc_t drbd_make_request(struct request_queue *q, struct bio *bio)
struct drbd_device *device = (struct drbd_device *) q->queuedata;
unsigned long start_jif;

- blk_queue_split(q, &bio, q->bio_split);
+ blk_queue_split(q, &bio);

start_jif = jiffies;

diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 66d846ba85a9..98394d034c29 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2414,7 +2414,7 @@ static blk_qc_t pkt_make_request(struct request_queue *q, struct bio *bio)

blk_queue_bounce(q, &bio);

- blk_queue_split(q, &bio, q->bio_split);
+ blk_queue_split(q, &bio);

pd = q->queuedata;
if (!pd) {
diff --git a/drivers/block/ps3vram.c b/drivers/block/ps3vram.c
index 456b4fe21559..48072c0c1010 100644
--- a/drivers/block/ps3vram.c
+++ b/drivers/block/ps3vram.c
@@ -606,7 +606,7 @@ static blk_qc_t ps3vram_make_request(struct request_queue *q, struct bio *bio)

dev_dbg(&dev->core, "%s\n", __func__);

- blk_queue_split(q, &bio, q->bio_split);
+ blk_queue_split(q, &bio);

spin_lock_irq(&priv->lock);
busy = !bio_list_empty(&priv->list);
diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c
index f81d70b39d10..8c8024f616ae 100644
--- a/drivers/block/rsxx/dev.c
+++ b/drivers/block/rsxx/dev.c
@@ -151,7 +151,7 @@ static blk_qc_t rsxx_make_request(struct request_queue *q, struct bio *bio)
struct rsxx_bio_meta *bio_meta;
int st = -EINVAL;

- blk_queue_split(q, &bio, q->bio_split);
+ blk_queue_split(q, &bio);

might_sleep();

diff --git a/drivers/block/umem.c b/drivers/block/umem.c
index c141cc3be22b..c8d8a2f16f8e 100644
--- a/drivers/block/umem.c
+++ b/drivers/block/umem.c
@@ -529,7 +529,7 @@ static blk_qc_t mm_make_request(struct request_queue *q, struct bio *bio)
(unsigned long long)bio->bi_iter.bi_sector,
bio->bi_iter.bi_size);

- blk_queue_split(q, &bio, q->bio_split);
+ blk_queue_split(q, &bio);

spin_lock_irq(&card->lock);
*card->biotail = bio;
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index e27d89a36c34..973dac2a9a99 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -880,7 +880,7 @@ static blk_qc_t zram_make_request(struct request_queue *queue, struct bio *bio)
{
struct zram *zram = queue->queuedata;

- blk_queue_split(queue, &bio, queue->bio_split);
+ blk_queue_split(queue, &bio);

if (!valid_io_request(zram, bio->bi_iter.bi_sector,
bio->bi_iter.bi_size)) {
diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c
index e00b1d7b976f..701fa2fafbb2 100644
--- a/drivers/lightnvm/rrpc.c
+++ b/drivers/lightnvm/rrpc.c
@@ -999,7 +999,7 @@ static blk_qc_t rrpc_make_rq(struct request_queue *q, struct bio *bio)
struct nvm_rq *rqd;
int err;

- blk_queue_split(q, &bio, q->bio_split);
+ blk_queue_split(q, &bio);

if (bio_op(bio) == REQ_OP_DISCARD) {
rrpc_discard(rrpc, bio);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 548d1b8014f8..d7d2bb51a58d 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -253,7 +253,7 @@ static blk_qc_t md_make_request(struct request_queue *q, struct bio *bio)
unsigned int sectors;
int cpu;

- blk_queue_split(q, &bio, q->bio_split);
+ blk_queue_split(q, &bio);

if (mddev == NULL || mddev->pers == NULL) {
bio_io_error(bio);
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 415d10a67b7a..10ece6f3c7eb 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -829,7 +829,7 @@ dcssblk_make_request(struct request_queue *q, struct bio *bio)
unsigned long source_addr;
unsigned long bytes_done;

- blk_queue_split(q, &bio, q->bio_split);
+ blk_queue_split(q, &bio);

bytes_done = 0;
dev_info = bio->bi_bdev->bd_disk->private_data;
diff --git a/drivers/s390/block/xpram.c b/drivers/s390/block/xpram.c
index b9d7e755c8a3..a48f0d40c1d2 100644
--- a/drivers/s390/block/xpram.c
+++ b/drivers/s390/block/xpram.c
@@ -190,7 +190,7 @@ static blk_qc_t xpram_make_request(struct request_queue *q, struct bio *bio)
unsigned long page_addr;
unsigned long bytes;

- blk_queue_split(q, &bio, q->bio_split);
+ blk_queue_split(q, &bio);

if ((bio->bi_iter.bi_sector & 7) != 0 ||
(bio->bi_iter.bi_size & 4095) != 0)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 796016e63c1d..63008a246403 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -934,8 +934,7 @@ extern int blk_insert_cloned_request(struct request_queue *q,
struct request *rq);
extern int blk_rq_append_bio(struct request *rq, struct bio *bio);
extern void blk_delay_queue(struct request_queue *, unsigned long);
-extern void blk_queue_split(struct request_queue *, struct bio **,
- struct bio_set *);
+extern void blk_queue_split(struct request_queue *, struct bio **);
extern void blk_recount_segments(struct request_queue *, struct bio *);
extern int scsi_verify_blk_ioctl(struct block_device *, unsigned int);
extern int scsi_cmd_blk_ioctl(struct block_device *, fmode_t,



Attachments:
signature.asc (832.00 B)

2017-03-10 04:35:54

by NeilBrown

[permalink] [raw]
Subject: [PATCH 3/5] blk: make the bioset rescue_workqueue optional.


This patch converts bioset_create() and bioset_create_nobvec()
to not create a workqueue so alloctions will never trigger
punt_bios_to_rescuer().
It also introduces bioset_create_rescued() and bioset_create_nobvec_rescued()
which preserve the old behaviour.

*All* callers of bioset_create() and bioset_create_nobvec() are
converted to the _rescued() version, so that not change in behaviour
is experienced.

It is hoped that most, if not all, bioset can end up being the
non-rescued version.

Signed-off-by: NeilBrown <[email protected]>
---
block/bio.c | 30 +++++++++++++++++++++++++-----
block/blk-core.c | 2 +-
drivers/block/drbd/drbd_main.c | 2 +-
drivers/md/bcache/super.c | 4 ++--
drivers/md/dm-crypt.c | 2 +-
drivers/md/dm-io.c | 2 +-
drivers/md/dm.c | 5 +++--
drivers/md/md.c | 2 +-
drivers/md/raid5-cache.c | 2 +-
drivers/target/target_core_iblock.c | 2 +-
fs/block_dev.c | 2 +-
fs/btrfs/extent_io.c | 4 ++--
fs/xfs/xfs_super.c | 2 +-
include/linux/bio.h | 2 ++
14 files changed, 43 insertions(+), 20 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 84ae39f06f81..06587f1119f5 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -362,6 +362,8 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
struct bio_list punt, nopunt;
struct bio *bio;

+ if (!WARN_ON_ONCE(!bs->rescue_workqueue))
+ return;
/*
* In order to guarantee forward progress we must punt only bios that
* were allocated from this bio_set; otherwise, if there was a bio on
@@ -471,7 +473,8 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)

if (current->bio_list &&
(!bio_list_empty(&current->bio_list[0]) ||
- !bio_list_empty(&current->bio_list[1])))
+ !bio_list_empty(&current->bio_list[1])) &&
+ bs->rescue_workqueue)
gfp_mask &= ~__GFP_DIRECT_RECLAIM;

p = mempool_alloc(bs->bio_pool, gfp_mask);
@@ -1940,7 +1943,8 @@ EXPORT_SYMBOL(bioset_free);

static struct bio_set *__bioset_create(unsigned int pool_size,
unsigned int front_pad,
- bool create_bvec_pool)
+ bool create_bvec_pool,
+ bool create_rescue_workqueue)
{
unsigned int back_pad = BIO_INLINE_VECS * sizeof(struct bio_vec);
struct bio_set *bs;
@@ -1971,6 +1975,9 @@ static struct bio_set *__bioset_create(unsigned int pool_size,
goto bad;
}

+ if (!create_rescue_workqueue)
+ return bs;
+
bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
if (!bs->rescue_workqueue)
goto bad;
@@ -1996,10 +2003,16 @@ static struct bio_set *__bioset_create(unsigned int pool_size,
*/
struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
{
- return __bioset_create(pool_size, front_pad, true);
+ return __bioset_create(pool_size, front_pad, true, false);
}
EXPORT_SYMBOL(bioset_create);

+struct bio_set *bioset_create_rescued(unsigned int pool_size, unsigned int front_pad)
+{
+ return __bioset_create(pool_size, front_pad, true, true);
+}
+EXPORT_SYMBOL(bioset_create_rescued);
+
/**
* bioset_create_nobvec - Create a bio_set without bio_vec mempool
* @pool_size: Number of bio to cache in the mempool
@@ -2011,10 +2024,17 @@ EXPORT_SYMBOL(bioset_create);
*/
struct bio_set *bioset_create_nobvec(unsigned int pool_size, unsigned int front_pad)
{
- return __bioset_create(pool_size, front_pad, false);
+ return __bioset_create(pool_size, front_pad, false, false);
}
EXPORT_SYMBOL(bioset_create_nobvec);

+struct bio_set *bioset_create_nobvec_rescued(unsigned int pool_size,
+ unsigned int front_pad)
+{
+ return __bioset_create(pool_size, front_pad, false, true);
+}
+EXPORT_SYMBOL(bioset_create_nobvec_rescued);
+
#ifdef CONFIG_BLK_CGROUP

/**
@@ -2129,7 +2149,7 @@ static int __init init_bio(void)
bio_integrity_init();
biovec_init_slabs();

- fs_bio_set = bioset_create(BIO_POOL_SIZE, 0);
+ fs_bio_set = bioset_create_rescued(BIO_POOL_SIZE, 0);
if (!fs_bio_set)
panic("bio: can't allocate bios\n");

diff --git a/block/blk-core.c b/block/blk-core.c
index 375006c94c15..c3992d17dc2c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -714,7 +714,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
if (q->id < 0)
goto fail_q;

- q->bio_split = bioset_create(BIO_POOL_SIZE, 0);
+ q->bio_split = bioset_create_rescued(BIO_POOL_SIZE, 0);
if (!q->bio_split)
goto fail_id;

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 92c60cbd04ee..2c69c2ab0fff 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2166,7 +2166,7 @@ static int drbd_create_mempools(void)
goto Enomem;

/* mempools */
- drbd_md_io_bio_set = bioset_create(DRBD_MIN_POOL_PAGES, 0);
+ drbd_md_io_bio_set = bioset_create_rescued(DRBD_MIN_POOL_PAGES, 0);
if (drbd_md_io_bio_set == NULL)
goto Enomem;

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 85e3f21c2514..6cb30792f0ed 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -786,7 +786,7 @@ static int bcache_device_init(struct bcache_device *d, unsigned block_size,

minor *= BCACHE_MINORS;

- if (!(d->bio_split = bioset_create(4, offsetof(struct bbio, bio))) ||
+ if (!(d->bio_split = bioset_create_rescued(4, offsetof(struct bbio, bio))) ||
!(d->disk = alloc_disk(BCACHE_MINORS))) {
ida_simple_remove(&bcache_minor, minor);
return -ENOMEM;
@@ -1520,7 +1520,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
sizeof(struct bbio) + sizeof(struct bio_vec) *
bucket_pages(c))) ||
!(c->fill_iter = mempool_create_kmalloc_pool(1, iter_size)) ||
- !(c->bio_split = bioset_create(4, offsetof(struct bbio, bio))) ||
+ !(c->bio_split = bioset_create_rescued(4, offsetof(struct bbio, bio))) ||
!(c->uuids = alloc_bucket_pages(GFP_KERNEL, c)) ||
!(c->moving_gc_wq = alloc_workqueue("bcache_gc",
WQ_MEM_RECLAIM, 0)) ||
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 389a3637ffcc..91a2d637d44f 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1936,7 +1936,7 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
goto bad;
}

- cc->bs = bioset_create(MIN_IOS, 0);
+ cc->bs = bioset_create_rescued(MIN_IOS, 0);
if (!cc->bs) {
ti->error = "Cannot allocate crypt bioset";
goto bad;
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 03940bf36f6c..fe1241c196b1 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -58,7 +58,7 @@ struct dm_io_client *dm_io_client_create(void)
if (!client->pool)
goto bad;

- client->bios = bioset_create(min_ios, 0);
+ client->bios = bioset_create_rescued(min_ios, 0);
if (!client->bios)
goto bad;

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index dfb75979e455..41b1f033841f 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1002,7 +1002,8 @@ static void flush_current_bio_list(struct blk_plug_cb *cb, bool from_schedule)

while ((bio = bio_list_pop(&list))) {
struct bio_set *bs = bio->bi_pool;
- if (unlikely(!bs) || bs == fs_bio_set) {
+ if (unlikely(!bs) || bs == fs_bio_set ||
+ !bs->rescue_workqueue) {
bio_list_add(&current->bio_list[i], bio);
continue;
}
@@ -2577,7 +2578,7 @@ struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, unsigned t
BUG();
}

- pools->bs = bioset_create_nobvec(pool_size, front_pad);
+ pools->bs = bioset_create_nobvec_rescued(pool_size, front_pad);
if (!pools->bs)
goto out;

diff --git a/drivers/md/md.c b/drivers/md/md.c
index d7d2bb51a58d..e5f08a195837 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5220,7 +5220,7 @@ int md_run(struct mddev *mddev)
}

if (mddev->bio_set == NULL) {
- mddev->bio_set = bioset_create(BIO_POOL_SIZE, 0);
+ mddev->bio_set = bioset_create_rescued(BIO_POOL_SIZE, 0);
if (!mddev->bio_set)
return -ENOMEM;
}
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 3f307be01b10..c95c6c046395 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2831,7 +2831,7 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
if (!log->io_pool)
goto io_pool;

- log->bs = bioset_create(R5L_POOL_SIZE, 0);
+ log->bs = bioset_create_rescued(R5L_POOL_SIZE, 0);
if (!log->bs)
goto io_bs;

diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
index d316ed537d59..5bf3392195c6 100644
--- a/drivers/target/target_core_iblock.c
+++ b/drivers/target/target_core_iblock.c
@@ -93,7 +93,7 @@ static int iblock_configure_device(struct se_device *dev)
return -EINVAL;
}

- ib_dev->ibd_bio_set = bioset_create(IBLOCK_BIO_POOL_SIZE, 0);
+ ib_dev->ibd_bio_set = bioset_create_rescued(IBLOCK_BIO_POOL_SIZE, 0);
if (!ib_dev->ibd_bio_set) {
pr_err("IBLOCK: Unable to create bioset\n");
goto out;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2eca00ec4370..c0ca5f0d0369 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -436,7 +436,7 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)

static __init int blkdev_init(void)
{
- blkdev_dio_pool = bioset_create(4, offsetof(struct blkdev_dio, bio));
+ blkdev_dio_pool = bioset_create_rescued(4, offsetof(struct blkdev_dio, bio));
if (!blkdev_dio_pool)
return -ENOMEM;
return 0;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 28e81922a21c..34aa8893790a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -173,8 +173,8 @@ int __init extent_io_init(void)
if (!extent_buffer_cache)
goto free_state_cache;

- btrfs_bioset = bioset_create(BIO_POOL_SIZE,
- offsetof(struct btrfs_io_bio, bio));
+ btrfs_bioset = bioset_create_rescued(BIO_POOL_SIZE,
+ offsetof(struct btrfs_io_bio, bio));
if (!btrfs_bioset)
goto free_buffer_cache;

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 890862f2447c..f4c4d6f41d91 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1756,7 +1756,7 @@ MODULE_ALIAS_FS("xfs");
STATIC int __init
xfs_init_zones(void)
{
- xfs_ioend_bioset = bioset_create(4 * MAX_BUF_PER_PAGE,
+ xfs_ioend_bioset = bioset_create_rescued(4 * MAX_BUF_PER_PAGE,
offsetof(struct xfs_ioend, io_inline_bio));
if (!xfs_ioend_bioset)
goto out;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8e521194f6fc..05730603fcf1 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -379,7 +379,9 @@ static inline struct bio *bio_next_split(struct bio *bio, int sectors,
}

extern struct bio_set *bioset_create(unsigned int, unsigned int);
+extern struct bio_set *bioset_create_rescued(unsigned int, unsigned int);
extern struct bio_set *bioset_create_nobvec(unsigned int, unsigned int);
+extern struct bio_set *bioset_create_nobvec_rescued(unsigned int, unsigned int);
extern void bioset_free(struct bio_set *);
extern mempool_t *biovec_create_pool(int pool_entries);




Attachments:
signature.asc (832.00 B)

2017-03-10 04:36:33

by NeilBrown

[permalink] [raw]
Subject: [PATCH 4/5] blk: use non-rescuing bioset for q->bio_split.


A rescuing bioset is only useful if there might be bios from
that same bioset on the bio_list_on_stack queue at a time
when bio_alloc_bioset() is called. This never applies to
q->bio_split.

Allocations from q->bio_split are only ever made from
blk_queue_split() which is only ever called early in each of
various make_request_fn()s. The original bio (call this A)
is then passed to generic_make_request() and is placed on
the bio_list_on_stack queue, and the bio that was allocated
from q->bio_split (B) is processed.

The processing of this may cause other bios to be passed to
generic_make_request() or may even cause the bio B itself to
be passed, possible after some prefix has been split off
(using some other bioset).

generic_make_request() now guarantees that all of these bios
(B and dependants) will be fully processed before the tail
of the original bio A gets handled. None of these early bios
can possible trigger an allocation from the original
q->bio_split as they are either too small to require
splitting or (more likely) are destined for a different queue.

The next time that the original q->bio_split might be used
by this thread is when A is processed again, as it might
still be too big to handle directly. By this time there
cannot be and other bios allocated fro q->bio_split in the
generic_make_request() queue. So no rescuing will ever be
needed.

Signed-off-by: NeilBrown <[email protected]>
---
block/blk-core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index c3992d17dc2c..375006c94c15 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -714,7 +714,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
if (q->id < 0)
goto fail_q;

- q->bio_split = bioset_create_rescued(BIO_POOL_SIZE, 0);
+ q->bio_split = bioset_create(BIO_POOL_SIZE, 0);
if (!q->bio_split)
goto fail_id;




Attachments:
signature.asc (832.00 B)

2017-03-10 04:37:54

by NeilBrown

[permalink] [raw]
Subject: [PATCH 5/5] block_dev: make blkdev_dio_pool a non-rescuing bioset


Allocations from blkdev_dio_pool are never made under
generic_make_request, so this bioset does not need a rescuer thread.

Signed-off-by: NeilBrown <[email protected]>
---
fs/block_dev.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index c0ca5f0d0369..2eca00ec4370 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -436,7 +436,7 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)

static __init int blkdev_init(void)
{
- blkdev_dio_pool = bioset_create_rescued(4, offsetof(struct blkdev_dio, bio));
+ blkdev_dio_pool = bioset_create(4, offsetof(struct blkdev_dio, bio));
if (!blkdev_dio_pool)
return -ENOMEM;
return 0;



Attachments:
signature.asc (832.00 B)

2017-03-10 04:38:17

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()

On 03/09/2017 09:32 PM, NeilBrown wrote:
>
> I started looking further at the improvements we can make once
> generic_make_request is fixed, and realised that I had missed an
> important detail in this patch.
> Several places test current->bio_list, and two actually edit the list.
> With this change, that cannot see the whole lists, so it could cause a
> regression.
>
> So I've revised the patch to make sure that the entire list of queued
> bios remains visible, and change the relevant code to look at both
> the new list and the old list.
>
> Following that there are some patches which make the rescuer thread
> optional, and then starts removing it from some easy-to-fix places.

Neil, note that the v2 patch is already in Linus tree as of earlier
today. You need to rebase the series, and if we need fixups on
top of v2, then that should be done separately and with increased
urgency.

--
Jens Axboe

2017-03-10 04:40:31

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()

On 03/09/2017 09:38 PM, Jens Axboe wrote:
> On 03/09/2017 09:32 PM, NeilBrown wrote:
>>
>> I started looking further at the improvements we can make once
>> generic_make_request is fixed, and realised that I had missed an
>> important detail in this patch.
>> Several places test current->bio_list, and two actually edit the list.
>> With this change, that cannot see the whole lists, so it could cause a
>> regression.
>>
>> So I've revised the patch to make sure that the entire list of queued
>> bios remains visible, and change the relevant code to look at both
>> the new list and the old list.
>>
>> Following that there are some patches which make the rescuer thread
>> optional, and then starts removing it from some easy-to-fix places.
>
> Neil, note that the v2 patch is already in Linus tree as of earlier
> today. You need to rebase the series, and if we need fixups on
> top of v2, then that should be done separately and with increased
> urgency.

Additionally, at least the first patch appears to be badly mangled.
The formatting is screwed up.

--
Jens Axboe

2017-03-10 05:19:29

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()

On Thu, Mar 09 2017, Jens Axboe wrote:

> On 03/09/2017 09:32 PM, NeilBrown wrote:
>>
>> I started looking further at the improvements we can make once
>> generic_make_request is fixed, and realised that I had missed an
>> important detail in this patch.
>> Several places test current->bio_list, and two actually edit the list.
>> With this change, that cannot see the whole lists, so it could cause a
>> regression.
>>
>> So I've revised the patch to make sure that the entire list of queued
>> bios remains visible, and change the relevant code to look at both
>> the new list and the old list.
>>
>> Following that there are some patches which make the rescuer thread
>> optional, and then starts removing it from some easy-to-fix places.
>
> Neil, note that the v2 patch is already in Linus tree as of earlier
> today. You need to rebase the series, and if we need fixups on
> top of v2, then that should be done separately and with increased
> urgency.

I had checked linux-next, but not the latest from Linus.
I see it now - thanks!
I'll rebase (and ensure nothing gets mangled)

Thanks,
NeilBrown


Attachments:
signature.asc (832.00 B)

2017-03-10 12:34:18

by Lars Ellenberg

[permalink] [raw]
Subject: Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()

> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio)
> */
> blk_qc_t generic_make_request(struct bio *bio)
> {
> - struct bio_list bio_list_on_stack;
> + /*
> + * bio_list_on_stack[0] contains bios submitted by the current
> + * make_request_fn.
> + * bio_list_on_stack[1] contains bios that were submitted before
> + * the current make_request_fn, but that haven't been processed
> + * yet.
> + */
> + struct bio_list bio_list_on_stack[2];
> blk_qc_t ret = BLK_QC_T_NONE;

May I suggest that, if you intend to assign something that is not a
plain &(struct bio_list), but a &(struct bio_list[2]),
you change the task member so it is renamed (current->bio_list vs
current->bio_lists, plural, is what I did last year).
Or you will break external modules, silently, and horribly (or,
rather, they won't notice, but break the kernel).
Examples of such modules would be DRBD, ZFS, quite possibly others.

Thanks,

Lars

2017-03-10 14:38:44

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()

On Fri, Mar 10 2017 at 7:34am -0500,
Lars Ellenberg <[email protected]> wrote:

> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio)
> > */
> > blk_qc_t generic_make_request(struct bio *bio)
> > {
> > - struct bio_list bio_list_on_stack;
> > + /*
> > + * bio_list_on_stack[0] contains bios submitted by the current
> > + * make_request_fn.
> > + * bio_list_on_stack[1] contains bios that were submitted before
> > + * the current make_request_fn, but that haven't been processed
> > + * yet.
> > + */
> > + struct bio_list bio_list_on_stack[2];
> > blk_qc_t ret = BLK_QC_T_NONE;
>
> May I suggest that, if you intend to assign something that is not a
> plain &(struct bio_list), but a &(struct bio_list[2]),
> you change the task member so it is renamed (current->bio_list vs
> current->bio_lists, plural, is what I did last year).
> Or you will break external modules, silently, and horribly (or,
> rather, they won't notice, but break the kernel).
> Examples of such modules would be DRBD, ZFS, quite possibly others.

drbd is upstream -- so what is the problem? (if you are having to
distribute drbd independent of the upstream drbd then why is drbd
upstream?)

As for ZFS, worrying about ZFS kABI breakage is the last thing we should
be doing.

So Nack from me on this defensive make-work for external modules.

2017-03-10 14:55:07

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()



On Fri, 10 Mar 2017, Mike Snitzer wrote:

> On Fri, Mar 10 2017 at 7:34am -0500,
> Lars Ellenberg <[email protected]> wrote:
>
> > > --- a/block/blk-core.c
> > > +++ b/block/blk-core.c
> > > @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio)
> > > */
> > > blk_qc_t generic_make_request(struct bio *bio)
> > > {
> > > - struct bio_list bio_list_on_stack;
> > > + /*
> > > + * bio_list_on_stack[0] contains bios submitted by the current
> > > + * make_request_fn.
> > > + * bio_list_on_stack[1] contains bios that were submitted before
> > > + * the current make_request_fn, but that haven't been processed
> > > + * yet.
> > > + */
> > > + struct bio_list bio_list_on_stack[2];
> > > blk_qc_t ret = BLK_QC_T_NONE;
> >
> > May I suggest that, if you intend to assign something that is not a
> > plain &(struct bio_list), but a &(struct bio_list[2]),
> > you change the task member so it is renamed (current->bio_list vs
> > current->bio_lists, plural, is what I did last year).
> > Or you will break external modules, silently, and horribly (or,
> > rather, they won't notice, but break the kernel).
> > Examples of such modules would be DRBD, ZFS, quite possibly others.
>
> drbd is upstream -- so what is the problem? (if you are having to
> distribute drbd independent of the upstream drbd then why is drbd
> upstream?)
>
> As for ZFS, worrying about ZFS kABI breakage is the last thing we should
> be doing.

It's better to make external modules not compile than to silently
introduce bugs in them. So yes, I would rename that.

Mikulas

> So Nack from me on this defensive make-work for external modules.
>

2017-03-10 15:16:31

by Jack Wang

[permalink] [raw]
Subject: Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()



On 10.03.2017 15:55, Mikulas Patocka wrote:
>
>
> On Fri, 10 Mar 2017, Mike Snitzer wrote:
>
>> On Fri, Mar 10 2017 at 7:34am -0500,
>> Lars Ellenberg <[email protected]> wrote:
>>
>>>> --- a/block/blk-core.c
>>>> +++ b/block/blk-core.c
>>>> @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio)
>>>> */
>>>> blk_qc_t generic_make_request(struct bio *bio)
>>>> {
>>>> - struct bio_list bio_list_on_stack;
>>>> + /*
>>>> + * bio_list_on_stack[0] contains bios submitted by the current
>>>> + * make_request_fn.
>>>> + * bio_list_on_stack[1] contains bios that were submitted before
>>>> + * the current make_request_fn, but that haven't been processed
>>>> + * yet.
>>>> + */
>>>> + struct bio_list bio_list_on_stack[2];
>>>> blk_qc_t ret = BLK_QC_T_NONE;
>>>
>>> May I suggest that, if you intend to assign something that is not a
>>> plain &(struct bio_list), but a &(struct bio_list[2]),
>>> you change the task member so it is renamed (current->bio_list vs
>>> current->bio_lists, plural, is what I did last year).
>>> Or you will break external modules, silently, and horribly (or,
>>> rather, they won't notice, but break the kernel).
>>> Examples of such modules would be DRBD, ZFS, quite possibly others.
>>
>> drbd is upstream -- so what is the problem? (if you are having to
>> distribute drbd independent of the upstream drbd then why is drbd
>> upstream?)
>>
>> As for ZFS, worrying about ZFS kABI breakage is the last thing we should
>> be doing.
>
> It's better to make external modules not compile than to silently
> introduce bugs in them. So yes, I would rename that.
>
> Mikulas

Agree, better rename current->bio_list to current->bio_lists

Regards,
Jack

2017-03-10 15:35:59

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()

On Fri, Mar 10 2017 at 10:07am -0500,
Jack Wang <[email protected]> wrote:

>
>
> On 10.03.2017 15:55, Mikulas Patocka wrote:
> >
> >
> > On Fri, 10 Mar 2017, Mike Snitzer wrote:
> >
> >> On Fri, Mar 10 2017 at 7:34am -0500,
> >> Lars Ellenberg <[email protected]> wrote:
> >>
> >>>> --- a/block/blk-core.c
> >>>> +++ b/block/blk-core.c
> >>>> @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio)
> >>>> */
> >>>> blk_qc_t generic_make_request(struct bio *bio)
> >>>> {
> >>>> - struct bio_list bio_list_on_stack;
> >>>> + /*
> >>>> + * bio_list_on_stack[0] contains bios submitted by the current
> >>>> + * make_request_fn.
> >>>> + * bio_list_on_stack[1] contains bios that were submitted before
> >>>> + * the current make_request_fn, but that haven't been processed
> >>>> + * yet.
> >>>> + */
> >>>> + struct bio_list bio_list_on_stack[2];
> >>>> blk_qc_t ret = BLK_QC_T_NONE;
> >>>
> >>> May I suggest that, if you intend to assign something that is not a
> >>> plain &(struct bio_list), but a &(struct bio_list[2]),
> >>> you change the task member so it is renamed (current->bio_list vs
> >>> current->bio_lists, plural, is what I did last year).
> >>> Or you will break external modules, silently, and horribly (or,
> >>> rather, they won't notice, but break the kernel).
> >>> Examples of such modules would be DRBD, ZFS, quite possibly others.
> >>
> >> drbd is upstream -- so what is the problem? (if you are having to
> >> distribute drbd independent of the upstream drbd then why is drbd
> >> upstream?)
> >>
> >> As for ZFS, worrying about ZFS kABI breakage is the last thing we should
> >> be doing.
> >
> > It's better to make external modules not compile than to silently
> > introduce bugs in them. So yes, I would rename that.
> >
> > Mikulas
>
> Agree, better rename current->bio_list to current->bio_lists

Fine, normally wouldn't do so but I'm not so opposed that we need to get
hung up on this detail. If Neil and Jens agree then so be it.

2017-03-10 18:51:18

by Lars Ellenberg

[permalink] [raw]
Subject: Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()

On Fri, Mar 10, 2017 at 04:07:58PM +0100, Jack Wang wrote:
> On 10.03.2017 15:55, Mikulas Patocka wrote:
> > On Fri, 10 Mar 2017, Mike Snitzer wrote:
> >> On Fri, Mar 10 2017 at 7:34am -0500,
> >> Lars Ellenberg <[email protected]> wrote:
> >>
> >>>> --- a/block/blk-core.c
> >>>> +++ b/block/blk-core.c
> >>>> @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio)
> >>>> */
> >>>> blk_qc_t generic_make_request(struct bio *bio)
> >>>> {
> >>>> - struct bio_list bio_list_on_stack;
> >>>> + /*
> >>>> + * bio_list_on_stack[0] contains bios submitted by the current
> >>>> + * make_request_fn.
> >>>> + * bio_list_on_stack[1] contains bios that were submitted before
> >>>> + * the current make_request_fn, but that haven't been processed
> >>>> + * yet.
> >>>> + */
> >>>> + struct bio_list bio_list_on_stack[2];
> >>>> blk_qc_t ret = BLK_QC_T_NONE;
> >>>
> >>> May I suggest that, if you intend to assign something that is not a
> >>> plain &(struct bio_list), but a &(struct bio_list[2]),
> >>> you change the task member so it is renamed (current->bio_list vs
> >>> current->bio_lists, plural, is what I did last year).
> >>> Or you will break external modules, silently, and horribly (or,
> >>> rather, they won't notice, but break the kernel).
> >>> Examples of such modules would be DRBD, ZFS, quite possibly others.

> > It's better to make external modules not compile than to silently
> > introduce bugs in them. So yes, I would rename that.
> >
> > Mikulas
>
> Agree, better rename current->bio_list to current->bio_lists
>
> Regards,
> Jack

Thank you.

(I don't know if some one does, but...)
Thing is: *IF* some external thing messes with
current->bio_list in "interesting" ways, and not just the
"I don't care, one level of real recursion fixes this for me"
pattern of
struct bio_list *tmp = current->bio_list;
current->bio_list = NULL;
submit_bio()
current->bio_list = tmp;
you get a chance of stack corruption,
without even as much as a compiler warning.

Which is why I wrote https://lkml.org/lkml/2016/7/8/469
...

Instead, I suggest to distinguish between recursive calls to
generic_make_request(), and pushing back the remainder part in
blk_queue_split(), by pointing current->bio_lists to a
struct recursion_to_iteration_bio_lists {
struct bio_list recursion;
struct bio_list queue;
}

By providing each q->make_request_fn() with an empty "recursion"
bio_list, then merging any recursively submitted bios to the
head of the "queue" list, we can make the recursion-to-iteration
logic in generic_make_request() process deepest level bios first,
and "sibling" bios of the same level in "natural" order.

...

Cheers,

Lars

2017-03-11 00:48:48

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v2] blk: improve order of bio handling in generic_make_request()

On Fri, Mar 10 2017, Lars Ellenberg wrote:

>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -1975,7 +1975,14 @@ generic_make_request_checks(struct bio *bio)
>> */
>> blk_qc_t generic_make_request(struct bio *bio)
>> {
>> - struct bio_list bio_list_on_stack;
>> + /*
>> + * bio_list_on_stack[0] contains bios submitted by the current
>> + * make_request_fn.
>> + * bio_list_on_stack[1] contains bios that were submitted before
>> + * the current make_request_fn, but that haven't been processed
>> + * yet.
>> + */
>> + struct bio_list bio_list_on_stack[2];
>> blk_qc_t ret = BLK_QC_T_NONE;
>
> May I suggest that, if you intend to assign something that is not a
> plain &(struct bio_list), but a &(struct bio_list[2]),
> you change the task member so it is renamed (current->bio_list vs
> current->bio_lists, plural, is what I did last year).
> Or you will break external modules, silently, and horribly (or,
> rather, they won't notice, but break the kernel).
> Examples of such modules would be DRBD, ZFS, quite possibly others.
>

This is exactly what I didn't in my first draft (bio_list -> bio_lists),
but then I reverted that change because it didn't seem to be worth the
noise.
It isn't much noise, sched.h, bcache/btree.c, md/dm-bufio.c, and
md/raid1.c get minor changes.
But as I'm hoping to get rid of all of those uses, renaming before
removing seemed pointless ... though admittedly that is what I did for
bioset_create().... I wondered about that too.

The example you give later:
struct bio_list *tmp = current->bio_list;
current->bio_list = NULL;
submit_bio()
current->bio_list = tmp;

won't cause any problem. Whatever lists the parent generic_make_request
is holding onto will be untouched during the submit_bio() call, and will
be exactly as it expects them when this caller returns.

If some out-of-tree code does anything with ->bio_list that makes sense
with the previous code, then it will still make sense with the new
code. However there will be a few bios that it didn't get too look at.
These will all be bios that were submitted by a device further up the
stack (closer to the filesystem), so they *should* be irrelevant.
I could probably come up with some weird behaviour that might have
worked before but now wouldn't quite work the same way. But just fixing
bugs can sometimes affect an out-of-tree driver in a strange way because
it was assuming those bugs.

I hope that I'll soon be able to remove punt_bios_to_rescuer and
flush_current_bio_list, after which current->bio_list can really be
just a list again. I don't think it is worth changing the name for a
transient situation.

But thanks for the review - it encouraged me to think though the
consequences again and I'm now more confident.
I actually now think that change probably wasn't necessary. It is
safer though. It ensures that current functionality isn't removed
without a clear justification.

Thanks,
NeilBrown


Attachments:
signature.asc (832.00 B)