Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751907AbdCCJ36 (ORCPT ); Fri, 3 Mar 2017 04:29:58 -0500 Received: from mail-wm0-f44.google.com ([74.125.82.44]:36147 "EHLO mail-wm0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751340AbdCCJ32 (ORCPT ); Fri, 3 Mar 2017 04:29:28 -0500 Subject: Re: [PATCH] blk: improve order of bio handling in generic_make_request() To: NeilBrown , Jens Axboe References: <87h93blz6g.fsf@notabene.neil.brown.name> Cc: LKML , Lars Ellenberg , Kent Overstreet , Pavel Machek , Mike Snitzer , Mikulas Patocka From: Jack Wang Message-ID: <71562c2c-97f4-9a0a-32ec-30e0702ca575@profitbricks.com> Date: Fri, 3 Mar 2017 10:28:44 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.7.0 MIME-Version: 1.0 In-Reply-To: <87h93blz6g.fsf@notabene.neil.brown.name> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6903 Lines: 169 On 03.03.2017 06:14, NeilBrown wrote: > > [ Hi Jens, > you might have seen assorted email threads recently about > deadlocks, particular in dm-snap or md/raid1/10. Also about > the excess of rescuer threads. > I think a big part of the problem is my ancient improvement > to generic_make_request to queue bios and handle them in > a strict FIFO order. As described below, that can cause > problems which individual block devices cannot fix themselves > without punting to various threads. > This patch does not fix everything, but provides a basis that > drives can build on to create dead-lock free solutions without > excess threads. > If you accept this, I will look into improving at least md > and bio_alloc_set() to be less dependant on rescuer threads. > Thanks, > NeilBrown > ] > > > To avoid recursion on the kernel stack when stacked block devices > are in use, generic_make_request() will, when called recursively, > queue new requests for later handling. They will be handled when the > make_request_fn for the current bio completes. > > If any bios are submitted by a make_request_fn, these will ultimately > handled seqeuntially. If the handling of one of those generates > further requests, they will be added to the end of the queue. > > This strict first-in-first-out behaviour can lead to deadlocks in > various ways, normally because a request might need to wait for a > previous request to the same device to complete. This can happen when > they share a mempool, and can happen due to interdependencies > particular to the device. Both md and dm have examples where this happens. > > These deadlocks can be erradicated by more selective ordering of bios. > Specifically by handling them in depth-first order. That is: when the > handling of one bio generates one or more further bios, they are > handled immediately after the parent, before any siblings of the > parent. That way, when generic_make_request() calls make_request_fn > for some particular device, it we can be certain that all previously > submited request for that device have been completely handled and are > not waiting for anything in the queue of requests maintained in > generic_make_request(). > > An easy way to achieve this would be to use a last-in-first-out stack > instead of a queue. However this will change the order of consecutive > bios submitted by a make_request_fn, which could have unexpected consequences. > Instead we take a slightly more complex approach. > A fresh queue is created for each call to a make_request_fn. After it completes, > any bios for a different device are placed on the front of the main queue, followed > by any bios for the same device, followed by all bios that were already on > the queue before the make_request_fn was called. > This provides the depth-first approach without reordering bios on the same level. > > This, by itself, it not enough to remove the deadlocks. It just makes > it possible for drivers to take the extra step required themselves. > > To avoid deadlocks, drivers must never risk waiting for a request > after submitting one to generic_make_request. This includes never > allocing from a mempool twice in the one call to a make_request_fn. > > A common pattern in drivers is to call bio_split() in a loop, handling > the first part and then looping around to possibly split the next part. > Instead, a driver that finds it needs to split a bio should queue > (with generic_make_request) the second part, handle the first part, > and then return. The new code in generic_make_request will ensure the > requests to underlying bios are processed first, then the second bio > that was split off. If it splits again, the same process happens. In > each case one bio will be completely handled before the next one is attempted. > > With this is place, it should be possible to disable the > punt_bios_to_recover() recovery thread for many block devices, and > eventually it may be possible to remove it completely. > > Tested-by: Jinpu Wang > Inspired-by: Lars Ellenberg > Signed-off-by: NeilBrown > --- > block/blk-core.c | 22 ++++++++++++++++++++++ > 1 file changed, 22 insertions(+) > > diff --git a/block/blk-core.c b/block/blk-core.c > index b9e857f4afe8..ef55f210dd7c 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -2018,10 +2018,32 @@ blk_qc_t generic_make_request(struct bio *bio) > struct request_queue *q = bdev_get_queue(bio->bi_bdev); > > if (likely(blk_queue_enter(q, false) == 0)) { > + struct bio_list hold; > + struct bio_list lower, same; > + > + /* Create a fresh bio_list for all subordinate requests */ > + bio_list_init(&hold); > + bio_list_merge(&hold, &bio_list_on_stack); > + bio_list_init(&bio_list_on_stack); > ret = q->make_request_fn(q, bio); > > blk_queue_exit(q); > > + /* sort new bios into those for a lower level > + * and those for the same level > + */ > + bio_list_init(&lower); > + bio_list_init(&same); > + while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL) > + if (q == bdev_get_queue(bio->bi_bdev)) > + bio_list_add(&same, bio); > + else > + bio_list_add(&lower, bio); > + /* now assemble so we handle the lowest level first */ > + bio_list_merge(&bio_list_on_stack, &lower); > + bio_list_merge(&bio_list_on_stack, &same); > + bio_list_merge(&bio_list_on_stack, &hold); > + > bio = bio_list_pop(current->bio_list); > } else { > struct bio *bio_next = bio_list_pop(current->bio_list); > Thanks Neil for pushing the fix. We can optimize generic_make_request a little bit: - assign bio_list struct hold directly instead init and merge - remove duplicate code I think better to squash into your fix. --- block/blk-core.c | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 3bc7202..b29b7e5 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2147,8 +2147,7 @@ blk_qc_t generic_make_request(struct bio *bio) struct bio_list lower, same, hold; /* Create a fresh bio_list for all subordinate requests */ - bio_list_init(&hold); - bio_list_merge(&hold, &bio_list_on_stack); + hold = bio_list_on_stack; bio_list_init(&bio_list_on_stack); ret = q->make_request_fn(q, bio); @@ -2168,14 +2167,10 @@ blk_qc_t generic_make_request(struct bio *bio) bio_list_merge(&bio_list_on_stack, &lower); bio_list_merge(&bio_list_on_stack, &same); bio_list_merge(&bio_list_on_stack, &hold); - - bio = bio_list_pop(current->bio_list); } else { - struct bio *bio_next = bio_list_pop(current->bio_list); - bio_io_error(bio); - bio = bio_next; } + bio = bio_list_pop(current->bio_list); } while (bio); current->bio_list = NULL; /* deactivate */ -- Regards, Jack