Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753640AbdCHRcV (ORCPT ); Wed, 8 Mar 2017 12:32:21 -0500 Received: from mail-wm0-f47.google.com ([74.125.82.47]:35489 "EHLO mail-wm0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752324AbdCHRcS (ORCPT ); Wed, 8 Mar 2017 12:32:18 -0500 MIME-Version: 1.0 In-Reply-To: References: <87h93blz6g.fsf@notabene.neil.brown.name> <71562c2c-97f4-9a0a-32ec-30e0702ca575@profitbricks.com> <87lgsjj9w8.fsf@notabene.neil.brown.name> <20170307165233.GB30230@redhat.com> <5cfbdc6b-9ba7-605a-642b-7f625cf5f5b7@kernel.dk> <20170307171436.GA2109@redhat.com> <87tw74j0e4.fsf@notabene.neil.brown.name> Date: Wed, 8 Mar 2017 18:15:04 +0100 Message-ID: Subject: Re: blk: improve order of bio handling in generic_make_request() From: Lars Ellenberg To: Mikulas Patocka Cc: NeilBrown , Mike Snitzer , Jens Axboe , Jack Wang , LKML , Kent Overstreet , Pavel Machek Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3239 Lines: 83 On 8 March 2017 at 17:40, Mikulas Patocka wrote: > > On Wed, 8 Mar 2017, NeilBrown wrote: > > I don't think this will fix the DM snapshot deadlock by itself. > > Rather, it make it possible for some internal changes to DM to fix it. > > The DM change might be something vaguely like: > > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c > > index 3086da5664f3..06ee0960e415 100644 > > --- a/drivers/md/dm.c > > +++ b/drivers/md/dm.c > > @@ -1216,6 +1216,14 @@ static int __split_and_process_non_flush(struct clone_info *ci) > > > > len = min_t(sector_t, max_io_len(ci->sector, ti), ci->sector_count); > > > > + if (len < ci->sector_count) { > > + struct bio *split = bio_split(bio, len, GFP_NOIO, fs_bio_set); > > fs_bio_set is a shared bio set, so it is prone to deadlocks. For this > change, we would need two bio sets per dm device, one for the split bio > and one for the outgoing bio. (this also means having one more kernel > thread per dm device) > > It would be possible to avoid having two bio sets if the incoming bio were > the same as the outgoing bio (allocate a small structure, move bi_end_io > and bi_private into it, replace bi_end_io and bi_private with pointers to > device mapper and send the bio to the target driver), but it would need > much more changes - basically rewrite the whole bio handling code in dm.c > and in the targets. > > Mikulas "back then" (see previously posted link into ML archive) I suggested this: ... A bit of conflict here may be that DM has all its own split and clone and queue magic, and wants to process "all of the bio" before returning back to generic_make_request(). To change that, __split_and_process_bio() and all its helpers would need to learn to "push back" (pieces of) the bio they are currently working on, and not push back via "DM_ENDIO_REQUEUE", but by bio_list_add_head(¤t->bio_lists->queue, piece_to_be_done_later). Then, after they processed each piece, *return* all the way up to the top-level generic_make_request(), where the recursion-to-iteration logic would then make sure that all deeper level bios, submitted via recursive calls to generic_make_request() will be processed, before the next, pushed back, piece of the "original incoming" bio. And *not* do their own iteration over all pieces first. Probably not as easy as dropping the while loop, using bio_advance, and pushing that "advanced" bio back to current->...queue? static void __split_and_process_bio(struct mapped_device *md, struct dm_table *map, struct bio *bio) ... ci.bio = bio; ci.sector_count = bio_sectors(bio); while (ci.sector_count && !error) error = __split_and_process_non_flush(&ci); ... error = __split_and_process_non_flush(&ci); if (ci.sector_count) bio_advance() bio_list_add_head(¤t->bio_lists->queue, ) ... Something like that, maybe? Needs to be adapted to this new and improved recursion-to-iteration logic, obviously. Would that be doable, or does device-mapper for some reason really *need* its own iteration loop (which, because it is called from generic_make_request(), won't be able to ever submit anything to any device, ever, so needs all these helper threads just in case). Lars