Received: by 2002:ac0:a874:0:0:0:0:0 with SMTP id c49csp580451ima; Fri, 15 Mar 2019 09:16:17 -0700 (PDT) X-Google-Smtp-Source: APXvYqxiDlwtuUstqIFwrngL8MUvvt/lj7pDVL0tDteOKrs5h+TTcKH/17uDW472L1H3i2kGgZ1B X-Received: by 2002:a17:902:8a8a:: with SMTP id p10mr4989658plo.92.1552666577598; Fri, 15 Mar 2019 09:16:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1552666577; cv=none; d=google.com; s=arc-20160816; b=SaDAaYmLcwr4GCSbzVNmNXhtPx/9vk8KrcYgkRAhH/luOjektRSAAmwdJx5qbO7css Ja2mxfx57ZUDnuzZFl/fLCYWtX1P2qwGmH8tzBaa3WqH6ngIQvXxnetdTBiW+/y9tYBG k4MVCYxo0XdTZC35aT+RVoVkGvdFcYsAYnJCnSPHFPlU8bl9Kxz4tFMG5/V3Iv+71sQg LPm+H2rRGKpnHNo7vkrx6SHO3D6slgPbXk/Wt+UKg9hQhSXdfNG+IL1bDXqGwG1Ml9f/ /jwmbqwuwZ8ZZG/bAC7UOn5Tl8L9pFB1u74+psCAV9dPqRFcI4k9b03mU35BobSf0heo +lQg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=3cGtbrBYA1yduBCJN73VYh8B3LOKK3467es1tpkFFRI=; b=cRZ7AcZtP0hLP4NmS1qRqCh1Yy5RK17OL3l1CuLx79rdZ5DgW3kElkf4CmNOjmYhE+ 5x0idIzftsHzXiEpZufa+GsZFz39TNEKaMFPd/EeejuYM03ONTltiODz3zPmw51a14jc l3+8KDgVjOiraNs8gowErPPHGzp2Yjcnj+q51RdrcJF6ZOGDL5j6SABV+hgn2SnRdeh5 2IbBARneqLL8OFMIhwEm0G9jD8gi1i4h1+35KJpug1z6bI11lA72HGyRm3BFYbjnbK7B 5R+0nCfQ+pM5FfncxJylCLOewq9vb59T0Gl3dG0BJGIEXx/3hv2kgK58Cy5fL3Ihlm8A UKnQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id r13si2208197pfh.75.2019.03.15.09.16.01; Fri, 15 Mar 2019 09:16:17 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729397AbfCOQPX (ORCPT + 99 others); Fri, 15 Mar 2019 12:15:23 -0400 Received: from mga14.intel.com ([192.55.52.115]:51229 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729465AbfCOQPW (ORCPT ); Fri, 15 Mar 2019 12:15:22 -0400 X-Amp-Result: UNKNOWN X-Amp-Original-Verdict: FILE UNKNOWN X-Amp-File-Uploaded: False Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga103.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 15 Mar 2019 09:15:21 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.58,482,1544515200"; d="scan'208";a="327645325" Received: from unknown (HELO localhost.localdomain) ([10.232.112.69]) by fmsmga006.fm.intel.com with ESMTP; 15 Mar 2019 09:15:21 -0700 Date: Fri, 15 Mar 2019 10:16:08 -0600 From: Keith Busch To: Jianchao Wang Cc: "axboe@kernel.dk" , "hch@lst.de" , "jthumshirn@suse.de" , "hare@suse.de" , "josef@toxicpanda.com" , "bvanassche@acm.org" , "sagi@grimberg.me" , "Busch, Keith" , "jsmart2021@gmail.com" , "linux-block@vger.kernel.org" , "linux-nvme@lists.infradead.org" , "linux-kernel@vger.kernel.org" Subject: Re: [PATCH 2/8] blk-mq: change the method of iterating busy tags of a request_queue Message-ID: <20190315161608.GB15153@localhost.localdomain> References: <1552640264-26101-1-git-send-email-jianchao.w.wang@oracle.com> <1552640264-26101-3-git-send-email-jianchao.w.wang@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1552640264-26101-3-git-send-email-jianchao.w.wang@oracle.com> User-Agent: Mutt/1.9.1 (2017-09-22) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Mar 15, 2019 at 01:57:38AM -0700, Jianchao Wang wrote: > tags->rqs[] will not been cleaned when free driver tag and there > is a window between get driver tag and write tags->rqs[], so we > may see stale rq in tags->rqs[] which may have been freed, as > following case, > blk_mq_get_request blk_mq_queue_tag_busy_iter > -> blk_mq_get_tag > -> bt_for_each > -> bt_iter > -> rq = taags->rqs[] > -> rq->q > -> blk_mq_rq_ctx_init > -> data->hctx->tags->rqs[rq->tag] = rq; > > To fix this, the blk_mq_queue_tag_busy_iter is changed in this > patch to use tags->static_rqs[] instead of tags->rqs[]. We have > to identify whether there is a io scheduler attached to decide > to use hctx->tags or hctx->sched_tags. And we will try to get a > non-zero q_usage_counter before that, so it is safe to access > them. Add 'inflight' parameter to determine to iterate in-flight > requests or just busy tags. A correction here is that > part_in_flight should count the busy tags instead of rqs that > have got driver tags. > > Moreover, get rid of the 'hctx' parameter in iter function as > we could get it from rq->mq_hctx and export this interface for > drivers. > > Signed-off-by: Jianchao Wang > --- > block/blk-mq-tag.c | 76 +++++++++++++++++++++++++++++++++----------------- > block/blk-mq-tag.h | 2 -- > block/blk-mq.c | 31 ++++++++------------ > include/linux/blk-mq.h | 5 ++-- > 4 files changed, 64 insertions(+), 50 deletions(-) > > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c > index b792537..cdec2cd 100644 > --- a/block/blk-mq-tag.c > +++ b/block/blk-mq-tag.c > @@ -216,26 +216,38 @@ struct bt_iter_data { > busy_iter_fn *fn; > void *data; > bool reserved; > + bool inflight; > }; > > static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) > { > struct bt_iter_data *iter_data = data; > struct blk_mq_hw_ctx *hctx = iter_data->hctx; > - struct blk_mq_tags *tags = hctx->tags; > bool reserved = iter_data->reserved; > + struct blk_mq_tags *tags; > struct request *rq; > > + tags = hctx->sched_tags ? hctx->sched_tags : hctx->tags; > + > if (!reserved) > bitnr += tags->nr_reserved_tags; > - rq = tags->rqs[bitnr]; > + /* > + * Because tags->rqs[] will not been cleaned when free driver tag > + * and there is a window between get driver tag and write tags->rqs[], > + * so we may see stale rq in tags->rqs[] which may have been freed. > + * Using static_rqs[] is safer. > + */ > + rq = tags->static_rqs[bitnr]; > > /* > - * We can hit rq == NULL here, because the tagging functions > - * test and set the bit before assigning ->rqs[]. > + * There is a small window between get tag and blk_mq_rq_ctx_init, > + * so rq->q and rq->mq_hctx maybe different. > */ > - if (rq && rq->q == hctx->queue) > - return iter_data->fn(hctx, rq, iter_data->data, reserved); > + if (rq && rq->q == hctx->queue && > + rq->mq_hctx == hctx && > + (!iter_data->inflight || > + blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT)) > + return iter_data->fn(rq, iter_data->data, reserved); There is still a window where the check may succeed, but the request is being assigned to a completely different request_queue. The callback then operates on a request it doesn't own. Tag iteration from a driver can be safe only if the driver initiates a freeze and quiesced all queues sharing the same tagset first. The nvme driver does that, but I think there may need to be an additional grace period to wait for the allocation's queue_enter to call queue_exit to ensure the request is initialized through blk_mq_rq_ctx_init().