Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp4691852imm; Mon, 30 Jul 2018 21:03:34 -0700 (PDT) X-Google-Smtp-Source: AAOMgpfXKkI5g+9k5l/0avqXDYCVX5DXXfdfVw5jiB8C0UZBui9nvYOHQVZvrJHcUG3K8nNWI4Jr X-Received: by 2002:a63:7f16:: with SMTP id a22-v6mr18828031pgd.255.1533009814945; Mon, 30 Jul 2018 21:03:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1533009814; cv=none; d=google.com; s=arc-20160816; b=yj/O1JHMs/a+Iz2ZEqnlZ04tnMLYyopLYRllDzcRSq50/R3pjO2M7e0zEeZ4xZ0Blk JVKEQ7D4//djHF6Xk99NZCxu555WBfcMsgFkmFVWgP4Pg4X6Gs1eknygowFwnPJtqUIk tJw61pu66KvHHx9S6g2oml4kREm/6VPyEhVmb/V5KuhXSZfZwpj+8550J/BNPMetugUd Ch/myu6oucvxxDdll3POVGNj1OlSVnBcC92yPW5swJfK87OnSpPXwvnVK2YBhEsS3yUi JVhWrtbbvtT34McQH3rzgge2MhsrcPtETSXhuZafGt+mEmnX8BnAvaxB2d97+sMPoo4y zjrg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:date:subject:cc:to:from :dkim-signature:arc-authentication-results; bh=W3L4Q5ZFpAPuHHPapRFFVjOPC+s/6ueI7AHh2jMzDFg=; b=bzbA83pM1l8dGzGZMhzgaaMRsgR66+EZ2hL42It9YbOHtvcarLGaXslbU3KX6A6CwB lSwQqaxhawSaZaVwER9azzZkDR1oZ0WgFjsyF0klObZrQ6konS+EyaJZf/ua8bLScli4 dGCQb7rxlbsGAMwdWM4ofDzh+lCGxXm3iSTSZvgUcLltd6UZcAexyNehLlmhirWl3BvM saR43srFgQKuOVI/lVulTxH0IjWg5R3ZpnrTTO1KGodbThbZNqyF6IOd1fMfZjcdKcuW bVLDvFMAvnXo5oKAqzSkDq/6ytBuaW1C6iIauhMifyzkwTpAnaCJUn4oqJObK1EK+ZII eA6Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=bd7ydZ5e; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id q65-v6si12938920pfb.9.2018.07.30.21.02.55; Mon, 30 Jul 2018 21:03:34 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=bd7ydZ5e; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727413AbeGaFjE (ORCPT + 99 others); Tue, 31 Jul 2018 01:39:04 -0400 Received: from aserp2120.oracle.com ([141.146.126.78]:60908 "EHLO aserp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727132AbeGaFjE (ORCPT ); Tue, 31 Jul 2018 01:39:04 -0400 Received: from pps.filterd (aserp2120.oracle.com [127.0.0.1]) by aserp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w6V3xLpg020883; Tue, 31 Jul 2018 04:00:48 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id; s=corp-2018-07-02; bh=W3L4Q5ZFpAPuHHPapRFFVjOPC+s/6ueI7AHh2jMzDFg=; b=bd7ydZ5eC/fIKmlKzS9iJ0emiGLrQnHyB1qibxubplY7aCLGbH/euroHGvnhkQlh3JtE X149a0FzM5AKuTH3OZHyij38ufLSSU8YPs8u36RTa/ptC+vBKn8FhgNWObl4Nn9urHFA 89zXkpwylBesiFRLBQ4AJ0qxAfN5E3QtOLbYWLDT+4tLb3sPhXXihF31y5hg/3l3tCZc kWFnU5us6JPD2DVVQDENaT2RrMAMpfVwIYRWxDMcgs6O1ApdkSMrM1R6cvZ9x0vfRDyw nXTdVl12U4rXXh9N/izz/PDYGmM61Z8pbRjcAD/Zns0NfYXO541QKd5sxzLRFC45yrru UQ== Received: from aserv0021.oracle.com (aserv0021.oracle.com [141.146.126.233]) by aserp2120.oracle.com with ESMTP id 2kggeny3m4-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 31 Jul 2018 04:00:47 +0000 Received: from aserv0121.oracle.com (aserv0121.oracle.com [141.146.126.235]) by aserv0021.oracle.com (8.14.4/8.14.4) with ESMTP id w6V40liC021575 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 31 Jul 2018 04:00:47 GMT Received: from abhmp0003.oracle.com (abhmp0003.oracle.com [141.146.116.9]) by aserv0121.oracle.com (8.14.4/8.13.8) with ESMTP id w6V40l3S021481; Tue, 31 Jul 2018 04:00:47 GMT Received: from will-ThinkCentre-M910s.cn.oracle.com (/10.182.70.254) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 30 Jul 2018 21:00:46 -0700 From: Jianchao Wang To: axboe@kernel.dk, ming.lei@redhat.com, bart.vanassche@wdc.com Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC] blk-mq: clean up the hctx restart Date: Tue, 31 Jul 2018 12:02:15 +0800 Message-Id: <1533009735-2221-1-git-send-email-jianchao.w.wang@oracle.com> X-Mailer: git-send-email 2.7.4 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=8970 signatures=668706 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1806210000 definitions=main-1807310043 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Currently, we will always set SCHED_RESTART whenever there are requests in hctx->dispatch, then when request is completed and freed the hctx queues will be restarted to avoid IO hang. This is unnecessary most of time. Especially when there are lots of LUNs attached to one host, the RR restart loop could be very expensive. Requests will be queued on hctx dispath list due to: - flush and cpu hotplug cases - no driver tag w/ io scheduler - LLDD returns BLK_STS_RESOURCE/BLK_STS_DEV_RESOURCE - no driver budget We needn't care about the 1st case, it is normal insert. Regarding the 2nd case, if tags are shared, the sbitmap_queue wakeup hook will work and hctx_may_queue will avoid starvation of other ones, otherwise, hctx restart is employed, but it is done by blk_mq_mark_tag_wait. We need hctx restart for the 3rd and 4th case to rerun the hctx queue when some LLDD limits are lifted due to in-flight requests are completed. So we remove the blk_mq_sched_mark_restart_hctx from blk_mq_sched_dispatch_requests and only use it when LLDD returns BLK_STS_DEV_RESOURCE and no driver budget. In addition, there some race conditions fixed in this patch. More details please refer to comment of blk_mq_dispatch_rq_list. Signed-off-by: Jianchao Wang Cc: Ming Lei Cc: Bart Van Assche Cc: Roman Pen --- block/blk-mq-sched.c | 7 ++--- block/blk-mq-sched.h | 1 + block/blk-mq.c | 82 +++++++++++++++++++++++++++++++--------------------- 3 files changed, 52 insertions(+), 38 deletions(-) diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 56c493c..3270ad0 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -54,7 +54,7 @@ void blk_mq_sched_assign_ioc(struct request *rq, struct bio *bio) * Mark a hardware queue as needing a restart. For shared queues, maintain * a count of how many hardware queues are marked for restart. */ -static void blk_mq_sched_mark_restart_hctx(struct blk_mq_hw_ctx *hctx) +void blk_mq_sched_mark_restart_hctx(struct blk_mq_hw_ctx *hctx) { if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) return; @@ -202,15 +202,13 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx) * we only ever dispatch a fraction of the requests available because * of low device queue depth. Once we pull requests out of the IO * scheduler, we can no longer merge or sort them. So it's best to - * leave them there for as long as we can. Mark the hw queue as - * needing a restart in that case. + * leave them there for as long as we can. * * We want to dispatch from the scheduler if there was nothing * on the dispatch list or we were able to dispatch from the * dispatch list. */ if (!list_empty(&rq_list)) { - blk_mq_sched_mark_restart_hctx(hctx); if (blk_mq_dispatch_rq_list(q, &rq_list, false)) { if (has_sched_dispatch) blk_mq_do_dispatch_sched(hctx); @@ -417,7 +415,6 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *const hctx) */ if (!atomic_read(&queue->shared_hctx_restart)) return; - rcu_read_lock(); list_for_each_entry_rcu_rr(q, queue, &set->tag_list, tag_set_list) { diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h index 0cb8f93..650cbfc 100644 --- a/block/blk-mq-sched.h +++ b/block/blk-mq-sched.h @@ -32,6 +32,7 @@ int blk_mq_sched_init_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx); void blk_mq_sched_exit_hctx(struct request_queue *q, struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx); +void blk_mq_sched_mark_restart_hctx(struct blk_mq_hw_ctx *hctx); static inline bool blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio) diff --git a/block/blk-mq.c b/block/blk-mq.c index 45df734..b47627a 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1086,6 +1086,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, bool no_tag = false; int errors, queued; blk_status_t ret = BLK_STS_OK; + bool marked = false; if (list_empty(list)) return false; @@ -1096,6 +1097,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, * Now process all the entries, sending them to the driver. */ errors = queued = 0; +retry: do { struct blk_mq_queue_data bd; @@ -1115,12 +1117,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, */ if (!blk_mq_mark_tag_wait(&hctx, rq)) { blk_mq_put_dispatch_budget(hctx); - /* - * For non-shared tags, the RESTART check - * will suffice. - */ - if (hctx->flags & BLK_MQ_F_TAG_SHARED) - no_tag = true; + no_tag = true; break; } } @@ -1172,42 +1169,61 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, * that is where we will continue on next queue run. */ if (!list_empty(list)) { - bool needs_restart; + bool run_queue = false; + + if (!no_tag && (ret != BLK_STS_RESOURCE) && !marked) { + blk_mq_sched_mark_restart_hctx(hctx); + marked = true; + /* + * .queue_rq will put the budget, so we need to get it again. + */ + got_budget = false; + goto retry; + } spin_lock(&hctx->lock); list_splice_init(list, &hctx->dispatch); spin_unlock(&hctx->lock); /* - * If SCHED_RESTART was set by the caller of this function and - * it is no longer set that means that it was cleared by another - * thread and hence that a queue rerun is needed. + * The reason for why we reach here could be: + * - No driver tag + * For the shared-tags case, the tag wakeup hook will work + * and hctx_may_queue will avoid starvation of other ones. + * For the non-shared-tags case, hctx restart is employed. * - * If 'no_tag' is set, that means that we failed getting - * a driver tag with an I/O scheduler attached. If our dispatch - * waitqueue is no longer active, ensure that we run the queue - * AFTER adding our entries back to the list. + * Because reqs have not been queued to hctx->dispatch list + * when blk_mq_mark_tag_wait, both of two conditions will be + * triggered w/ an empty hctx->dispatch and do nothing finally. + * So recheck them here and rerun the queue if needed. * - * If no I/O scheduler has been configured it is possible that - * the hardware queue got stopped and restarted before requests - * were pushed back onto the dispatch list. Rerun the queue to - * avoid starvation. Notes: - * - blk_mq_run_hw_queue() checks whether or not a queue has - * been stopped before rerunning a queue. - * - Some but not all block drivers stop a queue before - * returning BLK_STS_RESOURCE. Two exceptions are scsi-mq - * and dm-rq. - * - * If driver returns BLK_STS_RESOURCE and SCHED_RESTART - * bit is set, run queue after a delay to avoid IO stalls - * that could otherwise occur if the queue is idle. + * - BLK_STS_DEV_RESOURCE/no budget + * Employ the hctx restart mechanism to restart the queue. + * The LLDD resources could be released: + * - before mark SCHED_RESTART. We retry to dispatch to fix + * this case. + * - after mark SCHED_RESTART before queue requests on + * hctx->dispatch. we recheck the SCHED_RESTART, if not there, + * rerun the hctx. + * - BLK_STS_RESOURCE + * Run queue after a delay to avoid IO stalls. + * More details please refer to comment of BLK_STS_DEV_RESOURCE. */ - needs_restart = blk_mq_sched_needs_restart(hctx); - if (!needs_restart || - (no_tag && list_empty_careful(&hctx->dispatch_wait.entry))) - blk_mq_run_hw_queue(hctx, true); - else if (needs_restart && (ret == BLK_STS_RESOURCE)) - blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY); + + if (no_tag) { + if (hctx->flags & BLK_MQ_F_TAG_SHARED) + run_queue = list_empty_careful(&hctx->dispatch_wait.entry); + else + run_queue = !blk_mq_sched_needs_restart(hctx); + if (run_queue) + blk_mq_run_hw_queue(hctx, true); + } else if (ret == BLK_STS_RESOURCE){ + if (blk_mq_sched_needs_restart(hctx)) + blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY); + } else { + if (!blk_mq_sched_needs_restart(hctx)) + blk_mq_run_hw_queue(hctx, true); + } return false; } -- 2.7.4