Received: by 10.223.164.221 with SMTP id h29csp2764379wrb; Wed, 18 Oct 2017 06:36:21 -0700 (PDT) X-Received: by 10.84.174.67 with SMTP id q61mr15646671plb.184.1508333781529; Wed, 18 Oct 2017 06:36:21 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1508333781; cv=none; d=google.com; s=arc-20160816; b=hKUA/Y7aJmcVDywYyKXtrDQyr6OBL7nvyxMlAqEbjz5TPvCTI+wAmdFiCbzdgYh4rr f0JalDjHPNRTWpW1wnLcTO8wfEb22j43+jCQDJH01UJ/7WmfX6yZXNvCDKd3YjsWRBik nRTyLQuN6rbDno7AwTeWVghbsiuPpHdVqoSBP9WZ0MBOVr1NIz9bKbKxTW2sSNa2fqgb 4YlXc+ChL39ieMVQa1BExuIhixJJGNhewvloQGADv8vcQX4KwdAup0FWLK1/PIA9Raxb Myn0fO7IUrRGeOvidlp1Ncsv1CvXu88lA5B0XfNxuOb74NVyXhf3Xtfv1EH/zzjLxNPr g/MA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:to:message-id:date:subject:cc:from :dkim-signature:arc-authentication-results; bh=dQPZ6s9d5bgbekGmYP8yD+oeViWw/sA94WB9Rgk7n5s=; b=U35mFKG09WZ8Y6Qxnb2PZFxy1J93o/jCQOrrC1S54jSMnQRsuteL0wFc8uUl5sg4yK gsL4+ANCSRih+9qfih9fP+WO6mttQa7pSvRVbQ3Vjg6pXmGveoJUjHmevMIPc+8rSFCD jEuW7FIzOsfEnvxalGPNMMGYNjLkoVTGWh+AwUwwsShcb8GwAphS12oglPYw85lMk3vV mP/H7Ttrc+W8N/iihJvuzOVikPoTcUBTTK/YeUMZWi9Dy61kNuc4lv2hsAHUwJD80wyG Mdf9oEmcJvzmUdD9Ck6fnGG2ogVZn8Mslo579/WKLla8hYqPXX3Tx6qnShlWzFX37yNP gNUg== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@profitbricks-com.20150623.gappssmtp.com header.s=20150623 header.b=RB5Vyji+; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e62si7481671pfa.196.2017.10.18.06.36.06; Wed, 18 Oct 2017 06:36:21 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=fail header.i=@profitbricks-com.20150623.gappssmtp.com header.s=20150623 header.b=RB5Vyji+; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S937786AbdJRKWb (ORCPT + 99 others); Wed, 18 Oct 2017 06:22:31 -0400 Received: from mail-wr0-f177.google.com ([209.85.128.177]:56105 "EHLO mail-wr0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S937695AbdJRKW1 (ORCPT ); Wed, 18 Oct 2017 06:22:27 -0400 Received: by mail-wr0-f177.google.com with SMTP id z99so1983491wrc.12 for ; Wed, 18 Oct 2017 03:22:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=profitbricks-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id; bh=dQPZ6s9d5bgbekGmYP8yD+oeViWw/sA94WB9Rgk7n5s=; b=RB5Vyji+DM5b2gSwVMN0DRX8MP/E9l1wSt8zAAKe65JB47iLeMEmMYHYwked6D2/nm RJuSPKMcb7wNFbnvoHXqMrLPWRUNToglBPlExi/uk8Fi0k79GesOeAf+NrirLJGU0i8U I76sQ0ZS4tVL2zJYV76hQPxasbetAkR1T4QH52qAehAX+w2+XRJGAgiYl602YmNxTTRI ogv1jrp+SBGH4nhERP8ZSDuHL0nXT4OlbZWQMNXy3R5t3mWnrw0mT8qwJWZPzCJd+3kS 9g3fCqGKSakMTLgmLBRLR+Y7TVUJjtpc03ns0r891mJUCuVchMx7nTvLIoo9d8ImRLDM GFIw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=dQPZ6s9d5bgbekGmYP8yD+oeViWw/sA94WB9Rgk7n5s=; b=qMPiDsNTFWKu3qEirhIUuKzk8lmrlCticgd59SQPdHRRjGLStPF4n40m7Y8njZ25KH GxcVCZjG3+oxn69phlTzmG5/ElcQ+zsuCd1eAYwJdrWnfToB5vJVOxR+XDxjKrFwHr0V gUyMdJbMmCyso/VpCWSQUVbKbQAzYGGqX3ZZDlLJOKZp1N+KoeMRwxQtXm8Hyxg5CSTf 80E5qZWI4uyqcHzEAJ97HEBauylMgudrW3NhDwejHOz7R/KO99kyelIiZ1kUK99ZZDnn e4nce+ozmemN6c/UOmx8encqiFQynsZ6xNgcy5UbPJYf6S/eeqTESEnxb0HDERyEHXhK PgTg== X-Gm-Message-State: AMCzsaXNOUvFx0p7edvc71ZHoQtXeOpqEV5l7sfVPiMAcExJq0C5yXQK Zf1B/nwRf+jGjWS/PHetB7GcbQ== X-Google-Smtp-Source: ABhQp+QcumtTgL51iNE2rL0AR01WKKFnCE7XA3GU1BUN3P/BefAaWpCOeQ7HJ0zyhvc5ihGSkCNBtA== X-Received: by 10.223.143.105 with SMTP id p96mr6526791wrb.266.1508322145734; Wed, 18 Oct 2017 03:22:25 -0700 (PDT) Received: from pb.pb.local ([62.217.45.26]) by smtp.gmail.com with ESMTPSA id x75sm12150722wme.3.2017.10.18.03.22.24 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 18 Oct 2017 03:22:24 -0700 (PDT) From: Roman Pen Cc: Roman Pen , linux-kernel@vger.kernel.org, linux-block@vger.kernel.org, Bart Van Assche , Christoph Hellwig , Hannes Reinecke , Jens Axboe Subject: [PATCH 1/1] [RFC] blk-mq: fix queue stalling on shared hctx restart Date: Wed, 18 Oct 2017 12:22:06 +0200 Message-Id: <20171018102206.26020-1-roman.penyaev@profitbricks.com> X-Mailer: git-send-email 2.13.1 To: unlisted-recipients:; (no To-header on input) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi all, the patch below fixes queue stalling when shared hctx marked for restart (BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero. The root cause is that hctxs are shared between queues, but 'shared_hctx_restart' belongs to the particular queue, which in fact may not need to be restarted, thus we return from blk_mq_sched_restart() and leave shared hctx of another queue never restarted. The fix is to make shared_hctx_restart counter belong not to the queue, but to tags, thereby counter will reflect real number of shared hctx needed to be restarted. During tests 1 hctx (set->nr_hw_queues) was used and all stalled requests were noticed in dd->fifo_list of mq-deadline scheduler. Seeming possible sequence of events: 1. Request A of queue A is inserted into dd->fifo_list of the scheduler. 2. Request B of queue A bypasses scheduler and goes directly to hctx->dispatch. 3. Request C of queue B is inserted. 4. blk_mq_sched_dispatch_requests() is invoked, since hctx->dispatch is not empty (request B is in the list) hctx is only marked for for next restart and request A is left in a list (see comment "So it's best to leave them there for as long as we can. Mark the hw queue as needing a restart in that case." in blk-mq-sched.c) 5. Eventually request B is completed/freed and blk_mq_sched_restart() is called, but by chance hctx from queue B is chosen for restart and request C gets a chance to be dispatched. 6. Eventually request C is completed/freed and blk_mq_sched_restart() is called, but shared_hctx_restart for queue B is zero and we return without attempt to restart hctx from queue A, thus request A is stuck forever. But stalling queue is not the only one problem with blk_mq_sched_restart(). My tests show that those loops thru all queues and hctxs can be very costly, even with shared_hctx_restart counter, which aims to fix performance issue. For my tests I create 128 devices with 64 hctx each, which share same tags set. The following is the fio and ftrace output for v4.14-rc4 kernel: READ: io=5630.3MB, aggrb=573208KB/s, minb=573208KB/s, maxb=573208KB/s, mint=10058msec, maxt=10058msec WRITE: io=5650.9MB, aggrb=575312KB/s, minb=575312KB/s, maxb=575312KB/s, mint=10058msec, maxt=10058msec root@pserver16:~/roman# cat /sys/kernel/debug/tracing/trace_stat/* | grep blk_mq Function Hit Time Avg s^2 -------- --- ---- --- --- blk_mq_sched_restart 16347 9540759 us 583.639 us 8804801 us blk_mq_sched_restart 7884 6073471 us 770.354 us 8780054 us blk_mq_sched_restart 14176 7586794 us 535.185 us 2822731 us blk_mq_sched_restart 7843 6205435 us 791.206 us 12424960 us blk_mq_sched_restart 1490 4786107 us 3212.153 us 1949753 us blk_mq_sched_restart 7892 6039311 us 765.244 us 2994627 us blk_mq_sched_restart 15382 7511126 us 488.306 us 3090912 us [cut] And here are results with two patches reverted: 8e8320c9315c ("blk-mq: fix performance regression with shared tags") 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared") READ: io=12884MB, aggrb=1284.3MB/s, minb=1284.3MB/s, maxb=1284.3MB/s, mint=10032msec, maxt=10032msec WRITE: io=12987MB, aggrb=1294.6MB/s, minb=1294.6MB/s, maxb=1294.6MB/s, mint=10032msec, maxt=10032msec root@pserver16:~/roman# cat /sys/kernel/debug/tracing/trace_stat/* | grep blk_mq Function Hit Time Avg s^2 -------- --- ---- --- --- blk_mq_sched_restart 50699 8802.349 us 0.173 us 121.771 us blk_mq_sched_restart 50362 8740.470 us 0.173 us 161.494 us blk_mq_sched_restart 50402 9066.337 us 0.179 us 113.009 us blk_mq_sched_restart 50104 9366.197 us 0.186 us 188.645 us blk_mq_sched_restart 50375 9317.727 us 0.184 us 54.218 us blk_mq_sched_restart 50136 9311.657 us 0.185 us 446.790 us blk_mq_sched_restart 50103 9179.625 us 0.183 us 114.472 us [cut] Timings and stdevs are terrible, which leads to significant difference: 570MB/s vs 1280MB/s. This is RFC since current patch fixes queue stalling but performance issue still remains and for me is not clear is it better to improve commit 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared") making percpu restart lists (to avoid looping and to dequeue hctx immediately) or revert it (frankly I did not notice any difference on small number of devices and hctxs, when looping issue does not impact much). -- Roman Signed-off-by: Roman Pen Cc: linux-kernel@vger.kernel.org Cc: linux-block@vger.kernel.org Cc: Bart Van Assche Cc: Christoph Hellwig Cc: Hannes Reinecke Cc: Jens Axboe --- block/blk-mq-sched.c | 10 +++++----- block/blk-mq-tag.c | 1 + block/blk-mq-tag.h | 1 + block/blk-mq.c | 4 ++-- include/linux/blkdev.h | 2 -- 5 files changed, 9 insertions(+), 9 deletions(-) diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 4ab69435708c..a19a7f275173 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -60,10 +60,10 @@ static void blk_mq_sched_mark_restart_hctx(struct blk_mq_hw_ctx *hctx) return; if (hctx->flags & BLK_MQ_F_TAG_SHARED) { - struct request_queue *q = hctx->queue; + struct blk_mq_tags *tags = hctx->tags; if (!test_and_set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) - atomic_inc(&q->shared_hctx_restart); + atomic_inc(&tags->shared_hctx_restart); } else set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state); } @@ -74,10 +74,10 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx) return false; if (hctx->flags & BLK_MQ_F_TAG_SHARED) { - struct request_queue *q = hctx->queue; + struct blk_mq_tags *tags = hctx->tags; if (test_and_clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) - atomic_dec(&q->shared_hctx_restart); + atomic_dec(&tags->shared_hctx_restart); } else clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state); @@ -312,7 +312,7 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *const hctx) * If this is 0, then we know that no hardware queues * have RESTART marked. We're done. */ - if (!atomic_read(&queue->shared_hctx_restart)) + if (!atomic_read(&tags->shared_hctx_restart)) return; rcu_read_lock(); diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index d0be72ccb091..598f8e0095ff 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -385,6 +385,7 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags, tags->nr_tags = total_tags; tags->nr_reserved_tags = reserved_tags; + atomic_set(&tags->shared_hctx_restart, 0); return blk_mq_init_bitmap_tags(tags, node, alloc_policy); } diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h index 5cb51e53cc03..adf05c8811cd 100644 --- a/block/blk-mq-tag.h +++ b/block/blk-mq-tag.h @@ -11,6 +11,7 @@ struct blk_mq_tags { unsigned int nr_reserved_tags; atomic_t active_queues; + atomic_t shared_hctx_restart; struct sbitmap_queue bitmap_tags; struct sbitmap_queue breserved_tags; diff --git a/block/blk-mq.c b/block/blk-mq.c index 4603b115e234..7639f978ea2c 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2120,11 +2120,11 @@ static void queue_set_hctx_shared(struct request_queue *q, bool shared) queue_for_each_hw_ctx(q, hctx, i) { if (shared) { if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) - atomic_inc(&q->shared_hctx_restart); + atomic_inc(&hctx->tags->shared_hctx_restart); hctx->flags |= BLK_MQ_F_TAG_SHARED; } else { if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) - atomic_dec(&q->shared_hctx_restart); + atomic_dec(&hctx->tags->shared_hctx_restart); hctx->flags &= ~BLK_MQ_F_TAG_SHARED; } } diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 2a5d52fa90f5..3852a9ea87d0 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -393,8 +393,6 @@ struct request_queue { int nr_rqs[2]; /* # allocated [a]sync rqs */ int nr_rqs_elvpriv; /* # allocated rqs w/ elvpriv */ - atomic_t shared_hctx_restart; - struct blk_queue_stats *stats; struct rq_wb *rq_wb; -- 2.13.1 From 1583480743439632044@xxx Wed Nov 08 07:08:35 +0000 2017 X-GM-THRID: 1583480743439632044 X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread