Received: by 10.192.165.148 with SMTP id m20csp31359imm; Wed, 9 May 2018 08:19:35 -0700 (PDT) X-Google-Smtp-Source: AB8JxZp/V186CgUj9vUjUZklO0/nJpdf0K1injYyWYk4AlDePnmnwKkgqzu3PD9qsB996hI65edm X-Received: by 2002:a63:a34d:: with SMTP id v13-v6mr36672622pgn.224.1525879175475; Wed, 09 May 2018 08:19:35 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1525879175; cv=none; d=google.com; s=arc-20160816; b=ix5IQWlGxce42Ox3M16IaynHm0p1DVklo60yvS/4bH9q7YkokRNAMpm9juzw4VUQQq 3cW2nHjdVj6Jv2D35nzGdzojj9Z/32pblfrVJ4biDyFpYZfyAhZiqk5lQgR1z6HitwfN pqx9p+Q4zRNI4YD+KBDpBWoFUYOOHwxm3uy/DiCxwDcruG2tSFqA5THECmrzsUKBLDni 2wV8xgyJG5LB8siI46+BcT2OuuEjD13rs6A0BBQlW4pfoJ2b1PJFaCT5DV/KYq1T4Ezo NkCTmdl/vKSQqKj5TXOcZcZprDQZiML2sK7ih4W7+CdXjbykVvojzWjPbcdKt6f8IJpf yhCw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature :arc-authentication-results; bh=UfUETKFnmkCG+Vbyq5VSKPWeDL0MwrDddHz+75yujjw=; b=PjszbktR+G21WRYDT2/rjL0RiNWh/x9PIjHsC1hRsHeKmLm1fRXhbYArHEUktQ7Txx +c4EcZThWXkfnm4V82qrz76K2BvKZzchRENQPb5RqnMPUZzZ83vXcflPYivU3KzyYDzD LoEcFIsdr5uUXB/ens+UGsmXB5fiYZ7CpgMTD5UH0gzLGkmaO0VzHUO3wDL5oTq1kGTh DC99fWM4flid9htIxYrkZM086u81yHQXhs6n45Teji1vaEfFN/2FrzRS5H3J+q6j4+r5 3K3LE4Ty5ri+/cF43L8ozEDKbg5Q4Z7wZucGKdffV1d3xOhr6A8+t8btRmFKmRK+vEIq NZIA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel-dk.20150623.gappssmtp.com header.s=20150623 header.b=W6VfiUO8; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id b8-v6si29337415pls.261.2018.05.09.08.19.21; Wed, 09 May 2018 08:19:35 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel-dk.20150623.gappssmtp.com header.s=20150623 header.b=W6VfiUO8; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964889AbeEIPSX (ORCPT + 99 others); Wed, 9 May 2018 11:18:23 -0400 Received: from mail-io0-f170.google.com ([209.85.223.170]:43159 "EHLO mail-io0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S935390AbeEIPSU (ORCPT ); Wed, 9 May 2018 11:18:20 -0400 Received: by mail-io0-f170.google.com with SMTP id t23-v6so43061860ioc.10 for ; Wed, 09 May 2018 08:18:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=UfUETKFnmkCG+Vbyq5VSKPWeDL0MwrDddHz+75yujjw=; b=W6VfiUO8rAznILljQdc0rcY+CJeDqqCuU9UNQuJl4k0TXo6SHAx5G4fcwPWT7HqRTq Y9fyN0g8Toyjsqr7Tzwy4Jm/9A1L2rjBDJ9q3mRMRHvyUYv/qHVZbVyACKutKYfvRVzW p78Nq7hZnv7qjmXmCsImNN9zcp3CBhk32RkusQZFidD9o2g92ryh/QkLZpf08wunzqth 5EtYmeoKf+lxxuA2WiqLgDenPaduGw+WJ7xVJmItBBL18uG9BtyNeCRt3QaiyQD2+yEc 7T4KjmqICwsu2QFe0TLFZ+Cckv0V+/bPsIF0faSpdseblETns6ObZCCcwFiXnONYUeU0 b0cA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=UfUETKFnmkCG+Vbyq5VSKPWeDL0MwrDddHz+75yujjw=; b=NJV5xcbOQWoV8VjW4u1SHBJKZAGtB9UV8i2rVlzyBoBWxu1C/+DmVZ7P4vOf5FBJkU dBRLOYBpqGCSEgbS97m2cpuSKwuuUW+yVVFFLXJy5LG8aKLC5isQMdHkZD6BCReu80xB YGthWpcQKA1jwqPNBylaVoI+uS7456hBWR+wo1oSYsNCvWa5NpFQHpr7c6qhYFRhTJWJ NAmkL/hW9TaPioJtH5L0h6EP7A5+NWZ+3SqF7VXgkevaaD0IRz3g0mntzqvC6JD29504 UeWeqHNRu91Qv8YcAsxr0ZggpxR5KlquuAbrAaTiPpELi0CjgJRalX/qyuypTS5zDK0U qa2w== X-Gm-Message-State: ALKqPwdF61NB+k6lXHyigBhX/fxlRs0pxIMp9FZNY6mnlMnydb7SQ+CV ZMlkFRSD+zC5oc0THdCOn9eilw== X-Received: by 2002:a6b:3e8b:: with SMTP id l133-v6mr17190437ioa.65.1525879099428; Wed, 09 May 2018 08:18:19 -0700 (PDT) Received: from [192.168.1.167] ([216.160.245.98]) by smtp.gmail.com with ESMTPSA id z2-v6sm6890298ita.37.2018.05.09.08.18.17 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 09 May 2018 08:18:18 -0700 (PDT) Subject: Re: bug in tag handling in blk-mq? To: Mike Galbraith , Paolo Valente Cc: Christoph Hellwig , linux-block , Ulf Hansson , LKML , Linus Walleij , Oleksandr Natalenko References: <999DF2B3-4EE8-4BDF-89C5-EB0C2D8BF69E@linaro.org> <7760d23b-7a4c-a645-1c7a-da7569bb44dc@kernel.dk> <84145CD7-B917-4B32-8A5C-310C1910DB71@linaro.org> <1525755090.24338.1.camel@gmx.de> <1525768632.5208.4.camel@gmx.de> <1525797766.5204.2.camel@gmx.de> <3692ce7d-a767-72e6-65ae-6178b6c2e7d8@kernel.dk> <57952405-bdeb-f4e4-1aef-a7c0a8a68674@kernel.dk> <1525839089.15732.1.camel@gmx.de> From: Jens Axboe Message-ID: Date: Wed, 9 May 2018 09:18:17 -0600 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.0 MIME-Version: 1.0 In-Reply-To: <1525839089.15732.1.camel@gmx.de> Content-Type: text/plain; charset=iso-8859-15 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 5/8/18 10:11 PM, Mike Galbraith wrote: > On Tue, 2018-05-08 at 19:09 -0600, Jens Axboe wrote: >> >> Alright, I managed to reproduce it. What I think is happening is that >> BFQ is limiting the inflight case to something less than the wake >> batch for sbitmap, which can lead to stalls. I don't have time to test >> this tonight, but perhaps you can give it a go when you are back at it. >> If not, I'll try tomorrow morning. >> >> If this is the issue, I can turn it into a real patch. This is just to >> confirm that the issue goes away with the below. > > Confirmed. Impressive high speed bug stomping. Well, that's good news. Can I get you to try this patch? Needs to be split, but it'll be good to know if this fixes it too (since it's an ACTUAL attempt at a fix, not just a masking). diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index ebc264c87a09..b0dbfd297d20 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -533,19 +533,20 @@ static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt) * Limit depths of async I/O and sync writes so as to counter both * problems. */ -static void bfq_limit_depth(unsigned int op, struct blk_mq_alloc_data *data) +static int bfq_limit_depth(unsigned int op, struct blk_mq_alloc_data *data) { struct blk_mq_tags *tags = blk_mq_tags_from_data(data); struct bfq_data *bfqd = data->q->elevator->elevator_data; struct sbitmap_queue *bt; + int old_depth; if (op_is_sync(op) && !op_is_write(op)) - return; + return 0; if (data->flags & BLK_MQ_REQ_RESERVED) { if (unlikely(!tags->nr_reserved_tags)) { WARN_ON_ONCE(1); - return; + return 0; } bt = &tags->breserved_tags; } else @@ -554,12 +555,18 @@ static void bfq_limit_depth(unsigned int op, struct blk_mq_alloc_data *data) if (unlikely(bfqd->sb_shift != bt->sb.shift)) bfq_update_depths(bfqd, bt); + old_depth = data->shallow_depth; data->shallow_depth = bfqd->word_depths[!!bfqd->wr_busy_queues][op_is_sync(op)]; bfq_log(bfqd, "[%s] wr_busy %d sync %d depth %u", __func__, bfqd->wr_busy_queues, op_is_sync(op), data->shallow_depth); + + if (old_depth != data->shallow_depth) + return data->shallow_depth; + + return 0; } static struct bfq_queue * diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 25c14c58385c..0c53a254671f 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -16,6 +16,32 @@ #include "blk-mq-tag.h" #include "blk-wbt.h" +void blk_mq_sched_limit_depth(struct elevator_queue *e, + struct blk_mq_alloc_data *data, unsigned int op) +{ + struct blk_mq_tags *tags = blk_mq_tags_from_data(data); + struct sbitmap_queue *bt; + int ret; + + /* + * Flush requests are special and go directly to the + * dispatch list. + */ + if (op_is_flush(op) || !e->type->ops.mq.limit_depth) + return; + + ret = e->type->ops.mq.limit_depth(op, data); + if (!ret) + return; + + if (data->flags & BLK_MQ_REQ_RESERVED) + bt = &tags->breserved_tags; + else + bt = &tags->bitmap_tags; + + sbitmap_queue_shallow_depth(bt, ret); +} + void blk_mq_sched_free_hctx_data(struct request_queue *q, void (*exit)(struct blk_mq_hw_ctx *)) { diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h index 1e9c9018ace1..6abebc1b9ae0 100644 --- a/block/blk-mq-sched.h +++ b/block/blk-mq-sched.h @@ -5,6 +5,9 @@ #include "blk-mq.h" #include "blk-mq-tag.h" +void blk_mq_sched_limit_depth(struct elevator_queue *e, + struct blk_mq_alloc_data *data, unsigned int op); + void blk_mq_sched_free_hctx_data(struct request_queue *q, void (*exit)(struct blk_mq_hw_ctx *)); diff --git a/block/blk-mq.c b/block/blk-mq.c index 4e9d83594cca..1bb7aa40c192 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -357,13 +357,7 @@ static struct request *blk_mq_get_request(struct request_queue *q, if (e) { data->flags |= BLK_MQ_REQ_INTERNAL; - - /* - * Flush requests are special and go directly to the - * dispatch list. - */ - if (!op_is_flush(op) && e->type->ops.mq.limit_depth) - e->type->ops.mq.limit_depth(op, data); + blk_mq_sched_limit_depth(e, data, op); } tag = blk_mq_get_tag(data); diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c index 564967fafe5f..d2622386c115 100644 --- a/block/kyber-iosched.c +++ b/block/kyber-iosched.c @@ -433,17 +433,23 @@ static void rq_clear_domain_token(struct kyber_queue_data *kqd, } } -static void kyber_limit_depth(unsigned int op, struct blk_mq_alloc_data *data) +static int kyber_limit_depth(unsigned int op, struct blk_mq_alloc_data *data) { + struct kyber_queue_data *kqd = data->q->elevator->elevator_data; + + if (op_is_sync(op)) + return 0; + /* * We use the scheduler tags as per-hardware queue queueing tokens. * Async requests can be limited at this stage. */ - if (!op_is_sync(op)) { - struct kyber_queue_data *kqd = data->q->elevator->elevator_data; - + if (data->shallow_depth != kqd->async_depth) { data->shallow_depth = kqd->async_depth; + return data->shallow_depth; } + + return 0; } static void kyber_prepare_request(struct request *rq, struct bio *bio) diff --git a/include/linux/elevator.h b/include/linux/elevator.h index 6d9e230dffd2..b2712f4ca9f1 100644 --- a/include/linux/elevator.h +++ b/include/linux/elevator.h @@ -105,7 +105,7 @@ struct elevator_mq_ops { int (*request_merge)(struct request_queue *q, struct request **, struct bio *); void (*request_merged)(struct request_queue *, struct request *, enum elv_merge); void (*requests_merged)(struct request_queue *, struct request *, struct request *); - void (*limit_depth)(unsigned int, struct blk_mq_alloc_data *); + int (*limit_depth)(unsigned int, struct blk_mq_alloc_data *); void (*prepare_request)(struct request *, struct bio *bio); void (*finish_request)(struct request *); void (*insert_requests)(struct blk_mq_hw_ctx *, struct list_head *, bool); diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h index 841585f6e5f2..99059789f45f 100644 --- a/include/linux/sbitmap.h +++ b/include/linux/sbitmap.h @@ -164,6 +164,17 @@ static inline void sbitmap_free(struct sbitmap *sb) void sbitmap_resize(struct sbitmap *sb, unsigned int depth); /** + * sbitmap_queue_shallow_depth() - Inform sbitmap about shallow depth changes + * @sbq: Bitmap queue in question + * @depth: Shallow depth limit + * + * Due to how sbitmap does batched wakes, if a user of sbitmap updates the + * shallow depth, then we might need to update our batched wake counts. + * + */ +void sbitmap_queue_shallow_depth(struct sbitmap_queue *sbq, unsigned int depth); + +/** * sbitmap_get() - Try to allocate a free bit from a &struct sbitmap. * @sb: Bitmap to allocate from. * @alloc_hint: Hint for where to start searching for a free bit. diff --git a/lib/sbitmap.c b/lib/sbitmap.c index e6a9c06ec70c..563ae9d75fb8 100644 --- a/lib/sbitmap.c +++ b/lib/sbitmap.c @@ -327,7 +327,8 @@ int sbitmap_queue_init_node(struct sbitmap_queue *sbq, unsigned int depth, } EXPORT_SYMBOL_GPL(sbitmap_queue_init_node); -void sbitmap_queue_resize(struct sbitmap_queue *sbq, unsigned int depth) +static void sbitmap_queue_update_batch_wake(struct sbitmap_queue *sbq, + unsigned int depth) { unsigned int wake_batch = sbq_calc_wake_batch(depth); int i; @@ -342,6 +343,11 @@ void sbitmap_queue_resize(struct sbitmap_queue *sbq, unsigned int depth) for (i = 0; i < SBQ_WAIT_QUEUES; i++) atomic_set(&sbq->ws[i].wait_cnt, 1); } +} + +void sbitmap_queue_resize(struct sbitmap_queue *sbq, unsigned int depth) +{ + sbitmap_queue_update_batch_wake(sbq, depth); sbitmap_resize(&sbq->sb, depth); } EXPORT_SYMBOL_GPL(sbitmap_queue_resize); @@ -403,6 +409,15 @@ int __sbitmap_queue_get_shallow(struct sbitmap_queue *sbq, } EXPORT_SYMBOL_GPL(__sbitmap_queue_get_shallow); +/* + * User has limited the shallow depth to 'depth', update batch wake counts + */ +void sbitmap_queue_shallow_depth(struct sbitmap_queue *sbq, unsigned int depth) +{ + sbitmap_queue_update_batch_wake(sbq, depth); +} +EXPORT_SYMBOL_GPL(sbitmap_queue_shallow_depth); + static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq) { int i, wake_index; -- Jens Axboe