Received: by 10.223.185.116 with SMTP id b49csp1603142wrg; Fri, 23 Feb 2018 23:56:04 -0800 (PST) X-Google-Smtp-Source: AH8x226iIep+3qk/s6/EPnj4ewSEvU/LEPYJGnqse/FXsaUKFV6cTe4Jp1HFhnU8IKkmV0aBZDYY X-Received: by 2002:a17:902:4827:: with SMTP id s36-v6mr4218739pld.337.1519458964604; Fri, 23 Feb 2018 23:56:04 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1519458964; cv=none; d=google.com; s=arc-20160816; b=LpxchnJQrQJDkc8RuOpW4ctbMIUSvGaIr16XCe64H/1gRtTiQOBNsqhkuwlk3qYp0B SircKOlaahA2YIaeeOiHVJUEtvFU8aWivnZBWB2vEXmCmlO8r30BHh2ayVfjfaKr4gmk CSf+GewGTng/B9j8rZY/RUGOlyL/DyUBM/Nw1VJVBL6CetkvMJXpej+dmLAr4dcAAbkz PvZc8Zt2tGYW1WCpG4SoVa7rxnhusu3VeAGN1vVj6717+aqt9CKk4vsgWHFPQdbwJtqV AdaV7e7SKqnUMsTThOC5GBPin6MzYdha9v7kfHpoe8+4JmxkRzc4P6cA3Y7Ja/5TuzDl 4dqw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:to:references:message-id :content-transfer-encoding:cc:date:in-reply-to:from:subject :mime-version:dkim-signature:arc-authentication-results; bh=x5RA/ni3mum2uFetKHC1tlGd0Wd8RhHPK1h+Fy4Ckl8=; b=zWD8x+7C4/aVzLMwLp59lGfPdfStKFeHMCYEPAjntZR4jHZT8Yvf6HcvFfd88B+fQr O9AHfQ1iabvwK5NiU+lOXXQvM4GCQcNmd/3fpm3cBGN3bcTVg5dHtMLbQzAqMrrkvZdm Kk1RalAYOGHqRdxfKYu6iD8eEYQ2YP0SAEHOR2RBMOq10UwyzJcQR2qjkWiNiL/siM8a y5RhTsLzXeQju7t3uvsZpHkONlFaxxfQL3YEC6U+v6sTRQHfyztJVe489hVTTihbPVdf rbpw8NqUIQyeeMsRJR6dHFnTSMVggLAcSDC9K1n1o7TqnJj2yUn//iAMPSaPtBfo0RrM HOng== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=XVGvTqjL; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id r14si2618499pgs.604.2018.02.23.23.55.49; Fri, 23 Feb 2018 23:56:04 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=XVGvTqjL; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751437AbeBXHyi (ORCPT + 99 others); Sat, 24 Feb 2018 02:54:38 -0500 Received: from mail-wr0-f175.google.com ([209.85.128.175]:37002 "EHLO mail-wr0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751399AbeBXHyg (ORCPT ); Sat, 24 Feb 2018 02:54:36 -0500 Received: by mail-wr0-f175.google.com with SMTP id z12so16295492wrg.4 for ; Fri, 23 Feb 2018 23:54:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=x5RA/ni3mum2uFetKHC1tlGd0Wd8RhHPK1h+Fy4Ckl8=; b=XVGvTqjLIZXcajbhTtwjc1U2sP9D5d9jvWAecUY90PdkdE4dlmhA19urxUB/Z9y1wH 27u5uIdSkipaAFdJHZxrIebbiGXc83DZY+hDhpGgpy/6ezHJHFTtM7Osscr+K5bukgJs LE7YJ3sTXuGCSF0CwMUjP2B1ddVXdV3S4HYWc= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=x5RA/ni3mum2uFetKHC1tlGd0Wd8RhHPK1h+Fy4Ckl8=; b=XPr4PKZI+ETUJ/9wGknH9PjLOa1BlyPS5lqnT+lmsf1OXPPB1tSM5cfBV7czyT6th+ 2LImOBgwJj+JdYJahaNID/xd3FHKEgD/6PhLhOZIBfikZ2KeyNPbyBQK5a7MRWXoWQ9J F/Nd9iJM5NsS11r4jBoxtB6y5nzv51loZlQqHG3PFJi0SB7FSDtNMh1Wwb7GwonQ6Ae/ z2pKhYW7KuPTJ48zNA/hiBkmZA79/2NvqbSv5U0fSe9RZmWZCdMPOnAh+a1V2dyLfnHN Bjdz/4uBbAH8cyLfi/VEmR1AJsIEta+v5og4vZQPeaPUVGVby0yzFhPtTRG8ySpR8Qpz nY7g== X-Gm-Message-State: APf1xPAU4qWCIoUzQ7zw9A0Zax7KVlEUkcF15EbcNKV39T4sExx451ty OQu9I+AQPzyR5dwr2aIN3QFJ7w== X-Received: by 10.223.145.226 with SMTP id 89mr3713453wri.262.1519458874891; Fri, 23 Feb 2018 23:54:34 -0800 (PST) Received: from [192.168.0.101] (146-241-57-68.dyn.eolo.it. [146.241.57.68]) by smtp.gmail.com with ESMTPSA id r6sm5088956wrg.63.2018.02.23.23.54.32 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 23 Feb 2018 23:54:33 -0800 (PST) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 11.2 \(3445.5.20\)) Subject: Re: [PATCH BUGFIX V3] block, bfq: add requeue-request hook From: Paolo Valente In-Reply-To: <20180223161751.GA17466@ming.t460p> Date: Sat, 24 Feb 2018 08:54:31 +0100 Cc: Jens Axboe , linux-block , Linux Kernel Mailing List , Ulf Hansson , Mark Brown , Linus Walleij , 'Paolo Valente' via bfq-iosched , Oleksandr Natalenko , Alban Browaeys , Ivan Kozik , SERENA ZIVIANI <169364@studenti.unimore.it>, =?utf-8?Q?Holger_Hoffst=C3=A4tte?= , efault@gmx.de, Serena Ziviani Content-Transfer-Encoding: quoted-printable Message-Id: References: <20180207211920.6343-1-paolo.valente@linaro.org> <20180223150705.GB30079@ming.t460p> <8296094C-C3DB-434F-A83D-64CBDE927D09@linaro.org> <20180223161751.GA17466@ming.t460p> To: Ming Lei X-Mailer: Apple Mail (2.3445.5.20) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > Il giorno 23 feb 2018, alle ore 17:17, Ming Lei = ha scritto: >=20 > Hi Paolo, >=20 > On Fri, Feb 23, 2018 at 04:41:36PM +0100, Paolo Valente wrote: >>=20 >>=20 >>> Il giorno 23 feb 2018, alle ore 16:07, Ming Lei = ha scritto: >>>=20 >>> Hi Paolo, >>>=20 >>> On Wed, Feb 07, 2018 at 10:19:20PM +0100, Paolo Valente wrote: >>>> Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq = via >>>> RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a = device >>>> be re-inserted into the active I/O scheduler for that device. As a >>>=20 >>> No, this behaviour isn't related with commit a6a252e64914, and >>> it has been there since blk_mq_requeue_request() is introduced. >>>=20 >>=20 >> Hi Ming, >> actually, we wrote the above statement after simply following the = call >> chain that led to the failure. In this respect, the change in commit >> a6a252e64914: >>=20 >> static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx, >> + bool has_sched, >> struct request *rq) >> { >> - if (rq->tag =3D=3D -1) { >> + /* dispatch flush rq directly */ >> + if (rq->rq_flags & RQF_FLUSH_SEQ) { >> + spin_lock(&hctx->lock); >> + list_add(&rq->queuelist, &hctx->dispatch); >> + spin_unlock(&hctx->lock); >> + return true; >> + } >> + >> + if (has_sched) { >> rq->rq_flags |=3D RQF_SORTED; >> - return false; >> + WARN_ON(rq->tag !=3D -1); >> } >>=20 >> - /* >> - * If we already have a real request tag, send directly to >> - * the dispatch list. >> - */ >> - spin_lock(&hctx->lock); >> - list_add(&rq->queuelist, &hctx->dispatch); >> - spin_unlock(&hctx->lock); >> - return true; >> + return false; >> } >>=20 >> makes blk_mq_sched_bypass_insert return false for all non-flush >> requests. =46rom that, the anomaly described in our commit follows, = for >> bfq any stateful scheduler that waits for the completion of requests >> that passed through it. I'm elaborating again a little bit on this = in >> my replies to your next points below. >=20 > Before a6a252e64914, follows blk_mq_sched_bypass_insert() >=20 > if (rq->tag =3D=3D -1) { > rq->rq_flags |=3D RQF_SORTED; > return false; > } >=20 > /* > * If we already have a real request tag, send directly to > * the dispatch list. > */ > spin_lock(&hctx->lock); > list_add(&rq->queuelist, &hctx->dispatch); > spin_unlock(&hctx->lock); > return true; >=20 > This function still returns false for all non-flush request, so = nothing > changes wrt. this kind of handling. >=20 Yep Ming. I don't have the expertise to tell you why, but the failure in the USB case was caused by an rq that is not flush, but for which rq->tag !=3D -1. So, the previous version of blk_mq_sched_bypass_insert returned true, and there was not failure, while after commit a6a252e64914 the function returns true and the failure occurs if bfq does not exploit the requeue hook. You have actually shown it yourself, several months ago, through the simple code instrumentation you made and used to show that bfq was stuck. And I guess it can still be reproduced very easily, unless something else has changed in blk-mq. BTW, if you can shed a light on this fact, that would be a great occasion to learn for me. >>=20 >> I don't mean that this change is an error, it simply sends a stateful >> scheduler in an inconsistent state, unless the scheduler properly >> handles the requeue that precedes the re-insertion into the >> scheduler. >>=20 >> If this clarifies the situation, but there is still some misleading >> statement in the commit, just let me better understand, and I'll be >> glad to rectify it, if possible somehow. >>=20 >>=20 >>> And you can see blk_mq_requeue_request() is called by lots of = drivers, >>> especially it is often used in error handler, see SCSI's example. >>>=20 >>>> consequence, I/O schedulers may get the same request inserted = again, >>>> even several times, without a finish_request invoked on that = request >>>> before each re-insertion. >>>>=20 >>=20 >> ... >>=20 >>>> @@ -5426,7 +5482,8 @@ static struct elevator_type iosched_bfq_mq =3D = { >>>> .ops.mq =3D { >>>> .limit_depth =3D bfq_limit_depth, >>>> .prepare_request =3D bfq_prepare_request, >>>> - .finish_request =3D bfq_finish_request, >>>> + .requeue_request =3D bfq_finish_requeue_request, >>>> + .finish_request =3D bfq_finish_requeue_request, >>>> .exit_icq =3D bfq_exit_icq, >>>> .insert_requests =3D bfq_insert_requests, >>>> .dispatch_request =3D bfq_dispatch_request, >>>=20 >>> This way may not be correct since blk_mq_sched_requeue_request() can = be >>> called for one request which won't enter io scheduler. >>>=20 >>=20 >> Exactly, there are two cases: requeues that lead to subsequent >> re-insertions, and requeues that don't. The function >> bfq_finish_requeue_request handles both, and both need to be handled, >> to inform bfq that it has not to wait for the completion of rq any >> longer. >>=20 >> One special case is when bfq_finish_requeue_request gets invoked even >> if rq has nothing to do with any scheduler. In that case, >> bfq_finish_requeue_request exists immediately. >>=20 >>=20 >>> __blk_mq_requeue_request() is called for two cases: >>>=20 >>> - one is that the requeued request is added to hctx->dispatch, = such >>> as blk_mq_dispatch_rq_list() >>=20 >> yes >>=20 >>> - another case is that the request is requeued to io scheduler, = such as >>> blk_mq_requeue_request(). >>>=20 >>=20 >> yes >>=20 >>> For the 1st case, blk_mq_sched_requeue_request() shouldn't be called >>> since it is nothing to do with scheduler, >>=20 >> No, if that rq has been inserted and then extracted from the = scheduler >> through a dispatch_request, then it has. The scheduler is stateful, >> and keeps state for rq, because it must do so, until a completion or = a >> requeue arrive. In particular, bfq may decide that no other >=20 > This 'requeue' to hctx->dispatch is invisible to io scheduler, and = from > io scheduler view, seems no difference between STS_OK and = STS_RESOURCE, > since this request will be submitted to lld automatically by blk-mq > core. >=20 > Also in legacy path, no such notification to io scheduler too. ok, still, for reasons that at this point I don't know, in the last 9-10 years it has never been reported a failure caused by a request that has been first dispatched by bfq and then re-inserted into bfq without being completed. This new event is the problem. >> bfq_queues must be served before rq has been completed, because this >> is necessary to preserve its target service guarantees. If bfq is = not >> informed, either through its completion or its requeue hook, then it >> will wait forever. >=20 > The .finish_request will be called after this request is completed for = this > case, either after this io is completed by device, or timeout. >=20 > Or .requeue_request will be called if the driver need to requeue it to > io scheduler via blk_mq_requeue_request() explicitly. >=20 > So io scheduler will be made sure to get notified. >=20 So, you mean that the scheduler will get notified in a way or the other, if it has a requeue hook defined, right? Or, simply, the request will be eventually completed, thus unblocking the scheduler? >>=20 >>> seems we only need to do that >>> for 2nd case. >>>=20 >>=20 >>> So looks we need the following patch: >>>=20 >>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>> index 23de7fd8099a..a216f3c3c3ce 100644 >>> --- a/block/blk-mq.c >>> +++ b/block/blk-mq.c >>> @@ -712,7 +714,6 @@ static void __blk_mq_requeue_request(struct = request *rq) >>>=20 >>> trace_block_rq_requeue(q, rq); >>> wbt_requeue(q->rq_wb, &rq->issue_stat); >>> - blk_mq_sched_requeue_request(rq); >>>=20 >>=20 >> By doing so, if rq has 'passed' through bfq, then you block again bfq >> forever. >=20 > Could you explain it a bit why bfq is blocked by this way? It's a theoretical problem (unfortunately not just a contingency you can work around). If you want to guarantee that a process doing sync I/O, and with a higher weight than other processes, gets a share of the total throughput proportional to its weight, then you have to idle the device during the service slot of that process. Otherwise, the pending requests of the other processes will simply slip through *every time* the high-weight process has no pending request because it is blocked by one of its sync request. The final result will be loss of control on the percentage of requests of the high-weight process that get dispatched w.r.t. to the total number of requests dispatched. So, failure to provide the process with its due share of the throughput. I'm working on solutions to compensate for this nasty service constraint, and its effects on throughput, but this is another story. >=20 >>=20 >> Of course, I may be wrong, as I don't have a complete mental model of >> all what blk-mq does around bfq. But I thin that you can quickly >> check whether a hang occurs again if you add this change. In >> particular, it should happen in the USB former failure case. >=20 > USB/BFQ runs well on my parted test with this change, and this test = can > trigger the IO hang issue easily on kyber or without your patch of = 'block, > bfq: add requeue-request hook'. >=20 Great! According to the facts I reported at the beginning of this email, I guess your patch works because it avoids the nasty case for bfq, i.e., a request re-inserted into bfq without being completed first, right? And then, about the requeues without re-insertions, everything works because the latter type of requests eventually get a completion. Is it correct? Sorry for being pedantic, but because of our lack of expertise on these requeueing mechanism, we worked really a lot on this commit, and it would be frustrating if we bumped again into troubles, after removing it. Apart from that, if this commit is useless, it is perfectly fine for me that it is reverted and replaced by a better solution. In the end, less code in bfq means less room for bugs ;) Thanks, Paolo > Thanks, > Ming