Received: by 2002:a25:6193:0:0:0:0:0 with SMTP id v141csp133700ybb; Tue, 31 Mar 2020 19:05:44 -0700 (PDT) X-Google-Smtp-Source: APiQypL+7B43aaVwXrrZxPX0ie/FAcR9EzfwVNaGjz/jYNb95S69BwmSPPeb+I8/17mmjJvHpSLC X-Received: by 2002:a05:6808:495:: with SMTP id z21mr1338736oid.30.1585706744234; Tue, 31 Mar 2020 19:05:44 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1585706744; cv=none; d=google.com; s=arc-20160816; b=lI7osGzMQlDYbShJchIf5Sw+HUdRR+R+dy44Rmrv3vQhvpadLY6R75BCigs8xRLcT1 yASCpbOAK4+DvbRKh7qW/9g1KmMotp/QmOvnOFeg8GxK+lZAWFAIlg8nrYmVE12XUjVu AHSs6HErZSuPnZDSMhflxtgIYSB03xmEHO4KGwGkkKEumE9jEkJj+XIzZYo7dpt/1shV 5FDuqQe9pgP53ttSf4D4FgHdcYItp4FLj/iCPMUCwgSw9fWCm4kpFkClb7BVHaWjlvD8 K2zqBw7x/diZ/4Z0K657QnTG0t9B9DI+mAcgMSTaRXv50VXAG3YwJSqrr3S0iwPBSk98 PWtQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=ufaOCEFCA4uXdrXl9mettuNypsTl5ZbRrfDLVi4fUdY=; b=Uiod2bX71GFjLPMBpQOaUGHmKqnI+VuxiJh7Uilft0/WOV6sY6RXgbUs/VfbIRu40T OOYgkTwGNlPhwRm6uhzdYs6DvkmLTf2qthFTrLyz0CvUYHebCbXWgKmT1jnfgDbOssKu bMiAGDR649HxqtASeuN523KF5YLojbHAV5O9xYx40Bba5fhk/dLAcL8PPaTgKWPJ9kSL q2yeAqFBdwIIpR0Q/XH9OyrmcjBsaPJXGXIJ9i3i6tGHIpqQfhF3r8bUVfKo7ed/cdsV jsIdVHhFuVRtTlIhwmipqx27+TOS4uejZbdhcPf/tSPbeaduu182adV+J+hm78UGOaVW 7NsQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=YSXG50L7; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a18si185269otq.286.2020.03.31.19.05.31; Tue, 31 Mar 2020 19:05:44 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=YSXG50L7; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731616AbgDACEj (ORCPT + 99 others); Tue, 31 Mar 2020 22:04:39 -0400 Received: from us-smtp-1.mimecast.com ([205.139.110.61]:54351 "EHLO us-smtp-delivery-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1731523AbgDACEi (ORCPT ); Tue, 31 Mar 2020 22:04:38 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1585706676; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=ufaOCEFCA4uXdrXl9mettuNypsTl5ZbRrfDLVi4fUdY=; b=YSXG50L7UCnaqqIdTNrbp4Oe16XCKyoz2yxHaZWczkL6F13SISds2bUQEtL2PorzCrEEBa Y2ycFMlRvP75bvTJG8hpmmSd0uKGWOttivh1FKjyT3Gbecq+7Iz1uS0C2Wqu6jpYVK0VPF EPd/H7rUF3sULTAWnIhv98S9aoRfSSg= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-488-PRoNgaLfMJKVdHw3z1WhnQ-1; Tue, 31 Mar 2020 22:04:32 -0400 X-MC-Unique: PRoNgaLfMJKVdHw3z1WhnQ-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 2302913F9; Wed, 1 Apr 2020 02:04:31 +0000 (UTC) Received: from ming.t460p (ovpn-8-19.pek2.redhat.com [10.72.8.19]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 4733B5C1C5; Wed, 1 Apr 2020 02:04:22 +0000 (UTC) Date: Wed, 1 Apr 2020 10:04:18 +0800 From: Ming Lei To: Doug Anderson Cc: Jens Axboe , Paolo Valente , "James E.J. Bottomley" , "Martin K. Petersen" , linux-block , Guenter Roeck , linux-scsi@vger.kernel.org, Salman Qazi , LKML Subject: Re: [PATCH 2/2] scsi: core: Fix stall if two threads request budget at the same time Message-ID: <20200401020418.GA16793@ming.t460p> References: <20200330144907.13011-1-dianders@chromium.org> <20200330074856.2.I28278ef8ea27afc0ec7e597752a6d4e58c16176f@changeid> <20200331014109.GA20230@ming.t460p> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.12.1 (2019-06-15) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 31, 2020 at 04:51:00PM -0700, Doug Anderson wrote: > Hi, > > On Tue, Mar 31, 2020 at 11:26 AM Jens Axboe wrote: > > > > On 3/31/20 12:07 PM, Paolo Valente wrote: > > >> Il giorno 31 mar 2020, alle ore 03:41, Ming Lei ha scritto: > > >> > > >> On Mon, Mar 30, 2020 at 07:49:06AM -0700, Douglas Anderson wrote: > > >>> It is possible for two threads to be running > > >>> blk_mq_do_dispatch_sched() at the same time with the same "hctx". > > >>> This is because there can be more than one caller to > > >>> __blk_mq_run_hw_queue() with the same "hctx" and hctx_lock() doesn't > > >>> prevent more than one thread from entering. > > >>> > > >>> If more than one thread is running blk_mq_do_dispatch_sched() at the > > >>> same time with the same "hctx", they may have contention acquiring > > >>> budget. The blk_mq_get_dispatch_budget() can eventually translate > > >>> into scsi_mq_get_budget(). If the device's "queue_depth" is 1 (not > > >>> uncommon) then only one of the two threads will be the one to > > >>> increment "device_busy" to 1 and get the budget. > > >>> > > >>> The losing thread will break out of blk_mq_do_dispatch_sched() and > > >>> will stop dispatching requests. The assumption is that when more > > >>> budget is available later (when existing transactions finish) the > > >>> queue will be kicked again, perhaps in scsi_end_request(). > > >>> > > >>> The winning thread now has budget and can go on to call > > >>> dispatch_request(). If dispatch_request() returns NULL here then we > > >>> have a potential problem. Specifically we'll now call > > >> > > >> I guess this problem should be BFQ specific. Now there is definitely > > >> requests in BFQ queue wrt. this hctx. However, looks this request is > > >> only available from another loser thread, and it won't be retrieved in > > >> the winning thread via e->type->ops.dispatch_request(). > > >> > > >> Just wondering why BFQ is implemented in this way? > > >> > > > > > > BFQ inherited this powerful non-working scheme from CFQ, some age ago. > > > > > > In more detail: if BFQ has at least one non-empty internal queue, then > > > is says of course that there is work to do. But if the currently > > > in-service queue is empty, and is expected to receive new I/O, then > > > BFQ plugs I/O dispatch to enforce service guarantees for the > > > in-service queue, i.e., BFQ responds NULL to a dispatch request. > > > > What BFQ is doing is fine, IFF it always ensures that the queue is run > > at some later time, if it returns "yep I have work" yet returns NULL > > when attempting to retrieve that work. Generally this should happen from > > subsequent IO completion, or whatever else condition will resolve the > > issue that is currently preventing dispatch of that request. Last resort > > would be a timer, but that can happen if you're slicing your scheduling > > somehow. > > I've been poking more at this today trying to understand why the idle > timer that Paolo says is in BFQ isn't doing what it should be doing. > I've been continuing to put most of my stream-of-consciousness at > to avoid so much spamming of this thread. > In the trace I looked at most recently it looks like BFQ does try to > ensure that the queue is run at a later time, but at least in this > trace the later time is not late enough. Specifically the quick > summary of my recent trace: > > 28977309us - PID 2167 got the budget. > 28977518us - BFQ told PID 2167 that there was nothing to dispatch. > 28977702us - BFQ idle timer fires. > 28977725us - We start to try to dispatch as a result of BFQ's idle timer. > 28977732us - Dispatching that was a result of BFQ's idle timer can't get > budget and thus does nothing. Looks the BFQ idle timer may be re-tried given it knows there is work to do. > 28977780us - PID 2167 put the budget and exits since there was nothing > to dispatch. > > This is only one particular trace, but in this case BFQ did attempt to > rerun the queue after it returned NULL, but that ran almost > immediately after it returned NULL and thus ran into the race. :( > > > > > It would be very easy to change bfq_has_work so that it returns false > > > in case the in-service queue is empty, even if there is I/O > > > backlogged. My only concern is: since everything has worked with the > > > current scheme for probably 15 years, are we sure that everything is > > > still ok after we change this scheme? > > > > You're comparing apples to oranges, CFQ never worked within the blk-mq > > scheduling framework. > > > > That said, I don't think such a change is needed. If we currently have a > > hang due to this discrepancy between has_work and gets_work, then it > > sounds like we're not always re-running the queue as we should. From the > > original patch, the budget putting is not something the scheduler is > > involved with. Do we just need to ensure that if we put budget without > > having dispatched a request, we need to kick off dispatching again? > > By this you mean a change like this in blk_mq_do_dispatch_sched()? > > if (!rq) { > blk_mq_put_dispatch_budget(hctx); > + ret = true; > break; > } From Jens's tree, blk_mq_do_dispatch_sched() returns nothing. Which tree are you talking against? Thanks, Ming