Received: by 10.223.176.46 with SMTP id f43csp646521wra; Thu, 18 Jan 2018 23:29:10 -0800 (PST) X-Google-Smtp-Source: ACJfBovdpGztodN9evmtWqzxMWNsO3bR8lvOvIDGRXKvA+YmpFdljrmroUu3g2FS7sHOLPuw7DSA X-Received: by 10.98.232.14 with SMTP id c14mr4915973pfi.215.1516346950269; Thu, 18 Jan 2018 23:29:10 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1516346950; cv=none; d=google.com; s=arc-20160816; b=EbQjcf8maBATICY2+DkFM3qdpIp9s08ZXt2tLf1jBdWOO5sfSiNtirBo4Z29wB3O1n BNFT6SY+ja7nSI+x4KOHar5S0QS9MBgTExsiLkfTHxjXNYnDaOPKHBXGPes7O8torY8Y Uvk8pB+c6LkOsEOOB8SkBvYo2TWklqDoprdrN5RVRkSQjQ35XpZMI8Y6THgSZF0XJBLN NDKiHVxMbQM1TECoycPQrJMkmUlkqNUwu9z1/DcRO8Swveuxa4zDIA+LYtxYs8qdaRfO e1oyiHsvbBNav5y35etU+YjMGvaClSlgpA5eRYKd6NVrohTVMyiBjwgE4CBYIRyb2RRA BtIA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=9+LpkADjv87cBfbsW6Jyrk5ylnwuQDFCH5g4ICJXJeA=; b=d/by5vMkRAMcU2TNUmxVgyBnd5TKB+tfqRvBvz/CTQLZS4Un/UQGemo/4ddDrfvoLH P+5hDYA+2bvUBzoNhBaLGKz86e0brUV+sXRNsUnepLlf4f4GtDlFpvtmc+B/UF+5iOf+ +Lh5/jiWopBtccuiBsOgOGW1eyw/00Poc/LdaEVd84BxYmzKlts+mtiuCIRRa+7FlFgr slvAyZRmy/ildEl2Nf1lcE/IGveF80/MhFB8KThz18Ae/lwDbH56XiaP8dMI175BC3Y4 ETVuJ8iQFYqollzuCG8BrEcC+H6ZJ/VnYQg4RO8lm/o86PsrgjXW2JIy0IHk2C/GHB5/ vk3w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id s21-v6si612847plr.239.2018.01.18.23.28.55; Thu, 18 Jan 2018 23:29:10 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753741AbeASH0v (ORCPT + 99 others); Fri, 19 Jan 2018 02:26:51 -0500 Received: from mx1.redhat.com ([209.132.183.28]:37544 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750923AbeASH0q (ORCPT ); Fri, 19 Jan 2018 02:26:46 -0500 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id DECBB780F9; Fri, 19 Jan 2018 07:26:45 +0000 (UTC) Received: from ming.t460p (ovpn-12-90.pek2.redhat.com [10.72.12.90]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 351D06B433; Fri, 19 Jan 2018 07:26:27 +0000 (UTC) Date: Fri, 19 Jan 2018 15:26:24 +0800 From: Ming Lei To: Jens Axboe Cc: Bart Van Assche , "snitzer@redhat.com" , "dm-devel@redhat.com" , "hch@infradead.org" , "linux-kernel@vger.kernel.org" , "linux-block@vger.kernel.org" , "osandov@fb.com" Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle Message-ID: <20180119072623.GB25369@ming.t460p> References: <20180118024124.8079-1-ming.lei@redhat.com> <20180118170353.GB19734@redhat.com> <1516296056.2676.23.camel@wdc.com> <20180118183039.GA20121@redhat.com> <1516301278.2676.35.camel@wdc.com> <20180119023212.GA25413@ming.t460p> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.1 (2017-09-22) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.27]); Fri, 19 Jan 2018 07:26:46 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jan 18, 2018 at 09:02:45PM -0700, Jens Axboe wrote: > On 1/18/18 7:32 PM, Ming Lei wrote: > > On Thu, Jan 18, 2018 at 01:11:01PM -0700, Jens Axboe wrote: > >> On 1/18/18 11:47 AM, Bart Van Assche wrote: > >>>> This is all very tiresome. > >>> > >>> Yes, this is tiresome. It is very annoying to me that others keep > >>> introducing so many regressions in such important parts of the kernel. > >>> It is also annoying to me that I get blamed if I report a regression > >>> instead of seeing that the regression gets fixed. > >> > >> I agree, it sucks that any change there introduces the regression. I'm > >> fine with doing the delay insert again until a new patch is proven to be > >> better. > > > > That way is still buggy as I explained, since rerun queue before adding > > request to hctx->dispatch_list isn't correct. Who can make sure the request > > is visible when __blk_mq_run_hw_queue() is called? > > That race basically doesn't exist for a 10ms gap. > > > Not mention this way will cause performance regression again. > > How so? It's _exactly_ the same as what you are proposing, except mine > will potentially run the queue when it need not do so. But given that > these are random 10ms queue kicks because we are screwed, it should not > matter. The key point is that it only should be if we have NO better > options. If it's a frequently occurring event that we have to return > BLK_STS_RESOURCE, then we need to get a way to register an event for > when that condition clears. That event will then kick the necessary > queue(s). Please see queue_delayed_work_on(), hctx->run_work is shared by all scheduling, once blk_mq_delay_run_hw_queue(100ms) returns, no new scheduling can make progress during the 100ms. > > >> From the original topic of this email, we have conditions that can cause > >> the driver to not be able to submit an IO. A set of those conditions can > >> only happen if IO is in flight, and those cases we have covered just > >> fine. Another set can potentially trigger without IO being in flight. > >> These are cases where a non-device resource is unavailable at the time > >> of submission. This might be iommu running out of space, for instance, > >> or it might be a memory allocation of some sort. For these cases, we > >> don't get any notification when the shortage clears. All we can do is > >> ensure that we restart operations at some point in the future. We're SOL > >> at that point, but we have to ensure that we make forward progress. > > > > Right, it is a generic issue, not DM-specific one, almost all drivers > > call kmalloc(GFP_ATOMIC) in IO path. > > GFP_ATOMIC basically never fails, unless we are out of memory. The I guess GFP_KERNEL may never fail, but GFP_ATOMIC failure might be possible, and it is mentioned[1] there is such code in mm allocation path, also OOM can happen too. if (some randomly generated condition) && (request is atomic) return NULL; [1] https://lwn.net/Articles/276731/ > exception is higher order allocations. If a driver has a higher order > atomic allocation in its IO path, the device driver writer needs to be > taken out behind the barn and shot. Simple as that. It will NEVER work > well in a production environment. Witness the disaster that so many NIC > driver writers have learned. > > This is NOT the case we care about here. It's resources that are more > readily depleted because other devices are using them. If it's a high > frequency or generally occurring event, then we simply must have a > callback to restart the queue from that. The condition then becomes > identical to device private starvation, the only difference being from > where we restart the queue. > > > IMO, there is enough time for figuring out a generic solution before > > 4.16 release. > > I would hope so, but the proposed solutions have not filled me with > a lot of confidence in the end result so far. > > >> That last set of conditions better not be a a common occurence, since > >> performance is down the toilet at that point. I don't want to introduce > >> hot path code to rectify it. Have the driver return if that happens in a > >> way that is DIFFERENT from needing a normal restart. The driver knows if > >> this is a resource that will become available when IO completes on this > >> device or not. If we get that return, we have a generic run-again delay. > > > > Now most of times both NVMe and SCSI won't return BLK_STS_RESOURCE, and > > it should be DM-only which returns STS_RESOURCE so often. > > Where does the dm STS_RESOURCE error usually come from - what's exact > resource are we running out of? It is from blk_get_request(underlying queue), see multipath_clone_and_map(). Thanks, Ming