Received: by 10.223.176.46 with SMTP id f43csp434849wra; Thu, 18 Jan 2018 20:04:56 -0800 (PST) X-Google-Smtp-Source: ACJfBouhDPMV5cas8yhkt/k828KN3GBOJxthIEsjZ+fIDU4U5E5sH2KVHTBgDpUQAvEhIPW2VTYC X-Received: by 10.99.128.66 with SMTP id j63mr29265218pgd.254.1516334696049; Thu, 18 Jan 2018 20:04:56 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1516334696; cv=none; d=google.com; s=arc-20160816; b=QDU+PryAyEtURM0e6tHiwf4IIUiSLdEaTrjgT/1+wfwymZ2WzAwnooijP3Hz9LhfmZ Lrr4YsuDzQADlyGcnmdSPW44uhPQHM/qo0S0oYqjGsFfGijKo/+0JD2GFIFiyk8A8w8H VU1DjCSIkILEEibbnEXTKkkNrTAkaPz/yaAcdNQrIg/0EDwedw2BziJEp6Buw9lXrl3+ 3leI57txOem5B/T//4H5klU5ALq5ThKbtraseWm5guHh6nS2admx+fuYGKKfuwrZ3GYp p8YlywlLofBneT8TDvVZEYGoxK3gHWJOApnOsufQE801trj0rTdiAabOs/z4StflzPhl UYBg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature :arc-authentication-results; bh=dXdSfGOcfPHVQV8BUzByMXg/ly062hPT/5SoENx07Ik=; b=kfaoIxm+l412iUKcp+hl28gBaFqz/tdhcRvUqEnfBcqluZI7RIKbMRuiXOhDay/10p KGl9Zzupc769qxOCpt0JTkkgDg7SX7IprGS8EYmzzPzHlYH92vfmypISyA6lc+Ky5nB4 xELHTfMPDI7qG6B3C2CN1fNLzxNbBmhMc+a3HIoIxfJ5ZrlHZisGsV8ILgwNIXLyZLLM secB2xbxBNX91cADX7Nm4s0/2tnF4of56GZvPBBJli2xsgslzgGSf/HUUUXZHu64q5sH LrjTs9HAv/Uwemsia5t8Wcr8VR1pKww1SBsC0aM8ETj/H6eY5ognEix8fgHmaUYr5NhO b/9A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel-dk.20150623.gappssmtp.com header.s=20150623 header.b=T+NAkk8i; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id n10si5692718pgp.8.2018.01.18.20.04.41; Thu, 18 Jan 2018 20:04:56 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel-dk.20150623.gappssmtp.com header.s=20150623 header.b=T+NAkk8i; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754869AbeASEDC (ORCPT + 99 others); Thu, 18 Jan 2018 23:03:02 -0500 Received: from mail-pf0-f172.google.com ([209.85.192.172]:37744 "EHLO mail-pf0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753332AbeASECy (ORCPT ); Thu, 18 Jan 2018 23:02:54 -0500 Received: by mail-pf0-f172.google.com with SMTP id p1so440742pfh.4 for ; Thu, 18 Jan 2018 20:02:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=dXdSfGOcfPHVQV8BUzByMXg/ly062hPT/5SoENx07Ik=; b=T+NAkk8iFFglU/ZUS2nUxSFCshF5tFYSfh7xPGV6tomcJ+kmgPe4HD7sZvYGb/JlmA QN7jgJppGfpUFDoKB8wC/6usZL9J3N5RnZxeDjXe6Z3kdnuGRyZdNaGq1W09eeabd8bK xRP1BZtlyRmaCkSldsktIeOlGtdhIXTAOoqJtlu0G26CpFDIcoWwsJWCCNGeYqBqrTj3 r/luses46ffmfOBqJQc50etFyLO3ZHmxa5DXYiXLj0I4vhzzGmPr0khHHGecHSoqiLeE wB1NEk2h27eqZAwCpNOTTzT/uvFeHbRVa44+d8giP8akQtDyxaXJwoHu6VcDepoddE8f OojQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=dXdSfGOcfPHVQV8BUzByMXg/ly062hPT/5SoENx07Ik=; b=mjZj2kM3EQzwziCHKKW+aHVI/Kb/ooFGZ9JRA+f0liQzQhYqQ2O2w5yCbc/YA7uJDI fce2SyRXJFJFJeV1k2dry8wXvRMg5VxuXlgSMUYXklynoRrH543e7jvRpapAcILvs/Bt Fso6Soctq/2FAlTqWelauobbjA+R3O0U7qMcy1Q3eaXqUUhOU88/A0Pt05G8Zdg1NtkX LhxvoaYxS/3Goyc+YAPcV5BOJunP2DXG7wNewQt7dX/eO8cbT+uvEtorkaA/+w8rrhgp KZPcQ9VKsAdKtYrYBAkHAO5G+ShwtPjXH5+dQTQCI0aFUaOMq6GjDqKwRmHZN0E4Wbi9 NDPw== X-Gm-Message-State: AKGB3mKPOwSv6ROGEOZuBx2+ZtZ466/pGG0Q2VVUkTlNssRD6crqHvwJ ZJDSQep0peXcNsNOL+ErU5EgWg== X-Received: by 10.99.179.77 with SMTP id x13mr39453698pgt.217.1516334573587; Thu, 18 Jan 2018 20:02:53 -0800 (PST) Received: from [192.168.1.160] (107.191.0.158.static.utbb.net. [107.191.0.158]) by smtp.gmail.com with ESMTPSA id f188sm15986053pfc.22.2018.01.18.20.02.51 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 18 Jan 2018 20:02:52 -0800 (PST) Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle To: Ming Lei Cc: Bart Van Assche , "snitzer@redhat.com" , "dm-devel@redhat.com" , "hch@infradead.org" , "linux-kernel@vger.kernel.org" , "linux-block@vger.kernel.org" , "osandov@fb.com" References: <20180118024124.8079-1-ming.lei@redhat.com> <20180118170353.GB19734@redhat.com> <1516296056.2676.23.camel@wdc.com> <20180118183039.GA20121@redhat.com> <1516301278.2676.35.camel@wdc.com> <20180119023212.GA25413@ming.t460p> From: Jens Axboe Message-ID: Date: Thu, 18 Jan 2018 21:02:45 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:58.0) Gecko/20100101 Thunderbird/58.0 MIME-Version: 1.0 In-Reply-To: <20180119023212.GA25413@ming.t460p> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 1/18/18 7:32 PM, Ming Lei wrote: > On Thu, Jan 18, 2018 at 01:11:01PM -0700, Jens Axboe wrote: >> On 1/18/18 11:47 AM, Bart Van Assche wrote: >>>> This is all very tiresome. >>> >>> Yes, this is tiresome. It is very annoying to me that others keep >>> introducing so many regressions in such important parts of the kernel. >>> It is also annoying to me that I get blamed if I report a regression >>> instead of seeing that the regression gets fixed. >> >> I agree, it sucks that any change there introduces the regression. I'm >> fine with doing the delay insert again until a new patch is proven to be >> better. > > That way is still buggy as I explained, since rerun queue before adding > request to hctx->dispatch_list isn't correct. Who can make sure the request > is visible when __blk_mq_run_hw_queue() is called? That race basically doesn't exist for a 10ms gap. > Not mention this way will cause performance regression again. How so? It's _exactly_ the same as what you are proposing, except mine will potentially run the queue when it need not do so. But given that these are random 10ms queue kicks because we are screwed, it should not matter. The key point is that it only should be if we have NO better options. If it's a frequently occurring event that we have to return BLK_STS_RESOURCE, then we need to get a way to register an event for when that condition clears. That event will then kick the necessary queue(s). >> From the original topic of this email, we have conditions that can cause >> the driver to not be able to submit an IO. A set of those conditions can >> only happen if IO is in flight, and those cases we have covered just >> fine. Another set can potentially trigger without IO being in flight. >> These are cases where a non-device resource is unavailable at the time >> of submission. This might be iommu running out of space, for instance, >> or it might be a memory allocation of some sort. For these cases, we >> don't get any notification when the shortage clears. All we can do is >> ensure that we restart operations at some point in the future. We're SOL >> at that point, but we have to ensure that we make forward progress. > > Right, it is a generic issue, not DM-specific one, almost all drivers > call kmalloc(GFP_ATOMIC) in IO path. GFP_ATOMIC basically never fails, unless we are out of memory. The exception is higher order allocations. If a driver has a higher order atomic allocation in its IO path, the device driver writer needs to be taken out behind the barn and shot. Simple as that. It will NEVER work well in a production environment. Witness the disaster that so many NIC driver writers have learned. This is NOT the case we care about here. It's resources that are more readily depleted because other devices are using them. If it's a high frequency or generally occurring event, then we simply must have a callback to restart the queue from that. The condition then becomes identical to device private starvation, the only difference being from where we restart the queue. > IMO, there is enough time for figuring out a generic solution before > 4.16 release. I would hope so, but the proposed solutions have not filled me with a lot of confidence in the end result so far. >> That last set of conditions better not be a a common occurence, since >> performance is down the toilet at that point. I don't want to introduce >> hot path code to rectify it. Have the driver return if that happens in a >> way that is DIFFERENT from needing a normal restart. The driver knows if >> this is a resource that will become available when IO completes on this >> device or not. If we get that return, we have a generic run-again delay. > > Now most of times both NVMe and SCSI won't return BLK_STS_RESOURCE, and > it should be DM-only which returns STS_RESOURCE so often. Where does the dm STS_RESOURCE error usually come from - what's exact resource are we running out of? -- Jens Axboe