Received: by 10.223.176.46 with SMTP id f43csp1203384wra; Fri, 19 Jan 2018 08:15:49 -0800 (PST) X-Google-Smtp-Source: ACJfBosCIfkkUhGmxJP6Y/ONMYWDnM7Xeqny95tcG9F3Ff3lYYQ9GEFqWCzdeFYbG8OFUXDxvNGQ X-Received: by 10.99.125.74 with SMTP id m10mr26917582pgn.354.1516378549489; Fri, 19 Jan 2018 08:15:49 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1516378549; cv=none; d=google.com; s=arc-20160816; b=j0Lq0xU7G8qMrVjyBEhLvUQgAF6gAgtJbUxi3qpt07t9xnoDL2EVzPaX6/WYo/PEZd Jc8tM8wVJgOpRIhwpka7sV3sEs3qyV7Ef4uQBbtDCosfOd9o7zOEja1xZoWX7y6ekUmj F6qFvBhnpJi9g8+jlLspGXHNAH1IwQ7rAZCoowedIjV/RMOYoDjT3u/gT/XWR4jU7BGy ISq0NBTjmOwiL+xNdM0RukviUv8uaUJl/8FFZbaAHN2TAblew37CZJWid0FBHacFgWBb +Wh8+7DmgrkRi73gJkSM+47vkN3a9ZAZ5jpmWYCsC+MB7mYUDzABOTAcy4Dvj5laduzs U0Gw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=NZMg/bychThlOaawGfAtTLok94rjJe7SLwgc0aG0cAY=; b=Z0WMr1PsFdnQSzb3vi2vIXr7gYhUR06tz9UUIhVtM4Hds22S0P8z2K/xdUhDKamToO 9glAxyCJXy2G/ZTheb7tjhjusRi8KCQMMNsohTDBALwcsmSfrsgS/MBh+lNiLebXJgor V88GaSvpnNy82T36FSIQj53zOq02xHKcWAH+120BPHx/42hxIemn6m+2LK1b8qNAFTtZ QeI57IPYz3EHS9K6vIDmfyk/ZwIqWQ7uys7GilQkpx9NB1lnz/xYNxaLXZeyF0i4ONoc DpmxtVFj1JkMVdMTzChQfCAwsyQGUSznrjij1RGOquMYAbR88VfDIUoQC4Q7qj6DjMUt K3xw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 64si6552471pgc.62.2018.01.19.08.15.34; Fri, 19 Jan 2018 08:15:49 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755809AbeASQNw (ORCPT + 99 others); Fri, 19 Jan 2018 11:13:52 -0500 Received: from mx1.redhat.com ([209.132.183.28]:48618 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755524AbeASQNq (ORCPT ); Fri, 19 Jan 2018 11:13:46 -0500 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id A0BEBC0528DC; Fri, 19 Jan 2018 16:13:46 +0000 (UTC) Received: from localhost (unknown [10.18.25.149]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 875D460852; Fri, 19 Jan 2018 16:13:37 +0000 (UTC) Date: Fri, 19 Jan 2018 11:13:36 -0500 From: Mike Snitzer To: Jens Axboe Cc: Ming Lei , Bart Van Assche , "dm-devel@redhat.com" , "hch@infradead.org" , "linux-kernel@vger.kernel.org" , "linux-block@vger.kernel.org" , "osandov@fb.com" Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle Message-ID: <20180119161336.GA22600@redhat.com> References: <1516296056.2676.23.camel@wdc.com> <20180118183039.GA20121@redhat.com> <1516301278.2676.35.camel@wdc.com> <20180119023212.GA25413@ming.t460p> <20180119072623.GB25369@ming.t460p> <047f68ec-f51b-190f-2f89-f413325c2540@kernel.dk> <20180119154047.GB14827@ming.t460p> <540e1239-c415-766b-d4ff-bb0b7f3517a7@kernel.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <540e1239-c415-766b-d4ff-bb0b7f3517a7@kernel.dk> User-Agent: Mutt/1.5.21 (2010-09-15) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.31]); Fri, 19 Jan 2018 16:13:46 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 19 2018 at 10:48am -0500, Jens Axboe wrote: > On 1/19/18 8:40 AM, Ming Lei wrote: > >>>> Where does the dm STS_RESOURCE error usually come from - what's exact > >>>> resource are we running out of? > >>> > >>> It is from blk_get_request(underlying queue), see > >>> multipath_clone_and_map(). > >> > >> That's what I thought. So for a low queue depth underlying queue, it's > >> quite possible that this situation can happen. Two potential solutions > >> I see: > >> > >> 1) As described earlier in this thread, having a mechanism for being > >> notified when the scarce resource becomes available. It would not > >> be hard to tap into the existing sbitmap wait queue for that. > >> > >> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource > >> allocation. I haven't read the dm code to know if this is a > >> possibility or not. Right, #2 is _not_ the way forward. Historically request-based DM used its own mempool for requests, this was to be able to have some measure of control and resiliency in the face of low memory conditions that might be affecting the broader system. Then Christoph switched over to adding per-request-data; which ushered in the use of blk_get_request using ATOMIC allocations. I like the result of that line of development. But taking the next step of setting BLK_MQ_F_BLOCKING is highly unfortunate (especially in that this dm-mpath.c code is common to old .request_fn and blk-mq, at least the call to blk_get_request is). Ultimately dm-mpath like to avoid blocking for a request because for this dm-mpath device we have multiple queues to allocate from if need be (provided we have an active-active storage network topology). > >> I'd probably prefer #1. It's a classic case of trying to get the > >> request, and if it fails, add ourselves to the sbitmap tag wait > >> queue head, retry, and bail if that also fails. Connecting the > >> scarce resource and the consumer is the only way to really fix > >> this, without bogus arbitrary delays. > > > > Right, as I have replied to Bart, using mod_delayed_work_on() with > > returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce > > resource should fix this issue. > > It'll fix the forever stall, but it won't really fix it, as we'll slow > down the dm device by some random amount. Agreed. > A simple test case would be to have a null_blk device with a queue depth > of one, and dm on top of that. Start a fio job that runs two jobs: one > that does IO to the underlying device, and one that does IO to the dm > device. If the job on the dm device runs substantially slower than the > one to the underlying device, then the problem isn't really fixed. Not sure DM will allow the underlying device to be opened (due to master/slave ownership that is part of loading a DM table)? > That said, I'm fine with ensuring that we make forward progress always > first, and then we can come up with a proper solution to the issue. The > forward progress guarantee will be needed for the more rare failure > cases, like allocation failures. nvme needs that too, for instance, for > the discard range struct allocation. Yeap, I'd be OK with that too. We'd be better for revisted this and then have some time to develop the ultimate robust fix (#1, callback from above). Mike