Received: by 10.223.176.46 with SMTP id f43csp1214563wra; Fri, 19 Jan 2018 08:24:39 -0800 (PST) X-Google-Smtp-Source: AH8x225RLUx2ydZrS5Z3RO8nNrFdA1sv3JQhrM+0HtAQAInySICIO56CphK+weaUHwD+aMzIXvl8 X-Received: by 10.99.178.85 with SMTP id t21mr1356917pgo.296.1516379079229; Fri, 19 Jan 2018 08:24:39 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1516379079; cv=none; d=google.com; s=arc-20160816; b=jBBgQMuGvHtW+0M1tvRW7sv6EWAvfaKAHQHXskbojORnPm1cDj91s0B+qrfcBBVdjb J9xVGb4TyYsa11UMrAKz90ch3MqSYTAfnvpwGXvSj/zQpO5mvOxQZpmuJ0Q9nkDQezsH 0nVq0/iQhiUOIgewO341239xIbDfeE2lRUO2BDw8fjPQyGXvJ0SUL2qeQldvZ8PxxCx2 IVqJD4JLVDNbIAkMdE+U90gl69w7kyk3POJboXZFU1niG0PKNvyHV4EhiIdGcITY0BU+ FqVykNkjneJZCXf6a+WPmCh6fD1hHMDvhEQdEZbNam/6owUulK7ZqD27nuJ0nVPpwuY3 lVrw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature :arc-authentication-results; bh=3+YIx07PtbilvxpcYtKc7Xc88q+qwtQ/0uFHdSBuulA=; b=WUsQWEHBwODDfnddVzeF5EQkyCZuPm71VOuAdJF3qjcjuju9pKiHFUSACrKinrR6jr c1P4ArHXsLWwcgFcyuppbxDSc0hkL6Cyc1Wfe0dV7tGImU0VGT5h7GIuS/B+oTl2T6WU h/ywMKBIP/nO4KlTNOVpbjVU3hOu+bTklhTZGmG502Q+yJTNzD1PaFBlE/rQFhlYenRu OYBaSvrSxHsiism4mDf+77TWo37gQ40e7WTaOSKHV0XwCxS90c3bW6rTiDkbsN4qalOq lcbToECbDJ3fy/fqRvPPc/tgsFg3lwfr+UQccqW9DpUWM+TqyGkGQwLuwBlQAx1mZ+aK N+og== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel-dk.20150623.gappssmtp.com header.s=20150623 header.b=VqaRS+Pg; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 74si9360820pfl.59.2018.01.19.08.24.25; Fri, 19 Jan 2018 08:24:39 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel-dk.20150623.gappssmtp.com header.s=20150623 header.b=VqaRS+Pg; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756137AbeASQXn (ORCPT + 99 others); Fri, 19 Jan 2018 11:23:43 -0500 Received: from mail-it0-f48.google.com ([209.85.214.48]:38927 "EHLO mail-it0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755888AbeASQXi (ORCPT ); Fri, 19 Jan 2018 11:23:38 -0500 Received: by mail-it0-f48.google.com with SMTP id 68so2726442ite.4 for ; Fri, 19 Jan 2018 08:23:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=3+YIx07PtbilvxpcYtKc7Xc88q+qwtQ/0uFHdSBuulA=; b=VqaRS+Pgh3FQdxSto0QLjSy6fmp6bto6/799qoLpZeFUzk5Iy1epoVziAsyF+6flkl WpIatDZK/P4si4wCXCE74lQml0TYTYSzC405JmUjgLnv4f9Coh5OQMHg6GBCbkffaOSP digM1bf3tVu5VpJ44uHJL6CAIbpTTq2klZpvQH9VFtl8A7yhjmXR1LC2QPRpCCoVMseJ 7owNzII/hF4wy7qq0qvXKZDDa7xqb+osT3SYCacBHG9DchCBalOPqXuowps8b0r3NPgf +eNjdXpUycSN7EeHvtEL1JaIl6m9lREiYNEhkpYFr5E1usarTf4dyOutPUcMjaNQdcSA /uYg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=3+YIx07PtbilvxpcYtKc7Xc88q+qwtQ/0uFHdSBuulA=; b=oV/qmY/DdK9lWG/5hxOsixgY0JOoMr47Qloo2ltBOR4dEVauZH6ejcwnNpddUuvHTw 82cNfCy3ynssGkpRtf43F6Vymj5h162c2uR9mfGMmAuGAtZXgi4NsvVMrIEtvaPZEYYd oIe5U6qWjMfkBUMOR3jsHjWz2TF02kWs0J7iQnDpM2o9Gpk2PIKYy7TjsmmaQ1BUVw7W zGvDcjYaxupW1PBpfnC7S2owr0XByFysxeRZErHbmZoh0Nt2zXzzPSdecmh1nPOVE7iO fowzVlNhUXckB7zjXqJT+I9ylUGLtLGO6kQviyVKNYt2AiOAfN+qz2tcJmHsSC3sR1bb MKVw== X-Gm-Message-State: AKwxytdFPMPSeqICEmjGyCZQmE3CT+T31DCds9QXsjwMBap8xYFVS3oh 3IWapfYXf5L+L2BsqSKnr6JVFg== X-Received: by 10.36.160.5 with SMTP id o5mr30844234ite.79.1516379017962; Fri, 19 Jan 2018 08:23:37 -0800 (PST) Received: from [192.168.1.160] ([216.160.245.98]) by smtp.gmail.com with ESMTPSA id y35sm972705ita.20.2018.01.19.08.23.36 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 19 Jan 2018 08:23:36 -0800 (PST) Subject: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle To: Mike Snitzer Cc: Ming Lei , Bart Van Assche , "dm-devel@redhat.com" , "hch@infradead.org" , "linux-kernel@vger.kernel.org" , "linux-block@vger.kernel.org" , "osandov@fb.com" References: <1516296056.2676.23.camel@wdc.com> <20180118183039.GA20121@redhat.com> <1516301278.2676.35.camel@wdc.com> <20180119023212.GA25413@ming.t460p> <20180119072623.GB25369@ming.t460p> <047f68ec-f51b-190f-2f89-f413325c2540@kernel.dk> <20180119154047.GB14827@ming.t460p> <540e1239-c415-766b-d4ff-bb0b7f3517a7@kernel.dk> <20180119161336.GA22600@redhat.com> From: Jens Axboe Message-ID: <65a40164-275c-824b-64fc-b9e4cf781861@kernel.dk> Date: Fri, 19 Jan 2018 09:23:35 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:58.0) Gecko/20100101 Thunderbird/58.0 MIME-Version: 1.0 In-Reply-To: <20180119161336.GA22600@redhat.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 1/19/18 9:13 AM, Mike Snitzer wrote: > On Fri, Jan 19 2018 at 10:48am -0500, > Jens Axboe wrote: > >> On 1/19/18 8:40 AM, Ming Lei wrote: >>>>>> Where does the dm STS_RESOURCE error usually come from - what's exact >>>>>> resource are we running out of? >>>>> >>>>> It is from blk_get_request(underlying queue), see >>>>> multipath_clone_and_map(). >>>> >>>> That's what I thought. So for a low queue depth underlying queue, it's >>>> quite possible that this situation can happen. Two potential solutions >>>> I see: >>>> >>>> 1) As described earlier in this thread, having a mechanism for being >>>> notified when the scarce resource becomes available. It would not >>>> be hard to tap into the existing sbitmap wait queue for that. >>>> >>>> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource >>>> allocation. I haven't read the dm code to know if this is a >>>> possibility or not. > > Right, #2 is _not_ the way forward. Historically request-based DM used > its own mempool for requests, this was to be able to have some measure > of control and resiliency in the face of low memory conditions that > might be affecting the broader system. > > Then Christoph switched over to adding per-request-data; which ushered > in the use of blk_get_request using ATOMIC allocations. I like the > result of that line of development. But taking the next step of setting > BLK_MQ_F_BLOCKING is highly unfortunate (especially in that this > dm-mpath.c code is common to old .request_fn and blk-mq, at least the > call to blk_get_request is). Ultimately dm-mpath like to avoid blocking > for a request because for this dm-mpath device we have multiple queues > to allocate from if need be (provided we have an active-active storage > network topology). If you can go to multiple devices, obviously it should not block on a single device. That's only true for the case where you can only go to one device, blocking at that point would probably be fine. Or if all your paths are busy, then blocking would also be OK. But it's a much larger change, and would entail changing more than just the actual call to blk_get_request(). >> A simple test case would be to have a null_blk device with a queue depth >> of one, and dm on top of that. Start a fio job that runs two jobs: one >> that does IO to the underlying device, and one that does IO to the dm >> device. If the job on the dm device runs substantially slower than the >> one to the underlying device, then the problem isn't really fixed. > > Not sure DM will allow the underlying device to be opened (due to > master/slave ownership that is part of loading a DM table)? There are many ways it could be setup - just partition the underlying device then, and have one partition be part of the dm setup and the other used directly. >> That said, I'm fine with ensuring that we make forward progress always >> first, and then we can come up with a proper solution to the issue. The >> forward progress guarantee will be needed for the more rare failure >> cases, like allocation failures. nvme needs that too, for instance, for >> the discard range struct allocation. > > Yeap, I'd be OK with that too. We'd be better for revisted this and > then have some time to develop the ultimate robust fix (#1, callback > from above). Yeah, we need the quick and dirty sooner, which just brings us back to what we had before, essentially. -- Jens Axboe