Received: by 2002:ac0:a581:0:0:0:0:0 with SMTP id m1-v6csp4254199imm; Mon, 25 Jun 2018 12:21:32 -0700 (PDT) X-Google-Smtp-Source: ADUXVKL0nJGzfHxtsqiXc353ucgF6C4jDEnS2eGLuGZfhBO94s4WbE9bFnJtPszYNNui8t8XotBs X-Received: by 2002:a65:5a4f:: with SMTP id z15-v6mr11303824pgs.283.1529954492138; Mon, 25 Jun 2018 12:21:32 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1529954492; cv=none; d=google.com; s=arc-20160816; b=z23WEDKeRqqLa+k+sVxBeGWjzTId4uW0ejNOm+YGIC7MsbCrqEEK/dZgrShk/xOvvp 2IYpzuDkoM8VU4d9jbdaExVv45bqvfBuE9SRcThPTCQYnRjk0H+Mx3CAQOlpJGIhKzfJ eUMXXuS2/LG6ReCND2A4m9MqwbfHLg2kQWPPDbIFym4PeDbtFVm9dYmYUBQeOXTyK1+V Jqxq0fypfXvMtANfxKkb0a27Abx6PbtG/+9iAt5zuABO05BZSpZwFp6QFO1AcyfNY7D3 dQbiR9dAD7JGi27KcOmMpxNHk7ic8PWkOBm/2btH6z8gQmBaGRt1mqcWcZ0BCBjO1GlW TpXw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:mail-followup-to :message-id:subject:cc:to:from:date:arc-authentication-results; bh=3onMJFV27Z29MbB2BzNXweui3rQiycbJF4ckVkNmEt8=; b=LqcB8bLXB9kdr950Fgb3SRUTgSAoXFUyMosmN3XHjkGaogo+kLboRwo7aKf4LGJKpN W5fyHjXz3KEdmHDJ9KmdlLwUr61a5ALLUvn7tOYA6rEb2zNHYav0Gr5zazpkhwmbqsJO PMrnMBcHCowkhH9r9CWD1LhUToX7KPracK+QVuCKHbn5RfV8K10hkOjma2rPy8q8LS5/ fpOSW6sBVvQjk3HgFxNj6aMFxERsdFFC26bJLoRxKDUULYYOS7yQvTfkrNtuS9YSVlgc kDZ4YkJnbf5A2MqK9/KceW7x8GKBjeP5byYyR16H36YrIsNQ50c9V3cZS1sW1+GhIQpA rM2g== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id o11-v6si12280657pgq.506.2018.06.25.12.21.17; Mon, 25 Jun 2018 12:21:32 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934973AbeFYTUj (ORCPT + 99 others); Mon, 25 Jun 2018 15:20:39 -0400 Received: from mga03.intel.com ([134.134.136.65]:22466 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755595AbeFYTUh (ORCPT ); Mon, 25 Jun 2018 15:20:37 -0400 X-Amp-Result: UNSCANNABLE X-Amp-File-Uploaded: False Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by orsmga103.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 25 Jun 2018 12:20:36 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.51,271,1526367600"; d="scan'208";a="240171130" Received: from theros.lm.intel.com (HELO linux.intel.com) ([10.232.112.164]) by fmsmga005.fm.intel.com with ESMTP; 25 Jun 2018 12:20:36 -0700 Date: Mon, 25 Jun 2018 13:20:36 -0600 From: Ross Zwisler To: Mike Snitzer Cc: Ross Zwisler , Toshi Kani , dm-devel@redhat.com, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-nvdimm@lists.01.org, linux-xfs@vger.kernel.org Subject: Re: [PATCH v2 4/7] dm: prevent DAX mounts if not supported Message-ID: <20180625192036.GA11672@linux.intel.com> Mail-Followup-To: Ross Zwisler , Mike Snitzer , Toshi Kani , dm-devel@redhat.com, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-nvdimm@lists.01.org, linux-xfs@vger.kernel.org References: <20180529195106.14268-1-ross.zwisler@linux.intel.com> <20180529195106.14268-5-ross.zwisler@linux.intel.com> <20180601215513.GA18712@redhat.com> <20180604231508.GA10666@linux.intel.com> <20180620151748.GA4847@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180620151748.GA4847@redhat.com> User-Agent: Mutt/1.9.2 (2017-12-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jun 20, 2018 at 11:17:49AM -0400, Mike Snitzer wrote: > On Mon, Jun 04 2018 at 7:15pm -0400, > Ross Zwisler wrote: > > > On Fri, Jun 01, 2018 at 05:55:13PM -0400, Mike Snitzer wrote: > > > On Tue, May 29 2018 at 3:51pm -0400, > > > Ross Zwisler wrote: > > > > > > > Currently the code in dm_dax_direct_access() only checks whether the target > > > > type has a direct_access() operation defined, not whether the underlying > > > > block devices all support DAX. This latter property can be seen by looking > > > > at whether we set the QUEUE_FLAG_DAX request queue flag when creating the > > > > DM device. > > > > > > Wait... I thought DAX support was all or nothing? > > > > Right, it is, and that's what I'm trying to capture. The point of this series > > is to make sure that we don't use DAX thru DM if one of the DM members doesn't > > support DAX. > > > > This is a bit tricky, though, because as you've pointed out there are a lot of > > elements that go into a block device actually supporting DAX. > > > > First, the block device has to have a direct_access() operation defined in its > > struct dax_operations table. This is a static definition in the drivers, > > though, so it's necessary but not sufficient. For example, the PMEM driver > > always defines a direct_access() operation, but depending on the mode of the > > namespace (raw, fsdax or sector) it may or may not support DAX. > > > > The next step is that a driver needs to say that he block queue supports > > QUEUE_FLAG_DAX. This again is necessary but not sufficient. The PMEM driver > > currently sets this for all namespace modes, but I agree that this should be > > restricted to modes that support DAX. Even once we do that, though, for the > > block driver this isn't fully sufficient. We'd really like users to call > > bdev_dax_supported() so it can run some additional tests to make sure that DAX > > will work. > > > > So, the real test that filesystems rely on is bdev_dax_suppported(). > > > > The trick is that with DM we need to verify each block device via > > bdev_dax_supported() just like a filesystem would, and then have some way of > > communicating the result of all those checks to the filesystem which is > > eventually mounted on the DM device. At DAX mount time the filesystem will > > call bdev_dax_supported() on the DM device, but it'll really only check the > > first device. > > > > So, the strategy is to have DM manually check each member device via > > bdev_dax_supported() then if they all pass set QUEUE_FLAG_DAX. This then > > becomes our one source of truth on whether or not a DM device supports DAX. > > When the filesystem mounts with DAX support it'll also run > > bdev_dax_supported(), but if we have QUEUE_FLAG_DAX set on the DM device, we > > know that this check will pass. > > > > > > This is problematic if we have, for example, a dm-linear device made up of > > > > a PMEM namespace in fsdax mode followed by a ramdisk from BRD. > > > > QUEUE_FLAG_DAX won't be set on the dm-linear device's request queue, but > > > > we have a working direct_access() entry point and the first member of the > > > > dm-linear set *does* support DAX. > > > > > > If you don't have a uniformly capable device then it is very dangerous > > > to advertise that the entire device has a certain capability. That > > > completely bit me in the past with discard (because for every IO I > > > wasn't then checking if the destination device supported discards). > > > > > > It is all well and good that you're adding that check here. But what I > > > don't like is how you're saying QUEUE_FLAG_DAX implies direct_access() > > > operation exists.. yet for raw PMEM namespaces we just discussed how > > > that is a lie. > > > > QUEUE_FLAG_DAX does imply that direct_access() exits. However, as discussed > > above for a given bdev we really do need to check bdev_dax_supported(). > > > > > SO this type of change showcases how the QUEUE_FLAG_DAX doesn't _really_ > > > imply direct_access() exists. > > > > > > > This allows the user to create a filesystem on the dm-linear device, and > > > > then mount it with DAX. The filesystem's bdev_dax_supported() test will > > > > pass because it'll operate on the first member of the dm-linear device, > > > > which happens to be a fsdax PMEM namespace. > > > > > > > > All DAX I/O will then fail to that dm-linear device because the lack of > > > > QUEUE_FLAG_DAX prevents fs_dax_get_by_bdev() from working. This means that > > > > the struct dax_device isn't ever set in the filesystem, so > > > > dax_direct_access() will always return -EOPNOTSUPP. > > > > > > Now you've lost me... these past 2 paragraphs. Why can a user mount it > > > is DAX mode? Because bdev_dax_supported() only accesses the first > > > portion (which happens to have DAX capabilities?) > > > > Right. bdev_dax_supported() runs all of its checks, and because they are > > running against the first block device in the dm set, they all pass. But the > > overall DM device does not actually support DAX. > > > > > Isn't this exactly why you should be checking for QUEUE_FLAG_DAX in the > > > caller (bdev_dax_supported)? Why not use bdev_get_queue() and verify > > > QUEUE_FLAG_DAX is set in there? > > > > I'll look into that for the next revision, thanks. > > Have you made any progress on a new revision? > > > > > By failing out of dm_dax_direct_access() if QUEUE_FLAG_DAX isn't set we let > > > > the filesystem know we don't support DAX at mount time. The filesystem > > > > will then silently fall back and remove the dax mount option, causing it to > > > > work properly. > > > > > > This shouldn't be needed. Again, QUEUE_FLAG_DAX wasn't set.. so don't > > > allow code to falsely try operations that should've been gated by the > > > fact it wasn't set. > > > > Right, the goal is to make QUEUE_FLAG_DAX our one source of truth for whether > > DM devices support DAX, and not have it half defined by that and half by the > > DM_TYPE_DAX_BIO_BASED. > > My hope is that you can ignore the DM-internal book-keeping > (DM_TYPE_DAX_BIO_BASED) for now and just focus on fixing the real issue > of needing proper checking (as well as properly _not_ setting > QUEUE_FLAG_DAX in the case of pmem "raw"). > > Please advise, thanks Ross! I'm back working on this, and will send out another revision in the next day or so.