Received: by 2002:a05:6359:c8b:b0:c7:702f:21d4 with SMTP id go11csp307158rwb; Thu, 22 Sep 2022 18:41:32 -0700 (PDT) X-Google-Smtp-Source: AMsMyM4TokdG8m7tzKjuC5Kcrj5aYEHI4cxBY1TkC6HF0QrnSMefxIRxM4Ed4KnJlAtThBNcH9yQ X-Received: by 2002:a17:90a:b305:b0:203:d59:6ff5 with SMTP id d5-20020a17090ab30500b002030d596ff5mr18390302pjr.166.1663897292196; Thu, 22 Sep 2022 18:41:32 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1663897292; cv=none; d=google.com; s=arc-20160816; b=TwTabJ+RdC5S/03ngE0O67+NGKuTdryaw9lEHv0i4PNqSWJNFlFPSZUr0ruGiwxiOG 2jC8LSyWwtOlCazD7hnF0Kxp8iT9ZKcu4Q4A++cjRVI9gHxcejFOzLOfttPwfWZ/RUrP 7kWJH5MB33U07HKdCzNfZZZ9AS6PxbYIpFy0JzPxndWnHNj4DHxrMSRvQAuKDWtXfifY 5w+GaG3Kk90qkdRzpF9DhEnA7TP/k2xf+mpwpdyWEdO+0WZ6cSg3hXa74oMn+Jk8om0t 2a+s24E4HHenR/E8zfVXF+BMAtYdEf+TI35s3nsr3lKI0iSQ/eZD3Pgw1ol0WqyJyr95 y9Ug== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=nhIuO7lMcu1d32U9D+1dSTOAZXeuKD+uY7sfK4AN5AU=; b=pT7KYJEJD6k6ELr8ZjNYMyGsuh1g8/aOdiAfj7bPKazu91qvwmfRNjWHidcE/uaeRu 0nnfFhQ4lACBV+SUKoLrhT2ODW5WxpLTuLsdXLrClc4f5iqIkfkeEb4IGHNs2ye00Qfu RlbfjD2zGXcAwoIdoMn9TZTxOycIrmWUy8LT2Kpp6p0z70dp6mX3bYzLopl7x9GANCFG 1JwH6zy28rejpg7p5TQ8TdlS5U7PY8/OCXmAUqQgWVemHmlINlNb6RkIoV9Dc8OUmTEN ZkNIxF7QB7gb46dZh87MvF0lbMsKsx35qqFBV8iFciokx92RrmKt4n5SZmhp012n5p62 b15A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id em4-20020a17090b014400b0020319d3bb7bsi1117606pjb.18.2022.09.22.18.41.17; Thu, 22 Sep 2022 18:41:32 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230137AbiIWBgl (ORCPT + 99 others); Thu, 22 Sep 2022 21:36:41 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35958 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230020AbiIWBgj (ORCPT ); Thu, 22 Sep 2022 21:36:39 -0400 Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id A68E9E513C; Thu, 22 Sep 2022 18:36:38 -0700 (PDT) Received: from dread.disaster.area (pa49-181-106-210.pa.nsw.optusnet.com.au [49.181.106.210]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id E8C368AA501; Fri, 23 Sep 2022 11:36:35 +1000 (AEST) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1obXcI-00B0jP-E9; Fri, 23 Sep 2022 11:36:34 +1000 Date: Fri, 23 Sep 2022 11:36:34 +1000 From: Dave Chinner To: Dan Williams Cc: Jason Gunthorpe , akpm@linux-foundation.org, Matthew Wilcox , Jan Kara , "Darrick J. Wong" , Christoph Hellwig , John Hubbard , linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev, linux-xfs@vger.kernel.org, linux-mm@kvack.org, linux-ext4@vger.kernel.org Subject: Re: [PATCH v2 10/18] fsdax: Manage pgmap references at entry insertion and deletion Message-ID: <20220923013634.GY3600936@dread.disaster.area> References: <166329936739.2786261.14035402420254589047.stgit@dwillia2-xfh.jf.intel.com> <632b2b4edd803_66d1a2941a@dwillia2-xfh.jf.intel.com.notmuch> <632b8470d34a6_34962946d@dwillia2-xfh.jf.intel.com.notmuch> <632ba8eaa5aea_349629422@dwillia2-xfh.jf.intel.com.notmuch> <632bc5c4363e9_349629486@dwillia2-xfh.jf.intel.com.notmuch> <632cd9a2a023_3496294da@dwillia2-xfh.jf.intel.com.notmuch> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <632cd9a2a023_3496294da@dwillia2-xfh.jf.intel.com.notmuch> X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.4 cv=VuxAv86n c=1 sm=1 tr=0 ts=632d0da5 a=j6JUzzrSC7wlfFge/rmVbg==:117 a=j6JUzzrSC7wlfFge/rmVbg==:17 a=kj9zAlcOel0A:10 a=xOM3xZuef0cA:10 a=7-415B0cAAAA:8 a=HMkrczR5A4c5RoH6gtYA:9 a=CjuIK1q_8ugA:10 a=biEYGPWJfzWAr4FL6Ov7:22 X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H2,SPF_HELO_PASS,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Thu, Sep 22, 2022 at 02:54:42PM -0700, Dan Williams wrote: > Jason Gunthorpe wrote: > > On Wed, Sep 21, 2022 at 07:17:40PM -0700, Dan Williams wrote: > > > Jason Gunthorpe wrote: > > > > On Wed, Sep 21, 2022 at 05:14:34PM -0700, Dan Williams wrote: > > > > > > > > > > Indeed, you could reasonably put such a liveness test at the moment > > > > > > every driver takes a 0 refcount struct page and turns it into a 1 > > > > > > refcount struct page. > > > > > > > > > > I could do it with a flag, but the reason to have pgmap->ref managed at > > > > > the page->_refcount 0 -> 1 and 1 -> 0 transitions is so at the end of > > > > > time memunmap_pages() can look at the one counter rather than scanning > > > > > and rescanning all the pages to see when they go to final idle. > > > > > > > > That makes some sense too, but the logical way to do that is to put some > > > > counter along the page_free() path, and establish a 'make a page not > > > > free' path that does the other side. > > > > > > > > ie it should not be in DAX code, it should be all in common pgmap > > > > code. The pgmap should never be freed while any page->refcount != 0 > > > > and that should be an intrinsic property of pgmap, not relying on > > > > external parties. > > > > > > I just do not know where to put such intrinsics since there is nothing > > > today that requires going through the pgmap object to discover the pfn > > > and 'allocate' the page. > > > > I think that is just a new API that wrappers the set refcount = 1, > > percpu refcount and maybe building appropriate compound pages too. > > > > Eg maybe something like: > > > > struct folio *pgmap_alloc_folios(pgmap, start, length) > > > > And you get back maximally sized allocated folios with refcount = 1 > > that span the requested range. > > > > > In other words make dax_direct_access() the 'allocation' event that pins > > > the pgmap? I might be speaking a foreign language if you're not familiar > > > with the relationship of 'struct dax_device' to 'struct dev_pagemap' > > > instances. This is not the first time I have considered making them one > > > in the same. > > > > I don't know enough about dax, so yes very foreign :) > > > > I'm thinking broadly about how to make pgmap usable to all the other > > drivers in a safe and robust way that makes some kind of logical sense. > > I think the API should be pgmap_folio_get() because, at least for DAX, > the memory is already allocated. The 'allocator' for fsdax is the > filesystem block allocator, and pgmap_folio_get() grants access to a No, the "allocator" for fsdax is the inode iomap interface, not the filesystem block allocator. The filesystem block allocator is only involved in iomapping if we have to allocate a new mapping for a given file offset. A better name for this is "arbiter", not allocator. To get an active mapping of the DAX pages backing a file, we need to ask the inode iomap subsystem to *map a file offset* and it will return kaddr and/or pfns for the backing store the file offset maps to. IOWs, for FSDAX, access to the backing store (i.e. the physical pages) is arbitrated by the *inode*, not the filesystem allocator or the dax device. Hence if a subsystem needs to pin the backing store for some use, it must first ensure that it holds an inode reference (direct or indirect) for that range of the backing store that will spans the life of the pin. When the pin is done, it can tear down the mappings it was using and then the inode reference can be released. This ensures that any racing unlink of the inode will not result in the backing store being freed from under the application that has a pin. It will prevent the inode from being reclaimed and so potentially accessing stale or freed in-memory structures. And it will prevent the filesytem from being unmounted while the application using FSDAX access is still actively using that functionality even if it's already closed all it's fds.... Cheers, Dave. -- Dave Chinner david@fromorbit.com