2021-03-18 04:13:49

by Dan Williams

[permalink] [raw]
Subject: [PATCH 0/3] mm, pmem: Force unmap pmem on surprise remove

Summary:

A dax_dev can be unbound from its driver at any time. Unbind can not
fail. The driver-core will always trigger ->remove() and the result from
->remove() is ignored. After ->remove() the driver-core proceeds to tear
down context. The filesystem-dax implementation can leave pfns mapped
after ->remove() if it is triggered while the filesystem is mounted.
Security and data-integrity is forfeit if the dax_dev is repurposed for
another security domain (new filesystem or change device modes), or if
the dax_dev is physically replaced. CXL is a hotplug bus that makes
dax_dev physical replace a real world prospect.

All dax_dev pfns must be unmapped at remove. Detect the "remove while
mounted" case and trigger memory_failure() over the entire dax_dev
range.

Details:

The get_user_pages_fast() path expects all synchronization to be handled
by the pattern of checking for pte presence, taking a page reference,
and then validating that the pte was stable over that event. The
gup-fast path for devmap / DAX pages additionally attempts to take/hold
a live reference against the hosting pgmap over the page pin. The
rational for the pgmap reference is to synchronize against a dax-device
unbind / ->remove() event, but that is unnecessary if pte invalidation
is guaranteed in the ->remove() path.

Global dax-device pte invalidation *does* happen when the device is in
raw "device-dax" mode where there is a single shared inode to unmap at
remove, but the filesystem-dax path has a large number of actively
mapped inodes unknown to the driver at ->remove() time. So, that unmap
does not happen today for filesystem-dax. However, as Jason points out,
that unmap / invalidation *needs* to happen not only to cleanup
get_user_pages_fast() semantics, but in a future (see CXL) where dax_dev
->remove() is correlated with actual physical removal / replacement the
implications of allowing a physical pfn to be exchanged without tearing
down old mappings are severe (security and data-integrity).

What is not in this patch set is coordination with the dax_kmem driver
to trigger memory_failure() when the dax_dev is onlined as "System
RAM". The remove_memory() API was built with the assumption that
platform firmware negotiates all removal requests and the OS has a
chance to say "no". This is why dax_kmem today simply leaks
request_region() to burn that physical address space for any other
usage until the next reboot on a manual unbind event if the memory can't
be offlined. However a future to make sure that remove_memory() succeeds
after memory_failure() of the same range seems a better semantic than
permanently burning physical address space.

The topic of remove_memory() failures gets to the question of what
happens to active page references when the inopportune ->remove() event
happens. For transient pins the ->remove() event will wait for for all
pins to be dropped before allowing ->remove() to complete. Since
fileystem-dax forbids longterm pins all those pins are transient.
Device-dax, on the other hand, does allow longterm pins which means that
->remove() will hang unless / until the longterm pin is dropped.
Hopefully an unmap_mapping_range() event is sufficient to get the pin
dropped, but I suspect device-dax might need to trigger memory_failure()
as well to get the longterm pin holder to wake up and get out of the
way (TBD).

Lest we repeat the "longterm-pin-revoke" debate, which highlighted that
RDMA devices do not respond well to having context torn down, keep in
mind that this proposal is to do a best effort recovery of an event that
should not happen (surprise removal) under nominal operation.

---

Dan Williams (3):
mm/memory-failure: Prepare for mass memory_failure()
mm, dax, pmem: Introduce dev_pagemap_failure()
mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path


drivers/dax/super.c | 15 +++++++++++++++
drivers/nvdimm/pmem.c | 10 +++++++++-
drivers/nvdimm/pmem.h | 1 +
include/linux/dax.h | 5 +++++
include/linux/memremap.h | 5 +++++
include/linux/mm.h | 3 +++
mm/gup.c | 38 ++++++++++++++++----------------------
mm/memory-failure.c | 36 +++++++++++++++++++++++-------------
mm/memremap.c | 11 +++++++++++
9 files changed, 88 insertions(+), 36 deletions(-)


2021-03-18 04:14:13

by Dan Williams

[permalink] [raw]
Subject: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()

Jason wondered why the get_user_pages_fast() path takes references on a
@pgmap object. The rationale was to protect against accessing a 'struct
page' that might be in the process of being removed by the driver, but
he rightly points out that should be solved the same way all gup-fast
synchronization is solved which is invalidate the mapping and let the
gup slow path do @pgmap synchronization [1].

To achieve that it means that new user mappings need to stop being
created and all existing user mappings need to be invalidated.

For device-dax this is already the case as kill_dax() prevents future
faults from installing a pte, and the single device-dax inode
address_space can be trivially unmapped.

The situation is different for filesystem-dax where device pages could
be mapped by any number of inode address_space instances. An initial
thought was to treat the device removal event like a drop_pagecache_sb()
event that walks superblocks and unmaps all inodes. However, Dave points
out that it is not just the filesystem user-mappings that need to react
to global DAX page-unmap events, it is also filesystem metadata
(proposed DAX metadata access), and other drivers (upstream
DM-writecache) that need to react to this event [2].

The only kernel facility that is meant to globally broadcast the loss of
a page (via corruption or surprise remove) is memory_failure(). The
downside of memory_failure() is that it is a pfn-at-a-time interface.
However, the events that would trigger the need to call memory_failure()
over a full PMEM device should be rare. Remove should always be
coordinated by the administrator with the filesystem. If someone force
removes a device from underneath a mounted filesystem the driver assumes
they have a good reason, or otherwise get to keep the pieces. Since
->remove() callbacks can not fail the only option is to trigger the mass
memory_failure().

The mechanism to determine whether memory_failure() triggers at
pmem->remove() time is whether the associated dax_device has an elevated
reference at @pgmap ->kill() time.

With this in place the get_user_pages_fast() path can drop its
half-measure synchronization with an @pgmap reference.

Link: http://lore.kernel.org/r/[email protected] [1]
Link: http://lore.kernel.org/r/[email protected] [2]
Reported-by: Jason Gunthorpe <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Shiyang Ruan <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: "Darrick J. Wong" <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/super.c | 15 +++++++++++++++
drivers/nvdimm/pmem.c | 10 +++++++++-
drivers/nvdimm/pmem.h | 1 +
include/linux/dax.h | 5 +++++
include/linux/memremap.h | 5 +++++
include/linux/mm.h | 3 +++
mm/memory-failure.c | 11 +++++++++--
mm/memremap.c | 11 +++++++++++
8 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 5fa6ae9dbc8b..5ebcedf4a68c 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -624,6 +624,21 @@ void put_dax(struct dax_device *dax_dev)
}
EXPORT_SYMBOL_GPL(put_dax);

+bool dax_is_idle(struct dax_device *dax_dev)
+{
+ struct inode *inode;
+
+ if (!dax_dev)
+ return true;
+
+ WARN_ONCE(test_bit(DAXDEV_ALIVE, &dax_dev->flags),
+ "dax idle check on live device.\n");
+
+ inode = &dax_dev->inode;
+ return atomic_read(&inode->i_count) < 2;
+}
+EXPORT_SYMBOL_GPL(dax_is_idle);
+
/**
* dax_get_by_host() - temporary lookup mechanism for filesystem-dax
* @host: alternate name for the device registered by a dax driver
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index b8a85bfb2e95..e8822c9262ee 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -348,15 +348,21 @@ static void pmem_pagemap_kill(struct dev_pagemap *pgmap)
{
struct request_queue *q =
container_of(pgmap->ref, struct request_queue, q_usage_counter);
+ struct pmem_device *pmem = q->queuedata;

blk_freeze_queue_start(q);
+ kill_dax(pmem->dax_dev);
+ if (!dax_is_idle(pmem->dax_dev)) {
+ dev_warn(pmem->dev,
+ "DAX active at remove, trigger mass memory failure\n");
+ dev_pagemap_failure(pgmap);
+ }
}

static void pmem_release_disk(void *__pmem)
{
struct pmem_device *pmem = __pmem;

- kill_dax(pmem->dax_dev);
put_dax(pmem->dax_dev);
del_gendisk(pmem->disk);
put_disk(pmem->disk);
@@ -406,6 +412,7 @@ static int pmem_attach_disk(struct device *dev,
devm_namespace_disable(dev, ndns);

dev_set_drvdata(dev, pmem);
+ pmem->dev = dev;
pmem->phys_addr = res->start;
pmem->size = resource_size(res);
fua = nvdimm_has_flush(nd_region);
@@ -467,6 +474,7 @@ static int pmem_attach_disk(struct device *dev,
blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
if (pmem->pfn_flags & PFN_MAP)
blk_queue_flag_set(QUEUE_FLAG_DAX, q);
+ q->queuedata = pmem;

disk = alloc_disk_node(0, nid);
if (!disk)
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index 59cfe13ea8a8..1222088a569a 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -23,6 +23,7 @@ struct pmem_device {
struct badblocks bb;
struct dax_device *dax_dev;
struct gendisk *disk;
+ struct device *dev;
struct dev_pagemap pgmap;
};

diff --git a/include/linux/dax.h b/include/linux/dax.h
index b52f084aa643..015f1d9a8232 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -46,6 +46,7 @@ struct dax_device *alloc_dax(void *private, const char *host,
const struct dax_operations *ops, unsigned long flags);
void put_dax(struct dax_device *dax_dev);
void kill_dax(struct dax_device *dax_dev);
+bool dax_is_idle(struct dax_device *dax_dev);
void dax_write_cache(struct dax_device *dax_dev, bool wc);
bool dax_write_cache_enabled(struct dax_device *dax_dev);
bool __dax_synchronous(struct dax_device *dax_dev);
@@ -92,6 +93,10 @@ static inline void put_dax(struct dax_device *dax_dev)
static inline void kill_dax(struct dax_device *dax_dev)
{
}
+static inline bool dax_is_idle(struct dax_device *dax_dev)
+{
+ return true;
+}
static inline void dax_write_cache(struct dax_device *dax_dev, bool wc)
{
}
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index f5b464daeeca..d52cdc6c5313 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -137,6 +137,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
void devm_memunmap_pages(struct device *dev, struct dev_pagemap *pgmap);
struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
struct dev_pagemap *pgmap);
+void dev_pagemap_failure(struct dev_pagemap *pgmap);
bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);

unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
@@ -160,6 +161,10 @@ static inline void devm_memunmap_pages(struct device *dev,
{
}

+static inline void dev_pagemap_failure(struct dev_pagemap *pgmap)
+{
+}
+
static inline struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
struct dev_pagemap *pgmap)
{
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 77e64e3eac80..95f79f457bab 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3002,8 +3002,11 @@ enum mf_flags {
MF_ACTION_REQUIRED = 1 << 1,
MF_MUST_KILL = 1 << 2,
MF_SOFT_OFFLINE = 1 << 3,
+ MF_MEM_REMOVE = 1 << 4,
};
extern int memory_failure(unsigned long pfn, int flags);
+extern int memory_failure_dev_pagemap(unsigned long pfn, int flags,
+ struct dev_pagemap *pgmap);
extern void memory_failure_queue(unsigned long pfn, int flags);
extern void memory_failure_queue_kick(int cpu);
extern int unpoison_memory(unsigned long pfn);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 43ba4307c526..8f557beb19ee 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1296,8 +1296,8 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
return res;
}

-static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
- struct dev_pagemap *pgmap)
+int memory_failure_dev_pagemap(unsigned long pfn, int flags,
+ struct dev_pagemap *pgmap)
{
struct page *page = pfn_to_page(pfn);
const bool unmap_success = true;
@@ -1377,6 +1377,13 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
unlock:
dax_unlock_page(page, cookie);
out:
+ /*
+ * In the removal case, given unmap is always successful, and
+ * the driver is responsible for the direct map the recovery is
+ * always successful
+ */
+ if (flags & MF_MEM_REMOVE)
+ rc = 0;
action_result(pfn, MF_MSG_DAX, rc ? MF_FAILED : MF_RECOVERED);
return rc;
}
diff --git a/mm/memremap.c b/mm/memremap.c
index 7aa7d6e80ee5..f34da1e14b52 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -165,6 +165,17 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)
pgmap_array_delete(range);
}

+void dev_pagemap_failure(struct dev_pagemap *pgmap)
+{
+ unsigned long pfn;
+ int i;
+
+ for (i = 0; i < pgmap->nr_range; i++)
+ for_each_device_pfn(pfn, pgmap, i)
+ memory_failure_dev_pagemap(pfn, MF_MEM_REMOVE, pgmap);
+}
+EXPORT_SYMBOL_GPL(dev_pagemap_failure);
+
void memunmap_pages(struct dev_pagemap *pgmap)
{
unsigned long pfn;

2021-03-18 04:17:09

by Dan Williams

[permalink] [raw]
Subject: [PATCH 3/3] mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path

Now that device-dax and filesystem-dax are guaranteed to unmap all user
mappings of devmap / DAX pages before tearing down the 'struct page'
array, get_user_pages_fast() can rely on its traditional synchronization
method "validate_pte(); get_page(); revalidate_pte()" to catch races with
device shutdown. Specifically the unmap guarantee ensures that gup-fast
either succeeds in taking a page reference (lock-less), or it detects a
need to fall back to the slow path where the device presence can be
revalidated with locks held.

Reported-by: Jason Gunthorpe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Shiyang Ruan <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Andrew Morton <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
mm/gup.c | 38 ++++++++++++++++----------------------
1 file changed, 16 insertions(+), 22 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index e40579624f10..dfeb47e4e8d4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1996,9 +1996,8 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
unsigned int flags, struct page **pages, int *nr)
{
- struct dev_pagemap *pgmap = NULL;
- int nr_start = *nr, ret = 0;
pte_t *ptep, *ptem;
+ int ret = 0;

ptem = ptep = pte_offset_map(&pmd, addr);
do {
@@ -2015,16 +2014,10 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
if (!pte_access_permitted(pte, flags & FOLL_WRITE))
goto pte_unmap;

- if (pte_devmap(pte)) {
- if (unlikely(flags & FOLL_LONGTERM))
- goto pte_unmap;
+ if (pte_devmap(pte) && (flags & FOLL_LONGTERM))
+ goto pte_unmap;

- pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
- if (unlikely(!pgmap)) {
- undo_dev_pagemap(nr, nr_start, flags, pages);
- goto pte_unmap;
- }
- } else if (pte_special(pte))
+ if (pte_special(pte))
goto pte_unmap;

VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
@@ -2063,8 +2056,6 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
ret = 1;

pte_unmap:
- if (pgmap)
- put_dev_pagemap(pgmap);
pte_unmap(ptem);
return ret;
}
@@ -2087,21 +2078,26 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
#endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */

#if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
+
static int __gup_device_huge(unsigned long pfn, unsigned long addr,
unsigned long end, unsigned int flags,
struct page **pages, int *nr)
{
int nr_start = *nr;
- struct dev_pagemap *pgmap = NULL;

do {
- struct page *page = pfn_to_page(pfn);
+ struct page *page;
+
+ /*
+ * Typically pfn_to_page() on a devmap pfn is not safe
+ * without holding a live reference on the hosting
+ * pgmap. In the gup-fast path it is safe because any
+ * races will be resolved by either gup-fast taking a
+ * reference or the shutdown path unmapping the pte to
+ * trigger gup-fast to fall back to the slow path.
+ */
+ page = pfn_to_page(pfn);

- pgmap = get_dev_pagemap(pfn, pgmap);
- if (unlikely(!pgmap)) {
- undo_dev_pagemap(nr, nr_start, flags, pages);
- return 0;
- }
SetPageReferenced(page);
pages[*nr] = page;
if (unlikely(!try_grab_page(page, flags))) {
@@ -2112,8 +2108,6 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
pfn++;
} while (addr += PAGE_SIZE, addr != end);

- if (pgmap)
- put_dev_pagemap(pgmap);
return 1;
}


2021-03-18 04:47:05

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()

On Wed, Mar 17, 2021 at 09:08:23PM -0700, Dan Williams wrote:
> Jason wondered why the get_user_pages_fast() path takes references on a
> @pgmap object. The rationale was to protect against accessing a 'struct
> page' that might be in the process of being removed by the driver, but
> he rightly points out that should be solved the same way all gup-fast
> synchronization is solved which is invalidate the mapping and let the
> gup slow path do @pgmap synchronization [1].
>
> To achieve that it means that new user mappings need to stop being
> created and all existing user mappings need to be invalidated.
>
> For device-dax this is already the case as kill_dax() prevents future
> faults from installing a pte, and the single device-dax inode
> address_space can be trivially unmapped.
>
> The situation is different for filesystem-dax where device pages could
> be mapped by any number of inode address_space instances. An initial
> thought was to treat the device removal event like a drop_pagecache_sb()
> event that walks superblocks and unmaps all inodes. However, Dave points
> out that it is not just the filesystem user-mappings that need to react
> to global DAX page-unmap events, it is also filesystem metadata
> (proposed DAX metadata access), and other drivers (upstream
> DM-writecache) that need to react to this event [2].
>
> The only kernel facility that is meant to globally broadcast the loss of
> a page (via corruption or surprise remove) is memory_failure(). The
> downside of memory_failure() is that it is a pfn-at-a-time interface.
> However, the events that would trigger the need to call memory_failure()
> over a full PMEM device should be rare. Remove should always be
> coordinated by the administrator with the filesystem. If someone force
> removes a device from underneath a mounted filesystem the driver assumes
> they have a good reason, or otherwise get to keep the pieces. Since
> ->remove() callbacks can not fail the only option is to trigger the mass
> memory_failure().
>
> The mechanism to determine whether memory_failure() triggers at
> pmem->remove() time is whether the associated dax_device has an elevated
> reference at @pgmap ->kill() time.
>
> With this in place the get_user_pages_fast() path can drop its
> half-measure synchronization with an @pgmap reference.
>
> Link: http://lore.kernel.org/r/[email protected] [1]
> Link: http://lore.kernel.org/r/[email protected] [2]
> Reported-by: Jason Gunthorpe <[email protected]>
> Cc: Dave Chinner <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Shiyang Ruan <[email protected]>
> Cc: Vishal Verma <[email protected]>
> Cc: Dave Jiang <[email protected]>
> Cc: Ira Weiny <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Naoya Horiguchi <[email protected]>
> Cc: "Darrick J. Wong" <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
> ---
> drivers/dax/super.c | 15 +++++++++++++++
> drivers/nvdimm/pmem.c | 10 +++++++++-
> drivers/nvdimm/pmem.h | 1 +
> include/linux/dax.h | 5 +++++
> include/linux/memremap.h | 5 +++++
> include/linux/mm.h | 3 +++
> mm/memory-failure.c | 11 +++++++++--
> mm/memremap.c | 11 +++++++++++
> 8 files changed, 58 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 5fa6ae9dbc8b..5ebcedf4a68c 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -624,6 +624,21 @@ void put_dax(struct dax_device *dax_dev)
> }
> EXPORT_SYMBOL_GPL(put_dax);
>
> +bool dax_is_idle(struct dax_device *dax_dev)
> +{
> + struct inode *inode;
> +
> + if (!dax_dev)
> + return true;
> +
> + WARN_ONCE(test_bit(DAXDEV_ALIVE, &dax_dev->flags),
> + "dax idle check on live device.\n");
> +
> + inode = &dax_dev->inode;
> + return atomic_read(&inode->i_count) < 2;
> +}
> +EXPORT_SYMBOL_GPL(dax_is_idle);
> +
> /**
> * dax_get_by_host() - temporary lookup mechanism for filesystem-dax
> * @host: alternate name for the device registered by a dax driver
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index b8a85bfb2e95..e8822c9262ee 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -348,15 +348,21 @@ static void pmem_pagemap_kill(struct dev_pagemap *pgmap)
> {
> struct request_queue *q =
> container_of(pgmap->ref, struct request_queue, q_usage_counter);
> + struct pmem_device *pmem = q->queuedata;
>
> blk_freeze_queue_start(q);
> + kill_dax(pmem->dax_dev);
> + if (!dax_is_idle(pmem->dax_dev)) {
> + dev_warn(pmem->dev,
> + "DAX active at remove, trigger mass memory failure\n");
> + dev_pagemap_failure(pgmap);
> + }
> }
>
> static void pmem_release_disk(void *__pmem)
> {
> struct pmem_device *pmem = __pmem;
>
> - kill_dax(pmem->dax_dev);
> put_dax(pmem->dax_dev);
> del_gendisk(pmem->disk);
> put_disk(pmem->disk);
> @@ -406,6 +412,7 @@ static int pmem_attach_disk(struct device *dev,
> devm_namespace_disable(dev, ndns);
>
> dev_set_drvdata(dev, pmem);
> + pmem->dev = dev;
> pmem->phys_addr = res->start;
> pmem->size = resource_size(res);
> fua = nvdimm_has_flush(nd_region);
> @@ -467,6 +474,7 @@ static int pmem_attach_disk(struct device *dev,
> blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
> if (pmem->pfn_flags & PFN_MAP)
> blk_queue_flag_set(QUEUE_FLAG_DAX, q);
> + q->queuedata = pmem;
>
> disk = alloc_disk_node(0, nid);
> if (!disk)
> diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
> index 59cfe13ea8a8..1222088a569a 100644
> --- a/drivers/nvdimm/pmem.h
> +++ b/drivers/nvdimm/pmem.h
> @@ -23,6 +23,7 @@ struct pmem_device {
> struct badblocks bb;
> struct dax_device *dax_dev;
> struct gendisk *disk;
> + struct device *dev;
> struct dev_pagemap pgmap;
> };
>
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index b52f084aa643..015f1d9a8232 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -46,6 +46,7 @@ struct dax_device *alloc_dax(void *private, const char *host,
> const struct dax_operations *ops, unsigned long flags);
> void put_dax(struct dax_device *dax_dev);
> void kill_dax(struct dax_device *dax_dev);
> +bool dax_is_idle(struct dax_device *dax_dev);
> void dax_write_cache(struct dax_device *dax_dev, bool wc);
> bool dax_write_cache_enabled(struct dax_device *dax_dev);
> bool __dax_synchronous(struct dax_device *dax_dev);
> @@ -92,6 +93,10 @@ static inline void put_dax(struct dax_device *dax_dev)
> static inline void kill_dax(struct dax_device *dax_dev)
> {
> }
> +static inline bool dax_is_idle(struct dax_device *dax_dev)
> +{
> + return true;
> +}
> static inline void dax_write_cache(struct dax_device *dax_dev, bool wc)
> {
> }
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index f5b464daeeca..d52cdc6c5313 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -137,6 +137,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> void devm_memunmap_pages(struct device *dev, struct dev_pagemap *pgmap);
> struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
> struct dev_pagemap *pgmap);
> +void dev_pagemap_failure(struct dev_pagemap *pgmap);
> bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
>
> unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
> @@ -160,6 +161,10 @@ static inline void devm_memunmap_pages(struct device *dev,
> {
> }
>
> +static inline void dev_pagemap_failure(struct dev_pagemap *pgmap)
> +{
> +}
> +
> static inline struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
> struct dev_pagemap *pgmap)
> {
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 77e64e3eac80..95f79f457bab 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3002,8 +3002,11 @@ enum mf_flags {
> MF_ACTION_REQUIRED = 1 << 1,
> MF_MUST_KILL = 1 << 2,
> MF_SOFT_OFFLINE = 1 << 3,
> + MF_MEM_REMOVE = 1 << 4,
> };
> extern int memory_failure(unsigned long pfn, int flags);
> +extern int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> + struct dev_pagemap *pgmap);
> extern void memory_failure_queue(unsigned long pfn, int flags);
> extern void memory_failure_queue_kick(int cpu);
> extern int unpoison_memory(unsigned long pfn);
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 43ba4307c526..8f557beb19ee 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1296,8 +1296,8 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
> return res;
> }
>
> -static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> - struct dev_pagemap *pgmap)
> +int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> + struct dev_pagemap *pgmap)
> {
> struct page *page = pfn_to_page(pfn);
> const bool unmap_success = true;
> @@ -1377,6 +1377,13 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> unlock:
> dax_unlock_page(page, cookie);
> out:
> + /*
> + * In the removal case, given unmap is always successful, and
> + * the driver is responsible for the direct map the recovery is
> + * always successful
> + */
> + if (flags & MF_MEM_REMOVE)
> + rc = 0;
> action_result(pfn, MF_MSG_DAX, rc ? MF_FAILED : MF_RECOVERED);
> return rc;
> }
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 7aa7d6e80ee5..f34da1e14b52 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -165,6 +165,17 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)
> pgmap_array_delete(range);
> }
>
> +void dev_pagemap_failure(struct dev_pagemap *pgmap)
> +{
> + unsigned long pfn;
> + int i;
> +
> + for (i = 0; i < pgmap->nr_range; i++)
> + for_each_device_pfn(pfn, pgmap, i)
> + memory_failure_dev_pagemap(pfn, MF_MEM_REMOVE, pgmap);

So my 6TB memory chassis falls off the desk and we have to call
memory_failure_dev_pagemap for 1.6 billion PFNs?

Honestly if you're going offline the /entire/ device then just tell us
sb->memory_failure(dev, 0, -1ULL) and we'll just kill everything all at
once. That was where I was trying to push Shiyang's patchset, and I had
nearly succeeded when you NAKd the whole thing.

In the meantime, I estimate that there are ~45 months worth of deferred
XFS patch review that I can make progress on, so that's where I'm going
to focus.

--D

> +}
> +EXPORT_SYMBOL_GPL(dev_pagemap_failure);
> +
> void memunmap_pages(struct dev_pagemap *pgmap)
> {
> unsigned long pfn;
>

2021-03-18 05:02:11

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()

On Wed, Mar 17, 2021 at 09:08:23PM -0700, Dan Williams wrote:
> Jason wondered why the get_user_pages_fast() path takes references on a
> @pgmap object. The rationale was to protect against accessing a 'struct
> page' that might be in the process of being removed by the driver, but
> he rightly points out that should be solved the same way all gup-fast
> synchronization is solved which is invalidate the mapping and let the
> gup slow path do @pgmap synchronization [1].
>
> To achieve that it means that new user mappings need to stop being
> created and all existing user mappings need to be invalidated.
>
> For device-dax this is already the case as kill_dax() prevents future
> faults from installing a pte, and the single device-dax inode
> address_space can be trivially unmapped.
>
> The situation is different for filesystem-dax where device pages could
> be mapped by any number of inode address_space instances. An initial
> thought was to treat the device removal event like a drop_pagecache_sb()
> event that walks superblocks and unmaps all inodes. However, Dave points
> out that it is not just the filesystem user-mappings that need to react
> to global DAX page-unmap events, it is also filesystem metadata
> (proposed DAX metadata access), and other drivers (upstream
> DM-writecache) that need to react to this event [2].
>
> The only kernel facility that is meant to globally broadcast the loss of
> a page (via corruption or surprise remove) is memory_failure(). The
> downside of memory_failure() is that it is a pfn-at-a-time interface.
> However, the events that would trigger the need to call memory_failure()
> over a full PMEM device should be rare.

This is a highly suboptimal design. Filesystems only need a single
callout to trigger a shutdown that unmaps every active mapping in
the filesystem - we do not need a page-by-page error notification
which results in 250 million hwposion callouts per TB of pmem to do
this.

Indeed, the moment we get the first hwpoison from this patch, we'll
map it to the primary XFS superblock and we'd almost certainly
consider losing the storage behind that block to be a shut down
trigger. During the shutdown, the filesystem should unmap all the
active mappings (we already need to add this to shutdown on DAX
regardless of this device remove issue) and so we really don't need
a page-by-page notification of badness.

AFAICT, it's going to take minutes, maybe hours for do the page-by-page
iteration to hwposion every page. It's going to take a few seconds
for the filesystem shutdown to run a device wide invalidation.

SO, yeah, I think this should simply be a single ranged call to the
filesystem like:

->memory_failure(dev, 0, -1ULL)

to tell the filesystem that the entire backing device has gone away,
and leave the filesystem to handle failure entirely at the
filesystem level.

-Dave.
--
Dave Chinner
[email protected]

2021-03-18 10:03:50

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path

On 3/18/21 4:08 AM, Dan Williams wrote:
> Now that device-dax and filesystem-dax are guaranteed to unmap all user
> mappings of devmap / DAX pages before tearing down the 'struct page'
> array, get_user_pages_fast() can rely on its traditional synchronization
> method "validate_pte(); get_page(); revalidate_pte()" to catch races with
> device shutdown. Specifically the unmap guarantee ensures that gup-fast
> either succeeds in taking a page reference (lock-less), or it detects a
> need to fall back to the slow path where the device presence can be
> revalidated with locks held.

[...]

> @@ -2087,21 +2078,26 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
>
> #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> +
> static int __gup_device_huge(unsigned long pfn, unsigned long addr,
> unsigned long end, unsigned int flags,
> struct page **pages, int *nr)
> {
> int nr_start = *nr;
> - struct dev_pagemap *pgmap = NULL;
>
> do {
> - struct page *page = pfn_to_page(pfn);
> + struct page *page;
> +
> + /*
> + * Typically pfn_to_page() on a devmap pfn is not safe
> + * without holding a live reference on the hosting
> + * pgmap. In the gup-fast path it is safe because any
> + * races will be resolved by either gup-fast taking a
> + * reference or the shutdown path unmapping the pte to
> + * trigger gup-fast to fall back to the slow path.
> + */
> + page = pfn_to_page(pfn);
>
> - pgmap = get_dev_pagemap(pfn, pgmap);
> - if (unlikely(!pgmap)) {
> - undo_dev_pagemap(nr, nr_start, flags, pages);
> - return 0;
> - }
> SetPageReferenced(page);
> pages[*nr] = page;
> if (unlikely(!try_grab_page(page, flags))) {

So for allowing FOLL_LONGTERM[0] would it be OK if we used page->pgmap after
try_grab_page() for checking pgmap type to see if we are in a device-dax
longterm pin?

Joao

[0] https://lore.kernel.org/linux-mm/[email protected]/

2021-03-18 17:04:56

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path

On Thu, Mar 18, 2021 at 3:02 AM Joao Martins <[email protected]> wrote:
>
> On 3/18/21 4:08 AM, Dan Williams wrote:
> > Now that device-dax and filesystem-dax are guaranteed to unmap all user
> > mappings of devmap / DAX pages before tearing down the 'struct page'
> > array, get_user_pages_fast() can rely on its traditional synchronization
> > method "validate_pte(); get_page(); revalidate_pte()" to catch races with
> > device shutdown. Specifically the unmap guarantee ensures that gup-fast
> > either succeeds in taking a page reference (lock-less), or it detects a
> > need to fall back to the slow path where the device presence can be
> > revalidated with locks held.
>
> [...]
>
> > @@ -2087,21 +2078,26 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> > #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
> >
> > #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> > +
> > static int __gup_device_huge(unsigned long pfn, unsigned long addr,
> > unsigned long end, unsigned int flags,
> > struct page **pages, int *nr)
> > {
> > int nr_start = *nr;
> > - struct dev_pagemap *pgmap = NULL;
> >
> > do {
> > - struct page *page = pfn_to_page(pfn);
> > + struct page *page;
> > +
> > + /*
> > + * Typically pfn_to_page() on a devmap pfn is not safe
> > + * without holding a live reference on the hosting
> > + * pgmap. In the gup-fast path it is safe because any
> > + * races will be resolved by either gup-fast taking a
> > + * reference or the shutdown path unmapping the pte to
> > + * trigger gup-fast to fall back to the slow path.
> > + */
> > + page = pfn_to_page(pfn);
> >
> > - pgmap = get_dev_pagemap(pfn, pgmap);
> > - if (unlikely(!pgmap)) {
> > - undo_dev_pagemap(nr, nr_start, flags, pages);
> > - return 0;
> > - }
> > SetPageReferenced(page);
> > pages[*nr] = page;
> > if (unlikely(!try_grab_page(page, flags))) {
>
> So for allowing FOLL_LONGTERM[0] would it be OK if we used page->pgmap after
> try_grab_page() for checking pgmap type to see if we are in a device-dax
> longterm pin?
>

Yes. I still need to answer the question of whether mapping
invalidation triggers longterm pin holders to relinquish their hold,
but that's a problem regardless of whether gup-fast is supported or
not.

2021-03-18 19:25:38

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()

On Wed, Mar 17, 2021 at 9:58 PM Dave Chinner <[email protected]> wrote:
>
> On Wed, Mar 17, 2021 at 09:08:23PM -0700, Dan Williams wrote:
> > Jason wondered why the get_user_pages_fast() path takes references on a
> > @pgmap object. The rationale was to protect against accessing a 'struct
> > page' that might be in the process of being removed by the driver, but
> > he rightly points out that should be solved the same way all gup-fast
> > synchronization is solved which is invalidate the mapping and let the
> > gup slow path do @pgmap synchronization [1].
> >
> > To achieve that it means that new user mappings need to stop being
> > created and all existing user mappings need to be invalidated.
> >
> > For device-dax this is already the case as kill_dax() prevents future
> > faults from installing a pte, and the single device-dax inode
> > address_space can be trivially unmapped.
> >
> > The situation is different for filesystem-dax where device pages could
> > be mapped by any number of inode address_space instances. An initial
> > thought was to treat the device removal event like a drop_pagecache_sb()
> > event that walks superblocks and unmaps all inodes. However, Dave points
> > out that it is not just the filesystem user-mappings that need to react
> > to global DAX page-unmap events, it is also filesystem metadata
> > (proposed DAX metadata access), and other drivers (upstream
> > DM-writecache) that need to react to this event [2].
> >
> > The only kernel facility that is meant to globally broadcast the loss of
> > a page (via corruption or surprise remove) is memory_failure(). The
> > downside of memory_failure() is that it is a pfn-at-a-time interface.
> > However, the events that would trigger the need to call memory_failure()
> > over a full PMEM device should be rare.
>
> This is a highly suboptimal design. Filesystems only need a single
> callout to trigger a shutdown that unmaps every active mapping in
> the filesystem - we do not need a page-by-page error notification
> which results in 250 million hwposion callouts per TB of pmem to do
> this.
>
> Indeed, the moment we get the first hwpoison from this patch, we'll
> map it to the primary XFS superblock and we'd almost certainly
> consider losing the storage behind that block to be a shut down
> trigger. During the shutdown, the filesystem should unmap all the
> active mappings (we already need to add this to shutdown on DAX
> regardless of this device remove issue) and so we really don't need
> a page-by-page notification of badness.

XFS doesn't, but what about device-mapper and other agents? Even if
the driver had a callback up the stack memory_failure() still needs to
be able to trigger failures down the stack for CPU consumed poison.

>
> AFAICT, it's going to take minutes, maybe hours for do the page-by-page
> iteration to hwposion every page. It's going to take a few seconds
> for the filesystem shutdown to run a device wide invalidation.
>
> SO, yeah, I think this should simply be a single ranged call to the
> filesystem like:
>
> ->memory_failure(dev, 0, -1ULL)
>
> to tell the filesystem that the entire backing device has gone away,
> and leave the filesystem to handle failure entirely at the
> filesystem level.

So I went with memory_failure() after our discussion of all the other
agents in the system that might care about these pfns going offline
and relying on memory_failure() to route down to each of those. I.e.
the "reuse the drop_pagecache_sb() model" idea was indeed
insufficient. Now I'm trying to reconcile the fact that platform
poison handling will hit memory_failure() first and may not
immediately reach the driver, if ever (see the perennially awkward
firmware-first-mode error handling: ghes_handle_memory_failure()) . So
even if the ->memory_failure(dev...) up call exists there is no
guarantee it can get called for all poison before the memory_failure()
down call happens. Which means regardless of whether
->memory_failure(dev...) exists memory_failure() needs to be able to
do the right thing.

Combine that with the fact that new buses like CXL might be configured
in "poison on decode error" mode which means that a memory_failure()
storm can happen regardless of whether the driver initiates it
programatically.

How about a mechanism to optionally let a filesystem take over memory
failure handling for a range of pfns that the memory_failure() can
consult to fail ranges at a time rather than one by one? So a new
'struct dax_operations' op (void) (*memory_failure_register(struct
dax_device *, void *data). Where any agent that claims a dax_dev can
register to take over memory_failure() handling for any event that
happens in that range. This would be routed through device-mapper like
any other 'struct dax_operations' op. I think that meets your
requirement to get notifications of all the events you want to handle,
but still allows memory_failure() to be the last resort for everything
that has not opted into this error handling.

2021-03-18 19:30:20

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()

On Wed, Mar 17, 2021 at 9:45 PM Darrick J. Wong <[email protected]> wrote:
>
> On Wed, Mar 17, 2021 at 09:08:23PM -0700, Dan Williams wrote:
> > Jason wondered why the get_user_pages_fast() path takes references on a
> > @pgmap object. The rationale was to protect against accessing a 'struct
> > page' that might be in the process of being removed by the driver, but
> > he rightly points out that should be solved the same way all gup-fast
> > synchronization is solved which is invalidate the mapping and let the
> > gup slow path do @pgmap synchronization [1].
> >
> > To achieve that it means that new user mappings need to stop being
> > created and all existing user mappings need to be invalidated.
> >
> > For device-dax this is already the case as kill_dax() prevents future
> > faults from installing a pte, and the single device-dax inode
> > address_space can be trivially unmapped.
> >
> > The situation is different for filesystem-dax where device pages could
> > be mapped by any number of inode address_space instances. An initial
> > thought was to treat the device removal event like a drop_pagecache_sb()
> > event that walks superblocks and unmaps all inodes. However, Dave points
> > out that it is not just the filesystem user-mappings that need to react
> > to global DAX page-unmap events, it is also filesystem metadata
> > (proposed DAX metadata access), and other drivers (upstream
> > DM-writecache) that need to react to this event [2].
> >
> > The only kernel facility that is meant to globally broadcast the loss of
> > a page (via corruption or surprise remove) is memory_failure(). The
> > downside of memory_failure() is that it is a pfn-at-a-time interface.
> > However, the events that would trigger the need to call memory_failure()
> > over a full PMEM device should be rare. Remove should always be
> > coordinated by the administrator with the filesystem. If someone force
> > removes a device from underneath a mounted filesystem the driver assumes
> > they have a good reason, or otherwise get to keep the pieces. Since
> > ->remove() callbacks can not fail the only option is to trigger the mass
> > memory_failure().
> >
> > The mechanism to determine whether memory_failure() triggers at
> > pmem->remove() time is whether the associated dax_device has an elevated
> > reference at @pgmap ->kill() time.
> >
> > With this in place the get_user_pages_fast() path can drop its
> > half-measure synchronization with an @pgmap reference.
> >
> > Link: http://lore.kernel.org/r/[email protected] [1]
> > Link: http://lore.kernel.org/r/[email protected] [2]
> > Reported-by: Jason Gunthorpe <[email protected]>
> > Cc: Dave Chinner <[email protected]>
> > Cc: Christoph Hellwig <[email protected]>
> > Cc: Shiyang Ruan <[email protected]>
> > Cc: Vishal Verma <[email protected]>
> > Cc: Dave Jiang <[email protected]>
> > Cc: Ira Weiny <[email protected]>
> > Cc: Matthew Wilcox <[email protected]>
> > Cc: Jan Kara <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: Naoya Horiguchi <[email protected]>
> > Cc: "Darrick J. Wong" <[email protected]>
> > Signed-off-by: Dan Williams <[email protected]>
> > ---
> > drivers/dax/super.c | 15 +++++++++++++++
> > drivers/nvdimm/pmem.c | 10 +++++++++-
> > drivers/nvdimm/pmem.h | 1 +
> > include/linux/dax.h | 5 +++++
> > include/linux/memremap.h | 5 +++++
> > include/linux/mm.h | 3 +++
> > mm/memory-failure.c | 11 +++++++++--
> > mm/memremap.c | 11 +++++++++++
> > 8 files changed, 58 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index 5fa6ae9dbc8b..5ebcedf4a68c 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -624,6 +624,21 @@ void put_dax(struct dax_device *dax_dev)
> > }
> > EXPORT_SYMBOL_GPL(put_dax);
> >
> > +bool dax_is_idle(struct dax_device *dax_dev)
> > +{
> > + struct inode *inode;
> > +
> > + if (!dax_dev)
> > + return true;
> > +
> > + WARN_ONCE(test_bit(DAXDEV_ALIVE, &dax_dev->flags),
> > + "dax idle check on live device.\n");
> > +
> > + inode = &dax_dev->inode;
> > + return atomic_read(&inode->i_count) < 2;
> > +}
> > +EXPORT_SYMBOL_GPL(dax_is_idle);
> > +
> > /**
> > * dax_get_by_host() - temporary lookup mechanism for filesystem-dax
> > * @host: alternate name for the device registered by a dax driver
> > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > index b8a85bfb2e95..e8822c9262ee 100644
> > --- a/drivers/nvdimm/pmem.c
> > +++ b/drivers/nvdimm/pmem.c
> > @@ -348,15 +348,21 @@ static void pmem_pagemap_kill(struct dev_pagemap *pgmap)
> > {
> > struct request_queue *q =
> > container_of(pgmap->ref, struct request_queue, q_usage_counter);
> > + struct pmem_device *pmem = q->queuedata;
> >
> > blk_freeze_queue_start(q);
> > + kill_dax(pmem->dax_dev);
> > + if (!dax_is_idle(pmem->dax_dev)) {
> > + dev_warn(pmem->dev,
> > + "DAX active at remove, trigger mass memory failure\n");
> > + dev_pagemap_failure(pgmap);
> > + }
> > }
> >
> > static void pmem_release_disk(void *__pmem)
> > {
> > struct pmem_device *pmem = __pmem;
> >
> > - kill_dax(pmem->dax_dev);
> > put_dax(pmem->dax_dev);
> > del_gendisk(pmem->disk);
> > put_disk(pmem->disk);
> > @@ -406,6 +412,7 @@ static int pmem_attach_disk(struct device *dev,
> > devm_namespace_disable(dev, ndns);
> >
> > dev_set_drvdata(dev, pmem);
> > + pmem->dev = dev;
> > pmem->phys_addr = res->start;
> > pmem->size = resource_size(res);
> > fua = nvdimm_has_flush(nd_region);
> > @@ -467,6 +474,7 @@ static int pmem_attach_disk(struct device *dev,
> > blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
> > if (pmem->pfn_flags & PFN_MAP)
> > blk_queue_flag_set(QUEUE_FLAG_DAX, q);
> > + q->queuedata = pmem;
> >
> > disk = alloc_disk_node(0, nid);
> > if (!disk)
> > diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
> > index 59cfe13ea8a8..1222088a569a 100644
> > --- a/drivers/nvdimm/pmem.h
> > +++ b/drivers/nvdimm/pmem.h
> > @@ -23,6 +23,7 @@ struct pmem_device {
> > struct badblocks bb;
> > struct dax_device *dax_dev;
> > struct gendisk *disk;
> > + struct device *dev;
> > struct dev_pagemap pgmap;
> > };
> >
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index b52f084aa643..015f1d9a8232 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -46,6 +46,7 @@ struct dax_device *alloc_dax(void *private, const char *host,
> > const struct dax_operations *ops, unsigned long flags);
> > void put_dax(struct dax_device *dax_dev);
> > void kill_dax(struct dax_device *dax_dev);
> > +bool dax_is_idle(struct dax_device *dax_dev);
> > void dax_write_cache(struct dax_device *dax_dev, bool wc);
> > bool dax_write_cache_enabled(struct dax_device *dax_dev);
> > bool __dax_synchronous(struct dax_device *dax_dev);
> > @@ -92,6 +93,10 @@ static inline void put_dax(struct dax_device *dax_dev)
> > static inline void kill_dax(struct dax_device *dax_dev)
> > {
> > }
> > +static inline bool dax_is_idle(struct dax_device *dax_dev)
> > +{
> > + return true;
> > +}
> > static inline void dax_write_cache(struct dax_device *dax_dev, bool wc)
> > {
> > }
> > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > index f5b464daeeca..d52cdc6c5313 100644
> > --- a/include/linux/memremap.h
> > +++ b/include/linux/memremap.h
> > @@ -137,6 +137,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> > void devm_memunmap_pages(struct device *dev, struct dev_pagemap *pgmap);
> > struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
> > struct dev_pagemap *pgmap);
> > +void dev_pagemap_failure(struct dev_pagemap *pgmap);
> > bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
> >
> > unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
> > @@ -160,6 +161,10 @@ static inline void devm_memunmap_pages(struct device *dev,
> > {
> > }
> >
> > +static inline void dev_pagemap_failure(struct dev_pagemap *pgmap)
> > +{
> > +}
> > +
> > static inline struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
> > struct dev_pagemap *pgmap)
> > {
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 77e64e3eac80..95f79f457bab 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -3002,8 +3002,11 @@ enum mf_flags {
> > MF_ACTION_REQUIRED = 1 << 1,
> > MF_MUST_KILL = 1 << 2,
> > MF_SOFT_OFFLINE = 1 << 3,
> > + MF_MEM_REMOVE = 1 << 4,
> > };
> > extern int memory_failure(unsigned long pfn, int flags);
> > +extern int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> > + struct dev_pagemap *pgmap);
> > extern void memory_failure_queue(unsigned long pfn, int flags);
> > extern void memory_failure_queue_kick(int cpu);
> > extern int unpoison_memory(unsigned long pfn);
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 43ba4307c526..8f557beb19ee 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -1296,8 +1296,8 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
> > return res;
> > }
> >
> > -static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> > - struct dev_pagemap *pgmap)
> > +int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> > + struct dev_pagemap *pgmap)
> > {
> > struct page *page = pfn_to_page(pfn);
> > const bool unmap_success = true;
> > @@ -1377,6 +1377,13 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
> > unlock:
> > dax_unlock_page(page, cookie);
> > out:
> > + /*
> > + * In the removal case, given unmap is always successful, and
> > + * the driver is responsible for the direct map the recovery is
> > + * always successful
> > + */
> > + if (flags & MF_MEM_REMOVE)
> > + rc = 0;
> > action_result(pfn, MF_MSG_DAX, rc ? MF_FAILED : MF_RECOVERED);
> > return rc;
> > }
> > diff --git a/mm/memremap.c b/mm/memremap.c
> > index 7aa7d6e80ee5..f34da1e14b52 100644
> > --- a/mm/memremap.c
> > +++ b/mm/memremap.c
> > @@ -165,6 +165,17 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)
> > pgmap_array_delete(range);
> > }
> >
> > +void dev_pagemap_failure(struct dev_pagemap *pgmap)
> > +{
> > + unsigned long pfn;
> > + int i;
> > +
> > + for (i = 0; i < pgmap->nr_range; i++)
> > + for_each_device_pfn(pfn, pgmap, i)
> > + memory_failure_dev_pagemap(pfn, MF_MEM_REMOVE, pgmap);
>
> So my 6TB memory chassis falls off the desk and we have to call
> memory_failure_dev_pagemap for 1.6 billion PFNs?
>

In some respects it's a "doctor it hurts when I remove my devices
without umounting the filesytem first" type situation.

> Honestly if you're going offline the /entire/ device then just tell us
> sb->memory_failure(dev, 0, -1ULL) and we'll just kill everything all at
> once. That was where I was trying to push Shiyang's patchset, and I had
> nearly succeeded when you NAKd the whole thing.

I think we're closer. As I said to Dave in the other thread what if we
just flipped this around to allow the FS to takeover memory_failure()
rather than hoping that the device-driver can do the right up-calls at
the right time.

>
> In the meantime, I estimate that there are ~45 months worth of deferred
> XFS patch review that I can make progress on, so that's where I'm going
> to focus.

I do feel bad for not engaging sooner on this. Usually there's a
forcing function like Plumbers, or LSF to clear out the backlog.

2021-03-20 01:50:05

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()

On Thu, Mar 18, 2021 at 12:20:35PM -0700, Dan Williams wrote:
> On Wed, Mar 17, 2021 at 9:58 PM Dave Chinner <[email protected]> wrote:
> >
> > On Wed, Mar 17, 2021 at 09:08:23PM -0700, Dan Williams wrote:
> > > Jason wondered why the get_user_pages_fast() path takes references on a
> > > @pgmap object. The rationale was to protect against accessing a 'struct
> > > page' that might be in the process of being removed by the driver, but
> > > he rightly points out that should be solved the same way all gup-fast
> > > synchronization is solved which is invalidate the mapping and let the
> > > gup slow path do @pgmap synchronization [1].
> > >
> > > To achieve that it means that new user mappings need to stop being
> > > created and all existing user mappings need to be invalidated.
> > >
> > > For device-dax this is already the case as kill_dax() prevents future
> > > faults from installing a pte, and the single device-dax inode
> > > address_space can be trivially unmapped.
> > >
> > > The situation is different for filesystem-dax where device pages could
> > > be mapped by any number of inode address_space instances. An initial
> > > thought was to treat the device removal event like a drop_pagecache_sb()
> > > event that walks superblocks and unmaps all inodes. However, Dave points
> > > out that it is not just the filesystem user-mappings that need to react
> > > to global DAX page-unmap events, it is also filesystem metadata
> > > (proposed DAX metadata access), and other drivers (upstream
> > > DM-writecache) that need to react to this event [2].
> > >
> > > The only kernel facility that is meant to globally broadcast the loss of
> > > a page (via corruption or surprise remove) is memory_failure(). The
> > > downside of memory_failure() is that it is a pfn-at-a-time interface.
> > > However, the events that would trigger the need to call memory_failure()
> > > over a full PMEM device should be rare.
> >
> > This is a highly suboptimal design. Filesystems only need a single
> > callout to trigger a shutdown that unmaps every active mapping in
> > the filesystem - we do not need a page-by-page error notification
> > which results in 250 million hwposion callouts per TB of pmem to do
> > this.
> >
> > Indeed, the moment we get the first hwpoison from this patch, we'll
> > map it to the primary XFS superblock and we'd almost certainly
> > consider losing the storage behind that block to be a shut down
> > trigger. During the shutdown, the filesystem should unmap all the
> > active mappings (we already need to add this to shutdown on DAX
> > regardless of this device remove issue) and so we really don't need
> > a page-by-page notification of badness.
>
> XFS doesn't, but what about device-mapper and other agents? Even if
> the driver had a callback up the stack memory_failure() still needs to
> be able to trigger failures down the stack for CPU consumed poison.

If the device is gone, then they don't need page by page
notifucation, either. Tell them the entire device is gone so they
can do what they need (like pass it up to the filesystem as ranges
of badness!).

> > AFAICT, it's going to take minutes, maybe hours for do the page-by-page
> > iteration to hwposion every page. It's going to take a few seconds
> > for the filesystem shutdown to run a device wide invalidation.
> >
> > SO, yeah, I think this should simply be a single ranged call to the
> > filesystem like:
> >
> > ->memory_failure(dev, 0, -1ULL)
> >
> > to tell the filesystem that the entire backing device has gone away,
> > and leave the filesystem to handle failure entirely at the
> > filesystem level.
>
> So I went with memory_failure() after our discussion of all the other
> agents in the system that might care about these pfns going offline
> and relying on memory_failure() to route down to each of those. I.e.
> the "reuse the drop_pagecache_sb() model" idea was indeed
> insufficient.

Using drop_pagecache_sb is insufficient because filesystems have
more than just inode indexed caches that pmem failure may affect.
This is not an argument against a "knock everything down at once"
notification model, just that drop_pagecache_sb() is ...
insufficient to do what we need...

> Now I'm trying to reconcile the fact that platform
> poison handling will hit memory_failure() first and may not
> immediately reach the driver, if ever (see the perennially awkward
> firmware-first-mode error handling: ghes_handle_memory_failure()) . So
> even if the ->memory_failure(dev...) up call exists there is no
> guarantee it can get called for all poison before the memory_failure()
> down call happens. Which means regardless of whether
> ->memory_failure(dev...) exists memory_failure() needs to be able to
> do the right thing.

I don't see how a poor implementation of memory_failure in a driver
or hardware is even remotely relevant to the interface used to
notify the filesystem of a media or device failure. It sounds like
you are trying to use memory_failure() for something it was never
intended to support and that there's a bunch of driver and
infrastructure work needed to make it work how you want it to work.
And even then it may not work the way we want it to work....

> Combine that with the fact that new buses like CXL might be configured
> in "poison on decode error" mode which means that a memory_failure()
> storm can happen regardless of whether the driver initiates it
> programatically.

Straw man argument.

"We can't make this interface a ranged notification because the
hardware might only be able to do page-by-page notification."

You can do page-by-page notification with a range based interface.
We are talking about how to efficiently and reliably inform the
filesystem that a range of a device is no longer accessible and so
it needs to revoke all mappings over that range of it's address
space. That does not need to be a single page at a time interface.

If your hardware is configured to do stupid things, then that is not
the fault of the software interface used to communicate faults up
the stack, nor is it something that the notfication interface should
try to fix or mitigate.....

> How about a mechanism to optionally let a filesystem take over memory
> failure handling for a range of pfns that the memory_failure() can
> consult to fail ranges at a time rather than one by one? So a new
> 'struct dax_operations' op (void) (*memory_failure_register(struct
> dax_device *, void *data). Where any agent that claims a dax_dev can
> register to take over memory_failure() handling for any event that
> happens in that range. This would be routed through device-mapper like
> any other 'struct dax_operations' op. I think that meets your
> requirement to get notifications of all the events you want to handle,
> but still allows memory_failure() to be the last resort for everything
> that has not opted into this error handling.

Which is basically the same as the proposed ->corrupted_range stack,
except it doesn't map the pfns back to LBA addresses the filesystem
needs to make sense of the failure.

fs-dax filesystems have no clue what pfns are, or how to translate
them to LBAs in their block device address space that the map
everything to. The fs-dax infrastructure asks the filesystem for
bdev/sector based mappings, and internally converts them to pfns by
a combination of bdev and daxdev callouts. Hence fs-dax filesystems
never see nor interpret pfns at all. Nor do they have the
capability to convert a PFN to a LBA address. And neither the
underlying block device nor the associated DAX device provide a
method for doing this reverse mapping translation.

So if we have an interface that hands a {daxdev,PFN,len} tuple to
the filesystem, exactly what is the filesystem supposed to do with
it? How do we turn that back into a {bdev,sector,len} tuple so we
can do reverse mapping lookups to find the filesystem objects that
allocated within the notified range?

I'll point out again that these address space translations were
something that the ->corrupted_range callbacks handled directly - no
layer in the stack was handed a range that it didn't know how to map
to it's own internal structures. By the time it got to the
filesystem, it was a {bdev,sector,len} tuple, and the filesystem
could feed that directly to it's reverse mapping lookups....

Maybe I'm missing something magical about ->memory_failure that does
all this translation for us, but I don't see it in this patchset. I
just don't see how this proposed interface is a usable at the
filesystem level as it stands.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2021-03-20 02:41:02

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()

On Fri, Mar 19, 2021 at 6:47 PM Dave Chinner <[email protected]> wrote:
[..]
> > Now I'm trying to reconcile the fact that platform
> > poison handling will hit memory_failure() first and may not
> > immediately reach the driver, if ever (see the perennially awkward
> > firmware-first-mode error handling: ghes_handle_memory_failure()) . So
> > even if the ->memory_failure(dev...) up call exists there is no
> > guarantee it can get called for all poison before the memory_failure()
> > down call happens. Which means regardless of whether
> > ->memory_failure(dev...) exists memory_failure() needs to be able to
> > do the right thing.
>
> I don't see how a poor implementation of memory_failure in a driver
> or hardware is even remotely relevant to the interface used to
> notify the filesystem of a media or device failure. It sounds like
> you are trying to use memory_failure() for something it was never
> intended to support and that there's a bunch of driver and
> infrastructure work needed to make it work how you want it to work.
> And even then it may not work the way we want it to work....
>
> > Combine that with the fact that new buses like CXL might be configured
> > in "poison on decode error" mode which means that a memory_failure()
> > storm can happen regardless of whether the driver initiates it
> > programatically.
>
> Straw man argument.
>
> "We can't make this interface a ranged notification because the
> hardware might only be able to do page-by-page notification."

No, it's "we can't make this interface notify the filesytem that
sectors have failed before the memory_failure() (ranged or not) has
communicated that pfns have failed."

memory_failure() today is the first and sometimes only interface that
learns of pfn failures.

>
> You can do page-by-page notification with a range based interface.
> We are talking about how to efficiently and reliably inform the
> filesystem that a range of a device is no longer accessible and so
> it needs to revoke all mappings over that range of it's address
> space. That does not need to be a single page at a time interface.
>
> If your hardware is configured to do stupid things, then that is not
> the fault of the software interface used to communicate faults up
> the stack, nor is it something that the notfication interface should
> try to fix or mitigate.....
>
> > How about a mechanism to optionally let a filesystem take over memory
> > failure handling for a range of pfns that the memory_failure() can
> > consult to fail ranges at a time rather than one by one? So a new
> > 'struct dax_operations' op (void) (*memory_failure_register(struct
> > dax_device *, void *data). Where any agent that claims a dax_dev can
> > register to take over memory_failure() handling for any event that
> > happens in that range. This would be routed through device-mapper like
> > any other 'struct dax_operations' op. I think that meets your
> > requirement to get notifications of all the events you want to handle,
> > but still allows memory_failure() to be the last resort for everything
> > that has not opted into this error handling.
>
> Which is basically the same as the proposed ->corrupted_range stack,
> except it doesn't map the pfns back to LBA addresses the filesystem
> needs to make sense of the failure.
>
> fs-dax filesystems have no clue what pfns are, or how to translate
> them to LBAs in their block device address space that the map
> everything to. The fs-dax infrastructure asks the filesystem for
> bdev/sector based mappings, and internally converts them to pfns by
> a combination of bdev and daxdev callouts. Hence fs-dax filesystems
> never see nor interpret pfns at all. Nor do they have the
> capability to convert a PFN to a LBA address. And neither the
> underlying block device nor the associated DAX device provide a
> method for doing this reverse mapping translation.

True.

>
> So if we have an interface that hands a {daxdev,PFN,len} tuple to
> the filesystem, exactly what is the filesystem supposed to do with
> it? How do we turn that back into a {bdev,sector,len} tuple so we
> can do reverse mapping lookups to find the filesystem objects that
> allocated within the notified range?
>
> I'll point out again that these address space translations were
> something that the ->corrupted_range callbacks handled directly - no
> layer in the stack was handed a range that it didn't know how to map
> to it's own internal structures. By the time it got to the
> filesystem, it was a {bdev,sector,len} tuple, and the filesystem
> could feed that directly to it's reverse mapping lookups....
>
> Maybe I'm missing something magical about ->memory_failure that does
> all this translation for us, but I don't see it in this patchset. I
> just don't see how this proposed interface is a usable at the
> filesystem level as it stands.

So then it's not the filesystem that needs to register for
memory_failure() it's the driver in order to translate the failed LBAs
up the stack. However, memory_failure() still needs to make sure that
the pfns are unmapped regardless of any LBA notification after the
fact. So memory_failure() would still need to gain a range based
failure mode that optionally coordinates with a driver that can try to
head off sub-optimal memory_failure() default behavior.

Something like:

memory_failure_range()
for_each_range_owner() {
handled = notify_range_owner()
if (!handled)
do_fail_each_pfn()
}

...then in the pmem driver.

pmem_pfn_range_failure_handler()
lba_range_failure_notifier()

Then each stacking block device registers for the lba_notifier from
the next block device in the stack.

2021-03-24 17:48:29

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path

On Thu, Mar 18, 2021 at 3:02 AM Joao Martins <[email protected]> wrote:
>
> On 3/18/21 4:08 AM, Dan Williams wrote:
> > Now that device-dax and filesystem-dax are guaranteed to unmap all user
> > mappings of devmap / DAX pages before tearing down the 'struct page'
> > array, get_user_pages_fast() can rely on its traditional synchronization
> > method "validate_pte(); get_page(); revalidate_pte()" to catch races with
> > device shutdown. Specifically the unmap guarantee ensures that gup-fast
> > either succeeds in taking a page reference (lock-less), or it detects a
> > need to fall back to the slow path where the device presence can be
> > revalidated with locks held.
>
> [...]
>
> > @@ -2087,21 +2078,26 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> > #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
> >
> > #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> > +
> > static int __gup_device_huge(unsigned long pfn, unsigned long addr,
> > unsigned long end, unsigned int flags,
> > struct page **pages, int *nr)
> > {
> > int nr_start = *nr;
> > - struct dev_pagemap *pgmap = NULL;
> >
> > do {
> > - struct page *page = pfn_to_page(pfn);
> > + struct page *page;
> > +
> > + /*
> > + * Typically pfn_to_page() on a devmap pfn is not safe
> > + * without holding a live reference on the hosting
> > + * pgmap. In the gup-fast path it is safe because any
> > + * races will be resolved by either gup-fast taking a
> > + * reference or the shutdown path unmapping the pte to
> > + * trigger gup-fast to fall back to the slow path.
> > + */
> > + page = pfn_to_page(pfn);
> >
> > - pgmap = get_dev_pagemap(pfn, pgmap);
> > - if (unlikely(!pgmap)) {
> > - undo_dev_pagemap(nr, nr_start, flags, pages);
> > - return 0;
> > - }
> > SetPageReferenced(page);
> > pages[*nr] = page;
> > if (unlikely(!try_grab_page(page, flags))) {
>
> So for allowing FOLL_LONGTERM[0] would it be OK if we used page->pgmap after
> try_grab_page() for checking pgmap type to see if we are in a device-dax
> longterm pin?

So, there is an effort to add a new pte bit p{m,u}d_special to disable
gup-fast for huge pages [1]. I'd like to investigate whether we could
use devmap + special as an encoding for "no longterm" and never
consult the pgmap in the gup-fast path.

[1]: https://lore.kernel.org/linux-mm/[email protected]/

2021-03-24 19:03:48

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path

On 3/24/21 5:45 PM, Dan Williams wrote:
> On Thu, Mar 18, 2021 at 3:02 AM Joao Martins <[email protected]> wrote:
>> On 3/18/21 4:08 AM, Dan Williams wrote:
>>> Now that device-dax and filesystem-dax are guaranteed to unmap all user
>>> mappings of devmap / DAX pages before tearing down the 'struct page'
>>> array, get_user_pages_fast() can rely on its traditional synchronization
>>> method "validate_pte(); get_page(); revalidate_pte()" to catch races with
>>> device shutdown. Specifically the unmap guarantee ensures that gup-fast
>>> either succeeds in taking a page reference (lock-less), or it detects a
>>> need to fall back to the slow path where the device presence can be
>>> revalidated with locks held.
>>
>> [...]
>>
>>> @@ -2087,21 +2078,26 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>>> #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
>>>
>>> #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
>>> +
>>> static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>>> unsigned long end, unsigned int flags,
>>> struct page **pages, int *nr)
>>> {
>>> int nr_start = *nr;
>>> - struct dev_pagemap *pgmap = NULL;
>>>
>>> do {
>>> - struct page *page = pfn_to_page(pfn);
>>> + struct page *page;
>>> +
>>> + /*
>>> + * Typically pfn_to_page() on a devmap pfn is not safe
>>> + * without holding a live reference on the hosting
>>> + * pgmap. In the gup-fast path it is safe because any
>>> + * races will be resolved by either gup-fast taking a
>>> + * reference or the shutdown path unmapping the pte to
>>> + * trigger gup-fast to fall back to the slow path.
>>> + */
>>> + page = pfn_to_page(pfn);
>>>
>>> - pgmap = get_dev_pagemap(pfn, pgmap);
>>> - if (unlikely(!pgmap)) {
>>> - undo_dev_pagemap(nr, nr_start, flags, pages);
>>> - return 0;
>>> - }
>>> SetPageReferenced(page);
>>> pages[*nr] = page;
>>> if (unlikely(!try_grab_page(page, flags))) {
>>
>> So for allowing FOLL_LONGTERM[0] would it be OK if we used page->pgmap after
>> try_grab_page() for checking pgmap type to see if we are in a device-dax
>> longterm pin?
>
> So, there is an effort to add a new pte bit p{m,u}d_special to disable
> gup-fast for huge pages [1]. I'd like to investigate whether we could
> use devmap + special as an encoding for "no longterm" and never
> consult the pgmap in the gup-fast path.
>
> [1]: https://lore.kernel.org/linux-mm/[email protected]/
>

Oh, nice! That would be ideal indeed, as we would skip the pgmap lookup enterily.

I suppose device-dax would use pfn_t PFN_MAP while fs-dax memory device would set PFN_MAP
| PFN_DEV (provided vmf_insert_pfn_{pmd,pud} calls mkspecial on PFN_DEV).

I haven't been following that thread, but for PMD/PUD special in vmf_* these might be useful:

https://lore.kernel.org/linux-mm/[email protected]/
https://lore.kernel.org/linux-mm/[email protected]/

2021-03-25 13:05:21

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 0/3] mm, pmem: Force unmap pmem on surprise remove

On 18.03.21 05:08, Dan Williams wrote:
> Summary:
>
> A dax_dev can be unbound from its driver at any time. Unbind can not
> fail. The driver-core will always trigger ->remove() and the result from
> ->remove() is ignored. After ->remove() the driver-core proceeds to tear
> down context. The filesystem-dax implementation can leave pfns mapped
> after ->remove() if it is triggered while the filesystem is mounted.
> Security and data-integrity is forfeit if the dax_dev is repurposed for
> another security domain (new filesystem or change device modes), or if
> the dax_dev is physically replaced. CXL is a hotplug bus that makes
> dax_dev physical replace a real world prospect.
>
> All dax_dev pfns must be unmapped at remove. Detect the "remove while
> mounted" case and trigger memory_failure() over the entire dax_dev
> range.
>
> Details:
>
> The get_user_pages_fast() path expects all synchronization to be handled
> by the pattern of checking for pte presence, taking a page reference,
> and then validating that the pte was stable over that event. The
> gup-fast path for devmap / DAX pages additionally attempts to take/hold
> a live reference against the hosting pgmap over the page pin. The
> rational for the pgmap reference is to synchronize against a dax-device
> unbind / ->remove() event, but that is unnecessary if pte invalidation
> is guaranteed in the ->remove() path.
>
> Global dax-device pte invalidation *does* happen when the device is in
> raw "device-dax" mode where there is a single shared inode to unmap at
> remove, but the filesystem-dax path has a large number of actively
> mapped inodes unknown to the driver at ->remove() time. So, that unmap
> does not happen today for filesystem-dax. However, as Jason points out,
> that unmap / invalidation *needs* to happen not only to cleanup
> get_user_pages_fast() semantics, but in a future (see CXL) where dax_dev
> ->remove() is correlated with actual physical removal / replacement the
> implications of allowing a physical pfn to be exchanged without tearing
> down old mappings are severe (security and data-integrity).
>
> What is not in this patch set is coordination with the dax_kmem driver
> to trigger memory_failure() when the dax_dev is onlined as "System
> RAM". The remove_memory() API was built with the assumption that
> platform firmware negotiates all removal requests and the OS has a
> chance to say "no". This is why dax_kmem today simply leaks
> request_region() to burn that physical address space for any other
> usage until the next reboot on a manual unbind event if the memory can't
> be offlined. However a future to make sure that remove_memory() succeeds
> after memory_failure() of the same range seems a better semantic than
> permanently burning physical address space.

I'd have similar requirements for virtio-mem on forced driver unloading
(although less of an issue, because in contrast to cxl, it doesn't
really happen in sane environments). However, I'm afraid there are
absolutely no guarantees that you can actually offline+rip out memory
exposed to the buddy, at least in the general case.

I guess it might be possible to some degree if memory was onlined to
ZONE_MOVABLE (there are still no guarantees, though), however, bets are
completely off with ZONE_NORMAL. Just imagine the memmap of one dax
device being allocated from another dax device. You cannot possibly
invalidate via memory_failure() and rip out the memory without crashing
the whole system afterwards.

--
Thanks,

David / dhildenb

2021-03-25 14:36:06

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path

On Thu, Mar 18, 2021 at 10:03:06AM -0700, Dan Williams wrote:
> Yes. I still need to answer the question of whether mapping
> invalidation triggers longterm pin holders to relinquish their hold,
> but that's a problem regardless of whether gup-fast is supported or
> not.

It does not, GUP users do not interact with addres_space or mmu
notifiers

Jason

2021-03-25 14:41:05

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path

On Wed, Mar 17, 2021 at 09:08:28PM -0700, Dan Williams wrote:
> Now that device-dax and filesystem-dax are guaranteed to unmap all user
> mappings of devmap / DAX pages before tearing down the 'struct page'
> array, get_user_pages_fast() can rely on its traditional synchronization
> method "validate_pte(); get_page(); revalidate_pte()" to catch races with
> device shutdown. Specifically the unmap guarantee ensures that gup-fast
> either succeeds in taking a page reference (lock-less), or it detects a
> need to fall back to the slow path where the device presence can be
> revalidated with locks held.
>
> Reported-by: Jason Gunthorpe <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Shiyang Ruan <[email protected]>
> Cc: Vishal Verma <[email protected]>
> Cc: Dave Jiang <[email protected]>
> Cc: Ira Weiny <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Signed-off-by: Dan Williams <[email protected]>
> ---
> mm/gup.c | 38 ++++++++++++++++----------------------
> 1 file changed, 16 insertions(+), 22 deletions(-)

I'm happy to see this, and it is really the right thing that PTEs are
properly removed before anything happens to the pages under them.

Reviewed-by: Jason Gunthorpe <[email protected]>

Jason

2021-03-29 23:26:49

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path

On Thu, Mar 25, 2021 at 7:34 AM Jason Gunthorpe <[email protected]> wrote:
>
> On Thu, Mar 18, 2021 at 10:03:06AM -0700, Dan Williams wrote:
> > Yes. I still need to answer the question of whether mapping
> > invalidation triggers longterm pin holders to relinquish their hold,
> > but that's a problem regardless of whether gup-fast is supported or
> > not.
>
> It does not, GUP users do not interact with addres_space or mmu
> notifiers
>

Ok, but the SIGKILL from the memory_failure() will drop the pin?

Can memory_failure() find the right processes to kill if the memory
registration has been passed by SCM_RIGHTS?

2021-03-30 13:51:28

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path

On Mon, Mar 29, 2021 at 04:24:19PM -0700, Dan Williams wrote:
> On Thu, Mar 25, 2021 at 7:34 AM Jason Gunthorpe <[email protected]> wrote:
> >
> > On Thu, Mar 18, 2021 at 10:03:06AM -0700, Dan Williams wrote:
> > > Yes. I still need to answer the question of whether mapping
> > > invalidation triggers longterm pin holders to relinquish their hold,
> > > but that's a problem regardless of whether gup-fast is supported or
> > > not.
> >
> > It does not, GUP users do not interact with addres_space or mmu
> > notifiers
>
> Ok, but the SIGKILL from the memory_failure() will drop the pin?

In the general case I doubt it..

I think this is fine, we shouldn't expect unplugging a driver to not
block if there are open users.

Demading DMA users to be able to shoot down access to system memory is
a really big kernel change.

Jason

2021-04-01 19:55:55

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path

On 3/24/21 7:00 PM, Joao Martins wrote:
> On 3/24/21 5:45 PM, Dan Williams wrote:
>> On Thu, Mar 18, 2021 at 3:02 AM Joao Martins <[email protected]> wrote:
>>> On 3/18/21 4:08 AM, Dan Williams wrote:
>>>> Now that device-dax and filesystem-dax are guaranteed to unmap all user
>>>> mappings of devmap / DAX pages before tearing down the 'struct page'
>>>> array, get_user_pages_fast() can rely on its traditional synchronization
>>>> method "validate_pte(); get_page(); revalidate_pte()" to catch races with
>>>> device shutdown. Specifically the unmap guarantee ensures that gup-fast
>>>> either succeeds in taking a page reference (lock-less), or it detects a
>>>> need to fall back to the slow path where the device presence can be
>>>> revalidated with locks held.
>>>
>>> [...]
>>>
>>> So for allowing FOLL_LONGTERM[0] would it be OK if we used page->pgmap after
>>> try_grab_page() for checking pgmap type to see if we are in a device-dax
>>> longterm pin?
>>
>> So, there is an effort to add a new pte bit p{m,u}d_special to disable
>> gup-fast for huge pages [1]. I'd like to investigate whether we could
>> use devmap + special as an encoding for "no longterm" and never
>> consult the pgmap in the gup-fast path.
>>
>> [1]: https://lore.kernel.org/linux-mm/[email protected]/
>>
>
> Oh, nice! That would be ideal indeed, as we would skip the pgmap lookup enterily.
>
> I suppose device-dax would use pfn_t PFN_MAP while fs-dax memory device would set PFN_MAP
> | PFN_DEV (provided vmf_insert_pfn_{pmd,pud} calls mkspecial on PFN_DEV).
>
> I haven't been following that thread, but for PMD/PUD special in vmf_* these might be useful:
>
> https://lore.kernel.org/linux-mm/[email protected]/
> https://lore.kernel.org/linux-mm/[email protected]/
>

On a second thought, maybe it doesn't need to be that complicated for {fs,dev}dax if the
added special bit is just a subcase of devmap pte/pmd/puds. See below scsissors mark as a
rough estimation on the changes (nothing formal/proper as it isn't properly splitted).
Running gup_test with devdax/fsdax FOLL_LONGTERM and without does the intended. (gup_test
-m 16384 -r 10 -a -S -n 512 -w -f <file> [-F 0x10000]).

Unless this is about using special PMD/PUD bits without page structs (thus without devmap
bits) as in the previous two links.

-- >8 --

Subject: mm/gup, nvdimm: allow FOLL_LONGTERM for device-dax gup-fast

Signed-off-by: Joao Martins <[email protected]>

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 47027796c2f9..8b5d68d89cde 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -439,11 +439,16 @@ static inline pmd_t pmd_mkinvalid(pmd_t pmd)

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
#define pmd_devmap(pmd) pte_devmap(pmd_pte(pmd))
+#define pmd_special(pmd) pte_special(pmd_pte(pmd))
#endif
static inline pmd_t pmd_mkdevmap(pmd_t pmd)
{
return pte_pmd(set_pte_bit(pmd_pte(pmd), __pgprot(PTE_DEVMAP)));
}
+static inline pmd_t pmd_mkspecial(pmd_t pmd)
+{
+ return pte_pmd(set_pte_bit(pmd_pte(pmd), __pgprot(PTE_SPECIAL)));
+}

#define __pmd_to_phys(pmd) __pte_to_phys(pmd_pte(pmd))
#define __phys_to_pmd_val(phys) __phys_to_pte_val(phys)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index b1099f2d9800..45449ee86d4f 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -259,13 +259,13 @@ static inline int pmd_large(pmd_t pte)
/* NOTE: when predicate huge page, consider also pmd_devmap, or use pmd_large */
static inline int pmd_trans_huge(pmd_t pmd)
{
- return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
+ return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP|_PAGE_SPECIAL)) == _PAGE_PSE;
}

#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
static inline int pud_trans_huge(pud_t pud)
{
- return (pud_val(pud) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
+ return (pud_val(pud) & (_PAGE_PSE|_PAGE_DEVMAP|_PAGE_SPECIAL)) == _PAGE_PSE;
}
#endif

@@ -297,6 +297,19 @@ static inline int pgd_devmap(pgd_t pgd)
{
return 0;
}
+
+static inline bool pmd_special(pmd_t pmd)
+{
+ return !!(pmd_flags(pmd) & _PAGE_SPECIAL);
+}
+
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+static inline bool pud_special(pud_t pud)
+{
+ return !!(pud_flags(pud) & _PAGE_SPECIAL);
+}
+#endif
+
#endif
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

@@ -452,6 +465,11 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd)
return pmd_set_flags(pmd, _PAGE_DEVMAP);
}

+static inline pmd_t pmd_mkspecial(pmd_t pmd)
+{
+ return pmd_set_flags(pmd, _PAGE_SPECIAL);
+}
+
static inline pmd_t pmd_mkhuge(pmd_t pmd)
{
return pmd_set_flags(pmd, _PAGE_PSE);
@@ -511,6 +529,11 @@ static inline pud_t pud_mkhuge(pud_t pud)
return pud_set_flags(pud, _PAGE_PSE);
}

+static inline pud_t pud_mkspecial(pud_t pud)
+{
+ return pud_set_flags(pud, _PAGE_SPECIAL);
+}
+
static inline pud_t pud_mkyoung(pud_t pud)
{
return pud_set_flags(pud, _PAGE_ACCESSED);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 16760b237229..156ceae33164 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -435,7 +435,7 @@ static int pmem_attach_disk(struct device *dev,
pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
pmem->pfn_pad = resource_size(res) -
range_len(&pmem->pgmap.range);
- pmem->pfn_flags |= PFN_MAP;
+ pmem->pfn_flags |= PFN_MAP | PFN_SPECIAL;
bb_range = pmem->pgmap.range;
bb_range.start += pmem->data_offset;
} else if (pmem_should_map_pages(dev)) {
@@ -445,7 +445,7 @@ static int pmem_attach_disk(struct device *dev,
pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
pmem->pgmap.ops = &fsdax_pagemap_ops;
addr = devm_memremap_pages(dev, &pmem->pgmap);
- pmem->pfn_flags |= PFN_MAP;
+ pmem->pfn_flags |= PFN_MAP | PFN_SPECIAL;
bb_range = pmem->pgmap.range;
} else {
if (devm_add_action_or_reset(dev, pmem_release_queue,
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 4ee6f734ba83..873c8e53c85d 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -748,7 +748,7 @@ static long virtio_fs_direct_access(struct dax_device *dax_dev,
pgoff_t pgoff,
*kaddr = fs->window_kaddr + offset;
if (pfn)
*pfn = phys_to_pfn_t(fs->window_phys_addr + offset,
- PFN_DEV | PFN_MAP);
+ PFN_DEV | PFN_MAP | PFN_SPECIAL);
return nr_pages > max_nr_pages ? max_nr_pages : nr_pages;
}

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 7364e5a70228..ad7078e38ef2 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1189,6 +1189,18 @@ static inline int pgd_devmap(pgd_t pgd)
}
#endif

+#if !defined(CONFIG_ARCH_HAS_PTE_SPECIAL) || !defined(CONFIG_TRANSPARENT_HUGEPAGE)
+static inline bool pmd_special(pmd_t pmd)
+{
+ return false;
+}
+
+static inline pmd_t pmd_mkspecial(pmd_t pmd)
+{
+ return pmd;
+}
+#endif
+
#if !defined(CONFIG_TRANSPARENT_HUGEPAGE) || \
(defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
!defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD))
@@ -1196,6 +1208,14 @@ static inline int pud_trans_huge(pud_t pud)
{
return 0;
}
+static inline bool pud_special(pud_t pud)
+{
+ return false;
+}
+static inline pud_t pud_mkspecial(pud_t pud)
+{
+ return pud;
+}
#endif

/* See pmd_none_or_trans_huge_or_clear_bad for discussion. */
diff --git a/mm/gup.c b/mm/gup.c
index b3e647c8b7ee..87aa229a9347 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2086,7 +2086,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned
long end,
goto pte_unmap;

if (pte_devmap(pte)) {
- if (unlikely(flags & FOLL_LONGTERM))
+ if (unlikely(flags & FOLL_LONGTERM) && pte_special(pte))
goto pte_unmap;

pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
@@ -2338,7 +2338,7 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
return 0;

if (pmd_devmap(orig)) {
- if (unlikely(flags & FOLL_LONGTERM))
+ if (unlikely(flags & FOLL_LONGTERM) && pmd_special(orig))
return 0;
return __gup_device_huge_pmd(orig, pmdp, addr, end, flags,
pages, nr);
@@ -2372,7 +2372,7 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
return 0;

if (pud_devmap(orig)) {
- if (unlikely(flags & FOLL_LONGTERM))
+ if (unlikely(flags & FOLL_LONGTERM) && pud_special(orig))
return 0;
return __gup_device_huge_pud(orig, pudp, addr, end, flags,
pages, nr);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f6f70632fc29..9d5117711919 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -796,8 +796,11 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long
addr,
}

entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
- if (pfn_t_devmap(pfn))
+ if (pfn_t_devmap(pfn)) {
entry = pmd_mkdevmap(entry);
+ if (pfn_t_special(pfn))
+ entry = pmd_mkspecial(entry);
+ }
if (write) {
entry = pmd_mkyoung(pmd_mkdirty(entry));
entry = maybe_pmd_mkwrite(entry, vma);
@@ -896,8 +899,11 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long
addr,
}

entry = pud_mkhuge(pfn_t_pud(pfn, prot));
- if (pfn_t_devmap(pfn))
+ if (pfn_t_devmap(pfn)) {
entry = pud_mkdevmap(entry);
+ if (pfn_t_special(pfn))
+ entry = pud_mkspecial(entry);
+ }
if (write) {
entry = pud_mkyoung(pud_mkdirty(entry));
entry = maybe_pud_mkwrite(entry, vma);