LinuxLists.cc - [RFC 00/10] fs/dax: Fix FS DAX page reference counts

2024-04-11 00:57:54

by Alistair Popple

[permalink] [raw]

Subject: [RFC 00/10] fs/dax: Fix FS DAX page reference counts

FS DAX pages have always maintained their own page reference counts
without following the normal rules for page reference counting. In
particular pages are considered free when the refcount hits one rather
than zero and refcounts are not added when mapping the page.

Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
mechanism for allowing GUP to hold references on the page (see
get_dev_pagemap). However there doesn't seem to be any reason why FS
DAX pages need their own reference counting scheme.

This RFC is an initial attempt at removing the special reference
counting and instead refcount FS DAX pages the same as normal pages.

There are still a couple of rough edges - in particular I haven't
completely removed the devmap PTE bit references from arch specific
code and there is probably some more cleanup of dev_pagemap reference
counting that could be done, particular in mm/gup.c. I also haven't
yet compiled on anything other than x86_64.

Before continuing further with this clean-up though I would appreciate
some feedback on the viability of this approach and any issues I may
have overlooked, as I am not intimately familiar with FS DAX code (or
for that matter the FS layer in general).

I have of course run some basic testing which didn't reveal any
problems.

Signed-off-by: Alistair Popple <[email protected]>

Alistair Popple (10):
mm/gup.c: Remove redundant check for PCI P2PDMA page
mm/hmm: Remove dead check for HugeTLB and FS DAX
pci/p2pdma: Don't initialise page refcount to one
fs/dax: Don't track page mapping/index
fs/dax: Refactor wait for dax idle page
fs/dax: Add dax_page_free callback
mm: Allow compound zone device pages
fs/dax: Properly refcount fs dax pages
mm/khugepage.c: Warn if trying to scan devmap pmd
mm: Remove pXX_devmap

Documentation/mm/arch_pgtable_helpers.rst | 6 +-
arch/arm64/include/asm/pgtable.h | 24 +---
arch/powerpc/include/asm/book3s/64/pgtable.h | 42 +-----
arch/powerpc/mm/book3s64/hash_pgtable.c | 3 +-
arch/powerpc/mm/book3s64/pgtable.c | 8 +-
arch/powerpc/mm/book3s64/radix_pgtable.c | 5 +-
arch/powerpc/mm/pgtable.c | 2 +-
arch/x86/include/asm/pgtable.h | 31 +---
drivers/dax/super.c | 2 +-
drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +-
drivers/nvdimm/pmem.c | 10 +-
drivers/pci/p2pdma.c | 4 +-
fs/dax.c | 158 +++++++-----------
fs/ext4/inode.c | 5 +-
fs/fuse/dax.c | 4 +-
fs/fuse/virtio_fs.c | 8 +-
fs/userfaultfd.c | 2 +-
fs/xfs/xfs_file.c | 4 +-
include/linux/dax.h | 16 ++-
include/linux/huge_mm.h | 11 +-
include/linux/memremap.h | 12 +-
include/linux/migrate.h | 2 +-
include/linux/mm.h | 41 +-----
include/linux/page-flags.h | 6 +-
include/linux/pgtable.h | 17 +--
lib/test_hmm.c | 2 +-
mm/debug_vm_pgtable.c | 51 +------
mm/gup.c | 165 +------------------
mm/hmm.c | 40 +----
mm/huge_memory.c | 180 +++++++++-----------
mm/internal.h | 2 +-
mm/khugepaged.c | 2 +-
mm/mapping_dirty_helpers.c | 4 +-
mm/memory-failure.c | 6 +-
mm/memory.c | 109 ++++++++----
mm/memremap.c | 36 +---
mm/migrate_device.c | 6 +-
mm/mm_init.c | 5 +-
mm/mprotect.c | 2 +-
mm/mremap.c | 5 +-
mm/page_vma_mapped.c | 5 +-
mm/pgtable-generic.c | 7 +-
mm/swap.c | 2 +-
mm/vmscan.c | 5 +-
44 files changed, 338 insertions(+), 721 deletions(-)

base-commit: ffc253263a1375a65fa6c9f62a893e9767fbebfa
--
git-series 0.9.1

2024-04-11 00:58:24

by Alistair Popple

[permalink] [raw]

Subject: [RFC 02/10] mm/hmm: Remove dead check for HugeTLB and FS DAX

pud_huge() returns true only for a HugeTLB page. pud_devmap() is only
used by FS DAX pages. These two things are mutually exclusive so this
code is dead code and can be removed.

Signed-off-by: Alistair Popple <[email protected]>
---
mm/hmm.c | 33 ---------------------------------
1 file changed, 33 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 277ddca..5bbfb0e 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -411,9 +411,6 @@ static inline unsigned long pud_to_hmm_pfn_flags(struct hmm_range *range,
static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
struct mm_walk *walk)
{
- struct hmm_vma_walk *hmm_vma_walk = walk->private;
- struct hmm_range *range = hmm_vma_walk->range;
- unsigned long addr = start;
pud_t pud;
spinlock_t *ptl = pud_trans_huge_lock(pudp, walk->vma);

@@ -429,39 +426,9 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
return hmm_vma_walk_hole(start, end, -1, walk);
}

- if (pud_huge(pud) && pud_devmap(pud)) {
- unsigned long i, npages, pfn;
- unsigned int required_fault;
- unsigned long *hmm_pfns;
- unsigned long cpu_flags;
-
- if (!pud_present(pud)) {
- spin_unlock(ptl);
- return hmm_vma_walk_hole(start, end, -1, walk);
- }
-
- i = (addr - range->start) >> PAGE_SHIFT;
- npages = (end - addr) >> PAGE_SHIFT;
- hmm_pfns = &range->hmm_pfns[i];
-
- cpu_flags = pud_to_hmm_pfn_flags(range, pud);
- required_fault = hmm_range_need_fault(hmm_vma_walk, hmm_pfns,
- npages, cpu_flags);
- if (required_fault) {
- spin_unlock(ptl);
- return hmm_vma_fault(addr, end, required_fault, walk);
- }
-
- pfn = pud_pfn(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
- for (i = 0; i < npages; ++i, ++pfn)
- hmm_pfns[i] = pfn | cpu_flags;
- goto out_unlock;
- }
-
/* Ask for the PUD to be split */
walk->action = ACTION_SUBTREE;

-out_unlock:
spin_unlock(ptl);
return 0;
}
--
git-series 0.9.1

2024-04-11 00:58:28

by Alistair Popple

[permalink] [raw]

Subject: [RFC 03/10] pci/p2pdma: Don't initialise page refcount to one

The reference counts for ZONE_DEVICE private pages should be
initialised by the driver when the page is actually allocated by the
driver allocator, not when they are first created. This is currently
the case for MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_COHERENT pages
but not MEMORY_DEVICE_PCI_P2PDMA pages so fix that up.

Signed-off-by: Alistair Popple <[email protected]>
---
drivers/pci/p2pdma.c | 2 ++
mm/memremap.c | 8 ++++----
mm/mm_init.c | 4 +++-
3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index fa7370f..ab7ef18 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -128,6 +128,8 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
goto out;
}

+ get_page(virt_to_page(kaddr));
+
/*
* vm_insert_page() can sleep, so a reference is taken to mapping
* such that rcu_read_unlock() can be done before inserting the
diff --git a/mm/memremap.c b/mm/memremap.c
index bee8556..99d26ff 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -508,15 +508,15 @@ void free_zone_device_page(struct page *page)
page->mapping = NULL;
page->pgmap->ops->page_free(page);

- if (page->pgmap->type != MEMORY_DEVICE_PRIVATE &&
- page->pgmap->type != MEMORY_DEVICE_COHERENT)
+ if (page->pgmap->type == MEMORY_DEVICE_PRIVATE ||
+ page->pgmap->type == MEMORY_DEVICE_COHERENT)
+ put_dev_pagemap(page->pgmap);
+ else if (page->pgmap->type != MEMORY_DEVICE_PCI_P2PDMA)
/*
* Reset the page count to 1 to prepare for handing out the page
* again.
*/
set_page_count(page, 1);
- else
- put_dev_pagemap(page->pgmap);
}

void zone_device_page_init(struct page *page)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 50f2f34..da45abd 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -6,6 +6,7 @@
* Author Mel Gorman <[email protected]>
*
*/
+#include "linux/memremap.h"
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/kobject.h>
@@ -1006,7 +1007,8 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
* which will set the page count to 1 when allocating the page.
*/
if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
- pgmap->type == MEMORY_DEVICE_COHERENT)
+ pgmap->type == MEMORY_DEVICE_COHERENT ||
+ pgmap->type == MEMORY_DEVICE_PCI_P2PDMA)
set_page_count(page, 0);
}

--
git-series 0.9.1

2024-04-11 00:58:39

by Alistair Popple

[permalink] [raw]

Subject: [RFC 05/10] fs/dax: Refactor wait for dax idle page

A FS DAX page is considered idle when its refcount drops to one. This
is currently open-coded in all file systems supporting FS DAX. Move
the idle detection to a common function to make future changes easier.

Signed-off-by: Alistair Popple <[email protected]>
---
fs/ext4/inode.c | 5 +----
fs/fuse/dax.c | 4 +---
fs/xfs/xfs_file.c | 4 +---
include/linux/dax.h | 11 +++++++++++
4 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4ce35f1..e9cef7d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3868,10 +3868,7 @@ int ext4_break_layouts(struct inode *inode)
if (!page)
return 0;

- error = ___wait_var_event(&page->_refcount,
- atomic_read(&page->_refcount) == 1,
- TASK_INTERRUPTIBLE, 0, 0,
- ext4_wait_dax_page(inode));
+ error = dax_wait_page_idle(page, ext4_wait_dax_page, inode);
} while (error == 0);

return error;
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index 23904a6..8a62483 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -676,9 +676,7 @@ static int __fuse_dax_break_layouts(struct inode *inode, bool *retry,
return 0;

*retry = true;
- return ___wait_var_event(&page->_refcount,
- atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
- 0, 0, fuse_wait_dax_page(inode));
+ return dax_wait_page_idle(page, fuse_wait_dax_page, inode);
}

/* dmap_end == 0 leads to unmapping of whole file */
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 2037002..099cd70 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -849,9 +849,7 @@ xfs_break_dax_layouts(
return 0;

*retry = true;
- return ___wait_var_event(&page->_refcount,
- atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
- 0, 0, xfs_wait_dax_page(inode));
+ return dax_wait_page_idle(page, xfs_wait_dax_page, inode);
}

int
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 22cd990..bced4d4 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -212,6 +212,17 @@ int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
const struct iomap_ops *ops);

+static inline int dax_wait_page_idle(struct page *page,
+ void (cb)(struct inode *),
+ struct inode *inode)
+{
+ int ret;
+
+ ret = ___wait_var_event(page, page_ref_count(page) == 1,
+ TASK_INTERRUPTIBLE, 0, 0, cb(inode));
+ return ret;
+}
+
#if IS_ENABLED(CONFIG_DAX)
int dax_read_lock(void);
void dax_read_unlock(int id);
--
git-series 0.9.1

2024-04-11 00:58:46

by Alistair Popple

[permalink] [raw]

Subject: [RFC 06/10] fs/dax: Add dax_page_free callback

When a fs dax page is freed it has to notify filesystems that the page
has been unpinned/unmapped and is free. Currently this involves
special code in the page free paths to detect a transition of refcount
from 2 to 1 and to call some fs dax specific code.

A future change will require this to happen when the page refcount
drops to zero. In this case we can use the existing
pgmap->ops->page_free() callback so wire that up for all devices that
support FS DAX (nvdimm and virtio).

Signed-off-by: Alistair Popple <[email protected]>
---
drivers/nvdimm/pmem.c | 1 +
fs/dax.c | 6 ++++++
fs/fuse/virtio_fs.c | 5 +++++
include/linux/dax.h | 1 +
4 files changed, 13 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 4e8fdcb..b027e1f 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -444,6 +444,7 @@ static int pmem_pagemap_memory_failure(struct dev_pagemap *pgmap,

static const struct dev_pagemap_ops fsdax_pagemap_ops = {
.memory_failure = pmem_pagemap_memory_failure,
+ .page_free = dax_page_free,
};

static int pmem_attach_disk(struct device *dev,
diff --git a/fs/dax.c b/fs/dax.c
index a7bd423..17b1c5f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1981,3 +1981,9 @@ int dax_remap_file_range_prep(struct file *file_in, loff_t pos_in,
pos_out, len, remap_flags, ops);
}
EXPORT_SYMBOL_GPL(dax_remap_file_range_prep);
+
+void dax_page_free(struct page *page)
+{
+ wake_up_var(page);
+}
+EXPORT_SYMBOL_GPL(dax_page_free);
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 5f1be1d..11bfc28 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -795,6 +795,10 @@ static void virtio_fs_cleanup_dax(void *data)
put_dax(dax_dev);
}

+static const struct dev_pagemap_ops fsdax_pagemap_ops = {
+ .page_free = dax_page_free,
+};
+
static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
{
struct virtio_shm_region cache_reg;
@@ -827,6 +831,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
return -ENOMEM;

pgmap->type = MEMORY_DEVICE_FS_DAX;
+ pgmap->ops = &fsdax_pagemap_ops;

/* Ideally we would directly use the PCI BAR resource but
* devm_memremap_pages() wants its own copy in pgmap. So
diff --git a/include/linux/dax.h b/include/linux/dax.h
index bced4d4..c0c3206 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -212,6 +212,7 @@ int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
const struct iomap_ops *ops);

+void dax_page_free(struct page *page);
static inline int dax_wait_page_idle(struct page *page,
void (cb)(struct inode *),
struct inode *inode)
--
git-series 0.9.1

2024-04-11 00:58:55

by Alistair Popple

[permalink] [raw]

Subject: [RFC 07/10] mm: Allow compound zone device pages

Zone device pages are used to represent various type of device memory
managed by device drivers. Currently compound zone device pages are
not supported. This is because MEMORY_DEVICE_FS_DAX pages are the only
user of higher order zone device pages and have their own page
reference counting.

A future change will unify FS DAX reference counting with normal page
reference counting rules and remove the special FS DAX reference
counting. Supporting that requires compound zone device pages.

Supporting compound zone device pages requires compound_head() to
distinguish between head and tail pages whilst still preserving the
special struct page fields that are specific to zone device pages.

A tail page is distinguished by having bit zero being set in
page->compound_head, with the remaining bits pointing to the head
page. For zone device pages page->compound_head is shared with
page->pgmap.

The page->pgmap field is common to all pages within a memory section.
Therefore pgmap is the same for both head and tail pages and we can
use the same scheme to distinguish tail pages. To obtain the pgmap for
a tail page a new accessor is introduced to fetch it from
compound_head.

Signed-off-by: Alistair Popple <[email protected]>
---
drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +-
drivers/pci/p2pdma.c | 2 +-
include/linux/memremap.h | 12 +++++++++---
include/linux/migrate.h | 2 +-
lib/test_hmm.c | 2 +-
mm/hmm.c | 2 +-
mm/memory.c | 2 +-
mm/memremap.c | 6 +++---
mm/migrate_device.c | 4 ++--
9 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 12feecf..eb49f07 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -88,7 +88,7 @@ struct nouveau_dmem {

static struct nouveau_dmem_chunk *nouveau_page_to_chunk(struct page *page)
{
- return container_of(page->pgmap, struct nouveau_dmem_chunk, pagemap);
+ return container_of(page_dev_pagemap(page), struct nouveau_dmem_chunk, pagemap);
}

static struct nouveau_drm *page_to_drm(struct page *page)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index ab7ef18..dfc2a17 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -195,7 +195,7 @@ static const struct attribute_group p2pmem_group = {

static void p2pdma_page_free(struct page *page)
{
- struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap);
+ struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_dev_pagemap(page));
/* safe to dereference while a reference is held to the percpu ref */
struct pci_p2pdma *p2pdma =
rcu_dereference_protected(pgmap->provider->p2pdma, 1);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 1314d9c..0773f8b 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -139,6 +139,12 @@ struct dev_pagemap {
};
};

+static inline struct dev_pagemap *page_dev_pagemap(const struct page *page)
+{
+ WARN_ON(!is_zone_device_page(page));
+ return compound_head(page)->pgmap;
+}
+
static inline bool pgmap_has_memory_failure(struct dev_pagemap *pgmap)
{
return pgmap->ops && pgmap->ops->memory_failure;
@@ -160,7 +166,7 @@ static inline bool is_device_private_page(const struct page *page)
{
return IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
is_zone_device_page(page) &&
- page->pgmap->type == MEMORY_DEVICE_PRIVATE;
+ page_dev_pagemap(page)->type == MEMORY_DEVICE_PRIVATE;
}

static inline bool folio_is_device_private(const struct folio *folio)
@@ -172,13 +178,13 @@ static inline bool is_pci_p2pdma_page(const struct page *page)
{
return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
is_zone_device_page(page) &&
- page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
+ page_dev_pagemap(page)->type == MEMORY_DEVICE_PCI_P2PDMA;
}

static inline bool is_device_coherent_page(const struct page *page)
{
return is_zone_device_page(page) &&
- page->pgmap->type == MEMORY_DEVICE_COHERENT;
+ page_dev_pagemap(page)->type == MEMORY_DEVICE_COHERENT;
}

static inline bool folio_is_device_coherent(const struct folio *folio)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 711dd94..ebaf279 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -200,7 +200,7 @@ struct migrate_vma {
unsigned long end;

/*
- * Set to the owner value also stored in page->pgmap->owner for
+ * Set to the owner value also stored in page_dev_pagemap(page)->owner for
* migrating out of device private memory. The flags also need to
* be set to MIGRATE_VMA_SELECT_DEVICE_PRIVATE.
* The caller should always set this field when using mmu notifier
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 717dcb8..1101ff4 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -195,7 +195,7 @@ static int dmirror_fops_release(struct inode *inode, struct file *filp)

static struct dmirror_chunk *dmirror_page_to_chunk(struct page *page)
{
- return container_of(page->pgmap, struct dmirror_chunk, pagemap);
+ return container_of(page_dev_pagemap(page), struct dmirror_chunk, pagemap);
}

static struct dmirror_device *dmirror_page_to_device(struct page *page)
diff --git a/mm/hmm.c b/mm/hmm.c
index 5bbfb0e..a665a3c 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -248,7 +248,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
* just report the PFN.
*/
if (is_device_private_entry(entry) &&
- pfn_swap_entry_to_page(entry)->pgmap->owner ==
+ page_dev_pagemap(pfn_swap_entry_to_page(entry))->owner ==
range->dev_private_owner) {
cpu_flags = HMM_PFN_VALID;
if (is_writable_device_private_entry(entry))
diff --git a/mm/memory.c b/mm/memory.c
index 517221f..52248d4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3768,7 +3768,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
*/
get_page(vmf->page);
pte_unmap_unlock(vmf->pte, vmf->ptl);
- ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
+ ret = page_dev_pagemap(vmf->page)->ops->migrate_to_ram(vmf);
put_page(vmf->page);
} else if (is_hwpoison_entry(entry)) {
ret = VM_FAULT_HWPOISON;
diff --git a/mm/memremap.c b/mm/memremap.c
index 99d26ff..619b059 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -470,7 +470,7 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);

void free_zone_device_page(struct page *page)
{
- if (WARN_ON_ONCE(!page->pgmap->ops || !page->pgmap->ops->page_free))
+ if (WARN_ON_ONCE(!page_dev_pagemap(page)->ops || !page_dev_pagemap(page)->ops->page_free))
return;

mem_cgroup_uncharge(page_folio(page));
@@ -506,7 +506,7 @@ void free_zone_device_page(struct page *page)
* to clear page->mapping.
*/
page->mapping = NULL;
- page->pgmap->ops->page_free(page);
+ page_dev_pagemap(page)->ops->page_free(page);

if (page->pgmap->type == MEMORY_DEVICE_PRIVATE ||
page->pgmap->type == MEMORY_DEVICE_COHERENT)
@@ -525,7 +525,7 @@ void zone_device_page_init(struct page *page)
* Drivers shouldn't be allocating pages after calling
* memunmap_pages().
*/
- WARN_ON_ONCE(!percpu_ref_tryget_live(&page->pgmap->ref));
+ WARN_ON_ONCE(!percpu_ref_tryget_live(&page_dev_pagemap(page)->ref));
set_page_count(page, 1);
lock_page(page);
}
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 8ac1f79..1e1c82f 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -134,7 +134,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
page = pfn_swap_entry_to_page(entry);
if (!(migrate->flags &
MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
- page->pgmap->owner != migrate->pgmap_owner)
+ page_dev_pagemap(page)->owner != migrate->pgmap_owner)
goto next;

mpfn = migrate_pfn(page_to_pfn(page)) |
@@ -155,7 +155,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
goto next;
else if (page && is_device_coherent_page(page) &&
(!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_COHERENT) ||
- page->pgmap->owner != migrate->pgmap_owner))
+ page_dev_pagemap(page)->owner != migrate->pgmap_owner))
goto next;
mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
--
git-series 0.9.1

2024-04-11 00:59:05

by Alistair Popple

[permalink] [raw]

Subject: [RFC 08/10] fs/dax: Properly refcount fs dax pages

Currently fs dax pages are considered free when the refcount drops to
one and their refcounts are not increased when mapped via PTEs or
decreased when unmapped. This requires special logic in mm paths to
detect that these pages should not be properly refcounted, and to
detect when the refcount drops to one instead of zero.

On the other hand get_user_pages(), etc. will properly refcount fs dax
pages by taking a reference and dropping it when the page is
unpinned.

Tracking this special behaviour requires extra PTE bits
(eg. pte_devmap) and introduces rules that are potentially confusing
and specific to FS DAX pages. To fix this, and to possibly allow
removal of the special PTE bits in future, convert the fs dax page
refcounts to be zero based and instead take a reference on the page
each time it is mapped as is currently the case for normal pages.

This may also allow a future clean-up to remove the pgmap refcounting
that is currently done in mm/gup.c.

Signed-off-by: Alistair Popple <[email protected]>
---
drivers/dax/super.c | 2 +-
drivers/nvdimm/pmem.c | 9 +---
fs/dax.c | 91 +++++++++++++++++++++++++++++++++---------
fs/fuse/virtio_fs.c | 3 +-
include/linux/dax.h | 6 ++-
include/linux/huge_mm.h | 1 +-
include/linux/mm.h | 34 +---------------
mm/gup.c | 9 +---
mm/huge_memory.c | 80 +++++++++++++++++++++++++++++++++++--
mm/internal.h | 2 +-
mm/memory-failure.c | 6 +--
mm/memory.c | 82 ++++++++++++++++++++++++++++++++++----
mm/memremap.c | 24 +----------
mm/mm_init.c | 3 +-
mm/swap.c | 2 +-
15 files changed, 251 insertions(+), 103 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0da9232..d393cd3 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -256,7 +256,7 @@ EXPORT_SYMBOL_GPL(dax_holder_notify_failure);
void arch_wb_cache_pmem(void *addr, size_t size);
void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
{
- if (unlikely(!dax_write_cache_enabled(dax_dev)))
+ if (unlikely(dax_dev && !dax_write_cache_enabled(dax_dev)))
return;

arch_wb_cache_pmem(addr, size);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index b027e1f..c7cb6b4 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -505,7 +505,7 @@ static int pmem_attach_disk(struct device *dev,

pmem->disk = disk;
pmem->pgmap.owner = pmem;
- pmem->pfn_flags = PFN_DEV;
+ pmem->pfn_flags = 0;
if (is_nd_pfn(dev)) {
pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
pmem->pgmap.ops = &fsdax_pagemap_ops;
@@ -514,7 +514,7 @@ static int pmem_attach_disk(struct device *dev,
pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
pmem->pfn_pad = resource_size(res) -
range_len(&pmem->pgmap.range);
- pmem->pfn_flags |= PFN_MAP;
+ blk_queue_flag_set(QUEUE_FLAG_DAX, q);
bb_range = pmem->pgmap.range;
bb_range.start += pmem->data_offset;
} else if (pmem_should_map_pages(dev)) {
@@ -524,9 +524,10 @@ static int pmem_attach_disk(struct device *dev,
pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
pmem->pgmap.ops = &fsdax_pagemap_ops;
addr = devm_memremap_pages(dev, &pmem->pgmap);
- pmem->pfn_flags |= PFN_MAP;
+ blk_queue_flag_set(QUEUE_FLAG_DAX, q);
bb_range = pmem->pgmap.range;
} else {
+ pmem->pfn_flags = PFN_DEV;
addr = devm_memremap(dev, pmem->phys_addr,
pmem->size, ARCH_MEMREMAP_PMEM);
bb_range.start = res->start;
@@ -545,8 +546,6 @@ static int pmem_attach_disk(struct device *dev,
blk_queue_max_hw_sectors(q, UINT_MAX);
blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
blk_queue_flag_set(QUEUE_FLAG_SYNCHRONOUS, q);
- if (pmem->pfn_flags & PFN_MAP)
- blk_queue_flag_set(QUEUE_FLAG_DAX, q);

disk->fops = &pmem_fops;
disk->private_data = pmem;
diff --git a/fs/dax.c b/fs/dax.c
index 17b1c5f..a45793f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -71,6 +71,11 @@ static unsigned long dax_to_pfn(void *entry)
return xa_to_value(entry) >> DAX_SHIFT;
}

+static struct folio *dax_to_folio(void *entry)
+{
+ return page_folio(pfn_to_page(dax_to_pfn(entry)));
+}
+
static void *dax_make_entry(pfn_t pfn, unsigned long flags)
{
return xa_mk_value(flags | (pfn_t_to_pfn(pfn) << DAX_SHIFT));
@@ -318,7 +323,44 @@ static unsigned long dax_end_pfn(void *entry)
*/
#define for_each_mapped_pfn(entry, pfn) \
for (pfn = dax_to_pfn(entry); \
- pfn < dax_end_pfn(entry); pfn++)
+ pfn < dax_end_pfn(entry); pfn++)
+
+static void dax_device_folio_init(struct folio *folio, int order)
+{
+ int orig_order = folio_order(folio);
+ int i;
+
+ if (orig_order != order) {
+ for (i = 0; i < (1UL << orig_order); i++)
+ ClearPageHead(folio_page(folio, i));
+ }
+
+ if (order > 0) {
+ prep_compound_page(&folio->page, order);
+ if (order > 1) {
+ VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio);
+ INIT_LIST_HEAD(&folio->_deferred_list);
+ }
+ }
+}
+
+static void dax_associate_new_entry(void *entry, struct address_space *mapping, pgoff_t index)
+{
+ unsigned long order = dax_entry_order(entry);
+ struct folio *folio = dax_to_folio(entry);
+
+ if (!dax_entry_size(entry))
+ return;
+
+ // We don't hold a reference for the DAX pagecache entry for the page. But we
+ // need to initialise the folio so we can hand it out. Nothing else should have
+ // a reference if it's zeroed either.
+ WARN_ON_ONCE(folio_ref_count(folio));
+ WARN_ON_ONCE(folio->mapping);
+ dax_device_folio_init(folio, order);
+ folio->mapping = mapping;
+ folio->index = index;
+}

static struct page *dax_busy_page(void *entry)
{
@@ -327,7 +369,7 @@ static struct page *dax_busy_page(void *entry)
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);

- if (page_ref_count(page) > 1)
+ if (page_ref_count(page))
return page;
}
return NULL;
@@ -346,10 +388,10 @@ dax_entry_t dax_lock_page(struct page *page)
XA_STATE(xas, NULL, 0);
void *entry;

- /* Ensure page->mapping isn't freed while we look at it */
+ /* Ensure page_folio(page)->mapping isn't freed while we look at it */
rcu_read_lock();
for (;;) {
- struct address_space *mapping = READ_ONCE(page->mapping);
+ struct address_space *mapping = READ_ONCE(page_folio(page)->mapping);

entry = NULL;
if (!mapping || !dax_mapping(mapping))
@@ -368,7 +410,7 @@ dax_entry_t dax_lock_page(struct page *page)

xas.xa = &mapping->i_pages;
xas_lock_irq(&xas);
- if (mapping != page->mapping) {
+ if (mapping != page_folio(page)->mapping) {
xas_unlock_irq(&xas);
continue;
}
@@ -390,7 +432,7 @@ dax_entry_t dax_lock_page(struct page *page)

void dax_unlock_page(struct page *page, dax_entry_t cookie)
{
- struct address_space *mapping = page->mapping;
+ struct address_space *mapping = page_folio(page)->mapping;
XA_STATE(xas, &mapping->i_pages, page->index);

if (S_ISCHR(mapping->host->i_mode))
@@ -662,8 +704,8 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
}
EXPORT_SYMBOL_GPL(dax_layout_busy_page);

-static int __dax_invalidate_entry(struct address_space *mapping,
- pgoff_t index, bool trunc)
+int __dax_invalidate_entry(struct address_space *mapping,
+ pgoff_t index, bool trunc)
{
XA_STATE(xas, &mapping->i_pages, index);
int ret = 0;
@@ -813,6 +855,11 @@ static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
if (shared || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
void *old;

+ if (!shared) {
+ dax_associate_new_entry(new_entry, mapping,
+ linear_page_index(vmf->vma, vmf->address));
+ }
+
/*
* Only swap our new entry into the page cache if the current
* entry is a zero page or an empty entry. If a normal PTE or
@@ -1000,9 +1047,7 @@ static int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos,
goto out;
if (pfn_t_to_pfn(*pfnp) & (PHYS_PFN(size)-1))
goto out;
- /* For larger pages we need devmap */
- if (length > 1 && !pfn_t_devmap(*pfnp))
- goto out;
+
rc = 0;

out_check_addr:
@@ -1109,7 +1154,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas, struct vm_fault *vmf,

*entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, DAX_ZERO_PAGE);

- ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
+ ret = dax_insert_pfn(vmf->vma, vaddr, pfn, false);
trace_dax_load_hole(inode, vmf, ret);
return ret;
}
@@ -1602,12 +1647,10 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,

/* insert PMD pfn */
if (pmd)
- return vmf_insert_pfn_pmd(vmf, pfn, write);
+ return dax_insert_pfn_pmd(vmf, pfn, write);

/* insert PTE pfn */
- if (write)
- return vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
- return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
+ return dax_insert_pfn(vmf->vma, vmf->address, pfn, write);
}

static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
@@ -1864,10 +1907,10 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, unsigned int order)
dax_lock_entry(&xas, entry);
xas_unlock_irq(&xas);
if (order == 0)
- ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
+ ret = dax_insert_pfn(vmf->vma, vmf->address, pfn, true);
#ifdef CONFIG_FS_DAX_PMD
else if (order == PMD_ORDER)
- ret = vmf_insert_pfn_pmd(vmf, pfn, FAULT_FLAG_WRITE);
+ ret = dax_insert_pfn_pmd(vmf, pfn, FAULT_FLAG_WRITE);
#endif
else
ret = VM_FAULT_FALLBACK;
@@ -1984,6 +2027,18 @@ EXPORT_SYMBOL_GPL(dax_remap_file_range_prep);

void dax_page_free(struct page *page)
{
+ /*
+ * Set trunc to true because we want to remove the entry from the DAX
+ * page-cache.
+ */
+ __dax_invalidate_entry(page->mapping, page->index, true);
+
+ /*
+ * Make sure we flush any cached data to the page now that it's free.
+ */
+ if (PageDirty(page))
+ dax_flush(NULL, page_address(page), 1);
+
wake_up_var(page);
}
EXPORT_SYMBOL_GPL(dax_page_free);
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 11bfc28..c196cae 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -761,8 +761,7 @@ static long virtio_fs_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
if (kaddr)
*kaddr = fs->window_kaddr + offset;
if (pfn)
- *pfn = phys_to_pfn_t(fs->window_phys_addr + offset,
- PFN_DEV | PFN_MAP);
+ *pfn = phys_to_pfn_t(fs->window_phys_addr + offset, 0);
return nr_pages > max_nr_pages ? max_nr_pages : nr_pages;
}

diff --git a/include/linux/dax.h b/include/linux/dax.h
index c0c3206..74a40e5 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -217,9 +217,13 @@ static inline int dax_wait_page_idle(struct page *page,
void (cb)(struct inode *),
struct inode *inode)
{
+ int i = 0;
int ret;

- ret = ___wait_var_event(page, page_ref_count(page) == 1,
+ /*
+ * Wait for the pgmap->ops->page_free callback.
+ */
+ ret = ___wait_var_event(page, !page_ref_count(page) || i++,
TASK_INTERRUPTIBLE, 0, 0, cb(inode));
return ret;
}
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index fa0350b..bf49efa 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -39,6 +39,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,

vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
+vm_fault_t dax_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);

enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_UNSUPPORTED,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf5d0b1..f10aa62 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1040,6 +1040,8 @@ int vma_is_stack_for_current(struct vm_area_struct *vma);
struct mmu_gather;
struct inode;

+extern void prep_compound_page(struct page *page, unsigned int order);
+
/*
* compound_order() can be called without holding a reference, which means
* that niceties like page_folio() don't work. These callers should be
@@ -1400,30 +1402,6 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
* back into memory.
*/

-#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_FS_DAX)
-DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-
-bool __put_devmap_managed_page_refs(struct page *page, int refs);
-static inline bool put_devmap_managed_page_refs(struct page *page, int refs)
-{
- if (!static_branch_unlikely(&devmap_managed_key))
- return false;
- if (!is_zone_device_page(page))
- return false;
- return __put_devmap_managed_page_refs(page, refs);
-}
-#else /* CONFIG_ZONE_DEVICE && CONFIG_FS_DAX */
-static inline bool put_devmap_managed_page_refs(struct page *page, int refs)
-{
- return false;
-}
-#endif /* CONFIG_ZONE_DEVICE && CONFIG_FS_DAX */
-
-static inline bool put_devmap_managed_page(struct page *page)
-{
- return put_devmap_managed_page_refs(page, 1);
-}
-
/* 127: arbitrary random number, small enough to assemble well */
#define folio_ref_zero_or_close_to_overflow(folio) \
((unsigned int) folio_ref_count(folio) + 127u <= 127u)
@@ -1535,12 +1513,6 @@ static inline void put_page(struct page *page)
{
struct folio *folio = page_folio(page);

- /*
- * For some devmap managed pages we need to catch refcount transition
- * from 2 to 1:
- */
- if (put_devmap_managed_page(&folio->page))
- return;
folio_put(folio);
}

@@ -3465,6 +3437,8 @@ int vm_map_pages(struct vm_area_struct *vma, struct page **pages,
unsigned long num);
int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages,
unsigned long num);
+vm_fault_t dax_insert_pfn(struct vm_area_struct *vma,
+ unsigned long addr, pfn_t pfn, bool write);
vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn);
vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/gup.c b/mm/gup.c
index a9c8a09..6a3141d 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -89,8 +89,7 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
* belongs to this folio.
*/
if (unlikely(page_folio(page) != folio)) {
- if (!put_devmap_managed_page_refs(&folio->page, refs))
- folio_put_refs(folio, refs);
+ folio_put_refs(folio, refs);
goto retry;
}

@@ -156,8 +155,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
*/
if (unlikely((flags & FOLL_LONGTERM) &&
!folio_is_longterm_pinnable(folio))) {
- if (!put_devmap_managed_page_refs(&folio->page, refs))
- folio_put_refs(folio, refs);
+ folio_put_refs(folio, refs);
return NULL;
}

@@ -198,8 +196,7 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
refs *= GUP_PIN_COUNTING_BIAS;
}

- if (!put_devmap_managed_page_refs(&folio->page, refs))
- folio_put_refs(folio, refs);
+ folio_put_refs(folio, refs);
}

/**
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 064fbd9..c657886 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -901,8 +901,6 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
* but we need to be consistent with PTEs and architectures that
* can't support a 'special' bit.
*/
- BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
- !pfn_t_devmap(pfn));
BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
(VM_PFNMAP|VM_MIXEDMAP));
BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
@@ -923,6 +921,79 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
}
EXPORT_SYMBOL_GPL(vmf_insert_pfn_pmd);

+static vm_fault_t insert_page_pmd(struct vm_area_struct *vma, unsigned long addr,
+ pmd_t *pmd, pfn_t pfn, pgprot_t prot, bool write)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ pmd_t entry;
+ spinlock_t *ptl;
+ pgprot_t pgprot = vma->vm_page_prot;
+ pgtable_t pgtable = NULL;
+ struct page *page;
+
+ if (addr < vma->vm_start || addr >= vma->vm_end)
+ return VM_FAULT_SIGBUS;
+
+ if (arch_needs_pgtable_deposit()) {
+ pgtable = pte_alloc_one(vma->vm_mm);
+ if (!pgtable)
+ return VM_FAULT_OOM;
+ }
+
+ track_pfn_insert(vma, &pgprot, pfn);
+
+ ptl = pmd_lock(mm, pmd);
+ if (!pmd_none(*pmd)) {
+ if (write) {
+ if (pmd_pfn(*pmd) != pfn_t_to_pfn(pfn)) {
+ WARN_ON_ONCE(!is_huge_zero_pmd(*pmd));
+ goto out_unlock;
+ }
+ entry = pmd_mkyoung(*pmd);
+ entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+ if (pmdp_set_access_flags(vma, addr, pmd, entry, 1))
+ update_mmu_cache_pmd(vma, addr, pmd);
+ }
+
+ goto out_unlock;
+ }
+
+ entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
+ if (pfn_t_devmap(pfn))
+ entry = pmd_mkdevmap(entry);
+ if (write) {
+ entry = pmd_mkyoung(pmd_mkdirty(entry));
+ entry = maybe_pmd_mkwrite(entry, vma);
+ }
+
+ if (pgtable) {
+ pgtable_trans_huge_deposit(mm, pmd, pgtable);
+ mm_inc_nr_ptes(mm);
+ pgtable = NULL;
+ }
+
+ page = pfn_t_to_page(pfn);
+ folio_get(page_folio(page));
+ folio_add_file_rmap_range(page_folio(page), page, 1, vma, true);
+ add_mm_counter(mm, mm_counter_file(page), HPAGE_PMD_NR);
+ set_pmd_at(mm, addr, pmd, entry);
+ update_mmu_cache_pmd(vma, addr, pmd);
+
+out_unlock:
+ spin_unlock(ptl);
+ if (pgtable)
+ pte_free(mm, pgtable);
+
+ return VM_FAULT_NOPAGE;
+}
+
+vm_fault_t dax_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
+{
+ return insert_page_pmd(vmf->vma, vmf->address & PMD_MASK, vmf->pmd, pfn,
+ vmf->vma->vm_page_prot, write);
+}
+EXPORT_SYMBOL_GPL(dax_insert_pfn_pmd);
+
#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
static pud_t maybe_pud_mkwrite(pud_t pud, struct vm_area_struct *vma)
{
@@ -1677,7 +1748,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
tlb->fullmm);
arch_check_zapped_pmd(vma, orig_pmd);
tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
- if (vma_is_special_huge(vma)) {
+ if (!vma_is_dax(vma) && vma_is_special_huge(vma)) {
if (arch_needs_pgtable_deposit())
zap_deposited_table(tlb->mm, pmd);
spin_unlock(ptl);
@@ -2092,8 +2163,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
*/
if (arch_needs_pgtable_deposit())
zap_deposited_table(mm, pmd);
- if (vma_is_special_huge(vma))
+ if (!vma_is_dax(vma) && vma_is_special_huge(vma)) {
return;
+ }
if (unlikely(is_pmd_migration_entry(old_pmd))) {
swp_entry_t entry;

diff --git a/mm/internal.h b/mm/internal.h
index 30cf724..81597b6 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -434,8 +434,6 @@ static inline void prep_compound_tail(struct page *head, int tail_idx)
set_page_private(p, 0);
}

-extern void prep_compound_page(struct page *page, unsigned int order);
-
extern void post_alloc_hook(struct page *page, unsigned int order,
gfp_t gfp_flags);
extern int user_min_free_kbytes;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 4d6e43c..de64958 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -394,18 +394,18 @@ static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma,
pud = pud_offset(p4d, address);
if (!pud_present(*pud))
return 0;
- if (pud_devmap(*pud))
+ if (pud_trans_huge(*pud))
return PUD_SHIFT;
pmd = pmd_offset(pud, address);
if (!pmd_present(*pmd))
return 0;
- if (pmd_devmap(*pmd))
+ if (pmd_trans_huge(*pmd))
return PMD_SHIFT;
pte = pte_offset_map(pmd, address);
if (!pte)
return 0;
ptent = ptep_get(pte);
- if (pte_present(ptent) && pte_devmap(ptent))
+ if (pte_present(ptent))
ret = PAGE_SHIFT;
pte_unmap(pte);
return ret;
diff --git a/mm/memory.c b/mm/memory.c
index 52248d4..418b630 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1834,15 +1834,44 @@ static int validate_page_before_insert(struct page *page)
}

static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
- unsigned long addr, struct page *page, pgprot_t prot)
+ unsigned long addr, struct page *page, pgprot_t prot, bool mkwrite)
{
- if (!pte_none(ptep_get(pte)))
+ pte_t entry = ptep_get(pte);
+
+ if (!pte_none(entry)) {
+ if (mkwrite) {
+ /*
+ * For read faults on private mappings the PFN passed
+ * in may not match the PFN we have mapped if the
+ * mapped PFN is a writeable COW page. In the mkwrite
+ * case we are creating a writable PTE for a shared
+ * mapping and we expect the PFNs to match. If they
+ * don't match, we are likely racing with block
+ * allocation and mapping invalidation so just skip the
+ * update.
+ */
+ if (pte_pfn(entry) != page_to_pfn(page)) {
+ WARN_ON_ONCE(!is_zero_pfn(pte_pfn(entry)));
+ return -EFAULT;
+ }
+ entry = maybe_mkwrite(entry, vma);
+ entry = pte_mkyoung(entry);
+ if (ptep_set_access_flags(vma, addr, pte, entry, 1))
+ update_mmu_cache(vma, addr, pte);
+ return 0;
+ }
return -EBUSY;
+ }
+
/* Ok, finally just insert the thing.. */
get_page(page);
+ if (mkwrite)
+ entry = maybe_mkwrite(mk_pte(page, prot), vma);
+ else
+ entry = mk_pte(page, prot);
inc_mm_counter(vma->vm_mm, mm_counter_file(page));
page_add_file_rmap(page, vma, false);
- set_pte_at(vma->vm_mm, addr, pte, mk_pte(page, prot));
+ set_pte_at(vma->vm_mm, addr, pte, entry);
return 0;
}

@@ -1854,7 +1883,7 @@ static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
* pages reserved for the old functions anyway.
*/
static int insert_page(struct vm_area_struct *vma, unsigned long addr,
- struct page *page, pgprot_t prot)
+ struct page *page, pgprot_t prot, bool mkwrite)
{
int retval;
pte_t *pte;
@@ -1867,7 +1896,7 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
pte = get_locked_pte(vma->vm_mm, addr, &ptl);
if (!pte)
goto out;
- retval = insert_page_into_pte_locked(vma, pte, addr, page, prot);
+ retval = insert_page_into_pte_locked(vma, pte, addr, page, prot, mkwrite);
pte_unmap_unlock(pte, ptl);
out:
return retval;
@@ -1883,7 +1912,7 @@ static int insert_page_in_batch_locked(struct vm_area_struct *vma, pte_t *pte,
err = validate_page_before_insert(page);
if (err)
return err;
- return insert_page_into_pte_locked(vma, pte, addr, page, prot);
+ return insert_page_into_pte_locked(vma, pte, addr, page, prot, false);
}

/* insert_pages() amortizes the cost of spinlock operations
@@ -2020,7 +2049,7 @@ int vm_insert_page(struct vm_area_struct *vma, unsigned long addr,
BUG_ON(vma->vm_flags & VM_PFNMAP);
vm_flags_set(vma, VM_MIXEDMAP);
}
- return insert_page(vma, addr, page, vma->vm_page_prot);
+ return insert_page(vma, addr, page, vma->vm_page_prot, false);
}
EXPORT_SYMBOL(vm_insert_page);

@@ -2294,7 +2323,7 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
* result in pfn_t_has_page() == false.
*/
page = pfn_to_page(pfn_t_to_pfn(pfn));
- err = insert_page(vma, addr, page, pgprot);
+ err = insert_page(vma, addr, page, pgprot, mkwrite);
} else {
return insert_pfn(vma, addr, pfn, pgprot, mkwrite);
}
@@ -2307,6 +2336,43 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
return VM_FAULT_NOPAGE;
}

+vm_fault_t dax_insert_pfn(struct vm_area_struct *vma,
+ unsigned long addr, pfn_t pfn_t, bool write)
+{
+ pgprot_t pgprot = vma->vm_page_prot;
+ unsigned long pfn = pfn_t_to_pfn(pfn_t);
+ struct page *page = pfn_to_page(pfn);
+ int err;
+
+ if (addr < vma->vm_start || addr >= vma->vm_end)
+ return VM_FAULT_SIGBUS;
+
+ track_pfn_insert(vma, &pgprot, pfn_t);
+
+ if (!pfn_modify_allowed(pfn, pgprot))
+ return VM_FAULT_SIGBUS;
+
+ /*
+ * We refcount the page normally so make sure pfn_valid is true.
+ */
+ if (!pfn_t_valid(pfn_t))
+ return VM_FAULT_SIGBUS;
+
+ WARN_ON_ONCE(pfn_t_devmap(pfn_t));
+
+ if (WARN_ON(is_zero_pfn(pfn) && write))
+ return VM_FAULT_SIGBUS;
+
+ err = insert_page(vma, addr, page, pgprot, write);
+ if (err == -ENOMEM)
+ return VM_FAULT_OOM;
+ if (err < 0 && err != -EBUSY)
+ return VM_FAULT_SIGBUS;
+
+ return VM_FAULT_NOPAGE;
+}
+EXPORT_SYMBOL_GPL(dax_insert_pfn);
+
vm_fault_t vmf_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
pfn_t pfn)
{
diff --git a/mm/memremap.c b/mm/memremap.c
index 619b059..3aab098 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -505,18 +505,20 @@ void free_zone_device_page(struct page *page)
* handled differently or not done at all, so there is no need
* to clear page->mapping.
*/
- page->mapping = NULL;
page_dev_pagemap(page)->ops->page_free(page);

if (page->pgmap->type == MEMORY_DEVICE_PRIVATE ||
page->pgmap->type == MEMORY_DEVICE_COHERENT)
put_dev_pagemap(page->pgmap);
- else if (page->pgmap->type != MEMORY_DEVICE_PCI_P2PDMA)
+ else if (page->pgmap->type != MEMORY_DEVICE_PCI_P2PDMA &&
+ page->pgmap->type != MEMORY_DEVICE_FS_DAX)
/*
* Reset the page count to 1 to prepare for handing out the page
* again.
*/
set_page_count(page, 1);
+
+ page->mapping = NULL;
}

void zone_device_page_init(struct page *page)
@@ -530,21 +532,3 @@ void zone_device_page_init(struct page *page)
lock_page(page);
}
EXPORT_SYMBOL_GPL(zone_device_page_init);
-
-#ifdef CONFIG_FS_DAX
-bool __put_devmap_managed_page_refs(struct page *page, int refs)
-{
- if (page->pgmap->type != MEMORY_DEVICE_FS_DAX)
- return false;
-
- /*
- * fsdax page refcounts are 1-based, rather than 0-based: if
- * refcount is 1, then the page is free and the refcount is
- * stable because nobody holds a reference on the page.
- */
- if (page_ref_sub_return(page, refs) == 1)
- wake_up_var(&page->_refcount);
- return true;
-}
-EXPORT_SYMBOL(__put_devmap_managed_page_refs);
-#endif /* CONFIG_FS_DAX */
diff --git a/mm/mm_init.c b/mm/mm_init.c
index da45abd..2a2864e 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1008,7 +1008,8 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
*/
if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
pgmap->type == MEMORY_DEVICE_COHERENT ||
- pgmap->type == MEMORY_DEVICE_PCI_P2PDMA)
+ pgmap->type == MEMORY_DEVICE_PCI_P2PDMA ||
+ pgmap->type == MEMORY_DEVICE_FS_DAX)
set_page_count(page, 0);
}

diff --git a/mm/swap.c b/mm/swap.c
index cd8f015..fe76552 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -990,8 +990,6 @@ void release_pages(release_pages_arg arg, int nr)
unlock_page_lruvec_irqrestore(lruvec, flags);
lruvec = NULL;
}
- if (put_devmap_managed_page(&folio->page))
- continue;
if (folio_put_testzero(folio))
free_zone_device_page(&folio->page);
continue;
--
git-series 0.9.1

2024-04-11 00:59:11

by Alistair Popple

[permalink] [raw]

Subject: [RFC 09/10] mm/khugepage.c: Warn if trying to scan devmap pmd

The only user of devmap PTEs is FS DAX, and khugepaged should not be
scanning these VMAs. This is checked by calling
hugepage_vma_check. Therefore khugepaged should never encounter a
devmap PTE. Warn if this occurs.

Signed-off-by: Alistair Popple <[email protected]>

---

Note this is a transitory patch to test the above assumption both at
runtime and during review. I will likely remove it as the whole thing
gets deleted when pXX_devmap is removed.
---
mm/khugepaged.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 88433cc..b10db15 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -955,7 +955,7 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
return SCAN_PMD_NULL;
if (pmd_trans_huge(pmde))
return SCAN_PMD_MAPPED;
- if (pmd_devmap(pmde))
+ if (WARN_ON_ONCE(pmd_devmap(pmde)))
return SCAN_PMD_NULL;
if (pmd_bad(pmde))
return SCAN_PMD_NULL;
--
git-series 0.9.1

2024-04-11 00:59:46

by Alistair Popple

[permalink] [raw]

Subject: [RFC 01/10] mm/gup.c: Remove redundant check for PCI P2PDMA page

PCI P2PDMA pages are not mapped with pXX_devmap PTEs therefore the
check in __gup_device_huge() is redundant. Remove it

Signed-off-by: Alistair Popple <[email protected]>
---
mm/gup.c | 5 -----
1 file changed, 5 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 2f8a2d8..a9c8a09 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2683,11 +2683,6 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
break;
}

- if (!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
- undo_dev_pagemap(nr, nr_start, flags, pages);
- break;
- }
-
SetPageReferenced(page);
pages[*nr] = page;
if (unlikely(try_grab_page(page, flags))) {
--
git-series 0.9.1

2024-04-11 01:00:00

by Alistair Popple

[permalink] [raw]

Subject: [RFC 04/10] fs/dax: Don't track page mapping/index

The page->mapping and page->index fields are normally used by the
pagecache and rmap for looking up virtual mappings of pages. FS DAX
implements it's own kind of page cache and rmap look ups so these
fields are unnecessary. They are currently only used to detect
error/warning conditions which should never occur.

A future change will change the way shared mappings are detected by
doing normal page reference counting instead, so remove the
unnecessary checks.

Signed-off-by: Alistair Popple <[email protected]>
---
fs/dax.c | 84 +---------------------------------------
include/linux/page-flags.h | 6 +---
2 files changed, 90 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 8fafecb..a7bd423 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -320,85 +320,6 @@ static unsigned long dax_end_pfn(void *entry)
for (pfn = dax_to_pfn(entry); \
pfn < dax_end_pfn(entry); pfn++)

-static inline bool dax_page_is_shared(struct page *page)
-{
- return page->mapping == PAGE_MAPPING_DAX_SHARED;
-}
-
-/*
- * Set the page->mapping with PAGE_MAPPING_DAX_SHARED flag, increase the
- * refcount.
- */
-static inline void dax_page_share_get(struct page *page)
-{
- if (page->mapping != PAGE_MAPPING_DAX_SHARED) {
- /*
- * Reset the index if the page was already mapped
- * regularly before.
- */
- if (page->mapping)
- page->share = 1;
- page->mapping = PAGE_MAPPING_DAX_SHARED;
- }
- page->share++;
-}
-
-static inline unsigned long dax_page_share_put(struct page *page)
-{
- return --page->share;
-}
-
-/*
- * When it is called in dax_insert_entry(), the shared flag will indicate that
- * whether this entry is shared by multiple files. If so, set the page->mapping
- * PAGE_MAPPING_DAX_SHARED, and use page->share as refcount.
- */
-static void dax_associate_entry(void *entry, struct address_space *mapping,
- struct vm_area_struct *vma, unsigned long address, bool shared)
-{
- unsigned long size = dax_entry_size(entry), pfn, index;
- int i = 0;
-
- if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
- return;
-
- index = linear_page_index(vma, address & ~(size - 1));
- for_each_mapped_pfn(entry, pfn) {
- struct page *page = pfn_to_page(pfn);
-
- if (shared) {
- dax_page_share_get(page);
- } else {
- WARN_ON_ONCE(page->mapping);
- page->mapping = mapping;
- page->index = index + i++;
- }
- }
-}
-
-static void dax_disassociate_entry(void *entry, struct address_space *mapping,
- bool trunc)
-{
- unsigned long pfn;
-
- if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
- return;
-
- for_each_mapped_pfn(entry, pfn) {
- struct page *page = pfn_to_page(pfn);
-
- WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
- if (dax_page_is_shared(page)) {
- /* keep the shared flag if this page is still shared */
- if (dax_page_share_put(page) > 0)
- continue;
- } else
- WARN_ON_ONCE(page->mapping && page->mapping != mapping);
- page->mapping = NULL;
- page->index = 0;
- }
-}
-
static struct page *dax_busy_page(void *entry)
{
unsigned long pfn;
@@ -620,7 +541,6 @@ static void *grab_mapping_entry(struct xa_state *xas,
xas_lock_irq(xas);
}

- dax_disassociate_entry(entry, mapping, false);
xas_store(xas, NULL); /* undo the PMD join */
dax_wake_entry(xas, entry, WAKE_ALL);
mapping->nrpages -= PG_PMD_NR;
@@ -757,7 +677,6 @@ static int __dax_invalidate_entry(struct address_space *mapping,
(xas_get_mark(&xas, PAGECACHE_TAG_DIRTY) ||
xas_get_mark(&xas, PAGECACHE_TAG_TOWRITE)))
goto out;
- dax_disassociate_entry(entry, mapping, trunc);
xas_store(&xas, NULL);
mapping->nrpages -= 1UL << dax_entry_order(entry);
ret = 1;
@@ -894,9 +813,6 @@ static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
if (shared || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
void *old;

- dax_disassociate_entry(entry, mapping, false);
- dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address,
- shared);
/*
* Only swap our new entry into the page cache if the current
* entry is a zero page or an empty entry. If a normal PTE or
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 5c02720..85d5427 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -631,12 +631,6 @@ PAGEFLAG_FALSE(VmemmapSelfHosted, vmemmap_self_hosted)
#define PAGE_MAPPING_KSM (PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
#define PAGE_MAPPING_FLAGS (PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)

-/*
- * Different with flags above, this flag is used only for fsdax mode. It
- * indicates that this page->mapping is now under reflink case.
- */
-#define PAGE_MAPPING_DAX_SHARED ((void *)0x1)
-
static __always_inline bool folio_mapping_flags(struct folio *folio)
{
return ((unsigned long)folio->mapping & PAGE_MAPPING_FLAGS) != 0;
--
git-series 0.9.1

2024-04-11 01:00:21

by Alistair Popple

[permalink] [raw]

Subject: [RFC 10/10] mm: Remove pXX_devmap

The devmap PTE special bit was used to detect mappings of FS DAX
pages. This tracking was required to ensure the generic mm did not
manipulate the page reference counts as FS DAX implemented it's own
reference counting scheme.

Now that FS DAX pages have their references counted the same way as
normal pages this tracking is no longer needed and can be
removed.

Almost all existing uses of pmd_devmap() are paired with a check of
pmd_trans_huge(). As pmd_trans_huge() now returns true for FS DAX pages
dropping the check in these cases doesn't change anything.

However care needs to be taken because pmd_trans_huge() also checks that
a page is not an FS DAX page. This is dealt with either by checking
!vma_is_dax() or relying on the fact that the page pointer was obtained
from a page list. This is possible because zone device pages cannot
appear in any page list due to sharing page->lru with page->pgmap.

Signed-off-by: Alistair Popple <[email protected]>
---
Documentation/mm/arch_pgtable_helpers.rst | 6 +-
arch/arm64/include/asm/pgtable.h | 24 +---
arch/powerpc/include/asm/book3s/64/pgtable.h | 42 +------
arch/powerpc/mm/book3s64/hash_pgtable.c | 3 +-
arch/powerpc/mm/book3s64/pgtable.c | 8 +-
arch/powerpc/mm/book3s64/radix_pgtable.c | 5 +-
arch/powerpc/mm/pgtable.c | 2 +-
arch/x86/include/asm/pgtable.h | 31 +----
fs/dax.c | 5 +-
fs/userfaultfd.c | 2 +-
include/linux/huge_mm.h | 10 +-
include/linux/mm.h | 7 +-
include/linux/pgtable.h | 17 +--
mm/debug_vm_pgtable.c | 51 +-------
mm/gup.c | 151 +--------------------
mm/hmm.c | 5 +-
mm/huge_memory.c | 100 +-------------
mm/khugepaged.c | 2 +-
mm/mapping_dirty_helpers.c | 4 +-
mm/memory.c | 25 +---
mm/migrate_device.c | 2 +-
mm/mprotect.c | 2 +-
mm/mremap.c | 5 +-
mm/page_vma_mapped.c | 5 +-
mm/pgtable-generic.c | 7 +-
mm/vmscan.c | 5 +-
26 files changed, 48 insertions(+), 478 deletions(-)

diff --git a/Documentation/mm/arch_pgtable_helpers.rst b/Documentation/mm/arch_pgtable_helpers.rst
index c82e3ee..ab3238c 100644
--- a/Documentation/mm/arch_pgtable_helpers.rst
+++ b/Documentation/mm/arch_pgtable_helpers.rst
@@ -32,8 +32,6 @@ PTE Page Table Helpers
+---------------------------+--------------------------------------------------+
| pte_protnone | Tests a PROT_NONE PTE |
+---------------------------+--------------------------------------------------+
-| pte_devmap | Tests a ZONE_DEVICE mapped PTE |
-+---------------------------+--------------------------------------------------+
| pte_soft_dirty | Tests a soft dirty PTE |
+---------------------------+--------------------------------------------------+
| pte_swp_soft_dirty | Tests a soft dirty swapped PTE |
@@ -108,8 +106,6 @@ PMD Page Table Helpers
+---------------------------+--------------------------------------------------+
| pmd_protnone | Tests a PROT_NONE PMD |
+---------------------------+--------------------------------------------------+
-| pmd_devmap | Tests a ZONE_DEVICE mapped PMD |
-+---------------------------+--------------------------------------------------+
| pmd_soft_dirty | Tests a soft dirty PMD |
+---------------------------+--------------------------------------------------+
| pmd_swp_soft_dirty | Tests a soft dirty swapped PMD |
@@ -182,8 +178,6 @@ PUD Page Table Helpers
+---------------------------+--------------------------------------------------+
| pud_write | Tests a writable PUD |
+---------------------------+--------------------------------------------------+
-| pud_devmap | Tests a ZONE_DEVICE mapped PUD |
-+---------------------------+--------------------------------------------------+
| pud_mkyoung | Creates a young PUD |
+---------------------------+--------------------------------------------------+
| pud_mkold | Creates an old PUD |
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 7f7d9b1..506f78f 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -107,7 +107,6 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
#define pte_user(pte) (!!(pte_val(pte) & PTE_USER))
#define pte_user_exec(pte) (!(pte_val(pte) & PTE_UXN))
#define pte_cont(pte) (!!(pte_val(pte) & PTE_CONT))
-#define pte_devmap(pte) (!!(pte_val(pte) & PTE_DEVMAP))
#define pte_tagged(pte) ((pte_val(pte) & PTE_ATTRINDX_MASK) == \
PTE_ATTRINDX(MT_NORMAL_TAGGED))

@@ -256,11 +255,6 @@ static inline pmd_t pmd_mkcont(pmd_t pmd)
return __pmd(pmd_val(pmd) | PMD_SECT_CONT);
}

-static inline pte_t pte_mkdevmap(pte_t pte)
-{
- return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
-}
-
static inline void set_pte(pte_t *ptep, pte_t pte)
{
WRITE_ONCE(*ptep, pte);
@@ -506,14 +500,6 @@ static inline pmd_t pmd_mkinvalid(pmd_t pmd)

#define pmd_mkhuge(pmd) (__pmd(pmd_val(pmd) & ~PMD_TABLE_BIT))

-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define pmd_devmap(pmd) pte_devmap(pmd_pte(pmd))
-#endif
-static inline pmd_t pmd_mkdevmap(pmd_t pmd)
-{
- return pte_pmd(set_pte_bit(pmd_pte(pmd), __pgprot(PTE_DEVMAP)));
-}
-
#define __pmd_to_phys(pmd) __pte_to_phys(pmd_pte(pmd))
#define __phys_to_pmd_val(phys) __phys_to_pte_val(phys)
#define pmd_pfn(pmd) ((__pmd_to_phys(pmd) & PMD_MASK) >> PAGE_SHIFT)
@@ -847,16 +833,6 @@ static inline int pmdp_set_access_flags(struct vm_area_struct *vma,
{
return ptep_set_access_flags(vma, address, (pte_t *)pmdp, pmd_pte(entry), dirty);
}
-
-static inline int pud_devmap(pud_t pud)
-{
- return 0;
-}
-
-static inline int pgd_devmap(pgd_t pgd)
-{
- return 0;
-}
#endif

#ifdef CONFIG_PAGE_TABLE_CHECK
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 5c497c8..51351a0 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -639,19 +639,6 @@ static inline pte_t pte_mkuser(pte_t pte)
return __pte_raw(pte_raw(pte) & cpu_to_be64(~_PAGE_PRIVILEGED));
}

-/*
- * This is potentially called with a pmd as the argument, in which case it's not
- * safe to check _PAGE_DEVMAP unless we also confirm that _PAGE_PTE is set.
- * That's because the bit we use for _PAGE_DEVMAP is not reserved for software
- * use in page directory entries (ie. non-ptes).
- */
-static inline int pte_devmap(pte_t pte)
-{
- u64 mask = cpu_to_be64(_PAGE_DEVMAP | _PAGE_PTE);
-
- return (pte_raw(pte) & mask) == mask;
-}
-
static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
{
/* FIXME!! check whether this need to be a conditional */
@@ -1428,35 +1415,6 @@ static inline bool arch_needs_pgtable_deposit(void)
}
extern void serialize_against_pte_lookup(struct mm_struct *mm);

-
-static inline pmd_t pmd_mkdevmap(pmd_t pmd)
-{
- if (radix_enabled())
- return radix__pmd_mkdevmap(pmd);
- return hash__pmd_mkdevmap(pmd);
-}
-
-static inline pud_t pud_mkdevmap(pud_t pud)
-{
- if (radix_enabled())
- return radix__pud_mkdevmap(pud);
- BUG();
- return pud;
-}
-
-static inline int pmd_devmap(pmd_t pmd)
-{
- return pte_devmap(pmd_pte(pmd));
-}
-
-static inline int pud_devmap(pud_t pud)
-{
- return pte_devmap(pud_pte(pud));
-}
-
-static inline int pgd_devmap(pgd_t pgd)
-{
- return 0;
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c
index 988948d..82d3117 100644
--- a/arch/powerpc/mm/book3s64/hash_pgtable.c
+++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
@@ -195,7 +195,7 @@ unsigned long hash__pmd_hugepage_update(struct mm_struct *mm, unsigned long addr
unsigned long old;

#ifdef CONFIG_DEBUG_VM
- WARN_ON(!hash__pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
+ WARN_ON(!hash__pmd_trans_huge(*pmdp));
assert_spin_locked(pmd_lockptr(mm, pmdp));
#endif

@@ -227,7 +227,6 @@ pmd_t hash__pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long addres

VM_BUG_ON(address & ~HPAGE_PMD_MASK);
VM_BUG_ON(pmd_trans_huge(*pmdp));
- VM_BUG_ON(pmd_devmap(*pmdp));

pmd = *pmdp;
pmd_clear(pmdp);
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index 8f8a62d..8341957 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -50,7 +50,7 @@ int pmdp_set_access_flags(struct vm_area_struct *vma, unsigned long address,
{
int changed;
#ifdef CONFIG_DEBUG_VM
- WARN_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
+ WARN_ON(!pmd_trans_huge(*pmdp));
assert_spin_locked(pmd_lockptr(vma->vm_mm, pmdp));
#endif
changed = !pmd_same(*(pmdp), entry);
@@ -70,7 +70,6 @@ int pudp_set_access_flags(struct vm_area_struct *vma, unsigned long address,
{
int changed;
#ifdef CONFIG_DEBUG_VM
- WARN_ON(!pud_devmap(*pudp));
assert_spin_locked(pud_lockptr(vma->vm_mm, pudp));
#endif
changed = !pud_same(*(pudp), entry);
@@ -181,7 +180,7 @@ pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma,
pmd_t pmd;
VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
VM_BUG_ON((pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
- !pmd_devmap(*pmdp)) || !pmd_present(*pmdp));
+ || !pmd_present(*pmdp));
pmd = pmdp_huge_get_and_clear(vma->vm_mm, addr, pmdp);
/*
* if it not a fullmm flush, then we can possibly end up converting
@@ -199,8 +198,7 @@ pud_t pudp_huge_get_and_clear_full(struct vm_area_struct *vma,
pud_t pud;

VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
- VM_BUG_ON((pud_present(*pudp) && !pud_devmap(*pudp)) ||
- !pud_present(*pudp));
+ VM_BUG_ON(!pud_present(*pudp));
pud = pudp_huge_get_and_clear(vma->vm_mm, addr, pudp);
/*
* if it not a fullmm flush, then we can possibly end up converting
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index c6a4ac7..b50f999 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1362,7 +1362,7 @@ unsigned long radix__pmd_hugepage_update(struct mm_struct *mm, unsigned long add
unsigned long old;

#ifdef CONFIG_DEBUG_VM
- WARN_ON(!radix__pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
+ WARN_ON(!radix__pmd_trans_huge(*pmdp));
assert_spin_locked(pmd_lockptr(mm, pmdp));
#endif

@@ -1379,7 +1379,7 @@ unsigned long radix__pud_hugepage_update(struct mm_struct *mm, unsigned long add
unsigned long old;

#ifdef CONFIG_DEBUG_VM
- WARN_ON(!pud_devmap(*pudp));
+ WARN_ON(!pud_trans_huge(*pudp));
assert_spin_locked(pud_lockptr(mm, pudp));
#endif

@@ -1397,7 +1397,6 @@ pmd_t radix__pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long addre

VM_BUG_ON(address & ~HPAGE_PMD_MASK);
VM_BUG_ON(radix__pmd_trans_huge(*pmdp));
- VM_BUG_ON(pmd_devmap(*pmdp));
/*
* khugepaged calls this for normal pmd
*/
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 4d69bfb..d1f98e0 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -467,7 +467,7 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea,
return NULL;
#endif

- if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) {
+ if (pmd_trans_huge(pmd)) {
if (is_thp)
*is_thp = true;
ret_pte = (pte_t *)pmdp;
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index e02b179..9257da3 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -267,7 +267,6 @@ static inline int pmd_large(pmd_t pte)
}

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-/* NOTE: when predicate huge page, consider also pmd_devmap, or use pmd_large */
static inline int pmd_trans_huge(pmd_t pmd)
{
return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
@@ -286,29 +285,6 @@ static inline int has_transparent_hugepage(void)
return boot_cpu_has(X86_FEATURE_PSE);
}

-#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
-static inline int pmd_devmap(pmd_t pmd)
-{
- return !!(pmd_val(pmd) & _PAGE_DEVMAP);
-}
-
-#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-static inline int pud_devmap(pud_t pud)
-{
- return !!(pud_val(pud) & _PAGE_DEVMAP);
-}
-#else
-static inline int pud_devmap(pud_t pud)
-{
- return 0;
-}
-#endif
-
-static inline int pgd_devmap(pgd_t pgd)
-{
- return 0;
-}
-#endif
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

static inline pte_t pte_set_flags(pte_t pte, pteval_t set)
@@ -968,13 +944,6 @@ static inline int pte_present(pte_t a)
return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
}

-#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
-static inline int pte_devmap(pte_t a)
-{
- return (pte_flags(a) & _PAGE_DEVMAP) == _PAGE_DEVMAP;
-}
-#endif
-
#define pte_accessible pte_accessible
static inline bool pte_accessible(struct mm_struct *mm, pte_t a)
{
diff --git a/fs/dax.c b/fs/dax.c
index a45793f..b83c668 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1694,7 +1694,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
* the PTE we need to set up. If so just return and the fault will be
* retried.
*/
- if (pmd_trans_huge(*vmf->pmd) || pmd_devmap(*vmf->pmd)) {
+ if (pmd_trans_huge(*vmf->pmd)) {
ret = VM_FAULT_NOPAGE;
goto unlock_entry;
}
@@ -1815,8 +1815,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
* the PMD we need to set up. If so just return and the fault will be
* retried.
*/
- if (!pmd_none(*vmf->pmd) && !pmd_trans_huge(*vmf->pmd) &&
- !pmd_devmap(*vmf->pmd)) {
+ if (!pmd_none(*vmf->pmd) && !pmd_trans_huge(*vmf->pmd)) {
ret = 0;
goto unlock_entry;
}
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 56eaae9..33f9448 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -353,7 +353,7 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
goto out;

ret = false;
- if (!pmd_present(_pmd) || pmd_devmap(_pmd))
+ if (!pmd_present(_pmd) || vma_is_dax(vmf->vma))
goto out;

if (pmd_trans_huge(_pmd)) {
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index bf49efa..81b5b49 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -156,8 +156,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
#define split_huge_pmd(__vma, __pmd, __address) \
do { \
pmd_t *____pmd = (__pmd); \
- if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd) \
- || pmd_devmap(*____pmd)) \
+ if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd)) \
__split_huge_pmd(__vma, __pmd, __address, \
false, NULL); \
} while (0)
@@ -172,8 +171,7 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
#define split_huge_pud(__vma, __pud, __address) \
do { \
pud_t *____pud = (__pud); \
- if (pud_trans_huge(*____pud) \
- || pud_devmap(*____pud)) \
+ if (pud_trans_huge(*____pud)) \
__split_huge_pud(__vma, __pud, __address); \
} while (0)

@@ -196,7 +194,7 @@ static inline int is_swap_pmd(pmd_t pmd)
static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
struct vm_area_struct *vma)
{
- if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
+ if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd))
return __pmd_trans_huge_lock(pmd, vma);
else
return NULL;
@@ -204,7 +202,7 @@ static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
struct vm_area_struct *vma)
{
- if (pud_trans_huge(*pud) || pud_devmap(*pud))
+ if (pud_trans_huge(*pud))
return __pud_trans_huge_lock(pud, vma);
else
return NULL;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f10aa62..d299b42 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2620,13 +2620,6 @@ static inline pte_t pte_mkspecial(pte_t pte)
}
#endif

-#ifndef CONFIG_ARCH_HAS_PTE_DEVMAP
-static inline int pte_devmap(pte_t pte)
-{
- return 0;
-}
-#endif
-
extern pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
spinlock_t **ptl);
static inline pte_t *get_locked_pte(struct mm_struct *mm, unsigned long addr,
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index af7639c..ff7ca9d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1363,21 +1363,6 @@ static inline int pud_write(pud_t pud)
}
#endif /* pud_write */

-#if !defined(CONFIG_ARCH_HAS_PTE_DEVMAP) || !defined(CONFIG_TRANSPARENT_HUGEPAGE)
-static inline int pmd_devmap(pmd_t pmd)
-{
- return 0;
-}
-static inline int pud_devmap(pud_t pud)
-{
- return 0;
-}
-static inline int pgd_devmap(pgd_t pgd)
-{
- return 0;
-}
-#endif
-
#if !defined(CONFIG_TRANSPARENT_HUGEPAGE) || \
!defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
static inline int pud_trans_huge(pud_t pud)
@@ -1392,7 +1377,7 @@ static inline int pud_trans_unstable(pud_t *pud)
defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
pud_t pudval = READ_ONCE(*pud);

- if (pud_none(pudval) || pud_trans_huge(pudval) || pud_devmap(pudval))
+ if (pud_none(pudval) || pud_trans_huge(pudval))
return 1;
if (unlikely(pud_bad(pudval))) {
pud_clear_bad(pud);
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 48e329e..edb9fcb 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -705,53 +705,6 @@ static void __init pmd_protnone_tests(struct pgtable_debug_args *args)
static void __init pmd_protnone_tests(struct pgtable_debug_args *args) { }
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

-#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
-static void __init pte_devmap_tests(struct pgtable_debug_args *args)
-{
- pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot);
-
- pr_debug("Validating PTE devmap\n");
- WARN_ON(!pte_devmap(pte_mkdevmap(pte)));
-}
-
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static void __init pmd_devmap_tests(struct pgtable_debug_args *args)
-{
- pmd_t pmd;
-
- if (!has_transparent_hugepage())
- return;
-
- pr_debug("Validating PMD devmap\n");
- pmd = pfn_pmd(args->fixed_pmd_pfn, args->page_prot);
- WARN_ON(!pmd_devmap(pmd_mkdevmap(pmd)));
-}
-
-#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-static void __init pud_devmap_tests(struct pgtable_debug_args *args)
-{
- pud_t pud;
-
- if (!has_transparent_pud_hugepage())
- return;
-
- pr_debug("Validating PUD devmap\n");
- pud = pfn_pud(args->fixed_pud_pfn, args->page_prot);
- WARN_ON(!pud_devmap(pud_mkdevmap(pud)));
-}
-#else /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
-static void __init pud_devmap_tests(struct pgtable_debug_args *args) { }
-#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
-#else /* CONFIG_TRANSPARENT_HUGEPAGE */
-static void __init pmd_devmap_tests(struct pgtable_debug_args *args) { }
-static void __init pud_devmap_tests(struct pgtable_debug_args *args) { }
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-#else
-static void __init pte_devmap_tests(struct pgtable_debug_args *args) { }
-static void __init pmd_devmap_tests(struct pgtable_debug_args *args) { }
-static void __init pud_devmap_tests(struct pgtable_debug_args *args) { }
-#endif /* CONFIG_ARCH_HAS_PTE_DEVMAP */
-
static void __init pte_soft_dirty_tests(struct pgtable_debug_args *args)
{
pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot);
@@ -1352,10 +1305,6 @@ static int __init debug_vm_pgtable(void)
pte_protnone_tests(&args);
pmd_protnone_tests(&args);

- pte_devmap_tests(&args);
- pmd_devmap_tests(&args);
- pud_devmap_tests(&args);
-
pte_soft_dirty_tests(&args);
pmd_soft_dirty_tests(&args);
pte_swap_soft_dirty_tests(&args);
diff --git a/mm/gup.c b/mm/gup.c
index 6a3141d..8c3c7d3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -600,8 +600,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
page = vm_normal_page(vma, address, pte);

/*
- * We only care about anon pages in can_follow_write_pte() and don't
- * have to worry about pte_devmap() because they are never anon.
+ * We only care about anon pages in can_follow_write_pte().
*/
if ((flags & FOLL_WRITE) &&
!can_follow_write_pte(pte, page, vma, flags)) {
@@ -609,18 +608,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
goto out;
}

- if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
- /*
- * Only return device mapping pages in the FOLL_GET or FOLL_PIN
- * case since they are only valid while holding the pgmap
- * reference.
- */
- *pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap);
- if (*pgmap)
- page = pte_page(pte);
- else
- goto no_page;
- } else if (unlikely(!page)) {
+ if (unlikely(!page)) {
if (flags & FOLL_DUMP) {
/* Avoid special (like zero) pages in core dumps */
page = ERR_PTR(-EFAULT);
@@ -701,13 +689,6 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
return no_page_table(vma, flags);
if (!pmd_present(pmdval))
return no_page_table(vma, flags);
- if (pmd_devmap(pmdval)) {
- ptl = pmd_lock(mm, pmd);
- page = follow_devmap_pmd(vma, address, pmd, flags, &ctx->pgmap);
- spin_unlock(ptl);
- if (page)
- return page;
- }
if (likely(!pmd_trans_huge(pmdval)))
return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);

@@ -742,20 +723,10 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
struct follow_page_context *ctx)
{
pud_t *pud;
- spinlock_t *ptl;
- struct page *page;
- struct mm_struct *mm = vma->vm_mm;

pud = pud_offset(p4dp, address);
if (pud_none(*pud))
return no_page_table(vma, flags);
- if (pud_devmap(*pud)) {
- ptl = pud_lock(mm, pud);
- page = follow_devmap_pud(vma, address, pud, flags, &ctx->pgmap);
- spin_unlock(ptl);
- if (page)
- return page;
- }
if (unlikely(pud_bad(*pud)))
return no_page_table(vma, flags);

@@ -2554,7 +2525,7 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
struct page **pages, int *nr)
{
struct dev_pagemap *pgmap = NULL;
- int nr_start = *nr, ret = 0;
+ int ret = 0;
pte_t *ptep, *ptem;

ptem = ptep = pte_offset_map(&pmd, addr);
@@ -2578,16 +2549,7 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
if (!pte_access_permitted(pte, flags & FOLL_WRITE))
goto pte_unmap;

- if (pte_devmap(pte)) {
- if (unlikely(flags & FOLL_LONGTERM))
- goto pte_unmap;
-
- pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
- if (unlikely(!pgmap)) {
- undo_dev_pagemap(nr, nr_start, flags, pages);
- goto pte_unmap;
- }
- } else if (pte_special(pte))
+ if (pte_special(pte))
goto pte_unmap;

VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
@@ -2663,90 +2625,6 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
}
#endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */

-#if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
-static int __gup_device_huge(unsigned long pfn, unsigned long addr,
- unsigned long end, unsigned int flags,
- struct page **pages, int *nr)
-{
- int nr_start = *nr;
- struct dev_pagemap *pgmap = NULL;
-
- do {
- struct page *page = pfn_to_page(pfn);
-
- pgmap = get_dev_pagemap(pfn, pgmap);
- if (unlikely(!pgmap)) {
- undo_dev_pagemap(nr, nr_start, flags, pages);
- break;
- }
-
- SetPageReferenced(page);
- pages[*nr] = page;
- if (unlikely(try_grab_page(page, flags))) {
- undo_dev_pagemap(nr, nr_start, flags, pages);
- break;
- }
- (*nr)++;
- pfn++;
- } while (addr += PAGE_SIZE, addr != end);
-
- put_dev_pagemap(pgmap);
- return addr == end;
-}
-
-static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
- unsigned long end, unsigned int flags,
- struct page **pages, int *nr)
-{
- unsigned long fault_pfn;
- int nr_start = *nr;
-
- fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
- if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
- return 0;
-
- if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
- undo_dev_pagemap(nr, nr_start, flags, pages);
- return 0;
- }
- return 1;
-}
-
-static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
- unsigned long end, unsigned int flags,
- struct page **pages, int *nr)
-{
- unsigned long fault_pfn;
- int nr_start = *nr;
-
- fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
- if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
- return 0;
-
- if (unlikely(pud_val(orig) != pud_val(*pudp))) {
- undo_dev_pagemap(nr, nr_start, flags, pages);
- return 0;
- }
- return 1;
-}
-#else
-static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
- unsigned long end, unsigned int flags,
- struct page **pages, int *nr)
-{
- BUILD_BUG();
- return 0;
-}
-
-static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
- unsigned long end, unsigned int flags,
- struct page **pages, int *nr)
-{
- BUILD_BUG();
- return 0;
-}
-#endif
-
static int record_subpages(struct page *page, unsigned long addr,
unsigned long end, struct page **pages)
{
@@ -2852,13 +2730,6 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
if (!pmd_access_permitted(orig, flags & FOLL_WRITE))
return 0;

- if (pmd_devmap(orig)) {
- if (unlikely(flags & FOLL_LONGTERM))
- return 0;
- return __gup_device_huge_pmd(orig, pmdp, addr, end, flags,
- pages, nr);
- }
-
page = nth_page(pmd_page(orig), (addr & ~PMD_MASK) >> PAGE_SHIFT);
refs = record_subpages(page, addr, end, pages + *nr);

@@ -2896,13 +2767,6 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
if (!pud_access_permitted(orig, flags & FOLL_WRITE))
return 0;

- if (pud_devmap(orig)) {
- if (unlikely(flags & FOLL_LONGTERM))
- return 0;
- return __gup_device_huge_pud(orig, pudp, addr, end, flags,
- pages, nr);
- }
-
page = nth_page(pud_page(orig), (addr & ~PUD_MASK) >> PAGE_SHIFT);
refs = record_subpages(page, addr, end, pages + *nr);

@@ -2941,8 +2805,6 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
if (!pgd_access_permitted(orig, flags & FOLL_WRITE))
return 0;

- BUILD_BUG_ON(pgd_devmap(orig));
-
page = nth_page(pgd_page(orig), (addr & ~PGDIR_MASK) >> PAGE_SHIFT);
refs = record_subpages(page, addr, end, pages + *nr);

@@ -2984,8 +2846,7 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo
if (!pmd_present(pmd))
return 0;

- if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd) ||
- pmd_devmap(pmd))) {
+ if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
/* See gup_pte_range() */
if (pmd_protnone(pmd))
return 0;
@@ -3022,7 +2883,7 @@ static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned lo
next = pud_addr_end(addr, end);
if (unlikely(!pud_present(pud)))
return 0;
- if (unlikely(pud_huge(pud) || pud_devmap(pud))) {
+ if (unlikely(pud_huge(pud))) {
if (!gup_huge_pud(pud, pudp, addr, next, flags,
pages, nr))
return 0;
diff --git a/mm/hmm.c b/mm/hmm.c
index a665a3c..5ad89f9 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -298,7 +298,6 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
* fall through and treat it like a normal page.
*/
if (!vm_normal_page(walk->vma, addr, pte) &&
- !pte_devmap(pte) &&
!is_zero_pfn(pte_pfn(pte))) {
if (hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0)) {
pte_unmap(ptep);
@@ -351,7 +350,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
}

- if (pmd_devmap(pmd) || pmd_trans_huge(pmd)) {
+ if (pmd_trans_huge(pmd)) {
/*
* No need to take pmd_lock here, even if some other thread
* is splitting the huge pmd we will get that event through
@@ -362,7 +361,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
* values.
*/
pmd = pmdp_get_lockless(pmdp);
- if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd))
+ if (!pmd_trans_huge(pmd))
goto again;

return hmm_vma_handle_pmd(walk, addr, end, hmm_pfns, pmd);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c657886..265506a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1090,46 +1090,6 @@ static void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
update_mmu_cache_pmd(vma, addr, pmd);
}

-struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
- pmd_t *pmd, int flags, struct dev_pagemap **pgmap)
-{
- unsigned long pfn = pmd_pfn(*pmd);
- struct mm_struct *mm = vma->vm_mm;
- struct page *page;
- int ret;
-
- assert_spin_locked(pmd_lockptr(mm, pmd));
-
- if (flags & FOLL_WRITE && !pmd_write(*pmd))
- return NULL;
-
- if (pmd_present(*pmd) && pmd_devmap(*pmd))
- /* pass */;
- else
- return NULL;
-
- if (flags & FOLL_TOUCH)
- touch_pmd(vma, addr, pmd, flags & FOLL_WRITE);
-
- /*
- * device mapped pages can only be returned if the
- * caller will manage the page reference count.
- */
- if (!(flags & (FOLL_GET | FOLL_PIN)))
- return ERR_PTR(-EEXIST);
-
- pfn += (addr & ~PMD_MASK) >> PAGE_SHIFT;
- *pgmap = get_dev_pagemap(pfn, *pgmap);
- if (!*pgmap)
- return ERR_PTR(-EFAULT);
- page = pfn_to_page(pfn);
- ret = try_grab_page(page, flags);
- if (ret)
- page = ERR_PTR(ret);
-
- return page;
-}
-
int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
@@ -1245,49 +1205,6 @@ static void touch_pud(struct vm_area_struct *vma, unsigned long addr,
update_mmu_cache_pud(vma, addr, pud);
}

-struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
- pud_t *pud, int flags, struct dev_pagemap **pgmap)
-{
- unsigned long pfn = pud_pfn(*pud);
- struct mm_struct *mm = vma->vm_mm;
- struct page *page;
- int ret;
-
- assert_spin_locked(pud_lockptr(mm, pud));
-
- if (flags & FOLL_WRITE && !pud_write(*pud))
- return NULL;
-
- if (pud_present(*pud) && pud_devmap(*pud))
- /* pass */;
- else
- return NULL;
-
- if (flags & FOLL_TOUCH)
- touch_pud(vma, addr, pud, flags & FOLL_WRITE);
-
- /*
- * device mapped pages can only be returned if the
- * caller will manage the page reference count.
- *
- * At least one of FOLL_GET | FOLL_PIN must be set, so assert that here:
- */
- if (!(flags & (FOLL_GET | FOLL_PIN)))
- return ERR_PTR(-EEXIST);
-
- pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
- *pgmap = get_dev_pagemap(pfn, *pgmap);
- if (!*pgmap)
- return ERR_PTR(-EFAULT);
- page = pfn_to_page(pfn);
-
- ret = try_grab_page(page, flags);
- if (ret)
- page = ERR_PTR(ret);
-
- return page;
-}
-
int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
struct vm_area_struct *vma)
@@ -1302,7 +1219,7 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,

ret = -EAGAIN;
pud = *src_pud;
- if (unlikely(!pud_trans_huge(pud) && !pud_devmap(pud)))
+ if (unlikely(!pud_trans_huge(pud)))
goto out_unlock;

/*
@@ -2013,8 +1930,7 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
{
spinlock_t *ptl;
ptl = pmd_lock(vma->vm_mm, pmd);
- if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
- pmd_devmap(*pmd)))
+ if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd)))
return ptl;
spin_unlock(ptl);
return NULL;
@@ -2031,7 +1947,7 @@ spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma)
spinlock_t *ptl;

ptl = pud_lock(vma->vm_mm, pud);
- if (likely(pud_trans_huge(*pud) || pud_devmap(*pud)))
+ if (likely(pud_trans_huge(*pud)))
return ptl;
spin_unlock(ptl);
return NULL;
@@ -2065,7 +1981,7 @@ static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
VM_BUG_ON(haddr & ~HPAGE_PUD_MASK);
VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PUD_SIZE, vma);
- VM_BUG_ON(!pud_trans_huge(*pud) && !pud_devmap(*pud));
+ VM_BUG_ON(!pud_trans_huge(*pud));

count_vm_event(THP_SPLIT_PUD);

@@ -2083,7 +1999,7 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
(address & HPAGE_PUD_MASK) + HPAGE_PUD_SIZE);
mmu_notifier_invalidate_range_start(&range);
ptl = pud_lock(vma->vm_mm, pud);
- if (unlikely(!pud_trans_huge(*pud) && !pud_devmap(*pud)))
+ if (unlikely(!pud_trans_huge(*pud)))
goto out;
__split_huge_pud_locked(vma, pud, range.start);

@@ -2150,8 +2066,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
- VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
- && !pmd_devmap(*pmd));
+ VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd));

count_vm_event(THP_SPLIT_PMD);

@@ -2354,8 +2269,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
VM_BUG_ON(freeze && !folio);
VM_WARN_ON_ONCE(folio && !folio_test_locked(folio));

- if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) ||
- is_pmd_migration_entry(*pmd)) {
+ if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd)) {
/*
* It's safe to call pmd_page when folio is set because it's
* guaranteed that pmd is present.
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b10db15..7254ad4 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -955,8 +955,6 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
return SCAN_PMD_NULL;
if (pmd_trans_huge(pmde))
return SCAN_PMD_MAPPED;
- if (WARN_ON_ONCE(pmd_devmap(pmde)))
- return SCAN_PMD_NULL;
if (pmd_bad(pmde))
return SCAN_PMD_NULL;
return SCAN_SUCCEED;
diff --git a/mm/mapping_dirty_helpers.c b/mm/mapping_dirty_helpers.c
index 2f8829b..208b428 100644
--- a/mm/mapping_dirty_helpers.c
+++ b/mm/mapping_dirty_helpers.c
@@ -129,7 +129,7 @@ static int wp_clean_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long end,
pmd_t pmdval = pmdp_get_lockless(pmd);

/* Do not split a huge pmd, present or migrated */
- if (pmd_trans_huge(pmdval) || pmd_devmap(pmdval)) {
+ if (pmd_trans_huge(pmdval)) {
WARN_ON(pmd_write(pmdval) || pmd_dirty(pmdval));
walk->action = ACTION_CONTINUE;
}
@@ -152,7 +152,7 @@ static int wp_clean_pud_entry(pud_t *pud, unsigned long addr, unsigned long end,
pud_t pudval = READ_ONCE(*pud);

/* Do not split a huge pud */
- if (pud_trans_huge(pudval) || pud_devmap(pudval)) {
+ if (pud_trans_huge(pudval)) {
WARN_ON(pud_write(pudval) || pud_dirty(pudval));
walk->action = ACTION_CONTINUE;
}
diff --git a/mm/memory.c b/mm/memory.c
index 418b630..6b0a2d1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -592,16 +592,6 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
return NULL;
if (is_zero_pfn(pfn))
return NULL;
- if (pte_devmap(pte))
- /*
- * NOTE: New users of ZONE_DEVICE will not set pte_devmap()
- * and will have refcounts incremented on their struct pages
- * when they are inserted into PTEs, thus they are safe to
- * return here. Legacy ZONE_DEVICE pages that set pte_devmap()
- * do not have refcounts. Example of legacy ZONE_DEVICE is
- * MEMORY_DEVICE_FS_DAX type in pmem or virtio_fs drivers.
- */
- return NULL;

print_bad_pte(vma, addr, pte, NULL);
return NULL;
@@ -677,8 +667,6 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
}
}

- if (pmd_devmap(pmd))
- return NULL;
if (is_huge_zero_pmd(pmd))
return NULL;
if (unlikely(pfn > highest_memmap_pfn))
@@ -1150,8 +1138,7 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
src_pmd = pmd_offset(src_pud, addr);
do {
next = pmd_addr_end(addr, end);
- if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
- || pmd_devmap(*src_pmd)) {
+ if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)) {
int err;
VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma);
err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd,
@@ -1187,7 +1174,7 @@ copy_pud_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
src_pud = pud_offset(src_p4d, addr);
do {
next = pud_addr_end(addr, end);
- if (pud_trans_huge(*src_pud) || pud_devmap(*src_pud)) {
+ if (pud_trans_huge(*src_pud)) {
int err;

VM_BUG_ON_VMA(next-addr != HPAGE_PUD_SIZE, src_vma);
@@ -1547,7 +1534,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
pmd = pmd_offset(pud, addr);
do {
next = pmd_addr_end(addr, end);
- if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+ if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
__split_huge_pmd(vma, pmd, addr, false, NULL);
else if (zap_huge_pmd(tlb, vma, pmd, addr)) {
@@ -1589,7 +1576,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
pud = pud_offset(p4d, addr);
do {
next = pud_addr_end(addr, end);
- if (pud_trans_huge(*pud) || pud_devmap(*pud)) {
+ if (pud_trans_huge(*pud)) {
if (next - addr != HPAGE_PUD_SIZE) {
mmap_assert_locked(tlb->mm);
split_huge_pud(vma, pud, addr);
@@ -5129,7 +5116,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
pud_t orig_pud = *vmf.pud;

barrier();
- if (pud_trans_huge(orig_pud) || pud_devmap(orig_pud)) {
+ if (pud_trans_huge(orig_pud)) {

/*
* TODO once we support anonymous PUDs: NUMA case and
@@ -5169,7 +5156,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
pmd_migration_entry_wait(mm, vmf.pmd);
return 0;
}
- if (pmd_trans_huge(vmf.orig_pmd) || pmd_devmap(vmf.orig_pmd)) {
+ if (pmd_trans_huge(vmf.orig_pmd)) {
if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
return do_huge_pmd_numa_page(&vmf);

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 1e1c82f..a3659c0 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -590,7 +590,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
pmdp = pmd_alloc(mm, pudp, addr);
if (!pmdp)
goto abort;
- if (pmd_trans_huge(*pmdp) || pmd_devmap(*pmdp))
+ if (pmd_trans_huge(*pmdp))
goto abort;
if (pte_alloc(mm, pmdp))
goto abort;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index b94fbb4..a83e9f4 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -387,7 +387,7 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
}

_pmd = pmdp_get_lockless(pmd);
- if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) {
+ if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd)) {
if ((next - addr != HPAGE_PMD_SIZE) ||
pgtable_split_needed(vma, cp_flags)) {
__split_huge_pmd(vma, pmd, addr, false, NULL);
diff --git a/mm/mremap.c b/mm/mremap.c
index 382e81c..78ed214 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -527,7 +527,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr);
if (!new_pud)
break;
- if (pud_trans_huge(*old_pud) || pud_devmap(*old_pud)) {
+ if (pud_trans_huge(*old_pud)) {
if (extent == HPAGE_PUD_SIZE) {
move_pgt_entry(HPAGE_PUD, vma, old_addr, new_addr,
old_pud, new_pud, need_rmap_locks);
@@ -549,8 +549,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
if (!new_pmd)
break;
again:
- if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd) ||
- pmd_devmap(*old_pmd)) {
+ if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd)) {
if (extent == HPAGE_PMD_SIZE &&
move_pgt_entry(HPAGE_PMD, vma, old_addr, new_addr,
old_pmd, new_pmd, need_rmap_locks))
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index e0b368e..c519dbf 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -235,8 +235,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
*/
pmde = pmdp_get_lockless(pvmw->pmd);

- if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde) ||
- (pmd_present(pmde) && pmd_devmap(pmde))) {
+ if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
pvmw->ptl = pmd_lock(mm, pvmw->pmd);
pmde = *pvmw->pmd;
if (!pmd_present(pmde)) {
@@ -251,7 +250,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
return not_found(pvmw);
return true;
}
- if (likely(pmd_trans_huge(pmde) || pmd_devmap(pmde))) {
+ if (likely(pmd_trans_huge(pmde))) {
if (pvmw->flags & PVMW_MIGRATION)
return not_found(pvmw);
if (!check_pmd(pmd_pfn(pmde), pvmw))
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 4fcd959..ab24a0c 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -139,8 +139,7 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
{
pmd_t pmd;
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
- VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
- !pmd_devmap(*pmdp));
+ VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp));
pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
return pmd;
@@ -153,7 +152,7 @@ pud_t pudp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
pud_t pud;

VM_BUG_ON(address & ~HPAGE_PUD_MASK);
- VM_BUG_ON(!pud_trans_huge(*pudp) && !pud_devmap(*pudp));
+ VM_BUG_ON(!pud_trans_huge(*pudp));
pud = pudp_huge_get_and_clear(vma->vm_mm, address, pudp);
flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
return pud;
@@ -291,7 +290,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
*pmdvalp = pmdval;
if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
goto nomap;
- if (unlikely(pmd_trans_huge(pmdval) || pmd_devmap(pmdval)))
+ if (unlikely(pmd_trans_huge(pmdval)))
goto nomap;
if (unlikely(pmd_bad(pmdval))) {
pmd_clear_bad(pmd);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6f13394..09064c2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3940,7 +3940,7 @@ static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned
if (!pte_present(pte) || is_zero_pfn(pfn))
return -1;

- if (WARN_ON_ONCE(pte_devmap(pte) || pte_special(pte)))
+ if (WARN_ON_ONCE(pte_special(pte)))
return -1;

if (WARN_ON_ONCE(!pfn_valid(pfn)))
@@ -3959,9 +3959,6 @@ static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned
if (!pmd_present(pmd) || is_huge_zero_pmd(pmd))
return -1;

- if (WARN_ON_ONCE(pmd_devmap(pmd)))
- return -1;
-
if (WARN_ON_ONCE(!pfn_valid(pfn)))
return -1;

--
git-series 0.9.1

2024-04-11 12:29:50

by Jason Gunthorpe

[permalink] [raw]

Subject: Re: [RFC 03/10] pci/p2pdma: Don't initialise page refcount to one

On Thu, Apr 11, 2024 at 10:57:24AM +1000, Alistair Popple wrote:
> The reference counts for ZONE_DEVICE private pages should be
> initialised by the driver when the page is actually allocated by the
> driver allocator, not when they are first created. This is currently
> the case for MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_COHERENT pages
> but not MEMORY_DEVICE_PCI_P2PDMA pages so fix that up.
>
> Signed-off-by: Alistair Popple <[email protected]>
> ---
> drivers/pci/p2pdma.c | 2 ++
> mm/memremap.c | 8 ++++----
> mm/mm_init.c | 4 +++-
> 3 files changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index fa7370f..ab7ef18 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -128,6 +128,8 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
> goto out;
> }
>
> + get_page(virt_to_page(kaddr));
> +

Should this be

set_page_count(page, 1)

If the refcount is already known to be 0 ?

> @@ -508,15 +508,15 @@ void free_zone_device_page(struct page *page)
> page->mapping = NULL;
> page->pgmap->ops->page_free(page);
>
> - if (page->pgmap->type != MEMORY_DEVICE_PRIVATE &&
> - page->pgmap->type != MEMORY_DEVICE_COHERENT)
> + if (page->pgmap->type == MEMORY_DEVICE_PRIVATE ||
> + page->pgmap->type == MEMORY_DEVICE_COHERENT)
> + put_dev_pagemap(page->pgmap);

Not related, but we should really be getting rid of this devmap
refcount traffic too, IMHO..

If an implementation wants this then it should hook the page
free/alloc callbacks and do this, not put it in the core code.

Jason

2024-04-11 12:32:20

by Jason Gunthorpe

[permalink] [raw]

Subject: Re: [RFC 07/10] mm: Allow compound zone device pages

On Thu, Apr 11, 2024 at 10:57:28AM +1000, Alistair Popple wrote:
> Zone device pages are used to represent various type of device memory
> managed by device drivers. Currently compound zone device pages are
> not supported. This is because MEMORY_DEVICE_FS_DAX pages are the only
> user of higher order zone device pages and have their own page
> reference counting.
>
> A future change will unify FS DAX reference counting with normal page
> reference counting rules and remove the special FS DAX reference
> counting. Supporting that requires compound zone device pages.
>
> Supporting compound zone device pages requires compound_head() to
> distinguish between head and tail pages whilst still preserving the
> special struct page fields that are specific to zone device pages.
>
> A tail page is distinguished by having bit zero being set in
> page->compound_head, with the remaining bits pointing to the head
> page. For zone device pages page->compound_head is shared with
> page->pgmap.
>
> The page->pgmap field is common to all pages within a memory section.
> Therefore pgmap is the same for both head and tail pages and we can
> use the same scheme to distinguish tail pages. To obtain the pgmap for
> a tail page a new accessor is introduced to fetch it from
> compound_head.
>
> Signed-off-by: Alistair Popple <[email protected]>
> ---
> drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +-
> drivers/pci/p2pdma.c | 2 +-
> include/linux/memremap.h | 12 +++++++++---
> include/linux/migrate.h | 2 +-
> lib/test_hmm.c | 2 +-
> mm/hmm.c | 2 +-
> mm/memory.c | 2 +-
> mm/memremap.c | 6 +++---
> mm/migrate_device.c | 4 ++--
> 9 files changed, 20 insertions(+), 14 deletions(-)

Makes sense to me

Reviewed-by: Jason Gunthorpe <[email protected]>

Jason

2024-04-11 12:42:08

by Jason Gunthorpe

[permalink] [raw]

Subject: Re: [RFC 02/10] mm/hmm: Remove dead check for HugeTLB and FS DAX

On Thu, Apr 11, 2024 at 10:57:23AM +1000, Alistair Popple wrote:
> pud_huge() returns true only for a HugeTLB page. pud_devmap() is only
> used by FS DAX pages. These two things are mutually exclusive so this
> code is dead code and can be removed.

I'm not sure this is true.. pud_huge() is mostly a misspelling of pud_leaf()..

> - if (pud_huge(pud) && pud_devmap(pud)) {

I suspect this should be written as:

if (pud_leaf(pud) && pud_devmap(pud)) {

In line with Peter's work here:

https://lore.kernel.org/linux-mm/[email protected]/

Jason

2024-04-11 12:57:42

by Jason Gunthorpe

[permalink] [raw]

Subject: Re: [RFC 10/10] mm: Remove pXX_devmap

On Thu, Apr 11, 2024 at 10:57:31AM +1000, Alistair Popple wrote:
> The devmap PTE special bit was used to detect mappings of FS DAX
> pages. This tracking was required to ensure the generic mm did not
> manipulate the page reference counts as FS DAX implemented it's own
> reference counting scheme.
>
> Now that FS DAX pages have their references counted the same way as
> normal pages this tracking is no longer needed and can be
> removed.
>
> Almost all existing uses of pmd_devmap() are paired with a check of
> pmd_trans_huge(). As pmd_trans_huge() now returns true for FS DAX pages
> dropping the check in these cases doesn't change anything.
>
> However care needs to be taken because pmd_trans_huge() also checks that
> a page is not an FS DAX page. This is dealt with either by checking
> !vma_is_dax() or relying on the fact that the page pointer was obtained
> from a page list. This is possible because zone device pages cannot
> appear in any page list due to sharing page->lru with page->pgmap.
>
> Signed-off-by: Alistair Popple <[email protected]>
> ---
> Documentation/mm/arch_pgtable_helpers.rst | 6 +-
> arch/arm64/include/asm/pgtable.h | 24 +---
> arch/powerpc/include/asm/book3s/64/pgtable.h | 42 +------
> arch/powerpc/mm/book3s64/hash_pgtable.c | 3 +-
> arch/powerpc/mm/book3s64/pgtable.c | 8 +-
> arch/powerpc/mm/book3s64/radix_pgtable.c | 5 +-
> arch/powerpc/mm/pgtable.c | 2 +-
> arch/x86/include/asm/pgtable.h | 31 +----
> fs/dax.c | 5 +-
> fs/userfaultfd.c | 2 +-
> include/linux/huge_mm.h | 10 +-
> include/linux/mm.h | 7 +-
> include/linux/pgtable.h | 17 +--
> mm/debug_vm_pgtable.c | 51 +-------
> mm/gup.c | 151 +--------------------
> mm/hmm.c | 5 +-
> mm/huge_memory.c | 100 +-------------
> mm/khugepaged.c | 2 +-
> mm/mapping_dirty_helpers.c | 4 +-
> mm/memory.c | 25 +---
> mm/migrate_device.c | 2 +-
> mm/mprotect.c | 2 +-
> mm/mremap.c | 5 +-
> mm/page_vma_mapped.c | 5 +-
> mm/pgtable-generic.c | 7 +-
> mm/vmscan.c | 5 +-
> 26 files changed, 48 insertions(+), 478 deletions(-)

This is great

I suggest splitting it up into two parts, remove the functional code
in the core mm calling these functions testing and setting pXX_devmap
(and maybe this needs several patches for clarity) and then remove all
the remaining dead code once all the callers are gone.

Jason

2024-04-11 13:01:36

by Jason Gunthorpe

[permalink] [raw]

Subject: Re: [RFC 01/10] mm/gup.c: Remove redundant check for PCI P2PDMA page

On Thu, Apr 11, 2024 at 10:57:22AM +1000, Alistair Popple wrote:
> PCI P2PDMA pages are not mapped with pXX_devmap PTEs therefore the
> check in __gup_device_huge() is redundant. Remove it
>
> Signed-off-by: Alistair Popple <[email protected]>
> ---
> mm/gup.c | 5 -----
> 1 file changed, 5 deletions(-)

Reviewed-by: Jason Gunthorpe <[email protected]>

Jason

2024-04-11 13:38:01

[permalink] [raw]

Subject: Re: [RFC 02/10] mm/hmm: Remove dead check for HugeTLB and FS DAX

On Thu, Apr 11, 2024 at 09:25:30AM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 11, 2024 at 10:57:23AM +1000, Alistair Popple wrote:
> > pud_huge() returns true only for a HugeTLB page. pud_devmap() is only
> > used by FS DAX pages. These two things are mutually exclusive so this
> > code is dead code and can be removed.
>
> I'm not sure this is true.. pud_huge() is mostly a misspelling of pud_leaf()..
>
> > - if (pud_huge(pud) && pud_devmap(pud)) {
>
> I suspect this should be written as:
>
> if (pud_leaf(pud) && pud_devmap(pud)) {
>
> In line with Peter's work here:
>
> https://lore.kernel.org/linux-mm/[email protected]/

Just to provide more information for Alistair, this patch already switched
that over to a _leaf():

https://lore.kernel.org/r/[email protected]

That's in mm-unstable now, so should see that in a rebase.

And btw it's great to see that pxx_devmap() can go away.

Thanks,

--
Peter Xu

2024-04-11 13:49:20

by David Hildenbrand

[permalink] [raw]

Subject: Re: [RFC 09/10] mm/khugepage.c: Warn if trying to scan devmap pmd

On 11.04.24 02:57, Alistair Popple wrote:
> The only user of devmap PTEs is FS DAX, and khugepaged should not be
> scanning these VMAs. This is checked by calling
> hugepage_vma_check. Therefore khugepaged should never encounter a
> devmap PTE. Warn if this occurs.
>
> Signed-off-by: Alistair Popple <[email protected]>
>
> ---
>
> Note this is a transitory patch to test the above assumption both at
> runtime and during review. I will likely remove it as the whole thing
> gets deleted when pXX_devmap is removed.

Yes, doesn't make sense for this patch to exist if it would go upstream
along with the next patch that removes that completely.

--
Cheers,

David / dhildenb

2024-04-11 14:10:26

by David Hildenbrand

[permalink] [raw]

Subject: Re: [RFC 01/10] mm/gup.c: Remove redundant check for PCI P2PDMA page

On 11.04.24 02:57, Alistair Popple wrote:
> PCI P2PDMA pages are not mapped with pXX_devmap PTEs therefore the
> check in __gup_device_huge() is redundant. Remove it
>
> Signed-off-by: Alistair Popple <[email protected]>
> ---
> mm/gup.c | 5 -----
> 1 file changed, 5 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 2f8a2d8..a9c8a09 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2683,11 +2683,6 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
> break;
> }
>
> - if (!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
> - undo_dev_pagemap(nr, nr_start, flags, pages);
> - break;
> - }
> -
> SetPageReferenced(page);
> pages[*nr] = page;
> if (unlikely(try_grab_page(page, flags))) {

Rebasing on mm-unstable, you'll notice some minor conflicts, but nothing
earth shattering :)

Acked-by: David Hildenbrand <[email protected]>

--
Cheers,

David / dhildenb

2024-04-11 15:29:08

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [RFC 07/10] mm: Allow compound zone device pages

On Thu, Apr 11, 2024 at 10:57:28AM +1000, Alistair Popple wrote:
> Supporting compound zone device pages requires compound_head() to
> distinguish between head and tail pages whilst still preserving the
> special struct page fields that are specific to zone device pages.
>
> A tail page is distinguished by having bit zero being set in
> page->compound_head, with the remaining bits pointing to the head
> page. For zone device pages page->compound_head is shared with
> page->pgmap.
>
> The page->pgmap field is common to all pages within a memory section.
> Therefore pgmap is the same for both head and tail pages and we can
> use the same scheme to distinguish tail pages. To obtain the pgmap for
> a tail page a new accessor is introduced to fetch it from
> compound_head.

Would it make sense at this point to move pgmap and zone_device_data
from struct page to struct folio? That will make any forgotten
places fail to compile instead of getting a bogus value.

2024-04-11 18:22:59

by Jason Gunthorpe

[permalink] [raw]

Subject: Re: [RFC 00/10] fs/dax: Fix FS DAX page reference counts

On Thu, Apr 11, 2024 at 10:28:56AM -0700, Dan Williams wrote:
> Alistair Popple wrote:
> > FS DAX pages have always maintained their own page reference counts
> > without following the normal rules for page reference counting. In
> > particular pages are considered free when the refcount hits one rather
> > than zero and refcounts are not added when mapping the page.
>
> > Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
> > mechanism for allowing GUP to hold references on the page (see
> > get_dev_pagemap). However there doesn't seem to be any reason why FS
> > DAX pages need their own reference counting scheme.
>
> This is fair. However, for anyone coming in fresh to this situation
> maybe some more "how we get here" history helps. That longer story is
> here:
>
> http://lore.kernel.org/all/166579181584.2236710.17813547487183983273.stgit@dwillia2-xfh.jf.intel.com/

This never got merged? :(

Jason

2024-04-11 18:27:33

by Dan Williams

[permalink] [raw]

Subject: Re: [RFC 00/10] fs/dax: Fix FS DAX page reference counts

Alistair Popple wrote:
> FS DAX pages have always maintained their own page reference counts
> without following the normal rules for page reference counting. In
> particular pages are considered free when the refcount hits one rather
> than zero and refcounts are not added when mapping the page.

> Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
> mechanism for allowing GUP to hold references on the page (see
> get_dev_pagemap). However there doesn't seem to be any reason why FS
> DAX pages need their own reference counting scheme.

This is fair. However, for anyone coming in fresh to this situation
maybe some more "how we get here" history helps. That longer story is
here:

http://lore.kernel.org/all/166579181584.2236710.17813547487183983273.stgit@dwillia2-xfh.jf.intel.com/

> This RFC is an initial attempt at removing the special reference
> counting and instead refcount FS DAX pages the same as normal pages.
>
> There are still a couple of rough edges - in particular I haven't
> completely removed the devmap PTE bit references from arch specific
> code and there is probably some more cleanup of dev_pagemap reference
> counting that could be done, particular in mm/gup.c. I also haven't
> yet compiled on anything other than x86_64.
>
> Before continuing further with this clean-up though I would appreciate
> some feedback on the viability of this approach and any issues I may
> have overlooked, as I am not intimately familiar with FS DAX code (or
> for that matter the FS layer in general).
>
> I have of course run some basic testing which didn't reveal any
> problems.

FWIW I see the following with the ndctl/dax test-suite (double-checked
that vanilla v6.6 passes). I will take a look at the patches, but in the
meantime...

# meson test -C build --suite ndctl:dax
ninja: no work to do.
ninja: Entering directory `/root/git/ndctl/build'
[1/70] Generating version.h with a custom command
1/13 ndctl:dax / daxdev-errors.sh OK 14.46s
2/13 ndctl:dax / multi-dax.sh OK 2.70s
3/13 ndctl:dax / sub-section.sh OK 7.21s
4/13 ndctl:dax / dax-dev OK 0.08s
[5/13] ???? ndctl:dax / dax-ext4.sh 0/600s

...that last test crashed with:

EXT4-fs (pmem0): mounted filesystem 2adea02a-a791-4714-be40-125afd16634b r/w with ordered
ota mode: none.
page:ffffea0005f00000 refcount:0 mapcount:0 mapping:ffff8882a8a6be10 index:0x5800 pfn:0x1

head:ffffea0005f00000 order:9 entire_mapcount:0 nr_pages_mapped:0 pincount:0
aops:ext4_dax_aops ino:c dentry name:"image"
flags: 0x4ffff800004040(reserved|head|node=0|zone=4|lastcpupid=0x1ffff)
page_type: 0xffffffff()
raw: 004ffff800004040 ffff888202681520 0000000000000000 ffff8882a8a6be10
raw: 0000000000005800 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: VM_BUG_ON_FOLIO(((unsigned int) folio_ref_count(folio) + 127u <= 127

------------[ cut here ]------------
kernel BUG at include/linux/mm.h:1419!
invalid opcode: 0000 [#1] PREEMPT SMP PTI
CPU: 0 PID: 1415 Comm: dax-pmd Tainted: G OE N 6.6.0+ #209
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20230524-3.fc38 05/24/2023
RIP: 0010:dax_insert_pfn_pmd+0x41c/0x430
Code: 89 c1 41 b8 01 00 00 00 48 89 ea 4c 89 e6 4c 89 f7 e8 18 8a c7 ff e9 e0 fc ff ff 48
c b3 48 89 c7 e8 a4 53 f7 ff <0f> 0b e8 0d ba a8 00 48 8b 15 86 8a 62 01 e9 89 fc ff ff 90

RSP: 0000:ffffc90001d57b68 EFLAGS: 00010246
RAX: 000000000000005c RBX: ffffea0005f00000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffffb3749a15 RDI: 00000000ffffffff
RBP: ffff8882982c07e0 R08: 00000000ffffdfff R09: 0000000000000001
R10: 00000000ffffdfff R11: ffffffffb3a771c0 R12: 800000017c0008e7
R13: 8000000000000025 R14: ffff888202a395f8 R15: ffffea0005f00000
FS: 00007fdaa00e3d80(0000) GS:ffff888477000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fda9f800000 CR3: 0000000296224000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
? die+0x32/0x80
? do_trap+0xd6/0x100
? dax_insert_pfn_pmd+0x41c/0x430
? dax_insert_pfn_pmd+0x41c/0x430
? do_error_trap+0x81/0x110
? dax_insert_pfn_pmd+0x41c/0x430
? exc_invalid_op+0x4c/0x60
? dax_insert_pfn_pmd+0x41c/0x430
? asm_exc_invalid_op+0x16/0x20
? dax_insert_pfn_pmd+0x41c/0x430
? dax_insert_pfn_pmd+0x41c/0x430
dax_fault_iter+0x5d0/0x700
dax_iomap_pmd_fault+0x212/0x450
ext4_dax_huge_fault+0x1dc/0x470
__handle_mm_fault+0x808/0x13e0
handle_mm_fault+0x178/0x3e0
do_user_addr_fault+0x186/0x830
exc_page_fault+0x6f/0x1d0
asm_exc_page_fault+0x22/0x30
RIP: 0033:0x7fdaa072d009

2024-04-11 18:34:20

by Dan Williams

[permalink] [raw]

Subject: Re: [RFC 00/10] fs/dax: Fix FS DAX page reference counts

Jason Gunthorpe wrote:
> On Thu, Apr 11, 2024 at 10:28:56AM -0700, Dan Williams wrote:
> > Alistair Popple wrote:
> > > FS DAX pages have always maintained their own page reference counts
> > > without following the normal rules for page reference counting. In
> > > particular pages are considered free when the refcount hits one rather
> > > than zero and refcounts are not added when mapping the page.
> >
> > > Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
> > > mechanism for allowing GUP to hold references on the page (see
> > > get_dev_pagemap). However there doesn't seem to be any reason why FS
> > > DAX pages need their own reference counting scheme.
> >
> > This is fair. However, for anyone coming in fresh to this situation
> > maybe some more "how we get here" history helps. That longer story is
> > here:
> >
> > http://lore.kernel.org/all/166579181584.2236710.17813547487183983273.stgit@dwillia2-xfh.jf.intel.com/
>
> This never got merged? :(

Yeah... I got hung up on the "allocation" API to take ZONE_DEVICE pages
from recfount 0 to 1 and then never circled back.

2024-04-12 01:34:13

by Alistair Popple

[permalink] [raw]

Subject: Re: [RFC 02/10] mm/hmm: Remove dead check for HugeTLB and FS DAX

Peter Xu <[email protected]> writes:

> On Thu, Apr 11, 2024 at 09:25:30AM -0300, Jason Gunthorpe wrote:
>> On Thu, Apr 11, 2024 at 10:57:23AM +1000, Alistair Popple wrote:
>> > pud_huge() returns true only for a HugeTLB page. pud_devmap() is only
>> > used by FS DAX pages. These two things are mutually exclusive so this
>> > code is dead code and can be removed.
>>
>> I'm not sure this is true.. pud_huge() is mostly a misspelling of pud_leaf()..
>>
>> > - if (pud_huge(pud) && pud_devmap(pud)) {
>>
>> I suspect this should be written as:
>>
>> if (pud_leaf(pud) && pud_devmap(pud)) {

Oh that makes a lot more sense. I'd taken the comment for pud_huge() at
face value (ie. that it's a hugetlbfs page) without digging further.

>> In line with Peter's work here:
>>
>> https://lore.kernel.org/linux-mm/[email protected]/
>
> Just to provide more information for Alistair, this patch already switched
> that over to a _leaf():

Got it, thanks (and apologies for missing my Cc on that).

> https://lore.kernel.org/r/[email protected]
>
> That's in mm-unstable now, so should see that in a rebase.
>
> And btw it's great to see that pxx_devmap() can go away.

Yep, AFAICT pxx_devmap only exists to do this special FS DAX
refcounting. Once that is fixed it can go away.

> Thanks,

2024-04-12 01:36:39

by Alistair Popple

[permalink] [raw]

Subject: Re: [RFC 09/10] mm/khugepage.c: Warn if trying to scan devmap pmd

David Hildenbrand <[email protected]> writes:

> On 11.04.24 02:57, Alistair Popple wrote:
>> The only user of devmap PTEs is FS DAX, and khugepaged should not be
>> scanning these VMAs. This is checked by calling
>> hugepage_vma_check. Therefore khugepaged should never encounter a
>> devmap PTE. Warn if this occurs.
>> Signed-off-by: Alistair Popple <[email protected]>
>> ---
>> Note this is a transitory patch to test the above assumption both at
>> runtime and during review. I will likely remove it as the whole thing
>> gets deleted when pXX_devmap is removed.
>
> Yes, doesn't make sense for this patch to exist if it would go
> upstream along with the next patch that removes that completely.

Yep. I'd had it to sanity check my own understanding and figured I'd
leave it in the RFC series to see if it provoked a reaction from anyone.

Will drop it in the next version.

2024-04-12 01:38:20

by Alistair Popple

[permalink] [raw]

Subject: Re: [RFC 01/10] mm/gup.c: Remove redundant check for PCI P2PDMA page

David Hildenbrand <[email protected]> writes:

> On 11.04.24 02:57, Alistair Popple wrote:
>> PCI P2PDMA pages are not mapped with pXX_devmap PTEs therefore the
>> check in __gup_device_huge() is redundant. Remove it
>> Signed-off-by: Alistair Popple <[email protected]>
>> ---
>> mm/gup.c | 5 -----
>> 1 file changed, 5 deletions(-)
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 2f8a2d8..a9c8a09 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -2683,11 +2683,6 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>> break;
>> }
>> - if (!(flags & FOLL_PCI_P2PDMA) &&
>> is_pci_p2pdma_page(page)) {
>> - undo_dev_pagemap(nr, nr_start, flags, pages);
>> - break;
>> - }
>> -
>> SetPageReferenced(page);
>> pages[*nr] = page;
>> if (unlikely(try_grab_page(page, flags))) {
>
> Rebasing on mm-unstable, you'll notice some minor conflicts, but
> nothing earth shattering :)

Thanks. Rebasing was the other thing I meant to add as a TODO in the
cover letter :)

> Acked-by: David Hildenbrand <[email protected]>

2024-04-12 01:40:26

by Alistair Popple

[permalink] [raw]

Subject: Re: [RFC 07/10] mm: Allow compound zone device pages

Matthew Wilcox <[email protected]> writes:

> On Thu, Apr 11, 2024 at 10:57:28AM +1000, Alistair Popple wrote:
>> Supporting compound zone device pages requires compound_head() to
>> distinguish between head and tail pages whilst still preserving the
>> special struct page fields that are specific to zone device pages.
>>
>> A tail page is distinguished by having bit zero being set in
>> page->compound_head, with the remaining bits pointing to the head
>> page. For zone device pages page->compound_head is shared with
>> page->pgmap.
>>
>> The page->pgmap field is common to all pages within a memory section.
>> Therefore pgmap is the same for both head and tail pages and we can
>> use the same scheme to distinguish tail pages. To obtain the pgmap for
>> a tail page a new accessor is introduced to fetch it from
>> compound_head.
>
> Would it make sense at this point to move pgmap and zone_device_data
> from struct page to struct folio? That will make any forgotten
> places fail to compile instead of getting a bogus value.

Thanks for the suggestion, I think that makes a lot of sense. Will try
it for the next version to see what it looks like.

2024-04-12 04:36:18

by Alistair Popple

[permalink] [raw]

Subject: Re: [RFC 00/10] fs/dax: Fix FS DAX page reference counts

Dan Williams <[email protected]> writes:

> Alistair Popple wrote:
>> FS DAX pages have always maintained their own page reference counts
>> without following the normal rules for page reference counting. In
>> particular pages are considered free when the refcount hits one rather
>> than zero and refcounts are not added when mapping the page.
>
>> Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
>> mechanism for allowing GUP to hold references on the page (see
>> get_dev_pagemap). However there doesn't seem to be any reason why FS
>> DAX pages need their own reference counting scheme.
>
> This is fair. However, for anyone coming in fresh to this situation
> maybe some more "how we get here" history helps. That longer story is
> here:
>
> http://lore.kernel.org/all/166579181584.2236710.17813547487183983273.stgit@dwillia2-xfh.jf.intel.com/

Good idea.

>> This RFC is an initial attempt at removing the special reference
>> counting and instead refcount FS DAX pages the same as normal pages.
>>
>> There are still a couple of rough edges - in particular I haven't
>> completely removed the devmap PTE bit references from arch specific
>> code and there is probably some more cleanup of dev_pagemap reference
>> counting that could be done, particular in mm/gup.c. I also haven't
>> yet compiled on anything other than x86_64.
>>
>> Before continuing further with this clean-up though I would appreciate
>> some feedback on the viability of this approach and any issues I may
>> have overlooked, as I am not intimately familiar with FS DAX code (or
>> for that matter the FS layer in general).
>>
>> I have of course run some basic testing which didn't reveal any
>> problems.
>
> FWIW I see the following with the ndctl/dax test-suite (double-checked
> that vanilla v6.6 passes). I will take a look at the patches, but in the
> meantime...

Hmmm...

> # meson test -C build --suite ndctl:dax
> ninja: no work to do.
> ninja: Entering directory `/root/git/ndctl/build'
> [1/70] Generating version.h with a custom command
> 1/13 ndctl:dax / daxdev-errors.sh OK 14.46s
> 2/13 ndctl:dax / multi-dax.sh OK 2.70s
> 3/13 ndctl:dax / sub-section.sh OK 7.21s
> 4/13 ndctl:dax / dax-dev OK 0.08s
> [5/13] ???? ndctl:dax / dax-ext4.sh 0/600s

...thanks for pasting that output. Turns out I didn't have destructive
testing enabled during the build so hadn't noticed these tests were not
running. It would be nice if these were reported as skipped when not
enabled rather than hidden.

With that fixed I'm seeing a couple of kernel warnings (and I think I
know why), so it might be worth holding off looking at this too closely
until I've fixed these.

> ...that last test crashed with:
>
> EXT4-fs (pmem0): mounted filesystem 2adea02a-a791-4714-be40-125afd16634b r/w with ordered
> ota mode: none.
> page:ffffea0005f00000 refcount:0 mapcount:0 mapping:ffff8882a8a6be10 index:0x5800 pfn:0x1
>
> head:ffffea0005f00000 order:9 entire_mapcount:0 nr_pages_mapped:0 pincount:0
> aops:ext4_dax_aops ino:c dentry name:"image"
> flags: 0x4ffff800004040(reserved|head|node=0|zone=4|lastcpupid=0x1ffff)
> page_type: 0xffffffff()
> raw: 004ffff800004040 ffff888202681520 0000000000000000 ffff8882a8a6be10
> raw: 0000000000005800 0000000000000000 00000000ffffffff 0000000000000000
> page dumped because: VM_BUG_ON_FOLIO(((unsigned int) folio_ref_count(folio) + 127u <= 127
>
> ------------[ cut here ]------------
> kernel BUG at include/linux/mm.h:1419!
> invalid opcode: 0000 [#1] PREEMPT SMP PTI
> CPU: 0 PID: 1415 Comm: dax-pmd Tainted: G OE N 6.6.0+ #209
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20230524-3.fc38 05/24/2023
> RIP: 0010:dax_insert_pfn_pmd+0x41c/0x430
> Code: 89 c1 41 b8 01 00 00 00 48 89 ea 4c 89 e6 4c 89 f7 e8 18 8a c7 ff e9 e0 fc ff ff 48
> c b3 48 89 c7 e8 a4 53 f7 ff <0f> 0b e8 0d ba a8 00 48 8b 15 86 8a 62 01 e9 89 fc ff ff 90
>
> RSP: 0000:ffffc90001d57b68 EFLAGS: 00010246
> RAX: 000000000000005c RBX: ffffea0005f00000 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: ffffffffb3749a15 RDI: 00000000ffffffff
> RBP: ffff8882982c07e0 R08: 00000000ffffdfff R09: 0000000000000001
> R10: 00000000ffffdfff R11: ffffffffb3a771c0 R12: 800000017c0008e7
> R13: 8000000000000025 R14: ffff888202a395f8 R15: ffffea0005f00000
> FS: 00007fdaa00e3d80(0000) GS:ffff888477000000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fda9f800000 CR3: 0000000296224000 CR4: 00000000000006f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> <TASK>
> ? die+0x32/0x80
> ? do_trap+0xd6/0x100
> ? dax_insert_pfn_pmd+0x41c/0x430
> ? dax_insert_pfn_pmd+0x41c/0x430
> ? do_error_trap+0x81/0x110
> ? dax_insert_pfn_pmd+0x41c/0x430
> ? exc_invalid_op+0x4c/0x60
> ? dax_insert_pfn_pmd+0x41c/0x430
> ? asm_exc_invalid_op+0x16/0x20
> ? dax_insert_pfn_pmd+0x41c/0x430
> ? dax_insert_pfn_pmd+0x41c/0x430
> dax_fault_iter+0x5d0/0x700
> dax_iomap_pmd_fault+0x212/0x450
> ext4_dax_huge_fault+0x1dc/0x470
> __handle_mm_fault+0x808/0x13e0
> handle_mm_fault+0x178/0x3e0
> do_user_addr_fault+0x186/0x830
> exc_page_fault+0x6f/0x1d0
> asm_exc_page_fault+0x22/0x30
> RIP: 0033:0x7fdaa072d009

2024-04-12 05:51:56

by Alistair Popple

[permalink] [raw]

Subject: Re: [RFC 03/10] pci/p2pdma: Don't initialise page refcount to one

Jason Gunthorpe <[email protected]> writes:

> On Thu, Apr 11, 2024 at 10:57:24AM +1000, Alistair Popple wrote:
>> The reference counts for ZONE_DEVICE private pages should be
>> initialised by the driver when the page is actually allocated by the
>> driver allocator, not when they are first created. This is currently
>> the case for MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_COHERENT pages
>> but not MEMORY_DEVICE_PCI_P2PDMA pages so fix that up.
>>
>> Signed-off-by: Alistair Popple <[email protected]>
>> ---
>> drivers/pci/p2pdma.c | 2 ++
>> mm/memremap.c | 8 ++++----
>> mm/mm_init.c | 4 +++-
>> 3 files changed, 9 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
>> index fa7370f..ab7ef18 100644
>> --- a/drivers/pci/p2pdma.c
>> +++ b/drivers/pci/p2pdma.c
>> @@ -128,6 +128,8 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
>> goto out;
>> }
>>
>> + get_page(virt_to_page(kaddr));
>> +
>
> Should this be
>
> set_page_count(page, 1)
>
> If the refcount is already known to be 0 ?

Yeah, that would avoid the obvious warning that calling get_page there
will generate. My test setup for p2pdma is pretty clunky, so haven't run
it a while. Not sure if there are any good qemu based tests for this.

>> @@ -508,15 +508,15 @@ void free_zone_device_page(struct page *page)
>> page->mapping = NULL;
>> page->pgmap->ops->page_free(page);
>>
>> - if (page->pgmap->type != MEMORY_DEVICE_PRIVATE &&
>> - page->pgmap->type != MEMORY_DEVICE_COHERENT)
>> + if (page->pgmap->type == MEMORY_DEVICE_PRIVATE ||
>> + page->pgmap->type == MEMORY_DEVICE_COHERENT)
>> + put_dev_pagemap(page->pgmap);
>
> Not related, but we should really be getting rid of this devmap
> refcount traffic too, IMHO..

Absolutely. I think there's a bunch of clean ups for this in mm/gup.c
that could be done as well. I plan on doing that as a follow up to this
series. We pretty much don't use that for device private/coherent pages
anyway.

> If an implementation wants this then it should hook the page
> free/alloc callbacks and do this, not put it in the core code.
>
> Jason

2024-04-12 07:07:09

by Alistair Popple

[permalink] [raw]

Subject: Re: [RFC 00/10] fs/dax: Fix FS DAX page reference counts

Alistair Popple <[email protected]> writes:

> Dan Williams <[email protected]> writes:
>
>> Alistair Popple wrote:
>>> FS DAX pages have always maintained their own page reference counts
>>> without following the normal rules for page reference counting. In
>>> particular pages are considered free when the refcount hits one rather
>>> than zero and refcounts are not added when mapping the page.
>>
>>> Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
>>> mechanism for allowing GUP to hold references on the page (see
>>> get_dev_pagemap). However there doesn't seem to be any reason why FS
>>> DAX pages need their own reference counting scheme.
>>
>> This is fair. However, for anyone coming in fresh to this situation
>> maybe some more "how we get here" history helps. That longer story is
>> here:
>>
>> http://lore.kernel.org/all/166579181584.2236710.17813547487183983273.stgit@dwillia2-xfh.jf.intel.com/
>
> Good idea.
>
>>> This RFC is an initial attempt at removing the special reference
>>> counting and instead refcount FS DAX pages the same as normal pages.
>>>
>>> There are still a couple of rough edges - in particular I haven't
>>> completely removed the devmap PTE bit references from arch specific
>>> code and there is probably some more cleanup of dev_pagemap reference
>>> counting that could be done, particular in mm/gup.c. I also haven't
>>> yet compiled on anything other than x86_64.
>>>
>>> Before continuing further with this clean-up though I would appreciate
>>> some feedback on the viability of this approach and any issues I may
>>> have overlooked, as I am not intimately familiar with FS DAX code (or
>>> for that matter the FS layer in general).
>>>
>>> I have of course run some basic testing which didn't reveal any
>>> problems.
>>
>> FWIW I see the following with the ndctl/dax test-suite (double-checked
>> that vanilla v6.6 passes). I will take a look at the patches, but in the
>> meantime...
>
> Hmmm...
>
>> # meson test -C build --suite ndctl:dax
>> ninja: no work to do.
>> ninja: Entering directory `/root/git/ndctl/build'
>> [1/70] Generating version.h with a custom command
>> 1/13 ndctl:dax / daxdev-errors.sh OK 14.46s
>> 2/13 ndctl:dax / multi-dax.sh OK 2.70s
>> 3/13 ndctl:dax / sub-section.sh OK 7.21s
>> 4/13 ndctl:dax / dax-dev OK 0.08s
>> [5/13] ???? ndctl:dax / dax-ext4.sh 0/600s
>
> ...thanks for pasting that output. Turns out I didn't have destructive
> testing enabled during the build so hadn't noticed these tests were not
> running. It would be nice if these were reported as skipped when not
> enabled rather than hidden.
>
> With that fixed I'm seeing a couple of kernel warnings (and I think I
> know why), so it might be worth holding off looking at this too closely
> until I've fixed these.

Ok, I think I found the dragons you were talking about earlier for
device-dax. I completely broke that because as you've already pointed
out pmd_trans_huge() won't filter out DAX pages. That's fine for FS DAX
(because the pages are essentially normal pages now anyway), but we
don't have a PMD equivalent of vm_normal_page() which leads to all sorts
of issues for DEVDAX.

So I will probably have to add something like that unless we only need
to support large (pmd/pud) mappings of DEVDAX pages on systems with
CONFIG_ARCH_HAS_PTE_SPECIAL in which case I guess we could just filter
based on pte_special().

>> ...that last test crashed with:
>>
>> EXT4-fs (pmem0): mounted filesystem 2adea02a-a791-4714-be40-125afd16634b r/w with ordered
>> ota mode: none.
>> page:ffffea0005f00000 refcount:0 mapcount:0 mapping:ffff8882a8a6be10 index:0x5800 pfn:0x1
>>
>> head:ffffea0005f00000 order:9 entire_mapcount:0 nr_pages_mapped:0 pincount:0
>> aops:ext4_dax_aops ino:c dentry name:"image"
>> flags: 0x4ffff800004040(reserved|head|node=0|zone=4|lastcpupid=0x1ffff)
>> page_type: 0xffffffff()
>> raw: 004ffff800004040 ffff888202681520 0000000000000000 ffff8882a8a6be10
>> raw: 0000000000005800 0000000000000000 00000000ffffffff 0000000000000000
>> page dumped because: VM_BUG_ON_FOLIO(((unsigned int) folio_ref_count(folio) + 127u <= 127
>>
>> ------------[ cut here ]------------
>> kernel BUG at include/linux/mm.h:1419!
>> invalid opcode: 0000 [#1] PREEMPT SMP PTI
>> CPU: 0 PID: 1415 Comm: dax-pmd Tainted: G OE N 6.6.0+ #209
>> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20230524-3.fc38 05/24/2023
>> RIP: 0010:dax_insert_pfn_pmd+0x41c/0x430
>> Code: 89 c1 41 b8 01 00 00 00 48 89 ea 4c 89 e6 4c 89 f7 e8 18 8a c7 ff e9 e0 fc ff ff 48
>> c b3 48 89 c7 e8 a4 53 f7 ff <0f> 0b e8 0d ba a8 00 48 8b 15 86 8a 62 01 e9 89 fc ff ff 90
>>
>> RSP: 0000:ffffc90001d57b68 EFLAGS: 00010246
>> RAX: 000000000000005c RBX: ffffea0005f00000 RCX: 0000000000000000
>> RDX: 0000000000000000 RSI: ffffffffb3749a15 RDI: 00000000ffffffff
>> RBP: ffff8882982c07e0 R08: 00000000ffffdfff R09: 0000000000000001
>> R10: 00000000ffffdfff R11: ffffffffb3a771c0 R12: 800000017c0008e7
>> R13: 8000000000000025 R14: ffff888202a395f8 R15: ffffea0005f00000
>> FS: 00007fdaa00e3d80(0000) GS:ffff888477000000(0000) knlGS:0000000000000000
>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 00007fda9f800000 CR3: 0000000296224000 CR4: 00000000000006f0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> Call Trace:
>> <TASK>
>> ? die+0x32/0x80
>> ? do_trap+0xd6/0x100
>> ? dax_insert_pfn_pmd+0x41c/0x430
>> ? dax_insert_pfn_pmd+0x41c/0x430
>> ? do_error_trap+0x81/0x110
>> ? dax_insert_pfn_pmd+0x41c/0x430
>> ? exc_invalid_op+0x4c/0x60
>> ? dax_insert_pfn_pmd+0x41c/0x430
>> ? asm_exc_invalid_op+0x16/0x20
>> ? dax_insert_pfn_pmd+0x41c/0x430
>> ? dax_insert_pfn_pmd+0x41c/0x430
>> dax_fault_iter+0x5d0/0x700
>> dax_iomap_pmd_fault+0x212/0x450
>> ext4_dax_huge_fault+0x1dc/0x470
>> __handle_mm_fault+0x808/0x13e0
>> handle_mm_fault+0x178/0x3e0
>> do_user_addr_fault+0x186/0x830
>> exc_page_fault+0x6f/0x1d0
>> asm_exc_page_fault+0x22/0x30
>> RIP: 0033:0x7fdaa072d009

2024-04-12 11:54:12

by Jason Gunthorpe

[permalink] [raw]

Subject: Re: [RFC 00/10] fs/dax: Fix FS DAX page reference counts

On Fri, Apr 12, 2024 at 04:55:31PM +1000, Alistair Popple wrote:

> Ok, I think I found the dragons you were talking about earlier for
> device-dax. I completely broke that because as you've already pointed
> out pmd_trans_huge() won't filter out DAX pages. That's fine for FS DAX
> (because the pages are essentially normal pages now anyway), but we
> don't have a PMD equivalent of vm_normal_page() which leads to all sorts
> of issues for DEVDAX.

What about vm_normal_page() depends on the radix level ?

Doesn't DEVDAX memory have struct page too?

> So I will probably have to add something like that unless we only need
> to support large (pmd/pud) mappings of DEVDAX pages on systems with
> CONFIG_ARCH_HAS_PTE_SPECIAL in which case I guess we could just filter
> based on pte_special().

pte_special should only be used by memory without a struct page, is
that what DEVDAX is?

Jason

2024-04-12 14:38:16

[permalink] [raw]

Subject: Re: [RFC 05/10] fs/dax: Refactor wait for dax idle page

On Thu 11-04-24 10:57:26, Alistair Popple wrote:
> A FS DAX page is considered idle when its refcount drops to one. This
> is currently open-coded in all file systems supporting FS DAX. Move
> the idle detection to a common function to make future changes easier.
>
> Signed-off-by: Alistair Popple <[email protected]>

Good cleanup. Feel free to add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/ext4/inode.c | 5 +----
> fs/fuse/dax.c | 4 +---
> fs/xfs/xfs_file.c | 4 +---
> include/linux/dax.h | 11 +++++++++++
> 4 files changed, 14 insertions(+), 10 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 4ce35f1..e9cef7d 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3868,10 +3868,7 @@ int ext4_break_layouts(struct inode *inode)
> if (!page)
> return 0;
>
> - error = ___wait_var_event(&page->_refcount,
> - atomic_read(&page->_refcount) == 1,
> - TASK_INTERRUPTIBLE, 0, 0,
> - ext4_wait_dax_page(inode));
> + error = dax_wait_page_idle(page, ext4_wait_dax_page, inode);
> } while (error == 0);
>
> return error;
> diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
> index 23904a6..8a62483 100644
> --- a/fs/fuse/dax.c
> +++ b/fs/fuse/dax.c
> @@ -676,9 +676,7 @@ static int __fuse_dax_break_layouts(struct inode *inode, bool *retry,
> return 0;
>
> *retry = true;
> - return ___wait_var_event(&page->_refcount,
> - atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
> - 0, 0, fuse_wait_dax_page(inode));
> + return dax_wait_page_idle(page, fuse_wait_dax_page, inode);
> }
>
> /* dmap_end == 0 leads to unmapping of whole file */
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 2037002..099cd70 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -849,9 +849,7 @@ xfs_break_dax_layouts(
> return 0;
>
> *retry = true;
> - return ___wait_var_event(&page->_refcount,
> - atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
> - 0, 0, xfs_wait_dax_page(inode));
> + return dax_wait_page_idle(page, xfs_wait_dax_page, inode);
> }
>
> int
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 22cd990..bced4d4 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -212,6 +212,17 @@ int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
> const struct iomap_ops *ops);
>
> +static inline int dax_wait_page_idle(struct page *page,
> + void (cb)(struct inode *),
> + struct inode *inode)
> +{
> + int ret;
> +
> + ret = ___wait_var_event(page, page_ref_count(page) == 1,
> + TASK_INTERRUPTIBLE, 0, 0, cb(inode));
> + return ret;
> +}
> +
> #if IS_ENABLED(CONFIG_DAX)
> int dax_read_lock(void);
> void dax_read_unlock(int id);
> --
> git-series 0.9.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2024-04-12 15:33:03

[permalink] [raw]

Subject: Re: [RFC 04/10] fs/dax: Don't track page mapping/index

On Thu 11-04-24 10:57:25, Alistair Popple wrote:
> The page->mapping and page->index fields are normally used by the
> pagecache and rmap for looking up virtual mappings of pages. FS DAX
> implements it's own kind of page cache and rmap look ups so these
> fields are unnecessary. They are currently only used to detect
> error/warning conditions which should never occur.
>
> A future change will change the way shared mappings are detected by
> doing normal page reference counting instead, so remove the
> unnecessary checks.
>
> Signed-off-by: Alistair Popple <[email protected]>
...
> -/*
> - * When it is called in dax_insert_entry(), the shared flag will indicate that
> - * whether this entry is shared by multiple files. If so, set the page->mapping
> - * PAGE_MAPPING_DAX_SHARED, and use page->share as refcount.
> - */
> -static void dax_associate_entry(void *entry, struct address_space *mapping,
> - struct vm_area_struct *vma, unsigned long address, bool shared)
> -{
> - unsigned long size = dax_entry_size(entry), pfn, index;
> - int i = 0;
> -
> - if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
> - return;
> -
> - index = linear_page_index(vma, address & ~(size - 1));
> - for_each_mapped_pfn(entry, pfn) {
> - struct page *page = pfn_to_page(pfn);
> -
> - if (shared) {
> - dax_page_share_get(page);
> - } else {
> - WARN_ON_ONCE(page->mapping);
> - page->mapping = mapping;
> - page->index = index + i++;
> - }
> - }
> -}

Hum, but what about existing uses of folio->mapping and folio->index in
fs/dax.c? AFAICT this patch breaks them. What am I missing? How can this
ever work?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2024-04-12 17:20:25

by Dan Williams

[permalink] [raw]

Subject: RE: [RFC 03/10] pci/p2pdma: Don't initialise page refcount to one

Alistair Popple wrote:
> The reference counts for ZONE_DEVICE private pages should be
> initialised by the driver when the page is actually allocated by the
> driver allocator, not when they are first created. This is currently
> the case for MEMORY_DEVICE_PRIVATE and MEMORY_DEVICE_COHERENT pages
> but not MEMORY_DEVICE_PCI_P2PDMA pages so fix that up.

I know you said to hold off on looking at this series until you fixed up
the kernel assertions, but I would not expect to remove logic before the
replacement is available. So this seems to be in the wrong place in the
series, or am I missing something?

2024-04-12 17:21:38

by Dan Williams

[permalink] [raw]

Subject: RE: [RFC 04/10] fs/dax: Don't track page mapping/index

Alistair Popple wrote:
> The page->mapping and page->index fields are normally used by the
> pagecache and rmap for looking up virtual mappings of pages. FS DAX
> implements it's own kind of page cache and rmap look ups so these
> fields are unnecessary. They are currently only used to detect
> error/warning conditions which should never occur.
>
> A future change will change the way shared mappings are detected by
> doing normal page reference counting instead, so remove the
> unnecessary checks.

Ignore my comment on patch3, I fumble fingered the reply, it was meant
to for this patch. I.e. that "future change" should just be present
before removing old logic.

2024-04-12 17:32:12

by Dan Williams

[permalink] [raw]

Subject: Re: [RFC 04/10] fs/dax: Don't track page mapping/index

Jan Kara wrote:
> On Thu 11-04-24 10:57:25, Alistair Popple wrote:
> > The page->mapping and page->index fields are normally used by the
> > pagecache and rmap for looking up virtual mappings of pages. FS DAX
> > implements it's own kind of page cache and rmap look ups so these
> > fields are unnecessary. They are currently only used to detect
> > error/warning conditions which should never occur.
> >
> > A future change will change the way shared mappings are detected by
> > doing normal page reference counting instead, so remove the
> > unnecessary checks.
> >
> > Signed-off-by: Alistair Popple <[email protected]>
> ...
> > -/*
> > - * When it is called in dax_insert_entry(), the shared flag will indicate that
> > - * whether this entry is shared by multiple files. If so, set the page->mapping
> > - * PAGE_MAPPING_DAX_SHARED, and use page->share as refcount.
> > - */
> > -static void dax_associate_entry(void *entry, struct address_space *mapping,
> > - struct vm_area_struct *vma, unsigned long address, bool shared)
> > -{
> > - unsigned long size = dax_entry_size(entry), pfn, index;
> > - int i = 0;
> > -
> > - if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
> > - return;
> > -
> > - index = linear_page_index(vma, address & ~(size - 1));
> > - for_each_mapped_pfn(entry, pfn) {
> > - struct page *page = pfn_to_page(pfn);
> > -
> > - if (shared) {
> > - dax_page_share_get(page);
> > - } else {
> > - WARN_ON_ONCE(page->mapping);
> > - page->mapping = mapping;
> > - page->index = index + i++;
> > - }
> > - }
> > -}
>
> Hum, but what about existing uses of folio->mapping and folio->index in
> fs/dax.c? AFAICT this patch breaks them. What am I missing? How can this
> ever work?

Right, as far as I can see every fsdax filesystem would need to be
converted to use dax_holder_operations() so that the fs can backfill
->mapping and ->index.

2024-04-12 17:33:00

by Dan Williams

[permalink] [raw]

Subject: Re: [RFC 00/10] fs/dax: Fix FS DAX page reference counts

Jason Gunthorpe wrote:
> On Fri, Apr 12, 2024 at 04:55:31PM +1000, Alistair Popple wrote:
>
> > Ok, I think I found the dragons you were talking about earlier for
> > device-dax. I completely broke that because as you've already pointed
> > out pmd_trans_huge() won't filter out DAX pages. That's fine for FS DAX
> > (because the pages are essentially normal pages now anyway), but we
> > don't have a PMD equivalent of vm_normal_page() which leads to all sorts
> > of issues for DEVDAX.
>
> What about vm_normal_page() depends on the radix level ?
>
> Doesn't DEVDAX memory have struct page too?

Yes.

> > So I will probably have to add something like that unless we only need
> > to support large (pmd/pud) mappings of DEVDAX pages on systems with
> > CONFIG_ARCH_HAS_PTE_SPECIAL in which case I guess we could just filter
> > based on pte_special().
>
> pte_special should only be used by memory without a struct page, is
> that what DEVDAX is?

Right, I don't think pte_special is applicable for any DAX pages.

2024-04-13 20:19:33

by John Hubbard

[permalink] [raw]

Subject: Re: [RFC 05/10] fs/dax: Refactor wait for dax idle page

On 4/10/24 5:57 PM, Alistair Popple wrote:
...
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 22cd990..bced4d4 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -212,6 +212,17 @@ int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
> const struct iomap_ops *ops);
>
> +static inline int dax_wait_page_idle(struct page *page,
> + void (cb)(struct inode *),
> + struct inode *inode)
> +{
> + int ret;
> +
> + ret = ___wait_var_event(page, page_ref_count(page) == 1,
> + TASK_INTERRUPTIBLE, 0, 0, cb(inode));
> + return ret;
> +}

Or just:
{
return ___wait_var_event(page, page_ref_count(page) == 1,
TASK_INTERRUPTIBLE, 0, 0, cb(inode));
}

...yes?

thanks,
--
John Hubbard
NVIDIA

2024-04-15 07:04:32

by Alistair Popple

[permalink] [raw]

Subject: Re: [RFC 04/10] fs/dax: Don't track page mapping/index

Dan Williams <[email protected]> writes:

> Jan Kara wrote:
>> On Thu 11-04-24 10:57:25, Alistair Popple wrote:
>> > The page->mapping and page->index fields are normally used by the
>> > pagecache and rmap for looking up virtual mappings of pages. FS DAX
>> > implements it's own kind of page cache and rmap look ups so these
>> > fields are unnecessary. They are currently only used to detect
>> > error/warning conditions which should never occur.
>> >
>> > A future change will change the way shared mappings are detected by
>> > doing normal page reference counting instead, so remove the
>> > unnecessary checks.
>> >
>> > Signed-off-by: Alistair Popple <[email protected]>
>> ...
>> > -/*
>> > - * When it is called in dax_insert_entry(), the shared flag will indicate that
>> > - * whether this entry is shared by multiple files. If so, set the page->mapping
>> > - * PAGE_MAPPING_DAX_SHARED, and use page->share as refcount.
>> > - */
>> > -static void dax_associate_entry(void *entry, struct address_space *mapping,
>> > - struct vm_area_struct *vma, unsigned long address, bool shared)
>> > -{
>> > - unsigned long size = dax_entry_size(entry), pfn, index;
>> > - int i = 0;
>> > -
>> > - if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
>> > - return;
>> > -
>> > - index = linear_page_index(vma, address & ~(size - 1));
>> > - for_each_mapped_pfn(entry, pfn) {
>> > - struct page *page = pfn_to_page(pfn);
>> > -
>> > - if (shared) {
>> > - dax_page_share_get(page);
>> > - } else {
>> > - WARN_ON_ONCE(page->mapping);
>> > - page->mapping = mapping;
>> > - page->index = index + i++;
>> > - }
>> > - }
>> > -}
>>
>> Hum, but what about existing uses of folio->mapping and folio->index in
>> fs/dax.c? AFAICT this patch breaks them. What am I missing? How can this
>> ever work?

I did feel I was missing something here as well, but nothing obviously
breaks with this change from a test perspective (ie. ndctl tests, manual
tests). Somehow I missed how this was used in code, but Dan provided
enough of a hint below though so now I see the errors of my ways :-)

> Right, as far as I can see every fsdax filesystem would need to be
> converted to use dax_holder_operations() so that the fs can backfill
> ->mapping and ->index.

Oh, that was the hint I needed. Thanks. So basically it's just used for
memory failure like so:

memory_failure()
-> memory_failure_dev_pagemap()
-> mf_generic_kill_procs()
-> dax_lock_page()
-> mapping = READ_ONCE(page->mapping);

Somehow I had missed that bleatingly obvious usage of page->mapping. I
also couldn't understand how it was important if it was safe for it to
be just randomly overwritten in the shared case.

But I think I understand now - shared fs dax pages are only supported on
xfs and the mapping/index fields aren't used there because xfs provides
it's own look up for memory failure using dax_holder_operations.

I was initially concerned about these cases because I was wondering if
folio subpages could ever get different mappings and the shared case
implied they could. But it seems that's xfs specific and there is a
separate mechanism to deal with looking up ->mapping/index for that. So
I guess we should still be able to safely store this on the folio
head. I will double check and update this change.

2024-04-15 08:42:50

by Alistair Popple

[permalink] [raw]

Subject: Re: [RFC 05/10] fs/dax: Refactor wait for dax idle page

John Hubbard <[email protected]> writes:

> On 4/10/24 5:57 PM, Alistair Popple wrote:
> ...
>> diff --git a/include/linux/dax.h b/include/linux/dax.h
>> index 22cd990..bced4d4 100644
>> --- a/include/linux/dax.h
>> +++ b/include/linux/dax.h
>> @@ -212,6 +212,17 @@ int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
>> int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
>> const struct iomap_ops *ops);
>> +static inline int dax_wait_page_idle(struct page *page,
>> + void (cb)(struct inode *),
>> + struct inode *inode)
>> +{
>> + int ret;
>> +
>> + ret = ___wait_var_event(page, page_ref_count(page) == 1,
>> + TASK_INTERRUPTIBLE, 0, 0, cb(inode));
>> + return ret;
>> +}
>
> Or just:
> {
> return ___wait_var_event(page, page_ref_count(page) == 1,
> TASK_INTERRUPTIBLE, 0, 0, cb(inode));
> }
>
> ...yes?

Yep. Thanks.

> thanks,

2024-04-15 20:51:52

by Dan Williams

[permalink] [raw]

Subject: Re: [RFC 04/10] fs/dax: Don't track page mapping/index

Alistair Popple wrote:
> I was initially concerned about these cases because I was wondering if
> folio subpages could ever get different mappings and the shared case
> implied they could. But it seems that's xfs specific and there is a
> separate mechanism to deal with looking up ->mapping/index for that. So
> I guess we should still be able to safely store this on the folio
> head. I will double check and update this change.
>

I think there is path to store this information only on the folio head.
However, ugh, I think this is potentially another "head" of the
pmd_devmap() hydra.

pmd_devmap() taught the core-mm to treat dax_pmds indentically to
thp_pmds *except* for the __split_huge_pmd() case:

5c7fb56e5e3f mm, dax: dax-pmd vs thp-pmd vs hugetlbfs-pmd

Later on pmd migration entries joined pmd_devmap() in skipping splits:

84c3fc4e9c56 mm: thp: check pmd migration entry in common path

Unfortunately, pmd_devmap() stopped being considered for skipping
splits here:

7f7609175ff2 mm/huge_memory: remove stale locking logic from __split_huge_pmd()

Likely __split_huge_pmd_locked() grew support for pmd migration handling
and forgot about the pmd_devmap() case.

So now Linux has been allowing FSDAX pmd splits since v5.18... but with
no reports of any issues. Likely this is benefiting from the fact that
the preconditions for splitting are rarely if ever satisfied because
FSDAX mappings are never anon, and establishing the mapping in the first
place requires a 2MB aligned file extent and that is likely never
fractured.

Same for device-dax where the fracturing *should* not be allowed, but I
will feel better with focus tests to go after mremap() cases that would
attempt to split the page.

2024-04-16 00:17:04

by Alistair Popple

[permalink] [raw]

Subject: Re: [RFC 04/10] fs/dax: Don't track page mapping/index

Dan Williams <[email protected]> writes:

> Alistair Popple wrote:
>> I was initially concerned about these cases because I was wondering if
>> folio subpages could ever get different mappings and the shared case
>> implied they could. But it seems that's xfs specific and there is a
>> separate mechanism to deal with looking up ->mapping/index for that. So
>> I guess we should still be able to safely store this on the folio
>> head. I will double check and update this change.
>>
>
> I think there is path to store this information only on the folio head.
> However, ugh, I think this is potentially another "head" of the
> pmd_devmap() hydra.
>
> pmd_devmap() taught the core-mm to treat dax_pmds indentically to
> thp_pmds *except* for the __split_huge_pmd() case:
>
> 5c7fb56e5e3f mm, dax: dax-pmd vs thp-pmd vs hugetlbfs-pmd
>
> Later on pmd migration entries joined pmd_devmap() in skipping splits:
>
> 84c3fc4e9c56 mm: thp: check pmd migration entry in common path
>
> Unfortunately, pmd_devmap() stopped being considered for skipping
> splits here:
>
> 7f7609175ff2 mm/huge_memory: remove stale locking logic from __split_huge_pmd()
>
> Likely __split_huge_pmd_locked() grew support for pmd migration handling
> and forgot about the pmd_devmap() case.
>
> So now Linux has been allowing FSDAX pmd splits since v5.18...

From what I see we currently (in v6.6) have this in
__split_huge_pmd_locked():

if (!vma_is_anonymous(vma)) {
old_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
/*
* We are going to unmap this huge page. So
* just go ahead and zap it
*/
if (arch_needs_pgtable_deposit())
zap_deposited_table(mm, pmd);
if (vma_is_special_huge(vma))
return;

Where vma_is_special_huge(vma) returns true for vma_is_dax(). So AFAICT
we're still skipping the split right? In all versions we just zap the
PMD and continue. What am I missing?

> but with
> no reports of any issues. Likely this is benefiting from the fact that
> the preconditions for splitting are rarely if ever satisfied because
> FSDAX mappings are never anon, and establishing the mapping in the first
> place requires a 2MB aligned file extent and that is likely never
> fractured.
>
> Same for device-dax where the fracturing *should* not be allowed, but I
> will feel better with focus tests to go after mremap() cases that would
> attempt to split the page.

2024-04-16 00:37:19

by Dan Williams

[permalink] [raw]

Subject: Re: [RFC 04/10] fs/dax: Don't track page mapping/index

Alistair Popple wrote:
>
> Dan Williams <[email protected]> writes:
>
> > Alistair Popple wrote:
> >> I was initially concerned about these cases because I was wondering if
> >> folio subpages could ever get different mappings and the shared case
> >> implied they could. But it seems that's xfs specific and there is a
> >> separate mechanism to deal with looking up ->mapping/index for that. So
> >> I guess we should still be able to safely store this on the folio
> >> head. I will double check and update this change.
> >>
> >
> > I think there is path to store this information only on the folio head.
> > However, ugh, I think this is potentially another "head" of the
> > pmd_devmap() hydra.
> >
> > pmd_devmap() taught the core-mm to treat dax_pmds indentically to
> > thp_pmds *except* for the __split_huge_pmd() case:
> >
> > 5c7fb56e5e3f mm, dax: dax-pmd vs thp-pmd vs hugetlbfs-pmd
> >
> > Later on pmd migration entries joined pmd_devmap() in skipping splits:
> >
> > 84c3fc4e9c56 mm: thp: check pmd migration entry in common path
> >
> > Unfortunately, pmd_devmap() stopped being considered for skipping
> > splits here:
> >
> > 7f7609175ff2 mm/huge_memory: remove stale locking logic from __split_huge_pmd()
> >
> > Likely __split_huge_pmd_locked() grew support for pmd migration handling
> > and forgot about the pmd_devmap() case.
> >
> > So now Linux has been allowing FSDAX pmd splits since v5.18...
>
> From what I see we currently (in v6.6) have this in
> __split_huge_pmd_locked():
>
> if (!vma_is_anonymous(vma)) {
> old_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
> /*
> * We are going to unmap this huge page. So
> * just go ahead and zap it
> */
> if (arch_needs_pgtable_deposit())
> zap_deposited_table(mm, pmd);
> if (vma_is_special_huge(vma))
> return;
>
> Where vma_is_special_huge(vma) returns true for vma_is_dax(). So AFAICT
> we're still skipping the split right? In all versions we just zap the
> PMD and continue. What am I missing?

Ah, good point I missed that. One more dragon vanquished.