2022-12-16 19:42:50

by Mike Kravetz

[permalink] [raw]
Subject: [RFC PATCH] mm: remove zap_page_range and change callers to use zap_vma_page_range

zap_page_range was originally designed to unmap pages within an address
range that could span multiple vmas. While working on [1], it was
discovered that all callers of zap_page_range pass a range entirely within
a single vma. In addition, the mmu notification call within zap_page
range does not correctly handle ranges that span multiple vmas as calls
should be vma specific.

Instead of fixing zap_page_range, change all callers to use the new
routine zap_vma_page_range. zap_vma_page_range is just a wrapper around
zap_page_range_single passing in NULL zap details. The name is also
more in line with other exported routines that operate within a vma.
We can then remove zap_page_range.

Also, change madvise_dontneed_single_vma to use this new routine.

[1] https://lore.kernel.org/linux-mm/[email protected]/
Suggested-by: Peter Xu <[email protected]>
Signed-off-by: Mike Kravetz <[email protected]>
---
arch/arm64/kernel/vdso.c | 4 ++--
arch/powerpc/kernel/vdso.c | 2 +-
arch/powerpc/platforms/book3s/vas-api.c | 2 +-
arch/powerpc/platforms/pseries/vas.c | 2 +-
arch/riscv/kernel/vdso.c | 4 ++--
arch/s390/kernel/vdso.c | 2 +-
arch/s390/mm/gmap.c | 2 +-
arch/x86/entry/vdso/vma.c | 2 +-
drivers/android/binder_alloc.c | 2 +-
include/linux/mm.h | 7 ++++--
mm/madvise.c | 4 ++--
mm/memory.c | 30 -------------------------
mm/page-writeback.c | 2 +-
net/ipv4/tcp.c | 6 ++---
14 files changed, 22 insertions(+), 49 deletions(-)

diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index e59a32aa0c49..a7b10e182f78 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -141,10 +141,10 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
unsigned long size = vma->vm_end - vma->vm_start;

if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA64].dm))
- zap_page_range(vma, vma->vm_start, size);
+ zap_vma_page_range(vma, vma->vm_start, size);
#ifdef CONFIG_COMPAT_VDSO
if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA32].dm))
- zap_page_range(vma, vma->vm_start, size);
+ zap_vma_page_range(vma, vma->vm_start, size);
#endif
}

diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index 507f8228f983..479d70fe8c55 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -123,7 +123,7 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
unsigned long size = vma->vm_end - vma->vm_start;

if (vma_is_special_mapping(vma, &vvar_spec))
- zap_page_range(vma, vma->vm_start, size);
+ zap_vma_page_range(vma, vma->vm_start, size);
}
mmap_read_unlock(mm);

diff --git a/arch/powerpc/platforms/book3s/vas-api.c b/arch/powerpc/platforms/book3s/vas-api.c
index eb5bed333750..8f57388b760b 100644
--- a/arch/powerpc/platforms/book3s/vas-api.c
+++ b/arch/powerpc/platforms/book3s/vas-api.c
@@ -414,7 +414,7 @@ static vm_fault_t vas_mmap_fault(struct vm_fault *vmf)
/*
* When the LPAR lost credits due to core removal or during
* migration, invalidate the existing mapping for the current
- * paste addresses and set windows in-active (zap_page_range in
+ * paste addresses and set windows in-active (zap_vma_page_range in
* reconfig_close_windows()).
* New mapping will be done later after migration or new credits
* available. So continue to receive faults if the user space
diff --git a/arch/powerpc/platforms/pseries/vas.c b/arch/powerpc/platforms/pseries/vas.c
index 4ad6e510d405..2aef8d9295a2 100644
--- a/arch/powerpc/platforms/pseries/vas.c
+++ b/arch/powerpc/platforms/pseries/vas.c
@@ -760,7 +760,7 @@ static int reconfig_close_windows(struct vas_caps *vcap, int excess_creds,
* is done before the original mmap() and after the ioctl.
*/
if (vma)
- zap_page_range(vma, vma->vm_start,
+ zap_vma_page_range(vma, vma->vm_start,
vma->vm_end - vma->vm_start);

mmap_write_unlock(task_ref->mm);
diff --git a/arch/riscv/kernel/vdso.c b/arch/riscv/kernel/vdso.c
index e410275918ac..a405119da2c0 100644
--- a/arch/riscv/kernel/vdso.c
+++ b/arch/riscv/kernel/vdso.c
@@ -127,10 +127,10 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
unsigned long size = vma->vm_end - vma->vm_start;

if (vma_is_special_mapping(vma, vdso_info.dm))
- zap_page_range(vma, vma->vm_start, size);
+ zap_vma_page_range(vma, vma->vm_start, size);
#ifdef CONFIG_COMPAT
if (vma_is_special_mapping(vma, compat_vdso_info.dm))
- zap_page_range(vma, vma->vm_start, size);
+ zap_vma_page_range(vma, vma->vm_start, size);
#endif
}

diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c
index ff7bf4432229..eccfcd505403 100644
--- a/arch/s390/kernel/vdso.c
+++ b/arch/s390/kernel/vdso.c
@@ -63,7 +63,7 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)

if (!vma_is_special_mapping(vma, &vvar_mapping))
continue;
- zap_page_range(vma, vma->vm_start, size);
+ zap_vma_page_range(vma, vma->vm_start, size);
break;
}
mmap_read_unlock(mm);
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 74e1d873dce0..67d998152142 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -722,7 +722,7 @@ void gmap_discard(struct gmap *gmap, unsigned long from, unsigned long to)
if (is_vm_hugetlb_page(vma))
continue;
size = min(to - gaddr, PMD_SIZE - (gaddr & ~PMD_MASK));
- zap_page_range(vma, vmaddr, size);
+ zap_vma_page_range(vma, vmaddr, size);
}
mmap_read_unlock(gmap->mm);
}
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index b8f3f9b9e53c..5aafbd19e869 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -116,7 +116,7 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
unsigned long size = vma->vm_end - vma->vm_start;

if (vma_is_special_mapping(vma, &vvar_mapping))
- zap_page_range(vma, vma->vm_start, size);
+ zap_vma_page_range(vma, vma->vm_start, size);
}
mmap_read_unlock(mm);

diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c
index 4ad42b0f75cd..f7f10248c742 100644
--- a/drivers/android/binder_alloc.c
+++ b/drivers/android/binder_alloc.c
@@ -1019,7 +1019,7 @@ enum lru_status binder_alloc_free_page(struct list_head *item,
if (vma) {
trace_binder_unmap_user_start(alloc, index);

- zap_page_range(vma, page_addr, PAGE_SIZE);
+ zap_vma_page_range(vma, page_addr, PAGE_SIZE);

trace_binder_unmap_user_end(alloc, index);
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6b28eb9c6ea2..706efaf95783 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1980,10 +1980,13 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,

void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
unsigned long size);
-void zap_page_range(struct vm_area_struct *vma, unsigned long address,
- unsigned long size);
void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
unsigned long size, struct zap_details *details);
+static inline void zap_vma_page_range(struct vm_area_struct *vma,
+ unsigned long address, unsigned long size)
+{
+ zap_page_range_single(vma, address, size, NULL);
+}
void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
struct vm_area_struct *start_vma, unsigned long start,
unsigned long end);
diff --git a/mm/madvise.c b/mm/madvise.c
index 87703a19bbef..3c4d9829d4e1 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -787,7 +787,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
* Application no longer needs these pages. If the pages are dirty,
* it's OK to just throw them away. The app will be more careful about
* data it wants to keep. Be sure to free swap resources too. The
- * zap_page_range_single call sets things up for shrink_active_list to actually
+ * zap_vma_page_range call sets things up for shrink_active_list to actually
* free these pages later if no one else has touched them in the meantime,
* although we could add these pages to a global reuse list for
* shrink_active_list to pick up before reclaiming other pages.
@@ -805,7 +805,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
- zap_page_range_single(vma, start, end - start, NULL);
+ zap_vma_page_range(vma, start, end - start);
return 0;
}

diff --git a/mm/memory.c b/mm/memory.c
index 5b2c137dfb2a..e953a0108278 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1687,36 +1687,6 @@ void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
mmu_notifier_invalidate_range_end(&range);
}

-/**
- * zap_page_range - remove user pages in a given range
- * @vma: vm_area_struct holding the applicable pages
- * @start: starting address of pages to zap
- * @size: number of bytes to zap
- *
- * Caller must protect the VMA list
- */
-void zap_page_range(struct vm_area_struct *vma, unsigned long start,
- unsigned long size)
-{
- struct maple_tree *mt = &vma->vm_mm->mm_mt;
- unsigned long end = start + size;
- struct mmu_notifier_range range;
- struct mmu_gather tlb;
- MA_STATE(mas, mt, vma->vm_end, vma->vm_end);
-
- lru_add_drain();
- mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
- start, start + size);
- tlb_gather_mmu(&tlb, vma->vm_mm);
- update_hiwater_rss(vma->vm_mm);
- mmu_notifier_invalidate_range_start(&range);
- do {
- unmap_single_vma(&tlb, vma, start, range.end, NULL);
- } while ((vma = mas_find(&mas, end - 1)) != NULL);
- mmu_notifier_invalidate_range_end(&range);
- tlb_finish_mmu(&tlb);
-}
-
/**
* zap_page_range_single - remove user pages in a given range
* @vma: vm_area_struct holding the applicable pages
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index ad608ef2a243..bd9fe6ff6557 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2713,7 +2713,7 @@ void folio_account_cleaned(struct folio *folio, struct bdi_writeback *wb)
*
* The caller must hold lock_page_memcg(). Most callers have the folio
* locked. A few have the folio blocked from truncation through other
- * means (eg zap_page_range() has it mapped and is holding the page table
+ * means (eg zap_vma_page_range() has it mapped and is holding the page table
* lock). This can also be called from mark_buffer_dirty(), which I
* cannot prove is always protected against truncate.
*/
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index c567d5e8053e..afaad3cfed00 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2092,7 +2092,7 @@ static int tcp_zerocopy_vm_insert_batch_error(struct vm_area_struct *vma,
maybe_zap_len = total_bytes_to_map - /* All bytes to map */
*length + /* Mapped or pending */
(pages_remaining * PAGE_SIZE); /* Failed map. */
- zap_page_range(vma, *address, maybe_zap_len);
+ zap_vma_page_range(vma, *address, maybe_zap_len);
err = 0;
}

@@ -2100,7 +2100,7 @@ static int tcp_zerocopy_vm_insert_batch_error(struct vm_area_struct *vma,
unsigned long leftover_pages = pages_remaining;
int bytes_mapped;

- /* We called zap_page_range, try to reinsert. */
+ /* We called zap_vma_page_range, try to reinsert. */
err = vm_insert_pages(vma, *address,
pending_pages,
&pages_remaining);
@@ -2234,7 +2234,7 @@ static int tcp_zerocopy_receive(struct sock *sk,
total_bytes_to_map = avail_len & ~(PAGE_SIZE - 1);
if (total_bytes_to_map) {
if (!(zc->flags & TCP_RECEIVE_ZEROCOPY_FLAG_TLB_CLEAN_HINT))
- zap_page_range(vma, address, total_bytes_to_map);
+ zap_vma_page_range(vma, address, total_bytes_to_map);
zc->length = total_bytes_to_map;
zc->recv_skip_hint = 0;
} else {
--
2.38.1


2022-12-19 12:12:15

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH] mm: remove zap_page_range and change callers to use zap_vma_page_range

On Fri 16-12-22 11:20:12, Mike Kravetz wrote:
> zap_page_range was originally designed to unmap pages within an address
> range that could span multiple vmas. While working on [1], it was
> discovered that all callers of zap_page_range pass a range entirely within
> a single vma. In addition, the mmu notification call within zap_page
> range does not correctly handle ranges that span multiple vmas as calls
> should be vma specific.

Could you spend a sentence or two explaining what is wrong here?

> Instead of fixing zap_page_range, change all callers to use the new
> routine zap_vma_page_range. zap_vma_page_range is just a wrapper around
> zap_page_range_single passing in NULL zap details. The name is also
> more in line with other exported routines that operate within a vma.
> We can then remove zap_page_range.

I would stick with zap_page_range_single rather than adding a new
wrapper but nothing really critical.

> Also, change madvise_dontneed_single_vma to use this new routine.
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
> Suggested-by: Peter Xu <[email protected]>
> Signed-off-by: Mike Kravetz <[email protected]>

Other than that LGTM
Acked-by: Michal Hocko <[email protected]>

Thanks!

> ---
> arch/arm64/kernel/vdso.c | 4 ++--
> arch/powerpc/kernel/vdso.c | 2 +-
> arch/powerpc/platforms/book3s/vas-api.c | 2 +-
> arch/powerpc/platforms/pseries/vas.c | 2 +-
> arch/riscv/kernel/vdso.c | 4 ++--
> arch/s390/kernel/vdso.c | 2 +-
> arch/s390/mm/gmap.c | 2 +-
> arch/x86/entry/vdso/vma.c | 2 +-
> drivers/android/binder_alloc.c | 2 +-
> include/linux/mm.h | 7 ++++--
> mm/madvise.c | 4 ++--
> mm/memory.c | 30 -------------------------
> mm/page-writeback.c | 2 +-
> net/ipv4/tcp.c | 6 ++---
> 14 files changed, 22 insertions(+), 49 deletions(-)
>
> diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
> index e59a32aa0c49..a7b10e182f78 100644
> --- a/arch/arm64/kernel/vdso.c
> +++ b/arch/arm64/kernel/vdso.c
> @@ -141,10 +141,10 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
> unsigned long size = vma->vm_end - vma->vm_start;
>
> if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA64].dm))
> - zap_page_range(vma, vma->vm_start, size);
> + zap_vma_page_range(vma, vma->vm_start, size);
> #ifdef CONFIG_COMPAT_VDSO
> if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA32].dm))
> - zap_page_range(vma, vma->vm_start, size);
> + zap_vma_page_range(vma, vma->vm_start, size);
> #endif
> }
>
> diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
> index 507f8228f983..479d70fe8c55 100644
> --- a/arch/powerpc/kernel/vdso.c
> +++ b/arch/powerpc/kernel/vdso.c
> @@ -123,7 +123,7 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
> unsigned long size = vma->vm_end - vma->vm_start;
>
> if (vma_is_special_mapping(vma, &vvar_spec))
> - zap_page_range(vma, vma->vm_start, size);
> + zap_vma_page_range(vma, vma->vm_start, size);
> }
> mmap_read_unlock(mm);
>
> diff --git a/arch/powerpc/platforms/book3s/vas-api.c b/arch/powerpc/platforms/book3s/vas-api.c
> index eb5bed333750..8f57388b760b 100644
> --- a/arch/powerpc/platforms/book3s/vas-api.c
> +++ b/arch/powerpc/platforms/book3s/vas-api.c
> @@ -414,7 +414,7 @@ static vm_fault_t vas_mmap_fault(struct vm_fault *vmf)
> /*
> * When the LPAR lost credits due to core removal or during
> * migration, invalidate the existing mapping for the current
> - * paste addresses and set windows in-active (zap_page_range in
> + * paste addresses and set windows in-active (zap_vma_page_range in
> * reconfig_close_windows()).
> * New mapping will be done later after migration or new credits
> * available. So continue to receive faults if the user space
> diff --git a/arch/powerpc/platforms/pseries/vas.c b/arch/powerpc/platforms/pseries/vas.c
> index 4ad6e510d405..2aef8d9295a2 100644
> --- a/arch/powerpc/platforms/pseries/vas.c
> +++ b/arch/powerpc/platforms/pseries/vas.c
> @@ -760,7 +760,7 @@ static int reconfig_close_windows(struct vas_caps *vcap, int excess_creds,
> * is done before the original mmap() and after the ioctl.
> */
> if (vma)
> - zap_page_range(vma, vma->vm_start,
> + zap_vma_page_range(vma, vma->vm_start,
> vma->vm_end - vma->vm_start);
>
> mmap_write_unlock(task_ref->mm);
> diff --git a/arch/riscv/kernel/vdso.c b/arch/riscv/kernel/vdso.c
> index e410275918ac..a405119da2c0 100644
> --- a/arch/riscv/kernel/vdso.c
> +++ b/arch/riscv/kernel/vdso.c
> @@ -127,10 +127,10 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
> unsigned long size = vma->vm_end - vma->vm_start;
>
> if (vma_is_special_mapping(vma, vdso_info.dm))
> - zap_page_range(vma, vma->vm_start, size);
> + zap_vma_page_range(vma, vma->vm_start, size);
> #ifdef CONFIG_COMPAT
> if (vma_is_special_mapping(vma, compat_vdso_info.dm))
> - zap_page_range(vma, vma->vm_start, size);
> + zap_vma_page_range(vma, vma->vm_start, size);
> #endif
> }
>
> diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c
> index ff7bf4432229..eccfcd505403 100644
> --- a/arch/s390/kernel/vdso.c
> +++ b/arch/s390/kernel/vdso.c
> @@ -63,7 +63,7 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
>
> if (!vma_is_special_mapping(vma, &vvar_mapping))
> continue;
> - zap_page_range(vma, vma->vm_start, size);
> + zap_vma_page_range(vma, vma->vm_start, size);
> break;
> }
> mmap_read_unlock(mm);
> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> index 74e1d873dce0..67d998152142 100644
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -722,7 +722,7 @@ void gmap_discard(struct gmap *gmap, unsigned long from, unsigned long to)
> if (is_vm_hugetlb_page(vma))
> continue;
> size = min(to - gaddr, PMD_SIZE - (gaddr & ~PMD_MASK));
> - zap_page_range(vma, vmaddr, size);
> + zap_vma_page_range(vma, vmaddr, size);
> }
> mmap_read_unlock(gmap->mm);
> }
> diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
> index b8f3f9b9e53c..5aafbd19e869 100644
> --- a/arch/x86/entry/vdso/vma.c
> +++ b/arch/x86/entry/vdso/vma.c
> @@ -116,7 +116,7 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
> unsigned long size = vma->vm_end - vma->vm_start;
>
> if (vma_is_special_mapping(vma, &vvar_mapping))
> - zap_page_range(vma, vma->vm_start, size);
> + zap_vma_page_range(vma, vma->vm_start, size);
> }
> mmap_read_unlock(mm);
>
> diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c
> index 4ad42b0f75cd..f7f10248c742 100644
> --- a/drivers/android/binder_alloc.c
> +++ b/drivers/android/binder_alloc.c
> @@ -1019,7 +1019,7 @@ enum lru_status binder_alloc_free_page(struct list_head *item,
> if (vma) {
> trace_binder_unmap_user_start(alloc, index);
>
> - zap_page_range(vma, page_addr, PAGE_SIZE);
> + zap_vma_page_range(vma, page_addr, PAGE_SIZE);
>
> trace_binder_unmap_user_end(alloc, index);
> }
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 6b28eb9c6ea2..706efaf95783 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1980,10 +1980,13 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
>
> void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
> unsigned long size);
> -void zap_page_range(struct vm_area_struct *vma, unsigned long address,
> - unsigned long size);
> void zap_page_range_single(struct vm_area_struct *vma, unsigned long address,
> unsigned long size, struct zap_details *details);
> +static inline void zap_vma_page_range(struct vm_area_struct *vma,
> + unsigned long address, unsigned long size)
> +{
> + zap_page_range_single(vma, address, size, NULL);
> +}
> void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
> struct vm_area_struct *start_vma, unsigned long start,
> unsigned long end);
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 87703a19bbef..3c4d9829d4e1 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -787,7 +787,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
> * Application no longer needs these pages. If the pages are dirty,
> * it's OK to just throw them away. The app will be more careful about
> * data it wants to keep. Be sure to free swap resources too. The
> - * zap_page_range_single call sets things up for shrink_active_list to actually
> + * zap_vma_page_range call sets things up for shrink_active_list to actually
> * free these pages later if no one else has touched them in the meantime,
> * although we could add these pages to a global reuse list for
> * shrink_active_list to pick up before reclaiming other pages.
> @@ -805,7 +805,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
> static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> unsigned long start, unsigned long end)
> {
> - zap_page_range_single(vma, start, end - start, NULL);
> + zap_vma_page_range(vma, start, end - start);
> return 0;
> }
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 5b2c137dfb2a..e953a0108278 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1687,36 +1687,6 @@ void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
> mmu_notifier_invalidate_range_end(&range);
> }
>
> -/**
> - * zap_page_range - remove user pages in a given range
> - * @vma: vm_area_struct holding the applicable pages
> - * @start: starting address of pages to zap
> - * @size: number of bytes to zap
> - *
> - * Caller must protect the VMA list
> - */
> -void zap_page_range(struct vm_area_struct *vma, unsigned long start,
> - unsigned long size)
> -{
> - struct maple_tree *mt = &vma->vm_mm->mm_mt;
> - unsigned long end = start + size;
> - struct mmu_notifier_range range;
> - struct mmu_gather tlb;
> - MA_STATE(mas, mt, vma->vm_end, vma->vm_end);
> -
> - lru_add_drain();
> - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
> - start, start + size);
> - tlb_gather_mmu(&tlb, vma->vm_mm);
> - update_hiwater_rss(vma->vm_mm);
> - mmu_notifier_invalidate_range_start(&range);
> - do {
> - unmap_single_vma(&tlb, vma, start, range.end, NULL);
> - } while ((vma = mas_find(&mas, end - 1)) != NULL);
> - mmu_notifier_invalidate_range_end(&range);
> - tlb_finish_mmu(&tlb);
> -}
> -
> /**
> * zap_page_range_single - remove user pages in a given range
> * @vma: vm_area_struct holding the applicable pages
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index ad608ef2a243..bd9fe6ff6557 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2713,7 +2713,7 @@ void folio_account_cleaned(struct folio *folio, struct bdi_writeback *wb)
> *
> * The caller must hold lock_page_memcg(). Most callers have the folio
> * locked. A few have the folio blocked from truncation through other
> - * means (eg zap_page_range() has it mapped and is holding the page table
> + * means (eg zap_vma_page_range() has it mapped and is holding the page table
> * lock). This can also be called from mark_buffer_dirty(), which I
> * cannot prove is always protected against truncate.
> */
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index c567d5e8053e..afaad3cfed00 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2092,7 +2092,7 @@ static int tcp_zerocopy_vm_insert_batch_error(struct vm_area_struct *vma,
> maybe_zap_len = total_bytes_to_map - /* All bytes to map */
> *length + /* Mapped or pending */
> (pages_remaining * PAGE_SIZE); /* Failed map. */
> - zap_page_range(vma, *address, maybe_zap_len);
> + zap_vma_page_range(vma, *address, maybe_zap_len);
> err = 0;
> }
>
> @@ -2100,7 +2100,7 @@ static int tcp_zerocopy_vm_insert_batch_error(struct vm_area_struct *vma,
> unsigned long leftover_pages = pages_remaining;
> int bytes_mapped;
>
> - /* We called zap_page_range, try to reinsert. */
> + /* We called zap_vma_page_range, try to reinsert. */
> err = vm_insert_pages(vma, *address,
> pending_pages,
> &pages_remaining);
> @@ -2234,7 +2234,7 @@ static int tcp_zerocopy_receive(struct sock *sk,
> total_bytes_to_map = avail_len & ~(PAGE_SIZE - 1);
> if (total_bytes_to_map) {
> if (!(zc->flags & TCP_RECEIVE_ZEROCOPY_FLAG_TLB_CLEAN_HINT))
> - zap_page_range(vma, address, total_bytes_to_map);
> + zap_vma_page_range(vma, address, total_bytes_to_map);
> zc->length = total_bytes_to_map;
> zc->recv_skip_hint = 0;
> } else {
> --
> 2.38.1

--
Michal Hocko
SUSE Labs

2022-12-19 19:53:34

by Mike Kravetz

[permalink] [raw]
Subject: Re: [RFC PATCH] mm: remove zap_page_range and change callers to use zap_vma_page_range

On 12/19/22 13:06, Michal Hocko wrote:
> On Fri 16-12-22 11:20:12, Mike Kravetz wrote:
> > zap_page_range was originally designed to unmap pages within an address
> > range that could span multiple vmas. While working on [1], it was
> > discovered that all callers of zap_page_range pass a range entirely within
> > a single vma. In addition, the mmu notification call within zap_page
> > range does not correctly handle ranges that span multiple vmas as calls
> > should be vma specific.
>
> Could you spend a sentence or two explaining what is wrong here?

Hmmmm? My assumption was that the range passed to mmu_notifier_range_init()
was supposed to be within the specified vma. When looking into the notifier
routines, I could not find any documentation about the usage of the vma within
the mmu_notifier_range structure. It was introduced with commit bf198b2b34bf
"mm/mmu_notifier: pass down vma and reasons why mmu notifier is happening".
However, I do not see this being used today.

Of course, I could be missing something, so adding J?r?me.

>
> > Instead of fixing zap_page_range, change all callers to use the new
> > routine zap_vma_page_range. zap_vma_page_range is just a wrapper around
> > zap_page_range_single passing in NULL zap details. The name is also
> > more in line with other exported routines that operate within a vma.
> > We can then remove zap_page_range.
>
> I would stick with zap_page_range_single rather than adding a new
> wrapper but nothing really critical.

I am fine with doing that as well. My only reason for the wrapper is that all
callers outside mm/memory.c would pass in NULL zap details.

>
> > Also, change madvise_dontneed_single_vma to use this new routine.
> >
> > [1] https://lore.kernel.org/linux-mm/[email protected]/
> > Suggested-by: Peter Xu <[email protected]>
> > Signed-off-by: Mike Kravetz <[email protected]>
>
> Other than that LGTM
> Acked-by: Michal Hocko <[email protected]>
>
> Thanks!

Thanks for taking a look.
--
Mike Kravetz

2022-12-20 18:09:04

by Peter Xu

[permalink] [raw]
Subject: Re: [RFC PATCH] mm: remove zap_page_range and change callers to use zap_vma_page_range

On Fri, Dec 16, 2022 at 11:20:12AM -0800, Mike Kravetz wrote:
> zap_page_range was originally designed to unmap pages within an address
> range that could span multiple vmas. While working on [1], it was
> discovered that all callers of zap_page_range pass a range entirely within
> a single vma. In addition, the mmu notification call within zap_page
> range does not correctly handle ranges that span multiple vmas as calls
> should be vma specific.
>
> Instead of fixing zap_page_range, change all callers to use the new
> routine zap_vma_page_range. zap_vma_page_range is just a wrapper around
> zap_page_range_single passing in NULL zap details. The name is also
> more in line with other exported routines that operate within a vma.
> We can then remove zap_page_range.
>
> Also, change madvise_dontneed_single_vma to use this new routine.
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
> Suggested-by: Peter Xu <[email protected]>
> Signed-off-by: Mike Kravetz <[email protected]>

Acked-by: Peter Xu <[email protected]>

Thanks!

--
Peter Xu

2022-12-21 03:38:44

by Michael Ellerman

[permalink] [raw]
Subject: Re: [RFC PATCH] mm: remove zap_page_range and change callers to use zap_vma_page_range

Mike Kravetz <[email protected]> writes:
> zap_page_range was originally designed to unmap pages within an address
> range that could span multiple vmas. While working on [1], it was
> discovered that all callers of zap_page_range pass a range entirely within
> a single vma. In addition, the mmu notification call within zap_page
> range does not correctly handle ranges that span multiple vmas as calls
> should be vma specific.
>
> Instead of fixing zap_page_range, change all callers to use the new
> routine zap_vma_page_range. zap_vma_page_range is just a wrapper around
> zap_page_range_single passing in NULL zap details. The name is also
> more in line with other exported routines that operate within a vma.
> We can then remove zap_page_range.
>
> Also, change madvise_dontneed_single_vma to use this new routine.
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
> Suggested-by: Peter Xu <[email protected]>
> Signed-off-by: Mike Kravetz <[email protected]>
> ---
> arch/arm64/kernel/vdso.c | 4 ++--
> arch/powerpc/kernel/vdso.c | 2 +-
> arch/powerpc/platforms/book3s/vas-api.c | 2 +-
> arch/powerpc/platforms/pseries/vas.c | 2 +-

Acked-by: Michael Ellerman <[email protected]> (powerpc)

cheers

2022-12-23 17:51:16

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH] mm: remove zap_page_range and change callers to use zap_vma_page_range

> unsigned long size = vma->vm_end - vma->vm_start;
>
> if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA64].dm))
> - zap_page_range(vma, vma->vm_start, size);
> + zap_vma_page_range(vma, vma->vm_start, size);
> #ifdef CONFIG_COMPAT_VDSO
> if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA32].dm))
> - zap_page_range(vma, vma->vm_start, size);
> + zap_vma_page_range(vma, vma->vm_start, size);
> #endif

So for something called zap_vma_page_range I'd expect to just pass
the vma and zap all of it, which this and many other callers want
anyway.

> +++ b/arch/s390/mm/gmap.c
> @@ -722,7 +722,7 @@ void gmap_discard(struct gmap *gmap, unsigned long from, unsigned long to)
> if (is_vm_hugetlb_page(vma))
> continue;
> size = min(to - gaddr, PMD_SIZE - (gaddr & ~PMD_MASK));
> - zap_page_range(vma, vmaddr, size);
> + zap_vma_page_range(vma, vmaddr, size);

And then just call zap_page_range_single directly for those that
don't want to zap the entire vma.

2022-12-23 21:31:29

by Mike Kravetz

[permalink] [raw]
Subject: Re: [RFC PATCH] mm: remove zap_page_range and change callers to use zap_vma_page_range

On 12/23/22 08:27, Christoph Hellwig wrote:
> > unsigned long size = vma->vm_end - vma->vm_start;
> >
> > if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA64].dm))
> > - zap_page_range(vma, vma->vm_start, size);
> > + zap_vma_page_range(vma, vma->vm_start, size);
> > #ifdef CONFIG_COMPAT_VDSO
> > if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA32].dm))
> > - zap_page_range(vma, vma->vm_start, size);
> > + zap_vma_page_range(vma, vma->vm_start, size);
> > #endif
>
> So for something called zap_vma_page_range I'd expect to just pass
> the vma and zap all of it, which this and many other callers want
> anyway.
>
> > +++ b/arch/s390/mm/gmap.c
> > @@ -722,7 +722,7 @@ void gmap_discard(struct gmap *gmap, unsigned long from, unsigned long to)
> > if (is_vm_hugetlb_page(vma))
> > continue;
> > size = min(to - gaddr, PMD_SIZE - (gaddr & ~PMD_MASK));
> > - zap_page_range(vma, vmaddr, size);
> > + zap_vma_page_range(vma, vmaddr, size);
>
> And then just call zap_page_range_single directly for those that
> don't want to zap the entire vma.

Thanks!

This sounds like a good idea and I will incorporate in a new patch.

--
Mike Kravetz

2022-12-29 16:44:58

by Palmer Dabbelt

[permalink] [raw]
Subject: Re: [RFC PATCH] mm: remove zap_page_range and change callers to use zap_vma_page_range

On Fri, 16 Dec 2022 11:20:12 PST (-0800), [email protected] wrote:
> zap_page_range was originally designed to unmap pages within an address
> range that could span multiple vmas. While working on [1], it was
> discovered that all callers of zap_page_range pass a range entirely within
> a single vma. In addition, the mmu notification call within zap_page
> range does not correctly handle ranges that span multiple vmas as calls
> should be vma specific.
>
> Instead of fixing zap_page_range, change all callers to use the new
> routine zap_vma_page_range. zap_vma_page_range is just a wrapper around
> zap_page_range_single passing in NULL zap details. The name is also
> more in line with other exported routines that operate within a vma.
> We can then remove zap_page_range.
>
> Also, change madvise_dontneed_single_vma to use this new routine.
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
> Suggested-by: Peter Xu <[email protected]>
> Signed-off-by: Mike Kravetz <[email protected]>
> ---
> arch/arm64/kernel/vdso.c | 4 ++--
> arch/powerpc/kernel/vdso.c | 2 +-
> arch/powerpc/platforms/book3s/vas-api.c | 2 +-
> arch/powerpc/platforms/pseries/vas.c | 2 +-
> arch/riscv/kernel/vdso.c | 4 ++--
> arch/s390/kernel/vdso.c | 2 +-
> arch/s390/mm/gmap.c | 2 +-
> arch/x86/entry/vdso/vma.c | 2 +-
> drivers/android/binder_alloc.c | 2 +-
> include/linux/mm.h | 7 ++++--
> mm/madvise.c | 4 ++--
> mm/memory.c | 30 -------------------------
> mm/page-writeback.c | 2 +-
> net/ipv4/tcp.c | 6 ++---
> 14 files changed, 22 insertions(+), 49 deletions(-)

[snip]

> diff --git a/arch/riscv/kernel/vdso.c b/arch/riscv/kernel/vdso.c
> index e410275918ac..a405119da2c0 100644
> --- a/arch/riscv/kernel/vdso.c
> +++ b/arch/riscv/kernel/vdso.c
> @@ -127,10 +127,10 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
> unsigned long size = vma->vm_end - vma->vm_start;
>
> if (vma_is_special_mapping(vma, vdso_info.dm))
> - zap_page_range(vma, vma->vm_start, size);
> + zap_vma_page_range(vma, vma->vm_start, size);
> #ifdef CONFIG_COMPAT
> if (vma_is_special_mapping(vma, compat_vdso_info.dm))
> - zap_page_range(vma, vma->vm_start, size);
> + zap_vma_page_range(vma, vma->vm_start, size);
> #endif
> }

Acked-by: Palmer Dabbelt <[email protected]> # RISC-V

Thanks!

2023-01-05 01:55:40

by Alistair Popple

[permalink] [raw]
Subject: Re: [RFC PATCH] mm: remove zap_page_range and change callers to use zap_vma_page_range


Mike Kravetz <[email protected]> writes:

> On 12/19/22 13:06, Michal Hocko wrote:
>> On Fri 16-12-22 11:20:12, Mike Kravetz wrote:
>> > zap_page_range was originally designed to unmap pages within an address
>> > range that could span multiple vmas. While working on [1], it was
>> > discovered that all callers of zap_page_range pass a range entirely within
>> > a single vma. In addition, the mmu notification call within zap_page
>> > range does not correctly handle ranges that span multiple vmas as calls
>> > should be vma specific.
>>
>> Could you spend a sentence or two explaining what is wrong here?
>
> Hmmmm? My assumption was that the range passed to mmu_notifier_range_init()
> was supposed to be within the specified vma. When looking into the notifier
> routines, I could not find any documentation about the usage of the vma within
> the mmu_notifier_range structure. It was introduced with commit bf198b2b34bf
> "mm/mmu_notifier: pass down vma and reasons why mmu notifier is happening".
> However, I do not see this being used today.
>
> Of course, I could be missing something, so adding Jérôme.

The only use for mmu_notifier_range->vma I can find is in
mmu_notifier_range_update_to_read_only() which was introduced in
c6d23413f81b ("mm/mmu_notifier: mmu_notifier_range_update_to_read_only()
helper"). However there are no users of that symbol so I think we can
remove it along with the mmu_notifier_range->vma field.

I will put togeather a patch to do that.

>>
>> > Instead of fixing zap_page_range, change all callers to use the new
>> > routine zap_vma_page_range. zap_vma_page_range is just a wrapper around
>> > zap_page_range_single passing in NULL zap details. The name is also
>> > more in line with other exported routines that operate within a vma.
>> > We can then remove zap_page_range.
>>
>> I would stick with zap_page_range_single rather than adding a new
>> wrapper but nothing really critical.
>
> I am fine with doing that as well. My only reason for the wrapper is that all
> callers outside mm/memory.c would pass in NULL zap details.
>
>>
>> > Also, change madvise_dontneed_single_vma to use this new routine.
>> >
>> > [1] https://lore.kernel.org/linux-mm/[email protected]/
>> > Suggested-by: Peter Xu <[email protected]>
>> > Signed-off-by: Mike Kravetz <[email protected]>
>>
>> Other than that LGTM
>> Acked-by: Michal Hocko <[email protected]>
>>
>> Thanks!
>
> Thanks for taking a look.