2023-07-17 14:39:22

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v1 0/3] Optimize large folio interaction with deferred split

Hi All,

This is a small series in support of my work to enable the use of large folios
for anonymous memory (currently called "FLEXIBLE_THP") [1]. It first makes it
possible to add large, non-pmd-mappable folios to the deferred split queue. Then
it modifies zap_pte_range() to batch-remove spans of physically contiguous pages
from the rmap, which means that in the common case, we elide the need to ever
put the folio on the deferred split queue, thus reducing lock contention and
improving performance.

This becomes more visible once we have lots of large anonymous folios in the
system, and Huang Ying has suggested solving this needs to be a prerequisit for
merging the main FLEXIBLE_THP work.

The series applies on top of v6.5-rc2 and a branch is available at [2].

I don't have a full test run with the latest versions of all the patches on top
of the latest baseline, so not posting results formally. I can get these if
people feel they are neccessary though. But anecdotally, for the kernel
compilation workload, this series reduces kernel time by ~4% and reduces
real-time by ~0.4%, compared with [1].

[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/deferredsplit-lkml_v1

Thanks,
Ryan


Ryan Roberts (3):
mm: Allow deferred splitting of arbitrary large anon folios
mm: Implement folio_remove_rmap_range()
mm: Batch-zap large anonymous folio PTE mappings

include/linux/rmap.h | 2 +
mm/memory.c | 119 +++++++++++++++++++++++++++++++++++++++++++
mm/rmap.c | 67 +++++++++++++++++++++++-
3 files changed, 187 insertions(+), 1 deletion(-)

--
2.25.1



2023-07-17 14:57:35

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v1 2/3] mm: Implement folio_remove_rmap_range()

Like page_remove_rmap() but batch-removes the rmap for a range of pages
belonging to a folio. This can provide a small speedup due to less
manipuation of the various counters. But more crucially, if removing the
rmap for all pages of a folio in a batch, there is no need to
(spuriously) add it to the deferred split list, which saves significant
cost when there is contention for the split queue lock.

All contained pages are accounted using the order-0 folio (or base page)
scheme.

Signed-off-by: Ryan Roberts <[email protected]>
---
include/linux/rmap.h | 2 ++
mm/rmap.c | 65 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 67 insertions(+)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b87d01660412..f578975c12c0 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -200,6 +200,8 @@ void page_add_file_rmap(struct page *, struct vm_area_struct *,
bool compound);
void page_remove_rmap(struct page *, struct vm_area_struct *,
bool compound);
+void folio_remove_rmap_range(struct folio *folio, struct page *page,
+ int nr, struct vm_area_struct *vma);

void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
unsigned long address, rmap_t flags);
diff --git a/mm/rmap.c b/mm/rmap.c
index 2baf57d65c23..1da05aca2bb1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1359,6 +1359,71 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
mlock_vma_folio(folio, vma, compound);
}

+/*
+ * folio_remove_rmap_range - take down pte mappings from a range of pages
+ * belonging to a folio. All pages are accounted as small pages.
+ * @folio: folio that all pages belong to
+ * @page: first page in range to remove mapping from
+ * @nr: number of pages in range to remove mapping from
+ * @vma: the vm area from which the mapping is removed
+ *
+ * The caller needs to hold the pte lock.
+ */
+void folio_remove_rmap_range(struct folio *folio, struct page *page,
+ int nr, struct vm_area_struct *vma)
+{
+ atomic_t *mapped = &folio->_nr_pages_mapped;
+ int nr_unmapped = 0;
+ int nr_mapped;
+ bool last;
+ enum node_stat_item idx;
+
+ if (unlikely(folio_test_hugetlb(folio))) {
+ VM_WARN_ON_FOLIO(1, folio);
+ return;
+ }
+
+ if (!folio_test_large(folio)) {
+ /* Is this the page's last map to be removed? */
+ last = atomic_add_negative(-1, &page->_mapcount);
+ nr_unmapped = last;
+ } else {
+ for (; nr != 0; nr--, page++) {
+ /* Is this the page's last map to be removed? */
+ last = atomic_add_negative(-1, &page->_mapcount);
+ if (last) {
+ /* Page still mapped if folio mapped entirely */
+ nr_mapped = atomic_dec_return_relaxed(mapped);
+ if (nr_mapped < COMPOUND_MAPPED)
+ nr_unmapped++;
+ }
+ }
+ }
+
+ if (nr_unmapped) {
+ idx = folio_test_anon(folio) ? NR_ANON_MAPPED : NR_FILE_MAPPED;
+ __lruvec_stat_mod_folio(folio, idx, -nr_unmapped);
+
+ /*
+ * Queue anon THP for deferred split if we have just unmapped at
+ * least 1 page, while at least 1 page remains mapped.
+ */
+ if (folio_test_large(folio) && folio_test_anon(folio))
+ if (nr_mapped)
+ deferred_split_folio(folio);
+ }
+
+ /*
+ * It would be tidy to reset folio_test_anon mapping when fully
+ * unmapped, but that might overwrite a racing page_add_anon_rmap
+ * which increments mapcount after us but sets mapping before us:
+ * so leave the reset to free_pages_prepare, and remember that
+ * it's only reliable while mapped.
+ */
+
+ munlock_vma_folio(folio, vma, false);
+}
+
/**
* page_remove_rmap - take down pte mapping from a page
* @page: page to remove mapping from
--
2.25.1


2023-07-17 14:58:18

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v1 1/3] mm: Allow deferred splitting of arbitrary large anon folios

In preparation for the introduction of large folios for anonymous
memory, we would like to be able to split them when they have unmapped
subpages, in order to free those unused pages under memory pressure. So
remove the artificial requirement that the large folio needed to be at
least PMD-sized.

Signed-off-by: Ryan Roberts <[email protected]>
Reviewed-by: Yu Zhao <[email protected]>
Reviewed-by: Yin Fengwei <[email protected]>
---
mm/rmap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 0c0d8857dfce..2baf57d65c23 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1430,7 +1430,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
* page of the folio is unmapped and at least one page
* is still mapped.
*/
- if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
+ if (folio_test_large(folio) && folio_test_anon(folio))
if (!compound || nr < nr_pmdmapped)
deferred_split_folio(folio);
}
--
2.25.1


2023-07-17 14:59:29

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v1 3/3] mm: Batch-zap large anonymous folio PTE mappings

This allows batching the rmap removal with folio_remove_rmap_range(),
which means we avoid spuriously adding a partially unmapped folio to the
deferrred split queue in the common case, which reduces split queue lock
contention.

Previously each page was removed from the rmap individually with
page_remove_rmap(). If the first page belonged to a large folio, this
would cause page_remove_rmap() to conclude that the folio was now
partially mapped and add the folio to the deferred split queue. But
subsequent calls would cause the folio to become fully unmapped, meaning
there is no value to adding it to the split queue.

Signed-off-by: Ryan Roberts <[email protected]>
---
mm/memory.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 119 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 01f39e8144ef..6facb8c8807a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1391,6 +1391,95 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
}

+static inline unsigned long page_addr(struct page *page,
+ struct page *anchor, unsigned long anchor_addr)
+{
+ unsigned long offset;
+ unsigned long addr;
+
+ offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
+ addr = anchor_addr + offset;
+
+ if (anchor > page) {
+ if (addr > anchor_addr)
+ return 0;
+ } else {
+ if (addr < anchor_addr)
+ return ULONG_MAX;
+ }
+
+ return addr;
+}
+
+static int calc_anon_folio_map_pgcount(struct folio *folio,
+ struct page *page, pte_t *pte,
+ unsigned long addr, unsigned long end)
+{
+ pte_t ptent;
+ int floops;
+ int i;
+ unsigned long pfn;
+
+ end = min(page_addr(&folio->page + folio_nr_pages(folio), page, addr),
+ end);
+ floops = (end - addr) >> PAGE_SHIFT;
+ pfn = page_to_pfn(page);
+ pfn++;
+ pte++;
+
+ for (i = 1; i < floops; i++) {
+ ptent = ptep_get(pte);
+
+ if (!pte_present(ptent) ||
+ pte_pfn(ptent) != pfn) {
+ return i;
+ }
+
+ pfn++;
+ pte++;
+ }
+
+ return floops;
+}
+
+static unsigned long zap_anon_pte_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
+ struct page *page, pte_t *pte,
+ unsigned long addr, unsigned long end,
+ bool *full_out)
+{
+ struct folio *folio = page_folio(page);
+ struct mm_struct *mm = tlb->mm;
+ pte_t ptent;
+ int pgcount;
+ int i;
+ bool full;
+
+ pgcount = calc_anon_folio_map_pgcount(folio, page, pte, addr, end);
+
+ for (i = 0; i < pgcount;) {
+ ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+ tlb_remove_tlb_entry(tlb, pte, addr);
+ full = __tlb_remove_page(tlb, page, 0);
+
+ if (unlikely(page_mapcount(page) < 1))
+ print_bad_pte(vma, addr, ptent, page);
+
+ i++;
+ page++;
+ pte++;
+ addr += PAGE_SIZE;
+
+ if (unlikely(full))
+ break;
+ }
+
+ folio_remove_rmap_range(folio, page - i, i, vma);
+
+ *full_out = full;
+ return i;
+}
+
static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
@@ -1428,6 +1517,36 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
page = vm_normal_page(vma, addr, ptent);
if (unlikely(!should_zap_page(details, page)))
continue;
+
+ /*
+ * Batch zap large anonymous folio mappings. This allows
+ * batching the rmap removal, which means we avoid
+ * spuriously adding a partially unmapped folio to the
+ * deferrred split queue in the common case, which
+ * reduces split queue lock contention. Require the VMA
+ * to be anonymous to ensure that none of the PTEs in
+ * the range require zap_install_uffd_wp_if_needed().
+ */
+ if (page && PageAnon(page) && vma_is_anonymous(vma)) {
+ bool full;
+ int pgcount;
+
+ pgcount = zap_anon_pte_range(tlb, vma,
+ page, pte, addr, end, &full);
+
+ rss[mm_counter(page)] -= pgcount;
+ pgcount--;
+ pte += pgcount;
+ addr += pgcount << PAGE_SHIFT;
+
+ if (unlikely(full)) {
+ force_flush = 1;
+ addr += PAGE_SIZE;
+ break;
+ }
+ continue;
+ }
+
ptent = ptep_get_and_clear_full(mm, addr, pte,
tlb->fullmm);
tlb_remove_tlb_entry(tlb, pte, addr);
--
2.25.1


2023-07-17 15:28:28

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v1 2/3] mm: Implement folio_remove_rmap_range()

On Mon, Jul 17, 2023 at 03:31:09PM +0100, Ryan Roberts wrote:
> +/*
> + * folio_remove_rmap_range - take down pte mappings from a range of pages
> + * belonging to a folio. All pages are accounted as small pages.
> + * @folio: folio that all pages belong to
> + * @page: first page in range to remove mapping from
> + * @nr: number of pages in range to remove mapping from
> + * @vma: the vm area from which the mapping is removed
> + *
> + * The caller needs to hold the pte lock.
> + */

This could stand a little reworking. How about this?

/**
* folio_remove_rmap_range - Take down PTE mappings from a range of pages.
* @folio: Folio containing all pages in range.
* @page: First page in range to unmap.
* @nr: Number of pages to unmap.
* @vma: The VM area containing the range.
*
* All pages in the range must belong to the same VMA & folio. They
* must be mapped with PTEs, not a PMD.
*
* Context: Caller holds the pte lock.
*/

> +void folio_remove_rmap_range(struct folio *folio, struct page *page,
> + int nr, struct vm_area_struct *vma)
> +{
> + atomic_t *mapped = &folio->_nr_pages_mapped;
> + int nr_unmapped = 0;
> + int nr_mapped;
> + bool last;
> + enum node_stat_item idx;
> +
> + if (unlikely(folio_test_hugetlb(folio))) {
> + VM_WARN_ON_FOLIO(1, folio);
> + return;
> + }
> +
> + if (!folio_test_large(folio)) {
> + /* Is this the page's last map to be removed? */
> + last = atomic_add_negative(-1, &page->_mapcount);
> + nr_unmapped = last;
> + } else {
> + for (; nr != 0; nr--, page++) {
> + /* Is this the page's last map to be removed? */
> + last = atomic_add_negative(-1, &page->_mapcount);
> + if (last) {
> + /* Page still mapped if folio mapped entirely */
> + nr_mapped = atomic_dec_return_relaxed(mapped);

We're still doing one atomic op per page on the folio's nr_pages_mapped
... is it possible to batch this and use atomic_sub_return_relaxed()?


2023-07-17 15:30:04

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH v1 2/3] mm: Implement folio_remove_rmap_range()

On 17 Jul 2023, at 10:31, Ryan Roberts wrote:

> Like page_remove_rmap() but batch-removes the rmap for a range of pages
> belonging to a folio. This can provide a small speedup due to less
> manipuation of the various counters. But more crucially, if removing the
> rmap for all pages of a folio in a batch, there is no need to
> (spuriously) add it to the deferred split list, which saves significant
> cost when there is contention for the split queue lock.
>
> All contained pages are accounted using the order-0 folio (or base page)
> scheme.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> include/linux/rmap.h | 2 ++
> mm/rmap.c | 65 ++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 67 insertions(+)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index b87d01660412..f578975c12c0 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -200,6 +200,8 @@ void page_add_file_rmap(struct page *, struct vm_area_struct *,
> bool compound);
> void page_remove_rmap(struct page *, struct vm_area_struct *,
> bool compound);
> +void folio_remove_rmap_range(struct folio *folio, struct page *page,
> + int nr, struct vm_area_struct *vma);
>
> void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
> unsigned long address, rmap_t flags);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 2baf57d65c23..1da05aca2bb1 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1359,6 +1359,71 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
> mlock_vma_folio(folio, vma, compound);
> }
>
> +/*
> + * folio_remove_rmap_range - take down pte mappings from a range of pages
> + * belonging to a folio. All pages are accounted as small pages.
> + * @folio: folio that all pages belong to
> + * @page: first page in range to remove mapping from
> + * @nr: number of pages in range to remove mapping from

We might need some checks to make sure [page, page+nr] is in the range of
the folio. Something like:

page >= &folio->page && page + nr < (&folio->page + folio_nr_pages(folio))

> + * @vma: the vm area from which the mapping is removed
> + *
> + * The caller needs to hold the pte lock.
> + */
> +void folio_remove_rmap_range(struct folio *folio, struct page *page,
> + int nr, struct vm_area_struct *vma)
> +{
> + atomic_t *mapped = &folio->_nr_pages_mapped;
> + int nr_unmapped = 0;
> + int nr_mapped;
> + bool last;
> + enum node_stat_item idx;
> +
> + if (unlikely(folio_test_hugetlb(folio))) {
> + VM_WARN_ON_FOLIO(1, folio);
> + return;
> + }
> +
> + if (!folio_test_large(folio)) {
> + /* Is this the page's last map to be removed? */
> + last = atomic_add_negative(-1, &page->_mapcount);
> + nr_unmapped = last;
> + } else {
> + for (; nr != 0; nr--, page++) {
> + /* Is this the page's last map to be removed? */
> + last = atomic_add_negative(-1, &page->_mapcount);
> + if (last) {
> + /* Page still mapped if folio mapped entirely */
> + nr_mapped = atomic_dec_return_relaxed(mapped);
> + if (nr_mapped < COMPOUND_MAPPED)
> + nr_unmapped++;
> + }
> + }
> + }
> +
> + if (nr_unmapped) {
> + idx = folio_test_anon(folio) ? NR_ANON_MAPPED : NR_FILE_MAPPED;
> + __lruvec_stat_mod_folio(folio, idx, -nr_unmapped);
> +
> + /*
> + * Queue anon THP for deferred split if we have just unmapped at
> + * least 1 page, while at least 1 page remains mapped.
> + */
> + if (folio_test_large(folio) && folio_test_anon(folio))
> + if (nr_mapped)
> + deferred_split_folio(folio);
> + }
> +
> + /*
> + * It would be tidy to reset folio_test_anon mapping when fully
> + * unmapped, but that might overwrite a racing page_add_anon_rmap
> + * which increments mapcount after us but sets mapping before us:
> + * so leave the reset to free_pages_prepare, and remember that
> + * it's only reliable while mapped.
> + */
> +
> + munlock_vma_folio(folio, vma, false);
> +}
> +
> /**
> * page_remove_rmap - take down pte mapping from a page
> * @page: page to remove mapping from
> --
> 2.25.1

Everything else looks good to me. Reviewed-by: Zi Yan <[email protected]>

--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2023-07-17 15:49:52

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v1 1/3] mm: Allow deferred splitting of arbitrary large anon folios

On Mon, Jul 17, 2023 at 03:31:08PM +0100, Ryan Roberts wrote:
> In preparation for the introduction of large folios for anonymous
> memory, we would like to be able to split them when they have unmapped
> subpages, in order to free those unused pages under memory pressure. So
> remove the artificial requirement that the large folio needed to be at
> least PMD-sized.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> Reviewed-by: Yu Zhao <[email protected]>
> Reviewed-by: Yin Fengwei <[email protected]>

Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>

> */
> - if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
> + if (folio_test_large(folio) && folio_test_anon(folio))
> if (!compound || nr < nr_pmdmapped)
> deferred_split_folio(folio);

I wonder if it's worth introducing a folio_test_deferred_split() (better
naming appreciated ...) to allow us to allocate order-1 folios and not
do horrible things. Maybe it's not worth supporting order-1 folios;
we're always better off going to order-2 immediately. Just thinking.

2023-07-17 16:08:36

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v1 1/3] mm: Allow deferred splitting of arbitrary large anon folios

On Mon, Jul 17, 2023 at 05:43:40PM +0200, David Hildenbrand wrote:
> On 17.07.23 17:41, Ryan Roberts wrote:
> > On 17/07/2023 16:30, Matthew Wilcox wrote:
> > > On Mon, Jul 17, 2023 at 03:31:08PM +0100, Ryan Roberts wrote:
> > > > In preparation for the introduction of large folios for anonymous
> > > > memory, we would like to be able to split them when they have unmapped
> > > > subpages, in order to free those unused pages under memory pressure. So
> > > > remove the artificial requirement that the large folio needed to be at
> > > > least PMD-sized.
> > > >
> > > > Signed-off-by: Ryan Roberts <[email protected]>
> > > > Reviewed-by: Yu Zhao <[email protected]>
> > > > Reviewed-by: Yin Fengwei <[email protected]>
> > >
> > > Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
> >
> > Thanks!
> >
> > >
> > > > */
> > > > - if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
> > > > + if (folio_test_large(folio) && folio_test_anon(folio))
> > > > if (!compound || nr < nr_pmdmapped)
> > > > deferred_split_folio(folio);
> > >
> > > I wonder if it's worth introducing a folio_test_deferred_split() (better
> > > naming appreciated ...) to allow us to allocate order-1 folios and not
> > > do horrible things. Maybe it's not worth supporting order-1 folios;
> > > we're always better off going to order-2 immediately. Just thinking.
> >
> > There is more than just _deferred_list in the 3rd page; you also have _flags_2a
> > and _head_2a. I guess you know much better than me what they store. But I'm
> > guessing its harder than jsut not splitting an order-1 page?

Those are page->flags and page->compound_head for the third page in
the folio. They don't really need a name; nothing refers to them,
but it's important that space not be reused ;-)

This is slightly different from _flags_1; we do have some flags which
reuse the bits (they're labelled as PF_SECOND). Right now, it's only
PF_has_hwpoisoned, but we used to have PF_double_map. Others may arise.

> > With the direction of large anon folios (_not_ retrying with every order down to
> > 0), I'm not sure what the use case would be for order-1 anyway?
>
> Just noting that we might need some struct-page space for better
> mapcount/shared tracking, which might get hard for order-1 pages.

My assumption had been that we'd be able to reuse the _entire_mapcount
and _nr_pages_mapped fields and not spill into the third page, but the
third page is definitely available today if we want it. I'm fine with
disallowing order-1 anon/file folios forever.

2023-07-17 16:13:16

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v1 1/3] mm: Allow deferred splitting of arbitrary large anon folios

On 17/07/2023 16:30, Matthew Wilcox wrote:
> On Mon, Jul 17, 2023 at 03:31:08PM +0100, Ryan Roberts wrote:
>> In preparation for the introduction of large folios for anonymous
>> memory, we would like to be able to split them when they have unmapped
>> subpages, in order to free those unused pages under memory pressure. So
>> remove the artificial requirement that the large folio needed to be at
>> least PMD-sized.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> Reviewed-by: Yu Zhao <[email protected]>
>> Reviewed-by: Yin Fengwei <[email protected]>
>
> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>

Thanks!

>
>> */
>> - if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
>> + if (folio_test_large(folio) && folio_test_anon(folio))
>> if (!compound || nr < nr_pmdmapped)
>> deferred_split_folio(folio);
>
> I wonder if it's worth introducing a folio_test_deferred_split() (better
> naming appreciated ...) to allow us to allocate order-1 folios and not
> do horrible things. Maybe it's not worth supporting order-1 folios;
> we're always better off going to order-2 immediately. Just thinking.

There is more than just _deferred_list in the 3rd page; you also have _flags_2a
and _head_2a. I guess you know much better than me what they store. But I'm
guessing its harder than jsut not splitting an order-1 page?

With the direction of large anon folios (_not_ retrying with every order down to
0), I'm not sure what the use case would be for order-1 anyway?

2023-07-17 16:14:05

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v1 2/3] mm: Implement folio_remove_rmap_range()

On Mon, Jul 17, 2023 at 04:49:19PM +0100, Ryan Roberts wrote:
> > We're still doing one atomic op per page on the folio's nr_pages_mapped
> > ... is it possible to batch this and use atomic_sub_return_relaxed()?
>
> Good spot, something like this:
>
> } else {
> for (; nr != 0; nr--, page++) {
> /* Is this the page's last map to be removed? */
> last = atomic_add_negative(-1, &page->_mapcount);
> if (last)
> nr_unmapped++;
> }
>
> /* Pages still mapped if folio mapped entirely */
> nr_mapped = atomic_sub_return_relaxed(nr_unmapped, mapped);
> if (nr_mapped >= COMPOUND_MAPPED)
> nr_unmapped = 0;
> }

I think that's right, but my eyes always go slightly crossed trying to
read the new mapcount scheme.

2023-07-17 16:15:18

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v1 1/3] mm: Allow deferred splitting of arbitrary large anon folios

On 17/07/2023 16:42, David Hildenbrand wrote:
> On 17.07.23 16:31, Ryan Roberts wrote:
>> In preparation for the introduction of large folios for anonymous
>> memory, we would like to be able to split them when they have unmapped
>> subpages, in order to free those unused pages under memory pressure. So
>> remove the artificial requirement that the large folio needed to be at
>> least PMD-sized.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> Reviewed-by: Yu Zhao <[email protected]>
>> Reviewed-by: Yin Fengwei <[email protected]>
>> ---
>>   mm/rmap.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 0c0d8857dfce..2baf57d65c23 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1430,7 +1430,7 @@ void page_remove_rmap(struct page *page, struct
>> vm_area_struct *vma,
>>            * page of the folio is unmapped and at least one page
>>            * is still mapped.
>>            */
>> -        if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
>> +        if (folio_test_large(folio) && folio_test_anon(folio))
>>               if (!compound || nr < nr_pmdmapped)
>>                   deferred_split_folio(folio);
>
> !compound will always be true I guess, so nr_pmdmapped == 0 (which will always
> be the case) will be ignored.

I don't follow why !compound will always be true. This function is
page_remove_rmap() (not folio_remove_rmap_range() which I add in a later patch).
page_remove_rmap() can work on pmd-mapped pages where compound=true is passed in.

>
> Reviewed-by: David Hildenbrand <[email protected]>
>


2023-07-17 16:18:09

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 1/3] mm: Allow deferred splitting of arbitrary large anon folios

On 17.07.23 16:31, Ryan Roberts wrote:
> In preparation for the introduction of large folios for anonymous
> memory, we would like to be able to split them when they have unmapped
> subpages, in order to free those unused pages under memory pressure. So
> remove the artificial requirement that the large folio needed to be at
> least PMD-sized.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> Reviewed-by: Yu Zhao <[email protected]>
> Reviewed-by: Yin Fengwei <[email protected]>
> ---
> mm/rmap.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 0c0d8857dfce..2baf57d65c23 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1430,7 +1430,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
> * page of the folio is unmapped and at least one page
> * is still mapped.
> */
> - if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
> + if (folio_test_large(folio) && folio_test_anon(folio))
> if (!compound || nr < nr_pmdmapped)
> deferred_split_folio(folio);

!compound will always be true I guess, so nr_pmdmapped == 0 (which will
always be the case) will be ignored.

Reviewed-by: David Hildenbrand <[email protected]>

--
Cheers,

David / dhildenb


2023-07-17 16:19:12

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 1/3] mm: Allow deferred splitting of arbitrary large anon folios

On 17.07.23 17:41, Ryan Roberts wrote:
> On 17/07/2023 16:30, Matthew Wilcox wrote:
>> On Mon, Jul 17, 2023 at 03:31:08PM +0100, Ryan Roberts wrote:
>>> In preparation for the introduction of large folios for anonymous
>>> memory, we would like to be able to split them when they have unmapped
>>> subpages, in order to free those unused pages under memory pressure. So
>>> remove the artificial requirement that the large folio needed to be at
>>> least PMD-sized.
>>>
>>> Signed-off-by: Ryan Roberts <[email protected]>
>>> Reviewed-by: Yu Zhao <[email protected]>
>>> Reviewed-by: Yin Fengwei <[email protected]>
>>
>> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
>
> Thanks!
>
>>
>>> */
>>> - if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
>>> + if (folio_test_large(folio) && folio_test_anon(folio))
>>> if (!compound || nr < nr_pmdmapped)
>>> deferred_split_folio(folio);
>>
>> I wonder if it's worth introducing a folio_test_deferred_split() (better
>> naming appreciated ...) to allow us to allocate order-1 folios and not
>> do horrible things. Maybe it's not worth supporting order-1 folios;
>> we're always better off going to order-2 immediately. Just thinking.
>
> There is more than just _deferred_list in the 3rd page; you also have _flags_2a
> and _head_2a. I guess you know much better than me what they store. But I'm
> guessing its harder than jsut not splitting an order-1 page?
>
> With the direction of large anon folios (_not_ retrying with every order down to
> 0), I'm not sure what the use case would be for order-1 anyway?

Just noting that we might need some struct-page space for better
mapcount/shared tracking, which might get hard for order-1 pages.

--
Cheers,

David / dhildenb


2023-07-17 16:21:55

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH v1 3/3] mm: Batch-zap large anonymous folio PTE mappings

On 17 Jul 2023, at 10:31, Ryan Roberts wrote:

> This allows batching the rmap removal with folio_remove_rmap_range(),
> which means we avoid spuriously adding a partially unmapped folio to the
> deferrred split queue in the common case, which reduces split queue lock
> contention.
>
> Previously each page was removed from the rmap individually with
> page_remove_rmap(). If the first page belonged to a large folio, this
> would cause page_remove_rmap() to conclude that the folio was now
> partially mapped and add the folio to the deferred split queue. But
> subsequent calls would cause the folio to become fully unmapped, meaning
> there is no value to adding it to the split queue.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> mm/memory.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 119 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 01f39e8144ef..6facb8c8807a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1391,6 +1391,95 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
> pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
> }
>
> +static inline unsigned long page_addr(struct page *page,
> + struct page *anchor, unsigned long anchor_addr)
> +{
> + unsigned long offset;
> + unsigned long addr;
> +
> + offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
> + addr = anchor_addr + offset;
> +
> + if (anchor > page) {
> + if (addr > anchor_addr)
> + return 0;
> + } else {
> + if (addr < anchor_addr)
> + return ULONG_MAX;
> + }
> +
> + return addr;
> +}
> +
> +static int calc_anon_folio_map_pgcount(struct folio *folio,
> + struct page *page, pte_t *pte,
> + unsigned long addr, unsigned long end)
> +{
> + pte_t ptent;
> + int floops;
> + int i;
> + unsigned long pfn;
> +
> + end = min(page_addr(&folio->page + folio_nr_pages(folio), page, addr),
> + end);
> + floops = (end - addr) >> PAGE_SHIFT;
> + pfn = page_to_pfn(page);
> + pfn++;
> + pte++;
> +
> + for (i = 1; i < floops; i++) {
> + ptent = ptep_get(pte);
> +
> + if (!pte_present(ptent) ||
> + pte_pfn(ptent) != pfn) {
> + return i;
> + }
> +
> + pfn++;
> + pte++;
> + }
> +
> + return floops;
> +}
> +
> +static unsigned long zap_anon_pte_range(struct mmu_gather *tlb,
> + struct vm_area_struct *vma,
> + struct page *page, pte_t *pte,
> + unsigned long addr, unsigned long end,
> + bool *full_out)
> +{
> + struct folio *folio = page_folio(page);
> + struct mm_struct *mm = tlb->mm;
> + pte_t ptent;
> + int pgcount;
> + int i;
> + bool full;
> +
> + pgcount = calc_anon_folio_map_pgcount(folio, page, pte, addr, end);
> +
> + for (i = 0; i < pgcount;) {
> + ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
> + tlb_remove_tlb_entry(tlb, pte, addr);
> + full = __tlb_remove_page(tlb, page, 0);
> +
> + if (unlikely(page_mapcount(page) < 1))
> + print_bad_pte(vma, addr, ptent, page);
> +
> + i++;
> + page++;
> + pte++;
> + addr += PAGE_SIZE;
> +
> + if (unlikely(full))
> + break;
> + }
> +
> + folio_remove_rmap_range(folio, page - i, i, vma);
> +
> + *full_out = full;
> + return i;
> +}
> +
> static unsigned long zap_pte_range(struct mmu_gather *tlb,
> struct vm_area_struct *vma, pmd_t *pmd,
> unsigned long addr, unsigned long end,
> @@ -1428,6 +1517,36 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> page = vm_normal_page(vma, addr, ptent);
> if (unlikely(!should_zap_page(details, page)))
> continue;
> +
> + /*
> + * Batch zap large anonymous folio mappings. This allows
> + * batching the rmap removal, which means we avoid
> + * spuriously adding a partially unmapped folio to the
> + * deferrred split queue in the common case, which
> + * reduces split queue lock contention. Require the VMA
> + * to be anonymous to ensure that none of the PTEs in
> + * the range require zap_install_uffd_wp_if_needed().
> + */
> + if (page && PageAnon(page) && vma_is_anonymous(vma)) {
> + bool full;
> + int pgcount;
> +
> + pgcount = zap_anon_pte_range(tlb, vma,
> + page, pte, addr, end, &full);

Are you trying to zap as many ptes as possible if all these ptes are
within a folio? If so, why not calculate end before calling zap_anon_pte_range()?
That would make zap_anon_pte_range() simpler. Also check if page is part of
a large folio first to make sure you can batch.

> +
> + rss[mm_counter(page)] -= pgcount;
> + pgcount--;
> + pte += pgcount;
> + addr += pgcount << PAGE_SHIFT;
> +
> + if (unlikely(full)) {
> + force_flush = 1;
> + addr += PAGE_SIZE;
> + break;
> + }
> + continue;
> + }
> +
> ptent = ptep_get_and_clear_full(mm, addr, pte,
> tlb->fullmm);
> tlb_remove_tlb_entry(tlb, pte, addr);
> --
> 2.25.1


--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2023-07-17 16:26:28

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v1 2/3] mm: Implement folio_remove_rmap_range()

On 17/07/2023 16:09, Zi Yan wrote:
> On 17 Jul 2023, at 10:31, Ryan Roberts wrote:
>
>> Like page_remove_rmap() but batch-removes the rmap for a range of pages
>> belonging to a folio. This can provide a small speedup due to less
>> manipuation of the various counters. But more crucially, if removing the
>> rmap for all pages of a folio in a batch, there is no need to
>> (spuriously) add it to the deferred split list, which saves significant
>> cost when there is contention for the split queue lock.
>>
>> All contained pages are accounted using the order-0 folio (or base page)
>> scheme.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> ---
>> include/linux/rmap.h | 2 ++
>> mm/rmap.c | 65 ++++++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 67 insertions(+)
>>
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index b87d01660412..f578975c12c0 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -200,6 +200,8 @@ void page_add_file_rmap(struct page *, struct vm_area_struct *,
>> bool compound);
>> void page_remove_rmap(struct page *, struct vm_area_struct *,
>> bool compound);
>> +void folio_remove_rmap_range(struct folio *folio, struct page *page,
>> + int nr, struct vm_area_struct *vma);
>>
>> void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
>> unsigned long address, rmap_t flags);
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 2baf57d65c23..1da05aca2bb1 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1359,6 +1359,71 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
>> mlock_vma_folio(folio, vma, compound);
>> }
>>
>> +/*
>> + * folio_remove_rmap_range - take down pte mappings from a range of pages
>> + * belonging to a folio. All pages are accounted as small pages.
>> + * @folio: folio that all pages belong to
>> + * @page: first page in range to remove mapping from
>> + * @nr: number of pages in range to remove mapping from
>
> We might need some checks to make sure [page, page+nr] is in the range of
> the folio. Something like:
>
> page >= &folio->page && page + nr < (&folio->page + folio_nr_pages(folio))

No problem. Is a VM_WARN_ON() appropriate for something like this?

>
>> + * @vma: the vm area from which the mapping is removed
>> + *
>> + * The caller needs to hold the pte lock.
>> + */
>> +void folio_remove_rmap_range(struct folio *folio, struct page *page,
>> + int nr, struct vm_area_struct *vma)
>> +{
>> + atomic_t *mapped = &folio->_nr_pages_mapped;
>> + int nr_unmapped = 0;
>> + int nr_mapped;
>> + bool last;
>> + enum node_stat_item idx;
>> +
>> + if (unlikely(folio_test_hugetlb(folio))) {
>> + VM_WARN_ON_FOLIO(1, folio);
>> + return;
>> + }
>> +
>> + if (!folio_test_large(folio)) {
>> + /* Is this the page's last map to be removed? */
>> + last = atomic_add_negative(-1, &page->_mapcount);
>> + nr_unmapped = last;
>> + } else {
>> + for (; nr != 0; nr--, page++) {
>> + /* Is this the page's last map to be removed? */
>> + last = atomic_add_negative(-1, &page->_mapcount);
>> + if (last) {
>> + /* Page still mapped if folio mapped entirely */
>> + nr_mapped = atomic_dec_return_relaxed(mapped);
>> + if (nr_mapped < COMPOUND_MAPPED)
>> + nr_unmapped++;
>> + }
>> + }
>> + }
>> +
>> + if (nr_unmapped) {
>> + idx = folio_test_anon(folio) ? NR_ANON_MAPPED : NR_FILE_MAPPED;
>> + __lruvec_stat_mod_folio(folio, idx, -nr_unmapped);
>> +
>> + /*
>> + * Queue anon THP for deferred split if we have just unmapped at
>> + * least 1 page, while at least 1 page remains mapped.
>> + */
>> + if (folio_test_large(folio) && folio_test_anon(folio))
>> + if (nr_mapped)
>> + deferred_split_folio(folio);
>> + }
>> +
>> + /*
>> + * It would be tidy to reset folio_test_anon mapping when fully
>> + * unmapped, but that might overwrite a racing page_add_anon_rmap
>> + * which increments mapcount after us but sets mapping before us:
>> + * so leave the reset to free_pages_prepare, and remember that
>> + * it's only reliable while mapped.
>> + */
>> +
>> + munlock_vma_folio(folio, vma, false);
>> +}
>> +
>> /**
>> * page_remove_rmap - take down pte mapping from a page
>> * @page: page to remove mapping from
>> --
>> 2.25.1
>
> Everything else looks good to me. Reviewed-by: Zi Yan <[email protected]>
>
> --
> Best Regards,
> Yan, Zi


2023-07-17 16:27:56

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v1 2/3] mm: Implement folio_remove_rmap_range()

On 17/07/2023 16:07, Matthew Wilcox wrote:
> On Mon, Jul 17, 2023 at 03:31:09PM +0100, Ryan Roberts wrote:
>> +/*
>> + * folio_remove_rmap_range - take down pte mappings from a range of pages
>> + * belonging to a folio. All pages are accounted as small pages.
>> + * @folio: folio that all pages belong to
>> + * @page: first page in range to remove mapping from
>> + * @nr: number of pages in range to remove mapping from
>> + * @vma: the vm area from which the mapping is removed
>> + *
>> + * The caller needs to hold the pte lock.
>> + */
>
> This could stand a little reworking. How about this?
>
> /**
> * folio_remove_rmap_range - Take down PTE mappings from a range of pages.
> * @folio: Folio containing all pages in range.
> * @page: First page in range to unmap.
> * @nr: Number of pages to unmap.
> * @vma: The VM area containing the range.
> *
> * All pages in the range must belong to the same VMA & folio. They
> * must be mapped with PTEs, not a PMD.
> *
> * Context: Caller holds the pte lock.
> */

LGTM! thanks.

>
>> +void folio_remove_rmap_range(struct folio *folio, struct page *page,
>> + int nr, struct vm_area_struct *vma)
>> +{
>> + atomic_t *mapped = &folio->_nr_pages_mapped;
>> + int nr_unmapped = 0;
>> + int nr_mapped;
>> + bool last;
>> + enum node_stat_item idx;
>> +
>> + if (unlikely(folio_test_hugetlb(folio))) {
>> + VM_WARN_ON_FOLIO(1, folio);
>> + return;
>> + }
>> +
>> + if (!folio_test_large(folio)) {
>> + /* Is this the page's last map to be removed? */
>> + last = atomic_add_negative(-1, &page->_mapcount);
>> + nr_unmapped = last;
>> + } else {
>> + for (; nr != 0; nr--, page++) {
>> + /* Is this the page's last map to be removed? */
>> + last = atomic_add_negative(-1, &page->_mapcount);
>> + if (last) {
>> + /* Page still mapped if folio mapped entirely */
>> + nr_mapped = atomic_dec_return_relaxed(mapped);
>
> We're still doing one atomic op per page on the folio's nr_pages_mapped
> ... is it possible to batch this and use atomic_sub_return_relaxed()?

Good spot, something like this:

} else {
for (; nr != 0; nr--, page++) {
/* Is this the page's last map to be removed? */
last = atomic_add_negative(-1, &page->_mapcount);
if (last)
nr_unmapped++;
}

/* Pages still mapped if folio mapped entirely */
nr_mapped = atomic_sub_return_relaxed(nr_unmapped, mapped);
if (nr_mapped >= COMPOUND_MAPPED)
nr_unmapped = 0;
}



2023-07-17 16:44:10

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH v1 2/3] mm: Implement folio_remove_rmap_range()

On 17 Jul 2023, at 11:51, Ryan Roberts wrote:

> On 17/07/2023 16:09, Zi Yan wrote:
>> On 17 Jul 2023, at 10:31, Ryan Roberts wrote:
>>
>>> Like page_remove_rmap() but batch-removes the rmap for a range of pages
>>> belonging to a folio. This can provide a small speedup due to less
>>> manipuation of the various counters. But more crucially, if removing the
>>> rmap for all pages of a folio in a batch, there is no need to
>>> (spuriously) add it to the deferred split list, which saves significant
>>> cost when there is contention for the split queue lock.
>>>
>>> All contained pages are accounted using the order-0 folio (or base page)
>>> scheme.
>>>
>>> Signed-off-by: Ryan Roberts <[email protected]>
>>> ---
>>> include/linux/rmap.h | 2 ++
>>> mm/rmap.c | 65 ++++++++++++++++++++++++++++++++++++++++++++
>>> 2 files changed, 67 insertions(+)
>>>
>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>> index b87d01660412..f578975c12c0 100644
>>> --- a/include/linux/rmap.h
>>> +++ b/include/linux/rmap.h
>>> @@ -200,6 +200,8 @@ void page_add_file_rmap(struct page *, struct vm_area_struct *,
>>> bool compound);
>>> void page_remove_rmap(struct page *, struct vm_area_struct *,
>>> bool compound);
>>> +void folio_remove_rmap_range(struct folio *folio, struct page *page,
>>> + int nr, struct vm_area_struct *vma);
>>>
>>> void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
>>> unsigned long address, rmap_t flags);
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 2baf57d65c23..1da05aca2bb1 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1359,6 +1359,71 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
>>> mlock_vma_folio(folio, vma, compound);
>>> }
>>>
>>> +/*
>>> + * folio_remove_rmap_range - take down pte mappings from a range of pages
>>> + * belonging to a folio. All pages are accounted as small pages.
>>> + * @folio: folio that all pages belong to
>>> + * @page: first page in range to remove mapping from
>>> + * @nr: number of pages in range to remove mapping from
>>
>> We might need some checks to make sure [page, page+nr] is in the range of
>> the folio. Something like:
>>
>> page >= &folio->page && page + nr < (&folio->page + folio_nr_pages(folio))
>
> No problem. Is a VM_WARN_ON() appropriate for something like this?

VM_WARN_ON_ONCE() might be better.

--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2023-07-17 16:44:13

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH v1 3/3] mm: Batch-zap large anonymous folio PTE mappings

On 17 Jul 2023, at 11:55, Ryan Roberts wrote:

> On 17/07/2023 16:25, Zi Yan wrote:
>> On 17 Jul 2023, at 10:31, Ryan Roberts wrote:
>>
>>> This allows batching the rmap removal with folio_remove_rmap_range(),
>>> which means we avoid spuriously adding a partially unmapped folio to the
>>> deferrred split queue in the common case, which reduces split queue lock
>>> contention.
>>>
>>> Previously each page was removed from the rmap individually with
>>> page_remove_rmap(). If the first page belonged to a large folio, this
>>> would cause page_remove_rmap() to conclude that the folio was now
>>> partially mapped and add the folio to the deferred split queue. But
>>> subsequent calls would cause the folio to become fully unmapped, meaning
>>> there is no value to adding it to the split queue.
>>>
>>> Signed-off-by: Ryan Roberts <[email protected]>
>>> ---
>>> mm/memory.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 1 file changed, 119 insertions(+)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 01f39e8144ef..6facb8c8807a 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -1391,6 +1391,95 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
>>> pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
>>> }
>>>
>>> +static inline unsigned long page_addr(struct page *page,
>>> + struct page *anchor, unsigned long anchor_addr)
>>> +{
>>> + unsigned long offset;
>>> + unsigned long addr;
>>> +
>>> + offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
>>> + addr = anchor_addr + offset;
>>> +
>>> + if (anchor > page) {
>>> + if (addr > anchor_addr)
>>> + return 0;
>>> + } else {
>>> + if (addr < anchor_addr)
>>> + return ULONG_MAX;
>>> + }
>>> +
>>> + return addr;
>>> +}
>>> +
>>> +static int calc_anon_folio_map_pgcount(struct folio *folio,
>>> + struct page *page, pte_t *pte,
>>> + unsigned long addr, unsigned long end)
>>> +{
>>> + pte_t ptent;
>>> + int floops;
>>> + int i;
>>> + unsigned long pfn;
>>> +
>>> + end = min(page_addr(&folio->page + folio_nr_pages(folio), page, addr),
>>> + end);
>>> + floops = (end - addr) >> PAGE_SHIFT;
>>> + pfn = page_to_pfn(page);
>>> + pfn++;
>>> + pte++;
>>> +
>>> + for (i = 1; i < floops; i++) {
>>> + ptent = ptep_get(pte);
>>> +
>>> + if (!pte_present(ptent) ||
>>> + pte_pfn(ptent) != pfn) {
>>> + return i;
>>> + }
>>> +
>>> + pfn++;
>>> + pte++;
>>> + }
>>> +
>>> + return floops;
>>> +}
>>> +
>>> +static unsigned long zap_anon_pte_range(struct mmu_gather *tlb,
>>> + struct vm_area_struct *vma,
>>> + struct page *page, pte_t *pte,
>>> + unsigned long addr, unsigned long end,
>>> + bool *full_out)
>>> +{
>>> + struct folio *folio = page_folio(page);
>>> + struct mm_struct *mm = tlb->mm;
>>> + pte_t ptent;
>>> + int pgcount;
>>> + int i;
>>> + bool full;
>>> +
>>> + pgcount = calc_anon_folio_map_pgcount(folio, page, pte, addr, end);
>>> +
>>> + for (i = 0; i < pgcount;) {
>>> + ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
>>> + tlb_remove_tlb_entry(tlb, pte, addr);
>>> + full = __tlb_remove_page(tlb, page, 0);
>>> +
>>> + if (unlikely(page_mapcount(page) < 1))
>>> + print_bad_pte(vma, addr, ptent, page);
>>> +
>>> + i++;
>>> + page++;
>>> + pte++;
>>> + addr += PAGE_SIZE;
>>> +
>>> + if (unlikely(full))
>>> + break;
>>> + }
>>> +
>>> + folio_remove_rmap_range(folio, page - i, i, vma);
>>> +
>>> + *full_out = full;
>>> + return i;
>>> +}
>>> +
>>> static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>> struct vm_area_struct *vma, pmd_t *pmd,
>>> unsigned long addr, unsigned long end,
>>> @@ -1428,6 +1517,36 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>> page = vm_normal_page(vma, addr, ptent);
>>> if (unlikely(!should_zap_page(details, page)))
>>> continue;
>>> +
>>> + /*
>>> + * Batch zap large anonymous folio mappings. This allows
>>> + * batching the rmap removal, which means we avoid
>>> + * spuriously adding a partially unmapped folio to the
>>> + * deferrred split queue in the common case, which
>>> + * reduces split queue lock contention. Require the VMA
>>> + * to be anonymous to ensure that none of the PTEs in
>>> + * the range require zap_install_uffd_wp_if_needed().
>>> + */
>>> + if (page && PageAnon(page) && vma_is_anonymous(vma)) {
>>> + bool full;
>>> + int pgcount;
>>> +
>>> + pgcount = zap_anon_pte_range(tlb, vma,
>>> + page, pte, addr, end, &full);
>>
>> Are you trying to zap as many ptes as possible if all these ptes are
>> within a folio?
>
> Yes.
>
>> If so, why not calculate end before calling zap_anon_pte_range()?
>> That would make zap_anon_pte_range() simpler.
>
> I'm not sure I follow. That's currently done in calc_anon_folio_map_pgcount(). I
> could move it to here, but I'm not sure that makes things simpler, just puts
> more code in here and less in there?

Otherwise your zap_anon_pte_range() is really zap_anon_pte_in_folio_range() or
some other more descriptive name. When I first look at the name, I thought
PTEs will be zapped until the end. But that is not the case when I look at the
code. And future users can easily be confused too and use it in a wrong way.

BTW, page_addr() needs a better name and is easily confused with existing
page_address().

>
>> Also check if page is part of
>> a large folio first to make sure you can batch.
>
> Yeah that's fair. I'd be inclined to put that in zap_anon_pte_range() to short
> circuit calc_anon_folio_map_pgcount(). But ultimately zap_anon_pte_range() would
> still zap the single pte.
>
>
>>
>>> +
>>> + rss[mm_counter(page)] -= pgcount;
>>> + pgcount--;
>>> + pte += pgcount;
>>> + addr += pgcount << PAGE_SHIFT;
>>> +
>>> + if (unlikely(full)) {
>>> + force_flush = 1;
>>> + addr += PAGE_SIZE;
>>> + break;
>>> + }
>>> + continue;
>>> + }
>>> +
>>> ptent = ptep_get_and_clear_full(mm, addr, pte,
>>> tlb->fullmm);
>>> tlb_remove_tlb_entry(tlb, pte, addr);
>>> --
>>> 2.25.1
>>
>>
>> --
>> Best Regards,
>> Yan, Zi


--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2023-07-17 16:52:18

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v1 3/3] mm: Batch-zap large anonymous folio PTE mappings

On 17/07/2023 16:25, Zi Yan wrote:
> On 17 Jul 2023, at 10:31, Ryan Roberts wrote:
>
>> This allows batching the rmap removal with folio_remove_rmap_range(),
>> which means we avoid spuriously adding a partially unmapped folio to the
>> deferrred split queue in the common case, which reduces split queue lock
>> contention.
>>
>> Previously each page was removed from the rmap individually with
>> page_remove_rmap(). If the first page belonged to a large folio, this
>> would cause page_remove_rmap() to conclude that the folio was now
>> partially mapped and add the folio to the deferred split queue. But
>> subsequent calls would cause the folio to become fully unmapped, meaning
>> there is no value to adding it to the split queue.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> ---
>> mm/memory.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 119 insertions(+)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 01f39e8144ef..6facb8c8807a 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1391,6 +1391,95 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
>> pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
>> }
>>
>> +static inline unsigned long page_addr(struct page *page,
>> + struct page *anchor, unsigned long anchor_addr)
>> +{
>> + unsigned long offset;
>> + unsigned long addr;
>> +
>> + offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
>> + addr = anchor_addr + offset;
>> +
>> + if (anchor > page) {
>> + if (addr > anchor_addr)
>> + return 0;
>> + } else {
>> + if (addr < anchor_addr)
>> + return ULONG_MAX;
>> + }
>> +
>> + return addr;
>> +}
>> +
>> +static int calc_anon_folio_map_pgcount(struct folio *folio,
>> + struct page *page, pte_t *pte,
>> + unsigned long addr, unsigned long end)
>> +{
>> + pte_t ptent;
>> + int floops;
>> + int i;
>> + unsigned long pfn;
>> +
>> + end = min(page_addr(&folio->page + folio_nr_pages(folio), page, addr),
>> + end);
>> + floops = (end - addr) >> PAGE_SHIFT;
>> + pfn = page_to_pfn(page);
>> + pfn++;
>> + pte++;
>> +
>> + for (i = 1; i < floops; i++) {
>> + ptent = ptep_get(pte);
>> +
>> + if (!pte_present(ptent) ||
>> + pte_pfn(ptent) != pfn) {
>> + return i;
>> + }
>> +
>> + pfn++;
>> + pte++;
>> + }
>> +
>> + return floops;
>> +}
>> +
>> +static unsigned long zap_anon_pte_range(struct mmu_gather *tlb,
>> + struct vm_area_struct *vma,
>> + struct page *page, pte_t *pte,
>> + unsigned long addr, unsigned long end,
>> + bool *full_out)
>> +{
>> + struct folio *folio = page_folio(page);
>> + struct mm_struct *mm = tlb->mm;
>> + pte_t ptent;
>> + int pgcount;
>> + int i;
>> + bool full;
>> +
>> + pgcount = calc_anon_folio_map_pgcount(folio, page, pte, addr, end);
>> +
>> + for (i = 0; i < pgcount;) {
>> + ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
>> + tlb_remove_tlb_entry(tlb, pte, addr);
>> + full = __tlb_remove_page(tlb, page, 0);
>> +
>> + if (unlikely(page_mapcount(page) < 1))
>> + print_bad_pte(vma, addr, ptent, page);
>> +
>> + i++;
>> + page++;
>> + pte++;
>> + addr += PAGE_SIZE;
>> +
>> + if (unlikely(full))
>> + break;
>> + }
>> +
>> + folio_remove_rmap_range(folio, page - i, i, vma);
>> +
>> + *full_out = full;
>> + return i;
>> +}
>> +
>> static unsigned long zap_pte_range(struct mmu_gather *tlb,
>> struct vm_area_struct *vma, pmd_t *pmd,
>> unsigned long addr, unsigned long end,
>> @@ -1428,6 +1517,36 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>> page = vm_normal_page(vma, addr, ptent);
>> if (unlikely(!should_zap_page(details, page)))
>> continue;
>> +
>> + /*
>> + * Batch zap large anonymous folio mappings. This allows
>> + * batching the rmap removal, which means we avoid
>> + * spuriously adding a partially unmapped folio to the
>> + * deferrred split queue in the common case, which
>> + * reduces split queue lock contention. Require the VMA
>> + * to be anonymous to ensure that none of the PTEs in
>> + * the range require zap_install_uffd_wp_if_needed().
>> + */
>> + if (page && PageAnon(page) && vma_is_anonymous(vma)) {
>> + bool full;
>> + int pgcount;
>> +
>> + pgcount = zap_anon_pte_range(tlb, vma,
>> + page, pte, addr, end, &full);
>
> Are you trying to zap as many ptes as possible if all these ptes are
> within a folio?

Yes.

> If so, why not calculate end before calling zap_anon_pte_range()?
> That would make zap_anon_pte_range() simpler.

I'm not sure I follow. That's currently done in calc_anon_folio_map_pgcount(). I
could move it to here, but I'm not sure that makes things simpler, just puts
more code in here and less in there?

> Also check if page is part of
> a large folio first to make sure you can batch.

Yeah that's fair. I'd be inclined to put that in zap_anon_pte_range() to short
circuit calc_anon_folio_map_pgcount(). But ultimately zap_anon_pte_range() would
still zap the single pte.


>
>> +
>> + rss[mm_counter(page)] -= pgcount;
>> + pgcount--;
>> + pte += pgcount;
>> + addr += pgcount << PAGE_SHIFT;
>> +
>> + if (unlikely(full)) {
>> + force_flush = 1;
>> + addr += PAGE_SIZE;
>> + break;
>> + }
>> + continue;
>> + }
>> +
>> ptent = ptep_get_and_clear_full(mm, addr, pte,
>> tlb->fullmm);
>> tlb_remove_tlb_entry(tlb, pte, addr);
>> --
>> 2.25.1
>
>
> --
> Best Regards,
> Yan, Zi


2023-07-17 16:54:58

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v1 1/3] mm: Allow deferred splitting of arbitrary large anon folios

On Mon, Jul 17, 2023 at 04:54:58PM +0100, Matthew Wilcox wrote:
> Those are page->flags and page->compound_head for the third page in
> the folio. They don't really need a name; nothing refers to them,
> but it's important that space not be reused ;-)
>
> This is slightly different from _flags_1; we do have some flags which
> reuse the bits (they're labelled as PF_SECOND). Right now, it's only
> PG_has_hwpoisoned, but we used to have PG_double_map. Others may arise.

Sorry, this was incomplete. We do still have per-page flags! HWPoison
is the obvious one, but PG_head is per-page (... think about it ...)
PG_anon_exclusive is actually per-page.

Most of the flags labelled as PF_ANY are mislabelled. PG_private and
PG_private2 are never set/cleared/tested on tail pages. PG_young and
PG_idle are only ever tested on the head page, but some code incorrectly
sets them on tail pages, where those bits are ignored. I tried to fix
that a while ago, but the patch was overlooked and I couldn't be bothered
to try all that hard. I have no clue about PG_vmemmap_self_hosted.
I think PG_isolated is probably never set on compound pages.
PG_owner_priv_1 is a disaster, as you might expect.

2023-07-17 16:56:42

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 1/3] mm: Allow deferred splitting of arbitrary large anon folios

On 17.07.23 18:01, Ryan Roberts wrote:
> On 17/07/2023 16:42, David Hildenbrand wrote:
>> On 17.07.23 16:31, Ryan Roberts wrote:
>>> In preparation for the introduction of large folios for anonymous
>>> memory, we would like to be able to split them when they have unmapped
>>> subpages, in order to free those unused pages under memory pressure. So
>>> remove the artificial requirement that the large folio needed to be at
>>> least PMD-sized.
>>>
>>> Signed-off-by: Ryan Roberts <[email protected]>
>>> Reviewed-by: Yu Zhao <[email protected]>
>>> Reviewed-by: Yin Fengwei <[email protected]>
>>> ---
>>>   mm/rmap.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 0c0d8857dfce..2baf57d65c23 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1430,7 +1430,7 @@ void page_remove_rmap(struct page *page, struct
>>> vm_area_struct *vma,
>>>            * page of the folio is unmapped and at least one page
>>>            * is still mapped.
>>>            */
>>> -        if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
>>> +        if (folio_test_large(folio) && folio_test_anon(folio))
>>>               if (!compound || nr < nr_pmdmapped)
>>>                   deferred_split_folio(folio);
>>
>> !compound will always be true I guess, so nr_pmdmapped == 0 (which will always
>> be the case) will be ignored.
>
> I don't follow why !compound will always be true. This function is
> page_remove_rmap() (not folio_remove_rmap_range() which I add in a later patch).
> page_remove_rmap() can work on pmd-mapped pages where compound=true is passed in.

I was talking about the folio_test_pmd_mappable() -> folio_test_large()
change. For folio_test_large() && !folio_test_pmd_mappable() I expect
that we'll never pass in "compound=true".

--
Cheers,

David / dhildenb


2023-07-17 17:25:24

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 1/3] mm: Allow deferred splitting of arbitrary large anon folios

On 17.07.23 17:54, Matthew Wilcox wrote:
> On Mon, Jul 17, 2023 at 05:43:40PM +0200, David Hildenbrand wrote:
>> On 17.07.23 17:41, Ryan Roberts wrote:
>>> On 17/07/2023 16:30, Matthew Wilcox wrote:
>>>> On Mon, Jul 17, 2023 at 03:31:08PM +0100, Ryan Roberts wrote:
>>>>> In preparation for the introduction of large folios for anonymous
>>>>> memory, we would like to be able to split them when they have unmapped
>>>>> subpages, in order to free those unused pages under memory pressure. So
>>>>> remove the artificial requirement that the large folio needed to be at
>>>>> least PMD-sized.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>> Reviewed-by: Yu Zhao <[email protected]>
>>>>> Reviewed-by: Yin Fengwei <[email protected]>
>>>>
>>>> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
>>>
>>> Thanks!
>>>
>>>>
>>>>> */
>>>>> - if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
>>>>> + if (folio_test_large(folio) && folio_test_anon(folio))
>>>>> if (!compound || nr < nr_pmdmapped)
>>>>> deferred_split_folio(folio);
>>>>
>>>> I wonder if it's worth introducing a folio_test_deferred_split() (better
>>>> naming appreciated ...) to allow us to allocate order-1 folios and not
>>>> do horrible things. Maybe it's not worth supporting order-1 folios;
>>>> we're always better off going to order-2 immediately. Just thinking.
>>>
>>> There is more than just _deferred_list in the 3rd page; you also have _flags_2a
>>> and _head_2a. I guess you know much better than me what they store. But I'm
>>> guessing its harder than jsut not splitting an order-1 page?
>
> Those are page->flags and page->compound_head for the third page in
> the folio. They don't really need a name; nothing refers to them,
> but it's important that space not be reused ;-)
>
> This is slightly different from _flags_1; we do have some flags which
> reuse the bits (they're labelled as PF_SECOND). Right now, it's only
> PF_has_hwpoisoned, but we used to have PF_double_map. Others may arise.
>
>>> With the direction of large anon folios (_not_ retrying with every order down to
>>> 0), I'm not sure what the use case would be for order-1 anyway?
>>
>> Just noting that we might need some struct-page space for better
>> mapcount/shared tracking, which might get hard for order-1 pages.
>
> My assumption had been that we'd be able to reuse the _entire_mapcount
> and _nr_pages_mapped fields and not spill into the third page, but the

We most likely have to keep _entire_mapcount to keep "PMD mapped"
working (I don't think we can not account that, some user space relies
on that). Reusing _nr_pages_mapped for _total_mapcount would work until
we need more bits.

But once we want to sort out some other questions like "is this folio
mapped shared or mapped exclusive" we might need more space.

What I am playing with right now to tackle that would most probably not
fit in there (but I'll keep trying ;) ).

> third page is definitely available today if we want it. I'm fine with
> disallowing order-1 anon/file folios forever.

Yes, let's first sort out the open issues before going down that path
(might not really be worth it after all).

--
Cheers,

David / dhildenb


2023-07-18 00:18:01

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v1 3/3] mm: Batch-zap large anonymous folio PTE mappings



On 7/17/23 22:31, Ryan Roberts wrote:
> This allows batching the rmap removal with folio_remove_rmap_range(),
> which means we avoid spuriously adding a partially unmapped folio to the
> deferrred split queue in the common case, which reduces split queue lock
> contention.
>
> Previously each page was removed from the rmap individually with
> page_remove_rmap(). If the first page belonged to a large folio, this
> would cause page_remove_rmap() to conclude that the folio was now
> partially mapped and add the folio to the deferred split queue. But
> subsequent calls would cause the folio to become fully unmapped, meaning
> there is no value to adding it to the split queue.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> mm/memory.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 119 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 01f39e8144ef..6facb8c8807a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1391,6 +1391,95 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
> pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
> }
>
> +static inline unsigned long page_addr(struct page *page,
> + struct page *anchor, unsigned long anchor_addr)
> +{
> + unsigned long offset;
> + unsigned long addr;
> +
> + offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
> + addr = anchor_addr + offset;
> +
> + if (anchor > page) {
> + if (addr > anchor_addr)
> + return 0;
> + } else {
> + if (addr < anchor_addr)
> + return ULONG_MAX;
> + }
> +
> + return addr;
> +}
> +
> +static int calc_anon_folio_map_pgcount(struct folio *folio,
> + struct page *page, pte_t *pte,
> + unsigned long addr, unsigned long end)
> +{
> + pte_t ptent;
> + int floops;
> + int i;
> + unsigned long pfn;
> +
> + end = min(page_addr(&folio->page + folio_nr_pages(folio), page, addr),
> + end);
> + floops = (end - addr) >> PAGE_SHIFT;
> + pfn = page_to_pfn(page);
> + pfn++;
> + pte++;
> +
> + for (i = 1; i < floops; i++) {
> + ptent = ptep_get(pte);
> +
> + if (!pte_present(ptent) ||
> + pte_pfn(ptent) != pfn) {
> + return i;
> + }
> +
> + pfn++;
> + pte++;
> + }
> +
> + return floops;
> +}
> +
> +static unsigned long zap_anon_pte_range(struct mmu_gather *tlb,
> + struct vm_area_struct *vma,
> + struct page *page, pte_t *pte,
> + unsigned long addr, unsigned long end,
> + bool *full_out)
> +{
> + struct folio *folio = page_folio(page);
> + struct mm_struct *mm = tlb->mm;
> + pte_t ptent;
> + int pgcount;
> + int i;
> + bool full;
> +
> + pgcount = calc_anon_folio_map_pgcount(folio, page, pte, addr, end);
> +
> + for (i = 0; i < pgcount;) {
> + ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
> + tlb_remove_tlb_entry(tlb, pte, addr);
> + full = __tlb_remove_page(tlb, page, 0);
> +
> + if (unlikely(page_mapcount(page) < 1))
> + print_bad_pte(vma, addr, ptent, page);
> +
> + i++;
> + page++;
> + pte++;
> + addr += PAGE_SIZE;
> +
> + if (unlikely(full))
> + break;
> + }
> +
> + folio_remove_rmap_range(folio, page - i, i, vma);
> +
> + *full_out = full;
> + return i;
> +}
> +
> static unsigned long zap_pte_range(struct mmu_gather *tlb,
> struct vm_area_struct *vma, pmd_t *pmd,
> unsigned long addr, unsigned long end,
> @@ -1428,6 +1517,36 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> page = vm_normal_page(vma, addr, ptent);
> if (unlikely(!should_zap_page(details, page)))
> continue;
> +
> + /*
> + * Batch zap large anonymous folio mappings. This allows
> + * batching the rmap removal, which means we avoid
> + * spuriously adding a partially unmapped folio to the
> + * deferrred split queue in the common case, which
> + * reduces split queue lock contention. Require the VMA
> + * to be anonymous to ensure that none of the PTEs in
> + * the range require zap_install_uffd_wp_if_needed().
> + */
> + if (page && PageAnon(page) && vma_is_anonymous(vma)) {
Why this is only for anonymous page? I suppose it can support file mapping also.


Regards
Yin, Fengwei

> + bool full;
> + int pgcount;
> +
> + pgcount = zap_anon_pte_range(tlb, vma,
> + page, pte, addr, end, &full);
> +
> + rss[mm_counter(page)] -= pgcount;
> + pgcount--;
> + pte += pgcount;
> + addr += pgcount << PAGE_SHIFT;
> +
> + if (unlikely(full)) {
> + force_flush = 1;
> + addr += PAGE_SIZE;
> + break;
> + }
> + continue;
> + }
> +
> ptent = ptep_get_and_clear_full(mm, addr, pte,
> tlb->fullmm);
> tlb_remove_tlb_entry(tlb, pte, addr);

2023-07-18 01:31:47

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v1 2/3] mm: Implement folio_remove_rmap_range()



On 7/17/23 22:31, Ryan Roberts wrote:
> Like page_remove_rmap() but batch-removes the rmap for a range of pages
> belonging to a folio. This can provide a small speedup due to less
> manipuation of the various counters. But more crucially, if removing the
> rmap for all pages of a folio in a batch, there is no need to
> (spuriously) add it to the deferred split list, which saves significant
> cost when there is contention for the split queue lock.
>
> All contained pages are accounted using the order-0 folio (or base page)
> scheme.
>
> Signed-off-by: Ryan Roberts <[email protected]>

Reviewed-by: Yin Fengwei <[email protected]>

Regards
Yin, Fengwei

> ---
> include/linux/rmap.h | 2 ++
> mm/rmap.c | 65 ++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 67 insertions(+)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index b87d01660412..f578975c12c0 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -200,6 +200,8 @@ void page_add_file_rmap(struct page *, struct vm_area_struct *,
> bool compound);
> void page_remove_rmap(struct page *, struct vm_area_struct *,
> bool compound);
> +void folio_remove_rmap_range(struct folio *folio, struct page *page,
> + int nr, struct vm_area_struct *vma);
>
> void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
> unsigned long address, rmap_t flags);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 2baf57d65c23..1da05aca2bb1 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1359,6 +1359,71 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
> mlock_vma_folio(folio, vma, compound);
> }
>
> +/*
> + * folio_remove_rmap_range - take down pte mappings from a range of pages
> + * belonging to a folio. All pages are accounted as small pages.
> + * @folio: folio that all pages belong to
> + * @page: first page in range to remove mapping from
> + * @nr: number of pages in range to remove mapping from
> + * @vma: the vm area from which the mapping is removed
> + *
> + * The caller needs to hold the pte lock.
> + */
> +void folio_remove_rmap_range(struct folio *folio, struct page *page,
> + int nr, struct vm_area_struct *vma)
> +{
> + atomic_t *mapped = &folio->_nr_pages_mapped;
> + int nr_unmapped = 0;
> + int nr_mapped;
> + bool last;
> + enum node_stat_item idx;
> +
> + if (unlikely(folio_test_hugetlb(folio))) {
> + VM_WARN_ON_FOLIO(1, folio);
> + return;
> + }
> +
> + if (!folio_test_large(folio)) {
> + /* Is this the page's last map to be removed? */
> + last = atomic_add_negative(-1, &page->_mapcount);
> + nr_unmapped = last;
> + } else {
> + for (; nr != 0; nr--, page++) {
> + /* Is this the page's last map to be removed? */
> + last = atomic_add_negative(-1, &page->_mapcount);
> + if (last) {
> + /* Page still mapped if folio mapped entirely */
> + nr_mapped = atomic_dec_return_relaxed(mapped);
> + if (nr_mapped < COMPOUND_MAPPED)
> + nr_unmapped++;
> + }
> + }
> + }
> +
> + if (nr_unmapped) {
> + idx = folio_test_anon(folio) ? NR_ANON_MAPPED : NR_FILE_MAPPED;
> + __lruvec_stat_mod_folio(folio, idx, -nr_unmapped);
> +
> + /*
> + * Queue anon THP for deferred split if we have just unmapped at
> + * least 1 page, while at least 1 page remains mapped.
> + */
> + if (folio_test_large(folio) && folio_test_anon(folio))
> + if (nr_mapped)
> + deferred_split_folio(folio);
> + }
> +
> + /*
> + * It would be tidy to reset folio_test_anon mapping when fully
> + * unmapped, but that might overwrite a racing page_add_anon_rmap
> + * which increments mapcount after us but sets mapping before us:
> + * so leave the reset to free_pages_prepare, and remember that
> + * it's only reliable while mapped.
> + */
> +
> + munlock_vma_folio(folio, vma, false);
> +}
> +
> /**
> * page_remove_rmap - take down pte mapping from a page
> * @page: page to remove mapping from

2023-07-18 06:40:06

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v1 2/3] mm: Implement folio_remove_rmap_range()

Ryan Roberts <[email protected]> writes:

> Like page_remove_rmap() but batch-removes the rmap for a range of pages
> belonging to a folio. This can provide a small speedup due to less
> manipuation of the various counters. But more crucially, if removing the
> rmap for all pages of a folio in a batch, there is no need to
> (spuriously) add it to the deferred split list, which saves significant
> cost when there is contention for the split queue lock.
>
> All contained pages are accounted using the order-0 folio (or base page)
> scheme.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> include/linux/rmap.h | 2 ++
> mm/rmap.c | 65 ++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 67 insertions(+)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index b87d01660412..f578975c12c0 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -200,6 +200,8 @@ void page_add_file_rmap(struct page *, struct vm_area_struct *,
> bool compound);
> void page_remove_rmap(struct page *, struct vm_area_struct *,
> bool compound);
> +void folio_remove_rmap_range(struct folio *folio, struct page *page,
> + int nr, struct vm_area_struct *vma);
>
> void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
> unsigned long address, rmap_t flags);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 2baf57d65c23..1da05aca2bb1 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1359,6 +1359,71 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
> mlock_vma_folio(folio, vma, compound);
> }
>
> +/*
> + * folio_remove_rmap_range - take down pte mappings from a range of pages
> + * belonging to a folio. All pages are accounted as small pages.
> + * @folio: folio that all pages belong to
> + * @page: first page in range to remove mapping from
> + * @nr: number of pages in range to remove mapping from
> + * @vma: the vm area from which the mapping is removed
> + *
> + * The caller needs to hold the pte lock.
> + */
> +void folio_remove_rmap_range(struct folio *folio, struct page *page,
> + int nr, struct vm_area_struct *vma)
> +{
> + atomic_t *mapped = &folio->_nr_pages_mapped;
> + int nr_unmapped = 0;
> + int nr_mapped;
> + bool last;
> + enum node_stat_item idx;
> +
> + if (unlikely(folio_test_hugetlb(folio))) {
> + VM_WARN_ON_FOLIO(1, folio);
> + return;
> + }
> +
> + if (!folio_test_large(folio)) {
> + /* Is this the page's last map to be removed? */
> + last = atomic_add_negative(-1, &page->_mapcount);
> + nr_unmapped = last;
> + } else {
> + for (; nr != 0; nr--, page++) {
> + /* Is this the page's last map to be removed? */
> + last = atomic_add_negative(-1, &page->_mapcount);
> + if (last) {
> + /* Page still mapped if folio mapped entirely */
> + nr_mapped = atomic_dec_return_relaxed(mapped);
> + if (nr_mapped < COMPOUND_MAPPED)
> + nr_unmapped++;
> + }
> + }
> + }
> +
> + if (nr_unmapped) {
> + idx = folio_test_anon(folio) ? NR_ANON_MAPPED : NR_FILE_MAPPED;
> + __lruvec_stat_mod_folio(folio, idx, -nr_unmapped);
> +
> + /*
> + * Queue anon THP for deferred split if we have just unmapped at

Just some nitpicks. So feel free to ignore.

s/anon THP/large folio/ ?

> + * least 1 page, while at least 1 page remains mapped.
> + */
> + if (folio_test_large(folio) && folio_test_anon(folio))
> + if (nr_mapped)

if (folio_test_large(folio) && folio_test_anon(folio) && nr_mapped) ?

> + deferred_split_folio(folio);
> + }
> +
> + /*
> + * It would be tidy to reset folio_test_anon mapping when fully
> + * unmapped, but that might overwrite a racing page_add_anon_rmap
> + * which increments mapcount after us but sets mapping before us:
> + * so leave the reset to free_pages_prepare, and remember that
> + * it's only reliable while mapped.
> + */
> +
> + munlock_vma_folio(folio, vma, false);
> +}
> +
> /**
> * page_remove_rmap - take down pte mapping from a page
> * @page: page to remove mapping from

Best Regards,
Huang, Ying

2023-07-18 07:41:08

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v1 2/3] mm: Implement folio_remove_rmap_range()

Ryan Roberts <[email protected]> writes:

> Like page_remove_rmap() but batch-removes the rmap for a range of pages
> belonging to a folio. This can provide a small speedup due to less
> manipuation of the various counters. But more crucially, if removing the
> rmap for all pages of a folio in a batch, there is no need to
> (spuriously) add it to the deferred split list, which saves significant
> cost when there is contention for the split queue lock.
>
> All contained pages are accounted using the order-0 folio (or base page)
> scheme.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> include/linux/rmap.h | 2 ++
> mm/rmap.c | 65 ++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 67 insertions(+)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index b87d01660412..f578975c12c0 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -200,6 +200,8 @@ void page_add_file_rmap(struct page *, struct vm_area_struct *,
> bool compound);
> void page_remove_rmap(struct page *, struct vm_area_struct *,
> bool compound);
> +void folio_remove_rmap_range(struct folio *folio, struct page *page,
> + int nr, struct vm_area_struct *vma);
>
> void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
> unsigned long address, rmap_t flags);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 2baf57d65c23..1da05aca2bb1 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1359,6 +1359,71 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
> mlock_vma_folio(folio, vma, compound);
> }
>
> +/*
> + * folio_remove_rmap_range - take down pte mappings from a range of pages
> + * belonging to a folio. All pages are accounted as small pages.
> + * @folio: folio that all pages belong to
> + * @page: first page in range to remove mapping from
> + * @nr: number of pages in range to remove mapping from
> + * @vma: the vm area from which the mapping is removed
> + *
> + * The caller needs to hold the pte lock.
> + */
> +void folio_remove_rmap_range(struct folio *folio, struct page *page,
> + int nr, struct vm_area_struct *vma)

Can we call folio_remove_ramp_range() in page_remove_rmap() if
!compound? This can give us some opportunities to reduce code
duplication?

Best Regards,
Huang, Ying

> +{
> + atomic_t *mapped = &folio->_nr_pages_mapped;
> + int nr_unmapped = 0;
> + int nr_mapped;
> + bool last;
> + enum node_stat_item idx;
> +
> + if (unlikely(folio_test_hugetlb(folio))) {
> + VM_WARN_ON_FOLIO(1, folio);
> + return;
> + }
> +
> + if (!folio_test_large(folio)) {
> + /* Is this the page's last map to be removed? */
> + last = atomic_add_negative(-1, &page->_mapcount);
> + nr_unmapped = last;
> + } else {
> + for (; nr != 0; nr--, page++) {
> + /* Is this the page's last map to be removed? */
> + last = atomic_add_negative(-1, &page->_mapcount);
> + if (last) {
> + /* Page still mapped if folio mapped entirely */
> + nr_mapped = atomic_dec_return_relaxed(mapped);
> + if (nr_mapped < COMPOUND_MAPPED)
> + nr_unmapped++;
> + }
> + }
> + }
> +
> + if (nr_unmapped) {
> + idx = folio_test_anon(folio) ? NR_ANON_MAPPED : NR_FILE_MAPPED;
> + __lruvec_stat_mod_folio(folio, idx, -nr_unmapped);
> +
> + /*
> + * Queue anon THP for deferred split if we have just unmapped at
> + * least 1 page, while at least 1 page remains mapped.
> + */
> + if (folio_test_large(folio) && folio_test_anon(folio))
> + if (nr_mapped)
> + deferred_split_folio(folio);
> + }
> +
> + /*
> + * It would be tidy to reset folio_test_anon mapping when fully
> + * unmapped, but that might overwrite a racing page_add_anon_rmap
> + * which increments mapcount after us but sets mapping before us:
> + * so leave the reset to free_pages_prepare, and remember that
> + * it's only reliable while mapped.
> + */
> +
> + munlock_vma_folio(folio, vma, false);
> +}
> +
> /**
> * page_remove_rmap - take down pte mapping from a page
> * @page: page to remove mapping from

2023-07-18 09:19:53

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 1/3] mm: Allow deferred splitting of arbitrary large anon folios

On 18.07.23 10:58, Ryan Roberts wrote:
> On 17/07/2023 17:48, David Hildenbrand wrote:
>> On 17.07.23 18:01, Ryan Roberts wrote:
>>> On 17/07/2023 16:42, David Hildenbrand wrote:
>>>> On 17.07.23 16:31, Ryan Roberts wrote:
>>>>> In preparation for the introduction of large folios for anonymous
>>>>> memory, we would like to be able to split them when they have unmapped
>>>>> subpages, in order to free those unused pages under memory pressure. So
>>>>> remove the artificial requirement that the large folio needed to be at
>>>>> least PMD-sized.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>> Reviewed-by: Yu Zhao <[email protected]>
>>>>> Reviewed-by: Yin Fengwei <[email protected]>
>>>>> ---
>>>>>    mm/rmap.c | 2 +-
>>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>> index 0c0d8857dfce..2baf57d65c23 100644
>>>>> --- a/mm/rmap.c
>>>>> +++ b/mm/rmap.c
>>>>> @@ -1430,7 +1430,7 @@ void page_remove_rmap(struct page *page, struct
>>>>> vm_area_struct *vma,
>>>>>             * page of the folio is unmapped and at least one page
>>>>>             * is still mapped.
>>>>>             */
>>>>> -        if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
>>>>> +        if (folio_test_large(folio) && folio_test_anon(folio))
>>>>>                if (!compound || nr < nr_pmdmapped)
>>>>>                    deferred_split_folio(folio);
>>>>
>>>> !compound will always be true I guess, so nr_pmdmapped == 0 (which will always
>>>> be the case) will be ignored.
>>>
>>> I don't follow why !compound will always be true. This function is
>>> page_remove_rmap() (not folio_remove_rmap_range() which I add in a later patch).
>>> page_remove_rmap() can work on pmd-mapped pages where compound=true is passed in.
>>
>> I was talking about the folio_test_pmd_mappable() -> folio_test_large() change.
>> For folio_test_large() && !folio_test_pmd_mappable() I expect that we'll never
>> pass in "compound=true".
>>
>
> Sorry David, I've been staring at the code and your comment, and I still don't
> understand your point. I assumed you were trying to say that compound is always
> false and therefore "if (!compound || nr < nr_pmdmapped)" can be removed? But
> its not the case that compound is always false; it will be true when called to
> remove a pmd-mapped compound page.

Let me try again:

Assume, as I wrote, that we are given a folio that is
"folio_test_large() && !folio_test_pmd_mappable()". That is, a folio
that is *not* pmd mappable.

If it's not pmd-mappable, certainly, nr_pmdmapped == 0, and therefore,
"nr < nr_pmdmapped" will never ever trigger.

The only way to have it added to the deferred split queue is, therefore
"if (!compound)".

So *for these folios*, we will always pass "compound == false" to make
that "if (!compound)" succeed.


Does that make sense?

> What change are you suggesting, exactly?

Oh, I never suggested a change (I even gave you my RB). I was just
thinking out loud.

--
Cheers,

David / dhildenb


2023-07-18 09:51:42

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v1 1/3] mm: Allow deferred splitting of arbitrary large anon folios

On 17/07/2023 17:48, David Hildenbrand wrote:
> On 17.07.23 18:01, Ryan Roberts wrote:
>> On 17/07/2023 16:42, David Hildenbrand wrote:
>>> On 17.07.23 16:31, Ryan Roberts wrote:
>>>> In preparation for the introduction of large folios for anonymous
>>>> memory, we would like to be able to split them when they have unmapped
>>>> subpages, in order to free those unused pages under memory pressure. So
>>>> remove the artificial requirement that the large folio needed to be at
>>>> least PMD-sized.
>>>>
>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>> Reviewed-by: Yu Zhao <[email protected]>
>>>> Reviewed-by: Yin Fengwei <[email protected]>
>>>> ---
>>>>    mm/rmap.c | 2 +-
>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 0c0d8857dfce..2baf57d65c23 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1430,7 +1430,7 @@ void page_remove_rmap(struct page *page, struct
>>>> vm_area_struct *vma,
>>>>             * page of the folio is unmapped and at least one page
>>>>             * is still mapped.
>>>>             */
>>>> -        if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
>>>> +        if (folio_test_large(folio) && folio_test_anon(folio))
>>>>                if (!compound || nr < nr_pmdmapped)
>>>>                    deferred_split_folio(folio);
>>>
>>> !compound will always be true I guess, so nr_pmdmapped == 0 (which will always
>>> be the case) will be ignored.
>>
>> I don't follow why !compound will always be true. This function is
>> page_remove_rmap() (not folio_remove_rmap_range() which I add in a later patch).
>> page_remove_rmap() can work on pmd-mapped pages where compound=true is passed in.
>
> I was talking about the folio_test_pmd_mappable() -> folio_test_large() change.
> For folio_test_large() && !folio_test_pmd_mappable() I expect that we'll never
> pass in "compound=true".
>

Sorry David, I've been staring at the code and your comment, and I still don't
understand your point. I assumed you were trying to say that compound is always
false and therefore "if (!compound || nr < nr_pmdmapped)" can be removed? But
its not the case that compound is always false; it will be true when called to
remove a pmd-mapped compound page. What change are you suggesting, exactly?

2023-07-18 10:05:47

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v1 2/3] mm: Implement folio_remove_rmap_range()

On 18/07/2023 07:22, Huang, Ying wrote:
> Ryan Roberts <[email protected]> writes:
>
>> Like page_remove_rmap() but batch-removes the rmap for a range of pages
>> belonging to a folio. This can provide a small speedup due to less
>> manipuation of the various counters. But more crucially, if removing the
>> rmap for all pages of a folio in a batch, there is no need to
>> (spuriously) add it to the deferred split list, which saves significant
>> cost when there is contention for the split queue lock.
>>
>> All contained pages are accounted using the order-0 folio (or base page)
>> scheme.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> ---
>> include/linux/rmap.h | 2 ++
>> mm/rmap.c | 65 ++++++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 67 insertions(+)
>>
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index b87d01660412..f578975c12c0 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -200,6 +200,8 @@ void page_add_file_rmap(struct page *, struct vm_area_struct *,
>> bool compound);
>> void page_remove_rmap(struct page *, struct vm_area_struct *,
>> bool compound);
>> +void folio_remove_rmap_range(struct folio *folio, struct page *page,
>> + int nr, struct vm_area_struct *vma);
>>
>> void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
>> unsigned long address, rmap_t flags);
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 2baf57d65c23..1da05aca2bb1 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1359,6 +1359,71 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
>> mlock_vma_folio(folio, vma, compound);
>> }
>>
>> +/*
>> + * folio_remove_rmap_range - take down pte mappings from a range of pages
>> + * belonging to a folio. All pages are accounted as small pages.
>> + * @folio: folio that all pages belong to
>> + * @page: first page in range to remove mapping from
>> + * @nr: number of pages in range to remove mapping from
>> + * @vma: the vm area from which the mapping is removed
>> + *
>> + * The caller needs to hold the pte lock.
>> + */
>> +void folio_remove_rmap_range(struct folio *folio, struct page *page,
>> + int nr, struct vm_area_struct *vma)
>> +{
>> + atomic_t *mapped = &folio->_nr_pages_mapped;
>> + int nr_unmapped = 0;
>> + int nr_mapped;
>> + bool last;
>> + enum node_stat_item idx;
>> +
>> + if (unlikely(folio_test_hugetlb(folio))) {
>> + VM_WARN_ON_FOLIO(1, folio);
>> + return;
>> + }
>> +
>> + if (!folio_test_large(folio)) {
>> + /* Is this the page's last map to be removed? */
>> + last = atomic_add_negative(-1, &page->_mapcount);
>> + nr_unmapped = last;
>> + } else {
>> + for (; nr != 0; nr--, page++) {
>> + /* Is this the page's last map to be removed? */
>> + last = atomic_add_negative(-1, &page->_mapcount);
>> + if (last) {
>> + /* Page still mapped if folio mapped entirely */
>> + nr_mapped = atomic_dec_return_relaxed(mapped);
>> + if (nr_mapped < COMPOUND_MAPPED)
>> + nr_unmapped++;
>> + }
>> + }
>> + }
>> +
>> + if (nr_unmapped) {
>> + idx = folio_test_anon(folio) ? NR_ANON_MAPPED : NR_FILE_MAPPED;
>> + __lruvec_stat_mod_folio(folio, idx, -nr_unmapped);
>> +
>> + /*
>> + * Queue anon THP for deferred split if we have just unmapped at
>
> Just some nitpicks. So feel free to ignore.
>
> s/anon THP/large folio/ ?

ACK

>
>> + * least 1 page, while at least 1 page remains mapped.
>> + */
>> + if (folio_test_large(folio) && folio_test_anon(folio))
>> + if (nr_mapped)
>
> if (folio_test_large(folio) && folio_test_anon(folio) && nr_mapped) ?

ACK : I'll make these changes for the next version.

>
>> + deferred_split_folio(folio);
>> + }
>> +
>> + /*
>> + * It would be tidy to reset folio_test_anon mapping when fully
>> + * unmapped, but that might overwrite a racing page_add_anon_rmap
>> + * which increments mapcount after us but sets mapping before us:
>> + * so leave the reset to free_pages_prepare, and remember that
>> + * it's only reliable while mapped.
>> + */
>> +
>> + munlock_vma_folio(folio, vma, false);
>> +}
>> +
>> /**
>> * page_remove_rmap - take down pte mapping from a page
>> * @page: page to remove mapping from
>
> Best Regards,
> Huang, Ying


2023-07-18 10:16:07

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v1 1/3] mm: Allow deferred splitting of arbitrary large anon folios

On 18/07/2023 10:08, David Hildenbrand wrote:
> On 18.07.23 10:58, Ryan Roberts wrote:
>> On 17/07/2023 17:48, David Hildenbrand wrote:
>>> On 17.07.23 18:01, Ryan Roberts wrote:
>>>> On 17/07/2023 16:42, David Hildenbrand wrote:
>>>>> On 17.07.23 16:31, Ryan Roberts wrote:
>>>>>> In preparation for the introduction of large folios for anonymous
>>>>>> memory, we would like to be able to split them when they have unmapped
>>>>>> subpages, in order to free those unused pages under memory pressure. So
>>>>>> remove the artificial requirement that the large folio needed to be at
>>>>>> least PMD-sized.
>>>>>>
>>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>>> Reviewed-by: Yu Zhao <[email protected]>
>>>>>> Reviewed-by: Yin Fengwei <[email protected]>
>>>>>> ---
>>>>>>     mm/rmap.c | 2 +-
>>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>> index 0c0d8857dfce..2baf57d65c23 100644
>>>>>> --- a/mm/rmap.c
>>>>>> +++ b/mm/rmap.c
>>>>>> @@ -1430,7 +1430,7 @@ void page_remove_rmap(struct page *page, struct
>>>>>> vm_area_struct *vma,
>>>>>>              * page of the folio is unmapped and at least one page
>>>>>>              * is still mapped.
>>>>>>              */
>>>>>> -        if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
>>>>>> +        if (folio_test_large(folio) && folio_test_anon(folio))
>>>>>>                 if (!compound || nr < nr_pmdmapped)
>>>>>>                     deferred_split_folio(folio);
>>>>>
>>>>> !compound will always be true I guess, so nr_pmdmapped == 0 (which will always
>>>>> be the case) will be ignored.
>>>>
>>>> I don't follow why !compound will always be true. This function is
>>>> page_remove_rmap() (not folio_remove_rmap_range() which I add in a later
>>>> patch).
>>>> page_remove_rmap() can work on pmd-mapped pages where compound=true is
>>>> passed in.
>>>
>>> I was talking about the folio_test_pmd_mappable() -> folio_test_large() change.
>>> For folio_test_large() && !folio_test_pmd_mappable() I expect that we'll never
>>> pass in "compound=true".
>>>
>>
>> Sorry David, I've been staring at the code and your comment, and I still don't
>> understand your point. I assumed you were trying to say that compound is always
>> false and therefore "if (!compound || nr < nr_pmdmapped)" can be removed? But
>> its not the case that compound is always false; it will be true when called to
>> remove a pmd-mapped compound page.
>
> Let me try again:
>
> Assume, as I wrote, that we are given a folio that is "folio_test_large() &&
> !folio_test_pmd_mappable()". That is, a folio that is *not* pmd mappable.
>
> If it's not pmd-mappable, certainly, nr_pmdmapped == 0, and therefore, "nr <
> nr_pmdmapped" will never ever trigger.
>
> The only way to have it added to the deferred split queue is, therefore "if
> (!compound)".
>
> So *for these folios*, we will always pass "compound == false" to make that "if
> (!compound)" succeed.
>
>
> Does that make sense?

Yes I agree with all of this. I thought you were pointing out an issue or
proposing a change to the logic. Hence my confusion.

>
>> What change are you suggesting, exactly?
>
> Oh, I never suggested a change (I even gave you my RB). I was just thinking out
> loud.
>


2023-07-18 10:32:15

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v1 2/3] mm: Implement folio_remove_rmap_range()

On 18/07/2023 08:12, Huang, Ying wrote:
> Ryan Roberts <[email protected]> writes:
>
>> Like page_remove_rmap() but batch-removes the rmap for a range of pages
>> belonging to a folio. This can provide a small speedup due to less
>> manipuation of the various counters. But more crucially, if removing the
>> rmap for all pages of a folio in a batch, there is no need to
>> (spuriously) add it to the deferred split list, which saves significant
>> cost when there is contention for the split queue lock.
>>
>> All contained pages are accounted using the order-0 folio (or base page)
>> scheme.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> ---
>> include/linux/rmap.h | 2 ++
>> mm/rmap.c | 65 ++++++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 67 insertions(+)
>>
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index b87d01660412..f578975c12c0 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -200,6 +200,8 @@ void page_add_file_rmap(struct page *, struct vm_area_struct *,
>> bool compound);
>> void page_remove_rmap(struct page *, struct vm_area_struct *,
>> bool compound);
>> +void folio_remove_rmap_range(struct folio *folio, struct page *page,
>> + int nr, struct vm_area_struct *vma);
>>
>> void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
>> unsigned long address, rmap_t flags);
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 2baf57d65c23..1da05aca2bb1 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1359,6 +1359,71 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
>> mlock_vma_folio(folio, vma, compound);
>> }
>>
>> +/*
>> + * folio_remove_rmap_range - take down pte mappings from a range of pages
>> + * belonging to a folio. All pages are accounted as small pages.
>> + * @folio: folio that all pages belong to
>> + * @page: first page in range to remove mapping from
>> + * @nr: number of pages in range to remove mapping from
>> + * @vma: the vm area from which the mapping is removed
>> + *
>> + * The caller needs to hold the pte lock.
>> + */
>> +void folio_remove_rmap_range(struct folio *folio, struct page *page,
>> + int nr, struct vm_area_struct *vma)
>
> Can we call folio_remove_ramp_range() in page_remove_rmap() if
> !compound? This can give us some opportunities to reduce code
> duplication?

I considered that, but if felt like the savings were pretty small so my opinion
was that it was cleaner not to do this. This is the best I came up with. Perhaps
you can see further improvements?

void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
bool compound)
{
struct folio *folio = page_folio(page);
atomic_t *mapped = &folio->_nr_pages_mapped;
int nr = 0, nr_pmdmapped = 0;
bool last;
enum node_stat_item idx;

VM_BUG_ON_PAGE(compound && !PageHead(page), page);

/* Hugetlb pages are not counted in NR_*MAPPED */
if (unlikely(folio_test_hugetlb(folio))) {
/* hugetlb pages are always mapped with pmds */
atomic_dec(&folio->_entire_mapcount);
return;
}

/* Is page being unmapped by PTE? Is this its last map to be removed? */
if (likely(!compound)) {
folio_remove_rmap_range(folio, page, 1, vma);
return;
} else if (folio_test_pmd_mappable(folio)) {
/* That test is redundant: it's for safety or to optimize out */

last = atomic_add_negative(-1, &folio->_entire_mapcount);
if (last) {
nr = atomic_sub_return_relaxed(COMPOUND_MAPPED, mapped);
if (likely(nr < COMPOUND_MAPPED)) {
nr_pmdmapped = folio_nr_pages(folio);
nr = nr_pmdmapped - (nr & FOLIO_PAGES_MAPPED);
/* Raced ahead of another remove and an add? */
if (unlikely(nr < 0))
nr = 0;
} else {
/* An add of COMPOUND_MAPPED raced ahead */
nr = 0;
}
}
}

if (nr_pmdmapped) {
if (folio_test_anon(folio))
idx = NR_ANON_THPS;
else if (folio_test_swapbacked(folio))
idx = NR_SHMEM_PMDMAPPED;
else
idx = NR_FILE_PMDMAPPED;
__lruvec_stat_mod_folio(folio, idx, -nr_pmdmapped);
}
if (nr) {
idx = folio_test_anon(folio) ? NR_ANON_MAPPED : NR_FILE_MAPPED;
__lruvec_stat_mod_folio(folio, idx, -nr);

/*
* Queue anon THP for deferred split if at least one
* page of the folio is unmapped and at least one page
* is still mapped.
*/
if (folio_test_anon(folio) && nr < nr_pmdmapped)
deferred_split_folio(folio);
}

/*
* It would be tidy to reset folio_test_anon mapping when fully
* unmapped, but that might overwrite a racing page_add_anon_rmap
* which increments mapcount after us but sets mapping before us:
* so leave the reset to free_pages_prepare, and remember that
* it's only reliable while mapped.
*/

munlock_vma_folio(folio, vma, compound);
}

>
> Best Regards,
> Huang, Ying
>

2023-07-18 11:04:59

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v1 3/3] mm: Batch-zap large anonymous folio PTE mappings

On 17/07/2023 17:15, Zi Yan wrote:
> On 17 Jul 2023, at 11:55, Ryan Roberts wrote:
>
>> On 17/07/2023 16:25, Zi Yan wrote:
>>> On 17 Jul 2023, at 10:31, Ryan Roberts wrote:
>>>
>>>> This allows batching the rmap removal with folio_remove_rmap_range(),
>>>> which means we avoid spuriously adding a partially unmapped folio to the
>>>> deferrred split queue in the common case, which reduces split queue lock
>>>> contention.
>>>>
>>>> Previously each page was removed from the rmap individually with
>>>> page_remove_rmap(). If the first page belonged to a large folio, this
>>>> would cause page_remove_rmap() to conclude that the folio was now
>>>> partially mapped and add the folio to the deferred split queue. But
>>>> subsequent calls would cause the folio to become fully unmapped, meaning
>>>> there is no value to adding it to the split queue.
>>>>
>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>> ---
>>>> mm/memory.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> 1 file changed, 119 insertions(+)
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 01f39e8144ef..6facb8c8807a 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -1391,6 +1391,95 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
>>>> pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
>>>> }
>>>>
>>>> +static inline unsigned long page_addr(struct page *page,
>>>> + struct page *anchor, unsigned long anchor_addr)
>>>> +{
>>>> + unsigned long offset;
>>>> + unsigned long addr;
>>>> +
>>>> + offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
>>>> + addr = anchor_addr + offset;
>>>> +
>>>> + if (anchor > page) {
>>>> + if (addr > anchor_addr)
>>>> + return 0;
>>>> + } else {
>>>> + if (addr < anchor_addr)
>>>> + return ULONG_MAX;
>>>> + }
>>>> +
>>>> + return addr;
>>>> +}
>>>> +
>>>> +static int calc_anon_folio_map_pgcount(struct folio *folio,
>>>> + struct page *page, pte_t *pte,
>>>> + unsigned long addr, unsigned long end)
>>>> +{
>>>> + pte_t ptent;
>>>> + int floops;
>>>> + int i;
>>>> + unsigned long pfn;
>>>> +
>>>> + end = min(page_addr(&folio->page + folio_nr_pages(folio), page, addr),
>>>> + end);
>>>> + floops = (end - addr) >> PAGE_SHIFT;
>>>> + pfn = page_to_pfn(page);
>>>> + pfn++;
>>>> + pte++;
>>>> +
>>>> + for (i = 1; i < floops; i++) {
>>>> + ptent = ptep_get(pte);
>>>> +
>>>> + if (!pte_present(ptent) ||
>>>> + pte_pfn(ptent) != pfn) {
>>>> + return i;
>>>> + }
>>>> +
>>>> + pfn++;
>>>> + pte++;
>>>> + }
>>>> +
>>>> + return floops;
>>>> +}
>>>> +
>>>> +static unsigned long zap_anon_pte_range(struct mmu_gather *tlb,
>>>> + struct vm_area_struct *vma,
>>>> + struct page *page, pte_t *pte,
>>>> + unsigned long addr, unsigned long end,
>>>> + bool *full_out)
>>>> +{
>>>> + struct folio *folio = page_folio(page);
>>>> + struct mm_struct *mm = tlb->mm;
>>>> + pte_t ptent;
>>>> + int pgcount;
>>>> + int i;
>>>> + bool full;
>>>> +
>>>> + pgcount = calc_anon_folio_map_pgcount(folio, page, pte, addr, end);
>>>> +
>>>> + for (i = 0; i < pgcount;) {
>>>> + ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
>>>> + tlb_remove_tlb_entry(tlb, pte, addr);
>>>> + full = __tlb_remove_page(tlb, page, 0);
>>>> +
>>>> + if (unlikely(page_mapcount(page) < 1))
>>>> + print_bad_pte(vma, addr, ptent, page);
>>>> +
>>>> + i++;
>>>> + page++;
>>>> + pte++;
>>>> + addr += PAGE_SIZE;
>>>> +
>>>> + if (unlikely(full))
>>>> + break;
>>>> + }
>>>> +
>>>> + folio_remove_rmap_range(folio, page - i, i, vma);
>>>> +
>>>> + *full_out = full;
>>>> + return i;
>>>> +}
>>>> +
>>>> static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>>> struct vm_area_struct *vma, pmd_t *pmd,
>>>> unsigned long addr, unsigned long end,
>>>> @@ -1428,6 +1517,36 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>>> page = vm_normal_page(vma, addr, ptent);
>>>> if (unlikely(!should_zap_page(details, page)))
>>>> continue;
>>>> +
>>>> + /*
>>>> + * Batch zap large anonymous folio mappings. This allows
>>>> + * batching the rmap removal, which means we avoid
>>>> + * spuriously adding a partially unmapped folio to the
>>>> + * deferrred split queue in the common case, which
>>>> + * reduces split queue lock contention. Require the VMA
>>>> + * to be anonymous to ensure that none of the PTEs in
>>>> + * the range require zap_install_uffd_wp_if_needed().
>>>> + */
>>>> + if (page && PageAnon(page) && vma_is_anonymous(vma)) {
>>>> + bool full;
>>>> + int pgcount;
>>>> +
>>>> + pgcount = zap_anon_pte_range(tlb, vma,
>>>> + page, pte, addr, end, &full);
>>>
>>> Are you trying to zap as many ptes as possible if all these ptes are
>>> within a folio?
>>
>> Yes.
>>
>>> If so, why not calculate end before calling zap_anon_pte_range()?
>>> That would make zap_anon_pte_range() simpler.
>>
>> I'm not sure I follow. That's currently done in calc_anon_folio_map_pgcount(). I
>> could move it to here, but I'm not sure that makes things simpler, just puts
>> more code in here and less in there?
>
> Otherwise your zap_anon_pte_range() is really zap_anon_pte_in_folio_range() or
> some other more descriptive name. When I first look at the name, I thought
> PTEs will be zapped until the end. But that is not the case when I look at the
> code. And future users can easily be confused too and use it in a wrong way.

OK I see your point. OK let me pull the page count calculation into here and
pass it to zap_anon_pte_range(). Then I think we can keep the name as is?


>
> BTW, page_addr() needs a better name and is easily confused with existing
> page_address().

Yeah... I'll try to think of something for v2.

>
>>
>>> Also check if page is part of
>>> a large folio first to make sure you can batch.
>>
>> Yeah that's fair. I'd be inclined to put that in zap_anon_pte_range() to short
>> circuit calc_anon_folio_map_pgcount(). But ultimately zap_anon_pte_range() would
>> still zap the single pte.
>>
>>
>>>
>>>> +
>>>> + rss[mm_counter(page)] -= pgcount;
>>>> + pgcount--;
>>>> + pte += pgcount;
>>>> + addr += pgcount << PAGE_SHIFT;
>>>> +
>>>> + if (unlikely(full)) {
>>>> + force_flush = 1;
>>>> + addr += PAGE_SIZE;
>>>> + break;
>>>> + }
>>>> + continue;
>>>> + }
>>>> +
>>>> ptent = ptep_get_and_clear_full(mm, addr, pte,
>>>> tlb->fullmm);
>>>> tlb_remove_tlb_entry(tlb, pte, addr);
>>>> --
>>>> 2.25.1
>>>
>>>
>>> --
>>> Best Regards,
>>> Yan, Zi
>
>
> --
> Best Regards,
> Yan, Zi


2023-07-18 11:12:07

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v1 3/3] mm: Batch-zap large anonymous folio PTE mappings

On 18/07/2023 00:27, Yin Fengwei wrote:
>
>
> On 7/17/23 22:31, Ryan Roberts wrote:
>> This allows batching the rmap removal with folio_remove_rmap_range(),
>> which means we avoid spuriously adding a partially unmapped folio to the
>> deferrred split queue in the common case, which reduces split queue lock
>> contention.
>>
>> Previously each page was removed from the rmap individually with
>> page_remove_rmap(). If the first page belonged to a large folio, this
>> would cause page_remove_rmap() to conclude that the folio was now
>> partially mapped and add the folio to the deferred split queue. But
>> subsequent calls would cause the folio to become fully unmapped, meaning
>> there is no value to adding it to the split queue.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> ---
>> mm/memory.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 119 insertions(+)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 01f39e8144ef..6facb8c8807a 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1391,6 +1391,95 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
>> pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
>> }
>>
>> +static inline unsigned long page_addr(struct page *page,
>> + struct page *anchor, unsigned long anchor_addr)
>> +{
>> + unsigned long offset;
>> + unsigned long addr;
>> +
>> + offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
>> + addr = anchor_addr + offset;
>> +
>> + if (anchor > page) {
>> + if (addr > anchor_addr)
>> + return 0;
>> + } else {
>> + if (addr < anchor_addr)
>> + return ULONG_MAX;
>> + }
>> +
>> + return addr;
>> +}
>> +
>> +static int calc_anon_folio_map_pgcount(struct folio *folio,
>> + struct page *page, pte_t *pte,
>> + unsigned long addr, unsigned long end)
>> +{
>> + pte_t ptent;
>> + int floops;
>> + int i;
>> + unsigned long pfn;
>> +
>> + end = min(page_addr(&folio->page + folio_nr_pages(folio), page, addr),
>> + end);
>> + floops = (end - addr) >> PAGE_SHIFT;
>> + pfn = page_to_pfn(page);
>> + pfn++;
>> + pte++;
>> +
>> + for (i = 1; i < floops; i++) {
>> + ptent = ptep_get(pte);
>> +
>> + if (!pte_present(ptent) ||
>> + pte_pfn(ptent) != pfn) {
>> + return i;
>> + }
>> +
>> + pfn++;
>> + pte++;
>> + }
>> +
>> + return floops;
>> +}
>> +
>> +static unsigned long zap_anon_pte_range(struct mmu_gather *tlb,
>> + struct vm_area_struct *vma,
>> + struct page *page, pte_t *pte,
>> + unsigned long addr, unsigned long end,
>> + bool *full_out)
>> +{
>> + struct folio *folio = page_folio(page);
>> + struct mm_struct *mm = tlb->mm;
>> + pte_t ptent;
>> + int pgcount;
>> + int i;
>> + bool full;
>> +
>> + pgcount = calc_anon_folio_map_pgcount(folio, page, pte, addr, end);
>> +
>> + for (i = 0; i < pgcount;) {
>> + ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
>> + tlb_remove_tlb_entry(tlb, pte, addr);
>> + full = __tlb_remove_page(tlb, page, 0);
>> +
>> + if (unlikely(page_mapcount(page) < 1))
>> + print_bad_pte(vma, addr, ptent, page);
>> +
>> + i++;
>> + page++;
>> + pte++;
>> + addr += PAGE_SIZE;
>> +
>> + if (unlikely(full))
>> + break;
>> + }
>> +
>> + folio_remove_rmap_range(folio, page - i, i, vma);
>> +
>> + *full_out = full;
>> + return i;
>> +}
>> +
>> static unsigned long zap_pte_range(struct mmu_gather *tlb,
>> struct vm_area_struct *vma, pmd_t *pmd,
>> unsigned long addr, unsigned long end,
>> @@ -1428,6 +1517,36 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>> page = vm_normal_page(vma, addr, ptent);
>> if (unlikely(!should_zap_page(details, page)))
>> continue;
>> +
>> + /*
>> + * Batch zap large anonymous folio mappings. This allows
>> + * batching the rmap removal, which means we avoid
>> + * spuriously adding a partially unmapped folio to the
>> + * deferrred split queue in the common case, which
>> + * reduces split queue lock contention. Require the VMA
>> + * to be anonymous to ensure that none of the PTEs in
>> + * the range require zap_install_uffd_wp_if_needed().
>> + */
>> + if (page && PageAnon(page) && vma_is_anonymous(vma)) {
> Why this is only for anonymous page? I suppose it can support file mapping also.

I was trying to avoid the complexity. For file-backed pages, there is a bunch of
dirty and access management stuff that needs to happen (see "if
(!PageAnon(page)) {" a bit further down). And for file-backed VMAs (even if the
page is anon, I think?) zap_install_uffd_wp_if_needed() might do some extra
work, which again I didn't want to have to drag into zap_anon_pte_range().

I guess it's implementable, but given only anon folios will be deferred-split
and anon folios in a file-backed vma will all be single page, I didn't feel that
the extra complexity would add anything performance-wise.


>
>
> Regards
> Yin, Fengwei
>
>> + bool full;
>> + int pgcount;
>> +
>> + pgcount = zap_anon_pte_range(tlb, vma,
>> + page, pte, addr, end, &full);
>> +
>> + rss[mm_counter(page)] -= pgcount;
>> + pgcount--;
>> + pte += pgcount;
>> + addr += pgcount << PAGE_SHIFT;
>> +
>> + if (unlikely(full)) {
>> + force_flush = 1;
>> + addr += PAGE_SIZE;
>> + break;
>> + }
>> + continue;
>> + }
>> +
>> ptent = ptep_get_and_clear_full(mm, addr, pte,
>> tlb->fullmm);
>> tlb_remove_tlb_entry(tlb, pte, addr);


2023-07-18 14:35:38

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH v1 3/3] mm: Batch-zap large anonymous folio PTE mappings

On 18 Jul 2023, at 6:19, Ryan Roberts wrote:

> On 17/07/2023 17:15, Zi Yan wrote:
>> On 17 Jul 2023, at 11:55, Ryan Roberts wrote:
>>
>>> On 17/07/2023 16:25, Zi Yan wrote:
>>>> On 17 Jul 2023, at 10:31, Ryan Roberts wrote:
>>>>
>>>>> This allows batching the rmap removal with folio_remove_rmap_range(),
>>>>> which means we avoid spuriously adding a partially unmapped folio to the
>>>>> deferrred split queue in the common case, which reduces split queue lock
>>>>> contention.
>>>>>
>>>>> Previously each page was removed from the rmap individually with
>>>>> page_remove_rmap(). If the first page belonged to a large folio, this
>>>>> would cause page_remove_rmap() to conclude that the folio was now
>>>>> partially mapped and add the folio to the deferred split queue. But
>>>>> subsequent calls would cause the folio to become fully unmapped, meaning
>>>>> there is no value to adding it to the split queue.
>>>>>
>>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>>> ---
>>>>> mm/memory.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> 1 file changed, 119 insertions(+)
>>>>>
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index 01f39e8144ef..6facb8c8807a 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -1391,6 +1391,95 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
>>>>> pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
>>>>> }
>>>>>
>>>>> +static inline unsigned long page_addr(struct page *page,
>>>>> + struct page *anchor, unsigned long anchor_addr)
>>>>> +{
>>>>> + unsigned long offset;
>>>>> + unsigned long addr;
>>>>> +
>>>>> + offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
>>>>> + addr = anchor_addr + offset;
>>>>> +
>>>>> + if (anchor > page) {
>>>>> + if (addr > anchor_addr)
>>>>> + return 0;
>>>>> + } else {
>>>>> + if (addr < anchor_addr)
>>>>> + return ULONG_MAX;
>>>>> + }
>>>>> +
>>>>> + return addr;
>>>>> +}
>>>>> +
>>>>> +static int calc_anon_folio_map_pgcount(struct folio *folio,
>>>>> + struct page *page, pte_t *pte,
>>>>> + unsigned long addr, unsigned long end)
>>>>> +{
>>>>> + pte_t ptent;
>>>>> + int floops;
>>>>> + int i;
>>>>> + unsigned long pfn;
>>>>> +
>>>>> + end = min(page_addr(&folio->page + folio_nr_pages(folio), page, addr),
>>>>> + end);
>>>>> + floops = (end - addr) >> PAGE_SHIFT;
>>>>> + pfn = page_to_pfn(page);
>>>>> + pfn++;
>>>>> + pte++;
>>>>> +
>>>>> + for (i = 1; i < floops; i++) {
>>>>> + ptent = ptep_get(pte);
>>>>> +
>>>>> + if (!pte_present(ptent) ||
>>>>> + pte_pfn(ptent) != pfn) {
>>>>> + return i;
>>>>> + }
>>>>> +
>>>>> + pfn++;
>>>>> + pte++;
>>>>> + }
>>>>> +
>>>>> + return floops;
>>>>> +}
>>>>> +
>>>>> +static unsigned long zap_anon_pte_range(struct mmu_gather *tlb,
>>>>> + struct vm_area_struct *vma,
>>>>> + struct page *page, pte_t *pte,
>>>>> + unsigned long addr, unsigned long end,
>>>>> + bool *full_out)
>>>>> +{
>>>>> + struct folio *folio = page_folio(page);
>>>>> + struct mm_struct *mm = tlb->mm;
>>>>> + pte_t ptent;
>>>>> + int pgcount;
>>>>> + int i;
>>>>> + bool full;
>>>>> +
>>>>> + pgcount = calc_anon_folio_map_pgcount(folio, page, pte, addr, end);
>>>>> +
>>>>> + for (i = 0; i < pgcount;) {
>>>>> + ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
>>>>> + tlb_remove_tlb_entry(tlb, pte, addr);
>>>>> + full = __tlb_remove_page(tlb, page, 0);
>>>>> +
>>>>> + if (unlikely(page_mapcount(page) < 1))
>>>>> + print_bad_pte(vma, addr, ptent, page);
>>>>> +
>>>>> + i++;
>>>>> + page++;
>>>>> + pte++;
>>>>> + addr += PAGE_SIZE;
>>>>> +
>>>>> + if (unlikely(full))
>>>>> + break;
>>>>> + }
>>>>> +
>>>>> + folio_remove_rmap_range(folio, page - i, i, vma);
>>>>> +
>>>>> + *full_out = full;
>>>>> + return i;
>>>>> +}
>>>>> +
>>>>> static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>>>> struct vm_area_struct *vma, pmd_t *pmd,
>>>>> unsigned long addr, unsigned long end,
>>>>> @@ -1428,6 +1517,36 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>>>>> page = vm_normal_page(vma, addr, ptent);
>>>>> if (unlikely(!should_zap_page(details, page)))
>>>>> continue;
>>>>> +
>>>>> + /*
>>>>> + * Batch zap large anonymous folio mappings. This allows
>>>>> + * batching the rmap removal, which means we avoid
>>>>> + * spuriously adding a partially unmapped folio to the
>>>>> + * deferrred split queue in the common case, which
>>>>> + * reduces split queue lock contention. Require the VMA
>>>>> + * to be anonymous to ensure that none of the PTEs in
>>>>> + * the range require zap_install_uffd_wp_if_needed().
>>>>> + */
>>>>> + if (page && PageAnon(page) && vma_is_anonymous(vma)) {
>>>>> + bool full;
>>>>> + int pgcount;
>>>>> +
>>>>> + pgcount = zap_anon_pte_range(tlb, vma,
>>>>> + page, pte, addr, end, &full);
>>>>
>>>> Are you trying to zap as many ptes as possible if all these ptes are
>>>> within a folio?
>>>
>>> Yes.
>>>
>>>> If so, why not calculate end before calling zap_anon_pte_range()?
>>>> That would make zap_anon_pte_range() simpler.
>>>
>>> I'm not sure I follow. That's currently done in calc_anon_folio_map_pgcount(). I
>>> could move it to here, but I'm not sure that makes things simpler, just puts
>>> more code in here and less in there?
>>
>> Otherwise your zap_anon_pte_range() is really zap_anon_pte_in_folio_range() or
>> some other more descriptive name. When I first look at the name, I thought
>> PTEs will be zapped until the end. But that is not the case when I look at the
>> code. And future users can easily be confused too and use it in a wrong way.
>
> OK I see your point. OK let me pull the page count calculation into here and
> pass it to zap_anon_pte_range(). Then I think we can keep the name as is?

Yes. Thanks.

>
>
>>
>> BTW, page_addr() needs a better name and is easily confused with existing
>> page_address().
>
> Yeah... I'll try to think of something for v2.
>
>>
>>>
>>>> Also check if page is part of
>>>> a large folio first to make sure you can batch.
>>>
>>> Yeah that's fair. I'd be inclined to put that in zap_anon_pte_range() to short
>>> circuit calc_anon_folio_map_pgcount(). But ultimately zap_anon_pte_range() would
>>> still zap the single pte.
>>>
>>>
>>>>
>>>>> +
>>>>> + rss[mm_counter(page)] -= pgcount;
>>>>> + pgcount--;
>>>>> + pte += pgcount;
>>>>> + addr += pgcount << PAGE_SHIFT;
>>>>> +
>>>>> + if (unlikely(full)) {
>>>>> + force_flush = 1;
>>>>> + addr += PAGE_SIZE;
>>>>> + break;
>>>>> + }
>>>>> + continue;
>>>>> + }
>>>>> +
>>>>> ptent = ptep_get_and_clear_full(mm, addr, pte,
>>>>> tlb->fullmm);
>>>>> tlb_remove_tlb_entry(tlb, pte, addr);
>>>>> --
>>>>> 2.25.1
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Yan, Zi
>>
>>
>> --
>> Best Regards,
>> Yan, Zi


--
Best Regards,
Yan, Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature