LinuxLists.cc - [PATCH v1 0/4] Optimize mmap_exit for large folios

2023-08-10 10:50:48

by Ryan Roberts

[permalink] [raw]

Subject: [PATCH v1 0/4] Optimize mmap_exit for large folios

Hi All,

This is a series to improve performance of process teardown, taking
advantage of the fact that large folios are increasingly regularly
pte-mapped in user space; supporting filesystems already use large
folios for pagecache memory, and large folios for anonymous memory are
(hopefully) on the horizon.

See last patch for performance numbers, including measurements that show
this approach doesn't regress (and actually improves a little bit) when
all folios are small.

The basic approach is to accumulate contiguous ranges of pages in the
mmu_gather structure (instead of storing each individual page pointer),
then take advantage of this internal format to efficiently batch rmap
removal, swapcache removal and page release - see the commit messages
for more details.

This series replaces the previous approach I took at [1], which was much
smaller in scope, only attempting to batch rmap removal for anon pages.
Feedback was that I should do something more general that would also
batch-remove pagecache pages from the rmap. But while designing that, I
found it was also possible to improve swapcache removal and page
release. Hopefully I haven't gone too far the other way now! Note that
patch 1 is unchanged from that originl series.

This series is based on mm-unstable (ad3232df3e41).

I'm going to be out on holiday from the end of today, returning on 29th
August. So responses will likely be patchy, as I'm terrified of posting
to list from my phone!

[1] https://lore.kernel.org/linux-mm/[email protected]/

Thanks,
Ryan

Ryan Roberts (4):
mm: Implement folio_remove_rmap_range()
mm/mmu_gather: generalize mmu_gather rmap removal mechanism
mm/mmu_gather: Remove encoded_page infrastructure
mm/mmu_gather: Store and process pages in contig ranges

arch/s390/include/asm/tlb.h | 9 +--
include/asm-generic/tlb.h | 49 +++++++-------
include/linux/mm.h | 11 +++-
include/linux/mm_types.h | 34 +---------
include/linux/rmap.h | 2 +
include/linux/swap.h | 6 +-
mm/memory.c | 24 ++++---
mm/mmu_gather.c | 112 +++++++++++++++++++++++---------
mm/rmap.c | 125 +++++++++++++++++++++++++++---------
mm/swap.c | 99 ++++++++++++++++++++++++++--
mm/swap_state.c | 11 ++--
11 files changed, 333 insertions(+), 149 deletions(-)

--
2.25.1

2023-08-10 11:08:55

by Ryan Roberts

[permalink] [raw]

Subject: [PATCH v1 2/4] mm/mmu_gather: generalize mmu_gather rmap removal mechanism

Commit 5df397dec7c4 ("mm: delay page_remove_rmap() until after the TLB
has been flushed") added a mechanism whereby pages added to the
mmu_gather buffer could indicate whether they should also be removed
from the rmap. Then a call to the new tlb_flush_rmaps() API would
iterate though the buffer and remove each flagged page from the rmap.
This mechanism was intended for use with !PageAnon(page) pages only.

Let's generalize this rmap removal mechanism so that any type of page
can be removed from the rmap. This is done as preparation for batching
rmap removals with folio_remove_rmap_range(), whereby we will pass a
contiguous range of pages belonging to the same folio to be removed in
one shot for a performance improvement.

The mmu_gather now maintains a "pointer" that points to batch and index
within that batch of the next page in the queue that is yet to be
removed from the rmap. tlb_discard_rmaps() resets this "pointer" to the
first empty location in the queue. Whenever tlb_flush_rmaps() is called,
every page from "pointer" to the end of the queue is removed from the
rmap. Once the mmu is flushed (tlb_flush_mmu()/tlb_finish_mmu()) any
pending rmap removals are discarded. This pointer mechanism ensures that
tlb_flush_rmaps() only has to walk the part of the queue for which rmap
removal is pending, avoids the (potentially large) early portion of the
queue for which rmap removal has already been performed but for which
tlb invalidation/page freeing is still pending.

tlb_flush_rmaps() must always be called under the same PTL as was used
to clear the corresponding PTEs. So in practice rmap removal will be
done in a batch for each PTE table, while the tlbi/freeing can continue
to be done in much bigger batches outside the PTL. See this example
flow:

tlb_gather_mmu()
for each pte table {
with ptl held {
for each pte {
tlb_remove_tlb_entry()
__tlb_remove_page()
}

if (any removed pages require rmap after tlbi)
tlb_flush_mmu_tlbonly()

tlb_flush_rmaps()
}

if (full)
tlb_flush_mmu()
}
tlb_finish_mmu()

So this more general mechanism is no longer just for delaying rmap
removal until after tlbi, but can be used that way when required.

Note that s390 does not gather pages, but does immediate tlbi and page
freeing. In this case we continue to do the rmap removal page-by-page
without gathering them in the mmu_gather.

Signed-off-by: Ryan Roberts <[email protected]>
---
include/asm-generic/tlb.h | 34 ++++++++++++------------
mm/memory.c | 24 ++++++++++-------
mm/mmu_gather.c | 55 +++++++++++++++++++++++----------------
3 files changed, 66 insertions(+), 47 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 129a3a759976..f339d68cf44f 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -266,25 +266,30 @@ extern bool __tlb_remove_page_size(struct mmu_gather *tlb,

#ifdef CONFIG_SMP
/*
- * This both sets 'delayed_rmap', and returns true. It would be an inline
- * function, except we define it before the 'struct mmu_gather'.
+ * For configurations that support batching the rmap removal, the removal is
+ * triggered by calling tlb_flush_rmaps(), which must be called after the pte(s)
+ * are cleared and the page has been added to the mmu_gather, and before the ptl
+ * lock that was held for clearing the pte is released.
*/
-#define tlb_delay_rmap(tlb) (((tlb)->delayed_rmap = 1), true)
+#define tlb_batch_rmap(tlb) (true)
extern void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma);
+extern void tlb_discard_rmaps(struct mmu_gather *tlb);
#endif

#endif

/*
- * We have a no-op version of the rmap removal that doesn't
- * delay anything. That is used on S390, which flushes remote
- * TLBs synchronously, and on UP, which doesn't have any
- * remote TLBs to flush and is not preemptible due to this
- * all happening under the page table lock.
+ * We have a no-op version of the rmap removal that doesn't do anything. That is
+ * used on S390, which flushes remote TLBs synchronously, and on UP, which
+ * doesn't have any remote TLBs to flush and is not preemptible due to this all
+ * happening under the page table lock. Here, the caller must manage each rmap
+ * removal separately.
*/
-#ifndef tlb_delay_rmap
-#define tlb_delay_rmap(tlb) (false)
-static inline void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma) { }
+#ifndef tlb_batch_rmap
+#define tlb_batch_rmap(tlb) (false)
+static inline void tlb_flush_rmaps(struct mmu_gather *tlb,
+ struct vm_area_struct *vma) { }
+static inline void tlb_discard_rmaps(struct mmu_gather *tlb) { }
#endif

/*
@@ -317,11 +322,6 @@ struct mmu_gather {
*/
unsigned int freed_tables : 1;

- /*
- * Do we have pending delayed rmap removals?
- */
- unsigned int delayed_rmap : 1;
-
/*
* at which levels have we cleared entries?
*/
@@ -343,6 +343,8 @@ struct mmu_gather {
struct mmu_gather_batch *active;
struct mmu_gather_batch local;
struct page *__pages[MMU_GATHER_BUNDLE];
+ struct mmu_gather_batch *rmap_pend;
+ unsigned int rmap_pend_first;

#ifdef CONFIG_MMU_GATHER_PAGE_SIZE
unsigned int page_size;
diff --git a/mm/memory.c b/mm/memory.c
index d003076b218d..94a6ebd409a6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1405,6 +1405,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
swp_entry_t entry;

tlb_change_page_size(tlb, PAGE_SIZE);
+ tlb_discard_rmaps(tlb);
init_rss_vec(rss);
start_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
if (!pte)
@@ -1423,7 +1424,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
break;

if (pte_present(ptent)) {
- unsigned int delay_rmap;
+ unsigned int batch_rmap;

page = vm_normal_page(vma, addr, ptent);
if (unlikely(!should_zap_page(details, page)))
@@ -1438,12 +1439,15 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
continue;
}

- delay_rmap = 0;
+ batch_rmap = tlb_batch_rmap(tlb);
if (!PageAnon(page)) {
if (pte_dirty(ptent)) {
set_page_dirty(page);
- if (tlb_delay_rmap(tlb)) {
- delay_rmap = 1;
+ if (batch_rmap) {
+ /*
+ * Ensure tlb flush happens
+ * before rmap remove.
+ */
force_flush = 1;
}
}
@@ -1451,12 +1455,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
mark_page_accessed(page);
}
rss[mm_counter(page)]--;
- if (!delay_rmap) {
+ if (!batch_rmap) {
page_remove_rmap(page, vma, false);
if (unlikely(page_mapcount(page) < 0))
print_bad_pte(vma, addr, ptent, page);
}
- if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
+ if (unlikely(__tlb_remove_page(tlb, page, 0))) {
force_flush = 1;
addr += PAGE_SIZE;
break;
@@ -1517,10 +1521,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
arch_leave_lazy_mmu_mode();

/* Do the actual TLB flush before dropping ptl */
- if (force_flush) {
+ if (force_flush)
tlb_flush_mmu_tlbonly(tlb);
- tlb_flush_rmaps(tlb, vma);
- }
+
+ /* Rmap removal must always happen before dropping ptl */
+ tlb_flush_rmaps(tlb, vma);
+
pte_unmap_unlock(start_pte, ptl);

/*
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index ea9683e12936..ca328ecef5c2 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -19,10 +19,6 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
{
struct mmu_gather_batch *batch;

- /* Limit batching if we have delayed rmaps pending */
- if (tlb->delayed_rmap && tlb->active != &tlb->local)
- return false;
-
batch = tlb->active;
if (batch->next) {
tlb->active = batch->next;
@@ -48,36 +44,49 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
}

#ifdef CONFIG_SMP
-static void tlb_flush_rmap_batch(struct mmu_gather_batch *batch, struct vm_area_struct *vma)
+static void tlb_flush_rmap_batch(struct mmu_gather_batch *batch,
+ unsigned int first,
+ struct vm_area_struct *vma)
{
- for (int i = 0; i < batch->nr; i++) {
+ for (int i = first; i < batch->nr; i++) {
struct encoded_page *enc = batch->encoded_pages[i];
+ struct page *page = encoded_page_ptr(enc);

- if (encoded_page_flags(enc)) {
- struct page *page = encoded_page_ptr(enc);
- page_remove_rmap(page, vma, false);
- }
+ page_remove_rmap(page, vma, false);
}
}

/**
- * tlb_flush_rmaps - do pending rmap removals after we have flushed the TLB
+ * tlb_flush_rmaps - do pending rmap removals
* @tlb: the current mmu_gather
+ * @vma: vm area from which all pages are removed
*
- * Note that because of how tlb_next_batch() above works, we will
- * never start multiple new batches with pending delayed rmaps, so
- * we only need to walk through the current active batch and the
- * original local one.
+ * Removes rmap from all pages added via (e.g.) __tlb_remove_page_size() since
+ * the last call to tlb_discard_rmaps() or tlb_flush_rmaps(). All of those pages
+ * must have been mapped by vma. Must be called after the pte(s) are cleared,
+ * and before the ptl lock that was held for clearing the pte is released. Pages
+ * are accounted using the order-0 folio (or base page) scheme.
*/
void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma)
{
- if (!tlb->delayed_rmap)
- return;
+ struct mmu_gather_batch *batch = tlb->rmap_pend;

- tlb_flush_rmap_batch(&tlb->local, vma);
- if (tlb->active != &tlb->local)
- tlb_flush_rmap_batch(tlb->active, vma);
- tlb->delayed_rmap = 0;
+ tlb_flush_rmap_batch(batch, tlb->rmap_pend_first, vma);
+
+ for (batch = batch->next; batch && batch->nr; batch = batch->next)
+ tlb_flush_rmap_batch(batch, 0, vma);
+
+ tlb_discard_rmaps(tlb);
+}
+
+/**
+ * tlb_discard_rmaps - discard any pending rmap removals
+ * @tlb: the current mmu_gather
+ */
+void tlb_discard_rmaps(struct mmu_gather *tlb)
+{
+ tlb->rmap_pend = tlb->active;
+ tlb->rmap_pend_first = tlb->active->nr;
}
#endif

@@ -102,6 +111,7 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
} while (batch->nr);
}
tlb->active = &tlb->local;
+ tlb_discard_rmaps(tlb);
}

static void tlb_batch_list_free(struct mmu_gather *tlb)
@@ -312,8 +322,9 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
tlb->local.max = ARRAY_SIZE(tlb->__pages);
tlb->active = &tlb->local;
tlb->batch_count = 0;
+ tlb->rmap_pend = &tlb->local;
+ tlb->rmap_pend_first = 0;
#endif
- tlb->delayed_rmap = 0;

tlb_table_init(tlb);
#ifdef CONFIG_MMU_GATHER_PAGE_SIZE
--
2.25.1

2023-08-10 11:15:07

by Ryan Roberts

[permalink] [raw]

Subject: [PATCH v1 3/4] mm/mmu_gather: Remove encoded_page infrastructure

commit 70fb4fdff582 ("mm: introduce 'encoded' page pointers with
embedded extra bits") and commit 7cc8f9c7146a ("mm: mmu_gather: prepare
to gather encoded page pointers with flags") converted mmu_gather for
dealing with encoded_page, where the bottom 2 bits could encode extra
flags. Only 1 bit was ever used; to flag whether the page should
participate in a delayed rmap removal.

Now that the mmu_gather batched rmap removal mechanism has been
generalized, all pages participate and therefore the flag is unused. So
let's remove encoded_page to simplify the code. It also gets in the way
of further optimization which will be done in a follow up patch.

Signed-off-by: Ryan Roberts <[email protected]>
---
arch/s390/include/asm/tlb.h | 9 +++------
include/asm-generic/tlb.h | 10 +++++-----
include/linux/mm.h | 4 +---
include/linux/mm_types.h | 34 +---------------------------------
include/linux/swap.h | 2 +-
mm/memory.c | 2 +-
mm/mmu_gather.c | 11 +++++------
mm/swap.c | 8 +++-----
mm/swap_state.c | 4 ++--
9 files changed, 22 insertions(+), 62 deletions(-)

diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index 383b1f91442c..c40b44f6a31b 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -25,7 +25,7 @@
void __tlb_remove_table(void *_table);
static inline void tlb_flush(struct mmu_gather *tlb);
static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
- struct encoded_page *page,
+ struct page *page,
int page_size);

#define tlb_flush tlb_flush
@@ -41,15 +41,12 @@ static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
* Release the page cache reference for a pte removed by
* tlb_ptep_clear_flush. In both flush modes the tlb for a page cache page
* has already been freed, so just do free_page_and_swap_cache.
- *
- * s390 doesn't delay rmap removal, so there is nothing encoded in
- * the page pointer.
*/
static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
- struct encoded_page *page,
+ struct page *page,
int page_size)
{
- free_page_and_swap_cache(encoded_page_ptr(page));
+ free_page_and_swap_cache(page);
return false;
}

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index f339d68cf44f..d874415aaa33 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -246,7 +246,7 @@ struct mmu_gather_batch {
struct mmu_gather_batch *next;
unsigned int nr;
unsigned int max;
- struct encoded_page *encoded_pages[];
+ struct page *pages[];
};

#define MAX_GATHER_BATCH \
@@ -261,7 +261,7 @@ struct mmu_gather_batch {
#define MAX_GATHER_BATCH_COUNT (10000UL/MAX_GATHER_BATCH)

extern bool __tlb_remove_page_size(struct mmu_gather *tlb,
- struct encoded_page *page,
+ struct page *page,
int page_size);

#ifdef CONFIG_SMP
@@ -464,13 +464,13 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
static inline void tlb_remove_page_size(struct mmu_gather *tlb,
struct page *page, int page_size)
{
- if (__tlb_remove_page_size(tlb, encode_page(page, 0), page_size))
+ if (__tlb_remove_page_size(tlb, page, page_size))
tlb_flush_mmu(tlb);
}

-static __always_inline bool __tlb_remove_page(struct mmu_gather *tlb, struct page *page, unsigned int flags)
+static __always_inline bool __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
{
- return __tlb_remove_page_size(tlb, encode_page(page, flags), PAGE_SIZE);
+ return __tlb_remove_page_size(tlb, page, PAGE_SIZE);
}

/* tlb_remove_page
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6a95dfed4957..914e08185272 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1518,8 +1518,7 @@ static inline void folio_put_refs(struct folio *folio, int refs)
*
* release_pages() releases a simple array of multiple pages, and
* accepts various different forms of said page array: either
- * a regular old boring array of pages, an array of folios, or
- * an array of encoded page pointers.
+ * a regular old boring array of pages or an array of folios.
*
* The transparent union syntax for this kind of "any of these
* argument types" is all kinds of ugly, so look away.
@@ -1527,7 +1526,6 @@ static inline void folio_put_refs(struct folio *folio, int refs)
typedef union {
struct page **pages;
struct folio **folios;
- struct encoded_page **encoded_pages;
} release_pages_arg __attribute__ ((__transparent_union__));

void release_pages(release_pages_arg, int nr);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 291c05cacd48..b2cf57f9134c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -68,7 +68,7 @@ struct mem_cgroup;
#ifdef CONFIG_HAVE_ALIGNED_STRUCT_PAGE
#define _struct_page_alignment __aligned(2 * sizeof(unsigned long))
#else
-#define _struct_page_alignment __aligned(sizeof(unsigned long))
+#define _struct_page_alignment
#endif

struct page {
@@ -216,38 +216,6 @@ struct page {
#endif
} _struct_page_alignment;

-/*
- * struct encoded_page - a nonexistent type marking this pointer
- *
- * An 'encoded_page' pointer is a pointer to a regular 'struct page', but
- * with the low bits of the pointer indicating extra context-dependent
- * information. Not super-common, but happens in mmu_gather and mlock
- * handling, and this acts as a type system check on that use.
- *
- * We only really have two guaranteed bits in general, although you could
- * play with 'struct page' alignment (see CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
- * for more.
- *
- * Use the supplied helper functions to endcode/decode the pointer and bits.
- */
-struct encoded_page;
-#define ENCODE_PAGE_BITS 3ul
-static __always_inline struct encoded_page *encode_page(struct page *page, unsigned long flags)
-{
- BUILD_BUG_ON(flags > ENCODE_PAGE_BITS);
- return (struct encoded_page *)(flags | (unsigned long)page);
-}
-
-static inline unsigned long encoded_page_flags(struct encoded_page *page)
-{
- return ENCODE_PAGE_BITS & (unsigned long)page;
-}
-
-static inline struct page *encoded_page_ptr(struct encoded_page *page)
-{
- return (struct page *)(~ENCODE_PAGE_BITS & (unsigned long)page);
-}
-
/**
* struct folio - Represents a contiguous set of bytes.
* @flags: Identical to the page flags.
diff --git a/include/linux/swap.h b/include/linux/swap.h
index bb5adc604144..f199df803b33 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -453,7 +453,7 @@ static inline unsigned long total_swapcache_pages(void)

extern void free_swap_cache(struct page *page);
extern void free_page_and_swap_cache(struct page *);
-extern void free_pages_and_swap_cache(struct encoded_page **, int);
+extern void free_pages_and_swap_cache(struct page **, int);
/* linux/mm/swapfile.c */
extern atomic_long_t nr_swap_pages;
extern long total_swap_pages;
diff --git a/mm/memory.c b/mm/memory.c
index 94a6ebd409a6..b4f757171cf9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1460,7 +1460,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
if (unlikely(page_mapcount(page) < 0))
print_bad_pte(vma, addr, ptent, page);
}
- if (unlikely(__tlb_remove_page(tlb, page, 0))) {
+ if (unlikely(__tlb_remove_page(tlb, page))) {
force_flush = 1;
addr += PAGE_SIZE;
break;
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index ca328ecef5c2..5d100ac85e21 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -49,8 +49,7 @@ static void tlb_flush_rmap_batch(struct mmu_gather_batch *batch,
struct vm_area_struct *vma)
{
for (int i = first; i < batch->nr; i++) {
- struct encoded_page *enc = batch->encoded_pages[i];
- struct page *page = encoded_page_ptr(enc);
+ struct page *page = batch->pages[i];

page_remove_rmap(page, vma, false);
}
@@ -95,7 +94,7 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
struct mmu_gather_batch *batch;

for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
- struct encoded_page **pages = batch->encoded_pages;
+ struct page **pages = batch->pages;

do {
/*
@@ -125,7 +124,7 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
tlb->local.next = NULL;
}

-bool __tlb_remove_page_size(struct mmu_gather *tlb, struct encoded_page *page, int page_size)
+bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size)
{
struct mmu_gather_batch *batch;

@@ -140,13 +139,13 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct encoded_page *page, i
* Add the page and check if we are full. If so
* force a flush.
*/
- batch->encoded_pages[batch->nr++] = page;
+ batch->pages[batch->nr++] = page;
if (batch->nr == batch->max) {
if (!tlb_next_batch(tlb))
return true;
batch = tlb->active;
}
- VM_BUG_ON_PAGE(batch->nr > batch->max, encoded_page_ptr(page));
+ VM_BUG_ON_PAGE(batch->nr > batch->max, page);

return false;
}
diff --git a/mm/swap.c b/mm/swap.c
index cd8f0150ba3a..b05cce475202 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -953,14 +953,12 @@ void lru_cache_disable(void)
* Decrement the reference count on all the pages in @arg. If it
* fell to zero, remove the page from the LRU and free it.
*
- * Note that the argument can be an array of pages, encoded pages,
- * or folio pointers. We ignore any encoded bits, and turn any of
- * them into just a folio that gets free'd.
+ * Note that the argument can be an array of pages or folio pointers.
*/
void release_pages(release_pages_arg arg, int nr)
{
int i;
- struct encoded_page **encoded = arg.encoded_pages;
+ struct page **pages = arg.pages;
LIST_HEAD(pages_to_free);
struct lruvec *lruvec = NULL;
unsigned long flags = 0;
@@ -970,7 +968,7 @@ void release_pages(release_pages_arg arg, int nr)
struct folio *folio;

/* Turn any of the argument types into a folio */
- folio = page_folio(encoded_page_ptr(encoded[i]));
+ folio = page_folio(pages[i]);

/*
* Make sure the IRQ-safe lock-holding time does not get
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 01f15139b7d9..73b16795b0ff 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -307,11 +307,11 @@ void free_page_and_swap_cache(struct page *page)
* Passed an array of pages, drop them all from swapcache and then release
* them. They are removed from the LRU and freed if this is their last use.
*/
-void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
+void free_pages_and_swap_cache(struct page **pages, int nr)
{
lru_add_drain();
for (int i = 0; i < nr; i++)
- free_swap_cache(encoded_page_ptr(pages[i]));
+ free_swap_cache(pages[i]);
release_pages(pages, nr);
}

--
2.25.1

2023-08-10 11:18:23

by Ryan Roberts

[permalink] [raw]

Subject: [PATCH v1 4/4] mm/mmu_gather: Store and process pages in contig ranges

mmu_gather accumulates a set of pages into a buffer for later rmap
removal and freeing. Page pointers were previously stored in a "linked
list of arrays", then at flush time, each page in the buffer was removed
from the rmap, removed from the swapcache and its refcount was
decremented; if the refcount reached 0, then it was freed.

With increasing numbers of large folios (or at least contiguous parts of
large folios) mapped into userspace processes (pagecache pages for
supporting filesystems currently, but in future also large anonymous
folios), we can measurably improve performance of process teardown:

- For rmap removal, we can batch-remove a range of pages belonging to
the same folio with folio_remove_rmap_range(), which is more efficient
because atomics can be manipulated just once per range. In the common
case, it also allows us to elide adding the (anon) folio to the
deferred split queue, only to remove it a bit later, once all pages of
the folio have been removed fro mthe rmap.

- For swapcache removal, we only need to check and remove the folio from
the swap cache once, rather than trying for each individual page.

- For page release, we can batch-decrement the refcount for each page in
the folio and free it if it hits zero.

Change the page pointer storage format within the mmu_gather batch
structure to store "folio_range"s; a [start, end) page pointer pair.
This allows us to run length encode a contiguous range of pages that all
belong to the same folio. This likely allows us to improve cache
locality a bit. But it also gives us a convenient format for
implementing the above 3 optimizations.

Of course if running on a system that does not extensively use large
pte-mapped folios, then the RLE approach uses twice as much memory,
because each range is 1 page long and uses 2 pointers. But performance
measurements show no impact in terms of performance.

Macro Performance Results
-------------------------

Test: Timed kernel compilation on Ampere Altra (arm64), 80 jobs
Configs: Comparing with and without large anon folios

Without large anon folios:
| kernel | real-time | kern-time | user-time |
|:-----------------|------------:|------------:|------------:|
| baseline-laf-off | 0.0% | 0.0% | 0.0% |
| mmugather-range | -0.3% | -0.3% | 0.1% |

With large anon folios (order-3):
| kernel | real-time | kern-time | user-time |
|:-----------------|------------:|------------:|------------:|
| baseline-laf-on | 0.0% | 0.0% | 0.0% |
| mmugather-range | -0.7% | -3.9% | -0.1% |

Test: Timed kernel compilation in VM on Apple M2 MacBook Pro, 8 jobs
Configs: Comparing with and without large anon folios

Without large anon folios:
| kernel | real-time | kern-time | user-time |
|:-----------------|------------:|------------:|------------:|
| baseline-laf-off | 0.0% | 0.0% | 0.0% |
| mmugather-range | -0.9% | -2.9% | -0.6% |

With large anon folios (order-3):
| kernel | real-time | kern-time | user-time |
|:-----------------|------------:|------------:|------------:|
| baseline-laf-on | 0.0% | 0.0% | 0.0% |
| mmugather-range | -0.4% | -3.7% | -0.2% |

Micro Performance Results
-------------------------

Flame graphs for kernel compilation on Ampere Altra show reduction in
cycles consumed by __arm64_sys_exit_group syscall:

Without large anon folios: -2%
With large anon folios: -26%

For the large anon folios case, it also shows a big difference in cost
of rmap removal:

baseline: cycles in page_remove_rmap(): 24.7B
mmugather-range: cycles in folio_remove_rmap_range(): 5.5B

Furthermore, the baseline shows 5.2B cycles used by
deferred_split_folio() which has completely disappeared after
applying this series.

Signed-off-by: Ryan Roberts <[email protected]>
---
include/asm-generic/tlb.h | 7 +--
include/linux/mm.h | 7 +++
include/linux/swap.h | 6 +--
mm/mmu_gather.c | 56 ++++++++++++++++++++----
mm/swap.c | 91 +++++++++++++++++++++++++++++++++++++++
mm/swap_state.c | 11 ++---
6 files changed, 158 insertions(+), 20 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index d874415aaa33..fe300a64e59d 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -246,11 +246,11 @@ struct mmu_gather_batch {
struct mmu_gather_batch *next;
unsigned int nr;
unsigned int max;
- struct page *pages[];
+ struct folio_range ranges[];
};

#define MAX_GATHER_BATCH \
- ((PAGE_SIZE - sizeof(struct mmu_gather_batch)) / sizeof(void *))
+ ((PAGE_SIZE - sizeof(struct mmu_gather_batch)) / sizeof(struct folio_range))

/*
* Limit the maximum number of mmu_gather batches to reduce a risk of soft
@@ -342,7 +342,8 @@ struct mmu_gather {
#ifndef CONFIG_MMU_GATHER_NO_GATHER
struct mmu_gather_batch *active;
struct mmu_gather_batch local;
- struct page *__pages[MMU_GATHER_BUNDLE];
+ struct folio_range __ranges[MMU_GATHER_BUNDLE];
+ struct page *range_limit;
struct mmu_gather_batch *rmap_pend;
unsigned int rmap_pend_first;

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 914e08185272..f86c905a065d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1513,6 +1513,13 @@ static inline void folio_put_refs(struct folio *folio, int refs)
__folio_put(folio);
}

+struct folio_range {
+ struct page *start;
+ struct page *end;
+};
+
+void folios_put_refs(struct folio_range *folios, int nr);
+
/*
* union release_pages_arg - an array of pages or folios
*
diff --git a/include/linux/swap.h b/include/linux/swap.h
index f199df803b33..06a7cf3ad6c9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -453,7 +453,7 @@ static inline unsigned long total_swapcache_pages(void)

extern void free_swap_cache(struct page *page);
extern void free_page_and_swap_cache(struct page *);
-extern void free_pages_and_swap_cache(struct page **, int);
+extern void free_folios_and_swap_cache(struct folio_range *, int);
/* linux/mm/swapfile.c */
extern atomic_long_t nr_swap_pages;
extern long total_swap_pages;
@@ -530,8 +530,8 @@ static inline void put_swap_device(struct swap_info_struct *si)
* so leave put_page and release_pages undeclared... */
#define free_page_and_swap_cache(page) \
put_page(page)
-#define free_pages_and_swap_cache(pages, nr) \
- release_pages((pages), (nr));
+#define free_folios_and_swap_cache(folios, nr) \
+ folios_put_refs((folios), (nr))

/* used to sanity check ptes in zap_pte_range when CONFIG_SWAP=0 */
#define free_swap_and_cache(e) is_pfn_swap_entry(e)
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 5d100ac85e21..fd2ea7577817 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -22,6 +22,7 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
batch = tlb->active;
if (batch->next) {
tlb->active = batch->next;
+ tlb->range_limit = NULL;
return true;
}

@@ -39,6 +40,7 @@ static bool tlb_next_batch(struct mmu_gather *tlb)

tlb->active->next = batch;
tlb->active = batch;
+ tlb->range_limit = NULL;

return true;
}
@@ -49,9 +51,11 @@ static void tlb_flush_rmap_batch(struct mmu_gather_batch *batch,
struct vm_area_struct *vma)
{
for (int i = first; i < batch->nr; i++) {
- struct page *page = batch->pages[i];
+ struct folio_range *range = &batch->ranges[i];
+ int nr = range->end - range->start;
+ struct folio *folio = page_folio(range->start);

- page_remove_rmap(page, vma, false);
+ folio_remove_rmap_range(folio, range->start, nr, vma);
}
}

@@ -75,6 +79,11 @@ void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma)
for (batch = batch->next; batch && batch->nr; batch = batch->next)
tlb_flush_rmap_batch(batch, 0, vma);

+ /*
+ * Move to the next range on next page insertion to prevent any future
+ * pages from being accumulated into the range we just did the rmap for.
+ */
+ tlb->range_limit = NULL;
tlb_discard_rmaps(tlb);
}

@@ -94,7 +103,7 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
struct mmu_gather_batch *batch;

for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
- struct page **pages = batch->pages;
+ struct folio_range *ranges = batch->ranges;

do {
/*
@@ -102,14 +111,15 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
*/
unsigned int nr = min(512U, batch->nr);

- free_pages_and_swap_cache(pages, nr);
- pages += nr;
+ free_folios_and_swap_cache(ranges, nr);
+ ranges += nr;
batch->nr -= nr;

cond_resched();
} while (batch->nr);
}
tlb->active = &tlb->local;
+ tlb->range_limit = NULL;
tlb_discard_rmaps(tlb);
}

@@ -127,6 +137,7 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size)
{
struct mmu_gather_batch *batch;
+ struct folio_range *range;

VM_BUG_ON(!tlb->end);

@@ -135,11 +146,37 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_
#endif

batch = tlb->active;
+ range = &batch->ranges[batch->nr - 1];
+
+ /*
+ * If there is a range being accumulated, add the page to the range if
+ * its contiguous, else start the next range. range_limit is always NULL
+ * when nr is 0, which protects the batch->ranges[-1] case.
+ */
+ if (tlb->range_limit && page == range->end) {
+ range->end++;
+ } else {
+ struct folio *folio = page_folio(page);
+
+ range = &batch->ranges[batch->nr++];
+ range->start = page;
+ range->end = page + 1;
+
+ tlb->range_limit = &folio->page + folio_nr_pages(folio);
+ }
+
+ /*
+ * If we have reached the end of the folio, move to the next range when
+ * we add the next page; Never span multiple folios in the same range.
+ */
+ if (range->end == tlb->range_limit)
+ tlb->range_limit = NULL;
+
/*
- * Add the page and check if we are full. If so
- * force a flush.
+ * Check if we are full. If so force a flush. In order to ensure we
+ * always have a free range for the next added page, the last range in a
+ * batch always only has a single page.
*/
- batch->pages[batch->nr++] = page;
if (batch->nr == batch->max) {
if (!tlb_next_batch(tlb))
return true;
@@ -318,8 +355,9 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
tlb->need_flush_all = 0;
tlb->local.next = NULL;
tlb->local.nr = 0;
- tlb->local.max = ARRAY_SIZE(tlb->__pages);
+ tlb->local.max = ARRAY_SIZE(tlb->__ranges);
tlb->active = &tlb->local;
+ tlb->range_limit = NULL;
tlb->batch_count = 0;
tlb->rmap_pend = &tlb->local;
tlb->rmap_pend_first = 0;
diff --git a/mm/swap.c b/mm/swap.c
index b05cce475202..e238d3623fcb 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1041,6 +1041,97 @@ void release_pages(release_pages_arg arg, int nr)
}
EXPORT_SYMBOL(release_pages);

+/**
+ * folios_put_refs - batched folio_put_refs()
+ * @folios: array of `struct folio_range`s to release
+ * @nr: number of folio ranges
+ *
+ * Each `struct folio_range` describes the start and end page of a range within
+ * a folio. The folio reference count is decremented once for each page in the
+ * range. If it fell to zero, remove the page from the LRU and free it.
+ */
+void folios_put_refs(struct folio_range *folios, int nr)
+{
+ int i;
+ LIST_HEAD(pages_to_free);
+ struct lruvec *lruvec = NULL;
+ unsigned long flags = 0;
+ unsigned int lock_batch;
+
+ for (i = 0; i < nr; i++) {
+ struct folio *folio = page_folio(folios[i].start);
+ int refs = folios[i].end - folios[i].start;
+
+ /*
+ * Make sure the IRQ-safe lock-holding time does not get
+ * excessive with a continuous string of pages from the
+ * same lruvec. The lock is held only if lruvec != NULL.
+ */
+ if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
+ unlock_page_lruvec_irqrestore(lruvec, flags);
+ lruvec = NULL;
+ }
+
+ if (is_huge_zero_page(&folio->page))
+ continue;
+
+ if (folio_is_zone_device(folio)) {
+ if (lruvec) {
+ unlock_page_lruvec_irqrestore(lruvec, flags);
+ lruvec = NULL;
+ }
+ if (put_devmap_managed_page(&folio->page))
+ continue;
+ if (folio_put_testzero(folio))
+ free_zone_device_page(&folio->page);
+ continue;
+ }
+
+ if (!folio_ref_sub_and_test(folio, refs))
+ continue;
+
+ if (folio_test_large(folio)) {
+ if (lruvec) {
+ unlock_page_lruvec_irqrestore(lruvec, flags);
+ lruvec = NULL;
+ }
+ __folio_put_large(folio);
+ continue;
+ }
+
+ if (folio_test_lru(folio)) {
+ struct lruvec *prev_lruvec = lruvec;
+
+ lruvec = folio_lruvec_relock_irqsave(folio, lruvec,
+ &flags);
+ if (prev_lruvec != lruvec)
+ lock_batch = 0;
+
+ lruvec_del_folio(lruvec, folio);
+ __folio_clear_lru_flags(folio);
+ }
+
+ /*
+ * In rare cases, when truncation or holepunching raced with
+ * munlock after VM_LOCKED was cleared, Mlocked may still be
+ * found set here. This does not indicate a problem, unless
+ * "unevictable_pgs_cleared" appears worryingly large.
+ */
+ if (unlikely(folio_test_mlocked(folio))) {
+ __folio_clear_mlocked(folio);
+ zone_stat_sub_folio(folio, NR_MLOCK);
+ count_vm_event(UNEVICTABLE_PGCLEARED);
+ }
+
+ list_add(&folio->lru, &pages_to_free);
+ }
+ if (lruvec)
+ unlock_page_lruvec_irqrestore(lruvec, flags);
+
+ mem_cgroup_uncharge_list(&pages_to_free);
+ free_unref_page_list(&pages_to_free);
+}
+
/*
* The folios which we're about to release may be in the deferred lru-addition
* queues. That would prevent them from really being freed right now. That's
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 73b16795b0ff..526bbd5a2ce1 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -304,15 +304,16 @@ void free_page_and_swap_cache(struct page *page)
}

/*
- * Passed an array of pages, drop them all from swapcache and then release
- * them. They are removed from the LRU and freed if this is their last use.
+ * Passed an array of folio ranges, drop all folios from swapcache and then put
+ * a folio reference for each page in the range. They are removed from the LRU
+ * and freed if this is their last use.
*/
-void free_pages_and_swap_cache(struct page **pages, int nr)
+void free_folios_and_swap_cache(struct folio_range *folios, int nr)
{
lru_add_drain();
for (int i = 0; i < nr; i++)
- free_swap_cache(pages[i]);
- release_pages(pages, nr);
+ free_swap_cache(folios[i].start);
+ folios_put_refs(folios, nr);
}

static inline bool swap_use_vma_readahead(void)
--
2.25.1

2023-08-10 11:23:00

by Ryan Roberts

[permalink] [raw]

Subject: [PATCH v1 1/4] mm: Implement folio_remove_rmap_range()

Like page_remove_rmap() but batch-removes the rmap for a range of pages
belonging to a folio. This can provide a small speedup due to less
manipuation of the various counters. But more crucially, if removing the
rmap for all pages of a folio in a batch, there is no need to
(spuriously) add it to the deferred split list, which saves significant
cost when there is contention for the split queue lock.

All contained pages are accounted using the order-0 folio (or base page)
scheme.

page_remove_rmap() is refactored so that it forwards to
folio_remove_rmap_range() for !compound cases, and both functions now
share a common epilogue function. The intention here is to avoid
duplication of code.

Signed-off-by: Ryan Roberts <[email protected]>
---
include/linux/rmap.h | 2 +
mm/rmap.c | 125 ++++++++++++++++++++++++++++++++-----------
2 files changed, 97 insertions(+), 30 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index a3825ce81102..d442d1e5425d 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -202,6 +202,8 @@ void folio_add_file_rmap_range(struct folio *, struct page *, unsigned int nr,
struct vm_area_struct *, bool compound);
void page_remove_rmap(struct page *, struct vm_area_struct *,
bool compound);
+void folio_remove_rmap_range(struct folio *folio, struct page *page,
+ int nr, struct vm_area_struct *vma);

void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
unsigned long address, rmap_t flags);
diff --git a/mm/rmap.c b/mm/rmap.c
index 1f04debdc87a..d82d52ebf3a6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1379,6 +1379,94 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
folio_add_file_rmap_range(folio, page, nr_pages, vma, compound);
}

+/**
+ * __remove_rmap_finish - common operations when taking down a mapping.
+ * @folio: Folio containing all pages taken down.
+ * @vma: The VM area containing the range.
+ * @compound: True if pages were taken down from PMD or false if from PTE(s).
+ * @nr_unmapped: Number of pages within folio that are now unmapped.
+ * @nr_mapped: Number of pages within folio that are still mapped.
+ */
+static void __remove_rmap_finish(struct folio *folio,
+ struct vm_area_struct *vma, bool compound,
+ int nr_unmapped, int nr_mapped)
+{
+ enum node_stat_item idx;
+
+ if (nr_unmapped) {
+ idx = folio_test_anon(folio) ? NR_ANON_MAPPED : NR_FILE_MAPPED;
+ __lruvec_stat_mod_folio(folio, idx, -nr_unmapped);
+
+ /*
+ * Queue large anon folio for deferred split if at least one
+ * page of the folio is unmapped and at least one page is still
+ * mapped.
+ */
+ if (folio_test_large(folio) &&
+ folio_test_anon(folio) && nr_mapped)
+ deferred_split_folio(folio);
+ }
+
+ /*
+ * It would be tidy to reset folio_test_anon mapping when fully
+ * unmapped, but that might overwrite a racing page_add_anon_rmap
+ * which increments mapcount after us but sets mapping before us:
+ * so leave the reset to free_pages_prepare, and remember that
+ * it's only reliable while mapped.
+ */
+
+ munlock_vma_folio(folio, vma, compound);
+}
+
+/**
+ * folio_remove_rmap_range - Take down PTE mappings from a range of pages.
+ * @folio: Folio containing all pages in range.
+ * @page: First page in range to unmap.
+ * @nr: Number of pages to unmap.
+ * @vma: The VM area containing the range.
+ *
+ * All pages in the range must belong to the same VMA & folio. They must be
+ * mapped with PTEs, not a PMD.
+ *
+ * Context: Caller holds the pte lock.
+ */
+void folio_remove_rmap_range(struct folio *folio, struct page *page,
+ int nr, struct vm_area_struct *vma)
+{
+ atomic_t *mapped = &folio->_nr_pages_mapped;
+ int nr_unmapped = 0;
+ int nr_mapped = 0;
+ bool last;
+
+ if (unlikely(folio_test_hugetlb(folio))) {
+ VM_WARN_ON_FOLIO(1, folio);
+ return;
+ }
+
+ VM_WARN_ON_ONCE(page < &folio->page ||
+ page + nr > (&folio->page + folio_nr_pages(folio)));
+
+ if (!folio_test_large(folio)) {
+ /* Is this the page's last map to be removed? */
+ last = atomic_add_negative(-1, &page->_mapcount);
+ nr_unmapped = last;
+ } else {
+ for (; nr != 0; nr--, page++) {
+ /* Is this the page's last map to be removed? */
+ last = atomic_add_negative(-1, &page->_mapcount);
+ if (last)
+ nr_unmapped++;
+ }
+
+ /* Pages still mapped if folio mapped entirely */
+ nr_mapped = atomic_sub_return_relaxed(nr_unmapped, mapped);
+ if (nr_mapped >= COMPOUND_MAPPED)
+ nr_unmapped = 0;
+ }
+
+ __remove_rmap_finish(folio, vma, false, nr_unmapped, nr_mapped);
+}
+
/**
* page_remove_rmap - take down pte mapping from a page
* @page: page to remove mapping from
@@ -1405,15 +1493,13 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
return;
}

- /* Is page being unmapped by PTE? Is this its last map to be removed? */
+ /* Is page being unmapped by PTE? */
if (likely(!compound)) {
- last = atomic_add_negative(-1, &page->_mapcount);
- nr = last;
- if (last && folio_test_large(folio)) {
- nr = atomic_dec_return_relaxed(mapped);
- nr = (nr < COMPOUND_MAPPED);
- }
- } else if (folio_test_pmd_mappable(folio)) {
+ folio_remove_rmap_range(folio, page, 1, vma);
+ return;
+ }
+
+ if (folio_test_pmd_mappable(folio)) {
/* That test is redundant: it's for safety or to optimize out */

last = atomic_add_negative(-1, &folio->_entire_mapcount);
@@ -1441,29 +1527,8 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
idx = NR_FILE_PMDMAPPED;
__lruvec_stat_mod_folio(folio, idx, -nr_pmdmapped);
}
- if (nr) {
- idx = folio_test_anon(folio) ? NR_ANON_MAPPED : NR_FILE_MAPPED;
- __lruvec_stat_mod_folio(folio, idx, -nr);
-
- /*
- * Queue anon THP for deferred split if at least one
- * page of the folio is unmapped and at least one page
- * is still mapped.
- */
- if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
- if (!compound || nr < nr_pmdmapped)
- deferred_split_folio(folio);
- }
-
- /*
- * It would be tidy to reset folio_test_anon mapping when fully
- * unmapped, but that might overwrite a racing page_add_anon_rmap
- * which increments mapcount after us but sets mapping before us:
- * so leave the reset to free_pages_prepare, and remember that
- * it's only reliable while mapped.
- */

- munlock_vma_folio(folio, vma, compound);
+ __remove_rmap_finish(folio, vma, compound, nr, nr_pmdmapped - nr);
}

/*
--
2.25.1

2023-08-10 15:16:54

[permalink] [raw]

Subject: Re: [PATCH v1 4/4] mm/mmu_gather: Store and process pages in contig ranges

On 10 Aug 2023, at 6:33, Ryan Roberts wrote:

> mmu_gather accumulates a set of pages into a buffer for later rmap
> removal and freeing. Page pointers were previously stored in a "linked
> list of arrays", then at flush time, each page in the buffer was removed
> from the rmap, removed from the swapcache and its refcount was
> decremented; if the refcount reached 0, then it was freed.
>
> With increasing numbers of large folios (or at least contiguous parts of
> large folios) mapped into userspace processes (pagecache pages for
> supporting filesystems currently, but in future also large anonymous
> folios), we can measurably improve performance of process teardown:
>
> - For rmap removal, we can batch-remove a range of pages belonging to
> the same folio with folio_remove_rmap_range(), which is more efficient
> because atomics can be manipulated just once per range. In the common
> case, it also allows us to elide adding the (anon) folio to the
> deferred split queue, only to remove it a bit later, once all pages of
> the folio have been removed fro mthe rmap.
>
> - For swapcache removal, we only need to check and remove the folio from
> the swap cache once, rather than trying for each individual page.
>
> - For page release, we can batch-decrement the refcount for each page in
> the folio and free it if it hits zero.
>
> Change the page pointer storage format within the mmu_gather batch
> structure to store "folio_range"s; a [start, end) page pointer pair.
> This allows us to run length encode a contiguous range of pages that all
> belong to the same folio. This likely allows us to improve cache
> locality a bit. But it also gives us a convenient format for
> implementing the above 3 optimizations.
>
> Of course if running on a system that does not extensively use large
> pte-mapped folios, then the RLE approach uses twice as much memory,
> because each range is 1 page long and uses 2 pointers. But performance
> measurements show no impact in terms of performance.
>
> Macro Performance Results
> -------------------------
>
> Test: Timed kernel compilation on Ampere Altra (arm64), 80 jobs
> Configs: Comparing with and without large anon folios
>
> Without large anon folios:
> | kernel | real-time | kern-time | user-time |
> |:-----------------|------------:|------------:|------------:|
> | baseline-laf-off | 0.0% | 0.0% | 0.0% |
> | mmugather-range | -0.3% | -0.3% | 0.1% |
>
> With large anon folios (order-3):
> | kernel | real-time | kern-time | user-time |
> |:-----------------|------------:|------------:|------------:|
> | baseline-laf-on | 0.0% | 0.0% | 0.0% |
> | mmugather-range | -0.7% | -3.9% | -0.1% |
>
> Test: Timed kernel compilation in VM on Apple M2 MacBook Pro, 8 jobs
> Configs: Comparing with and without large anon folios
>
> Without large anon folios:
> | kernel | real-time | kern-time | user-time |
> |:-----------------|------------:|------------:|------------:|
> | baseline-laf-off | 0.0% | 0.0% | 0.0% |
> | mmugather-range | -0.9% | -2.9% | -0.6% |
>
> With large anon folios (order-3):
> | kernel | real-time | kern-time | user-time |
> |:-----------------|------------:|------------:|------------:|
> | baseline-laf-on | 0.0% | 0.0% | 0.0% |
> | mmugather-range | -0.4% | -3.7% | -0.2% |
>
> Micro Performance Results
> -------------------------
>
> Flame graphs for kernel compilation on Ampere Altra show reduction in
> cycles consumed by __arm64_sys_exit_group syscall:
>
> Without large anon folios: -2%
> With large anon folios: -26%
>
> For the large anon folios case, it also shows a big difference in cost
> of rmap removal:
>
> baseline: cycles in page_remove_rmap(): 24.7B
> mmugather-range: cycles in folio_remove_rmap_range(): 5.5B
>
> Furthermore, the baseline shows 5.2B cycles used by
> deferred_split_folio() which has completely disappeared after
> applying this series.
>
> Signed-off-by: Ryan Roberts <[email protected]>
> ---
> include/asm-generic/tlb.h | 7 +--
> include/linux/mm.h | 7 +++
> include/linux/swap.h | 6 +--
> mm/mmu_gather.c | 56 ++++++++++++++++++++----
> mm/swap.c | 91 +++++++++++++++++++++++++++++++++++++++
> mm/swap_state.c | 11 ++---
> 6 files changed, 158 insertions(+), 20 deletions(-)
>
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index d874415aaa33..fe300a64e59d 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -246,11 +246,11 @@ struct mmu_gather_batch {
> struct mmu_gather_batch *next;
> unsigned int nr;
> unsigned int max;
> - struct page *pages[];
> + struct folio_range ranges[];
> };
>
> #define MAX_GATHER_BATCH \
> - ((PAGE_SIZE - sizeof(struct mmu_gather_batch)) / sizeof(void *))
> + ((PAGE_SIZE - sizeof(struct mmu_gather_batch)) / sizeof(struct folio_range))
>
> /*
> * Limit the maximum number of mmu_gather batches to reduce a risk of soft
> @@ -342,7 +342,8 @@ struct mmu_gather {
> #ifndef CONFIG_MMU_GATHER_NO_GATHER
> struct mmu_gather_batch *active;
> struct mmu_gather_batch local;
> - struct page *__pages[MMU_GATHER_BUNDLE];
> + struct folio_range __ranges[MMU_GATHER_BUNDLE];
> + struct page *range_limit;
> struct mmu_gather_batch *rmap_pend;
> unsigned int rmap_pend_first;
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 914e08185272..f86c905a065d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1513,6 +1513,13 @@ static inline void folio_put_refs(struct folio *folio, int refs)
> __folio_put(folio);
> }
>
> +struct folio_range {
> + struct page *start;
> + struct page *end;
> +};

I see end is used for calculating nr_pages multiple times below. Maybe just
use nr_pages instead of end here.

Also, struct page (memmap) might not be always contiguous, using struct page
points to represent folio range might not give the result you want.
See nth_page() and folio_page_idx() in include/linux/mm.h.

> +
> +void folios_put_refs(struct folio_range *folios, int nr);
> +
> /*
> * union release_pages_arg - an array of pages or folios
> *
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index f199df803b33..06a7cf3ad6c9 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -453,7 +453,7 @@ static inline unsigned long total_swapcache_pages(void)
>
> extern void free_swap_cache(struct page *page);
> extern void free_page_and_swap_cache(struct page *);
> -extern void free_pages_and_swap_cache(struct page **, int);
> +extern void free_folios_and_swap_cache(struct folio_range *, int);
> /* linux/mm/swapfile.c */
> extern atomic_long_t nr_swap_pages;
> extern long total_swap_pages;
> @@ -530,8 +530,8 @@ static inline void put_swap_device(struct swap_info_struct *si)
> * so leave put_page and release_pages undeclared... */
> #define free_page_and_swap_cache(page) \
> put_page(page)
> -#define free_pages_and_swap_cache(pages, nr) \
> - release_pages((pages), (nr));
> +#define free_folios_and_swap_cache(folios, nr) \
> + folios_put_refs((folios), (nr))
>
> /* used to sanity check ptes in zap_pte_range when CONFIG_SWAP=0 */
> #define free_swap_and_cache(e) is_pfn_swap_entry(e)
> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index 5d100ac85e21..fd2ea7577817 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -22,6 +22,7 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
> batch = tlb->active;
> if (batch->next) {
> tlb->active = batch->next;
> + tlb->range_limit = NULL;
> return true;
> }
>
> @@ -39,6 +40,7 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
>
> tlb->active->next = batch;
> tlb->active = batch;
> + tlb->range_limit = NULL;
>
> return true;
> }
> @@ -49,9 +51,11 @@ static void tlb_flush_rmap_batch(struct mmu_gather_batch *batch,
> struct vm_area_struct *vma)
> {
> for (int i = first; i < batch->nr; i++) {
> - struct page *page = batch->pages[i];
> + struct folio_range *range = &batch->ranges[i];
> + int nr = range->end - range->start;
> + struct folio *folio = page_folio(range->start);
>
> - page_remove_rmap(page, vma, false);
> + folio_remove_rmap_range(folio, range->start, nr, vma);
> }
> }
>
> @@ -75,6 +79,11 @@ void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma)
> for (batch = batch->next; batch && batch->nr; batch = batch->next)
> tlb_flush_rmap_batch(batch, 0, vma);
>
> + /*
> + * Move to the next range on next page insertion to prevent any future
> + * pages from being accumulated into the range we just did the rmap for.
> + */
> + tlb->range_limit = NULL;
> tlb_discard_rmaps(tlb);
> }
>
> @@ -94,7 +103,7 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
> struct mmu_gather_batch *batch;
>
> for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
> - struct page **pages = batch->pages;
> + struct folio_range *ranges = batch->ranges;
>
> do {
> /*
> @@ -102,14 +111,15 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
> */
> unsigned int nr = min(512U, batch->nr);
>
> - free_pages_and_swap_cache(pages, nr);
> - pages += nr;
> + free_folios_and_swap_cache(ranges, nr);
> + ranges += nr;
> batch->nr -= nr;
>
> cond_resched();
> } while (batch->nr);
> }
> tlb->active = &tlb->local;
> + tlb->range_limit = NULL;
> tlb_discard_rmaps(tlb);
> }
>
> @@ -127,6 +137,7 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
> bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size)
> {
> struct mmu_gather_batch *batch;
> + struct folio_range *range;
>
> VM_BUG_ON(!tlb->end);
>
> @@ -135,11 +146,37 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_
> #endif
>
> batch = tlb->active;
> + range = &batch->ranges[batch->nr - 1];
> +
> + /*
> + * If there is a range being accumulated, add the page to the range if
> + * its contiguous, else start the next range. range_limit is always NULL
> + * when nr is 0, which protects the batch->ranges[-1] case.
> + */
> + if (tlb->range_limit && page == range->end) {
> + range->end++;
> + } else {
> + struct folio *folio = page_folio(page);
> +
> + range = &batch->ranges[batch->nr++];
> + range->start = page;
> + range->end = page + 1;
> +
> + tlb->range_limit = &folio->page + folio_nr_pages(folio);
> + }
> +
> + /*
> + * If we have reached the end of the folio, move to the next range when
> + * we add the next page; Never span multiple folios in the same range.
> + */
> + if (range->end == tlb->range_limit)
> + tlb->range_limit = NULL;
> +
> /*
> - * Add the page and check if we are full. If so
> - * force a flush.
> + * Check if we are full. If so force a flush. In order to ensure we
> + * always have a free range for the next added page, the last range in a
> + * batch always only has a single page.
> */
> - batch->pages[batch->nr++] = page;
> if (batch->nr == batch->max) {
> if (!tlb_next_batch(tlb))
> return true;
> @@ -318,8 +355,9 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
> tlb->need_flush_all = 0;
> tlb->local.next = NULL;
> tlb->local.nr = 0;
> - tlb->local.max = ARRAY_SIZE(tlb->__pages);
> + tlb->local.max = ARRAY_SIZE(tlb->__ranges);
> tlb->active = &tlb->local;
> + tlb->range_limit = NULL;
> tlb->batch_count = 0;
> tlb->rmap_pend = &tlb->local;
> tlb->rmap_pend_first = 0;
> diff --git a/mm/swap.c b/mm/swap.c
> index b05cce475202..e238d3623fcb 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -1041,6 +1041,97 @@ void release_pages(release_pages_arg arg, int nr)
> }
> EXPORT_SYMBOL(release_pages);
>
> +/**
> + * folios_put_refs - batched folio_put_refs()
> + * @folios: array of `struct folio_range`s to release
> + * @nr: number of folio ranges
> + *
> + * Each `struct folio_range` describes the start and end page of a range within
> + * a folio. The folio reference count is decremented once for each page in the
> + * range. If it fell to zero, remove the page from the LRU and free it.
> + */
> +void folios_put_refs(struct folio_range *folios, int nr)
> +{
> + int i;
> + LIST_HEAD(pages_to_free);
> + struct lruvec *lruvec = NULL;
> + unsigned long flags = 0;
> + unsigned int lock_batch;
> +
> + for (i = 0; i < nr; i++) {
> + struct folio *folio = page_folio(folios[i].start);
> + int refs = folios[i].end - folios[i].start;
> +
> + /*
> + * Make sure the IRQ-safe lock-holding time does not get
> + * excessive with a continuous string of pages from the
> + * same lruvec. The lock is held only if lruvec != NULL.
> + */
> + if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
> + unlock_page_lruvec_irqrestore(lruvec, flags);
> + lruvec = NULL;
> + }
> +
> + if (is_huge_zero_page(&folio->page))
> + continue;
> +
> + if (folio_is_zone_device(folio)) {
> + if (lruvec) {
> + unlock_page_lruvec_irqrestore(lruvec, flags);
> + lruvec = NULL;
> + }
> + if (put_devmap_managed_page(&folio->page))
> + continue;
> + if (folio_put_testzero(folio))
> + free_zone_device_page(&folio->page);
> + continue;
> + }
> +
> + if (!folio_ref_sub_and_test(folio, refs))
> + continue;
> +
> + if (folio_test_large(folio)) {
> + if (lruvec) {
> + unlock_page_lruvec_irqrestore(lruvec, flags);
> + lruvec = NULL;
> + }
> + __folio_put_large(folio);
> + continue;
> + }
> +
> + if (folio_test_lru(folio)) {
> + struct lruvec *prev_lruvec = lruvec;
> +
> + lruvec = folio_lruvec_relock_irqsave(folio, lruvec,
> + &flags);
> + if (prev_lruvec != lruvec)
> + lock_batch = 0;
> +
> + lruvec_del_folio(lruvec, folio);
> + __folio_clear_lru_flags(folio);
> + }
> +
> + /*
> + * In rare cases, when truncation or holepunching raced with
> + * munlock after VM_LOCKED was cleared, Mlocked may still be
> + * found set here. This does not indicate a problem, unless
> + * "unevictable_pgs_cleared" appears worryingly large.
> + */
> + if (unlikely(folio_test_mlocked(folio))) {
> + __folio_clear_mlocked(folio);
> + zone_stat_sub_folio(folio, NR_MLOCK);
> + count_vm_event(UNEVICTABLE_PGCLEARED);
> + }
> +
> + list_add(&folio->lru, &pages_to_free);
> + }
> + if (lruvec)
> + unlock_page_lruvec_irqrestore(lruvec, flags);
> +
> + mem_cgroup_uncharge_list(&pages_to_free);
> + free_unref_page_list(&pages_to_free);
> +}
> +
> /*
> * The folios which we're about to release may be in the deferred lru-addition
> * queues. That would prevent them from really being freed right now. That's
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 73b16795b0ff..526bbd5a2ce1 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -304,15 +304,16 @@ void free_page_and_swap_cache(struct page *page)
> }
>
> /*
> - * Passed an array of pages, drop them all from swapcache and then release
> - * them. They are removed from the LRU and freed if this is their last use.
> + * Passed an array of folio ranges, drop all folios from swapcache and then put
> + * a folio reference for each page in the range. They are removed from the LRU
> + * and freed if this is their last use.
> */
> -void free_pages_and_swap_cache(struct page **pages, int nr)
> +void free_folios_and_swap_cache(struct folio_range *folios, int nr)
> {
> lru_add_drain();
> for (int i = 0; i < nr; i++)
> - free_swap_cache(pages[i]);
> - release_pages(pages, nr);
> + free_swap_cache(folios[i].start);
> + folios_put_refs(folios, nr);
> }
>
> static inline bool swap_use_vma_readahead(void)
> --
> 2.25.1

--
Best Regards,
Yan, Zi

Attachments:

signature.asc (871.00 B)
OpenPGP digital signature

2023-08-10 15:19:38

by Ryan Roberts

[permalink] [raw]

Subject: Re: [PATCH v1 4/4] mm/mmu_gather: Store and process pages in contig ranges

On 10/08/2023 15:44, Zi Yan wrote:
> On 10 Aug 2023, at 6:33, Ryan Roberts wrote:
>
>> mmu_gather accumulates a set of pages into a buffer for later rmap
>> removal and freeing. Page pointers were previously stored in a "linked
>> list of arrays", then at flush time, each page in the buffer was removed
>> from the rmap, removed from the swapcache and its refcount was
>> decremented; if the refcount reached 0, then it was freed.
>>
>> With increasing numbers of large folios (or at least contiguous parts of
>> large folios) mapped into userspace processes (pagecache pages for
>> supporting filesystems currently, but in future also large anonymous
>> folios), we can measurably improve performance of process teardown:
>>
>> - For rmap removal, we can batch-remove a range of pages belonging to
>> the same folio with folio_remove_rmap_range(), which is more efficient
>> because atomics can be manipulated just once per range. In the common
>> case, it also allows us to elide adding the (anon) folio to the
>> deferred split queue, only to remove it a bit later, once all pages of
>> the folio have been removed fro mthe rmap.
>>
>> - For swapcache removal, we only need to check and remove the folio from
>> the swap cache once, rather than trying for each individual page.
>>
>> - For page release, we can batch-decrement the refcount for each page in
>> the folio and free it if it hits zero.
>>
>> Change the page pointer storage format within the mmu_gather batch
>> structure to store "folio_range"s; a [start, end) page pointer pair.
>> This allows us to run length encode a contiguous range of pages that all
>> belong to the same folio. This likely allows us to improve cache
>> locality a bit. But it also gives us a convenient format for
>> implementing the above 3 optimizations.
>>
>> Of course if running on a system that does not extensively use large
>> pte-mapped folios, then the RLE approach uses twice as much memory,
>> because each range is 1 page long and uses 2 pointers. But performance
>> measurements show no impact in terms of performance.
>>
>> Macro Performance Results
>> -------------------------
>>
>> Test: Timed kernel compilation on Ampere Altra (arm64), 80 jobs
>> Configs: Comparing with and without large anon folios
>>
>> Without large anon folios:
>> | kernel | real-time | kern-time | user-time |
>> |:-----------------|------------:|------------:|------------:|
>> | baseline-laf-off | 0.0% | 0.0% | 0.0% |
>> | mmugather-range | -0.3% | -0.3% | 0.1% |
>>
>> With large anon folios (order-3):
>> | kernel | real-time | kern-time | user-time |
>> |:-----------------|------------:|------------:|------------:|
>> | baseline-laf-on | 0.0% | 0.0% | 0.0% |
>> | mmugather-range | -0.7% | -3.9% | -0.1% |
>>
>> Test: Timed kernel compilation in VM on Apple M2 MacBook Pro, 8 jobs
>> Configs: Comparing with and without large anon folios
>>
>> Without large anon folios:
>> | kernel | real-time | kern-time | user-time |
>> |:-----------------|------------:|------------:|------------:|
>> | baseline-laf-off | 0.0% | 0.0% | 0.0% |
>> | mmugather-range | -0.9% | -2.9% | -0.6% |
>>
>> With large anon folios (order-3):
>> | kernel | real-time | kern-time | user-time |
>> |:-----------------|------------:|------------:|------------:|
>> | baseline-laf-on | 0.0% | 0.0% | 0.0% |
>> | mmugather-range | -0.4% | -3.7% | -0.2% |
>>
>> Micro Performance Results
>> -------------------------
>>
>> Flame graphs for kernel compilation on Ampere Altra show reduction in
>> cycles consumed by __arm64_sys_exit_group syscall:
>>
>> Without large anon folios: -2%
>> With large anon folios: -26%
>>
>> For the large anon folios case, it also shows a big difference in cost
>> of rmap removal:
>>
>> baseline: cycles in page_remove_rmap(): 24.7B
>> mmugather-range: cycles in folio_remove_rmap_range(): 5.5B
>>
>> Furthermore, the baseline shows 5.2B cycles used by
>> deferred_split_folio() which has completely disappeared after
>> applying this series.
>>
>> Signed-off-by: Ryan Roberts <[email protected]>
>> ---
>> include/asm-generic/tlb.h | 7 +--
>> include/linux/mm.h | 7 +++
>> include/linux/swap.h | 6 +--
>> mm/mmu_gather.c | 56 ++++++++++++++++++++----
>> mm/swap.c | 91 +++++++++++++++++++++++++++++++++++++++
>> mm/swap_state.c | 11 ++---
>> 6 files changed, 158 insertions(+), 20 deletions(-)
>>
>> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
>> index d874415aaa33..fe300a64e59d 100644
>> --- a/include/asm-generic/tlb.h
>> +++ b/include/asm-generic/tlb.h
>> @@ -246,11 +246,11 @@ struct mmu_gather_batch {
>> struct mmu_gather_batch *next;
>> unsigned int nr;
>> unsigned int max;
>> - struct page *pages[];
>> + struct folio_range ranges[];
>> };
>>
>> #define MAX_GATHER_BATCH \
>> - ((PAGE_SIZE - sizeof(struct mmu_gather_batch)) / sizeof(void *))
>> + ((PAGE_SIZE - sizeof(struct mmu_gather_batch)) / sizeof(struct folio_range))
>>
>> /*
>> * Limit the maximum number of mmu_gather batches to reduce a risk of soft
>> @@ -342,7 +342,8 @@ struct mmu_gather {
>> #ifndef CONFIG_MMU_GATHER_NO_GATHER
>> struct mmu_gather_batch *active;
>> struct mmu_gather_batch local;
>> - struct page *__pages[MMU_GATHER_BUNDLE];
>> + struct folio_range __ranges[MMU_GATHER_BUNDLE];
>> + struct page *range_limit;
>> struct mmu_gather_batch *rmap_pend;
>> unsigned int rmap_pend_first;
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 914e08185272..f86c905a065d 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1513,6 +1513,13 @@ static inline void folio_put_refs(struct folio *folio, int refs)
>> __folio_put(folio);
>> }
>>
>> +struct folio_range {
>> + struct page *start;
>> + struct page *end;
>> +};
>
> I see end is used for calculating nr_pages multiple times below. Maybe just
> use nr_pages instead of end here.

But then I'd need to calculate end (= start + nr_pages) every time
__tlb_remove_page() is called to figure out if the page being removed is the
next contiguous page in the current range. __tlb_remove_page() gets called for
every page, but the current way I do it, I only calculate nr_pages once per
range. So I think my way is more efficient?

>
> Also, struct page (memmap) might not be always contiguous, using struct page
> points to represent folio range might not give the result you want.
> See nth_page() and folio_page_idx() in include/linux/mm.h.

Is that true for pages within the same folio too? Or are all pages in a folio
guarranteed contiguous? Perhaps I'm better off using pfn?

>
>> +
>> +void folios_put_refs(struct folio_range *folios, int nr);
>> +
>> /*
>> * union release_pages_arg - an array of pages or folios
>> *
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index f199df803b33..06a7cf3ad6c9 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -453,7 +453,7 @@ static inline unsigned long total_swapcache_pages(void)
>>
>> extern void free_swap_cache(struct page *page);
>> extern void free_page_and_swap_cache(struct page *);
>> -extern void free_pages_and_swap_cache(struct page **, int);
>> +extern void free_folios_and_swap_cache(struct folio_range *, int);
>> /* linux/mm/swapfile.c */
>> extern atomic_long_t nr_swap_pages;
>> extern long total_swap_pages;
>> @@ -530,8 +530,8 @@ static inline void put_swap_device(struct swap_info_struct *si)
>> * so leave put_page and release_pages undeclared... */
>> #define free_page_and_swap_cache(page) \
>> put_page(page)
>> -#define free_pages_and_swap_cache(pages, nr) \
>> - release_pages((pages), (nr));
>> +#define free_folios_and_swap_cache(folios, nr) \
>> + folios_put_refs((folios), (nr))
>>
>> /* used to sanity check ptes in zap_pte_range when CONFIG_SWAP=0 */
>> #define free_swap_and_cache(e) is_pfn_swap_entry(e)
>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
>> index 5d100ac85e21..fd2ea7577817 100644
>> --- a/mm/mmu_gather.c
>> +++ b/mm/mmu_gather.c
>> @@ -22,6 +22,7 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
>> batch = tlb->active;
>> if (batch->next) {
>> tlb->active = batch->next;
>> + tlb->range_limit = NULL;
>> return true;
>> }
>>
>> @@ -39,6 +40,7 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
>>
>> tlb->active->next = batch;
>> tlb->active = batch;
>> + tlb->range_limit = NULL;
>>
>> return true;
>> }
>> @@ -49,9 +51,11 @@ static void tlb_flush_rmap_batch(struct mmu_gather_batch *batch,
>> struct vm_area_struct *vma)
>> {
>> for (int i = first; i < batch->nr; i++) {
>> - struct page *page = batch->pages[i];
>> + struct folio_range *range = &batch->ranges[i];
>> + int nr = range->end - range->start;
>> + struct folio *folio = page_folio(range->start);
>>
>> - page_remove_rmap(page, vma, false);
>> + folio_remove_rmap_range(folio, range->start, nr, vma);
>> }
>> }
>>
>> @@ -75,6 +79,11 @@ void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma)
>> for (batch = batch->next; batch && batch->nr; batch = batch->next)
>> tlb_flush_rmap_batch(batch, 0, vma);
>>
>> + /*
>> + * Move to the next range on next page insertion to prevent any future
>> + * pages from being accumulated into the range we just did the rmap for.
>> + */
>> + tlb->range_limit = NULL;
>> tlb_discard_rmaps(tlb);
>> }
>>
>> @@ -94,7 +103,7 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
>> struct mmu_gather_batch *batch;
>>
>> for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
>> - struct page **pages = batch->pages;
>> + struct folio_range *ranges = batch->ranges;
>>
>> do {
>> /*
>> @@ -102,14 +111,15 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
>> */
>> unsigned int nr = min(512U, batch->nr);
>>
>> - free_pages_and_swap_cache(pages, nr);
>> - pages += nr;
>> + free_folios_and_swap_cache(ranges, nr);
>> + ranges += nr;
>> batch->nr -= nr;
>>
>> cond_resched();
>> } while (batch->nr);
>> }
>> tlb->active = &tlb->local;
>> + tlb->range_limit = NULL;
>> tlb_discard_rmaps(tlb);
>> }
>>
>> @@ -127,6 +137,7 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
>> bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size)
>> {
>> struct mmu_gather_batch *batch;
>> + struct folio_range *range;
>>
>> VM_BUG_ON(!tlb->end);
>>
>> @@ -135,11 +146,37 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_
>> #endif
>>
>> batch = tlb->active;
>> + range = &batch->ranges[batch->nr - 1];
>> +
>> + /*
>> + * If there is a range being accumulated, add the page to the range if
>> + * its contiguous, else start the next range. range_limit is always NULL
>> + * when nr is 0, which protects the batch->ranges[-1] case.
>> + */
>> + if (tlb->range_limit && page == range->end) {
>> + range->end++;
>> + } else {
>> + struct folio *folio = page_folio(page);
>> +
>> + range = &batch->ranges[batch->nr++];
>> + range->start = page;
>> + range->end = page + 1;
>> +
>> + tlb->range_limit = &folio->page + folio_nr_pages(folio);
>> + }
>> +
>> + /*
>> + * If we have reached the end of the folio, move to the next range when
>> + * we add the next page; Never span multiple folios in the same range.
>> + */
>> + if (range->end == tlb->range_limit)
>> + tlb->range_limit = NULL;
>> +
>> /*
>> - * Add the page and check if we are full. If so
>> - * force a flush.
>> + * Check if we are full. If so force a flush. In order to ensure we
>> + * always have a free range for the next added page, the last range in a
>> + * batch always only has a single page.
>> */
>> - batch->pages[batch->nr++] = page;
>> if (batch->nr == batch->max) {
>> if (!tlb_next_batch(tlb))
>> return true;
>> @@ -318,8 +355,9 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
>> tlb->need_flush_all = 0;
>> tlb->local.next = NULL;
>> tlb->local.nr = 0;
>> - tlb->local.max = ARRAY_SIZE(tlb->__pages);
>> + tlb->local.max = ARRAY_SIZE(tlb->__ranges);
>> tlb->active = &tlb->local;
>> + tlb->range_limit = NULL;
>> tlb->batch_count = 0;
>> tlb->rmap_pend = &tlb->local;
>> tlb->rmap_pend_first = 0;
>> diff --git a/mm/swap.c b/mm/swap.c
>> index b05cce475202..e238d3623fcb 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -1041,6 +1041,97 @@ void release_pages(release_pages_arg arg, int nr)
>> }
>> EXPORT_SYMBOL(release_pages);
>>
>> +/**
>> + * folios_put_refs - batched folio_put_refs()
>> + * @folios: array of `struct folio_range`s to release
>> + * @nr: number of folio ranges
>> + *
>> + * Each `struct folio_range` describes the start and end page of a range within
>> + * a folio. The folio reference count is decremented once for each page in the
>> + * range. If it fell to zero, remove the page from the LRU and free it.
>> + */
>> +void folios_put_refs(struct folio_range *folios, int nr)
>> +{
>> + int i;
>> + LIST_HEAD(pages_to_free);
>> + struct lruvec *lruvec = NULL;
>> + unsigned long flags = 0;
>> + unsigned int lock_batch;
>> +
>> + for (i = 0; i < nr; i++) {
>> + struct folio *folio = page_folio(folios[i].start);
>> + int refs = folios[i].end - folios[i].start;
>> +
>> + /*
>> + * Make sure the IRQ-safe lock-holding time does not get
>> + * excessive with a continuous string of pages from the
>> + * same lruvec. The lock is held only if lruvec != NULL.
>> + */
>> + if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
>> + unlock_page_lruvec_irqrestore(lruvec, flags);
>> + lruvec = NULL;
>> + }
>> +
>> + if (is_huge_zero_page(&folio->page))
>> + continue;
>> +
>> + if (folio_is_zone_device(folio)) {
>> + if (lruvec) {
>> + unlock_page_lruvec_irqrestore(lruvec, flags);
>> + lruvec = NULL;
>> + }
>> + if (put_devmap_managed_page(&folio->page))
>> + continue;
>> + if (folio_put_testzero(folio))
>> + free_zone_device_page(&folio->page);
>> + continue;
>> + }
>> +
>> + if (!folio_ref_sub_and_test(folio, refs))
>> + continue;
>> +
>> + if (folio_test_large(folio)) {
>> + if (lruvec) {
>> + unlock_page_lruvec_irqrestore(lruvec, flags);
>> + lruvec = NULL;
>> + }
>> + __folio_put_large(folio);
>> + continue;
>> + }
>> +
>> + if (folio_test_lru(folio)) {
>> + struct lruvec *prev_lruvec = lruvec;
>> +
>> + lruvec = folio_lruvec_relock_irqsave(folio, lruvec,
>> + &flags);
>> + if (prev_lruvec != lruvec)
>> + lock_batch = 0;
>> +
>> + lruvec_del_folio(lruvec, folio);
>> + __folio_clear_lru_flags(folio);
>> + }
>> +
>> + /*
>> + * In rare cases, when truncation or holepunching raced with
>> + * munlock after VM_LOCKED was cleared, Mlocked may still be
>> + * found set here. This does not indicate a problem, unless
>> + * "unevictable_pgs_cleared" appears worryingly large.
>> + */
>> + if (unlikely(folio_test_mlocked(folio))) {
>> + __folio_clear_mlocked(folio);
>> + zone_stat_sub_folio(folio, NR_MLOCK);
>> + count_vm_event(UNEVICTABLE_PGCLEARED);
>> + }
>> +
>> + list_add(&folio->lru, &pages_to_free);
>> + }
>> + if (lruvec)
>> + unlock_page_lruvec_irqrestore(lruvec, flags);
>> +
>> + mem_cgroup_uncharge_list(&pages_to_free);
>> + free_unref_page_list(&pages_to_free);
>> +}
>> +
>> /*
>> * The folios which we're about to release may be in the deferred lru-addition
>> * queues. That would prevent them from really being freed right now. That's
>> diff --git a/mm/swap_state.c b/mm/swap_state.c
>> index 73b16795b0ff..526bbd5a2ce1 100644
>> --- a/mm/swap_state.c
>> +++ b/mm/swap_state.c
>> @@ -304,15 +304,16 @@ void free_page_and_swap_cache(struct page *page)
>> }
>>
>> /*
>> - * Passed an array of pages, drop them all from swapcache and then release
>> - * them. They are removed from the LRU and freed if this is their last use.
>> + * Passed an array of folio ranges, drop all folios from swapcache and then put
>> + * a folio reference for each page in the range. They are removed from the LRU
>> + * and freed if this is their last use.
>> */
>> -void free_pages_and_swap_cache(struct page **pages, int nr)
>> +void free_folios_and_swap_cache(struct folio_range *folios, int nr)
>> {
>> lru_add_drain();
>> for (int i = 0; i < nr; i++)
>> - free_swap_cache(pages[i]);
>> - release_pages(pages, nr);
>> + free_swap_cache(folios[i].start);
>> + folios_put_refs(folios, nr);
>> }
>>
>> static inline bool swap_use_vma_readahead(void)
>> --
>> 2.25.1
>
>
> --
> Best Regards,
> Yan, Zi

2023-08-10 15:51:53

by Ryan Roberts

[permalink] [raw]

Subject: Re: [PATCH v1 4/4] mm/mmu_gather: Store and process pages in contig ranges

On 10/08/2023 15:59, Zi Yan wrote:
> On 10 Aug 2023, at 10:55, Ryan Roberts wrote:
>
>> On 10/08/2023 15:44, Zi Yan wrote:
>>> On 10 Aug 2023, at 6:33, Ryan Roberts wrote:
>>>
>>>> mmu_gather accumulates a set of pages into a buffer for later rmap
>>>> removal and freeing. Page pointers were previously stored in a "linked
>>>> list of arrays", then at flush time, each page in the buffer was removed
>>>> from the rmap, removed from the swapcache and its refcount was
>>>> decremented; if the refcount reached 0, then it was freed.
>>>>
>>>> With increasing numbers of large folios (or at least contiguous parts of
>>>> large folios) mapped into userspace processes (pagecache pages for
>>>> supporting filesystems currently, but in future also large anonymous
>>>> folios), we can measurably improve performance of process teardown:
>>>>
>>>> - For rmap removal, we can batch-remove a range of pages belonging to
>>>> the same folio with folio_remove_rmap_range(), which is more efficient
>>>> because atomics can be manipulated just once per range. In the common
>>>> case, it also allows us to elide adding the (anon) folio to the
>>>> deferred split queue, only to remove it a bit later, once all pages of
>>>> the folio have been removed fro mthe rmap.
>>>>
>>>> - For swapcache removal, we only need to check and remove the folio from
>>>> the swap cache once, rather than trying for each individual page.
>>>>
>>>> - For page release, we can batch-decrement the refcount for each page in
>>>> the folio and free it if it hits zero.
>>>>
>>>> Change the page pointer storage format within the mmu_gather batch
>>>> structure to store "folio_range"s; a [start, end) page pointer pair.
>>>> This allows us to run length encode a contiguous range of pages that all
>>>> belong to the same folio. This likely allows us to improve cache
>>>> locality a bit. But it also gives us a convenient format for
>>>> implementing the above 3 optimizations.
>>>>
>>>> Of course if running on a system that does not extensively use large
>>>> pte-mapped folios, then the RLE approach uses twice as much memory,
>>>> because each range is 1 page long and uses 2 pointers. But performance
>>>> measurements show no impact in terms of performance.
>>>>
>>>> Macro Performance Results
>>>> -------------------------
>>>>
>>>> Test: Timed kernel compilation on Ampere Altra (arm64), 80 jobs
>>>> Configs: Comparing with and without large anon folios
>>>>
>>>> Without large anon folios:
>>>> | kernel | real-time | kern-time | user-time |
>>>> |:-----------------|------------:|------------:|------------:|
>>>> | baseline-laf-off | 0.0% | 0.0% | 0.0% |
>>>> | mmugather-range | -0.3% | -0.3% | 0.1% |
>>>>
>>>> With large anon folios (order-3):
>>>> | kernel | real-time | kern-time | user-time |
>>>> |:-----------------|------------:|------------:|------------:|
>>>> | baseline-laf-on | 0.0% | 0.0% | 0.0% |
>>>> | mmugather-range | -0.7% | -3.9% | -0.1% |
>>>>
>>>> Test: Timed kernel compilation in VM on Apple M2 MacBook Pro, 8 jobs
>>>> Configs: Comparing with and without large anon folios
>>>>
>>>> Without large anon folios:
>>>> | kernel | real-time | kern-time | user-time |
>>>> |:-----------------|------------:|------------:|------------:|
>>>> | baseline-laf-off | 0.0% | 0.0% | 0.0% |
>>>> | mmugather-range | -0.9% | -2.9% | -0.6% |
>>>>
>>>> With large anon folios (order-3):
>>>> | kernel | real-time | kern-time | user-time |
>>>> |:-----------------|------------:|------------:|------------:|
>>>> | baseline-laf-on | 0.0% | 0.0% | 0.0% |
>>>> | mmugather-range | -0.4% | -3.7% | -0.2% |
>>>>
>>>> Micro Performance Results
>>>> -------------------------
>>>>
>>>> Flame graphs for kernel compilation on Ampere Altra show reduction in
>>>> cycles consumed by __arm64_sys_exit_group syscall:
>>>>
>>>> Without large anon folios: -2%
>>>> With large anon folios: -26%
>>>>
>>>> For the large anon folios case, it also shows a big difference in cost
>>>> of rmap removal:
>>>>
>>>> baseline: cycles in page_remove_rmap(): 24.7B
>>>> mmugather-range: cycles in folio_remove_rmap_range(): 5.5B
>>>>
>>>> Furthermore, the baseline shows 5.2B cycles used by
>>>> deferred_split_folio() which has completely disappeared after
>>>> applying this series.
>>>>
>>>> Signed-off-by: Ryan Roberts <[email protected]>
>>>> ---
>>>> include/asm-generic/tlb.h | 7 +--
>>>> include/linux/mm.h | 7 +++
>>>> include/linux/swap.h | 6 +--
>>>> mm/mmu_gather.c | 56 ++++++++++++++++++++----
>>>> mm/swap.c | 91 +++++++++++++++++++++++++++++++++++++++
>>>> mm/swap_state.c | 11 ++---
>>>> 6 files changed, 158 insertions(+), 20 deletions(-)
>>>>
>>>> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
>>>> index d874415aaa33..fe300a64e59d 100644
>>>> --- a/include/asm-generic/tlb.h
>>>> +++ b/include/asm-generic/tlb.h
>>>> @@ -246,11 +246,11 @@ struct mmu_gather_batch {
>>>> struct mmu_gather_batch *next;
>>>> unsigned int nr;
>>>> unsigned int max;
>>>> - struct page *pages[];
>>>> + struct folio_range ranges[];
>>>> };
>>>>
>>>> #define MAX_GATHER_BATCH \
>>>> - ((PAGE_SIZE - sizeof(struct mmu_gather_batch)) / sizeof(void *))
>>>> + ((PAGE_SIZE - sizeof(struct mmu_gather_batch)) / sizeof(struct folio_range))
>>>>
>>>> /*
>>>> * Limit the maximum number of mmu_gather batches to reduce a risk of soft
>>>> @@ -342,7 +342,8 @@ struct mmu_gather {
>>>> #ifndef CONFIG_MMU_GATHER_NO_GATHER
>>>> struct mmu_gather_batch *active;
>>>> struct mmu_gather_batch local;
>>>> - struct page *__pages[MMU_GATHER_BUNDLE];
>>>> + struct folio_range __ranges[MMU_GATHER_BUNDLE];
>>>> + struct page *range_limit;
>>>> struct mmu_gather_batch *rmap_pend;
>>>> unsigned int rmap_pend_first;
>>>>
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index 914e08185272..f86c905a065d 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -1513,6 +1513,13 @@ static inline void folio_put_refs(struct folio *folio, int refs)
>>>> __folio_put(folio);
>>>> }
>>>>
>>>> +struct folio_range {
>>>> + struct page *start;
>>>> + struct page *end;
>>>> +};
>>>
>>> I see end is used for calculating nr_pages multiple times below. Maybe just
>>> use nr_pages instead of end here.
>>
>> But then I'd need to calculate end (= start + nr_pages) every time
>> __tlb_remove_page() is called to figure out if the page being removed is the
>> next contiguous page in the current range. __tlb_remove_page() gets called for
>> every page, but the current way I do it, I only calculate nr_pages once per
>> range. So I think my way is more efficient?
>>
>>>
>>> Also, struct page (memmap) might not be always contiguous, using struct page
>>> points to represent folio range might not give the result you want.
>>> See nth_page() and folio_page_idx() in include/linux/mm.h.
>>
>> Is that true for pages within the same folio too? Or are all pages in a folio
>> guarranteed contiguous? Perhaps I'm better off using pfn?
>
> folio_page_idx() says not all pages in a folio is guaranteed to be contiguous.
> PFN might be a better choice.

OK thanks for pointing this out. I'll plan to use PFNs for v2, which I'll send
out once I'm back from holiday (and hopefully have feedback from others to roll
in too).

>
>>
>>>
>>>> +
>>>> +void folios_put_refs(struct folio_range *folios, int nr);
>>>> +
>>>> /*
>>>> * union release_pages_arg - an array of pages or folios
>>>> *
>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>> index f199df803b33..06a7cf3ad6c9 100644
>>>> --- a/include/linux/swap.h
>>>> +++ b/include/linux/swap.h
>>>> @@ -453,7 +453,7 @@ static inline unsigned long total_swapcache_pages(void)
>>>>
>>>> extern void free_swap_cache(struct page *page);
>>>> extern void free_page_and_swap_cache(struct page *);
>>>> -extern void free_pages_and_swap_cache(struct page **, int);
>>>> +extern void free_folios_and_swap_cache(struct folio_range *, int);
>>>> /* linux/mm/swapfile.c */
>>>> extern atomic_long_t nr_swap_pages;
>>>> extern long total_swap_pages;
>>>> @@ -530,8 +530,8 @@ static inline void put_swap_device(struct swap_info_struct *si)
>>>> * so leave put_page and release_pages undeclared... */
>>>> #define free_page_and_swap_cache(page) \
>>>> put_page(page)
>>>> -#define free_pages_and_swap_cache(pages, nr) \
>>>> - release_pages((pages), (nr));
>>>> +#define free_folios_and_swap_cache(folios, nr) \
>>>> + folios_put_refs((folios), (nr))
>>>>
>>>> /* used to sanity check ptes in zap_pte_range when CONFIG_SWAP=0 */
>>>> #define free_swap_and_cache(e) is_pfn_swap_entry(e)
>>>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
>>>> index 5d100ac85e21..fd2ea7577817 100644
>>>> --- a/mm/mmu_gather.c
>>>> +++ b/mm/mmu_gather.c
>>>> @@ -22,6 +22,7 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
>>>> batch = tlb->active;
>>>> if (batch->next) {
>>>> tlb->active = batch->next;
>>>> + tlb->range_limit = NULL;
>>>> return true;
>>>> }
>>>>
>>>> @@ -39,6 +40,7 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
>>>>
>>>> tlb->active->next = batch;
>>>> tlb->active = batch;
>>>> + tlb->range_limit = NULL;
>>>>
>>>> return true;
>>>> }
>>>> @@ -49,9 +51,11 @@ static void tlb_flush_rmap_batch(struct mmu_gather_batch *batch,
>>>> struct vm_area_struct *vma)
>>>> {
>>>> for (int i = first; i < batch->nr; i++) {
>>>> - struct page *page = batch->pages[i];
>>>> + struct folio_range *range = &batch->ranges[i];
>>>> + int nr = range->end - range->start;
>>>> + struct folio *folio = page_folio(range->start);
>>>>
>>>> - page_remove_rmap(page, vma, false);
>>>> + folio_remove_rmap_range(folio, range->start, nr, vma);
>>>> }
>>>> }
>>>>
>>>> @@ -75,6 +79,11 @@ void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma)
>>>> for (batch = batch->next; batch && batch->nr; batch = batch->next)
>>>> tlb_flush_rmap_batch(batch, 0, vma);
>>>>
>>>> + /*
>>>> + * Move to the next range on next page insertion to prevent any future
>>>> + * pages from being accumulated into the range we just did the rmap for.
>>>> + */
>>>> + tlb->range_limit = NULL;
>>>> tlb_discard_rmaps(tlb);
>>>> }
>>>>
>>>> @@ -94,7 +103,7 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
>>>> struct mmu_gather_batch *batch;
>>>>
>>>> for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
>>>> - struct page **pages = batch->pages;
>>>> + struct folio_range *ranges = batch->ranges;
>>>>
>>>> do {
>>>> /*
>>>> @@ -102,14 +111,15 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
>>>> */
>>>> unsigned int nr = min(512U, batch->nr);
>>>>
>>>> - free_pages_and_swap_cache(pages, nr);
>>>> - pages += nr;
>>>> + free_folios_and_swap_cache(ranges, nr);
>>>> + ranges += nr;
>>>> batch->nr -= nr;
>>>>
>>>> cond_resched();
>>>> } while (batch->nr);
>>>> }
>>>> tlb->active = &tlb->local;
>>>> + tlb->range_limit = NULL;
>>>> tlb_discard_rmaps(tlb);
>>>> }
>>>>
>>>> @@ -127,6 +137,7 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
>>>> bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size)
>>>> {
>>>> struct mmu_gather_batch *batch;
>>>> + struct folio_range *range;
>>>>
>>>> VM_BUG_ON(!tlb->end);
>>>>
>>>> @@ -135,11 +146,37 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_
>>>> #endif
>>>>
>>>> batch = tlb->active;
>>>> + range = &batch->ranges[batch->nr - 1];
>>>> +
>>>> + /*
>>>> + * If there is a range being accumulated, add the page to the range if
>>>> + * its contiguous, else start the next range. range_limit is always NULL
>>>> + * when nr is 0, which protects the batch->ranges[-1] case.
>>>> + */
>>>> + if (tlb->range_limit && page == range->end) {
>>>> + range->end++;
>>>> + } else {
>>>> + struct folio *folio = page_folio(page);
>>>> +
>>>> + range = &batch->ranges[batch->nr++];
>>>> + range->start = page;
>>>> + range->end = page + 1;
>>>> +
>>>> + tlb->range_limit = &folio->page + folio_nr_pages(folio);
>>>> + }
>>>> +
>>>> + /*
>>>> + * If we have reached the end of the folio, move to the next range when
>>>> + * we add the next page; Never span multiple folios in the same range.
>>>> + */
>>>> + if (range->end == tlb->range_limit)
>>>> + tlb->range_limit = NULL;
>>>> +
>>>> /*
>>>> - * Add the page and check if we are full. If so
>>>> - * force a flush.
>>>> + * Check if we are full. If so force a flush. In order to ensure we
>>>> + * always have a free range for the next added page, the last range in a
>>>> + * batch always only has a single page.
>>>> */
>>>> - batch->pages[batch->nr++] = page;
>>>> if (batch->nr == batch->max) {
>>>> if (!tlb_next_batch(tlb))
>>>> return true;
>>>> @@ -318,8 +355,9 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
>>>> tlb->need_flush_all = 0;
>>>> tlb->local.next = NULL;
>>>> tlb->local.nr = 0;
>>>> - tlb->local.max = ARRAY_SIZE(tlb->__pages);
>>>> + tlb->local.max = ARRAY_SIZE(tlb->__ranges);
>>>> tlb->active = &tlb->local;
>>>> + tlb->range_limit = NULL;
>>>> tlb->batch_count = 0;
>>>> tlb->rmap_pend = &tlb->local;
>>>> tlb->rmap_pend_first = 0;
>>>> diff --git a/mm/swap.c b/mm/swap.c
>>>> index b05cce475202..e238d3623fcb 100644
>>>> --- a/mm/swap.c
>>>> +++ b/mm/swap.c
>>>> @@ -1041,6 +1041,97 @@ void release_pages(release_pages_arg arg, int nr)
>>>> }
>>>> EXPORT_SYMBOL(release_pages);
>>>>
>>>> +/**
>>>> + * folios_put_refs - batched folio_put_refs()
>>>> + * @folios: array of `struct folio_range`s to release
>>>> + * @nr: number of folio ranges
>>>> + *
>>>> + * Each `struct folio_range` describes the start and end page of a range within
>>>> + * a folio. The folio reference count is decremented once for each page in the
>>>> + * range. If it fell to zero, remove the page from the LRU and free it.
>>>> + */
>>>> +void folios_put_refs(struct folio_range *folios, int nr)
>>>> +{
>>>> + int i;
>>>> + LIST_HEAD(pages_to_free);
>>>> + struct lruvec *lruvec = NULL;
>>>> + unsigned long flags = 0;
>>>> + unsigned int lock_batch;
>>>> +
>>>> + for (i = 0; i < nr; i++) {
>>>> + struct folio *folio = page_folio(folios[i].start);
>>>> + int refs = folios[i].end - folios[i].start;
>>>> +
>>>> + /*
>>>> + * Make sure the IRQ-safe lock-holding time does not get
>>>> + * excessive with a continuous string of pages from the
>>>> + * same lruvec. The lock is held only if lruvec != NULL.
>>>> + */
>>>> + if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
>>>> + unlock_page_lruvec_irqrestore(lruvec, flags);
>>>> + lruvec = NULL;
>>>> + }
>>>> +
>>>> + if (is_huge_zero_page(&folio->page))
>>>> + continue;
>>>> +
>>>> + if (folio_is_zone_device(folio)) {
>>>> + if (lruvec) {
>>>> + unlock_page_lruvec_irqrestore(lruvec, flags);
>>>> + lruvec = NULL;
>>>> + }
>>>> + if (put_devmap_managed_page(&folio->page))
>>>> + continue;
>>>> + if (folio_put_testzero(folio))
>>>> + free_zone_device_page(&folio->page);
>>>> + continue;
>>>> + }
>>>> +
>>>> + if (!folio_ref_sub_and_test(folio, refs))
>>>> + continue;
>>>> +
>>>> + if (folio_test_large(folio)) {
>>>> + if (lruvec) {
>>>> + unlock_page_lruvec_irqrestore(lruvec, flags);
>>>> + lruvec = NULL;
>>>> + }
>>>> + __folio_put_large(folio);
>>>> + continue;
>>>> + }
>>>> +
>>>> + if (folio_test_lru(folio)) {
>>>> + struct lruvec *prev_lruvec = lruvec;
>>>> +
>>>> + lruvec = folio_lruvec_relock_irqsave(folio, lruvec,
>>>> + &flags);
>>>> + if (prev_lruvec != lruvec)
>>>> + lock_batch = 0;
>>>> +
>>>> + lruvec_del_folio(lruvec, folio);
>>>> + __folio_clear_lru_flags(folio);
>>>> + }
>>>> +
>>>> + /*
>>>> + * In rare cases, when truncation or holepunching raced with
>>>> + * munlock after VM_LOCKED was cleared, Mlocked may still be
>>>> + * found set here. This does not indicate a problem, unless
>>>> + * "unevictable_pgs_cleared" appears worryingly large.
>>>> + */
>>>> + if (unlikely(folio_test_mlocked(folio))) {
>>>> + __folio_clear_mlocked(folio);
>>>> + zone_stat_sub_folio(folio, NR_MLOCK);
>>>> + count_vm_event(UNEVICTABLE_PGCLEARED);
>>>> + }
>>>> +
>>>> + list_add(&folio->lru, &pages_to_free);
>>>> + }
>>>> + if (lruvec)
>>>> + unlock_page_lruvec_irqrestore(lruvec, flags);
>>>> +
>>>> + mem_cgroup_uncharge_list(&pages_to_free);
>>>> + free_unref_page_list(&pages_to_free);
>>>> +}
>>>> +
>>>> /*
>>>> * The folios which we're about to release may be in the deferred lru-addition
>>>> * queues. That would prevent them from really being freed right now. That's
>>>> diff --git a/mm/swap_state.c b/mm/swap_state.c
>>>> index 73b16795b0ff..526bbd5a2ce1 100644
>>>> --- a/mm/swap_state.c
>>>> +++ b/mm/swap_state.c
>>>> @@ -304,15 +304,16 @@ void free_page_and_swap_cache(struct page *page)
>>>> }
>>>>
>>>> /*
>>>> - * Passed an array of pages, drop them all from swapcache and then release
>>>> - * them. They are removed from the LRU and freed if this is their last use.
>>>> + * Passed an array of folio ranges, drop all folios from swapcache and then put
>>>> + * a folio reference for each page in the range. They are removed from the LRU
>>>> + * and freed if this is their last use.
>>>> */
>>>> -void free_pages_and_swap_cache(struct page **pages, int nr)
>>>> +void free_folios_and_swap_cache(struct folio_range *folios, int nr)
>>>> {
>>>> lru_add_drain();
>>>> for (int i = 0; i < nr; i++)
>>>> - free_swap_cache(pages[i]);
>>>> - release_pages(pages, nr);
>>>> + free_swap_cache(folios[i].start);
>>>> + folios_put_refs(folios, nr);
>>>> }
>>>>
>>>> static inline bool swap_use_vma_readahead(void)
>>>> --
>>>> 2.25.1
>>>
>>>
>>> --
>>> Best Regards,
>>> Yan, Zi
>
>
> --
> Best Regards,
> Yan, Zi

2023-08-10 17:36:50

[permalink] [raw]

Subject: Re: [PATCH v1 4/4] mm/mmu_gather: Store and process pages in contig ranges

On 10 Aug 2023, at 10:55, Ryan Roberts wrote:

> On 10/08/2023 15:44, Zi Yan wrote:
>> On 10 Aug 2023, at 6:33, Ryan Roberts wrote:
>>
>>> mmu_gather accumulates a set of pages into a buffer for later rmap
>>> removal and freeing. Page pointers were previously stored in a "linked
>>> list of arrays", then at flush time, each page in the buffer was removed
>>> from the rmap, removed from the swapcache and its refcount was
>>> decremented; if the refcount reached 0, then it was freed.
>>>
>>> With increasing numbers of large folios (or at least contiguous parts of
>>> large folios) mapped into userspace processes (pagecache pages for
>>> supporting filesystems currently, but in future also large anonymous
>>> folios), we can measurably improve performance of process teardown:
>>>
>>> - For rmap removal, we can batch-remove a range of pages belonging to
>>> the same folio with folio_remove_rmap_range(), which is more efficient
>>> because atomics can be manipulated just once per range. In the common
>>> case, it also allows us to elide adding the (anon) folio to the
>>> deferred split queue, only to remove it a bit later, once all pages of
>>> the folio have been removed fro mthe rmap.
>>>
>>> - For swapcache removal, we only need to check and remove the folio from
>>> the swap cache once, rather than trying for each individual page.
>>>
>>> - For page release, we can batch-decrement the refcount for each page in
>>> the folio and free it if it hits zero.
>>>
>>> Change the page pointer storage format within the mmu_gather batch
>>> structure to store "folio_range"s; a [start, end) page pointer pair.
>>> This allows us to run length encode a contiguous range of pages that all
>>> belong to the same folio. This likely allows us to improve cache
>>> locality a bit. But it also gives us a convenient format for
>>> implementing the above 3 optimizations.
>>>
>>> Of course if running on a system that does not extensively use large
>>> pte-mapped folios, then the RLE approach uses twice as much memory,
>>> because each range is 1 page long and uses 2 pointers. But performance
>>> measurements show no impact in terms of performance.
>>>
>>> Macro Performance Results
>>> -------------------------
>>>
>>> Test: Timed kernel compilation on Ampere Altra (arm64), 80 jobs
>>> Configs: Comparing with and without large anon folios
>>>
>>> Without large anon folios:
>>> | kernel | real-time | kern-time | user-time |
>>> |:-----------------|------------:|------------:|------------:|
>>> | baseline-laf-off | 0.0% | 0.0% | 0.0% |
>>> | mmugather-range | -0.3% | -0.3% | 0.1% |
>>>
>>> With large anon folios (order-3):
>>> | kernel | real-time | kern-time | user-time |
>>> |:-----------------|------------:|------------:|------------:|
>>> | baseline-laf-on | 0.0% | 0.0% | 0.0% |
>>> | mmugather-range | -0.7% | -3.9% | -0.1% |
>>>
>>> Test: Timed kernel compilation in VM on Apple M2 MacBook Pro, 8 jobs
>>> Configs: Comparing with and without large anon folios
>>>
>>> Without large anon folios:
>>> | kernel | real-time | kern-time | user-time |
>>> |:-----------------|------------:|------------:|------------:|
>>> | baseline-laf-off | 0.0% | 0.0% | 0.0% |
>>> | mmugather-range | -0.9% | -2.9% | -0.6% |
>>>
>>> With large anon folios (order-3):
>>> | kernel | real-time | kern-time | user-time |
>>> |:-----------------|------------:|------------:|------------:|
>>> | baseline-laf-on | 0.0% | 0.0% | 0.0% |
>>> | mmugather-range | -0.4% | -3.7% | -0.2% |
>>>
>>> Micro Performance Results
>>> -------------------------
>>>
>>> Flame graphs for kernel compilation on Ampere Altra show reduction in
>>> cycles consumed by __arm64_sys_exit_group syscall:
>>>
>>> Without large anon folios: -2%
>>> With large anon folios: -26%
>>>
>>> For the large anon folios case, it also shows a big difference in cost
>>> of rmap removal:
>>>
>>> baseline: cycles in page_remove_rmap(): 24.7B
>>> mmugather-range: cycles in folio_remove_rmap_range(): 5.5B
>>>
>>> Furthermore, the baseline shows 5.2B cycles used by
>>> deferred_split_folio() which has completely disappeared after
>>> applying this series.
>>>
>>> Signed-off-by: Ryan Roberts <[email protected]>
>>> ---
>>> include/asm-generic/tlb.h | 7 +--
>>> include/linux/mm.h | 7 +++
>>> include/linux/swap.h | 6 +--
>>> mm/mmu_gather.c | 56 ++++++++++++++++++++----
>>> mm/swap.c | 91 +++++++++++++++++++++++++++++++++++++++
>>> mm/swap_state.c | 11 ++---
>>> 6 files changed, 158 insertions(+), 20 deletions(-)
>>>
>>> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
>>> index d874415aaa33..fe300a64e59d 100644
>>> --- a/include/asm-generic/tlb.h
>>> +++ b/include/asm-generic/tlb.h
>>> @@ -246,11 +246,11 @@ struct mmu_gather_batch {
>>> struct mmu_gather_batch *next;
>>> unsigned int nr;
>>> unsigned int max;
>>> - struct page *pages[];
>>> + struct folio_range ranges[];
>>> };
>>>
>>> #define MAX_GATHER_BATCH \
>>> - ((PAGE_SIZE - sizeof(struct mmu_gather_batch)) / sizeof(void *))
>>> + ((PAGE_SIZE - sizeof(struct mmu_gather_batch)) / sizeof(struct folio_range))
>>>
>>> /*
>>> * Limit the maximum number of mmu_gather batches to reduce a risk of soft
>>> @@ -342,7 +342,8 @@ struct mmu_gather {
>>> #ifndef CONFIG_MMU_GATHER_NO_GATHER
>>> struct mmu_gather_batch *active;
>>> struct mmu_gather_batch local;
>>> - struct page *__pages[MMU_GATHER_BUNDLE];
>>> + struct folio_range __ranges[MMU_GATHER_BUNDLE];
>>> + struct page *range_limit;
>>> struct mmu_gather_batch *rmap_pend;
>>> unsigned int rmap_pend_first;
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 914e08185272..f86c905a065d 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -1513,6 +1513,13 @@ static inline void folio_put_refs(struct folio *folio, int refs)
>>> __folio_put(folio);
>>> }
>>>
>>> +struct folio_range {
>>> + struct page *start;
>>> + struct page *end;
>>> +};
>>
>> I see end is used for calculating nr_pages multiple times below. Maybe just
>> use nr_pages instead of end here.
>
> But then I'd need to calculate end (= start + nr_pages) every time
> __tlb_remove_page() is called to figure out if the page being removed is the
> next contiguous page in the current range. __tlb_remove_page() gets called for
> every page, but the current way I do it, I only calculate nr_pages once per
> range. So I think my way is more efficient?
>
>>
>> Also, struct page (memmap) might not be always contiguous, using struct page
>> points to represent folio range might not give the result you want.
>> See nth_page() and folio_page_idx() in include/linux/mm.h.
>
> Is that true for pages within the same folio too? Or are all pages in a folio
> guarranteed contiguous? Perhaps I'm better off using pfn?

folio_page_idx() says not all pages in a folio is guaranteed to be contiguous.
PFN might be a better choice.

>
>>
>>> +
>>> +void folios_put_refs(struct folio_range *folios, int nr);
>>> +
>>> /*
>>> * union release_pages_arg - an array of pages or folios
>>> *
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index f199df803b33..06a7cf3ad6c9 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -453,7 +453,7 @@ static inline unsigned long total_swapcache_pages(void)
>>>
>>> extern void free_swap_cache(struct page *page);
>>> extern void free_page_and_swap_cache(struct page *);
>>> -extern void free_pages_and_swap_cache(struct page **, int);
>>> +extern void free_folios_and_swap_cache(struct folio_range *, int);
>>> /* linux/mm/swapfile.c */
>>> extern atomic_long_t nr_swap_pages;
>>> extern long total_swap_pages;
>>> @@ -530,8 +530,8 @@ static inline void put_swap_device(struct swap_info_struct *si)
>>> * so leave put_page and release_pages undeclared... */
>>> #define free_page_and_swap_cache(page) \
>>> put_page(page)
>>> -#define free_pages_and_swap_cache(pages, nr) \
>>> - release_pages((pages), (nr));
>>> +#define free_folios_and_swap_cache(folios, nr) \
>>> + folios_put_refs((folios), (nr))
>>>
>>> /* used to sanity check ptes in zap_pte_range when CONFIG_SWAP=0 */
>>> #define free_swap_and_cache(e) is_pfn_swap_entry(e)
>>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
>>> index 5d100ac85e21..fd2ea7577817 100644
>>> --- a/mm/mmu_gather.c
>>> +++ b/mm/mmu_gather.c
>>> @@ -22,6 +22,7 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
>>> batch = tlb->active;
>>> if (batch->next) {
>>> tlb->active = batch->next;
>>> + tlb->range_limit = NULL;
>>> return true;
>>> }
>>>
>>> @@ -39,6 +40,7 @@ static bool tlb_next_batch(struct mmu_gather *tlb)
>>>
>>> tlb->active->next = batch;
>>> tlb->active = batch;
>>> + tlb->range_limit = NULL;
>>>
>>> return true;
>>> }
>>> @@ -49,9 +51,11 @@ static void tlb_flush_rmap_batch(struct mmu_gather_batch *batch,
>>> struct vm_area_struct *vma)
>>> {
>>> for (int i = first; i < batch->nr; i++) {
>>> - struct page *page = batch->pages[i];
>>> + struct folio_range *range = &batch->ranges[i];
>>> + int nr = range->end - range->start;
>>> + struct folio *folio = page_folio(range->start);
>>>
>>> - page_remove_rmap(page, vma, false);
>>> + folio_remove_rmap_range(folio, range->start, nr, vma);
>>> }
>>> }
>>>
>>> @@ -75,6 +79,11 @@ void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma)
>>> for (batch = batch->next; batch && batch->nr; batch = batch->next)
>>> tlb_flush_rmap_batch(batch, 0, vma);
>>>
>>> + /*
>>> + * Move to the next range on next page insertion to prevent any future
>>> + * pages from being accumulated into the range we just did the rmap for.
>>> + */
>>> + tlb->range_limit = NULL;
>>> tlb_discard_rmaps(tlb);
>>> }
>>>
>>> @@ -94,7 +103,7 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
>>> struct mmu_gather_batch *batch;
>>>
>>> for (batch = &tlb->local; batch && batch->nr; batch = batch->next) {
>>> - struct page **pages = batch->pages;
>>> + struct folio_range *ranges = batch->ranges;
>>>
>>> do {
>>> /*
>>> @@ -102,14 +111,15 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
>>> */
>>> unsigned int nr = min(512U, batch->nr);
>>>
>>> - free_pages_and_swap_cache(pages, nr);
>>> - pages += nr;
>>> + free_folios_and_swap_cache(ranges, nr);
>>> + ranges += nr;
>>> batch->nr -= nr;
>>>
>>> cond_resched();
>>> } while (batch->nr);
>>> }
>>> tlb->active = &tlb->local;
>>> + tlb->range_limit = NULL;
>>> tlb_discard_rmaps(tlb);
>>> }
>>>
>>> @@ -127,6 +137,7 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
>>> bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_size)
>>> {
>>> struct mmu_gather_batch *batch;
>>> + struct folio_range *range;
>>>
>>> VM_BUG_ON(!tlb->end);
>>>
>>> @@ -135,11 +146,37 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_
>>> #endif
>>>
>>> batch = tlb->active;
>>> + range = &batch->ranges[batch->nr - 1];
>>> +
>>> + /*
>>> + * If there is a range being accumulated, add the page to the range if
>>> + * its contiguous, else start the next range. range_limit is always NULL
>>> + * when nr is 0, which protects the batch->ranges[-1] case.
>>> + */
>>> + if (tlb->range_limit && page == range->end) {
>>> + range->end++;
>>> + } else {
>>> + struct folio *folio = page_folio(page);
>>> +
>>> + range = &batch->ranges[batch->nr++];
>>> + range->start = page;
>>> + range->end = page + 1;
>>> +
>>> + tlb->range_limit = &folio->page + folio_nr_pages(folio);
>>> + }
>>> +
>>> + /*
>>> + * If we have reached the end of the folio, move to the next range when
>>> + * we add the next page; Never span multiple folios in the same range.
>>> + */
>>> + if (range->end == tlb->range_limit)
>>> + tlb->range_limit = NULL;
>>> +
>>> /*
>>> - * Add the page and check if we are full. If so
>>> - * force a flush.
>>> + * Check if we are full. If so force a flush. In order to ensure we
>>> + * always have a free range for the next added page, the last range in a
>>> + * batch always only has a single page.
>>> */
>>> - batch->pages[batch->nr++] = page;
>>> if (batch->nr == batch->max) {
>>> if (!tlb_next_batch(tlb))
>>> return true;
>>> @@ -318,8 +355,9 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
>>> tlb->need_flush_all = 0;
>>> tlb->local.next = NULL;
>>> tlb->local.nr = 0;
>>> - tlb->local.max = ARRAY_SIZE(tlb->__pages);
>>> + tlb->local.max = ARRAY_SIZE(tlb->__ranges);
>>> tlb->active = &tlb->local;
>>> + tlb->range_limit = NULL;
>>> tlb->batch_count = 0;
>>> tlb->rmap_pend = &tlb->local;
>>> tlb->rmap_pend_first = 0;
>>> diff --git a/mm/swap.c b/mm/swap.c
>>> index b05cce475202..e238d3623fcb 100644
>>> --- a/mm/swap.c
>>> +++ b/mm/swap.c
>>> @@ -1041,6 +1041,97 @@ void release_pages(release_pages_arg arg, int nr)
>>> }
>>> EXPORT_SYMBOL(release_pages);
>>>
>>> +/**
>>> + * folios_put_refs - batched folio_put_refs()
>>> + * @folios: array of `struct folio_range`s to release
>>> + * @nr: number of folio ranges
>>> + *
>>> + * Each `struct folio_range` describes the start and end page of a range within
>>> + * a folio. The folio reference count is decremented once for each page in the
>>> + * range. If it fell to zero, remove the page from the LRU and free it.
>>> + */
>>> +void folios_put_refs(struct folio_range *folios, int nr)
>>> +{
>>> + int i;
>>> + LIST_HEAD(pages_to_free);
>>> + struct lruvec *lruvec = NULL;
>>> + unsigned long flags = 0;
>>> + unsigned int lock_batch;
>>> +
>>> + for (i = 0; i < nr; i++) {
>>> + struct folio *folio = page_folio(folios[i].start);
>>> + int refs = folios[i].end - folios[i].start;
>>> +
>>> + /*
>>> + * Make sure the IRQ-safe lock-holding time does not get
>>> + * excessive with a continuous string of pages from the
>>> + * same lruvec. The lock is held only if lruvec != NULL.
>>> + */
>>> + if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
>>> + unlock_page_lruvec_irqrestore(lruvec, flags);
>>> + lruvec = NULL;
>>> + }
>>> +
>>> + if (is_huge_zero_page(&folio->page))
>>> + continue;
>>> +
>>> + if (folio_is_zone_device(folio)) {
>>> + if (lruvec) {
>>> + unlock_page_lruvec_irqrestore(lruvec, flags);
>>> + lruvec = NULL;
>>> + }
>>> + if (put_devmap_managed_page(&folio->page))
>>> + continue;
>>> + if (folio_put_testzero(folio))
>>> + free_zone_device_page(&folio->page);
>>> + continue;
>>> + }
>>> +
>>> + if (!folio_ref_sub_and_test(folio, refs))
>>> + continue;
>>> +
>>> + if (folio_test_large(folio)) {
>>> + if (lruvec) {
>>> + unlock_page_lruvec_irqrestore(lruvec, flags);
>>> + lruvec = NULL;
>>> + }
>>> + __folio_put_large(folio);
>>> + continue;
>>> + }
>>> +
>>> + if (folio_test_lru(folio)) {
>>> + struct lruvec *prev_lruvec = lruvec;
>>> +
>>> + lruvec = folio_lruvec_relock_irqsave(folio, lruvec,
>>> + &flags);
>>> + if (prev_lruvec != lruvec)
>>> + lock_batch = 0;
>>> +
>>> + lruvec_del_folio(lruvec, folio);
>>> + __folio_clear_lru_flags(folio);
>>> + }
>>> +
>>> + /*
>>> + * In rare cases, when truncation or holepunching raced with
>>> + * munlock after VM_LOCKED was cleared, Mlocked may still be
>>> + * found set here. This does not indicate a problem, unless
>>> + * "unevictable_pgs_cleared" appears worryingly large.
>>> + */
>>> + if (unlikely(folio_test_mlocked(folio))) {
>>> + __folio_clear_mlocked(folio);
>>> + zone_stat_sub_folio(folio, NR_MLOCK);
>>> + count_vm_event(UNEVICTABLE_PGCLEARED);
>>> + }
>>> +
>>> + list_add(&folio->lru, &pages_to_free);
>>> + }
>>> + if (lruvec)
>>> + unlock_page_lruvec_irqrestore(lruvec, flags);
>>> +
>>> + mem_cgroup_uncharge_list(&pages_to_free);
>>> + free_unref_page_list(&pages_to_free);
>>> +}
>>> +
>>> /*
>>> * The folios which we're about to release may be in the deferred lru-addition
>>> * queues. That would prevent them from really being freed right now. That's
>>> diff --git a/mm/swap_state.c b/mm/swap_state.c
>>> index 73b16795b0ff..526bbd5a2ce1 100644
>>> --- a/mm/swap_state.c
>>> +++ b/mm/swap_state.c
>>> @@ -304,15 +304,16 @@ void free_page_and_swap_cache(struct page *page)
>>> }
>>>
>>> /*
>>> - * Passed an array of pages, drop them all from swapcache and then release
>>> - * them. They are removed from the LRU and freed if this is their last use.
>>> + * Passed an array of folio ranges, drop all folios from swapcache and then put
>>> + * a folio reference for each page in the range. They are removed from the LRU
>>> + * and freed if this is their last use.
>>> */
>>> -void free_pages_and_swap_cache(struct page **pages, int nr)
>>> +void free_folios_and_swap_cache(struct folio_range *folios, int nr)
>>> {
>>> lru_add_drain();
>>> for (int i = 0; i < nr; i++)
>>> - free_swap_cache(pages[i]);
>>> - release_pages(pages, nr);
>>> + free_swap_cache(folios[i].start);
>>> + folios_put_refs(folios, nr);
>>> }
>>>
>>> static inline bool swap_use_vma_readahead(void)
>>> --
>>> 2.25.1
>>
>>
>> --
>> Best Regards,
>> Yan, Zi

--
Best Regards,
Yan, Zi

Attachments:

signature.asc (871.00 B)
OpenPGP digital signature

2023-08-10 19:09:23

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH v1 3/4] mm/mmu_gather: Remove encoded_page infrastructure

On Thu, 10 Aug 2023 at 10:35, Yu Zhao <[email protected]> wrote:
>
> Adding the original author and reviewers... They might want (need) to
> take a look at this series.

It looks fine to me. The important part is that the rmap removal has
to be done after the TLB flush, but before the page table lock is
released.

That used to be a special thing for anonymous pages and thus needed
that special flag. But if it's done for *all* pages the need to flag
pages goes away.

I see no issues with this, although obviously I might have missed something.

Linus

2023-08-10 19:17:17

[permalink] [raw]

Subject: Re: [PATCH v1 3/4] mm/mmu_gather: Remove encoded_page infrastructure

On Thu, Aug 10, 2023 at 4:33 AM Ryan Roberts <[email protected]> wrote:
>
> commit 70fb4fdff582 ("mm: introduce 'encoded' page pointers with
> embedded extra bits") and commit 7cc8f9c7146a ("mm: mmu_gather: prepare
> to gather encoded page pointers with flags") converted mmu_gather for
> dealing with encoded_page, where the bottom 2 bits could encode extra
> flags. Only 1 bit was ever used; to flag whether the page should
> participate in a delayed rmap removal.
>
> Now that the mmu_gather batched rmap removal mechanism has been
> generalized, all pages participate and therefore the flag is unused. So
> let's remove encoded_page to simplify the code. It also gets in the way
> of further optimization which will be done in a follow up patch.

Adding the original author and reviewers... They might want (need) to
take a look at this series.

2023-08-10 19:20:19

by Ryan Roberts

[permalink] [raw]

Subject: Re: [PATCH v1 3/4] mm/mmu_gather: Remove encoded_page infrastructure

On 10/08/2023 19:31, Linus Torvalds wrote:
> On Thu, 10 Aug 2023 at 10:35, Yu Zhao <[email protected]> wrote:
>>
>> Adding the original author and reviewers... They might want (need) to
>> take a look at this series.
>
> It looks fine to me. The important part is that the rmap removal has
> to be done after the TLB flush, but before the page table lock is
> released.

Yes, we still abide by this rule for the !PageAnon(page) pages.

>
> That used to be a special thing for anonymous pages and thus needed
> that special flag. But if it's done for *all* pages the need to flag
> pages goes away.

I think you misstyped - assuming you meant pagecache pages rather than anonymous
pages?

>
> I see no issues with this, although obviously I might have missed something.

Thanks for looking so quickly - really appreciate it. And sorry not to have
included you initially.

Thanks,
Ryan

>
> Linus

2023-08-29 16:48:57

by Ryan Roberts

[permalink] [raw]

Subject: Re: [PATCH v1 4/4] mm/mmu_gather: Store and process pages in contig ranges

On 25/08/2023 05:09, Matthew Wilcox wrote:
> On Thu, Aug 10, 2023 at 11:33:32AM +0100, Ryan Roberts wrote:
>> +void folios_put_refs(struct folio_range *folios, int nr)
>> +{
>> + int i;
>> + LIST_HEAD(pages_to_free);
>> + struct lruvec *lruvec = NULL;
>> + unsigned long flags = 0;
>> + unsigned int lock_batch;
>> +
>> + for (i = 0; i < nr; i++) {
>> + struct folio *folio = page_folio(folios[i].start);
>> + int refs = folios[i].end - folios[i].start;
>> +
>> + /*
>> + * Make sure the IRQ-safe lock-holding time does not get
>> + * excessive with a continuous string of pages from the
>> + * same lruvec. The lock is held only if lruvec != NULL.
>> + */
>> + if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
>> + unlock_page_lruvec_irqrestore(lruvec, flags);
>> + lruvec = NULL;
>> + }
>> +
>> + if (is_huge_zero_page(&folio->page))
>> + continue;
>> +
>> + if (folio_is_zone_device(folio)) {
>> + if (lruvec) {
>> + unlock_page_lruvec_irqrestore(lruvec, flags);
>> + lruvec = NULL;
>> + }
>> + if (put_devmap_managed_page(&folio->page))
>> + continue;
>> + if (folio_put_testzero(folio))
>
> We're only putting one ref for the zone_device folios? Surely
> this should be ref_sub_and_test like below?

Good point. This function is originally a copy/paste of release_pages(), and I
obviously missed this. In fact, looking at it again today, I think I'll factor
out the vast majority into a common helper, since I'm currently duplicating a
whole bunch here.

In practice I think all devmap folios will be small today though? So while I
agree I need to fix this, I think in practice it will currently do the right thing?

>
>> + free_zone_device_page(&folio->page);
>> + continue;
>> + }
>> +
>> + if (!folio_ref_sub_and_test(folio, refs))
>> + continue;
>> +
>> + if (folio_test_large(folio)) {
>> + if (lruvec) {
>> + unlock_page_lruvec_irqrestore(lruvec, flags);
>> + lruvec = NULL;
>> + }
>> + __folio_put_large(folio);
>> + continue;
>> + }
>> +
>> + if (folio_test_lru(folio)) {
>> + struct lruvec *prev_lruvec = lruvec;
>> +
>> + lruvec = folio_lruvec_relock_irqsave(folio, lruvec,
>> + &flags);
>> + if (prev_lruvec != lruvec)
>> + lock_batch = 0;
>> +
>> + lruvec_del_folio(lruvec, folio);
>> + __folio_clear_lru_flags(folio);
>> + }
>> +
>> + /*
>> + * In rare cases, when truncation or holepunching raced with
>> + * munlock after VM_LOCKED was cleared, Mlocked may still be
>> + * found set here. This does not indicate a problem, unless
>> + * "unevictable_pgs_cleared" appears worryingly large.
>> + */
>> + if (unlikely(folio_test_mlocked(folio))) {
>> + __folio_clear_mlocked(folio);
>> + zone_stat_sub_folio(folio, NR_MLOCK);
>> + count_vm_event(UNEVICTABLE_PGCLEARED);
>> + }
>
> You'll be glad to know I've factored out a nice little helper for that.

OK, what's it called? This is just copied from release_pages() at the moment.
Happy to use your helper in the refactored common helper.

>

2023-08-29 17:19:44

by Ryan Roberts

[permalink] [raw]

Subject: Re: [PATCH v1 4/4] mm/mmu_gather: Store and process pages in contig ranges

On 29/08/2023 15:24, Matthew Wilcox wrote:
> On Tue, Aug 29, 2023 at 03:19:29PM +0100, Matthew Wilcox wrote:
>>>> You'll be glad to know I've factored out a nice little helper for that.
>>>
>>> OK, what's it called? This is just copied from release_pages() at the moment.
>>> Happy to use your helper in the refactored common helper.
>>
>> I'll send out those patches today.
>
> No, wait, I sent them on Friday.
>
> https://lore.kernel.org/linux-mm/[email protected]/
>
> is the important one from your point of view. It's
> __page_cache_release() which is a little different from the current
> __page_cache_release()

Thanks! Given your series is marked RFC, I won't take the dependency for now;
I'd rather keep my series independent for review. We can race to mm-unstable and
I guess the loser gets to do the merge. ;-)

2023-08-29 19:15:37

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH v1 4/4] mm/mmu_gather: Store and process pages in contig ranges

On Tue, Aug 29, 2023 at 03:19:29PM +0100, Matthew Wilcox wrote:
> > > You'll be glad to know I've factored out a nice little helper for that.
> >
> > OK, what's it called? This is just copied from release_pages() at the moment.
> > Happy to use your helper in the refactored common helper.
>
> I'll send out those patches today.

No, wait, I sent them on Friday.

https://lore.kernel.org/linux-mm/[email protected]/

is the important one from your point of view. It's
__page_cache_release() which is a little different from the current
__page_cache_release()

2023-12-04 12:26:54

by Ryan Roberts

[permalink] [raw]

Subject: Re: [PATCH v1 4/4] mm/mmu_gather: Store and process pages in contig ranges

>>>
>>> Also, struct page (memmap) might not be always contiguous, using struct page
>>> points to represent folio range might not give the result you want.
>>> See nth_page() and folio_page_idx() in include/linux/mm.h.
>>
>> Is that true for pages within the same folio too? Or are all pages in a folio
>> guarranteed contiguous? Perhaps I'm better off using pfn?
>
> folio_page_idx() says not all pages in a folio is guaranteed to be contiguous.
> PFN might be a better choice.

Hi Zi, Matthew,

Zi made this comment a couple of months back that it is incorrect to assume that
`struct page`s within a folio are (virtually) contiguous. I'm not sure if that's
really the case though? I see other sites in the source that do page++ when
iterating over a folio. e.g. smaps_account(), splice_folio_into_pipe(),
__collapse_huge_page_copy(), etc.

Any chance someone could explain the rules?

Thanks,
Ryan

2023-12-04 12:29:17

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v1 4/4] mm/mmu_gather: Store and process pages in contig ranges

On 04.12.23 13:26, Ryan Roberts wrote:
>>>>
>>>> Also, struct page (memmap) might not be always contiguous, using struct page
>>>> points to represent folio range might not give the result you want.
>>>> See nth_page() and folio_page_idx() in include/linux/mm.h.
>>>
>>> Is that true for pages within the same folio too? Or are all pages in a folio
>>> guarranteed contiguous? Perhaps I'm better off using pfn?
>>
>> folio_page_idx() says not all pages in a folio is guaranteed to be contiguous.
>> PFN might be a better choice.
>
> Hi Zi, Matthew,
>
> Zi made this comment a couple of months back that it is incorrect to assume that
> `struct page`s within a folio are (virtually) contiguous. I'm not sure if that's
> really the case though? I see other sites in the source that do page++ when
> iterating over a folio. e.g. smaps_account(), splice_folio_into_pipe(),
> __collapse_huge_page_copy(), etc.
>
> Any chance someone could explain the rules?

With the vmemmap, they are contiguous. Without a vmemmap, but with
sparsemem, we might end up allocating one memmap chunk per memory
section (e.g., 128 MiB).

So, for example, a 1 GiB hugetlb page could cross multiple 128 MiB
sections, and therefore, the memmap might not be virtually consecutive.

--
Cheers,

David / dhildenb

2023-12-04 12:40:42

by Ryan Roberts

[permalink] [raw]

Subject: Re: [PATCH v1 4/4] mm/mmu_gather: Store and process pages in contig ranges

On 04/12/2023 12:28, David Hildenbrand wrote:
> On 04.12.23 13:26, Ryan Roberts wrote:
>>>>>
>>>>> Also, struct page (memmap) might not be always contiguous, using struct page
>>>>> points to represent folio range might not give the result you want.
>>>>> See nth_page() and folio_page_idx() in include/linux/mm.h.
>>>>
>>>> Is that true for pages within the same folio too? Or are all pages in a folio
>>>> guarranteed contiguous? Perhaps I'm better off using pfn?
>>>
>>> folio_page_idx() says not all pages in a folio is guaranteed to be contiguous.
>>> PFN might be a better choice.
>>
>> Hi Zi, Matthew,
>>
>> Zi made this comment a couple of months back that it is incorrect to assume that
>> `struct page`s within a folio are (virtually) contiguous. I'm not sure if that's
>> really the case though? I see other sites in the source that do page++ when
>> iterating over a folio. e.g. smaps_account(), splice_folio_into_pipe(),
>> __collapse_huge_page_copy(), etc.
>>
>> Any chance someone could explain the rules?
>
> With the vmemmap, they are contiguous. Without a vmemmap, but with sparsemem, we
> might end up allocating one memmap chunk per memory section (e.g., 128 MiB).
>
> So, for example, a 1 GiB hugetlb page could cross multiple 128 MiB sections, and
> therefore, the memmap might not be virtually consecutive.

OK, is a "memory section" always 128M or is it variable? If fixed, does that
mean that it's impossible for a THP to cross section boundaries? (because a THP
is always smaller than a section?)

Trying to figure out why my original usage in this series was wrong, but
presumably the other places that I mentioned are safe.

>
>

2023-12-04 12:44:17

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v1 4/4] mm/mmu_gather: Store and process pages in contig ranges

On 04.12.23 13:39, Ryan Roberts wrote:
> On 04/12/2023 12:28, David Hildenbrand wrote:
>> On 04.12.23 13:26, Ryan Roberts wrote:
>>>>>>
>>>>>> Also, struct page (memmap) might not be always contiguous, using struct page
>>>>>> points to represent folio range might not give the result you want.
>>>>>> See nth_page() and folio_page_idx() in include/linux/mm.h.
>>>>>
>>>>> Is that true for pages within the same folio too? Or are all pages in a folio
>>>>> guarranteed contiguous? Perhaps I'm better off using pfn?
>>>>
>>>> folio_page_idx() says not all pages in a folio is guaranteed to be contiguous.
>>>> PFN might be a better choice.
>>>
>>> Hi Zi, Matthew,
>>>
>>> Zi made this comment a couple of months back that it is incorrect to assume that
>>> `struct page`s within a folio are (virtually) contiguous. I'm not sure if that's
>>> really the case though? I see other sites in the source that do page++ when
>>> iterating over a folio. e.g. smaps_account(), splice_folio_into_pipe(),
>>> __collapse_huge_page_copy(), etc.
>>>
>>> Any chance someone could explain the rules?
>>
>> With the vmemmap, they are contiguous. Without a vmemmap, but with sparsemem, we
>> might end up allocating one memmap chunk per memory section (e.g., 128 MiB).
>>
>> So, for example, a 1 GiB hugetlb page could cross multiple 128 MiB sections, and
>> therefore, the memmap might not be virtually consecutive.
>
> OK, is a "memory section" always 128M or is it variable? If fixed, does that
> mean that it's impossible for a THP to cross section boundaries? (because a THP
> is always smaller than a section?)

Section size is variable (see SECTION_SIZE_BITS), but IIRC, buddy
allocations will never cross them.

>
> Trying to figure out why my original usage in this series was wrong, but
> presumably the other places that I mentioned are safe.

If only dealing with buddy allocations, *currently* it might always fall
into a single memory section.

--
Cheers,

David / dhildenb

2023-12-04 12:57:56

by Ryan Roberts

[permalink] [raw]

Subject: Re: [PATCH v1 4/4] mm/mmu_gather: Store and process pages in contig ranges

On 04/12/2023 12:43, David Hildenbrand wrote:
> On 04.12.23 13:39, Ryan Roberts wrote:
>> On 04/12/2023 12:28, David Hildenbrand wrote:
>>> On 04.12.23 13:26, Ryan Roberts wrote:
>>>>>>>
>>>>>>> Also, struct page (memmap) might not be always contiguous, using struct page
>>>>>>> points to represent folio range might not give the result you want.
>>>>>>> See nth_page() and folio_page_idx() in include/linux/mm.h.
>>>>>>
>>>>>> Is that true for pages within the same folio too? Or are all pages in a folio
>>>>>> guarranteed contiguous? Perhaps I'm better off using pfn?
>>>>>
>>>>> folio_page_idx() says not all pages in a folio is guaranteed to be contiguous.
>>>>> PFN might be a better choice.
>>>>
>>>> Hi Zi, Matthew,
>>>>
>>>> Zi made this comment a couple of months back that it is incorrect to assume
>>>> that
>>>> `struct page`s within a folio are (virtually) contiguous. I'm not sure if
>>>> that's
>>>> really the case though? I see other sites in the source that do page++ when
>>>> iterating over a folio. e.g. smaps_account(), splice_folio_into_pipe(),
>>>> __collapse_huge_page_copy(), etc.
>>>>
>>>> Any chance someone could explain the rules?
>>>
>>> With the vmemmap, they are contiguous. Without a vmemmap, but with sparsemem, we
>>> might end up allocating one memmap chunk per memory section (e.g., 128 MiB).
>>>
>>> So, for example, a 1 GiB hugetlb page could cross multiple 128 MiB sections, and
>>> therefore, the memmap might not be virtually consecutive.
>>
>> OK, is a "memory section" always 128M or is it variable? If fixed, does that
>> mean that it's impossible for a THP to cross section boundaries? (because a THP
>> is always smaller than a section?)
>
> Section size is variable (see SECTION_SIZE_BITS), but IIRC, buddy allocations
> will never cross them.
>
>>
>> Trying to figure out why my original usage in this series was wrong, but
>> presumably the other places that I mentioned are safe.
>
> If only dealing with buddy allocations, *currently* it might always fall into a
> single memory section.

OK that makes sense - thanks!

>