LinuxLists.cc - [PATCH v3 0/4] Swap-out small-sized THP without splitting

2023-10-25 14:46:57

Subject: [PATCH v3 0/4] Swap-out small-sized THP without splitting

Hi All,

This is v3 of a series to add support for swapping out small-sized THP without
needing to first split the large folio via __split_huge_page(). It closely
follows the approach already used by PMD-sized THP.

"Small-sized THP" is an upcoming feature that enables performance improvements
by allocating large folios for anonymous memory, where the large folio size is
smaller than the traditional PMD-size. See [3].

In some circumstances I've observed a performance regression (see patch 2 for
details), and this series is an attempt to fix the regression in advance of
merging small-sized THP support.

I've done what I thought was the smallest change possible, and as a result, this
approach is only employed when the swap is backed by a non-rotating block device
(just as PMD-sized THP is supported today). Discussion against the RFC concluded
that this is probably sufficient.

The series applies against mm-unstable (1a3c85fa684a)

Changes since v2 [2]
====================

- Reuse scan_swap_map_try_ssd_cluster() between order-0 and order > 0
allocation. This required some refactoring to make everything work nicely
(new patches 2 and 3).
- Fix bug where nr_swap_pages would say there are pages available but the
scanner would not be able to allocate them because they were reserved for the
per-cpu allocator. We now allow stealing of order-0 entries from the high
order per-cpu clusters (in addition to exisiting stealing from order-0
per-cpu clusters).

Thanks to Huang, Ying for the review feedback and suggestions!

Changes since v1 [1]
====================

- patch 1:
- Use cluster_set_count() instead of cluster_set_count_flag() in
swap_alloc_cluster() since we no longer have any flag to set. I was unable
to kill cluster_set_count_flag() as proposed against v1 as other call
sites depend explicitly setting flags to 0.
- patch 2:
- Moved large_next[] array into percpu_cluster to make it per-cpu
(recommended by Huang, Ying).
- large_next[] array is dynamically allocated because PMD_ORDER is not
compile-time constant for powerpc (fixes build error).

Thanks,
Ryan

P.S. I know we agreed this is not a prerequisite for merging small-sized THP,
but given Huang Ying had provided some review feedback, I wanted to progress it.
All the actual prerequisites are either complete or being worked on by others.

[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/linux-mm/[email protected]/
[3] https://lore.kernel.org/linux-mm/[email protected]/

Ryan Roberts (4):
mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
mm: swap: Remove struct percpu_cluster
mm: swap: Simplify ssd behavior when scanner steals entry
mm: swap: Swap-out small-sized THP without splitting

include/linux/swap.h | 31 +++---
mm/huge_memory.c | 3 -
mm/swapfile.c | 232 ++++++++++++++++++++++++-------------------
mm/vmscan.c | 10 +-
4 files changed, 149 insertions(+), 127 deletions(-)

--
2.25.1

2023-10-25 14:47:06

by Ryan Roberts

[permalink] [raw]

Subject: [PATCH v3 2/4] mm: swap: Remove struct percpu_cluster

struct percpu_cluster stores the index of cpu's current cluster and the
offset of the next entry that will be allocated for the cpu. These two
pieces of information are redundant because the cluster index is just
(offset / SWAPFILE_CLUSTER). The only reason for explicitly keeping the
cluster index is because the structure used for it also has a flag to
indicate "no cluster". However this data structure also contains a spin
lock, which is never used in this context, as a side effect the code
copies the spinlock_t structure, which is questionable coding practice
in my view.

So let's clean this up and store only the next offset, and use a
sentinal value (SWAP_NEXT_NULL) to indicate "no cluster". SWAP_NEXT_NULL
is chosen to be 0, because 0 will never be seen legitimately; The first
page in the swap file is the swap header, which is always marked bad to
prevent it from being allocated as an entry. This also prevents the
cluster to which it belongs being marked free, so it will never appear
on the free list.

This change saves 16 bytes per cpu. And given we are shortly going to
extend this mechanism to be per-cpu-AND-per-order, we will end up saving
16 * 9 = 144 bytes per cpu, which adds up if you have 256 cpus in the
system.

Signed-off-by: Ryan Roberts <[email protected]>
---
include/linux/swap.h | 21 +++++++++++++--------
mm/swapfile.c | 43 +++++++++++++++++++------------------------
2 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a073366a227c..0ca8aaa098ba 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -261,14 +261,12 @@ struct swap_cluster_info {
#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */

/*
- * We assign a cluster to each CPU, so each CPU can allocate swap entry from
- * its own cluster and swapout sequentially. The purpose is to optimize swapout
- * throughput.
+ * The first page in the swap file is the swap header, which is always marked
+ * bad to prevent it from being allocated as an entry. This also prevents the
+ * cluster to which it belongs being marked free. Therefore 0 is safe to use as
+ * a sentinel to indicate cpu_next is not valid in swap_info_struct.
*/
-struct percpu_cluster {
- struct swap_cluster_info index; /* Current cluster index */
- unsigned int next; /* Likely next allocation offset */
-};
+#define SWAP_NEXT_NULL 0

struct swap_cluster_list {
struct swap_cluster_info head;
@@ -295,7 +293,14 @@ struct swap_info_struct {
unsigned int cluster_next; /* likely index for next allocation */
unsigned int cluster_nr; /* countdown to next cluster search */
unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocation */
- struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
+ unsigned int __percpu *cpu_next;/*
+ * Likely next allocation offset. We
+ * assign a cluster to each CPU, so each
+ * CPU can allocate swap entry from its
+ * own cluster and swapout sequentially.
+ * The purpose is to optimize swapout
+ * throughput.
+ */
struct rb_root swap_extent_root;/* root of the swap extent rbtree */
struct block_device *bdev; /* swap device or bdev of swap file */
struct file *swap_file; /* seldom referenced */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index b83ad77e04c0..617e34b8cdbe 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -591,7 +591,6 @@ static bool
scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
unsigned long offset)
{
- struct percpu_cluster *percpu_cluster;
bool conflict;

offset /= SWAPFILE_CLUSTER;
@@ -602,8 +601,7 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
if (!conflict)
return false;

- percpu_cluster = this_cpu_ptr(si->percpu_cluster);
- cluster_set_null(&percpu_cluster->index);
+ *this_cpu_ptr(si->cpu_next) = SWAP_NEXT_NULL;
return true;
}

@@ -614,16 +612,16 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
unsigned long *offset, unsigned long *scan_base)
{
- struct percpu_cluster *cluster;
struct swap_cluster_info *ci;
- unsigned long tmp, max;
+ unsigned int tmp, max;
+ unsigned int *cpu_next;

new_cluster:
- cluster = this_cpu_ptr(si->percpu_cluster);
- if (cluster_is_null(&cluster->index)) {
+ cpu_next = this_cpu_ptr(si->cpu_next);
+ tmp = *cpu_next;
+ if (tmp == SWAP_NEXT_NULL) {
if (!cluster_list_empty(&si->free_clusters)) {
- cluster->index = si->free_clusters.head;
- cluster->next = cluster_next(&cluster->index) *
+ tmp = cluster_next(&si->free_clusters.head) *
SWAPFILE_CLUSTER;
} else if (!cluster_list_empty(&si->discard_clusters)) {
/*
@@ -643,9 +641,8 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
* Other CPUs can use our cluster if they can't find a free cluster,
* check if there is still free entry in the cluster
*/
- tmp = cluster->next;
max = min_t(unsigned long, si->max,
- (cluster_next(&cluster->index) + 1) * SWAPFILE_CLUSTER);
+ ALIGN_DOWN(tmp, SWAPFILE_CLUSTER) + SWAPFILE_CLUSTER);
if (tmp < max) {
ci = lock_cluster(si, tmp);
while (tmp < max) {
@@ -656,12 +653,13 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
unlock_cluster(ci);
}
if (tmp >= max) {
- cluster_set_null(&cluster->index);
+ *cpu_next = SWAP_NEXT_NULL;
goto new_cluster;
}
- cluster->next = tmp + 1;
*offset = tmp;
*scan_base = tmp;
+ tmp += 1;
+ *cpu_next = tmp < max ? tmp : SWAP_NEXT_NULL;
return true;
}

@@ -2488,8 +2486,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
arch_swap_invalidate_area(p->type);
zswap_swapoff(p->type);
mutex_unlock(&swapon_mutex);
- free_percpu(p->percpu_cluster);
- p->percpu_cluster = NULL;
+ free_percpu(p->cpu_next);
+ p->cpu_next = NULL;
free_percpu(p->cluster_next_cpu);
p->cluster_next_cpu = NULL;
vfree(swap_map);
@@ -3073,16 +3071,13 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
for (ci = 0; ci < nr_cluster; ci++)
spin_lock_init(&((cluster_info + ci)->lock));

- p->percpu_cluster = alloc_percpu(struct percpu_cluster);
- if (!p->percpu_cluster) {
+ p->cpu_next = alloc_percpu(unsigned int);
+ if (!p->cpu_next) {
error = -ENOMEM;
goto bad_swap_unlock_inode;
}
- for_each_possible_cpu(cpu) {
- struct percpu_cluster *cluster;
- cluster = per_cpu_ptr(p->percpu_cluster, cpu);
- cluster_set_null(&cluster->index);
- }
+ for_each_possible_cpu(cpu)
+ per_cpu(*p->cpu_next, cpu) = SWAP_NEXT_NULL;
} else {
atomic_inc(&nr_rotate_swap);
inced_nr_rotate_swap = true;
@@ -3171,8 +3166,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
bad_swap_unlock_inode:
inode_unlock(inode);
bad_swap:
- free_percpu(p->percpu_cluster);
- p->percpu_cluster = NULL;
+ free_percpu(p->cpu_next);
+ p->cpu_next = NULL;
free_percpu(p->cluster_next_cpu);
p->cluster_next_cpu = NULL;
if (inode && S_ISBLK(inode->i_mode) && p->bdev) {
--
2.25.1

2023-11-29 07:48:25

Subject: [PATCH v3 0/4] Swap-out small-sized THP without splitting

Subject: [PATCH v3 2/4] mm: swap: Remove struct percpu_cluster

Subject: Re: [PATCH v3 0/4] Swap-out small-sized THP without splitting

Subject: Re: [PATCH v3 0/4] Swap-out small-sized THP without splitting

Subject: Re: [PATCH v3 0/4] Swap-out small-sized THP without splitting

Subject: [PATCH RFC 0/6] mm: support large folios swap-in

Subject: [PATCH RFC 2/6] mm: swap: introduce swap_nr_free() for batched swap_free()

Subject: [PATCH RFC 4/6] mm: support large folios swapin as a whole

Subject: [PATCH RFC 1/6] arm64: mm: swap: support THP_SWAP on hardware with MTE

Subject: [PATCH RFC 3/6] mm: swap: make should_try_to_free_swap() support large-folio

Subject: [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()

Subject: [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT

Subject: Re: [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()

Subject: Re: [PATCH RFC 0/6] mm: support large folios swap-in

Subject: Re: [PATCH RFC 0/6] mm: support large folios swap-in

Subject: Re: [PATCH RFC 0/6] mm: support large folios swap-in

Subject: Re: [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()

Subject: Re: [PATCH RFC 1/6] arm64: mm: swap: support THP_SWAP on hardware with MTE

Subject: Re: [PATCH RFC 2/6] mm: swap: introduce swap_nr_free() for batched swap_free()

Subject: Re: [PATCH RFC 3/6] mm: swap: make should_try_to_free_swap() support large-folio

Subject: Re: [PATCH RFC 0/6] mm: support large folios swap-in

Subject: Re: [PATCH RFC 4/6] mm: support large folios swapin as a whole

Subject: Re: [PATCH RFC 4/6] mm: support large folios swapin as a whole

Subject: Re: [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()

Subject: Re: [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT

Subject: Re: [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()

Subject: Re: [PATCH RFC 0/6] mm: support large folios swap-in

Subject: Re: [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()

Subject: Re: [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()

Subject: Re: [PATCH RFC 1/6] arm64: mm: swap: support THP_SWAP on hardware with MTE

Subject: Re: [PATCH RFC 2/6] mm: swap: introduce swap_nr_free() for batched swap_free()

Subject: Re: [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()

Subject: Re: [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT

Subject: Re: [PATCH RFC 4/6] mm: support large folios swapin as a whole

Subject: Re: [PATCH RFC 4/6] mm: support large folios swapin as a whole

Subject: Re: [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT

Subject: Re: [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT

Subject: Re: [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT

Subject: Re: [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT

Subject: Re: [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT

Subject: Re: [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()