2024-05-24 17:17:39

by Chris Li

[permalink] [raw]
Subject: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

This is the short term solutiolns "swap cluster order" listed
in my "Swap Abstraction" discussion slice 8 in the recent
LSF/MM conference.

When commit 845982eb264bc "mm: swap: allow storage of all mTHP
orders" is introduced, it only allocates the mTHP swap entries
from new empty cluster list. That works well for PMD size THP,
but it has a serius fragmentation issue reported by Barry.

https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/

The mTHP allocation failure rate raises to almost 100% after a few
hours in Barry's test run.

The reason is that all the empty cluster has been exhausted while
there are planty of free swap entries to in the cluster that is
not 100% free.

Address this by remember the swap allocation order in the cluster.
Keep track of the per order non full cluster list for later allocation.

This greatly improve the sucess rate of the mTHP swap allocation.
While I am still waiting for Barry's test result. I paste Kairui's test
result here:

I'm able to reproduce such an issue with a simple script (enabling all order of mthp):

modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024))
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0

rmdir /sys/fs/cgroup/benchmark
mkdir -p /sys/fs/cgroup/benchmark
cd /sys/fs/cgroup/benchmark
echo 8G > memory.max
echo $$ > cgroup.procs

memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary &

/usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \
--ratio 1:0 --pipeline 8 -d 1024

Before:
Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98
After:
Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74

And the fallback ratio dropped by a lot:
Before:
hugepages-32kB/stats/anon_swpout_fallback:15997
hugepages-32kB/stats/anon_swpout:18712
hugepages-512kB/stats/anon_swpout_fallback:192
hugepages-512kB/stats/anon_swpout:0
hugepages-2048kB/stats/anon_swpout_fallback:2
hugepages-2048kB/stats/anon_swpout:0
hugepages-1024kB/stats/anon_swpout_fallback:0
hugepages-1024kB/stats/anon_swpout:0
hugepages-64kB/stats/anon_swpout_fallback:18246
hugepages-64kB/stats/anon_swpout:17644
hugepages-16kB/stats/anon_swpout_fallback:13701
hugepages-16kB/stats/anon_swpout:18234
hugepages-256kB/stats/anon_swpout_fallback:8642
hugepages-256kB/stats/anon_swpout:93
hugepages-128kB/stats/anon_swpout_fallback:21497
hugepages-128kB/stats/anon_swpout:7596

(Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate)

After:
hugepages-32kB/stats/swpout:34445
hugepages-32kB/stats/swpout_fallback:0
hugepages-512kB/stats/swpout:1
hugepages-512kB/stats/swpout_fallback:134
hugepages-2048kB/stats/swpout:1
hugepages-2048kB/stats/swpout_fallback:1
hugepages-1024kB/stats/swpout:6
hugepages-1024kB/stats/swpout_fallback:0
hugepages-64kB/stats/swpout:35495
hugepages-64kB/stats/swpout_fallback:0
hugepages-16kB/stats/swpout:32441
hugepages-16kB/stats/swpout_fallback:0
hugepages-256kB/stats/swpout:2223
hugepages-256kB/stats/swpout_fallback:6278
hugepages-128kB/stats/swpout:29136
hugepages-128kB/stats/swpout_fallback:52

Reported-by: Barry Song <[email protected]>
Tested-by: Kairui Song <[email protected]>
Signed-off-by: Chris Li <[email protected]>
---
Chris Li (2):
mm: swap: swap cluster switch to double link list
mm: swap: mTHP allocate swap entries from nonfull list

include/linux/swap.h | 18 ++--
mm/swapfile.c | 252 +++++++++++++++++----------------------------------
2 files changed, 93 insertions(+), 177 deletions(-)
---
base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed
change-id: 20240523-swap-allocator-1534c480ece4

Best regards,
--
Chris Li <[email protected]>



2024-05-24 17:17:51

by Chris Li

[permalink] [raw]
Subject: [PATCH 2/2] mm: swap: mTHP allocate swap entries from nonfull list

Track the nonfull cluster as well as the empty cluster
on lists. Each order has one nonfull cluster list.

The cluster will remember which order it was used during
new cluster allocation.

When the cluster has free entry, add to the nonfull[order]
list.  When the free cluster list is empty, also allocate
from the nonempty list of that order.

This improves the mTHP swap allocation success rate.

There are limitations if the distribution of numbers of
different orders of mTHP changes a lot. e.g. there are a lot
of nonfull cluster assign to order A while later time there
are a lot of order B allocation while very little allocation
in order A. Currently the cluster used by order A will not
reused by order B unless the cluster is 100% empty.

This situation is best addressed by the longer term "swap
buddy allocator", in future patches.
---
include/linux/swap.h | 4 ++++
mm/swapfile.c | 25 +++++++++++++++++++++++--
2 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0d3906eff3c9..1b7f0794b9bf 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -255,10 +255,12 @@ struct swap_cluster_info {
* cluster
*/
unsigned int count:16;
+ unsigned int order:8;
unsigned int flags:8;
struct list_head next;
};
#define CLUSTER_FLAG_FREE 1 /* This cluster is free */
+#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */


/*
@@ -297,6 +299,8 @@ struct swap_info_struct {
unsigned char *swap_map; /* vmalloc'ed array of usage counts */
struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
struct list_head free_clusters; /* free clusters list */
+ struct list_head nonfull_clusters[SWAP_NR_ORDERS];
+ /* list of cluster that contains at least one free slot */
unsigned int lowest_bit; /* index of first free in swap_map */
unsigned int highest_bit; /* index of last free in swap_map */
unsigned int pages; /* total of usable pages of swap */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 205a60c5f9cb..51923aba500e 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -363,8 +363,11 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,

static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
{
+ if (ci->flags & CLUSTER_FLAG_NONFULL)
+ list_move_tail(&ci->next, &si->free_clusters);
+ else
+ list_add_tail(&ci->next, &si->free_clusters);
ci->flags = CLUSTER_FLAG_FREE;
- list_add_tail(&ci->next, &si->free_clusters);
}

/*
@@ -486,7 +489,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
ci->count--;

if (!ci->count)
- free_cluster(p, ci);
+ return free_cluster(p, ci);
+
+ if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
+ list_add_tail(&ci->next, &p->nonfull_clusters[ci->order]);
+ ci->flags |= CLUSTER_FLAG_NONFULL;
+ }
}

/*
@@ -547,6 +555,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next);
list_del(&ci->next);
spin_lock(&ci->lock);
+ ci->order = order;
+ ci->flags = 0;
+ spin_unlock(&ci->lock);
+ tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
+ } else if (!list_empty(&si->nonfull_clusters[order])) {
+ ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, next);
+ list_del(&ci->next);
+ spin_lock(&ci->lock);
ci->flags = 0;
spin_unlock(&ci->lock);
tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
@@ -578,6 +594,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
break;
tmp += nr_pages;
}
+ WARN_ONCE(ci->order != order, "expecting order %d got %d", order, ci->order);
unlock_cluster(ci);
}
if (tmp >= max) {
@@ -956,6 +973,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
ci = lock_cluster(si, offset);
memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
ci->count = 0;
+ ci->order = 0;
ci->flags = 0;
free_cluster(si, ci);
unlock_cluster(ci);
@@ -2882,6 +2900,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
INIT_LIST_HEAD(&p->free_clusters);
INIT_LIST_HEAD(&p->discard_clusters);

+ for (i = 0; i < SWAP_NR_ORDERS; i++)
+ INIT_LIST_HEAD(&p->nonfull_clusters[i]);
+
for (i = 0; i < swap_header->info.nr_badpages; i++) {
unsigned int page_nr = swap_header->info.badpages[i];
if (page_nr == 0 || page_nr > swap_header->info.last_page)

--
2.45.1.288.g0e0cd299f1-goog


2024-05-24 17:17:55

by Chris Li

[permalink] [raw]
Subject: [PATCH 1/2] mm: swap: swap cluster switch to double link list

Previously, the swap cluster used a cluster index as a pointer
to construct a custom single link list type "swap_cluster_list".
The next cluster pointer is shared with the cluster->count.
The assumption is that only the free cluster needs to be put
on the list.

That assumption is not true for mTHP allocators any more. Need
to track the non full cluster on the list as well.  Move the
current cluster single link list into standard double link list.

Remove the cluster getter/setter for accessing the cluster
struct member.  Move the cluster locking in the caller function
rather than the getter/setter function. That way the locking can
protect more than one member, e.g. cluster->flag.

Change cluster code to use "struct swap_cluster_info *" to
reference the cluster rather than by using index. That is more
consistent with the list manipulation. It avoids the repeat
adding index to the cluser_info. The code is easier to understand.

Remove the cluster next pointer is NULL flag, the double link
list can handle the empty list pretty well.

The "swap_cluster_info" struct is two pointer bigger, because
512 swap entries share one swap struct, it has very little impact
on the average memory usage per swap entry.  Other than the list
conversion, there is no real function change in this patch.
---
include/linux/swap.h | 14 ++--
mm/swapfile.c | 231 ++++++++++++++-------------------------------------
2 files changed, 68 insertions(+), 177 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 11c53692f65f..0d3906eff3c9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -254,11 +254,12 @@ struct swap_cluster_info {
* elements correspond to the swap
* cluster
*/
- unsigned int data:24;
+ unsigned int count:16;
unsigned int flags:8;
+ struct list_head next;
};
#define CLUSTER_FLAG_FREE 1 /* This cluster is free */
-#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
+

/*
* The first page in the swap file is the swap header, which is always marked
@@ -283,11 +284,6 @@ struct percpu_cluster {
unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
};

-struct swap_cluster_list {
- struct swap_cluster_info head;
- struct swap_cluster_info tail;
-};
-
/*
* The in-memory structure used to track swap areas.
*/
@@ -300,7 +296,7 @@ struct swap_info_struct {
unsigned int max; /* extent of the swap_map */
unsigned char *swap_map; /* vmalloc'ed array of usage counts */
struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
- struct swap_cluster_list free_clusters; /* free clusters list */
+ struct list_head free_clusters; /* free clusters list */
unsigned int lowest_bit; /* index of first free in swap_map */
unsigned int highest_bit; /* index of last free in swap_map */
unsigned int pages; /* total of usable pages of swap */
@@ -333,7 +329,7 @@ struct swap_info_struct {
* list.
*/
struct work_struct discard_work; /* discard worker */
- struct swap_cluster_list discard_clusters; /* discard clusters list */
+ struct list_head discard_clusters; /* discard clusters list */
struct plist_node avail_lists[]; /*
* entries in swap_avail_heads, one
* entry per node.
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4f0e8b2ac8aa..205a60c5f9cb 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -290,64 +290,11 @@ static void discard_swap_cluster(struct swap_info_struct *si,
#endif
#define LATENCY_LIMIT 256

-static inline void cluster_set_flag(struct swap_cluster_info *info,
- unsigned int flag)
-{
- info->flags = flag;
-}
-
-static inline unsigned int cluster_count(struct swap_cluster_info *info)
-{
- return info->data;
-}
-
-static inline void cluster_set_count(struct swap_cluster_info *info,
- unsigned int c)
-{
- info->data = c;
-}
-
-static inline void cluster_set_count_flag(struct swap_cluster_info *info,
- unsigned int c, unsigned int f)
-{
- info->flags = f;
- info->data = c;
-}
-
-static inline unsigned int cluster_next(struct swap_cluster_info *info)
-{
- return info->data;
-}
-
-static inline void cluster_set_next(struct swap_cluster_info *info,
- unsigned int n)
-{
- info->data = n;
-}
-
-static inline void cluster_set_next_flag(struct swap_cluster_info *info,
- unsigned int n, unsigned int f)
-{
- info->flags = f;
- info->data = n;
-}
-
static inline bool cluster_is_free(struct swap_cluster_info *info)
{
return info->flags & CLUSTER_FLAG_FREE;
}

-static inline bool cluster_is_null(struct swap_cluster_info *info)
-{
- return info->flags & CLUSTER_FLAG_NEXT_NULL;
-}
-
-static inline void cluster_set_null(struct swap_cluster_info *info)
-{
- info->flags = CLUSTER_FLAG_NEXT_NULL;
- info->data = 0;
-}
-
static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
unsigned long offset)
{
@@ -394,65 +341,11 @@ static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si,
spin_unlock(&si->lock);
}

-static inline bool cluster_list_empty(struct swap_cluster_list *list)
-{
- return cluster_is_null(&list->head);
-}
-
-static inline unsigned int cluster_list_first(struct swap_cluster_list *list)
-{
- return cluster_next(&list->head);
-}
-
-static void cluster_list_init(struct swap_cluster_list *list)
-{
- cluster_set_null(&list->head);
- cluster_set_null(&list->tail);
-}
-
-static void cluster_list_add_tail(struct swap_cluster_list *list,
- struct swap_cluster_info *ci,
- unsigned int idx)
-{
- if (cluster_list_empty(list)) {
- cluster_set_next_flag(&list->head, idx, 0);
- cluster_set_next_flag(&list->tail, idx, 0);
- } else {
- struct swap_cluster_info *ci_tail;
- unsigned int tail = cluster_next(&list->tail);
-
- /*
- * Nested cluster lock, but both cluster locks are
- * only acquired when we held swap_info_struct->lock
- */
- ci_tail = ci + tail;
- spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING);
- cluster_set_next(ci_tail, idx);
- spin_unlock(&ci_tail->lock);
- cluster_set_next_flag(&list->tail, idx, 0);
- }
-}
-
-static unsigned int cluster_list_del_first(struct swap_cluster_list *list,
- struct swap_cluster_info *ci)
-{
- unsigned int idx;
-
- idx = cluster_next(&list->head);
- if (cluster_next(&list->tail) == idx) {
- cluster_set_null(&list->head);
- cluster_set_null(&list->tail);
- } else
- cluster_set_next_flag(&list->head,
- cluster_next(&ci[idx]), 0);
-
- return idx;
-}
-
/* Add a cluster to discard list and schedule it to do discard */
static void swap_cluster_schedule_discard(struct swap_info_struct *si,
- unsigned int idx)
+ struct swap_cluster_info *ci)
{
+ unsigned int idx = ci - si->cluster_info;
/*
* If scan_swap_map_slots() can't find a free cluster, it will check
* si->swap_map directly. To make sure the discarding cluster isn't
@@ -462,17 +355,16 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
memset(si->swap_map + idx * SWAPFILE_CLUSTER,
SWAP_MAP_BAD, SWAPFILE_CLUSTER);

- cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);
-
+ spin_lock_nested(&ci->lock, SINGLE_DEPTH_NESTING);
+ list_add_tail(&ci->next, &si->discard_clusters);
+ spin_unlock(&ci->lock);
schedule_work(&si->discard_work);
}

-static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
+static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
{
- struct swap_cluster_info *ci = si->cluster_info;
-
- cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
- cluster_list_add_tail(&si->free_clusters, ci, idx);
+ ci->flags = CLUSTER_FLAG_FREE;
+ list_add_tail(&ci->next, &si->free_clusters);
}

/*
@@ -481,21 +373,21 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
*/
static void swap_do_scheduled_discard(struct swap_info_struct *si)
{
- struct swap_cluster_info *info, *ci;
+ struct swap_cluster_info *ci;
unsigned int idx;

- info = si->cluster_info;
-
- while (!cluster_list_empty(&si->discard_clusters)) {
- idx = cluster_list_del_first(&si->discard_clusters, info);
+ while (!list_empty(&si->discard_clusters)) {
+ ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, next);
+ idx = ci - si->cluster_info;
spin_unlock(&si->lock);

discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
SWAPFILE_CLUSTER);

spin_lock(&si->lock);
- ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
- __free_cluster(si, idx);
+
+ spin_lock(&ci->lock);
+ __free_cluster(si, ci);
memset(si->swap_map + idx * SWAPFILE_CLUSTER,
0, SWAPFILE_CLUSTER);
unlock_cluster(ci);
@@ -521,20 +413,20 @@ static void swap_users_ref_free(struct percpu_ref *ref)
complete(&si->comp);
}

-static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
+static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsigned long idx)
{
- struct swap_cluster_info *ci = si->cluster_info;
+ struct swap_cluster_info *ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next);

- VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx);
- cluster_list_del_first(&si->free_clusters, ci);
- cluster_set_count_flag(ci + idx, 0, 0);
+ VM_BUG_ON(ci - si->cluster_info != idx);
+ list_del(&ci->next);
+ ci->count = 0;
+ ci->flags = 0;
+ return ci;
}

-static void free_cluster(struct swap_info_struct *si, unsigned long idx)
+static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
{
- struct swap_cluster_info *ci = si->cluster_info + idx;
-
- VM_BUG_ON(cluster_count(ci) != 0);
+ VM_BUG_ON(ci->count != 0);
/*
* If the swap is discardable, prepare discard the cluster
* instead of free it immediately. The cluster will be freed
@@ -542,11 +434,11 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
*/
if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
(SWP_WRITEOK | SWP_PAGE_DISCARD)) {
- swap_cluster_schedule_discard(si, idx);
+ swap_cluster_schedule_discard(si, ci);
return;
}

- __free_cluster(si, idx);
+ __free_cluster(si, ci);
}

/*
@@ -559,15 +451,15 @@ static void add_cluster_info_page(struct swap_info_struct *p,
unsigned long count)
{
unsigned long idx = page_nr / SWAPFILE_CLUSTER;
+ struct swap_cluster_info *ci = cluster_info + idx;

if (!cluster_info)
return;
- if (cluster_is_free(&cluster_info[idx]))
+ if (cluster_is_free(ci))
alloc_cluster(p, idx);

- VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
- cluster_set_count(&cluster_info[idx],
- cluster_count(&cluster_info[idx]) + count);
+ VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER);
+ ci->count += count;
}

/*
@@ -581,24 +473,20 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
}

/*
- * The cluster corresponding to page_nr decreases one usage. If the usage
- * counter becomes 0, which means no page in the cluster is in using, we can
- * optionally discard the cluster and add it to free cluster list.
+ * The cluster ci decreases one usage. If the usage counter becomes 0,
+ * which means no page in the cluster is in using, we can optionally discard
+ * the cluster and add it to free cluster list.
*/
-static void dec_cluster_info_page(struct swap_info_struct *p,
- struct swap_cluster_info *cluster_info, unsigned long page_nr)
+static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci)
{
- unsigned long idx = page_nr / SWAPFILE_CLUSTER;
-
- if (!cluster_info)
+ if (!p->cluster_info)
return;

- VM_BUG_ON(cluster_count(&cluster_info[idx]) == 0);
- cluster_set_count(&cluster_info[idx],
- cluster_count(&cluster_info[idx]) - 1);
+ VM_BUG_ON(ci->count == 0);
+ ci->count--;

- if (cluster_count(&cluster_info[idx]) == 0)
- free_cluster(p, idx);
+ if (!ci->count)
+ free_cluster(p, ci);
}

/*
@@ -611,10 +499,10 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
{
struct percpu_cluster *percpu_cluster;
bool conflict;
-
+ struct swap_cluster_info *first = list_first_entry(&si->free_clusters, struct swap_cluster_info, next);
offset /= SWAPFILE_CLUSTER;
- conflict = !cluster_list_empty(&si->free_clusters) &&
- offset != cluster_list_first(&si->free_clusters) &&
+ conflict = !list_empty(&si->free_clusters) &&
+ offset != first - si->cluster_info &&
cluster_is_free(&si->cluster_info[offset]);

if (!conflict)
@@ -655,10 +543,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
cluster = this_cpu_ptr(si->percpu_cluster);
tmp = cluster->next[order];
if (tmp == SWAP_NEXT_INVALID) {
- if (!cluster_list_empty(&si->free_clusters)) {
- tmp = cluster_next(&si->free_clusters.head) *
- SWAPFILE_CLUSTER;
- } else if (!cluster_list_empty(&si->discard_clusters)) {
+ if (!list_empty(&si->free_clusters)) {
+ ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next);
+ list_del(&ci->next);
+ spin_lock(&ci->lock);
+ ci->flags = 0;
+ spin_unlock(&ci->lock);
+ tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
+ } else if (!list_empty(&si->discard_clusters)) {
/*
* we don't have free cluster but have some clusters in
* discarding, do discard now and reclaim them, then
@@ -670,7 +562,8 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
goto new_cluster;
} else
return false;
- }
+ } else
+ ci = si->cluster_info + tmp;

/*
* Other CPUs can use our cluster if they can't find a free cluster,
@@ -1062,8 +955,9 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)

ci = lock_cluster(si, offset);
memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
- cluster_set_count_flag(ci, 0, 0);
- free_cluster(si, idx);
+ ci->count = 0;
+ ci->flags = 0;
+ free_cluster(si, ci);
unlock_cluster(ci);
swap_range_free(si, offset, SWAPFILE_CLUSTER);
}
@@ -1336,7 +1230,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
count = p->swap_map[offset];
VM_BUG_ON(count != SWAP_HAS_CACHE);
p->swap_map[offset] = 0;
- dec_cluster_info_page(p, p->cluster_info, offset);
+ dec_cluster_info_page(p, ci);
unlock_cluster(ci);

mem_cgroup_uncharge_swap(entry, 1);
@@ -2985,8 +2879,8 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,

nr_good_pages = maxpages - 1; /* omit header page */

- cluster_list_init(&p->free_clusters);
- cluster_list_init(&p->discard_clusters);
+ INIT_LIST_HEAD(&p->free_clusters);
+ INIT_LIST_HEAD(&p->discard_clusters);

for (i = 0; i < swap_header->info.nr_badpages; i++) {
unsigned int page_nr = swap_header->info.badpages[i];
@@ -3037,14 +2931,15 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
for (k = 0; k < SWAP_CLUSTER_COLS; k++) {
j = (k + col) % SWAP_CLUSTER_COLS;
for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
+ struct swap_cluster_info *ci;
idx = i * SWAP_CLUSTER_COLS + j;
+ ci = cluster_info + idx;
if (idx >= nr_clusters)
continue;
- if (cluster_count(&cluster_info[idx]))
+ if (ci->count)
continue;
- cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
- cluster_list_add_tail(&p->free_clusters, cluster_info,
- idx);
+ ci->flags = CLUSTER_FLAG_FREE;
+ list_add_tail(&ci->next, &p->free_clusters);
}
}
return nr_extents;

--
2.45.1.288.g0e0cd299f1-goog


2024-05-28 03:07:25

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

On Sat, May 25, 2024 at 5:17 AM Chris Li <[email protected]> wrote:
>
> This is the short term solutiolns "swap cluster order" listed
> in my "Swap Abstraction" discussion slice 8 in the recent
> LSF/MM conference.
>
> When commit 845982eb264bc "mm: swap: allow storage of all mTHP
> orders" is introduced, it only allocates the mTHP swap entries
> from new empty cluster list. That works well for PMD size THP,
> but it has a serius fragmentation issue reported by Barry.
>
> https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
>
> The mTHP allocation failure rate raises to almost 100% after a few
> hours in Barry's test run.
>
> The reason is that all the empty cluster has been exhausted while
> there are planty of free swap entries to in the cluster that is
> not 100% free.
>
> Address this by remember the swap allocation order in the cluster.
> Keep track of the per order non full cluster list for later allocation.
>
> This greatly improve the sucess rate of the mTHP swap allocation.
> While I am still waiting for Barry's test result. I paste Kairui's test

Hi Chris,

Attached are the test results from a real phone using 4-order mTHP. The results
seem better overall, but after 7 hours, especially when the swap device becomes
full(soon some apps are killed to free memory and swap), the fallback
ratio still
reaches 100%.

I haven't debugged this, but my guess is that the cluster's order can
shift between
4-order and 0-order. Sometimes, they all shift to 0-order, and hardly can they
get back to 4-order.

> result here:
>
> I'm able to reproduce such an issue with a simple script (enabling all order of mthp):
>
> modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024))
> swapoff -a
> mkswap /dev/ram0
> swapon /dev/ram0
>
> rmdir /sys/fs/cgroup/benchmark
> mkdir -p /sys/fs/cgroup/benchmark
> cd /sys/fs/cgroup/benchmark
> echo 8G > memory.max
> echo $$ > cgroup.procs
>
> memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary &
>
> /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \
> -P memcache_binary -n allkeys --key-minimum=1 \
> --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \
> --ratio 1:0 --pipeline 8 -d 1024
>
> Before:
> Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98
> After:
> Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74
>
> And the fallback ratio dropped by a lot:
> Before:
> hugepages-32kB/stats/anon_swpout_fallback:15997
> hugepages-32kB/stats/anon_swpout:18712
> hugepages-512kB/stats/anon_swpout_fallback:192
> hugepages-512kB/stats/anon_swpout:0
> hugepages-2048kB/stats/anon_swpout_fallback:2
> hugepages-2048kB/stats/anon_swpout:0
> hugepages-1024kB/stats/anon_swpout_fallback:0
> hugepages-1024kB/stats/anon_swpout:0
> hugepages-64kB/stats/anon_swpout_fallback:18246
> hugepages-64kB/stats/anon_swpout:17644
> hugepages-16kB/stats/anon_swpout_fallback:13701
> hugepages-16kB/stats/anon_swpout:18234
> hugepages-256kB/stats/anon_swpout_fallback:8642
> hugepages-256kB/stats/anon_swpout:93
> hugepages-128kB/stats/anon_swpout_fallback:21497
> hugepages-128kB/stats/anon_swpout:7596
>
> (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate)
>
> After:
> hugepages-32kB/stats/swpout:34445
> hugepages-32kB/stats/swpout_fallback:0
> hugepages-512kB/stats/swpout:1
> hugepages-512kB/stats/swpout_fallback:134
> hugepages-2048kB/stats/swpout:1
> hugepages-2048kB/stats/swpout_fallback:1
> hugepages-1024kB/stats/swpout:6
> hugepages-1024kB/stats/swpout_fallback:0
> hugepages-64kB/stats/swpout:35495
> hugepages-64kB/stats/swpout_fallback:0
> hugepages-16kB/stats/swpout:32441
> hugepages-16kB/stats/swpout_fallback:0
> hugepages-256kB/stats/swpout:2223
> hugepages-256kB/stats/swpout_fallback:6278
> hugepages-128kB/stats/swpout:29136
> hugepages-128kB/stats/swpout_fallback:52
>
> Reported-by: Barry Song <[email protected]>
> Tested-by: Kairui Song <[email protected]>
> Signed-off-by: Chris Li <[email protected]>
> ---
> Chris Li (2):
> mm: swap: swap cluster switch to double link list
> mm: swap: mTHP allocate swap entries from nonfull list
>
> include/linux/swap.h | 18 ++--
> mm/swapfile.c | 252 +++++++++++++++++----------------------------------
> 2 files changed, 93 insertions(+), 177 deletions(-)
> ---
> base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed
> change-id: 20240523-swap-allocator-1534c480ece4
>
> Best regards,
> --
> Chris Li <[email protected]>
>

Thanks
Barry


Attachments:
chris-swap-patch.png (24.52 kB)

2024-05-28 16:24:25

by Kairui Song

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: swap: swap cluster switch to double link list

On Sat, May 25, 2024 at 1:17 AM Chris Li <[email protected]> wrote:
>
> Previously, the swap cluster used a cluster index as a pointer
> to construct a custom single link list type "swap_cluster_list".
> The next cluster pointer is shared with the cluster->count.
> The assumption is that only the free cluster needs to be put
> on the list.
>
> That assumption is not true for mTHP allocators any more. Need
> to track the non full cluster on the list as well. Move the
> current cluster single link list into standard double link list.
>
> Remove the cluster getter/setter for accessing the cluster
> struct member. Move the cluster locking in the caller function
> rather than the getter/setter function. That way the locking can
> protect more than one member, e.g. cluster->flag.
>
> Change cluster code to use "struct swap_cluster_info *" to
> reference the cluster rather than by using index. That is more
> consistent with the list manipulation. It avoids the repeat
> adding index to the cluser_info. The code is easier to understand.
>
> Remove the cluster next pointer is NULL flag, the double link
> list can handle the empty list pretty well.
>
> The "swap_cluster_info" struct is two pointer bigger, because
> 512 swap entries share one swap struct, it has very little impact
> on the average memory usage per swap entry. Other than the list
> conversion, there is no real function change in this patch.
> ---
> include/linux/swap.h | 14 ++--
> mm/swapfile.c | 231 ++++++++++++++-------------------------------------
> 2 files changed, 68 insertions(+), 177 deletions(-)
>

Hi Chris,

Thanks for this very nice clean up, the code is much easier to read.

> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 11c53692f65f..0d3906eff3c9 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -254,11 +254,12 @@ struct swap_cluster_info {
> * elements correspond to the swap
> * cluster
> */
> - unsigned int data:24;
> + unsigned int count:16;
> unsigned int flags:8;
> + struct list_head next;
> };
> #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
> -#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
> +
>
> /*
> * The first page in the swap file is the swap header, which is always marked
> @@ -283,11 +284,6 @@ struct percpu_cluster {
> unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> };
>
> -struct swap_cluster_list {
> - struct swap_cluster_info head;
> - struct swap_cluster_info tail;
> -};
> -
> /*
> * The in-memory structure used to track swap areas.
> */
> @@ -300,7 +296,7 @@ struct swap_info_struct {
> unsigned int max; /* extent of the swap_map */
> unsigned char *swap_map; /* vmalloc'ed array of usage counts */
> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
> - struct swap_cluster_list free_clusters; /* free clusters list */
> + struct list_head free_clusters; /* free clusters list */
> unsigned int lowest_bit; /* index of first free in swap_map */
> unsigned int highest_bit; /* index of last free in swap_map */
> unsigned int pages; /* total of usable pages of swap */
> @@ -333,7 +329,7 @@ struct swap_info_struct {
> * list.
> */
> struct work_struct discard_work; /* discard worker */
> - struct swap_cluster_list discard_clusters; /* discard clusters list */
> + struct list_head discard_clusters; /* discard clusters list */
> struct plist_node avail_lists[]; /*
> * entries in swap_avail_heads, one
> * entry per node.
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 4f0e8b2ac8aa..205a60c5f9cb 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -290,64 +290,11 @@ static void discard_swap_cluster(struct swap_info_struct *si,
> #endif
> #define LATENCY_LIMIT 256
>
> -static inline void cluster_set_flag(struct swap_cluster_info *info,
> - unsigned int flag)
> -{
> - info->flags = flag;
> -}
> -
> -static inline unsigned int cluster_count(struct swap_cluster_info *info)
> -{
> - return info->data;
> -}
> -
> -static inline void cluster_set_count(struct swap_cluster_info *info,
> - unsigned int c)
> -{
> - info->data = c;
> -}
> -
> -static inline void cluster_set_count_flag(struct swap_cluster_info *info,
> - unsigned int c, unsigned int f)
> -{
> - info->flags = f;
> - info->data = c;
> -}
> -
> -static inline unsigned int cluster_next(struct swap_cluster_info *info)
> -{
> - return info->data;
> -}
> -
> -static inline void cluster_set_next(struct swap_cluster_info *info,
> - unsigned int n)
> -{
> - info->data = n;
> -}
> -
> -static inline void cluster_set_next_flag(struct swap_cluster_info *info,
> - unsigned int n, unsigned int f)
> -{
> - info->flags = f;
> - info->data = n;
> -}
> -
> static inline bool cluster_is_free(struct swap_cluster_info *info)
> {
> return info->flags & CLUSTER_FLAG_FREE;
> }
>
> -static inline bool cluster_is_null(struct swap_cluster_info *info)
> -{
> - return info->flags & CLUSTER_FLAG_NEXT_NULL;
> -}
> -
> -static inline void cluster_set_null(struct swap_cluster_info *info)
> -{
> - info->flags = CLUSTER_FLAG_NEXT_NULL;
> - info->data = 0;
> -}
> -
> static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
> unsigned long offset)
> {
> @@ -394,65 +341,11 @@ static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si,
> spin_unlock(&si->lock);
> }
>
> -static inline bool cluster_list_empty(struct swap_cluster_list *list)
> -{
> - return cluster_is_null(&list->head);
> -}
> -
> -static inline unsigned int cluster_list_first(struct swap_cluster_list *list)
> -{
> - return cluster_next(&list->head);
> -}
> -
> -static void cluster_list_init(struct swap_cluster_list *list)
> -{
> - cluster_set_null(&list->head);
> - cluster_set_null(&list->tail);
> -}
> -
> -static void cluster_list_add_tail(struct swap_cluster_list *list,
> - struct swap_cluster_info *ci,
> - unsigned int idx)
> -{
> - if (cluster_list_empty(list)) {
> - cluster_set_next_flag(&list->head, idx, 0);
> - cluster_set_next_flag(&list->tail, idx, 0);
> - } else {
> - struct swap_cluster_info *ci_tail;
> - unsigned int tail = cluster_next(&list->tail);
> -
> - /*
> - * Nested cluster lock, but both cluster locks are
> - * only acquired when we held swap_info_struct->lock
> - */
> - ci_tail = ci + tail;
> - spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING);
> - cluster_set_next(ci_tail, idx);
> - spin_unlock(&ci_tail->lock);
> - cluster_set_next_flag(&list->tail, idx, 0);
> - }
> -}
> -
> -static unsigned int cluster_list_del_first(struct swap_cluster_list *list,
> - struct swap_cluster_info *ci)
> -{
> - unsigned int idx;
> -
> - idx = cluster_next(&list->head);
> - if (cluster_next(&list->tail) == idx) {
> - cluster_set_null(&list->head);
> - cluster_set_null(&list->tail);
> - } else
> - cluster_set_next_flag(&list->head,
> - cluster_next(&ci[idx]), 0);
> -
> - return idx;
> -}
> -
> /* Add a cluster to discard list and schedule it to do discard */
> static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> - unsigned int idx)
> + struct swap_cluster_info *ci)
> {
> + unsigned int idx = ci - si->cluster_info;
> /*
> * If scan_swap_map_slots() can't find a free cluster, it will check
> * si->swap_map directly. To make sure the discarding cluster isn't
> @@ -462,17 +355,16 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> SWAP_MAP_BAD, SWAPFILE_CLUSTER);
>
> - cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);
> -
> + spin_lock_nested(&ci->lock, SINGLE_DEPTH_NESTING);
> + list_add_tail(&ci->next, &si->discard_clusters);
> + spin_unlock(&ci->lock);
> schedule_work(&si->discard_work);
> }
>
> -static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
> +static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> {
> - struct swap_cluster_info *ci = si->cluster_info;
> -
> - cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
> - cluster_list_add_tail(&si->free_clusters, ci, idx);
> + ci->flags = CLUSTER_FLAG_FREE;
> + list_add_tail(&ci->next, &si->free_clusters);
> }
>
> /*
> @@ -481,21 +373,21 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
> */
> static void swap_do_scheduled_discard(struct swap_info_struct *si)
> {
> - struct swap_cluster_info *info, *ci;
> + struct swap_cluster_info *ci;
> unsigned int idx;
>
> - info = si->cluster_info;
> -
> - while (!cluster_list_empty(&si->discard_clusters)) {
> - idx = cluster_list_del_first(&si->discard_clusters, info);
> + while (!list_empty(&si->discard_clusters)) {
> + ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, next);
> + idx = ci - si->cluster_info;
> spin_unlock(&si->lock);
>
> discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
> SWAPFILE_CLUSTER);
>
> spin_lock(&si->lock);
> - ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
> - __free_cluster(si, idx);
> +
> + spin_lock(&ci->lock);
> + __free_cluster(si, ci);
> memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> 0, SWAPFILE_CLUSTER);
> unlock_cluster(ci);
> @@ -521,20 +413,20 @@ static void swap_users_ref_free(struct percpu_ref *ref)
> complete(&si->comp);
> }
>
> -static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
> +static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsigned long idx)
> {
> - struct swap_cluster_info *ci = si->cluster_info;
> + struct swap_cluster_info *ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next);
>
> - VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx);
> - cluster_list_del_first(&si->free_clusters, ci);
> - cluster_set_count_flag(ci + idx, 0, 0);
> + VM_BUG_ON(ci - si->cluster_info != idx);
> + list_del(&ci->next);
> + ci->count = 0;
> + ci->flags = 0;
> + return ci;
> }
>
> -static void free_cluster(struct swap_info_struct *si, unsigned long idx)
> +static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> {
> - struct swap_cluster_info *ci = si->cluster_info + idx;
> -
> - VM_BUG_ON(cluster_count(ci) != 0);
> + VM_BUG_ON(ci->count != 0);
> /*
> * If the swap is discardable, prepare discard the cluster
> * instead of free it immediately. The cluster will be freed
> @@ -542,11 +434,11 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
> */
> if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
> (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
> - swap_cluster_schedule_discard(si, idx);
> + swap_cluster_schedule_discard(si, ci);
> return;
> }
>
> - __free_cluster(si, idx);
> + __free_cluster(si, ci);
> }
>
> /*
> @@ -559,15 +451,15 @@ static void add_cluster_info_page(struct swap_info_struct *p,
> unsigned long count)
> {
> unsigned long idx = page_nr / SWAPFILE_CLUSTER;
> + struct swap_cluster_info *ci = cluster_info + idx;
>
> if (!cluster_info)
> return;
> - if (cluster_is_free(&cluster_info[idx]))
> + if (cluster_is_free(ci))
> alloc_cluster(p, idx);
>
> - VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
> - cluster_set_count(&cluster_info[idx],
> - cluster_count(&cluster_info[idx]) + count);
> + VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER);
> + ci->count += count;
> }
>
> /*
> @@ -581,24 +473,20 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
> }
>
> /*
> - * The cluster corresponding to page_nr decreases one usage. If the usage
> - * counter becomes 0, which means no page in the cluster is in using, we can
> - * optionally discard the cluster and add it to free cluster list.
> + * The cluster ci decreases one usage. If the usage counter becomes 0,
> + * which means no page in the cluster is in using, we can optionally discard
> + * the cluster and add it to free cluster list.
> */
> -static void dec_cluster_info_page(struct swap_info_struct *p,
> - struct swap_cluster_info *cluster_info, unsigned long page_nr)
> +static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci)
> {
> - unsigned long idx = page_nr / SWAPFILE_CLUSTER;
> -
> - if (!cluster_info)
> + if (!p->cluster_info)
> return;
>
> - VM_BUG_ON(cluster_count(&cluster_info[idx]) == 0);
> - cluster_set_count(&cluster_info[idx],
> - cluster_count(&cluster_info[idx]) - 1);
> + VM_BUG_ON(ci->count == 0);
> + ci->count--;
>
> - if (cluster_count(&cluster_info[idx]) == 0)
> - free_cluster(p, idx);
> + if (!ci->count)
> + free_cluster(p, ci);
> }
>
> /*
> @@ -611,10 +499,10 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
> {

This whole scan_swap_map_ssd_cluster_conflict function seems not
needed now. free_clusters is a double linked list, so using a cluster
in the middle won't corrupt the list. The comments are still for the
old list design.

> struct percpu_cluster *percpu_cluster;
> bool conflict;
> -
> + struct swap_cluster_info *first = list_first_entry(&si->free_clusters, struct swap_cluster_info, next);
> offset /= SWAPFILE_CLUSTER;
> - conflict = !cluster_list_empty(&si->free_clusters) &&
> - offset != cluster_list_first(&si->free_clusters) &&
> + conflict = !list_empty(&si->free_clusters) &&
> + offset != first - si->cluster_info &&
> cluster_is_free(&si->cluster_info[offset]);
>
> if (!conflict)
> @@ -655,10 +543,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> cluster = this_cpu_ptr(si->percpu_cluster);
> tmp = cluster->next[order];
> if (tmp == SWAP_NEXT_INVALID) {
> - if (!cluster_list_empty(&si->free_clusters)) {
> - tmp = cluster_next(&si->free_clusters.head) *
> - SWAPFILE_CLUSTER;
> - } else if (!cluster_list_empty(&si->discard_clusters)) {
> + if (!list_empty(&si->free_clusters)) {
> + ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next);
> + list_del(&ci->next);
> + spin_lock(&ci->lock);

Shouldn't this list_del also be protected by ci->lock? It was
protected in alloc_cluster before, keeping the flag synced with
cluster status so cluster_is_free won't return false positive.

> + ci->flags = 0;
> + spin_unlock(&ci->lock);
> + tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
> + } else if (!list_empty(&si->discard_clusters)) {
> /*
> * we don't have free cluster but have some clusters in
> * discarding, do discard now and reclaim them, then
> @@ -670,7 +562,8 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> goto new_cluster;
> } else
> return false;
> - }
> + } else
> + ci = si->cluster_info + tmp;

This "else ci = ..." seems wrong, tmp is not an array index, and not
needed either.

>
> /*
> * Other CPUs can use our cluster if they can't find a free cluster,
> @@ -1062,8 +955,9 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
>
> ci = lock_cluster(si, offset);
> memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
> - cluster_set_count_flag(ci, 0, 0);
> - free_cluster(si, idx);
> + ci->count = 0;
> + ci->flags = 0;
> + free_cluster(si, ci);
> unlock_cluster(ci);
> swap_range_free(si, offset, SWAPFILE_CLUSTER);
> }
> @@ -1336,7 +1230,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
> count = p->swap_map[offset];
> VM_BUG_ON(count != SWAP_HAS_CACHE);
> p->swap_map[offset] = 0;
> - dec_cluster_info_page(p, p->cluster_info, offset);
> + dec_cluster_info_page(p, ci);
> unlock_cluster(ci);
>
> mem_cgroup_uncharge_swap(entry, 1);
> @@ -2985,8 +2879,8 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
>
> nr_good_pages = maxpages - 1; /* omit header page */
>
> - cluster_list_init(&p->free_clusters);
> - cluster_list_init(&p->discard_clusters);
> + INIT_LIST_HEAD(&p->free_clusters);
> + INIT_LIST_HEAD(&p->discard_clusters);
>
> for (i = 0; i < swap_header->info.nr_badpages; i++) {
> unsigned int page_nr = swap_header->info.badpages[i];
> @@ -3037,14 +2931,15 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
> for (k = 0; k < SWAP_CLUSTER_COLS; k++) {
> j = (k + col) % SWAP_CLUSTER_COLS;
> for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
> + struct swap_cluster_info *ci;
> idx = i * SWAP_CLUSTER_COLS + j;
> + ci = cluster_info + idx;
> if (idx >= nr_clusters)
> continue;
> - if (cluster_count(&cluster_info[idx]))
> + if (ci->count)
> continue;
> - cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
> - cluster_list_add_tail(&p->free_clusters, cluster_info,
> - idx);
> + ci->flags = CLUSTER_FLAG_FREE;
> + list_add_tail(&ci->next, &p->free_clusters);
> }
> }
> return nr_extents;
>
> --
> 2.45.1.288.g0e0cd299f1-goog
>
>

2024-05-28 21:04:55

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

I am spinning a new version for this series to address two issues
found in this series:

1) Oppo discovered a bug in the following line:
+ ci = si->cluster_info + tmp;
Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
That is a serious bug but trivial to fix.

2) order 0 allocation currently blindly scans swap_map disregarding
the cluster->order. Given enough order 0 swap allocations(close to the
swap file size) the order 0 allocation head will eventually sweep
across the whole swapfile and destroy other cluster order allocations.

The short term fix is just skipping clusters that are already assigned
to higher orders.

In the long term, I want to unify the non-SSD to use clusters for
locking and allocations as well, just try to follow the last
allocation (less seeking) as much as possible.

Chris



On Fri, May 24, 2024 at 10:17 AM Chris Li <[email protected]> wrote:
>
> This is the short term solutiolns "swap cluster order" listed
> in my "Swap Abstraction" discussion slice 8 in the recent
> LSF/MM conference.
>
> When commit 845982eb264bc "mm: swap: allow storage of all mTHP
> orders" is introduced, it only allocates the mTHP swap entries
> from new empty cluster list. That works well for PMD size THP,
> but it has a serius fragmentation issue reported by Barry.
>
> https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
>
> The mTHP allocation failure rate raises to almost 100% after a few
> hours in Barry's test run.
>
> The reason is that all the empty cluster has been exhausted while
> there are planty of free swap entries to in the cluster that is
> not 100% free.
>
> Address this by remember the swap allocation order in the cluster.
> Keep track of the per order non full cluster list for later allocation.
>
> This greatly improve the sucess rate of the mTHP swap allocation.
> While I am still waiting for Barry's test result. I paste Kairui's test
> result here:
>
> I'm able to reproduce such an issue with a simple script (enabling all order of mthp):
>
> modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024))
> swapoff -a
> mkswap /dev/ram0
> swapon /dev/ram0
>
> rmdir /sys/fs/cgroup/benchmark
> mkdir -p /sys/fs/cgroup/benchmark
> cd /sys/fs/cgroup/benchmark
> echo 8G > memory.max
> echo $$ > cgroup.procs
>
> memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary &
>
> /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \
> -P memcache_binary -n allkeys --key-minimum=1 \
> --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \
> --ratio 1:0 --pipeline 8 -d 1024
>
> Before:
> Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98
> After:
> Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74
>
> And the fallback ratio dropped by a lot:
> Before:
> hugepages-32kB/stats/anon_swpout_fallback:15997
> hugepages-32kB/stats/anon_swpout:18712
> hugepages-512kB/stats/anon_swpout_fallback:192
> hugepages-512kB/stats/anon_swpout:0
> hugepages-2048kB/stats/anon_swpout_fallback:2
> hugepages-2048kB/stats/anon_swpout:0
> hugepages-1024kB/stats/anon_swpout_fallback:0
> hugepages-1024kB/stats/anon_swpout:0
> hugepages-64kB/stats/anon_swpout_fallback:18246
> hugepages-64kB/stats/anon_swpout:17644
> hugepages-16kB/stats/anon_swpout_fallback:13701
> hugepages-16kB/stats/anon_swpout:18234
> hugepages-256kB/stats/anon_swpout_fallback:8642
> hugepages-256kB/stats/anon_swpout:93
> hugepages-128kB/stats/anon_swpout_fallback:21497
> hugepages-128kB/stats/anon_swpout:7596
>
> (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate)
>
> After:
> hugepages-32kB/stats/swpout:34445
> hugepages-32kB/stats/swpout_fallback:0
> hugepages-512kB/stats/swpout:1
> hugepages-512kB/stats/swpout_fallback:134
> hugepages-2048kB/stats/swpout:1
> hugepages-2048kB/stats/swpout_fallback:1
> hugepages-1024kB/stats/swpout:6
> hugepages-1024kB/stats/swpout_fallback:0
> hugepages-64kB/stats/swpout:35495
> hugepages-64kB/stats/swpout_fallback:0
> hugepages-16kB/stats/swpout:32441
> hugepages-16kB/stats/swpout_fallback:0
> hugepages-256kB/stats/swpout:2223
> hugepages-256kB/stats/swpout_fallback:6278
> hugepages-128kB/stats/swpout:29136
> hugepages-128kB/stats/swpout_fallback:52
>
> Reported-by: Barry Song <[email protected]>
> Tested-by: Kairui Song <[email protected]>
> Signed-off-by: Chris Li <[email protected]>
> ---
> Chris Li (2):
> mm: swap: swap cluster switch to double link list
> mm: swap: mTHP allocate swap entries from nonfull list
>
> include/linux/swap.h | 18 ++--
> mm/swapfile.c | 252 +++++++++++++++++----------------------------------
> 2 files changed, 93 insertions(+), 177 deletions(-)
> ---
> base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed
> change-id: 20240523-swap-allocator-1534c480ece4
>
> Best regards,
> --
> Chris Li <[email protected]>
>

2024-05-28 22:27:43

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: swap: swap cluster switch to double link list

Hi Kairui,


On Tue, May 28, 2024 at 9:24 AM Kairui Song <[email protected]> wrote:
>
> On Sat, May 25, 2024 at 1:17 AM Chris Li <[email protected]> wrote:
> >
> > Previously, the swap cluster used a cluster index as a pointer
> > to construct a custom single link list type "swap_cluster_list".
> > The next cluster pointer is shared with the cluster->count.
> > The assumption is that only the free cluster needs to be put
> > on the list.
> >
> > That assumption is not true for mTHP allocators any more. Need
> > to track the non full cluster on the list as well. Move the
> > current cluster single link list into standard double link list.
> >
> > Remove the cluster getter/setter for accessing the cluster
> > struct member. Move the cluster locking in the caller function
> > rather than the getter/setter function. That way the locking can
> > protect more than one member, e.g. cluster->flag.
> >
> > Change cluster code to use "struct swap_cluster_info *" to
> > reference the cluster rather than by using index. That is more
> > consistent with the list manipulation. It avoids the repeat
> > adding index to the cluser_info. The code is easier to understand.
> >
> > Remove the cluster next pointer is NULL flag, the double link
> > list can handle the empty list pretty well.
> >
> > The "swap_cluster_info" struct is two pointer bigger, because
> > 512 swap entries share one swap struct, it has very little impact
> > on the average memory usage per swap entry. Other than the list
> > conversion, there is no real function change in this patch.
> > ---
> > include/linux/swap.h | 14 ++--
> > mm/swapfile.c | 231 ++++++++++++++-------------------------------------
> > 2 files changed, 68 insertions(+), 177 deletions(-)
> >
>
> Hi Chris,
>
> Thanks for this very nice clean up, the code is much easier to read.

Thanks for the review.

See my comments below. I am working on a V2 to address the two issues
identified so far.

BTW, I am pretty happy the patch stats have much more deltes than insert.

>
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 11c53692f65f..0d3906eff3c9 100644
> > --- a/include/linux/swap.hm
> > +++ b/include/linux/swap.h
> > @@ -254,11 +254,12 @@ struct swap_cluster_info {
> > * elements correspond to the swap
> > * cluster
> > */
> > - unsigned int data:24;
> > + unsigned int count:16;
> > unsigned int flags:8;
> > + struct list_head next;
> > };
> > #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
> > -#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
> > +
> >
> > /*
> > * The first page in the swap file is the swap header, which is always marked
> > @@ -283,11 +284,6 @@ struct percpu_cluster {
> > unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> > };
> >
> > -struct swap_cluster_list {
> > - struct swap_cluster_info head;
> > - struct swap_cluster_info tail;
> > -};
> > -
> > /*
> > * The in-memory structure used to track swap areas.
> > */
> > @@ -300,7 +296,7 @@ struct swap_info_struct {
> > unsigned int max; /* extent of the swap_map */
> > unsigned char *swap_map; /* vmalloc'ed array of usage counts */
> > struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
> > - struct swap_cluster_list free_clusters; /* free clusters list */
> > + struct list_head free_clusters; /* free clusters list */
> > unsigned int lowest_bit; /* index of first free in swap_map */
> > unsigned int highest_bit; /* index of last free in swap_map */
> > unsigned int pages; /* total of usable pages of swap */
> > @@ -333,7 +329,7 @@ struct swap_info_struct {
> > * list.
> > */
> > struct work_struct discard_work; /* discard worker */
> > - struct swap_cluster_list discard_clusters; /* discard clusters list */
> > + struct list_head discard_clusters; /* discard clusters list */
> > struct plist_node avail_lists[]; /*
> > * entries in swap_avail_heads, one
> > * entry per node.
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 4f0e8b2ac8aa..205a60c5f9cb 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -290,64 +290,11 @@ static void discard_swap_cluster(struct swap_info_struct *si,
> > #endif
> > #define LATENCY_LIMIT 256
> >
> > -static inline void cluster_set_flag(struct swap_cluster_info *info,
> > - unsigned int flag)
> > -{
> > - info->flags = flag;
> > -}
> > -
> > -static inline unsigned int cluster_count(struct swap_cluster_info *info)
> > -{
> > - return info->data;
> > -}
> > -
> > -static inline void cluster_set_count(struct swap_cluster_info *info,
> > - unsigned int c)
> > -{
> > - info->data = c;
> > -}
> > -
> > -static inline void cluster_set_count_flag(struct swap_cluster_info *info,
> > - unsigned int c, unsigned int f)
> > -{
> > - info->flags = f;
> > - info->data = c;
> > -}
> > -
> > -static inline unsigned int cluster_next(struct swap_cluster_info *info)
> > -{
> > - return info->data;
> > -}
> > -
> > -static inline void cluster_set_next(struct swap_cluster_info *info,
> > - unsigned int n)
> > -{
> > - info->data = n;
> > -}
> > -
> > -static inline void cluster_set_next_flag(struct swap_cluster_info *info,
> > - unsigned int n, unsigned int f)
> > -{
> > - info->flags = f;
> > - info->data = n;
> > -}
> > -
> > static inline bool cluster_is_free(struct swap_cluster_info *info)
> > {
> > return info->flags & CLUSTER_FLAG_FREE;
> > }
> >
> > -static inline bool cluster_is_null(struct swap_cluster_info *info)
> > -{
> > - return info->flags & CLUSTER_FLAG_NEXT_NULL;
> > -}
> > -
> > -static inline void cluster_set_null(struct swap_cluster_info *info)
> > -{
> > - info->flags = CLUSTER_FLAG_NEXT_NULL;
> > - info->data = 0;
> > -}
> > -
> > static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
> > unsigned long offset)
> > {
> > @@ -394,65 +341,11 @@ static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si,
> > spin_unlock(&si->lock);
> > }
> >
> > -static inline bool cluster_list_empty(struct swap_cluster_list *list)
> > -{
> > - return cluster_is_null(&list->head);
> > -}
> > -
> > -static inline unsigned int cluster_list_first(struct swap_cluster_list *list)
> > -{
> > - return cluster_next(&list->head);
> > -}
> > -
> > -static void cluster_list_init(struct swap_cluster_list *list)
> > -{
> > - cluster_set_null(&list->head);
> > - cluster_set_null(&list->tail);
> > -}
> > -
> > -static void cluster_list_add_tail(struct swap_cluster_list *list,
> > - struct swap_cluster_info *ci,
> > - unsigned int idx)
> > -{
> > - if (cluster_list_empty(list)) {
> > - cluster_set_next_flag(&list->head, idx, 0);
> > - cluster_set_next_flag(&list->tail, idx, 0);
> > - } else {
> > - struct swap_cluster_info *ci_tail;
> > - unsigned int tail = cluster_next(&list->tail);
> > -
> > - /*
> > - * Nested cluster lock, but both cluster locks are
> > - * only acquired when we held swap_info_struct->lock
> > - */
> > - ci_tail = ci + tail;
> > - spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING);
> > - cluster_set_next(ci_tail, idx);
> > - spin_unlock(&ci_tail->lock);
> > - cluster_set_next_flag(&list->tail, idx, 0);
> > - }
> > -}
> > -
> > -static unsigned int cluster_list_del_first(struct swap_cluster_list *list,
> > - struct swap_cluster_info *ci)
> > -{
> > - unsigned int idx;
> > -
> > - idx = cluster_next(&list->head);
> > - if (cluster_next(&list->tail) == idx) {
> > - cluster_set_null(&list->head);
> > - cluster_set_null(&list->tail);
> > - } else
> > - cluster_set_next_flag(&list->head,
> > - cluster_next(&ci[idx]), 0);
> > -
> > - return idx;
> > -}
> > -
> > /* Add a cluster to discard list and schedule it to do discard */
> > static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> > - unsigned int idx)
> > + struct swap_cluster_info *ci)
> > {
> > + unsigned int idx = ci - si->cluster_info;
> > /*
> > * If scan_swap_map_slots() can't find a free cluster, it will check
> > * si->swap_map directly. To make sure the discarding cluster isn't
> > @@ -462,17 +355,16 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> > memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> > SWAP_MAP_BAD, SWAPFILE_CLUSTER);
> >
> > - cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);
> > -
> > + spin_lock_nested(&ci->lock, SINGLE_DEPTH_NESTING);
> > + list_add_tail(&ci->next, &si->discard_clusters);
> > + spin_unlock(&ci->lock);
> > schedule_work(&si->discard_work);
> > }
> >
> > -static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
> > +static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> > {
> > - struct swap_cluster_info *ci = si->cluster_info;
> > -
> > - cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
> > - cluster_list_add_tail(&si->free_clusters, ci, idx);
> > + ci->flags = CLUSTER_FLAG_FREE;
> > + list_add_tail(&ci->next, &si->free_clusters);
> > }
> >
> > /*
> > @@ -481,21 +373,21 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
> > */
> > static void swap_do_scheduled_discard(struct swap_info_struct *si)
> > {
> > - struct swap_cluster_info *info, *ci;
> > + struct swap_cluster_info *ci;
> > unsigned int idx;
> >
> > - info = si->cluster_info;
> > -
> > - while (!cluster_list_empty(&si->discard_clusters)) {
> > - idx = cluster_list_del_first(&si->discard_clusters, info);
> > + while (!list_empty(&si->discard_clusters)) {
> > + ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, next);
> > + idx = ci - si->cluster_info;
> > spin_unlock(&si->lock);
> >
> > discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
> > SWAPFILE_CLUSTER);
> >
> > spin_lock(&si->lock);
> > - ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
> > - __free_cluster(si, idx);
> > +
> > + spin_lock(&ci->lock);
> > + __free_cluster(si, ci);
> > memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> > 0, SWAPFILE_CLUSTER);
> > unlock_cluster(ci);
> > @@ -521,20 +413,20 @@ static void swap_users_ref_free(struct percpu_ref *ref)
> > complete(&si->comp);
> > }
> >
> > -static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
> > +static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsigned long idx)
> > {
> > - struct swap_cluster_info *ci = si->cluster_info;
> > + struct swap_cluster_info *ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next);
> >
> > - VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx);
> > - cluster_list_del_first(&si->free_clusters, ci);
> > - cluster_set_count_flag(ci + idx, 0, 0);
> > + VM_BUG_ON(ci - si->cluster_info != idx);
> > + list_del(&ci->next);
> > + ci->count = 0;
> > + ci->flags = 0;
> > + return ci;
> > }
> >
> > -static void free_cluster(struct swap_info_struct *si, unsigned long idx)
> > +static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> > {
> > - struct swap_cluster_info *ci = si->cluster_info + idx;
> > -
> > - VM_BUG_ON(cluster_count(ci) != 0);
> > + VM_BUG_ON(ci->count != 0);
> > /*
> > * If the swap is discardable, prepare discard the cluster
> > * instead of free it immediately. The cluster will be freed
> > @@ -542,11 +434,11 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
> > */
> > if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
> > (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
> > - swap_cluster_schedule_discard(si, idx);
> > + swap_cluster_schedule_discard(si, ci);
> > return;
> > }
> >
> > - __free_cluster(si, idx);
> > + __free_cluster(si, ci);
> > }
> >
> > /*
> > @@ -559,15 +451,15 @@ static void add_cluster_info_page(struct swap_info_struct *p,
> > unsigned long count)
> > {
> > unsigned long idx = page_nr / SWAPFILE_CLUSTER;
> > + struct swap_cluster_info *ci = cluster_info + idx;
> >
> > if (!cluster_info)
> > return;
> > - if (cluster_is_free(&cluster_info[idx]))
> > + if (cluster_is_free(ci))
> > alloc_cluster(p, idx);
> >
> > - VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
> > - cluster_set_count(&cluster_info[idx],
> > - cluster_count(&cluster_info[idx]) + count);
> > + VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER);
> > + ci->count += count;
> > }
> >
> > /*
> > @@ -581,24 +473,20 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
> > }
> >
> > /*
> > - * The cluster corresponding to page_nr decreases one usage. If the usage
> > - * counter becomes 0, which means no page in the cluster is in using, we can
> > - * optionally discard the cluster and add it to free cluster list.
> > + * The cluster ci decreases one usage. If the usage counter becomes 0,
> > + * which means no page in the cluster is in using, we can optionally discard
> > + * the cluster and add it to free cluster list.
> > */
> > -static void dec_cluster_info_page(struct swap_info_struct *p,
> > - struct swap_cluster_info *cluster_info, unsigned long page_nr)
> > +static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci)
> > {
> > - unsigned long idx = page_nr / SWAPFILE_CLUSTER;
> > -
> > - if (!cluster_info)
> > + if (!p->cluster_info)
> > return;
> >
> > - VM_BUG_ON(cluster_count(&cluster_info[idx]) == 0);
> > - cluster_set_count(&cluster_info[idx],
> > - cluster_count(&cluster_info[idx]) - 1);
> > + VM_BUG_ON(ci->count == 0);
> > + ci->count--;
> >
> > - if (cluster_count(&cluster_info[idx]) == 0)
> > - free_cluster(p, idx);
> > + if (!ci->count)
> > + free_cluster(p, ci);
> > }
> >
> > /*
> > @@ -611,10 +499,10 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
> > {
>
> This whole scan_swap_map_ssd_cluster_conflict function seems not
> needed now. free_clusters is a double linked list, so using a cluster
> in the middle won't corrupt the list. The comments are still for the
> old list design.

I was debating removing the cluster_conflict() as well and found out
it can't be removed until we change the order 0 allocations also use
clusters.
There can still be conflict because the order 0 allocations just do
the bruce force scan of swap_map[] when try_ssd fails. This causes
other problems as well. As far as I can tell, the conflict can still
happen.

>
> > struct percpu_cluster *percpu_cluster;
> > bool conflict;
> > -
> > + struct swap_cluster_info *first = list_first_entry(&si->free_clusters, struct swap_cluster_info, next);
> > offset /= SWAPFILE_CLUSTER;
> > - conflict = !cluster_list_empty(&si->free_clusters) &&
> > - offset != cluster_list_first(&si->free_clusters) &&
> > + conflict = !list_empty(&si->free_clusters) &&
> > + offset != first - si->cluster_info &&
> > cluster_is_free(&si->cluster_info[offset]);
> >
> > if (!conflict)
> > @@ -655,10 +543,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> > cluster = this_cpu_ptr(si->percpu_cluster);
> > tmp = cluster->next[order];
> > if (tmp == SWAP_NEXT_INVALID) {
> > - if (!cluster_list_empty(&si->free_clusters)) {
> > - tmp = cluster_next(&si->free_clusters.head) *
> > - SWAPFILE_CLUSTER;
> > - } else if (!cluster_list_empty(&si->discard_clusters)) {
> > + if (!list_empty(&si->free_clusters)) {
> > + ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next);
> > + list_del(&ci->next);
> > + spin_lock(&ci->lock);
>
> Shouldn't this list_del also be protected by ci->lock? It was
> protected in alloc_cluster before, keeping the flag synced with
> cluster status so cluster_is_free won't return false positive.

The list add and list del are protected by Si->lock not by cluster lock.
Previously I wanted to use cluster->lock to protect it and realized
that adding/deleting the cluster to/from the list will change three
clusters. (current, prev, next). We need to get three cluster locks.
We might change to a per list spinlock. e.g. one lock for one list to
reduce the contention on Si->lock. However, per cluster lock is not
enough if we only take one cluster lock.

>
> > + ci->flags = 0;
> > + spin_unlock(&ci->lock);
> > + tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
> > + } else if (!list_empty(&si->discard_clusters)) {
> > /*
> > * we don't have free cluster but have some clusters in
> > * discarding, do discard now and reclaim them, then
> > @@ -670,7 +562,8 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> > goto new_cluster;
> > } else
> > return false;
> > - }
> > + } else
> > + ci = si->cluster_info + tmp;
>
> This "else ci = ..." seems wrong, tmp is not an array index, and not
> needed either.

Yes, there is a bug there, pointed out by OPPO as well. It should be
ci = si->cluster_info + (tmp/ SWAPFILE_CLUSTER);

"tmp" is needed because "tmp" or " cluster->next[order]" keep track of
the current cluster allocation offset,
in the per cpu cluster struct.

BTW, In my V2 I have changed "tmp" to "offset" and previous "offset"
to "retoffset" to make it more obvious. "tmp" does not give much
information about what it really does.

Chris

>
> >
> > /*
> > * Other CPUs can use our cluster if they can't find a free cluster,
> > @@ -1062,8 +955,9 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
> >
> > ci = lock_cluster(si, offset);
> > memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
> > - cluster_set_count_flag(ci, 0, 0);
> > - free_cluster(si, idx);
> > + ci->count = 0;
> > + ci->flags = 0;
> > + free_cluster(si, ci);
> > unlock_cluster(ci);
> > swap_range_free(si, offset, SWAPFILE_CLUSTER);
> > }
> > @@ -1336,7 +1230,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
> > count = p->swap_map[offset];
> > VM_BUG_ON(count != SWAP_HAS_CACHE);
> > p->swap_map[offset] = 0;
> > - dec_cluster_info_page(p, p->cluster_info, offset);
> > + dec_cluster_info_page(p, ci);
> > unlock_cluster(ci);
> >
> > mem_cgroup_uncharge_swap(entry, 1);
> > @@ -2985,8 +2879,8 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
> >
> > nr_good_pages = maxpages - 1; /* omit header page */
> >
> > - cluster_list_init(&p->free_clusters);
> > - cluster_list_init(&p->discard_clusters);
> > + INIT_LIST_HEAD(&p->free_clusters);
> > + INIT_LIST_HEAD(&p->discard_clusters);
> >
> > for (i = 0; i < swap_header->info.nr_badpages; i++) {
> > unsigned int page_nr = swap_header->info.badpages[i];
> > @@ -3037,14 +2931,15 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
> > for (k = 0; k < SWAP_CLUSTER_COLS; k++) {
> > j = (k + col) % SWAP_CLUSTER_COLS;
> > for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
> > + struct swap_cluster_info *ci;
> > idx = i * SWAP_CLUSTER_COLS + j;
> > + ci = cluster_info + idx;
> > if (idx >= nr_clusters)
> > continue;
> > - if (cluster_count(&cluster_info[idx]))
> > + if (ci->count)
> > continue;
> > - cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
> > - cluster_list_add_tail(&p->free_clusters, cluster_info,
> > - idx);
> > + ci->flags = CLUSTER_FLAG_FREE;
> > + list_add_tail(&ci->next, &p->free_clusters);
> > }
> > }
> > return nr_extents;
> >
> > --
> > 2.45.1.288.g0e0cd299f1-goog
> >
> >
>

2024-05-29 00:51:13

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: swap: swap cluster switch to double link list

On Tue, May 28, 2024 at 3:27 PM Chris Li <[email protected]> wrote:
> > > @@ -670,7 +562,8 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> > > goto new_cluster;
> > > } else
> > > return false;
> > > - }
> > > + } else
> > > + ci = si->cluster_info + tmp;
> >
> > This "else ci = ..." seems wrong, tmp is not an array index, and not
> > needed either.
>
> Yes, there is a bug there, pointed out by OPPO as well. It should be
> ci = si->cluster_info + (tmp/ SWAPFILE_CLUSTER);
>
> "tmp" is needed because "tmp" or " cluster->next[order]" keep track of
> the current cluster allocation offset,
> in the per cpu cluster struct.

Hi Kairui,


Actually, you are right, the "ci" is not used here. That is why that
ci out of bound error does not trigger kernel OOPS.
We can delete that else line completely.

Chris

2024-05-29 08:52:16

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: swap: swap cluster switch to double link list

Chris Li <[email protected]> writes:

> Previously, the swap cluster used a cluster index as a pointer
> to construct a custom single link list type "swap_cluster_list".
> The next cluster pointer is shared with the cluster->count.
> The assumption is that only the free cluster needs to be put
> on the list.
>
> That assumption is not true for mTHP allocators any more.

I think that the words aren't correct here. mTHP swap entry allocators
can work with current cluster definition.

> Need to track the non full cluster on the list as well.  Move the
> current cluster single link list into standard double link list.

It's an optimization to track non-full cluster with a list.

I understand that you want to change cluster list definition. I just
feel the wording isn't accurate.

> Remove the cluster getter/setter for accessing the cluster
> struct member.  Move the cluster locking in the caller function
> rather than the getter/setter function. That way the locking can
> protect more than one member, e.g. cluster->flag.

Sorry, I don't understand the locking in above words. I don't find that
we lock/unlock in the original getter/setter functions. I found that
the cluster locking rule for cluster list is changed. Better to make
this explicit.

> Change cluster code to use "struct swap_cluster_info *" to
> reference the cluster rather than by using index. That is more
> consistent with the list manipulation. It avoids the repeat
> adding index to the cluser_info. The code is easier to understand.
>
> Remove the cluster next pointer is NULL flag, the double link
> list can handle the empty list pretty well.
>
> The "swap_cluster_info" struct is two pointer bigger, because
> 512 swap entries share one swap struct, it has very little impact
> on the average memory usage per swap entry.  Other than the list
> conversion, there is no real function change in this patch.

On 64bit platform, the size of swap_cluster_info increases from 8 bytes
to 24 bytes. For a 1TB swap device, the memory usage will increase from
4MB to 12MB. This looks OK for me.

Another choice is to use a customized double linked list using "unsigned
int" as pointer to cluster. That will reduce the size of cluster to 16
bytes. But it may be not necessary to do that.

Anyway, I think that it's better to add more calculation in changelog
for memory usage increment.

> ---
> include/linux/swap.h | 14 ++--
> mm/swapfile.c | 231 ++++++++++++++-------------------------------------
> 2 files changed, 68 insertions(+), 177 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 11c53692f65f..0d3906eff3c9 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -254,11 +254,12 @@ struct swap_cluster_info {
> * elements correspond to the swap
> * cluster
> */
> - unsigned int data:24;
> + unsigned int count:16;
> unsigned int flags:8;

If we use 16bits and 8 bits in bit fields, why not just use u8 and u16
instead?

> + struct list_head next;

"next" isn't a good naming because prev pointer is in list_head too.
The common naming is "list".

Need to revise comments for swap_cluster_info.lock and add the locking
rule comments for swap_cluster_info.next.

> };
> #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
> -#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
> +
>
> /*
> * The first page in the swap file is the swap header, which is always marked
> @@ -283,11 +284,6 @@ struct percpu_cluster {
> unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> };
>
> -struct swap_cluster_list {
> - struct swap_cluster_info head;
> - struct swap_cluster_info tail;
> -};
> -
> /*
> * The in-memory structure used to track swap areas.
> */
> @@ -300,7 +296,7 @@ struct swap_info_struct {
> unsigned int max; /* extent of the swap_map */
> unsigned char *swap_map; /* vmalloc'ed array of usage counts */
> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
> - struct swap_cluster_list free_clusters; /* free clusters list */
> + struct list_head free_clusters; /* free clusters list */
> unsigned int lowest_bit; /* index of first free in swap_map */
> unsigned int highest_bit; /* index of last free in swap_map */
> unsigned int pages; /* total of usable pages of swap */
> @@ -333,7 +329,7 @@ struct swap_info_struct {
> * list.
> */
> struct work_struct discard_work; /* discard worker */
> - struct swap_cluster_list discard_clusters; /* discard clusters list */
> + struct list_head discard_clusters; /* discard clusters list */
> struct plist_node avail_lists[]; /*
> * entries in swap_avail_heads, one
> * entry per node.
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 4f0e8b2ac8aa..205a60c5f9cb 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -290,64 +290,11 @@ static void discard_swap_cluster(struct swap_info_struct *si,
> #endif
> #define LATENCY_LIMIT 256
>
> -static inline void cluster_set_flag(struct swap_cluster_info *info,
> - unsigned int flag)
> -{
> - info->flags = flag;
> -}
> -
> -static inline unsigned int cluster_count(struct swap_cluster_info *info)
> -{
> - return info->data;
> -}
> -
> -static inline void cluster_set_count(struct swap_cluster_info *info,
> - unsigned int c)
> -{
> - info->data = c;
> -}
> -
> -static inline void cluster_set_count_flag(struct swap_cluster_info *info,
> - unsigned int c, unsigned int f)
> -{
> - info->flags = f;
> - info->data = c;
> -}
> -
> -static inline unsigned int cluster_next(struct swap_cluster_info *info)
> -{
> - return info->data;
> -}
> -
> -static inline void cluster_set_next(struct swap_cluster_info *info,
> - unsigned int n)
> -{
> - info->data = n;
> -}
> -
> -static inline void cluster_set_next_flag(struct swap_cluster_info *info,
> - unsigned int n, unsigned int f)
> -{
> - info->flags = f;
> - info->data = n;
> -}
> -
> static inline bool cluster_is_free(struct swap_cluster_info *info)
> {
> return info->flags & CLUSTER_FLAG_FREE;
> }
>
> -static inline bool cluster_is_null(struct swap_cluster_info *info)
> -{
> - return info->flags & CLUSTER_FLAG_NEXT_NULL;
> -}
> -
> -static inline void cluster_set_null(struct swap_cluster_info *info)
> -{
> - info->flags = CLUSTER_FLAG_NEXT_NULL;
> - info->data = 0;
> -}
> -
> static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
> unsigned long offset)
> {
> @@ -394,65 +341,11 @@ static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si,
> spin_unlock(&si->lock);
> }
>
> -static inline bool cluster_list_empty(struct swap_cluster_list *list)
> -{
> - return cluster_is_null(&list->head);
> -}
> -
> -static inline unsigned int cluster_list_first(struct swap_cluster_list *list)
> -{
> - return cluster_next(&list->head);
> -}
> -
> -static void cluster_list_init(struct swap_cluster_list *list)
> -{
> - cluster_set_null(&list->head);
> - cluster_set_null(&list->tail);
> -}
> -
> -static void cluster_list_add_tail(struct swap_cluster_list *list,
> - struct swap_cluster_info *ci,
> - unsigned int idx)
> -{
> - if (cluster_list_empty(list)) {
> - cluster_set_next_flag(&list->head, idx, 0);
> - cluster_set_next_flag(&list->tail, idx, 0);
> - } else {
> - struct swap_cluster_info *ci_tail;
> - unsigned int tail = cluster_next(&list->tail);
> -
> - /*
> - * Nested cluster lock, but both cluster locks are
> - * only acquired when we held swap_info_struct->lock
> - */
> - ci_tail = ci + tail;
> - spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING);
> - cluster_set_next(ci_tail, idx);
> - spin_unlock(&ci_tail->lock);
> - cluster_set_next_flag(&list->tail, idx, 0);
> - }
> -}
> -
> -static unsigned int cluster_list_del_first(struct swap_cluster_list *list,
> - struct swap_cluster_info *ci)
> -{
> - unsigned int idx;
> -
> - idx = cluster_next(&list->head);
> - if (cluster_next(&list->tail) == idx) {
> - cluster_set_null(&list->head);
> - cluster_set_null(&list->tail);
> - } else
> - cluster_set_next_flag(&list->head,
> - cluster_next(&ci[idx]), 0);
> -
> - return idx;
> -}
> -
> /* Add a cluster to discard list and schedule it to do discard */
> static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> - unsigned int idx)
> + struct swap_cluster_info *ci)
> {
> + unsigned int idx = ci - si->cluster_info;
> /*
> * If scan_swap_map_slots() can't find a free cluster, it will check
> * si->swap_map directly. To make sure the discarding cluster isn't
> @@ -462,17 +355,16 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> SWAP_MAP_BAD, SWAPFILE_CLUSTER);
>
> - cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);
> -
> + spin_lock_nested(&ci->lock, SINGLE_DEPTH_NESTING);

If we don't use ci->lock to protect ci->next, we don't need spin_lock here.

> + list_add_tail(&ci->next, &si->discard_clusters);
> + spin_unlock(&ci->lock);
> schedule_work(&si->discard_work);
> }
>
> -static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
> +static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> {
> - struct swap_cluster_info *ci = si->cluster_info;
> -
> - cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
> - cluster_list_add_tail(&si->free_clusters, ci, idx);
> + ci->flags = CLUSTER_FLAG_FREE;
> + list_add_tail(&ci->next, &si->free_clusters);
> }
>
> /*
> @@ -481,21 +373,21 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
> */
> static void swap_do_scheduled_discard(struct swap_info_struct *si)
> {
> - struct swap_cluster_info *info, *ci;
> + struct swap_cluster_info *ci;
> unsigned int idx;
>
> - info = si->cluster_info;
> -
> - while (!cluster_list_empty(&si->discard_clusters)) {
> - idx = cluster_list_del_first(&si->discard_clusters, info);
> + while (!list_empty(&si->discard_clusters)) {
> + ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, next);
> + idx = ci - si->cluster_info;
> spin_unlock(&si->lock);
>
> discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
> SWAPFILE_CLUSTER);
>
> spin_lock(&si->lock);
> - ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
> - __free_cluster(si, idx);
> +
> + spin_lock(&ci->lock);
> + __free_cluster(si, ci);
> memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> 0, SWAPFILE_CLUSTER);
> unlock_cluster(ci);
> @@ -521,20 +413,20 @@ static void swap_users_ref_free(struct percpu_ref *ref)
> complete(&si->comp);
> }
>
> -static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
> +static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsigned long idx)
> {
> - struct swap_cluster_info *ci = si->cluster_info;
> + struct swap_cluster_info *ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next);
>
> - VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx);
> - cluster_list_del_first(&si->free_clusters, ci);
> - cluster_set_count_flag(ci + idx, 0, 0);
> + VM_BUG_ON(ci - si->cluster_info != idx);
> + list_del(&ci->next);
> + ci->count = 0;
> + ci->flags = 0;
> + return ci;
> }
>
> -static void free_cluster(struct swap_info_struct *si, unsigned long idx)
> +static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> {
> - struct swap_cluster_info *ci = si->cluster_info + idx;
> -
> - VM_BUG_ON(cluster_count(ci) != 0);
> + VM_BUG_ON(ci->count != 0);
> /*
> * If the swap is discardable, prepare discard the cluster
> * instead of free it immediately. The cluster will be freed
> @@ -542,11 +434,11 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
> */
> if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
> (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
> - swap_cluster_schedule_discard(si, idx);
> + swap_cluster_schedule_discard(si, ci);
> return;
> }
>
> - __free_cluster(si, idx);
> + __free_cluster(si, ci);
> }
>
> /*
> @@ -559,15 +451,15 @@ static void add_cluster_info_page(struct swap_info_struct *p,
> unsigned long count)
> {
> unsigned long idx = page_nr / SWAPFILE_CLUSTER;
> + struct swap_cluster_info *ci = cluster_info + idx;
>
> if (!cluster_info)
> return;
> - if (cluster_is_free(&cluster_info[idx]))
> + if (cluster_is_free(ci))
> alloc_cluster(p, idx);
>
> - VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
> - cluster_set_count(&cluster_info[idx],
> - cluster_count(&cluster_info[idx]) + count);
> + VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER);
> + ci->count += count;
> }
>
> /*
> @@ -581,24 +473,20 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
> }
>
> /*
> - * The cluster corresponding to page_nr decreases one usage. If the usage
> - * counter becomes 0, which means no page in the cluster is in using, we can
> - * optionally discard the cluster and add it to free cluster list.
> + * The cluster ci decreases one usage. If the usage counter becomes 0,
> + * which means no page in the cluster is in using, we can optionally discard
> + * the cluster and add it to free cluster list.
> */
> -static void dec_cluster_info_page(struct swap_info_struct *p,
> - struct swap_cluster_info *cluster_info, unsigned long page_nr)
> +static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci)
> {
> - unsigned long idx = page_nr / SWAPFILE_CLUSTER;
> -
> - if (!cluster_info)
> + if (!p->cluster_info)
> return;
>
> - VM_BUG_ON(cluster_count(&cluster_info[idx]) == 0);
> - cluster_set_count(&cluster_info[idx],
> - cluster_count(&cluster_info[idx]) - 1);
> + VM_BUG_ON(ci->count == 0);
> + ci->count--;
>
> - if (cluster_count(&cluster_info[idx]) == 0)
> - free_cluster(p, idx);
> + if (!ci->count)
> + free_cluster(p, ci);
> }
>
> /*
> @@ -611,10 +499,10 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
> {
> struct percpu_cluster *percpu_cluster;
> bool conflict;
> -
> + struct swap_cluster_info *first = list_first_entry(&si->free_clusters, struct swap_cluster_info, next);
> offset /= SWAPFILE_CLUSTER;
> - conflict = !cluster_list_empty(&si->free_clusters) &&
> - offset != cluster_list_first(&si->free_clusters) &&
> + conflict = !list_empty(&si->free_clusters) &&
> + offset != first - si->cluster_info &&
> cluster_is_free(&si->cluster_info[offset]);
>
> if (!conflict)
> @@ -655,10 +543,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> cluster = this_cpu_ptr(si->percpu_cluster);
> tmp = cluster->next[order];
> if (tmp == SWAP_NEXT_INVALID) {
> - if (!cluster_list_empty(&si->free_clusters)) {
> - tmp = cluster_next(&si->free_clusters.head) *
> - SWAPFILE_CLUSTER;
> - } else if (!cluster_list_empty(&si->discard_clusters)) {
> + if (!list_empty(&si->free_clusters)) {
> + ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next);
> + list_del(&ci->next);
> + spin_lock(&ci->lock);
> + ci->flags = 0;
> + spin_unlock(&ci->lock);
> + tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
> + } else if (!list_empty(&si->discard_clusters)) {
> /*
> * we don't have free cluster but have some clusters in
> * discarding, do discard now and reclaim them, then
> @@ -670,7 +562,8 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> goto new_cluster;
> } else
> return false;
> - }
> + } else
> + ci = si->cluster_info + tmp;
>
> /*
> * Other CPUs can use our cluster if they can't find a free cluster,
> @@ -1062,8 +955,9 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
>
> ci = lock_cluster(si, offset);
> memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
> - cluster_set_count_flag(ci, 0, 0);
> - free_cluster(si, idx);
> + ci->count = 0;
> + ci->flags = 0;
> + free_cluster(si, ci);
> unlock_cluster(ci);
> swap_range_free(si, offset, SWAPFILE_CLUSTER);
> }
> @@ -1336,7 +1230,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
> count = p->swap_map[offset];
> VM_BUG_ON(count != SWAP_HAS_CACHE);
> p->swap_map[offset] = 0;
> - dec_cluster_info_page(p, p->cluster_info, offset);
> + dec_cluster_info_page(p, ci);
> unlock_cluster(ci);
>
> mem_cgroup_uncharge_swap(entry, 1);
> @@ -2985,8 +2879,8 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
>
> nr_good_pages = maxpages - 1; /* omit header page */
>
> - cluster_list_init(&p->free_clusters);
> - cluster_list_init(&p->discard_clusters);
> + INIT_LIST_HEAD(&p->free_clusters);
> + INIT_LIST_HEAD(&p->discard_clusters);
>
> for (i = 0; i < swap_header->info.nr_badpages; i++) {
> unsigned int page_nr = swap_header->info.badpages[i];
> @@ -3037,14 +2931,15 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
> for (k = 0; k < SWAP_CLUSTER_COLS; k++) {
> j = (k + col) % SWAP_CLUSTER_COLS;
> for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
> + struct swap_cluster_info *ci;
> idx = i * SWAP_CLUSTER_COLS + j;
> + ci = cluster_info + idx;
> if (idx >= nr_clusters)
> continue;
> - if (cluster_count(&cluster_info[idx]))
> + if (ci->count)
> continue;
> - cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
> - cluster_list_add_tail(&p->free_clusters, cluster_info,
> - idx);
> + ci->flags = CLUSTER_FLAG_FREE;
> + list_add_tail(&ci->next, &p->free_clusters);
> }
> }
> return nr_extents;

--
Best Regards,
Huang, Ying

2024-05-29 09:19:41

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

Chris Li <[email protected]> writes:

> I am spinning a new version for this series to address two issues
> found in this series:
>
> 1) Oppo discovered a bug in the following line:
> + ci = si->cluster_info + tmp;
> Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
> That is a serious bug but trivial to fix.
>
> 2) order 0 allocation currently blindly scans swap_map disregarding
> the cluster->order.

IIUC, now, we only scan swap_map[] only if
!list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]).
That is, if you doesn't run low swap free space, you will not do that.

> Given enough order 0 swap allocations(close to the
> swap file size) the order 0 allocation head will eventually sweep
> across the whole swapfile and destroy other cluster order allocations.
>
> The short term fix is just skipping clusters that are already assigned
> to higher orders.

Better to do any further optimization on top of the simpler one. Need
to evaluate whether it's necessary to add more complexity.

> In the long term, I want to unify the non-SSD to use clusters for
> locking and allocations as well, just try to follow the last
> allocation (less seeking) as much as possible.

I have thought about that too. Personally, I think that it's good to
remove swap_map[] scanning. The implementation can be simplified too.
I don't know whether do we need to consider the performance of HDD swap
now.

--
Best Regards,
Huang, Ying

> On Fri, May 24, 2024 at 10:17 AM Chris Li <[email protected]> wrote:
>>
>> This is the short term solutiolns "swap cluster order" listed
>> in my "Swap Abstraction" discussion slice 8 in the recent
>> LSF/MM conference.
>>
>> When commit 845982eb264bc "mm: swap: allow storage of all mTHP
>> orders" is introduced, it only allocates the mTHP swap entries
>> from new empty cluster list. That works well for PMD size THP,
>> but it has a serius fragmentation issue reported by Barry.
>>
>> https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
>>
>> The mTHP allocation failure rate raises to almost 100% after a few
>> hours in Barry's test run.
>>
>> The reason is that all the empty cluster has been exhausted while
>> there are planty of free swap entries to in the cluster that is
>> not 100% free.
>>
>> Address this by remember the swap allocation order in the cluster.
>> Keep track of the per order non full cluster list for later allocation.
>>
>> This greatly improve the sucess rate of the mTHP swap allocation.
>> While I am still waiting for Barry's test result. I paste Kairui's test
>> result here:
>>
>> I'm able to reproduce such an issue with a simple script (enabling all order of mthp):
>>
>> modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024))
>> swapoff -a
>> mkswap /dev/ram0
>> swapon /dev/ram0
>>
>> rmdir /sys/fs/cgroup/benchmark
>> mkdir -p /sys/fs/cgroup/benchmark
>> cd /sys/fs/cgroup/benchmark
>> echo 8G > memory.max
>> echo $$ > cgroup.procs
>>
>> memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary &
>>
>> /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \
>> -P memcache_binary -n allkeys --key-minimum=1 \
>> --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \
>> --ratio 1:0 --pipeline 8 -d 1024
>>
>> Before:
>> Totals 48805.63 0.00 0.00 5.26045 119100 38.91100 59.64700 51063.98
>> After:
>> Totals 71098.84 0.00 0.00 3.60585 071100 26.36700 39.16700 74388.74
>>
>> And the fallback ratio dropped by a lot:
>> Before:
>> hugepages-32kB/stats/anon_swpout_fallback:15997
>> hugepages-32kB/stats/anon_swpout:18712
>> hugepages-512kB/stats/anon_swpout_fallback:192
>> hugepages-512kB/stats/anon_swpout:0
>> hugepages-2048kB/stats/anon_swpout_fallback:2
>> hugepages-2048kB/stats/anon_swpout:0
>> hugepages-1024kB/stats/anon_swpout_fallback:0
>> hugepages-1024kB/stats/anon_swpout:0
>> hugepages-64kB/stats/anon_swpout_fallback:18246
>> hugepages-64kB/stats/anon_swpout:17644
>> hugepages-16kB/stats/anon_swpout_fallback:13701
>> hugepages-16kB/stats/anon_swpout:18234
>> hugepages-256kB/stats/anon_swpout_fallback:8642
>> hugepages-256kB/stats/anon_swpout:93
>> hugepages-128kB/stats/anon_swpout_fallback:21497
>> hugepages-128kB/stats/anon_swpout:7596
>>
>> (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate)
>>
>> After:
>> hugepages-32kB/stats/swpout:34445
>> hugepages-32kB/stats/swpout_fallback:0
>> hugepages-512kB/stats/swpout:1
>> hugepages-512kB/stats/swpout_fallback:134
>> hugepages-2048kB/stats/swpout:1
>> hugepages-2048kB/stats/swpout_fallback:1
>> hugepages-1024kB/stats/swpout:6
>> hugepages-1024kB/stats/swpout_fallback:0
>> hugepages-64kB/stats/swpout:35495
>> hugepages-64kB/stats/swpout_fallback:0
>> hugepages-16kB/stats/swpout:32441
>> hugepages-16kB/stats/swpout_fallback:0
>> hugepages-256kB/stats/swpout:2223
>> hugepages-256kB/stats/swpout_fallback:6278
>> hugepages-128kB/stats/swpout:29136
>> hugepages-128kB/stats/swpout_fallback:52
>>
>> Reported-by: Barry Song <[email protected]>
>> Tested-by: Kairui Song <[email protected]>
>> Signed-off-by: Chris Li <[email protected]>
>> ---
>> Chris Li (2):
>> mm: swap: swap cluster switch to double link list
>> mm: swap: mTHP allocate swap entries from nonfull list
>>
>> include/linux/swap.h | 18 ++--
>> mm/swapfile.c | 252 +++++++++++++++++----------------------------------
>> 2 files changed, 93 insertions(+), 177 deletions(-)
>> ---
>> base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed
>> change-id: 20240523-swap-allocator-1534c480ece4
>>
>> Best regards,
>> --
>> Chris Li <[email protected]>
>>

2024-05-30 01:14:03

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

Hi Ying,

On Wed, May 29, 2024 at 1:57 AM Huang, Ying <[email protected]> wrote:
>
> Chris Li <[email protected]> writes:
>
> > I am spinning a new version for this series to address two issues
> > found in this series:
> >
> > 1) Oppo discovered a bug in the following line:
> > + ci = si->cluster_info + tmp;
> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
> > That is a serious bug but trivial to fix.
> >
> > 2) order 0 allocation currently blindly scans swap_map disregarding
> > the cluster->order.
>
> IIUC, now, we only scan swap_map[] only if
> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]).
> That is, if you doesn't run low swap free space, you will not do that.

You can still swap space in order 0 clusters while order 4 runs out of
free_cluster
or nonfull_clusters[order]. For Android that is a common case.

>
> > Given enough order 0 swap allocations(close to the
> > swap file size) the order 0 allocation head will eventually sweep
> > across the whole swapfile and destroy other cluster order allocations.
> >
> > The short term fix is just skipping clusters that are already assigned
> > to higher orders.
>
> Better to do any further optimization on top of the simpler one. Need
> to evaluate whether it's necessary to add more complexity.

I agree this needs more careful planning and discussion. In Android's
use case, the swapfile is always almost full. It will run into this
situation after long enough swap time.
Once the order 0 swap entry starts to pollute the higher order
cluster, there is no going back(until the whole cluster is 100% free).

> > In the long term, I want to unify the non-SSD to use clusters for
> > locking and allocations as well, just try to follow the last
> > allocation (less seeking) as much as possible.
>
> I have thought about that too. Personally, I think that it's good to
> remove swap_map[] scanning. The implementation can be simplified too.

Agree. I look at the commit that introduces the SSD cluster. The
commit message indicates a lot of CPU time spent in swap_map scanning,
especially when the swapfile is almost full. The main motivation to
introduce the cluster in HDD is to simplify and unify the code.

> I don't know whether do we need to consider the performance of HDD swap
> now.

I am not sure about that either. We can make the best effort to reduce the seek.

Chris

> --
> Best Regards,
> Huang, Ying
>
> > On Fri, May 24, 2024 at 10:17 AM Chris Li <[email protected]> wrote:
> >>
> >> This is the short term solutiolns "swap cluster order" listed
> >> in my "Swap Abstraction" discussion slice 8 in the recent
> >> LSF/MM conference.
> >>
> >> When commit 845982eb264bc "mm: swap: allow storage of all mTHP
> >> orders" is introduced, it only allocates the mTHP swap entries
> >> from new empty cluster list. That works well for PMD size THP,
> >> but it has a serius fragmentation issue reported by Barry.
> >>
> >> https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
> >>
> >> The mTHP allocation failure rate raises to almost 100% after a few
> >> hours in Barry's test run.
> >>
> >> The reason is that all the empty cluster has been exhausted while
> >> there are planty of free swap entries to in the cluster that is
> >> not 100% free.
> >>
> >> Address this by remember the swap allocation order in the cluster.
> >> Keep track of the per order non full cluster list for later allocation.
> >>
> >> This greatly improve the sucess rate of the mTHP swap allocation.
> >> While I am still waiting for Barry's test result. I paste Kairui's test
> >> result here:
> >>
> >> I'm able to reproduce such an issue with a simple script (enabling all order of mthp):
> >>
> >> modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024))
> >> swapoff -a
> >> mkswap /dev/ram0
> >> swapon /dev/ram0
> >>
> >> rmdir /sys/fs/cgroup/benchmark
> >> mkdir -p /sys/fs/cgroup/benchmark
> >> cd /sys/fs/cgroup/benchmark
> >> echo 8G > memory.max
> >> echo $$ > cgroup.procs
> >>
> >> memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary &
> >>
> >> /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \
> >> -P memcache_binary -n allkeys --key-minimum=1 \
> >> --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \
> >> --ratio 1:0 --pipeline 8 -d 1024
> >>
> >> Before:
> >> Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98
> >> After:
> >> Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74
> >>
> >> And the fallback ratio dropped by a lot:
> >> Before:
> >> hugepages-32kB/stats/anon_swpout_fallback:15997
> >> hugepages-32kB/stats/anon_swpout:18712
> >> hugepages-512kB/stats/anon_swpout_fallback:192
> >> hugepages-512kB/stats/anon_swpout:0
> >> hugepages-2048kB/stats/anon_swpout_fallback:2
> >> hugepages-2048kB/stats/anon_swpout:0
> >> hugepages-1024kB/stats/anon_swpout_fallback:0
> >> hugepages-1024kB/stats/anon_swpout:0
> >> hugepages-64kB/stats/anon_swpout_fallback:18246
> >> hugepages-64kB/stats/anon_swpout:17644
> >> hugepages-16kB/stats/anon_swpout_fallback:13701
> >> hugepages-16kB/stats/anon_swpout:18234
> >> hugepages-256kB/stats/anon_swpout_fallback:8642
> >> hugepages-256kB/stats/anon_swpout:93
> >> hugepages-128kB/stats/anon_swpout_fallback:21497
> >> hugepages-128kB/stats/anon_swpout:7596
> >>
> >> (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate)
> >>
> >> After:
> >> hugepages-32kB/stats/swpout:34445
> >> hugepages-32kB/stats/swpout_fallback:0
> >> hugepages-512kB/stats/swpout:1
> >> hugepages-512kB/stats/swpout_fallback:134
> >> hugepages-2048kB/stats/swpout:1
> >> hugepages-2048kB/stats/swpout_fallback:1
> >> hugepages-1024kB/stats/swpout:6
> >> hugepages-1024kB/stats/swpout_fallback:0
> >> hugepages-64kB/stats/swpout:35495
> >> hugepages-64kB/stats/swpout_fallback:0
> >> hugepages-16kB/stats/swpout:32441
> >> hugepages-16kB/stats/swpout_fallback:0
> >> hugepages-256kB/stats/swpout:2223
> >> hugepages-256kB/stats/swpout_fallback:6278
> >> hugepages-128kB/stats/swpout:29136
> >> hugepages-128kB/stats/swpout_fallback:52
> >>
> >> Reported-by: Barry Song <[email protected]>
> >> Tested-by: Kairui Song <[email protected]>
> >> Signed-off-by: Chris Li <[email protected]>
> >> ---
> >> Chris Li (2):
> >> mm: swap: swap cluster switch to double link list
> >> mm: swap: mTHP allocate swap entries from nonfull list
> >>
> >> include/linux/swap.h | 18 ++--
> >> mm/swapfile.c | 252 +++++++++++++++++----------------------------------
> >> 2 files changed, 93 insertions(+), 177 deletions(-)
> >> ---
> >> base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed
> >> change-id: 20240523-swap-allocator-1534c480ece4
> >>
> >> Best regards,
> >> --
> >> Chris Li <[email protected]>
> >>
>

2024-05-30 02:54:29

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

Chris Li <[email protected]> writes:

> Hi Ying,
>
> On Wed, May 29, 2024 at 1:57 AM Huang, Ying <[email protected]> wrote:
>>
>> Chris Li <[email protected]> writes:
>>
>> > I am spinning a new version for this series to address two issues
>> > found in this series:
>> >
>> > 1) Oppo discovered a bug in the following line:
>> > + ci = si->cluster_info + tmp;
>> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
>> > That is a serious bug but trivial to fix.
>> >
>> > 2) order 0 allocation currently blindly scans swap_map disregarding
>> > the cluster->order.
>>
>> IIUC, now, we only scan swap_map[] only if
>> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]).
>> That is, if you doesn't run low swap free space, you will not do that.
>
> You can still swap space in order 0 clusters while order 4 runs out of
> free_cluster
> or nonfull_clusters[order]. For Android that is a common case.

When we fail to allocate order 4, we will fallback to order 0. Still
don't need to scan swap_map[]. But after looking at your below reply, I
realized that the swap space is almost full at most times in your cases.
Then, it's possible that we run into scanning swap_map[].
list_empty(&si->free_clusters) &&
list_empty(&si->nonfull_clusters[order]) will become true, if we put too
many clusters in si->percpu_cluster. So, if we want to avoid to scan
swap_map[], we can stop add clusters in si->percpu_cluster when swap
space runs low. And maybe take clusters out of si->percpu_cluster
sometimes.

Another issue is nonfull_cluster[order1] cannot be used for
nonfull_cluster[order2]. In definition, we should not fail order 0
allocation, we need to steal nonfull_cluster[order>0] for order 0
allocation. This can avoid to scan swap_map[] too. This may be not
perfect, but it is the simplest first step implementation. You can
optimize based on it further.

And, I checked your code again. It appears that si->percpu_cluster may
be put in si->nonfull_cluster[], then be used by another CPU. Please
check it.

--
Best Regards,
Huang, Ying

[snip]

2024-05-30 07:49:44

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

On Wed, May 29, 2024 at 9:04 AM Chris Li <[email protected]> wrote:
>
> I am spinning a new version for this series to address two issues
> found in this series:
>
> 1) Oppo discovered a bug in the following line:
> + ci = si->cluster_info + tmp;
> Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
> That is a serious bug but trivial to fix.
>
> 2) order 0 allocation currently blindly scans swap_map disregarding
> the cluster->order. Given enough order 0 swap allocations(close to the
> swap file size) the order 0 allocation head will eventually sweep
> across the whole swapfile and destroy other cluster order allocations.
>
> The short term fix is just skipping clusters that are already assigned
> to higher orders.
>
> In the long term, I want to unify the non-SSD to use clusters for
> locking and allocations as well, just try to follow the last
> allocation (less seeking) as much as possible.

Hi Chris,

I am sharing some new test results with you. This time, we used two
zRAM devices by modifying get_swap_pages().

zram0 -> dedicated for order-0 swpout
zram1 -> dedicated for order-4 swpout

We allocate a generous amount of space for zRAM1 to ensure it never gets full
and always has ample free space. However, we found that Ryan's approach
does not perform well even in this straightforward scenario. Despite zRAM1
having 80% of its space remaining, we still experience issues obtaining
contiguous swap slots and encounter a high swpout_fallback ratio.

Sorry for the report, Ryan :-)

In contrast, with your patch, we consistently see the thp_swpout_fallback ratio
at 0%, indicating a significant improvement in the situation.

Although your patch still has issues supporting the mixing of order-0 and
order-4 pages in a swap device, it represents a significant improvement.

I would be delighted to witness your approach advancing with Ying
Huang’s assistance. However, due to my current commitments, I
regret that I am unable to allocate time for debugging.

>
> Chris
>
>
>
> On Fri, May 24, 2024 at 10:17 AM Chris Li <[email protected]> wrote:
> >
> > This is the short term solutiolns "swap cluster order" listed
> > in my "Swap Abstraction" discussion slice 8 in the recent
> > LSF/MM conference.
> >
> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP
> > orders" is introduced, it only allocates the mTHP swap entries
> > from new empty cluster list. That works well for PMD size THP,
> > but it has a serius fragmentation issue reported by Barry.
> >
> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
> >
> > The mTHP allocation failure rate raises to almost 100% after a few
> > hours in Barry's test run.
> >
> > The reason is that all the empty cluster has been exhausted while
> > there are planty of free swap entries to in the cluster that is
> > not 100% free.
> >
> > Address this by remember the swap allocation order in the cluster.
> > Keep track of the per order non full cluster list for later allocation.
> >
> > This greatly improve the sucess rate of the mTHP swap allocation.
> > While I am still waiting for Barry's test result. I paste Kairui's test
> > result here:
> >
> > I'm able to reproduce such an issue with a simple script (enabling all order of mthp):
> >
> > modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024))
> > swapoff -a
> > mkswap /dev/ram0
> > swapon /dev/ram0
> >
> > rmdir /sys/fs/cgroup/benchmark
> > mkdir -p /sys/fs/cgroup/benchmark
> > cd /sys/fs/cgroup/benchmark
> > echo 8G > memory.max
> > echo $$ > cgroup.procs
> >
> > memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary &
> >
> > /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \
> > -P memcache_binary -n allkeys --key-minimum=1 \
> > --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \
> > --ratio 1:0 --pipeline 8 -d 1024
> >
> > Before:
> > Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98
> > After:
> > Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74
> >
> > And the fallback ratio dropped by a lot:
> > Before:
> > hugepages-32kB/stats/anon_swpout_fallback:15997
> > hugepages-32kB/stats/anon_swpout:18712
> > hugepages-512kB/stats/anon_swpout_fallback:192
> > hugepages-512kB/stats/anon_swpout:0
> > hugepages-2048kB/stats/anon_swpout_fallback:2
> > hugepages-2048kB/stats/anon_swpout:0
> > hugepages-1024kB/stats/anon_swpout_fallback:0
> > hugepages-1024kB/stats/anon_swpout:0
> > hugepages-64kB/stats/anon_swpout_fallback:18246
> > hugepages-64kB/stats/anon_swpout:17644
> > hugepages-16kB/stats/anon_swpout_fallback:13701
> > hugepages-16kB/stats/anon_swpout:18234
> > hugepages-256kB/stats/anon_swpout_fallback:8642
> > hugepages-256kB/stats/anon_swpout:93
> > hugepages-128kB/stats/anon_swpout_fallback:21497
> > hugepages-128kB/stats/anon_swpout:7596
> >
> > (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate)
> >
> > After:
> > hugepages-32kB/stats/swpout:34445
> > hugepages-32kB/stats/swpout_fallback:0
> > hugepages-512kB/stats/swpout:1
> > hugepages-512kB/stats/swpout_fallback:134
> > hugepages-2048kB/stats/swpout:1
> > hugepages-2048kB/stats/swpout_fallback:1
> > hugepages-1024kB/stats/swpout:6
> > hugepages-1024kB/stats/swpout_fallback:0
> > hugepages-64kB/stats/swpout:35495
> > hugepages-64kB/stats/swpout_fallback:0
> > hugepages-16kB/stats/swpout:32441
> > hugepages-16kB/stats/swpout_fallback:0
> > hugepages-256kB/stats/swpout:2223
> > hugepages-256kB/stats/swpout_fallback:6278
> > hugepages-128kB/stats/swpout:29136
> > hugepages-128kB/stats/swpout_fallback:52
> >
> > Reported-by: Barry Song <[email protected]>
> > Tested-by: Kairui Song <[email protected]>
> > Signed-off-by: Chris Li <[email protected]>
> > ---
> > Chris Li (2):
> > mm: swap: swap cluster switch to double link list
> > mm: swap: mTHP allocate swap entries from nonfull list
> >
> > include/linux/swap.h | 18 ++--
> > mm/swapfile.c | 252 +++++++++++++++++----------------------------------
> > 2 files changed, 93 insertions(+), 177 deletions(-)
> > ---
> > base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed
> > change-id: 20240523-swap-allocator-1534c480ece4
> >
> > Best regards,
> > --
> > Chris Li <[email protected]>
> >

Thanks
Barry


Attachments:
chris-swap-patch-dedicated-zram.png (37.88 kB)

2024-05-30 08:08:35

by Kairui Song

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

On Thu, May 30, 2024 at 10:54 AM Huang, Ying <[email protected]> wrote:
>
> Chris Li <[email protected]> writes:
>
> > Hi Ying,
> >
> > On Wed, May 29, 2024 at 1:57 AM Huang, Ying <[email protected]> wrote:
> >>
> >> Chris Li <[email protected]> writes:
> >>
> >> > I am spinning a new version for this series to address two issues
> >> > found in this series:
> >> >
> >> > 1) Oppo discovered a bug in the following line:
> >> > + ci = si->cluster_info + tmp;
> >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
> >> > That is a serious bug but trivial to fix.
> >> >
> >> > 2) order 0 allocation currently blindly scans swap_map disregarding
> >> > the cluster->order.
> >>
> >> IIUC, now, we only scan swap_map[] only if
> >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]).
> >> That is, if you doesn't run low swap free space, you will not do that.
> >
> > You can still swap space in order 0 clusters while order 4 runs out of
> > free_cluster
> > or nonfull_clusters[order]. For Android that is a common case.
>
> When we fail to allocate order 4, we will fallback to order 0. Still
> don't need to scan swap_map[]. But after looking at your below reply, I
> realized that the swap space is almost full at most times in your cases.
> Then, it's possible that we run into scanning swap_map[].
> list_empty(&si->free_clusters) &&
> list_empty(&si->nonfull_clusters[order]) will become true, if we put too
> many clusters in si->percpu_cluster. So, if we want to avoid to scan
> swap_map[], we can stop add clusters in si->percpu_cluster when swap
> space runs low. And maybe take clusters out of si->percpu_cluster
> sometimes.

Stop adding when it runs low seems too late, there could still be a
free cluster stuck on a CPU, and not getting scanned, right?

> Another issue is nonfull_cluster[order1] cannot be used for
> nonfull_cluster[order2]. In definition, we should not fail order 0
> allocation, we need to steal nonfull_cluster[order>0] for order 0
> allocation. This can avoid to scan swap_map[] too. This may be not
> perfect, but it is the simplest first step implementation. You can
> optimize based on it further.

This can be extended to allow any order < MAX_ORDER to steal from
higher order, which might increase fragmentation though.

So this is looking more and more like a buddy allocator, and that
should be the long term solution.

2024-05-30 18:32:34

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

On Thu, May 30, 2024 at 1:08 AM Kairui Song <[email protected]> wrote:
>
> On Thu, May 30, 2024 at 10:54 AM Huang, Ying <[email protected]> wrote:
> >
> > Chris Li <[email protected]> writes:
> >
> > > Hi Ying,
> > >
> > > On Wed, May 29, 2024 at 1:57 AM Huang, Ying <ying.huang@intelcom> wrote:
> > >>
> > >> Chris Li <[email protected]> writes:
> > >>
> > >> > I am spinning a new version for this series to address two issues
> > >> > found in this series:
> > >> >
> > >> > 1) Oppo discovered a bug in the following line:
> > >> > + ci = si->cluster_info + tmp;
> > >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
> > >> > That is a serious bug but trivial to fix.
> > >> >
> > >> > 2) order 0 allocation currently blindly scans swap_map disregarding
> > >> > the cluster->order.
> > >>
> > >> IIUC, now, we only scan swap_map[] only if
> > >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]).
> > >> That is, if you doesn't run low swap free space, you will not do that.
> > >
> > > You can still swap space in order 0 clusters while order 4 runs out of
> > > free_cluster
> > > or nonfull_clusters[order]. For Android that is a common case.
> >
> > When we fail to allocate order 4, we will fallback to order 0. Still
> > don't need to scan swap_map[]. But after looking at your below reply, I
> > realized that the swap space is almost full at most times in your cases.
> > Then, it's possible that we run into scanning swap_map[].
> > list_empty(&si->free_clusters) &&
> > list_empty(&si->nonfull_clusters[order]) will become true, if we put too
> > many clusters in si->percpu_cluster. So, if we want to avoid to scan
> > swap_map[], we can stop add clusters in si->percpu_cluster when swap
> > space runs low. And maybe take clusters out of si->percpu_cluster
> > sometimes.
>
> Stop adding when it runs low seems too late, there could still be a
> free cluster stuck on a CPU, and not getting scanned, right?

The free clusters stuck on the CPU are a small number. Only a handful
of clusters. Preventing low order swap polluting the high order
cluster is more urgent.

>
> > Another issue is nonfull_cluster[order1] cannot be used for
> > nonfull_cluster[order2]. In definition, we should not fail order 0
> > allocation, we need to steal nonfull_cluster[order>0] for order 0
> > allocation. This can avoid to scan swap_map[] too. This may be not
> > perfect, but it is the simplest first step implementation. You can
> > optimize based on it further.
>
> This can be extended to allow any order < MAX_ORDER to steal from
> higher order, which might increase fragmentation though.

Steal from higher order is a bad thing. Because the value of the
allocator is able to allocate from higher order.
High to low is always trivil, the low to high is impossible. See the
other email having a "knob" to reserve some swap space for high order
allocations. That is not perfect but more useful.

>
> So this is looking more and more like a buddy allocator, and that
> should be the long term solution.
>
In Barry's test case, there is a huge swing of order 0 and order 4
allocation caused by the low memory killer. Apps get killed and take a
while for the app to launch and swap out high order entries. The buddy
allocator will have limited help there because once cluster is used
for order 0, the fragmentation will prevent higher order allocation.
Buddy allocator might not be able to help much in this situation. We
do need a way to swap out large folios using discontiguous swap
entries. That is the longer term solution to Barry's usage situation.

Chris

2024-05-30 21:44:54

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

On Wed, May 29, 2024 at 7:54 PM Huang, Ying <[email protected]> wrote:
>
> Chris Li <[email protected]> writes:
>
> > Hi Ying,
> >
> > On Wed, May 29, 2024 at 1:57 AM Huang, Ying <[email protected]> wrote:
> >>
> >> Chris Li <[email protected]> writes:
> >>
> >> > I am spinning a new version for this series to address two issues
> >> > found in this series:
> >> >
> >> > 1) Oppo discovered a bug in the following line:
> >> > + ci = si->cluster_info + tmp;
> >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
> >> > That is a serious bug but trivial to fix.
> >> >
> >> > 2) order 0 allocation currently blindly scans swap_map disregarding
> >> > the cluster->order.
> >>
> >> IIUC, now, we only scan swap_map[] only if
> >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]).
> >> That is, if you doesn't run low swap free space, you will not do that.
> >
> > You can still swap space in order 0 clusters while order 4 runs out of
> > free_cluster
> > or nonfull_clusters[order]. For Android that is a common case.
>
> When we fail to allocate order 4, we will fallback to order 0. Still
> don't need to scan swap_map[]. But after looking at your below reply, I
> realized that the swap space is almost full at most times in your cases.
> Then, it's possible that we run into scanning swap_map[].
> list_empty(&si->free_clusters) &&
> list_empty(&si->nonfull_clusters[order]) will become true, if we put too
> many clusters in si->percpu_cluster. So, if we want to avoid to scan
> swap_map[], we can stop add clusters in si->percpu_cluster when swap
> space runs low. And maybe take clusters out of si->percpu_cluster
> sometimes.

One idea after reading your reply. If we run out of the
nonfull_cluster[order], we should be able to use other cpu's
si->percpu_cluster[] as well. That is a very small win for Android,
because android does not have too many cpu. We are talking about a
handful of clusters, which might not justify the code complexity. It
does not change the behavior that order 0 can pollut higher order.

>
> Another issue is nonfull_cluster[order1] cannot be used for
> nonfull_cluster[order2]. In definition, we should not fail order 0
> allocation, we need to steal nonfull_cluster[order>0] for order 0
> allocation. This can avoid to scan swap_map[] too. This may be not
> perfect, but it is the simplest first step implementation. You can
> optimize based on it further.

Yes, that is listed as the limitation of this cluster order approach.
Initially we need to support one order well first. We might choose
what order that is, 16K or 64K folio. 4K pages are too small, 2M pages
are too big. The sweet spot might be some there in between. If we can
support one order well, we can demonstrate the value of the mTHP. We
can worry about other mix orders later.

Do you have any suggestions for how to prevent the order 0 polluting
the higher order cluster? If we allow that to happen, then it defeats
the goal of being able to allocate higher order swap entries. The
tricky question is we don't know how much swap space we should reserve
for each order. We can always break higher order clusters to lower
order, but can't do the reserves. The current patch series lets the
actual usage determine the percentage of the cluster for each order.
However that seems not enough for the test case Barry has. When the
app gets OOM kill that is where a large swing of order 0 swap will
show up and not enough higher order usage for the brief moment. The
order 0 swap entry will pollute the high order cluster. We are
currently debating a "knob" to be able to reserve a certain % of swap
space for a certain order. Those reservations will be guaranteed and
order 0 swap entry can't pollute them even when it runs out of swap
space. That can make the mTHP at least usable for the Android case.

Do you see another way to protect the high order cluster polluted by
lower order one?

>
> And, I checked your code again. It appears that si->percpu_cluster may
> be put in si->nonfull_cluster[], then be used by another CPU. Please
> check it.

Ah, good point. I think it does. Let me take a closer look.

Chris


Chris

2024-05-30 21:49:50

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: swap: swap cluster switch to double link list

On Wed, May 29, 2024 at 1:48 AM Huang, Ying <[email protected]> wrote:
>
> Chris Li <[email protected]> writes:
>
> > Previously, the swap cluster used a cluster index as a pointer
> > to construct a custom single link list type "swap_cluster_list".
> > The next cluster pointer is shared with the cluster->count.
> > The assumption is that only the free cluster needs to be put
> > on the list.
> >
> > That assumption is not true for mTHP allocators any more.
>
> I think that the words aren't correct here. mTHP swap entry allocators
> can work with current cluster definition.

The current behavior is very problematic though:

If we only allocate and free order 4 swap entries, nothing else. After
a while, the free cluster will be used up, the swap entry allocation
will fail even though there is a lot of swap space left.

> > Need to track the non full cluster on the list as well. Move the
> > current cluster single link list into standard double link list.
>
> It's an optimization to track non-full cluster with a list.
>
> I understand that you want to change cluster list definition. I just

In my mind, I was changing the list implementation so it can be
tracked non free cluster as well.

> feel the wording isn't accurate.

Help me improve it. I am happy to adjust the wording in V2, you can
provide more feedback then.

>
> > Remove the cluster getter/setter for accessing the cluster
> > struct member. Move the cluster locking in the caller function
> > rather than the getter/setter function. That way the locking can
> > protect more than one member, e.g. cluster->flag.
>
> Sorry, I don't understand the locking in above words. I don't find that
> we lock/unlock in the original getter/setter functions. I found that
> the cluster locking rule for cluster list is changed. Better to make
> this explicit.

The original cluster single link list add/remove will require si->lock
protection as well, because the list head and tail pointer are outside
of the cluster pointer.
In this regard, the cluster double link list locking rule is very
similar. Yes, I move the list_del() outside of the cluster lock, is
that what you are referring to as the locking change?

> > Change cluster code to use "struct swap_cluster_info *" to
> > reference the cluster rather than by using index. That is more
> > consistent with the list manipulation. It avoids the repeat
> > adding index to the cluser_info. The code is easier to understand.
> >
> > Remove the cluster next pointer is NULL flag, the double link
> > list can handle the empty list pretty well.
> >
> > The "swap_cluster_info" struct is two pointer bigger, because
> > 512 swap entries share one swap struct, it has very little impact
> > on the average memory usage per swap entry. Other than the list
> > conversion, there is no real function change in this patch.
>
> On 64bit platform, the size of swap_cluster_info increases from 8 bytes
> to 24 bytes. For a 1TB swap device, the memory usage will increase from
> 4MB to 12MB. This looks OK for me.

Will add the size change calculation in V2 and have you review it again.

>
> Another choice is to use a customized double linked list using "unsigned
> int" as pointer to cluster. That will reduce the size of cluster to 16
> bytes. But it may be not necessary to do that.

We can always do that as a follow up step to optimize the 24 byte to
16 byte, at the price of more code complicity.
The trick part is the link list head, it is not part of the cluster
array, it does not have an index, and will need a special handle for
that.

>
> Anyway, I think that it's better to add more calculation in changelog
> for memory usage increment.

Sure, I will adjust the commit message in V2.

Chris

>
> > ---
> > include/linux/swap.h | 14 ++--
> > mm/swapfile.c | 231 ++++++++++++++-------------------------------------
> > 2 files changed, 68 insertions(+), 177 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 11c53692f65f..0d3906eff3c9 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -254,11 +254,12 @@ struct swap_cluster_info {
> > * elements correspond to the swap
> > * cluster
> > */
> > - unsigned int data:24;
> > + unsigned int count:16;
> > unsigned int flags:8;
>
> If we use 16bits and 8 bits in bit fields, why not just use u8 and u16
> instead?
Not sure about the

>
> > + struct list_head next;
>
> "next" isn't a good naming because prev pointer is in list_head too.
> The common naming is "list".

Sure, I can change it to "list".

>
> Need to revise comments for swap_cluster_info.lock and add the locking
> rule comments for swap_cluster_info.next.

Will do.

>
> > };
> > #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
> > -#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
> > +
> >
> > /*
> > * The first page in the swap file is the swap header, which is always marked
> > @@ -283,11 +284,6 @@ struct percpu_cluster {
> > unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> > };
> >
> > -struct swap_cluster_list {
> > - struct swap_cluster_info head;
> > - struct swap_cluster_info tail;
> > -};
> > -
> > /*
> > * The in-memory structure used to track swap areas.
> > */
> > @@ -300,7 +296,7 @@ struct swap_info_struct {
> > unsigned int max; /* extent of the swap_map */
> > unsigned char *swap_map; /* vmalloc'ed array of usage counts */
> > struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
> > - struct swap_cluster_list free_clusters; /* free clusters list */
> > + struct list_head free_clusters; /* free clusters list */
> > unsigned int lowest_bit; /* index of first free in swap_map */
> > unsigned int highest_bit; /* index of last free in swap_map */
> > unsigned int pages; /* total of usable pages of swap */
> > @@ -333,7 +329,7 @@ struct swap_info_struct {
> > * list.
> > */
> > struct work_struct discard_work; /* discard worker */
> > - struct swap_cluster_list discard_clusters; /* discard clusters list */
> > + struct list_head discard_clusters; /* discard clusters list */
> > struct plist_node avail_lists[]; /*
> > * entries in swap_avail_heads, one
> > * entry per node.
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 4f0e8b2ac8aa..205a60c5f9cb 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -290,64 +290,11 @@ static void discard_swap_cluster(struct swap_info_struct *si,
> > #endif
> > #define LATENCY_LIMIT 256
> >
> > -static inline void cluster_set_flag(struct swap_cluster_info *info,
> > - unsigned int flag)
> > -{
> > - info->flags = flag;
> > -}
> > -
> > -static inline unsigned int cluster_count(struct swap_cluster_info *info)
> > -{
> > - return info->data;
> > -}
> > -
> > -static inline void cluster_set_count(struct swap_cluster_info *info,
> > - unsigned int c)
> > -{
> > - info->data = c;
> > -}
> > -
> > -static inline void cluster_set_count_flag(struct swap_cluster_info *info,
> > - unsigned int c, unsigned int f)
> > -{
> > - info->flags = f;
> > - info->data = c;
> > -}
> > -
> > -static inline unsigned int cluster_next(struct swap_cluster_info *info)
> > -{
> > - return info->data;
> > -}
> > -
> > -static inline void cluster_set_next(struct swap_cluster_info *info,
> > - unsigned int n)
> > -{
> > - info->data = n;
> > -}
> > -
> > -static inline void cluster_set_next_flag(struct swap_cluster_info *info,
> > - unsigned int n, unsigned int f)
> > -{
> > - info->flags = f;
> > - info->data = n;
> > -}
> > -
> > static inline bool cluster_is_free(struct swap_cluster_info *info)
> > {
> > return info->flags & CLUSTER_FLAG_FREE;
> > }
> >
> > -static inline bool cluster_is_null(struct swap_cluster_info *info)
> > -{
> > - return info->flags & CLUSTER_FLAG_NEXT_NULL;
> > -}
> > -
> > -static inline void cluster_set_null(struct swap_cluster_info *info)
> > -{
> > - info->flags = CLUSTER_FLAG_NEXT_NULL;
> > - info->data = 0;
> > -}
> > -
> > static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
> > unsigned long offset)
> > {
> > @@ -394,65 +341,11 @@ static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si,
> > spin_unlock(&si->lock);
> > }
> >
> > -static inline bool cluster_list_empty(struct swap_cluster_list *list)
> > -{
> > - return cluster_is_null(&list->head);
> > -}
> > -
> > -static inline unsigned int cluster_list_first(struct swap_cluster_list *list)
> > -{
> > - return cluster_next(&list->head);
> > -}
> > -
> > -static void cluster_list_init(struct swap_cluster_list *list)
> > -{
> > - cluster_set_null(&list->head);
> > - cluster_set_null(&list->tail);
> > -}
> > -
> > -static void cluster_list_add_tail(struct swap_cluster_list *list,
> > - struct swap_cluster_info *ci,
> > - unsigned int idx)
> > -{
> > - if (cluster_list_empty(list)) {
> > - cluster_set_next_flag(&list->head, idx, 0);
> > - cluster_set_next_flag(&list->tail, idx, 0);
> > - } else {
> > - struct swap_cluster_info *ci_tail;
> > - unsigned int tail = cluster_next(&list->tail);
> > -
> > - /*
> > - * Nested cluster lock, but both cluster locks are
> > - * only acquired when we held swap_info_struct->lock
> > - */
> > - ci_tail = ci + tail;
> > - spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING);
> > - cluster_set_next(ci_tail, idx);
> > - spin_unlock(&ci_tail->lock);
> > - cluster_set_next_flag(&list->tail, idx, 0);
> > - }
> > -}
> > -
> > -static unsigned int cluster_list_del_first(struct swap_cluster_list *list,
> > - struct swap_cluster_info *ci)
> > -{
> > - unsigned int idx;
> > -
> > - idx = cluster_next(&list->head);
> > - if (cluster_next(&list->tail) == idx) {
> > - cluster_set_null(&list->head);
> > - cluster_set_null(&list->tail);
> > - } else
> > - cluster_set_next_flag(&list->head,
> > - cluster_next(&ci[idx]), 0);
> > -
> > - return idx;
> > -}
> > -
> > /* Add a cluster to discard list and schedule it to do discard */
> > static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> > - unsigned int idx)
> > + struct swap_cluster_info *ci)
> > {
> > + unsigned int idx = ci - si->cluster_info;
> > /*
> > * If scan_swap_map_slots() can't find a free cluster, it will check
> > * si->swap_map directly. To make sure the discarding cluster isn't
> > @@ -462,17 +355,16 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> > memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> > SWAP_MAP_BAD, SWAPFILE_CLUSTER);
> >
> > - cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);
> > -
> > + spin_lock_nested(&ci->lock, SINGLE_DEPTH_NESTING);
>
> If we don't use ci->lock to protect ci->next, we don't need spin_lock here.

Good point. Thanks.

>
> > + list_add_tail(&ci->next, &si->discard_clusters);
> > + spin_unlock(&ci->lock);
> > schedule_work(&si->discard_work);
> > }
> >
> > -static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
> > +static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> > {
> > - struct swap_cluster_info *ci = si->cluster_info;
> > -
> > - cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
> > - cluster_list_add_tail(&si->free_clusters, ci, idx);
> > + ci->flags = CLUSTER_FLAG_FREE;
> > + list_add_tail(&ci->next, &si->free_clusters);
> > }
> >
> > /*
> > @@ -481,21 +373,21 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
> > */
> > static void swap_do_scheduled_discard(struct swap_info_struct *si)
> > {
> > - struct swap_cluster_info *info, *ci;
> > + struct swap_cluster_info *ci;
> > unsigned int idx;
> >
> > - info = si->cluster_info;
> > -
> > - while (!cluster_list_empty(&si->discard_clusters)) {
> > - idx = cluster_list_del_first(&si->discard_clusters, info);
> > + while (!list_empty(&si->discard_clusters)) {
> > + ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, next);
> > + idx = ci - si->cluster_info;
> > spin_unlock(&si->lock);
> >
> > discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
> > SWAPFILE_CLUSTER);
> >
> > spin_lock(&si->lock);
> > - ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
> > - __free_cluster(si, idx);
> > +
> > + spin_lock(&ci->lock);
> > + __free_cluster(si, ci);
> > memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> > 0, SWAPFILE_CLUSTER);
> > unlock_cluster(ci);
> > @@ -521,20 +413,20 @@ static void swap_users_ref_free(struct percpu_ref *ref)
> > complete(&si->comp);
> > }
> >
> > -static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
> > +static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsigned long idx)
> > {
> > - struct swap_cluster_info *ci = si->cluster_info;
> > + struct swap_cluster_info *ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next);
> >
> > - VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx);
> > - cluster_list_del_first(&si->free_clusters, ci);
> > - cluster_set_count_flag(ci + idx, 0, 0);
> > + VM_BUG_ON(ci - si->cluster_info != idx);
> > + list_del(&ci->next);
> > + ci->count = 0;
> > + ci->flags = 0;
> > + return ci;
> > }
> >
> > -static void free_cluster(struct swap_info_struct *si, unsigned long idx)
> > +static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> > {
> > - struct swap_cluster_info *ci = si->cluster_info + idx;
> > -
> > - VM_BUG_ON(cluster_count(ci) != 0);
> > + VM_BUG_ON(ci->count != 0);
> > /*
> > * If the swap is discardable, prepare discard the cluster
> > * instead of free it immediately. The cluster will be freed
> > @@ -542,11 +434,11 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
> > */
> > if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
> > (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
> > - swap_cluster_schedule_discard(si, idx);
> > + swap_cluster_schedule_discard(si, ci);
> > return;
> > }
> >
> > - __free_cluster(si, idx);
> > + __free_cluster(si, ci);
> > }
> >
> > /*
> > @@ -559,15 +451,15 @@ static void add_cluster_info_page(struct swap_info_struct *p,
> > unsigned long count)
> > {
> > unsigned long idx = page_nr / SWAPFILE_CLUSTER;
> > + struct swap_cluster_info *ci = cluster_info + idx;
> >
> > if (!cluster_info)
> > return;
> > - if (cluster_is_free(&cluster_info[idx]))
> > + if (cluster_is_free(ci))
> > alloc_cluster(p, idx);
> >
> > - VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
> > - cluster_set_count(&cluster_info[idx],
> > - cluster_count(&cluster_info[idx]) + count);
> > + VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER);
> > + ci->count += count;
> > }
> >
> > /*
> > @@ -581,24 +473,20 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
> > }
> >
> > /*
> > - * The cluster corresponding to page_nr decreases one usage. If the usage
> > - * counter becomes 0, which means no page in the cluster is in using, we can
> > - * optionally discard the cluster and add it to free cluster list.
> > + * The cluster ci decreases one usage. If the usage counter becomes 0,
> > + * which means no page in the cluster is in using, we can optionally discard
> > + * the cluster and add it to free cluster list.
> > */
> > -static void dec_cluster_info_page(struct swap_info_struct *p,
> > - struct swap_cluster_info *cluster_info, unsigned long page_nr)
> > +static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci)
> > {
> > - unsigned long idx = page_nr / SWAPFILE_CLUSTER;
> > -
> > - if (!cluster_info)
> > + if (!p->cluster_info)
> > return;
> >
> > - VM_BUG_ON(cluster_count(&cluster_info[idx]) == 0);
> > - cluster_set_count(&cluster_info[idx],
> > - cluster_count(&cluster_info[idx]) - 1);
> > + VM_BUG_ON(ci->count == 0);
> > + ci->count--;
> >
> > - if (cluster_count(&cluster_info[idx]) == 0)
> > - free_cluster(p, idx);
> > + if (!ci->count)
> > + free_cluster(p, ci);
> > }
> >
> > /*
> > @@ -611,10 +499,10 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
> > {
> > struct percpu_cluster *percpu_cluster;
> > bool conflict;
> > -
> > + struct swap_cluster_info *first = list_first_entry(&si->free_clusters, struct swap_cluster_info, next);
> > offset /= SWAPFILE_CLUSTER;
> > - conflict = !cluster_list_empty(&si->free_clusters) &&
> > - offset != cluster_list_first(&si->free_clusters) &&
> > + conflict = !list_empty(&si->free_clusters) &&
> > + offset != first - si->cluster_info &&
> > cluster_is_free(&si->cluster_info[offset]);
> >
> > if (!conflict)
> > @@ -655,10 +543,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> > cluster = this_cpu_ptr(si->percpu_cluster);
> > tmp = cluster->next[order];
> > if (tmp == SWAP_NEXT_INVALID) {
> > - if (!cluster_list_empty(&si->free_clusters)) {
> > - tmp = cluster_next(&si->free_clusters.head) *
> > - SWAPFILE_CLUSTER;
> > - } else if (!cluster_list_empty(&si->discard_clusters)) {
> > + if (!list_empty(&si->free_clusters)) {
> > + ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next);
> > + list_del(&ci->next);
> > + spin_lock(&ci->lock);
> > + ci->flags = 0;
> > + spin_unlock(&ci->lock);
> > + tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
> > + } else if (!list_empty(&si->discard_clusters)) {
> > /*
> > * we don't have free cluster but have some clusters in
> > * discarding, do discard now and reclaim them, then
> > @@ -670,7 +562,8 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> > goto new_cluster;
> > } else
> > return false;
> > - }
> > + } else
> > + ci = si->cluster_info + tmp;
> >
> > /*
> > * Other CPUs can use our cluster if they can't find a free cluster,
> > @@ -1062,8 +955,9 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
> >
> > ci = lock_cluster(si, offset);
> > memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
> > - cluster_set_count_flag(ci, 0, 0);
> > - free_cluster(si, idx);
> > + ci->count = 0;
> > + ci->flags = 0;
> > + free_cluster(si, ci);
> > unlock_cluster(ci);
> > swap_range_free(si, offset, SWAPFILE_CLUSTER);
> > }
> > @@ -1336,7 +1230,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
> > count = p->swap_map[offset];
> > VM_BUG_ON(count != SWAP_HAS_CACHE);
> > p->swap_map[offset] = 0;
> > - dec_cluster_info_page(p, p->cluster_info, offset);
> > + dec_cluster_info_page(p, ci);
> > unlock_cluster(ci);
> >
> > mem_cgroup_uncharge_swap(entry, 1);
> > @@ -2985,8 +2879,8 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
> >
> > nr_good_pages = maxpages - 1; /* omit header page */
> >
> > - cluster_list_init(&p->free_clusters);
> > - cluster_list_init(&p->discard_clusters);
> > + INIT_LIST_HEAD(&p->free_clusters);
> > + INIT_LIST_HEAD(&p->discard_clusters);
> >
> > for (i = 0; i < swap_header->info.nr_badpages; i++) {
> > unsigned int page_nr = swap_header->info.badpages[i];
> > @@ -3037,14 +2931,15 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
> > for (k = 0; k < SWAP_CLUSTER_COLS; k++) {
> > j = (k + col) % SWAP_CLUSTER_COLS;
> > for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
> > + struct swap_cluster_info *ci;
> > idx = i * SWAP_CLUSTER_COLS + j;
> > + ci = cluster_info + idx;
> > if (idx >= nr_clusters)
> > continue;
> > - if (cluster_count(&cluster_info[idx]))
> > + if (ci->count)
> > continue;
> > - cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
> > - cluster_list_add_tail(&p->free_clusters, cluster_info,
> > - idx);
> > + ci->flags = CLUSTER_FLAG_FREE;
> > + list_add_tail(&ci->next, &p->free_clusters);
> > }
> > }
> > return nr_extents;
>
> --
> Best Regards,
> Huang, Ying
>

2024-05-31 02:05:46

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: swap: swap cluster switch to double link list

Chris Li <[email protected]> writes:

> On Wed, May 29, 2024 at 1:48 AM Huang, Ying <[email protected]> wrote:
>>
>> Chris Li <[email protected]> writes:
>>
>> > Previously, the swap cluster used a cluster index as a pointer
>> > to construct a custom single link list type "swap_cluster_list".
>> > The next cluster pointer is shared with the cluster->count.
>> > The assumption is that only the free cluster needs to be put
>> > on the list.
>> >
>> > That assumption is not true for mTHP allocators any more.
>>
>> I think that the words aren't correct here. mTHP swap entry allocators
>> can work with current cluster definition.
>
> The current behavior is very problematic though:
>
> If we only allocate and free order 4 swap entries, nothing else. After
> a while, the free cluster will be used up, the swap entry allocation
> will fail even though there is a lot of swap space left.

The original behavior doesn't work well for order-0 allocation too.
percpu_cluster and quick (cluster) scan path cannot be used for
fragmented swap devices.

>> > Need to track the non full cluster on the list as well. Move the
>> > current cluster single link list into standard double link list.
>>
>> It's an optimization to track non-full cluster with a list.
>>
>> I understand that you want to change cluster list definition. I just
>
> In my mind, I was changing the list implementation so it can be
> tracked non free cluster as well.
>
>> feel the wording isn't accurate.
>
> Help me improve it. I am happy to adjust the wording in V2, you can
> provide more feedback then.

I suggest you to focus on improvement. The original implementation
hasn't the assumption that it's the best or perfect. It improves from
its base and you can continue to improve it for more situations.
Describe the situation where current implementation doesn't performance
well and how do you improve it. Better with the cost.

>>
>> > Remove the cluster getter/setter for accessing the cluster
>> > struct member. Move the cluster locking in the caller function
>> > rather than the getter/setter function. That way the locking can
>> > protect more than one member, e.g. cluster->flag.
>>
>> Sorry, I don't understand the locking in above words. I don't find that
>> we lock/unlock in the original getter/setter functions. I found that
>> the cluster locking rule for cluster list is changed. Better to make
>> this explicit.
>
> The original cluster single link list add/remove will require si->lock
> protection as well, because the list head and tail pointer are outside
> of the cluster pointer.
> In this regard, the cluster double link list locking rule is very
> similar. Yes, I move the list_del() outside of the cluster lock, is
> that what you are referring to as the locking change?

In the original implementation, ci->lock is held when changing ci->data
(in fact next here). Now, you don't need the ci->lock. This is a
locking rule change, I suggest you to make it explicit in change log and
comments.

>> > Change cluster code to use "struct swap_cluster_info *" to
>> > reference the cluster rather than by using index. That is more
>> > consistent with the list manipulation. It avoids the repeat
>> > adding index to the cluser_info. The code is easier to understand.
>> >
>> > Remove the cluster next pointer is NULL flag, the double link
>> > list can handle the empty list pretty well.
>> >
>> > The "swap_cluster_info" struct is two pointer bigger, because
>> > 512 swap entries share one swap struct, it has very little impact
>> > on the average memory usage per swap entry. Other than the list
>> > conversion, there is no real function change in this patch.
>>
>> On 64bit platform, the size of swap_cluster_info increases from 8 bytes
>> to 24 bytes. For a 1TB swap device, the memory usage will increase from
>> 4MB to 12MB. This looks OK for me.
>
> Will add the size change calculation in V2 and have you review it again.
>
>>
>> Another choice is to use a customized double linked list using "unsigned
>> int" as pointer to cluster. That will reduce the size of cluster to 16
>> bytes. But it may be not necessary to do that.
>
> We can always do that as a follow up step to optimize the 24 byte to
> 16 byte, at the price of more code complicity.
> The trick part is the link list head, it is not part of the cluster
> array, it does not have an index, and will need a special handle for
> that.

In theory, you can define a "struct list_u32_head" and a set of
list_u32_* functions. But I don't find that it's necessary to do that.

>>
>> Anyway, I think that it's better to add more calculation in changelog
>> for memory usage increment.
>
> Sure, I will adjust the commit message in V2.
>
> Chris
>
>>
>> > ---
>> > include/linux/swap.h | 14 ++--
>> > mm/swapfile.c | 231 ++++++++++++++-------------------------------------
>> > 2 files changed, 68 insertions(+), 177 deletions(-)
>> >
>> > diff --git a/include/linux/swap.h b/include/linux/swap.h
>> > index 11c53692f65f..0d3906eff3c9 100644
>> > --- a/include/linux/swap.h
>> > +++ b/include/linux/swap.h
>> > @@ -254,11 +254,12 @@ struct swap_cluster_info {
>> > * elements correspond to the swap
>> > * cluster
>> > */
>> > - unsigned int data:24;
>> > + unsigned int count:16;
>> > unsigned int flags:8;
>>
>> If we use 16bits and 8 bits in bit fields, why not just use u8 and u16
>> instead?
> Not sure about the

?

u16 count;
u8 flags;

>>
>> > + struct list_head next;
>>
>> "next" isn't a good naming because prev pointer is in list_head too.
>> The common naming is "list".
>
> Sure, I can change it to "list".
>
>>
>> Need to revise comments for swap_cluster_info.lock and add the locking
>> rule comments for swap_cluster_info.next.
>
> Will do.
>
>>
>> > };
>> > #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
>> > -#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
>> > +
>> >

[snip]

--
Best Regards,
Huang, Ying

2024-05-31 02:37:21

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

Chris Li <[email protected]> writes:

> On Wed, May 29, 2024 at 7:54 PM Huang, Ying <[email protected]> wrote:
>>
>> Chris Li <[email protected]> writes:
>>
>> > Hi Ying,
>> >
>> > On Wed, May 29, 2024 at 1:57 AM Huang, Ying <[email protected]> wrote:
>> >>
>> >> Chris Li <[email protected]> writes:
>> >>
>> >> > I am spinning a new version for this series to address two issues
>> >> > found in this series:
>> >> >
>> >> > 1) Oppo discovered a bug in the following line:
>> >> > + ci = si->cluster_info + tmp;
>> >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
>> >> > That is a serious bug but trivial to fix.
>> >> >
>> >> > 2) order 0 allocation currently blindly scans swap_map disregarding
>> >> > the cluster->order.
>> >>
>> >> IIUC, now, we only scan swap_map[] only if
>> >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]).
>> >> That is, if you doesn't run low swap free space, you will not do that.
>> >
>> > You can still swap space in order 0 clusters while order 4 runs out of
>> > free_cluster
>> > or nonfull_clusters[order]. For Android that is a common case.
>>
>> When we fail to allocate order 4, we will fallback to order 0. Still
>> don't need to scan swap_map[]. But after looking at your below reply, I
>> realized that the swap space is almost full at most times in your cases.
>> Then, it's possible that we run into scanning swap_map[].
>> list_empty(&si->free_clusters) &&
>> list_empty(&si->nonfull_clusters[order]) will become true, if we put too
>> many clusters in si->percpu_cluster. So, if we want to avoid to scan
>> swap_map[], we can stop add clusters in si->percpu_cluster when swap
>> space runs low. And maybe take clusters out of si->percpu_cluster
>> sometimes.
>
> One idea after reading your reply. If we run out of the
> nonfull_cluster[order], we should be able to use other cpu's
> si->percpu_cluster[] as well. That is a very small win for Android,

This will be useful in general. The number CPU may be large, and
multiple orders may be used.

> because android does not have too many cpu. We are talking about a
> handful of clusters, which might not justify the code complexity. It
> does not change the behavior that order 0 can pollut higher order.

I have a feeling that you don't really know why swap_map[] is scanned.
I suggest you to do more test and tracing to find out the reason. I
suspect that there are some non-full cluster collection issues.

>> Another issue is nonfull_cluster[order1] cannot be used for
>> nonfull_cluster[order2]. In definition, we should not fail order 0
>> allocation, we need to steal nonfull_cluster[order>0] for order 0
>> allocation. This can avoid to scan swap_map[] too. This may be not
>> perfect, but it is the simplest first step implementation. You can
>> optimize based on it further.
>
> Yes, that is listed as the limitation of this cluster order approach.
> Initially we need to support one order well first. We might choose
> what order that is, 16K or 64K folio. 4K pages are too small, 2M pages
> are too big. The sweet spot might be some there in between. If we can
> support one order well, we can demonstrate the value of the mTHP. We
> can worry about other mix orders later.
>
> Do you have any suggestions for how to prevent the order 0 polluting
> the higher order cluster? If we allow that to happen, then it defeats
> the goal of being able to allocate higher order swap entries. The
> tricky question is we don't know how much swap space we should reserve
> for each order. We can always break higher order clusters to lower
> order, but can't do the reserves. The current patch series lets the
> actual usage determine the percentage of the cluster for each order.
> However that seems not enough for the test case Barry has. When the
> app gets OOM kill that is where a large swing of order 0 swap will
> show up and not enough higher order usage for the brief moment. The
> order 0 swap entry will pollute the high order cluster. We are
> currently debating a "knob" to be able to reserve a certain % of swap
> space for a certain order. Those reservations will be guaranteed and
> order 0 swap entry can't pollute them even when it runs out of swap
> space. That can make the mTHP at least usable for the Android case.

IMO, the bottom line is that order-0 allocation is the first class
citizen, we must keep it optimized. And, OOM with free swap space isn't
acceptable. Please consider the policy we used for page allocation.

> Do you see another way to protect the high order cluster polluted by
> lower order one?

If we use high-order page allocation as reference, we need something
like compaction to guarantee high-order allocation finally. But we are
too far from that.

For specific configuration, I believe that we can get reasonable
high-order swap entry allocation success rate for specific use cases.
For example, if we only do limited maximum number order-0 swap entries
allocation, can we keep high-order clusters?

>>
>> And, I checked your code again. It appears that si->percpu_cluster may
>> be put in si->nonfull_cluster[], then be used by another CPU. Please
>> check it.
>
> Ah, good point. I think it does. Let me take a closer look.

--
Best Regards,
Huang, Ying

2024-05-31 12:48:35

by Kairui Song

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

On Fri, May 31, 2024 at 10:37 AM Huang, Ying <[email protected]> wrote:
>
> Chris Li <[email protected]> writes:
>
> > On Wed, May 29, 2024 at 7:54 PM Huang, Ying <[email protected]> wrote:
> >>
> >> Chris Li <[email protected]> writes:
> >>
> >> > Hi Ying,
> >> >
> >> > On Wed, May 29, 2024 at 1:57 AM Huang, Ying <[email protected]> wrote:
> >> >>
> >> >> Chris Li <[email protected]> writes:
> >> >>
> >> >> > I am spinning a new version for this series to address two issues
> >> >> > found in this series:
> >> >> >
> >> >> > 1) Oppo discovered a bug in the following line:
> >> >> > + ci = si->cluster_info + tmp;
> >> >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
> >> >> > That is a serious bug but trivial to fix.
> >> >> >
> >> >> > 2) order 0 allocation currently blindly scans swap_map disregarding
> >> >> > the cluster->order.
> >> >>
> >> >> IIUC, now, we only scan swap_map[] only if
> >> >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]).
> >> >> That is, if you doesn't run low swap free space, you will not do that.
> >> >
> >> > You can still swap space in order 0 clusters while order 4 runs out of
> >> > free_cluster
> >> > or nonfull_clusters[order]. For Android that is a common case.
> >>
> >> When we fail to allocate order 4, we will fallback to order 0. Still
> >> don't need to scan swap_map[]. But after looking at your below reply, I
> >> realized that the swap space is almost full at most times in your cases.
> >> Then, it's possible that we run into scanning swap_map[].
> >> list_empty(&si->free_clusters) &&
> >> list_empty(&si->nonfull_clusters[order]) will become true, if we put too
> >> many clusters in si->percpu_cluster. So, if we want to avoid to scan
> >> swap_map[], we can stop add clusters in si->percpu_cluster when swap
> >> space runs low. And maybe take clusters out of si->percpu_cluster
> >> sometimes.
> >
> > One idea after reading your reply. If we run out of the
> > nonfull_cluster[order], we should be able to use other cpu's
> > si->percpu_cluster[] as well. That is a very small win for Android,
>
> This will be useful in general. The number CPU may be large, and
> multiple orders may be used.
>
> > because android does not have too many cpu. We are talking about a
> > handful of clusters, which might not justify the code complexity. It
> > does not change the behavior that order 0 can pollut higher order.
>
> I have a feeling that you don't really know why swap_map[] is scanned.
> I suggest you to do more test and tracing to find out the reason. I
> suspect that there are some non-full cluster collection issues.
>
> >> Another issue is nonfull_cluster[order1] cannot be used for
> >> nonfull_cluster[order2]. In definition, we should not fail order 0
> >> allocation, we need to steal nonfull_cluster[order>0] for order 0
> >> allocation. This can avoid to scan swap_map[] too. This may be not
> >> perfect, but it is the simplest first step implementation. You can
> >> optimize based on it further.
> >
> > Yes, that is listed as the limitation of this cluster order approach.
> > Initially we need to support one order well first. We might choose
> > what order that is, 16K or 64K folio. 4K pages are too small, 2M pages
> > are too big. The sweet spot might be some there in between. If we can
> > support one order well, we can demonstrate the value of the mTHP. We
> > can worry about other mix orders later.
> >
> > Do you have any suggestions for how to prevent the order 0 polluting
> > the higher order cluster? If we allow that to happen, then it defeats
> > the goal of being able to allocate higher order swap entries. The
> > tricky question is we don't know how much swap space we should reserve
> > for each order. We can always break higher order clusters to lower
> > order, but can't do the reserves. The current patch series lets the
> > actual usage determine the percentage of the cluster for each order.
> > However that seems not enough for the test case Barry has. When the
> > app gets OOM kill that is where a large swing of order 0 swap will
> > show up and not enough higher order usage for the brief moment. The
> > order 0 swap entry will pollute the high order cluster. We are
> > currently debating a "knob" to be able to reserve a certain % of swap
> > space for a certain order. Those reservations will be guaranteed and
> > order 0 swap entry can't pollute them even when it runs out of swap
> > space. That can make the mTHP at least usable for the Android case.
>
> IMO, the bottom line is that order-0 allocation is the first class
> citizen, we must keep it optimized. And, OOM with free swap space isn't
> acceptable. Please consider the policy we used for page allocation.
>
> > Do you see another way to protect the high order cluster polluted by
> > lower order one?
>
> If we use high-order page allocation as reference, we need something
> like compaction to guarantee high-order allocation finally. But we are
> too far from that.
>
> For specific configuration, I believe that we can get reasonable
> high-order swap entry allocation success rate for specific use cases.
> For example, if we only do limited maximum number order-0 swap entries
> allocation, can we keep high-order clusters?

Isn't limiting order-0 allocation breaks the bottom line that order-0
allocation is the first class citizen, and should not fail if there is
space?

Just my two cents...

I had a try locally based on Chris's work, allowing order 0 to use
nonfull_clusters as Ying has suggested, and starting with low order
and increase the order until nonfull_cluster[order] is not empty, that
way higher order is just better protected, because unless we ran out
of free_cluster and nonfull_cluster, direct scan won't happen.

More concretely, I applied the following changes, which didn't change
the code much:
- In scan_swap_map_try_ssd_cluster, check nonfull_cluster first, then
free_clusters, then discard_cluster.
- If it's order 0, also check for (int i = 0; i < SWAP_NR_ORDERS; ++i)
nonfull_clusters[i] cluster before scan_swap_map_try_ssd_cluster
returns false.

A quick test still using the memtier test, but decreased the swap
device size from 10G to 8g for higher pressure.

Before:
hugepages-32kB/stats/swpout:34013
hugepages-32kB/stats/swpout_fallback:266
hugepages-512kB/stats/swpout:0
hugepages-512kB/stats/swpout_fallback:77
hugepages-2048kB/stats/swpout:0
hugepages-2048kB/stats/swpout_fallback:1
hugepages-1024kB/stats/swpout:0
hugepages-1024kB/stats/swpout_fallback:0
hugepages-64kB/stats/swpout:35088
hugepages-64kB/stats/swpout_fallback:66
hugepages-16kB/stats/swpout:31848
hugepages-16kB/stats/swpout_fallback:402
hugepages-256kB/stats/swpout:390
hugepages-256kB/stats/swpout_fallback:7244
hugepages-128kB/stats/swpout:28573
hugepages-128kB/stats/swpout_fallback:474

After:
hugepages-32kB/stats/swpout:31448
hugepages-32kB/stats/swpout_fallback:3354
hugepages-512kB/stats/swpout:30
hugepages-512kB/stats/swpout_fallback:33
hugepages-2048kB/stats/swpout:2
hugepages-2048kB/stats/swpout_fallback:0
hugepages-1024kB/stats/swpout:0
hugepages-1024kB/stats/swpout_fallback:0
hugepages-64kB/stats/swpout:31255
hugepages-64kB/stats/swpout_fallback:3112
hugepages-16kB/stats/swpout:29931
hugepages-16kB/stats/swpout_fallback:3397
hugepages-256kB/stats/swpout:5223
hugepages-256kB/stats/swpout_fallback:2351
hugepages-128kB/stats/swpout:25600
hugepages-128kB/stats/swpout_fallback:2194

High order (256k) swapout rate are significantly higher, 512k is now
possible, which indicate high orders are better protected, lower
orders are sacrificed but seems worth it.

2024-06-04 07:33:00

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

Kairui Song <[email protected]> writes:

> On Fri, May 31, 2024 at 10:37 AM Huang, Ying <[email protected]> wrote:
>>
>> Chris Li <[email protected]> writes:
>>
>> > On Wed, May 29, 2024 at 7:54 PM Huang, Ying <[email protected]> wrote:
>> >>
>> >> Chris Li <[email protected]> writes:
>> >>
>> >> > Hi Ying,
>> >> >
>> >> > On Wed, May 29, 2024 at 1:57 AM Huang, Ying <[email protected]> wrote:
>> >> >>
>> >> >> Chris Li <[email protected]> writes:
>> >> >>
>> >> >> > I am spinning a new version for this series to address two issues
>> >> >> > found in this series:
>> >> >> >
>> >> >> > 1) Oppo discovered a bug in the following line:
>> >> >> > + ci = si->cluster_info + tmp;
>> >> >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
>> >> >> > That is a serious bug but trivial to fix.
>> >> >> >
>> >> >> > 2) order 0 allocation currently blindly scans swap_map disregarding
>> >> >> > the cluster->order.
>> >> >>
>> >> >> IIUC, now, we only scan swap_map[] only if
>> >> >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]).
>> >> >> That is, if you doesn't run low swap free space, you will not do that.
>> >> >
>> >> > You can still swap space in order 0 clusters while order 4 runs out of
>> >> > free_cluster
>> >> > or nonfull_clusters[order]. For Android that is a common case.
>> >>
>> >> When we fail to allocate order 4, we will fallback to order 0. Still
>> >> don't need to scan swap_map[]. But after looking at your below reply, I
>> >> realized that the swap space is almost full at most times in your cases.
>> >> Then, it's possible that we run into scanning swap_map[].
>> >> list_empty(&si->free_clusters) &&
>> >> list_empty(&si->nonfull_clusters[order]) will become true, if we put too
>> >> many clusters in si->percpu_cluster. So, if we want to avoid to scan
>> >> swap_map[], we can stop add clusters in si->percpu_cluster when swap
>> >> space runs low. And maybe take clusters out of si->percpu_cluster
>> >> sometimes.
>> >
>> > One idea after reading your reply. If we run out of the
>> > nonfull_cluster[order], we should be able to use other cpu's
>> > si->percpu_cluster[] as well. That is a very small win for Android,
>>
>> This will be useful in general. The number CPU may be large, and
>> multiple orders may be used.
>>
>> > because android does not have too many cpu. We are talking about a
>> > handful of clusters, which might not justify the code complexity. It
>> > does not change the behavior that order 0 can pollut higher order.
>>
>> I have a feeling that you don't really know why swap_map[] is scanned.
>> I suggest you to do more test and tracing to find out the reason. I
>> suspect that there are some non-full cluster collection issues.
>>
>> >> Another issue is nonfull_cluster[order1] cannot be used for
>> >> nonfull_cluster[order2]. In definition, we should not fail order 0
>> >> allocation, we need to steal nonfull_cluster[order>0] for order 0
>> >> allocation. This can avoid to scan swap_map[] too. This may be not
>> >> perfect, but it is the simplest first step implementation. You can
>> >> optimize based on it further.
>> >
>> > Yes, that is listed as the limitation of this cluster order approach.
>> > Initially we need to support one order well first. We might choose
>> > what order that is, 16K or 64K folio. 4K pages are too small, 2M pages
>> > are too big. The sweet spot might be some there in between. If we can
>> > support one order well, we can demonstrate the value of the mTHP. We
>> > can worry about other mix orders later.
>> >
>> > Do you have any suggestions for how to prevent the order 0 polluting
>> > the higher order cluster? If we allow that to happen, then it defeats
>> > the goal of being able to allocate higher order swap entries. The
>> > tricky question is we don't know how much swap space we should reserve
>> > for each order. We can always break higher order clusters to lower
>> > order, but can't do the reserves. The current patch series lets the
>> > actual usage determine the percentage of the cluster for each order.
>> > However that seems not enough for the test case Barry has. When the
>> > app gets OOM kill that is where a large swing of order 0 swap will
>> > show up and not enough higher order usage for the brief moment. The
>> > order 0 swap entry will pollute the high order cluster. We are
>> > currently debating a "knob" to be able to reserve a certain % of swap
>> > space for a certain order. Those reservations will be guaranteed and
>> > order 0 swap entry can't pollute them even when it runs out of swap
>> > space. That can make the mTHP at least usable for the Android case.
>>
>> IMO, the bottom line is that order-0 allocation is the first class
>> citizen, we must keep it optimized. And, OOM with free swap space isn't
>> acceptable. Please consider the policy we used for page allocation.
>>
>> > Do you see another way to protect the high order cluster polluted by
>> > lower order one?
>>
>> If we use high-order page allocation as reference, we need something
>> like compaction to guarantee high-order allocation finally. But we are
>> too far from that.
>>
>> For specific configuration, I believe that we can get reasonable
>> high-order swap entry allocation success rate for specific use cases.
>> For example, if we only do limited maximum number order-0 swap entries
>> allocation, can we keep high-order clusters?
>
> Isn't limiting order-0 allocation breaks the bottom line that order-0
> allocation is the first class citizen, and should not fail if there is
> space?

Sorry for confusing words. I mean limiting maximum number order-0 swap
entries allocation in workloads, instead of limiting that in kernel.

> Just my two cents...
>
> I had a try locally based on Chris's work, allowing order 0 to use
> nonfull_clusters as Ying has suggested, and starting with low order
> and increase the order until nonfull_cluster[order] is not empty, that
> way higher order is just better protected, because unless we ran out
> of free_cluster and nonfull_cluster, direct scan won't happen.
>
> More concretely, I applied the following changes, which didn't change
> the code much:
> - In scan_swap_map_try_ssd_cluster, check nonfull_cluster first, then
> free_clusters, then discard_cluster.
> - If it's order 0, also check for (int i = 0; i < SWAP_NR_ORDERS; ++i)
> nonfull_clusters[i] cluster before scan_swap_map_try_ssd_cluster
> returns false.
>
> A quick test still using the memtier test, but decreased the swap
> device size from 10G to 8g for higher pressure.
>
> Before:
> hugepages-32kB/stats/swpout:34013
> hugepages-32kB/stats/swpout_fallback:266
> hugepages-512kB/stats/swpout:0
> hugepages-512kB/stats/swpout_fallback:77
> hugepages-2048kB/stats/swpout:0
> hugepages-2048kB/stats/swpout_fallback:1
> hugepages-1024kB/stats/swpout:0
> hugepages-1024kB/stats/swpout_fallback:0
> hugepages-64kB/stats/swpout:35088
> hugepages-64kB/stats/swpout_fallback:66
> hugepages-16kB/stats/swpout:31848
> hugepages-16kB/stats/swpout_fallback:402
> hugepages-256kB/stats/swpout:390
> hugepages-256kB/stats/swpout_fallback:7244
> hugepages-128kB/stats/swpout:28573
> hugepages-128kB/stats/swpout_fallback:474
>
> After:
> hugepages-32kB/stats/swpout:31448
> hugepages-32kB/stats/swpout_fallback:3354
> hugepages-512kB/stats/swpout:30
> hugepages-512kB/stats/swpout_fallback:33
> hugepages-2048kB/stats/swpout:2
> hugepages-2048kB/stats/swpout_fallback:0
> hugepages-1024kB/stats/swpout:0
> hugepages-1024kB/stats/swpout_fallback:0
> hugepages-64kB/stats/swpout:31255
> hugepages-64kB/stats/swpout_fallback:3112
> hugepages-16kB/stats/swpout:29931
> hugepages-16kB/stats/swpout_fallback:3397
> hugepages-256kB/stats/swpout:5223
> hugepages-256kB/stats/swpout_fallback:2351
> hugepages-128kB/stats/swpout:25600
> hugepages-128kB/stats/swpout_fallback:2194
>
> High order (256k) swapout rate are significantly higher, 512k is now
> possible, which indicate high orders are better protected, lower
> orders are sacrificed but seems worth it.

Yes. I think that this reflects another aspect of the problem. In some
situations, it's better to steal one high-order cluster and use it for
order-0 allocation instead of scattering order-0 allocation in random
high-order clusters.

--
Best Regards,
Huang, Ying

2024-06-05 07:08:35

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

On Thu, May 30, 2024 at 7:37 PM Huang, Ying <[email protected]> wrote:
>
> Chris Li <[email protected]> writes:
>
> > On Wed, May 29, 2024 at 7:54 PM Huang, Ying <[email protected]> wrote:
> > because android does not have too many cpu. We are talking about a
> > handful of clusters, which might not justify the code complexity. It
> > does not change the behavior that order 0 can pollut higher order.
>
> I have a feeling that you don't really know why swap_map[] is scanned.
> I suggest you to do more test and tracing to find out the reason. I
> suspect that there are some non-full cluster collection issues.

Swap_map[] is scanned because of running out of non full clusters.
This can happen because Android tries to make full use of the swapfile.
However, once the swap_map[] scan happens, the non full cluster is polluted.

I currently don't have a local reproduction of the issue Barry reported.
However here is some data point:
Two swap files, one for high order allocation only with this patch. No
fall back.
If there is a non-full cluster collection issue, we should see the
fall back in this case as well.

BTW, same setup without this patch series it will fall back on the
high order allocation as well.

>
> >> Another issue is nonfull_cluster[order1] cannot be used for
> >> nonfull_cluster[order2]. In definition, we should not fail order 0
> >> allocation, we need to steal nonfull_cluster[order>0] for order 0
> >> allocation. This can avoid to scan swap_map[] too. This may be not
> >> perfect, but it is the simplest first step implementation. You can
> >> optimize based on it further.
> >
> > Yes, that is listed as the limitation of this cluster order approach.
> > Initially we need to support one order well first. We might choose
> > what order that is, 16K or 64K folio. 4K pages are too small, 2M pages
> > are too big. The sweet spot might be some there in between. If we can
> > support one order well, we can demonstrate the value of the mTHP. We
> > can worry about other mix orders later.
> >
> > Do you have any suggestions for how to prevent the order 0 polluting
> > the higher order cluster? If we allow that to happen, then it defeats
> > the goal of being able to allocate higher order swap entries. The
> > tricky question is we don't know how much swap space we should reserve
> > for each order. We can always break higher order clusters to lower
> > order, but can't do the reserves. The current patch series lets the
> > actual usage determine the percentage of the cluster for each order.
> > However that seems not enough for the test case Barry has. When the
> > app gets OOM kill that is where a large swing of order 0 swap will
> > show up and not enough higher order usage for the brief moment. The
> > order 0 swap entry will pollute the high order cluster. We are
> > currently debating a "knob" to be able to reserve a certain % of swap
> > space for a certain order. Those reservations will be guaranteed and
> > order 0 swap entry can't pollute them even when it runs out of swap
> > space. That can make the mTHP at least usable for the Android case.
>
> IMO, the bottom line is that order-0 allocation is the first class
> citizen, we must keep it optimized. And, OOM with free swap space isn't
> acceptable. Please consider the policy we used for page allocation.

We need to make order-0 and high order allocation both can work after
the initial pass of allocating from empty clusters.
Only order-0 allocation work is not good enough.

In the page allocation side, we have the hugetlbfs which reserve some
memory for high order pages.
We should have similar things to allow reserve some high order swap
entries without getting polluted by low order one.

>
> > Do you see another way to protect the high order cluster polluted by
> > lower order one?
>
> If we use high-order page allocation as reference, we need something
> like compaction to guarantee high-order allocation finally. But we are
> too far from that.

We should consider reservation for high-order swap entry allocation
similar to hugetlbfs for memory.
Swap compaction will be very complicated because it needs to scan the
PTE to migrate the swap entry. It might be easier to support folio
write out compound discontiguous swap entries. That is another way to
address the fragmentation issue. We are also too far from that as
right now.

>
> For specific configuration, I believe that we can get reasonable
> high-order swap entry allocation success rate for specific use cases.
> For example, if we only do limited maximum number order-0 swap entries
> allocation, can we keep high-order clusters?

Yes we can. Having a knob to reserve some high order swap space.
Limiting order 0 is the same as having some high order swap entries
reserved.

That is a short term solution.

Chris

2024-06-05 07:30:59

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

On Fri, May 31, 2024 at 5:40 AM Kairui Song <[email protected]> wrote:
>
> On Fri, May 31, 2024 at 10:37 AM Huang, Ying <[email protected]> wrote:

> >
> > For specific configuration, I believe that we can get reasonable
> > high-order swap entry allocation success rate for specific use cases.
> > For example, if we only do limited maximum number order-0 swap entries
> > allocation, can we keep high-order clusters?
>
> Isn't limiting order-0 allocation breaks the bottom line that order-0
> allocation is the first class citizen, and should not fail if there is
> space?

We need to have high order and low order swap allocation working. Able
to recover from the swapfile full case.

>
> Just my two cents...
>
> I had a try locally based on Chris's work, allowing order 0 to use
> nonfull_clusters as Ying has suggested, and starting with low order
> and increase the order until nonfull_cluster[order] is not empty, that
> way higher order is just better protected, because unless we ran out
> of free_cluster and nonfull_cluster, direct scan won't happen.

That does not help the Android test case Barry is running because
Android tries to keep the swapfile full.
It will hit the case both empty and nonfull list all used up.
When it performs the low memory kill. There will be a big change in
the ratio of low vs high order swap.
Allocating high order swap entries should be able to recover from that.

>
> More concretely, I applied the following changes, which didn't change
> the code much:
> - In scan_swap_map_try_ssd_cluster, check nonfull_cluster first, then
> free_clusters, then discard_cluster.

I consider high the nonfull list before the empty list. The current
allocation tries to make the HAS_CACHE only swap entry stay in the
disk for a longer time before recycling it. If the folio is still in
swap cache and not dirty, it can skip the write out and directly reuse
the swap slot during reclaim. I am not sure this code path is
important now, it seems when the swap slot is free, it will remove the
HAS_CACHE as well. BTW, I noticed that the discard cluster doesn't
check if the swap cache has a folio point to it. After discarding it
just set the swap_map to 0. I wonder if swap cache has a folio in that
discarded slot that would hit the skip writeback logic. If that is
triggerable, it would be a corruption bug.

The current SSD allocation also has some command said old SSD can
benefit from not writing to the same block too many times to help the
wear leveling. I don't think that is a big deal now, even cheap SD
cards have wear leveling nowadays.

> - If it's order 0, also check for (int i = 0; i < SWAP_NR_ORDERS; ++i)
> nonfull_clusters[i] cluster before scan_swap_map_try_ssd_cluster
> returns false.

Ideally to have some option to reserve some high order swap space so
order 0 can't pollute high order clusters.


Chris

>
> A quick test still using the memtier test, but decreased the swap
> device size from 10G to 8g for higher pressure.
>
> Before:
> hugepages-32kB/stats/swpout:34013
> hugepages-32kB/stats/swpout_fallback:266
> hugepages-512kB/stats/swpout:0
> hugepages-512kB/stats/swpout_fallback:77
> hugepages-2048kB/stats/swpout:0
> hugepages-2048kB/stats/swpout_fallback:1
> hugepages-1024kB/stats/swpout:0
> hugepages-1024kB/stats/swpout_fallback:0
> hugepages-64kB/stats/swpout:35088
> hugepages-64kB/stats/swpout_fallback:66
> hugepages-16kB/stats/swpout:31848
> hugepages-16kB/stats/swpout_fallback:402
> hugepages-256kB/stats/swpout:390
> hugepages-256kB/stats/swpout_fallback:7244
> hugepages-128kB/stats/swpout:28573
> hugepages-128kB/stats/swpout_fallback:474
>
> After:
> hugepages-32kB/stats/swpout:31448
> hugepages-32kB/stats/swpout_fallback:3354
> hugepages-512kB/stats/swpout:30
> hugepages-512kB/stats/swpout_fallback:33
> hugepages-2048kB/stats/swpout:2
> hugepages-2048kB/stats/swpout_fallback:0
> hugepages-1024kB/stats/swpout:0
> hugepages-1024kB/stats/swpout_fallback:0
> hugepages-64kB/stats/swpout:31255
> hugepages-64kB/stats/swpout_fallback:3112
> hugepages-16kB/stats/swpout:29931
> hugepages-16kB/stats/swpout_fallback:3397
> hugepages-256kB/stats/swpout:5223
> hugepages-256kB/stats/swpout_fallback:2351
> hugepages-128kB/stats/swpout:25600
> hugepages-128kB/stats/swpout_fallback:2194
>
> High order (256k) swapout rate are significantly higher, 512k is now
> possible, which indicate high orders are better protected, lower
> orders are sacrificed but seems worth it.
>

2024-06-05 07:40:56

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

On Tue, Jun 4, 2024 at 12:29 AM Huang, Ying <[email protected]> wrote:
>
> Kairui Song <[email protected]> writes:
>
> > On Fri, May 31, 2024 at 10:37 AM Huang, Ying <[email protected]> wrote:
> > Isn't limiting order-0 allocation breaks the bottom line that order-0
> > allocation is the first class citizen, and should not fail if there is
> > space?
>
> Sorry for confusing words. I mean limiting maximum number order-0 swap
> entries allocation in workloads, instead of limiting that in kernel.

What interface does it use to limit the order 0 swap entries?

I was thinking the kernel would enforce the high order swap space
reservation just like hugetlbfs does on huge pages.
We will need to introduce some interface to specify the reservation.

>
> > Just my two cents...
> >
> > I had a try locally based on Chris's work, allowing order 0 to use
> > nonfull_clusters as Ying has suggested, and starting with low order
> > and increase the order until nonfull_cluster[order] is not empty, that
> > way higher order is just better protected, because unless we ran out
> > of free_cluster and nonfull_cluster, direct scan won't happen.
> >
> > More concretely, I applied the following changes, which didn't change
> > the code much:
> > - In scan_swap_map_try_ssd_cluster, check nonfull_cluster first, then
> > free_clusters, then discard_cluster.
> > - If it's order 0, also check for (int i = 0; i < SWAP_NR_ORDERS; ++i)
> > nonfull_clusters[i] cluster before scan_swap_map_try_ssd_cluster
> > returns false.
> >
> > A quick test still using the memtier test, but decreased the swap
> > device size from 10G to 8g for higher pressure.
> >
> > Before:
> > hugepages-32kB/stats/swpout:34013
> > hugepages-32kB/stats/swpout_fallback:266
> > hugepages-512kB/stats/swpout:0
> > hugepages-512kB/stats/swpout_fallback:77
> > hugepages-2048kB/stats/swpout:0
> > hugepages-2048kB/stats/swpout_fallback:1
> > hugepages-1024kB/stats/swpout:0
> > hugepages-1024kB/stats/swpout_fallback:0
> > hugepages-64kB/stats/swpout:35088
> > hugepages-64kB/stats/swpout_fallback:66
> > hugepages-16kB/stats/swpout:31848
> > hugepages-16kB/stats/swpout_fallback:402
> > hugepages-256kB/stats/swpout:390
> > hugepages-256kB/stats/swpout_fallback:7244
> > hugepages-128kB/stats/swpout:28573
> > hugepages-128kB/stats/swpout_fallback:474
> >
> > After:
> > hugepages-32kB/stats/swpout:31448
> > hugepages-32kB/stats/swpout_fallback:3354
> > hugepages-512kB/stats/swpout:30
> > hugepages-512kB/stats/swpout_fallback:33
> > hugepages-2048kB/stats/swpout:2
> > hugepages-2048kB/stats/swpout_fallback:0
> > hugepages-1024kB/stats/swpout:0
> > hugepages-1024kB/stats/swpout_fallback:0
> > hugepages-64kB/stats/swpout:31255
> > hugepages-64kB/stats/swpout_fallback:3112
> > hugepages-16kB/stats/swpout:29931
> > hugepages-16kB/stats/swpout_fallback:3397
> > hugepages-256kB/stats/swpout:5223
> > hugepages-256kB/stats/swpout_fallback:2351
> > hugepages-128kB/stats/swpout:25600
> > hugepages-128kB/stats/swpout_fallback:2194
> >
> > High order (256k) swapout rate are significantly higher, 512k is now
> > possible, which indicate high orders are better protected, lower
> > orders are sacrificed but seems worth it.
>
> Yes. I think that this reflects another aspect of the problem. In some
> situations, it's better to steal one high-order cluster and use it for
> order-0 allocation instead of scattering order-0 allocation in random
> high-order clusters.

Agree, the scan loop on swap_map[] has the worst pollution.

Chris

2024-06-06 01:57:28

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

Chris Li <[email protected]> writes:

> On Thu, May 30, 2024 at 7:37 PM Huang, Ying <[email protected]> wrote:
>>
>> Chris Li <[email protected]> writes:
>>
>> > On Wed, May 29, 2024 at 7:54 PM Huang, Ying <[email protected]> wrote:
>> > because android does not have too many cpu. We are talking about a
>> > handful of clusters, which might not justify the code complexity. It
>> > does not change the behavior that order 0 can pollut higher order.
>>
>> I have a feeling that you don't really know why swap_map[] is scanned.
>> I suggest you to do more test and tracing to find out the reason. I
>> suspect that there are some non-full cluster collection issues.
>
> Swap_map[] is scanned because of running out of non full clusters.
> This can happen because Android tries to make full use of the swapfile.
> However, once the swap_map[] scan happens, the non full cluster is polluted.
>
> I currently don't have a local reproduction of the issue Barry reported.
> However here is some data point:
> Two swap files, one for high order allocation only with this patch. No
> fall back.
> If there is a non-full cluster collection issue, we should see the
> fall back in this case as well.
>
> BTW, same setup without this patch series it will fall back on the
> high order allocation as well.
>
>>
>> >> Another issue is nonfull_cluster[order1] cannot be used for
>> >> nonfull_cluster[order2]. In definition, we should not fail order 0
>> >> allocation, we need to steal nonfull_cluster[order>0] for order 0
>> >> allocation. This can avoid to scan swap_map[] too. This may be not
>> >> perfect, but it is the simplest first step implementation. You can
>> >> optimize based on it further.
>> >
>> > Yes, that is listed as the limitation of this cluster order approach.
>> > Initially we need to support one order well first. We might choose
>> > what order that is, 16K or 64K folio. 4K pages are too small, 2M pages
>> > are too big. The sweet spot might be some there in between. If we can
>> > support one order well, we can demonstrate the value of the mTHP. We
>> > can worry about other mix orders later.
>> >
>> > Do you have any suggestions for how to prevent the order 0 polluting
>> > the higher order cluster? If we allow that to happen, then it defeats
>> > the goal of being able to allocate higher order swap entries. The
>> > tricky question is we don't know how much swap space we should reserve
>> > for each order. We can always break higher order clusters to lower
>> > order, but can't do the reserves. The current patch series lets the
>> > actual usage determine the percentage of the cluster for each order.
>> > However that seems not enough for the test case Barry has. When the
>> > app gets OOM kill that is where a large swing of order 0 swap will
>> > show up and not enough higher order usage for the brief moment. The
>> > order 0 swap entry will pollute the high order cluster. We are
>> > currently debating a "knob" to be able to reserve a certain % of swap
>> > space for a certain order. Those reservations will be guaranteed and
>> > order 0 swap entry can't pollute them even when it runs out of swap
>> > space. That can make the mTHP at least usable for the Android case.
>>
>> IMO, the bottom line is that order-0 allocation is the first class
>> citizen, we must keep it optimized. And, OOM with free swap space isn't
>> acceptable. Please consider the policy we used for page allocation.
>
> We need to make order-0 and high order allocation both can work after
> the initial pass of allocating from empty clusters.
> Only order-0 allocation work is not good enough.
>
> In the page allocation side, we have the hugetlbfs which reserve some
> memory for high order pages.
> We should have similar things to allow reserve some high order swap
> entries without getting polluted by low order one.

TBH, I don't like the idea of high order swap entries reservation. If
that's really important for you, I think that it's better to design
something like hugetlbfs vs core mm, that is, be separated from the
normal swap subsystem as much as possible.

>>
>> > Do you see another way to protect the high order cluster polluted by
>> > lower order one?
>>
>> If we use high-order page allocation as reference, we need something
>> like compaction to guarantee high-order allocation finally. But we are
>> too far from that.
>
> We should consider reservation for high-order swap entry allocation
> similar to hugetlbfs for memory.
> Swap compaction will be very complicated because it needs to scan the
> PTE to migrate the swap entry. It might be easier to support folio
> write out compound discontiguous swap entries. That is another way to
> address the fragmentation issue. We are also too far from that as
> right now.

That's not easy to write out compound discontiguous swap entries too.
For example, how to put folios in swap cache?

>>
>> For specific configuration, I believe that we can get reasonable
>> high-order swap entry allocation success rate for specific use cases.
>> For example, if we only do limited maximum number order-0 swap entries
>> allocation, can we keep high-order clusters?
>
> Yes we can. Having a knob to reserve some high order swap space.
> Limiting order 0 is the same as having some high order swap entries
> reserved.
>
> That is a short term solution.

--
Best Regards,
Huang, Ying

2024-06-07 09:53:39

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

Sorry I'm late to the discussion - I've been out for the last 3.5 weeks and just
getting through my mail now...


On 24/05/2024 18:17, Chris Li wrote:
> This is the short term solutiolns "swap cluster order" listed
> in my "Swap Abstraction" discussion slice 8 in the recent
> LSF/MM conference.

I've read the article on lwn and look forward to watching the video once
available. The longer term plans look interesting.

>
> When commit 845982eb264bc "mm: swap: allow storage of all mTHP
> orders" is introduced, it only allocates the mTHP swap entries
> from new empty cluster list. That works well for PMD size THP,
> but it has a serius fragmentation issue reported by Barry.

Yes, that was a deliberate initial approach to be conservative, just like the
original PMD-size THP support. I'm glad to see work to improve the situation!

>
> https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
>
> The mTHP allocation failure rate raises to almost 100% after a few
> hours in Barry's test run.
>
> The reason is that all the empty cluster has been exhausted while
> there are planty of free swap entries to in the cluster that is
> not 100% free.
>
> Address this by remember the swap allocation order in the cluster.
> Keep track of the per order non full cluster list for later allocation.

I don't immediately see how this helps because memory is swapped back in
per-page (currently), so just because a given cluster was initially filled with
entries of a given order, doesn't mean that those entries are freed in atomic
units; only specific pages could have been swapped back in, meaning the holes
are not of the required order. Additionally, scanning could lead to order-0
pages being populated in random places.

My naive assumption was that the obvious way to solve this problem in the short
term would be to extend the scanning logic to be able to scan for an arbitrary
order. That way you could find an allocation of the required order in any of the
clusters, even a cluster that was not originally allocated for the required order.

I guess I should read your patches to understand exactly what you are doing
rather than making assumptions...

Thanks,
Ryan

>
> This greatly improve the sucess rate of the mTHP swap allocation.
> While I am still waiting for Barry's test result. I paste Kairui's test
> result here:
>
> I'm able to reproduce such an issue with a simple script (enabling all order of mthp):
>
> modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024))
> swapoff -a
> mkswap /dev/ram0
> swapon /dev/ram0
>
> rmdir /sys/fs/cgroup/benchmark
> mkdir -p /sys/fs/cgroup/benchmark
> cd /sys/fs/cgroup/benchmark
> echo 8G > memory.max
> echo $$ > cgroup.procs
>
> memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary &
>
> /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \
> -P memcache_binary -n allkeys --key-minimum=1 \
> --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \
> --ratio 1:0 --pipeline 8 -d 1024
>
> Before:
> Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98
> After:
> Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74
>
> And the fallback ratio dropped by a lot:
> Before:
> hugepages-32kB/stats/anon_swpout_fallback:15997
> hugepages-32kB/stats/anon_swpout:18712
> hugepages-512kB/stats/anon_swpout_fallback:192
> hugepages-512kB/stats/anon_swpout:0
> hugepages-2048kB/stats/anon_swpout_fallback:2
> hugepages-2048kB/stats/anon_swpout:0
> hugepages-1024kB/stats/anon_swpout_fallback:0
> hugepages-1024kB/stats/anon_swpout:0
> hugepages-64kB/stats/anon_swpout_fallback:18246
> hugepages-64kB/stats/anon_swpout:17644
> hugepages-16kB/stats/anon_swpout_fallback:13701
> hugepages-16kB/stats/anon_swpout:18234
> hugepages-256kB/stats/anon_swpout_fallback:8642
> hugepages-256kB/stats/anon_swpout:93
> hugepages-128kB/stats/anon_swpout_fallback:21497
> hugepages-128kB/stats/anon_swpout:7596
>
> (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate)
>
> After:
> hugepages-32kB/stats/swpout:34445
> hugepages-32kB/stats/swpout_fallback:0
> hugepages-512kB/stats/swpout:1
> hugepages-512kB/stats/swpout_fallback:134
> hugepages-2048kB/stats/swpout:1
> hugepages-2048kB/stats/swpout_fallback:1
> hugepages-1024kB/stats/swpout:6
> hugepages-1024kB/stats/swpout_fallback:0
> hugepages-64kB/stats/swpout:35495
> hugepages-64kB/stats/swpout_fallback:0
> hugepages-16kB/stats/swpout:32441
> hugepages-16kB/stats/swpout_fallback:0
> hugepages-256kB/stats/swpout:2223
> hugepages-256kB/stats/swpout_fallback:6278
> hugepages-128kB/stats/swpout:29136
> hugepages-128kB/stats/swpout_fallback:52
>
> Reported-by: Barry Song <[email protected]>
> Tested-by: Kairui Song <[email protected]>
> Signed-off-by: Chris Li <[email protected]>
> ---
> Chris Li (2):
> mm: swap: swap cluster switch to double link list
> mm: swap: mTHP allocate swap entries from nonfull list
>
> include/linux/swap.h | 18 ++--
> mm/swapfile.c | 252 +++++++++++++++++----------------------------------
> 2 files changed, 93 insertions(+), 177 deletions(-)
> ---
> base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed
> change-id: 20240523-swap-allocator-1534c480ece4
>
> Best regards,


2024-06-07 10:36:25

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: swap: mTHP allocate swap entries from nonfull list

On 24/05/2024 18:17, Chris Li wrote:
> Track the nonfull cluster as well as the empty cluster
> on lists. Each order has one nonfull cluster list.
>
> The cluster will remember which order it was used during
> new cluster allocation.
>
> When the cluster has free entry, add to the nonfull[order]
> list.  When the free cluster list is empty, also allocate
> from the nonempty list of that order.
>
> This improves the mTHP swap allocation success rate.

If I've understood correctly, the aim here is to link all the current per-cpu
clusters for a given order together so that if a cpu can't allocate a new
cluster for a given order, then it can steal another CPU's current cluster for
that order?

If that's the intent, couldn't that be done just by iterating over the per-cpu,
per-order cluster pointers? Then you don't need all the linked list churn
(althogh I like the linked list changes as a nice cleanup, I'm not sure the
churn is neccessary for this change?). There would likely need to be some
locking considerations, but it would also allow you to get access to the next
entry within the cluster for allocation.

However, fundamentally, I don't think this change solves the problem; it just
takes a bit longer before the allocation fails. The real problem is
fragmentation due to freeing individual pages from swap entries at different times.

Wouldn't it be better to just extend scanning to support high order allocations?
Then we can steal a high order block from any cluster, even clusters that were
previously full, just like we currently do for order-0. Given we are already
falling back to this path for order-0, I don't think it would be any more
expensive; infact its less expensive because we only scan once for the high
order block, rather than scan for every split order-0 page.

Of course that still doesn't solve the proplem entirely; if swap is so
fragmented that there is no contiguous block of the required order then you
still have to fall back to splitting. As an extra optimization, you could store
the largest contiguous free space available in each cluster to avoid scanning in
case its too small?


>
> There are limitations if the distribution of numbers of
> different orders of mTHP changes a lot. e.g. there are a lot
> of nonfull cluster assign to order A while later time there
> are a lot of order B allocation while very little allocation
> in order A. Currently the cluster used by order A will not
> reused by order B unless the cluster is 100% empty.
>
> This situation is best addressed by the longer term "swap
> buddy allocator", in future patches.
> ---
> include/linux/swap.h | 4 ++++
> mm/swapfile.c | 25 +++++++++++++++++++++++--
> 2 files changed, 27 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 0d3906eff3c9..1b7f0794b9bf 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -255,10 +255,12 @@ struct swap_cluster_info {
> * cluster
> */
> unsigned int count:16;
> + unsigned int order:8;
> unsigned int flags:8;
> struct list_head next;
> };
> #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
> +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
>
>
> /*
> @@ -297,6 +299,8 @@ struct swap_info_struct {
> unsigned char *swap_map; /* vmalloc'ed array of usage counts */
> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
> struct list_head free_clusters; /* free clusters list */
> + struct list_head nonfull_clusters[SWAP_NR_ORDERS];
> + /* list of cluster that contains at least one free slot */
> unsigned int lowest_bit; /* index of first free in swap_map */
> unsigned int highest_bit; /* index of last free in swap_map */
> unsigned int pages; /* total of usable pages of swap */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 205a60c5f9cb..51923aba500e 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -363,8 +363,11 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>
> static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> {
> + if (ci->flags & CLUSTER_FLAG_NONFULL)
> + list_move_tail(&ci->next, &si->free_clusters);
> + else
> + list_add_tail(&ci->next, &si->free_clusters);
> ci->flags = CLUSTER_FLAG_FREE;
> - list_add_tail(&ci->next, &si->free_clusters);
> }
>
> /*
> @@ -486,7 +489,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
> ci->count--;
>
> if (!ci->count)
> - free_cluster(p, ci);
> + return free_cluster(p, ci);
> +
> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
> + list_add_tail(&ci->next, &p->nonfull_clusters[ci->order]);
> + ci->flags |= CLUSTER_FLAG_NONFULL;
> + }
> }
>
> /*
> @@ -547,6 +555,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next);
> list_del(&ci->next);
> spin_lock(&ci->lock);
> + ci->order = order;
> + ci->flags = 0;
> + spin_unlock(&ci->lock);
> + tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
> + } else if (!list_empty(&si->nonfull_clusters[order])) {
> + ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, next);
> + list_del(&ci->next);
> + spin_lock(&ci->lock);
> ci->flags = 0;
> spin_unlock(&ci->lock);
> tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;

This looks wrong to me; if the cluster is on the nonfull list then it will have
had some entries already allocated (by another cpu). So pointing tmp to the
first block in the cluster will never yield a free block. The cpu from which you
are stealing the cluster stores the next free block location in its per-cpu
structure. So perhaps iterating over the other cpu's `struct percpu_cluster`s is
a better approach than the nonfull list?

Additionally, this cluster will be stored back to this cpu's current cluster at
the bottom of the function. That may or may not be what you intended.

> @@ -578,6 +594,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> break;
> tmp += nr_pages;
> }
> + WARN_ONCE(ci->order != order, "expecting order %d got %d", order, ci->order);
> unlock_cluster(ci);
> }
> if (tmp >= max) {
> @@ -956,6 +973,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
> ci = lock_cluster(si, offset);
> memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
> ci->count = 0;
> + ci->order = 0;
> ci->flags = 0;
> free_cluster(si, ci);
> unlock_cluster(ci);
> @@ -2882,6 +2900,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
> INIT_LIST_HEAD(&p->free_clusters);
> INIT_LIST_HEAD(&p->discard_clusters);
>
> + for (i = 0; i < SWAP_NR_ORDERS; i++)
> + INIT_LIST_HEAD(&p->nonfull_clusters[i]);
> +
> for (i = 0; i < swap_header->info.nr_badpages; i++) {
> unsigned int page_nr = swap_header->info.badpages[i];
> if (page_nr == 0 || page_nr > swap_header->info.last_page)
>


2024-06-07 10:49:34

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

On 30/05/2024 08:49, Barry Song wrote:
> On Wed, May 29, 2024 at 9:04 AM Chris Li <[email protected]> wrote:
>>
>> I am spinning a new version for this series to address two issues
>> found in this series:
>>
>> 1) Oppo discovered a bug in the following line:
>> + ci = si->cluster_info + tmp;
>> Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
>> That is a serious bug but trivial to fix.
>>
>> 2) order 0 allocation currently blindly scans swap_map disregarding
>> the cluster->order. Given enough order 0 swap allocations(close to the
>> swap file size) the order 0 allocation head will eventually sweep
>> across the whole swapfile and destroy other cluster order allocations.
>>
>> The short term fix is just skipping clusters that are already assigned
>> to higher orders.
>>
>> In the long term, I want to unify the non-SSD to use clusters for
>> locking and allocations as well, just try to follow the last
>> allocation (less seeking) as much as possible.
>
> Hi Chris,
>
> I am sharing some new test results with you. This time, we used two
> zRAM devices by modifying get_swap_pages().
>
> zram0 -> dedicated for order-0 swpout
> zram1 -> dedicated for order-4 swpout
>
> We allocate a generous amount of space for zRAM1 to ensure it never gets full
> and always has ample free space. However, we found that Ryan's approach
> does not perform well even in this straightforward scenario. Despite zRAM1
> having 80% of its space remaining, we still experience issues obtaining
> contiguous swap slots and encounter a high swpout_fallback ratio.
>
> Sorry for the report, Ryan :-)

No problem; clearly it needs to be fixed, and I'll help where I can. I'm pretty
sure that this is due to fragmentation preventing clusters from being freed back
to the free list.

>
> In contrast, with your patch, we consistently see the thp_swpout_fallback ratio
> at 0%, indicating a significant improvement in the situation.

Unless I've misunderstood something critical, Chris's change is just allowing a
cpu to steal a block from another cpu's current cluster for that order. So it
just takes longer (approx by a factor of the number of cpus in the system) to
get to the state where fragmentation is causing fallbacks? As I said in the
other thread, I think the more robust solution is to implement scanning for high
order blocks.

>
> Although your patch still has issues supporting the mixing of order-0 and
> order-4 pages in a swap device, it represents a significant improvement.
>
> I would be delighted to witness your approach advancing with Ying
> Huang’s assistance. However, due to my current commitments, I
> regret that I am unable to allocate time for debugging.
>
>>
>> Chris
>>
>>
>>
>> On Fri, May 24, 2024 at 10:17 AM Chris Li <[email protected]> wrote:
>>>
>>> This is the short term solutiolns "swap cluster order" listed
>>> in my "Swap Abstraction" discussion slice 8 in the recent
>>> LSF/MM conference.
>>>
>>> When commit 845982eb264bc "mm: swap: allow storage of all mTHP
>>> orders" is introduced, it only allocates the mTHP swap entries
>>> from new empty cluster list. That works well for PMD size THP,
>>> but it has a serius fragmentation issue reported by Barry.
>>>
>>> https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
>>>
>>> The mTHP allocation failure rate raises to almost 100% after a few
>>> hours in Barry's test run.
>>>
>>> The reason is that all the empty cluster has been exhausted while
>>> there are planty of free swap entries to in the cluster that is
>>> not 100% free.
>>>
>>> Address this by remember the swap allocation order in the cluster.
>>> Keep track of the per order non full cluster list for later allocation.
>>>
>>> This greatly improve the sucess rate of the mTHP swap allocation.
>>> While I am still waiting for Barry's test result. I paste Kairui's test
>>> result here:
>>>
>>> I'm able to reproduce such an issue with a simple script (enabling all order of mthp):
>>>
>>> modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024))
>>> swapoff -a
>>> mkswap /dev/ram0
>>> swapon /dev/ram0
>>>
>>> rmdir /sys/fs/cgroup/benchmark
>>> mkdir -p /sys/fs/cgroup/benchmark
>>> cd /sys/fs/cgroup/benchmark
>>> echo 8G > memory.max
>>> echo $$ > cgroup.procs
>>>
>>> memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary &
>>>
>>> /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \
>>> -P memcache_binary -n allkeys --key-minimum=1 \
>>> --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \
>>> --ratio 1:0 --pipeline 8 -d 1024
>>>
>>> Before:
>>> Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98
>>> After:
>>> Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74
>>>
>>> And the fallback ratio dropped by a lot:
>>> Before:
>>> hugepages-32kB/stats/anon_swpout_fallback:15997
>>> hugepages-32kB/stats/anon_swpout:18712
>>> hugepages-512kB/stats/anon_swpout_fallback:192
>>> hugepages-512kB/stats/anon_swpout:0
>>> hugepages-2048kB/stats/anon_swpout_fallback:2
>>> hugepages-2048kB/stats/anon_swpout:0
>>> hugepages-1024kB/stats/anon_swpout_fallback:0
>>> hugepages-1024kB/stats/anon_swpout:0
>>> hugepages-64kB/stats/anon_swpout_fallback:18246
>>> hugepages-64kB/stats/anon_swpout:17644
>>> hugepages-16kB/stats/anon_swpout_fallback:13701
>>> hugepages-16kB/stats/anon_swpout:18234
>>> hugepages-256kB/stats/anon_swpout_fallback:8642
>>> hugepages-256kB/stats/anon_swpout:93
>>> hugepages-128kB/stats/anon_swpout_fallback:21497
>>> hugepages-128kB/stats/anon_swpout:7596
>>>
>>> (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate)
>>>
>>> After:
>>> hugepages-32kB/stats/swpout:34445
>>> hugepages-32kB/stats/swpout_fallback:0
>>> hugepages-512kB/stats/swpout:1
>>> hugepages-512kB/stats/swpout_fallback:134
>>> hugepages-2048kB/stats/swpout:1
>>> hugepages-2048kB/stats/swpout_fallback:1
>>> hugepages-1024kB/stats/swpout:6
>>> hugepages-1024kB/stats/swpout_fallback:0
>>> hugepages-64kB/stats/swpout:35495
>>> hugepages-64kB/stats/swpout_fallback:0
>>> hugepages-16kB/stats/swpout:32441
>>> hugepages-16kB/stats/swpout_fallback:0
>>> hugepages-256kB/stats/swpout:2223
>>> hugepages-256kB/stats/swpout_fallback:6278
>>> hugepages-128kB/stats/swpout:29136
>>> hugepages-128kB/stats/swpout_fallback:52
>>>
>>> Reported-by: Barry Song <[email protected]>
>>> Tested-by: Kairui Song <[email protected]>
>>> Signed-off-by: Chris Li <[email protected]>
>>> ---
>>> Chris Li (2):
>>> mm: swap: swap cluster switch to double link list
>>> mm: swap: mTHP allocate swap entries from nonfull list
>>>
>>> include/linux/swap.h | 18 ++--
>>> mm/swapfile.c | 252 +++++++++++++++++----------------------------------
>>> 2 files changed, 93 insertions(+), 177 deletions(-)
>>> ---
>>> base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed
>>> change-id: 20240523-swap-allocator-1534c480ece4
>>>
>>> Best regards,
>>> --
>>> Chris Li <[email protected]>
>>>
>
> Thanks
> Barry


2024-06-07 10:58:03

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: swap: mTHP allocate swap entries from nonfull list

On 07/06/2024 11:35, Ryan Roberts wrote:
> On 24/05/2024 18:17, Chris Li wrote:
>> Track the nonfull cluster as well as the empty cluster
>> on lists. Each order has one nonfull cluster list.
>>
>> The cluster will remember which order it was used during
>> new cluster allocation.
>>
>> When the cluster has free entry, add to the nonfull[order]
>> list.  When the free cluster list is empty, also allocate
>> from the nonempty list of that order.
>>
>> This improves the mTHP swap allocation success rate.
>
> If I've understood correctly, the aim here is to link all the current per-cpu
> clusters for a given order together so that if a cpu can't allocate a new
> cluster for a given order, then it can steal another CPU's current cluster for
> that order?
>
> If that's the intent, couldn't that be done just by iterating over the per-cpu,
> per-order cluster pointers? Then you don't need all the linked list churn
> (althogh I like the linked list changes as a nice cleanup, I'm not sure the
> churn is neccessary for this change?). There would likely need to be some
> locking considerations, but it would also allow you to get access to the next
> entry within the cluster for allocation.
>
> However, fundamentally, I don't think this change solves the problem; it just
> takes a bit longer before the allocation fails. The real problem is
> fragmentation due to freeing individual pages from swap entries at different times.
>
> Wouldn't it be better to just extend scanning to support high order allocations?
> Then we can steal a high order block from any cluster, even clusters that were
> previously full, just like we currently do for order-0. Given we are already
> falling back to this path for order-0, I don't think it would be any more
> expensive; infact its less expensive because we only scan once for the high
> order block, rather than scan for every split order-0 page.
>
> Of course that still doesn't solve the proplem entirely; if swap is so
> fragmented that there is no contiguous block of the required order then you
> still have to fall back to splitting. As an extra optimization, you could store
> the largest contiguous free space available in each cluster to avoid scanning in
> case its too small?
>
>
>>
>> There are limitations if the distribution of numbers of
>> different orders of mTHP changes a lot. e.g. there are a lot
>> of nonfull cluster assign to order A while later time there
>> are a lot of order B allocation while very little allocation
>> in order A. Currently the cluster used by order A will not
>> reused by order B unless the cluster is 100% empty.
>>
>> This situation is best addressed by the longer term "swap
>> buddy allocator", in future patches.
>> ---
>> include/linux/swap.h | 4 ++++
>> mm/swapfile.c | 25 +++++++++++++++++++++++--
>> 2 files changed, 27 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 0d3906eff3c9..1b7f0794b9bf 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -255,10 +255,12 @@ struct swap_cluster_info {
>> * cluster
>> */
>> unsigned int count:16;
>> + unsigned int order:8;
>> unsigned int flags:8;
>> struct list_head next;
>> };
>> #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
>> +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
>>
>>
>> /*
>> @@ -297,6 +299,8 @@ struct swap_info_struct {
>> unsigned char *swap_map; /* vmalloc'ed array of usage counts */
>> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
>> struct list_head free_clusters; /* free clusters list */
>> + struct list_head nonfull_clusters[SWAP_NR_ORDERS];
>> + /* list of cluster that contains at least one free slot */
>> unsigned int lowest_bit; /* index of first free in swap_map */
>> unsigned int highest_bit; /* index of last free in swap_map */
>> unsigned int pages; /* total of usable pages of swap */
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 205a60c5f9cb..51923aba500e 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -363,8 +363,11 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>>
>> static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
>> {
>> + if (ci->flags & CLUSTER_FLAG_NONFULL)
>> + list_move_tail(&ci->next, &si->free_clusters);
>> + else
>> + list_add_tail(&ci->next, &si->free_clusters);
>> ci->flags = CLUSTER_FLAG_FREE;
>> - list_add_tail(&ci->next, &si->free_clusters);
>> }
>>
>> /*
>> @@ -486,7 +489,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
>> ci->count--;
>>
>> if (!ci->count)
>> - free_cluster(p, ci);
>> + return free_cluster(p, ci);
>> +
>> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>> + list_add_tail(&ci->next, &p->nonfull_clusters[ci->order]);
>> + ci->flags |= CLUSTER_FLAG_NONFULL;
>> + }
>> }
>>
>> /*
>> @@ -547,6 +555,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>> ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next);
>> list_del(&ci->next);
>> spin_lock(&ci->lock);
>> + ci->order = order;
>> + ci->flags = 0;
>> + spin_unlock(&ci->lock);
>> + tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
>> + } else if (!list_empty(&si->nonfull_clusters[order])) {
>> + ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, next);
>> + list_del(&ci->next);
>> + spin_lock(&ci->lock);
>> ci->flags = 0;
>> spin_unlock(&ci->lock);
>> tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
>
> This looks wrong to me; if the cluster is on the nonfull list then it will have
> had some entries already allocated (by another cpu). So pointing tmp to the
> first block in the cluster will never yield a free block. The cpu from which you
> are stealing the cluster stores the next free block location in its per-cpu
> structure. So perhaps iterating over the other cpu's `struct percpu_cluster`s is
> a better approach than the nonfull list?

Ahh; of course the cluster scan below will move this along to a free block.

>
> Additionally, this cluster will be stored back to this cpu's current cluster at
> the bottom of the function. That may or may not be what you intended.
>
>> @@ -578,6 +594,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>> break;
>> tmp += nr_pages;
>> }
>> + WARN_ONCE(ci->order != order, "expecting order %d got %d", order, ci->order);
>> unlock_cluster(ci);
>> }
>> if (tmp >= max) {
>> @@ -956,6 +973,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
>> ci = lock_cluster(si, offset);
>> memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
>> ci->count = 0;
>> + ci->order = 0;
>> ci->flags = 0;
>> free_cluster(si, ci);
>> unlock_cluster(ci);
>> @@ -2882,6 +2900,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
>> INIT_LIST_HEAD(&p->free_clusters);
>> INIT_LIST_HEAD(&p->discard_clusters);
>>
>> + for (i = 0; i < SWAP_NR_ORDERS; i++)
>> + INIT_LIST_HEAD(&p->nonfull_clusters[i]);
>> +
>> for (i = 0; i < swap_header->info.nr_badpages; i++) {
>> unsigned int page_nr = swap_header->info.badpages[i];
>> if (page_nr == 0 || page_nr > swap_header->info.last_page)
>>
>


2024-06-07 18:40:40

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

On Wed, Jun 5, 2024 at 7:02 PM Huang, Ying <[email protected]> wrote:
>
> Chris Li <[email protected]> writes:
>

> > In the page allocation side, we have the hugetlbfs which reserve some
> > memory for high order pages.
> > We should have similar things to allow reserve some high order swap
> > entries without getting polluted by low order one.
>
> TBH, I don't like the idea of high order swap entries reservation.
May I know more if you don't like the idea? I understand this can be
controversial, because previously we like to take the THP as the best
effort approach. If there is some reason we can't make THP, we use the
order 0 as fall back.

For discussion purpose, I want break it down to smaller steps:

First, can we agree that the following usage case is reasonable:
The usage case is that, as Barry has shown, zsmalloc compresses bigger
size than 4K and can have both better compress ratio and CPU
performance gain.
https://lore.kernel.org/linux-mm/[email protected]/

So the goal is to make THP/mTHP have some reasonable success rate
running in the mix size swap allocation, after either low order or
high order swap requests can overflow the swap file size. The allocate
can still recover from that, after some swap entries got free.

Please let me know if you think the above usage case and goal are not
reasonable for the kernel.

> that's really important for you, I think that it's better to design
> something like hugetlbfs vs core mm, that is, be separated from the
> normal swap subsystem as much as possible.

I am giving hugetlbfs just to make the point using reservation, or
isolation of the resource to prevent mixing fragmentation existing in
core mm.
I am not suggesting copying the hugetlbfs implementation to the swap
system. Unlike hugetlbfs, the swap allocation is typically done from
the kernel, it is transparent from the application. I don't think
separate from the swap subsystem is a good way to go.

This comes down to why you don't like the reservation. e.g. if we use
two swapfile, one swapfile is purely allocate for high order, would
that be better?
>
> >>
> >> > Do you see another way to protect the high order cluster polluted by
> >> > lower order one?
> >>
> >> If we use high-order page allocation as reference, we need something
> >> like compaction to guarantee high-order allocation finally. But we are
> >> too far from that.
> >
> > We should consider reservation for high-order swap entry allocation
> > similar to hugetlbfs for memory.
> > Swap compaction will be very complicated because it needs to scan the
> > PTE to migrate the swap entry. It might be easier to support folio
> > write out compound discontiguous swap entries. That is another way to
> > address the fragmentation issue. We are also too far from that as
> > right now.
>
> That's not easy to write out compound discontiguous swap entries too.
> For example, how to put folios in swap cache?

I propose the idea in the recent LSF/MM discussion, the last few
slides are for the discontiguous swap and it has the discontiguous
entries in swap cache.
https://drive.google.com/file/d/10wN4WgEekaiTDiAx2AND97CYLgfDJXAD/view

Agree it is not an easy change. The cache cache would have to change
the assumption all offset are contiguous.
For swap, we kind of have some in memory data associated with per
offset already, so it might provide an opportunity to combine the
offset related data structure for swap together. Another alternative
might be using xarray without the multi entry property. , just treat
each offset like a single entry. I haven't dug deep into this
direction yet.

We can have more discussion, maybe arrange an upstream alignment
meeting if there is interest.

Chris

2024-06-07 18:49:19

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

On Fri, Jun 7, 2024 at 2:43 AM Ryan Roberts <[email protected]> wrote:
>
> Sorry I'm late to the discussion - I've been out for the last 3.5 weeks and just
> getting through my mail now...

No problem at all, please take it easy.

>
>
> On 24/05/2024 18:17, Chris Li wrote:
> > This is the short term solutiolns "swap cluster order" listed
> > in my "Swap Abstraction" discussion slice 8 in the recent
> > LSF/MM conference.
>
> I've read the article on lwn and look forward to watching the video once
> available. The longer term plans look interesting.
>
> >
> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP
> > orders" is introduced, it only allocates the mTHP swap entries
> > from new empty cluster list. That works well for PMD size THP,
> > but it has a serius fragmentation issue reported by Barry.
>
> Yes, that was a deliberate initial approach to be conservative, just like the
> original PMD-size THP support. I'm glad to see work to improve the situation!
>
> >
> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
> >
> > The mTHP allocation failure rate raises to almost 100% after a few
> > hours in Barry's test run.
> >
> > The reason is that all the empty cluster has been exhausted while
> > there are planty of free swap entries to in the cluster that is
> > not 100% free.
> >
> > Address this by remember the swap allocation order in the cluster.
> > Keep track of the per order non full cluster list for later allocation.
>
> I don't immediately see how this helps because memory is swapped back in
> per-page (currently), so just because a given cluster was initially filled with

That is not the case for Barry's setup, he has some other patch series
to swap in mTHP as a whole. Especially in for the mTHP store in the
zsmalloc as bigger than 4K pages.
https://lore.kernel.org/linux-mm/[email protected]/

> entries of a given order, doesn't mean that those entries are freed in atomic
> units; only specific pages could have been swapped back in, meaning the holes
> are not of the required order. Additionally, scanning could lead to order-0
> pages being populated in random places.

Yes, that is a problem we need to address. The proposed short term
solution is to have an isolation scheme preventing the high order swap
entry mix with the lower order one inside one cluster. That is easy to
do and has some test results confirming the reservation/isolation
effect.

>
> My naive assumption was that the obvious way to solve this problem in the short
> term would be to extend the scanning logic to be able to scan for an arbitrary
> order. That way you could find an allocation of the required order in any of the
> clusters, even a cluster that was not originally allocated for the required order.
>
> I guess I should read your patches to understand exactly what you are doing
> rather than making assumptions...

Scanning is not enough. We need to have some way to prevent the
fragmentation from happening.
Once the fragmentation has happened, it can't be easily reversed.
Scanning does not help the fragmentation aspect.

Chris

>
> Thanks,
> Ryan
>
> >
> > This greatly improve the sucess rate of the mTHP swap allocation.
> > While I am still waiting for Barry's test result. I paste Kairui's test
> > result here:
> >
> > I'm able to reproduce such an issue with a simple script (enabling all order of mthp):
> >
> > modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024))
> > swapoff -a
> > mkswap /dev/ram0
> > swapon /dev/ram0
> >
> > rmdir /sys/fs/cgroup/benchmark
> > mkdir -p /sys/fs/cgroup/benchmark
> > cd /sys/fs/cgroup/benchmark
> > echo 8G > memory.max
> > echo $$ > cgroup.procs
> >
> > memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary &
> >
> > /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \
> > -P memcache_binary -n allkeys --key-minimum=1 \
> > --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \
> > --ratio 1:0 --pipeline 8 -d 1024
> >
> > Before:
> > Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98
> > After:
> > Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74
> >
> > And the fallback ratio dropped by a lot:
> > Before:
> > hugepages-32kB/stats/anon_swpout_fallback:15997
> > hugepages-32kB/stats/anon_swpout:18712
> > hugepages-512kB/stats/anon_swpout_fallback:192
> > hugepages-512kB/stats/anon_swpout:0
> > hugepages-2048kB/stats/anon_swpout_fallback:2
> > hugepages-2048kB/stats/anon_swpout:0
> > hugepages-1024kB/stats/anon_swpout_fallback:0
> > hugepages-1024kB/stats/anon_swpout:0
> > hugepages-64kB/stats/anon_swpout_fallback:18246
> > hugepages-64kB/stats/anon_swpout:17644
> > hugepages-16kB/stats/anon_swpout_fallback:13701
> > hugepages-16kB/stats/anon_swpout:18234
> > hugepages-256kB/stats/anon_swpout_fallback:8642
> > hugepages-256kB/stats/anon_swpout:93
> > hugepages-128kB/stats/anon_swpout_fallback:21497
> > hugepages-128kB/stats/anon_swpout:7596
> >
> > (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate)
> >
> > After:
> > hugepages-32kB/stats/swpout:34445
> > hugepages-32kB/stats/swpout_fallback:0
> > hugepages-512kB/stats/swpout:1
> > hugepages-512kB/stats/swpout_fallback:134
> > hugepages-2048kB/stats/swpout:1
> > hugepages-2048kB/stats/swpout_fallback:1
> > hugepages-1024kB/stats/swpout:6
> > hugepages-1024kB/stats/swpout_fallback:0
> > hugepages-64kB/stats/swpout:35495
> > hugepages-64kB/stats/swpout_fallback:0
> > hugepages-16kB/stats/swpout:32441
> > hugepages-16kB/stats/swpout_fallback:0
> > hugepages-256kB/stats/swpout:2223
> > hugepages-256kB/stats/swpout_fallback:6278
> > hugepages-128kB/stats/swpout:29136
> > hugepages-128kB/stats/swpout_fallback:52
> >
> > Reported-by: Barry Song <[email protected]>
> > Tested-by: Kairui Song <[email protected]>
> > Signed-off-by: Chris Li <[email protected]>
> > ---
> > Chris Li (2):
> > mm: swap: swap cluster switch to double link list
> > mm: swap: mTHP allocate swap entries from nonfull list
> >
> > include/linux/swap.h | 18 ++--
> > mm/swapfile.c | 252 +++++++++++++++++----------------------------------
> > 2 files changed, 93 insertions(+), 177 deletions(-)
> > ---
> > base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed
> > change-id: 20240523-swap-allocator-1534c480ece4
> >
> > Best regards,
>

2024-06-07 18:58:18

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

On Fri, Jun 7, 2024 at 3:49 AM Ryan Roberts <[email protected]> wrote:
>
> On 30/05/2024 08:49, Barry Song wrote:
> > On Wed, May 29, 2024 at 9:04 AM Chris Li <[email protected]> wrote:
> >>
> >> I am spinning a new version for this series to address two issues
> >> found in this series:
> >>
> >> 1) Oppo discovered a bug in the following line:
> >> + ci = si->cluster_info + tmp;
> >> Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
> >> That is a serious bug but trivial to fix.
> >>
> >> 2) order 0 allocation currently blindly scans swap_map disregarding
> >> the cluster->order. Given enough order 0 swap allocations(close to the
> >> swap file size) the order 0 allocation head will eventually sweep
> >> across the whole swapfile and destroy other cluster order allocations.
> >>
> >> The short term fix is just skipping clusters that are already assigned
> >> to higher orders.
> >>
> >> In the long term, I want to unify the non-SSD to use clusters for
> >> locking and allocations as well, just try to follow the last
> >> allocation (less seeking) as much as possible.
> >
> > Hi Chris,
> >
> > I am sharing some new test results with you. This time, we used two
> > zRAM devices by modifying get_swap_pages().
> >
> > zram0 -> dedicated for order-0 swpout
> > zram1 -> dedicated for order-4 swpout
> >
> > We allocate a generous amount of space for zRAM1 to ensure it never gets full
> > and always has ample free space. However, we found that Ryan's approach
> > does not perform well even in this straightforward scenario. Despite zRAM1
> > having 80% of its space remaining, we still experience issues obtaining
> > contiguous swap slots and encounter a high swpout_fallback ratio.
> >
> > Sorry for the report, Ryan :-)
>
> No problem; clearly it needs to be fixed, and I'll help where I can. I'm pretty
> sure that this is due to fragmentation preventing clusters from being freed back
> to the free list.
>
> >
> > In contrast, with your patch, we consistently see the thp_swpout_fallback ratio
> > at 0%, indicating a significant improvement in the situation.
>
> Unless I've misunderstood something critical, Chris's change is just allowing a
> cpu to steal a block from another cpu's current cluster for that order. So it

No, that is not the main change. The main change is to allow the CPU
to allocate from the nonfull and non-empty cluster, which are not in
any CPU's current cluster, not in the empty list either. The current
patch does not prevent the CPU from stealing from the other CPU
current order. It will get addressed in V2.

> just takes longer (approx by a factor of the number of cpus in the system) to
> get to the state where fragmentation is causing fallbacks? As I said in the
> other thread, I think the more robust solution is to implement scanning for high
> order blocks.

That will introduce more fragmentation to the high order cluster, and
will make it harder to allocate high order swap entry later.

Please see my previous email for the usage case and the goal of the change.
https://lore.kernel.org/linux-mm/CANeU7QnVzqGKXp9pKDLWiuhqTvBxXupgFCRXejYhshAjw6uDyQ@mail.gmail.com/T/#mf431a743e458896c2ab4a4077b103341313c9cf4

Let's discuss whether the usage case and the goal makes sense or not.

Chris

2024-06-07 20:53:26

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: swap: mTHP allocate swap entries from nonfull list

On Fri, Jun 7, 2024 at 3:35 AM Ryan Roberts <[email protected]> wrote:
>
> On 24/05/2024 18:17, Chris Li wrote:
> > Track the nonfull cluster as well as the empty cluster
> > on lists. Each order has one nonfull cluster list.
> >
> > The cluster will remember which order it was used during
> > new cluster allocation.
> >
> > When the cluster has free entry, add to the nonfull[order]
> > list. When the free cluster list is empty, also allocate
> > from the nonempty list of that order.
> >
> > This improves the mTHP swap allocation success rate.
>
> If I've understood correctly, the aim here is to link all the current per-cpu
> clusters for a given order together so that if a cpu can't allocate a new
> cluster for a given order, then it can steal another CPU's current cluster for
> that order?

Stealing other CPU's *current* cluster is not the intent. The intent
is after all current per-cpu done with this cluster(full), those full
clusters are not tracked by any per-cpu struct. When those full
clusters become non-full. Track it in the global nonfull cluster list.
The per-cpu allocation can take a cluster from that nonfull cluster
list and start allocating from it.

The V1 code does not specifically check for the stealing behavior, the
V2 code will prevent that from happening. Basically each cluster has 4
states and owners:
1) empty, owned by a global free cluster list.
2) per cpu allocating. owned by per CPU current.
3) nonfull (also non empty). own global nonfull list.
4) full, currently not tracked, we can track it under global full list.

When the per cpu runs out of free cluster, it can take a cluster from
3) and move it to 2).

>
> If that's the intent, couldn't that be done just by iterating over the per-cpu,
> per-order cluster pointers? Then you don't need all the linked list churn

Again, that is not the intent.

> (althogh I like the linked list changes as a nice cleanup, I'm not sure the
> churn is neccessary for this change?). There would likely need to be some
> locking considerations, but it would also allow you to get access to the next
> entry within the cluster for allocation.
>
> However, fundamentally, I don't think this change solves the problem; it just
> takes a bit longer before the allocation fails. The real problem is
> fragmentation due to freeing individual pages from swap entries at different times.

It definitely helps to find nonfull clusters quicker. Please take a
look at my above comment and read the patch again.

>
> Wouldn't it be better to just extend scanning to support high order allocations?
> Then we can steal a high order block from any cluster, even clusters that were

Steal from higher order causes the higher order harder to allocate,
that is downside.
In my mind, ideally have some high order cluster reservation scheme so
the high order one doesn't mix with the low order one.

> previously full, just like we currently do for order-0. Given we are already
> falling back to this path for order-0, I don't think it would be any more
> expensive; infact its less expensive because we only scan once for the high
> order block, rather than scan for every split order-0 page.
>
> Of course that still doesn't solve the proplem entirely; if swap is so
> fragmented that there is no contiguous block of the required order then you
> still have to fall back to splitting. As an extra optimization, you could store

Exactly. That is why I think some high order cluster reservation
scheme is needed for a short term solution.
The change itself is not too complicated if we can agree on this approach.

> the largest contiguous free space available in each cluster to avoid scanning in
> case its too small?

Avoid scanning does just get to the non available high order result quicker.
Does not seem to help increase the high order allocation success rate.

>
>
> >
> > There are limitations if the distribution of numbers of
> > different orders of mTHP changes a lot. e.g. there are a lot
> > of nonfull cluster assign to order A while later time there
> > are a lot of order B allocation while very little allocation
> > in order A. Currently the cluster used by order A will not
> > reused by order B unless the cluster is 100% empty.
> >
> > This situation is best addressed by the longer term "swap
> > buddy allocator", in future patches.
> > ---
> > include/linux/swap.h | 4 ++++
> > mm/swapfile.c | 25 +++++++++++++++++++++++--
> > 2 files changed, 27 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 0d3906eff3c9..1b7f0794b9bf 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -255,10 +255,12 @@ struct swap_cluster_info {
> > * cluster
> > */
> > unsigned int count:16;
> > + unsigned int order:8;
> > unsigned int flags:8;
> > struct list_head next;
> > };
> > #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
> > +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
> >
> >
> > /*
> > @@ -297,6 +299,8 @@ struct swap_info_struct {
> > unsigned char *swap_map; /* vmalloc'ed array of usage counts */
> > struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
> > struct list_head free_clusters; /* free clusters list */
> > + struct list_head nonfull_clusters[SWAP_NR_ORDERS];
> > + /* list of cluster that contains at least one free slot */
> > unsigned int lowest_bit; /* index of first free in swap_map */
> > unsigned int highest_bit; /* index of last free in swap_map */
> > unsigned int pages; /* total of usable pages of swap */
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 205a60c5f9cb..51923aba500e 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -363,8 +363,11 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> >
> > static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> > {
> > + if (ci->flags & CLUSTER_FLAG_NONFULL)
> > + list_move_tail(&ci->next, &si->free_clusters);
> > + else
> > + list_add_tail(&ci->next, &si->free_clusters);
> > ci->flags = CLUSTER_FLAG_FREE;
> > - list_add_tail(&ci->next, &si->free_clusters);
> > }
> >
> > /*
> > @@ -486,7 +489,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
> > ci->count--;
> >
> > if (!ci->count)
> > - free_cluster(p, ci);
> > + return free_cluster(p, ci);
> > +
> > + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
> > + list_add_tail(&ci->next, &p->nonfull_clusters[ci->order]);
> > + ci->flags |= CLUSTER_FLAG_NONFULL;
> > + }
> > }
> >
> > /*
> > @@ -547,6 +555,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> > ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next);
> > list_del(&ci->next);
> > spin_lock(&ci->lock);
> > + ci->order = order;
> > + ci->flags = 0;
> > + spin_unlock(&ci->lock);
> > + tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
> > + } else if (!list_empty(&si->nonfull_clusters[order])) {
> > + ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, next);
> > + list_del(&ci->next);
> > + spin_lock(&ci->lock);
> > ci->flags = 0;
> > spin_unlock(&ci->lock);
> > tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
>
> This looks wrong to me; if the cluster is on the nonfull list then it will have
> had some entries already allocated (by another cpu). So pointing tmp to the
> first block in the cluster will never yield a free block. The cpu from which you

I believe it will scan until it finds a free block with alignment down
in the offset < max loop.

while (offset < max) {
if (swap_range_empty(si->swap_map, offset, nr_pages))
break;
offset += nr_pages;
}

> are stealing the cluster stores the next free block location in its per-cpu
> structure. So perhaps iterating over the other cpu's `struct percpu_cluster`s is
> a better approach than the nonfull list?

No, stealing is not the intent. The intent is quickly finding the non
full cluster NOT in other per cpu allocation.

>
> Additionally, this cluster will be stored back to this cpu's current cluster at
> the bottom of the function. That may or may not be what you intended.

That is what I intended. It remembers the current allocating cluster,
in case this cluster has more than one high order swap entries.

Chris

>
> > @@ -578,6 +594,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> > break;
> > tmp += nr_pages;
> > }
> > + WARN_ONCE(ci->order != order, "expecting order %d got %d", order, ci->order);
> > unlock_cluster(ci);
> > }
> > if (tmp >= max) {
> > @@ -956,6 +973,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
> > ci = lock_cluster(si, offset);
> > memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
> > ci->count = 0;
> > + ci->order = 0;
> > ci->flags = 0;
> > free_cluster(si, ci);
> > unlock_cluster(ci);
> > @@ -2882,6 +2900,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
> > INIT_LIST_HEAD(&p->free_clusters);
> > INIT_LIST_HEAD(&p->discard_clusters);
> >
> > + for (i = 0; i < SWAP_NR_ORDERS; i++)
> > + INIT_LIST_HEAD(&p->nonfull_clusters[i]);
> > +
> > for (i = 0; i < swap_header->info.nr_badpages; i++) {
> > unsigned int page_nr = swap_header->info.badpages[i];
> > if (page_nr == 0 || page_nr > swap_header->info.last_page)
> >
>

2024-06-07 20:54:00

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: swap: mTHP allocate swap entries from nonfull list

On Fri, Jun 7, 2024 at 3:57 AM Ryan Roberts <[email protected]> wrote:
>
> On 07/06/2024 11:35, Ryan Roberts wrote:
> > On 24/05/2024 18:17, Chris Li wrote:
> >> Track the nonfull cluster as well as the empty cluster
> >> on lists. Each order has one nonfull cluster list.
> >>
> >> The cluster will remember which order it was used during
> >> new cluster allocation.
> >>
> >> When the cluster has free entry, add to the nonfull[order]
> >> list. When the free cluster list is empty, also allocate
> >> from the nonempty list of that order.
> >>
> >> This improves the mTHP swap allocation success rate.
> >
> > If I've understood correctly, the aim here is to link all the current per-cpu
> > clusters for a given order together so that if a cpu can't allocate a new
> > cluster for a given order, then it can steal another CPU's current cluster for
> > that order?
> >
> > If that's the intent, couldn't that be done just by iterating over the per-cpu,
> > per-order cluster pointers? Then you don't need all the linked list churn
> > (althogh I like the linked list changes as a nice cleanup, I'm not sure the
> > churn is neccessary for this change?). There would likely need to be some
> > locking considerations, but it would also allow you to get access to the next
> > entry within the cluster for allocation.
> >
> > However, fundamentally, I don't think this change solves the problem; it just
> > takes a bit longer before the allocation fails. The real problem is
> > fragmentation due to freeing individual pages from swap entries at different times.
> >
> > Wouldn't it be better to just extend scanning to support high order allocations?
> > Then we can steal a high order block from any cluster, even clusters that were
> > previously full, just like we currently do for order-0. Given we are already
> > falling back to this path for order-0, I don't think it would be any more
> > expensive; infact its less expensive because we only scan once for the high
> > order block, rather than scan for every split order-0 page.
> >
> > Of course that still doesn't solve the proplem entirely; if swap is so
> > fragmented that there is no contiguous block of the required order then you
> > still have to fall back to splitting. As an extra optimization, you could store
> > the largest contiguous free space available in each cluster to avoid scanning in
> > case its too small?
> >
> >
> >>
> >> There are limitations if the distribution of numbers of
> >> different orders of mTHP changes a lot. e.g. there are a lot
> >> of nonfull cluster assign to order A while later time there
> >> are a lot of order B allocation while very little allocation
> >> in order A. Currently the cluster used by order A will not
> >> reused by order B unless the cluster is 100% empty.
> >>
> >> This situation is best addressed by the longer term "swap
> >> buddy allocator", in future patches.
> >> ---
> >> include/linux/swap.h | 4 ++++
> >> mm/swapfile.c | 25 +++++++++++++++++++++++--
> >> 2 files changed, 27 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> index 0d3906eff3c9..1b7f0794b9bf 100644
> >> --- a/include/linux/swap.h
> >> +++ b/include/linux/swap.h
> >> @@ -255,10 +255,12 @@ struct swap_cluster_info {
> >> * cluster
> >> */
> >> unsigned int count:16;
> >> + unsigned int order:8;
> >> unsigned int flags:8;
> >> struct list_head next;
> >> };
> >> #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
> >> +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
> >>
> >>
> >> /*
> >> @@ -297,6 +299,8 @@ struct swap_info_struct {
> >> unsigned char *swap_map; /* vmalloc'ed array of usage counts */
> >> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
> >> struct list_head free_clusters; /* free clusters list */
> >> + struct list_head nonfull_clusters[SWAP_NR_ORDERS];
> >> + /* list of cluster that contains at least one free slot */
> >> unsigned int lowest_bit; /* index of first free in swap_map */
> >> unsigned int highest_bit; /* index of last free in swap_map */
> >> unsigned int pages; /* total of usable pages of swap */
> >> diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> index 205a60c5f9cb..51923aba500e 100644
> >> --- a/mm/swapfile.c
> >> +++ b/mm/swapfile.c
> >> @@ -363,8 +363,11 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> >>
> >> static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> >> {
> >> + if (ci->flags & CLUSTER_FLAG_NONFULL)
> >> + list_move_tail(&ci->next, &si->free_clusters);
> >> + else
> >> + list_add_tail(&ci->next, &si->free_clusters);
> >> ci->flags = CLUSTER_FLAG_FREE;
> >> - list_add_tail(&ci->next, &si->free_clusters);
> >> }
> >>
> >> /*
> >> @@ -486,7 +489,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
> >> ci->count--;
> >>
> >> if (!ci->count)
> >> - free_cluster(p, ci);
> >> + return free_cluster(p, ci);
> >> +
> >> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
> >> + list_add_tail(&ci->next, &p->nonfull_clusters[ci->order]);
> >> + ci->flags |= CLUSTER_FLAG_NONFULL;
> >> + }
> >> }
> >>
> >> /*
> >> @@ -547,6 +555,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> >> ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next);
> >> list_del(&ci->next);
> >> spin_lock(&ci->lock);
> >> + ci->order = order;
> >> + ci->flags = 0;
> >> + spin_unlock(&ci->lock);
> >> + tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
> >> + } else if (!list_empty(&si->nonfull_clusters[order])) {
> >> + ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, next);
> >> + list_del(&ci->next);
> >> + spin_lock(&ci->lock);
> >> ci->flags = 0;
> >> spin_unlock(&ci->lock);
> >> tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
> >
> > This looks wrong to me; if the cluster is on the nonfull list then it will have
> > had some entries already allocated (by another cpu). So pointing tmp to the
> > first block in the cluster will never yield a free block. The cpu from which you
> > are stealing the cluster stores the next free block location in its per-cpu
> > structure. So perhaps iterating over the other cpu's `struct percpu_cluster`s is
> > a better approach than the nonfull list?
>
> Ahh; of course the cluster scan below will move this along to a free block.

You mean the (offset < max) loop, right?

Agree.

Chris

>
> >
> > Additionally, this cluster will be stored back to this cpu's current cluster at
> > the bottom of the function. That may or may not be what you intended.
> >
> >> @@ -578,6 +594,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> >> break;
> >> tmp += nr_pages;
> >> }
> >> + WARN_ONCE(ci->order != order, "expecting order %d got %d", order, ci->order);
> >> unlock_cluster(ci);
> >> }
> >> if (tmp >= max) {
> >> @@ -956,6 +973,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
> >> ci = lock_cluster(si, offset);
> >> memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
> >> ci->count = 0;
> >> + ci->order = 0;
> >> ci->flags = 0;
> >> free_cluster(si, ci);
> >> unlock_cluster(ci);
> >> @@ -2882,6 +2900,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
> >> INIT_LIST_HEAD(&p->free_clusters);
> >> INIT_LIST_HEAD(&p->discard_clusters);
> >>
> >> + for (i = 0; i < SWAP_NR_ORDERS; i++)
> >> + INIT_LIST_HEAD(&p->nonfull_clusters[i]);
> >> +
> >> for (i = 0; i < swap_header->info.nr_badpages; i++) {
> >> unsigned int page_nr = swap_header->info.badpages[i];
> >> if (page_nr == 0 || page_nr > swap_header->info.last_page)
> >>
> >
>

2024-06-10 11:19:01

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: swap: mTHP allocate swap entries from nonfull list

On 07/06/2024 21:52, Chris Li wrote:
> On Fri, Jun 7, 2024 at 3:35 AM Ryan Roberts <[email protected]> wrote:
>>
>> On 24/05/2024 18:17, Chris Li wrote:
>>> Track the nonfull cluster as well as the empty cluster
>>> on lists. Each order has one nonfull cluster list.
>>>
>>> The cluster will remember which order it was used during
>>> new cluster allocation.
>>>
>>> When the cluster has free entry, add to the nonfull[order]
>>> list. When the free cluster list is empty, also allocate
>>> from the nonempty list of that order.
>>>
>>> This improves the mTHP swap allocation success rate.
>>
>> If I've understood correctly, the aim here is to link all the current per-cpu
>> clusters for a given order together so that if a cpu can't allocate a new
>> cluster for a given order, then it can steal another CPU's current cluster for
>> that order?
>
> Stealing other CPU's *current* cluster is not the intent. The intent
> is after all current per-cpu done with this cluster(full), those full
> clusters are not tracked by any per-cpu struct. When those full
> clusters become non-full. Track it in the global nonfull cluster list.
> The per-cpu allocation can take a cluster from that nonfull cluster
> list and start allocating from it.
>
> The V1 code does not specifically check for the stealing behavior, the
> V2 code will prevent that from happening. Basically each cluster has 4
> states and owners:
> 1) empty, owned by a global free cluster list.
> 2) per cpu allocating. owned by per CPU current.
> 3) nonfull (also non empty). own global nonfull list.
> 4) full, currently not tracked, we can track it under global full list.
>
> When the per cpu runs out of free cluster, it can take a cluster from
> 3) and move it to 2).

OK, sorry for my misunderstanding, and thanks for the explanaiton. I've taken a
proper look at the patch and with this explanation now understand the intent.

I guess in effect, this is a scanning approach, but you are limiting the
clusters that you scan to those that were originally allocated for the required
order and which are known to have some free space.

I guess this will work more effectively with Barry's series that swaps in a
whole large folio in one go, because it is more likely that holes of the
required size will appear in the non-full clusters. But previous discussions
concluded that it was not always going to be the right approach to swap-in large
folios in one go (certainly not for higer orders). So I don't think you can
(yet) rely on swap slots being freed as order-sized blocks. That said I still
think that your approach should improve the situation even without Barry's series.

In fact, why don't you put the cluster on the non-full list at cluster
allocation time? Then you can also allow a cpu to steal from another cpu's
current cluster (as I initially thought you were doing). I think this should
work with pretty much the same logic? And improve chances of allocation without
increasing chances of fragmentation? (more on this below).

>
>>
>> If that's the intent, couldn't that be done just by iterating over the per-cpu,
>> per-order cluster pointers? Then you don't need all the linked list churn
>
> Again, that is not the intent.
>
>> (althogh I like the linked list changes as a nice cleanup, I'm not sure the
>> churn is neccessary for this change?). There would likely need to be some
>> locking considerations, but it would also allow you to get access to the next
>> entry within the cluster for allocation.
>>
>> However, fundamentally, I don't think this change solves the problem; it just
>> takes a bit longer before the allocation fails. The real problem is
>> fragmentation due to freeing individual pages from swap entries at different times.
>
> It definitely helps to find nonfull clusters quicker. Please take a
> look at my above comment and read the patch again.
>
>>
>> Wouldn't it be better to just extend scanning to support high order allocations?
>> Then we can steal a high order block from any cluster, even clusters that were
>
> Steal from higher order causes the higher order harder to allocate,
> that is downside.
> In my mind, ideally have some high order cluster reservation scheme so
> the high order one doesn't mix with the low order one.

Yes, that would make sense; you could limit the number of clusters allocated for
each order at any given time.

Order-0 stealing will still cause problems. You could probably just remove that
and limit order-0 scanning/stealing to clusters that were originally allocated
for order-0 too, using the same logic.

>
>> previously full, just like we currently do for order-0. Given we are already
>> falling back to this path for order-0, I don't think it would be any more
>> expensive; infact its less expensive because we only scan once for the high
>> order block, rather than scan for every split order-0 page.
>>
>> Of course that still doesn't solve the proplem entirely; if swap is so
>> fragmented that there is no contiguous block of the required order then you
>> still have to fall back to splitting. As an extra optimization, you could store
>
> Exactly. That is why I think some high order cluster reservation
> scheme is needed for a short term solution.
> The change itself is not too complicated if we can agree on this approach.
>
>> the largest contiguous free space available in each cluster to avoid scanning in
>> case its too small?
>
> Avoid scanning does just get to the non available high order result quicker.
> Does not seem to help increase the high order allocation success rate.
>
>>
>>
>>>
>>> There are limitations if the distribution of numbers of
>>> different orders of mTHP changes a lot. e.g. there are a lot
>>> of nonfull cluster assign to order A while later time there
>>> are a lot of order B allocation while very little allocation
>>> in order A. Currently the cluster used by order A will not
>>> reused by order B unless the cluster is 100% empty.
>>>
>>> This situation is best addressed by the longer term "swap
>>> buddy allocator", in future patches.
>>> ---
>>> include/linux/swap.h | 4 ++++
>>> mm/swapfile.c | 25 +++++++++++++++++++++++--
>>> 2 files changed, 27 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index 0d3906eff3c9..1b7f0794b9bf 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -255,10 +255,12 @@ struct swap_cluster_info {
>>> * cluster
>>> */
>>> unsigned int count:16;
>>> + unsigned int order:8;
>>> unsigned int flags:8;
>>> struct list_head next;
>>> };
>>> #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
>>> +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
>>>
>>>
>>> /*
>>> @@ -297,6 +299,8 @@ struct swap_info_struct {
>>> unsigned char *swap_map; /* vmalloc'ed array of usage counts */
>>> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
>>> struct list_head free_clusters; /* free clusters list */
>>> + struct list_head nonfull_clusters[SWAP_NR_ORDERS];
>>> + /* list of cluster that contains at least one free slot */
>>> unsigned int lowest_bit; /* index of first free in swap_map */
>>> unsigned int highest_bit; /* index of last free in swap_map */
>>> unsigned int pages; /* total of usable pages of swap */
>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>> index 205a60c5f9cb..51923aba500e 100644
>>> --- a/mm/swapfile.c
>>> +++ b/mm/swapfile.c
>>> @@ -363,8 +363,11 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>>>
>>> static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
>>> {
>>> + if (ci->flags & CLUSTER_FLAG_NONFULL)
>>> + list_move_tail(&ci->next, &si->free_clusters);
>>> + else
>>> + list_add_tail(&ci->next, &si->free_clusters);
>>> ci->flags = CLUSTER_FLAG_FREE;
>>> - list_add_tail(&ci->next, &si->free_clusters);
>>> }
>>>
>>> /*
>>> @@ -486,7 +489,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
>>> ci->count--;
>>>
>>> if (!ci->count)
>>> - free_cluster(p, ci);
>>> + return free_cluster(p, ci);
>>> +
>>> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>>> + list_add_tail(&ci->next, &p->nonfull_clusters[ci->order]);
>>> + ci->flags |= CLUSTER_FLAG_NONFULL;
>>> + }
>>> }
>>>
>>> /*
>>> @@ -547,6 +555,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>> ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next);
>>> list_del(&ci->next);
>>> spin_lock(&ci->lock);
>>> + ci->order = order;
>>> + ci->flags = 0;
>>> + spin_unlock(&ci->lock);
>>> + tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
>>> + } else if (!list_empty(&si->nonfull_clusters[order])) {

You are preferring to scan the nonfull clusters over doing discard and
allocating a newly freed cluster; wouldn't it be better to prefer discard over
nonfull scanning?

>>> + ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, next);
>>> + list_del(&ci->next);

I'm struggling a bit with what the value of the nonfull_clusters linked list is.
I wonder if it would be simpler to just track the number of free slots in the
cluster and iterate over the clusters, scanning the ones with the desired order
and which have at least (1 << order) free slots? I guess this list is giving you
an ordering such that the cluster you pull off the list first had its first slot
freed longest ago so it is most likely to have most space?

>>> + spin_lock(&ci->lock);
>>> ci->flags = 0;
>>> spin_unlock(&ci->lock);
>>> tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
>>
>> This looks wrong to me; if the cluster is on the nonfull list then it will have
>> had some entries already allocated (by another cpu). So pointing tmp to the
>> first block in the cluster will never yield a free block. The cpu from which you
>
> I believe it will scan until it finds a free block with alignment down
> in the offset < max loop.

Yes agreed, my bad. Sorry about that.

Thanks,
Ryan

>
> while (offset < max) {
> if (swap_range_empty(si->swap_map, offset, nr_pages))
> break;
> offset += nr_pages;
> }
>
>> are stealing the cluster stores the next free block location in its per-cpu
>> structure. So perhaps iterating over the other cpu's `struct percpu_cluster`s is
>> a better approach than the nonfull list?
>
> No, stealing is not the intent. The intent is quickly finding the non
> full cluster NOT in other per cpu allocation.
>
>>
>> Additionally, this cluster will be stored back to this cpu's current cluster at
>> the bottom of the function. That may or may not be what you intended.
>
> That is what I intended. It remembers the current allocating cluster,
> in case this cluster has more than one high order swap entries.
>
> Chris
>
>>
>>> @@ -578,6 +594,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>> break;
>>> tmp += nr_pages;
>>> }
>>> + WARN_ONCE(ci->order != order, "expecting order %d got %d", order, ci->order);
>>> unlock_cluster(ci);
>>> }
>>> if (tmp >= max) {
>>> @@ -956,6 +973,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
>>> ci = lock_cluster(si, offset);
>>> memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
>>> ci->count = 0;
>>> + ci->order = 0;
>>> ci->flags = 0;
>>> free_cluster(si, ci);
>>> unlock_cluster(ci);
>>> @@ -2882,6 +2900,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
>>> INIT_LIST_HEAD(&p->free_clusters);
>>> INIT_LIST_HEAD(&p->discard_clusters);
>>>
>>> + for (i = 0; i < SWAP_NR_ORDERS; i++)
>>> + INIT_LIST_HEAD(&p->nonfull_clusters[i]);
>>> +
>>> for (i = 0; i < swap_header->info.nr_badpages; i++) {
>>> unsigned int page_nr = swap_header->info.badpages[i];
>>> if (page_nr == 0 || page_nr > swap_header->info.last_page)
>>>
>>


2024-06-11 02:38:32

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

Chris Li <[email protected]> writes:

> On Wed, Jun 5, 2024 at 7:02 PM Huang, Ying <[email protected]> wrote:
>>
>> Chris Li <[email protected]> writes:
>>
>
>> > In the page allocation side, we have the hugetlbfs which reserve some
>> > memory for high order pages.
>> > We should have similar things to allow reserve some high order swap
>> > entries without getting polluted by low order one.
>>
>> TBH, I don't like the idea of high order swap entries reservation.
> May I know more if you don't like the idea? I understand this can be
> controversial, because previously we like to take the THP as the best
> effort approach. If there is some reason we can't make THP, we use the
> order 0 as fall back.
>
> For discussion purpose, I want break it down to smaller steps:
>
> First, can we agree that the following usage case is reasonable:
> The usage case is that, as Barry has shown, zsmalloc compresses bigger
> size than 4K and can have both better compress ratio and CPU
> performance gain.
> https://lore.kernel.org/linux-mm/[email protected]/
>
> So the goal is to make THP/mTHP have some reasonable success rate
> running in the mix size swap allocation, after either low order or
> high order swap requests can overflow the swap file size. The allocate
> can still recover from that, after some swap entries got free.
>
> Please let me know if you think the above usage case and goal are not
> reasonable for the kernel.

I think that it's reasonable to improve the success rate of high-order
swap entries allocation. I just think that it's hard to use the
reservation based method. For example, how much should be reserved?
Why system OOM when there's still swap space available? And so forth.
So, I prefer the transparent methods. Just like THP vs. hugetlbfs.

>> that's really important for you, I think that it's better to design
>> something like hugetlbfs vs core mm, that is, be separated from the
>> normal swap subsystem as much as possible.
>
> I am giving hugetlbfs just to make the point using reservation, or
> isolation of the resource to prevent mixing fragmentation existing in
> core mm.
> I am not suggesting copying the hugetlbfs implementation to the swap
> system. Unlike hugetlbfs, the swap allocation is typically done from
> the kernel, it is transparent from the application. I don't think
> separate from the swap subsystem is a good way to go.
>
> This comes down to why you don't like the reservation. e.g. if we use
> two swapfile, one swapfile is purely allocate for high order, would
> that be better?

Sorry, my words weren't accurate. Personally, I just think that it's
better to make reservation related code not too intrusive.

And, before reservation, we need to consider something else firstly.
Whether is it generally good to swap-in with swap-out order? Should we
consider memory wastage too? One static policy doesn't fit all, we may
need either a dynamic policy, or make policy configurable.

In general, I think that we need to do this step by step.

>> >> > Do you see another way to protect the high order cluster polluted by
>> >> > lower order one?
>> >>
>> >> If we use high-order page allocation as reference, we need something
>> >> like compaction to guarantee high-order allocation finally. But we are
>> >> too far from that.
>> >
>> > We should consider reservation for high-order swap entry allocation
>> > similar to hugetlbfs for memory.
>> > Swap compaction will be very complicated because it needs to scan the
>> > PTE to migrate the swap entry. It might be easier to support folio
>> > write out compound discontiguous swap entries. That is another way to
>> > address the fragmentation issue. We are also too far from that as
>> > right now.
>>
>> That's not easy to write out compound discontiguous swap entries too.
>> For example, how to put folios in swap cache?
>
> I propose the idea in the recent LSF/MM discussion, the last few
> slides are for the discontiguous swap and it has the discontiguous
> entries in swap cache.
> https://drive.google.com/file/d/10wN4WgEekaiTDiAx2AND97CYLgfDJXAD/view
>
> Agree it is not an easy change. The cache cache would have to change
> the assumption all offset are contiguous.
> For swap, we kind of have some in memory data associated with per
> offset already, so it might provide an opportunity to combine the
> offset related data structure for swap together. Another alternative
> might be using xarray without the multi entry property. , just treat
> each offset like a single entry. I haven't dug deep into this
> direction yet.

Thanks! I will study your idea.

> We can have more discussion, maybe arrange an upstream alignment
> meeting if there is interest.

Sure.

--
Best Regards,
Huang, Ying

2024-06-11 06:10:26

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: swap: mTHP allocate swap entries from nonfull list

On Mon, Jun 10, 2024 at 4:18 AM Ryan Roberts <[email protected]> wrote:
>
> On 07/06/2024 21:52, Chris Li wrote:
> > On Fri, Jun 7, 2024 at 3:35 AM Ryan Roberts <[email protected]> wrote:
> > Stealing other CPU's *current* cluster is not the intent. The intent
> > is after all current per-cpu done with this cluster(full), those full
> > clusters are not tracked by any per-cpu struct. When those full
> > clusters become non-full. Track it in the global nonfull cluster list.
> > The per-cpu allocation can take a cluster from that nonfull cluster
> > list and start allocating from it.
> >
> > The V1 code does not specifically check for the stealing behavior, the
> > V2 code will prevent that from happening. Basically each cluster has 4
> > states and owners:
> > 1) empty, owned by a global free cluster list.
> > 2) per cpu allocating. owned by per CPU current.
> > 3) nonfull (also non empty). own global nonfull list.
> > 4) full, currently not tracked, we can track it under global full list.
> >
> > When the per cpu runs out of free cluster, it can take a cluster from
> > 3) and move it to 2).
>
> OK, sorry for my misunderstanding, and thanks for the explanaiton. I've taken a
> proper look at the patch and with this explanation now understand the intent.
>
> I guess in effect, this is a scanning approach, but you are limiting the
> clusters that you scan to those that were originally allocated for the required
> order and which are known to have some free space.

Yes, not only some free space. When we swap in the same size as the
swap out size, we can have the cluster dedicated to that order. Avoid
mixing order in that cluster.

>
> I guess this will work more effectively with Barry's series that swaps in a
> whole large folio in one go, because it is more likely that holes of the
> required size will appear in the non-full clusters. But previous discussions

Ack.

> concluded that it was not always going to be the right approach to swap-in large
> folios in one go (certainly not for higer orders). So I don't think you can

We need to start from somewhere. Right now it is still too early to
say which approach will win out. I think it is beneficial to try out
swapin the same size as swap out, and see how the data play out. 4K is
too small and 2M is too big. Maybe one of the mTHP size would be the
sweet spot.

> (yet) rely on swap slots being freed as order-sized blocks. That said I still
> think that your approach should improve the situation even without Barry's series.

Agree.

> In fact, why don't you put the cluster on the non-full list at cluster
> allocation time? Then you can also allow a cpu to steal from another cpu's

That would be the reservation approach. If we know how much swap space
we want for that order in advance. We can make the reservation at swap
on time, then that order will only allocate from the reserved cluster.

> current cluster (as I initially thought you were doing). I think this should

For stealing from other cpu's current order cluster, we can have a
global list of clusters for per_cpu[order] for the cluster. It's
better to track it separately as the other nonfull cluster. We only
steal from other CPU when we run out of global nonfull clusters. The
current patch allows stealing from other CPU as well. Just might
happen before it runs out of the nonfull cluster. Might not be too big
a deal.

> work with pretty much the same logic? And improve chances of allocation without
> increasing chances of fragmentation? (more on this below).
>
> >
> >>
> >> If that's the intent, couldn't that be done just by iterating over the per-cpu,
> >> per-order cluster pointers? Then you don't need all the linked list churn
> >
> > Again, that is not the intent.
> >
> >> (althogh I like the linked list changes as a nice cleanup, I'm not sure the
> >> churn is neccessary for this change?). There would likely need to be some
> >> locking considerations, but it would also allow you to get access to the next
> >> entry within the cluster for allocation.
> >>
> >> However, fundamentally, I don't think this change solves the problem; it just
> >> takes a bit longer before the allocation fails. The real problem is
> >> fragmentation due to freeing individual pages from swap entries at different times.
> >
> > It definitely helps to find nonfull clusters quicker. Please take a
> > look at my above comment and read the patch again.
> >
> >>
> >> Wouldn't it be better to just extend scanning to support high order allocations?
> >> Then we can steal a high order block from any cluster, even clusters that were
> >
> > Steal from higher order causes the higher order harder to allocate,
> > that is downside.
> > In my mind, ideally have some high order cluster reservation scheme so
> > the high order one doesn't mix with the low order one.
>
> Yes, that would make sense; you could limit the number of clusters allocated for
> each order at any given time.
>
> Order-0 stealing will still cause problems. You could probably just remove that
> and limit order-0 scanning/stealing to clusters that were originally allocated
> for order-0 too, using the same logic.

Yes, that is the plan.

If we reserve some space for higher order, then order-0 can't steal
from the reserved higher order cluster.

>
> >
> >> previously full, just like we currently do for order-0. Given we are already
> >> falling back to this path for order-0, I don't think it would be any more
> >> expensive; infact its less expensive because we only scan once for the high
> >> order block, rather than scan for every split order-0 page.
> >>
> >> Of course that still doesn't solve the proplem entirely; if swap is so
> >> fragmented that there is no contiguous block of the required order then you
> >> still have to fall back to splitting. As an extra optimization, you could store
> >
> > Exactly. That is why I think some high order cluster reservation
> > scheme is needed for a short term solution.
> > The change itself is not too complicated if we can agree on this approach.
> >
> >> the largest contiguous free space available in each cluster to avoid scanning in
> >> case its too small?
> >
> > Avoid scanning does just get to the non available high order result quicker.
> > Does not seem to help increase the high order allocation success rate.
> >
> >>
> >>
> >>>
> >>> There are limitations if the distribution of numbers of
> >>> different orders of mTHP changes a lot. e.g. there are a lot
> >>> of nonfull cluster assign to order A while later time there
> >>> are a lot of order B allocation while very little allocation
> >>> in order A. Currently the cluster used by order A will not
> >>> reused by order B unless the cluster is 100% empty.
> >>>
> >>> This situation is best addressed by the longer term "swap
> >>> buddy allocator", in future patches.
> >>> ---
> >>> include/linux/swap.h | 4 ++++
> >>> mm/swapfile.c | 25 +++++++++++++++++++++++--
> >>> 2 files changed, 27 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/include/linux/swap.h b/include/linux/swap.h
> >>> index 0d3906eff3c9..1b7f0794b9bf 100644
> >>> --- a/include/linux/swap.h
> >>> +++ b/include/linux/swap.h
> >>> @@ -255,10 +255,12 @@ struct swap_cluster_info {
> >>> * cluster
> >>> */
> >>> unsigned int count:16;
> >>> + unsigned int order:8;
> >>> unsigned int flags:8;
> >>> struct list_head next;
> >>> };
> >>> #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
> >>> +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
> >>>
> >>>
> >>> /*
> >>> @@ -297,6 +299,8 @@ struct swap_info_struct {
> >>> unsigned char *swap_map; /* vmalloc'ed array of usage counts */
> >>> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
> >>> struct list_head free_clusters; /* free clusters list */
> >>> + struct list_head nonfull_clusters[SWAP_NR_ORDERS];
> >>> + /* list of cluster that contains at least one free slot */
> >>> unsigned int lowest_bit; /* index of first free in swap_map */
> >>> unsigned int highest_bit; /* index of last free in swap_map */
> >>> unsigned int pages; /* total of usable pages of swap */
> >>> diff --git a/mm/swapfile.c b/mm/swapfile.c
> >>> index 205a60c5f9cb..51923aba500e 100644
> >>> --- a/mm/swapfile.c
> >>> +++ b/mm/swapfile.c
> >>> @@ -363,8 +363,11 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> >>>
> >>> static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> >>> {
> >>> + if (ci->flags & CLUSTER_FLAG_NONFULL)
> >>> + list_move_tail(&ci->next, &si->free_clusters);
> >>> + else
> >>> + list_add_tail(&ci->next, &si->free_clusters);
> >>> ci->flags = CLUSTER_FLAG_FREE;
> >>> - list_add_tail(&ci->next, &si->free_clusters);
> >>> }
> >>>
> >>> /*
> >>> @@ -486,7 +489,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
> >>> ci->count--;
> >>>
> >>> if (!ci->count)
> >>> - free_cluster(p, ci);
> >>> + return free_cluster(p, ci);
> >>> +
> >>> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
> >>> + list_add_tail(&ci->next, &p->nonfull_clusters[ci->order]);
> >>> + ci->flags |= CLUSTER_FLAG_NONFULL;
> >>> + }
> >>> }
> >>>
> >>> /*
> >>> @@ -547,6 +555,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> >>> ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next);
> >>> list_del(&ci->next);
> >>> spin_lock(&ci->lock);
> >>> + ci->order = order;
> >>> + ci->flags = 0;
> >>> + spin_unlock(&ci->lock);
> >>> + tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
> >>> + } else if (!list_empty(&si->nonfull_clusters[order])) {
>
> You are preferring to scan the nonfull clusters over doing discard and
> allocating a newly freed cluster; wouldn't it be better to prefer discard over
> nonfull scanning?

My consideration is that issuing a discard command takes some IO time.
If we have an alternative path to move forward, e.g. using another
nonfull cluster, we should do that to avoid latency penalty.

>
> >>> + ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, next);
> >>> + list_del(&ci->next);
>
> I'm struggling a bit with what the value of the nonfull_clusters linked list is.

Assume we have the swap in size the same as swap out size. The
nonfull_clusters will get to the cluster in the desired order quicker.

> I wonder if it would be simpler to just track the number of free slots in the
> cluster and iterate over the clusters, scanning the ones with the desired order
> and which have at least (1 << order) free slots? I guess this list is giving you

That would be slower than this approach. We would likely repeatedly
scan over a lot of lower order clusters which can't find a big enough
free range. We can cache it and speed it up by avoiding repeat
scanning the unfitting cluster over and over again, that effectively
turns into a buddy allocator approach then. It can work but is more
complex.

> an ordering such that the cluster you pull off the list first had its first slot
> freed longest ago so it is most likely to have most space?

Yes, that is one of the considerations as well.

Chris

2024-06-11 07:12:04

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

On Mon, Jun 10, 2024 at 7:38 PM Huang, Ying <[email protected]> wrote:
>
> Chris Li <[email protected]> writes:
>
> > On Wed, Jun 5, 2024 at 7:02 PM Huang, Ying <[email protected]> wrote:
> >>
> >> Chris Li <[email protected]> writes:
> >>
> >
> >> > In the page allocation side, we have the hugetlbfs which reserve some
> >> > memory for high order pages.
> >> > We should have similar things to allow reserve some high order swap
> >> > entries without getting polluted by low order one.
> >>
> >> TBH, I don't like the idea of high order swap entries reservation.
> > May I know more if you don't like the idea? I understand this can be
> > controversial, because previously we like to take the THP as the best
> > effort approach. If there is some reason we can't make THP, we use the
> > order 0 as fall back.
> >
> > For discussion purpose, I want break it down to smaller steps:
> >
> > First, can we agree that the following usage case is reasonable:
> > The usage case is that, as Barry has shown, zsmalloc compresses bigger
> > size than 4K and can have both better compress ratio and CPU
> > performance gain.
> > https://lore.kernel.org/linux-mm/[email protected]/
> >
> > So the goal is to make THP/mTHP have some reasonable success rate
> > running in the mix size swap allocation, after either low order or
> > high order swap requests can overflow the swap file size. The allocate
> > can still recover from that, after some swap entries got free.
> >
> > Please let me know if you think the above usage case and goal are not
> > reasonable for the kernel.
>
> I think that it's reasonable to improve the success rate of high-order

Glad to hear that.

> swap entries allocation. I just think that it's hard to use the
> reservation based method. For example, how much should be reserved?

Understand, it is harder to use than a fully transparent method, but
still better than no solution at all. The alternative right now is we
can't do it.

Regarding how much we should reserve. Similarly, how much should you
choose your swap file size? If you choose N, why not N*120% or N*80%?
That did not stop us from having a swapfile, right?

> Why system OOM when there's still swap space available? And so forth.

Keep in mind that the reservation is an option. If you prefer the old
behavior, you don't have to use the reservation. That shouldn't be a
reason to stop others who want to use it. We don't have an alternative
solution for the long run mix size allocation yet. If there is, I like
to hear it.

> So, I prefer the transparent methods. Just like THP vs. hugetlbfs.

Me too. I prefer transparent over reservation if it can achieve the
same goal. Do we have a fully transparent method spec out? How to
achieve fully transparent and also avoid fragmentation caused by mix
order allocation/free?

Keep in mind that we are still in the early stage of the mTHP swap
development, I can have the reservation patch relatively easily. If
you come up with a better transparent method patch which can achieve
the same goal later, we can use it instead.

>
> >> that's really important for you, I think that it's better to design
> >> something like hugetlbfs vs core mm, that is, be separated from the
> >> normal swap subsystem as much as possible.
> >
> > I am giving hugetlbfs just to make the point using reservation, or
> > isolation of the resource to prevent mixing fragmentation existing in
> > core mm.
> > I am not suggesting copying the hugetlbfs implementation to the swap
> > system. Unlike hugetlbfs, the swap allocation is typically done from
> > the kernel, it is transparent from the application. I don't think
> > separate from the swap subsystem is a good way to go.
> >
> > This comes down to why you don't like the reservation. e.g. if we use
> > two swapfile, one swapfile is purely allocate for high order, would
> > that be better?
>
> Sorry, my words weren't accurate. Personally, I just think that it's
> better to make reservation related code not too intrusive.

Yes. I will try to make it not too intrusive.

> And, before reservation, we need to consider something else firstly.
> Whether is it generally good to swap-in with swap-out order? Should we

When we have the reservation patch (or other means to sustain mix size
swap allocation/free), we can test it out to get more data to reason
about it.
I consider the swap in size policy an orthogonal issue.

> consider memory wastage too? One static policy doesn't fit all, we may
> need either a dynamic policy, or make policy configurable.
> In general, I think that we need to do this step by step.

The core swap layer needs to be able to sustain mix size swap
allocation free in the long run. Without that the swap in size policy
is meaningless.

Yes, that is the step by step approach. Allowing long run mix size
swap allocation as the first step.

> >> >> > Do you see another way to protect the high order cluster polluted by
> >> >> > lower order one?
> >> >>
> >> >> If we use high-order page allocation as reference, we need something
> >> >> like compaction to guarantee high-order allocation finally. But we are
> >> >> too far from that.
> >> >
> >> > We should consider reservation for high-order swap entry allocation
> >> > similar to hugetlbfs for memory.
> >> > Swap compaction will be very complicated because it needs to scan the
> >> > PTE to migrate the swap entry. It might be easier to support folio
> >> > write out compound discontiguous swap entries. That is another way to
> >> > address the fragmentation issue. We are also too far from that as
> >> > right now.
> >>
> >> That's not easy to write out compound discontiguous swap entries too.
> >> For example, how to put folios in swap cache?
> >
> > I propose the idea in the recent LSF/MM discussion, the last few
> > slides are for the discontiguous swap and it has the discontiguous
> > entries in swap cache.
> > https://drive.google.com/file/d/10wN4WgEekaiTDiAx2AND97CYLgfDJXAD/view
> >
> > Agree it is not an easy change. The cache cache would have to change
> > the assumption all offset are contiguous.
> > For swap, we kind of have some in memory data associated with per
> > offset already, so it might provide an opportunity to combine the
> > offset related data structure for swap together. Another alternative
> > might be using xarray without the multi entry property. , just treat
> > each offset like a single entry. I haven't dug deep into this
> > direction yet.
>
> Thanks! I will study your idea.
>

I am happy to discuss if you have any questions.

> > We can have more discussion, maybe arrange an upstream alignment
> > meeting if there is interest.
>
> Sure.

Ideally, if we can resolve our differences over the mail list then we
don't need to have a separate meeting :-)

Chris

2024-06-13 08:43:36

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order

Chris Li <[email protected]> writes:

> On Mon, Jun 10, 2024 at 7:38 PM Huang, Ying <[email protected]> wrote:
>>
>> Chris Li <[email protected]> writes:
>>
>> > On Wed, Jun 5, 2024 at 7:02 PM Huang, Ying <[email protected]> wrote:
>> >>
>> >> Chris Li <[email protected]> writes:
>> >>
>> >
>> >> > In the page allocation side, we have the hugetlbfs which reserve some
>> >> > memory for high order pages.
>> >> > We should have similar things to allow reserve some high order swap
>> >> > entries without getting polluted by low order one.
>> >>
>> >> TBH, I don't like the idea of high order swap entries reservation.
>> > May I know more if you don't like the idea? I understand this can be
>> > controversial, because previously we like to take the THP as the best
>> > effort approach. If there is some reason we can't make THP, we use the
>> > order 0 as fall back.
>> >
>> > For discussion purpose, I want break it down to smaller steps:
>> >
>> > First, can we agree that the following usage case is reasonable:
>> > The usage case is that, as Barry has shown, zsmalloc compresses bigger
>> > size than 4K and can have both better compress ratio and CPU
>> > performance gain.
>> > https://lore.kernel.org/linux-mm/[email protected]/
>> >
>> > So the goal is to make THP/mTHP have some reasonable success rate
>> > running in the mix size swap allocation, after either low order or
>> > high order swap requests can overflow the swap file size. The allocate
>> > can still recover from that, after some swap entries got free.
>> >
>> > Please let me know if you think the above usage case and goal are not
>> > reasonable for the kernel.
>>
>> I think that it's reasonable to improve the success rate of high-order
>
> Glad to hear that.
>
>> swap entries allocation. I just think that it's hard to use the
>> reservation based method. For example, how much should be reserved?
>
> Understand, it is harder to use than a fully transparent method, but
> still better than no solution at all. The alternative right now is we
> can't do it.
>
> Regarding how much we should reserve. Similarly, how much should you
> choose your swap file size? If you choose N, why not N*120% or N*80%?
> That did not stop us from having a swapfile, right?
>
>> Why system OOM when there's still swap space available? And so forth.
>
> Keep in mind that the reservation is an option. If you prefer the old
> behavior, you don't have to use the reservation. That shouldn't be a
> reason to stop others who want to use it. We don't have an alternative
> solution for the long run mix size allocation yet. If there is, I like
> to hear it.

It's not enough to make it optional. When you run into issue, you need
to debug it. And you may debug an issue on a system that is configured
by someone else.

>> So, I prefer the transparent methods. Just like THP vs. hugetlbfs.
>
> Me too. I prefer transparent over reservation if it can achieve the
> same goal. Do we have a fully transparent method spec out? How to
> achieve fully transparent and also avoid fragmentation caused by mix
> order allocation/free?
>
> Keep in mind that we are still in the early stage of the mTHP swap
> development, I can have the reservation patch relatively easily. If
> you come up with a better transparent method patch which can achieve
> the same goal later, we can use it instead.

Because we are still in the early stage, I think that we should try to
improve transparent solution firstly. Personally, what I don't like is
that we don't work on the transparent solution because we have the
reservation solution.

>>
>> >> that's really important for you, I think that it's better to design
>> >> something like hugetlbfs vs core mm, that is, be separated from the
>> >> normal swap subsystem as much as possible.
>> >
>> > I am giving hugetlbfs just to make the point using reservation, or
>> > isolation of the resource to prevent mixing fragmentation existing in
>> > core mm.
>> > I am not suggesting copying the hugetlbfs implementation to the swap
>> > system. Unlike hugetlbfs, the swap allocation is typically done from
>> > the kernel, it is transparent from the application. I don't think
>> > separate from the swap subsystem is a good way to go.
>> >
>> > This comes down to why you don't like the reservation. e.g. if we use
>> > two swapfile, one swapfile is purely allocate for high order, would
>> > that be better?
>>
>> Sorry, my words weren't accurate. Personally, I just think that it's
>> better to make reservation related code not too intrusive.
>
> Yes. I will try to make it not too intrusive.
>
>> And, before reservation, we need to consider something else firstly.
>> Whether is it generally good to swap-in with swap-out order? Should we
>
> When we have the reservation patch (or other means to sustain mix size
> swap allocation/free), we can test it out to get more data to reason
> about it.
> I consider the swap in size policy an orthogonal issue.

No. I don't think so. If you swap-out in higher order, but swap-in in
lower order, you make the swap clusters fragmented.

>> consider memory wastage too? One static policy doesn't fit all, we may
>> need either a dynamic policy, or make policy configurable.
>> In general, I think that we need to do this step by step.
>
> The core swap layer needs to be able to sustain mix size swap
> allocation free in the long run. Without that the swap in size policy
> is meaningless.
>
> Yes, that is the step by step approach. Allowing long run mix size
> swap allocation as the first step.
>
>> >> >> > Do you see another way to protect the high order cluster polluted by
>> >> >> > lower order one?
>> >> >>
>> >> >> If we use high-order page allocation as reference, we need something
>> >> >> like compaction to guarantee high-order allocation finally. But we are
>> >> >> too far from that.
>> >> >
>> >> > We should consider reservation for high-order swap entry allocation
>> >> > similar to hugetlbfs for memory.
>> >> > Swap compaction will be very complicated because it needs to scan the
>> >> > PTE to migrate the swap entry. It might be easier to support folio
>> >> > write out compound discontiguous swap entries. That is another way to
>> >> > address the fragmentation issue. We are also too far from that as
>> >> > right now.
>> >>
>> >> That's not easy to write out compound discontiguous swap entries too.
>> >> For example, how to put folios in swap cache?
>> >
>> > I propose the idea in the recent LSF/MM discussion, the last few
>> > slides are for the discontiguous swap and it has the discontiguous
>> > entries in swap cache.
>> > https://drive.google.com/file/d/10wN4WgEekaiTDiAx2AND97CYLgfDJXAD/view
>> >
>> > Agree it is not an easy change. The cache cache would have to change
>> > the assumption all offset are contiguous.
>> > For swap, we kind of have some in memory data associated with per
>> > offset already, so it might provide an opportunity to combine the
>> > offset related data structure for swap together. Another alternative
>> > might be using xarray without the multi entry property. , just treat
>> > each offset like a single entry. I haven't dug deep into this
>> > direction yet.
>>
>> Thanks! I will study your idea.
>>
>
> I am happy to discuss if you have any questions.
>
>> > We can have more discussion, maybe arrange an upstream alignment
>> > meeting if there is interest.
>>
>> Sure.
>
> Ideally, if we can resolve our differences over the mail list then we
> don't need to have a separate meeting :-)
>

--
Best Regards,
Huang, Ying