2023-10-25 14:46:57

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v3 0/4] Swap-out small-sized THP without splitting

Hi All,

This is v3 of a series to add support for swapping out small-sized THP without
needing to first split the large folio via __split_huge_page(). It closely
follows the approach already used by PMD-sized THP.

"Small-sized THP" is an upcoming feature that enables performance improvements
by allocating large folios for anonymous memory, where the large folio size is
smaller than the traditional PMD-size. See [3].

In some circumstances I've observed a performance regression (see patch 2 for
details), and this series is an attempt to fix the regression in advance of
merging small-sized THP support.

I've done what I thought was the smallest change possible, and as a result, this
approach is only employed when the swap is backed by a non-rotating block device
(just as PMD-sized THP is supported today). Discussion against the RFC concluded
that this is probably sufficient.

The series applies against mm-unstable (1a3c85fa684a)


Changes since v2 [2]
====================

- Reuse scan_swap_map_try_ssd_cluster() between order-0 and order > 0
allocation. This required some refactoring to make everything work nicely
(new patches 2 and 3).
- Fix bug where nr_swap_pages would say there are pages available but the
scanner would not be able to allocate them because they were reserved for the
per-cpu allocator. We now allow stealing of order-0 entries from the high
order per-cpu clusters (in addition to exisiting stealing from order-0
per-cpu clusters).

Thanks to Huang, Ying for the review feedback and suggestions!


Changes since v1 [1]
====================

- patch 1:
- Use cluster_set_count() instead of cluster_set_count_flag() in
swap_alloc_cluster() since we no longer have any flag to set. I was unable
to kill cluster_set_count_flag() as proposed against v1 as other call
sites depend explicitly setting flags to 0.
- patch 2:
- Moved large_next[] array into percpu_cluster to make it per-cpu
(recommended by Huang, Ying).
- large_next[] array is dynamically allocated because PMD_ORDER is not
compile-time constant for powerpc (fixes build error).


Thanks,
Ryan

P.S. I know we agreed this is not a prerequisite for merging small-sized THP,
but given Huang Ying had provided some review feedback, I wanted to progress it.
All the actual prerequisites are either complete or being worked on by others.


[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/linux-mm/[email protected]/
[3] https://lore.kernel.org/linux-mm/[email protected]/


Ryan Roberts (4):
mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
mm: swap: Remove struct percpu_cluster
mm: swap: Simplify ssd behavior when scanner steals entry
mm: swap: Swap-out small-sized THP without splitting

include/linux/swap.h | 31 +++---
mm/huge_memory.c | 3 -
mm/swapfile.c | 232 ++++++++++++++++++++++++-------------------
mm/vmscan.c | 10 +-
4 files changed, 149 insertions(+), 127 deletions(-)

--
2.25.1


2023-10-25 14:47:06

by Ryan Roberts

[permalink] [raw]
Subject: [PATCH v3 2/4] mm: swap: Remove struct percpu_cluster

struct percpu_cluster stores the index of cpu's current cluster and the
offset of the next entry that will be allocated for the cpu. These two
pieces of information are redundant because the cluster index is just
(offset / SWAPFILE_CLUSTER). The only reason for explicitly keeping the
cluster index is because the structure used for it also has a flag to
indicate "no cluster". However this data structure also contains a spin
lock, which is never used in this context, as a side effect the code
copies the spinlock_t structure, which is questionable coding practice
in my view.

So let's clean this up and store only the next offset, and use a
sentinal value (SWAP_NEXT_NULL) to indicate "no cluster". SWAP_NEXT_NULL
is chosen to be 0, because 0 will never be seen legitimately; The first
page in the swap file is the swap header, which is always marked bad to
prevent it from being allocated as an entry. This also prevents the
cluster to which it belongs being marked free, so it will never appear
on the free list.

This change saves 16 bytes per cpu. And given we are shortly going to
extend this mechanism to be per-cpu-AND-per-order, we will end up saving
16 * 9 = 144 bytes per cpu, which adds up if you have 256 cpus in the
system.

Signed-off-by: Ryan Roberts <[email protected]>
---
include/linux/swap.h | 21 +++++++++++++--------
mm/swapfile.c | 43 +++++++++++++++++++------------------------
2 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a073366a227c..0ca8aaa098ba 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -261,14 +261,12 @@ struct swap_cluster_info {
#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */

/*
- * We assign a cluster to each CPU, so each CPU can allocate swap entry from
- * its own cluster and swapout sequentially. The purpose is to optimize swapout
- * throughput.
+ * The first page in the swap file is the swap header, which is always marked
+ * bad to prevent it from being allocated as an entry. This also prevents the
+ * cluster to which it belongs being marked free. Therefore 0 is safe to use as
+ * a sentinel to indicate cpu_next is not valid in swap_info_struct.
*/
-struct percpu_cluster {
- struct swap_cluster_info index; /* Current cluster index */
- unsigned int next; /* Likely next allocation offset */
-};
+#define SWAP_NEXT_NULL 0

struct swap_cluster_list {
struct swap_cluster_info head;
@@ -295,7 +293,14 @@ struct swap_info_struct {
unsigned int cluster_next; /* likely index for next allocation */
unsigned int cluster_nr; /* countdown to next cluster search */
unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocation */
- struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
+ unsigned int __percpu *cpu_next;/*
+ * Likely next allocation offset. We
+ * assign a cluster to each CPU, so each
+ * CPU can allocate swap entry from its
+ * own cluster and swapout sequentially.
+ * The purpose is to optimize swapout
+ * throughput.
+ */
struct rb_root swap_extent_root;/* root of the swap extent rbtree */
struct block_device *bdev; /* swap device or bdev of swap file */
struct file *swap_file; /* seldom referenced */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index b83ad77e04c0..617e34b8cdbe 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -591,7 +591,6 @@ static bool
scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
unsigned long offset)
{
- struct percpu_cluster *percpu_cluster;
bool conflict;

offset /= SWAPFILE_CLUSTER;
@@ -602,8 +601,7 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
if (!conflict)
return false;

- percpu_cluster = this_cpu_ptr(si->percpu_cluster);
- cluster_set_null(&percpu_cluster->index);
+ *this_cpu_ptr(si->cpu_next) = SWAP_NEXT_NULL;
return true;
}

@@ -614,16 +612,16 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
unsigned long *offset, unsigned long *scan_base)
{
- struct percpu_cluster *cluster;
struct swap_cluster_info *ci;
- unsigned long tmp, max;
+ unsigned int tmp, max;
+ unsigned int *cpu_next;

new_cluster:
- cluster = this_cpu_ptr(si->percpu_cluster);
- if (cluster_is_null(&cluster->index)) {
+ cpu_next = this_cpu_ptr(si->cpu_next);
+ tmp = *cpu_next;
+ if (tmp == SWAP_NEXT_NULL) {
if (!cluster_list_empty(&si->free_clusters)) {
- cluster->index = si->free_clusters.head;
- cluster->next = cluster_next(&cluster->index) *
+ tmp = cluster_next(&si->free_clusters.head) *
SWAPFILE_CLUSTER;
} else if (!cluster_list_empty(&si->discard_clusters)) {
/*
@@ -643,9 +641,8 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
* Other CPUs can use our cluster if they can't find a free cluster,
* check if there is still free entry in the cluster
*/
- tmp = cluster->next;
max = min_t(unsigned long, si->max,
- (cluster_next(&cluster->index) + 1) * SWAPFILE_CLUSTER);
+ ALIGN_DOWN(tmp, SWAPFILE_CLUSTER) + SWAPFILE_CLUSTER);
if (tmp < max) {
ci = lock_cluster(si, tmp);
while (tmp < max) {
@@ -656,12 +653,13 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
unlock_cluster(ci);
}
if (tmp >= max) {
- cluster_set_null(&cluster->index);
+ *cpu_next = SWAP_NEXT_NULL;
goto new_cluster;
}
- cluster->next = tmp + 1;
*offset = tmp;
*scan_base = tmp;
+ tmp += 1;
+ *cpu_next = tmp < max ? tmp : SWAP_NEXT_NULL;
return true;
}

@@ -2488,8 +2486,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
arch_swap_invalidate_area(p->type);
zswap_swapoff(p->type);
mutex_unlock(&swapon_mutex);
- free_percpu(p->percpu_cluster);
- p->percpu_cluster = NULL;
+ free_percpu(p->cpu_next);
+ p->cpu_next = NULL;
free_percpu(p->cluster_next_cpu);
p->cluster_next_cpu = NULL;
vfree(swap_map);
@@ -3073,16 +3071,13 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
for (ci = 0; ci < nr_cluster; ci++)
spin_lock_init(&((cluster_info + ci)->lock));

- p->percpu_cluster = alloc_percpu(struct percpu_cluster);
- if (!p->percpu_cluster) {
+ p->cpu_next = alloc_percpu(unsigned int);
+ if (!p->cpu_next) {
error = -ENOMEM;
goto bad_swap_unlock_inode;
}
- for_each_possible_cpu(cpu) {
- struct percpu_cluster *cluster;
- cluster = per_cpu_ptr(p->percpu_cluster, cpu);
- cluster_set_null(&cluster->index);
- }
+ for_each_possible_cpu(cpu)
+ per_cpu(*p->cpu_next, cpu) = SWAP_NEXT_NULL;
} else {
atomic_inc(&nr_rotate_swap);
inced_nr_rotate_swap = true;
@@ -3171,8 +3166,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
bad_swap_unlock_inode:
inode_unlock(inode);
bad_swap:
- free_percpu(p->percpu_cluster);
- p->percpu_cluster = NULL;
+ free_percpu(p->cpu_next);
+ p->cpu_next = NULL;
free_percpu(p->cluster_next_cpu);
p->cluster_next_cpu = NULL;
if (inode && S_ISBLK(inode->i_mode) && p->bdev) {
--
2.25.1

2023-11-29 07:48:25

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v3 0/4] Swap-out small-sized THP without splitting

> Hi All,
>
> This is v3 of a series to add support for swapping out small-sized THP without
> needing to first split the large folio via __split_huge_page(). It closely
> follows the approach already used by PMD-sized THP.
>
> "Small-sized THP" is an upcoming feature that enables performance improvements
> by allocating large folios for anonymous memory, where the large folio size is
> smaller than the traditional PMD-size. See [3].
>
> In some circumstances I've observed a performance regression (see patch 2 for
> details), and this series is an attempt to fix the regression in advance of
> merging small-sized THP support.
>
> I've done what I thought was the smallest change possible, and as a result, this
> approach is only employed when the swap is backed by a non-rotating block device
> (just as PMD-sized THP is supported today). Discussion against the RFC concluded
> that this is probably sufficient.
>
> The series applies against mm-unstable (1a3c85fa684a)
>
>
> Changes since v2 [2]
> ====================
>
> - Reuse scan_swap_map_try_ssd_cluster() between order-0 and order > 0
> allocation. This required some refactoring to make everything work nicely
> (new patches 2 and 3).
> - Fix bug where nr_swap_pages would say there are pages available but the
> scanner would not be able to allocate them because they were reserved for the
> per-cpu allocator. We now allow stealing of order-0 entries from the high
> order per-cpu clusters (in addition to exisiting stealing from order-0
> per-cpu clusters).
>
> Thanks to Huang, Ying for the review feedback and suggestions!
>
>
> Changes since v1 [1]
> ====================
>
> - patch 1:
> - Use cluster_set_count() instead of cluster_set_count_flag() in
> swap_alloc_cluster() since we no longer have any flag to set. I was unable
> to kill cluster_set_count_flag() as proposed against v1 as other call
> sites depend explicitly setting flags to 0.
> - patch 2:
> - Moved large_next[] array into percpu_cluster to make it per-cpu
> (recommended by Huang, Ying).
> - large_next[] array is dynamically allocated because PMD_ORDER is not
> compile-time constant for powerpc (fixes build error).
>
>
> Thanks,
> Ryan

> P.S. I know we agreed this is not a prerequisite for merging small-sized THP,
> but given Huang Ying had provided some review feedback, I wanted to progress it.
> All the actual prerequisites are either complete or being worked on by others.
>

Hi Ryan,

this is quite important to a phone and a must-have component, so is large-folio
swapin, as i explained to you in another email.
Luckily, we are having Chuanhua Han(Cc-ed) to prepare a patchset of largefolio
swapin on top of your this patchset, probably a port and cleanup of our
do_swap_page[1] againest yours.

Another concern is that swapslots can be fragmented, if we place small/large folios
in a swap device, since large folios always require contiguous swapslot, we can
result in failure of getting slots even we still have many free slots which are not
contiguous. To avoid this, [2] dynamic hugepage solution have two swap devices,
one for basepage, the other one for CONTPTE. we have modified the priority-based
selection of swap devices to choose swap devices based on small/large folios.
i realize this approache is super ugly and might be very hard to find a way to
upstream though, it seems not universal especially if you are a linux server (-_-)

two devices are not a nice approach though it works well for a real product,
we might still need some decent way to address this problem while the problem
is for sure not a stopper of your patchset.

[1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L4648
[2] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/swapfile.c#L1129

>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
> [2] https://lore.kernel.org/linux-mm/[email protected]/
> [3] https://lore.kernel.org/linux-mm/[email protected]/
>
>
> Ryan Roberts (4):
> mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
> mm: swap: Remove struct percpu_cluster
> mm: swap: Simplify ssd behavior when scanner steals entry
> mm: swap: Swap-out small-sized THP without splitting
>
> include/linux/swap.h | 31 +++---
> mm/huge_memory.c | 3 -
> mm/swapfile.c | 232 ++++++++++++++++++++++++-------------------
> mm/vmscan.c | 10 +-
> 4 files changed, 149 insertions(+), 127 deletions(-)

Thanks
Barry

2023-11-29 12:07:08

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v3 0/4] Swap-out small-sized THP without splitting

On 29/11/2023 07:47, Barry Song wrote:
>> Hi All,
>>
>> This is v3 of a series to add support for swapping out small-sized THP without
>> needing to first split the large folio via __split_huge_page(). It closely
>> follows the approach already used by PMD-sized THP.
>>
>> "Small-sized THP" is an upcoming feature that enables performance improvements
>> by allocating large folios for anonymous memory, where the large folio size is
>> smaller than the traditional PMD-size. See [3].
>>
>> In some circumstances I've observed a performance regression (see patch 2 for
>> details), and this series is an attempt to fix the regression in advance of
>> merging small-sized THP support.
>>
>> I've done what I thought was the smallest change possible, and as a result, this
>> approach is only employed when the swap is backed by a non-rotating block device
>> (just as PMD-sized THP is supported today). Discussion against the RFC concluded
>> that this is probably sufficient.
>>
>> The series applies against mm-unstable (1a3c85fa684a)
>>
>>
>> Changes since v2 [2]
>> ====================
>>
>> - Reuse scan_swap_map_try_ssd_cluster() between order-0 and order > 0
>> allocation. This required some refactoring to make everything work nicely
>> (new patches 2 and 3).
>> - Fix bug where nr_swap_pages would say there are pages available but the
>> scanner would not be able to allocate them because they were reserved for the
>> per-cpu allocator. We now allow stealing of order-0 entries from the high
>> order per-cpu clusters (in addition to exisiting stealing from order-0
>> per-cpu clusters).
>>
>> Thanks to Huang, Ying for the review feedback and suggestions!
>>
>>
>> Changes since v1 [1]
>> ====================
>>
>> - patch 1:
>> - Use cluster_set_count() instead of cluster_set_count_flag() in
>> swap_alloc_cluster() since we no longer have any flag to set. I was unable
>> to kill cluster_set_count_flag() as proposed against v1 as other call
>> sites depend explicitly setting flags to 0.
>> - patch 2:
>> - Moved large_next[] array into percpu_cluster to make it per-cpu
>> (recommended by Huang, Ying).
>> - large_next[] array is dynamically allocated because PMD_ORDER is not
>> compile-time constant for powerpc (fixes build error).
>>
>>
>> Thanks,
>> Ryan
>
>> P.S. I know we agreed this is not a prerequisite for merging small-sized THP,
>> but given Huang Ying had provided some review feedback, I wanted to progress it.
>> All the actual prerequisites are either complete or being worked on by others.
>>
>
> Hi Ryan,
>
> this is quite important to a phone and a must-have component, so is large-folio
> swapin, as i explained to you in another email.

Yes understood; the "prerequisites" are just the things that must be merged
*before* small-sized THP to ensure we don't regress existing behaviour or to
ensure that small-size THP is correct/robust when enabled. Performance
improvements can be merged after the initial small-sized series.

> Luckily, we are having Chuanhua Han(Cc-ed) to prepare a patchset of largefolio
> swapin on top of your this patchset, probably a port and cleanup of our
> do_swap_page[1] againest yours.

That's great to hear - welcome aboard, Chuanhua Han! Feel free to reach out if
you have questions.

I would guess that any large swap-in changes would be independent of this
swap-out patch though? Wouldn't you just be looking for contiguous swap entries
in the page table to determine a suitable folio order, then swap-in each of
those entries into the folio? And if they happen to have contiguous swap offsets
(enabled by this swap-out series) then you potentially get a batched disk access
benefit.

That's just a guess though, perhaps you can describe your proposed approach?

>
> Another concern is that swapslots can be fragmented, if we place small/large folios
> in a swap device, since large folios always require contiguous swapslot, we can
> result in failure of getting slots even we still have many free slots which are not
> contiguous.

This series tries to mitigate that problem by reserving a swap cluster per
order. That works well until we run out of swap clusters; a cluster can't be
freed until all contained swap entries are swapped back in and deallocated.

But I think we should start with the simple approach first, and only solve the
problems as they arise through real testing.

To avoid this, [2] dynamic hugepage solution have two swap devices,
> one for basepage, the other one for CONTPTE. we have modified the priority-based
> selection of swap devices to choose swap devices based on small/large folios.
> i realize this approache is super ugly and might be very hard to find a way to
> upstream though, it seems not universal especially if you are a linux server (-_-)
>
> two devices are not a nice approach though it works well for a real product,
> we might still need some decent way to address this problem while the problem
> is for sure not a stopper of your patchset.

I guess that approach works for your case because A) you only have 2 sizes, and
B) your swap device is zRAM, which dynamically allocate RAM as it needs it.

The upstream small-sized THP solution can support multiple sizes, so you would
need a swap device per size (I think 13 is the limit at the moment - PMD size
for 64K base page). And if your swap device is a physical block device, you
can't dynamically parition it the way you can with zRAM. Nether of those things
scale particularly well IMHO.

>
> [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L4648
> [2] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/swapfile.c#L1129
>
>>
>> [1] https://lore.kernel.org/linux-mm/[email protected]/
>> [2] https://lore.kernel.org/linux-mm/[email protected]/
>> [3] https://lore.kernel.org/linux-mm/[email protected]/
>>
>>
>> Ryan Roberts (4):
>> mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
>> mm: swap: Remove struct percpu_cluster
>> mm: swap: Simplify ssd behavior when scanner steals entry
>> mm: swap: Swap-out small-sized THP without splitting
>>
>> include/linux/swap.h | 31 +++---
>> mm/huge_memory.c | 3 -
>> mm/swapfile.c | 232 ++++++++++++++++++++++++-------------------
>> mm/vmscan.c | 10 +-
>> 4 files changed, 149 insertions(+), 127 deletions(-)
>
> Thanks
> Barry

2023-11-29 20:39:15

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v3 0/4] Swap-out small-sized THP without splitting

On Thu, Nov 30, 2023 at 1:06 AM Ryan Roberts <[email protected]> wrote:
>
> On 29/11/2023 07:47, Barry Song wrote:
> >> Hi All,
> >>
> >> This is v3 of a series to add support for swapping out small-sized THP without
> >> needing to first split the large folio via __split_huge_page(). It closely
> >> follows the approach already used by PMD-sized THP.
> >>
> >> "Small-sized THP" is an upcoming feature that enables performance improvements
> >> by allocating large folios for anonymous memory, where the large folio size is
> >> smaller than the traditional PMD-size. See [3].
> >>
> >> In some circumstances I've observed a performance regression (see patch 2 for
> >> details), and this series is an attempt to fix the regression in advance of
> >> merging small-sized THP support.
> >>
> >> I've done what I thought was the smallest change possible, and as a result, this
> >> approach is only employed when the swap is backed by a non-rotating block device
> >> (just as PMD-sized THP is supported today). Discussion against the RFC concluded
> >> that this is probably sufficient.
> >>
> >> The series applies against mm-unstable (1a3c85fa684a)
> >>
> >>
> >> Changes since v2 [2]
> >> ====================
> >>
> >> - Reuse scan_swap_map_try_ssd_cluster() between order-0 and order > 0
> >> allocation. This required some refactoring to make everything work nicely
> >> (new patches 2 and 3).
> >> - Fix bug where nr_swap_pages would say there are pages available but the
> >> scanner would not be able to allocate them because they were reserved for the
> >> per-cpu allocator. We now allow stealing of order-0 entries from the high
> >> order per-cpu clusters (in addition to exisiting stealing from order-0
> >> per-cpu clusters).
> >>
> >> Thanks to Huang, Ying for the review feedback and suggestions!
> >>
> >>
> >> Changes since v1 [1]
> >> ====================
> >>
> >> - patch 1:
> >> - Use cluster_set_count() instead of cluster_set_count_flag() in
> >> swap_alloc_cluster() since we no longer have any flag to set. I was unable
> >> to kill cluster_set_count_flag() as proposed against v1 as other call
> >> sites depend explicitly setting flags to 0.
> >> - patch 2:
> >> - Moved large_next[] array into percpu_cluster to make it per-cpu
> >> (recommended by Huang, Ying).
> >> - large_next[] array is dynamically allocated because PMD_ORDER is not
> >> compile-time constant for powerpc (fixes build error).
> >>
> >>
> >> Thanks,
> >> Ryan
> >
> >> P.S. I know we agreed this is not a prerequisite for merging small-sized THP,
> >> but given Huang Ying had provided some review feedback, I wanted to progress it.
> >> All the actual prerequisites are either complete or being worked on by others.
> >>
> >
> > Hi Ryan,
> >
> > this is quite important to a phone and a must-have component, so is large-folio
> > swapin, as i explained to you in another email.
>
> Yes understood; the "prerequisites" are just the things that must be merged
> *before* small-sized THP to ensure we don't regress existing behaviour or to
> ensure that small-size THP is correct/robust when enabled. Performance
> improvements can be merged after the initial small-sized series.

I completely agree. I didn't mean small-THP swap out as a whole should be
a prerequisite for small-THP initial patchset, just describing how important
it is to a phone :-)

And actually we have done much further than this on phones by optimizing
zsmalloc/zram and allow a large folio compressed and decompressed
as a whole, we have seen compressing/decompressing a whole large folio
can significantly improve compression ratio and decrease CPU consumption.

so that means large folios can not only save memory but also decrease
CPU consumption.

>
> > Luckily, we are having Chuanhua Han(Cc-ed) to prepare a patchset of largefolio
> > swapin on top of your this patchset, probably a port and cleanup of our
> > do_swap_page[1] againest yours.
>
> That's great to hear - welcome aboard, Chuanhua Han! Feel free to reach out if
> you have questions.
>
> I would guess that any large swap-in changes would be independent of this
> swap-out patch though? Wouldn't you just be looking for contiguous swap entries
> in the page table to determine a suitable folio order, then swap-in each of
> those entries into the folio? And if they happen to have contiguous swap offsets
> (enabled by this swap-out series) then you potentially get a batched disk access
> benefit.

I agree. Maybe we still need to check if the number of contiguous swap entries
is one of those supported large folio sizes?

>
> That's just a guess though, perhaps you can describe your proposed approach?

we have an ugly hack if we are swapping in from the dedicated zRAM for
large folios,
we assume we have a chance to swapin as a whole, but we do also handle corner
cases in which some entries might have been zap_pte_range()-ed.

My current proposal is as below,
A1. we get the number of contiguous swap entries with PTL and find it
is a valid large folio size
A2. we allocate large folio without PTL
A3. after getting PTL again, we re-check PTEs if the situation in A1
have been changed,
if no other threads change those PTEs, we set_ptes and finish the swap-in

but we have a chance to fail in A2, so in this case we still need to
fall back to basepage.

considering the MTE thread[1] I am handling, and MTE tag life cycle is
the same with swap
entry life cycle. it seems we will still need a page-level
arch_swap_restore even after
we support large folio swap-in for the below two reasons

1. contiguous PTEs might be partially dropped by madvise(DONTNEED) etc
2. we can still fall back to basepage for swap-in if we fail to get
large folio even PTEs are all
contiguous swap entries

Of course, if we succeed in setting all PTEs for a large folio in A3,
we can have
a folio-level arch_swap_restore.

To me, an universal folio-level arch_swap_restore seems not sensible
to handle all kinds of
complex cases.

[1] [RFC V3 PATCH] arm64: mm: swap: save and restore mte tags for large folios
https://lore.kernel.org/linux-mm/[email protected]/

>
> >
> > Another concern is that swapslots can be fragmented, if we place small/large folios
> > in a swap device, since large folios always require contiguous swapslot, we can
> > result in failure of getting slots even we still have many free slots which are not
> > contiguous.
>
> This series tries to mitigate that problem by reserving a swap cluster per
> order. That works well until we run out of swap clusters; a cluster can't be
> freed until all contained swap entries are swapped back in and deallocated.
>
> But I think we should start with the simple approach first, and only solve the
> problems as they arise through real testing.

I agree.

>
> To avoid this, [2] dynamic hugepage solution have two swap devices,
> > one for basepage, the other one for CONTPTE. we have modified the priority-based
> > selection of swap devices to choose swap devices based on small/large folios.
> > i realize this approache is super ugly and might be very hard to find a way to
> > upstream though, it seems not universal especially if you are a linux server (-_-)
> >
> > two devices are not a nice approach though it works well for a real product,
> > we might still need some decent way to address this problem while the problem
> > is for sure not a stopper of your patchset.
>
> I guess that approach works for your case because A) you only have 2 sizes, and
> B) your swap device is zRAM, which dynamically allocate RAM as it needs it.
>
> The upstream small-sized THP solution can support multiple sizes, so you would
> need a swap device per size (I think 13 is the limit at the moment - PMD size
> for 64K base page). And if your swap device is a physical block device, you
> can't dynamically parition it the way you can with zRAM. Nether of those things
> scale particularly well IMHO.

right.

>
> >
> > [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L4648
> > [2] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/swapfile.c#L1129
> >
> >>
> >> [1] https://lore.kernel.org/linux-mm/[email protected]/
> >> [2] https://lore.kernel.org/linux-mm/[email protected]/
> >> [3] https://lore.kernel.org/linux-mm/[email protected]/
> >>
> >>
> >> Ryan Roberts (4):
> >> mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
> >> mm: swap: Remove struct percpu_cluster
> >> mm: swap: Simplify ssd behavior when scanner steals entry
> >> mm: swap: Swap-out small-sized THP without splitting
> >>
> >> include/linux/swap.h | 31 +++---
> >> mm/huge_memory.c | 3 -
> >> mm/swapfile.c | 232 ++++++++++++++++++++++++-------------------
> >> mm/vmscan.c | 10 +-
> >> 4 files changed, 149 insertions(+), 127 deletions(-)
> >

Thanks
Barry

2024-01-18 11:12:49

by Barry Song

[permalink] [raw]
Subject: [PATCH RFC 0/6] mm: support large folios swap-in

On an embedded system like Android, more than half of anon memory is actually
in swap devices such as zRAM. For example, while an app is switched to back-
ground, its most memory might be swapped-out.

Now we have mTHP features, unfortunately, if we don't support large folios
swap-in, once those large folios are swapped-out, we immediately lose the
performance gain we can get through large folios and hardware optimization
such as CONT-PTE.

In theory, we don't need to rely on Ryan's swap out patchset[1]. That is to say,
before swap-out, if some memory were normal pages, but when swapping in, we
can also swap-in them as large folios. But this might require I/O happen at
some random places in swap devices. So we limit the large folios swap-in to
those areas which were large folios before swapping-out, aka, swaps are also
contiguous in hardware. On the other hand, in OPPO's product, we've deployed
anon large folios on millions of phones[2]. we enhanced zsmalloc and zRAM to
compress and decompress large folios as a whole, which help improve compression
ratio and decrease CPU consumption significantly. In zsmalloc and zRAM we can
save large objects whose original size are 64KiB for example. So it is also a
better choice for us to only swap-in large folios for those compressed large
objects as a large folio can be decompressed all together.

Note I am moving my previous "arm64: mm: swap: support THP_SWAP on hardware
with MTE" to this series as it might help review.

[1] [PATCH v3 0/4] Swap-out small-sized THP without splitting
https://lore.kernel.org/linux-mm/[email protected]/
[2] OnePlusOSS / android_kernel_oneplus_sm8550
https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11

Barry Song (2):
arm64: mm: swap: support THP_SWAP on hardware with MTE
mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()

Chuanhua Han (4):
mm: swap: introduce swap_nr_free() for batched swap_free()
mm: swap: make should_try_to_free_swap() support large-folio
mm: support large folios swapin as a whole
mm: madvise: don't split mTHP for MADV_PAGEOUT

arch/arm64/include/asm/pgtable.h | 21 ++----
arch/arm64/mm/mteswap.c | 42 ++++++++++++
include/asm-generic/tlb.h | 10 +++
include/linux/huge_mm.h | 12 ----
include/linux/pgtable.h | 62 ++++++++++++++++-
include/linux/swap.h | 6 ++
mm/madvise.c | 48 ++++++++++++++
mm/memory.c | 110 ++++++++++++++++++++++++++-----
mm/page_io.c | 2 +-
mm/rmap.c | 5 +-
mm/swap_slots.c | 2 +-
mm/swapfile.c | 29 ++++++++
12 files changed, 301 insertions(+), 48 deletions(-)

--
2.34.1


2024-01-18 11:13:16

by Barry Song

[permalink] [raw]
Subject: [PATCH RFC 2/6] mm: swap: introduce swap_nr_free() for batched swap_free()

From: Chuanhua Han <[email protected]>

While swapping in a large folio, we need to free swaps related to the whole
folio. To avoid frequently acquiring and releasing swap locks, it is better
to introduce an API for batched free.

Signed-off-by: Chuanhua Han <[email protected]>
Co-developed-by: Barry Song <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
include/linux/swap.h | 6 ++++++
mm/swapfile.c | 29 +++++++++++++++++++++++++++++
2 files changed, 35 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4db00ddad261..31a4ee2dcd1c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -478,6 +478,7 @@ extern void swap_shmem_alloc(swp_entry_t);
extern int swap_duplicate(swp_entry_t);
extern int swapcache_prepare(swp_entry_t);
extern void swap_free(swp_entry_t);
+extern void swap_nr_free(swp_entry_t entry, int nr_pages);
extern void swapcache_free_entries(swp_entry_t *entries, int n);
extern int free_swap_and_cache(swp_entry_t);
int swap_type_of(dev_t device, sector_t offset);
@@ -553,6 +554,11 @@ static inline void swap_free(swp_entry_t swp)
{
}

+void swap_nr_free(swp_entry_t entry, int nr_pages)
+{
+
+}
+
static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
{
}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 556ff7347d5f..6321bda96b77 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1335,6 +1335,35 @@ void swap_free(swp_entry_t entry)
__swap_entry_free(p, entry);
}

+void swap_nr_free(swp_entry_t entry, int nr_pages)
+{
+ int i;
+ struct swap_cluster_info *ci;
+ struct swap_info_struct *p;
+ unsigned type = swp_type(entry);
+ unsigned long offset = swp_offset(entry);
+ DECLARE_BITMAP(usage, SWAPFILE_CLUSTER) = { 0 };
+
+ VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
+
+ if (nr_pages == 1) {
+ swap_free(entry);
+ return;
+ }
+
+ p = _swap_info_get(entry);
+
+ ci = lock_cluster(p, offset);
+ for (i = 0; i < nr_pages; i++) {
+ if (__swap_entry_free_locked(p, offset + i, 1))
+ __bitmap_set(usage, i, 1);
+ }
+ unlock_cluster(ci);
+
+ for_each_clear_bit(i, usage, nr_pages)
+ free_swap_slot(swp_entry(type, offset + i));
+}
+
/*
* Called after dropping swapcache to decrease refcnt to swap entries.
*/
--
2.34.1


2024-01-18 11:13:40

by Barry Song

[permalink] [raw]
Subject: [PATCH RFC 4/6] mm: support large folios swapin as a whole

From: Chuanhua Han <[email protected]>

On an embedded system like Android, more than half of anon memory is actually
in swap devices such as zRAM. For example, while an app is switched to back-
ground, its most memory might be swapped-out.

Now we have mTHP features, unfortunately, if we don't support large folios
swap-in, once those large folios are swapped-out, we immediately lose the
performance gain we can get through large folios and hardware optimization
such as CONT-PTE.

This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
to those contiguous swaps which were likely swapped out from mTHP as a whole.

On the other hand, the current implementation only covers the SWAP_SYCHRONOUS
case. It doesn't support swapin_readahead as large folios yet.

Right now, we are re-faulting large folios which are still in swapcache as a
whole, this can effectively decrease extra loops and early-exitings which we
have increased in arch_swap_restore() while supporting MTE restore for folios
rather than page.

Signed-off-by: Chuanhua Han <[email protected]>
Co-developed-by: Barry Song <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
mm/memory.c | 108 +++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 94 insertions(+), 14 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index f61a48929ba7..928b3f542932 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -107,6 +107,8 @@ EXPORT_SYMBOL(mem_map);
static vm_fault_t do_fault(struct vm_fault *vmf);
static vm_fault_t do_anonymous_page(struct vm_fault *vmf);
static bool vmf_pte_changed(struct vm_fault *vmf);
+static struct folio *alloc_anon_folio(struct vm_fault *vmf,
+ bool (*pte_range_check)(pte_t *, int));

/*
* Return true if the original pte was a uffd-wp pte marker (so the pte was
@@ -3784,6 +3786,34 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
return VM_FAULT_SIGBUS;
}

+static bool pte_range_swap(pte_t *pte, int nr_pages)
+{
+ int i;
+ swp_entry_t entry;
+ unsigned type;
+ pgoff_t start_offset;
+
+ entry = pte_to_swp_entry(ptep_get_lockless(pte));
+ if (non_swap_entry(entry))
+ return false;
+ start_offset = swp_offset(entry);
+ if (start_offset % nr_pages)
+ return false;
+
+ type = swp_type(entry);
+ for (i = 1; i < nr_pages; i++) {
+ entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
+ if (non_swap_entry(entry))
+ return false;
+ if (swp_offset(entry) != start_offset + i)
+ return false;
+ if (swp_type(entry) != type)
+ return false;
+ }
+
+ return true;
+}
+
/*
* We enter with non-exclusive mmap_lock (to exclude vma changes,
* but allow concurrent faults), and pte mapped but not yet locked.
@@ -3804,6 +3834,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
pte_t pte;
vm_fault_t ret = 0;
void *shadow = NULL;
+ int nr_pages = 1;
+ unsigned long start_address;
+ pte_t *start_pte;

if (!pte_unmap_same(vmf))
goto out;
@@ -3868,13 +3901,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
__swap_count(entry) == 1) {
/* skip swapcache */
- folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
- vma, vmf->address, false);
+ folio = alloc_anon_folio(vmf, pte_range_swap);
page = &folio->page;
if (folio) {
__folio_set_locked(folio);
__folio_set_swapbacked(folio);

+ if (folio_test_large(folio)) {
+ unsigned long start_offset;
+
+ nr_pages = folio_nr_pages(folio);
+ start_offset = swp_offset(entry) & ~(nr_pages - 1);
+ entry = swp_entry(swp_type(entry), start_offset);
+ }
+
if (mem_cgroup_swapin_charge_folio(folio,
vma->vm_mm, GFP_KERNEL,
entry)) {
@@ -3980,6 +4020,39 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
*/
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
&vmf->ptl);
+
+ start_address = vmf->address;
+ start_pte = vmf->pte;
+ if (folio_test_large(folio)) {
+ unsigned long nr = folio_nr_pages(folio);
+ unsigned long addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
+ pte_t *pte_t = vmf->pte - (vmf->address - addr) / PAGE_SIZE;
+
+ /*
+ * case 1: we are allocating large_folio, try to map it as a whole
+ * iff the swap entries are still entirely mapped;
+ * case 2: we hit a large folio in swapcache, and all swap entries
+ * are still entirely mapped, try to map a large folio as a whole.
+ * otherwise, map only the faulting page within the large folio
+ * which is swapcache
+ */
+ if (pte_range_swap(pte_t, nr)) {
+ start_address = addr;
+ start_pte = pte_t;
+ if (unlikely(folio == swapcache)) {
+ /*
+ * the below has been done before swap_read_folio()
+ * for case 1
+ */
+ nr_pages = nr;
+ entry = pte_to_swp_entry(ptep_get(start_pte));
+ page = &folio->page;
+ }
+ } else if (nr_pages > 1) { /* ptes have changed for case 1 */
+ goto out_nomap;
+ }
+ }
+
if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
goto out_nomap;

@@ -4047,12 +4120,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
* We're already holding a reference on the page but haven't mapped it
* yet.
*/
- swap_free(entry);
+ swap_nr_free(entry, nr_pages);
if (should_try_to_free_swap(folio, vma, vmf->flags))
folio_free_swap(folio);

- inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
- dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
+ folio_ref_add(folio, nr_pages - 1);
+ add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
+ add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
+
pte = mk_pte(page, vma->vm_page_prot);

/*
@@ -4062,14 +4137,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
* exclusivity.
*/
if (!folio_test_ksm(folio) &&
- (exclusive || folio_ref_count(folio) == 1)) {
+ (exclusive || folio_ref_count(folio) == nr_pages)) {
if (vmf->flags & FAULT_FLAG_WRITE) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
vmf->flags &= ~FAULT_FLAG_WRITE;
}
rmap_flags |= RMAP_EXCLUSIVE;
}
- flush_icache_page(vma, page);
+ flush_icache_pages(vma, page, nr_pages);
if (pte_swp_soft_dirty(vmf->orig_pte))
pte = pte_mksoft_dirty(pte);
if (pte_swp_uffd_wp(vmf->orig_pte))
@@ -4081,14 +4156,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
folio_add_new_anon_rmap(folio, vma, vmf->address);
folio_add_lru_vma(folio, vma);
} else {
- folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
+ folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
rmap_flags);
}

VM_BUG_ON(!folio_test_anon(folio) ||
(pte_write(pte) && !PageAnonExclusive(page)));
- set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
- arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
+ set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
+
+ arch_do_swap_page(vma->vm_mm, vma, start_address, pte, vmf->orig_pte);

folio_unlock(folio);
if (folio != swapcache && swapcache) {
@@ -4105,6 +4181,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
}

if (vmf->flags & FAULT_FLAG_WRITE) {
+ if (folio_test_large(folio) && nr_pages > 1)
+ vmf->orig_pte = ptep_get(vmf->pte);
+
ret |= do_wp_page(vmf);
if (ret & VM_FAULT_ERROR)
ret &= VM_FAULT_ERROR;
@@ -4112,7 +4191,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
}

/* No need to invalidate - it was non-present before */
- update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
+ update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
unlock:
if (vmf->pte)
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -4148,7 +4227,8 @@ static bool pte_range_none(pte_t *pte, int nr_pages)
return true;
}

-static struct folio *alloc_anon_folio(struct vm_fault *vmf)
+static struct folio *alloc_anon_folio(struct vm_fault *vmf,
+ bool (*pte_range_check)(pte_t *, int))
{
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
struct vm_area_struct *vma = vmf->vma;
@@ -4190,7 +4270,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
order = highest_order(orders);
while (orders) {
addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
- if (pte_range_none(pte + pte_index(addr), 1 << order))
+ if (pte_range_check(pte + pte_index(addr), 1 << order))
break;
order = next_order(&orders, order);
}
@@ -4269,7 +4349,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
if (unlikely(anon_vma_prepare(vma)))
goto oom;
/* Returns NULL on OOM or ERR_PTR(-EAGAIN) if we must retry the fault */
- folio = alloc_anon_folio(vmf);
+ folio = alloc_anon_folio(vmf, pte_range_none);
if (IS_ERR(folio))
return 0;
if (!folio)
--
2.34.1


2024-01-18 11:15:02

by Barry Song

[permalink] [raw]
Subject: [PATCH RFC 1/6] arm64: mm: swap: support THP_SWAP on hardware with MTE

From: Barry Song <[email protected]>

Commit d0637c505f8a1 ("arm64: enable THP_SWAP for arm64") brings up
THP_SWAP on ARM64, but it doesn't enable THP_SWP on hardware with
MTE as the MTE code works with the assumption tags save/restore is
always handling a folio with only one page.

The limitation should be removed as more and more ARM64 SoCs have
this feature. Co-existence of MTE and THP_SWAP becomes more and
more important.

This patch makes MTE tags saving support large folios, then we don't
need to split large folios into base pages for swapping out on ARM64
SoCs with MTE any more.

arch_prepare_to_swap() should take folio rather than page as parameter
because we support THP swap-out as a whole. It saves tags for all
pages in a large folio.

As now we are restoring tags based-on folio, in arch_swap_restore(),
we may increase some extra loops and early-exitings while refaulting
a large folio which is still in swapcache in do_swap_page(). In case
a large folio has nr pages, do_swap_page() will only set the PTE of
the particular page which is causing the page fault.
Thus do_swap_page() runs nr times, and each time, arch_swap_restore()
will loop nr times for those subpages in the folio. So right now the
algorithmic complexity becomes O(nr^2).

Once we support mapping large folios in do_swap_page(), extra loops
and early-exitings will decrease while not being completely removed
as a large folio might get partially tagged in corner cases such as,
1. a large folio in swapcache can be partially unmapped, thus, MTE
tags for the unmapped pages will be invalidated;
2. users might use mprotect() to set MTEs on a part of a large folio.

arch_thp_swp_supported() is dropped since ARM64 MTE was the only one
who needed it.

Reviewed-by: Steven Price <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 21 +++-------------
arch/arm64/mm/mteswap.c | 42 ++++++++++++++++++++++++++++++++
include/linux/huge_mm.h | 12 ---------
include/linux/pgtable.h | 2 +-
mm/page_io.c | 2 +-
mm/swap_slots.c | 2 +-
6 files changed, 49 insertions(+), 32 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 79ce70fbb751..9902395ca426 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -45,12 +45,6 @@
__flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

-static inline bool arch_thp_swp_supported(void)
-{
- return !system_supports_mte();
-}
-#define arch_thp_swp_supported arch_thp_swp_supported
-
/*
* Outside of a few very special situations (e.g. hibernation), we always
* use broadcast TLB invalidation instructions, therefore a spurious page
@@ -1042,12 +1036,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
#ifdef CONFIG_ARM64_MTE

#define __HAVE_ARCH_PREPARE_TO_SWAP
-static inline int arch_prepare_to_swap(struct page *page)
-{
- if (system_supports_mte())
- return mte_save_tags(page);
- return 0;
-}
+#define arch_prepare_to_swap arch_prepare_to_swap
+extern int arch_prepare_to_swap(struct folio *folio);

#define __HAVE_ARCH_SWAP_INVALIDATE
static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
@@ -1063,11 +1053,8 @@ static inline void arch_swap_invalidate_area(int type)
}

#define __HAVE_ARCH_SWAP_RESTORE
-static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
-{
- if (system_supports_mte())
- mte_restore_tags(entry, &folio->page);
-}
+#define arch_swap_restore arch_swap_restore
+extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);

#endif /* CONFIG_ARM64_MTE */

diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
index a31833e3ddc5..b9ca1b35902f 100644
--- a/arch/arm64/mm/mteswap.c
+++ b/arch/arm64/mm/mteswap.c
@@ -68,6 +68,13 @@ void mte_invalidate_tags(int type, pgoff_t offset)
mte_free_tag_storage(tags);
}

+static inline void __mte_invalidate_tags(struct page *page)
+{
+ swp_entry_t entry = page_swap_entry(page);
+
+ mte_invalidate_tags(swp_type(entry), swp_offset(entry));
+}
+
void mte_invalidate_tags_area(int type)
{
swp_entry_t entry = swp_entry(type, 0);
@@ -83,3 +90,38 @@ void mte_invalidate_tags_area(int type)
}
xa_unlock(&mte_pages);
}
+
+int arch_prepare_to_swap(struct folio *folio)
+{
+ int err;
+ long i;
+
+ if (system_supports_mte()) {
+ long nr = folio_nr_pages(folio);
+
+ for (i = 0; i < nr; i++) {
+ err = mte_save_tags(folio_page(folio, i));
+ if (err)
+ goto out;
+ }
+ }
+ return 0;
+
+out:
+ while (i--)
+ __mte_invalidate_tags(folio_page(folio, i));
+ return err;
+}
+
+void arch_swap_restore(swp_entry_t entry, struct folio *folio)
+{
+ if (system_supports_mte()) {
+ long i, nr = folio_nr_pages(folio);
+
+ entry.val -= swp_offset(entry) & (nr - 1);
+ for (i = 0; i < nr; i++) {
+ mte_restore_tags(entry, folio_page(folio, i));
+ entry.val++;
+ }
+ }
+}
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 5adb86af35fc..67219d2309dd 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -530,16 +530,4 @@ static inline int split_folio(struct folio *folio)
return split_folio_to_list(folio, NULL);
}

-/*
- * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
- * limitations in the implementation like arm64 MTE can override this to
- * false
- */
-#ifndef arch_thp_swp_supported
-static inline bool arch_thp_swp_supported(void)
-{
- return true;
-}
-#endif
-
#endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index f6d0e3513948..37fe83b0c358 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -925,7 +925,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
* prototypes must be defined in the arch-specific asm/pgtable.h file.
*/
#ifndef __HAVE_ARCH_PREPARE_TO_SWAP
-static inline int arch_prepare_to_swap(struct page *page)
+static inline int arch_prepare_to_swap(struct folio *folio)
{
return 0;
}
diff --git a/mm/page_io.c b/mm/page_io.c
index ae2b49055e43..a9a7c236aecc 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -189,7 +189,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
* Arch code may have to preserve more data than just the page
* contents, e.g. memory tags.
*/
- ret = arch_prepare_to_swap(&folio->page);
+ ret = arch_prepare_to_swap(folio);
if (ret) {
folio_mark_dirty(folio);
folio_unlock(folio);
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 0bec1f705f8e..2325adbb1f19 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -307,7 +307,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
entry.val = 0;

if (folio_test_large(folio)) {
- if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
+ if (IS_ENABLED(CONFIG_THP_SWAP))
get_swap_pages(1, &entry, folio_nr_pages(folio));
goto out;
}
--
2.34.1


2024-01-18 11:15:25

by Barry Song

[permalink] [raw]
Subject: [PATCH RFC 3/6] mm: swap: make should_try_to_free_swap() support large-folio

From: Chuanhua Han <[email protected]>

should_try_to_free_swap() works with an assumption that swap-in is always done
at normal page granularity, aka, folio_nr_pages = 1. To support large folio
swap-in, this patch removes the assumption.

Signed-off-by: Chuanhua Han <[email protected]>
Co-developed-by: Barry Song <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
mm/memory.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7e1f4849463a..f61a48929ba7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3714,7 +3714,7 @@ static inline bool should_try_to_free_swap(struct folio *folio,
* reference only in case it's likely that we'll be the exlusive user.
*/
return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
- folio_ref_count(folio) == 2;
+ folio_ref_count(folio) == (1 + folio_nr_pages(folio));
}

static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
--
2.34.1


2024-01-18 11:15:47

by Barry Song

[permalink] [raw]
Subject: [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()

From: Barry Song <[email protected]>

In do_swap_page(), while supporting large folio swap-in, we are using the helper
folio_add_anon_rmap_ptes. This is triggerring a WARN_ON in __folio_add_anon_rmap.
We can make the warning quiet by two ways
1. in do_swap_page, we call folio_add_new_anon_rmap() if we are sure the large
folio is new allocated one; we call folio_add_anon_rmap_ptes() if we find the
large folio in swapcache.
2. we always call folio_add_anon_rmap_ptes() in do_swap_page but weaken the
WARN_ON in __folio_add_anon_rmap() by letting the WARN_ON less sensitive.

Option 2 seems to be better for do_swap_page() as it can use unified code for
all cases.

Signed-off-by: Barry Song <[email protected]>
Tested-by: Chuanhua Han <[email protected]>
---
mm/rmap.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index f5d43edad529..469fcfd32317 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1304,7 +1304,10 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
* page.
*/
VM_WARN_ON_FOLIO(folio_test_large(folio) &&
- level != RMAP_LEVEL_PMD, folio);
+ level != RMAP_LEVEL_PMD &&
+ (!IS_ALIGNED(address, nr_pages * PAGE_SIZE) ||
+ (folio_test_swapcache(folio) && !IS_ALIGNED(folio->index, nr_pages)) ||
+ page != &folio->page), folio);
__folio_set_anon(folio, vma, address,
!!(flags & RMAP_EXCLUSIVE));
} else if (likely(!folio_test_ksm(folio))) {
--
2.34.1


2024-01-18 11:15:59

by Barry Song

[permalink] [raw]
Subject: [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT

From: Chuanhua Han <[email protected]>

MADV_PAGEOUT and MADV_FREE are common cases in Android. Ryan's patchset has
supported swapping large folios out as a whole for vmscan case. This patch
extends the feature to madvise.

If madvised range covers the whole large folio, we don't split it. Otherwise,
we still need to split it.

This patch doesn't depend on ARM64's CONT-PTE, alternatively, it defines one
helper named pte_range_cont_mapped() to check if all PTEs are contiguously
mapped to a large folio.

Signed-off-by: Chuanhua Han <[email protected]>
Co-developed-by: Barry Song <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
include/asm-generic/tlb.h | 10 +++++++
include/linux/pgtable.h | 60 +++++++++++++++++++++++++++++++++++++++
mm/madvise.c | 48 +++++++++++++++++++++++++++++++
3 files changed, 118 insertions(+)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 129a3a759976..f894e22da5d6 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -608,6 +608,16 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
__tlb_remove_tlb_entry(tlb, ptep, address); \
} while (0)

+#define tlb_remove_nr_tlb_entry(tlb, ptep, address, nr) \
+ do { \
+ int i; \
+ tlb_flush_pte_range(tlb, address, \
+ PAGE_SIZE * nr); \
+ for (i = 0; i < nr; i++) \
+ __tlb_remove_tlb_entry(tlb, ptep + i, \
+ address + i * PAGE_SIZE); \
+ } while (0)
+
#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address) \
do { \
unsigned long _sz = huge_page_size(h); \
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 37fe83b0c358..da0c1cf447e3 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -320,6 +320,42 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
}
#endif

+#ifndef pte_range_cont_mapped
+static inline bool pte_range_cont_mapped(unsigned long start_pfn,
+ pte_t *start_pte,
+ unsigned long start_addr,
+ int nr)
+{
+ int i;
+ pte_t pte_val;
+
+ for (i = 0; i < nr; i++) {
+ pte_val = ptep_get(start_pte + i);
+
+ if (pte_none(pte_val))
+ return false;
+
+ if (pte_pfn(pte_val) != (start_pfn + i))
+ return false;
+ }
+
+ return true;
+}
+#endif
+
+#ifndef pte_range_young
+static inline bool pte_range_young(pte_t *start_pte, int nr)
+{
+ int i;
+
+ for (i = 0; i < nr; i++)
+ if (pte_young(ptep_get(start_pte + i)))
+ return true;
+
+ return false;
+}
+#endif
+
#ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
unsigned long address,
@@ -580,6 +616,23 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
}
#endif

+#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_RANGE_FULL
+static inline pte_t ptep_get_and_clear_range_full(struct mm_struct *mm,
+ unsigned long start_addr,
+ pte_t *start_pte,
+ int nr, int full)
+{
+ int i;
+ pte_t pte;
+
+ pte = ptep_get_and_clear_full(mm, start_addr, start_pte, full);
+
+ for (i = 1; i < nr; i++)
+ ptep_get_and_clear_full(mm, start_addr + i * PAGE_SIZE,
+ start_pte + i, full);
+
+ return pte;
+}

/*
* If two threads concurrently fault at the same page, the thread that
@@ -995,6 +1048,13 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
})
#endif

+#ifndef pte_nr_addr_end
+#define pte_nr_addr_end(addr, size, end) \
+({ unsigned long __boundary = ((addr) + size) & (~(size - 1)); \
+ (__boundary - 1 < (end) - 1)? __boundary: (end); \
+})
+#endif
+
/*
* When walking page tables, we usually want to skip any p?d_none entries;
* and any p?d_bad entries - reporting the error before resetting to none.
diff --git a/mm/madvise.c b/mm/madvise.c
index 912155a94ed5..262460ac4b2e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -452,6 +452,54 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
if (folio_test_large(folio)) {
int err;

+ if (!folio_test_pmd_mappable(folio)) {
+ int nr_pages = folio_nr_pages(folio);
+ unsigned long folio_size = PAGE_SIZE * nr_pages;
+ unsigned long start_addr = ALIGN_DOWN(addr, nr_pages * PAGE_SIZE);;
+ unsigned long start_pfn = page_to_pfn(folio_page(folio, 0));
+ pte_t *start_pte = pte - (addr - start_addr) / PAGE_SIZE;
+ unsigned long next = pte_nr_addr_end(addr, folio_size, end);
+
+ if (!pte_range_cont_mapped(start_pfn, start_pte, start_addr, nr_pages))
+ goto split;
+
+ if (next - addr != folio_size) {
+ goto split;
+ } else {
+ /* Do not interfere with other mappings of this page */
+ if (folio_estimated_sharers(folio) != 1)
+ goto skip;
+
+ VM_BUG_ON(addr != start_addr || pte != start_pte);
+
+ if (pte_range_young(start_pte, nr_pages)) {
+ ptent = ptep_get_and_clear_range_full(mm, start_addr, start_pte,
+ nr_pages, tlb->fullmm);
+ ptent = pte_mkold(ptent);
+
+ set_ptes(mm, start_addr, start_pte, ptent, nr_pages);
+ tlb_remove_nr_tlb_entry(tlb, start_pte, start_addr, nr_pages);
+ }
+
+ folio_clear_referenced(folio);
+ folio_test_clear_young(folio);
+ if (pageout) {
+ if (folio_isolate_lru(folio)) {
+ if (folio_test_unevictable(folio))
+ folio_putback_lru(folio);
+ else
+ list_add(&folio->lru, &folio_list);
+ }
+ } else
+ folio_deactivate(folio);
+ }
+skip:
+ pte += (next - PAGE_SIZE - (addr & PAGE_MASK))/PAGE_SIZE;
+ addr = next - PAGE_SIZE;
+ continue;
+
+ }
+split:
if (folio_estimated_sharers(folio) != 1)
break;
if (pageout_anon_only_filter && !folio_test_anon(folio))
--
2.34.1


2024-01-18 11:54:32

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()

On 18.01.24 12:10, Barry Song wrote:
> From: Barry Song <[email protected]>
>
> In do_swap_page(), while supporting large folio swap-in, we are using the helper
> folio_add_anon_rmap_ptes. This is triggerring a WARN_ON in __folio_add_anon_rmap.
> We can make the warning quiet by two ways
> 1. in do_swap_page, we call folio_add_new_anon_rmap() if we are sure the large
> folio is new allocated one; we call folio_add_anon_rmap_ptes() if we find the
> large folio in swapcache.
> 2. we always call folio_add_anon_rmap_ptes() in do_swap_page but weaken the
> WARN_ON in __folio_add_anon_rmap() by letting the WARN_ON less sensitive.
>
> Option 2 seems to be better for do_swap_page() as it can use unified code for
> all cases.
>
> Signed-off-by: Barry Song <[email protected]>
> Tested-by: Chuanhua Han <[email protected]>
> ---
> mm/rmap.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index f5d43edad529..469fcfd32317 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1304,7 +1304,10 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
> * page.
> */
> VM_WARN_ON_FOLIO(folio_test_large(folio) &&
> - level != RMAP_LEVEL_PMD, folio);
> + level != RMAP_LEVEL_PMD &&
> + (!IS_ALIGNED(address, nr_pages * PAGE_SIZE) ||
> + (folio_test_swapcache(folio) && !IS_ALIGNED(folio->index, nr_pages)) ||
> + page != &folio->page), folio);
> __folio_set_anon(folio, vma, address,
> !!(flags & RMAP_EXCLUSIVE));
> } else if (likely(!folio_test_ksm(folio))) {


I have on my todo list to move all that !anon handling out of
folio_add_anon_rmap_ptes(), and instead make swapin code call add
folio_add_new_anon_rmap(), where we'll have to pass an exclusive flag
then (-> whole new folio exclusive).

That's the cleaner approach.

--
Cheers,

David / dhildenb


2024-01-18 15:29:22

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH RFC 0/6] mm: support large folios swap-in

On 18/01/2024 11:10, Barry Song wrote:
> On an embedded system like Android, more than half of anon memory is actually
> in swap devices such as zRAM. For example, while an app is switched to back-
> ground, its most memory might be swapped-out.
>
> Now we have mTHP features, unfortunately, if we don't support large folios
> swap-in, once those large folios are swapped-out, we immediately lose the
> performance gain we can get through large folios and hardware optimization
> such as CONT-PTE.
>
> In theory, we don't need to rely on Ryan's swap out patchset[1]. That is to say,
> before swap-out, if some memory were normal pages, but when swapping in, we
> can also swap-in them as large folios.

I think this could also violate MADV_NOHUGEPAGE; if the application has
requested that we do not create a THP, then we had better not; it could cause a
correctness issue in some circumstances. You would need to pay attention to this
vma flag if taking this approach.

> But this might require I/O happen at
> some random places in swap devices. So we limit the large folios swap-in to
> those areas which were large folios before swapping-out, aka, swaps are also
> contiguous in hardware.

In fact, even this may not be sufficient; it's possible that a contiguous set of
base pages (small folios) were allocated to a virtual mapping and all swapped
out together - they would likely end up contiguous in the swap file, but should
not be swapped back in as a single folio because of this (same reasoning applies
to cluster of smaller THPs that you mistake for a larger THP, etc).

So you will need to check what THP sizes are enabled and check the VMA
suitability regardless; Perhaps you are already doing this - I haven't looked at
the code yet.

I'll aim to review the code in the next couple of weeks.

Thanks,
Ryan

> On the other hand, in OPPO's product, we've deployed
> anon large folios on millions of phones[2]. we enhanced zsmalloc and zRAM to
> compress and decompress large folios as a whole, which help improve compression
> ratio and decrease CPU consumption significantly. In zsmalloc and zRAM we can
> save large objects whose original size are 64KiB for example. So it is also a
> better choice for us to only swap-in large folios for those compressed large
> objects as a large folio can be decompressed all together.
>
> Note I am moving my previous "arm64: mm: swap: support THP_SWAP on hardware
> with MTE" to this series as it might help review.
>
> [1] [PATCH v3 0/4] Swap-out small-sized THP without splitting
> https://lore.kernel.org/linux-mm/[email protected]/
> [2] OnePlusOSS / android_kernel_oneplus_sm8550
> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
>
> Barry Song (2):
> arm64: mm: swap: support THP_SWAP on hardware with MTE
> mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()
>
> Chuanhua Han (4):
> mm: swap: introduce swap_nr_free() for batched swap_free()
> mm: swap: make should_try_to_free_swap() support large-folio
> mm: support large folios swapin as a whole
> mm: madvise: don't split mTHP for MADV_PAGEOUT
>
> arch/arm64/include/asm/pgtable.h | 21 ++----
> arch/arm64/mm/mteswap.c | 42 ++++++++++++
> include/asm-generic/tlb.h | 10 +++
> include/linux/huge_mm.h | 12 ----
> include/linux/pgtable.h | 62 ++++++++++++++++-
> include/linux/swap.h | 6 ++
> mm/madvise.c | 48 ++++++++++++++
> mm/memory.c | 110 ++++++++++++++++++++++++++-----
> mm/page_io.c | 2 +-
> mm/rmap.c | 5 +-
> mm/swap_slots.c | 2 +-
> mm/swapfile.c | 29 ++++++++
> 12 files changed, 301 insertions(+), 48 deletions(-)
>


2024-01-18 23:54:33

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH RFC 0/6] mm: support large folios swap-in

On Thu, Jan 18, 2024 at 11:25 PM Ryan Roberts <[email protected]> wrote:
>
> On 18/01/2024 11:10, Barry Song wrote:
> > On an embedded system like Android, more than half of anon memory is actually
> > in swap devices such as zRAM. For example, while an app is switched to back-
> > ground, its most memory might be swapped-out.
> >
> > Now we have mTHP features, unfortunately, if we don't support large folios
> > swap-in, once those large folios are swapped-out, we immediately lose the
> > performance gain we can get through large folios and hardware optimization
> > such as CONT-PTE.
> >
> > In theory, we don't need to rely on Ryan's swap out patchset[1]. That is to say,
> > before swap-out, if some memory were normal pages, but when swapping in, we
> > can also swap-in them as large folios.
>
> I think this could also violate MADV_NOHUGEPAGE; if the application has
> requested that we do not create a THP, then we had better not; it could cause a
> correctness issue in some circumstances. You would need to pay attention to this
> vma flag if taking this approach.
>
> > But this might require I/O happen at
> > some random places in swap devices. So we limit the large folios swap-in to
> > those areas which were large folios before swapping-out, aka, swaps are also
> > contiguous in hardware.
>
> In fact, even this may not be sufficient; it's possible that a contiguous set of
> base pages (small folios) were allocated to a virtual mapping and all swapped
> out together - they would likely end up contiguous in the swap file, but should
> not be swapped back in as a single folio because of this (same reasoning applies
> to cluster of smaller THPs that you mistake for a larger THP, etc).
>
> So you will need to check what THP sizes are enabled and check the VMA
> suitability regardless; Perhaps you are already doing this - I haven't looked at
> the code yet.

we are actually re-using your alloc_anon_folio() by adding a parameter
to make it
support both do_anon_page and do_swap_page,

-static struct folio *alloc_anon_folio(struct vm_fault *vmf)
+static struct folio *alloc_anon_folio(struct vm_fault *vmf,
+ bool (*pte_range_check)(pte_t *, int))
{
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
struct vm_area_struct *vma = vmf->vma;
@@ -4190,7 +4270,7 @@ static struct folio *alloc_anon_folio(struct
vm_fault *vmf)
order = highest_order(orders);
while (orders) {
addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
- if (pte_range_none(pte + pte_index(addr), 1 << order))
+ if (pte_range_check(pte + pte_index(addr), 1 << order))
break;
order = next_order(&orders, order);
}
@@ -4269,7 +4349,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
if (unlikely(anon_vma_prepare(vma)))
goto oom;
/* Returns NULL on OOM or ERR_PTR(-EAGAIN) if we must retry the fault */
- folio = alloc_anon_folio(vmf);
+ folio = alloc_anon_folio(vmf, pte_range_none);
if (IS_ERR(folio))
return 0;
if (!folio)
--

I assume this has checked everything?

>
> I'll aim to review the code in the next couple of weeks.

nice, thanks!

>
> Thanks,
> Ryan
>
> > On the other hand, in OPPO's product, we've deployed
> > anon large folios on millions of phones[2]. we enhanced zsmalloc and zRAM to
> > compress and decompress large folios as a whole, which help improve compression
> > ratio and decrease CPU consumption significantly. In zsmalloc and zRAM we can
> > save large objects whose original size are 64KiB for example. So it is also a
> > better choice for us to only swap-in large folios for those compressed large
> > objects as a large folio can be decompressed all together.
> >
> > Note I am moving my previous "arm64: mm: swap: support THP_SWAP on hardware
> > with MTE" to this series as it might help review.
> >
> > [1] [PATCH v3 0/4] Swap-out small-sized THP without splitting
> > https://lore.kernel.org/linux-mm/[email protected]/
> > [2] OnePlusOSS / android_kernel_oneplus_sm8550
> > https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
> >
> > Barry Song (2):
> > arm64: mm: swap: support THP_SWAP on hardware with MTE
> > mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()
> >
> > Chuanhua Han (4):
> > mm: swap: introduce swap_nr_free() for batched swap_free()
> > mm: swap: make should_try_to_free_swap() support large-folio
> > mm: support large folios swapin as a whole
> > mm: madvise: don't split mTHP for MADV_PAGEOUT
> >
> > arch/arm64/include/asm/pgtable.h | 21 ++----
> > arch/arm64/mm/mteswap.c | 42 ++++++++++++
> > include/asm-generic/tlb.h | 10 +++
> > include/linux/huge_mm.h | 12 ----
> > include/linux/pgtable.h | 62 ++++++++++++++++-
> > include/linux/swap.h | 6 ++
> > mm/madvise.c | 48 ++++++++++++++
> > mm/memory.c | 110 ++++++++++++++++++++++++++-----
> > mm/page_io.c | 2 +-
> > mm/rmap.c | 5 +-
> > mm/swap_slots.c | 2 +-
> > mm/swapfile.c | 29 ++++++++
> > 12 files changed, 301 insertions(+), 48 deletions(-)
> >
>

Thanks
Barry

2024-01-19 13:26:52

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH RFC 0/6] mm: support large folios swap-in

On 18/01/2024 23:54, Barry Song wrote:
> On Thu, Jan 18, 2024 at 11:25 PM Ryan Roberts <[email protected]> wrote:
>>
>> On 18/01/2024 11:10, Barry Song wrote:
>>> On an embedded system like Android, more than half of anon memory is actually
>>> in swap devices such as zRAM. For example, while an app is switched to back-
>>> ground, its most memory might be swapped-out.
>>>
>>> Now we have mTHP features, unfortunately, if we don't support large folios
>>> swap-in, once those large folios are swapped-out, we immediately lose the
>>> performance gain we can get through large folios and hardware optimization
>>> such as CONT-PTE.
>>>
>>> In theory, we don't need to rely on Ryan's swap out patchset[1]. That is to say,
>>> before swap-out, if some memory were normal pages, but when swapping in, we
>>> can also swap-in them as large folios.
>>
>> I think this could also violate MADV_NOHUGEPAGE; if the application has
>> requested that we do not create a THP, then we had better not; it could cause a
>> correctness issue in some circumstances. You would need to pay attention to this
>> vma flag if taking this approach.
>>
>>> But this might require I/O happen at
>>> some random places in swap devices. So we limit the large folios swap-in to
>>> those areas which were large folios before swapping-out, aka, swaps are also
>>> contiguous in hardware.
>>
>> In fact, even this may not be sufficient; it's possible that a contiguous set of
>> base pages (small folios) were allocated to a virtual mapping and all swapped
>> out together - they would likely end up contiguous in the swap file, but should
>> not be swapped back in as a single folio because of this (same reasoning applies
>> to cluster of smaller THPs that you mistake for a larger THP, etc).
>>
>> So you will need to check what THP sizes are enabled and check the VMA
>> suitability regardless; Perhaps you are already doing this - I haven't looked at
>> the code yet.
>
> we are actually re-using your alloc_anon_folio() by adding a parameter
> to make it
> support both do_anon_page and do_swap_page,
>
> -static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> +static struct folio *alloc_anon_folio(struct vm_fault *vmf,
> + bool (*pte_range_check)(pte_t *, int))
> {
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> struct vm_area_struct *vma = vmf->vma;
> @@ -4190,7 +4270,7 @@ static struct folio *alloc_anon_folio(struct
> vm_fault *vmf)
> order = highest_order(orders);
> while (orders) {
> addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> - if (pte_range_none(pte + pte_index(addr), 1 << order))
> + if (pte_range_check(pte + pte_index(addr), 1 << order))
> break;
> order = next_order(&orders, order);
> }
> @@ -4269,7 +4349,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> if (unlikely(anon_vma_prepare(vma)))
> goto oom;
> /* Returns NULL on OOM or ERR_PTR(-EAGAIN) if we must retry the fault */
> - folio = alloc_anon_folio(vmf);
> + folio = alloc_anon_folio(vmf, pte_range_none);
> if (IS_ERR(folio))
> return 0;
> if (!folio)
> --
>
> I assume this has checked everything?

Ahh yes, very good. In that case you can disregard what I said; its already covered.

I notice that this series appears as a reply to my series. I'm not sure what the
normal convention is, but I expect more people would see it if you posted it as
its own thread?


>
>>
>> I'll aim to review the code in the next couple of weeks.
>
> nice, thanks!
>
>>
>> Thanks,
>> Ryan
>>
>>> On the other hand, in OPPO's product, we've deployed
>>> anon large folios on millions of phones[2]. we enhanced zsmalloc and zRAM to
>>> compress and decompress large folios as a whole, which help improve compression
>>> ratio and decrease CPU consumption significantly. In zsmalloc and zRAM we can
>>> save large objects whose original size are 64KiB for example. So it is also a
>>> better choice for us to only swap-in large folios for those compressed large
>>> objects as a large folio can be decompressed all together.
>>>
>>> Note I am moving my previous "arm64: mm: swap: support THP_SWAP on hardware
>>> with MTE" to this series as it might help review.
>>>
>>> [1] [PATCH v3 0/4] Swap-out small-sized THP without splitting
>>> https://lore.kernel.org/linux-mm/[email protected]/
>>> [2] OnePlusOSS / android_kernel_oneplus_sm8550
>>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
>>>
>>> Barry Song (2):
>>> arm64: mm: swap: support THP_SWAP on hardware with MTE
>>> mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()
>>>
>>> Chuanhua Han (4):
>>> mm: swap: introduce swap_nr_free() for batched swap_free()
>>> mm: swap: make should_try_to_free_swap() support large-folio
>>> mm: support large folios swapin as a whole
>>> mm: madvise: don't split mTHP for MADV_PAGEOUT
>>>
>>> arch/arm64/include/asm/pgtable.h | 21 ++----
>>> arch/arm64/mm/mteswap.c | 42 ++++++++++++
>>> include/asm-generic/tlb.h | 10 +++
>>> include/linux/huge_mm.h | 12 ----
>>> include/linux/pgtable.h | 62 ++++++++++++++++-
>>> include/linux/swap.h | 6 ++
>>> mm/madvise.c | 48 ++++++++++++++
>>> mm/memory.c | 110 ++++++++++++++++++++++++++-----
>>> mm/page_io.c | 2 +-
>>> mm/rmap.c | 5 +-
>>> mm/swap_slots.c | 2 +-
>>> mm/swapfile.c | 29 ++++++++
>>> 12 files changed, 301 insertions(+), 48 deletions(-)
>>>
>>
>
> Thanks
> Barry


2024-01-23 06:49:31

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()

On Thu, Jan 18, 2024 at 7:54 PM David Hildenbrand <[email protected]> wrote:
>
> On 18.01.24 12:10, Barry Song wrote:
> > From: Barry Song <[email protected]>
> >
> > In do_swap_page(), while supporting large folio swap-in, we are using the helper
> > folio_add_anon_rmap_ptes. This is triggerring a WARN_ON in __folio_add_anon_rmap.
> > We can make the warning quiet by two ways
> > 1. in do_swap_page, we call folio_add_new_anon_rmap() if we are sure the large
> > folio is new allocated one; we call folio_add_anon_rmap_ptes() if we find the
> > large folio in swapcache.
> > 2. we always call folio_add_anon_rmap_ptes() in do_swap_page but weaken the
> > WARN_ON in __folio_add_anon_rmap() by letting the WARN_ON less sensitive.
> >
> > Option 2 seems to be better for do_swap_page() as it can use unified code for
> > all cases.
> >
> > Signed-off-by: Barry Song <[email protected]>
> > Tested-by: Chuanhua Han <[email protected]>
> > ---
> > mm/rmap.c | 5 ++++-
> > 1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index f5d43edad529..469fcfd32317 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1304,7 +1304,10 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
> > * page.
> > */
> > VM_WARN_ON_FOLIO(folio_test_large(folio) &&
> > - level != RMAP_LEVEL_PMD, folio);
> > + level != RMAP_LEVEL_PMD &&
> > + (!IS_ALIGNED(address, nr_pages * PAGE_SIZE) ||
> > + (folio_test_swapcache(folio) && !IS_ALIGNED(folio->index, nr_pages)) ||
> > + page != &folio->page), folio);
> > __folio_set_anon(folio, vma, address,
> > !!(flags & RMAP_EXCLUSIVE));
> > } else if (likely(!folio_test_ksm(folio))) {
>
>
> I have on my todo list to move all that !anon handling out of
> folio_add_anon_rmap_ptes(), and instead make swapin code call add
> folio_add_new_anon_rmap(), where we'll have to pass an exclusive flag
> then (-> whole new folio exclusive).
>
> That's the cleaner approach.
>

one tricky thing is that sometimes it is hard to know who is the first
one to add rmap and thus should
call folio_add_new_anon_rmap.
especially when we want to support swapin_readahead(), the one who
allocated large filio might not
be that one who firstly does rmap.
is it an acceptable way to do the below in do_swap_page?
if (!folio_test_anon(folio))
folio_add_new_anon_rmap()
else
folio_add_anon_rmap_ptes()

> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry

2024-01-26 23:16:53

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH RFC 1/6] arm64: mm: swap: support THP_SWAP on hardware with MTE

On Thu, Jan 18, 2024 at 3:11 AM Barry Song <[email protected]> wrote:
>
> From: Barry Song <[email protected]>
>
> Commit d0637c505f8a1 ("arm64: enable THP_SWAP for arm64") brings up
> THP_SWAP on ARM64, but it doesn't enable THP_SWP on hardware with
> MTE as the MTE code works with the assumption tags save/restore is
> always handling a folio with only one page.
>
> The limitation should be removed as more and more ARM64 SoCs have
> this feature. Co-existence of MTE and THP_SWAP becomes more and
> more important.
>
> This patch makes MTE tags saving support large folios, then we don't
> need to split large folios into base pages for swapping out on ARM64
> SoCs with MTE any more.
>
> arch_prepare_to_swap() should take folio rather than page as parameter
> because we support THP swap-out as a whole. It saves tags for all
> pages in a large folio.
>
> As now we are restoring tags based-on folio, in arch_swap_restore(),
> we may increase some extra loops and early-exitings while refaulting
> a large folio which is still in swapcache in do_swap_page(). In case
> a large folio has nr pages, do_swap_page() will only set the PTE of
> the particular page which is causing the page fault.
> Thus do_swap_page() runs nr times, and each time, arch_swap_restore()
> will loop nr times for those subpages in the folio. So right now the
> algorithmic complexity becomes O(nr^2).
>
> Once we support mapping large folios in do_swap_page(), extra loops
> and early-exitings will decrease while not being completely removed
> as a large folio might get partially tagged in corner cases such as,
> 1. a large folio in swapcache can be partially unmapped, thus, MTE
> tags for the unmapped pages will be invalidated;
> 2. users might use mprotect() to set MTEs on a part of a large folio.
>
> arch_thp_swp_supported() is dropped since ARM64 MTE was the only one
> who needed it.
>
> Reviewed-by: Steven Price <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> ---
> arch/arm64/include/asm/pgtable.h | 21 +++-------------
> arch/arm64/mm/mteswap.c | 42 ++++++++++++++++++++++++++++++++
> include/linux/huge_mm.h | 12 ---------
> include/linux/pgtable.h | 2 +-
> mm/page_io.c | 2 +-
> mm/swap_slots.c | 2 +-
> 6 files changed, 49 insertions(+), 32 deletions(-)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 79ce70fbb751..9902395ca426 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -45,12 +45,6 @@
> __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> -static inline bool arch_thp_swp_supported(void)
> -{
> - return !system_supports_mte();
> -}
> -#define arch_thp_swp_supported arch_thp_swp_supported
> -
> /*
> * Outside of a few very special situations (e.g. hibernation), we always
> * use broadcast TLB invalidation instructions, therefore a spurious page
> @@ -1042,12 +1036,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> #ifdef CONFIG_ARM64_MTE
>
> #define __HAVE_ARCH_PREPARE_TO_SWAP
> -static inline int arch_prepare_to_swap(struct page *page)
> -{
> - if (system_supports_mte())
> - return mte_save_tags(page);
> - return 0;
> -}
> +#define arch_prepare_to_swap arch_prepare_to_swap

This seems a noop, define "arch_prepare_to_swap" back to itself.
What am I missing?

I see. Answer my own question, I guess you want to allow someone to
overwrite the arch_prepare_to_swap.
Wouldn't testing against __HAVE_ARCH_PREPARE_TO_SWAP enough to support that?

Maybe I need to understand better how you want others to extend this
code to make suggestions.
As it is, this looks strange.

> +extern int arch_prepare_to_swap(struct folio *folio);
>
> #define __HAVE_ARCH_SWAP_INVALIDATE
> static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
> @@ -1063,11 +1053,8 @@ static inline void arch_swap_invalidate_area(int type)
> }
>
> #define __HAVE_ARCH_SWAP_RESTORE
> -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> -{
> - if (system_supports_mte())
> - mte_restore_tags(entry, &folio->page);
> -}
> +#define arch_swap_restore arch_swap_restore

Same here.

> +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
>
> #endif /* CONFIG_ARM64_MTE */
>
> diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
> index a31833e3ddc5..b9ca1b35902f 100644
> --- a/arch/arm64/mm/mteswap.c
> +++ b/arch/arm64/mm/mteswap.c
> @@ -68,6 +68,13 @@ void mte_invalidate_tags(int type, pgoff_t offset)
> mte_free_tag_storage(tags);
> }
>
> +static inline void __mte_invalidate_tags(struct page *page)
> +{
> + swp_entry_t entry = page_swap_entry(page);
> +
> + mte_invalidate_tags(swp_type(entry), swp_offset(entry));
> +}
> +
> void mte_invalidate_tags_area(int type)
> {
> swp_entry_t entry = swp_entry(type, 0);
> @@ -83,3 +90,38 @@ void mte_invalidate_tags_area(int type)
> }
> xa_unlock(&mte_pages);
> }
> +
> +int arch_prepare_to_swap(struct folio *folio)
> +{
> + int err;
> + long i;
> +
> + if (system_supports_mte()) {
Very minor nitpick.

You can do
if (!system_supports_mte())
return 0;

Here and the for loop would have less indent. The function looks flatter.

> + long nr = folio_nr_pages(folio);
> +
> + for (i = 0; i < nr; i++) {
> + err = mte_save_tags(folio_page(folio, i));
> + if (err)
> + goto out;
> + }
> + }
> + return 0;
> +
> +out:
> + while (i--)
> + __mte_invalidate_tags(folio_page(folio, i));
> + return err;
> +}
> +
> +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> +{
> + if (system_supports_mte()) {

Same here.

Looks good otherwise. None of the nitpicks are deal breakers.

Acked-by: Chris Li <[email protected]>


Chris

> + long i, nr = folio_nr_pages(folio);
> +
> + entry.val -= swp_offset(entry) & (nr - 1);
> + for (i = 0; i < nr; i++) {
> + mte_restore_tags(entry, folio_page(folio, i));
> + entry.val++;
> + }
> + }
> +}
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 5adb86af35fc..67219d2309dd 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -530,16 +530,4 @@ static inline int split_folio(struct folio *folio)
> return split_folio_to_list(folio, NULL);
> }
>
> -/*
> - * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> - * limitations in the implementation like arm64 MTE can override this to
> - * false
> - */
> -#ifndef arch_thp_swp_supported
> -static inline bool arch_thp_swp_supported(void)
> -{
> - return true;
> -}
> -#endif
> -
> #endif /* _LINUX_HUGE_MM_H */
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index f6d0e3513948..37fe83b0c358 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -925,7 +925,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
> * prototypes must be defined in the arch-specific asm/pgtable.h file.
> */
> #ifndef __HAVE_ARCH_PREPARE_TO_SWAP
> -static inline int arch_prepare_to_swap(struct page *page)
> +static inline int arch_prepare_to_swap(struct folio *folio)
> {
> return 0;
> }
> diff --git a/mm/page_io.c b/mm/page_io.c
> index ae2b49055e43..a9a7c236aecc 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -189,7 +189,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
> * Arch code may have to preserve more data than just the page
> * contents, e.g. memory tags.
> */
> - ret = arch_prepare_to_swap(&folio->page);
> + ret = arch_prepare_to_swap(folio);
> if (ret) {
> folio_mark_dirty(folio);
> folio_unlock(folio);
> diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> index 0bec1f705f8e..2325adbb1f19 100644
> --- a/mm/swap_slots.c
> +++ b/mm/swap_slots.c
> @@ -307,7 +307,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
> entry.val = 0;
>
> if (folio_test_large(folio)) {
> - if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> + if (IS_ENABLED(CONFIG_THP_SWAP))
> get_swap_pages(1, &entry, folio_nr_pages(folio));
> goto out;
> }
> --
> 2.34.1
>
>

2024-01-26 23:23:27

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH RFC 2/6] mm: swap: introduce swap_nr_free() for batched swap_free()

On Thu, Jan 18, 2024 at 3:11 AM Barry Song <[email protected]> wrote:
>
> From: Chuanhua Han <[email protected]>
>
> While swapping in a large folio, we need to free swaps related to the whole
> folio. To avoid frequently acquiring and releasing swap locks, it is better
> to introduce an API for batched free.
>
> Signed-off-by: Chuanhua Han <[email protected]>
> Co-developed-by: Barry Song <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> ---
> include/linux/swap.h | 6 ++++++
> mm/swapfile.c | 29 +++++++++++++++++++++++++++++
> 2 files changed, 35 insertions(+)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 4db00ddad261..31a4ee2dcd1c 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -478,6 +478,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> extern int swap_duplicate(swp_entry_t);
> extern int swapcache_prepare(swp_entry_t);
> extern void swap_free(swp_entry_t);
> +extern void swap_nr_free(swp_entry_t entry, int nr_pages);
> extern void swapcache_free_entries(swp_entry_t *entries, int n);
> extern int free_swap_and_cache(swp_entry_t);
> int swap_type_of(dev_t device, sector_t offset);
> @@ -553,6 +554,11 @@ static inline void swap_free(swp_entry_t swp)
> {
> }
>
> +void swap_nr_free(swp_entry_t entry, int nr_pages)
> +{
> +
> +}
> +
> static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> {
> }
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 556ff7347d5f..6321bda96b77 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1335,6 +1335,35 @@ void swap_free(swp_entry_t entry)
> __swap_entry_free(p, entry);
> }
>
> +void swap_nr_free(swp_entry_t entry, int nr_pages)
> +{
> + int i;
> + struct swap_cluster_info *ci;
> + struct swap_info_struct *p;
> + unsigned type = swp_type(entry);
> + unsigned long offset = swp_offset(entry);
> + DECLARE_BITMAP(usage, SWAPFILE_CLUSTER) = { 0 };
> +
> + VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);

BUG_ON here seems a bit too developer originated. Maybe warn once and
roll back to free one by one?

How big is your typical SWAPFILE_CUSTER and nr_pages typically in arm?

I ask this question because nr_ppages > 64, that is a totally
different game, we can completely bypass the swap cache slots.

> +
> + if (nr_pages == 1) {
> + swap_free(entry);
> + return;
> + }
> +
> + p = _swap_info_get(entry);
> +
> + ci = lock_cluster(p, offset);
> + for (i = 0; i < nr_pages; i++) {
> + if (__swap_entry_free_locked(p, offset + i, 1))
> + __bitmap_set(usage, i, 1);
> + }
> + unlock_cluster(ci);
> +
> + for_each_clear_bit(i, usage, nr_pages)
> + free_swap_slot(swp_entry(type, offset + i));

Notice that free_swap_slot() internal has per CPU cache batching as
well. Every free_swap_slot will get some per_cpu swap slot cache and
cache->lock. There is double batching here.
If the typical batch size here is bigger than 64 entries, we can go
directly to batching swap_entry_free and avoid the free_swap_slot()
batching altogether. Unlike free_swap_slot_entries(), here swap slots
are all from one swap device, there is no need to sort and group the
swap slot by swap devices.

Chris


Chris

> +}
> +
> /*
> * Called after dropping swapcache to decrease refcnt to swap entries.
> */
> --
> 2.34.1
>
>

2024-01-26 23:34:38

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH RFC 3/6] mm: swap: make should_try_to_free_swap() support large-folio

Acked-by: Chris Li <[email protected]>

Chris

On Thu, Jan 18, 2024 at 3:11 AM Barry Song <[email protected]> wrote:
>
> From: Chuanhua Han <[email protected]>
>
> should_try_to_free_swap() works with an assumption that swap-in is always done
> at normal page granularity, aka, folio_nr_pages = 1. To support large folio
> swap-in, this patch removes the assumption.
>
> Signed-off-by: Chuanhua Han <[email protected]>
> Co-developed-by: Barry Song <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> ---
> mm/memory.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 7e1f4849463a..f61a48929ba7 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3714,7 +3714,7 @@ static inline bool should_try_to_free_swap(struct folio *folio,
> * reference only in case it's likely that we'll be the exlusive user.
> */
> return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
> - folio_ref_count(folio) == 2;
> + folio_ref_count(folio) == (1 + folio_nr_pages(folio));
> }
>
> static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
> --
> 2.34.1
>
>

2024-01-27 14:27:42

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH RFC 0/6] mm: support large folios swap-in

On Fri, Jan 19, 2024 at 9:25 PM Ryan Roberts <[email protected]> wrote:
>
> On 18/01/2024 23:54, Barry Song wrote:
> > On Thu, Jan 18, 2024 at 11:25 PM Ryan Roberts <ryan.roberts@armcom> wrote:
> >>
> >> On 18/01/2024 11:10, Barry Song wrote:
> >>> On an embedded system like Android, more than half of anon memory is actually
> >>> in swap devices such as zRAM. For example, while an app is switched to back-
> >>> ground, its most memory might be swapped-out.
> >>>
> >>> Now we have mTHP features, unfortunately, if we don't support large folios
> >>> swap-in, once those large folios are swapped-out, we immediately lose the
> >>> performance gain we can get through large folios and hardware optimization
> >>> such as CONT-PTE.
> >>>
> >>> In theory, we don't need to rely on Ryan's swap out patchset[1]. That is to say,
> >>> before swap-out, if some memory were normal pages, but when swapping in, we
> >>> can also swap-in them as large folios.
> >>
> >> I think this could also violate MADV_NOHUGEPAGE; if the application has
> >> requested that we do not create a THP, then we had better not; it could cause a
> >> correctness issue in some circumstances. You would need to pay attention to this
> >> vma flag if taking this approach.
> >>
> >>> But this might require I/O happen at
> >>> some random places in swap devices. So we limit the large folios swap-in to
> >>> those areas which were large folios before swapping-out, aka, swaps are also
> >>> contiguous in hardware.
> >>
> >> In fact, even this may not be sufficient; it's possible that a contiguous set of
> >> base pages (small folios) were allocated to a virtual mapping and all swapped
> >> out together - they would likely end up contiguous in the swap file, but should
> >> not be swapped back in as a single folio because of this (same reasoning applies
> >> to cluster of smaller THPs that you mistake for a larger THP, etc).
> >>
> >> So you will need to check what THP sizes are enabled and check the VMA
> >> suitability regardless; Perhaps you are already doing this - I haven't looked at
> >> the code yet.
> >
> > we are actually re-using your alloc_anon_folio() by adding a parameter
> > to make it
> > support both do_anon_page and do_swap_page,
> >
> > -static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> > +static struct folio *alloc_anon_folio(struct vm_fault *vmf,
> > + bool (*pte_range_check)(pte_t *, int))
> > {
> > #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > struct vm_area_struct *vma = vmf->vma;
> > @@ -4190,7 +4270,7 @@ static struct folio *alloc_anon_folio(struct
> > vm_fault *vmf)
> > order = highest_order(orders);
> > while (orders) {
> > addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> > - if (pte_range_none(pte + pte_index(addr), 1 << order))
> > + if (pte_range_check(pte + pte_index(addr), 1 << order))
> > break;
> > order = next_order(&orders, order);
> > }
> > @@ -4269,7 +4349,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> > if (unlikely(anon_vma_prepare(vma)))
> > goto oom;
> > /* Returns NULL on OOM or ERR_PTR(-EAGAIN) if we must retry the fault */
> > - folio = alloc_anon_folio(vmf);
> > + folio = alloc_anon_folio(vmf, pte_range_none);
> > if (IS_ERR(folio))
> > return 0;
> > if (!folio)
> > --
> >
> > I assume this has checked everything?
>
> Ahh yes, very good. In that case you can disregard what I said; its already covered.
>
> I notice that this series appears as a reply to my series. I'm not sure what the
> normal convention is, but I expect more people would see it if you posted it as
> its own thread?

yes. i was replying your series because we were using your series to
swapout large folios
without splitting. in v2, i will send it as a new thread.

>
>
> >
> >>
> >> I'll aim to review the code in the next couple of weeks.
> >
> > nice, thanks!
> >
> >>
> >> Thanks,
> >> Ryan
> >>
> >>> On the other hand, in OPPO's product, we've deployed
> >>> anon large folios on millions of phones[2]. we enhanced zsmalloc and zRAM to
> >>> compress and decompress large folios as a whole, which help improve compression
> >>> ratio and decrease CPU consumption significantly. In zsmalloc and zRAM we can
> >>> save large objects whose original size are 64KiB for example. So it is also a
> >>> better choice for us to only swap-in large folios for those compressed large
> >>> objects as a large folio can be decompressed all together.
> >>>
> >>> Note I am moving my previous "arm64: mm: swap: support THP_SWAP on hardware
> >>> with MTE" to this series as it might help review.
> >>>
> >>> [1] [PATCH v3 0/4] Swap-out small-sized THP without splitting
> >>> https://lore.kernel.org/linux-mm/[email protected]/
> >>> [2] OnePlusOSS / android_kernel_oneplus_sm8550
> >>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
> >>>
> >>> Barry Song (2):
> >>> arm64: mm: swap: support THP_SWAP on hardware with MTE
> >>> mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()
> >>>
> >>> Chuanhua Han (4):
> >>> mm: swap: introduce swap_nr_free() for batched swap_free()
> >>> mm: swap: make should_try_to_free_swap() support large-folio
> >>> mm: support large folios swapin as a whole
> >>> mm: madvise: don't split mTHP for MADV_PAGEOUT
> >>>
> >>> arch/arm64/include/asm/pgtable.h | 21 ++----
> >>> arch/arm64/mm/mteswap.c | 42 ++++++++++++
> >>> include/asm-generic/tlb.h | 10 +++
> >>> include/linux/huge_mm.h | 12 ----
> >>> include/linux/pgtable.h | 62 ++++++++++++++++-
> >>> include/linux/swap.h | 6 ++
> >>> mm/madvise.c | 48 ++++++++++++++
> >>> mm/memory.c | 110 ++++++++++++++++++++++++++-----
> >>> mm/page_io.c | 2 +-
> >>> mm/rmap.c | 5 +-
> >>> mm/swap_slots.c | 2 +-
> >>> mm/swapfile.c | 29 ++++++++
> >>> 12 files changed, 301 insertions(+), 48 deletions(-)
> >>>
> >>
> >
Thanks
Barry
>

2024-01-27 19:53:49

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH RFC 4/6] mm: support large folios swapin as a whole

On Thu, Jan 18, 2024 at 3:12 AM Barry Song <[email protected]> wrote:
>
> From: Chuanhua Han <[email protected]>
>
> On an embedded system like Android, more than half of anon memory is actually
> in swap devices such as zRAM. For example, while an app is switched to back-
> ground, its most memory might be swapped-out.
>
> Now we have mTHP features, unfortunately, if we don't support large folios
> swap-in, once those large folios are swapped-out, we immediately lose the
> performance gain we can get through large folios and hardware optimization
> such as CONT-PTE.
>
> This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
> to those contiguous swaps which were likely swapped out from mTHP as a whole.
>
> On the other hand, the current implementation only covers the SWAP_SYCHRONOUS
> case. It doesn't support swapin_readahead as large folios yet.
>
> Right now, we are re-faulting large folios which are still in swapcache as a
> whole, this can effectively decrease extra loops and early-exitings which we
> have increased in arch_swap_restore() while supporting MTE restore for folios
> rather than page.
>
> Signed-off-by: Chuanhua Han <[email protected]>
> Co-developed-by: Barry Song <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> ---
> mm/memory.c | 108 +++++++++++++++++++++++++++++++++++++++++++++-------
> 1 file changed, 94 insertions(+), 14 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index f61a48929ba7..928b3f542932 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -107,6 +107,8 @@ EXPORT_SYMBOL(mem_map);
> static vm_fault_t do_fault(struct vm_fault *vmf);
> static vm_fault_t do_anonymous_page(struct vm_fault *vmf);
> static bool vmf_pte_changed(struct vm_fault *vmf);
> +static struct folio *alloc_anon_folio(struct vm_fault *vmf,
> + bool (*pte_range_check)(pte_t *, int));

Instead of returning "bool", the pte_range_check() can return the
start of the swap entry of the large folio.
That will save some of the later code needed to get the start of the
large folio.

>
> /*
> * Return true if the original pte was a uffd-wp pte marker (so the pte was
> @@ -3784,6 +3786,34 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
> return VM_FAULT_SIGBUS;
> }
>
> +static bool pte_range_swap(pte_t *pte, int nr_pages)

This function name seems to suggest it will perform the range swap.
That is not what it is doing.
Suggest change to some other name reflecting that it is only a
condition test without actual swap action.
I am not very good at naming functions. Just think it out loud: e.g.
pte_range_swap_check, pte_test_range_swap. You can come up with
something better.


> +{
> + int i;
> + swp_entry_t entry;
> + unsigned type;
> + pgoff_t start_offset;
> +
> + entry = pte_to_swp_entry(ptep_get_lockless(pte));
> + if (non_swap_entry(entry))
> + return false;
> + start_offset = swp_offset(entry);
> + if (start_offset % nr_pages)
> + return false;

This suggests the pte argument needs to point to the beginning of the
large folio equivalent of swap entry(not sure what to call it. Let me
call it "large folio swap" here.).
We might want to unify the terms for that.
Any way, might want to document this requirement, otherwise the caller
might consider passing the current pte that generates the fault. From
the function name it is not obvious which pte should pass it.

> +
> + type = swp_type(entry);
> + for (i = 1; i < nr_pages; i++) {

You might want to test the last page backwards, because if the entry
is not the large folio swap, most likely it will have the last entry
invalid. Some of the beginning swap entries might match due to batch
allocation etc. The SSD likes to group the nearby swap entry write out
together on the disk.



> + entry = pte_to_swp_entry(ptep_get_lockless(pte + i));

> + if (non_swap_entry(entry))
> + return false;
> + if (swp_offset(entry) != start_offset + i)
> + return false;
> + if (swp_type(entry) != type)
> + return false;
> + }
> +
> + return true;
> +}
> +
> /*
> * We enter with non-exclusive mmap_lock (to exclude vma changes,
> * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -3804,6 +3834,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> pte_t pte;
> vm_fault_t ret = 0;
> void *shadow = NULL;
> + int nr_pages = 1;
> + unsigned long start_address;
> + pte_t *start_pte;
>
> if (!pte_unmap_same(vmf))
> goto out;
> @@ -3868,13 +3901,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> __swap_count(entry) == 1) {
> /* skip swapcache */
> - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> - vma, vmf->address, false);
> + folio = alloc_anon_folio(vmf, pte_range_swap);

This function can call pte_range_swap() twice(), one here, another one
in folio_test_large().
Consider caching the result so it does not need to walk the pte range
swap twice.

I think alloc_anon_folio should either be told what is the
size(prefered) or just figure out the right size. I don't think it
needs to pass in the checking function as function callbacks. There
are two call sites of alloc_anon_folio, they are all within this
function. The call back seems a bit overkill here. Also duplicate the
range swap walk.

> page = &folio->page;
> if (folio) {
> __folio_set_locked(folio);
> __folio_set_swapbacked(folio);
>
> + if (folio_test_large(folio)) {
> + unsigned long start_offset;
> +
> + nr_pages = folio_nr_pages(folio);
> + start_offset = swp_offset(entry) & ~(nr_pages - 1);
Here is the first place place we roll up the start offset with folio size

> + entry = swp_entry(swp_type(entry), start_offset);
> + }
> +
> if (mem_cgroup_swapin_charge_folio(folio,
> vma->vm_mm, GFP_KERNEL,
> entry)) {
> @@ -3980,6 +4020,39 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> */
> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> &vmf->ptl);
> +
> + start_address = vmf->address;
> + start_pte = vmf->pte;
> + if (folio_test_large(folio)) {
> + unsigned long nr = folio_nr_pages(folio);
> + unsigned long addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
> + pte_t *pte_t = vmf->pte - (vmf->address - addr) / PAGE_SIZE;

Here is the second place we roll up the folio size.
Maybe we can cache results and avoid repetition?

> +
> + /*
> + * case 1: we are allocating large_folio, try to map it as a whole
> + * iff the swap entries are still entirely mapped;
> + * case 2: we hit a large folio in swapcache, and all swap entries
> + * are still entirely mapped, try to map a large folio as a whole.
> + * otherwise, map only the faulting page within the large folio
> + * which is swapcache
> + */

One question I have in mind is that the swap device is locked. We
can't change the swap slot allocations.
It does not stop the pte entry getting changed right? Then we can have
someone in the user pace racing to change the PTE vs we checking the
pte there.

> + if (pte_range_swap(pte_t, nr)) {

After this pte_range_swap() check, some of the PTE entries get changed
and now we don't have the full large page swap any more?
At least I can't conclude this possibility can't happen yet, please
enlighten me.

> + start_address = addr;
> + start_pte = pte_t;
> + if (unlikely(folio == swapcache)) {
> + /*
> + * the below has been done before swap_read_folio()
> + * for case 1
> + */
> + nr_pages = nr;
> + entry = pte_to_swp_entry(ptep_get(start_pte));

If we make pte_range_swap() return the entry, we can avoid refetching
the swap entry here.

> + page = &folio->page;
> + }
> + } else if (nr_pages > 1) { /* ptes have changed for case 1 */
> + goto out_nomap;
> + }
> + }
> +
I rewrote the above to make the code indentation matching the execution flow.
I did not add any functional change. Just rearrange the code to be a
bit more streamlined. Get rid of the "else if goto".
if (!pte_range_swap(pte_t, nr)) {
if (nr_pages > 1) /* ptes have changed for case 1 */
goto out_nomap;
goto check_pte;
}

start_address = addr;
start_pte = pte_t;
if (unlikely(folio == swapcache)) {
/*
* the below has been done before swap_read_folio()
* for case 1
*/
nr_pages = nr;
entry = pte_to_swp_entry(ptep_get(start_pte));
page = &folio->page;
}
}

check_pte:

> if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> goto out_nomap;
>
> @@ -4047,12 +4120,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> * We're already holding a reference on the page but haven't mapped it
> * yet.
> */
> - swap_free(entry);
> + swap_nr_free(entry, nr_pages);
> if (should_try_to_free_swap(folio, vma, vmf->flags))
> folio_free_swap(folio);
>
> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> + folio_ref_add(folio, nr_pages - 1);
> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> +
> pte = mk_pte(page, vma->vm_page_prot);
>
> /*
> @@ -4062,14 +4137,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> * exclusivity.
> */
> if (!folio_test_ksm(folio) &&
> - (exclusive || folio_ref_count(folio) == 1)) {
> + (exclusive || folio_ref_count(folio) == nr_pages)) {
> if (vmf->flags & FAULT_FLAG_WRITE) {
> pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> vmf->flags &= ~FAULT_FLAG_WRITE;
> }
> rmap_flags |= RMAP_EXCLUSIVE;
> }
> - flush_icache_page(vma, page);
> + flush_icache_pages(vma, page, nr_pages);
> if (pte_swp_soft_dirty(vmf->orig_pte))
> pte = pte_mksoft_dirty(pte);
> if (pte_swp_uffd_wp(vmf->orig_pte))
> @@ -4081,14 +4156,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> folio_add_new_anon_rmap(folio, vma, vmf->address);
> folio_add_lru_vma(folio, vma);
> } else {
> - folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
> + folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
> rmap_flags);
> }
>
> VM_BUG_ON(!folio_test_anon(folio) ||
> (pte_write(pte) && !PageAnonExclusive(page)));
> - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> + set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
> +
> + arch_do_swap_page(vma->vm_mm, vma, start_address, pte, vmf->orig_pte);
>
> folio_unlock(folio);
> if (folio != swapcache && swapcache) {
> @@ -4105,6 +4181,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> }
>
> if (vmf->flags & FAULT_FLAG_WRITE) {
> + if (folio_test_large(folio) && nr_pages > 1)
> + vmf->orig_pte = ptep_get(vmf->pte);
> +
> ret |= do_wp_page(vmf);
> if (ret & VM_FAULT_ERROR)
> ret &= VM_FAULT_ERROR;
> @@ -4112,7 +4191,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> }
>
> /* No need to invalidate - it was non-present before */
> - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> + update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
> unlock:
> if (vmf->pte)
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> @@ -4148,7 +4227,8 @@ static bool pte_range_none(pte_t *pte, int nr_pages)
> return true;
> }
>
> -static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> +static struct folio *alloc_anon_folio(struct vm_fault *vmf,
> + bool (*pte_range_check)(pte_t *, int))
> {
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> struct vm_area_struct *vma = vmf->vma;
> @@ -4190,7 +4270,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)

About this patch context we have the following comments in the source code.
/*
* Find the highest order where the aligned range is completely
* pte_none(). Note that all remaining orders will be completely
* pte_none().
*/
> order = highest_order(orders);
> while (orders) {
> addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> - if (pte_range_none(pte + pte_index(addr), 1 << order))
> + if (pte_range_check(pte + pte_index(addr), 1 << order))

Again, I don't think we need to pass in the pte_range_check() as call
back functions.
There are only two call sites, all within this file. This will totally
invalide the above comments about pte_none(). In the worst case, just
make it accept one argument: it is checking swap range or none range
or not. Depending on the argument, do check none or swap range.
We should make it blend in with alloc_anon_folio better. My gut
feeling is that there should be a better way to make the range check
blend in with alloc_anon_folio better. e.g. Maybe store some of the
large swap context in the vmf and pass to different places etc. I need
to spend more time thinking about it to come up with happier
solutions.

Chris

> break;
> order = next_order(&orders, order);
> }
> @@ -4269,7 +4349,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> if (unlikely(anon_vma_prepare(vma)))
> goto oom;
> /* Returns NULL on OOM or ERR_PTR(-EAGAIN) if we must retry the fault */
> - folio = alloc_anon_folio(vmf);
> + folio = alloc_anon_folio(vmf, pte_range_none);
> if (IS_ERR(folio))
> return 0;
> if (!folio)
> --
> 2.34.1
>
>

2024-01-27 20:06:37

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH RFC 4/6] mm: support large folios swapin as a whole

On Thu, Jan 18, 2024 at 3:12 AM Barry Song <[email protected]> wrote:
>
> From: Chuanhua Han <[email protected]>
>
> On an embedded system like Android, more than half of anon memory is actually
> in swap devices such as zRAM. For example, while an app is switched to back-
> ground, its most memory might be swapped-out.
>
> Now we have mTHP features, unfortunately, if we don't support large folios
> swap-in, once those large folios are swapped-out, we immediately lose the
> performance gain we can get through large folios and hardware optimization
> such as CONT-PTE.
>
> This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
> to those contiguous swaps which were likely swapped out from mTHP as a whole.
>
> On the other hand, the current implementation only covers the SWAP_SYCHRONOUS
> case. It doesn't support swapin_readahead as large folios yet.
>
> Right now, we are re-faulting large folios which are still in swapcache as a
> whole, this can effectively decrease extra loops and early-exitings which we
> have increased in arch_swap_restore() while supporting MTE restore for folios
> rather than page.
>
> Signed-off-by: Chuanhua Han <[email protected]>
> Co-developed-by: Barry Song <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> ---
> mm/memory.c | 108 +++++++++++++++++++++++++++++++++++++++++++++-------
> 1 file changed, 94 insertions(+), 14 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index f61a48929ba7..928b3f542932 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -107,6 +107,8 @@ EXPORT_SYMBOL(mem_map);
> static vm_fault_t do_fault(struct vm_fault *vmf);
> static vm_fault_t do_anonymous_page(struct vm_fault *vmf);
> static bool vmf_pte_changed(struct vm_fault *vmf);
> +static struct folio *alloc_anon_folio(struct vm_fault *vmf,
> + bool (*pte_range_check)(pte_t *, int));
>
> /*
> * Return true if the original pte was a uffd-wp pte marker (so the pte was
> @@ -3784,6 +3786,34 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
> return VM_FAULT_SIGBUS;
> }
>
> +static bool pte_range_swap(pte_t *pte, int nr_pages)
> +{
> + int i;
> + swp_entry_t entry;
> + unsigned type;
> + pgoff_t start_offset;
> +
> + entry = pte_to_swp_entry(ptep_get_lockless(pte));
> + if (non_swap_entry(entry))
> + return false;
> + start_offset = swp_offset(entry);
> + if (start_offset % nr_pages)
> + return false;
> +
> + type = swp_type(entry);
> + for (i = 1; i < nr_pages; i++) {
> + entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
> + if (non_swap_entry(entry))
> + return false;
> + if (swp_offset(entry) != start_offset + i)
> + return false;
> + if (swp_type(entry) != type)
> + return false;
> + }
> +
> + return true;
> +}
> +
> /*
> * We enter with non-exclusive mmap_lock (to exclude vma changes,
> * but allow concurrent faults), and pte mapped but not yet locked.
> @@ -3804,6 +3834,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> pte_t pte;
> vm_fault_t ret = 0;
> void *shadow = NULL;
> + int nr_pages = 1;
> + unsigned long start_address;
> + pte_t *start_pte;
>
> if (!pte_unmap_same(vmf))
> goto out;
> @@ -3868,13 +3901,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> __swap_count(entry) == 1) {
> /* skip swapcache */
> - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> - vma, vmf->address, false);
> + folio = alloc_anon_folio(vmf, pte_range_swap);
> page = &folio->page;
> if (folio) {
> __folio_set_locked(folio);
> __folio_set_swapbacked(folio);
>
> + if (folio_test_large(folio)) {
> + unsigned long start_offset;
> +
> + nr_pages = folio_nr_pages(folio);
> + start_offset = swp_offset(entry) & ~(nr_pages - 1);
> + entry = swp_entry(swp_type(entry), start_offset);
> + }
> +
> if (mem_cgroup_swapin_charge_folio(folio,
> vma->vm_mm, GFP_KERNEL,
> entry)) {
> @@ -3980,6 +4020,39 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> */
> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> &vmf->ptl);
> +
> + start_address = vmf->address;
> + start_pte = vmf->pte;
> + if (folio_test_large(folio)) {
> + unsigned long nr = folio_nr_pages(folio);
> + unsigned long addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
> + pte_t *pte_t = vmf->pte - (vmf->address - addr) / PAGE_SIZE;

I forgot about one comment here.
Please change the variable name other than "pte_t", it is a bit
strange to use the typedef name as variable name here.

Chris

> +
> + /*
> + * case 1: we are allocating large_folio, try to map it as a whole
> + * iff the swap entries are still entirely mapped;
> + * case 2: we hit a large folio in swapcache, and all swap entries
> + * are still entirely mapped, try to map a large folio as a whole.
> + * otherwise, map only the faulting page within the large folio
> + * which is swapcache
> + */
> + if (pte_range_swap(pte_t, nr)) {
> + start_address = addr;
> + start_pte = pte_t;
> + if (unlikely(folio == swapcache)) {
> + /*
> + * the below has been done before swap_read_folio()
> + * for case 1
> + */
> + nr_pages = nr;
> + entry = pte_to_swp_entry(ptep_get(start_pte));
> + page = &folio->page;
> + }
> + } else if (nr_pages > 1) { /* ptes have changed for case 1 */
> + goto out_nomap;
> + }
> + }
> +
> if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> goto out_nomap;
>
> @@ -4047,12 +4120,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> * We're already holding a reference on the page but haven't mapped it
> * yet.
> */
> - swap_free(entry);
> + swap_nr_free(entry, nr_pages);
> if (should_try_to_free_swap(folio, vma, vmf->flags))
> folio_free_swap(folio);
>
> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> + folio_ref_add(folio, nr_pages - 1);
> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> +
> pte = mk_pte(page, vma->vm_page_prot);
>
> /*
> @@ -4062,14 +4137,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> * exclusivity.
> */
> if (!folio_test_ksm(folio) &&
> - (exclusive || folio_ref_count(folio) == 1)) {
> + (exclusive || folio_ref_count(folio) == nr_pages)) {
> if (vmf->flags & FAULT_FLAG_WRITE) {
> pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> vmf->flags &= ~FAULT_FLAG_WRITE;
> }
> rmap_flags |= RMAP_EXCLUSIVE;
> }
> - flush_icache_page(vma, page);
> + flush_icache_pages(vma, page, nr_pages);
> if (pte_swp_soft_dirty(vmf->orig_pte))
> pte = pte_mksoft_dirty(pte);
> if (pte_swp_uffd_wp(vmf->orig_pte))
> @@ -4081,14 +4156,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> folio_add_new_anon_rmap(folio, vma, vmf->address);
> folio_add_lru_vma(folio, vma);
> } else {
> - folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
> + folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
> rmap_flags);
> }
>
> VM_BUG_ON(!folio_test_anon(folio) ||
> (pte_write(pte) && !PageAnonExclusive(page)));
> - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> + set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
> +
> + arch_do_swap_page(vma->vm_mm, vma, start_address, pte, vmf->orig_pte);
>
> folio_unlock(folio);
> if (folio != swapcache && swapcache) {
> @@ -4105,6 +4181,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> }
>
> if (vmf->flags & FAULT_FLAG_WRITE) {
> + if (folio_test_large(folio) && nr_pages > 1)
> + vmf->orig_pte = ptep_get(vmf->pte);
> +
> ret |= do_wp_page(vmf);
> if (ret & VM_FAULT_ERROR)
> ret &= VM_FAULT_ERROR;
> @@ -4112,7 +4191,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> }
>
> /* No need to invalidate - it was non-present before */
> - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> + update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
> unlock:
> if (vmf->pte)
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> @@ -4148,7 +4227,8 @@ static bool pte_range_none(pte_t *pte, int nr_pages)
> return true;
> }
>
> -static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> +static struct folio *alloc_anon_folio(struct vm_fault *vmf,
> + bool (*pte_range_check)(pte_t *, int))
> {
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> struct vm_area_struct *vma = vmf->vma;
> @@ -4190,7 +4270,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> order = highest_order(orders);
> while (orders) {
> addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> - if (pte_range_none(pte + pte_index(addr), 1 << order))
> + if (pte_range_check(pte + pte_index(addr), 1 << order))
> break;
> order = next_order(&orders, order);
> }
> @@ -4269,7 +4349,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> if (unlikely(anon_vma_prepare(vma)))
> goto oom;
> /* Returns NULL on OOM or ERR_PTR(-EAGAIN) if we must retry the fault */
> - folio = alloc_anon_folio(vmf);
> + folio = alloc_anon_folio(vmf, pte_range_none);
> if (IS_ERR(folio))
> return 0;
> if (!folio)
> --
> 2.34.1
>
>

2024-01-27 23:42:10

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()

On Thu, Jan 18, 2024 at 3:12 AM Barry Song <[email protected]> wrote:
>
> From: Barry Song <[email protected]>
>
> In do_swap_page(), while supporting large folio swap-in, we are using the helper
> folio_add_anon_rmap_ptes. This is triggerring a WARN_ON in __folio_add_anon_rmap.
> We can make the warning quiet by two ways
> 1. in do_swap_page, we call folio_add_new_anon_rmap() if we are sure the large
> folio is new allocated one; we call folio_add_anon_rmap_ptes() if we find the
> large folio in swapcache.
> 2. we always call folio_add_anon_rmap_ptes() in do_swap_page but weaken the
> WARN_ON in __folio_add_anon_rmap() by letting the WARN_ON less sensitive.
>
> Option 2 seems to be better for do_swap_page() as it can use unified code for
> all cases.
>
> Signed-off-by: Barry Song <[email protected]>
> Tested-by: Chuanhua Han <[email protected]>
> ---
> mm/rmap.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index f5d43edad529..469fcfd32317 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1304,7 +1304,10 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
> * page.
> */
> VM_WARN_ON_FOLIO(folio_test_large(folio) &&
> - level != RMAP_LEVEL_PMD, folio);
> + level != RMAP_LEVEL_PMD &&
> + (!IS_ALIGNED(address, nr_pages * PAGE_SIZE) ||
Some minor nitpick here.
There are two leading "(" in this and next line. This is the first "("
> + (folio_test_swapcache(folio) && !IS_ALIGNED(folio->index, nr_pages)) ||
Second "(" here.

These two "(" are NOT at the same nested level. They should not have
the same indentation.
On my first glance, I misread the scope of the "||" due to the same
level indentation.
We can do one of the two
1) add more indentation on the second "(" to reflect the nesting level.

> + page != &folio->page), folio);

Also moving the folio to the next line, because the multiline
expression is huge and complex. Make it obvious the ending "folio" is
not part of the testing condition.

2) Move the multiline test condition to a checking function. Inside
the function it can return early when the shortcut condition is met.
That will also help the readability of this warning condition.

Chris

> __folio_set_anon(folio, vma, address,
> !!(flags & RMAP_EXCLUSIVE));
> } else if (likely(!folio_test_ksm(folio))) {
> --
> 2.34.1
>
>

2024-01-29 02:15:29

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT

On Thu, Jan 18, 2024 at 3:12 AM Barry Song <[email protected]> wrote:
>
> From: Chuanhua Han <[email protected]>
>
> MADV_PAGEOUT and MADV_FREE are common cases in Android. Ryan's patchset has
> supported swapping large folios out as a whole for vmscan case. This patch
> extends the feature to madvise.
>
> If madvised range covers the whole large folio, we don't split it. Otherwise,
> we still need to split it.
>
> This patch doesn't depend on ARM64's CONT-PTE, alternatively, it defines one
> helper named pte_range_cont_mapped() to check if all PTEs are contiguously
> mapped to a large folio.
>
> Signed-off-by: Chuanhua Han <[email protected]>
> Co-developed-by: Barry Song <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> ---
> include/asm-generic/tlb.h | 10 +++++++
> include/linux/pgtable.h | 60 +++++++++++++++++++++++++++++++++++++++
> mm/madvise.c | 48 +++++++++++++++++++++++++++++++
> 3 files changed, 118 insertions(+)
>
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 129a3a759976..f894e22da5d6 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -608,6 +608,16 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
> __tlb_remove_tlb_entry(tlb, ptep, address); \
> } while (0)
>
> +#define tlb_remove_nr_tlb_entry(tlb, ptep, address, nr) \
> + do { \
> + int i; \
> + tlb_flush_pte_range(tlb, address, \
> + PAGE_SIZE * nr); \
> + for (i = 0; i < nr; i++) \
> + __tlb_remove_tlb_entry(tlb, ptep + i, \
> + address + i * PAGE_SIZE); \
> + } while (0)
> +
> #define tlb_remove_huge_tlb_entry(h, tlb, ptep, address) \
> do { \
> unsigned long _sz = huge_page_size(h); \
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 37fe83b0c358..da0c1cf447e3 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -320,6 +320,42 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
> }
> #endif
>
> +#ifndef pte_range_cont_mapped
> +static inline bool pte_range_cont_mapped(unsigned long start_pfn,
> + pte_t *start_pte,
> + unsigned long start_addr,
> + int nr)
> +{
> + int i;
> + pte_t pte_val;
> +
> + for (i = 0; i < nr; i++) {
> + pte_val = ptep_get(start_pte + i);
> +
> + if (pte_none(pte_val))
> + return false;

Hmm, the following check pte_pfn == start_pfn + i should have covered
the pte none case?

I think the pte_none means it can't have a valid pfn. So this check
can be skipped?

> +
> + if (pte_pfn(pte_val) != (start_pfn + i))
> + return false;
> + }
> +
> + return true;
> +}
> +#endif
> +
> +#ifndef pte_range_young
> +static inline bool pte_range_young(pte_t *start_pte, int nr)
> +{
> + int i;
> +
> + for (i = 0; i < nr; i++)
> + if (pte_young(ptep_get(start_pte + i)))
> + return true;
> +
> + return false;
> +}
> +#endif
> +
> #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
> unsigned long address,
> @@ -580,6 +616,23 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
> }
> #endif
>
> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_RANGE_FULL
> +static inline pte_t ptep_get_and_clear_range_full(struct mm_struct *mm,
> + unsigned long start_addr,
> + pte_t *start_pte,
> + int nr, int full)
> +{
> + int i;
> + pte_t pte;
> +
> + pte = ptep_get_and_clear_full(mm, start_addr, start_pte, full);
> +
> + for (i = 1; i < nr; i++)
> + ptep_get_and_clear_full(mm, start_addr + i * PAGE_SIZE,
> + start_pte + i, full);
> +
> + return pte;
> +}
>
> /*
> * If two threads concurrently fault at the same page, the thread that
> @@ -995,6 +1048,13 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> })
> #endif
>
> +#ifndef pte_nr_addr_end
> +#define pte_nr_addr_end(addr, size, end) \
> +({ unsigned long __boundary = ((addr) + size) & (~(size - 1)); \
> + (__boundary - 1 < (end) - 1)? __boundary: (end); \
> +})
> +#endif
> +
> /*
> * When walking page tables, we usually want to skip any p?d_none entries;
> * and any p?d_bad entries - reporting the error before resetting to none.
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 912155a94ed5..262460ac4b2e 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -452,6 +452,54 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> if (folio_test_large(folio)) {
> int err;
>
> + if (!folio_test_pmd_mappable(folio)) {

This session of code indent into the right too much.
You can do:

if (folio_test_pmd_mappable(folio))
goto split;

to make the code flatter.

> + int nr_pages = folio_nr_pages(folio);
> + unsigned long folio_size = PAGE_SIZE * nr_pages;
> + unsigned long start_addr = ALIGN_DOWN(addr, nr_pages * PAGE_SIZE);;
> + unsigned long start_pfn = page_to_pfn(folio_page(folio, 0));
> + pte_t *start_pte = pte - (addr - start_addr) / PAGE_SIZE;
> + unsigned long next = pte_nr_addr_end(addr, folio_size, end);
> +
> + if (!pte_range_cont_mapped(start_pfn, start_pte, start_addr, nr_pages))
> + goto split;
> +
> + if (next - addr != folio_size) {

Nitpick: One line statement does not need {

> + goto split;
> + } else {

When the previous if statement already "goto split", there is no need
for the else. You can save one level of indentation.



> + /* Do not interfere with other mappings of this page */
> + if (folio_estimated_sharers(folio) != 1)
> + goto skip;
> +
> + VM_BUG_ON(addr != start_addr || pte != start_pte);
> +
> + if (pte_range_young(start_pte, nr_pages)) {
> + ptent = ptep_get_and_clear_range_full(mm, start_addr, start_pte,
> + nr_pages, tlb->fullmm);
> + ptent = pte_mkold(ptent);
> +
> + set_ptes(mm, start_addr, start_pte, ptent, nr_pages);
> + tlb_remove_nr_tlb_entry(tlb, start_pte, start_addr, nr_pages);
> + }
> +
> + folio_clear_referenced(folio);
> + folio_test_clear_young(folio);
> + if (pageout) {
> + if (folio_isolate_lru(folio)) {
> + if (folio_test_unevictable(folio))
> + folio_putback_lru(folio);
> + else
> + list_add(&folio->lru, &folio_list);
> + }
> + } else
> + folio_deactivate(folio);

I notice this section is very similar to the earlier statements inside
the same function.
"if (pmd_trans_huge(*pmd)) {"

Wondering if there is some way to unify the two a bit somehow.

Also notice if you test the else condition first,

If (!pageout) {
folio_deactivate(folio);
goto skip;
}

You can save one level of indentation.
Not your fault, I notice the section inside (pmd_trans_huge(*pmd))
does exactly the same thing.

Chris


> + }
> +skip:
> + pte += (next - PAGE_SIZE - (addr & PAGE_MASK))/PAGE_SIZE;
> + addr = next - PAGE_SIZE;
> + continue;
> +
> + }
> +split:
> if (folio_estimated_sharers(folio) != 1)
> break;
> if (pageout_anon_only_filter && !folio_test_anon(folio))
> --
> 2.34.1
>
>

2024-01-29 03:25:45

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()

Hi David and Barry,

On Mon, Jan 22, 2024 at 10:49 PM Barry Song <[email protected]> wrote:
>
> >
> >
> > I have on my todo list to move all that !anon handling out of
> > folio_add_anon_rmap_ptes(), and instead make swapin code call add
> > folio_add_new_anon_rmap(), where we'll have to pass an exclusive flag
> > then (-> whole new folio exclusive).
> >
> > That's the cleaner approach.
> >
>
> one tricky thing is that sometimes it is hard to know who is the first
> one to add rmap and thus should
> call folio_add_new_anon_rmap.
> especially when we want to support swapin_readahead(), the one who
> allocated large filio might not
> be that one who firstly does rmap.

I think Barry has a point. Two tasks might race to swap in the folio
then race to perform the rmap.
folio_add_new_anon_rmap() should only call a folio that is absolutely
"new", not shared. The sharing in swap cache disqualifies that
condition.

> is it an acceptable way to do the below in do_swap_page?
> if (!folio_test_anon(folio))
> folio_add_new_anon_rmap()
> else
> folio_add_anon_rmap_ptes()

I am curious to know the answer as well.

BTW, that test might have a race as well. By the time the task got
!anon result, this result might get changed by another task. We need
to make sure in the caller context this race can't happen. Otherwise
we can't do the above safely.

Chris.

2024-01-29 09:07:22

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH RFC 0/6] mm: support large folios swap-in

Barry Song <[email protected]> writes:

> On an embedded system like Android, more than half of anon memory is actually
> in swap devices such as zRAM. For example, while an app is switched to back-
> ground, its most memory might be swapped-out.
>
> Now we have mTHP features, unfortunately, if we don't support large folios
> swap-in, once those large folios are swapped-out, we immediately lose the
> performance gain we can get through large folios and hardware optimization
> such as CONT-PTE.
>
> In theory, we don't need to rely on Ryan's swap out patchset[1]. That is to say,
> before swap-out, if some memory were normal pages, but when swapping in, we
> can also swap-in them as large folios. But this might require I/O happen at
> some random places in swap devices. So we limit the large folios swap-in to
> those areas which were large folios before swapping-out, aka, swaps are also
> contiguous in hardware. On the other hand, in OPPO's product, we've deployed
> anon large folios on millions of phones[2]. we enhanced zsmalloc and zRAM to
> compress and decompress large folios as a whole, which help improve compression
> ratio and decrease CPU consumption significantly. In zsmalloc and zRAM we can
> save large objects whose original size are 64KiB for example. So it is also a
> better choice for us to only swap-in large folios for those compressed large
> objects as a large folio can be decompressed all together.

Another possibility is to combine large folios swap-in with VMA based
swap-in readahead. If we will swap-in readahead several pages based on
VMA, we can swap-in a large folio instead.

I think that it is similar as allocating large file folios for file
readahead (TBH, I haven't check file large folios allocation code).

--
Best Regards,
Huang, Ying

> Note I am moving my previous "arm64: mm: swap: support THP_SWAP on hardware
> with MTE" to this series as it might help review.
>
> [1] [PATCH v3 0/4] Swap-out small-sized THP without splitting
> https://lore.kernel.org/linux-mm/[email protected]/
> [2] OnePlusOSS / android_kernel_oneplus_sm8550
> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
>
> Barry Song (2):
> arm64: mm: swap: support THP_SWAP on hardware with MTE
> mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()
>
> Chuanhua Han (4):
> mm: swap: introduce swap_nr_free() for batched swap_free()
> mm: swap: make should_try_to_free_swap() support large-folio
> mm: support large folios swapin as a whole
> mm: madvise: don't split mTHP for MADV_PAGEOUT
>
> arch/arm64/include/asm/pgtable.h | 21 ++----
> arch/arm64/mm/mteswap.c | 42 ++++++++++++
> include/asm-generic/tlb.h | 10 +++
> include/linux/huge_mm.h | 12 ----
> include/linux/pgtable.h | 62 ++++++++++++++++-
> include/linux/swap.h | 6 ++
> mm/madvise.c | 48 ++++++++++++++
> mm/memory.c | 110 ++++++++++++++++++++++++++-----
> mm/page_io.c | 2 +-
> mm/rmap.c | 5 +-
> mm/swap_slots.c | 2 +-
> mm/swapfile.c | 29 ++++++++
> 12 files changed, 301 insertions(+), 48 deletions(-)

2024-01-29 10:07:16

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()

On 29.01.24 04:25, Chris Li wrote:
> Hi David and Barry,
>
> On Mon, Jan 22, 2024 at 10:49 PM Barry Song <[email protected]> wrote:
>>
>>>
>>>
>>> I have on my todo list to move all that !anon handling out of
>>> folio_add_anon_rmap_ptes(), and instead make swapin code call add
>>> folio_add_new_anon_rmap(), where we'll have to pass an exclusive flag
>>> then (-> whole new folio exclusive).
>>>
>>> That's the cleaner approach.
>>>
>>
>> one tricky thing is that sometimes it is hard to know who is the first
>> one to add rmap and thus should
>> call folio_add_new_anon_rmap.
>> especially when we want to support swapin_readahead(), the one who
>> allocated large filio might not
>> be that one who firstly does rmap.
>
> I think Barry has a point. Two tasks might race to swap in the folio
> then race to perform the rmap.
> folio_add_new_anon_rmap() should only call a folio that is absolutely
> "new", not shared. The sharing in swap cache disqualifies that
> condition.

We have to hold the folio lock. So only one task at a time might do the
folio_add_anon_rmap_ptes() right now, and the
folio_add_new_shared_anon_rmap() in the future [below].

Also observe how folio_add_anon_rmap_ptes() states that one must hold
the page lock, because otherwise this would all be completely racy.

From the pte swp exclusive flags, we know for sure whether we are
dealing with exclusive vs. shared. I think patch #6 does not properly
check that all entries are actually the same in that regard (all
exclusive vs all shared). That likely needs fixing.

[I have converting per-page PageAnonExclusive flags to a single
per-folio flag on my todo list. I suspect that we'll keep the
per-swp-pte exlusive bits, but the question is rather what we can
actually make work, because swap and migration just make it much more
complicated. Anyhow, future work]

>
>> is it an acceptable way to do the below in do_swap_page?
>> if (!folio_test_anon(folio))
>> folio_add_new_anon_rmap()
>> else
>> folio_add_anon_rmap_ptes()
>
> I am curious to know the answer as well.


Yes, the end code should likely be something like:

/* ksm created a completely new copy */
if (unlikely(folio != swapcache && swapcache)) {
folio_add_new_anon_rmap(folio, vma, vmf->address);
folio_add_lru_vma(folio, vma);
} else if (folio_test_anon(folio)) {
folio_add_anon_rmap_ptes(rmap_flags)
} else {
folio_add_new_anon_rmap(rmap_flags)
}

Maybe we want to avoid teaching all existing folio_add_new_anon_rmap()
callers about a new flag, and just have a new
folio_add_new_shared_anon_rmap() instead. TBD.

>
> BTW, that test might have a race as well. By the time the task got
> !anon result, this result might get changed by another task. We need
> to make sure in the caller context this race can't happen. Otherwise
> we can't do the above safely.
Again, folio lock. Observe the folio_lock_or_retry() call that covers
our existing folio_add_new_anon_rmap/folio_add_anon_rmap_pte calls.

--
Cheers,

David / dhildenb


2024-01-29 16:35:19

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()

On Mon, Jan 29, 2024 at 2:07 AM David Hildenbrand <[email protected]> wrote:
>
> On 29.01.24 04:25, Chris Li wrote:
> > Hi David and Barry,
> >
> > On Mon, Jan 22, 2024 at 10:49 PM Barry Song <[email protected]> wrote:
> >>
> >>>
> >>>
> >>> I have on my todo list to move all that !anon handling out of
> >>> folio_add_anon_rmap_ptes(), and instead make swapin code call add
> >>> folio_add_new_anon_rmap(), where we'll have to pass an exclusive flag
> >>> then (-> whole new folio exclusive).
> >>>
> >>> That's the cleaner approach.
> >>>
> >>
> >> one tricky thing is that sometimes it is hard to know who is the first
> >> one to add rmap and thus should
> >> call folio_add_new_anon_rmap.
> >> especially when we want to support swapin_readahead(), the one who
> >> allocated large filio might not
> >> be that one who firstly does rmap.
> >
> > I think Barry has a point. Two tasks might race to swap in the folio
> > then race to perform the rmap.
> > folio_add_new_anon_rmap() should only call a folio that is absolutely
> > "new", not shared. The sharing in swap cache disqualifies that
> > condition.
>
> We have to hold the folio lock. So only one task at a time might do the
> folio_add_anon_rmap_ptes() right now, and the
> folio_add_new_shared_anon_rmap() in the future [below].
>

Ah, I see. The folio_lock() is the answer I am looking for.

> Also observe how folio_add_anon_rmap_ptes() states that one must hold
> the page lock, because otherwise this would all be completely racy.
>
> From the pte swp exclusive flags, we know for sure whether we are
> dealing with exclusive vs. shared. I think patch #6 does not properly
> check that all entries are actually the same in that regard (all
> exclusive vs all shared). That likely needs fixing.
>
> [I have converting per-page PageAnonExclusive flags to a single
> per-folio flag on my todo list. I suspect that we'll keep the
> per-swp-pte exlusive bits, but the question is rather what we can
> actually make work, because swap and migration just make it much more
> complicated. Anyhow, future work]
>
> >
> >> is it an acceptable way to do the below in do_swap_page?
> >> if (!folio_test_anon(folio))
> >> folio_add_new_anon_rmap()
> >> else
> >> folio_add_anon_rmap_ptes()
> >
> > I am curious to know the answer as well.
>
>
> Yes, the end code should likely be something like:
>
> /* ksm created a completely new copy */
> if (unlikely(folio != swapcache && swapcache)) {
> folio_add_new_anon_rmap(folio, vma, vmf->address);
> folio_add_lru_vma(folio, vma);
> } else if (folio_test_anon(folio)) {
> folio_add_anon_rmap_ptes(rmap_flags)
> } else {
> folio_add_new_anon_rmap(rmap_flags)
> }
>
> Maybe we want to avoid teaching all existing folio_add_new_anon_rmap()
> callers about a new flag, and just have a new
> folio_add_new_shared_anon_rmap() instead. TBD.

There is more than one caller needed to perform that dance around
folio_test_anon() then decide which function to call. It would be nice
to have a wrapper function folio_add_new_shared_anon_rmap() to
abstract this behavior.


>
> >
> > BTW, that test might have a race as well. By the time the task got
> > !anon result, this result might get changed by another task. We need
> > to make sure in the caller context this race can't happen. Otherwise
> > we can't do the above safely.
> Again, folio lock. Observe the folio_lock_or_retry() call that covers
> our existing folio_add_new_anon_rmap/folio_add_anon_rmap_pte calls.

Ack. Thanks for the explanation.

Chris

2024-02-26 03:00:23

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH RFC 1/6] arm64: mm: swap: support THP_SWAP on hardware with MTE

Hi Chris,

Thanks for reviewing. sorry for the late reply as I’ve been getting a lot to do
recently.

On Sat, Jan 27, 2024 at 12:14 PM Chris Li <[email protected]> wrote:
>
> On Thu, Jan 18, 2024 at 3:11 AM Barry Song <[email protected]> wrote:
> >
> > From: Barry Song <[email protected]>
> >
> > Commit d0637c505f8a1 ("arm64: enable THP_SWAP for arm64") brings up
> > THP_SWAP on ARM64, but it doesn't enable THP_SWP on hardware with
> > MTE as the MTE code works with the assumption tags save/restore is
> > always handling a folio with only one page.
> >
> > The limitation should be removed as more and more ARM64 SoCs have
> > this feature. Co-existence of MTE and THP_SWAP becomes more and
> > more important.
> >
> > This patch makes MTE tags saving support large folios, then we don't
> > need to split large folios into base pages for swapping out on ARM64
> > SoCs with MTE any more.
> >
> > arch_prepare_to_swap() should take folio rather than page as parameter
> > because we support THP swap-out as a whole. It saves tags for all
> > pages in a large folio.
> >
> > As now we are restoring tags based-on folio, in arch_swap_restore(),
> > we may increase some extra loops and early-exitings while refaulting
> > a large folio which is still in swapcache in do_swap_page(). In case
> > a large folio has nr pages, do_swap_page() will only set the PTE of
> > the particular page which is causing the page fault.
> > Thus do_swap_page() runs nr times, and each time, arch_swap_restore()
> > will loop nr times for those subpages in the folio. So right now the
> > algorithmic complexity becomes O(nr^2).
> >
> > Once we support mapping large folios in do_swap_page(), extra loops
> > and early-exitings will decrease while not being completely removed
> > as a large folio might get partially tagged in corner cases such as,
> > 1. a large folio in swapcache can be partially unmapped, thus, MTE
> > tags for the unmapped pages will be invalidated;
> > 2. users might use mprotect() to set MTEs on a part of a large folio.
> >
> > arch_thp_swp_supported() is dropped since ARM64 MTE was the only one
> > who needed it.
> >
> > Reviewed-by: Steven Price <[email protected]>
> > Signed-off-by: Barry Song <[email protected]>
> > ---
> > arch/arm64/include/asm/pgtable.h | 21 +++-------------
> > arch/arm64/mm/mteswap.c | 42 ++++++++++++++++++++++++++++++++
> > include/linux/huge_mm.h | 12 ---------
> > include/linux/pgtable.h | 2 +-
> > mm/page_io.c | 2 +-
> > mm/swap_slots.c | 2 +-
> > 6 files changed, 49 insertions(+), 32 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index 79ce70fbb751..9902395ca426 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -45,12 +45,6 @@
> > __flush_tlb_range(vma, addr, end, PUD_SIZE, false, 1)
> > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> > -static inline bool arch_thp_swp_supported(void)
> > -{
> > - return !system_supports_mte();
> > -}
> > -#define arch_thp_swp_supported arch_thp_swp_supported
> > -
> > /*
> > * Outside of a few very special situations (e.g. hibernation), we always
> > * use broadcast TLB invalidation instructions, therefore a spurious page
> > @@ -1042,12 +1036,8 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> > #ifdef CONFIG_ARM64_MTE
> >
> > #define __HAVE_ARCH_PREPARE_TO_SWAP
> > -static inline int arch_prepare_to_swap(struct page *page)
> > -{
> > - if (system_supports_mte())
> > - return mte_save_tags(page);
> > - return 0;
> > -}
> > +#define arch_prepare_to_swap arch_prepare_to_swap
>
> This seems a noop, define "arch_prepare_to_swap" back to itself.
> What am I missing?
>
> I see. Answer my own question, I guess you want to allow someone to
> overwrite the arch_prepare_to_swap.
> Wouldn't testing against __HAVE_ARCH_PREPARE_TO_SWAP enough to support that?

you are right. i was blindly copying my previous code

static inline bool arch_thp_swp_supported(void)
{
return !system_supports_mte();
}
#define arch_thp_swp_supported arch_thp_swp_supported

for arch_thp_swp_supported, there isn't a similar MACRO. so we are depending
on arch_thp_swp_supported as a MACRO for itself in include/linux/huge_mm.h.

/*
* archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
* limitations in the implementation like arm64 MTE can override this to
* false
*/
#ifndef arch_thp_swp_supported
static inline bool arch_thp_swp_supported(void)
{
return true;
}
#endif

Now the case is different, we do have __HAVE_ARCH_PREPARE_TO_SWAP
instead.

>
> Maybe I need to understand better how you want others to extend this
> code to make suggestions.
> As it is, this looks strange.
>
> > +extern int arch_prepare_to_swap(struct folio *folio);
> >
> > #define __HAVE_ARCH_SWAP_INVALIDATE
> > static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
> > @@ -1063,11 +1053,8 @@ static inline void arch_swap_invalidate_area(int type)
> > }
> >
> > #define __HAVE_ARCH_SWAP_RESTORE
> > -static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> > -{
> > - if (system_supports_mte())
> > - mte_restore_tags(entry, &folio->page);
> > -}
> > +#define arch_swap_restore arch_swap_restore
>
> Same here.

you are right, again.

>
> > +extern void arch_swap_restore(swp_entry_t entry, struct folio *folio);
> >
> > #endif /* CONFIG_ARM64_MTE */
> >
> > diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
> > index a31833e3ddc5..b9ca1b35902f 100644
> > --- a/arch/arm64/mm/mteswap.c
> > +++ b/arch/arm64/mm/mteswap.c
> > @@ -68,6 +68,13 @@ void mte_invalidate_tags(int type, pgoff_t offset)
> > mte_free_tag_storage(tags);
> > }
> >
> > +static inline void __mte_invalidate_tags(struct page *page)
> > +{
> > + swp_entry_t entry = page_swap_entry(page);
> > +
> > + mte_invalidate_tags(swp_type(entry), swp_offset(entry));
> > +}
> > +
> > void mte_invalidate_tags_area(int type)
> > {
> > swp_entry_t entry = swp_entry(type, 0);
> > @@ -83,3 +90,38 @@ void mte_invalidate_tags_area(int type)
> > }
> > xa_unlock(&mte_pages);
> > }
> > +
> > +int arch_prepare_to_swap(struct folio *folio)
> > +{
> > + int err;
> > + long i;
> > +
> > + if (system_supports_mte()) {
> Very minor nitpick.
>
> You can do
> if (!system_supports_mte())
> return 0;
>
> Here and the for loop would have less indent. The function looks flatter.

I agree.

>
> > + long nr = folio_nr_pages(folio);
> > +
> > + for (i = 0; i < nr; i++) {
> > + err = mte_save_tags(folio_page(folio, i));
> > + if (err)
> > + goto out;
> > + }
> > + }
> > + return 0;
> > +
> > +out:
> > + while (i--)
> > + __mte_invalidate_tags(folio_page(folio, i));
> > + return err;
> > +}
> > +
> > +void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> > +{
> > + if (system_supports_mte()) {
>
> Same here.
>
> Looks good otherwise. None of the nitpicks are deal breakers.
>
> Acked-by: Chris Li <[email protected]>

Thanks!

>
>
> Chris
>
> > + long i, nr = folio_nr_pages(folio);
> > +
> > + entry.val -= swp_offset(entry) & (nr - 1);
> > + for (i = 0; i < nr; i++) {
> > + mte_restore_tags(entry, folio_page(folio, i));
> > + entry.val++;
> > + }
> > + }
> > +}
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 5adb86af35fc..67219d2309dd 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -530,16 +530,4 @@ static inline int split_folio(struct folio *folio)
> > return split_folio_to_list(folio, NULL);
> > }
> >
> > -/*
> > - * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
> > - * limitations in the implementation like arm64 MTE can override this to
> > - * false
> > - */
> > -#ifndef arch_thp_swp_supported
> > -static inline bool arch_thp_swp_supported(void)
> > -{
> > - return true;
> > -}
> > -#endif
> > -
> > #endif /* _LINUX_HUGE_MM_H */
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index f6d0e3513948..37fe83b0c358 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -925,7 +925,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
> > * prototypes must be defined in the arch-specific asm/pgtable.h file.
> > */
> > #ifndef __HAVE_ARCH_PREPARE_TO_SWAP
> > -static inline int arch_prepare_to_swap(struct page *page)
> > +static inline int arch_prepare_to_swap(struct folio *folio)
> > {
> > return 0;
> > }
> > diff --git a/mm/page_io.c b/mm/page_io.c
> > index ae2b49055e43..a9a7c236aecc 100644
> > --- a/mm/page_io.c
> > +++ b/mm/page_io.c
> > @@ -189,7 +189,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
> > * Arch code may have to preserve more data than just the page
> > * contents, e.g. memory tags.
> > */
> > - ret = arch_prepare_to_swap(&folio->page);
> > + ret = arch_prepare_to_swap(folio);
> > if (ret) {
> > folio_mark_dirty(folio);
> > folio_unlock(folio);
> > diff --git a/mm/swap_slots.c b/mm/swap_slots.c
> > index 0bec1f705f8e..2325adbb1f19 100644
> > --- a/mm/swap_slots.c
> > +++ b/mm/swap_slots.c
> > @@ -307,7 +307,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
> > entry.val = 0;
> >
> > if (folio_test_large(folio)) {
> > - if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
> > + if (IS_ENABLED(CONFIG_THP_SWAP))
> > get_swap_pages(1, &entry, folio_nr_pages(folio));
> > goto out;
> > }
> > --
> > 2.34.1
> >

Best Regards,
Barry

2024-02-26 04:47:25

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH RFC 2/6] mm: swap: introduce swap_nr_free() for batched swap_free()

Hi Chris,

Thanks!

On Sat, Jan 27, 2024 at 12:17 PM Chris Li <[email protected]> wrote:
>
> On Thu, Jan 18, 2024 at 3:11 AM Barry Song <[email protected]> wrote:
> >
> > From: Chuanhua Han <[email protected]>
> >
> > While swapping in a large folio, we need to free swaps related to the whole
> > folio. To avoid frequently acquiring and releasing swap locks, it is better
> > to introduce an API for batched free.
> >
> > Signed-off-by: Chuanhua Han <[email protected]>
> > Co-developed-by: Barry Song <[email protected]>
> > Signed-off-by: Barry Song <[email protected]>
> > ---
> > include/linux/swap.h | 6 ++++++
> > mm/swapfile.c | 29 +++++++++++++++++++++++++++++
> > 2 files changed, 35 insertions(+)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 4db00ddad261..31a4ee2dcd1c 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -478,6 +478,7 @@ extern void swap_shmem_alloc(swp_entry_t);
> > extern int swap_duplicate(swp_entry_t);
> > extern int swapcache_prepare(swp_entry_t);
> > extern void swap_free(swp_entry_t);
> > +extern void swap_nr_free(swp_entry_t entry, int nr_pages);
> > extern void swapcache_free_entries(swp_entry_t *entries, int n);
> > extern int free_swap_and_cache(swp_entry_t);
> > int swap_type_of(dev_t device, sector_t offset);
> > @@ -553,6 +554,11 @@ static inline void swap_free(swp_entry_t swp)
> > {
> > }
> >
> > +void swap_nr_free(swp_entry_t entry, int nr_pages)
> > +{
> > +
> > +}
> > +
> > static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> > {
> > }
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 556ff7347d5f..6321bda96b77 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1335,6 +1335,35 @@ void swap_free(swp_entry_t entry)
> > __swap_entry_free(p, entry);
> > }
> >
> > +void swap_nr_free(swp_entry_t entry, int nr_pages)
> > +{
> > + int i;
> > + struct swap_cluster_info *ci;
> > + struct swap_info_struct *p;
> > + unsigned type = swp_type(entry);
> > + unsigned long offset = swp_offset(entry);
> > + DECLARE_BITMAP(usage, SWAPFILE_CLUSTER) = { 0 };
> > +
> > + VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER);
>
> BUG_ON here seems a bit too developer originated. Maybe warn once and
> roll back to free one by one?

The function is used only for the case we are quite sure we are freeing
some contiguous swap entries within a cluster. if it is not the case,
we will need an array of entries[]. will people be more comfortable to
have a WARN_ON instead? but the problem is if that really happens,
it is a bug, WARN isn't enough.

>
> How big is your typical SWAPFILE_CUSTER and nr_pages typically in arm?

My case is SWAPFILE_CLUSTER = HPAGE_PMD_NR = 2MB/4KB = 512.

>
> I ask this question because nr_ppages > 64, that is a totally
> different game, we can completely bypass the swap cache slots.
>

I agree we have a chance to bypass slot cache if nr_pages is bigger than
SWAP_SLOTS_CACHE_SIZE. on the other hand, even when nr_pages <
64, we still have a good chance to optimize free_swap_slot() by batching
as there are many spin_lock and sort() for each single entry.


> > +
> > + if (nr_pages == 1) {
> > + swap_free(entry);
> > + return;
> > + }
> > +
> > + p = _swap_info_get(entry);
> > +
> > + ci = lock_cluster(p, offset);
> > + for (i = 0; i < nr_pages; i++) {
> > + if (__swap_entry_free_locked(p, offset + i, 1))
> > + __bitmap_set(usage, i, 1);
> > + }
> > + unlock_cluster(ci);
> > +
> > + for_each_clear_bit(i, usage, nr_pages)
> > + free_swap_slot(swp_entry(type, offset + i));
>
> Notice that free_swap_slot() internal has per CPU cache batching as
> well. Every free_swap_slot will get some per_cpu swap slot cache and
> cache->lock. There is double batching here.
> If the typical batch size here is bigger than 64 entries, we can go
> directly to batching swap_entry_free and avoid the free_swap_slot()
> batching altogether. Unlike free_swap_slot_entries(), here swap slots
> are all from one swap device, there is no need to sort and group the
> swap slot by swap devices.

I agree. you are completely right!
However, to make the patchset smaller at the beginning, I prefer
these optimizations to be deferred as a separate patchset after this one.

>
> Chris
>
>
> Chris
>
> > +}
> > +
> > /*
> > * Called after dropping swapcache to decrease refcnt to swap entries.
> > */
> > --
> > 2.34.1

Thanks
Barry

2024-02-26 05:06:05

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()

On Tue, Jan 30, 2024 at 5:32 AM Chris Li <[email protected]> wrote:
>
> On Mon, Jan 29, 2024 at 2:07 AM David Hildenbrand <[email protected]> wrote:
> >
> > On 29.01.24 04:25, Chris Li wrote:
> > > Hi David and Barry,
> > >
> > > On Mon, Jan 22, 2024 at 10:49 PM Barry Song <[email protected]> wrote:
> > >>
> > >>>
> > >>>
> > >>> I have on my todo list to move all that !anon handling out of
> > >>> folio_add_anon_rmap_ptes(), and instead make swapin code call add
> > >>> folio_add_new_anon_rmap(), where we'll have to pass an exclusive flag
> > >>> then (-> whole new folio exclusive).
> > >>>
> > >>> That's the cleaner approach.
> > >>>
> > >>
> > >> one tricky thing is that sometimes it is hard to know who is the first
> > >> one to add rmap and thus should
> > >> call folio_add_new_anon_rmap.
> > >> especially when we want to support swapin_readahead(), the one who
> > >> allocated large filio might not
> > >> be that one who firstly does rmap.
> > >
> > > I think Barry has a point. Two tasks might race to swap in the folio
> > > then race to perform the rmap.
> > > folio_add_new_anon_rmap() should only call a folio that is absolutely
> > > "new", not shared. The sharing in swap cache disqualifies that
> > > condition.
> >
> > We have to hold the folio lock. So only one task at a time might do the
> > folio_add_anon_rmap_ptes() right now, and the
> > folio_add_new_shared_anon_rmap() in the future [below].
> >
>
> Ah, I see. The folio_lock() is the answer I am looking for.
>
> > Also observe how folio_add_anon_rmap_ptes() states that one must hold
> > the page lock, because otherwise this would all be completely racy.
> >
> > From the pte swp exclusive flags, we know for sure whether we are
> > dealing with exclusive vs. shared. I think patch #6 does not properly
> > check that all entries are actually the same in that regard (all
> > exclusive vs all shared). That likely needs fixing.
> >
> > [I have converting per-page PageAnonExclusive flags to a single
> > per-folio flag on my todo list. I suspect that we'll keep the
> > per-swp-pte exlusive bits, but the question is rather what we can
> > actually make work, because swap and migration just make it much more
> > complicated. Anyhow, future work]
> >
> > >
> > >> is it an acceptable way to do the below in do_swap_page?
> > >> if (!folio_test_anon(folio))
> > >> folio_add_new_anon_rmap()
> > >> else
> > >> folio_add_anon_rmap_ptes()
> > >
> > > I am curious to know the answer as well.
> >
> >
> > Yes, the end code should likely be something like:
> >
> > /* ksm created a completely new copy */
> > if (unlikely(folio != swapcache && swapcache)) {
> > folio_add_new_anon_rmap(folio, vma, vmf->address);
> > folio_add_lru_vma(folio, vma);
> > } else if (folio_test_anon(folio)) {
> > folio_add_anon_rmap_ptes(rmap_flags)
> > } else {
> > folio_add_new_anon_rmap(rmap_flags)
> > }
> >
> > Maybe we want to avoid teaching all existing folio_add_new_anon_rmap()
> > callers about a new flag, and just have a new
> > folio_add_new_shared_anon_rmap() instead. TBD.

if we have to add a wrapper like folio_add_new_shared_anon_rmap()
to avoid "if (folio_test_anon(folio))" and "else" everywhere, why not
we just do it in folio_add_anon_rmap_ptes() ?

folio_add_anon_rmap_ptes()
{
if (!folio_test_anon(folio))
return folio_add_new_anon_rmap();
}

Anyway, I am going to change the patch 4/6 to if(folio_test_anon)/else first
and drop this 5/6.
we may figure out if we need a wrapper later.

>
> There is more than one caller needed to perform that dance around
> folio_test_anon() then decide which function to call. It would be nice
> to have a wrapper function folio_add_new_shared_anon_rmap() to
> abstract this behavior.
>
>
> >
> > >
> > > BTW, that test might have a race as well. By the time the task got
> > > !anon result, this result might get changed by another task. We need
> > > to make sure in the caller context this race can't happen. Otherwise
> > > we can't do the above safely.
> > Again, folio lock. Observe the folio_lock_or_retry() call that covers
> > our existing folio_add_new_anon_rmap/folio_add_anon_rmap_pte calls.
>
> Ack. Thanks for the explanation.
>
> Chris

Thanks
Barry

2024-02-26 06:40:13

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT

On Mon, Jan 29, 2024 at 3:15 PM Chris Li <[email protected]> wrote:
>
> On Thu, Jan 18, 2024 at 3:12 AM Barry Song <[email protected]> wrote:
> >
> > From: Chuanhua Han <[email protected]>
> >
> > MADV_PAGEOUT and MADV_FREE are common cases in Android. Ryan's patchset has
> > supported swapping large folios out as a whole for vmscan case. This patch
> > extends the feature to madvise.
> >
> > If madvised range covers the whole large folio, we don't split it. Otherwise,
> > we still need to split it.
> >
> > This patch doesn't depend on ARM64's CONT-PTE, alternatively, it defines one
> > helper named pte_range_cont_mapped() to check if all PTEs are contiguously
> > mapped to a large folio.
> >
> > Signed-off-by: Chuanhua Han <[email protected]>
> > Co-developed-by: Barry Song <[email protected]>
> > Signed-off-by: Barry Song <[email protected]>
> > ---
> > include/asm-generic/tlb.h | 10 +++++++
> > include/linux/pgtable.h | 60 +++++++++++++++++++++++++++++++++++++++
> > mm/madvise.c | 48 +++++++++++++++++++++++++++++++
> > 3 files changed, 118 insertions(+)
> >
> > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> > index 129a3a759976..f894e22da5d6 100644
> > --- a/include/asm-generic/tlb.h
> > +++ b/include/asm-generic/tlb.h
> > @@ -608,6 +608,16 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
> > __tlb_remove_tlb_entry(tlb, ptep, address); \
> > } while (0)
> >
> > +#define tlb_remove_nr_tlb_entry(tlb, ptep, address, nr) \
> > + do { \
> > + int i; \
> > + tlb_flush_pte_range(tlb, address, \
> > + PAGE_SIZE * nr); \
> > + for (i = 0; i < nr; i++) \
> > + __tlb_remove_tlb_entry(tlb, ptep + i, \
> > + address + i * PAGE_SIZE); \
> > + } while (0)
> > +
> > #define tlb_remove_huge_tlb_entry(h, tlb, ptep, address) \
> > do { \
> > unsigned long _sz = huge_page_size(h); \
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index 37fe83b0c358..da0c1cf447e3 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -320,6 +320,42 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
> > }
> > #endif
> >
> > +#ifndef pte_range_cont_mapped
> > +static inline bool pte_range_cont_mapped(unsigned long start_pfn,
> > + pte_t *start_pte,
> > + unsigned long start_addr,
> > + int nr)
> > +{
> > + int i;
> > + pte_t pte_val;
> > +
> > + for (i = 0; i < nr; i++) {
> > + pte_val = ptep_get(start_pte + i);
> > +
> > + if (pte_none(pte_val))
> > + return false;
>
> Hmm, the following check pte_pfn == start_pfn + i should have covered
> the pte none case?
>
> I think the pte_none means it can't have a valid pfn. So this check
> can be skipped?

yes. check pte_pfn == start_pfn + i should have covered the pte none
case. but leaving pte_none there seems to make the code more
readable. i guess we need to check pte_present() too, a small chance is
swp_offset can equal pte_pfn after some shifting? in case, a PTE
within the large folio range has been a swap entry?

I am still thinking about if we have some cheaper way to check if a folio
is still entirely mapped. maybe sth like if
(list_empty(&folio->_deferred_list))?

>
> > +
> > + if (pte_pfn(pte_val) != (start_pfn + i))
> > + return false;
> > + }
> > +
> > + return true;
> > +}
> > +#endif
> > +
> > +#ifndef pte_range_young
> > +static inline bool pte_range_young(pte_t *start_pte, int nr)
> > +{
> > + int i;
> > +
> > + for (i = 0; i < nr; i++)
> > + if (pte_young(ptep_get(start_pte + i)))
> > + return true;
> > +
> > + return false;
> > +}
> > +#endif
> > +
> > #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> > static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
> > unsigned long address,
> > @@ -580,6 +616,23 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
> > }
> > #endif
> >
> > +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_RANGE_FULL
> > +static inline pte_t ptep_get_and_clear_range_full(struct mm_struct *mm,
> > + unsigned long start_addr,
> > + pte_t *start_pte,
> > + int nr, int full)
> > +{
> > + int i;
> > + pte_t pte;
> > +
> > + pte = ptep_get_and_clear_full(mm, start_addr, start_pte, full);
> > +
> > + for (i = 1; i < nr; i++)
> > + ptep_get_and_clear_full(mm, start_addr + i * PAGE_SIZE,
> > + start_pte + i, full);
> > +
> > + return pte;
> > +}
> >
> > /*
> > * If two threads concurrently fault at the same page, the thread that
> > @@ -995,6 +1048,13 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> > })
> > #endif
> >
> > +#ifndef pte_nr_addr_end
> > +#define pte_nr_addr_end(addr, size, end) \
> > +({ unsigned long __boundary = ((addr) + size) & (~(size - 1)); \
> > + (__boundary - 1 < (end) - 1)? __boundary: (end); \
> > +})
> > +#endif
> > +
> > /*
> > * When walking page tables, we usually want to skip any p?d_none entries;
> > * and any p?d_bad entries - reporting the error before resetting to none.
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 912155a94ed5..262460ac4b2e 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -452,6 +452,54 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> > if (folio_test_large(folio)) {
> > int err;
> >
> > + if (!folio_test_pmd_mappable(folio)) {
>
> This session of code indent into the right too much.
> You can do:
>
> if (folio_test_pmd_mappable(folio))
> goto split;
>
> to make the code flatter.

I guess we don't need "if (!folio_test_pmd_mappable(folio))" at all
as the pmd case has been
handled at the first beginning of madvise_cold_or_pageout_pte_range().

>
> > + int nr_pages = folio_nr_pages(folio);
> > + unsigned long folio_size = PAGE_SIZE * nr_pages;
> > + unsigned long start_addr = ALIGN_DOWN(addr, nr_pages * PAGE_SIZE);;
> > + unsigned long start_pfn = page_to_pfn(folio_page(folio, 0));
> > + pte_t *start_pte = pte - (addr - start_addr) / PAGE_SIZE;
> > + unsigned long next = pte_nr_addr_end(addr, folio_size, end);
> > +
> > + if (!pte_range_cont_mapped(start_pfn, start_pte, start_addr, nr_pages))
> > + goto split;
> > +
> > + if (next - addr != folio_size) {
>
> Nitpick: One line statement does not need {
>
> > + goto split;
> > + } else {
>
> When the previous if statement already "goto split", there is no need
> for the else. You can save one level of indentation.

right!

>
>
>
> > + /* Do not interfere with other mappings of this page */
> > + if (folio_estimated_sharers(folio) != 1)
> > + goto skip;
> > +
> > + VM_BUG_ON(addr != start_addr || pte != start_pte);
> > +
> > + if (pte_range_young(start_pte, nr_pages)) {
> > + ptent = ptep_get_and_clear_range_full(mm, start_addr, start_pte,
> > + nr_pages, tlb->fullmm);
> > + ptent = pte_mkold(ptent);
> > +
> > + set_ptes(mm, start_addr, start_pte, ptent, nr_pages);
> > + tlb_remove_nr_tlb_entry(tlb, start_pte, start_addr, nr_pages);
> > + }
> > +
> > + folio_clear_referenced(folio);
> > + folio_test_clear_young(folio);
> > + if (pageout) {
> > + if (folio_isolate_lru(folio)) {
> > + if (folio_test_unevictable(folio))
> > + folio_putback_lru(folio);
> > + else
> > + list_add(&folio->lru, &folio_list);
> > + }
> > + } else
> > + folio_deactivate(folio);
>
> I notice this section is very similar to the earlier statements inside
> the same function.
> "if (pmd_trans_huge(*pmd)) {"
>
> Wondering if there is some way to unify the two a bit somehow.

we have duplicated the code three times - pmd, pte-mapped large, normal folio.
I am quite sure if we can extract a common function.

>
> Also notice if you test the else condition first,
>
> If (!pageout) {
> folio_deactivate(folio);
> goto skip;
> }
>
> You can save one level of indentation.
> Not your fault, I notice the section inside (pmd_trans_huge(*pmd))
> does exactly the same thing.
>

can address this issue once we have a common func.

> Chris
>
>
> > + }
> > +skip:
> > + pte += (next - PAGE_SIZE - (addr & PAGE_MASK))/PAGE_SIZE;
> > + addr = next - PAGE_SIZE;
> > + continue;
> > +
> > + }
> > +split:
> > if (folio_estimated_sharers(folio) != 1)
> > break;
> > if (pageout_anon_only_filter && !folio_test_anon(folio))
> > --
> > 2.34.1
> >
> >

Thanks
Barry

2024-02-26 07:29:56

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH RFC 4/6] mm: support large folios swapin as a whole

On Sun, Jan 28, 2024 at 8:53 AM Chris Li <[email protected]> wrote:
>
> On Thu, Jan 18, 2024 at 3:12 AM Barry Song <[email protected]> wrote:
> >
> > From: Chuanhua Han <[email protected]>
> >
> > On an embedded system like Android, more than half of anon memory is actually
> > in swap devices such as zRAM. For example, while an app is switched to back-
> > ground, its most memory might be swapped-out.
> >
> > Now we have mTHP features, unfortunately, if we don't support large folios
> > swap-in, once those large folios are swapped-out, we immediately lose the
> > performance gain we can get through large folios and hardware optimization
> > such as CONT-PTE.
> >
> > This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
> > to those contiguous swaps which were likely swapped out from mTHP as a whole.
> >
> > On the other hand, the current implementation only covers the SWAP_SYCHRONOUS
> > case. It doesn't support swapin_readahead as large folios yet.
> >
> > Right now, we are re-faulting large folios which are still in swapcache as a
> > whole, this can effectively decrease extra loops and early-exitings which we
> > have increased in arch_swap_restore() while supporting MTE restore for folios
> > rather than page.
> >
> > Signed-off-by: Chuanhua Han <[email protected]>
> > Co-developed-by: Barry Song <[email protected]>
> > Signed-off-by: Barry Song <[email protected]>
> > ---
> > mm/memory.c | 108 +++++++++++++++++++++++++++++++++++++++++++++-------
> > 1 file changed, 94 insertions(+), 14 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index f61a48929ba7..928b3f542932 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -107,6 +107,8 @@ EXPORT_SYMBOL(mem_map);
> > static vm_fault_t do_fault(struct vm_fault *vmf);
> > static vm_fault_t do_anonymous_page(struct vm_fault *vmf);
> > static bool vmf_pte_changed(struct vm_fault *vmf);
> > +static struct folio *alloc_anon_folio(struct vm_fault *vmf,
> > + bool (*pte_range_check)(pte_t *, int));
>
> Instead of returning "bool", the pte_range_check() can return the
> start of the swap entry of the large folio.
> That will save some of the later code needed to get the start of the
> large folio.

I am trying to reuse alloc_anon_folio() for both do_anon_page and
do_swap_page. Unfortunately, this func returns a folio, no more
place to return a swap entry unless we add a parameter. Getting
start swap is quite cheap on the other hand.

>
> >
> > /*
> > * Return true if the original pte was a uffd-wp pte marker (so the pte was
> > @@ -3784,6 +3786,34 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
> > return VM_FAULT_SIGBUS;
> > }
> >
> > +static bool pte_range_swap(pte_t *pte, int nr_pages)
>
> This function name seems to suggest it will perform the range swap.
> That is not what it is doing.
> Suggest change to some other name reflecting that it is only a
> condition test without actual swap action.
> I am not very good at naming functions. Just think it out loud: e.g.
> pte_range_swap_check, pte_test_range_swap. You can come up with
> something better.

Ryan has a function named pte_range_none, which is checking the whole
range is pte_none. Maybe we can have an is_pte_range_contig_swap
which includes both swap and contiguous as we only need contiguous
swap entries.

>
>
> > +{
> > + int i;
> > + swp_entry_t entry;
> > + unsigned type;
> > + pgoff_t start_offset;
> > +
> > + entry = pte_to_swp_entry(ptep_get_lockless(pte));
> > + if (non_swap_entry(entry))
> > + return false;
> > + start_offset = swp_offset(entry);
> > + if (start_offset % nr_pages)
> > + return false;
>
> This suggests the pte argument needs to point to the beginning of the
> large folio equivalent of swap entry(not sure what to call it. Let me
> call it "large folio swap" here.).
> We might want to unify the terms for that.
> Any way, might want to document this requirement, otherwise the caller
> might consider passing the current pte that generates the fault. From
> the function name it is not obvious which pte should pass it.

ok, Ryan's swap-out will allocate swap entries whose start offset is
aligned with nr_pages. will add some doc to describe the first entry.

>
> > +
> > + type = swp_type(entry);
> > + for (i = 1; i < nr_pages; i++) {
>
> You might want to test the last page backwards, because if the entry
> is not the large folio swap, most likely it will have the last entry
> invalid. Some of the beginning swap entries might match due to batch
> allocation etc. The SSD likes to group the nearby swap entry write out
> together on the disk.

I am not sure I got your point. This is checking all pages within
the range of a large folio, Ryan's patch allocates swap entries
all together as a whole for a large folio while swapping out.

@@ -1073,14 +1133,13 @@ int get_swap_pages(int n_goal, swp_entry_t
swp_entries[], int entry_size)
spin_unlock(&si->lock);
goto nextsi;
}
- if (size == SWAPFILE_CLUSTER) {
- if (si->flags & SWP_BLKDEV)
- n_ret = swap_alloc_cluster(si, swp_entries);
+ if (size > 1) {
+ n_ret = swap_alloc_large(si, swp_entries, size);
} else
n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
n_goal, swp_entries);


>
>
>
> > + entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
>
> > + if (non_swap_entry(entry))
> > + return false;
> > + if (swp_offset(entry) != start_offset + i)
> > + return false;
> > + if (swp_type(entry) != type)
> > + return false;
> > + }
> > +
> > + return true;
> > +}
> > +
> > /*
> > * We enter with non-exclusive mmap_lock (to exclude vma changes,
> > * but allow concurrent faults), and pte mapped but not yet locked.
> > @@ -3804,6 +3834,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > pte_t pte;
> > vm_fault_t ret = 0;
> > void *shadow = NULL;
> > + int nr_pages = 1;
> > + unsigned long start_address;
> > + pte_t *start_pte;
> >
> > if (!pte_unmap_same(vmf))
> > goto out;
> > @@ -3868,13 +3901,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> > __swap_count(entry) == 1) {
> > /* skip swapcache */
> > - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> > - vma, vmf->address, false);
> > + folio = alloc_anon_folio(vmf, pte_range_swap);
>
> This function can call pte_range_swap() twice(), one here, another one
> in folio_test_large().
> Consider caching the result so it does not need to walk the pte range
> swap twice.
>
> I think alloc_anon_folio should either be told what is the
> size(prefered) or just figure out the right size. I don't think it
> needs to pass in the checking function as function callbacks. There
> are two call sites of alloc_anon_folio, they are all within this
> function. The call back seems a bit overkill here. Also duplicate the
> range swap walk.

alloc_anon_folio is reusing the one for do_anon_page. in both
cases, scanning PTEs to figure out the proper size is done.
The other call site is within do_anonymous_page().

>
> > page = &folio->page;
> > if (folio) {
> > __folio_set_locked(folio);
> > __folio_set_swapbacked(folio);
> >
> > + if (folio_test_large(folio)) {
> > + unsigned long start_offset;
> > +
> > + nr_pages = folio_nr_pages(folio);
> > + start_offset = swp_offset(entry) & ~(nr_pages - 1);
> Here is the first place place we roll up the start offset with folio size
>
> > + entry = swp_entry(swp_type(entry), start_offset);
> > + }
> > +
> > if (mem_cgroup_swapin_charge_folio(folio,
> > vma->vm_mm, GFP_KERNEL,
> > entry)) {
> > @@ -3980,6 +4020,39 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > */
> > vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> > &vmf->ptl);
> > +
> > + start_address = vmf->address;
> > + start_pte = vmf->pte;
> > + if (folio_test_large(folio)) {
> > + unsigned long nr = folio_nr_pages(folio);
> > + unsigned long addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
> > + pte_t *pte_t = vmf->pte - (vmf->address - addr) / PAGE_SIZE;
>
> Here is the second place we roll up the folio size.
> Maybe we can cache results and avoid repetition?

We have two paths getting into large folios
1. we allocate a new large folio
2. we find a large folio in swapcache

We have rolled up the folio size for case 1 before, but here we need to
take care of case 2 as well. so that is why we need both. let me think
if we can have some way to remove some redundant code for case 1.

>
> > +
> > + /*
> > + * case 1: we are allocating large_folio, try to map it as a whole
> > + * iff the swap entries are still entirely mapped;
> > + * case 2: we hit a large folio in swapcache, and all swap entries
> > + * are still entirely mapped, try to map a large folio as a whole.
> > + * otherwise, map only the faulting page within the large folio
> > + * which is swapcache
> > + */
>
> One question I have in mind is that the swap device is locked. We
> can't change the swap slot allocations.
> It does not stop the pte entry getting changed right? Then we can have
> someone in the user pace racing to change the PTE vs we checking the
> pte there.
>
> > + if (pte_range_swap(pte_t, nr)) {
>
> After this pte_range_swap() check, some of the PTE entries get changed
> and now we don't have the full large page swap any more?
> At least I can't conclude this possibility can't happen yet, please
> enlighten me.

This check is under PTL. no one else can change it as they have to
hold PTL to change pte.
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
&vmf->ptl);


>
> > + start_address = addr;
> > + start_pte = pte_t;
> > + if (unlikely(folio == swapcache)) {
> > + /*
> > + * the below has been done before swap_read_folio()
> > + * for case 1
> > + */
> > + nr_pages = nr;
> > + entry = pte_to_swp_entry(ptep_get(start_pte));
>
> If we make pte_range_swap() return the entry, we can avoid refetching
> the swap entry here.

we will have to add a parameter swp_entry_t *first_entry to return
the entry. The difficulty is we will have to add this parameter in
alloc_anon_folio() as well, that's a bit overkill for that function.


>
> > + page = &folio->page;
> > + }
> > + } else if (nr_pages > 1) { /* ptes have changed for case 1 */
> > + goto out_nomap;
> > + }
> > + }
> > +
> I rewrote the above to make the code indentation matching the execution flow.
> I did not add any functional change. Just rearrange the code to be a
> bit more streamlined. Get rid of the "else if goto".
> if (!pte_range_swap(pte_t, nr)) {
> if (nr_pages > 1) /* ptes have changed for case 1 */
> goto out_nomap;
> goto check_pte;
> }
>
> start_address = addr;
> start_pte = pte_t;
> if (unlikely(folio == swapcache)) {
> /*
> * the below has been done before swap_read_folio()
> * for case 1
> */
> nr_pages = nr;
> entry = pte_to_swp_entry(ptep_get(start_pte));
> page = &folio->page;
> }
> }

looks good to me.

>
> check_pte:
>
> > if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> > goto out_nomap;
> >
> > @@ -4047,12 +4120,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > * We're already holding a reference on the page but haven't mapped it
> > * yet.
> > */
> > - swap_free(entry);
> > + swap_nr_free(entry, nr_pages);
> > if (should_try_to_free_swap(folio, vma, vmf->flags))
> > folio_free_swap(folio);
> >
> > - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> > - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> > + folio_ref_add(folio, nr_pages - 1);
> > + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> > + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> > +
> > pte = mk_pte(page, vma->vm_page_prot);
> >
> > /*
> > @@ -4062,14 +4137,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > * exclusivity.
> > */
> > if (!folio_test_ksm(folio) &&
> > - (exclusive || folio_ref_count(folio) == 1)) {
> > + (exclusive || folio_ref_count(folio) == nr_pages)) {
> > if (vmf->flags & FAULT_FLAG_WRITE) {
> > pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> > vmf->flags &= ~FAULT_FLAG_WRITE;
> > }
> > rmap_flags |= RMAP_EXCLUSIVE;
> > }
> > - flush_icache_page(vma, page);
> > + flush_icache_pages(vma, page, nr_pages);
> > if (pte_swp_soft_dirty(vmf->orig_pte))
> > pte = pte_mksoft_dirty(pte);
> > if (pte_swp_uffd_wp(vmf->orig_pte))
> > @@ -4081,14 +4156,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > folio_add_new_anon_rmap(folio, vma, vmf->address);
> > folio_add_lru_vma(folio, vma);
> > } else {
> > - folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
> > + folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
> > rmap_flags);
> > }
> >
> > VM_BUG_ON(!folio_test_anon(folio) ||
> > (pte_write(pte) && !PageAnonExclusive(page)));
> > - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> > - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> > + set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
> > +
> > + arch_do_swap_page(vma->vm_mm, vma, start_address, pte, vmf->orig_pte);
> >
> > folio_unlock(folio);
> > if (folio != swapcache && swapcache) {
> > @@ -4105,6 +4181,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > }
> >
> > if (vmf->flags & FAULT_FLAG_WRITE) {
> > + if (folio_test_large(folio) && nr_pages > 1)
> > + vmf->orig_pte = ptep_get(vmf->pte);
> > +
> > ret |= do_wp_page(vmf);
> > if (ret & VM_FAULT_ERROR)
> > ret &= VM_FAULT_ERROR;
> > @@ -4112,7 +4191,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > }
> >
> > /* No need to invalidate - it was non-present before */
> > - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> > + update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
> > unlock:
> > if (vmf->pte)
> > pte_unmap_unlock(vmf->pte, vmf->ptl);
> > @@ -4148,7 +4227,8 @@ static bool pte_range_none(pte_t *pte, int nr_pages)
> > return true;
> > }
> >
> > -static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> > +static struct folio *alloc_anon_folio(struct vm_fault *vmf,
> > + bool (*pte_range_check)(pte_t *, int))
> > {
> > #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > struct vm_area_struct *vma = vmf->vma;
> > @@ -4190,7 +4270,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>
> About this patch context we have the following comments in the source code.
> /*
> * Find the highest order where the aligned range is completely
> * pte_none(). Note that all remaining orders will be completely
> * pte_none().
> */
> > order = highest_order(orders);
> > while (orders) {
> > addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> > - if (pte_range_none(pte + pte_index(addr), 1 << order))
> > + if (pte_range_check(pte + pte_index(addr), 1 << order))
>
> Again, I don't think we need to pass in the pte_range_check() as call
> back functions.
> There are only two call sites, all within this file. This will totally
> invalide the above comments about pte_none(). In the worst case, just
> make it accept one argument: it is checking swap range or none range
> or not. Depending on the argument, do check none or swap range.
> We should make it blend in with alloc_anon_folio better. My gut
> feeling is that there should be a better way to make the range check
> blend in with alloc_anon_folio better. e.g. Maybe store some of the
> large swap context in the vmf and pass to different places etc. I need
> to spend more time thinking about it to come up with happier
> solutions.

could pass a type to hint pte_range_none or pte_range_swap.
i'd like to avoid changing any global variable like vmf, as people
will have to cross two or more functions to understand what is
going on though the second function might be able to use the
changed vmf value in the first function. but it really makes the
code have more couples.

>
> Chris
>
> > break;
> > order = next_order(&orders, order);
> > }
> > @@ -4269,7 +4349,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> > if (unlikely(anon_vma_prepare(vma)))
> > goto oom;
> > /* Returns NULL on OOM or ERR_PTR(-EAGAIN) if we must retry the fault */
> > - folio = alloc_anon_folio(vmf);
> > + folio = alloc_anon_folio(vmf, pte_range_none);
> > if (IS_ERR(folio))
> > return 0;
> > if (!folio)
> > --
> > 2.34.1
> >

Thanks
Barry

2024-02-26 07:40:09

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH RFC 4/6] mm: support large folios swapin as a whole

On Sun, Jan 28, 2024 at 9:06 AM Chris Li <[email protected]> wrote:
>
> On Thu, Jan 18, 2024 at 3:12 AM Barry Song <[email protected]> wrote:
> >
> > From: Chuanhua Han <[email protected]>
> >
> > On an embedded system like Android, more than half of anon memory is actually
> > in swap devices such as zRAM. For example, while an app is switched to back-
> > ground, its most memory might be swapped-out.
> >
> > Now we have mTHP features, unfortunately, if we don't support large folios
> > swap-in, once those large folios are swapped-out, we immediately lose the
> > performance gain we can get through large folios and hardware optimization
> > such as CONT-PTE.
> >
> > This patch brings up mTHP swap-in support. Right now, we limit mTHP swap-in
> > to those contiguous swaps which were likely swapped out from mTHP as a whole.
> >
> > On the other hand, the current implementation only covers the SWAP_SYCHRONOUS
> > case. It doesn't support swapin_readahead as large folios yet.
> >
> > Right now, we are re-faulting large folios which are still in swapcache as a
> > whole, this can effectively decrease extra loops and early-exitings which we
> > have increased in arch_swap_restore() while supporting MTE restore for folios
> > rather than page.
> >
> > Signed-off-by: Chuanhua Han <[email protected]>
> > Co-developed-by: Barry Song <[email protected]>
> > Signed-off-by: Barry Song <[email protected]>
> > ---
> > mm/memory.c | 108 +++++++++++++++++++++++++++++++++++++++++++++-------
> > 1 file changed, 94 insertions(+), 14 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index f61a48929ba7..928b3f542932 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -107,6 +107,8 @@ EXPORT_SYMBOL(mem_map);
> > static vm_fault_t do_fault(struct vm_fault *vmf);
> > static vm_fault_t do_anonymous_page(struct vm_fault *vmf);
> > static bool vmf_pte_changed(struct vm_fault *vmf);
> > +static struct folio *alloc_anon_folio(struct vm_fault *vmf,
> > + bool (*pte_range_check)(pte_t *, int));
> >
> > /*
> > * Return true if the original pte was a uffd-wp pte marker (so the pte was
> > @@ -3784,6 +3786,34 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
> > return VM_FAULT_SIGBUS;
> > }
> >
> > +static bool pte_range_swap(pte_t *pte, int nr_pages)
> > +{
> > + int i;
> > + swp_entry_t entry;
> > + unsigned type;
> > + pgoff_t start_offset;
> > +
> > + entry = pte_to_swp_entry(ptep_get_lockless(pte));
> > + if (non_swap_entry(entry))
> > + return false;
> > + start_offset = swp_offset(entry);
> > + if (start_offset % nr_pages)
> > + return false;
> > +
> > + type = swp_type(entry);
> > + for (i = 1; i < nr_pages; i++) {
> > + entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
> > + if (non_swap_entry(entry))
> > + return false;
> > + if (swp_offset(entry) != start_offset + i)
> > + return false;
> > + if (swp_type(entry) != type)
> > + return false;
> > + }
> > +
> > + return true;
> > +}
> > +
> > /*
> > * We enter with non-exclusive mmap_lock (to exclude vma changes,
> > * but allow concurrent faults), and pte mapped but not yet locked.
> > @@ -3804,6 +3834,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > pte_t pte;
> > vm_fault_t ret = 0;
> > void *shadow = NULL;
> > + int nr_pages = 1;
> > + unsigned long start_address;
> > + pte_t *start_pte;
> >
> > if (!pte_unmap_same(vmf))
> > goto out;
> > @@ -3868,13 +3901,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> > __swap_count(entry) == 1) {
> > /* skip swapcache */
> > - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> > - vma, vmf->address, false);
> > + folio = alloc_anon_folio(vmf, pte_range_swap);
> > page = &folio->page;
> > if (folio) {
> > __folio_set_locked(folio);
> > __folio_set_swapbacked(folio);
> >
> > + if (folio_test_large(folio)) {
> > + unsigned long start_offset;
> > +
> > + nr_pages = folio_nr_pages(folio);
> > + start_offset = swp_offset(entry) & ~(nr_pages - 1);
> > + entry = swp_entry(swp_type(entry), start_offset);
> > + }
> > +
> > if (mem_cgroup_swapin_charge_folio(folio,
> > vma->vm_mm, GFP_KERNEL,
> > entry)) {
> > @@ -3980,6 +4020,39 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > */
> > vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> > &vmf->ptl);
> > +
> > + start_address = vmf->address;
> > + start_pte = vmf->pte;
> > + if (folio_test_large(folio)) {
> > + unsigned long nr = folio_nr_pages(folio);
> > + unsigned long addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
> > + pte_t *pte_t = vmf->pte - (vmf->address - addr) / PAGE_SIZE;
>
> I forgot about one comment here.
> Please change the variable name other than "pte_t", it is a bit
> strange to use the typedef name as variable name here.
>

make sense!

> Chris

Thanks
Barry

2024-02-27 12:26:28

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT

Hi Barry,

I've scanned through this patch as part of trying to understand the races you
have reported (It's going to take me a while to fully understand it all :) ). In
the meantime I have a few comments on this patch...

On 18/01/2024 11:10, Barry Song wrote:
> From: Chuanhua Han <[email protected]>
>
> MADV_PAGEOUT and MADV_FREE are common cases in Android. Ryan's patchset has
> supported swapping large folios out as a whole for vmscan case. This patch
> extends the feature to madvise.
>
> If madvised range covers the whole large folio, we don't split it. Otherwise,
> we still need to split it.
>
> This patch doesn't depend on ARM64's CONT-PTE, alternatively, it defines one
> helper named pte_range_cont_mapped() to check if all PTEs are contiguously
> mapped to a large folio.
>
> Signed-off-by: Chuanhua Han <[email protected]>
> Co-developed-by: Barry Song <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> ---
> include/asm-generic/tlb.h | 10 +++++++
> include/linux/pgtable.h | 60 +++++++++++++++++++++++++++++++++++++++
> mm/madvise.c | 48 +++++++++++++++++++++++++++++++
> 3 files changed, 118 insertions(+)
>
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 129a3a759976..f894e22da5d6 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -608,6 +608,16 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
> __tlb_remove_tlb_entry(tlb, ptep, address); \
> } while (0)
>
> +#define tlb_remove_nr_tlb_entry(tlb, ptep, address, nr) \
> + do { \
> + int i; \
> + tlb_flush_pte_range(tlb, address, \
> + PAGE_SIZE * nr); \
> + for (i = 0; i < nr; i++) \
> + __tlb_remove_tlb_entry(tlb, ptep + i, \
> + address + i * PAGE_SIZE); \
> + } while (0)

David has recently added tlb_remove_tlb_entries() which does the same thing.

> +
> #define tlb_remove_huge_tlb_entry(h, tlb, ptep, address) \
> do { \
> unsigned long _sz = huge_page_size(h); \
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 37fe83b0c358..da0c1cf447e3 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -320,6 +320,42 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
> }
> #endif
>
> +#ifndef pte_range_cont_mapped
> +static inline bool pte_range_cont_mapped(unsigned long start_pfn,
> + pte_t *start_pte,
> + unsigned long start_addr,
> + int nr)
> +{
> + int i;
> + pte_t pte_val;
> +
> + for (i = 0; i < nr; i++) {
> + pte_val = ptep_get(start_pte + i);
> +
> + if (pte_none(pte_val))
> + return false;
> +
> + if (pte_pfn(pte_val) != (start_pfn + i))
> + return false;
> + }
> +
> + return true;
> +}
> +#endif

David has recently added folio_pte_batch() which does a similar thing (as
discussed in other context).

> +
> +#ifndef pte_range_young
> +static inline bool pte_range_young(pte_t *start_pte, int nr)
> +{
> + int i;
> +
> + for (i = 0; i < nr; i++)
> + if (pte_young(ptep_get(start_pte + i)))
> + return true;
> +
> + return false;
> +}
> +#endif

I wonder if this should come from folio_pte_batch()?

> +
> #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
> unsigned long address,
> @@ -580,6 +616,23 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
> }
> #endif
>
> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_RANGE_FULL
> +static inline pte_t ptep_get_and_clear_range_full(struct mm_struct *mm,
> + unsigned long start_addr,
> + pte_t *start_pte,
> + int nr, int full)
> +{
> + int i;
> + pte_t pte;
> +
> + pte = ptep_get_and_clear_full(mm, start_addr, start_pte, full);
> +
> + for (i = 1; i < nr; i++)
> + ptep_get_and_clear_full(mm, start_addr + i * PAGE_SIZE,
> + start_pte + i, full);
> +
> + return pte;
> +}

David has recently added get_and_clear_full_ptes(). Your version isn't gathering
access/dirty, which may be ok for your case, but not ok in general.

>
> /*
> * If two threads concurrently fault at the same page, the thread that
> @@ -995,6 +1048,13 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> })
> #endif
>
> +#ifndef pte_nr_addr_end
> +#define pte_nr_addr_end(addr, size, end) \
> +({ unsigned long __boundary = ((addr) + size) & (~(size - 1)); \
> + (__boundary - 1 < (end) - 1)? __boundary: (end); \
> +})
> +#endif
> +
> /*
> * When walking page tables, we usually want to skip any p?d_none entries;
> * and any p?d_bad entries - reporting the error before resetting to none.
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 912155a94ed5..262460ac4b2e 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -452,6 +452,54 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> if (folio_test_large(folio)) {
> int err;
>
> + if (!folio_test_pmd_mappable(folio)) {
> + int nr_pages = folio_nr_pages(folio);
> + unsigned long folio_size = PAGE_SIZE * nr_pages;
> + unsigned long start_addr = ALIGN_DOWN(addr, nr_pages * PAGE_SIZE);;

I doubt it is correct to align down here. Couldn't you be going outside the
bounds that the user supplied?

nit: you've defined folio_size, why not use it here?
nit: double semi-colon.

> + unsigned long start_pfn = page_to_pfn(folio_page(folio, 0));
> + pte_t *start_pte = pte - (addr - start_addr) / PAGE_SIZE;

I think start_pte could be off the start of the pgtable and into random memory
in some corner cases (and outside the protection of the PTL)? You're assuming
that the folio is fully and contigously mapped and correctly aligned. mremap
(and other things) could break that assumption.

> + unsigned long next = pte_nr_addr_end(addr, folio_size, end);
> +
> + if (!pte_range_cont_mapped(start_pfn, start_pte, start_addr, nr_pages))
> + goto split;
> +
> + if (next - addr != folio_size) {
> + goto split;
> + } else {
> + /* Do not interfere with other mappings of this page */
> + if (folio_estimated_sharers(folio) != 1)
> + goto skip;
> +
> + VM_BUG_ON(addr != start_addr || pte != start_pte);
> +
> + if (pte_range_young(start_pte, nr_pages)) {
> + ptent = ptep_get_and_clear_range_full(mm, start_addr, start_pte,
> + nr_pages, tlb->fullmm);
> + ptent = pte_mkold(ptent);
> +
> + set_ptes(mm, start_addr, start_pte, ptent, nr_pages);
> + tlb_remove_nr_tlb_entry(tlb, start_pte, start_addr, nr_pages);
> + }
> +
> + folio_clear_referenced(folio);
> + folio_test_clear_young(folio);
> + if (pageout) {
> + if (folio_isolate_lru(folio)) {
> + if (folio_test_unevictable(folio))
> + folio_putback_lru(folio);
> + else
> + list_add(&folio->lru, &folio_list);
> + }
> + } else
> + folio_deactivate(folio);
> + }
> +skip:
> + pte += (next - PAGE_SIZE - (addr & PAGE_MASK))/PAGE_SIZE;
> + addr = next - PAGE_SIZE;
> + continue;
> +
> + }
> +split:
> if (folio_estimated_sharers(folio) != 1)
> break;
> if (pageout_anon_only_filter && !folio_test_anon(folio))


2024-02-27 15:02:32

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT

On 18/01/2024 11:10, Barry Song wrote:
> From: Chuanhua Han <[email protected]>
>
> MADV_PAGEOUT and MADV_FREE are common cases in Android. Ryan's patchset has
> supported swapping large folios out as a whole for vmscan case. This patch
> extends the feature to madvise.
>
> If madvised range covers the whole large folio, we don't split it. Otherwise,
> we still need to split it.
>
> This patch doesn't depend on ARM64's CONT-PTE, alternatively, it defines one
> helper named pte_range_cont_mapped() to check if all PTEs are contiguously
> mapped to a large folio.

I'm going to rework this patch and integrate it into my series if that's ok with
you?


2024-02-27 18:59:44

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT

On Wed, Feb 28, 2024 at 3:40 AM Ryan Roberts <[email protected]> wrote:
>
> On 18/01/2024 11:10, Barry Song wrote:
> > From: Chuanhua Han <[email protected]>
> >
> > MADV_PAGEOUT and MADV_FREE are common cases in Android. Ryan's patchset has
> > supported swapping large folios out as a whole for vmscan case. This patch
> > extends the feature to madvise.
> >
> > If madvised range covers the whole large folio, we don't split it. Otherwise,
> > we still need to split it.
> >
> > This patch doesn't depend on ARM64's CONT-PTE, alternatively, it defines one
> > helper named pte_range_cont_mapped() to check if all PTEs are contiguously
> > mapped to a large folio.
>
> I'm going to rework this patch and integrate it into my series if that's ok with
> you?

This is perfect. Please integrate it into your swap-out series which is the
perfect place for this MADV_PAGEOUT.

Thanks
Barry

2024-02-27 22:40:08

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT

On Wed, Feb 28, 2024 at 1:22 AM Ryan Roberts <[email protected]> wrote:
>
> Hi Barry,
>
> I've scanned through this patch as part of trying to understand the races you
> have reported (It's going to take me a while to fully understand it all :) ). In
> the meantime I have a few comments on this patch...
>
> On 18/01/2024 11:10, Barry Song wrote:
> > From: Chuanhua Han <[email protected]>
> >
> > MADV_PAGEOUT and MADV_FREE are common cases in Android. Ryan's patchset has
> > supported swapping large folios out as a whole for vmscan case. This patch
> > extends the feature to madvise.
> >
> > If madvised range covers the whole large folio, we don't split it. Otherwise,
> > we still need to split it.
> >
> > This patch doesn't depend on ARM64's CONT-PTE, alternatively, it defines one
> > helper named pte_range_cont_mapped() to check if all PTEs are contiguously
> > mapped to a large folio.
> >
> > Signed-off-by: Chuanhua Han <[email protected]>
> > Co-developed-by: Barry Song <[email protected]>
> > Signed-off-by: Barry Song <[email protected]>
> > ---
> > include/asm-generic/tlb.h | 10 +++++++
> > include/linux/pgtable.h | 60 +++++++++++++++++++++++++++++++++++++++
> > mm/madvise.c | 48 +++++++++++++++++++++++++++++++
> > 3 files changed, 118 insertions(+)
> >
> > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> > index 129a3a759976..f894e22da5d6 100644
> > --- a/include/asm-generic/tlb.h
> > +++ b/include/asm-generic/tlb.h
> > @@ -608,6 +608,16 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
> > __tlb_remove_tlb_entry(tlb, ptep, address); \
> > } while (0)
> >
> > +#define tlb_remove_nr_tlb_entry(tlb, ptep, address, nr) \
> > + do { \
> > + int i; \
> > + tlb_flush_pte_range(tlb, address, \
> > + PAGE_SIZE * nr); \
> > + for (i = 0; i < nr; i++) \
> > + __tlb_remove_tlb_entry(tlb, ptep + i, \
> > + address + i * PAGE_SIZE); \
> > + } while (0)
>
> David has recently added tlb_remove_tlb_entries() which does the same thing.

cool. While sending the patchset, we were not depending on other work.
Nice to know David's work can help this case.

>
> > +
> > #define tlb_remove_huge_tlb_entry(h, tlb, ptep, address) \
> > do { \
> > unsigned long _sz = huge_page_size(h); \
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index 37fe83b0c358..da0c1cf447e3 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -320,6 +320,42 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
> > }
> > #endif
> >
> > +#ifndef pte_range_cont_mapped
> > +static inline bool pte_range_cont_mapped(unsigned long start_pfn,
> > + pte_t *start_pte,
> > + unsigned long start_addr,
> > + int nr)
> > +{
> > + int i;
> > + pte_t pte_val;
> > +
> > + for (i = 0; i < nr; i++) {
> > + pte_val = ptep_get(start_pte + i);
> > +
> > + if (pte_none(pte_val))
> > + return false;
> > +
> > + if (pte_pfn(pte_val) != (start_pfn + i))
> > + return false;
> > + }
> > +
> > + return true;
> > +}
> > +#endif
>
> David has recently added folio_pte_batch() which does a similar thing (as
> discussed in other context).

yes.

>
> > +
> > +#ifndef pte_range_young
> > +static inline bool pte_range_young(pte_t *start_pte, int nr)
> > +{
> > + int i;
> > +
> > + for (i = 0; i < nr; i++)
> > + if (pte_young(ptep_get(start_pte + i)))
> > + return true;
> > +
> > + return false;
> > +}
> > +#endif
>
> I wonder if this should come from folio_pte_batch()?

not quite sure folio_pte_batch can return young. but i guess
you already have a batched function to check if a large folio
is young?

>
> > +
> > #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> > static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
> > unsigned long address,
> > @@ -580,6 +616,23 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
> > }
> > #endif
> >
> > +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_RANGE_FULL
> > +static inline pte_t ptep_get_and_clear_range_full(struct mm_struct *mm,
> > + unsigned long start_addr,
> > + pte_t *start_pte,
> > + int nr, int full)
> > +{
> > + int i;
> > + pte_t pte;
> > +
> > + pte = ptep_get_and_clear_full(mm, start_addr, start_pte, full);
> > +
> > + for (i = 1; i < nr; i++)
> > + ptep_get_and_clear_full(mm, start_addr + i * PAGE_SIZE,
> > + start_pte + i, full);
> > +
> > + return pte;
> > +}
>
> David has recently added get_and_clear_full_ptes(). Your version isn't gathering
> access/dirty, which may be ok for your case, but not ok in general.

ok. glad to know we can use get_and_clear_full_ptes().

>
> >
> > /*
> > * If two threads concurrently fault at the same page, the thread that
> > @@ -995,6 +1048,13 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> > })
> > #endif
> >
> > +#ifndef pte_nr_addr_end
> > +#define pte_nr_addr_end(addr, size, end) \
> > +({ unsigned long __boundary = ((addr) + size) & (~(size - 1)); \
> > + (__boundary - 1 < (end) - 1)? __boundary: (end); \
> > +})
> > +#endif
> > +
> > /*
> > * When walking page tables, we usually want to skip any p?d_none entries;
> > * and any p?d_bad entries - reporting the error before resetting to none.
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 912155a94ed5..262460ac4b2e 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -452,6 +452,54 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> > if (folio_test_large(folio)) {
> > int err;
> >
> > + if (!folio_test_pmd_mappable(folio)) {
> > + int nr_pages = folio_nr_pages(folio);
> > + unsigned long folio_size = PAGE_SIZE * nr_pages;
> > + unsigned long start_addr = ALIGN_DOWN(addr, nr_pages * PAGE_SIZE);;
>
> I doubt it is correct to align down here. Couldn't you be going outside the
> bounds that the user supplied?

Yes, it can. This is ugly and suspicious but does not cause problems
if the large folio's virtadd is aligned , but it is wrong if virtual address is
not aligned as explained below.

>
> nit: you've defined folio_size, why not use it here?
> nit: double semi-colon.
>
> > + unsigned long start_pfn = page_to_pfn(folio_page(folio, 0));
> > + pte_t *start_pte = pte - (addr - start_addr) / PAGE_SIZE;
>
> I think start_pte could be off the start of the pgtable and into random memory

> in some corner cases (and outside the protection of the PTL)? You're assuming
> that the folio is fully and contigously mapped and correctly aligned. mremap
> (and other things) could break that assumption.

actually we don't run under the assumption folio is fully and
contiguously mapped.
but the code does assume a large folio's virtual address is aligned with
nr_pages * PAGE_SIZE.

OTOH, we have if (next - addr != folio_size) to split folios if
users just want to partially
reclaim a large folio, but I do agree we should move if (next - addr
!= folio_size)
before pte_range_cont_mapped().

as long as the virt addr is aligned, pte_range_cont_mapped() won't
cause a problem
for the code even before if (next - addr != folio_size) (but ugly and
suspicious) as it is
still under the protection of PTL since we don't cross a PMD for a
pte-mapped large
folio.

but you are really right, we have cases like mremap which can remap an aligned
large folio to an unaligned address. I actually placed a trace point
in kernel, running
lots of phones, didn't find this case was happening. so i feel mremap
is really rare.
Is it possible to split large folios and avoid complexity instead if
we are remapping
to an unaligned address?

And, the code is really completely wrong if the large folio is
unaligned. we have to
remove the assumption if that is really happening. So shouldn't do ALIGN_DOWN.

>
> > + unsigned long next = pte_nr_addr_end(addr, folio_size, end);
> > +
> > + if (!pte_range_cont_mapped(start_pfn, start_pte, start_addr, nr_pages))
> > + goto split;
> > +
> > + if (next - addr != folio_size) {
> > + goto split;
> > + } else {
> > + /* Do not interfere with other mappings of this page */
> > + if (folio_estimated_sharers(folio) != 1)
> > + goto skip;
> > +
> > + VM_BUG_ON(addr != start_addr || pte != start_pte);
> > +
> > + if (pte_range_young(start_pte, nr_pages)) {
> > + ptent = ptep_get_and_clear_range_full(mm, start_addr, start_pte,
> > + nr_pages, tlb->fullmm);
> > + ptent = pte_mkold(ptent);
> > +
> > + set_ptes(mm, start_addr, start_pte, ptent, nr_pages);
> > + tlb_remove_nr_tlb_entry(tlb, start_pte, start_addr, nr_pages);
> > + }
> > +
> > + folio_clear_referenced(folio);
> > + folio_test_clear_young(folio);
> > + if (pageout) {
> > + if (folio_isolate_lru(folio)) {
> > + if (folio_test_unevictable(folio))
> > + folio_putback_lru(folio);
> > + else
> > + list_add(&folio->lru, &folio_list);
> > + }
> > + } else
> > + folio_deactivate(folio);
> > + }
> > +skip:
> > + pte += (next - PAGE_SIZE - (addr & PAGE_MASK))/PAGE_SIZE;
> > + addr = next - PAGE_SIZE;
> > + continue;
> > +
> > + }
> > +split:
> > if (folio_estimated_sharers(folio) != 1)
> > break;
> > if (pageout_anon_only_filter && !folio_test_anon(folio))

Thanks
Barry

2024-02-28 03:49:42

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT

>> I'm going to rework this patch and integrate it into my series if that's ok with
>> you?
>
> This is perfect. Please integrate it into your swap-out series which is the
> perfect place for this MADV_PAGEOUT.

BTW, Ryan, while you integrate this into your swap-put series, can you also
add the below one which is addressing one comment of Chris,

From: Barry Song <[email protected]>
Date: Tue, 27 Feb 2024 22:03:59 +1300
Subject: [PATCH] mm: madvise: extract common function
folio_deactivate_or_add_to_reclaim_list

For madvise_cold_or_pageout_pte_range, both pmd-mapped and pte-mapped
normal folios are duplicating the same code right now, and we might
have more such as pte-mapped large folios to use it. It is better
to extract a common function.

Cc: Chris Li <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: SeongJae Park <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
mm/madvise.c | 52 ++++++++++++++++++++--------------------------------
1 file changed, 20 insertions(+), 32 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 44a498c94158..1812457144ea 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -321,6 +321,24 @@ static inline bool can_do_file_pageout(struct vm_area_struct *vma)
file_permission(vma->vm_file, MAY_WRITE) == 0;
}

+static inline void folio_deactivate_or_add_to_reclaim_list(struct folio *folio, bool pageout,
+ struct list_head *folio_list)
+{
+ folio_clear_referenced(folio);
+ folio_test_clear_young(folio);
+
+ if (folio_test_active(folio))
+ folio_set_workingset(folio);
+ if (!pageout)
+ return folio_deactivate(folio);
+ if (folio_isolate_lru(folio)) {
+ if (folio_test_unevictable(folio))
+ folio_putback_lru(folio);
+ else
+ list_add(&folio->lru, folio_list);
+ }
+}
+
static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
unsigned long addr, unsigned long end,
struct mm_walk *walk)
@@ -394,19 +412,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
}

- folio_clear_referenced(folio);
- folio_test_clear_young(folio);
- if (folio_test_active(folio))
- folio_set_workingset(folio);
- if (pageout) {
- if (folio_isolate_lru(folio)) {
- if (folio_test_unevictable(folio))
- folio_putback_lru(folio);
- else
- list_add(&folio->lru, &folio_list);
- }
- } else
- folio_deactivate(folio);
+ folio_deactivate_or_add_to_reclaim_list(folio, pageout, &folio_list);
huge_unlock:
spin_unlock(ptl);
if (pageout)
@@ -498,25 +504,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
tlb_remove_tlb_entry(tlb, pte, addr);
}

- /*
- * We are deactivating a folio for accelerating reclaiming.
- * VM couldn't reclaim the folio unless we clear PG_young.
- * As a side effect, it makes confuse idle-page tracking
- * because they will miss recent referenced history.
- */
- folio_clear_referenced(folio);
- folio_test_clear_young(folio);
- if (folio_test_active(folio))
- folio_set_workingset(folio);
- if (pageout) {
- if (folio_isolate_lru(folio)) {
- if (folio_test_unevictable(folio))
- folio_putback_lru(folio);
- else
- list_add(&folio->lru, &folio_list);
- }
- } else
- folio_deactivate(folio);
+ folio_deactivate_or_add_to_reclaim_list(folio, pageout, &folio_list);
}

if (start_pte) {
--
2.34.1

Thanks
Barry

2024-04-06 23:27:47

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap()

On Mon, Jan 29, 2024 at 11:07 PM David Hildenbrand <[email protected]> wrote:
>
> On 29.01.24 04:25, Chris Li wrote:
> > Hi David and Barry,
> >
> > On Mon, Jan 22, 2024 at 10:49 PM Barry Song <[email protected]> wrote:
> >>
> >>>
> >>>
> >>> I have on my todo list to move all that !anon handling out of
> >>> folio_add_anon_rmap_ptes(), and instead make swapin code call add
> >>> folio_add_new_anon_rmap(), where we'll have to pass an exclusive flag
> >>> then (-> whole new folio exclusive).
> >>>
> >>> That's the cleaner approach.
> >>>
> >>
> >> one tricky thing is that sometimes it is hard to know who is the first
> >> one to add rmap and thus should
> >> call folio_add_new_anon_rmap.
> >> especially when we want to support swapin_readahead(), the one who
> >> allocated large filio might not
> >> be that one who firstly does rmap.
> >
> > I think Barry has a point. Two tasks might race to swap in the folio
> > then race to perform the rmap.
> > folio_add_new_anon_rmap() should only call a folio that is absolutely
> > "new", not shared. The sharing in swap cache disqualifies that
> > condition.
>
> We have to hold the folio lock. So only one task at a time might do the
> folio_add_anon_rmap_ptes() right now, and the
> folio_add_new_shared_anon_rmap() in the future [below].
>
> Also observe how folio_add_anon_rmap_ptes() states that one must hold
> the page lock, because otherwise this would all be completely racy.
>
> From the pte swp exclusive flags, we know for sure whether we are
> dealing with exclusive vs. shared. I think patch #6 does not properly
> check that all entries are actually the same in that regard (all
> exclusive vs all shared). That likely needs fixing.
>
> [I have converting per-page PageAnonExclusive flags to a single
> per-folio flag on my todo list. I suspect that we'll keep the
> per-swp-pte exlusive bits, but the question is rather what we can
> actually make work, because swap and migration just make it much more
> complicated. Anyhow, future work]
>
> >
> >> is it an acceptable way to do the below in do_swap_page?
> >> if (!folio_test_anon(folio))
> >> folio_add_new_anon_rmap()
> >> else
> >> folio_add_anon_rmap_ptes()
> >
> > I am curious to know the answer as well.
>
>
> Yes, the end code should likely be something like:
>
> /* ksm created a completely new copy */
> if (unlikely(folio != swapcache && swapcache)) {
> folio_add_new_anon_rmap(folio, vma, vmf->address);
> folio_add_lru_vma(folio, vma);
> } else if (folio_test_anon(folio)) {
> folio_add_anon_rmap_ptes(rmap_flags)
> } else {
> folio_add_new_anon_rmap(rmap_flags)
> }
>
> Maybe we want to avoid teaching all existing folio_add_new_anon_rmap()
> callers about a new flag, and just have a new
> folio_add_new_shared_anon_rmap() instead. TBD.

right.

We need to clarify that the new anon_folio might not necessarily be exclusive.
Unlike folio_add_new_anon_rmap, which assumes the new folio is exclusive,
folio_add_anon_rmap_ptes is capable of handling both exclusive and
non-exclusive new anon folios.

The code would be like:

if (unlikely(folio != swapcache && swapcache)) {
folio_add_new_anon_rmap(folio, vma, vmf->address);
folio_add_lru_vma(folio, vma);
} else if (!folio_test_anon(folio)) {
folio_add_anon_rmap_ptes(rmap_flags);
} else {
if (exclusive)
folio_add_new_anon_rmap();
else
folio_add_new_shared_anon_rmap();
}

It appears a bit lengthy?

>
> >
> > BTW, that test might have a race as well. By the time the task got
> > !anon result, this result might get changed by another task. We need
> > to make sure in the caller context this race can't happen. Otherwise
> > we can't do the above safely.
> Again, folio lock. Observe the folio_lock_or_retry() call that covers
> our existing folio_add_new_anon_rmap/folio_add_anon_rmap_pte calls.
>
> --
> Cheers,
>
> David / dhildenb

Thanks
Barry