2024-04-09 08:32:59

by Barry Song

[permalink] [raw]
Subject: [PATCH v2 0/5] large folios swap-in: handle refault cases first

From: Barry Song <[email protected]>

This patch is extracted from the large folio swapin series[1], primarily addressing
the handling of scenarios involving large folios in the swap cache. Currently, it is
particularly focused on addressing the refaulting of mTHP, which is still undergoing
reclamation. This approach aims to streamline code review and expedite the integration
of this segment into the MM tree.

It relies on Ryan's swap-out series v7[2], leveraging the helper function
swap_pte_batch() introduced by that series.

Presently, do_swap_page only encounters a large folio in the swap
cache before the large folio is released by vmscan. However, the code
should remain equally useful once we support large folio swap-in via
swapin_readahead(). This approach can effectively reduce page faults
and eliminate most redundant checks and early exits for MTE restoration
in recent MTE patchset[3].

The large folio swap-in for SWP_SYNCHRONOUS_IO and swapin_readahead()
will be split into separate patch sets and sent at a later time.

-v2:
- rebase on top of mm-unstable in which Ryan's swap_pte_batch() has changed
a lot.
- remove folio_add_new_anon_rmap() for !folio_test_anon()
as currently large folios are always anon(refault).
- add mTHP swpin refault counters

-v1:
Link: https://lore.kernel.org/linux-mm/[email protected]/

Differences with the original large folios swap-in series
- collect r-o-b, acked;
- rename swap_nr_free to swap_free_nr, according to Ryan;
- limit the maximum kernel stack usage for swap_free_nr, Ryan;
- add output argument in swap_pte_batch to expose if all entries are
exclusive
- many clean refinements, handle the corner case folio's virtual addr
might not be naturally aligned

[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/linux-mm/[email protected]/
[3] https://lore.kernel.org/linux-mm/[email protected]/

Barry Song (1):
mm: swap_pte_batch: add an output argument to reture if all swap
entries are exclusive
mm: add per-order mTHP swpin_refault counter

Chuanhua Han (3):
mm: swap: introduce swap_free_nr() for batched swap_free()
mm: swap: make should_try_to_free_swap() support large-folio
mm: swap: entirely map large folios found in swapcache

include/linux/huge_mm.h | 1 +
include/linux/swap.h | 5 +++
mm/huge_memory.c | 2 ++
mm/internal.h | 9 +++++-
mm/madvise.c | 2 +-
mm/memory.c | 69 ++++++++++++++++++++++++++++++++---------
mm/swapfile.c | 51 ++++++++++++++++++++++++++++++
7 files changed, 123 insertions(+), 16 deletions(-)

Appendix:

The following program can generate numerous instances where large folios
are hit in the swap cache if we enable 64KiB mTHP,

#echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled

#define DATA_SIZE (128UL * 1024)
#define PAGE_SIZE (4UL * 1024)
#define LARGE_FOLIO_SIZE (64UL * 1024)

static void *write_data(void *addr)
{
unsigned long i;

for (i = 0; i < DATA_SIZE; i += PAGE_SIZE)
memset(addr + i, (char)i, PAGE_SIZE);
}

static void *read_data(void *addr)
{
unsigned long i;

for (i = 0; i < DATA_SIZE; i += PAGE_SIZE) {
if (*((char *)addr + i) != (char)i) {
perror("mismatched data");
_exit(-1);
}
}
}

static void *pgout_data(void *addr)
{
madvise(addr, DATA_SIZE, MADV_PAGEOUT);
}

int main(int argc, char **argv)
{
for (int i = 0; i < 10000; i++) {
pthread_t tid1, tid2;
void *addr = mmap(NULL, DATA_SIZE * 2, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
unsigned long aligned_addr = ((unsigned long)addr + LARGE_FOLIO_SIZE) &
~(LARGE_FOLIO_SIZE - 1);

if (addr == MAP_FAILED) {
perror("fail to malloc");
return -1;
}

write_data(aligned_addr);

if (pthread_create(&tid1, NULL, pgout_data, (void *)aligned_addr)) {
perror("fail to pthread_create");
return -1;
}

if (pthread_create(&tid2, NULL, read_data, (void *)aligned_addr)) {
perror("fail to pthread_create");
return -1;
}

pthread_join(tid1, NULL);
pthread_join(tid2, NULL);
munmap(addr, DATA_SIZE * 2);
}

return 0;
}

# cat /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/anon_swpout
932
# cat /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/anon_swpin_refault
1488

--
2.34.1



2024-04-09 08:33:51

by Barry Song

[permalink] [raw]
Subject: [PATCH v2 5/5] mm: add per-order mTHP swpin_refault counter

From: Barry Song <[email protected]>

Currently, we are handling the scenario where we've hit a
large folio in the swapcache, and the reclaiming process
for this large folio is still ongoing.

Signed-off-by: Barry Song <[email protected]>
---
include/linux/huge_mm.h | 1 +
mm/huge_memory.c | 2 ++
mm/memory.c | 1 +
3 files changed, 4 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c8256af83e33..b67294d5814f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -269,6 +269,7 @@ enum mthp_stat_item {
MTHP_STAT_ANON_ALLOC_FALLBACK,
MTHP_STAT_ANON_SWPOUT,
MTHP_STAT_ANON_SWPOUT_FALLBACK,
+ MTHP_STAT_ANON_SWPIN_REFAULT,
__MTHP_STAT_COUNT
};

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d8d2ed80b0bf..fb95345b0bde 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -556,12 +556,14 @@ DEFINE_MTHP_STAT_ATTR(anon_alloc, MTHP_STAT_ANON_ALLOC);
DEFINE_MTHP_STAT_ATTR(anon_alloc_fallback, MTHP_STAT_ANON_ALLOC_FALLBACK);
DEFINE_MTHP_STAT_ATTR(anon_swpout, MTHP_STAT_ANON_SWPOUT);
DEFINE_MTHP_STAT_ATTR(anon_swpout_fallback, MTHP_STAT_ANON_SWPOUT_FALLBACK);
+DEFINE_MTHP_STAT_ATTR(anon_swpin_refault, MTHP_STAT_ANON_SWPIN_REFAULT);

static struct attribute *stats_attrs[] = {
&anon_alloc_attr.attr,
&anon_alloc_fallback_attr.attr,
&anon_swpout_attr.attr,
&anon_swpout_fallback_attr.attr,
+ &anon_swpin_refault_attr.attr,
NULL,
};

diff --git a/mm/memory.c b/mm/memory.c
index 9818dc1893c8..acc023795a4d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4167,6 +4167,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
nr_pages = nr;
entry = folio->swap;
page = &folio->page;
+ count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
}

check_pte:
--
2.34.1


2024-04-09 08:50:01

by Barry Song

[permalink] [raw]
Subject: [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache

From: Chuanhua Han <[email protected]>

When a large folio is found in the swapcache, the current implementation
requires calling do_swap_page() nr_pages times, resulting in nr_pages
page faults. This patch opts to map the entire large folio at once to
minimize page faults. Additionally, redundant checks and early exits
for ARM64 MTE restoring are removed.

Signed-off-by: Chuanhua Han <[email protected]>
Co-developed-by: Barry Song <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
mm/memory.c | 64 +++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 52 insertions(+), 12 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index c4a52e8d740a..9818dc1893c8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3947,6 +3947,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
pte_t pte;
vm_fault_t ret = 0;
void *shadow = NULL;
+ int nr_pages = 1;
+ unsigned long start_address = vmf->address;
+ pte_t *start_pte = vmf->pte;
+ bool any_swap_shared = false;

if (!pte_unmap_same(vmf))
goto out;
@@ -4137,6 +4141,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
*/
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
&vmf->ptl);
+
+ /* We hit large folios in swapcache */
+ if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio)) {
+ int nr = folio_nr_pages(folio);
+ int idx = folio_page_idx(folio, page);
+ unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
+ unsigned long folio_end = folio_start + nr * PAGE_SIZE;
+ pte_t *folio_ptep;
+ pte_t folio_pte;
+
+ if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
+ goto check_pte;
+ if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
+ goto check_pte;
+
+ folio_ptep = vmf->pte - idx;
+ folio_pte = ptep_get(folio_ptep);
+ if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
+ swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
+ goto check_pte;
+
+ start_address = folio_start;
+ start_pte = folio_ptep;
+ nr_pages = nr;
+ entry = folio->swap;
+ page = &folio->page;
+ }
+
+check_pte:
if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
goto out_nomap;

@@ -4190,6 +4223,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
*/
exclusive = false;
}
+
+ /* Reuse the whole large folio iff all entries are exclusive */
+ if (nr_pages > 1 && any_swap_shared)
+ exclusive = false;
}

/*
@@ -4204,12 +4241,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
* We're already holding a reference on the page but haven't mapped it
* yet.
*/
- swap_free(entry);
+ swap_free_nr(entry, nr_pages);
if (should_try_to_free_swap(folio, vma, vmf->flags))
folio_free_swap(folio);

- inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
- dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
+ folio_ref_add(folio, nr_pages - 1);
+ add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
+ add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
+
pte = mk_pte(page, vma->vm_page_prot);

/*
@@ -4219,33 +4258,34 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
* exclusivity.
*/
if (!folio_test_ksm(folio) &&
- (exclusive || folio_ref_count(folio) == 1)) {
+ (exclusive || (folio_ref_count(folio) == nr_pages &&
+ folio_nr_pages(folio) == nr_pages))) {
if (vmf->flags & FAULT_FLAG_WRITE) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
vmf->flags &= ~FAULT_FLAG_WRITE;
}
rmap_flags |= RMAP_EXCLUSIVE;
}
- flush_icache_page(vma, page);
+ flush_icache_pages(vma, page, nr_pages);
if (pte_swp_soft_dirty(vmf->orig_pte))
pte = pte_mksoft_dirty(pte);
if (pte_swp_uffd_wp(vmf->orig_pte))
pte = pte_mkuffd_wp(pte);
- vmf->orig_pte = pte;

/* ksm created a completely new copy */
if (unlikely(folio != swapcache && swapcache)) {
- folio_add_new_anon_rmap(folio, vma, vmf->address);
+ folio_add_new_anon_rmap(folio, vma, start_address);
folio_add_lru_vma(folio, vma);
} else {
- folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
- rmap_flags);
+ folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
+ rmap_flags);
}

VM_BUG_ON(!folio_test_anon(folio) ||
(pte_write(pte) && !PageAnonExclusive(page)));
- set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
- arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
+ set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
+ vmf->orig_pte = ptep_get(vmf->pte);
+ arch_do_swap_page(vma->vm_mm, vma, start_address, pte, pte);

folio_unlock(folio);
if (folio != swapcache && swapcache) {
@@ -4269,7 +4309,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
}

/* No need to invalidate - it was non-present before */
- update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
+ update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
unlock:
if (vmf->pte)
pte_unmap_unlock(vmf->pte, vmf->ptl);
--
2.34.1


2024-04-10 23:15:49

by SeongJae Park

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] mm: add per-order mTHP swpin_refault counter

Hi Barry,

On Tue, 9 Apr 2024 20:26:31 +1200 Barry Song <[email protected]> wrote:

> From: Barry Song <[email protected]>
>
> Currently, we are handling the scenario where we've hit a
> large folio in the swapcache, and the reclaiming process
> for this large folio is still ongoing.
>
> Signed-off-by: Barry Song <[email protected]>
> ---
> include/linux/huge_mm.h | 1 +
> mm/huge_memory.c | 2 ++
> mm/memory.c | 1 +
> 3 files changed, 4 insertions(+)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index c8256af83e33..b67294d5814f 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -269,6 +269,7 @@ enum mthp_stat_item {
> MTHP_STAT_ANON_ALLOC_FALLBACK,
> MTHP_STAT_ANON_SWPOUT,
> MTHP_STAT_ANON_SWPOUT_FALLBACK,
> + MTHP_STAT_ANON_SWPIN_REFAULT,
> __MTHP_STAT_COUNT
> };
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d8d2ed80b0bf..fb95345b0bde 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -556,12 +556,14 @@ DEFINE_MTHP_STAT_ATTR(anon_alloc, MTHP_STAT_ANON_ALLOC);
> DEFINE_MTHP_STAT_ATTR(anon_alloc_fallback, MTHP_STAT_ANON_ALLOC_FALLBACK);
> DEFINE_MTHP_STAT_ATTR(anon_swpout, MTHP_STAT_ANON_SWPOUT);
> DEFINE_MTHP_STAT_ATTR(anon_swpout_fallback, MTHP_STAT_ANON_SWPOUT_FALLBACK);
> +DEFINE_MTHP_STAT_ATTR(anon_swpin_refault, MTHP_STAT_ANON_SWPIN_REFAULT);
>
> static struct attribute *stats_attrs[] = {
> &anon_alloc_attr.attr,
> &anon_alloc_fallback_attr.attr,
> &anon_swpout_attr.attr,
> &anon_swpout_fallback_attr.attr,
> + &anon_swpin_refault_attr.attr,
> NULL,
> };
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 9818dc1893c8..acc023795a4d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4167,6 +4167,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> nr_pages = nr;
> entry = folio->swap;
> page = &folio->page;
> + count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
> }

From the latest mm-unstable tree, I get below kunit build failure and
'git bisect' points this patch.

$ ./tools/testing/kunit/kunit.py run --build_dir ../kunit.out/
[16:07:40] Configuring KUnit Kernel ...
[16:07:40] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=../kunit.out/ olddefconfig
Building with:
$ make ARCH=um O=../kunit.out/ --jobs=36
ERROR:root:.../mm/memory.c: In function ‘do_swap_page’:
.../mm/memory.c:4169:17: error: implicit declaration of function ‘count_mthp_stat’ [-Werror=implicit-function-declaration]
4169 | count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
| ^~~~~~~~~~~~~~~
.../mm/memory.c:4169:53: error: ‘MTHP_STAT_ANON_SWPIN_REFAULT’ undeclared (first use in this function)
4169 | count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
.../mm/memory.c:4169:53: note: each undeclared identifier is reported only once for each function it appears in
cc1: some warnings being treated as errors

My kunit build config doesn't have CONFIG_TRANSPARE_HUGEPAGE. Maybe that's the
reason and this patch, or the patch that introduced the function and the enum
need to take care of the case?


Thanks,
SJ

[...]

2024-04-11 01:47:04

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] mm: add per-order mTHP swpin_refault counter

>> + count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
>> }
>>
> From the latest mm-unstable tree, I get below kunit build failure and
> 'git bisect' points this patch.
>
> $ ./tools/testing/kunit/kunit.py run --build_dir ../kunit.out/
> [16:07:40] Configuring KUnit Kernel ...
> [16:07:40] Building KUnit Kernel ...
> Populating config with:
> $ make ARCH=um O=../kunit.out/ olddefconfig
> Building with:
> $ make ARCH=um O=../kunit.out/ --jobs=36
> ERROR:root:.../mm/memory.c: In function ‘do_swap_page’:
> .../mm/memory.c:4169:17: error: implicit declaration of function ‘count_mthp_stat’ [-Werror=implicit-function-declaration]
> 4169 | count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
> | ^~~~~~~~~~~~~~~
> .../mm/memory.c:4169:53: error: ‘MTHP_STAT_ANON_SWPIN_REFAULT’ undeclared (first use in this function)
> 4169 | count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
> .../mm/memory.c:4169:53: note: each undeclared identifier is reported only once for each function it appears in
> cc1: some warnings being treated as errors
>
> My kunit build config doesn't have CONFIG_TRANSPARE_HUGEPAGE. Maybe that's the
> reason and this patch, or the patch that introduced the function and the enum
> need to take care of the case?

Hi SeongJae,
Thanks very much, can you check if the below fix the build? If yes, I will
include this fix while sending v3.

Subject: [PATCH] mm: fix build errors on CONFIG_TRANSPARENT_HUGEPAGE=N

Signed-off-by: Barry Song <[email protected]>
---
mm/memory.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index acc023795a4d..1d587d1eb432 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4142,6 +4142,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
&vmf->ptl);

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
/* We hit large folios in swapcache */
if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio)) {
int nr = folio_nr_pages(folio);
@@ -4171,6 +4172,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
}

check_pte:
+#endif
if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
goto out_nomap;

--
2.34.1


2024-04-11 15:57:13

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache

On 09/04/2024 09:26, Barry Song wrote:
> From: Chuanhua Han <[email protected]>
>
> When a large folio is found in the swapcache, the current implementation
> requires calling do_swap_page() nr_pages times, resulting in nr_pages
> page faults. This patch opts to map the entire large folio at once to
> minimize page faults. Additionally, redundant checks and early exits
> for ARM64 MTE restoring are removed.
>
> Signed-off-by: Chuanhua Han <[email protected]>
> Co-developed-by: Barry Song <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> ---
> mm/memory.c | 64 +++++++++++++++++++++++++++++++++++++++++++----------
> 1 file changed, 52 insertions(+), 12 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index c4a52e8d740a..9818dc1893c8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3947,6 +3947,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> pte_t pte;
> vm_fault_t ret = 0;
> void *shadow = NULL;
> + int nr_pages = 1;
> + unsigned long start_address = vmf->address;
> + pte_t *start_pte = vmf->pte;

possible bug?: there are code paths that assign to vmf-pte below in this
function, so couldn't start_pte be stale in some cases? I'd just do the
assignment (all 4 of these variables in fact) in an else clause below, after any
messing about with them is complete.

nit: rename start_pte -> start_ptep ?

> + bool any_swap_shared = false;

Suggest you defer initialization of this to your "We hit large folios in
swapcache" block below, and init it to:

any_swap_shared = !pte_swp_exclusive(vmf->pte);

Then the any_shared semantic in swap_pte_batch() can be the same as for
folio_pte_batch().

>
> if (!pte_unmap_same(vmf))
> goto out;
> @@ -4137,6 +4141,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> */
> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> &vmf->ptl);

bug: vmf->pte may be NULL and you are not checking it until check_pte:. Byt you
are using it in this block. It also seems odd to do all the work in the below
block under the PTL but before checking if the pte has changed. Suggest moving
both checks here.

> +
> + /* We hit large folios in swapcache */
> + if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio)) {

What's the start_pte check protecting?

> + int nr = folio_nr_pages(folio);
> + int idx = folio_page_idx(folio, page);
> + unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> + unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> + pte_t *folio_ptep;
> + pte_t folio_pte;
> +
> + if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> + goto check_pte;
> + if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> + goto check_pte;
> +
> + folio_ptep = vmf->pte - idx;
> + folio_pte = ptep_get(folio_ptep);
> + if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
> + swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
> + goto check_pte;
> +
> + start_address = folio_start;
> + start_pte = folio_ptep;
> + nr_pages = nr;
> + entry = folio->swap;
> + page = &folio->page;
> + }
> +
> +check_pte:
> if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> goto out_nomap;
>
> @@ -4190,6 +4223,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> */
> exclusive = false;
> }
> +
> + /* Reuse the whole large folio iff all entries are exclusive */
> + if (nr_pages > 1 && any_swap_shared)
> + exclusive = false;

If you init any_shared with the firt pte as I suggested then you could just set
exclusive = !any_shared at the top of this if block without needing this
separate fixup.
> }
>
> /*
> @@ -4204,12 +4241,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> * We're already holding a reference on the page but haven't mapped it
> * yet.
> */
> - swap_free(entry);
> + swap_free_nr(entry, nr_pages);
> if (should_try_to_free_swap(folio, vma, vmf->flags))
> folio_free_swap(folio);
>
> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> + folio_ref_add(folio, nr_pages - 1);
> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> +
> pte = mk_pte(page, vma->vm_page_prot);
>
> /*
> @@ -4219,33 +4258,34 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> * exclusivity.
> */
> if (!folio_test_ksm(folio) &&
> - (exclusive || folio_ref_count(folio) == 1)) {
> + (exclusive || (folio_ref_count(folio) == nr_pages &&
> + folio_nr_pages(folio) == nr_pages))) {
> if (vmf->flags & FAULT_FLAG_WRITE) {
> pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> vmf->flags &= ~FAULT_FLAG_WRITE;
> }
> rmap_flags |= RMAP_EXCLUSIVE;
> }
> - flush_icache_page(vma, page);
> + flush_icache_pages(vma, page, nr_pages);
> if (pte_swp_soft_dirty(vmf->orig_pte))
> pte = pte_mksoft_dirty(pte);
> if (pte_swp_uffd_wp(vmf->orig_pte))
> pte = pte_mkuffd_wp(pte);

I'm not sure about all this... you are smearing these SW bits from the faulting
PTE across all the ptes you are mapping. Although I guess actually that's ok
because swap_pte_batch() only returns a batch with all these bits the same?

> - vmf->orig_pte = pte;

Instead of doing a readback below, perhaps:

vmf->orig_pte = pte_advance_pfn(pte, nr_pages);

>
> /* ksm created a completely new copy */
> if (unlikely(folio != swapcache && swapcache)) {
> - folio_add_new_anon_rmap(folio, vma, vmf->address);
> + folio_add_new_anon_rmap(folio, vma, start_address);
> folio_add_lru_vma(folio, vma);
> } else {
> - folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
> - rmap_flags);
> + folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
> + rmap_flags);
> }
>
> VM_BUG_ON(!folio_test_anon(folio) ||
> (pte_write(pte) && !PageAnonExclusive(page)));
> - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> + set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
> + vmf->orig_pte = ptep_get(vmf->pte);
> + arch_do_swap_page(vma->vm_mm, vma, start_address, pte, pte);
>
> folio_unlock(folio);
> if (folio != swapcache && swapcache) {
> @@ -4269,7 +4309,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> }
>
> /* No need to invalidate - it was non-present before */
> - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> + update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
> unlock:
> if (vmf->pte)
> pte_unmap_unlock(vmf->pte, vmf->ptl);


2024-04-11 17:27:18

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] mm: add per-order mTHP swpin_refault counter

On 09/04/2024 09:26, Barry Song wrote:
> From: Barry Song <[email protected]>
>
> Currently, we are handling the scenario where we've hit a
> large folio in the swapcache, and the reclaiming process
> for this large folio is still ongoing.
>
> Signed-off-by: Barry Song <[email protected]>
> ---
> include/linux/huge_mm.h | 1 +
> mm/huge_memory.c | 2 ++
> mm/memory.c | 1 +
> 3 files changed, 4 insertions(+)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index c8256af83e33..b67294d5814f 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -269,6 +269,7 @@ enum mthp_stat_item {
> MTHP_STAT_ANON_ALLOC_FALLBACK,
> MTHP_STAT_ANON_SWPOUT,
> MTHP_STAT_ANON_SWPOUT_FALLBACK,
> + MTHP_STAT_ANON_SWPIN_REFAULT,

I don't see any equivalent counter for small folios. Is there an analogue?

> __MTHP_STAT_COUNT
> };
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d8d2ed80b0bf..fb95345b0bde 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -556,12 +556,14 @@ DEFINE_MTHP_STAT_ATTR(anon_alloc, MTHP_STAT_ANON_ALLOC);
> DEFINE_MTHP_STAT_ATTR(anon_alloc_fallback, MTHP_STAT_ANON_ALLOC_FALLBACK);
> DEFINE_MTHP_STAT_ATTR(anon_swpout, MTHP_STAT_ANON_SWPOUT);
> DEFINE_MTHP_STAT_ATTR(anon_swpout_fallback, MTHP_STAT_ANON_SWPOUT_FALLBACK);
> +DEFINE_MTHP_STAT_ATTR(anon_swpin_refault, MTHP_STAT_ANON_SWPIN_REFAULT);
>
> static struct attribute *stats_attrs[] = {
> &anon_alloc_attr.attr,
> &anon_alloc_fallback_attr.attr,
> &anon_swpout_attr.attr,
> &anon_swpout_fallback_attr.attr,
> + &anon_swpin_refault_attr.attr,
> NULL,
> };
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 9818dc1893c8..acc023795a4d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4167,6 +4167,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> nr_pages = nr;
> entry = folio->swap;
> page = &folio->page;
> + count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);

I don't think this is the point of no return yet? There's the pte_same() check
immediately below (although I've suggested that needs to be moved to earlier),
but also the folio_test_uptodate() check. Perhaps this should go after that?

> }
>
> check_pte:


2024-04-11 17:42:48

by SeongJae Park

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] mm: add per-order mTHP swpin_refault counter

Hi Barry,

On Thu, 11 Apr 2024 13:46:36 +1200 Barry Song <[email protected]> wrote:

> >> + count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
> >> }
> >>
> > From the latest mm-unstable tree, I get below kunit build failure and
> > 'git bisect' points this patch.
> >
> > $ ./tools/testing/kunit/kunit.py run --build_dir ../kunit.out/
> > [16:07:40] Configuring KUnit Kernel ...
> > [16:07:40] Building KUnit Kernel ...
> > Populating config with:
> > $ make ARCH=um O=../kunit.out/ olddefconfig
> > Building with:
> > $ make ARCH=um O=../kunit.out/ --jobs=36
> > ERROR:root:.../mm/memory.c: In function ‘do_swap_page’:
> > .../mm/memory.c:4169:17: error: implicit declaration of function ‘count_mthp_stat’ [-Werror=implicit-function-declaration]
> > 4169 | count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
> > | ^~~~~~~~~~~~~~~
> > .../mm/memory.c:4169:53: error: ‘MTHP_STAT_ANON_SWPIN_REFAULT’ undeclared (first use in this function)
> > 4169 | count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
> > | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > .../mm/memory.c:4169:53: note: each undeclared identifier is reported only once for each function it appears in
> > cc1: some warnings being treated as errors
> >
> > My kunit build config doesn't have CONFIG_TRANSPARE_HUGEPAGE. Maybe that's the
> > reason and this patch, or the patch that introduced the function and the enum
> > need to take care of the case?
>
> Hi SeongJae,
> Thanks very much, can you check if the below fix the build? If yes, I will
> include this fix while sending v3.

Thank you for quick and kind reply :) I confirmed this fixes the build failure.

>
> Subject: [PATCH] mm: fix build errors on CONFIG_TRANSPARENT_HUGEPAGE=N
>
> Signed-off-by: Barry Song <[email protected]>
Tested-by: SeongJae Park <[email protected]>


Thanks,
SJ

> ---
> mm/memory.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index acc023795a4d..1d587d1eb432 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4142,6 +4142,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> &vmf->ptl);
>
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> /* We hit large folios in swapcache */
> if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio)) {
> int nr = folio_nr_pages(folio);
> @@ -4171,6 +4172,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> }
>
> check_pte:
> +#endif
> if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> goto out_nomap;
>
> --
> 2.34.1
>
>

2024-04-11 23:02:42

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] mm: add per-order mTHP swpin_refault counter

On Fri, Apr 12, 2024 at 3:53 AM Ryan Roberts <[email protected]> wrote:
>
> On 09/04/2024 09:26, Barry Song wrote:
> > From: Barry Song <[email protected]>
> >
> > Currently, we are handling the scenario where we've hit a
> > large folio in the swapcache, and the reclaiming process
> > for this large folio is still ongoing.
> >
> > Signed-off-by: Barry Song <[email protected]>
> > ---
> > include/linux/huge_mm.h | 1 +
> > mm/huge_memory.c | 2 ++
> > mm/memory.c | 1 +
> > 3 files changed, 4 insertions(+)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index c8256af83e33..b67294d5814f 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -269,6 +269,7 @@ enum mthp_stat_item {
> > MTHP_STAT_ANON_ALLOC_FALLBACK,
> > MTHP_STAT_ANON_SWPOUT,
> > MTHP_STAT_ANON_SWPOUT_FALLBACK,
> > + MTHP_STAT_ANON_SWPIN_REFAULT,
>
> I don't see any equivalent counter for small folios. Is there an analogue?

Indeed, we don't count refaults for small folios, as their refault
mechanism is much
simpler compared to large folios. Implementing this counter can enhance the
system's visibility to users.

Personally, having this counter and observing a non-zero value greatly enhances
my confidence when debugging this refault series. Otherwise, it feels like being
blind to what's happening inside the system :-)

>
> > __MTHP_STAT_COUNT
> > };
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index d8d2ed80b0bf..fb95345b0bde 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -556,12 +556,14 @@ DEFINE_MTHP_STAT_ATTR(anon_alloc, MTHP_STAT_ANON_ALLOC);
> > DEFINE_MTHP_STAT_ATTR(anon_alloc_fallback, MTHP_STAT_ANON_ALLOC_FALLBACK);
> > DEFINE_MTHP_STAT_ATTR(anon_swpout, MTHP_STAT_ANON_SWPOUT);
> > DEFINE_MTHP_STAT_ATTR(anon_swpout_fallback, MTHP_STAT_ANON_SWPOUT_FALLBACK);
> > +DEFINE_MTHP_STAT_ATTR(anon_swpin_refault, MTHP_STAT_ANON_SWPIN_REFAULT);
> >
> > static struct attribute *stats_attrs[] = {
> > &anon_alloc_attr.attr,
> > &anon_alloc_fallback_attr.attr,
> > &anon_swpout_attr.attr,
> > &anon_swpout_fallback_attr.attr,
> > + &anon_swpin_refault_attr.attr,
> > NULL,
> > };
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 9818dc1893c8..acc023795a4d 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4167,6 +4167,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > nr_pages = nr;
> > entry = folio->swap;
> > page = &folio->page;
> > + count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
>
> I don't think this is the point of no return yet? There's the pte_same() check
> immediately below (although I've suggested that needs to be moved to earlier),
> but also the folio_test_uptodate() check. Perhaps this should go after that?
>

swap_pte_batch() == nr_pages should have passed the test for pte_same.
folio_test_uptodate(folio)) should be also unlikely to be true as we are
not reading from swap devices for refault case.

but i agree we can move all the refault handling after those two "goto
out_nomap".

> > }
> >
> > check_pte:
>

Thanks
Barry

2024-04-11 23:30:31

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache

On Fri, Apr 12, 2024 at 3:33 AM Ryan Roberts <[email protected]> wrote:
>
> On 09/04/2024 09:26, Barry Song wrote:
> > From: Chuanhua Han <[email protected]>
> >
> > When a large folio is found in the swapcache, the current implementation
> > requires calling do_swap_page() nr_pages times, resulting in nr_pages
> > page faults. This patch opts to map the entire large folio at once to
> > minimize page faults. Additionally, redundant checks and early exits
> > for ARM64 MTE restoring are removed.
> >
> > Signed-off-by: Chuanhua Han <[email protected]>
> > Co-developed-by: Barry Song <[email protected]>
> > Signed-off-by: Barry Song <[email protected]>
> > ---
> > mm/memory.c | 64 +++++++++++++++++++++++++++++++++++++++++++----------
> > 1 file changed, 52 insertions(+), 12 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index c4a52e8d740a..9818dc1893c8 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3947,6 +3947,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > pte_t pte;
> > vm_fault_t ret = 0;
> > void *shadow = NULL;
> > + int nr_pages = 1;
> > + unsigned long start_address = vmf->address;
> > + pte_t *start_pte = vmf->pte;
>
> possible bug?: there are code paths that assign to vmf-pte below in this
> function, so couldn't start_pte be stale in some cases? I'd just do the
> assignment (all 4 of these variables in fact) in an else clause below, after any
> messing about with them is complete.
>
> nit: rename start_pte -> start_ptep ?

Agreed.

>
> > + bool any_swap_shared = false;
>
> Suggest you defer initialization of this to your "We hit large folios in
> swapcache" block below, and init it to:
>
> any_swap_shared = !pte_swp_exclusive(vmf->pte);
>
> Then the any_shared semantic in swap_pte_batch() can be the same as for
> folio_pte_batch().
>
> >
> > if (!pte_unmap_same(vmf))
> > goto out;
> > @@ -4137,6 +4141,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > */
> > vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> > &vmf->ptl);
>
> bug: vmf->pte may be NULL and you are not checking it until check_pte:. Byt you
> are using it in this block. It also seems odd to do all the work in the below
> block under the PTL but before checking if the pte has changed. Suggest moving
> both checks here.

agreed.

>
> > +
> > + /* We hit large folios in swapcache */
> > + if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio)) {
>
> What's the start_pte check protecting?

This is exactly protecting the case vmf->pte==NULL but for some reason it was
assigned in the beginning of the function incorrectly. The intention of the code
was actually doing start_pte = vmf->pte after "vmf->pte = pte_offset_map_lock".

>
> > + int nr = folio_nr_pages(folio);
> > + int idx = folio_page_idx(folio, page);
> > + unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> > + unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> > + pte_t *folio_ptep;
> > + pte_t folio_pte;
> > +
> > + if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> > + goto check_pte;
> > + if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> > + goto check_pte;
> > +
> > + folio_ptep = vmf->pte - idx;
> > + folio_pte = ptep_get(folio_ptep);
> > + if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
> > + swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
> > + goto check_pte;
> > +
> > + start_address = folio_start;
> > + start_pte = folio_ptep;
> > + nr_pages = nr;
> > + entry = folio->swap;
> > + page = &folio->page;
> > + }
> > +
> > +check_pte:
> > if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> > goto out_nomap;
> >
> > @@ -4190,6 +4223,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > */
> > exclusive = false;
> > }
> > +
> > + /* Reuse the whole large folio iff all entries are exclusive */
> > + if (nr_pages > 1 && any_swap_shared)
> > + exclusive = false;
>
> If you init any_shared with the firt pte as I suggested then you could just set
> exclusive = !any_shared at the top of this if block without needing this
> separate fixup.

Since your swap_pte_batch() function checks that all PTEs have the same
exclusive bits, I'll be removing any_shared first in version 3 per David's
suggestions. We could potentially develop "any_shared" as an incremental
patchset later on .

> > }
> >
> > /*
> > @@ -4204,12 +4241,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > * We're already holding a reference on the page but haven't mapped it
> > * yet.
> > */
> > - swap_free(entry);
> > + swap_free_nr(entry, nr_pages);
> > if (should_try_to_free_swap(folio, vma, vmf->flags))
> > folio_free_swap(folio);
> >
> > - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> > - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> > + folio_ref_add(folio, nr_pages - 1);
> > + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> > + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> > +
> > pte = mk_pte(page, vma->vm_page_prot);
> >
> > /*
> > @@ -4219,33 +4258,34 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > * exclusivity.
> > */
> > if (!folio_test_ksm(folio) &&
> > - (exclusive || folio_ref_count(folio) == 1)) {
> > + (exclusive || (folio_ref_count(folio) == nr_pages &&
> > + folio_nr_pages(folio) == nr_pages))) {
> > if (vmf->flags & FAULT_FLAG_WRITE) {
> > pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> > vmf->flags &= ~FAULT_FLAG_WRITE;
> > }
> > rmap_flags |= RMAP_EXCLUSIVE;
> > }
> > - flush_icache_page(vma, page);
> > + flush_icache_pages(vma, page, nr_pages);
> > if (pte_swp_soft_dirty(vmf->orig_pte))
> > pte = pte_mksoft_dirty(pte);
> > if (pte_swp_uffd_wp(vmf->orig_pte))
> > pte = pte_mkuffd_wp(pte);
>
> I'm not sure about all this... you are smearing these SW bits from the faulting
> PTE across all the ptes you are mapping. Although I guess actually that's ok
> because swap_pte_batch() only returns a batch with all these bits the same?

Initially, I didn't recognize the issue at all because the tested
architecture arm64
didn't include these bits. However, after reviewing your latest swpout series,
which verifies the consistent bits for soft_dirty and uffd_wp, I now
feel its safety
even for platforms with these bits.

>
> > - vmf->orig_pte = pte;
>
> Instead of doing a readback below, perhaps:
>
> vmf->orig_pte = pte_advance_pfn(pte, nr_pages);

Nice !

>
> >
> > /* ksm created a completely new copy */
> > if (unlikely(folio != swapcache && swapcache)) {
> > - folio_add_new_anon_rmap(folio, vma, vmf->address);
> > + folio_add_new_anon_rmap(folio, vma, start_address);
> > folio_add_lru_vma(folio, vma);
> > } else {
> > - folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
> > - rmap_flags);
> > + folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
> > + rmap_flags);
> > }
> >
> > VM_BUG_ON(!folio_test_anon(folio) ||
> > (pte_write(pte) && !PageAnonExclusive(page)));
> > - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> > - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> > + set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
> > + vmf->orig_pte = ptep_get(vmf->pte);
> > + arch_do_swap_page(vma->vm_mm, vma, start_address, pte, pte);
> >
> > folio_unlock(folio);
> > if (folio != swapcache && swapcache) {
> > @@ -4269,7 +4309,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > }
> >
> > /* No need to invalidate - it was non-present before */
> > - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> > + update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
> > unlock:
> > if (vmf->pte)
> > pte_unmap_unlock(vmf->pte, vmf->ptl);
>

Thanks
Barry

2024-04-12 11:33:51

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache

On 12/04/2024 00:30, Barry Song wrote:
> On Fri, Apr 12, 2024 at 3:33 AM Ryan Roberts <[email protected]> wrote:
>>
>> On 09/04/2024 09:26, Barry Song wrote:
>>> From: Chuanhua Han <[email protected]>
>>>
>>> When a large folio is found in the swapcache, the current implementation
>>> requires calling do_swap_page() nr_pages times, resulting in nr_pages
>>> page faults. This patch opts to map the entire large folio at once to
>>> minimize page faults. Additionally, redundant checks and early exits
>>> for ARM64 MTE restoring are removed.
>>>
>>> Signed-off-by: Chuanhua Han <[email protected]>
>>> Co-developed-by: Barry Song <[email protected]>
>>> Signed-off-by: Barry Song <[email protected]>
>>> ---
>>> mm/memory.c | 64 +++++++++++++++++++++++++++++++++++++++++++----------
>>> 1 file changed, 52 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index c4a52e8d740a..9818dc1893c8 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3947,6 +3947,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> pte_t pte;
>>> vm_fault_t ret = 0;
>>> void *shadow = NULL;
>>> + int nr_pages = 1;
>>> + unsigned long start_address = vmf->address;
>>> + pte_t *start_pte = vmf->pte;
>>
>> possible bug?: there are code paths that assign to vmf-pte below in this
>> function, so couldn't start_pte be stale in some cases? I'd just do the
>> assignment (all 4 of these variables in fact) in an else clause below, after any
>> messing about with them is complete.
>>
>> nit: rename start_pte -> start_ptep ?
>
> Agreed.
>
>>
>>> + bool any_swap_shared = false;
>>
>> Suggest you defer initialization of this to your "We hit large folios in
>> swapcache" block below, and init it to:
>>
>> any_swap_shared = !pte_swp_exclusive(vmf->pte);
>>
>> Then the any_shared semantic in swap_pte_batch() can be the same as for
>> folio_pte_batch().
>>
>>>
>>> if (!pte_unmap_same(vmf))
>>> goto out;
>>> @@ -4137,6 +4141,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> */
>>> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>>> &vmf->ptl);
>>
>> bug: vmf->pte may be NULL and you are not checking it until check_pte:. Byt you
>> are using it in this block. It also seems odd to do all the work in the below
>> block under the PTL but before checking if the pte has changed. Suggest moving
>> both checks here.
>
> agreed.
>
>>
>>> +
>>> + /* We hit large folios in swapcache */
>>> + if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio)) {
>>
>> What's the start_pte check protecting?
>
> This is exactly protecting the case vmf->pte==NULL but for some reason it was
> assigned in the beginning of the function incorrectly. The intention of the code
> was actually doing start_pte = vmf->pte after "vmf->pte = pte_offset_map_lock".
>
>>
>>> + int nr = folio_nr_pages(folio);
>>> + int idx = folio_page_idx(folio, page);
>>> + unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
>>> + unsigned long folio_end = folio_start + nr * PAGE_SIZE;
>>> + pte_t *folio_ptep;
>>> + pte_t folio_pte;
>>> +
>>> + if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
>>> + goto check_pte;
>>> + if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
>>> + goto check_pte;
>>> +
>>> + folio_ptep = vmf->pte - idx;
>>> + folio_pte = ptep_get(folio_ptep);
>>> + if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
>>> + swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
>>> + goto check_pte;
>>> +
>>> + start_address = folio_start;
>>> + start_pte = folio_ptep;
>>> + nr_pages = nr;
>>> + entry = folio->swap;
>>> + page = &folio->page;
>>> + }
>>> +
>>> +check_pte:
>>> if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
>>> goto out_nomap;
>>>
>>> @@ -4190,6 +4223,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> */
>>> exclusive = false;
>>> }
>>> +
>>> + /* Reuse the whole large folio iff all entries are exclusive */
>>> + if (nr_pages > 1 && any_swap_shared)
>>> + exclusive = false;
>>
>> If you init any_shared with the firt pte as I suggested then you could just set
>> exclusive = !any_shared at the top of this if block without needing this
>> separate fixup.
>
> Since your swap_pte_batch() function checks that all PTEs have the same
> exclusive bits, I'll be removing any_shared first in version 3 per David's
> suggestions. We could potentially develop "any_shared" as an incremental
> patchset later on .

Ahh yes, good point. I'll admit that your conversation about this went over my
head at the time since I hadn't yet looked at this.

>
>>> }
>>>
>>> /*
>>> @@ -4204,12 +4241,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> * We're already holding a reference on the page but haven't mapped it
>>> * yet.
>>> */
>>> - swap_free(entry);
>>> + swap_free_nr(entry, nr_pages);
>>> if (should_try_to_free_swap(folio, vma, vmf->flags))
>>> folio_free_swap(folio);
>>>
>>> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
>>> + folio_ref_add(folio, nr_pages - 1);
>>> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>>> + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
>>> +
>>> pte = mk_pte(page, vma->vm_page_prot);
>>>
>>> /*
>>> @@ -4219,33 +4258,34 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> * exclusivity.
>>> */
>>> if (!folio_test_ksm(folio) &&
>>> - (exclusive || folio_ref_count(folio) == 1)) {
>>> + (exclusive || (folio_ref_count(folio) == nr_pages &&
>>> + folio_nr_pages(folio) == nr_pages))) {
>>> if (vmf->flags & FAULT_FLAG_WRITE) {
>>> pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>>> vmf->flags &= ~FAULT_FLAG_WRITE;
>>> }
>>> rmap_flags |= RMAP_EXCLUSIVE;
>>> }
>>> - flush_icache_page(vma, page);
>>> + flush_icache_pages(vma, page, nr_pages);
>>> if (pte_swp_soft_dirty(vmf->orig_pte))
>>> pte = pte_mksoft_dirty(pte);
>>> if (pte_swp_uffd_wp(vmf->orig_pte))
>>> pte = pte_mkuffd_wp(pte);
>>
>> I'm not sure about all this... you are smearing these SW bits from the faulting
>> PTE across all the ptes you are mapping. Although I guess actually that's ok
>> because swap_pte_batch() only returns a batch with all these bits the same?
>
> Initially, I didn't recognize the issue at all because the tested
> architecture arm64
> didn't include these bits. However, after reviewing your latest swpout series,
> which verifies the consistent bits for soft_dirty and uffd_wp, I now
> feel its safety
> even for platforms with these bits.

Yep, agreed.

>
>>
>>> - vmf->orig_pte = pte;
>>
>> Instead of doing a readback below, perhaps:
>>
>> vmf->orig_pte = pte_advance_pfn(pte, nr_pages);
>
> Nice !
>
>>
>>>
>>> /* ksm created a completely new copy */
>>> if (unlikely(folio != swapcache && swapcache)) {
>>> - folio_add_new_anon_rmap(folio, vma, vmf->address);
>>> + folio_add_new_anon_rmap(folio, vma, start_address);
>>> folio_add_lru_vma(folio, vma);
>>> } else {
>>> - folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
>>> - rmap_flags);
>>> + folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
>>> + rmap_flags);
>>> }
>>>
>>> VM_BUG_ON(!folio_test_anon(folio) ||
>>> (pte_write(pte) && !PageAnonExclusive(page)));
>>> - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
>>> - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
>>> + set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
>>> + vmf->orig_pte = ptep_get(vmf->pte);
>>> + arch_do_swap_page(vma->vm_mm, vma, start_address, pte, pte);
>>>
>>> folio_unlock(folio);
>>> if (folio != swapcache && swapcache) {
>>> @@ -4269,7 +4309,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> }
>>>
>>> /* No need to invalidate - it was non-present before */
>>> - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
>>> + update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
>>> unlock:
>>> if (vmf->pte)
>>> pte_unmap_unlock(vmf->pte, vmf->ptl);
>>
>
> Thanks
> Barry


2024-04-15 08:40:06

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache

Barry Song <[email protected]> writes:

> From: Chuanhua Han <[email protected]>
>
> When a large folio is found in the swapcache, the current implementation
> requires calling do_swap_page() nr_pages times, resulting in nr_pages
> page faults. This patch opts to map the entire large folio at once to
> minimize page faults. Additionally, redundant checks and early exits
> for ARM64 MTE restoring are removed.
>
> Signed-off-by: Chuanhua Han <[email protected]>
> Co-developed-by: Barry Song <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> ---
> mm/memory.c | 64 +++++++++++++++++++++++++++++++++++++++++++----------
> 1 file changed, 52 insertions(+), 12 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index c4a52e8d740a..9818dc1893c8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3947,6 +3947,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> pte_t pte;
> vm_fault_t ret = 0;
> void *shadow = NULL;
> + int nr_pages = 1;
> + unsigned long start_address = vmf->address;
> + pte_t *start_pte = vmf->pte;

IMHO, it's better to rename the above 2 local variables to "address" and
"ptep". Just my personal opinion. Feel free to ignore the comments.

> + bool any_swap_shared = false;
>
> if (!pte_unmap_same(vmf))
> goto out;
> @@ -4137,6 +4141,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> */
> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> &vmf->ptl);

We should move pte check here. That is,

if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
goto out_nomap;

This will simplify the situation for large folio.

> +
> + /* We hit large folios in swapcache */

The comments seems unnecessary because the code tells that already.

> + if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio)) {
> + int nr = folio_nr_pages(folio);
> + int idx = folio_page_idx(folio, page);
> + unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> + unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> + pte_t *folio_ptep;
> + pte_t folio_pte;
> +
> + if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> + goto check_pte;
> + if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> + goto check_pte;
> +
> + folio_ptep = vmf->pte - idx;
> + folio_pte = ptep_get(folio_ptep);

It's better to construct pte based on fault PTE via generalizing
pte_next_swp_offset() (may be pte_move_swp_offset()). Then we can find
inconsistent PTEs quicker.

> + if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
> + swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
> + goto check_pte;
> +
> + start_address = folio_start;
> + start_pte = folio_ptep;
> + nr_pages = nr;
> + entry = folio->swap;
> + page = &folio->page;
> + }
> +
> +check_pte:
> if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> goto out_nomap;
>
> @@ -4190,6 +4223,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> */
> exclusive = false;
> }
> +
> + /* Reuse the whole large folio iff all entries are exclusive */
> + if (nr_pages > 1 && any_swap_shared)
> + exclusive = false;
> }
>
> /*
> @@ -4204,12 +4241,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> * We're already holding a reference on the page but haven't mapped it
> * yet.
> */
> - swap_free(entry);
> + swap_free_nr(entry, nr_pages);
> if (should_try_to_free_swap(folio, vma, vmf->flags))
> folio_free_swap(folio);
>
> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> + folio_ref_add(folio, nr_pages - 1);
> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> +
> pte = mk_pte(page, vma->vm_page_prot);
>
> /*
> @@ -4219,33 +4258,34 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> * exclusivity.
> */
> if (!folio_test_ksm(folio) &&
> - (exclusive || folio_ref_count(folio) == 1)) {
> + (exclusive || (folio_ref_count(folio) == nr_pages &&
> + folio_nr_pages(folio) == nr_pages))) {
> if (vmf->flags & FAULT_FLAG_WRITE) {
> pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> vmf->flags &= ~FAULT_FLAG_WRITE;
> }
> rmap_flags |= RMAP_EXCLUSIVE;
> }
> - flush_icache_page(vma, page);
> + flush_icache_pages(vma, page, nr_pages);
> if (pte_swp_soft_dirty(vmf->orig_pte))
> pte = pte_mksoft_dirty(pte);
> if (pte_swp_uffd_wp(vmf->orig_pte))
> pte = pte_mkuffd_wp(pte);
> - vmf->orig_pte = pte;
>
> /* ksm created a completely new copy */
> if (unlikely(folio != swapcache && swapcache)) {
> - folio_add_new_anon_rmap(folio, vma, vmf->address);
> + folio_add_new_anon_rmap(folio, vma, start_address);
> folio_add_lru_vma(folio, vma);
> } else {
> - folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
> - rmap_flags);
> + folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
> + rmap_flags);
> }
>
> VM_BUG_ON(!folio_test_anon(folio) ||
> (pte_write(pte) && !PageAnonExclusive(page)));
> - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> + set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
> + vmf->orig_pte = ptep_get(vmf->pte);
> + arch_do_swap_page(vma->vm_mm, vma, start_address, pte, pte);

Do we need to call arch_do_swap_page() for each subpage? IIUC, the
corresponding arch_unmap_one() will be called for each subpage.

> folio_unlock(folio);
> if (folio != swapcache && swapcache) {
> @@ -4269,7 +4309,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> }
>
> /* No need to invalidate - it was non-present before */
> - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> + update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
> unlock:
> if (vmf->pte)
> pte_unmap_unlock(vmf->pte, vmf->ptl);

--
Best Regards,
Huang, Ying

2024-04-15 09:11:51

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache

On Mon, Apr 15, 2024 at 8:39 PM Huang, Ying <[email protected]> wrote:
>
> Barry Song <[email protected]> writes:
>
> > From: Chuanhua Han <[email protected]>
> >
> > When a large folio is found in the swapcache, the current implementation
> > requires calling do_swap_page() nr_pages times, resulting in nr_pages
> > page faults. This patch opts to map the entire large folio at once to
> > minimize page faults. Additionally, redundant checks and early exits
> > for ARM64 MTE restoring are removed.
> >
> > Signed-off-by: Chuanhua Han <[email protected]>
> > Co-developed-by: Barry Song <[email protected]>
> > Signed-off-by: Barry Song <[email protected]>
> > ---
> > mm/memory.c | 64 +++++++++++++++++++++++++++++++++++++++++++----------
> > 1 file changed, 52 insertions(+), 12 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index c4a52e8d740a..9818dc1893c8 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3947,6 +3947,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > pte_t pte;
> > vm_fault_t ret = 0;
> > void *shadow = NULL;
> > + int nr_pages = 1;
> > + unsigned long start_address = vmf->address;
> > + pte_t *start_pte = vmf->pte;
>
> IMHO, it's better to rename the above 2 local variables to "address" and
> "ptep". Just my personal opinion. Feel free to ignore the comments.

fine.

>
> > + bool any_swap_shared = false;
> >
> > if (!pte_unmap_same(vmf))
> > goto out;
> > @@ -4137,6 +4141,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > */
> > vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> > &vmf->ptl);
>
> We should move pte check here. That is,
>
> if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> goto out_nomap;
>
> This will simplify the situation for large folio.

the plan is moving the whole code block

if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio))

after
if (unlikely(!folio_test_uptodate(folio))) {
ret = VM_FAULT_SIGBUS;
goto out_nomap;
}

though we couldn't be !folio_test_uptodate(folio)) for hitting
swapcache but it seems
logically better for future use.

>
> > +
> > + /* We hit large folios in swapcache */
>
> The comments seems unnecessary because the code tells that already.
>
> > + if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio)) {
> > + int nr = folio_nr_pages(folio);
> > + int idx = folio_page_idx(folio, page);
> > + unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> > + unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> > + pte_t *folio_ptep;
> > + pte_t folio_pte;
> > +
> > + if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> > + goto check_pte;
> > + if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> > + goto check_pte;
> > +
> > + folio_ptep = vmf->pte - idx;
> > + folio_pte = ptep_get(folio_ptep);
>
> It's better to construct pte based on fault PTE via generalizing
> pte_next_swp_offset() (may be pte_move_swp_offset()). Then we can find
> inconsistent PTEs quicker.

it seems your point is getting the pte of page0 by pte_next_swp_offset()
unfortunately pte_next_swp_offset can't go back. on the other hand,
we have to check the real pte value of the 0nd entry right now because
swap_pte_batch() only really reads pte from the 1st entry. it assumes
pte argument is the real value for the 0nd pte entry.

static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
{
pte_t expected_pte = pte_next_swp_offset(pte);
const pte_t *end_ptep = start_ptep + max_nr;
pte_t *ptep = start_ptep + 1;

VM_WARN_ON(max_nr < 1);
VM_WARN_ON(!is_swap_pte(pte));
VM_WARN_ON(non_swap_entry(pte_to_swp_entry(pte)));

while (ptep < end_ptep) {
pte = ptep_get(ptep);

if (!pte_same(pte, expected_pte))
break;

expected_pte = pte_next_swp_offset(expected_pte);
ptep++;
}

return ptep - start_ptep;
}

>
> > + if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
> > + swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
> > + goto check_pte;
> > +
> > + start_address = folio_start;
> > + start_pte = folio_ptep;
> > + nr_pages = nr;
> > + entry = folio->swap;
> > + page = &folio->page;
> > + }
> > +
> > +check_pte:
> > if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> > goto out_nomap;
> >
> > @@ -4190,6 +4223,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > */
> > exclusive = false;
> > }
> > +
> > + /* Reuse the whole large folio iff all entries are exclusive */
> > + if (nr_pages > 1 && any_swap_shared)
> > + exclusive = false;
> > }
> >
> > /*
> > @@ -4204,12 +4241,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > * We're already holding a reference on the page but haven't mapped it
> > * yet.
> > */
> > - swap_free(entry);
> > + swap_free_nr(entry, nr_pages);
> > if (should_try_to_free_swap(folio, vma, vmf->flags))
> > folio_free_swap(folio);
> >
> > - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> > - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> > + folio_ref_add(folio, nr_pages - 1);
> > + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> > + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> > +
> > pte = mk_pte(page, vma->vm_page_prot);
> >
> > /*
> > @@ -4219,33 +4258,34 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > * exclusivity.
> > */
> > if (!folio_test_ksm(folio) &&
> > - (exclusive || folio_ref_count(folio) == 1)) {
> > + (exclusive || (folio_ref_count(folio) == nr_pages &&
> > + folio_nr_pages(folio) == nr_pages))) {
> > if (vmf->flags & FAULT_FLAG_WRITE) {
> > pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> > vmf->flags &= ~FAULT_FLAG_WRITE;
> > }
> > rmap_flags |= RMAP_EXCLUSIVE;
> > }
> > - flush_icache_page(vma, page);
> > + flush_icache_pages(vma, page, nr_pages);
> > if (pte_swp_soft_dirty(vmf->orig_pte))
> > pte = pte_mksoft_dirty(pte);
> > if (pte_swp_uffd_wp(vmf->orig_pte))
> > pte = pte_mkuffd_wp(pte);
> > - vmf->orig_pte = pte;
> >
> > /* ksm created a completely new copy */
> > if (unlikely(folio != swapcache && swapcache)) {
> > - folio_add_new_anon_rmap(folio, vma, vmf->address);
> > + folio_add_new_anon_rmap(folio, vma, start_address);
> > folio_add_lru_vma(folio, vma);
> > } else {
> > - folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
> > - rmap_flags);
> > + folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
> > + rmap_flags);
> > }
> >
> > VM_BUG_ON(!folio_test_anon(folio) ||
> > (pte_write(pte) && !PageAnonExclusive(page)));
> > - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> > - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> > + set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
> > + vmf->orig_pte = ptep_get(vmf->pte);
> > + arch_do_swap_page(vma->vm_mm, vma, start_address, pte, pte);
>
> Do we need to call arch_do_swap_page() for each subpage? IIUC, the
> corresponding arch_unmap_one() will be called for each subpage.

i actually thought about this very carefully, right now, the only one who
needs this is sparc and it doesn't support THP_SWAPOUT at all. and
there is no proof doing restoration one by one won't really break sparc.
so i'd like to defer this to when sparc really needs THP_SWAPOUT.
on the other hand, it seems really bad we have both

arch_swap_restore - for this, arm64 has moved to using folio
and
arch_do_swap_page

we should somehow unify them later if sparc wants THP_SWPOUT.

>
> > folio_unlock(folio);
> > if (folio != swapcache && swapcache) {
> > @@ -4269,7 +4309,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > }
> >
> > /* No need to invalidate - it was non-present before */
> > - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> > + update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
> > unlock:
> > if (vmf->pte)
> > pte_unmap_unlock(vmf->pte, vmf->ptl);
>
> --
> Best Regards,
> Huang, Ying

2024-04-16 02:27:42

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache


Added Khalid for arch_do_swap_page().

Barry Song <[email protected]> writes:

> On Mon, Apr 15, 2024 at 8:39 PM Huang, Ying <[email protected]> wrote:
>>
>> Barry Song <[email protected]> writes:

[snip]

>>
>> > + bool any_swap_shared = false;
>> >
>> > if (!pte_unmap_same(vmf))
>> > goto out;
>> > @@ -4137,6 +4141,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> > */
>> > vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>> > &vmf->ptl);
>>
>> We should move pte check here. That is,
>>
>> if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
>> goto out_nomap;
>>
>> This will simplify the situation for large folio.
>
> the plan is moving the whole code block
>
> if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio))
>
> after
> if (unlikely(!folio_test_uptodate(folio))) {
> ret = VM_FAULT_SIGBUS;
> goto out_nomap;
> }
>
> though we couldn't be !folio_test_uptodate(folio)) for hitting
> swapcache but it seems
> logically better for future use.

LGTM, Thanks!

>>
>> > +
>> > + /* We hit large folios in swapcache */
>>
>> The comments seems unnecessary because the code tells that already.
>>
>> > + if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio)) {
>> > + int nr = folio_nr_pages(folio);
>> > + int idx = folio_page_idx(folio, page);
>> > + unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
>> > + unsigned long folio_end = folio_start + nr * PAGE_SIZE;
>> > + pte_t *folio_ptep;
>> > + pte_t folio_pte;
>> > +
>> > + if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
>> > + goto check_pte;
>> > + if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
>> > + goto check_pte;
>> > +
>> > + folio_ptep = vmf->pte - idx;
>> > + folio_pte = ptep_get(folio_ptep);
>>
>> It's better to construct pte based on fault PTE via generalizing
>> pte_next_swp_offset() (may be pte_move_swp_offset()). Then we can find
>> inconsistent PTEs quicker.
>
> it seems your point is getting the pte of page0 by pte_next_swp_offset()
> unfortunately pte_next_swp_offset can't go back. on the other hand,
> we have to check the real pte value of the 0nd entry right now because
> swap_pte_batch() only really reads pte from the 1st entry. it assumes
> pte argument is the real value for the 0nd pte entry.
>
> static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
> {
> pte_t expected_pte = pte_next_swp_offset(pte);
> const pte_t *end_ptep = start_ptep + max_nr;
> pte_t *ptep = start_ptep + 1;
>
> VM_WARN_ON(max_nr < 1);
> VM_WARN_ON(!is_swap_pte(pte));
> VM_WARN_ON(non_swap_entry(pte_to_swp_entry(pte)));
>
> while (ptep < end_ptep) {
> pte = ptep_get(ptep);
>
> if (!pte_same(pte, expected_pte))
> break;
>
> expected_pte = pte_next_swp_offset(expected_pte);
> ptep++;
> }
>
> return ptep - start_ptep;
> }

Yes. You are right.

But we may check whether the pte of page0 is same as "vmf->orig_pte -
folio_page_idx()" (fake code).

You need to check the pte of page 0 anyway.

>>
>> > + if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
>> > + swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
>> > + goto check_pte;
>> > +
>> > + start_address = folio_start;
>> > + start_pte = folio_ptep;
>> > + nr_pages = nr;
>> > + entry = folio->swap;
>> > + page = &folio->page;
>> > + }
>> > +
>> > +check_pte:
>> > if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
>> > goto out_nomap;
>> >
>> > @@ -4190,6 +4223,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> > */
>> > exclusive = false;
>> > }
>> > +
>> > + /* Reuse the whole large folio iff all entries are exclusive */
>> > + if (nr_pages > 1 && any_swap_shared)
>> > + exclusive = false;
>> > }
>> >
>> > /*
>> > @@ -4204,12 +4241,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> > * We're already holding a reference on the page but haven't mapped it
>> > * yet.
>> > */
>> > - swap_free(entry);
>> > + swap_free_nr(entry, nr_pages);
>> > if (should_try_to_free_swap(folio, vma, vmf->flags))
>> > folio_free_swap(folio);
>> >
>> > - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>> > - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
>> > + folio_ref_add(folio, nr_pages - 1);
>> > + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>> > + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
>> > +
>> > pte = mk_pte(page, vma->vm_page_prot);
>> >
>> > /*
>> > @@ -4219,33 +4258,34 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> > * exclusivity.
>> > */
>> > if (!folio_test_ksm(folio) &&
>> > - (exclusive || folio_ref_count(folio) == 1)) {
>> > + (exclusive || (folio_ref_count(folio) == nr_pages &&
>> > + folio_nr_pages(folio) == nr_pages))) {
>> > if (vmf->flags & FAULT_FLAG_WRITE) {
>> > pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>> > vmf->flags &= ~FAULT_FLAG_WRITE;
>> > }
>> > rmap_flags |= RMAP_EXCLUSIVE;
>> > }
>> > - flush_icache_page(vma, page);
>> > + flush_icache_pages(vma, page, nr_pages);
>> > if (pte_swp_soft_dirty(vmf->orig_pte))
>> > pte = pte_mksoft_dirty(pte);
>> > if (pte_swp_uffd_wp(vmf->orig_pte))
>> > pte = pte_mkuffd_wp(pte);
>> > - vmf->orig_pte = pte;
>> >
>> > /* ksm created a completely new copy */
>> > if (unlikely(folio != swapcache && swapcache)) {
>> > - folio_add_new_anon_rmap(folio, vma, vmf->address);
>> > + folio_add_new_anon_rmap(folio, vma, start_address);
>> > folio_add_lru_vma(folio, vma);
>> > } else {
>> > - folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
>> > - rmap_flags);
>> > + folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
>> > + rmap_flags);
>> > }
>> >
>> > VM_BUG_ON(!folio_test_anon(folio) ||
>> > (pte_write(pte) && !PageAnonExclusive(page)));
>> > - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
>> > - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
>> > + set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
>> > + vmf->orig_pte = ptep_get(vmf->pte);
>> > + arch_do_swap_page(vma->vm_mm, vma, start_address, pte, pte);
>>
>> Do we need to call arch_do_swap_page() for each subpage? IIUC, the
>> corresponding arch_unmap_one() will be called for each subpage.
>
> i actually thought about this very carefully, right now, the only one who
> needs this is sparc and it doesn't support THP_SWAPOUT at all. and
> there is no proof doing restoration one by one won't really break sparc.
> so i'd like to defer this to when sparc really needs THP_SWAPOUT.

Let's ask SPARC developer (Cced) for this.

IMHO, even if we cannot get help, we need to change code with our
understanding instead of deferring it.

> on the other hand, it seems really bad we have both
> arch_swap_restore - for this, arm64 has moved to using folio
> and
> arch_do_swap_page
>
> we should somehow unify them later if sparc wants THP_SWPOUT.
>
>>
>> > folio_unlock(folio);
>> > if (folio != swapcache && swapcache) {
>> > @@ -4269,7 +4309,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> > }
>> >
>> > /* No need to invalidate - it was non-present before */
>> > - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
>> > + update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
>> > unlock:
>> > if (vmf->pte)
>> > pte_unmap_unlock(vmf->pte, vmf->ptl);

--
Best Regards,
Huang, Ying

2024-04-16 02:36:31

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache

On Tue, Apr 16, 2024 at 2:27 PM Huang, Ying <[email protected]> wrote:
>
>
> Added Khalid for arch_do_swap_page().
>
> Barry Song <[email protected]> writes:
>
> > On Mon, Apr 15, 2024 at 8:39 PM Huang, Ying <[email protected]> wrote:
> >>
> >> Barry Song <[email protected]> writes:
>
> [snip]
>
> >>
> >> > + bool any_swap_shared = false;
> >> >
> >> > if (!pte_unmap_same(vmf))
> >> > goto out;
> >> > @@ -4137,6 +4141,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> > */
> >> > vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> >> > &vmf->ptl);
> >>
> >> We should move pte check here. That is,
> >>
> >> if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> >> goto out_nomap;
> >>
> >> This will simplify the situation for large folio.
> >
> > the plan is moving the whole code block
> >
> > if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio))
> >
> > after
> > if (unlikely(!folio_test_uptodate(folio))) {
> > ret = VM_FAULT_SIGBUS;
> > goto out_nomap;
> > }
> >
> > though we couldn't be !folio_test_uptodate(folio)) for hitting
> > swapcache but it seems
> > logically better for future use.
>
> LGTM, Thanks!
>
> >>
> >> > +
> >> > + /* We hit large folios in swapcache */
> >>
> >> The comments seems unnecessary because the code tells that already.
> >>
> >> > + if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio)) {
> >> > + int nr = folio_nr_pages(folio);
> >> > + int idx = folio_page_idx(folio, page);
> >> > + unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> >> > + unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> >> > + pte_t *folio_ptep;
> >> > + pte_t folio_pte;
> >> > +
> >> > + if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> >> > + goto check_pte;
> >> > + if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> >> > + goto check_pte;
> >> > +
> >> > + folio_ptep = vmf->pte - idx;
> >> > + folio_pte = ptep_get(folio_ptep);
> >>
> >> It's better to construct pte based on fault PTE via generalizing
> >> pte_next_swp_offset() (may be pte_move_swp_offset()). Then we can find
> >> inconsistent PTEs quicker.
> >
> > it seems your point is getting the pte of page0 by pte_next_swp_offset()
> > unfortunately pte_next_swp_offset can't go back. on the other hand,
> > we have to check the real pte value of the 0nd entry right now because
> > swap_pte_batch() only really reads pte from the 1st entry. it assumes
> > pte argument is the real value for the 0nd pte entry.
> >
> > static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
> > {
> > pte_t expected_pte = pte_next_swp_offset(pte);
> > const pte_t *end_ptep = start_ptep + max_nr;
> > pte_t *ptep = start_ptep + 1;
> >
> > VM_WARN_ON(max_nr < 1);
> > VM_WARN_ON(!is_swap_pte(pte));
> > VM_WARN_ON(non_swap_entry(pte_to_swp_entry(pte)));
> >
> > while (ptep < end_ptep) {
> > pte = ptep_get(ptep);
> >
> > if (!pte_same(pte, expected_pte))
> > break;
> >
> > expected_pte = pte_next_swp_offset(expected_pte);
> > ptep++;
> > }
> >
> > return ptep - start_ptep;
> > }
>
> Yes. You are right.
>
> But we may check whether the pte of page0 is same as "vmf->orig_pte -
> folio_page_idx()" (fake code).

right, that is why we are reading and checking PTE0 before calling
swap_pte_batch()
right now.

folio_ptep = vmf->pte - idx;
folio_pte = ptep_get(folio_ptep);
if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
goto check_pte;

So, if I understand correctly, you're proposing that we should directly check
PTE0 in swap_pte_batch(). Personally, I don't have any objections to this idea.
However, I'd also like to hear the feedback from Ryan and David :-)

>
> You need to check the pte of page 0 anyway.
>
> >>
> >> > + if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
> >> > + swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
> >> > + goto check_pte;
> >> > +
> >> > + start_address = folio_start;
> >> > + start_pte = folio_ptep;
> >> > + nr_pages = nr;
> >> > + entry = folio->swap;
> >> > + page = &folio->page;
> >> > + }
> >> > +
> >> > +check_pte:
> >> > if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> >> > goto out_nomap;
> >> >
> >> > @@ -4190,6 +4223,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> > */
> >> > exclusive = false;
> >> > }
> >> > +
> >> > + /* Reuse the whole large folio iff all entries are exclusive */
> >> > + if (nr_pages > 1 && any_swap_shared)
> >> > + exclusive = false;
> >> > }
> >> >
> >> > /*
> >> > @@ -4204,12 +4241,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> > * We're already holding a reference on the page but haven't mapped it
> >> > * yet.
> >> > */
> >> > - swap_free(entry);
> >> > + swap_free_nr(entry, nr_pages);
> >> > if (should_try_to_free_swap(folio, vma, vmf->flags))
> >> > folio_free_swap(folio);
> >> >
> >> > - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> >> > - dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> >> > + folio_ref_add(folio, nr_pages - 1);
> >> > + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> >> > + add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> >> > +
> >> > pte = mk_pte(page, vma->vm_page_prot);
> >> >
> >> > /*
> >> > @@ -4219,33 +4258,34 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> > * exclusivity.
> >> > */
> >> > if (!folio_test_ksm(folio) &&
> >> > - (exclusive || folio_ref_count(folio) == 1)) {
> >> > + (exclusive || (folio_ref_count(folio) == nr_pages &&
> >> > + folio_nr_pages(folio) == nr_pages))) {
> >> > if (vmf->flags & FAULT_FLAG_WRITE) {
> >> > pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> >> > vmf->flags &= ~FAULT_FLAG_WRITE;
> >> > }
> >> > rmap_flags |= RMAP_EXCLUSIVE;
> >> > }
> >> > - flush_icache_page(vma, page);
> >> > + flush_icache_pages(vma, page, nr_pages);
> >> > if (pte_swp_soft_dirty(vmf->orig_pte))
> >> > pte = pte_mksoft_dirty(pte);
> >> > if (pte_swp_uffd_wp(vmf->orig_pte))
> >> > pte = pte_mkuffd_wp(pte);
> >> > - vmf->orig_pte = pte;
> >> >
> >> > /* ksm created a completely new copy */
> >> > if (unlikely(folio != swapcache && swapcache)) {
> >> > - folio_add_new_anon_rmap(folio, vma, vmf->address);
> >> > + folio_add_new_anon_rmap(folio, vma, start_address);
> >> > folio_add_lru_vma(folio, vma);
> >> > } else {
> >> > - folio_add_anon_rmap_pte(folio, page, vma, vmf->address,
> >> > - rmap_flags);
> >> > + folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, start_address,
> >> > + rmap_flags);
> >> > }
> >> >
> >> > VM_BUG_ON(!folio_test_anon(folio) ||
> >> > (pte_write(pte) && !PageAnonExclusive(page)));
> >> > - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> >> > - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> >> > + set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
> >> > + vmf->orig_pte = ptep_get(vmf->pte);
> >> > + arch_do_swap_page(vma->vm_mm, vma, start_address, pte, pte);
> >>
> >> Do we need to call arch_do_swap_page() for each subpage? IIUC, the
> >> corresponding arch_unmap_one() will be called for each subpage.
> >
> > i actually thought about this very carefully, right now, the only one who
> > needs this is sparc and it doesn't support THP_SWAPOUT at all. and
> > there is no proof doing restoration one by one won't really break sparc.
> > so i'd like to defer this to when sparc really needs THP_SWAPOUT.
>
> Let's ask SPARC developer (Cced) for this.
>
> IMHO, even if we cannot get help, we need to change code with our
> understanding instead of deferring it.

ok. Thanks for Ccing sparc developers.

>
> > on the other hand, it seems really bad we have both
> > arch_swap_restore - for this, arm64 has moved to using folio
> > and
> > arch_do_swap_page
> >
> > we should somehow unify them later if sparc wants THP_SWPOUT.
> >
> >>
> >> > folio_unlock(folio);
> >> > if (folio != swapcache && swapcache) {
> >> > @@ -4269,7 +4309,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> > }
> >> >
> >> > /* No need to invalidate - it was non-present before */
> >> > - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
> >> > + update_mmu_cache_range(vmf, vma, start_address, start_pte, nr_pages);
> >> > unlock:
> >> > if (vmf->pte)
> >> > pte_unmap_unlock(vmf->pte, vmf->ptl);
>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry

2024-04-16 02:52:01

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache

Barry Song <[email protected]> writes:

> On Tue, Apr 16, 2024 at 2:27 PM Huang, Ying <[email protected]> wrote:
>>
>>
>> Added Khalid for arch_do_swap_page().
>>
>> Barry Song <[email protected]> writes:
>>
>> > On Mon, Apr 15, 2024 at 8:39 PM Huang, Ying <[email protected]> wrote:
>> >>
>> >> Barry Song <[email protected]> writes:
>>
>> [snip]
>>
>> >>
>> >> > + bool any_swap_shared = false;
>> >> >
>> >> > if (!pte_unmap_same(vmf))
>> >> > goto out;
>> >> > @@ -4137,6 +4141,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >> > */
>> >> > vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>> >> > &vmf->ptl);
>> >>
>> >> We should move pte check here. That is,
>> >>
>> >> if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
>> >> goto out_nomap;
>> >>
>> >> This will simplify the situation for large folio.
>> >
>> > the plan is moving the whole code block
>> >
>> > if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio))
>> >
>> > after
>> > if (unlikely(!folio_test_uptodate(folio))) {
>> > ret = VM_FAULT_SIGBUS;
>> > goto out_nomap;
>> > }
>> >
>> > though we couldn't be !folio_test_uptodate(folio)) for hitting
>> > swapcache but it seems
>> > logically better for future use.
>>
>> LGTM, Thanks!
>>
>> >>
>> >> > +
>> >> > + /* We hit large folios in swapcache */
>> >>
>> >> The comments seems unnecessary because the code tells that already.
>> >>
>> >> > + if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio)) {
>> >> > + int nr = folio_nr_pages(folio);
>> >> > + int idx = folio_page_idx(folio, page);
>> >> > + unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
>> >> > + unsigned long folio_end = folio_start + nr * PAGE_SIZE;
>> >> > + pte_t *folio_ptep;
>> >> > + pte_t folio_pte;
>> >> > +
>> >> > + if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
>> >> > + goto check_pte;
>> >> > + if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
>> >> > + goto check_pte;
>> >> > +
>> >> > + folio_ptep = vmf->pte - idx;
>> >> > + folio_pte = ptep_get(folio_ptep);
>> >>
>> >> It's better to construct pte based on fault PTE via generalizing
>> >> pte_next_swp_offset() (may be pte_move_swp_offset()). Then we can find
>> >> inconsistent PTEs quicker.
>> >
>> > it seems your point is getting the pte of page0 by pte_next_swp_offset()
>> > unfortunately pte_next_swp_offset can't go back. on the other hand,
>> > we have to check the real pte value of the 0nd entry right now because
>> > swap_pte_batch() only really reads pte from the 1st entry. it assumes
>> > pte argument is the real value for the 0nd pte entry.
>> >
>> > static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
>> > {
>> > pte_t expected_pte = pte_next_swp_offset(pte);
>> > const pte_t *end_ptep = start_ptep + max_nr;
>> > pte_t *ptep = start_ptep + 1;
>> >
>> > VM_WARN_ON(max_nr < 1);
>> > VM_WARN_ON(!is_swap_pte(pte));
>> > VM_WARN_ON(non_swap_entry(pte_to_swp_entry(pte)));
>> >
>> > while (ptep < end_ptep) {
>> > pte = ptep_get(ptep);
>> >
>> > if (!pte_same(pte, expected_pte))
>> > break;
>> >
>> > expected_pte = pte_next_swp_offset(expected_pte);
>> > ptep++;
>> > }
>> >
>> > return ptep - start_ptep;
>> > }
>>
>> Yes. You are right.
>>
>> But we may check whether the pte of page0 is same as "vmf->orig_pte -
>> folio_page_idx()" (fake code).
>
> right, that is why we are reading and checking PTE0 before calling
> swap_pte_batch()
> right now.
>
> folio_ptep = vmf->pte - idx;
> folio_pte = ptep_get(folio_ptep);
> if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
> swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
> goto check_pte;
>
> So, if I understand correctly, you're proposing that we should directly check
> PTE0 in swap_pte_batch(). Personally, I don't have any objections to this idea.
> However, I'd also like to hear the feedback from Ryan and David :-)

I mean that we can replace

!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte))

in above code with pte_same() with constructed expected first pte.

>>
>> You need to check the pte of page 0 anyway.
>>
>> >>
>> >> > + if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
>> >> > + swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
>> >> > + goto check_pte;
>> >> > +
>> >> > + start_address = folio_start;
>> >> > + start_pte = folio_ptep;
>> >> > + nr_pages = nr;
>> >> > + entry = folio->swap;
>> >> > + page = &folio->page;
>> >> > + }
>> >> > +
>> >> > +check_pte:
>> >> > if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
>> >> > goto out_nomap;
>> >> >
>> >> > @@ -4190,6 +4223,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >> > */
>> >> > exclusive = false;
>> >> > }
>> >> > +
>> >> > + /* Reuse the whole large folio iff all entries are exclusive */
>> >> > + if (nr_pages > 1 && any_swap_shared)
>> >> > + exclusive = false;
>> >> > }
>> >> >

[snip]

--
Best Regards,
Huang, Ying

2024-04-16 02:52:45

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache

On Tue, Apr 16, 2024 at 2:41 PM Huang, Ying <[email protected]> wrote:
>
> Barry Song <[email protected]> writes:
>
> > On Tue, Apr 16, 2024 at 2:27 PM Huang, Ying <[email protected]> wrote:
> >>
> >>
> >> Added Khalid for arch_do_swap_page().
> >>
> >> Barry Song <[email protected]> writes:
> >>
> >> > On Mon, Apr 15, 2024 at 8:39 PM Huang, Ying <[email protected]> wrote:
> >> >>
> >> >> Barry Song <[email protected]> writes:
> >>
> >> [snip]
> >>
> >> >>
> >> >> > + bool any_swap_shared = false;
> >> >> >
> >> >> > if (!pte_unmap_same(vmf))
> >> >> > goto out;
> >> >> > @@ -4137,6 +4141,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> >> > */
> >> >> > vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> >> >> > &vmf->ptl);
> >> >>
> >> >> We should move pte check here. That is,
> >> >>
> >> >> if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> >> >> goto out_nomap;
> >> >>
> >> >> This will simplify the situation for large folio.
> >> >
> >> > the plan is moving the whole code block
> >> >
> >> > if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio))
> >> >
> >> > after
> >> > if (unlikely(!folio_test_uptodate(folio))) {
> >> > ret = VM_FAULT_SIGBUS;
> >> > goto out_nomap;
> >> > }
> >> >
> >> > though we couldn't be !folio_test_uptodate(folio)) for hitting
> >> > swapcache but it seems
> >> > logically better for future use.
> >>
> >> LGTM, Thanks!
> >>
> >> >>
> >> >> > +
> >> >> > + /* We hit large folios in swapcache */
> >> >>
> >> >> The comments seems unnecessary because the code tells that already.
> >> >>
> >> >> > + if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio)) {
> >> >> > + int nr = folio_nr_pages(folio);
> >> >> > + int idx = folio_page_idx(folio, page);
> >> >> > + unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> >> >> > + unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> >> >> > + pte_t *folio_ptep;
> >> >> > + pte_t folio_pte;
> >> >> > +
> >> >> > + if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> >> >> > + goto check_pte;
> >> >> > + if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> >> >> > + goto check_pte;
> >> >> > +
> >> >> > + folio_ptep = vmf->pte - idx;
> >> >> > + folio_pte = ptep_get(folio_ptep);
> >> >>
> >> >> It's better to construct pte based on fault PTE via generalizing
> >> >> pte_next_swp_offset() (may be pte_move_swp_offset()). Then we can find
> >> >> inconsistent PTEs quicker.
> >> >
> >> > it seems your point is getting the pte of page0 by pte_next_swp_offset()
> >> > unfortunately pte_next_swp_offset can't go back. on the other hand,
> >> > we have to check the real pte value of the 0nd entry right now because
> >> > swap_pte_batch() only really reads pte from the 1st entry. it assumes
> >> > pte argument is the real value for the 0nd pte entry.
> >> >
> >> > static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
> >> > {
> >> > pte_t expected_pte = pte_next_swp_offset(pte);
> >> > const pte_t *end_ptep = start_ptep + max_nr;
> >> > pte_t *ptep = start_ptep + 1;
> >> >
> >> > VM_WARN_ON(max_nr < 1);
> >> > VM_WARN_ON(!is_swap_pte(pte));
> >> > VM_WARN_ON(non_swap_entry(pte_to_swp_entry(pte)));
> >> >
> >> > while (ptep < end_ptep) {
> >> > pte = ptep_get(ptep);
> >> >
> >> > if (!pte_same(pte, expected_pte))
> >> > break;
> >> >
> >> > expected_pte = pte_next_swp_offset(expected_pte);
> >> > ptep++;
> >> > }
> >> >
> >> > return ptep - start_ptep;
> >> > }
> >>
> >> Yes. You are right.
> >>
> >> But we may check whether the pte of page0 is same as "vmf->orig_pte -
> >> folio_page_idx()" (fake code).
> >
> > right, that is why we are reading and checking PTE0 before calling
> > swap_pte_batch()
> > right now.
> >
> > folio_ptep = vmf->pte - idx;
> > folio_pte = ptep_get(folio_ptep);
> > if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
> > swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
> > goto check_pte;
> >
> > So, if I understand correctly, you're proposing that we should directly check
> > PTE0 in swap_pte_batch(). Personally, I don't have any objections to this idea.
> > However, I'd also like to hear the feedback from Ryan and David :-)
>
> I mean that we can replace
>
> !is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte))
>
> in above code with pte_same() with constructed expected first pte.

Got it. It could be quite tricky, especially with considerations like
pte_swp_soft_dirty, pte_swp_exclusive, and pte_swp_uffd_wp. We might
require a helper function similar to pte_next_swp_offset() but capable of
moving both forward and backward. For instance:

pte_move_swp_offset(pte_t pte, long delta)

pte_next_swp_offset can insteadly call it by:
pte_move_swp_offset(pte, 1);

Is it what you are proposing?

>
> >>
> >> You need to check the pte of page 0 anyway.
> >>
> >> >>
> >> >> > + if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
> >> >> > + swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
> >> >> > + goto check_pte;
> >> >> > +
> >> >> > + start_address = folio_start;
> >> >> > + start_pte = folio_ptep;
> >> >> > + nr_pages = nr;
> >> >> > + entry = folio->swap;
> >> >> > + page = &folio->page;
> >> >> > + }
> >> >> > +
> >> >> > +check_pte:
> >> >> > if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> >> >> > goto out_nomap;
> >> >> >
> >> >> > @@ -4190,6 +4223,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> >> > */
> >> >> > exclusive = false;
> >> >> > }
> >> >> > +
> >> >> > + /* Reuse the whole large folio iff all entries are exclusive */
> >> >> > + if (nr_pages > 1 && any_swap_shared)
> >> >> > + exclusive = false;
> >> >> > }
> >> >> >
>
> [snip]
>
> --
> Best Regards,
> Huang, Ying

2024-04-16 03:19:57

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache

Barry Song <[email protected]> writes:

> On Tue, Apr 16, 2024 at 2:41 PM Huang, Ying <[email protected]> wrote:
>>
>> Barry Song <[email protected]> writes:
>>
>> > On Tue, Apr 16, 2024 at 2:27 PM Huang, Ying <[email protected]> wrote:
>> >>
>> >>
>> >> Added Khalid for arch_do_swap_page().
>> >>
>> >> Barry Song <[email protected]> writes:
>> >>
>> >> > On Mon, Apr 15, 2024 at 8:39 PM Huang, Ying <[email protected]> wrote:
>> >> >>
>> >> >> Barry Song <[email protected]> writes:
>> >>
>> >> [snip]
>> >>
>> >> >>
>> >> >> > + bool any_swap_shared = false;
>> >> >> >
>> >> >> > if (!pte_unmap_same(vmf))
>> >> >> > goto out;
>> >> >> > @@ -4137,6 +4141,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >> >> > */
>> >> >> > vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
>> >> >> > &vmf->ptl);
>> >> >>
>> >> >> We should move pte check here. That is,
>> >> >>
>> >> >> if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
>> >> >> goto out_nomap;
>> >> >>
>> >> >> This will simplify the situation for large folio.
>> >> >
>> >> > the plan is moving the whole code block
>> >> >
>> >> > if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio))
>> >> >
>> >> > after
>> >> > if (unlikely(!folio_test_uptodate(folio))) {
>> >> > ret = VM_FAULT_SIGBUS;
>> >> > goto out_nomap;
>> >> > }
>> >> >
>> >> > though we couldn't be !folio_test_uptodate(folio)) for hitting
>> >> > swapcache but it seems
>> >> > logically better for future use.
>> >>
>> >> LGTM, Thanks!
>> >>
>> >> >>
>> >> >> > +
>> >> >> > + /* We hit large folios in swapcache */
>> >> >>
>> >> >> The comments seems unnecessary because the code tells that already.
>> >> >>
>> >> >> > + if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio)) {
>> >> >> > + int nr = folio_nr_pages(folio);
>> >> >> > + int idx = folio_page_idx(folio, page);
>> >> >> > + unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
>> >> >> > + unsigned long folio_end = folio_start + nr * PAGE_SIZE;
>> >> >> > + pte_t *folio_ptep;
>> >> >> > + pte_t folio_pte;
>> >> >> > +
>> >> >> > + if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
>> >> >> > + goto check_pte;
>> >> >> > + if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
>> >> >> > + goto check_pte;
>> >> >> > +
>> >> >> > + folio_ptep = vmf->pte - idx;
>> >> >> > + folio_pte = ptep_get(folio_ptep);
>> >> >>
>> >> >> It's better to construct pte based on fault PTE via generalizing
>> >> >> pte_next_swp_offset() (may be pte_move_swp_offset()). Then we can find
>> >> >> inconsistent PTEs quicker.
>> >> >
>> >> > it seems your point is getting the pte of page0 by pte_next_swp_offset()
>> >> > unfortunately pte_next_swp_offset can't go back. on the other hand,
>> >> > we have to check the real pte value of the 0nd entry right now because
>> >> > swap_pte_batch() only really reads pte from the 1st entry. it assumes
>> >> > pte argument is the real value for the 0nd pte entry.
>> >> >
>> >> > static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
>> >> > {
>> >> > pte_t expected_pte = pte_next_swp_offset(pte);
>> >> > const pte_t *end_ptep = start_ptep + max_nr;
>> >> > pte_t *ptep = start_ptep + 1;
>> >> >
>> >> > VM_WARN_ON(max_nr < 1);
>> >> > VM_WARN_ON(!is_swap_pte(pte));
>> >> > VM_WARN_ON(non_swap_entry(pte_to_swp_entry(pte)));
>> >> >
>> >> > while (ptep < end_ptep) {
>> >> > pte = ptep_get(ptep);
>> >> >
>> >> > if (!pte_same(pte, expected_pte))
>> >> > break;
>> >> >
>> >> > expected_pte = pte_next_swp_offset(expected_pte);
>> >> > ptep++;
>> >> > }
>> >> >
>> >> > return ptep - start_ptep;
>> >> > }
>> >>
>> >> Yes. You are right.
>> >>
>> >> But we may check whether the pte of page0 is same as "vmf->orig_pte -
>> >> folio_page_idx()" (fake code).
>> >
>> > right, that is why we are reading and checking PTE0 before calling
>> > swap_pte_batch()
>> > right now.
>> >
>> > folio_ptep = vmf->pte - idx;
>> > folio_pte = ptep_get(folio_ptep);
>> > if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
>> > swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
>> > goto check_pte;
>> >
>> > So, if I understand correctly, you're proposing that we should directly check
>> > PTE0 in swap_pte_batch(). Personally, I don't have any objections to this idea.
>> > However, I'd also like to hear the feedback from Ryan and David :-)
>>
>> I mean that we can replace
>>
>> !is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte))
>>
>> in above code with pte_same() with constructed expected first pte.
>
> Got it. It could be quite tricky, especially with considerations like
> pte_swp_soft_dirty, pte_swp_exclusive, and pte_swp_uffd_wp. We might
> require a helper function similar to pte_next_swp_offset() but capable of
> moving both forward and backward. For instance:
>
> pte_move_swp_offset(pte_t pte, long delta)
>
> pte_next_swp_offset can insteadly call it by:
> pte_move_swp_offset(pte, 1);
>
> Is it what you are proposing?

Yes. Exactly.

--
Best Regards,
Huang, Ying

>>
>> >>
>> >> You need to check the pte of page 0 anyway.
>> >>
>> >> >>
>> >> >> > + if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
>> >> >> > + swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
>> >> >> > + goto check_pte;
>> >> >> > +
>> >> >> > + start_address = folio_start;
>> >> >> > + start_pte = folio_ptep;
>> >> >> > + nr_pages = nr;
>> >> >> > + entry = folio->swap;
>> >> >> > + page = &folio->page;
>> >> >> > + }
>> >> >> > +
>> >> >> > +check_pte:
>> >> >> > if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
>> >> >> > goto out_nomap;
>> >> >> >
>> >> >> > @@ -4190,6 +4223,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> >> >> > */
>> >> >> > exclusive = false;
>> >> >> > }
>> >> >> > +
>> >> >> > + /* Reuse the whole large folio iff all entries are exclusive */
>> >> >> > + if (nr_pages > 1 && any_swap_shared)
>> >> >> > + exclusive = false;
>> >> >> > }
>> >> >> >
>>
>> [snip]
>>
>> --
>> Best Regards,
>> Huang, Ying

2024-04-16 04:40:30

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache

On Tue, Apr 16, 2024 at 3:19 PM Huang, Ying <[email protected]> wrote:
>
> Barry Song <[email protected]> writes:
>
> > On Tue, Apr 16, 2024 at 2:41 PM Huang, Ying <[email protected]> wrote:
> >>
> >> Barry Song <[email protected]> writes:
> >>
> >> > On Tue, Apr 16, 2024 at 2:27 PM Huang, Ying <[email protected]> wrote:
> >> >>
> >> >>
> >> >> Added Khalid for arch_do_swap_page().
> >> >>
> >> >> Barry Song <[email protected]> writes:
> >> >>
> >> >> > On Mon, Apr 15, 2024 at 8:39 PM Huang, Ying <[email protected]> wrote:
> >> >> >>
> >> >> >> Barry Song <[email protected]> writes:
> >> >>
> >> >> [snip]
> >> >>
> >> >> >>
> >> >> >> > + bool any_swap_shared = false;
> >> >> >> >
> >> >> >> > if (!pte_unmap_same(vmf))
> >> >> >> > goto out;
> >> >> >> > @@ -4137,6 +4141,35 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> >> >> > */
> >> >> >> > vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> >> >> >> > &vmf->ptl);
> >> >> >>
> >> >> >> We should move pte check here. That is,
> >> >> >>
> >> >> >> if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> >> >> >> goto out_nomap;
> >> >> >>
> >> >> >> This will simplify the situation for large folio.
> >> >> >
> >> >> > the plan is moving the whole code block
> >> >> >
> >> >> > if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio))
> >> >> >
> >> >> > after
> >> >> > if (unlikely(!folio_test_uptodate(folio))) {
> >> >> > ret = VM_FAULT_SIGBUS;
> >> >> > goto out_nomap;
> >> >> > }
> >> >> >
> >> >> > though we couldn't be !folio_test_uptodate(folio)) for hitting
> >> >> > swapcache but it seems
> >> >> > logically better for future use.
> >> >>
> >> >> LGTM, Thanks!
> >> >>
> >> >> >>
> >> >> >> > +
> >> >> >> > + /* We hit large folios in swapcache */
> >> >> >>
> >> >> >> The comments seems unnecessary because the code tells that already.
> >> >> >>
> >> >> >> > + if (start_pte && folio_test_large(folio) && folio_test_swapcache(folio)) {
> >> >> >> > + int nr = folio_nr_pages(folio);
> >> >> >> > + int idx = folio_page_idx(folio, page);
> >> >> >> > + unsigned long folio_start = vmf->address - idx * PAGE_SIZE;
> >> >> >> > + unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> >> >> >> > + pte_t *folio_ptep;
> >> >> >> > + pte_t folio_pte;
> >> >> >> > +
> >> >> >> > + if (unlikely(folio_start < max(vmf->address & PMD_MASK, vma->vm_start)))
> >> >> >> > + goto check_pte;
> >> >> >> > + if (unlikely(folio_end > pmd_addr_end(vmf->address, vma->vm_end)))
> >> >> >> > + goto check_pte;
> >> >> >> > +
> >> >> >> > + folio_ptep = vmf->pte - idx;
> >> >> >> > + folio_pte = ptep_get(folio_ptep);
> >> >> >>
> >> >> >> It's better to construct pte based on fault PTE via generalizing
> >> >> >> pte_next_swp_offset() (may be pte_move_swp_offset()). Then we can find
> >> >> >> inconsistent PTEs quicker.
> >> >> >
> >> >> > it seems your point is getting the pte of page0 by pte_next_swp_offset()
> >> >> > unfortunately pte_next_swp_offset can't go back. on the other hand,
> >> >> > we have to check the real pte value of the 0nd entry right now because
> >> >> > swap_pte_batch() only really reads pte from the 1st entry. it assumes
> >> >> > pte argument is the real value for the 0nd pte entry.
> >> >> >
> >> >> > static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
> >> >> > {
> >> >> > pte_t expected_pte = pte_next_swp_offset(pte);
> >> >> > const pte_t *end_ptep = start_ptep + max_nr;
> >> >> > pte_t *ptep = start_ptep + 1;
> >> >> >
> >> >> > VM_WARN_ON(max_nr < 1);
> >> >> > VM_WARN_ON(!is_swap_pte(pte));
> >> >> > VM_WARN_ON(non_swap_entry(pte_to_swp_entry(pte)));
> >> >> >
> >> >> > while (ptep < end_ptep) {
> >> >> > pte = ptep_get(ptep);
> >> >> >
> >> >> > if (!pte_same(pte, expected_pte))
> >> >> > break;
> >> >> >
> >> >> > expected_pte = pte_next_swp_offset(expected_pte);
> >> >> > ptep++;
> >> >> > }
> >> >> >
> >> >> > return ptep - start_ptep;
> >> >> > }
> >> >>
> >> >> Yes. You are right.
> >> >>
> >> >> But we may check whether the pte of page0 is same as "vmf->orig_pte -
> >> >> folio_page_idx()" (fake code).
> >> >
> >> > right, that is why we are reading and checking PTE0 before calling
> >> > swap_pte_batch()
> >> > right now.
> >> >
> >> > folio_ptep = vmf->pte - idx;
> >> > folio_pte = ptep_get(folio_ptep);
> >> > if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
> >> > swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
> >> > goto check_pte;
> >> >
> >> > So, if I understand correctly, you're proposing that we should directly check
> >> > PTE0 in swap_pte_batch(). Personally, I don't have any objections to this idea.
> >> > However, I'd also like to hear the feedback from Ryan and David :-)
> >>
> >> I mean that we can replace
> >>
> >> !is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte))
> >>
> >> in above code with pte_same() with constructed expected first pte.
> >
> > Got it. It could be quite tricky, especially with considerations like
> > pte_swp_soft_dirty, pte_swp_exclusive, and pte_swp_uffd_wp. We might
> > require a helper function similar to pte_next_swp_offset() but capable of
> > moving both forward and backward. For instance:
> >
> > pte_move_swp_offset(pte_t pte, long delta)
> >
> > pte_next_swp_offset can insteadly call it by:
> > pte_move_swp_offset(pte, 1);
> >
> > Is it what you are proposing?
>
> Yes. Exactly.

Great. I agree that this appears to be much cleaner than the current code.

>
> --
> Best Regards,
> Huang, Ying
>
> >>
> >> >>
> >> >> You need to check the pte of page 0 anyway.
> >> >>
> >> >> >>
> >> >> >> > + if (!is_swap_pte(folio_pte) || non_swap_entry(pte_to_swp_entry(folio_pte)) ||
> >> >> >> > + swap_pte_batch(folio_ptep, nr, folio_pte, &any_swap_shared) != nr)
> >> >> >> > + goto check_pte;
> >> >> >> > +
> >> >> >> > + start_address = folio_start;
> >> >> >> > + start_pte = folio_ptep;
> >> >> >> > + nr_pages = nr;
> >> >> >> > + entry = folio->swap;
> >> >> >> > + page = &folio->page;
> >> >> >> > + }
> >> >> >> > +
> >> >> >> > +check_pte:
> >> >> >> > if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> >> >> >> > goto out_nomap;
> >> >> >> >
> >> >> >> > @@ -4190,6 +4223,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> >> >> > */
> >> >> >> > exclusive = false;
> >> >> >> > }
> >> >> >> > +
> >> >> >> > + /* Reuse the whole large folio iff all entries are exclusive */
> >> >> >> > + if (nr_pages > 1 && any_swap_shared)
> >> >> >> > + exclusive = false;
> >> >> >> > }
> >> >> >> >
> >>
> >> [snip]
> >>
> >> --
> >> Best Regards,
> >> Huang, Ying

2024-04-17 00:47:15

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] mm: add per-order mTHP swpin_refault counter

Barry Song <[email protected]> writes:

> From: Barry Song <[email protected]>
>
> Currently, we are handling the scenario where we've hit a
> large folio in the swapcache, and the reclaiming process
> for this large folio is still ongoing.
>
> Signed-off-by: Barry Song <[email protected]>
> ---
> include/linux/huge_mm.h | 1 +
> mm/huge_memory.c | 2 ++
> mm/memory.c | 1 +
> 3 files changed, 4 insertions(+)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index c8256af83e33..b67294d5814f 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -269,6 +269,7 @@ enum mthp_stat_item {
> MTHP_STAT_ANON_ALLOC_FALLBACK,
> MTHP_STAT_ANON_SWPOUT,
> MTHP_STAT_ANON_SWPOUT_FALLBACK,
> + MTHP_STAT_ANON_SWPIN_REFAULT,

This is different from the refault concept used in other place in mm
subystem. Please check the following code

if (shadow)
workingset_refault(folio, shadow);

in __read_swap_cache_async().

> __MTHP_STAT_COUNT
> };

--
Best Regards,
Huang, Ying

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d8d2ed80b0bf..fb95345b0bde 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -556,12 +556,14 @@ DEFINE_MTHP_STAT_ATTR(anon_alloc, MTHP_STAT_ANON_ALLOC);
> DEFINE_MTHP_STAT_ATTR(anon_alloc_fallback, MTHP_STAT_ANON_ALLOC_FALLBACK);
> DEFINE_MTHP_STAT_ATTR(anon_swpout, MTHP_STAT_ANON_SWPOUT);
> DEFINE_MTHP_STAT_ATTR(anon_swpout_fallback, MTHP_STAT_ANON_SWPOUT_FALLBACK);
> +DEFINE_MTHP_STAT_ATTR(anon_swpin_refault, MTHP_STAT_ANON_SWPIN_REFAULT);
>
> static struct attribute *stats_attrs[] = {
> &anon_alloc_attr.attr,
> &anon_alloc_fallback_attr.attr,
> &anon_swpout_attr.attr,
> &anon_swpout_fallback_attr.attr,
> + &anon_swpin_refault_attr.attr,
> NULL,
> };
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 9818dc1893c8..acc023795a4d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4167,6 +4167,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> nr_pages = nr;
> entry = folio->swap;
> page = &folio->page;
> + count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
> }
>
> check_pte:

2024-04-17 01:16:26

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] mm: add per-order mTHP swpin_refault counter

On Wed, Apr 17, 2024 at 8:47 AM Huang, Ying <[email protected]> wrote:
>
> Barry Song <[email protected]> writes:
>
> > From: Barry Song <[email protected]>
> >
> > Currently, we are handling the scenario where we've hit a
> > large folio in the swapcache, and the reclaiming process
> > for this large folio is still ongoing.
> >
> > Signed-off-by: Barry Song <[email protected]>
> > ---
> > include/linux/huge_mm.h | 1 +
> > mm/huge_memory.c | 2 ++
> > mm/memory.c | 1 +
> > 3 files changed, 4 insertions(+)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index c8256af83e33..b67294d5814f 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -269,6 +269,7 @@ enum mthp_stat_item {
> > MTHP_STAT_ANON_ALLOC_FALLBACK,
> > MTHP_STAT_ANON_SWPOUT,
> > MTHP_STAT_ANON_SWPOUT_FALLBACK,
> > + MTHP_STAT_ANON_SWPIN_REFAULT,
>
> This is different from the refault concept used in other place in mm
> subystem. Please check the following code
>
> if (shadow)
> workingset_refault(folio, shadow);
>
> in __read_swap_cache_async().

right. it is slightly different as refault can also cover the case folios
have been entirely released and a new page fault happens soon
after it.
Do you have a better name for this?
MTHP_STAT_ANON_SWPIN_UNDER_RECLAIM
or
MTHP_STAT_ANON_SWPIN_RECLAIMING ?

>
> > __MTHP_STAT_COUNT
> > };
>
> --
> Best Regards,
> Huang, Ying
>
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index d8d2ed80b0bf..fb95345b0bde 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -556,12 +556,14 @@ DEFINE_MTHP_STAT_ATTR(anon_alloc, MTHP_STAT_ANON_ALLOC);
> > DEFINE_MTHP_STAT_ATTR(anon_alloc_fallback, MTHP_STAT_ANON_ALLOC_FALLBACK);
> > DEFINE_MTHP_STAT_ATTR(anon_swpout, MTHP_STAT_ANON_SWPOUT);
> > DEFINE_MTHP_STAT_ATTR(anon_swpout_fallback, MTHP_STAT_ANON_SWPOUT_FALLBACK);
> > +DEFINE_MTHP_STAT_ATTR(anon_swpin_refault, MTHP_STAT_ANON_SWPIN_REFAULT);
> >
> > static struct attribute *stats_attrs[] = {
> > &anon_alloc_attr.attr,
> > &anon_alloc_fallback_attr.attr,
> > &anon_swpout_attr.attr,
> > &anon_swpout_fallback_attr.attr,
> > + &anon_swpin_refault_attr.attr,
> > NULL,
> > };
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 9818dc1893c8..acc023795a4d 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4167,6 +4167,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > nr_pages = nr;
> > entry = folio->swap;
> > page = &folio->page;
> > + count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
> > }
> >
> > check_pte:

2024-04-17 01:40:39

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] mm: add per-order mTHP swpin_refault counter

Barry Song <[email protected]> writes:

> On Wed, Apr 17, 2024 at 8:47 AM Huang, Ying <[email protected]> wrote:
>>
>> Barry Song <[email protected]> writes:
>>
>> > From: Barry Song <[email protected]>
>> >
>> > Currently, we are handling the scenario where we've hit a
>> > large folio in the swapcache, and the reclaiming process
>> > for this large folio is still ongoing.
>> >
>> > Signed-off-by: Barry Song <[email protected]>
>> > ---
>> > include/linux/huge_mm.h | 1 +
>> > mm/huge_memory.c | 2 ++
>> > mm/memory.c | 1 +
>> > 3 files changed, 4 insertions(+)
>> >
>> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> > index c8256af83e33..b67294d5814f 100644
>> > --- a/include/linux/huge_mm.h
>> > +++ b/include/linux/huge_mm.h
>> > @@ -269,6 +269,7 @@ enum mthp_stat_item {
>> > MTHP_STAT_ANON_ALLOC_FALLBACK,
>> > MTHP_STAT_ANON_SWPOUT,
>> > MTHP_STAT_ANON_SWPOUT_FALLBACK,
>> > + MTHP_STAT_ANON_SWPIN_REFAULT,
>>
>> This is different from the refault concept used in other place in mm
>> subystem. Please check the following code
>>
>> if (shadow)
>> workingset_refault(folio, shadow);
>>
>> in __read_swap_cache_async().
>
> right. it is slightly different as refault can also cover the case folios
> have been entirely released and a new page fault happens soon
> after it.
> Do you have a better name for this?
> MTHP_STAT_ANON_SWPIN_UNDER_RECLAIM
> or
> MTHP_STAT_ANON_SWPIN_RECLAIMING ?

TBH, I don't think we need this counter. It's important for you during
implementation. But I don't think it's important for end users.

--
Best Regards,
Huang, Ying

>>
>> > __MTHP_STAT_COUNT
>> > };
>>
>> --
>> Best Regards,
>> Huang, Ying
>>
>> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> > index d8d2ed80b0bf..fb95345b0bde 100644
>> > --- a/mm/huge_memory.c
>> > +++ b/mm/huge_memory.c
>> > @@ -556,12 +556,14 @@ DEFINE_MTHP_STAT_ATTR(anon_alloc, MTHP_STAT_ANON_ALLOC);
>> > DEFINE_MTHP_STAT_ATTR(anon_alloc_fallback, MTHP_STAT_ANON_ALLOC_FALLBACK);
>> > DEFINE_MTHP_STAT_ATTR(anon_swpout, MTHP_STAT_ANON_SWPOUT);
>> > DEFINE_MTHP_STAT_ATTR(anon_swpout_fallback, MTHP_STAT_ANON_SWPOUT_FALLBACK);
>> > +DEFINE_MTHP_STAT_ATTR(anon_swpin_refault, MTHP_STAT_ANON_SWPIN_REFAULT);
>> >
>> > static struct attribute *stats_attrs[] = {
>> > &anon_alloc_attr.attr,
>> > &anon_alloc_fallback_attr.attr,
>> > &anon_swpout_attr.attr,
>> > &anon_swpout_fallback_attr.attr,
>> > + &anon_swpin_refault_attr.attr,
>> > NULL,
>> > };
>> >
>> > diff --git a/mm/memory.c b/mm/memory.c
>> > index 9818dc1893c8..acc023795a4d 100644
>> > --- a/mm/memory.c
>> > +++ b/mm/memory.c
>> > @@ -4167,6 +4167,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> > nr_pages = nr;
>> > entry = folio->swap;
>> > page = &folio->page;
>> > + count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
>> > }
>> >
>> > check_pte:

2024-04-17 01:48:56

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] mm: add per-order mTHP swpin_refault counter

On Wed, Apr 17, 2024 at 1:40 PM Huang, Ying <[email protected]> wrote:
>
> Barry Song <[email protected]> writes:
>
> > On Wed, Apr 17, 2024 at 8:47 AM Huang, Ying <[email protected]> wrote:
> >>
> >> Barry Song <[email protected]> writes:
> >>
> >> > From: Barry Song <[email protected]>
> >> >
> >> > Currently, we are handling the scenario where we've hit a
> >> > large folio in the swapcache, and the reclaiming process
> >> > for this large folio is still ongoing.
> >> >
> >> > Signed-off-by: Barry Song <[email protected]>
> >> > ---
> >> > include/linux/huge_mm.h | 1 +
> >> > mm/huge_memory.c | 2 ++
> >> > mm/memory.c | 1 +
> >> > 3 files changed, 4 insertions(+)
> >> >
> >> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >> > index c8256af83e33..b67294d5814f 100644
> >> > --- a/include/linux/huge_mm.h
> >> > +++ b/include/linux/huge_mm.h
> >> > @@ -269,6 +269,7 @@ enum mthp_stat_item {
> >> > MTHP_STAT_ANON_ALLOC_FALLBACK,
> >> > MTHP_STAT_ANON_SWPOUT,
> >> > MTHP_STAT_ANON_SWPOUT_FALLBACK,
> >> > + MTHP_STAT_ANON_SWPIN_REFAULT,
> >>
> >> This is different from the refault concept used in other place in mm
> >> subystem. Please check the following code
> >>
> >> if (shadow)
> >> workingset_refault(folio, shadow);
> >>
> >> in __read_swap_cache_async().
> >
> > right. it is slightly different as refault can also cover the case folios
> > have been entirely released and a new page fault happens soon
> > after it.
> > Do you have a better name for this?
> > MTHP_STAT_ANON_SWPIN_UNDER_RECLAIM
> > or
> > MTHP_STAT_ANON_SWPIN_RECLAIMING ?
>
> TBH, I don't think we need this counter. It's important for you during
> implementation. But I don't think it's important for end users.

Okay. If we can't find a shared interest between the
implementer and user, I'm perfectly fine with keeping it
local only for debugging purposes.

>
> --
> Best Regards,
> Huang, Ying
>
> >>
> >> > __MTHP_STAT_COUNT
> >> > };
> >>
> >> --
> >> Best Regards,
> >> Huang, Ying
> >>
> >> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> > index d8d2ed80b0bf..fb95345b0bde 100644
> >> > --- a/mm/huge_memory.c
> >> > +++ b/mm/huge_memory.c
> >> > @@ -556,12 +556,14 @@ DEFINE_MTHP_STAT_ATTR(anon_alloc, MTHP_STAT_ANON_ALLOC);
> >> > DEFINE_MTHP_STAT_ATTR(anon_alloc_fallback, MTHP_STAT_ANON_ALLOC_FALLBACK);
> >> > DEFINE_MTHP_STAT_ATTR(anon_swpout, MTHP_STAT_ANON_SWPOUT);
> >> > DEFINE_MTHP_STAT_ATTR(anon_swpout_fallback, MTHP_STAT_ANON_SWPOUT_FALLBACK);
> >> > +DEFINE_MTHP_STAT_ATTR(anon_swpin_refault, MTHP_STAT_ANON_SWPIN_REFAULT);
> >> >
> >> > static struct attribute *stats_attrs[] = {
> >> > &anon_alloc_attr.attr,
> >> > &anon_alloc_fallback_attr.attr,
> >> > &anon_swpout_attr.attr,
> >> > &anon_swpout_fallback_attr.attr,
> >> > + &anon_swpin_refault_attr.attr,
> >> > NULL,
> >> > };
> >> >
> >> > diff --git a/mm/memory.c b/mm/memory.c
> >> > index 9818dc1893c8..acc023795a4d 100644
> >> > --- a/mm/memory.c
> >> > +++ b/mm/memory.c
> >> > @@ -4167,6 +4167,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> > nr_pages = nr;
> >> > entry = folio->swap;
> >> > page = &folio->page;
> >> > + count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_SWPIN_REFAULT);
> >> > }
> >> >
> >> > check_pte:

2024-04-18 09:55:37

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] mm: swap: entirely map large folios found in swapcache

[snip]

> > >> >
> > >> > VM_BUG_ON(!folio_test_anon(folio) ||
> > >> > (pte_write(pte) && !PageAnonExclusive(page)));
> > >> > - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> > >> > - arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> > >> > + set_ptes(vma->vm_mm, start_address, start_pte, pte, nr_pages);
> > >> > + vmf->orig_pte = ptep_get(vmf->pte);
> > >> > + arch_do_swap_page(vma->vm_mm, vma, start_address, pte, pte);
> > >>
> > >> Do we need to call arch_do_swap_page() for each subpage? IIUC, the
> > >> corresponding arch_unmap_one() will be called for each subpage.
> > >
> > > i actually thought about this very carefully, right now, the only one who
> > > needs this is sparc and it doesn't support THP_SWAPOUT at all. and
> > > there is no proof doing restoration one by one won't really break sparc.
> > > so i'd like to defer this to when sparc really needs THP_SWAPOUT.
> >
> > Let's ask SPARC developer (Cced) for this.
> >
> > IMHO, even if we cannot get help, we need to change code with our
> > understanding instead of deferring it.
>
> ok. Thanks for Ccing sparc developers.

Hi Khalid & Ying (also Cced sparc maillist),

SPARC is the only platform which needs arch_do_swap_page(), right now,
its THP_SWAPOUT is not enabled. so we will not really hit a large folio
in swapcache. just in case you might need THP_SWAPOUT later, i am
changing the code as below,

@@ -4286,7 +4285,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
VM_BUG_ON(!folio_test_anon(folio) ||
(pte_write(pte) && !PageAnonExclusive(page)));
set_ptes(vma->vm_mm, start_address, start_ptep, pte, nr_pages);
- arch_do_swap_page(vma->vm_mm, vma, start_address, pte, pte);
+ for (int i = 0; i < nr_pages; i++) {
+ arch_do_swap_page(vma->vm_mm, vma, start_address + i *
PAGE_SIZE,
+ pte, pte);
+ pte = pte_advance_pfn(pte, 1);
+ }

folio_unlock(folio);
if (folio != swapcache && swapcache) {

for sparc, nr_pages will always be 1(THP_SWAPOUT not enabled). for
arm64/x86/riscv,
it seems redundant to do a for loop "for (int i = 0; i < nr_pages; i++)".

so another option is adding a helper as below to avoid the idle loop
for arm64/x86/riscv etc.

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e2f45e22a6d1..ea314a5f9b5e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1085,6 +1085,28 @@ static inline void arch_do_swap_page(struct
mm_struct *mm,
{

}
+
+static inline void arch_do_swap_page_nr(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long addr,
+ pte_t pte, pte_t oldpte,
+ int nr)
+{
+
+}
+#else
+static inline void arch_do_swap_page_nr(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long addr,
+ pte_t pte, pte_t oldpte,
+ int nr)
+{
+ for (int i = 0; i < nr; i++) {
+ arch_do_swap_page(vma->vm_mm, vma, addr + i * PAGE_SIZE,
+ pte_advance_pfn(pte, i),
+ pte_advance_pfn(oldpte, i));
+ }
+}
#endif

Please tell me your preference.

BTW, i found oldpte and pte are always same in do_swap_page(), is it
something wrong? does arch_do_swap_page() really need two same
arguments?


vmf->orig_pte = pte;
..
arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);


>
> >
> > > on the other hand, it seems really bad we have both
> > > arch_swap_restore - for this, arm64 has moved to using folio
> > > and
> > > arch_do_swap_page
> > >
> > > we should somehow unify them later if sparc wants THP_SWPOUT.

Thanks
Barry