From: Kairui Song <[email protected]>
This series tries to unify and clean up the swapin path, fixing a few
issues with optimizations:
1. Memcg leak issue: when a process that previously swapped out some
migrated to another cgroup, and the origianl cgroup is dead. If we
do a swapoff, swapped in pages will be accounted into the process
doing swapoff instead of the new cgroup. This will allow the process
to use more memory than expect easily.
This can be easily reproduced by:
- Setup a swap.
- Create memory cgroup A, B and C.
- Spawn process P1 in cgroup A and make it swap out some pages.
- Move process P1 to memory cgroup B.
- Destroy cgroup A.
- Do a swapoff in cgroup C
- Swapped in pages is accounted into cgroup C.
This patch will fix it make the swapped in pages accounted in cgroup B.
2. When there are multiple swap deviced configured, if one of these
devices is not SSD, VMA readahead will be globally disabled.
This series will make the readahead policy check per swap entry.
3. This series also include many refactor and optimzations:
- Swap readahead policy check is unified for page-fault/swapoff/shmem,
so swapin from ramdisk (eg. ZRAM) will always bypass swapcache.
Previously shmem and swapoff have different behavior on this.
- Some mircro optimization (eg. skip duplicated xarray lookup)
for swapin path while doing the refactor.
Some benchmark:
1. fio test for shmem (whin 2G memcg limit and using lzo-rle ZRAM swap):
fio -name=tmpfs --numjobs=16 --directory=/tmpfs --size=256m --ioengine=mmap \
--iodepth=128 --rw=randrw --random_distribution=<RANDOM> --time_based\
--ramp_time=1m --runtime=1m --group_reporting
RANDOM=zipf:1.2 ZRAM
Before (R/W, bw): 7339MiB/s / 7341MiB/s
After (R/W, bw): 7305MiB/s / 7308MiB/s (-0.5%)
RANDOM=zipf:0.5 ZRAM
Before (R/W, bw): 770MiB/s / 770MiB/s
After (R/W, bw): 775MiB/s / 774MiB/s (+0.6%)
RANDOM=random ZRAM
Before (R/W, bw): 537MiB/s / 537MiB/s
After (R/W, bw): 552MiB/s / 552MiB/s (+2.7%)
We can see readahead barely helps, and for random RW there is a observable performance gain.
2. Micro benchmark which use madvise to swap out 10G zero-filled data to
ZRAM then read them in, shows a performance gain for swapin path:
Before: 12480532 us
After: 12013318 us (+3.8%)
4. The final vmlinux is also a little bit smaller (gcc 8.5.0):
./scripts/bloat-o-meter vmlinux.old vmlinux
add/remove: 8/7 grow/shrink: 5/6 up/down: 5737/-5789 (-52)
Function old new delta
unuse_vma - 3204 +3204
swapin_page_fault - 1804 +1804
swapin_page_non_fault - 437 +437
swapin_no_readahead - 165 +165
swap_cache_get_folio 291 326 +35
__pfx_unuse_vma - 16 +16
__pfx_swapin_page_non_fault - 16 +16
__pfx_swapin_page_fault - 16 +16
__pfx_swapin_no_readahead - 16 +16
read_swap_cache_async 179 191 +12
swap_cluster_readahead 912 921 +9
__read_swap_cache_async 669 673 +4
zswap_writeback_entry 1463 1466 +3
__do_sys_swapon 4923 4920 -3
nr_rotate_swap 4 - -4
__pfx_unuse_pte_range 16 - -16
__pfx_swapin_readahead 16 - -16
__pfx___swap_count 16 - -16
__x64_sys_swapoff 1347 1312 -35
__ia32_sys_swapoff 1346 1311 -35
__swap_count 72 - -72
shmem_swapin_folio 1697 1535 -162
do_swap_page 2404 1942 -462
try_to_unuse 1867 880 -987
swapin_readahead 1377 - -1377
unuse_pte_range 2604 - -2604
Total: Before=30085393, After=30085341, chg -0.00%
Kairui Song (24):
mm/swap: fix a potential undefined behavior issue
mm/swapfile.c: add back some comment
mm/swap: move no readahead swapin code to a stand alone helper
mm/swap: avoid setting page lock bit and doing extra unlock check
mm/swap: move readahead policy checking into swapin_readahead
swap: rework swapin_no_readahead arguments
mm/swap: move swap_count to header to be shared
mm/swap: check readahead policy per entry
mm/swap: inline __swap_count
mm/swap: remove nr_rotate_swap and related code
mm/swap: also handle swapcache lookup in swapin_readahead
mm/swap: simplify arguments for swap_cache_get_folio
swap: simplify swap_cache_get_folio
mm/swap: do shadow lookup as well when doing swap cache lookup
mm/swap: avoid an duplicated swap cache lookup for SYNCHRONOUS_IO
device
mm/swap: reduce scope of get_swap_device in swapin path
mm/swap: fix false error when swapoff race with swapin
mm/swap: introduce a helper non fault swapin
shmem, swap: refactor error check on OOM or race
swap: simplify and make swap_find_cache static
swap: make swapin_readahead result checking argument mandatory
swap: make swap_cluster_readahead static
swap: fix multiple swap leak when after cgroup migrate
mm/swap: change swapin_readahead to swapin_page_fault
include/linux/swap.h | 7 --
mm/memory.c | 109 +++++++--------------
mm/shmem.c | 55 ++++-------
mm/swap.h | 34 ++++---
mm/swap_state.c | 222 ++++++++++++++++++++++++++++++++-----------
mm/swapfile.c | 70 ++++++--------
mm/zswap.c | 2 +-
7 files changed, 269 insertions(+), 230 deletions(-)
--
2.42.0
From: Kairui Song <[email protected]>
No feature change, prepare for later commits.
Signed-off-by: Kairui Song <[email protected]>
---
mm/swap.h | 5 +++++
mm/swapfile.c | 5 -----
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/mm/swap.h b/mm/swap.h
index f82d43d7b52a..a9a654af791e 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -61,6 +61,11 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
{
return page_swap_info(&folio->page)->flags;
}
+
+static inline unsigned char swap_count(unsigned char ent)
+{
+ return ent & ~SWAP_HAS_CACHE; /* may include COUNT_CONTINUED flag */
+}
#else /* CONFIG_SWAP */
struct swap_iocb;
static inline void swap_readpage(struct page *page, bool do_poll,
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0142bfc71b81..a8ae472ed2b6 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -114,11 +114,6 @@ static struct swap_info_struct *swap_type_to_swap_info(int type)
return READ_ONCE(swap_info[type]); /* rcu_dereference() */
}
-static inline unsigned char swap_count(unsigned char ent)
-{
- return ent & ~SWAP_HAS_CACHE; /* may include COUNT_CONTINUED flag */
-}
-
/* Reclaim the swap entry anyway if possible */
#define TTRS_ANYWAY 0x1
/*
--
2.42.0
From: Kairui Song <[email protected]>
Currently VMA readahead is globally disabled when any rotate disk is
used as swap backend. So multiple swap devices are enabled, if a slower
hard disk is set as a low priority fallback, and a high performance SSD
is used and high priority swap device, vma readahead is disabled globally.
The SSD swap device performance will drop by a lot.
Check readahead policy per entry to avoid such problem.
Signed-off-by: Kairui Song <[email protected]>
---
mm/swap_state.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index ff6756f2e8e4..fb78f7f18ed7 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -321,9 +321,9 @@ static inline bool swap_use_no_readahead(struct swap_info_struct *si, swp_entry_
return data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1;
}
-static inline bool swap_use_vma_readahead(void)
+static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
{
- return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
+ return data_race(si->flags & SWP_SOLIDSTATE) && READ_ONCE(enable_vma_readahead);
}
/*
@@ -341,7 +341,7 @@ struct folio *swap_cache_get_folio(swp_entry_t entry,
folio = filemap_get_folio(swap_address_space(entry), swp_offset(entry));
if (!IS_ERR(folio)) {
- bool vma_ra = swap_use_vma_readahead();
+ bool vma_ra = swap_use_vma_readahead(swp_swap_info(entry));
bool readahead;
/*
@@ -920,16 +920,18 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
struct vm_fault *vmf, bool *swapcached)
{
+ struct swap_info_struct *si;
struct mempolicy *mpol;
struct page *page;
pgoff_t ilx;
bool cached;
+ si = swp_swap_info(entry);
mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
- if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
+ if (swap_use_no_readahead(si, entry)) {
page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
cached = false;
- } else if (swap_use_vma_readahead()) {
+ } else if (swap_use_vma_readahead(si)) {
page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
cached = true;
} else {
--
2.42.0
From: Kairui Song <[email protected]>
There is only one caller in swap subsystem now, where it can be inline
smoothly, avoid the memory access and function call overheads.
Signed-off-by: Kairui Song <[email protected]>
---
include/linux/swap.h | 6 ------
mm/swap_state.c | 6 +++---
mm/swapfile.c | 8 --------
3 files changed, 3 insertions(+), 17 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2401990d954d..64a37819a9b3 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -485,7 +485,6 @@ int swap_type_of(dev_t device, sector_t offset);
int find_first_swap(dev_t *device);
extern unsigned int count_swap_pages(int, int);
extern sector_t swapdev_block(int, pgoff_t);
-extern int __swap_count(swp_entry_t entry);
extern int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry);
extern int swp_swapcount(swp_entry_t entry);
extern struct swap_info_struct *page_swap_info(struct page *);
@@ -559,11 +558,6 @@ static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
{
}
-static inline int __swap_count(swp_entry_t entry)
-{
- return 0;
-}
-
static inline int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
{
return 0;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index fb78f7f18ed7..d87c20f9f7ec 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -316,9 +316,9 @@ void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
release_pages(pages, nr);
}
-static inline bool swap_use_no_readahead(struct swap_info_struct *si, swp_entry_t entry)
+static inline bool swap_use_no_readahead(struct swap_info_struct *si, pgoff_t offset)
{
- return data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1;
+ return data_race(si->flags & SWP_SYNCHRONOUS_IO) && swap_count(si->swap_map[offset]) == 1;
}
static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
@@ -928,7 +928,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
si = swp_swap_info(entry);
mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
- if (swap_use_no_readahead(si, entry)) {
+ if (swap_use_no_readahead(si, swp_offset(entry))) {
page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
cached = false;
} else if (swap_use_vma_readahead(si)) {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index a8ae472ed2b6..e15a6c464a38 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1431,14 +1431,6 @@ void swapcache_free_entries(swp_entry_t *entries, int n)
spin_unlock(&p->lock);
}
-int __swap_count(swp_entry_t entry)
-{
- struct swap_info_struct *si = swp_swap_info(entry);
- pgoff_t offset = swp_offset(entry);
-
- return swap_count(si->swap_map[offset]);
-}
-
/*
* How many references to @entry are currently swapped out?
* This does not give an exact answer when swap count is continued,
--
2.42.0
From: Kairui Song <[email protected]>
Make it use alloc_pages_mpol instead of vma_alloc_folio, and accept
mm_struct directly as an argument instead of taking a vmf as argument.
Make its arguments similar to swap_{cluster,vma}_readahead, to
make the code more aligned.
Also prepare for following commits which will skip vmf for certain
swapin paths.
Signed-off-by: Kairui Song <[email protected]>
---
mm/swap_state.c | 15 +++++++--------
1 file changed, 7 insertions(+), 8 deletions(-)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index fd0047ae324e..ff6756f2e8e4 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -867,17 +867,17 @@ static struct page *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
* in.
*/
static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
- struct vm_fault *vmf)
+ struct mempolicy *mpol, pgoff_t ilx,
+ struct mm_struct *mm)
{
- struct vm_area_struct *vma = vmf->vma;
- struct page *page = NULL;
struct folio *folio;
+ struct page *page;
void *shadow = NULL;
- folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
- vma, vmf->address, false);
+ page = alloc_pages_mpol(gfp_mask, 0, mpol, ilx, numa_node_id());
+ folio = (struct folio *)page;
if (folio) {
- if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
+ if (mem_cgroup_swapin_charge_folio(folio, mm,
GFP_KERNEL, entry)) {
folio_put(folio);
return NULL;
@@ -896,7 +896,6 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
/* To provide entry to swap_readpage() */
folio->swap = entry;
- page = &folio->page;
swap_readpage(page, true, NULL);
folio->private = NULL;
}
@@ -928,7 +927,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
- page = swapin_no_readahead(entry, gfp_mask, vmf);
+ page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
cached = false;
} else if (swap_use_vma_readahead()) {
page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
--
2.42.0
From: Kairui Song <[email protected]>
When a xa_value is returned by the cache lookup, keep it to be used
later for workingset refault check instead of doing the looking up again
in swapin_no_readahead.
This does have a side effect of making swapoff also triggers workingset
check, but should be fine since swapoff does effect the workload in many
ways already.
Signed-off-by: Kairui Song <[email protected]>
---
mm/swap_state.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index e057c79fb06f..51de2a0412df 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -872,7 +872,6 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
{
struct folio *folio;
struct page *page;
- void *shadow = NULL;
page = alloc_pages_mpol(gfp_mask, 0, mpol, ilx, numa_node_id());
folio = (struct folio *)page;
@@ -888,10 +887,6 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
mem_cgroup_swapin_uncharge_swap(entry);
- shadow = get_shadow_from_swap_cache(entry);
- if (shadow)
- workingset_refault(folio, shadow);
-
folio_add_lru(folio);
/* To provide entry to swap_readpage() */
@@ -922,11 +917,12 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
enum swap_cache_result cache_result;
struct swap_info_struct *si;
struct mempolicy *mpol;
+ void *shadow = NULL;
struct folio *folio;
struct page *page;
pgoff_t ilx;
- folio = swap_cache_get_folio(entry, vmf, NULL);
+ folio = swap_cache_get_folio(entry, vmf, &shadow);
if (folio) {
page = folio_file_page(folio, swp_offset(entry));
cache_result = SWAP_CACHE_HIT;
@@ -938,6 +934,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
if (swap_use_no_readahead(si, swp_offset(entry))) {
page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
cache_result = SWAP_CACHE_BYPASS;
+ if (shadow)
+ workingset_refault(page_folio(page), shadow);
} else if (swap_use_vma_readahead(si)) {
page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
cache_result = SWAP_CACHE_MISS;
--
2.42.0
From: Kairui Song <[email protected]>
Move get_swap_device into swapin_readahead, simplify the code
and prepare for follow up commits.
For the later part in do_swap_page, using swp_swap_info directly is fine
since in that context, the swap device is pinned by swapcache reference.
Signed-off-by: Kairui Song <[email protected]>
---
mm/memory.c | 16 ++++------------
mm/swap_state.c | 8 ++++++--
mm/swapfile.c | 4 +++-
3 files changed, 13 insertions(+), 15 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 22af9f3e8c75..e399b37ef395 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3789,7 +3789,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
struct folio *swapcache = NULL, *folio = NULL;
enum swap_cache_result cache_result;
struct page *page;
- struct swap_info_struct *si = NULL;
rmap_t rmap_flags = RMAP_NONE;
bool exclusive = false;
swp_entry_t entry;
@@ -3845,14 +3844,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
goto out;
}
- /* Prevent swapoff from happening to us. */
- si = get_swap_device(entry);
- if (unlikely(!si))
- goto out;
-
page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
vmf, &cache_result);
- if (page) {
+ if (PTR_ERR(page) == -EBUSY) {
+ goto out;
+ } else if (page) {
folio = page_folio(page);
if (cache_result != SWAP_CACHE_HIT) {
/* Had to read the page from swap area: Major fault */
@@ -3964,7 +3960,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
*/
exclusive = true;
} else if (exclusive && folio_test_writeback(folio) &&
- data_race(si->flags & SWP_STABLE_WRITES)) {
+ (swp_swap_info(entry)->flags & SWP_STABLE_WRITES)) {
/*
* This is tricky: not all swap backends support
* concurrent page modifications while under writeback.
@@ -4068,8 +4064,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (vmf->pte)
pte_unmap_unlock(vmf->pte, vmf->ptl);
out:
- if (si)
- put_swap_device(si);
return ret;
out_nomap:
if (vmf->pte)
@@ -4082,8 +4076,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
folio_unlock(swapcache);
folio_put(swapcache);
}
- if (si)
- put_swap_device(si);
return ret;
}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 51de2a0412df..ff8a166603d0 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -922,6 +922,11 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
struct page *page;
pgoff_t ilx;
+ /* Prevent swapoff from happening to us */
+ si = get_swap_device(entry);
+ if (unlikely(!si))
+ return ERR_PTR(-EBUSY);
+
folio = swap_cache_get_folio(entry, vmf, &shadow);
if (folio) {
page = folio_file_page(folio, swp_offset(entry));
@@ -929,7 +934,6 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
goto done;
}
- si = swp_swap_info(entry);
mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
if (swap_use_no_readahead(si, swp_offset(entry))) {
page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
@@ -944,8 +948,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
cache_result = SWAP_CACHE_MISS;
}
mpol_cond_put(mpol);
-
done:
+ put_swap_device(si);
if (result)
*result = cache_result;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index b6d57fff5e21..925ad92486a4 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1857,7 +1857,9 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
pte = NULL;
page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
&vmf, NULL);
- if (page)
+ if (IS_ERR(page))
+ return PTR_ERR(page);
+ else if (page)
folio = page_folio(page);
if (!folio) {
/*
--
2.42.0
From: Kairui Song <[email protected]>
When swapoff race with swapin, get_swap_device may fail and cause
swapin_readahead to return -EBUSY. In such case check if the page is
already swapped in by swapoff path.
Signed-off-by: Kairui Song <[email protected]>
---
mm/memory.c | 29 +++++++++++++++--------------
1 file changed, 15 insertions(+), 14 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index e399b37ef395..620fa87557fd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3846,9 +3846,21 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
vmf, &cache_result);
- if (PTR_ERR(page) == -EBUSY) {
- goto out;
- } else if (page) {
+ if (IS_ERR_OR_NULL(page)) {
+ /*
+ * Back out if somebody else faulted in this pte
+ * while we released the pte lock.
+ */
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ if (likely(vmf->pte && pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
+ if (!page)
+ ret = VM_FAULT_OOM;
+ else
+ ret = VM_FAULT_RETRY;
+ }
+ goto unlock;
+ } else {
folio = page_folio(page);
if (cache_result != SWAP_CACHE_HIT) {
/* Had to read the page from swap area: Major fault */
@@ -3866,17 +3878,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
ret = VM_FAULT_HWPOISON;
goto out_release;
}
- } else {
- /*
- * Back out if somebody else faulted in this pte
- * while we released the pte lock.
- */
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
- vmf->address, &vmf->ptl);
- if (likely(vmf->pte &&
- pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
- ret = VM_FAULT_OOM;
- goto unlock;
}
ret |= folio_lock_or_retry(folio, vmf);
--
2.42.0
From: Kairui Song <[email protected]>
There are two places where swapin is not direct caused by page fault:
shmem swapin is invoked through shmem mapping, swapoff cause swapin by
walking the page table. They used to construct a pseudo vmfault struct
for swapin function.
Shmem has dropped the pseudo vmfault recently in commit ddc1a5cbc05d
("mempolicy: alloc_pages_mpol() for NUMA policy without vma"). Swapoff
path is still using a pseudo vmfault.
Introduce a helper for them both, this help save stack usage for swapoff
path, and help apply a unified swapin cache and readahead policy check.
Also prepare for follow up commits.
Signed-off-by: Kairui Song <[email protected]>
---
mm/shmem.c | 51 ++++++++++++++++---------------------------------
mm/swap.h | 11 +++++++++++
mm/swap_state.c | 38 ++++++++++++++++++++++++++++++++++++
mm/swapfile.c | 23 +++++++++++-----------
4 files changed, 76 insertions(+), 47 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index f9ce4067c742..81d129aa66d1 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1565,22 +1565,6 @@ static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
static struct mempolicy *shmem_get_pgoff_policy(struct shmem_inode_info *info,
pgoff_t index, unsigned int order, pgoff_t *ilx);
-static struct folio *shmem_swapin_cluster(swp_entry_t swap, gfp_t gfp,
- struct shmem_inode_info *info, pgoff_t index)
-{
- struct mempolicy *mpol;
- pgoff_t ilx;
- struct page *page;
-
- mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
- page = swap_cluster_readahead(swap, gfp, mpol, ilx);
- mpol_cond_put(mpol);
-
- if (!page)
- return NULL;
- return page_folio(page);
-}
-
/*
* Make sure huge_gfp is always more limited than limit_gfp.
* Some of the flags set permissions, while others set limitations.
@@ -1854,9 +1838,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
{
struct address_space *mapping = inode->i_mapping;
struct shmem_inode_info *info = SHMEM_I(inode);
- struct swap_info_struct *si;
+ enum swap_cache_result result;
struct folio *folio = NULL;
+ struct mempolicy *mpol;
+ struct page *page;
swp_entry_t swap;
+ pgoff_t ilx;
int error;
VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
@@ -1866,34 +1853,30 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
if (is_poisoned_swp_entry(swap))
return -EIO;
- si = get_swap_device(swap);
- if (!si) {
+ mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
+ page = swapin_page_non_fault(swap, gfp, mpol, ilx, fault_mm, &result);
+ mpol_cond_put(mpol);
+
+ if (PTR_ERR(page) == -EBUSY) {
if (!shmem_confirm_swap(mapping, index, swap))
return -EEXIST;
else
return -EINVAL;
- }
-
- /* Look it up and read it in.. */
- folio = swap_cache_get_folio(swap, NULL, NULL);
- if (!folio) {
- /* Or update major stats only when swapin succeeds?? */
- if (fault_type) {
+ } else if (!page) {
+ error = -ENOMEM;
+ goto failed;
+ } else {
+ folio = page_folio(page);
+ if (fault_type && result != SWAP_CACHE_HIT) {
*fault_type |= VM_FAULT_MAJOR;
count_vm_event(PGMAJFAULT);
count_memcg_event_mm(fault_mm, PGMAJFAULT);
}
- /* Here we actually start the io */
- folio = shmem_swapin_cluster(swap, gfp, info, index);
- if (!folio) {
- error = -ENOMEM;
- goto failed;
- }
}
/* We have to do this with folio locked to prevent races */
folio_lock(folio);
- if (!folio_test_swapcache(folio) ||
+ if ((result != SWAP_CACHE_BYPASS && !folio_test_swapcache(folio)) ||
folio->swap.val != swap.val ||
!shmem_confirm_swap(mapping, index, swap)) {
error = -EEXIST;
@@ -1930,7 +1913,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
delete_from_swap_cache(folio);
folio_mark_dirty(folio);
swap_free(swap);
- put_swap_device(si);
*foliop = folio;
return 0;
@@ -1944,7 +1926,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
folio_unlock(folio);
folio_put(folio);
}
- put_swap_device(si);
return error;
}
diff --git a/mm/swap.h b/mm/swap.h
index da9deb5ba37d..b073c29c9790 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -62,6 +62,10 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
struct mempolicy *mpol, pgoff_t ilx);
struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
struct vm_fault *vmf, enum swap_cache_result *result);
+struct page *swapin_page_non_fault(swp_entry_t entry, gfp_t gfp_mask,
+ struct mempolicy *mpol, pgoff_t ilx,
+ struct mm_struct *mm,
+ enum swap_cache_result *result);
static inline unsigned int folio_swap_flags(struct folio *folio)
{
@@ -103,6 +107,13 @@ static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
return NULL;
}
+static inline struct page *swapin_page_non_fault(swp_entry_t entry, gfp_t gfp_mask,
+ struct mempolicy *mpol, pgoff_t ilx, struct mm_struct *mm,
+ enum swap_cache_result *result)
+{
+ return NULL;
+}
+
static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
{
return 0;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index ff8a166603d0..eef66757c615 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -956,6 +956,44 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
return page;
}
+struct page *swapin_page_non_fault(swp_entry_t entry, gfp_t gfp_mask,
+ struct mempolicy *mpol, pgoff_t ilx,
+ struct mm_struct *mm, enum swap_cache_result *result)
+{
+ enum swap_cache_result cache_result;
+ struct swap_info_struct *si;
+ void *shadow = NULL;
+ struct folio *folio;
+ struct page *page;
+
+ /* Prevent swapoff from happening to us */
+ si = get_swap_device(entry);
+ if (unlikely(!si))
+ return ERR_PTR(-EBUSY);
+
+ folio = swap_cache_get_folio(entry, NULL, &shadow);
+ if (folio) {
+ page = folio_file_page(folio, swp_offset(entry));
+ cache_result = SWAP_CACHE_HIT;
+ goto done;
+ }
+
+ if (swap_use_no_readahead(si, swp_offset(entry))) {
+ page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, mm);
+ if (shadow)
+ workingset_refault(page_folio(page), shadow);
+ cache_result = SWAP_CACHE_BYPASS;
+ } else {
+ page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
+ cache_result = SWAP_CACHE_MISS;
+ }
+done:
+ put_swap_device(si);
+ if (result)
+ *result = cache_result;
+ return page;
+}
+
#ifdef CONFIG_SYSFS
static ssize_t vma_ra_enabled_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 925ad92486a4..f8c5096fe0f0 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1822,20 +1822,15 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
si = swap_info[type];
do {
+ int ret;
+ pte_t ptent;
+ pgoff_t ilx;
+ swp_entry_t entry;
struct page *page;
unsigned long offset;
+ struct mempolicy *mpol;
unsigned char swp_count;
struct folio *folio = NULL;
- swp_entry_t entry;
- int ret;
- pte_t ptent;
-
- struct vm_fault vmf = {
- .vma = vma,
- .address = addr,
- .real_address = addr,
- .pmd = pmd,
- };
if (!pte++) {
pte = pte_offset_map(pmd, addr);
@@ -1855,8 +1850,12 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
offset = swp_offset(entry);
pte_unmap(pte);
pte = NULL;
- page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
- &vmf, NULL);
+
+ mpol = get_vma_policy(vma, addr, 0, &ilx);
+ page = swapin_page_non_fault(entry, GFP_HIGHUSER_MOVABLE,
+ mpol, ilx, vma->vm_mm, NULL);
+ mpol_cond_put(mpol);
+
if (IS_ERR(page))
return PTR_ERR(page);
else if (page)
--
2.42.0
From: Kairui Song <[email protected]>
There are only two callers within the same file now, make it static and
drop the redundant if check on the shadow variable.
Signed-off-by: Kairui Song <[email protected]>
---
mm/swap.h | 2 --
mm/swap_state.c | 5 ++---
2 files changed, 2 insertions(+), 5 deletions(-)
diff --git a/mm/swap.h b/mm/swap.h
index b073c29c9790..4402970547e7 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -46,8 +46,6 @@ void __delete_from_swap_cache(struct folio *folio,
void delete_from_swap_cache(struct folio *folio);
void clear_shadow_from_swap_cache(int type, unsigned long begin,
unsigned long end);
-struct folio *swap_cache_get_folio(swp_entry_t entry,
- struct vm_fault *vmf, void **shadowp);
struct folio *filemap_get_incore_folio(struct address_space *mapping,
pgoff_t index);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index eef66757c615..6f39aa8394f1 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -334,15 +334,14 @@ static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
*
* Caller must lock the swap device or hold a reference to keep it valid.
*/
-struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_fault *vmf, void **shadowp)
+static struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_fault *vmf, void **shadowp)
{
bool vma_ra, readahead;
struct folio *folio;
folio = filemap_get_entry(swap_address_space(entry), swp_offset(entry));
if (xa_is_value(folio)) {
- if (shadowp)
- *shadowp = folio;
+ *shadowp = folio;
return NULL;
}
--
2.42.0
From: Kairui Song <[email protected]>
This is only one caller now in page fault path, make the result return
argument mandatory.
Signed-off-by: Kairui Song <[email protected]>
---
mm/swap_state.c | 17 +++++++----------
1 file changed, 7 insertions(+), 10 deletions(-)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 6f39aa8394f1..0433a2586c6d 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -913,7 +913,6 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
struct vm_fault *vmf, enum swap_cache_result *result)
{
- enum swap_cache_result cache_result;
struct swap_info_struct *si;
struct mempolicy *mpol;
void *shadow = NULL;
@@ -928,29 +927,27 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
folio = swap_cache_get_folio(entry, vmf, &shadow);
if (folio) {
+ *result = SWAP_CACHE_HIT;
page = folio_file_page(folio, swp_offset(entry));
- cache_result = SWAP_CACHE_HIT;
goto done;
}
mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
if (swap_use_no_readahead(si, swp_offset(entry))) {
+ *result = SWAP_CACHE_BYPASS;
page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
- cache_result = SWAP_CACHE_BYPASS;
if (shadow)
workingset_refault(page_folio(page), shadow);
- } else if (swap_use_vma_readahead(si)) {
- page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
- cache_result = SWAP_CACHE_MISS;
} else {
- page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
- cache_result = SWAP_CACHE_MISS;
+ *result = SWAP_CACHE_MISS;
+ if (swap_use_vma_readahead(si))
+ page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
+ else
+ page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
}
mpol_cond_put(mpol);
done:
put_swap_device(si);
- if (result)
- *result = cache_result;
return page;
}
--
2.42.0
From: Kairui Song <[email protected]>
It should always check if a swap entry is already swaped in on error,
fix potential false error issue.
Signed-off-by: Kairui Song <[email protected]>
---
mm/shmem.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index 81d129aa66d1..6154b5b68b8f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1857,13 +1857,11 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
page = swapin_page_non_fault(swap, gfp, mpol, ilx, fault_mm, &result);
mpol_cond_put(mpol);
- if (PTR_ERR(page) == -EBUSY) {
- if (!shmem_confirm_swap(mapping, index, swap))
- return -EEXIST;
+ if (IS_ERR_OR_NULL(page)) {
+ if (!page)
+ error = -ENOMEM;
else
- return -EINVAL;
- } else if (!page) {
- error = -ENOMEM;
+ error = -EINVAL;
goto failed;
} else {
folio = page_folio(page);
--
2.42.0
From: Kairui Song <[email protected]>
Now there is no caller outside the same file, make it static.
Signed-off-by: Kairui Song <[email protected]>
---
mm/swap.h | 8 --------
mm/swap_state.c | 4 ++--
2 files changed, 2 insertions(+), 10 deletions(-)
diff --git a/mm/swap.h b/mm/swap.h
index 4402970547e7..795a25df87da 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -56,8 +56,6 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
struct mempolicy *mpol, pgoff_t ilx,
bool *new_page_allocated);
-struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
- struct mempolicy *mpol, pgoff_t ilx);
struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
struct vm_fault *vmf, enum swap_cache_result *result);
struct page *swapin_page_non_fault(swp_entry_t entry, gfp_t gfp_mask,
@@ -93,12 +91,6 @@ static inline void show_swap_cache_info(void)
{
}
-static inline struct page *swap_cluster_readahead(swp_entry_t entry,
- gfp_t gfp_mask, struct mempolicy *mpol, pgoff_t ilx)
-{
- return NULL;
-}
-
static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
struct vm_fault *vmf, enum swap_cache_result *result)
{
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 0433a2586c6d..b377e55cb850 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -627,8 +627,8 @@ static unsigned long swapin_nr_pages(unsigned long offset)
* are used for every page of the readahead: neighbouring pages on swap
* are fairly likely to have been swapped out from the same node.
*/
-struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
- struct mempolicy *mpol, pgoff_t ilx)
+static struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
+ struct mempolicy *mpol, pgoff_t ilx)
{
struct page *page;
unsigned long entry_offset = swp_offset(entry);
--
2.42.0
From: Kairui Song <[email protected]>
There are only two caller now, simplify the arguments.
Signed-off-by: Kairui Song <[email protected]>
---
mm/shmem.c | 2 +-
mm/swap.h | 2 +-
mm/swap_state.c | 15 +++++++--------
3 files changed, 9 insertions(+), 10 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index 0d1ce70bce38..72239061c655 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1875,7 +1875,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
}
/* Look it up and read it in.. */
- folio = swap_cache_get_folio(swap, NULL, 0);
+ folio = swap_cache_get_folio(swap, NULL);
if (!folio) {
/* Or update major stats only when swapin succeeds?? */
if (fault_type) {
diff --git a/mm/swap.h b/mm/swap.h
index ac9136eee690..e43e965f123f 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -47,7 +47,7 @@ void delete_from_swap_cache(struct folio *folio);
void clear_shadow_from_swap_cache(int type, unsigned long begin,
unsigned long end);
struct folio *swap_cache_get_folio(swp_entry_t entry,
- struct vm_area_struct *vma, unsigned long addr);
+ struct vm_fault *vmf);
struct folio *filemap_get_incore_folio(struct address_space *mapping,
pgoff_t index);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index e96d63bf8a22..91461e26a8cc 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -334,8 +334,7 @@ static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
*
* Caller must lock the swap device or hold a reference to keep it valid.
*/
-struct folio *swap_cache_get_folio(swp_entry_t entry,
- struct vm_area_struct *vma, unsigned long addr)
+struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_fault *vmf)
{
struct folio *folio;
@@ -352,22 +351,22 @@ struct folio *swap_cache_get_folio(swp_entry_t entry,
return folio;
readahead = folio_test_clear_readahead(folio);
- if (vma && vma_ra) {
+ if (vmf && vma_ra) {
unsigned long ra_val;
int win, hits;
- ra_val = GET_SWAP_RA_VAL(vma);
+ ra_val = GET_SWAP_RA_VAL(vmf->vma);
win = SWAP_RA_WIN(ra_val);
hits = SWAP_RA_HITS(ra_val);
if (readahead)
hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
- atomic_long_set(&vma->swap_readahead_info,
- SWAP_RA_VAL(addr, win, hits));
+ atomic_long_set(&vmf->vma->swap_readahead_info,
+ SWAP_RA_VAL(vmf->address, win, hits));
}
if (readahead) {
count_vm_event(SWAP_RA_HIT);
- if (!vma || !vma_ra)
+ if (!vmf || !vma_ra)
atomic_inc(&swapin_readahead_hits);
}
} else {
@@ -926,7 +925,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
struct page *page;
pgoff_t ilx;
- folio = swap_cache_get_folio(entry, vmf->vma, vmf->address);
+ folio = swap_cache_get_folio(entry, vmf);
if (folio) {
page = folio_file_page(folio, swp_offset(entry));
cache_result = SWAP_CACHE_HIT;
--
2.42.0
From: Kairui Song <[email protected]>
No feature change, just prepare for later commits.
Signed-off-by: Kairui Song <[email protected]>
---
mm/memory.c | 61 +++++++++++++++++++++++--------------------------
mm/swap.h | 10 ++++++--
mm/swap_state.c | 26 +++++++++++++--------
mm/swapfile.c | 30 +++++++++++-------------
4 files changed, 66 insertions(+), 61 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index f4237a2e3b93..22af9f3e8c75 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3786,13 +3786,13 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
vm_fault_t do_swap_page(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
- struct folio *swapcache, *folio = NULL;
+ struct folio *swapcache = NULL, *folio = NULL;
+ enum swap_cache_result cache_result;
struct page *page;
struct swap_info_struct *si = NULL;
rmap_t rmap_flags = RMAP_NONE;
bool exclusive = false;
swp_entry_t entry;
- bool swapcached;
pte_t pte;
vm_fault_t ret = 0;
@@ -3850,42 +3850,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (unlikely(!si))
goto out;
- folio = swap_cache_get_folio(entry, vma, vmf->address);
- if (folio)
- page = folio_file_page(folio, swp_offset(entry));
- swapcache = folio;
-
- if (!folio) {
- page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
- vmf, &swapcached);
- if (page) {
- folio = page_folio(page);
- if (swapcached)
- swapcache = folio;
- } else {
+ page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
+ vmf, &cache_result);
+ if (page) {
+ folio = page_folio(page);
+ if (cache_result != SWAP_CACHE_HIT) {
+ /* Had to read the page from swap area: Major fault */
+ ret = VM_FAULT_MAJOR;
+ count_vm_event(PGMAJFAULT);
+ count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
+ }
+ if (cache_result != SWAP_CACHE_BYPASS)
+ swapcache = folio;
+ if (PageHWPoison(page)) {
/*
- * Back out if somebody else faulted in this pte
- * while we released the pte lock.
+ * hwpoisoned dirty swapcache pages are kept for killing
+ * owner processes (which may be unknown at hwpoison time)
*/
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
- vmf->address, &vmf->ptl);
- if (likely(vmf->pte &&
- pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
- ret = VM_FAULT_OOM;
- goto unlock;
+ ret = VM_FAULT_HWPOISON;
+ goto out_release;
}
-
- /* Had to read the page from swap area: Major fault */
- ret = VM_FAULT_MAJOR;
- count_vm_event(PGMAJFAULT);
- count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
- } else if (PageHWPoison(page)) {
+ } else {
/*
- * hwpoisoned dirty swapcache pages are kept for killing
- * owner processes (which may be unknown at hwpoison time)
+ * Back out if somebody else faulted in this pte
+ * while we released the pte lock.
*/
- ret = VM_FAULT_HWPOISON;
- goto out_release;
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ if (likely(vmf->pte &&
+ pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
+ ret = VM_FAULT_OOM;
+ goto unlock;
}
ret |= folio_lock_or_retry(folio, vmf);
diff --git a/mm/swap.h b/mm/swap.h
index a9a654af791e..ac9136eee690 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -30,6 +30,12 @@ extern struct address_space *swapper_spaces[];
(&swapper_spaces[swp_type(entry)][swp_offset(entry) \
>> SWAP_ADDRESS_SPACE_SHIFT])
+enum swap_cache_result {
+ SWAP_CACHE_HIT,
+ SWAP_CACHE_MISS,
+ SWAP_CACHE_BYPASS,
+};
+
void show_swap_cache_info(void);
bool add_to_swap(struct folio *folio);
void *get_shadow_from_swap_cache(swp_entry_t entry);
@@ -55,7 +61,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
struct mempolicy *mpol, pgoff_t ilx);
struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
- struct vm_fault *vmf, bool *swapcached);
+ struct vm_fault *vmf, enum swap_cache_result *result);
static inline unsigned int folio_swap_flags(struct folio *folio)
{
@@ -92,7 +98,7 @@ static inline struct page *swap_cluster_readahead(swp_entry_t entry,
}
static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
- struct vm_fault *vmf, bool *swapcached)
+ struct vm_fault *vmf, enum swap_cache_result *result)
{
return NULL;
}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index d87c20f9f7ec..e96d63bf8a22 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -908,8 +908,7 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
* @entry: swap entry of this memory
* @gfp_mask: memory allocation flags
* @vmf: fault information
- * @swapcached: pointer to a bool used as indicator if the
- * page is swapped in through swapcache.
+ * @result: a return value to indicate swap cache usage.
*
* Returns the struct page for entry and addr, after queueing swapin.
*
@@ -918,30 +917,39 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
* or vma-based(ie, virtual address based on faulty address) readahead.
*/
struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
- struct vm_fault *vmf, bool *swapcached)
+ struct vm_fault *vmf, enum swap_cache_result *result)
{
+ enum swap_cache_result cache_result;
struct swap_info_struct *si;
struct mempolicy *mpol;
+ struct folio *folio;
struct page *page;
pgoff_t ilx;
- bool cached;
+
+ folio = swap_cache_get_folio(entry, vmf->vma, vmf->address);
+ if (folio) {
+ page = folio_file_page(folio, swp_offset(entry));
+ cache_result = SWAP_CACHE_HIT;
+ goto done;
+ }
si = swp_swap_info(entry);
mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
if (swap_use_no_readahead(si, swp_offset(entry))) {
page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
- cached = false;
+ cache_result = SWAP_CACHE_BYPASS;
} else if (swap_use_vma_readahead(si)) {
page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
- cached = true;
+ cache_result = SWAP_CACHE_MISS;
} else {
page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
- cached = true;
+ cache_result = SWAP_CACHE_MISS;
}
mpol_cond_put(mpol);
- if (swapcached)
- *swapcached = cached;
+done:
+ if (result)
+ *result = cache_result;
return page;
}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 01c3f53b6521..b6d57fff5e21 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1822,13 +1822,21 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
si = swap_info[type];
do {
- struct folio *folio;
+ struct page *page;
unsigned long offset;
unsigned char swp_count;
+ struct folio *folio = NULL;
swp_entry_t entry;
int ret;
pte_t ptent;
+ struct vm_fault vmf = {
+ .vma = vma,
+ .address = addr,
+ .real_address = addr,
+ .pmd = pmd,
+ };
+
if (!pte++) {
pte = pte_offset_map(pmd, addr);
if (!pte)
@@ -1847,22 +1855,10 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
offset = swp_offset(entry);
pte_unmap(pte);
pte = NULL;
-
- folio = swap_cache_get_folio(entry, vma, addr);
- if (!folio) {
- struct page *page;
- struct vm_fault vmf = {
- .vma = vma,
- .address = addr,
- .real_address = addr,
- .pmd = pmd,
- };
-
- page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
- &vmf, NULL);
- if (page)
- folio = page_folio(page);
- }
+ page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
+ &vmf, NULL);
+ if (page)
+ folio = page_folio(page);
if (!folio) {
/*
* The entry could have been freed, and will not
--
2.42.0
From: Kairui Song <[email protected]>
Make swap_cache_get_folio capable of returning the shadow value when the
xarray contains a shadow instead of a valid folio. Just extend the
arguments to be used later.
Signed-off-by: Kairui Song <[email protected]>
---
mm/shmem.c | 2 +-
mm/swap.h | 2 +-
mm/swap_state.c | 11 +++++++----
3 files changed, 9 insertions(+), 6 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index 72239061c655..f9ce4067c742 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1875,7 +1875,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
}
/* Look it up and read it in.. */
- folio = swap_cache_get_folio(swap, NULL);
+ folio = swap_cache_get_folio(swap, NULL, NULL);
if (!folio) {
/* Or update major stats only when swapin succeeds?? */
if (fault_type) {
diff --git a/mm/swap.h b/mm/swap.h
index e43e965f123f..da9deb5ba37d 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -47,7 +47,7 @@ void delete_from_swap_cache(struct folio *folio);
void clear_shadow_from_swap_cache(int type, unsigned long begin,
unsigned long end);
struct folio *swap_cache_get_folio(swp_entry_t entry,
- struct vm_fault *vmf);
+ struct vm_fault *vmf, void **shadowp);
struct folio *filemap_get_incore_folio(struct address_space *mapping,
pgoff_t index);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3b5a34f47192..e057c79fb06f 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -334,14 +334,17 @@ static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
*
* Caller must lock the swap device or hold a reference to keep it valid.
*/
-struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_fault *vmf)
+struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_fault *vmf, void **shadowp)
{
bool vma_ra, readahead;
struct folio *folio;
- folio = filemap_get_folio(swap_address_space(entry), swp_offset(entry));
- if (IS_ERR(folio))
+ folio = filemap_get_entry(swap_address_space(entry), swp_offset(entry));
+ if (xa_is_value(folio)) {
+ if (shadowp)
+ *shadowp = folio;
return NULL;
+ }
/*
* At the moment, we don't support PG_readahead for anon THP
@@ -923,7 +926,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
struct page *page;
pgoff_t ilx;
- folio = swap_cache_get_folio(entry, vmf);
+ folio = swap_cache_get_folio(entry, vmf, NULL);
if (folio) {
page = folio_file_page(folio, swp_offset(entry));
cache_result = SWAP_CACHE_HIT;
--
2.42.0
From: Kairui Song <[email protected]>
Now swapin_readahead is only called from direct page fault path, so
rename it and drop the gfp argument, since there is only one caller
always using the same flag for userspace page fault.
Signed-off-by: Kairui Song <[email protected]>
---
mm/memory.c | 4 ++--
mm/swap.h | 6 +++---
mm/swap_state.c | 15 +++++++++------
3 files changed, 14 insertions(+), 11 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 620fa87557fd..4907a5b1b75b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3844,8 +3844,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
goto out;
}
- page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
- vmf, &cache_result);
+ page = swapin_page_fault(entry, GFP_HIGHUSER_MOVABLE,
+ vmf, &cache_result);
if (IS_ERR_OR_NULL(page)) {
/*
* Back out if somebody else faulted in this pte
diff --git a/mm/swap.h b/mm/swap.h
index 4374bf11ca41..2f8f8befff89 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -56,8 +56,8 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
struct mempolicy *mpol, pgoff_t ilx,
struct mm_struct *mm, bool *new_page_allocated);
-struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
- struct vm_fault *vmf, enum swap_cache_result *result);
+struct page *swapin_page_fault(swp_entry_t entry, gfp_t flag,
+ struct vm_fault *vmf, enum swap_cache_result *result);
struct page *swapin_page_non_fault(swp_entry_t entry, gfp_t gfp_mask,
struct mempolicy *mpol, pgoff_t ilx,
struct mm_struct *mm,
@@ -91,7 +91,7 @@ static inline void show_swap_cache_info(void)
{
}
-static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
+static inline struct page *swapin_page_fault(swp_entry_t swp, gfp_t gfp_mask,
struct vm_fault *vmf, enum swap_cache_result *result)
{
return NULL;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 362a6f674b36..2f51d2e64e59 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -899,7 +899,7 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
}
/**
- * swapin_readahead - swap in pages in hope we need them soon
+ * swapin_page_fault - swap in a page from page fault context
* @entry: swap entry of this memory
* @gfp_mask: memory allocation flags
* @vmf: fault information
@@ -911,8 +911,8 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
* it will read ahead blocks by cluster-based(ie, physical disk based)
* or vma-based(ie, virtual address based on faulty address) readahead.
*/
-struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
- struct vm_fault *vmf, enum swap_cache_result *result)
+struct page *swapin_page_fault(swp_entry_t entry, gfp_t gfp_mask,
+ struct vm_fault *vmf, enum swap_cache_result *result)
{
struct swap_info_struct *si;
struct mempolicy *mpol;
@@ -936,15 +936,18 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
if (swap_use_no_readahead(si, swp_offset(entry))) {
*result = SWAP_CACHE_BYPASS;
- page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
+ page = swapin_no_readahead(entry, GFP_HIGHUSER_MOVABLE,
+ mpol, ilx, vmf->vma->vm_mm);
if (shadow)
workingset_refault(page_folio(page), shadow);
} else {
*result = SWAP_CACHE_MISS;
if (swap_use_vma_readahead(si))
- page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
+ page = swap_vma_readahead(entry, GFP_HIGHUSER_MOVABLE,
+ mpol, ilx, vmf);
else
- page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
+ page = swap_cluster_readahead(entry, GFP_HIGHUSER_MOVABLE,
+ mpol, ilx, vmf->vma->vm_mm);
}
mpol_cond_put(mpol);
done:
--
2.42.0
From: Kairui Song <[email protected]>
Rearrange the if statement, reduce the code indent, no feature change.
Signed-off-by: Kairui Song <[email protected]>
---
mm/swap_state.c | 58 ++++++++++++++++++++++++-------------------------
1 file changed, 28 insertions(+), 30 deletions(-)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 91461e26a8cc..3b5a34f47192 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -336,41 +336,39 @@ static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
*/
struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_fault *vmf)
{
+ bool vma_ra, readahead;
struct folio *folio;
folio = filemap_get_folio(swap_address_space(entry), swp_offset(entry));
- if (!IS_ERR(folio)) {
- bool vma_ra = swap_use_vma_readahead(swp_swap_info(entry));
- bool readahead;
+ if (IS_ERR(folio))
+ return NULL;
- /*
- * At the moment, we don't support PG_readahead for anon THP
- * so let's bail out rather than confusing the readahead stat.
- */
- if (unlikely(folio_test_large(folio)))
- return folio;
-
- readahead = folio_test_clear_readahead(folio);
- if (vmf && vma_ra) {
- unsigned long ra_val;
- int win, hits;
-
- ra_val = GET_SWAP_RA_VAL(vmf->vma);
- win = SWAP_RA_WIN(ra_val);
- hits = SWAP_RA_HITS(ra_val);
- if (readahead)
- hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
- atomic_long_set(&vmf->vma->swap_readahead_info,
- SWAP_RA_VAL(vmf->address, win, hits));
- }
+ /*
+ * At the moment, we don't support PG_readahead for anon THP
+ * so let's bail out rather than confusing the readahead stat.
+ */
+ if (unlikely(folio_test_large(folio)))
+ return folio;
- if (readahead) {
- count_vm_event(SWAP_RA_HIT);
- if (!vmf || !vma_ra)
- atomic_inc(&swapin_readahead_hits);
- }
- } else {
- folio = NULL;
+ vma_ra = swap_use_vma_readahead(swp_swap_info(entry));
+ readahead = folio_test_clear_readahead(folio);
+ if (vmf && vma_ra) {
+ unsigned long ra_val;
+ int win, hits;
+
+ ra_val = GET_SWAP_RA_VAL(vmf->vma);
+ win = SWAP_RA_WIN(ra_val);
+ hits = SWAP_RA_HITS(ra_val);
+ if (readahead)
+ hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
+ atomic_long_set(&vmf->vma->swap_readahead_info,
+ SWAP_RA_VAL(vmf->address, win, hits));
+ }
+
+ if (readahead) {
+ count_vm_event(SWAP_RA_HIT);
+ if (!vmf || !vma_ra)
+ atomic_inc(&swapin_readahead_hits);
}
return folio;
--
2.42.0
From: Kairui Song <[email protected]>
No longer needed after we switched to per entry swap readhead policy.
Signed-off-by: Kairui Song <[email protected]>
---
include/linux/swap.h | 1 -
mm/swapfile.c | 11 -----------
2 files changed, 12 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 64a37819a9b3..cc83fb884757 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -454,7 +454,6 @@ extern void free_pages_and_swap_cache(struct encoded_page **, int);
/* linux/mm/swapfile.c */
extern atomic_long_t nr_swap_pages;
extern long total_swap_pages;
-extern atomic_t nr_rotate_swap;
extern bool has_usable_swap(void);
/* Swap 50% full? Release swapcache more aggressively.. */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e15a6c464a38..01c3f53b6521 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -104,8 +104,6 @@ static DECLARE_WAIT_QUEUE_HEAD(proc_poll_wait);
/* Activity counter to indicate that a swapon or swapoff has occurred */
static atomic_t proc_poll_event = ATOMIC_INIT(0);
-atomic_t nr_rotate_swap = ATOMIC_INIT(0);
-
static struct swap_info_struct *swap_type_to_swap_info(int type)
{
if (type >= MAX_SWAPFILES)
@@ -2486,9 +2484,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
if (p->flags & SWP_CONTINUED)
free_swap_count_continuations(p);
- if (!p->bdev || !bdev_nonrot(p->bdev))
- atomic_dec(&nr_rotate_swap);
-
mutex_lock(&swapon_mutex);
spin_lock(&swap_lock);
spin_lock(&p->lock);
@@ -2990,7 +2985,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
struct swap_cluster_info *cluster_info = NULL;
struct page *page = NULL;
struct inode *inode = NULL;
- bool inced_nr_rotate_swap = false;
if (swap_flags & ~SWAP_FLAGS_VALID)
return -EINVAL;
@@ -3112,9 +3106,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
cluster = per_cpu_ptr(p->percpu_cluster, cpu);
cluster_set_null(&cluster->index);
}
- } else {
- atomic_inc(&nr_rotate_swap);
- inced_nr_rotate_swap = true;
}
error = swap_cgroup_swapon(p->type, maxpages);
@@ -3218,8 +3209,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
spin_unlock(&swap_lock);
vfree(swap_map);
kvfree(cluster_info);
- if (inced_nr_rotate_swap)
- atomic_dec(&nr_rotate_swap);
if (swap_file)
filp_close(swap_file, NULL);
out:
--
2.42.0
From: Kairui Song <[email protected]>
When a process which previously swapped some memory was moved to
another cgroup, and the cgroup it previous in is dead, then swapped in
pages will be leaked into rootcg. Previous commits fixed the bug for
no readahead path, this commit fix the same issue for readahead path.
This can be easily reproduced by:
- Setup a SSD or HDD swap.
- Create memory cgroup A, B and C.
- Spawn process P1 in cgroup A and make it swap out some pages.
- Move process P1 to memory cgroup B.
- Destroy cgroup A.
- Do a swapoff in cgroup C
- Swapped in pages is accounted into cgroup C.
This patch will fix it make the swapped in pages accounted in cgroup B.
Signed-off-by: Kairui Song <[email protected]>
---
mm/swap.h | 2 +-
mm/swap_state.c | 19 ++++++++++---------
mm/zswap.c | 2 +-
3 files changed, 12 insertions(+), 11 deletions(-)
diff --git a/mm/swap.h b/mm/swap.h
index 795a25df87da..4374bf11ca41 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -55,7 +55,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
struct swap_iocb **plug);
struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
struct mempolicy *mpol, pgoff_t ilx,
- bool *new_page_allocated);
+ struct mm_struct *mm, bool *new_page_allocated);
struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
struct vm_fault *vmf, enum swap_cache_result *result);
struct page *swapin_page_non_fault(swp_entry_t entry, gfp_t gfp_mask,
diff --git a/mm/swap_state.c b/mm/swap_state.c
index b377e55cb850..362a6f674b36 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -416,7 +416,7 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping,
struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
struct mempolicy *mpol, pgoff_t ilx,
- bool *new_page_allocated)
+ struct mm_struct *mm, bool *new_page_allocated)
{
struct swap_info_struct *si;
struct folio *folio;
@@ -462,7 +462,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
mpol, ilx, numa_node_id());
if (!folio)
goto fail_put_swap;
- if (mem_cgroup_swapin_charge_folio(folio, NULL, gfp_mask, entry))
+ if (mem_cgroup_swapin_charge_folio(folio, mm, gfp_mask, entry))
goto fail_put_folio;
/*
@@ -540,7 +540,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
mpol = get_vma_policy(vma, addr, 0, &ilx);
page = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
- &page_allocated);
+ vma->vm_mm, &page_allocated);
mpol_cond_put(mpol);
if (page_allocated)
@@ -628,7 +628,8 @@ static unsigned long swapin_nr_pages(unsigned long offset)
* are fairly likely to have been swapped out from the same node.
*/
static struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
- struct mempolicy *mpol, pgoff_t ilx)
+ struct mempolicy *mpol, pgoff_t ilx,
+ struct mm_struct *mm)
{
struct page *page;
unsigned long entry_offset = swp_offset(entry);
@@ -657,7 +658,7 @@ static struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
/* Ok, do the async read-ahead now */
page = __read_swap_cache_async(
swp_entry(swp_type(entry), offset),
- gfp_mask, mpol, ilx, &page_allocated);
+ gfp_mask, mpol, ilx, mm, &page_allocated);
if (!page)
continue;
if (page_allocated) {
@@ -675,7 +676,7 @@ static struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
skip:
/* The page was likely read above, so no need for plugging here */
page = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
- &page_allocated);
+ mm, &page_allocated);
if (unlikely(page_allocated))
swap_readpage(page, false, NULL);
return page;
@@ -830,7 +831,7 @@ static struct page *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
pte_unmap(pte);
pte = NULL;
page = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
- &page_allocated);
+ vmf->vma->vm_mm, &page_allocated);
if (!page)
continue;
if (page_allocated) {
@@ -850,7 +851,7 @@ static struct page *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
skip:
/* The page was likely read above, so no need for plugging here */
page = __read_swap_cache_async(targ_entry, gfp_mask, mpol, targ_ilx,
- &page_allocated);
+ vmf->vma->vm_mm, &page_allocated);
if (unlikely(page_allocated))
swap_readpage(page, false, NULL);
return page;
@@ -980,7 +981,7 @@ struct page *swapin_page_non_fault(swp_entry_t entry, gfp_t gfp_mask,
workingset_refault(page_folio(page), shadow);
cache_result = SWAP_CACHE_BYPASS;
} else {
- page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
+ page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx, mm);
cache_result = SWAP_CACHE_MISS;
}
done:
diff --git a/mm/zswap.c b/mm/zswap.c
index 030cc137138f..e2712ff169b1 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1081,7 +1081,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
/* try to allocate swap cache page */
mpol = get_task_policy(current);
page = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
- NO_INTERLEAVE_INDEX, &page_was_allocated);
+ NO_INTERLEAVE_INDEX, NULL, &page_was_allocated);
if (!page) {
ret = -ENOMEM;
goto fail;
--
2.42.0
On Mon, Nov 20, 2023 at 03:47:32AM +0800, Kairui Song wrote:
> page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> vmf, &cache_result);
> - if (page) {
> + if (PTR_ERR(page) == -EBUSY) {
if (page == ERR_PTR(-EBUSY)) {
Hi Kairui,
kernel test robot noticed the following build warnings:
[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on linus/master v6.7-rc2 next-20231117]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Kairui-Song/mm-swap-fix-a-potential-undefined-behavior-issue/20231120-035926
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20231119194740.94101-7-ryncsn%40gmail.com
patch subject: [PATCH 06/24] swap: rework swapin_no_readahead arguments
config: i386-buildonly-randconfig-003-20231120 (https://download.01.org/0day-ci/archive/20231120/[email protected]/config)
compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231120/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All warnings (new ones prefixed by >>):
>> mm/swap_state.c:872: warning: Function parameter or member 'mpol' not described in 'swapin_no_readahead'
>> mm/swap_state.c:872: warning: Function parameter or member 'ilx' not described in 'swapin_no_readahead'
>> mm/swap_state.c:872: warning: Function parameter or member 'mm' not described in 'swapin_no_readahead'
>> mm/swap_state.c:872: warning: Excess function parameter 'vmf' description in 'swapin_no_readahead'
vim +872 mm/swap_state.c
d9bfcfdc41e8e5 Huang Ying 2017-09-06 859
19f582d2684e47 Kairui Song 2023-11-20 860 /**
19f582d2684e47 Kairui Song 2023-11-20 861 * swapin_no_readahead - swap in pages skipping swap cache and readahead
19f582d2684e47 Kairui Song 2023-11-20 862 * @entry: swap entry of this memory
19f582d2684e47 Kairui Song 2023-11-20 863 * @gfp_mask: memory allocation flags
19f582d2684e47 Kairui Song 2023-11-20 864 * @vmf: fault information
19f582d2684e47 Kairui Song 2023-11-20 865 *
19f582d2684e47 Kairui Song 2023-11-20 866 * Returns the struct page for entry and addr after the swap entry is read
19f582d2684e47 Kairui Song 2023-11-20 867 * in.
19f582d2684e47 Kairui Song 2023-11-20 868 */
598f2616cde014 Kairui Song 2023-11-20 869 static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
2538a5e96fe62f Kairui Song 2023-11-20 870 struct mempolicy *mpol, pgoff_t ilx,
2538a5e96fe62f Kairui Song 2023-11-20 871 struct mm_struct *mm)
19f582d2684e47 Kairui Song 2023-11-20 @872 {
19f582d2684e47 Kairui Song 2023-11-20 873 struct folio *folio;
2538a5e96fe62f Kairui Song 2023-11-20 874 struct page *page;
19f582d2684e47 Kairui Song 2023-11-20 875 void *shadow = NULL;
19f582d2684e47 Kairui Song 2023-11-20 876
2538a5e96fe62f Kairui Song 2023-11-20 877 page = alloc_pages_mpol(gfp_mask, 0, mpol, ilx, numa_node_id());
2538a5e96fe62f Kairui Song 2023-11-20 878 folio = (struct folio *)page;
19f582d2684e47 Kairui Song 2023-11-20 879 if (folio) {
2538a5e96fe62f Kairui Song 2023-11-20 880 if (mem_cgroup_swapin_charge_folio(folio, mm,
c2ac0dcbf9ab6a Kairui Song 2023-11-20 881 GFP_KERNEL, entry)) {
19f582d2684e47 Kairui Song 2023-11-20 882 folio_put(folio);
19f582d2684e47 Kairui Song 2023-11-20 883 return NULL;
19f582d2684e47 Kairui Song 2023-11-20 884 }
c2ac0dcbf9ab6a Kairui Song 2023-11-20 885
c2ac0dcbf9ab6a Kairui Song 2023-11-20 886 __folio_set_locked(folio);
c2ac0dcbf9ab6a Kairui Song 2023-11-20 887 __folio_set_swapbacked(folio);
c2ac0dcbf9ab6a Kairui Song 2023-11-20 888
19f582d2684e47 Kairui Song 2023-11-20 889 mem_cgroup_swapin_uncharge_swap(entry);
19f582d2684e47 Kairui Song 2023-11-20 890
19f582d2684e47 Kairui Song 2023-11-20 891 shadow = get_shadow_from_swap_cache(entry);
19f582d2684e47 Kairui Song 2023-11-20 892 if (shadow)
19f582d2684e47 Kairui Song 2023-11-20 893 workingset_refault(folio, shadow);
19f582d2684e47 Kairui Song 2023-11-20 894
19f582d2684e47 Kairui Song 2023-11-20 895 folio_add_lru(folio);
19f582d2684e47 Kairui Song 2023-11-20 896
19f582d2684e47 Kairui Song 2023-11-20 897 /* To provide entry to swap_readpage() */
19f582d2684e47 Kairui Song 2023-11-20 898 folio->swap = entry;
19f582d2684e47 Kairui Song 2023-11-20 899 swap_readpage(page, true, NULL);
19f582d2684e47 Kairui Song 2023-11-20 900 folio->private = NULL;
19f582d2684e47 Kairui Song 2023-11-20 901 }
19f582d2684e47 Kairui Song 2023-11-20 902
19f582d2684e47 Kairui Song 2023-11-20 903 return page;
19f582d2684e47 Kairui Song 2023-11-20 904 }
19f582d2684e47 Kairui Song 2023-11-20 905
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Hi Kairui,
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.7-rc2 next-20231117]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Kairui-Song/mm-swap-fix-a-potential-undefined-behavior-issue/20231120-035926
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20231119194740.94101-12-ryncsn%40gmail.com
patch subject: [PATCH 11/24] mm/swap: also handle swapcache lookup in swapin_readahead
config: x86_64-allnoconfig (https://download.01.org/0day-ci/archive/20231120/[email protected]/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231120/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All error/warnings (new ones prefixed by >>):
In file included from mm/filemap.c:62:
>> mm/swap.h:101:52: warning: 'enum swap_cache_result' declared inside parameter list will not be visible outside of this definition or declaration
101 | struct vm_fault *vmf, enum swap_cache_result *result)
| ^~~~~~~~~~~~~~~~~
--
In file included from mm/memory.c:93:
>> mm/swap.h:101:52: warning: 'enum swap_cache_result' declared inside parameter list will not be visible outside of this definition or declaration
101 | struct vm_fault *vmf, enum swap_cache_result *result)
| ^~~~~~~~~~~~~~~~~
mm/memory.c: In function 'do_swap_page':
>> mm/memory.c:3790:32: error: storage size of 'cache_result' isn't known
3790 | enum swap_cache_result cache_result;
| ^~~~~~~~~~~~
>> mm/memory.c:3857:37: error: 'SWAP_CACHE_HIT' undeclared (first use in this function); did you mean 'DQST_CACHE_HITS'?
3857 | if (cache_result != SWAP_CACHE_HIT) {
| ^~~~~~~~~~~~~~
| DQST_CACHE_HITS
mm/memory.c:3857:37: note: each undeclared identifier is reported only once for each function it appears in
>> mm/memory.c:3863:37: error: 'SWAP_CACHE_BYPASS' undeclared (first use in this function); did you mean 'SMP_CACHE_BYTES'?
3863 | if (cache_result != SWAP_CACHE_BYPASS)
| ^~~~~~~~~~~~~~~~~
| SMP_CACHE_BYTES
>> mm/memory.c:3790:32: warning: unused variable 'cache_result' [-Wunused-variable]
3790 | enum swap_cache_result cache_result;
| ^~~~~~~~~~~~
vim +3790 mm/memory.c
3777
3778 /*
3779 * We enter with non-exclusive mmap_lock (to exclude vma changes,
3780 * but allow concurrent faults), and pte mapped but not yet locked.
3781 * We return with pte unmapped and unlocked.
3782 *
3783 * We return with the mmap_lock locked or unlocked in the same cases
3784 * as does filemap_fault().
3785 */
3786 vm_fault_t do_swap_page(struct vm_fault *vmf)
3787 {
3788 struct vm_area_struct *vma = vmf->vma;
3789 struct folio *swapcache = NULL, *folio = NULL;
> 3790 enum swap_cache_result cache_result;
3791 struct page *page;
3792 struct swap_info_struct *si = NULL;
3793 rmap_t rmap_flags = RMAP_NONE;
3794 bool exclusive = false;
3795 swp_entry_t entry;
3796 pte_t pte;
3797 vm_fault_t ret = 0;
3798
3799 if (!pte_unmap_same(vmf))
3800 goto out;
3801
3802 entry = pte_to_swp_entry(vmf->orig_pte);
3803 if (unlikely(non_swap_entry(entry))) {
3804 if (is_migration_entry(entry)) {
3805 migration_entry_wait(vma->vm_mm, vmf->pmd,
3806 vmf->address);
3807 } else if (is_device_exclusive_entry(entry)) {
3808 vmf->page = pfn_swap_entry_to_page(entry);
3809 ret = remove_device_exclusive_entry(vmf);
3810 } else if (is_device_private_entry(entry)) {
3811 if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
3812 /*
3813 * migrate_to_ram is not yet ready to operate
3814 * under VMA lock.
3815 */
3816 vma_end_read(vma);
3817 ret = VM_FAULT_RETRY;
3818 goto out;
3819 }
3820
3821 vmf->page = pfn_swap_entry_to_page(entry);
3822 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
3823 vmf->address, &vmf->ptl);
3824 if (unlikely(!vmf->pte ||
3825 !pte_same(ptep_get(vmf->pte),
3826 vmf->orig_pte)))
3827 goto unlock;
3828
3829 /*
3830 * Get a page reference while we know the page can't be
3831 * freed.
3832 */
3833 get_page(vmf->page);
3834 pte_unmap_unlock(vmf->pte, vmf->ptl);
3835 ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
3836 put_page(vmf->page);
3837 } else if (is_hwpoison_entry(entry)) {
3838 ret = VM_FAULT_HWPOISON;
3839 } else if (is_pte_marker_entry(entry)) {
3840 ret = handle_pte_marker(vmf);
3841 } else {
3842 print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
3843 ret = VM_FAULT_SIGBUS;
3844 }
3845 goto out;
3846 }
3847
3848 /* Prevent swapoff from happening to us. */
3849 si = get_swap_device(entry);
3850 if (unlikely(!si))
3851 goto out;
3852
3853 page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
3854 vmf, &cache_result);
3855 if (page) {
3856 folio = page_folio(page);
> 3857 if (cache_result != SWAP_CACHE_HIT) {
3858 /* Had to read the page from swap area: Major fault */
3859 ret = VM_FAULT_MAJOR;
3860 count_vm_event(PGMAJFAULT);
3861 count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
3862 }
> 3863 if (cache_result != SWAP_CACHE_BYPASS)
3864 swapcache = folio;
3865 if (PageHWPoison(page)) {
3866 /*
3867 * hwpoisoned dirty swapcache pages are kept for killing
3868 * owner processes (which may be unknown at hwpoison time)
3869 */
3870 ret = VM_FAULT_HWPOISON;
3871 goto out_release;
3872 }
3873 } else {
3874 /*
3875 * Back out if somebody else faulted in this pte
3876 * while we released the pte lock.
3877 */
3878 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
3879 vmf->address, &vmf->ptl);
3880 if (likely(vmf->pte &&
3881 pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
3882 ret = VM_FAULT_OOM;
3883 goto unlock;
3884 }
3885
3886 ret |= folio_lock_or_retry(folio, vmf);
3887 if (ret & VM_FAULT_RETRY)
3888 goto out_release;
3889
3890 if (swapcache) {
3891 /*
3892 * Make sure folio_free_swap() or swapoff did not release the
3893 * swapcache from under us. The page pin, and pte_same test
3894 * below, are not enough to exclude that. Even if it is still
3895 * swapcache, we need to check that the page's swap has not
3896 * changed.
3897 */
3898 if (unlikely(!folio_test_swapcache(folio) ||
3899 page_swap_entry(page).val != entry.val))
3900 goto out_page;
3901
3902 /*
3903 * KSM sometimes has to copy on read faults, for example, if
3904 * page->index of !PageKSM() pages would be nonlinear inside the
3905 * anon VMA -- PageKSM() is lost on actual swapout.
3906 */
3907 page = ksm_might_need_to_copy(page, vma, vmf->address);
3908 if (unlikely(!page)) {
3909 ret = VM_FAULT_OOM;
3910 goto out_page;
3911 } else if (unlikely(PTR_ERR(page) == -EHWPOISON)) {
3912 ret = VM_FAULT_HWPOISON;
3913 goto out_page;
3914 }
3915 folio = page_folio(page);
3916
3917 /*
3918 * If we want to map a page that's in the swapcache writable, we
3919 * have to detect via the refcount if we're really the exclusive
3920 * owner. Try removing the extra reference from the local LRU
3921 * caches if required.
3922 */
3923 if ((vmf->flags & FAULT_FLAG_WRITE) && folio == swapcache &&
3924 !folio_test_ksm(folio) && !folio_test_lru(folio))
3925 lru_add_drain();
3926 }
3927
3928 folio_throttle_swaprate(folio, GFP_KERNEL);
3929
3930 /*
3931 * Back out if somebody else already faulted in this pte.
3932 */
3933 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
3934 &vmf->ptl);
3935 if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
3936 goto out_nomap;
3937
3938 if (unlikely(!folio_test_uptodate(folio))) {
3939 ret = VM_FAULT_SIGBUS;
3940 goto out_nomap;
3941 }
3942
3943 /*
3944 * PG_anon_exclusive reuses PG_mappedtodisk for anon pages. A swap pte
3945 * must never point at an anonymous page in the swapcache that is
3946 * PG_anon_exclusive. Sanity check that this holds and especially, that
3947 * no filesystem set PG_mappedtodisk on a page in the swapcache. Sanity
3948 * check after taking the PT lock and making sure that nobody
3949 * concurrently faulted in this page and set PG_anon_exclusive.
3950 */
3951 BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
3952 BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
3953
3954 /*
3955 * Check under PT lock (to protect against concurrent fork() sharing
3956 * the swap entry concurrently) for certainly exclusive pages.
3957 */
3958 if (!folio_test_ksm(folio)) {
3959 exclusive = pte_swp_exclusive(vmf->orig_pte);
3960 if (folio != swapcache) {
3961 /*
3962 * We have a fresh page that is not exposed to the
3963 * swapcache -> certainly exclusive.
3964 */
3965 exclusive = true;
3966 } else if (exclusive && folio_test_writeback(folio) &&
3967 data_race(si->flags & SWP_STABLE_WRITES)) {
3968 /*
3969 * This is tricky: not all swap backends support
3970 * concurrent page modifications while under writeback.
3971 *
3972 * So if we stumble over such a page in the swapcache
3973 * we must not set the page exclusive, otherwise we can
3974 * map it writable without further checks and modify it
3975 * while still under writeback.
3976 *
3977 * For these problematic swap backends, simply drop the
3978 * exclusive marker: this is perfectly fine as we start
3979 * writeback only if we fully unmapped the page and
3980 * there are no unexpected references on the page after
3981 * unmapping succeeded. After fully unmapped, no
3982 * further GUP references (FOLL_GET and FOLL_PIN) can
3983 * appear, so dropping the exclusive marker and mapping
3984 * it only R/O is fine.
3985 */
3986 exclusive = false;
3987 }
3988 }
3989
3990 /*
3991 * Some architectures may have to restore extra metadata to the page
3992 * when reading from swap. This metadata may be indexed by swap entry
3993 * so this must be called before swap_free().
3994 */
3995 arch_swap_restore(entry, folio);
3996
3997 /*
3998 * Remove the swap entry and conditionally try to free up the swapcache.
3999 * We're already holding a reference on the page but haven't mapped it
4000 * yet.
4001 */
4002 swap_free(entry);
4003 if (should_try_to_free_swap(folio, vma, vmf->flags))
4004 folio_free_swap(folio);
4005
4006 inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
4007 dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
4008 pte = mk_pte(page, vma->vm_page_prot);
4009
4010 /*
4011 * Same logic as in do_wp_page(); however, optimize for pages that are
4012 * certainly not shared either because we just allocated them without
4013 * exposing them to the swapcache or because the swap entry indicates
4014 * exclusivity.
4015 */
4016 if (!folio_test_ksm(folio) &&
4017 (exclusive || folio_ref_count(folio) == 1)) {
4018 if (vmf->flags & FAULT_FLAG_WRITE) {
4019 pte = maybe_mkwrite(pte_mkdirty(pte), vma);
4020 vmf->flags &= ~FAULT_FLAG_WRITE;
4021 }
4022 rmap_flags |= RMAP_EXCLUSIVE;
4023 }
4024 flush_icache_page(vma, page);
4025 if (pte_swp_soft_dirty(vmf->orig_pte))
4026 pte = pte_mksoft_dirty(pte);
4027 if (pte_swp_uffd_wp(vmf->orig_pte))
4028 pte = pte_mkuffd_wp(pte);
4029 vmf->orig_pte = pte;
4030
4031 /* ksm created a completely new copy */
4032 if (unlikely(folio != swapcache && swapcache)) {
4033 page_add_new_anon_rmap(page, vma, vmf->address);
4034 folio_add_lru_vma(folio, vma);
4035 } else {
4036 page_add_anon_rmap(page, vma, vmf->address, rmap_flags);
4037 }
4038
4039 VM_BUG_ON(!folio_test_anon(folio) ||
4040 (pte_write(pte) && !PageAnonExclusive(page)));
4041 set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
4042 arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
4043
4044 folio_unlock(folio);
4045 if (folio != swapcache && swapcache) {
4046 /*
4047 * Hold the lock to avoid the swap entry to be reused
4048 * until we take the PT lock for the pte_same() check
4049 * (to avoid false positives from pte_same). For
4050 * further safety release the lock after the swap_free
4051 * so that the swap count won't change under a
4052 * parallel locked swapcache.
4053 */
4054 folio_unlock(swapcache);
4055 folio_put(swapcache);
4056 }
4057
4058 if (vmf->flags & FAULT_FLAG_WRITE) {
4059 ret |= do_wp_page(vmf);
4060 if (ret & VM_FAULT_ERROR)
4061 ret &= VM_FAULT_ERROR;
4062 goto out;
4063 }
4064
4065 /* No need to invalidate - it was non-present before */
4066 update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
4067 unlock:
4068 if (vmf->pte)
4069 pte_unmap_unlock(vmf->pte, vmf->ptl);
4070 out:
4071 if (si)
4072 put_swap_device(si);
4073 return ret;
4074 out_nomap:
4075 if (vmf->pte)
4076 pte_unmap_unlock(vmf->pte, vmf->ptl);
4077 out_page:
4078 folio_unlock(folio);
4079 out_release:
4080 folio_put(folio);
4081 if (folio != swapcache && swapcache) {
4082 folio_unlock(swapcache);
4083 folio_put(swapcache);
4084 }
4085 if (si)
4086 put_swap_device(si);
4087 return ret;
4088 }
4089
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Hi Kairui,
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.7-rc2 next-20231117]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Kairui-Song/mm-swap-fix-a-potential-undefined-behavior-issue/20231120-035926
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20231119194740.94101-19-ryncsn%40gmail.com
patch subject: [PATCH 18/24] mm/swap: introduce a helper non fault swapin
config: i386-buildonly-randconfig-002-20231120 (https://download.01.org/0day-ci/archive/20231120/[email protected]/config)
compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231120/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All errors (new ones prefixed by >>):
In file included from mm/shmem.c:43:
mm/swap.h:105:31: warning: declaration of 'enum swap_cache_result' will not be visible outside of this function [-Wvisibility]
struct vm_fault *vmf, enum swap_cache_result *result)
^
mm/swap.h:112:8: warning: declaration of 'enum swap_cache_result' will not be visible outside of this function [-Wvisibility]
enum swap_cache_result *result)
^
>> mm/shmem.c:1841:25: error: variable has incomplete type 'enum swap_cache_result'
enum swap_cache_result result;
^
mm/shmem.c:1841:7: note: forward declaration of 'enum swap_cache_result'
enum swap_cache_result result;
^
>> mm/shmem.c:1870:31: error: use of undeclared identifier 'SWAP_CACHE_HIT'
if (fault_type && result != SWAP_CACHE_HIT) {
^
>> mm/shmem.c:1879:17: error: use of undeclared identifier 'SWAP_CACHE_BYPASS'
if ((result != SWAP_CACHE_BYPASS && !folio_test_swapcache(folio)) ||
^
2 warnings and 3 errors generated.
vim +1841 mm/shmem.c
1827
1828 /*
1829 * Swap in the folio pointed to by *foliop.
1830 * Caller has to make sure that *foliop contains a valid swapped folio.
1831 * Returns 0 and the folio in foliop if success. On failure, returns the
1832 * error code and NULL in *foliop.
1833 */
1834 static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
1835 struct folio **foliop, enum sgp_type sgp,
1836 gfp_t gfp, struct mm_struct *fault_mm,
1837 vm_fault_t *fault_type)
1838 {
1839 struct address_space *mapping = inode->i_mapping;
1840 struct shmem_inode_info *info = SHMEM_I(inode);
> 1841 enum swap_cache_result result;
1842 struct folio *folio = NULL;
1843 struct mempolicy *mpol;
1844 struct page *page;
1845 swp_entry_t swap;
1846 pgoff_t ilx;
1847 int error;
1848
1849 VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
1850 swap = radix_to_swp_entry(*foliop);
1851 *foliop = NULL;
1852
1853 if (is_poisoned_swp_entry(swap))
1854 return -EIO;
1855
1856 mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
1857 page = swapin_page_non_fault(swap, gfp, mpol, ilx, fault_mm, &result);
1858 mpol_cond_put(mpol);
1859
1860 if (PTR_ERR(page) == -EBUSY) {
1861 if (!shmem_confirm_swap(mapping, index, swap))
1862 return -EEXIST;
1863 else
1864 return -EINVAL;
1865 } else if (!page) {
1866 error = -ENOMEM;
1867 goto failed;
1868 } else {
1869 folio = page_folio(page);
> 1870 if (fault_type && result != SWAP_CACHE_HIT) {
1871 *fault_type |= VM_FAULT_MAJOR;
1872 count_vm_event(PGMAJFAULT);
1873 count_memcg_event_mm(fault_mm, PGMAJFAULT);
1874 }
1875 }
1876
1877 /* We have to do this with folio locked to prevent races */
1878 folio_lock(folio);
> 1879 if ((result != SWAP_CACHE_BYPASS && !folio_test_swapcache(folio)) ||
1880 folio->swap.val != swap.val ||
1881 !shmem_confirm_swap(mapping, index, swap)) {
1882 error = -EEXIST;
1883 goto unlock;
1884 }
1885 if (!folio_test_uptodate(folio)) {
1886 error = -EIO;
1887 goto failed;
1888 }
1889 folio_wait_writeback(folio);
1890
1891 /*
1892 * Some architectures may have to restore extra metadata to the
1893 * folio after reading from swap.
1894 */
1895 arch_swap_restore(swap, folio);
1896
1897 if (shmem_should_replace_folio(folio, gfp)) {
1898 error = shmem_replace_folio(&folio, gfp, info, index);
1899 if (error)
1900 goto failed;
1901 }
1902
1903 error = shmem_add_to_page_cache(folio, mapping, index,
1904 swp_to_radix_entry(swap), gfp);
1905 if (error)
1906 goto failed;
1907
1908 shmem_recalc_inode(inode, 0, -1);
1909
1910 if (sgp == SGP_WRITE)
1911 folio_mark_accessed(folio);
1912
1913 delete_from_swap_cache(folio);
1914 folio_mark_dirty(folio);
1915 swap_free(swap);
1916
1917 *foliop = folio;
1918 return 0;
1919 failed:
1920 if (!shmem_confirm_swap(mapping, index, swap))
1921 error = -EEXIST;
1922 if (error == -EIO)
1923 shmem_set_folio_swapin_error(inode, index, folio, swap);
1924 unlock:
1925 if (folio) {
1926 folio_unlock(folio);
1927 folio_put(folio);
1928 }
1929
1930 return error;
1931 }
1932
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Kairui Song <[email protected]> writes:
> From: Kairui Song <[email protected]>
>
> Currently VMA readahead is globally disabled when any rotate disk is
> used as swap backend. So multiple swap devices are enabled, if a slower
> hard disk is set as a low priority fallback, and a high performance SSD
> is used and high priority swap device, vma readahead is disabled globally.
> The SSD swap device performance will drop by a lot.
>
> Check readahead policy per entry to avoid such problem.
>
> Signed-off-by: Kairui Song <[email protected]>
> ---
> mm/swap_state.c | 12 +++++++-----
> 1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index ff6756f2e8e4..fb78f7f18ed7 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -321,9 +321,9 @@ static inline bool swap_use_no_readahead(struct swap_info_struct *si, swp_entry_
> return data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1;
> }
>
> -static inline bool swap_use_vma_readahead(void)
> +static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
> {
> - return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
> + return data_race(si->flags & SWP_SOLIDSTATE) && READ_ONCE(enable_vma_readahead);
> }
>
> /*
> @@ -341,7 +341,7 @@ struct folio *swap_cache_get_folio(swp_entry_t entry,
>
> folio = filemap_get_folio(swap_address_space(entry), swp_offset(entry));
> if (!IS_ERR(folio)) {
> - bool vma_ra = swap_use_vma_readahead();
> + bool vma_ra = swap_use_vma_readahead(swp_swap_info(entry));
> bool readahead;
>
> /*
> @@ -920,16 +920,18 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> struct vm_fault *vmf, bool *swapcached)
> {
> + struct swap_info_struct *si;
> struct mempolicy *mpol;
> struct page *page;
> pgoff_t ilx;
> bool cached;
>
> + si = swp_swap_info(entry);
> mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> - if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
> + if (swap_use_no_readahead(si, entry)) {
> page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
> cached = false;
> - } else if (swap_use_vma_readahead()) {
> + } else if (swap_use_vma_readahead(si)) {
It's possible that some pages are swapped out to SSD while others are
swapped out to HDD in a readahead window.
I suspect that there are practical requirements to use swap on SSD and
HDD at the same time.
> page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
> cached = true;
> } else {
--
Best Regards,
Huang, Ying
Hi Kairui,
On Sun, Nov 19, 2023 at 11:49 AM Kairui Song <[email protected]> wrote:
>
> From: Kairui Song <[email protected]>
>
> It should always check if a swap entry is already swaped in on error,
> fix potential false error issue.
>
> Signed-off-by: Kairui Song <[email protected]>
> ---
> mm/shmem.c | 10 ++++------
> 1 file changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 81d129aa66d1..6154b5b68b8f 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1857,13 +1857,11 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> page = swapin_page_non_fault(swap, gfp, mpol, ilx, fault_mm, &result);
> mpol_cond_put(mpol);
>
> - if (PTR_ERR(page) == -EBUSY) {
> - if (!shmem_confirm_swap(mapping, index, swap))
> - return -EEXIST;
Do you intentionally remove checking shmem_confirm_swap()?
I am not sure I am following.
> + if (IS_ERR_OR_NULL(page)) {
> + if (!page)
> + error = -ENOMEM;
> else
> - return -EINVAL;
> - } else if (!page) {
> - error = -ENOMEM;
> + error = -EINVAL;
The resulting code is a bit hard to read in diff. In plan source it is like:
if (IS_ERR_OR_NULL(page)) {
if (!page)
error = -ENOMEM;
else
error = -EINVAL;
goto failed;
} else {
folio = page_folio(page);
if (fault_type && result != SWAP_CACHE_HIT) {
*fault_type |= VM_FAULT_MAJOR;
count_vm_event(PGMAJFAULT);
count_memcg_event_mm(fault_mm, PGMAJFAULT);
}
}
First of all, if the first always "goto failed", the second else is not needed.
The whole else block can be flatten one indentation.
if (IS_ERR_OR_NULL(page)) {
if (!page)
error = -ENOMEM;
else
error = -EINVAL;
goto failed;
} else {
Can be rewrite as following with less indentation:
if (!page) {
error = -ENOMEM;
goto failed;
}
if (IS_ERR(page)) {
error = -EINVAL;
goto failed;
}
/* else block */
Am I missing something and misreading your code?
Chris
Kairui Song <[email protected]> writes:
> From: Kairui Song <[email protected]>
>
> When a process which previously swapped some memory was moved to
> another cgroup, and the cgroup it previous in is dead, then swapped in
> pages will be leaked into rootcg. Previous commits fixed the bug for
> no readahead path, this commit fix the same issue for readahead path.
>
> This can be easily reproduced by:
> - Setup a SSD or HDD swap.
> - Create memory cgroup A, B and C.
> - Spawn process P1 in cgroup A and make it swap out some pages.
> - Move process P1 to memory cgroup B.
> - Destroy cgroup A.
> - Do a swapoff in cgroup C
> - Swapped in pages is accounted into cgroup C.
>
> This patch will fix it make the swapped in pages accounted in cgroup B.
Accroding to "Memory Ownership" section of
Documentation/admin-guide/cgroup-v2.rst,
"
A memory area is charged to the cgroup which instantiated it and stays
charged to the cgroup until the area is released. Migrating a process
to a different cgroup doesn't move the memory usages that it
instantiated while in the previous cgroup to the new cgroup.
"
Because we don't move the charge when we move a task from one cgroup to
another. It's controversial which cgroup should be charged to.
According to the above document, it's acceptable to charge to the cgroup
C (cgroup where swapoff happens).
--
Best Regards,
Huang, Ying
Kairui Song <[email protected]> writes:
> From: Kairui Song <[email protected]>
>
> There is only one caller in swap subsystem now, where it can be inline
> smoothly, avoid the memory access and function call overheads.
>
> Signed-off-by: Kairui Song <[email protected]>
> ---
> include/linux/swap.h | 6 ------
> mm/swap_state.c | 6 +++---
> mm/swapfile.c | 8 --------
> 3 files changed, 3 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2401990d954d..64a37819a9b3 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -485,7 +485,6 @@ int swap_type_of(dev_t device, sector_t offset);
> int find_first_swap(dev_t *device);
> extern unsigned int count_swap_pages(int, int);
> extern sector_t swapdev_block(int, pgoff_t);
> -extern int __swap_count(swp_entry_t entry);
> extern int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry);
> extern int swp_swapcount(swp_entry_t entry);
> extern struct swap_info_struct *page_swap_info(struct page *);
> @@ -559,11 +558,6 @@ static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> {
> }
>
> -static inline int __swap_count(swp_entry_t entry)
> -{
> - return 0;
> -}
> -
> static inline int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
> {
> return 0;
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index fb78f7f18ed7..d87c20f9f7ec 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -316,9 +316,9 @@ void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
> release_pages(pages, nr);
> }
>
> -static inline bool swap_use_no_readahead(struct swap_info_struct *si, swp_entry_t entry)
> +static inline bool swap_use_no_readahead(struct swap_info_struct *si, pgoff_t offset)
> {
> - return data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1;
> + return data_race(si->flags & SWP_SYNCHRONOUS_IO) && swap_count(si->swap_map[offset]) == 1;
> }
>
> static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
> @@ -928,7 +928,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>
> si = swp_swap_info(entry);
> mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> - if (swap_use_no_readahead(si, entry)) {
> + if (swap_use_no_readahead(si, swp_offset(entry))) {
> page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
> cached = false;
> } else if (swap_use_vma_readahead(si)) {
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index a8ae472ed2b6..e15a6c464a38 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1431,14 +1431,6 @@ void swapcache_free_entries(swp_entry_t *entries, int n)
> spin_unlock(&p->lock);
> }
>
> -int __swap_count(swp_entry_t entry)
> -{
> - struct swap_info_struct *si = swp_swap_info(entry);
> - pgoff_t offset = swp_offset(entry);
> -
> - return swap_count(si->swap_map[offset]);
> -}
> -
I'd rather keep __swap_count() in the original place together with other
swap count related functions. And si->swap_map[] was hided in
swapfile.c before. I don't think the change will have any real
performance improvement.
> /*
> * How many references to @entry are currently swapped out?
> * This does not give an exact answer when swap count is continued,
--
Best Regards,
Huang, Ying
Matthew Wilcox <[email protected]> 于2023年11月20日周一 05:12写道:
>
> On Mon, Nov 20, 2023 at 03:47:32AM +0800, Kairui Song wrote:
> > page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> > vmf, &cache_result);
> > - if (page) {
> > + if (PTR_ERR(page) == -EBUSY) {
>
> if (page == ERR_PTR(-EBUSY)) {
>
Thanks, I'll fix this.
Huang, Ying <[email protected]> 于2023年11月20日周一 15:37写道:
>
> Kairui Song <[email protected]> writes:
>
> > From: Kairui Song <[email protected]>
> >
> > When a process which previously swapped some memory was moved to
> > another cgroup, and the cgroup it previous in is dead, then swapped in
> > pages will be leaked into rootcg. Previous commits fixed the bug for
> > no readahead path, this commit fix the same issue for readahead path.
> >
> > This can be easily reproduced by:
> > - Setup a SSD or HDD swap.
> > - Create memory cgroup A, B and C.
> > - Spawn process P1 in cgroup A and make it swap out some pages.
> > - Move process P1 to memory cgroup B.
> > - Destroy cgroup A.
> > - Do a swapoff in cgroup C
> > - Swapped in pages is accounted into cgroup C.
> >
> > This patch will fix it make the swapped in pages accounted in cgroup B.
>
> Accroding to "Memory Ownership" section of
> Documentation/admin-guide/cgroup-v2.rst,
>
> "
> A memory area is charged to the cgroup which instantiated it and stays
> charged to the cgroup until the area is released. Migrating a process
> to a different cgroup doesn't move the memory usages that it
> instantiated while in the previous cgroup to the new cgroup.
> "
>
> Because we don't move the charge when we move a task from one cgroup to
> another. It's controversial which cgroup should be charged to.
> According to the above document, it's acceptable to charge to the cgroup
> C (cgroup where swapoff happens).
Hi Ying, thank you very much for the info!
It is controversial indeed, just the original behavior is kind of
counter-intuitive.
Image if there are cgroup P1, and its child cgroup C1 C2. If a process
swapped out some memory in C1 then moved to C2, and C1 is dead.
On swapoff the charge will be moved out of P1...
And swapoff often happen on some unlimited cgroup or some cgroup for
management agent.
If P1 have a memory limit, it can breech the limit easily, we will see
a process that never leave P1 having a much higher RSS that P1/C1/C2's
limit.
And if there is a limit for the management agent cgroup, the agent
will be OOM instead of OOM in P1.
Simply moving a process between the child cgroup of the same parent
cgroup won't cause such issue, thing get weird when swapoff is
involved.
Or maybe we should try to be compatible, and introduce a sysctl or
cmdline for this?
Huang, Ying <[email protected]> 于2023年11月20日周一 14:07写道:
>
> Kairui Song <[email protected]> writes:
>
> > From: Kairui Song <[email protected]>
> >
> > Currently VMA readahead is globally disabled when any rotate disk is
> > used as swap backend. So multiple swap devices are enabled, if a slower
> > hard disk is set as a low priority fallback, and a high performance SSD
> > is used and high priority swap device, vma readahead is disabled globally.
> > The SSD swap device performance will drop by a lot.
> >
> > Check readahead policy per entry to avoid such problem.
> >
> > Signed-off-by: Kairui Song <[email protected]>
> > ---
> > mm/swap_state.c | 12 +++++++-----
> > 1 file changed, 7 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index ff6756f2e8e4..fb78f7f18ed7 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -321,9 +321,9 @@ static inline bool swap_use_no_readahead(struct swap_info_struct *si, swp_entry_
> > return data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1;
> > }
> >
> > -static inline bool swap_use_vma_readahead(void)
> > +static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
> > {
> > - return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
> > + return data_race(si->flags & SWP_SOLIDSTATE) && READ_ONCE(enable_vma_readahead);
> > }
> >
> > /*
> > @@ -341,7 +341,7 @@ struct folio *swap_cache_get_folio(swp_entry_t entry,
> >
> > folio = filemap_get_folio(swap_address_space(entry), swp_offset(entry));
> > if (!IS_ERR(folio)) {
> > - bool vma_ra = swap_use_vma_readahead();
> > + bool vma_ra = swap_use_vma_readahead(swp_swap_info(entry));
> > bool readahead;
> >
> > /*
> > @@ -920,16 +920,18 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > struct vm_fault *vmf, bool *swapcached)
> > {
> > + struct swap_info_struct *si;
> > struct mempolicy *mpol;
> > struct page *page;
> > pgoff_t ilx;
> > bool cached;
> >
> > + si = swp_swap_info(entry);
> > mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> > - if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
> > + if (swap_use_no_readahead(si, entry)) {
> > page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
> > cached = false;
> > - } else if (swap_use_vma_readahead()) {
> > + } else if (swap_use_vma_readahead(si)) {
>
> It's possible that some pages are swapped out to SSD while others are
> swapped out to HDD in a readahead window.
>
> I suspect that there are practical requirements to use swap on SSD and
> HDD at the same time.
Hi Ying,
Thanks for the review!
For the first issue "fragmented readahead window", I was planning to
do an extra check in readahead path to skip readahead entries that are
on different swap devices, which is not hard to do, but this series is
growing too long so I thought it will be better done later.
For the second issue, "is there any practical use for multiple swap",
I think actually there are. For example we are trying to use multi
layer swap for offloading memory of different hotness on servers. And
we also tried to implement a mechanism to migrate long sleep swap
entries from high performance SSD/RAMDISK swap to cheap HDD swap
device, with more than two layers of swap, which worked except the
upstream issue, that readahead policy will no longer work as expected.
>
> > page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
> > cached = true;
> > } else {
>
> --
> Best Regards,
> Huang, Ying
Chris Li <[email protected]> 于2023年11月20日周一 15:12写道:
>
> Hi Kairui,
>
> On Sun, Nov 19, 2023 at 11:49 AM Kairui Song <[email protected]> wrote:
> >
> > From: Kairui Song <[email protected]>
> >
> > It should always check if a swap entry is already swaped in on error,
> > fix potential false error issue.
> >
> > Signed-off-by: Kairui Song <[email protected]>
> > ---
> > mm/shmem.c | 10 ++++------
> > 1 file changed, 4 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 81d129aa66d1..6154b5b68b8f 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -1857,13 +1857,11 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> > page = swapin_page_non_fault(swap, gfp, mpol, ilx, fault_mm, &result);
> > mpol_cond_put(mpol);
> >
> > - if (PTR_ERR(page) == -EBUSY) {
> > - if (!shmem_confirm_swap(mapping, index, swap))
> > - return -EEXIST;
>
> Do you intentionally remove checking shmem_confirm_swap()?
> I am not sure I am following.
Hi, Chris, thanks for the review.
This was also called under failed: label so I think there is no need
to keep it here.
>
> > + if (IS_ERR_OR_NULL(page)) {
> > + if (!page)
> > + error = -ENOMEM;
> > else
> > - return -EINVAL;
> > - } else if (!page) {
> > - error = -ENOMEM;
> > + error = -EINVAL;
>
> The resulting code is a bit hard to read in diff. In plan source it is like:
>
> if (IS_ERR_OR_NULL(page)) {
> if (!page)
> error = -ENOMEM;
> else
> error = -EINVAL;
> goto failed;
> } else {
> folio = page_folio(page);
> if (fault_type && result != SWAP_CACHE_HIT) {
> *fault_type |= VM_FAULT_MAJOR;
> count_vm_event(PGMAJFAULT);
> count_memcg_event_mm(fault_mm, PGMAJFAULT);
> }
> }
>
> First of all, if the first always "goto failed", the second else is not needed.
> The whole else block can be flatten one indentation.
Yes, thanks for the suggestion.
>
> if (IS_ERR_OR_NULL(page)) {
> if (!page)
> error = -ENOMEM;
> else
> error = -EINVAL;
> goto failed;
> } else {
>
> Can be rewrite as following with less indentation:
>
> if (!page) {
> error = -ENOMEM;
> goto failed;
> }
> if (IS_ERR(page)) {
> error = -EINVAL;
> goto failed;
> }
> /* else block */
>
> Am I missing something and misreading your code?
Your code looks cleaner :)
This patch is trying to clean up the error path after the helper
refactor, and your suggestions are very helpful, thanks!
>
> Chris
On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <[email protected]> wrote:
>
> From: Kairui Song <[email protected]>
>
> This series tries to unify and clean up the swapin path, fixing a few
> issues with optimizations:
>
> 1. Memcg leak issue: when a process that previously swapped out some
> migrated to another cgroup, and the origianl cgroup is dead. If we
> do a swapoff, swapped in pages will be accounted into the process
> doing swapoff instead of the new cgroup. This will allow the process
> to use more memory than expect easily.
>
> This can be easily reproduced by:
> - Setup a swap.
> - Create memory cgroup A, B and C.
> - Spawn process P1 in cgroup A and make it swap out some pages.
> - Move process P1 to memory cgroup B.
> - Destroy cgroup A.
> - Do a swapoff in cgroup C
> - Swapped in pages is accounted into cgroup C.
>
> This patch will fix it make the swapped in pages accounted in cgroup B.
>
I guess this only works for anonymous memory and not shmem, right?
I think tying memcg charges to a process is not something we usually
do. Charging the pages to the memcg of the faulting process if the
previous owner is dead makes sense, it's essentially recharging the
memory to the new owner. Swapoff is indeed a special case, since the
faulting process is not the new owner, but an admin process or so. I
am guessing charging to the new memcg of the previous owner might make
sense in this case, but it is a change of behavior.
Hi Kairui,
On Mon, Nov 20, 2023 at 11:10 AM Yosry Ahmed <[email protected]> wrote:
>
> On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <[email protected]> wrote:
> >
> > From: Kairui Song <[email protected]>
> >
> > This series tries to unify and clean up the swapin path, fixing a few
> > issues with optimizations:
> >
> > 1. Memcg leak issue: when a process that previously swapped out some
> > migrated to another cgroup, and the origianl cgroup is dead. If we
> > do a swapoff, swapped in pages will be accounted into the process
> > doing swapoff instead of the new cgroup. This will allow the process
> > to use more memory than expect easily.
> >
> > This can be easily reproduced by:
> > - Setup a swap.
> > - Create memory cgroup A, B and C.
> > - Spawn process P1 in cgroup A and make it swap out some pages.
> > - Move process P1 to memory cgroup B.
> > - Destroy cgroup A.
> > - Do a swapoff in cgroup C
> > - Swapped in pages is accounted into cgroup C.
> >
> > This patch will fix it make the swapped in pages accounted in cgroup B.
> >
>
> I guess this only works for anonymous memory and not shmem, right?
>
> I think tying memcg charges to a process is not something we usually
> do. Charging the pages to the memcg of the faulting process if the
> previous owner is dead makes sense, it's essentially recharging the
> memory to the new owner. Swapoff is indeed a special case, since the
> faulting process is not the new owner, but an admin process or so. I
> am guessing charging to the new memcg of the previous owner might make
> sense in this case, but it is a change of behavior.
>
I was looking at this at patch 23 as well. Will ask more questions in
the patch thread.
I would suggest making these two behavior change patches separate out
from the clean up series to give it more exposure and proper
discussion.
Patch 5 and patch 23.
Chris
Kairui Song <[email protected]> writes:
> Huang, Ying <[email protected]> 于2023年11月20日周一 14:07写道:
>>
>> Kairui Song <[email protected]> writes:
>>
>> > From: Kairui Song <[email protected]>
>> >
>> > Currently VMA readahead is globally disabled when any rotate disk is
>> > used as swap backend. So multiple swap devices are enabled, if a slower
>> > hard disk is set as a low priority fallback, and a high performance SSD
>> > is used and high priority swap device, vma readahead is disabled globally.
>> > The SSD swap device performance will drop by a lot.
>> >
>> > Check readahead policy per entry to avoid such problem.
>> >
>> > Signed-off-by: Kairui Song <[email protected]>
>> > ---
>> > mm/swap_state.c | 12 +++++++-----
>> > 1 file changed, 7 insertions(+), 5 deletions(-)
>> >
>> > diff --git a/mm/swap_state.c b/mm/swap_state.c
>> > index ff6756f2e8e4..fb78f7f18ed7 100644
>> > --- a/mm/swap_state.c
>> > +++ b/mm/swap_state.c
>> > @@ -321,9 +321,9 @@ static inline bool swap_use_no_readahead(struct swap_info_struct *si, swp_entry_
>> > return data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1;
>> > }
>> >
>> > -static inline bool swap_use_vma_readahead(void)
>> > +static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
>> > {
>> > - return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
>> > + return data_race(si->flags & SWP_SOLIDSTATE) && READ_ONCE(enable_vma_readahead);
>> > }
>> >
>> > /*
>> > @@ -341,7 +341,7 @@ struct folio *swap_cache_get_folio(swp_entry_t entry,
>> >
>> > folio = filemap_get_folio(swap_address_space(entry), swp_offset(entry));
>> > if (!IS_ERR(folio)) {
>> > - bool vma_ra = swap_use_vma_readahead();
>> > + bool vma_ra = swap_use_vma_readahead(swp_swap_info(entry));
>> > bool readahead;
>> >
>> > /*
>> > @@ -920,16 +920,18 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
>> > struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>> > struct vm_fault *vmf, bool *swapcached)
>> > {
>> > + struct swap_info_struct *si;
>> > struct mempolicy *mpol;
>> > struct page *page;
>> > pgoff_t ilx;
>> > bool cached;
>> >
>> > + si = swp_swap_info(entry);
>> > mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
>> > - if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
>> > + if (swap_use_no_readahead(si, entry)) {
>> > page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
>> > cached = false;
>> > - } else if (swap_use_vma_readahead()) {
>> > + } else if (swap_use_vma_readahead(si)) {
>>
>> It's possible that some pages are swapped out to SSD while others are
>> swapped out to HDD in a readahead window.
>>
>> I suspect that there are practical requirements to use swap on SSD and
>> HDD at the same time.
>
> Hi Ying,
>
> Thanks for the review!
>
> For the first issue "fragmented readahead window", I was planning to
> do an extra check in readahead path to skip readahead entries that are
> on different swap devices, which is not hard to do,
This is a possible solution.
> but this series is growing too long so I thought it will be better
> done later.
You don't need to keep everything in one series. Just use multiple
series. Even if they are all swap-related. They are dealing with
different problem in fact.
> For the second issue, "is there any practical use for multiple swap",
> I think actually there are. For example we are trying to use multi
> layer swap for offloading memory of different hotness on servers. And
> we also tried to implement a mechanism to migrate long sleep swap
> entries from high performance SSD/RAMDISK swap to cheap HDD swap
> device, with more than two layers of swap, which worked except the
> upstream issue, that readahead policy will no longer work as expected.
Thanks for your information.
>> > page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
>> > cached = true;
>> > } else {
--
Best Regards,
Huang, Ying
On Mon, Nov 20, 2023 at 3:17 AM Kairui Song <[email protected]> wrote:
ime.
>
> Hi Ying,
>
> Thanks for the review!
>
> For the first issue "fragmented readahead window", I was planning to
> do an extra check in readahead path to skip readahead entries that are
That makes sense. The read ahead is an optional thing for speed
optimization. If the read ahead crosses the swap device boundaries.
The read ahead portion can be capped.
> on different swap devices, which is not hard to do, but this series is
> growing too long so I thought it will be better done later.
>
> For the second issue, "is there any practical use for multiple swap",
> I think actually there are. For example we are trying to use multi
> layer swap for offloading memory of different hotness on servers. And
> we also tried to implement a mechanism to migrate long sleep swap
> entries from high performance SSD/RAMDISK swap to cheap HDD swap
> device, with more than two layers of swap, which worked except the
> upstream issue, that readahead policy will no longer work as expected.
Thank you very much for sharing your usage case. I am proposing
"memory.swap.tiers" in this email thread:
https://lore.kernel.org/linux-mm/CAF8kJuOD6zq2VPcVdoZGvkzYX8iXn1akuYhNDJx-LUdS+Sx3GA@mail.gmail.com/
It allows memcg to select which swap device/tiers it wants to opt in.
Your SSD and HDD swap combination is what I have in mind as well.
Chris
On Mon, Nov 20, 2023 at 5:12 PM Huang, Ying <[email protected]> wrote:
> > but this series is growing too long so I thought it will be better
> > done later.
>
> You don't need to keep everything in one series. Just use multiple
> series. Even if they are all swap-related. They are dealing with
> different problem in fact.
I second that. Actually having multiple smaller series is *preferred*
over one long series.
Shorter series are easier to review.
Chris
On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <[email protected]> wrote:
>
> From: Kairui Song <[email protected]>
>
> Make it use alloc_pages_mpol instead of vma_alloc_folio, and accept
> mm_struct directly as an argument instead of taking a vmf as argument.
> Make its arguments similar to swap_{cluster,vma}_readahead, to
> make the code more aligned.
It is unclear to me what is the value of making the
swap_{cluster,vma}_readahead more aligned here.
> Also prepare for following commits which will skip vmf for certain
> swapin paths.
The following patch 07/24 does not seem to use this patch.
Can it merge with the other patch that uses it?
As it is, this patch does not serve a stand alone value to justify it.
Chris
>
> Signed-off-by: Kairui Song <[email protected]>
> ---
> mm/swap_state.c | 15 +++++++--------
> 1 file changed, 7 insertions(+), 8 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index fd0047ae324e..ff6756f2e8e4 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -867,17 +867,17 @@ static struct page *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
> * in.
> */
> static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> - struct vm_fault *vmf)
> + struct mempolicy *mpol, pgoff_t ilx,
> + struct mm_struct *mm)
> {
> - struct vm_area_struct *vma = vmf->vma;
> - struct page *page = NULL;
> struct folio *folio;
> + struct page *page;
> void *shadow = NULL;
>
> - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
> - vma, vmf->address, false);
> + page = alloc_pages_mpol(gfp_mask, 0, mpol, ilx, numa_node_id());
> + folio = (struct folio *)page;
> if (folio) {
> - if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
> + if (mem_cgroup_swapin_charge_folio(folio, mm,
> GFP_KERNEL, entry)) {
> folio_put(folio);
> return NULL;
> @@ -896,7 +896,6 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
>
> /* To provide entry to swap_readpage() */
> folio->swap = entry;
> - page = &folio->page;
> swap_readpage(page, true, NULL);
> folio->private = NULL;
> }
> @@ -928,7 +927,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>
> mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> if (swap_use_no_readahead(swp_swap_info(entry), entry)) {
> - page = swapin_no_readahead(entry, gfp_mask, vmf);
> + page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
> cached = false;
> } else if (swap_use_vma_readahead()) {
> page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
> --
> 2.42.0
>
>
Hi Kairui,
On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <[email protected]> wrote:
>
> From: Kairui Song <[email protected]>
>
> No feature change, prepare for later commits.
Again, I don't see the value of having this as a stand alone patch.
If one of the later patches needs to use this function as external
rather than static, move it with the patch that uses it. From the
reviewing point of view, it is unnecessary overhead to cross reference
different patches in order to figure out why it is moved.
Chris
>
> Signed-off-by: Kairui Song <[email protected]>
> ---
> mm/swap.h | 5 +++++
> mm/swapfile.c | 5 -----
> 2 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/mm/swap.h b/mm/swap.h
> index f82d43d7b52a..a9a654af791e 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -61,6 +61,11 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
> {
> return page_swap_info(&folio->page)->flags;
> }
> +
> +static inline unsigned char swap_count(unsigned char ent)
> +{
> + return ent & ~SWAP_HAS_CACHE; /* may include COUNT_CONTINUED flag */
> +}
> #else /* CONFIG_SWAP */
> struct swap_iocb;
> static inline void swap_readpage(struct page *page, bool do_poll,
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 0142bfc71b81..a8ae472ed2b6 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -114,11 +114,6 @@ static struct swap_info_struct *swap_type_to_swap_info(int type)
> return READ_ONCE(swap_info[type]); /* rcu_dereference() */
> }
>
> -static inline unsigned char swap_count(unsigned char ent)
> -{
> - return ent & ~SWAP_HAS_CACHE; /* may include COUNT_CONTINUED flag */
> -}
> -
> /* Reclaim the swap entry anyway if possible */
> #define TTRS_ANYWAY 0x1
> /*
> --
> 2.42.0
>
>
Chris Li <[email protected]> 于2023年11月21日周二 14:52写道:
>
> Hi Kairui,
>
> On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <[email protected]> wrote:
> >
> > From: Kairui Song <[email protected]>
> >
> > No feature change, prepare for later commits.
>
> Again, I don't see the value of having this as a stand alone patch.
> If one of the later patches needs to use this function as external
> rather than static, move it with the patch that uses it. From the
> reviewing point of view, it is unnecessary overhead to cross reference
> different patches in order to figure out why it is moved.
Good suggestion, I do this locally for easier rebase & conflict
resolving, will get rid of these little parts next time.
>
> Chris
>
> >
> > Signed-off-by: Kairui Song <[email protected]>
> > ---
> > mm/swap.h | 5 +++++
> > mm/swapfile.c | 5 -----
> > 2 files changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/swap.h b/mm/swap.h
> > index f82d43d7b52a..a9a654af791e 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -61,6 +61,11 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
> > {
> > return page_swap_info(&folio->page)->flags;
> > }
> > +
> > +static inline unsigned char swap_count(unsigned char ent)
> > +{
> > + return ent & ~SWAP_HAS_CACHE; /* may include COUNT_CONTINUED flag */
> > +}
> > #else /* CONFIG_SWAP */
> > struct swap_iocb;
> > static inline void swap_readpage(struct page *page, bool do_poll,
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 0142bfc71b81..a8ae472ed2b6 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -114,11 +114,6 @@ static struct swap_info_struct *swap_type_to_swap_info(int type)
> > return READ_ONCE(swap_info[type]); /* rcu_dereference() */
> > }
> >
> > -static inline unsigned char swap_count(unsigned char ent)
> > -{
> > - return ent & ~SWAP_HAS_CACHE; /* may include COUNT_CONTINUED flag */
> > -}
> > -
> > /* Reclaim the swap entry anyway if possible */
> > #define TTRS_ANYWAY 0x1
> > /*
> > --
> > 2.42.0
> >
> >
On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <[email protected]> wrote:
>
> From: Kairui Song <[email protected]>
>
> Currently VMA readahead is globally disabled when any rotate disk is
> used as swap backend. So multiple swap devices are enabled, if a slower
> hard disk is set as a low priority fallback, and a high performance SSD
> is used and high priority swap device, vma readahead is disabled globally.
> The SSD swap device performance will drop by a lot.
>
> Check readahead policy per entry to avoid such problem.
>
> Signed-off-by: Kairui Song <[email protected]>
> ---
> mm/swap_state.c | 12 +++++++-----
> 1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index ff6756f2e8e4..fb78f7f18ed7 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -321,9 +321,9 @@ static inline bool swap_use_no_readahead(struct swap_info_struct *si, swp_entry_
> return data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1;
> }
>
> -static inline bool swap_use_vma_readahead(void)
> +static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
> {
> - return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
> + return data_race(si->flags & SWP_SOLIDSTATE) && READ_ONCE(enable_vma_readahead);
A very minor point:
I notice you change the order enable_vma_readahead to the last.
Normally if enable_vma_reachahead == 0, there is no need to check the si->flags.
The si->flags check is more expensive than simple memory load.
You might want to check enable_vma_readahead first then you can short
cut the more expensive part.
Chris
On Sun, Nov 19, 2023 at 11:43 PM Huang, Ying <[email protected]> wrote:
>
> Kairui Song <[email protected]> writes:
>
> > From: Kairui Song <[email protected]>
> >
> > There is only one caller in swap subsystem now, where it can be inline
> > smoothly, avoid the memory access and function call overheads.
> >
> > Signed-off-by: Kairui Song <[email protected]>
> > ---
> > include/linux/swap.h | 6 ------
> > mm/swap_state.c | 6 +++---
> > mm/swapfile.c | 8 --------
> > 3 files changed, 3 insertions(+), 17 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 2401990d954d..64a37819a9b3 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -485,7 +485,6 @@ int swap_type_of(dev_t device, sector_t offset);
> > int find_first_swap(dev_t *device);
> > extern unsigned int count_swap_pages(int, int);
> > extern sector_t swapdev_block(int, pgoff_t);
> > -extern int __swap_count(swp_entry_t entry);
> > extern int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry);
> > extern int swp_swapcount(swp_entry_t entry);
> > extern struct swap_info_struct *page_swap_info(struct page *);
> > @@ -559,11 +558,6 @@ static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> > {
> > }
> >
> > -static inline int __swap_count(swp_entry_t entry)
> > -{
> > - return 0;
> > -}
> > -
> > static inline int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
> > {
> > return 0;
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index fb78f7f18ed7..d87c20f9f7ec 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -316,9 +316,9 @@ void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
> > release_pages(pages, nr);
> > }
> >
> > -static inline bool swap_use_no_readahead(struct swap_info_struct *si, swp_entry_t entry)
> > +static inline bool swap_use_no_readahead(struct swap_info_struct *si, pgoff_t offset)
> > {
> > - return data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1;
> > + return data_race(si->flags & SWP_SYNCHRONOUS_IO) && swap_count(si->swap_map[offset]) == 1;
> > }
> >
> > static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
> > @@ -928,7 +928,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> >
> > si = swp_swap_info(entry);
> > mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> > - if (swap_use_no_readahead(si, entry)) {
> > + if (swap_use_no_readahead(si, swp_offset(entry))) {
> > page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
> > cached = false;
> > } else if (swap_use_vma_readahead(si)) {
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index a8ae472ed2b6..e15a6c464a38 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1431,14 +1431,6 @@ void swapcache_free_entries(swp_entry_t *entries, int n)
> > spin_unlock(&p->lock);
> > }
> >
> > -int __swap_count(swp_entry_t entry)
> > -{
> > - struct swap_info_struct *si = swp_swap_info(entry);
> > - pgoff_t offset = swp_offset(entry);
> > -
> > - return swap_count(si->swap_map[offset]);
> > -}
> > -
>
> I'd rather keep __swap_count() in the original place together with other
> swap count related functions. And si->swap_map[] was hided in
> swapfile.c before. I don't think the change will have any real
> performance improvement.
I agree with Ying here. Does not seem to have value high enough to
justify a patch by itself.
Chris
Hi Kairui,
On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <[email protected]> wrote:
>
> From: Kairui Song <[email protected]>
>
> No longer needed after we switched to per entry swap readhead policy.
This is a behavior change patch, better separate out from clean up
series for exposure and discussions.
I think the idea is reasonable. The policy should be made on device
level rather than system level. I saw Ying has some great feedback
regarding readahead cross device boundaries.
Chris
On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <[email protected]> wrote:
>
> From: Kairui Song <[email protected]>
>
> No feature change, just prepare for later commits.
You need to have a proper commit message why this change needs to happen.
Preparing is too generic, it does not give any real information.
For example, it seems you want to reduce one swap cache lookup because
swap_readahead already has it?
I am a bit puzzled at this patch. It shuffles a lot of sensitive code.
However I do not get the value.
It seems like this patch should be merged with the later patch that
depends on it to be judged together.
>
> Signed-off-by: Kairui Song <[email protected]>
> ---
> mm/memory.c | 61 +++++++++++++++++++++++--------------------------
> mm/swap.h | 10 ++++++--
> mm/swap_state.c | 26 +++++++++++++--------
> mm/swapfile.c | 30 +++++++++++-------------
> 4 files changed, 66 insertions(+), 61 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index f4237a2e3b93..22af9f3e8c75 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3786,13 +3786,13 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
> vm_fault_t do_swap_page(struct vm_fault *vmf)
> {
> struct vm_area_struct *vma = vmf->vma;
> - struct folio *swapcache, *folio = NULL;
> + struct folio *swapcache = NULL, *folio = NULL;
> + enum swap_cache_result cache_result;
> struct page *page;
> struct swap_info_struct *si = NULL;
> rmap_t rmap_flags = RMAP_NONE;
> bool exclusive = false;
> swp_entry_t entry;
> - bool swapcached;
> pte_t pte;
> vm_fault_t ret = 0;
>
> @@ -3850,42 +3850,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> if (unlikely(!si))
> goto out;
>
> - folio = swap_cache_get_folio(entry, vma, vmf->address);
> - if (folio)
> - page = folio_file_page(folio, swp_offset(entry));
> - swapcache = folio;
Is the motivation that swap_readahead() already has a swap cache look up so you
remove this look up here?
> -
> - if (!folio) {
> - page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> - vmf, &swapcached);
> - if (page) {
> - folio = page_folio(page);
> - if (swapcached)
> - swapcache = folio;
> - } else {
> + page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> + vmf, &cache_result);
> + if (page) {
> + folio = page_folio(page);
> + if (cache_result != SWAP_CACHE_HIT) {
> + /* Had to read the page from swap area: Major fault */
> + ret = VM_FAULT_MAJOR;
> + count_vm_event(PGMAJFAULT);
> + count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
> + }
> + if (cache_result != SWAP_CACHE_BYPASS)
> + swapcache = folio;
> + if (PageHWPoison(page)) {
There is a lot of code shuffle here. From the diff it is hard to tell
if they are doing the same thing as before.
> /*
> - * Back out if somebody else faulted in this pte
> - * while we released the pte lock.
> + * hwpoisoned dirty swapcache pages are kept for killing
> + * owner processes (which may be unknown at hwpoison time)
> */
> - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> - vmf->address, &vmf->ptl);
> - if (likely(vmf->pte &&
> - pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> - ret = VM_FAULT_OOM;
> - goto unlock;
> + ret = VM_FAULT_HWPOISON;
> + goto out_release;
> }
> -
> - /* Had to read the page from swap area: Major fault */
> - ret = VM_FAULT_MAJOR;
> - count_vm_event(PGMAJFAULT);
> - count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
> - } else if (PageHWPoison(page)) {
> + } else {
> /*
> - * hwpoisoned dirty swapcache pages are kept for killing
> - * owner processes (which may be unknown at hwpoison time)
> + * Back out if somebody else faulted in this pte
> + * while we released the pte lock.
> */
> - ret = VM_FAULT_HWPOISON;
> - goto out_release;
> + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> + vmf->address, &vmf->ptl);
> + if (likely(vmf->pte &&
> + pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> + ret = VM_FAULT_OOM;
> + goto unlock;
> }
>
> ret |= folio_lock_or_retry(folio, vmf);
> diff --git a/mm/swap.h b/mm/swap.h
> index a9a654af791e..ac9136eee690 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -30,6 +30,12 @@ extern struct address_space *swapper_spaces[];
> (&swapper_spaces[swp_type(entry)][swp_offset(entry) \
> >> SWAP_ADDRESS_SPACE_SHIFT])
>
> +enum swap_cache_result {
> + SWAP_CACHE_HIT,
> + SWAP_CACHE_MISS,
> + SWAP_CACHE_BYPASS,
> +};
Does any function later care about CACHE_BYPASS?
Again, better introduce it with the function that uses it. Don't
introduce it for "just in case I might use it later".
> +
> void show_swap_cache_info(void);
> bool add_to_swap(struct folio *folio);
> void *get_shadow_from_swap_cache(swp_entry_t entry);
> @@ -55,7 +61,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
> struct mempolicy *mpol, pgoff_t ilx);
> struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
> - struct vm_fault *vmf, bool *swapcached);
> + struct vm_fault *vmf, enum swap_cache_result *result);
>
> static inline unsigned int folio_swap_flags(struct folio *folio)
> {
> @@ -92,7 +98,7 @@ static inline struct page *swap_cluster_readahead(swp_entry_t entry,
> }
>
> static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
> - struct vm_fault *vmf, bool *swapcached)
> + struct vm_fault *vmf, enum swap_cache_result *result)
> {
> return NULL;
> }
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index d87c20f9f7ec..e96d63bf8a22 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -908,8 +908,7 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> * @entry: swap entry of this memory
> * @gfp_mask: memory allocation flags
> * @vmf: fault information
> - * @swapcached: pointer to a bool used as indicator if the
> - * page is swapped in through swapcache.
> + * @result: a return value to indicate swap cache usage.
> *
> * Returns the struct page for entry and addr, after queueing swapin.
> *
> @@ -918,30 +917,39 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> * or vma-based(ie, virtual address based on faulty address) readahead.
> */
> struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> - struct vm_fault *vmf, bool *swapcached)
> + struct vm_fault *vmf, enum swap_cache_result *result)
> {
> + enum swap_cache_result cache_result;
> struct swap_info_struct *si;
> struct mempolicy *mpol;
> + struct folio *folio;
> struct page *page;
> pgoff_t ilx;
> - bool cached;
> +
> + folio = swap_cache_get_folio(entry, vmf->vma, vmf->address);
> + if (folio) {
> + page = folio_file_page(folio, swp_offset(entry));
> + cache_result = SWAP_CACHE_HIT;
> + goto done;
> + }
>
> si = swp_swap_info(entry);
> mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> if (swap_use_no_readahead(si, swp_offset(entry))) {
> page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
> - cached = false;
> + cache_result = SWAP_CACHE_BYPASS;
> } else if (swap_use_vma_readahead(si)) {
> page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
> - cached = true;
> + cache_result = SWAP_CACHE_MISS;
> } else {
> page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> - cached = true;
> + cache_result = SWAP_CACHE_MISS;
> }
> mpol_cond_put(mpol);
>
> - if (swapcached)
> - *swapcached = cached;
> +done:
> + if (result)
> + *result = cache_result;
>
> return page;
> }
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 01c3f53b6521..b6d57fff5e21 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1822,13 +1822,21 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>
> si = swap_info[type];
> do {
> - struct folio *folio;
> + struct page *page;
> unsigned long offset;
> unsigned char swp_count;
> + struct folio *folio = NULL;
> swp_entry_t entry;
> int ret;
> pte_t ptent;
>
> + struct vm_fault vmf = {
> + .vma = vma,
> + .address = addr,
> + .real_address = addr,
> + .pmd = pmd,
> + };
Is this code move caused by skipping the swap cache look up here?
This is very sensitive code related to swap cache racing. It needs
very careful reviewing. Better not shuffle it for no good reason.
Chris
> +
> if (!pte++) {
> pte = pte_offset_map(pmd, addr);
> if (!pte)
> @@ -1847,22 +1855,10 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> offset = swp_offset(entry);
> pte_unmap(pte);
> pte = NULL;
> -
> - folio = swap_cache_get_folio(entry, vma, addr);
> - if (!folio) {
> - struct page *page;
> - struct vm_fault vmf = {
> - .vma = vma,
> - .address = addr,
> - .real_address = addr,
> - .pmd = pmd,
> - };
> -
> - page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> - &vmf, NULL);
> - if (page)
> - folio = page_folio(page);
> - }
> + page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> + &vmf, NULL);
> + if (page)
> + folio = page_folio(page);
> if (!folio) {
> /*
> * The entry could have been freed, and will not
> --
> 2.42.0
>
>
On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <[email protected]> wrote:
>
> From: Kairui Song <[email protected]>
>
> There are only two caller now, simplify the arguments.
I don't think this patch is needed. It will not have a real impact on
the resulting kernel.
>
> Signed-off-by: Kairui Song <[email protected]>
> ---
> mm/shmem.c | 2 +-
> mm/swap.h | 2 +-
> mm/swap_state.c | 15 +++++++--------
> 3 files changed, 9 insertions(+), 10 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 0d1ce70bce38..72239061c655 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1875,7 +1875,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> }
>
> /* Look it up and read it in.. */
> - folio = swap_cache_get_folio(swap, NULL, 0);
> + folio = swap_cache_get_folio(swap, NULL);
> if (!folio) {
> /* Or update major stats only when swapin succeeds?? */
> if (fault_type) {
> diff --git a/mm/swap.h b/mm/swap.h
> index ac9136eee690..e43e965f123f 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -47,7 +47,7 @@ void delete_from_swap_cache(struct folio *folio);
> void clear_shadow_from_swap_cache(int type, unsigned long begin,
> unsigned long end);
> struct folio *swap_cache_get_folio(swp_entry_t entry,
> - struct vm_area_struct *vma, unsigned long addr);
> + struct vm_fault *vmf);
> struct folio *filemap_get_incore_folio(struct address_space *mapping,
> pgoff_t index);
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index e96d63bf8a22..91461e26a8cc 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -334,8 +334,7 @@ static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
> *
> * Caller must lock the swap device or hold a reference to keep it valid.
> */
> -struct folio *swap_cache_get_folio(swp_entry_t entry,
> - struct vm_area_struct *vma, unsigned long addr)
> +struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_fault *vmf)
I actually prefer the original code. The vm_fault is a much heavier object.
vma and address is really what this function needs minimally. I
consider vma and address,
even though two arguments are simpler than vm_fault struct.
It is possible some other non fault patch wants to lookup the swap_cache,
then construct a vmf just for the vma and address is kind of unnatural.
Chris
> {
> struct folio *folio;
>
> @@ -352,22 +351,22 @@ struct folio *swap_cache_get_folio(swp_entry_t entry,
> return folio;
>
> readahead = folio_test_clear_readahead(folio);
> - if (vma && vma_ra) {
> + if (vmf && vma_ra) {
> unsigned long ra_val;
> int win, hits;
>
> - ra_val = GET_SWAP_RA_VAL(vma);
> + ra_val = GET_SWAP_RA_VAL(vmf->vma);
> win = SWAP_RA_WIN(ra_val);
> hits = SWAP_RA_HITS(ra_val);
> if (readahead)
> hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
> - atomic_long_set(&vma->swap_readahead_info,
> - SWAP_RA_VAL(addr, win, hits));
> + atomic_long_set(&vmf->vma->swap_readahead_info,
> + SWAP_RA_VAL(vmf->address, win, hits));
> }
>
> if (readahead) {
> count_vm_event(SWAP_RA_HIT);
> - if (!vma || !vma_ra)
> + if (!vmf || !vma_ra)
> atomic_inc(&swapin_readahead_hits);
> }
> } else {
> @@ -926,7 +925,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> struct page *page;
> pgoff_t ilx;
>
> - folio = swap_cache_get_folio(entry, vmf->vma, vmf->address);
> + folio = swap_cache_get_folio(entry, vmf);
> if (folio) {
> page = folio_file_page(folio, swp_offset(entry));
> cache_result = SWAP_CACHE_HIT;
> --
> 2.42.0
>
>
Hi Kairui,
I agree the resulting code is marginally better.
However, this change does not bring enough value to justify a stand alone patch.
It has no impact on the resulting kernel.
It would be much better if the code was checkin like this, or if you
are modifying this function, rewrite it better. In my opinion, doing
very trivial code shuffling for the sake of cleaning up is not
justifiable for the value it brings.
For one it will make the git blame less obvious who actually changed
that code for what reason. I am against trivial code shuffling.
Chris
On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <[email protected]> wrote:
>
> From: Kairui Song <[email protected]>
>
> Rearrange the if statement, reduce the code indent, no feature change.
>
> Signed-off-by: Kairui Song <[email protected]>
> ---
> mm/swap_state.c | 58 ++++++++++++++++++++++++-------------------------
> 1 file changed, 28 insertions(+), 30 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 91461e26a8cc..3b5a34f47192 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -336,41 +336,39 @@ static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
> */
> struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_fault *vmf)
> {
> + bool vma_ra, readahead;
> struct folio *folio;
>
> folio = filemap_get_folio(swap_address_space(entry), swp_offset(entry));
> - if (!IS_ERR(folio)) {
> - bool vma_ra = swap_use_vma_readahead(swp_swap_info(entry));
> - bool readahead;
> + if (IS_ERR(folio))
> + return NULL;
>
> - /*
> - * At the moment, we don't support PG_readahead for anon THP
> - * so let's bail out rather than confusing the readahead stat.
> - */
> - if (unlikely(folio_test_large(folio)))
> - return folio;
> -
> - readahead = folio_test_clear_readahead(folio);
> - if (vmf && vma_ra) {
> - unsigned long ra_val;
> - int win, hits;
> -
> - ra_val = GET_SWAP_RA_VAL(vmf->vma);
> - win = SWAP_RA_WIN(ra_val);
> - hits = SWAP_RA_HITS(ra_val);
> - if (readahead)
> - hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
> - atomic_long_set(&vmf->vma->swap_readahead_info,
> - SWAP_RA_VAL(vmf->address, win, hits));
> - }
> + /*
> + * At the moment, we don't support PG_readahead for anon THP
> + * so let's bail out rather than confusing the readahead stat.
> + */
> + if (unlikely(folio_test_large(folio)))
> + return folio;
>
> - if (readahead) {
> - count_vm_event(SWAP_RA_HIT);
> - if (!vmf || !vma_ra)
> - atomic_inc(&swapin_readahead_hits);
> - }
> - } else {
> - folio = NULL;
> + vma_ra = swap_use_vma_readahead(swp_swap_info(entry));
> + readahead = folio_test_clear_readahead(folio);
> + if (vmf && vma_ra) {
> + unsigned long ra_val;
> + int win, hits;
> +
> + ra_val = GET_SWAP_RA_VAL(vmf->vma);
> + win = SWAP_RA_WIN(ra_val);
> + hits = SWAP_RA_HITS(ra_val);
> + if (readahead)
> + hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
> + atomic_long_set(&vmf->vma->swap_readahead_info,
> + SWAP_RA_VAL(vmf->address, win, hits));
> + }
> +
> + if (readahead) {
> + count_vm_event(SWAP_RA_HIT);
> + if (!vmf || !vma_ra)
> + atomic_inc(&swapin_readahead_hits);
> }
>
> return folio;
> --
> 2.42.0
>
>
Hi Kairui,
Too trivial to stand alone as a patch. Merge it with the patch needed
to use that "*shadow".
Chris
On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <[email protected]> wrote:
>
> From: Kairui Song <[email protected]>
>
> Make swap_cache_get_folio capable of returning the shadow value when the
> xarray contains a shadow instead of a valid folio. Just extend the
> arguments to be used later.
>
> Signed-off-by: Kairui Song <[email protected]>
> ---
> mm/shmem.c | 2 +-
> mm/swap.h | 2 +-
> mm/swap_state.c | 11 +++++++----
> 3 files changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 72239061c655..f9ce4067c742 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1875,7 +1875,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> }
>
> /* Look it up and read it in.. */
> - folio = swap_cache_get_folio(swap, NULL);
> + folio = swap_cache_get_folio(swap, NULL, NULL);
> if (!folio) {
> /* Or update major stats only when swapin succeeds?? */
> if (fault_type) {
> diff --git a/mm/swap.h b/mm/swap.h
> index e43e965f123f..da9deb5ba37d 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -47,7 +47,7 @@ void delete_from_swap_cache(struct folio *folio);
> void clear_shadow_from_swap_cache(int type, unsigned long begin,
> unsigned long end);
> struct folio *swap_cache_get_folio(swp_entry_t entry,
> - struct vm_fault *vmf);
> + struct vm_fault *vmf, void **shadowp);
> struct folio *filemap_get_incore_folio(struct address_space *mapping,
> pgoff_t index);
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 3b5a34f47192..e057c79fb06f 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -334,14 +334,17 @@ static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
> *
> * Caller must lock the swap device or hold a reference to keep it valid.
> */
> -struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_fault *vmf)
> +struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_fault *vmf, void **shadowp)
> {
> bool vma_ra, readahead;
> struct folio *folio;
>
> - folio = filemap_get_folio(swap_address_space(entry), swp_offset(entry));
> - if (IS_ERR(folio))
> + folio = filemap_get_entry(swap_address_space(entry), swp_offset(entry));
> + if (xa_is_value(folio)) {
> + if (shadowp)
> + *shadowp = folio;
> return NULL;
> + }
>
> /*
> * At the moment, we don't support PG_readahead for anon THP
> @@ -923,7 +926,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> struct page *page;
> pgoff_t ilx;
>
> - folio = swap_cache_get_folio(entry, vmf);
> + folio = swap_cache_get_folio(entry, vmf, NULL);
> if (folio) {
> page = folio_file_page(folio, swp_offset(entry));
> cache_result = SWAP_CACHE_HIT;
> --
> 2.42.0
>
>
On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <[email protected]> wrote:
>
> From: Kairui Song <[email protected]>
>
> When a xa_value is returned by the cache lookup, keep it to be used
> later for workingset refault check instead of doing the looking up again
> in swapin_no_readahead.
>
> This does have a side effect of making swapoff also triggers workingset
> check, but should be fine since swapoff does effect the workload in many
> ways already.
I need to sleep on it a bit to see if this will create another problem or not.
>
> Signed-off-by: Kairui Song <[email protected]>
> ---
> mm/swap_state.c | 10 ++++------
> 1 file changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index e057c79fb06f..51de2a0412df 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -872,7 +872,6 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> {
> struct folio *folio;
> struct page *page;
> - void *shadow = NULL;
>
> page = alloc_pages_mpol(gfp_mask, 0, mpol, ilx, numa_node_id());
> folio = (struct folio *)page;
> @@ -888,10 +887,6 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
>
> mem_cgroup_swapin_uncharge_swap(entry);
>
> - shadow = get_shadow_from_swap_cache(entry);
> - if (shadow)
> - workingset_refault(folio, shadow);
> -
> folio_add_lru(folio);
>
> /* To provide entry to swap_readpage() */
> @@ -922,11 +917,12 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> enum swap_cache_result cache_result;
> struct swap_info_struct *si;
> struct mempolicy *mpol;
> + void *shadow = NULL;
> struct folio *folio;
> struct page *page;
> pgoff_t ilx;
>
> - folio = swap_cache_get_folio(entry, vmf, NULL);
> + folio = swap_cache_get_folio(entry, vmf, &shadow);
> if (folio) {
> page = folio_file_page(folio, swp_offset(entry));
> cache_result = SWAP_CACHE_HIT;
> @@ -938,6 +934,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> if (swap_use_no_readahead(si, swp_offset(entry))) {
> page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
> cache_result = SWAP_CACHE_BYPASS;
> + if (shadow)
> + workingset_refault(page_folio(page), shadow);
It is inconsistent why other flavors of readahead do not do the
workingset_refault here.
I suggest keeping the workingset_refault in swapin_no_readahead() and
pass the shadow argument in.
Chris
> } else if (swap_use_vma_readahead(si)) {
> page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
> cache_result = SWAP_CACHE_MISS;
> --
> 2.42.0
>
>
On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <[email protected]> wrote:
>
> From: Kairui Song <[email protected]>
>
> Move get_swap_device into swapin_readahead, simplify the code
> and prepare for follow up commits.
>
> For the later part in do_swap_page, using swp_swap_info directly is fine
> since in that context, the swap device is pinned by swapcache reference.
>
> Signed-off-by: Kairui Song <[email protected]>
> ---
> mm/memory.c | 16 ++++------------
> mm/swap_state.c | 8 ++++++--
> mm/swapfile.c | 4 +++-
> 3 files changed, 13 insertions(+), 15 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 22af9f3e8c75..e399b37ef395 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3789,7 +3789,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> struct folio *swapcache = NULL, *folio = NULL;
> enum swap_cache_result cache_result;
> struct page *page;
> - struct swap_info_struct *si = NULL;
> rmap_t rmap_flags = RMAP_NONE;
> bool exclusive = false;
> swp_entry_t entry;
> @@ -3845,14 +3844,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> goto out;
> }
>
> - /* Prevent swapoff from happening to us. */
> - si = get_swap_device(entry);
> - if (unlikely(!si))
> - goto out;
> -
> page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> vmf, &cache_result);
> - if (page) {
> + if (PTR_ERR(page) == -EBUSY) {
> + goto out;
> + } else if (page) {
As Matthew suggested, using ERR_PTR(-EBUSY).
you also don't need the else here. The goto already terminates the flow.
if (page == ERR_PTR(-EBUSY))
goto out;
if (page) {
Chris
Kairui Song <[email protected]> writes:
> From: Kairui Song <[email protected]>
>
> Move get_swap_device into swapin_readahead, simplify the code
> and prepare for follow up commits.
No. Please don't do this. Please check the get/put_swap_device() usage
rule in the comments of get_swap_device().
"
* When we get a swap entry, if there aren't some other ways to
* prevent swapoff, such as the folio in swap cache is locked, page
* table lock is held, etc., the swap entry may become invalid because
* of swapoff. Then, we need to enclose all swap related functions
* with get_swap_device() and put_swap_device(), unless the swap
* functions call get/put_swap_device() by themselves.
"
This is to simplify the reasoning about swapoff and swap entry.
Why does it bother you?
> For the later part in do_swap_page, using swp_swap_info directly is fine
> since in that context, the swap device is pinned by swapcache reference.
[snip]
--
Best Regards,
Huang, Ying
On Sun, Nov 19, 2023 at 11:49 AM Kairui Song <[email protected]> wrote:
>
> From: Kairui Song <[email protected]>
>
> There are two places where swapin is not direct caused by page fault:
> shmem swapin is invoked through shmem mapping, swapoff cause swapin by
> walking the page table. They used to construct a pseudo vmfault struct
> for swapin function.
>
> Shmem has dropped the pseudo vmfault recently in commit ddc1a5cbc05d
> ("mempolicy: alloc_pages_mpol() for NUMA policy without vma"). Swapoff
> path is still using a pseudo vmfault.
>
> Introduce a helper for them both, this help save stack usage for swapoff
> path, and help apply a unified swapin cache and readahead policy check.
>
> Also prepare for follow up commits.
>
> Signed-off-by: Kairui Song <[email protected]>
> ---
> mm/shmem.c | 51 ++++++++++++++++---------------------------------
> mm/swap.h | 11 +++++++++++
> mm/swap_state.c | 38 ++++++++++++++++++++++++++++++++++++
> mm/swapfile.c | 23 +++++++++++-----------
> 4 files changed, 76 insertions(+), 47 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index f9ce4067c742..81d129aa66d1 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1565,22 +1565,6 @@ static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
> static struct mempolicy *shmem_get_pgoff_policy(struct shmem_inode_info *info,
> pgoff_t index, unsigned int order, pgoff_t *ilx);
>
> -static struct folio *shmem_swapin_cluster(swp_entry_t swap, gfp_t gfp,
> - struct shmem_inode_info *info, pgoff_t index)
> -{
> - struct mempolicy *mpol;
> - pgoff_t ilx;
> - struct page *page;
> -
> - mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
> - page = swap_cluster_readahead(swap, gfp, mpol, ilx);
> - mpol_cond_put(mpol);
> -
> - if (!page)
> - return NULL;
> - return page_folio(page);
> -}
> -
Nice. Thank you.
> /*
> * Make sure huge_gfp is always more limited than limit_gfp.
> * Some of the flags set permissions, while others set limitations.
> @@ -1854,9 +1838,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> {
> struct address_space *mapping = inode->i_mapping;
> struct shmem_inode_info *info = SHMEM_I(inode);
> - struct swap_info_struct *si;
> + enum swap_cache_result result;
> struct folio *folio = NULL;
> + struct mempolicy *mpol;
> + struct page *page;
> swp_entry_t swap;
> + pgoff_t ilx;
> int error;
>
> VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
> @@ -1866,34 +1853,30 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> if (is_poisoned_swp_entry(swap))
> return -EIO;
>
> - si = get_swap_device(swap);
> - if (!si) {
> + mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
> + page = swapin_page_non_fault(swap, gfp, mpol, ilx, fault_mm, &result);
Notice this "result" CAN be outdated. e.g. after this call, the swap
cache can be changed by another thread generating the swap page fault
and installing the folio into the swap cache or removing it.
> + mpol_cond_put(mpol);
> +
> + if (PTR_ERR(page) == -EBUSY) {
> if (!shmem_confirm_swap(mapping, index, swap))
> return -EEXIST;
Not your fault . The if statement already returned.
> else
This is not needed, the next return -EINVAL can be one less indent level.
> return -EINVAL;
> - }
> -
> - /* Look it up and read it in.. */
> - folio = swap_cache_get_folio(swap, NULL, NULL);
> - if (!folio) {
> - /* Or update major stats only when swapin succeeds?? */
> - if (fault_type) {
> + } else if (!page) {
Don't need the else here because previous if statement always return.
> + error = -ENOMEM;
> + goto failed;
> + } else {
Don't need the else here. Previous goto terminate the flow.
> + folio = page_folio(page);
> + if (fault_type && result != SWAP_CACHE_HIT) {
> *fault_type |= VM_FAULT_MAJOR;
> count_vm_event(PGMAJFAULT);
> count_memcg_event_mm(fault_mm, PGMAJFAULT);
> }
> - /* Here we actually start the io */
> - folio = shmem_swapin_cluster(swap, gfp, info, index);
> - if (!folio) {
> - error = -ENOMEM;
> - goto failed;
> - }
> }
>
> /* We have to do this with folio locked to prevent races */
> folio_lock(folio);
> - if (!folio_test_swapcache(folio) ||
> + if ((result != SWAP_CACHE_BYPASS && !folio_test_swapcache(folio)) ||
I think there is a possible racing bug here. Because the result can be outdated.
> folio->swap.val != swap.val ||
> !shmem_confirm_swap(mapping, index, swap)) {
> error = -EEXIST;
> @@ -1930,7 +1913,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> delete_from_swap_cache(folio);
> folio_mark_dirty(folio);
> swap_free(swap);
> - put_swap_device(si);
>
> *foliop = folio;
> return 0;
> @@ -1944,7 +1926,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> folio_unlock(folio);
> folio_put(folio);
> }
> - put_swap_device(si);
>
> return error;
> }
> diff --git a/mm/swap.h b/mm/swap.h
> index da9deb5ba37d..b073c29c9790 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -62,6 +62,10 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
> struct mempolicy *mpol, pgoff_t ilx);
> struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
> struct vm_fault *vmf, enum swap_cache_result *result);
> +struct page *swapin_page_non_fault(swp_entry_t entry, gfp_t gfp_mask,
> + struct mempolicy *mpol, pgoff_t ilx,
> + struct mm_struct *mm,
> + enum swap_cache_result *result);
>
> static inline unsigned int folio_swap_flags(struct folio *folio)
> {
> @@ -103,6 +107,13 @@ static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
> return NULL;
> }
>
> +static inline struct page *swapin_page_non_fault(swp_entry_t entry, gfp_t gfp_mask,
> + struct mempolicy *mpol, pgoff_t ilx, struct mm_struct *mm,
> + enum swap_cache_result *result)
> +{
> + return NULL;
> +}
> +
> static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
> {
> return 0;
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index ff8a166603d0..eef66757c615 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -956,6 +956,44 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> return page;
> }
>
> +struct page *swapin_page_non_fault(swp_entry_t entry, gfp_t gfp_mask,
> + struct mempolicy *mpol, pgoff_t ilx,
> + struct mm_struct *mm, enum swap_cache_result *result)
Can you get a better function name? e.g. no negative works. The
function should be named after what it does, not who calls it. The
caller usage might change over time.
I saw that swapin_page_non_fault() and swapin_readahead() are doing
similar things and with similar structure. Can you unify these two
somehow?
Chris
> +{
> + enum swap_cache_result cache_result;
> + struct swap_info_struct *si;
> + void *shadow = NULL;
> + struct folio *folio;
> + struct page *page;
> +
> + /* Prevent swapoff from happening to us */
> + si = get_swap_device(entry);
> + if (unlikely(!si))
> + return ERR_PTR(-EBUSY);
> +
> + folio = swap_cache_get_folio(entry, NULL, &shadow);
> + if (folio) {
> + page = folio_file_page(folio, swp_offset(entry));
> + cache_result = SWAP_CACHE_HIT;
> + goto done;
> + }
> +
> + if (swap_use_no_readahead(si, swp_offset(entry))) {
> + page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, mm);
> + if (shadow)
> + workingset_refault(page_folio(page), shadow);
> + cache_result = SWAP_CACHE_BYPASS;
> + } else {
> + page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> + cache_result = SWAP_CACHE_MISS;
> + }
> +done:
> + put_swap_device(si);
> + if (result)
> + *result = cache_result;
> + return page;
> +}
> +
> #ifdef CONFIG_SYSFS
> static ssize_t vma_ra_enabled_show(struct kobject *kobj,
> struct kobj_attribute *attr, char *buf)
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 925ad92486a4..f8c5096fe0f0 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1822,20 +1822,15 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>
> si = swap_info[type];
> do {
> + int ret;
> + pte_t ptent;
> + pgoff_t ilx;
> + swp_entry_t entry;
> struct page *page;
> unsigned long offset;
> + struct mempolicy *mpol;
> unsigned char swp_count;
> struct folio *folio = NULL;
> - swp_entry_t entry;
> - int ret;
> - pte_t ptent;
> -
> - struct vm_fault vmf = {
> - .vma = vma,
> - .address = addr,
> - .real_address = addr,
> - .pmd = pmd,
> - };
>
> if (!pte++) {
> pte = pte_offset_map(pmd, addr);
> @@ -1855,8 +1850,12 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> offset = swp_offset(entry);
> pte_unmap(pte);
> pte = NULL;
> - page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> - &vmf, NULL);
> +
> + mpol = get_vma_policy(vma, addr, 0, &ilx);
> + page = swapin_page_non_fault(entry, GFP_HIGHUSER_MOVABLE,
> + mpol, ilx, vma->vm_mm, NULL);
> + mpol_cond_put(mpol);
> +
> if (IS_ERR(page))
> return PTR_ERR(page);
> else if (page)
> --
> 2.42.0
>
>
On Sun, Nov 19, 2023 at 11:49 AM Kairui Song <[email protected]> wrote:
>
> From: Kairui Song <[email protected]>
>
> There are only two callers within the same file now, make it static and
> drop the redundant if check on the shadow variable.
>
> Signed-off-by: Kairui Song <[email protected]>
> ---
> mm/swap.h | 2 --
> mm/swap_state.c | 5 ++---
> 2 files changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/mm/swap.h b/mm/swap.h
> index b073c29c9790..4402970547e7 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -46,8 +46,6 @@ void __delete_from_swap_cache(struct folio *folio,
> void delete_from_swap_cache(struct folio *folio);
> void clear_shadow_from_swap_cache(int type, unsigned long begin,
> unsigned long end);
> -struct folio *swap_cache_get_folio(swp_entry_t entry,
> - struct vm_fault *vmf, void **shadowp);
> struct folio *filemap_get_incore_folio(struct address_space *mapping,
> pgoff_t index);
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index eef66757c615..6f39aa8394f1 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -334,15 +334,14 @@ static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
> *
> * Caller must lock the swap device or hold a reference to keep it valid.
> */
> -struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_fault *vmf, void **shadowp)
> +static struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_fault *vmf, void **shadowp)
> {
> bool vma_ra, readahead;
> struct folio *folio;
>
> folio = filemap_get_entry(swap_address_space(entry), swp_offset(entry));
> if (xa_is_value(folio)) {
> - if (shadowp)
> - *shadowp = folio;
> + *shadowp = folio;
I prefer to keep the original code to perform the check before
assigning it. The caller usage situation might change.
It is more defensive.
Does not seem to justify it as a standalone patch.
Chris
On Sun, Nov 19, 2023 at 11:49 AM Kairui Song <[email protected]> wrote:
>
> From: Kairui Song <[email protected]>
>
> This is only one caller now in page fault path, make the result return
> argument mandatory.
>
> Signed-off-by: Kairui Song <[email protected]>
> ---
> mm/swap_state.c | 17 +++++++----------
> 1 file changed, 7 insertions(+), 10 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 6f39aa8394f1..0433a2586c6d 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -913,7 +913,6 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> struct vm_fault *vmf, enum swap_cache_result *result)
> {
> - enum swap_cache_result cache_result;
> struct swap_info_struct *si;
> struct mempolicy *mpol;
> void *shadow = NULL;
> @@ -928,29 +927,27 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>
> folio = swap_cache_get_folio(entry, vmf, &shadow);
> if (folio) {
> + *result = SWAP_CACHE_HIT;
> page = folio_file_page(folio, swp_offset(entry));
> - cache_result = SWAP_CACHE_HIT;
> goto done;
> }
>
> mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> if (swap_use_no_readahead(si, swp_offset(entry))) {
> + *result = SWAP_CACHE_BYPASS;
Each of this "*result" will compile into memory store instructions.
The compiler most likely can't optimize and combine them together
because the store can cause segfault from the compiler's point of
view. The multiple local variable assignment can be compiled into a
few registers assignment so it does not cost as much as multiple
memory stores.
> page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
> - cache_result = SWAP_CACHE_BYPASS;
> if (shadow)
> workingset_refault(page_folio(page), shadow);
> - } else if (swap_use_vma_readahead(si)) {
> - page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
> - cache_result = SWAP_CACHE_MISS;
> } else {
> - page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> - cache_result = SWAP_CACHE_MISS;
> + *result = SWAP_CACHE_MISS;
> + if (swap_use_vma_readahead(si))
> + page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
> + else
> + page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
I recall you introduce or heavy modify this function in previous patch before.
Consider combine some of the patch and present the final version sooner.
From the reviewing point of view, don't need to review so many
internal version which get over written any way.
> }
> mpol_cond_put(mpol);
> done:
> put_swap_device(si);
> - if (result)
> - *result = cache_result;
The original version with check and assign it at one place is better.
Safer and produce better code.
Chris
Hi Kairui,
On Sun, Nov 19, 2023 at 11:49 AM Kairui Song <[email protected]> wrote:
>
> From: Kairui Song <[email protected]>
>
> Now there is no caller outside the same file, make it static.
Seems to me too trivial/low value to justify as a standalone patch.
Chris
>
> Signed-off-by: Kairui Song <[email protected]>
> ---
> mm/swap.h | 8 --------
> mm/swap_state.c | 4 ++--
> 2 files changed, 2 insertions(+), 10 deletions(-)
>
> diff --git a/mm/swap.h b/mm/swap.h
> index 4402970547e7..795a25df87da 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -56,8 +56,6 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> struct mempolicy *mpol, pgoff_t ilx,
> bool *new_page_allocated);
> -struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
> - struct mempolicy *mpol, pgoff_t ilx);
> struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
> struct vm_fault *vmf, enum swap_cache_result *result);
> struct page *swapin_page_non_fault(swp_entry_t entry, gfp_t gfp_mask,
> @@ -93,12 +91,6 @@ static inline void show_swap_cache_info(void)
> {
> }
>
> -static inline struct page *swap_cluster_readahead(swp_entry_t entry,
> - gfp_t gfp_mask, struct mempolicy *mpol, pgoff_t ilx)
> -{
> - return NULL;
> -}
> -
> static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
> struct vm_fault *vmf, enum swap_cache_result *result)
> {
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 0433a2586c6d..b377e55cb850 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -627,8 +627,8 @@ static unsigned long swapin_nr_pages(unsigned long offset)
> * are used for every page of the readahead: neighbouring pages on swap
> * are fairly likely to have been swapped out from the same node.
> */
> -struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
> - struct mempolicy *mpol, pgoff_t ilx)
> +static struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
> + struct mempolicy *mpol, pgoff_t ilx)
> {
> struct page *page;
> unsigned long entry_offset = swp_offset(entry);
> --
> 2.42.0
>
>
Hi Kairui,
On Mon, Nov 20, 2023 at 3:17 AM Kairui Song <[email protected]> wrote:
>
> Huang, Ying <[email protected]> 于2023年11月20日周一 15:37写道:
> >
> > Kairui Song <[email protected]> writes:
> >
> > > From: Kairui Song <[email protected]>
> > >
> > > When a process which previously swapped some memory was moved to
> > > another cgroup, and the cgroup it previous in is dead, then swapped in
> > > pages will be leaked into rootcg. Previous commits fixed the bug for
> > > no readahead path, this commit fix the same issue for readahead path.
> > >
> > > This can be easily reproduced by:
> > > - Setup a SSD or HDD swap.
> > > - Create memory cgroup A, B and C.
> > > - Spawn process P1 in cgroup A and make it swap out some pages.
> > > - Move process P1 to memory cgroup B.
> > > - Destroy cgroup A.
> > > - Do a swapoff in cgroup C
> > > - Swapped in pages is accounted into cgroup C.
> > >
> > > This patch will fix it make the swapped in pages accounted in cgroup B.
> >
> > Accroding to "Memory Ownership" section of
> > Documentation/admin-guide/cgroup-v2.rst,
> >
> > "
> > A memory area is charged to the cgroup which instantiated it and stays
> > charged to the cgroup until the area is released. Migrating a process
> > to a different cgroup doesn't move the memory usages that it
> > instantiated while in the previous cgroup to the new cgroup.
> > "
> >
> > Because we don't move the charge when we move a task from one cgroup to
> > another. It's controversial which cgroup should be charged to.
> > According to the above document, it's acceptable to charge to the cgroup
> > C (cgroup where swapoff happens).
>
> Hi Ying, thank you very much for the info!
>
> It is controversial indeed, just the original behavior is kind of
> counter-intuitive.
>
> Image if there are cgroup P1, and its child cgroup C1 C2. If a process
> swapped out some memory in C1 then moved to C2, and C1 is dead.
> On swapoff the charge will be moved out of P1...
>
> And swapoff often happen on some unlimited cgroup or some cgroup for
> management agent.
>
> If P1 have a memory limit, it can breech the limit easily, we will see
> a process that never leave P1 having a much higher RSS that P1/C1/C2's
> limit.
> And if there is a limit for the management agent cgroup, the agent
> will be OOM instead of OOM in P1.
I think I will reply to another similar email.
If you want OOM in P1, you can have an admin program. fork and execute
a new process, add the new process into P1, then swap off from that
new process.
>
> Simply moving a process between the child cgroup of the same parent
> cgroup won't cause such issue, thing get weird when swapoff is
> involved.
>
> Or maybe we should try to be compatible, and introduce a sysctl or
> cmdline for this?
If the above suggestion works, then you don't need to change swap off?
If you still want to change the charging model. I like to see the
bigger picture, what rules it follows and how it works in other
situations.
Chris
Yosry Ahmed <[email protected]> 于2023年11月21日周二 03:18写道:
>
> On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <[email protected]> wrote:
> >
> > From: Kairui Song <[email protected]>
> >
> > This series tries to unify and clean up the swapin path, fixing a few
> > issues with optimizations:
> >
> > 1. Memcg leak issue: when a process that previously swapped out some
> > migrated to another cgroup, and the origianl cgroup is dead. If we
> > do a swapoff, swapped in pages will be accounted into the process
> > doing swapoff instead of the new cgroup. This will allow the process
> > to use more memory than expect easily.
> >
> > This can be easily reproduced by:
> > - Setup a swap.
> > - Create memory cgroup A, B and C.
> > - Spawn process P1 in cgroup A and make it swap out some pages.
> > - Move process P1 to memory cgroup B.
> > - Destroy cgroup A.
> > - Do a swapoff in cgroup C
> > - Swapped in pages is accounted into cgroup C.
> >
> > This patch will fix it make the swapped in pages accounted in cgroup B.
> >
>
> I guess this only works for anonymous memory and not shmem, right?
Hi Yosry,
Yes, this patch only changed the charge target for anon page.
>
> I think tying memcg charges to a process is not something we usually
> do. Charging the pages to the memcg of the faulting process if the
> previous owner is dead makes sense, it's essentially recharging the
> memory to the new owner. Swapoff is indeed a special case, since the
> faulting process is not the new owner, but an admin process or so. I
> am guessing charging to the new memcg of the previous owner might make
> sense in this case, but it is a change of behavior.
After discuss with others I also found this is a controversial issue,
will send separate series for this, I think it needs more discuss.
Chris Li <[email protected]> 于2023年11月21日周二 04:23写道:
>
> Hi Kairui,
>
> On Mon, Nov 20, 2023 at 11:10 AM Yosry Ahmed <[email protected]> wrote:
> >
> > On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <[email protected]> wrote:
> > >
> > > From: Kairui Song <[email protected]>
> > >
> > > This series tries to unify and clean up the swapin path, fixing a few
> > > issues with optimizations:
> > >
> > > 1. Memcg leak issue: when a process that previously swapped out some
> > > migrated to another cgroup, and the origianl cgroup is dead. If we
> > > do a swapoff, swapped in pages will be accounted into the process
> > > doing swapoff instead of the new cgroup. This will allow the process
> > > to use more memory than expect easily.
> > >
> > > This can be easily reproduced by:
> > > - Setup a swap.
> > > - Create memory cgroup A, B and C.
> > > - Spawn process P1 in cgroup A and make it swap out some pages.
> > > - Move process P1 to memory cgroup B.
> > > - Destroy cgroup A.
> > > - Do a swapoff in cgroup C
> > > - Swapped in pages is accounted into cgroup C.
> > >
> > > This patch will fix it make the swapped in pages accounted in cgroup B.
> > >
> >
> > I guess this only works for anonymous memory and not shmem, right?
> >
> > I think tying memcg charges to a process is not something we usually
> > do. Charging the pages to the memcg of the faulting process if the
> > previous owner is dead makes sense, it's essentially recharging the
> > memory to the new owner. Swapoff is indeed a special case, since the
> > faulting process is not the new owner, but an admin process or so. I
> > am guessing charging to the new memcg of the previous owner might make
> > sense in this case, but it is a change of behavior.
> >
>
> I was looking at this at patch 23 as well. Will ask more questions in
> the patch thread.
> I would suggest making these two behavior change patches separate out
> from the clean up series to give it more exposure and proper
> discussion.
> Patch 5 and patch 23.
>
> Chris
>
Hi Chris,
Thank you very much for reviewing these details, it's really helpful.
I'll send out new serieses after checking your suggestions on these patches.
Chris Li <[email protected]> 于2023年11月22日周三 01:18写道:
>
> On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <[email protected]> wrote:
> >
> > From: Kairui Song <[email protected]>
> >
> > When a xa_value is returned by the cache lookup, keep it to be used
> > later for workingset refault check instead of doing the looking up again
> > in swapin_no_readahead.
> >
> > This does have a side effect of making swapoff also triggers workingset
> > check, but should be fine since swapoff does effect the workload in many
> > ways already.
>
> I need to sleep on it a bit to see if this will create another problem or not.
>
> >
> > Signed-off-by: Kairui Song <[email protected]>
> > ---
> > mm/swap_state.c | 10 ++++------
> > 1 file changed, 4 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index e057c79fb06f..51de2a0412df 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -872,7 +872,6 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > {
> > struct folio *folio;
> > struct page *page;
> > - void *shadow = NULL;
> >
> > page = alloc_pages_mpol(gfp_mask, 0, mpol, ilx, numa_node_id());
> > folio = (struct folio *)page;
> > @@ -888,10 +887,6 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> >
> > mem_cgroup_swapin_uncharge_swap(entry);
> >
> > - shadow = get_shadow_from_swap_cache(entry);
> > - if (shadow)
> > - workingset_refault(folio, shadow);
> > -
> > folio_add_lru(folio);
> >
> > /* To provide entry to swap_readpage() */
> > @@ -922,11 +917,12 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > enum swap_cache_result cache_result;
> > struct swap_info_struct *si;
> > struct mempolicy *mpol;
> > + void *shadow = NULL;
> > struct folio *folio;
> > struct page *page;
> > pgoff_t ilx;
> >
> > - folio = swap_cache_get_folio(entry, vmf, NULL);
> > + folio = swap_cache_get_folio(entry, vmf, &shadow);
> > if (folio) {
> > page = folio_file_page(folio, swp_offset(entry));
> > cache_result = SWAP_CACHE_HIT;
> > @@ -938,6 +934,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > if (swap_use_no_readahead(si, swp_offset(entry))) {
> > page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
> > cache_result = SWAP_CACHE_BYPASS;
> > + if (shadow)
> > + workingset_refault(page_folio(page), shadow);
>
> It is inconsistent why other flavors of readahead do not do the
> workingset_refault here.
Because of the readaheads and swapcache. Every readahead pages need to
be checked by workingset_refault with a different shadow (and so a
different xarray entry search is needed). And since other swapin path
need to insert page into swapcache, they will do extra xarray
search/insert anyway so this optimization won't work.
> I suggest keeping the workingset_refault in swapin_no_readahead() and
> pass the shadow argument in.
That sounds good to me.
Chris Li <[email protected]> 于2023年11月21日周二 14:44写道:
>
> On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <[email protected]> wrote:
> >
> > From: Kairui Song <[email protected]>
> >
> > Make it use alloc_pages_mpol instead of vma_alloc_folio, and accept
> > mm_struct directly as an argument instead of taking a vmf as argument.
> > Make its arguments similar to swap_{cluster,vma}_readahead, to
> > make the code more aligned.
>
> It is unclear to me what is the value of making the
> swap_{cluster,vma}_readahead more aligned here.
>
> > Also prepare for following commits which will skip vmf for certain
> > swapin paths.
>
> The following patch 07/24 does not seem to use this patch.
> Can it merge with the other patch that uses it?
>
> As it is, this patch does not serve a stand alone value to justify it.
Yes, I'll rearrange this.
Chris Li <[email protected]> 于2023年11月21日周二 15:54写道:
>
> On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <[email protected]> wrote:
> >
> > From: Kairui Song <[email protected]>
> >
> > Currently VMA readahead is globally disabled when any rotate disk is
> > used as swap backend. So multiple swap devices are enabled, if a slower
> > hard disk is set as a low priority fallback, and a high performance SSD
> > is used and high priority swap device, vma readahead is disabled globally.
> > The SSD swap device performance will drop by a lot.
> >
> > Check readahead policy per entry to avoid such problem.
> >
> > Signed-off-by: Kairui Song <[email protected]>
> > ---
> > mm/swap_state.c | 12 +++++++-----
> > 1 file changed, 7 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index ff6756f2e8e4..fb78f7f18ed7 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -321,9 +321,9 @@ static inline bool swap_use_no_readahead(struct swap_info_struct *si, swp_entry_
> > return data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1;
> > }
> >
> > -static inline bool swap_use_vma_readahead(void)
> > +static inline bool swap_use_vma_readahead(struct swap_info_struct *si)
> > {
> > - return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
> > + return data_race(si->flags & SWP_SOLIDSTATE) && READ_ONCE(enable_vma_readahead);
>
> A very minor point:
> I notice you change the order enable_vma_readahead to the last.
> Normally if enable_vma_reachahead == 0, there is no need to check the si->flags.
> The si->flags check is more expensive than simple memory load.
> You might want to check enable_vma_readahead first then you can short
> cut the more expensive part.
Thanks, I'll improve this part.
Huang, Ying <[email protected]> 于2023年11月22日周三 08:38写道:
>
> Kairui Song <[email protected]> writes:
>
> > From: Kairui Song <[email protected]>
> >
> > Move get_swap_device into swapin_readahead, simplify the code
> > and prepare for follow up commits.
>
> No. Please don't do this. Please check the get/put_swap_device() usage
> rule in the comments of get_swap_device().
>
> "
> * When we get a swap entry, if there aren't some other ways to
> * prevent swapoff, such as the folio in swap cache is locked, page
> * table lock is held, etc., the swap entry may become invalid because
> * of swapoff. Then, we need to enclose all swap related functions
> * with get_swap_device() and put_swap_device(), unless the swap
> * functions call get/put_swap_device() by themselves.
> "
>
> This is to simplify the reasoning about swapoff and swap entry.
>
> Why does it bother you?
Hi Ying,
This is trying to reduce LOC, avoid a trivial si read, and make error
checking logic easier to refactor in later commits.
And besides there is one trivial change I forgot to include in this
commit, get_swap_device can be put after swap_cache_get_folio in
swapin_readahead, since swap_cache_get_folio doesn't need to hold the
swap device, so in cache hit case this get/put_swap_device() can be
saved.
The comment also mentioned:
"Then, we need to enclose all swap related functions with
get_swap_device() and put_swap_device(), unless the swap functions
call get/put_swap_device() by themselves"
So I think it should be OK to do this.
Kairui Song <[email protected]> writes:
> Huang, Ying <[email protected]> 于2023年11月22日周三 08:38写道:
>>
>> Kairui Song <[email protected]> writes:
>>
>> > From: Kairui Song <[email protected]>
>> >
>> > Move get_swap_device into swapin_readahead, simplify the code
>> > and prepare for follow up commits.
>>
>> No. Please don't do this. Please check the get/put_swap_device() usage
>> rule in the comments of get_swap_device().
>>
>> "
>> * When we get a swap entry, if there aren't some other ways to
>> * prevent swapoff, such as the folio in swap cache is locked, page
>> * table lock is held, etc., the swap entry may become invalid because
>> * of swapoff. Then, we need to enclose all swap related functions
>> * with get_swap_device() and put_swap_device(), unless the swap
>> * functions call get/put_swap_device() by themselves.
>> "
>>
>> This is to simplify the reasoning about swapoff and swap entry.
>>
>> Why does it bother you?
>
> Hi Ying,
>
> This is trying to reduce LOC, avoid a trivial si read, and make error
> checking logic easier to refactor in later commits.
The race with swapoff isn't considered by many developers usually. So,
we should use a simple rule as much as possible. For example, if you
get a swap entry, use get/put_swap_device() to enclose all code that
operate on the swap entry. This makes code easier to be maintained in
the long run. Yes. Sometimes we break the rule a little, but only if
we have enough benefit, such as improving performance in some practical
use cases.
> And besides there is one trivial change I forgot to include in this
> commit, get_swap_device can be put after swap_cache_get_folio in
> swapin_readahead, since swap_cache_get_folio doesn't need to hold the
> swap device, so in cache hit case this get/put_swap_device() can be
> saved.
swapoff is rare operation, it's OK to delay it a little to make the code
easier to be understood.
> The comment also mentioned:
>
> "Then, we need to enclose all swap related functions with
> get_swap_device() and put_swap_device(), unless the swap functions
> call get/put_swap_device() by themselves"
>
> So I think it should be OK to do this.
No. You should change the code with a good enough reason. Compared
with complexity it introduced, the benefit isn't enough for me so far.
--
Best Regards,
Huang, Ying
Chris Li <[email protected]> 于2023年11月22日周三 13:18写道:
>
> On Sun, Nov 19, 2023 at 11:49 AM Kairui Song <[email protected]> wrote:
> >
> > From: Kairui Song <[email protected]>
> >
> > This is only one caller now in page fault path, make the result return
> > argument mandatory.
> >
> > Signed-off-by: Kairui Song <[email protected]>
> > ---
> > mm/swap_state.c | 17 +++++++----------
> > 1 file changed, 7 insertions(+), 10 deletions(-)
> >
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 6f39aa8394f1..0433a2586c6d 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -913,7 +913,6 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > struct vm_fault *vmf, enum swap_cache_result *result)
> > {
> > - enum swap_cache_result cache_result;
> > struct swap_info_struct *si;
> > struct mempolicy *mpol;
> > void *shadow = NULL;
> > @@ -928,29 +927,27 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> >
> > folio = swap_cache_get_folio(entry, vmf, &shadow);
> > if (folio) {
> > + *result = SWAP_CACHE_HIT;
> > page = folio_file_page(folio, swp_offset(entry));
> > - cache_result = SWAP_CACHE_HIT;
> > goto done;
> > }
> >
> > mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> > if (swap_use_no_readahead(si, swp_offset(entry))) {
> > + *result = SWAP_CACHE_BYPASS;
>
> Each of this "*result" will compile into memory store instructions.
> The compiler most likely can't optimize and combine them together
> because the store can cause segfault from the compiler's point of
> view. The multiple local variable assignment can be compiled into a
> few registers assignment so it does not cost as much as multiple
> memory stores.
>
> > page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
> > - cache_result = SWAP_CACHE_BYPASS;
> > if (shadow)
> > workingset_refault(page_folio(page), shadow);
> > - } else if (swap_use_vma_readahead(si)) {
> > - page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
> > - cache_result = SWAP_CACHE_MISS;
> > } else {
> > - page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> > - cache_result = SWAP_CACHE_MISS;
> > + *result = SWAP_CACHE_MISS;
> > + if (swap_use_vma_readahead(si))
> > + page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
> > + else
> > + page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
>
> I recall you introduce or heavy modify this function in previous patch before.
> Consider combine some of the patch and present the final version sooner.
> From the reviewing point of view, don't need to review so many
> internal version which get over written any way.
>
> > }
> > mpol_cond_put(mpol);
> > done:
> > put_swap_device(si);
> > - if (result)
> > - *result = cache_result;
>
> The original version with check and assign it at one place is better.
> Safer and produce better code.
Yes, that's less error-prone indeed, saving a "if" seems doesn't worth
all the potential trouble, will drop this.
Chris Li <[email protected]> 于2023年11月22日周三 00:07写道:
>
> On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <[email protected]> wrote:
> >
> > From: Kairui Song <[email protected]>
> >
> > No feature change, just prepare for later commits.
>
> You need to have a proper commit message why this change needs to happen.
> Preparing is too generic, it does not give any real information.
> For example, it seems you want to reduce one swap cache lookup because
> swap_readahead already has it?
>
> I am a bit puzzled at this patch. It shuffles a lot of sensitive code.
> However I do not get the value.
> It seems like this patch should be merged with the later patch that
> depends on it to be judged together.
>
> >
> > Signed-off-by: Kairui Song <[email protected]>
> > ---
> > mm/memory.c | 61 +++++++++++++++++++++++--------------------------
> > mm/swap.h | 10 ++++++--
> > mm/swap_state.c | 26 +++++++++++++--------
> > mm/swapfile.c | 30 +++++++++++-------------
> > 4 files changed, 66 insertions(+), 61 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index f4237a2e3b93..22af9f3e8c75 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3786,13 +3786,13 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
> > vm_fault_t do_swap_page(struct vm_fault *vmf)
> > {
> > struct vm_area_struct *vma = vmf->vma;
> > - struct folio *swapcache, *folio = NULL;
> > + struct folio *swapcache = NULL, *folio = NULL;
> > + enum swap_cache_result cache_result;
> > struct page *page;
> > struct swap_info_struct *si = NULL;
> > rmap_t rmap_flags = RMAP_NONE;
> > bool exclusive = false;
> > swp_entry_t entry;
> > - bool swapcached;
> > pte_t pte;
> > vm_fault_t ret = 0;
> >
> > @@ -3850,42 +3850,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > if (unlikely(!si))
> > goto out;
> >
> > - folio = swap_cache_get_folio(entry, vma, vmf->address);
> > - if (folio)
> > - page = folio_file_page(folio, swp_offset(entry));
> > - swapcache = folio;
>
> Is the motivation that swap_readahead() already has a swap cache look up so you
> remove this look up here?
Yes, the cache look up can is moved and shared in swapin_readahead,
and this also make it possible to use that look up to return a shadow
when entry is not a page, so another shadow look up can be saved for
sync (ZRAM) swapin path. This can help improve ZRAM performance for
~4% for a 10G ZRAM, and should improves more when the cache tree grows
large.
>
> > -
> > - if (!folio) {
> > - page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> > - vmf, &swapcached);
> > - if (page) {
> > - folio = page_folio(page);
> > - if (swapcached)
> > - swapcache = folio;
> > - } else {
> > + page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> > + vmf, &cache_result);
> > + if (page) {
> > + folio = page_folio(page);
> > + if (cache_result != SWAP_CACHE_HIT) {
> > + /* Had to read the page from swap area: Major fault */
> > + ret = VM_FAULT_MAJOR;
> > + count_vm_event(PGMAJFAULT);
> > + count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
> > + }
> > + if (cache_result != SWAP_CACHE_BYPASS)
> > + swapcache = folio;
> > + if (PageHWPoison(page)) {
>
> There is a lot of code shuffle here. From the diff it is hard to tell
> if they are doing the same thing as before.
>
> > /*
> > - * Back out if somebody else faulted in this pte
> > - * while we released the pte lock.
> > + * hwpoisoned dirty swapcache pages are kept for killing
> > + * owner processes (which may be unknown at hwpoison time)
> > */
> > - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> > - vmf->address, &vmf->ptl);
> > - if (likely(vmf->pte &&
> > - pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> > - ret = VM_FAULT_OOM;
> > - goto unlock;
> > + ret = VM_FAULT_HWPOISON;
> > + goto out_release;
> > }
> > -
> > - /* Had to read the page from swap area: Major fault */
> > - ret = VM_FAULT_MAJOR;
> > - count_vm_event(PGMAJFAULT);
> > - count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
> > - } else if (PageHWPoison(page)) {
> > + } else {
> > /*
> > - * hwpoisoned dirty swapcache pages are kept for killing
> > - * owner processes (which may be unknown at hwpoison time)
> > + * Back out if somebody else faulted in this pte
> > + * while we released the pte lock.
> > */
> > - ret = VM_FAULT_HWPOISON;
> > - goto out_release;
> > + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> > + vmf->address, &vmf->ptl);
> > + if (likely(vmf->pte &&
> > + pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> > + ret = VM_FAULT_OOM;
> > + goto unlock;
> > }
> >
> > ret |= folio_lock_or_retry(folio, vmf);
> > diff --git a/mm/swap.h b/mm/swap.h
> > index a9a654af791e..ac9136eee690 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -30,6 +30,12 @@ extern struct address_space *swapper_spaces[];
> > (&swapper_spaces[swp_type(entry)][swp_offset(entry) \
> > >> SWAP_ADDRESS_SPACE_SHIFT])
> >
> > +enum swap_cache_result {
> > + SWAP_CACHE_HIT,
> > + SWAP_CACHE_MISS,
> > + SWAP_CACHE_BYPASS,
> > +};
>
> Does any function later care about CACHE_BYPASS?
>
> Again, better introduce it with the function that uses it. Don't
> introduce it for "just in case I might use it later".
Yes, callers in shmem will also need to know if the page is cached in
swap, and need a value to indicate the bypass case. I can add some
comments here to indicate the usage.
>
> > +
> > void show_swap_cache_info(void);
> > bool add_to_swap(struct folio *folio);
> > void *get_shadow_from_swap_cache(swp_entry_t entry);
> > @@ -55,7 +61,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> > struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
> > struct mempolicy *mpol, pgoff_t ilx);
> > struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
> > - struct vm_fault *vmf, bool *swapcached);
> > + struct vm_fault *vmf, enum swap_cache_result *result);
> >
> > static inline unsigned int folio_swap_flags(struct folio *folio)
> > {
> > @@ -92,7 +98,7 @@ static inline struct page *swap_cluster_readahead(swp_entry_t entry,
> > }
> >
> > static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
> > - struct vm_fault *vmf, bool *swapcached)
> > + struct vm_fault *vmf, enum swap_cache_result *result)
> > {
> > return NULL;
> > }
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index d87c20f9f7ec..e96d63bf8a22 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -908,8 +908,7 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > * @entry: swap entry of this memory
> > * @gfp_mask: memory allocation flags
> > * @vmf: fault information
> > - * @swapcached: pointer to a bool used as indicator if the
> > - * page is swapped in through swapcache.
> > + * @result: a return value to indicate swap cache usage.
> > *
> > * Returns the struct page for entry and addr, after queueing swapin.
> > *
> > @@ -918,30 +917,39 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > * or vma-based(ie, virtual address based on faulty address) readahead.
> > */
> > struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > - struct vm_fault *vmf, bool *swapcached)
> > + struct vm_fault *vmf, enum swap_cache_result *result)
> > {
> > + enum swap_cache_result cache_result;
> > struct swap_info_struct *si;
> > struct mempolicy *mpol;
> > + struct folio *folio;
> > struct page *page;
> > pgoff_t ilx;
> > - bool cached;
> > +
> > + folio = swap_cache_get_folio(entry, vmf->vma, vmf->address);
> > + if (folio) {
> > + page = folio_file_page(folio, swp_offset(entry));
> > + cache_result = SWAP_CACHE_HIT;
> > + goto done;
> > + }
> >
> > si = swp_swap_info(entry);
> > mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> > if (swap_use_no_readahead(si, swp_offset(entry))) {
> > page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
> > - cached = false;
> > + cache_result = SWAP_CACHE_BYPASS;
> > } else if (swap_use_vma_readahead(si)) {
> > page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
> > - cached = true;
> > + cache_result = SWAP_CACHE_MISS;
> > } else {
> > page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> > - cached = true;
> > + cache_result = SWAP_CACHE_MISS;
> > }
> > mpol_cond_put(mpol);
> >
> > - if (swapcached)
> > - *swapcached = cached;
> > +done:
> > + if (result)
> > + *result = cache_result;
> >
> > return page;
> > }
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 01c3f53b6521..b6d57fff5e21 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1822,13 +1822,21 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >
> > si = swap_info[type];
> > do {
> > - struct folio *folio;
> > + struct page *page;
> > unsigned long offset;
> > unsigned char swp_count;
> > + struct folio *folio = NULL;
> > swp_entry_t entry;
> > int ret;
> > pte_t ptent;
> >
> > + struct vm_fault vmf = {
> > + .vma = vma,
> > + .address = addr,
> > + .real_address = addr,
> > + .pmd = pmd,
> > + };
>
> Is this code move caused by skipping the swap cache look up here?
Yes.
>
> This is very sensitive code related to swap cache racing. It needs
> very careful reviewing. Better not shuffle it for no good reason.
Thanks for the suggestion, I'll try to avoid these shuffling, but
cache lookup is moved into swappin_readahead so some changes in the
original caller are not avoidable...
On Fri, Nov 24, 2023 at 12:42 AM Kairui Song <[email protected]> wrote:
>
> Chris Li <[email protected]> 于2023年11月22日周三 00:07写道:
>
>
> >
> > On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <[email protected]> wrote:
> > >
> > > From: Kairui Song <[email protected]>
> > >
> > > No feature change, just prepare for later commits.
> >
> > You need to have a proper commit message why this change needs to happen.
> > Preparing is too generic, it does not give any real information.
> > For example, it seems you want to reduce one swap cache lookup because
> > swap_readahead already has it?
> >
> > I am a bit puzzled at this patch. It shuffles a lot of sensitive code.
> > However I do not get the value.
> > It seems like this patch should be merged with the later patch that
> > depends on it to be judged together.
> >
> > >
> > > Signed-off-by: Kairui Song <[email protected]>
> > > ---
> > > mm/memory.c | 61 +++++++++++++++++++++++--------------------------
> > > mm/swap.h | 10 ++++++--
> > > mm/swap_state.c | 26 +++++++++++++--------
> > > mm/swapfile.c | 30 +++++++++++-------------
> > > 4 files changed, 66 insertions(+), 61 deletions(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index f4237a2e3b93..22af9f3e8c75 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -3786,13 +3786,13 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
> > > vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > {
> > > struct vm_area_struct *vma = vmf->vma;
> > > - struct folio *swapcache, *folio = NULL;
> > > + struct folio *swapcache = NULL, *folio = NULL;
> > > + enum swap_cache_result cache_result;
> > > struct page *page;
> > > struct swap_info_struct *si = NULL;
> > > rmap_t rmap_flags = RMAP_NONE;
> > > bool exclusive = false;
> > > swp_entry_t entry;
> > > - bool swapcached;
> > > pte_t pte;
> > > vm_fault_t ret = 0;
> > >
> > > @@ -3850,42 +3850,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > if (unlikely(!si))
> > > goto out;
> > >
> > > - folio = swap_cache_get_folio(entry, vma, vmf->address);
> > > - if (folio)
> > > - page = folio_file_page(folio, swp_offset(entry));
> > > - swapcache = folio;
> >
> > Is the motivation that swap_readahead() already has a swap cache look up so you
> > remove this look up here?
>
> Yes, the cache look up can is moved and shared in swapin_readahead,
> and this also make it possible to use that look up to return a shadow
> when entry is not a page, so another shadow look up can be saved for
> sync (ZRAM) swapin path. This can help improve ZRAM performance for
> ~4% for a 10G ZRAM, and should improves more when the cache tree grows
> large.
>
> >
> > > -
> > > - if (!folio) {
> > > - page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> > > - vmf, &swapcached);
> > > - if (page) {
> > > - folio = page_folio(page);
> > > - if (swapcached)
> > > - swapcache = folio;
> > > - } else {
> > > + page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> > > + vmf, &cache_result);
> > > + if (page) {
> > > + folio = page_folio(page);
> > > + if (cache_result != SWAP_CACHE_HIT) {
> > > + /* Had to read the page from swap area: Major fault */
> > > + ret = VM_FAULT_MAJOR;
> > > + count_vm_event(PGMAJFAULT);
> > > + count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
> > > + }
> > > + if (cache_result != SWAP_CACHE_BYPASS)
> > > + swapcache = folio;
> > > + if (PageHWPoison(page)) {
> >
> > There is a lot of code shuffle here. From the diff it is hard to tell
> > if they are doing the same thing as before.
> >
> > > /*
> > > - * Back out if somebody else faulted in this pte
> > > - * while we released the pte lock.
> > > + * hwpoisoned dirty swapcache pages are kept for killing
> > > + * owner processes (which may be unknown at hwpoison time)
> > > */
> > > - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> > > - vmf->address, &vmf->ptl);
> > > - if (likely(vmf->pte &&
> > > - pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> > > - ret = VM_FAULT_OOM;
> > > - goto unlock;
> > > + ret = VM_FAULT_HWPOISON;
> > > + goto out_release;
> > > }
> > > -
> > > - /* Had to read the page from swap area: Major fault */
> > > - ret = VM_FAULT_MAJOR;
> > > - count_vm_event(PGMAJFAULT);
> > > - count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
> > > - } else if (PageHWPoison(page)) {
> > > + } else {
> > > /*
> > > - * hwpoisoned dirty swapcache pages are kept for killing
> > > - * owner processes (which may be unknown at hwpoison time)
> > > + * Back out if somebody else faulted in this pte
> > > + * while we released the pte lock.
> > > */
> > > - ret = VM_FAULT_HWPOISON;
> > > - goto out_release;
> > > + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> > > + vmf->address, &vmf->ptl);
> > > + if (likely(vmf->pte &&
> > > + pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
> > > + ret = VM_FAULT_OOM;
> > > + goto unlock;
> > > }
> > >
> > > ret |= folio_lock_or_retry(folio, vmf);
> > > diff --git a/mm/swap.h b/mm/swap.h
> > > index a9a654af791e..ac9136eee690 100644
> > > --- a/mm/swap.h
> > > +++ b/mm/swap.h
> > > @@ -30,6 +30,12 @@ extern struct address_space *swapper_spaces[];
> > > (&swapper_spaces[swp_type(entry)][swp_offset(entry) \
> > > >> SWAP_ADDRESS_SPACE_SHIFT])
> > >
> > > +enum swap_cache_result {
> > > + SWAP_CACHE_HIT,
> > > + SWAP_CACHE_MISS,
> > > + SWAP_CACHE_BYPASS,
> > > +};
> >
> > Does any function later care about CACHE_BYPASS?
> >
> > Again, better introduce it with the function that uses it. Don't
> > introduce it for "just in case I might use it later".
>
> Yes, callers in shmem will also need to know if the page is cached in
> swap, and need a value to indicate the bypass case. I can add some
> comments here to indicate the usage.
I also comment on the later patch. Because you do the look up without
the folio locked. The swap_cache can change by the time you use the
"*result". I suspect some of the swap_cache look up you need to add it
back due to the race before locking. This better introduces this field
with the user side together. Make it easier to reason the usage in one
patch.
BTW, one way to flatten the development history of the patch is to
flatten the branch as one big patch. Then copy/paste from the big
patch to introduce a sub patch step by step. That way the sub patch is
closer to the latest version of the code. Just something for you to
consider.
>
> >
> > > +
> > > void show_swap_cache_info(void);
> > > bool add_to_swap(struct folio *folio);
> > > void *get_shadow_from_swap_cache(swp_entry_t entry);
> > > @@ -55,7 +61,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> > > struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
> > > struct mempolicy *mpol, pgoff_t ilx);
> > > struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
> > > - struct vm_fault *vmf, bool *swapcached);
> > > + struct vm_fault *vmf, enum swap_cache_result *result);
> > >
> > > static inline unsigned int folio_swap_flags(struct folio *folio)
> > > {
> > > @@ -92,7 +98,7 @@ static inline struct page *swap_cluster_readahead(swp_entry_t entry,
> > > }
> > >
> > > static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
> > > - struct vm_fault *vmf, bool *swapcached)
> > > + struct vm_fault *vmf, enum swap_cache_result *result)
> > > {
> > > return NULL;
> > > }
> > > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > > index d87c20f9f7ec..e96d63bf8a22 100644
> > > --- a/mm/swap_state.c
> > > +++ b/mm/swap_state.c
> > > @@ -908,8 +908,7 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > > * @entry: swap entry of this memory
> > > * @gfp_mask: memory allocation flags
> > > * @vmf: fault information
> > > - * @swapcached: pointer to a bool used as indicator if the
> > > - * page is swapped in through swapcache.
> > > + * @result: a return value to indicate swap cache usage.
> > > *
> > > * Returns the struct page for entry and addr, after queueing swapin.
> > > *
> > > @@ -918,30 +917,39 @@ static struct page *swapin_no_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > > * or vma-based(ie, virtual address based on faulty address) readahead.
> > > */
> > > struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
> > > - struct vm_fault *vmf, bool *swapcached)
> > > + struct vm_fault *vmf, enum swap_cache_result *result)
> > > {
> > > + enum swap_cache_result cache_result;
> > > struct swap_info_struct *si;
> > > struct mempolicy *mpol;
> > > + struct folio *folio;
> > > struct page *page;
> > > pgoff_t ilx;
> > > - bool cached;
> > > +
> > > + folio = swap_cache_get_folio(entry, vmf->vma, vmf->address);
> > > + if (folio) {
> > > + page = folio_file_page(folio, swp_offset(entry));
> > > + cache_result = SWAP_CACHE_HIT;
> > > + goto done;
> > > + }
> > >
> > > si = swp_swap_info(entry);
> > > mpol = get_vma_policy(vmf->vma, vmf->address, 0, &ilx);
> > > if (swap_use_no_readahead(si, swp_offset(entry))) {
> > > page = swapin_no_readahead(entry, gfp_mask, mpol, ilx, vmf->vma->vm_mm);
> > > - cached = false;
> > > + cache_result = SWAP_CACHE_BYPASS;
> > > } else if (swap_use_vma_readahead(si)) {
> > > page = swap_vma_readahead(entry, gfp_mask, mpol, ilx, vmf);
> > > - cached = true;
> > > + cache_result = SWAP_CACHE_MISS;
> > > } else {
> > > page = swap_cluster_readahead(entry, gfp_mask, mpol, ilx);
> > > - cached = true;
> > > + cache_result = SWAP_CACHE_MISS;
> > > }
> > > mpol_cond_put(mpol);
> > >
> > > - if (swapcached)
> > > - *swapcached = cached;
> > > +done:
> > > + if (result)
> > > + *result = cache_result;
> > >
> > > return page;
> > > }
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index 01c3f53b6521..b6d57fff5e21 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -1822,13 +1822,21 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> > >
> > > si = swap_info[type];
> > > do {
> > > - struct folio *folio;
> > > + struct page *page;
> > > unsigned long offset;
> > > unsigned char swp_count;
> > > + struct folio *folio = NULL;
> > > swp_entry_t entry;
> > > int ret;
> > > pte_t ptent;
> > >
> > > + struct vm_fault vmf = {
> > > + .vma = vma,
> > > + .address = addr,
> > > + .real_address = addr,
> > > + .pmd = pmd,
> > > + };
> >
> > Is this code move caused by skipping the swap cache look up here?
>
> Yes.
>
> >
> > This is very sensitive code related to swap cache racing. It needs
> > very careful reviewing. Better not shuffle it for no good reason.
>
> Thanks for the suggestion, I'll try to avoid these shuffling, but
> cache lookup is moved into swappin_readahead so some changes in the
> original caller are not avoidable...
Yes I agree sometimes it is unavoidable.
Chris
Chris Li <[email protected]> 于2023年11月22日周三 12:41写道:
>
> On Sun, Nov 19, 2023 at 11:49 AM Kairui Song <[email protected]> wrote:
> >
> > From: Kairui Song <[email protected]>
> >
> > There are two places where swapin is not direct caused by page fault:
> > shmem swapin is invoked through shmem mapping, swapoff cause swapin by
> > walking the page table. They used to construct a pseudo vmfault struct
> > for swapin function.
> >
> > Shmem has dropped the pseudo vmfault recently in commit ddc1a5cbc05d
> > ("mempolicy: alloc_pages_mpol() for NUMA policy without vma"). Swapoff
> > path is still using a pseudo vmfault.
> >
> > Introduce a helper for them both, this help save stack usage for swapoff
> > path, and help apply a unified swapin cache and readahead policy check.
> >
> > Also prepare for follow up commits.
> >
> > Signed-off-by: Kairui Song <[email protected]>
> > ---
> > mm/shmem.c | 51 ++++++++++++++++---------------------------------
> > mm/swap.h | 11 +++++++++++
> > mm/swap_state.c | 38 ++++++++++++++++++++++++++++++++++++
> > mm/swapfile.c | 23 +++++++++++-----------
> > 4 files changed, 76 insertions(+), 47 deletions(-)
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index f9ce4067c742..81d129aa66d1 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -1565,22 +1565,6 @@ static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
> > static struct mempolicy *shmem_get_pgoff_policy(struct shmem_inode_info *info,
> > pgoff_t index, unsigned int order, pgoff_t *ilx);
> >
> > -static struct folio *shmem_swapin_cluster(swp_entry_t swap, gfp_t gfp,
> > - struct shmem_inode_info *info, pgoff_t index)
> > -{
> > - struct mempolicy *mpol;
> > - pgoff_t ilx;
> > - struct page *page;
> > -
> > - mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
> > - page = swap_cluster_readahead(swap, gfp, mpol, ilx);
> > - mpol_cond_put(mpol);
> > -
> > - if (!page)
> > - return NULL;
> > - return page_folio(page);
> > -}
> > -
>
> Nice. Thank you.
>
> > /*
> > * Make sure huge_gfp is always more limited than limit_gfp.
> > * Some of the flags set permissions, while others set limitations.
> > @@ -1854,9 +1838,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> > {
> > struct address_space *mapping = inode->i_mapping;
> > struct shmem_inode_info *info = SHMEM_I(inode);
> > - struct swap_info_struct *si;
> > + enum swap_cache_result result;
> > struct folio *folio = NULL;
> > + struct mempolicy *mpol;
> > + struct page *page;
> > swp_entry_t swap;
> > + pgoff_t ilx;
> > int error;
> >
> > VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
> > @@ -1866,34 +1853,30 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> > if (is_poisoned_swp_entry(swap))
> > return -EIO;
> >
> > - si = get_swap_device(swap);
> > - if (!si) {
> > + mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
> > + page = swapin_page_non_fault(swap, gfp, mpol, ilx, fault_mm, &result);
Hi Chris,
I've been trying to address these issues in V2, most issue in other
patches have a straight solution, some could be discuss in seperate
series, but I come up with some thoughts here:
>
> Notice this "result" CAN be outdated. e.g. after this call, the swap
> cache can be changed by another thread generating the swap page fault
> and installing the folio into the swap cache or removing it.
This is true, and it seems a potential race also exist before this
series for direct (no swapcache) swapin path (do_swap_page) if I
understand it correctly:
In do_swap_page path, multiple process could swapin the page at the
same time (a mapped once page can still be shared by sub threads),
they could get different folios. The later pte lock and pte_same check
is not enough, because while one process is not holding the pte lock,
another process could read-in, swap_free the entry, then swap-out the
page again, using same entry, an ABA problem. The race is not likely
to happen in reality but in theory possible.
Same issue for shmem here, there are
shmem_confirm_swap/shmem_add_to_page_cache check later to prevent
re-installing into shmem mapping for direct swap in, but also not
enough. Other process could read-in and re-swapout using same entry so
the mapping entry seems unchanged during the time window. Still very
unlikely to happen in reality, but not impossible.
When swapcache is used there is no such issue, since swap lock and
swap_map are used to sync all readers, and while one reader is still
holding the folio, the entry is locked through swapcache, or if a
folio is removed from swapcache, folio_test_swapcache will fail, and
the reader could retry.
I'm trying to come up with a better locking for direct swap in, am I
missing anything here? Correct me if I get it wrong...
On Tue, Nov 28, 2023 at 3:22 AM Kairui Song <[email protected]> wrote:
>
> > > /*
> > > * Make sure huge_gfp is always more limited than limit_gfp.
> > > * Some of the flags set permissions, while others set limitations.
> > > @@ -1854,9 +1838,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> > > {
> > > struct address_space *mapping = inode->i_mapping;
> > > struct shmem_inode_info *info = SHMEM_I(inode);
> > > - struct swap_info_struct *si;
> > > + enum swap_cache_result result;
> > > struct folio *folio = NULL;
> > > + struct mempolicy *mpol;
> > > + struct page *page;
> > > swp_entry_t swap;
> > > + pgoff_t ilx;
> > > int error;
> > >
> > > VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
> > > @@ -1866,34 +1853,30 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> > > if (is_poisoned_swp_entry(swap))
> > > return -EIO;
> > >
> > > - si = get_swap_device(swap);
> > > - if (!si) {
> > > + mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
> > > + page = swapin_page_non_fault(swap, gfp, mpol, ilx, fault_mm, &result);
>
> Hi Chris,
>
> I've been trying to address these issues in V2, most issue in other
> patches have a straight solution, some could be discuss in seperate
> series, but I come up with some thoughts here:
>
> >
> > Notice this "result" CAN be outdated. e.g. after this call, the swap
> > cache can be changed by another thread generating the swap page fault
> > and installing the folio into the swap cache or removing it.
>
> This is true, and it seems a potential race also exist before this
> series for direct (no swapcache) swapin path (do_swap_page) if I
> understand it correctly:
I just noticed I missed this email while I was cleaning up my email
archive. Sorry for the late reply. Traveling does not help either.
I am not aware of swap in racing bugs in the existing code. Racing,
yes. If you discover a code path for racing causing bug, please report
it.
>
> In do_swap_page path, multiple process could swapin the page at the
> same time (a mapped once page can still be shared by sub threads),
> they could get different folios. The later pte lock and pte_same check
> is not enough, because while one process is not holding the pte lock,
> another process could read-in, swap_free the entry, then swap-out the
> page again, using same entry, an ABA problem. The race is not likely
> to happen in reality but in theory possible.
Have you taken into account that if the page was locked, then it
wasn't able to change from the swapcache? I think the swap cache find
and get function will return the page locked. Then swapcache will not
be able to change the mapping as long as the page is still locked.
>
> Same issue for shmem here, there are
> shmem_confirm_swap/shmem_add_to_page_cache check later to prevent
> re-installing into shmem mapping for direct swap in, but also not
> enough. Other process could read-in and re-swapout using same entry so
> the mapping entry seems unchanged during the time window. Still very
> unlikely to happen in reality, but not impossible.
Please take a look again with the page lock information. Report back
if you still think there is a racing bug in the existing code. We can
take a closer look at the concurrent call stack to trigger the bug.
Chris
>
> When swapcache is used there is no such issue, since swap lock and
> swap_map are used to sync all readers, and while one reader is still
> holding the folio, the entry is locked through swapcache, or if a
> folio is removed from swapcache, folio_test_swapcache will fail, and
> the reader could retry.
>
> I'm trying to come up with a better locking for direct swap in, am I
> missing anything here? Correct me if I get it wrong...
>