2020-06-17 05:28:53

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v6 0/6] workingset protection/detection on the anonymous LRU list

From: Joonsoo Kim <[email protected]>

Hello,

This patchset implements workingset protection and detection on
the anonymous LRU list.

* Changes on v6
- rework to reflect a new LRU balance model
- remove memcg charge timing stuff on v5 since alternative is already
merged on mainline
- remove readahead stuff on v5 (reason is the same with above)
- clear shadow entry if corresponding swap entry is deleted
(mm/swapcache: support to handle the exceptional entries in swapcache)
- change experiment environment
(from ssd swap to ram swap, for fast evaluation and for reducing side-effect of I/O)
- update performance number

* Changes on v5
- change memcg charge timing for the swapped-in page (fault -> swap-in)
- avoid readahead if previous owner of the swapped-out page isn't me
- use another lruvec to update the reclaim_stat for a new anonymous page
- add two more cases to fix up the reclaim_stat

* Changes on v4
- In the patch "mm/swapcache: support to handle the exceptional
entries in swapcache":
-- replace the word "value" with "exceptional entries"
-- add to handle the shadow entry in add_to_swap_cache()
-- support the huge page
-- remove the registration code for shadow shrinker

- remove the patch "mm/workingset: use the node counter
if memcg is the root memcg" since workingset detection for
anonymous page doesn't use shadow shrinker now
- minor style fixes

* Changes on v3
- rework the patch, "mm/vmscan: protect the workingset on anonymous LRU"
(use almost same reference tracking algorithm to the one for the file
mapped page)

* Changes on v2
- fix a critical bug that uses out of index lru list in
workingset_refault()
- fix a bug that reuses the rotate value for previous page

* SUBJECT
workingset protection

* PROBLEM
In current implementation, newly created or swap-in anonymous page is
started on the active list. Growing the active list results in rebalancing
active/inactive list so old pages on the active list are demoted to the
inactive list. Hence, hot page on the active list isn't protected at all.

Following is an example of this situation.

Assume that 50 hot pages on active list and system can contain total
100 pages. Numbers denote the number of pages on active/inactive
list (active | inactive). (h) stands for hot pages and (uo) stands for
used-once pages.

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(uo) | 50(h)

3. workload: another 50 newly created (used-once) pages
50(uo) | 50(uo), swap-out 50(h)

As we can see, hot pages are swapped-out and it would cause swap-in later.

* SOLUTION
Since this is what we want to avoid, this patchset implements workingset
protection. Like as the file LRU list, newly created or swap-in anonymous
page is started on the inactive list. Also, like as the file LRU list,
if enough reference happens, the page will be promoted. This simple
modification changes the above example as following.

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(h) | 50(uo)

3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(uo)

hot pages remains in the active list. :)

* EXPERIMENT
I tested this scenario on my test bed and confirmed that this problem
happens on current implementation. I also checked that it is fixed by
this patchset.


* SUBJECT
workingset detection

* PROBLEM
Later part of the patchset implements the workingset detection for
the anonymous LRU list. There is a corner case that workingset protection
could cause thrashing. If we can avoid thrashing by workingset detection,
we can get the better performance.

Following is an example of thrashing due to the workingset protection.

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (will be hot) pages
50(h) | 50(wh)

3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(wh)

4. workload: 50 (will be hot) pages
50(h) | 50(wh), swap-in 50(wh)

5. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(wh)

6. repeat 4, 5

Without workingset detection, this kind of workload cannot be promoted
and thrashing happens forever.

* SOLUTION
Therefore, this patchset implements workingset detection.
All the infrastructure for workingset detecion is already implemented,
so there is not much work to do. First, extend workingset detection
code to deal with the anonymous LRU list. Then, make swap cache handles
the exceptional value for the shadow entry. Lastly, install/retrieve
the shadow value into/from the swap cache and check the refault distance.

* EXPERIMENT
I made a test program to imitates above scenario and confirmed that
problem exists. Then, I checked that this patchset fixes it.

My test setup is a virtual machine with 8 cpus and 6100MB memory. But,
the amount of the memory that the test program can use is about 280 MB.
This is because the system uses large ram-backed swap and large ramdisk
to capture the trace.

Test scenario is like as below.

1. allocate cold memory (512MB)
2. allocate hot-1 memory (96MB)
3. activate hot-1 memory (96MB)
4. allocate another hot-2 memory (96MB)
5. access cold memory (128MB)
6. access hot-2 memory (96MB)
7. repeat 5, 6

Since hot-1 memory (96MB) is on the active list, the inactive list can
contains roughly 190MB pages. hot-2 memory's re-access interval
(96+128 MB) is more 190MB, so it cannot be promoted without workingset
detection and swap-in/out happens repeatedly. With this patchset,
workingset detection works and promotion happens. Therefore, swap-in/out
occurs less.

Here is the result. (average of 5 runs)

type swap-in swap-out
base 863240 989945
patch 681565 809273

As we can see, patched kernel do less swap-in/out.

* OVERALL TEST (ebizzy using modified random function)

ebizzy is the test program that main thread allocates lots of memory and
child threads access them randomly during the given times. Swap-in
will happen if allocated memory is larger than the system memory.

The random function that represents the zipf distribution is used to
make hot/cold memory. Hot/cold ratio is controlled by the parameter. If
the parameter is high, hot memory is accessed much larger than cold one.
If the parameter is low, the number of access on each memory would be
similar. I uses various parameters in order to show the effect of
patchset on various hot/cold ratio workload.

My test setup is a virtual machine with 8 cpus, 1024 MB memory and
5120 MB ram swap.

Result format is as following.

param: 1-1024-0.1
- 1 (number of thread)
- 1024 (allocated memory size, MB)
- 0.1 (zipf distribution alpha,
0.1 works like as roughly uniform random,
1.3 works like as small portion of memory is hot and the others are cold)

pswpin: smaller is better
std: standard deviation
improvement: negative is better

* single thread
param pswpin std improvement
base 1-1024.0-0.1 14101983.40 79441.19
prot 1-1024.0-0.1 14065875.80 136413.01 ( -0.26 )
detect 1-1024.0-0.1 13910435.60 100804.82 ( -1.36 )
base 1-1024.0-0.7 7998368.80 43469.32
prot 1-1024.0-0.7 7622245.80 88318.74 ( -4.70 )
detect 1-1024.0-0.7 7618515.20 59742.07 ( -4.75 )
base 1-1024.0-1.3 1017400.80 38756.30
prot 1-1024.0-1.3 940464.60 29310.69 ( -7.56 )
detect 1-1024.0-1.3 945511.40 24579.52 ( -7.07 )
base 1-1280.0-0.1 22895541.40 50016.08
prot 1-1280.0-0.1 22860305.40 51952.37 ( -0.15 )
detect 1-1280.0-0.1 22705565.20 93380.35 ( -0.83 )
base 1-1280.0-0.7 13717645.60 46250.65
prot 1-1280.0-0.7 12935355.80 64754.43 ( -5.70 )
detect 1-1280.0-0.7 13040232.00 63304.00 ( -4.94 )
base 1-1280.0-1.3 1654251.40 4159.68
prot 1-1280.0-1.3 1522680.60 33673.50 ( -7.95 )
detect 1-1280.0-1.3 1599207.00 70327.89 ( -3.33 )
base 1-1536.0-0.1 31621775.40 31156.28
prot 1-1536.0-0.1 31540355.20 62241.36 ( -0.26 )
detect 1-1536.0-0.1 31420056.00 123831.27 ( -0.64 )
base 1-1536.0-0.7 19620760.60 60937.60
prot 1-1536.0-0.7 18337839.60 56102.58 ( -6.54 )
detect 1-1536.0-0.7 18599128.00 75289.48 ( -5.21 )
base 1-1536.0-1.3 2378142.40 20994.43
prot 1-1536.0-1.3 2166260.60 48455.46 ( -8.91 )
detect 1-1536.0-1.3 2183762.20 16883.24 ( -8.17 )
base 1-1792.0-0.1 40259714.80 90750.70
prot 1-1792.0-0.1 40053917.20 64509.47 ( -0.51 )
detect 1-1792.0-0.1 39949736.40 104989.64 ( -0.77 )
base 1-1792.0-0.7 25704884.40 69429.68
prot 1-1792.0-0.7 23937389.00 79945.60 ( -6.88 )
detect 1-1792.0-0.7 24271902.00 35044.30 ( -5.57 )
base 1-1792.0-1.3 3129497.00 32731.86
prot 1-1792.0-1.3 2796994.40 19017.26 ( -10.62 )
detect 1-1792.0-1.3 2886840.40 33938.82 ( -7.75 )
base 1-2048.0-0.1 48746924.40 50863.88
prot 1-2048.0-0.1 48631954.40 24537.30 ( -0.24 )
detect 1-2048.0-0.1 48509419.80 27085.34 ( -0.49 )
base 1-2048.0-0.7 32046424.40 78624.22
prot 1-2048.0-0.7 29764182.20 86002.26 ( -7.12 )
detect 1-2048.0-0.7 30250315.80 101282.14 ( -5.60 )
base 1-2048.0-1.3 3916723.60 24048.55
prot 1-2048.0-1.3 3490781.60 33292.61 ( -10.87 )
detect 1-2048.0-1.3 3585002.20 44942.04 ( -8.47 )

* multi thread
param pswpin std improvement
base 8-1024.0-0.1 16219822.60 329474.01
prot 8-1024.0-0.1 15959494.00 654597.45 ( -1.61 )
detect 8-1024.0-0.1 15773790.80 502275.25 ( -2.75 )
base 8-1024.0-0.7 9174107.80 537619.33
prot 8-1024.0-0.7 8571915.00 385230.08 ( -6.56 )
detect 8-1024.0-0.7 8489484.20 364683.00 ( -7.46 )
base 8-1024.0-1.3 1108495.60 83555.98
prot 8-1024.0-1.3 1038906.20 63465.20 ( -6.28 )
detect 8-1024.0-1.3 941817.80 32648.80 ( -15.04 )
base 8-1280.0-0.1 25776114.20 450480.45
prot 8-1280.0-0.1 25430847.00 465627.07 ( -1.34 )
detect 8-1280.0-0.1 25282555.00 465666.55 ( -1.91 )
base 8-1280.0-0.7 15218968.00 702007.69
prot 8-1280.0-0.7 13957947.80 492643.86 ( -8.29 )
detect 8-1280.0-0.7 14158331.20 238656.02 ( -6.97 )
base 8-1280.0-1.3 1792482.80 30512.90
prot 8-1280.0-1.3 1577686.40 34002.62 ( -11.98 )
detect 8-1280.0-1.3 1556133.00 22944.79 ( -13.19 )
base 8-1536.0-0.1 33923761.40 575455.85
prot 8-1536.0-0.1 32715766.20 300633.51 ( -3.56 )
detect 8-1536.0-0.1 33158477.40 117764.51 ( -2.26 )
base 8-1536.0-0.7 20628907.80 303851.34
prot 8-1536.0-0.7 19329511.20 341719.31 ( -6.30 )
detect 8-1536.0-0.7 20013934.00 385358.66 ( -2.98 )
base 8-1536.0-1.3 2588106.40 130769.20
prot 8-1536.0-1.3 2275222.40 89637.06 ( -12.09 )
detect 8-1536.0-1.3 2365008.40 124412.55 ( -8.62 )
base 8-1792.0-0.1 43328279.20 946469.12
prot 8-1792.0-0.1 41481980.80 525690.89 ( -4.26 )
detect 8-1792.0-0.1 41713944.60 406798.93 ( -3.73 )
base 8-1792.0-0.7 27155647.40 536253.57
prot 8-1792.0-0.7 24989406.80 502734.52 ( -7.98 )
detect 8-1792.0-0.7 25524806.40 263237.87 ( -6.01 )
base 8-1792.0-1.3 3260372.80 137907.92
prot 8-1792.0-1.3 2879187.80 63597.26 ( -11.69 )
detect 8-1792.0-1.3 2892962.20 33229.13 ( -11.27 )
base 8-2048.0-0.1 50583989.80 710121.48
prot 8-2048.0-0.1 49599984.40 228782.42 ( -1.95 )
detect 8-2048.0-0.1 50578596.00 660971.66 ( -0.01 )
base 8-2048.0-0.7 33765479.60 812659.55
prot 8-2048.0-0.7 30767021.20 462907.24 ( -8.88 )
detect 8-2048.0-0.7 32213068.80 211884.24 ( -4.60 )
base 8-2048.0-1.3 3941675.80 28436.45
prot 8-2048.0-1.3 3538742.40 76856.08 ( -10.22 )
detect 8-2048.0-1.3 3579397.80 58630.95 ( -9.19 )

As we can see, all the cases show improvement. Especially,
test case with zipf distribution 1.3 show more improvements.
It means that if there is a hot/cold tendency in anon pages,
this patchset works better.

Patchset is based on next-20200616 + the patchset [1].
The patchset [1] is included into the Andrew's tree now.

Full patchset can also be available at

https://github.com/JoonsooKim/linux/tree/improve-anonymous-lru-management-v6.00-next-20200616

Enjoy it.

Thanks.

[1]: fix for "mm: balance LRU lists based on relative thrashing",
http://lkml.kernel.org/r/[email protected]

Joonsoo Kim (6):
mm/vmscan: make active/inactive ratio as 1:1 for anon lru
mm/vmscan: protect the workingset on anonymous LRU
mm/workingset: extend the workingset detection for anon LRU
mm/swapcache: support to handle the exceptional entries in swapcache
mm/swap: implement workingset detection for anonymous LRU
mm/vmscan: restore active/inactive ratio for anonymous LRU

include/linux/mmzone.h | 16 ++++++----
include/linux/swap.h | 25 ++++++++++++----
kernel/events/uprobes.c | 2 +-
mm/huge_memory.c | 2 +-
mm/khugepaged.c | 2 +-
mm/memcontrol.c | 16 ++++++----
mm/memory.c | 20 +++++--------
mm/migrate.c | 2 +-
mm/shmem.c | 3 +-
mm/swap.c | 13 +++++----
mm/swap_state.c | 77 ++++++++++++++++++++++++++++++++++++++++++-------
mm/swapfile.c | 4 ++-
mm/userfaultfd.c | 2 +-
mm/vmscan.c | 26 ++++++++++-------
mm/vmstat.c | 9 ++++--
mm/workingset.c | 15 +++++-----
16 files changed, 163 insertions(+), 71 deletions(-)

--
2.7.4


2020-06-17 05:29:04

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v6 2/6] mm/vmscan: protect the workingset on anonymous LRU

From: Joonsoo Kim <[email protected]>

In current implementation, newly created or swap-in anonymous page
is started on active list. Growing active list results in rebalancing
active/inactive list so old pages on active list are demoted to inactive
list. Hence, the page on active list isn't protected at all.

Following is an example of this situation.

Assume that 50 hot pages on active list. Numbers denote the number of
pages on active/inactive list (active | inactive).

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(uo) | 50(h)

3. workload: another 50 newly created (used-once) pages
50(uo) | 50(uo), swap-out 50(h)

This patch tries to fix this issue.
Like as file LRU, newly created or swap-in anonymous pages will be
inserted to the inactive list. They are promoted to active list if
enough reference happens. This simple modification changes the above
example as following.

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(h) | 50(uo)

3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(uo)

As you can see, hot pages on active list would be protected.

Note that, this implementation has a drawback that the page cannot
be promoted and will be swapped-out if re-access interval is greater than
the size of inactive list but less than the size of total(active+inactive).
To solve this potential issue, following patch will apply workingset
detection that is applied to file LRU some day before.

v6: Before this patch, all anon pages (inactive + active) are considered
as workingset. However, with this patch, only active pages are considered
as workingset. So, file refault formula which uses the number of all
anon pages is changed to use only the number of active anon pages.

Acked-by: Johannes Weiner <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/swap.h | 2 +-
kernel/events/uprobes.c | 2 +-
mm/huge_memory.c | 2 +-
mm/khugepaged.c | 2 +-
mm/memory.c | 9 ++++-----
mm/migrate.c | 2 +-
mm/swap.c | 13 +++++++------
mm/swapfile.c | 2 +-
mm/userfaultfd.c | 2 +-
mm/vmscan.c | 4 +---
mm/workingset.c | 2 --
11 files changed, 19 insertions(+), 23 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 5b3216b..f4f5f94 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -353,7 +353,7 @@ extern void deactivate_page(struct page *page);
extern void mark_page_lazyfree(struct page *page);
extern void swap_setup(void);

-extern void lru_cache_add_active_or_unevictable(struct page *page,
+extern void lru_cache_add_inactive_or_unevictable(struct page *page,
struct vm_area_struct *vma);

/* linux/mm/vmscan.c */
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index bb08628..67814de 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -184,7 +184,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
if (new_page) {
get_page(new_page);
page_add_new_anon_rmap(new_page, vma, addr, false);
- lru_cache_add_active_or_unevictable(new_page, vma);
+ lru_cache_add_inactive_or_unevictable(new_page, vma);
} else
/* no new page, just dec_mm_counter for old_page */
dec_mm_counter(mm, MM_ANONPAGES);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 78c84be..ffbf5ad 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -640,7 +640,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
entry = mk_huge_pmd(page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
page_add_new_anon_rmap(page, vma, haddr, true);
- lru_cache_add_active_or_unevictable(page, vma);
+ lru_cache_add_inactive_or_unevictable(page, vma);
pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b043c40..02fb51f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1173,7 +1173,7 @@ static void collapse_huge_page(struct mm_struct *mm,
spin_lock(pmd_ptl);
BUG_ON(!pmd_none(*pmd));
page_add_new_anon_rmap(new_page, vma, address, true);
- lru_cache_add_active_or_unevictable(new_page, vma);
+ lru_cache_add_inactive_or_unevictable(new_page, vma);
pgtable_trans_huge_deposit(mm, pmd, pgtable);
set_pmd_at(mm, address, pmd, _pmd);
update_mmu_cache_pmd(vma, address, pmd);
diff --git a/mm/memory.c b/mm/memory.c
index 3359057..f221f96 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2711,7 +2711,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
*/
ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
page_add_new_anon_rmap(new_page, vma, vmf->address, false);
- lru_cache_add_active_or_unevictable(new_page, vma);
+ lru_cache_add_inactive_or_unevictable(new_page, vma);
/*
* We call the notify macro here because, when using secondary
* mmu page tables (such as kvm shadow page tables), we want the
@@ -3260,10 +3260,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
/* ksm created a completely new copy */
if (unlikely(page != swapcache && swapcache)) {
page_add_new_anon_rmap(page, vma, vmf->address, false);
- lru_cache_add_active_or_unevictable(page, vma);
+ lru_cache_add_inactive_or_unevictable(page, vma);
} else {
do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
- activate_page(page);
}

swap_free(entry);
@@ -3408,7 +3407,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)

inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
- lru_cache_add_active_or_unevictable(page, vma);
+ lru_cache_add_inactive_or_unevictable(page, vma);
setpte:
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);

@@ -3666,7 +3665,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
if (write && !(vma->vm_flags & VM_SHARED)) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
- lru_cache_add_active_or_unevictable(page, vma);
+ lru_cache_add_inactive_or_unevictable(page, vma);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
page_add_file_rmap(page, false);
diff --git a/mm/migrate.c b/mm/migrate.c
index c95912f..f0ec043 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2856,7 +2856,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
inc_mm_counter(mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, addr, false);
if (!is_zone_device_page(page))
- lru_cache_add_active_or_unevictable(page, vma);
+ lru_cache_add_inactive_or_unevictable(page, vma);
get_page(page);

if (flush) {
diff --git a/mm/swap.c b/mm/swap.c
index c5d5114..7cf3ab5 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -476,23 +476,24 @@ void lru_cache_add(struct page *page)
EXPORT_SYMBOL(lru_cache_add);

/**
- * lru_cache_add_active_or_unevictable
+ * lru_cache_add_inactive_or_unevictable
* @page: the page to be added to LRU
* @vma: vma in which page is mapped for determining reclaimability
*
- * Place @page on the active or unevictable LRU list, depending on its
+ * Place @page on the inactive or unevictable LRU list, depending on its
* evictability. Note that if the page is not evictable, it goes
* directly back onto it's zone's unevictable list, it does NOT use a
* per cpu pagevec.
*/
-void lru_cache_add_active_or_unevictable(struct page *page,
+void lru_cache_add_inactive_or_unevictable(struct page *page,
struct vm_area_struct *vma)
{
+ bool unevictable;
+
VM_BUG_ON_PAGE(PageLRU(page), page);

- if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
- SetPageActive(page);
- else if (!TestSetPageMlocked(page)) {
+ unevictable = (vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) == VM_LOCKED;
+ if (unevictable && !TestSetPageMlocked(page)) {
/*
* We use the irq-unsafe __mod_zone_page_stat because this
* counter is not modified from interrupt context, and the pte
diff --git a/mm/swapfile.c b/mm/swapfile.c
index c047789..38f6433 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1920,7 +1920,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
page_add_anon_rmap(page, vma, addr, false);
} else { /* ksm created a completely new copy */
page_add_new_anon_rmap(page, vma, addr, false);
- lru_cache_add_active_or_unevictable(page, vma);
+ lru_cache_add_inactive_or_unevictable(page, vma);
}
swap_free(entry);
/*
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index b804193..9a3d451 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -123,7 +123,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,

inc_mm_counter(dst_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
- lru_cache_add_active_or_unevictable(page, dst_vma);
+ lru_cache_add_inactive_or_unevictable(page, dst_vma);

set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9f940c4..4745e88 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1003,8 +1003,6 @@ static enum page_references page_check_references(struct page *page,
return PAGEREF_RECLAIM;

if (referenced_ptes) {
- if (PageSwapBacked(page))
- return PAGEREF_ACTIVATE;
/*
* All mapped pages start out with page table
* references from the instantiating fault, so we need
@@ -1027,7 +1025,7 @@ static enum page_references page_check_references(struct page *page,
/*
* Activate file-backed executable pages after first usage.
*/
- if (vm_flags & VM_EXEC)
+ if ((vm_flags & VM_EXEC) && !PageSwapBacked(page))
return PAGEREF_ACTIVATE;

return PAGEREF_KEEP;
diff --git a/mm/workingset.c b/mm/workingset.c
index 50b7937..fc16d97c 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -357,8 +357,6 @@ void workingset_refault(struct page *page, void *shadow)
workingset_size = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
if (mem_cgroup_get_nr_swap_pages(memcg) > 0) {
workingset_size += lruvec_page_state(eviction_lruvec,
- NR_INACTIVE_ANON);
- workingset_size += lruvec_page_state(eviction_lruvec,
NR_ACTIVE_ANON);
}
if (refault_distance > workingset_size)
--
2.7.4

2020-06-17 05:30:46

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v6 6/6] mm/vmscan: restore active/inactive ratio for anonymous LRU

From: Joonsoo Kim <[email protected]>

Now, workingset detection is implemented for anonymous LRU.
We don't have to worry about the misfound for workingset due to
the ratio of active/inactive. Let's restore the ratio.

Acked-by: Johannes Weiner <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index eb02d18..ec77691 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2211,7 +2211,7 @@ static bool inactive_is_low(struct lruvec *lruvec, enum lru_list inactive_lru)
active = lruvec_page_state(lruvec, NR_LRU_BASE + active_lru);

gb = (inactive + active) >> (30 - PAGE_SHIFT);
- if (gb && is_file_lru(inactive_lru))
+ if (gb)
inactive_ratio = int_sqrt(10 * gb);
else
inactive_ratio = 1;
--
2.7.4

2020-06-17 05:30:51

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v6 5/6] mm/swap: implement workingset detection for anonymous LRU

From: Joonsoo Kim <[email protected]>

This patch implements workingset detection for anonymous LRU.
All the infrastructure is implemented by the previous patches so this patch
just activates the workingset detection by installing/retrieving
the shadow entry.

Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/swap.h | 6 ++++++
mm/memory.c | 11 ++++-------
mm/swap_state.c | 23 ++++++++++++++++++-----
mm/vmscan.c | 7 ++++---
mm/workingset.c | 5 +++--
5 files changed, 35 insertions(+), 17 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 901da54..9ee78b8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -416,6 +416,7 @@ extern struct address_space *swapper_spaces[];
extern unsigned long total_swapcache_pages(void);
extern void show_swap_cache_info(void);
extern int add_to_swap(struct page *page);
+extern void *get_shadow_from_swap_cache(swp_entry_t entry);
extern int add_to_swap_cache(struct page *page, swp_entry_t entry,
gfp_t gfp, void **shadowp);
extern void __delete_from_swap_cache(struct page *page,
@@ -575,6 +576,11 @@ static inline int add_to_swap(struct page *page)
return 0;
}

+static inline void *get_shadow_from_swap_cache(swp_entry_t entry)
+{
+ return NULL;
+}
+
static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
gfp_t gfp_mask, void **shadowp)
{
diff --git a/mm/memory.c b/mm/memory.c
index f221f96..2411cf57 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3094,6 +3094,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
int locked;
int exclusive = 0;
vm_fault_t ret = 0;
+ void *shadow = NULL;

if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
goto out;
@@ -3143,13 +3144,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (err)
goto out_page;

- /*
- * XXX: Move to lru_cache_add() when it
- * supports new vs putback
- */
- spin_lock_irq(&page_pgdat(page)->lru_lock);
- lru_note_cost_page(page);
- spin_unlock_irq(&page_pgdat(page)->lru_lock);
+ shadow = get_shadow_from_swap_cache(entry);
+ if (shadow)
+ workingset_refault(page, shadow);

lru_cache_add(page);
swap_readpage(page, true);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 43c4e3a..90c5bd1 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -106,6 +106,20 @@ void show_swap_cache_info(void)
printk("Total swap = %lukB\n", total_swap_pages << (PAGE_SHIFT - 10));
}

+void *get_shadow_from_swap_cache(swp_entry_t entry)
+{
+ struct address_space *address_space = swap_address_space(entry);
+ pgoff_t idx = swp_offset(entry);
+ struct page *page;
+
+ page = find_get_entry(address_space, idx);
+ if (xa_is_value(page))
+ return page;
+ if (page)
+ put_page(page);
+ return NULL;
+}
+
/*
* add_to_swap_cache resembles add_to_page_cache_locked on swapper_space,
* but sets SwapCache flag and private instead of mapping and index.
@@ -405,6 +419,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
{
struct swap_info_struct *si;
struct page *page;
+ void *shadow = NULL;

*new_page_allocated = false;

@@ -473,7 +488,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
__SetPageSwapBacked(page);

/* May fail (-ENOMEM) if XArray node allocation failed. */
- if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL, NULL)) {
+ if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL, &shadow)) {
put_swap_page(page, entry);
goto fail_unlock;
}
@@ -483,10 +498,8 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
goto fail_unlock;
}

- /* XXX: Move to lru_cache_add() when it supports new vs putback */
- spin_lock_irq(&page_pgdat(page)->lru_lock);
- lru_note_cost_page(page);
- spin_unlock_irq(&page_pgdat(page)->lru_lock);
+ if (shadow)
+ workingset_refault(page, shadow);

/* Caller will initiate read into locked page */
SetPageWorkingset(page);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 37943bf..eb02d18 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -859,6 +859,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
{
unsigned long flags;
int refcount;
+ void *shadow = NULL;

BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));
@@ -901,13 +902,13 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
if (PageSwapCache(page)) {
swp_entry_t swap = { .val = page_private(page) };
mem_cgroup_swapout(page, swap);
- __delete_from_swap_cache(page, swap, NULL);
+ if (reclaimed && !mapping_exiting(mapping))
+ shadow = workingset_eviction(page, target_memcg);
+ __delete_from_swap_cache(page, swap, shadow);
xa_unlock_irqrestore(&mapping->i_pages, flags);
put_swap_page(page, swap);
- workingset_eviction(page, target_memcg);
} else {
void (*freepage)(struct page *);
- void *shadow = NULL;

freepage = mapping->a_ops->freepage;
/*
diff --git a/mm/workingset.c b/mm/workingset.c
index 8395e60..3769ae6 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -353,8 +353,9 @@ void workingset_refault(struct page *page, void *shadow)
/*
* Compare the distance to the existing workingset size. We
* don't activate pages that couldn't stay resident even if
- * all the memory was available to the page cache. Whether
- * cache can compete with anon or not depends on having swap.
+ * all the memory was available to the workingset. Whether
+ * workingset competetion need to consider anon or not depends
+ * on having swap.
*/
workingset_size = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
if (mem_cgroup_get_nr_swap_pages(memcg) > 0) {
--
2.7.4

2020-06-17 05:31:03

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v6 3/6] mm/workingset: extend the workingset detection for anon LRU

From: Joonsoo Kim <[email protected]>

In the following patch, workingset detection will be applied to
anonymous LRU. To prepare it, this patch adds some code to
distinguish/handle the both LRUs.

v6: do not introduce a new nonresident_age for anon LRU since
we need to use *unified* nonresident_age to implement workingset
detection for anon LRU.

Acked-by: Johannes Weiner <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/mmzone.h | 16 +++++++++++-----
mm/memcontrol.c | 16 +++++++++++-----
mm/vmscan.c | 15 ++++++++++-----
mm/vmstat.c | 9 ++++++---
mm/workingset.c | 8 +++++---
5 files changed, 43 insertions(+), 21 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f6f8849..8e9d0b9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -179,9 +179,15 @@ enum node_stat_item {
NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
WORKINGSET_NODES,
- WORKINGSET_REFAULT,
- WORKINGSET_ACTIVATE,
- WORKINGSET_RESTORE,
+ WORKINGSET_REFAULT_BASE,
+ WORKINGSET_REFAULT_ANON = WORKINGSET_REFAULT_BASE,
+ WORKINGSET_REFAULT_FILE,
+ WORKINGSET_ACTIVATE_BASE,
+ WORKINGSET_ACTIVATE_ANON = WORKINGSET_ACTIVATE_BASE,
+ WORKINGSET_ACTIVATE_FILE,
+ WORKINGSET_RESTORE_BASE,
+ WORKINGSET_RESTORE_ANON = WORKINGSET_RESTORE_BASE,
+ WORKINGSET_RESTORE_FILE,
WORKINGSET_NODERECLAIM,
NR_ANON_MAPPED, /* Mapped anonymous pages */
NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
@@ -259,8 +265,8 @@ struct lruvec {
unsigned long file_cost;
/* Non-resident age, driven by LRU movement */
atomic_long_t nonresident_age;
- /* Refaults at the time of last reclaim cycle */
- unsigned long refaults;
+ /* Refaults at the time of last reclaim cycle, anon=0, file=1 */
+ unsigned long refaults[2];
/* Various lruvec state flags (enum lruvec_flags) */
unsigned long flags;
#ifdef CONFIG_MEMCG
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0b38b6a..2127dd1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1425,12 +1425,18 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGMAJFAULT),
memcg_events(memcg, PGMAJFAULT));

- seq_buf_printf(&s, "workingset_refault %lu\n",
- memcg_page_state(memcg, WORKINGSET_REFAULT));
- seq_buf_printf(&s, "workingset_activate %lu\n",
- memcg_page_state(memcg, WORKINGSET_ACTIVATE));
+ seq_buf_printf(&s, "workingset_refault_anon %lu\n",
+ memcg_page_state(memcg, WORKINGSET_REFAULT_ANON));
+ seq_buf_printf(&s, "workingset_refault_file %lu\n",
+ memcg_page_state(memcg, WORKINGSET_REFAULT_FILE));
+ seq_buf_printf(&s, "workingset_activate_anon %lu\n",
+ memcg_page_state(memcg, WORKINGSET_ACTIVATE_ANON));
+ seq_buf_printf(&s, "workingset_activate_file %lu\n",
+ memcg_page_state(memcg, WORKINGSET_ACTIVATE_FILE));
seq_buf_printf(&s, "workingset_restore %lu\n",
- memcg_page_state(memcg, WORKINGSET_RESTORE));
+ memcg_page_state(memcg, WORKINGSET_RESTORE_ANON));
+ seq_buf_printf(&s, "workingset_restore %lu\n",
+ memcg_page_state(memcg, WORKINGSET_RESTORE_FILE));
seq_buf_printf(&s, "workingset_nodereclaim %lu\n",
memcg_page_state(memcg, WORKINGSET_NODERECLAIM));

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4745e88..3caa35f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2695,7 +2695,10 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
if (!sc->force_deactivate) {
unsigned long refaults;

- if (inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
+ refaults = lruvec_page_state(target_lruvec,
+ WORKINGSET_ACTIVATE_ANON);
+ if (refaults != target_lruvec->refaults[0] ||
+ inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
sc->may_deactivate |= DEACTIVATE_ANON;
else
sc->may_deactivate &= ~DEACTIVATE_ANON;
@@ -2706,8 +2709,8 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
* rid of any stale active pages quickly.
*/
refaults = lruvec_page_state(target_lruvec,
- WORKINGSET_ACTIVATE);
- if (refaults != target_lruvec->refaults ||
+ WORKINGSET_ACTIVATE_FILE);
+ if (refaults != target_lruvec->refaults[1] ||
inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
sc->may_deactivate |= DEACTIVATE_FILE;
else
@@ -2984,8 +2987,10 @@ static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat)
unsigned long refaults;

target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
- refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE);
- target_lruvec->refaults = refaults;
+ refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_ANON);
+ target_lruvec->refaults[0] = refaults;
+ refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_FILE);
+ target_lruvec->refaults[1] = refaults;
}

/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 80c9b62..8843e75 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1149,9 +1149,12 @@ const char * const vmstat_text[] = {
"nr_isolated_anon",
"nr_isolated_file",
"workingset_nodes",
- "workingset_refault",
- "workingset_activate",
- "workingset_restore",
+ "workingset_refault_anon",
+ "workingset_refault_file",
+ "workingset_activate_anon",
+ "workingset_activate_file",
+ "workingset_restore_anon",
+ "workingset_restore_file",
"workingset_nodereclaim",
"nr_anon_pages",
"nr_mapped",
diff --git a/mm/workingset.c b/mm/workingset.c
index fc16d97c..8395e60 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -6,6 +6,7 @@
*/

#include <linux/memcontrol.h>
+#include <linux/mm_inline.h>
#include <linux/writeback.h>
#include <linux/shmem_fs.h>
#include <linux/pagemap.h>
@@ -280,6 +281,7 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
*/
void workingset_refault(struct page *page, void *shadow)
{
+ bool file = page_is_file_lru(page);
struct mem_cgroup *eviction_memcg;
struct lruvec *eviction_lruvec;
unsigned long refault_distance;
@@ -346,7 +348,7 @@ void workingset_refault(struct page *page, void *shadow)
memcg = page_memcg(page);
lruvec = mem_cgroup_lruvec(memcg, pgdat);

- inc_lruvec_state(lruvec, WORKINGSET_REFAULT);
+ inc_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file);

/*
* Compare the distance to the existing workingset size. We
@@ -364,7 +366,7 @@ void workingset_refault(struct page *page, void *shadow)

SetPageActive(page);
workingset_age_nonresident(lruvec, hpage_nr_pages(page));
- inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
+ inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + file);

/* Page was active prior to eviction */
if (workingset) {
@@ -373,7 +375,7 @@ void workingset_refault(struct page *page, void *shadow)
spin_lock_irq(&page_pgdat(page)->lru_lock);
lru_note_cost_page(page);
spin_unlock_irq(&page_pgdat(page)->lru_lock);
- inc_lruvec_state(lruvec, WORKINGSET_RESTORE);
+ inc_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file);
}
out:
rcu_read_unlock();
--
2.7.4

2020-06-17 05:31:10

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH v6 4/6] mm/swapcache: support to handle the exceptional entries in swapcache

From: Joonsoo Kim <[email protected]>

Swapcache doesn't handle the exceptional entries since there is no case
using it. In the following patch, workingset detection for anonymous
page will be implemented and it stores the shadow entries as exceptional
entries into the swapcache. So, we need to handle the exceptional entries
and this patch implements it.

Acked-by: Johannes Weiner <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/swap.h | 17 ++++++++++++----
mm/shmem.c | 3 ++-
mm/swap_state.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++------
mm/swapfile.c | 2 ++
mm/vmscan.c | 2 +-
5 files changed, 68 insertions(+), 12 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index f4f5f94..901da54 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -416,9 +416,13 @@ extern struct address_space *swapper_spaces[];
extern unsigned long total_swapcache_pages(void);
extern void show_swap_cache_info(void);
extern int add_to_swap(struct page *page);
-extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t);
-extern void __delete_from_swap_cache(struct page *, swp_entry_t entry);
+extern int add_to_swap_cache(struct page *page, swp_entry_t entry,
+ gfp_t gfp, void **shadowp);
+extern void __delete_from_swap_cache(struct page *page,
+ swp_entry_t entry, void *shadow);
extern void delete_from_swap_cache(struct page *);
+extern void clear_shadow_from_swap_cache(int type, unsigned long begin,
+ unsigned long end);
extern void free_page_and_swap_cache(struct page *);
extern void free_pages_and_swap_cache(struct page **, int);
extern struct page *lookup_swap_cache(swp_entry_t entry,
@@ -572,13 +576,13 @@ static inline int add_to_swap(struct page *page)
}

static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
- gfp_t gfp_mask)
+ gfp_t gfp_mask, void **shadowp)
{
return -1;
}

static inline void __delete_from_swap_cache(struct page *page,
- swp_entry_t entry)
+ swp_entry_t entry, void *shadow)
{
}

@@ -586,6 +590,11 @@ static inline void delete_from_swap_cache(struct page *page)
{
}

+static inline void clear_shadow_from_swap_cache(int type, unsigned long begin,
+ unsigned long end)
+{
+}
+
static inline int page_swapcount(struct page *page)
{
return 0;
diff --git a/mm/shmem.c b/mm/shmem.c
index a0dbe62..e9a99a2 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1374,7 +1374,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
list_add(&info->swaplist, &shmem_swaplist);

if (add_to_swap_cache(page, swap,
- __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN) == 0) {
+ __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN,
+ NULL) == 0) {
spin_lock_irq(&info->lock);
shmem_recalc_inode(inode);
info->swapped++;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 1050fde..43c4e3a 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -110,12 +110,15 @@ void show_swap_cache_info(void)
* add_to_swap_cache resembles add_to_page_cache_locked on swapper_space,
* but sets SwapCache flag and private instead of mapping and index.
*/
-int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
+int add_to_swap_cache(struct page *page, swp_entry_t entry,
+ gfp_t gfp, void **shadowp)
{
struct address_space *address_space = swap_address_space(entry);
pgoff_t idx = swp_offset(entry);
XA_STATE_ORDER(xas, &address_space->i_pages, idx, compound_order(page));
unsigned long i, nr = hpage_nr_pages(page);
+ unsigned long nrexceptional = 0;
+ void *old;

VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(PageSwapCache(page), page);
@@ -131,10 +134,17 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
goto unlock;
for (i = 0; i < nr; i++) {
VM_BUG_ON_PAGE(xas.xa_index != idx + i, page);
+ old = xas_load(&xas);
+ if (xa_is_value(old)) {
+ nrexceptional++;
+ if (shadowp)
+ *shadowp = old;
+ }
set_page_private(page + i, entry.val + i);
xas_store(&xas, page);
xas_next(&xas);
}
+ address_space->nrexceptional -= nrexceptional;
address_space->nrpages += nr;
__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr);
ADD_CACHE_INFO(add_total, nr);
@@ -154,7 +164,8 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
* This must be called only on pages that have
* been verified to be in the swap cache.
*/
-void __delete_from_swap_cache(struct page *page, swp_entry_t entry)
+void __delete_from_swap_cache(struct page *page,
+ swp_entry_t entry, void *shadow)
{
struct address_space *address_space = swap_address_space(entry);
int i, nr = hpage_nr_pages(page);
@@ -166,12 +177,14 @@ void __delete_from_swap_cache(struct page *page, swp_entry_t entry)
VM_BUG_ON_PAGE(PageWriteback(page), page);

for (i = 0; i < nr; i++) {
- void *entry = xas_store(&xas, NULL);
+ void *entry = xas_store(&xas, shadow);
VM_BUG_ON_PAGE(entry != page, entry);
set_page_private(page + i, 0);
xas_next(&xas);
}
ClearPageSwapCache(page);
+ if (shadow)
+ address_space->nrexceptional += nr;
address_space->nrpages -= nr;
__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr);
ADD_CACHE_INFO(del_total, nr);
@@ -208,7 +221,7 @@ int add_to_swap(struct page *page)
* Add it to the swap cache.
*/
err = add_to_swap_cache(page, entry,
- __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
+ __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN, NULL);
if (err)
/*
* add_to_swap_cache() doesn't return -EEXIST, so we can safely
@@ -246,13 +259,44 @@ void delete_from_swap_cache(struct page *page)
struct address_space *address_space = swap_address_space(entry);

xa_lock_irq(&address_space->i_pages);
- __delete_from_swap_cache(page, entry);
+ __delete_from_swap_cache(page, entry, NULL);
xa_unlock_irq(&address_space->i_pages);

put_swap_page(page, entry);
page_ref_sub(page, hpage_nr_pages(page));
}

+void clear_shadow_from_swap_cache(int type, unsigned long begin,
+ unsigned long end)
+{
+ unsigned long curr;
+ void *old;
+ swp_entry_t entry = swp_entry(type, begin);
+ struct address_space *address_space = swap_address_space(entry);
+ XA_STATE(xas, &address_space->i_pages, begin);
+
+retry:
+ xa_lock_irq(&address_space->i_pages);
+ for (curr = begin; curr <= end; curr++) {
+ entry = swp_entry(type, curr);
+ if (swap_address_space(entry) != address_space) {
+ xa_unlock_irq(&address_space->i_pages);
+ address_space = swap_address_space(entry);
+ begin = curr;
+ xas_set(&xas, begin);
+ goto retry;
+ }
+
+ old = xas_load(&xas);
+ if (!xa_is_value(old))
+ continue;
+ xas_store(&xas, NULL);
+ address_space->nrexceptional--;
+ xas_next(&xas);
+ }
+ xa_unlock_irq(&address_space->i_pages);
+}
+
/*
* If we are the only user, then try to free up the swap cache.
*
@@ -429,7 +473,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
__SetPageSwapBacked(page);

/* May fail (-ENOMEM) if XArray node allocation failed. */
- if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL)) {
+ if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL, NULL)) {
put_swap_page(page, entry);
goto fail_unlock;
}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 38f6433..4630db1 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -696,6 +696,7 @@ static void add_to_avail_list(struct swap_info_struct *p)
static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
unsigned int nr_entries)
{
+ unsigned long begin = offset;
unsigned long end = offset + nr_entries - 1;
void (*swap_slot_free_notify)(struct block_device *, unsigned long);

@@ -721,6 +722,7 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
swap_slot_free_notify(si->bdev, offset);
offset++;
}
+ clear_shadow_from_swap_cache(si->type, begin, end);
}

static void set_cluster_next(struct swap_info_struct *si, unsigned long next)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3caa35f..37943bf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -901,7 +901,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
if (PageSwapCache(page)) {
swp_entry_t swap = { .val = page_private(page) };
mem_cgroup_swapout(page, swap);
- __delete_from_swap_cache(page, swap);
+ __delete_from_swap_cache(page, swap, NULL);
xa_unlock_irqrestore(&mapping->i_pages, flags);
put_swap_page(page, swap);
workingset_eviction(page, target_memcg);
--
2.7.4

2020-06-17 12:19:55

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v6 4/6] mm/swapcache: support to handle the exceptional entries in swapcache

On Wed, Jun 17, 2020 at 02:26:21PM +0900, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> Swapcache doesn't handle the exceptional entries since there is no case

Don't call them exceptional entries.

The radix tree has/had the concecpt of exceptional entries. The swapcache
doesn't use the radix tree any more, it uses the XArray. The XArray
has value entries.

But you shouldn't call them value entries either; that's an XArray
concept. The swap cache and indeed page cache use value entries to
implement shadow entries (they're also used to implement dax entries and
swap entries in the page cache). So just call them shadow entries here.

I know there are still places which use the term 'nrexceptional' in
the kernel. I just haven't got round to replacing them yet. Please
don't add more.

> +void clear_shadow_from_swap_cache(int type, unsigned long begin,
> + unsigned long end)
> +{
> + unsigned long curr;
> + void *old;
> + swp_entry_t entry = swp_entry(type, begin);
> + struct address_space *address_space = swap_address_space(entry);
> + XA_STATE(xas, &address_space->i_pages, begin);
> +
> +retry:
> + xa_lock_irq(&address_space->i_pages);
> + for (curr = begin; curr <= end; curr++) {
> + entry = swp_entry(type, curr);
> + if (swap_address_space(entry) != address_space) {
> + xa_unlock_irq(&address_space->i_pages);
> + address_space = swap_address_space(entry);
> + begin = curr;
> + xas_set(&xas, begin);
> + goto retry;
> + }
> +
> + old = xas_load(&xas);
> + if (!xa_is_value(old))
> + continue;
> + xas_store(&xas, NULL);
> + address_space->nrexceptional--;
> + xas_next(&xas);
> + }
> + xa_unlock_irq(&address_space->i_pages);
> +}

This is a very clunky loop. I'm not sure it's even right, given that
you change address space without changing the xas's address space. How
about this?

for (;;) {
XA_STATE(xas, &address_space->i_pages, begin);
unsigned long nr_shadows = 0;

xas_lock_irq(&xas);
xas_for_each(&xas, entry, end) {
if (!xa_is_value(entry))
continue;
xas_store(&xas, NULL);
nr_shadows++;
}
address_space->nr_exceptionals -= nr_shadows;
xas_unlock_irq(&xas);

if (xas.xa_index >= end)
break;
entry = swp_entry(type, xas.xa_index);
address_space = swap_address_space(entry);
}

2020-06-19 02:17:19

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v6 4/6] mm/swapcache: support to handle the exceptional entries in swapcache

On Wed, Jun 17, 2020 at 05:17:17AM -0700, Matthew Wilcox wrote:
> On Wed, Jun 17, 2020 at 02:26:21PM +0900, [email protected] wrote:
> > From: Joonsoo Kim <[email protected]>
> >
> > Swapcache doesn't handle the exceptional entries since there is no case
>
> Don't call them exceptional entries.
>
> The radix tree has/had the concecpt of exceptional entries. The swapcache
> doesn't use the radix tree any more, it uses the XArray. The XArray
> has value entries.
>
> But you shouldn't call them value entries either; that's an XArray
> concept. The swap cache and indeed page cache use value entries to
> implement shadow entries (they're also used to implement dax entries and
> swap entries in the page cache). So just call them shadow entries here.
>
> I know there are still places which use the term 'nrexceptional' in
> the kernel. I just haven't got round to replacing them yet. Please
> don't add more.

Okay! Thanks for commenting.

>
> > +void clear_shadow_from_swap_cache(int type, unsigned long begin,
> > + unsigned long end)
> > +{
> > + unsigned long curr;
> > + void *old;
> > + swp_entry_t entry = swp_entry(type, begin);
> > + struct address_space *address_space = swap_address_space(entry);
> > + XA_STATE(xas, &address_space->i_pages, begin);
> > +
> > +retry:
> > + xa_lock_irq(&address_space->i_pages);
> > + for (curr = begin; curr <= end; curr++) {
> > + entry = swp_entry(type, curr);
> > + if (swap_address_space(entry) != address_space) {
> > + xa_unlock_irq(&address_space->i_pages);
> > + address_space = swap_address_space(entry);
> > + begin = curr;
> > + xas_set(&xas, begin);
> > + goto retry;
> > + }
> > +
> > + old = xas_load(&xas);
> > + if (!xa_is_value(old))
> > + continue;
> > + xas_store(&xas, NULL);
> > + address_space->nrexceptional--;
> > + xas_next(&xas);
> > + }
> > + xa_unlock_irq(&address_space->i_pages);
> > +}
>
> This is a very clunky loop. I'm not sure it's even right, given that
> you change address space without changing the xas's address space. How
> about this?

You are correct. The xas's address space should be changed.


> for (;;) {
> XA_STATE(xas, &address_space->i_pages, begin);
> unsigned long nr_shadows = 0;
>
> xas_lock_irq(&xas);
> xas_for_each(&xas, entry, end) {
> if (!xa_is_value(entry))
> continue;
> xas_store(&xas, NULL);
> nr_shadows++;
> }
> address_space->nr_exceptionals -= nr_shadows;
> xas_unlock_irq(&xas);
>
> if (xas.xa_index >= end)
> break;
> entry = swp_entry(type, xas.xa_index);
> address_space = swap_address_space(entry);
> }

Thanks for suggestion.

I make a patch based on your suggestion. IIUC about Xarray,
after running xas_for_each(), xas.xa_index can be less than the end if
there are empty slots on last portion of array. Handling this case is
also considered in my patch.

Thanks.

------------------->8--------------------------------
From 72e97600ea294372a13ab8e208ebd3c0e1889408 Mon Sep 17 00:00:00 2001
From: Joonsoo Kim <[email protected]>
Date: Fri, 15 Nov 2019 09:48:32 +0900
Subject: [PATCH v6 4/6] mm/swapcache: support to handle the shadow entries

Workingset detection for anonymous page will be implemented in the
following patch and it requires to store the shadow entries into the
swapcache. This patch implements an infrastructure to store the shadow
entry in the swapcache.

Acked-by: Johannes Weiner <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/swap.h | 17 ++++++++++++----
mm/shmem.c | 3 ++-
mm/swap_state.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++------
mm/swapfile.c | 2 ++
mm/vmscan.c | 2 +-
5 files changed, 69 insertions(+), 12 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index f4f5f94..901da54 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -416,9 +416,13 @@ extern struct address_space *swapper_spaces[];
extern unsigned long total_swapcache_pages(void);
extern void show_swap_cache_info(void);
extern int add_to_swap(struct page *page);
-extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t);
-extern void __delete_from_swap_cache(struct page *, swp_entry_t entry);
+extern int add_to_swap_cache(struct page *page, swp_entry_t entry,
+ gfp_t gfp, void **shadowp);
+extern void __delete_from_swap_cache(struct page *page,
+ swp_entry_t entry, void *shadow);
extern void delete_from_swap_cache(struct page *);
+extern void clear_shadow_from_swap_cache(int type, unsigned long begin,
+ unsigned long end);
extern void free_page_and_swap_cache(struct page *);
extern void free_pages_and_swap_cache(struct page **, int);
extern struct page *lookup_swap_cache(swp_entry_t entry,
@@ -572,13 +576,13 @@ static inline int add_to_swap(struct page *page)
}

static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
- gfp_t gfp_mask)
+ gfp_t gfp_mask, void **shadowp)
{
return -1;
}

static inline void __delete_from_swap_cache(struct page *page,
- swp_entry_t entry)
+ swp_entry_t entry, void *shadow)
{
}

@@ -586,6 +590,11 @@ static inline void delete_from_swap_cache(struct page *page)
{
}

+static inline void clear_shadow_from_swap_cache(int type, unsigned long begin,
+ unsigned long end)
+{
+}
+
static inline int page_swapcount(struct page *page)
{
return 0;
diff --git a/mm/shmem.c b/mm/shmem.c
index a0dbe62..e9a99a2 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1374,7 +1374,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
list_add(&info->swaplist, &shmem_swaplist);

if (add_to_swap_cache(page, swap,
- __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN) == 0) {
+ __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN,
+ NULL) == 0) {
spin_lock_irq(&info->lock);
shmem_recalc_inode(inode);
info->swapped++;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 1050fde..49a66dc 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -110,12 +110,14 @@ void show_swap_cache_info(void)
* add_to_swap_cache resembles add_to_page_cache_locked on swapper_space,
* but sets SwapCache flag and private instead of mapping and index.
*/
-int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
+int add_to_swap_cache(struct page *page, swp_entry_t entry,
+ gfp_t gfp, void **shadowp)
{
struct address_space *address_space = swap_address_space(entry);
pgoff_t idx = swp_offset(entry);
XA_STATE_ORDER(xas, &address_space->i_pages, idx, compound_order(page));
unsigned long i, nr = hpage_nr_pages(page);
+ void *old;

VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(PageSwapCache(page), page);
@@ -125,16 +127,25 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
SetPageSwapCache(page);

do {
+ unsigned long nr_shadows = 0;
+
xas_lock_irq(&xas);
xas_create_range(&xas);
if (xas_error(&xas))
goto unlock;
for (i = 0; i < nr; i++) {
VM_BUG_ON_PAGE(xas.xa_index != idx + i, page);
+ old = xas_load(&xas);
+ if (xa_is_value(old)) {
+ nr_shadows++;
+ if (shadowp)
+ *shadowp = old;
+ }
set_page_private(page + i, entry.val + i);
xas_store(&xas, page);
xas_next(&xas);
}
+ address_space->nrexceptional -= nr_shadows;
address_space->nrpages += nr;
__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr);
ADD_CACHE_INFO(add_total, nr);
@@ -154,7 +165,8 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
* This must be called only on pages that have
* been verified to be in the swap cache.
*/
-void __delete_from_swap_cache(struct page *page, swp_entry_t entry)
+void __delete_from_swap_cache(struct page *page,
+ swp_entry_t entry, void *shadow)
{
struct address_space *address_space = swap_address_space(entry);
int i, nr = hpage_nr_pages(page);
@@ -166,12 +178,14 @@ void __delete_from_swap_cache(struct page *page, swp_entry_t entry)
VM_BUG_ON_PAGE(PageWriteback(page), page);

for (i = 0; i < nr; i++) {
- void *entry = xas_store(&xas, NULL);
+ void *entry = xas_store(&xas, shadow);
VM_BUG_ON_PAGE(entry != page, entry);
set_page_private(page + i, 0);
xas_next(&xas);
}
ClearPageSwapCache(page);
+ if (shadow)
+ address_space->nrexceptional += nr;
address_space->nrpages -= nr;
__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr);
ADD_CACHE_INFO(del_total, nr);
@@ -208,7 +222,7 @@ int add_to_swap(struct page *page)
* Add it to the swap cache.
*/
err = add_to_swap_cache(page, entry,
- __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
+ __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN, NULL);
if (err)
/*
* add_to_swap_cache() doesn't return -EEXIST, so we can safely
@@ -246,13 +260,44 @@ void delete_from_swap_cache(struct page *page)
struct address_space *address_space = swap_address_space(entry);

xa_lock_irq(&address_space->i_pages);
- __delete_from_swap_cache(page, entry);
+ __delete_from_swap_cache(page, entry, NULL);
xa_unlock_irq(&address_space->i_pages);

put_swap_page(page, entry);
page_ref_sub(page, hpage_nr_pages(page));
}

+void clear_shadow_from_swap_cache(int type, unsigned long begin,
+ unsigned long end)
+{
+ unsigned long curr = begin;
+ void *old;
+
+ for (;;) {
+ unsigned long nr_shadows = 0;
+ swp_entry_t entry = swp_entry(type, curr);
+ struct address_space *address_space = swap_address_space(entry);
+ XA_STATE(xas, &address_space->i_pages, curr);
+
+ xa_lock_irq(&address_space->i_pages);
+ xas_for_each(&xas, old, end) {
+ if (!xa_is_value(old))
+ continue;
+ xas_store(&xas, NULL);
+ nr_shadows++;
+ }
+ address_space->nrexceptional -= nr_shadows;
+ xa_unlock_irq(&address_space->i_pages);
+
+ /* search the next swapcache until we meet end */
+ curr >>= SWAP_ADDRESS_SPACE_SHIFT;
+ curr++;
+ curr <<= SWAP_ADDRESS_SPACE_SHIFT;
+ if (curr > end)
+ break;
+ }
+}
+
/*
* If we are the only user, then try to free up the swap cache.
*
@@ -429,7 +474,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
__SetPageSwapBacked(page);

/* May fail (-ENOMEM) if XArray node allocation failed. */
- if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL)) {
+ if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL, NULL)) {
put_swap_page(page, entry);
goto fail_unlock;
}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 38f6433..4630db1 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -696,6 +696,7 @@ static void add_to_avail_list(struct swap_info_struct *p)
static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
unsigned int nr_entries)
{
+ unsigned long begin = offset;
unsigned long end = offset + nr_entries - 1;
void (*swap_slot_free_notify)(struct block_device *, unsigned long);

@@ -721,6 +722,7 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
swap_slot_free_notify(si->bdev, offset);
offset++;
}
+ clear_shadow_from_swap_cache(si->type, begin, end);
}

static void set_cluster_next(struct swap_info_struct *si, unsigned long next)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3caa35f..37943bf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -901,7 +901,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
if (PageSwapCache(page)) {
swp_entry_t swap = { .val = page_private(page) };
mem_cgroup_swapout(page, swap);
- __delete_from_swap_cache(page, swap);
+ __delete_from_swap_cache(page, swap, NULL);
xa_unlock_irqrestore(&mapping->i_pages, flags);
put_swap_page(page, swap);
workingset_eviction(page, target_memcg);
--
2.7.4

2020-06-26 05:09:39

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v6 4/6] mm/swapcache: support to handle the exceptional entries in swapcache

2020년 6월 19일 (금) 오전 10:33, Joonsoo Kim <[email protected]>님이 작성:
>
> On Wed, Jun 17, 2020 at 05:17:17AM -0700, Matthew Wilcox wrote:
> > On Wed, Jun 17, 2020 at 02:26:21PM +0900, [email protected] wrote:
> > > From: Joonsoo Kim <[email protected]>
> > >
> > > Swapcache doesn't handle the exceptional entries since there is no case
> >
> > Don't call them exceptional entries.
> >
> > The radix tree has/had the concecpt of exceptional entries. The swapcache
> > doesn't use the radix tree any more, it uses the XArray. The XArray
> > has value entries.
> >
> > But you shouldn't call them value entries either; that's an XArray
> > concept. The swap cache and indeed page cache use value entries to
> > implement shadow entries (they're also used to implement dax entries and
> > swap entries in the page cache). So just call them shadow entries here.
> >
> > I know there are still places which use the term 'nrexceptional' in
> > the kernel. I just haven't got round to replacing them yet. Please
> > don't add more.
>
> Okay! Thanks for commenting.
>
> >
> > > +void clear_shadow_from_swap_cache(int type, unsigned long begin,
> > > + unsigned long end)
> > > +{
> > > + unsigned long curr;
> > > + void *old;
> > > + swp_entry_t entry = swp_entry(type, begin);
> > > + struct address_space *address_space = swap_address_space(entry);
> > > + XA_STATE(xas, &address_space->i_pages, begin);
> > > +
> > > +retry:
> > > + xa_lock_irq(&address_space->i_pages);
> > > + for (curr = begin; curr <= end; curr++) {
> > > + entry = swp_entry(type, curr);
> > > + if (swap_address_space(entry) != address_space) {
> > > + xa_unlock_irq(&address_space->i_pages);
> > > + address_space = swap_address_space(entry);
> > > + begin = curr;
> > > + xas_set(&xas, begin);
> > > + goto retry;
> > > + }
> > > +
> > > + old = xas_load(&xas);
> > > + if (!xa_is_value(old))
> > > + continue;
> > > + xas_store(&xas, NULL);
> > > + address_space->nrexceptional--;
> > > + xas_next(&xas);
> > > + }
> > > + xa_unlock_irq(&address_space->i_pages);
> > > +}
> >
> > This is a very clunky loop. I'm not sure it's even right, given that
> > you change address space without changing the xas's address space. How
> > about this?
>
> You are correct. The xas's address space should be changed.
>
>
> > for (;;) {
> > XA_STATE(xas, &address_space->i_pages, begin);
> > unsigned long nr_shadows = 0;
> >
> > xas_lock_irq(&xas);
> > xas_for_each(&xas, entry, end) {
> > if (!xa_is_value(entry))
> > continue;
> > xas_store(&xas, NULL);
> > nr_shadows++;
> > }
> > address_space->nr_exceptionals -= nr_shadows;
> > xas_unlock_irq(&xas);
> >
> > if (xas.xa_index >= end)
> > break;
> > entry = swp_entry(type, xas.xa_index);
> > address_space = swap_address_space(entry);
> > }
>
> Thanks for suggestion.
>
> I make a patch based on your suggestion. IIUC about Xarray,
> after running xas_for_each(), xas.xa_index can be less than the end if
> there are empty slots on last portion of array. Handling this case is
> also considered in my patch.

Hello, Matthew.

Could you check if the following patch (Xarray part) is correct?
Since I made a patch based on your suggestion, I'd like to get your review. :)

Thanks.

> Thanks.
>
> ------------------->8--------------------------------
> From 72e97600ea294372a13ab8e208ebd3c0e1889408 Mon Sep 17 00:00:00 2001
> From: Joonsoo Kim <[email protected]>
> Date: Fri, 15 Nov 2019 09:48:32 +0900
> Subject: [PATCH v6 4/6] mm/swapcache: support to handle the shadow entries
>
> Workingset detection for anonymous page will be implemented in the
> following patch and it requires to store the shadow entries into the
> swapcache. This patch implements an infrastructure to store the shadow
> entry in the swapcache.
>
> Acked-by: Johannes Weiner <[email protected]>
> Signed-off-by: Joonsoo Kim <[email protected]>
> ---
> include/linux/swap.h | 17 ++++++++++++----
> mm/shmem.c | 3 ++-
> mm/swap_state.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++------
> mm/swapfile.c | 2 ++
> mm/vmscan.c | 2 +-
> 5 files changed, 69 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index f4f5f94..901da54 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -416,9 +416,13 @@ extern struct address_space *swapper_spaces[];
> extern unsigned long total_swapcache_pages(void);
> extern void show_swap_cache_info(void);
> extern int add_to_swap(struct page *page);
> -extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t);
> -extern void __delete_from_swap_cache(struct page *, swp_entry_t entry);
> +extern int add_to_swap_cache(struct page *page, swp_entry_t entry,
> + gfp_t gfp, void **shadowp);
> +extern void __delete_from_swap_cache(struct page *page,
> + swp_entry_t entry, void *shadow);
> extern void delete_from_swap_cache(struct page *);
> +extern void clear_shadow_from_swap_cache(int type, unsigned long begin,
> + unsigned long end);
> extern void free_page_and_swap_cache(struct page *);
> extern void free_pages_and_swap_cache(struct page **, int);
> extern struct page *lookup_swap_cache(swp_entry_t entry,
> @@ -572,13 +576,13 @@ static inline int add_to_swap(struct page *page)
> }
>
> static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
> - gfp_t gfp_mask)
> + gfp_t gfp_mask, void **shadowp)
> {
> return -1;
> }
>
> static inline void __delete_from_swap_cache(struct page *page,
> - swp_entry_t entry)
> + swp_entry_t entry, void *shadow)
> {
> }
>
> @@ -586,6 +590,11 @@ static inline void delete_from_swap_cache(struct page *page)
> {
> }
>
> +static inline void clear_shadow_from_swap_cache(int type, unsigned long begin,
> + unsigned long end)
> +{
> +}
> +
> static inline int page_swapcount(struct page *page)
> {
> return 0;
> diff --git a/mm/shmem.c b/mm/shmem.c
> index a0dbe62..e9a99a2 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1374,7 +1374,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
> list_add(&info->swaplist, &shmem_swaplist);
>
> if (add_to_swap_cache(page, swap,
> - __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN) == 0) {
> + __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN,
> + NULL) == 0) {
> spin_lock_irq(&info->lock);
> shmem_recalc_inode(inode);
> info->swapped++;
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 1050fde..49a66dc 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -110,12 +110,14 @@ void show_swap_cache_info(void)
> * add_to_swap_cache resembles add_to_page_cache_locked on swapper_space,
> * but sets SwapCache flag and private instead of mapping and index.
> */
> -int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
> +int add_to_swap_cache(struct page *page, swp_entry_t entry,
> + gfp_t gfp, void **shadowp)
> {
> struct address_space *address_space = swap_address_space(entry);
> pgoff_t idx = swp_offset(entry);
> XA_STATE_ORDER(xas, &address_space->i_pages, idx, compound_order(page));
> unsigned long i, nr = hpage_nr_pages(page);
> + void *old;
>
> VM_BUG_ON_PAGE(!PageLocked(page), page);
> VM_BUG_ON_PAGE(PageSwapCache(page), page);
> @@ -125,16 +127,25 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
> SetPageSwapCache(page);
>
> do {
> + unsigned long nr_shadows = 0;
> +
> xas_lock_irq(&xas);
> xas_create_range(&xas);
> if (xas_error(&xas))
> goto unlock;
> for (i = 0; i < nr; i++) {
> VM_BUG_ON_PAGE(xas.xa_index != idx + i, page);
> + old = xas_load(&xas);
> + if (xa_is_value(old)) {
> + nr_shadows++;
> + if (shadowp)
> + *shadowp = old;
> + }
> set_page_private(page + i, entry.val + i);
> xas_store(&xas, page);
> xas_next(&xas);
> }
> + address_space->nrexceptional -= nr_shadows;
> address_space->nrpages += nr;
> __mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr);
> ADD_CACHE_INFO(add_total, nr);
> @@ -154,7 +165,8 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
> * This must be called only on pages that have
> * been verified to be in the swap cache.
> */
> -void __delete_from_swap_cache(struct page *page, swp_entry_t entry)
> +void __delete_from_swap_cache(struct page *page,
> + swp_entry_t entry, void *shadow)
> {
> struct address_space *address_space = swap_address_space(entry);
> int i, nr = hpage_nr_pages(page);
> @@ -166,12 +178,14 @@ void __delete_from_swap_cache(struct page *page, swp_entry_t entry)
> VM_BUG_ON_PAGE(PageWriteback(page), page);
>
> for (i = 0; i < nr; i++) {
> - void *entry = xas_store(&xas, NULL);
> + void *entry = xas_store(&xas, shadow);
> VM_BUG_ON_PAGE(entry != page, entry);
> set_page_private(page + i, 0);
> xas_next(&xas);
> }
> ClearPageSwapCache(page);
> + if (shadow)
> + address_space->nrexceptional += nr;
> address_space->nrpages -= nr;
> __mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr);
> ADD_CACHE_INFO(del_total, nr);
> @@ -208,7 +222,7 @@ int add_to_swap(struct page *page)
> * Add it to the swap cache.
> */
> err = add_to_swap_cache(page, entry,
> - __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
> + __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN, NULL);
> if (err)
> /*
> * add_to_swap_cache() doesn't return -EEXIST, so we can safely
> @@ -246,13 +260,44 @@ void delete_from_swap_cache(struct page *page)
> struct address_space *address_space = swap_address_space(entry);
>
> xa_lock_irq(&address_space->i_pages);
> - __delete_from_swap_cache(page, entry);
> + __delete_from_swap_cache(page, entry, NULL);
> xa_unlock_irq(&address_space->i_pages);
>
> put_swap_page(page, entry);
> page_ref_sub(page, hpage_nr_pages(page));
> }
>
> +void clear_shadow_from_swap_cache(int type, unsigned long begin,
> + unsigned long end)
> +{
> + unsigned long curr = begin;
> + void *old;
> +
> + for (;;) {
> + unsigned long nr_shadows = 0;
> + swp_entry_t entry = swp_entry(type, curr);
> + struct address_space *address_space = swap_address_space(entry);
> + XA_STATE(xas, &address_space->i_pages, curr);
> +
> + xa_lock_irq(&address_space->i_pages);
> + xas_for_each(&xas, old, end) {
> + if (!xa_is_value(old))
> + continue;
> + xas_store(&xas, NULL);
> + nr_shadows++;
> + }
> + address_space->nrexceptional -= nr_shadows;
> + xa_unlock_irq(&address_space->i_pages);
> +
> + /* search the next swapcache until we meet end */
> + curr >>= SWAP_ADDRESS_SPACE_SHIFT;
> + curr++;
> + curr <<= SWAP_ADDRESS_SPACE_SHIFT;
> + if (curr > end)
> + break;
> + }
> +}
> +
> /*
> * If we are the only user, then try to free up the swap cache.
> *
> @@ -429,7 +474,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> __SetPageSwapBacked(page);
>
> /* May fail (-ENOMEM) if XArray node allocation failed. */
> - if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL)) {
> + if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL, NULL)) {
> put_swap_page(page, entry);
> goto fail_unlock;
> }
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 38f6433..4630db1 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -696,6 +696,7 @@ static void add_to_avail_list(struct swap_info_struct *p)
> static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
> unsigned int nr_entries)
> {
> + unsigned long begin = offset;
> unsigned long end = offset + nr_entries - 1;
> void (*swap_slot_free_notify)(struct block_device *, unsigned long);
>
> @@ -721,6 +722,7 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
> swap_slot_free_notify(si->bdev, offset);
> offset++;
> }
> + clear_shadow_from_swap_cache(si->type, begin, end);
> }
>
> static void set_cluster_next(struct swap_info_struct *si, unsigned long next)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3caa35f..37943bf 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -901,7 +901,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
> if (PageSwapCache(page)) {
> swp_entry_t swap = { .val = page_private(page) };
> mem_cgroup_swapout(page, swap);
> - __delete_from_swap_cache(page, swap);
> + __delete_from_swap_cache(page, swap, NULL);
> xa_unlock_irqrestore(&mapping->i_pages, flags);
> put_swap_page(page, swap);
> workingset_eviction(page, target_memcg);
> --
> 2.7.4
>

2020-06-26 05:13:54

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v6 0/6] workingset protection/detection on the anonymous LRU list

2020년 6월 17일 (수) 오후 2:26, <[email protected]>님이 작성:
>
> From: Joonsoo Kim <[email protected]>
>
> Hello,
>
> This patchset implements workingset protection and detection on
> the anonymous LRU list.
>
> * Changes on v6
> - rework to reflect a new LRU balance model
> - remove memcg charge timing stuff on v5 since alternative is already
> merged on mainline
> - remove readahead stuff on v5 (reason is the same with above)
> - clear shadow entry if corresponding swap entry is deleted
> (mm/swapcache: support to handle the exceptional entries in swapcache)
> - change experiment environment
> (from ssd swap to ram swap, for fast evaluation and for reducing side-effect of I/O)
> - update performance number

Hello, Johannes.

Could you review the v6 patchset?

Some minor things have changed so it's really welcome if you review the patchset
again. Especially, "mm/swap: implement workingset detection for anonymous LRU"
doesn't get any ack yet. :)

Thanks.

2020-07-01 18:03:34

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v6 2/6] mm/vmscan: protect the workingset on anonymous LRU

On 6/17/20 7:26 AM, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>

Hi, how about a more descriptive subject, such as

mm/vmscan: add new anonymous pages to inactive LRU list

> In current implementation, newly created or swap-in anonymous page
> is started on active list. Growing active list results in rebalancing
> active/inactive list so old pages on active list are demoted to inactive
> list. Hence, the page on active list isn't protected at all.
>
> Following is an example of this situation.
>
> Assume that 50 hot pages on active list. Numbers denote the number of
> pages on active/inactive list (active | inactive).
>
> 1. 50 hot pages on active list
> 50(h) | 0
>
> 2. workload: 50 newly created (used-once) pages
> 50(uo) | 50(h)
>
> 3. workload: another 50 newly created (used-once) pages
> 50(uo) | 50(uo), swap-out 50(h)
>
> This patch tries to fix this issue.
> Like as file LRU, newly created or swap-in anonymous pages will be
> inserted to the inactive list. They are promoted to active list if
> enough reference happens. This simple modification changes the above
> example as following.
>
> 1. 50 hot pages on active list
> 50(h) | 0
>
> 2. workload: 50 newly created (used-once) pages
> 50(h) | 50(uo)
>
> 3. workload: another 50 newly created (used-once) pages
> 50(h) | 50(uo), swap-out 50(uo)
>
> As you can see, hot pages on active list would be protected.
>
> Note that, this implementation has a drawback that the page cannot
> be promoted and will be swapped-out if re-access interval is greater than
> the size of inactive list but less than the size of total(active+inactive).
> To solve this potential issue, following patch will apply workingset
> detection that is applied to file LRU some day before.

detection similar to the one that's already applied to file LRU.

> v6: Before this patch, all anon pages (inactive + active) are considered
> as workingset. However, with this patch, only active pages are considered
> as workingset. So, file refault formula which uses the number of all
> anon pages is changed to use only the number of active anon pages.

a "v6" note is more suitable for a diffstat area than commit log, but it's good
to mention this so drop the 'v6:'?

> Acked-by: Johannes Weiner <[email protected]>
> Signed-off-by: Joonsoo Kim <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

One more nit below.

> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -476,23 +476,24 @@ void lru_cache_add(struct page *page)
> EXPORT_SYMBOL(lru_cache_add);
>
> /**
> - * lru_cache_add_active_or_unevictable
> + * lru_cache_add_inactive_or_unevictable
> * @page: the page to be added to LRU
> * @vma: vma in which page is mapped for determining reclaimability
> *
> - * Place @page on the active or unevictable LRU list, depending on its
> + * Place @page on the inactive or unevictable LRU list, depending on its
> * evictability. Note that if the page is not evictable, it goes
> * directly back onto it's zone's unevictable list, it does NOT use a
> * per cpu pagevec.
> */
> -void lru_cache_add_active_or_unevictable(struct page *page,
> +void lru_cache_add_inactive_or_unevictable(struct page *page,
> struct vm_area_struct *vma)
> {
> + bool unevictable;
> +
> VM_BUG_ON_PAGE(PageLRU(page), page);
>
> - if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
> - SetPageActive(page);
> - else if (!TestSetPageMlocked(page)) {
> + unevictable = (vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) == VM_LOCKED;
> + if (unevictable && !TestSetPageMlocked(page)) {

I guess this could be "if (unlikely(unevictable) && ..." to match the previous
if (likely(evictable)) else ...

> /*
> * We use the irq-unsafe __mod_zone_page_stat because this
> * counter is not modified from interrupt context, and the pte
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index c047789..38f6433 100644

2020-07-01 21:26:09

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v6 3/6] mm/workingset: extend the workingset detection for anon LRU

On 6/17/20 7:26 AM, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>

Hi,

I would adjust the subject, as it sounds like the patch does the whole
workingset detection, not just preparation.
How about:

mm/workingset: prepare the workingset infrastructure for anon LRU

> In the following patch, workingset detection will be applied to
> anonymous LRU. To prepare it, this patch adds some code to
> distinguish/handle the both LRUs.

How about:
To prepare for this, this patch splits workingset event counters for refault,
activate and restore into anon and file variants, as well as the refaults
counter in struct lruvec.

> v6: do not introduce a new nonresident_age for anon LRU since
> we need to use *unified* nonresident_age to implement workingset
> detection for anon LRU.

Again, v6 update info shouldn't go to changelog. In this case I think it doesn't
need mentioning at all, at least not in this patch.

> Acked-by: Johannes Weiner <[email protected]>
> Signed-off-by: Joonsoo Kim <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

2020-07-02 13:38:27

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v6 5/6] mm/swap: implement workingset detection for anonymous LRU

On 6/17/20 7:26 AM, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> This patch implements workingset detection for anonymous LRU.
> All the infrastructure is implemented by the previous patches so this patch
> just activates the workingset detection by installing/retrieving
> the shadow entry.
>
> Signed-off-by: Joonsoo Kim <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

> diff --git a/mm/workingset.c b/mm/workingset.c
> index 8395e60..3769ae6 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -353,8 +353,9 @@ void workingset_refault(struct page *page, void *shadow)
> /*
> * Compare the distance to the existing workingset size. We
> * don't activate pages that couldn't stay resident even if
> - * all the memory was available to the page cache. Whether
> - * cache can compete with anon or not depends on having swap.
> + * all the memory was available to the workingset. Whether
> + * workingset competetion need to consider anon or not depends

competition needs

> + * on having swap.
> */
> workingset_size = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
> if (mem_cgroup_get_nr_swap_pages(memcg) > 0) {
>

2020-07-02 13:47:03

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v6 6/6] mm/vmscan: restore active/inactive ratio for anonymous LRU

On 6/17/20 7:26 AM, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> Now, workingset detection is implemented for anonymous LRU.
> We don't have to worry about the misfound for workingset due to
> the ratio of active/inactive. Let's restore the ratio.

How about:

Now that workingset detection is implemented for anonymous LRU, we don't need
large inactive list to allow detecting frequently accessed pages before they are
reclaimed, anymore. This effectively reverts the temporary measure put in by
commit "mm/vmscan: make active/inactive ratio as 1:1 for anon lru".

> Acked-by: Johannes Weiner <[email protected]>
> Signed-off-by: Joonsoo Kim <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

Thanks!
I still hope Matthew can review updated patch 4/6 (I'm not really familiar with
proper xarray handling), and Johannes patch 5/6.

And then we just need a nice Documentation file describing how reclaim really
works after all the recent changes :)

Vlastimil

> ---
> mm/vmscan.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eb02d18..ec77691 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2211,7 +2211,7 @@ static bool inactive_is_low(struct lruvec *lruvec, enum lru_list inactive_lru)
> active = lruvec_page_state(lruvec, NR_LRU_BASE + active_lru);
>
> gb = (inactive + active) >> (30 - PAGE_SHIFT);
> - if (gb && is_file_lru(inactive_lru))
> + if (gb)
> inactive_ratio = int_sqrt(10 * gb);
> else
> inactive_ratio = 1;
>

2020-07-03 00:49:37

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v6 2/6] mm/vmscan: protect the workingset on anonymous LRU

2020년 7월 2일 (목) 오전 3:02, Vlastimil Babka <[email protected]>님이 작성:
>
> On 6/17/20 7:26 AM, [email protected] wrote:
> > From: Joonsoo Kim <[email protected]>
>
> Hi, how about a more descriptive subject, such as

Hello,

> mm/vmscan: add new anonymous pages to inactive LRU list

This patch does two things to implement workingset protection.

- add new anonymous pages to inactive LRU list
- check two reference to activate

So, I think that the current subject is more descriptive for this patch.

> > In current implementation, newly created or swap-in anonymous page
> > is started on active list. Growing active list results in rebalancing
> > active/inactive list so old pages on active list are demoted to inactive
> > list. Hence, the page on active list isn't protected at all.
> >
> > Following is an example of this situation.
> >
> > Assume that 50 hot pages on active list. Numbers denote the number of
> > pages on active/inactive list (active | inactive).
> >
> > 1. 50 hot pages on active list
> > 50(h) | 0
> >
> > 2. workload: 50 newly created (used-once) pages
> > 50(uo) | 50(h)
> >
> > 3. workload: another 50 newly created (used-once) pages
> > 50(uo) | 50(uo), swap-out 50(h)
> >
> > This patch tries to fix this issue.
> > Like as file LRU, newly created or swap-in anonymous pages will be
> > inserted to the inactive list. They are promoted to active list if
> > enough reference happens. This simple modification changes the above
> > example as following.
> >
> > 1. 50 hot pages on active list
> > 50(h) | 0
> >
> > 2. workload: 50 newly created (used-once) pages
> > 50(h) | 50(uo)
> >
> > 3. workload: another 50 newly created (used-once) pages
> > 50(h) | 50(uo), swap-out 50(uo)
> >
> > As you can see, hot pages on active list would be protected.
> >
> > Note that, this implementation has a drawback that the page cannot
> > be promoted and will be swapped-out if re-access interval is greater than
> > the size of inactive list but less than the size of total(active+inactive).
> > To solve this potential issue, following patch will apply workingset
> > detection that is applied to file LRU some day before.
>
> detection similar to the one that's already applied to file LRU.

Will change!

> > v6: Before this patch, all anon pages (inactive + active) are considered
> > as workingset. However, with this patch, only active pages are considered
> > as workingset. So, file refault formula which uses the number of all
> > anon pages is changed to use only the number of active anon pages.
>
> a "v6" note is more suitable for a diffstat area than commit log, but it's good
> to mention this so drop the 'v6:'?

Okay.

> > Acked-by: Johannes Weiner <[email protected]>
> > Signed-off-by: Joonsoo Kim <[email protected]>
>
> Acked-by: Vlastimil Babka <[email protected]>

Thanks!

> One more nit below.
>
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -476,23 +476,24 @@ void lru_cache_add(struct page *page)
> > EXPORT_SYMBOL(lru_cache_add);
> >
> > /**
> > - * lru_cache_add_active_or_unevictable
> > + * lru_cache_add_inactive_or_unevictable
> > * @page: the page to be added to LRU
> > * @vma: vma in which page is mapped for determining reclaimability
> > *
> > - * Place @page on the active or unevictable LRU list, depending on its
> > + * Place @page on the inactive or unevictable LRU list, depending on its
> > * evictability. Note that if the page is not evictable, it goes
> > * directly back onto it's zone's unevictable list, it does NOT use a
> > * per cpu pagevec.
> > */
> > -void lru_cache_add_active_or_unevictable(struct page *page,
> > +void lru_cache_add_inactive_or_unevictable(struct page *page,
> > struct vm_area_struct *vma)
> > {
> > + bool unevictable;
> > +
> > VM_BUG_ON_PAGE(PageLRU(page), page);
> >
> > - if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
> > - SetPageActive(page);
> > - else if (!TestSetPageMlocked(page)) {
> > + unevictable = (vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) == VM_LOCKED;
> > + if (unevictable && !TestSetPageMlocked(page)) {
>
> I guess this could be "if (unlikely(unevictable) && ..." to match the previous
> if (likely(evictable)) else ...

I will fix it, too.

Thanks.

2020-07-03 00:52:41

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v6 3/6] mm/workingset: extend the workingset detection for anon LRU

2020년 7월 2일 (목) 오전 6:25, Vlastimil Babka <[email protected]>님이 작성:
>
> On 6/17/20 7:26 AM, [email protected] wrote:
> > From: Joonsoo Kim <[email protected]>
>
> Hi,
>
> I would adjust the subject, as it sounds like the patch does the whole
> workingset detection, not just preparation.
> How about:
>
> mm/workingset: prepare the workingset infrastructure for anon LRU

Looks good. I will use it.

> > In the following patch, workingset detection will be applied to
> > anonymous LRU. To prepare it, this patch adds some code to
> > distinguish/handle the both LRUs.
>
> How about:
> To prepare for this, this patch splits workingset event counters for refault,
> activate and restore into anon and file variants, as well as the refaults
> counter in struct lruvec.

Will do.

> > v6: do not introduce a new nonresident_age for anon LRU since
> > we need to use *unified* nonresident_age to implement workingset
> > detection for anon LRU.
>
> Again, v6 update info shouldn't go to changelog. In this case I think it doesn't
> need mentioning at all, at least not in this patch.

Okay. I agree that this should not be included in the changelog. I just want
to notice someone who checked previous patches about that there is an
important change in this version.

> > Acked-by: Johannes Weiner <[email protected]>
> > Signed-off-by: Joonsoo Kim <[email protected]>
>
> Acked-by: Vlastimil Babka <[email protected]>

Thanks.

2020-07-03 00:55:16

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v6 5/6] mm/swap: implement workingset detection for anonymous LRU

2020년 7월 2일 (목) 오후 10:37, Vlastimil Babka <[email protected]>님이 작성:
>
> On 6/17/20 7:26 AM, [email protected] wrote:
> > From: Joonsoo Kim <[email protected]>
> >
> > This patch implements workingset detection for anonymous LRU.
> > All the infrastructure is implemented by the previous patches so this patch
> > just activates the workingset detection by installing/retrieving
> > the shadow entry.
> >
> > Signed-off-by: Joonsoo Kim <[email protected]>
>
> Acked-by: Vlastimil Babka <[email protected]>
>
> > diff --git a/mm/workingset.c b/mm/workingset.c
> > index 8395e60..3769ae6 100644
> > --- a/mm/workingset.c
> > +++ b/mm/workingset.c
> > @@ -353,8 +353,9 @@ void workingset_refault(struct page *page, void *shadow)
> > /*
> > * Compare the distance to the existing workingset size. We
> > * don't activate pages that couldn't stay resident even if
> > - * all the memory was available to the page cache. Whether
> > - * cache can compete with anon or not depends on having swap.
> > + * all the memory was available to the workingset. Whether
> > + * workingset competetion need to consider anon or not depends
>
> competition needs

Will fix it.

Thanks.

2020-07-03 00:56:02

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v6 6/6] mm/vmscan: restore active/inactive ratio for anonymous LRU

2020년 7월 2일 (목) 오후 10:45, Vlastimil Babka <[email protected]>님이 작성:
>
> On 6/17/20 7:26 AM, [email protected] wrote:
> > From: Joonsoo Kim <[email protected]>
> >
> > Now, workingset detection is implemented for anonymous LRU.
> > We don't have to worry about the misfound for workingset due to
> > the ratio of active/inactive. Let's restore the ratio.
>
> How about:
>
> Now that workingset detection is implemented for anonymous LRU, we don't need
> large inactive list to allow detecting frequently accessed pages before they are
> reclaimed, anymore. This effectively reverts the temporary measure put in by
> commit "mm/vmscan: make active/inactive ratio as 1:1 for anon lru".

Much better!. I will use the comment you suggested. Thanks.

> > Acked-by: Johannes Weiner <[email protected]>
> > Signed-off-by: Joonsoo Kim <[email protected]>
>
> Acked-by: Vlastimil Babka <[email protected]>
>
> Thanks!
> I still hope Matthew can review updated patch 4/6 (I'm not really familiar with
> proper xarray handling), and Johannes patch 5/6.

Okay, I hope so, too. :)

> And then we just need a nice Documentation file describing how reclaim really
> works after all the recent changes :)

Agreed.

Thanks.

2020-07-17 14:00:36

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v6 2/6] mm/vmscan: protect the workingset on anonymous LRU

On Wed, Jun 17, 2020 at 02:26:19PM +0900, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> In current implementation, newly created or swap-in anonymous page
> is started on active list. Growing active list results in rebalancing
> active/inactive list so old pages on active list are demoted to inactive
> list. Hence, the page on active list isn't protected at all.
>
> Following is an example of this situation.
>
> Assume that 50 hot pages on active list. Numbers denote the number of
> pages on active/inactive list (active | inactive).
>
> 1. 50 hot pages on active list
> 50(h) | 0
>
> 2. workload: 50 newly created (used-once) pages
> 50(uo) | 50(h)
>
> 3. workload: another 50 newly created (used-once) pages
> 50(uo) | 50(uo), swap-out 50(h)
>
> This patch tries to fix this issue.
> Like as file LRU, newly created or swap-in anonymous pages will be
> inserted to the inactive list. They are promoted to active list if
> enough reference happens. This simple modification changes the above
> example as following.
>
> 1. 50 hot pages on active list
> 50(h) | 0
>
> 2. workload: 50 newly created (used-once) pages
> 50(h) | 50(uo)
>
> 3. workload: another 50 newly created (used-once) pages
> 50(h) | 50(uo), swap-out 50(uo)
>
> As you can see, hot pages on active list would be protected.
>
> Note that, this implementation has a drawback that the page cannot
> be promoted and will be swapped-out if re-access interval is greater than
> the size of inactive list but less than the size of total(active+inactive).
> To solve this potential issue, following patch will apply workingset
> detection that is applied to file LRU some day before.
>
> v6: Before this patch, all anon pages (inactive + active) are considered
> as workingset. However, with this patch, only active pages are considered
> as workingset. So, file refault formula which uses the number of all
> anon pages is changed to use only the number of active anon pages.

I can see that also from the code, but it doesn't explain why.

And I'm not sure this is correct. I can see two problems with it.

After your patch series, there is still one difference between anon
and file: cache trim mode. If the "use-once" anon dominate most of
memory and you have a small set of heavily thrashing files, it would
not get recognized. File refaults *have* to compare their distance to
the *entire* anon set, or we could get trapped in cache trimming mode
even as file pages with access frequencies <= RAM are thrashing.

On the anon side, there is no cache trimming mode. But even if we're
not in cache trimming mode and active file is already being reclaimed,
we have to recognize thrashing on the anon side when reuse frequencies
are within available RAM. Otherwise we treat an inactive file that is
not being reused as having the same value as an anon page that is
being reused. And then we may reclaim file and anon at the same rate
even as anon is thrashing and file is not. That's not right.

We need to activate everything with a reuse frequency <= RAM. Reuse
frequency is refault distance plus size of the inactive list the page
was on. This means anon distances should be compared to active anon +
inactive file + active file, and file distances should be compared to
active file + inactive_anon + active anon.

workingset_size should basically always be everything except the
inactive list the page is refaulting from as that represents the delta
between total RAM and the amount of space this page had available.

2020-07-17 14:08:24

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v6 5/6] mm/swap: implement workingset detection for anonymous LRU

On Wed, Jun 17, 2020 at 02:26:22PM +0900, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> This patch implements workingset detection for anonymous LRU.
> All the infrastructure is implemented by the previous patches so this patch
> just activates the workingset detection by installing/retrieving
> the shadow entry.
>
> Signed-off-by: Joonsoo Kim <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

2020-07-20 06:54:10

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v6 2/6] mm/vmscan: protect the workingset on anonymous LRU

2020년 7월 17일 (금) 오후 10:59, Johannes Weiner <[email protected]>님이 작성:
>
> On Wed, Jun 17, 2020 at 02:26:19PM +0900, [email protected] wrote:
> > From: Joonsoo Kim <[email protected]>
> >
> > In current implementation, newly created or swap-in anonymous page
> > is started on active list. Growing active list results in rebalancing
> > active/inactive list so old pages on active list are demoted to inactive
> > list. Hence, the page on active list isn't protected at all.
> >
> > Following is an example of this situation.
> >
> > Assume that 50 hot pages on active list. Numbers denote the number of
> > pages on active/inactive list (active | inactive).
> >
> > 1. 50 hot pages on active list
> > 50(h) | 0
> >
> > 2. workload: 50 newly created (used-once) pages
> > 50(uo) | 50(h)
> >
> > 3. workload: another 50 newly created (used-once) pages
> > 50(uo) | 50(uo), swap-out 50(h)
> >
> > This patch tries to fix this issue.
> > Like as file LRU, newly created or swap-in anonymous pages will be
> > inserted to the inactive list. They are promoted to active list if
> > enough reference happens. This simple modification changes the above
> > example as following.
> >
> > 1. 50 hot pages on active list
> > 50(h) | 0
> >
> > 2. workload: 50 newly created (used-once) pages
> > 50(h) | 50(uo)
> >
> > 3. workload: another 50 newly created (used-once) pages
> > 50(h) | 50(uo), swap-out 50(uo)
> >
> > As you can see, hot pages on active list would be protected.
> >
> > Note that, this implementation has a drawback that the page cannot
> > be promoted and will be swapped-out if re-access interval is greater than
> > the size of inactive list but less than the size of total(active+inactive).
> > To solve this potential issue, following patch will apply workingset
> > detection that is applied to file LRU some day before.
> >
> > v6: Before this patch, all anon pages (inactive + active) are considered
> > as workingset. However, with this patch, only active pages are considered
> > as workingset. So, file refault formula which uses the number of all
> > anon pages is changed to use only the number of active anon pages.
>
> I can see that also from the code, but it doesn't explain why.
>
> And I'm not sure this is correct. I can see two problems with it.
>
> After your patch series, there is still one difference between anon
> and file: cache trim mode. If the "use-once" anon dominate most of
> memory and you have a small set of heavily thrashing files, it would
> not get recognized. File refaults *have* to compare their distance to
> the *entire* anon set, or we could get trapped in cache trimming mode
> even as file pages with access frequencies <= RAM are thrashing.
>
> On the anon side, there is no cache trimming mode. But even if we're
> not in cache trimming mode and active file is already being reclaimed,
> we have to recognize thrashing on the anon side when reuse frequencies
> are within available RAM. Otherwise we treat an inactive file that is
> not being reused as having the same value as an anon page that is
> being reused. And then we may reclaim file and anon at the same rate
> even as anon is thrashing and file is not. That's not right.
>
> We need to activate everything with a reuse frequency <= RAM. Reuse
> frequency is refault distance plus size of the inactive list the page
> was on. This means anon distances should be compared to active anon +
> inactive file + active file, and file distances should be compared to
> active file + inactive_anon + active anon.

You're right. Maybe, I'm confused about something at that time. I will change
it as you suggested.

Thanks.