From: Joonsoo Kim <[email protected]>
Hello,
This patchset implements workingset protection and detection on
the anonymous LRU list.
* Changes on v3
- rework the patch, "mm/vmscan: protect the workingset on anonymous LRU"
(use almost same reference tracking algorithm to the one for the file
mapped page)
* Changes on v2
- fix a critical bug that uses out of index lru list in
workingset_refault()
- fix a bug that reuses the rotate value for previous page
* SUBJECT
workingset protection
* PROBLEM
In current implementation, newly created or swap-in anonymous page is
started on the active list. Growing the active list results in rebalancing
active/inactive list so old pages on the active list are demoted to the
inactive list. Hence, hot page on the active list isn't protected at all.
Following is an example of this situation.
Assume that 50 hot pages on active list and system can contain total
100 pages. Numbers denote the number of pages on active/inactive
list (active | inactive). (h) stands for hot pages and (uo) stands for
used-once pages.
1. 50 hot pages on active list
50(h) | 0
2. workload: 50 newly created (used-once) pages
50(uo) | 50(h)
3. workload: another 50 newly created (used-once) pages
50(uo) | 50(uo), swap-out 50(h)
As we can see, hot pages are swapped-out and it would cause swap-in later.
* SOLUTION
Since this is what we want to avoid, this patchset implements workingset
protection. Like as the file LRU list, newly created or swap-in anonymous
page is started on the inactive list. Also, like as the file LRU list,
if enough reference happens, the page will be promoted. This simple
modification changes the above example as following.
1. 50 hot pages on active list
50(h) | 0
2. workload: 50 newly created (used-once) pages
50(h) | 50(uo)
3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(uo)
hot pages remains in the active list. :)
* EXPERIMENT
I tested this scenario on my test bed and confirmed that this problem
happens on current implementation. I also checked that it is fixed by
this patchset.
I did another test to show the performance effect of this patchset.
- ebizzy (with modified random function)
ebizzy is the test program that main thread allocates lots of memory and
child threads access them randomly during the given times. Swap-in/out
will happen if allocated memory is larger than the system memory.
The random function that represents the zipf distribution is used to
make hot/cold memory. Hot/cold ratio is controlled by the parameter. If
the parameter is high, hot memory is accessed much larger than cold one.
If the parameter is low, the number of access on each memory would be
similar. I uses various parameters in order to show the effect of
patchset on various hot/cold ratio workload.
My test setup is a virtual machine with 8 cpus and 1024MB memory.
Result format is as following.
Parameter 0.1 ... 1.3
Allocated memory size
Throughput for base (larger is better)
Throughput for patchset (larger is better)
Improvement (larger is better)
* single thread
0.1 0.3 0.5 0.7 0.9 1.1 1.3
<512M>
7009.0 7372.0 7774.0 8523.0 9569.0 10724.0 11936.0
6973.0 7342.0 7745.0 8576.0 9441.0 10730.0 12033.0
-0.01 -0.0 -0.0 0.01 -0.01 0.0 0.01
<768M>
915.0 1039.0 1275.0 1687.0 2328.0 3486.0 5445.0
920.0 1037.0 1238.0 1689.0 2384.0 3638.0 5381.0
0.01 -0.0 -0.03 0.0 0.02 0.04 -0.01
<1024M>
425.0 471.0 539.0 753.0 1183.0 2130.0 3839.0
414.0 468.0 553.0 770.0 1242.0 2187.0 3932.0
-0.03 -0.01 0.03 0.02 0.05 0.03 0.02
<1280M>
320.0 346.0 410.0 556.0 871.0 1654.0 3298.0
316.0 346.0 411.0 550.0 892.0 1652.0 3293.0
-0.01 0.0 0.0 -0.01 0.02 -0.0 -0.0
<1536M>
273.0 290.0 341.0 458.0 733.0 1381.0 2925.0
271.0 293.0 344.0 462.0 740.0 1398.0 2969.0
-0.01 0.01 0.01 0.01 0.01 0.01 0.02
<2048M>
77.0 79.0 95.0 147.0 276.0 690.0 1816.0
91.0 94.0 115.0 170.0 321.0 770.0 2018.0
0.18 0.19 0.21 0.16 0.16 0.12 0.11
* multi thread (8)
0.1 0.3 0.5 0.7 0.9 1.1 1.3
<512M>
29083.0 29648.0 30145.0 31668.0 33964.0 38414.0 43707.0
29238.0 29701.0 30301.0 31328.0 33809.0 37991.0 43667.0
0.01 0.0 0.01 -0.01 -0.0 -0.01 -0.0
<768M>
3332.0 3699.0 4673.0 5830.0 8307.0 12969.0 17665.0
3579.0 3992.0 4432.0 6111.0 8699.0 12604.0 18061.0
0.07 0.08 -0.05 0.05 0.05 -0.03 0.02
<1024M>
1921.0 2141.0 2484.0 3296.0 5391.0 8227.0 14574.0
1989.0 2155.0 2609.0 3565.0 5463.0 8170.0 15642.0
0.04 0.01 0.05 0.08 0.01 -0.01 0.07
<1280M>
1524.0 1625.0 1931.0 2581.0 4155.0 6959.0 12443.0
1560.0 1707.0 2016.0 2714.0 4262.0 7518.0 13910.0
0.02 0.05 0.04 0.05 0.03 0.08 0.12
<1536M>
1303.0 1399.0 1550.0 2137.0 3469.0 6712.0 12944.0
1356.0 1465.0 1701.0 2237.0 3583.0 6830.0 13580.0
0.04 0.05 0.1 0.05 0.03 0.02 0.05
<2048M>
172.0 184.0 215.0 289.0 514.0 1318.0 4153.0
175.0 190.0 225.0 329.0 606.0 1585.0 5170.0
0.02 0.03 0.05 0.14 0.18 0.2 0.24
As we can see, as allocated memory grows, patched kernel get the better
result. Maximum improvement is 21% for the single thread test and
24% for the multi thread test.
* SUBJECT
workingset detection
* PROBLEM
Later part of the patchset implements the workingset detection for
the anonymous LRU list. There is a corner case that workingset protection
could cause thrashing. If we can avoid thrashing by workingset detection,
we can get the better performance.
Following is an example of thrashing due to the workingset protection.
1. 50 hot pages on active list
50(h) | 0
2. workload: 50 newly created (will be hot) pages
50(h) | 50(wh)
3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(wh)
4. workload: 50 (will be hot) pages
50(h) | 50(wh), swap-in 50(wh)
5. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(wh)
6. repeat 4, 5
Without workingset detection, this kind of workload cannot be promoted
and thrashing happens forever.
* SOLUTION
Therefore, this patchset implements workingset detection.
All the infrastructure for workingset detecion is already implemented,
so there is not much work to do. First, extend workingset detection
code to deal with the anonymous LRU list. Then, make swap cache handles
the exceptional value for the shadow entry. Lastly, install/retrieve
the shadow value into/from the swap cache and check the refault distance.
* EXPERIMENT
I made a test program to imitates above scenario and confirmed that
problem exists. Then, I checked that this patchset fixes it.
My test setup is a virtual machine with 8 cpus and 6100MB memory. But,
the amount of the memory that the test program can use is about 280 MB.
This is because the system uses large ram-backed swap and large ramdisk
to capture the trace.
Test scenario is like as below.
1. allocate cold memory (512MB)
2. allocate hot-1 memory (96MB)
3. activate hot-1 memory (96MB)
4. allocate another hot-2 memory (96MB)
5. access cold memory (128MB)
6. access hot-2 memory (96MB)
7. repeat 5, 6
Since hot-1 memory (96MB) is on the active list, the inactive list can
contains roughly 190MB pages. hot-2 memory's re-access interval
(96+128 MB) is more 190MB, so it cannot be promoted without workingset
detection and swap-in/out happens repeatedly. With this patchset,
workingset detection works and promotion happens. Therefore, swap-in/out
occurs less.
Here is the result. (average of 5 runs)
type swap-in swap-out
base 863240 989945
patch 681565 809273
As we can see, patched kernel do less swap-in/out.
Patchset is based on v5.5.
Patchset can also be available at
https://github.com/JoonsooKim/linux/tree/improve-anonymous-lru-management-v3.00-v5.5
Enjoy it.
Thanks.
Joonsoo Kim (9):
mm/vmscan: make active/inactive ratio as 1:1 for anon lru
mm/vmscan: protect the workingset on anonymous LRU
mm/workingset: extend the workingset detection for anon LRU
mm/swapcache: support to handle the value in swapcache
mm/workingset: use the node counter if memcg is the root memcg
mm/workingset: handle the page without memcg
mm/swap: implement workingset detection for anonymous LRU
mm/vmscan: restore active/inactive ratio for anonymous LRU
mm/swap: count a new anonymous page as a reclaim_state's rotate
include/linux/mmzone.h | 14 ++++++++-----
include/linux/swap.h | 18 ++++++++++++-----
kernel/events/uprobes.c | 2 +-
mm/huge_memory.c | 6 +++---
mm/khugepaged.c | 2 +-
mm/memcontrol.c | 12 ++++++++----
mm/memory.c | 16 +++++++++------
mm/migrate.c | 2 +-
mm/shmem.c | 3 ++-
mm/swap.c | 42 ++++++++++++++++++++++++++++++++-------
mm/swap_state.c | 52 ++++++++++++++++++++++++++++++++++++++++++-------
mm/swapfile.c | 2 +-
mm/userfaultfd.c | 2 +-
mm/vmscan.c | 26 +++++++++++++++----------
mm/vmstat.c | 6 ++++--
mm/workingset.c | 47 +++++++++++++++++++++++++++++++-------------
16 files changed, 183 insertions(+), 69 deletions(-)
--
2.7.4
From: Joonsoo Kim <[email protected]>
Current implementation of LRU management for anonymous page has some
problems. Most important one is that it doesn't protect the workingset,
that is, pages on the active LRU list. Although, this problem will be
fixed in the following patchset, the preparation is required and
this patch does it.
What following patchset does is to restore workingset protection. In this
case, newly created or swap-in pages are started their lifetime on the
inactive list. If inactive list is too small, there is not enough chance
to be referenced and the page cannot become the workingset.
In order to provide enough chance to the newly anonymous pages, this patch
makes active/inactive LRU ratio as 1:1.
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 572fb17..e772f3f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2217,7 +2217,7 @@ static bool inactive_is_low(struct lruvec *lruvec, enum lru_list inactive_lru)
active = lruvec_page_state(lruvec, NR_LRU_BASE + active_lru);
gb = (inactive + active) >> (30 - PAGE_SHIFT);
- if (gb)
+ if (gb && is_file_lru(inactive_lru))
inactive_ratio = int_sqrt(10 * gb);
else
inactive_ratio = 1;
--
2.7.4
From: Joonsoo Kim <[email protected]>
In current implementation, newly created or swap-in anonymous page
is started on active list. Growing active list results in rebalancing
active/inactive list so old pages on active list are demoted to inactive
list. Hence, the page on active list isn't protected at all.
Following is an example of this situation.
Assume that 50 hot pages on active list. Numbers denote the number of
pages on active/inactive list (active | inactive).
1. 50 hot pages on active list
50(h) | 0
2. workload: 50 newly created (used-once) pages
50(uo) | 50(h)
3. workload: another 50 newly created (used-once) pages
50(uo) | 50(uo), swap-out 50(h)
This patch tries to fix this issue.
Like as file LRU, newly created or swap-in anonymous pages will be
inserted to the inactive list. They are promoted to active list if
enough reference happens. This simple modification changes the above
example as following.
1. 50 hot pages on active list
50(h) | 0
2. workload: 50 newly created (used-once) pages
50(h) | 50(uo)
3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(uo)
As you can see, hot pages on active list would be protected.
Note that, this implementation has a drawback that the page cannot
be promoted and will be swapped-out if re-access interval is greater than
the size of inactive list but less than the size of total(active+inactive).
To solve this potential issue, following patch will apply workingset
detection that is applied to file LRU some day before.
Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/swap.h | 2 +-
kernel/events/uprobes.c | 2 +-
mm/huge_memory.c | 6 +++---
mm/khugepaged.c | 2 +-
mm/memory.c | 9 ++++-----
mm/migrate.c | 2 +-
mm/swap.c | 13 +++++++------
mm/swapfile.c | 2 +-
mm/userfaultfd.c | 2 +-
mm/vmscan.c | 4 +---
10 files changed, 21 insertions(+), 23 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1e99f7a..954e13e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -344,7 +344,7 @@ extern void deactivate_page(struct page *page);
extern void mark_page_lazyfree(struct page *page);
extern void swap_setup(void);
-extern void lru_cache_add_active_or_unevictable(struct page *page,
+extern void lru_cache_add_inactive_or_unevictable(struct page *page,
struct vm_area_struct *vma);
/* linux/mm/vmscan.c */
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index ece7e13..14156fc 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -190,7 +190,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
get_page(new_page);
page_add_new_anon_rmap(new_page, vma, addr, false);
mem_cgroup_commit_charge(new_page, memcg, false, false);
- lru_cache_add_active_or_unevictable(new_page, vma);
+ lru_cache_add_inactive_or_unevictable(new_page, vma);
} else
/* no new page, just dec_mm_counter for old_page */
dec_mm_counter(mm, MM_ANONPAGES);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a880932..6356dfd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -638,7 +638,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
page_add_new_anon_rmap(page, vma, haddr, true);
mem_cgroup_commit_charge(page, memcg, false, true);
- lru_cache_add_active_or_unevictable(page, vma);
+ lru_cache_add_inactive_or_unevictable(page, vma);
pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
@@ -1282,7 +1282,7 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf,
set_page_private(pages[i], 0);
page_add_new_anon_rmap(pages[i], vmf->vma, haddr, false);
mem_cgroup_commit_charge(pages[i], memcg, false, false);
- lru_cache_add_active_or_unevictable(pages[i], vma);
+ lru_cache_add_inactive_or_unevictable(pages[i], vma);
vmf->pte = pte_offset_map(&_pmd, haddr);
VM_BUG_ON(!pte_none(*vmf->pte));
set_pte_at(vma->vm_mm, haddr, vmf->pte, entry);
@@ -1435,7 +1435,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
page_add_new_anon_rmap(new_page, vma, haddr, true);
mem_cgroup_commit_charge(new_page, memcg, false, true);
- lru_cache_add_active_or_unevictable(new_page, vma);
+ lru_cache_add_inactive_or_unevictable(new_page, vma);
set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
if (!page) {
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b679908..246c155 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1092,7 +1092,7 @@ static void collapse_huge_page(struct mm_struct *mm,
page_add_new_anon_rmap(new_page, vma, address, true);
mem_cgroup_commit_charge(new_page, memcg, false, true);
count_memcg_events(memcg, THP_COLLAPSE_ALLOC, 1);
- lru_cache_add_active_or_unevictable(new_page, vma);
+ lru_cache_add_inactive_or_unevictable(new_page, vma);
pgtable_trans_huge_deposit(mm, pmd, pgtable);
set_pmd_at(mm, address, pmd, _pmd);
update_mmu_cache_pmd(vma, address, pmd);
diff --git a/mm/memory.c b/mm/memory.c
index 45442d9..5f7813a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2513,7 +2513,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
page_add_new_anon_rmap(new_page, vma, vmf->address, false);
mem_cgroup_commit_charge(new_page, memcg, false, false);
- lru_cache_add_active_or_unevictable(new_page, vma);
+ lru_cache_add_inactive_or_unevictable(new_page, vma);
/*
* We call the notify macro here because, when using secondary
* mmu page tables (such as kvm shadow page tables), we want the
@@ -3038,11 +3038,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (unlikely(page != swapcache && swapcache)) {
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
- lru_cache_add_active_or_unevictable(page, vma);
+ lru_cache_add_inactive_or_unevictable(page, vma);
} else {
do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
mem_cgroup_commit_charge(page, memcg, true, false);
- activate_page(page);
}
swap_free(entry);
@@ -3186,7 +3185,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
- lru_cache_add_active_or_unevictable(page, vma);
+ lru_cache_add_inactive_or_unevictable(page, vma);
setpte:
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
@@ -3449,7 +3448,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
- lru_cache_add_active_or_unevictable(page, vma);
+ lru_cache_add_inactive_or_unevictable(page, vma);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
page_add_file_rmap(page, false);
diff --git a/mm/migrate.c b/mm/migrate.c
index 86873b6..ef034c0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2784,7 +2784,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
page_add_new_anon_rmap(page, vma, addr, false);
mem_cgroup_commit_charge(page, memcg, false, false);
if (!is_zone_device_page(page))
- lru_cache_add_active_or_unevictable(page, vma);
+ lru_cache_add_inactive_or_unevictable(page, vma);
get_page(page);
if (flush) {
diff --git a/mm/swap.c b/mm/swap.c
index 5341ae9..18b2735 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -448,23 +448,24 @@ void lru_cache_add(struct page *page)
}
/**
- * lru_cache_add_active_or_unevictable
+ * lru_cache_add_inactive_or_unevictable
* @page: the page to be added to LRU
* @vma: vma in which page is mapped for determining reclaimability
*
- * Place @page on the active or unevictable LRU list, depending on its
+ * Place @page on the inactive or unevictable LRU list, depending on its
* evictability. Note that if the page is not evictable, it goes
* directly back onto it's zone's unevictable list, it does NOT use a
* per cpu pagevec.
*/
-void lru_cache_add_active_or_unevictable(struct page *page,
+void lru_cache_add_inactive_or_unevictable(struct page *page,
struct vm_area_struct *vma)
{
+ bool evictable;
+
VM_BUG_ON_PAGE(PageLRU(page), page);
- if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
- SetPageActive(page);
- else if (!TestSetPageMlocked(page)) {
+ evictable = (vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED;
+ if (!evictable && !TestSetPageMlocked(page)) {
/*
* We use the irq-unsafe __mod_zone_page_stat because this
* counter is not modified from interrupt context, and the pte
diff --git a/mm/swapfile.c b/mm/swapfile.c
index bb3261d..6bdcbf9 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1888,7 +1888,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
} else { /* ksm created a completely new copy */
page_add_new_anon_rmap(page, vma, addr, false);
mem_cgroup_commit_charge(page, memcg, false, false);
- lru_cache_add_active_or_unevictable(page, vma);
+ lru_cache_add_inactive_or_unevictable(page, vma);
}
swap_free(entry);
/*
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 1b0d7ab..875e329 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -120,7 +120,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
inc_mm_counter(dst_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
mem_cgroup_commit_charge(page, memcg, false, false);
- lru_cache_add_active_or_unevictable(page, dst_vma);
+ lru_cache_add_inactive_or_unevictable(page, dst_vma);
set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e772f3f..c932141 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1010,8 +1010,6 @@ static enum page_references page_check_references(struct page *page,
return PAGEREF_RECLAIM;
if (referenced_ptes) {
- if (PageSwapBacked(page))
- return PAGEREF_ACTIVATE;
/*
* All mapped pages start out with page table
* references from the instantiating fault, so we need
@@ -1034,7 +1032,7 @@ static enum page_references page_check_references(struct page *page,
/*
* Activate file-backed executable pages after first usage.
*/
- if (vm_flags & VM_EXEC)
+ if ((vm_flags & VM_EXEC) && !PageSwapBacked(page))
return PAGEREF_ACTIVATE;
return PAGEREF_KEEP;
--
2.7.4
From: Joonsoo Kim <[email protected]>
In the following patch, workingset detection will be applied to
anonymous LRU. To prepare it, this patch adds some code to
distinguish/handle the both LRUs.
Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/mmzone.h | 14 +++++++++-----
mm/memcontrol.c | 12 ++++++++----
mm/vmscan.c | 15 ++++++++++-----
mm/vmstat.c | 6 ++++--
mm/workingset.c | 35 ++++++++++++++++++++++-------------
5 files changed, 53 insertions(+), 29 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5334ad8..b78fd8c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -220,8 +220,12 @@ enum node_stat_item {
NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
WORKINGSET_NODES,
- WORKINGSET_REFAULT,
- WORKINGSET_ACTIVATE,
+ WORKINGSET_REFAULT_BASE,
+ WORKINGSET_REFAULT_ANON = WORKINGSET_REFAULT_BASE,
+ WORKINGSET_REFAULT_FILE,
+ WORKINGSET_ACTIVATE_BASE,
+ WORKINGSET_ACTIVATE_ANON = WORKINGSET_ACTIVATE_BASE,
+ WORKINGSET_ACTIVATE_FILE,
WORKINGSET_RESTORE,
WORKINGSET_NODERECLAIM,
NR_ANON_MAPPED, /* Mapped anonymous pages */
@@ -304,10 +308,10 @@ enum lruvec_flags {
struct lruvec {
struct list_head lists[NR_LRU_LISTS];
struct zone_reclaim_stat reclaim_stat;
- /* Evictions & activations on the inactive file list */
- atomic_long_t inactive_age;
+ /* Evictions & activations on the inactive list */
+ atomic_long_t inactive_age[2];
/* Refaults at the time of last reclaim cycle */
- unsigned long refaults;
+ unsigned long refaults[2];
/* Various lruvec state flags (enum lruvec_flags) */
unsigned long flags;
#ifdef CONFIG_MEMCG
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6c83cf4..8f4473d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1431,10 +1431,14 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGMAJFAULT),
memcg_events(memcg, PGMAJFAULT));
- seq_buf_printf(&s, "workingset_refault %lu\n",
- memcg_page_state(memcg, WORKINGSET_REFAULT));
- seq_buf_printf(&s, "workingset_activate %lu\n",
- memcg_page_state(memcg, WORKINGSET_ACTIVATE));
+ seq_buf_printf(&s, "workingset_refault_anon %lu\n",
+ memcg_page_state(memcg, WORKINGSET_REFAULT_ANON));
+ seq_buf_printf(&s, "workingset_refault_file %lu\n",
+ memcg_page_state(memcg, WORKINGSET_REFAULT_FILE));
+ seq_buf_printf(&s, "workingset_activate_anon %lu\n",
+ memcg_page_state(memcg, WORKINGSET_ACTIVATE_ANON));
+ seq_buf_printf(&s, "workingset_activate_file %lu\n",
+ memcg_page_state(memcg, WORKINGSET_ACTIVATE_FILE));
seq_buf_printf(&s, "workingset_nodereclaim %lu\n",
memcg_page_state(memcg, WORKINGSET_NODERECLAIM));
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c932141..0493c25 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2716,7 +2716,10 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
if (!sc->force_deactivate) {
unsigned long refaults;
- if (inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
+ refaults = lruvec_page_state(target_lruvec,
+ WORKINGSET_ACTIVATE_ANON);
+ if (refaults != target_lruvec->refaults[0] ||
+ inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
sc->may_deactivate |= DEACTIVATE_ANON;
else
sc->may_deactivate &= ~DEACTIVATE_ANON;
@@ -2727,8 +2730,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
* rid of any stale active pages quickly.
*/
refaults = lruvec_page_state(target_lruvec,
- WORKINGSET_ACTIVATE);
- if (refaults != target_lruvec->refaults ||
+ WORKINGSET_ACTIVATE_FILE);
+ if (refaults != target_lruvec->refaults[1] ||
inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
sc->may_deactivate |= DEACTIVATE_FILE;
else
@@ -3007,8 +3010,10 @@ static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat)
unsigned long refaults;
target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
- refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE);
- target_lruvec->refaults = refaults;
+ refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_ANON);
+ target_lruvec->refaults[0] = refaults;
+ refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_FILE);
+ target_lruvec->refaults[1] = refaults;
}
/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 78d5337..3cdf8e9 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1146,8 +1146,10 @@ const char * const vmstat_text[] = {
"nr_isolated_anon",
"nr_isolated_file",
"workingset_nodes",
- "workingset_refault",
- "workingset_activate",
+ "workingset_refault_anon",
+ "workingset_refault_file",
+ "workingset_activate_anon",
+ "workingset_activate_file",
"workingset_restore",
"workingset_nodereclaim",
"nr_anon_pages",
diff --git a/mm/workingset.c b/mm/workingset.c
index 474186b..5fb8f85 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -15,6 +15,7 @@
#include <linux/dax.h>
#include <linux/fs.h>
#include <linux/mm.h>
+#include <linux/mm_inline.h>
/*
* Double CLOCK lists
@@ -156,7 +157,7 @@
*
* Implementation
*
- * For each node's file LRU lists, a counter for inactive evictions
+ * For each node's anon/file LRU lists, a counter for inactive evictions
* and activations is maintained (node->inactive_age).
*
* On eviction, a snapshot of this counter (along with some bits to
@@ -213,7 +214,8 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
*workingsetp = workingset;
}
-static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
+static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat,
+ int is_file)
{
/*
* Reclaiming a cgroup means reclaiming all its children in a
@@ -230,7 +232,7 @@ static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
struct lruvec *lruvec;
lruvec = mem_cgroup_lruvec(memcg, pgdat);
- atomic_long_inc(&lruvec->inactive_age);
+ atomic_long_inc(&lruvec->inactive_age[is_file]);
} while (memcg && (memcg = parent_mem_cgroup(memcg)));
}
@@ -248,18 +250,19 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
unsigned long eviction;
struct lruvec *lruvec;
int memcgid;
+ int is_file = page_is_file_cache(page);
/* Page is fully exclusive and pins page->mem_cgroup */
VM_BUG_ON_PAGE(PageLRU(page), page);
VM_BUG_ON_PAGE(page_count(page), page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
- advance_inactive_age(page_memcg(page), pgdat);
+ advance_inactive_age(page_memcg(page), pgdat, is_file);
lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
/* XXX: target_memcg can be NULL, go through lruvec */
memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
- eviction = atomic_long_read(&lruvec->inactive_age);
+ eviction = atomic_long_read(&lruvec->inactive_age[is_file]);
return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
}
@@ -278,13 +281,16 @@ void workingset_refault(struct page *page, void *shadow)
struct lruvec *eviction_lruvec;
unsigned long refault_distance;
struct pglist_data *pgdat;
- unsigned long active_file;
+ unsigned long active;
struct mem_cgroup *memcg;
unsigned long eviction;
struct lruvec *lruvec;
unsigned long refault;
bool workingset;
int memcgid;
+ int is_file = page_is_file_cache(page);
+ enum lru_list active_lru = page_lru_base_type(page) + LRU_ACTIVE;
+ enum node_stat_item workingset_stat;
unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
@@ -309,8 +315,8 @@ void workingset_refault(struct page *page, void *shadow)
if (!mem_cgroup_disabled() && !eviction_memcg)
goto out;
eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
- refault = atomic_long_read(&eviction_lruvec->inactive_age);
- active_file = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
+ refault = atomic_long_read(&eviction_lruvec->inactive_age[is_file]);
+ active = lruvec_page_state(eviction_lruvec, active_lru);
/*
* Calculate the refault distance
@@ -341,19 +347,21 @@ void workingset_refault(struct page *page, void *shadow)
memcg = page_memcg(page);
lruvec = mem_cgroup_lruvec(memcg, pgdat);
- inc_lruvec_state(lruvec, WORKINGSET_REFAULT);
+ workingset_stat = WORKINGSET_REFAULT_BASE + is_file;
+ inc_lruvec_state(lruvec, workingset_stat);
/*
* Compare the distance to the existing workingset size. We
* don't act on pages that couldn't stay resident even if all
* the memory was available to the page cache.
*/
- if (refault_distance > active_file)
+ if (refault_distance > active)
goto out;
SetPageActive(page);
- advance_inactive_age(memcg, pgdat);
- inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
+ advance_inactive_age(memcg, pgdat, is_file);
+ workingset_stat = WORKINGSET_ACTIVATE_BASE + is_file;
+ inc_lruvec_state(lruvec, workingset_stat);
/* Page was active prior to eviction */
if (workingset) {
@@ -371,6 +379,7 @@ void workingset_refault(struct page *page, void *shadow)
void workingset_activation(struct page *page)
{
struct mem_cgroup *memcg;
+ int is_file = page_is_file_cache(page);
rcu_read_lock();
/*
@@ -383,7 +392,7 @@ void workingset_activation(struct page *page)
memcg = page_memcg_rcu(page);
if (!mem_cgroup_disabled() && !memcg)
goto out;
- advance_inactive_age(memcg, page_pgdat(page));
+ advance_inactive_age(memcg, page_pgdat(page), is_file);
out:
rcu_read_unlock();
}
--
2.7.4
From: Joonsoo Kim <[email protected]>
Swapcache doesn't handle the value since there is no case using the value.
In the following patch, workingset detection for anonymous page will be
implemented and it stores the value into the swapcache. So, we need to
handle it and this patch implement handling.
Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/swap.h | 5 +++--
mm/swap_state.c | 23 ++++++++++++++++++++---
mm/vmscan.c | 2 +-
3 files changed, 24 insertions(+), 6 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 954e13e..0df8b3f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -410,7 +410,8 @@ extern void show_swap_cache_info(void);
extern int add_to_swap(struct page *page);
extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t);
extern int __add_to_swap_cache(struct page *page, swp_entry_t entry);
-extern void __delete_from_swap_cache(struct page *, swp_entry_t entry);
+extern void __delete_from_swap_cache(struct page *page,
+ swp_entry_t entry, void *shadow);
extern void delete_from_swap_cache(struct page *);
extern void free_page_and_swap_cache(struct page *);
extern void free_pages_and_swap_cache(struct page **, int);
@@ -571,7 +572,7 @@ static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
}
static inline void __delete_from_swap_cache(struct page *page,
- swp_entry_t entry)
+ swp_entry_t entry, void *shadow)
{
}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 8e7ce9a..3fbbe45 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -117,6 +117,10 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
pgoff_t idx = swp_offset(entry);
XA_STATE_ORDER(xas, &address_space->i_pages, idx, compound_order(page));
unsigned long i, nr = compound_nr(page);
+ unsigned long nrexceptional = 0;
+ void *old;
+
+ xas_set_update(&xas, workingset_update_node);
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(PageSwapCache(page), page);
@@ -132,10 +136,14 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
goto unlock;
for (i = 0; i < nr; i++) {
VM_BUG_ON_PAGE(xas.xa_index != idx + i, page);
+ old = xas_load(&xas);
+ if (xa_is_value(old))
+ nrexceptional++;
set_page_private(page + i, entry.val + i);
xas_store(&xas, page);
xas_next(&xas);
}
+ address_space->nrexceptional -= nrexceptional;
address_space->nrpages += nr;
__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr);
ADD_CACHE_INFO(add_total, nr);
@@ -155,24 +163,33 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
* This must be called only on pages that have
* been verified to be in the swap cache.
*/
-void __delete_from_swap_cache(struct page *page, swp_entry_t entry)
+void __delete_from_swap_cache(struct page *page,
+ swp_entry_t entry, void *shadow)
{
struct address_space *address_space = swap_address_space(entry);
int i, nr = hpage_nr_pages(page);
pgoff_t idx = swp_offset(entry);
XA_STATE(xas, &address_space->i_pages, idx);
+ /* Do not apply workingset detection for the hugh page */
+ if (nr > 1)
+ shadow = NULL;
+
+ xas_set_update(&xas, workingset_update_node);
+
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(!PageSwapCache(page), page);
VM_BUG_ON_PAGE(PageWriteback(page), page);
for (i = 0; i < nr; i++) {
- void *entry = xas_store(&xas, NULL);
+ void *entry = xas_store(&xas, shadow);
VM_BUG_ON_PAGE(entry != page, entry);
set_page_private(page + i, 0);
xas_next(&xas);
}
ClearPageSwapCache(page);
+ if (shadow)
+ address_space->nrexceptional += nr;
address_space->nrpages -= nr;
__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr);
ADD_CACHE_INFO(del_total, nr);
@@ -247,7 +264,7 @@ void delete_from_swap_cache(struct page *page)
struct address_space *address_space = swap_address_space(entry);
xa_lock_irq(&address_space->i_pages);
- __delete_from_swap_cache(page, entry);
+ __delete_from_swap_cache(page, entry, NULL);
xa_unlock_irq(&address_space->i_pages);
put_swap_page(page, entry);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0493c25..9871861 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -909,7 +909,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
if (PageSwapCache(page)) {
swp_entry_t swap = { .val = page_private(page) };
mem_cgroup_swapout(page, swap);
- __delete_from_swap_cache(page, swap);
+ __delete_from_swap_cache(page, swap, NULL);
xa_unlock_irqrestore(&mapping->i_pages, flags);
put_swap_page(page, swap);
} else {
--
2.7.4
From: Joonsoo Kim <[email protected]>
When implementing workingset detection for anonymous page, I found
some swapcache pages with NULL memcg. From the code reading, I found
two reasons.
One is the case that swap-in readahead happens. The other is the
corner case related to the shmem cache. These two problems should be
fixed, but, it's not straight-forward to fix. For example, when swap-off,
all swapped-out pages are read into swapcache. In this case, who's the
owner of the swapcache page?
Since this problem doesn't look trivial, I decide to leave the issue and
handles this corner case on the place where the error occurs.
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/workingset.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/mm/workingset.c b/mm/workingset.c
index a9f474a..8d2e83a 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -257,6 +257,10 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
VM_BUG_ON_PAGE(page_count(page), page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
+ /* page_memcg() can be NULL if swap-in readahead happens */
+ if (!page_memcg(page))
+ return NULL;
+
advance_inactive_age(page_memcg(page), pgdat, is_file);
lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
--
2.7.4
From: Joonsoo Kim <[email protected]>
reclaim_stat's rotate is used for controlling the ratio of scanning page
between file and anonymous LRU. All new anonymous pages are counted
for rotate before the patch, protecting anonymous pages on active LRU, and,
it makes that reclaim on anonymous LRU is less happened than file LRU.
Now, situation is changed. all new anonymous pages are not added
to the active LRU so rotate would be far less than before. It will cause
that reclaim on anonymous LRU happens more and it would result in bad
effect on some system that is optimized for previous setting.
Therefore, this patch counts a new anonymous page as a reclaim_state's
rotate. Although it is non-logical to add this count to
the reclaim_state's rotate in current algorithm, reducing the regression
would be more important.
I found this regression on kernel-build test and it is roughly 2~5%
performance degradation. With this workaround, performance is completely
restored.
v2: fix a bug that reuses the rotate value for previous page
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/swap.c | 29 ++++++++++++++++++++++++++++-
1 file changed, 28 insertions(+), 1 deletion(-)
diff --git a/mm/swap.c b/mm/swap.c
index 18b2735..9001d81 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -187,6 +187,9 @@ int get_kernel_page(unsigned long start, int write, struct page **pages)
}
EXPORT_SYMBOL_GPL(get_kernel_page);
+static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
+ void *arg);
+
static void pagevec_lru_move_fn(struct pagevec *pvec,
void (*move_fn)(struct page *page, struct lruvec *lruvec, void *arg),
void *arg)
@@ -199,6 +202,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
for (i = 0; i < pagevec_count(pvec); i++) {
struct page *page = pvec->pages[i];
struct pglist_data *pagepgdat = page_pgdat(page);
+ void *arg_orig = arg;
if (pagepgdat != pgdat) {
if (pgdat)
@@ -207,8 +211,22 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
spin_lock_irqsave(&pgdat->lru_lock, flags);
}
+ if (move_fn == __pagevec_lru_add_fn) {
+ struct list_head *entry = &page->lru;
+ unsigned long next = (unsigned long)entry->next;
+ unsigned long rotate = next & 2;
+
+ if (rotate) {
+ VM_BUG_ON(arg);
+
+ next = next & ~2;
+ entry->next = (struct list_head *)next;
+ arg = (void *)rotate;
+ }
+ }
lruvec = mem_cgroup_page_lruvec(page, pgdat);
(*move_fn)(page, lruvec, arg);
+ arg = arg_orig;
}
if (pgdat)
spin_unlock_irqrestore(&pgdat->lru_lock, flags);
@@ -475,6 +493,14 @@ void lru_cache_add_inactive_or_unevictable(struct page *page,
hpage_nr_pages(page));
count_vm_event(UNEVICTABLE_PGMLOCKED);
}
+
+ if (PageSwapBacked(page) && evictable) {
+ struct list_head *entry = &page->lru;
+ unsigned long next = (unsigned long)entry->next;
+
+ next = next | 2;
+ entry->next = (struct list_head *)next;
+ }
lru_cache_add(page);
}
@@ -927,6 +953,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
{
enum lru_list lru;
int was_unevictable = TestClearPageUnevictable(page);
+ unsigned long rotate = (unsigned long)arg;
VM_BUG_ON_PAGE(PageLRU(page), page);
@@ -962,7 +989,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
if (page_evictable(page)) {
lru = page_lru(page);
update_page_reclaim_stat(lruvec, page_is_file_cache(page),
- PageActive(page));
+ PageActive(page) | rotate);
if (was_unevictable)
count_vm_event(UNEVICTABLE_PGRESCUED);
} else {
--
2.7.4
From: Joonsoo Kim <[email protected]>
This patch implements workingset detection for anonymous LRU.
All the infrastructure is implemented by the previous patches so this patch
just activates the workingset detection by installing/retrieving
the shadow entry.
Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/swap.h | 11 +++++++++--
mm/memory.c | 7 ++++++-
mm/shmem.c | 3 ++-
mm/swap_state.c | 31 ++++++++++++++++++++++++++-----
mm/vmscan.c | 7 +++++--
5 files changed, 48 insertions(+), 11 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0df8b3f..fb4772e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -408,7 +408,9 @@ extern struct address_space *swapper_spaces[];
extern unsigned long total_swapcache_pages(void);
extern void show_swap_cache_info(void);
extern int add_to_swap(struct page *page);
-extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t);
+extern void *get_shadow_from_swap_cache(swp_entry_t entry);
+extern int add_to_swap_cache(struct page *page, swp_entry_t entry,
+ gfp_t gfp, void **shadowp);
extern int __add_to_swap_cache(struct page *page, swp_entry_t entry);
extern void __delete_from_swap_cache(struct page *page,
swp_entry_t entry, void *shadow);
@@ -565,8 +567,13 @@ static inline int add_to_swap(struct page *page)
return 0;
}
+static inline void *get_shadow_from_swap_cache(swp_entry_t entry)
+{
+ return NULL;
+}
+
static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
- gfp_t gfp_mask)
+ gfp_t gfp_mask, void **shadowp)
{
return -1;
}
diff --git a/mm/memory.c b/mm/memory.c
index 5f7813a..91a2097 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2925,10 +2925,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
vmf->address);
if (page) {
+ void *shadow;
+
__SetPageLocked(page);
__SetPageSwapBacked(page);
set_page_private(page, entry.val);
- lru_cache_add_anon(page);
+ shadow = get_shadow_from_swap_cache(entry);
+ if (shadow)
+ workingset_refault(page, shadow);
+ lru_cache_add(page);
swap_readpage(page, true);
}
} else {
diff --git a/mm/shmem.c b/mm/shmem.c
index 8793e8c..c6663ad 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1370,7 +1370,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
list_add(&info->swaplist, &shmem_swaplist);
if (add_to_swap_cache(page, swap,
- __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN) == 0) {
+ __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN,
+ NULL) == 0) {
spin_lock_irq(&info->lock);
shmem_recalc_inode(inode);
info->swapped++;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3fbbe45..7f7cb19 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -107,11 +107,24 @@ void show_swap_cache_info(void)
printk("Total swap = %lukB\n", total_swap_pages << (PAGE_SHIFT - 10));
}
+void *get_shadow_from_swap_cache(swp_entry_t entry)
+{
+ struct address_space *address_space = swap_address_space(entry);
+ pgoff_t idx = swp_offset(entry);
+ struct page *page;
+
+ page = find_get_entry(address_space, idx);
+ if (xa_is_value(page))
+ return page;
+ return NULL;
+}
+
/*
* add_to_swap_cache resembles add_to_page_cache_locked on swapper_space,
* but sets SwapCache flag and private instead of mapping and index.
*/
-int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
+int add_to_swap_cache(struct page *page, swp_entry_t entry,
+ gfp_t gfp, void **shadowp)
{
struct address_space *address_space = swap_address_space(entry);
pgoff_t idx = swp_offset(entry);
@@ -137,8 +150,11 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
for (i = 0; i < nr; i++) {
VM_BUG_ON_PAGE(xas.xa_index != idx + i, page);
old = xas_load(&xas);
- if (xa_is_value(old))
+ if (xa_is_value(old)) {
nrexceptional++;
+ if (shadowp)
+ *shadowp = old;
+ }
set_page_private(page + i, entry.val + i);
xas_store(&xas, page);
xas_next(&xas);
@@ -226,7 +242,7 @@ int add_to_swap(struct page *page)
* Add it to the swap cache.
*/
err = add_to_swap_cache(page, entry,
- __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
+ __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN, NULL);
if (err)
/*
* add_to_swap_cache() doesn't return -EEXIST, so we can safely
@@ -380,6 +396,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
struct page *found_page = NULL, *new_page = NULL;
struct swap_info_struct *si;
int err;
+ void *shadow;
*new_page_allocated = false;
do {
@@ -435,11 +452,15 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
/* May fail (-ENOMEM) if XArray node allocation failed. */
__SetPageLocked(new_page);
__SetPageSwapBacked(new_page);
- err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
+ shadow = NULL;
+ err = add_to_swap_cache(new_page, entry,
+ gfp_mask & GFP_KERNEL, &shadow);
if (likely(!err)) {
/* Initiate read into locked page */
SetPageWorkingset(new_page);
- lru_cache_add_anon(new_page);
+ if (shadow)
+ workingset_refault(new_page, shadow);
+ lru_cache_add(new_page);
*new_page_allocated = true;
return new_page;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9871861..b37cc26 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -867,6 +867,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
{
unsigned long flags;
int refcount;
+ void *shadow = NULL;
BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));
@@ -909,12 +910,13 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
if (PageSwapCache(page)) {
swp_entry_t swap = { .val = page_private(page) };
mem_cgroup_swapout(page, swap);
- __delete_from_swap_cache(page, swap, NULL);
+ if (reclaimed && !mapping_exiting(mapping))
+ shadow = workingset_eviction(page, target_memcg);
+ __delete_from_swap_cache(page, swap, shadow);
xa_unlock_irqrestore(&mapping->i_pages, flags);
put_swap_page(page, swap);
} else {
void (*freepage)(struct page *);
- void *shadow = NULL;
freepage = mapping->a_ops->freepage;
/*
@@ -1476,6 +1478,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
SetPageActive(page);
stat->nr_activate[type] += nr_pages;
count_memcg_page_event(page, PGACTIVATE);
+ workingset_activation(page);
}
keep_locked:
unlock_page(page);
--
2.7.4
From: Joonsoo Kim <[email protected]>
In the following patch, workingset detection is implemented for the
swap cache. Swap cache's node is usually allocated by kswapd and it
isn't charged by kmemcg since it is from the kernel thread. So the swap
cache's shadow node is managed by the node list of the list_lru rather
than the memcg specific one.
If counting the shadow node on the root memcg happens to reclaim the slab
object, the shadow node count returns the number of the shadow node on
the node list of the list_lru since root memcg has the kmem_cache_id, -1.
However, the size of pages on the LRU is calculated by using the specific
memcg, so mismatch happens. This causes the number of shadow node not to
be increased to the enough size and, therefore, workingset detection
cannot work correctly. This patch fixes this bug by checking if the memcg
is the root memcg or not. If it is the root memcg, instead of using
the memcg-specific LRU, the system-wide LRU is used to calculate proper
size of the shadow node so that the number of the shadow node can grow
as expected.
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/workingset.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/mm/workingset.c b/mm/workingset.c
index 5fb8f85..a9f474a 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -468,7 +468,13 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
* PAGE_SIZE / xa_nodes / node_entries * 8 / PAGE_SIZE
*/
#ifdef CONFIG_MEMCG
- if (sc->memcg) {
+ /*
+ * Kernel allocation on root memcg isn't regarded as allocation of
+ * specific memcg. So, if sc->memcg is the root memcg, we need to
+ * use the count for the node rather than one for the specific
+ * memcg.
+ */
+ if (sc->memcg && !mem_cgroup_is_root(sc->memcg)) {
struct lruvec *lruvec;
int i;
--
2.7.4
From: Joonsoo Kim <[email protected]>
Now, workingset detection is implemented for anonymous LRU.
We don't have to worry about the misfound for workingset due to
the ratio of active/inactive. Let's restore the ratio.
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b37cc26..3d44e32 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2218,7 +2218,7 @@ static bool inactive_is_low(struct lruvec *lruvec, enum lru_list inactive_lru)
active = lruvec_page_state(lruvec, NR_LRU_BASE + active_lru);
gb = (inactive + active) >> (30 - PAGE_SHIFT);
- if (gb && is_file_lru(inactive_lru))
+ if (gb)
inactive_ratio = int_sqrt(10 * gb);
else
inactive_ratio = 1;
--
2.7.4
On Tue, Mar 17, 2020 at 02:41:49PM +0900, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> Current implementation of LRU management for anonymous page has some
> problems. Most important one is that it doesn't protect the workingset,
> that is, pages on the active LRU list. Although, this problem will be
> fixed in the following patchset, the preparation is required and
> this patch does it.
>
> What following patchset does is to restore workingset protection. In this
> case, newly created or swap-in pages are started their lifetime on the
> inactive list. If inactive list is too small, there is not enough chance
> to be referenced and the page cannot become the workingset.
>
> In order to provide enough chance to the newly anonymous pages, this patch
> makes active/inactive LRU ratio as 1:1.
>
> Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
On Tue, Mar 17, 2020 at 02:41:50PM +0900, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> In current implementation, newly created or swap-in anonymous page
> is started on active list. Growing active list results in rebalancing
> active/inactive list so old pages on active list are demoted to inactive
> list. Hence, the page on active list isn't protected at all.
>
> Following is an example of this situation.
>
> Assume that 50 hot pages on active list. Numbers denote the number of
> pages on active/inactive list (active | inactive).
>
> 1. 50 hot pages on active list
> 50(h) | 0
>
> 2. workload: 50 newly created (used-once) pages
> 50(uo) | 50(h)
>
> 3. workload: another 50 newly created (used-once) pages
> 50(uo) | 50(uo), swap-out 50(h)
>
> This patch tries to fix this issue.
> Like as file LRU, newly created or swap-in anonymous pages will be
> inserted to the inactive list. They are promoted to active list if
> enough reference happens. This simple modification changes the above
> example as following.
>
> 1. 50 hot pages on active list
> 50(h) | 0
>
> 2. workload: 50 newly created (used-once) pages
> 50(h) | 50(uo)
>
> 3. workload: another 50 newly created (used-once) pages
> 50(h) | 50(uo), swap-out 50(uo)
>
> As you can see, hot pages on active list would be protected.
>
> Note that, this implementation has a drawback that the page cannot
> be promoted and will be swapped-out if re-access interval is greater than
> the size of inactive list but less than the size of total(active+inactive).
> To solve this potential issue, following patch will apply workingset
> detection that is applied to file LRU some day before.
>
> Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
> -void lru_cache_add_active_or_unevictable(struct page *page,
> +void lru_cache_add_inactive_or_unevictable(struct page *page,
> struct vm_area_struct *vma)
> {
> + bool evictable;
> +
> VM_BUG_ON_PAGE(PageLRU(page), page);
>
> - if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
> - SetPageActive(page);
> - else if (!TestSetPageMlocked(page)) {
> + evictable = (vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED;
> + if (!evictable && !TestSetPageMlocked(page)) {
Minor point, but in case there is a v4: `unevictable` instead of
!evictable would be a bit easier to read, match the function name,
PageUnevictable etc.
On Tue, Mar 17, 2020 at 02:41:51PM +0900, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> In the following patch, workingset detection will be applied to
> anonymous LRU. To prepare it, this patch adds some code to
> distinguish/handle the both LRUs.
>
> Signed-off-by: Joonsoo Kim <[email protected]>
This looks all correct to me, but I would ask for some style
fixes. With those applied, please feel free to add
Acked-by: Johannes Weiner <[email protected]>
> @@ -304,10 +308,10 @@ enum lruvec_flags {
> struct lruvec {
> struct list_head lists[NR_LRU_LISTS];
> struct zone_reclaim_stat reclaim_stat;
> - /* Evictions & activations on the inactive file list */
> - atomic_long_t inactive_age;
> + /* Evictions & activations on the inactive list */
> + atomic_long_t inactive_age[2];
Can you please add anon=0, file=1 to the comment?
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1431,10 +1431,14 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
> seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGMAJFAULT),
> memcg_events(memcg, PGMAJFAULT));
>
> - seq_buf_printf(&s, "workingset_refault %lu\n",
> - memcg_page_state(memcg, WORKINGSET_REFAULT));
> - seq_buf_printf(&s, "workingset_activate %lu\n",
> - memcg_page_state(memcg, WORKINGSET_ACTIVATE));
> + seq_buf_printf(&s, "workingset_refault_anon %lu\n",
> + memcg_page_state(memcg, WORKINGSET_REFAULT_ANON));
> + seq_buf_printf(&s, "workingset_refault_file %lu\n",
> + memcg_page_state(memcg, WORKINGSET_REFAULT_FILE));
> + seq_buf_printf(&s, "workingset_activate_anon %lu\n",
> + memcg_page_state(memcg, WORKINGSET_ACTIVATE_ANON));
> + seq_buf_printf(&s, "workingset_activate_file %lu\n",
> + memcg_page_state(memcg, WORKINGSET_ACTIVATE_FILE));
> seq_buf_printf(&s, "workingset_nodereclaim %lu\n",
> memcg_page_state(memcg, WORKINGSET_NODERECLAIM));
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c932141..0493c25 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2716,7 +2716,10 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> if (!sc->force_deactivate) {
> unsigned long refaults;
>
> - if (inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
> + refaults = lruvec_page_state(target_lruvec,
> + WORKINGSET_ACTIVATE_ANON);
> + if (refaults != target_lruvec->refaults[0] ||
> + inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
> sc->may_deactivate |= DEACTIVATE_ANON;
> else
> sc->may_deactivate &= ~DEACTIVATE_ANON;
> @@ -2727,8 +2730,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> * rid of any stale active pages quickly.
> */
> refaults = lruvec_page_state(target_lruvec,
> - WORKINGSET_ACTIVATE);
> - if (refaults != target_lruvec->refaults ||
> + WORKINGSET_ACTIVATE_FILE);
> + if (refaults != target_lruvec->refaults[1] ||
> inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
> sc->may_deactivate |= DEACTIVATE_FILE;
> else
> @@ -3007,8 +3010,10 @@ static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat)
> unsigned long refaults;
>
> target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
> - refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE);
> - target_lruvec->refaults = refaults;
> + refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_ANON);
> + target_lruvec->refaults[0] = refaults;
> + refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_FILE);
> + target_lruvec->refaults[1] = refaults;
> }
>
> /*
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 78d5337..3cdf8e9 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1146,8 +1146,10 @@ const char * const vmstat_text[] = {
> "nr_isolated_anon",
> "nr_isolated_file",
> "workingset_nodes",
> - "workingset_refault",
> - "workingset_activate",
> + "workingset_refault_anon",
> + "workingset_refault_file",
> + "workingset_activate_anon",
> + "workingset_activate_file",
> "workingset_restore",
> "workingset_nodereclaim",
> "nr_anon_pages",
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 474186b..5fb8f85 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -15,6 +15,7 @@
> #include <linux/dax.h>
> #include <linux/fs.h>
> #include <linux/mm.h>
> +#include <linux/mm_inline.h>
Please keep the line-length sorting of the includes intact. If that's
not possible, please separate the mm_inline.h include with an empty
line into its own section.
This makes it much easier to scan for specific headers, helps avoid
duplicate includes, and is generally just easier on the eyes.
> @@ -156,7 +157,7 @@
> *
> * Implementation
> *
> - * For each node's file LRU lists, a counter for inactive evictions
> + * For each node's anon/file LRU lists, a counter for inactive evictions
> * and activations is maintained (node->inactive_age).
> *
> * On eviction, a snapshot of this counter (along with some bits to
> @@ -213,7 +214,8 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
> *workingsetp = workingset;
> }
>
> -static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
> +static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat,
> + int is_file)
bool file, please.
> {
> /*
> * Reclaiming a cgroup means reclaiming all its children in a
> @@ -230,7 +232,7 @@ static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
> struct lruvec *lruvec;
>
> lruvec = mem_cgroup_lruvec(memcg, pgdat);
> - atomic_long_inc(&lruvec->inactive_age);
> + atomic_long_inc(&lruvec->inactive_age[is_file]);
> } while (memcg && (memcg = parent_mem_cgroup(memcg)));
> }
>
> @@ -248,18 +250,19 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
> unsigned long eviction;
> struct lruvec *lruvec;
> int memcgid;
> + int is_file = page_is_file_cache(page);
Please keep the line-length ordering intact here as well.
> /* Page is fully exclusive and pins page->mem_cgroup */
> VM_BUG_ON_PAGE(PageLRU(page), page);
> VM_BUG_ON_PAGE(page_count(page), page);
> VM_BUG_ON_PAGE(!PageLocked(page), page);
>
> - advance_inactive_age(page_memcg(page), pgdat);
> + advance_inactive_age(page_memcg(page), pgdat, is_file);
>
> lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
> /* XXX: target_memcg can be NULL, go through lruvec */
> memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
> - eviction = atomic_long_read(&lruvec->inactive_age);
> + eviction = atomic_long_read(&lruvec->inactive_age[is_file]);
> return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
> }
>
> @@ -278,13 +281,16 @@ void workingset_refault(struct page *page, void *shadow)
> struct lruvec *eviction_lruvec;
> unsigned long refault_distance;
> struct pglist_data *pgdat;
> - unsigned long active_file;
> + unsigned long active;
> struct mem_cgroup *memcg;
> unsigned long eviction;
> struct lruvec *lruvec;
> unsigned long refault;
> bool workingset;
> int memcgid;
> + int is_file = page_is_file_cache(page);
> + enum lru_list active_lru = page_lru_base_type(page) + LRU_ACTIVE;
> + enum node_stat_item workingset_stat;
Please use bool file and keep line-length ordering.
The workingset_stat variable doesn't seem helpful, see below:
> unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
>
> @@ -309,8 +315,8 @@ void workingset_refault(struct page *page, void *shadow)
> if (!mem_cgroup_disabled() && !eviction_memcg)
> goto out;
> eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
> - refault = atomic_long_read(&eviction_lruvec->inactive_age);
> - active_file = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
> + refault = atomic_long_read(&eviction_lruvec->inactive_age[is_file]);
> + active = lruvec_page_state(eviction_lruvec, active_lru);
>
> /*
> * Calculate the refault distance
> @@ -341,19 +347,21 @@ void workingset_refault(struct page *page, void *shadow)
> memcg = page_memcg(page);
> lruvec = mem_cgroup_lruvec(memcg, pgdat);
>
> - inc_lruvec_state(lruvec, WORKINGSET_REFAULT);
> + workingset_stat = WORKINGSET_REFAULT_BASE + is_file;
> + inc_lruvec_state(lruvec, workingset_stat);
inc_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file);
> /*
> * Compare the distance to the existing workingset size. We
> * don't act on pages that couldn't stay resident even if all
> * the memory was available to the page cache.
> */
> - if (refault_distance > active_file)
> + if (refault_distance > active)
> goto out;
>
> SetPageActive(page);
> - advance_inactive_age(memcg, pgdat);
> - inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
> + advance_inactive_age(memcg, pgdat, is_file);
> + workingset_stat = WORKINGSET_ACTIVATE_BASE + is_file;
> + inc_lruvec_state(lruvec, workingset_stat);
inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + file);
> /* Page was active prior to eviction */
> if (workingset) {
> @@ -371,6 +379,7 @@ void workingset_refault(struct page *page, void *shadow)
> void workingset_activation(struct page *page)
> {
> struct mem_cgroup *memcg;
> + int is_file = page_is_file_cache(page);
bool file & length-ordering
On Tue, Mar 17, 2020 at 02:41:52PM +0900, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> Swapcache doesn't handle the value since there is no case using the value.
> In the following patch, workingset detection for anonymous page will be
> implemented and it stores the value into the swapcache. So, we need to
> handle it and this patch implement handling.
"value" is too generic, it's not quite clear what this refers to
here. "Exceptional entries" or "shadow entries" would be better.
> @@ -155,24 +163,33 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
> * This must be called only on pages that have
> * been verified to be in the swap cache.
> */
> -void __delete_from_swap_cache(struct page *page, swp_entry_t entry)
> +void __delete_from_swap_cache(struct page *page,
> + swp_entry_t entry, void *shadow)
> {
> struct address_space *address_space = swap_address_space(entry);
> int i, nr = hpage_nr_pages(page);
> pgoff_t idx = swp_offset(entry);
> XA_STATE(xas, &address_space->i_pages, idx);
>
> + /* Do not apply workingset detection for the hugh page */
> + if (nr > 1)
> + shadow = NULL;
Hm, why is that? Should that be an XXX/TODO item? The comment should
explain the reason, not necessarily what the code is doing.
Also, s/hugh/huge/
The rest of the patch looks straight-forward to me.
On Tue, Mar 17, 2020 at 02:41:53PM +0900, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> In the following patch, workingset detection is implemented for the
> swap cache. Swap cache's node is usually allocated by kswapd and it
> isn't charged by kmemcg since it is from the kernel thread. So the swap
> cache's shadow node is managed by the node list of the list_lru rather
> than the memcg specific one.
>
> If counting the shadow node on the root memcg happens to reclaim the slab
> object, the shadow node count returns the number of the shadow node on
> the node list of the list_lru since root memcg has the kmem_cache_id, -1.
>
> However, the size of pages on the LRU is calculated by using the specific
> memcg, so mismatch happens. This causes the number of shadow node not to
> be increased to the enough size and, therefore, workingset detection
> cannot work correctly. This patch fixes this bug by checking if the memcg
> is the root memcg or not. If it is the root memcg, instead of using
> the memcg-specific LRU, the system-wide LRU is used to calculate proper
> size of the shadow node so that the number of the shadow node can grow
> as expected.
>
> Signed-off-by: Joonsoo Kim <[email protected]>
> ---
> mm/workingset.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 5fb8f85..a9f474a 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -468,7 +468,13 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
> * PAGE_SIZE / xa_nodes / node_entries * 8 / PAGE_SIZE
> */
> #ifdef CONFIG_MEMCG
> - if (sc->memcg) {
> + /*
> + * Kernel allocation on root memcg isn't regarded as allocation of
> + * specific memcg. So, if sc->memcg is the root memcg, we need to
> + * use the count for the node rather than one for the specific
> + * memcg.
> + */
> + if (sc->memcg && !mem_cgroup_is_root(sc->memcg)) {
This is no good, unfortunately.
It allows the root cgroup's shadows to grow way too large. Consider a
large memory system where several workloads run in containers and only
some host software runs in the root, yet that tiny root group will
grow shadow entries in proportion to the entire RAM.
IMO, we have some choices here:
1. We say the swapcache is a shared system facility and its memory is
not accounted to anyone. In that case, we should either
1a. Reclaim them to a fixed threshold, regardless of cgroup, or
1b. Not reclaim them at all. Or
2. We account all nodes to the groups for which they are allocated.
Essentially like this:
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 8e7ce9a9bc5e..d0d0dcc357fb 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -125,6 +125,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
page_ref_add(page, nr);
SetPageSwapCache(page);
+ memalloc_use_memcg(page_memcg(page));
do {
xas_lock_irq(&xas);
xas_create_range(&xas);
@@ -142,6 +143,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
unlock:
xas_unlock_irq(&xas);
} while (xas_nomem(&xas, gfp));
+ memalloc_unuse_memcg();
if (!xas_error(&xas))
return 0;
@@ -605,7 +607,8 @@ int init_swap_address_space(unsigned int type, unsigned long nr_pages)
return -ENOMEM;
for (i = 0; i < nr; i++) {
space = spaces + i;
- xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ);
+ xa_init_flags(&space->i_pages,
+ XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT);
atomic_set(&space->i_mmap_writable, 0);
space->a_ops = &swap_aops;
/* swap cache doesn't use writeback related tags */
(A reclaimer has PF_MEMALLOC set, so we'll bypass the limit when
recursing into charging the node.)
I'm leaning more toward 1b, actually. The shadow shrinker was written
because the combined address space of files on the filesystem can
easily be in the terabytes, and practically unbounded with sparse
files. The shadow shrinker is there to keep users from DoSing the
system with shadow entries for files.
However, the swap address space is bounded by a privileged user. And
the size is usually in the GB range. On my system, radix_tree_node is
~583 bytes, so a for a 16G swapfile, the swapcache xarray should max
out below 40M (36M worth of leaf nodes, plus some intermediate nodes).
It doesn't seem worth messing with the shrinker at all for these.
On Tue, Mar 17, 2020 at 02:41:54PM +0900, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> When implementing workingset detection for anonymous page, I found
> some swapcache pages with NULL memcg. From the code reading, I found
> two reasons.
>
> One is the case that swap-in readahead happens. The other is the
> corner case related to the shmem cache. These two problems should be
> fixed, but, it's not straight-forward to fix. For example, when swap-off,
> all swapped-out pages are read into swapcache. In this case, who's the
> owner of the swapcache page?
>
> Since this problem doesn't look trivial, I decide to leave the issue and
> handles this corner case on the place where the error occurs.
>
> Signed-off-by: Joonsoo Kim <[email protected]>
It wouldn't be hard to find out who owns this page. The code in
mem_cgroup_try_charge() is only a few lines:
swp_entry_t ent = { .val = page_private(page), };
unsigned short id = lookup_swap_cgroup_id(ent);
rcu_read_lock();
memcg = mem_cgroup_from_id(id);
if (memcg && !css_tryget_online(&memcg->css))
memcg = NULL;
rcu_read_unlock();
THAT BEING SAID, I don't think we actually *want* to know the original
cgroup for readahead pages. Because before they are accessed and
charged to the original owner, the pages are sitting on the root
cgroup LRU list and follow the root group's aging speed and LRU order.
Eviction and refault tracking is about the LRU that hosts the pages.
So IMO your patch is much less of a hack than you might think.
> diff --git a/mm/workingset.c b/mm/workingset.c
> index a9f474a..8d2e83a 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -257,6 +257,10 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
> VM_BUG_ON_PAGE(page_count(page), page);
> VM_BUG_ON_PAGE(!PageLocked(page), page);
>
> + /* page_memcg() can be NULL if swap-in readahead happens */
> + if (!page_memcg(page))
> + return NULL;
> +
> advance_inactive_age(page_memcg(page), pgdat, is_file);
>
> lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
This means a readahead page that hasn't been accessed will actively
not be tracked as an eviction and later as a refault.
I think that's the right thing to do, but I would expand the comment:
/*
* A page can be without a cgroup here when it was brought in by swap
* readahead and nobody has touched it since.
*
* The idea behind the workingset code is to tell on page fault time
* whether pages have been previously used or not. Since this page
* hasn't been used, don't store a shadow entry for it; when it later
* faults back in, we treat it as the new page that it is.
*/
2020년 3월 19일 (목) 오전 2:52, Johannes Weiner <[email protected]>님이 작성:
>
> On Tue, Mar 17, 2020 at 02:41:50PM +0900, [email protected] wrote:
> > From: Joonsoo Kim <[email protected]>
> >
> > In current implementation, newly created or swap-in anonymous page
> > is started on active list. Growing active list results in rebalancing
> > active/inactive list so old pages on active list are demoted to inactive
> > list. Hence, the page on active list isn't protected at all.
> >
> > Following is an example of this situation.
> >
> > Assume that 50 hot pages on active list. Numbers denote the number of
> > pages on active/inactive list (active | inactive).
> >
> > 1. 50 hot pages on active list
> > 50(h) | 0
> >
> > 2. workload: 50 newly created (used-once) pages
> > 50(uo) | 50(h)
> >
> > 3. workload: another 50 newly created (used-once) pages
> > 50(uo) | 50(uo), swap-out 50(h)
> >
> > This patch tries to fix this issue.
> > Like as file LRU, newly created or swap-in anonymous pages will be
> > inserted to the inactive list. They are promoted to active list if
> > enough reference happens. This simple modification changes the above
> > example as following.
> >
> > 1. 50 hot pages on active list
> > 50(h) | 0
> >
> > 2. workload: 50 newly created (used-once) pages
> > 50(h) | 50(uo)
> >
> > 3. workload: another 50 newly created (used-once) pages
> > 50(h) | 50(uo), swap-out 50(uo)
> >
> > As you can see, hot pages on active list would be protected.
> >
> > Note that, this implementation has a drawback that the page cannot
> > be promoted and will be swapped-out if re-access interval is greater than
> > the size of inactive list but less than the size of total(active+inactive).
> > To solve this potential issue, following patch will apply workingset
> > detection that is applied to file LRU some day before.
> >
> > Signed-off-by: Joonsoo Kim <[email protected]>
>
> Acked-by: Johannes Weiner <[email protected]>
>
> > -void lru_cache_add_active_or_unevictable(struct page *page,
> > +void lru_cache_add_inactive_or_unevictable(struct page *page,
> > struct vm_area_struct *vma)
> > {
> > + bool evictable;
> > +
> > VM_BUG_ON_PAGE(PageLRU(page), page);
> >
> > - if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
> > - SetPageActive(page);
> > - else if (!TestSetPageMlocked(page)) {
> > + evictable = (vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED;
> > + if (!evictable && !TestSetPageMlocked(page)) {
>
> Minor point, but in case there is a v4: `unevictable` instead of
> !evictable would be a bit easier to read, match the function name,
> PageUnevictable etc.
Okay. Looks like v4 is needed so I will change it as you said.
Thanks.
2020년 3월 19일 (목) 오전 3:06, Johannes Weiner <[email protected]>님이 작성:
>
> On Tue, Mar 17, 2020 at 02:41:51PM +0900, [email protected] wrote:
> > From: Joonsoo Kim <[email protected]>
> >
> > In the following patch, workingset detection will be applied to
> > anonymous LRU. To prepare it, this patch adds some code to
> > distinguish/handle the both LRUs.
> >
> > Signed-off-by: Joonsoo Kim <[email protected]>
>
> This looks all correct to me, but I would ask for some style
> fixes. With those applied, please feel free to add
>
> Acked-by: Johannes Weiner <[email protected]>
Thanks! I will apply all you commented at v4.
Thanks.
2020년 3월 19일 (목) 오전 3:33, Johannes Weiner <[email protected]>님이 작성:
>
> On Tue, Mar 17, 2020 at 02:41:52PM +0900, [email protected] wrote:
> > From: Joonsoo Kim <[email protected]>
> >
> > Swapcache doesn't handle the value since there is no case using the value.
> > In the following patch, workingset detection for anonymous page will be
> > implemented and it stores the value into the swapcache. So, we need to
> > handle it and this patch implement handling.
>
> "value" is too generic, it's not quite clear what this refers to
> here. "Exceptional entries" or "shadow entries" would be better.
Okay. Will change it.
> > @@ -155,24 +163,33 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
> > * This must be called only on pages that have
> > * been verified to be in the swap cache.
> > */
> > -void __delete_from_swap_cache(struct page *page, swp_entry_t entry)
> > +void __delete_from_swap_cache(struct page *page,
> > + swp_entry_t entry, void *shadow)
> > {
> > struct address_space *address_space = swap_address_space(entry);
> > int i, nr = hpage_nr_pages(page);
> > pgoff_t idx = swp_offset(entry);
> > XA_STATE(xas, &address_space->i_pages, idx);
> >
> > + /* Do not apply workingset detection for the hugh page */
> > + if (nr > 1)
> > + shadow = NULL;
>
> Hm, why is that? Should that be an XXX/TODO item? The comment should
> explain the reason, not necessarily what the code is doing.
It was my TODO. Now, I check the code and find that there is no blocker
for the huge page support. So, I will remove this code and enable the
workingset detection even for the huge page.
> Also, s/hugh/huge/
Okay.
> The rest of the patch looks straight-forward to me.
Thanks.
2020년 3월 19일 (목) 오전 4:18, Johannes Weiner <[email protected]>님이 작성:
>
> On Tue, Mar 17, 2020 at 02:41:53PM +0900, [email protected] wrote:
> > From: Joonsoo Kim <[email protected]>
> >
> > In the following patch, workingset detection is implemented for the
> > swap cache. Swap cache's node is usually allocated by kswapd and it
> > isn't charged by kmemcg since it is from the kernel thread. So the swap
> > cache's shadow node is managed by the node list of the list_lru rather
> > than the memcg specific one.
> >
> > If counting the shadow node on the root memcg happens to reclaim the slab
> > object, the shadow node count returns the number of the shadow node on
> > the node list of the list_lru since root memcg has the kmem_cache_id, -1.
> >
> > However, the size of pages on the LRU is calculated by using the specific
> > memcg, so mismatch happens. This causes the number of shadow node not to
> > be increased to the enough size and, therefore, workingset detection
> > cannot work correctly. This patch fixes this bug by checking if the memcg
> > is the root memcg or not. If it is the root memcg, instead of using
> > the memcg-specific LRU, the system-wide LRU is used to calculate proper
> > size of the shadow node so that the number of the shadow node can grow
> > as expected.
> >
> > Signed-off-by: Joonsoo Kim <[email protected]>
> > ---
> > mm/workingset.c | 8 +++++++-
> > 1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/workingset.c b/mm/workingset.c
> > index 5fb8f85..a9f474a 100644
> > --- a/mm/workingset.c
> > +++ b/mm/workingset.c
> > @@ -468,7 +468,13 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
> > * PAGE_SIZE / xa_nodes / node_entries * 8 / PAGE_SIZE
> > */
> > #ifdef CONFIG_MEMCG
> > - if (sc->memcg) {
> > + /*
> > + * Kernel allocation on root memcg isn't regarded as allocation of
> > + * specific memcg. So, if sc->memcg is the root memcg, we need to
> > + * use the count for the node rather than one for the specific
> > + * memcg.
> > + */
> > + if (sc->memcg && !mem_cgroup_is_root(sc->memcg)) {
>
> This is no good, unfortunately.
>
> It allows the root cgroup's shadows to grow way too large. Consider a
> large memory system where several workloads run in containers and only
> some host software runs in the root, yet that tiny root group will
> grow shadow entries in proportion to the entire RAM.
Okay.
> IMO, we have some choices here:
>
> 1. We say the swapcache is a shared system facility and its memory is
> not accounted to anyone. In that case, we should either
> 1a. Reclaim them to a fixed threshold, regardless of cgroup, or
> 1b. Not reclaim them at all. Or
> 2. We account all nodes to the groups for which they are allocated.
> Essentially like this:
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 8e7ce9a9bc5e..d0d0dcc357fb 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -125,6 +125,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
> page_ref_add(page, nr);
> SetPageSwapCache(page);
>
> + memalloc_use_memcg(page_memcg(page));
> do {
> xas_lock_irq(&xas);
> xas_create_range(&xas);
> @@ -142,6 +143,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
> unlock:
> xas_unlock_irq(&xas);
> } while (xas_nomem(&xas, gfp));
> + memalloc_unuse_memcg();
>
> if (!xas_error(&xas))
> return 0;
> @@ -605,7 +607,8 @@ int init_swap_address_space(unsigned int type, unsigned long nr_pages)
> return -ENOMEM;
> for (i = 0; i < nr; i++) {
> space = spaces + i;
> - xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ);
> + xa_init_flags(&space->i_pages,
> + XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT);
> atomic_set(&space->i_mmap_writable, 0);
> space->a_ops = &swap_aops;
> /* swap cache doesn't use writeback related tags */
>
> (A reclaimer has PF_MEMALLOC set, so we'll bypass the limit when
> recursing into charging the node.)
>
> I'm leaning more toward 1b, actually. The shadow shrinker was written
> because the combined address space of files on the filesystem can
> easily be in the terabytes, and practically unbounded with sparse
> files. The shadow shrinker is there to keep users from DoSing the
> system with shadow entries for files.
>
> However, the swap address space is bounded by a privileged user. And
> the size is usually in the GB range. On my system, radix_tree_node is
> ~583 bytes, so a for a 16G swapfile, the swapcache xarray should max
> out below 40M (36M worth of leaf nodes, plus some intermediate nodes).
>
> It doesn't seem worth messing with the shrinker at all for these.
40M / 16G, 0.25% of the amount of the used swap looks okay to me.
I will rework the patch on that way.
Thanks.
2020년 3월 19일 (목) 오전 4:59, Johannes Weiner <[email protected]>님이 작성:
>
> On Tue, Mar 17, 2020 at 02:41:54PM +0900, [email protected] wrote:
> > From: Joonsoo Kim <[email protected]>
> >
> > When implementing workingset detection for anonymous page, I found
> > some swapcache pages with NULL memcg. From the code reading, I found
> > two reasons.
> >
> > One is the case that swap-in readahead happens. The other is the
> > corner case related to the shmem cache. These two problems should be
> > fixed, but, it's not straight-forward to fix. For example, when swap-off,
> > all swapped-out pages are read into swapcache. In this case, who's the
> > owner of the swapcache page?
> >
> > Since this problem doesn't look trivial, I decide to leave the issue and
> > handles this corner case on the place where the error occurs.
> >
> > Signed-off-by: Joonsoo Kim <[email protected]>
>
> It wouldn't be hard to find out who owns this page. The code in
> mem_cgroup_try_charge() is only a few lines:
lookup_swap_cgroup_id() uses additional memory and is only
usable if CONFIG_MEMCG_SWAP is on.
> swp_entry_t ent = { .val = page_private(page), };
> unsigned short id = lookup_swap_cgroup_id(ent);
>
> rcu_read_lock();
> memcg = mem_cgroup_from_id(id);
> if (memcg && !css_tryget_online(&memcg->css))
> memcg = NULL;
> rcu_read_unlock();
>
> THAT BEING SAID, I don't think we actually *want* to know the original
> cgroup for readahead pages. Because before they are accessed and
> charged to the original owner, the pages are sitting on the root
> cgroup LRU list and follow the root group's aging speed and LRU order.
Okay. Sound reasonable.
> Eviction and refault tracking is about the LRU that hosts the pages.
>
> So IMO your patch is much less of a hack than you might think.
Good!
> > diff --git a/mm/workingset.c b/mm/workingset.c
> > index a9f474a..8d2e83a 100644
> > --- a/mm/workingset.c
> > +++ b/mm/workingset.c
> > @@ -257,6 +257,10 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
> > VM_BUG_ON_PAGE(page_count(page), page);
> > VM_BUG_ON_PAGE(!PageLocked(page), page);
> >
> > + /* page_memcg() can be NULL if swap-in readahead happens */
> > + if (!page_memcg(page))
> > + return NULL;
> > +
> > advance_inactive_age(page_memcg(page), pgdat, is_file);
> >
> > lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
>
> This means a readahead page that hasn't been accessed will actively
> not be tracked as an eviction and later as a refault.
>
> I think that's the right thing to do, but I would expand the comment:
Okay. I will add the following comment.
Thanks.
> /*
> * A page can be without a cgroup here when it was brought in by swap
> * readahead and nobody has touched it since.
> *
> * The idea behind the workingset code is to tell on page fault time
> * whether pages have been previously used or not. Since this page
> * hasn't been used, don't store a shadow entry for it; when it later
> * faults back in, we treat it as the new page that it is.
> */