From: Joonsoo Kim <[email protected]>
Hello,
This patchset implements workingset protection and detection on
the anonymous LRU list.
* Changes on v2
- fix a critical bug that uses out of index lru list in
workingset_refault()
- fix a bug that reuses the rotate value for previous page
* SUBJECT
workingset protection
* PROBLEM
In current implementation, newly created or swap-in anonymous page is
started on the active list. Growing the active list results in rebalancing
active/inactive list so old pages on the active list are demoted to the
inactive list. Hence, hot page on the active list isn't protected at all.
Following is an example of this situation.
Assume that 50 hot pages on active list and system can contain total
100 pages. Numbers denote the number of pages on active/inactive
list (active | inactive). (h) stands for hot pages and (uo) stands for
used-once pages.
1. 50 hot pages on active list
50(h) | 0
2. workload: 50 newly created (used-once) pages
50(uo) | 50(h)
3. workload: another 50 newly created (used-once) pages
50(uo) | 50(uo), swap-out 50(h)
As we can see, hot pages are swapped-out and it would cause swap-in later.
* SOLUTION
Since this is what we want to avoid, this patchset implements workingset
protection. Like as the file LRU list, newly created or swap-in anonymous
page is started on the inactive list. Also, like as the file LRU list,
if enough reference happens, the page will be promoted. This simple
modification changes the above example as following.
1. 50 hot pages on active list
50(h) | 0
2. workload: 50 newly created (used-once) pages
50(h) | 50(uo)
3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(uo)
hot pages remains in the active list. :)
* EXPERIMENT
I tested this scenario on my test bed and confirmed that this problem
happens on current implementation. I also checked that it is fixed by
this patchset.
I did another test to show the performance effect of this patchset.
- ebizzy (with modified random function)
ebizzy is the test program that main thread allocates lots of memory and
child threads access them randomly during the given times. Swap-in/out
will happen if allocated memory is larger than the system memory.
The random function that represents the zipf distribution is used to
make hot/cold memory. Hot/cold ratio is controlled by the parameter. If
the parameter is high, hot memory is accessed much larger than cold one.
If the parameter is low, the number of access on each memory would be
similar. I uses various parameters in order to show the effect of
patchset on various hot/cold ratio workload.
My test setup is a virtual machine with 8 cpus and 1024MB memory.
Result format is as following.
Parameter 0.1 ... 1.3
Allocated memory size
Throughput for base (larger is better)
Throughput for patchset (larger is better)
Improvement (larger is better)
* single thread
0.1 0.3 0.5 0.7 0.9 1.1 1.3
<512M>
7009.0 7372.0 7774.0 8523.0 9569.0 10724.0 11936.0
6973.0 7342.0 7745.0 8576.0 9441.0 10730.0 12033.0
-0.01 -0.0 -0.0 0.01 -0.01 0.0 0.01
<768M>
915.0 1039.0 1275.0 1687.0 2328.0 3486.0 5445.0
920.0 1037.0 1238.0 1689.0 2384.0 3638.0 5381.0
0.01 -0.0 -0.03 0.0 0.02 0.04 -0.01
<1024M>
425.0 471.0 539.0 753.0 1183.0 2130.0 3839.0
414.0 468.0 553.0 770.0 1242.0 2187.0 3932.0
-0.03 -0.01 0.03 0.02 0.05 0.03 0.02
<1280M>
320.0 346.0 410.0 556.0 871.0 1654.0 3298.0
316.0 346.0 411.0 550.0 892.0 1652.0 3293.0
-0.01 0.0 0.0 -0.01 0.02 -0.0 -0.0
<1536M>
273.0 290.0 341.0 458.0 733.0 1381.0 2925.0
271.0 293.0 344.0 462.0 740.0 1398.0 2969.0
-0.01 0.01 0.01 0.01 0.01 0.01 0.02
<2048M>
77.0 79.0 95.0 147.0 276.0 690.0 1816.0
91.0 94.0 115.0 170.0 321.0 770.0 2018.0
0.18 0.19 0.21 0.16 0.16 0.12 0.11
* multi thread (8)
0.1 0.3 0.5 0.7 0.9 1.1 1.3
<512M>
29083.0 29648.0 30145.0 31668.0 33964.0 38414.0 43707.0
29238.0 29701.0 30301.0 31328.0 33809.0 37991.0 43667.0
0.01 0.0 0.01 -0.01 -0.0 -0.01 -0.0
<768M>
3332.0 3699.0 4673.0 5830.0 8307.0 12969.0 17665.0
3579.0 3992.0 4432.0 6111.0 8699.0 12604.0 18061.0
0.07 0.08 -0.05 0.05 0.05 -0.03 0.02
<1024M>
1921.0 2141.0 2484.0 3296.0 5391.0 8227.0 14574.0
1989.0 2155.0 2609.0 3565.0 5463.0 8170.0 15642.0
0.04 0.01 0.05 0.08 0.01 -0.01 0.07
<1280M>
1524.0 1625.0 1931.0 2581.0 4155.0 6959.0 12443.0
1560.0 1707.0 2016.0 2714.0 4262.0 7518.0 13910.0
0.02 0.05 0.04 0.05 0.03 0.08 0.12
<1536M>
1303.0 1399.0 1550.0 2137.0 3469.0 6712.0 12944.0
1356.0 1465.0 1701.0 2237.0 3583.0 6830.0 13580.0
0.04 0.05 0.1 0.05 0.03 0.02 0.05
<2048M>
172.0 184.0 215.0 289.0 514.0 1318.0 4153.0
175.0 190.0 225.0 329.0 606.0 1585.0 5170.0
0.02 0.03 0.05 0.14 0.18 0.2 0.24
As we can see, as allocated memory grows, patched kernel get the better
result. Maximum improvement is 21% for the single thread test and
24% for the multi thread test.
* SUBJECT
workingset detection
* PROBLEM
Later part of the patchset implements the workingset detection for
the anonymous LRU list. There is a corner case that workingset protection
could cause thrashing. If we can avoid thrashing by workingset detection,
we can get the better performance.
Following is an example of thrashing due to the workingset protection.
1. 50 hot pages on active list
50(h) | 0
2. workload: 50 newly created (will be hot) pages
50(h) | 50(wh)
3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(wh)
4. workload: 50 (will be hot) pages
50(h) | 50(wh), swap-in 50(wh)
5. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(wh)
6. repeat 4, 5
Without workingset detection, this kind of workload cannot be promoted
and thrashing happens forever.
* SOLUTION
Therefore, this patchset implements workingset detection.
All the infrastructure for workingset detecion is already implemented,
so there is not much work to do. First, extend workingset detection
code to deal with the anonymous LRU list. Then, make swap cache handles
the exceptional value for the shadow entry. Lastly, install/retrieve
the shadow value into/from the swap cache and check the refault distance.
* EXPERIMENT
I made a test program to imitates above scenario and confirmed that
problem exists. Then, I checked that this patchset fixes it.
My test setup is a virtual machine with 8 cpus and 6100MB memory. But,
the amount of the memory that the test program can use is about 280 MB.
This is because the system uses large ram-backed swap and large ramdisk
to capture the trace.
Test scenario is like as below.
1. allocate cold memory (512MB)
2. allocate hot-1 memory (96MB)
3. activate hot-1 memory (96MB)
4. allocate another hot-2 memory (96MB)
5. access cold memory (128MB)
6. access hot-2 memory (96MB)
7. repeat 5, 6
Since hot-1 memory (96MB) is on the active list, the inactive list can
contains roughly 190MB pages. hot-2 memory's re-access interval
(96+128 MB) is more 190MB, so it cannot be promoted without workingset
detection and swap-in/out happens repeatedly. With this patchset,
workingset detection works and promotion happens. Therefore, swap-in/out
occurs less.
Here is the result. (average of 5 runs)
type swap-in swap-out
base 863240 989945
patch 681565 809273
As we can see, patched kernel do less swap-in/out.
Patchset is based on v5.5.
Patchset can also be available at
https://github.com/JoonsooKim/linux/tree/improve-anonymous-lru-management-v2.00-v5.5
Enjoy it.
Thanks.
Joonsoo Kim (9):
mm/vmscan: make active/inactive ratio as 1:1 for anon lru
mm/vmscan: protect the workingset on anonymous LRU
mm/workingset: extend the workingset detection for anon LRU
mm/swapcache: support to handle the value in swapcache
mm/workingset: use the node counter if memcg is the root memcg
mm/workingset: handle the page without memcg
mm/swap: implement workingset detection for anonymous LRU
mm/vmscan: restore active/inactive ratio for anonymous LRU
mm/swap: count a new anonymous page as a reclaim_state's rotate
include/linux/mmzone.h | 14 ++++++++-----
include/linux/swap.h | 18 ++++++++++++-----
kernel/events/uprobes.c | 2 +-
mm/huge_memory.c | 6 +++---
mm/khugepaged.c | 2 +-
mm/memcontrol.c | 12 ++++++++----
mm/memory.c | 16 +++++++++------
mm/migrate.c | 2 +-
mm/shmem.c | 3 ++-
mm/swap.c | 42 ++++++++++++++++++++++++++++++++-------
mm/swap_state.c | 52 ++++++++++++++++++++++++++++++++++++++++++-------
mm/swapfile.c | 2 +-
mm/userfaultfd.c | 2 +-
mm/vmscan.c | 43 +++++++++++++++++++++++++++++++---------
mm/vmstat.c | 6 ++++--
mm/workingset.c | 47 +++++++++++++++++++++++++++++++-------------
16 files changed, 201 insertions(+), 68 deletions(-)
--
2.7.4
From: Joonsoo Kim <[email protected]>
In current implementation, newly created or swap-in anonymous page
is started on active list. Growing active list results in rebalancing
active/inactive list so old pages on active list are demoted to inactive
list. Hence, the page on active list isn't protected at all.
Following is an example of this situation.
Assume that 50 hot pages on active list. Numbers denote the number of
pages on active/inactive list (active | inactive).
1. 50 hot pages on active list
50(h) | 0
2. workload: 50 newly created (used-once) pages
50(uo) | 50(h)
3. workload: another 50 newly created (used-once) pages
50(uo) | 50(uo), swap-out 50(h)
This patch tries to fix this issue.
Like as file LRU, newly created or swap-in anonymous pages will be
inserted to the inactive list. They are promoted to active list if
enough reference happens. This simple modification changes the above
example as following.
1. 50 hot pages on active list
50(h) | 0
2. workload: 50 newly created (used-once) pages
50(h) | 50(uo)
3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(uo)
As you can see, hot pages on active list would be protected.
Note that, this implementation has a drawback that the page cannot
be promoted and will be swapped-out if re-access interval is greater than
the size of inactive list but less than the size of total(active+inactive).
To solve this potential issue, following patch will apply workingset
detection that is applied to file LRU some day before.
Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/swap.h | 2 +-
kernel/events/uprobes.c | 2 +-
mm/huge_memory.c | 6 +++---
mm/khugepaged.c | 2 +-
mm/memory.c | 9 ++++-----
mm/migrate.c | 2 +-
mm/swap.c | 13 +++++++------
mm/swapfile.c | 2 +-
mm/userfaultfd.c | 2 +-
mm/vmscan.c | 21 +++++++++++++++++++--
10 files changed, 39 insertions(+), 22 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1e99f7a..954e13e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -344,7 +344,7 @@ extern void deactivate_page(struct page *page);
extern void mark_page_lazyfree(struct page *page);
extern void swap_setup(void);
-extern void lru_cache_add_active_or_unevictable(struct page *page,
+extern void lru_cache_add_inactive_or_unevictable(struct page *page,
struct vm_area_struct *vma);
/* linux/mm/vmscan.c */
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index ece7e13..14156fc 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -190,7 +190,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
get_page(new_page);
page_add_new_anon_rmap(new_page, vma, addr, false);
mem_cgroup_commit_charge(new_page, memcg, false, false);
- lru_cache_add_active_or_unevictable(new_page, vma);
+ lru_cache_add_inactive_or_unevictable(new_page, vma);
} else
/* no new page, just dec_mm_counter for old_page */
dec_mm_counter(mm, MM_ANONPAGES);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a880932..6356dfd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -638,7 +638,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
page_add_new_anon_rmap(page, vma, haddr, true);
mem_cgroup_commit_charge(page, memcg, false, true);
- lru_cache_add_active_or_unevictable(page, vma);
+ lru_cache_add_inactive_or_unevictable(page, vma);
pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
@@ -1282,7 +1282,7 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf,
set_page_private(pages[i], 0);
page_add_new_anon_rmap(pages[i], vmf->vma, haddr, false);
mem_cgroup_commit_charge(pages[i], memcg, false, false);
- lru_cache_add_active_or_unevictable(pages[i], vma);
+ lru_cache_add_inactive_or_unevictable(pages[i], vma);
vmf->pte = pte_offset_map(&_pmd, haddr);
VM_BUG_ON(!pte_none(*vmf->pte));
set_pte_at(vma->vm_mm, haddr, vmf->pte, entry);
@@ -1435,7 +1435,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
page_add_new_anon_rmap(new_page, vma, haddr, true);
mem_cgroup_commit_charge(new_page, memcg, false, true);
- lru_cache_add_active_or_unevictable(new_page, vma);
+ lru_cache_add_inactive_or_unevictable(new_page, vma);
set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
if (!page) {
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b679908..246c155 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1092,7 +1092,7 @@ static void collapse_huge_page(struct mm_struct *mm,
page_add_new_anon_rmap(new_page, vma, address, true);
mem_cgroup_commit_charge(new_page, memcg, false, true);
count_memcg_events(memcg, THP_COLLAPSE_ALLOC, 1);
- lru_cache_add_active_or_unevictable(new_page, vma);
+ lru_cache_add_inactive_or_unevictable(new_page, vma);
pgtable_trans_huge_deposit(mm, pmd, pgtable);
set_pmd_at(mm, address, pmd, _pmd);
update_mmu_cache_pmd(vma, address, pmd);
diff --git a/mm/memory.c b/mm/memory.c
index 45442d9..5f7813a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2513,7 +2513,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
page_add_new_anon_rmap(new_page, vma, vmf->address, false);
mem_cgroup_commit_charge(new_page, memcg, false, false);
- lru_cache_add_active_or_unevictable(new_page, vma);
+ lru_cache_add_inactive_or_unevictable(new_page, vma);
/*
* We call the notify macro here because, when using secondary
* mmu page tables (such as kvm shadow page tables), we want the
@@ -3038,11 +3038,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (unlikely(page != swapcache && swapcache)) {
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
- lru_cache_add_active_or_unevictable(page, vma);
+ lru_cache_add_inactive_or_unevictable(page, vma);
} else {
do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
mem_cgroup_commit_charge(page, memcg, true, false);
- activate_page(page);
}
swap_free(entry);
@@ -3186,7 +3185,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
- lru_cache_add_active_or_unevictable(page, vma);
+ lru_cache_add_inactive_or_unevictable(page, vma);
setpte:
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
@@ -3449,7 +3448,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
- lru_cache_add_active_or_unevictable(page, vma);
+ lru_cache_add_inactive_or_unevictable(page, vma);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
page_add_file_rmap(page, false);
diff --git a/mm/migrate.c b/mm/migrate.c
index 86873b6..ef034c0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2784,7 +2784,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
page_add_new_anon_rmap(page, vma, addr, false);
mem_cgroup_commit_charge(page, memcg, false, false);
if (!is_zone_device_page(page))
- lru_cache_add_active_or_unevictable(page, vma);
+ lru_cache_add_inactive_or_unevictable(page, vma);
get_page(page);
if (flush) {
diff --git a/mm/swap.c b/mm/swap.c
index 5341ae9..18b2735 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -448,23 +448,24 @@ void lru_cache_add(struct page *page)
}
/**
- * lru_cache_add_active_or_unevictable
+ * lru_cache_add_inactive_or_unevictable
* @page: the page to be added to LRU
* @vma: vma in which page is mapped for determining reclaimability
*
- * Place @page on the active or unevictable LRU list, depending on its
+ * Place @page on the inactive or unevictable LRU list, depending on its
* evictability. Note that if the page is not evictable, it goes
* directly back onto it's zone's unevictable list, it does NOT use a
* per cpu pagevec.
*/
-void lru_cache_add_active_or_unevictable(struct page *page,
+void lru_cache_add_inactive_or_unevictable(struct page *page,
struct vm_area_struct *vma)
{
+ bool evictable;
+
VM_BUG_ON_PAGE(PageLRU(page), page);
- if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
- SetPageActive(page);
- else if (!TestSetPageMlocked(page)) {
+ evictable = (vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED;
+ if (!evictable && !TestSetPageMlocked(page)) {
/*
* We use the irq-unsafe __mod_zone_page_stat because this
* counter is not modified from interrupt context, and the pte
diff --git a/mm/swapfile.c b/mm/swapfile.c
index bb3261d..6bdcbf9 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1888,7 +1888,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
} else { /* ksm created a completely new copy */
page_add_new_anon_rmap(page, vma, addr, false);
mem_cgroup_commit_charge(page, memcg, false, false);
- lru_cache_add_active_or_unevictable(page, vma);
+ lru_cache_add_inactive_or_unevictable(page, vma);
}
swap_free(entry);
/*
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 1b0d7ab..875e329 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -120,7 +120,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
inc_mm_counter(dst_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
mem_cgroup_commit_charge(page, memcg, false, false);
- lru_cache_add_active_or_unevictable(page, dst_vma);
+ lru_cache_add_inactive_or_unevictable(page, dst_vma);
set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e772f3f..4122a84 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1010,8 +1010,15 @@ static enum page_references page_check_references(struct page *page,
return PAGEREF_RECLAIM;
if (referenced_ptes) {
- if (PageSwapBacked(page))
- return PAGEREF_ACTIVATE;
+ if (PageSwapBacked(page)) {
+ if (referenced_page) {
+ ClearPageReferenced(page);
+ return PAGEREF_ACTIVATE;
+ }
+
+ SetPageReferenced(page);
+ return PAGEREF_KEEP;
+ }
/*
* All mapped pages start out with page table
* references from the instantiating fault, so we need
@@ -2056,6 +2063,15 @@ static void shrink_active_list(unsigned long nr_to_scan,
}
}
+ /*
+ * Now, newly created anonymous page isn't appened to the
+ * active list. We don't need to clear the reference bit here.
+ */
+ if (PageSwapBacked(page)) {
+ ClearPageReferenced(page);
+ goto deactivate;
+ }
+
if (page_referenced(page, 0, sc->target_mem_cgroup,
&vm_flags)) {
nr_rotated += hpage_nr_pages(page);
@@ -2074,6 +2090,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
}
}
+deactivate:
ClearPageActive(page); /* we are de-activating */
SetPageWorkingset(page);
list_add(&page->lru, &l_inactive);
--
2.7.4
From: Joonsoo Kim <[email protected]>
In the following patch, workingset detection will be applied to
anonymous LRU. To prepare it, this patch adds some code to
distinguish/handle the both LRUs.
Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/mmzone.h | 14 +++++++++-----
mm/memcontrol.c | 12 ++++++++----
mm/vmscan.c | 15 ++++++++++-----
mm/vmstat.c | 6 ++++--
mm/workingset.c | 35 ++++++++++++++++++++++-------------
5 files changed, 53 insertions(+), 29 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5334ad8..b78fd8c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -220,8 +220,12 @@ enum node_stat_item {
NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
WORKINGSET_NODES,
- WORKINGSET_REFAULT,
- WORKINGSET_ACTIVATE,
+ WORKINGSET_REFAULT_BASE,
+ WORKINGSET_REFAULT_ANON = WORKINGSET_REFAULT_BASE,
+ WORKINGSET_REFAULT_FILE,
+ WORKINGSET_ACTIVATE_BASE,
+ WORKINGSET_ACTIVATE_ANON = WORKINGSET_ACTIVATE_BASE,
+ WORKINGSET_ACTIVATE_FILE,
WORKINGSET_RESTORE,
WORKINGSET_NODERECLAIM,
NR_ANON_MAPPED, /* Mapped anonymous pages */
@@ -304,10 +308,10 @@ enum lruvec_flags {
struct lruvec {
struct list_head lists[NR_LRU_LISTS];
struct zone_reclaim_stat reclaim_stat;
- /* Evictions & activations on the inactive file list */
- atomic_long_t inactive_age;
+ /* Evictions & activations on the inactive list */
+ atomic_long_t inactive_age[2];
/* Refaults at the time of last reclaim cycle */
- unsigned long refaults;
+ unsigned long refaults[2];
/* Various lruvec state flags (enum lruvec_flags) */
unsigned long flags;
#ifdef CONFIG_MEMCG
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6c83cf4..8f4473d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1431,10 +1431,14 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGMAJFAULT),
memcg_events(memcg, PGMAJFAULT));
- seq_buf_printf(&s, "workingset_refault %lu\n",
- memcg_page_state(memcg, WORKINGSET_REFAULT));
- seq_buf_printf(&s, "workingset_activate %lu\n",
- memcg_page_state(memcg, WORKINGSET_ACTIVATE));
+ seq_buf_printf(&s, "workingset_refault_anon %lu\n",
+ memcg_page_state(memcg, WORKINGSET_REFAULT_ANON));
+ seq_buf_printf(&s, "workingset_refault_file %lu\n",
+ memcg_page_state(memcg, WORKINGSET_REFAULT_FILE));
+ seq_buf_printf(&s, "workingset_activate_anon %lu\n",
+ memcg_page_state(memcg, WORKINGSET_ACTIVATE_ANON));
+ seq_buf_printf(&s, "workingset_activate_file %lu\n",
+ memcg_page_state(memcg, WORKINGSET_ACTIVATE_FILE));
seq_buf_printf(&s, "workingset_nodereclaim %lu\n",
memcg_page_state(memcg, WORKINGSET_NODERECLAIM));
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4122a84..74c3ade 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2735,7 +2735,10 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
if (!sc->force_deactivate) {
unsigned long refaults;
- if (inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
+ refaults = lruvec_page_state(target_lruvec,
+ WORKINGSET_ACTIVATE_ANON);
+ if (refaults != target_lruvec->refaults[0] ||
+ inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
sc->may_deactivate |= DEACTIVATE_ANON;
else
sc->may_deactivate &= ~DEACTIVATE_ANON;
@@ -2746,8 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
* rid of any stale active pages quickly.
*/
refaults = lruvec_page_state(target_lruvec,
- WORKINGSET_ACTIVATE);
- if (refaults != target_lruvec->refaults ||
+ WORKINGSET_ACTIVATE_FILE);
+ if (refaults != target_lruvec->refaults[1] ||
inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
sc->may_deactivate |= DEACTIVATE_FILE;
else
@@ -3026,8 +3029,10 @@ static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat)
unsigned long refaults;
target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
- refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE);
- target_lruvec->refaults = refaults;
+ refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_ANON);
+ target_lruvec->refaults[0] = refaults;
+ refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_FILE);
+ target_lruvec->refaults[1] = refaults;
}
/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 78d5337..3cdf8e9 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1146,8 +1146,10 @@ const char * const vmstat_text[] = {
"nr_isolated_anon",
"nr_isolated_file",
"workingset_nodes",
- "workingset_refault",
- "workingset_activate",
+ "workingset_refault_anon",
+ "workingset_refault_file",
+ "workingset_activate_anon",
+ "workingset_activate_file",
"workingset_restore",
"workingset_nodereclaim",
"nr_anon_pages",
diff --git a/mm/workingset.c b/mm/workingset.c
index 474186b..5fb8f85 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -15,6 +15,7 @@
#include <linux/dax.h>
#include <linux/fs.h>
#include <linux/mm.h>
+#include <linux/mm_inline.h>
/*
* Double CLOCK lists
@@ -156,7 +157,7 @@
*
* Implementation
*
- * For each node's file LRU lists, a counter for inactive evictions
+ * For each node's anon/file LRU lists, a counter for inactive evictions
* and activations is maintained (node->inactive_age).
*
* On eviction, a snapshot of this counter (along with some bits to
@@ -213,7 +214,8 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
*workingsetp = workingset;
}
-static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
+static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat,
+ int is_file)
{
/*
* Reclaiming a cgroup means reclaiming all its children in a
@@ -230,7 +232,7 @@ static void advance_inactive_age(struct mem_cgroup *memcg, pg_data_t *pgdat)
struct lruvec *lruvec;
lruvec = mem_cgroup_lruvec(memcg, pgdat);
- atomic_long_inc(&lruvec->inactive_age);
+ atomic_long_inc(&lruvec->inactive_age[is_file]);
} while (memcg && (memcg = parent_mem_cgroup(memcg)));
}
@@ -248,18 +250,19 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
unsigned long eviction;
struct lruvec *lruvec;
int memcgid;
+ int is_file = page_is_file_cache(page);
/* Page is fully exclusive and pins page->mem_cgroup */
VM_BUG_ON_PAGE(PageLRU(page), page);
VM_BUG_ON_PAGE(page_count(page), page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
- advance_inactive_age(page_memcg(page), pgdat);
+ advance_inactive_age(page_memcg(page), pgdat, is_file);
lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
/* XXX: target_memcg can be NULL, go through lruvec */
memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
- eviction = atomic_long_read(&lruvec->inactive_age);
+ eviction = atomic_long_read(&lruvec->inactive_age[is_file]);
return pack_shadow(memcgid, pgdat, eviction, PageWorkingset(page));
}
@@ -278,13 +281,16 @@ void workingset_refault(struct page *page, void *shadow)
struct lruvec *eviction_lruvec;
unsigned long refault_distance;
struct pglist_data *pgdat;
- unsigned long active_file;
+ unsigned long active;
struct mem_cgroup *memcg;
unsigned long eviction;
struct lruvec *lruvec;
unsigned long refault;
bool workingset;
int memcgid;
+ int is_file = page_is_file_cache(page);
+ enum lru_list active_lru = page_lru_base_type(page) + LRU_ACTIVE;
+ enum node_stat_item workingset_stat;
unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
@@ -309,8 +315,8 @@ void workingset_refault(struct page *page, void *shadow)
if (!mem_cgroup_disabled() && !eviction_memcg)
goto out;
eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
- refault = atomic_long_read(&eviction_lruvec->inactive_age);
- active_file = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE);
+ refault = atomic_long_read(&eviction_lruvec->inactive_age[is_file]);
+ active = lruvec_page_state(eviction_lruvec, active_lru);
/*
* Calculate the refault distance
@@ -341,19 +347,21 @@ void workingset_refault(struct page *page, void *shadow)
memcg = page_memcg(page);
lruvec = mem_cgroup_lruvec(memcg, pgdat);
- inc_lruvec_state(lruvec, WORKINGSET_REFAULT);
+ workingset_stat = WORKINGSET_REFAULT_BASE + is_file;
+ inc_lruvec_state(lruvec, workingset_stat);
/*
* Compare the distance to the existing workingset size. We
* don't act on pages that couldn't stay resident even if all
* the memory was available to the page cache.
*/
- if (refault_distance > active_file)
+ if (refault_distance > active)
goto out;
SetPageActive(page);
- advance_inactive_age(memcg, pgdat);
- inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE);
+ advance_inactive_age(memcg, pgdat, is_file);
+ workingset_stat = WORKINGSET_ACTIVATE_BASE + is_file;
+ inc_lruvec_state(lruvec, workingset_stat);
/* Page was active prior to eviction */
if (workingset) {
@@ -371,6 +379,7 @@ void workingset_refault(struct page *page, void *shadow)
void workingset_activation(struct page *page)
{
struct mem_cgroup *memcg;
+ int is_file = page_is_file_cache(page);
rcu_read_lock();
/*
@@ -383,7 +392,7 @@ void workingset_activation(struct page *page)
memcg = page_memcg_rcu(page);
if (!mem_cgroup_disabled() && !memcg)
goto out;
- advance_inactive_age(memcg, page_pgdat(page));
+ advance_inactive_age(memcg, page_pgdat(page), is_file);
out:
rcu_read_unlock();
}
--
2.7.4
From: Joonsoo Kim <[email protected]>
In the following patch, workingset detection is implemented for the
swap cache. Swap cache's node is usually allocated by kswapd and it
isn't charged by kmemcg since it is from the kernel thread. So the swap
cache's shadow node is managed by the node list of the list_lru rather
than the memcg specific one.
If counting the shadow node on the root memcg happens to reclaim the slab
object, the shadow node count returns the number of the shadow node on
the node list of the list_lru since root memcg has the kmem_cache_id, -1.
However, the size of pages on the LRU is calculated by using the specific
memcg, so mismatch happens. This causes the number of shadow node not to
be increased to the enough size and, therefore, workingset detection
cannot work correctly. This patch fixes this bug by checking if the memcg
is the root memcg or not. If it is the root memcg, instead of using
the memcg-specific LRU, the system-wide LRU is used to calculate proper
size of the shadow node so that the number of the shadow node can grow
as expected.
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/workingset.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/mm/workingset.c b/mm/workingset.c
index 5fb8f85..a9f474a 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -468,7 +468,13 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
* PAGE_SIZE / xa_nodes / node_entries * 8 / PAGE_SIZE
*/
#ifdef CONFIG_MEMCG
- if (sc->memcg) {
+ /*
+ * Kernel allocation on root memcg isn't regarded as allocation of
+ * specific memcg. So, if sc->memcg is the root memcg, we need to
+ * use the count for the node rather than one for the specific
+ * memcg.
+ */
+ if (sc->memcg && !mem_cgroup_is_root(sc->memcg)) {
struct lruvec *lruvec;
int i;
--
2.7.4
From: Joonsoo Kim <[email protected]>
When implementing workingset detection for anonymous page, I found
some swapcache pages with NULL memcg. From the code reading, I found
two reasons.
One is the case that swap-in readahead happens. The other is the
corner case related to the shmem cache. These two problems should be
fixed, but, it's not straight-forward to fix. For example, when swap-off,
all swapped-out pages are read into swapcache. In this case, who's the
owner of the swapcache page?
Since this problem doesn't look trivial, I decide to leave the issue and
handles this corner case on the place where the error occurs.
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/workingset.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/mm/workingset.c b/mm/workingset.c
index a9f474a..8d2e83a 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -257,6 +257,10 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
VM_BUG_ON_PAGE(page_count(page), page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
+ /* page_memcg() can be NULL if swap-in readahead happens */
+ if (!page_memcg(page))
+ return NULL;
+
advance_inactive_age(page_memcg(page), pgdat, is_file);
lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
--
2.7.4
From: Joonsoo Kim <[email protected]>
Now, workingset detection is implemented for anonymous LRU.
We don't have to worry about the misfound for workingset due to
the ratio of active/inactive. Let's restore the ratio.
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a1892e7..81ff725 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2237,7 +2237,7 @@ static bool inactive_is_low(struct lruvec *lruvec, enum lru_list inactive_lru)
active = lruvec_page_state(lruvec, NR_LRU_BASE + active_lru);
gb = (inactive + active) >> (30 - PAGE_SHIFT);
- if (gb && is_file_lru(inactive_lru))
+ if (gb)
inactive_ratio = int_sqrt(10 * gb);
else
inactive_ratio = 1;
--
2.7.4
From: Joonsoo Kim <[email protected]>
This patch implements workingset detection for anonymous LRU.
All the infrastructure is implemented by the previous patches so this patch
just activates the workingset detection by installing/retrieving
the shadow entry.
Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/swap.h | 11 +++++++++--
mm/memory.c | 7 ++++++-
mm/shmem.c | 3 ++-
mm/swap_state.c | 31 ++++++++++++++++++++++++++-----
mm/vmscan.c | 7 +++++--
5 files changed, 48 insertions(+), 11 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0df8b3f..fb4772e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -408,7 +408,9 @@ extern struct address_space *swapper_spaces[];
extern unsigned long total_swapcache_pages(void);
extern void show_swap_cache_info(void);
extern int add_to_swap(struct page *page);
-extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t);
+extern void *get_shadow_from_swap_cache(swp_entry_t entry);
+extern int add_to_swap_cache(struct page *page, swp_entry_t entry,
+ gfp_t gfp, void **shadowp);
extern int __add_to_swap_cache(struct page *page, swp_entry_t entry);
extern void __delete_from_swap_cache(struct page *page,
swp_entry_t entry, void *shadow);
@@ -565,8 +567,13 @@ static inline int add_to_swap(struct page *page)
return 0;
}
+static inline void *get_shadow_from_swap_cache(swp_entry_t entry)
+{
+ return NULL;
+}
+
static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
- gfp_t gfp_mask)
+ gfp_t gfp_mask, void **shadowp)
{
return -1;
}
diff --git a/mm/memory.c b/mm/memory.c
index 5f7813a..91a2097 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2925,10 +2925,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
vmf->address);
if (page) {
+ void *shadow;
+
__SetPageLocked(page);
__SetPageSwapBacked(page);
set_page_private(page, entry.val);
- lru_cache_add_anon(page);
+ shadow = get_shadow_from_swap_cache(entry);
+ if (shadow)
+ workingset_refault(page, shadow);
+ lru_cache_add(page);
swap_readpage(page, true);
}
} else {
diff --git a/mm/shmem.c b/mm/shmem.c
index 8793e8c..c6663ad 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1370,7 +1370,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
list_add(&info->swaplist, &shmem_swaplist);
if (add_to_swap_cache(page, swap,
- __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN) == 0) {
+ __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN,
+ NULL) == 0) {
spin_lock_irq(&info->lock);
shmem_recalc_inode(inode);
info->swapped++;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3fbbe45..7f7cb19 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -107,11 +107,24 @@ void show_swap_cache_info(void)
printk("Total swap = %lukB\n", total_swap_pages << (PAGE_SHIFT - 10));
}
+void *get_shadow_from_swap_cache(swp_entry_t entry)
+{
+ struct address_space *address_space = swap_address_space(entry);
+ pgoff_t idx = swp_offset(entry);
+ struct page *page;
+
+ page = find_get_entry(address_space, idx);
+ if (xa_is_value(page))
+ return page;
+ return NULL;
+}
+
/*
* add_to_swap_cache resembles add_to_page_cache_locked on swapper_space,
* but sets SwapCache flag and private instead of mapping and index.
*/
-int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
+int add_to_swap_cache(struct page *page, swp_entry_t entry,
+ gfp_t gfp, void **shadowp)
{
struct address_space *address_space = swap_address_space(entry);
pgoff_t idx = swp_offset(entry);
@@ -137,8 +150,11 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
for (i = 0; i < nr; i++) {
VM_BUG_ON_PAGE(xas.xa_index != idx + i, page);
old = xas_load(&xas);
- if (xa_is_value(old))
+ if (xa_is_value(old)) {
nrexceptional++;
+ if (shadowp)
+ *shadowp = old;
+ }
set_page_private(page + i, entry.val + i);
xas_store(&xas, page);
xas_next(&xas);
@@ -226,7 +242,7 @@ int add_to_swap(struct page *page)
* Add it to the swap cache.
*/
err = add_to_swap_cache(page, entry,
- __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
+ __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN, NULL);
if (err)
/*
* add_to_swap_cache() doesn't return -EEXIST, so we can safely
@@ -380,6 +396,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
struct page *found_page = NULL, *new_page = NULL;
struct swap_info_struct *si;
int err;
+ void *shadow;
*new_page_allocated = false;
do {
@@ -435,11 +452,15 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
/* May fail (-ENOMEM) if XArray node allocation failed. */
__SetPageLocked(new_page);
__SetPageSwapBacked(new_page);
- err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
+ shadow = NULL;
+ err = add_to_swap_cache(new_page, entry,
+ gfp_mask & GFP_KERNEL, &shadow);
if (likely(!err)) {
/* Initiate read into locked page */
SetPageWorkingset(new_page);
- lru_cache_add_anon(new_page);
+ if (shadow)
+ workingset_refault(new_page, shadow);
+ lru_cache_add(new_page);
*new_page_allocated = true;
return new_page;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 99588ba..a1892e7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -867,6 +867,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
{
unsigned long flags;
int refcount;
+ void *shadow = NULL;
BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));
@@ -909,12 +910,13 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
if (PageSwapCache(page)) {
swp_entry_t swap = { .val = page_private(page) };
mem_cgroup_swapout(page, swap);
- __delete_from_swap_cache(page, swap, NULL);
+ if (reclaimed && !mapping_exiting(mapping))
+ shadow = workingset_eviction(page, target_memcg);
+ __delete_from_swap_cache(page, swap, shadow);
xa_unlock_irqrestore(&mapping->i_pages, flags);
put_swap_page(page, swap);
} else {
void (*freepage)(struct page *);
- void *shadow = NULL;
freepage = mapping->a_ops->freepage;
/*
@@ -1485,6 +1487,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
SetPageActive(page);
stat->nr_activate[type] += nr_pages;
count_memcg_page_event(page, PGACTIVATE);
+ workingset_activation(page);
}
keep_locked:
unlock_page(page);
--
2.7.4
From: Joonsoo Kim <[email protected]>
Current implementation of LRU management for anonymous page has some
problems. Most important one is that it doesn't protect the workingset,
that is, pages on the active LRU list. Although, this problem will be
fixed in the following patchset, the preparation is required and
this patch does it.
What following patchset does is to restore workingset protection. In this
case, newly created or swap-in pages are started their lifetime on the
inactive list. If inactive list is too small, there is not enough chance
to be referenced and the page cannot become the workingset.
In order to provide enough chance to the newly anonymous pages, this patch
makes active/inactive LRU ratio as 1:1.
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 572fb17..e772f3f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2217,7 +2217,7 @@ static bool inactive_is_low(struct lruvec *lruvec, enum lru_list inactive_lru)
active = lruvec_page_state(lruvec, NR_LRU_BASE + active_lru);
gb = (inactive + active) >> (30 - PAGE_SHIFT);
- if (gb)
+ if (gb && is_file_lru(inactive_lru))
inactive_ratio = int_sqrt(10 * gb);
else
inactive_ratio = 1;
--
2.7.4
From: Joonsoo Kim <[email protected]>
reclaim_stat's rotate is used for controlling the ratio of scanning page
between file and anonymous LRU. All new anonymous pages are counted
for rotate before the patch, protecting anonymous pages on active LRU, and,
it makes that reclaim on anonymous LRU is less happened than file LRU.
Now, situation is changed. all new anonymous pages are not added
to the active LRU so rotate would be far less than before. It will cause
that reclaim on anonymous LRU happens more and it would result in bad
effect on some system that is optimized for previous setting.
Therefore, this patch counts a new anonymous page as a reclaim_state's
rotate. Although it is non-logical to add this count to
the reclaim_state's rotate in current algorithm, reducing the regression
would be more important.
I found this regression on kernel-build test and it is roughly 2~5%
performance degradation. With this workaround, performance is completely
restored.
v2: fix a bug that reuses the rotate value for previous page
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/swap.c | 29 ++++++++++++++++++++++++++++-
1 file changed, 28 insertions(+), 1 deletion(-)
diff --git a/mm/swap.c b/mm/swap.c
index 18b2735..9001d81 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -187,6 +187,9 @@ int get_kernel_page(unsigned long start, int write, struct page **pages)
}
EXPORT_SYMBOL_GPL(get_kernel_page);
+static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
+ void *arg);
+
static void pagevec_lru_move_fn(struct pagevec *pvec,
void (*move_fn)(struct page *page, struct lruvec *lruvec, void *arg),
void *arg)
@@ -199,6 +202,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
for (i = 0; i < pagevec_count(pvec); i++) {
struct page *page = pvec->pages[i];
struct pglist_data *pagepgdat = page_pgdat(page);
+ void *arg_orig = arg;
if (pagepgdat != pgdat) {
if (pgdat)
@@ -207,8 +211,22 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
spin_lock_irqsave(&pgdat->lru_lock, flags);
}
+ if (move_fn == __pagevec_lru_add_fn) {
+ struct list_head *entry = &page->lru;
+ unsigned long next = (unsigned long)entry->next;
+ unsigned long rotate = next & 2;
+
+ if (rotate) {
+ VM_BUG_ON(arg);
+
+ next = next & ~2;
+ entry->next = (struct list_head *)next;
+ arg = (void *)rotate;
+ }
+ }
lruvec = mem_cgroup_page_lruvec(page, pgdat);
(*move_fn)(page, lruvec, arg);
+ arg = arg_orig;
}
if (pgdat)
spin_unlock_irqrestore(&pgdat->lru_lock, flags);
@@ -475,6 +493,14 @@ void lru_cache_add_inactive_or_unevictable(struct page *page,
hpage_nr_pages(page));
count_vm_event(UNEVICTABLE_PGMLOCKED);
}
+
+ if (PageSwapBacked(page) && evictable) {
+ struct list_head *entry = &page->lru;
+ unsigned long next = (unsigned long)entry->next;
+
+ next = next | 2;
+ entry->next = (struct list_head *)next;
+ }
lru_cache_add(page);
}
@@ -927,6 +953,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
{
enum lru_list lru;
int was_unevictable = TestClearPageUnevictable(page);
+ unsigned long rotate = (unsigned long)arg;
VM_BUG_ON_PAGE(PageLRU(page), page);
@@ -962,7 +989,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
if (page_evictable(page)) {
lru = page_lru(page);
update_page_reclaim_stat(lruvec, page_is_file_cache(page),
- PageActive(page));
+ PageActive(page) | rotate);
if (was_unevictable)
count_vm_event(UNEVICTABLE_PGRESCUED);
} else {
--
2.7.4
From: Joonsoo Kim <[email protected]>
Swapcache doesn't handle the value since there is no case using the value.
In the following patch, workingset detection for anonymous page will be
implemented and it stores the value into the swapcache. So, we need to
handle it and this patch implement handling.
Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/swap.h | 5 +++--
mm/swap_state.c | 23 ++++++++++++++++++++---
mm/vmscan.c | 2 +-
3 files changed, 24 insertions(+), 6 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 954e13e..0df8b3f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -410,7 +410,8 @@ extern void show_swap_cache_info(void);
extern int add_to_swap(struct page *page);
extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t);
extern int __add_to_swap_cache(struct page *page, swp_entry_t entry);
-extern void __delete_from_swap_cache(struct page *, swp_entry_t entry);
+extern void __delete_from_swap_cache(struct page *page,
+ swp_entry_t entry, void *shadow);
extern void delete_from_swap_cache(struct page *);
extern void free_page_and_swap_cache(struct page *);
extern void free_pages_and_swap_cache(struct page **, int);
@@ -571,7 +572,7 @@ static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
}
static inline void __delete_from_swap_cache(struct page *page,
- swp_entry_t entry)
+ swp_entry_t entry, void *shadow)
{
}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 8e7ce9a..3fbbe45 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -117,6 +117,10 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
pgoff_t idx = swp_offset(entry);
XA_STATE_ORDER(xas, &address_space->i_pages, idx, compound_order(page));
unsigned long i, nr = compound_nr(page);
+ unsigned long nrexceptional = 0;
+ void *old;
+
+ xas_set_update(&xas, workingset_update_node);
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(PageSwapCache(page), page);
@@ -132,10 +136,14 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
goto unlock;
for (i = 0; i < nr; i++) {
VM_BUG_ON_PAGE(xas.xa_index != idx + i, page);
+ old = xas_load(&xas);
+ if (xa_is_value(old))
+ nrexceptional++;
set_page_private(page + i, entry.val + i);
xas_store(&xas, page);
xas_next(&xas);
}
+ address_space->nrexceptional -= nrexceptional;
address_space->nrpages += nr;
__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr);
ADD_CACHE_INFO(add_total, nr);
@@ -155,24 +163,33 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
* This must be called only on pages that have
* been verified to be in the swap cache.
*/
-void __delete_from_swap_cache(struct page *page, swp_entry_t entry)
+void __delete_from_swap_cache(struct page *page,
+ swp_entry_t entry, void *shadow)
{
struct address_space *address_space = swap_address_space(entry);
int i, nr = hpage_nr_pages(page);
pgoff_t idx = swp_offset(entry);
XA_STATE(xas, &address_space->i_pages, idx);
+ /* Do not apply workingset detection for the hugh page */
+ if (nr > 1)
+ shadow = NULL;
+
+ xas_set_update(&xas, workingset_update_node);
+
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(!PageSwapCache(page), page);
VM_BUG_ON_PAGE(PageWriteback(page), page);
for (i = 0; i < nr; i++) {
- void *entry = xas_store(&xas, NULL);
+ void *entry = xas_store(&xas, shadow);
VM_BUG_ON_PAGE(entry != page, entry);
set_page_private(page + i, 0);
xas_next(&xas);
}
ClearPageSwapCache(page);
+ if (shadow)
+ address_space->nrexceptional += nr;
address_space->nrpages -= nr;
__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr);
ADD_CACHE_INFO(del_total, nr);
@@ -247,7 +264,7 @@ void delete_from_swap_cache(struct page *page)
struct address_space *address_space = swap_address_space(entry);
xa_lock_irq(&address_space->i_pages);
- __delete_from_swap_cache(page, entry);
+ __delete_from_swap_cache(page, entry, NULL);
xa_unlock_irq(&address_space->i_pages);
put_swap_page(page, entry);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 74c3ade..99588ba 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -909,7 +909,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
if (PageSwapCache(page)) {
swp_entry_t swap = { .val = page_private(page) };
mem_cgroup_swapout(page, swap);
- __delete_from_swap_cache(page, swap);
+ __delete_from_swap_cache(page, swap, NULL);
xa_unlock_irqrestore(&mapping->i_pages, flags);
put_swap_page(page, swap);
} else {
--
2.7.4
On Thu, 20 Feb 2020 14:11:44 +0900 [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> Hello,
>
> This patchset implements workingset protection and detection on
> the anonymous LRU list.
The test robot measurement got my attention!
http://lkml.kernel.org/r/20200227022905.GH6548@shao2-debian
> * Changes on v2
> - fix a critical bug that uses out of index lru list in
> workingset_refault()
> - fix a bug that reuses the rotate value for previous page
>
> * SUBJECT
> workingset protection
>
> * PROBLEM
> In current implementation, newly created or swap-in anonymous page is
> started on the active list. Growing the active list results in rebalancing
> active/inactive list so old pages on the active list are demoted to the
> inactive list. Hence, hot page on the active list isn't protected at all.
>
> Following is an example of this situation.
>
> Assume that 50 hot pages on active list and system can contain total
> 100 pages. Numbers denote the number of pages on active/inactive
> list (active | inactive). (h) stands for hot pages and (uo) stands for
> used-once pages.
>
> 1. 50 hot pages on active list
> 50(h) | 0
>
> 2. workload: 50 newly created (used-once) pages
> 50(uo) | 50(h)
>
> 3. workload: another 50 newly created (used-once) pages
> 50(uo) | 50(uo), swap-out 50(h)
>
> As we can see, hot pages are swapped-out and it would cause swap-in later.
>
> * SOLUTION
> Since this is what we want to avoid, this patchset implements workingset
> protection. Like as the file LRU list, newly created or swap-in anonymous
> page is started on the inactive list. Also, like as the file LRU list,
> if enough reference happens, the page will be promoted. This simple
> modification changes the above example as following.
One wonders why on earth we weren't doing these things in the first
place?
> * SUBJECT
> workingset detection
It sounds like the above simple aging changes provide most of the
improvement, and that the workingset changes are less beneficial and a
bit more risky/speculative?
If so, would it be best for us to concentrate on the aging changes
first, let that settle in and spread out and then turn attention to the
workingset changes?
Hello, Andrew.
On Wed, Feb 26, 2020 at 07:39:42PM -0800, Andrew Morton wrote:
> On Thu, 20 Feb 2020 14:11:44 +0900 [email protected] wrote:
>
> > From: Joonsoo Kim <[email protected]>
> >
> > Hello,
> >
> > This patchset implements workingset protection and detection on
> > the anonymous LRU list.
>
> The test robot measurement got my attention!
>
> http://lkml.kernel.org/r/20200227022905.GH6548@shao2-debian
I really hope to get an attention!!!
Thanks, test robot and Andrew.
>
> > * Changes on v2
> > - fix a critical bug that uses out of index lru list in
> > workingset_refault()
> > - fix a bug that reuses the rotate value for previous page
> >
> > * SUBJECT
> > workingset protection
> >
> > * PROBLEM
> > In current implementation, newly created or swap-in anonymous page is
> > started on the active list. Growing the active list results in rebalancing
> > active/inactive list so old pages on the active list are demoted to the
> > inactive list. Hence, hot page on the active list isn't protected at all.
> >
> > Following is an example of this situation.
> >
> > Assume that 50 hot pages on active list and system can contain total
> > 100 pages. Numbers denote the number of pages on active/inactive
> > list (active | inactive). (h) stands for hot pages and (uo) stands for
> > used-once pages.
> >
> > 1. 50 hot pages on active list
> > 50(h) | 0
> >
> > 2. workload: 50 newly created (used-once) pages
> > 50(uo) | 50(h)
> >
> > 3. workload: another 50 newly created (used-once) pages
> > 50(uo) | 50(uo), swap-out 50(h)
> >
> > As we can see, hot pages are swapped-out and it would cause swap-in later.
> >
> > * SOLUTION
> > Since this is what we want to avoid, this patchset implements workingset
> > protection. Like as the file LRU list, newly created or swap-in anonymous
> > page is started on the inactive list. Also, like as the file LRU list,
> > if enough reference happens, the page will be promoted. This simple
> > modification changes the above example as following.
>
> One wonders why on earth we weren't doing these things in the first
> place?
I don't know. I tried to find the origin of this behaviour and found
that it's from you 18 years ago. :)
It mentions that starting pages on the active list boosts throughput on
stupid swapstormy test but I cannot guess the exact reason of such
improvement.
Anyway, Following is the related patch history. Could you remember
anything about it?
commit 018c71d821e7cfb13470e43778645c899c30c53e
Author: Andrew Morton <[email protected]>
Date: Thu Oct 31 04:09:19 2002 -0800
[PATCH] start anon pages on the active list (properly this time)
Use lru_cache_add_active() so ensure that pages which are, or will be
mapped into pagetables are started out on the active list.
commit 1527d0b71fa1e9db1beb22fda689b9086d025455
Author: Andrew Morton <[email protected]>
Date: Thu Oct 31 04:09:13 2002 -0800
[PATCH] lru_add_active(): for starting pages on the active list
This is the first in a series of patches which tune up the 2.5
performance under heavy swap loads.
Throughput on stupid swapstormy tests is increased by 1.5x to 3x.
Still about 20% behind 2.4 with multithreaded tests. That is not
easily fixable - the virtual scan tends to apply a form of load
control: particular processes are heavily swapped out so the others can
get ahead. With 2.5 all processes make very even progress and much
more swapping is needed. It's on par with 2.4 for single-process
swapstorms.
In this patch:
The code which tries to start mapped pages out on the active list
doesn't work very well. It uses an "is it mapped into pagetables"
test. Which doesn't work for, say, swap readahead pages. They are not
mapped into pagetables when they are spilled onto the LRU.
So create a new `lru_cache_add_active()' function for deferred addition
of pages to their active list.
Also move mark_page_accessed() from filemap.c to swap.c where all
similar functions live. And teach it to not try to move pages which
are in the deferred-addition list onto the active list. That won't
work, and it's bogusly clearing PageReferenced in that case.
The deferred-addition lists are a pest. But lru_cache_add used to be
really expensive in sime workloads on some machines. Must persist.
> > * SUBJECT
> > workingset detection
>
> It sounds like the above simple aging changes provide most of the
> improvement, and that the workingset changes are less beneficial and a
> bit more risky/speculative?
I don't think so.
Although test robot just find the improvement of simple ratio changes,
later patches also have their's own benefit. I found the benefit of
the other patches on our production workload although it isn't
mentioned in cover-letter.
And, what this patchset does looks the reasonable thing.
> If so, would it be best for us to concentrate on the aging changes
> first, let that settle in and spread out and then turn attention to the
> workingset changes?
I hope that more developer pay an attention on this patchset and
the patchset are merged together.
Thanks.
On Wed, Feb 26, 2020 at 07:39:42PM -0800, Andrew Morton wrote:
> On Thu, 20 Feb 2020 14:11:44 +0900 [email protected] wrote:
>
> > From: Joonsoo Kim <[email protected]>
> >
> > Hello,
> >
> > This patchset implements workingset protection and detection on
> > the anonymous LRU list.
>
> The test robot measurement got my attention!
>
> http://lkml.kernel.org/r/20200227022905.GH6548@shao2-debian
>
> > * Changes on v2
> > - fix a critical bug that uses out of index lru list in
> > workingset_refault()
> > - fix a bug that reuses the rotate value for previous page
> >
> > * SUBJECT
> > workingset protection
> >
> > * PROBLEM
> > In current implementation, newly created or swap-in anonymous page is
> > started on the active list. Growing the active list results in rebalancing
> > active/inactive list so old pages on the active list are demoted to the
> > inactive list. Hence, hot page on the active list isn't protected at all.
> >
> > Following is an example of this situation.
> >
> > Assume that 50 hot pages on active list and system can contain total
> > 100 pages. Numbers denote the number of pages on active/inactive
> > list (active | inactive). (h) stands for hot pages and (uo) stands for
> > used-once pages.
> >
> > 1. 50 hot pages on active list
> > 50(h) | 0
> >
> > 2. workload: 50 newly created (used-once) pages
> > 50(uo) | 50(h)
> >
> > 3. workload: another 50 newly created (used-once) pages
> > 50(uo) | 50(uo), swap-out 50(h)
> >
> > As we can see, hot pages are swapped-out and it would cause swap-in later.
> >
> > * SOLUTION
> > Since this is what we want to avoid, this patchset implements workingset
> > protection. Like as the file LRU list, newly created or swap-in anonymous
> > page is started on the inactive list. Also, like as the file LRU list,
> > if enough reference happens, the page will be promoted. This simple
> > modification changes the above example as following.
>
> One wonders why on earth we weren't doing these things in the first
> place?
>
> > * SUBJECT
> > workingset detection
>
> It sounds like the above simple aging changes provide most of the
> improvement, and that the workingset changes are less beneficial and a
> bit more risky/speculative?
>
> If so, would it be best for us to concentrate on the aging changes
> first, let that settle in and spread out and then turn attention to the
> workingset changes?
Those two patches work well for some workloads (like the benchmark),
but not for others. The full patchset makes sure both types work well.
Specifically, the existing aging strategy for anon assumes that most
anon pages allocated are hot. That's why they all start active and we
then do second-chance with the small inactive LRU to filter out the
few cold ones to swap out. This is true for many common workloads.
The benchmark creates a larger-than-memory set of anon pages with a
flat access profile - to the VM a flood of one-off pages. Joonsoo's
first two patches allow the VM to usher those pages in and out of
memory very quickly, which explains the throughput boost. But it comes
at the cost of reducing space available to hot anon pages, which will
regress others.
Joonsoo's full patchset makes the VM support both types of workloads
well: by putting everything on the inactive list first, one-off pages
can move through the system without disturbing the hot pages. And by
supplementing the inactive list with non-resident information, he can
keep it tiny without the risk of one-off pages drowning out new hot
pages. He can retain today's level of active page protection and
detection, while allowing one-off pages to move through quickly.
On Thu, 27 Feb 2020 08:48:06 -0500 Johannes Weiner <[email protected]> wrote:
> > It sounds like the above simple aging changes provide most of the
> > improvement, and that the workingset changes are less beneficial and a
> > bit more risky/speculative?
> >
> > If so, would it be best for us to concentrate on the aging changes
> > first, let that settle in and spread out and then turn attention to the
> > workingset changes?
>
> Those two patches work well for some workloads (like the benchmark),
> but not for others. The full patchset makes sure both types work well.
>
> Specifically, the existing aging strategy for anon assumes that most
> anon pages allocated are hot. That's why they all start active and we
> then do second-chance with the small inactive LRU to filter out the
> few cold ones to swap out. This is true for many common workloads.
>
> The benchmark creates a larger-than-memory set of anon pages with a
> flat access profile - to the VM a flood of one-off pages. Joonsoo's
> first two patches allow the VM to usher those pages in and out of
> memory very quickly, which explains the throughput boost. But it comes
> at the cost of reducing space available to hot anon pages, which will
> regress others.
>
> Joonsoo's full patchset makes the VM support both types of workloads
> well: by putting everything on the inactive list first, one-off pages
> can move through the system without disturbing the hot pages. And by
> supplementing the inactive list with non-resident information, he can
> keep it tiny without the risk of one-off pages drowning out new hot
> pages. He can retain today's level of active page protection and
> detection, while allowing one-off pages to move through quickly.
Helpful, thanks.
At v2 with no evident review input I'd normally take a pass at this
stage. But given all the potential benefits, perhaps I should be more
aggressive here?
On Thu, Feb 27, 2020 at 08:48:06AM -0500, Johannes Weiner wrote:
> On Wed, Feb 26, 2020 at 07:39:42PM -0800, Andrew Morton wrote:
> > It sounds like the above simple aging changes provide most of the
> > improvement, and that the workingset changes are less beneficial and a
> > bit more risky/speculative?
> >
> > If so, would it be best for us to concentrate on the aging changes
> > first, let that settle in and spread out and then turn attention to the
> > workingset changes?
>
> Those two patches work well for some workloads (like the benchmark),
> but not for others. The full patchset makes sure both types work well.
>
> Specifically, the existing aging strategy for anon assumes that most
> anon pages allocated are hot. That's why they all start active and we
> then do second-chance with the small inactive LRU to filter out the
> few cold ones to swap out. This is true for many common workloads.
>
> The benchmark creates a larger-than-memory set of anon pages with a
> flat access profile - to the VM a flood of one-off pages. Joonsoo's
test: swap-w-rand-mt, which is a multi thread swap write intensive
workload so there will be swap out and swap ins.
> first two patches allow the VM to usher those pages in and out of
Weird part is, the robot says the performance gain comes from the 1st
patch only, which adjust the ratio, not including the 2nd patch which
makes anon page starting from inactive list.
I find the performance gain hard to explain...
> memory very quickly, which explains the throughput boost. But it comes
> at the cost of reducing space available to hot anon pages, which will
> regress others.
>
Hello,
On Fri, Feb 28, 2020 at 11:23:58AM +0800, Aaron Lu wrote:
> On Thu, Feb 27, 2020 at 08:48:06AM -0500, Johannes Weiner wrote:
> > On Wed, Feb 26, 2020 at 07:39:42PM -0800, Andrew Morton wrote:
> > > It sounds like the above simple aging changes provide most of the
> > > improvement, and that the workingset changes are less beneficial and a
> > > bit more risky/speculative?
> > >
> > > If so, would it be best for us to concentrate on the aging changes
> > > first, let that settle in and spread out and then turn attention to the
> > > workingset changes?
> >
> > Those two patches work well for some workloads (like the benchmark),
> > but not for others. The full patchset makes sure both types work well.
> >
> > Specifically, the existing aging strategy for anon assumes that most
> > anon pages allocated are hot. That's why they all start active and we
> > then do second-chance with the small inactive LRU to filter out the
> > few cold ones to swap out. This is true for many common workloads.
> >
> > The benchmark creates a larger-than-memory set of anon pages with a
> > flat access profile - to the VM a flood of one-off pages. Joonsoo's
>
> test: swap-w-rand-mt, which is a multi thread swap write intensive
> workload so there will be swap out and swap ins.
>
> > first two patches allow the VM to usher those pages in and out of
>
> Weird part is, the robot says the performance gain comes from the 1st
> patch only, which adjust the ratio, not including the 2nd patch which
> makes anon page starting from inactive list.
>
> I find the performance gain hard to explain...
Let me explain the reason of the performance gain.
1st patch provides more second chance to the anonymous pages.
In swap-w-rand-mt test, memory used by all threads is greater than the
amount of the system memory, but, memory used by each thread would
not be much. So, although it is a rand test, there is a locality
in each thread's job. More second chance helps to exploit this
locality so performance could be improved.
Thanks.
On Fri, Feb 28, 2020 at 01:03:03PM +0900, Joonsoo Kim wrote:
> Hello,
>
> On Fri, Feb 28, 2020 at 11:23:58AM +0800, Aaron Lu wrote:
> > On Thu, Feb 27, 2020 at 08:48:06AM -0500, Johannes Weiner wrote:
> > > On Wed, Feb 26, 2020 at 07:39:42PM -0800, Andrew Morton wrote:
> > > > It sounds like the above simple aging changes provide most of the
> > > > improvement, and that the workingset changes are less beneficial and a
> > > > bit more risky/speculative?
> > > >
> > > > If so, would it be best for us to concentrate on the aging changes
> > > > first, let that settle in and spread out and then turn attention to the
> > > > workingset changes?
> > >
> > > Those two patches work well for some workloads (like the benchmark),
> > > but not for others. The full patchset makes sure both types work well.
> > >
> > > Specifically, the existing aging strategy for anon assumes that most
> > > anon pages allocated are hot. That's why they all start active and we
> > > then do second-chance with the small inactive LRU to filter out the
> > > few cold ones to swap out. This is true for many common workloads.
> > >
> > > The benchmark creates a larger-than-memory set of anon pages with a
> > > flat access profile - to the VM a flood of one-off pages. Joonsoo's
> >
> > test: swap-w-rand-mt, which is a multi thread swap write intensive
> > workload so there will be swap out and swap ins.
> >
> > > first two patches allow the VM to usher those pages in and out of
> >
> > Weird part is, the robot says the performance gain comes from the 1st
> > patch only, which adjust the ratio, not including the 2nd patch which
> > makes anon page starting from inactive list.
> >
> > I find the performance gain hard to explain...
>
> Let me explain the reason of the performance gain.
>
> 1st patch provides more second chance to the anonymous pages.
By second chance, do I understand correctely this refers to pages on
inactive list get moved back to active list?
> In swap-w-rand-mt test, memory used by all threads is greater than the
> amount of the system memory, but, memory used by each thread would
> not be much. So, although it is a rand test, there is a locality
> in each thread's job. More second chance helps to exploit this
> locality so performance could be improved.
Does this mean there should be fewer vmstat.pswpout and vmstat.pswpin
with patch1 compared to vanilla?
Thanks.
On Fri, Feb 28, 2020 at 01:57:26PM +0800, Aaron Lu wrote:
> On Fri, Feb 28, 2020 at 01:03:03PM +0900, Joonsoo Kim wrote:
> > Hello,
> >
> > On Fri, Feb 28, 2020 at 11:23:58AM +0800, Aaron Lu wrote:
> > > On Thu, Feb 27, 2020 at 08:48:06AM -0500, Johannes Weiner wrote:
> > > > On Wed, Feb 26, 2020 at 07:39:42PM -0800, Andrew Morton wrote:
> > > > > It sounds like the above simple aging changes provide most of the
> > > > > improvement, and that the workingset changes are less beneficial and a
> > > > > bit more risky/speculative?
> > > > >
> > > > > If so, would it be best for us to concentrate on the aging changes
> > > > > first, let that settle in and spread out and then turn attention to the
> > > > > workingset changes?
> > > >
> > > > Those two patches work well for some workloads (like the benchmark),
> > > > but not for others. The full patchset makes sure both types work well.
> > > >
> > > > Specifically, the existing aging strategy for anon assumes that most
> > > > anon pages allocated are hot. That's why they all start active and we
> > > > then do second-chance with the small inactive LRU to filter out the
> > > > few cold ones to swap out. This is true for many common workloads.
> > > >
> > > > The benchmark creates a larger-than-memory set of anon pages with a
> > > > flat access profile - to the VM a flood of one-off pages. Joonsoo's
> > >
> > > test: swap-w-rand-mt, which is a multi thread swap write intensive
> > > workload so there will be swap out and swap ins.
> > >
> > > > first two patches allow the VM to usher those pages in and out of
> > >
> > > Weird part is, the robot says the performance gain comes from the 1st
> > > patch only, which adjust the ratio, not including the 2nd patch which
> > > makes anon page starting from inactive list.
> > >
> > > I find the performance gain hard to explain...
> >
> > Let me explain the reason of the performance gain.
> >
> > 1st patch provides more second chance to the anonymous pages.
>
> By second chance, do I understand correctely this refers to pages on
> inactive list get moved back to active list?
Yes.
>
> > In swap-w-rand-mt test, memory used by all threads is greater than the
> > amount of the system memory, but, memory used by each thread would
> > not be much. So, although it is a rand test, there is a locality
> > in each thread's job. More second chance helps to exploit this
> > locality so performance could be improved.
>
> Does this mean there should be fewer vmstat.pswpout and vmstat.pswpin
> with patch1 compared to vanilla?
It depends on the workload. If the workload consists of anonymous
pages only, I think, yes, pswpout/pswpin would be lower than vanilla
with patch #1. With large inactive list, we can easily find the
frequently referenced page and it would result in less swap in/out.
Thanks.
On Fri, Feb 28, 2020 at 03:52:59PM +0900, Joonsoo Kim wrote:
> On Fri, Feb 28, 2020 at 01:57:26PM +0800, Aaron Lu wrote:
> > On Fri, Feb 28, 2020 at 01:03:03PM +0900, Joonsoo Kim wrote:
> > > Hello,
> > >
> > > On Fri, Feb 28, 2020 at 11:23:58AM +0800, Aaron Lu wrote:
> > > > On Thu, Feb 27, 2020 at 08:48:06AM -0500, Johannes Weiner wrote:
> > > > > On Wed, Feb 26, 2020 at 07:39:42PM -0800, Andrew Morton wrote:
> > > > > > It sounds like the above simple aging changes provide most of the
> > > > > > improvement, and that the workingset changes are less beneficial and a
> > > > > > bit more risky/speculative?
> > > > > >
> > > > > > If so, would it be best for us to concentrate on the aging changes
> > > > > > first, let that settle in and spread out and then turn attention to the
> > > > > > workingset changes?
> > > > >
> > > > > Those two patches work well for some workloads (like the benchmark),
> > > > > but not for others. The full patchset makes sure both types work well.
> > > > >
> > > > > Specifically, the existing aging strategy for anon assumes that most
> > > > > anon pages allocated are hot. That's why they all start active and we
> > > > > then do second-chance with the small inactive LRU to filter out the
> > > > > few cold ones to swap out. This is true for many common workloads.
> > > > >
> > > > > The benchmark creates a larger-than-memory set of anon pages with a
> > > > > flat access profile - to the VM a flood of one-off pages. Joonsoo's
> > > >
> > > > test: swap-w-rand-mt, which is a multi thread swap write intensive
> > > > workload so there will be swap out and swap ins.
> > > >
> > > > > first two patches allow the VM to usher those pages in and out of
> > > >
> > > > Weird part is, the robot says the performance gain comes from the 1st
> > > > patch only, which adjust the ratio, not including the 2nd patch which
> > > > makes anon page starting from inactive list.
> > > >
> > > > I find the performance gain hard to explain...
> > >
> > > Let me explain the reason of the performance gain.
> > >
> > > 1st patch provides more second chance to the anonymous pages.
> >
> > By second chance, do I understand correctely this refers to pages on
> > inactive list get moved back to active list?
>
> Yes.
>
> >
> > > In swap-w-rand-mt test, memory used by all threads is greater than the
> > > amount of the system memory, but, memory used by each thread would
> > > not be much. So, although it is a rand test, there is a locality
> > > in each thread's job. More second chance helps to exploit this
> > > locality so performance could be improved.
> >
> > Does this mean there should be fewer vmstat.pswpout and vmstat.pswpin
> > with patch1 compared to vanilla?
>
> It depends on the workload. If the workload consists of anonymous
This swap-rand-w-mt workload is anon only.
> pages only, I think, yes, pswpout/pswpin would be lower than vanilla
I think LKP robot has captured these two metrics but the report didn't
show them, which means the number is about the same with or without
patch #1.
> with patch #1. With large inactive list, we can easily find the
> frequently referenced page and it would result in less swap in/out.
But with small inactive list, the pages that would be on inactive list
will stay on active list? I think the larger inactive list is mainly
used to give the anon page a chance to be promoted to active list now
that anon pages land on inactive list first, but on reclaim, I don't see
how a larger inactive list can cause fewer swap outs.
Forgive me for my curiosity and feel free to ignore my question as I
don't want to waste your time on this. Your patchset looks a worthwhile
thing to do, it's just the robot's report on patch1 seems er...
On Fri, Feb 28, 2020 at 05:17:00PM +0800, Aaron Lu wrote:
> On Fri, Feb 28, 2020 at 03:52:59PM +0900, Joonsoo Kim wrote:
> > On Fri, Feb 28, 2020 at 01:57:26PM +0800, Aaron Lu wrote:
> > > On Fri, Feb 28, 2020 at 01:03:03PM +0900, Joonsoo Kim wrote:
> > > > Hello,
> > > >
> > > > On Fri, Feb 28, 2020 at 11:23:58AM +0800, Aaron Lu wrote:
> > > > > On Thu, Feb 27, 2020 at 08:48:06AM -0500, Johannes Weiner wrote:
> > > > > > On Wed, Feb 26, 2020 at 07:39:42PM -0800, Andrew Morton wrote:
> > > > > > > It sounds like the above simple aging changes provide most of the
> > > > > > > improvement, and that the workingset changes are less beneficial and a
> > > > > > > bit more risky/speculative?
> > > > > > >
> > > > > > > If so, would it be best for us to concentrate on the aging changes
> > > > > > > first, let that settle in and spread out and then turn attention to the
> > > > > > > workingset changes?
> > > > > >
> > > > > > Those two patches work well for some workloads (like the benchmark),
> > > > > > but not for others. The full patchset makes sure both types work well.
> > > > > >
> > > > > > Specifically, the existing aging strategy for anon assumes that most
> > > > > > anon pages allocated are hot. That's why they all start active and we
> > > > > > then do second-chance with the small inactive LRU to filter out the
> > > > > > few cold ones to swap out. This is true for many common workloads.
> > > > > >
> > > > > > The benchmark creates a larger-than-memory set of anon pages with a
> > > > > > flat access profile - to the VM a flood of one-off pages. Joonsoo's
> > > > >
> > > > > test: swap-w-rand-mt, which is a multi thread swap write intensive
> > > > > workload so there will be swap out and swap ins.
> > > > >
> > > > > > first two patches allow the VM to usher those pages in and out of
> > > > >
> > > > > Weird part is, the robot says the performance gain comes from the 1st
> > > > > patch only, which adjust the ratio, not including the 2nd patch which
> > > > > makes anon page starting from inactive list.
> > > > >
> > > > > I find the performance gain hard to explain...
> > > >
> > > > Let me explain the reason of the performance gain.
> > > >
> > > > 1st patch provides more second chance to the anonymous pages.
> > >
> > > By second chance, do I understand correctely this refers to pages on
> > > inactive list get moved back to active list?
> >
> > Yes.
> >
> > >
> > > > In swap-w-rand-mt test, memory used by all threads is greater than the
> > > > amount of the system memory, but, memory used by each thread would
> > > > not be much. So, although it is a rand test, there is a locality
> > > > in each thread's job. More second chance helps to exploit this
> > > > locality so performance could be improved.
> > >
> > > Does this mean there should be fewer vmstat.pswpout and vmstat.pswpin
> > > with patch1 compared to vanilla?
> >
> > It depends on the workload. If the workload consists of anonymous
>
> This swap-rand-w-mt workload is anon only.
Yes, I know.
>
> > pages only, I think, yes, pswpout/pswpin would be lower than vanilla
>
> I think LKP robot has captured these two metrics but the report didn't
> show them, which means the number is about the same with or without
> patch #1.
robot showed these two metrics. See below.
50190319 ? 31% -35.7% 32291856 ? 14% proc-vmstat.pswpin
56429784 ? 21% -42.6% 32386842 ? 14% proc-vmstat.pswpout
pswpin/out are improved.
> > with patch #1. With large inactive list, we can easily find the
> > frequently referenced page and it would result in less swap in/out.
>
> But with small inactive list, the pages that would be on inactive list
> will stay on active list? I think the larger inactive list is mainly
> used to give the anon page a chance to be promoted to active list now
> that anon pages land on inactive list first, but on reclaim, I don't see
> how a larger inactive list can cause fewer swap outs.
Point is that larger inactive LRU helps to find hot pages and these
hot pages leads to more cache hits.
When a cache hit happens, no swap outs happens. But, if a cache miss
happens, a new page is added to the LRU and then it causes the reclaim
and swap out.
> Forgive me for my curiosity and feel free to ignore my question as I
> don't want to waste your time on this. Your patchset looks a worthwhile
> thing to do, it's just the robot's report on patch1 seems er...
I appreciate your attention. Feel free to ask. :)
Thanks.
On Fri, Feb 28, 2020 at 06:56:11PM +0900, Joonsoo Kim wrote:
> On Fri, Feb 28, 2020 at 05:17:00PM +0800, Aaron Lu wrote:
> > I think LKP robot has captured these two metrics but the report didn't
> > show them, which means the number is about the same with or without
> > patch #1.
>
> robot showed these two metrics. See below.
>
> 50190319 ? 31% -35.7% 32291856 ? 14% proc-vmstat.pswpin
> 56429784 ? 21% -42.6% 32386842 ? 14% proc-vmstat.pswpout
>
> pswpin/out are improved.
Oh yes, I checked the vmstat part, while I should check proc-vmstat
part...Sorry for missing this.
>
> > > with patch #1. With large inactive list, we can easily find the
> > > frequently referenced page and it would result in less swap in/out.
> >
> > But with small inactive list, the pages that would be on inactive list
> > will stay on active list? I think the larger inactive list is mainly
> > used to give the anon page a chance to be promoted to active list now
> > that anon pages land on inactive list first, but on reclaim, I don't see
> > how a larger inactive list can cause fewer swap outs.
>
> Point is that larger inactive LRU helps to find hot pages and these
> hot pages leads to more cache hits.
>
> When a cache hit happens, no swap outs happens. But, if a cache miss
> happens, a new page is added to the LRU and then it causes the reclaim
> and swap out.
OK, I think I start to get your point. Your explanation makes sense.
> > Forgive me for my curiosity and feel free to ignore my question as I
> > don't want to waste your time on this. Your patchset looks a worthwhile
> > thing to do, it's just the robot's report on patch1 seems er...
>
> I appreciate your attention. Feel free to ask. :)
Thanks a lot for your patience and nice explanation :-)
On Thu, 27 Feb 2020 16:48:47 +0900 Joonsoo Kim <[email protected]> wrote:
> Hello, Andrew.
>
> > > * SOLUTION
> > > Since this is what we want to avoid, this patchset implements workingset
> > > protection. Like as the file LRU list, newly created or swap-in anonymous
> > > page is started on the inactive list. Also, like as the file LRU list,
> > > if enough reference happens, the page will be promoted. This simple
> > > modification changes the above example as following.
> >
> > One wonders why on earth we weren't doing these things in the first
> > place?
>
> I don't know. I tried to find the origin of this behaviour and found
> that it's from you 18 years ago. :)
>
> It mentions that starting pages on the active list boosts throughput on
> stupid swapstormy test but I cannot guess the exact reason of such
> improvement.
>
> Anyway, Following is the related patch history. Could you remember
> anything about it?
>
erm, yes, that was a long time ago ;) I guess enough other things have
changed since then to necessitate a revisit!
2020년 2월 28일 (금) 오전 8:36, Andrew Morton <[email protected]>님이 작성:
>
> On Thu, 27 Feb 2020 08:48:06 -0500 Johannes Weiner <[email protected]> wrote:
>
> > > It sounds like the above simple aging changes provide most of the
> > > improvement, and that the workingset changes are less beneficial and a
> > > bit more risky/speculative?
> > >
> > > If so, would it be best for us to concentrate on the aging changes
> > > first, let that settle in and spread out and then turn attention to the
> > > workingset changes?
> >
> > Those two patches work well for some workloads (like the benchmark),
> > but not for others. The full patchset makes sure both types work well.
> >
> > Specifically, the existing aging strategy for anon assumes that most
> > anon pages allocated are hot. That's why they all start active and we
> > then do second-chance with the small inactive LRU to filter out the
> > few cold ones to swap out. This is true for many common workloads.
> >
> > The benchmark creates a larger-than-memory set of anon pages with a
> > flat access profile - to the VM a flood of one-off pages. Joonsoo's
> > first two patches allow the VM to usher those pages in and out of
> > memory very quickly, which explains the throughput boost. But it comes
> > at the cost of reducing space available to hot anon pages, which will
> > regress others.
> >
> > Joonsoo's full patchset makes the VM support both types of workloads
> > well: by putting everything on the inactive list first, one-off pages
> > can move through the system without disturbing the hot pages. And by
> > supplementing the inactive list with non-resident information, he can
> > keep it tiny without the risk of one-off pages drowning out new hot
> > pages. He can retain today's level of active page protection and
> > detection, while allowing one-off pages to move through quickly.
>
> Helpful, thanks.
>
> At v2 with no evident review input I'd normally take a pass at this
> stage. But given all the potential benefits, perhaps I should be more
> aggressive here?
I hope so. It would boost the review. :)
Thanks.
2020년 3월 3일 (화) 오전 8:31, Joonsoo Kim <[email protected]>님이 작성:
>
> 2020년 2월 28일 (금) 오전 8:36, Andrew Morton <[email protected]>님이 작성:
> >
> > On Thu, 27 Feb 2020 08:48:06 -0500 Johannes Weiner <[email protected]> wrote:
> >
> > > > It sounds like the above simple aging changes provide most of the
> > > > improvement, and that the workingset changes are less beneficial and a
> > > > bit more risky/speculative?
> > > >
> > > > If so, would it be best for us to concentrate on the aging changes
> > > > first, let that settle in and spread out and then turn attention to the
> > > > workingset changes?
> > >
> > > Those two patches work well for some workloads (like the benchmark),
> > > but not for others. The full patchset makes sure both types work well.
> > >
> > > Specifically, the existing aging strategy for anon assumes that most
> > > anon pages allocated are hot. That's why they all start active and we
> > > then do second-chance with the small inactive LRU to filter out the
> > > few cold ones to swap out. This is true for many common workloads.
> > >
> > > The benchmark creates a larger-than-memory set of anon pages with a
> > > flat access profile - to the VM a flood of one-off pages. Joonsoo's
> > > first two patches allow the VM to usher those pages in and out of
> > > memory very quickly, which explains the throughput boost. But it comes
> > > at the cost of reducing space available to hot anon pages, which will
> > > regress others.
> > >
> > > Joonsoo's full patchset makes the VM support both types of workloads
> > > well: by putting everything on the inactive list first, one-off pages
> > > can move through the system without disturbing the hot pages. And by
> > > supplementing the inactive list with non-resident information, he can
> > > keep it tiny without the risk of one-off pages drowning out new hot
> > > pages. He can retain today's level of active page protection and
> > > detection, while allowing one-off pages to move through quickly.
> >
> > Helpful, thanks.
> >
> > At v2 with no evident review input I'd normally take a pass at this
> > stage. But given all the potential benefits, perhaps I should be more
> > aggressive here?
>
> I hope so. It would boost the review. :)
Hello, Andrew.
Would you like to tell me your plan about this patchset?
Merging it into the next tree would accelerate the review and test. :)
Thanks.
On Thu, Feb 20, 2020 at 02:11:45PM +0900, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> Current implementation of LRU management for anonymous page has some
> problems. Most important one is that it doesn't protect the workingset,
> that is, pages on the active LRU list. Although, this problem will be
> fixed in the following patchset, the preparation is required and
> this patch does it.
>
> What following patchset does is to restore workingset protection. In this
> case, newly created or swap-in pages are started their lifetime on the
> inactive list. If inactive list is too small, there is not enough chance
> to be referenced and the page cannot become the workingset.
>
> In order to provide enough chance to the newly anonymous pages, this patch
> makes active/inactive LRU ratio as 1:1.
Patch 8/9 is a revert of this patch. I assume you did this for the
series to be bisectable and partially revertable, but I'm not sure
keeping only the first and second patch would be safe: they reduce
workingset protection quite dramatically on their own (on a 10G system
from 90% of RAM to 50% e.g.) and likely cause regressions.
So while patch 2 is probably a lot better with patch 1 than without,
it seems a bit unnecessary since we cannot keep patch 2 on its own. We
need the rest of the series to make these changes.
On the other hand, the patch is small and obviously correct. So no
strong feelings either way.
> Signed-off-by: Joonsoo Kim <[email protected]>
Reviewed-by: Johannes Weiner <[email protected]>
On Thu, Feb 20, 2020 at 02:11:46PM +0900, [email protected] wrote:
> @@ -1010,8 +1010,15 @@ static enum page_references page_check_references(struct page *page,
> return PAGEREF_RECLAIM;
>
> if (referenced_ptes) {
> - if (PageSwapBacked(page))
> - return PAGEREF_ACTIVATE;
> + if (PageSwapBacked(page)) {
> + if (referenced_page) {
> + ClearPageReferenced(page);
> + return PAGEREF_ACTIVATE;
> + }
This looks odd to me. referenced_page = TestClearPageReferenced()
above, so it's already be clear. Why clear it again?
> +
> + SetPageReferenced(page);
> + return PAGEREF_KEEP;
> + }
The existing file code already does:
SetPageReferenced(page);
if (referenced_page || referenced_ptes > 1)
return PAGEREF_ACTIVATE;
if (vm_flags & VM_EXEC)
return PAGEREF_ACTIVATE;
return PAGEREF_KEEP;
The differences are:
1) referenced_ptes > 1. We did this so that heavily shared file
mappings are protected a bit better than others. Arguably the same
could apply for anon pages when we put them on the inactive list.
2) vm_flags & VM_EXEC. This mostly doesn't apply to anon pages. The
exception would be jit code pages, but if we put anon pages on the
inactive list we should protect jit code the same way we protect file
executables.
Seems to me you don't need to add anything. Just remove the
PageSwapBacked branch and apply equal treatment to both types.
> @@ -2056,6 +2063,15 @@ static void shrink_active_list(unsigned long nr_to_scan,
> }
> }
>
> + /*
> + * Now, newly created anonymous page isn't appened to the
> + * active list. We don't need to clear the reference bit here.
> + */
> + if (PageSwapBacked(page)) {
> + ClearPageReferenced(page);
> + goto deactivate;
> + }
I don't understand this.
If you don't clear the pte references, you're leaving behind stale
data. You already decide here that we consider the page referenced
when it reaches the end of the inactive list, regardless of what
happens in between. That makes the deactivation kind of useless.
And it blurs the lines between the inactive and active list.
shrink_page_list() (and page_check_references()) are written with the
notion that any references they look at are from the inactive list. If
you carry over stale data, this can cause more subtle bugs later on.
And again, I don't quite understand why anon would need different
treatment here than file.
> +
> if (page_referenced(page, 0, sc->target_mem_cgroup,
> &vm_flags)) {
> nr_rotated += hpage_nr_pages(page);
> @@ -2074,6 +2090,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
> }
> }
>
> +deactivate:
> ClearPageActive(page); /* we are de-activating */
> SetPageWorkingset(page);
> list_add(&page->lru, &l_inactive);
> --
> 2.7.4
>
2020년 3월 12일 (목) 오후 11:47, Johannes Weiner <[email protected]>님이 작성:
>
> On Thu, Feb 20, 2020 at 02:11:45PM +0900, [email protected] wrote:
> > From: Joonsoo Kim <[email protected]>
> >
> > Current implementation of LRU management for anonymous page has some
> > problems. Most important one is that it doesn't protect the workingset,
> > that is, pages on the active LRU list. Although, this problem will be
> > fixed in the following patchset, the preparation is required and
> > this patch does it.
> >
> > What following patchset does is to restore workingset protection. In this
> > case, newly created or swap-in pages are started their lifetime on the
> > inactive list. If inactive list is too small, there is not enough chance
> > to be referenced and the page cannot become the workingset.
> >
> > In order to provide enough chance to the newly anonymous pages, this patch
> > makes active/inactive LRU ratio as 1:1.
>
> Patch 8/9 is a revert of this patch. I assume you did this for the
> series to be bisectable and partially revertable, but I'm not sure
> keeping only the first and second patch would be safe: they reduce
> workingset protection quite dramatically on their own (on a 10G system
> from 90% of RAM to 50% e.g.) and likely cause regressions.
>
> So while patch 2 is probably a lot better with patch 1 than without,
Yes, it is what I intended.
> it seems a bit unnecessary since we cannot keep patch 2 on its own. We
> need the rest of the series to make these changes.
Yes, you're right.
> On the other hand, the patch is small and obviously correct. So no
> strong feelings either way.
Okay. I will keep the patches since I think that these patches will
help someone who want to understand the LRU management.
> > Signed-off-by: Joonsoo Kim <[email protected]>
>
> Reviewed-by: Johannes Weiner <[email protected]>
Thanks!
2020년 3월 13일 (금) 오전 12:14, Johannes Weiner <[email protected]>님이 작성:
>
> On Thu, Feb 20, 2020 at 02:11:46PM +0900, [email protected] wrote:
> > @@ -1010,8 +1010,15 @@ static enum page_references page_check_references(struct page *page,
> > return PAGEREF_RECLAIM;
> >
> > if (referenced_ptes) {
> > - if (PageSwapBacked(page))
> > - return PAGEREF_ACTIVATE;
> > + if (PageSwapBacked(page)) {
> > + if (referenced_page) {
> > + ClearPageReferenced(page);
> > + return PAGEREF_ACTIVATE;
> > + }
>
> This looks odd to me. referenced_page = TestClearPageReferenced()
> above, so it's already be clear. Why clear it again?
Oops... it's just my fault. Will remove it.
> > +
> > + SetPageReferenced(page);
> > + return PAGEREF_KEEP;
> > + }
>
> The existing file code already does:
>
> SetPageReferenced(page);
> if (referenced_page || referenced_ptes > 1)
> return PAGEREF_ACTIVATE;
> if (vm_flags & VM_EXEC)
> return PAGEREF_ACTIVATE;
> return PAGEREF_KEEP;
>
> The differences are:
>
> 1) referenced_ptes > 1. We did this so that heavily shared file
> mappings are protected a bit better than others. Arguably the same
> could apply for anon pages when we put them on the inactive list.
Yes, these check should be included for anon.
> 2) vm_flags & VM_EXEC. This mostly doesn't apply to anon pages. The
> exception would be jit code pages, but if we put anon pages on the
> inactive list we should protect jit code the same way we protect file
> executables.
I'm not sure that this is necessary for anon page. From my understanding,
executable mapped file page is more precious than other mapped file page
because this mapping is usually used by *multiple* thread and there is
no way to check it by MM. If anon JIT code has also such characteristic, this
code should be included for anon, but, should be included separately. It
seems that it's beyond of this patch.
> Seems to me you don't need to add anything. Just remove the
> PageSwapBacked branch and apply equal treatment to both types.
I will rework the code if you agree with my opinion.
> > @@ -2056,6 +2063,15 @@ static void shrink_active_list(unsigned long nr_to_scan,
> > }
> > }
> >
> > + /*
> > + * Now, newly created anonymous page isn't appened to the
> > + * active list. We don't need to clear the reference bit here.
> > + */
> > + if (PageSwapBacked(page)) {
> > + ClearPageReferenced(page);
> > + goto deactivate;
> > + }
>
> I don't understand this.
>
> If you don't clear the pte references, you're leaving behind stale
> data. You already decide here that we consider the page referenced
> when it reaches the end of the inactive list, regardless of what
> happens in between. That makes the deactivation kind of useless.
My idea is that the pages newly appended to the inactive list, for example,
a newly allocated anon page or deactivated page, start at the same line.
A newly allocated anon page would have a mapping (reference) so I
made this code to help for deactivated page to have a mapping (reference).
I think that there is no reason to devalue the page accessed on active list.
Before this patch is applied, all newly allocated anon page are started
at the active list so clearing the pte reference on deactivation is required
to check the further access. However, it is not the case so I skip it here.
> And it blurs the lines between the inactive and active list.
>
> shrink_page_list() (and page_check_references()) are written with the
> notion that any references they look at are from the inactive list. If
> you carry over stale data, this can cause more subtle bugs later on.
It's not. For file page, PageReferenced() is maintained even if deactivation
happens and it means one reference.
> And again, I don't quite understand why anon would need different
> treatment here than file.
In order to preserve the current behaviour for the file page, I leave the code
as is for the file page and change the code for the anon page. There is
fundamental difference between them such as how referenced is checked,
accessed by mapping and accessed by syscall. I think that some difference
would be admitted.
Thanks.
On Fri, Mar 13, 2020 at 04:40:18PM +0900, Joonsoo Kim wrote:
> 2020년 3월 13일 (금) 오전 12:14, Johannes Weiner <[email protected]>님이 작성:
> >
> > On Thu, Feb 20, 2020 at 02:11:46PM +0900, [email protected] wrote:
> > > @@ -1010,8 +1010,15 @@ static enum page_references page_check_references(struct page *page,
> > > return PAGEREF_RECLAIM;
> > >
> > > if (referenced_ptes) {
> > > - if (PageSwapBacked(page))
> > > - return PAGEREF_ACTIVATE;
> > > + if (PageSwapBacked(page)) {
> > > + if (referenced_page) {
> > > + ClearPageReferenced(page);
> > > + return PAGEREF_ACTIVATE;
> > > + }
> >
> > This looks odd to me. referenced_page = TestClearPageReferenced()
> > above, so it's already be clear. Why clear it again?
>
> Oops... it's just my fault. Will remove it.
>
> > > +
> > > + SetPageReferenced(page);
> > > + return PAGEREF_KEEP;
> > > + }
> >
> > The existing file code already does:
> >
> > SetPageReferenced(page);
> > if (referenced_page || referenced_ptes > 1)
> > return PAGEREF_ACTIVATE;
> > if (vm_flags & VM_EXEC)
> > return PAGEREF_ACTIVATE;
> > return PAGEREF_KEEP;
> >
> > The differences are:
> >
> > 1) referenced_ptes > 1. We did this so that heavily shared file
> > mappings are protected a bit better than others. Arguably the same
> > could apply for anon pages when we put them on the inactive list.
>
> Yes, these check should be included for anon.
>
> > 2) vm_flags & VM_EXEC. This mostly doesn't apply to anon pages. The
> > exception would be jit code pages, but if we put anon pages on the
> > inactive list we should protect jit code the same way we protect file
> > executables.
>
> I'm not sure that this is necessary for anon page. From my understanding,
> executable mapped file page is more precious than other mapped file page
> because this mapping is usually used by *multiple* thread and there is
> no way to check it by MM. If anon JIT code has also such characteristic, this
> code should be included for anon, but, should be included separately. It
> seems that it's beyond of this patch.
The sharing is what the referenced_ptes > 1 check is for.
The problem with executables is that when they are referenced, they
get a *lot* of references compared to data pages. Think about an
instruction stream and how many of those instructions result in data
references. So when you see an executable page that is being accessed,
it's likely being accessed at a high rate. They're much hotter, and
that's why reference bits from VM_EXEC mappings carry more weight.
IMO this applies to executable file and anon equally.
> > Seems to me you don't need to add anything. Just remove the
> > PageSwapBacked branch and apply equal treatment to both types.
>
> I will rework the code if you agree with my opinion.
>
> > > @@ -2056,6 +2063,15 @@ static void shrink_active_list(unsigned long nr_to_scan,
> > > }
> > > }
> > >
> > > + /*
> > > + * Now, newly created anonymous page isn't appened to the
> > > + * active list. We don't need to clear the reference bit here.
> > > + */
> > > + if (PageSwapBacked(page)) {
> > > + ClearPageReferenced(page);
> > > + goto deactivate;
> > > + }
> >
> > I don't understand this.
> >
> > If you don't clear the pte references, you're leaving behind stale
> > data. You already decide here that we consider the page referenced
> > when it reaches the end of the inactive list, regardless of what
> > happens in between. That makes the deactivation kind of useless.
>
> My idea is that the pages newly appended to the inactive list, for example,
> a newly allocated anon page or deactivated page, start at the same line.
> A newly allocated anon page would have a mapping (reference) so I
> made this code to help for deactivated page to have a mapping (reference).
> I think that there is no reason to devalue the page accessed on active list.
I don't think that leads to desirable behavior, because it causes an
age inversion between deactivated and freshly instantiated pages.
We know the new page was referenced when it entered the head of the
inactive list. However, the old page's reference could be much, much
longer in the past. Hours ago. So when they both reach the end of the
list, we treat them as equally hot even though the new page has been
referenced very recently and the old page might be completely stale.
Keep in mind that we only deactivate in the first place because the
inactive list is struggling and we need to get rid of stale active
pages. We're in a workingset transition and *should* be giving old
pages the chance to move out quickly.
> Before this patch is applied, all newly allocated anon page are started
> at the active list so clearing the pte reference on deactivation is required
> to check the further access. However, it is not the case so I skip it here.
>
> > And it blurs the lines between the inactive and active list.
> >
> > shrink_page_list() (and page_check_references()) are written with the
> > notion that any references they look at are from the inactive list. If
> > you carry over stale data, this can cause more subtle bugs later on.
>
> It's not. For file page, PageReferenced() is maintained even if deactivation
> happens and it means one reference.
shrink_page_list() doesn't honor PageReferenced as a reference.
PG_referenced is primarily for the mark_page_accessed() state machine,
which is different from the reclaim scanner's reference tracking: for
unmapped pages we can detect accesses in realtime and don't need the
reference sampling from LRU cycle to LRU cycle. The bit carries over a
deactivation, but it doesn't prevent reclaim from freeing the page.
For mapped pages, we sample references using the LRU cycles, and
PG_referenced is otherwise unused. We repurpose it to implement
second-chance tracking of inactive pages with pte refs. It counts
inactive list cycles, not references.
> > And again, I don't quite understand why anon would need different
> > treatment here than file.
>
> In order to preserve the current behaviour for the file page, I leave the code
> as is for the file page and change the code for the anon page. There is
> fundamental difference between them such as how referenced is checked,
> accessed by mapping and accessed by syscall. I think that some difference
> would be admitted.
Right, unmapped pages have their own reference tracking system because
they can be detected synchronously.
My questions center around this:
We have an existing sampling algorithm for the coarse-grained page
table referenced bit, where we start pages on inactive, treat
references a certain way, target a certain inactive:active ratio, use
refault information to detect workingset transitions etc. Anon used a
different system in the past, but your patch set switches it over to
the more universal model we have developed for mapped file pages.
However, you don't switch it over to this exact model we have for
mapped files, but rather a slightly modified version. And I don't
quite understand the rationale behind the individual modifications.
So let me turn it around. What would be the downsides of aging mapped
anon exactly the same way we age mapped files? Can we identify where
differences are necessary and document them?
2020년 3월 14일 (토) 오전 4:55, Johannes Weiner <[email protected]>님이 작성:
>
> On Fri, Mar 13, 2020 at 04:40:18PM +0900, Joonsoo Kim wrote:
> > 2020년 3월 13일 (금) 오전 12:14, Johannes Weiner <[email protected]>님이 작성:
> > >
> > > On Thu, Feb 20, 2020 at 02:11:46PM +0900, [email protected] wrote:
> > > > @@ -1010,8 +1010,15 @@ static enum page_references page_check_references(struct page *page,
> > > > return PAGEREF_RECLAIM;
> > > >
> > > > if (referenced_ptes) {
> > > > - if (PageSwapBacked(page))
> > > > - return PAGEREF_ACTIVATE;
> > > > + if (PageSwapBacked(page)) {
> > > > + if (referenced_page) {
> > > > + ClearPageReferenced(page);
> > > > + return PAGEREF_ACTIVATE;
> > > > + }
> > >
> > > This looks odd to me. referenced_page = TestClearPageReferenced()
> > > above, so it's already be clear. Why clear it again?
> >
> > Oops... it's just my fault. Will remove it.
> >
> > > > +
> > > > + SetPageReferenced(page);
> > > > + return PAGEREF_KEEP;
> > > > + }
> > >
> > > The existing file code already does:
> > >
> > > SetPageReferenced(page);
> > > if (referenced_page || referenced_ptes > 1)
> > > return PAGEREF_ACTIVATE;
> > > if (vm_flags & VM_EXEC)
> > > return PAGEREF_ACTIVATE;
> > > return PAGEREF_KEEP;
> > >
> > > The differences are:
> > >
> > > 1) referenced_ptes > 1. We did this so that heavily shared file
> > > mappings are protected a bit better than others. Arguably the same
> > > could apply for anon pages when we put them on the inactive list.
> >
> > Yes, these check should be included for anon.
> >
> > > 2) vm_flags & VM_EXEC. This mostly doesn't apply to anon pages. The
> > > exception would be jit code pages, but if we put anon pages on the
> > > inactive list we should protect jit code the same way we protect file
> > > executables.
> >
> > I'm not sure that this is necessary for anon page. From my understanding,
> > executable mapped file page is more precious than other mapped file page
> > because this mapping is usually used by *multiple* thread and there is
> > no way to check it by MM. If anon JIT code has also such characteristic, this
> > code should be included for anon, but, should be included separately. It
> > seems that it's beyond of this patch.
>
> The sharing is what the referenced_ptes > 1 check is for.
Sharing between processes can be checked by referenced_ptes,
but, sharing between threads cannot be checked by it. I guess that
VM_EXEC check is for thread case since most of executable is usually
shared by thread. But, from below your comment, I re-consider the
reason of VM_EXEC check. See below.
> The problem with executables is that when they are referenced, they
> get a *lot* of references compared to data pages. Think about an
> instruction stream and how many of those instructions result in data
> references. So when you see an executable page that is being accessed,
> it's likely being accessed at a high rate. They're much hotter, and
> that's why reference bits from VM_EXEC mappings carry more weight.
Sound reasonable.
But, now, I have another thought about it. I think that the root of the reason
of this check is that there are two different reference tracking models
on file LRU. If we assume that all file pages are mapped ones, this special
check isn't needed. If executable pages are accessed more frequently than
others, reference can be found more for them at LRU cycle. No need for this
special check.
However, file pages includes syscall-ed pages and mapped pages, and,
reference tracking models for mapped page has a disadvantage to
one for syscall-ed page. Several references for mapped page would be
considered as one at LRU cycle. I think that this check is to mitigate
this disadvantage, at least, for executable file mapped page.
> IMO this applies to executable file and anon equally.
anon LRU consist of all mapped pages, so, IMHO, there is no need for
special handling for executable pages on anon LRU. Although reference
is just checked at LRU cycle, it would correctly approximate reference
frequency.
Moreover, there is one comment in shrink_active_list() that explains one
reason about omission of this check for anon pages.
"Anon pages are not likely to be evicted by use-once streaming IO, plus JVM
can create lots of anon VM_EXEC pages, so we ignore them here."
Lastly, as I said before, at least, it is done separately with appropriate
investigation.
> > > Seems to me you don't need to add anything. Just remove the
> > > PageSwapBacked branch and apply equal treatment to both types.
> >
> > I will rework the code if you agree with my opinion.
> >
> > > > @@ -2056,6 +2063,15 @@ static void shrink_active_list(unsigned long nr_to_scan,
> > > > }
> > > > }
> > > >
> > > > + /*
> > > > + * Now, newly created anonymous page isn't appened to the
> > > > + * active list. We don't need to clear the reference bit here.
> > > > + */
> > > > + if (PageSwapBacked(page)) {
> > > > + ClearPageReferenced(page);
> > > > + goto deactivate;
> > > > + }
> > >
> > > I don't understand this.
> > >
> > > If you don't clear the pte references, you're leaving behind stale
> > > data. You already decide here that we consider the page referenced
> > > when it reaches the end of the inactive list, regardless of what
> > > happens in between. That makes the deactivation kind of useless.
> >
> > My idea is that the pages newly appended to the inactive list, for example,
> > a newly allocated anon page or deactivated page, start at the same line.
> > A newly allocated anon page would have a mapping (reference) so I
> > made this code to help for deactivated page to have a mapping (reference).
> > I think that there is no reason to devalue the page accessed on active list.
>
> I don't think that leads to desirable behavior, because it causes an
> age inversion between deactivated and freshly instantiated pages.
>
> We know the new page was referenced when it entered the head of the
> inactive list. However, the old page's reference could be much, much
> longer in the past. Hours ago. So when they both reach the end of the
> list, we treat them as equally hot even though the new page has been
> referenced very recently and the old page might be completely stale.
>
> Keep in mind that we only deactivate in the first place because the
> inactive list is struggling and we need to get rid of stale active
> pages. We're in a workingset transition and *should* be giving old
> pages the chance to move out quickly.
You're right. I will fix it.
> > Before this patch is applied, all newly allocated anon page are started
> > at the active list so clearing the pte reference on deactivation is required
> > to check the further access. However, it is not the case so I skip it here.
> >
> > > And it blurs the lines between the inactive and active list.
> > >
> > > shrink_page_list() (and page_check_references()) are written with the
> > > notion that any references they look at are from the inactive list. If
> > > you carry over stale data, this can cause more subtle bugs later on.
> >
> > It's not. For file page, PageReferenced() is maintained even if deactivation
> > happens and it means one reference.
>
> shrink_page_list() doesn't honor PageReferenced as a reference.
>
> PG_referenced is primarily for the mark_page_accessed() state machine,
> which is different from the reclaim scanner's reference tracking: for
> unmapped pages we can detect accesses in realtime and don't need the
> reference sampling from LRU cycle to LRU cycle. The bit carries over a
> deactivation, but it doesn't prevent reclaim from freeing the page.
>
> For mapped pages, we sample references using the LRU cycles, and
> PG_referenced is otherwise unused. We repurpose it to implement
> second-chance tracking of inactive pages with pte refs. It counts
> inactive list cycles, not references.
Okay.
> > > And again, I don't quite understand why anon would need different
> > > treatment here than file.
> >
> > In order to preserve the current behaviour for the file page, I leave the code
> > as is for the file page and change the code for the anon page. There is
> > fundamental difference between them such as how referenced is checked,
> > accessed by mapping and accessed by syscall. I think that some difference
> > would be admitted.
>
> Right, unmapped pages have their own reference tracking system because
> they can be detected synchronously.
>
> My questions center around this:
>
> We have an existing sampling algorithm for the coarse-grained page
> table referenced bit, where we start pages on inactive, treat
> references a certain way, target a certain inactive:active ratio, use
> refault information to detect workingset transitions etc. Anon used a
> different system in the past, but your patch set switches it over to
> the more universal model we have developed for mapped file pages.
>
> However, you don't switch it over to this exact model we have for
> mapped files, but rather a slightly modified version. And I don't
> quite understand the rationale behind the individual modifications.
>
> So let me turn it around. What would be the downsides of aging mapped
> anon exactly the same way we age mapped files? Can we identify where
> differences are necessary and document them?
Now, I plan to make a next version applied all your comments except
VM_EXEC case. As I said above, fundamental difference between
file mapped page and anon mapped page is that file LRU where file mapped
pages are managed uses two reference tracking model but anon LRU uses
just one. File LRU needs some heuristic to adjust the difference of two models,
but, anon LRU doesn't need at all. And, I think VM_EXEC check is for this case.
Please let me know your opinion about VM_EXEC case. I will start to rework
if you agree with my thought.
Anyway, all comments are very helpful to me. Really, thanks Johannes.
Thanks.
On Mon, Mar 16, 2020 at 04:05:30PM +0900, Joonsoo Kim wrote:
> 2020년 3월 14일 (토) 오전 4:55, Johannes Weiner <[email protected]>님이 작성:
> > The problem with executables is that when they are referenced, they
> > get a *lot* of references compared to data pages. Think about an
> > instruction stream and how many of those instructions result in data
> > references. So when you see an executable page that is being accessed,
> > it's likely being accessed at a high rate. They're much hotter, and
> > that's why reference bits from VM_EXEC mappings carry more weight.
>
> Sound reasonable.
>
> But, now, I have another thought about it. I think that the root of the reason
> of this check is that there are two different reference tracking models
> on file LRU. If we assume that all file pages are mapped ones, this special
> check isn't needed. If executable pages are accessed more frequently than
> others, reference can be found more for them at LRU cycle. No need for this
> special check.
>
> However, file pages includes syscall-ed pages and mapped pages, and,
> reference tracking models for mapped page has a disadvantage to
> one for syscall-ed page. Several references for mapped page would be
> considered as one at LRU cycle. I think that this check is to mitigate
> this disadvantage, at least, for executable file mapped page.
Hm, I don't quite understand this reasoning. Yes, there are two models
for unmapped and mapped file pages. But both types of pages get
activated with two or more references: for unmapped it's tracked
through mark_page_accessed(), and for mapped it's the two LRU cycles
with the referenced bit set (unmapped pages don't get that extra trip
around the LRU with one reference). With that alone, mapped pages and
unmapped pages should have equal chances, no?
> > IMO this applies to executable file and anon equally.
>
> anon LRU consist of all mapped pages, so, IMHO, there is no need for
> special handling for executable pages on anon LRU. Although reference
> is just checked at LRU cycle, it would correctly approximate reference
> frequency.
>
> Moreover, there is one comment in shrink_active_list() that explains one
> reason about omission of this check for anon pages.
>
> "Anon pages are not likely to be evicted by use-once streaming IO, plus JVM
> can create lots of anon VM_EXEC pages, so we ignore them here."
>
> Lastly, as I said before, at least, it is done separately with appropriate
> investigation.
You do have a point here, though.
The VM_EXEC protection is a heuristic that I think was added for one
specific case. Instead of adopting it straight-away for anon pages, it
may be good to wait until a usecase materializes. If it never does,
all the better - one less heuristic to carry around.
> Now, I plan to make a next version applied all your comments except
> VM_EXEC case. As I said above, fundamental difference between
> file mapped page and anon mapped page is that file LRU where file mapped
> pages are managed uses two reference tracking model but anon LRU uses
> just one. File LRU needs some heuristic to adjust the difference of two models,
> but, anon LRU doesn't need at all. And, I think VM_EXEC check is for this case.
>
> Please let me know your opinion about VM_EXEC case. I will start to rework
> if you agree with my thought.
That sounds like a good plan. I'm looking forward to the next version!
Johannes
2020년 3월 17일 (화) 오전 1:12, Johannes Weiner <[email protected]>님이 작성:
>
> On Mon, Mar 16, 2020 at 04:05:30PM +0900, Joonsoo Kim wrote:
> > 2020년 3월 14일 (토) 오전 4:55, Johannes Weiner <[email protected]>님이 작성:
> > > The problem with executables is that when they are referenced, they
> > > get a *lot* of references compared to data pages. Think about an
> > > instruction stream and how many of those instructions result in data
> > > references. So when you see an executable page that is being accessed,
> > > it's likely being accessed at a high rate. They're much hotter, and
> > > that's why reference bits from VM_EXEC mappings carry more weight.
> >
> > Sound reasonable.
> >
> > But, now, I have another thought about it. I think that the root of the reason
> > of this check is that there are two different reference tracking models
> > on file LRU. If we assume that all file pages are mapped ones, this special
> > check isn't needed. If executable pages are accessed more frequently than
> > others, reference can be found more for them at LRU cycle. No need for this
> > special check.
> >
> > However, file pages includes syscall-ed pages and mapped pages, and,
> > reference tracking models for mapped page has a disadvantage to
> > one for syscall-ed page. Several references for mapped page would be
> > considered as one at LRU cycle. I think that this check is to mitigate
> > this disadvantage, at least, for executable file mapped page.
>
> Hm, I don't quite understand this reasoning. Yes, there are two models
> for unmapped and mapped file pages. But both types of pages get
> activated with two or more references: for unmapped it's tracked
> through mark_page_accessed(), and for mapped it's the two LRU cycles
> with the referenced bit set (unmapped pages don't get that extra trip
> around the LRU with one reference). With that alone, mapped pages and
> unmapped pages should have equal chances, no?
>
Think about following example.
Assume that the size of inactive list for file page is 100.
1. start the new page, A, and access to A happens
2. 50 inactive pages are reclaimed due to 50 new pages
3. second access to A happens
4. 50 inactive pages are reclaimed due to 50 new pages
5. 100 inactive pages are reclaimed due to 100 new pages
If A is:
unmapped page: the page is activated at #3 and lives after #5
mapped page: the page reference is checked at #4 and re-attached
to the inactive list with PageReferenced() but is eventually reclaimed at #5
Even they are referenced by the same pattern, mapped page is reclaimed
but unmapped isn't. This is, what I said before, the disadvantage of
the mapped page than unmapped page. My guess is that to mitigate this
disadvantage on mapped file page, VM_EXEC test is needed since
VM_EXEC is the most important case for mapped file page.
Thanks.