From: Joonsoo Kim <[email protected]>
Hello,
This patchset implements workingset protection and detection on
the anonymous LRU list.
* Changes on v4
- In the patch "mm/swapcache: support to handle the exceptional
entries in swapcache":
-- replace the word "value" with "exceptional entries"
-- add to handle the shadow entry in add_to_swap_cache()
-- support the huge page
-- remove the registration code for shadow shrinker
- remove the patch "mm/workingset: use the node counter
if memcg is the root memcg" since workingset detection for
anonymous page doesn't use shadow shrinker now
- minor style fixes
* Changes on v3
- rework the patch, "mm/vmscan: protect the workingset on anonymous LRU"
(use almost same reference tracking algorithm to the one for the file
mapped page)
* Changes on v2
- fix a critical bug that uses out of index lru list in
workingset_refault()
- fix a bug that reuses the rotate value for previous page
* SUBJECT
workingset protection
* PROBLEM
In current implementation, newly created or swap-in anonymous page is
started on the active list. Growing the active list results in rebalancing
active/inactive list so old pages on the active list are demoted to the
inactive list. Hence, hot page on the active list isn't protected at all.
Following is an example of this situation.
Assume that 50 hot pages on active list and system can contain total
100 pages. Numbers denote the number of pages on active/inactive
list (active | inactive). (h) stands for hot pages and (uo) stands for
used-once pages.
1. 50 hot pages on active list
50(h) | 0
2. workload: 50 newly created (used-once) pages
50(uo) | 50(h)
3. workload: another 50 newly created (used-once) pages
50(uo) | 50(uo), swap-out 50(h)
As we can see, hot pages are swapped-out and it would cause swap-in later.
* SOLUTION
Since this is what we want to avoid, this patchset implements workingset
protection. Like as the file LRU list, newly created or swap-in anonymous
page is started on the inactive list. Also, like as the file LRU list,
if enough reference happens, the page will be promoted. This simple
modification changes the above example as following.
1. 50 hot pages on active list
50(h) | 0
2. workload: 50 newly created (used-once) pages
50(h) | 50(uo)
3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(uo)
hot pages remains in the active list. :)
* EXPERIMENT
I tested this scenario on my test bed and confirmed that this problem
happens on current implementation. I also checked that it is fixed by
this patchset.
I did another test to show the performance effect of this patchset.
- ebizzy (with modified random function)
ebizzy is the test program that main thread allocates lots of memory and
child threads access them randomly during the given times. Swap-in/out
will happen if allocated memory is larger than the system memory.
The random function that represents the zipf distribution is used to
make hot/cold memory. Hot/cold ratio is controlled by the parameter. If
the parameter is high, hot memory is accessed much larger than cold one.
If the parameter is low, the number of access on each memory would be
similar. I uses various parameters in order to show the effect of
patchset on various hot/cold ratio workload.
My test setup is a virtual machine with 8 cpus and 1024MB memory.
Result format is as following.
Parameter 0.1 ... 1.3
Allocated memory size
Throughput for base (larger is better)
Throughput for patchset (larger is better)
Improvement (larger is better)
* single thread
0.1 0.3 0.5 0.7 0.9 1.1 1.3
<512M>
7009.0 7372.0 7774.0 8523.0 9569.0 10724.0 11936.0
6973.0 7342.0 7745.0 8576.0 9441.0 10730.0 12033.0
-0.01 -0.0 -0.0 0.01 -0.01 0.0 0.01
<768M>
915.0 1039.0 1275.0 1687.0 2328.0 3486.0 5445.0
920.0 1037.0 1238.0 1689.0 2384.0 3638.0 5381.0
0.01 -0.0 -0.03 0.0 0.02 0.04 -0.01
<1024M>
425.0 471.0 539.0 753.0 1183.0 2130.0 3839.0
414.0 468.0 553.0 770.0 1242.0 2187.0 3932.0
-0.03 -0.01 0.03 0.02 0.05 0.03 0.02
<1280M>
320.0 346.0 410.0 556.0 871.0 1654.0 3298.0
316.0 346.0 411.0 550.0 892.0 1652.0 3293.0
-0.01 0.0 0.0 -0.01 0.02 -0.0 -0.0
<1536M>
273.0 290.0 341.0 458.0 733.0 1381.0 2925.0
271.0 293.0 344.0 462.0 740.0 1398.0 2969.0
-0.01 0.01 0.01 0.01 0.01 0.01 0.02
<2048M>
77.0 79.0 95.0 147.0 276.0 690.0 1816.0
91.0 94.0 115.0 170.0 321.0 770.0 2018.0
0.18 0.19 0.21 0.16 0.16 0.12 0.11
* multi thread (8)
0.1 0.3 0.5 0.7 0.9 1.1 1.3
<512M>
29083.0 29648.0 30145.0 31668.0 33964.0 38414.0 43707.0
29238.0 29701.0 30301.0 31328.0 33809.0 37991.0 43667.0
0.01 0.0 0.01 -0.01 -0.0 -0.01 -0.0
<768M>
3332.0 3699.0 4673.0 5830.0 8307.0 12969.0 17665.0
3579.0 3992.0 4432.0 6111.0 8699.0 12604.0 18061.0
0.07 0.08 -0.05 0.05 0.05 -0.03 0.02
<1024M>
1921.0 2141.0 2484.0 3296.0 5391.0 8227.0 14574.0
1989.0 2155.0 2609.0 3565.0 5463.0 8170.0 15642.0
0.04 0.01 0.05 0.08 0.01 -0.01 0.07
<1280M>
1524.0 1625.0 1931.0 2581.0 4155.0 6959.0 12443.0
1560.0 1707.0 2016.0 2714.0 4262.0 7518.0 13910.0
0.02 0.05 0.04 0.05 0.03 0.08 0.12
<1536M>
1303.0 1399.0 1550.0 2137.0 3469.0 6712.0 12944.0
1356.0 1465.0 1701.0 2237.0 3583.0 6830.0 13580.0
0.04 0.05 0.1 0.05 0.03 0.02 0.05
<2048M>
172.0 184.0 215.0 289.0 514.0 1318.0 4153.0
175.0 190.0 225.0 329.0 606.0 1585.0 5170.0
0.02 0.03 0.05 0.14 0.18 0.2 0.24
As we can see, as allocated memory grows, patched kernel get the better
result. Maximum improvement is 21% for the single thread test and
24% for the multi thread test.
* SUBJECT
workingset detection
* PROBLEM
Later part of the patchset implements the workingset detection for
the anonymous LRU list. There is a corner case that workingset protection
could cause thrashing. If we can avoid thrashing by workingset detection,
we can get the better performance.
Following is an example of thrashing due to the workingset protection.
1. 50 hot pages on active list
50(h) | 0
2. workload: 50 newly created (will be hot) pages
50(h) | 50(wh)
3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(wh)
4. workload: 50 (will be hot) pages
50(h) | 50(wh), swap-in 50(wh)
5. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(wh)
6. repeat 4, 5
Without workingset detection, this kind of workload cannot be promoted
and thrashing happens forever.
* SOLUTION
Therefore, this patchset implements workingset detection.
All the infrastructure for workingset detecion is already implemented,
so there is not much work to do. First, extend workingset detection
code to deal with the anonymous LRU list. Then, make swap cache handles
the exceptional value for the shadow entry. Lastly, install/retrieve
the shadow value into/from the swap cache and check the refault distance.
* EXPERIMENT
I made a test program to imitates above scenario and confirmed that
problem exists. Then, I checked that this patchset fixes it.
My test setup is a virtual machine with 8 cpus and 6100MB memory. But,
the amount of the memory that the test program can use is about 280 MB.
This is because the system uses large ram-backed swap and large ramdisk
to capture the trace.
Test scenario is like as below.
1. allocate cold memory (512MB)
2. allocate hot-1 memory (96MB)
3. activate hot-1 memory (96MB)
4. allocate another hot-2 memory (96MB)
5. access cold memory (128MB)
6. access hot-2 memory (96MB)
7. repeat 5, 6
Since hot-1 memory (96MB) is on the active list, the inactive list can
contains roughly 190MB pages. hot-2 memory's re-access interval
(96+128 MB) is more 190MB, so it cannot be promoted without workingset
detection and swap-in/out happens repeatedly. With this patchset,
workingset detection works and promotion happens. Therefore, swap-in/out
occurs less.
Here is the result. (average of 5 runs)
type swap-in swap-out
base 863240 989945
patch 681565 809273
As we can see, patched kernel do less swap-in/out.
Patchset is based on v5.5.
Patchset can also be available at
https://github.com/JoonsooKim/linux/tree/improve-anonymous-lru-management-v4.00-v5.5
Enjoy it.
Thanks.
Joonsoo Kim (8):
mm/vmscan: make active/inactive ratio as 1:1 for anon lru
mm/vmscan: protect the workingset on anonymous LRU
mm/workingset: extend the workingset detection for anon LRU
mm/swapcache: support to handle the exceptional entries in swapcache
mm/workingset: handle the page without memcg
mm/swap: implement workingset detection for anonymous LRU
mm/vmscan: restore active/inactive ratio for anonymous LRU
mm/swap: count a new anonymous page as a reclaim_state's rotate
include/linux/mmzone.h | 14 +++++++++-----
include/linux/swap.h | 18 +++++++++++++-----
kernel/events/uprobes.c | 2 +-
mm/huge_memory.c | 6 +++---
mm/khugepaged.c | 2 +-
mm/memcontrol.c | 12 ++++++++----
mm/memory.c | 16 ++++++++++------
mm/migrate.c | 2 +-
mm/shmem.c | 3 ++-
mm/swap.c | 42 +++++++++++++++++++++++++++++++++++-------
mm/swap_state.c | 44 +++++++++++++++++++++++++++++++++++++-------
mm/swapfile.c | 2 +-
mm/userfaultfd.c | 2 +-
mm/vmscan.c | 26 ++++++++++++++++----------
mm/vmstat.c | 6 ++++--
mm/workingset.c | 46 +++++++++++++++++++++++++++++++++-------------
16 files changed, 175 insertions(+), 68 deletions(-)
--
2.7.4
From: Joonsoo Kim <[email protected]>
reclaim_stat's rotate is used for controlling the ratio of scanning page
between file and anonymous LRU. All new anonymous pages are counted
for rotate before the patch, protecting anonymous pages on active LRU, and,
it makes that reclaim on anonymous LRU is less happened than file LRU.
Now, situation is changed. all new anonymous pages are not added
to the active LRU so rotate would be far less than before. It will cause
that reclaim on anonymous LRU happens more and it would result in bad
effect on some system that is optimized for previous setting.
Therefore, this patch counts a new anonymous page as a reclaim_state's
rotate. Although it is non-logical to add this count to
the reclaim_state's rotate in current algorithm, reducing the regression
would be more important.
I found this regression on kernel-build test and it is roughly 2~5%
performance degradation. With this workaround, performance is completely
restored.
v2: fix a bug that reuses the rotate value for previous page
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/swap.c | 29 ++++++++++++++++++++++++++++-
1 file changed, 28 insertions(+), 1 deletion(-)
diff --git a/mm/swap.c b/mm/swap.c
index 442d27e..1f19301 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -187,6 +187,9 @@ int get_kernel_page(unsigned long start, int write, struct page **pages)
}
EXPORT_SYMBOL_GPL(get_kernel_page);
+static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
+ void *arg);
+
static void pagevec_lru_move_fn(struct pagevec *pvec,
void (*move_fn)(struct page *page, struct lruvec *lruvec, void *arg),
void *arg)
@@ -199,6 +202,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
for (i = 0; i < pagevec_count(pvec); i++) {
struct page *page = pvec->pages[i];
struct pglist_data *pagepgdat = page_pgdat(page);
+ void *arg_orig = arg;
if (pagepgdat != pgdat) {
if (pgdat)
@@ -207,8 +211,22 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
spin_lock_irqsave(&pgdat->lru_lock, flags);
}
+ if (move_fn == __pagevec_lru_add_fn) {
+ struct list_head *entry = &page->lru;
+ unsigned long next = (unsigned long)entry->next;
+ unsigned long rotate = next & 2;
+
+ if (rotate) {
+ VM_BUG_ON(arg);
+
+ next = next & ~2;
+ entry->next = (struct list_head *)next;
+ arg = (void *)rotate;
+ }
+ }
lruvec = mem_cgroup_page_lruvec(page, pgdat);
(*move_fn)(page, lruvec, arg);
+ arg = arg_orig;
}
if (pgdat)
spin_unlock_irqrestore(&pgdat->lru_lock, flags);
@@ -475,6 +493,14 @@ void lru_cache_add_inactive_or_unevictable(struct page *page,
hpage_nr_pages(page));
count_vm_event(UNEVICTABLE_PGMLOCKED);
}
+
+ if (PageSwapBacked(page) && !unevictable) {
+ struct list_head *entry = &page->lru;
+ unsigned long next = (unsigned long)entry->next;
+
+ next = next | 2;
+ entry->next = (struct list_head *)next;
+ }
lru_cache_add(page);
}
@@ -927,6 +953,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
{
enum lru_list lru;
int was_unevictable = TestClearPageUnevictable(page);
+ unsigned long rotate = (unsigned long)arg;
VM_BUG_ON_PAGE(PageLRU(page), page);
@@ -962,7 +989,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
if (page_evictable(page)) {
lru = page_lru(page);
update_page_reclaim_stat(lruvec, page_is_file_cache(page),
- PageActive(page));
+ PageActive(page) | rotate);
if (was_unevictable)
count_vm_event(UNEVICTABLE_PGRESCUED);
} else {
--
2.7.4
From: Joonsoo Kim <[email protected]>
Now, workingset detection is implemented for anonymous LRU.
We don't have to worry about the misfound for workingset due to
the ratio of active/inactive. Let's restore the ratio.
Signed-off-by: Joonsoo Kim <[email protected]>
---
mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b37cc26..3d44e32 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2218,7 +2218,7 @@ static bool inactive_is_low(struct lruvec *lruvec, enum lru_list inactive_lru)
active = lruvec_page_state(lruvec, NR_LRU_BASE + active_lru);
gb = (inactive + active) >> (30 - PAGE_SHIFT);
- if (gb && is_file_lru(inactive_lru))
+ if (gb)
inactive_ratio = int_sqrt(10 * gb);
else
inactive_ratio = 1;
--
2.7.4
On Mon, Mar 23, 2020 at 02:52:11PM +0900, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> Now, workingset detection is implemented for anonymous LRU.
> We don't have to worry about the misfound for workingset due to
> the ratio of active/inactive. Let's restore the ratio.
>
> Signed-off-by: Joonsoo Kim <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
On Mon, Mar 23, 2020 at 02:52:12PM +0900, [email protected] wrote:
> From: Joonsoo Kim <[email protected]>
>
> reclaim_stat's rotate is used for controlling the ratio of scanning page
> between file and anonymous LRU. All new anonymous pages are counted
> for rotate before the patch, protecting anonymous pages on active LRU, and,
> it makes that reclaim on anonymous LRU is less happened than file LRU.
>
> Now, situation is changed. all new anonymous pages are not added
> to the active LRU so rotate would be far less than before. It will cause
> that reclaim on anonymous LRU happens more and it would result in bad
> effect on some system that is optimized for previous setting.
>
> Therefore, this patch counts a new anonymous page as a reclaim_state's
> rotate. Although it is non-logical to add this count to
> the reclaim_state's rotate in current algorithm, reducing the regression
> would be more important.
>
> I found this regression on kernel-build test and it is roughly 2~5%
> performance degradation. With this workaround, performance is completely
> restored.
>
> v2: fix a bug that reuses the rotate value for previous page
I agree with the rationale, but the magic bit in the page->lru list
pointers seems pretty ugly.
I wrote a patch a few years ago that split lru_add_pvecs into an add
and a putback component. This was to avoid unintentional balancing
effects of LRU isolations, but I think you can benefit from that
cleanup here as well. Would you mind taking a look at it and maybe
take it up into your series?
https://lore.kernel.org/patchwork/patch/685708/
2020년 3월 24일 (화) 오전 2:43, Johannes Weiner <[email protected]>님이 작성:
>
> On Mon, Mar 23, 2020 at 02:52:12PM +0900, [email protected] wrote:
> > From: Joonsoo Kim <[email protected]>
> >
> > reclaim_stat's rotate is used for controlling the ratio of scanning page
> > between file and anonymous LRU. All new anonymous pages are counted
> > for rotate before the patch, protecting anonymous pages on active LRU, and,
> > it makes that reclaim on anonymous LRU is less happened than file LRU.
> >
> > Now, situation is changed. all new anonymous pages are not added
> > to the active LRU so rotate would be far less than before. It will cause
> > that reclaim on anonymous LRU happens more and it would result in bad
> > effect on some system that is optimized for previous setting.
> >
> > Therefore, this patch counts a new anonymous page as a reclaim_state's
> > rotate. Although it is non-logical to add this count to
> > the reclaim_state's rotate in current algorithm, reducing the regression
> > would be more important.
> >
> > I found this regression on kernel-build test and it is roughly 2~5%
> > performance degradation. With this workaround, performance is completely
> > restored.
> >
> > v2: fix a bug that reuses the rotate value for previous page
>
> I agree with the rationale, but the magic bit in the page->lru list
> pointers seems pretty ugly.
Yes, pretty ugly.
> I wrote a patch a few years ago that split lru_add_pvecs into an add
> and a putback component. This was to avoid unintentional balancing
> effects of LRU isolations, but I think you can benefit from that
> cleanup here as well. Would you mind taking a look at it and maybe
> take it up into your series?
>
> https://lore.kernel.org/patchwork/patch/685708/
Thanks for pointing that.
I will remove the magic bit and imitate your patch to solve the problem
of this patch.
Thanks.