2020-03-13 18:35:46

by Yang Shi

[permalink] [raw]
Subject: [PATCH 1/2] mm: swap: make page_evictable() inline

When backporting commit 9c4e6b1a7027 ("mm, mlock, vmscan: no more
skipping pagevecs") to our 4.9 kernel, our test bench noticed around 10%
down with a couple of vm-scalability's test cases (lru-file-readonce,
lru-file-readtwice and lru-file-mmap-read). I didn't see that much down
on my VM (32c-64g-2nodes). It might be caused by the test configuration,
which is 32c-256g with NUMA disabled and the tests were run in root memcg,
so the tests actually stress only one inactive and active lru. It
sounds not very usual in mordern production environment.

That commit did two major changes:
1. Call page_evictable()
2. Use smp_mb to force the PG_lru set visible

It looks they contribute the most overhead. The page_evictable() is a
function which does function prologue and epilogue, and that was used by
page reclaim path only. However, lru add is a very hot path, so it
sounds better to make it inline. However, it calls page_mapping() which
is not inlined either, but the disassemble shows it doesn't do push and
pop operations and it sounds not very straightforward to inline it.

Other than this, it sounds smp_mb() is not necessary for x86 since
SetPageLRU is atomic which enforces memory barrier already, replace it
with smp_mb__after_atomic() in the following patch.

With the two fixes applied, the tests can get back around 5% on that
test bench and get back normal on my VM. Since the test bench
configuration is not that usual and I also saw around 6% up on the
latest upstream, so it sounds good enough IMHO.

The below is test data (lru-file-readtwice throughput) against the v5.6-rc4:
mainline w/ inline fix
150MB 154MB

With this patch the throughput gets 2.67% up. The data with using
smp_mb__after_atomic() is showed in the following patch.

Fixes: 9c4e6b1a7027 ("mm, mlock, vmscan: no more skipping pagevecs")
Cc: Shakeel Butt <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Yang Shi <[email protected]>
---
include/linux/swap.h | 24 +++++++++++++++++++++++-
mm/vmscan.c | 23 -----------------------
2 files changed, 23 insertions(+), 24 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1e99f7a..297eb66 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -374,7 +374,29 @@ extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
#define node_reclaim_mode 0
#endif

-extern int page_evictable(struct page *page);
+/*
+ * page_evictable - test whether a page is evictable
+ * @page: the page to test
+ *
+ * Test whether page is evictable--i.e., should be placed on active/inactive
+ * lists vs unevictable list.
+ *
+ * Reasons page might not be evictable:
+ * (1) page's mapping marked unevictable
+ * (2) page is part of an mlocked VMA
+ *
+ */
+static inline int page_evictable(struct page *page)
+{
+ int ret;
+
+ /* Prevent address_space of inode and swap cache from being freed */
+ rcu_read_lock();
+ ret = !mapping_unevictable(page_mapping(page)) && !PageMlocked(page);
+ rcu_read_unlock();
+ return ret;
+}
+
extern void check_move_unevictable_pages(struct pagevec *pvec);

extern int kswapd_run(int nid);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8763705..855c395 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4277,29 +4277,6 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
}
#endif

-/*
- * page_evictable - test whether a page is evictable
- * @page: the page to test
- *
- * Test whether page is evictable--i.e., should be placed on active/inactive
- * lists vs unevictable list.
- *
- * Reasons page might not be evictable:
- * (1) page's mapping marked unevictable
- * (2) page is part of an mlocked VMA
- *
- */
-int page_evictable(struct page *page)
-{
- int ret;
-
- /* Prevent address_space of inode and swap cache from being freed */
- rcu_read_lock();
- ret = !mapping_unevictable(page_mapping(page)) && !PageMlocked(page);
- rcu_read_unlock();
- return ret;
-}
-
/**
* check_move_unevictable_pages - check pages for evictability and move to
* appropriate zone lru list
--
1.8.3.1


2020-03-13 18:36:13

by Yang Shi

[permalink] [raw]
Subject: [PATCH 2/2] mm: swap: use smp_mb__after_atomic() to order LRU bit set

Memory barrier is needed after setting LRU bit, but smp_mb() is too
strong. Some architectures, i.e. x86, imply memory barrier with atomic
operations, so replacing it with smp_mb__after_atomic() sounds better,
which is nop on strong ordered machines, and full memory barriers on
others. With this change the vm-calability cases would perform better
on x86, I saw total 6% improvement with this patch and previous inline
fix.

The test data (lru-file-readtwice throughput) against v5.6-rc4:
mainline w/ inline fix w/ both (adding this)
150MB 154MB 159MB

Fixes: 9c4e6b1a7027 ("mm, mlock, vmscan: no more skipping pagevecs")
Cc: Shakeel Butt <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Signed-off-by: Yang Shi <[email protected]>
---
mm/swap.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index cf39d24..118bac4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -945,20 +945,20 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
* #0: __pagevec_lru_add_fn #1: clear_page_mlock
*
* SetPageLRU() TestClearPageMlocked()
- * smp_mb() // explicit ordering // above provides strict
+ * MB() // explicit ordering // above provides strict
* // ordering
* PageMlocked() PageLRU()
*
*
* if '#1' does not observe setting of PG_lru by '#0' and fails
* isolation, the explicit barrier will make sure that page_evictable
- * check will put the page in correct LRU. Without smp_mb(), SetPageLRU
+ * check will put the page in correct LRU. Without MB(), SetPageLRU
* can be reordered after PageMlocked check and can make '#1' to fail
* the isolation of the page whose Mlocked bit is cleared (#0 is also
* looking at the same page) and the evictable page will be stranded
* in an unevictable LRU.
*/
- smp_mb();
+ smp_mb__after_atomic();

if (page_evictable(page)) {
lru = page_lru(page);
--
1.8.3.1

2020-03-13 19:34:55

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: swap: make page_evictable() inline

On Fri, Mar 13, 2020 at 11:34 AM Yang Shi <[email protected]> wrote:
>
> When backporting commit 9c4e6b1a7027 ("mm, mlock, vmscan: no more
> skipping pagevecs") to our 4.9 kernel, our test bench noticed around 10%
> down with a couple of vm-scalability's test cases (lru-file-readonce,
> lru-file-readtwice and lru-file-mmap-read). I didn't see that much down
> on my VM (32c-64g-2nodes). It might be caused by the test configuration,
> which is 32c-256g with NUMA disabled and the tests were run in root memcg,
> so the tests actually stress only one inactive and active lru. It
> sounds not very usual in mordern production environment.
>
> That commit did two major changes:
> 1. Call page_evictable()
> 2. Use smp_mb to force the PG_lru set visible
>
> It looks they contribute the most overhead. The page_evictable() is a
> function which does function prologue and epilogue, and that was used by
> page reclaim path only. However, lru add is a very hot path, so it
> sounds better to make it inline. However, it calls page_mapping() which
> is not inlined either, but the disassemble shows it doesn't do push and
> pop operations and it sounds not very straightforward to inline it.
>
> Other than this, it sounds smp_mb() is not necessary for x86 since
> SetPageLRU is atomic which enforces memory barrier already, replace it
> with smp_mb__after_atomic() in the following patch.
>
> With the two fixes applied, the tests can get back around 5% on that
> test bench and get back normal on my VM. Since the test bench
> configuration is not that usual and I also saw around 6% up on the
> latest upstream, so it sounds good enough IMHO.
>
> The below is test data (lru-file-readtwice throughput) against the v5.6-rc4:
> mainline w/ inline fix
> 150MB 154MB
>

What is the test setup for the above experiment? I would like to get a repro.

> With this patch the throughput gets 2.67% up. The data with using
> smp_mb__after_atomic() is showed in the following patch.
>
> Fixes: 9c4e6b1a7027 ("mm, mlock, vmscan: no more skipping pagevecs")
> Cc: Shakeel Butt <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Signed-off-by: Yang Shi <[email protected]>
> ---
> include/linux/swap.h | 24 +++++++++++++++++++++++-
> mm/vmscan.c | 23 -----------------------
> 2 files changed, 23 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 1e99f7a..297eb66 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -374,7 +374,29 @@ extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
> #define node_reclaim_mode 0
> #endif
>
> -extern int page_evictable(struct page *page);
> +/*
> + * page_evictable - test whether a page is evictable
> + * @page: the page to test
> + *
> + * Test whether page is evictable--i.e., should be placed on active/inactive
> + * lists vs unevictable list.
> + *
> + * Reasons page might not be evictable:
> + * (1) page's mapping marked unevictable
> + * (2) page is part of an mlocked VMA
> + *
> + */
> +static inline int page_evictable(struct page *page)
> +{
> + int ret;
> +
> + /* Prevent address_space of inode and swap cache from being freed */
> + rcu_read_lock();
> + ret = !mapping_unevictable(page_mapping(page)) && !PageMlocked(page);
> + rcu_read_unlock();
> + return ret;
> +}
> +
> extern void check_move_unevictable_pages(struct pagevec *pvec);
>
> extern int kswapd_run(int nid);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8763705..855c395 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4277,29 +4277,6 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
> }
> #endif
>
> -/*
> - * page_evictable - test whether a page is evictable
> - * @page: the page to test
> - *
> - * Test whether page is evictable--i.e., should be placed on active/inactive
> - * lists vs unevictable list.
> - *
> - * Reasons page might not be evictable:
> - * (1) page's mapping marked unevictable
> - * (2) page is part of an mlocked VMA
> - *
> - */
> -int page_evictable(struct page *page)
> -{
> - int ret;
> -
> - /* Prevent address_space of inode and swap cache from being freed */
> - rcu_read_lock();
> - ret = !mapping_unevictable(page_mapping(page)) && !PageMlocked(page);
> - rcu_read_unlock();
> - return ret;
> -}
> -
> /**
> * check_move_unevictable_pages - check pages for evictability and move to
> * appropriate zone lru list
> --
> 1.8.3.1
>

2020-03-13 19:49:04

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: swap: make page_evictable() inline



On 3/13/20 12:33 PM, Shakeel Butt wrote:
> On Fri, Mar 13, 2020 at 11:34 AM Yang Shi <[email protected]> wrote:
>> When backporting commit 9c4e6b1a7027 ("mm, mlock, vmscan: no more
>> skipping pagevecs") to our 4.9 kernel, our test bench noticed around 10%
>> down with a couple of vm-scalability's test cases (lru-file-readonce,
>> lru-file-readtwice and lru-file-mmap-read). I didn't see that much down
>> on my VM (32c-64g-2nodes). It might be caused by the test configuration,
>> which is 32c-256g with NUMA disabled and the tests were run in root memcg,
>> so the tests actually stress only one inactive and active lru. It
>> sounds not very usual in mordern production environment.
>>
>> That commit did two major changes:
>> 1. Call page_evictable()
>> 2. Use smp_mb to force the PG_lru set visible
>>
>> It looks they contribute the most overhead. The page_evictable() is a
>> function which does function prologue and epilogue, and that was used by
>> page reclaim path only. However, lru add is a very hot path, so it
>> sounds better to make it inline. However, it calls page_mapping() which
>> is not inlined either, but the disassemble shows it doesn't do push and
>> pop operations and it sounds not very straightforward to inline it.
>>
>> Other than this, it sounds smp_mb() is not necessary for x86 since
>> SetPageLRU is atomic which enforces memory barrier already, replace it
>> with smp_mb__after_atomic() in the following patch.
>>
>> With the two fixes applied, the tests can get back around 5% on that
>> test bench and get back normal on my VM. Since the test bench
>> configuration is not that usual and I also saw around 6% up on the
>> latest upstream, so it sounds good enough IMHO.
>>
>> The below is test data (lru-file-readtwice throughput) against the v5.6-rc4:
>> mainline w/ inline fix
>> 150MB 154MB
>>
> What is the test setup for the above experiment? I would like to get a repro.

Just startup a VM with two nodes, then run case-lru-file-readtwice or
case-lru-file-readonce in vm-scalability in root memcg or with memcg
disabled.  Then get the average throughput (dd result) from the test.
Our test bench uses the script from lkp, but I just ran it manually.
Single node VM should be more obvious showed in my test.

>
>> With this patch the throughput gets 2.67% up. The data with using
>> smp_mb__after_atomic() is showed in the following patch.
>>
>> Fixes: 9c4e6b1a7027 ("mm, mlock, vmscan: no more skipping pagevecs")
>> Cc: Shakeel Butt <[email protected]>
>> Cc: Vlastimil Babka <[email protected]>
>> Signed-off-by: Yang Shi <[email protected]>
>> ---
>> include/linux/swap.h | 24 +++++++++++++++++++++++-
>> mm/vmscan.c | 23 -----------------------
>> 2 files changed, 23 insertions(+), 24 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 1e99f7a..297eb66 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -374,7 +374,29 @@ extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
>> #define node_reclaim_mode 0
>> #endif
>>
>> -extern int page_evictable(struct page *page);
>> +/*
>> + * page_evictable - test whether a page is evictable
>> + * @page: the page to test
>> + *
>> + * Test whether page is evictable--i.e., should be placed on active/inactive
>> + * lists vs unevictable list.
>> + *
>> + * Reasons page might not be evictable:
>> + * (1) page's mapping marked unevictable
>> + * (2) page is part of an mlocked VMA
>> + *
>> + */
>> +static inline int page_evictable(struct page *page)
>> +{
>> + int ret;
>> +
>> + /* Prevent address_space of inode and swap cache from being freed */
>> + rcu_read_lock();
>> + ret = !mapping_unevictable(page_mapping(page)) && !PageMlocked(page);
>> + rcu_read_unlock();
>> + return ret;
>> +}
>> +
>> extern void check_move_unevictable_pages(struct pagevec *pvec);
>>
>> extern int kswapd_run(int nid);
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 8763705..855c395 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -4277,29 +4277,6 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
>> }
>> #endif
>>
>> -/*
>> - * page_evictable - test whether a page is evictable
>> - * @page: the page to test
>> - *
>> - * Test whether page is evictable--i.e., should be placed on active/inactive
>> - * lists vs unevictable list.
>> - *
>> - * Reasons page might not be evictable:
>> - * (1) page's mapping marked unevictable
>> - * (2) page is part of an mlocked VMA
>> - *
>> - */
>> -int page_evictable(struct page *page)
>> -{
>> - int ret;
>> -
>> - /* Prevent address_space of inode and swap cache from being freed */
>> - rcu_read_lock();
>> - ret = !mapping_unevictable(page_mapping(page)) && !PageMlocked(page);
>> - rcu_read_unlock();
>> - return ret;
>> -}
>> -
>> /**
>> * check_move_unevictable_pages - check pages for evictability and move to
>> * appropriate zone lru list
>> --
>> 1.8.3.1
>>

2020-03-13 19:51:23

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: swap: make page_evictable() inline

On Fri, Mar 13, 2020 at 12:46 PM Yang Shi <[email protected]> wrote:
>
>
>
> On 3/13/20 12:33 PM, Shakeel Butt wrote:
> > On Fri, Mar 13, 2020 at 11:34 AM Yang Shi <[email protected]> wrote:
> >> When backporting commit 9c4e6b1a7027 ("mm, mlock, vmscan: no more
> >> skipping pagevecs") to our 4.9 kernel, our test bench noticed around 10%
> >> down with a couple of vm-scalability's test cases (lru-file-readonce,
> >> lru-file-readtwice and lru-file-mmap-read). I didn't see that much down
> >> on my VM (32c-64g-2nodes). It might be caused by the test configuration,
> >> which is 32c-256g with NUMA disabled and the tests were run in root memcg,
> >> so the tests actually stress only one inactive and active lru. It
> >> sounds not very usual in mordern production environment.
> >>
> >> That commit did two major changes:
> >> 1. Call page_evictable()
> >> 2. Use smp_mb to force the PG_lru set visible
> >>
> >> It looks they contribute the most overhead. The page_evictable() is a
> >> function which does function prologue and epilogue, and that was used by
> >> page reclaim path only. However, lru add is a very hot path, so it
> >> sounds better to make it inline. However, it calls page_mapping() which
> >> is not inlined either, but the disassemble shows it doesn't do push and
> >> pop operations and it sounds not very straightforward to inline it.
> >>
> >> Other than this, it sounds smp_mb() is not necessary for x86 since
> >> SetPageLRU is atomic which enforces memory barrier already, replace it
> >> with smp_mb__after_atomic() in the following patch.
> >>
> >> With the two fixes applied, the tests can get back around 5% on that
> >> test bench and get back normal on my VM. Since the test bench
> >> configuration is not that usual and I also saw around 6% up on the
> >> latest upstream, so it sounds good enough IMHO.
> >>
> >> The below is test data (lru-file-readtwice throughput) against the v5.6-rc4:
> >> mainline w/ inline fix
> >> 150MB 154MB
> >>
> > What is the test setup for the above experiment? I would like to get a repro.
>
> Just startup a VM with two nodes, then run case-lru-file-readtwice or
> case-lru-file-readonce in vm-scalability in root memcg or with memcg
> disabled. Then get the average throughput (dd result) from the test.
> Our test bench uses the script from lkp, but I just ran it manually.
> Single node VM should be more obvious showed in my test.
>

Thanks, I will try this on a real machine.

2020-03-13 19:55:46

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: swap: make page_evictable() inline



On 3/13/20 12:50 PM, Shakeel Butt wrote:
> On Fri, Mar 13, 2020 at 12:46 PM Yang Shi <[email protected]> wrote:
>>
>>
>> On 3/13/20 12:33 PM, Shakeel Butt wrote:
>>> On Fri, Mar 13, 2020 at 11:34 AM Yang Shi <[email protected]> wrote:
>>>> When backporting commit 9c4e6b1a7027 ("mm, mlock, vmscan: no more
>>>> skipping pagevecs") to our 4.9 kernel, our test bench noticed around 10%
>>>> down with a couple of vm-scalability's test cases (lru-file-readonce,
>>>> lru-file-readtwice and lru-file-mmap-read). I didn't see that much down
>>>> on my VM (32c-64g-2nodes). It might be caused by the test configuration,
>>>> which is 32c-256g with NUMA disabled and the tests were run in root memcg,
>>>> so the tests actually stress only one inactive and active lru. It
>>>> sounds not very usual in mordern production environment.
>>>>
>>>> That commit did two major changes:
>>>> 1. Call page_evictable()
>>>> 2. Use smp_mb to force the PG_lru set visible
>>>>
>>>> It looks they contribute the most overhead. The page_evictable() is a
>>>> function which does function prologue and epilogue, and that was used by
>>>> page reclaim path only. However, lru add is a very hot path, so it
>>>> sounds better to make it inline. However, it calls page_mapping() which
>>>> is not inlined either, but the disassemble shows it doesn't do push and
>>>> pop operations and it sounds not very straightforward to inline it.
>>>>
>>>> Other than this, it sounds smp_mb() is not necessary for x86 since
>>>> SetPageLRU is atomic which enforces memory barrier already, replace it
>>>> with smp_mb__after_atomic() in the following patch.
>>>>
>>>> With the two fixes applied, the tests can get back around 5% on that
>>>> test bench and get back normal on my VM. Since the test bench
>>>> configuration is not that usual and I also saw around 6% up on the
>>>> latest upstream, so it sounds good enough IMHO.
>>>>
>>>> The below is test data (lru-file-readtwice throughput) against the v5.6-rc4:
>>>> mainline w/ inline fix
>>>> 150MB 154MB
>>>>
>>> What is the test setup for the above experiment? I would like to get a repro.
>> Just startup a VM with two nodes, then run case-lru-file-readtwice or
>> case-lru-file-readonce in vm-scalability in root memcg or with memcg
>> disabled. Then get the average throughput (dd result) from the test.
>> Our test bench uses the script from lkp, but I just ran it manually.
>> Single node VM should be more obvious showed in my test.
>>
> Thanks, I will try this on a real machine

Real machine should be better. Our test bench is bare metal with NUMA
disabled. On my test VM it is not that obvious.


2020-03-15 07:16:11

by Matthew Wilcox (Oracle)

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: swap: make page_evictable() inline

On Sat, Mar 14, 2020 at 02:34:35AM +0800, Yang Shi wrote:
> -extern int page_evictable(struct page *page);
> +/*

This seems to be in kernel-doc format already; could you add the extra
'*' so it is added to the fine documentation?

> + * page_evictable - test whether a page is evictable
> + * @page: the page to test
> + *
> + * Test whether page is evictable--i.e., should be placed on active/inactive
> + * lists vs unevictable list.
> + *
> + * Reasons page might not be evictable:
> + * (1) page's mapping marked unevictable
> + * (2) page is part of an mlocked VMA
> + *
> + */
> +static inline int page_evictable(struct page *page)
> +{
> + int ret;
> +
> + /* Prevent address_space of inode and swap cache from being freed */
> + rcu_read_lock();
> + ret = !mapping_unevictable(page_mapping(page)) && !PageMlocked(page);
> + rcu_read_unlock();
> + return ret;
> +}

This seems like it should return bool ... that might even lead to code
generation improvement.

2020-03-16 16:38:31

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: swap: make page_evictable() inline



On 3/14/20 9:01 AM, Matthew Wilcox wrote:
> On Sat, Mar 14, 2020 at 02:34:35AM +0800, Yang Shi wrote:
>> -extern int page_evictable(struct page *page);
>> +/*
> This seems to be in kernel-doc format already; could you add the extra
> '*' so it is added to the fine documentation?

Yes, sure.

>
>> + * page_evictable - test whether a page is evictable
>> + * @page: the page to test
>> + *
>> + * Test whether page is evictable--i.e., should be placed on active/inactive
>> + * lists vs unevictable list.
>> + *
>> + * Reasons page might not be evictable:
>> + * (1) page's mapping marked unevictable
>> + * (2) page is part of an mlocked VMA
>> + *
>> + */
>> +static inline int page_evictable(struct page *page)
>> +{
>> + int ret;
>> +
>> + /* Prevent address_space of inode and swap cache from being freed */
>> + rcu_read_lock();
>> + ret = !mapping_unevictable(page_mapping(page)) && !PageMlocked(page);
>> + rcu_read_unlock();
>> + return ret;
>> +}
> This seems like it should return bool ... that might even lead to code
> generation improvement.

Thanks for catching this. It looks mapping_unevictable() needs to be
converted to bool as well.


2020-03-16 17:41:39

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: swap: use smp_mb__after_atomic() to order LRU bit set

On 3/13/20 7:34 PM, Yang Shi wrote:
> Memory barrier is needed after setting LRU bit, but smp_mb() is too
> strong. Some architectures, i.e. x86, imply memory barrier with atomic
> operations, so replacing it with smp_mb__after_atomic() sounds better,
> which is nop on strong ordered machines, and full memory barriers on
> others. With this change the vm-calability cases would perform better
> on x86, I saw total 6% improvement with this patch and previous inline
> fix.
>
> The test data (lru-file-readtwice throughput) against v5.6-rc4:
> mainline w/ inline fix w/ both (adding this)
> 150MB 154MB 159MB
>
> Fixes: 9c4e6b1a7027 ("mm, mlock, vmscan: no more skipping pagevecs")
> Cc: Shakeel Butt <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Signed-off-by: Yang Shi <[email protected]>

According to my understanding of Documentation/memory_barriers.txt this would be
correct (but it might not say much :)

Acked-by: Vlastimil Babka <[email protected]>

But i have some suggestions...

> ---
> mm/swap.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/mm/swap.c b/mm/swap.c
> index cf39d24..118bac4 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -945,20 +945,20 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
> * #0: __pagevec_lru_add_fn #1: clear_page_mlock
> *
> * SetPageLRU() TestClearPageMlocked()
> - * smp_mb() // explicit ordering // above provides strict
> + * MB() // explicit ordering // above provides strict

Why MB()? That would be the first appareance of 'MB()' in the whole tree. I
think it's fine keeping smp_mb()...

> * // ordering
> * PageMlocked() PageLRU()
> *
> *
> * if '#1' does not observe setting of PG_lru by '#0' and fails
> * isolation, the explicit barrier will make sure that page_evictable
> - * check will put the page in correct LRU. Without smp_mb(), SetPageLRU
> + * check will put the page in correct LRU. Without MB(), SetPageLRU

... same here ...

> * can be reordered after PageMlocked check and can make '#1' to fail
> * the isolation of the page whose Mlocked bit is cleared (#0 is also
> * looking at the same page) and the evictable page will be stranded
> * in an unevictable LRU.

Only here I would note that SetPageLRU() is an atomic bitop so we can use the
__after_atomic() variant. And I would move the actual SetPageLRU() call from
above the comment here right before the barrier.

> */
> - smp_mb();
> + smp_mb__after_atomic();

Thanks.

>
> if (page_evictable(page)) {
> lru = page_lru(page);
>

2020-03-16 17:50:06

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: swap: use smp_mb__after_atomic() to order LRU bit set



On 3/16/20 10:40 AM, Vlastimil Babka wrote:
> On 3/13/20 7:34 PM, Yang Shi wrote:
>> Memory barrier is needed after setting LRU bit, but smp_mb() is too
>> strong. Some architectures, i.e. x86, imply memory barrier with atomic
>> operations, so replacing it with smp_mb__after_atomic() sounds better,
>> which is nop on strong ordered machines, and full memory barriers on
>> others. With this change the vm-calability cases would perform better
>> on x86, I saw total 6% improvement with this patch and previous inline
>> fix.
>>
>> The test data (lru-file-readtwice throughput) against v5.6-rc4:
>> mainline w/ inline fix w/ both (adding this)
>> 150MB 154MB 159MB
>>
>> Fixes: 9c4e6b1a7027 ("mm, mlock, vmscan: no more skipping pagevecs")
>> Cc: Shakeel Butt <[email protected]>
>> Cc: Vlastimil Babka <[email protected]>
>> Signed-off-by: Yang Shi <[email protected]>
> According to my understanding of Documentation/memory_barriers.txt this would be
> correct (but it might not say much :)

This is my understanding too.

>
> Acked-by: Vlastimil Babka <[email protected]>
>
> But i have some suggestions...
>
>> ---
>> mm/swap.c | 6 +++---
>> 1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/swap.c b/mm/swap.c
>> index cf39d24..118bac4 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -945,20 +945,20 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
>> * #0: __pagevec_lru_add_fn #1: clear_page_mlock
>> *
>> * SetPageLRU() TestClearPageMlocked()
>> - * smp_mb() // explicit ordering // above provides strict
>> + * MB() // explicit ordering // above provides strict
> Why MB()? That would be the first appareance of 'MB()' in the whole tree. I
> think it's fine keeping smp_mb()...

I would like to use a more general name, maybe just use "memory barrier"?

>
>> * // ordering
>> * PageMlocked() PageLRU()
>> *
>> *
>> * if '#1' does not observe setting of PG_lru by '#0' and fails
>> * isolation, the explicit barrier will make sure that page_evictable
>> - * check will put the page in correct LRU. Without smp_mb(), SetPageLRU
>> + * check will put the page in correct LRU. Without MB(), SetPageLRU
> ... same here ...
>
>> * can be reordered after PageMlocked check and can make '#1' to fail
>> * the isolation of the page whose Mlocked bit is cleared (#0 is also
>> * looking at the same page) and the evictable page will be stranded
>> * in an unevictable LRU.
> Only here I would note that SetPageLRU() is an atomic bitop so we can use the
> __after_atomic() variant. And I would move the actual SetPageLRU() call from
> above the comment here right before the barrier.

Sure. Thanks.

>
>> */
>> - smp_mb();
>> + smp_mb__after_atomic();
> Thanks.
>
>>
>> if (page_evictable(page)) {
>> lru = page_lru(page);
>>

2020-03-16 22:19:32

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm: swap: use smp_mb__after_atomic() to order LRU bit set



On 3/16/20 10:49 AM, Yang Shi wrote:
>
>
> On 3/16/20 10:40 AM, Vlastimil Babka wrote:
>> On 3/13/20 7:34 PM, Yang Shi wrote:
>>> Memory barrier is needed after setting LRU bit, but smp_mb() is too
>>> strong.  Some architectures, i.e. x86, imply memory barrier with atomic
>>> operations, so replacing it with smp_mb__after_atomic() sounds better,
>>> which is nop on strong ordered machines, and full memory barriers on
>>> others.  With this change the vm-calability cases would perform better
>>> on x86, I saw total 6% improvement with this patch and previous inline
>>> fix.
>>>
>>> The test data (lru-file-readtwice throughput) against v5.6-rc4:
>>>     mainline    w/ inline fix    w/ both (adding this)
>>>     150MB        154MB        159MB
>>>
>>> Fixes: 9c4e6b1a7027 ("mm, mlock, vmscan: no more skipping pagevecs")
>>> Cc: Shakeel Butt <[email protected]>
>>> Cc: Vlastimil Babka <[email protected]>
>>> Signed-off-by: Yang Shi <[email protected]>
>> According to my understanding of Documentation/memory_barriers.txt
>> this would be
>> correct (but it might not say much :)
>
> This is my understanding too.
>
>>
>> Acked-by: Vlastimil Babka <[email protected]>
>>
>> But i have some suggestions...
>>
>>> ---
>>>   mm/swap.c | 6 +++---
>>>   1 file changed, 3 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/swap.c b/mm/swap.c
>>> index cf39d24..118bac4 100644
>>> --- a/mm/swap.c
>>> +++ b/mm/swap.c
>>> @@ -945,20 +945,20 @@ static void __pagevec_lru_add_fn(struct page
>>> *page, struct lruvec *lruvec,
>>>        * #0: __pagevec_lru_add_fn        #1: clear_page_mlock
>>>        *
>>>        * SetPageLRU()                TestClearPageMlocked()
>>> -     * smp_mb() // explicit ordering    // above provides strict
>>> +     * MB()     // explicit ordering    // above provides strict
>> Why MB()? That would be the first appareance of 'MB()' in the whole
>> tree. I
>> think it's fine keeping smp_mb()...
>
> I would like to use a more general name, maybe just use "memory barrier"?

Keeping smp_mb() should be just fine...

>
>>
>>>        *                    // ordering
>>>        * PageMlocked()            PageLRU()
>>>        *
>>>        *
>>>        * if '#1' does not observe setting of PG_lru by '#0' and fails
>>>        * isolation, the explicit barrier will make sure that
>>> page_evictable
>>> -     * check will put the page in correct LRU. Without smp_mb(),
>>> SetPageLRU
>>> +     * check will put the page in correct LRU. Without MB(),
>>> SetPageLRU
>> ... same here ...
>>
>>>        * can be reordered after PageMlocked check and can make '#1'
>>> to fail
>>>        * the isolation of the page whose Mlocked bit is cleared (#0
>>> is also
>>>        * looking at the same page) and the evictable page will be
>>> stranded
>>>        * in an unevictable LRU.
>> Only here I would note that SetPageLRU() is an atomic bitop so we can
>> use the
>> __after_atomic() variant. And I would move the actual SetPageLRU()
>> call from
>> above the comment here right before the barrier.
>
> Sure. Thanks.
>
>>
>>>        */
>>> -    smp_mb();
>>> +    smp_mb__after_atomic();
>> Thanks.
>>
>>>         if (page_evictable(page)) {
>>>           lru = page_lru(page);
>>>
>