2020-11-20 08:29:29

by Alex Shi

[permalink] [raw]
Subject: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add

The current relock logical will change lru_lock when found a new
lruvec, so if 2 memcgs are reading file or alloc page at same time,
they could hold the lru_lock alternately, and wait for each other for
fairness attribute of ticket spin lock.

This patch will sort that all lru_locks and only hold them once in
above scenario. That could reduce fairness waiting for lock reget.
Than, vm-scalability/case-lru-file-readtwice could get ~5% performance
gain on my 2P*20core*HT machine.

Suggested-by: Konstantin Khlebnikov <[email protected]>
Signed-off-by: Alex Shi <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
mm/swap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 49 insertions(+), 8 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 490553f3f9ef..c787b38bf9c0 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1009,24 +1009,65 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
trace_mm_lru_insertion(page, lru);
}

+struct lruvecs {
+ struct list_head lists[PAGEVEC_SIZE];
+ struct lruvec *vecs[PAGEVEC_SIZE];
+};
+
+/* Sort pvec pages on their lruvec */
+int sort_page_lruvec(struct lruvecs *lruvecs, struct pagevec *pvec)
+{
+ int i, j, nr_lruvec;
+ struct page *page;
+ struct lruvec *lruvec = NULL;
+
+ lruvecs->vecs[0] = NULL;
+ for (i = nr_lruvec = 0; i < pagevec_count(pvec); i++) {
+ page = pvec->pages[i];
+ lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+
+ /* Try to find a same lruvec */
+ for (j = 0; j <= nr_lruvec; j++)
+ if (lruvec == lruvecs->vecs[j])
+ break;
+
+ /* A new lruvec */
+ if (j > nr_lruvec) {
+ INIT_LIST_HEAD(&lruvecs->lists[nr_lruvec]);
+ lruvecs->vecs[nr_lruvec] = lruvec;
+ j = nr_lruvec++;
+ lruvecs->vecs[nr_lruvec] = 0;
+ }
+
+ list_add_tail(&page->lru, &lruvecs->lists[j]);
+ }
+
+ return nr_lruvec;
+}
+
/*
* Add the passed pages to the LRU, then drop the caller's refcount
* on them. Reinitialises the caller's pagevec.
*/
void __pagevec_lru_add(struct pagevec *pvec)
{
- int i;
- struct lruvec *lruvec = NULL;
+ int i, nr_lruvec;
unsigned long flags = 0;
+ struct page *page;
+ struct lruvecs lruvecs;

- for (i = 0; i < pagevec_count(pvec); i++) {
- struct page *page = pvec->pages[i];
+ nr_lruvec = sort_page_lruvec(&lruvecs, pvec);

- lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
- __pagevec_lru_add_fn(page, lruvec);
+ for (i = 0; i < nr_lruvec; i++) {
+ spin_lock_irqsave(&lruvecs.vecs[i]->lru_lock, flags);
+ while (!list_empty(&lruvecs.lists[i])) {
+ page = lru_to_page(&lruvecs.lists[i]);
+ list_del(&page->lru);
+ __pagevec_lru_add_fn(page, lruvecs.vecs[i]);
+ }
+ spin_unlock_irqrestore(&lruvecs.vecs[i]->lru_lock, flags);
}
- if (lruvec)
- unlock_page_lruvec_irqrestore(lruvec, flags);
+
release_pages(pvec->pages, pvec->nr);
pagevec_reinit(pvec);
}
--
2.29.GIT


2020-11-20 23:22:37

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add

On Fri, 20 Nov 2020 16:27:27 +0800 Alex Shi <[email protected]> wrote:

> The current relock logical will change lru_lock when found a new
> lruvec, so if 2 memcgs are reading file or alloc page at same time,
> they could hold the lru_lock alternately, and wait for each other for
> fairness attribute of ticket spin lock.
>
> This patch will sort that all lru_locks and only hold them once in
> above scenario. That could reduce fairness waiting for lock reget.
> Than, vm-scalability/case-lru-file-readtwice could get ~5% performance
> gain on my 2P*20core*HT machine.

But what happens when all or most of the pages belong to the same
lruvec? This sounds like the common case - won't it suffer?

2020-11-23 04:53:37

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add



?? 2020/11/21 ????7:19, Andrew Morton д??:
> On Fri, 20 Nov 2020 16:27:27 +0800 Alex Shi <[email protected]> wrote:
>
>> The current relock logical will change lru_lock when found a new
>> lruvec, so if 2 memcgs are reading file or alloc page at same time,
>> they could hold the lru_lock alternately, and wait for each other for
>> fairness attribute of ticket spin lock.
>>
>> This patch will sort that all lru_locks and only hold them once in
>> above scenario. That could reduce fairness waiting for lock reget.
>> Than, vm-scalability/case-lru-file-readtwice could get ~5% performance
>> gain on my 2P*20core*HT machine.
>
> But what happens when all or most of the pages belong to the same
> lruvec? This sounds like the common case - won't it suffer?
>
Hi Andrew,

My testing show no regression on this situation, like original centos7,
The most spending time is on lru_lock for lru sensitive case.

Thanks
Alex

2020-11-25 15:41:48

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add

On 11/20/20 9:27 AM, Alex Shi wrote:
> The current relock logical will change lru_lock when found a new
> lruvec, so if 2 memcgs are reading file or alloc page at same time,
> they could hold the lru_lock alternately, and wait for each other for
> fairness attribute of ticket spin lock.
>
> This patch will sort that all lru_locks and only hold them once in
> above scenario. That could reduce fairness waiting for lock reget.
> Than, vm-scalability/case-lru-file-readtwice could get ~5% performance
> gain on my 2P*20core*HT machine.

Hm, once you sort the pages like this, it's a shame not to splice them
instead of more list_del() + list_add() iterations. update_lru_size()
could be also called once?

> Suggested-by: Konstantin Khlebnikov <[email protected]>
> Signed-off-by: Alex Shi <[email protected]>
> Cc: Konstantin Khlebnikov <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Yu Zhao <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> mm/swap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++--------
> 1 file changed, 49 insertions(+), 8 deletions(-)
>
> diff --git a/mm/swap.c b/mm/swap.c
> index 490553f3f9ef..c787b38bf9c0 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -1009,24 +1009,65 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
> trace_mm_lru_insertion(page, lru);
> }
>
> +struct lruvecs {
> + struct list_head lists[PAGEVEC_SIZE];
> + struct lruvec *vecs[PAGEVEC_SIZE];
> +};
> +
> +/* Sort pvec pages on their lruvec */
> +int sort_page_lruvec(struct lruvecs *lruvecs, struct pagevec *pvec)
> +{
> + int i, j, nr_lruvec;
> + struct page *page;
> + struct lruvec *lruvec = NULL;
> +
> + lruvecs->vecs[0] = NULL;
> + for (i = nr_lruvec = 0; i < pagevec_count(pvec); i++) {
> + page = pvec->pages[i];
> + lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +
> + /* Try to find a same lruvec */
> + for (j = 0; j <= nr_lruvec; j++)
> + if (lruvec == lruvecs->vecs[j])
> + break;
> +
> + /* A new lruvec */
> + if (j > nr_lruvec) {
> + INIT_LIST_HEAD(&lruvecs->lists[nr_lruvec]);
> + lruvecs->vecs[nr_lruvec] = lruvec;
> + j = nr_lruvec++;
> + lruvecs->vecs[nr_lruvec] = 0;
> + }
> +
> + list_add_tail(&page->lru, &lruvecs->lists[j]);
> + }
> +
> + return nr_lruvec;
> +}
> +
> /*
> * Add the passed pages to the LRU, then drop the caller's refcount
> * on them. Reinitialises the caller's pagevec.
> */
> void __pagevec_lru_add(struct pagevec *pvec)
> {
> - int i;
> - struct lruvec *lruvec = NULL;
> + int i, nr_lruvec;
> unsigned long flags = 0;
> + struct page *page;
> + struct lruvecs lruvecs;
>
> - for (i = 0; i < pagevec_count(pvec); i++) {
> - struct page *page = pvec->pages[i];
> + nr_lruvec = sort_page_lruvec(&lruvecs, pvec);
>
> - lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
> - __pagevec_lru_add_fn(page, lruvec);
> + for (i = 0; i < nr_lruvec; i++) {
> + spin_lock_irqsave(&lruvecs.vecs[i]->lru_lock, flags);
> + while (!list_empty(&lruvecs.lists[i])) {
> + page = lru_to_page(&lruvecs.lists[i]);
> + list_del(&page->lru);
> + __pagevec_lru_add_fn(page, lruvecs.vecs[i]);
> + }
> + spin_unlock_irqrestore(&lruvecs.vecs[i]->lru_lock, flags);
> }
> - if (lruvec)
> - unlock_page_lruvec_irqrestore(lruvec, flags);
> +
> release_pages(pvec->pages, pvec->nr);
> pagevec_reinit(pvec);
> }
>

2020-11-26 09:39:26

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add



在 2020/11/25 下午11:38, Vlastimil Babka 写道:
> On 11/20/20 9:27 AM, Alex Shi wrote:
>> The current relock logical will change lru_lock when found a new
>> lruvec, so if 2 memcgs are reading file or alloc page at same time,
>> they could hold the lru_lock alternately, and wait for each other for
>> fairness attribute of ticket spin lock.
>>
>> This patch will sort that all lru_locks and only hold them once in
>> above scenario. That could reduce fairness waiting for lock reget.
>> Than, vm-scalability/case-lru-file-readtwice could get ~5% performance
>> gain on my 2P*20core*HT machine.
>
> Hm, once you sort the pages like this, it's a shame not to splice them instead of more list_del() + list_add() iterations. update_lru_size() could be also called once?

Yes, looks it's a good idea to use splice instead of list_del/add, but pages
may on different lru list in a same lruvec, and also may come from different
zones. That could involve 5 cycles for different lists, and more for zones...

I give up the try.

2020-11-26 09:44:16

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add

On Fri, Nov 20, 2020 at 04:27:27PM +0800, Alex Shi wrote:
> The current relock logical will change lru_lock when found a new
> lruvec, so if 2 memcgs are reading file or alloc page at same time,
> they could hold the lru_lock alternately, and wait for each other for
> fairness attribute of ticket spin lock.
>
> This patch will sort that all lru_locks and only hold them once in
> above scenario. That could reduce fairness waiting for lock reget.
> Than, vm-scalability/case-lru-file-readtwice could get ~5% performance
> gain on my 2P*20core*HT machine.
>
> Suggested-by: Konstantin Khlebnikov <[email protected]>
> Signed-off-by: Alex Shi <[email protected]>
> Cc: Konstantin Khlebnikov <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Yu Zhao <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> mm/swap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++--------
> 1 file changed, 49 insertions(+), 8 deletions(-)
>
> diff --git a/mm/swap.c b/mm/swap.c
> index 490553f3f9ef..c787b38bf9c0 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -1009,24 +1009,65 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
> trace_mm_lru_insertion(page, lru);
> }
>
> +struct lruvecs {
> + struct list_head lists[PAGEVEC_SIZE];
> + struct lruvec *vecs[PAGEVEC_SIZE];
> +};
> +
> +/* Sort pvec pages on their lruvec */
> +int sort_page_lruvec(struct lruvecs *lruvecs, struct pagevec *pvec)
> +{
> + int i, j, nr_lruvec;
> + struct page *page;
> + struct lruvec *lruvec = NULL;
> +
> + lruvecs->vecs[0] = NULL;
> + for (i = nr_lruvec = 0; i < pagevec_count(pvec); i++) {
> + page = pvec->pages[i];
> + lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
> +
> + /* Try to find a same lruvec */
> + for (j = 0; j <= nr_lruvec; j++)
> + if (lruvec == lruvecs->vecs[j])
> + break;
> +
> + /* A new lruvec */
> + if (j > nr_lruvec) {
> + INIT_LIST_HEAD(&lruvecs->lists[nr_lruvec]);
> + lruvecs->vecs[nr_lruvec] = lruvec;
> + j = nr_lruvec++;
> + lruvecs->vecs[nr_lruvec] = 0;
> + }
> +
> + list_add_tail(&page->lru, &lruvecs->lists[j]);
> + }
> +
> + return nr_lruvec;
> +}
> +
> /*
> * Add the passed pages to the LRU, then drop the caller's refcount
> * on them. Reinitialises the caller's pagevec.
> */
> void __pagevec_lru_add(struct pagevec *pvec)
> {
> - int i;
> - struct lruvec *lruvec = NULL;
> + int i, nr_lruvec;
> unsigned long flags = 0;
> + struct page *page;
> + struct lruvecs lruvecs;
>
> - for (i = 0; i < pagevec_count(pvec); i++) {
> - struct page *page = pvec->pages[i];
> + nr_lruvec = sort_page_lruvec(&lruvecs, pvec);

Simply looping pvec multiple times (15 at most) for different lruvecs
would be better because 1) it requires no extra data structures and
therefore has better cache locality (theoretically faster) 2) it only
loops once when !CONFIG_MEMCG and !CONFIG_NUMA and therefore has no
impact on Android and Chrome OS.

> - lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
> - __pagevec_lru_add_fn(page, lruvec);
> + for (i = 0; i < nr_lruvec; i++) {
> + spin_lock_irqsave(&lruvecs.vecs[i]->lru_lock, flags);
> + while (!list_empty(&lruvecs.lists[i])) {
> + page = lru_to_page(&lruvecs.lists[i]);
> + list_del(&page->lru);
> + __pagevec_lru_add_fn(page, lruvecs.vecs[i]);
> + }
> + spin_unlock_irqrestore(&lruvecs.vecs[i]->lru_lock, flags);
> }
> - if (lruvec)
> - unlock_page_lruvec_irqrestore(lruvec, flags);
> +
> release_pages(pvec->pages, pvec->nr);
> pagevec_reinit(pvec);
> }
> --
> 2.29.GIT
>

2020-11-26 10:07:06

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add



?? 2020/11/26 ????12:52, Yu Zhao д??:
>> */
>> void __pagevec_lru_add(struct pagevec *pvec)
>> {
>> - int i;
>> - struct lruvec *lruvec = NULL;
>> + int i, nr_lruvec;
>> unsigned long flags = 0;
>> + struct page *page;
>> + struct lruvecs lruvecs;
>>
>> - for (i = 0; i < pagevec_count(pvec); i++) {
>> - struct page *page = pvec->pages[i];
>> + nr_lruvec = sort_page_lruvec(&lruvecs, pvec);
> Simply looping pvec multiple times (15 at most) for different lruvecs
> would be better because 1) it requires no extra data structures and
> therefore has better cache locality (theoretically faster) 2) it only
> loops once when !CONFIG_MEMCG and !CONFIG_NUMA and therefore has no
> impact on Android and Chrome OS.
>

With multiple memcgs, it do help a lot, I had gotten 30% grain on readtwice
case. but yes, w/o MEMCG and NUMA, it's good to keep old behavior. So
would you like has a proposal for this?

Thanks
Alex

2020-11-26 10:27:49

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add

On Thu, Nov 26, 2020 at 02:39:03PM +0800, Alex Shi wrote:
>
>
> 在 2020/11/26 下午12:52, Yu Zhao 写道:
> >> */
> >> void __pagevec_lru_add(struct pagevec *pvec)
> >> {
> >> - int i;
> >> - struct lruvec *lruvec = NULL;
> >> + int i, nr_lruvec;
> >> unsigned long flags = 0;
> >> + struct page *page;
> >> + struct lruvecs lruvecs;
> >>
> >> - for (i = 0; i < pagevec_count(pvec); i++) {
> >> - struct page *page = pvec->pages[i];
> >> + nr_lruvec = sort_page_lruvec(&lruvecs, pvec);
> > Simply looping pvec multiple times (15 at most) for different lruvecs
> > would be better because 1) it requires no extra data structures and
> > therefore has better cache locality (theoretically faster) 2) it only
> > loops once when !CONFIG_MEMCG and !CONFIG_NUMA and therefore has no
> > impact on Android and Chrome OS.
> >
>
> With multiple memcgs, it do help a lot, I had gotten 30% grain on readtwice
> case. but yes, w/o MEMCG and NUMA, it's good to keep old behavior. So
> would you like has a proposal for this?

Oh, no, I'm not against your idea. I was saying it doesn't seem
necessary to sort -- a nested loop would just do the job given
pagevec is small.

diff --git a/mm/swap.c b/mm/swap.c
index cb3794e13b48..1d238edc2907 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -996,15 +996,26 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
*/
void __pagevec_lru_add(struct pagevec *pvec)
{
- int i;
+ int i, j;
struct lruvec *lruvec = NULL;
unsigned long flags = 0;

for (i = 0; i < pagevec_count(pvec); i++) {
struct page *page = pvec->pages[i];

+ if (!page)
+ continue;
+
lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
- __pagevec_lru_add_fn(page, lruvec);
+
+ for (j = i; j < pagevec_count(pvec); j++) {
+ if (page_to_nid(pvec->pages[j]) != page_to_nid(page) ||
+ page_memcg(pvec->pages[j]) != page_memcg(page))
+ continue;
+
+ __pagevec_lru_add_fn(pvec->pages[j], lruvec);
+ pvec->pages[j] = NULL;
+ }
}
if (lruvec)
unlock_page_lruvec_irqrestore(lruvec, flags);

2020-11-26 10:44:03

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add



在 2020/11/26 下午3:24, Yu Zhao 写道:
> Oh, no, I'm not against your idea. I was saying it doesn't seem
> necessary to sort -- a nested loop would just do the job given
> pagevec is small.
>
> diff --git a/mm/swap.c b/mm/swap.c
> index cb3794e13b48..1d238edc2907 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -996,15 +996,26 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
> */
> void __pagevec_lru_add(struct pagevec *pvec)
> {
> - int i;
> + int i, j;
> struct lruvec *lruvec = NULL;
> unsigned long flags = 0;
>
> for (i = 0; i < pagevec_count(pvec); i++) {
> struct page *page = pvec->pages[i];
>
> + if (!page)
> + continue;
> +
> lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
> - __pagevec_lru_add_fn(page, lruvec);
> +
> + for (j = i; j < pagevec_count(pvec); j++) {
> + if (page_to_nid(pvec->pages[j]) != page_to_nid(page) ||
> + page_memcg(pvec->pages[j]) != page_memcg(page))
> + continue;
> +
> + __pagevec_lru_add_fn(pvec->pages[j], lruvec);
> + pvec->pages[j] = NULL;
> + }

Uh, I have to say your method is more better than mine.
And this could be reused for all relock_page_lruvec. I expect this could
speed up lru performance a lot!


> }
> if (lruvec)
> unlock_page_lruvec_irqrestore(lruvec, flags);

2020-11-26 12:10:56

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add

On 11/26/20 8:24 AM, Yu Zhao wrote:
> On Thu, Nov 26, 2020 at 02:39:03PM +0800, Alex Shi wrote:
>>
>>
>> 在 2020/11/26 下午12:52, Yu Zhao 写道:
>> >> */
>> >> void __pagevec_lru_add(struct pagevec *pvec)
>> >> {
>> >> - int i;
>> >> - struct lruvec *lruvec = NULL;
>> >> + int i, nr_lruvec;
>> >> unsigned long flags = 0;
>> >> + struct page *page;
>> >> + struct lruvecs lruvecs;
>> >>
>> >> - for (i = 0; i < pagevec_count(pvec); i++) {
>> >> - struct page *page = pvec->pages[i];
>> >> + nr_lruvec = sort_page_lruvec(&lruvecs, pvec);
>> > Simply looping pvec multiple times (15 at most) for different lruvecs
>> > would be better because 1) it requires no extra data structures and
>> > therefore has better cache locality (theoretically faster) 2) it only
>> > loops once when !CONFIG_MEMCG and !CONFIG_NUMA and therefore has no
>> > impact on Android and Chrome OS.
>> >
>>
>> With multiple memcgs, it do help a lot, I had gotten 30% grain on readtwice
>> case. but yes, w/o MEMCG and NUMA, it's good to keep old behavior. So
>> would you like has a proposal for this?
>
> Oh, no, I'm not against your idea. I was saying it doesn't seem
> necessary to sort -- a nested loop would just do the job given
> pagevec is small.

Yeah that could work. The worst case doesn't look nice (O(n^2)) but it should be
rather rare to have pages from really multiple memcgs/nodes?

Maybe with the following change? Avoids going through the nulls if we got lucky
(or have !MEMCG !NUMA).

> diff --git a/mm/swap.c b/mm/swap.c
> index cb3794e13b48..1d238edc2907 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -996,15 +996,26 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
> */
> void __pagevec_lru_add(struct pagevec *pvec)
> {
> - int i;
> + int i, j;

int i, j, n;

> struct lruvec *lruvec = NULL;
> unsigned long flags = 0;
>

n = pagevec_count(pvec);
> for (i = 0; i < pagevec_count(pvec); i++) {

for (i = 0; n; i++) {

> struct page *page = pvec->pages[i];
>
> + if (!page)
> + continue;
> +
> lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
> - __pagevec_lru_add_fn(page, lruvec);

n--;

> +
> + for (j = i; j < pagevec_count(pvec); j++) {
> + if (page_to_nid(pvec->pages[j]) != page_to_nid(page) ||
> + page_memcg(pvec->pages[j]) != page_memcg(page))
> + continue;
> +
> + __pagevec_lru_add_fn(pvec->pages[j], lruvec);
> + pvec->pages[j] = NULL;

n--;
> + }
> }
> if (lruvec)
> unlock_page_lruvec_irqrestore(lruvec, flags);
>

2020-11-26 15:47:58

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add

On 11/26/20 12:22 PM, Vlastimil Babka wrote:
> On 11/26/20 8:24 AM, Yu Zhao wrote:
>> On Thu, Nov 26, 2020 at 02:39:03PM +0800, Alex Shi wrote:
>>>
>>>
>>> 在 2020/11/26 下午12:52, Yu Zhao 写道:
>>> >> */
>>> >> void __pagevec_lru_add(struct pagevec *pvec)
>>> >> {
>>> >> - int i;
>>> >> - struct lruvec *lruvec = NULL;
>>> >> + int i, nr_lruvec;
>>> >> unsigned long flags = 0;
>>> >> + struct page *page;
>>> >> + struct lruvecs lruvecs;
>>> >>
>>> >> - for (i = 0; i < pagevec_count(pvec); i++) {
>>> >> - struct page *page = pvec->pages[i];
>>> >> + nr_lruvec = sort_page_lruvec(&lruvecs, pvec);
>>> > Simply looping pvec multiple times (15 at most) for different lruvecs
>>> > would be better because 1) it requires no extra data structures and
>>> > therefore has better cache locality (theoretically faster) 2) it only
>>> > loops once when !CONFIG_MEMCG and !CONFIG_NUMA and therefore has no
>>> > impact on Android and Chrome OS.
>>> >
>>>
>>> With multiple memcgs, it do help a lot, I had gotten 30% grain on readtwice
>>> case. but yes, w/o MEMCG and NUMA, it's good to keep old behavior. So
>>> would you like has a proposal for this?
>>
>> Oh, no, I'm not against your idea. I was saying it doesn't seem
>> necessary to sort -- a nested loop would just do the job given
>> pagevec is small.
>
> Yeah that could work. The worst case doesn't look nice (O(n^2)) but it should be
> rather rare to have pages from really multiple memcgs/nodes?

However, Matthew wanted to increase pagevec size [1] and once 15^2 becomes 63^2,
it starts to be somewhat more worrying.

[1] https://lore.kernel.org/linux-mm/[email protected]/

> Maybe with the following change? Avoids going through the nulls if we got lucky
> (or have !MEMCG !NUMA).
>
>> diff --git a/mm/swap.c b/mm/swap.c
>> index cb3794e13b48..1d238edc2907 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -996,15 +996,26 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
>> */
>> void __pagevec_lru_add(struct pagevec *pvec)
>> {
>> - int i;
>> + int i, j;
>
> int i, j, n;
>
>> struct lruvec *lruvec = NULL;
>> unsigned long flags = 0;
>>
>
> n = pagevec_count(pvec);
>> for (i = 0; i < pagevec_count(pvec); i++) {
>
> for (i = 0; n; i++) {
>
>> struct page *page = pvec->pages[i];
>>
>> + if (!page)
>> + continue;
>> +
>> lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
>> - __pagevec_lru_add_fn(page, lruvec);
>
> n--;
>
>> +
>> + for (j = i; j < pagevec_count(pvec); j++) {
>> + if (page_to_nid(pvec->pages[j]) != page_to_nid(page) ||
>> + page_memcg(pvec->pages[j]) != page_memcg(page))
>> + continue;
>> +
>> + __pagevec_lru_add_fn(pvec->pages[j], lruvec);
>> + pvec->pages[j] = NULL;
>
> n--;
>> + }
>> }
>> if (lruvec)
>> unlock_page_lruvec_irqrestore(lruvec, flags);
>>
>

2020-11-26 15:59:25

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add

On Thu, Nov 26, 2020 at 04:44:04PM +0100, Vlastimil Babka wrote:
> However, Matthew wanted to increase pagevec size [1] and once 15^2 becomes
> 63^2, it starts to be somewhat more worrying.
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/

Well, Tim wanted it ;-)

I would suggest that rather than an insertion sort (or was it a bubble
sort?), we should be using a Shell sort. It's ideal for these kinds of
smallish arrays.

https://en.wikipedia.org/wiki/Shellsort

2020-11-27 08:13:52

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add

On 11/26/20 4:12 AM, Alex Shi wrote:
>
>
> 在 2020/11/25 下午11:38, Vlastimil Babka 写道:
>> On 11/20/20 9:27 AM, Alex Shi wrote:
>>> The current relock logical will change lru_lock when found a new
>>> lruvec, so if 2 memcgs are reading file or alloc page at same time,
>>> they could hold the lru_lock alternately, and wait for each other for
>>> fairness attribute of ticket spin lock.
>>>
>>> This patch will sort that all lru_locks and only hold them once in
>>> above scenario. That could reduce fairness waiting for lock reget.
>>> Than, vm-scalability/case-lru-file-readtwice could get ~5% performance
>>> gain on my 2P*20core*HT machine.
>>
>> Hm, once you sort the pages like this, it's a shame not to splice them instead of more list_del() + list_add() iterations. update_lru_size() could be also called once?
>
> Yes, looks it's a good idea to use splice instead of list_del/add, but pages
> may on different lru list in a same lruvec, and also may come from different
> zones. That could involve 5 cycles for different lists, and more for zones...

Hmm, zones wouldn't affect splicing (there's a per-node lru these days), but
would affect accounting. And yeah, there are 5 lru lists, and we probably need
to be under lru lock to stabilize where a page belongs, so pre-sorting without
lock wouldn't be safe? Bummer.

> I give up the try.
>

2020-11-27 08:49:04

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH next] mm/swap.c: reduce lock contention in lru_cache_add



?? 2020/11/26 ????11:55, Matthew Wilcox д??:
> On Thu, Nov 26, 2020 at 04:44:04PM +0100, Vlastimil Babka wrote:
>> However, Matthew wanted to increase pagevec size [1] and once 15^2 becomes
>> 63^2, it starts to be somewhat more worrying.
>>
>> [1] https://lore.kernel.org/linux-mm/[email protected]/
>
> Well, Tim wanted it ;-)
>
> I would suggest that rather than an insertion sort (or was it a bubble
> sort?), we should be using a Shell sort. It's ideal for these kinds of
> smallish arrays.
>
> https://en.wikipedia.org/wiki/Shellsort
>

Uh, looks perfect good!. I gonna look into it. :)

Thanks!

2020-12-25 10:03:28

by Alex Shi

[permalink] [raw]
Subject: [RFC PATCH 0/4] pre sort pages on lruvec in pagevec

This idea was tried on per memcg lru lock patchset v18, and had a good
result, about 5%~20+% performance gain on lru lock busy benchmarks,
like case-lru-file-readtwice.

But on the latest kernel, I can not reproduce the result on my box.
Also I can not reproduce Tim's performance gain too on my box.

So I don't know if it's workable in some scenario, just sent out if
someone has interesting...

Alex Shi (4):
mm/swap.c: pre-sort pages in pagevec for pagevec_lru_move_fn
mm/swap.c: bail out early for no memcg and no numa
mm/swap.c: extend the usage to pagevec_lru_add
mm/swap.c: no sort if all page's lruvec are same

Cc: Konstantin Khlebnikov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
Cc: [email protected]

mm/swap.c | 118 +++++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 91 insertions(+), 27 deletions(-)

--
2.29.GIT

2020-12-25 10:04:11

by Alex Shi

[permalink] [raw]
Subject: [RFC PATCH 3/4] mm/swap.c: extend the usage to pagevec_lru_add

The only different for __pagevec_lru_add and other page moving between
lru lists is page to add lru list has no need to do TestClearPageLRU and
set the lru bit back. So we could combound them with a clear lru bit
switch in sort function parameter.

Than all lru list operation functions could be united.

Signed-off-by: Alex Shi <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
mm/swap.c | 31 ++++++++++++-------------------
1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index bb5300b7e321..9a2269e5099b 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -12,6 +12,7 @@
* Started 18.12.91
* Swap aging added 23.2.95, Stephen Tweedie.
* Buffermem limits added 12.3.98, Rik van Riel.
+ * Pre-sort pagevec added 12.1.20, Alex Shi.
*/

#include <linux/mm.h>
@@ -227,8 +228,8 @@ static void shell_sort(struct pagevec *pvec, unsigned long *lvaddr)
}

/* Get lru bit cleared page and their lruvec address, release the others */
-void sort_isopv(struct pagevec *pvec, struct pagevec *isopv,
- unsigned long *lvaddr)
+static void sort_isopv(struct pagevec *pvec, struct pagevec *isopv,
+ unsigned long *lvaddr, bool clearlru)
{
int i, j;
struct pagevec busypv;
@@ -242,7 +243,7 @@ void sort_isopv(struct pagevec *pvec, struct pagevec *isopv,
pvec->pages[i] = NULL;

/* block memcg migration during page moving between lru */
- if (!TestClearPageLRU(page)) {
+ if (clearlru && !TestClearPageLRU(page)) {
pagevec_add(&busypv, page);
continue;
}
@@ -266,9 +267,13 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
unsigned long flags = 0;
unsigned long lvaddr[PAGEVEC_SIZE];
struct pagevec sortedpv;
+ bool clearlru;
+
+ /* don't clear lru bit for new page adding to lru */
+ clearlru = pvec != this_cpu_ptr(&lru_pvecs.lru_add);

pagevec_init(&sortedpv);
- sort_isopv(pvec, &sortedpv, lvaddr);
+ sort_isopv(pvec, &sortedpv, lvaddr, clearlru);

n = pagevec_count(&sortedpv);
if (!n)
@@ -287,7 +292,8 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,

(*move_fn)(sortedpv.pages[i], lruvec);

- SetPageLRU(sortedpv.pages[i]);
+ if (clearlru)
+ SetPageLRU(sortedpv.pages[i]);
}
spin_unlock_irqrestore(&lruvec->lru_lock, flags);
release_pages(sortedpv.pages, sortedpv.nr);
@@ -1111,20 +1117,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
*/
void __pagevec_lru_add(struct pagevec *pvec)
{
- int i;
- struct lruvec *lruvec = NULL;
- unsigned long flags = 0;
-
- for (i = 0; i < pagevec_count(pvec); i++) {
- struct page *page = pvec->pages[i];
-
- lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
- __pagevec_lru_add_fn(page, lruvec);
- }
- if (lruvec)
- unlock_page_lruvec_irqrestore(lruvec, flags);
- release_pages(pvec->pages, pvec->nr);
- pagevec_reinit(pvec);
+ pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn);
}

/**
--
2.29.GIT