2020-05-19 21:48:51

by Ahmed S. Darwish

[permalink] [raw]
Subject: [PATCH v1 02/25] mm/swap: Don't abuse the seqcount latching API

Commit eef1a429f234 ("mm/swap.c: piggyback lru_add_drain_all() calls")
implemented an optimization mechanism to exit the to-be-started LRU
drain operation (name it A) if another drain operation *started and
finished* while (A) was blocked on the LRU draining mutex.

This was done through a seqcount latch, which is an abuse of its
semantics:

1. Seqcount latching should be used for the purpose of switching
between two storage places with sequence protection to allow
interruptible, preemptible writer sections. The optimization
mechanism has absolutely nothing to do with that.

2. The used raw_write_seqcount_latch() has two smp write memory
barriers to always insure one consistent storage place out of the
two storage places available. This extra smp_wmb() is redundant for
the optimization use case.

Beside the API abuse, the semantics of a latch sequence counter was
force fitted into the optimization. What was actually meant is to track
generations of LRU draining operations, where "current lru draining
generation = x" implies that all generations 0 < n <= x are already
*scheduled* for draining.

Remove the conceptually-inappropriate seqcount latch usage and manually
implement the optimization using a counter and SMP memory barriers.

Link: https://lkml.kernel.org/r/CALYGNiPSr-cxV9MX9czaVh6Wz_gzSv3H_8KPvgjBTGbJywUJpA@mail.gmail.com
Signed-off-by: Ahmed S. Darwish <[email protected]>
---
mm/swap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 47 insertions(+), 10 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index bf9a79fed62d..d6910eeed43d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -713,10 +713,20 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
*/
void lru_add_drain_all(void)
{
- static seqcount_t seqcount = SEQCNT_ZERO(seqcount);
- static DEFINE_MUTEX(lock);
+ /*
+ * lru_drain_gen - Current generation of pages that could be in vectors
+ *
+ * (A) Definition: lru_drain_gen = x implies that all generations
+ * 0 < n <= x are already scheduled for draining.
+ *
+ * This is an optimization for the highly-contended use case where a
+ * user space workload keeps constantly generating a flow of pages
+ * for each CPU.
+ */
+ static unsigned int lru_drain_gen;
static struct cpumask has_work;
- int cpu, seq;
+ static DEFINE_MUTEX(lock);
+ int cpu, this_gen;

/*
* Make sure nobody triggers this path before mm_percpu_wq is fully
@@ -725,21 +735,48 @@ void lru_add_drain_all(void)
if (WARN_ON(!mm_percpu_wq))
return;

- seq = raw_read_seqcount_latch(&seqcount);
+ /*
+ * (B) Cache the LRU draining generation number
+ *
+ * smp_rmb() ensures that the counter is loaded before the mutex is
+ * taken. It pairs with the smp_wmb() inside the mutex critical section
+ * at (D).
+ */
+ this_gen = READ_ONCE(lru_drain_gen);
+ smp_rmb();

mutex_lock(&lock);

/*
- * Piggyback on drain started and finished while we waited for lock:
- * all pages pended at the time of our enter were drained from vectors.
+ * (C) Exit the draining operation if a newer generation, from another
+ * lru_add_drain_all(), was already scheduled for draining. Check (A).
*/
- if (__read_seqcount_retry(&seqcount, seq))
+ if (unlikely(this_gen != lru_drain_gen))
goto done;

- raw_write_seqcount_latch(&seqcount);
+ /*
+ * (D) Increment generation number
+ *
+ * Pairs with READ_ONCE() and smp_rmb() at (B), outside of the critical
+ * section.
+ *
+ * This pairing must be done here, before the for_each_online_cpu loop
+ * below which drains the page vectors.
+ *
+ * Let x, y, and z represent some system CPU numbers, where x < y < z.
+ * Assume CPU #z is is in the middle of the for_each_online_cpu loop
+ * below and has already reached CPU #y's per-cpu data. CPU #x comes
+ * along, adds some pages to its per-cpu vectors, then calls
+ * lru_add_drain_all().
+ *
+ * If the paired smp_wmb() below is done at any later step, e.g. after
+ * the loop, CPU #x will just exit at (C) and miss flushing out all of
+ * its added pages.
+ */
+ WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
+ smp_wmb();

cpumask_clear(&has_work);
-
for_each_online_cpu(cpu) {
struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);

@@ -766,7 +803,7 @@ void lru_add_drain_all(void)
{
lru_add_drain();
}
-#endif
+#endif /* CONFIG_SMP */

/**
* release_pages - batched put_page()
--
2.20.1


2020-05-20 12:24:22

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH v1 02/25] mm/swap: Don't abuse the seqcount latching API

On 20/05/2020 00.45, Ahmed S. Darwish wrote:
> Commit eef1a429f234 ("mm/swap.c: piggyback lru_add_drain_all() calls")
> implemented an optimization mechanism to exit the to-be-started LRU
> drain operation (name it A) if another drain operation *started and
> finished* while (A) was blocked on the LRU draining mutex.
>
> This was done through a seqcount latch, which is an abuse of its
> semantics:
>
> 1. Seqcount latching should be used for the purpose of switching
> between two storage places with sequence protection to allow
> interruptible, preemptible writer sections. The optimization
> mechanism has absolutely nothing to do with that.
>
> 2. The used raw_write_seqcount_latch() has two smp write memory
> barriers to always insure one consistent storage place out of the
> two storage places available. This extra smp_wmb() is redundant for
> the optimization use case.
>
> Beside the API abuse, the semantics of a latch sequence counter was
> force fitted into the optimization. What was actually meant is to track
> generations of LRU draining operations, where "current lru draining
> generation = x" implies that all generations 0 < n <= x are already
> *scheduled* for draining.
>
> Remove the conceptually-inappropriate seqcount latch usage and manually
> implement the optimization using a counter and SMP memory barriers.

Well, I thought it fits perfectly =)

Maybe it's worth to add helpers with appropriate semantics?
This is pretty common pattern.

>
> Link: https://lkml.kernel.org/r/CALYGNiPSr-cxV9MX9czaVh6Wz_gzSv3H_8KPvgjBTGbJywUJpA@mail.gmail.com
> Signed-off-by: Ahmed S. Darwish <[email protected]>
> ---
> mm/swap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++----------
> 1 file changed, 47 insertions(+), 10 deletions(-)
>
> diff --git a/mm/swap.c b/mm/swap.c
> index bf9a79fed62d..d6910eeed43d 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -713,10 +713,20 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
> */
> void lru_add_drain_all(void)
> {
> - static seqcount_t seqcount = SEQCNT_ZERO(seqcount);
> - static DEFINE_MUTEX(lock);
> + /*
> + * lru_drain_gen - Current generation of pages that could be in vectors
> + *
> + * (A) Definition: lru_drain_gen = x implies that all generations
> + * 0 < n <= x are already scheduled for draining.
> + *
> + * This is an optimization for the highly-contended use case where a
> + * user space workload keeps constantly generating a flow of pages
> + * for each CPU.
> + */
> + static unsigned int lru_drain_gen;
> static struct cpumask has_work;
> - int cpu, seq;
> + static DEFINE_MUTEX(lock);
> + int cpu, this_gen;
>
> /*
> * Make sure nobody triggers this path before mm_percpu_wq is fully
> @@ -725,21 +735,48 @@ void lru_add_drain_all(void)
> if (WARN_ON(!mm_percpu_wq))
> return;
>
> - seq = raw_read_seqcount_latch(&seqcount);
> + /*
> + * (B) Cache the LRU draining generation number
> + *
> + * smp_rmb() ensures that the counter is loaded before the mutex is
> + * taken. It pairs with the smp_wmb() inside the mutex critical section
> + * at (D).
> + */
> + this_gen = READ_ONCE(lru_drain_gen);
> + smp_rmb();
>
> mutex_lock(&lock);
>
> /*
> - * Piggyback on drain started and finished while we waited for lock:
> - * all pages pended at the time of our enter were drained from vectors.
> + * (C) Exit the draining operation if a newer generation, from another
> + * lru_add_drain_all(), was already scheduled for draining. Check (A).
> */
> - if (__read_seqcount_retry(&seqcount, seq))
> + if (unlikely(this_gen != lru_drain_gen))
> goto done;
>
> - raw_write_seqcount_latch(&seqcount);
> + /*
> + * (D) Increment generation number
> + *
> + * Pairs with READ_ONCE() and smp_rmb() at (B), outside of the critical
> + * section.
> + *
> + * This pairing must be done here, before the for_each_online_cpu loop
> + * below which drains the page vectors.
> + *
> + * Let x, y, and z represent some system CPU numbers, where x < y < z.
> + * Assume CPU #z is is in the middle of the for_each_online_cpu loop
> + * below and has already reached CPU #y's per-cpu data. CPU #x comes
> + * along, adds some pages to its per-cpu vectors, then calls
> + * lru_add_drain_all().
> + *
> + * If the paired smp_wmb() below is done at any later step, e.g. after
> + * the loop, CPU #x will just exit at (C) and miss flushing out all of
> + * its added pages.
> + */
> + WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
> + smp_wmb();
>
> cpumask_clear(&has_work);
> -
> for_each_online_cpu(cpu) {
> struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
>
> @@ -766,7 +803,7 @@ void lru_add_drain_all(void)
> {
> lru_add_drain();
> }
> -#endif
> +#endif /* CONFIG_SMP */
>
> /**
> * release_pages - batched put_page()
>

2020-05-20 13:08:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v1 02/25] mm/swap: Don't abuse the seqcount latching API

On Wed, May 20, 2020 at 03:22:15PM +0300, Konstantin Khlebnikov wrote:
> On 20/05/2020 00.45, Ahmed S. Darwish wrote:
> > Commit eef1a429f234 ("mm/swap.c: piggyback lru_add_drain_all() calls")
> > implemented an optimization mechanism to exit the to-be-started LRU
> > drain operation (name it A) if another drain operation *started and
> > finished* while (A) was blocked on the LRU draining mutex.

That commit is horrible...

> Well, I thought it fits perfectly =)
>
> Maybe it's worth to add helpers with appropriate semantics?
> This is pretty common pattern.

Where's more sites?

> > @@ -725,21 +735,48 @@ void lru_add_drain_all(void)
> > if (WARN_ON(!mm_percpu_wq))
> > return;
> > - seq = raw_read_seqcount_latch(&seqcount);
> > mutex_lock(&lock);
> > /*
> > - * Piggyback on drain started and finished while we waited for lock:
> > - * all pages pended at the time of our enter were drained from vectors.
> > */
> > - if (__read_seqcount_retry(&seqcount, seq))
> > goto done;

Since there is no ordering in raw_read_seqcount_latch(), and
mutex_lock() is an ACQUIRE, there's no guarantee the read actually
happens before the mutex is acquired.

> > - raw_write_seqcount_latch(&seqcount);
> > cpumask_clear(&has_work);

2020-05-22 14:59:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v1 02/25] mm/swap: Don't abuse the seqcount latching API

On Tue, May 19, 2020 at 11:45:24PM +0200, Ahmed S. Darwish wrote:
> @@ -713,10 +713,20 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
> */
> void lru_add_drain_all(void)
> {

> + static unsigned int lru_drain_gen;
> static struct cpumask has_work;
> + static DEFINE_MUTEX(lock);
> + int cpu, this_gen;
>
> /*
> * Make sure nobody triggers this path before mm_percpu_wq is fully
> @@ -725,21 +735,48 @@ void lru_add_drain_all(void)
> if (WARN_ON(!mm_percpu_wq))
> return;
>

> + this_gen = READ_ONCE(lru_drain_gen);
> + smp_rmb();

this_gen = smp_load_acquire(&lru_drain_gen);
>
> mutex_lock(&lock);
>
> /*
> + * (C) Exit the draining operation if a newer generation, from another
> + * lru_add_drain_all(), was already scheduled for draining. Check (A).
> */
> + if (unlikely(this_gen != lru_drain_gen))
> goto done;
>

> + WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
> + smp_wmb();

You can leave this smp_wmb() out and rely on the smp_mb() implied by
queue_work_on()'s test_and_set_bit().

> cpumask_clear(&has_work);
> -
> for_each_online_cpu(cpu) {
> struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
>

While you're here, do:

s/cpumask_set_cpu/__&/

> @@ -766,7 +803,7 @@ void lru_add_drain_all(void)
> {
> lru_add_drain();
> }
> -#endif
> +#endif /* CONFIG_SMP */
>
> /**
> * release_pages - batched put_page()

Subject: Re: [PATCH v1 02/25] mm/swap: Don't abuse the seqcount latching API

On 2020-05-22 16:57:07 [+0200], Peter Zijlstra wrote:
> > @@ -725,21 +735,48 @@ void lru_add_drain_all(void)
> > if (WARN_ON(!mm_percpu_wq))
> > return;
> >
>
> > + this_gen = READ_ONCE(lru_drain_gen);
> > + smp_rmb();
>
> this_gen = smp_load_acquire(&lru_drain_gen);
> >
> > mutex_lock(&lock);
> >
> > /*
> > + * (C) Exit the draining operation if a newer generation, from another
> > + * lru_add_drain_all(), was already scheduled for draining. Check (A).
> > */
> > + if (unlikely(this_gen != lru_drain_gen))
> > goto done;
> >
>
> > + WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
> > + smp_wmb();
>
> You can leave this smp_wmb() out and rely on the smp_mb() implied by
> queue_work_on()'s test_and_set_bit().

This is to avoid smp_store_release() ?

Sebastian

2020-05-22 16:26:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v1 02/25] mm/swap: Don't abuse the seqcount latching API

On Fri, May 22, 2020 at 05:17:05PM +0200, Sebastian A. Siewior wrote:
> On 2020-05-22 16:57:07 [+0200], Peter Zijlstra wrote:
> > > @@ -725,21 +735,48 @@ void lru_add_drain_all(void)
> > > if (WARN_ON(!mm_percpu_wq))
> > > return;
> > >
> >
> > > + this_gen = READ_ONCE(lru_drain_gen);
> > > + smp_rmb();
> >
> > this_gen = smp_load_acquire(&lru_drain_gen);
> > >
> > > mutex_lock(&lock);
> > >
> > > /*
> > > + * (C) Exit the draining operation if a newer generation, from another
> > > + * lru_add_drain_all(), was already scheduled for draining. Check (A).
> > > */
> > > + if (unlikely(this_gen != lru_drain_gen))
> > > goto done;
> > >
> >
> > > + WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
> > > + smp_wmb();
> >
> > You can leave this smp_wmb() out and rely on the smp_mb() implied by
> > queue_work_on()'s test_and_set_bit().
>
> This is to avoid smp_store_release() ?

store_release would have the barrier on the other end. If you read the
comments (I so helpfully cut out) you'll see it wants to order against
later stores, not ealier.

2020-05-25 15:27:10

by Ahmed S. Darwish

[permalink] [raw]
Subject: Re: [PATCH v1 02/25] mm/swap: Don't abuse the seqcount latching API

Peter Zijlstra <[email protected]> wrote:
> On Tue, May 19, 2020 at 11:45:24PM +0200, Ahmed S. Darwish wrote:
> > @@ -713,10 +713,20 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
> > */
> > void lru_add_drain_all(void)
> > {
>

Re-adding cut-out comment for context:

/*
* lru_drain_gen - Current generation of pages that could be in vectors
*
* (A) Definition: lru_drain_gen = x implies that all generations
* 0 < n <= x are already scheduled for draining.
*
* This is an optimization for the highly-contended use case where a
* user space workload keeps constantly generating a flow of pages
* for each CPU.
*/
> > + static unsigned int lru_drain_gen;
> > static struct cpumask has_work;
> > + static DEFINE_MUTEX(lock);
> > + int cpu, this_gen;
> >
> > /*
> > * Make sure nobody triggers this path before mm_percpu_wq is fully
> > @@ -725,21 +735,48 @@ void lru_add_drain_all(void)
> > if (WARN_ON(!mm_percpu_wq))
> > return;
> >
>

Re-adding cut-out comment for context:

/*
* (B) Cache the LRU draining generation number
*
* smp_rmb() ensures that the counter is loaded before the mutex is
* taken. It pairs with the smp_wmb() inside the mutex critical section
* at (D).
*/
> > + this_gen = READ_ONCE(lru_drain_gen);
> > + smp_rmb();
>
> this_gen = smp_load_acquire(&lru_drain_gen);

ACK. will do.

> >
> > mutex_lock(&lock);
> >
> > /*
> > + * (C) Exit the draining operation if a newer generation, from another
> > + * lru_add_drain_all(), was already scheduled for draining. Check (A).
> > */
> > + if (unlikely(this_gen != lru_drain_gen))
> > goto done;
> >
>

Re-adding cut-out comment for context:

/*
* (D) Increment generation number
*
* Pairs with READ_ONCE() and smp_rmb() at (B), outside of the critical
* section.
*
* This pairing must be done here, before the for_each_online_cpu loop
* below which drains the page vectors.
*
* Let x, y, and z represent some system CPU numbers, where x < y < z.
* Assume CPU #z is is in the middle of the for_each_online_cpu loop
* below and has already reached CPU #y's per-cpu data. CPU #x comes
* along, adds some pages to its per-cpu vectors, then calls
* lru_add_drain_all().
*
* If the paired smp_wmb() below is done at any later step, e.g. after
* the loop, CPU #x will just exit at (C) and miss flushing out all of
* its added pages.
*/
> > + WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
> > + smp_wmb();
>
> You can leave this smp_wmb() out and rely on the smp_mb() implied by
> queue_work_on()'s test_and_set_bit().
>

Won't this be too implicit?

Isn't it possible that, over the years, queue_work_on() impementation
changes and the test_and_set_bit()/smp_mb() gets removed?

If that happens, this commit will get *silently* broken and the local
CPU pages won't be drained.

> > cpumask_clear(&has_work);
> > -
> > for_each_online_cpu(cpu) {
> > struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
> >
>
> While you're here, do:
>
> s/cpumask_set_cpu/__&/
>

ACK.

Thanks,

--
Ahmed S. Darwish
Linutronix GmbH

2020-05-25 15:50:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v1 02/25] mm/swap: Don't abuse the seqcount latching API

On Mon, May 25, 2020 at 05:24:01PM +0200, Ahmed S. Darwish wrote:
> Peter Zijlstra <[email protected]> wrote:
> > On Tue, May 19, 2020 at 11:45:24PM +0200, Ahmed S. Darwish wrote:

> > > + WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
> > > + smp_wmb();
> >
> > You can leave this smp_wmb() out and rely on the smp_mb() implied by
> > queue_work_on()'s test_and_set_bit().
> >
>
> Won't this be too implicit?
>
> Isn't it possible that, over the years, queue_work_on() impementation
> changes and the test_and_set_bit()/smp_mb() gets removed?
>
> If that happens, this commit will get *silently* broken and the local
> CPU pages won't be drained.

Add a comment to queue_work_on() that points here? That way people are
aware.

2020-05-25 21:01:17

by John Ogness

[permalink] [raw]
Subject: Re: [PATCH v1 02/25] mm/swap: Don't abuse the seqcount latching API

Hi,

This optimization is broken. The main concern here: Is it possible that
lru_add_drain_all() _would_ have drained pagevec X, but then aborted
because another lru_add_drain_all() is underway and that other task will
_not_ drain pagevec X? I claim the answer is yes!

My suggested changes are inline below.

I attached a litmus test to verify it.

On 2020-05-22, Peter Zijlstra <[email protected]> wrote:
> On Tue, May 19, 2020 at 11:45:24PM +0200, Ahmed S. Darwish wrote:
>> @@ -713,10 +713,20 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
>> */
>> void lru_add_drain_all(void)
>> {
>
>> + static unsigned int lru_drain_gen;
>> static struct cpumask has_work;
>> + static DEFINE_MUTEX(lock);
>> + int cpu, this_gen;
>>
>> /*
>> * Make sure nobody triggers this path before mm_percpu_wq is fully
>> @@ -725,21 +735,48 @@ void lru_add_drain_all(void)
>> if (WARN_ON(!mm_percpu_wq))
>> return;
>>

An smp_mb() is needed here.

/*
* Guarantee the pagevec counter stores visible by
* this CPU are visible to other CPUs before loading
* the current drain generation.
*/
smp_mb();

>> + this_gen = READ_ONCE(lru_drain_gen);
>> + smp_rmb();
>
> this_gen = smp_load_acquire(&lru_drain_gen);
>>
>> mutex_lock(&lock);
>>
>> /*
>> + * (C) Exit the draining operation if a newer generation, from another
>> + * lru_add_drain_all(), was already scheduled for draining. Check (A).
>> */
>> + if (unlikely(this_gen != lru_drain_gen))
>> goto done;
>>
>
>> + WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
>> + smp_wmb();

Instead of smp_wmb(), this needs to be a full memory barrier.

/*
* Guarantee the new drain generation is stored before
* loading the pagevec counters.
*/
smp_mb();

> You can leave this smp_wmb() out and rely on the smp_mb() implied by
> queue_work_on()'s test_and_set_bit().
>
>> cpumask_clear(&has_work);
>> -
>> for_each_online_cpu(cpu) {
>> struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
>>
>
> While you're here, do:
>
> s/cpumask_set_cpu/__&/
>
>> @@ -766,7 +803,7 @@ void lru_add_drain_all(void)
>> {
>> lru_add_drain();
>> }
>> -#endif
>> +#endif /* CONFIG_SMP */
>>
>> /**
>> * release_pages - batched put_page()

For the litmus test:

1:rx=0 (P1 did not see the pagevec counter)
2:rx=1 (P2 _would_ have seen the pagevec counter)
2:ry1=0 /\ 2:ry2=1 (P2 aborted due to optimization)

Changing the smp_mb() back to smp_wmb() in P1 and removing the smp_mb()
in P2 represents this patch. And it shows that sometimes P2 will abort
even though it would have drained the pagevec and P1 did not drain the
pagevec.

This is ugly as hell. And there maybe other memory barrier types to make
it pretty. But as is, memory barriers are missing.

John Ogness


Attachments:
lru_add_drain_all.litmus (1.11 kB)

2020-09-10 19:28:46

by tip-bot2 for Jacob Pan

[permalink] [raw]
Subject: [tip: locking/core] mm/swap: Do not abuse the seqcount_t latching API

The following commit has been merged into the locking/core branch of tip:

Commit-ID: 6446a5131e24a834606c15a965fa920041581c2c
Gitweb: https://git.kernel.org/tip/6446a5131e24a834606c15a965fa920041581c2c
Author: Ahmed S. Darwish <[email protected]>
AuthorDate: Thu, 27 Aug 2020 13:40:38 +02:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Thu, 10 Sep 2020 11:19:28 +02:00

mm/swap: Do not abuse the seqcount_t latching API

Commit eef1a429f234 ("mm/swap.c: piggyback lru_add_drain_all() calls")
implemented an optimization mechanism to exit the to-be-started LRU
drain operation (name it A) if another drain operation *started and
finished* while (A) was blocked on the LRU draining mutex.

This was done through a seqcount_t latch, which is an abuse of its
semantics:

1. seqcount_t latching should be used for the purpose of switching
between two storage places with sequence protection to allow
interruptible, preemptible, writer sections. The referenced
optimization mechanism has absolutely nothing to do with that.

2. The used raw_write_seqcount_latch() has two SMP write memory
barriers to insure one consistent storage place out of the two
storage places available. A full memory barrier is required
instead: to guarantee that the pagevec counter stores visible by
local CPU are visible to other CPUs -- before loading the current
drain generation.

Beside the seqcount_t API abuse, the semantics of a latch sequence
counter was force-fitted into the referenced optimization. What was
meant is to track "generations" of LRU draining operations, where
"global lru draining generation = x" implies that all generations
0 < n <= x are already *scheduled* for draining -- thus nothing needs
to be done if the current generation number n <= x.

Remove the conceptually-inappropriate seqcount_t latch usage. Manually
implement the referenced optimization using a counter and SMP memory
barriers.

Note, while at it, use the non-atomic variant of cpumask_set_cpu(),
__cpumask_set_cpu(), due to the already existing mutex protection.

Signed-off-by: Ahmed S. Darwish <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
mm/swap.c | 65 ++++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 54 insertions(+), 11 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index d16d65d..a1ec807 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -763,10 +763,20 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
*/
void lru_add_drain_all(void)
{
- static seqcount_t seqcount = SEQCNT_ZERO(seqcount);
- static DEFINE_MUTEX(lock);
+ /*
+ * lru_drain_gen - Global pages generation number
+ *
+ * (A) Definition: global lru_drain_gen = x implies that all generations
+ * 0 < n <= x are already *scheduled* for draining.
+ *
+ * This is an optimization for the highly-contended use case where a
+ * user space workload keeps constantly generating a flow of pages for
+ * each CPU.
+ */
+ static unsigned int lru_drain_gen;
static struct cpumask has_work;
- int cpu, seq;
+ static DEFINE_MUTEX(lock);
+ unsigned cpu, this_gen;

/*
* Make sure nobody triggers this path before mm_percpu_wq is fully
@@ -775,21 +785,54 @@ void lru_add_drain_all(void)
if (WARN_ON(!mm_percpu_wq))
return;

- seq = raw_read_seqcount_latch(&seqcount);
+ /*
+ * Guarantee pagevec counter stores visible by this CPU are visible to
+ * other CPUs before loading the current drain generation.
+ */
+ smp_mb();
+
+ /*
+ * (B) Locally cache global LRU draining generation number
+ *
+ * The read barrier ensures that the counter is loaded before the mutex
+ * is taken. It pairs with smp_mb() inside the mutex critical section
+ * at (D).
+ */
+ this_gen = smp_load_acquire(&lru_drain_gen);

mutex_lock(&lock);

/*
- * Piggyback on drain started and finished while we waited for lock:
- * all pages pended at the time of our enter were drained from vectors.
+ * (C) Exit the draining operation if a newer generation, from another
+ * lru_add_drain_all(), was already scheduled for draining. Check (A).
*/
- if (__read_seqcount_retry(&seqcount, seq))
+ if (unlikely(this_gen != lru_drain_gen))
goto done;

- raw_write_seqcount_latch(&seqcount);
+ /*
+ * (D) Increment global generation number
+ *
+ * Pairs with smp_load_acquire() at (B), outside of the critical
+ * section. Use a full memory barrier to guarantee that the new global
+ * drain generation number is stored before loading pagevec counters.
+ *
+ * This pairing must be done here, before the for_each_online_cpu loop
+ * below which drains the page vectors.
+ *
+ * Let x, y, and z represent some system CPU numbers, where x < y < z.
+ * Assume CPU #z is is in the middle of the for_each_online_cpu loop
+ * below and has already reached CPU #y's per-cpu data. CPU #x comes
+ * along, adds some pages to its per-cpu vectors, then calls
+ * lru_add_drain_all().
+ *
+ * If the paired barrier is done at any later step, e.g. after the
+ * loop, CPU #x will just exit at (C) and miss flushing out all of its
+ * added pages.
+ */
+ WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
+ smp_mb();

cpumask_clear(&has_work);
-
for_each_online_cpu(cpu) {
struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);

@@ -801,7 +844,7 @@ void lru_add_drain_all(void)
need_activate_page_drain(cpu)) {
INIT_WORK(work, lru_add_drain_per_cpu);
queue_work_on(cpu, mm_percpu_wq, work);
- cpumask_set_cpu(cpu, &has_work);
+ __cpumask_set_cpu(cpu, &has_work);
}
}

@@ -816,7 +859,7 @@ void lru_add_drain_all(void)
{
lru_add_drain();
}
-#endif
+#endif /* CONFIG_SMP */

/**
* release_pages - batched put_page()