[permalink] [raw]

Subject: [patch v5] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu

On systems that run FIFO:1 applications that busy loop,
any SCHED_OTHER task that attempts to execute
on such a CPU (such as work threads) will not
be scheduled, which leads to system hangs.

Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU
pagevec during the migration temporarily") relies on
queueing work items on all online CPUs to ensure visibility
of lru_disable_count.

To fix this, replace the usage of work items with synchronize_rcu,
which provides the same guarantees.

Readers of lru_disable_count are protected by either disabling
preemption or rcu_read_lock:

preempt_disable, local_irq_disable [bh_lru_lock()]
rcu_read_lock [rt_spin_lock CONFIG_PREEMPT_RT]
preempt_disable [local_lock !CONFIG_PREEMPT_RT]

Since v5.1 kernel, synchronize_rcu() is guaranteed to wait on
preempt_disable() regions of code. So any CPU which sees
lru_disable_count = 0 will have exited the critical
section when synchronize_rcu() returns.

Signed-off-by: Marcelo Tosatti <[email protected]>
Reviewed-by: Nicolas Saenz Julienne <[email protected]>
Acked-by: Minchan Kim <[email protected]>

---

v5: changelog improvements (Andrew Morton)
v4: improve comment clarity, mention synchronize_rcu guarantees
on v5.1 (Andrew Morton /
Paul E. McKenney)
v3: update stale comment (Nicolas Saenz Julienne)
v2: rt_spin_lock calls rcu_read_lock, no need
to add it before local_lock on swap.c (Nicolas Saenz Julienne)

diff --git a/mm/swap.c b/mm/swap.c
index bcf3ac288b56..b5ee163daa66 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -831,8 +831,7 @@ inline void __lru_add_drain_all(bool force_all_cpus)
for_each_online_cpu(cpu) {
struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);

- if (force_all_cpus ||
- pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
+ if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
@@ -876,15 +875,21 @@ atomic_t lru_disable_count = ATOMIC_INIT(0);
void lru_cache_disable(void)
{
atomic_inc(&lru_disable_count);
-#ifdef CONFIG_SMP
/*
- * lru_add_drain_all in the force mode will schedule draining on
- * all online CPUs so any calls of lru_cache_disabled wrapped by
- * local_lock or preemption disabled would be ordered by that.
- * The atomic operation doesn't need to have stronger ordering
- * requirements because that is enforced by the scheduling
- * guarantees.
+ * Readers of lru_disable_count are protected by either disabling
+ * preemption or rcu_read_lock:
+ *
+ * preempt_disable, local_irq_disable [bh_lru_lock()]
+ * rcu_read_lock [rt_spin_lock CONFIG_PREEMPT_RT]
+ * preempt_disable [local_lock !CONFIG_PREEMPT_RT]
+ *
+ * Since v5.1 kernel, synchronize_rcu() is guaranteed to wait on
+ * preempt_disable() regions of code. So any CPU which sees
+ * lru_disable_count = 0 will have exited the critical
+ * section when synchronize_rcu() returns.
*/
+ synchronize_rcu();
+#ifdef CONFIG_SMP
__lru_add_drain_all(true);
#else
lru_add_and_bh_lrus_drain();

2022-04-01 07:03:06

by Borislav Petkov

[permalink] [raw]

Subject: Re: [patch v5] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu

On Thu, Mar 10, 2022 at 10:22:12AM -0300, Marcelo Tosatti wrote:
>
> On systems that run FIFO:1 applications that busy loop,
> any SCHED_OTHER task that attempts to execute
> on such a CPU (such as work threads) will not
> be scheduled, which leads to system hangs.
>
> Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU
> pagevec during the migration temporarily") relies on
> queueing work items on all online CPUs to ensure visibility
> of lru_disable_count.
>
> To fix this, replace the usage of work items with synchronize_rcu,
> which provides the same guarantees.
>
> Readers of lru_disable_count are protected by either disabling
> preemption or rcu_read_lock:
>
> preempt_disable, local_irq_disable [bh_lru_lock()]
> rcu_read_lock [rt_spin_lock CONFIG_PREEMPT_RT]
> preempt_disable [local_lock !CONFIG_PREEMPT_RT]
>
> Since v5.1 kernel, synchronize_rcu() is guaranteed to wait on
> preempt_disable() regions of code. So any CPU which sees
> lru_disable_count = 0 will have exited the critical
> section when synchronize_rcu() returns.
>
> Signed-off-by: Marcelo Tosatti <[email protected]>
> Reviewed-by: Nicolas Saenz Julienne <[email protected]>
> Acked-by: Minchan Kim <[email protected]>

Someone pointed me at this:

https://www.phoronix.com/scan.php?page=news_item&px=Linux-518-Stress-NUMA-Goes-Boom

which says this one causes a performance regression with stress-ng's
NUMA test...

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-04-29 11:49:12

by Marcelo Tosatti

[permalink] [raw]

Subject: Re: [patch v5] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu

On Thu, Mar 31, 2022 at 03:52:45PM +0200, Borislav Petkov wrote:
> On Thu, Mar 10, 2022 at 10:22:12AM -0300, Marcelo Tosatti wrote:
> >
> > On systems that run FIFO:1 applications that busy loop,
> > any SCHED_OTHER task that attempts to execute
> > on such a CPU (such as work threads) will not
> > be scheduled, which leads to system hangs.
> >
> > Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU
> > pagevec during the migration temporarily") relies on
> > queueing work items on all online CPUs to ensure visibility
> > of lru_disable_count.
> >
> > To fix this, replace the usage of work items with synchronize_rcu,
> > which provides the same guarantees.
> >
> > Readers of lru_disable_count are protected by either disabling
> > preemption or rcu_read_lock:
> >
> > preempt_disable, local_irq_disable [bh_lru_lock()]
> > rcu_read_lock [rt_spin_lock CONFIG_PREEMPT_RT]
> > preempt_disable [local_lock !CONFIG_PREEMPT_RT]
> >
> > Since v5.1 kernel, synchronize_rcu() is guaranteed to wait on
> > preempt_disable() regions of code. So any CPU which sees
> > lru_disable_count = 0 will have exited the critical
> > section when synchronize_rcu() returns.
> >
> > Signed-off-by: Marcelo Tosatti <[email protected]>
> > Reviewed-by: Nicolas Saenz Julienne <[email protected]>
> > Acked-by: Minchan Kim <[email protected]>
>
> Someone pointed me at this:
>
> https://www.phoronix.com/scan.php?page=news_item&px=Linux-518-Stress-NUMA-Goes-Boom
>
> which says this one causes a performance regression with stress-ng's
> NUMA test...

Michael,

This is probably do_migrate_pages that is taking too long due to
synchronize_rcu().

Switching to synchronize_rcu_expedited() should probably fix it...
Can you give it a try, please?

diff --git a/mm/swap.c b/mm/swap.c
index bceff0cb559c..04a8bbf9817a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -879,7 +879,7 @@ void lru_cache_disable(void)
* lru_disable_count = 0 will have exited the critical
* section when synchronize_rcu() returns.
*/
- synchronize_rcu();
+ synchronize_rcu_expedited();
#ifdef CONFIG_SMP
__lru_add_drain_all(true);
#else

2022-05-28 21:18:53

by Andrew Morton

[permalink] [raw]

Subject: Re: [patch v5] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu

On Thu, 28 Apr 2022 15:00:11 -0300 Marcelo Tosatti <[email protected]> wrote:

> On Thu, Mar 31, 2022 at 03:52:45PM +0200, Borislav Petkov wrote:
> > On Thu, Mar 10, 2022 at 10:22:12AM -0300, Marcelo Tosatti wrote:
> > >
> ...
>
> >
> > Someone pointed me at this:
> >
> > https://www.phoronix.com/scan.php?page=news_item&px=Linux-518-Stress-NUMA-Goes-Boom
> >
> > which says this one causes a performance regression with stress-ng's
> > NUMA test...
>
> Michael,
>
> This is probably do_migrate_pages that is taking too long due to
> synchronize_rcu().
>
> Switching to synchronize_rcu_expedited() should probably fix it...
> Can you give it a try, please?

I guess not.

Is anyone else able to demonstrate a stress-ng performance regression
due to ff042f4a9b0508? And if so, are they able to try Marcelo's
one-liner?

> diff --git a/mm/swap.c b/mm/swap.c
> index bceff0cb559c..04a8bbf9817a 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -879,7 +879,7 @@ void lru_cache_disable(void)
> * lru_disable_count = 0 will have exited the critical
> * section when synchronize_rcu() returns.
> */
> - synchronize_rcu();
> + synchronize_rcu_expedited();
> #ifdef CONFIG_SMP
> __lru_add_drain_all(true);
> #else
>
>

2022-06-19 12:36:11

by Linux regression tracking (Thorsten Leemhuis)

[permalink] [raw]

Subject: Re: [patch v5] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu

Hi, this is your Linux kernel regression tracker.

On 29.05.22 02:48, Michael Larabel wrote:
> On 5/28/22 17:54, Michael Larabel wrote:
>> On 5/28/22 16:18, Andrew Morton wrote:
>>> On Thu, 28 Apr 2022 15:00:11 -0300 Marcelo Tosatti
>>> <[email protected]> wrote:
>>>> On Thu, Mar 31, 2022 at 03:52:45PM +0200, Borislav Petkov wrote:
>>>>> On Thu, Mar 10, 2022 at 10:22:12AM -0300, Marcelo Tosatti wrote:
>>>>> Someone pointed me at this:
>>>>> https://www.phoronix.com/scan.php?page=news_item&px=Linux-518-Stress-NUMA-Goes-Boom
>>>>>
>>>>> which says this one causes a performance regression with stress-ng's
>>>>> NUMA test...
>>>>
>>>> This is probably do_migrate_pages that is taking too long due to
>>>> synchronize_rcu().
>>>>
>>>> Switching to synchronize_rcu_expedited() should probably fix it...
>>>> Can you give it a try, please?
>>> I guess not.
>>>
>>> Is anyone else able to demonstrate a stress-ng performance regression
>>> due to ff042f4a9b0508? And if so, are they able to try Marcelo's
>>> one-liner?
>>
>> Apologies I don't believe I got the email previously (or if it ended
>> up in spam or otherwise overlooked) so just noticed this thread now...
>>
>> I have the system around and will work on verifying it can reproduce
>> still and can then test the patch, should be able to get it tomorrow.
>>
>> Thanks and sorry about the delay.
>
> Had a chance to look at it today still. I was able to reproduce the
> regression still on that 5950X system going from v5.17 to v5.18 (using
> newer stress-ng benchmark and other system changes since the prior
> tests). Confirmed it also still showed slower as of today's Git.
>
> I can confirm with Marcelo's patch below that the stress-ng NUMA
> performance is back to the v5.17 level of performance (actually, faster)
> and certainly not like what I was seeing on v5.18 or Git to this point.
>
> So all seems to be good with that one-liner for the stress-ng NUMA test
> case. All the system details and results for those interested is
> documented @ https://openbenchmarking.org/result/2205284-PTS-NUMAREGR17
> but basically amounts to:
>
>     Stress-NG 0.14
>     Test: NUMA
>     Bogo Ops/s > Higher Is Better
>     v5.17: 412.88
>     v5.18: 49.33
>     20220528 Git: 49.66
>     20220528 Git + sched-rcu-exped patch: 468.81
>
> Apologies again about the delay / not seeing the email thread earlier.
>lru_cache_disable: replace work queue synchronization with synchronize_rcu
> Thanks,
>
> Michael
>
> Tested-by: Michael Larabel <[email protected]>

Andrew, is there a reason why this patch afaics isn't mainlined yet and
lingering in linux-next for so long? Michael confirmed that this patch
fixes a regression three weeks ago and a few days later Stefan confirmed
that his problem was solved as well:
https://lore.kernel.org/regressions/[email protected]/

Reminder: unless there are good reasons it shouldn't take this long to
for reason explained in
https://docs.kernel.org/process/handling-regressions.html

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.

>>>> diff --git a/mm/swap.c b/mm/swap.c
>>>> index bceff0cb559c..04a8bbf9817a 100644
>>>> --- a/mm/swap.c
>>>> +++ b/mm/swap.c
>>>> @@ -879,7 +879,7 @@ void lru_cache_disable(void)
>>>>        * lru_disable_count = 0 will have exited the critical
>>>>        * section when synchronize_rcu() returns.
>>>>        */
>>>> -    synchronize_rcu();
>>>> +    synchronize_rcu_expedited();
>>>> #ifdef CONFIG_SMP
>>>>       __lru_add_drain_all(true);
>>>> #else
>>>>
>>>>

2022-06-22 00:19:33

by Andrew Morton

[permalink] [raw]

Subject: Re: [patch v5] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu

On Sun, 19 Jun 2022 14:14:03 +0200 Thorsten Leemhuis <[email protected]> wrote:

> Andrew, is there a reason why this patch afaics isn't mainlined yet and
> lingering in linux-next for so long?

I didn't bother doing a hotfixes merge last week because there wasn't anything
very urgent-looking in there. I'll be putting together a pull request later this week.