2022-02-22 16:33:40

by Marcelo Tosatti

[permalink] [raw]
Subject: [patch v3] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu


On systems that run FIFO:1 applications that busy loop
on isolated CPUs, executing tasks on such CPUs under
lower priority is undesired (since that will either
hang the system, or cause longer interruption to the
FIFO task due to execution of lower priority task
with very small sched slices).

Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU
pagevec during the migration temporarily") relies on
queueing work items on all online CPUs to ensure visibility
of lru_disable_count.

However, its possible to use synchronize_rcu which will provide the same
guarantees (see comment this patch modifies on lru_cache_disable).

Fixes:

[ 1873.243925] INFO: task kworker/u160:0:9 blocked for more than 622 seconds.
[ 1873.243927] Tainted: G I --------- --- 5.14.0-31.rt21.31.el9.x86_64 #1
[ 1873.243929] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1873.243929] task:kworker/u160:0 state:D stack: 0 pid: 9 ppid: 2 flags:0x00004000
[ 1873.243932] Workqueue: cpuset_migrate_mm cpuset_migrate_mm_workfn
[ 1873.243936] Call Trace:
[ 1873.243938] __schedule+0x21b/0x5b0
[ 1873.243941] schedule+0x43/0xe0
[ 1873.243943] schedule_timeout+0x14d/0x190
[ 1873.243946] ? resched_curr+0x20/0xe0
[ 1873.243953] ? __prepare_to_swait+0x4b/0x70
[ 1873.243958] wait_for_completion+0x84/0xe0
[ 1873.243962] __flush_work.isra.0+0x146/0x200
[ 1873.243966] ? flush_workqueue_prep_pwqs+0x130/0x130
[ 1873.243971] __lru_add_drain_all+0x158/0x1f0
[ 1873.243978] do_migrate_pages+0x3d/0x2d0
[ 1873.243985] ? pick_next_task_fair+0x39/0x3b0
[ 1873.243989] ? put_prev_task_fair+0x1e/0x30
[ 1873.243992] ? pick_next_task+0xb30/0xbd0
[ 1873.243995] ? __tick_nohz_task_switch+0x1e/0x70
[ 1873.244000] ? raw_spin_rq_unlock+0x18/0x60
[ 1873.244002] ? finish_task_switch.isra.0+0xc1/0x2d0
[ 1873.244005] ? __switch_to+0x12f/0x510
[ 1873.244013] cpuset_migrate_mm_workfn+0x22/0x40
[ 1873.244016] process_one_work+0x1e0/0x410
[ 1873.244019] worker_thread+0x50/0x3b0
[ 1873.244022] ? process_one_work+0x410/0x410
[ 1873.244024] kthread+0x173/0x190
[ 1873.244027] ? set_kthread_struct+0x40/0x40
[ 1873.244031] ret_from_fork+0x1f/0x30

Signed-off-by: Marcelo Tosatti <[email protected]>

---

v3: update stale comment (Nicolas Saenz Julienne)
v2: rt_spin_lock calls rcu_read_lock, no need
to add it before local_lock on swap.c (Nicolas Saenz Julienne)

diff --git a/mm/swap.c b/mm/swap.c
index bcf3ac288b56..abb26293e7c1 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -831,8 +831,7 @@ inline void __lru_add_drain_all(bool force_all_cpus)
for_each_online_cpu(cpu) {
struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);

- if (force_all_cpus ||
- pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
+ if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
@@ -876,14 +875,19 @@ atomic_t lru_disable_count = ATOMIC_INIT(0);
void lru_cache_disable(void)
{
atomic_inc(&lru_disable_count);
+ synchronize_rcu();
#ifdef CONFIG_SMP
/*
- * lru_add_drain_all in the force mode will schedule draining on
- * all online CPUs so any calls of lru_cache_disabled wrapped by
- * local_lock or preemption disabled would be ordered by that.
- * The atomic operation doesn't need to have stronger ordering
- * requirements because that is enforced by the scheduling
- * guarantees.
+ * synchronize_rcu() waits for preemption disabled
+ * and RCU read side critical sections.
+ * For the users of lru_disable_count:
+ *
+ * preempt_disable, local_irq_disable [bh_lru_lock()]
+ * rcu_read_lock [rt_spin_lock CONFIG_PREEMPT_RT]
+ * preempt_disable [local_lock !CONFIG_PREEMPT_RT]
+ *
+ * so any calls of lru_cache_disabled wrapped by local_lock or
+ * preemption disabled would be ordered by that.
*/
__lru_add_drain_all(true);
#else


2022-02-22 19:40:28

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: Re: [patch v3] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu

On Tue, 2022-02-22 at 13:07 -0300, Marcelo Tosatti wrote:
> On systems that run FIFO:1 applications that busy loop
> on isolated CPUs, executing tasks on such CPUs under
> lower priority is undesired (since that will either
> hang the system, or cause longer interruption to the
> FIFO task due to execution of lower priority task
> with very small sched slices).
>
> Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU
> pagevec during the migration temporarily") relies on
> queueing work items on all online CPUs to ensure visibility
> of lru_disable_count.
>
> However, its possible to use synchronize_rcu which will provide the same
> guarantees (see comment this patch modifies on lru_cache_disable).
>
> Fixes:
>
> [ 1873.243925] INFO: task kworker/u160:0:9 blocked for more than 622 seconds.
> [ 1873.243927] Tainted: G I --------- --- 5.14.0-31.rt21.31.el9.x86_64 #1
> [ 1873.243929] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1873.243929] task:kworker/u160:0 state:D stack: 0 pid: 9 ppid: 2 flags:0x00004000
> [ 1873.243932] Workqueue: cpuset_migrate_mm cpuset_migrate_mm_workfn
> [ 1873.243936] Call Trace:
> [ 1873.243938] __schedule+0x21b/0x5b0
> [ 1873.243941] schedule+0x43/0xe0
> [ 1873.243943] schedule_timeout+0x14d/0x190
> [ 1873.243946] ? resched_curr+0x20/0xe0
> [ 1873.243953] ? __prepare_to_swait+0x4b/0x70
> [ 1873.243958] wait_for_completion+0x84/0xe0
> [ 1873.243962] __flush_work.isra.0+0x146/0x200
> [ 1873.243966] ? flush_workqueue_prep_pwqs+0x130/0x130
> [ 1873.243971] __lru_add_drain_all+0x158/0x1f0
> [ 1873.243978] do_migrate_pages+0x3d/0x2d0
> [ 1873.243985] ? pick_next_task_fair+0x39/0x3b0
> [ 1873.243989] ? put_prev_task_fair+0x1e/0x30
> [ 1873.243992] ? pick_next_task+0xb30/0xbd0
> [ 1873.243995] ? __tick_nohz_task_switch+0x1e/0x70
> [ 1873.244000] ? raw_spin_rq_unlock+0x18/0x60
> [ 1873.244002] ? finish_task_switch.isra.0+0xc1/0x2d0
> [ 1873.244005] ? __switch_to+0x12f/0x510
> [ 1873.244013] cpuset_migrate_mm_workfn+0x22/0x40
> [ 1873.244016] process_one_work+0x1e0/0x410
> [ 1873.244019] worker_thread+0x50/0x3b0
> [ 1873.244022] ? process_one_work+0x410/0x410
> [ 1873.244024] kthread+0x173/0x190
> [ 1873.244027] ? set_kthread_struct+0x40/0x40
> [ 1873.244031] ret_from_fork+0x1f/0x30
>
> Signed-off-by: Marcelo Tosatti <[email protected]>

Reviewed-by: Nicolas Saenz Julienne <[email protected]>

Regards,

--
Nicolás Sáenz

2022-03-04 09:36:28

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [patch v3] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu

On Thu, Mar 03, 2022 at 05:03:23PM -0800, Andrew Morton wrote:
> (Question for paulmck below, please)
>
> On Tue, 22 Feb 2022 13:07:35 -0300 Marcelo Tosatti <[email protected]> wrote:
>
> >
> > On systems that run FIFO:1 applications that busy loop
> > on isolated CPUs, executing tasks on such CPUs under
> > lower priority is undesired (since that will either
> > hang the system, or cause longer interruption to the
> > FIFO task due to execution of lower priority task
> > with very small sched slices).
> >
> > Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU
> > pagevec during the migration temporarily") relies on
> > queueing work items on all online CPUs to ensure visibility
> > of lru_disable_count.
> >
> > However, its possible to use synchronize_rcu which will provide the same
> > guarantees (see comment this patch modifies on lru_cache_disable).
> >
> > Fixes:
> >
> > [ 1873.243925] INFO: task kworker/u160:0:9 blocked for more than 622 seconds.
> > [ 1873.243927] Tainted: G I --------- --- 5.14.0-31.rt21.31.el9.x86_64 #1
> > [ 1873.243929] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [ 1873.243929] task:kworker/u160:0 state:D stack: 0 pid: 9 ppid: 2 flags:0x00004000
> > [ 1873.243932] Workqueue: cpuset_migrate_mm cpuset_migrate_mm_workfn
> > [ 1873.243936] Call Trace:
> > [ 1873.243938] __schedule+0x21b/0x5b0
> > [ 1873.243941] schedule+0x43/0xe0
> > [ 1873.243943] schedule_timeout+0x14d/0x190
> > [ 1873.243946] ? resched_curr+0x20/0xe0
> > [ 1873.243953] ? __prepare_to_swait+0x4b/0x70
> > [ 1873.243958] wait_for_completion+0x84/0xe0
> > [ 1873.243962] __flush_work.isra.0+0x146/0x200
> > [ 1873.243966] ? flush_workqueue_prep_pwqs+0x130/0x130
> > [ 1873.243971] __lru_add_drain_all+0x158/0x1f0
> > [ 1873.243978] do_migrate_pages+0x3d/0x2d0
> > [ 1873.243985] ? pick_next_task_fair+0x39/0x3b0
> > [ 1873.243989] ? put_prev_task_fair+0x1e/0x30
> > [ 1873.243992] ? pick_next_task+0xb30/0xbd0
> > [ 1873.243995] ? __tick_nohz_task_switch+0x1e/0x70
> > [ 1873.244000] ? raw_spin_rq_unlock+0x18/0x60
> > [ 1873.244002] ? finish_task_switch.isra.0+0xc1/0x2d0
> > [ 1873.244005] ? __switch_to+0x12f/0x510
> > [ 1873.244013] cpuset_migrate_mm_workfn+0x22/0x40
> > [ 1873.244016] process_one_work+0x1e0/0x410
> > [ 1873.244019] worker_thread+0x50/0x3b0
> > [ 1873.244022] ? process_one_work+0x410/0x410
> > [ 1873.244024] kthread+0x173/0x190
> > [ 1873.244027] ? set_kthread_struct+0x40/0x40
> > [ 1873.244031] ret_from_fork+0x1f/0x30
> >
> > ...
> >
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -831,8 +831,7 @@ inline void __lru_add_drain_all(bool force_all_cpus)
> > for_each_online_cpu(cpu) {
> > struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
> >
> > - if (force_all_cpus ||
> > - pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> > + if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> > data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
> > pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
> > pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
>
> This change appears to be "don't queue work on CPUs which don't have
> any work to do". Correct? This isn't changelogged?
>
> > @@ -876,14 +875,19 @@ atomic_t lru_disable_count = ATOMIC_INIT(0);
> > void lru_cache_disable(void)
> > {
> > atomic_inc(&lru_disable_count);
> > + synchronize_rcu();
> > #ifdef CONFIG_SMP
> > /*
> > - * lru_add_drain_all in the force mode will schedule draining on
> > - * all online CPUs so any calls of lru_cache_disabled wrapped by
> > - * local_lock or preemption disabled would be ordered by that.
> > - * The atomic operation doesn't need to have stronger ordering
> > - * requirements because that is enforced by the scheduling
> > - * guarantees.
> > + * synchronize_rcu() waits for preemption disabled
> > + * and RCU read side critical sections.
> > + * For the users of lru_disable_count:
> > + *
> > + * preempt_disable, local_irq_disable [bh_lru_lock()]
> > + * rcu_read_lock [rt_spin_lock CONFIG_PREEMPT_RT]
> > + * preempt_disable [local_lock !CONFIG_PREEMPT_RT]
> > + *
> > + * so any calls of lru_cache_disabled wrapped by local_lock or
> > + * preemption disabled would be ordered by that.
> > */
> > __lru_add_drain_all(true);
> > #else
>
> Does this also work with CONFIG_TINY_RCU?
>
> This seems abusive of synchronize_rcu(). None of this code uses RCU,
> but it so happens that synchronize_rcu() happily provides the desired
> effects. Changes in RCU's happy side-effects might break this.
> Perhaps a formal API function which does whatever-you-want-it-to-do
> would be better.

I don't claim to understand the full lru_cache_disable() use case, but
since v5.1 synchronize_rcu() is guaranteed to wait on preempt_disable()
regions of code. In contrast, back in the old days, you had to use
synchronize_sched() to wait on preempt_disable() regions, even in
CONFIG_PREEMPT=y kernels. So if the comment is accurate, it is OK.

Just be careful what you backport past v5.1...

> And... I really don't understand the fix. What is it about
> synchronize_rcu() which guarantees that a work function which is queued
> on CPU N will now get executed even if CPU N is spinning in SCHED_FIFO
> userspace?

I don't understand this part, either.

Thanx, Paul

2022-03-04 11:30:42

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch v3] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu

(Question for paulmck below, please)

On Tue, 22 Feb 2022 13:07:35 -0300 Marcelo Tosatti <[email protected]> wrote:

>
> On systems that run FIFO:1 applications that busy loop
> on isolated CPUs, executing tasks on such CPUs under
> lower priority is undesired (since that will either
> hang the system, or cause longer interruption to the
> FIFO task due to execution of lower priority task
> with very small sched slices).
>
> Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU
> pagevec during the migration temporarily") relies on
> queueing work items on all online CPUs to ensure visibility
> of lru_disable_count.
>
> However, its possible to use synchronize_rcu which will provide the same
> guarantees (see comment this patch modifies on lru_cache_disable).
>
> Fixes:
>
> [ 1873.243925] INFO: task kworker/u160:0:9 blocked for more than 622 seconds.
> [ 1873.243927] Tainted: G I --------- --- 5.14.0-31.rt21.31.el9.x86_64 #1
> [ 1873.243929] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1873.243929] task:kworker/u160:0 state:D stack: 0 pid: 9 ppid: 2 flags:0x00004000
> [ 1873.243932] Workqueue: cpuset_migrate_mm cpuset_migrate_mm_workfn
> [ 1873.243936] Call Trace:
> [ 1873.243938] __schedule+0x21b/0x5b0
> [ 1873.243941] schedule+0x43/0xe0
> [ 1873.243943] schedule_timeout+0x14d/0x190
> [ 1873.243946] ? resched_curr+0x20/0xe0
> [ 1873.243953] ? __prepare_to_swait+0x4b/0x70
> [ 1873.243958] wait_for_completion+0x84/0xe0
> [ 1873.243962] __flush_work.isra.0+0x146/0x200
> [ 1873.243966] ? flush_workqueue_prep_pwqs+0x130/0x130
> [ 1873.243971] __lru_add_drain_all+0x158/0x1f0
> [ 1873.243978] do_migrate_pages+0x3d/0x2d0
> [ 1873.243985] ? pick_next_task_fair+0x39/0x3b0
> [ 1873.243989] ? put_prev_task_fair+0x1e/0x30
> [ 1873.243992] ? pick_next_task+0xb30/0xbd0
> [ 1873.243995] ? __tick_nohz_task_switch+0x1e/0x70
> [ 1873.244000] ? raw_spin_rq_unlock+0x18/0x60
> [ 1873.244002] ? finish_task_switch.isra.0+0xc1/0x2d0
> [ 1873.244005] ? __switch_to+0x12f/0x510
> [ 1873.244013] cpuset_migrate_mm_workfn+0x22/0x40
> [ 1873.244016] process_one_work+0x1e0/0x410
> [ 1873.244019] worker_thread+0x50/0x3b0
> [ 1873.244022] ? process_one_work+0x410/0x410
> [ 1873.244024] kthread+0x173/0x190
> [ 1873.244027] ? set_kthread_struct+0x40/0x40
> [ 1873.244031] ret_from_fork+0x1f/0x30
>
> ...
>
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -831,8 +831,7 @@ inline void __lru_add_drain_all(bool force_all_cpus)
> for_each_online_cpu(cpu) {
> struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
>
> - if (force_all_cpus ||
> - pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> + if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
> pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
> pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||

This change appears to be "don't queue work on CPUs which don't have
any work to do". Correct? This isn't changelogged?

> @@ -876,14 +875,19 @@ atomic_t lru_disable_count = ATOMIC_INIT(0);
> void lru_cache_disable(void)
> {
> atomic_inc(&lru_disable_count);
> + synchronize_rcu();
> #ifdef CONFIG_SMP
> /*
> - * lru_add_drain_all in the force mode will schedule draining on
> - * all online CPUs so any calls of lru_cache_disabled wrapped by
> - * local_lock or preemption disabled would be ordered by that.
> - * The atomic operation doesn't need to have stronger ordering
> - * requirements because that is enforced by the scheduling
> - * guarantees.
> + * synchronize_rcu() waits for preemption disabled
> + * and RCU read side critical sections.
> + * For the users of lru_disable_count:
> + *
> + * preempt_disable, local_irq_disable [bh_lru_lock()]
> + * rcu_read_lock [rt_spin_lock CONFIG_PREEMPT_RT]
> + * preempt_disable [local_lock !CONFIG_PREEMPT_RT]
> + *
> + * so any calls of lru_cache_disabled wrapped by local_lock or
> + * preemption disabled would be ordered by that.
> */
> __lru_add_drain_all(true);
> #else

Does this also work with CONFIG_TINY_RCU?

This seems abusive of synchronize_rcu(). None of this code uses RCU,
but it so happens that synchronize_rcu() happily provides the desired
effects. Changes in RCU's happy side-effects might break this.
Perhaps a formal API function which does whatever-you-want-it-to-do
would be better.

And... I really don't understand the fix. What is it about
synchronize_rcu() which guarantees that a work function which is queued
on CPU N will now get executed even if CPU N is spinning in SCHED_FIFO
userspace?

2022-03-04 18:34:01

by Marcelo Tosatti

[permalink] [raw]
Subject: [patch v4] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu


On systems that run FIFO:1 applications that busy loop
on isolated CPUs, executing tasks on such CPUs under
lower priority is undesired (since that will either
hang the system, or cause longer interruption to the
FIFO task due to execution of lower priority task
with very small sched slices).

Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU
pagevec during the migration temporarily") relies on
queueing work items on all online CPUs to ensure visibility
of lru_disable_count.

However, its possible to use synchronize_rcu which will provide the same
guarantees (see comment this patch modifies on lru_cache_disable).

Fixes:

[ 1873.243925] INFO: task kworker/u160:0:9 blocked for more than 622 seconds.
[ 1873.243927] Tainted: G I --------- --- 5.14.0-31.rt21.31.el9.x86_64 #1
[ 1873.243929] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1873.243929] task:kworker/u160:0 state:D stack: 0 pid: 9 ppid: 2 flags:0x00004000
[ 1873.243932] Workqueue: cpuset_migrate_mm cpuset_migrate_mm_workfn
[ 1873.243936] Call Trace:
[ 1873.243938] __schedule+0x21b/0x5b0
[ 1873.243941] schedule+0x43/0xe0
[ 1873.243943] schedule_timeout+0x14d/0x190
[ 1873.243946] ? resched_curr+0x20/0xe0
[ 1873.243953] ? __prepare_to_swait+0x4b/0x70
[ 1873.243958] wait_for_completion+0x84/0xe0
[ 1873.243962] __flush_work.isra.0+0x146/0x200
[ 1873.243966] ? flush_workqueue_prep_pwqs+0x130/0x130
[ 1873.243971] __lru_add_drain_all+0x158/0x1f0
[ 1873.243978] do_migrate_pages+0x3d/0x2d0
[ 1873.243985] ? pick_next_task_fair+0x39/0x3b0
[ 1873.243989] ? put_prev_task_fair+0x1e/0x30
[ 1873.243992] ? pick_next_task+0xb30/0xbd0
[ 1873.243995] ? __tick_nohz_task_switch+0x1e/0x70
[ 1873.244000] ? raw_spin_rq_unlock+0x18/0x60
[ 1873.244002] ? finish_task_switch.isra.0+0xc1/0x2d0
[ 1873.244005] ? __switch_to+0x12f/0x510
[ 1873.244013] cpuset_migrate_mm_workfn+0x22/0x40
[ 1873.244016] process_one_work+0x1e0/0x410
[ 1873.244019] worker_thread+0x50/0x3b0
[ 1873.244022] ? process_one_work+0x410/0x410
[ 1873.244024] kthread+0x173/0x190
[ 1873.244027] ? set_kthread_struct+0x40/0x40
[ 1873.244031] ret_from_fork+0x1f/0x30

Signed-off-by: Marcelo Tosatti <[email protected]>

---

v4: improve comment clarify, mention synchronize_rcu guarantees
on v5.1 (Andrew Morton /
Paul E. McKenney)
v3: update stale comment (Nicolas Saenz Julienne)
v2: rt_spin_lock calls rcu_read_lock, no need
to add it before local_lock on swap.c (Nicolas Saenz Julienne)

diff --git a/mm/swap.c b/mm/swap.c
index bcf3ac288b56..b5ee163daa66 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -831,8 +831,7 @@ inline void __lru_add_drain_all(bool force_all_cpus)
for_each_online_cpu(cpu) {
struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);

- if (force_all_cpus ||
- pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
+ if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
@@ -876,15 +875,21 @@ atomic_t lru_disable_count = ATOMIC_INIT(0);
void lru_cache_disable(void)
{
atomic_inc(&lru_disable_count);
-#ifdef CONFIG_SMP
/*
- * lru_add_drain_all in the force mode will schedule draining on
- * all online CPUs so any calls of lru_cache_disabled wrapped by
- * local_lock or preemption disabled would be ordered by that.
- * The atomic operation doesn't need to have stronger ordering
- * requirements because that is enforced by the scheduling
- * guarantees.
+ * Readers of lru_disable_count are protected by either disabling
+ * preemption or rcu_read_lock:
+ *
+ * preempt_disable, local_irq_disable [bh_lru_lock()]
+ * rcu_read_lock [rt_spin_lock CONFIG_PREEMPT_RT]
+ * preempt_disable [local_lock !CONFIG_PREEMPT_RT]
+ *
+ * Since v5.1 kernel, synchronize_rcu() is guaranteed to wait on
+ * preempt_disable() regions of code. So any CPU which sees
+ * lru_disable_count = 0 will have exited the critical
+ * section when synchronize_rcu() returns.
*/
+ synchronize_rcu();
+#ifdef CONFIG_SMP
__lru_add_drain_all(true);
#else
lru_add_and_bh_lrus_drain();

2022-03-04 19:04:52

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [patch v3] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu

On Thu, Mar 03, 2022 at 05:49:30PM -0800, Paul E. McKenney wrote:
> On Thu, Mar 03, 2022 at 05:03:23PM -0800, Andrew Morton wrote:
> > (Question for paulmck below, please)
> >
> > On Tue, 22 Feb 2022 13:07:35 -0300 Marcelo Tosatti <[email protected]> wrote:
> >
> > >
> > > On systems that run FIFO:1 applications that busy loop
> > > on isolated CPUs, executing tasks on such CPUs under
> > > lower priority is undesired (since that will either
> > > hang the system, or cause longer interruption to the
> > > FIFO task due to execution of lower priority task
> > > with very small sched slices).
> > >
> > > Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU
> > > pagevec during the migration temporarily") relies on
> > > queueing work items on all online CPUs to ensure visibility
> > > of lru_disable_count.
> > >
> > > However, its possible to use synchronize_rcu which will provide the same
> > > guarantees (see comment this patch modifies on lru_cache_disable).
> > >
> > > Fixes:
> > >
> > > [ 1873.243925] INFO: task kworker/u160:0:9 blocked for more than 622 seconds.
> > > [ 1873.243927] Tainted: G I --------- --- 5.14.0-31.rt21.31.el9.x86_64 #1
> > > [ 1873.243929] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > [ 1873.243929] task:kworker/u160:0 state:D stack: 0 pid: 9 ppid: 2 flags:0x00004000
> > > [ 1873.243932] Workqueue: cpuset_migrate_mm cpuset_migrate_mm_workfn
> > > [ 1873.243936] Call Trace:
> > > [ 1873.243938] __schedule+0x21b/0x5b0
> > > [ 1873.243941] schedule+0x43/0xe0
> > > [ 1873.243943] schedule_timeout+0x14d/0x190
> > > [ 1873.243946] ? resched_curr+0x20/0xe0
> > > [ 1873.243953] ? __prepare_to_swait+0x4b/0x70
> > > [ 1873.243958] wait_for_completion+0x84/0xe0
> > > [ 1873.243962] __flush_work.isra.0+0x146/0x200
> > > [ 1873.243966] ? flush_workqueue_prep_pwqs+0x130/0x130
> > > [ 1873.243971] __lru_add_drain_all+0x158/0x1f0
> > > [ 1873.243978] do_migrate_pages+0x3d/0x2d0
> > > [ 1873.243985] ? pick_next_task_fair+0x39/0x3b0
> > > [ 1873.243989] ? put_prev_task_fair+0x1e/0x30
> > > [ 1873.243992] ? pick_next_task+0xb30/0xbd0
> > > [ 1873.243995] ? __tick_nohz_task_switch+0x1e/0x70
> > > [ 1873.244000] ? raw_spin_rq_unlock+0x18/0x60
> > > [ 1873.244002] ? finish_task_switch.isra.0+0xc1/0x2d0
> > > [ 1873.244005] ? __switch_to+0x12f/0x510
> > > [ 1873.244013] cpuset_migrate_mm_workfn+0x22/0x40
> > > [ 1873.244016] process_one_work+0x1e0/0x410
> > > [ 1873.244019] worker_thread+0x50/0x3b0
> > > [ 1873.244022] ? process_one_work+0x410/0x410
> > > [ 1873.244024] kthread+0x173/0x190
> > > [ 1873.244027] ? set_kthread_struct+0x40/0x40
> > > [ 1873.244031] ret_from_fork+0x1f/0x30
> > >
> > > ...
> > >
> > > --- a/mm/swap.c
> > > +++ b/mm/swap.c
> > > @@ -831,8 +831,7 @@ inline void __lru_add_drain_all(bool force_all_cpus)
> > > for_each_online_cpu(cpu) {
> > > struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
> > >
> > > - if (force_all_cpus ||
> > > - pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> > > + if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> > > data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
> > > pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
> > > pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
> >
> > This change appears to be "don't queue work on CPUs which don't have
> > any work to do". Correct? This isn't changelogged?

Its replaced by synchronize_rcu, and its mentioned in the changelog:

"However, its possible to use synchronize_rcu which will provide the same
guarantees (see comment this patch modifies on lru_cache_disable)."

Will resend -v4 with a more verbose changelog.


> > > @@ -876,14 +875,19 @@ atomic_t lru_disable_count = ATOMIC_INIT(0);
> > > void lru_cache_disable(void)
> > > {
> > > atomic_inc(&lru_disable_count);
> > > + synchronize_rcu();
> > > #ifdef CONFIG_SMP
> > > /*
> > > - * lru_add_drain_all in the force mode will schedule draining on
> > > - * all online CPUs so any calls of lru_cache_disabled wrapped by
> > > - * local_lock or preemption disabled would be ordered by that.
> > > - * The atomic operation doesn't need to have stronger ordering
> > > - * requirements because that is enforced by the scheduling
> > > - * guarantees.
> > > + * synchronize_rcu() waits for preemption disabled
> > > + * and RCU read side critical sections.
> > > + * For the users of lru_disable_count:
> > > + *
> > > + * preempt_disable, local_irq_disable [bh_lru_lock()]
> > > + * rcu_read_lock [rt_spin_lock CONFIG_PREEMPT_RT]
> > > + * preempt_disable [local_lock !CONFIG_PREEMPT_RT]
> > > + *
> > > + * so any calls of lru_cache_disabled wrapped by local_lock or
> > > + * preemption disabled would be ordered by that.
> > > */
> > > __lru_add_drain_all(true);
> > > #else
> >
> > Does this also work with CONFIG_TINY_RCU?
> >
> > This seems abusive of synchronize_rcu(). None of this code uses RCU,
> > but it so happens that synchronize_rcu() happily provides the desired
> > effects. Changes in RCU's happy side-effects might break this.
> > Perhaps a formal API function which does whatever-you-want-it-to-do
> > would be better.
>
> I don't claim to understand the full lru_cache_disable() use case, but
> since v5.1 synchronize_rcu() is guaranteed to wait on preempt_disable()
> regions of code. In contrast, back in the old days, you had to use
> synchronize_sched() to wait on preempt_disable() regions, even in
> CONFIG_PREEMPT=y kernels. So if the comment is accurate, it is OK.

OK, will add an additional comment regarding v5.1.

> Just be careful what you backport past v5.1...
>
> > And... I really don't understand the fix. What is it about
> > synchronize_rcu() which guarantees that a work function which is queued
> > on CPU N will now get executed even if CPU N is spinning in SCHED_FIFO
> > userspace?
>
> I don't understand this part, either.


All CPUs should see lru_disable_count (and therefore not add pages
to per-CPU LRU pvecs, otherwise the page migration bug fixed
by d479960e44f27e0e52ba31b21740b703c538027c can occur.

To do this, the commit above ("mm: disable LRU
pagevec during the migration temporarily") relies on
queueing work items on all online CPUs to ensure visibility
of lru_disable_count:

*/
+void lru_cache_disable(void)
+{
+ atomic_inc(&lru_disable_count);
+#ifdef CONFIG_SMP
+ /*
+ * lru_add_drain_all in the force mode will schedule draining on
+ * all online CPUs so any calls of lru_cache_disabled wrapped by
+ * local_lock or preemption disabled would be ordered by that.
+ * The atomic operation doesn't need to have stronger ordering
+ * requirements because that is enforeced by the scheduling
+ * guarantees.
+ */
+ __lru_add_drain_all(true);
+#else


CPU-0 CPU-1


local_lock(&lru_pvecs.lock);
pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate_file);
add page to per-CPU LRU pvec
if atomic_read(lru_disable_count) != 0
flush per-CPU LRU pvec
atomic_inc(lru_disable_count)

local_unlock(&lru_pvec.lock)
lru_add_drain_all(force_all_cpus=true)

However queueing the work items disturbs isolated CPUs. To avoid it, its
possible to use synchronize_rcu instead:

CPU-0 CPU-1


local_lock(&lru_pvecs.lock);
pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate_file);
add page to per-CPU LRU pvec
if atomic_read(lru_disable_count) != 0
flush per-CPU LRU pvec
atomic_inc(lru_disable_count)

local_unlock(&lru_pvec.lock)
synchronize_rcu()

Which will wait for all preemption (or IRQ disabled) sections to
complete, therefore ensuring visibilily of lru_disable_count.

2022-03-04 20:30:40

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [patch v3] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu

On Fri, Mar 04, 2022 at 12:08:46PM -0300, Marcelo Tosatti wrote:
> On Thu, Mar 03, 2022 at 05:49:30PM -0800, Paul E. McKenney wrote:
> > On Thu, Mar 03, 2022 at 05:03:23PM -0800, Andrew Morton wrote:
> > > (Question for paulmck below, please)
> > >
> > > On Tue, 22 Feb 2022 13:07:35 -0300 Marcelo Tosatti <[email protected]> wrote:
> > >
> > > >
> > > > On systems that run FIFO:1 applications that busy loop
> > > > on isolated CPUs, executing tasks on such CPUs under
> > > > lower priority is undesired (since that will either
> > > > hang the system, or cause longer interruption to the
> > > > FIFO task due to execution of lower priority task
> > > > with very small sched slices).
> > > >
> > > > Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU
> > > > pagevec during the migration temporarily") relies on
> > > > queueing work items on all online CPUs to ensure visibility
> > > > of lru_disable_count.
> > > >
> > > > However, its possible to use synchronize_rcu which will provide the same
> > > > guarantees (see comment this patch modifies on lru_cache_disable).
> > > >
> > > > Fixes:
> > > >
> > > > [ 1873.243925] INFO: task kworker/u160:0:9 blocked for more than 622 seconds.
> > > > [ 1873.243927] Tainted: G I --------- --- 5.14.0-31.rt21.31.el9.x86_64 #1
> > > > [ 1873.243929] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > [ 1873.243929] task:kworker/u160:0 state:D stack: 0 pid: 9 ppid: 2 flags:0x00004000
> > > > [ 1873.243932] Workqueue: cpuset_migrate_mm cpuset_migrate_mm_workfn
> > > > [ 1873.243936] Call Trace:
> > > > [ 1873.243938] __schedule+0x21b/0x5b0
> > > > [ 1873.243941] schedule+0x43/0xe0
> > > > [ 1873.243943] schedule_timeout+0x14d/0x190
> > > > [ 1873.243946] ? resched_curr+0x20/0xe0
> > > > [ 1873.243953] ? __prepare_to_swait+0x4b/0x70
> > > > [ 1873.243958] wait_for_completion+0x84/0xe0
> > > > [ 1873.243962] __flush_work.isra.0+0x146/0x200
> > > > [ 1873.243966] ? flush_workqueue_prep_pwqs+0x130/0x130
> > > > [ 1873.243971] __lru_add_drain_all+0x158/0x1f0
> > > > [ 1873.243978] do_migrate_pages+0x3d/0x2d0
> > > > [ 1873.243985] ? pick_next_task_fair+0x39/0x3b0
> > > > [ 1873.243989] ? put_prev_task_fair+0x1e/0x30
> > > > [ 1873.243992] ? pick_next_task+0xb30/0xbd0
> > > > [ 1873.243995] ? __tick_nohz_task_switch+0x1e/0x70
> > > > [ 1873.244000] ? raw_spin_rq_unlock+0x18/0x60
> > > > [ 1873.244002] ? finish_task_switch.isra.0+0xc1/0x2d0
> > > > [ 1873.244005] ? __switch_to+0x12f/0x510
> > > > [ 1873.244013] cpuset_migrate_mm_workfn+0x22/0x40
> > > > [ 1873.244016] process_one_work+0x1e0/0x410
> > > > [ 1873.244019] worker_thread+0x50/0x3b0
> > > > [ 1873.244022] ? process_one_work+0x410/0x410
> > > > [ 1873.244024] kthread+0x173/0x190
> > > > [ 1873.244027] ? set_kthread_struct+0x40/0x40
> > > > [ 1873.244031] ret_from_fork+0x1f/0x30
> > > >
> > > > ...
> > > >
> > > > --- a/mm/swap.c
> > > > +++ b/mm/swap.c
> > > > @@ -831,8 +831,7 @@ inline void __lru_add_drain_all(bool force_all_cpus)
> > > > for_each_online_cpu(cpu) {
> > > > struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
> > > >
> > > > - if (force_all_cpus ||
> > > > - pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> > > > + if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> > > > data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
> > > > pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
> > > > pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
> > >
> > > This change appears to be "don't queue work on CPUs which don't have
> > > any work to do". Correct? This isn't changelogged?
>
> Its replaced by synchronize_rcu, and its mentioned in the changelog:
>
> "However, its possible to use synchronize_rcu which will provide the same
> guarantees (see comment this patch modifies on lru_cache_disable)."
>
> Will resend -v4 with a more verbose changelog.
>
>
> > > > @@ -876,14 +875,19 @@ atomic_t lru_disable_count = ATOMIC_INIT(0);
> > > > void lru_cache_disable(void)
> > > > {
> > > > atomic_inc(&lru_disable_count);
> > > > + synchronize_rcu();
> > > > #ifdef CONFIG_SMP
> > > > /*
> > > > - * lru_add_drain_all in the force mode will schedule draining on
> > > > - * all online CPUs so any calls of lru_cache_disabled wrapped by
> > > > - * local_lock or preemption disabled would be ordered by that.
> > > > - * The atomic operation doesn't need to have stronger ordering
> > > > - * requirements because that is enforced by the scheduling
> > > > - * guarantees.
> > > > + * synchronize_rcu() waits for preemption disabled
> > > > + * and RCU read side critical sections.
> > > > + * For the users of lru_disable_count:
> > > > + *
> > > > + * preempt_disable, local_irq_disable [bh_lru_lock()]
> > > > + * rcu_read_lock [rt_spin_lock CONFIG_PREEMPT_RT]
> > > > + * preempt_disable [local_lock !CONFIG_PREEMPT_RT]
> > > > + *
> > > > + * so any calls of lru_cache_disabled wrapped by local_lock or
> > > > + * preemption disabled would be ordered by that.
> > > > */
> > > > __lru_add_drain_all(true);
> > > > #else
> > >
> > > Does this also work with CONFIG_TINY_RCU?
> > >
> > > This seems abusive of synchronize_rcu(). None of this code uses RCU,
> > > but it so happens that synchronize_rcu() happily provides the desired
> > > effects. Changes in RCU's happy side-effects might break this.
> > > Perhaps a formal API function which does whatever-you-want-it-to-do
> > > would be better.
> >
> > I don't claim to understand the full lru_cache_disable() use case, but
> > since v5.1 synchronize_rcu() is guaranteed to wait on preempt_disable()
> > regions of code. In contrast, back in the old days, you had to use
> > synchronize_sched() to wait on preempt_disable() regions, even in
> > CONFIG_PREEMPT=y kernels. So if the comment is accurate, it is OK.
>
> OK, will add an additional comment regarding v5.1.

And if someone does need to backport to a kernel version with old-style
limited synchronize_rcu() semantics, you can do this:

synchronize_rcu_mult(call_rcu_mult, call_rcu_sched);

This will wait concurrently for an RCU and and RCU-sched grace period.

If you want the full-up semantics, you can do this to wait on all
three flavors, also including RCU-bh:

synchronize_rcu_mult(call_rcu_mult, call_rcu_sched, call_rcu_bh);

> > Just be careful what you backport past v5.1...
> >
> > > And... I really don't understand the fix. What is it about
> > > synchronize_rcu() which guarantees that a work function which is queued
> > > on CPU N will now get executed even if CPU N is spinning in SCHED_FIFO
> > > userspace?
> >
> > I don't understand this part, either.
>
>
> All CPUs should see lru_disable_count (and therefore not add pages
> to per-CPU LRU pvecs, otherwise the page migration bug fixed
> by d479960e44f27e0e52ba31b21740b703c538027c can occur.
>
> To do this, the commit above ("mm: disable LRU
> pagevec during the migration temporarily") relies on
> queueing work items on all online CPUs to ensure visibility
> of lru_disable_count:
>
> */
> +void lru_cache_disable(void)
> +{
> + atomic_inc(&lru_disable_count);
> +#ifdef CONFIG_SMP
> + /*
> + * lru_add_drain_all in the force mode will schedule draining on
> + * all online CPUs so any calls of lru_cache_disabled wrapped by
> + * local_lock or preemption disabled would be ordered by that.
> + * The atomic operation doesn't need to have stronger ordering
> + * requirements because that is enforeced by the scheduling
> + * guarantees.
> + */
> + __lru_add_drain_all(true);
> +#else
>
>
> CPU-0 CPU-1
>
>
> local_lock(&lru_pvecs.lock);
> pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate_file);
> add page to per-CPU LRU pvec
> if atomic_read(lru_disable_count) != 0
> flush per-CPU LRU pvec
> atomic_inc(lru_disable_count)
>
> local_unlock(&lru_pvec.lock)
> lru_add_drain_all(force_all_cpus=true)
>
> However queueing the work items disturbs isolated CPUs. To avoid it, its
> possible to use synchronize_rcu instead:
>
> CPU-0 CPU-1
>
>
> local_lock(&lru_pvecs.lock);
> pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate_file);
> add page to per-CPU LRU pvec
> if atomic_read(lru_disable_count) != 0
> flush per-CPU LRU pvec
> atomic_inc(lru_disable_count)
>
> local_unlock(&lru_pvec.lock)
> synchronize_rcu()
>
> Which will wait for all preemption (or IRQ disabled) sections to
> complete, therefore ensuring visibilily of lru_disable_count.

OK, that does sound plausible. (But please note that I am not familiar
with that code.)

Thanx, Paul

2022-03-04 20:31:11

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [patch v3] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu

On Thu, Mar 03, 2022 at 05:03:23PM -0800, Andrew Morton wrote:
> (Question for paulmck below, please)
>
> On Tue, 22 Feb 2022 13:07:35 -0300 Marcelo Tosatti <[email protected]> wrote:
>
> >
> > On systems that run FIFO:1 applications that busy loop
> > on isolated CPUs, executing tasks on such CPUs under
> > lower priority is undesired (since that will either
> > hang the system, or cause longer interruption to the
> > FIFO task due to execution of lower priority task
> > with very small sched slices).
> >
> > Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU
> > pagevec during the migration temporarily") relies on
> > queueing work items on all online CPUs to ensure visibility
> > of lru_disable_count.
> >
> > However, its possible to use synchronize_rcu which will provide the same
> > guarantees (see comment this patch modifies on lru_cache_disable).
> >
> > Fixes:
> >
> > [ 1873.243925] INFO: task kworker/u160:0:9 blocked for more than 622 seconds.
> > [ 1873.243927] Tainted: G I --------- --- 5.14.0-31.rt21.31.el9.x86_64 #1
> > [ 1873.243929] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [ 1873.243929] task:kworker/u160:0 state:D stack: 0 pid: 9 ppid: 2 flags:0x00004000
> > [ 1873.243932] Workqueue: cpuset_migrate_mm cpuset_migrate_mm_workfn
> > [ 1873.243936] Call Trace:
> > [ 1873.243938] __schedule+0x21b/0x5b0
> > [ 1873.243941] schedule+0x43/0xe0
> > [ 1873.243943] schedule_timeout+0x14d/0x190
> > [ 1873.243946] ? resched_curr+0x20/0xe0
> > [ 1873.243953] ? __prepare_to_swait+0x4b/0x70
> > [ 1873.243958] wait_for_completion+0x84/0xe0
> > [ 1873.243962] __flush_work.isra.0+0x146/0x200
> > [ 1873.243966] ? flush_workqueue_prep_pwqs+0x130/0x130
> > [ 1873.243971] __lru_add_drain_all+0x158/0x1f0
> > [ 1873.243978] do_migrate_pages+0x3d/0x2d0
> > [ 1873.243985] ? pick_next_task_fair+0x39/0x3b0
> > [ 1873.243989] ? put_prev_task_fair+0x1e/0x30
> > [ 1873.243992] ? pick_next_task+0xb30/0xbd0
> > [ 1873.243995] ? __tick_nohz_task_switch+0x1e/0x70
> > [ 1873.244000] ? raw_spin_rq_unlock+0x18/0x60
> > [ 1873.244002] ? finish_task_switch.isra.0+0xc1/0x2d0
> > [ 1873.244005] ? __switch_to+0x12f/0x510
> > [ 1873.244013] cpuset_migrate_mm_workfn+0x22/0x40
> > [ 1873.244016] process_one_work+0x1e0/0x410
> > [ 1873.244019] worker_thread+0x50/0x3b0
> > [ 1873.244022] ? process_one_work+0x410/0x410
> > [ 1873.244024] kthread+0x173/0x190
> > [ 1873.244027] ? set_kthread_struct+0x40/0x40
> > [ 1873.244031] ret_from_fork+0x1f/0x30
> >
> > ...
> >
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -831,8 +831,7 @@ inline void __lru_add_drain_all(bool force_all_cpus)
> > for_each_online_cpu(cpu) {
> > struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
> >
> > - if (force_all_cpus ||
> > - pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> > + if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> > data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
> > pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
> > pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
>
> This change appears to be "don't queue work on CPUs which don't have
> any work to do". Correct? This isn't changelogged?
>
> > @@ -876,14 +875,19 @@ atomic_t lru_disable_count = ATOMIC_INIT(0);
> > void lru_cache_disable(void)
> > {
> > atomic_inc(&lru_disable_count);
> > + synchronize_rcu();
> > #ifdef CONFIG_SMP
> > /*
> > - * lru_add_drain_all in the force mode will schedule draining on
> > - * all online CPUs so any calls of lru_cache_disabled wrapped by
> > - * local_lock or preemption disabled would be ordered by that.
> > - * The atomic operation doesn't need to have stronger ordering
> > - * requirements because that is enforced by the scheduling
> > - * guarantees.
> > + * synchronize_rcu() waits for preemption disabled
> > + * and RCU read side critical sections.
> > + * For the users of lru_disable_count:
> > + *
> > + * preempt_disable, local_irq_disable [bh_lru_lock()]
> > + * rcu_read_lock [rt_spin_lock CONFIG_PREEMPT_RT]
> > + * preempt_disable [local_lock !CONFIG_PREEMPT_RT]
> > + *
> > + * so any calls of lru_cache_disabled wrapped by local_lock or
> > + * preemption disabled would be ordered by that.
> > */
> > __lru_add_drain_all(true);
> > #else
>
> Does this also work with CONFIG_TINY_RCU?
>
> This seems abusive of synchronize_rcu(). None of this code uses RCU,
> but it so happens that synchronize_rcu() happily provides the desired
> effects. Changes in RCU's happy side-effects might break this.
> Perhaps a formal API function which does whatever-you-want-it-to-do
> would be better.
>
> And... I really don't understand the fix. What is it about
> synchronize_rcu() which guarantees that a work function which is queued
> on CPU N will now get executed even if CPU N is spinning in SCHED_FIFO
> userspace?

It does not. synchronize_rcu() replaces queueing the work functions,
to ensure visibility of lru_disable_count.

2022-03-05 01:47:59

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch v4] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu

On Fri, 4 Mar 2022 13:29:31 -0300 Marcelo Tosatti <[email protected]> wrote:

>
> On systems that run FIFO:1 applications that busy loop
> on isolated CPUs, executing tasks on such CPUs under
> lower priority is undesired (since that will either
> hang the system, or cause longer interruption to the
> FIFO task due to execution of lower priority task
> with very small sched slices).
>
> Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU
> pagevec during the migration temporarily") relies on
> queueing work items on all online CPUs to ensure visibility
> of lru_disable_count.
>
> However, its possible to use synchronize_rcu which will provide the same
> guarantees (see comment this patch modifies on lru_cache_disable).
>
> Fixes:
>
> ...
>
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -831,8 +831,7 @@ inline void __lru_add_drain_all(bool force_all_cpus)
> for_each_online_cpu(cpu) {
> struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
>
> - if (force_all_cpus ||
> - pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> + if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||

Please changelog this alteration?

> data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
> pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
> pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
> @@ -876,15 +875,21 @@ atomic_t lru_disable_count = ATOMIC_INIT(0);
> void lru_cache_disable(void)
> {
> atomic_inc(&lru_disable_count);
> -#ifdef CONFIG_SMP
> /*
> - * lru_add_drain_all in the force mode will schedule draining on
> - * all online CPUs so any calls of lru_cache_disabled wrapped by
> - * local_lock or preemption disabled would be ordered by that.
> - * The atomic operation doesn't need to have stronger ordering
> - * requirements because that is enforced by the scheduling
> - * guarantees.
> + * Readers of lru_disable_count are protected by either disabling
> + * preemption or rcu_read_lock:
> + *
> + * preempt_disable, local_irq_disable [bh_lru_lock()]
> + * rcu_read_lock [rt_spin_lock CONFIG_PREEMPT_RT]
> + * preempt_disable [local_lock !CONFIG_PREEMPT_RT]
> + *
> + * Since v5.1 kernel, synchronize_rcu() is guaranteed to wait on
> + * preempt_disable() regions of code. So any CPU which sees
> + * lru_disable_count = 0 will have exited the critical
> + * section when synchronize_rcu() returns.
> */
> + synchronize_rcu();
> +#ifdef CONFIG_SMP
> __lru_add_drain_all(true);
> #else
> lru_add_and_bh_lrus_drain();

2022-03-05 20:15:35

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [patch v4] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu

On Fri, Mar 04, 2022 at 01:29:31PM -0300, Marcelo Tosatti wrote:
>
> On systems that run FIFO:1 applications that busy loop
> on isolated CPUs, executing tasks on such CPUs under
> lower priority is undesired (since that will either
> hang the system, or cause longer interruption to the
> FIFO task due to execution of lower priority task
> with very small sched slices).
>
> Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU
> pagevec during the migration temporarily") relies on
> queueing work items on all online CPUs to ensure visibility
> of lru_disable_count.
>
> However, its possible to use synchronize_rcu which will provide the same
> guarantees (see comment this patch modifies on lru_cache_disable).
>
> Fixes:
>
> [ 1873.243925] INFO: task kworker/u160:0:9 blocked for more than 622 seconds.
> [ 1873.243927] Tainted: G I --------- --- 5.14.0-31.rt21.31.el9.x86_64 #1
> [ 1873.243929] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1873.243929] task:kworker/u160:0 state:D stack: 0 pid: 9 ppid: 2 flags:0x00004000
> [ 1873.243932] Workqueue: cpuset_migrate_mm cpuset_migrate_mm_workfn
> [ 1873.243936] Call Trace:
> [ 1873.243938] __schedule+0x21b/0x5b0
> [ 1873.243941] schedule+0x43/0xe0
> [ 1873.243943] schedule_timeout+0x14d/0x190
> [ 1873.243946] ? resched_curr+0x20/0xe0
> [ 1873.243953] ? __prepare_to_swait+0x4b/0x70
> [ 1873.243958] wait_for_completion+0x84/0xe0
> [ 1873.243962] __flush_work.isra.0+0x146/0x200
> [ 1873.243966] ? flush_workqueue_prep_pwqs+0x130/0x130
> [ 1873.243971] __lru_add_drain_all+0x158/0x1f0
> [ 1873.243978] do_migrate_pages+0x3d/0x2d0
> [ 1873.243985] ? pick_next_task_fair+0x39/0x3b0
> [ 1873.243989] ? put_prev_task_fair+0x1e/0x30
> [ 1873.243992] ? pick_next_task+0xb30/0xbd0
> [ 1873.243995] ? __tick_nohz_task_switch+0x1e/0x70
> [ 1873.244000] ? raw_spin_rq_unlock+0x18/0x60
> [ 1873.244002] ? finish_task_switch.isra.0+0xc1/0x2d0
> [ 1873.244005] ? __switch_to+0x12f/0x510
> [ 1873.244013] cpuset_migrate_mm_workfn+0x22/0x40
> [ 1873.244016] process_one_work+0x1e0/0x410
> [ 1873.244019] worker_thread+0x50/0x3b0
> [ 1873.244022] ? process_one_work+0x410/0x410
> [ 1873.244024] kthread+0x173/0x190
> [ 1873.244027] ? set_kthread_struct+0x40/0x40
> [ 1873.244031] ret_from_fork+0x1f/0x30
>
> Signed-off-by: Marcelo Tosatti <[email protected]>

Given the explanation and the comments below, this does look plausible
to me.

Thanx, Paul

> ---
>
> v4: improve comment clarify, mention synchronize_rcu guarantees
> on v5.1 (Andrew Morton /
> Paul E. McKenney)
> v3: update stale comment (Nicolas Saenz Julienne)
> v2: rt_spin_lock calls rcu_read_lock, no need
> to add it before local_lock on swap.c (Nicolas Saenz Julienne)
>
> diff --git a/mm/swap.c b/mm/swap.c
> index bcf3ac288b56..b5ee163daa66 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -831,8 +831,7 @@ inline void __lru_add_drain_all(bool force_all_cpus)
> for_each_online_cpu(cpu) {
> struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
>
> - if (force_all_cpus ||
> - pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> + if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
> pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
> pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
> @@ -876,15 +875,21 @@ atomic_t lru_disable_count = ATOMIC_INIT(0);
> void lru_cache_disable(void)
> {
> atomic_inc(&lru_disable_count);
> -#ifdef CONFIG_SMP
> /*
> - * lru_add_drain_all in the force mode will schedule draining on
> - * all online CPUs so any calls of lru_cache_disabled wrapped by
> - * local_lock or preemption disabled would be ordered by that.
> - * The atomic operation doesn't need to have stronger ordering
> - * requirements because that is enforced by the scheduling
> - * guarantees.
> + * Readers of lru_disable_count are protected by either disabling
> + * preemption or rcu_read_lock:
> + *
> + * preempt_disable, local_irq_disable [bh_lru_lock()]
> + * rcu_read_lock [rt_spin_lock CONFIG_PREEMPT_RT]
> + * preempt_disable [local_lock !CONFIG_PREEMPT_RT]
> + *
> + * Since v5.1 kernel, synchronize_rcu() is guaranteed to wait on
> + * preempt_disable() regions of code. So any CPU which sees
> + * lru_disable_count = 0 will have exited the critical
> + * section when synchronize_rcu() returns.
> */
> + synchronize_rcu();
> +#ifdef CONFIG_SMP
> __lru_add_drain_all(true);
> #else
> lru_add_and_bh_lrus_drain();
>

2022-03-07 21:40:56

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [patch v4] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu

On Fri, Mar 04, 2022 at 04:35:54PM -0800, Andrew Morton wrote:
> On Fri, 4 Mar 2022 13:29:31 -0300 Marcelo Tosatti <[email protected]> wrote:
>
> >
> > On systems that run FIFO:1 applications that busy loop
> > on isolated CPUs, executing tasks on such CPUs under
> > lower priority is undesired (since that will either
> > hang the system, or cause longer interruption to the
> > FIFO task due to execution of lower priority task
> > with very small sched slices).
> >
> > Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU
> > pagevec during the migration temporarily") relies on
> > queueing work items on all online CPUs to ensure visibility
> > of lru_disable_count.
> >
> > However, its possible to use synchronize_rcu which will provide the same
> > guarantees (see comment this patch modifies on lru_cache_disable).
> >
> > Fixes:
> >
> > ...
> >
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -831,8 +831,7 @@ inline void __lru_add_drain_all(bool force_all_cpus)
> > for_each_online_cpu(cpu) {
> > struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
> >
> > - if (force_all_cpus ||
> > - pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
> > + if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
>
> Please changelog this alteration?

It should be now. Are you OK with this changelog ?
(if not, please let me know what should be improved).

On systems that run FIFO:1 applications that busy loop,
any SCHED_OTHER task that attempts to execute
on such a CPU (such as work threads) will not
be scheduled, which leads to system hangs.

Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU
pagevec during the migration temporarily") relies on
queueing work items on all online CPUs to ensure visibility
of lru_disable_count.

To fix this, replace the usage of work items with synchronize_rcu,
which provides the same guarantees:

Readers of lru_disable_count are protected by either disabling
preemption or rcu_read_lock:

preempt_disable, local_irq_disable [bh_lru_lock()]
rcu_read_lock [rt_spin_lock CONFIG_PREEMPT_RT]
preempt_disable [local_lock !CONFIG_PREEMPT_RT]

Since v5.1 kernel, synchronize_rcu() is guaranteed to wait on
preempt_disable() regions of code. So any CPU which sees
lru_disable_count = 0 will have exited the critical
section when synchronize_rcu() returns.

Fixes:
...

Thanks.

2022-03-08 22:13:22

by Minchan Kim

[permalink] [raw]
Subject: Re: [patch v4] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu

On Fri, Mar 04, 2022 at 01:29:31PM -0300, Marcelo Tosatti wrote:
>
> On systems that run FIFO:1 applications that busy loop
> on isolated CPUs, executing tasks on such CPUs under
> lower priority is undesired (since that will either
> hang the system, or cause longer interruption to the
> FIFO task due to execution of lower priority task
> with very small sched slices).
>
> Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU
> pagevec during the migration temporarily") relies on
> queueing work items on all online CPUs to ensure visibility
> of lru_disable_count.
>
> However, its possible to use synchronize_rcu which will provide the same
> guarantees (see comment this patch modifies on lru_cache_disable).
>
> Fixes:
>
> [ 1873.243925] INFO: task kworker/u160:0:9 blocked for more than 622 seconds.
> [ 1873.243927] Tainted: G I --------- --- 5.14.0-31.rt21.31.el9.x86_64 #1
> [ 1873.243929] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1873.243929] task:kworker/u160:0 state:D stack: 0 pid: 9 ppid: 2 flags:0x00004000
> [ 1873.243932] Workqueue: cpuset_migrate_mm cpuset_migrate_mm_workfn
> [ 1873.243936] Call Trace:
> [ 1873.243938] __schedule+0x21b/0x5b0
> [ 1873.243941] schedule+0x43/0xe0
> [ 1873.243943] schedule_timeout+0x14d/0x190
> [ 1873.243946] ? resched_curr+0x20/0xe0
> [ 1873.243953] ? __prepare_to_swait+0x4b/0x70
> [ 1873.243958] wait_for_completion+0x84/0xe0
> [ 1873.243962] __flush_work.isra.0+0x146/0x200
> [ 1873.243966] ? flush_workqueue_prep_pwqs+0x130/0x130
> [ 1873.243971] __lru_add_drain_all+0x158/0x1f0
> [ 1873.243978] do_migrate_pages+0x3d/0x2d0
> [ 1873.243985] ? pick_next_task_fair+0x39/0x3b0
> [ 1873.243989] ? put_prev_task_fair+0x1e/0x30
> [ 1873.243992] ? pick_next_task+0xb30/0xbd0
> [ 1873.243995] ? __tick_nohz_task_switch+0x1e/0x70
> [ 1873.244000] ? raw_spin_rq_unlock+0x18/0x60
> [ 1873.244002] ? finish_task_switch.isra.0+0xc1/0x2d0
> [ 1873.244005] ? __switch_to+0x12f/0x510
> [ 1873.244013] cpuset_migrate_mm_workfn+0x22/0x40
> [ 1873.244016] process_one_work+0x1e0/0x410
> [ 1873.244019] worker_thread+0x50/0x3b0
> [ 1873.244022] ? process_one_work+0x410/0x410
> [ 1873.244024] kthread+0x173/0x190
> [ 1873.244027] ? set_kthread_struct+0x40/0x40
> [ 1873.244031] ret_from_fork+0x1f/0x30
>
> Signed-off-by: Marcelo Tosatti <[email protected]>

Looks like great to me.

Acked-by: Minchan Kim <[email protected]>

While I reviewed it, it seems I found a bug that br_lru_install
needs to check lru_cache_disabled under bh_lru_lock. Let me
cook the patch.

Thanks!

2022-03-11 23:03:40

by Marcelo Tosatti

[permalink] [raw]
Subject: [patch v5] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu


On systems that run FIFO:1 applications that busy loop,
any SCHED_OTHER task that attempts to execute
on such a CPU (such as work threads) will not
be scheduled, which leads to system hangs.

Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU
pagevec during the migration temporarily") relies on
queueing work items on all online CPUs to ensure visibility
of lru_disable_count.

To fix this, replace the usage of work items with synchronize_rcu,
which provides the same guarantees.

Readers of lru_disable_count are protected by either disabling
preemption or rcu_read_lock:

preempt_disable, local_irq_disable [bh_lru_lock()]
rcu_read_lock [rt_spin_lock CONFIG_PREEMPT_RT]
preempt_disable [local_lock !CONFIG_PREEMPT_RT]

Since v5.1 kernel, synchronize_rcu() is guaranteed to wait on
preempt_disable() regions of code. So any CPU which sees
lru_disable_count = 0 will have exited the critical
section when synchronize_rcu() returns.

Signed-off-by: Marcelo Tosatti <[email protected]>
Reviewed-by: Nicolas Saenz Julienne <[email protected]>
Acked-by: Minchan Kim <[email protected]>

---

v5: changelog improvements (Andrew Morton)
v4: improve comment clarity, mention synchronize_rcu guarantees
on v5.1 (Andrew Morton /
Paul E. McKenney)
v3: update stale comment (Nicolas Saenz Julienne)
v2: rt_spin_lock calls rcu_read_lock, no need
to add it before local_lock on swap.c (Nicolas Saenz Julienne)

diff --git a/mm/swap.c b/mm/swap.c
index bcf3ac288b56..b5ee163daa66 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -831,8 +831,7 @@ inline void __lru_add_drain_all(bool force_all_cpus)
for_each_online_cpu(cpu) {
struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);

- if (force_all_cpus ||
- pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
+ if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) ||
data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) ||
pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) ||
pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
@@ -876,15 +875,21 @@ atomic_t lru_disable_count = ATOMIC_INIT(0);
void lru_cache_disable(void)
{
atomic_inc(&lru_disable_count);
-#ifdef CONFIG_SMP
/*
- * lru_add_drain_all in the force mode will schedule draining on
- * all online CPUs so any calls of lru_cache_disabled wrapped by
- * local_lock or preemption disabled would be ordered by that.
- * The atomic operation doesn't need to have stronger ordering
- * requirements because that is enforced by the scheduling
- * guarantees.
+ * Readers of lru_disable_count are protected by either disabling
+ * preemption or rcu_read_lock:
+ *
+ * preempt_disable, local_irq_disable [bh_lru_lock()]
+ * rcu_read_lock [rt_spin_lock CONFIG_PREEMPT_RT]
+ * preempt_disable [local_lock !CONFIG_PREEMPT_RT]
+ *
+ * Since v5.1 kernel, synchronize_rcu() is guaranteed to wait on
+ * preempt_disable() regions of code. So any CPU which sees
+ * lru_disable_count = 0 will have exited the critical
+ * section when synchronize_rcu() returns.
*/
+ synchronize_rcu();
+#ifdef CONFIG_SMP
__lru_add_drain_all(true);
#else
lru_add_and_bh_lrus_drain();



2022-04-01 07:03:06

by Borislav Petkov

[permalink] [raw]
Subject: Re: [patch v5] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu

On Thu, Mar 10, 2022 at 10:22:12AM -0300, Marcelo Tosatti wrote:
>
> On systems that run FIFO:1 applications that busy loop,
> any SCHED_OTHER task that attempts to execute
> on such a CPU (such as work threads) will not
> be scheduled, which leads to system hangs.
>
> Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU
> pagevec during the migration temporarily") relies on
> queueing work items on all online CPUs to ensure visibility
> of lru_disable_count.
>
> To fix this, replace the usage of work items with synchronize_rcu,
> which provides the same guarantees.
>
> Readers of lru_disable_count are protected by either disabling
> preemption or rcu_read_lock:
>
> preempt_disable, local_irq_disable [bh_lru_lock()]
> rcu_read_lock [rt_spin_lock CONFIG_PREEMPT_RT]
> preempt_disable [local_lock !CONFIG_PREEMPT_RT]
>
> Since v5.1 kernel, synchronize_rcu() is guaranteed to wait on
> preempt_disable() regions of code. So any CPU which sees
> lru_disable_count = 0 will have exited the critical
> section when synchronize_rcu() returns.
>
> Signed-off-by: Marcelo Tosatti <[email protected]>
> Reviewed-by: Nicolas Saenz Julienne <[email protected]>
> Acked-by: Minchan Kim <[email protected]>

Someone pointed me at this:

https://www.phoronix.com/scan.php?page=news_item&px=Linux-518-Stress-NUMA-Goes-Boom

which says this one causes a performance regression with stress-ng's
NUMA test...

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-04-29 11:49:12

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [patch v5] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu

On Thu, Mar 31, 2022 at 03:52:45PM +0200, Borislav Petkov wrote:
> On Thu, Mar 10, 2022 at 10:22:12AM -0300, Marcelo Tosatti wrote:
> >
> > On systems that run FIFO:1 applications that busy loop,
> > any SCHED_OTHER task that attempts to execute
> > on such a CPU (such as work threads) will not
> > be scheduled, which leads to system hangs.
> >
> > Commit d479960e44f27e0e52ba31b21740b703c538027c ("mm: disable LRU
> > pagevec during the migration temporarily") relies on
> > queueing work items on all online CPUs to ensure visibility
> > of lru_disable_count.
> >
> > To fix this, replace the usage of work items with synchronize_rcu,
> > which provides the same guarantees.
> >
> > Readers of lru_disable_count are protected by either disabling
> > preemption or rcu_read_lock:
> >
> > preempt_disable, local_irq_disable [bh_lru_lock()]
> > rcu_read_lock [rt_spin_lock CONFIG_PREEMPT_RT]
> > preempt_disable [local_lock !CONFIG_PREEMPT_RT]
> >
> > Since v5.1 kernel, synchronize_rcu() is guaranteed to wait on
> > preempt_disable() regions of code. So any CPU which sees
> > lru_disable_count = 0 will have exited the critical
> > section when synchronize_rcu() returns.
> >
> > Signed-off-by: Marcelo Tosatti <[email protected]>
> > Reviewed-by: Nicolas Saenz Julienne <[email protected]>
> > Acked-by: Minchan Kim <[email protected]>
>
> Someone pointed me at this:
>
> https://www.phoronix.com/scan.php?page=news_item&px=Linux-518-Stress-NUMA-Goes-Boom
>
> which says this one causes a performance regression with stress-ng's
> NUMA test...

Michael,

This is probably do_migrate_pages that is taking too long due to
synchronize_rcu().

Switching to synchronize_rcu_expedited() should probably fix it...
Can you give it a try, please?

diff --git a/mm/swap.c b/mm/swap.c
index bceff0cb559c..04a8bbf9817a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -879,7 +879,7 @@ void lru_cache_disable(void)
* lru_disable_count = 0 will have exited the critical
* section when synchronize_rcu() returns.
*/
- synchronize_rcu();
+ synchronize_rcu_expedited();
#ifdef CONFIG_SMP
__lru_add_drain_all(true);
#else



2022-05-28 21:18:53

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch v5] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu

On Thu, 28 Apr 2022 15:00:11 -0300 Marcelo Tosatti <[email protected]> wrote:

> On Thu, Mar 31, 2022 at 03:52:45PM +0200, Borislav Petkov wrote:
> > On Thu, Mar 10, 2022 at 10:22:12AM -0300, Marcelo Tosatti wrote:
> > >
> ...
>
> >
> > Someone pointed me at this:
> >
> > https://www.phoronix.com/scan.php?page=news_item&px=Linux-518-Stress-NUMA-Goes-Boom
> >
> > which says this one causes a performance regression with stress-ng's
> > NUMA test...
>
> Michael,
>
> This is probably do_migrate_pages that is taking too long due to
> synchronize_rcu().
>
> Switching to synchronize_rcu_expedited() should probably fix it...
> Can you give it a try, please?

I guess not.

Is anyone else able to demonstrate a stress-ng performance regression
due to ff042f4a9b0508? And if so, are they able to try Marcelo's
one-liner?

> diff --git a/mm/swap.c b/mm/swap.c
> index bceff0cb559c..04a8bbf9817a 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -879,7 +879,7 @@ void lru_cache_disable(void)
> * lru_disable_count = 0 will have exited the critical
> * section when synchronize_rcu() returns.
> */
> - synchronize_rcu();
> + synchronize_rcu_expedited();
> #ifdef CONFIG_SMP
> __lru_add_drain_all(true);
> #else
>
>

Subject: Re: [patch v5] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu

Hi, this is your Linux kernel regression tracker.

On 29.05.22 02:48, Michael Larabel wrote:
> On 5/28/22 17:54, Michael Larabel wrote:
>> On 5/28/22 16:18, Andrew Morton wrote:
>>> On Thu, 28 Apr 2022 15:00:11 -0300 Marcelo Tosatti
>>> <[email protected]> wrote:
>>>> On Thu, Mar 31, 2022 at 03:52:45PM +0200, Borislav Petkov wrote:
>>>>> On Thu, Mar 10, 2022 at 10:22:12AM -0300, Marcelo Tosatti wrote:
>>>>> Someone pointed me at this:
>>>>> https://www.phoronix.com/scan.php?page=news_item&px=Linux-518-Stress-NUMA-Goes-Boom
>>>>>
>>>>> which says this one causes a performance regression with stress-ng's
>>>>> NUMA test...
>>>>
>>>> This is probably do_migrate_pages that is taking too long due to
>>>> synchronize_rcu().
>>>>
>>>> Switching to synchronize_rcu_expedited() should probably fix it...
>>>> Can you give it a try, please?
>>> I guess not.
>>>
>>> Is anyone else able to demonstrate a stress-ng performance regression
>>> due to ff042f4a9b0508?  And if so, are they able to try Marcelo's
>>> one-liner?
>>
>> Apologies I don't believe I got the email previously (or if it ended
>> up in spam or otherwise overlooked) so just noticed this thread now...
>>
>> I have the system around and will work on verifying it can reproduce
>> still and can then test the patch, should be able to get it tomorrow.
>>
>> Thanks and sorry about the delay.
>
> Had a chance to look at it today still. I was able to reproduce the
> regression still on that 5950X system going from v5.17 to v5.18 (using
> newer stress-ng benchmark and other system changes since the prior
> tests). Confirmed it also still showed slower as of today's Git.
>
> I can confirm with Marcelo's patch below that the stress-ng NUMA
> performance is back to the v5.17 level of performance (actually, faster)
> and certainly not like what I was seeing on v5.18 or Git to this point.
>
> So all seems to be good with that one-liner for the stress-ng NUMA test
> case. All the system details and results for those interested is
> documented @ https://openbenchmarking.org/result/2205284-PTS-NUMAREGR17
> but basically amounts to:
>
>     Stress-NG 0.14
>     Test: NUMA
>     Bogo Ops/s > Higher Is Better
>     v5.17: 412.88
>     v5.18: 49.33
>     20220528 Git: 49.66
>     20220528 Git + sched-rcu-exped patch: 468.81
>
> Apologies again about the delay / not seeing the email thread earlier.
>lru_cache_disable: replace work queue synchronization with synchronize_rcu
> Thanks,
>
> Michael
>
> Tested-by: Michael Larabel <[email protected]>

Andrew, is there a reason why this patch afaics isn't mainlined yet and
lingering in linux-next for so long? Michael confirmed that this patch
fixes a regression three weeks ago and a few days later Stefan confirmed
that his problem was solved as well:
https://lore.kernel.org/regressions/[email protected]/

Reminder: unless there are good reasons it shouldn't take this long to
for reason explained in
https://docs.kernel.org/process/handling-regressions.html

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.

>>>> diff --git a/mm/swap.c b/mm/swap.c
>>>> index bceff0cb559c..04a8bbf9817a 100644
>>>> --- a/mm/swap.c
>>>> +++ b/mm/swap.c
>>>> @@ -879,7 +879,7 @@ void lru_cache_disable(void)
>>>>        * lru_disable_count = 0 will have exited the critical
>>>>        * section when synchronize_rcu() returns.
>>>>        */
>>>> -    synchronize_rcu();
>>>> +    synchronize_rcu_expedited();
>>>>   #ifdef CONFIG_SMP
>>>>       __lru_add_drain_all(true);
>>>>   #else
>>>>
>>>>

2022-06-22 00:19:33

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch v5] mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu

On Sun, 19 Jun 2022 14:14:03 +0200 Thorsten Leemhuis <[email protected]> wrote:

> Andrew, is there a reason why this patch afaics isn't mainlined yet and
> lingering in linux-next for so long?

I didn't bother doing a hotfixes merge last week because there wasn't anything
very urgent-looking in there. I'll be putting together a pull request later this week.