2024-01-17 20:33:41

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] kprobes: Use synchronize_rcu_tasks_rude in kprobe_optimizer

On Wed, 17 Jan 2024 06:16:36 +0000 Chen Zhongjin <[email protected]> wrote:

> There is a deadlock scenario in kprobe_optimizer():
>
> pid A pid B pid C
> kprobe_optimizer() do_exit() perf_kprobe_init()
> mutex_lock(&kprobe_mutex) exit_tasks_rcu_start() mutex_lock(&kprobe_mutex)
> synchronize_rcu_tasks() zap_pid_ns_processes() // waiting kprobe_mutex
> // waiting tasks_rcu_exit_srcu kernel_wait4()
> // waiting pid C exit
>
> To avoid this deadlock loop, use synchronize_rcu_tasks_rude() in kprobe_optimizer()
> rather than synchronize_rcu_tasks(). synchronize_rcu_tasks_rude() can also promise
> that all preempted tasks have scheduled, but it will not wait tasks_rcu_exit_srcu.
>
> Signed-off-by: Chen Zhongjin <[email protected]>

Thanks. Should we backport this fix into earlier kernels? If so, are
we able to identify a suitable Fixes: target?


2024-01-18 14:42:37

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] kprobes: Use synchronize_rcu_tasks_rude in kprobe_optimizer

On Wed, Jan 17, 2024 at 12:31:33PM -0800, Andrew Morton wrote:
> On Wed, 17 Jan 2024 06:16:36 +0000 Chen Zhongjin <[email protected]> wrote:
>
> > There is a deadlock scenario in kprobe_optimizer():
> >
> > pid A pid B pid C
> > kprobe_optimizer() do_exit() perf_kprobe_init()
> > mutex_lock(&kprobe_mutex) exit_tasks_rcu_start() mutex_lock(&kprobe_mutex)
> > synchronize_rcu_tasks() zap_pid_ns_processes() // waiting kprobe_mutex
> > // waiting tasks_rcu_exit_srcu kernel_wait4()
> > // waiting pid C exit
> >
> > To avoid this deadlock loop, use synchronize_rcu_tasks_rude() in kprobe_optimizer()
> > rather than synchronize_rcu_tasks(). synchronize_rcu_tasks_rude() can also promise
> > that all preempted tasks have scheduled, but it will not wait tasks_rcu_exit_srcu.

Hello, Chen Zhongjin,

Relying on the non-preemptability of the last portion of the do_exit()
function does appear to be the basis for a much better solution than
what we currently have. So at the very least, thank you for the idea!!!
I feel a bit silly for not having thought of it myself. ;-)

However, just invoking synchronize_rcu_tasks_rude() will be bad for both
energy efficiency and real-time response. This is due to the fact that
synchronize_rcu_tasks_rude() sends an IPI to each and every online CPUs,
almost none of which will be in the non-preemptible tail of do_exit()
at any given time. These additional unnecessary IPIs will drain battery
when they hit idle CPUs and degrade real-time response when they hit
CPUs running aggressive real-time applications. Which might not make
people happy.

So, would you be willing to use RCU's do_exit() hooks and RCU's hook
into the scheduler (either rcu_note_context_switch() or rcu_tasks_qs(),
whichever would work better) maintain a per-CPU variable that is a
pointer to the task in the non-preemptible tail of do_exit() if there
is one or NULL otherwise? This would get us the deadlock-avoidance
simplicity of your underlying idea, while avoiding (almost all of)
the energy-efficiency and real-time-response issues with your patch.

This does require a bit of memory-ordering care, so if you would prefer
that I do the implementation, just let me know.

(The memory-ordering trick is to use "smp_mb(); smp_load_acquire();" to
sample the counter and "smp_store_release(); smp_mb();" to update it.)

Thanx, Paul

> > Signed-off-by: Chen Zhongjin <[email protected]>
>
> Thanks. Should we backport this fix into earlier kernels? If so, are
> we able to identify a suitable Fixes: target?