put_task_struct() decrements a usage counter and calls
__put_task_struct() if the counter reaches zero.
__put_task_struct() indirectly acquires a spinlock, which is a sleeping
lock under PREEMPT_RT. Therefore, we can't call put_task_struct() in an
atomic context in RT kernels.
This patch series introduces put_task_struct_atomic_safe(), which defers
the call to __put_task_struct() to a process context when compiled with
PREEMPT_RT.
It also fixes known problematic call sites.
Changelog:
==========
v2:
* Add the put_task_struct_atomic_safe() function that is responsible for
handling the conditions to call put_task_struct().
* Replace put_task_struct() by put_task_struct_atomic_safe() in known
atomic call sites.
Wander Lairson Costa (4):
sched/task: Add the put_task_struct_atomic_safe function
sched/deadline: fix inactive_task_timer splat
sched/rt: use put_task_struct_atomic_safe() to avoid potential splat
sched/core: use put_task_struct_atomic_safe() to avoid potential splat
include/linux/sched/task.h | 21 +++++++++++++++++++++
kernel/fork.c | 8 ++++++++
kernel/sched/core.c | 2 +-
kernel/sched/deadline.c | 2 +-
kernel/sched/rt.c | 4 ++--
5 files changed, 33 insertions(+), 4 deletions(-)
--
2.39.0
With PREEMPT_RT, it is unsafe to call put_task_struct() in atomic
contexts because it indirectly acquires sleeping locks.
Introduce put_task_struct_atomic_safe(), which schedules
__put_task_struct() through call_rcu() when the kernel is compiled with
PREEMPT_RT.
A more natural approach would use a workqueue, but since
in PREEMPT_RT we can't allocate dynamic memory from atomic context,
the code would become more complex because we would need to put the
work_struct instance in the task_struct and initialize it when we
allocate a new task_struct.
Signed-off-by: Wander Lairson Costa <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: Thomas Gleixner <[email protected]>
---
include/linux/sched/task.h | 21 +++++++++++++++++++++
kernel/fork.c | 8 ++++++++
2 files changed, 29 insertions(+)
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 357e0068497c..80b4c5812563 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -127,6 +127,27 @@ static inline void put_task_struct_many(struct task_struct *t, int nr)
void put_task_struct_rcu_user(struct task_struct *task);
+extern void __delayed_put_task_struct(struct rcu_head *rhp);
+
+static inline void put_task_struct_atomic_safe(struct task_struct *task)
+{
+ if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
+ /*
+ * Decrement the refcount explicitly to avoid unnecessarily
+ * calling call_rcu.
+ */
+ if (refcount_dec_and_test(&task->usage))
+ /*
+ * under PREEMPT_RT, we can't call put_task_struct
+ * in atomic context because it will indirectly
+ * acquire sleeping locks.
+ */
+ call_rcu(&task->rcu, __delayed_put_task_struct);
+ } else {
+ put_task_struct(task);
+ }
+}
+
/* Free all architecture-specific resources held by a thread. */
void release_thread(struct task_struct *dead_task);
diff --git a/kernel/fork.c b/kernel/fork.c
index 9f7fe3541897..3d7a4e9311b3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -859,6 +859,14 @@ void __put_task_struct(struct task_struct *tsk)
}
EXPORT_SYMBOL_GPL(__put_task_struct);
+void __delayed_put_task_struct(struct rcu_head *rhp)
+{
+ struct task_struct *task = container_of(rhp, struct task_struct, rcu);
+
+ __put_task_struct(task);
+}
+EXPORT_SYMBOL_GPL(__delayed_put_task_struct);
+
void __init __weak arch_task_cache_init(void) { }
/*
--
2.39.0
rto_push_irq_work_func() is called in hardirq context, and it calls
push_rt_task(), which calls put_task_struct().
If the kernel is compiled with PREEMPT_RT and put_task_struct() reaches
zero usage count, it triggers a splat because __put_task_struct()
indirectly acquires sleeping locks.
The put_task_struct() call pairs with an earlier get_task_struct(),
which makes the probability of the usage count reaches zero pretty
low. In any case, let's play safe and use the atomic safe version.
Signed-off-by: Wander Lairson Costa <[email protected]>
Cc: Thomas Gleixner <[email protected]>
---
kernel/sched/rt.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index ed2a47e4ddae..30a4e9607bec 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2147,7 +2147,7 @@ static int push_rt_task(struct rq *rq, bool pull)
/*
* Something has shifted, try again.
*/
- put_task_struct(next_task);
+ put_task_struct_atomic_safe(next_task);
next_task = task;
goto retry;
}
@@ -2160,7 +2160,7 @@ static int push_rt_task(struct rq *rq, bool pull)
double_unlock_balance(rq, lowest_rq);
out:
- put_task_struct(next_task);
+ put_task_struct_atomic_safe(next_task);
return ret;
}
--
2.39.0
inactive_task_timer() executes in interrupt (atomic) context. It calls
put_task_struct(), which indirectly acquires sleeping locks under
PREEMPT_RT.
Below is an example of a splat that happened in a test environment:
CPU: 1 PID: 2848 Comm: life Kdump: loaded Tainted: G W ---------
Hardware name: HP ProLiant DL388p Gen8, BIOS P70 07/15/2012
Call Trace:
dump_stack_lvl+0x57/0x7d
mark_lock_irq.cold+0x33/0xba
? stack_trace_save+0x4b/0x70
? save_trace+0x55/0x150
mark_lock+0x1e7/0x400
mark_usage+0x11d/0x140
__lock_acquire+0x30d/0x930
lock_acquire.part.0+0x9c/0x210
? refill_obj_stock+0x3d/0x3a0
? rcu_read_lock_sched_held+0x3f/0x70
? trace_lock_acquire+0x38/0x140
? lock_acquire+0x30/0x80
? refill_obj_stock+0x3d/0x3a0
rt_spin_lock+0x27/0xe0
? refill_obj_stock+0x3d/0x3a0
refill_obj_stock+0x3d/0x3a0
? inactive_task_timer+0x1ad/0x340
kmem_cache_free+0x357/0x560
inactive_task_timer+0x1ad/0x340
? switched_from_dl+0x2d0/0x2d0
__run_hrtimer+0x8a/0x1a0
__hrtimer_run_queues+0x91/0x130
hrtimer_interrupt+0x10f/0x220
__sysvec_apic_timer_interrupt+0x7b/0xd0
sysvec_apic_timer_interrupt+0x4f/0xd0
? asm_sysvec_apic_timer_interrupt+0xa/0x20
asm_sysvec_apic_timer_interrupt+0x12/0x20
RIP: 0033:0x7fff196bf6f5
Use put_task_struct_atomic_safe() instead.
Signed-off-by: Wander Lairson Costa <[email protected]>
Cc: Thomas Gleixner <[email protected]>
---
kernel/sched/deadline.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0d97d54276cc..03400c61a994 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1442,7 +1442,7 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
dl_se->dl_non_contending = 0;
unlock:
task_rq_unlock(rq, p, &rf);
- put_task_struct(p);
+ put_task_struct_atomic_safe(p);
return HRTIMER_NORESTART;
}
--
2.39.0
push_cpu_stop() is called form a context in which it must not sleep.
Since push_cpu_stop() calls put_task_struct(), it potentially can sleep
under PREEMPT_RT if the usage count reaches zero.
Use put_task_struct_atomic_safe(0 instead to avoid this potential splat.
Signed-off-by: Wander Lairson Costa <[email protected]>
Cc: Thomas Gleixner <[email protected]>
---
kernel/sched/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c437260e19b0..e5d0c66cb90c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2565,7 +2565,7 @@ int push_cpu_stop(void *arg)
raw_spin_rq_unlock(rq);
raw_spin_unlock_irq(&p->pi_lock);
- put_task_struct(p);
+ put_task_struct_atomic_safe(p);
return 0;
}
--
2.39.0
On 20/01/23 12:02, Wander Lairson Costa wrote:
> put_task_struct() decrements a usage counter and calls
> __put_task_struct() if the counter reaches zero.
>
> __put_task_struct() indirectly acquires a spinlock, which is a sleeping
> lock under PREEMPT_RT. Therefore, we can't call put_task_struct() in an
> atomic context in RT kernels.
>
> This patch series introduces put_task_struct_atomic_safe(), which defers
> the call to __put_task_struct() to a process context when compiled with
> PREEMPT_RT.
>
> It also fixes known problematic call sites.
>
Browsing around put_task_struct() callsites gives me the impression there
are more problematic call sites lurking around, which makes me wonder:
should we make the PREEMPT_RT put_task_struct() *always* be done via
call_rcu()?
The task's stack is actually always freed that way in put_task_stack(), cf.
e540bf3162e8 ("fork: Only cache the VMAP stack in finish_task_switch()")
> Changelog:
> ==========
>
> v2:
> * Add the put_task_struct_atomic_safe() function that is responsible for
> handling the conditions to call put_task_struct().
> * Replace put_task_struct() by put_task_struct_atomic_safe() in known
> atomic call sites.
>
> Wander Lairson Costa (4):
> sched/task: Add the put_task_struct_atomic_safe function
> sched/deadline: fix inactive_task_timer splat
> sched/rt: use put_task_struct_atomic_safe() to avoid potential splat
> sched/core: use put_task_struct_atomic_safe() to avoid potential splat
>
> include/linux/sched/task.h | 21 +++++++++++++++++++++
> kernel/fork.c | 8 ++++++++
> kernel/sched/core.c | 2 +-
> kernel/sched/deadline.c | 2 +-
> kernel/sched/rt.c | 4 ++--
> 5 files changed, 33 insertions(+), 4 deletions(-)
>
> --
> 2.39.0
On Fri, Jan 20, 2023 at 2:45 PM Valentin Schneider <[email protected]> wrote:
>
> On 20/01/23 at 12:02, Wander Lairson Costa wrote:
> > put_task_struct() decrements a usage counter and calls
> > __put_task_struct() if the counter reaches zero.
> >
> > __put_task_struct() indirectly acquires a spinlock, which is a sleeping
> > lock under PREEMPT_RT. Therefore, we can't call put_task_struct() in an
> > atomic context in RT kernels.
> >
> > This patch series introduces put_task_struct_atomic_safe(), which defers
> > the call to __put_task_struct() to a process context when compiled with
> > PREEMPT_RT.
> >
> > It also fixes known problematic call sites.
> >
>
> Browsing around put_task_struct() callsites gives me the impression there
> are more problematic call sites lurking around, which makes me wonder:
> should we make the PREEMPT_RT put_task_struct() *always* be done via
> call_rcu()?
>
I thought about going on this route, but I was concerned about the
performance side effects this approach could bring. Another idea I had
was to check at runtime if we are in a preemptible context. Again,
this would have a (minor?) performance penalty.
> The task's stack is actually always freed that way in put_task_stack(), cf.
>
> e540bf3162e8 ("fork: Only cache the VMAP stack in finish_task_switch()")
>
> > Changelog:
> > ==========
> >
> > v2:
> > * Add the put_task_struct_atomic_safe() function that is responsible for
> > handling the conditions to call put_task_struct().
> > * Replace put_task_struct() by put_task_struct_atomic_safe() in known
> > atomic call sites.
> >
> > Wander Lairson Costa (4):
> > sched/task: Add the put_task_struct_atomic_safe function
> > sched/deadline: fix inactive_task_timer splat
> > sched/rt: use put_task_struct_atomic_safe() to avoid potential splat
> > sched/core: use put_task_struct_atomic_safe() to avoid potential splat
> >
> > include/linux/sched/task.h | 21 +++++++++++++++++++++
> > kernel/fork.c | 8 ++++++++
> > kernel/sched/core.c | 2 +-
> > kernel/sched/deadline.c | 2 +-
> > kernel/sched/rt.c | 4 ++--
> > 5 files changed, 33 insertions(+), 4 deletions(-)
> >
> > --
> > 2.39.0
>
On 01/20, Wander Lairson Costa wrote:
>
> +static inline void put_task_struct_atomic_safe(struct task_struct *task)
> +{
> + if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> + /*
> + * Decrement the refcount explicitly to avoid unnecessarily
> + * calling call_rcu.
> + */
> + if (refcount_dec_and_test(&task->usage))
> + /*
> + * under PREEMPT_RT, we can't call put_task_struct
> + * in atomic context because it will indirectly
> + * acquire sleeping locks.
> + */
> + call_rcu(&task->rcu, __delayed_put_task_struct);
^^^^^^^^^
I am not sure the usage of task->rcu is safe...
Suppose that, before __delayed_put_task_struct() is called by RCU, this task
does the last schedule and calls put_task_struct_rcu_user().
And, can't we simply turn put_task_struct() into something like
put_task_struct(struct task_struct *t)
{
if (refcount_dec_and_test(&t->usage)) {
if (IS_ENABLED(CONFIG_PREEMPT_RT)
&& (in_atomic() || irqs_disabled()))
call_rcu(...);
else
__put_task_struct(t);
}
}
?
Oleg.
On 01/23, Oleg Nesterov wrote:
>
> On 01/20, Wander Lairson Costa wrote:
> >
> > +static inline void put_task_struct_atomic_safe(struct task_struct *task)
> > +{
> > + if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> > + /*
> > + * Decrement the refcount explicitly to avoid unnecessarily
> > + * calling call_rcu.
> > + */
> > + if (refcount_dec_and_test(&task->usage))
> > + /*
> > + * under PREEMPT_RT, we can't call put_task_struct
> > + * in atomic context because it will indirectly
> > + * acquire sleeping locks.
> > + */
> > + call_rcu(&task->rcu, __delayed_put_task_struct);
> ^^^^^^^^^
> I am not sure the usage of task->rcu is safe...
>
> Suppose that, before __delayed_put_task_struct() is called by RCU, this task
> does the last schedule and calls put_task_struct_rcu_user().
Ah, sorry, please forget, rcu_users != 0 implies task->usage != 0.
> And, can't we simply turn put_task_struct() into something like
>
> put_task_struct(struct task_struct *t)
> {
> if (refcount_dec_and_test(&t->usage)) {
> if (IS_ENABLED(CONFIG_PREEMPT_RT)
> && (in_atomic() || irqs_disabled()))
> call_rcu(...);
> else
> __put_task_struct(t);
> }
> }
>
> ?
>
> Oleg.
On Mon, Jan 23, 2023 at 1:30 PM Oleg Nesterov <[email protected]> wrote:
>
> On 01/20, Wander Lairson Costa wrote:
> >
> > +static inline void put_task_struct_atomic_safe(struct task_struct *task)
> > +{
> > + if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> > + /*
> > + * Decrement the refcount explicitly to avoid unnecessarily
> > + * calling call_rcu.
> > + */
> > + if (refcount_dec_and_test(&task->usage))
> > + /*
> > + * under PREEMPT_RT, we can't call put_task_struct
> > + * in atomic context because it will indirectly
> > + * acquire sleeping locks.
> > + */
> > + call_rcu(&task->rcu, __delayed_put_task_struct);
> ^^^^^^^^^
> I am not sure the usage of task->rcu is safe...
>
> Suppose that, before __delayed_put_task_struct() is called by RCU, this task
> does the last schedule and calls put_task_struct_rcu_user().
>
> And, can't we simply turn put_task_struct() into something like
>
> put_task_struct(struct task_struct *t)
> {
> if (refcount_dec_and_test(&t->usage)) {
> if (IS_ENABLED(CONFIG_PREEMPT_RT)
> && (in_atomic() || irqs_disabled()))
> call_rcu(...);
> else
> __put_task_struct(t);
> }
> }
>
> ?
Yeah, that was one approach I thought about. I chose to use an
explicit function because I assumed calling __put_task_struct() from a
non-preemptable context should be the exception, not the rule.
Therefore (if I am correct in my assumption), it would make sense for
only some call sites to pay the overhead price for it. But this is
just a guess, and I have no evidence to support my claim.
On Fri, 20 Jan 2023 12:02:41 -0300
Wander Lairson Costa <[email protected]> wrote:
> rto_push_irq_work_func() is called in hardirq context, and it calls
> push_rt_task(), which calls put_task_struct().
>
> If the kernel is compiled with PREEMPT_RT and put_task_struct() reaches
> zero usage count, it triggers a splat because __put_task_struct()
> indirectly acquires sleeping locks.
>
> The put_task_struct() call pairs with an earlier get_task_struct(),
> which makes the probability of the usage count reaches zero pretty
> low. In any case, let's play safe and use the atomic safe version.
>
> Signed-off-by: Wander Lairson Costa <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> ---
> kernel/sched/rt.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
For what it's worth:
Reviewed-by: Steven Rostedt (Google) <[email protected]>
-- Steve
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index ed2a47e4ddae..30a4e9607bec 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -2147,7 +2147,7 @@ static int push_rt_task(struct rq *rq, bool pull)
> /*
> * Something has shifted, try again.
> */
> - put_task_struct(next_task);
> + put_task_struct_atomic_safe(next_task);
> next_task = task;
> goto retry;
> }
> @@ -2160,7 +2160,7 @@ static int push_rt_task(struct rq *rq, bool pull)
>
> double_unlock_balance(rq, lowest_rq);
> out:
> - put_task_struct(next_task);
> + put_task_struct_atomic_safe(next_task);
>
> return ret;
> }
On 23/01/23 14:24, Wander Lairson Costa wrote:
> On Mon, Jan 23, 2023 at 1:30 PM Oleg Nesterov <[email protected]> wrote:
>>
>> On 01/20, Wander Lairson Costa wrote:
>> >
>> > +static inline void put_task_struct_atomic_safe(struct task_struct *task)
>> > +{
>> > + if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
>> > + /*
>> > + * Decrement the refcount explicitly to avoid unnecessarily
>> > + * calling call_rcu.
>> > + */
>> > + if (refcount_dec_and_test(&task->usage))
>> > + /*
>> > + * under PREEMPT_RT, we can't call put_task_struct
>> > + * in atomic context because it will indirectly
>> > + * acquire sleeping locks.
>> > + */
>> > + call_rcu(&task->rcu, __delayed_put_task_struct);
>> ^^^^^^^^^
>> I am not sure the usage of task->rcu is safe...
>>
>> Suppose that, before __delayed_put_task_struct() is called by RCU, this task
>> does the last schedule and calls put_task_struct_rcu_user().
>>
>> And, can't we simply turn put_task_struct() into something like
>>
>> put_task_struct(struct task_struct *t)
>> {
>> if (refcount_dec_and_test(&t->usage)) {
>> if (IS_ENABLED(CONFIG_PREEMPT_RT)
>> && (in_atomic() || irqs_disabled()))
>> call_rcu(...);
>> else
>> __put_task_struct(t);
>> }
>> }
>>
>> ?
>
> Yeah, that was one approach I thought about. I chose to use an
> explicit function because I assumed calling __put_task_struct() from a
> non-preemptable context should be the exception, not the rule.
I'd tend to agree.
> Therefore (if I am correct in my assumption), it would make sense for
> only some call sites to pay the overhead price for it. But this is
> just a guess, and I have no evidence to support my claim.
My worry here is that it's easy to miss problematic callgraphs, and it's
potentially easy for new ones to creep in. Having a solution within
put_task_struct() itself would prevent that.
Another thing, if you look at release_task_stack(), it either caches the
outgoing stack for later use, or frees it via RCU (regardless of
PREEMPT_RT). Perhaps we could follow that and just always punt the freeing
of the task struct to RCU?
On Fri, Jan 27, 2023 at 12:55 PM Valentin Schneider <[email protected]> wrote:
>
> On 23/01/23 14:24, Wander Lairson Costa wrote:
> > On Mon, Jan 23, 2023 at 1:30 PM Oleg Nesterov <[email protected]> wrote:
> >>
> >> On 01/20, Wander Lairson Costa wrote:
> >> >
> >> > +static inline void put_task_struct_atomic_safe(struct task_struct *task)
> >> > +{
> >> > + if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> >> > + /*
> >> > + * Decrement the refcount explicitly to avoid unnecessarily
> >> > + * calling call_rcu.
> >> > + */
> >> > + if (refcount_dec_and_test(&task->usage))
> >> > + /*
> >> > + * under PREEMPT_RT, we can't call put_task_struct
> >> > + * in atomic context because it will indirectly
> >> > + * acquire sleeping locks.
> >> > + */
> >> > + call_rcu(&task->rcu, __delayed_put_task_struct);
> >> ^^^^^^^^^
> >> I am not sure the usage of task->rcu is safe...
> >>
> >> Suppose that, before __delayed_put_task_struct() is called by RCU, this task
> >> does the last schedule and calls put_task_struct_rcu_user().
> >>
> >> And, can't we simply turn put_task_struct() into something like
> >>
> >> put_task_struct(struct task_struct *t)
> >> {
> >> if (refcount_dec_and_test(&t->usage)) {
> >> if (IS_ENABLED(CONFIG_PREEMPT_RT)
> >> && (in_atomic() || irqs_disabled()))
> >> call_rcu(...);
> >> else
> >> __put_task_struct(t);
> >> }
> >> }
> >>
> >> ?
> >
> > Yeah, that was one approach I thought about. I chose to use an
> > explicit function because I assumed calling __put_task_struct() from a
> > non-preemptable context should be the exception, not the rule.
>
> I'd tend to agree.
>
> > Therefore (if I am correct in my assumption), it would make sense for
> > only some call sites to pay the overhead price for it. But this is
> > just a guess, and I have no evidence to support my claim.
>
> My worry here is that it's easy to miss problematic callgraphs, and it's
> potentially easy for new ones to creep in. Having a solution within
> put_task_struct() itself would prevent that.
>
We could add a WARN_ON statement in put_task_struct() to detect such cases.
> Another thing, if you look at release_task_stack(), it either caches the
> outgoing stack for later use, or frees it via RCU (regardless of
> PREEMPT_RT). Perhaps we could follow that and just always punt the freeing
> of the task struct to RCU?
>
That's a point. Do you mean doing that even for !PREEMPT_RT?
On 30/01/23 08:49, Wander Lairson Costa wrote:
> On Fri, Jan 27, 2023 at 12:55 PM Valentin Schneider <[email protected]> wrote:
>>
>> On 23/01/23 14:24, Wander Lairson Costa wrote:
>> > Therefore (if I am correct in my assumption), it would make sense for
>> > only some call sites to pay the overhead price for it. But this is
>> > just a guess, and I have no evidence to support my claim.
>>
>> My worry here is that it's easy to miss problematic callgraphs, and it's
>> potentially easy for new ones to creep in. Having a solution within
>> put_task_struct() itself would prevent that.
>>
>
> We could add a WARN_ON statement in put_task_struct() to detect such cases.
>
Anyone running their kernel with DEBUG_ATOMIC_SLEEP should be able to
detect misuse, but it doesn't change that some callgraphs will only
materialize under certain hardware/configuration combos.
>> Another thing, if you look at release_task_stack(), it either caches the
>> outgoing stack for later use, or frees it via RCU (regardless of
>> PREEMPT_RT). Perhaps we could follow that and just always punt the freeing
>> of the task struct to RCU?
>>
>
> That's a point. Do you mean doing that even for !PREEMPT_RT?
Could be worth a try? I think because of the cache thing the task stack is
a bit less aggressive wrt RCU callback processing, but at a quick glance I
don't see any fundamental reason why the task_struct itself can't be given
the same treatment.
On Mon, Jan 30, 2023 at 11:47 AM Valentin Schneider <[email protected]> wrote:
>
> On 30/01/23 08:49, Wander Lairson Costa wrote:
> > On Fri, Jan 27, 2023 at 12:55 PM Valentin Schneider <[email protected]> wrote:
> >>
> >> On 23/01/23 14:24, Wander Lairson Costa wrote:
> >> > Therefore (if I am correct in my assumption), it would make sense for
> >> > only some call sites to pay the overhead price for it. But this is
> >> > just a guess, and I have no evidence to support my claim.
> >>
> >> My worry here is that it's easy to miss problematic callgraphs, and it's
> >> potentially easy for new ones to creep in. Having a solution within
> >> put_task_struct() itself would prevent that.
> >>
> >
> > We could add a WARN_ON statement in put_task_struct() to detect such cases.
> >
>
> Anyone running their kernel with DEBUG_ATOMIC_SLEEP should be able to
> detect misuse, but it doesn't change that some callgraphs will only
> materialize under certain hardware/configuration combos.
>
If we put a WARN_ON in put_task_struct(), we catch cases where the
reference count didn't reach zero.
> >> Another thing, if you look at release_task_stack(), it either caches the
> >> outgoing stack for later use, or frees it via RCU (regardless of
> >> PREEMPT_RT). Perhaps we could follow that and just always punt the freeing
> >> of the task struct to RCU?
> >>
> >
> > That's a point. Do you mean doing that even for !PREEMPT_RT?
>
> Could be worth a try?
Sure. But I would do it only for PREEMPT_RT.
> I think because of the cache thing the task stack is
> a bit less aggressive wrt RCU callback processing, but at a quick glance I
> don't see any fundamental reason why the task_struct itself can't be given
> the same treatment.
>
Any idea about tests to catch performance regressions?
I
On 30/01/23 11:58, Wander Lairson Costa wrote:
> On Mon, Jan 30, 2023 at 11:47 AM Valentin Schneider <[email protected]> wrote:
>>
>> On 30/01/23 08:49, Wander Lairson Costa wrote:
>> > On Fri, Jan 27, 2023 at 12:55 PM Valentin Schneider <[email protected]> wrote:
>> >>
>> >> On 23/01/23 14:24, Wander Lairson Costa wrote:
>> >> > Therefore (if I am correct in my assumption), it would make sense for
>> >> > only some call sites to pay the overhead price for it. But this is
>> >> > just a guess, and I have no evidence to support my claim.
>> >>
>> >> My worry here is that it's easy to miss problematic callgraphs, and it's
>> >> potentially easy for new ones to creep in. Having a solution within
>> >> put_task_struct() itself would prevent that.
>> >>
>> >
>> > We could add a WARN_ON statement in put_task_struct() to detect such cases.
>> >
>>
>> Anyone running their kernel with DEBUG_ATOMIC_SLEEP should be able to
>> detect misuse, but it doesn't change that some callgraphs will only
>> materialize under certain hardware/configuration combos.
>>
>
> If we put a WARN_ON in put_task_struct(), we catch cases where the
> reference count didn't reach zero.
>
True, that'd be an improvement.
>> >> Another thing, if you look at release_task_stack(), it either caches the
>> >> outgoing stack for later use, or frees it via RCU (regardless of
>> >> PREEMPT_RT). Perhaps we could follow that and just always punt the freeing
>> >> of the task struct to RCU?
>> >>
>> >
>> > That's a point. Do you mean doing that even for !PREEMPT_RT?
>>
>> Could be worth a try?
>
> Sure. But I would do it only for PREEMPT_RT.
>
>> I think because of the cache thing the task stack is
>> a bit less aggressive wrt RCU callback processing, but at a quick glance I
>> don't see any fundamental reason why the task_struct itself can't be given
>> the same treatment.
>>
>
> Any idea about tests to catch performance regressions?
>
I would wager anything fork-heavy with short-lived tasks, say loops of
short hackbench runs, I belive stress-ng also has a fork test case.
On 1/30/23 08:49, Wander Lairson Costa wrote:
>> Another thing, if you look at release_task_stack(), it either caches the
>> outgoing stack for later use, or frees it via RCU (regardless of
>> PREEMPT_RT). Perhaps we could follow that and just always punt the freeing
>> of the task struct to RCU?
>>
> That's a point. Do you mean doing that even for !PREEMPT_RT?
>
It seems that Luca Abeni (in Cc) is hitting the bug in the non-rt kernel
with SCHED_DEADLINE...
adding him to the thread.
-- Daniel
Hi all,
On Fri, 17 Feb 2023 14:35:22 -0300
Daniel Bristot de Oliveira <[email protected]> wrote:
> On 1/30/23 08:49, Wander Lairson Costa wrote:
> >> Another thing, if you look at release_task_stack(), it either
> >> caches the outgoing stack for later use, or frees it via RCU
> >> (regardless of PREEMPT_RT). Perhaps we could follow that and just
> >> always punt the freeing of the task struct to RCU?
> >>
> > That's a point. Do you mean doing that even for !PREEMPT_RT?
> >
>
> It seems that Luca Abeni (in Cc) is hitting the bug in the non-rt
> kernel with SCHED_DEADLINE...
>
> adding him to the thread.
Thanks Daniel for adding me.
I triggered this "BUG: Invalid wait context" with a 5.15.76 kernel,
without PREEMPT_RT. I can reproduce it easily on a KVM-based VM; if
more information or tests are needed, let me know.
Luca
On Fri, Feb 17, 2023 at 08:04:37PM +0100, luca abeni wrote:
> Hi all,
>
> On Fri, 17 Feb 2023 14:35:22 -0300
> Daniel Bristot de Oliveira <[email protected]> wrote:
>
> > On 1/30/23 08:49, Wander Lairson Costa wrote:
> > >> Another thing, if you look at release_task_stack(), it either
> > >> caches the outgoing stack for later use, or frees it via RCU
> > >> (regardless of PREEMPT_RT). Perhaps we could follow that and just
> > >> always punt the freeing of the task struct to RCU?
> > >>
> > > That's a point. Do you mean doing that even for !PREEMPT_RT?
> > >
> >
> > It seems that Luca Abeni (in Cc) is hitting the bug in the non-rt
> > kernel with SCHED_DEADLINE...
> >
> > adding him to the thread.
>
> Thanks Daniel for adding me.
>
> I triggered this "BUG: Invalid wait context" with a 5.15.76 kernel,
> without PREEMPT_RT. I can reproduce it easily on a KVM-based VM; if
> more information or tests are needed, let me know.
>
Does it happen in linux-6.1 or linux-6.2?
>
> Luca
>
On Wed, 22 Feb 2023 15:42:37 -0300
Wander Lairson Costa <[email protected]> wrote:
[...]
> > I triggered this "BUG: Invalid wait context" with a 5.15.76 kernel,
> > without PREEMPT_RT. I can reproduce it easily on a KVM-based VM; if
> > more information or tests are needed, let me know.
> >
>
> Does it happen in linux-6.1 or linux-6.2?
I only tried with 5.15.76... I am going to test 6.2 and I'll let you
know ASAP.
Luca
On Wed, 22 Feb 2023 22:00:34 +0100
luca abeni <[email protected]> wrote:
> On Wed, 22 Feb 2023 15:42:37 -0300
> Wander Lairson Costa <[email protected]> wrote:
> [...]
> > > I triggered this "BUG: Invalid wait context" with a 5.15.76
> > > kernel, without PREEMPT_RT. I can reproduce it easily on a
> > > KVM-based VM; if more information or tests are needed, let me
> > > know.
> >
> > Does it happen in linux-6.1 or linux-6.2?
>
> I only tried with 5.15.76... I am going to test 6.2 and I'll let you
> know ASAP.
For various reasons it took more time than expected, but I managed to
reproduce the bug with 6.2... Here are the relevant kernel messages:
[ 1246.556100] =============================
[ 1246.559104] [ BUG: Invalid wait context ]
[ 1246.562270] 6.2.0 #4 Not tainted
[ 1246.564854] -----------------------------
[ 1246.567260] swapper/3/0 is trying to lock:
[ 1246.568665] ffff8c2c7ebb2c10 (&c->lock){..-.}-{3:3}, at: put_cpu_partial+0x24/0x1c0
[ 1246.571325] other info that might help us debug this:
[ 1246.573045] context-{2:2}
[ 1246.574166] no locks held by swapper/3/0.
[ 1246.575434] stack backtrace:
[ 1246.576207] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 6.2.0 #4
[ 1246.578184] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[ 1246.580815] Call Trace:
[ 1246.581723] <IRQ>
[ 1246.582570] dump_stack_lvl+0x49/0x61
[ 1246.583860] __lock_acquire.cold+0xc8/0x31c
[ 1246.584923] ? __lock_acquire+0x3be/0x1df0
[ 1246.585915] lock_acquire+0xce/0x2f0
[ 1246.586819] ? put_cpu_partial+0x24/0x1c0
[ 1246.588177] ? lock_is_held_type+0xdb/0x130
[ 1246.589519] put_cpu_partial+0x5b/0x1c0
[ 1246.590996] ? put_cpu_partial+0x24/0x1c0
[ 1246.592212] inactive_task_timer+0x263/0x4c0
[ 1246.593509] ? __pfx_inactive_task_timer+0x10/0x10
[ 1246.594953] __hrtimer_run_queues+0x1bf/0x470
[ 1246.596297] hrtimer_interrupt+0x117/0x250
[ 1246.597528] __sysvec_apic_timer_interrupt+0x99/0x270
[ 1246.599015] sysvec_apic_timer_interrupt+0x8d/0xc0
[ 1246.600416] </IRQ>
[ 1246.601170] <TASK>
[ 1246.601918] asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 1246.603377] RIP: 0010:default_idle+0xf/0x20
[ 1246.604640] Code: f6 5d 41 5c e9 72 4a 6e ff cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 03 52 2a 00 fb f4 <c3> cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
[ 1246.609718] RSP: 0018:ffffa1a2c009bed0 EFLAGS: 00000202
[ 1246.611259] RAX: ffffffffa4961a60 RBX: ffff8c2c4126b000 RCX: 0000000000000000
[ 1246.613230] RDX: 0000000000000000 RSI: ffffffffa510271b RDI: ffffffffa50d5b15
[ 1246.615266] RBP: 0000000000000003 R08: 0000000000000001 R09: 0000000000000001
[ 1246.617275] R10: 0000000000000000 R11: ffff8c2c4126b000 R12: ffff8c2c4126b000
[ 1246.619318] R13: ffff8c2c4126b000 R14: 0000000000000000 R15: 0000000000000000
[ 1246.621293] ? __pfx_default_idle+0x10/0x10
[ 1246.622581] default_idle_call+0x71/0x220
[ 1246.623790] do_idle+0x210/0x290
[ 1246.624827] cpu_startup_entry+0x18/0x20
[ 1246.626016] start_secondary+0xf1/0x100
[ 1246.627200] secondary_startup_64_no_verify+0xe0/0xeb
[ 1246.628707] </TASK>
Let me know if you need more information, or
I should run other tests/experiments.
Luca
On Fri, Feb 24, 2023 at 09:46:48AM +0100, luca abeni wrote:
> On Wed, 22 Feb 2023 22:00:34 +0100
> luca abeni <[email protected]> wrote:
>
> > On Wed, 22 Feb 2023 15:42:37 -0300
> > Wander Lairson Costa <[email protected]> wrote:
> > [...]
> > > > I triggered this "BUG: Invalid wait context" with a 5.15.76
> > > > kernel, without PREEMPT_RT. I can reproduce it easily on a
> > > > KVM-based VM; if more information or tests are needed, let me
> > > > know.
> > >
> > > Does it happen in linux-6.1 or linux-6.2?
> >
> > I only tried with 5.15.76... I am going to test 6.2 and I'll let you
> > know ASAP.
>
> For various reasons it took more time than expected, but I managed to
> reproduce the bug with 6.2... Here are the relevant kernel messages:
>
> [ 1246.556100] =============================
> [ 1246.559104] [ BUG: Invalid wait context ]
> [ 1246.562270] 6.2.0 #4 Not tainted
> [ 1246.564854] -----------------------------
> [ 1246.567260] swapper/3/0 is trying to lock:
> [ 1246.568665] ffff8c2c7ebb2c10 (&c->lock){..-.}-{3:3}, at: put_cpu_partial+0x24/0x1c0
> [ 1246.571325] other info that might help us debug this:
> [ 1246.573045] context-{2:2}
> [ 1246.574166] no locks held by swapper/3/0.
> [ 1246.575434] stack backtrace:
> [ 1246.576207] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 6.2.0 #4
> [ 1246.578184] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
> [ 1246.580815] Call Trace:
> [ 1246.581723] <IRQ>
> [ 1246.582570] dump_stack_lvl+0x49/0x61
> [ 1246.583860] __lock_acquire.cold+0xc8/0x31c
> [ 1246.584923] ? __lock_acquire+0x3be/0x1df0
> [ 1246.585915] lock_acquire+0xce/0x2f0
> [ 1246.586819] ? put_cpu_partial+0x24/0x1c0
> [ 1246.588177] ? lock_is_held_type+0xdb/0x130
> [ 1246.589519] put_cpu_partial+0x5b/0x1c0
> [ 1246.590996] ? put_cpu_partial+0x24/0x1c0
> [ 1246.592212] inactive_task_timer+0x263/0x4c0
> [ 1246.593509] ? __pfx_inactive_task_timer+0x10/0x10
> [ 1246.594953] __hrtimer_run_queues+0x1bf/0x470
> [ 1246.596297] hrtimer_interrupt+0x117/0x250
> [ 1246.597528] __sysvec_apic_timer_interrupt+0x99/0x270
> [ 1246.599015] sysvec_apic_timer_interrupt+0x8d/0xc0
> [ 1246.600416] </IRQ>
> [ 1246.601170] <TASK>
> [ 1246.601918] asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [ 1246.603377] RIP: 0010:default_idle+0xf/0x20
> [ 1246.604640] Code: f6 5d 41 5c e9 72 4a 6e ff cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 03 52 2a 00 fb f4 <c3> cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
> [ 1246.609718] RSP: 0018:ffffa1a2c009bed0 EFLAGS: 00000202
> [ 1246.611259] RAX: ffffffffa4961a60 RBX: ffff8c2c4126b000 RCX: 0000000000000000
> [ 1246.613230] RDX: 0000000000000000 RSI: ffffffffa510271b RDI: ffffffffa50d5b15
> [ 1246.615266] RBP: 0000000000000003 R08: 0000000000000001 R09: 0000000000000001
> [ 1246.617275] R10: 0000000000000000 R11: ffff8c2c4126b000 R12: ffff8c2c4126b000
> [ 1246.619318] R13: ffff8c2c4126b000 R14: 0000000000000000 R15: 0000000000000000
> [ 1246.621293] ? __pfx_default_idle+0x10/0x10
> [ 1246.622581] default_idle_call+0x71/0x220
> [ 1246.623790] do_idle+0x210/0x290
> [ 1246.624827] cpu_startup_entry+0x18/0x20
> [ 1246.626016] start_secondary+0xf1/0x100
> [ 1246.627200] secondary_startup_64_no_verify+0xe0/0xeb
> [ 1246.628707] </TASK>
>
>
> Let me know if you need more information, or
> I should run other tests/experiments.
>
This seems to be a different (maybe related?) issue. Would you mind
sharing your .config and steps to reproduce it?
>
> Luca
>
On Fri, 24 Feb 2023 10:02:40 -0300
Wander Lairson Costa <[email protected]> wrote:
[...]
> > [ 1246.556100] =============================
> > [ 1246.559104] [ BUG: Invalid wait context ]
> > [ 1246.562270] 6.2.0 #4 Not tainted
> > [ 1246.564854] -----------------------------
> > [ 1246.567260] swapper/3/0 is trying to lock:
> > [ 1246.568665] ffff8c2c7ebb2c10 (&c->lock){..-.}-{3:3}, at:
> > put_cpu_partial+0x24/0x1c0 [ 1246.571325] other info that might
> > help us debug this: [ 1246.573045] context-{2:2}
> > [ 1246.574166] no locks held by swapper/3/0.
> > [ 1246.575434] stack backtrace:
> > [ 1246.576207] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 6.2.0 #4
> > [ 1246.578184] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
> > BIOS 1.13.0-1ubuntu1.1 04/01/2014 [ 1246.580815] Call Trace:
> > [ 1246.581723] <IRQ>
> > [ 1246.582570] dump_stack_lvl+0x49/0x61
> > [ 1246.583860] __lock_acquire.cold+0xc8/0x31c
> > [ 1246.584923] ? __lock_acquire+0x3be/0x1df0
> > [ 1246.585915] lock_acquire+0xce/0x2f0
> > [ 1246.586819] ? put_cpu_partial+0x24/0x1c0
> > [ 1246.588177] ? lock_is_held_type+0xdb/0x130
> > [ 1246.589519] put_cpu_partial+0x5b/0x1c0
> > [ 1246.590996] ? put_cpu_partial+0x24/0x1c0
> > [ 1246.592212] inactive_task_timer+0x263/0x4c0
> > [ 1246.593509] ? __pfx_inactive_task_timer+0x10/0x10
> > [ 1246.594953] __hrtimer_run_queues+0x1bf/0x470
> > [ 1246.596297] hrtimer_interrupt+0x117/0x250
> > [ 1246.597528] __sysvec_apic_timer_interrupt+0x99/0x270
> > [ 1246.599015] sysvec_apic_timer_interrupt+0x8d/0xc0
> > [ 1246.600416] </IRQ>
> > [ 1246.601170] <TASK>
> > [ 1246.601918] asm_sysvec_apic_timer_interrupt+0x1a/0x20
> > [ 1246.603377] RIP: 0010:default_idle+0xf/0x20
> > [ 1246.604640] Code: f6 5d 41 5c e9 72 4a 6e ff cc cc 90 90 90 90
> > 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 03
> > 52 2a 00 fb f4 <c3> cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90
> > 90 90 90 90 90 [ 1246.609718] RSP: 0018:ffffa1a2c009bed0 EFLAGS:
> > 00000202 [ 1246.611259] RAX: ffffffffa4961a60 RBX: ffff8c2c4126b000
> > RCX: 0000000000000000 [ 1246.613230] RDX: 0000000000000000 RSI:
> > ffffffffa510271b RDI: ffffffffa50d5b15 [ 1246.615266] RBP:
> > 0000000000000003 R08: 0000000000000001 R09: 0000000000000001 [
> > 1246.617275] R10: 0000000000000000 R11: ffff8c2c4126b000 R12:
> > ffff8c2c4126b000 [ 1246.619318] R13: ffff8c2c4126b000 R14:
> > 0000000000000000 R15: 0000000000000000 [ 1246.621293] ?
> > __pfx_default_idle+0x10/0x10 [ 1246.622581]
> > default_idle_call+0x71/0x220 [ 1246.623790] do_idle+0x210/0x290 [
> > 1246.624827] cpu_startup_entry+0x18/0x20 [ 1246.626016]
> > start_secondary+0xf1/0x100 [ 1246.627200]
> > secondary_startup_64_no_verify+0xe0/0xeb [ 1246.628707] </TASK>
> >
> >
> > Let me know if you need more information, or
> > I should run other tests/experiments.
> >
>
> This seems to be a different (maybe related?) issue. Would you mind
> sharing your .config and steps to reproduce it?
Ah, sorry then... I probably misunderstood the kernel messages
(in my understanding, this is lockdep complaining because
put_task_struct() - which can take a sleeping lock - is invoked
from a timer callback).
Anyway, I attach the config (it is basically a "make defconfig;
make kvm_guest.config" with some debug options manually enabled - I
think the relevant one is CONFIG_PROVE_RAW_LOCK_NESTING)
Luca