The put_task_struct() function reduces a usage counter and invokes
__put_task_struct() when the counter reaches zero.
In the case of __put_task_struct(), it indirectly acquires a spinlock,
which operates as a sleeping lock under the PREEMPT_RT configuration.
As a result, invoking put_task_struct() within an atomic context is
not feasible for real-time (RT) kernels.
One practical example is a splat inside inactive_task_timer(), which is
called in a interrupt context:
CPU: 1 PID: 2848 Comm: life Kdump: loaded Tainted: G W ---------
Hardware name: HP ProLiant DL388p Gen8, BIOS P70 07/15/2012
Call Trace:
dump_stack_lvl+0x57/0x7d
mark_lock_irq.cold+0x33/0xba
? stack_trace_save+0x4b/0x70
? save_trace+0x55/0x150
mark_lock+0x1e7/0x400
mark_usage+0x11d/0x140
__lock_acquire+0x30d/0x930
lock_acquire.part.0+0x9c/0x210
? refill_obj_stock+0x3d/0x3a0
? rcu_read_lock_sched_held+0x3f/0x70
? trace_lock_acquire+0x38/0x140
? lock_acquire+0x30/0x80
? refill_obj_stock+0x3d/0x3a0
rt_spin_lock+0x27/0xe0
? refill_obj_stock+0x3d/0x3a0
refill_obj_stock+0x3d/0x3a0
? inactive_task_timer+0x1ad/0x340
kmem_cache_free+0x357/0x560
inactive_task_timer+0x1ad/0x340
? switched_from_dl+0x2d0/0x2d0
__run_hrtimer+0x8a/0x1a0
__hrtimer_run_queues+0x91/0x130
hrtimer_interrupt+0x10f/0x220
__sysvec_apic_timer_interrupt+0x7b/0xd0
sysvec_apic_timer_interrupt+0x4f/0xd0
? asm_sysvec_apic_timer_interrupt+0xa/0x20
asm_sysvec_apic_timer_interrupt+0x12/0x20
RIP: 0033:0x7fff196bf6f5
To address this issue, this patch series introduces a new function
called put_task_struct_atomic_safe(). When compiled with the
PREEMPT_RT configuration, this function defers the call to
__put_task_struct() to a process context.
Additionally, the patch series rectifies known problematic call sites
to ensure smooth functioning.
Changelog
=========
v1:
* Initial implementation fixing the splat.
v2:
* Isolate the logic in its own function.
* Fix two more cases caught in review.
v3:
* Change __put_task_struct() to handle the issue internally.
v4:
* Explain why call_rcu() is safe to call from interrupt context.
v5:
* Explain why __put_task_struct() doesn't conflict with
put_task_sruct_rcu_user.
v6:
* As per Sebastian's review, revert back the implementation of v2
with a distinct function.
* Add a check in put_task_struct() to warning when called from a
non-sleepable context.
* Address more call sites.
v7:
* Fix typos.
* Add an explanation why the new function doesn't conflict with
delayed_free_task().
Wander Lairson Costa (3):
sched/core: warn on call put_task_struct in invalid context
sched/task: Add the put_task_struct_atomic_safe() function
treewide: replace put_task_struct() with the atomic safe version
include/linux/sched/task.h | 49 ++++++++++++++++++++++++++++++++++++++
kernel/events/core.c | 6 ++---
kernel/fork.c | 8 +++++++
kernel/locking/rtmutex.c | 10 ++++----
kernel/sched/core.c | 6 ++---
kernel/sched/deadline.c | 16 ++++++-------
kernel/sched/rt.c | 4 ++--
7 files changed, 78 insertions(+), 21 deletions(-)
--
2.40.0
Under PREEMPT_RT, spinlocks become sleepable locks. put_task_struct()
indirectly acquires a spinlock. Therefore, it can't be called in
atomic/interrupt context in RT kernels.
To prevent such conditions, add a check for atomic/interrupt context
before calling put_task_struct().
Signed-off-by: Wander Lairson Costa <[email protected]>
Suggested-by: Sebastian Andrzej Siewior <[email protected]>
---
include/linux/sched/task.h | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 357e0068497c..b597b97b1f8f 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -113,14 +113,28 @@ static inline struct task_struct *get_task_struct(struct task_struct *t)
extern void __put_task_struct(struct task_struct *t);
+#define PUT_TASK_RESCHED_OFFSETS \
+ (rcu_preempt_depth() << MIGHT_RESCHED_RCU_SHIFT)
+
+#define __put_task_might_resched() \
+ __might_resched(__FILE__, __LINE__, PUT_TASK_RESCHED_OFFSETS)
+
+#define put_task_might_resched() \
+ do { \
+ if (IS_ENABLED(CONFIG_PREEMPT_RT)) \
+ __put_task_might_resched(); \
+ } while (0)
+
static inline void put_task_struct(struct task_struct *t)
{
+ put_task_might_resched();
if (refcount_dec_and_test(&t->usage))
__put_task_struct(t);
}
static inline void put_task_struct_many(struct task_struct *t, int nr)
{
+ put_task_might_resched();
if (refcount_sub_and_test(nr, &t->usage))
__put_task_struct(t);
}
--
2.40.0
In places where put_task_struct() is called in a non-sleepable context,
we replace those calls by put_task_struct_atomic_safe().
These call sites were found by running internal regression tests and
looking for warnings generated by put_task_might_resched().
Signed-off-by: Wander Lairson Costa <[email protected]>
Cc: Valentin Schneider <[email protected]>
---
kernel/events/core.c | 6 +++---
kernel/locking/rtmutex.c | 10 +++++-----
kernel/sched/core.c | 6 +++---
kernel/sched/deadline.c | 16 ++++++++--------
kernel/sched/rt.c | 4 ++--
5 files changed, 21 insertions(+), 21 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 435815d3be3f..8f823da02324 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1181,7 +1181,7 @@ static void put_ctx(struct perf_event_context *ctx)
if (ctx->parent_ctx)
put_ctx(ctx->parent_ctx);
if (ctx->task && ctx->task != TASK_TOMBSTONE)
- put_task_struct(ctx->task);
+ put_task_struct_atomic_safe(ctx->task);
call_rcu(&ctx->rcu_head, free_ctx);
}
}
@@ -13019,7 +13019,7 @@ static void perf_event_exit_task_context(struct task_struct *child)
RCU_INIT_POINTER(child->perf_event_ctxp, NULL);
put_ctx(child_ctx); /* cannot be last */
WRITE_ONCE(child_ctx->task, TASK_TOMBSTONE);
- put_task_struct(current); /* cannot be last */
+ put_task_struct_atomic_safe(current); /* cannot be last */
clone_ctx = unclone_ctx(child_ctx);
raw_spin_unlock_irq(&child_ctx->lock);
@@ -13124,7 +13124,7 @@ void perf_event_free_task(struct task_struct *task)
*/
RCU_INIT_POINTER(task->perf_event_ctxp, NULL);
WRITE_ONCE(ctx->task, TASK_TOMBSTONE);
- put_task_struct(task); /* cannot be last */
+ put_task_struct_atomic_safe(task); /* cannot be last */
raw_spin_unlock_irq(&ctx->lock);
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 728f434de2bb..3ecb8d6ae039 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -509,7 +509,7 @@ static __always_inline void rt_mutex_wake_up_q(struct rt_wake_q_head *wqh)
{
if (IS_ENABLED(CONFIG_PREEMPT_RT) && wqh->rtlock_task) {
wake_up_state(wqh->rtlock_task, TASK_RTLOCK_WAIT);
- put_task_struct(wqh->rtlock_task);
+ put_task_struct_atomic_safe(wqh->rtlock_task);
wqh->rtlock_task = NULL;
}
@@ -649,7 +649,7 @@ static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task,
"task: %s (%d)\n", max_lock_depth,
top_task->comm, task_pid_nr(top_task));
}
- put_task_struct(task);
+ put_task_struct_atomic_safe(task);
return -EDEADLK;
}
@@ -817,7 +817,7 @@ static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task,
* No requeue[7] here. Just release @task [8]
*/
raw_spin_unlock(&task->pi_lock);
- put_task_struct(task);
+ put_task_struct_atomic_safe(task);
/*
* [9] check_exit_conditions_3 protected by lock->wait_lock.
@@ -886,7 +886,7 @@ static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task,
/* [8] Release the task */
raw_spin_unlock(&task->pi_lock);
- put_task_struct(task);
+ put_task_struct_atomic_safe(task);
/*
* [9] check_exit_conditions_3 protected by lock->wait_lock.
@@ -990,7 +990,7 @@ static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task,
out_unlock_pi:
raw_spin_unlock_irq(&task->pi_lock);
out_put_task:
- put_task_struct(task);
+ put_task_struct_atomic_safe(task);
return ret;
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0d18c3969f90..a4783f0c9f01 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1007,7 +1007,7 @@ void wake_up_q(struct wake_q_head *head)
* the queueing in wake_q_add() so as not to miss wakeups.
*/
wake_up_process(task);
- put_task_struct(task);
+ put_task_struct_atomic_safe(task);
}
}
@@ -2528,7 +2528,7 @@ int push_cpu_stop(void *arg)
raw_spin_rq_unlock(rq);
raw_spin_unlock_irq(&p->pi_lock);
- put_task_struct(p);
+ put_task_struct_atomic_safe(p);
return 0;
}
@@ -9316,7 +9316,7 @@ static int __balance_push_cpu_stop(void *arg)
rq_unlock(rq, &rf);
raw_spin_unlock_irq(&p->pi_lock);
- put_task_struct(p);
+ put_task_struct_atomic_safe(p);
return 0;
}
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 71b24371a6f7..0f8b8a490dc0 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -327,7 +327,7 @@ static void dl_change_utilization(struct task_struct *p, u64 new_bw)
* so we are still safe.
*/
if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
- put_task_struct(p);
+ put_task_struct_atomic_safe(p);
}
__sub_rq_bw(p->dl.dl_bw, &rq->dl);
__add_rq_bw(new_bw, &rq->dl);
@@ -467,7 +467,7 @@ static void task_contending(struct sched_dl_entity *dl_se, int flags)
* so we are still safe.
*/
if (hrtimer_try_to_cancel(&dl_se->inactive_timer) == 1)
- put_task_struct(dl_task_of(dl_se));
+ put_task_struct_atomic_safe(dl_task_of(dl_se));
} else {
/*
* Since "dl_non_contending" is not set, the
@@ -1207,7 +1207,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
* This can free the task_struct, including this hrtimer, do not touch
* anything related to that after this.
*/
- put_task_struct(p);
+ put_task_struct_atomic_safe(p);
return HRTIMER_NORESTART;
}
@@ -1442,7 +1442,7 @@ static enum hrtimer_restart inactive_task_timer(struct hrtimer *timer)
dl_se->dl_non_contending = 0;
unlock:
task_rq_unlock(rq, p, &rf);
- put_task_struct(p);
+ put_task_struct_atomic_safe(p);
return HRTIMER_NORESTART;
}
@@ -1899,7 +1899,7 @@ static void migrate_task_rq_dl(struct task_struct *p, int new_cpu __maybe_unused
* so we are still safe.
*/
if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
- put_task_struct(p);
+ put_task_struct_atomic_safe(p);
}
sub_rq_bw(&p->dl, &rq->dl);
rq_unlock(rq, &rf);
@@ -2351,7 +2351,7 @@ static int push_dl_task(struct rq *rq)
/* No more tasks */
goto out;
- put_task_struct(next_task);
+ put_task_struct_atomic_safe(next_task);
next_task = task;
goto retry;
}
@@ -2366,7 +2366,7 @@ static int push_dl_task(struct rq *rq)
double_unlock_balance(rq, later_rq);
out:
- put_task_struct(next_task);
+ put_task_struct_atomic_safe(next_task);
return ret;
}
@@ -2633,7 +2633,7 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
static void switched_to_dl(struct rq *rq, struct task_struct *p)
{
if (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
- put_task_struct(p);
+ put_task_struct_atomic_safe(p);
/* If p is not queued we will update its parameters at next wakeup. */
if (!task_on_rq_queued(p)) {
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 0a11f44adee5..e58a84535f61 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2150,7 +2150,7 @@ static int push_rt_task(struct rq *rq, bool pull)
/*
* Something has shifted, try again.
*/
- put_task_struct(next_task);
+ put_task_struct_atomic_safe(next_task);
next_task = task;
goto retry;
}
@@ -2163,7 +2163,7 @@ static int push_rt_task(struct rq *rq, bool pull)
double_unlock_balance(rq, lowest_rq);
out:
- put_task_struct(next_task);
+ put_task_struct_atomic_safe(next_task);
return ret;
}
--
2.40.0
Due to the possibility of indirectly acquiring sleeping locks, it is
unsafe to call put_task_struct() in atomic contexts when the kernel is
compiled with PREEMPT_RT.
To mitigate this issue, this commit introduces
put_task_struct_atomic_safe(), which schedules __put_task_struct()
through call_rcu() when PREEMPT_RT is enabled. While a workqueue would
be a more natural approach, we cannot allocate dynamic memory from
atomic context in PREEMPT_RT, making the code more complex.
This implementation ensures safe execution in atomic contexts and
avoids any potential issues that may arise from using the non-atomic
version.
Signed-off-by: Wander Lairson Costa <[email protected]>
Reported-by: Hu Chunyu <[email protected]>
Reviewed-by: Paul McKenney <[email protected]>
Cc: Thomas Gleixner <[email protected]>
---
include/linux/sched/task.h | 35 +++++++++++++++++++++++++++++++++++
kernel/fork.c | 8 ++++++++
2 files changed, 43 insertions(+)
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index b597b97b1f8f..cf774b83b2ec 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -141,6 +141,41 @@ static inline void put_task_struct_many(struct task_struct *t, int nr)
void put_task_struct_rcu_user(struct task_struct *task);
+extern void __delayed_put_task_struct(struct rcu_head *rhp);
+
+static inline void put_task_struct_atomic_safe(struct task_struct *task)
+{
+ if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
+ /*
+ * Decrement the refcount explicitly to avoid unnecessarily
+ * calling call_rcu.
+ */
+ if (refcount_dec_and_test(&task->usage))
+ /*
+ * under PREEMPT_RT, we can't call put_task_struct
+ * in atomic context because it will indirectly
+ * acquire sleeping locks.
+ * call_rcu() will schedule __delayed_put_task_struct()
+ * to be called in process context.
+ *
+ * __put_task_struct() is called when
+ * refcount_dec_and_test(&t->usage) succeeds.
+ *
+ * This means that it can't conflict with
+ * put_task_struct_rcu_user() which abuses ->rcu the same
+ * way; rcu_users has a reference so task->usage can't be
+ * zero after rcu_users 1 -> 0 transition.
+ *
+ * delayed_free_task() also uses ->rcu, but it is only called
+ * when it fails to fork a process. Therefore, there is no
+ * way it can conflict with put_task_struct().
+ */
+ call_rcu(&task->rcu, __delayed_put_task_struct);
+ } else {
+ put_task_struct(task);
+ }
+}
+
/* Free all architecture-specific resources held by a thread. */
void release_thread(struct task_struct *dead_task);
diff --git a/kernel/fork.c b/kernel/fork.c
index ea332319dffe..7f016b691b1d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -854,6 +854,14 @@ void __put_task_struct(struct task_struct *tsk)
}
EXPORT_SYMBOL_GPL(__put_task_struct);
+void __delayed_put_task_struct(struct rcu_head *rhp)
+{
+ struct task_struct *task = container_of(rhp, struct task_struct, rcu);
+
+ __put_task_struct(task);
+}
+EXPORT_SYMBOL_GPL(__delayed_put_task_struct);
+
void __init __weak arch_task_cache_init(void) { }
/*
--
2.40.0
On 25/04/23 08:43, Wander Lairson Costa wrote:
> The put_task_struct() function reduces a usage counter and invokes
> __put_task_struct() when the counter reaches zero.
>
> In the case of __put_task_struct(), it indirectly acquires a spinlock,
> which operates as a sleeping lock under the PREEMPT_RT configuration.
> As a result, invoking put_task_struct() within an atomic context is
> not feasible for real-time (RT) kernels.
>
> One practical example is a splat inside inactive_task_timer(), which is
> called in a interrupt context:
>
> CPU: 1 PID: 2848 Comm: life Kdump: loaded Tainted: G W ---------
> Hardware name: HP ProLiant DL388p Gen8, BIOS P70 07/15/2012
> Call Trace:
> dump_stack_lvl+0x57/0x7d
> mark_lock_irq.cold+0x33/0xba
> ? stack_trace_save+0x4b/0x70
> ? save_trace+0x55/0x150
> mark_lock+0x1e7/0x400
> mark_usage+0x11d/0x140
> __lock_acquire+0x30d/0x930
> lock_acquire.part.0+0x9c/0x210
> ? refill_obj_stock+0x3d/0x3a0
> ? rcu_read_lock_sched_held+0x3f/0x70
> ? trace_lock_acquire+0x38/0x140
> ? lock_acquire+0x30/0x80
> ? refill_obj_stock+0x3d/0x3a0
> rt_spin_lock+0x27/0xe0
> ? refill_obj_stock+0x3d/0x3a0
> refill_obj_stock+0x3d/0x3a0
> ? inactive_task_timer+0x1ad/0x340
> kmem_cache_free+0x357/0x560
> inactive_task_timer+0x1ad/0x340
> ? switched_from_dl+0x2d0/0x2d0
> __run_hrtimer+0x8a/0x1a0
> __hrtimer_run_queues+0x91/0x130
> hrtimer_interrupt+0x10f/0x220
> __sysvec_apic_timer_interrupt+0x7b/0xd0
> sysvec_apic_timer_interrupt+0x4f/0xd0
> ? asm_sysvec_apic_timer_interrupt+0xa/0x20
> asm_sysvec_apic_timer_interrupt+0x12/0x20
> RIP: 0033:0x7fff196bf6f5
>
> To address this issue, this patch series introduces a new function
> called put_task_struct_atomic_safe(). When compiled with the
> PREEMPT_RT configuration, this function defers the call to
> __put_task_struct() to a process context.
>
> Additionally, the patch series rectifies known problematic call sites
> to ensure smooth functioning.
>
It took me a bit of time to grok the put_task_struct_rcu_user() vs
delayed_free_task() vs put_task_struct_atomic_safe() situation, but other
than that the patches LGTM.
Reviewed-by: Valentin Schneider <[email protected]>
On 4/25/23 07:43, Wander Lairson Costa wrote:
> The put_task_struct() function reduces a usage counter and invokes
> __put_task_struct() when the counter reaches zero.
>
> In the case of __put_task_struct(), it indirectly acquires a spinlock,
> which operates as a sleeping lock under the PREEMPT_RT configuration.
> As a result, invoking put_task_struct() within an atomic context is
> not feasible for real-time (RT) kernels.
>
> One practical example is a splat inside inactive_task_timer(), which is
> called in a interrupt context:
>
> CPU: 1 PID: 2848 Comm: life Kdump: loaded Tainted: G W ---------
> Hardware name: HP ProLiant DL388p Gen8, BIOS P70 07/15/2012
> Call Trace:
> dump_stack_lvl+0x57/0x7d
> mark_lock_irq.cold+0x33/0xba
> ? stack_trace_save+0x4b/0x70
> ? save_trace+0x55/0x150
> mark_lock+0x1e7/0x400
> mark_usage+0x11d/0x140
> __lock_acquire+0x30d/0x930
> lock_acquire.part.0+0x9c/0x210
> ? refill_obj_stock+0x3d/0x3a0
> ? rcu_read_lock_sched_held+0x3f/0x70
> ? trace_lock_acquire+0x38/0x140
> ? lock_acquire+0x30/0x80
> ? refill_obj_stock+0x3d/0x3a0
> rt_spin_lock+0x27/0xe0
> ? refill_obj_stock+0x3d/0x3a0
> refill_obj_stock+0x3d/0x3a0
> ? inactive_task_timer+0x1ad/0x340
> kmem_cache_free+0x357/0x560
> inactive_task_timer+0x1ad/0x340
> ? switched_from_dl+0x2d0/0x2d0
> __run_hrtimer+0x8a/0x1a0
> __hrtimer_run_queues+0x91/0x130
> hrtimer_interrupt+0x10f/0x220
> __sysvec_apic_timer_interrupt+0x7b/0xd0
> sysvec_apic_timer_interrupt+0x4f/0xd0
> ? asm_sysvec_apic_timer_interrupt+0xa/0x20
> asm_sysvec_apic_timer_interrupt+0x12/0x20
> RIP: 0033:0x7fff196bf6f5
>
> To address this issue, this patch series introduces a new function
> called put_task_struct_atomic_safe(). When compiled with the
> PREEMPT_RT configuration, this function defers the call to
> __put_task_struct() to a process context.
>
> Additionally, the patch series rectifies known problematic call sites
> to ensure smooth functioning.
>
> Changelog
> =========
>
> v1:
> * Initial implementation fixing the splat.
>
> v2:
> * Isolate the logic in its own function.
> * Fix two more cases caught in review.
>
> v3:
> * Change __put_task_struct() to handle the issue internally.
>
> v4:
> * Explain why call_rcu() is safe to call from interrupt context.
>
> v5:
> * Explain why __put_task_struct() doesn't conflict with
> put_task_sruct_rcu_user.
>
> v6:
> * As per Sebastian's review, revert back the implementation of v2
> with a distinct function.
> * Add a check in put_task_struct() to warning when called from a
> non-sleepable context.
> * Address more call sites.
>
> v7:
> * Fix typos.
> * Add an explanation why the new function doesn't conflict with
> delayed_free_task().
>
> Wander Lairson Costa (3):
> sched/core: warn on call put_task_struct in invalid context
> sched/task: Add the put_task_struct_atomic_safe() function
> treewide: replace put_task_struct() with the atomic safe version
>
> include/linux/sched/task.h | 49 ++++++++++++++++++++++++++++++++++++++
> kernel/events/core.c | 6 ++---
> kernel/fork.c | 8 +++++++
> kernel/locking/rtmutex.c | 10 ++++----
> kernel/sched/core.c | 6 ++---
> kernel/sched/deadline.c | 16 ++++++-------
> kernel/sched/rt.c | 4 ++--
> 7 files changed, 78 insertions(+), 21 deletions(-)
This patch series looks good to me.
Acked-by: Waiman Long <[email protected]>
I notice that __put_task_struct() invokes quite a bit of cleanup works
from different subsystems. So it may burn quite a bit of cpu cycles to
complete. This may not be something we want in an atomic context, maybe
we should call call_rcu() irrespective of the PREEMPT_RT setting.
Anyway, this can be a follow-up patch if we want to do that.
Cheers,
Longman
On 2023-04-25 08:43:01 [-0300], Wander Lairson Costa wrote:
> Under PREEMPT_RT, spinlocks become sleepable locks. put_task_struct()
> indirectly acquires a spinlock. Therefore, it can't be called in
> atomic/interrupt context in RT kernels.
>
> To prevent such conditions, add a check for atomic/interrupt context
> before calling put_task_struct().
>
> Signed-off-by: Wander Lairson Costa <[email protected]>
> Suggested-by: Sebastian Andrzej Siewior <[email protected]>
Been only CCed here.
I asked to not special case PREEMPT_RT but doing this (clean up via RCU)
unconditionally. I don't remember that someone said "this is a bad
because $reason".
Lockdep will complain about this on !RT.
The below open codes rtlock_might_resched() with no explanation on why
it works or where it comes from.
The function is named put_task_struct_atomic_safe() yet it behaves it
differently on PREEMPT_RT otherwise it remains put_task_struct().
Not good.
> ---
> include/linux/sched/task.h | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index 357e0068497c..b597b97b1f8f 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -113,14 +113,28 @@ static inline struct task_struct *get_task_struct(struct task_struct *t)
>
> extern void __put_task_struct(struct task_struct *t);
>
> +#define PUT_TASK_RESCHED_OFFSETS \
> + (rcu_preempt_depth() << MIGHT_RESCHED_RCU_SHIFT)
> +
> +#define __put_task_might_resched() \
> + __might_resched(__FILE__, __LINE__, PUT_TASK_RESCHED_OFFSETS)
> +
> +#define put_task_might_resched() \
> + do { \
> + if (IS_ENABLED(CONFIG_PREEMPT_RT)) \
> + __put_task_might_resched(); \
> + } while (0)
> +
> static inline void put_task_struct(struct task_struct *t)
> {
> + put_task_might_resched();
> if (refcount_dec_and_test(&t->usage))
> __put_task_struct(t);
> }
>
> static inline void put_task_struct_many(struct task_struct *t, int nr)
> {
> + put_task_might_resched();
> if (refcount_sub_and_test(nr, &t->usage))
> __put_task_struct(t);
> }
> --
> 2.40.0
>
Sebastian
On Fri, Apr 28, 2023 at 06:17:58PM +0200, Sebastian Andrzej Siewior wrote:
> On 2023-04-25 08:43:01 [-0300], Wander Lairson Costa wrote:
> > Under PREEMPT_RT, spinlocks become sleepable locks. put_task_struct()
> > indirectly acquires a spinlock. Therefore, it can't be called in
> > atomic/interrupt context in RT kernels.
> >
> > To prevent such conditions, add a check for atomic/interrupt context
> > before calling put_task_struct().
> >
> > Signed-off-by: Wander Lairson Costa <[email protected]>
> > Suggested-by: Sebastian Andrzej Siewior <[email protected]>
>
> Been only CCed here.
>
I relied on the get_maintainer script to generate the recipient list for
me. I will explicity add you to the CC list next time.
> I asked to not special case PREEMPT_RT but doing this (clean up via RCU)
> unconditionally. I don't remember that someone said "this is a bad
> because $reason".
>
Yes, I can do it. Although I would prefer to do it in a follow up patch.
This way, if something goes wrong, it is easier to revert.
> Lockdep will complain about this on !RT.
>
In my tests, it didn't.
> The below open codes rtlock_might_resched() with no explanation on why
> it works or where it comes from.
>
I will add some comments on the next patch version.
> The function is named put_task_struct_atomic_safe() yet it behaves it
> differently on PREEMPT_RT otherwise it remains put_task_struct().
>
> Not good.
That's intentional. We only need to do the cleanup under RT, but for !RT
we don't need it. Anyway, in the end we will endup with an
unconditional call_rcu().
>
> > ---
> > include/linux/sched/task.h | 14 ++++++++++++++
> > 1 file changed, 14 insertions(+)
> >
> > diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> > index 357e0068497c..b597b97b1f8f 100644
> > --- a/include/linux/sched/task.h
> > +++ b/include/linux/sched/task.h
> > @@ -113,14 +113,28 @@ static inline struct task_struct *get_task_struct(struct task_struct *t)
> >
> > extern void __put_task_struct(struct task_struct *t);
> >
> > +#define PUT_TASK_RESCHED_OFFSETS \
> > + (rcu_preempt_depth() << MIGHT_RESCHED_RCU_SHIFT)
> > +
> > +#define __put_task_might_resched() \
> > + __might_resched(__FILE__, __LINE__, PUT_TASK_RESCHED_OFFSETS)
> > +
> > +#define put_task_might_resched() \
> > + do { \
> > + if (IS_ENABLED(CONFIG_PREEMPT_RT)) \
> > + __put_task_might_resched(); \
> > + } while (0)
> > +
> > static inline void put_task_struct(struct task_struct *t)
> > {
> > + put_task_might_resched();
> > if (refcount_dec_and_test(&t->usage))
> > __put_task_struct(t);
> > }
> >
> > static inline void put_task_struct_many(struct task_struct *t, int nr)
> > {
> > + put_task_might_resched();
> > if (refcount_sub_and_test(nr, &t->usage))
> > __put_task_struct(t);
> > }
> > --
> > 2.40.0
> >
>
> Sebastian
>
On Tue, Apr 25, 2023 at 08:43:02AM -0300, Wander Lairson Costa wrote:
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index b597b97b1f8f..cf774b83b2ec 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -141,6 +141,41 @@ static inline void put_task_struct_many(struct task_struct *t, int nr)
>
> void put_task_struct_rcu_user(struct task_struct *task);
>
> +extern void __delayed_put_task_struct(struct rcu_head *rhp);
> +
> +static inline void put_task_struct_atomic_safe(struct task_struct *task)
> +{
> + if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> + /*
> + * Decrement the refcount explicitly to avoid unnecessarily
> + * calling call_rcu.
> + */
> + if (refcount_dec_and_test(&task->usage))
> + /*
> + * under PREEMPT_RT, we can't call put_task_struct
> + * in atomic context because it will indirectly
> + * acquire sleeping locks.
> + * call_rcu() will schedule __delayed_put_task_struct()
> + * to be called in process context.
> + *
> + * __put_task_struct() is called when
> + * refcount_dec_and_test(&t->usage) succeeds.
> + *
> + * This means that it can't conflict with
> + * put_task_struct_rcu_user() which abuses ->rcu the same
> + * way; rcu_users has a reference so task->usage can't be
> + * zero after rcu_users 1 -> 0 transition.
> + *
> + * delayed_free_task() also uses ->rcu, but it is only called
> + * when it fails to fork a process. Therefore, there is no
> + * way it can conflict with put_task_struct().
> + */
> + call_rcu(&task->rcu, __delayed_put_task_struct);
> + } else {
> + put_task_struct(task);
> + }
> +}
Urgh.. that's plenty horrible. And I'm sure everybody plus kitchen sink
has already asked why can't we just rcu free the thing unconditionally.
Google only found me an earlier version of this same patch set, but I'm
sure we've had that discussion many times over the past several years.
The above and your follow up patch is just horrible.
It requires users to know the RT specific context and gives them no help
what so ever for !RT builds.
On 04/05/23 10:42, Peter Zijlstra wrote:
> On Tue, Apr 25, 2023 at 08:43:02AM -0300, Wander Lairson Costa wrote:
>> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
>> index b597b97b1f8f..cf774b83b2ec 100644
>> --- a/include/linux/sched/task.h
>> +++ b/include/linux/sched/task.h
>> @@ -141,6 +141,41 @@ static inline void put_task_struct_many(struct task_struct *t, int nr)
>>
>> void put_task_struct_rcu_user(struct task_struct *task);
>>
>> +extern void __delayed_put_task_struct(struct rcu_head *rhp);
>> +
>> +static inline void put_task_struct_atomic_safe(struct task_struct *task)
>> +{
>> + if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
>> + /*
>> + * Decrement the refcount explicitly to avoid unnecessarily
>> + * calling call_rcu.
>> + */
>> + if (refcount_dec_and_test(&task->usage))
>> + /*
>> + * under PREEMPT_RT, we can't call put_task_struct
>> + * in atomic context because it will indirectly
>> + * acquire sleeping locks.
>> + * call_rcu() will schedule __delayed_put_task_struct()
>> + * to be called in process context.
>> + *
>> + * __put_task_struct() is called when
>> + * refcount_dec_and_test(&t->usage) succeeds.
>> + *
>> + * This means that it can't conflict with
>> + * put_task_struct_rcu_user() which abuses ->rcu the same
>> + * way; rcu_users has a reference so task->usage can't be
>> + * zero after rcu_users 1 -> 0 transition.
>> + *
>> + * delayed_free_task() also uses ->rcu, but it is only called
>> + * when it fails to fork a process. Therefore, there is no
>> + * way it can conflict with put_task_struct().
>> + */
>> + call_rcu(&task->rcu, __delayed_put_task_struct);
>> + } else {
>> + put_task_struct(task);
>> + }
>> +}
>
> Urgh.. that's plenty horrible. And I'm sure everybody plus kitchen sink
> has already asked why can't we just rcu free the thing unconditionally.
>
> Google only found me an earlier version of this same patch set, but I'm
> sure we've had that discussion many times over the past several years.
> The above and your follow up patch is just horrible.
>
So on v3/v4 we got to doing that unconditionally for PREEMPT_RT, but per
[1] Wander went back to hand-fixing the problematic callsites.
Now that I'm looking at it again, I couldn't find a concrete argument from
Oleg against doing this unconditionally - as Wander is pointing out in the
changelog and comments, reusing task_struct.rcu for that purpose is safe
(although not necessarily obviously so).
Is this just miscommunication, or is there a genuine issue with doing this
unconditionally? As argued before, I'd also much rather have this be an
unconditional call_rcu() (regardless of context or PREEMPT_RT).
On Thu, May 04, 2023 at 10:42:29AM +0200, Peter Zijlstra wrote:
> On Tue, Apr 25, 2023 at 08:43:02AM -0300, Wander Lairson Costa wrote:
> > diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> > index b597b97b1f8f..cf774b83b2ec 100644
> > --- a/include/linux/sched/task.h
> > +++ b/include/linux/sched/task.h
> > @@ -141,6 +141,41 @@ static inline void put_task_struct_many(struct task_struct *t, int nr)
> >
> > void put_task_struct_rcu_user(struct task_struct *task);
> >
> > +extern void __delayed_put_task_struct(struct rcu_head *rhp);
> > +
> > +static inline void put_task_struct_atomic_safe(struct task_struct *task)
> > +{
> > + if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> > + /*
> > + * Decrement the refcount explicitly to avoid unnecessarily
> > + * calling call_rcu.
> > + */
> > + if (refcount_dec_and_test(&task->usage))
> > + /*
> > + * under PREEMPT_RT, we can't call put_task_struct
> > + * in atomic context because it will indirectly
> > + * acquire sleeping locks.
> > + * call_rcu() will schedule __delayed_put_task_struct()
> > + * to be called in process context.
> > + *
> > + * __put_task_struct() is called when
> > + * refcount_dec_and_test(&t->usage) succeeds.
> > + *
> > + * This means that it can't conflict with
> > + * put_task_struct_rcu_user() which abuses ->rcu the same
> > + * way; rcu_users has a reference so task->usage can't be
> > + * zero after rcu_users 1 -> 0 transition.
> > + *
> > + * delayed_free_task() also uses ->rcu, but it is only called
> > + * when it fails to fork a process. Therefore, there is no
> > + * way it can conflict with put_task_struct().
> > + */
> > + call_rcu(&task->rcu, __delayed_put_task_struct);
> > + } else {
> > + put_task_struct(task);
> > + }
> > +}
>
> Urgh.. that's plenty horrible. And I'm sure everybody plus kitchen sink
> has already asked why can't we just rcu free the thing unconditionally.
>
> Google only found me an earlier version of this same patch set, but I'm
> sure we've had that discussion many times over the past several years.
> The above and your follow up patch is just horrible.
>
> It requires users to know the RT specific context and gives them no help
> what so ever for !RT builds.
>
>
No problem, I will send a new version doing it unconditionally.
On 05/04, Peter Zijlstra wrote:
>
> Urgh.. that's plenty horrible. And I'm sure everybody plus kitchen sink
> has already asked why can't we just rcu free the thing unconditionally.
>
> Google only found me an earlier version of this same patch set, but I'm
> sure we've had that discussion many times over the past several years.
Yes... see for example
https://lore.kernel.org/lkml/CAHk-=whtj+aSYftniMRG4xvFE8dmmYyrqcJyPmzStsfj5w9r=w@mail.gmail.com/
We already have an rcu pass before put_task_struct(zombie), see
put_task_struct_rcu_user(), another one look unfortunate.
Oleg.
On Thu, May 04, 2023 at 10:32:31AM +0100, Valentin Schneider wrote:
> On 04/05/23 10:42, Peter Zijlstra wrote:
> > On Tue, Apr 25, 2023 at 08:43:02AM -0300, Wander Lairson Costa wrote:
> >> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> >> index b597b97b1f8f..cf774b83b2ec 100644
> >> --- a/include/linux/sched/task.h
> >> +++ b/include/linux/sched/task.h
> >> @@ -141,6 +141,41 @@ static inline void put_task_struct_many(struct task_struct *t, int nr)
> >>
> >> void put_task_struct_rcu_user(struct task_struct *task);
> >>
> >> +extern void __delayed_put_task_struct(struct rcu_head *rhp);
> >> +
> >> +static inline void put_task_struct_atomic_safe(struct task_struct *task)
> >> +{
> >> + if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> >> + /*
> >> + * Decrement the refcount explicitly to avoid unnecessarily
> >> + * calling call_rcu.
> >> + */
> >> + if (refcount_dec_and_test(&task->usage))
> >> + /*
> >> + * under PREEMPT_RT, we can't call put_task_struct
> >> + * in atomic context because it will indirectly
> >> + * acquire sleeping locks.
> >> + * call_rcu() will schedule __delayed_put_task_struct()
> >> + * to be called in process context.
> >> + *
> >> + * __put_task_struct() is called when
> >> + * refcount_dec_and_test(&t->usage) succeeds.
> >> + *
> >> + * This means that it can't conflict with
> >> + * put_task_struct_rcu_user() which abuses ->rcu the same
> >> + * way; rcu_users has a reference so task->usage can't be
> >> + * zero after rcu_users 1 -> 0 transition.
> >> + *
> >> + * delayed_free_task() also uses ->rcu, but it is only called
> >> + * when it fails to fork a process. Therefore, there is no
> >> + * way it can conflict with put_task_struct().
> >> + */
> >> + call_rcu(&task->rcu, __delayed_put_task_struct);
> >> + } else {
> >> + put_task_struct(task);
> >> + }
> >> +}
> >
> > Urgh.. that's plenty horrible. And I'm sure everybody plus kitchen sink
> > has already asked why can't we just rcu free the thing unconditionally.
> >
> > Google only found me an earlier version of this same patch set, but I'm
> > sure we've had that discussion many times over the past several years.
> > The above and your follow up patch is just horrible.
> >
>
> So on v3/v4 we got to doing that unconditionally for PREEMPT_RT, but per
> [1] Wander went back to hand-fixing the problematic callsites.
>
> Now that I'm looking at it again, I couldn't find a concrete argument from
> Oleg against doing this unconditionally - as Wander is pointing out in the
> changelog and comments, reusing task_struct.rcu for that purpose is safe
> (although not necessarily obviously so).
>
> Is this just miscommunication, or is there a genuine issue with doing this
> unconditionally? As argued before, I'd also much rather have this be an
> unconditional call_rcu() (regardless of context or PREEMPT_RT).
>
Yeah, I think it was a misunderstanding of mine.
On Thu, May 04, 2023 at 02:29:45PM +0200, Oleg Nesterov wrote:
> On 05/04, Peter Zijlstra wrote:
> >
> > Urgh.. that's plenty horrible. And I'm sure everybody plus kitchen sink
> > has already asked why can't we just rcu free the thing unconditionally.
> >
> > Google only found me an earlier version of this same patch set, but I'm
> > sure we've had that discussion many times over the past several years.
>
> Yes... see for example
>
> https://lore.kernel.org/lkml/CAHk-=whtj+aSYftniMRG4xvFE8dmmYyrqcJyPmzStsfj5w9r=w@mail.gmail.com/
>
> We already have an rcu pass before put_task_struct(zombie), see
> put_task_struct_rcu_user(), another one look unfortunate.
Ah indeed, it got mentioned there as well. And Linus seems to be arguing
against doing an rcu free there. So humm..
Then I'm thinking something trivial like so:
static inline void put_task_struct(struct task_struct *t)
{
if (!refcount_dec_and_test(&t->usage))
return;
if (IS_ENABLED(CONFIG_PREEMPT_RT) && !preemptible())
call_rcu(&t->rcu, __put_task_struct_rcu);
__put_task_struct(t);
}
should do, or alternatively use irq_work, which has a much lower
latency, but meh..
On Thu, May 4, 2023 at 11:34 AM Peter Zijlstra <[email protected]> wrote:
>
> On Thu, May 04, 2023 at 02:29:45PM +0200, Oleg Nesterov wrote:
> > On 05/04, Peter Zijlstra wrote:
> > >
> > > Urgh.. that's plenty horrible. And I'm sure everybody plus kitchen sink
> > > has already asked why can't we just rcu free the thing unconditionally.
> > >
> > > Google only found me an earlier version of this same patch set, but I'm
> > > sure we've had that discussion many times over the past several years.
> >
> > Yes... see for example
> >
> > https://lore.kernel.org/lkml/CAHk-=whtj+aSYftniMRG4xvFE8dmmYyrqcJyPmzStsfj5w9r=w@mail.gmail.com/
> >
> > We already have an rcu pass before put_task_struct(zombie), see
> > put_task_struct_rcu_user(), another one look unfortunate.
>
> Ah indeed, it got mentioned there as well. And Linus seems to be arguing
> against doing an rcu free there. So humm..
>
> Then I'm thinking something trivial like so:
>
> static inline void put_task_struct(struct task_struct *t)
> {
> if (!refcount_dec_and_test(&t->usage))
> return;
>
> if (IS_ENABLED(CONFIG_PREEMPT_RT) && !preemptible())
> call_rcu(&t->rcu, __put_task_struct_rcu);
>
> __put_task_struct(t);
> }
>
That's what v5 [1] does. What would be the path in this case? Should I
resend it as v8?
> should do, or alternatively use irq_work, which has a much lower
> latency, but meh..
>
Initially, I did that. I switched to call_rcu because the code got much simpler.
[1] https://lore.kernel.org/all/[email protected]/
On 05/04, Wander Lairson Costa wrote:
>
> On Thu, May 4, 2023 at 11:34 AM Peter Zijlstra <[email protected]> wrote:
> >
> > static inline void put_task_struct(struct task_struct *t)
> > {
> > if (!refcount_dec_and_test(&t->usage))
> > return;
> >
> > if (IS_ENABLED(CONFIG_PREEMPT_RT) && !preemptible())
> > call_rcu(&t->rcu, __put_task_struct_rcu);
> >
> > __put_task_struct(t);
> > }
> >
>
> That's what v5 [1] does.
Yes, but as Sebastian explained CONFIG_PROVE_RAW_LOCK_NESTING won't like it.
https://lore.kernel.org/all/Y+zFNrCjBn53%[email protected]/
Oleg.
On Thu, May 04, 2023 at 11:55:15AM -0300, Wander Lairson Costa wrote:
> > Then I'm thinking something trivial like so:
> >
> > static inline void put_task_struct(struct task_struct *t)
> > {
> > if (!refcount_dec_and_test(&t->usage))
> > return;
> >
> > if (IS_ENABLED(CONFIG_PREEMPT_RT) && !preemptible())
> > call_rcu(&t->rcu, __put_task_struct_rcu);
> >
> > __put_task_struct(t);
> > }
> >
>
> That's what v5 [1] does. What would be the path in this case? Should I
> resend it as v8?
It's almost what v5 does. v5 also has a !in_task() thing. v5 also
violates codingstyle :-)
But yeah.. something like that.
On Thu, May 04, 2023 at 05:23:07PM +0200, Oleg Nesterov wrote:
> On 05/04, Wander Lairson Costa wrote:
> >
> > On Thu, May 4, 2023 at 11:34 AM Peter Zijlstra <[email protected]> wrote:
> > >
> > > static inline void put_task_struct(struct task_struct *t)
> > > {
> > > if (!refcount_dec_and_test(&t->usage))
> > > return;
> > >
> > > if (IS_ENABLED(CONFIG_PREEMPT_RT) && !preemptible())
> > > call_rcu(&t->rcu, __put_task_struct_rcu);
> > >
> > > __put_task_struct(t);
> > > }
> > >
> >
> > That's what v5 [1] does.
>
> Yes, but as Sebastian explained CONFIG_PROVE_RAW_LOCK_NESTING won't like it.
This can help:
https://lkml.kernel.org/r/168303194177.404.8610123576035502891.tip-bot2@tip-bot2
On Thu, May 04, 2023 at 05:24:24PM +0200, Peter Zijlstra wrote:
> On Thu, May 04, 2023 at 11:55:15AM -0300, Wander Lairson Costa wrote:
>
> > > Then I'm thinking something trivial like so:
> > >
> > > static inline void put_task_struct(struct task_struct *t)
> > > {
> > > if (!refcount_dec_and_test(&t->usage))
> > > return;
> > >
> > > if (IS_ENABLED(CONFIG_PREEMPT_RT) && !preemptible())
> > > call_rcu(&t->rcu, __put_task_struct_rcu);
> > >
> > > __put_task_struct(t);
> > > }
> > >
> >
> > That's what v5 [1] does. What would be the path in this case? Should I
> > resend it as v8?
>
> It's almost what v5 does. v5 also has a !in_task() thing. v5 also
> violates codingstyle :-)
IIRC, the in_task() is there because preemptible() doesn't check if it
is running in interrupt context.
>
> But yeah.. something like that.
>
On Thu, May 4, 2023 at 12:23 PM Oleg Nesterov <[email protected]> wrote:
>
> On 05/04, Wander Lairson Costa wrote:
> >
> > On Thu, May 4, 2023 at 11:34 AM Peter Zijlstra <[email protected]> wrote:
> > >
> > > static inline void put_task_struct(struct task_struct *t)
> > > {
> > > if (!refcount_dec_and_test(&t->usage))
> > > return;
> > >
> > > if (IS_ENABLED(CONFIG_PREEMPT_RT) && !preemptible())
> > > call_rcu(&t->rcu, __put_task_struct_rcu);
> > >
> > > __put_task_struct(t);
> > > }
> > >
> >
> > That's what v5 [1] does.
>
> Yes, but as Sebastian explained CONFIG_PROVE_RAW_LOCK_NESTING won't like it.
>
> https://lore.kernel.org/all/Y+zFNrCjBn53%[email protected]/
>
I think that was my confusion in that thread. My understanding is that
CONFIG_PROVE_RAW_LOCK_NESTING will check lock ordering but not
context. I feel my ignorance is blocking me from seeing something that
is obvious to everyone else :/
> Oleg.
>
On Thu, May 4, 2023 at 4:23 PM Oleg Nesterov <[email protected]> wrote:
>
> On 05/04, Wander Lairson Costa wrote:
> >
> > On Thu, May 4, 2023 at 12:23 PM Oleg Nesterov <[email protected]> wrote:
> > >
> > > Yes, but as Sebastian explained CONFIG_PROVE_RAW_LOCK_NESTING won't like it.
> > >
> > > https://lore.kernel.org/all/Y+zFNrCjBn53%[email protected]/
> > >
> >
> > I think that was my confusion in that thread. My understanding is that
> > CONFIG_PROVE_RAW_LOCK_NESTING will check lock ordering but not
> > context.
>
> Sorry, I don't understand... perhaps I missed something. But iiuc
> the problem is simple.
>
> So, this code
>
> raw_spin_lock(one);
> spin_lock(two);
>
> is obviously wrong if CONFIG_PREEMPT_RT.
>
> Without PREEMPT_RT this code is fine because raw_spinlock_t and spinlock_t
> are the same thing. Except they have different lockdep annotations if
> CONFIG_PROVE_RAW_LOCK_NESTING is true, LD_WAIT_SPIN and LD_WAIT_CONFIG.
>
> So if CONFIG_PROVE_RAW_LOCK_NESTING is set, lockdep will complain even
> on the !PREEMPT_RT kernel, iow it checks the nesting as if the code runs
> on with PREEMPT_RT.
>
> Cough... not sure my explanation can help ;) It looks very confusing when
> I read it.
>
Thanks for the explanation. That's my understanding too. The part I
don't get is why this would fail with a call_rcu() inside
put_task_struct().
> Oleg.
>
On 05/04, Wander Lairson Costa wrote:
>
> On Thu, May 4, 2023 at 12:23 PM Oleg Nesterov <[email protected]> wrote:
> >
> > Yes, but as Sebastian explained CONFIG_PROVE_RAW_LOCK_NESTING won't like it.
> >
> > https://lore.kernel.org/all/Y+zFNrCjBn53%[email protected]/
> >
>
> I think that was my confusion in that thread. My understanding is that
> CONFIG_PROVE_RAW_LOCK_NESTING will check lock ordering but not
> context.
Sorry, I don't understand... perhaps I missed something. But iiuc
the problem is simple.
So, this code
raw_spin_lock(one);
spin_lock(two);
is obviously wrong if CONFIG_PREEMPT_RT.
Without PREEMPT_RT this code is fine because raw_spinlock_t and spinlock_t
are the same thing. Except they have different lockdep annotations if
CONFIG_PROVE_RAW_LOCK_NESTING is true, LD_WAIT_SPIN and LD_WAIT_CONFIG.
So if CONFIG_PROVE_RAW_LOCK_NESTING is set, lockdep will complain even
on the !PREEMPT_RT kernel, iow it checks the nesting as if the code runs
on with PREEMPT_RT.
Cough... not sure my explanation can help ;) It looks very confusing when
I read it.
Oleg.
Hi Wander,
I certainly missed something ;) plus I am already sleeping. but let me try to
reply anyway.
On 05/04, Wander Lairson Costa wrote:
> On Thu, May 4, 2023 at 4:23 PM Oleg Nesterov <[email protected]> wrote:
> >
> > On 05/04, Wander Lairson Costa wrote:
> > >
> > > On Thu, May 4, 2023 at 12:23 PM Oleg Nesterov <[email protected]> wrote:
> > > >
> > > > Yes, but as Sebastian explained CONFIG_PROVE_RAW_LOCK_NESTING won't like it.
> > > >
> > > > https://lore.kernel.org/all/Y+zFNrCjBn53%[email protected]/
> > > >
> > >
> > > I think that was my confusion in that thread. My understanding is that
> > > CONFIG_PROVE_RAW_LOCK_NESTING will check lock ordering but not
> > > context.
> >
> > Sorry, I don't understand... perhaps I missed something. But iiuc
> > the problem is simple.
> >
> > So, this code
> >
> > raw_spin_lock(one);
> > spin_lock(two);
> >
> > is obviously wrong if CONFIG_PREEMPT_RT.
> >
> > Without PREEMPT_RT this code is fine because raw_spinlock_t and spinlock_t
> > are the same thing. Except they have different lockdep annotations if
> > CONFIG_PROVE_RAW_LOCK_NESTING is true, LD_WAIT_SPIN and LD_WAIT_CONFIG.
> >
> > So if CONFIG_PROVE_RAW_LOCK_NESTING is set, lockdep will complain even
> > on the !PREEMPT_RT kernel, iow it checks the nesting as if the code runs
> > on with PREEMPT_RT.
> >
> > Cough... not sure my explanation can help ;) It looks very confusing when
> > I read it.
> >
>
> Thanks for the explanation. That's my understanding too. The part I
> don't get is why this would fail with a call_rcu() inside
> put_task_struct().
the problem is that call_rcu() won't be called if !IS_ENABLED(PREEMPT_RT),
___put_task_struct() will be called.
CONFIG_PROVE_RAW_LOCK_NESTING can't know this can't happen if PREEMPT_RT
is set.
IOW. To simplify, suppose we have
// can be called in atomic context, e.g. under
// raw_spin_lock() so it is wrong with PREEMPT_RT
void __put_task_struct(struct task_struct *tsk)
{
spin_lock(some_lock);
}
lets "fix" the code above, lets change __put_task_struct,
void __put_task_struct(struct task_struct *tsk)
{
if (!IS_ENABLED(CONFIG_PREEMPT_RT))
return;
spin_lock(some_lock);
}
Now, if CONFIG_PREEMPT_RT is true then __put_task_struct() is fine
wrt lock nesting.
But, if CONFIG_PREEMPT_RT is not set, then __put_task_struct() still
does the same:
void __put_task_struct(struct task_struct *tsk)
{
spin_lock(some_lock);
}
and CONFIG_PROVE_RAW_LOCK_NESTING will complain. Because, once again,
it checks the nesting as if CONFIG_PREEMPT_RT is true, and in this case
__put_task_struct() if it is called under raw_spin_lock().
Oleg.
On Thu, May 04, 2023 at 05:30:57PM +0200, Peter Zijlstra wrote:
> On Thu, May 04, 2023 at 05:23:07PM +0200, Oleg Nesterov wrote:
> > Yes, but as Sebastian explained CONFIG_PROVE_RAW_LOCK_NESTING won't like it.
>
> This can help:
>
> https://lkml.kernel.org/r/168303194177.404.8610123576035502891.tip-bot2@tip-bot2
Explicitly:
static inline void put_task_struct(struct task_struct *t)
{
if (!refcount_dec_and_test(&t->usage))
return;
if (!IS_ENABLED(CONFIG_PREEMPT_RT) || premptible()) {
/*
* ... same comment as the other patch ...
*/
static DEFINE_WAIT_OVERRIDE_MAP(put_task_map, LD_WAIT_SLEEP);
lock_map_acquire_try(&put_task_map);
__put_task_struct(t);
lock_map_release(&put_task_map);
return;
}
call_rcu(&t->rcu, __put_task_struct_rcu);
}
Should not complain since we tell it to STFU :-)
On Thu, May 04, 2023 at 03:21:11PM -0300, Wander Lairson Costa wrote:
> On Thu, May 04, 2023 at 05:24:24PM +0200, Peter Zijlstra wrote:
> > On Thu, May 04, 2023 at 11:55:15AM -0300, Wander Lairson Costa wrote:
> >
> > > > Then I'm thinking something trivial like so:
> > > >
> > > > static inline void put_task_struct(struct task_struct *t)
> > > > {
> > > > if (!refcount_dec_and_test(&t->usage))
> > > > return;
> > > >
> > > > if (IS_ENABLED(CONFIG_PREEMPT_RT) && !preemptible())
> > > > call_rcu(&t->rcu, __put_task_struct_rcu);
> > > >
> > > > __put_task_struct(t);
> > > > }
> > > >
> > >
> > > That's what v5 [1] does. What would be the path in this case? Should I
> > > resend it as v8?
> >
> > It's almost what v5 does. v5 also has a !in_task() thing. v5 also
> > violates codingstyle :-)
>
> IIRC, the in_task() is there because preemptible() doesn't check if it
> is running in interrupt context.
#define preemptible() (preempt_count() == 0 && !irqs_disabled())
When in interrupt context preempt_count() will have a non-zero value in
HARDIRQ_MASK and IRQs must be disabled, so preemptible() evaluates to
(false && false), last time I checked that ends up being false.
On Fri, 5 May 2023 15:32:35 +0200
Peter Zijlstra <[email protected]> wrote:
> On Thu, May 04, 2023 at 03:21:11PM -0300, Wander Lairson Costa wrote:
> > On Thu, May 04, 2023 at 05:24:24PM +0200, Peter Zijlstra wrote:
> > > On Thu, May 04, 2023 at 11:55:15AM -0300, Wander Lairson Costa wrote:
> > >
> > > > > Then I'm thinking something trivial like so:
> > > > >
> > > > > static inline void put_task_struct(struct task_struct *t)
> > > > > {
> > > > > if (!refcount_dec_and_test(&t->usage))
> > > > > return;
> > > > >
> > > > > if (IS_ENABLED(CONFIG_PREEMPT_RT) && !preemptible())
> > > > > call_rcu(&t->rcu, __put_task_struct_rcu);
> > > > >
> > > > > __put_task_struct(t);
> > > > > }
> > > > >
> > > >
> > > > That's what v5 [1] does. What would be the path in this case? Should I
> > > > resend it as v8?
> > >
> > > It's almost what v5 does. v5 also has a !in_task() thing. v5 also
> > > violates codingstyle :-)
> >
> > IIRC, the in_task() is there because preemptible() doesn't check if it
> > is running in interrupt context.
>
> #define preemptible() (preempt_count() == 0 && !irqs_disabled())
>
> When in interrupt context preempt_count() will have a non-zero value in
> HARDIRQ_MASK and IRQs must be disabled, so preemptible() evaluates to
> (false && false), last time I checked that ends up being false.
Interesting, I can't find v5 anywhere in my mail folders (but I have
v4 and v6!). Anyway, from just the context of this email, and seeing
IS_ENABLED(CONFIG_PREEMPT_RT), I'm guessing that in_task() returns false if
it's running in a interrupt thread, where preemtible() does not.
-- Steve
On Fri, 5 May 2023 10:26:02 -0400
Steven Rostedt <[email protected]> wrote:
> > > IIRC, the in_task() is there because preemptible() doesn't check if it
> > > is running in interrupt context.
> >
> > #define preemptible() (preempt_count() == 0 && !irqs_disabled())
> >
> > When in interrupt context preempt_count() will have a non-zero value in
> > HARDIRQ_MASK and IRQs must be disabled, so preemptible() evaluates to
> > (false && false), last time I checked that ends up being false.
>
> Interesting, I can't find v5 anywhere in my mail folders (but I have
> v4 and v6!). Anyway, from just the context of this email, and seeing
> IS_ENABLED(CONFIG_PREEMPT_RT), I'm guessing that in_task() returns false if
> it's running in a interrupt thread, where preemtible() does not.
But then I question, does it matter if it is running in an interrupt thread
or not for put_task_struct()?
-- Steve
On Thu, May 4, 2023 at 5:16 PM Oleg Nesterov <[email protected]> wrote:
>
> Hi Wander,
>
> I certainly missed something ;) plus I am already sleeping. but let me try to
> reply anyway.
>
> On 05/04, Wander Lairson Costa wrote:
> > On Thu, May 4, 2023 at 4:23 PM Oleg Nesterov <[email protected]> wrote:
> > >
> > > On 05/04, Wander Lairson Costa wrote:
> > > >
> > > > On Thu, May 4, 2023 at 12:23 PM Oleg Nesterov <[email protected]> wrote:
> > > > >
> > > > > Yes, but as Sebastian explained CONFIG_PROVE_RAW_LOCK_NESTING won't like it.
> > > > >
> > > > > https://lore.kernel.org/all/Y+zFNrCjBn53%[email protected]/
> > > > >
> > > >
> > > > I think that was my confusion in that thread. My understanding is that
> > > > CONFIG_PROVE_RAW_LOCK_NESTING will check lock ordering but not
> > > > context.
> > >
> > > Sorry, I don't understand... perhaps I missed something. But iiuc
> > > the problem is simple.
> > >
> > > So, this code
> > >
> > > raw_spin_lock(one);
> > > spin_lock(two);
> > >
> > > is obviously wrong if CONFIG_PREEMPT_RT.
> > >
> > > Without PREEMPT_RT this code is fine because raw_spinlock_t and spinlock_t
> > > are the same thing. Except they have different lockdep annotations if
> > > CONFIG_PROVE_RAW_LOCK_NESTING is true, LD_WAIT_SPIN and LD_WAIT_CONFIG.
> > >
> > > So if CONFIG_PROVE_RAW_LOCK_NESTING is set, lockdep will complain even
> > > on the !PREEMPT_RT kernel, iow it checks the nesting as if the code runs
> > > on with PREEMPT_RT.
> > >
> > > Cough... not sure my explanation can help ;) It looks very confusing when
> > > I read it.
> > >
> >
> > Thanks for the explanation. That's my understanding too. The part I
> > don't get is why this would fail with a call_rcu() inside
> > put_task_struct().
>
> the problem is that call_rcu() won't be called if !IS_ENABLED(PREEMPT_RT),
> ___put_task_struct() will be called.
>
> CONFIG_PROVE_RAW_LOCK_NESTING can't know this can't happen if PREEMPT_RT
> is set.
>
> IOW. To simplify, suppose we have
>
> // can be called in atomic context, e.g. under
> // raw_spin_lock() so it is wrong with PREEMPT_RT
> void __put_task_struct(struct task_struct *tsk)
> {
> spin_lock(some_lock);
> }
>
> lets "fix" the code above, lets change __put_task_struct,
>
> void __put_task_struct(struct task_struct *tsk)
> {
> if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> return;
>
> spin_lock(some_lock);
> }
>
> Now, if CONFIG_PREEMPT_RT is true then __put_task_struct() is fine
> wrt lock nesting.
>
> But, if CONFIG_PREEMPT_RT is not set, then __put_task_struct() still
> does the same:
>
> void __put_task_struct(struct task_struct *tsk)
> {
> spin_lock(some_lock);
> }
>
> and CONFIG_PROVE_RAW_LOCK_NESTING will complain. Because, once again,
> it checks the nesting as if CONFIG_PREEMPT_RT is true, and in this case
> __put_task_struct() if it is called under raw_spin_lock().
>
IIUC, this is a problem with the current implementation, isn't it?
Because the holder of raw_spin_lock is the put_task_struct() caller.
> Oleg.
>
On Fri, May 5, 2023 at 10:33 AM Peter Zijlstra <[email protected]> wrote:
>
> On Thu, May 04, 2023 at 03:21:11PM -0300, Wander Lairson Costa wrote:
> > On Thu, May 04, 2023 at 05:24:24PM +0200, Peter Zijlstra wrote:
> > > On Thu, May 04, 2023 at 11:55:15AM -0300, Wander Lairson Costa wrote:
> > >
> > > > > Then I'm thinking something trivial like so:
> > > > >
> > > > > static inline void put_task_struct(struct task_struct *t)
> > > > > {
> > > > > if (!refcount_dec_and_test(&t->usage))
> > > > > return;
> > > > >
> > > > > if (IS_ENABLED(CONFIG_PREEMPT_RT) && !preemptible())
> > > > > call_rcu(&t->rcu, __put_task_struct_rcu);
> > > > >
> > > > > __put_task_struct(t);
> > > > > }
> > > > >
> > > >
> > > > That's what v5 [1] does. What would be the path in this case? Should I
> > > > resend it as v8?
> > >
> > > It's almost what v5 does. v5 also has a !in_task() thing. v5 also
> > > violates codingstyle :-)
> >
> > IIRC, the in_task() is there because preemptible() doesn't check if it
> > is running in interrupt context.
>
> #define preemptible() (preempt_count() == 0 && !irqs_disabled())
>
> When in interrupt context preempt_count() will have a non-zero value in
> HARDIRQ_MASK and IRQs must be disabled, so preemptible() evaluates to
> (false && false), last time I checked that ends up being false.
>
When I first tested, I started with in_irq(), then I saw some warnings
and added preemptible().
Never tested with preemptible() alone. Thanks, I will change that and test it.