Subject: [PATCH v2 0/4] locking/rtmutex: Avoid overwriting pi_blocked_on while invoking blk_flush_plug().

Hi,

Crystal Wood reported that task_struct::pi_blocked_on can be overwritten
by mistake that is:
rt_mutex_slowlock()
- task_blocks_on_rt_mutex()
- current->pi_blocked_on = waiter;
- rt_mutex_slowlock_block()
- schedule()
- sched_submit_work()
- blk_flush_plug()
- *any* RT sleeping lock used by the plug
- rtlock_slowlock_locked()
- task_blocks_on_rt_mutex()
- current->pi_blocked_on = waiter; <-- XXX

The requirement is
- I/O queued
- lock contention on a sleeping lock (a mutex_t)
- lock contention while flushing queued I/O (in blk_flush_plug(), a
spin_lock_t on PREEMPT_RT).

Later in review it was pointed out by tglx that any function within
sched_submit_work() is affected so it is not limited to
blk_flush_plug().

This series addresses the problem by
- export sched_submit_work()
- invoke sched_submit_work() if it is clear that the slow path is
needed.
- invoke schedule_rtmutex() while blocking on lock which contains only
the schedule loop (without sched_submit_work().

Original report by Crystal
https://lore.kernel.org/all/[email protected]

v1: https://lore.kernel.org/all/[email protected]

v1…v2:
- Avoid invoking blk_flush_plug() with DEBUG-enabled
- Fix also the ww-mutex implementation based on RT-mutex.
- Export sched_submit_work() and do the whole block before blocking
not just blk_flush_plug().

Sebastian



Subject: [PATCH v2 1/4] sched/core: Provide sched_rtmutex() and expose sched work helpers

From: Thomas Gleixner <[email protected]>

schedule() invokes sched_submit_work() before scheduling and
sched_update_worker() afterwards to ensure that queued block requests are
flushed and the (IO)worker machineries can instantiate new workers if
required. This avoids deadlocks and starvation.

With rt_mutexes this can lead to subtle problem:

When rtmutex blocks current::pi_blocked_on points to the rtmutex it
blocks on. When one of the functions in sched_submit/resume_work()
contends on a rtmutex based lock then that would corrupt
current::pi_blocked_on.

Make it possible to let rtmutex issue the calls outside of the slowpath,
i.e. when it is guaranteed that current::pi_blocked_on is NULL, by:

- Exposing sched_submit_work() and moving the task_running() condition
into schedule()

- Renamimg sched_update_worker() to sched_resume_work() and exposing it
too.

- Providing sched_rtmutex() which just does the inner loop of scheduling
until need_resched() is not longer set. Split out the loop so this does
not create yet another copy.

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
include/linux/sched.h | 5 +++++
kernel/sched/core.c | 40 ++++++++++++++++++++++------------------
2 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 675298d6eb362..ff1ce66d8b6e3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -304,6 +304,11 @@ extern long schedule_timeout_idle(long timeout);
asmlinkage void schedule(void);
extern void schedule_preempt_disabled(void);
asmlinkage void preempt_schedule_irq(void);
+
+extern void sched_submit_work(void);
+extern void sched_resume_work(void);
+extern void schedule_rtmutex(void);
+
#ifdef CONFIG_PREEMPT_RT
extern void schedule_rtlock(void);
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c415418b0b847..7c5cfae086c78 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6690,14 +6690,11 @@ void __noreturn do_task_dead(void)
cpu_relax();
}

-static inline void sched_submit_work(struct task_struct *tsk)
+void sched_submit_work(void)
{
- unsigned int task_flags;
+ struct task_struct *tsk = current;
+ unsigned int task_flags = tsk->flags;

- if (task_is_running(tsk))
- return;
-
- task_flags = tsk->flags;
/*
* If a worker goes to sleep, notify and ask workqueue whether it
* wants to wake up a task to maintain concurrency.
@@ -6723,8 +6720,10 @@ static inline void sched_submit_work(struct task_struct *tsk)
blk_flush_plug(tsk->plug, true);
}

-static void sched_update_worker(struct task_struct *tsk)
+void sched_resume_work(void)
{
+ struct task_struct *tsk = current;
+
if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
if (tsk->flags & PF_WQ_WORKER)
wq_worker_running(tsk);
@@ -6733,20 +6732,29 @@ static void sched_update_worker(struct task_struct *tsk)
}
}

-asmlinkage __visible void __sched schedule(void)
+static void schedule_loop(unsigned int sched_mode)
{
- struct task_struct *tsk = current;
-
- sched_submit_work(tsk);
do {
preempt_disable();
- __schedule(SM_NONE);
+ __schedule(sched_mode);
sched_preempt_enable_no_resched();
} while (need_resched());
- sched_update_worker(tsk);
+}
+
+asmlinkage __visible void __sched schedule(void)
+{
+ if (!task_is_running(current))
+ sched_submit_work();
+ schedule_loop(SM_NONE);
+ sched_resume_work();
}
EXPORT_SYMBOL(schedule);

+void schedule_rtmutex(void)
+{
+ schedule_loop(SM_NONE);
+}
+
/*
* synchronize_rcu_tasks() makes sure that no task is stuck in preempted
* state (have scheduled out non-voluntarily) by making sure that all
@@ -6806,11 +6814,7 @@ void __sched schedule_preempt_disabled(void)
#ifdef CONFIG_PREEMPT_RT
void __sched notrace schedule_rtlock(void)
{
- do {
- preempt_disable();
- __schedule(SM_RTLOCK_WAIT);
- sched_preempt_enable_no_resched();
- } while (need_resched());
+ schedule_loop(SM_RTLOCK_WAIT);
}
NOKPROBE_SYMBOL(schedule_rtlock);
#endif
--
2.40.1

Subject: [PATCH v2 2/4] locking/rtmutex: Submit/resume work explicitly before/after blocking

schedule() invokes sched_submit_work() before scheduling and
sched_resume_work() afterwards to ensure that queued block requests are
flushed and the (IO)worker machineries can instantiate new workers if
required. This avoids deadlocks and starvation.

With rt_mutexes this can lead to a subtle problem:

When rtmutex blocks current::pi_blocked_on points to the rtmutex it
blocks on. When one of the functions in sched_submit/resume_work() contends
on a rtmutex based lock then that would corrupt current::pi_blocked_on.

Let rtmutex and the RT lock variants which are based on it invoke
sched_submit/resume_work() explicitly before and after the slowpath so
it's guaranteed that current::pi_blocked_on cannot be corrupted by blocking
on two locks.

This does not apply to the PREEMPT_RT variants of spinlock_t and rwlock_t
as their scheduling slowpath is separate and cannot invoke the work related
functions due to potential deadlocks anyway.

[ tglx: Make it explicit and symmetric. Massage changelog ]

Fixes: e17ba59b7e8e1 ("locking/rtmutex: Guard regular sleeping locks specific functions")
Reported-by: Crystal Wood <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/[email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
kernel/locking/rtmutex.c | 11 +++++++++--
kernel/locking/rwbase_rt.c | 18 ++++++++++++++++--
kernel/locking/rwsem.c | 6 ++++++
kernel/locking/spinlock_rt.c | 3 +++
4 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index 728f434de2bbf..aa66a3c5950a7 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -1555,7 +1555,7 @@ static int __sched rt_mutex_slowlock_block(struct rt_mutex_base *lock,
raw_spin_unlock_irq(&lock->wait_lock);

if (!owner || !rtmutex_spin_on_owner(lock, waiter, owner))
- schedule();
+ schedule_rtmutex();

raw_spin_lock_irq(&lock->wait_lock);
set_current_state(state);
@@ -1584,7 +1584,7 @@ static void __sched rt_mutex_handle_deadlock(int res, int detect_deadlock,
WARN(1, "rtmutex deadlock detected\n");
while (1) {
set_current_state(TASK_INTERRUPTIBLE);
- schedule();
+ schedule_rtmutex();
}
}

@@ -1679,6 +1679,12 @@ static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
unsigned long flags;
int ret;

+ /*
+ * The task is about to sleep. Invoke sched_submit_work() before
+ * blocking as that might take locks and corrupt tsk::pi_blocked_on.
+ */
+ sched_submit_work();
+
/*
* Technically we could use raw_spin_[un]lock_irq() here, but this can
* be called in early boot if the cmpxchg() fast path is disabled
@@ -1691,6 +1697,7 @@ static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
ret = __rt_mutex_slowlock_locked(lock, ww_ctx, state);
raw_spin_unlock_irqrestore(&lock->wait_lock, flags);

+ sched_resume_work();
return ret;
}

diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
index 25ec0239477c2..945d474f5d27f 100644
--- a/kernel/locking/rwbase_rt.c
+++ b/kernel/locking/rwbase_rt.c
@@ -131,10 +131,21 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
static __always_inline int rwbase_read_lock(struct rwbase_rt *rwb,
unsigned int state)
{
+ int ret;
+
if (rwbase_read_trylock(rwb))
return 0;

- return __rwbase_read_lock(rwb, state);
+ /*
+ * The task is about to sleep. For rwsems this submits work as that
+ * might take locks and corrupt tsk::pi_blocked_on. Must be
+ * explicit here because __rwbase_read_lock() cannot invoke
+ * rt_mutex_slowlock(). NOP for rwlocks.
+ */
+ rwbase_sched_submit_work();
+ ret = __rwbase_read_lock(rwb, state);
+ rwbase_sched_resume_work();
+ return ret;
}

static void __sched __rwbase_read_unlock(struct rwbase_rt *rwb,
@@ -230,7 +241,10 @@ static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
struct rt_mutex_base *rtm = &rwb->rtmutex;
unsigned long flags;

- /* Take the rtmutex as a first step */
+ /*
+ * Take the rtmutex as a first step. For rwsem this will also
+ * invoke sched_submit_work() to flush IO and workers.
+ */
if (rwbase_rtmutex_lock_state(rtm, state))
return -EINTR;

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index acb5a50309a18..aca266006ad47 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -1415,6 +1415,12 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
#define rwbase_rtmutex_lock_state(rtm, state) \
__rt_mutex_lock(rtm, state)

+#define rwbase_sched_submit_work() \
+ sched_submit_work()
+
+#define rwbase_sched_resume_work() \
+ sched_resume_work()
+
#define rwbase_rtmutex_slowlock_locked(rtm, state) \
__rt_mutex_slowlock_locked(rtm, NULL, state)

diff --git a/kernel/locking/spinlock_rt.c b/kernel/locking/spinlock_rt.c
index 48a19ed8486d8..62c4a6866087a 100644
--- a/kernel/locking/spinlock_rt.c
+++ b/kernel/locking/spinlock_rt.c
@@ -159,6 +159,9 @@ rwbase_rtmutex_lock_state(struct rt_mutex_base *rtm, unsigned int state)
return 0;
}

+static __always_inline void rwbase_sched_submit_work(void) { }
+static __always_inline void rwbase_sched_resume_work(void) { }
+
static __always_inline int
rwbase_rtmutex_slowlock_locked(struct rt_mutex_base *rtm, unsigned int state)
{
--
2.40.1

2023-05-03 13:22:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] sched/core: Provide sched_rtmutex() and expose sched work helpers

On Thu, Apr 27, 2023 at 01:19:34PM +0200, Sebastian Andrzej Siewior wrote:
> From: Thomas Gleixner <[email protected]>
>
> schedule() invokes sched_submit_work() before scheduling and
> sched_update_worker() afterwards to ensure that queued block requests are
> flushed and the (IO)worker machineries can instantiate new workers if
> required. This avoids deadlocks and starvation.
>
> With rt_mutexes this can lead to subtle problem:
>
> When rtmutex blocks current::pi_blocked_on points to the rtmutex it
> blocks on. When one of the functions in sched_submit/resume_work()
> contends on a rtmutex based lock then that would corrupt
> current::pi_blocked_on.
>
> Make it possible to let rtmutex issue the calls outside of the slowpath,
> i.e. when it is guaranteed that current::pi_blocked_on is NULL, by:
>
> - Exposing sched_submit_work() and moving the task_running() condition
> into schedule()
>
> - Renamimg sched_update_worker() to sched_resume_work() and exposing it
> too.
>
> - Providing sched_rtmutex() which just does the inner loop of scheduling
> until need_resched() is not longer set. Split out the loop so this does
> not create yet another copy.
>
> Signed-off-by: Thomas Gleixner <[email protected]>
> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>

Urgh, so I really don't like this.

The end result is something like:

rt_mutex_lock()
sched_submit_work();
// a nested rt_mutex_lock() here will not clobber
// ->pi_blocked_on because it's not set yet.

task_blocks_on_rt_mutex();
tsk->pi_blocked_on = waiter;
rt_mutex_enqueue(lock, waiter); <-- the real problem

rt_mutex_slowlock_block();
schedule_rtmutex();

sched_resume_work();

And all of this it not just because tsk->pi_blocked_on, but mostly
because of task_blocks_on_rt_mutex() enqueueing the waiter. The whole
enqueue thing is what makes the 'simple' solution of saving/restoring
tsk->pi_blocked_on not work.

Basically the pi_blocked_on curruption is a side effect, not the
fundamental issue. One task having two waiters registered is the bigger
issue.

Now, sched_submit_work() could also use (regular) mutex -- after all
it's a fully preemptible context. And then we're subject to the 'same'
problem but with tsk->blocked_on (DEBUG_MUTEXES=y).

This means that strictly speaking we should litter mutex with the same
thing :/

This all feels fragile to me. Too many things spread out in too many
places. An alternative is something like:

void __sched schedule_pi(void)
{
struct task_struct *tsk = current;
void *waiter = tsk->pi_blocked_on;

sched_submit_work(tsk);
do {
preempt_disable();
if (rt_mutex_blocks(tsk, waiter))
schedule();
sched_preempt_enable_no_resched();
} while (need_resched());
sched_update_worker(tsk);
}

And then rt_mutex_blocks() will do the enqueue/boost/optimistic_spin
thing. However, this is going to be a massive reorg of the rt_mutex code
and I'm not entirely sure the end result will actually be better... it
might just make a mess elsewhere :/

> +void sched_submit_work(void)
> {
> - unsigned int task_flags;
> + struct task_struct *tsk = current;
> + unsigned int task_flags = tsk->flags;
>
> - if (task_is_running(tsk))
> - return;
> -
> - task_flags = tsk->flags;
> /*
> * If a worker goes to sleep, notify and ask workqueue whether it
> * wants to wake up a task to maintain concurrency.
> @@ -6723,8 +6720,10 @@ static inline void sched_submit_work(struct task_struct *tsk)
> blk_flush_plug(tsk->plug, true);
> }

> +asmlinkage __visible void __sched schedule(void)
> +{
> + if (!task_is_running(current))
> + sched_submit_work();
> + schedule_loop(SM_NONE);
> + sched_resume_work();
> }
> EXPORT_SYMBOL(schedule);

pulling out task_is_running() like this is going to get into conflict
with TJs patches here:

https://lkml.kernel.org/r/[email protected]

That makes sched_submit_work() do things even if task_is_running().

2023-05-09 22:18:46

by Crystal Wood

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] sched/core: Provide sched_rtmutex() and expose sched work helpers

On Wed, 2023-05-03 at 15:20 +0200, Peter Zijlstra wrote:
> On Thu, Apr 27, 2023 at 01:19:34PM +0200, Sebastian Andrzej Siewior wrote:
> > From: Thomas Gleixner <[email protected]>
> >
> > schedule() invokes sched_submit_work() before scheduling and
> > sched_update_worker() afterwards to ensure that queued block requests
> > are
> > flushed and the (IO)worker machineries can instantiate new workers if
> > required. This avoids deadlocks and starvation.
> >
> > With rt_mutexes this can lead to subtle problem:
> >
> >   When rtmutex blocks current::pi_blocked_on points to the rtmutex it
> >   blocks on. When one of the functions in sched_submit/resume_work()
> >   contends on a rtmutex based lock then that would corrupt
> >   current::pi_blocked_on.
> >
> > Make it possible to let rtmutex issue the calls outside of the slowpath,
> > i.e. when it is guaranteed that current::pi_blocked_on is NULL, by:
> >
> >   - Exposing sched_submit_work() and moving the task_running() condition
> >     into schedule()
> >
> >   - Renamimg sched_update_worker() to sched_resume_work() and exposing
> > it
> >     too.
> >
> >   - Providing sched_rtmutex() which just does the inner loop of
> > scheduling
> >     until need_resched() is not longer set. Split out the loop so this
> > does
> >     not create yet another copy.
> >
> > Signed-off-by: Thomas Gleixner <[email protected]>
> > Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
>
> Urgh, so I really don't like this.
>
> The end result is something like:
>
>         rt_mutex_lock()
>           sched_submit_work();
>             // a nested rt_mutex_lock() here will not clobber
>             // ->pi_blocked_on because it's not set yet.
>
>           task_blocks_on_rt_mutex();
>             tsk->pi_blocked_on = waiter;
>             rt_mutex_enqueue(lock, waiter); <-- the real problem
>
>           rt_mutex_slowlock_block();
>             schedule_rtmutex();
>
>           sched_resume_work();
>
> And all of this it not just because tsk->pi_blocked_on, but mostly
> because of task_blocks_on_rt_mutex() enqueueing the waiter. The whole
> enqueue thing is what makes the 'simple' solution of saving/restoring
> tsk->pi_blocked_on not work.
>
> Basically the pi_blocked_on curruption is a side effect, not the
> fundamental issue. One task having two waiters registered is the bigger
> issue.

Where do you see pi_blocked_on being saved/restored? The whole point of
this patchset is to deal with sched_submit_work() before anything has
been done on the "outer" lock acquisition (not just pi_blocked_on, but
also enqueuing) other than failing the fast path.

> Now, sched_submit_work() could also use (regular) mutex -- after all
> it's a fully preemptible context. And then we're subject to the 'same'
> problem but with tsk->blocked_on (DEBUG_MUTEXES=y).

It's fully preemptible but it still shouldn't be doing things that would
block on non-RT. That'd already be broken for a number of reasons (task
state corruption, infinite recursion if current->plug isn't cleared
before doing whatever causes another standard schedule(), etc).

-Crystal

Subject: Re: [PATCH v2 1/4] sched/core: Provide sched_rtmutex() and expose sched work helpers

On 2023-05-03 15:20:51 [+0200], Peter Zijlstra wrote:
> Urgh, so I really don't like this.
>
> The end result is something like:
>
> rt_mutex_lock()
> sched_submit_work();
> // a nested rt_mutex_lock() here will not clobber
> // ->pi_blocked_on because it's not set yet.
>
> task_blocks_on_rt_mutex();
> tsk->pi_blocked_on = waiter;
> rt_mutex_enqueue(lock, waiter); <-- the real problem
>
> rt_mutex_slowlock_block();
> schedule_rtmutex();
>
> sched_resume_work();
>
> And all of this it not just because tsk->pi_blocked_on, but mostly
> because of task_blocks_on_rt_mutex() enqueueing the waiter. The whole
> enqueue thing is what makes the 'simple' solution of saving/restoring
> tsk->pi_blocked_on not work.
>
> Basically the pi_blocked_on curruption is a side effect, not the
> fundamental issue. One task having two waiters registered is the bigger
> issue.

Yes, one task blocks on two locks but the lock in sched_submit_work()
needs to be solved first.

> Now, sched_submit_work() could also use (regular) mutex -- after all
> it's a fully preemptible context. And then we're subject to the 'same'
> problem but with tsk->blocked_on (DEBUG_MUTEXES=y).

Not sure I follow. We only invoke sched_submit_work() if we block on a
lock which is sleeping on !RT (mutex_t, not spinlock_t). I browsed of
few of the sched_submit_work() callbacks and they all use non-sleeping
locks (on !RT).
If a sched_submit_work() would use a mutex_t lock then we would
recursively call blk_flush_plug() before setting tsk->blocked_on and
perform the same callback and block on the very same lock (again).
This isn't different compared to !RT therefore you must not use a
sleeping lock (mutex_t) in the callback.

> This means that strictly speaking we should litter mutex with the same
> thing :/

No need, see above logic.

> This all feels fragile to me. Too many things spread out in too many
> places. An alternative is something like:
>
> void __sched schedule_pi(void)
> {
> struct task_struct *tsk = current;
> void *waiter = tsk->pi_blocked_on;
>
> sched_submit_work(tsk);
> do {
> preempt_disable();
> if (rt_mutex_blocks(tsk, waiter))
> schedule();
> sched_preempt_enable_no_resched();
> } while (need_resched());
> sched_update_worker(tsk);
> }
>
> And then rt_mutex_blocks() will do the enqueue/boost/optimistic_spin
> thing. However, this is going to be a massive reorg of the rt_mutex code
> and I'm not entirely sure the end result will actually be better... it
> might just make a mess elsewhere :/

It might be not needed…

> > @@ -6723,8 +6720,10 @@ static inline void sched_submit_work(struct task_struct *tsk)
> > blk_flush_plug(tsk->plug, true);
> > }
>
> > +asmlinkage __visible void __sched schedule(void)
> > +{
> > + if (!task_is_running(current))
> > + sched_submit_work();
> > + schedule_loop(SM_NONE);
> > + sched_resume_work();
> > }
> > EXPORT_SYMBOL(schedule);
>
> pulling out task_is_running() like this is going to get into conflict
> with TJs patches here:
>
> https://lkml.kernel.org/r/[email protected]
>
> That makes sched_submit_work() do things even if task_is_running().

Do I rebase my stuff on top of his then and we good?

Sebastian

2023-05-11 14:11:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] sched/core: Provide sched_rtmutex() and expose sched work helpers

On Wed, May 10, 2023 at 05:04:15PM +0200, Sebastian Andrzej Siewior wrote:
> On 2023-05-03 15:20:51 [+0200], Peter Zijlstra wrote:
> > Urgh, so I really don't like this.
> >
> > The end result is something like:
> >
> > rt_mutex_lock()
> > sched_submit_work();
> > // a nested rt_mutex_lock() here will not clobber
> > // ->pi_blocked_on because it's not set yet.
> >
> > task_blocks_on_rt_mutex();
> > tsk->pi_blocked_on = waiter;
> > rt_mutex_enqueue(lock, waiter); <-- the real problem
> >
> > rt_mutex_slowlock_block();
> > schedule_rtmutex();
> >
> > sched_resume_work();
> >
> > And all of this it not just because tsk->pi_blocked_on, but mostly
> > because of task_blocks_on_rt_mutex() enqueueing the waiter. The whole
> > enqueue thing is what makes the 'simple' solution of saving/restoring
> > tsk->pi_blocked_on not work.
> >
> > Basically the pi_blocked_on curruption is a side effect, not the
> > fundamental issue. One task having two waiters registered is the bigger
> > issue.
>
> Yes, one task blocks on two locks but the lock in sched_submit_work()
> needs to be solved first.

Sure.. just saying that the focus on ->pi_blocked_on is a wee bit under
representing the issue, because there's a 'simple' solution for
->pi_blocked_on. There is not for the whole waiter situation.

> > Now, sched_submit_work() could also use (regular) mutex -- after all
> > it's a fully preemptible context. And then we're subject to the 'same'
> > problem but with tsk->blocked_on (DEBUG_MUTEXES=y).
>
> Not sure I follow. We only invoke sched_submit_work() if we block on a
> lock which is sleeping on !RT (mutex_t, not spinlock_t). I browsed of
> few of the sched_submit_work() callbacks and they all use non-sleeping
> locks (on !RT).

Right, but this is not enforced, they could change this and we'd never
know.

It used to be ran with IRQs disabled so anybody trying to do 'funny'
things would get yelled at, but commit 9c40cef2b799 ("sched: Move
blk_schedule_flush_plug() out of __schedule()") broke that.

> If a sched_submit_work() would use a mutex_t lock then we would
> recursively call blk_flush_plug() before setting tsk->blocked_on and

I'm not following, mutex code sets tsk->blocked_on before it calls
schedule(), getting into the very same problem you have with rt_mutex.

> perform the same callback and block on the very same lock (again).
> This isn't different compared to !RT therefore you must not use a
> sleeping lock (mutex_t) in the callback.

See the enforcement thing; today nothing stops the code from using a
mutex or other blocking primitives here.


> > pulling out task_is_running() like this is going to get into conflict
> > with TJs patches here:
> >
> > https://lkml.kernel.org/r/[email protected]
> >
> > That makes sched_submit_work() do things even if task_is_running().
>
> Do I rebase my stuff on top of his then and we good?

I just suggested he try something else:

https://lkml.kernel.org/r/[email protected]

if that works out this worry goes away.

If we get PROVE_RAW_LOCK_NESTING usable, something like the below might
help out with the validation part...

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 944c3ae39861..8df1aa333dee 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6694,11 +6694,18 @@ void __noreturn do_task_dead(void)

static inline void sched_submit_work(struct task_struct *tsk)
{
+ static DEFINE_WAIT_OVERRIDE_MAP(sched_map, LD_WAIT_CONFIG);
unsigned int task_flags;

if (task_is_running(tsk))
return;

+ /*
+ * Establish LD_WAIT_CONFIG context to ensure none of the code called
+ * will use a blocking primitive.
+ */
+ lock_map_acquire_try(&sched_map);
+
task_flags = tsk->flags;
/*
* If a worker goes to sleep, notify and ask workqueue whether it
@@ -6723,6 +6730,8 @@ static inline void sched_submit_work(struct task_struct *tsk)
* make sure to submit it to avoid deadlocks.
*/
blk_flush_plug(tsk->plug, true);
+
+ lock_map_release(&sched_map);
}

static void sched_update_worker(struct task_struct *tsk)

2023-05-11 14:16:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] sched/core: Provide sched_rtmutex() and expose sched work helpers

On Tue, May 09, 2023 at 05:14:38PM -0500, Crystal Wood wrote:
> On Wed, 2023-05-03 at 15:20 +0200, Peter Zijlstra wrote:

> > Urgh, so I really don't like this.
> >
> > The end result is something like:
> >
> > ????????rt_mutex_lock()
> > ????????? sched_submit_work();
> > ??????????? // a nested rt_mutex_lock() here will not clobber
> > ??????????? // ->pi_blocked_on because it's not set yet.
> >
> > ????????? task_blocks_on_rt_mutex();
> > ??????????? tsk->pi_blocked_on = waiter;
> > ??????????? rt_mutex_enqueue(lock, waiter); <-- the real problem
> >
> > ????????? rt_mutex_slowlock_block();
> > ??????????? schedule_rtmutex();
> >
> > ????????? sched_resume_work();
> >
> > And all of this it not just because tsk->pi_blocked_on, but mostly
> > because of task_blocks_on_rt_mutex() enqueueing the waiter. The whole
> > enqueue thing is what makes the 'simple' solution of saving/restoring
> > tsk->pi_blocked_on not work.
> >
> > Basically the pi_blocked_on curruption is a side effect, not the
> > fundamental issue. One task having two waiters registered is the bigger
> > issue.
>
> Where do you see pi_blocked_on being saved/restored?

I'm not, I'm saying that *IF* ->pi_blocked_on corruption were the real
problem that would be a sufficient solution.

But ->pi_blocked_on corruption is just a wee side effect of the real
problem, which is that the waiter is already enqueued and we've already
done PI and can't very well back out of all that in a hurry.

> The whole point of
> this patchset is to deal with sched_submit_work() before anything has
> been done on the "outer" lock acquisition (not just pi_blocked_on, but
> also enqueuing) other than failing the fast path.

It's also terribly fragile, sprinkling stuff all over that shouldn't be
sprinkled.

And it's sprinkled far wider than it needs to be -- per the below
argument it really only should be applied to rtlock, not to rt_mutex or
ww_rt_mutex or any of the others that normally block and shouldn't be
used anyway.

> > Now, sched_submit_work() could also use (regular) mutex -- after all
> > it's a fully preemptible context. And then we're subject to the 'same'
> > problem but with tsk->blocked_on (DEBUG_MUTEXES=y).
>
> It's fully preemptible but it still shouldn't be doing things that would
> block on non-RT. That'd already be broken for a number of reasons (task
> state corruption, infinite recursion if current->plug isn't cleared
> before doing whatever causes another standard schedule(), etc).

task->state is fairly immune to corruption normally -- the typical
case is that the nested block resolves and resets it to RUNNING, at
which point the outer loop 'fails' to schedule() and re-tries. All that
is mostly harmless.

But yes, all that code *SHOULD* not block, but nothing is really
enforcing that.


Subject: Re: [PATCH v2 1/4] sched/core: Provide sched_rtmutex() and expose sched work helpers

On 2023-05-11 15:43:08 [+0200], Peter Zijlstra wrote:
> > If a sched_submit_work() would use a mutex_t lock then we would
> > recursively call blk_flush_plug() before setting tsk->blocked_on and
>
> I'm not following, mutex code sets tsk->blocked_on before it calls
> schedule(), getting into the very same problem you have with rt_mutex.
>
> > perform the same callback and block on the very same lock (again).
> > This isn't different compared to !RT therefore you must not use a
> > sleeping lock (mutex_t) in the callback.
>
> See the enforcement thing; today nothing stops the code from using a
> mutex or other blocking primitives here.

I tried to explain that if blk_flush_plug() blocks on a mutex_t then it
will invoke schedule() -> blk_flush_plug() -> schedule() ->
blk_flush_plug() -> … until it runs out of stack.

So it is broken regardless of RT but yes we don't enforce it and yes
people might use it and it would work as long as the lock is not
contended.

> > Do I rebase my stuff on top of his then and we good?
>
> I just suggested he try something else:
>
> https://lkml.kernel.org/r/[email protected]
>
> if that works out this worry goes away.
>
> If we get PROVE_RAW_LOCK_NESTING usable, something like the below might
> help out with the validation part...

Okay. So if I don't collide with workqueue do you buy this or do you
ask for something else. I'm not sure…

Regarding PROVE_RAW_LOCK_NESTING: If I boot -rc3 with `quiet' then I
don't see any complains.
Otherwise it is printk during boot (caller is holding raw_spinlock_t and
then printk() calls to serial driver with spinlock_t).
From time to time ppl send "fixes" for PROVE_RAW_LOCK_NESTING splats so
I would guess they boot with `quiet' and there isn't much else. So we
are getting close here I guess.

Do you want me to test the suggested validation map somewhere? Because
if it works, it could be queued.

Sebastian

Subject: Re: [PATCH v2 1/4] sched/core: Provide sched_rtmutex() and expose sched work helpers

On 2023-05-25 17:25:07 [+0200], To Peter Zijlstra wrote:
> > if that works out this worry goes away.
> >
> > If we get PROVE_RAW_LOCK_NESTING usable, something like the below might
> > help out with the validation part...
>
> Okay. So if I don't collide with workqueue do you buy this or do you
> ask for something else. I'm not sure…
>
> Regarding PROVE_RAW_LOCK_NESTING: If I boot -rc3 with `quiet' then I
> don't see any complains.
> Otherwise it is printk during boot (caller is holding raw_spinlock_t and
> then printk() calls to serial driver with spinlock_t).
> From time to time ppl send "fixes" for PROVE_RAW_LOCK_NESTING splats so
> I would guess they boot with `quiet' and there isn't much else. So we
> are getting close here I guess.
>
> Do you want me to test the suggested validation map somewhere? Because
> if it works, it could be queued.

Do you want just the validation map or is there something I'm missing?

Sebastian