LinuxLists.cc - [patch 03/50] sched: Prepare for RT sleeping spin/rwlocks

2021-07-13 16:14:19

Subject: [patch 03/50] sched: Prepare for RT sleeping spin/rwlocks

From: Thomas Gleixner <[email protected]>

Waiting for spinlocks and rwlocks on non RT enabled kernels is task::state
preserving. Any wakeup which matches the state is valid.

RT enabled kernels substitutes them with 'sleeping' spinlocks. This creates
an issue vs. task::state.

In order to block on the lock the task has to overwrite task::state and a
consecutive wakeup issued by the unlocker sets the state back to
TASK_RUNNING. As a consequence the task loses the state which was set
before the lock acquire and also any regular wakeup targeted at the task
while it is blocked on the lock.

To handle this gracefully add a 'saved_state' member to task_struct which
is used in the following way:

1) When a task blocks on a 'sleeping' spinlock, the current state is saved
in task::saved_state before it is set to TASK_RTLOCK_WAIT.

2) When the task unblocks and after acquiring the lock, it restores the saved
state.

3) When a regular wakeup happens for a task while it is blocked then the
state change of that wakeup is redirected to operate on task::saved_state.

This is also required when the task state is running because the task
might have been woken up from the lock wait and has not yet restored
the saved state.

To make it complete provide the necessary helpers to save and restore the
saved state along with the necessary documentation how the RT lock blocking
is supposed to work.

For non-RT kernels there is no functional change.

Signed-off-by: Thomas Gleixner <[email protected]>
---
include/linux/sched.h | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/core.c | 33 +++++++++++++++++++++++
2 files changed, 103 insertions(+)
---
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -155,6 +155,27 @@ struct task_group;
WRITE_ONCE(current->__state, (state_value)); \
raw_spin_unlock_irqrestore(&current->pi_lock, flags); \
} while (0)
+
+
+#define current_save_and_set_rtlock_wait_state() \
+ do { \
+ raw_spin_lock(&current->pi_lock); \
+ current->saved_state = current->__state; \
+ current->saved_state_change = current->task_state_change;\
+ current->task_state_change = _THIS_IP_; \
+ WRITE_ONCE(current->__state, TASK_RTLOCK_WAIT); \
+ raw_spin_unlock(&current->pi_lock); \
+ } while (0);
+
+#define current_restore_rtlock_saved_state() \
+ do { \
+ raw_spin_lock(&current->pi_lock); \
+ current->task_state_change = current->saved_state_change;\
+ WRITE_ONCE(current->__state, current->saved_state); \
+ current->saved_state = TASK_RUNNING; \
+ raw_spin_unlock(&current->pi_lock); \
+ } while (0);
+
#else
/*
* set_current_state() includes a barrier so that the write of current->state
@@ -213,6 +234,47 @@ struct task_group;
raw_spin_unlock_irqrestore(&current->pi_lock, flags); \
} while (0)

+/*
+ * PREEMPT_RT specific variants for "sleeping" spin/rwlocks
+ *
+ * RT's spin/rwlock substitutions are state preserving. The state of the
+ * task when blocking on the lock is saved in task_struct::saved_state and
+ * restored after the lock has been acquired. These operations are
+ * serialized by task_struct::pi_lock against try_to_wake_up(). Any non RT
+ * lock related wakeups while the task is blocked on the lock are
+ * redirected to operate on task_struct::saved_state to ensure that these
+ * are not dropped. On restore task_struct::saved_state is set to
+ * TASK_RUNNING so any wakeup attempt redirected to saved_state will fail.
+ *
+ * The lock operation looks like this:
+ *
+ * current_save_and_set_rtlock_wait_state();
+ * for (;;) {
+ * if (try_lock())
+ * break;
+ * raw_spin_unlock_irq(&lock->wait_lock);
+ * schedule_rtlock();
+ * raw_spin_lock_irq(&lock->wait_lock);
+ * set_current_state(TASK_RTLOCK_WAIT);
+ * }
+ * current_restore_rtlock_saved_state();
+ */
+#define current_save_and_set_rtlock_wait_state() \
+ do { \
+ raw_spin_lock(&current->pi_lock); \
+ current->saved_state = current->state; \
+ WRITE_ONCE(current->__state, TASK_RTLOCK_WAIT); \
+ raw_spin_unlock(&current->pi_lock); \
+ } while (0);
+
+#define current_restore_rtlock_saved_state() \
+ do { \
+ raw_spin_lock(&current->pi_lock); \
+ WRITE_ONCE(current->__state, current->saved_state); \
+ current->saved_state = TASK_RUNNING; \
+ raw_spin_unlock(&current->pi_lock); \
+ } while (0);
+
#endif

#define get_current_state() READ_ONCE(current->__state)
@@ -670,6 +732,11 @@ struct task_struct {
#endif
unsigned int __state;

+#ifdef CONFIG_PREEMPT_RT
+ /* saved state for "spinlock sleepers" */
+ unsigned int saved_state;
+#endif
+
/*
* This begins the randomizable portion of task_struct. Only
* scheduling-critical items should be added above here.
@@ -1359,6 +1426,9 @@ struct task_struct {
struct kmap_ctrl kmap_ctrl;
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
unsigned long task_state_change;
+# ifdef CONFIG_PREEMPT_RT
+ unsigned long saved_state_change;
+# endif
#endif
int pagefault_disabled;
#ifdef CONFIG_MMU
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3568,14 +3568,47 @@ static void ttwu_queue(struct task_struc
*
* The caller holds p::pi_lock if p != current or has preemption
* disabled when p == current.
+ *
+ * The rules of PREEMPT_RT saved_state:
+ *
+ * The related locking code always holds p::pi_lock when updating
+ * p::saved_state, which means the code is fully serialized in both cases.
+ *
+ * The lock wait and lock wakeups happen via TASK_RTLOCK_WAIT. No other
+ * bits set. This allows to distinguish all wakeup scenarios.
*/
static __always_inline
bool ttwu_state_match(struct task_struct *p, unsigned int state, int *success)
{
+ if (IS_ENABLED(CONFIG_DEBUG_PREEMPT)) {
+ WARN_ON_ONCE((state & TASK_RTLOCK_WAIT) &&
+ state != TASK_RTLOCK_WAIT);
+ }
+
if (READ_ONCE(p->__state) & state) {
*success = 1;
return true;
}
+
+#ifdef CONFIG_PREEMPT_RT
+ /*
+ * Saved state preserves the task state accross blocking on
+ * a RT lock. If the state matches, set p::saved_state to
+ * TASK_RUNNING, but do not wake the task because it waits
+ * for a lock wakeup. Also indicate success because from
+ * the regular waker's point of view this has succeeded.
+ *
+ * After acquiring the lock the task will restore p::state
+ * from p::saved_state which ensures that the regular
+ * wakeup is not lost. The restore will also set
+ * p::saved_state to TASK_RUNNING so any further tests will
+ * not result in false positives vs. @success
+ */
+ if (p->saved_state & state) {
+ p->saved_state = TASK_RUNNING;
+ *success = 1;
+ }
+#endif
return false;
}

2021-07-13 19:53:46

by Waiman Long

[permalink] [raw]

Subject: Re: [patch 03/50] sched: Prepare for RT sleeping spin/rwlocks

On 7/13/21 11:10 AM, Thomas Gleixner wrote:
> From: Thomas Gleixner <[email protected]>
>
> Waiting for spinlocks and rwlocks on non RT enabled kernels is task::state
> preserving. Any wakeup which matches the state is valid.
>
> RT enabled kernels substitutes them with 'sleeping' spinlocks. This creates
> an issue vs. task::state.
>
> In order to block on the lock the task has to overwrite task::state and a
> consecutive wakeup issued by the unlocker sets the state back to
> TASK_RUNNING. As a consequence the task loses the state which was set
> before the lock acquire and also any regular wakeup targeted at the task
> while it is blocked on the lock.
>
> To handle this gracefully add a 'saved_state' member to task_struct which
> is used in the following way:
>
> 1) When a task blocks on a 'sleeping' spinlock, the current state is saved
> in task::saved_state before it is set to TASK_RTLOCK_WAIT.
>
> 2) When the task unblocks and after acquiring the lock, it restores the saved
> state.
>
> 3) When a regular wakeup happens for a task while it is blocked then the
> state change of that wakeup is redirected to operate on task::saved_state.
>
> This is also required when the task state is running because the task
> might have been woken up from the lock wait and has not yet restored
> the saved state.
>
> To make it complete provide the necessary helpers to save and restore the
> saved state along with the necessary documentation how the RT lock blocking
> is supposed to work.
>
> For non-RT kernels there is no functional change.
>
> Signed-off-by: Thomas Gleixner <[email protected]>
> ---
> include/linux/sched.h | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++
> kernel/sched/core.c | 33 +++++++++++++++++++++++
> 2 files changed, 103 insertions(+)
> ---
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -155,6 +155,27 @@ struct task_group;
> WRITE_ONCE(current->__state, (state_value)); \
> raw_spin_unlock_irqrestore(&current->pi_lock, flags); \
> } while (0)
> +
> +
> +#define current_save_and_set_rtlock_wait_state() \
> + do { \
> + raw_spin_lock(&current->pi_lock); \
> + current->saved_state = current->__state; \
> + current->saved_state_change = current->task_state_change;\
> + current->task_state_change = _THIS_IP_; \
> + WRITE_ONCE(current->__state, TASK_RTLOCK_WAIT); \
> + raw_spin_unlock(&current->pi_lock); \
> + } while (0);
> +
> +#define current_restore_rtlock_saved_state() \
> + do { \
> + raw_spin_lock(&current->pi_lock); \
> + current->task_state_change = current->saved_state_change;\
> + WRITE_ONCE(current->__state, current->saved_state); \
> + current->saved_state = TASK_RUNNING; \
> + raw_spin_unlock(&current->pi_lock); \
> + } while (0);
> +
> #else
> /*
> * set_current_state() includes a barrier so that the write of current->state
> @@ -213,6 +234,47 @@ struct task_group;
> raw_spin_unlock_irqrestore(&current->pi_lock, flags); \
> } while (0)
>
> +/*
> + * PREEMPT_RT specific variants for "sleeping" spin/rwlocks
> + *
> + * RT's spin/rwlock substitutions are state preserving. The state of the
> + * task when blocking on the lock is saved in task_struct::saved_state and
> + * restored after the lock has been acquired. These operations are
> + * serialized by task_struct::pi_lock against try_to_wake_up(). Any non RT
> + * lock related wakeups while the task is blocked on the lock are
> + * redirected to operate on task_struct::saved_state to ensure that these
> + * are not dropped. On restore task_struct::saved_state is set to
> + * TASK_RUNNING so any wakeup attempt redirected to saved_state will fail.
> + *
> + * The lock operation looks like this:
> + *
> + * current_save_and_set_rtlock_wait_state();
> + * for (;;) {
> + * if (try_lock())
> + * break;
> + * raw_spin_unlock_irq(&lock->wait_lock);
> + * schedule_rtlock();
> + * raw_spin_lock_irq(&lock->wait_lock);
> + * set_current_state(TASK_RTLOCK_WAIT);
> + * }
> + * current_restore_rtlock_saved_state();
> + */
> +#define current_save_and_set_rtlock_wait_state() \
> + do { \
> + raw_spin_lock(&current->pi_lock); \
> + current->saved_state = current->state; \
> + WRITE_ONCE(current->__state, TASK_RTLOCK_WAIT); \
> + raw_spin_unlock(&current->pi_lock); \
> + } while (0);
> +
> +#define current_restore_rtlock_saved_state() \
> + do { \
> + raw_spin_lock(&current->pi_lock); \
> + WRITE_ONCE(current->__state, current->saved_state); \
> + current->saved_state = TASK_RUNNING; \
> + raw_spin_unlock(&current->pi_lock); \
> + } while (0);
> +
> #endif
>

The difference between the 2 versions of
current_save_and_set_rtlock_wait_state() is just the handling of
current->saved_state_change. I think it will be cleaner to add helper
macros to just save and restore saved_state_change and break out
current_save_and_set_rtlock_wait_state() and
current_restore_rtlock_saved_state() into its own block. They can also
be put under CONFIG_PREEMPT_RT with an alternate null implementations so
that they can be used outside of CONFIG_PREEMPT_RT conditional block.

Cheers,
Longman

2021-07-15 01:17:30

by Valentin Schneider

[permalink] [raw]

Subject: Re: [patch 03/50] sched: Prepare for RT sleeping spin/rwlocks

Hi,

On 13/07/21 17:10, Thomas Gleixner wrote:
> From: Thomas Gleixner <[email protected]>
>
> Waiting for spinlocks and rwlocks on non RT enabled kernels is task::state
> preserving. Any wakeup which matches the state is valid.
>
> RT enabled kernels substitutes them with 'sleeping' spinlocks. This creates
> an issue vs. task::state.
>
> In order to block on the lock the task has to overwrite task::state and a
> consecutive wakeup issued by the unlocker sets the state back to
> TASK_RUNNING. As a consequence the task loses the state which was set
> before the lock acquire and also any regular wakeup targeted at the task
> while it is blocked on the lock.
>

I'm not sure I get this for spinlocks - p->__state != TASK_RUNNING means
task is stopped (or about to be), IMO that doesn't go with spinning. I was
thinking perhaps ptrace could be an issue, but I don't have a clear picture
on that either. What am I missing?

> @@ -213,6 +234,47 @@ struct task_group;
> raw_spin_unlock_irqrestore(&current->pi_lock, flags); \
> } while (0)
>
> +/*
> + * PREEMPT_RT specific variants for "sleeping" spin/rwlocks
> + *
> + * RT's spin/rwlock substitutions are state preserving. The state of the
> + * task when blocking on the lock is saved in task_struct::saved_state and
> + * restored after the lock has been acquired. These operations are
> + * serialized by task_struct::pi_lock against try_to_wake_up(). Any non RT
> + * lock related wakeups while the task is blocked on the lock are
> + * redirected to operate on task_struct::saved_state to ensure that these
> + * are not dropped. On restore task_struct::saved_state is set to
> + * TASK_RUNNING so any wakeup attempt redirected to saved_state will fail.
> + *
> + * The lock operation looks like this:
> + *
> + * current_save_and_set_rtlock_wait_state();
> + * for (;;) {
> + * if (try_lock())
> + * break;
> + * raw_spin_unlock_irq(&lock->wait_lock);
> + * schedule_rtlock();
> + * raw_spin_lock_irq(&lock->wait_lock);
> + * set_current_state(TASK_RTLOCK_WAIT);
> + * }
> + * current_restore_rtlock_saved_state();
> + */
> +#define current_save_and_set_rtlock_wait_state() \
> + do { \
> + raw_spin_lock(&current->pi_lock); \
> + current->saved_state = current->state; \
^^^^^
That one somehow survived the s/state/__state/ renaming.

2021-07-15 13:41:44

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [patch 03/50] sched: Prepare for RT sleeping spin/rwlocks

On Thu, Jul 15, 2021 at 12:20:28AM +0100, Valentin Schneider wrote:
> Hi,
>
> On 13/07/21 17:10, Thomas Gleixner wrote:
> > From: Thomas Gleixner <[email protected]>
> >
> > Waiting for spinlocks and rwlocks on non RT enabled kernels is task::state
> > preserving. Any wakeup which matches the state is valid.
> >
> > RT enabled kernels substitutes them with 'sleeping' spinlocks. This creates
> > an issue vs. task::state.
> >
> > In order to block on the lock the task has to overwrite task::state and a
> > consecutive wakeup issued by the unlocker sets the state back to
> > TASK_RUNNING. As a consequence the task loses the state which was set
> > before the lock acquire and also any regular wakeup targeted at the task
> > while it is blocked on the lock.
> >
>
> I'm not sure I get this for spinlocks - p->__state != TASK_RUNNING means
> task is stopped (or about to be), IMO that doesn't go with spinning. I was
> thinking perhaps ptrace could be an issue, but I don't have a clear picture
> on that either. What am I missing?

spinlocks will become rtmutex. They're going to clobber __state by
virtue of a nested block.

2021-07-15 17:42:34

by Valentin Schneider

[permalink] [raw]

Subject: Re: [patch 03/50] sched: Prepare for RT sleeping spin/rwlocks

On 15/07/21 11:27, Peter Zijlstra wrote:
> On Thu, Jul 15, 2021 at 12:20:28AM +0100, Valentin Schneider wrote:
>> Hi,
>>
>> On 13/07/21 17:10, Thomas Gleixner wrote:
>> > From: Thomas Gleixner <[email protected]>
>> >
>> > Waiting for spinlocks and rwlocks on non RT enabled kernels is task::state
>> > preserving. Any wakeup which matches the state is valid.
>> >
>> > RT enabled kernels substitutes them with 'sleeping' spinlocks. This creates
>> > an issue vs. task::state.
>> >
>> > In order to block on the lock the task has to overwrite task::state and a
>> > consecutive wakeup issued by the unlocker sets the state back to
>> > TASK_RUNNING. As a consequence the task loses the state which was set
>> > before the lock acquire and also any regular wakeup targeted at the task
>> > while it is blocked on the lock.
>> >
>>
>> I'm not sure I get this for spinlocks - p->__state != TASK_RUNNING means
>> task is stopped (or about to be), IMO that doesn't go with spinning. I was
>> thinking perhaps ptrace could be an issue, but I don't have a clear picture
>> on that either. What am I missing?
>
> spinlocks will become rtmutex. They're going to clobber __state by
> virtue of a nested block.

I wasn't expecting there to be any task taking spinlocks with state !=
TASK_RUNNING, but I just didn't know where to look.

For instance do_wait() sets current to TASK_INTERRUPTIBLE and can then faff
around with some sighand locks before eventually calling into schedule(),
so clearly that one would be affected by the clobbering.

The more you know...