2022-01-21 22:25:28

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v3 0/2] sched/tracing: sched_switch prev_state reported as TASK_RUNNING when it's not

Hi folks,

Problem
=======

Abhijeet pointed out that the following sequence of trace events can be
observed:

cat-1676 [001] d..3 103.010411: sched_waking: comm=grep pid=1677 prio=120 target_cpu=004
grep-1677 [004] d..2 103.010440: sched_switch: prev_comm=grep prev_pid=1677 prev_prio=120 prev_state=R 0x0 ==> next_comm=swapper/4 next_pid=0 next_prio=120
<idle>-0 [004] dNh3 103.0100459: sched_wakeup: comm=grep pid=1677 prio=120 target_cpu=004

IOW, not-yet-woken task gets presented as runnable and switched out in
favor of the idle task... Dietmar and I had a look, turns out this sequence
can happen:

p->__state = TASK_INTERRUPTIBLE;
__schedule()
deactivate_task(p);
ttwu()
READ !p->on_rq
p->__state=TASK_WAKING
trace_sched_switch()
__trace_sched_switch_state()
task_state_index()
return 0;

TASK_WAKING isn't in TASK_REPORT, hence why task_state_index(p) returns 0.
This can happen as of commit c6e7bd7afaeb ("sched/core: Optimize ttwu()
spinning on p->on_cpu") which punted the TASK_WAKING write to the other
side of

smp_cond_load_acquire(&p->on_cpu, !VAL);

in ttwu().

Uwe reported on #linux-rt what I think is a similar issue with
TASK_RTLOCK_WAIT on PREEMPT_RT; again that state isn't in TASK_REPORT so
you get prev_state=0 despite the task blocking on a lock.

Both of those are very confusing for tooling or even human observers.

Patches
=======

For the TASK_WAKING case, pushing the prev_state read in __schedule() down
to __trace_sched_switch_state() seems sufficient. That's patch 1.

For TASK_RTLOCK_WAIT, it's a bit trickier. One could use ->saved_state as
prev_state, but
a) that is also susceptible to racing (see ttwu() / ttwu_state_match())
b) some lock substitutions (e.g. rt_spin_lock()) leave ->saved_state as
TASK_RUNNING.

Patch 2 adds TASK_RTLOCK_WAIT to TASK_REPORT instead.

Testing
=======

Briefly tested on an Arm Juno via ftrace+hackbench against
o tip/sched/core@82762d2af31a
o v5.16-rt15-rebase w/ CONFIG_PREEMPT_RT


I also had a look at the __schedule() disassembly as suggested by Peter; on x86
prev_state gets pushed to the stack and popped prior to the trace event static
call, the rest seems mostly unchanged (GCC 9.3).

Revisions
=========

v2 -> v3
++++++++

o Dropped TASK_RTLOCK_WAIT from TASK_REPORT, made it appear as
TASK_UNINTERRUPTIBLE instead (Eric)

RFC v1 -> v2
++++++++++++

o Collected tags (Steven, Sebastian)
o Patched missed trace_switch event clients (Steven)

Cheers,
Valentin

Valentin Schneider (2):
sched/tracing: Don't re-read p->state when emitting sched_switch event
sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE

include/linux/sched.h | 19 ++++++++++++++++---
include/trace/events/sched.h | 11 +++++++----
kernel/sched/core.c | 4 ++--
kernel/trace/fgraph.c | 4 +++-
kernel/trace/ftrace.c | 4 +++-
kernel/trace/trace_events.c | 8 ++++++--
kernel/trace/trace_osnoise.c | 4 +++-
kernel/trace/trace_sched_switch.c | 1 +
kernel/trace/trace_sched_wakeup.c | 1 +
9 files changed, 42 insertions(+), 14 deletions(-)

--
2.25.1


2022-01-21 22:25:33

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v3 2/2] sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE

TASK_RTLOCK_WAIT currently isn't part of TASK_REPORT, thus a task blocking
on an rtlock will appear as having a task state == 0, IOW TASK_RUNNING.

The actual state is saved in p->saved_state, but reading it after reading
p->__state has a few issues:
o that could still be TASK_RUNNING in the case of e.g. rt_spin_lock
o ttwu_state_match() might have changed that to TASK_RUNNING

As pointed out by Eric, adding TASK_RTLOCK_WAIT to TASK_REPORT implies
exposing a new state to userspace tools which way not know what to do with
them. The only information that needs to be conveyed here is that a task is
waiting on an rt_mutex, which matches TASK_UNINTERRUPTIBLE - there's no
need for a new state.

Reported-by: Uwe Kleine-König <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
include/linux/sched.h | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d00837d12b9d..068270ff04b8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1626,6 +1626,14 @@ static inline unsigned int __task_state_index(unsigned int tsk_state,
if (tsk_state == TASK_IDLE)
state = TASK_REPORT_IDLE;

+ /*
+ * We're lying here, but rather than expose a completely new task state
+ * to userspace, we can make this appear as if the task has gone through
+ * a regular rt_mutex_lock() call.
+ */
+ if (tsk_state == TASK_RTLOCK_WAIT)
+ state = TASK_UNINTERRUPTIBLE;
+
return fls(state);
}

--
2.25.1

2022-01-22 02:02:21

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] sched/tracing: sched_switch prev_state reported as TASK_RUNNING when it's not

On Thu, 20 Jan 2022 16:25:18 +0000
Valentin Schneider <[email protected]> wrote:

> Hi folks,
>
> Problem
> =======
>
> Abhijeet pointed out that the following sequence of trace events can be
> observed:
>
> cat-1676 [001] d..3 103.010411: sched_waking: comm=grep pid=1677 prio=120 target_cpu=004
> grep-1677 [004] d..2 103.010440: sched_switch: prev_comm=grep prev_pid=1677 prev_prio=120 prev_state=R 0x0 ==> next_comm=swapper/4 next_pid=0 next_prio=120
> <idle>-0 [004] dNh3 103.0100459: sched_wakeup: comm=grep pid=1677 prio=120 target_cpu=004
>
> IOW, not-yet-woken task gets presented as runnable and switched out in
> favor of the idle task... Dietmar and I had a look, turns out this sequence
> can happen:
>
> p->__state = TASK_INTERRUPTIBLE;
> __schedule()
> deactivate_task(p);
> ttwu()
> READ !p->on_rq
> p->__state=TASK_WAKING
> trace_sched_switch()
> __trace_sched_switch_state()
> task_state_index()
> return 0;
>
> TASK_WAKING isn't in TASK_REPORT, hence why task_state_index(p) returns 0.
> This can happen as of commit c6e7bd7afaeb ("sched/core: Optimize ttwu()
> spinning on p->on_cpu") which punted the TASK_WAKING write to the other
> side of
>
> smp_cond_load_acquire(&p->on_cpu, !VAL);
>
> in ttwu().
>
> Uwe reported on #linux-rt what I think is a similar issue with
> TASK_RTLOCK_WAIT on PREEMPT_RT; again that state isn't in TASK_REPORT so
> you get prev_state=0 despite the task blocking on a lock.
>
> Both of those are very confusing for tooling or even human observers.



This all looks fine to me:

Reviewed-by: Steven Rostedt (Google) <[email protected]>

Peter, want to take this through your tree?

-- Steve

2022-02-27 15:44:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] sched/tracing: sched_switch prev_state reported as TASK_RUNNING when it's not

On Fri, Jan 21, 2022 at 12:15:58PM -0500, Steven Rostedt wrote:
>
> This all looks fine to me:
>
> Reviewed-by: Steven Rostedt (Google) <[email protected]>
>
> Peter, want to take this through your tree?

Queued up now. Thanks!

Subject: [tip: sched/core] sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 25795ef6299f07ce3838f3253a9cb34f64efcfae
Gitweb: https://git.kernel.org/tip/25795ef6299f07ce3838f3253a9cb34f64efcfae
Author: Valentin Schneider <[email protected]>
AuthorDate: Thu, 20 Jan 2022 16:25:20
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Tue, 01 Mar 2022 16:18:39 +01:00

sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE

TASK_RTLOCK_WAIT currently isn't part of TASK_REPORT, thus a task blocking
on an rtlock will appear as having a task state == 0, IOW TASK_RUNNING.

The actual state is saved in p->saved_state, but reading it after reading
p->__state has a few issues:
o that could still be TASK_RUNNING in the case of e.g. rt_spin_lock
o ttwu_state_match() might have changed that to TASK_RUNNING

As pointed out by Eric, adding TASK_RTLOCK_WAIT to TASK_REPORT implies
exposing a new state to userspace tools which way not know what to do with
them. The only information that needs to be conveyed here is that a task is
waiting on an rt_mutex, which matches TASK_UNINTERRUPTIBLE - there's no
need for a new state.

Reported-by: Uwe Kleine-König <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Steven Rostedt (Google) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
include/linux/sched.h | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 457c8a0..78b606c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1630,6 +1630,14 @@ static inline unsigned int __task_state_index(unsigned int tsk_state,
if (tsk_state == TASK_IDLE)
state = TASK_REPORT_IDLE;

+ /*
+ * We're lying here, but rather than expose a completely new task state
+ * to userspace, we can make this appear as if the task has gone through
+ * a regular rt_mutex_lock() call.
+ */
+ if (tsk_state == TASK_RTLOCK_WAIT)
+ state = TASK_UNINTERRUPTIBLE;
+
return fls(state);
}

2022-04-11 14:44:07

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE

+CC stable

On 01/20/22 16:25, Valentin Schneider wrote:
> TASK_RTLOCK_WAIT currently isn't part of TASK_REPORT, thus a task blocking
> on an rtlock will appear as having a task state == 0, IOW TASK_RUNNING.
>
> The actual state is saved in p->saved_state, but reading it after reading
> p->__state has a few issues:
> o that could still be TASK_RUNNING in the case of e.g. rt_spin_lock
> o ttwu_state_match() might have changed that to TASK_RUNNING
>
> As pointed out by Eric, adding TASK_RTLOCK_WAIT to TASK_REPORT implies
> exposing a new state to userspace tools which way not know what to do with
> them. The only information that needs to be conveyed here is that a task is
> waiting on an rt_mutex, which matches TASK_UNINTERRUPTIBLE - there's no
> need for a new state.
>
> Reported-by: Uwe Kleine-König <[email protected]>
> Signed-off-by: Valentin Schneider <[email protected]>

Any objection for this to be picked up by stable? We care about Patch 1 only in
this series for stable, but it seems sensible to pick this one too, no strong
feeling if it is omitted though.

AFAICT it seems the problem dates back since commit:

1593baab910d ("sched/debug: Implement consistent task-state printing")

or even before. I think v4.14+ is good enough.

Thoughts?

Thanks!

--
Qais Yousef

2022-04-12 04:01:46

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE

On Sun, Apr 10, 2022 at 11:13:25PM +0100, Qais Yousef wrote:
> On 04/10/22 08:14, Greg KH wrote:
> > On Sun, Apr 10, 2022 at 12:42:24AM +0100, Qais Yousef wrote:
> > > +CC stable
> > >
> > > On 01/20/22 16:25, Valentin Schneider wrote:
> > > > TASK_RTLOCK_WAIT currently isn't part of TASK_REPORT, thus a task blocking
> > > > on an rtlock will appear as having a task state == 0, IOW TASK_RUNNING.
> > > >
> > > > The actual state is saved in p->saved_state, but reading it after reading
> > > > p->__state has a few issues:
> > > > o that could still be TASK_RUNNING in the case of e.g. rt_spin_lock
> > > > o ttwu_state_match() might have changed that to TASK_RUNNING
> > > >
> > > > As pointed out by Eric, adding TASK_RTLOCK_WAIT to TASK_REPORT implies
> > > > exposing a new state to userspace tools which way not know what to do with
> > > > them. The only information that needs to be conveyed here is that a task is
> > > > waiting on an rt_mutex, which matches TASK_UNINTERRUPTIBLE - there's no
> > > > need for a new state.
> > > >
> > > > Reported-by: Uwe Kleine-K�nig <[email protected]>
> > > > Signed-off-by: Valentin Schneider <[email protected]>
> > >
> > > Any objection for this to be picked up by stable? We care about Patch 1 only in
> > > this series for stable, but it seems sensible to pick this one too, no strong
> > > feeling if it is omitted though.
> > >
> > > AFAICT it seems the problem dates back since commit:
> > >
> > > 1593baab910d ("sched/debug: Implement consistent task-state printing")
> > >
> > > or even before. I think v4.14+ is good enough.
> >
> >
> > <formletter>
> >
> > This is not the correct way to submit patches for inclusion in the
> > stable kernel tree. Please read:
> > https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> > for how to do this properly.
> >
> > </formletter>
>
> Apologies.
>
> commit: 25795ef6299f07ce3838f3253a9cb34f64efcfae
> Subject: sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE
>
> I am interested in Patch 1 in this series as I know it impacts some Android
> 5.10 users. But this patch seems a good candidate for stable too since it was
> observed by a user (Uwe) and AFAICT the problem dates back to v4.14+ kernels.
>
> Suggested kernels: v4.14+. This has already been picked up by AUTOSEL for
> v5.15+ stable trees.

I do not think you have tested this in any of those kernels, as it
breaks the build :(

Please send a set of patches, properly backported and tested, that you
wish to see applied to stable kernels and we will be glad to review them
and apply them. But to suggest patches to be merged that you have not
even tested is not good.

thanks,

greg k-h

2022-04-12 07:18:47

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE

On 04/10/22 08:14, Greg KH wrote:
> On Sun, Apr 10, 2022 at 12:42:24AM +0100, Qais Yousef wrote:
> > +CC stable
> >
> > On 01/20/22 16:25, Valentin Schneider wrote:
> > > TASK_RTLOCK_WAIT currently isn't part of TASK_REPORT, thus a task blocking
> > > on an rtlock will appear as having a task state == 0, IOW TASK_RUNNING.
> > >
> > > The actual state is saved in p->saved_state, but reading it after reading
> > > p->__state has a few issues:
> > > o that could still be TASK_RUNNING in the case of e.g. rt_spin_lock
> > > o ttwu_state_match() might have changed that to TASK_RUNNING
> > >
> > > As pointed out by Eric, adding TASK_RTLOCK_WAIT to TASK_REPORT implies
> > > exposing a new state to userspace tools which way not know what to do with
> > > them. The only information that needs to be conveyed here is that a task is
> > > waiting on an rt_mutex, which matches TASK_UNINTERRUPTIBLE - there's no
> > > need for a new state.
> > >
> > > Reported-by: Uwe Kleine-K�nig <[email protected]>
> > > Signed-off-by: Valentin Schneider <[email protected]>
> >
> > Any objection for this to be picked up by stable? We care about Patch 1 only in
> > this series for stable, but it seems sensible to pick this one too, no strong
> > feeling if it is omitted though.
> >
> > AFAICT it seems the problem dates back since commit:
> >
> > 1593baab910d ("sched/debug: Implement consistent task-state printing")
> >
> > or even before. I think v4.14+ is good enough.
>
>
> <formletter>
>
> This is not the correct way to submit patches for inclusion in the
> stable kernel tree. Please read:
> https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> for how to do this properly.
>
> </formletter>

Apologies.

commit: 25795ef6299f07ce3838f3253a9cb34f64efcfae
Subject: sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE

I am interested in Patch 1 in this series as I know it impacts some Android
5.10 users. But this patch seems a good candidate for stable too since it was
observed by a user (Uwe) and AFAICT the problem dates back to v4.14+ kernels.

Suggested kernels: v4.14+. This has already been picked up by AUTOSEL for
v5.15+ stable trees.

Hope I got the process right this time.

Thanks!

--
Qais Yousef

2022-04-12 08:25:40

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE

On 04/11/22 15:20, Greg KH wrote:
> On Sun, Apr 10, 2022 at 11:13:25PM +0100, Qais Yousef wrote:
> > On 04/10/22 08:14, Greg KH wrote:
> > > On Sun, Apr 10, 2022 at 12:42:24AM +0100, Qais Yousef wrote:
> > > > +CC stable
> > > >
> > > > On 01/20/22 16:25, Valentin Schneider wrote:
> > > > > TASK_RTLOCK_WAIT currently isn't part of TASK_REPORT, thus a task blocking
> > > > > on an rtlock will appear as having a task state == 0, IOW TASK_RUNNING.
> > > > >
> > > > > The actual state is saved in p->saved_state, but reading it after reading
> > > > > p->__state has a few issues:
> > > > > o that could still be TASK_RUNNING in the case of e.g. rt_spin_lock
> > > > > o ttwu_state_match() might have changed that to TASK_RUNNING
> > > > >
> > > > > As pointed out by Eric, adding TASK_RTLOCK_WAIT to TASK_REPORT implies
> > > > > exposing a new state to userspace tools which way not know what to do with
> > > > > them. The only information that needs to be conveyed here is that a task is
> > > > > waiting on an rt_mutex, which matches TASK_UNINTERRUPTIBLE - there's no
> > > > > need for a new state.
> > > > >
> > > > > Reported-by: Uwe Kleine-K�nig <[email protected]>
> > > > > Signed-off-by: Valentin Schneider <[email protected]>
> > > >
> > > > Any objection for this to be picked up by stable? We care about Patch 1 only in
> > > > this series for stable, but it seems sensible to pick this one too, no strong
> > > > feeling if it is omitted though.
> > > >
> > > > AFAICT it seems the problem dates back since commit:
> > > >
> > > > 1593baab910d ("sched/debug: Implement consistent task-state printing")
> > > >
> > > > or even before. I think v4.14+ is good enough.
> > >
> > >
> > > <formletter>
> > >
> > > This is not the correct way to submit patches for inclusion in the
> > > stable kernel tree. Please read:
> > > https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> > > for how to do this properly.
> > >
> > > </formletter>
> >
> > Apologies.
> >
> > commit: 25795ef6299f07ce3838f3253a9cb34f64efcfae
> > Subject: sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE
> >
> > I am interested in Patch 1 in this series as I know it impacts some Android
> > 5.10 users. But this patch seems a good candidate for stable too since it was
> > observed by a user (Uwe) and AFAICT the problem dates back to v4.14+ kernels.
> >
> > Suggested kernels: v4.14+. This has already been picked up by AUTOSEL for
> > v5.15+ stable trees.
>
> I do not think you have tested this in any of those kernels, as it
> breaks the build :(
>
> Please send a set of patches, properly backported and tested, that you
> wish to see applied to stable kernels and we will be glad to review them
> and apply them. But to suggest patches to be merged that you have not
> even tested is not good.

Okay. Maybe I was trying to chew too much. I care about patch 1 only. I'll make
sure it builds and works against 5.10 and post it separately. Please ignore
this one.

Sorry for the noise.

Thanks

--
Qais Yousef

2022-04-12 09:23:57

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE

On Sun, Apr 10, 2022 at 12:42:24AM +0100, Qais Yousef wrote:
> +CC stable
>
> On 01/20/22 16:25, Valentin Schneider wrote:
> > TASK_RTLOCK_WAIT currently isn't part of TASK_REPORT, thus a task blocking
> > on an rtlock will appear as having a task state == 0, IOW TASK_RUNNING.
> >
> > The actual state is saved in p->saved_state, but reading it after reading
> > p->__state has a few issues:
> > o that could still be TASK_RUNNING in the case of e.g. rt_spin_lock
> > o ttwu_state_match() might have changed that to TASK_RUNNING
> >
> > As pointed out by Eric, adding TASK_RTLOCK_WAIT to TASK_REPORT implies
> > exposing a new state to userspace tools which way not know what to do with
> > them. The only information that needs to be conveyed here is that a task is
> > waiting on an rt_mutex, which matches TASK_UNINTERRUPTIBLE - there's no
> > need for a new state.
> >
> > Reported-by: Uwe Kleine-K?nig <[email protected]>
> > Signed-off-by: Valentin Schneider <[email protected]>
>
> Any objection for this to be picked up by stable? We care about Patch 1 only in
> this series for stable, but it seems sensible to pick this one too, no strong
> feeling if it is omitted though.
>
> AFAICT it seems the problem dates back since commit:
>
> 1593baab910d ("sched/debug: Implement consistent task-state printing")
>
> or even before. I think v4.14+ is good enough.


<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree. Please read:
https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.

</formletter>