2023-08-03 09:59:14

by Ze Gao

[permalink] [raw]
Subject: [RFC PATCH v6 0/5] fix task state report from sched tracepoint

This is the 6th attempt to fix the report task state issue in sched
tracepint, you can check out previous discussions here:

v1: https://lore.kernel.org/linux-trace-kernel/[email protected]
v2: https://lore.kernel.org/linux-trace-kernel/[email protected]
v3: https://lore.kernel.org/linux-trace-kernel/[email protected]
v4: https://lore.kernel.org/linux-trace-kernel/[email protected]
v5: https://lore.kernel.org/linux-trace-kernel/[email protected]

Against v5, fix some code style and change to report prev_state
in 'short'; also send libtraceevent patches seperately to
[email protected].

Note PATCH 1-3 are the normal fixes and cleanup whereas PATCH 4-5
introduce new changes.

--

FYI, this series are designed not to break anything now and still do the
1-1 correspondence int-char mapping for each distinct task state we want to
report, and thus will not lose any details intended for debug purposes. Of
course, this might be compromised because of bugs introduced due to my
stupidity. So your sage comments are very important and appreciated!

--

In the status quo, we should see three different outcomes of the reported
sched-out task state from perf-script, perf-sched-timehist, and Tp_printk
of tracepoint sched_switch. And it's not hard to figure out that the
former two are built upon the third one, and the reason why we see this
inconsistency is that the former two does not catch up with the internal
change of reported task state definitions as the kernel evolves.

IMHO, exporting internal representations of task state in the tracepoint
sched_switch is not a good practice and not encouraged at all, which can
easily break userspace tools that relies on it. Especially when tracepoints
are massively used in many observability tools nowadays due to its stable
nature, which makes them no longer used for debug only purpose and we
should be careful to decide what ought to be reported to userspace and what
ought not.

Therefore, to fix the issues mentioned above for good, I proposed to add
a new variable to report task state in sched_switch with a symbolic char
along with the old hardcoded value, and save the further processing of
userspace tools and spare them from knowing implementation details in the
kernel.

After this patch seires, we report 'RSDTtXZPI' the same as in procfs, plus
a 'p' which denotes PREEMP_ACTIVE and is used for sched_switch tracepoint
only.

Reviews welcome!

Regards,
Ze

Ze Gao (5):
perf sched: sync state char array with the kernel
perf sched: reorganize sched-out task state report code
sched, tracing: reorganize fields of switch event struct
sched, tracing: add to report task state in symbolic chars
perf sched: prefer to use prev_state_char introduced in sched_switch

include/trace/events/sched.h | 70 +++++++++++++++++-------------
tools/perf/builtin-sched.c | 83 +++++++++++++++++-------------------
2 files changed, 78 insertions(+), 75 deletions(-)

--
2.41.0



2023-08-03 10:43:16

by Ze Gao

[permalink] [raw]
Subject: [RFC PATCH v6 4/5] sched, tracing: add to report task state in symbolic chars

Internal representations of task state are likely to be changed
or ordered, and reporting them to userspace without exporting
them as part of API is basically wrong, which can easily break
a userspace observability tool as kernel evolves. For example,
perf suffers from this and still reports wrong states as of this
writing.

OTOH, some masqueraded states like TASK_REPORT_IDLE and
TASK_REPORT_MAX are also reported inadvertently, which confuses
things even more and most userspace tools do not even take them
into consideration.

So add a new variable in company with the old raw value to
report task state in symbolic chars, which are self-explaining
and no further translation is needed. Of course this does not
break any userspace tool.

Note for PREEMPT_ACTIVE, we introduce 'p' to report it and use
the old conventions for the rest.

Signed-off-by: Ze Gao <[email protected]>
Reviewed-by: Masami Hiramatsu (Google) <[email protected]>
Acked-by: Ian Rogers <[email protected]>
---
include/trace/events/sched.h | 44 ++++++++++++++++++++++--------------
1 file changed, 27 insertions(+), 17 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 43492daefa6f..ae5b486cc969 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -6,6 +6,7 @@
#define _TRACE_SCHED_H

#include <linux/kthread.h>
+#include <linux/sched.h>
#include <linux/sched/numa_balancing.h>
#include <linux/tracepoint.h>
#include <linux/binfmts.h>
@@ -214,6 +215,27 @@ static inline short __trace_sched_switch_state(bool preempt,

return state ? (1 << (state - 1)) : state;
}
+
+static inline char __trace_sched_switch_state_char(bool preempt,
+ unsigned int prev_state,
+ struct task_struct *p)
+{
+ long state;
+
+#ifdef CONFIG_SCHED_DEBUG
+ WARN_ON_ONCE(p != current);
+#endif /* CONFIG_SCHED_DEBUG */
+
+ /*
+ * For PREEMPT_ACTIVE, we introduce 'p' to report it and use the old
+ * conventions for the rest.
+ */
+ if (preempt)
+ return 'p';
+
+ state = __task_state_index(prev_state, p->exit_state);
+ return task_index_to_char(state);
+}
#endif /* CREATE_TRACE_POINTS */

/*
@@ -236,6 +258,7 @@ TRACE_EVENT(sched_switch,
__array( char, prev_comm, TASK_COMM_LEN )
__array( char, next_comm, TASK_COMM_LEN )
__field( short, prev_state )
+ __field( char, prev_state_char )
),

TP_fast_assign(
@@ -246,26 +269,13 @@ TRACE_EVENT(sched_switch,
memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
__entry->prev_state = __trace_sched_switch_state(preempt, prev_state, prev);
+ __entry->prev_state_char = __trace_sched_switch_state_char(preempt, prev_state, prev);
/* XXX SCHED_DEADLINE */
),

- TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d",
- __entry->prev_comm, __entry->prev_pid, __entry->prev_prio,
-
- (__entry->prev_state & (TASK_REPORT_MAX - 1)) ?
- __print_flags(__entry->prev_state & (TASK_REPORT_MAX - 1), "|",
- { TASK_INTERRUPTIBLE, "S" },
- { TASK_UNINTERRUPTIBLE, "D" },
- { __TASK_STOPPED, "T" },
- { __TASK_TRACED, "t" },
- { EXIT_DEAD, "X" },
- { EXIT_ZOMBIE, "Z" },
- { TASK_PARKED, "P" },
- { TASK_DEAD, "I" }) :
- "R",
-
- __entry->prev_state & TASK_REPORT_MAX ? "+" : "",
- __entry->next_comm, __entry->next_pid, __entry->next_prio)
+ TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%c ==> next_comm=%s next_pid=%d next_prio=%d",
+ __entry->prev_comm, __entry->prev_pid, __entry->prev_prio, __entry->prev_state_char, __entry->next_comm,
+ __entry->next_pid, __entry->next_prio)
);

/*
--
2.41.0


2023-08-03 11:05:55

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH v6 4/5] sched, tracing: add to report task state in symbolic chars

On Thu, 3 Aug 2023 04:33:51 -0400
Ze Gao <[email protected]> wrote:

> Internal representations of task state are likely to be changed
> or ordered, and reporting them to userspace without exporting
> them as part of API is basically wrong, which can easily break
> a userspace observability tool as kernel evolves. For example,
> perf suffers from this and still reports wrong states as of this
> writing.
>
> OTOH, some masqueraded states like TASK_REPORT_IDLE and
> TASK_REPORT_MAX are also reported inadvertently, which confuses
> things even more and most userspace tools do not even take them
> into consideration.
>
> So add a new variable in company with the old raw value to
> report task state in symbolic chars, which are self-explaining
> and no further translation is needed. Of course this does not
> break any userspace tool.
>
> Note for PREEMPT_ACTIVE, we introduce 'p' to report it and use
> the old conventions for the rest.

The above is actually good.


>
> Signed-off-by: Ze Gao <[email protected]>
> Reviewed-by: Masami Hiramatsu (Google) <[email protected]>
> Acked-by: Ian Rogers <[email protected]>
> ---
> include/trace/events/sched.h | 44 ++++++++++++++++++++++--------------
> 1 file changed, 27 insertions(+), 17 deletions(-)
>
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 43492daefa6f..ae5b486cc969 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -6,6 +6,7 @@
> #define _TRACE_SCHED_H
>
> #include <linux/kthread.h>
> +#include <linux/sched.h>
> #include <linux/sched/numa_balancing.h>
> #include <linux/tracepoint.h>
> #include <linux/binfmts.h>
> @@ -214,6 +215,27 @@ static inline short __trace_sched_switch_state(bool preempt,
>
> return state ? (1 << (state - 1)) : state;
> }
> +
> +static inline char __trace_sched_switch_state_char(bool preempt,
> + unsigned int prev_state,
> + struct task_struct *p)
> +{
> + long state;
> +
> +#ifdef CONFIG_SCHED_DEBUG
> + WARN_ON_ONCE(p != current);
> +#endif /* CONFIG_SCHED_DEBUG */
> +
> + /*
> + * For PREEMPT_ACTIVE, we introduce 'p' to report it and use the old
> + * conventions for the rest.
> + */
> + if (preempt)
> + return 'p';
> +
> + state = __task_state_index(prev_state, p->exit_state);
> + return task_index_to_char(state);
> +}
> #endif /* CREATE_TRACE_POINTS */
>
> /*
> @@ -236,6 +258,7 @@ TRACE_EVENT(sched_switch,
> __array( char, prev_comm, TASK_COMM_LEN )
> __array( char, next_comm, TASK_COMM_LEN )
> __field( short, prev_state )
> + __field( char, prev_state_char )
> ),
>
> TP_fast_assign(
> @@ -246,26 +269,13 @@ TRACE_EVENT(sched_switch,
> memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
> memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
> __entry->prev_state = __trace_sched_switch_state(preempt, prev_state, prev);
> + __entry->prev_state_char = __trace_sched_switch_state_char(preempt, prev_state, prev);
> /* XXX SCHED_DEADLINE */
> ),
>
> - TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d",
> - __entry->prev_comm, __entry->prev_pid, __entry->prev_prio,
> -
> - (__entry->prev_state & (TASK_REPORT_MAX - 1)) ?
> - __print_flags(__entry->prev_state & (TASK_REPORT_MAX - 1), "|",
> - { TASK_INTERRUPTIBLE, "S" },
> - { TASK_UNINTERRUPTIBLE, "D" },
> - { __TASK_STOPPED, "T" },
> - { __TASK_TRACED, "t" },
> - { EXIT_DEAD, "X" },
> - { EXIT_ZOMBIE, "Z" },
> - { TASK_PARKED, "P" },
> - { TASK_DEAD, "I" }) :
> - "R",

I just realized, I have user space code that looks at this. As in the format file we have:

print fmt: "prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d", REC->prev_comm, REC->prev_pid, REC->prev_prio, (REC->prev_state & ((((0x00000000 | 0x00000001 | 0x00000002 | 0x00000004 | 0x00000008 | 0x00000010 | 0x00000020 | 0x00000040) + 1) << 1) - 1)) ? __print_flags(REC->prev_state & ((((0x00000000 | 0x00000001 | 0x00000002 | 0x00000004 | 0x00000008 | 0x00000010 | 0x00000020 | 0x00000040) + 1) << 1) - 1), "|", { 0x00000001, "S" }, { 0x00000002, "D" }, { 0x00000004, "T" }, { 0x00000008, "t" }, { 0x00000010, "X" }, { 0x00000020, "Z" }, { 0x00000040, "P" }, { 0x00000080, "I" }) : "R", REC->prev_state & (((0x00000000 | 0x00000001 | 0x00000002 | 0x00000004 | 0x00000008 | 0x00000010 | 0x00000020 | 0x00000040) + 1) << 1) ? "+" : "", REC->next_comm, REC->next_pid, REC->next_prio

And I have used this in applications to find out what values "S" and "D"
are.

So, we need to keep that still. But we can add the prev_state_char to the
output too.

"prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s[%c] ==> next_comm=%s next_pid=%d next_prio=%d"

> -
> - __entry->prev_state & TASK_REPORT_MAX ? "+" : "",
> - __entry->next_comm, __entry->next_pid, __entry->next_prio)
> + TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%c ==> next_comm=%s next_pid=%d next_prio=%d",
> + __entry->prev_comm, __entry->prev_pid, __entry->prev_prio, __entry->prev_state_char, __entry->next_comm,
> + __entry->next_pid, __entry->next_prio)
> );
>
> /*


-- Steve

2023-08-03 11:39:59

by Ze Gao

[permalink] [raw]
Subject: Re: [RFC PATCH v6 4/5] sched, tracing: add to report task state in symbolic chars

On Thu, Aug 3, 2023 at 5:29 PM Steven Rostedt <[email protected]> wrote:
>
> On Thu, 3 Aug 2023 04:33:51 -0400
> Ze Gao <[email protected]> wrote:
>
> > Internal representations of task state are likely to be changed
> > or ordered, and reporting them to userspace without exporting
> > them as part of API is basically wrong, which can easily break
> > a userspace observability tool as kernel evolves. For example,
> > perf suffers from this and still reports wrong states as of this
> > writing.
> >
> > OTOH, some masqueraded states like TASK_REPORT_IDLE and
> > TASK_REPORT_MAX are also reported inadvertently, which confuses
> > things even more and most userspace tools do not even take them
> > into consideration.
> >
> > So add a new variable in company with the old raw value to
> > report task state in symbolic chars, which are self-explaining
> > and no further translation is needed. Of course this does not
> > break any userspace tool.
> >
> > Note for PREEMPT_ACTIVE, we introduce 'p' to report it and use
> > the old conventions for the rest.
>
> The above is actually good.
>
>
> >
> > Signed-off-by: Ze Gao <[email protected]>
> > Reviewed-by: Masami Hiramatsu (Google) <[email protected]>
> > Acked-by: Ian Rogers <[email protected]>
> > ---
> > include/trace/events/sched.h | 44 ++++++++++++++++++++++--------------
> > 1 file changed, 27 insertions(+), 17 deletions(-)
> >
> > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> > index 43492daefa6f..ae5b486cc969 100644
> > --- a/include/trace/events/sched.h
> > +++ b/include/trace/events/sched.h
> > @@ -6,6 +6,7 @@
> > #define _TRACE_SCHED_H
> >
> > #include <linux/kthread.h>
> > +#include <linux/sched.h>
> > #include <linux/sched/numa_balancing.h>
> > #include <linux/tracepoint.h>
> > #include <linux/binfmts.h>
> > @@ -214,6 +215,27 @@ static inline short __trace_sched_switch_state(bool preempt,
> >
> > return state ? (1 << (state - 1)) : state;
> > }
> > +
> > +static inline char __trace_sched_switch_state_char(bool preempt,
> > + unsigned int prev_state,
> > + struct task_struct *p)
> > +{
> > + long state;
> > +
> > +#ifdef CONFIG_SCHED_DEBUG
> > + WARN_ON_ONCE(p != current);
> > +#endif /* CONFIG_SCHED_DEBUG */
> > +
> > + /*
> > + * For PREEMPT_ACTIVE, we introduce 'p' to report it and use the old
> > + * conventions for the rest.
> > + */
> > + if (preempt)
> > + return 'p';
> > +
> > + state = __task_state_index(prev_state, p->exit_state);
> > + return task_index_to_char(state);
> > +}
> > #endif /* CREATE_TRACE_POINTS */
> >
> > /*
> > @@ -236,6 +258,7 @@ TRACE_EVENT(sched_switch,
> > __array( char, prev_comm, TASK_COMM_LEN )
> > __array( char, next_comm, TASK_COMM_LEN )
> > __field( short, prev_state )
> > + __field( char, prev_state_char )
> > ),
> >
> > TP_fast_assign(
> > @@ -246,26 +269,13 @@ TRACE_EVENT(sched_switch,
> > memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
> > memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
> > __entry->prev_state = __trace_sched_switch_state(preempt, prev_state, prev);
> > + __entry->prev_state_char = __trace_sched_switch_state_char(preempt, prev_state, prev);
> > /* XXX SCHED_DEADLINE */
> > ),
> >
> > - TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d",
> > - __entry->prev_comm, __entry->prev_pid, __entry->prev_prio,
> > -
> > - (__entry->prev_state & (TASK_REPORT_MAX - 1)) ?
> > - __print_flags(__entry->prev_state & (TASK_REPORT_MAX - 1), "|",
> > - { TASK_INTERRUPTIBLE, "S" },
> > - { TASK_UNINTERRUPTIBLE, "D" },
> > - { __TASK_STOPPED, "T" },
> > - { __TASK_TRACED, "t" },
> > - { EXIT_DEAD, "X" },
> > - { EXIT_ZOMBIE, "Z" },
> > - { TASK_PARKED, "P" },
> > - { TASK_DEAD, "I" }) :
> > - "R",
>
> I just realized, I have user space code that looks at this. As in the format file we have:
>
> print fmt: "prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d", REC->prev_comm, REC->prev_pid, REC->prev_prio, (REC->prev_state & ((((0x00000000 | 0x00000001 | 0x00000002 | 0x00000004 | 0x00000008 | 0x00000010 | 0x00000020 | 0x00000040) + 1) << 1) - 1)) ? __print_flags(REC->prev_state & ((((0x00000000 | 0x00000001 | 0x00000002 | 0x00000004 | 0x00000008 | 0x00000010 | 0x00000020 | 0x00000040) + 1) << 1) - 1), "|", { 0x00000001, "S" }, { 0x00000002, "D" }, { 0x00000004, "T" }, { 0x00000008, "t" }, { 0x00000010, "X" }, { 0x00000020, "Z" }, { 0x00000040, "P" }, { 0x00000080, "I" }) : "R", REC->prev_state & (((0x00000000 | 0x00000001 | 0x00000002 | 0x00000004 | 0x00000008 | 0x00000010 | 0x00000020 | 0x00000040) + 1) << 1) ? "+" : "", REC->next_comm, REC->next_pid, REC->next_prio
>
> And I have used this in applications to find out what values "S" and "D"
> are.
>
> So, we need to keep that still. But we can add the prev_state_char to the
> output too.
>
> "prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s[%c] ==> next_comm=%s next_pid=%d next_prio=%d"

Good point!

But I see Peter has strong opinions over this change badly. So I'm not
sure if it's worth the effort to push this anymore :/ And apparently We
failed to convince each other over this problem.

How about we only keep all the fixing patches and throw away this
'prev_state_char' thing?

Again, I'm not meant to upset anyone here. But Tons of thanks for your
sage reviews on this.

Thoughts?

Thanks,
Ze