2023-08-02 13:10:12

by Ze Gao

[permalink] [raw]
Subject: [RFC PATCH v5 0/7] fix task state report from sched tracepoint

This is the 5th attempt to fix the report task state issue in sched
tracepint, you can check out previous discussions here:

v1: https://lore.kernel.org/linux-trace-kernel/[email protected]
v2: https://lore.kernel.org/linux-trace-kernel/[email protected]
v3: https://lore.kernel.org/linux-trace-kernel/[email protected]
v4: sorry for my stupid copy-paste, which makes patch broken and cannot be applied

Against v3, reorganize commits so that they can be cherry-picked
individually if people indeed have strong opinion over this change.

Note PATCH 1-4 are the normal fixes and cleanup whereas PATCH 5-7
introduce new changes.

--

FYI, this series are designed not to break anything now and still do the
1-1 correspondence int-char mapping for each distinct task state we want to
report, and thus will not lose any details intended for debug purposes. Of
course, this might be compromised because of bugs introduced due to my
stupidity. So your sage comments are very important and appreciated!

--

In the status quo, we should see three different outcomes of the reported
sched-out task state from perf-script, perf-sched-timehist, and Tp_printk
of tracepoint sched_switch. And it's not hard to figure out that the
former two are built upon the third one, and the reason why we see this
inconsistency is that the former two does not catch up with the internal
change of reported task state definitions as the kernel evolves.

IMHO, exporting internal representations of task state in the tracepoint
sched_switch is not a good practice and not encouraged at all, which can
easily break userspace tools that relies on it. Especially when tracepoints
are massively used in many observability tools nowadays due to its stable
nature, which makes them no longer used for debug only purpose and we
should be careful to decide what ought to be reported to userspace and what
ought not.

Therefore, to fix the issues mentioned above for good, I proposed to add
a new variable to report task state in sched_switch with a symbolic char
along with the old hardcoded value, and save the further processing of
userspace tools and spare them from knowing implementation details in the
kernel.

After this patch seires, we report 'RSDTtXZPI' the same as in procfs, plus
a 'p' which denotes PREEMP_ACTIVE and is used for sched_switch tracepoint
only.

Reviews welcome!

Regards,
Ze

--

Ze Gao (5):
perf sched: sync state char array with the kernel
perf sched: reorganize sched-out task state report code
sched, tracing: reorganize fields of switch event struct
sched, tracing: add to report task state in symbolic chars
perf sched: prefer to use prev_state_char introduced in sched_switch

include/trace/events/sched.h | 68 +++++++++++++++++-------------
tools/perf/builtin-sched.c | 82 ++++++++++++++++--------------------
2 files changed, 76 insertions(+), 74 deletions(-)

Ze Gao (2):
libtraceevent: sync state char array with the kernel
libtraceevent: prefer to use prev_state_char introduced in
sched_switch

plugins/plugin_sched_switch.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

--
2.41.0



2023-08-02 13:13:03

by Ze Gao

[permalink] [raw]
Subject: [RFC PATCH v5 2/7] perf sched: sync state char array with the kernel

Update state char array and then remove unused and stale
macros, which are kernel internal representations and not
encouraged to use anymore.

Signed-off-by: Ze Gao <[email protected]>
---
tools/perf/builtin-sched.c | 13 +------------
1 file changed, 1 insertion(+), 12 deletions(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 9ab300b6f131..8dc8f071721c 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -92,23 +92,12 @@ struct sched_atom {
struct task_desc *wakee;
};

-#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
+#define TASK_STATE_TO_CHAR_STR "RSDTtXZPI"

/* task state bitmask, copied from include/linux/sched.h */
#define TASK_RUNNING 0
#define TASK_INTERRUPTIBLE 1
#define TASK_UNINTERRUPTIBLE 2
-#define __TASK_STOPPED 4
-#define __TASK_TRACED 8
-/* in tsk->exit_state */
-#define EXIT_DEAD 16
-#define EXIT_ZOMBIE 32
-#define EXIT_TRACE (EXIT_ZOMBIE | EXIT_DEAD)
-/* in tsk->state again */
-#define TASK_DEAD 64
-#define TASK_WAKEKILL 128
-#define TASK_WAKING 256
-#define TASK_PARKED 512

enum thread_state {
THREAD_SLEEPING = 0,
--
2.41.0


2023-08-02 13:32:50

by Ze Gao

[permalink] [raw]
Subject: [RFC PATCH v4 6/7] libtraceevent: prefer to use prev_state_char introduced in sched_switch

Since the sched_switch tracepoint introduces a new variable to
report sched-out task state in symbolic char, we prefer to use
it to spare from knowing internal implementations in kernel.

Also we keep the old parsing logic intact but sync the state char
array with the latest kernel.

Signed-off-by: Ze Gao <[email protected]>
---
plugins/plugin_sched_switch.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/plugins/plugin_sched_switch.c b/plugins/plugin_sched_switch.c
index e0986ac..4c57322 100644
--- a/plugins/plugin_sched_switch.c
+++ b/plugins/plugin_sched_switch.c
@@ -11,7 +11,7 @@

static void write_state(struct trace_seq *s, int val)
{
- const char states[] = "SDTtXZPI";
+ const char states[] = "SDTtXZPIp";
int found = 0;
int i;

@@ -99,7 +99,12 @@ static int sched_switch_handler(struct trace_seq *s,
if (tep_get_field_val(s, event, "prev_prio", record, &val, 1) == 0)
trace_seq_printf(s, "[%d] ", (int) val);

- if (tep_get_field_val(s, event, "prev_state", record, &val, 1) == 0)
+ //find if has prev_state_char, otherwise fallback to prev_state
+ if (tep_find_field(event, "prev_state_char")) {
+ if (tep_get_field_val(s, event, "prev_state_char", record, &val, 1) == 0)
+ trace_seq_putc(s, (char) val);
+ }
+ else if (tep_get_field_val(s, event, "prev_state", record, &val, 1) == 0)
write_state(s, val);

trace_seq_puts(s, " ==> ");
--
2.41.0


2023-08-02 13:33:51

by Ze Gao

[permalink] [raw]
Subject: [RFC PATCH v5 3/7] perf sched: reorganize sched-out task state report code

Mainly does housekeeping work and not introduce any
functional change.

Signed-off-by: Ze Gao <[email protected]>
---
tools/perf/builtin-sched.c | 57 ++++++++++++++++----------------------
1 file changed, 24 insertions(+), 33 deletions(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 8dc8f071721c..eb310d1a7625 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -94,11 +94,6 @@ struct sched_atom {

#define TASK_STATE_TO_CHAR_STR "RSDTtXZPI"

-/* task state bitmask, copied from include/linux/sched.h */
-#define TASK_RUNNING 0
-#define TASK_INTERRUPTIBLE 1
-#define TASK_UNINTERRUPTIBLE 2
-
enum thread_state {
THREAD_SLEEPING = 0,
THREAD_WAIT_CPU,
@@ -255,7 +250,7 @@ struct thread_runtime {
u64 total_preempt_time;
u64 total_delay_time;

- int last_state;
+ char last_state;

char shortname[3];
bool comm_changed;
@@ -425,7 +420,7 @@ static void add_sched_event_wakeup(struct perf_sched *sched, struct task_desc *t
}

static void add_sched_event_sleep(struct perf_sched *sched, struct task_desc *task,
- u64 timestamp, u64 task_state __maybe_unused)
+ u64 timestamp, char task_state __maybe_unused)
{
struct sched_atom *event = get_new_event(task, timestamp);

@@ -840,6 +835,20 @@ replay_wakeup_event(struct perf_sched *sched,
return 0;
}

+static inline char task_state_char(int state)
+{
+ static const char state_to_char[] = "RSDTtXZPI";
+ unsigned bit = state ? ffs(state) : 0;
+ return bit < sizeof(state_to_char) - 1 ? state_to_char[bit] : '?';
+}
+
+static inline char get_task_prev_state(struct evsel *evsel,
+ struct perf_sample *sample)
+{
+ const int prev_state = evsel__intval(evsel, sample, "prev_state");
+ return task_state_char(prev_state);
+}
+
static int replay_switch_event(struct perf_sched *sched,
struct evsel *evsel,
struct perf_sample *sample,
@@ -849,7 +858,7 @@ static int replay_switch_event(struct perf_sched *sched,
*next_comm = evsel__strval(evsel, sample, "next_comm");
const u32 prev_pid = evsel__intval(evsel, sample, "prev_pid"),
next_pid = evsel__intval(evsel, sample, "next_pid");
- const u64 prev_state = evsel__intval(evsel, sample, "prev_state");
+ const char prev_state = get_task_prev_state(evsel, sample);
struct task_desc *prev, __maybe_unused *next;
u64 timestamp0, timestamp = sample->time;
int cpu = sample->cpu;
@@ -1039,12 +1048,6 @@ static int thread_atoms_insert(struct perf_sched *sched, struct thread *thread)
return 0;
}

-static char sched_out_state(u64 prev_state)
-{
- const char *str = TASK_STATE_TO_CHAR_STR;
-
- return str[prev_state];
-}

static int
add_sched_out_event(struct work_atoms *atoms,
@@ -1121,7 +1124,7 @@ static int latency_switch_event(struct perf_sched *sched,
{
const u32 prev_pid = evsel__intval(evsel, sample, "prev_pid"),
next_pid = evsel__intval(evsel, sample, "next_pid");
- const u64 prev_state = evsel__intval(evsel, sample, "prev_state");
+ const char prev_state = get_task_prev_state(evsel, sample);
struct work_atoms *out_events, *in_events;
struct thread *sched_out, *sched_in;
u64 timestamp0, timestamp = sample->time;
@@ -1157,7 +1160,7 @@ static int latency_switch_event(struct perf_sched *sched,
goto out_put;
}
}
- if (add_sched_out_event(out_events, sched_out_state(prev_state), timestamp))
+ if (add_sched_out_event(out_events, prev_state, timestamp))
return -1;

in_events = thread_atoms_search(&sched->atom_root, sched_in, &sched->cmp_pid);
@@ -2022,24 +2025,12 @@ static void timehist_header(struct perf_sched *sched)
printf("\n");
}

-static char task_state_char(struct thread *thread, int state)
-{
- static const char state_to_char[] = TASK_STATE_TO_CHAR_STR;
- unsigned bit = state ? ffs(state) : 0;
-
- /* 'I' for idle */
- if (thread__tid(thread) == 0)
- return 'I';
-
- return bit < sizeof(state_to_char) - 1 ? state_to_char[bit] : '?';
-}
-
static void timehist_print_sample(struct perf_sched *sched,
struct evsel *evsel,
struct perf_sample *sample,
struct addr_location *al,
struct thread *thread,
- u64 t, int state)
+ u64 t, char state)
{
struct thread_runtime *tr = thread__priv(thread);
const char *next_comm = evsel__strval(evsel, sample, "next_comm");
@@ -2080,7 +2071,7 @@ static void timehist_print_sample(struct perf_sched *sched,
print_sched_time(tr->dt_run, 6);

if (sched->show_state)
- printf(" %5c ", task_state_char(thread, state));
+ printf(" %5c ", thread->tid == 0 ? 'I' : state);

if (sched->show_next) {
snprintf(nstr, sizeof(nstr), "next: %s[%d]", next_comm, next_pid);
@@ -2152,9 +2143,9 @@ static void timehist_update_runtime_stats(struct thread_runtime *r,
else if (r->last_time) {
u64 dt_wait = tprev - r->last_time;

- if (r->last_state == TASK_RUNNING)
+ if (r->last_state == 'R')
r->dt_preempt = dt_wait;
- else if (r->last_state == TASK_UNINTERRUPTIBLE)
+ else if (r->last_state == 'D')
r->dt_iowait = dt_wait;
else
r->dt_sleep = dt_wait;
@@ -2579,7 +2570,7 @@ static int timehist_sched_change_event(struct perf_tool *tool,
struct thread_runtime *tr = NULL;
u64 tprev, t = sample->time;
int rc = 0;
- int state = evsel__intval(evsel, sample, "prev_state");
+ const char state = get_task_prev_state(evsel, sample);

addr_location__init(&al);
if (machine__resolve(machine, &al, sample) < 0) {
--
2.41.0


2023-08-02 14:07:30

by Ze Gao

[permalink] [raw]
Subject: [RFC PATCH v4 1/7] libtraceevent: sync state char array with the kernel

Update state char array to match the latest kernel
definitions.

Signed-off-by: Ze Gao <[email protected]>
---
plugins/plugin_sched_switch.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/plugins/plugin_sched_switch.c b/plugins/plugin_sched_switch.c
index 8752cae..e0986ac 100644
--- a/plugins/plugin_sched_switch.c
+++ b/plugins/plugin_sched_switch.c
@@ -11,7 +11,7 @@

static void write_state(struct trace_seq *s, int val)
{
- const char states[] = "SDTtZXxW";
+ const char states[] = "SDTtXZPI";
int found = 0;
int i;

--
2.41.0


2023-08-02 14:14:37

by Ze Gao

[permalink] [raw]
Subject: [RFC PATCH v5 4/7] sched, tracing: reorganize fields of switch event struct

Report priorities in 'short' and prev_state in 'int' to save
some buffer space. And also reorder the fields so that we take
struct alignment into consideration to make the record compact.

Suggested-by: Steven Rostedt (Google) <[email protected]>
Signed-off-by: Ze Gao <[email protected]>
---
include/trace/events/sched.h | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index fbb99a61f714..7d34db20b2c6 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -187,7 +187,7 @@ DEFINE_EVENT(sched_wakeup_template, sched_wakeup_new,
TP_ARGS(p));

#ifdef CREATE_TRACE_POINTS
-static inline long __trace_sched_switch_state(bool preempt,
+static inline int __trace_sched_switch_state(bool preempt,
unsigned int prev_state,
struct task_struct *p)
{
@@ -229,23 +229,23 @@ TRACE_EVENT(sched_switch,
TP_ARGS(preempt, prev, next, prev_state),

TP_STRUCT__entry(
- __array( char, prev_comm, TASK_COMM_LEN )
__field( pid_t, prev_pid )
- __field( int, prev_prio )
- __field( long, prev_state )
- __array( char, next_comm, TASK_COMM_LEN )
__field( pid_t, next_pid )
- __field( int, next_prio )
+ __field( short, prev_prio )
+ __field( short, next_prio )
+ __field( int, prev_state )
+ __array( char, prev_comm, TASK_COMM_LEN )
+ __array( char, next_comm, TASK_COMM_LEN )
),

TP_fast_assign(
- memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
- __entry->prev_pid = prev->pid;
- __entry->prev_prio = prev->prio;
- __entry->prev_state = __trace_sched_switch_state(preempt, prev_state, prev);
+ __entry->prev_pid = prev->pid;
+ __entry->next_pid = next->pid;
+ __entry->prev_prio = (short) prev->prio;
+ __entry->next_prio = (short) next->prio;
+ __entry->prev_state = __trace_sched_switch_state(preempt, prev_state, prev);
memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
- __entry->next_pid = next->pid;
- __entry->next_prio = next->prio;
+ memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
/* XXX SCHED_DEADLINE */
),

--
2.41.0


2023-08-02 14:21:51

by Ze Gao

[permalink] [raw]
Subject: [RFC PATCH v5 5/7] sched, tracing: add to report task state in symbolic chars

Internal representations of task state are likely to be changed
or ordered, and reporting them to userspace without exporting
them as part of API is basically wrong, which can easily break
a userspace observability tool as kernel evolves. For example,
perf suffers from this and still reports wrong states as of this
writing.

OTOH, some masqueraded states like TASK_REPORT_IDLE and
TASK_REPORT_MAX are also reported inadvertently, which confuses
things even more and most userspace tools do not even take them
into consideration.

So add a new variable in company with the old raw value to
report task state in symbolic chars, which are self-explaining
and no further translation is needed. Of course this does not
break any userspace tool.

Note for PREEMPT_ACTIVE, we introduce 'p' to report it and use
the old conventions for the rest.

Signed-off-by: Ze Gao <[email protected]>
Reviewed-by: Masami Hiramatsu (Google) <[email protected]>
Acked-by: Ian Rogers <[email protected]>
---
include/trace/events/sched.h | 44 ++++++++++++++++++++++--------------
1 file changed, 27 insertions(+), 17 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 7d34db20b2c6..1c7b94793495 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -6,6 +6,7 @@
#define _TRACE_SCHED_H

#include <linux/kthread.h>
+#include <linux/sched.h>
#include <linux/sched/numa_balancing.h>
#include <linux/tracepoint.h>
#include <linux/binfmts.h>
@@ -214,6 +215,27 @@ static inline int __trace_sched_switch_state(bool preempt,

return state ? (1 << (state - 1)) : state;
}
+
+static inline char __trace_sched_switch_state_char(bool preempt,
+ unsigned int prev_state,
+ struct task_struct *p)
+{
+ long state;
+
+#ifdef CONFIG_SCHED_DEBUG
+ BUG_ON(p != current);
+#endif /* CONFIG_SCHED_DEBUG */
+
+ /*
+ * For PREEMPT_ACTIVE, we introduce 'p' to report it and use the old
+ * conventions for the rest.
+ */
+ if (preempt)
+ return 'p';
+
+ state = __task_state_index(prev_state, p->exit_state);
+ return task_index_to_char(state);
+}
#endif /* CREATE_TRACE_POINTS */

/*
@@ -236,6 +258,7 @@ TRACE_EVENT(sched_switch,
__field( int, prev_state )
__array( char, prev_comm, TASK_COMM_LEN )
__array( char, next_comm, TASK_COMM_LEN )
+ __field( char, prev_state_char )
),

TP_fast_assign(
@@ -246,26 +269,13 @@ TRACE_EVENT(sched_switch,
__entry->prev_state = __trace_sched_switch_state(preempt, prev_state, prev);
memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
+ __entry->prev_state_char = __trace_sched_switch_state_char(preempt, prev_state, prev);
/* XXX SCHED_DEADLINE */
),

- TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d",
- __entry->prev_comm, __entry->prev_pid, __entry->prev_prio,
-
- (__entry->prev_state & (TASK_REPORT_MAX - 1)) ?
- __print_flags(__entry->prev_state & (TASK_REPORT_MAX - 1), "|",
- { TASK_INTERRUPTIBLE, "S" },
- { TASK_UNINTERRUPTIBLE, "D" },
- { __TASK_STOPPED, "T" },
- { __TASK_TRACED, "t" },
- { EXIT_DEAD, "X" },
- { EXIT_ZOMBIE, "Z" },
- { TASK_PARKED, "P" },
- { TASK_DEAD, "I" }) :
- "R",
-
- __entry->prev_state & TASK_REPORT_MAX ? "+" : "",
- __entry->next_comm, __entry->next_pid, __entry->next_prio)
+ TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%c ==> next_comm=%s next_pid=%d next_prio=%d",
+ __entry->prev_comm, __entry->prev_pid, __entry->prev_prio, __entry->prev_state_char, __entry->next_comm,
+ __entry->next_pid, __entry->next_prio)
);

/*
--
2.41.0


2023-08-02 14:28:19

by Ze Gao

[permalink] [raw]
Subject: [RFC PATCH v5 7/7] perf sched: prefer to use prev_state_char introduced in sched_switch

Since the sched_switch tracepoint introduces a new variable to
report sched-out task state in symbolic char, we prefer to use
it to spare from knowing internal implementations in kernel.

Also we keep the old parsing logic intact but sync the state char
array with the latest kernel.

Signed-off-by: Ze Gao <[email protected]>
---
tools/perf/builtin-sched.c | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index eb310d1a7625..7f76ba51e36d 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -837,7 +837,7 @@ replay_wakeup_event(struct perf_sched *sched,

static inline char task_state_char(int state)
{
- static const char state_to_char[] = "RSDTtXZPI";
+ static const char state_to_char[] = "RSDTtXZPIp";
unsigned bit = state ? ffs(state) : 0;
return bit < sizeof(state_to_char) - 1 ? state_to_char[bit] : '?';
}
@@ -845,8 +845,20 @@ static inline char task_state_char(int state)
static inline char get_task_prev_state(struct evsel *evsel,
struct perf_sample *sample)
{
- const int prev_state = evsel__intval(evsel, sample, "prev_state");
- return task_state_char(prev_state);
+ char prev_state_char;
+ int prev_state;
+
+ //prefer to use prev_state_char
+ if (evsel__field(evsel, "prev_state_char"))
+ prev_state_char = (char) evsel__intval(evsel,
+ sample, "prev_state_char");
+ else {
+ prev_state = (int) evsel__intval(evsel,
+ sample, "prev_state");
+ prev_state_char = task_state_char(prev_state);
+ }
+
+ return prev_state_char;
}

static int replay_switch_event(struct perf_sched *sched,
@@ -2143,7 +2155,7 @@ static void timehist_update_runtime_stats(struct thread_runtime *r,
else if (r->last_time) {
u64 dt_wait = tprev - r->last_time;

- if (r->last_state == 'R')
+ if (r->last_state == 'R' || r->last_state == 'p')
r->dt_preempt = dt_wait;
else if (r->last_state == 'D')
r->dt_iowait = dt_wait;
--
2.41.0