2023-08-03 09:42:28

by Ze Gao

[permalink] [raw]
Subject: [RFC PATCH v6 1/5] perf sched: sync state char array with the kernel

Update state char array and then remove unused and stale
macros, which are kernel internal representations and not
encouraged to use anymore.

Signed-off-by: Ze Gao <[email protected]>
---
tools/perf/builtin-sched.c | 13 +------------
1 file changed, 1 insertion(+), 12 deletions(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 9ab300b6f131..8dc8f071721c 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -92,23 +92,12 @@ struct sched_atom {
struct task_desc *wakee;
};

-#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
+#define TASK_STATE_TO_CHAR_STR "RSDTtXZPI"

/* task state bitmask, copied from include/linux/sched.h */
#define TASK_RUNNING 0
#define TASK_INTERRUPTIBLE 1
#define TASK_UNINTERRUPTIBLE 2
-#define __TASK_STOPPED 4
-#define __TASK_TRACED 8
-/* in tsk->exit_state */
-#define EXIT_DEAD 16
-#define EXIT_ZOMBIE 32
-#define EXIT_TRACE (EXIT_ZOMBIE | EXIT_DEAD)
-/* in tsk->state again */
-#define TASK_DEAD 64
-#define TASK_WAKEKILL 128
-#define TASK_WAKING 256
-#define TASK_PARKED 512

enum thread_state {
THREAD_SLEEPING = 0,
--
2.41.0



2023-08-03 11:04:31

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH v6 1/5] perf sched: sync state char array with the kernel

On Thu, 3 Aug 2023 04:33:48 -0400
Ze Gao <[email protected]> wrote:

Hi Ze,

> Update state char array and then remove unused and stale
> macros, which are kernel internal representations and not
> encouraged to use anymore.
>

A couple of things.

First, the change logs of every commit need to specify the "why". The
subject can say "what", but the change log really needs to explain why this
patch is important. For example, this patch is really two changes (and thus
should actually be two patches). (I'll also comment on the other patches)

1. The update of the state char array. You should explain why it's being
updated. If it was wrong, it needs to state the commit that changed to make
that happen.

2. For the removing the stale macros, the change log can simply state that
the macros are unused in the code and are being removed.

Finally, I know you're eager to get this patch set in, but please hold off
sending a new version immediately after a comment or two. Some maintainers
prefer submitters to wait a week or so, otherwise you will tend to "spam"
their inboxes. There's more than one maintainer Cc'd on this series, and you
need to be courteous not to send too many emails in a short period of time.

-- Steve


> Signed-off-by: Ze Gao <[email protected]>
> ---
> tools/perf/builtin-sched.c | 13 +------------
> 1 file changed, 1 insertion(+), 12 deletions(-)
>
> diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
> index 9ab300b6f131..8dc8f071721c 100644
> --- a/tools/perf/builtin-sched.c
> +++ b/tools/perf/builtin-sched.c
> @@ -92,23 +92,12 @@ struct sched_atom {
> struct task_desc *wakee;
> };
>
> -#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
> +#define TASK_STATE_TO_CHAR_STR "RSDTtXZPI"
>
> /* task state bitmask, copied from include/linux/sched.h */
> #define TASK_RUNNING 0
> #define TASK_INTERRUPTIBLE 1
> #define TASK_UNINTERRUPTIBLE 2
> -#define __TASK_STOPPED 4
> -#define __TASK_TRACED 8
> -/* in tsk->exit_state */
> -#define EXIT_DEAD 16
> -#define EXIT_ZOMBIE 32
> -#define EXIT_TRACE (EXIT_ZOMBIE | EXIT_DEAD)
> -/* in tsk->state again */
> -#define TASK_DEAD 64
> -#define TASK_WAKEKILL 128
> -#define TASK_WAKING 256
> -#define TASK_PARKED 512
>
> enum thread_state {
> THREAD_SLEEPING = 0,


2023-08-03 11:05:42

by Ze Gao

[permalink] [raw]
Subject: Re: [RFC PATCH v6 1/5] perf sched: sync state char array with the kernel

On Thu, Aug 3, 2023 at 5:09 PM Steven Rostedt <[email protected]> wrote:
>
> On Thu, 3 Aug 2023 04:33:48 -0400
> Ze Gao <[email protected]> wrote:
>
> Hi Ze,
>
> > Update state char array and then remove unused and stale
> > macros, which are kernel internal representations and not
> > encouraged to use anymore.
> >
>
> A couple of things.
>
> First, the change logs of every commit need to specify the "why". The
> subject can say "what", but the change log really needs to explain why this
> patch is important. For example, this patch is really two changes (and thus
> should actually be two patches). (I'll also comment on the other patches)

Thanks for the feedback! Will elaborate the changes in each changelog.

> 1. The update of the state char array. You should explain why it's being
> updated. If it was wrong, it needs to state the commit that changed to make
> that happen.
>
> 2. For the removing the stale macros, the change log can simply state that
> the macros are unused in the code and are being removed.
>
> Finally, I know you're eager to get this patch set in, but please hold off
> sending a new version immediately after a comment or two. Some maintainers
> prefer submitters to wait a week or so, otherwise you will tend to "spam"
> their inboxes. There's more than one maintainer Cc'd on this series, and you
> need to be courteous not to send too many emails in a short period of time.

Noted! Actually I'm in no rush and just to make sure people see the
latest patches so they do not have to waste time on the old series.

Will hold off to resolve all the comments in this thread.

And thanks for pointing this out.

Regards,
Ze

2023-08-03 13:21:15

by Ze Gao

[permalink] [raw]
Subject: Re: [RFC PATCH v6 1/5] perf sched: sync state char array with the kernel

> 2. For the removing the stale macros, the change log can simply state that
> the macros are unused in the code and are being removed.

I've split this part into a separate patch, and here is the changelog:

perf sched: cleanup to remove unused macros

The macros copied from kernel headers are unused and
stale in the code and are being removed to avoid confusions.

Signed-off-by: Ze Gao <[email protected]>

Regards,
Ze

2023-08-03 14:34:19

by Ze Gao

[permalink] [raw]
Subject: Re: [RFC PATCH v6 1/5] perf sched: sync state char array with the kernel

Hi,

THIS IS THE NEW CHANGELOG FOR THIS PATCH:

perf sched: sync state char array with the kernel

Since commit e936e8e459e14 ("perf tools: Adapt the
TASK_STATE_TO_CHAR_STR to new value in kernel space."),
the state char array that is used to interpret the
states of tasks being switched out have not synced
once with kernel definitions. Whereas the task report
logic is evolving over this time and the definition
of this state char array has been changed multiple
times. And this leads to inconsistency.

As of this writing, perf timehist --state still reports
the wrong states because TASK_STATE_TO_CHAR_STR is too
outdated to use.

So sync TASK_STATE_TO_CHAR_STR to match the latest kernel
definitions to fix it.

Signed-off-by: Ze Gao <[email protected]>

Regards,
Ze

On Thu, Aug 3, 2023 at 6:29 PM Ze Gao <[email protected]> wrote:
>
> On Thu, Aug 3, 2023 at 5:09 PM Steven Rostedt <[email protected]> wrote:
> >
> > On Thu, 3 Aug 2023 04:33:48 -0400
> > Ze Gao <[email protected]> wrote:
> >
> > Hi Ze,
> >
> > > Update state char array and then remove unused and stale
> > > macros, which are kernel internal representations and not
> > > encouraged to use anymore.
> > >
> >
> > A couple of things.
> >
> > First, the change logs of every commit need to specify the "why". The
> > subject can say "what", but the change log really needs to explain why this
> > patch is important. For example, this patch is really two changes (and thus
> > should actually be two patches). (I'll also comment on the other patches)
>
> Thanks for the feedback! Will elaborate the changes in each changelog.
>
> > 1. The update of the state char array. You should explain why it's being
> > updated. If it was wrong, it needs to state the commit that changed to make
> > that happen.
> >
> > 2. For the removing the stale macros, the change log can simply state that
> > the macros are unused in the code and are being removed.
> >
> > Finally, I know you're eager to get this patch set in, but please hold off
> > sending a new version immediately after a comment or two. Some maintainers
> > prefer submitters to wait a week or so, otherwise you will tend to "spam"
> > their inboxes. There's more than one maintainer Cc'd on this series, and you
> > need to be courteous not to send too many emails in a short period of time.
>
> Noted! Actually I'm in no rush and just to make sure people see the
> latest patches so they do not have to waste time on the old series.
>
> Will hold off to resolve all the comments in this thread.
>
> And thanks for pointing this out.
>
> Regards,
> Ze

2023-08-03 16:00:17

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH v6 1/5] perf sched: sync state char array with the kernel

On Thu, 3 Aug 2023 04:33:48 -0400
Ze Gao <[email protected]> wrote:

> Update state char array and then remove unused and stale
> macros, which are kernel internal representations and not
> encouraged to use anymore.
>
> Signed-off-by: Ze Gao <[email protected]>
> ---
> tools/perf/builtin-sched.c | 13 +------------
> 1 file changed, 1 insertion(+), 12 deletions(-)
>
> diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
> index 9ab300b6f131..8dc8f071721c 100644
> --- a/tools/perf/builtin-sched.c
> +++ b/tools/perf/builtin-sched.c
> @@ -92,23 +92,12 @@ struct sched_atom {
> struct task_desc *wakee;
> };
>
> -#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
> +#define TASK_STATE_TO_CHAR_STR "RSDTtXZPI"

Thinking about this more, this will always be wrong. Changing it just works
for the kernel you made the change for, but if it is run on another kernel,
it's broken again.

I actually wrote code once that basically just did a:

struct trace_seq s;

trace_seq_init(&s);
tep_print_event(tep, &s, record, "%s", TEP_PRINT_INFO);

then searched s.buffer for "prev_state=%s ", to find the state character.

That's because the kernel should always be up to date (and why I said I
needed that string in the print_fmt).

As perf has a tep handle, this could be a helper function to extract the
state if needed, and get rind of relying on the above character array.

-- Steve


>
> /* task state bitmask, copied from include/linux/sched.h */
> #define TASK_RUNNING 0
> #define TASK_INTERRUPTIBLE 1
> #define TASK_UNINTERRUPTIBLE 2
> -#define __TASK_STOPPED 4
> -#define __TASK_TRACED 8
> -/* in tsk->exit_state */
> -#define EXIT_DEAD 16
> -#define EXIT_ZOMBIE 32
> -#define EXIT_TRACE (EXIT_ZOMBIE | EXIT_DEAD)
> -/* in tsk->state again */
> -#define TASK_DEAD 64
> -#define TASK_WAKEKILL 128
> -#define TASK_WAKING 256
> -#define TASK_PARKED 512
>
> enum thread_state {
> THREAD_SLEEPING = 0,


2023-08-04 03:07:39

by Ze Gao

[permalink] [raw]
Subject: Re: [RFC PATCH v6 1/5] perf sched: sync state char array with the kernel

On Fri, Aug 4, 2023 at 10:21 AM Ze Gao <[email protected]> wrote:
>
> On Thu, Aug 3, 2023 at 11:10 PM Steven Rostedt <[email protected]> wrote:
> >
> > On Thu, 3 Aug 2023 04:33:48 -0400
> > Ze Gao <[email protected]> wrote:
> >
> > > Update state char array and then remove unused and stale
> > > macros, which are kernel internal representations and not
> > > encouraged to use anymore.
> > >
> > > Signed-off-by: Ze Gao <[email protected]>
> > > ---
> > > tools/perf/builtin-sched.c | 13 +------------
> > > 1 file changed, 1 insertion(+), 12 deletions(-)
> > >
> > > diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
> > > index 9ab300b6f131..8dc8f071721c 100644
> > > --- a/tools/perf/builtin-sched.c
> > > +++ b/tools/perf/builtin-sched.c
> > > @@ -92,23 +92,12 @@ struct sched_atom {
> > > struct task_desc *wakee;
> > > };
> > >
> > > -#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
> > > +#define TASK_STATE_TO_CHAR_STR "RSDTtXZPI"
> >
> > Thinking about this more, this will always be wrong. Changing it just works
> > for the kernel you made the change for, but if it is run on another kernel,
> > it's broken again.
>
> Indeed. There is no easy way to maintain backward compatibility unless
> we stop using this bizarre 'prev_state' field. Basically all its users suffer
> from this. That's why I believe this needs a fix to alert people does not
> use 'prev_state' anymore.
>
> > I actually wrote code once that basically just did a:
> >
> > struct trace_seq s;
> >
> > trace_seq_init(&s);
> > tep_print_event(tep, &s, record, "%s", TEP_PRINT_INFO);
> >
> > then searched s.buffer for "prev_state=%s ", to find the state character.
> >
> > That's because the kernel should always be up to date (and why I said I
> > needed that string in the print_fmt).
>
> Turing to building the state char array from print fmt string dynamically
> is a great idea. :)
>
> > As perf has a tep handle, this could be a helper function to extract the
> > state if needed, and get rind of relying on the above character array.
>
> I'll figure out how to make it happen.
>
> BTW, my last concern is that is there any better way to notice userspace to
> avoid interpreting task state out of 'prev_state'. Because the awkward thing
> happens again.

By userspace, I mean all tools consume 'prev_state' but don't have print fmt
available, taking bpf tracepoint for example.

Regards,
Ze

> Thanks,
> Ze
>
> > -- Steve
> >
> >
> > >
> > > /* task state bitmask, copied from include/linux/sched.h */
> > > #define TASK_RUNNING 0
> > > #define TASK_INTERRUPTIBLE 1
> > > #define TASK_UNINTERRUPTIBLE 2
> > > -#define __TASK_STOPPED 4
> > > -#define __TASK_TRACED 8
> > > -/* in tsk->exit_state */
> > > -#define EXIT_DEAD 16
> > > -#define EXIT_ZOMBIE 32
> > > -#define EXIT_TRACE (EXIT_ZOMBIE | EXIT_DEAD)
> > > -/* in tsk->state again */
> > > -#define TASK_DEAD 64
> > > -#define TASK_WAKEKILL 128
> > > -#define TASK_WAKING 256
> > > -#define TASK_PARKED 512
> > >
> > > enum thread_state {
> > > THREAD_SLEEPING = 0,
> >

2023-08-04 03:32:02

by Ze Gao

[permalink] [raw]
Subject: Re: [RFC PATCH v6 1/5] perf sched: sync state char array with the kernel

On Thu, Aug 3, 2023 at 11:10 PM Steven Rostedt <[email protected]> wrote:
>
> On Thu, 3 Aug 2023 04:33:48 -0400
> Ze Gao <[email protected]> wrote:
>
> > Update state char array and then remove unused and stale
> > macros, which are kernel internal representations and not
> > encouraged to use anymore.
> >
> > Signed-off-by: Ze Gao <[email protected]>
> > ---
> > tools/perf/builtin-sched.c | 13 +------------
> > 1 file changed, 1 insertion(+), 12 deletions(-)
> >
> > diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
> > index 9ab300b6f131..8dc8f071721c 100644
> > --- a/tools/perf/builtin-sched.c
> > +++ b/tools/perf/builtin-sched.c
> > @@ -92,23 +92,12 @@ struct sched_atom {
> > struct task_desc *wakee;
> > };
> >
> > -#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
> > +#define TASK_STATE_TO_CHAR_STR "RSDTtXZPI"
>
> Thinking about this more, this will always be wrong. Changing it just works
> for the kernel you made the change for, but if it is run on another kernel,
> it's broken again.

Indeed. There is no easy way to maintain backward compatibility unless
we stop using this bizarre 'prev_state' field. Basically all its users suffer
from this. That's why I believe this needs a fix to alert people does not
use 'prev_state' anymore.

> I actually wrote code once that basically just did a:
>
> struct trace_seq s;
>
> trace_seq_init(&s);
> tep_print_event(tep, &s, record, "%s", TEP_PRINT_INFO);
>
> then searched s.buffer for "prev_state=%s ", to find the state character.
>
> That's because the kernel should always be up to date (and why I said I
> needed that string in the print_fmt).

Turing to building the state char array from print fmt string dynamically
is a great idea. :)

> As perf has a tep handle, this could be a helper function to extract the
> state if needed, and get rind of relying on the above character array.

I'll figure out how to make it happen.

BTW, my last concern is that is there any better way to notice userspace to
avoid interpreting task state out of 'prev_state'. Because the awkward thing
happens again.

Thanks,
Ze

> -- Steve
>
>
> >
> > /* task state bitmask, copied from include/linux/sched.h */
> > #define TASK_RUNNING 0
> > #define TASK_INTERRUPTIBLE 1
> > #define TASK_UNINTERRUPTIBLE 2
> > -#define __TASK_STOPPED 4
> > -#define __TASK_TRACED 8
> > -/* in tsk->exit_state */
> > -#define EXIT_DEAD 16
> > -#define EXIT_ZOMBIE 32
> > -#define EXIT_TRACE (EXIT_ZOMBIE | EXIT_DEAD)
> > -/* in tsk->state again */
> > -#define TASK_DEAD 64
> > -#define TASK_WAKEKILL 128
> > -#define TASK_WAKING 256
> > -#define TASK_PARKED 512
> >
> > enum thread_state {
> > THREAD_SLEEPING = 0,
>

2023-08-04 03:56:42

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH v6 1/5] perf sched: sync state char array with the kernel

On Fri, 4 Aug 2023 11:19:06 +0800
Ze Gao <[email protected]> wrote:

> I realize this is not perfect as well after second thoughts, since this does not
> take offline use of perf into consideration. People might run perf on different
> machines than where the perf.data gets recorded, in which way what we get
> from /sys/kernel/debug/tracing/events/sched/sched_switch/format is likely
> different from the perf.data.

If perf data files does what trace.dat files do, it should save the
file formats in the data files. It should not be reading the kernel
when reading the data file.

With trace-cmd, you can do: trace-cmd dump --events

And it will show you all the formats of the events that it saved in the
file.

-- Steve

2023-08-04 05:17:54

by Ze Gao

[permalink] [raw]
Subject: Re: [RFC PATCH v6 1/5] perf sched: sync state char array with the kernel

On Fri, Aug 4, 2023 at 10:38 AM Ze Gao <[email protected]> wrote:
>
> On Fri, Aug 4, 2023 at 10:21 AM Ze Gao <[email protected]> wrote:
> >
> > On Thu, Aug 3, 2023 at 11:10 PM Steven Rostedt <[email protected]> wrote:
> > >
> > > On Thu, 3 Aug 2023 04:33:48 -0400
> > > Ze Gao <[email protected]> wrote:
> > >
> > > > Update state char array and then remove unused and stale
> > > > macros, which are kernel internal representations and not
> > > > encouraged to use anymore.
> > > >
> > > > Signed-off-by: Ze Gao <[email protected]>
> > > > ---
> > > > tools/perf/builtin-sched.c | 13 +------------
> > > > 1 file changed, 1 insertion(+), 12 deletions(-)
> > > >
> > > > diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
> > > > index 9ab300b6f131..8dc8f071721c 100644
> > > > --- a/tools/perf/builtin-sched.c
> > > > +++ b/tools/perf/builtin-sched.c
> > > > @@ -92,23 +92,12 @@ struct sched_atom {
> > > > struct task_desc *wakee;
> > > > };
> > > >
> > > > -#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
> > > > +#define TASK_STATE_TO_CHAR_STR "RSDTtXZPI"
> > >
> > > Thinking about this more, this will always be wrong. Changing it just works
> > > for the kernel you made the change for, but if it is run on another kernel,
> > > it's broken again.
> >
> > Indeed. There is no easy way to maintain backward compatibility unless
> > we stop using this bizarre 'prev_state' field. Basically all its users suffer
> > from this. That's why I believe this needs a fix to alert people does not
> > use 'prev_state' anymore.
> >
> > > I actually wrote code once that basically just did a:
> > >
> > > struct trace_seq s;
> > >
> > > trace_seq_init(&s);
> > > tep_print_event(tep, &s, record, "%s", TEP_PRINT_INFO);
> > >
> > > then searched s.buffer for "prev_state=%s ", to find the state character.
> > >
> > > That's because the kernel should always be up to date (and why I said I
> > > needed that string in the print_fmt).
> >
> > Turing to building the state char array from print fmt string dynamically
> > is a great idea. :)

I realize this is not perfect as well after second thoughts, since this does not
take offline use of perf into consideration. People might run perf on different
machines than where the perf.data gets recorded, in which way what we get
from /sys/kernel/debug/tracing/events/sched/sched_switch/format is likely
different from the perf.data.

So let's parse it from TEP_PRINT_INFO of each record instead of building
the state char array and rely on 'prev_state' again. At least this fix all tools
that have TEP_PRINT_INFO available.

Thanks,
Ze



> > > As perf has a tep handle, this could be a helper function to extract the
> > > state if needed, and get rind of relying on the above character array.
> >
> > I'll figure out how to make it happen.
> >
> > BTW, my last concern is that is there any better way to notice userspace to
> > avoid interpreting task state out of 'prev_state'. Because the awkward thing
> > happens again.
>
> By userspace, I mean all tools consume 'prev_state' but don't have print fmt
> available, taking bpf tracepoint for example.
>
> Regards,
> Ze
>
> > Thanks,
> > Ze
> >
> > > -- Steve
> > >
> > >
> > > >
> > > > /* task state bitmask, copied from include/linux/sched.h */
> > > > #define TASK_RUNNING 0
> > > > #define TASK_INTERRUPTIBLE 1
> > > > #define TASK_UNINTERRUPTIBLE 2
> > > > -#define __TASK_STOPPED 4
> > > > -#define __TASK_TRACED 8
> > > > -/* in tsk->exit_state */
> > > > -#define EXIT_DEAD 16
> > > > -#define EXIT_ZOMBIE 32
> > > > -#define EXIT_TRACE (EXIT_ZOMBIE | EXIT_DEAD)
> > > > -/* in tsk->state again */
> > > > -#define TASK_DEAD 64
> > > > -#define TASK_WAKEKILL 128
> > > > -#define TASK_WAKING 256
> > > > -#define TASK_PARKED 512
> > > >
> > > > enum thread_state {
> > > > THREAD_SLEEPING = 0,
> > >

2023-08-10 07:35:22

by Ze Gao

[permalink] [raw]
Subject:

Hi Steven,

I managed to build task state char map dynamically by parsing
the tracepoint print format from data recorded by perf. And
likewise for libtraceevent.

FYI, I tried TEP_PRINT_INFO but no shot. It turns out TEP_PRINT_INFO
stills relies on libtraceevent (i.e., sched_switch_handler() in
plugin_sched_switch.c) and we need to parse the print format on our own.

Anyway, it works now and I've tested on some perf.data in old formats
but not cover all the kernel releases.

Thoughts?

Regards,
Ze


From 6b2035494952efb2963e6459ae4dbfce496c3b97 Mon Sep 17 00:00:00 2001
From: Ze Gao <[email protected]>
Date: Wed, 2 Aug 2023 08:19:54 -0400
Subject: [PATCH] perf sched: parse task state from tracepoint print format

As of this writing, we use prev_state to report task state,
which relies on both the task state macros and
TASK_STATE_TO_CHAR_STR in kernel to interpret its actual
meaning. In this way, perf gets broken literally each time
TASK_STATE_TO_CHAR_STR changes as kernel evolves. Counting
on TASK_STATE_TO_CHAR_STR gurantees no backward compatibilty.

To fix this, we build the state char map from the print
format parsed from perf.data on the fly and removes
dependencies on these internal kernel definitions.

Note that we provide an intended helper get_task_prev_state()
for extracting task state from perf record and passing task
state in char elsewhere, which helps to eliminate the need to
know task state macros further.

Signed-off-by: Ze Gao <[email protected]>
---
tools/perf/builtin-sched.c | 126 +++++++++++++++++++++++--------------
1 file changed, 80 insertions(+), 46 deletions(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 9ab300b6f131..9366bc0a991d 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -52,6 +52,10 @@
#define SYM_LEN 129
#define MAX_PID 1024000

+#define TASK_STATE_MAX 16
+static char state_to_char[TASK_STATE_MAX];
+static unsigned int num_sleep_states = 0;
+
static const char *cpu_list;
static DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);

@@ -92,24 +96,6 @@ struct sched_atom {
struct task_desc *wakee;
};

-#define TASK_STATE_TO_CHAR_STR "RSDTtZXxKWP"
-
-/* task state bitmask, copied from include/linux/sched.h */
-#define TASK_RUNNING 0
-#define TASK_INTERRUPTIBLE 1
-#define TASK_UNINTERRUPTIBLE 2
-#define __TASK_STOPPED 4
-#define __TASK_TRACED 8
-/* in tsk->exit_state */
-#define EXIT_DEAD 16
-#define EXIT_ZOMBIE 32
-#define EXIT_TRACE (EXIT_ZOMBIE | EXIT_DEAD)
-/* in tsk->state again */
-#define TASK_DEAD 64
-#define TASK_WAKEKILL 128
-#define TASK_WAKING 256
-#define TASK_PARKED 512
-
enum thread_state {
THREAD_SLEEPING = 0,
THREAD_WAIT_CPU,
@@ -266,7 +252,7 @@ struct thread_runtime {
u64 total_preempt_time;
u64 total_delay_time;

- int last_state;
+ char last_state;

char shortname[3];
bool comm_changed;
@@ -436,7 +422,7 @@ static void add_sched_event_wakeup(struct perf_sched *sched, struct task_desc *t
}

static void add_sched_event_sleep(struct perf_sched *sched, struct task_desc *task,
- u64 timestamp, u64 task_state __maybe_unused)
+ u64 timestamp, char task_state __maybe_unused)
{
struct sched_atom *event = get_new_event(task, timestamp);

@@ -851,6 +837,72 @@ replay_wakeup_event(struct perf_sched *sched,
return 0;
}

+static struct tep_print_arg* task_state_print_flag(struct tep_event *event)
+{
+ struct tep_print_arg* args = event->print_fmt.args;
+
+ while(args)
+ {
+ if (args->type == TEP_PRINT_FLAGS)
+ return args;
+ if (args->type == TEP_PRINT_OP) {
+ args = args->op.right;
+ args = args->op.left;
+ continue;
+ }
+ args = args->next;
+ }
+ return NULL;
+}
+
+static void __parse_print_flag(struct tep_print_flag_sym *field)
+{
+
+ long val = strtol(field->value, NULL, 0);
+ unsigned int bit = val ? ffs(val) : 0;
+
+ state_to_char[bit] = field->str[0];
+ num_sleep_states++;
+ if(num_sleep_states > TASK_STATE_MAX - 1) {
+ pr_warning("too many states parsed, possibly bad format\n");
+ return;
+ }
+ if (field->next) {
+ __parse_print_flag(field->next);
+ }
+}
+
+static inline void parse_print_flag(struct tep_print_arg* args)
+{
+ __parse_print_flag(args->flags.flags);
+}
+
+static void build_task_state_arr(struct tep_event *event)
+{
+ struct tep_print_arg* args;
+
+ args = task_state_print_flag(event);
+ if (!args)
+ pr_warning("print flag not found, possibly bad format\n");
+ else
+ parse_print_flag(args);
+}
+
+static inline char get_prev_task_state(struct evsel *evsel,
+ struct perf_sample *sample)
+{
+ int prev_state = evsel__intval(evsel, sample, "prev_state");
+ unsigned int bit = prev_state ? ffs(prev_state) : 0;
+ char state;
+
+ if(!num_sleep_states)
+ build_task_state_arr(evsel->tp_format);
+
+ state = (!bit || bit > num_sleep_states) ? 'R' : state_to_char[bit];
+
+ return state;
+}
+
static int replay_switch_event(struct perf_sched *sched,
struct evsel *evsel,
struct perf_sample *sample,
@@ -860,7 +912,7 @@ static int replay_switch_event(struct perf_sched *sched,
*next_comm = evsel__strval(evsel, sample, "next_comm");
const u32 prev_pid = evsel__intval(evsel, sample, "prev_pid"),
next_pid = evsel__intval(evsel, sample, "next_pid");
- const u64 prev_state = evsel__intval(evsel, sample, "prev_state");
+ const char prev_state = get_prev_task_state(evsel, sample);
struct task_desc *prev, __maybe_unused *next;
u64 timestamp0, timestamp = sample->time;
int cpu = sample->cpu;
@@ -1050,12 +1102,6 @@ static int thread_atoms_insert(struct perf_sched *sched, struct thread *thread)
return 0;
}

-static char sched_out_state(u64 prev_state)
-{
- const char *str = TASK_STATE_TO_CHAR_STR;
-
- return str[prev_state];
-}

static int
add_sched_out_event(struct work_atoms *atoms,
@@ -1132,7 +1178,7 @@ static int latency_switch_event(struct perf_sched *sched,
{
const u32 prev_pid = evsel__intval(evsel, sample, "prev_pid"),
next_pid = evsel__intval(evsel, sample, "next_pid");
- const u64 prev_state = evsel__intval(evsel, sample, "prev_state");
+ const char prev_state = get_prev_task_state(evsel, sample);
struct work_atoms *out_events, *in_events;
struct thread *sched_out, *sched_in;
u64 timestamp0, timestamp = sample->time;
@@ -1168,7 +1214,7 @@ static int latency_switch_event(struct perf_sched *sched,
goto out_put;
}
}
- if (add_sched_out_event(out_events, sched_out_state(prev_state), timestamp))
+ if (add_sched_out_event(out_events, prev_state, timestamp))
return -1;

in_events = thread_atoms_search(&sched->atom_root, sched_in, &sched->cmp_pid);
@@ -2033,24 +2079,12 @@ static void timehist_header(struct perf_sched *sched)
printf("\n");
}

-static char task_state_char(struct thread *thread, int state)
-{
- static const char state_to_char[] = TASK_STATE_TO_CHAR_STR;
- unsigned bit = state ? ffs(state) : 0;
-
- /* 'I' for idle */
- if (thread__tid(thread) == 0)
- return 'I';
-
- return bit < sizeof(state_to_char) - 1 ? state_to_char[bit] : '?';
-}
-
static void timehist_print_sample(struct perf_sched *sched,
struct evsel *evsel,
struct perf_sample *sample,
struct addr_location *al,
struct thread *thread,
- u64 t, int state)
+ u64 t, char state)
{
struct thread_runtime *tr = thread__priv(thread);
const char *next_comm = evsel__strval(evsel, sample, "next_comm");
@@ -2091,7 +2125,7 @@ static void timehist_print_sample(struct perf_sched *sched,
print_sched_time(tr->dt_run, 6);

if (sched->show_state)
- printf(" %5c ", task_state_char(thread, state));
+ printf(" %5c ", thread->tid == 0 ? 'I' : state);

if (sched->show_next) {
snprintf(nstr, sizeof(nstr), "next: %s[%d]", next_comm, next_pid);
@@ -2163,9 +2197,9 @@ static void timehist_update_runtime_stats(struct thread_runtime *r,
else if (r->last_time) {
u64 dt_wait = tprev - r->last_time;

- if (r->last_state == TASK_RUNNING)
+ if (r->last_state == 'R')
r->dt_preempt = dt_wait;
- else if (r->last_state == TASK_UNINTERRUPTIBLE)
+ else if (r->last_state == 'D')
r->dt_iowait = dt_wait;
else
r->dt_sleep = dt_wait;
@@ -2590,7 +2624,7 @@ static int timehist_sched_change_event(struct perf_tool *tool,
struct thread_runtime *tr = NULL;
u64 tprev, t = sample->time;
int rc = 0;
- int state = evsel__intval(evsel, sample, "prev_state");
+ const char state = get_prev_task_state(evsel, sample);

addr_location__init(&al);
if (machine__resolve(machine, &al, sample) < 0) {
--
2.41.0


2023-08-11 18:04:18

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] perf sched: parse task state from tracepoint print format

On Thu, 10 Aug 2023 01:50:24 -0400
Ze Gao <[email protected]> wrote:

> Hi Steven,
>
> I managed to build task state char map dynamically by parsing
> the tracepoint print format from data recorded by perf. And
> likewise for libtraceevent.
>
> FYI, I tried TEP_PRINT_INFO but no shot. It turns out TEP_PRINT_INFO
> stills relies on libtraceevent (i.e., sched_switch_handler() in
> plugin_sched_switch.c) and we need to parse the print format on our own.

There is a way to unload plugins:

tep_unload_plugins(t->plugin_list, tep);

Hmm, I should add:

tep_unload_plugin(tep, t->plugin_list, "plugin-name");

To unload a single plugin.

I it can also just override what the plugin does by calling:

static int sched_switch_handler(struct trace_seq *s,
struct tep_record *record,
struct tep_event *event, void *context)
{
// do whatever you want.
}

tep_register_event_handler(tep, -1, "sched", "sched_switch",
sched_switch_handler, NULL);

>
> Anyway, it works now and I've tested on some perf.data in old formats
> but not cover all the kernel releases.
>
> Thoughts?

I don't maintain the perf code. You'll have to talk with the perf
maintainers.

-- Steve

2023-08-14 03:11:59

by Ze Gao

[permalink] [raw]
Subject: Re: [PATCH] perf sched: parse task state from tracepoint print format

On Sat, Aug 12, 2023 at 1:28 AM Steven Rostedt <[email protected]> wrote:
>
> On Thu, 10 Aug 2023 01:50:24 -0400
> Ze Gao <[email protected]> wrote:
>
> > Hi Steven,
> >
> > I managed to build task state char map dynamically by parsing
> > the tracepoint print format from data recorded by perf. And
> > likewise for libtraceevent.
> >
> > FYI, I tried TEP_PRINT_INFO but no shot. It turns out TEP_PRINT_INFO
> > stills relies on libtraceevent (i.e., sched_switch_handler() in
> > plugin_sched_switch.c) and we need to parse the print format on our own.
>
> There is a way to unload plugins:
>
> tep_unload_plugins(t->plugin_list, tep);
>
> Hmm, I should add:
>
> tep_unload_plugin(tep, t->plugin_list, "plugin-name");
>
> To unload a single plugin.
>
> I it can also just override what the plugin does by calling:
>
> static int sched_switch_handler(struct trace_seq *s,
> struct tep_record *record,
> struct tep_event *event, void *context)
> {
> // do whatever you want.
> }
>
> tep_register_event_handler(tep, -1, "sched", "sched_switch",
> sched_switch_handler, NULL);
>
Yes, I chose to fix libtraceevent in a similar way, to not break
users of this plugin,
like perf script, I decided to build state char mapping dynamically
for both instead
of overriding sched_switch_handler in libtraceevent. Here is the patch:

From e4fae23d9538e60e75a9776fa7938102e7c26bbb Mon Sep 17 00:00:00 2001
From: Ze Gao <[email protected]>
Date: Wed, 2 Aug 2023 07:20:46 -0400
Subject: [PATCH] libtraceevent: parse task state from tracepoint print format

As of this writing, we use prev_state to report task state, which
relies on both the task state macros and TASK_STATE_TO_CHAR_STR
copied from the kernel to interpret its actual meaning. In this way,
libtraceevent gets broken literally each time TASK_STATE_TO_CHAR_STR
changes as the kernel evolves. Counting on hardcoded
TASK_STATE_TO_CHAR_STR gurantees no backward compatibilty.

To fix this, we build the state char map from the print format
parsed from date recorded on the fly and removes dependencies on
these internal kernel definitions.

Signed-off-by: Ze Gao <[email protected]>
---
plugins/plugin_sched_switch.c | 103 ++++++++++++++++++++++++++++------
1 file changed, 87 insertions(+), 16 deletions(-)

diff --git a/plugins/plugin_sched_switch.c b/plugins/plugin_sched_switch.c
index 8752cae..a0df138 100644
--- a/plugins/plugin_sched_switch.c
+++ b/plugins/plugin_sched_switch.c
@@ -6,28 +6,25 @@
#include <stdlib.h>
#include <string.h>

+#include "event-utils.h"
#include "event-parse.h"
#include "trace-seq.h"

-static void write_state(struct trace_seq *s, int val)
-{
- const char states[] = "SDTtZXxW";
- int found = 0;
- int i;
+#define TASK_STATE_MAX 16

- for (i = 0; i < (sizeof(states) - 1); i++) {
- if (!(val & (1 << i)))
- continue;
+static char state_to_char[TASK_STATE_MAX];
+static unsigned int num_sleeping_states = 0;

- if (found)
- trace_seq_putc(s, '|');

- found = 1;
- trace_seq_putc(s, states[i]);
- }
+static void write_state(struct trace_seq *s, int val)
+{
+ unsigned int bit = val ? ffs(val) : 0;
+ char state;

- if (!found)
- trace_seq_putc(s, 'R');
+ state = (!bit || bit > num_sleeping_states) ? 'R' : state_to_char[bit];
+ trace_seq_putc(s, state);
+ if(bit > num_sleeping_states)
+ trace_seq_putc(s, '+');
}

static void write_and_save_comm(struct tep_format_field *field,
@@ -79,6 +76,76 @@ static int sched_wakeup_handler(struct trace_seq *s,
return 0;
}

+static const struct tep_print_arg* task_state_print_flag(const struct
tep_event *event) {
+
+ struct tep_print_arg* args;
+
+ if (!event)
+ goto out;
+
+ args = event->print_fmt.args;
+ while(args)
+ {
+ if (args->type == TEP_PRINT_FLAGS)
+ return args;
+ if (args->type == TEP_PRINT_OP) {
+ args = args->op.right;
+ args = args->op.left;
+ continue;
+ }
+ args = args->next;
+ }
+out:
+ return NULL;
+}
+
+static int parse_task_state_arr(const char *value, const char *str)
+{
+ long val;
+ unsigned int bit;
+
+ if (!value || !str)
+ return -1;
+
+ val = strtol(value, NULL, 0);
+ bit = val ? ffs(val) : 0;
+ state_to_char[bit] = str[0];
+ num_sleeping_states++;
+ if (num_sleeping_states > TASK_STATE_MAX - 1) {
+ tep_warning("too many states parsed, possibly bad format\n");
+ return -1;
+ }
+ return 0;
+}
+
+static void parse_print_flag(const struct tep_print_flag_sym* field,
+ int (*parser)(const char *value, const char *str))
+{
+ int err;
+
+ if (!field || !parser)
+ return;
+ err = parser(field->value, field->str);
+ if (err){
+ tep_warning("parsing print flag failed, possibly bad format\n");
+ return;
+ }
+ if (field->next)
+ parse_print_flag(field->next, parser);
+
+}
+
+static void build_task_state_arr(const struct tep_event *event)
+{
+ const struct tep_print_arg* args;
+
+ args = task_state_print_flag(event);
+ if (!args)
+ tep_warning("print flag not found, possibly bad format\n");
+ else
+ parse_print_flag(args->flags.flags, parse_task_state_arr);
+}
+
static int sched_switch_handler(struct trace_seq *s,
struct tep_record *record,
struct tep_event *event, void *context)
@@ -99,8 +166,12 @@ static int sched_switch_handler(struct trace_seq *s,
if (tep_get_field_val(s, event, "prev_prio", record, &val, 1) == 0)
trace_seq_printf(s, "[%d] ", (int) val);

- if (tep_get_field_val(s, event, "prev_state", record, &val, 1) == 0)
+
+ if (tep_get_field_val(s, event, "prev_state", record, &val, 1) == 0) {
+ if (!num_sleeping_states)
+ build_task_state_arr(event);
write_state(s, val);
+ }

trace_seq_puts(s, " ==> ");

--
2.41.0



> > Anyway, it works now and I've tested on some perf.data in old formats
> > but not cover all the kernel releases.
> >
> > Thoughts?
>
> I don't maintain the perf code. You'll have to talk with the perf
> maintainers.

Roger that! Thank you very much!

Regards,
Ze

> -- Steve