2024-05-08 06:05:28

by Ravi Bangoria

[permalink] [raw]
Subject: [RFC 0/4] perf sched: Introduce schedstat tool

MOTIVATION
----------

Existing `perf sched` is quite exhaustive and provides lot of insights
into scheduler behavior but it quickly becomes impractical to use for
long running or scheduler intensive workload. For ex, `perf sched record`
has ~7.77% overhead on hackbench (with 25 groups each running 700K loops
on a 2-socket 128 Cores 256 Threads 3rd Generation EPYC Server), and it
generates huge 56G perf.data for which perf takes ~137 mins to prepare
and write it to disk [1].

Unlike `perf sched record`, which hooks onto set of scheduler tracepoints
and generates samples on a tracepoint hit, `perf sched schedstat record`
takes snapshot of the /proc/schedstat file before and after the workload,
i.e. there is zero interference on workload run. Also, it takes very
minimal time to parse /proc/schedstat, convert it into perf samples and
save those samples into perf.data file. Result perf.data file is much
smaller. So, overall `perf sched schedstat record` is much more light-
weight compare to `perf sched record`.

We, internally at AMD, have been using this (a variant of this, known as
"sched-scoreboard"[2]) and found it to be very useful to analyse impact
of any scheduler code changes[3][4].

Please note that, this is not a replacement of perf sched record/report.
The intended users of the new tool are scheduler developers, not regular
users.

USAGE
-----

# perf sched schedstat record
# perf sched schedstat report

Note: Although perf schedstat tool supports workload profiling syntax
(i.e. -- <workload> ), the recorded profile is still systemwide since
the /proc/schedstat is a systemwide file.

HOW TO INTERPRET THE REPORT
---------------------------

The schedstat report starts with total time profiling was active in
terms of jiffies:

----------------------------------------------------------------------------------------------------
Time elapsed (in jiffies) : 24537
----------------------------------------------------------------------------------------------------

Next is cpu scheduling statistics. These are simple diffs of
/proc/schedstat cpu lines along with description. The report also
prints % relative to base stat.

In the example below, schedule() left the cpu0 idle 98.19% of the time.
16.54% of total try_to_wake_up() was to wakeup local cpu. And, the total
waittime by tasks on cpu0 is 0.49% of the total runtime by tasks on the
same cpu.

----------------------------------------------------------------------------------------------------
cpu: cpu0
----------------------------------------------------------------------------------------------------
sched_yield() count : 0
Legacy counter can be ignored : 0
schedule() called : 17138
schedule() left the processor idle : 16827 ( 98.19% )
try_to_wake_up() was called : 508
try_to_wake_up() was called to wake up the local cpu : 84 ( 16.54% )
total runtime by tasks on this processor (in jiffies) : 2408959243
total waittime by tasks on this processor (in jiffies) : 11731825 ( 0.49% )
total timeslices run on this cpu : 311
----------------------------------------------------------------------------------------------------

Next is load balancing statistics. For each of the sched domains
(eg: `SMT`, `MC`, `DIE`...), the scheduler computes statistics under
the following three categories:

1) Idle Load Balance: Load balancing performed on behalf of a long
idling cpu by some other cpu.
2) Busy Load Balance: Load balancing performed when the cpu was busy.
3) New Idle Balance : Load balancing performed when a cpu just became
idle.

Under each of these three categories, schedstat report provides
different load balancing statistics. Along with direct stats, the
report also contains derived metrics prefixed with *. Example:

----------------------------------------------------------------------------------------------------
CPU 0 DOMAIN 0
----------------------------------------------------------------------------------------------------
------------------------------------------<Category idle>-------------------------------------------
load_balance() count on cpu idle : 50 $ 490.74 $
load_balance() found balanced on cpu idle : 42 $ 584.21 $
load_balance() move task failed on cpu idle : 8 $ 3067.12 $
imbalance sum on cpu idle : 8
pull_task() count on cpu idle : 0
pull_task() when target task was cache-hot on cpu idle : 0
load_balance() failed to find busier queue on cpu idle : 0 $ 0.00 $
load_balance() failed to find busier group on cpu idle : 42 $ 584.21 $
*load_balance() success count on cpu idle : 0
*avg task pulled per successful lb attempt (cpu idle) : 0.00
------------------------------------------<Category busy>-------------------------------------------
load_balance() count on cpu busy : 2 $ 12268.50 $
load_balance() found balanced on cpu busy : 2 $ 12268.50 $
load_balance() move task failed on cpu busy : 0 $ 0.00 $
imbalance sum on cpu busy : 0
pull_task() count on cpu busy : 0
pull_task() when target task was cache-hot on cpu busy : 0
load_balance() failed to find busier queue on cpu busy : 0 $ 0.00 $
load_balance() failed to find busier group on cpu busy : 1 $ 24537.00 $
*load_balance() success count on cpu busy : 0
*avg task pulled per successful lb attempt (cpu busy) : 0.00
-----------------------------------------<Category newidle>-----------------------------------------
load_balance() count on cpu newly idle : 427 $ 57.46 $
load_balance() found balanced on cpu newly idle : 382 $ 64.23 $
load_balance() move task failed on cpu newly idle : 45 $ 545.27 $
imbalance sum on cpu newly idle : 48
pull_task() count on cpu newly idle : 0
pull_task() when target task was cache-hot on cpu newly idle : 0
load_balance() failed to find busier queue on cpu newly idle : 0 $ 0.00 $
load_balance() failed to find busier group on cpu newly idle : 382 $ 64.23 $
*load_balance() success count on cpu newly idle : 0
*avg task pulled per successful lb attempt (cpu newly idle) : 0.00
----------------------------------------------------------------------------------------------------

Consider following line:

load_balance() found balanced on cpu newly idle : 382 $ 64.23 $

While profiling was active, the load-balancer found 382 times the load
needs to be balanced on a newly idle CPU 0. Following value encapsulated
inside $ is average jiffies between two events (24537 / 382 = 64.23).

Next is active_load_balance() stats. alb did not trigger while I ran
a workload so it's all 0s.

----------------------------------<Category active_load_balance()>----------------------------------
active_load_balance() count : 0
active_load_balance() move task failed : 0
active_load_balance() successfully moved a task : 0
----------------------------------------------------------------------------------------------------

Next are sched_balance_exec() and sched_balance_fork() stats. They are
not used but we kept it in RFC just for legacy purpose. Unless opposed,
we plan to remove them in next revision.

Next is wakeup statistics. For every domain, the report also shows
task-wakeup statistics. Example:

--------------------------------------------<Wakeup Info>-------------------------------------------
try_to_wake_up() awoke a task that last ran on a diff cpu : 12070
try_to_wake_up() moved task because cache-cold on own cpu : 3324
try_to_wake_up() started passive balancing : 0
----------------------------------------------------------------------------------------------------

Same set of stats are reported for each cpu and each domain level.


TODO:
- This RFC series supports v15 layout of /proc/schedstat but v16 layout
is already being pushed upstream. We are planning to include v16 as
well in the next revision.
- Currently schedstat tool provides statistics of only one run but we
are planning to add `perf sched schedstat diff` which can compare
the data of two different runs (possibly good and bad) and highlight
where scheduler decisions are impacting workload performance.
- perf sched schedstat records /proc/schedstat which is a cpu and domain
level scheduler statistic. We are planning to add taskstat tool which
reads task stats from procfs and generate scheduler statistic report
at task granularity. this will probably a standalone tool, something
like `perf sched taskstat record/report`.
- /proc/schedstat shows cpumask in domain line to indicate a group of
cpus that belong to the domain. Since we are not using domain<->cpumask
data anywhere, we are not capturing it as part of perf sample. But we
are planning to include it in the next revision.
- We have tested the patch only on AMD machines, not on other platforms.
- Some of the perf related features can be added to the schedstat sub-
command as well: for ex, pipemode, -p <pid> option to profile running
process etc. We are not planning to add TUI support however.
- Code changes are for RFC and thus not optimal. Some of the places
where code changes are not required for RFC are marked as /* FIXME */.
- Documenting details about schedstat tool in perf-sched man page will
also be done in next revision.
- Checkpatch warnings are ignored for now.

Patches are prepared on perf-tools-next/perf-tools-next (37862d6fdced).
Although all changes are in tools/perf/, kernel is important since it's
using v15 of /proc/schedstat.

[1] https://youtu.be/lg-9aG2ajA0?t=283
[2] https://github.com/AMDESE/sched-scoreboard
[3] https://lore.kernel.org/lkml/[email protected]/
[4] https://lore.kernel.org/lkml/[email protected]/

Ravi Bangoria (2):
perf sched: Make `struct perf_sched sched` a global variable
perf sched: Add template code for schedstat subcommand

Swapnil Sapkal (2):
perf sched schedstat: Add record and rawdump support
perf sched schedstat: Add support for report subcommand

tools/lib/perf/Documentation/libperf.txt | 2 +
tools/lib/perf/Makefile | 2 +-
tools/lib/perf/include/perf/event.h | 37 ++
.../lib/perf/include/perf/schedstat-cpu-v15.h | 22 +
.../perf/include/perf/schedstat-domain-v15.h | 121 ++++
tools/perf/builtin-inject.c | 2 +
tools/perf/builtin-sched.c | 558 ++++++++++++++++--
tools/perf/util/event.c | 54 ++
tools/perf/util/event.h | 2 +
tools/perf/util/session.c | 44 ++
tools/perf/util/synthetic-events.c | 170 ++++++
tools/perf/util/synthetic-events.h | 4 +
tools/perf/util/tool.h | 4 +-
13 files changed, 965 insertions(+), 57 deletions(-)
create mode 100644 tools/lib/perf/include/perf/schedstat-cpu-v15.h
create mode 100644 tools/lib/perf/include/perf/schedstat-domain-v15.h

--
2.44.0



2024-05-08 06:05:45

by Ravi Bangoria

[permalink] [raw]
Subject: [RFC 1/4] perf sched: Make `struct perf_sched sched` a global variable

Currently it is function local. Followup changes will add new fields to
this structure, and those new fields will be used by callback functions
to which there is no way to pass a pointer of `sched` variable. So make
it a global variable. Also, rename it to `perf_sched` to be consistent
with other builtin-*.c subtools nomenclature.

Signed-off-by: Ravi Bangoria <[email protected]>
---
tools/perf/builtin-sched.c | 109 +++++++++++++++++++------------------
1 file changed, 55 insertions(+), 54 deletions(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 0fce7d8986c0..bc1317d7e106 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -3504,29 +3504,30 @@ static int __cmd_record(int argc, const char **argv)
return ret;
}

+static const char default_sort_order[] = "avg, max, switch, runtime";
+static struct perf_sched perf_sched = {
+ .tool = {
+ .sample = perf_sched__process_tracepoint_sample,
+ .comm = perf_sched__process_comm,
+ .namespaces = perf_event__process_namespaces,
+ .lost = perf_event__process_lost,
+ .fork = perf_sched__process_fork_event,
+ .ordered_events = true,
+ },
+ .cmp_pid = LIST_HEAD_INIT(perf_sched.cmp_pid),
+ .sort_list = LIST_HEAD_INIT(perf_sched.sort_list),
+ .sort_order = default_sort_order,
+ .replay_repeat = 10,
+ .profile_cpu = -1,
+ .next_shortname1 = 'A',
+ .next_shortname2 = '0',
+ .skip_merge = 0,
+ .show_callchain = 1,
+ .max_stack = 5,
+};
+
int cmd_sched(int argc, const char **argv)
{
- static const char default_sort_order[] = "avg, max, switch, runtime";
- struct perf_sched sched = {
- .tool = {
- .sample = perf_sched__process_tracepoint_sample,
- .comm = perf_sched__process_comm,
- .namespaces = perf_event__process_namespaces,
- .lost = perf_event__process_lost,
- .fork = perf_sched__process_fork_event,
- .ordered_events = true,
- },
- .cmp_pid = LIST_HEAD_INIT(sched.cmp_pid),
- .sort_list = LIST_HEAD_INIT(sched.sort_list),
- .sort_order = default_sort_order,
- .replay_repeat = 10,
- .profile_cpu = -1,
- .next_shortname1 = 'A',
- .next_shortname2 = '0',
- .skip_merge = 0,
- .show_callchain = 1,
- .max_stack = 5,
- };
const struct option sched_options[] = {
OPT_STRING('i', "input", &input_name, "file",
"input file name"),
@@ -3534,31 +3535,31 @@ int cmd_sched(int argc, const char **argv)
"be more verbose (show symbol address, etc)"),
OPT_BOOLEAN('D', "dump-raw-trace", &dump_trace,
"dump raw trace in ASCII"),
- OPT_BOOLEAN('f', "force", &sched.force, "don't complain, do it"),
+ OPT_BOOLEAN('f', "force", &perf_sched.force, "don't complain, do it"),
OPT_END()
};
const struct option latency_options[] = {
- OPT_STRING('s', "sort", &sched.sort_order, "key[,key2...]",
+ OPT_STRING('s', "sort", &perf_sched.sort_order, "key[,key2...]",
"sort by key(s): runtime, switch, avg, max"),
- OPT_INTEGER('C', "CPU", &sched.profile_cpu,
+ OPT_INTEGER('C', "CPU", &perf_sched.profile_cpu,
"CPU to profile on"),
- OPT_BOOLEAN('p', "pids", &sched.skip_merge,
+ OPT_BOOLEAN('p', "pids", &perf_sched.skip_merge,
"latency stats per pid instead of per comm"),
OPT_PARENT(sched_options)
};
const struct option replay_options[] = {
- OPT_UINTEGER('r', "repeat", &sched.replay_repeat,
+ OPT_UINTEGER('r', "repeat", &perf_sched.replay_repeat,
"repeat the workload replay N times (-1: infinite)"),
OPT_PARENT(sched_options)
};
const struct option map_options[] = {
- OPT_BOOLEAN(0, "compact", &sched.map.comp,
+ OPT_BOOLEAN(0, "compact", &perf_sched.map.comp,
"map output in compact mode"),
- OPT_STRING(0, "color-pids", &sched.map.color_pids_str, "pids",
+ OPT_STRING(0, "color-pids", &perf_sched.map.color_pids_str, "pids",
"highlight given pids in map"),
- OPT_STRING(0, "color-cpus", &sched.map.color_cpus_str, "cpus",
+ OPT_STRING(0, "color-cpus", &perf_sched.map.color_cpus_str, "cpus",
"highlight given CPUs in map"),
- OPT_STRING(0, "cpus", &sched.map.cpus_str, "cpus",
+ OPT_STRING(0, "cpus", &perf_sched.map.cpus_str, "cpus",
"display given CPUs in map"),
OPT_PARENT(sched_options)
};
@@ -3567,24 +3568,24 @@ int cmd_sched(int argc, const char **argv)
"file", "vmlinux pathname"),
OPT_STRING(0, "kallsyms", &symbol_conf.kallsyms_name,
"file", "kallsyms pathname"),
- OPT_BOOLEAN('g', "call-graph", &sched.show_callchain,
+ OPT_BOOLEAN('g', "call-graph", &perf_sched.show_callchain,
"Display call chains if present (default on)"),
- OPT_UINTEGER(0, "max-stack", &sched.max_stack,
+ OPT_UINTEGER(0, "max-stack", &perf_sched.max_stack,
"Maximum number of functions to display backtrace."),
OPT_STRING(0, "symfs", &symbol_conf.symfs, "directory",
"Look for files with symbols relative to this directory"),
- OPT_BOOLEAN('s', "summary", &sched.summary_only,
+ OPT_BOOLEAN('s', "summary", &perf_sched.summary_only,
"Show only syscall summary with statistics"),
- OPT_BOOLEAN('S', "with-summary", &sched.summary,
+ OPT_BOOLEAN('S', "with-summary", &perf_sched.summary,
"Show all syscalls and summary with statistics"),
- OPT_BOOLEAN('w', "wakeups", &sched.show_wakeups, "Show wakeup events"),
- OPT_BOOLEAN('n', "next", &sched.show_next, "Show next task"),
- OPT_BOOLEAN('M', "migrations", &sched.show_migrations, "Show migration events"),
- OPT_BOOLEAN('V', "cpu-visual", &sched.show_cpu_visual, "Add CPU visual"),
- OPT_BOOLEAN('I', "idle-hist", &sched.idle_hist, "Show idle events only"),
- OPT_STRING(0, "time", &sched.time_str, "str",
+ OPT_BOOLEAN('w', "wakeups", &perf_sched.show_wakeups, "Show wakeup events"),
+ OPT_BOOLEAN('n', "next", &perf_sched.show_next, "Show next task"),
+ OPT_BOOLEAN('M', "migrations", &perf_sched.show_migrations, "Show migration events"),
+ OPT_BOOLEAN('V', "cpu-visual", &perf_sched.show_cpu_visual, "Add CPU visual"),
+ OPT_BOOLEAN('I', "idle-hist", &perf_sched.idle_hist, "Show idle events only"),
+ OPT_STRING(0, "time", &perf_sched.time_str, "str",
"Time span for analysis (start,stop)"),
- OPT_BOOLEAN(0, "state", &sched.show_state, "Show task state when sched-out"),
+ OPT_BOOLEAN(0, "state", &perf_sched.show_state, "Show task state when sched-out"),
OPT_STRING('p', "pid", &symbol_conf.pid_list_str, "pid[,pid...]",
"analyze events only for given process id(s)"),
OPT_STRING('t', "tid", &symbol_conf.tid_list_str, "tid[,tid...]",
@@ -3645,31 +3646,31 @@ int cmd_sched(int argc, const char **argv)
} else if (strlen(argv[0]) > 2 && strstarts("record", argv[0])) {
return __cmd_record(argc, argv);
} else if (strlen(argv[0]) > 2 && strstarts("latency", argv[0])) {
- sched.tp_handler = &lat_ops;
+ perf_sched.tp_handler = &lat_ops;
if (argc > 1) {
argc = parse_options(argc, argv, latency_options, latency_usage, 0);
if (argc)
usage_with_options(latency_usage, latency_options);
}
- setup_sorting(&sched, latency_options, latency_usage);
- return perf_sched__lat(&sched);
+ setup_sorting(&perf_sched, latency_options, latency_usage);
+ return perf_sched__lat(&perf_sched);
} else if (!strcmp(argv[0], "map")) {
if (argc) {
argc = parse_options(argc, argv, map_options, map_usage, 0);
if (argc)
usage_with_options(map_usage, map_options);
}
- sched.tp_handler = &map_ops;
- setup_sorting(&sched, latency_options, latency_usage);
- return perf_sched__map(&sched);
+ perf_sched.tp_handler = &map_ops;
+ setup_sorting(&perf_sched, latency_options, latency_usage);
+ return perf_sched__map(&perf_sched);
} else if (strlen(argv[0]) > 2 && strstarts("replay", argv[0])) {
- sched.tp_handler = &replay_ops;
+ perf_sched.tp_handler = &replay_ops;
if (argc) {
argc = parse_options(argc, argv, replay_options, replay_usage, 0);
if (argc)
usage_with_options(replay_usage, replay_options);
}
- return perf_sched__replay(&sched);
+ return perf_sched__replay(&perf_sched);
} else if (!strcmp(argv[0], "timehist")) {
if (argc) {
argc = parse_options(argc, argv, timehist_options,
@@ -3677,13 +3678,13 @@ int cmd_sched(int argc, const char **argv)
if (argc)
usage_with_options(timehist_usage, timehist_options);
}
- if ((sched.show_wakeups || sched.show_next) &&
- sched.summary_only) {
+ if ((perf_sched.show_wakeups || perf_sched.show_next) &&
+ perf_sched.summary_only) {
pr_err(" Error: -s and -[n|w] are mutually exclusive.\n");
parse_options_usage(timehist_usage, timehist_options, "s", true);
- if (sched.show_wakeups)
+ if (perf_sched.show_wakeups)
parse_options_usage(NULL, timehist_options, "w", true);
- if (sched.show_next)
+ if (perf_sched.show_next)
parse_options_usage(NULL, timehist_options, "n", true);
return -EINVAL;
}
@@ -3691,7 +3692,7 @@ int cmd_sched(int argc, const char **argv)
if (ret)
return ret;

- return perf_sched__timehist(&sched);
+ return perf_sched__timehist(&perf_sched);
} else {
usage_with_options(sched_usage, sched_options);
}
--
2.44.0


2024-05-08 06:06:02

by Ravi Bangoria

[permalink] [raw]
Subject: [RFC 2/4] perf sched: Add template code for schedstat subcommand

perf sched schedstat will allow user to record the profile with `record`
subcommand and analyse it with `report` subcommand.

Signed-off-by: Ravi Bangoria <[email protected]>
---
tools/perf/builtin-sched.c | 48 +++++++++++++++++++++++++++++++++++++-
1 file changed, 47 insertions(+), 1 deletion(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index bc1317d7e106..717bdf113241 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -3504,6 +3504,21 @@ static int __cmd_record(int argc, const char **argv)
return ret;
}

+/* perf.data or any other output file name used by schedstat subcommand (only). */
+const char *output_name;
+
+static int perf_sched__schedstat_record(struct perf_sched *sched __maybe_unused,
+ int argc __maybe_unused,
+ const char **argv __maybe_unused)
+{
+ return 0;
+}
+
+static int perf_sched__schedstat_report(struct perf_sched *sched __maybe_unused)
+{
+ return 0;
+}
+
static const char default_sort_order[] = "avg, max, switch, runtime";
static struct perf_sched perf_sched = {
.tool = {
@@ -3593,6 +3608,13 @@ int cmd_sched(int argc, const char **argv)
OPT_STRING('C', "cpu", &cpu_list, "cpu", "list of cpus to profile"),
OPT_PARENT(sched_options)
};
+ const struct option schedstat_options[] = {
+ OPT_STRING('i', "input", &input_name, "file",
+ "`schedstat report` with input filename"),
+ OPT_STRING('o', "output", &output_name, "file",
+ "`schedstat record` with output filename"),
+ OPT_END()
+ };

const char * const latency_usage[] = {
"perf sched latency [<options>]",
@@ -3610,9 +3632,13 @@ int cmd_sched(int argc, const char **argv)
"perf sched timehist [<options>]",
NULL
};
+ const char *schedstat_usage[] = {
+ "perf sched schedstat {record|report} [<options>]",
+ NULL
+ };
const char *const sched_subcommands[] = { "record", "latency", "map",
"replay", "script",
- "timehist", NULL };
+ "timehist", "schedstat", NULL };
const char *sched_usage[] = {
NULL,
NULL
@@ -3693,6 +3719,26 @@ int cmd_sched(int argc, const char **argv)
return ret;

return perf_sched__timehist(&perf_sched);
+ } else if (!strcmp(argv[0], "schedstat")) {
+ const char *const schedstat_subcommands[] = {"record", "report", NULL};
+
+ argc = parse_options_subcommand(argc, argv, schedstat_options,
+ schedstat_subcommands,
+ schedstat_usage, 0);
+
+ if (argv[0] && !strcmp(argv[0], "record")) {
+ if (argc)
+ argc = parse_options(argc, argv, schedstat_options,
+ schedstat_usage, 0);
+ ret = perf_sched__schedstat_record(&perf_sched, argc, argv);
+ } else if (argv[0] && !strcmp(argv[0], "report")) {
+ if (argc)
+ argc = parse_options(argc, argv, schedstat_options,
+ schedstat_usage, 0);
+ ret = perf_sched__schedstat_report(&perf_sched);
+ } else {
+ usage_with_options(schedstat_usage, schedstat_options);
+ }
} else {
usage_with_options(sched_usage, sched_options);
}
--
2.44.0


2024-05-08 06:06:18

by Ravi Bangoria

[permalink] [raw]
Subject: [RFC 3/4] perf sched schedstat: Add record and rawdump support

From: Swapnil Sapkal <[email protected]>

Define new, perf tool only, sample types and their layouts. Add logic
to parse /proc/schedstat, convert it to perf sample format and save
samples to perf.data file with `perf sched schedstat record` subcommand.
Also add logic to read perf.data file, interpret schedstat samples and
print rawdump of samples with `perf script -D`.

Note that, /proc/schedstat file output is standardized with version
number. The patch supports v15 but older or newer version can be added
easily.

Co-developed-by: Ravi Bangoria <[email protected]>
Signed-off-by: Swapnil Sapkal <[email protected]>
Signed-off-by: Ravi Bangoria <[email protected]>
---
tools/lib/perf/Documentation/libperf.txt | 2 +
tools/lib/perf/Makefile | 2 +-
tools/lib/perf/include/perf/event.h | 37 ++++
.../lib/perf/include/perf/schedstat-cpu-v15.h | 13 ++
.../perf/include/perf/schedstat-domain-v15.h | 40 +++++
tools/perf/builtin-inject.c | 2 +
tools/perf/builtin-sched.c | 157 +++++++++++++++-
tools/perf/util/event.c | 54 ++++++
tools/perf/util/event.h | 2 +
tools/perf/util/session.c | 44 +++++
tools/perf/util/synthetic-events.c | 170 ++++++++++++++++++
tools/perf/util/synthetic-events.h | 4 +
tools/perf/util/tool.h | 4 +-
13 files changed, 525 insertions(+), 6 deletions(-)
create mode 100644 tools/lib/perf/include/perf/schedstat-cpu-v15.h
create mode 100644 tools/lib/perf/include/perf/schedstat-domain-v15.h

diff --git a/tools/lib/perf/Documentation/libperf.txt b/tools/lib/perf/Documentation/libperf.txt
index fcfb9499ef9c..39c78682ad2e 100644
--- a/tools/lib/perf/Documentation/libperf.txt
+++ b/tools/lib/perf/Documentation/libperf.txt
@@ -211,6 +211,8 @@ SYNOPSIS
struct perf_record_time_conv;
struct perf_record_header_feature;
struct perf_record_compressed;
+ struct perf_record_schedstat_cpu;
+ struct perf_record_schedstat_domain;
--

DESCRIPTION
diff --git a/tools/lib/perf/Makefile b/tools/lib/perf/Makefile
index 3a9b2140aa04..ebbfea891a6a 100644
--- a/tools/lib/perf/Makefile
+++ b/tools/lib/perf/Makefile
@@ -187,7 +187,7 @@ install_lib: libs
$(call do_install_mkdir,$(libdir_SQ)); \
cp -fpR $(LIBPERF_ALL) $(DESTDIR)$(libdir_SQ)

-HDRS := bpf_perf.h core.h cpumap.h threadmap.h evlist.h evsel.h event.h mmap.h
+HDRS := bpf_perf.h core.h cpumap.h threadmap.h evlist.h evsel.h event.h mmap.h schedstat-cpu-v15.h schedstat-domain-v15.h
INTERNAL_HDRS := cpumap.h evlist.h evsel.h lib.h mmap.h rc_check.h threadmap.h xyarray.h

INSTALL_HDRS_PFX := $(DESTDIR)$(prefix)/include/perf
diff --git a/tools/lib/perf/include/perf/event.h b/tools/lib/perf/include/perf/event.h
index ae64090184d3..835bb3e2dcbf 100644
--- a/tools/lib/perf/include/perf/event.h
+++ b/tools/lib/perf/include/perf/event.h
@@ -451,6 +451,39 @@ struct perf_record_compressed {
char data[];
};

+struct perf_record_schedstat_cpu_v15 {
+#define CPU_FIELD(type, name) type name;
+#include "schedstat-cpu-v15.h"
+#undef CPU_FIELD
+};
+
+struct perf_record_schedstat_cpu {
+ struct perf_event_header header;
+ __u16 version;
+ __u64 timestamp;
+ __u32 cpu;
+ union {
+ struct perf_record_schedstat_cpu_v15 v15;
+ };
+};
+
+struct perf_record_schedstat_domain_v15 {
+#define DOMAIN_FIELD(type, name) type name;
+#include "schedstat-domain-v15.h"
+#undef DOMAIN_FIELD
+};
+
+struct perf_record_schedstat_domain {
+ struct perf_event_header header;
+ __u16 version;
+ __u64 timestamp;
+ __u32 cpu;
+ __u16 domain;
+ union {
+ struct perf_record_schedstat_domain_v15 v15;
+ };
+};
+
enum perf_user_event_type { /* above any possible kernel type */
PERF_RECORD_USER_TYPE_START = 64,
PERF_RECORD_HEADER_ATTR = 64,
@@ -472,6 +505,8 @@ enum perf_user_event_type { /* above any possible kernel type */
PERF_RECORD_HEADER_FEATURE = 80,
PERF_RECORD_COMPRESSED = 81,
PERF_RECORD_FINISHED_INIT = 82,
+ PERF_RECORD_SCHEDSTAT_CPU = 83,
+ PERF_RECORD_SCHEDSTAT_DOMAIN = 84,
PERF_RECORD_HEADER_MAX
};

@@ -512,6 +547,8 @@ union perf_event {
struct perf_record_time_conv time_conv;
struct perf_record_header_feature feat;
struct perf_record_compressed pack;
+ struct perf_record_schedstat_cpu schedstat_cpu;
+ struct perf_record_schedstat_domain schedstat_domain;
};

#endif /* __LIBPERF_EVENT_H */
diff --git a/tools/lib/perf/include/perf/schedstat-cpu-v15.h b/tools/lib/perf/include/perf/schedstat-cpu-v15.h
new file mode 100644
index 000000000000..8dca84b11902
--- /dev/null
+++ b/tools/lib/perf/include/perf/schedstat-cpu-v15.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifdef CPU_FIELD
+CPU_FIELD(__u32, yld_count)
+CPU_FIELD(__u32, array_exp)
+CPU_FIELD(__u32, sched_count)
+CPU_FIELD(__u32, sched_goidle)
+CPU_FIELD(__u32, ttwu_count)
+CPU_FIELD(__u32, ttwu_local)
+CPU_FIELD(__u64, rq_cpu_time)
+CPU_FIELD(__u64, run_delay)
+CPU_FIELD(__u64, pcount)
+#endif
diff --git a/tools/lib/perf/include/perf/schedstat-domain-v15.h b/tools/lib/perf/include/perf/schedstat-domain-v15.h
new file mode 100644
index 000000000000..181b1c10395c
--- /dev/null
+++ b/tools/lib/perf/include/perf/schedstat-domain-v15.h
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifdef DOMAIN_FIELD
+DOMAIN_FIELD(__u32, idle_lb_count)
+DOMAIN_FIELD(__u32, idle_lb_balanced)
+DOMAIN_FIELD(__u32, idle_lb_failed)
+DOMAIN_FIELD(__u32, idle_lb_imbalance)
+DOMAIN_FIELD(__u32, idle_lb_gained)
+DOMAIN_FIELD(__u32, idle_lb_hot_gained)
+DOMAIN_FIELD(__u32, idle_lb_nobusyq)
+DOMAIN_FIELD(__u32, idle_lb_nobusyg)
+DOMAIN_FIELD(__u32, busy_lb_count)
+DOMAIN_FIELD(__u32, busy_lb_balanced)
+DOMAIN_FIELD(__u32, busy_lb_failed)
+DOMAIN_FIELD(__u32, busy_lb_imbalance)
+DOMAIN_FIELD(__u32, busy_lb_gained)
+DOMAIN_FIELD(__u32, busy_lb_hot_gained)
+DOMAIN_FIELD(__u32, busy_lb_nobusyq)
+DOMAIN_FIELD(__u32, busy_lb_nobusyg)
+DOMAIN_FIELD(__u32, newidle_lb_count)
+DOMAIN_FIELD(__u32, newidle_lb_balanced)
+DOMAIN_FIELD(__u32, newidle_lb_failed)
+DOMAIN_FIELD(__u32, newidle_lb_imbalance)
+DOMAIN_FIELD(__u32, newidle_lb_gained)
+DOMAIN_FIELD(__u32, newidle_lb_hot_gained)
+DOMAIN_FIELD(__u32, newidle_lb_nobusyq)
+DOMAIN_FIELD(__u32, newidle_lb_nobusyg)
+DOMAIN_FIELD(__u32, alb_count)
+DOMAIN_FIELD(__u32, alb_failed)
+DOMAIN_FIELD(__u32, alb_pushed)
+DOMAIN_FIELD(__u32, sbe_count)
+DOMAIN_FIELD(__u32, sbe_balanced)
+DOMAIN_FIELD(__u32, sbe_pushed)
+DOMAIN_FIELD(__u32, sbf_count)
+DOMAIN_FIELD(__u32, sbf_balanced)
+DOMAIN_FIELD(__u32, sbf_pushed)
+DOMAIN_FIELD(__u32, ttwu_wake_remote)
+DOMAIN_FIELD(__u32, ttwu_move_affine)
+DOMAIN_FIELD(__u32, ttwu_move_balance)
+#endif
diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
index a212678d47be..28b8c1366446 100644
--- a/tools/perf/builtin-inject.c
+++ b/tools/perf/builtin-inject.c
@@ -2204,6 +2204,8 @@ int cmd_inject(int argc, const char **argv)
.finished_init = perf_event__repipe_op2_synth,
.compressed = perf_event__repipe_op4_synth,
.auxtrace = perf_event__repipe_auxtrace,
+ .schedstat_cpu = perf_event__repipe_op2_synth,
+ .schedstat_domain = perf_event__repipe_op2_synth,
},
.input_name = "-",
.samples = LIST_HEAD_INIT(inject.samples),
diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 717bdf113241..70bcd36fe1d3 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -19,6 +19,8 @@
#include "util/string2.h"
#include "util/callchain.h"
#include "util/time-utils.h"
+#include "util/synthetic-events.h"
+#include "util/target.h"

#include <subcmd/pager.h>
#include <subcmd/parse-options.h>
@@ -229,8 +231,13 @@ struct perf_sched {
struct perf_time_interval ptime;
struct perf_time_interval hist_time;
volatile bool thread_funcs_exit;
+
+ struct perf_session *session;
+ struct perf_data *data;
};

+static struct perf_sched perf_sched;
+
/* per thread run time data */
struct thread_runtime {
u64 last_time; /* time of previous sched in/out event */
@@ -3504,14 +3511,156 @@ static int __cmd_record(int argc, const char **argv)
return ret;
}

+static int process_synthesized_event(struct perf_tool *tool __maybe_unused,
+ union perf_event *event,
+ struct perf_sample *sample __maybe_unused,
+ struct machine *machine __maybe_unused)
+{
+ if (perf_data__write(perf_sched.data, event, event->header.size) <= 0) {
+ pr_err("failed to write perf data, error: %m\n");
+ return -1;
+ }
+
+ perf_sched.session->header.data_size += event->header.size;
+ return 0;
+}
+
+static void sighandler(int sig __maybe_unused)
+{
+}
+
/* perf.data or any other output file name used by schedstat subcommand (only). */
const char *output_name;
+static struct target target;

-static int perf_sched__schedstat_record(struct perf_sched *sched __maybe_unused,
- int argc __maybe_unused,
- const char **argv __maybe_unused)
+static int perf_sched__schedstat_record(struct perf_sched *sched,
+ int argc, const char **argv)
{
- return 0;
+ struct perf_session *session;
+ struct evlist *evlist;
+ int err = 0;
+ FILE *fp;
+ int flag;
+ char ch;
+ int fd;
+ struct perf_data data = {
+ .path = output_name,
+ .mode = PERF_DATA_MODE_WRITE,
+ };
+
+ signal(SIGINT, sighandler);
+ signal(SIGCHLD, sighandler);
+ signal(SIGTERM, sighandler);
+
+ evlist = evlist__new();
+ if (!evlist)
+ return -ENOMEM;
+
+ session = perf_session__new(&data, &sched->tool);
+ if (IS_ERR(session)) {
+ pr_err("Perf session creation failed.\n");
+ return PTR_ERR(session);
+ }
+
+ session->evlist = evlist;
+
+ perf_sched.session = session;
+ perf_sched.data = &data;
+
+ fd = perf_data__fd(&data);
+
+ /*
+ * Capture all important metadata about the system. Although they
+ * are not used by schedstat tool directly, they provide useful
+ * information about profiled environment.
+ */
+ perf_header__set_feat(&session->header, HEADER_HOSTNAME);
+ perf_header__set_feat(&session->header, HEADER_OSRELEASE);
+ perf_header__set_feat(&session->header, HEADER_VERSION);
+ perf_header__set_feat(&session->header, HEADER_ARCH);
+ perf_header__set_feat(&session->header, HEADER_NRCPUS);
+ perf_header__set_feat(&session->header, HEADER_CPUDESC);
+ perf_header__set_feat(&session->header, HEADER_CPUID);
+ perf_header__set_feat(&session->header, HEADER_TOTAL_MEM);
+ perf_header__set_feat(&session->header, HEADER_CMDLINE);
+ perf_header__set_feat(&session->header, HEADER_CPU_TOPOLOGY);
+ perf_header__set_feat(&session->header, HEADER_NUMA_TOPOLOGY);
+ perf_header__set_feat(&session->header, HEADER_CACHE);
+ perf_header__set_feat(&session->header, HEADER_MEM_TOPOLOGY);
+ perf_header__set_feat(&session->header, HEADER_CPU_PMU_CAPS);
+ perf_header__set_feat(&session->header, HEADER_HYBRID_TOPOLOGY);
+ perf_header__set_feat(&session->header, HEADER_PMU_CAPS);
+
+ err = perf_session__write_header(session, evlist, fd, false);
+ if (err < 0)
+ goto out;
+
+ /* FIXME. Quirk for evlist__prepare_workload() */
+ target.system_wide = true;
+
+ /* FIXME: -p <pid> support */
+ if (argc) {
+ err = evlist__prepare_workload(evlist, &target, argv, false, NULL);
+ if (err)
+ goto out;
+ }
+
+ err = perf_event__synthesize_schedstat(&(sched->tool), session,
+ &session->machines.host /* machine */,
+ process_synthesized_event);
+ if (err < 0)
+ goto out;
+
+ fp = fopen("/proc/sys/kernel/sched_schedstats", "w+");
+ if (!fp) {
+ printf("Failed to open /proc/sys/kernel/sched_schedstats");
+ goto out;
+ }
+
+ ch = getc(fp);
+ if (ch == '0') {
+ flag = 1;
+ rewind(fp);
+ putc('1', fp);
+ fclose(fp);
+ }
+
+ if (argc)
+ evlist__start_workload(evlist);
+
+ /* wait for signal */
+ pause();
+
+ err = perf_event__synthesize_schedstat(&(sched->tool), session,
+ &session->machines.host /* machine */,
+ process_synthesized_event);
+
+ if (flag == 1) {
+ fp = fopen("/proc/sys/kernel/sched_schedstats", "w");
+ if (!fp) {
+ printf("Failed to open /proc/sys/kernel/sched_schedstats");
+ goto out;
+ }
+
+ putc('0', fp);
+ fclose(fp);
+ }
+
+ if (err < 0)
+ goto out;
+
+ err = perf_session__write_header(session, evlist, fd, true);
+
+ if (!err)
+ fprintf(stderr, "[ perf sched schedstat: Wrote samples to %s ]\n", data.path);
+ else
+ fprintf(stderr, "[ perf sched schedstat: Failed !! ]\n");
+
+out:
+ close(fd);
+ perf_session__delete(session);
+
+ return err;
}

static int perf_sched__schedstat_report(struct perf_sched *sched __maybe_unused)
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index f32f9abf6344..f2b10bc44a9e 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -77,6 +77,8 @@ static const char *perf_event__names[] = {
[PERF_RECORD_HEADER_FEATURE] = "FEATURE",
[PERF_RECORD_COMPRESSED] = "COMPRESSED",
[PERF_RECORD_FINISHED_INIT] = "FINISHED_INIT",
+ [PERF_RECORD_SCHEDSTAT_CPU] = "SCHEDSTAT_CPU",
+ [PERF_RECORD_SCHEDSTAT_DOMAIN] = "SCHEDSTAT_DOMAIN",
};

const char *perf_event__name(unsigned int id)
@@ -587,6 +589,58 @@ size_t perf_event__fprintf(union perf_event *event, struct machine *machine, FIL
return ret;
}

+static size_t __fprintf_schedstat_cpu_v15(union perf_event *event, FILE *fp)
+{
+ struct perf_record_schedstat_cpu_v15 *csv15;
+ size_t size = 0;
+
+ size = fprintf(fp, "\ncpu%u ", event->schedstat_cpu.cpu);
+ csv15 = &event->schedstat_cpu.v15;
+
+#define CPU_FIELD(type, name) \
+ size += fprintf(fp, "%" PRIu64 " ", (unsigned long)csv15->name);
+
+#include <perf/schedstat-cpu-v15.h>
+#undef CPU_FIELD
+
+ return size;
+}
+
+size_t perf_event__fprintf_schedstat_cpu(union perf_event *event, FILE *fp)
+{
+ if (event->schedstat_cpu.version == 15)
+ return __fprintf_schedstat_cpu_v15(event, fp);
+
+ return fprintf(fp, "Unsupported /proc/schedstat version %d.\n",
+ event->schedstat_cpu.version);
+}
+
+static size_t __fprintf_schedstat_domain_v15(union perf_event *event, FILE *fp)
+{
+ struct perf_record_schedstat_domain_v15 *dsv15;
+ size_t size = 0;
+
+ size = fprintf(fp, "\ndomain%u ", event->schedstat_domain.domain);
+ dsv15 = &event->schedstat_domain.v15;
+
+#define DOMAIN_FIELD(type, name) \
+ size += fprintf(fp, "%" PRIu64 " ", (unsigned long)dsv15->name);
+
+#include <perf/schedstat-domain-v15.h>
+#undef DOMAIN_FIELD
+
+ return size;
+}
+
+size_t perf_event__fprintf_schedstat_domain(union perf_event *event, FILE *fp)
+{
+ if (event->schedstat_domain.version == 15)
+ return __fprintf_schedstat_domain_v15(event, fp);
+
+ return fprintf(fp, "Unsupported /proc/schedstat version %d.\n",
+ event->schedstat_domain.version);
+}
+
int perf_event__process(struct perf_tool *tool __maybe_unused,
union perf_event *event,
struct perf_sample *sample,
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index d8bcee2e9b93..c6ac391ce09c 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -356,6 +356,8 @@ size_t perf_event__fprintf_cgroup(union perf_event *event, FILE *fp);
size_t perf_event__fprintf_ksymbol(union perf_event *event, FILE *fp);
size_t perf_event__fprintf_bpf(union perf_event *event, FILE *fp);
size_t perf_event__fprintf_text_poke(union perf_event *event, struct machine *machine,FILE *fp);
+size_t perf_event__fprintf_schedstat_cpu(union perf_event *event, FILE *fp);
+size_t perf_event__fprintf_schedstat_domain(union perf_event *event, FILE *fp);
size_t perf_event__fprintf(union perf_event *event, struct machine *machine, FILE *fp);

int kallsyms__get_function_start(const char *kallsyms_filename,
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index a10343b9dcd4..676bfe022afd 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -480,6 +480,26 @@ static int perf_session__process_compressed_event_stub(struct perf_session *sess
return 0;
}

+static int
+process_schedstat_cpu_stub(struct perf_session *perf_session __maybe_unused,
+ union perf_event *event)
+{
+ if (dump_trace)
+ perf_event__fprintf_schedstat_cpu(event, stdout);
+ dump_printf(": unhandled!\n");
+ return 0;
+}
+
+static int
+process_schedstat_domain_stub(struct perf_session *perf_session __maybe_unused,
+ union perf_event *event)
+{
+ if (dump_trace)
+ perf_event__fprintf_schedstat_domain(event, stdout);
+ dump_printf(": unhandled!\n");
+ return 0;
+}
+
void perf_tool__fill_defaults(struct perf_tool *tool)
{
if (tool->sample == NULL)
@@ -562,6 +582,10 @@ void perf_tool__fill_defaults(struct perf_tool *tool)
tool->compressed = perf_session__process_compressed_event;
if (tool->finished_init == NULL)
tool->finished_init = process_event_op2_stub;
+ if (tool->schedstat_cpu == NULL)
+ tool->schedstat_cpu = process_schedstat_cpu_stub;
+ if (tool->schedstat_domain == NULL)
+ tool->schedstat_domain = process_schedstat_domain_stub;
}

static void swap_sample_id_all(union perf_event *event, void *data)
@@ -996,6 +1020,20 @@ static void perf_event__time_conv_swap(union perf_event *event,
}
}

+static void
+perf_event__schedstat_cpu_swap(union perf_event *event __maybe_unused,
+ bool sample_id_all __maybe_unused)
+{
+ /* FIXME */
+}
+
+static void
+perf_event__schedstat_domain_swap(union perf_event *event __maybe_unused,
+ bool sample_id_all __maybe_unused)
+{
+ /* FIXME */
+}
+
typedef void (*perf_event__swap_op)(union perf_event *event,
bool sample_id_all);

@@ -1034,6 +1072,8 @@ static perf_event__swap_op perf_event__swap_ops[] = {
[PERF_RECORD_STAT_ROUND] = perf_event__stat_round_swap,
[PERF_RECORD_EVENT_UPDATE] = perf_event__event_update_swap,
[PERF_RECORD_TIME_CONV] = perf_event__time_conv_swap,
+ [PERF_RECORD_SCHEDSTAT_CPU] = perf_event__schedstat_cpu_swap,
+ [PERF_RECORD_SCHEDSTAT_DOMAIN] = perf_event__schedstat_domain_swap,
[PERF_RECORD_HEADER_MAX] = NULL,
};

@@ -1744,6 +1784,10 @@ static s64 perf_session__process_user_event(struct perf_session *session,
return err;
case PERF_RECORD_FINISHED_INIT:
return tool->finished_init(session, event);
+ case PERF_RECORD_SCHEDSTAT_CPU:
+ return tool->schedstat_cpu(session, event);
+ case PERF_RECORD_SCHEDSTAT_DOMAIN:
+ return tool->schedstat_domain(session, event);
default:
return -EINVAL;
}
diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
index 5498048f56ea..b5cfd4797495 100644
--- a/tools/perf/util/synthetic-events.c
+++ b/tools/perf/util/synthetic-events.c
@@ -2430,3 +2430,173 @@ int parse_synth_opt(char *synth)

return ret;
}
+
+static bool read_schedstat_cpu_v15(struct io *io,
+ struct perf_record_schedstat_cpu *cs)
+{
+ char ch;
+
+ if (io__get_char(io) != 'p' || io__get_char(io) != 'u')
+ return false;
+
+ if (io__get_dec(io, (__u64 *) &cs->cpu) != ' ')
+ return false;
+
+#define CPU_FIELD(type, name) \
+ do { \
+ ch = io__get_dec(io, (__u64 *) &cs->v15.name); \
+ if (ch != ' ' && ch != '\n') \
+ return false; \
+ } while (0);
+
+#include <perf/schedstat-cpu-v15.h>
+#undef CPU_FIELD
+
+ return true;
+}
+
+static bool read_schedstat_domain_v15(struct io *io,
+ struct perf_record_schedstat_domain *ds)
+{
+ char ch;
+
+ if (io__get_char(io) != 'o' || io__get_char(io) != 'm' || io__get_char(io) != 'a' ||
+ io__get_char(io) != 'i' || io__get_char(io) != 'n')
+ return false;
+
+ if (io__get_dec(io, (__u64 *) &ds->domain) != ' ')
+ return false;
+
+ while (io__get_char(io) != ' ');
+
+#define DOMAIN_FIELD(type, name) \
+ do { \
+ ch = io__get_dec(io, (__u64 *) &ds->v15.name); \
+ if (ch != ' ' && ch != '\n') \
+ return false; \
+ } while (0);
+
+#include <perf/schedstat-domain-v15.h>
+#undef DOMAIN_FIELD
+
+ return true;
+}
+
+static union perf_event *__synthesize_schedstat_cpu_v15(struct io *io, __u16 *cpu,
+ __u64 timestamp,
+ struct machine *machine)
+{
+ union perf_event *event;
+ size_t size;
+
+ size = sizeof(struct perf_record_schedstat_cpu);
+ event = zalloc(size + machine->id_hdr_size);
+
+ if (!event)
+ return NULL;
+
+ event->schedstat_cpu.header.type = PERF_RECORD_SCHEDSTAT_CPU;
+ event->schedstat_cpu.version = 15;
+ event->schedstat_cpu.timestamp = timestamp;
+ event->schedstat_cpu.header.size = size + machine->id_hdr_size;
+
+ if (read_schedstat_cpu_v15(io, &event->schedstat_cpu)) {
+ *cpu = event->schedstat_cpu.cpu;
+ } else {
+ free(event);
+ return NULL;
+ }
+
+ return event;
+}
+
+static union perf_event *__synthesize_schedstat_domain_v15(struct io *io, __u16 cpu,
+ __u64 timestamp,
+ struct machine *machine)
+{
+ union perf_event *event;
+ size_t size;
+
+ size = sizeof(struct perf_record_schedstat_domain);
+ event = zalloc(size + machine->id_hdr_size);
+
+ if (!event)
+ return NULL;
+
+ event->schedstat_domain.header.type = PERF_RECORD_SCHEDSTAT_DOMAIN;
+ event->schedstat_domain.version = 15;
+ event->schedstat_domain.timestamp = timestamp;
+ event->schedstat_domain.header.size = size + machine->id_hdr_size;
+
+ if (read_schedstat_domain_v15(io, &event->schedstat_domain)) {
+ event->schedstat_domain.cpu = cpu;
+ } else {
+ free(event);
+ return NULL;
+ }
+
+ return event;
+}
+
+int perf_event__synthesize_schedstat(struct perf_tool *tool,
+ struct perf_session *session __maybe_unused,
+ struct machine *machine,
+ perf_event__handler_t process)
+{
+ union perf_event *event = NULL;
+ size_t line_len = 0;
+ char *line = NULL;
+ char bf[BUFSIZ];
+ __u64 timestamp;
+ __u16 cpu = -1;
+ struct io io;
+ int ret = -1;
+ char ch;
+
+ io.fd = open("/proc/schedstat", O_RDONLY, 0);
+ if (io.fd < 0) {
+ pr_debug("Failed to open /proc/schedstat\n");
+ return -1;
+ }
+ io__init(&io, io.fd, bf, sizeof(bf));
+
+ if (io__getline(&io, &line, &line_len) < 0 || !line_len)
+ goto out;
+ if (strcmp(line, "version 15\n")) {
+ pr_debug("Unsupported /proc/schedstat version\n");
+ goto out_free_line;
+ }
+
+ if (io__getline(&io, &line, &line_len) < 0 || !line_len)
+ goto out_free_ev;
+ timestamp = atol(line + 10);
+
+ ch = io__get_char(&io);
+
+ while (!io.eof) {
+ if (ch == 'c') {
+ event = __synthesize_schedstat_cpu_v15(&io, &cpu, timestamp,
+ machine);
+ } else if (ch == 'd') {
+ event = __synthesize_schedstat_domain_v15(&io, cpu, timestamp,
+ machine);
+ }
+ if (!event)
+ goto out_free_line;
+
+ if (process(tool, event, NULL, NULL) < 0)
+ goto out_free_ev;
+
+ ch = io__get_char(&io);
+ }
+
+ ret = 0;
+
+out_free_ev:
+ free(event);
+out_free_line:
+ free(line);
+out:
+ close(io.fd);
+ return ret;
+}
diff --git a/tools/perf/util/synthetic-events.h b/tools/perf/util/synthetic-events.h
index 53737d1619a4..31498b8426fc 100644
--- a/tools/perf/util/synthetic-events.h
+++ b/tools/perf/util/synthetic-events.h
@@ -122,4 +122,8 @@ int perf_event__synthesize_for_pipe(struct perf_tool *tool,
struct perf_data *data,
perf_event__handler_t process);

+int perf_event__synthesize_schedstat(struct perf_tool *tool,
+ struct perf_session *session,
+ struct machine *machine,
+ perf_event__handler_t process);
#endif // __PERF_SYNTHETIC_EVENTS_H
diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
index c957fb849ac6..fe763db8a7ad 100644
--- a/tools/perf/util/tool.h
+++ b/tools/perf/util/tool.h
@@ -77,7 +77,9 @@ struct perf_tool {
stat,
stat_round,
feature,
- finished_init;
+ finished_init,
+ schedstat_cpu,
+ schedstat_domain;
event_op4 compressed;
event_op3 auxtrace;
bool ordered_events;
--
2.44.0


2024-05-08 06:08:04

by Ravi Bangoria

[permalink] [raw]
Subject: [RFC 4/4] perf sched schedstat: Add support for report subcommand

From: Swapnil Sapkal <[email protected]>

`perf sched schedstat record` captures two sets of samples. For workload
profile, first set right before workload starts and second set after
workload finishes. For the systemwide profile, first set at the beginning
of profile and second set on receiving SIGINT signal.

Add `perf sched schedstat report` subcommand that will read both the set
of samples, get the diff and render a final report. Final report prints
scheduler stat at cpu granularity as well as sched domain granularity.

Usage example:

# perf sched schedstat record
# perf sched schedstat report

Co-developed-by: Ravi Bangoria <[email protected]>
Signed-off-by: Swapnil Sapkal <[email protected]>
Signed-off-by: Ravi Bangoria <[email protected]>
---
tools/lib/perf/include/perf/event.h | 4 +-
.../lib/perf/include/perf/schedstat-cpu-v15.h | 29 +-
.../perf/include/perf/schedstat-domain-v15.h | 153 ++++++++---
tools/perf/builtin-sched.c | 254 +++++++++++++++++-
tools/perf/util/event.c | 4 +-
tools/perf/util/synthetic-events.c | 4 +-
6 files changed, 395 insertions(+), 53 deletions(-)

diff --git a/tools/lib/perf/include/perf/event.h b/tools/lib/perf/include/perf/event.h
index 835bb3e2dcbf..53a58d4ebd17 100644
--- a/tools/lib/perf/include/perf/event.h
+++ b/tools/lib/perf/include/perf/event.h
@@ -452,7 +452,7 @@ struct perf_record_compressed {
};

struct perf_record_schedstat_cpu_v15 {
-#define CPU_FIELD(type, name) type name;
+#define CPU_FIELD(type, name, desc, format, is_pct, pct_of) type name;
#include "schedstat-cpu-v15.h"
#undef CPU_FIELD
};
@@ -468,7 +468,7 @@ struct perf_record_schedstat_cpu {
};

struct perf_record_schedstat_domain_v15 {
-#define DOMAIN_FIELD(type, name) type name;
+#define DOMAIN_FIELD(type, name, desc, format, is_jiffies) type name;
#include "schedstat-domain-v15.h"
#undef DOMAIN_FIELD
};
diff --git a/tools/lib/perf/include/perf/schedstat-cpu-v15.h b/tools/lib/perf/include/perf/schedstat-cpu-v15.h
index 8dca84b11902..96a9aba3d2cf 100644
--- a/tools/lib/perf/include/perf/schedstat-cpu-v15.h
+++ b/tools/lib/perf/include/perf/schedstat-cpu-v15.h
@@ -1,13 +1,22 @@
/* SPDX-License-Identifier: GPL-2.0 */

#ifdef CPU_FIELD
-CPU_FIELD(__u32, yld_count)
-CPU_FIELD(__u32, array_exp)
-CPU_FIELD(__u32, sched_count)
-CPU_FIELD(__u32, sched_goidle)
-CPU_FIELD(__u32, ttwu_count)
-CPU_FIELD(__u32, ttwu_local)
-CPU_FIELD(__u64, rq_cpu_time)
-CPU_FIELD(__u64, run_delay)
-CPU_FIELD(__u64, pcount)
-#endif
+CPU_FIELD(__u32, yld_count, "sched_yield() count",
+ "%11u", false, yld_count)
+CPU_FIELD(__u32, array_exp, "Legacy counter can be ignored",
+ "%11u", false, array_exp)
+CPU_FIELD(__u32, sched_count, "schedule() called",
+ "%11u", false, sched_count)
+CPU_FIELD(__u32, sched_goidle, "schedule() left the processor idle",
+ "%11u", true, sched_count)
+CPU_FIELD(__u32, ttwu_count, "try_to_wake_up() was called",
+ "%11u", false, ttwu_count)
+CPU_FIELD(__u32, ttwu_local, "try_to_wake_up() was called to wake up the local cpu",
+ "%11u", true, ttwu_count)
+CPU_FIELD(__u64, rq_cpu_time, "total runtime by tasks on this processor (in jiffies)",
+ "%11llu", false, rq_cpu_time)
+CPU_FIELD(__u64, run_delay, "total waittime by tasks on this processor (in jiffies)",
+ "%11llu", true, rq_cpu_time)
+CPU_FIELD(__u64, pcount, "total timeslices run on this cpu",
+ "%11llu", false, pcount)
+#endif /* CPU_FIELD */
diff --git a/tools/lib/perf/include/perf/schedstat-domain-v15.h b/tools/lib/perf/include/perf/schedstat-domain-v15.h
index 181b1c10395c..5d8d65f79674 100644
--- a/tools/lib/perf/include/perf/schedstat-domain-v15.h
+++ b/tools/lib/perf/include/perf/schedstat-domain-v15.h
@@ -1,40 +1,121 @@
/* SPDX-License-Identifier: GPL-2.0 */

#ifdef DOMAIN_FIELD
-DOMAIN_FIELD(__u32, idle_lb_count)
-DOMAIN_FIELD(__u32, idle_lb_balanced)
-DOMAIN_FIELD(__u32, idle_lb_failed)
-DOMAIN_FIELD(__u32, idle_lb_imbalance)
-DOMAIN_FIELD(__u32, idle_lb_gained)
-DOMAIN_FIELD(__u32, idle_lb_hot_gained)
-DOMAIN_FIELD(__u32, idle_lb_nobusyq)
-DOMAIN_FIELD(__u32, idle_lb_nobusyg)
-DOMAIN_FIELD(__u32, busy_lb_count)
-DOMAIN_FIELD(__u32, busy_lb_balanced)
-DOMAIN_FIELD(__u32, busy_lb_failed)
-DOMAIN_FIELD(__u32, busy_lb_imbalance)
-DOMAIN_FIELD(__u32, busy_lb_gained)
-DOMAIN_FIELD(__u32, busy_lb_hot_gained)
-DOMAIN_FIELD(__u32, busy_lb_nobusyq)
-DOMAIN_FIELD(__u32, busy_lb_nobusyg)
-DOMAIN_FIELD(__u32, newidle_lb_count)
-DOMAIN_FIELD(__u32, newidle_lb_balanced)
-DOMAIN_FIELD(__u32, newidle_lb_failed)
-DOMAIN_FIELD(__u32, newidle_lb_imbalance)
-DOMAIN_FIELD(__u32, newidle_lb_gained)
-DOMAIN_FIELD(__u32, newidle_lb_hot_gained)
-DOMAIN_FIELD(__u32, newidle_lb_nobusyq)
-DOMAIN_FIELD(__u32, newidle_lb_nobusyg)
-DOMAIN_FIELD(__u32, alb_count)
-DOMAIN_FIELD(__u32, alb_failed)
-DOMAIN_FIELD(__u32, alb_pushed)
-DOMAIN_FIELD(__u32, sbe_count)
-DOMAIN_FIELD(__u32, sbe_balanced)
-DOMAIN_FIELD(__u32, sbe_pushed)
-DOMAIN_FIELD(__u32, sbf_count)
-DOMAIN_FIELD(__u32, sbf_balanced)
-DOMAIN_FIELD(__u32, sbf_pushed)
-DOMAIN_FIELD(__u32, ttwu_wake_remote)
-DOMAIN_FIELD(__u32, ttwu_move_affine)
-DOMAIN_FIELD(__u32, ttwu_move_balance)
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY("<Category idle>")
#endif
+DOMAIN_FIELD(__u32, idle_lb_count,
+ "load_balance() count on cpu idle", "%11u", true)
+DOMAIN_FIELD(__u32, idle_lb_balanced,
+ "load_balance() found balanced on cpu idle", "%11u", true)
+DOMAIN_FIELD(__u32, idle_lb_failed,
+ "load_balance() move task failed on cpu idle", "%11u", true)
+DOMAIN_FIELD(__u32, idle_lb_imbalance,
+ "imbalance sum on cpu idle", "%11u", false)
+DOMAIN_FIELD(__u32, idle_lb_gained,
+ "pull_task() count on cpu idle", "%11u", false)
+DOMAIN_FIELD(__u32, idle_lb_hot_gained,
+ "pull_task() when target task was cache-hot on cpu idle", "%11u", false)
+DOMAIN_FIELD(__u32, idle_lb_nobusyq,
+ "load_balance() failed to find busier queue on cpu idle", "%11u", true)
+DOMAIN_FIELD(__u32, idle_lb_nobusyg,
+ "load_balance() failed to find busier group on cpu idle", "%11u", true)
+#ifdef DERIVED_CNT_FIELD
+DERIVED_CNT_FIELD("load_balance() success count on cpu idle", "%11u",
+ idle_lb_count, idle_lb_balanced, idle_lb_failed)
+#endif
+#ifdef DERIVED_AVG_FIELD
+DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu idle)", "%11.2Lf",
+ idle_lb_count, idle_lb_balanced, idle_lb_failed, idle_lb_gained)
+#endif
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY("<Category busy>")
+#endif
+DOMAIN_FIELD(__u32, busy_lb_count,
+ "load_balance() count on cpu busy", "%11u", true)
+DOMAIN_FIELD(__u32, busy_lb_balanced,
+ "load_balance() found balanced on cpu busy", "%11u", true)
+DOMAIN_FIELD(__u32, busy_lb_failed,
+ "load_balance() move task failed on cpu busy", "%11u", true)
+DOMAIN_FIELD(__u32, busy_lb_imbalance,
+ "imbalance sum on cpu busy", "%11u", false)
+DOMAIN_FIELD(__u32, busy_lb_gained,
+ "pull_task() count on cpu busy", "%11u", false)
+DOMAIN_FIELD(__u32, busy_lb_hot_gained,
+ "pull_task() when target task was cache-hot on cpu busy", "%11u", false)
+DOMAIN_FIELD(__u32, busy_lb_nobusyq,
+ "load_balance() failed to find busier queue on cpu busy", "%11u", true)
+DOMAIN_FIELD(__u32, busy_lb_nobusyg,
+ "load_balance() failed to find busier group on cpu busy", "%11u", true)
+#ifdef DERIVED_CNT_FIELD
+DERIVED_CNT_FIELD("load_balance() success count on cpu busy", "%11u",
+ busy_lb_count, busy_lb_balanced, busy_lb_failed)
+#endif
+#ifdef DERIVED_AVG_FIELD
+DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu busy)", "%11.2Lf",
+ busy_lb_count, busy_lb_balanced, busy_lb_failed, busy_lb_gained)
+#endif
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY("<Category newidle>")
+#endif
+DOMAIN_FIELD(__u32, newidle_lb_count,
+ "load_balance() count on cpu newly idle", "%11u", true)
+DOMAIN_FIELD(__u32, newidle_lb_balanced,
+ "load_balance() found balanced on cpu newly idle", "%11u", true)
+DOMAIN_FIELD(__u32, newidle_lb_failed,
+ "load_balance() move task failed on cpu newly idle", "%11u", true)
+DOMAIN_FIELD(__u32, newidle_lb_imbalance,
+ "imbalance sum on cpu newly idle", "%11u", false)
+DOMAIN_FIELD(__u32, newidle_lb_gained,
+ "pull_task() count on cpu newly idle", "%11u", false)
+DOMAIN_FIELD(__u32, newidle_lb_hot_gained,
+ "pull_task() when target task was cache-hot on cpu newly idle", "%11u", false)
+DOMAIN_FIELD(__u32, newidle_lb_nobusyq,
+ "load_balance() failed to find busier queue on cpu newly idle", "%11u", true)
+DOMAIN_FIELD(__u32, newidle_lb_nobusyg,
+ "load_balance() failed to find busier group on cpu newly idle", "%11u", true)
+#ifdef DERIVED_CNT_FIELD
+DERIVED_CNT_FIELD("load_balance() success count on cpu newly idle", "%11u",
+ newidle_lb_count, newidle_lb_balanced, newidle_lb_failed)
+#endif
+#ifdef DERIVED_AVG_FIELD
+DERIVED_AVG_FIELD("avg task pulled per successful lb attempt (cpu newly idle)", "%11.2Lf",
+ newidle_lb_count, newidle_lb_balanced, newidle_lb_failed, newidle_lb_gained)
+#endif
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY("<Category active_load_balance()>")
+#endif
+DOMAIN_FIELD(__u32, alb_count,
+ "active_load_balance() count", "%11u", false)
+DOMAIN_FIELD(__u32, alb_failed,
+ "active_load_balance() move task failed", "%11u", false)
+DOMAIN_FIELD(__u32, alb_pushed,
+ "active_load_balance() successfully moved a task", "%11u", false)
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY("<Category sched_balance_exec()>")
+#endif
+DOMAIN_FIELD(__u32, sbe_count,
+ "sbe_count is not used", "%11u", false)
+DOMAIN_FIELD(__u32, sbe_balanced,
+ "sbe_balanced is not used", "%11u", false)
+DOMAIN_FIELD(__u32, sbe_pushed,
+ "sbe_pushed is not used", "%11u", false)
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY("<Category sched_balance_fork()>")
+#endif
+DOMAIN_FIELD(__u32, sbf_count,
+ "sbf_count is not used", "%11u", false)
+DOMAIN_FIELD(__u32, sbf_balanced,
+ "sbf_balanced is not used", "%11u", false)
+DOMAIN_FIELD(__u32, sbf_pushed,
+ "sbf_pushed is not used", "%11u", false)
+#ifdef DOMAIN_CATEGORY
+DOMAIN_CATEGORY("<Wakeup Info>")
+#endif
+DOMAIN_FIELD(__u32, ttwu_wake_remote,
+ "try_to_wake_up() awoke a task that last ran on a diff cpu", "%11u", false)
+DOMAIN_FIELD(__u32, ttwu_move_affine,
+ "try_to_wake_up() moved task because cache-cold on own cpu", "%11u", false)
+DOMAIN_FIELD(__u32, ttwu_move_balance,
+ "try_to_wake_up() started passive balancing", "%11u", false)
+#endif /* DOMAIN_FIELD */
diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 70bcd36fe1d3..26f7b2ee12e1 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -3663,11 +3663,263 @@ static int perf_sched__schedstat_record(struct perf_sched *sched,
return err;
}

-static int perf_sched__schedstat_report(struct perf_sched *sched __maybe_unused)
+struct schedstat_domain {
+ struct perf_record_schedstat_domain domain_data;
+ struct schedstat_domain *next;
+};
+
+struct schedstat_cpu {
+ struct perf_record_schedstat_cpu cpu_data;
+ struct schedstat_domain *domain_head;
+ struct schedstat_cpu *next;
+};
+
+struct schedstat_cpu *cpu_head = NULL, *cpu_tail = NULL, *cpu_second_pass = NULL;
+struct schedstat_domain *domain_tail = NULL, *domain_second_pass = NULL;
+
+static void store_schedtstat_cpu_diff(struct schedstat_cpu *after_workload)
+{
+ struct perf_record_schedstat_cpu *before = &cpu_second_pass->cpu_data;
+ struct perf_record_schedstat_cpu *after = &after_workload->cpu_data;
+
+#define CPU_FIELD(type, name, desc, format, is_pct, pct_of) \
+ (before->v15.name = after->v15.name - before->v15.name);
+
+#include <perf/schedstat-cpu-v15.h>
+#undef CPU_FIELD
+}
+
+static void store_schedstat_domain_diff(struct schedstat_domain *after_workload)
+{
+ struct perf_record_schedstat_domain *before = &domain_second_pass->domain_data;
+ struct perf_record_schedstat_domain *after = &after_workload->domain_data;
+
+#define DOMAIN_FIELD(type, name, desc, format, is_jiffies) \
+ (before->v15.name = after->v15.name - before->v15.name);
+
+#include <perf/schedstat-domain-v15.h>
+#undef DOMAIN_FIELD
+}
+
+static void print_separator(int pre_dash_cnt, const char *s, int post_dash_cnt)
{
+ int i;
+
+ for (i = 0; i < pre_dash_cnt; ++i)
+ printf("-");
+
+ printf("%s", s);
+
+ for (i = 0; i < post_dash_cnt; ++i)
+ printf("-");
+
+ printf("\n");
+}
+
+static inline void print_cpu_stats(struct perf_record_schedstat_cpu *cs)
+{
+#define CALC_PCT(x, y) ((y) ? ((double)(x) / (y)) * 100 : 0.0)
+
+#define CPU_FIELD(type, name, desc, format, is_pct, pct_of) \
+ do { \
+ if (is_pct) { \
+ printf("%-60s: " format " ( %3.2lf%% )\n", desc, \
+ cs->v15.name, \
+ CALC_PCT(cs->v15.name, cs->v15.pct_of)); \
+ } else { \
+ printf("%-60s: " format "\n", desc, cs->v15.name); \
+ } \
+ } while (0);
+
+#include <perf/schedstat-cpu-v15.h>
+
+#undef CPU_FIELD
+#undef CALC_PCT
+}
+
+static inline void print_domain_stats(struct perf_record_schedstat_domain *ds,
+ __u64 jiffies)
+{
+#define DOMAIN_CATEGORY(desc) \
+ do { \
+ int len = strlen(desc); \
+ int pre_dash_cnt = (100 - len) / 2; \
+ int post_dash_cnt = 100 - len - pre_dash_cnt; \
+ print_separator(pre_dash_cnt, desc, post_dash_cnt); \
+ } while (0);
+
+#define CALC_AVG(x, y) ((y) ? (long double)(x) / (y) : 0.0)
+
+#define DOMAIN_FIELD(type, name, desc, format, is_jiffies) \
+ do { \
+ if (is_jiffies) { \
+ printf("%-65s: " format " $ %11.2Lf $\n", desc, \
+ ds->v15.name, \
+ CALC_AVG(jiffies, ds->v15.name)); \
+ } else { \
+ printf("%-65s: " format "\n", desc, ds->v15.name); \
+ } \
+ } while (0);
+
+#define DERIVED_CNT_FIELD(desc, format, x, y, z) \
+ do { \
+ printf("*%-64s: " format "\n", desc, \
+ (ds->v15.x) - (ds->v15.y) - (ds->v15.z)); \
+ } while (0);
+
+#define DERIVED_AVG_FIELD(desc, format, x, y, z, w) \
+ do { \
+ printf("*%-64s: " format "\n", desc, CALC_AVG(ds->v15.w, \
+ ((ds->v15.x) - (ds->v15.y) - (ds->v15.z)))); \
+ } while (0);
+
+#include <perf/schedstat-domain-v15.h>
+
+#undef DERIVED_AVG_FIELD
+#undef DERIVED_CNT_FIELD
+#undef DOMAIN_FIELD
+#undef CALC_AVG
+#undef DOMAIN_CATEGORY
+}
+
+static void show_schedstat_data(void)
+{
+ struct perf_record_schedstat_domain *ds = NULL;
+ struct perf_record_schedstat_cpu *cs = NULL;
+ struct schedstat_cpu *cptr = cpu_head;
+ struct schedstat_domain *dptr = NULL;
+ __u64 jiffies = cptr->cpu_data.timestamp;
+
+ print_separator(100, "", 0);
+ printf("Time elapsed (in jiffies) : %11llu\n", jiffies);
+
+ while (cptr) {
+ cs = &cptr->cpu_data;
+
+ print_separator(100, "", 0);
+ printf("CPU %u\n", cs->cpu);
+ print_separator(100, "", 0);
+
+ print_cpu_stats(cs);
+ print_separator(100, "", 0);
+
+ dptr = cptr->domain_head;
+ ds = &dptr->domain_data;
+
+ while (dptr) {
+ printf("CPU %u DOMAIN %u\n", cs->cpu, ds->domain);
+ print_separator(100, "", 0);
+ print_domain_stats(ds, jiffies);
+
+ dptr = dptr->next;
+ ds = &dptr->domain_data;
+ print_separator(100, "", 0);
+ }
+ cptr = cptr->next;
+ }
+}
+
+static int perf_sched__process_schedstat(struct perf_session *session __maybe_unused,
+ union perf_event *event)
+{
+ static bool after_workload_flag;
+
+ if (event->header.type == PERF_RECORD_SCHEDSTAT_CPU) {
+ struct schedstat_cpu *temp =
+ (struct schedstat_cpu *)malloc(sizeof(struct schedstat_cpu));
+ temp->cpu_data = event->schedstat_cpu;
+ temp->next = NULL;
+ temp->domain_head = NULL;
+
+ if (cpu_head && temp->cpu_data.cpu == 0)
+ after_workload_flag = true;
+
+ if (!after_workload_flag) {
+ if (temp->cpu_data.cpu == 0)
+ cpu_head = temp;
+ else
+ cpu_tail->next = temp;
+
+ cpu_tail = temp;
+ } else {
+ if (temp->cpu_data.cpu == 0) {
+ cpu_second_pass = cpu_head;
+ cpu_head->cpu_data.timestamp =
+ temp->cpu_data.timestamp - cpu_second_pass->cpu_data.timestamp;
+ } else {
+ cpu_second_pass = cpu_second_pass->next;
+ }
+ domain_second_pass = cpu_second_pass->domain_head;
+ store_schedtstat_cpu_diff(temp);
+ }
+ } else if (event->header.type == PERF_RECORD_SCHEDSTAT_DOMAIN) {
+ struct schedstat_domain *temp =
+ (struct schedstat_domain *)malloc(sizeof(struct schedstat_domain));
+ temp->domain_data = event->schedstat_domain;
+ temp->next = NULL;
+
+ if (!after_workload_flag) {
+ if (cpu_tail->domain_head == NULL) {
+ cpu_tail->domain_head = temp;
+ domain_tail = temp;
+ } else {
+ domain_tail->next = temp;
+ domain_tail = temp;
+ }
+ } else {
+ store_schedstat_domain_diff(temp);
+ domain_second_pass = domain_second_pass->next;
+ }
+ }
+
return 0;
}

+static void free_schedstat(void)
+{
+ struct schedstat_cpu *cptr = cpu_head, *tmp_cptr;
+ struct schedstat_domain *dptr = NULL, *tmp_dptr;
+
+ while (cptr) {
+ tmp_cptr = cptr;
+ dptr = cptr->domain_head;
+
+ while (dptr) {
+ tmp_dptr = dptr;
+ dptr = dptr->next;
+ free(tmp_dptr);
+ }
+ cptr = cptr->next;
+ free(tmp_cptr);
+ }
+}
+
+static int perf_sched__schedstat_report(struct perf_sched *sched)
+{
+ struct perf_session *session;
+ struct perf_data data = {
+ .path = input_name,
+ .mode = PERF_DATA_MODE_READ,
+ };
+ int err;
+
+ sched->tool.schedstat_cpu = perf_sched__process_schedstat;
+ sched->tool.schedstat_domain = perf_sched__process_schedstat;
+
+ session = perf_session__new(&data, &sched->tool);
+ if (IS_ERR(session)) {
+ pr_err("Perf session creation failed.\n");
+ return PTR_ERR(session);
+ }
+
+ err = perf_session__process_events(session);
+
+ perf_session__delete(session);
+ show_schedstat_data();
+ free_schedstat();
+ return err;
+}
+
static const char default_sort_order[] = "avg, max, switch, runtime";
static struct perf_sched perf_sched = {
.tool = {
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index f2b10bc44a9e..ad2f5392f145 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -597,7 +597,7 @@ static size_t __fprintf_schedstat_cpu_v15(union perf_event *event, FILE *fp)
size = fprintf(fp, "\ncpu%u ", event->schedstat_cpu.cpu);
csv15 = &event->schedstat_cpu.v15;

-#define CPU_FIELD(type, name) \
+#define CPU_FIELD(type, name, desc, format, is_pct, pct_of) \
size += fprintf(fp, "%" PRIu64 " ", (unsigned long)csv15->name);

#include <perf/schedstat-cpu-v15.h>
@@ -623,7 +623,7 @@ static size_t __fprintf_schedstat_domain_v15(union perf_event *event, FILE *fp)
size = fprintf(fp, "\ndomain%u ", event->schedstat_domain.domain);
dsv15 = &event->schedstat_domain.v15;

-#define DOMAIN_FIELD(type, name) \
+#define DOMAIN_FIELD(type, name, desc, format, is_jiffies) \
size += fprintf(fp, "%" PRIu64 " ", (unsigned long)dsv15->name);

#include <perf/schedstat-domain-v15.h>
diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
index b5cfd4797495..1da0cc02e801 100644
--- a/tools/perf/util/synthetic-events.c
+++ b/tools/perf/util/synthetic-events.c
@@ -2442,7 +2442,7 @@ static bool read_schedstat_cpu_v15(struct io *io,
if (io__get_dec(io, (__u64 *) &cs->cpu) != ' ')
return false;

-#define CPU_FIELD(type, name) \
+#define CPU_FIELD(type, name, desc, format, is_pct, pct_of) \
do { \
ch = io__get_dec(io, (__u64 *) &cs->v15.name); \
if (ch != ' ' && ch != '\n') \
@@ -2469,7 +2469,7 @@ static bool read_schedstat_domain_v15(struct io *io,

while (io__get_char(io) != ' ');

-#define DOMAIN_FIELD(type, name) \
+#define DOMAIN_FIELD(type, name, desc, format, is_jiffies) \
do { \
ch = io__get_dec(io, (__u64 *) &ds->v15.name); \
if (ch != ' ' && ch != '\n') \
--
2.44.0


2024-05-09 05:08:40

by Namhyung Kim

[permalink] [raw]
Subject: Re: [RFC 0/4] perf sched: Introduce schedstat tool

Hi Ravi,

On Tue, May 7, 2024 at 11:05 PM Ravi Bangoria <[email protected]> wrote:
>
> MOTIVATION
> ----------
>
> Existing `perf sched` is quite exhaustive and provides lot of insights
> into scheduler behavior but it quickly becomes impractical to use for
> long running or scheduler intensive workload. For ex, `perf sched record`
> has ~7.77% overhead on hackbench (with 25 groups each running 700K loops
> on a 2-socket 128 Cores 256 Threads 3rd Generation EPYC Server), and it
> generates huge 56G perf.data for which perf takes ~137 mins to prepare
> and write it to disk [1].

Right, this is painful.

>
> Unlike `perf sched record`, which hooks onto set of scheduler tracepoints
> and generates samples on a tracepoint hit, `perf sched schedstat record`
> takes snapshot of the /proc/schedstat file before and after the workload,
> i.e. there is zero interference on workload run. Also, it takes very
> minimal time to parse /proc/schedstat, convert it into perf samples and
> save those samples into perf.data file. Result perf.data file is much
> smaller. So, overall `perf sched schedstat record` is much more light-
> weight compare to `perf sched record`.

Nice work!

>
> We, internally at AMD, have been using this (a variant of this, known as
> "sched-scoreboard"[2]) and found it to be very useful to analyse impact
> of any scheduler code changes[3][4].
>
> Please note that, this is not a replacement of perf sched record/report.
> The intended users of the new tool are scheduler developers, not regular
> users.

Great, I think it's very useful.

>
> USAGE
> -----
>
> # perf sched schedstat record
> # perf sched schedstat report

Hmm. I think we can remove the duplication in 'sched'. :)
Given you are thinking of taskstat, how about making it
'cpustat' instead?

Also I think it'd be easier if you also provide 'live' mode so that
users can skip record + report steps and run the workload
directly like uftrace does. :)

Something like this

# perf sched cpustat myworkload
(result here ...)

Thanks,
Namhyung

>
> Note: Although perf schedstat tool supports workload profiling syntax
> (i.e. -- <workload> ), the recorded profile is still systemwide since
> the /proc/schedstat is a systemwide file.

2024-05-09 05:20:49

by Ravi Bangoria

[permalink] [raw]
Subject: Re: [RFC 0/4] perf sched: Introduce schedstat tool

>> TODO:
>> - This RFC series supports v15 layout of /proc/schedstat but v16 layout
>> is already being pushed upstream. We are planning to include v16 as
>> well in the next revision.
>> - Currently schedstat tool provides statistics of only one run but we
>> are planning to add `perf sched schedstat diff` which can compare
>> the data of two different runs (possibly good and bad) and highlight
>> where scheduler decisions are impacting workload performance.
>> - perf sched schedstat records /proc/schedstat which is a cpu and domain
>> level scheduler statistic. We are planning to add taskstat tool which
>> reads task stats from procfs and generate scheduler statistic report
>> at task granularity. this will probably a standalone tool, something
>> like `perf sched taskstat record/report`.
>> - /proc/schedstat shows cpumask in domain line to indicate a group of
>> cpus that belong to the domain. Since we are not using domain<->cpumask
>> data anywhere, we are not capturing it as part of perf sample. But we
>> are planning to include it in the next revision.
>> - We have tested the patch only on AMD machines, not on other platforms.
>
> This is great! Is it possible to add some basic shell script testing:
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/tests/shell?h=perf-tools-next
> for the sake of coverage, other platform testing, etc. ?

Sure, I will think about adding some simple tests.

Thanks,
Ravi

2024-05-09 06:03:20

by Ravi Bangoria

[permalink] [raw]
Subject: Re: [RFC 0/4] perf sched: Introduce schedstat tool

>> USAGE
>> -----
>>
>> # perf sched schedstat record
>> # perf sched schedstat report
>
> Hmm. I think we can remove the duplication in 'sched'. :)

You mean `perf sched stat record/report` ?

> Given you are thinking of taskstat, how about making it
> 'cpustat' instead?

Sure. How about:

# perf sched stat --cpu --task record
# perf sched stat report

> Also I think it'd be easier if you also provide 'live' mode so that
> users can skip record + report steps and run the workload
> directly like uftrace does. :)
>
> Something like this
>
> # perf sched cpustat myworkload
> (result here ...)

Sure.

Thanks for the feedback,
Ravi

2024-05-09 22:33:50

by Namhyung Kim

[permalink] [raw]
Subject: Re: [RFC 0/4] perf sched: Introduce schedstat tool

On Wed, May 8, 2024 at 11:02 PM Ravi Bangoria <[email protected]> wrote:
>
> >> USAGE
> >> -----
> >>
> >> # perf sched schedstat record
> >> # perf sched schedstat report
> >
> > Hmm. I think we can remove the duplication in 'sched'. :)
>
> You mean `perf sched stat record/report` ?
>
> > Given you are thinking of taskstat, how about making it
> > 'cpustat' instead?
>
> Sure. How about:
>
> # perf sched stat --cpu --task record

If you plan to support both cpu and task at the same time,
then I'm ok with this. But if they're mutually exclusive, then
probably you want to have them as sub-commands.

Thanks,
Namhyung


> # perf sched stat report
>
> > Also I think it'd be easier if you also provide 'live' mode so that
> > users can skip record + report steps and run the workload
> > directly like uftrace does. :)
> >
> > Something like this
> >
> > # perf sched cpustat myworkload
> > (result here ...)
>
> Sure.
>
> Thanks for the feedback,
> Ravi

2024-05-10 04:54:59

by Ravi Bangoria

[permalink] [raw]
Subject: Re: [RFC 0/4] perf sched: Introduce schedstat tool

>>>> USAGE
>>>> -----
>>>>
>>>> # perf sched schedstat record
>>>> # perf sched schedstat report
>>>
>>> Hmm. I think we can remove the duplication in 'sched'. :)
>>
>> You mean `perf sched stat record/report` ?
>>
>>> Given you are thinking of taskstat, how about making it
>>> 'cpustat' instead?
>>
>> Sure. How about:
>>
>> # perf sched stat --cpu --task record
>
> If you plan to support both cpu and task at the same time,
> then I'm ok with this. But if they're mutually exclusive, then
> probably you want to have them as sub-commands.

Sure, will think about it while preparing next version.

Thanks,
Ravi

2024-05-11 07:46:26

by Chen Yu

[permalink] [raw]
Subject: Re: [RFC 4/4] perf sched schedstat: Add support for report subcommand

On 2024-05-08 at 11:34:27 +0530, Ravi Bangoria wrote:
> From: Swapnil Sapkal <[email protected]>
>
> `perf sched schedstat record` captures two sets of samples. For workload
> profile, first set right before workload starts and second set after
> workload finishes. For the systemwide profile, first set at the beginning
> of profile and second set on receiving SIGINT signal.
>
> Add `perf sched schedstat report` subcommand that will read both the set
> of samples, get the diff and render a final report. Final report prints
> scheduler stat at cpu granularity as well as sched domain granularity.
>
> Usage example:
>
> # perf sched schedstat record
> # perf sched schedstat report
>
> Co-developed-by: Ravi Bangoria <[email protected]>
> Signed-off-by: Swapnil Sapkal <[email protected]>
> Signed-off-by: Ravi Bangoria <[email protected]>
>

I've tested it on a 240 CPUs Xeon system and it looks very useful. Thanks!

1. Just to confirm, if we want to add new fields for debugging purpose,
schedstat-domain-v1x.h and schedstat-cpu-v1x.h are the only files to
be touched, right?
2. Although we can filter the output, is it applicable to only track some
CPUs? Like perf sched schedstat -C 4 record

thanks,
Chenyu


2024-05-13 03:14:04

by Ravi Bangoria

[permalink] [raw]
Subject: Re: [RFC 4/4] perf sched schedstat: Add support for report subcommand

On 11-May-24 1:15 PM, Chen Yu wrote:
> On 2024-05-08 at 11:34:27 +0530, Ravi Bangoria wrote:
>> From: Swapnil Sapkal <[email protected]>
>>
>> `perf sched schedstat record` captures two sets of samples. For workload
>> profile, first set right before workload starts and second set after
>> workload finishes. For the systemwide profile, first set at the beginning
>> of profile and second set on receiving SIGINT signal.
>>
>> Add `perf sched schedstat report` subcommand that will read both the set
>> of samples, get the diff and render a final report. Final report prints
>> scheduler stat at cpu granularity as well as sched domain granularity.
>>
>> Usage example:
>>
>> # perf sched schedstat record
>> # perf sched schedstat report
>>
>> Co-developed-by: Ravi Bangoria <[email protected]>
>> Signed-off-by: Swapnil Sapkal <[email protected]>
>> Signed-off-by: Ravi Bangoria <[email protected]>
>>
>
> I've tested it on a 240 CPUs Xeon system and it looks very useful. Thanks!

Glad you found it useful!

> 1. Just to confirm, if we want to add new fields for debugging purpose,
> schedstat-domain-v1x.h and schedstat-cpu-v1x.h are the only files to
> be touched, right?

Correct.

> 2. Although we can filter the output, is it applicable to only track some
> CPUs? Like perf sched schedstat -C 4 record

Yes, adding filtering capabilities should be possible at both record and report
time.

Thanks,
Ravi