2023-03-22 15:39:28

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH 0/3] Add basic tracing for uclamp and schedutil

Hi all,

The task scheduler feature: Uclamp, begins to take off. To better understand
the dynamics in the task scheduler and CPU frequency requests we need some
better tracing.
In schedutil (cpufreq governor) we allow to enter the scheduler
and make the frequency change. Although, there is some limit in regards to how
often this can happen. That min period is provided by the cpufreq driver.
Thus, some of the cpufreq requests might be filter out and the frequency won't
be changed (hopefuly will be set a bit later). We would like to know about
those situations, especially in context of the user-space hints made via
Uclamp for particular tasks.
This patch set aims to add base for our toolkits and post-processing trace
analyzes.

Regards,
Lukasz Luba

Lukasz Luba (3):
sched/tp: Add new tracepoint to track uclamp set from user-space
cpufreq: schedutil: Refactor sugov_update_shared() internals
schedutil: trace: Add tracing to capture filter out requests

include/trace/events/sched.h | 4 ++++
include/trace/events/schedutil.h | 17 +++++++++++++++++
kernel/sched/core.c | 5 +++++
kernel/sched/cpufreq_schedutil.c | 32 ++++++++++++++++++++++----------
4 files changed, 48 insertions(+), 10 deletions(-)
create mode 100644 include/trace/events/schedutil.h

--
2.17.1


2023-03-22 15:39:33

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH 1/3] sched/tp: Add new tracepoint to track uclamp set from user-space

The user-space can set uclamp value for a given task. It impacts task
placement decisions made by the scheduler. This is very useful information
and helps to understand the system behavior or track improvements in
middleware and applications which start using uclamp mechanisms and report
better performance in tests.

Signed-off-by: Lukasz Luba <[email protected]>
---
include/trace/events/sched.h | 4 ++++
kernel/sched/core.c | 5 +++++
2 files changed, 9 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index fbb99a61f714..dbfb30809f15 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -735,6 +735,10 @@ DECLARE_TRACE(sched_update_nr_running_tp,
TP_PROTO(struct rq *rq, int change),
TP_ARGS(rq, change));

+DECLARE_TRACE(uclamp_update_tsk_tp,
+ TP_PROTO(struct task_struct *tsk, int uclamp_id, unsigned int value),
+ TP_ARGS(tsk, uclamp_id, value));
+
#endif /* _TRACE_SCHED_H */

/* This part must be outside protection */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 488655f2319f..882c92e3ccf0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -110,6 +110,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_overutilized_tp);
EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_cfs_tp);
EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_se_tp);
EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
+EXPORT_TRACEPOINT_SYMBOL_GPL(uclamp_update_tsk_tp);

DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);

@@ -1936,12 +1937,16 @@ static void __setscheduler_uclamp(struct task_struct *p,
attr->sched_util_min != -1) {
uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
attr->sched_util_min, true);
+ trace_uclamp_update_tsk_tp(p, UCLAMP_MIN,
+ attr->sched_util_min);
}

if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX &&
attr->sched_util_max != -1) {
uclamp_se_set(&p->uclamp_req[UCLAMP_MAX],
attr->sched_util_max, true);
+ trace_uclamp_update_tsk_tp(p, UCLAMP_MAX,
+ attr->sched_util_max);
}
}

--
2.17.1

2023-03-22 15:39:37

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH 2/3] cpufreq: schedutil: Refactor sugov_update_shared() internals

Remove the if section block. Use the simple check to bail out
and jump to the unlock at the end. That makes the code more readable
and ready for some future tracing.

Signed-off-by: Lukasz Luba <[email protected]>
---
kernel/sched/cpufreq_schedutil.c | 20 +++++++++++---------
1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index e3211455b203..f462496e5c07 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -446,17 +446,19 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)

ignore_dl_rate_limit(sg_cpu);

- if (sugov_should_update_freq(sg_policy, time)) {
- next_f = sugov_next_freq_shared(sg_cpu, time);
+ if (!sugov_should_update_freq(sg_policy, time))
+ goto unlock;

- if (!sugov_update_next_freq(sg_policy, time, next_f))
- goto unlock;
+ next_f = sugov_next_freq_shared(sg_cpu, time);
+
+ if (!sugov_update_next_freq(sg_policy, time, next_f))
+ goto unlock;
+
+ if (sg_policy->policy->fast_switch_enabled)
+ cpufreq_driver_fast_switch(sg_policy->policy, next_f);
+ else
+ sugov_deferred_update(sg_policy);

- if (sg_policy->policy->fast_switch_enabled)
- cpufreq_driver_fast_switch(sg_policy->policy, next_f);
- else
- sugov_deferred_update(sg_policy);
- }
unlock:
raw_spin_unlock(&sg_policy->update_lock);
}
--
2.17.1

2023-03-22 15:40:21

by Lukasz Luba

[permalink] [raw]
Subject: [PATCH 3/3] schedutil: trace: Add tracing to capture filter out requests

Some of the frequency update requests coming form the task scheduler
might be filter out. It can happen when the previous request was served
not that long ago (in a period smaller than provided by the cpufreq driver
as minimum for frequency update). In such case, we want to know if some of
the frequency updates cannot make through.
Export the new tracepoint as well. That would allow to handle it by a
toolkit for trace analyzes.

Signed-off-by: Lukasz Luba <[email protected]>
---
include/trace/events/schedutil.h | 17 +++++++++++++++++
kernel/sched/cpufreq_schedutil.c | 14 ++++++++++++--
2 files changed, 29 insertions(+), 2 deletions(-)
create mode 100644 include/trace/events/schedutil.h

diff --git a/include/trace/events/schedutil.h b/include/trace/events/schedutil.h
new file mode 100644
index 000000000000..7f25122f7257
--- /dev/null
+++ b/include/trace/events/schedutil.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM schedutil
+
+#if !defined(_TRACE_SCHEDUTIL_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_SCHEDUTIL_H
+
+#include <linux/tracepoint.h>
+
+DECLARE_TRACE(schedutil_update_filtered_tp,
+ TP_PROTO(int cpu),
+ TP_ARGS(cpu));
+
+#endif /* _TRACE_SCHEDUTIL_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index f462496e5c07..45c18559f3a8 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -6,6 +6,12 @@
* Author: Rafael J. Wysocki <[email protected]>
*/

+#define CREATE_TRACE_POINTS
+#include <trace/events/schedutil.h>
+#undef CREATE_TRACE_POINTS
+
+EXPORT_TRACEPOINT_SYMBOL_GPL(schedutil_update_filtered_tp);
+
#define IOWAIT_BOOST_MIN (SCHED_CAPACITY_SCALE / 8)

struct sugov_tunables {
@@ -318,8 +324,10 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,

ignore_dl_rate_limit(sg_cpu);

- if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
+ if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) {
+ trace_schedutil_update_filtered_tp(sg_cpu->cpu);
return false;
+ }

sugov_get_util(sg_cpu);
sugov_iowait_apply(sg_cpu, time, max_cap);
@@ -446,8 +454,10 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)

ignore_dl_rate_limit(sg_cpu);

- if (!sugov_should_update_freq(sg_policy, time))
+ if (!sugov_should_update_freq(sg_policy, time)) {
+ trace_schedutil_update_filtered_tp(sg_cpu->cpu);
goto unlock;
+ }

next_f = sugov_next_freq_shared(sg_cpu, time);

--
2.17.1

2023-03-22 17:40:40

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 3/3] schedutil: trace: Add tracing to capture filter out requests

Hi Lukasz,

I love your patch! Perhaps something to improve:

[auto build test WARNING on rafael-pm/linux-next]
[also build test WARNING on tip/sched/core linus/master v6.3-rc3 next-20230322]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Lukasz-Luba/sched-tp-Add-new-tracepoint-to-track-uclamp-set-from-user-space/20230322-232206
base: https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next
patch link: https://lore.kernel.org/r/20230322151843.14390-4-lukasz.luba%40arm.com
patch subject: [PATCH 3/3] schedutil: trace: Add tracing to capture filter out requests
config: x86_64-allyesconfig (https://download.01.org/0day-ci/archive/20230323/[email protected]/config)
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
# https://github.com/intel-lab-lkp/linux/commit/9eba04b3431f67b8e0c18ff57ea2976111673bea
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Lukasz-Luba/sched-tp-Add-new-tracepoint-to-track-uclamp-set-from-user-space/20230322-232206
git checkout 9eba04b3431f67b8e0c18ff57ea2976111673bea
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 O=build_dir ARCH=x86_64 olddefconfig
make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash kernel/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All warnings (new ones prefixed by >>):

In file included from include/trace/trace_events.h:27,
from include/trace/define_trace.h:102,
from include/trace/events/schedutil.h:17,
from kernel/sched/cpufreq_schedutil.c:10,
from kernel/sched/build_utility.c:69:
>> include/trace/stages/init.h:2:23: warning: 'str__schedutil__trace_system_name' defined but not used [-Wunused-const-variable=]
2 | #define __app__(x, y) str__##x##y
| ^~~~~
include/trace/stages/init.h:3:21: note: in expansion of macro '__app__'
3 | #define __app(x, y) __app__(x, y)
| ^~~~~~~
include/trace/stages/init.h:5:29: note: in expansion of macro '__app'
5 | #define TRACE_SYSTEM_STRING __app(TRACE_SYSTEM_VAR,__trace_system_name)
| ^~~~~
include/trace/stages/init.h:8:27: note: in expansion of macro 'TRACE_SYSTEM_STRING'
8 | static const char TRACE_SYSTEM_STRING[] = \
| ^~~~~~~~~~~~~~~~~~~
include/trace/stages/init.h:11:1: note: in expansion of macro 'TRACE_MAKE_SYSTEM_STR'
11 | TRACE_MAKE_SYSTEM_STR();
| ^~~~~~~~~~~~~~~~~~~~~


vim +/str__schedutil__trace_system_name +2 include/trace/stages/init.h

af6b9668e85ffd Steven Rostedt (Google 2022-03-03 @2) #define __app__(x, y) str__##x##y
af6b9668e85ffd Steven Rostedt (Google 2022-03-03 3) #define __app(x, y) __app__(x, y)
af6b9668e85ffd Steven Rostedt (Google 2022-03-03 4)

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

2023-04-03 13:50:10

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched/tp: Add new tracepoint to track uclamp set from user-space

Hi Lukasz!

On 03/22/23 15:18, Lukasz Luba wrote:
> The user-space can set uclamp value for a given task. It impacts task
> placement decisions made by the scheduler. This is very useful information
> and helps to understand the system behavior or track improvements in
> middleware and applications which start using uclamp mechanisms and report
> better performance in tests.

We do have uclamp trace events in sched_tp, why are they not sufficient?

https://github.com/qais-yousef/sched_tp/blob/main/sched_events.h#L233

Do you really want to know the exact time the value has changed?

Would it make sense to introduce a generic sched_setscheduler tracepoint
instead? Although this might not be necessary as I think we can use
register_kprobe() to register a callback and create a new event without any
additional tracepoint. sched_setscheduler() is not inlined so should be easy to
hook into and create events, no?


Thanks

--
Qais Yousef

>
> Signed-off-by: Lukasz Luba <[email protected]>
> ---
> include/trace/events/sched.h | 4 ++++
> kernel/sched/core.c | 5 +++++
> 2 files changed, 9 insertions(+)
>
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index fbb99a61f714..dbfb30809f15 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -735,6 +735,10 @@ DECLARE_TRACE(sched_update_nr_running_tp,
> TP_PROTO(struct rq *rq, int change),
> TP_ARGS(rq, change));
>
> +DECLARE_TRACE(uclamp_update_tsk_tp,
> + TP_PROTO(struct task_struct *tsk, int uclamp_id, unsigned int value),
> + TP_ARGS(tsk, uclamp_id, value));
> +
> #endif /* _TRACE_SCHED_H */
>
> /* This part must be outside protection */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 488655f2319f..882c92e3ccf0 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -110,6 +110,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_overutilized_tp);
> EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_cfs_tp);
> EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_se_tp);
> EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
> +EXPORT_TRACEPOINT_SYMBOL_GPL(uclamp_update_tsk_tp);
>
> DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
>
> @@ -1936,12 +1937,16 @@ static void __setscheduler_uclamp(struct task_struct *p,
> attr->sched_util_min != -1) {
> uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
> attr->sched_util_min, true);
> + trace_uclamp_update_tsk_tp(p, UCLAMP_MIN,
> + attr->sched_util_min);
> }
>
> if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX &&
> attr->sched_util_max != -1) {
> uclamp_se_set(&p->uclamp_req[UCLAMP_MAX],
> attr->sched_util_max, true);
> + trace_uclamp_update_tsk_tp(p, UCLAMP_MAX,
> + attr->sched_util_max);
> }
> }
>
> --
> 2.17.1
>

2023-04-03 13:50:24

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH 3/3] schedutil: trace: Add tracing to capture filter out requests

On 03/22/23 15:18, Lukasz Luba wrote:
> Some of the frequency update requests coming form the task scheduler
> might be filter out. It can happen when the previous request was served
> not that long ago (in a period smaller than provided by the cpufreq driver
> as minimum for frequency update). In such case, we want to know if some of
> the frequency updates cannot make through.
> Export the new tracepoint as well. That would allow to handle it by a
> toolkit for trace analyzes.

I think you can use register_kretprobe() for this one. When functions are not
inlined it should be easy to hook into them and you can get the return value of
the function too.

Check the usage in lib/test_kprobes.c. Creating the event in sched_tp should be
the same way when registering for a tracepoint. They both are essentially the
same.

Patches to sched_tp module would be welcome ;-)


Cheers

--
Qais Yousef

>
> Signed-off-by: Lukasz Luba <[email protected]>
> ---
> include/trace/events/schedutil.h | 17 +++++++++++++++++
> kernel/sched/cpufreq_schedutil.c | 14 ++++++++++++--
> 2 files changed, 29 insertions(+), 2 deletions(-)
> create mode 100644 include/trace/events/schedutil.h
>
> diff --git a/include/trace/events/schedutil.h b/include/trace/events/schedutil.h
> new file mode 100644
> index 000000000000..7f25122f7257
> --- /dev/null
> +++ b/include/trace/events/schedutil.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM schedutil
> +
> +#if !defined(_TRACE_SCHEDUTIL_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_SCHEDUTIL_H
> +
> +#include <linux/tracepoint.h>
> +
> +DECLARE_TRACE(schedutil_update_filtered_tp,
> + TP_PROTO(int cpu),
> + TP_ARGS(cpu));
> +
> +#endif /* _TRACE_SCHEDUTIL_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index f462496e5c07..45c18559f3a8 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -6,6 +6,12 @@
> * Author: Rafael J. Wysocki <[email protected]>
> */
>
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/schedutil.h>
> +#undef CREATE_TRACE_POINTS
> +
> +EXPORT_TRACEPOINT_SYMBOL_GPL(schedutil_update_filtered_tp);
> +
> #define IOWAIT_BOOST_MIN (SCHED_CAPACITY_SCALE / 8)
>
> struct sugov_tunables {
> @@ -318,8 +324,10 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
>
> ignore_dl_rate_limit(sg_cpu);
>
> - if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
> + if (!sugov_should_update_freq(sg_cpu->sg_policy, time)) {
> + trace_schedutil_update_filtered_tp(sg_cpu->cpu);
> return false;
> + }
>
> sugov_get_util(sg_cpu);
> sugov_iowait_apply(sg_cpu, time, max_cap);
> @@ -446,8 +454,10 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)
>
> ignore_dl_rate_limit(sg_cpu);
>
> - if (!sugov_should_update_freq(sg_policy, time))
> + if (!sugov_should_update_freq(sg_policy, time)) {
> + trace_schedutil_update_filtered_tp(sg_cpu->cpu);
> goto unlock;
> + }
>
> next_f = sugov_next_freq_shared(sg_cpu, time);
>
> --
> 2.17.1
>

2023-04-03 17:18:59

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched/tp: Add new tracepoint to track uclamp set from user-space

Hi Qais,

On 4/3/23 14:46, Qais Yousef wrote:
> Hi Lukasz!
>
> On 03/22/23 15:18, Lukasz Luba wrote:
>> The user-space can set uclamp value for a given task. It impacts task
>> placement decisions made by the scheduler. This is very useful information
>> and helps to understand the system behavior or track improvements in
>> middleware and applications which start using uclamp mechanisms and report
>> better performance in tests.
>
> We do have uclamp trace events in sched_tp, why are they not sufficient?
>
> https://github.com/qais-yousef/sched_tp/blob/main/sched_events.h#L233
>
> Do you really want to know the exact time the value has changed?

Yes, that's why these new are triggered instantly after userspace wanted
to set the new uclamp values. We are going to have a few different
uclamp implementations: one in mainline and X in Android vendor kernels.
The goal is to have only one... We will have to experiment to find
the behavior of those algorithms and understand the differences. Since
uclamp is part of this 'control-chain' of CPU frequency and also
task placement - it would be really tricky to figure our everything.
The analysis on traces are crucial for this.

>
> Would it make sense to introduce a generic sched_setscheduler tracepoint
> instead? Although this might not be necessary as I think we can use
> register_kprobe() to register a callback and create a new event without any
> additional tracepoint. sched_setscheduler() is not inlined so should be easy to
> hook into and create events, no?

This looks very complex and we already have our LISA tool with the
module to change the tracepoints into trace events and build them.
I wanted to be aligned with that design, which might look a bit
old-fashion but is simple IMO.

The 'sched_setscheduler tracepoint' might be a too big for this
purpose.

Thanks for the comments :)
Lukasz

>
>
> Thanks
>
> --
> Qais Yousef
>
>>
>> Signed-off-by: Lukasz Luba <[email protected]>
>> ---
>> include/trace/events/sched.h | 4 ++++
>> kernel/sched/core.c | 5 +++++
>> 2 files changed, 9 insertions(+)
>>
>> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
>> index fbb99a61f714..dbfb30809f15 100644
>> --- a/include/trace/events/sched.h
>> +++ b/include/trace/events/sched.h
>> @@ -735,6 +735,10 @@ DECLARE_TRACE(sched_update_nr_running_tp,
>> TP_PROTO(struct rq *rq, int change),
>> TP_ARGS(rq, change));
>>
>> +DECLARE_TRACE(uclamp_update_tsk_tp,
>> + TP_PROTO(struct task_struct *tsk, int uclamp_id, unsigned int value),
>> + TP_ARGS(tsk, uclamp_id, value));
>> +
>> #endif /* _TRACE_SCHED_H */
>>
>> /* This part must be outside protection */
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 488655f2319f..882c92e3ccf0 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -110,6 +110,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_overutilized_tp);
>> EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_cfs_tp);
>> EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_se_tp);
>> EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
>> +EXPORT_TRACEPOINT_SYMBOL_GPL(uclamp_update_tsk_tp);
>>
>> DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
>>
>> @@ -1936,12 +1937,16 @@ static void __setscheduler_uclamp(struct task_struct *p,
>> attr->sched_util_min != -1) {
>> uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
>> attr->sched_util_min, true);
>> + trace_uclamp_update_tsk_tp(p, UCLAMP_MIN,
>> + attr->sched_util_min);
>> }
>>
>> if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX &&
>> attr->sched_util_max != -1) {
>> uclamp_se_set(&p->uclamp_req[UCLAMP_MAX],
>> attr->sched_util_max, true);
>> + trace_uclamp_update_tsk_tp(p, UCLAMP_MAX,
>> + attr->sched_util_max);
>> }
>> }
>>
>> --
>> 2.17.1
>>

2023-04-03 17:19:17

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH 3/3] schedutil: trace: Add tracing to capture filter out requests



On 4/3/23 14:46, Qais Yousef wrote:
> On 03/22/23 15:18, Lukasz Luba wrote:
>> Some of the frequency update requests coming form the task scheduler
>> might be filter out. It can happen when the previous request was served
>> not that long ago (in a period smaller than provided by the cpufreq driver
>> as minimum for frequency update). In such case, we want to know if some of
>> the frequency updates cannot make through.
>> Export the new tracepoint as well. That would allow to handle it by a
>> toolkit for trace analyzes.
>
> I think you can use register_kretprobe() for this one. When functions are not
> inlined it should be easy to hook into them and you can get the return value of
> the function too.

Could be possible, but that's a different pattern to what we have in
terms of LISa. It's also a bit complicated for downstream folks to
understand when I would ask about the trace with a particular name.
The integration would be a bit more complicated and our tooling and CI
might be impacted but this different design approach. Therefore,
I prefer to be aligned with the old-school.

>
> Check the usage in lib/test_kprobes.c. Creating the event in sched_tp should be
> the same way when registering for a tracepoint. They both are essentially the
> same.
>
> Patches to sched_tp module would be welcome ;-)

I cannot promise, but I'll try to find some cycles for it :)

Cheers,
Lukasz

>
>
> Cheers
>
> --
> Qais Yousef
>

2023-04-04 17:18:23

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched/tp: Add new tracepoint to track uclamp set from user-space

On 04/03/23 17:47, Lukasz Luba wrote:
> Hi Qais,
>
> On 4/3/23 14:46, Qais Yousef wrote:
> > Hi Lukasz!
> >
> > On 03/22/23 15:18, Lukasz Luba wrote:
> > > The user-space can set uclamp value for a given task. It impacts task
> > > placement decisions made by the scheduler. This is very useful information
> > > and helps to understand the system behavior or track improvements in
> > > middleware and applications which start using uclamp mechanisms and report
> > > better performance in tests.
> >
> > We do have uclamp trace events in sched_tp, why are they not sufficient?
> >
> > https://github.com/qais-yousef/sched_tp/blob/main/sched_events.h#L233
> >
> > Do you really want to know the exact time the value has changed?
>
> Yes, that's why these new are triggered instantly after userspace wanted
> to set the new uclamp values. We are going to have a few different
> uclamp implementations: one in mainline and X in Android vendor kernels.

This is not true. As you can see everyone tries to push fixes for issues they
find, but this not happening fast enough and they get forced to carry out of
tree stuff at the end. Out of tree stuff for the broken bits that is.

> The goal is to have only one... We will have to experiment to find

This statement gives a very strong message to everyone out here. And I'd like
to stress strongly that this is NOT true. Broken bits, yes. But essential bits
that works are used the same.

We can do a better job and fix things faster upstream. I am committed to
sorting all these stuff out anyway.

> the behavior of those algorithms and understand the differences. Since
> uclamp is part of this 'control-chain' of CPU frequency and also
> task placement - it would be really tricky to figure our everything.
> The analysis on traces are crucial for this.

I think the existing uclamp trace event is enough to be honest. But up to the
maintainer if they want to add this new specific one. The two tps seem a bit of
a clutter to me. With kprobes and bpf a lot can be done on the fly if you want
to reverse engineer some stuff.

>
> >
> > Would it make sense to introduce a generic sched_setscheduler tracepoint
> > instead? Although this might not be necessary as I think we can use
> > register_kprobe() to register a callback and create a new event without any
> > additional tracepoint. sched_setscheduler() is not inlined so should be easy to
> > hook into and create events, no?
>
> This looks very complex and we already have our LISA tool with the

It's not. Here's a PoC that only does a trace_printk(), it's simple. You don't
need to do it in a module even, see below.

https://github.com/qais-yousef/sched_tp/compare/main...kprobe-poc

It did highlight that sugov_should_update_freq() ends up actually being inlined
though :(. It should work for sched_setscheduler(). You'd want to use
register_kprobe() instead for that.

> module to change the tracepoints into trace events and build them.
> I wanted to be aligned with that design, which might look a bit
> old-fashion but is simple IMO.

trace-cmd, bpf and I believe perf, all can do the same; and they support
kprobes not just tracepoints.

And they all boil down to the same underlying mechanism

https://www.kernel.org/doc/html/v6.1/trace/kprobetrace.html

No need to re-invent a new wheel ;-)

> The 'sched_setscheduler tracepoint' might be a too big for this
> purpose.

Sorry I am usually supportive for more tracepoints, but I feel skeptical this
time. That said, I really don't mind if the maintainers are okay with it. So
while I'm not convinced, but I don't have any objection either.

> Thanks for the comments :)

Thanks!

--
Qais Yousef

2023-04-05 10:56:05

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH 1/3] sched/tp: Add new tracepoint to track uclamp set from user-space



On 4/4/23 18:17, Qais Yousef wrote:
> On 04/03/23 17:47, Lukasz Luba wrote:
>> Hi Qais,
>>
>> On 4/3/23 14:46, Qais Yousef wrote:
>>> Hi Lukasz!
>>>
>>> On 03/22/23 15:18, Lukasz Luba wrote:
>>>> The user-space can set uclamp value for a given task. It impacts task
>>>> placement decisions made by the scheduler. This is very useful information
>>>> and helps to understand the system behavior or track improvements in
>>>> middleware and applications which start using uclamp mechanisms and report
>>>> better performance in tests.
>>>
>>> We do have uclamp trace events in sched_tp, why are they not sufficient?
>>>
>>> https://github.com/qais-yousef/sched_tp/blob/main/sched_events.h#L233
>>>
>>> Do you really want to know the exact time the value has changed?
>>
>> Yes, that's why these new are triggered instantly after userspace wanted
>> to set the new uclamp values. We are going to have a few different
>> uclamp implementations: one in mainline and X in Android vendor kernels.
>
> This is not true. As you can see everyone tries to push fixes for issues they

Could you clarify which part is not true?
1. We have X vendor hook implementations how uclamp is interpreted + one
upstream
2. we are going to have such situation for a while (till we don't meet
requirements in mainline uclamp, so the hooks can be removed and
we would have the same mechanism under the hood that we understand)
3. the user-space sets uclamp value via syscall


> find, but this not happening fast enough and they get forced to carry out of
> tree stuff at the end. Out of tree stuff for the broken bits that is.
>
>> The goal is to have only one... We will have to experiment to find
>
> This statement gives a very strong message to everyone out here. And I'd like
> to stress strongly that this is NOT true. Broken bits, yes. But essential bits
> that works are used the same.

Could you elaborate this?

This is how I view the solution with uclamp, but I'm afraid we have
different views:
Uclamp is part of the control algorithms in a closed feedback loop. The
apps want to use (as stated in the latest kernel doc) uclamp in order
get desired performance (e.g. no frame drops) in an environment, which:
1. can change with a CPUs configuration that the app was installed
(single app from a repo want to run efficient on various devices;
those devices create different environment, from the beginning)
2. the thermal conditions can change in run-time, so the capacity of
the platform might change (even to a point that some CPUs is out)
3. has access to a dedicated accelerators, which can handle some of
heavy computations from the CPUs (e.g. those with neural networks)
4. drivers and devices have different capabilities e.g. in terms of
fast DVFS change, which can be and issue in this case

When the uclamp algorithm, which interprets the individual tasks' values
and runqueue value, can be different, then this is a different solution.
In such situation every platform might have different solution.

We are going to focus on mainline uclamp development. We want to know
if the frequency cannot be changed due to some slow HW or a huge traffic
in the system of various tasks/runqueues wanted to change the freq, or
just wrong setting in the schedutil filter. We are also interested in
measuring the 'delay' from the uclamp new request till the HW getting
it actually. All of this would shape the solutions 'quality'.

I don't know if vendors justify their vendor hook uclamp implementation
as a fix for 'broken bits' or a 'value-added'.

>
> We can do a better job and fix things faster upstream. I am committed to
> sorting all these stuff out anyway.
>
>> the behavior of those algorithms and understand the differences. Since
>> uclamp is part of this 'control-chain' of CPU frequency and also
>> task placement - it would be really tricky to figure our everything.
>> The analysis on traces are crucial for this.
>
> I think the existing uclamp trace event is enough to be honest. But up to the
> maintainer if they want to add this new specific one. The two tps seem a bit of
> a clutter to me. With kprobes and bpf a lot can be done on the fly if you want
> to reverse engineer some stuff.
>
>>
>>>
>>> Would it make sense to introduce a generic sched_setscheduler tracepoint
>>> instead? Although this might not be necessary as I think we can use
>>> register_kprobe() to register a callback and create a new event without any
>>> additional tracepoint. sched_setscheduler() is not inlined so should be easy to
>>> hook into and create events, no?
>>
>> This looks very complex and we already have our LISA tool with the
>
> It's not. Here's a PoC that only does a trace_printk(), it's simple. You don't
> need to do it in a module even, see below.
>
> https://github.com/qais-yousef/sched_tp/compare/main...kprobe-poc
>
> It did highlight that sugov_should_update_freq() ends up actually being inlined
> though :(. It should work for sched_setscheduler(). You'd want to use
> register_kprobe() instead for that.

Issues with that approach:
- as you said, it won't work with inline functions, so in this patch set
example it would work 'partially' ;)
- uses trace_printk() which is not aligned method to our tool and AFAIK
Perfetto
- would create mix of mechanisms that would push the complexity for
our analysis tool, which we don't want. This is the complexity
that I had in mind from SW engineering perspective.
Maintainability and understanding of the mechanics across teams
or even companies is not for free. Then explaining someone in
different company that this approach won't work if the compiler
inlined a function, so they have to rewrite the code and use
the other approach (with tracepoint in the code) doesn't sound
good.

>
>> module to change the tracepoints into trace events and build them.
>> I wanted to be aligned with that design, which might look a bit
>> old-fashion but is simple IMO.
>
> trace-cmd, bpf and I believe perf, all can do the same; and they support
> kprobes not just tracepoints.
>
> And they all boil down to the same underlying mechanism
>
> https://www.kernel.org/doc/html/v6.1/trace/kprobetrace.html
>
> No need to re-invent a new wheel ;-)

I don't understand why you are calling this 're-invent a new wheel',
when I said this is aligned with our SW design for trace analysis.
BTW, the LISA tool is open source and anyone can use it for free [1].
I'm not developing another framework for tracepoints here, I just use
this approach.

I agree, there is a few ways of doing this. The worse thing would
be for a us to use a mix (and mess) a few of them. As you know,
our tool takes the path with one of them: tracepoints explicitly
visible to everyone in the kernel code, so you can point engineers
into it. I know, you can basically trace almost any function dynamically
which creates huge potential space. The big space of opportunity isn't
good when you want to solve the problem and share the knowledge
to others how to follow. Look at the example of tracepoints which
are used for deriving the task wake-up latency. It's easier for
engineers to see that in the code with the tracepoints, what parts
are important. Where and why the flow is going, which is a nice
example how the vast code space can been reduced.

>
>> The 'sched_setscheduler tracepoint' might be a too big for this
>> purpose.
>
> Sorry I am usually supportive for more tracepoints, but I feel skeptical this
> time. That said, I really don't mind if the maintainers are okay with it. So
> while I'm not convinced, but I don't have any objection either.

I hope I would convince you with those examples, so you would be less
skeptical.

Cheers,
Lukasz

[1]
https://lisa-linux-integrated-system-analysis.readthedocs.io/en/latest/setup.html