Add trace event to capture the moment of the call for updating the thermal
pressure value. It's helpful to investigate how often those events occurs
in a system dealing with throttling. This trace event is needed since the
old 'cdev_update' might not be used by some drivers. Also, the new trace
event shows capacity value, not a cooling state.
We already have a trace point 'pelt_thermal_tp' which after a change to
trace event can be paired with this new 'thermal_pressure_update' and
derive more insight what is going on in the system under pressure
(and why).
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Lukasz Luba <[email protected]>
---
Hi all,
This v2 solves the issue reported by build robot.
After experiments and checks I decided to not add the
EXPORT_TRACEPOINT_SYMBOL_GPL(thermal_pressure_update)
The trace event shouldn't be used by modules, since they use
the exported update function:
EXPORT_SYMBOL_GPL(topology_update_thermal_pressure)
which calls that trace event internally.
The code duplication for updating thermal pressure in a modules
is not recommended - that was the goal of introducing
topology_update_thermal_pressure().
Changes in v2:
- as suggested by Steve, I have swapped the fields
- a new trace header, so no conflicts with thermal.h
- added CREATE_TRACE_POINTS before header
The v1 and discussion can be found here at:
https://lore.kernel.org/lkml/[email protected]/
Regards,
Lukasz
drivers/base/arch_topology.c | 5 +++++
include/trace/events/thermal_pressure.h | 29 +++++++++++++++++++++++++
2 files changed, 34 insertions(+)
create mode 100644 include/trace/events/thermal_pressure.h
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 1d6636ebaac5..20bdad18dccd 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -19,6 +19,9 @@
#include <linux/rcupdate.h>
#include <linux/sched.h>
+#define CREATE_TRACE_POINTS
+#include <trace/events/thermal_pressure.h>
+
static DEFINE_PER_CPU(struct scale_freq_data __rcu *, sft_data);
static struct cpumask scale_freq_counters_mask;
static bool scale_freq_invariant;
@@ -195,6 +198,8 @@ void topology_update_thermal_pressure(const struct cpumask *cpus,
th_pressure = max_capacity - capacity;
+ trace_thermal_pressure_update(cpu, th_pressure);
+
for_each_cpu(cpu, cpus)
WRITE_ONCE(per_cpu(thermal_pressure, cpu), th_pressure);
}
diff --git a/include/trace/events/thermal_pressure.h b/include/trace/events/thermal_pressure.h
new file mode 100644
index 000000000000..b68680201360
--- /dev/null
+++ b/include/trace/events/thermal_pressure.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM thermal_pressure
+
+#if !defined(_TRACE_THERMAL_PRESSURE_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_THERMAL_PRESSURE_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(thermal_pressure_update,
+ TP_PROTO(int cpu, unsigned long thermal_pressure),
+ TP_ARGS(cpu, thermal_pressure),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, thermal_pressure)
+ __field(int, cpu)
+ ),
+
+ TP_fast_assign(
+ __entry->thermal_pressure = thermal_pressure;
+ __entry->cpu = cpu;
+ ),
+
+ TP_printk("cpu=%d thermal_pressure=%lu", __entry->cpu, __entry->thermal_pressure)
+);
+#endif /* _TRACE_THERMAL_PRESSURE_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
--
2.17.1
On Mon, Apr 25, 2022 at 02:53:17PM +0100, Lukasz Luba wrote:
> Add trace event to capture the moment of the call for updating the thermal
> pressure value. It's helpful to investigate how often those events occurs
> in a system dealing with throttling. This trace event is needed since the
> old 'cdev_update' might not be used by some drivers. Also, the new trace
> event shows capacity value, not a cooling state.
Wait, why are thermal events not going through the thermal system so
that thermal_cdev_update() would catch them? Don't work around broken
systems, force them to use the correct apis.
thanks,
greg k-h
On 4/25/22 15:12, Greg KH wrote:
> On Mon, Apr 25, 2022 at 02:53:17PM +0100, Lukasz Luba wrote:
>> Add trace event to capture the moment of the call for updating the thermal
>> pressure value. It's helpful to investigate how often those events occurs
>> in a system dealing with throttling. This trace event is needed since the
>> old 'cdev_update' might not be used by some drivers. Also, the new trace
>> event shows capacity value, not a cooling state.
>
> Wait, why are thermal events not going through the thermal system so
> that thermal_cdev_update() would catch them? Don't work around broken
> systems, force them to use the correct apis.
In some platforms the thermal is controlled in FW. In Arm we have
recently implemented a logic of our Intelligent Power Allocation
(kernel governor gov_power_allocator.c) (and a bit more) in our
reference FW [1]. So kernel would not controlling thermal.
There is another platform which does similar [2]. As you can see
in that driver, there is no cooling device created, it just handles
the notification about reduced frequency as an IRQ. Then it
populates this information to the task scheduler using thermal
pressure mechanism. Thanks to that the scheduler can avoid mistakes
in the tasks placement and not put more that that CPU real capacity.
For Arm FW thermal, I can make it, since we don't have it yet.
We are in the middle of internal design of this FW/kernel glue and I
haven't sent the kernel patches yet. I will follow your recommendation.
For the driver [2] I have no device, so I cannot change it and prove the
new works safely.
But we need this trace event to discuss the design for RT task scheduler
and impact of thermal pressure decaying delays. We have Android systems
which suffer the RT issues known for years (like audio apps) and we
started to tackle them. That was the motivation for this new trace
event - a helper which everyone can 'just use' in their current code
state and provide measurements.
We can ask maintainers of that driver [2] to follow your guidance and
fix that cpufreq driver. That might require a split of the logic in
there and a new thermal driver which would handle the part of thermal
updates. If that is feasible, since I have heard that some platforms
can cause a huge noise 'KHz' of those interrupt...
If that is true, I might see the reason why someone goes lightweight way
update-thermal-pressure-only and not via thermal cooling complexity.
IMO this trace event doesn't harm the design and brings also benefit
comparing to the 'cdev_update' trace event which only provides a state
value: [0, n]. That state value then needs additional tools to translate
it: state -> freq -> capacity -> thermal pressure. As you can see
this new event just stores proper value in the trace buffer, no need
to hassle, just 'cat' the trace file. If anyone would like to help
us and share it's trace output, we would have everything we need.
Regards,
Lukasz
>
> thanks,
>
> greg k-h
[1] https://github.com/ARM-software/SCP-firmware
[2]
https://elixir.bootlin.com/linux/v5.17/source/drivers/cpufreq/qcom-cpufreq-hw.c#L300
On 4/26/22 09:12, Greg KH wrote:
> On Tue, Apr 26, 2022 at 08:26:40AM +0100, Lukasz Luba wrote:
>>
>>
>> On 4/25/22 15:12, Greg KH wrote:
>>> On Mon, Apr 25, 2022 at 02:53:17PM +0100, Lukasz Luba wrote:
>>>> Add trace event to capture the moment of the call for updating the thermal
>>>> pressure value. It's helpful to investigate how often those events occurs
>>>> in a system dealing with throttling. This trace event is needed since the
>>>> old 'cdev_update' might not be used by some drivers. Also, the new trace
>>>> event shows capacity value, not a cooling state.
>>>
>>> Wait, why are thermal events not going through the thermal system so
>>> that thermal_cdev_update() would catch them? Don't work around broken
>>> systems, force them to use the correct apis.
>>
>> In some platforms the thermal is controlled in FW. In Arm we have
>> recently implemented a logic of our Intelligent Power Allocation
>> (kernel governor gov_power_allocator.c) (and a bit more) in our
>> reference FW [1]. So kernel would not controlling thermal.
>> There is another platform which does similar [2]. As you can see
>> in that driver, there is no cooling device created, it just handles
>> the notification about reduced frequency as an IRQ. Then it
>> populates this information to the task scheduler using thermal
>> pressure mechanism. Thanks to that the scheduler can avoid mistakes
>> in the tasks placement and not put more that that CPU real capacity.
>>
>> For Arm FW thermal, I can make it, since we don't have it yet.
>> We are in the middle of internal design of this FW/kernel glue and I
>> haven't sent the kernel patches yet. I will follow your recommendation.
>>
>> For the driver [2] I have no device, so I cannot change it and prove the
>> new works safely.
>>
>> But we need this trace event to discuss the design for RT task scheduler
>> and impact of thermal pressure decaying delays. We have Android systems
>> which suffer the RT issues known for years (like audio apps) and we
>> started to tackle them. That was the motivation for this new trace
>> event - a helper which everyone can 'just use' in their current code
>> state and provide measurements.
>>
>> We can ask maintainers of that driver [2] to follow your guidance and
>> fix that cpufreq driver. That might require a split of the logic in
>> there and a new thermal driver which would handle the part of thermal
>> updates. If that is feasible, since I have heard that some platforms
>> can cause a huge noise 'KHz' of those interrupt...
>> If that is true, I might see the reason why someone goes lightweight way
>> update-thermal-pressure-only and not via thermal cooling complexity.
>>
>> IMO this trace event doesn't harm the design and brings also benefit
>> comparing to the 'cdev_update' trace event which only provides a state
>> value: [0, n]. That state value then needs additional tools to translate
>> it: state -> freq -> capacity -> thermal pressure. As you can see
>> this new event just stores proper value in the trace buffer, no need
>> to hassle, just 'cat' the trace file. If anyone would like to help
>> us and share it's trace output, we would have everything we need.
>
> Ok, can you put this type of explaination in the changelog to make it
> more obvious for us all going forward?
Yes, I will update that and send v3. Thank you for sharing your opinion
and guidance for this stuff for future platforms. I'll point Morten
and other Arm folks this thread.
Regards,
Lukasz
On Tue, Apr 26, 2022 at 08:26:40AM +0100, Lukasz Luba wrote:
>
>
> On 4/25/22 15:12, Greg KH wrote:
> > On Mon, Apr 25, 2022 at 02:53:17PM +0100, Lukasz Luba wrote:
> > > Add trace event to capture the moment of the call for updating the thermal
> > > pressure value. It's helpful to investigate how often those events occurs
> > > in a system dealing with throttling. This trace event is needed since the
> > > old 'cdev_update' might not be used by some drivers. Also, the new trace
> > > event shows capacity value, not a cooling state.
> >
> > Wait, why are thermal events not going through the thermal system so
> > that thermal_cdev_update() would catch them? Don't work around broken
> > systems, force them to use the correct apis.
>
> In some platforms the thermal is controlled in FW. In Arm we have
> recently implemented a logic of our Intelligent Power Allocation
> (kernel governor gov_power_allocator.c) (and a bit more) in our
> reference FW [1]. So kernel would not controlling thermal.
> There is another platform which does similar [2]. As you can see
> in that driver, there is no cooling device created, it just handles
> the notification about reduced frequency as an IRQ. Then it
> populates this information to the task scheduler using thermal
> pressure mechanism. Thanks to that the scheduler can avoid mistakes
> in the tasks placement and not put more that that CPU real capacity.
>
> For Arm FW thermal, I can make it, since we don't have it yet.
> We are in the middle of internal design of this FW/kernel glue and I
> haven't sent the kernel patches yet. I will follow your recommendation.
>
> For the driver [2] I have no device, so I cannot change it and prove the
> new works safely.
>
> But we need this trace event to discuss the design for RT task scheduler
> and impact of thermal pressure decaying delays. We have Android systems
> which suffer the RT issues known for years (like audio apps) and we
> started to tackle them. That was the motivation for this new trace
> event - a helper which everyone can 'just use' in their current code
> state and provide measurements.
>
> We can ask maintainers of that driver [2] to follow your guidance and
> fix that cpufreq driver. That might require a split of the logic in
> there and a new thermal driver which would handle the part of thermal
> updates. If that is feasible, since I have heard that some platforms
> can cause a huge noise 'KHz' of those interrupt...
> If that is true, I might see the reason why someone goes lightweight way
> update-thermal-pressure-only and not via thermal cooling complexity.
>
> IMO this trace event doesn't harm the design and brings also benefit
> comparing to the 'cdev_update' trace event which only provides a state
> value: [0, n]. That state value then needs additional tools to translate
> it: state -> freq -> capacity -> thermal pressure. As you can see
> this new event just stores proper value in the trace buffer, no need
> to hassle, just 'cat' the trace file. If anyone would like to help
> us and share it's trace output, we would have everything we need.
Ok, can you put this type of explaination in the changelog to make it
more obvious for us all going forward?
thanks,
greg k-h