LinuxLists.cc - [RFC PATCH 0/4] Support for passing runtime state idle time to TF-A

2021-04-22 20:33:27

Subject: [RFC PATCH 0/4] Support for passing runtime state idle time to TF-A

Tegra194 and Tegra186 platforms use separate MCE firmware for CPUs which is
in charge of deciding on state transition based on target state, state idle
time, and some other Tegra CPU core cluster states information.

Current PSCI specification don't have function defined for passing runtime
state idle time predicted by governor (based on next events and state target
residency) to ARM trusted firmware.

With the support of adding new PSCI function to allow passing runtime state
idle time from kernel to ARM trusted firmware, Tegra194 platforms can use
generic psci cpuidle driver rather than having Tegra specific cpuidle driver.

During Tegra specific cpuidle driver V1 review, Sudeep Holla from ARM also
suggested to use generic cpuidle driver by generalizing the need of runtime
state idle time.

So had internal discussion between ARM and NVIDIA on adding new PSCI function
to allow passing runtime state idle time from kernel to TF-A through PSCI and
once this implementation is accepted by upstream, ARM will look into further
to update PSCI specification for this new PSCI function.

So sending these patches as RFC as new PSCI function added in this series is
not part of PSCI specification and once this implementation is accepted by ARM
and upstream community, ARM can help to take this forward to add to PSCI
specification.

To keep the backward compatibility we can't update CPU_SUSPEND function to pass
state idle time argument. So added seperate function for passing state idle time
and serializing this with cpu suspend state enter.

Once this approach is agreed, we can either use this way of separate PSCI
function for passing state idle time or with PSCI specification update we can
use same CPU_SUSPEND function with extra argument for state idle time which can
be decided later for final patches based on discussion with ARM.

Sowjanya Komatineni (4):
firmware/psci: add support for PSCI function SET_STATE_IDLE_TIME
cpuidle: menu: add idle_time to cpuidle_state
cpuidle: psci: pass state idle time before state enter callback
arm64: dts: tegra194: Add CPU idle states

arch/arm64/boot/dts/nvidia/tegra194.dtsi | 19 +++++++++++++++++++
drivers/cpuidle/cpuidle-psci.c | 11 +++++++++++
drivers/cpuidle/governors/menu.c | 7 ++++++-
drivers/firmware/psci/psci.c | 9 +++++++++
include/linux/cpuidle.h | 1 +
include/linux/psci.h | 1 +
include/uapi/linux/psci.h | 2 ++
7 files changed, 49 insertions(+), 1 deletion(-)

--
2.7.4

2021-04-22 20:35:07

by Sowjanya Komatineni

[permalink] [raw]

Subject: [RFC PATCH 4/4] arm64: dts: tegra194: Add CPU idle states

This patch adds CPU core and cluster idle states to Tegra194
device tree

Signed-off-by: Sowjanya Komatineni <[email protected]>
---
arch/arm64/boot/dts/nvidia/tegra194.dtsi | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)

diff --git a/arch/arm64/boot/dts/nvidia/tegra194.dtsi b/arch/arm64/boot/dts/nvidia/tegra194.dtsi
index 9449156..c3b478e 100644
--- a/arch/arm64/boot/dts/nvidia/tegra194.dtsi
+++ b/arch/arm64/boot/dts/nvidia/tegra194.dtsi
@@ -2161,6 +2161,7 @@
device_type = "cpu";
reg = <0x000>;
enable-method = "psci";
+ cpu-idle-states = <&C6>;
i-cache-size = <131072>;
i-cache-line-size = <64>;
i-cache-sets = <512>;
@@ -2175,6 +2176,7 @@
device_type = "cpu";
reg = <0x001>;
enable-method = "psci";
+ cpu-idle-states = <&C6>;
i-cache-size = <131072>;
i-cache-line-size = <64>;
i-cache-sets = <512>;
@@ -2189,6 +2191,7 @@
device_type = "cpu";
reg = <0x100>;
enable-method = "psci";
+ cpu-idle-states = <&C6>;
i-cache-size = <131072>;
i-cache-line-size = <64>;
i-cache-sets = <512>;
@@ -2203,6 +2206,7 @@
device_type = "cpu";
reg = <0x101>;
enable-method = "psci";
+ cpu-idle-states = <&C6>;
i-cache-size = <131072>;
i-cache-line-size = <64>;
i-cache-sets = <512>;
@@ -2217,6 +2221,7 @@
device_type = "cpu";
reg = <0x200>;
enable-method = "psci";
+ cpu-idle-states = <&C6>;
i-cache-size = <131072>;
i-cache-line-size = <64>;
i-cache-sets = <512>;
@@ -2231,6 +2236,7 @@
device_type = "cpu";
reg = <0x201>;
enable-method = "psci";
+ cpu-idle-states = <&C6>;
i-cache-size = <131072>;
i-cache-line-size = <64>;
i-cache-sets = <512>;
@@ -2245,6 +2251,7 @@
device_type = "cpu";
reg = <0x300>;
enable-method = "psci";
+ cpu-idle-states = <&C6>;
i-cache-size = <131072>;
i-cache-line-size = <64>;
i-cache-sets = <512>;
@@ -2259,6 +2266,7 @@
device_type = "cpu";
reg = <0x301>;
enable-method = "psci";
+ cpu-idle-states = <&C6>;
i-cache-size = <131072>;
i-cache-line-size = <64>;
i-cache-sets = <512>;
@@ -2343,6 +2351,17 @@
cache-line-size = <64>;
cache-sets = <4096>;
};
+
+ idle-states {
+ entry-method = "arm,psci";
+ C6: c6 {
+ compatible = "arm,idle-state";
+ wakeup-latency-us = <2000>;
+ min-residency-us = <30000>;
+ arm,psci-suspend-param = <0x6>;
+ status = "okay";
+ };
+ };
};

psci {
--
2.7.4

2021-04-23 01:05:25

by Sowjanya Komatineni

[permalink] [raw]

Subject: Re: [RFC PATCH 0/4] Support for passing runtime state idle time to TF-A

kernel test robot reported below errors for patch-2 of this series. Will
fix in next version to use div_u64() for nsec to usec conversion along
with the other feedback I may be receiving.

drivers/cpuidle/governors/menu.c:398: undefined reference to `__udivdi3'
ld: drivers/cpuidle/governors/menu.c:398: undefined reference to `__udivdi3'
ld: drivers/cpuidle/governors/menu.c:398: undefined reference to `__udivdi3'
ld: drivers/cpuidle/governors/menu.c:398: undefined reference to `__udivdi3'

Thanks

Sowjanya

On 4/22/21 1:30 PM, Sowjanya Komatineni wrote:
> Tegra194 and Tegra186 platforms use separate MCE firmware for CPUs which is
> in charge of deciding on state transition based on target state, state idle
> time, and some other Tegra CPU core cluster states information.
>
> Current PSCI specification don't have function defined for passing runtime
> state idle time predicted by governor (based on next events and state target
> residency) to ARM trusted firmware.
>
> With the support of adding new PSCI function to allow passing runtime state
> idle time from kernel to ARM trusted firmware, Tegra194 platforms can use
> generic psci cpuidle driver rather than having Tegra specific cpuidle driver.
>
> During Tegra specific cpuidle driver V1 review, Sudeep Holla from ARM also
> suggested to use generic cpuidle driver by generalizing the need of runtime
> state idle time.
>
> So had internal discussion between ARM and NVIDIA on adding new PSCI function
> to allow passing runtime state idle time from kernel to TF-A through PSCI and
> once this implementation is accepted by upstream, ARM will look into further
> to update PSCI specification for this new PSCI function.
>
> So sending these patches as RFC as new PSCI function added in this series is
> not part of PSCI specification and once this implementation is accepted by ARM
> and upstream community, ARM can help to take this forward to add to PSCI
> specification.
>
> To keep the backward compatibility we can't update CPU_SUSPEND function to pass
> state idle time argument. So added seperate function for passing state idle time
> and serializing this with cpu suspend state enter.
>
> Once this approach is agreed, we can either use this way of separate PSCI
> function for passing state idle time or with PSCI specification update we can
> use same CPU_SUSPEND function with extra argument for state idle time which can
> be decided later for final patches based on discussion with ARM.
>
>
> Sowjanya Komatineni (4):
> firmware/psci: add support for PSCI function SET_STATE_IDLE_TIME
> cpuidle: menu: add idle_time to cpuidle_state
> cpuidle: psci: pass state idle time before state enter callback
> arm64: dts: tegra194: Add CPU idle states
>
> arch/arm64/boot/dts/nvidia/tegra194.dtsi | 19 +++++++++++++++++++
> drivers/cpuidle/cpuidle-psci.c | 11 +++++++++++
> drivers/cpuidle/governors/menu.c | 7 ++++++-
> drivers/firmware/psci/psci.c | 9 +++++++++
> include/linux/cpuidle.h | 1 +
> include/linux/psci.h | 1 +
> include/uapi/linux/psci.h | 2 ++
> 7 files changed, 49 insertions(+), 1 deletion(-)
>

2021-04-23 12:28:52

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: [RFC PATCH 0/4] Support for passing runtime state idle time to TF-A

On Thu, Apr 22, 2021 at 10:31 PM Sowjanya Komatineni
<[email protected]> wrote:
>
> Tegra194 and Tegra186 platforms use separate MCE firmware for CPUs which is
> in charge of deciding on state transition based on target state, state idle
> time, and some other Tegra CPU core cluster states information.
>
> Current PSCI specification don't have function defined for passing runtime
> state idle time predicted by governor (based on next events and state target
> residency) to ARM trusted firmware.

Presumably that's because this is not a good idea.

A basic design principle of cpuidle is that it should be possible to
use every governor with every driver and the changes in this series
make the platforms in question only work with menu AFAICS.

2021-04-23 18:36:34

by Sowjanya Komatineni

[permalink] [raw]

Subject: Re: [RFC PATCH 0/4] Support for passing runtime state idle time to TF-A

On 4/23/21 5:27 AM, Rafael J. Wysocki wrote:
> On Thu, Apr 22, 2021 at 10:31 PM Sowjanya Komatineni
> <[email protected]> wrote:
>> Tegra194 and Tegra186 platforms use separate MCE firmware for CPUs which is
>> in charge of deciding on state transition based on target state, state idle
>> time, and some other Tegra CPU core cluster states information.
>>
>> Current PSCI specification don't have function defined for passing runtime
>> state idle time predicted by governor (based on next events and state target
>> residency) to ARM trusted firmware.
> Presumably that's because this is not a good idea.
Tegra194 and Tegra186 platforms use separate MCE firmware for CPUs.

MCE firmware handles CPU complex power management states entry/exit
based on its background tasks and uses expected wake time of the core to
decide on state transition.

Expected wake time is based on next event and allowed min residency of
the state which governor predicts in kernel which MCE is not aware of.

So need a way to pass this info from kernel through PSCI to TF-A and
TF-A will update this to MCE along with requested state of entry.

For example, When C6 core idle state is requested, MCE notes the time at
which core is likely to wake up. There could be background tasks running
on core which kernel is not aware of.

When those tasks are completed it will check the remaining time against
states crossover thresholds (programmed by ARM trusted FW) to determine
if its still have enough time to enter into C6 state.

While a core is power gates, it could be woken up for background tasks
and put back to power gated state again by MCE based on expected wake
time of the corresponding core.

So, Tegra194/Tegra186 CPU idle support, we need this runtime state
expected wake time predicted by governor to be passed from kernel to TF-A.

Thanks

Sowjanya

>
> A basic design principle of cpuidle is that it should be possible to
> use every governor with every driver and the changes in this series
> make the platforms in question only work with menu AFAICS.

2021-04-23 20:19:37

by Lukasz Luba

[permalink] [raw]

Subject: Re: [RFC PATCH 0/4] Support for passing runtime state idle time to TF-A

Hi Sowjanya,

On 4/22/21 9:30 PM, Sowjanya Komatineni wrote:
> Tegra194 and Tegra186 platforms use separate MCE firmware for CPUs which is
> in charge of deciding on state transition based on target state, state idle
> time, and some other Tegra CPU core cluster states information.
>
> Current PSCI specification don't have function defined for passing runtime
> state idle time predicted by governor (based on next events and state target
> residency) to ARM trusted firmware.

Do you have some numbers from experiments showing that these idle
governor prediction values, which are passed from kernel to MCE
firmware, are making a good 'guess'?
How much precision (1us? 1ms?) in the values do you need there?

IIRC (probably Rafael's presentations) predicting in the kernel
something like CPU idle time residency is not a trivial thing.

Another idea (depending on DT structure and PSCI bits):
Could this be solved differently, but just having a knowledge that if
the governor requested some C-state, this means governor 'predicted'
an idle residency to be greater that min_residency attached to this
C-state?
Then, when that request shows up in your FW, you know that it must be at
least min_residency because of this C-state id.
It would depend on number of available states, max_residency, scale
that you would choose while assigning values from [0, max_residency]
to each state.
IIRC there can be many state IDs for idle, so it would depend on
number of bits encoding this state, and your needs. Example of
linear scale:
4-bits encoding idle state and max predicted residency 10msec,
that means 10000us / 16 states = 625us/state.
The max_residency might be split differently, using different than
linear function, to have some rage more precised.

Open question is if these idle states must be all represented
in DT, or there is a way of describing a 'set of idle states'
automatically.

Regards,
Lukasz

2021-04-23 22:25:58

by Sowjanya Komatineni

[permalink] [raw]

Subject: Re: [RFC PATCH 0/4] Support for passing runtime state idle time to TF-A

On 4/23/21 1:16 PM, Lukasz Luba wrote:
> Hi Sowjanya,
>
> On 4/22/21 9:30 PM, Sowjanya Komatineni wrote:
>> Tegra194 and Tegra186 platforms use separate MCE firmware for CPUs
>> which is
>> in charge of deciding on state transition based on target state,
>> state idle
>> time, and some other Tegra CPU core cluster states information.
>>
>> Current PSCI specification don't have function defined for passing
>> runtime
>> state idle time predicted by governor (based on next events and state
>> target
>> residency) to ARM trusted firmware.
>
> Do you have some numbers from experiments showing that these idle
> governor prediction values, which are passed from kernel to MCE
> firmware, are making a good 'guess'?
> How much precision (1us? 1ms?) in the values do you need there?

it could also be in few ms depending on when next cpu event/activity
might happen which is not transparent to MCE firmware.

>
> IIRC (probably Rafael's presentations) predicting in the kernel
> something like CPU idle time residency is not a trivial thing.
>
> Another idea (depending on DT structure and PSCI bits):
> Could this be solved differently, but just having a knowledge that if
> the governor requested some C-state, this means governor 'predicted'
> an idle residency to be greater that min_residency attached to this
> C-state?
> Then, when that request shows up in your FW, you know that it must be at
> least min_residency because of this C-state id.
C6 is the only deepest state for Tegra194 Carmel CPU that we support in
addition to C1 (WFI) idle state.

MCE firmware gets state crossover thresholds for C1 to C6 transition
from TF-A and uses it along with state idle time to decide on C6 state
entry based on its background work.

Assuming for now if we use min_residency as state idle time which is
static value from DT, then it enters into deepest state C6 always as we
use min_residency value we use is always higher than state crossover
threshold.

But MCE firmware is not aware of when next cpu event can happen to
predict if next event can take longer than state min_residency time.

Using min residency in such case is very conservative where MCE firmware
exits C6 state early where we may not have better power saving.

But with MCE firmware being aware of when next event can happen it can
use that to stay in C6 state without early exit for better power savings.

> It would depend on number of available states, max_residency, scale
> that you would choose while assigning values from [0, max_residency]
> to each state.
> IIRC there can be many state IDs for idle, so it would depend on
> number of bits encoding this state, and your needs. Example of
> linear scale:
> 4-bits encoding idle state and max predicted residency 10msec,
> that means 10000us / 16 states = 625us/state.
> The max_residency might be split differently, using different than
> linear function, to have some rage more precised.
>
> Open question is if these idle states must be all represented
> in DT, or there is a way of describing a 'set of idle states'
> automatically.
We only support C6 state through DT as C6 is the only deepest state for
Tegra194 carmel CPU. WFI idle state is completely handled by kernel and
does not require MCE sequences for entry/exit.
>
> Regards,
> Lukasz

2021-04-26 10:13:36

by Souvik Chakravarty

[permalink] [raw]

Subject: RE: [RFC PATCH 0/4] Support for passing runtime state idle time to TF-A

Hi Sowjanya,

> From: Sowjanya Komatineni
> Sent: Friday, April 23, 2021 11:25 PM
>
> On 4/23/21 1:16 PM, Lukasz Luba wrote:
> > Hi Sowjanya,
> >
> > On 4/22/21 9:30 PM, Sowjanya Komatineni wrote:
> >> Tegra194 and Tegra186 platforms use separate MCE firmware for CPUs
> >> which is in charge of deciding on state transition based on target
> >> state, state idle time, and some other Tegra CPU core cluster states
> >> information.
> >>
> >> Current PSCI specification don't have function defined for passing
> >> runtime state idle time predicted by governor (based on next events
> >> and state target
> >> residency) to ARM trusted firmware.
> >
> > Do you have some numbers from experiments showing that these idle
> > governor prediction values, which are passed from kernel to MCE
> > firmware, are making a good 'guess'?
> > How much precision (1us? 1ms?) in the values do you need there?
>
> it could also be in few ms depending on when next cpu event/activity might
> happen which is not transparent to MCE firmware.
>
> >
> > IIRC (probably Rafael's presentations) predicting in the kernel
> > something like CPU idle time residency is not a trivial thing.
> >
> > Another idea (depending on DT structure and PSCI bits):
> > Could this be solved differently, but just having a knowledge that if
> > the governor requested some C-state, this means governor 'predicted'
> > an idle residency to be greater that min_residency attached to this
> > C-state?
> > Then, when that request shows up in your FW, you know that it must be
> > at least min_residency because of this C-state id.
> C6 is the only deepest state for Tegra194 Carmel CPU that we support in
> addition to C1 (WFI) idle state.
>
> MCE firmware gets state crossover thresholds for C1 to C6 transition from TF-
> A and uses it along with state idle time to decide on C6 state entry based on
> its background work.
>
> Assuming for now if we use min_residency as state idle time which is static
> value from DT, then it enters into deepest state C6 always as we use
> min_residency value we use is always higher than state crossover threshold.
>
> But MCE firmware is not aware of when next cpu event can happen to
> predict if next event can take longer than state min_residency time.
>
> Using min residency in such case is very conservative where MCE firmware
> exits C6 state early where we may not have better power saving.
>
> But with MCE firmware being aware of when next event can happen it can
> use that to stay in C6 state without early exit for better power savings.

This part confuses me. Are you saying that the firmware will forcefully wake up
the core, even if no wakeups are pending, when min residency for C6 expires?

Regards,
Souvik

>
> > It would depend on number of available states, max_residency, scale
> > that you would choose while assigning values from [0, max_residency]
> > to each state.
> > IIRC there can be many state IDs for idle, so it would depend on
> > number of bits encoding this state, and your needs. Example of linear
> > scale:
> > 4-bits encoding idle state and max predicted residency 10msec, that
> > means 10000us / 16 states = 625us/state.
> > The max_residency might be split differently, using different than
> > linear function, to have some rage more precised.
> >
> > Open question is if these idle states must be all represented in DT,
> > or there is a way of describing a 'set of idle states'
> > automatically.
> We only support C6 state through DT as C6 is the only deepest state for
> Tegra194 carmel CPU. WFI idle state is completely handled by kernel and
> does not require MCE sequences for entry/exit.
> >
> > Regards,
> > Lukasz

2021-04-26 13:14:04

by Morten Rasmussen

[permalink] [raw]

Subject: Re: [RFC PATCH 0/4] Support for passing runtime state idle time to TF-A

Hi,

On Fri, Apr 23, 2021 at 03:24:51PM -0700, Sowjanya Komatineni wrote:
> On 4/23/21 1:16 PM, Lukasz Luba wrote:
> > Hi Sowjanya,
> >
> > On 4/22/21 9:30 PM, Sowjanya Komatineni wrote:
> > > Tegra194 and Tegra186 platforms use separate MCE firmware for CPUs
> > > which is
> > > in charge of deciding on state transition based on target state,
> > > state idle
> > > time, and some other Tegra CPU core cluster states information.
> > >
> > > Current PSCI specification don't have function defined for passing
> > > runtime
> > > state idle time predicted by governor (based on next events and
> > > state target
> > > residency) to ARM trusted firmware.
> >
> > Do you have some numbers from experiments showing that these idle
> > governor prediction values, which are passed from kernel to MCE
> > firmware, are making a good 'guess'?
> > How much precision (1us? 1ms?) in the values do you need there?
>
> it could also be in few ms depending on when next cpu event/activity might
> happen which is not transparent to MCE firmware.
>
> >
> > IIRC (probably Rafael's presentations) predicting in the kernel
> > something like CPU idle time residency is not a trivial thing.
> >
> > Another idea (depending on DT structure and PSCI bits):
> > Could this be solved differently, but just having a knowledge that if
> > the governor requested some C-state, this means governor 'predicted'
> > an idle residency to be greater that min_residency attached to this
> > C-state?
> > Then, when that request shows up in your FW, you know that it must be at
> > least min_residency because of this C-state id.
> C6 is the only deepest state for Tegra194 Carmel CPU that we support in
> addition to C1 (WFI) idle state.
>
> MCE firmware gets state crossover thresholds for C1 to C6 transition from
> TF-A and uses it along with state idle time to decide on C6 state entry
> based on its background work.
>
> Assuming for now if we use min_residency as state idle time which is static
> value from DT, then it enters into deepest state C6 always as we use
> min_residency value we use is always higher than state crossover threshold.
>
> But MCE firmware is not aware of when next cpu event can happen to predict
> if next event can take longer than state min_residency time.
>
> Using min residency in such case is very conservative where MCE firmware
> exits C6 state early where we may not have better power saving.
>
> But with MCE firmware being aware of when next event can happen it can use
> that to stay in C6 state without early exit for better power savings.
>
> > It would depend on number of available states, max_residency, scale
> > that you would choose while assigning values from [0, max_residency]
> > to each state.
> > IIRC there can be many state IDs for idle, so it would depend on
> > number of bits encoding this state, and your needs. Example of
> > linear scale:
> > 4-bits encoding idle state and max predicted residency 10msec,
> > that means 10000us / 16 states = 625us/state.
> > The max_residency might be split differently, using different than
> > linear function, to have some rage more precised.
> >
> > Open question is if these idle states must be all represented
> > in DT, or there is a way of describing a 'set of idle states'
> > automatically.
> We only support C6 state through DT as C6 is the only deepest state for
> Tegra194 carmel CPU. WFI idle state is completely handled by kernel and does
> not require MCE sequences for entry/exit.

I think Lukasz's point is that you can encode the predicted idle time by
having multiple idle_state entries with different min_residency mapping
to the same actual idle-state. So you would several variants of C6 with
different min_residencies and if the OS picks one with longer
min_residency firmware would have a better estimate of the predicted
idle residency.

I'm not convinced it is the right way to work around passing this
information on to firmware. I would rather see an example of how well
this works (best with numbers) and have a proper solution.

Morten