LinuxLists.cc - RCU vs NOHZ

2022-09-15 08:52:06

by Peter Zijlstra

[permalink] [raw]

Subject: RCU vs NOHZ

Hi,

After watching Joel's talk about RCU and idle ticks I was wondering
about why RCU doesn't have NOHZ hooks -- that is regular NOHZ, not the
NOHZ_FULL stuff.

These deep idle states are only feasible during NOHZ idle, and the NOHZ
path is already relatively expensive (which is offset by then mostly
staying idle for a long while).

Specifically my thinking was that when a CPU goes NOHZ it can splice
it's callback list onto a global list (cmpxchg), and then the
jiffy-updater CPU can look at and consume this global list (xchg).

Before you say... but globals suck (they do), NOHZ already has a fair
amount of global state, and as said before, it's offset by the CPU then
staying idle for a fair while. If there is heavy contention on the NOHZ
data, the idle governor is doing a bad job by selecting deep idle states
whilst we're not actually idle for long.

The above would remove the reason for RCU to inhibit NOHZ.

Additionally; when the very last CPU goes idle (I think we know this
somewhere, but I can't reaily remember where) we can insta-advance the
QS machinery and run the callbacks before going (NOHZ) idle.

Is there a reason this couldn't work? To me this seems like a much
simpler solution than the whole rcu-cb thing.

2022-09-15 15:47:50

by Joel Fernandes

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On Thu, Sep 15, 2022 at 9:39 AM Peter Zijlstra <[email protected]> wrote:
>
> Hi,
>
> After watching Joel's talk about RCU and idle ticks I was wondering
> about why RCU doesn't have NOHZ hooks -- that is regular NOHZ, not the
> NOHZ_FULL stuff.

Glad the talk stirred up a discussion!

> These deep idle states are only feasible during NOHZ idle, and the NOHZ
> path is already relatively expensive (which is offset by then mostly
> staying idle for a long while).
>
> Specifically my thinking was that when a CPU goes NOHZ it can splice
> it's callback list onto a global list (cmpxchg), and then the
> jiffy-updater CPU can look at and consume this global list (xchg).
>
> Before you say... but globals suck (they do), NOHZ already has a fair
> amount of global state, and as said before, it's offset by the CPU then
> staying idle for a fair while. If there is heavy contention on the NOHZ
> data, the idle governor is doing a bad job by selecting deep idle states
> whilst we're not actually idle for long.
>
> The above would remove the reason for RCU to inhibit NOHZ.
>
>
> Additionally; when the very last CPU goes idle (I think we know this
> somewhere, but I can't reaily remember where) we can insta-advance the
> QS machinery and run the callbacks before going (NOHZ) idle.
>
>
> Is there a reason this couldn't work? To me this seems like a much
> simpler solution than the whole rcu-cb thing.

You mean the “whole rcu-nocb” thing? The nocb feature does not just
eliminate the need to keep idle ticks ON, that’s just one of the
reasons to use nocb. The other reasons is nocb makes the cb invoke in
a thread and not softirq of the queuing cpu, this makes the energy
aware scheduler decide where to place those threads and it seems to do
a really good job of saving power (Vlad Rezki who works on android
checked and I CC’d).

Maybe your suggestion is more about how to keep the idle tick off on
systems without using no-cb (note that nocb has additional overhead
due to thread wake ups and locking)? If we do the global thing you
mentioned, then that means we won’t get the per cpu cache locality
benefits on those systems that want the cb to execute on same cpu (cb
operating on a hot cache line) since your cb gets invoked on the
jiffie updating cpu per your design.

Outside of RCU, I do remember back in my android days that even with
nocb, we can see idle ticks present quite a bit (this is likely
because of the idle governor in android, but I did not dig much
deeper). I need to go and investigate that again..

Thanks,

- Joel

2022-09-15 17:19:32

by Paul E. McKenney

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On Thu, Sep 15, 2022 at 10:39:12AM +0200, Peter Zijlstra wrote:
> Hi,
>
> After watching Joel's talk about RCU and idle ticks I was wondering
> about why RCU doesn't have NOHZ hooks -- that is regular NOHZ, not the
> NOHZ_FULL stuff.

It actually does, but they have recently moved into the context-tracking
code, courtesy of Frederic's recent patch series.

> These deep idle states are only feasible during NOHZ idle, and the NOHZ
> path is already relatively expensive (which is offset by then mostly
> staying idle for a long while).
>
> Specifically my thinking was that when a CPU goes NOHZ it can splice
> it's callback list onto a global list (cmpxchg), and then the
> jiffy-updater CPU can look at and consume this global list (xchg).
>
> Before you say... but globals suck (they do), NOHZ already has a fair
> amount of global state, and as said before, it's offset by the CPU then
> staying idle for a fair while. If there is heavy contention on the NOHZ
> data, the idle governor is doing a bad job by selecting deep idle states
> whilst we're not actually idle for long.
>
> The above would remove the reason for RCU to inhibit NOHZ.
>
>
> Additionally; when the very last CPU goes idle (I think we know this
> somewhere, but I can't reaily remember where) we can insta-advance the
> QS machinery and run the callbacks before going (NOHZ) idle.
>
>
> Is there a reason this couldn't work? To me this seems like a much
> simpler solution than the whole rcu-cb thing.

To restate Joel's reply a bit...

Maybe.

Except that we need rcu_nocbs anyway for low latency and HPC applications.
Given that we have it, and given that it totally eliminates RCU-induced
idle ticks, how would it help to add cmpxchg-based global offloading?

Thanx, Paul

2022-09-15 19:04:37

by Peter Zijlstra

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On Thu, Sep 15, 2022 at 09:06:00AM -0700, Paul E. McKenney wrote:
> On Thu, Sep 15, 2022 at 10:39:12AM +0200, Peter Zijlstra wrote:
> > Hi,
> >
> > After watching Joel's talk about RCU and idle ticks I was wondering
> > about why RCU doesn't have NOHZ hooks -- that is regular NOHZ, not the
> > NOHZ_FULL stuff.
>
> It actually does, but they have recently moved into the context-tracking
> code, courtesy of Frederic's recent patch series.

afair that's idle and that is not nohz.

> > These deep idle states are only feasible during NOHZ idle, and the NOHZ
> > path is already relatively expensive (which is offset by then mostly
> > staying idle for a long while).
> >
> > Specifically my thinking was that when a CPU goes NOHZ it can splice
> > it's callback list onto a global list (cmpxchg), and then the
> > jiffy-updater CPU can look at and consume this global list (xchg).
> >
> > Before you say... but globals suck (they do), NOHZ already has a fair
> > amount of global state, and as said before, it's offset by the CPU then
> > staying idle for a fair while. If there is heavy contention on the NOHZ
> > data, the idle governor is doing a bad job by selecting deep idle states
> > whilst we're not actually idle for long.
> >
> > The above would remove the reason for RCU to inhibit NOHZ.
> >
> >
> > Additionally; when the very last CPU goes idle (I think we know this
> > somewhere, but I can't reaily remember where) we can insta-advance the
> > QS machinery and run the callbacks before going (NOHZ) idle.
> >
> >
> > Is there a reason this couldn't work? To me this seems like a much
> > simpler solution than the whole rcu-cb thing.
>
> To restate Joel's reply a bit...
>
> Maybe.
>
> Except that we need rcu_nocbs anyway for low latency and HPC applications.
> Given that we have it, and given that it totally eliminates RCU-induced
> idle ticks, how would it help to add cmpxchg-based global offloading?

Because that nocb stuff isn't default enabled?

2022-09-15 19:34:39

by Paul E. McKenney

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On Thu, Sep 15, 2022 at 08:50:44PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 15, 2022 at 09:06:00AM -0700, Paul E. McKenney wrote:
> > On Thu, Sep 15, 2022 at 10:39:12AM +0200, Peter Zijlstra wrote:
> > > Hi,
> > >
> > > After watching Joel's talk about RCU and idle ticks I was wondering
> > > about why RCU doesn't have NOHZ hooks -- that is regular NOHZ, not the
> > > NOHZ_FULL stuff.
> >
> > It actually does, but they have recently moved into the context-tracking
> > code, courtesy of Frederic's recent patch series.
>
> afair that's idle and that is not nohz.

For nohz_full CPUs, it does both.

> > > These deep idle states are only feasible during NOHZ idle, and the NOHZ
> > > path is already relatively expensive (which is offset by then mostly
> > > staying idle for a long while).
> > >
> > > Specifically my thinking was that when a CPU goes NOHZ it can splice
> > > it's callback list onto a global list (cmpxchg), and then the
> > > jiffy-updater CPU can look at and consume this global list (xchg).
> > >
> > > Before you say... but globals suck (they do), NOHZ already has a fair
> > > amount of global state, and as said before, it's offset by the CPU then
> > > staying idle for a fair while. If there is heavy contention on the NOHZ
> > > data, the idle governor is doing a bad job by selecting deep idle states
> > > whilst we're not actually idle for long.
> > >
> > > The above would remove the reason for RCU to inhibit NOHZ.
> > >
> > >
> > > Additionally; when the very last CPU goes idle (I think we know this
> > > somewhere, but I can't reaily remember where) we can insta-advance the
> > > QS machinery and run the callbacks before going (NOHZ) idle.
> > >
> > >
> > > Is there a reason this couldn't work? To me this seems like a much
> > > simpler solution than the whole rcu-cb thing.
> >
> > To restate Joel's reply a bit...
> >
> > Maybe.
> >
> > Except that we need rcu_nocbs anyway for low latency and HPC applications.
> > Given that we have it, and given that it totally eliminates RCU-induced
> > idle ticks, how would it help to add cmpxchg-based global offloading?
>
> Because that nocb stuff isn't default enabled?

Last I checked, both RHEL and Fedora were built with CONFIG_RCU_NOCB_CPU=y.
And I checked Fedora just now.

Or am I missing your point?

Thanx, Paul

2022-09-15 22:40:37

by Peter Zijlstra

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On Thu, Sep 15, 2022 at 12:14:27PM -0700, Paul E. McKenney wrote:
> On Thu, Sep 15, 2022 at 08:50:44PM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 15, 2022 at 09:06:00AM -0700, Paul E. McKenney wrote:
> > > On Thu, Sep 15, 2022 at 10:39:12AM +0200, Peter Zijlstra wrote:
> > > > Hi,
> > > >
> > > > After watching Joel's talk about RCU and idle ticks I was wondering
> > > > about why RCU doesn't have NOHZ hooks -- that is regular NOHZ, not the
> > > > NOHZ_FULL stuff.
> > >
> > > It actually does, but they have recently moved into the context-tracking
> > > code, courtesy of Frederic's recent patch series.
> >
> > afair that's idle and that is not nohz.
>
> For nohz_full CPUs, it does both.

Normal people don't have nohz_full cpus (and shouldn't want any).

> > > > These deep idle states are only feasible during NOHZ idle, and the NOHZ
> > > > path is already relatively expensive (which is offset by then mostly
> > > > staying idle for a long while).
> > > >
> > > > Specifically my thinking was that when a CPU goes NOHZ it can splice
> > > > it's callback list onto a global list (cmpxchg), and then the
> > > > jiffy-updater CPU can look at and consume this global list (xchg).
> > > >
> > > > Before you say... but globals suck (they do), NOHZ already has a fair
> > > > amount of global state, and as said before, it's offset by the CPU then
> > > > staying idle for a fair while. If there is heavy contention on the NOHZ
> > > > data, the idle governor is doing a bad job by selecting deep idle states
> > > > whilst we're not actually idle for long.
> > > >
> > > > The above would remove the reason for RCU to inhibit NOHZ.
> > > >
> > > >
> > > > Additionally; when the very last CPU goes idle (I think we know this
> > > > somewhere, but I can't reaily remember where) we can insta-advance the
> > > > QS machinery and run the callbacks before going (NOHZ) idle.
> > > >
> > > >
> > > > Is there a reason this couldn't work? To me this seems like a much
> > > > simpler solution than the whole rcu-cb thing.
> > >
> > > To restate Joel's reply a bit...
> > >
> > > Maybe.
> > >
> > > Except that we need rcu_nocbs anyway for low latency and HPC applications.
> > > Given that we have it, and given that it totally eliminates RCU-induced
> > > idle ticks, how would it help to add cmpxchg-based global offloading?
> >
> > Because that nocb stuff isn't default enabled?
>
> Last I checked, both RHEL and Fedora were built with CONFIG_RCU_NOCB_CPU=y.
> And I checked Fedora just now.
>
> Or am I missing your point?

I might be missing the point; but why did Joel have a talk if it's all
default on?

2022-09-16 03:52:48

by Joel Fernandes

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On 9/15/2022 6:30 PM, Peter Zijlstra wrote:
>>>>> These deep idle states are only feasible during NOHZ idle, and the NOHZ
>>>>> path is already relatively expensive (which is offset by then mostly
>>>>> staying idle for a long while).
>>>>>
>>>>> Specifically my thinking was that when a CPU goes NOHZ it can splice
>>>>> it's callback list onto a global list (cmpxchg), and then the
>>>>> jiffy-updater CPU can look at and consume this global list (xchg).
>>>>>
>>>>> Before you say... but globals suck (they do), NOHZ already has a fair
>>>>> amount of global state, and as said before, it's offset by the CPU then
>>>>> staying idle for a fair while. If there is heavy contention on the NOHZ
>>>>> data, the idle governor is doing a bad job by selecting deep idle states
>>>>> whilst we're not actually idle for long.
>>>>>
>>>>> The above would remove the reason for RCU to inhibit NOHZ.
>>>>>
>>>>>
>>>>> Additionally; when the very last CPU goes idle (I think we know this
>>>>> somewhere, but I can't reaily remember where) we can insta-advance the
>>>>> QS machinery and run the callbacks before going (NOHZ) idle.
>>>>>
>>>>>
>>>>> Is there a reason this couldn't work? To me this seems like a much
>>>>> simpler solution than the whole rcu-cb thing.
>>>> To restate Joel's reply a bit...
>>>>
>>>> Maybe.
>>>>
>>>> Except that we need rcu_nocbs anyway for low latency and HPC applications.
>>>> Given that we have it, and given that it totally eliminates RCU-induced
>>>> idle ticks, how would it help to add cmpxchg-based global offloading?
>>> Because that nocb stuff isn't default enabled?
>> Last I checked, both RHEL and Fedora were built with CONFIG_RCU_NOCB_CPU=y.
>> And I checked Fedora just now.
>>
>> Or am I missing your point?
> I might be missing the point; but why did Joel have a talk if it's all
> default on?

It was not default on until recently for Intel ChromeOS devices. Also, my talk
is much more than just idle-ticks/NOCB. I am talking about delaying grace
periods by long amounts of times using various techniques, with data to show
improvements, and new tool rcutop to show the problems :-) The discussion of
ticks which disturb idle was more for background information for the audience
(Sorry if I was not clear about that).

thanks,

- Joel

2022-09-16 08:15:27

by Paul E. McKenney

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On Fri, Sep 16, 2022 at 12:30:34AM +0200, Peter Zijlstra wrote:
> On Thu, Sep 15, 2022 at 12:14:27PM -0700, Paul E. McKenney wrote:
> > On Thu, Sep 15, 2022 at 08:50:44PM +0200, Peter Zijlstra wrote:
> > > On Thu, Sep 15, 2022 at 09:06:00AM -0700, Paul E. McKenney wrote:
> > > > On Thu, Sep 15, 2022 at 10:39:12AM +0200, Peter Zijlstra wrote:
> > > > > Hi,
> > > > >
> > > > > After watching Joel's talk about RCU and idle ticks I was wondering
> > > > > about why RCU doesn't have NOHZ hooks -- that is regular NOHZ, not the
> > > > > NOHZ_FULL stuff.
> > > >
> > > > It actually does, but they have recently moved into the context-tracking
> > > > code, courtesy of Frederic's recent patch series.
> > >
> > > afair that's idle and that is not nohz.
> >
> > For nohz_full CPUs, it does both.
>
> Normal people don't have nohz_full cpus (and shouldn't want any).

To the best of my knowledge at this point in time, agreed. Who knows
what someone will come up with next week? But for people running certain
types of real-time and HPC workloads, context tracking really does handle
both idle and userspace transitions.

> > > > > These deep idle states are only feasible during NOHZ idle, and the NOHZ
> > > > > path is already relatively expensive (which is offset by then mostly
> > > > > staying idle for a long while).
> > > > >
> > > > > Specifically my thinking was that when a CPU goes NOHZ it can splice
> > > > > it's callback list onto a global list (cmpxchg), and then the
> > > > > jiffy-updater CPU can look at and consume this global list (xchg).
> > > > >
> > > > > Before you say... but globals suck (they do), NOHZ already has a fair
> > > > > amount of global state, and as said before, it's offset by the CPU then
> > > > > staying idle for a fair while. If there is heavy contention on the NOHZ
> > > > > data, the idle governor is doing a bad job by selecting deep idle states
> > > > > whilst we're not actually idle for long.
> > > > >
> > > > > The above would remove the reason for RCU to inhibit NOHZ.
> > > > >
> > > > >
> > > > > Additionally; when the very last CPU goes idle (I think we know this
> > > > > somewhere, but I can't reaily remember where) we can insta-advance the
> > > > > QS machinery and run the callbacks before going (NOHZ) idle.
> > > > >
> > > > >
> > > > > Is there a reason this couldn't work? To me this seems like a much
> > > > > simpler solution than the whole rcu-cb thing.
> > > >
> > > > To restate Joel's reply a bit...
> > > >
> > > > Maybe.
> > > >
> > > > Except that we need rcu_nocbs anyway for low latency and HPC applications.
> > > > Given that we have it, and given that it totally eliminates RCU-induced
> > > > idle ticks, how would it help to add cmpxchg-based global offloading?
> > >
> > > Because that nocb stuff isn't default enabled?
> >
> > Last I checked, both RHEL and Fedora were built with CONFIG_RCU_NOCB_CPU=y.
> > And I checked Fedora just now.
> >
> > Or am I missing your point?
>
> I might be missing the point; but why did Joel have a talk if it's all
> default on?

It wasn't enabled for ChromeOS.

When fully enabled, it gave them the energy-efficiency advantages Joel
described. And then Joel described some additional call_rcu_lazy()
changes that provided even better energy efficiency. Though I believe
that the application should also be changed to avoid incessantly opening
and closing that file while the device is idle, as this would remove
-all- RCU work when nearly idle. But some of the other call_rcu_lazy()
use cases would likely remain.

If someone believes that their workload would benefit similarly and they
are running Fedora or RHEL (and last I knew, the SUSE distros as well),
then they can boot with rcu_nocbs=0-N and try it out. No need to further
change RCU until proven otherwise.

Thanx, Paul

2022-09-16 09:44:31

by Peter Zijlstra

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On Fri, Sep 16, 2022 at 12:58:17AM -0700, Paul E. McKenney wrote:

> To the best of my knowledge at this point in time, agreed. Who knows
> what someone will come up with next week? But for people running certain
> types of real-time and HPC workloads, context tracking really does handle
> both idle and userspace transitions.

Sure, but idle != nohz. Nohz is where we disable the tick, and currently
RCU can inhibit this -- rcu_needs_cpu().

AFAICT there really isn't an RCU hook for this, not through context
tracking not through anything else.

> It wasn't enabled for ChromeOS.
>
> When fully enabled, it gave them the energy-efficiency advantages Joel
> described. And then Joel described some additional call_rcu_lazy()
> changes that provided even better energy efficiency. Though I believe
> that the application should also be changed to avoid incessantly opening
> and closing that file while the device is idle, as this would remove
> -all- RCU work when nearly idle. But some of the other call_rcu_lazy()
> use cases would likely remain.

So I'm thinking the scheme I outlined gets you most if not all of what
lazy would get you without having to add the lazy thing. A CPU is never
refused deep idle when it passes off the callbacks.

The NOHZ thing is a nice hook for 'this-cpu-wants-to-go-idle-long-term'
and do our utmost bestest to move work away from it. You *want* to break
affinity at this point.

If you hate on the global, push it to a per rcu_node offload list until
the whole node is idle and then push it up the next rcu_node level until
you reach the top.

Then when the top rcu_node is full idle; you can insta progress the QS
state and run the callbacks and go idle.

2022-09-16 18:27:42

by Joel Fernandes

[permalink] [raw]

Subject: Re: RCU vs NOHZ

Hi Peter,

On Fri, Sep 16, 2022 at 5:20 AM Peter Zijlstra <[email protected]> wrote:
[...]
> > It wasn't enabled for ChromeOS.
> >
> > When fully enabled, it gave them the energy-efficiency advantages Joel
> > described. And then Joel described some additional call_rcu_lazy()
> > changes that provided even better energy efficiency. Though I believe
> > that the application should also be changed to avoid incessantly opening
> > and closing that file while the device is idle, as this would remove
> > -all- RCU work when nearly idle. But some of the other call_rcu_lazy()
> > use cases would likely remain.
>
> So I'm thinking the scheme I outlined gets you most if not all of what
> lazy would get you without having to add the lazy thing. A CPU is never
> refused deep idle when it passes off the callbacks.
>
> The NOHZ thing is a nice hook for 'this-cpu-wants-to-go-idle-long-term'
> and do our utmost bestest to move work away from it. You *want* to break
> affinity at this point.
>
> If you hate on the global, push it to a per rcu_node offload list until
> the whole node is idle and then push it up the next rcu_node level until
> you reach the top.
>
> Then when the top rcu_node is full idle; you can insta progress the QS
> state and run the callbacks and go idle.

In my opinion the speed brakes have to be applied before the GP and
other threads are even awakened. The issue Android and ChromeOS
observe is that even a single CB queued every few jiffies can cause
work that can be otherwise delayed / batched, to be scheduled in. I am
not sure if your suggestions above address that. Does it?

Try this experiment on your ADL system (for fun). Boot to the login
screen on any distro, and before logging in, run turbostat over ssh
and observe PC8 percent residencies. Now increase
jiffies_till_first_fqs boot parameter value to 64 or so and try again.
You may be surprised how much PC8 percent increases by delaying RCU
and batching callbacks (via jiffies boot option) Admittedly this is
more amplified on ADL because of package-C-states, firmware and what
not, and isn’t as much a problem on Android; but still gives a nice
power improvement there.

thanks,

- Joel

2022-09-17 13:56:47

by Peter Zijlstra

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On Fri, Sep 16, 2022 at 02:11:10PM -0400, Joel Fernandes wrote:
> Hi Peter,
>
> On Fri, Sep 16, 2022 at 5:20 AM Peter Zijlstra <[email protected]> wrote:
> [...]
> > > It wasn't enabled for ChromeOS.
> > >
> > > When fully enabled, it gave them the energy-efficiency advantages Joel
> > > described. And then Joel described some additional call_rcu_lazy()
> > > changes that provided even better energy efficiency. Though I believe
> > > that the application should also be changed to avoid incessantly opening
> > > and closing that file while the device is idle, as this would remove
> > > -all- RCU work when nearly idle. But some of the other call_rcu_lazy()
> > > use cases would likely remain.
> >
> > So I'm thinking the scheme I outlined gets you most if not all of what
> > lazy would get you without having to add the lazy thing. A CPU is never
> > refused deep idle when it passes off the callbacks.
> >
> > The NOHZ thing is a nice hook for 'this-cpu-wants-to-go-idle-long-term'
> > and do our utmost bestest to move work away from it. You *want* to break
> > affinity at this point.
> >
> > If you hate on the global, push it to a per rcu_node offload list until
> > the whole node is idle and then push it up the next rcu_node level until
> > you reach the top.
> >
> > Then when the top rcu_node is full idle; you can insta progress the QS
> > state and run the callbacks and go idle.
>
> In my opinion the speed brakes have to be applied before the GP and
> other threads are even awakened. The issue Android and ChromeOS
> observe is that even a single CB queued every few jiffies can cause
> work that can be otherwise delayed / batched, to be scheduled in. I am
> not sure if your suggestions above address that. Does it?

Scheduled how? Is this callbacks doing queue_work() or something?

Anyway; the thinking is that by passing off the callbacks on NOHZ, the
idle CPUs stay idle. By running the callbacks before going full idle,
all work is done and you can stay idle longer.

> Try this experiment on your ADL system (for fun). Boot to the login
> screen on any distro,

All my dev boxes are headless :-) I don't thinkt he ADL even has X or
wayland installed.

> and before logging in, run turbostat over ssh
> and observe PC8 percent residencies. Now increase
> jiffies_till_first_fqs boot parameter value to 64 or so and try again.
> You may be surprised how much PC8 percent increases by delaying RCU
> and batching callbacks (via jiffies boot option) Admittedly this is
> more amplified on ADL because of package-C-states, firmware and what
> not, and isn’t as much a problem on Android; but still gives a nice
> power improvement there.

I can try; but as of now turbostat doesn't seem to work on that thing at
all. I think localyesconfig might've stripped a required bit. I'll poke
at it later.

2022-09-17 14:48:45

by Paul E. McKenney

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On Sat, Sep 17, 2022 at 09:52:49AM -0400, Joel Fernandes wrote:
> On 9/17/2022 9:35 AM, Peter Zijlstra wrote:
> > On Fri, Sep 16, 2022 at 02:11:10PM -0400, Joel Fernandes wrote:
> >> Hi Peter,
> >>
> >> On Fri, Sep 16, 2022 at 5:20 AM Peter Zijlstra <[email protected]> wrote:
> >> [...]
> >>>> It wasn't enabled for ChromeOS.
> >>>>
> >>>> When fully enabled, it gave them the energy-efficiency advantages Joel
> >>>> described. And then Joel described some additional call_rcu_lazy()
> >>>> changes that provided even better energy efficiency. Though I believe
> >>>> that the application should also be changed to avoid incessantly opening
> >>>> and closing that file while the device is idle, as this would remove
> >>>> -all- RCU work when nearly idle. But some of the other call_rcu_lazy()
> >>>> use cases would likely remain.
> >>>
> >>> So I'm thinking the scheme I outlined gets you most if not all of what
> >>> lazy would get you without having to add the lazy thing. A CPU is never
> >>> refused deep idle when it passes off the callbacks.
> >>>
> >>> The NOHZ thing is a nice hook for 'this-cpu-wants-to-go-idle-long-term'
> >>> and do our utmost bestest to move work away from it. You *want* to break
> >>> affinity at this point.
> >>>
> >>> If you hate on the global, push it to a per rcu_node offload list until
> >>> the whole node is idle and then push it up the next rcu_node level until
> >>> you reach the top.
> >>>
> >>> Then when the top rcu_node is full idle; you can insta progress the QS
> >>> state and run the callbacks and go idle.
> >>
> >> In my opinion the speed brakes have to be applied before the GP and
> >> other threads are even awakened. The issue Android and ChromeOS
> >> observe is that even a single CB queued every few jiffies can cause
> >> work that can be otherwise delayed / batched, to be scheduled in. I am
> >> not sure if your suggestions above address that. Does it?
> >
> > Scheduled how? Is this callbacks doing queue_work() or something?
>
> Way before the callback is even ready to execute, you can rcuog, rcuop,
> rcu_preempt threads running to go through the grace period state machine.
>
> > Anyway; the thinking is that by passing off the callbacks on NOHZ, the
> > idle CPUs stay idle. By running the callbacks before going full idle,
> > all work is done and you can stay idle longer.
>
> But all CPUs idle does not mean grace period is over, you can have a task (at
> least on PREEMPT_RT) block in the middle of an RCU read-side critical section
> and then all CPUs go idle.
>
> Other than that, a typical flow could look like:
>
> 1. CPU queues a callback.
> 2. CPU then goes idle.
> 3. Another CPU is running the RCU threads waking up otherwise idle CPUs.
> 4. Grace period completes and an RCU thread runs a callback.
>
> >> Try this experiment on your ADL system (for fun). Boot to the login
> >> screen on any distro,
> >
> > All my dev boxes are headless :-) I don't thinkt he ADL even has X or
> > wayland installed.
>
> Ah, ok. Maybe what you have (like daemons) are already requesting RCU for
> something. Android folks had some logger requesting RCU all the time.
>
> >> and before logging in, run turbostat over ssh
> >> and observe PC8 percent residencies. Now increase
> >> jiffies_till_first_fqs boot parameter value to 64 or so and try again.
> >> You may be surprised how much PC8 percent increases by delaying RCU
> >> and batching callbacks (via jiffies boot option) Admittedly this is
> >> more amplified on ADL because of package-C-states, firmware and what
> >> not, and isn’t as much a problem on Android; but still gives a nice
> >> power improvement there.
> >
> > I can try; but as of now turbostat doesn't seem to work on that thing at
> > all. I think localyesconfig might've stripped a required bit. I'll poke
> > at it later.
>
> Cool! I believe Len Brown can help on that , or maybe there is another way you
> can read the counters to figure out the PC8% and RAPL power.

Whatever the evaluation scheme, it absolutely -must- measure real power
consumed by real hardware running some real-world workload compared to
Joel et al.'s scheme, or I will cheerfully ignore it. ;-)

Thanx, Paul

2022-09-17 14:50:57

by Joel Fernandes

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On 9/17/2022 9:35 AM, Peter Zijlstra wrote:
> On Fri, Sep 16, 2022 at 02:11:10PM -0400, Joel Fernandes wrote:
>> Hi Peter,
>>
>> On Fri, Sep 16, 2022 at 5:20 AM Peter Zijlstra <[email protected]> wrote:
>> [...]
>>>> It wasn't enabled for ChromeOS.
>>>>
>>>> When fully enabled, it gave them the energy-efficiency advantages Joel
>>>> described. And then Joel described some additional call_rcu_lazy()
>>>> changes that provided even better energy efficiency. Though I believe
>>>> that the application should also be changed to avoid incessantly opening
>>>> and closing that file while the device is idle, as this would remove
>>>> -all- RCU work when nearly idle. But some of the other call_rcu_lazy()
>>>> use cases would likely remain.
>>>
>>> So I'm thinking the scheme I outlined gets you most if not all of what
>>> lazy would get you without having to add the lazy thing. A CPU is never
>>> refused deep idle when it passes off the callbacks.
>>>
>>> The NOHZ thing is a nice hook for 'this-cpu-wants-to-go-idle-long-term'
>>> and do our utmost bestest to move work away from it. You *want* to break
>>> affinity at this point.
>>>
>>> If you hate on the global, push it to a per rcu_node offload list until
>>> the whole node is idle and then push it up the next rcu_node level until
>>> you reach the top.
>>>
>>> Then when the top rcu_node is full idle; you can insta progress the QS
>>> state and run the callbacks and go idle.
>>
>> In my opinion the speed brakes have to be applied before the GP and
>> other threads are even awakened. The issue Android and ChromeOS
>> observe is that even a single CB queued every few jiffies can cause
>> work that can be otherwise delayed / batched, to be scheduled in. I am
>> not sure if your suggestions above address that. Does it?
>
> Scheduled how? Is this callbacks doing queue_work() or something?

Way before the callback is even ready to execute, you can rcuog, rcuop,
rcu_preempt threads running to go through the grace period state machine.

> Anyway; the thinking is that by passing off the callbacks on NOHZ, the
> idle CPUs stay idle. By running the callbacks before going full idle,
> all work is done and you can stay idle longer.

But all CPUs idle does not mean grace period is over, you can have a task (at
least on PREEMPT_RT) block in the middle of an RCU read-side critical section
and then all CPUs go idle.

Other than that, a typical flow could look like:

1. CPU queues a callback.
2. CPU then goes idle.
3. Another CPU is running the RCU threads waking up otherwise idle CPUs.
4. Grace period completes and an RCU thread runs a callback.

>> Try this experiment on your ADL system (for fun). Boot to the login
>> screen on any distro,
>
> All my dev boxes are headless :-) I don't thinkt he ADL even has X or
> wayland installed.

Ah, ok. Maybe what you have (like daemons) are already requesting RCU for
something. Android folks had some logger requesting RCU all the time.

>> and before logging in, run turbostat over ssh
>> and observe PC8 percent residencies. Now increase
>> jiffies_till_first_fqs boot parameter value to 64 or so and try again.
>> You may be surprised how much PC8 percent increases by delaying RCU
>> and batching callbacks (via jiffies boot option) Admittedly this is
>> more amplified on ADL because of package-C-states, firmware and what
>> not, and isn’t as much a problem on Android; but still gives a nice
>> power improvement there.
>
> I can try; but as of now turbostat doesn't seem to work on that thing at
> all. I think localyesconfig might've stripped a required bit. I'll poke
> at it later.

Cool! I believe Len Brown can help on that , or maybe there is another way you
can read the counters to figure out the PC8% and RAPL power.

thanks,

- Joel

2022-09-17 15:34:51

by Paul E. McKenney

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On Fri, Sep 16, 2022 at 11:20:14AM +0200, Peter Zijlstra wrote:
> On Fri, Sep 16, 2022 at 12:58:17AM -0700, Paul E. McKenney wrote:
>
> > To the best of my knowledge at this point in time, agreed. Who knows
> > what someone will come up with next week? But for people running certain
> > types of real-time and HPC workloads, context tracking really does handle
> > both idle and userspace transitions.
>
> Sure, but idle != nohz. Nohz is where we disable the tick, and currently
> RCU can inhibit this -- rcu_needs_cpu().

Exactly. For non-nohz userspace execution, the tick is still running
anyway, so RCU of course won't be inhibiting its disabling. And in that
case, RCU's hook is the tick interrupt itself. RCU's hook is passed a
flag saying whether the interrupt came from userspace or from kernel.

> AFAICT there really isn't an RCU hook for this, not through context
> tracking not through anything else.

There is a directly invoked RCU hook for any transition that enables or
disables the tick, namely the ct_*_enter() and ct_*_exit() functions,
that is, those functions formerly known as rcu_*_enter() and rcu_*_exit().

> > It wasn't enabled for ChromeOS.
> >
> > When fully enabled, it gave them the energy-efficiency advantages Joel
> > described. And then Joel described some additional call_rcu_lazy()
> > changes that provided even better energy efficiency. Though I believe
> > that the application should also be changed to avoid incessantly opening
> > and closing that file while the device is idle, as this would remove
> > -all- RCU work when nearly idle. But some of the other call_rcu_lazy()
> > use cases would likely remain.
>
> So I'm thinking the scheme I outlined gets you most if not all of what
> lazy would get you without having to add the lazy thing. A CPU is never
> refused deep idle when it passes off the callbacks.
>
> The NOHZ thing is a nice hook for 'this-cpu-wants-to-go-idle-long-term'
> and do our utmost bestest to move work away from it. You *want* to break
> affinity at this point.
>
> If you hate on the global, push it to a per rcu_node offload list until
> the whole node is idle and then push it up the next rcu_node level until
> you reach the top.
>
> Then when the top rcu_node is full idle; you can insta progress the QS
> state and run the callbacks and go idle.

Unfortunately, the overhead of doing all that tracking along with
resolving all the resulting race conditions will -increase- power
consumption. With RCU, counting CPU wakeups is not as good a predictor
of power consumption as one might like. Sure, it is a nice heuristic
in some cases, but with RCU it is absolutely -not- a replacement for
actually measuring power consumption on real hardware. And yes, I did
learn this the hard way. Why do you ask? ;-)

And that is why the recently removed CONFIG_RCU_FAST_NO_HZ left the
callbacks in place and substituted a 4x slower timer for the tick.
-That- actually resulted in significant real measured power savings on
real hardware.

Except that everything that was building with CONFIG_RCU_FAST_NO_HZ
was also doing nohz_full on each and every CPU. Which meant that all
that CONFIG_RCU_FAST_NO_HZ was doing for them was adding an additional
useless check on each transition to and from idle. Which in turn is why
CONFIG_RCU_FAST_NO_HZ was removed. No one was using it in any way that
made any sense.

And more recent testing with rcu_nocbs on both ChromeOS and Android has
produced better savings than was produced by CONFIG_RCU_FAST_NO_HZ anyway.

Much of the additional savings from Joel et al.'s work is not so much
from reducing the number of ticks, but rather from reducing the number
of grace periods, which are of course much heavier weight.

And this of course means that any additional schemes to reduce RCU's
power consumption must be compared (with real measurements on real
hardware!) to Joel et al.'s work, whether in combination or as an
alternative. And either way, the power savings must of course justify
the added code and complexity.

Thanx, Paul

2022-09-21 21:52:12

by Paul E. McKenney

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On Sat, Sep 17, 2022 at 07:25:08AM -0700, Paul E. McKenney wrote:
> On Fri, Sep 16, 2022 at 11:20:14AM +0200, Peter Zijlstra wrote:
> > On Fri, Sep 16, 2022 at 12:58:17AM -0700, Paul E. McKenney wrote:
> >
> > > To the best of my knowledge at this point in time, agreed. Who knows
> > > what someone will come up with next week? But for people running certain
> > > types of real-time and HPC workloads, context tracking really does handle
> > > both idle and userspace transitions.
> >
> > Sure, but idle != nohz. Nohz is where we disable the tick, and currently
> > RCU can inhibit this -- rcu_needs_cpu().
>
> Exactly. For non-nohz userspace execution, the tick is still running
> anyway, so RCU of course won't be inhibiting its disabling. And in that
> case, RCU's hook is the tick interrupt itself. RCU's hook is passed a
> flag saying whether the interrupt came from userspace or from kernel.
>
> > AFAICT there really isn't an RCU hook for this, not through context
> > tracking not through anything else.
>
> There is a directly invoked RCU hook for any transition that enables or
> disables the tick, namely the ct_*_enter() and ct_*_exit() functions,
> that is, those functions formerly known as rcu_*_enter() and rcu_*_exit().
>
> > > It wasn't enabled for ChromeOS.
> > >
> > > When fully enabled, it gave them the energy-efficiency advantages Joel
> > > described. And then Joel described some additional call_rcu_lazy()
> > > changes that provided even better energy efficiency. Though I believe
> > > that the application should also be changed to avoid incessantly opening
> > > and closing that file while the device is idle, as this would remove
> > > -all- RCU work when nearly idle. But some of the other call_rcu_lazy()
> > > use cases would likely remain.
> >
> > So I'm thinking the scheme I outlined gets you most if not all of what
> > lazy would get you without having to add the lazy thing. A CPU is never
> > refused deep idle when it passes off the callbacks.
> >
> > The NOHZ thing is a nice hook for 'this-cpu-wants-to-go-idle-long-term'
> > and do our utmost bestest to move work away from it. You *want* to break
> > affinity at this point.
> >
> > If you hate on the global, push it to a per rcu_node offload list until
> > the whole node is idle and then push it up the next rcu_node level until
> > you reach the top.
> >
> > Then when the top rcu_node is full idle; you can insta progress the QS
> > state and run the callbacks and go idle.
>
> Unfortunately, the overhead of doing all that tracking along with
> resolving all the resulting race conditions will -increase- power
> consumption. With RCU, counting CPU wakeups is not as good a predictor
> of power consumption as one might like. Sure, it is a nice heuristic
> in some cases, but with RCU it is absolutely -not- a replacement for
> actually measuring power consumption on real hardware. And yes, I did
> learn this the hard way. Why do you ask? ;-)
>
> And that is why the recently removed CONFIG_RCU_FAST_NO_HZ left the
> callbacks in place and substituted a 4x slower timer for the tick.
> -That- actually resulted in significant real measured power savings on
> real hardware.
>
> Except that everything that was building with CONFIG_RCU_FAST_NO_HZ
> was also doing nohz_full on each and every CPU. Which meant that all
> that CONFIG_RCU_FAST_NO_HZ was doing for them was adding an additional
> useless check on each transition to and from idle. Which in turn is why
> CONFIG_RCU_FAST_NO_HZ was removed. No one was using it in any way that
> made any sense.
>
> And more recent testing with rcu_nocbs on both ChromeOS and Android has
> produced better savings than was produced by CONFIG_RCU_FAST_NO_HZ anyway.
>
> Much of the additional savings from Joel et al.'s work is not so much
> from reducing the number of ticks, but rather from reducing the number
> of grace periods, which are of course much heavier weight.
>
> And this of course means that any additional schemes to reduce RCU's
> power consumption must be compared (with real measurements on real
> hardware!) to Joel et al.'s work, whether in combination or as an
> alternative. And either way, the power savings must of course justify
> the added code and complexity.

And here is an untested patch that in theory might allow much of the
reduction in power with minimal complexity/overhead for kernels without
rcu_nocbs CPUs. On the off-chance you know of someone who would be
willing to do a realistic evaluation of it.

Thanx, Paul

------------------------------------------------------------------------

commit 80fc02e80a2dfb6c7468217cff2d4494a1c4b58d
Author: Paul E. McKenney <[email protected]>
Date: Wed Sep 21 13:30:24 2022 -0700

rcu: Let non-offloaded idle CPUs with callbacks defer tick

When a CPU goes idle, rcu_needs_cpu() is invoked to determine whether or
not RCU needs the scheduler-clock tick to keep interrupting. Right now,
RCU keeps the tick on for a given idle CPU if there are any non-offloaded
callbacks queued on that CPU.

But if all of these callbacks are waiting for a grace period to finish,
there is no point in scheduling a tick before that grace period has any
reasonable chance of completing. This commit therefore delays the tick
in the case where all the callbacks are waiting for a specific grace
period to elapse. In theory, this should result in a 50-70% reduction in
RCU-induced scheduling-clock ticks on mostly-idle CPUs. In practice, TBD.

Signed-off-by: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>

diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 9bc025aa79a3..84e930c11065 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -133,7 +133,7 @@ static inline void rcu_softirq_qs(void)
rcu_tasks_qs(current, (preempt)); \
} while (0)

-static inline int rcu_needs_cpu(void)
+static inline int rcu_needs_cpu(u64 basemono, u64 *nextevt)
{
return 0;
}
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index 70795386b9ff..3066e0975022 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -19,7 +19,7 @@

void rcu_softirq_qs(void);
void rcu_note_context_switch(bool preempt);
-int rcu_needs_cpu(void);
+int rcu_needs_cpu(u64 basemono, u64 *nextevt);
void rcu_cpu_stall_reset(void);

/*
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 5ec97e3f7468..47cd3b0d2a07 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -676,12 +676,33 @@ void __rcu_irq_enter_check_tick(void)
* scheduler-clock interrupt.
*
* Just check whether or not this CPU has non-offloaded RCU callbacks
- * queued.
+ * queued that need immediate attention.
*/
-int rcu_needs_cpu(void)
+int rcu_needs_cpu(u64 basemono, u64 *nextevt)
{
- return !rcu_segcblist_empty(&this_cpu_ptr(&rcu_data)->cblist) &&
- !rcu_rdp_is_offloaded(this_cpu_ptr(&rcu_data));
+ struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
+ struct rcu_segcblist *rsclp = &rdp->cblist;
+
+ // Disabled, empty, or offloaded means nothing to do.
+ if (!rcu_segcblist_is_enabled(rsclp) ||
+ rcu_segcblist_empty(rsclp) || rcu_rdp_is_offloaded(rdp)) {
+ *nextevt = KTIME_MAX;
+ return 0;
+ }
+
+ // Callbacks ready to invoke or that have not already been
+ // assigned a grace period need immediate attention.
+ if (!rcu_segcblist_segempty(rsclp, RCU_DONE_TAIL) ||
+ !rcu_segcblist_segempty(rsclp, RCU_NEXT_TAIL))
+ return 1;
+
+ // There are callbacks waiting for some later grace period.
+ // Wait for about a grace period or two for the next tick, at which
+ // point there is high probability that this CPU will need to do some
+ // work for RCU.
+ *nextevt = basemono + TICK_NSEC * (READ_ONCE(jiffies_till_first_fqs) +
+ READ_ONCE(jiffies_till_next_fqs) + 1);
+ return 0;
}

/*
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index b0e3c9205946..303ea15cdb96 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -784,7 +784,7 @@ static inline bool local_timer_softirq_pending(void)

static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
{
- u64 basemono, next_tick, delta, expires;
+ u64 basemono, next_tick, next_tmr, next_rcu, delta, expires;
unsigned long basejiff;
unsigned int seq;

@@ -807,7 +807,7 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
* minimal delta which brings us back to this place
* immediately. Lather, rinse and repeat...
*/
- if (rcu_needs_cpu() || arch_needs_cpu() ||
+ if (rcu_needs_cpu(basemono, &next_rcu) || arch_needs_cpu() ||
irq_work_needs_cpu() || local_timer_softirq_pending()) {
next_tick = basemono + TICK_NSEC;
} else {
@@ -818,8 +818,10 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
* disabled this also looks at the next expiring
* hrtimer.
*/
- next_tick = get_next_timer_interrupt(basejiff, basemono);
- ts->next_timer = next_tick;
+ next_tmr = get_next_timer_interrupt(basejiff, basemono);
+ ts->next_timer = next_tmr;
+ /* Take the next rcu event into account */
+ next_tick = next_rcu < next_tmr ? next_rcu : next_tmr;
}

/*

2022-09-23 15:40:39

by Joel Fernandes

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On 9/21/2022 5:36 PM, Paul E. McKenney wrote:
> On Sat, Sep 17, 2022 at 07:25:08AM -0700, Paul E. McKenney wrote:
>> On Fri, Sep 16, 2022 at 11:20:14AM +0200, Peter Zijlstra wrote:
>>> On Fri, Sep 16, 2022 at 12:58:17AM -0700, Paul E. McKenney wrote:
>>>
>>>> To the best of my knowledge at this point in time, agreed. Who knows
>>>> what someone will come up with next week? But for people running certain
>>>> types of real-time and HPC workloads, context tracking really does handle
>>>> both idle and userspace transitions.
>>>
>>> Sure, but idle != nohz. Nohz is where we disable the tick, and currently
>>> RCU can inhibit this -- rcu_needs_cpu().
>>
>> Exactly. For non-nohz userspace execution, the tick is still running
>> anyway, so RCU of course won't be inhibiting its disabling. And in that
>> case, RCU's hook is the tick interrupt itself. RCU's hook is passed a
>> flag saying whether the interrupt came from userspace or from kernel.
>>
>>> AFAICT there really isn't an RCU hook for this, not through context
>>> tracking not through anything else.
>>
>> There is a directly invoked RCU hook for any transition that enables or
>> disables the tick, namely the ct_*_enter() and ct_*_exit() functions,
>> that is, those functions formerly known as rcu_*_enter() and rcu_*_exit().
>>
>>>> It wasn't enabled for ChromeOS.
>>>>
>>>> When fully enabled, it gave them the energy-efficiency advantages Joel
>>>> described. And then Joel described some additional call_rcu_lazy()
>>>> changes that provided even better energy efficiency. Though I believe
>>>> that the application should also be changed to avoid incessantly opening
>>>> and closing that file while the device is idle, as this would remove
>>>> -all- RCU work when nearly idle. But some of the other call_rcu_lazy()
>>>> use cases would likely remain.
>>>
>>> So I'm thinking the scheme I outlined gets you most if not all of what
>>> lazy would get you without having to add the lazy thing. A CPU is never
>>> refused deep idle when it passes off the callbacks.
>>>
>>> The NOHZ thing is a nice hook for 'this-cpu-wants-to-go-idle-long-term'
>>> and do our utmost bestest to move work away from it. You *want* to break
>>> affinity at this point.
>>>
>>> If you hate on the global, push it to a per rcu_node offload list until
>>> the whole node is idle and then push it up the next rcu_node level until
>>> you reach the top.
>>>
>>> Then when the top rcu_node is full idle; you can insta progress the QS
>>> state and run the callbacks and go idle.
>>
>> Unfortunately, the overhead of doing all that tracking along with
>> resolving all the resulting race conditions will -increase- power
>> consumption. With RCU, counting CPU wakeups is not as good a predictor
>> of power consumption as one might like. Sure, it is a nice heuristic
>> in some cases, but with RCU it is absolutely -not- a replacement for
>> actually measuring power consumption on real hardware. And yes, I did
>> learn this the hard way. Why do you ask? ;-)
>>
>> And that is why the recently removed CONFIG_RCU_FAST_NO_HZ left the
>> callbacks in place and substituted a 4x slower timer for the tick.
>> -That- actually resulted in significant real measured power savings on
>> real hardware.
>>
>> Except that everything that was building with CONFIG_RCU_FAST_NO_HZ
>> was also doing nohz_full on each and every CPU. Which meant that all
>> that CONFIG_RCU_FAST_NO_HZ was doing for them was adding an additional
>> useless check on each transition to and from idle. Which in turn is why
>> CONFIG_RCU_FAST_NO_HZ was removed. No one was using it in any way that
>> made any sense.
>>
>> And more recent testing with rcu_nocbs on both ChromeOS and Android has
>> produced better savings than was produced by CONFIG_RCU_FAST_NO_HZ anyway.
>>
>> Much of the additional savings from Joel et al.'s work is not so much
>> from reducing the number of ticks, but rather from reducing the number
>> of grace periods, which are of course much heavier weight.
>>
>> And this of course means that any additional schemes to reduce RCU's
>> power consumption must be compared (with real measurements on real
>> hardware!) to Joel et al.'s work, whether in combination or as an
>> alternative. And either way, the power savings must of course justify
>> the added code and complexity.
>
> And here is an untested patch that in theory might allow much of the
> reduction in power with minimal complexity/overhead for kernels without
> rcu_nocbs CPUs. On the off-chance you know of someone who would be
> willing to do a realistic evaluation of it.
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit 80fc02e80a2dfb6c7468217cff2d4494a1c4b58d
> Author: Paul E. McKenney <[email protected]>
> Date: Wed Sep 21 13:30:24 2022 -0700
>
> rcu: Let non-offloaded idle CPUs with callbacks defer tick
>
> When a CPU goes idle, rcu_needs_cpu() is invoked to determine whether or
> not RCU needs the scheduler-clock tick to keep interrupting. Right now,
> RCU keeps the tick on for a given idle CPU if there are any non-offloaded
> callbacks queued on that CPU.
>
> But if all of these callbacks are waiting for a grace period to finish,
> there is no point in scheduling a tick before that grace period has any
> reasonable chance of completing. This commit therefore delays the tick
> in the case where all the callbacks are waiting for a specific grace
> period to elapse. In theory, this should result in a 50-70% reduction in
> RCU-induced scheduling-clock ticks on mostly-idle CPUs. In practice, TBD.
>
> Signed-off-by: Paul E. McKenney <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
>
> diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
> index 9bc025aa79a3..84e930c11065 100644
> --- a/include/linux/rcutiny.h
> +++ b/include/linux/rcutiny.h
> @@ -133,7 +133,7 @@ static inline void rcu_softirq_qs(void)
> rcu_tasks_qs(current, (preempt)); \
> } while (0)
>
> -static inline int rcu_needs_cpu(void)
> +static inline int rcu_needs_cpu(u64 basemono, u64 *nextevt)
> {
> return 0;
> }
> diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
> index 70795386b9ff..3066e0975022 100644
> --- a/include/linux/rcutree.h
> +++ b/include/linux/rcutree.h
> @@ -19,7 +19,7 @@
>
> void rcu_softirq_qs(void);
> void rcu_note_context_switch(bool preempt);
> -int rcu_needs_cpu(void);
> +int rcu_needs_cpu(u64 basemono, u64 *nextevt);
> void rcu_cpu_stall_reset(void);
>
> /*
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 5ec97e3f7468..47cd3b0d2a07 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -676,12 +676,33 @@ void __rcu_irq_enter_check_tick(void)
> * scheduler-clock interrupt.
> *
> * Just check whether or not this CPU has non-offloaded RCU callbacks
> - * queued.
> + * queued that need immediate attention.
> */
> -int rcu_needs_cpu(void)
> +int rcu_needs_cpu(u64 basemono, u64 *nextevt)
> {
> - return !rcu_segcblist_empty(&this_cpu_ptr(&rcu_data)->cblist) &&
> - !rcu_rdp_is_offloaded(this_cpu_ptr(&rcu_data));
> + struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
> + struct rcu_segcblist *rsclp = &rdp->cblist;
> +
> + // Disabled, empty, or offloaded means nothing to do.
> + if (!rcu_segcblist_is_enabled(rsclp) ||
> + rcu_segcblist_empty(rsclp) || rcu_rdp_is_offloaded(rdp)) {
> + *nextevt = KTIME_MAX;
> + return 0;
> + }
> +
> + // Callbacks ready to invoke or that have not already been
> + // assigned a grace period need immediate attention.
> + if (!rcu_segcblist_segempty(rsclp, RCU_DONE_TAIL) ||
> + !rcu_segcblist_segempty(rsclp, RCU_NEXT_TAIL))
> + return 1;> +
> + // There are callbacks waiting for some later grace period.
> + // Wait for about a grace period or two for the next tick, at which
> + // point there is high probability that this CPU will need to do some
> + // work for RCU.
> + *nextevt = basemono + TICK_NSEC * (READ_ONCE(jiffies_till_first_fqs) > + READ_ONCE(jiffies_till_next_fqs) + 1);

Looks like nice idea. Could this race with the main GP thread on another CPU
completing the grace period, then on this CPU there is actually some work to do
but rcu_needs_cpu() returns 0.

I think it is plausible but not common, in which case the extra delay is
probably Ok.

Also, if the RCU readers take a long time, then we'd still wake up the system
periodically although with the above change, much fewer times, which is a good
thing.

Thanks,

- Joel

> + return 0;
> }
>
> /*
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index b0e3c9205946..303ea15cdb96 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -784,7 +784,7 @@ static inline bool local_timer_softirq_pending(void)
>
> static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
> {
> - u64 basemono, next_tick, delta, expires;
> + u64 basemono, next_tick, next_tmr, next_rcu, delta, expires;
> unsigned long basejiff;
> unsigned int seq;
>
> @@ -807,7 +807,7 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
> * minimal delta which brings us back to this place
> * immediately. Lather, rinse and repeat...
> */
> - if (rcu_needs_cpu() || arch_needs_cpu() ||
> + if (rcu_needs_cpu(basemono, &next_rcu) || arch_needs_cpu() ||
> irq_work_needs_cpu() || local_timer_softirq_pending()) {
> next_tick = basemono + TICK_NSEC;
> } else {
> @@ -818,8 +818,10 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
> * disabled this also looks at the next expiring
> * hrtimer.
> */
> - next_tick = get_next_timer_interrupt(basejiff, basemono);
> - ts->next_timer = next_tick;
> + next_tmr = get_next_timer_interrupt(basejiff, basemono);
> + ts->next_timer = next_tmr;
> + /* Take the next rcu event into account */
> + next_tick = next_rcu < next_tmr ? next_rcu : next_tmr;
> }
>
> /*

2022-09-23 16:07:32

by Paul E. McKenney

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On Fri, Sep 23, 2022 at 11:12:17AM -0400, Joel Fernandes wrote:
> On 9/21/2022 5:36 PM, Paul E. McKenney wrote:
> > On Sat, Sep 17, 2022 at 07:25:08AM -0700, Paul E. McKenney wrote:
> >> On Fri, Sep 16, 2022 at 11:20:14AM +0200, Peter Zijlstra wrote:
> >>> On Fri, Sep 16, 2022 at 12:58:17AM -0700, Paul E. McKenney wrote:
> >>>
> >>>> To the best of my knowledge at this point in time, agreed. Who knows
> >>>> what someone will come up with next week? But for people running certain
> >>>> types of real-time and HPC workloads, context tracking really does handle
> >>>> both idle and userspace transitions.
> >>>
> >>> Sure, but idle != nohz. Nohz is where we disable the tick, and currently
> >>> RCU can inhibit this -- rcu_needs_cpu().
> >>
> >> Exactly. For non-nohz userspace execution, the tick is still running
> >> anyway, so RCU of course won't be inhibiting its disabling. And in that
> >> case, RCU's hook is the tick interrupt itself. RCU's hook is passed a
> >> flag saying whether the interrupt came from userspace or from kernel.
> >>
> >>> AFAICT there really isn't an RCU hook for this, not through context
> >>> tracking not through anything else.
> >>
> >> There is a directly invoked RCU hook for any transition that enables or
> >> disables the tick, namely the ct_*_enter() and ct_*_exit() functions,
> >> that is, those functions formerly known as rcu_*_enter() and rcu_*_exit().
> >>
> >>>> It wasn't enabled for ChromeOS.
> >>>>
> >>>> When fully enabled, it gave them the energy-efficiency advantages Joel
> >>>> described. And then Joel described some additional call_rcu_lazy()
> >>>> changes that provided even better energy efficiency. Though I believe
> >>>> that the application should also be changed to avoid incessantly opening
> >>>> and closing that file while the device is idle, as this would remove
> >>>> -all- RCU work when nearly idle. But some of the other call_rcu_lazy()
> >>>> use cases would likely remain.
> >>>
> >>> So I'm thinking the scheme I outlined gets you most if not all of what
> >>> lazy would get you without having to add the lazy thing. A CPU is never
> >>> refused deep idle when it passes off the callbacks.
> >>>
> >>> The NOHZ thing is a nice hook for 'this-cpu-wants-to-go-idle-long-term'
> >>> and do our utmost bestest to move work away from it. You *want* to break
> >>> affinity at this point.
> >>>
> >>> If you hate on the global, push it to a per rcu_node offload list until
> >>> the whole node is idle and then push it up the next rcu_node level until
> >>> you reach the top.
> >>>
> >>> Then when the top rcu_node is full idle; you can insta progress the QS
> >>> state and run the callbacks and go idle.
> >>
> >> Unfortunately, the overhead of doing all that tracking along with
> >> resolving all the resulting race conditions will -increase- power
> >> consumption. With RCU, counting CPU wakeups is not as good a predictor
> >> of power consumption as one might like. Sure, it is a nice heuristic
> >> in some cases, but with RCU it is absolutely -not- a replacement for
> >> actually measuring power consumption on real hardware. And yes, I did
> >> learn this the hard way. Why do you ask? ;-)
> >>
> >> And that is why the recently removed CONFIG_RCU_FAST_NO_HZ left the
> >> callbacks in place and substituted a 4x slower timer for the tick.
> >> -That- actually resulted in significant real measured power savings on
> >> real hardware.
> >>
> >> Except that everything that was building with CONFIG_RCU_FAST_NO_HZ
> >> was also doing nohz_full on each and every CPU. Which meant that all
> >> that CONFIG_RCU_FAST_NO_HZ was doing for them was adding an additional
> >> useless check on each transition to and from idle. Which in turn is why
> >> CONFIG_RCU_FAST_NO_HZ was removed. No one was using it in any way that
> >> made any sense.
> >>
> >> And more recent testing with rcu_nocbs on both ChromeOS and Android has
> >> produced better savings than was produced by CONFIG_RCU_FAST_NO_HZ anyway.
> >>
> >> Much of the additional savings from Joel et al.'s work is not so much
> >> from reducing the number of ticks, but rather from reducing the number
> >> of grace periods, which are of course much heavier weight.
> >>
> >> And this of course means that any additional schemes to reduce RCU's
> >> power consumption must be compared (with real measurements on real
> >> hardware!) to Joel et al.'s work, whether in combination or as an
> >> alternative. And either way, the power savings must of course justify
> >> the added code and complexity.
> >
> > And here is an untested patch that in theory might allow much of the
> > reduction in power with minimal complexity/overhead for kernels without
> > rcu_nocbs CPUs. On the off-chance you know of someone who would be
> > willing to do a realistic evaluation of it.
> >
> > Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > commit 80fc02e80a2dfb6c7468217cff2d4494a1c4b58d
> > Author: Paul E. McKenney <[email protected]>
> > Date: Wed Sep 21 13:30:24 2022 -0700
> >
> > rcu: Let non-offloaded idle CPUs with callbacks defer tick
> >
> > When a CPU goes idle, rcu_needs_cpu() is invoked to determine whether or
> > not RCU needs the scheduler-clock tick to keep interrupting. Right now,
> > RCU keeps the tick on for a given idle CPU if there are any non-offloaded
> > callbacks queued on that CPU.
> >
> > But if all of these callbacks are waiting for a grace period to finish,
> > there is no point in scheduling a tick before that grace period has any
> > reasonable chance of completing. This commit therefore delays the tick
> > in the case where all the callbacks are waiting for a specific grace
> > period to elapse. In theory, this should result in a 50-70% reduction in
> > RCU-induced scheduling-clock ticks on mostly-idle CPUs. In practice, TBD.
> >
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> >
> > diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
> > index 9bc025aa79a3..84e930c11065 100644
> > --- a/include/linux/rcutiny.h
> > +++ b/include/linux/rcutiny.h
> > @@ -133,7 +133,7 @@ static inline void rcu_softirq_qs(void)
> > rcu_tasks_qs(current, (preempt)); \
> > } while (0)
> >
> > -static inline int rcu_needs_cpu(void)
> > +static inline int rcu_needs_cpu(u64 basemono, u64 *nextevt)
> > {
> > return 0;
> > }
> > diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
> > index 70795386b9ff..3066e0975022 100644
> > --- a/include/linux/rcutree.h
> > +++ b/include/linux/rcutree.h
> > @@ -19,7 +19,7 @@
> >
> > void rcu_softirq_qs(void);
> > void rcu_note_context_switch(bool preempt);
> > -int rcu_needs_cpu(void);
> > +int rcu_needs_cpu(u64 basemono, u64 *nextevt);
> > void rcu_cpu_stall_reset(void);
> >
> > /*
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 5ec97e3f7468..47cd3b0d2a07 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -676,12 +676,33 @@ void __rcu_irq_enter_check_tick(void)
> > * scheduler-clock interrupt.
> > *
> > * Just check whether or not this CPU has non-offloaded RCU callbacks
> > - * queued.
> > + * queued that need immediate attention.
> > */
> > -int rcu_needs_cpu(void)
> > +int rcu_needs_cpu(u64 basemono, u64 *nextevt)
> > {
> > - return !rcu_segcblist_empty(&this_cpu_ptr(&rcu_data)->cblist) &&
> > - !rcu_rdp_is_offloaded(this_cpu_ptr(&rcu_data));
> > + struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
> > + struct rcu_segcblist *rsclp = &rdp->cblist;
> > +
> > + // Disabled, empty, or offloaded means nothing to do.
> > + if (!rcu_segcblist_is_enabled(rsclp) ||
> > + rcu_segcblist_empty(rsclp) || rcu_rdp_is_offloaded(rdp)) {
> > + *nextevt = KTIME_MAX;
> > + return 0;
> > + }
> > +
> > + // Callbacks ready to invoke or that have not already been
> > + // assigned a grace period need immediate attention.
> > + if (!rcu_segcblist_segempty(rsclp, RCU_DONE_TAIL) ||
> > + !rcu_segcblist_segempty(rsclp, RCU_NEXT_TAIL))
> > + return 1;> +
> > + // There are callbacks waiting for some later grace period.
> > + // Wait for about a grace period or two for the next tick, at which
> > + // point there is high probability that this CPU will need to do some
> > + // work for RCU.
> > + *nextevt = basemono + TICK_NSEC * (READ_ONCE(jiffies_till_first_fqs) > + READ_ONCE(jiffies_till_next_fqs) + 1);
>
> Looks like nice idea. Could this race with the main GP thread on another CPU
> completing the grace period, then on this CPU there is actually some work to do
> but rcu_needs_cpu() returns 0.
>
> I think it is plausible but not common, in which case the extra delay is
> probably Ok.

Glad you like it!

Yes, that race can happen, but it can also happen today.
A scheduling-clock interrupt might arrive at a CPU just as a grace
period finishes. Yes, the delay is longer with this patch. If this
proves to be a problem, then the delay heuristic might expanded to
include the age of the current grace period.

But keeping it simple to start with.

> Also, if the RCU readers take a long time, then we'd still wake up the system
> periodically although with the above change, much fewer times, which is a good
> thing.

And the delay heuristic could also be expanded to include a digitally
filtered estimate of grace-period duration. But again, keeping it simple
to start with. ;-)

My guess is that offloading gets you more power savings, but I don't
have a good way of testing this guess.

Thanx, Paul

> Thanks,
>
> - Joel
>
>
>
> > + return 0;
> > }
> >
> > /*
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index b0e3c9205946..303ea15cdb96 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -784,7 +784,7 @@ static inline bool local_timer_softirq_pending(void)
> >
> > static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
> > {
> > - u64 basemono, next_tick, delta, expires;
> > + u64 basemono, next_tick, next_tmr, next_rcu, delta, expires;
> > unsigned long basejiff;
> > unsigned int seq;
> >
> > @@ -807,7 +807,7 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
> > * minimal delta which brings us back to this place
> > * immediately. Lather, rinse and repeat...
> > */
> > - if (rcu_needs_cpu() || arch_needs_cpu() ||
> > + if (rcu_needs_cpu(basemono, &next_rcu) || arch_needs_cpu() ||
> > irq_work_needs_cpu() || local_timer_softirq_pending()) {
> > next_tick = basemono + TICK_NSEC;
> > } else {
> > @@ -818,8 +818,10 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
> > * disabled this also looks at the next expiring
> > * hrtimer.
> > */
> > - next_tick = get_next_timer_interrupt(basejiff, basemono);
> > - ts->next_timer = next_tick;
> > + next_tmr = get_next_timer_interrupt(basejiff, basemono);
> > + ts->next_timer = next_tmr;
> > + /* Take the next rcu event into account */
> > + next_tick = next_rcu < next_tmr ? next_rcu : next_tmr;
> > }
> >
> > /*

2022-09-23 18:42:33

by Paul E. McKenney

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On Fri, Sep 23, 2022 at 01:47:40PM -0400, Joel Fernandes wrote:
> On Fri, Sep 23, 2022 at 12:01 PM Paul E. McKenney <[email protected]> wrote:
> [...]
> > > > And here is an untested patch that in theory might allow much of the
> > > > reduction in power with minimal complexity/overhead for kernels without
> > > > rcu_nocbs CPUs. On the off-chance you know of someone who would be
> > > > willing to do a realistic evaluation of it.
> > > >
> > > > Thanx, Paul
> > > >
> > > > ------------------------------------------------------------------------
> > > >
> > > > commit 80fc02e80a2dfb6c7468217cff2d4494a1c4b58d
> > > > Author: Paul E. McKenney <[email protected]>
> > > > Date: Wed Sep 21 13:30:24 2022 -0700
> > > >
> > > > rcu: Let non-offloaded idle CPUs with callbacks defer tick
> > > >
> > > > When a CPU goes idle, rcu_needs_cpu() is invoked to determine whether or
> > > > not RCU needs the scheduler-clock tick to keep interrupting. Right now,
> > > > RCU keeps the tick on for a given idle CPU if there are any non-offloaded
> > > > callbacks queued on that CPU.
> > > >
> > > > But if all of these callbacks are waiting for a grace period to finish,
> > > > there is no point in scheduling a tick before that grace period has any
> > > > reasonable chance of completing. This commit therefore delays the tick
> > > > in the case where all the callbacks are waiting for a specific grace
> > > > period to elapse. In theory, this should result in a 50-70% reduction in
> > > > RCU-induced scheduling-clock ticks on mostly-idle CPUs. In practice, TBD.
> > > >
> > > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > > Cc: Peter Zijlstra <[email protected]>
> > > >
> > > > diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
> > > > index 9bc025aa79a3..84e930c11065 100644
> > > > --- a/include/linux/rcutiny.h
> > > > +++ b/include/linux/rcutiny.h
> > > > @@ -133,7 +133,7 @@ static inline void rcu_softirq_qs(void)
> > > > rcu_tasks_qs(current, (preempt)); \
> > > > } while (0)
> > > >
> > > > -static inline int rcu_needs_cpu(void)
> > > > +static inline int rcu_needs_cpu(u64 basemono, u64 *nextevt)
> > > > {
> > > > return 0;
> > > > }
> > > > diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
> > > > index 70795386b9ff..3066e0975022 100644
> > > > --- a/include/linux/rcutree.h
> > > > +++ b/include/linux/rcutree.h
> > > > @@ -19,7 +19,7 @@
> > > >
> > > > void rcu_softirq_qs(void);
> > > > void rcu_note_context_switch(bool preempt);
> > > > -int rcu_needs_cpu(void);
> > > > +int rcu_needs_cpu(u64 basemono, u64 *nextevt);
> > > > void rcu_cpu_stall_reset(void);
> > > >
> > > > /*
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index 5ec97e3f7468..47cd3b0d2a07 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -676,12 +676,33 @@ void __rcu_irq_enter_check_tick(void)
> > > > * scheduler-clock interrupt.
> > > > *
> > > > * Just check whether or not this CPU has non-offloaded RCU callbacks
> > > > - * queued.
> > > > + * queued that need immediate attention.
> > > > */
> > > > -int rcu_needs_cpu(void)
> > > > +int rcu_needs_cpu(u64 basemono, u64 *nextevt)
> > > > {
> > > > - return !rcu_segcblist_empty(&this_cpu_ptr(&rcu_data)->cblist) &&
> > > > - !rcu_rdp_is_offloaded(this_cpu_ptr(&rcu_data));
> > > > + struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
> > > > + struct rcu_segcblist *rsclp = &rdp->cblist;
> > > > +
> > > > + // Disabled, empty, or offloaded means nothing to do.
> > > > + if (!rcu_segcblist_is_enabled(rsclp) ||
> > > > + rcu_segcblist_empty(rsclp) || rcu_rdp_is_offloaded(rdp)) {
> > > > + *nextevt = KTIME_MAX;
> > > > + return 0;
> > > > + }
> > > > +
> > > > + // Callbacks ready to invoke or that have not already been
> > > > + // assigned a grace period need immediate attention.
> > > > + if (!rcu_segcblist_segempty(rsclp, RCU_DONE_TAIL) ||
> > > > + !rcu_segcblist_segempty(rsclp, RCU_NEXT_TAIL))
> > > > + return 1;> +
> > > > + // There are callbacks waiting for some later grace period.
> > > > + // Wait for about a grace period or two for the next tick, at which
> > > > + // point there is high probability that this CPU will need to do some
> > > > + // work for RCU.
> > > > + *nextevt = basemono + TICK_NSEC * (READ_ONCE(jiffies_till_first_fqs) > + READ_ONCE(jiffies_till_next_fqs) + 1);
> > >
> > > Looks like nice idea. Could this race with the main GP thread on another CPU
> > > completing the grace period, then on this CPU there is actually some work to do
> > > but rcu_needs_cpu() returns 0.
> > >
> > > I think it is plausible but not common, in which case the extra delay is
> > > probably Ok.
> >
> > Glad you like it!
> >
> > Yes, that race can happen, but it can also happen today.
> > A scheduling-clock interrupt might arrive at a CPU just as a grace
> > period finishes. Yes, the delay is longer with this patch. If this
> > proves to be a problem, then the delay heuristic might expanded to
> > include the age of the current grace period.
> >
> > But keeping it simple to start with.
>
> Sure sounds good and yes I agree to the point of the existing issue
> but the error is just 1 jiffie there as you pointed.

One jiffy currently, but it would typically be about seven jiffies with
the patch. Systems with smaller values of HZ would have fewer jiffies,
and systems with more than 128 CPUs would have more jiffies. Systems
booted with explicit values for the rcutree.jiffies_till_first_fqs and
rcutree.jiffies_till_next_fqs kernel boot parameters could have whatever
the administrator wanted. ;-)

But the key point is that the grace period itself can be extended by
that value just due to timing and distribution of idle CPUs.

> > > Also, if the RCU readers take a long time, then we'd still wake up the system
> > > periodically although with the above change, much fewer times, which is a good
> > > thing.
> >
> > And the delay heuristic could also be expanded to include a digitally
> > filtered estimate of grace-period duration. But again, keeping it simple
> > to start with. ;-)
> >
> > My guess is that offloading gets you more power savings, but I don't
> > have a good way of testing this guess.
>
> I could try to run turbostat on Monday on our Intel SoCs, and see how
> it reacts, but I was thinking of tracing this first to see the
> behavior. Another thing I was thinking of was updating (the future)
> rcutop to see how many 'idle ticks' are RCU related, vs others; and
> then see how this patch effects that.

Such testing would be very welcome, thank you!

This patch might also need to keep track of the last tick on a given
CPU in order to prevent frequent short idle periods from indefinitely
delaying the tick.

Thanx, Paul

> thanks,
>
> - Joel
>
>
> > > > unsigned long basejiff;
> > > > unsigned int seq;
> > > >
> > > > @@ -807,7 +807,7 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
> > > > * minimal delta which brings us back to this place
> > > > * immediately. Lather, rinse and repeat...
> > > > */
> > > > - if (rcu_needs_cpu() || arch_needs_cpu() ||
> > > > + if (rcu_needs_cpu(basemono, &next_rcu) || arch_needs_cpu() ||
> > > > irq_work_needs_cpu() || local_timer_softirq_pending()) {
> > > > next_tick = basemono + TICK_NSEC;
> > > > } else {
> > > > @@ -818,8 +818,10 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
> > > > * disabled this also looks at the next expiring
> > > > * hrtimer.
> > > > */
> > > > - next_tick = get_next_timer_interrupt(basejiff, basemono);
> > > > - ts->next_timer = next_tick;
> > > > + next_tmr = get_next_timer_interrupt(basejiff, basemono);
> > > > + ts->next_timer = next_tmr;
> > > > + /* Take the next rcu event into account */
> > > > + next_tick = next_rcu < next_tmr ? next_rcu : next_tmr;
> > > > }
> > > >
> > > > /*

2022-09-23 18:42:36

by Joel Fernandes

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On Fri, Sep 23, 2022 at 12:01 PM Paul E. McKenney <[email protected]> wrote:
[...]
> > > And here is an untested patch that in theory might allow much of the
> > > reduction in power with minimal complexity/overhead for kernels without
> > > rcu_nocbs CPUs. On the off-chance you know of someone who would be
> > > willing to do a realistic evaluation of it.
> > >
> > > Thanx, Paul
> > >
> > > ------------------------------------------------------------------------
> > >
> > > commit 80fc02e80a2dfb6c7468217cff2d4494a1c4b58d
> > > Author: Paul E. McKenney <[email protected]>
> > > Date: Wed Sep 21 13:30:24 2022 -0700
> > >
> > > rcu: Let non-offloaded idle CPUs with callbacks defer tick
> > >
> > > When a CPU goes idle, rcu_needs_cpu() is invoked to determine whether or
> > > not RCU needs the scheduler-clock tick to keep interrupting. Right now,
> > > RCU keeps the tick on for a given idle CPU if there are any non-offloaded
> > > callbacks queued on that CPU.
> > >
> > > But if all of these callbacks are waiting for a grace period to finish,
> > > there is no point in scheduling a tick before that grace period has any
> > > reasonable chance of completing. This commit therefore delays the tick
> > > in the case where all the callbacks are waiting for a specific grace
> > > period to elapse. In theory, this should result in a 50-70% reduction in
> > > RCU-induced scheduling-clock ticks on mostly-idle CPUs. In practice, TBD.
> > >
> > > Signed-off-by: Paul E. McKenney <[email protected]>
> > > Cc: Peter Zijlstra <[email protected]>
> > >
> > > diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
> > > index 9bc025aa79a3..84e930c11065 100644
> > > --- a/include/linux/rcutiny.h
> > > +++ b/include/linux/rcutiny.h
> > > @@ -133,7 +133,7 @@ static inline void rcu_softirq_qs(void)
> > > rcu_tasks_qs(current, (preempt)); \
> > > } while (0)
> > >
> > > -static inline int rcu_needs_cpu(void)
> > > +static inline int rcu_needs_cpu(u64 basemono, u64 *nextevt)
> > > {
> > > return 0;
> > > }
> > > diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
> > > index 70795386b9ff..3066e0975022 100644
> > > --- a/include/linux/rcutree.h
> > > +++ b/include/linux/rcutree.h
> > > @@ -19,7 +19,7 @@
> > >
> > > void rcu_softirq_qs(void);
> > > void rcu_note_context_switch(bool preempt);
> > > -int rcu_needs_cpu(void);
> > > +int rcu_needs_cpu(u64 basemono, u64 *nextevt);
> > > void rcu_cpu_stall_reset(void);
> > >
> > > /*
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 5ec97e3f7468..47cd3b0d2a07 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -676,12 +676,33 @@ void __rcu_irq_enter_check_tick(void)
> > > * scheduler-clock interrupt.
> > > *
> > > * Just check whether or not this CPU has non-offloaded RCU callbacks
> > > - * queued.
> > > + * queued that need immediate attention.
> > > */
> > > -int rcu_needs_cpu(void)
> > > +int rcu_needs_cpu(u64 basemono, u64 *nextevt)
> > > {
> > > - return !rcu_segcblist_empty(&this_cpu_ptr(&rcu_data)->cblist) &&
> > > - !rcu_rdp_is_offloaded(this_cpu_ptr(&rcu_data));
> > > + struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
> > > + struct rcu_segcblist *rsclp = &rdp->cblist;
> > > +
> > > + // Disabled, empty, or offloaded means nothing to do.
> > > + if (!rcu_segcblist_is_enabled(rsclp) ||
> > > + rcu_segcblist_empty(rsclp) || rcu_rdp_is_offloaded(rdp)) {
> > > + *nextevt = KTIME_MAX;
> > > + return 0;
> > > + }
> > > +
> > > + // Callbacks ready to invoke or that have not already been
> > > + // assigned a grace period need immediate attention.
> > > + if (!rcu_segcblist_segempty(rsclp, RCU_DONE_TAIL) ||
> > > + !rcu_segcblist_segempty(rsclp, RCU_NEXT_TAIL))
> > > + return 1;> +
> > > + // There are callbacks waiting for some later grace period.
> > > + // Wait for about a grace period or two for the next tick, at which
> > > + // point there is high probability that this CPU will need to do some
> > > + // work for RCU.
> > > + *nextevt = basemono + TICK_NSEC * (READ_ONCE(jiffies_till_first_fqs) > + READ_ONCE(jiffies_till_next_fqs) + 1);
> >
> > Looks like nice idea. Could this race with the main GP thread on another CPU
> > completing the grace period, then on this CPU there is actually some work to do
> > but rcu_needs_cpu() returns 0.
> >
> > I think it is plausible but not common, in which case the extra delay is
> > probably Ok.
>
> Glad you like it!
>
> Yes, that race can happen, but it can also happen today.
> A scheduling-clock interrupt might arrive at a CPU just as a grace
> period finishes. Yes, the delay is longer with this patch. If this
> proves to be a problem, then the delay heuristic might expanded to
> include the age of the current grace period.
>
> But keeping it simple to start with.

Sure sounds good and yes I agree to the point of the existing issue
but the error is just 1 jiffie there as you pointed.

> > Also, if the RCU readers take a long time, then we'd still wake up the system
> > periodically although with the above change, much fewer times, which is a good
> > thing.
>
> And the delay heuristic could also be expanded to include a digitally
> filtered estimate of grace-period duration. But again, keeping it simple
> to start with. ;-)
>
> My guess is that offloading gets you more power savings, but I don't
> have a good way of testing this guess.

I could try to run turbostat on Monday on our Intel SoCs, and see how
it reacts, but I was thinking of tracing this first to see the
behavior. Another thing I was thinking of was updating (the future)
rcutop to see how many 'idle ticks' are RCU related, vs others; and
then see how this patch effects that.

thanks,

- Joel

> > > unsigned long basejiff;
> > > unsigned int seq;
> > >
> > > @@ -807,7 +807,7 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
> > > * minimal delta which brings us back to this place
> > > * immediately. Lather, rinse and repeat...
> > > */
> > > - if (rcu_needs_cpu() || arch_needs_cpu() ||
> > > + if (rcu_needs_cpu(basemono, &next_rcu) || arch_needs_cpu() ||
> > > irq_work_needs_cpu() || local_timer_softirq_pending()) {
> > > next_tick = basemono + TICK_NSEC;
> > > } else {
> > > @@ -818,8 +818,10 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
> > > * disabled this also looks at the next expiring
> > > * hrtimer.
> > > */
> > > - next_tick = get_next_timer_interrupt(basejiff, basemono);
> > > - ts->next_timer = next_tick;
> > > + next_tmr = get_next_timer_interrupt(basejiff, basemono);
> > > + ts->next_timer = next_tmr;
> > > + /* Take the next rcu event into account */
> > > + next_tick = next_rcu < next_tmr ? next_rcu : next_tmr;
> > > }
> > >
> > > /*

2022-09-23 20:06:41

by Joel Fernandes

[permalink] [raw]

Subject: Re: RCU vs NOHZ

Hi Paul,

On Fri, Sep 23, 2022 at 2:15 PM Paul E. McKenney <[email protected]> wrote:
>
> On Fri, Sep 23, 2022 at 01:47:40PM -0400, Joel Fernandes wrote:
>> On Fri, Sep 23, 2022 at 12:01 PM Paul E. McKenney <[email protected]> wrote:
>> [...]
>>>>> And here is an untested patch that in theory might allow much of the
>>>>> reduction in power with minimal complexity/overhead for kernels without
>>>>> rcu_nocbs CPUs. On the off-chance you know of someone who would be
>>>>> willing to do a realistic evaluation of it.
>>>>> Thanx, Paul
>>>>> ------------------------------------------------------------------------
>>>>> commit 80fc02e80a2dfb6c7468217cff2d4494a1c4b58d
>>>>> Author: Paul E. McKenney <[email protected]>
>>>>> Date: Wed Sep 21 13:30:24 2022 -0700
>>>>> rcu: Let non-offloaded idle CPUs with callbacks defer tick
>>>>> When a CPU goes idle, rcu_needs_cpu() is invoked to determine whether or
>>>>> not RCU needs the scheduler-clock tick to keep interrupting. Right now,
>>>>> RCU keeps the tick on for a given idle CPU if there are any non-offloaded
>>>>> callbacks queued on that CPU.
>>>>> But if all of these callbacks are waiting for a grace period to finish,
>>>>> there is no point in scheduling a tick before that grace period has any
>>>>> reasonable chance of completing. This commit therefore delays the tick
>>>>> in the case where all the callbacks are waiting for a specific grace
>>>>> period to elapse. In theory, this should result in a 50-70% reduction in
>>>>> RCU-induced scheduling-clock ticks on mostly-idle CPUs. In practice, TBD.
>>>>> Signed-off-by: Paul E. McKenney <[email protected]>
>>>>> Cc: Peter Zijlstra <[email protected]>
>>>>> diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
>>>>> index 9bc025aa79a3..84e930c11065 100644
>>>>> --- a/include/linux/rcutiny.h
>>>>> +++ b/include/linux/rcutiny.h
>>>>> @@ -133,7 +133,7 @@ static inline void rcu_softirq_qs(void)
>>>>> rcu_tasks_qs(current, (preempt)); \
>>>>> } while (0)
>>>>> -static inline int rcu_needs_cpu(void)
>>>>> +static inline int rcu_needs_cpu(u64 basemono, u64 *nextevt)
>>>>> {
>>>>> return 0;
>>>>> }
>>>>> diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
>>>>> index 70795386b9ff..3066e0975022 100644
>>>>> --- a/include/linux/rcutree.h
>>>>> +++ b/include/linux/rcutree.h
>>>>> @@ -19,7 +19,7 @@
>>>>> void rcu_softirq_qs(void);
>>>>> void rcu_note_context_switch(bool preempt);
>>>>> -int rcu_needs_cpu(void);
>>>>> +int rcu_needs_cpu(u64 basemono, u64 *nextevt);
>>>>> void rcu_cpu_stall_reset(void);
>>>>> /*
>>>>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>>>>> index 5ec97e3f7468..47cd3b0d2a07 100644
>>>>> --- a/kernel/rcu/tree.c
>>>>> +++ b/kernel/rcu/tree.c
>>>>> @@ -676,12 +676,33 @@ void __rcu_irq_enter_check_tick(void)
>>>>> * scheduler-clock interrupt.
>>>>> *
>>>>> * Just check whether or not this CPU has non-offloaded RCU callbacks
>>>>> - * queued.
>>>>> + * queued that need immediate attention.
>>>>> */
>>>>> -int rcu_needs_cpu(void)
>>>>> +int rcu_needs_cpu(u64 basemono, u64 *nextevt)
>>>>> {
>>>>> - return !rcu_segcblist_empty(&this_cpu_ptr(&rcu_data)->cblist) &&
>>>>> - !rcu_rdp_is_offloaded(this_cpu_ptr(&rcu_data));
>>>>> + struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
>>>>> + struct rcu_segcblist *rsclp = &rdp->cblist;
>>>>> +
>>>>> + // Disabled, empty, or offloaded means nothing to do.
>>>>> + if (!rcu_segcblist_is_enabled(rsclp) ||
>>>>> + rcu_segcblist_empty(rsclp) || rcu_rdp_is_offloaded(rdp)) {
>>>>> + *nextevt = KTIME_MAX;
>>>>> + return 0;
>>>>> + }
>>>>> +
>>>>> + // Callbacks ready to invoke or that have not already been
>>>>> + // assigned a grace period need immediate attention.
>>>>> + if (!rcu_segcblist_segempty(rsclp, RCU_DONE_TAIL) ||
>>>>> + !rcu_segcblist_segempty(rsclp, RCU_NEXT_TAIL))
>>>>> + return 1;> +
>>>>> + // There are callbacks waiting for some later grace period.
>>>>> + // Wait for about a grace period or two for the next tick, at which
>>>>> + // point there is high probability that this CPU will need to do some
>>>>> + // work for RCU.
>>>>> + *nextevt = basemono + TICK_NSEC * (READ_ONCE(jiffies_till_first_fqs) > + READ_ONCE(jiffies_till_next_fqs) + 1);
>>>> Looks like nice idea. Could this race with the main GP thread on another CPU
>>>> completing the grace period, then on this CPU there is actually some work to do
>>>> but rcu_needs_cpu() returns 0.
>>>> I think it is plausible but not common, in which case the extra delay is
>>>> probably Ok.
>>> Glad you like it!
>>> Yes, that race can happen, but it can also happen today.
>>> A scheduling-clock interrupt might arrive at a CPU just as a grace
>>> period finishes. Yes, the delay is longer with this patch. If this
>>> proves to be a problem, then the delay heuristic might expanded to
>>> include the age of the current grace period.
>>> But keeping it simple to start with.
>> Sure sounds good and yes I agree to the point of the existing issue
>> but the error is just 1 jiffie there as you pointed.
>
> One jiffy currently, but it would typically be about seven jiffies with
> the patch

Yes exactly, that’s what I meant.

> . Systems with smaller values of HZ would have fewer jiffies,
> and systems with more than 128 CPUs would have more jiffies. Systems
> booted with explicit values for the rcutree.jiffies_till_first_fqs and
> rcutree.jiffies_till_next_fqs kernel boot parameters could have whatever
> the administrator wanted. ;-)

Makes sense, thanks for clarifying.

> But the key point is that the grace period itself can be extended by
> that value just due to timing and distribution of idle CPUs.
>
>>>> Also, if the RCU readers take a long time, then we'd still wake up the system
>>>> periodically although with the above change, much fewer times, which is a good
>>>> thing.
>>> And the delay heuristic could also be expanded to include a digitally
>>> filtered estimate of grace-period duration. But again, keeping it simple
>>> to start with. ;-)
>>> My guess is that offloading gets you more power savings, but I don't
>>> have a good way of testing this guess.
>> I could try to run turbostat on Monday on our Intel SoCs, and see how
>> it reacts, but I was thinking of tracing this first to see the
>> behavior. Another thing I was thinking of was updating (the future)
>> rcutop to see how many 'idle ticks' are RCU related, vs others; and
>> then see how this patch effects that.
>
> Such testing would be very welcome, thank you!
>
> This patch might also need to keep track of the last tick on a given
> CPU in order to prevent frequent short idle periods from indefinitely
> delaying the tick.

I know what you mean! I had the exact same issue with the lazy timer initially, now the behavior is any lazy enqueue which happened will piggy back onto the existing timer already running.

- Joel

>
>
> Thanx, Paul
>
>> thanks,
>> - Joel
>>>>> unsigned long basejiff;
>>>>> unsigned int seq;
>>>>> @@ -807,7 +807,7 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
>>>>> * minimal delta which brings us back to this place
>>>>> * immediately. Lather, rinse and repeat...
>>>>> */
>>>>> - if (rcu_needs_cpu() || arch_needs_cpu() ||
>>>>> + if (rcu_needs_cpu(basemono, &next_rcu) || arch_needs_cpu() ||
>>>>> irq_work_needs_cpu() || local_timer_softirq_pending()) {
>>>>> next_tick = basemono + TICK_NSEC;
>>>>> } else {
>>>>> @@ -818,8 +818,10 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
>>>>> * disabled this also looks at the next expiring
>>>>> * hrtimer.
>>>>> */
>>>>> - next_tick = get_next_timer_interrupt(basejiff, basemono);
>>>>> - ts->next_timer = next_tick;
>>>>> + next_tmr = get_next_timer_interrupt(basejiff, basemono);
>>>>> + ts->next_timer = next_tmr;
>>>>> + /* Take the next rcu event into account */
>>>>> + next_tick = next_rcu < next_tmr ? next_rcu : next_tmr;
>>>>> }
>>>>> /*

2022-09-29 11:32:25

by Peter Zijlstra

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On Wed, Sep 21, 2022 at 02:36:44PM -0700, Paul E. McKenney wrote:

> commit 80fc02e80a2dfb6c7468217cff2d4494a1c4b58d
> Author: Paul E. McKenney <[email protected]>
> Date: Wed Sep 21 13:30:24 2022 -0700
>
> rcu: Let non-offloaded idle CPUs with callbacks defer tick
>
> When a CPU goes idle, rcu_needs_cpu() is invoked to determine whether or
> not RCU needs the scheduler-clock tick to keep interrupting. Right now,
> RCU keeps the tick on for a given idle CPU if there are any non-offloaded
> callbacks queued on that CPU.
>
> But if all of these callbacks are waiting for a grace period to finish,
> there is no point in scheduling a tick before that grace period has any
> reasonable chance of completing. This commit therefore delays the tick
> in the case where all the callbacks are waiting for a specific grace
> period to elapse. In theory, this should result in a 50-70% reduction in
> RCU-induced scheduling-clock ticks on mostly-idle CPUs. In practice, TBD.
>
> Signed-off-by: Paul E. McKenney <[email protected]>
> Cc: Peter Zijlstra <[email protected]>

> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 5ec97e3f7468..47cd3b0d2a07 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -676,12 +676,33 @@ void __rcu_irq_enter_check_tick(void)
> * scheduler-clock interrupt.
> *
> * Just check whether or not this CPU has non-offloaded RCU callbacks
> - * queued.
> + * queued that need immediate attention.
> */
> -int rcu_needs_cpu(void)
> +int rcu_needs_cpu(u64 basemono, u64 *nextevt)
> {
> - return !rcu_segcblist_empty(&this_cpu_ptr(&rcu_data)->cblist) &&
> - !rcu_rdp_is_offloaded(this_cpu_ptr(&rcu_data));
> + struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
> + struct rcu_segcblist *rsclp = &rdp->cblist;
> +
> + // Disabled, empty, or offloaded means nothing to do.
> + if (!rcu_segcblist_is_enabled(rsclp) ||
> + rcu_segcblist_empty(rsclp) || rcu_rdp_is_offloaded(rdp)) {
> + *nextevt = KTIME_MAX;
> + return 0;
> + }

So far agreed; however, I was arguing to instead:

> +
> + // Callbacks ready to invoke or that have not already been
> + // assigned a grace period need immediate attention.
> + if (!rcu_segcblist_segempty(rsclp, RCU_DONE_TAIL) ||
> + !rcu_segcblist_segempty(rsclp, RCU_NEXT_TAIL))
> + return 1;
> +
> + // There are callbacks waiting for some later grace period.
> + // Wait for about a grace period or two for the next tick, at which
> + // point there is high probability that this CPU will need to do some
> + // work for RCU.
> + *nextevt = basemono + TICK_NSEC * (READ_ONCE(jiffies_till_first_fqs) +
> + READ_ONCE(jiffies_till_next_fqs) + 1);
> + return 0;
> }

force offload whatever you have in this case and always have it return
false.

Except I don't think this is quite the right place; there's too much
that can still get in the way of stopping the tick, I would delay the
force offload to the place where we actually know we're going to stop
the tick.

2022-09-29 11:49:20

by Peter Zijlstra

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On Sat, Sep 17, 2022 at 07:25:08AM -0700, Paul E. McKenney wrote:
> On Fri, Sep 16, 2022 at 11:20:14AM +0200, Peter Zijlstra wrote:
> > On Fri, Sep 16, 2022 at 12:58:17AM -0700, Paul E. McKenney wrote:
> >
> > > To the best of my knowledge at this point in time, agreed. Who knows
> > > what someone will come up with next week? But for people running certain
> > > types of real-time and HPC workloads, context tracking really does handle
> > > both idle and userspace transitions.
> >
> > Sure, but idle != nohz. Nohz is where we disable the tick, and currently
> > RCU can inhibit this -- rcu_needs_cpu().
>
> Exactly. For non-nohz userspace execution, the tick is still running
> anyway, so RCU of course won't be inhibiting its disabling. And in that
> case, RCU's hook is the tick interrupt itself. RCU's hook is passed a
> flag saying whether the interrupt came from userspace or from kernel.

I'm not sure how we ended up here; this is completely irrelevant and I'm
not disagreeing with it.

> > AFAICT there really isn't an RCU hook for this, not through context
> > tracking not through anything else.
>
> There is a directly invoked RCU hook for any transition that enables or
> disables the tick, namely the ct_*_enter() and ct_*_exit() functions,
> that is, those functions formerly known as rcu_*_enter() and rcu_*_exit().

Context tracking doesn't know about NOHZ, therefore RCU can't either.
Context tracking knows about IDLE, but not all IDLE is NOHZ-IDLE.

Specifically we have:

ct_{idle,irq,nmi,user,kernel}_enter()

And none of them are related to NOHZ in the slightest. So no, RCU does
not have a NOHZ callback.

I'm still thikning you're conflating NOHZ_FULL (stopping the tick when
in userspace) and regular NOHZ (stopping the tick when idle).

> And this of course means that any additional schemes to reduce RCU's
> power consumption must be compared (with real measurements on real
> hardware!) to Joel et al.'s work, whether in combination or as an
> alternative. And either way, the power savings must of course justify
> the added code and complexity.

Well, Joel's lazy scheme has the difficulty that you can wreck things by
improperly marking the callback as lazy when there's an explicit
dependency on it. The talk even called that out.

I was hoping to construct a scheme that doesn't need the whole lazy
approach.

To recap; we want the CPU to go into deeper idle states, no?

RCU can currently inhibit this by having callbacks pending for this CPU
-- in this case RCU inhibits NOHZ-IDLE and deep power states are not
selected or less effective.

Now, deep idle states actually purge the caches, so cache locality
cannot be an argument to keep the callbacks local.

We know when we're doing deep idle we stop the tick.

So why not, when stopping the tick, move the RCU pending crud elsewhere
and let the CPU get on with going idle instead of inhibiting the
stopping of the tick and wrecking deep idle?

2022-09-29 15:45:58

by Paul E. McKenney

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On Thu, Sep 29, 2022 at 12:55:58PM +0200, Peter Zijlstra wrote:
> On Sat, Sep 17, 2022 at 07:25:08AM -0700, Paul E. McKenney wrote:
> > On Fri, Sep 16, 2022 at 11:20:14AM +0200, Peter Zijlstra wrote:
> > > On Fri, Sep 16, 2022 at 12:58:17AM -0700, Paul E. McKenney wrote:
> > >
> > > > To the best of my knowledge at this point in time, agreed. Who knows
> > > > what someone will come up with next week? But for people running certain
> > > > types of real-time and HPC workloads, context tracking really does handle
> > > > both idle and userspace transitions.
> > >
> > > Sure, but idle != nohz. Nohz is where we disable the tick, and currently
> > > RCU can inhibit this -- rcu_needs_cpu().
> >
> > Exactly. For non-nohz userspace execution, the tick is still running
> > anyway, so RCU of course won't be inhibiting its disabling. And in that
> > case, RCU's hook is the tick interrupt itself. RCU's hook is passed a
> > flag saying whether the interrupt came from userspace or from kernel.
>
> I'm not sure how we ended up here; this is completely irrelevant and I'm
> not disagreeing with it.
>
> > > AFAICT there really isn't an RCU hook for this, not through context
> > > tracking not through anything else.
> >
> > There is a directly invoked RCU hook for any transition that enables or
> > disables the tick, namely the ct_*_enter() and ct_*_exit() functions,
> > that is, those functions formerly known as rcu_*_enter() and rcu_*_exit().
>
> Context tracking doesn't know about NOHZ, therefore RCU can't either.
> Context tracking knows about IDLE, but not all IDLE is NOHZ-IDLE.
>
> Specifically we have:
>
> ct_{idle,irq,nmi,user,kernel}_enter()
>
> And none of them are related to NOHZ in the slightest. So no, RCU does
> not have a NOHZ callback.
>
> I'm still thikning you're conflating NOHZ_FULL (stopping the tick when
> in userspace) and regular NOHZ (stopping the tick when idle).
>
> > And this of course means that any additional schemes to reduce RCU's
> > power consumption must be compared (with real measurements on real
> > hardware!) to Joel et al.'s work, whether in combination or as an
> > alternative. And either way, the power savings must of course justify
> > the added code and complexity.
>
> Well, Joel's lazy scheme has the difficulty that you can wreck things by
> improperly marking the callback as lazy when there's an explicit
> dependency on it. The talk even called that out.
>
> I was hoping to construct a scheme that doesn't need the whole lazy
> approach.
>
>
> To recap; we want the CPU to go into deeper idle states, no?
>
> RCU can currently inhibit this by having callbacks pending for this CPU
> -- in this case RCU inhibits NOHZ-IDLE and deep power states are not
> selected or less effective.
>
> Now, deep idle states actually purge the caches, so cache locality
> cannot be an argument to keep the callbacks local.
>
> We know when we're doing deep idle we stop the tick.
>
> So why not, when stopping the tick, move the RCU pending crud elsewhere
> and let the CPU get on with going idle instead of inhibiting the
> stopping of the tick and wrecking deep idle?

Because doing so in the past has cost more energy than is saved.

Thanx, Paul

2022-09-29 16:08:58

by Paul E. McKenney

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On Thu, Sep 29, 2022 at 08:20:44AM -0700, Paul E. McKenney wrote:
> On Thu, Sep 29, 2022 at 12:55:58PM +0200, Peter Zijlstra wrote:
> > On Sat, Sep 17, 2022 at 07:25:08AM -0700, Paul E. McKenney wrote:
> > > On Fri, Sep 16, 2022 at 11:20:14AM +0200, Peter Zijlstra wrote:
> > > > On Fri, Sep 16, 2022 at 12:58:17AM -0700, Paul E. McKenney wrote:
> > > >
> > > > > To the best of my knowledge at this point in time, agreed. Who knows
> > > > > what someone will come up with next week? But for people running certain
> > > > > types of real-time and HPC workloads, context tracking really does handle
> > > > > both idle and userspace transitions.
> > > >
> > > > Sure, but idle != nohz. Nohz is where we disable the tick, and currently
> > > > RCU can inhibit this -- rcu_needs_cpu().
> > >
> > > Exactly. For non-nohz userspace execution, the tick is still running
> > > anyway, so RCU of course won't be inhibiting its disabling. And in that
> > > case, RCU's hook is the tick interrupt itself. RCU's hook is passed a
> > > flag saying whether the interrupt came from userspace or from kernel.
> >
> > I'm not sure how we ended up here; this is completely irrelevant and I'm
> > not disagreeing with it.
> >
> > > > AFAICT there really isn't an RCU hook for this, not through context
> > > > tracking not through anything else.
> > >
> > > There is a directly invoked RCU hook for any transition that enables or
> > > disables the tick, namely the ct_*_enter() and ct_*_exit() functions,
> > > that is, those functions formerly known as rcu_*_enter() and rcu_*_exit().
> >
> > Context tracking doesn't know about NOHZ, therefore RCU can't either.
> > Context tracking knows about IDLE, but not all IDLE is NOHZ-IDLE.
> >
> > Specifically we have:
> >
> > ct_{idle,irq,nmi,user,kernel}_enter()
> >
> > And none of them are related to NOHZ in the slightest. So no, RCU does
> > not have a NOHZ callback.
> >
> > I'm still thikning you're conflating NOHZ_FULL (stopping the tick when
> > in userspace) and regular NOHZ (stopping the tick when idle).

Exactly how are ct_user_enter() and ct_user_exit() completely unrelated
to nohz_full CPUs?

> > > And this of course means that any additional schemes to reduce RCU's
> > > power consumption must be compared (with real measurements on real
> > > hardware!) to Joel et al.'s work, whether in combination or as an
> > > alternative. And either way, the power savings must of course justify
> > > the added code and complexity.
> >
> > Well, Joel's lazy scheme has the difficulty that you can wreck things by
> > improperly marking the callback as lazy when there's an explicit
> > dependency on it. The talk even called that out.
> >
> > I was hoping to construct a scheme that doesn't need the whole lazy
> > approach.

I agree that this is a risk that must be addressed.

> > To recap; we want the CPU to go into deeper idle states, no?
> >
> > RCU can currently inhibit this by having callbacks pending for this CPU
> > -- in this case RCU inhibits NOHZ-IDLE and deep power states are not
> > selected or less effective.
> >
> > Now, deep idle states actually purge the caches, so cache locality
> > cannot be an argument to keep the callbacks local.
> >
> > We know when we're doing deep idle we stop the tick.
> >
> > So why not, when stopping the tick, move the RCU pending crud elsewhere
> > and let the CPU get on with going idle instead of inhibiting the
> > stopping of the tick and wrecking deep idle?
>
> Because doing so in the past has cost more energy than is saved.

And I should hasten to add that I have no intention of sending this
commit upstream unless/until it is demonstrated to save real energy on
real hardware. In the meantime, please see below for an updated version
that avoids indefinitely postponing the tick on systems having CPUs that
enter and exit idle frequently.

Thanx, Paul

------------------------------------------------------------------------

commit e30960e87d58db50bbe4fd09a2ff1e5eeeaad754
Author: Paul E. McKenney <[email protected]>
Date: Wed Sep 21 13:30:24 2022 -0700

rcu: Let non-offloaded idle CPUs with callbacks defer tick

When a CPU goes idle, rcu_needs_cpu() is invoked to determine whether or
not RCU needs the scheduler-clock tick to keep interrupting. Right now,
RCU keeps the tick on for a given idle CPU if there are any non-offloaded
callbacks queued on that CPU.

But if all of these callbacks are waiting for a grace period to finish,
there is no point in scheduling a tick before that grace period has any
reasonable chance of completing. This commit therefore delays the tick
in the case where all the callbacks are waiting for a specific grace
period to elapse. In theory, this should result in a 50-70% reduction in
RCU-induced scheduling-clock ticks on mostly-idle CPUs. In practice, TBD.
/bin/bash: fm: command not found

Signed-off-by: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>

diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 9bc025aa79a3..84e930c11065 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -133,7 +133,7 @@ static inline void rcu_softirq_qs(void)
rcu_tasks_qs(current, (preempt)); \
} while (0)

-static inline int rcu_needs_cpu(void)
+static inline int rcu_needs_cpu(u64 basemono, u64 *nextevt)
{
return 0;
}
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index 70795386b9ff..3066e0975022 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -19,7 +19,7 @@

void rcu_softirq_qs(void);
void rcu_note_context_switch(bool preempt);
-int rcu_needs_cpu(void);
+int rcu_needs_cpu(u64 basemono, u64 *nextevt);
void rcu_cpu_stall_reset(void);

/*
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 5ec97e3f7468..1930cee1ccdb 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -676,12 +676,40 @@ void __rcu_irq_enter_check_tick(void)
* scheduler-clock interrupt.
*
* Just check whether or not this CPU has non-offloaded RCU callbacks
- * queued.
+ * queued that need immediate attention.
*/
-int rcu_needs_cpu(void)
+int rcu_needs_cpu(u64 basemono, u64 *nextevt)
{
- return !rcu_segcblist_empty(&this_cpu_ptr(&rcu_data)->cblist) &&
- !rcu_rdp_is_offloaded(this_cpu_ptr(&rcu_data));
+ unsigned long j;
+ unsigned long jlast;
+ unsigned long jwait;
+ struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
+ struct rcu_segcblist *rsclp = &rdp->cblist;
+
+ // Disabled, empty, or offloaded means nothing to do.
+ if (!rcu_segcblist_is_enabled(rsclp) ||
+ rcu_segcblist_empty(rsclp) || rcu_rdp_is_offloaded(rdp)) {
+ *nextevt = KTIME_MAX;
+ return 0;
+ }
+
+ // Callbacks ready to invoke or that have not already been
+ // assigned a grace period need immediate attention.
+ if (!rcu_segcblist_segempty(rsclp, RCU_DONE_TAIL) ||
+ !rcu_segcblist_segempty(rsclp, RCU_NEXT_TAIL))
+ return 1;
+
+ // There are callbacks waiting for some later grace period.
+ // Wait for about a grace period or two since the last tick, at which
+ // point there is high probability that this CPU will need to do some
+ // work for RCU.
+ j = jiffies;
+ jlast = __this_cpu_read(rcu_data.last_sched_clock);
+ jwait = READ_ONCE(jiffies_till_first_fqs) + READ_ONCE(jiffies_till_next_fqs) + 1;
+ if (time_after(j, jlast + jwait))
+ return 1;
+ *nextevt = basemono + TICK_NSEC * (jlast + jwait - j);
+ return 0;
}

/*
@@ -2324,11 +2352,9 @@ void rcu_sched_clock_irq(int user)
{
unsigned long j;

- if (IS_ENABLED(CONFIG_PROVE_RCU)) {
- j = jiffies;
- WARN_ON_ONCE(time_before(j, __this_cpu_read(rcu_data.last_sched_clock)));
- __this_cpu_write(rcu_data.last_sched_clock, j);
- }
+ j = jiffies;
+ WARN_ON_ONCE(time_before(j, __this_cpu_read(rcu_data.last_sched_clock)));
+ __this_cpu_write(rcu_data.last_sched_clock, j);
trace_rcu_utilization(TPS("Start scheduler-tick"));
lockdep_assert_irqs_disabled();
raw_cpu_inc(rcu_data.ticks_this_gp);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index b0e3c9205946..303ea15cdb96 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -784,7 +784,7 @@ static inline bool local_timer_softirq_pending(void)

static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
{
- u64 basemono, next_tick, delta, expires;
+ u64 basemono, next_tick, next_tmr, next_rcu, delta, expires;
unsigned long basejiff;
unsigned int seq;

@@ -807,7 +807,7 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
* minimal delta which brings us back to this place
* immediately. Lather, rinse and repeat...
*/
- if (rcu_needs_cpu() || arch_needs_cpu() ||
+ if (rcu_needs_cpu(basemono, &next_rcu) || arch_needs_cpu() ||
irq_work_needs_cpu() || local_timer_softirq_pending()) {
next_tick = basemono + TICK_NSEC;
} else {
@@ -818,8 +818,10 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
* disabled this also looks at the next expiring
* hrtimer.
*/
- next_tick = get_next_timer_interrupt(basejiff, basemono);
- ts->next_timer = next_tick;
+ next_tmr = get_next_timer_interrupt(basejiff, basemono);
+ ts->next_timer = next_tmr;
+ /* Take the next rcu event into account */
+ next_tick = next_rcu < next_tmr ? next_rcu : next_tmr;
}

/*

2022-09-29 16:32:50

by Peter Zijlstra

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On Thu, Sep 29, 2022 at 08:20:44AM -0700, Paul E. McKenney wrote:

> > To recap; we want the CPU to go into deeper idle states, no?
> >
> > RCU can currently inhibit this by having callbacks pending for this CPU
> > -- in this case RCU inhibits NOHZ-IDLE and deep power states are not
> > selected or less effective.
> >
> > Now, deep idle states actually purge the caches, so cache locality
> > cannot be an argument to keep the callbacks local.
> >
> > We know when we're doing deep idle we stop the tick.
> >
> > So why not, when stopping the tick, move the RCU pending crud elsewhere
> > and let the CPU get on with going idle instead of inhibiting the
> > stopping of the tick and wrecking deep idle?
>
> Because doing so in the past has cost more energy than is saved.

How has this been tried; and why did the energy cost go up? Is this
because the offload thread ends up waking up the CPU we just put to
sleep?

By default I think the offload stuff just doesn't work well for
!NOHZ_FULL situations; that is, NOHZ_FULL is the only case where there
are housekeeper CPUs that take care of the offload threads.

Hence my initial suggestion to force the pending work into the jiffy
owner CPU.

2022-09-29 16:33:33

by Joel Fernandes

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On 9/29/2022 11:46 AM, Paul E. McKenney wrote:
> On Thu, Sep 29, 2022 at 08:20:44AM -0700, Paul E. McKenney wrote:
>> On Thu, Sep 29, 2022 at 12:55:58PM +0200, Peter Zijlstra wrote:
>>> On Sat, Sep 17, 2022 at 07:25:08AM -0700, Paul E. McKenney wro
[..]
>>>> And this of course means that any additional schemes to reduce RCU's
>>>> power consumption must be compared (with real measurements on real
>>>> hardware!) to Joel et al.'s work, whether in combination or as an
>>>> alternative. And either way, the power savings must of course justify
>>>> the added code and complexity.
>>>
>>> Well, Joel's lazy scheme has the difficulty that you can wreck things by
>>> improperly marking the callback as lazy when there's an explicit
>>> dependency on it. The talk even called that out.
>>>
>>> I was hoping to construct a scheme that doesn't need the whole lazy
>>> approach.

Peter, when constructing such scheme, please do consider that the power savings
needs to be comparable to power testing done with large jiffies_till_first_fqs
values. Otherwise, such solution is 'not that good' (IMO). In other words, the
ideal savings is one you get when not having to ask for RCU's services too soon
(rather than optimizing RCU itself). Of course, the tick being turned off also
could/should be optimized for when you do need RCU's services.

> I agree that this is a risk that must be addressed.

Right, it is encouraging to see that we're making good progress on this. And
also Thomas mentioned in LPC that if call_rcu() users are expecting time-bounded
callback invocation, then _that_ needs to be fixed.

thanks,

- Joel

>
>>> To recap; we want the CPU to go into deeper idle states, no?
>>>
>>> RCU can currently inhibit this by having callbacks pending for this CPU
>>> -- in this case RCU inhibits NOHZ-IDLE and deep power states are not
>>> selected or less effective.
>>>
>>> Now, deep idle states actually purge the caches, so cache locality
>>> cannot be an argument to keep the callbacks local.
>>>
>>> We know when we're doing deep idle we stop the tick.
>>>
>>> So why not, when stopping the tick, move the RCU pending crud elsewhere
>>> and let the CPU get on with going idle instead of inhibiting the
>>> stopping of the tick and wrecking deep idle?
>>
>> Because doing so in the past has cost more energy than is saved.
>
> And I should hasten to add that I have no intention of sending this
> commit upstream unless/until it is demonstrated to save real energy on
> real hardware. In the meantime, please see below for an updated version
> that avoids indefinitely postponing the tick on systems having CPUs that
> enter and exit idle frequently.
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit e30960e87d58db50bbe4fd09a2ff1e5eeeaad754
> Author: Paul E. McKenney <[email protected]>
> Date: Wed Sep 21 13:30:24 2022 -0700
>
> rcu: Let non-offloaded idle CPUs with callbacks defer tick
>
> When a CPU goes idle, rcu_needs_cpu() is invoked to determine whether or
> not RCU needs the scheduler-clock tick to keep interrupting. Right now,
> RCU keeps the tick on for a given idle CPU if there are any non-offloaded
> callbacks queued on that CPU.
>
> But if all of these callbacks are waiting for a grace period to finish,
> there is no point in scheduling a tick before that grace period has any
> reasonable chance of completing. This commit therefore delays the tick
> in the case where all the callbacks are waiting for a specific grace
> period to elapse. In theory, this should result in a 50-70% reduction in
> RCU-induced scheduling-clock ticks on mostly-idle CPUs. In practice, TBD.
> /bin/bash: fm: command not found
>
> Signed-off-by: Paul E. McKenney <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
>
> diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
> index 9bc025aa79a3..84e930c11065 100644
> --- a/include/linux/rcutiny.h
> +++ b/include/linux/rcutiny.h
> @@ -133,7 +133,7 @@ static inline void rcu_softirq_qs(void)
> rcu_tasks_qs(current, (preempt)); \
> } while (0)
>
> -static inline int rcu_needs_cpu(void)
> +static inline int rcu_needs_cpu(u64 basemono, u64 *nextevt)
> {
> return 0;
> }
> diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
> index 70795386b9ff..3066e0975022 100644
> --- a/include/linux/rcutree.h
> +++ b/include/linux/rcutree.h
> @@ -19,7 +19,7 @@
>
> void rcu_softirq_qs(void);
> void rcu_note_context_switch(bool preempt);
> -int rcu_needs_cpu(void);
> +int rcu_needs_cpu(u64 basemono, u64 *nextevt);
> void rcu_cpu_stall_reset(void);
>
> /*
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 5ec97e3f7468..1930cee1ccdb 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -676,12 +676,40 @@ void __rcu_irq_enter_check_tick(void)
> * scheduler-clock interrupt.
> *
> * Just check whether or not this CPU has non-offloaded RCU callbacks
> - * queued.
> + * queued that need immediate attention.
> */
> -int rcu_needs_cpu(void)
> +int rcu_needs_cpu(u64 basemono, u64 *nextevt)
> {
> - return !rcu_segcblist_empty(&this_cpu_ptr(&rcu_data)->cblist) &&
> - !rcu_rdp_is_offloaded(this_cpu_ptr(&rcu_data));
> + unsigned long j;
> + unsigned long jlast;
> + unsigned long jwait;
> + struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
> + struct rcu_segcblist *rsclp = &rdp->cblist;
> +
> + // Disabled, empty, or offloaded means nothing to do.
> + if (!rcu_segcblist_is_enabled(rsclp) ||
> + rcu_segcblist_empty(rsclp) || rcu_rdp_is_offloaded(rdp)) {
> + *nextevt = KTIME_MAX;
> + return 0;
> + }
> +
> + // Callbacks ready to invoke or that have not already been
> + // assigned a grace period need immediate attention.
> + if (!rcu_segcblist_segempty(rsclp, RCU_DONE_TAIL) ||
> + !rcu_segcblist_segempty(rsclp, RCU_NEXT_TAIL))
> + return 1;
> +
> + // There are callbacks waiting for some later grace period.
> + // Wait for about a grace period or two since the last tick, at which
> + // point there is high probability that this CPU will need to do some
> + // work for RCU.
> + j = jiffies;
> + jlast = __this_cpu_read(rcu_data.last_sched_clock);
> + jwait = READ_ONCE(jiffies_till_first_fqs) + READ_ONCE(jiffies_till_next_fqs) + 1;
> + if (time_after(j, jlast + jwait))
> + return 1;
> + *nextevt = basemono + TICK_NSEC * (jlast + jwait - j);
> + return 0;
> }
>
> /*
> @@ -2324,11 +2352,9 @@ void rcu_sched_clock_irq(int user)
> {
> unsigned long j;
>
> - if (IS_ENABLED(CONFIG_PROVE_RCU)) {
> - j = jiffies;
> - WARN_ON_ONCE(time_before(j, __this_cpu_read(rcu_data.last_sched_clock)));
> - __this_cpu_write(rcu_data.last_sched_clock, j);
> - }
> + j = jiffies;
> + WARN_ON_ONCE(time_before(j, __this_cpu_read(rcu_data.last_sched_clock)));
> + __this_cpu_write(rcu_data.last_sched_clock, j);
> trace_rcu_utilization(TPS("Start scheduler-tick"));
> lockdep_assert_irqs_disabled();
> raw_cpu_inc(rcu_data.ticks_this_gp);
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index b0e3c9205946..303ea15cdb96 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -784,7 +784,7 @@ static inline bool local_timer_softirq_pending(void)
>
> static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
> {
> - u64 basemono, next_tick, delta, expires;
> + u64 basemono, next_tick, next_tmr, next_rcu, delta, expires;
> unsigned long basejiff;
> unsigned int seq;
>
> @@ -807,7 +807,7 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
> * minimal delta which brings us back to this place
> * immediately. Lather, rinse and repeat...
> */
> - if (rcu_needs_cpu() || arch_needs_cpu() ||
> + if (rcu_needs_cpu(basemono, &next_rcu) || arch_needs_cpu() ||
> irq_work_needs_cpu() || local_timer_softirq_pending()) {
> next_tick = basemono + TICK_NSEC;
> } else {
> @@ -818,8 +818,10 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
> * disabled this also looks at the next expiring
> * hrtimer.
> */
> - next_tick = get_next_timer_interrupt(basejiff, basemono);
> - ts->next_timer = next_tick;
> + next_tmr = get_next_timer_interrupt(basejiff, basemono);
> + ts->next_timer = next_tmr;
> + /* Take the next rcu event into account */
> + next_tick = next_rcu < next_tmr ? next_rcu : next_tmr;
> }
>
> /*

2022-09-29 16:37:31

by Peter Zijlstra

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On Thu, Sep 29, 2022 at 08:46:18AM -0700, Paul E. McKenney wrote:
> On Thu, Sep 29, 2022 at 08:20:44AM -0700, Paul E. McKenney wrote:

> > > > There is a directly invoked RCU hook for any transition that enables or
> > > > disables the tick, namely the ct_*_enter() and ct_*_exit() functions,
> > > > that is, those functions formerly known as rcu_*_enter() and rcu_*_exit().
> > >
> > > Context tracking doesn't know about NOHZ, therefore RCU can't either.
> > > Context tracking knows about IDLE, but not all IDLE is NOHZ-IDLE.
> > >
> > > Specifically we have:
> > >
> > > ct_{idle,irq,nmi,user,kernel}_enter()
> > >
> > > And none of them are related to NOHZ in the slightest. So no, RCU does
> > > not have a NOHZ callback.
> > >
> > > I'm still thikning you're conflating NOHZ_FULL (stopping the tick when
> > > in userspace) and regular NOHZ (stopping the tick when idle).
>
> Exactly how are ct_user_enter() and ct_user_exit() completely unrelated
> to nohz_full CPUs?

That's the thing; I'm not talking about nohz_full. I'm talking about
regular nohz. World of difference there.

nohz_full is a gimmick that shouldn't be used outside of very specific
cases. Regular nohz otoh is used by everybody always.

2022-09-29 16:47:09

by Paul E. McKenney

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On Thu, Sep 29, 2022 at 06:23:16PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 29, 2022 at 08:46:18AM -0700, Paul E. McKenney wrote:
> > On Thu, Sep 29, 2022 at 08:20:44AM -0700, Paul E. McKenney wrote:
>
> > > > > There is a directly invoked RCU hook for any transition that enables or
> > > > > disables the tick, namely the ct_*_enter() and ct_*_exit() functions,
> > > > > that is, those functions formerly known as rcu_*_enter() and rcu_*_exit().
> > > >
> > > > Context tracking doesn't know about NOHZ, therefore RCU can't either.
> > > > Context tracking knows about IDLE, but not all IDLE is NOHZ-IDLE.
> > > >
> > > > Specifically we have:
> > > >
> > > > ct_{idle,irq,nmi,user,kernel}_enter()
> > > >
> > > > And none of them are related to NOHZ in the slightest. So no, RCU does
> > > > not have a NOHZ callback.
> > > >
> > > > I'm still thikning you're conflating NOHZ_FULL (stopping the tick when
> > > > in userspace) and regular NOHZ (stopping the tick when idle).
> >
> > Exactly how are ct_user_enter() and ct_user_exit() completely unrelated
> > to nohz_full CPUs?
>
> That's the thing; I'm not talking about nohz_full. I'm talking about
> regular nohz. World of difference there.

And indeed, for !nohz_full CPUs, the tick continues throughout userspace
execution. But you really did have ct_user_enter() and ct_user_exit()
on your list.

And for idle (as opposed to nohz_full userspace execution), there is still
ct_{idle,irq,nmi}_enter(). And RCU does pay attention to these.

So exactly what are you trying to tell me here? ;-)

> nohz_full is a gimmick that shouldn't be used outside of very specific
> cases. Regular nohz otoh is used by everybody always.

I will let you take that up with the people using it.

Thanx, Paul

2022-09-29 17:35:57

by Paul E. McKenney

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On Thu, Sep 29, 2022 at 06:18:32PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 29, 2022 at 08:20:44AM -0700, Paul E. McKenney wrote:
>
> > > To recap; we want the CPU to go into deeper idle states, no?
> > >
> > > RCU can currently inhibit this by having callbacks pending for this CPU
> > > -- in this case RCU inhibits NOHZ-IDLE and deep power states are not
> > > selected or less effective.
> > >
> > > Now, deep idle states actually purge the caches, so cache locality
> > > cannot be an argument to keep the callbacks local.
> > >
> > > We know when we're doing deep idle we stop the tick.
> > >
> > > So why not, when stopping the tick, move the RCU pending crud elsewhere
> > > and let the CPU get on with going idle instead of inhibiting the
> > > stopping of the tick and wrecking deep idle?
> >
> > Because doing so in the past has cost more energy than is saved.
>
> How has this been tried; and why did the energy cost go up? Is this
> because the offload thread ends up waking up the CPU we just put to
> sleep?

Because doing the additional work consumes energy. I am not clear on
exactly what you are asking for here, given the limitations of the tools
that measure energy consumption.

> By default I think the offload stuff just doesn't work well for
> !NOHZ_FULL situations; that is, NOHZ_FULL is the only case where there
> are housekeeper CPUs that take care of the offload threads.
>
> Hence my initial suggestion to force the pending work into the jiffy
> owner CPU.

By all means, please feel free to prove me wrong. But doing so requires
real code and real testing of real energy consumption by real hardware
running real workloads.

It also requires correctly handling races with all and sundry.

Thanx, Paul

2022-09-29 19:54:53

by Peter Zijlstra

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On Thu, Sep 29, 2022 at 09:42:04AM -0700, Paul E. McKenney wrote:
> On Thu, Sep 29, 2022 at 06:23:16PM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 29, 2022 at 08:46:18AM -0700, Paul E. McKenney wrote:
> > > On Thu, Sep 29, 2022 at 08:20:44AM -0700, Paul E. McKenney wrote:
> >
> > > > > > There is a directly invoked RCU hook for any transition that enables or
> > > > > > disables the tick, namely the ct_*_enter() and ct_*_exit() functions,
> > > > > > that is, those functions formerly known as rcu_*_enter() and rcu_*_exit().
> > > > >
> > > > > Context tracking doesn't know about NOHZ, therefore RCU can't either.
> > > > > Context tracking knows about IDLE, but not all IDLE is NOHZ-IDLE.
> > > > >
> > > > > Specifically we have:
> > > > >
> > > > > ct_{idle,irq,nmi,user,kernel}_enter()
> > > > >
> > > > > And none of them are related to NOHZ in the slightest. So no, RCU does
> > > > > not have a NOHZ callback.
> > > > >
> > > > > I'm still thikning you're conflating NOHZ_FULL (stopping the tick when
> > > > > in userspace) and regular NOHZ (stopping the tick when idle).
> > >
> > > Exactly how are ct_user_enter() and ct_user_exit() completely unrelated
> > > to nohz_full CPUs?
> >
> > That's the thing; I'm not talking about nohz_full. I'm talking about
> > regular nohz. World of difference there.
>
> And indeed, for !nohz_full CPUs, the tick continues throughout userspace
> execution. But you really did have ct_user_enter() and ct_user_exit()
> on your list.
>
> And for idle (as opposed to nohz_full userspace execution), there is still
> ct_{idle,irq,nmi}_enter(). And RCU does pay attention to these.
>
> So exactly what are you trying to tell me here? ;-)

That RCU doens't have a nohz callback -- you were arguing it does
through the ct_*_enter() things, I said none of them are related to
nohz.

2022-09-29 20:01:00

by Paul E. McKenney

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On Thu, Sep 29, 2022 at 09:01:46PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 29, 2022 at 09:42:04AM -0700, Paul E. McKenney wrote:
> > On Thu, Sep 29, 2022 at 06:23:16PM +0200, Peter Zijlstra wrote:
> > > On Thu, Sep 29, 2022 at 08:46:18AM -0700, Paul E. McKenney wrote:
> > > > On Thu, Sep 29, 2022 at 08:20:44AM -0700, Paul E. McKenney wrote:
> > >
> > > > > > > There is a directly invoked RCU hook for any transition that enables or
> > > > > > > disables the tick, namely the ct_*_enter() and ct_*_exit() functions,
> > > > > > > that is, those functions formerly known as rcu_*_enter() and rcu_*_exit().
> > > > > >
> > > > > > Context tracking doesn't know about NOHZ, therefore RCU can't either.
> > > > > > Context tracking knows about IDLE, but not all IDLE is NOHZ-IDLE.
> > > > > >
> > > > > > Specifically we have:
> > > > > >
> > > > > > ct_{idle,irq,nmi,user,kernel}_enter()
> > > > > >
> > > > > > And none of them are related to NOHZ in the slightest. So no, RCU does
> > > > > > not have a NOHZ callback.
> > > > > >
> > > > > > I'm still thikning you're conflating NOHZ_FULL (stopping the tick when
> > > > > > in userspace) and regular NOHZ (stopping the tick when idle).
> > > >
> > > > Exactly how are ct_user_enter() and ct_user_exit() completely unrelated
> > > > to nohz_full CPUs?
> > >
> > > That's the thing; I'm not talking about nohz_full. I'm talking about
> > > regular nohz. World of difference there.
> >
> > And indeed, for !nohz_full CPUs, the tick continues throughout userspace
> > execution. But you really did have ct_user_enter() and ct_user_exit()
> > on your list.
> >
> > And for idle (as opposed to nohz_full userspace execution), there is still
> > ct_{idle,irq,nmi}_enter(). And RCU does pay attention to these.
> >
> > So exactly what are you trying to tell me here? ;-)
>
> That RCU doens't have a nohz callback -- you were arguing it does
> through the ct_*_enter() things, I said none of them are related to
> nohz.

OK, once again, I will bite...

How are ct_idle_enter(), ct_irq_enter() from idle, and ct_nmi_enter()
again from idle unrelated to nohz?

Or, for that matter, rcu_needs_cpu()?

Thanx, Paul

2022-09-29 20:01:40

by Paul E. McKenney

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On Thu, Sep 29, 2022 at 09:08:42PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 29, 2022 at 09:36:24AM -0700, Paul E. McKenney wrote:
>
> > > How has this been tried; and why did the energy cost go up? Is this
> > > because the offload thread ends up waking up the CPU we just put to
> > > sleep?
> >
> > Because doing the additional work consumes energy. I am not clear on
> > exactly what you are asking for here, given the limitations of the tools
> > that measure energy consumption.
>
> What additional work? Splicing the cpu pending list onto another list
> with or without atomic op barely qualifies for work. The main point is
> making sure the pending list isn't in the way of going (deep) idle.

Very good. Send a patch.

After some time, its successor might correctly handle lock/memory
contention, CPU hotplug, presumed upcoming runtime changes in CPUs'
housekeeping status, frequent idle entry/exit, grace period begin/end,
quiet embedded systems, and so on.

Then we can see if it actually reduces power consumption.

Thanx, Paul

2022-09-29 20:20:22

by Peter Zijlstra

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On Thu, Sep 29, 2022 at 09:36:24AM -0700, Paul E. McKenney wrote:

> > How has this been tried; and why did the energy cost go up? Is this
> > because the offload thread ends up waking up the CPU we just put to
> > sleep?
>
> Because doing the additional work consumes energy. I am not clear on
> exactly what you are asking for here, given the limitations of the tools
> that measure energy consumption.

What additional work? Splicing the cpu pending list onto another list
with or without atomic op barely qualifies for work. The main point is
making sure the pending list isn't in the way of going (deep) idle.

2022-09-30 15:17:54

by Paul E. McKenney

[permalink] [raw]

Subject: Re: RCU vs NOHZ

On Thu, Sep 29, 2022 at 12:56:41PM -0700, Paul E. McKenney wrote:
> On Thu, Sep 29, 2022 at 09:08:42PM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 29, 2022 at 09:36:24AM -0700, Paul E. McKenney wrote:
> >
> > > > How has this been tried; and why did the energy cost go up? Is this
> > > > because the offload thread ends up waking up the CPU we just put to
> > > > sleep?
> > >
> > > Because doing the additional work consumes energy. I am not clear on
> > > exactly what you are asking for here, given the limitations of the tools
> > > that measure energy consumption.
> >
> > What additional work? Splicing the cpu pending list onto another list
> > with or without atomic op barely qualifies for work. The main point is
> > making sure the pending list isn't in the way of going (deep) idle.
>
> Very good. Send a patch.
>
> After some time, its successor might correctly handle lock/memory
> contention, CPU hotplug, presumed upcoming runtime changes in CPUs'
> housekeeping status, frequent idle entry/exit, grace period begin/end,
> quiet embedded systems, and so on.
>
> Then we can see if it actually reduces power consumption.

Another approach is to runtime-offload CPUs that have been mostly idle,
and switch back to deoffloaded during busy periods.

Thanx, Paul