LinuxLists.cc - Reconciling rcu_irq_enter()/rcu_nmi

2015-07-17 01:53:38

Subject: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking

For reasons that mystify me a bit, we currently track context tracking
state separately from rcu's watching state. This results in strange
artifacts: nothing generic cause IRQs to enter CONTEXT_KERNEL, and we
can nest exceptions inside the IRQ handler (an example would be
wrmsr_safe failing), and, in -next, we splat a warning:

https://gist.github.com/sashalevin/a006a44989312f6835e7

I'm trying to make context tracking more exact, which will fix this
issue (the particular splat that Sasha hit shouldn't be possible when
I'm done), but I think it would be nice to unify all of this stuff.
Would it be plausible for us to guarantee that RCU state is always in
sync with context tracking state? If so, we could maybe simplify
things and have fewer state variables.

Doing this for NMIs might be weird. Would it make sense to have a
CONTEXT_NMI that's somehow valid even if the NMI happened while
changing context tracking state.

Thoughts? As it stands, I think we might already be broken for real:

Syscall -> user_exit. Perf NMI hits *during* user_exit. Perf does
copy_from_user_nmi, which can fault, causing do_page_fault to get
called, which calls exception_enter(), which can't be a good thing.

RCU is okay (sort of) because of rcu_nmi_enter, but this seems very fragile.

Thoughts? As it stands, I need to do something because -tip and thus
-next spews occasional warnings.

--Andy

--
Andy Lutomirski
AMA Capital Management, LLC

2015-07-17 04:29:20

by Paul E. McKenney

[permalink] [raw]

Subject: Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking

On Thu, Jul 16, 2015 at 06:53:15PM -0700, Andy Lutomirski wrote:
> For reasons that mystify me a bit, we currently track context tracking
> state separately from rcu's watching state. This results in strange
> artifacts: nothing generic cause IRQs to enter CONTEXT_KERNEL, and we
> can nest exceptions inside the IRQ handler (an example would be
> wrmsr_safe failing), and, in -next, we splat a warning:
>
> https://gist.github.com/sashalevin/a006a44989312f6835e7
>
> I'm trying to make context tracking more exact, which will fix this
> issue (the particular splat that Sasha hit shouldn't be possible when
> I'm done), but I think it would be nice to unify all of this stuff.
> Would it be plausible for us to guarantee that RCU state is always in
> sync with context tracking state? If so, we could maybe simplify
> things and have fewer state variables.

A noble goal. Might even be possible, and maybe even advantageous.

But it is usually easier to say than to do. RCU really does need to make
some adjustments when the state changes, as do the other subsystems.
It might or might not be possible to do the transitions atomically.
And if the transitions are not atomic, there will still be weird code
paths where (say) the processor is considered non-idle, but RCU doesn't
realize it yet. Such a code path could not safely use rcu_read_lock(),
so you still need RCU to be able to scream if someone tries it.
Contrariwise, if there is a code path where the processor is considered
idle, but RCU thinks it is non-idle, that code path can stall
grace periods. (Yes, not a problem if the code path is short enough.
At least if the underlying VCPU is making progres...)

Still, I cannot prove that it is impossible, and if it is possible,
then as you say, there might well be benefits.

> Doing this for NMIs might be weird. Would it make sense to have a
> CONTEXT_NMI that's somehow valid even if the NMI happened while
> changing context tracking state.

Face it, NMIs are weird. ;-)

> Thoughts? As it stands, I think we might already be broken for real:
>
> Syscall -> user_exit. Perf NMI hits *during* user_exit. Perf does
> copy_from_user_nmi, which can fault, causing do_page_fault to get
> called, which calls exception_enter(), which can't be a good thing.
>
> RCU is okay (sort of) because of rcu_nmi_enter, but this seems very fragile.

Actually, I see more cases where people forget irq_enter() than
rcu_nmi_enter(). "We will just nip in quickly and do something without
actually letting the irq system know. Oh, and we want some event tracing
in that code path." Boom!

> Thoughts? As it stands, I need to do something because -tip and thus
> -next spews occasional warnings.

Tell me more?

Thanx, Paul

2015-07-17 04:49:35

by Paul E. McKenney

[permalink] [raw]

Subject: Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking

On Thu, Jul 16, 2015 at 09:29:07PM -0700, Paul E. McKenney wrote:
> On Thu, Jul 16, 2015 at 06:53:15PM -0700, Andy Lutomirski wrote:
> > For reasons that mystify me a bit, we currently track context tracking
> > state separately from rcu's watching state. This results in strange
> > artifacts: nothing generic cause IRQs to enter CONTEXT_KERNEL, and we
> > can nest exceptions inside the IRQ handler (an example would be
> > wrmsr_safe failing), and, in -next, we splat a warning:
> >
> > https://gist.github.com/sashalevin/a006a44989312f6835e7
> >
> > I'm trying to make context tracking more exact, which will fix this
> > issue (the particular splat that Sasha hit shouldn't be possible when
> > I'm done), but I think it would be nice to unify all of this stuff.
> > Would it be plausible for us to guarantee that RCU state is always in
> > sync with context tracking state? If so, we could maybe simplify
> > things and have fewer state variables.
>
> A noble goal. Might even be possible, and maybe even advantageous.
>
> But it is usually easier to say than to do. RCU really does need to make
> some adjustments when the state changes, as do the other subsystems.
> It might or might not be possible to do the transitions atomically.
> And if the transitions are not atomic, there will still be weird code
> paths where (say) the processor is considered non-idle, but RCU doesn't
> realize it yet. Such a code path could not safely use rcu_read_lock(),
> so you still need RCU to be able to scream if someone tries it.
> Contrariwise, if there is a code path where the processor is considered
> idle, but RCU thinks it is non-idle, that code path can stall
> grace periods. (Yes, not a problem if the code path is short enough.
> At least if the underlying VCPU is making progres...)
>
> Still, I cannot prove that it is impossible, and if it is possible,
> then as you say, there might well be benefits.
>
> > Doing this for NMIs might be weird. Would it make sense to have a
> > CONTEXT_NMI that's somehow valid even if the NMI happened while
> > changing context tracking state.
>
> Face it, NMIs are weird. ;-)
>
> > Thoughts? As it stands, I think we might already be broken for real:
> >
> > Syscall -> user_exit. Perf NMI hits *during* user_exit. Perf does
> > copy_from_user_nmi, which can fault, causing do_page_fault to get
> > called, which calls exception_enter(), which can't be a good thing.
> >
> > RCU is okay (sort of) because of rcu_nmi_enter, but this seems very fragile.
>
> Actually, I see more cases where people forget irq_enter() than
> rcu_nmi_enter(). "We will just nip in quickly and do something without
> actually letting the irq system know. Oh, and we want some event tracing
> in that code path." Boom!
>
> > Thoughts? As it stands, I need to do something because -tip and thus
> > -next spews occasional warnings.
>
> Tell me more?

And for completeness, RCU also has the following requirements on the
state-transition mechanism:

1. It must be possible to reliably sample some other CPU's state.
This is an energy-efficiency requirement, as RCU is not normally
permitted to wake up idle CPUs. Nor nohz CPUs, for that matter.

2. RCU must be able to track passage through idle and nohz states.
In other words, if RCU samples at t=0 and finds that the CPU
is executing (say) in kernel mode, and RCU samples again at
t=10 and again finds that the CPU is executing in kernel mode,
RCU needs to be able to determine whether or not that CPU passed
through idle or nohz betweentimes.

3. In some configurations, RCU needs to be able to block entry into
nohz state, both for idle and userspace.

Probably others as well...

Thanx, Paul

2015-07-17 18:59:41

by Andy Lutomirski

[permalink] [raw]

Subject: Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking

On Thu, Jul 16, 2015 at 9:49 PM, Paul E. McKenney
<[email protected]> wrote:
> On Thu, Jul 16, 2015 at 09:29:07PM -0700, Paul E. McKenney wrote:
>> On Thu, Jul 16, 2015 at 06:53:15PM -0700, Andy Lutomirski wrote:
>> > For reasons that mystify me a bit, we currently track context tracking
>> > state separately from rcu's watching state. This results in strange
>> > artifacts: nothing generic cause IRQs to enter CONTEXT_KERNEL, and we
>> > can nest exceptions inside the IRQ handler (an example would be
>> > wrmsr_safe failing), and, in -next, we splat a warning:
>> >
>> > https://gist.github.com/sashalevin/a006a44989312f6835e7
>> >
>> > I'm trying to make context tracking more exact, which will fix this
>> > issue (the particular splat that Sasha hit shouldn't be possible when
>> > I'm done), but I think it would be nice to unify all of this stuff.
>> > Would it be plausible for us to guarantee that RCU state is always in
>> > sync with context tracking state? If so, we could maybe simplify
>> > things and have fewer state variables.
>>
>> A noble goal. Might even be possible, and maybe even advantageous.
>>
>> But it is usually easier to say than to do. RCU really does need to make
>> some adjustments when the state changes, as do the other subsystems.
>> It might or might not be possible to do the transitions atomically.
>> And if the transitions are not atomic, there will still be weird code
>> paths where (say) the processor is considered non-idle, but RCU doesn't
>> realize it yet. Such a code path could not safely use rcu_read_lock(),
>> so you still need RCU to be able to scream if someone tries it.
>> Contrariwise, if there is a code path where the processor is considered
>> idle, but RCU thinks it is non-idle, that code path can stall
>> grace periods. (Yes, not a problem if the code path is short enough.
>> At least if the underlying VCPU is making progres...)
>>
>> Still, I cannot prove that it is impossible, and if it is possible,
>> then as you say, there might well be benefits.
>>
>> > Doing this for NMIs might be weird. Would it make sense to have a
>> > CONTEXT_NMI that's somehow valid even if the NMI happened while
>> > changing context tracking state.
>>
>> Face it, NMIs are weird. ;-)
>>
>> > Thoughts? As it stands, I think we might already be broken for real:
>> >
>> > Syscall -> user_exit. Perf NMI hits *during* user_exit. Perf does
>> > copy_from_user_nmi, which can fault, causing do_page_fault to get
>> > called, which calls exception_enter(), which can't be a good thing.
>> >
>> > RCU is okay (sort of) because of rcu_nmi_enter, but this seems very fragile.
>>
>> Actually, I see more cases where people forget irq_enter() than
>> rcu_nmi_enter(). "We will just nip in quickly and do something without
>> actually letting the irq system know. Oh, and we want some event tracing
>> in that code path." Boom!
>>
>> > Thoughts? As it stands, I need to do something because -tip and thus
>> > -next spews occasional warnings.
>>
>> Tell me more?
>
> And for completeness, RCU also has the following requirements on the
> state-transition mechanism:
>
> 1. It must be possible to reliably sample some other CPU's state.
> This is an energy-efficiency requirement, as RCU is not normally
> permitted to wake up idle CPUs. Nor nohz CPUs, for that matter.

NOHZ needs this for vtime accounting, too. I think Rik might be
thinking about this. Maybe the underlying state could be shared?

>
> 2. RCU must be able to track passage through idle and nohz states.
> In other words, if RCU samples at t=0 and finds that the CPU
> is executing (say) in kernel mode, and RCU samples again at
> t=10 and again finds that the CPU is executing in kernel mode,
> RCU needs to be able to determine whether or not that CPU passed
> through idle or nohz betweentimes.

And RCU can do this for CONTEXT_KERNEL vs CONTEXT_USER because the
context tracking stuff notifies RCU. The think I'm less than happy
with is that we can currently be CONTEXT_USER but still rcu-awake.
This is manageable, but it seems messy.

>
> 3. In some configurations, RCU needs to be able to block entry into
> nohz state, both for idle and userspace.
>

Hmm. I suppose we could be CONTEXT_USER but still have RCU awake,
although the tick would have to stay on.

Grumble.

--Andy

2015-07-17 20:12:23

by Paul E. McKenney

[permalink] [raw]

Subject: Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking

On Fri, Jul 17, 2015 at 11:59:18AM -0700, Andy Lutomirski wrote:
> On Thu, Jul 16, 2015 at 9:49 PM, Paul E. McKenney
> <[email protected]> wrote:
> > On Thu, Jul 16, 2015 at 09:29:07PM -0700, Paul E. McKenney wrote:
> >> On Thu, Jul 16, 2015 at 06:53:15PM -0700, Andy Lutomirski wrote:
> >> > For reasons that mystify me a bit, we currently track context tracking
> >> > state separately from rcu's watching state. This results in strange
> >> > artifacts: nothing generic cause IRQs to enter CONTEXT_KERNEL, and we
> >> > can nest exceptions inside the IRQ handler (an example would be
> >> > wrmsr_safe failing), and, in -next, we splat a warning:
> >> >
> >> > https://gist.github.com/sashalevin/a006a44989312f6835e7
> >> >
> >> > I'm trying to make context tracking more exact, which will fix this
> >> > issue (the particular splat that Sasha hit shouldn't be possible when
> >> > I'm done), but I think it would be nice to unify all of this stuff.
> >> > Would it be plausible for us to guarantee that RCU state is always in
> >> > sync with context tracking state? If so, we could maybe simplify
> >> > things and have fewer state variables.
> >>
> >> A noble goal. Might even be possible, and maybe even advantageous.
> >>
> >> But it is usually easier to say than to do. RCU really does need to make
> >> some adjustments when the state changes, as do the other subsystems.
> >> It might or might not be possible to do the transitions atomically.
> >> And if the transitions are not atomic, there will still be weird code
> >> paths where (say) the processor is considered non-idle, but RCU doesn't
> >> realize it yet. Such a code path could not safely use rcu_read_lock(),
> >> so you still need RCU to be able to scream if someone tries it.
> >> Contrariwise, if there is a code path where the processor is considered
> >> idle, but RCU thinks it is non-idle, that code path can stall
> >> grace periods. (Yes, not a problem if the code path is short enough.
> >> At least if the underlying VCPU is making progres...)
> >>
> >> Still, I cannot prove that it is impossible, and if it is possible,
> >> then as you say, there might well be benefits.
> >>
> >> > Doing this for NMIs might be weird. Would it make sense to have a
> >> > CONTEXT_NMI that's somehow valid even if the NMI happened while
> >> > changing context tracking state.
> >>
> >> Face it, NMIs are weird. ;-)
> >>
> >> > Thoughts? As it stands, I think we might already be broken for real:
> >> >
> >> > Syscall -> user_exit. Perf NMI hits *during* user_exit. Perf does
> >> > copy_from_user_nmi, which can fault, causing do_page_fault to get
> >> > called, which calls exception_enter(), which can't be a good thing.
> >> >
> >> > RCU is okay (sort of) because of rcu_nmi_enter, but this seems very fragile.
> >>
> >> Actually, I see more cases where people forget irq_enter() than
> >> rcu_nmi_enter(). "We will just nip in quickly and do something without
> >> actually letting the irq system know. Oh, and we want some event tracing
> >> in that code path." Boom!
> >>
> >> > Thoughts? As it stands, I need to do something because -tip and thus
> >> > -next spews occasional warnings.
> >>
> >> Tell me more?
> >
> > And for completeness, RCU also has the following requirements on the
> > state-transition mechanism:
> >
> > 1. It must be possible to reliably sample some other CPU's state.
> > This is an energy-efficiency requirement, as RCU is not normally
> > permitted to wake up idle CPUs. Nor nohz CPUs, for that matter.
>
> NOHZ needs this for vtime accounting, too. I think Rik might be
> thinking about this. Maybe the underlying state could be shared?

>From what I understand, what Rik is looking at is accounting information,
which is a different type of state. And a type of state where some
approximation is just fine. Try that with RCU, and you will approximate
yourself into a segfault.

> > 2. RCU must be able to track passage through idle and nohz states.
> > In other words, if RCU samples at t=0 and finds that the CPU
> > is executing (say) in kernel mode, and RCU samples again at
> > t=10 and again finds that the CPU is executing in kernel mode,
> > RCU needs to be able to determine whether or not that CPU passed
> > through idle or nohz betweentimes.
>
> And RCU can do this for CONTEXT_KERNEL vs CONTEXT_USER because the
> context tracking stuff notifies RCU. The think I'm less than happy
> with is that we can currently be CONTEXT_USER but still rcu-awake.
> This is manageable, but it seems messy.

Well, if you don't have CONFIG_NO_HZ_FULL, there normally isn't context
tracking, so RCU cannot see CONTEXT_USER. Or are you thinking of making
context tracking unconditional? (The tinification guys might have some
opinions on this.)

> > 3. In some configurations, RCU needs to be able to block entry into
> > nohz state, both for idle and userspace.
>
> Hmm. I suppose we could be CONTEXT_USER but still have RCU awake,
> although the tick would have to stay on.

Right, there are situations where RCU needs a given CPU to keep the tick
going, for example, when there are RCU callbacks queued on that CPU.
Failing to keep the tick going could result in a system hang, because
that callback might never be invoked. Of course, something or another
will normally eventually disturb the CPU, but the resulting huge delay
would not be good. And on deep embedded systems, it is quite possible
that the CPU would go for a good long time without being disturbed.
(This is not just a theoretical possibility, and I have the scars to
prove it.)

And there is this one as well:

4. In CONFIG_NO_HZ_FULL_SYSIDLE=y kernels, RCU has to treat userspace
context differently than idle context, and still needs to be
able to take two samples and determine if the CPU ever went idle
(and only idle, not userspace) betweentimes.

> Grumble.

Welcome to my world! ;-)

Thanx, Paul

2015-07-17 20:32:50

by Andy Lutomirski

[permalink] [raw]

Subject: Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking

On Fri, Jul 17, 2015 at 1:12 PM, Paul E. McKenney
<[email protected]> wrote:
> On Fri, Jul 17, 2015 at 11:59:18AM -0700, Andy Lutomirski wrote:
>> On Thu, Jul 16, 2015 at 9:49 PM, Paul E. McKenney
>> <[email protected]> wrote:
>> > On Thu, Jul 16, 2015 at 09:29:07PM -0700, Paul E. McKenney wrote:
>> >> On Thu, Jul 16, 2015 at 06:53:15PM -0700, Andy Lutomirski wrote:
>> >> > For reasons that mystify me a bit, we currently track context tracking
>> >> > state separately from rcu's watching state. This results in strange
>> >> > artifacts: nothing generic cause IRQs to enter CONTEXT_KERNEL, and we
>> >> > can nest exceptions inside the IRQ handler (an example would be
>> >> > wrmsr_safe failing), and, in -next, we splat a warning:
>> >> >
>> >> > https://gist.github.com/sashalevin/a006a44989312f6835e7
>> >> >
>> >> > I'm trying to make context tracking more exact, which will fix this
>> >> > issue (the particular splat that Sasha hit shouldn't be possible when
>> >> > I'm done), but I think it would be nice to unify all of this stuff.
>> >> > Would it be plausible for us to guarantee that RCU state is always in
>> >> > sync with context tracking state? If so, we could maybe simplify
>> >> > things and have fewer state variables.
>> >>
>> >> A noble goal. Might even be possible, and maybe even advantageous.
>> >>
>> >> But it is usually easier to say than to do. RCU really does need to make
>> >> some adjustments when the state changes, as do the other subsystems.
>> >> It might or might not be possible to do the transitions atomically.
>> >> And if the transitions are not atomic, there will still be weird code
>> >> paths where (say) the processor is considered non-idle, but RCU doesn't
>> >> realize it yet. Such a code path could not safely use rcu_read_lock(),
>> >> so you still need RCU to be able to scream if someone tries it.
>> >> Contrariwise, if there is a code path where the processor is considered
>> >> idle, but RCU thinks it is non-idle, that code path can stall
>> >> grace periods. (Yes, not a problem if the code path is short enough.
>> >> At least if the underlying VCPU is making progres...)
>> >>
>> >> Still, I cannot prove that it is impossible, and if it is possible,
>> >> then as you say, there might well be benefits.
>> >>
>> >> > Doing this for NMIs might be weird. Would it make sense to have a
>> >> > CONTEXT_NMI that's somehow valid even if the NMI happened while
>> >> > changing context tracking state.
>> >>
>> >> Face it, NMIs are weird. ;-)
>> >>
>> >> > Thoughts? As it stands, I think we might already be broken for real:
>> >> >
>> >> > Syscall -> user_exit. Perf NMI hits *during* user_exit. Perf does
>> >> > copy_from_user_nmi, which can fault, causing do_page_fault to get
>> >> > called, which calls exception_enter(), which can't be a good thing.
>> >> >
>> >> > RCU is okay (sort of) because of rcu_nmi_enter, but this seems very fragile.
>> >>
>> >> Actually, I see more cases where people forget irq_enter() than
>> >> rcu_nmi_enter(). "We will just nip in quickly and do something without
>> >> actually letting the irq system know. Oh, and we want some event tracing
>> >> in that code path." Boom!
>> >>
>> >> > Thoughts? As it stands, I need to do something because -tip and thus
>> >> > -next spews occasional warnings.
>> >>
>> >> Tell me more?
>> >
>> > And for completeness, RCU also has the following requirements on the
>> > state-transition mechanism:
>> >
>> > 1. It must be possible to reliably sample some other CPU's state.
>> > This is an energy-efficiency requirement, as RCU is not normally
>> > permitted to wake up idle CPUs. Nor nohz CPUs, for that matter.
>>
>> NOHZ needs this for vtime accounting, too. I think Rik might be
>> thinking about this. Maybe the underlying state could be shared?
>
> From what I understand, what Rik is looking at is accounting information,
> which is a different type of state. And a type of state where some
> approximation is just fine. Try that with RCU, and you will approximate
> yourself into a segfault.

True. But context tracking wouldn't object to being exact. And I
think we need context tracking to treat user mode as quiescent, so
they're at least related.

>
>> > 2. RCU must be able to track passage through idle and nohz states.
>> > In other words, if RCU samples at t=0 and finds that the CPU
>> > is executing (say) in kernel mode, and RCU samples again at
>> > t=10 and again finds that the CPU is executing in kernel mode,
>> > RCU needs to be able to determine whether or not that CPU passed
>> > through idle or nohz betweentimes.
>>
>> And RCU can do this for CONTEXT_KERNEL vs CONTEXT_USER because the
>> context tracking stuff notifies RCU. The think I'm less than happy
>> with is that we can currently be CONTEXT_USER but still rcu-awake.
>> This is manageable, but it seems messy.
>
> Well, if you don't have CONFIG_NO_HZ_FULL, there normally isn't context
> tracking, so RCU cannot see CONTEXT_USER. Or are you thinking of making
> context tracking unconditional? (The tinification guys might have some
> opinions on this.)

Without context tracking, user mode is not RCU idle, right? Instead
we have timer ticks. We could get away with a very minimal context
tracking implementation that just tracked CONTEXT_IDLE and
CONTEXT_KERNEL. (Hmm, there is no CONTEXT_IDLE right now. Further
grumbling.)

>
>> > 3. In some configurations, RCU needs to be able to block entry into
>> > nohz state, both for idle and userspace.
>>
>> Hmm. I suppose we could be CONTEXT_USER but still have RCU awake,
>> although the tick would have to stay on.
>
> Right, there are situations where RCU needs a given CPU to keep the tick
> going, for example, when there are RCU callbacks queued on that CPU.
> Failing to keep the tick going could result in a system hang, because
> that callback might never be invoked.

Can't we just fire the callbacks right away? We should only go
RCU-idle from reasonable contexts. In fact, the nohz crowd likes the
idea of nohz userspace being absolutely no hz.

NMIs can't queue RCU callbacks, right? (I hope!) As of a couple
releases ago, on x86, we *always* have a clean state when
transitioning from any non-NMI kernel context to user mode on x86.

> Of course, something or another
> will normally eventually disturb the CPU, but the resulting huge delay
> would not be good. And on deep embedded systems, it is quite possible
> that the CPU would go for a good long time without being disturbed.
> (This is not just a theoretical possibility, and I have the scars to
> prove it.)
>
> And there is this one as well:
>
> 4. In CONFIG_NO_HZ_FULL_SYSIDLE=y kernels, RCU has to treat userspace
> context differently than idle context, and still needs to be
> able to take two samples and determine if the CPU ever went idle
> (and only idle, not userspace) betweentimes.

If the context tracking code, or whatever the hook is, tracked the
number of transitions out of user mode, that would do it, right?
We're talking literally a single per-cpu increment on user entry
and/or exit, I think.

--Andy

2015-07-17 21:19:50

by Paul E. McKenney

[permalink] [raw]

Subject: Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking

On Fri, Jul 17, 2015 at 01:32:27PM -0700, Andy Lutomirski wrote:
> On Fri, Jul 17, 2015 at 1:12 PM, Paul E. McKenney
> <[email protected]> wrote:
> > On Fri, Jul 17, 2015 at 11:59:18AM -0700, Andy Lutomirski wrote:
> >> On Thu, Jul 16, 2015 at 9:49 PM, Paul E. McKenney
> >> <[email protected]> wrote:
> >> > On Thu, Jul 16, 2015 at 09:29:07PM -0700, Paul E. McKenney wrote:
> >> >> On Thu, Jul 16, 2015 at 06:53:15PM -0700, Andy Lutomirski wrote:
> >> >> > For reasons that mystify me a bit, we currently track context tracking
> >> >> > state separately from rcu's watching state. This results in strange
> >> >> > artifacts: nothing generic cause IRQs to enter CONTEXT_KERNEL, and we
> >> >> > can nest exceptions inside the IRQ handler (an example would be
> >> >> > wrmsr_safe failing), and, in -next, we splat a warning:
> >> >> >
> >> >> > https://gist.github.com/sashalevin/a006a44989312f6835e7
> >> >> >
> >> >> > I'm trying to make context tracking more exact, which will fix this
> >> >> > issue (the particular splat that Sasha hit shouldn't be possible when
> >> >> > I'm done), but I think it would be nice to unify all of this stuff.
> >> >> > Would it be plausible for us to guarantee that RCU state is always in
> >> >> > sync with context tracking state? If so, we could maybe simplify
> >> >> > things and have fewer state variables.
> >> >>
> >> >> A noble goal. Might even be possible, and maybe even advantageous.
> >> >>
> >> >> But it is usually easier to say than to do. RCU really does need to make
> >> >> some adjustments when the state changes, as do the other subsystems.
> >> >> It might or might not be possible to do the transitions atomically.
> >> >> And if the transitions are not atomic, there will still be weird code
> >> >> paths where (say) the processor is considered non-idle, but RCU doesn't
> >> >> realize it yet. Such a code path could not safely use rcu_read_lock(),
> >> >> so you still need RCU to be able to scream if someone tries it.
> >> >> Contrariwise, if there is a code path where the processor is considered
> >> >> idle, but RCU thinks it is non-idle, that code path can stall
> >> >> grace periods. (Yes, not a problem if the code path is short enough.
> >> >> At least if the underlying VCPU is making progres...)
> >> >>
> >> >> Still, I cannot prove that it is impossible, and if it is possible,
> >> >> then as you say, there might well be benefits.
> >> >>
> >> >> > Doing this for NMIs might be weird. Would it make sense to have a
> >> >> > CONTEXT_NMI that's somehow valid even if the NMI happened while
> >> >> > changing context tracking state.
> >> >>
> >> >> Face it, NMIs are weird. ;-)
> >> >>
> >> >> > Thoughts? As it stands, I think we might already be broken for real:
> >> >> >
> >> >> > Syscall -> user_exit. Perf NMI hits *during* user_exit. Perf does
> >> >> > copy_from_user_nmi, which can fault, causing do_page_fault to get
> >> >> > called, which calls exception_enter(), which can't be a good thing.
> >> >> >
> >> >> > RCU is okay (sort of) because of rcu_nmi_enter, but this seems very fragile.
> >> >>
> >> >> Actually, I see more cases where people forget irq_enter() than
> >> >> rcu_nmi_enter(). "We will just nip in quickly and do something without
> >> >> actually letting the irq system know. Oh, and we want some event tracing
> >> >> in that code path." Boom!
> >> >>
> >> >> > Thoughts? As it stands, I need to do something because -tip and thus
> >> >> > -next spews occasional warnings.
> >> >>
> >> >> Tell me more?
> >> >
> >> > And for completeness, RCU also has the following requirements on the
> >> > state-transition mechanism:
> >> >
> >> > 1. It must be possible to reliably sample some other CPU's state.
> >> > This is an energy-efficiency requirement, as RCU is not normally
> >> > permitted to wake up idle CPUs. Nor nohz CPUs, for that matter.
> >>
> >> NOHZ needs this for vtime accounting, too. I think Rik might be
> >> thinking about this. Maybe the underlying state could be shared?
> >
> > From what I understand, what Rik is looking at is accounting information,
> > which is a different type of state. And a type of state where some
> > approximation is just fine. Try that with RCU, and you will approximate
> > yourself into a segfault.
>
> True. But context tracking wouldn't object to being exact. And I
> think we need context tracking to treat user mode as quiescent, so
> they're at least related.

And RCU would be happy to be able to always detect usermode execution.
But there are configurations and architectures that exclude context
tracking, which means that RCU has to roll its own in those cases.

> >> > 2. RCU must be able to track passage through idle and nohz states.
> >> > In other words, if RCU samples at t=0 and finds that the CPU
> >> > is executing (say) in kernel mode, and RCU samples again at
> >> > t=10 and again finds that the CPU is executing in kernel mode,
> >> > RCU needs to be able to determine whether or not that CPU passed
> >> > through idle or nohz betweentimes.
> >>
> >> And RCU can do this for CONTEXT_KERNEL vs CONTEXT_USER because the
> >> context tracking stuff notifies RCU. The think I'm less than happy
> >> with is that we can currently be CONTEXT_USER but still rcu-awake.
> >> This is manageable, but it seems messy.
> >
> > Well, if you don't have CONFIG_NO_HZ_FULL, there normally isn't context
> > tracking, so RCU cannot see CONTEXT_USER. Or are you thinking of making
> > context tracking unconditional? (The tinification guys might have some
> > opinions on this.)
>
> Without context tracking, user mode is not RCU idle, right? Instead
> we have timer ticks. We could get away with a very minimal context
> tracking implementation that just tracked CONTEXT_IDLE and
> CONTEXT_KERNEL. (Hmm, there is no CONTEXT_IDLE right now. Further
> grumbling.)

Without context tracking, usermode is still RCU idle. However, in
that case, RCU detects idle using a hook in the timer tick handler.

> >> > 3. In some configurations, RCU needs to be able to block entry into
> >> > nohz state, both for idle and userspace.
> >>
> >> Hmm. I suppose we could be CONTEXT_USER but still have RCU awake,
> >> although the tick would have to stay on.
> >
> > Right, there are situations where RCU needs a given CPU to keep the tick
> > going, for example, when there are RCU callbacks queued on that CPU.
> > Failing to keep the tick going could result in a system hang, because
> > that callback might never be invoked.
>
> Can't we just fire the callbacks right away?

Absolutely not!!!

At least not on multi-CPU systems. There might be an RCU read-side
critical section on some other CPU that we still have to wait for.

> We should only go
> RCU-idle from reasonable contexts. In fact, the nohz crowd likes the
> idea of nohz userspace being absolutely no hz.

Guilty to charges as read, and this is why RCU needs to be informed
about userspace execution for nohz userspace.

> NMIs can't queue RCU callbacks, right? (I hope!) As of a couple
> releases ago, on x86, we *always* have a clean state when
> transitioning from any non-NMI kernel context to user mode on x86.

You are correct, NMIs cannot queue RCU callbacks. That would be possible,
but it alws would be a bit painful and not so good for energy efficiency.
So if someone wants that, they need to have an extremely good reason. ;-)

> > Of course, something or another
> > will normally eventually disturb the CPU, but the resulting huge delay
> > would not be good. And on deep embedded systems, it is quite possible
> > that the CPU would go for a good long time without being disturbed.
> > (This is not just a theoretical possibility, and I have the scars to
> > prove it.)
> >
> > And there is this one as well:
> >
> > 4. In CONFIG_NO_HZ_FULL_SYSIDLE=y kernels, RCU has to treat userspace
> > context differently than idle context, and still needs to be
> > able to take two samples and determine if the CPU ever went idle
> > (and only idle, not userspace) betweentimes.
>
> If the context tracking code, or whatever the hook is, tracked the
> number of transitions out of user mode, that would do it, right?
> We're talking literally a single per-cpu increment on user entry
> and/or exit, I think.

With memory barriers, because RCU has to accurately sample the counters
remotely. I currently use full-up atomic operations, possibly only due
to paranoia, but I need to further intensify testing to trust moving away
from full-up atomic operations. In addition, RCU currently relies on a
single counter counting idle-to-RCU transitions, both userspace and idle.
It might be possible to wean RCU of this habit, or maybe have the two
counters be combined into a single word. The checks would of course be
more complex in that case, but should be doable.

In addition, in CONFIG_RCU_FAST_NO_HZ kernels, RCU needs a call to
rcu_prepare_for_idle() just before incrementing the counter when
transitioning to an idle-like state. Similarly, in CONFIG_RCU_NOCB_CPU
kernels, RCU needs a call to the following in the same place:

for_each_rcu_flavor(rsp) {
rdp = this_cpu_ptr(rsp->rda);
do_nocb_deferred_wakeup(rdp);
}

On the from-idle-like-state side, rcu_cleanup_after_idle() must be invoked
in CONFIG_RCU_FAST_NO_HZ kernels just after incrementing the counter on
transition from an idle-like state.

There is also some debug that complains if something transitions
to/from an idle-like state that shouldn't be doing so, but that could be
pulled into context tracking. (Might already be there, for all I know.)
And there is event tracing, which might be subsumed into context tracking.
See rcu_eqs_enter_common() and rcu_eqs_exit_common() in kernel/rcu/tree.c
for the full story.

This could all be handled by an RCU hook being invoked just before
incrementing the counter on entry to an idle-like state and just after
incrementing the counter on exit from an idle-like state.

Ah, yes, and interrupts to/from idle. Some architectures have
half-interrupts that never return. RCU uses a compound counter
that is zeroed upon process-level entry to an idle-like state to
deal with this. See kernel/rcu/rcu.h, the definitions starting with
DYNTICK_TASK_NEST_WIDTH and ending with DYNTICK_TASK_EXIT_IDLE, plus
the associated comment block. But maybe context tracking has some
other way of handling these beasts?

And transitions to idle-like states are not atomic. In some cases
in some configurations, rcu_needs_cpu() says "OK" when asked about
stopping the tick, but by the time we get to rcu_prepare_for_idle(),
it is no longer OK. RCU raises softirq to force a replay in these
sorts of cases.

Hey, you asked!!! ;-)

Thanx, Paul

2015-07-17 21:22:53

by Paul E. McKenney

[permalink] [raw]

Subject: Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking

And please see attached for an in-process LWN article on RCU's requirements.
If you get a chance to look it over, I would value any feedback that you
might have.

Thanx, Paul

Attachments:

(No filename) (182.00 B)
Requirements.2015.07.17a.tgz (163.88 kB)
Download all attachments

2015-07-17 22:13:59

by Andy Lutomirski

[permalink] [raw]

Subject: Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking

On Fri, Jul 17, 2015 at 2:19 PM, Paul E. McKenney
<[email protected]> wrote:
> On Fri, Jul 17, 2015 at 01:32:27PM -0700, Andy Lutomirski wrote:
>> True. But context tracking wouldn't object to being exact. And I
>> think we need context tracking to treat user mode as quiescent, so
>> they're at least related.
>
> And RCU would be happy to be able to always detect usermode execution.
> But there are configurations and architectures that exclude context
> tracking, which means that RCU has to roll its own in those cases.
>

We could slowly fix them, perhaps. I suspect that I'm half-way done
with accidentally enabling it for x86_32 :)

>
>> >> > 3. In some configurations, RCU needs to be able to block entry into
>> >> > nohz state, both for idle and userspace.
>> >>
>> >> Hmm. I suppose we could be CONTEXT_USER but still have RCU awake,
>> >> although the tick would have to stay on.
>> >
>> > Right, there are situations where RCU needs a given CPU to keep the tick
>> > going, for example, when there are RCU callbacks queued on that CPU.
>> > Failing to keep the tick going could result in a system hang, because
>> > that callback might never be invoked.
>>
>> Can't we just fire the callbacks right away?
>
> Absolutely not!!!
>
> At least not on multi-CPU systems. There might be an RCU read-side
> critical section on some other CPU that we still have to wait for.

Oh, right, obviously. Could you kick them over to a different CPU, though?

>
>> We should only go
>> RCU-idle from reasonable contexts. In fact, the nohz crowd likes the
>> idea of nohz userspace being absolutely no hz.
>
> Guilty to charges as read, and this is why RCU needs to be informed
> about userspace execution for nohz userspace.
>
>> NMIs can't queue RCU callbacks, right? (I hope!) As of a couple
>> releases ago, on x86, we *always* have a clean state when
>> transitioning from any non-NMI kernel context to user mode on x86.
>
> You are correct, NMIs cannot queue RCU callbacks. That would be possible,
> but it alws would be a bit painful and not so good for energy efficiency.
> So if someone wants that, they need to have an extremely good reason. ;-)

Eww, please no. NMI is already a terrifying scary disaster, and
keeping it simple would be for the best.

>
>> > Of course, something or another
>> > will normally eventually disturb the CPU, but the resulting huge delay
>> > would not be good. And on deep embedded systems, it is quite possible
>> > that the CPU would go for a good long time without being disturbed.
>> > (This is not just a theoretical possibility, and I have the scars to
>> > prove it.)
>> >
>> > And there is this one as well:
>> >
>> > 4. In CONFIG_NO_HZ_FULL_SYSIDLE=y kernels, RCU has to treat userspace
>> > context differently than idle context, and still needs to be
>> > able to take two samples and determine if the CPU ever went idle
>> > (and only idle, not userspace) betweentimes.
>>
>> If the context tracking code, or whatever the hook is, tracked the
>> number of transitions out of user mode, that would do it, right?
>> We're talking literally a single per-cpu increment on user entry
>> and/or exit, I think.
>
> With memory barriers, because RCU has to accurately sample the counters
> remotely. I currently use full-up atomic operations, possibly only due
> to paranoia, but I need to further intensify testing to trust moving away
> from full-up atomic operations. In addition, RCU currently relies on a
> single counter counting idle-to-RCU transitions, both userspace and idle.
> It might be possible to wean RCU of this habit, or maybe have the two
> counters be combined into a single word. The checks would of course be
> more complex in that case, but should be doable.

Why is idle-to-RCU different from user-to-RCU?

I feel like RCU and context tracking are implementing more or less the
same thing, and the fact that they're not shared makes life
complicated.

>From my perspective, I want to be able to say "I'm transitioning to
user mode right now" and "I'm transitioning out of user mode right
now" and have it Just Work. In current -tip, on x86_64, we do that
for literally every non-NMI entry with one stupid racy exception, and
that racy exception is very much fixable. I'd prefer not to think
about whether I'm informing RCU about exiting user mode, informing
context tracking about exiting user mode, or both.

>
> In addition, in CONFIG_RCU_FAST_NO_HZ kernels, RCU needs a call to
> rcu_prepare_for_idle() just before incrementing the counter when
> transitioning to an idle-like state. Similarly, in CONFIG_RCU_NOCB_CPU
> kernels, RCU needs a call to the following in the same place:
>
> for_each_rcu_flavor(rsp) {
> rdp = this_cpu_ptr(rsp->rda);
> do_nocb_deferred_wakeup(rdp);
> }
>
> On the from-idle-like-state side, rcu_cleanup_after_idle() must be invoked
> in CONFIG_RCU_FAST_NO_HZ kernels just after incrementing the counter on
> transition from an idle-like state.

If my hypothetical "I'm going to userspace now" function did that,
great! I call it from a context where percpu variable work and IRQs
are off. I further promise not to run any RCU code or enable IRQs
between calling that function and actually entering user mode.

>
> There is also some debug that complains if something transitions
> to/from an idle-like state that shouldn't be doing so, but that could be
> pulled into context tracking. (Might already be there, for all I know.)
> And there is event tracing, which might be subsumed into context tracking.
> See rcu_eqs_enter_common() and rcu_eqs_exit_common() in kernel/rcu/tree.c
> for the full story.
>
> This could all be handled by an RCU hook being invoked just before
> incrementing the counter on entry to an idle-like state and just after
> incrementing the counter on exit from an idle-like state.
>
> Ah, yes, and interrupts to/from idle. Some architectures have
> half-interrupts that never return.

WTF? Some architectures are clearly nuts.

x86 gets this right, at least in the intel_idle and acpi_idle cases.
There are no interrupts from RCU idle unless I badly misread the code.

> RCU uses a compound counter
> that is zeroed upon process-level entry to an idle-like state to
> deal with this. See kernel/rcu/rcu.h, the definitions starting with
> DYNTICK_TASK_NEST_WIDTH and ending with DYNTICK_TASK_EXIT_IDLE, plus
> the associated comment block. But maybe context tracking has some
> other way of handling these beasts?
>
> And transitions to idle-like states are not atomic. In some cases
> in some configurations, rcu_needs_cpu() says "OK" when asked about
> stopping the tick, but by the time we get to rcu_prepare_for_idle(),
> it is no longer OK. RCU raises softirq to force a replay in these
> sorts of cases.

Hmm. If pending callbacks got kicked to another CPU, would that help?
If that's impossible, when RCU finally detects that the grace period
is over, could it send an IPI rather than relying on the timer tick?

--Andy

2015-07-17 22:45:41

by Andy Lutomirski

[permalink] [raw]

Subject: Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking

On Fri, Jul 17, 2015 at 2:22 PM, Paul E. McKenney
<[email protected]> wrote:
> And please see attached for an in-process LWN article on RCU's requirements.
> If you get a chance to look it over, I would value any feedback that you
> might have.
>

Sure. Nice article!

I found the add_gp_buggy thing a bit confusing. What's the rcu reader
doing? What's the rcu_access_pointer for? You have a spinlock for
the updater.

You reference remove_gp_synchronous before you define it.

You say:

Quick Quiz 4: Without the rcu_dereference() or the
rcu_access_pointer(), what destructive optimizations might the
compiler make use of?

Answer: It could reuse a value formerly fetched from this same
pointer. It could also fetch the pointer from gp in a byte-at-a-time
manner, resulting in load tearing, in turn resulting a bytewise
mash-up of two distince pointer values. It might even use
value-speculation optimizations, where it makes a wrong guess, but by
the time it gets around to checking the value, an update has changed
the pointer to match the wrong guess. Too bad about any dereferences
that returned pre-initialization garbage in the meantime!

Doesn't the spinlock protect against that?

Requirement #2: you say "Each CPU that has an RCU read-side critical
section that ends after synchronize_rcu() returns is guaranteed to
execute a full memory barrier between the time that synchronize_rcu()
begins and the time that the RCU read-side critical section begins. "
Don't you mean an RCU read-side critical section that *starts* after
synchronize_rcu() returns?

Having read it: on x86 right now with nohz_full, a cpu can go into
user mode and come back without executing a full barrier (I think).
Certainly there's no such barrier in the entry code. I don't know
what user_enter and user_exit do. Is that okay? The issue here is
that a SYSRET/SYSCALL pair doesn't serialize or enforce any ordering
whatsoever. I got Intel to promise that SYSCALL will always force a
TSX abort [1], but that's about it.

For timer-driven idle/user detection, we're fine: IRET is serializing.
However, there are a couple of kernels in which we don't promise to
IRET after an IRQ.

This also reminds me: we really really need task switches to be full
barriers in general. Otherwise store forwarding can bite
otherwise-correct code on x86.

s/Guaranteed Unconditional/Guaranteed Unconditionally/

[1] Because of an amusing possible attack. If you attempt a TSX
transaction and it fails, you get to leak at least 8 bits out of the
aborted transaction. If you manage to read /dev/urandom in a
transaction, you can abort the read and still know the random number.
Whoops!

--Andy

2015-07-17 22:55:38

by Paul E. McKenney

[permalink] [raw]

Subject: Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking

On Fri, Jul 17, 2015 at 03:13:36PM -0700, Andy Lutomirski wrote:
> On Fri, Jul 17, 2015 at 2:19 PM, Paul E. McKenney
> <[email protected]> wrote:
> > On Fri, Jul 17, 2015 at 01:32:27PM -0700, Andy Lutomirski wrote:
> >> True. But context tracking wouldn't object to being exact. And I
> >> think we need context tracking to treat user mode as quiescent, so
> >> they're at least related.
> >
> > And RCU would be happy to be able to always detect usermode execution.
> > But there are configurations and architectures that exclude context
> > tracking, which means that RCU has to roll its own in those cases.
>
> We could slowly fix them, perhaps. I suspect that I'm half-way done
> with accidentally enabling it for x86_32 :)

If there was an appropriate Kconfig variable, I could do things one way
or the other, depending on what the architecture was doing.

So you are -unconditionally- enabling context tracking for x86_32?
Doesn't that increase kernel-user transition overhead?

> >> >> > 3. In some configurations, RCU needs to be able to block entry into
> >> >> > nohz state, both for idle and userspace.
> >> >>
> >> >> Hmm. I suppose we could be CONTEXT_USER but still have RCU awake,
> >> >> although the tick would have to stay on.
> >> >
> >> > Right, there are situations where RCU needs a given CPU to keep the tick
> >> > going, for example, when there are RCU callbacks queued on that CPU.
> >> > Failing to keep the tick going could result in a system hang, because
> >> > that callback might never be invoked.
> >>
> >> Can't we just fire the callbacks right away?
> >
> > Absolutely not!!!
> >
> > At least not on multi-CPU systems. There might be an RCU read-side
> > critical section on some other CPU that we still have to wait for.
>
> Oh, right, obviously. Could you kick them over to a different CPU, though?

For CONFIG_RCU_NOCB_CPUS=y kernels, no problem, they will execute on
whatever CPU the scheduler or the sysadm chooses, as the case may be.
Otherwise, it gets really hard to make sure that a given CPU's callbacks
execute in order, which is required for rcu_barrier() to work properly.

There was some thought of making CONFIG_RCU_NOCB_CPUS=y be the only
way that RCU callbacks were invoked, but the overhead is higher and
turns out to be all too noticeable on some workloads. :-(

> >> We should only go
> >> RCU-idle from reasonable contexts. In fact, the nohz crowd likes the
> >> idea of nohz userspace being absolutely no hz.
> >
> > Guilty to charges as read, and this is why RCU needs to be informed
> > about userspace execution for nohz userspace.
> >
> >> NMIs can't queue RCU callbacks, right? (I hope!) As of a couple
> >> releases ago, on x86, we *always* have a clean state when
> >> transitioning from any non-NMI kernel context to user mode on x86.
> >
> > You are correct, NMIs cannot queue RCU callbacks. That would be possible,
> > but it alws would be a bit painful and not so good for energy efficiency.
> > So if someone wants that, they need to have an extremely good reason. ;-)
>
> Eww, please no. NMI is already a terrifying scary disaster, and
> keeping it simple would be for the best.

;-) ;-) ;-)

> >> > Of course, something or another
> >> > will normally eventually disturb the CPU, but the resulting huge delay
> >> > would not be good. And on deep embedded systems, it is quite possible
> >> > that the CPU would go for a good long time without being disturbed.
> >> > (This is not just a theoretical possibility, and I have the scars to
> >> > prove it.)
> >> >
> >> > And there is this one as well:
> >> >
> >> > 4. In CONFIG_NO_HZ_FULL_SYSIDLE=y kernels, RCU has to treat userspace
> >> > context differently than idle context, and still needs to be
> >> > able to take two samples and determine if the CPU ever went idle
> >> > (and only idle, not userspace) betweentimes.
> >>
> >> If the context tracking code, or whatever the hook is, tracked the
> >> number of transitions out of user mode, that would do it, right?
> >> We're talking literally a single per-cpu increment on user entry
> >> and/or exit, I think.
> >
> > With memory barriers, because RCU has to accurately sample the counters
> > remotely. I currently use full-up atomic operations, possibly only due
> > to paranoia, but I need to further intensify testing to trust moving away
> > from full-up atomic operations. In addition, RCU currently relies on a
> > single counter counting idle-to-RCU transitions, both userspace and idle.
> > It might be possible to wean RCU of this habit, or maybe have the two
> > counters be combined into a single word. The checks would of course be
> > more complex in that case, but should be doable.
>
> Why is idle-to-RCU different from user-to-RCU?

Full-system-idle checking that is supposed to someday allow CPU 0's
scheduling-clock interrupt to be turned off on CONFIG_NO_HZ_FULL
systems. In this case, RCU treats user as non-idle for the purpose
of determining whether or not CPU 0's scheduling-clock interrupt can
be stopped. But idle is still idle. And for purposes of determining
whether the grace period has ended, idle and user are both extended
quiescent states, regardless.

> I feel like RCU and context tracking are implementing more or less the
> same thing, and the fact that they're not shared makes life
> complicated.
>
> >From my perspective, I want to be able to say "I'm transitioning to
> user mode right now" and "I'm transitioning out of user mode right
> now" and have it Just Work. In current -tip, on x86_64, we do that
> for literally every non-NMI entry with one stupid racy exception, and
> that racy exception is very much fixable. I'd prefer not to think
> about whether I'm informing RCU about exiting user mode, informing
> context tracking about exiting user mode, or both.

RCU's tracking was in place for many years before context tracking
appeared. If we can converge them, well and good, but it really does
have to fully work.

> > In addition, in CONFIG_RCU_FAST_NO_HZ kernels, RCU needs a call to
> > rcu_prepare_for_idle() just before incrementing the counter when
> > transitioning to an idle-like state. Similarly, in CONFIG_RCU_NOCB_CPU
> > kernels, RCU needs a call to the following in the same place:
> >
> > for_each_rcu_flavor(rsp) {
> > rdp = this_cpu_ptr(rsp->rda);
> > do_nocb_deferred_wakeup(rdp);
> > }
> >
> > On the from-idle-like-state side, rcu_cleanup_after_idle() must be invoked
> > in CONFIG_RCU_FAST_NO_HZ kernels just after incrementing the counter on
> > transition from an idle-like state.
>
> If my hypothetical "I'm going to userspace now" function did that,
> great! I call it from a context where percpu variable work and IRQs
> are off. I further promise not to run any RCU code or enable IRQs
> between calling that function and actually entering user mode.

Probably need a pair of nonspecific RCU hook function that does whatever
RCU eventually needs to be done, but that could hopefully work.

Of course, just having a pair of hook functions assumes that RCU can
be convinced to not care about the difference between irq and process
transitions, and the difference between idle and user (at transition,
sysidle will continue to care about the difference between idle and
user when remotely checking a given CPU's state).

> > There is also some debug that complains if something transitions
> > to/from an idle-like state that shouldn't be doing so, but that could be
> > pulled into context tracking. (Might already be there, for all I know.)
> > And there is event tracing, which might be subsumed into context tracking.
> > See rcu_eqs_enter_common() and rcu_eqs_exit_common() in kernel/rcu/tree.c
> > for the full story.
> >
> > This could all be handled by an RCU hook being invoked just before
> > incrementing the counter on entry to an idle-like state and just after
> > incrementing the counter on exit from an idle-like state.
> >
> > Ah, yes, and interrupts to/from idle. Some architectures have
> > half-interrupts that never return.
>
> WTF? Some architectures are clearly nuts.

Heh. That was -exactly- my reaction when I first ran into it.

> x86 gets this right, at least in the intel_idle and acpi_idle cases.
> There are no interrupts from RCU idle unless I badly misread the code.

These half-interrupts happen when running non-idle in kernel code.
Things like simulating exceptions and system calls from within the kernel.
I would not be sad to see it go, but while it is here, RCU must handle
it correctly. If it really truly cannot happen for x86, then x86
arch code of course need not worry about it.

> > RCU uses a compound counter
> > that is zeroed upon process-level entry to an idle-like state to
> > deal with this. See kernel/rcu/rcu.h, the definitions starting with
> > DYNTICK_TASK_NEST_WIDTH and ending with DYNTICK_TASK_EXIT_IDLE, plus
> > the associated comment block. But maybe context tracking has some
> > other way of handling these beasts?
> >
> > And transitions to idle-like states are not atomic. In some cases
> > in some configurations, rcu_needs_cpu() says "OK" when asked about
> > stopping the tick, but by the time we get to rcu_prepare_for_idle(),
> > it is no longer OK. RCU raises softirq to force a replay in these
> > sorts of cases.
>
> Hmm. If pending callbacks got kicked to another CPU, would that help?

Well, for CONFIG_RCU_NOCB_CPUS_ALL kernels, rcu_needs_cpu() unconditionally
says "OK", so for workloads where that is a reasonable strategy, it works
quite well. Otherwise, kicking callbacks to other CPUs is extremely
painful at best, and perhaps even impossible, at least if rcu_barrier()
is to work correctly.

> If that's impossible, when RCU finally detects that the grace period
> is over, could it send an IPI rather than relying on the timer tick?

In CONFIG_RCU_FAST_NO_HZ kernels, when the CPU is going idle, it sets
a timer to catch end of the grace period in the common case. This is
the point of the time passed back from rcu_needs_cpu(). Making the
end-of-grace-period code send IPIs to CPUs that might (or might not)
be in this mode involves too much rummaging through other CPU's states.
Plus people are not complaining that the grace-period kthread is using
too little CPU.

Besides, you have to tolerate a CPU catching an interrupt that does a
wakeup just as that CPU was trying to go idle, so the current approach
is not introducing any additional pain from what I can see.

Thanx, Paul

2015-07-17 23:21:20

by Andy Lutomirski

[permalink] [raw]

Subject: Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking

On Fri, Jul 17, 2015 at 3:55 PM, Paul E. McKenney
<[email protected]> wrote:
> On Fri, Jul 17, 2015 at 03:13:36PM -0700, Andy Lutomirski wrote:
>> On Fri, Jul 17, 2015 at 2:19 PM, Paul E. McKenney
>> <[email protected]> wrote:
>> > On Fri, Jul 17, 2015 at 01:32:27PM -0700, Andy Lutomirski wrote:
>> >> True. But context tracking wouldn't object to being exact. And I
>> >> think we need context tracking to treat user mode as quiescent, so
>> >> they're at least related.
>> >
>> > And RCU would be happy to be able to always detect usermode execution.
>> > But there are configurations and architectures that exclude context
>> > tracking, which means that RCU has to roll its own in those cases.
>>
>> We could slowly fix them, perhaps. I suspect that I'm half-way done
>> with accidentally enabling it for x86_32 :)
>
> If there was an appropriate Kconfig variable, I could do things one way
> or the other, depending on what the architecture was doing.

There's CONFIG_CONTEXT_TRACKING.

IMO it would be nice if there was a sort of clear spec for what
promises an arch needs to make for proper RCU operation and what
additional promises it needs to make for RCU idle during user mode.

>
> So you are -unconditionally- enabling context tracking for x86_32?
> Doesn't that increase kernel-user transition overhead?

No, I'm just adding the user_enter and user_exit calls. This might
let us enable HAVE_CONTEXT_TRACKING. Someone still needs to flip it
on.

>
>> >> >> > 3. In some configurations, RCU needs to be able to block entry into
>> >> >> > nohz state, both for idle and userspace.
>> >> >>
>> >> >> Hmm. I suppose we could be CONTEXT_USER but still have RCU awake,
>> >> >> although the tick would have to stay on.
>> >> >
>> >> > Right, there are situations where RCU needs a given CPU to keep the tick
>> >> > going, for example, when there are RCU callbacks queued on that CPU.
>> >> > Failing to keep the tick going could result in a system hang, because
>> >> > that callback might never be invoked.
>> >>
>> >> Can't we just fire the callbacks right away?
>> >
>> > Absolutely not!!!
>> >
>> > At least not on multi-CPU systems. There might be an RCU read-side
>> > critical section on some other CPU that we still have to wait for.
>>
>> Oh, right, obviously. Could you kick them over to a different CPU, though?
>
> For CONFIG_RCU_NOCB_CPUS=y kernels, no problem, they will execute on
> whatever CPU the scheduler or the sysadm chooses, as the case may be.
> Otherwise, it gets really hard to make sure that a given CPU's callbacks
> execute in order, which is required for rcu_barrier() to work properly.
>
> There was some thought of making CONFIG_RCU_NOCB_CPUS=y be the only
> way that RCU callbacks were invoked, but the overhead is higher and
> turns out to be all too noticeable on some workloads. :-(
>

Yuck. What if we make CONFIG_RCU_NOCB_CPUS=y mandatory for NOHZ_FULL?
In any case, the people who want their systems to be really truly
quiescent in user space will have CONFIG_RCU_NOCB_CPUS=y.

>>
>> Why is idle-to-RCU different from user-to-RCU?
>
> Full-system-idle checking that is supposed to someday allow CPU 0's
> scheduling-clock interrupt to be turned off on CONFIG_NO_HZ_FULL
> systems. In this case, RCU treats user as non-idle for the purpose
> of determining whether or not CPU 0's scheduling-clock interrupt can
> be stopped.

I'm not quite sure I understand this. We can turn off the tick if
truly idle but not if in user mode? Why?

>
>> I feel like RCU and context tracking are implementing more or less the
>> same thing, and the fact that they're not shared makes life
>> complicated.
>>
>> >From my perspective, I want to be able to say "I'm transitioning to
>> user mode right now" and "I'm transitioning out of user mode right
>> now" and have it Just Work. In current -tip, on x86_64, we do that
>> for literally every non-NMI entry with one stupid racy exception, and
>> that racy exception is very much fixable. I'd prefer not to think
>> about whether I'm informing RCU about exiting user mode, informing
>> context tracking about exiting user mode, or both.
>
> RCU's tracking was in place for many years before context tracking
> appeared. If we can converge them, well and good, but it really does
> have to fully work.

True.

One of my goals is to perfect coverage of the entering-usermode and
exiting-usermode callbacks on x86. What those callbacks are called
and what they do is up for debate, but I intend to call them. In
fact, I intend to promise that we will never execute non-NMI kernel
code with IRQs on at *any* point where I haven't called the
appropriate callback to tell the kernel that I'm executing kernel
code.

The splat that Sasha got was an assertion I added to help validate
this promise, but I'm asserting sort-of the wrong thing. Currently
there's a narrow window in which RCU knows we're non-idle but context
tracking still thinks we're in user mode, and Sasha got it to take an
interrupt there. I can change the assertion (it's currently
harmless), or I can fix it (needs a bit more work, and Ingo is
currently swamped so it'll be a couple weeks).

>
>> > In addition, in CONFIG_RCU_FAST_NO_HZ kernels, RCU needs a call to
>> > rcu_prepare_for_idle() just before incrementing the counter when
>> > transitioning to an idle-like state. Similarly, in CONFIG_RCU_NOCB_CPU
>> > kernels, RCU needs a call to the following in the same place:
>> >
>> > for_each_rcu_flavor(rsp) {
>> > rdp = this_cpu_ptr(rsp->rda);
>> > do_nocb_deferred_wakeup(rdp);
>> > }
>> >
>> > On the from-idle-like-state side, rcu_cleanup_after_idle() must be invoked
>> > in CONFIG_RCU_FAST_NO_HZ kernels just after incrementing the counter on
>> > transition from an idle-like state.
>>
>> If my hypothetical "I'm going to userspace now" function did that,
>> great! I call it from a context where percpu variable work and IRQs
>> are off. I further promise not to run any RCU code or enable IRQs
>> between calling that function and actually entering user mode.
>
> Probably need a pair of nonspecific RCU hook function that does whatever
> RCU eventually needs to be done, but that could hopefully work.
>
> Of course, just having a pair of hook functions assumes that RCU can
> be convinced to not care about the difference between irq and process
> transitions, and the difference between idle and user (at transition,
> sysidle will continue to care about the difference between idle and
> user when remotely checking a given CPU's state).

It can be more complex than just a pair of functions. I could call
idle_to_kernel() and user_to_kernel(), both with IRQs off and exactly
at the right times.

There may be architectures that deliver IRQs directly from idle, which
would mean that IRQ handlers would need care to choose the right hook,
but x86 is not one of those architectures.

>
>> > There is also some debug that complains if something transitions
>> > to/from an idle-like state that shouldn't be doing so, but that could be
>> > pulled into context tracking. (Might already be there, for all I know.)
>> > And there is event tracing, which might be subsumed into context tracking.
>> > See rcu_eqs_enter_common() and rcu_eqs_exit_common() in kernel/rcu/tree.c
>> > for the full story.
>> >
>> > This could all be handled by an RCU hook being invoked just before
>> > incrementing the counter on entry to an idle-like state and just after
>> > incrementing the counter on exit from an idle-like state.
>> >
>> > Ah, yes, and interrupts to/from idle. Some architectures have
>> > half-interrupts that never return.
>>
>> WTF? Some architectures are clearly nuts.
>
> Heh. That was -exactly- my reaction when I first ran into it.
>
>> x86 gets this right, at least in the intel_idle and acpi_idle cases.
>> There are no interrupts from RCU idle unless I badly misread the code.
>
> These half-interrupts happen when running non-idle in kernel code.
> Things like simulating exceptions and system calls from within the kernel.
> I would not be sad to see it go, but while it is here, RCU must handle
> it correctly. If it really truly cannot happen for x86, then x86
> arch code of course need not worry about it.

On x86, you can simulate a syscall by calling the syscall body. I
clearly don't understand this weirdness...

>
>> > RCU uses a compound counter
>> > that is zeroed upon process-level entry to an idle-like state to
>> > deal with this. See kernel/rcu/rcu.h, the definitions starting with
>> > DYNTICK_TASK_NEST_WIDTH and ending with DYNTICK_TASK_EXIT_IDLE, plus
>> > the associated comment block. But maybe context tracking has some
>> > other way of handling these beasts?
>> >
>> > And transitions to idle-like states are not atomic. In some cases
>> > in some configurations, rcu_needs_cpu() says "OK" when asked about
>> > stopping the tick, but by the time we get to rcu_prepare_for_idle(),
>> > it is no longer OK. RCU raises softirq to force a replay in these
>> > sorts of cases.
>>
>> Hmm. If pending callbacks got kicked to another CPU, would that help?
>
> Well, for CONFIG_RCU_NOCB_CPUS_ALL kernels, rcu_needs_cpu() unconditionally
> says "OK", so for workloads where that is a reasonable strategy, it works
> quite well. Otherwise, kicking callbacks to other CPUs is extremely
> painful at best, and perhaps even impossible, at least if rcu_barrier()
> is to work correctly.
>
>> If that's impossible, when RCU finally detects that the grace period
>> is over, could it send an IPI rather than relying on the timer tick?
>
> In CONFIG_RCU_FAST_NO_HZ kernels, when the CPU is going idle, it sets
> a timer to catch end of the grace period in the common case. This is
> the point of the time passed back from rcu_needs_cpu(). Making the
> end-of-grace-period code send IPIs to CPUs that might (or might not)
> be in this mode involves too much rummaging through other CPU's states.
> Plus people are not complaining that the grace-period kthread is using
> too little CPU.
>
> Besides, you have to tolerate a CPU catching an interrupt that does a
> wakeup just as that CPU was trying to go idle, so the current approach
> is not introducing any additional pain from what I can see.

Except that, in that case, we defer idle exactly long enough to handle
the interrupt and then either schedule or go idle for real (or have
another interrupt). There's no weird state where we have no work to
do right now but still can't become idle.

--Andy

2015-07-18 00:07:19

by Paul E. McKenney

[permalink] [raw]

Subject: Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking

On Fri, Jul 17, 2015 at 04:20:57PM -0700, Andy Lutomirski wrote:
> On Fri, Jul 17, 2015 at 3:55 PM, Paul E. McKenney
> <[email protected]> wrote:
> > On Fri, Jul 17, 2015 at 03:13:36PM -0700, Andy Lutomirski wrote:
> >> On Fri, Jul 17, 2015 at 2:19 PM, Paul E. McKenney
> >> <[email protected]> wrote:
> >> > On Fri, Jul 17, 2015 at 01:32:27PM -0700, Andy Lutomirski wrote:
> >> >> True. But context tracking wouldn't object to being exact. And I
> >> >> think we need context tracking to treat user mode as quiescent, so
> >> >> they're at least related.
> >> >
> >> > And RCU would be happy to be able to always detect usermode execution.
> >> > But there are configurations and architectures that exclude context
> >> > tracking, which means that RCU has to roll its own in those cases.
> >>
> >> We could slowly fix them, perhaps. I suspect that I'm half-way done
> >> with accidentally enabling it for x86_32 :)
> >
> > If there was an appropriate Kconfig variable, I could do things one way
> > or the other, depending on what the architecture was doing.
>
> There's CONFIG_CONTEXT_TRACKING.
>
> IMO it would be nice if there was a sort of clear spec for what
> promises an arch needs to make for proper RCU operation and what
> additional promises it needs to make for RCU idle during user mode.

Well, if RCU is going to delegate that responsibility to the architectures,
as you are suggesting, something will indeed be required. ;-)

If we come up with something workable, I will document it. Self-defense
and all that.

> > So you are -unconditionally- enabling context tracking for x86_32?
> > Doesn't that increase kernel-user transition overhead?
>
> No, I'm just adding the user_enter and user_exit calls. This might
> let us enable HAVE_CONTEXT_TRACKING. Someone still needs to flip it
> on.

Whew! ;-)

On the other hand, RCU will need to be set up to work either way.
Shouldn't be too hard, though.

> >> >> >> > 3. In some configurations, RCU needs to be able to block entry into
> >> >> >> > nohz state, both for idle and userspace.
> >> >> >>
> >> >> >> Hmm. I suppose we could be CONTEXT_USER but still have RCU awake,
> >> >> >> although the tick would have to stay on.
> >> >> >
> >> >> > Right, there are situations where RCU needs a given CPU to keep the tick
> >> >> > going, for example, when there are RCU callbacks queued on that CPU.
> >> >> > Failing to keep the tick going could result in a system hang, because
> >> >> > that callback might never be invoked.
> >> >>
> >> >> Can't we just fire the callbacks right away?
> >> >
> >> > Absolutely not!!!
> >> >
> >> > At least not on multi-CPU systems. There might be an RCU read-side
> >> > critical section on some other CPU that we still have to wait for.
> >>
> >> Oh, right, obviously. Could you kick them over to a different CPU, though?
> >
> > For CONFIG_RCU_NOCB_CPUS=y kernels, no problem, they will execute on
> > whatever CPU the scheduler or the sysadm chooses, as the case may be.
> > Otherwise, it gets really hard to make sure that a given CPU's callbacks
> > execute in order, which is required for rcu_barrier() to work properly.
> >
> > There was some thought of making CONFIG_RCU_NOCB_CPUS=y be the only
> > way that RCU callbacks were invoked, but the overhead is higher and
> > turns out to be all too noticeable on some workloads. :-(
>
> Yuck. What if we make CONFIG_RCU_NOCB_CPUS=y mandatory for NOHZ_FULL?
> In any case, the people who want their systems to be really truly
> quiescent in user space will have CONFIG_RCU_NOCB_CPUS=y.

It is already mandatory, NO_HZ_FULL selects RCU_NOCB_CPUS. However,
the choice of which CPU will really do nohz (and thus nocbs as well)
happens at boot time. So a given NO_HZ_FULL system will typically have
at least one non-RCU_NOCB_CPUS CPU, namely the boot CPU, which cannot
be a nohz CPU. Thankfully, the choices are automated at Kconfig and
boot time, so the right thing should happen without any undue user
consternation.

> >> Why is idle-to-RCU different from user-to-RCU?
> >
> > Full-system-idle checking that is supposed to someday allow CPU 0's
> > scheduling-clock interrupt to be turned off on CONFIG_NO_HZ_FULL
> > systems. In this case, RCU treats user as non-idle for the purpose
> > of determining whether or not CPU 0's scheduling-clock interrupt can
> > be stopped.
>
> I'm not quite sure I understand this. We can turn off the tick if
> truly idle but not if in user mode? Why?

Because of the need to maintain time synchronization at all times that
at least one CPU is non-idle. User-mode code might access fine-grained
time, and might care about time synchronization. The scheduling-clock
interrupt handler takes care of this when needed.

> >> I feel like RCU and context tracking are implementing more or less the
> >> same thing, and the fact that they're not shared makes life
> >> complicated.
> >>
> >> >From my perspective, I want to be able to say "I'm transitioning to
> >> user mode right now" and "I'm transitioning out of user mode right
> >> now" and have it Just Work. In current -tip, on x86_64, we do that
> >> for literally every non-NMI entry with one stupid racy exception, and
> >> that racy exception is very much fixable. I'd prefer not to think
> >> about whether I'm informing RCU about exiting user mode, informing
> >> context tracking about exiting user mode, or both.
> >
> > RCU's tracking was in place for many years before context tracking
> > appeared. If we can converge them, well and good, but it really does
> > have to fully work.
>
> True.
>
> One of my goals is to perfect coverage of the entering-usermode and
> exiting-usermode callbacks on x86. What those callbacks are called
> and what they do is up for debate, but I intend to call them. In
> fact, I intend to promise that we will never execute non-NMI kernel
> code with IRQs on at *any* point where I haven't called the
> appropriate callback to tell the kernel that I'm executing kernel
> code.
>
> The splat that Sasha got was an assertion I added to help validate
> this promise, but I'm asserting sort-of the wrong thing. Currently
> there's a narrow window in which RCU knows we're non-idle but context
> tracking still thinks we're in user mode, and Sasha got it to take an
> interrupt there. I can change the assertion (it's currently
> harmless), or I can fix it (needs a bit more work, and Ingo is
> currently swamped so it'll be a couple weeks).

Agreed, false-positive splats are no fun either.

> >> > In addition, in CONFIG_RCU_FAST_NO_HZ kernels, RCU needs a call to
> >> > rcu_prepare_for_idle() just before incrementing the counter when
> >> > transitioning to an idle-like state. Similarly, in CONFIG_RCU_NOCB_CPU
> >> > kernels, RCU needs a call to the following in the same place:
> >> >
> >> > for_each_rcu_flavor(rsp) {
> >> > rdp = this_cpu_ptr(rsp->rda);
> >> > do_nocb_deferred_wakeup(rdp);
> >> > }
> >> >
> >> > On the from-idle-like-state side, rcu_cleanup_after_idle() must be invoked
> >> > in CONFIG_RCU_FAST_NO_HZ kernels just after incrementing the counter on
> >> > transition from an idle-like state.
> >>
> >> If my hypothetical "I'm going to userspace now" function did that,
> >> great! I call it from a context where percpu variable work and IRQs
> >> are off. I further promise not to run any RCU code or enable IRQs
> >> between calling that function and actually entering user mode.
> >
> > Probably need a pair of nonspecific RCU hook function that does whatever
> > RCU eventually needs to be done, but that could hopefully work.
> >
> > Of course, just having a pair of hook functions assumes that RCU can
> > be convinced to not care about the difference between irq and process
> > transitions, and the difference between idle and user (at transition,
> > sysidle will continue to care about the difference between idle and
> > user when remotely checking a given CPU's state).
>
> It can be more complex than just a pair of functions. I could call
> idle_to_kernel() and user_to_kernel(), both with IRQs off and exactly
> at the right times.

OK, that approach sounds reasonable. So rcu_idle_to_kernel(),
rcu_kernel_to_idle(), rcu_user_to_kernel(), and rcu_kernel_to_user()?
I expect that I would abstract them out of the current rcu_user_enter()
and friends, in the name of maintainability.

I would guess that the NMIs remain as they are, but one way or another
RCU needs to know about the NMIs.

> There may be architectures that deliver IRQs directly from idle, which
> would mean that IRQ handlers would need care to choose the right hook,
> but x86 is not one of those architectures.

There are indeed such architectures.

> >> > There is also some debug that complains if something transitions
> >> > to/from an idle-like state that shouldn't be doing so, but that could be
> >> > pulled into context tracking. (Might already be there, for all I know.)
> >> > And there is event tracing, which might be subsumed into context tracking.
> >> > See rcu_eqs_enter_common() and rcu_eqs_exit_common() in kernel/rcu/tree.c
> >> > for the full story.
> >> >
> >> > This could all be handled by an RCU hook being invoked just before
> >> > incrementing the counter on entry to an idle-like state and just after
> >> > incrementing the counter on exit from an idle-like state.
> >> >
> >> > Ah, yes, and interrupts to/from idle. Some architectures have
> >> > half-interrupts that never return.
> >>
> >> WTF? Some architectures are clearly nuts.
> >
> > Heh. That was -exactly- my reaction when I first ran into it.
> >
> >> x86 gets this right, at least in the intel_idle and acpi_idle cases.
> >> There are no interrupts from RCU idle unless I badly misread the code.
> >
> > These half-interrupts happen when running non-idle in kernel code.
> > Things like simulating exceptions and system calls from within the kernel.
> > I would not be sad to see it go, but while it is here, RCU must handle
> > it correctly. If it really truly cannot happen for x86, then x86
> > arch code of course need not worry about it.
>
> On x86, you can simulate a syscall by calling the syscall body. I
> clearly don't understand this weirdness...

If it could go away, that would be good.

> >> > RCU uses a compound counter
> >> > that is zeroed upon process-level entry to an idle-like state to
> >> > deal with this. See kernel/rcu/rcu.h, the definitions starting with
> >> > DYNTICK_TASK_NEST_WIDTH and ending with DYNTICK_TASK_EXIT_IDLE, plus
> >> > the associated comment block. But maybe context tracking has some
> >> > other way of handling these beasts?
> >> >
> >> > And transitions to idle-like states are not atomic. In some cases
> >> > in some configurations, rcu_needs_cpu() says "OK" when asked about
> >> > stopping the tick, but by the time we get to rcu_prepare_for_idle(),
> >> > it is no longer OK. RCU raises softirq to force a replay in these
> >> > sorts of cases.
> >>
> >> Hmm. If pending callbacks got kicked to another CPU, would that help?
> >
> > Well, for CONFIG_RCU_NOCB_CPUS_ALL kernels, rcu_needs_cpu() unconditionally
> > says "OK", so for workloads where that is a reasonable strategy, it works
> > quite well. Otherwise, kicking callbacks to other CPUs is extremely
> > painful at best, and perhaps even impossible, at least if rcu_barrier()
> > is to work correctly.
> >
> >> If that's impossible, when RCU finally detects that the grace period
> >> is over, could it send an IPI rather than relying on the timer tick?
> >
> > In CONFIG_RCU_FAST_NO_HZ kernels, when the CPU is going idle, it sets
> > a timer to catch end of the grace period in the common case. This is
> > the point of the time passed back from rcu_needs_cpu(). Making the
> > end-of-grace-period code send IPIs to CPUs that might (or might not)
> > be in this mode involves too much rummaging through other CPU's states.
> > Plus people are not complaining that the grace-period kthread is using
> > too little CPU.
> >
> > Besides, you have to tolerate a CPU catching an interrupt that does a
> > wakeup just as that CPU was trying to go idle, so the current approach
> > is not introducing any additional pain from what I can see.
>
> Except that, in that case, we defer idle exactly long enough to handle
> the interrupt and then either schedule or go idle for real (or have
> another interrupt). There's no weird state where we have no work to
> do right now but still can't become idle.

So I have to actually wake up some kthread on that CPU to force
reconsideration of the type of transition to idle? Ouch...

Or will raising softirq have the same effect?

The reason I care is that the interrupt might have invoked call_rcu(),
which in some configurations means that RCU cannot allow the scheduling
clock interrupt to be turned off. The same thing can happen in a
softirq handler.

Thanx, Paul

2015-07-18 13:00:11

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking

On Thu, Jul 16, 2015 at 06:53:15PM -0700, Andy Lutomirski wrote:
> For reasons that mystify me a bit, we currently track context tracking
> state separately from rcu's watching state. This results in strange
> artifacts: nothing generic cause IRQs to enter CONTEXT_KERNEL, and we
> can nest exceptions inside the IRQ handler (an example would be
> wrmsr_safe failing), and, in -next, we splat a warning:
>
> https://gist.github.com/sashalevin/a006a44989312f6835e7

I don't know how it happened. But the context tracking code should
be able to handle exceptions on irqs. They are supposed to be simply
ignored with the in_interrupt() check on context_tracking_enter/exit().

>
> I'm trying to make context tracking more exact, which will fix this
> issue (the particular splat that Sasha hit shouldn't be possible when
> I'm done), but I think it would be nice to unify all of this stuff.
> Would it be plausible for us to guarantee that RCU state is always in
> sync with context tracking state? If so, we could maybe simplify
> things and have fewer state variables.

RCU uses the same variables for idle and user tracking whereas context
tracking only tracks user. So they are at least decoupled there. And we
probably don't want RCU to use a different variable due to the overhead
it brings on readers. But it could be a shifted count on the same variable.

>
> Doing this for NMIs might be weird. Would it make sense to have a
> CONTEXT_NMI that's somehow valid even if the NMI happened while
> changing context tracking state.
>
> Thoughts? As it stands, I think we might already be broken for real:
>
> Syscall -> user_exit. Perf NMI hits *during* user_exit. Perf does
> copy_from_user_nmi, which can fault, causing do_page_fault to get
> called, which calls exception_enter(), which can't be a good thing.

I think the in_interrupt() handles that. Besides NMI has its own counter.

>
> RCU is okay (sort of) because of rcu_nmi_enter, but this seems very fragile.
>
> Thoughts? As it stands, I need to do something because -tip and thus
> -next spews occasional warnings.

But yeah if we can, it would be nice to use context tracking as the sole
tracker that RCU can safely use.

2015-07-18 13:12:59

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking

On Fri, Jul 17, 2015 at 11:59:18AM -0700, Andy Lutomirski wrote:
> On Thu, Jul 16, 2015 at 9:49 PM, Paul E. McKenney
> <[email protected]> wrote:
> > On Thu, Jul 16, 2015 at 09:29:07PM -0700, Paul E. McKenney wrote:
> >> On Thu, Jul 16, 2015 at 06:53:15PM -0700, Andy Lutomirski wrote:
> >> > For reasons that mystify me a bit, we currently track context tracking
> >> > state separately from rcu's watching state. This results in strange
> >> > artifacts: nothing generic cause IRQs to enter CONTEXT_KERNEL, and we
> >> > can nest exceptions inside the IRQ handler (an example would be
> >> > wrmsr_safe failing), and, in -next, we splat a warning:
> >> >
> >> > https://gist.github.com/sashalevin/a006a44989312f6835e7
> >> >
> >> > I'm trying to make context tracking more exact, which will fix this
> >> > issue (the particular splat that Sasha hit shouldn't be possible when
> >> > I'm done), but I think it would be nice to unify all of this stuff.
> >> > Would it be plausible for us to guarantee that RCU state is always in
> >> > sync with context tracking state? If so, we could maybe simplify
> >> > things and have fewer state variables.
> >>
> >> A noble goal. Might even be possible, and maybe even advantageous.
> >>
> >> But it is usually easier to say than to do. RCU really does need to make
> >> some adjustments when the state changes, as do the other subsystems.
> >> It might or might not be possible to do the transitions atomically.
> >> And if the transitions are not atomic, there will still be weird code
> >> paths where (say) the processor is considered non-idle, but RCU doesn't
> >> realize it yet. Such a code path could not safely use rcu_read_lock(),
> >> so you still need RCU to be able to scream if someone tries it.
> >> Contrariwise, if there is a code path where the processor is considered
> >> idle, but RCU thinks it is non-idle, that code path can stall
> >> grace periods. (Yes, not a problem if the code path is short enough.
> >> At least if the underlying VCPU is making progres...)
> >>
> >> Still, I cannot prove that it is impossible, and if it is possible,
> >> then as you say, there might well be benefits.
> >>
> >> > Doing this for NMIs might be weird. Would it make sense to have a
> >> > CONTEXT_NMI that's somehow valid even if the NMI happened while
> >> > changing context tracking state.
> >>
> >> Face it, NMIs are weird. ;-)
> >>
> >> > Thoughts? As it stands, I think we might already be broken for real:
> >> >
> >> > Syscall -> user_exit. Perf NMI hits *during* user_exit. Perf does
> >> > copy_from_user_nmi, which can fault, causing do_page_fault to get
> >> > called, which calls exception_enter(), which can't be a good thing.
> >> >
> >> > RCU is okay (sort of) because of rcu_nmi_enter, but this seems very fragile.
> >>
> >> Actually, I see more cases where people forget irq_enter() than
> >> rcu_nmi_enter(). "We will just nip in quickly and do something without
> >> actually letting the irq system know. Oh, and we want some event tracing
> >> in that code path." Boom!
> >>
> >> > Thoughts? As it stands, I need to do something because -tip and thus
> >> > -next spews occasional warnings.
> >>
> >> Tell me more?
> >
> > And for completeness, RCU also has the following requirements on the
> > state-transition mechanism:
> >
> > 1. It must be possible to reliably sample some other CPU's state.
> > This is an energy-efficiency requirement, as RCU is not normally
> > permitted to wake up idle CPUs. Nor nohz CPUs, for that matter.
>
> NOHZ needs this for vtime accounting, too. I think Rik might be
> thinking about this. Maybe the underlying state could be shared?
>
> >
> > 2. RCU must be able to track passage through idle and nohz states.
> > In other words, if RCU samples at t=0 and finds that the CPU
> > is executing (say) in kernel mode, and RCU samples again at
> > t=10 and again finds that the CPU is executing in kernel mode,
> > RCU needs to be able to determine whether or not that CPU passed
> > through idle or nohz betweentimes.
>
> And RCU can do this for CONTEXT_KERNEL vs CONTEXT_USER because the
> context tracking stuff notifies RCU. The think I'm less than happy
> with is that we can currently be CONTEXT_USER but still rcu-awake.
> This is manageable, but it seems messy.

When we interrupt userspace, right? I don't see that much as a problem,
until we use a unified context tracking for both RCU and context tracking.

>
> >
> > 3. In some configurations, RCU needs to be able to block entry into
> > nohz state, both for idle and userspace.
> >
>
> Hmm. I suppose we could be CONTEXT_USER but still have RCU awake,
> although the tick would have to stay on.

Well 3) is handled by the tick nohz code so it's still external.