2016-11-05 04:19:08

by Chris Metcalf

[permalink] [raw]
Subject: task isolation discussion at Linux Plumbers

A bunch of people got together this week at the Linux Plumbers
Conference to discuss nohz_full, task isolation, and related stuff.
(Thanks to Thomas for getting everyone gathered at one place and time!)

Here are the notes I took; I welcome any corrections and follow-up.


== rcu_nocbs ==

We started out by discussing this option. It is automatically enabled
by nohz_full, but we spent a little while side-tracking on the
implementation of one kthread per rcu flavor per core. The suggestion
was made (by Peter or Andy; I forget) that each kthread could handle
all flavors per core by using a dedicated worklist. It certainly
seems like removing potentially dozens or hundreds of kthreads from
larger systems will be a win if this works out.

Paul said he would look into this possibility.


== Remote statistics ==

We discussed the possibility of remote statistics gathering, i.e. load
average etc. The idea would be that we could have housekeeping
core(s) periodically iterate over the nohz cores to load their rq
remotely and do update_current etc. Presumably it should be possible
for a single housekeeping core to handle doing this for all the
nohz_full cores, as we only need to do it quite infrequently.

Thomas suggested that this might be the last remaining thing that
needed to be done to allow disabling the current behavior of falling
back to a 1 Hz clock in nohz_full.

I believe Thomas said he had a patch to do this already.


== Remote LRU cache drain ==

One of the issues with task isolation currently is that the LRU cache
drain must be done prior to entering userspace, but it requires
interrupts enabled and thus can't be done atomically. My previous
patch series have handled this by checking with interrupts disabled,
but then looping around with interrupts enabled to try to drain the
LRU pagevecs. Experimentally this works, but it's not provable that
it terminates, which is worrisome. Andy suggested adding a percpu
flag to disable creation of deferred work like LRU cache pages.

Thomas suggested using an RT "local lock" to guard the LRU cache
flush; he is planning on bringing the concept to mainline in any case.
However, after some discussion we converged on simply using a spinlock
to guard the appropriate resources. As a result, the
lru_add_drain_all() code that currently queues work on each remote cpu
to drain it, can instead simply acquire the lock and drain it remotely.
This means that a task isolation task no longer needs to worry about
being interrupted by SMP function call IPIs, so we don't have to deal
with this in the task isolation framework any more.

I don't recall anyone else volunteering to tackle this, so I will plan
to look at it. The patch to do that should be orthogonal to the
revised task isolation patch series.


== Quiescing vmstat ==

Another issue that task isolation handles is ensuring that the vmstat
worker is quiesced before returning to user space. Currently we
cancel the vmstat delayed work, then invoke refresh_cpu_vm_stats().
Currently neither of these things appears safe to do in the
interrupts-disabled context just before return to userspace, because
they both can call schedule(): refresh_cpu_vm_stats() via a
cond_resched() under CONFIG_NUMA, and cancel_delayed_work_sync() via a
schedule() in __cancel_work_timer().

Christoph offered to work with me to make sure that we could do the
appropriate quiescing with interrupts disabled, and seemed confident
it should be doable.


== Remote kernel TLB flush ==

Andy then brought up the issue of remote kernel TLB flush, which I've
been trying to sweep under the rug for the initial task isolation
series. Remote TLB flush causes an interrupt on many systems (x86 and
tile, for example, although not arm64), so to the extent that it
occurs frequently, it becomes important to handle for task isolation.
With the recent addition of vmap kernel stacks, this becomes suddenly
much more important than it used to be, to the point where we now
really have to handle it for task isolation.

The basic insight here is that you can safely skip interrupting
userspace cores when you are sending remote kernel TLB flushes, since
by definition they can't touch the kernel pages in question anyway.
Then you just need to guarantee to flush the kernel TLB space next
time the userspace task re-enters the kernel.

The original Tilera dataplane code handled this by tracking task state
(kernel, user, or user-flushed) and manipulating the state atomically
at TLB flush time and kernel entry time. After some discussion of the
overheads of such atomics, Andy pointed out that there is already an
atomic increment being done in the RCU code, and we should be able to
leverage that word to achieve this effect. The idea is that remote
cores would do a compare-exchange of 0 to 1, which if it succeeded
would indicate that the remote core was in userspace and thus didn't
need to be IPI'd, but that it was now tagged for a kernel flush next
time the remote task entered the kernel. Then, when the remote task
enters the kernel, it does an atomic update of its own dynticks and
discovers the low bit set, it does a kernel TLB flush before
continuing.

It was agreed that this makes sense to do unconditionally, since it's
not just helpful for nohz_full and task isolation, but also for idle,
since interrupting an idle core periodically just to do repeated
kernel tlb flushes isn't good for power consumption.

One open question is whether we discover the low bit set early enough
in kernel entry that we can trust that we haven't tried to touch any
pages that have been invalidated in the TLB.

Paul agreed to take a look at implementing this.


== Optimizing vfree via RCU ==

An orthogonal issue was also brought up, which is whether we could use
RCU to handle the kernel TLB flush from freeing vmaps; presumably if
we have enough vmap space, we can arrange to return the freed VA space
via RCU, and simply defer the TLB flush until the next grace period.

I'm not sure if this is practical if we encounter a high volume of
vfrees, but I don't think we really reached a definitive agreement on
it during the discussion either.


== Disabling the dyn tick ==

One issue that the current task isolation patch series encounters is
when we request disabling the dyntick, but it doesn't happen. At the
moment we just wait until the the tick is properly disabled, by
busy-waiting in the kernel (calling schedule etc as needed). No one
is particularly fond of this scheme. The consensus seems to be to try
harder to figure out what is going on, fix whatever problems exist,
then consider it a regression going forward if something causes the
dyntick to become difficult to disable again in the future. I will
take a look at this and try to gather more data on if and when this is
happening in 4.9.


== Missing oneshot_stopped callbacks ==

I raised the issue that various clock_event_device sources don't
always support oneshot_stopped, which can cause an additional
final interrupt to occur after the timer infrastructure believes the
interrupt has been stopped. I have patches to fix this for tile and
arm64 in my patch series; Thomas volunteered to look at adding
equivalent support for x86.


Many thanks to all those who participated in the discussion.
Frederic, we wished you had been there!

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com


Subject: Re: task isolation discussion at Linux Plumbers

On Sat, 5 Nov 2016, Chris Metcalf wrote:

> Here are the notes I took; I welcome any corrections and follow-up.

Thank you for writing this up. I hope we can now move forward further on
these issues.

> == Remote statistics ==
>
> We discussed the possibility of remote statistics gathering, i.e. load
> average etc. The idea would be that we could have housekeeping
> core(s) periodically iterate over the nohz cores to load their rq
> remotely and do update_current etc. Presumably it should be possible
> for a single housekeeping core to handle doing this for all the
> nohz_full cores, as we only need to do it quite infrequently.
>
> Thomas suggested that this might be the last remaining thing that
> needed to be done to allow disabling the current behavior of falling
> back to a 1 Hz clock in nohz_full.
>
> I believe Thomas said he had a patch to do this already.


Note that the vmstat_shepherd already scans idle cpus ever 2 seconds to
see if there are new updates top vm statistics that require the
reactivation of the vmstat updater. It would be possible to
opportunistically update the thread statistics should the remove cpu be in
user space (if one figures out the synchronization issues for remote per
cpu updates)

> == Quiescing vmstat ==
>
> Another issue that task isolation handles is ensuring that the vmstat
> worker is quiesced before returning to user space. Currently we
> cancel the vmstat delayed work, then invoke refresh_cpu_vm_stats().
> Currently neither of these things appears safe to do in the
> interrupts-disabled context just before return to userspace, because
> they both can call schedule(): refresh_cpu_vm_stats() via a
> cond_resched() under CONFIG_NUMA, and cancel_delayed_work_sync() via a
> schedule() in __cancel_work_timer().
>
> Christoph offered to work with me to make sure that we could do the
> appropriate quiescing with interrupts disabled, and seemed confident
> it should be doable.

This is already implemented. You can call

refresh_cpu_vm_stats(false)

to do the quiescing with interrupts disabled.

2016-11-07 16:59:18

by Thomas Gleixner

[permalink] [raw]
Subject: Re: task isolation discussion at Linux Plumbers

On Sat, 5 Nov 2016, Chris Metcalf wrote:
> == Remote statistics ==
>
> We discussed the possibility of remote statistics gathering, i.e. load
> average etc. The idea would be that we could have housekeeping
> core(s) periodically iterate over the nohz cores to load their rq
> remotely and do update_current etc. Presumably it should be possible
> for a single housekeeping core to handle doing this for all the
> nohz_full cores, as we only need to do it quite infrequently.
>
> Thomas suggested that this might be the last remaining thing that
> needed to be done to allow disabling the current behavior of falling
> back to a 1 Hz clock in nohz_full.
>
> I believe Thomas said he had a patch to do this already.

No, Riek was working on that.

> == Remote LRU cache drain ==
>
> One of the issues with task isolation currently is that the LRU cache
> drain must be done prior to entering userspace, but it requires
> interrupts enabled and thus can't be done atomically. My previous
> patch series have handled this by checking with interrupts disabled,
> but then looping around with interrupts enabled to try to drain the
> LRU pagevecs. Experimentally this works, but it's not provable that
> it terminates, which is worrisome. Andy suggested adding a percpu
> flag to disable creation of deferred work like LRU cache pages.
>
> Thomas suggested using an RT "local lock" to guard the LRU cache
> flush; he is planning on bringing the concept to mainline in any case.
> However, after some discussion we converged on simply using a spinlock
> to guard the appropriate resources. As a result, the
> lru_add_drain_all() code that currently queues work on each remote cpu
> to drain it, can instead simply acquire the lock and drain it remotely.
> This means that a task isolation task no longer needs to worry about
> being interrupted by SMP function call IPIs, so we don't have to deal
> with this in the task isolation framework any more.
>
> I don't recall anyone else volunteering to tackle this, so I will plan
> to look at it. The patch to do that should be orthogonal to the
> revised task isolation patch series.

I offered to clean up the patch from RT. I'll do that in the next days.

> == Missing oneshot_stopped callbacks ==
>
> I raised the issue that various clock_event_device sources don't
> always support oneshot_stopped, which can cause an additional
> final interrupt to occur after the timer infrastructure believes the
> interrupt has been stopped. I have patches to fix this for tile and
> arm64 in my patch series; Thomas volunteered to look at adding
> equivalent support for x86.

Right.

Thanks,

tglx

2016-11-07 18:41:30

by Thomas Gleixner

[permalink] [raw]
Subject: Re: task isolation discussion at Linux Plumbers

On Mon, 7 Nov 2016, Thomas Gleixner wrote:
> > == Missing oneshot_stopped callbacks ==
> >
> > I raised the issue that various clock_event_device sources don't
> > always support oneshot_stopped, which can cause an additional
> > final interrupt to occur after the timer infrastructure believes the
> > interrupt has been stopped. I have patches to fix this for tile and
> > arm64 in my patch series; Thomas volunteered to look at adding
> > equivalent support for x86.
>
> Right.

Untested patch below should fix that.

Thanks,

tglx

8<---------------
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -530,18 +530,20 @@ static void lapic_timer_broadcast(const
* The local apic timer can be used for any function which is CPU local.
*/
static struct clock_event_device lapic_clockevent = {
- .name = "lapic",
- .features = CLOCK_EVT_FEAT_PERIODIC |
- CLOCK_EVT_FEAT_ONESHOT | CLOCK_EVT_FEAT_C3STOP
- | CLOCK_EVT_FEAT_DUMMY,
- .shift = 32,
- .set_state_shutdown = lapic_timer_shutdown,
- .set_state_periodic = lapic_timer_set_periodic,
- .set_state_oneshot = lapic_timer_set_oneshot,
- .set_next_event = lapic_next_event,
- .broadcast = lapic_timer_broadcast,
- .rating = 100,
- .irq = -1,
+ .name = "lapic",
+ .features = CLOCK_EVT_FEAT_PERIODIC |
+ CLOCK_EVT_FEAT_ONESHOT |
+ CLOCK_EVT_FEAT_C3STOP |
+ CLOCK_EVT_FEAT_DUMMY,
+ .shift = 32,
+ .set_state_shutdown = lapic_timer_shutdown,
+ .set_state_periodic = lapic_timer_set_periodic,
+ .set_state_oneshot = lapic_timer_set_oneshot,
+ .set_state_oneshot_stopped = lapic_timer_shutdown,
+ .set_next_event = lapic_next_event,
+ .broadcast = lapic_timer_broadcast,
+ .rating = 100,
+ .irq = -1,
};
static DEFINE_PER_CPU(struct clock_event_device, lapic_events);


2016-11-07 19:17:02

by Will Deacon

[permalink] [raw]
Subject: Re: task isolation discussion at Linux Plumbers

On Mon, Nov 07, 2016 at 02:12:13PM -0500, Rik van Riel wrote:
> On Mon, 2016-11-07 at 19:36 +0100, Thomas Gleixner wrote:
> > On Mon, 7 Nov 2016, Thomas Gleixner wrote:
> > >
> > > >
> > > > == Missing oneshot_stopped callbacks ==
> > > >
> > > > I raised the issue that various clock_event_device sources don't
> > > > always support oneshot_stopped, which can cause an additional
> > > > final interrupt to occur after the timer infrastructure believes
> > > > the
> > > > interrupt has been stopped.??I have patches to fix this for tile
> > > > and
> > > > arm64 in my patch series; Thomas volunteered to look at adding
> > > > equivalent support for x86.
> > >
> > > Right.
> >
> > Untested patch below should fix that.
> >?
>
> That whitespace cleanup looks awesome, but I am not
> optimistic about its chances to bring about functional
> change.
>
> What am I overlooking?

It hooks up .set_state_oneshot_stopped?

Will

2016-11-07 19:18:08

by Rik van Riel

[permalink] [raw]
Subject: Re: task isolation discussion at Linux Plumbers

On Mon, 2016-11-07 at 19:16 +0000, Will Deacon wrote:
> On Mon, Nov 07, 2016 at 02:12:13PM -0500, Rik van Riel wrote:
> >
> > On Mon, 2016-11-07 at 19:36 +0100, Thomas Gleixner wrote:
> > >
> > > On Mon, 7 Nov 2016, Thomas Gleixner wrote:
> > > >
> > > >
> > > > >
> > > > >
> > > > > == Missing oneshot_stopped callbacks ==
> > > > >
> > > > > I raised the issue that various clock_event_device sources
> > > > > don't
> > > > > always support oneshot_stopped, which can cause an additional
> > > > > final interrupt to occur after the timer infrastructure
> > > > > believes
> > > > > the
> > > > > interrupt has been stopped.  I have patches to fix this for
> > > > > tile
> > > > > and
> > > > > arm64 in my patch series; Thomas volunteered to look at
> > > > > adding
> > > > > equivalent support for x86.
> > > >
> > > > Right.
> > >
> > > Untested patch below should fix that.
> > >  
> >
> > That whitespace cleanup looks awesome, but I am not
> > optimistic about its chances to bring about functional
> > change.
> >
> > What am I overlooking?
>
> It hooks up .set_state_oneshot_stopped?

Gah, indeed. Never mind :)

--
All Rights Reversed.


Attachments:
signature.asc (473.00 B)
This is a digitally signed message part

2016-11-07 19:21:52

by Rik van Riel

[permalink] [raw]
Subject: Re: task isolation discussion at Linux Plumbers

On Mon, 2016-11-07 at 19:36 +0100, Thomas Gleixner wrote:
> On Mon, 7 Nov 2016, Thomas Gleixner wrote:
> >
> > >
> > > == Missing oneshot_stopped callbacks ==
> > >
> > > I raised the issue that various clock_event_device sources don't
> > > always support oneshot_stopped, which can cause an additional
> > > final interrupt to occur after the timer infrastructure believes
> > > the
> > > interrupt has been stopped.  I have patches to fix this for tile
> > > and
> > > arm64 in my patch series; Thomas volunteered to look at adding
> > > equivalent support for x86.
> >
> > Right.
>
> Untested patch below should fix that.


That whitespace cleanup looks awesome, but I am not
optimistic about its chances to bring about functional
change.

What am I overlooking?

-- 
All Rights Reversed.


Attachments:
signature.asc (473.00 B)
This is a digitally signed message part

2016-11-09 01:40:55

by Paul E. McKenney

[permalink] [raw]
Subject: Re: task isolation discussion at Linux Plumbers

On Sat, Nov 05, 2016 at 12:04:45AM -0400, Chris Metcalf wrote:
> A bunch of people got together this week at the Linux Plumbers
> Conference to discuss nohz_full, task isolation, and related stuff.
> (Thanks to Thomas for getting everyone gathered at one place and time!)
>
> Here are the notes I took; I welcome any corrections and follow-up.

[ . . . ]

> == Remote kernel TLB flush ==
>
> Andy then brought up the issue of remote kernel TLB flush, which I've
> been trying to sweep under the rug for the initial task isolation
> series. Remote TLB flush causes an interrupt on many systems (x86 and
> tile, for example, although not arm64), so to the extent that it
> occurs frequently, it becomes important to handle for task isolation.
> With the recent addition of vmap kernel stacks, this becomes suddenly
> much more important than it used to be, to the point where we now
> really have to handle it for task isolation.
>
> The basic insight here is that you can safely skip interrupting
> userspace cores when you are sending remote kernel TLB flushes, since
> by definition they can't touch the kernel pages in question anyway.
> Then you just need to guarantee to flush the kernel TLB space next
> time the userspace task re-enters the kernel.
>
> The original Tilera dataplane code handled this by tracking task state
> (kernel, user, or user-flushed) and manipulating the state atomically
> at TLB flush time and kernel entry time. After some discussion of the
> overheads of such atomics, Andy pointed out that there is already an
> atomic increment being done in the RCU code, and we should be able to
> leverage that word to achieve this effect. The idea is that remote
> cores would do a compare-exchange of 0 to 1, which if it succeeded
> would indicate that the remote core was in userspace and thus didn't
> need to be IPI'd, but that it was now tagged for a kernel flush next
> time the remote task entered the kernel. Then, when the remote task
> enters the kernel, it does an atomic update of its own dynticks and
> discovers the low bit set, it does a kernel TLB flush before
> continuing.
>
> It was agreed that this makes sense to do unconditionally, since it's
> not just helpful for nohz_full and task isolation, but also for idle,
> since interrupting an idle core periodically just to do repeated
> kernel tlb flushes isn't good for power consumption.
>
> One open question is whether we discover the low bit set early enough
> in kernel entry that we can trust that we haven't tried to touch any
> pages that have been invalidated in the TLB.
>
> Paul agreed to take a look at implementing this.

Please see a prototype at 49961e272333 at:

git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git

Or see below for the patch.

As discussed earlier, once I get this working, I hand it off to you
to add your code.

Thoughts?

Thanx, Paul

------------------------------------------------------------------------

commit 49961e272333ac720ac4ccbaba45521bfea259ae
Author: Paul E. McKenney <[email protected]>
Date: Tue Nov 8 14:25:21 2016 -0800

rcu: Maintain special bits at bottom of ->dynticks counter

Currently, IPIs are used to force other CPUs to invalidate their TLBs
in response to a kernel virtual-memory mapping change. This works, but
degrades both battery lifetime (for idle CPUs) and real-time response
(for nohz_full CPUs), and in addition results in unnecessary IPIs due to
the fact that CPUs executing in usermode are unaffected by stale kernel
mappings. It would be better to cause a CPU executing in usermode to
wait until it is entering kernel mode to

This commit therefore reserves a bit at the bottom of the ->dynticks
counter, which is checked upon exit from extended quiescent states. If it
is set, it is cleared and then a new rcu_dynticks_special_exit() macro
is invoked, which, if not supplied, is an empty single-pass do-while loop.
If this bottom bit is set on -entry- to an extended quiescent state,
then a WARN_ON_ONCE() triggers.

This bottom bit may be set using a new rcu_dynticks_special_set()
function, which returns true if the bit was set, or false if the CPU
turned out to not be in an extended quiescent state. Please note that
this function refuses to set the bit for a non-nohz_full CPU when that
CPU is executing in usermode because usermode execution is tracked by
RCU as a dyntick-idle extended quiescent state only for nohz_full CPUs.

Reported-by: Andy Lutomirski <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 4f9b2fa2173d..130d911e4ba1 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -33,6 +33,11 @@ static inline int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
return 0;
}

+static inline bool rcu_dynticks_special_set(int cpu)
+{
+ return false; /* Never flag non-existent other CPUs! */
+}
+
static inline unsigned long get_state_synchronize_rcu(void)
{
return 0;
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index dbf20b058f48..8de83830e86b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -279,23 +279,36 @@ static DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
};

/*
+ * Steal a bit from the bottom of ->dynticks for idle entry/exit
+ * control. Initially this is for TLB flushing.
+ */
+#define RCU_DYNTICK_CTRL_MASK 0x1
+#define RCU_DYNTICK_CTRL_CTR (RCU_DYNTICK_CTRL_MASK + 1)
+#ifndef rcu_dynticks_special_exit
+#define rcu_dynticks_special_exit() do { } while (0)
+#endif
+
+/*
* Record entry into an extended quiescent state. This is only to be
* called when not already in an extended quiescent state.
*/
static void rcu_dynticks_eqs_enter(void)
{
struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
+ int seq;

/*
- * CPUs seeing atomic_inc() must see prior RCU read-side critical
- * sections, and we also must force ordering with the next idle
- * sojourn.
+ * CPUs seeing atomic_inc_return() must see prior RCU read-side
+ * critical sections, and we also must force ordering with the
+ * next idle sojourn.
*/
- smp_mb__before_atomic(); /* See above. */
- atomic_inc(&rdtp->dynticks);
- smp_mb__after_atomic(); /* See above. */
+ seq = atomic_inc_return(&rdtp->dynticks);
+ /* Better be in an extended quiescent state! */
+ WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
+ (seq & RCU_DYNTICK_CTRL_CTR));
+ /* Better not have special action (TLB flush) pending! */
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
- atomic_read(&rdtp->dynticks) & 0x1);
+ (seq & RCU_DYNTICK_CTRL_MASK));
}

/*
@@ -305,17 +318,21 @@ static void rcu_dynticks_eqs_enter(void)
static void rcu_dynticks_eqs_exit(void)
{
struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
+ int seq;

/*
- * CPUs seeing atomic_inc() must see prior idle sojourns,
+ * CPUs seeing atomic_inc_return() must see prior idle sojourns,
* and we also must force ordering with the next RCU read-side
* critical section.
*/
- smp_mb__before_atomic(); /* See above. */
- atomic_inc(&rdtp->dynticks);
- smp_mb__after_atomic(); /* See above. */
+ seq = atomic_inc_return(&rdtp->dynticks);
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
- !(atomic_read(&rdtp->dynticks) & 0x1));
+ !(seq & RCU_DYNTICK_CTRL_CTR));
+ if (seq & RCU_DYNTICK_CTRL_MASK) {
+ atomic_and(~RCU_DYNTICK_CTRL_MASK, &rdtp->dynticks);
+ smp_mb__after_atomic(); /* Clear bits before acting on them */
+ rcu_dynticks_special_exit();
+ }
}

/*
@@ -326,7 +343,7 @@ int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
{
int snap = atomic_add_return(0, &rdtp->dynticks);

- return snap;
+ return snap & ~RCU_DYNTICK_CTRL_MASK;
}

/*
@@ -335,7 +352,7 @@ int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
*/
static bool rcu_dynticks_in_eqs(int snap)
{
- return !(snap & 0x1);
+ return !(snap & RCU_DYNTICK_CTRL_CTR);
}

/*
@@ -355,10 +372,33 @@ static bool rcu_dynticks_in_eqs_since(struct rcu_dynticks *rdtp, int snap)
static void rcu_dynticks_momentary_idle(void)
{
struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
- int special = atomic_add_return(2, &rdtp->dynticks);
+ int special = atomic_add_return(2 * RCU_DYNTICK_CTRL_CTR,
+ &rdtp->dynticks);

/* It is illegal to call this from idle state. */
- WARN_ON_ONCE(!(special & 0x1));
+ WARN_ON_ONCE(!(special & RCU_DYNTICK_CTRL_CTR));
+}
+
+/*
+ * Set the special (bottom) bit of the specified CPU so that it
+ * will take special action (such as flushing its TLB) on the
+ * next exit from an extended quiescent state. Returns true if
+ * the bit was successfully set, or false if the CPU was not in
+ * an extended quiescent state.
+ */
+bool rcu_dynticks_special_set(int cpu)
+{
+ int old;
+ int new;
+ struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
+
+ do {
+ old = atomic_read(&rdtp->dynticks);
+ if (old & RCU_DYNTICK_CTRL_CTR)
+ return false;
+ new = old | ~RCU_DYNTICK_CTRL_MASK;
+ } while (atomic_cmpxchg(&rdtp->dynticks, old, new) != old);
+ return true;
}

DEFINE_PER_CPU_SHARED_ALIGNED(unsigned long, rcu_qs_ctr);
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 3b953dcf6afc..c444787a3bdc 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -596,6 +596,7 @@ extern struct rcu_state rcu_preempt_state;
#endif /* #ifdef CONFIG_PREEMPT_RCU */

int rcu_dynticks_snap(struct rcu_dynticks *rdtp);
+bool rcu_dynticks_special_set(int cpu);

#ifdef CONFIG_RCU_BOOST
DECLARE_PER_CPU(unsigned int, rcu_cpu_kthread_status);

2016-11-09 11:08:10

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: task isolation discussion at Linux Plumbers

2016-11-05 4:04 GMT+00:00 Chris Metcalf <[email protected]>:
> A bunch of people got together this week at the Linux Plumbers
> Conference to discuss nohz_full, task isolation, and related stuff.
> (Thanks to Thomas for getting everyone gathered at one place and time!)
>
> Here are the notes I took; I welcome any corrections and follow-up.
>

Thanks for that report Chris!

> == rcu_nocbs ==
>
> We started out by discussing this option. It is automatically enabled
> by nohz_full, but we spent a little while side-tracking on the
> implementation of one kthread per rcu flavor per core. The suggestion
> was made (by Peter or Andy; I forget) that each kthread could handle
> all flavors per core by using a dedicated worklist. It certainly
> seems like removing potentially dozens or hundreds of kthreads from
> larger systems will be a win if this works out.
>
> Paul said he would look into this possibility.

Sounds good.

>
>
> == Remote statistics ==
>
> We discussed the possibility of remote statistics gathering, i.e. load
> average etc. The idea would be that we could have housekeeping
> core(s) periodically iterate over the nohz cores to load their rq
> remotely and do update_current etc. Presumably it should be possible
> for a single housekeeping core to handle doing this for all the
> nohz_full cores, as we only need to do it quite infrequently.
>
> Thomas suggested that this might be the last remaining thing that
> needed to be done to allow disabling the current behavior of falling
> back to a 1 Hz clock in nohz_full.
>
> I believe Thomas said he had a patch to do this already.
>

There are also some other details among update_curr to take care of,
but that's certainly a big piece of it.
I had wished we could find a solution that doesn't involve remote
accounting but at least it could be a first step.
I have let that idea rotting for too long, I need to get my hands into
it for good.

> == Disabling the dyn tick ==
>
> One issue that the current task isolation patch series encounters is
> when we request disabling the dyntick, but it doesn't happen. At the
> moment we just wait until the the tick is properly disabled, by
> busy-waiting in the kernel (calling schedule etc as needed). No one
> is particularly fond of this scheme. The consensus seems to be to try
> harder to figure out what is going on, fix whatever problems exist,
> then consider it a regression going forward if something causes the
> dyntick to become difficult to disable again in the future. I will
> take a look at this and try to gather more data on if and when this is
> happening in 4.9.
>

We could enhance dynticks tracing, expand the tick stop failure codes
for example in order to report more details about what's going on.

> == Missing oneshot_stopped callbacks ==
>
> I raised the issue that various clock_event_device sources don't
> always support oneshot_stopped, which can cause an additional
> final interrupt to occur after the timer infrastructure believes the
> interrupt has been stopped. I have patches to fix this for tile and
> arm64 in my patch series; Thomas volunteered to look at adding
> equivalent support for x86.
>
>
> Many thanks to all those who participated in the discussion.
> Frederic, we wished you had been there!

I wish I had too!

2016-11-09 11:15:07

by Andy Lutomirski

[permalink] [raw]
Subject: Re: task isolation discussion at Linux Plumbers

On Tue, Nov 8, 2016 at 5:40 PM, Paul E. McKenney
<[email protected]> wrote:

> commit 49961e272333ac720ac4ccbaba45521bfea259ae
> Author: Paul E. McKenney <[email protected]>
> Date: Tue Nov 8 14:25:21 2016 -0800
>
> rcu: Maintain special bits at bottom of ->dynticks counter
>
> Currently, IPIs are used to force other CPUs to invalidate their TLBs
> in response to a kernel virtual-memory mapping change. This works, but
> degrades both battery lifetime (for idle CPUs) and real-time response
> (for nohz_full CPUs), and in addition results in unnecessary IPIs due to
> the fact that CPUs executing in usermode are unaffected by stale kernel
> mappings. It would be better to cause a CPU executing in usermode to
> wait until it is entering kernel mode to

missing words here?

>
> This commit therefore reserves a bit at the bottom of the ->dynticks
> counter, which is checked upon exit from extended quiescent states. If it
> is set, it is cleared and then a new rcu_dynticks_special_exit() macro
> is invoked, which, if not supplied, is an empty single-pass do-while loop.
> If this bottom bit is set on -entry- to an extended quiescent state,
> then a WARN_ON_ONCE() triggers.
>
> This bottom bit may be set using a new rcu_dynticks_special_set()
> function, which returns true if the bit was set, or false if the CPU
> turned out to not be in an extended quiescent state. Please note that
> this function refuses to set the bit for a non-nohz_full CPU when that
> CPU is executing in usermode because usermode execution is tracked by
> RCU as a dyntick-idle extended quiescent state only for nohz_full CPUs.

I'm inclined to suggest s/dynticks/eqs/ in the public API. To me,
"dynticks" is a feature, whereas "eqs" means "extended quiescent
state" and means something concrete about the CPU state

>
> Reported-by: Andy Lutomirski <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
>
> diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
> index 4f9b2fa2173d..130d911e4ba1 100644
> --- a/include/linux/rcutiny.h
> +++ b/include/linux/rcutiny.h
> @@ -33,6 +33,11 @@ static inline int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
> return 0;
> }
>
> +static inline bool rcu_dynticks_special_set(int cpu)
> +{
> + return false; /* Never flag non-existent other CPUs! */
> +}
> +
> static inline unsigned long get_state_synchronize_rcu(void)
> {
> return 0;
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index dbf20b058f48..8de83830e86b 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -279,23 +279,36 @@ static DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
> };
>
> /*
> + * Steal a bit from the bottom of ->dynticks for idle entry/exit
> + * control. Initially this is for TLB flushing.
> + */
> +#define RCU_DYNTICK_CTRL_MASK 0x1
> +#define RCU_DYNTICK_CTRL_CTR (RCU_DYNTICK_CTRL_MASK + 1)
> +#ifndef rcu_dynticks_special_exit
> +#define rcu_dynticks_special_exit() do { } while (0)
> +#endif
> +

> /*
> @@ -305,17 +318,21 @@ static void rcu_dynticks_eqs_enter(void)
> static void rcu_dynticks_eqs_exit(void)
> {
> struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> + int seq;
>
> /*
> - * CPUs seeing atomic_inc() must see prior idle sojourns,
> + * CPUs seeing atomic_inc_return() must see prior idle sojourns,
> * and we also must force ordering with the next RCU read-side
> * critical section.
> */
> - smp_mb__before_atomic(); /* See above. */
> - atomic_inc(&rdtp->dynticks);
> - smp_mb__after_atomic(); /* See above. */
> + seq = atomic_inc_return(&rdtp->dynticks);
> WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
> - !(atomic_read(&rdtp->dynticks) & 0x1));
> + !(seq & RCU_DYNTICK_CTRL_CTR));
> + if (seq & RCU_DYNTICK_CTRL_MASK) {
> + atomic_and(~RCU_DYNTICK_CTRL_MASK, &rdtp->dynticks);
> + smp_mb__after_atomic(); /* Clear bits before acting on them */
> + rcu_dynticks_special_exit();

I think this needs to be reversed for NMI safety: do the callback and
then clear the bits.

> +/*
> + * Set the special (bottom) bit of the specified CPU so that it
> + * will take special action (such as flushing its TLB) on the
> + * next exit from an extended quiescent state. Returns true if
> + * the bit was successfully set, or false if the CPU was not in
> + * an extended quiescent state.
> + */
> +bool rcu_dynticks_special_set(int cpu)
> +{
> + int old;
> + int new;
> + struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
> +
> + do {
> + old = atomic_read(&rdtp->dynticks);
> + if (old & RCU_DYNTICK_CTRL_CTR)
> + return false;
> + new = old | ~RCU_DYNTICK_CTRL_MASK;

Shouldn't this be old | RCU_DYNTICK_CTRL_MASK?

> + } while (atomic_cmpxchg(&rdtp->dynticks, old, new) != old);
> + return true;
> }

--Andy

2016-11-09 17:38:27

by Paul E. McKenney

[permalink] [raw]
Subject: Re: task isolation discussion at Linux Plumbers

On Wed, Nov 09, 2016 at 03:14:35AM -0800, Andy Lutomirski wrote:
> On Tue, Nov 8, 2016 at 5:40 PM, Paul E. McKenney
> <[email protected]> wrote:

Thank you for the review and comments!

> > commit 49961e272333ac720ac4ccbaba45521bfea259ae
> > Author: Paul E. McKenney <[email protected]>
> > Date: Tue Nov 8 14:25:21 2016 -0800
> >
> > rcu: Maintain special bits at bottom of ->dynticks counter
> >
> > Currently, IPIs are used to force other CPUs to invalidate their TLBs
> > in response to a kernel virtual-memory mapping change. This works, but
> > degrades both battery lifetime (for idle CPUs) and real-time response
> > (for nohz_full CPUs), and in addition results in unnecessary IPIs due to
> > the fact that CPUs executing in usermode are unaffected by stale kernel
> > mappings. It would be better to cause a CPU executing in usermode to
> > wait until it is entering kernel mode to
>
> missing words here?

Just a few, added more. ;-)

> > This commit therefore reserves a bit at the bottom of the ->dynticks
> > counter, which is checked upon exit from extended quiescent states. If it
> > is set, it is cleared and then a new rcu_dynticks_special_exit() macro
> > is invoked, which, if not supplied, is an empty single-pass do-while loop.
> > If this bottom bit is set on -entry- to an extended quiescent state,
> > then a WARN_ON_ONCE() triggers.
> >
> > This bottom bit may be set using a new rcu_dynticks_special_set()
> > function, which returns true if the bit was set, or false if the CPU
> > turned out to not be in an extended quiescent state. Please note that
> > this function refuses to set the bit for a non-nohz_full CPU when that
> > CPU is executing in usermode because usermode execution is tracked by
> > RCU as a dyntick-idle extended quiescent state only for nohz_full CPUs.
>
> I'm inclined to suggest s/dynticks/eqs/ in the public API. To me,
> "dynticks" is a feature, whereas "eqs" means "extended quiescent
> state" and means something concrete about the CPU state

OK, I have changed rcu_dynticks_special_exit() to rcu_eqs_special_exit().
I also changed rcu_dynticks_special_set() to rcu_eqs_special_set().

I left rcu_dynticks_snap() as is because it is internal to RCU. (External
only for the benefit of kernel/rcu/tree_trace.c.)

Any others?

Current state of patch below.

> > Reported-by: Andy Lutomirski <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> >
> > diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
> > index 4f9b2fa2173d..130d911e4ba1 100644
> > --- a/include/linux/rcutiny.h
> > +++ b/include/linux/rcutiny.h
> > @@ -33,6 +33,11 @@ static inline int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
> > return 0;
> > }
> >
> > +static inline bool rcu_dynticks_special_set(int cpu)
> > +{
> > + return false; /* Never flag non-existent other CPUs! */
> > +}
> > +
> > static inline unsigned long get_state_synchronize_rcu(void)
> > {
> > return 0;
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index dbf20b058f48..8de83830e86b 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -279,23 +279,36 @@ static DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
> > };
> >
> > /*
> > + * Steal a bit from the bottom of ->dynticks for idle entry/exit
> > + * control. Initially this is for TLB flushing.
> > + */
> > +#define RCU_DYNTICK_CTRL_MASK 0x1
> > +#define RCU_DYNTICK_CTRL_CTR (RCU_DYNTICK_CTRL_MASK + 1)
> > +#ifndef rcu_dynticks_special_exit
> > +#define rcu_dynticks_special_exit() do { } while (0)
> > +#endif
> > +
>
> > /*
> > @@ -305,17 +318,21 @@ static void rcu_dynticks_eqs_enter(void)
> > static void rcu_dynticks_eqs_exit(void)
> > {
> > struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > + int seq;
> >
> > /*
> > - * CPUs seeing atomic_inc() must see prior idle sojourns,
> > + * CPUs seeing atomic_inc_return() must see prior idle sojourns,
> > * and we also must force ordering with the next RCU read-side
> > * critical section.
> > */
> > - smp_mb__before_atomic(); /* See above. */
> > - atomic_inc(&rdtp->dynticks);
> > - smp_mb__after_atomic(); /* See above. */
> > + seq = atomic_inc_return(&rdtp->dynticks);
> > WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
> > - !(atomic_read(&rdtp->dynticks) & 0x1));
> > + !(seq & RCU_DYNTICK_CTRL_CTR));
> > + if (seq & RCU_DYNTICK_CTRL_MASK) {
> > + atomic_and(~RCU_DYNTICK_CTRL_MASK, &rdtp->dynticks);
> > + smp_mb__after_atomic(); /* Clear bits before acting on them */
> > + rcu_dynticks_special_exit();
>
> I think this needs to be reversed for NMI safety: do the callback and
> then clear the bits.

OK. Ah, the race that I was worried about can't happen due to the
fact that rdtp->dynticks gets incremented before the call to
rcu_dynticks_special_exit().

Good catch, fixed.

And the other thing I forgot is that I cannot clear the bottom bits if
this is an NMI handler. But now I cannot construct a case where this
is a problem. The only way this could matter is if an NMI is taken in
an extended quiescent state. In that case, the code flushes and clears
the bit, and any later remote-flush request to this CPU will set the
bit again. And any races between the NMI handler and the other CPU look
the same as between IRQ handlers and process entry.

Yes, this one needs some formal verification, doesn't it?

In the meantime, if you can reproduce the race that led us to believe
that NMI handlers should not clear the bottom bits, please let me know.

> > +/*
> > + * Set the special (bottom) bit of the specified CPU so that it
> > + * will take special action (such as flushing its TLB) on the
> > + * next exit from an extended quiescent state. Returns true if
> > + * the bit was successfully set, or false if the CPU was not in
> > + * an extended quiescent state.
> > + */
> > +bool rcu_dynticks_special_set(int cpu)
> > +{
> > + int old;
> > + int new;
> > + struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
> > +
> > + do {
> > + old = atomic_read(&rdtp->dynticks);
> > + if (old & RCU_DYNTICK_CTRL_CTR)
> > + return false;
> > + new = old | ~RCU_DYNTICK_CTRL_MASK;
>
> Shouldn't this be old | RCU_DYNTICK_CTRL_MASK?

Indeed it should! (What -was- I thinking?) Fixed.

> > + } while (atomic_cmpxchg(&rdtp->dynticks, old, new) != old);
> > + return true;
> > }

Thank you again, please see update below.

Thanx, Paul

------------------------------------------------------------------------

commit 7bbb80d5f612e7f0ffc826b11d292a3616150b34
Author: Paul E. McKenney <[email protected]>
Date: Tue Nov 8 14:25:21 2016 -0800

rcu: Maintain special bits at bottom of ->dynticks counter

Currently, IPIs are used to force other CPUs to invalidate their TLBs
in response to a kernel virtual-memory mapping change. This works, but
degrades both battery lifetime (for idle CPUs) and real-time response
(for nohz_full CPUs), and in addition results in unnecessary IPIs due to
the fact that CPUs executing in usermode are unaffected by stale kernel
mappings. It would be better to cause a CPU executing in usermode to
wait until it is entering kernel mode to do the flush, first to avoid
interrupting usemode tasks and second to handle multiple flush requests
with a single flush in the case of a long-running user task.

This commit therefore reserves a bit at the bottom of the ->dynticks
counter, which is checked upon exit from extended quiescent states.
If it is set, it is cleared and then a new rcu_eqs_special_exit() macro is
invoked, which, if not supplied, is an empty single-pass do-while loop.
If this bottom bit is set on -entry- to an extended quiescent state,
then a WARN_ON_ONCE() triggers.

This bottom bit may be set using a new rcu_eqs_special_set() function,
which returns true if the bit was set, or false if the CPU turned
out to not be in an extended quiescent state. Please note that this
function refuses to set the bit for a non-nohz_full CPU when that CPU
is executing in usermode because usermode execution is tracked by RCU
as a dyntick-idle extended quiescent state only for nohz_full CPUs.

Reported-by: Andy Lutomirski <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 4f9b2fa2173d..7232d199a81c 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -33,6 +33,11 @@ static inline int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
return 0;
}

+static inline bool rcu_eqs_special_set(int cpu)
+{
+ return false; /* Never flag non-existent other CPUs! */
+}
+
static inline unsigned long get_state_synchronize_rcu(void)
{
return 0;
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index dbf20b058f48..342c8ee402d6 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -279,23 +279,36 @@ static DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
};

/*
+ * Steal a bit from the bottom of ->dynticks for idle entry/exit
+ * control. Initially this is for TLB flushing.
+ */
+#define RCU_DYNTICK_CTRL_MASK 0x1
+#define RCU_DYNTICK_CTRL_CTR (RCU_DYNTICK_CTRL_MASK + 1)
+#ifndef rcu_eqs_special_exit
+#define rcu_eqs_special_exit() do { } while (0)
+#endif
+
+/*
* Record entry into an extended quiescent state. This is only to be
* called when not already in an extended quiescent state.
*/
static void rcu_dynticks_eqs_enter(void)
{
struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
+ int seq;

/*
- * CPUs seeing atomic_inc() must see prior RCU read-side critical
- * sections, and we also must force ordering with the next idle
- * sojourn.
+ * CPUs seeing atomic_inc_return() must see prior RCU read-side
+ * critical sections, and we also must force ordering with the
+ * next idle sojourn.
*/
- smp_mb__before_atomic(); /* See above. */
- atomic_inc(&rdtp->dynticks);
- smp_mb__after_atomic(); /* See above. */
+ seq = atomic_inc_return(&rdtp->dynticks);
+ /* Better be in an extended quiescent state! */
+ WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
+ (seq & RCU_DYNTICK_CTRL_CTR));
+ /* Better not have special action (TLB flush) pending! */
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
- atomic_read(&rdtp->dynticks) & 0x1);
+ (seq & RCU_DYNTICK_CTRL_MASK));
}

/*
@@ -305,17 +318,22 @@ static void rcu_dynticks_eqs_enter(void)
static void rcu_dynticks_eqs_exit(void)
{
struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
+ int seq;

/*
- * CPUs seeing atomic_inc() must see prior idle sojourns,
+ * CPUs seeing atomic_inc_return() must see prior idle sojourns,
* and we also must force ordering with the next RCU read-side
* critical section.
*/
- smp_mb__before_atomic(); /* See above. */
- atomic_inc(&rdtp->dynticks);
- smp_mb__after_atomic(); /* See above. */
+ seq = atomic_inc_return(&rdtp->dynticks);
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
- !(atomic_read(&rdtp->dynticks) & 0x1));
+ !(seq & RCU_DYNTICK_CTRL_CTR));
+ if (seq & RCU_DYNTICK_CTRL_MASK) {
+ rcu_eqs_special_exit();
+ /* Prefer duplicate flushes to losing a flush. */
+ smp_mb__before_atomic(); /* NMI safety. */
+ atomic_and(~RCU_DYNTICK_CTRL_MASK, &rdtp->dynticks);
+ }
}

/*
@@ -326,7 +344,7 @@ int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
{
int snap = atomic_add_return(0, &rdtp->dynticks);

- return snap;
+ return snap & ~RCU_DYNTICK_CTRL_MASK;
}

/*
@@ -335,7 +353,7 @@ int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
*/
static bool rcu_dynticks_in_eqs(int snap)
{
- return !(snap & 0x1);
+ return !(snap & RCU_DYNTICK_CTRL_CTR);
}

/*
@@ -355,10 +373,33 @@ static bool rcu_dynticks_in_eqs_since(struct rcu_dynticks *rdtp, int snap)
static void rcu_dynticks_momentary_idle(void)
{
struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
- int special = atomic_add_return(2, &rdtp->dynticks);
+ int special = atomic_add_return(2 * RCU_DYNTICK_CTRL_CTR,
+ &rdtp->dynticks);

/* It is illegal to call this from idle state. */
- WARN_ON_ONCE(!(special & 0x1));
+ WARN_ON_ONCE(!(special & RCU_DYNTICK_CTRL_CTR));
+}
+
+/*
+ * Set the special (bottom) bit of the specified CPU so that it
+ * will take special action (such as flushing its TLB) on the
+ * next exit from an extended quiescent state. Returns true if
+ * the bit was successfully set, or false if the CPU was not in
+ * an extended quiescent state.
+ */
+bool rcu_eqs_special_set(int cpu)
+{
+ int old;
+ int new;
+ struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
+
+ do {
+ old = atomic_read(&rdtp->dynticks);
+ if (old & RCU_DYNTICK_CTRL_CTR)
+ return false;
+ new = old | RCU_DYNTICK_CTRL_MASK;
+ } while (atomic_cmpxchg(&rdtp->dynticks, old, new) != old);
+ return true;
}

DEFINE_PER_CPU_SHARED_ALIGNED(unsigned long, rcu_qs_ctr);
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 3b953dcf6afc..7dcdd59d894c 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -596,6 +596,7 @@ extern struct rcu_state rcu_preempt_state;
#endif /* #ifdef CONFIG_PREEMPT_RCU */

int rcu_dynticks_snap(struct rcu_dynticks *rdtp);
+bool rcu_eqs_special_set(int cpu);

#ifdef CONFIG_RCU_BOOST
DECLARE_PER_CPU(unsigned int, rcu_cpu_kthread_status);

2016-11-09 18:57:42

by Will Deacon

[permalink] [raw]
Subject: Re: task isolation discussion at Linux Plumbers

Hi Paul,

Just a couple of comments, but they be more suited to Andy.

On Wed, Nov 09, 2016 at 09:38:08AM -0800, Paul E. McKenney wrote:
> @@ -355,10 +373,33 @@ static bool rcu_dynticks_in_eqs_since(struct rcu_dynticks *rdtp, int snap)
> static void rcu_dynticks_momentary_idle(void)
> {
> struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> - int special = atomic_add_return(2, &rdtp->dynticks);
> + int special = atomic_add_return(2 * RCU_DYNTICK_CTRL_CTR,
> + &rdtp->dynticks);
>
> /* It is illegal to call this from idle state. */
> - WARN_ON_ONCE(!(special & 0x1));
> + WARN_ON_ONCE(!(special & RCU_DYNTICK_CTRL_CTR));
> +}
> +
> +/*
> + * Set the special (bottom) bit of the specified CPU so that it
> + * will take special action (such as flushing its TLB) on the
> + * next exit from an extended quiescent state. Returns true if
> + * the bit was successfully set, or false if the CPU was not in
> + * an extended quiescent state.
> + */

Given that TLB maintenance on arm is handled in hardware (no need for IPI),
I'd like to avoid this work if at all possible. However, without seeing the
call site I can't tell if it's optional.

> +bool rcu_eqs_special_set(int cpu)
> +{
> + int old;
> + int new;
> + struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
> +
> + do {
> + old = atomic_read(&rdtp->dynticks);
> + if (old & RCU_DYNTICK_CTRL_CTR)
> + return false;
> + new = old | RCU_DYNTICK_CTRL_MASK;
> + } while (atomic_cmpxchg(&rdtp->dynticks, old, new) != old);
> + return true;
> }

Can this be a cmpxchg_relaxed? What is it attempting to order?

Will

2016-11-09 19:11:44

by Paul E. McKenney

[permalink] [raw]
Subject: Re: task isolation discussion at Linux Plumbers

On Wed, Nov 09, 2016 at 06:57:43PM +0000, Will Deacon wrote:
> Hi Paul,
>
> Just a couple of comments, but they be more suited to Andy.
>
> On Wed, Nov 09, 2016 at 09:38:08AM -0800, Paul E. McKenney wrote:
> > @@ -355,10 +373,33 @@ static bool rcu_dynticks_in_eqs_since(struct rcu_dynticks *rdtp, int snap)
> > static void rcu_dynticks_momentary_idle(void)
> > {
> > struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > - int special = atomic_add_return(2, &rdtp->dynticks);
> > + int special = atomic_add_return(2 * RCU_DYNTICK_CTRL_CTR,
> > + &rdtp->dynticks);
> >
> > /* It is illegal to call this from idle state. */
> > - WARN_ON_ONCE(!(special & 0x1));
> > + WARN_ON_ONCE(!(special & RCU_DYNTICK_CTRL_CTR));
> > +}
> > +
> > +/*
> > + * Set the special (bottom) bit of the specified CPU so that it
> > + * will take special action (such as flushing its TLB) on the
> > + * next exit from an extended quiescent state. Returns true if
> > + * the bit was successfully set, or false if the CPU was not in
> > + * an extended quiescent state.
> > + */
>
> Given that TLB maintenance on arm is handled in hardware (no need for IPI),
> I'd like to avoid this work if at all possible. However, without seeing the
> call site I can't tell if it's optional.

For this, I must defer to Andy.

> > +bool rcu_eqs_special_set(int cpu)
> > +{
> > + int old;
> > + int new;
> > + struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
> > +
> > + do {
> > + old = atomic_read(&rdtp->dynticks);
> > + if (old & RCU_DYNTICK_CTRL_CTR)
> > + return false;
> > + new = old | RCU_DYNTICK_CTRL_MASK;
> > + } while (atomic_cmpxchg(&rdtp->dynticks, old, new) != old);
> > + return true;
> > }
>
> Can this be a cmpxchg_relaxed? What is it attempting to order?

It is attmepting to order my paranoia. ;-)

If Andy shows me that less ordering is possible, I can weaken it.

Thanx, Paul

2016-11-10 01:44:25

by Andy Lutomirski

[permalink] [raw]
Subject: Re: task isolation discussion at Linux Plumbers

On Wed, Nov 9, 2016 at 9:38 AM, Paul E. McKenney
<[email protected]> wrote:

Are you planning on changing rcu_nmi_enter()? It would make it easier
to figure out how they interact if I could see the code.

> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index dbf20b058f48..342c8ee402d6 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c


> /*
> @@ -305,17 +318,22 @@ static void rcu_dynticks_eqs_enter(void)
> static void rcu_dynticks_eqs_exit(void)
> {
> struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> + int seq;
>
> /*
> - * CPUs seeing atomic_inc() must see prior idle sojourns,
> + * CPUs seeing atomic_inc_return() must see prior idle sojourns,
> * and we also must force ordering with the next RCU read-side
> * critical section.
> */
> - smp_mb__before_atomic(); /* See above. */
> - atomic_inc(&rdtp->dynticks);
> - smp_mb__after_atomic(); /* See above. */
> + seq = atomic_inc_return(&rdtp->dynticks);
> WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
> - !(atomic_read(&rdtp->dynticks) & 0x1));
> + !(seq & RCU_DYNTICK_CTRL_CTR));

I think there's still a race here. Suppose we're running this code on
cpu n and...

> + if (seq & RCU_DYNTICK_CTRL_MASK) {
> + rcu_eqs_special_exit();
> + /* Prefer duplicate flushes to losing a flush. */
> + smp_mb__before_atomic(); /* NMI safety. */

... another CPU changes the page tables and calls rcu_eqs_special_set(n) here.

That CPU expects that we will flush prior to continuing, but we won't.
Admittedly it's highly unlikely that any stale TLB entries would be
created yet, but nothing rules it out.

> + atomic_and(~RCU_DYNTICK_CTRL_MASK, &rdtp->dynticks);
> + }

Maybe the way to handle it is something like:

this_cpu_write(rcu_nmi_needs_eqs_special, 1);
barrier();

/* NMI here will call rcu_eqs_special_exit() regardless of the value
in dynticks */

atomic_and(...);
smp_mb__after_atomic();
rcu_eqs_special_exit();

barrier();
this_cpu_write(rcu_nmi_needs_eqs_special, 0);


Then rcu_nmi_enter() would call rcu_eqs_special_exit() if the dynticks
bit is set *or* rcu_nmi_needs_eqs_special is set.

Does that make sense?

--Andy

2016-11-10 04:52:23

by Paul E. McKenney

[permalink] [raw]
Subject: Re: task isolation discussion at Linux Plumbers

On Wed, Nov 09, 2016 at 05:44:02PM -0800, Andy Lutomirski wrote:
> On Wed, Nov 9, 2016 at 9:38 AM, Paul E. McKenney
> <[email protected]> wrote:
>
> Are you planning on changing rcu_nmi_enter()? It would make it easier
> to figure out how they interact if I could see the code.

It already calls rcu_dynticks_eqs_exit(), courtesy of the earlier
consolidation patches.

> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index dbf20b058f48..342c8ee402d6 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
>
>
> > /*
> > @@ -305,17 +318,22 @@ static void rcu_dynticks_eqs_enter(void)
> > static void rcu_dynticks_eqs_exit(void)
> > {
> > struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > + int seq;
> >
> > /*
> > - * CPUs seeing atomic_inc() must see prior idle sojourns,
> > + * CPUs seeing atomic_inc_return() must see prior idle sojourns,
> > * and we also must force ordering with the next RCU read-side
> > * critical section.
> > */
> > - smp_mb__before_atomic(); /* See above. */
> > - atomic_inc(&rdtp->dynticks);
> > - smp_mb__after_atomic(); /* See above. */
> > + seq = atomic_inc_return(&rdtp->dynticks);
> > WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
> > - !(atomic_read(&rdtp->dynticks) & 0x1));
> > + !(seq & RCU_DYNTICK_CTRL_CTR));
>
> I think there's still a race here. Suppose we're running this code on
> cpu n and...
>
> > + if (seq & RCU_DYNTICK_CTRL_MASK) {
> > + rcu_eqs_special_exit();
> > + /* Prefer duplicate flushes to losing a flush. */
> > + smp_mb__before_atomic(); /* NMI safety. */
>
> ... another CPU changes the page tables and calls rcu_eqs_special_set(n) here.

But then rcu_eqs_special_set() will return false because we already
exited the extended quiescent state at the atomic_inc_return() above.

That should tell the caller to send an IPI.

> That CPU expects that we will flush prior to continuing, but we won't.
> Admittedly it's highly unlikely that any stale TLB entries would be
> created yet, but nothing rules it out.

That said, 0day is having some heartburn from this, so I must have broken
something somewhere. My own tests of course complete just fine...

> > + atomic_and(~RCU_DYNTICK_CTRL_MASK, &rdtp->dynticks);
> > + }
>
> Maybe the way to handle it is something like:
>
> this_cpu_write(rcu_nmi_needs_eqs_special, 1);
> barrier();
>
> /* NMI here will call rcu_eqs_special_exit() regardless of the value
> in dynticks */
>
> atomic_and(...);
> smp_mb__after_atomic();
> rcu_eqs_special_exit();
>
> barrier();
> this_cpu_write(rcu_nmi_needs_eqs_special, 0);
>
>
> Then rcu_nmi_enter() would call rcu_eqs_special_exit() if the dynticks
> bit is set *or* rcu_nmi_needs_eqs_special is set.
>
> Does that make sense?

I believe that rcu_eqs_special_set() returning false covers this, but
could easily be missing something.

Thanx, Paul

2016-11-10 05:10:40

by Paul E. McKenney

[permalink] [raw]
Subject: Re: task isolation discussion at Linux Plumbers

On Wed, Nov 09, 2016 at 08:52:13PM -0800, Paul E. McKenney wrote:
> On Wed, Nov 09, 2016 at 05:44:02PM -0800, Andy Lutomirski wrote:
> > On Wed, Nov 9, 2016 at 9:38 AM, Paul E. McKenney
> > <[email protected]> wrote:
> >
> > Are you planning on changing rcu_nmi_enter()? It would make it easier
> > to figure out how they interact if I could see the code.
>
> It already calls rcu_dynticks_eqs_exit(), courtesy of the earlier
> consolidation patches.
>
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index dbf20b058f48..342c8ee402d6 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> >
> >
> > > /*
> > > @@ -305,17 +318,22 @@ static void rcu_dynticks_eqs_enter(void)
> > > static void rcu_dynticks_eqs_exit(void)
> > > {
> > > struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > > + int seq;
> > >
> > > /*
> > > - * CPUs seeing atomic_inc() must see prior idle sojourns,
> > > + * CPUs seeing atomic_inc_return() must see prior idle sojourns,
> > > * and we also must force ordering with the next RCU read-side
> > > * critical section.
> > > */
> > > - smp_mb__before_atomic(); /* See above. */
> > > - atomic_inc(&rdtp->dynticks);
> > > - smp_mb__after_atomic(); /* See above. */
> > > + seq = atomic_inc_return(&rdtp->dynticks);
> > > WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
> > > - !(atomic_read(&rdtp->dynticks) & 0x1));
> > > + !(seq & RCU_DYNTICK_CTRL_CTR));
> >
> > I think there's still a race here. Suppose we're running this code on
> > cpu n and...
> >
> > > + if (seq & RCU_DYNTICK_CTRL_MASK) {
> > > + rcu_eqs_special_exit();
> > > + /* Prefer duplicate flushes to losing a flush. */
> > > + smp_mb__before_atomic(); /* NMI safety. */
> >
> > ... another CPU changes the page tables and calls rcu_eqs_special_set(n) here.
>
> But then rcu_eqs_special_set() will return false because we already
> exited the extended quiescent state at the atomic_inc_return() above.
>
> That should tell the caller to send an IPI.
>
> > That CPU expects that we will flush prior to continuing, but we won't.
> > Admittedly it's highly unlikely that any stale TLB entries would be
> > created yet, but nothing rules it out.
>
> That said, 0day is having some heartburn from this, so I must have broken
> something somewhere. My own tests of course complete just fine...
>
> > > + atomic_and(~RCU_DYNTICK_CTRL_MASK, &rdtp->dynticks);
> > > + }
> >
> > Maybe the way to handle it is something like:
> >
> > this_cpu_write(rcu_nmi_needs_eqs_special, 1);
> > barrier();
> >
> > /* NMI here will call rcu_eqs_special_exit() regardless of the value
> > in dynticks */
> >
> > atomic_and(...);
> > smp_mb__after_atomic();
> > rcu_eqs_special_exit();
> >
> > barrier();
> > this_cpu_write(rcu_nmi_needs_eqs_special, 0);
> >
> >
> > Then rcu_nmi_enter() would call rcu_eqs_special_exit() if the dynticks
> > bit is set *or* rcu_nmi_needs_eqs_special is set.
> >
> > Does that make sense?
>
> I believe that rcu_eqs_special_set() returning false covers this, but
> could easily be missing something.

And fixing a couple of stupid errors might help. Probably more where
those came from...

Thanx, Paul

------------------------------------------------------------------------

commit 0ea13930e9a89b1741897f5af308692eab08d9e8
Author: Paul E. McKenney <[email protected]>
Date: Tue Nov 8 14:25:21 2016 -0800

rcu: Maintain special bits at bottom of ->dynticks counter

Currently, IPIs are used to force other CPUs to invalidate their TLBs
in response to a kernel virtual-memory mapping change. This works, but
degrades both battery lifetime (for idle CPUs) and real-time response
(for nohz_full CPUs), and in addition results in unnecessary IPIs due to
the fact that CPUs executing in usermode are unaffected by stale kernel
mappings. It would be better to cause a CPU executing in usermode to
wait until it is entering kernel mode to do the flush, first to avoid
interrupting usemode tasks and second to handle multiple flush requests
with a single flush in the case of a long-running user task.

This commit therefore reserves a bit at the bottom of the ->dynticks
counter, which is checked upon exit from extended quiescent states.
If it is set, it is cleared and then a new rcu_eqs_special_exit() macro is
invoked, which, if not supplied, is an empty single-pass do-while loop.
If this bottom bit is set on -entry- to an extended quiescent state,
then a WARN_ON_ONCE() triggers.

This bottom bit may be set using a new rcu_eqs_special_set() function,
which returns true if the bit was set, or false if the CPU turned
out to not be in an extended quiescent state. Please note that this
function refuses to set the bit for a non-nohz_full CPU when that CPU
is executing in usermode because usermode execution is tracked by RCU
as a dyntick-idle extended quiescent state only for nohz_full CPUs.

Reported-by: Andy Lutomirski <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 4f9b2fa2173d..7232d199a81c 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -33,6 +33,11 @@ static inline int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
return 0;
}

+static inline bool rcu_eqs_special_set(int cpu)
+{
+ return false; /* Never flag non-existent other CPUs! */
+}
+
static inline unsigned long get_state_synchronize_rcu(void)
{
return 0;
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index dbf20b058f48..c1e3d4a333b2 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -279,23 +279,36 @@ static DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
};

/*
+ * Steal a bit from the bottom of ->dynticks for idle entry/exit
+ * control. Initially this is for TLB flushing.
+ */
+#define RCU_DYNTICK_CTRL_MASK 0x1
+#define RCU_DYNTICK_CTRL_CTR (RCU_DYNTICK_CTRL_MASK + 1)
+#ifndef rcu_eqs_special_exit
+#define rcu_eqs_special_exit() do { } while (0)
+#endif
+
+/*
* Record entry into an extended quiescent state. This is only to be
* called when not already in an extended quiescent state.
*/
static void rcu_dynticks_eqs_enter(void)
{
struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
+ int seq;

/*
- * CPUs seeing atomic_inc() must see prior RCU read-side critical
- * sections, and we also must force ordering with the next idle
- * sojourn.
+ * CPUs seeing atomic_inc_return() must see prior RCU read-side
+ * critical sections, and we also must force ordering with the
+ * next idle sojourn.
*/
- smp_mb__before_atomic(); /* See above. */
- atomic_inc(&rdtp->dynticks);
- smp_mb__after_atomic(); /* See above. */
+ seq = atomic_add_return(RCU_DYNTICK_CTRL_CTR, &rdtp->dynticks);
+ /* Better be in an extended quiescent state! */
+ WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
+ (seq & RCU_DYNTICK_CTRL_CTR));
+ /* Better not have special action (TLB flush) pending! */
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
- atomic_read(&rdtp->dynticks) & 0x1);
+ (seq & RCU_DYNTICK_CTRL_MASK));
}

/*
@@ -305,17 +318,22 @@ static void rcu_dynticks_eqs_enter(void)
static void rcu_dynticks_eqs_exit(void)
{
struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
+ int seq;

/*
- * CPUs seeing atomic_inc() must see prior idle sojourns,
+ * CPUs seeing atomic_inc_return() must see prior idle sojourns,
* and we also must force ordering with the next RCU read-side
* critical section.
*/
- smp_mb__before_atomic(); /* See above. */
- atomic_inc(&rdtp->dynticks);
- smp_mb__after_atomic(); /* See above. */
+ seq = atomic_add_return(RCU_DYNTICK_CTRL_CTR, &rdtp->dynticks);
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
- !(atomic_read(&rdtp->dynticks) & 0x1));
+ !(seq & RCU_DYNTICK_CTRL_CTR));
+ if (seq & RCU_DYNTICK_CTRL_MASK) {
+ rcu_eqs_special_exit();
+ /* Prefer duplicate flushes to losing a flush. */
+ smp_mb__before_atomic(); /* NMI safety. */
+ atomic_and(~RCU_DYNTICK_CTRL_MASK, &rdtp->dynticks);
+ }
}

/*
@@ -326,7 +344,7 @@ int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
{
int snap = atomic_add_return(0, &rdtp->dynticks);

- return snap;
+ return snap & ~RCU_DYNTICK_CTRL_MASK;
}

/*
@@ -335,7 +353,7 @@ int rcu_dynticks_snap(struct rcu_dynticks *rdtp)
*/
static bool rcu_dynticks_in_eqs(int snap)
{
- return !(snap & 0x1);
+ return !(snap & RCU_DYNTICK_CTRL_CTR);
}

/*
@@ -355,10 +373,33 @@ static bool rcu_dynticks_in_eqs_since(struct rcu_dynticks *rdtp, int snap)
static void rcu_dynticks_momentary_idle(void)
{
struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
- int special = atomic_add_return(2, &rdtp->dynticks);
+ int special = atomic_add_return(2 * RCU_DYNTICK_CTRL_CTR,
+ &rdtp->dynticks);

/* It is illegal to call this from idle state. */
- WARN_ON_ONCE(!(special & 0x1));
+ WARN_ON_ONCE(!(special & RCU_DYNTICK_CTRL_CTR));
+}
+
+/*
+ * Set the special (bottom) bit of the specified CPU so that it
+ * will take special action (such as flushing its TLB) on the
+ * next exit from an extended quiescent state. Returns true if
+ * the bit was successfully set, or false if the CPU was not in
+ * an extended quiescent state.
+ */
+bool rcu_eqs_special_set(int cpu)
+{
+ int old;
+ int new;
+ struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
+
+ do {
+ old = atomic_read(&rdtp->dynticks);
+ if (old & RCU_DYNTICK_CTRL_CTR)
+ return false;
+ new = old | RCU_DYNTICK_CTRL_MASK;
+ } while (atomic_cmpxchg(&rdtp->dynticks, old, new) != old);
+ return true;
}

DEFINE_PER_CPU_SHARED_ALIGNED(unsigned long, rcu_qs_ctr);
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 3b953dcf6afc..7dcdd59d894c 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -596,6 +596,7 @@ extern struct rcu_state rcu_preempt_state;
#endif /* #ifdef CONFIG_PREEMPT_RCU */

int rcu_dynticks_snap(struct rcu_dynticks *rdtp);
+bool rcu_eqs_special_set(int cpu);

#ifdef CONFIG_RCU_BOOST
DECLARE_PER_CPU(unsigned int, rcu_cpu_kthread_status);

2016-11-11 17:01:02

by Andy Lutomirski

[permalink] [raw]
Subject: Re: task isolation discussion at Linux Plumbers

On Wed, Nov 9, 2016 at 8:52 PM, Paul E. McKenney
<[email protected]> wrote:
> On Wed, Nov 09, 2016 at 05:44:02PM -0800, Andy Lutomirski wrote:
>> On Wed, Nov 9, 2016 at 9:38 AM, Paul E. McKenney
>> <[email protected]> wrote:
>>
>> Are you planning on changing rcu_nmi_enter()? It would make it easier
>> to figure out how they interact if I could see the code.
>
> It already calls rcu_dynticks_eqs_exit(), courtesy of the earlier
> consolidation patches.
>
>> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>> > index dbf20b058f48..342c8ee402d6 100644
>> > --- a/kernel/rcu/tree.c
>> > +++ b/kernel/rcu/tree.c
>>
>>
>> > /*
>> > @@ -305,17 +318,22 @@ static void rcu_dynticks_eqs_enter(void)
>> > static void rcu_dynticks_eqs_exit(void)
>> > {
>> > struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
>> > + int seq;
>> >
>> > /*
>> > - * CPUs seeing atomic_inc() must see prior idle sojourns,
>> > + * CPUs seeing atomic_inc_return() must see prior idle sojourns,
>> > * and we also must force ordering with the next RCU read-side
>> > * critical section.
>> > */
>> > - smp_mb__before_atomic(); /* See above. */
>> > - atomic_inc(&rdtp->dynticks);
>> > - smp_mb__after_atomic(); /* See above. */
>> > + seq = atomic_inc_return(&rdtp->dynticks);
>> > WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
>> > - !(atomic_read(&rdtp->dynticks) & 0x1));
>> > + !(seq & RCU_DYNTICK_CTRL_CTR));
>>
>> I think there's still a race here. Suppose we're running this code on
>> cpu n and...
>>
>> > + if (seq & RCU_DYNTICK_CTRL_MASK) {
>> > + rcu_eqs_special_exit();
>> > + /* Prefer duplicate flushes to losing a flush. */
>> > + smp_mb__before_atomic(); /* NMI safety. */
>>
>> ... another CPU changes the page tables and calls rcu_eqs_special_set(n) here.
>
> But then rcu_eqs_special_set() will return false because we already
> exited the extended quiescent state at the atomic_inc_return() above.
>
> That should tell the caller to send an IPI.
>
>> That CPU expects that we will flush prior to continuing, but we won't.
>> Admittedly it's highly unlikely that any stale TLB entries would be
>> created yet, but nothing rules it out.
>
> That said, 0day is having some heartburn from this, so I must have broken
> something somewhere. My own tests of course complete just fine...
>
>> > + atomic_and(~RCU_DYNTICK_CTRL_MASK, &rdtp->dynticks);
>> > + }
>>
>> Maybe the way to handle it is something like:
>>
>> this_cpu_write(rcu_nmi_needs_eqs_special, 1);
>> barrier();
>>
>> /* NMI here will call rcu_eqs_special_exit() regardless of the value
>> in dynticks */
>>
>> atomic_and(...);
>> smp_mb__after_atomic();
>> rcu_eqs_special_exit();
>>
>> barrier();
>> this_cpu_write(rcu_nmi_needs_eqs_special, 0);
>>
>>
>> Then rcu_nmi_enter() would call rcu_eqs_special_exit() if the dynticks
>> bit is set *or* rcu_nmi_needs_eqs_special is set.
>>
>> Does that make sense?
>
> I believe that rcu_eqs_special_set() returning false covers this, but
> could easily be missing something.

I think you're right. I'll stare at it some more when I do the actual
TLB flush patch.

--Andy

2016-11-11 20:55:03

by Luiz Capitulino

[permalink] [raw]
Subject: Re: task isolation discussion at Linux Plumbers

On Mon, 7 Nov 2016 17:55:47 +0100 (CET)
Thomas Gleixner <[email protected]> wrote:

> On Sat, 5 Nov 2016, Chris Metcalf wrote:
> > == Remote statistics ==
> >
> > We discussed the possibility of remote statistics gathering, i.e. load
> > average etc. The idea would be that we could have housekeeping
> > core(s) periodically iterate over the nohz cores to load their rq
> > remotely and do update_current etc. Presumably it should be possible
> > for a single housekeeping core to handle doing this for all the
> > nohz_full cores, as we only need to do it quite infrequently.
> >
> > Thomas suggested that this might be the last remaining thing that
> > needed to be done to allow disabling the current behavior of falling
> > back to a 1 Hz clock in nohz_full.
> >
> > I believe Thomas said he had a patch to do this already.
>
> No, Riek was working on that.

Rik series made it possible to have remote tick sampling. That is,
calling account_process_tick(cpu) from a housekeeping CPU, so that
tick accounting is done for a remote "cpu". The series was intended
to improve overhead on nohz_full CPUs.

However, to get rid of the 1Hz tick, we need to do the same thing
for scheduler_tick(). I'm not sure if this has been attempted yet,
and if it's at all possible.

> > == Remote LRU cache drain ==
> >
> > One of the issues with task isolation currently is that the LRU cache
> > drain must be done prior to entering userspace, but it requires
> > interrupts enabled and thus can't be done atomically. My previous
> > patch series have handled this by checking with interrupts disabled,
> > but then looping around with interrupts enabled to try to drain the
> > LRU pagevecs. Experimentally this works, but it's not provable that
> > it terminates, which is worrisome. Andy suggested adding a percpu
> > flag to disable creation of deferred work like LRU cache pages.
> >
> > Thomas suggested using an RT "local lock" to guard the LRU cache
> > flush; he is planning on bringing the concept to mainline in any case.
> > However, after some discussion we converged on simply using a spinlock
> > to guard the appropriate resources. As a result, the
> > lru_add_drain_all() code that currently queues work on each remote cpu
> > to drain it, can instead simply acquire the lock and drain it remotely.
> > This means that a task isolation task no longer needs to worry about
> > being interrupted by SMP function call IPIs, so we don't have to deal
> > with this in the task isolation framework any more.
> >
> > I don't recall anyone else volunteering to tackle this, so I will plan
> > to look at it. The patch to do that should be orthogonal to the
> > revised task isolation patch series.
>
> I offered to clean up the patch from RT. I'll do that in the next days.

Yes, the RT kernel got a patch that fixes this. I wonder if the same
idea could work for vm_stat (that is, gathering those stats from
housekeeping CPUs vs. having to queue deferrable work to do it).

2016-12-19 14:37:43

by Paul E. McKenney

[permalink] [raw]
Subject: Re: task isolation discussion at Linux Plumbers

On Sat, Nov 05, 2016 at 12:04:45AM -0400, Chris Metcalf wrote:
> A bunch of people got together this week at the Linux Plumbers
> Conference to discuss nohz_full, task isolation, and related stuff.
> (Thanks to Thomas for getting everyone gathered at one place and time!)

Which reminds me...

One spirited side discussion at Santa Fe involved RCU's rcu_node tree
and CPU numbering. Several people insisted that RCU should remap CPU
numbers to allow for interesting CPU-numbering schemes. I resisted
this added complexity rather strenuously, but eventually said that I
would consider adding such complexity if someone provided me with valid
system-level performance data showing a need for this sort of remapping.

I haven't heard anything since, so I figured that I should follow up.
How are things going collecting this data?

Thanx, Paul