2010-01-31 21:12:47

by Mathieu Desnoyers

[permalink] [raw]
Subject: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock

Depends on:
"Create spin lock/spin unlock with distinct memory barrier"

A full memory barrier is wanted before and after runqueue data structure
modifications so these can be read safely by sys_membarrier without holding the
rq lock.

Adds no overhead on x86, because LOCK-prefixed atomic operations of the spin
lock/unlock already imply a full memory barrier. Combines the spin lock
acquire/release barriers with the full memory barrier to diminish the
performance impact on other architectures. (per-architecture spinlock-mb.h
should be gradually implemented to replace the generic version)

Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: KOSAKI Motohiro <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: "Paul E. McKenney" <[email protected]>
CC: Nicholas Miell <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
---
kernel/sched.c | 24 ++++++++++++++++++++----
1 file changed, 20 insertions(+), 4 deletions(-)

Index: linux-2.6-lttng/kernel/sched.c
===================================================================
--- linux-2.6-lttng.orig/kernel/sched.c 2010-01-31 14:59:42.000000000 -0500
+++ linux-2.6-lttng/kernel/sched.c 2010-01-31 15:09:51.000000000 -0500
@@ -893,7 +893,12 @@ static inline void finish_lock_switch(st
*/
spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);

- raw_spin_unlock_irq(&rq->lock);
+ /*
+ * Order mm_cpumask and rq->curr updates before following memory
+ * accesses. Required by sys_membarrier().
+ */
+ smp_mb__before_spin_unlock();
+ raw_spin_unlock_irq__no_release(&rq->lock);
}

#else /* __ARCH_WANT_UNLOCKED_CTXSW */
@@ -916,10 +921,15 @@ static inline void prepare_lock_switch(s
*/
next->oncpu = 1;
#endif
+ /*
+ * Order mm_cpumask and rq->curr updates before following memory
+ * accesses. Required by sys_membarrier().
+ */
+ smp_mb__before_spin_unlock();
#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
- raw_spin_unlock_irq(&rq->lock);
+ raw_spin_unlock_irq__no_release(&rq->lock);
#else
- raw_spin_unlock(&rq->lock);
+ raw_spin_unlock__no_release(&rq->lock);
#endif
}

@@ -5490,7 +5500,13 @@ need_resched_nonpreemptible:
if (sched_feat(HRTICK))
hrtick_clear(rq);

- raw_spin_lock_irq(&rq->lock);
+ raw_spin_lock_irq__no_acquire(&rq->lock);
+ /*
+ * Order memory accesses before mm_cpumask and rq->curr updates.
+ * Required by sys_membarrier() when prev != next. We only learn about
+ * next later, so we issue this mb() unconditionally.
+ */
+ smp_mb__after_spin_lock();
update_rq_clock(rq);
clear_tsk_need_resched(prev);


--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68


2010-02-01 08:30:42

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock

On Sun, Jan 31, 2010 at 03:52:56PM -0500, Mathieu Desnoyers wrote:
> Depends on:
> "Create spin lock/spin unlock with distinct memory barrier"
>
> A full memory barrier is wanted before and after runqueue data structure
> modifications so these can be read safely by sys_membarrier without holding the
> rq lock.
>
> Adds no overhead on x86, because LOCK-prefixed atomic operations of the spin
> lock/unlock already imply a full memory barrier. Combines the spin lock
> acquire/release barriers with the full memory barrier to diminish the
> performance impact on other architectures. (per-architecture spinlock-mb.h
> should be gradually implemented to replace the generic version)

It does add overhead on x86, as well as most other architectures.

This really seems like the wrong optimisation to make, especially
given that there's not likely to be much using librcu yet, right?

I'd go with the simpler and safer version of sys_membarrier that does
not do tricky synchronisation or add overhead to the ctxsw fastpath.
Then if you see some actual improvement in a real program using librcu
one day we can discuss making it faster.

As it is right now, the change will definitely slow down everybody
not using librcu (ie. nearly everything).

>
> Signed-off-by: Mathieu Desnoyers <[email protected]>
> CC: KOSAKI Motohiro <[email protected]>
> CC: Steven Rostedt <[email protected]>
> CC: "Paul E. McKenney" <[email protected]>
> CC: Nicholas Miell <[email protected]>
> CC: Linus Torvalds <[email protected]>
> CC: [email protected]
> CC: [email protected]
> CC: [email protected]
> CC: [email protected]
> CC: [email protected]
> CC: [email protected]
> CC: [email protected]
> CC: [email protected]
> CC: [email protected]
> CC: [email protected]
> CC: [email protected]
> ---
> kernel/sched.c | 24 ++++++++++++++++++++----
> 1 file changed, 20 insertions(+), 4 deletions(-)
>
> Index: linux-2.6-lttng/kernel/sched.c
> ===================================================================
> --- linux-2.6-lttng.orig/kernel/sched.c 2010-01-31 14:59:42.000000000 -0500
> +++ linux-2.6-lttng/kernel/sched.c 2010-01-31 15:09:51.000000000 -0500
> @@ -893,7 +893,12 @@ static inline void finish_lock_switch(st
> */
> spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_);
>
> - raw_spin_unlock_irq(&rq->lock);
> + /*
> + * Order mm_cpumask and rq->curr updates before following memory
> + * accesses. Required by sys_membarrier().
> + */
> + smp_mb__before_spin_unlock();
> + raw_spin_unlock_irq__no_release(&rq->lock);
> }
>
> #else /* __ARCH_WANT_UNLOCKED_CTXSW */
> @@ -916,10 +921,15 @@ static inline void prepare_lock_switch(s
> */
> next->oncpu = 1;
> #endif
> + /*
> + * Order mm_cpumask and rq->curr updates before following memory
> + * accesses. Required by sys_membarrier().
> + */
> + smp_mb__before_spin_unlock();
> #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
> - raw_spin_unlock_irq(&rq->lock);
> + raw_spin_unlock_irq__no_release(&rq->lock);
> #else
> - raw_spin_unlock(&rq->lock);
> + raw_spin_unlock__no_release(&rq->lock);
> #endif
> }
>
> @@ -5490,7 +5500,13 @@ need_resched_nonpreemptible:
> if (sched_feat(HRTICK))
> hrtick_clear(rq);
>
> - raw_spin_lock_irq(&rq->lock);
> + raw_spin_lock_irq__no_acquire(&rq->lock);
> + /*
> + * Order memory accesses before mm_cpumask and rq->curr updates.
> + * Required by sys_membarrier() when prev != next. We only learn about
> + * next later, so we issue this mb() unconditionally.
> + */
> + smp_mb__after_spin_lock();
> update_rq_clock(rq);
> clear_tsk_need_resched(prev);
>
>
> --
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2010-02-01 09:43:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock

On Mon, 2010-02-01 at 18:33 +1100, Nick Piggin wrote:
> > Adds no overhead on x86, because LOCK-prefixed atomic operations of the spin
> > lock/unlock already imply a full memory barrier. Combines the spin lock
> > acquire/release barriers with the full memory barrier to diminish the
> > performance impact on other architectures. (per-architecture spinlock-mb.h
> > should be gradually implemented to replace the generic version)
>
> It does add overhead on x86, as well as most other architectures.
>
> This really seems like the wrong optimisation to make, especially
> given that there's not likely to be much using librcu yet, right?
>
> I'd go with the simpler and safer version of sys_membarrier that does
> not do tricky synchronisation or add overhead to the ctxsw fastpath.
> Then if you see some actual improvement in a real program using librcu
> one day we can discuss making it faster.
>
> As it is right now, the change will definitely slow down everybody
> not using librcu (ie. nearly everything).

Right, so the problem with the 'slow'/'safe' version is that it takes
rq->lock for all relevant rqs. This renders while (1) sys_membarrier()
in a quite effective DoS.

Now, I'm not quite charmed by all this. Esp. this patch seems wrong. The
fact is on x86 we have all the required membarriers in place.

There's a number of LOCK ins before we set rq->curr and we have them
after. Adding more, like this patch effectively does
(smp_mb__{before,after}_unlock should be a full mb as Nick pointed out)
doesn't seem like a good idea at all.

And then there's !x86 to consider.

2010-02-01 10:12:03

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock

On Mon, Feb 01, 2010 at 10:42:30AM +0100, Peter Zijlstra wrote:
> On Mon, 2010-02-01 at 18:33 +1100, Nick Piggin wrote:
> > > Adds no overhead on x86, because LOCK-prefixed atomic operations of the spin
> > > lock/unlock already imply a full memory barrier. Combines the spin lock
> > > acquire/release barriers with the full memory barrier to diminish the
> > > performance impact on other architectures. (per-architecture spinlock-mb.h
> > > should be gradually implemented to replace the generic version)
> >
> > It does add overhead on x86, as well as most other architectures.
> >
> > This really seems like the wrong optimisation to make, especially
> > given that there's not likely to be much using librcu yet, right?
> >
> > I'd go with the simpler and safer version of sys_membarrier that does
> > not do tricky synchronisation or add overhead to the ctxsw fastpath.
> > Then if you see some actual improvement in a real program using librcu
> > one day we can discuss making it faster.
> >
> > As it is right now, the change will definitely slow down everybody
> > not using librcu (ie. nearly everything).
>
> Right, so the problem with the 'slow'/'safe' version is that it takes
> rq->lock for all relevant rqs. This renders while (1) sys_membarrier()
> in a quite effective DoS.

All, but one at a time, no? How much of a DoS really is taking these
locks for a handful of cycles each, per syscall?

I mean, we have LOTS of syscalls that take locks, and for a lot longer,
(look at dcache_lock).

I think we basically just have to say that locking primitives should be
somewhat fair, and not be held for too long, it should more or less
work.

If the locks are getting contended, then the threads calling
sys_membarrier are going to be spinning longer too, using more CPU time,
and will get scheduled away...

If there is some particular problem on -rt because of the rq locks,
then I guess you could consider whether to add more overhead to your
ctxsw path to reduce the problem, or simply not support sys_membarrier
for unprived users in the first place.


> Now, I'm not quite charmed by all this. Esp. this patch seems wrong. The
> fact is on x86 we have all the required membarriers in place.
>
> There's a number of LOCK ins before we set rq->curr and we have them
> after. Adding more, like this patch effectively does
> (smp_mb__{before,after}_unlock should be a full mb as Nick pointed out)
> doesn't seem like a good idea at all.
>
> And then there's !x86 to consider.

Yep.

2010-02-01 10:36:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock

On Mon, 2010-02-01 at 21:11 +1100, Nick Piggin wrote:
> On Mon, Feb 01, 2010 at 10:42:30AM +0100, Peter Zijlstra wrote:
> > On Mon, 2010-02-01 at 18:33 +1100, Nick Piggin wrote:
> > > > Adds no overhead on x86, because LOCK-prefixed atomic operations of the spin
> > > > lock/unlock already imply a full memory barrier. Combines the spin lock
> > > > acquire/release barriers with the full memory barrier to diminish the
> > > > performance impact on other architectures. (per-architecture spinlock-mb.h
> > > > should be gradually implemented to replace the generic version)
> > >
> > > It does add overhead on x86, as well as most other architectures.
> > >
> > > This really seems like the wrong optimisation to make, especially
> > > given that there's not likely to be much using librcu yet, right?
> > >
> > > I'd go with the simpler and safer version of sys_membarrier that does
> > > not do tricky synchronisation or add overhead to the ctxsw fastpath.
> > > Then if you see some actual improvement in a real program using librcu
> > > one day we can discuss making it faster.
> > >
> > > As it is right now, the change will definitely slow down everybody
> > > not using librcu (ie. nearly everything).
> >
> > Right, so the problem with the 'slow'/'safe' version is that it takes
> > rq->lock for all relevant rqs. This renders while (1) sys_membarrier()
> > in a quite effective DoS.
>
> All, but one at a time, no? How much of a DoS really is taking these
> locks for a handful of cycles each, per syscall?

I was more worrying about the cacheline trashing than lock hold times
there.

> I mean, we have LOTS of syscalls that take locks, and for a lot longer,
> (look at dcache_lock).

Yeah, and dcache is a massive pain, isn't it ;-)

> I think we basically just have to say that locking primitives should be
> somewhat fair, and not be held for too long, it should more or less
> work.

Sure, it'll more of less work, but he's basically making rq->lock a
global lock instead of a per-cpu lock.

> If the locks are getting contended, then the threads calling
> sys_membarrier are going to be spinning longer too, using more CPU time,
> and will get scheduled away...

Sure, and increased spinning reduces the total throughput.

> If there is some particular problem on -rt because of the rq locks,
> then I guess you could consider whether to add more overhead to your
> ctxsw path to reduce the problem, or simply not support sys_membarrier
> for unprived users in the first place.

Right, for -rt we might need to do that, but its just that rq->lock is a
very hot lock, and adding basically unlimited trashing to it didn't seem
like a good idea.

Also, I'm thinking making it a priv syscall basically renders it useless
for Mathieu.

Anyway, it might be I'm just paranoid... but archs with large core count
and lazy tlb flush seem particularly vulnerable.

2010-02-01 10:49:10

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock

On Mon, Feb 01, 2010 at 11:36:01AM +0100, Peter Zijlstra wrote:
> On Mon, 2010-02-01 at 21:11 +1100, Nick Piggin wrote:
> > All, but one at a time, no? How much of a DoS really is taking these
> > locks for a handful of cycles each, per syscall?
>
> I was more worrying about the cacheline trashing than lock hold times
> there.

Well, same issue really. Look at all the unprived files in /proc
for example that can look through all per-cpu cachelines. It just
takes a single read syscall to do a lot of them too.


> > I mean, we have LOTS of syscalls that take locks, and for a lot longer,
> > (look at dcache_lock).
>
> Yeah, and dcache is a massive pain, isn't it ;-)

My point is, I don't think it is something we can realistically
care much about and it is nowhere near a new or unique problem
being added by this one patch.

It is really a RoS, reduction of service, rather than a DoS. And
any time we allow an unpriv user on our system, we have RoS potential :)


> > I think we basically just have to say that locking primitives should be
> > somewhat fair, and not be held for too long, it should more or less
> > work.
>
> Sure, it'll more of less work, but he's basically making rq->lock a
> global lock instead of a per-cpu lock.
>
> > If the locks are getting contended, then the threads calling
> > sys_membarrier are going to be spinning longer too, using more CPU time,
> > and will get scheduled away...
>
> Sure, and increased spinning reduces the total throughput.
>
> > If there is some particular problem on -rt because of the rq locks,
> > then I guess you could consider whether to add more overhead to your
> > ctxsw path to reduce the problem, or simply not support sys_membarrier
> > for unprived users in the first place.
>
> Right, for -rt we might need to do that, but its just that rq->lock is a
> very hot lock, and adding basically unlimited trashing to it didn't seem
> like a good idea.
>
> Also, I'm thinking making it a priv syscall basically renders it useless
> for Mathieu.

Well I just mean that it's something for -rt to work out. Apps can
still work if the call is unsupported completely.


> Anyway, it might be I'm just paranoid... but archs with large core count
> and lazy tlb flush seem particularly vulnerable.

2010-02-01 14:48:07

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock

* Nick Piggin ([email protected]) wrote:
> On Mon, Feb 01, 2010 at 11:36:01AM +0100, Peter Zijlstra wrote:
> > On Mon, 2010-02-01 at 21:11 +1100, Nick Piggin wrote:
> > > All, but one at a time, no? How much of a DoS really is taking these
> > > locks for a handful of cycles each, per syscall?
> >
> > I was more worrying about the cacheline trashing than lock hold times
> > there.
>
> Well, same issue really. Look at all the unprived files in /proc
> for example that can look through all per-cpu cachelines. It just
> takes a single read syscall to do a lot of them too.
>
>
> > > I mean, we have LOTS of syscalls that take locks, and for a lot longer,
> > > (look at dcache_lock).
> >
> > Yeah, and dcache is a massive pain, isn't it ;-)
>
> My point is, I don't think it is something we can realistically
> care much about and it is nowhere near a new or unique problem
> being added by this one patch.
>
> It is really a RoS, reduction of service, rather than a DoS. And
> any time we allow an unpriv user on our system, we have RoS potential :)
>
>
> > > I think we basically just have to say that locking primitives should be
> > > somewhat fair, and not be held for too long, it should more or less
> > > work.
> >
> > Sure, it'll more of less work, but he's basically making rq->lock a
> > global lock instead of a per-cpu lock.
> >
> > > If the locks are getting contended, then the threads calling
> > > sys_membarrier are going to be spinning longer too, using more CPU time,
> > > and will get scheduled away...
> >
> > Sure, and increased spinning reduces the total throughput.
> >
> > > If there is some particular problem on -rt because of the rq locks,
> > > then I guess you could consider whether to add more overhead to your
> > > ctxsw path to reduce the problem, or simply not support sys_membarrier
> > > for unprived users in the first place.
> >
> > Right, for -rt we might need to do that, but its just that rq->lock is a
> > very hot lock, and adding basically unlimited trashing to it didn't seem
> > like a good idea.
> >
> > Also, I'm thinking making it a priv syscall basically renders it useless
> > for Mathieu.
>
> Well I just mean that it's something for -rt to work out. Apps can
> still work if the call is unsupported completely.

OK, so we seem to be settling for the spinlock-based sys_membarrier()
this time, which is much less intrusive in terms of scheduler
fast path modification, but adds more system overhead each time
sys_membarrier() is called. This trade-off makes sense to me, as we
expect the scheduler to execute _much_ more often than sys_membarrier().

When I get confirmation that's the route to follow from both of you,
I'll go back to the spinlock-based scheme for v9.

Thanks,

Mathieu

>
>
> > Anyway, it might be I'm just paranoid... but archs with large core count
> > and lazy tlb flush seem particularly vulnerable.

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-02-01 14:58:40

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock

On Mon, Feb 01, 2010 at 09:47:59AM -0500, Mathieu Desnoyers wrote:
> * Nick Piggin ([email protected]) wrote:
> > Well I just mean that it's something for -rt to work out. Apps can
> > still work if the call is unsupported completely.
>
> OK, so we seem to be settling for the spinlock-based sys_membarrier()
> this time, which is much less intrusive in terms of scheduler
> fast path modification, but adds more system overhead each time
> sys_membarrier() is called. This trade-off makes sense to me, as we
> expect the scheduler to execute _much_ more often than sys_membarrier().
>
> When I get confirmation that's the route to follow from both of you,
> I'll go back to the spinlock-based scheme for v9.

I think locking or cacheline bouncing DoS is just something we can't
realistically worry too much about in the standard kernel. No further
than just generally good practice of good scalability, avoiding
starvations and long lock hold times etc.

So I would prefer the simpler version that doesn't add overhead to
ctxsw, at least for the first implementation.

Thanks,
Nick

2010-02-01 15:23:41

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock

On Tue, 2010-02-02 at 01:58 +1100, Nick Piggin wrote:
> On Mon, Feb 01, 2010 at 09:47:59AM -0500, Mathieu Desnoyers wrote:
> > * Nick Piggin ([email protected]) wrote:
> > > Well I just mean that it's something for -rt to work out. Apps can
> > > still work if the call is unsupported completely.
> >
> > OK, so we seem to be settling for the spinlock-based sys_membarrier()
> > this time, which is much less intrusive in terms of scheduler
> > fast path modification, but adds more system overhead each time
> > sys_membarrier() is called. This trade-off makes sense to me, as we
> > expect the scheduler to execute _much_ more often than sys_membarrier().
> >
> > When I get confirmation that's the route to follow from both of you,
> > I'll go back to the spinlock-based scheme for v9.
>
> I think locking or cacheline bouncing DoS is just something we can't
> realistically worry too much about in the standard kernel. No further
> than just generally good practice of good scalability, avoiding
> starvations and long lock hold times etc.
>
> So I would prefer the simpler version that doesn't add overhead to
> ctxsw, at least for the first implementation.

Acked-by: Steven Rostedt <[email protected]>

;-)

-- Steve

2010-02-01 15:28:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock



On Sun, 31 Jan 2010, Mathieu Desnoyers wrote:
>
> Adds no overhead on x86, because LOCK-prefixed atomic operations of the spin
> lock/unlock already imply a full memory barrier.

.. and as Nick pointed out, you're fundamentally incorrect on this.
unlock on x86 is no memory barrier at all, since the x86 memory ordering
rules are such that a regular store always has release consistency.

But more importantly, you don't even explain why the addded smp_mb()
helps.

Why does a smp_mb() at the lock/unlock even matter? Reading accesses by
the same CPU sure as hell do _not_ matter, so the whole concept seems
totally broken. There is no way in _hell_ that whatever unlocked thing
can ever write the variables protected by the lock, only read them. So a
full memory barrier makes zero sense to begin with.

So what are these magical memory barriers all about?

Linus

2010-02-01 15:44:19

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock

On Mon, 2010-02-01 at 10:23 -0500, Steven Rostedt wrote:
> On Tue, 2010-02-02 at 01:58 +1100, Nick Piggin wrote:
> > On Mon, Feb 01, 2010 at 09:47:59AM -0500, Mathieu Desnoyers wrote:
> > > * Nick Piggin ([email protected]) wrote:
> > > > Well I just mean that it's something for -rt to work out. Apps can
> > > > still work if the call is unsupported completely.
> > >
> > > OK, so we seem to be settling for the spinlock-based sys_membarrier()
> > > this time, which is much less intrusive in terms of scheduler
> > > fast path modification, but adds more system overhead each time
> > > sys_membarrier() is called. This trade-off makes sense to me, as we
> > > expect the scheduler to execute _much_ more often than sys_membarrier().
> > >
> > > When I get confirmation that's the route to follow from both of you,
> > > I'll go back to the spinlock-based scheme for v9.
> >
> > I think locking or cacheline bouncing DoS is just something we can't
> > realistically worry too much about in the standard kernel. No further
> > than just generally good practice of good scalability, avoiding
> > starvations and long lock hold times etc.
> >
> > So I would prefer the simpler version that doesn't add overhead to
> > ctxsw, at least for the first implementation.
>
> Acked-by: Steven Rostedt <[email protected]>

I'm in agreement with Nick on this issue. Peter, I think you are being a
bit paranoid here. But that's a good thing. Anyone working in the
scheduler must be paranoid. That's what brings out issues that people
normally do no think about. :-)

-- Steve

2010-02-01 16:00:12

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock

On Mon, 2010-02-01 at 10:23 -0500, Steven Rostedt wrote:
> On Tue, 2010-02-02 at 01:58 +1100, Nick Piggin wrote:

> > So I would prefer the simpler version that doesn't add overhead to
> > ctxsw, at least for the first implementation.
>
> Acked-by: Steven Rostedt <[email protected]>
>
> ;-)

Yup, fast path needs to _lose_ some weight.

-Mike

2010-02-01 16:09:34

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock

* Linus Torvalds ([email protected]) wrote:
>
>
> On Sun, 31 Jan 2010, Mathieu Desnoyers wrote:
> >
> > Adds no overhead on x86, because LOCK-prefixed atomic operations of the spin
> > lock/unlock already imply a full memory barrier.
>
> .. and as Nick pointed out, you're fundamentally incorrect on this.
> unlock on x86 is no memory barrier at all, since the x86 memory ordering
> rules are such that a regular store always has release consistency.
>
> But more importantly, you don't even explain why the addded smp_mb()
> helps.
>
> Why does a smp_mb() at the lock/unlock even matter? Reading accesses by
> the same CPU sure as hell do _not_ matter, so the whole concept seems
> totally broken. There is no way in _hell_ that whatever unlocked thing
> can ever write the variables protected by the lock, only read them. So a
> full memory barrier makes zero sense to begin with.
>
> So what are these magical memory barriers all about?
>
> Linus

To show the ordering required by sys_membarrier, let's first express
sys_membarrier() succinctly:

smp_mb();
rcu_read_lock(); /* ensures validity of cpu_curr(cpu) tasks */
for_each_cpu(cpu, mm_cpumask(current->mm))
if (current->mm == cpu_curr(cpu)->mm)
smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
rcu_read_unlock();
smp_mb();

We need to guarantee that between the two memory barriers issued by
sys_membarrier(), we run a membarrier_ipi on every other running thread
belonging to the same process.

Where we have to be careful here is that the context switch does not
imply memory barriers both before and after:

- mm_cpumask update
- rq->cur update

Not having memory barriers between user-space instruction execution and
these updates performed by the scheduler leaves room for a race between
sys_membarrier() and the scheduler, where we would not deliver a
membarrier_ipi to a thread which either starts running or is about to be
context switched.

One option is to surround the context switch by memory barriers. If we
choose not to do that, we are left with these solutions:

We can deal with the rq->cur update by holding the rq lock in each
iteration of the for_each_cpu(cpu, mm_cpumask(current->mm)) loop. This
ensures that if rq->cur is updated, we have an associated memory barrier
issued (e.g. on x86, implied by writing to cr3 while the rq lock is held).

However, this does not deal with mm_cpumask update, and we cannot use
the per-cpu rq lock, as it's a process-wide data structure updated with
clear_bit/set_bit in switch_mm(). So at the very least, we would have to
add memory barriers in switch_mm() on some architectures to deal with
this.

Thanks,

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-02-01 16:11:18

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock

On Mon, 2010-02-01 at 07:27 -0800, Linus Torvalds wrote:

> So what are these magical memory barriers all about?

Mathieu is implementing userspace RCU. In order to make the
rcu_read_locks() fast, they can not be calling memory barriers. What is
needed is on the synchronize_rcu() the writer has to force a mb() on all
CPUs running one of the readers.

The first simple approach that Mathieu made, was to simply send an IPI
to all CPUs and force the mb() to be made. But this lets one process
interfere with other processes needlessly. And us Real-Time folks balked
at the idea since it would allow any process to mess with the running of
a real-time thread.

The next approach was to use the mm_cpumask of the thread and only send
IPIs to the CPUs that are running the thread. But there's a race between
the update of the mm_cpumask and the scheduling of the task. If we send
an IPI to a CPU that is not running the process's thread, it may cause a
little interference with the other thread but nothing to worry about.

The issue is if we miss sending to a process's thread. Then the reader
could be accessing a stale pointer that the writer is modifying after
the userspace synchronize_rcu() call.

The taking of the rq locks was a way to make sure that the update of the
mm_cpumask and the scheduling is in sync. And we know that we are
sending an IPI that is running the process's thread and not missing any
other ones.

But all this got a bit ugly when we tried to avoid grabbing the run
queue locks in the loop to send out IPIs.

Note, I believe that x86 is not affected, since the act of doing the
schedule is in itself a mb(). But this may not be the case on all archs.


-- Steve

2010-02-01 16:24:42

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock

On Mon, 2010-02-01 at 11:09 -0500, Mathieu Desnoyers wrote:

> We can deal with the rq->cur update by holding the rq lock in each
> iteration of the for_each_cpu(cpu, mm_cpumask(current->mm)) loop. This
> ensures that if rq->cur is updated, we have an associated memory barrier
> issued (e.g. on x86, implied by writing to cr3 while the rq lock is held).
>
> However, this does not deal with mm_cpumask update, and we cannot use
> the per-cpu rq lock, as it's a process-wide data structure updated with
> clear_bit/set_bit in switch_mm(). So at the very least, we would have to
> add memory barriers in switch_mm() on some architectures to deal with
> this.
>


Doesn't set_bit imply a wmb()? If so couldn't we do:

What about:

again:
tmp_mask = mm_cpumask(current->mm);
smp_mb();
rcu_read_lock(); /* ensures validity of cpu_curr(cpu) tasks */
for_each_cpu(cpu, tmp_mask) {
spin_lock_irq(&cpu_rq(cpu)->lock);
ret = current->mm == cpu_curr(cpu)->mm;
spin_unlock_irq(&cpu_rq(cpu)->lock);
if (ret)
smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
}
rcu_read_unlock();
smp_mb();
if (tmp_mask != mm_cpumask(current->mm)) {
/* do check for signals here */
goto again;
}

Would the above work?

-- Steve

2010-02-01 16:25:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock



On Mon, 1 Feb 2010, Mathieu Desnoyers wrote:
>
> However, this does not deal with mm_cpumask update, and we cannot use
> the per-cpu rq lock, as it's a process-wide data structure updated with
> clear_bit/set_bit in switch_mm(). So at the very least, we would have to
> add memory barriers in switch_mm() on some architectures to deal with
> this.

I'd much rather have a "switch_mm()" is a guaranteed memory barrier logic,
because quite frankly, I don't see how it ever couldn't be one anyway. It
fundamentally needs to do at least a TLB context switch (which may be just
switching an ASI around, not flushing the whole TLB, of course), and I bet
that for 99% of all architectures, that is already pretty much guaranteed
to be equivalent to a memory barrier.

It certainly is for x86. "mov to cr0" is serializing (setting any control
register except cr8 is serializing). And I strongly suspect other
architectures will be too.

Btw, one reason to strongly prefer "switch_mm()" over any random context
switch is that at least it won't affect inter-thread (kernel or user-land)
switching, including switching to/from the idle thread.

So I'd be _much_ more open to a "let's guarantee that 'switch_mm()' always
implies a memory barrier" model than to playing clever games with
spinlocks.

Linus

2010-02-01 16:30:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock

On Mon, 2010-02-01 at 11:24 -0500, Steven Rostedt wrote:
>
> Doesn't set_bit imply a wmb()? If so couldn't we do:

Nope, that's what we have smp_mb__{before,after}_set_bit() for.

On x86 its a LOCK'ed op, so sure it implies a full membarrier, but not
in generic.

on x86 we have plenty serializing instructions before and after rq->curr
is set so none of the crazyness is needed at all. The only thing is !
x86.


2010-02-01 16:46:39

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock

[ Added linux-arch ]

On Mon, 2010-02-01 at 17:29 +0100, Peter Zijlstra wrote:
> On Mon, 2010-02-01 at 11:24 -0500, Steven Rostedt wrote:
> >
> > Doesn't set_bit imply a wmb()? If so couldn't we do:
>
> Nope, that's what we have smp_mb__{before,after}_set_bit() for.
>
> On x86 its a LOCK'ed op, so sure it implies a full membarrier, but not
> in generic.
>
> on x86 we have plenty serializing instructions before and after rq->curr
> is set so none of the crazyness is needed at all. The only thing is !
> x86.
>

So if we go with Linus's approach and make all archs guarantee that
switch_mm() implies a full mb(), then we can simplify the
sys_membarrier() code?

That looks like the best approach.

-- Steve


2010-02-01 16:49:00

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock

* Linus Torvalds ([email protected]) wrote:
>
>
> On Mon, 1 Feb 2010, Mathieu Desnoyers wrote:
> >
> > However, this does not deal with mm_cpumask update, and we cannot use
> > the per-cpu rq lock, as it's a process-wide data structure updated with
> > clear_bit/set_bit in switch_mm(). So at the very least, we would have to
> > add memory barriers in switch_mm() on some architectures to deal with
> > this.
>
> I'd much rather have a "switch_mm()" is a guaranteed memory barrier logic,
> because quite frankly, I don't see how it ever couldn't be one anyway. It
> fundamentally needs to do at least a TLB context switch (which may be just
> switching an ASI around, not flushing the whole TLB, of course), and I bet
> that for 99% of all architectures, that is already pretty much guaranteed
> to be equivalent to a memory barrier.
>
> It certainly is for x86. "mov to cr0" is serializing (setting any control
> register except cr8 is serializing). And I strongly suspect other
> architectures will be too.

What we have to be careful about here is that it's not enough to just
rely on switch_mm() containing a memory barrier. What we really need to
enforce is that switch_mm() issues memory barriers both _before_ and
_after_ mm_cpumask modification. The "after" part is usually dealt with
by the TLB context switch, but the "before" part usually isn't.

>
> Btw, one reason to strongly prefer "switch_mm()" over any random context
> switch is that at least it won't affect inter-thread (kernel or user-land)
> switching, including switching to/from the idle thread.
>
> So I'd be _much_ more open to a "let's guarantee that 'switch_mm()' always
> implies a memory barrier" model than to playing clever games with
> spinlocks.

If we really want to make this patch less intrusive, we can consider
iterating on each online cpu in sys_membarrier() rather than on the
mm_cpumask. But it comes at the cost of useless cache-line bouncing on
large machines with few threads running in the process, as we would grab
the rq locks one by one for all cpus.

Thanks,

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-02-01 16:58:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock



On Mon, 1 Feb 2010, Mathieu Desnoyers wrote:
>
> What we have to be careful about here is that it's not enough to just
> rely on switch_mm() containing a memory barrier. What we really need to
> enforce is that switch_mm() issues memory barriers both _before_ and
> _after_ mm_cpumask modification. The "after" part is usually dealt with
> by the TLB context switch, but the "before" part usually isn't.

Ok, whatever. I vote for not doing anything at all, because this just all
sounds like some totally crazy crap. You haven't explained the actual
races, you haven't explained anything at all, you're apparently just
randomly sprinkling smp_mb's around until you think it's all fine.

Show the actual memory ordering constraints as it is related to the kernel
data structures. I'm totally uninterested in general handwaving and "we
must have smp_mb's here and here" without very explicit explanations of
exactly WHAT the memory orderings are.

Linus

2010-02-01 17:13:16

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock

On Mon, 2010-02-01 at 11:48 -0500, Mathieu Desnoyers wrote:

> What we have to be careful about here is that it's not enough to just
> rely on switch_mm() containing a memory barrier. What we really need to
> enforce is that switch_mm() issues memory barriers both _before_ and
> _after_ mm_cpumask modification. The "after" part is usually dealt with
> by the TLB context switch, but the "before" part usually isn't.

Then we add a smp_mb__before_clear_bit() in the switch_mm() on all archs
that do not have clear_bit imply a smp_mb().

>
> >
> > Btw, one reason to strongly prefer "switch_mm()" over any random context
> > switch is that at least it won't affect inter-thread (kernel or user-land)
> > switching, including switching to/from the idle thread.
> >
> > So I'd be _much_ more open to a "let's guarantee that 'switch_mm()' always
> > implies a memory barrier" model than to playing clever games with
> > spinlocks.
>
> If we really want to make this patch less intrusive, we can consider
> iterating on each online cpu in sys_membarrier() rather than on the
> mm_cpumask. But it comes at the cost of useless cache-line bouncing on
> large machines with few threads running in the process, as we would grab
> the rq locks one by one for all cpus.

I still think modifying the switch_mm() is better than the full
iteration.

-- Steve

2010-02-01 17:35:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock



On Mon, 1 Feb 2010, Steven Rostedt wrote:

> On Mon, 2010-02-01 at 11:48 -0500, Mathieu Desnoyers wrote:
>
> > What we have to be careful about here is that it's not enough to just
> > rely on switch_mm() containing a memory barrier. What we really need to
> > enforce is that switch_mm() issues memory barriers both _before_ and
> > _after_ mm_cpumask modification. The "after" part is usually dealt with
> > by the TLB context switch, but the "before" part usually isn't.
>
> Then we add a smp_mb__before_clear_bit() in the switch_mm() on all archs
> that do not have clear_bit imply a smp_mb().

No. I want to get an _explanation_ first.

And no, "I want to do user-level RCU and I'm looking at mm_cpumask" is not
an explanation.

In order to explain memory ordering rules, you need to actually show the
actual accesses on the different cpu's and the ordering requirements
between them, and explain them. IOW, you need to show

thread1 thread2
------- -------

some-particular-memop
mm_cpumask access
memory barrier
memory barrier
mm_cpumask access
some-particular-memop
memory barrier

some-other-particular-memop

and explain exactly why the memory barriers are needed.

In particular, now I've heard the INSANE statement that we need

"memory barriers both _before_ and _after_ mm_cpumask modification."

and people need to realize that such statements are totally worthless.

The fact is, a memory barrier with regards to a single location
modification makes no sense. Not before, not after. Putting a barrier
around a single access (even if that "single" access is then a
read-modify-read one) is a meaningless operation - it has _zero_ semantic
information.

Memory barriers need to be _between_ two operations, and even then they
never make sense on their own - you need to have another actor that also
has a memory barrier, and that you are ordering things with regards to.

Saying that they are "around" one operation is crazy talk. It's a
nonsensical statement. It shows a total lack of understanding of what a
barrier is about. You can't put a barrier 'around' anything at all.

So before we go any further, I want to see the graph of barriers AND THE
THINGS THEY ARE BARRIERS BETWEEN. On both sides. I want to know that you
guys even know _what_ you are protecting against, and I want it documented
so that when people say "this would solve it", they can show that yes, you
actually need barriers at both points.

Linus

2010-02-01 17:45:05

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock

* Linus Torvalds ([email protected]) wrote:
>
>
> On Mon, 1 Feb 2010, Mathieu Desnoyers wrote:
> >
> > What we have to be careful about here is that it's not enough to just
> > rely on switch_mm() containing a memory barrier. What we really need to
> > enforce is that switch_mm() issues memory barriers both _before_ and
> > _after_ mm_cpumask modification. The "after" part is usually dealt with
> > by the TLB context switch, but the "before" part usually isn't.
>
> Ok, whatever. I vote for not doing anything at all, because this just all
> sounds like some totally crazy crap. You haven't explained the actual
> races, you haven't explained anything at all, you're apparently just
> randomly sprinkling smp_mb's around until you think it's all fine.
>
> Show the actual memory ordering constraints as it is related to the kernel
> data structures. I'm totally uninterested in general handwaving and "we
> must have smp_mb's here and here" without very explicit explanations of
> exactly WHAT the memory orderings are.

It's probably worthwhile to summarize here the mb-related discussions we
had in the previous RFC rounds.

Here is the detailed execution scenario showing the race. The race with
unlock showing that we need to order user-space loads before mm_cpumask
store. This is showing an execution sequence involving the userspace RCU
library and the Linux kernel with sys_membarrier(). As we see below, the
lack of ordering between "cpumask_clear" and user-space execution
creates a race with the scheduler where sys_membarrier() incorrectly
considers CPU 1 as not needing a membarrier_ipi.


CPU 0 CPU 1
-------------- --------------
<user space> <user space> (already in read-side C.S.)
obj = rcu_dereference(list->next);
-> load list->next
copy = obj->foo;
rcu_read_unlock();
-> store to urcu_reader.ctr
<urcu_reader.ctr store is globally visible>
list_del(obj);
<preempt>
<kernel space>

schedule();
cpumask_clear(mm_cpumask, cpu);

sys_membarrier();
<kernel space>
smp_mb()
iterates on mm_cpumask
smp_mb()
<user space>
set global g.p. (urcu_gp_ctr) phase to 1
wait for all urcu_reader.ctr in phase 0
set global g.p. (urcu_gp_ctr) phase to 0
wait for all urcu_reader.ctr in phase 1
sys_membarrier();
<kernel space>
smp_mb()
iterates on mm_cpumask
smp_mb()
<user space>
free(obj);
<list->next load hits memory>
<obj->foo load hits memory> <- corruption


There is a similar race scenario (symmetric to the one above) between
cpumask_set and a following user-space rcu_read_lock(), which I could
provide upon request.

This results in the following comments around smp_mb() in
sys_membarrier:

First sys_membarrier smp_mb():

/*
* Memory barrier on the caller thread before sending first
* IPI. Orders all memory accesses prior to sys_membarrier() before
* mm_cpumask read and membarrier_ipi executions.
* Matches memory barriers before and after mm_cpumask updates.
*/

Second sys_membarrier smp_mb():

/*
* Memory barrier on the caller thread after we finished
* waiting for the last IPI. Orders mm_cpumask read and membarrier_ipi
* executions before memory accesses following sys_membarrier()
* execution.
* Matches memory barriers before and after mm_cpumask updates.
*/

Thanks,

Mathieu


>
> Linus

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-02-01 18:00:10

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock

On Mon, 2010-02-01 at 12:45 -0500, Mathieu Desnoyers wrote:
> * Linus Torvalds ([email protected]) wrote:

> It's probably worthwhile to summarize here the mb-related discussions we
> had in the previous RFC rounds.

Actually it is probably worthwhile to include this explanation in the
change log. Can't expect everyone to have read a kernel thread that
contains hundreds of replies.

>
> Here is the detailed execution scenario showing the race. The race with
> unlock showing that we need to order user-space loads before mm_cpumask
> store. This is showing an execution sequence involving the userspace RCU
> library and the Linux kernel with sys_membarrier(). As we see below, the
> lack of ordering between "cpumask_clear" and user-space execution
> creates a race with the scheduler where sys_membarrier() incorrectly
> considers CPU 1 as not needing a membarrier_ipi.
>
>
> CPU 0 CPU 1
> -------------- --------------
> <user space> <user space> (already in read-side C.S.)
> obj = rcu_dereference(list->next);
> -> load list->next
> copy = obj->foo;
> rcu_read_unlock();
> -> store to urcu_reader.ctr
> <urcu_reader.ctr store is globally visible>
> list_del(obj);
> <preempt>
> <kernel space>
>
> schedule();
> cpumask_clear(mm_cpumask, cpu);
>
> sys_membarrier();
> <kernel space>
> smp_mb()
> iterates on mm_cpumask
> smp_mb()
> <user space>
> set global g.p. (urcu_gp_ctr) phase to 1
> wait for all urcu_reader.ctr in phase 0
> set global g.p. (urcu_gp_ctr) phase to 0
> wait for all urcu_reader.ctr in phase 1
> sys_membarrier();
> <kernel space>
> smp_mb()
> iterates on mm_cpumask
> smp_mb()
> <user space>
> free(obj);
> <list->next load hits memory>
> <obj->foo load hits memory> <- corruption
>
>
> There is a similar race scenario (symmetric to the one above) between
> cpumask_set and a following user-space rcu_read_lock(), which I could
> provide upon request.

If it would help others to understand, it would be worthwhile in
explaining that too. So, I'm requesting it ;-)

>
> This results in the following comments around smp_mb() in
> sys_membarrier:
>
> First sys_membarrier smp_mb():
>
> /*
> * Memory barrier on the caller thread before sending first
> * IPI. Orders all memory accesses prior to sys_membarrier() before
> * mm_cpumask read and membarrier_ipi executions.
> * Matches memory barriers before and after mm_cpumask updates.
> */

The above comment does not really explain what it is protecting. It just
states that it serializes memory access. Well, duh, that's what smp_mb()
does ;-) It's the equivalent to i++; /* increment i */

Basically, the comment should explain "why" the memory barrier is
needed.

>
> Second sys_membarrier smp_mb():
>
> /*
> * Memory barrier on the caller thread after we finished
> * waiting for the last IPI. Orders mm_cpumask read and membarrier_ipi
> * executions before memory accesses following sys_membarrier()
> * execution.
> * Matches memory barriers before and after mm_cpumask updates.
> */

Ditto.

-- Steve

2010-02-01 18:38:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock



On Mon, 1 Feb 2010, Mathieu Desnoyers wrote:
>
> Here is the detailed execution scenario showing the race.

No. You've added random smp_mb() calls, but you don't actually show what
the f*ck they are protecting against.

For example

> First sys_membarrier smp_mb():

I'm not AT ALL interested in the sys_membarrier() parts. You can hav ea
million memory barriers there, and I won't care. I'm interested in what
you think the memory barriers elsewhere protect against. It's a barrier
between _which_ two operations?

You can't say it's a barrier "around" the

cpumask_clear(mm_cpumask, cpu);

because a barrier is between two things. So if you want to add two
barriers around that mm_cpumask acces, you need to describe the _three_
events you're barriers between in that call-path (with mm_cpumask being
just one of them)

And then, once you've described _those_ three events, you describe what
the sys_membarrier interaction is, and how mm_cpumask is involved there.

I'm not interested in the user-space code. Don't even quote it. It's
irrelevant apart from the actual semantics you want to guarantee for the
new membarrier() system call. So don't quote the code, just explain what
the actual barriers are.

Linus

2010-02-01 19:56:32

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock

* Linus Torvalds ([email protected]) wrote:
>
>
> On Mon, 1 Feb 2010, Mathieu Desnoyers wrote:
> >
> > Here is the detailed execution scenario showing the race.
>
> No. You've added random smp_mb() calls, but you don't actually show what
> the f*ck they are protecting against.
>
> For example
>
> > First sys_membarrier smp_mb():
>
> I'm not AT ALL interested in the sys_membarrier() parts. You can hav ea
> million memory barriers there, and I won't care. I'm interested in what
> you think the memory barriers elsewhere protect against. It's a barrier
> between _which_ two operations?
>
> You can't say it's a barrier "around" the
>
> cpumask_clear(mm_cpumask, cpu);
>
> because a barrier is between two things. So if you want to add two
> barriers around that mm_cpumask acces, you need to describe the _three_
> events you're barriers between in that call-path (with mm_cpumask being
> just one of them)
>
> And then, once you've described _those_ three events, you describe what
> the sys_membarrier interaction is, and how mm_cpumask is involved there.
>
> I'm not interested in the user-space code. Don't even quote it. It's
> irrelevant apart from the actual semantics you want to guarantee for the
> new membarrier() system call. So don't quote the code, just explain what
> the actual barriers are.
>

The two event pairs we are looking at are:

Pair 1)

* memory accesses (load/stores) performed by user-space thread before
context switch.
* cpumask_clear_cpu(cpu, mm_cpumask(prev));

Pair 2)

* cpumask_set_cpu(cpu, mm_cpumask(next));
* memory accessses (load/stores) performed by user-space thread after
context switch.

I can see two ways to add memory barriers in switch_mm that would
provide ordering for these two memory access pairs:

Either A)

switch_mm()
smp_mb__before_clear_bit();
cpumask_clear_cpu(cpu, mm_cpumask(prev));
cpumask_set_cpu(cpu, mm_cpumask(next));
smp_mb__after_set_bit();

or B)

switch_mm()
cpumask_set_cpu(cpu, mm_cpumask(next));
smp_mb__before_clear_bit();
cpumask_clear_cpu(cpu, mm_cpumask(prev));

(B) seems like a clear win, as we get the ordering right for both pairs
with a single memory barrier, but I don't know if changing the set/clear
bit order could have nasty side-effects on other mm_cpumask users.

sys_membarrier uses the mm_cpumask to iterate on all CPUs on which the
current process's mm is in use, so it can issue a smp_mb() through an
IPI on all CPUs that need it. Without appropriate ordering of pairs 1-2
detailed above, we could miss a CPU that actually needs a memory
barrier.

Thanks,

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-02-01 20:33:14

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock

On Mon, 2010-02-01 at 10:36 -0800, Linus Torvalds wrote:
>

> I'm not interested in the user-space code. Don't even quote it. It's
> irrelevant apart from the actual semantics you want to guarantee for
> the
> new membarrier() system call. So don't quote the code, just explain
> what
> the actual barriers are.


OK, but first we must establish that the sys_membarrier() system call
guarantees that all running threads of this process have an mb()
performed on them before this syscall returns.

The simplest implementation would be to just do an IPI on all CPUs and
have that IPI perform the mb(). But that would interfere with other
tasks, so we want to limit it to only sending the mb()'s to the threads
of the process that are currently running. We use the mm_cpumask to find
out what threads are associated with this task, and only send the IPI to
the CPUs running threads of the current process.

With the kernel point of view, the goal is to make sure a mb() happens
on all running threads of the calling process.

The code does the following:

for_each_cpu(cpu, mm_cpumask(current->mm)) {
if (current->mm == cpu_curr(cpu)->mm)
send_ipi();
}

But a race exists between the reading of the mm_cpumask and sending the
IPI. There is in fact two different problems with this race. One is that
a thread scheduled away, but never issued an mb(), the other is that a
running task just came in and we never saw it.

Here:

CPU 0 CPU 1
----------- -----------
< same thread >
schedule()
clear_bit();

current->mm == cpu_curr(1)->mm <<< failed
return sys_membarrier();

context_switch();


The above fails the situation, because we missed our thread before it
actually switched to another task. This fails the guarantee that the
syscall sys_membarrier() implies.


Second scenario, for non-x86 archs that do not imply a mb() on
switch_mm():

CPU 0 CPU 1
----------- -----------
< different thread >
schedule();
clear_bit();
set_bit();
schedule();
< same thread >

sys_membarrier();
current->mm == cpu_curr(1)->mm <<<<< failed


This scenario happens if the switch_mm() does not imply a mb(). That is,
the syscall sys_membarrier() was called after CPU 1 scheduled a thread
of the same process, but the switch_mm() did not force the mb() causing
CPU 0 to see the old value of the mm_cpumask.



The above does not take any user-space into account. It only tries to
fulfill the kernel's obligation of sys_membarrier to ensure that all
threads of the calling process has an mb() performed on them.

Mathieu, from this point of view, you can explain the necessary mb()s
that are within the kernel proper.

-- Steve

2010-02-01 20:44:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock



On Mon, 1 Feb 2010, Mathieu Desnoyers wrote:
>
> The two event pairs we are looking at are:
>
> Pair 1)
>
> * memory accesses (load/stores) performed by user-space thread before
> context switch.
> * cpumask_clear_cpu(cpu, mm_cpumask(prev));
>
> Pair 2)
>
> * cpumask_set_cpu(cpu, mm_cpumask(next));
> * memory accessses (load/stores) performed by user-space thread after
> context switch.

So explain why does that smp_mb() in between the two _help_?

The user of this will do a

for_each_cpu(mm_cpumask)
send_IPI(cpu, smp_mb);

but that's not an atomic op _anyway_. So you're reading mm_cpumask
somewhere earlier, and doing the send_IPI later. So look at the whole
scenario 2:

cpumask_set_cpu(cpu, mm_cpumask(next));
memory accessses performed by user-space

and think about it from the perspective of another CPU. What does an
smp_mb() in between the two do?

I'll tell you - it does NOTHING. Because it doesn't matter. I see no
possible way another CPU can care, because let's assume that the other CPU
is doing that

for_each_cpu(mm_cpumask)
send_ipi(smp_mb);

and you have to realize that the other CPU needs to read that mm_cpumask
early in order to do that.

So you have this situation:

CPU1 CPU2
---- ----

cpumask_set_cpu
read mm_cpumask
smp_mb
smp_mb
user memory accessses
send_ipi

and exactly _what_ is that "smp_mb" on CPU1 protecting against?

Realize that CPU2 is not ordered (because you wanted to avoid the
locking), so the "read mm_cpumask" can happen before or after that
cpumask_set_cpu. And it can happen before or after REGARDLESS of that
smp_mb. The smp_mb doesn't make any difference to CPU2 that I can see.

So the question becomes one of "How can CPU2 care about whether CPU1 is in
the mask"? Considering that CPU2 doesn't do any locking, I don't see any
way you can get a "consistent" CPU mask _regardless_ of any smp_mb's in
there. When it does the "read mm_cpumask()" it might get the value
_before_ the cpumask_set_cpu, and it might get the value _after_, and
that's true regardless of whether there is a smp_mb there or not.

See what I'm asking for? I'm asking for why it matters that we have a
memory barrier, and why that mm_cpumask is so magical that _that_ access
matters so much.

Maybe I'm dense. But If somebody puts memory barriers in the code, I want
to know exactly what the reason for the barrier is. Memory ordering is too
subtle and non-intuitive to go by gut feel.

Linus

2010-02-01 20:53:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock



On Mon, 1 Feb 2010, Steven Rostedt wrote:
>
> But a race exists between the reading of the mm_cpumask and sending the
> IPI. There is in fact two different problems with this race. One is that
> a thread scheduled away, but never issued an mb(), the other is that a
> running task just came in and we never saw it.

I get it. But the thing I object to here is that Mathieu claims that we
need _two_ memory barriers in the switch_mm() code.

And I'm still not seeing it.

You claim that the rule is that "you have to do a mb on all threads", and
that there is a race if a threads switches away just as we're about to do
that.

Fine. But why _two_? And what's so magical about the mm_cpumask that it
needs to be around it?

If the rule is that we do a memory barrier as we switch an mm, then why
does that single one not just handle it? Either the CPU kept running that
mm (and the IPI will do the memory barrier), or the CPU didn't (and the
switch_mm had a memory barrier).

Without locking, I don't see how you can really have any stronger
guarantees, and as per my previous email, I don't see what the smp_mb()
around mm_cpumask accesses help - because the other CPU is still not going
to atomically "see the mask and IPI". It's going to see one value or the
other, and the smp_mb() around the access doesn't seem to have anything to
do with which value it sees.

So I can kind of understand the "We want to guarantee that switching MM's
around wants to be a memory barrier". Quite frankly, I haven't though even
that through entirely, so who knows... But the "we need to have memory
barriers on both sides of the bit setting/clearing" I don't get.

IOW, show me why that cpumask is _so_ important that the placement of the
memory barriers around it matters, to the point where you want to have it
on both sides.

Maybe you've really thought about this very deeply, but the explanations
aren't getting through to me. Educate me.

Linus

2010-02-01 22:39:39

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock

On Mon, 2010-02-01 at 12:52 -0800, Linus Torvalds wrote:
>
> On Mon, 1 Feb 2010, Steven Rostedt wrote:
> >
> > But a race exists between the reading of the mm_cpumask and sending the
> > IPI. There is in fact two different problems with this race. One is that
> > a thread scheduled away, but never issued an mb(), the other is that a
> > running task just came in and we never saw it.
>
> I get it. But the thing I object to here is that Mathieu claims that we
> need _two_ memory barriers in the switch_mm() code.

Note, on x86, we don't need two memory barriers, because the cr3 load
acts as one, and the set_bit() acts as the other.

>
> And I'm still not seeing it.
>
> You claim that the rule is that "you have to do a mb on all threads", and
> that there is a race if a threads switches away just as we're about to do
> that.
>
> Fine. But why _two_? And what's so magical about the mm_cpumask that it
> needs to be around it?
>
> If the rule is that we do a memory barrier as we switch an mm, then why
> does that single one not just handle it? Either the CPU kept running that
> mm (and the IPI will do the memory barrier), or the CPU didn't (and the
> switch_mm had a memory barrier).

The one before the mm_cpumask actually does the mb() that we want to
send, in case we miss it. We use mm_cpumask just to limit our sending of
mb()s to CPUs. The unlikely case that we miss a thread, this mb() will
perform the work mb() for us.

CPU 0 CPU 1
----------- -----------
< same thread >
clear_bit(mm_cpumask)

sys_membarrier();
<miss sending CPU 1 mb()>
return back to user-space

modify data

< CPU 0 data is read > <<- corruption

schedule()



Earlier you asked what is the three things the two mb()'s are
protecting. Actually, it is really just two things that can be accessed
in:
A
mb();
B
mb();
A

Where, we access A then B and then A again, and need to worry about that
ordering. The two things are, the mm_cpumask and the second thing is the
user-space data that the sys_membarrier() is guaranteeing will be
flushed when it returns.


If we add a mb() before the clearing of the cpu bit in the mm_cpumask(),
we don't care if we get an old value that does not contain a missing cpu
that is running one of the process's threads. Because the mb() would
have made all the necessary data of the thread visible, and no other
mb() is needed.


The mb() after the set bit makes sure the mask is visible for this:

CPU 0 CPU 1
----------- -----------
set_bit();
schedule();
return to userspace and access
critical data

sys_membarrier();
miss sending IPI.
return;

update critical data



I guess the simple answer is that the first mb() is doing the mb() we
missed. The second mb() makes sure that a thread does not get back to
userspace without the sys_membarrier() seeing it.

>
> Without locking, I don't see how you can really have any stronger
> guarantees, and as per my previous email, I don't see what the smp_mb()
> around mm_cpumask accesses help - because the other CPU is still not going
> to atomically "see the mask and IPI". It's going to see one value or the
> other, and the smp_mb() around the access doesn't seem to have anything to
> do with which value it sees.

I guess you can say we are cheating here. If we miss the mm_cpumask
update, the only time it is an issue for us is if one of our threads
scheduled out. We use the mb() there to perform the mb() that we missed.
We really don't care about the mm_cpumask, but we do care that a mb() is
performed.


>
> So I can kind of understand the "We want to guarantee that switching MM's
> around wants to be a memory barrier". Quite frankly, I haven't though even
> that through entirely, so who knows... But the "we need to have memory
> barriers on both sides of the bit setting/clearing" I don't get.
>
> IOW, show me why that cpumask is _so_ important that the placement of the
> memory barriers around it matters, to the point where you want to have it
> on both sides.

Actually the cpumask is not as important as the mb(). We use the cpumask
to find out what CPUs we need to make do a mb(), and the mb() before the
cpumask is in the case we miss one that is caused by the race.

>
> Maybe you've really thought about this very deeply, but the explanations
> aren't getting through to me. Educate me.

The issues is that sys_membarrier() is guaranteeing that all the other
threads of the process will have a mb() performed before the
sys_membarrier() returns. We only care about the user-space data of the
threads, thus the mb() here is quite unique. We are using the mm_cpumask
as a short cut to sending a mb() to only the CPUs that are running our
threads.

The mb() before the mm_cpumask() is to make sure that the thread does a
mb() in case we miss it when reading the mm_cpumask().

The mb() after the mm_cpumask() is to make sure that the cpumask is
visible before the thread reads any data in user-space, so that the
sys_membarrier() will send out the IPI() to that thread.

For x86 these are a not an issues because clear_bit() is an mb() and so
is the load_cr3().

Now if other arch maintainers do not want to add two mb()s in
switch_mm(), then perhaps if they just guarantee that one exists after
the clear/set bits, the following code may work:

(posted in a previous email)

sys_membarrier(void) {

again:
tmp_mask = mm_cpumask(current->mm);
smp_mb();
rcu_read_lock(); /* ensures validity of cpu_curr(cpu) tasks */
for_each_cpu(cpu, tmp_mask) {
spin_lock_irq(&cpu_rq(cpu)->lock);
ret = current->mm == cpu_curr(cpu)->mm;
spin_unlock_irq(&cpu_rq(cpu)->lock);
if (ret)
smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
}
rcu_read_unlock();
smp_mb();
if (tmp_mask != mm_cpumask(current->mm)) {
/* do check for signals here */
goto again;
}

Maybe even the locks are not needed.

-- Steve

2010-02-01 22:42:11

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock

* Linus Torvalds ([email protected]) wrote:
>
>
> On Mon, 1 Feb 2010, Mathieu Desnoyers wrote:
> >
> > The two event pairs we are looking at are:
> >
> > Pair 1)
> >
> > * memory accesses (load/stores) performed by user-space thread before
> > context switch.
> > * cpumask_clear_cpu(cpu, mm_cpumask(prev));
> >
> > Pair 2)
> >
> > * cpumask_set_cpu(cpu, mm_cpumask(next));
> > * memory accessses (load/stores) performed by user-space thread after
> > context switch.
>
> So explain why does that smp_mb() in between the two _help_?
>
> The user of this will do a
>
> for_each_cpu(mm_cpumask)
> send_IPI(cpu, smp_mb);
>
> but that's not an atomic op _anyway_. So you're reading mm_cpumask
> somewhere earlier, and doing the send_IPI later. So look at the whole
> scenario 2:
>
> cpumask_set_cpu(cpu, mm_cpumask(next));
> memory accessses performed by user-space
>
> and think about it from the perspective of another CPU. What does an
> smp_mb() in between the two do?
>
> I'll tell you - it does NOTHING. Because it doesn't matter. I see no
> possible way another CPU can care, because let's assume that the other CPU
> is doing that
>
> for_each_cpu(mm_cpumask)
> send_ipi(smp_mb);
>
> and you have to realize that the other CPU needs to read that mm_cpumask
> early in order to do that.
>
> So you have this situation:
>
> CPU1 CPU2
> ---- ----
>
> cpumask_set_cpu
> read mm_cpumask
> smp_mb
> smp_mb
> user memory accessses
> send_ipi
>
> and exactly _what_ is that "smp_mb" on CPU1 protecting against?
>
> Realize that CPU2 is not ordered (because you wanted to avoid the
> locking), so the "read mm_cpumask" can happen before or after that
> cpumask_set_cpu. And it can happen before or after REGARDLESS of that
> smp_mb. The smp_mb doesn't make any difference to CPU2 that I can see.
>
> So the question becomes one of "How can CPU2 care about whether CPU1 is in
> the mask"? Considering that CPU2 doesn't do any locking, I don't see any
> way you can get a "consistent" CPU mask _regardless_ of any smp_mb's in
> there. When it does the "read mm_cpumask()" it might get the value
> _before_ the cpumask_set_cpu, and it might get the value _after_, and
> that's true regardless of whether there is a smp_mb there or not.
>
> See what I'm asking for? I'm asking for why it matters that we have a
> memory barrier, and why that mm_cpumask is so magical that _that_ access
> matters so much.
>
> Maybe I'm dense. But If somebody puts memory barriers in the code, I want
> to know exactly what the reason for the barrier is. Memory ordering is too
> subtle and non-intuitive to go by gut feel.
>

I am 100% fine with your request for explanation. I'll do my best to
explain the requirements and ordering as clearly as I can.

Let's define the semantic of this system call clearly, this will help us
below: it orders memory accesses performed by a thread before and after
calling sys_membarrier with respect to _all_ memory accesses (in program
order) performed by all other threads of the same process.

The barriers are not there to "protect" against anything, they merely
enforce ordering of memory accesses. sys_membarrier ensures that a
memory barrier will be issued on all running threads of the current
process executing on other cpus, between two memory barriers issued
locally. A very important point is that the memory barriers _need_ to be
issued on the other running threads after the sys_membarrier first mb
and before its second mb. The second important aspect is the definition
of a running vs non-running thread. Here, any CPU that would still have
unflushed write buffers (or pending reordered read operations) affecting
a thread which belongs to our process should be considered as "running",
because it needs to have memory barriers issued.

Now why is mm_cpumask so important here ? This is the only non-lock
protected data structure we access to find out which CPUs are running
threads belonging to our process.

I'm modifying your scenario to match the requirements expressed above.
In this first scenario, we send an IPI guaranteeing that the newly
scheduled thread (which happens to have set its cpu in the mm_cpumask
before we read it) issues a memory barrier. We end up issuing one extra
memory barrier, as both the scheduler and IPI issue the barriers:

CPU1 CPU2
---- ----
smp_mb
cpumask_set_cpu
read mm_cpumask
send_ipi
|
smp_mb |
user memory accesses |
|
smp_mb <----------------/
smp_mb

Now, let's see what happens if the read mm_cpumask occurs right before
the cpumask_set_cpu. In this scenario, CPU1 is considered as "not
running", because it cannot have any user-space memory accesses
reordered prior to the first smp_mb on CPU2.

CPU1 CPU2
---- ----
smp_mb
read mm_cpumask
(no ipi to send)
cpumask_set_cpu
smp_mb
smp_mb
user memory accesses

As we see, the smp_mb issued on CPU1 between the cpumask set and user
memory accesses takes care to ensure that no user memory accesses spill
over the barrier. So if sys_membarrier encounters a mm_cpumask with a
cleared cpu bit, it knows for sure that there are no unflushed write
buffers containing user-space data, nor reordered read or write accesses
to user-space data. Removing this barrier could let CPU1 reorder the
user memory accesses before the cpumask_set_cpu, which fails to satisfy
the sys_membarrier guarantees.

Thanks,

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2010-02-01 23:09:47

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch 2/3] scheduler: add full memory barriers upon task switch at runqueue lock/unlock

On Mon, 2010-02-01 at 17:39 -0500, Steven Rostedt wrote:

>
> CPU 0 CPU 1
> ----------- -----------
> < same thread >
> clear_bit(mm_cpumask)
>
> sys_membarrier();
> <miss sending CPU 1 mb()>
> return back to user-space
>
> modify data
>
> < CPU 0 data is read > <<- corruption
>
> schedule()
>
>

> (posted in a previous email)
>
> sys_membarrier(void) {
>
> again:
> tmp_mask = mm_cpumask(current->mm);
> smp_mb();
> rcu_read_lock(); /* ensures validity of cpu_curr(cpu) tasks */
> for_each_cpu(cpu, tmp_mask) {
> spin_lock_irq(&cpu_rq(cpu)->lock);
> ret = current->mm == cpu_curr(cpu)->mm;
> spin_unlock_irq(&cpu_rq(cpu)->lock);
> if (ret)
> smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
> }
> rcu_read_unlock();
> smp_mb();
> if (tmp_mask != mm_cpumask(current->mm)) {
> /* do check for signals here */
> goto again;
> }

Doh, this misses the above case if a mb() is not before the caller.
Nevermind, it looks like the only other case is to iterate over all
CPUs.

-- Steve