Hello!
Rough notes from our discussion last Thursday. Please reply to the
group with any needed elaborations or corrections.
Adding Andy and Michael on CC since this most closely affects their
architectures. Also adding Dave Watson and Maged Michael because
the preferred approach requires that processes wanting to use the
lightweight sys_membarrier() do a registration step.
Thanx, Paul
------------------------------------------------------------------------
Problem:
1. The current sys_membarrier() introduces an smp_mb() that
is not otherwise required on powerpc.
2. The envisioned JIT variant of sys_membarrier() assumes that
the return-to-user instruction sequence handling any change
to the usermode instruction stream, and Andy Lutomirski's
upcoming changes invalidate this assumption. It is believed
that powerpc has a similar issue.
Here are diagrams indicating the memory-ordering requirements:
Scenario 1: Access preceding sys_membarrier() must see changes
from thread that concurrently switches in.
----------------------------------------------------------------
Scheduler sys_membarrier()
--------- ----------------
smp_mb();
usermode load or store to Y
/* begin system call */
sys_membarrier()
smp_mb();
Check rq->curr
rq->curr = new_thread;
smp_mb(); /* not powerpc! */
/* return to user */
usermode load or store to X
smp_mb();
----------------------------------------------------------------
Due to the fr link from the check of rq->curr to the scheduler's
write, we need full memory barriers on both sides. However,
we don't want to lose the powerpc optimization, at least not in
the common case.
Scenario 2: Access following sys_membarrier() must see changes
from thread that concurrently switches out.
----------------------------------------------------------------
Scheduler sys_membarrier()
--------- ----------------
/* begin system call */
sys_membarrier()
smp_mb();
usermode load or store to X
/* Schedule from user */
smp_mb();
rq->curr = new_thread;
Check rq->curr
smp_mb();
smp_mb(); /* not powerpc! */
/* return to user */
usermode load or store to Y
----------------------------------------------------------------
Here less ordering is required, given that a read is returning
the value previously written. Weaker barriers could be used,
but full memory barriers are in place in any case.
Potential resolutions, including known stupid ones:
A. IPI all CPUs all the time. Not so good for real-time workloads,
and a usermode-induced set of IPIs could potentially be used for
a denial-of-service (DoS) attack.
B. Lock all runqueues all the time. This could potentially also be
used in a usermode-induced DoS attack.
C. Explicitly interact with all threads rather than with CPUs.
This can be quite expensive for the surprisingly common case
where applications have very large numbers of thread. (Java,
we are looking at you!!!)
D. Just keep the redundant smp_mb() and just say "no" to Andy's
x86 optimizations. We would like to avoid the performance
degradation in both cases.
E. Require that threads register before using sys_membarrier() for
private or JIT usage. (The historical implementation using
synchronize_sched() would continue to -not- require registration,
both for compatibility and because there is no need to do so.)
For x86 and powerpc, this registration would set a TIF flag
on all of the current process's threads. This flag would be
inherited by any later thread creation within that process, and
would be cleared by fork() and exec(). When this TIF flag is set,
the return-to-user path would execute additional code that would
ensure that ordering and newly JITed code was handled correctly.
We believe that checks for these TIF flags could be combined with
existing checks to avoid adding any overhead in the common case
where the process was not using these sys_membarrier() features.
For all other architecture, the registration step would be
a no-op.
Does anyone have any better solution? If so, please don't keep it
a secret!
Thanx, Paul
On Sun, 17 Sep 2017, Paul E. McKenney wrote:
> Hello!
>
> Rough notes from our discussion last Thursday. Please reply to the
> group with any needed elaborations or corrections.
>
> Adding Andy and Michael on CC since this most closely affects their
> architectures. Also adding Dave Watson and Maged Michael because
> the preferred approach requires that processes wanting to use the
> lightweight sys_membarrier() do a registration step.
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> Problem:
>
> 1. The current sys_membarrier() introduces an smp_mb() that
> is not otherwise required on powerpc.
>
> 2. The envisioned JIT variant of sys_membarrier() assumes that
> the return-to-user instruction sequence handling any change
> to the usermode instruction stream, and Andy Lutomirski's
> upcoming changes invalidate this assumption. It is believed
> that powerpc has a similar issue.
> E. Require that threads register before using sys_membarrier() for
> private or JIT usage. (The historical implementation using
> synchronize_sched() would continue to -not- require registration,
> both for compatibility and because there is no need to do so.)
>
> For x86 and powerpc, this registration would set a TIF flag
> on all of the current process's threads. This flag would be
> inherited by any later thread creation within that process, and
> would be cleared by fork() and exec(). When this TIF flag is set,
Why a TIF flag, and why clear it during fork()? If a process registers
to use private expedited sys_membarrier, shouldn't that apply to
threads it will create in the future just as much as to threads it has
already created?
> the return-to-user path would execute additional code that would
> ensure that ordering and newly JITed code was handled correctly.
> We believe that checks for these TIF flags could be combined with
> existing checks to avoid adding any overhead in the common case
> where the process was not using these sys_membarrier() features.
>
> For all other architecture, the registration step would be
> a no-op.
Don't we want to fail private expedited sys_membarrier calls if the
process hasn't registered for them? This requires the registration
call to set a flag for the process, even on architectures where no
additional memory barriers are actually needed. It can't be a no-op.
Alan Stern
----- On Sep 18, 2017, at 3:04 PM, Alan Stern [email protected] wrote:
> On Sun, 17 Sep 2017, Paul E. McKenney wrote:
>
>> Hello!
>>
>> Rough notes from our discussion last Thursday. Please reply to the
>> group with any needed elaborations or corrections.
>>
>> Adding Andy and Michael on CC since this most closely affects their
>> architectures. Also adding Dave Watson and Maged Michael because
>> the preferred approach requires that processes wanting to use the
>> lightweight sys_membarrier() do a registration step.
>>
>> Thanx, Paul
>>
>> ------------------------------------------------------------------------
>>
>> Problem:
>>
>> 1. The current sys_membarrier() introduces an smp_mb() that
>> is not otherwise required on powerpc.
>>
>> 2. The envisioned JIT variant of sys_membarrier() assumes that
>> the return-to-user instruction sequence handling any change
>> to the usermode instruction stream, and Andy Lutomirski's
>> upcoming changes invalidate this assumption. It is believed
>> that powerpc has a similar issue.
>
>> E. Require that threads register before using sys_membarrier() for
>> private or JIT usage. (The historical implementation using
>> synchronize_sched() would continue to -not- require registration,
>> both for compatibility and because there is no need to do so.)
>>
>> For x86 and powerpc, this registration would set a TIF flag
>> on all of the current process's threads. This flag would be
>> inherited by any later thread creation within that process, and
>> would be cleared by fork() and exec(). When this TIF flag is set,
>
> Why a TIF flag, and why clear it during fork()? If a process registers
> to use private expedited sys_membarrier, shouldn't that apply to
> threads it will create in the future just as much as to threads it has
> already created?
In my implementation posted today, I'm not clearing it on fork. The child
inherits from the parent.
Why TIF flag ? It appears to be a convenient way to add an architecture-specific
single-bit state for each thread. We also don't want to do too much pointer
chasing on the scheduler fast-path (current->mm->..).
>
>> the return-to-user path would execute additional code that would
>> ensure that ordering and newly JITed code was handled correctly.
>> We believe that checks for these TIF flags could be combined with
>> existing checks to avoid adding any overhead in the common case
>> where the process was not using these sys_membarrier() features.
>>
>> For all other architecture, the registration step would be
>> a no-op.
>
> Don't we want to fail private expedited sys_membarrier calls if the
> process hasn't registered for them? This requires the registration
> call to set a flag for the process, even on architectures where no
> additional memory barriers are actually needed. It can't be a no-op.
My implementation posted today fails the private expedited command
if the process is not registered yet. We indeed add a new flag in
mm_struct for all architectures to do so.
So why not re-use this flag instead of the TIF on powerpc ? See my
pointer chasing on fast-path argument above.
Thanks,
Mathieu
>
> Alan Stern
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Mon, Sep 18, 2017 at 03:04:21PM -0400, Alan Stern wrote:
> On Sun, 17 Sep 2017, Paul E. McKenney wrote:
>
> > Hello!
> >
> > Rough notes from our discussion last Thursday. Please reply to the
> > group with any needed elaborations or corrections.
> >
> > Adding Andy and Michael on CC since this most closely affects their
> > architectures. Also adding Dave Watson and Maged Michael because
> > the preferred approach requires that processes wanting to use the
> > lightweight sys_membarrier() do a registration step.
> >
> > Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > Problem:
> >
> > 1. The current sys_membarrier() introduces an smp_mb() that
> > is not otherwise required on powerpc.
> >
> > 2. The envisioned JIT variant of sys_membarrier() assumes that
> > the return-to-user instruction sequence handling any change
> > to the usermode instruction stream, and Andy Lutomirski's
> > upcoming changes invalidate this assumption. It is believed
> > that powerpc has a similar issue.
>
> > E. Require that threads register before using sys_membarrier() for
> > private or JIT usage. (The historical implementation using
> > synchronize_sched() would continue to -not- require registration,
> > both for compatibility and because there is no need to do so.)
> >
> > For x86 and powerpc, this registration would set a TIF flag
> > on all of the current process's threads. This flag would be
> > inherited by any later thread creation within that process, and
> > would be cleared by fork() and exec(). When this TIF flag is set,
>
> Why a TIF flag, and why clear it during fork()? If a process registers
> to use private expedited sys_membarrier, shouldn't that apply to
> threads it will create in the future just as much as to threads it has
> already created?
The reason for a TIF flag is to keep this per-architecture, as only
powerpc and x86 need it.
The reason for clearing it during fork() is that fork() creates a new
process initially having but a single thread, which might or might
not use sys_membarrier(). Usually not, as most instances of fork()
are quickly followed by exec(). In addition, if we give an error for
unregistered use of private sys_membarrier(), clearing on fork() gets an
unambiguous error instead of a silent likely failure (due to libraries
being confused by the fork()).
That said, pthread_create() should preserve the flag, as the new thread
will be part of this same process.
> > the return-to-user path would execute additional code that would
> > ensure that ordering and newly JITed code was handled correctly.
> > We believe that checks for these TIF flags could be combined with
> > existing checks to avoid adding any overhead in the common case
> > where the process was not using these sys_membarrier() features.
> >
> > For all other architecture, the registration step would be
> > a no-op.
>
> Don't we want to fail private expedited sys_membarrier calls if the
> process hasn't registered for them? This requires the registration
> call to set a flag for the process, even on architectures where no
> additional memory barriers are actually needed. It can't be a no-op.
Good point, and we did discuss that. Color me forgetful!!!
Thanx, Paul
----- On Sep 18, 2017, at 3:29 PM, Paul E. McKenney [email protected] wrote:
> On Mon, Sep 18, 2017 at 03:04:21PM -0400, Alan Stern wrote:
>> On Sun, 17 Sep 2017, Paul E. McKenney wrote:
>>
>> > Hello!
>> >
>> > Rough notes from our discussion last Thursday. Please reply to the
>> > group with any needed elaborations or corrections.
>> >
>> > Adding Andy and Michael on CC since this most closely affects their
>> > architectures. Also adding Dave Watson and Maged Michael because
>> > the preferred approach requires that processes wanting to use the
>> > lightweight sys_membarrier() do a registration step.
>> >
>> > Thanx, Paul
>> >
>> > ------------------------------------------------------------------------
>> >
>> > Problem:
>> >
>> > 1. The current sys_membarrier() introduces an smp_mb() that
>> > is not otherwise required on powerpc.
>> >
>> > 2. The envisioned JIT variant of sys_membarrier() assumes that
>> > the return-to-user instruction sequence handling any change
>> > to the usermode instruction stream, and Andy Lutomirski's
>> > upcoming changes invalidate this assumption. It is believed
>> > that powerpc has a similar issue.
>>
>> > E. Require that threads register before using sys_membarrier() for
>> > private or JIT usage. (The historical implementation using
>> > synchronize_sched() would continue to -not- require registration,
>> > both for compatibility and because there is no need to do so.)
>> >
>> > For x86 and powerpc, this registration would set a TIF flag
>> > on all of the current process's threads. This flag would be
>> > inherited by any later thread creation within that process, and
>> > would be cleared by fork() and exec(). When this TIF flag is set,
>>
>> Why a TIF flag, and why clear it during fork()? If a process registers
>> to use private expedited sys_membarrier, shouldn't that apply to
>> threads it will create in the future just as much as to threads it has
>> already created?
>
> The reason for a TIF flag is to keep this per-architecture, as only
> powerpc and x86 need it.
>
> The reason for clearing it during fork() is that fork() creates a new
> process initially having but a single thread, which might or might
> not use sys_membarrier(). Usually not, as most instances of fork()
> are quickly followed by exec(). In addition, if we give an error for
> unregistered use of private sys_membarrier(), clearing on fork() gets an
> unambiguous error instead of a silent likely failure (due to libraries
> being confused by the fork()).
I think clearing that state on fork() would be unexpected. The child process
inherits from the parent flag in my current implementation. Clearing the
flag is only provided through exec().
Libraries don't get re-initialized on fork, only on exec(). Therefore, it
makes sense for the child process to inherit the state from its parent.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Mon, Sep 18, 2017 at 07:37:22PM +0000, Mathieu Desnoyers wrote:
> ----- On Sep 18, 2017, at 3:29 PM, Paul E. McKenney [email protected] wrote:
>
> > On Mon, Sep 18, 2017 at 03:04:21PM -0400, Alan Stern wrote:
> >> On Sun, 17 Sep 2017, Paul E. McKenney wrote:
> >>
> >> > Hello!
> >> >
> >> > Rough notes from our discussion last Thursday. Please reply to the
> >> > group with any needed elaborations or corrections.
> >> >
> >> > Adding Andy and Michael on CC since this most closely affects their
> >> > architectures. Also adding Dave Watson and Maged Michael because
> >> > the preferred approach requires that processes wanting to use the
> >> > lightweight sys_membarrier() do a registration step.
> >> >
> >> > Thanx, Paul
> >> >
> >> > ------------------------------------------------------------------------
> >> >
> >> > Problem:
> >> >
> >> > 1. The current sys_membarrier() introduces an smp_mb() that
> >> > is not otherwise required on powerpc.
> >> >
> >> > 2. The envisioned JIT variant of sys_membarrier() assumes that
> >> > the return-to-user instruction sequence handling any change
> >> > to the usermode instruction stream, and Andy Lutomirski's
> >> > upcoming changes invalidate this assumption. It is believed
> >> > that powerpc has a similar issue.
> >>
> >> > E. Require that threads register before using sys_membarrier() for
> >> > private or JIT usage. (The historical implementation using
> >> > synchronize_sched() would continue to -not- require registration,
> >> > both for compatibility and because there is no need to do so.)
> >> >
> >> > For x86 and powerpc, this registration would set a TIF flag
> >> > on all of the current process's threads. This flag would be
> >> > inherited by any later thread creation within that process, and
> >> > would be cleared by fork() and exec(). When this TIF flag is set,
> >>
> >> Why a TIF flag, and why clear it during fork()? If a process registers
> >> to use private expedited sys_membarrier, shouldn't that apply to
> >> threads it will create in the future just as much as to threads it has
> >> already created?
> >
> > The reason for a TIF flag is to keep this per-architecture, as only
> > powerpc and x86 need it.
> >
> > The reason for clearing it during fork() is that fork() creates a new
> > process initially having but a single thread, which might or might
> > not use sys_membarrier(). Usually not, as most instances of fork()
> > are quickly followed by exec(). In addition, if we give an error for
> > unregistered use of private sys_membarrier(), clearing on fork() gets an
> > unambiguous error instead of a silent likely failure (due to libraries
> > being confused by the fork()).
>
> I think clearing that state on fork() would be unexpected. The child process
> inherits from the parent flag in my current implementation. Clearing the
> flag is only provided through exec().
>
> Libraries don't get re-initialized on fork, only on exec(). Therefore, it
> makes sense for the child process to inherit the state from its parent.
Fair enough!
Thanx, Paul
On Sun, Sep 17, 2017 at 3:36 PM, Paul E. McKenney
<[email protected]> wrote:
> Hello!
>
> Rough notes from our discussion last Thursday. Please reply to the
> group with any needed elaborations or corrections.
>
> Adding Andy and Michael on CC since this most closely affects their
> architectures. Also adding Dave Watson and Maged Michael because
> the preferred approach requires that processes wanting to use the
> lightweight sys_membarrier() do a registration step.
Not to be too much of a curmudgeon, but I think that there should be a
real implementation of the isync membarrier before this get merged.
This series purports to solve two problems, ppc barriers and x86
exit-without-isync, but it's very hard to evaluate whether it actually
solves the latter problem given the complete lack of x86 or isync code
in the current RFC.
It still seems to me that you won't get any particular advantage for
using this registration mechanism on x86 even when you implement
isync. Unless I've misunderstood, the only real issue on x86 is that
you need a helper like arch_force_isync_before_usermode(), and that
helper doesn't presently exist. That means that this whole patchset
is standing on very dangerous ground: you'll end up with an efficient
implementation that works just fine without even requesting
registration on every architecture except ppc. That way lies
userspace bugs.
Also, can you elaborate on the PPC issue? PPC appears to track
mm_cpumask more or less just like x86. Is the issue just that this
tracking has no implied barriers? If so, how does TLB flush on ppc
work? It really does seem impressive to me that an architecture can
efficiently support munmap() but not an expedited private membarrier.
--Andy
----- On Sep 20, 2017, at 12:02 PM, Andy Lutomirski [email protected] wrote:
> On Sun, Sep 17, 2017 at 3:36 PM, Paul E. McKenney
> <[email protected]> wrote:
>> Hello!
>>
>> Rough notes from our discussion last Thursday. Please reply to the
>> group with any needed elaborations or corrections.
>>
>> Adding Andy and Michael on CC since this most closely affects their
>> architectures. Also adding Dave Watson and Maged Michael because
>> the preferred approach requires that processes wanting to use the
>> lightweight sys_membarrier() do a registration step.
>
> Not to be too much of a curmudgeon, but I think that there should be a
> real implementation of the isync membarrier before this get merged.
> This series purports to solve two problems, ppc barriers and x86
> exit-without-isync, but it's very hard to evaluate whether it actually
> solves the latter problem given the complete lack of x86 or isync code
> in the current RFC.
>
> It still seems to me that you won't get any particular advantage for
> using this registration mechanism on x86 even when you implement
> isync. Unless I've misunderstood, the only real issue on x86 is that
> you need a helper like arch_force_isync_before_usermode(), and that
> helper doesn't presently exist. That means that this whole patchset
> is standing on very dangerous ground: you'll end up with an efficient
> implementation that works just fine without even requesting
> registration on every architecture except ppc. That way lies
> userspace bugs.
My proposed RFC for private expedited membarrier enforces that all
architectures perform the registration step. Using the "PRIVATE_EXPEDITED"
command without prior process registration returns an error on all
architectures. The goal here is to make all architectures behave in the
same way, and it allows us to rely on process registration to deal
with future arch-specific optimizations.
Adding the "core_sync" behavior could then be done for the next kernel
merge window. I'm currently foreseeing two possible ABI approaches to
expose it:
Approach 1:
Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE and
MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE commands. This
allows us to return their availability through MEMBARRIER_CMD_QUERY.
Approach 2:
Add a "MEMBARRIER_FLAG_SYNC_CORE" as flag parameter. It could be set
when issuing both MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED and
MEMBARRIER_CMD_PRIVATE_EXPEDITED, thus ensuring core serializing
behavior. Querying whether core serialization is supported could
be done by issuing the MEMBARRIER_CMD_QUERY command with the
MEMBARRIER_FLAG_SYNC_CORE flag set.
Any other ideas ? Any approach seems better ?
>
> Also, can you elaborate on the PPC issue? PPC appears to track
> mm_cpumask more or less just like x86. Is the issue just that this
> tracking has no implied barriers? If so, how does TLB flush on ppc
> work? It really does seem impressive to me that an architecture can
> efficiently support munmap() but not an expedited private membarrier.
I'll leave this question to the PPC experts :)
Thanks,
Mathieu
>
> --Andy
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Wed, Sep 20, 2017 at 11:13 AM, Mathieu Desnoyers
<[email protected]> wrote:
>
> ----- On Sep 20, 2017, at 12:02 PM, Andy Lutomirski [email protected] wrote:
>
> > On Sun, Sep 17, 2017 at 3:36 PM, Paul E. McKenney
> > <[email protected]> wrote:
> >> Hello!
> >>
> >> Rough notes from our discussion last Thursday. Please reply to the
> >> group with any needed elaborations or corrections.
> >>
> >> Adding Andy and Michael on CC since this most closely affects their
> >> architectures. Also adding Dave Watson and Maged Michael because
> >> the preferred approach requires that processes wanting to use the
> >> lightweight sys_membarrier() do a registration step.
> >
> > Not to be too much of a curmudgeon, but I think that there should be a
> > real implementation of the isync membarrier before this get merged.
> > This series purports to solve two problems, ppc barriers and x86
> > exit-without-isync, but it's very hard to evaluate whether it actually
> > solves the latter problem given the complete lack of x86 or isync code
> > in the current RFC.
> >
> > It still seems to me that you won't get any particular advantage for
> > using this registration mechanism on x86 even when you implement
> > isync. Unless I've misunderstood, the only real issue on x86 is that
> > you need a helper like arch_force_isync_before_usermode(), and that
> > helper doesn't presently exist. That means that this whole patchset
> > is standing on very dangerous ground: you'll end up with an efficient
> > implementation that works just fine without even requesting
> > registration on every architecture except ppc. That way lies
> > userspace bugs.
>
> My proposed RFC for private expedited membarrier enforces that all
> architectures perform the registration step. Using the "PRIVATE_EXPEDITED"
> command without prior process registration returns an error on all
> architectures. The goal here is to make all architectures behave in the
> same way, and it allows us to rely on process registration to deal
> with future arch-specific optimizations.
Fair enough.
That being said, on same architectures (which may well be all but
PPC), it might be nice if the registration call literally just sets a
flag in the mm saying that it happened so that the registration
enforcement can be done.
>
>
> Adding the "core_sync" behavior could then be done for the next kernel
> merge window. I'm currently foreseeing two possible ABI approaches to
> expose it:
>
> Approach 1:
>
> Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE and
> MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE commands. This
> allows us to return their availability through MEMBARRIER_CMD_QUERY.
>
> Approach 2:
>
> Add a "MEMBARRIER_FLAG_SYNC_CORE" as flag parameter. It could be set
> when issuing both MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED and
> MEMBARRIER_CMD_PRIVATE_EXPEDITED, thus ensuring core serializing
> behavior. Querying whether core serialization is supported could
> be done by issuing the MEMBARRIER_CMD_QUERY command with the
> MEMBARRIER_FLAG_SYNC_CORE flag set.
>
> Any other ideas ? Any approach seems better ?
It doesn't seem to make much difference to me.
----- On Sep 20, 2017, at 2:18 PM, Andy Lutomirski [email protected] wrote:
> On Wed, Sep 20, 2017 at 11:13 AM, Mathieu Desnoyers
> <[email protected]> wrote:
>>
>> ----- On Sep 20, 2017, at 12:02 PM, Andy Lutomirski [email protected] wrote:
>>
>> > On Sun, Sep 17, 2017 at 3:36 PM, Paul E. McKenney
>> > <[email protected]> wrote:
>> >> Hello!
>> >>
>> >> Rough notes from our discussion last Thursday. Please reply to the
>> >> group with any needed elaborations or corrections.
>> >>
>> >> Adding Andy and Michael on CC since this most closely affects their
>> >> architectures. Also adding Dave Watson and Maged Michael because
>> >> the preferred approach requires that processes wanting to use the
>> >> lightweight sys_membarrier() do a registration step.
>> >
>> > Not to be too much of a curmudgeon, but I think that there should be a
>> > real implementation of the isync membarrier before this get merged.
>> > This series purports to solve two problems, ppc barriers and x86
>> > exit-without-isync, but it's very hard to evaluate whether it actually
>> > solves the latter problem given the complete lack of x86 or isync code
>> > in the current RFC.
>> >
>> > It still seems to me that you won't get any particular advantage for
>> > using this registration mechanism on x86 even when you implement
>> > isync. Unless I've misunderstood, the only real issue on x86 is that
>> > you need a helper like arch_force_isync_before_usermode(), and that
>> > helper doesn't presently exist. That means that this whole patchset
>> > is standing on very dangerous ground: you'll end up with an efficient
>> > implementation that works just fine without even requesting
>> > registration on every architecture except ppc. That way lies
>> > userspace bugs.
>>
>> My proposed RFC for private expedited membarrier enforces that all
>> architectures perform the registration step. Using the "PRIVATE_EXPEDITED"
>> command without prior process registration returns an error on all
>> architectures. The goal here is to make all architectures behave in the
>> same way, and it allows us to rely on process registration to deal
>> with future arch-specific optimizations.
>
> Fair enough.
>
> That being said, on same architectures (which may well be all but
> PPC), it might be nice if the registration call literally just sets a
> flag in the mm saying that it happened so that the registration
> enforcement can be done.
My RFC patch does exactly that. :-)
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Wed, Sep 20, 2017 at 06:13:50PM +0000, Mathieu Desnoyers wrote:
> > Also, can you elaborate on the PPC issue? PPC appears to track
> > mm_cpumask more or less just like x86. Is the issue just that this
> > tracking has no implied barriers? If so, how does TLB flush on ppc
> > work? It really does seem impressive to me that an architecture can
> > efficiently support munmap() but not an expedited private membarrier.
>
> I'll leave this question to the PPC experts :)
IIRC PPC does not keep a tight mm_cpumask, it only sets bit, it never
clears bits. The atomic op required to set bits does not imply any
memory barrier on PPC.
TLB invalidation is a TLBI instruction, it sends TLBI broadcast packets
over the interconnect, it doesn't require IPIs like x86.
The only optimization PPC does is that if the mm_cpumask has only a
single bit set, it uses a TLBI instruction without broadcast, which is
cheaper.
On Wed, Sep 20, 2017 at 06:13:50PM +0000, Mathieu Desnoyers wrote:
> My proposed RFC for private expedited membarrier enforces that all
> architectures perform the registration step. Using the "PRIVATE_EXPEDITED"
> command without prior process registration returns an error on all
> architectures. The goal here is to make all architectures behave in the
> same way, and it allows us to rely on process registration to deal
> with future arch-specific optimizations.
>
> Adding the "core_sync" behavior could then be done for the next kernel
> merge window. I'm currently foreseeing two possible ABI approaches to
> expose it:
>
> Approach 1:
>
> Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE and
> MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE commands. This
> allows us to return their availability through MEMBARRIER_CMD_QUERY.
>
> Approach 2:
>
> Add a "MEMBARRIER_FLAG_SYNC_CORE" as flag parameter. It could be set
> when issuing both MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED and
> MEMBARRIER_CMD_PRIVATE_EXPEDITED, thus ensuring core serializing
> behavior. Querying whether core serialization is supported could
> be done by issuing the MEMBARRIER_CMD_QUERY command with the
> MEMBARRIER_FLAG_SYNC_CORE flag set.
>
> Any other ideas ? Any approach seems better ?
So we really need another FLAG for that? AFAICT the current
PRIVATE_EXPEDITED is already sufficient for the cross modifying code,
since the IPI triggers an exception return on all currently running CPUs
and the future running CPUs will have the return to userspace doing the
exception return.
The only issue is Andy fudging our x86 ret-to-userspace to not use IRET,
which we can fix by forcing it into the slowpath (that needs to exist
anyway) using that new TIF flag.
On Thu, 2017-09-21 at 15:09 +0200, Peter Zijlstra wrote:
> On Wed, Sep 20, 2017 at 06:13:50PM +0000, Mathieu Desnoyers wrote:
>
> >
> > >
> > > Also, can you elaborate on the PPC issue? PPC appears to track
> > > mm_cpumask more or less just like x86. Is the issue just that
> > > this
> > > tracking has no implied barriers? If so, how does TLB flush on
> > > ppc
> > > work? It really does seem impressive to me that an architecture
> > > can
> > > efficiently support munmap() but not an expedited private
> > > membarrier.
> >
> > I'll leave this question to the PPC experts :)
>
> IIRC PPC does not keep a tight mm_cpumask, it only sets bit, it never
> clears bits. The atomic op required to set bits does not imply any
> memory barrier on PPC.
>
> TLB invalidation is a TLBI instruction, it sends TLBI broadcast
> packets over the interconnect, it doesn't require IPIs like x86.
I believe this to be true for all SMP RISC systems ... it's certainly
true for PA-RISC as well. There are so many RISC coherency issues that
the CPUs pretty much have to have a private bus to broadcast and
interlock coherency operations. We have one system that locks up if
multiple CPUs have outstanding coherency operations on the private bus,
but that's only one annoying CPU (which we manage with a special lock
inside the PA-RISC mmu code).
James
----- On Sep 21, 2017, at 9:15 AM, Peter Zijlstra [email protected] wrote:
> On Wed, Sep 20, 2017 at 06:13:50PM +0000, Mathieu Desnoyers wrote:
>> My proposed RFC for private expedited membarrier enforces that all
>> architectures perform the registration step. Using the "PRIVATE_EXPEDITED"
>> command without prior process registration returns an error on all
>> architectures. The goal here is to make all architectures behave in the
>> same way, and it allows us to rely on process registration to deal
>> with future arch-specific optimizations.
>>
>> Adding the "core_sync" behavior could then be done for the next kernel
>> merge window. I'm currently foreseeing two possible ABI approaches to
>> expose it:
>>
>> Approach 1:
>>
>> Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE and
>> MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE commands. This
>> allows us to return their availability through MEMBARRIER_CMD_QUERY.
>>
>> Approach 2:
>>
>> Add a "MEMBARRIER_FLAG_SYNC_CORE" as flag parameter. It could be set
>> when issuing both MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED and
>> MEMBARRIER_CMD_PRIVATE_EXPEDITED, thus ensuring core serializing
>> behavior. Querying whether core serialization is supported could
>> be done by issuing the MEMBARRIER_CMD_QUERY command with the
>> MEMBARRIER_FLAG_SYNC_CORE flag set.
>>
>> Any other ideas ? Any approach seems better ?
>
> So we really need another FLAG for that? AFAICT the current
> PRIVATE_EXPEDITED is already sufficient for the cross modifying code,
> since the IPI triggers an exception return on all currently running CPUs
> and the future running CPUs will have the return to userspace doing the
> exception return.
>
> The only issue is Andy fudging our x86 ret-to-userspace to not use IRET,
> which we can fix by forcing it into the slowpath (that needs to exist
> anyway) using that new TIF flag.
I agree that x86, as it stands today, would provide core serialization
with the private expedited membarrier command. And we can deal with
future optimization of ret-to-userspace using the TIF flag set on
registration.
I'm wondering whether all architectures guarantee core serialization
on return from interrupt triggered by the IPI, and on ret-to-userspace ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Peter Zijlstra <[email protected]> writes:
> On Wed, Sep 20, 2017 at 06:13:50PM +0000, Mathieu Desnoyers wrote:
>
>> > Also, can you elaborate on the PPC issue? PPC appears to track
>> > mm_cpumask more or less just like x86. Is the issue just that this
>> > tracking has no implied barriers? If so, how does TLB flush on ppc
>> > work? It really does seem impressive to me that an architecture can
>> > efficiently support munmap() but not an expedited private membarrier.
>>
>> I'll leave this question to the PPC experts :)
>
> IIRC PPC does not keep a tight mm_cpumask, it only sets bit, it never
> clears bits. The atomic op required to set bits does not imply any
> memory barrier on PPC.
Yep.
We do have a full barrier now when we set a bit, but not if the bit was
already set.
> TLB invalidation is a TLBI instruction, it sends TLBI broadcast packets
> over the interconnect, it doesn't require IPIs like x86.
Yep.
> The only optimization PPC does is that if the mm_cpumask has only a
> single bit set, it uses a TLBI instruction without broadcast, which is
> cheaper.
Yep.
We would like to trim the mm_cpumask, but it's one of those hairy
optimisations we have never quite found time to do.
I've been away for two weeks so I've not been able to keep up with the
membarrier discussions. Will try and page it back in next week.
cheers
On Thu, Sep 21, 2017 at 10:23:39AM -0700, James Bottomley wrote:
> ?We have one system that locks up if
> multiple CPUs have outstanding coherency operations on the private bus,
> but that's only one annoying CPU (which we manage with a special lock
> inside the PA-RISC mmu code).
Hehe, yeah. I saw that code recently and had a WTF moment ;-)