LinuxLists.cc - Regression in SYS

2019-02-17 19:01:19

Subject: Regression in SYS_membarrier expedited

commit a961e40917fb14614d368d8bc9782ca4d6a8cd11 made it so that the
MEMBARRIER_CMD_PRIVATE_EXPEDITED command cannot be used without first
registering intent to use it. However, registration is an expensive
operation since commit 3ccfebedd8cf54e291c809c838d8ad5cc00f5688, which
added synchronize_sched() to it; this means it's no longer possible to
lazily register intent at first use, and it's unreasonably expensive
to preemptively register intent for possibly extremely-short-lived
processes that will never use it. (My usage case is in libc (musl),
where I can't know if the process will be short- or long-lived;
unnecessary and potentially expensive syscalls can't be made
preemptively, only lazily at first use.)

Can we restore the functionality of MEMBARRIER_CMD_PRIVATE_EXPEDITED
to work even without registration? The motivation of requiring
registration seems to be:

"Registering at this time removes the need to interrupt each and
every thread in that process at the first expedited
sys_membarrier() system call."

but interrupting every thread in the process is exactly what I expect,
and is not a problem. What does seem like a big problem is waiting for
synchronize_sched() to synchronize with an unboundedly large number of
cores (vs only a few threads in the process), especially in the
presence of full_nohz, where it seems like latency would be at least a
few ms and possibly unbounded.

Short of a working SYS_membarrier that doesn't require expensive
pre-registration, I'm stuck just implementing it in userspace with
signals...

Rich

2019-02-17 21:35:52

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: Regression in SYS_membarrier expedited

----- On Feb 17, 2019, at 1:48 PM, Rich Felker [email protected] wrote:

> commit a961e40917fb14614d368d8bc9782ca4d6a8cd11 made it so that the
> MEMBARRIER_CMD_PRIVATE_EXPEDITED command cannot be used without first
> registering intent to use it. However, registration is an expensive
> operation since commit 3ccfebedd8cf54e291c809c838d8ad5cc00f5688, which
> added synchronize_sched() to it; this means it's no longer possible to
> lazily register intent at first use, and it's unreasonably expensive
> to preemptively register intent for possibly extremely-short-lived
> processes that will never use it. (My usage case is in libc (musl),
> where I can't know if the process will be short- or long-lived;
> unnecessary and potentially expensive syscalls can't be made
> preemptively, only lazily at first use.)
>
> Can we restore the functionality of MEMBARRIER_CMD_PRIVATE_EXPEDITED
> to work even without registration? The motivation of requiring
> registration seems to be:
>
> "Registering at this time removes the need to interrupt each and
> every thread in that process at the first expedited
> sys_membarrier() system call."
>
> but interrupting every thread in the process is exactly what I expect,
> and is not a problem. What does seem like a big problem is waiting for
> synchronize_sched() to synchronize with an unboundedly large number of
> cores (vs only a few threads in the process), especially in the
> presence of full_nohz, where it seems like latency would be at least a
> few ms and possibly unbounded.
>
> Short of a working SYS_membarrier that doesn't require expensive
> pre-registration, I'm stuck just implementing it in userspace with
> signals...

Hi Rich,

Let me try to understand the scenario first.

musl libc support for using membarrier private expedited
would require to first register membarrier private expedited for
the process at musl library init (typically after exec). At that stage, the
process is still single-threaded, right ? So there is no reason
to issue a synchronize_sched() (or now synchronize_rcu() in newer
kernels):

membarrier_register_private_expedited()

if (!(atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)) {
/*
* Ensure all future scheduler executions will observe the
* new thread flag state for this process.
*/
synchronize_rcu();
}

So considering that pre-registration carefully done before the process
becomes multi-threaded just costs a system call (and not a synchronize_sched()),
does it make the pre-registration approach more acceptable ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2019-02-17 21:53:10

by Rich Felker

[permalink] [raw]

Subject: Re: Regression in SYS_membarrier expedited

On Sun, Feb 17, 2019 at 04:34:45PM -0500, Mathieu Desnoyers wrote:
> ----- On Feb 17, 2019, at 1:48 PM, Rich Felker [email protected] wrote:
>
> > commit a961e40917fb14614d368d8bc9782ca4d6a8cd11 made it so that the
> > MEMBARRIER_CMD_PRIVATE_EXPEDITED command cannot be used without first
> > registering intent to use it. However, registration is an expensive
> > operation since commit 3ccfebedd8cf54e291c809c838d8ad5cc00f5688, which
> > added synchronize_sched() to it; this means it's no longer possible to
> > lazily register intent at first use, and it's unreasonably expensive
> > to preemptively register intent for possibly extremely-short-lived
> > processes that will never use it. (My usage case is in libc (musl),
> > where I can't know if the process will be short- or long-lived;
> > unnecessary and potentially expensive syscalls can't be made
> > preemptively, only lazily at first use.)
> >
> > Can we restore the functionality of MEMBARRIER_CMD_PRIVATE_EXPEDITED
> > to work even without registration? The motivation of requiring
> > registration seems to be:
> >
> > "Registering at this time removes the need to interrupt each and
> > every thread in that process at the first expedited
> > sys_membarrier() system call."
> >
> > but interrupting every thread in the process is exactly what I expect,
> > and is not a problem. What does seem like a big problem is waiting for
> > synchronize_sched() to synchronize with an unboundedly large number of
> > cores (vs only a few threads in the process), especially in the
> > presence of full_nohz, where it seems like latency would be at least a
> > few ms and possibly unbounded.
> >
> > Short of a working SYS_membarrier that doesn't require expensive
> > pre-registration, I'm stuck just implementing it in userspace with
> > signals...
>
> Hi Rich,
>
> Let me try to understand the scenario first.
>
> musl libc support for using membarrier private expedited
> would require to first register membarrier private expedited for
> the process at musl library init (typically after exec). At that stage, the
> process is still single-threaded, right ? So there is no reason
> to issue a synchronize_sched() (or now synchronize_rcu() in newer
> kernels):
>
> membarrier_register_private_expedited()
>
> if (!(atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)) {
> /*
> * Ensure all future scheduler executions will observe the
> * new thread flag state for this process.
> */
> synchronize_rcu();
> }
>
> So considering that pre-registration carefully done before the process
> becomes multi-threaded just costs a system call (and not a synchronize_sched()),
> does it make the pre-registration approach more acceptable ?

It does get rid of the extreme cost, but I don't think it would be
well-received by users who don't like random unnecessary syscalls at
init time (each adding a few us of startup time cost). If it's so
cheap, why isn't it just the default at kernel-side process creation?
Why is there any requirement of registration to begin with? Reading
the code, it looks like all it does is set a flag, and all this flag
is used for is erroring-out if it's not set.

Rich

2019-02-17 22:09:02

by Rich Felker

[permalink] [raw]

Subject: Re: Regression in SYS_membarrier expedited

On Sun, Feb 17, 2019 at 04:52:35PM -0500, Rich Felker wrote:
> On Sun, Feb 17, 2019 at 04:34:45PM -0500, Mathieu Desnoyers wrote:
> > ----- On Feb 17, 2019, at 1:48 PM, Rich Felker [email protected] wrote:
> >
> > > commit a961e40917fb14614d368d8bc9782ca4d6a8cd11 made it so that the
> > > MEMBARRIER_CMD_PRIVATE_EXPEDITED command cannot be used without first
> > > registering intent to use it. However, registration is an expensive
> > > operation since commit 3ccfebedd8cf54e291c809c838d8ad5cc00f5688, which
> > > added synchronize_sched() to it; this means it's no longer possible to
> > > lazily register intent at first use, and it's unreasonably expensive
> > > to preemptively register intent for possibly extremely-short-lived
> > > processes that will never use it. (My usage case is in libc (musl),
> > > where I can't know if the process will be short- or long-lived;
> > > unnecessary and potentially expensive syscalls can't be made
> > > preemptively, only lazily at first use.)
> > >
> > > Can we restore the functionality of MEMBARRIER_CMD_PRIVATE_EXPEDITED
> > > to work even without registration? The motivation of requiring
> > > registration seems to be:
> > >
> > > "Registering at this time removes the need to interrupt each and
> > > every thread in that process at the first expedited
> > > sys_membarrier() system call."
> > >
> > > but interrupting every thread in the process is exactly what I expect,
> > > and is not a problem. What does seem like a big problem is waiting for
> > > synchronize_sched() to synchronize with an unboundedly large number of
> > > cores (vs only a few threads in the process), especially in the
> > > presence of full_nohz, where it seems like latency would be at least a
> > > few ms and possibly unbounded.
> > >
> > > Short of a working SYS_membarrier that doesn't require expensive
> > > pre-registration, I'm stuck just implementing it in userspace with
> > > signals...
> >
> > Hi Rich,
> >
> > Let me try to understand the scenario first.
> >
> > musl libc support for using membarrier private expedited
> > would require to first register membarrier private expedited for
> > the process at musl library init (typically after exec). At that stage, the
> > process is still single-threaded, right ? So there is no reason
> > to issue a synchronize_sched() (or now synchronize_rcu() in newer
> > kernels):
> >
> > membarrier_register_private_expedited()
> >
> > if (!(atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)) {
> > /*
> > * Ensure all future scheduler executions will observe the
> > * new thread flag state for this process.
> > */
> > synchronize_rcu();
> > }
> >
> > So considering that pre-registration carefully done before the process
> > becomes multi-threaded just costs a system call (and not a synchronize_sched()),
> > does it make the pre-registration approach more acceptable ?
>
> It does get rid of the extreme cost, but I don't think it would be
> well-received by users who don't like random unnecessary syscalls at
> init time (each adding a few us of startup time cost). If it's so
> cheap, why isn't it just the default at kernel-side process creation?
> Why is there any requirement of registration to begin with? Reading
> the code, it looks like all it does is set a flag, and all this flag
> is used for is erroring-out if it's not set.

On further thought, pre-registration could be done at first
pthread_create rather than process entry, which would probably be
acceptable. But the question remains why it's needed at all, and
neither of these approaches is available to code that doesn't have the
privilege of being part of libc. For example, library code that might
be loaded via dlopen can't safely use SYS_membarrier without
introducing unbounded latency before the first use.

Rich

2019-02-18 18:42:25

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: Regression in SYS_membarrier expedited

----- On Feb 17, 2019, at 5:08 PM, Rich Felker [email protected] wrote:

> On Sun, Feb 17, 2019 at 04:52:35PM -0500, Rich Felker wrote:
>> On Sun, Feb 17, 2019 at 04:34:45PM -0500, Mathieu Desnoyers wrote:
>> > ----- On Feb 17, 2019, at 1:48 PM, Rich Felker [email protected] wrote:
>> >
>> > > commit a961e40917fb14614d368d8bc9782ca4d6a8cd11 made it so that the
>> > > MEMBARRIER_CMD_PRIVATE_EXPEDITED command cannot be used without first
>> > > registering intent to use it. However, registration is an expensive
>> > > operation since commit 3ccfebedd8cf54e291c809c838d8ad5cc00f5688, which
>> > > added synchronize_sched() to it; this means it's no longer possible to
>> > > lazily register intent at first use, and it's unreasonably expensive
>> > > to preemptively register intent for possibly extremely-short-lived
>> > > processes that will never use it. (My usage case is in libc (musl),
>> > > where I can't know if the process will be short- or long-lived;
>> > > unnecessary and potentially expensive syscalls can't be made
>> > > preemptively, only lazily at first use.)
>> > >
>> > > Can we restore the functionality of MEMBARRIER_CMD_PRIVATE_EXPEDITED
>> > > to work even without registration? The motivation of requiring
>> > > registration seems to be:
>> > >
>> > > "Registering at this time removes the need to interrupt each and
>> > > every thread in that process at the first expedited
>> > > sys_membarrier() system call."
>> > >
>> > > but interrupting every thread in the process is exactly what I expect,
>> > > and is not a problem. What does seem like a big problem is waiting for
>> > > synchronize_sched() to synchronize with an unboundedly large number of
>> > > cores (vs only a few threads in the process), especially in the
>> > > presence of full_nohz, where it seems like latency would be at least a
>> > > few ms and possibly unbounded.
>> > >
>> > > Short of a working SYS_membarrier that doesn't require expensive
>> > > pre-registration, I'm stuck just implementing it in userspace with
>> > > signals...
>> >
>> > Hi Rich,
>> >
>> > Let me try to understand the scenario first.
>> >
>> > musl libc support for using membarrier private expedited
>> > would require to first register membarrier private expedited for
>> > the process at musl library init (typically after exec). At that stage, the
>> > process is still single-threaded, right ? So there is no reason
>> > to issue a synchronize_sched() (or now synchronize_rcu() in newer
>> > kernels):
>> >
>> > membarrier_register_private_expedited()
>> >
>> > if (!(atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)) {
>> > /*
>> > * Ensure all future scheduler executions will observe the
>> > * new thread flag state for this process.
>> > */
>> > synchronize_rcu();
>> > }
>> >
>> > So considering that pre-registration carefully done before the process
>> > becomes multi-threaded just costs a system call (and not a synchronize_sched()),
>> > does it make the pre-registration approach more acceptable ?
>>
>> It does get rid of the extreme cost, but I don't think it would be
>> well-received by users who don't like random unnecessary syscalls at
>> init time (each adding a few us of startup time cost). If it's so
>> cheap, why isn't it just the default at kernel-side process creation?
>> Why is there any requirement of registration to begin with? Reading
>> the code, it looks like all it does is set a flag, and all this flag
>> is used for is erroring-out if it's not set.
>
> On further thought, pre-registration could be done at first
> pthread_create rather than process entry, which would probably be
> acceptable. But the question remains why it's needed at all, and
> neither of these approaches is available to code that doesn't have the
> privilege of being part of libc. For example, library code that might
> be loaded via dlopen can't safely use SYS_membarrier without
> introducing unbounded latency before the first use.

For membarrier private expedited, the need for pre-registration is currently
there because of powerpc not wanting to slow down switch_mm() for processes
not needing that command.

That's the only reason I see for it. If we would have accepted to add
a smp_mb() to the powerpc switch_mm() scheduler path, we could have done
so without registration for the private expedited membarrier command.

commit a961e40917fb hints at the sync_core private expedited membarrier
commands (which was being actively designed at that time) which may
require pre-registration. However, things did not turn out that way: we
ended up adding the required core serializing barriers unconditionally
into each architecture.

Considering that sync_core private expedited membarrier ended up not needing
pre-registration, I think this pre-registration optimization may have been
somewhat counter-productive, since I doubt the overhead of smp_mb() in a
switch_mm() for powerpc is that high, but I don't have the hardware handy
to provide numbers. So we end up slowing down everyone by requiring a
registration system call after exec. :-(

One possible way out of this would be to make MEMBARRIER_CMD_PRIVATE_EXPEDITED
and MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE work fine without pre-registration
in future kernels. Therefore, the application could try using them without
registration. If it works, all is fine, else it would treat the error how it
sees fit, either through explicit registration and trying again, or returning
the error to the caller.

The only change I see we would require to make this work is to turn
arch/powerpc/include/asm/membarrier.h membarrier_arch_switch_mm() into
an unconditional smp_mb().

Thoughts ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2019-02-19 02:38:14

by Rich Felker

[permalink] [raw]

Subject: Re: Regression in SYS_membarrier expedited

On Mon, Feb 18, 2019 at 10:22:32AM -0500, Mathieu Desnoyers wrote:
> ----- On Feb 17, 2019, at 5:08 PM, Rich Felker [email protected] wrote:
>
> > On Sun, Feb 17, 2019 at 04:52:35PM -0500, Rich Felker wrote:
> >> On Sun, Feb 17, 2019 at 04:34:45PM -0500, Mathieu Desnoyers wrote:
> >> > ----- On Feb 17, 2019, at 1:48 PM, Rich Felker [email protected] wrote:
> >> >
> >> > > commit a961e40917fb14614d368d8bc9782ca4d6a8cd11 made it so that the
> >> > > MEMBARRIER_CMD_PRIVATE_EXPEDITED command cannot be used without first
> >> > > registering intent to use it. However, registration is an expensive
> >> > > operation since commit 3ccfebedd8cf54e291c809c838d8ad5cc00f5688, which
> >> > > added synchronize_sched() to it; this means it's no longer possible to
> >> > > lazily register intent at first use, and it's unreasonably expensive
> >> > > to preemptively register intent for possibly extremely-short-lived
> >> > > processes that will never use it. (My usage case is in libc (musl),
> >> > > where I can't know if the process will be short- or long-lived;
> >> > > unnecessary and potentially expensive syscalls can't be made
> >> > > preemptively, only lazily at first use.)
> >> > >
> >> > > Can we restore the functionality of MEMBARRIER_CMD_PRIVATE_EXPEDITED
> >> > > to work even without registration? The motivation of requiring
> >> > > registration seems to be:
> >> > >
> >> > > "Registering at this time removes the need to interrupt each and
> >> > > every thread in that process at the first expedited
> >> > > sys_membarrier() system call."
> >> > >
> >> > > but interrupting every thread in the process is exactly what I expect,
> >> > > and is not a problem. What does seem like a big problem is waiting for
> >> > > synchronize_sched() to synchronize with an unboundedly large number of
> >> > > cores (vs only a few threads in the process), especially in the
> >> > > presence of full_nohz, where it seems like latency would be at least a
> >> > > few ms and possibly unbounded.
> >> > >
> >> > > Short of a working SYS_membarrier that doesn't require expensive
> >> > > pre-registration, I'm stuck just implementing it in userspace with
> >> > > signals...
> >> >
> >> > Hi Rich,
> >> >
> >> > Let me try to understand the scenario first.
> >> >
> >> > musl libc support for using membarrier private expedited
> >> > would require to first register membarrier private expedited for
> >> > the process at musl library init (typically after exec). At that stage, the
> >> > process is still single-threaded, right ? So there is no reason
> >> > to issue a synchronize_sched() (or now synchronize_rcu() in newer
> >> > kernels):
> >> >
> >> > membarrier_register_private_expedited()
> >> >
> >> > if (!(atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)) {
> >> > /*
> >> > * Ensure all future scheduler executions will observe the
> >> > * new thread flag state for this process.
> >> > */
> >> > synchronize_rcu();
> >> > }
> >> >
> >> > So considering that pre-registration carefully done before the process
> >> > becomes multi-threaded just costs a system call (and not a synchronize_sched()),
> >> > does it make the pre-registration approach more acceptable ?
> >>
> >> It does get rid of the extreme cost, but I don't think it would be
> >> well-received by users who don't like random unnecessary syscalls at
> >> init time (each adding a few us of startup time cost). If it's so
> >> cheap, why isn't it just the default at kernel-side process creation?
> >> Why is there any requirement of registration to begin with? Reading
> >> the code, it looks like all it does is set a flag, and all this flag
> >> is used for is erroring-out if it's not set.
> >
> > On further thought, pre-registration could be done at first
> > pthread_create rather than process entry, which would probably be
> > acceptable. But the question remains why it's needed at all, and
> > neither of these approaches is available to code that doesn't have the
> > privilege of being part of libc. For example, library code that might
> > be loaded via dlopen can't safely use SYS_membarrier without
> > introducing unbounded latency before the first use.
>
> For membarrier private expedited, the need for pre-registration is currently
> there because of powerpc not wanting to slow down switch_mm() for processes
> not needing that command.
>
> That's the only reason I see for it. If we would have accepted to add
> a smp_mb() to the powerpc switch_mm() scheduler path, we could have done
> so without registration for the private expedited membarrier command.

I don't understand why the barrier is needed at all; the ipi ping
should suffice to execute a barrier instruction on all cores on which
a thread of the process is running, and if any other core subsequently
picks up a thread of the process to run, it must necessarily perform a
barrier just to synchronize with whatever core the thread was
previously running on (not to mention synchronizing the handoff
itself). Is this just to optimize out ipi pinging cores that threads
of the process are not currently running on, but were last running on
and could become running on again without migration?

> commit a961e40917fb hints at the sync_core private expedited membarrier
> commands (which was being actively designed at that time) which may
> require pre-registration. However, things did not turn out that way: we
> ended up adding the required core serializing barriers unconditionally
> into each architecture.
>
> Considering that sync_core private expedited membarrier ended up not needing
> pre-registration, I think this pre-registration optimization may have been
> somewhat counter-productive, since I doubt the overhead of smp_mb() in a
> switch_mm() for powerpc is that high, but I don't have the hardware handy
> to provide numbers. So we end up slowing down everyone by requiring a
> registration system call after exec. :-(

I'm surprised it's even possible to do switch_mm correctly with no
barrier...

> One possible way out of this would be to make MEMBARRIER_CMD_PRIVATE_EXPEDITED
> and MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE work fine without pre-registration
> in future kernels. Therefore, the application could try using them without
> registration. If it works, all is fine, else it would treat the error how it
> sees fit, either through explicit registration and trying again, or returning
> the error to the caller.

This would be nice. I think the register-at-first-pthread_create works
fine for my case now, but as noted it doesn't work when you can't
control libc internals.

> The only change I see we would require to make this work is to turn
> arch/powerpc/include/asm/membarrier.h membarrier_arch_switch_mm() into
> an unconditional smp_mb().

I finally found this just now; before I was mistakenly grepping for
MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY rather than
MEMBARRIER_STATE_PRIVATE_EXPEDITED, and couldn't find anywhere it was
actually used.

> Thoughts ?

I'm all for making it work without pre-registration. I don't care
whether this is via heavier barriers in ppc or via pinging more cores,
or some other mechanism, but maybe others have opinions on this.

One alternative might be auto-registration, ipi pinging all cores the
process has threads on (even not currently running) for the first
membarrier, so that they see the membarrier_state updated without
having to synchronize with the rcu/scheduler, then doing the same as
now after the first call. I'm not sure if this creates opportunities
for abuse; probably no worse than the process could do just by
actually running lots of threads.

Rich

2019-02-19 16:03:44

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: Regression in SYS_membarrier expedited

----- On Feb 18, 2019, at 4:55 PM, Rich Felker [email protected] wrote:

> On Mon, Feb 18, 2019 at 10:22:32AM -0500, Mathieu Desnoyers wrote:
>> ----- On Feb 17, 2019, at 5:08 PM, Rich Felker [email protected] wrote:
>>
>> > On Sun, Feb 17, 2019 at 04:52:35PM -0500, Rich Felker wrote:
>> >> On Sun, Feb 17, 2019 at 04:34:45PM -0500, Mathieu Desnoyers wrote:
>> >> > ----- On Feb 17, 2019, at 1:48 PM, Rich Felker [email protected] wrote:
>> >> >
>> >> > > commit a961e40917fb14614d368d8bc9782ca4d6a8cd11 made it so that the
>> >> > > MEMBARRIER_CMD_PRIVATE_EXPEDITED command cannot be used without first
>> >> > > registering intent to use it. However, registration is an expensive
>> >> > > operation since commit 3ccfebedd8cf54e291c809c838d8ad5cc00f5688, which
>> >> > > added synchronize_sched() to it; this means it's no longer possible to
>> >> > > lazily register intent at first use, and it's unreasonably expensive
>> >> > > to preemptively register intent for possibly extremely-short-lived
>> >> > > processes that will never use it. (My usage case is in libc (musl),
>> >> > > where I can't know if the process will be short- or long-lived;
>> >> > > unnecessary and potentially expensive syscalls can't be made
>> >> > > preemptively, only lazily at first use.)
>> >> > >
>> >> > > Can we restore the functionality of MEMBARRIER_CMD_PRIVATE_EXPEDITED
>> >> > > to work even without registration? The motivation of requiring
>> >> > > registration seems to be:
>> >> > >
>> >> > > "Registering at this time removes the need to interrupt each and
>> >> > > every thread in that process at the first expedited
>> >> > > sys_membarrier() system call."
>> >> > >
>> >> > > but interrupting every thread in the process is exactly what I expect,
>> >> > > and is not a problem. What does seem like a big problem is waiting for
>> >> > > synchronize_sched() to synchronize with an unboundedly large number of
>> >> > > cores (vs only a few threads in the process), especially in the
>> >> > > presence of full_nohz, where it seems like latency would be at least a
>> >> > > few ms and possibly unbounded.
>> >> > >
>> >> > > Short of a working SYS_membarrier that doesn't require expensive
>> >> > > pre-registration, I'm stuck just implementing it in userspace with
>> >> > > signals...
>> >> >
>> >> > Hi Rich,
>> >> >
>> >> > Let me try to understand the scenario first.
>> >> >
>> >> > musl libc support for using membarrier private expedited
>> >> > would require to first register membarrier private expedited for
>> >> > the process at musl library init (typically after exec). At that stage, the
>> >> > process is still single-threaded, right ? So there is no reason
>> >> > to issue a synchronize_sched() (or now synchronize_rcu() in newer
>> >> > kernels):
>> >> >
>> >> > membarrier_register_private_expedited()
>> >> >
>> >> > if (!(atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)) {
>> >> > /*
>> >> > * Ensure all future scheduler executions will observe the
>> >> > * new thread flag state for this process.
>> >> > */
>> >> > synchronize_rcu();
>> >> > }
>> >> >
>> >> > So considering that pre-registration carefully done before the process
>> >> > becomes multi-threaded just costs a system call (and not a synchronize_sched()),
>> >> > does it make the pre-registration approach more acceptable ?
>> >>
>> >> It does get rid of the extreme cost, but I don't think it would be
>> >> well-received by users who don't like random unnecessary syscalls at
>> >> init time (each adding a few us of startup time cost). If it's so
>> >> cheap, why isn't it just the default at kernel-side process creation?
>> >> Why is there any requirement of registration to begin with? Reading
>> >> the code, it looks like all it does is set a flag, and all this flag
>> >> is used for is erroring-out if it's not set.
>> >
>> > On further thought, pre-registration could be done at first
>> > pthread_create rather than process entry, which would probably be
>> > acceptable. But the question remains why it's needed at all, and
>> > neither of these approaches is available to code that doesn't have the
>> > privilege of being part of libc. For example, library code that might
>> > be loaded via dlopen can't safely use SYS_membarrier without
>> > introducing unbounded latency before the first use.
>>
>> For membarrier private expedited, the need for pre-registration is currently
>> there because of powerpc not wanting to slow down switch_mm() for processes
>> not needing that command.
>>
>> That's the only reason I see for it. If we would have accepted to add
>> a smp_mb() to the powerpc switch_mm() scheduler path, we could have done
>> so without registration for the private expedited membarrier command.
>
> I don't understand why the barrier is needed at all; the ipi ping
> should suffice to execute a barrier instruction on all cores on which
> a thread of the process is running, and if any other core subsequently
> picks up a thread of the process to run, it must necessarily perform a
> barrier just to synchronize with whatever core the thread was
> previously running on (not to mention synchronizing the handoff
> itself). Is this just to optimize out ipi pinging cores that threads
> of the process are not currently running on, but were last running on
> and could become running on again without migration?

See this comment in context_switch():

* If mm is non-NULL, we pass through switch_mm(). If mm is
* NULL, we will pass through mmdrop() in finish_task_switch().
* Both of these contain the full memory barrier required by
* membarrier after storing to rq->curr, before returning to
* user-space.

So the full memory barrier we are discussing here in switch_mm() orders
the store to rq->curr before following memory accesses (including those
performed by user-space).

This pairs with the first smp_mb() in membarrier_private_expedited(), which
orders memory accesses before the membarrier system call before the following
loads of each cpu_rq(cpu)->curr.

This guarantees that if we happen to skip the IPI for a given CPU that
is just about to schedule in a thread belonging to our process (say, just before
storing rq->curr in the scheduler), the memory accesses performed by that thread
after it starts running are ordered after the memory accesses performed prior
to membarrier.

As we are not grabbing the runqueue locks for each CPU in membarrier because
it would be too costly, we need to ensure the proper barriers are there within
the scheduler. We cannot just rely on the memory barrier present at the very
start of the scheduler code to order with respect to memory accesses happening
in the newly scheduled-in thread _after_ scheduler execution. This is why we
need to ensure the proper barriers are present both before and after the
scheduler stores to rq->curr.

>
>> commit a961e40917fb hints at the sync_core private expedited membarrier
>> commands (which was being actively designed at that time) which may
>> require pre-registration. However, things did not turn out that way: we
>> ended up adding the required core serializing barriers unconditionally
>> into each architecture.
>>
>> Considering that sync_core private expedited membarrier ended up not needing
>> pre-registration, I think this pre-registration optimization may have been
>> somewhat counter-productive, since I doubt the overhead of smp_mb() in a
>> switch_mm() for powerpc is that high, but I don't have the hardware handy
>> to provide numbers. So we end up slowing down everyone by requiring a
>> registration system call after exec. :-(
>
> I'm surprised it's even possible to do switch_mm correctly with no
> barrier...

There is a barrier at the beginning of the scheduler code, which is typically
enough for all use-cases that rely on the runqueue lock to synchronize with the
scheduler. Membarrier does not use rq lock for performance considerations, so
we end up having to make sure the proper barriers are in place both before/after
store to rq->curr. Those pair with the 2 barriers at the end/beginning of the
membarrier system call (e.g. in membarrier_private_expedited())

>
>> One possible way out of this would be to make MEMBARRIER_CMD_PRIVATE_EXPEDITED
>> and MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE work fine without
>> pre-registration
>> in future kernels. Therefore, the application could try using them without
>> registration. If it works, all is fine, else it would treat the error how it
>> sees fit, either through explicit registration and trying again, or returning
>> the error to the caller.
>
> This would be nice. I think the register-at-first-pthread_create works
> fine for my case now, but as noted it doesn't work when you can't
> control libc internals.

It seems to be a transient problem then: once all libc start supporting this,
there will be fewer and fewer "early adopter" libraries that need to register
while multithreaded. Therefore, I wonder if it's really worthwhile to introduce
a change at this point, considering that it will take a while before newer
kernels gets rolled out, and people will have to support legacy behavior anyway.

>
>> The only change I see we would require to make this work is to turn
>> arch/powerpc/include/asm/membarrier.h membarrier_arch_switch_mm() into
>> an unconditional smp_mb().
>
> I finally found this just now; before I was mistakenly grepping for
> MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY rather than
> MEMBARRIER_STATE_PRIVATE_EXPEDITED, and couldn't find anywhere it was
> actually used.
>
>> Thoughts ?
>
> I'm all for making it work without pre-registration. I don't care
> whether this is via heavier barriers in ppc or via pinging more cores,
> or some other mechanism, but maybe others have opinions on this.

If from a libc perspective the current pre-registration scheme is acceptable,
I don't see a strong point for introducing a behavior change to an already
exposed system call (not requiring pre-registration). It may introduce user
confusion, and would require documenting per-kernel-version-ranges system call
error behavior, which seems to add an unwanted amount of complexity for
users of the system call.

>
> One alternative might be auto-registration, ipi pinging all cores the
> process has threads on (even not currently running) for the first
> membarrier, so that they see the membarrier_state updated without
> having to synchronize with the rcu/scheduler, then doing the same as
> now after the first call. I'm not sure if this creates opportunities
> for abuse; probably no worse than the process could do just by
> actually running lots of threads.

We try really hard not to IPI a core that happens to be running
a realtime thread at high priority. I think this scheme would not
meet this requirement.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2019-02-19 16:37:19

by Rich Felker

[permalink] [raw]

Subject: Re: Regression in SYS_membarrier expedited

On Tue, Feb 19, 2019 at 11:02:41AM -0500, Mathieu Desnoyers wrote:
> ----- On Feb 18, 2019, at 4:55 PM, Rich Felker [email protected] wrote:
>
> > On Mon, Feb 18, 2019 at 10:22:32AM -0500, Mathieu Desnoyers wrote:
> >> ----- On Feb 17, 2019, at 5:08 PM, Rich Felker [email protected] wrote:
> >>
> >> > On Sun, Feb 17, 2019 at 04:52:35PM -0500, Rich Felker wrote:
> >> >> On Sun, Feb 17, 2019 at 04:34:45PM -0500, Mathieu Desnoyers wrote:
> >> >> > ----- On Feb 17, 2019, at 1:48 PM, Rich Felker [email protected] wrote:
> >> >> >
> >> >> > > commit a961e40917fb14614d368d8bc9782ca4d6a8cd11 made it so that the
> >> >> > > MEMBARRIER_CMD_PRIVATE_EXPEDITED command cannot be used without first
> >> >> > > registering intent to use it. However, registration is an expensive
> >> >> > > operation since commit 3ccfebedd8cf54e291c809c838d8ad5cc00f5688, which
> >> >> > > added synchronize_sched() to it; this means it's no longer possible to
> >> >> > > lazily register intent at first use, and it's unreasonably expensive
> >> >> > > to preemptively register intent for possibly extremely-short-lived
> >> >> > > processes that will never use it. (My usage case is in libc (musl),
> >> >> > > where I can't know if the process will be short- or long-lived;
> >> >> > > unnecessary and potentially expensive syscalls can't be made
> >> >> > > preemptively, only lazily at first use.)
> >> >> > >
> >> >> > > Can we restore the functionality of MEMBARRIER_CMD_PRIVATE_EXPEDITED
> >> >> > > to work even without registration? The motivation of requiring
> >> >> > > registration seems to be:
> >> >> > >
> >> >> > > "Registering at this time removes the need to interrupt each and
> >> >> > > every thread in that process at the first expedited
> >> >> > > sys_membarrier() system call."
> >> >> > >
> >> >> > > but interrupting every thread in the process is exactly what I expect,
> >> >> > > and is not a problem. What does seem like a big problem is waiting for
> >> >> > > synchronize_sched() to synchronize with an unboundedly large number of
> >> >> > > cores (vs only a few threads in the process), especially in the
> >> >> > > presence of full_nohz, where it seems like latency would be at least a
> >> >> > > few ms and possibly unbounded.
> >> >> > >
> >> >> > > Short of a working SYS_membarrier that doesn't require expensive
> >> >> > > pre-registration, I'm stuck just implementing it in userspace with
> >> >> > > signals...
> >> >> >
> >> >> > Hi Rich,
> >> >> >
> >> >> > Let me try to understand the scenario first.
> >> >> >
> >> >> > musl libc support for using membarrier private expedited
> >> >> > would require to first register membarrier private expedited for
> >> >> > the process at musl library init (typically after exec). At that stage, the
> >> >> > process is still single-threaded, right ? So there is no reason
> >> >> > to issue a synchronize_sched() (or now synchronize_rcu() in newer
> >> >> > kernels):
> >> >> >
> >> >> > membarrier_register_private_expedited()
> >> >> >
> >> >> > if (!(atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)) {
> >> >> > /*
> >> >> > * Ensure all future scheduler executions will observe the
> >> >> > * new thread flag state for this process.
> >> >> > */
> >> >> > synchronize_rcu();
> >> >> > }
> >> >> >
> >> >> > So considering that pre-registration carefully done before the process
> >> >> > becomes multi-threaded just costs a system call (and not a synchronize_sched()),
> >> >> > does it make the pre-registration approach more acceptable ?
> >> >>
> >> >> It does get rid of the extreme cost, but I don't think it would be
> >> >> well-received by users who don't like random unnecessary syscalls at
> >> >> init time (each adding a few us of startup time cost). If it's so
> >> >> cheap, why isn't it just the default at kernel-side process creation?
> >> >> Why is there any requirement of registration to begin with? Reading
> >> >> the code, it looks like all it does is set a flag, and all this flag
> >> >> is used for is erroring-out if it's not set.
> >> >
> >> > On further thought, pre-registration could be done at first
> >> > pthread_create rather than process entry, which would probably be
> >> > acceptable. But the question remains why it's needed at all, and
> >> > neither of these approaches is available to code that doesn't have the
> >> > privilege of being part of libc. For example, library code that might
> >> > be loaded via dlopen can't safely use SYS_membarrier without
> >> > introducing unbounded latency before the first use.
> >>
> >> For membarrier private expedited, the need for pre-registration is currently
> >> there because of powerpc not wanting to slow down switch_mm() for processes
> >> not needing that command.
> >>
> >> That's the only reason I see for it. If we would have accepted to add
> >> a smp_mb() to the powerpc switch_mm() scheduler path, we could have done
> >> so without registration for the private expedited membarrier command.
> >
> > I don't understand why the barrier is needed at all; the ipi ping
> > should suffice to execute a barrier instruction on all cores on which
> > a thread of the process is running, and if any other core subsequently
> > picks up a thread of the process to run, it must necessarily perform a
> > barrier just to synchronize with whatever core the thread was
> > previously running on (not to mention synchronizing the handoff
> > itself). Is this just to optimize out ipi pinging cores that threads
> > of the process are not currently running on, but were last running on
> > and could become running on again without migration?
>
> See this comment in context_switch():
>
> * If mm is non-NULL, we pass through switch_mm(). If mm is
> * NULL, we will pass through mmdrop() in finish_task_switch().
> * Both of these contain the full memory barrier required by
> * membarrier after storing to rq->curr, before returning to
> * user-space.
>
> So the full memory barrier we are discussing here in switch_mm() orders
> the store to rq->curr before following memory accesses (including those
> performed by user-space).
>
> This pairs with the first smp_mb() in membarrier_private_expedited(), which
> orders memory accesses before the membarrier system call before the following
> loads of each cpu_rq(cpu)->curr.
>
> This guarantees that if we happen to skip the IPI for a given CPU that
> is just about to schedule in a thread belonging to our process (say, just before
> storing rq->curr in the scheduler), the memory accesses performed by that thread
> after it starts running are ordered after the memory accesses performed prior
> to membarrier.
>
> As we are not grabbing the runqueue locks for each CPU in membarrier because
> it would be too costly, we need to ensure the proper barriers are there within
> the scheduler. We cannot just rely on the memory barrier present at the very
> start of the scheduler code to order with respect to memory accesses happening
> in the newly scheduled-in thread _after_ scheduler execution. This is why we
> need to ensure the proper barriers are present both before and after the
> scheduler stores to rq->curr.

OK, this all makes sense.

> >> commit a961e40917fb hints at the sync_core private expedited membarrier
> >> commands (which was being actively designed at that time) which may
> >> require pre-registration. However, things did not turn out that way: we
> >> ended up adding the required core serializing barriers unconditionally
> >> into each architecture.
> >>
> >> Considering that sync_core private expedited membarrier ended up not needing
> >> pre-registration, I think this pre-registration optimization may have been
> >> somewhat counter-productive, since I doubt the overhead of smp_mb() in a
> >> switch_mm() for powerpc is that high, but I don't have the hardware handy
> >> to provide numbers. So we end up slowing down everyone by requiring a
> >> registration system call after exec. :-(
> >
> > I'm surprised it's even possible to do switch_mm correctly with no
> > barrier...
>
> There is a barrier at the beginning of the scheduler code, which is typically
> enough for all use-cases that rely on the runqueue lock to synchronize with the
> scheduler. Membarrier does not use rq lock for performance considerations, so
> we end up having to make sure the proper barriers are in place both before/after
> store to rq->curr. Those pair with the 2 barriers at the end/beginning of the
> membarrier system call (e.g. in membarrier_private_expedited())

This too.

> >> One possible way out of this would be to make MEMBARRIER_CMD_PRIVATE_EXPEDITED
> >> and MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE work fine without
> >> pre-registration
> >> in future kernels. Therefore, the application could try using them without
> >> registration. If it works, all is fine, else it would treat the error how it
> >> sees fit, either through explicit registration and trying again, or returning
> >> the error to the caller.
> >
> > This would be nice. I think the register-at-first-pthread_create works
> > fine for my case now, but as noted it doesn't work when you can't
> > control libc internals.
>
> It seems to be a transient problem then: once all libc start supporting this,
> there will be fewer and fewer "early adopter" libraries that need to register
> while multithreaded. Therefore, I wonder if it's really worthwhile to introduce
> a change at this point, considering that it will take a while before newer
> kernels gets rolled out, and people will have to support legacy behavior anyway.

It's slightly more complicated than that. At present, the code I have
queued for push to musl only performs the syscall in pthread_create if
the code using membarrier is linked (it's a weak reference, or rather
as we implement those, a weak alias for a dummy function), and the
only code using it is in the dynamic linker, so static-linked programs
never register. This would be fixed for applications if we provide a
public membarrier() function declared in some header, and applications
use it rather than doing syscall(SYS_membarrier, ...) themselves,
which I'd kinda like to do -- it would also let us offer a fallback on
old kernels via the signaling mechanism musl already has as a
fallback. But I don't think we have any consensus with glibc about
whether or where to declare such a function, and we usually try to get
that before adding new syscall wrappers. I can work on this.

> >> The only change I see we would require to make this work is to turn
> >> arch/powerpc/include/asm/membarrier.h membarrier_arch_switch_mm() into
> >> an unconditional smp_mb().
> >
> > I finally found this just now; before I was mistakenly grepping for
> > MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY rather than
> > MEMBARRIER_STATE_PRIVATE_EXPEDITED, and couldn't find anywhere it was
> > actually used.
> >
> >> Thoughts ?
> >
> > I'm all for making it work without pre-registration. I don't care
> > whether this is via heavier barriers in ppc or via pinging more cores,
> > or some other mechanism, but maybe others have opinions on this.
>
> If from a libc perspective the current pre-registration scheme is acceptable,
> I don't see a strong point for introducing a behavior change to an already
> exposed system call (not requiring pre-registration). It may introduce user
> confusion, and would require documenting per-kernel-version-ranges system call
> error behavior, which seems to add an unwanted amount of complexity for
> users of the system call.
>
> >
> > One alternative might be auto-registration, ipi pinging all cores the
> > process has threads on (even not currently running) for the first
> > membarrier, so that they see the membarrier_state updated without
> > having to synchronize with the rcu/scheduler, then doing the same as
> > now after the first call. I'm not sure if this creates opportunities
> > for abuse; probably no worse than the process could do just by
> > actually running lots of threads.
>
> We try really hard not to IPI a core that happens to be running
> a realtime thread at high priority. I think this scheme would not
> meet this requirement.

That's a very reasonable requirement.

Rich