LinuxLists.cc - [RFC patch] introduce sys_membarrier(): process-wide memory barrier (v9)

[permalink] [raw]

Subject: Re: [RFC patch] introduce sys_membarrier(): process-wide memory barrier (v9)

* Chris Friesen ([email protected]) wrote:
> On 02/12/2010 04:46 PM, Mathieu Desnoyers wrote:
>
> > Editorial question:
> >
> > This synchronization only takes care of threads using the current process memory
> > map. It should not be used to synchronize accesses performed on memory maps
> > shared between different processes. Is that a limitation we can live with ?
>
> It makes sense for an initial version. It would be unfortunate if this
> were a permanent limitation, since using separate processes with
> explicit shared memory is a useful way to mitigate memory trampler issues.
>
> If we were going to allow that, it might make sense to add an address
> range such that only those processes which have mapped that range would
> execute the barrier. Come to think of it, it might be possible to use
> this somehow to avoid having to execute the barrier on *all* threads
> within a process.

The extensible system call mandatory and optional flags will allow this kind of
improvement later on if this appears to be needed. It will also allow user-space
to detect if later kernels support these new features or not. But meanwhile I
think it's good to start with this implementation that covers 99.99% of
use-cases I can currently think of (ok, well, maybe I'm just unimaginative) ;)

Thanks,

Mathieu

>
> Chris

2010-02-24 09:11:09

[permalink] [raw]

Subject: Re: [RFC patch] introduce sys_membarrier(): process-wide memory barrier (v9)

On Mon, Feb 22, 2010 at 04:23:21PM -0500, Mathieu Desnoyers wrote:
> * Chris Friesen ([email protected]) wrote:
> > On 02/12/2010 04:46 PM, Mathieu Desnoyers wrote:
> >
> > > Editorial question:
> > >
> > > This synchronization only takes care of threads using the current process memory
> > > map. It should not be used to synchronize accesses performed on memory maps
> > > shared between different processes. Is that a limitation we can live with ?
> >
> > It makes sense for an initial version. It would be unfortunate if this
> > were a permanent limitation, since using separate processes with
> > explicit shared memory is a useful way to mitigate memory trampler issues.
> >
> > If we were going to allow that, it might make sense to add an address
> > range such that only those processes which have mapped that range would
> > execute the barrier. Come to think of it, it might be possible to use
> > this somehow to avoid having to execute the barrier on *all* threads
> > within a process.
>
> The extensible system call mandatory and optional flags will allow this kind of
> improvement later on if this appears to be needed. It will also allow user-space
> to detect if later kernels support these new features or not. But meanwhile I
> think it's good to start with this implementation that covers 99.99% of
> use-cases I can currently think of (ok, well, maybe I'm just unimaginative) ;)

It's a good point, I think having at least the ability to do
process-shared or process-private in the first version of the API might
be a good idea. That matches glibc's synchronisation routines so it
would probably be a desirable feature even if you don't implement it in
your library initially.

When writing multiprocessor scalable software, threads should often be
avoided. They share so much state that it is easy to run into
scalability issues in the kernel. So yes it would be really nice to
have userspace RCU available in a process-shared mode.

2010-02-24 15:22:56

[permalink] [raw]

Subject: Re: [RFC patch] introduce sys_membarrier(): process-wide memory barrier (v9)

* Nick Piggin ([email protected]) wrote:
> On Mon, Feb 22, 2010 at 04:23:21PM -0500, Mathieu Desnoyers wrote:
> > * Chris Friesen ([email protected]) wrote:
> > > On 02/12/2010 04:46 PM, Mathieu Desnoyers wrote:
> > >
> > > > Editorial question:
> > > >
> > > > This synchronization only takes care of threads using the current process memory
> > > > map. It should not be used to synchronize accesses performed on memory maps
> > > > shared between different processes. Is that a limitation we can live with ?
> > >
> > > It makes sense for an initial version. It would be unfortunate if this
> > > were a permanent limitation, since using separate processes with
> > > explicit shared memory is a useful way to mitigate memory trampler issues.
> > >
> > > If we were going to allow that, it might make sense to add an address
> > > range such that only those processes which have mapped that range would
> > > execute the barrier. Come to think of it, it might be possible to use
> > > this somehow to avoid having to execute the barrier on *all* threads
> > > within a process.
> >
> > The extensible system call mandatory and optional flags will allow this kind of
> > improvement later on if this appears to be needed. It will also allow user-space
> > to detect if later kernels support these new features or not. But meanwhile I
> > think it's good to start with this implementation that covers 99.99% of
> > use-cases I can currently think of (ok, well, maybe I'm just unimaginative) ;)
>
> It's a good point, I think having at least the ability to do
> process-shared or process-private in the first version of the API might
> be a good idea. That matches glibc's synchronisation routines so it
> would probably be a desirable feature even if you don't implement it in
> your library initially.

I am tempted to say that we should probably wait for users of this API feature
to manifest themselves before we go on and implement it. This will ensure that
we don't end up maintaining an unused feature and this provides a minimum
testability. For now, returning -EINVAL seems like an appropriate response for
this system call feature.

As I said above, given the exensible nature of the sys_membarrier flags, we can
assign a MEMBARRIER_SHARED_MEM or something like that to a mandatory flag bit
later on. So when userspace start using this flag on old kernels that do not
support it, -EINVAL will be returned, and then the application will know it must
use a fallback. So, basically, we don't even need to define this flag now.

>
> When writing multiprocessor scalable software, threads should often be
> avoided. They share so much state that it is easy to run into
> scalability issues in the kernel. So yes it would be really nice to
> have userspace RCU available in a process-shared mode.
>

Agreed, although some major modifications would also be needed in the userspace
RCU library to do that, because it currently rely on being able to access other
thread's TLS.

Thanks,

Mathieu

2010-02-24 17:30:17

by Darren Hart

[permalink] [raw]

Subject: Re: [RFC patch] introduce sys_membarrier(): process-wide memory barrier (v9)

Nick Piggin wrote:

> When writing multiprocessor scalable software, threads should often be
> avoided. They share so much state that it is easy to run into
> scalability issues in the kernel. So yes it would be really nice to
> have userspace RCU available in a process-shared mode.

A bit off topic, but I'm interested in what you feel some of these
scalability issues are. Is it mostly bouncing this shared context from
one CPU to the next and the related cache effects, or is there something
more you are referring to?

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

2010-02-25 05:23:10

[permalink] [raw]

Subject: Re: [RFC patch] introduce sys_membarrier(): process-wide memory barrier (v9)

On Wed, Feb 24, 2010 at 09:29:46AM -0800, Darren Hart wrote:
> Nick Piggin wrote:
>
> >When writing multiprocessor scalable software, threads should often be
> >avoided. They share so much state that it is easy to run into
> >scalability issues in the kernel. So yes it would be really nice to
> >have userspace RCU available in a process-shared mode.
>
> A bit off topic, but I'm interested in what you feel some of these
> scalability issues are. Is it mostly bouncing this shared context
> from one CPU to the next and the related cache effects, or is there
> something more you are referring to?

Just in general shared state is almost always going to be more costly in
SMP than non-shared.

>From VM to files and fs state to signals and timers and process
accounting. And this also carries up to libc, and critical user code
like the heap allocator.

Linux is usually pretty good, a lot due to RCU, but there are still
contention points.

Andrew had investigated this a lot (in relation to samba) and had a good
talk on it, but the slides don't really do it justice.
http://www.samba.org/~tridge/talks/threads.pdf

2010-02-25 05:33:20

[permalink] [raw]

Subject: Re: [RFC patch] introduce sys_membarrier(): process-wide memory barrier (v9)

On Wed, Feb 24, 2010 at 10:22:52AM -0500, Mathieu Desnoyers wrote:
> * Nick Piggin ([email protected]) wrote:
> > On Mon, Feb 22, 2010 at 04:23:21PM -0500, Mathieu Desnoyers wrote:
> > > * Chris Friesen ([email protected]) wrote:
> > > > On 02/12/2010 04:46 PM, Mathieu Desnoyers wrote:
> > > >
> > > > > Editorial question:
> > > > >
> > > > > This synchronization only takes care of threads using the current process memory
> > > > > map. It should not be used to synchronize accesses performed on memory maps
> > > > > shared between different processes. Is that a limitation we can live with ?
> > > >
> > > > It makes sense for an initial version. It would be unfortunate if this
> > > > were a permanent limitation, since using separate processes with
> > > > explicit shared memory is a useful way to mitigate memory trampler issues.
> > > >
> > > > If we were going to allow that, it might make sense to add an address
> > > > range such that only those processes which have mapped that range would
> > > > execute the barrier. Come to think of it, it might be possible to use
> > > > this somehow to avoid having to execute the barrier on *all* threads
> > > > within a process.
> > >
> > > The extensible system call mandatory and optional flags will allow this kind of
> > > improvement later on if this appears to be needed. It will also allow user-space
> > > to detect if later kernels support these new features or not. But meanwhile I
> > > think it's good to start with this implementation that covers 99.99% of
> > > use-cases I can currently think of (ok, well, maybe I'm just unimaginative) ;)
> >
> > It's a good point, I think having at least the ability to do
> > process-shared or process-private in the first version of the API might
> > be a good idea. That matches glibc's synchronisation routines so it
> > would probably be a desirable feature even if you don't implement it in
> > your library initially.
>
> I am tempted to say that we should probably wait for users of this API feature
> to manifest themselves before we go on and implement it. This will ensure that
> we don't end up maintaining an unused feature and this provides a minimum
> testability. For now, returning -EINVAL seems like an appropriate response for
> this system call feature.

It would be very trivial compared to the process-private case. Just IPI
all CPUs. It would allow older kernels to work with newer process based
apps as they get implemented. But... not a really big deal I suppose.

> As I said above, given the exensible nature of the sys_membarrier flags, we can
> assign a MEMBARRIER_SHARED_MEM or something like that to a mandatory flag bit
> later on. So when userspace start using this flag on old kernels that do not
> support it, -EINVAL will be returned, and then the application will know it must
> use a fallback. So, basically, we don't even need to define this flag now.
>
> >
> > When writing multiprocessor scalable software, threads should often be
> > avoided. They share so much state that it is easy to run into
> > scalability issues in the kernel. So yes it would be really nice to
> > have userspace RCU available in a process-shared mode.
> >
>
> Agreed, although some major modifications would also be needed in the userspace
> RCU library to do that, because it currently rely on being able to access other
> thread's TLS.

OK. It would be a good feature to keep in mind, I believe.

2010-02-25 16:53:11

[permalink] [raw]

Subject: Re: [RFC patch] introduce sys_membarrier(): process-wide memory barrier (v9)

* Nick Piggin ([email protected]) wrote:
> On Wed, Feb 24, 2010 at 10:22:52AM -0500, Mathieu Desnoyers wrote:
> > * Nick Piggin ([email protected]) wrote:
> > > On Mon, Feb 22, 2010 at 04:23:21PM -0500, Mathieu Desnoyers wrote:
> > > > * Chris Friesen ([email protected]) wrote:
> > > > > On 02/12/2010 04:46 PM, Mathieu Desnoyers wrote:
> > > > >
> > > > > > Editorial question:
> > > > > >
> > > > > > This synchronization only takes care of threads using the current process memory
> > > > > > map. It should not be used to synchronize accesses performed on memory maps
> > > > > > shared between different processes. Is that a limitation we can live with ?
> > > > >
> > > > > It makes sense for an initial version. It would be unfortunate if this
> > > > > were a permanent limitation, since using separate processes with
> > > > > explicit shared memory is a useful way to mitigate memory trampler issues.
> > > > >
> > > > > If we were going to allow that, it might make sense to add an address
> > > > > range such that only those processes which have mapped that range would
> > > > > execute the barrier. Come to think of it, it might be possible to use
> > > > > this somehow to avoid having to execute the barrier on *all* threads
> > > > > within a process.
> > > >
> > > > The extensible system call mandatory and optional flags will allow this kind of
> > > > improvement later on if this appears to be needed. It will also allow user-space
> > > > to detect if later kernels support these new features or not. But meanwhile I
> > > > think it's good to start with this implementation that covers 99.99% of
> > > > use-cases I can currently think of (ok, well, maybe I'm just unimaginative) ;)
> > >
> > > It's a good point, I think having at least the ability to do
> > > process-shared or process-private in the first version of the API might
> > > be a good idea. That matches glibc's synchronisation routines so it
> > > would probably be a desirable feature even if you don't implement it in
> > > your library initially.
> >
> > I am tempted to say that we should probably wait for users of this API feature
> > to manifest themselves before we go on and implement it. This will ensure that
> > we don't end up maintaining an unused feature and this provides a minimum
> > testability. For now, returning -EINVAL seems like an appropriate response for
> > this system call feature.
>
> It would be very trivial compared to the process-private case. Just IPI
> all CPUs. It would allow older kernels to work with newer process based
> apps as they get implemented. But... not a really big deal I suppose.

This is actually what I did in v1 of the patch, but this implementation met
resistance from the RT people, who were concerned about the impact on RT tasks
of a lower priority process doing lots of sys_membarrier() calls. So if we want
to do other-process-aware sys_membarrier(), we would have to iterate on all
cpus, for every running process shared memory maps and see if there is something
shared with all shm of the current process. This is clearly not as trivial as
just broadcasting the IPI to all cpus.

>
>
> > As I said above, given the exensible nature of the sys_membarrier flags, we can
> > assign a MEMBARRIER_SHARED_MEM or something like that to a mandatory flag bit
> > later on. So when userspace start using this flag on old kernels that do not
> > support it, -EINVAL will be returned, and then the application will know it must
> > use a fallback. So, basically, we don't even need to define this flag now.
> >
> > >
> > > When writing multiprocessor scalable software, threads should often be
> > > avoided. They share so much state that it is easy to run into
> > > scalability issues in the kernel. So yes it would be really nice to
> > > have userspace RCU available in a process-shared mode.
> > >
> >
> > Agreed, although some major modifications would also be needed in the userspace
> > RCU library to do that, because it currently rely on being able to access other
> > thread's TLS.
>
> OK. It would be a good feature to keep in mind, I believe.
>

Sure.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency Consultant
EfficiOS Inc.
http://www.efficios.com

2010-02-25 17:25:42

[permalink] [raw]

Subject: Re: [RFC patch] introduce sys_membarrier(): process-wide memory barrier (v9)

On Thu, 2010-02-25 at 11:53 -0500, Mathieu Desnoyers wrote:

> > It would be very trivial compared to the process-private case. Just IPI
> > all CPUs. It would allow older kernels to work with newer process based
> > apps as they get implemented. But... not a really big deal I suppose.
>
> This is actually what I did in v1 of the patch, but this implementation met
> resistance from the RT people, who were concerned about the impact on RT tasks
> of a lower priority process doing lots of sys_membarrier() calls. So if we want
> to do other-process-aware sys_membarrier(), we would have to iterate on all
> cpus, for every running process shared memory maps and see if there is something
> shared with all shm of the current process. This is clearly not as trivial as
> just broadcasting the IPI to all cpus.

Right, it may require another syscall or parameter to let the tasks
register a shared page. Then have some mechanism to find a way to
quickly check if a CPU is running a process with that page.

-- Steve

2010-02-25 17:51:25

[permalink] [raw]

Subject: Re: [RFC patch] introduce sys_membarrier(): process-wide memory barrier (v9)

* Steven Rostedt ([email protected]) wrote:
> On Thu, 2010-02-25 at 11:53 -0500, Mathieu Desnoyers wrote:
>
> > > It would be very trivial compared to the process-private case. Just IPI
> > > all CPUs. It would allow older kernels to work with newer process based
> > > apps as they get implemented. But... not a really big deal I suppose.
> >
> > This is actually what I did in v1 of the patch, but this implementation met
> > resistance from the RT people, who were concerned about the impact on RT tasks
> > of a lower priority process doing lots of sys_membarrier() calls. So if we want
> > to do other-process-aware sys_membarrier(), we would have to iterate on all
> > cpus, for every running process shared memory maps and see if there is something
> > shared with all shm of the current process. This is clearly not as trivial as
> > just broadcasting the IPI to all cpus.
>
> Right, it may require another syscall or parameter to let the tasks
> register a shared page. Then have some mechanism to find a way to
> quickly check if a CPU is running a process with that page.

Well, either we explicitly require the task to register its shared pages, which
could be error-prone in terms of API, or simply consider all pages that are
shared between the current process and every process running on other CPUs. That
would be much simpler to use from a user-level perspective I think. The
downside is that it may generate a few IPIs to processes that happen not to need
them, but we are talking of a relatively small overhead to processes that we are
interacting with anyway. It's not like we would be interrupting completely
unrelated RT threads. I'm just not sure if it would be valid to exclude COW and
RO shared pages from that check. For instance, if a pages is mapped as RO on one
process and RW on another, then we have to synchronize these processes. Similar
weird cases could happen if a memory map is changed from RW to RO right after
the content is modified, and then we need to execute sys_membarrier: we might
miss a memory map that actually needs to be synchronized.

And yes, as you say, we'd have to find a way to quickly compare shared-memory
maps from two processes. The dumb approach, O(n^2), would be to compare these
entries element by element. Assuming a relatively low amount of shared mmaps,
this could make sense, otherwise we'd have to construct a lookup hash table to
accelerate the lookup, but it adds either a basic runtime overhead if we
construct it within sys_membarrier() or a memory overhead if we choose to add it
to the task struct (which I'd really like to avoid).

But... either way we chose, we can extend the system call flags and parameters
as needed, so I think it really should not be part of this initial
implementation.

Thanks,

Mathieu

>
> -- Steve
>
>

--
Mathieu Desnoyers
Operating System Efficiency Consultant
EfficiOS Inc.
http://www.efficios.com

2010-02-25 18:00:21

[permalink] [raw]

Subject: Re: [RFC patch] introduce sys_membarrier(): process-wide memory barrier (v9)

* Mathieu Desnoyers ([email protected]) wrote:
[...]
> But... either way we chose, we can extend the system call flags and parameters
> as needed, so I think it really should not be part of this initial
> implementation.

So... considering all this discussion is about future enhancements that are not
required by anyone at this stage, and that it will be possible to add these
later on thanks to the extensible sys_membarrier() flags, I propose to merge v9
of this patch for 2.6.34. I think the logical path for this patch is to go
through Ingo's tree, as it sits mostly along with the scheduler, but I have not
heard anything from him yet. Am I taking the correct path ?

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency Consultant
EfficiOS Inc.
http://www.efficios.com

2010-02-25 18:09:03

[permalink] [raw]

Subject: Re: [RFC patch] introduce sys_membarrier(): process-wide memory barrier (v9)

On Thu, 2010-02-25 at 12:51 -0500, Mathieu Desnoyers wrote:

> But... either way we chose, we can extend the system call flags and parameters
> as needed, so I think it really should not be part of this initial
> implementation.

I agree here too.

If you have two different tasks doing lockless RCU or what not on shared
memory, it's best to stick with the mb() on the reader side. Yeah, it
makes the performance go down, but heck, I'm really worried about the
crazy complexity that wound need to go into the kernel to prevent this.

-- Steve

2010-02-25 18:20:53

[permalink] [raw]

Subject: Add this to tip (was: Re: [RFC patch] introduce sys_membarrier(): process-wide memory barrier (v9))

On Thu, 2010-02-25 at 13:00 -0500, Mathieu Desnoyers wrote:

> So... considering all this discussion is about future enhancements that are not
> required by anyone at this stage, and that it will be possible to add these
> later on thanks to the extensible sys_membarrier() flags, I propose to merge v9
> of this patch for 2.6.34. I think the logical path for this patch is to go
> through Ingo's tree, as it sits mostly along with the scheduler, but I have not
> heard anything from him yet. Am I taking the correct path ?

I agree this should probably go through tip. I'm sure Ingo is busy
working through the merge window now too, and is not focusing on this
thread.

Anyway, this thread still has RFC in it. Send out a new patch (new
thread) with the Subject:

[PATCH -tip] introduce sys_membarrier(): process-wide memory barrier (v9))

With all Acked-by's given and state that it is ready for inclusion in
v2.6.34. (make this statement at the top of the email) It may still not
make 2.6.34, but at least it will be on its way to 2.6.25 (or 3.0
*wish*)

-- Steve

2010-02-26 05:08:24

[permalink] [raw]

Subject: Re: [RFC patch] introduce sys_membarrier(): process-wide memory barrier (v9)

On Thu, Feb 25, 2010 at 11:53:01AM -0500, Mathieu Desnoyers wrote:
> * Nick Piggin ([email protected]) wrote:
> > On Wed, Feb 24, 2010 at 10:22:52AM -0500, Mathieu Desnoyers wrote:
> > > * Nick Piggin ([email protected]) wrote:
> > > > On Mon, Feb 22, 2010 at 04:23:21PM -0500, Mathieu Desnoyers wrote:
> > > > > * Chris Friesen ([email protected]) wrote:
> > > > > > On 02/12/2010 04:46 PM, Mathieu Desnoyers wrote:
> > > > > >
> > > > > > > Editorial question:
> > > > > > >
> > > > > > > This synchronization only takes care of threads using the current process memory
> > > > > > > map. It should not be used to synchronize accesses performed on memory maps
> > > > > > > shared between different processes. Is that a limitation we can live with ?
> > > > > >
> > > > > > It makes sense for an initial version. It would be unfortunate if this
> > > > > > were a permanent limitation, since using separate processes with
> > > > > > explicit shared memory is a useful way to mitigate memory trampler issues.
> > > > > >
> > > > > > If we were going to allow that, it might make sense to add an address
> > > > > > range such that only those processes which have mapped that range would
> > > > > > execute the barrier. Come to think of it, it might be possible to use
> > > > > > this somehow to avoid having to execute the barrier on *all* threads
> > > > > > within a process.
> > > > >
> > > > > The extensible system call mandatory and optional flags will allow this kind of
> > > > > improvement later on if this appears to be needed. It will also allow user-space
> > > > > to detect if later kernels support these new features or not. But meanwhile I
> > > > > think it's good to start with this implementation that covers 99.99% of
> > > > > use-cases I can currently think of (ok, well, maybe I'm just unimaginative) ;)
> > > >
> > > > It's a good point, I think having at least the ability to do
> > > > process-shared or process-private in the first version of the API might
> > > > be a good idea. That matches glibc's synchronisation routines so it
> > > > would probably be a desirable feature even if you don't implement it in
> > > > your library initially.
> > >
> > > I am tempted to say that we should probably wait for users of this API feature
> > > to manifest themselves before we go on and implement it. This will ensure that
> > > we don't end up maintaining an unused feature and this provides a minimum
> > > testability. For now, returning -EINVAL seems like an appropriate response for
> > > this system call feature.
> >
> > It would be very trivial compared to the process-private case. Just IPI
> > all CPUs. It would allow older kernels to work with newer process based
> > apps as they get implemented. But... not a really big deal I suppose.
>
> This is actually what I did in v1 of the patch, but this implementation met
> resistance from the RT people, who were concerned about the impact on RT tasks
> of a lower priority process doing lots of sys_membarrier() calls. So if we want
> to do other-process-aware sys_membarrier(), we would have to iterate on all
> cpus, for every running process shared memory maps and see if there is something
> shared with all shm of the current process. This is clearly not as trivial as
> just broadcasting the IPI to all cpus.

I don't see how this is fundamentally worse than your existing approach,
because on some architectures with asids, the mm_cpumask isn't cleared
when a process is scheduled off the CPU then you could effectively just
cause IPIs to lots of CPUs anyway.

x86 may also one day implement ASIDS in the same way.

So if we are worried about this then we need to solve it properly IMO.
Rate-limiting it might work.

2010-02-26 05:37:08