2009-09-11 04:38:24

by Roland Dreier

[permalink] [raw]
Subject: [GIT PULL] please pull ummunotify

Linus, please consider pulling from

master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify

This tree is also available from kernel.org mirrors at:

git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify

This will get "ummunotify," a new character device that allows a
userspace library to register for MMU notifications; this is
particularly useful for MPI implementions (message passing libraries
used in HPC) to be able to keep track of what wacky things consumers
do to their memory mappings. My colleague Jeff Squyres from the Open
MPI project posted a blog entry about why MPI wants this:

http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/

His summary of ummunotify:

"It’s elegant, doesn’t require strange linker tricks, and seems to
work in all cases. Yay!"

This code went through several review iterations on lkml and was in
-mm and -next for quite a few weeks. Andrew is OK with merging it (I
think -- Andrew please correct me if I misunderstood you).

Roland Dreier (1):
ummunotify: Userspace support for MMU notifications

Documentation/Makefile | 3 +-
Documentation/ummunotify/Makefile | 7 +
Documentation/ummunotify/ummunotify.txt | 150 ++++++++
Documentation/ummunotify/umn-test.c | 200 +++++++++++
drivers/char/Kconfig | 12 +
drivers/char/Makefile | 1 +
drivers/char/ummunotify.c | 566 +++++++++++++++++++++++++++++++
include/linux/Kbuild | 1 +
include/linux/ummunotify.h | 121 +++++++
9 files changed, 1060 insertions(+), 1 deletions(-)
create mode 100644 Documentation/ummunotify/Makefile
create mode 100644 Documentation/ummunotify/ummunotify.txt
create mode 100644 Documentation/ummunotify/umn-test.c
create mode 100644 drivers/char/ummunotify.c
create mode 100644 include/linux/ummunotify.h


2009-09-11 05:56:16

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [GIT PULL] please pull ummunotify

Hi Roland,

> Linus, please consider pulling from
>
> master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify
>
> This tree is also available from kernel.org mirrors at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify
>
> This will get "ummunotify," a new character device that allows a
> userspace library to register for MMU notifications; this is
> particularly useful for MPI implementions (message passing libraries
> used in HPC) to be able to keep track of what wacky things consumers
> do to their memory mappings. My colleague Jeff Squyres from the Open
> MPI project posted a blog entry about why MPI wants this:
>
> http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/
>
> His summary of ummunotify:
>
> "It’s elegant, doesn’t require strange linker tricks, and seems to
> work in all cases. Yay!"
>
> This code went through several review iterations on lkml and was in
> -mm and -next for quite a few weeks. Andrew is OK with merging it (I
> think -- Andrew please correct me if I misunderstood you).

I'm sorry. I haven't review this code and I didn't track this discussion
carefully. but I have one stupid question. May I ask?
Can I this version already solved fork() + COW issue? if so, could you
please explain what happen at fork. Obviously RDMA point to either parent
or child page, not both. but Corrent COW rule is, first touch process
get copyed page and other process still own original page. I think it's
unpecected behavior form RDMA.

2009-09-11 06:03:33

by Roland Dreier

[permalink] [raw]
Subject: Re: [GIT PULL] please pull ummunotify


> Can I this version already solved fork() + COW issue? if so, could you
> please explain what happen at fork. Obviously RDMA point to either parent
> or child page, not both. but Corrent COW rule is, first touch process
> get copyed page and other process still own original page. I think it's
> unpecected behavior form RDMA.

No, ummunotify doesn't really help that much with fork() + COW. If a
parent forks and then touches pages that are actively in use for RDMA,
then of course they get COWed and RDMA goes to the wrong memory (from
the point of view of the parent).

ummunotify does deal with the case where a process forks and touches
memory that was used for RDMA but no longer is -- in that case, the MPI
library has a chance to flush its registration cache because it will get
a ummunotify event invalidating the old mapping.

The real purpose of ummunotify is to allow MPI implementations to cache
registrations, even when the MPI library is used with an application
that does funny things for allocation (mmap()/munmap() or brk(), etc).

- Roland

2009-09-11 06:11:38

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [GIT PULL] please pull ummunotify

Hi

Thank you explanation.

>
> > Can I this version already solved fork() + COW issue? if so, could you
> > please explain what happen at fork. Obviously RDMA point to either parent
> > or child page, not both. but Corrent COW rule is, first touch process
> > get copyed page and other process still own original page. I think it's
> > unpecected behavior form RDMA.
>
> No, ummunotify doesn't really help that much with fork() + COW. If a
> parent forks and then touches pages that are actively in use for RDMA,
> then of course they get COWed and RDMA goes to the wrong memory (from
> the point of view of the parent).

So, Can we assume OpenMPI user process doesn't such thing?

Parhaps, madvise(DONTFORK) or vfork() avoid this issue. but I'm not
sure all program in the world do that.

> ummunotify does deal with the case where a process forks and touches
> memory that was used for RDMA but no longer is -- in that case, the MPI
> library has a chance to flush its registration cache because it will get
> a ummunotify event invalidating the old mapping.
>
> The real purpose of ummunotify is to allow MPI implementations to cache
> registrations, even when the MPI library is used with an application
> that does funny things for allocation (mmap()/munmap() or brk(), etc).

Yup, that's very worth.


2009-09-11 06:15:42

by Brice Goglin

[permalink] [raw]
Subject: Re: [GIT PULL] please pull ummunotify

Roland Dreier wrote:
> > Can I this version already solved fork() + COW issue? if so, could you
> > please explain what happen at fork. Obviously RDMA point to either parent
> > or child page, not both. but Corrent COW rule is, first touch process
> > get copyed page and other process still own original page. I think it's
> > unpecected behavior form RDMA.
>
> No, ummunotify doesn't really help that much with fork() + COW. If a
> parent forks and then touches pages that are actively in use for RDMA,
> then of course they get COWed and RDMA goes to the wrong memory (from
> the point of view of the parent).
>

My understanding of the code is that fork will end-up calling
copy_page_range() on all VMA, and copy_page_range() calls
mmu_notifier_invalidate_range_start() if is_cow_mapping() is true,
which should be the case here. So you should get some invalidate events
on fork.

Brice

2009-09-11 06:21:37

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [GIT PULL] please pull ummunotify

> Roland Dreier wrote:
> > > Can I this version already solved fork() + COW issue? if so, could you
> > > please explain what happen at fork. Obviously RDMA point to either parent
> > > or child page, not both. but Corrent COW rule is, first touch process
> > > get copyed page and other process still own original page. I think it's
> > > unpecected behavior form RDMA.
> >
> > No, ummunotify doesn't really help that much with fork() + COW. If a
> > parent forks and then touches pages that are actively in use for RDMA,
> > then of course they get COWed and RDMA goes to the wrong memory (from
> > the point of view of the parent).
> >
>
> My understanding of the code is that fork will end-up calling
> copy_page_range() on all VMA, and copy_page_range() calls
> mmu_notifier_invalidate_range_start() if is_cow_mapping() is true,
> which should be the case here. So you should get some invalidate events
> on fork.

Worried...
Anybody haven't test fork() case yet???


2009-09-11 06:22:19

by Roland Dreier

[permalink] [raw]
Subject: Re: [GIT PULL] please pull ummunotify


> My understanding of the code is that fork will end-up calling
> copy_page_range() on all VMA, and copy_page_range() calls
> mmu_notifier_invalidate_range_start() if is_cow_mapping() is true,
> which should be the case here. So you should get some invalidate events
> on fork.

Yes, I agree (that's what the second half of my email tried to say).

However, that doesn't help if the parent process is actively doing RDMA
on the range being invalidated -- the MPI library or whatever will get
the invalidate event via ummunotify, but what can it do? The event is
basically saying "your data is going to the wrong place" and I don't see
what useful thing MPI could do with that.

As I said, it does mean that MPI can invalidate cached registrations for
COWed memory, which might be useful in case a parent forks and then
touches memory it used to use for RDMA, but I think that's the easier
part of the fork/COW problem.

- R.

2009-09-11 06:40:24

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify

On Thu, Sep 10, 2009 at 11:22:20PM -0700, Roland Dreier wrote:

> As I said, it does mean that MPI can invalidate cached registrations for
> COWed memory, which might be useful in case a parent forks and then
> touches memory it used to use for RDMA, but I think that's the easier
> part of the fork/COW problem.

What happens to all the other IB resources (PD, CQ, QP, etc) on fork?

AFAIK, pretty much by design the IB stack cannot/does not duplicate
these objects.

The natural consequence is that a PD is always associated with a
single process at a time, thus a memory registration which is
associated with a PD must also be associated with a single process.

So.. What is the problem with fork? The semantics of what should
happen seem natural enough to me, the PD doesn't get copied to the
child, so the MR stays with the parent. COW events on the pinned
region must be resolved so that the physical page stays with the
process that has pinned it - the pin is logically released in the
child because the MR doesn't exist because the PD doesn't exist.

Is this a general problem with the MR mechanism? If I
mmap(MAP_SHARED|MAP_READONLY) and someone mmaps(MAP_PRIVATE|MAP_WRITE)
on the same file I can generate COW events - will this make RDMAs go
randomly too??

Jason

2009-09-11 16:42:54

by Gleb Natapov

[permalink] [raw]
Subject: Re: [GIT PULL] please pull ummunotify

On Fri, Sep 11, 2009 at 03:11:36PM +0900, KOSAKI Motohiro wrote:
> Hi
>
> Thank you explanation.
>
> >
> > > Can I this version already solved fork() + COW issue? if so, could you
> > > please explain what happen at fork. Obviously RDMA point to either parent
> > > or child page, not both. but Corrent COW rule is, first touch process
> > > get copyed page and other process still own original page. I think it's
> > > unpecected behavior form RDMA.
> >
> > No, ummunotify doesn't really help that much with fork() + COW. If a
> > parent forks and then touches pages that are actively in use for RDMA,
> > then of course they get COWed and RDMA goes to the wrong memory (from
> > the point of view of the parent).
>
> So, Can we assume OpenMPI user process doesn't such thing?
>
> Parhaps, madvise(DONTFORK) or vfork() avoid this issue. but I'm not
> sure all program in the world do that.
>
MPI (or is it libibverbs?) marks all registered memory as DONTFORK.

--
Gleb.

2009-09-11 16:58:15

by Roland Dreier

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify


> So.. What is the problem with fork? The semantics of what should
> happen seem natural enough to me, the PD doesn't get copied to the
> child, so the MR stays with the parent. COW events on the pinned
> region must be resolved so that the physical page stays with the
> process that has pinned it - the pin is logically released in the
> child because the MR doesn't exist because the PD doesn't exist.

This is getting away from the problem that ummunotify is solving, but
handling a COW fault generated by the parent by doing the copy in the
child seems like a pretty major, tricky change to make. The child may
have forked 100 more times in the meantime, meaning we now have to
change 101 memory maps ... the cost of page faults goes through the roof
probably...

- R.

2009-09-15 07:03:09

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify

>
> > So.. What is the problem with fork? The semantics of what should
> > happen seem natural enough to me, the PD doesn't get copied to the
> > child, so the MR stays with the parent. COW events on the pinned
> > region must be resolved so that the physical page stays with the
> > process that has pinned it - the pin is logically released in the
> > child because the MR doesn't exist because the PD doesn't exist.
>
> This is getting away from the problem that ummunotify is solving, but
> handling a COW fault generated by the parent by doing the copy in the
> child seems like a pretty major, tricky change to make. The child may
> have forked 100 more times in the meantime, meaning we now have to
> change 101 memory maps ... the cost of page faults goes through the roof
> probably...

Ummm...
Perhaps my first question was wrong. I'm not intent to NAK your patch.
I merely want to know your patch detail...

ok, I ask you again as another word.

- I guess you have your MPI implementaion w/ ummunotify, right?
- I guess you have test sevaral pattern, right?
if so, can we see your test result?
- I think you can explain your MPI advantage/disadvantage against
current OpenMPI (or mpich et al).
- I guess your patch dramatically improve MPI implementaion, but
it's not free. it request some limitation to MPI application, right?
- I imagine multi thread and fork. Is there another linmitaion?
- In past discuttion, you said ummunotify user should not use
multi threading. you also think user should not fork?


2009-09-15 08:28:04

by Roland Dreier

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify


> - I guess you have your MPI implementaion w/ ummunotify, right?

Yes, Jeff Squyres (cc'ed) has an Open MPI prototype (mercurial tree at
http://bitbucket.org/jsquyres/ummunot/).

> - I guess you have test sevaral pattern, right?
> if so, can we see your test result?

Open MPI has a pretty extensive automated test fabric -- I don't have a
link handy but I believe all the tests that pass with unmodified Open
MPI currently still pass with ummunotify. Maybe Jeff has a link.

> - I think you can explain your MPI advantage/disadvantage against
> current OpenMPI (or mpich et al).

The advantage is as Jeff explained in his blog post
(http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/),
namely the performance improvement of memory registration caching
without the reliability problems caused by previous approaches to
caching such as trying to hook malloc etc (which are fragile because the
great diversity of MPI-using codes find ways to mess up all previous
userspace-only approaches).

> - I guess your patch dramatically improve MPI implementaion, but
> it's not free. it request some limitation to MPI application, right?

Not that I know of, beyond already existing limitations.

> - I imagine multi thread and fork. Is there another linmitaion?

There are no new limitations on multi-threaded codes or on use of fork
that I know of. Of course, buggy code that does something like passing
a buffer to MPI in one thread and then freeing that buffer from another
thread before MPI is done with it is still buggy; but ummunotify
actually increases the ability of the MPI implementation to detect such
bugs and give useful diagnostic information.

> - In past discuttion, you said ummunotify user should not use
> multi threading. you also think user should not fork?

I don't recall where I said ummunotify users should not be
multithreaded. I don't know of any problem with that.

Also code using ummunotify can fork -- ummunotify simply does not fix
issues with copy-on-write for buffers that are in use, just as it does
not fix multithreaded code that has a race between using a buffer and
freeing the same buffer.

Hope this clarifies things.

- Roland

2009-09-15 11:34:46

by Pavel Machek

[permalink] [raw]
Subject: Re: [GIT PULL] please pull ummunotify

Hi!

> Linus, please consider pulling from
>
> master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify
>
> This tree is also available from kernel.org mirrors at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify
>
> This will get "ummunotify," a new character device that allows a
> userspace library to register for MMU notifications; this is
> particularly useful for MPI implementions (message passing libraries
> used in HPC) to be able to keep track of what wacky things consumers
> do to their memory mappings. My colleague Jeff Squyres from the Open
> MPI project posted a blog entry about why MPI wants this:
>
> http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/
>
> His summary of ummunotify:
>
> "It???s elegant, doesn???t require strange linker tricks, and seems to
> work in all cases. Yay!"
>
> This code went through several review iterations on lkml and was in
> -mm and -next for quite a few weeks. Andrew is OK with merging it (I
> think -- Andrew please correct me if I misunderstood you).

I don't remember seeing discussion of this on lkml. Yes it is in
-next...

Basically it allows app to 'trace itself'? ...with interesting mmap()
interface, exporting int to userspace, hoping it behaves atomically...?

Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-09-15 12:38:14

by Jeff Squyres

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify

On Sep 15, 2009, at 3:03 AM, KOSAKI Motohiro wrote:

> - I guess you have your MPI implementaion w/ ummunotify, right?
> - I guess you have test sevaral pattern, right?
> if so, can we see your test result?
>


Roland's answers to the rest of these questions were spot-on, so I
thought I'd just throw in a quick reply to the above questions: yes,
we have a prototype Open MPI implementation with code that uses
ummunotify (http://bitbucket.org/jsquyres/ummunot/). I just finished
fixing a high-priority (but unrelated) bug in Open MPI, so merging the
prototype ummunotify code into the upstream Open MPI repository is now
at the top of my priority list.

We have done quite a bit of testing with ummunotify, but since the
code is not yet in the Open MPI mainline, most of the testing has been
manual (not through our automated testing system). As far as we can
tell, everything is working properly with Open MPI + ummunotify. We
also anticipate that other MPI implementations will be able to use
ummunotify, potentially using Open MPI as a reference ummunotify
implementation.

FWIW: we went through a bunch of design and implementation iterations
with Roland to get code that everyone was happy with:

- Roland likes it (and anticipated that the kernel community would be
receptive to)
- we like it
- performs correctly

Hope that helps.

--
Jeff Squyres
[email protected]

2009-09-15 14:58:26

by Roland Dreier

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify


> I don't remember seeing discussion of this on lkml. Yes it is in
> -next...

eg http://lkml.org/lkml/2009/7/31/197 and followups, or search for v2
and earlier patches.

> Basically it allows app to 'trace itself'? ...with interesting mmap()
> interface, exporting int to userspace, hoping it behaves atomically...?

Yes, it allows app to trace what the kernel does to memory mappings. I
don't believe there's any real issue to atomicity of mmap'ed memory,
since userspace really just tests whether read value is == to old read
value or not.

- R.

2009-09-16 16:30:16

by Roland Dreier

[permalink] [raw]
Subject: Re: [GIT PULL] please pull ummunotify

Hi Linus,

Sorry to hassle you about this, but I would like to know where things
stand. I know (from the reflink discussion if nothing else) that you're
definitely not bashful about telling people when their code sucks, so
this silent treatment has me really flustered. I've been showering and
brushing my teeth and everything, honest!

Seriously, this code solves a problem that the MPI/HPC people have been
complaining about for quite a while, and if possible I'd like to get
this upstream. Or if you have a better idea, I'm all ears...

Thanks,
Roland

> Linus, please consider pulling from
>
> master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify
>
> This tree is also available from kernel.org mirrors at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify
>
> This will get "ummunotify," a new character device that allows a
> userspace library to register for MMU notifications; this is
> particularly useful for MPI implementions (message passing libraries
> used in HPC) to be able to keep track of what wacky things consumers
> do to their memory mappings. My colleague Jeff Squyres from the Open
> MPI project posted a blog entry about why MPI wants this:
>
> http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/
>
> His summary of ummunotify:
>
> "It’s elegant, doesn’t require strange linker tricks, and seems to
> work in all cases. Yay!"
>
> This code went through several review iterations on lkml and was in
> -mm and -next for quite a few weeks. Andrew is OK with merging it (I
> think -- Andrew please correct me if I misunderstood you).
>
> Roland Dreier (1):
> ummunotify: Userspace support for MMU notifications
>
> Documentation/Makefile | 3 +-
> Documentation/ummunotify/Makefile | 7 +
> Documentation/ummunotify/ummunotify.txt | 150 ++++++++
> Documentation/ummunotify/umn-test.c | 200 +++++++++++
> drivers/char/Kconfig | 12 +
> drivers/char/Makefile | 1 +
> drivers/char/ummunotify.c | 566 +++++++++++++++++++++++++++++++
> include/linux/Kbuild | 1 +
> include/linux/ummunotify.h | 121 +++++++
> 9 files changed, 1060 insertions(+), 1 deletions(-)
> create mode 100644 Documentation/ummunotify/Makefile
> create mode 100644 Documentation/ummunotify/ummunotify.txt
> create mode 100644 Documentation/ummunotify/umn-test.c
> create mode 100644 drivers/char/ummunotify.c
> create mode 100644 include/linux/ummunotify.h

2009-09-16 16:41:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] please pull ummunotify



On Wed, 16 Sep 2009, Roland Dreier wrote:
>
> Sorry to hassle you about this, but I would like to know where things
> stand. I know (from the reflink discussion if nothing else) that you're
> definitely not bashful about telling people when their code sucks, so
> this silent treatment has me really flustered. I've been showering and
> brushing my teeth and everything, honest!

I just haven't had time to look into the issue, so I'm merging the code
that I know I need to merge, and hopefully I'll have a breather later when
I can actually look at code and the thread it all spawned.,

Linus

2009-09-17 11:30:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [GIT PULL] please pull ummunotify

On Thu, 2009-09-10 at 21:38 -0700, Roland Dreier wrote:
> Linus, please consider pulling from
>
> master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify
>
> This tree is also available from kernel.org mirrors at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify
>
> This will get "ummunotify," a new character device that allows a
> userspace library to register for MMU notifications; this is
> particularly useful for MPI implementions (message passing libraries
> used in HPC) to be able to keep track of what wacky things consumers
> do to their memory mappings. My colleague Jeff Squyres from the Open
> MPI project posted a blog entry about why MPI wants this:
>
> http://blogs.cisco.com/ciscotalk/performance/comments/better_linux_memory_tracking/
>
> His summary of ummunotify:
>
> "It’s elegant, doesn’t require strange linker tricks, and seems to
> work in all cases. Yay!"
>
> This code went through several review iterations on lkml and was in
> -mm and -next for quite a few weeks. Andrew is OK with merging it (I
> think -- Andrew please correct me if I misunderstood you).

Anton Blanchard suggested a while back that this might be integrated
with perf-counters, since perf-counters already does mmap() tracking and
also provides events through an mmap()'ed buffer.

Has anybody looked into this?

If someone did and I missed the discussion on why it isn't appropriate,
kindly point me in the right direction ;-)

2009-09-17 14:24:45

by Roland Dreier

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify


> Anton Blanchard suggested a while back that this might be integrated
> with perf-counters, since perf-counters already does mmap() tracking and
> also provides events through an mmap()'ed buffer.
>
> Has anybody looked into this?

I didn't see the original suggestion. Certainly hooking in to existing
infrastructure for user/kernel communication would be good.

The fit doesn't seem great to me, although I am rather naive about perf
counters. The problem that ummunotify is trying to solve is to let an
app say 'for these 1000 address ranges (that possibly only cover a small
part of my total address space), please let me know when the mappings
are invalidated for any reason'.

So getting those events in the kernel is no problem -- we have the MMU
notifier hooks that tell us exactly what we need to know. The issue is
purely the way userspace registers interest in address ranges, and how
to kernel returns the events.

For perf counters it seems that one would have to create a new counter
for each address range... is that correct? And also I don't know if
perf counter has an analog for the fast path optimization that
ummunotify provides via a mmap'ed generation counter (a quick way for
userspace to see 'nothing happened since last time you checked').

- R.

2009-09-17 14:32:54

by Roland Dreier

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify


> So getting those events in the kernel is no problem -- we have the MMU
> notifier hooks that tell us exactly what we need to know. The issue is
> purely the way userspace registers interest in address ranges, and how
> to kernel returns the events.
>
> For perf counters it seems that one would have to create a new counter
> for each address range... is that correct? And also I don't know if
> perf counter has an analog for the fast path optimization that
> ummunotify provides via a mmap'ed generation counter (a quick way for
> userspace to see 'nothing happened since last time you checked').

Oh I forgot... ummunotify also preallocates everything etc. so that
there is no way for events to be lost. Which saves userspace from
having to trash everything cached and start over, which it would have to
do if it misses an invalidate event.

And AFAIK, pref counters does have the possibility of overflowing a
buffer and losing an event, right?

- R.

2009-09-17 14:43:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify

On Thu, 2009-09-17 at 07:24 -0700, Roland Dreier wrote:
> > Anton Blanchard suggested a while back that this might be integrated
> > with perf-counters, since perf-counters already does mmap() tracking and
> > also provides events through an mmap()'ed buffer.
> >
> > Has anybody looked into this?
>
> I didn't see the original suggestion. Certainly hooking in to existing
> infrastructure for user/kernel communication would be good.
>
> The fit doesn't seem great to me, although I am rather naive about perf
> counters. The problem that ummunotify is trying to solve is to let an
> app say 'for these 1000 address ranges (that possibly only cover a small
> part of my total address space), please let me know when the mappings
> are invalidated for any reason'.
>
> So getting those events in the kernel is no problem -- we have the MMU
> notifier hooks that tell us exactly what we need to know. The issue is
> purely the way userspace registers interest in address ranges, and how
> to kernel returns the events.
>
> For perf counters it seems that one would have to create a new counter
> for each address range... is that correct? And also I don't know if
> perf counter has an analog for the fast path optimization that
> ummunotify provides via a mmap'ed generation counter (a quick way for
> userspace to see 'nothing happened since last time you checked').

You're right in that perf-counter currently doesn't provide a way to
specify these ranges, we simply track all mmap() traffic.

The thing is that mmap() data is basically a side channel. For your
usage you'd basically have to open a NOP counter and only observe the
mmap data.

We could look at ways of adding ranges..

We do have a means of detecting if new data is available, we keep a data
head index. If that moves, you've got new stuff.


2009-09-17 14:49:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify

On Thu, 2009-09-17 at 07:32 -0700, Roland Dreier wrote:
> > So getting those events in the kernel is no problem -- we have the MMU
> > notifier hooks that tell us exactly what we need to know. The issue is
> > purely the way userspace registers interest in address ranges, and how
> > to kernel returns the events.
> >
> > For perf counters it seems that one would have to create a new counter
> > for each address range... is that correct? And also I don't know if
> > perf counter has an analog for the fast path optimization that
> > ummunotify provides via a mmap'ed generation counter (a quick way for
> > userspace to see 'nothing happened since last time you checked').
>
> Oh I forgot... ummunotify also preallocates everything etc. so that
> there is no way for events to be lost. Which saves userspace from
> having to trash everything cached and start over, which it would have to
> do if it misses an invalidate event.
>
> And AFAIK, pref counters does have the possibility of overflowing a
> buffer and losing an event, right?

Well, you cannot pre-allocate everything, either you get back-logged
evens in kernel space leading to a kernel DoS, or you loose events.

Perf counters have two modes, a RO mmap() and a RW mmap(). The RO mode
will automagically overwrite its tail data without regard for userspace
having observed it.

In the RW mode userspace has to advance the tail, the kernel will drop
events when full and insert a PERF_EVENT_LOST event once there is room
again.

Hmm, or are you saying you can only get 1 event per registered range and
allocate the thing on registration? That'd need some registration limit
to avoid DoS scenarios.

2009-09-17 15:03:25

by Roland Dreier

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify


> Hmm, or are you saying you can only get 1 event per registered range and
> allocate the thing on registration? That'd need some registration limit
> to avoid DoS scenarios.

Yes, that's what I do. You're right, I should add a limit... although
their are lots of ways for userspace to consume arbitrary amounts of
kernel resources already.

- R.

2009-09-17 15:22:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify

On Thu, 2009-09-17 at 08:03 -0700, Roland Dreier wrote:
> > Hmm, or are you saying you can only get 1 event per registered range and
> > allocate the thing on registration? That'd need some registration limit
> > to avoid DoS scenarios.
>
> Yes, that's what I do. You're right, I should add a limit... although
> their are lots of ways for userspace to consume arbitrary amounts of
> kernel resources already.

I'd be good to work at reducing that number, not adding to it ;-)

But yeah, I currently don't see a very nice match to perf counters.

2009-09-17 15:45:28

by Roland Dreier

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify


> > > Hmm, or are you saying you can only get 1 event per registered range and
> > > allocate the thing on registration? That'd need some registration limit
> > > to avoid DoS scenarios.
> >
> > Yes, that's what I do. You're right, I should add a limit... although
> > their are lots of ways for userspace to consume arbitrary amounts of
> > kernel resources already.
>
> I'd be good to work at reducing that number, not adding to it ;-)

Yes, definitely. I'll add a quick ummunotify module parameter that
limits the number of registrations per process.

> But yeah, I currently don't see a very nice match to perf counters.

OK. It would be nice to tie into something more general, but I think I
agree -- perf counters are missing the filtering and the "no lost
events" that ummunotify does have. And I'm not sure it's worth messing
up the perf counters design just to jam one more not totally related
thing in.

- R.

2009-09-18 11:51:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify


* Roland Dreier <[email protected]> wrote:

> > But yeah, I currently don't see a very nice match to perf counters.
>
> OK. It would be nice to tie into something more general, but I think
> I agree -- perf counters are missing the filtering and the "no lost
> events" that ummunotify does have. And I'm not sure it's worth
> messing up the perf counters design just to jam one more not totally
> related thing in.

The filtering can be done and has been done - see Li Zefan's patchset
that uses filter expressions to do per event in-kernel filtering.

The OOM DoS is a bug in your patches i think, which perfcounters solves
;-)

Ingo

2009-09-28 20:49:26

by Pavel Machek

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify

On Tue 2009-09-15 07:57:56, Roland Dreier wrote:
>
> > I don't remember seeing discussion of this on lkml. Yes it is in
> > -next...
>
> eg http://lkml.org/lkml/2009/7/31/197 and followups, or search for v2
> and earlier patches.

Well... it seems little overspecialized. Just modifying libc to
provide hooks you want looks like better solution.

> > Basically it allows app to 'trace itself'? ...with interesting mmap()
> > interface, exporting int to userspace, hoping it behaves atomically...?
>
> Yes, it allows app to trace what the kernel does to memory mappings. I
> don't believe there's any real issue to atomicity of mmap'ed memory,
> since userspace really just tests whether read value is == to old read
> value or not.

That still needs memory barriers etc.. to ensure reliable operation,
no?
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-09-28 21:40:59

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify

On Mon, Sep 28, 2009 at 10:49:23PM +0200, Pavel Machek wrote:

> > > I don't remember seeing discussion of this on lkml. Yes it is in
> > > -next...
> >
> > eg http://lkml.org/lkml/2009/7/31/197 and followups, or search for v2
> > and earlier patches.

> Well... it seems little overspecialized. Just modifying libc to
> provide hooks you want looks like better solution.

That is what MPI people are doing today and their feedback is that it
doesn't work - there are a lot of ways to mess with memory and no good
choices to hook the raw syscalls and keep sensible performance.

The main focus of this is high performance MPI apps, so lower overhead
on critical paths like memory allocation is part of the point. It is
ment to go hand-in-hand with the specialized RDMA memory pinning
interfaces..

> > > Basically it allows app to 'trace itself'? ...with interesting mmap()
> > > interface, exporting int to userspace, hoping it behaves atomically...?
> >
> > Yes, it allows app to trace what the kernel does to memory mappings. I
> > don't believe there's any real issue to atomicity of mmap'ed memory,
> > since userspace really just tests whether read value is == to old read
> > value or not.
>
> That still needs memory barriers etc.. to ensure reliable operation,
> no?

No, I don't think so..

The application is expected to provide sequencing of some sort between
the memory call (mmap/munmap/brk/etc) and the int check - usually just
by running in the same thread, or through some kind of locking scheme.

As long as the mmu notifiers run immediately in the same context as
the mmap/etc then it should be fine.

For example, the most common problem to solve looks like this:

x = mmap(...)
do RDMA with x
[..]
mmunmap(x);

[..]
y = mmap(..);
do RDMA with y
if by chance x == y things explode.

So this API puts the int test directly before 'do RDMA with'.

Due to the above kind of argument the net requirement is either to
completely synchronously (and with low overhead) hook every
mmap/munmap/brk/etc call into the kernel and do the accounting work,
or have a very low over head check every time the memory region is
about to be used.

Jason

2009-09-29 17:13:40

by Pavel Machek

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify

On Thu 2009-09-17 08:45:29, Roland Dreier wrote:
>
> > > > Hmm, or are you saying you can only get 1 event per registered range and
> > > > allocate the thing on registration? That'd need some registration limit
> > > > to avoid DoS scenarios.
> > >
> > > Yes, that's what I do. You're right, I should add a limit... although
> > > their are lots of ways for userspace to consume arbitrary amounts of
> > > kernel resources already.
> >
> > I'd be good to work at reducing that number, not adding to it ;-)
>
> Yes, definitely. I'll add a quick ummunotify module parameter that
> limits the number of registrations per process.
>
> > But yeah, I currently don't see a very nice match to perf counters.
>
> OK. It would be nice to tie into something more general, but I think I
> agree -- perf counters are missing the filtering and the "no lost
> events" that ummunotify does have. And I'm not sure it's worth messing
> up the perf counters design just to jam one more not totally related
> thing in.

I believe that extending perf counters to do what you want is better
than adding one more, very strange, user<->kernel interface.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-09-30 09:45:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify


* Pavel Machek <[email protected]> wrote:

> On Thu 2009-09-17 08:45:29, Roland Dreier wrote:
> >
> >
[...]
> > OK. It would be nice to tie into something more general, but I
> > think I agree -- perf counters are missing the filtering and the "no
> > lost events" that ummunotify does have. [...]

Performance events filtering is being worked on and now with the proper
non-DoS limit you've added you can lose events too, dont you? So it's
all a question of how much buffering to add - and with perf events too
you can buffer arbitrary large amount of events.

> > [...] And I'm not sure it's worth messing up the perf counters
> > design just to jam one more not totally related thing in.

Nobody suggested details for any redesign yet (so far it seems like a
perfect match, to me at least) so i'm wondering what messup you are
referring to.

> I believe that extending perf counters to do what you want is better
> than adding one more, very strange, user<->kernel interface.

Agreed.

Lemme react to the original description of the code:

> git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git ummunotify
>
> This will get "ummunotify," a new character device that allows a
> userspace library to register for MMU notifications; this is
> particularly useful for MPI implementions (message passing libraries
> used in HPC) to be able to keep track of what wacky things consumers
> do to their memory mappings.

I test-pulled this code and had a look at it.

I think this could be done in a simpler, less limited, more generic,
more useful form by using some variation of perf events.

You should be able to get all that you want by adding two TRACE_EVENT()
tracepoints and using the existing perf event syscall to get the events
to user-space.

Meaning that this:

9 files changed, 1060 insertions(+), 1 deletions(-)

Would be replaced with something like:

2 files changed, 100 insertions(+), 0 deletions(-)

[ the +100 lines would (roughly) would add tracepoints to
invalidate_page and invalidate_range_start. (possibly via
mmu_notifier_register() like the ummunotify code does) Most of that
linecount would be comments. ]

Another upside, beyond the reduction in complexity is that we'd have one
less special char driver based ABI. Which is a big plus in my opinion,
especially if this goes towards HPC folks and if it's used for real. Why
should such a MM capability hidden behind a character device and an
ioctl?

The perf event approach is beneficial to non-HPC as well: MM
instrumentation for example - page range invalidates are interesting to
all sorts of modi of analysis.

A question: what is the typical size/scope of the rbtree of the watched
regions of memory in practical (test) deployments of the ummunofity
code?

Per tracepoint filtering is possible via the perf event patches Li Zefan
has posted to lkml recently, under this subject:

[PATCH 0/6] perf trace: Add filter support

They are still being worked on but it's very clear that flexible
in-kernel filtering support will be a natural part of the perf event
design in the very near future, so if that alone is your reason not to
use it it would be better if you helped us complete/test the filter
support and use that, instead of a parallel framework.

Or if that's not desirable or not possible, or if there's any other
technical roadblock, i'd like to know the particulars of that.

Thanks,

Ingo

2009-09-30 16:02:50

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify

On Wed, Sep 30, 2009 at 11:44:56AM +0200, Ingo Molnar wrote:
> > > OK. It would be nice to tie into something more general, but I
> > > think I agree -- perf counters are missing the filtering and the "no
> > > lost events" that ummunotify does have. [...]
>
> Performance events filtering is being worked on and now with the proper
> non-DoS limit you've added you can lose events too, dont you? So it's
> all a question of how much buffering to add - and with perf events too
> you can buffer arbitrary large amount of events.

No, the ummunotify does not loose events, that is the fundamental
difference between it and all tracing schemes.

Every call to ibv_reg_mr is paired with a call to ummunotify to create
a matching watcher. Both calls allocate some kernel memory, if one
fails the entire operation fails and userspace can do whatever it does
on memory allocation failure.

After that point the scheme is perfectly lossless.

Performance event filtering would use the same kind of kernel memory,
call ibv_reg_mr, then install a filter, both allocate kernel memory,
if one fails the op fails. But then when the ring buffer overflows
you've lost events.

All the tracing schemes are lossy - since they loose events when the
ring buffer fills up. So to do that we either need to make a recovery
scheme of some sort, or make trace points that are blocking..

So, here is a concrete proposal how ummunotify could be absorbed by
perf events tracing, with filters.
- The filter expression must be able to trigger on a MMU event,
triggering on the intersection of the MMU event address range and
filter expression address range.
- The traces must be choosen so that there is exactly one filter
expression per ibv_reg_mr region
- Each filter has a clearable saturating counter that increments every
time the filter matches an event
- Each filter has a 64 bit user space assigned tag.
- An API similar to ummunotify exists:
struct perf_filter_tag foo[100]
int rc = perf_filters_read_and_clear_non_zero_counters(foo,100);
- Optionally - the mmap ring would contain only 64 bit user space
filter tags, not trace events.

This would then duplicate the functions of ummunotify, including the
lossless collection of events. The flow would more or less be the
same:

struct my_data *ptr = calloc()
ptr->reg_handle = ibv_reg_mr(base,len)
ptr->filter_handle = perf_filter_register("string matching base->len",ptr)

[..]
// fast path
if (atomically(perf_map->head) != last_perf_map_head) {
struct perf_filter_tag foo[100]
int rc = perf_filters_read_and_clear_non_zero_counters(foo,100);
for (unsigned int i = 0; i != rc; i++)
((struct my_data *)foo[i])->invalid = 1;

perf_empty_mmap_ring(perf_map);
}

If 'optionally' is done then the app can trundle through the mmap
and only use the above syscall loop if the mmap overflows. That would
be quite ideal.

It also must be guarenteed that when a trace point is hit the mmap
atomics are updated and visible to another user space thread before
the trace point returns - otherwise it is not synchronous enough and
will be racey.

> A question: what is the typical size/scope of the rbtree of the watched
> regions of memory in practical (test) deployments of the ummunofity
> code?

Jeff can you comment?

IIRC it is many tens (hundreds?) of thousands of watches.

> Per tracepoint filtering is possible via the perf event patches Li Zefan
> has posted to lkml recently, under this subject:

Performance of the filter add is probably a bit of a concern..

Regards,
Jason

2009-09-30 17:06:31

by Roland Dreier

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify


> Performance events filtering is being worked on and now with the proper
> non-DoS limit you've added you can lose events too, dont you? So it's
> all a question of how much buffering to add - and with perf events too
> you can buffer arbitrary large amount of events.

No, the idea for non-DoS for ummunotify is that we would limit the
number of regions the application can register; so an application might
hit the limit up front but no runtime loss of events once a region was
registered successfully.

> I think this could be done in a simpler, less limited, more generic,
> more useful form by using some variation of perf events.
>
> You should be able to get all that you want by adding two TRACE_EVENT()
> tracepoints and using the existing perf event syscall to get the events
> to user-space.

Yes, I would like to use perf events too. Would it be plausible to
create a way for userspace to create a "counter" for each address range
being watched? Then events would not be lost, because those counters
would become non-zero.

> Meaning that this:
> 9 files changed, 1060 insertions(+), 1 deletions(-)

Note that lots/ of the files touched here are in Documentation or are
one-line changes to Makefiles etc.

- R.

2009-10-02 16:34:48

by Roland Dreier

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify


> Per tracepoint filtering is possible via the perf event patches Li Zefan
> has posted to lkml recently, under this subject:
>
> [PATCH 0/6] perf trace: Add filter support
>
> They are still being worked on but it's very clear that flexible
> in-kernel filtering support will be a natural part of the perf event
> design in the very near future, so if that alone is your reason not to
> use it it would be better if you helped us complete/test the filter
> support and use that, instead of a parallel framework.
>
> Or if that's not desirable or not possible, or if there's any other
> technical roadblock, i'd like to know the particulars of that.

So I looked a little deeper into this, and I don't think (even with the
filtering extensions) that perf events are directly applicable to this
problem. The first issue is that, assuming I'm understanding the
comment in perf_event.c:

/*
* Raw tracepoint data is a severe data leak, only allow root to
* have these.
*/

currently tracepoints can only be used by privileged processes. A key
feature of ummunotify is that ordinary unprivileged processes can use it.

So would it be acceptable to add something like PERF_TYPE_MMU_NOTIFIER
as a way of letting unprivileged userspace get access to just MMU events
for their own process? Clearly this touches core infrastructure and is
not as simple as just adding two tracepoints.

Then, assuming we have some way to create an "MMU notifier" perf event,
we need a way for userspace to specify which address ranges it would
like events for (I don't think the string filter expression used by
existing trace filtering works, because if userspace is looking at a few
hundred regions, then the size of the filtering expression explodes, and
adding or removing a single range becomes a pain). So I guess a new
ioctl() to add/remove ranges for MMU_NOTIFIER perf events?

I think filtering is needed, because otherwise events for ranges that
are not of interest are just a waste of resources to generate and
process, and make losing good events because of overflow much more
likely.

We still have the problem of lost events if the mmap buffer overflows,
but userspace should be able to size the buffer so that such events are
rare I guess.

In the end this seems to just take the ummunotify code I have, and make
it be a new type of perf counter instead of a character special device.
I'd actually be OK with that, since having an oddball new char dev
interface is not particularly nice. But on the other hand just
multiplexing a new type of thing under perf events is not all that much
better. What do you think?

Thanks,
Roland

2009-10-02 20:45:33

by Pavel Machek

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify

Hi!

> In the end this seems to just take the ummunotify code I have, and make
> it be a new type of perf counter instead of a character special device.
> I'd actually be OK with that, since having an oddball new char dev
> interface is not particularly nice. But on the other hand just
> multiplexing a new type of thing under perf events is not all that much
> better. What do you think?

I really hate the strange character device. So if you can hide it in
tracing infrastructure, I'd certainly like that.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-10-07 22:35:28

by Roland Dreier

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify


> So I looked a little deeper into this, and I don't think (even with the
> filtering extensions) that perf events are directly applicable to this
> problem. The first issue is that, assuming I'm understanding the
> comment in perf_event.c:
>
> /*
> * Raw tracepoint data is a severe data leak, only allow root to
> * have these.
> */
>
> currently tracepoints can only be used by privileged processes. A key
> feature of ummunotify is that ordinary unprivileged processes can use it.
>
> So would it be acceptable to add something like PERF_TYPE_MMU_NOTIFIER
> as a way of letting unprivileged userspace get access to just MMU events
> for their own process? Clearly this touches core infrastructure and is
> not as simple as just adding two tracepoints.
>
> Then, assuming we have some way to create an "MMU notifier" perf event,
> we need a way for userspace to specify which address ranges it would
> like events for (I don't think the string filter expression used by
> existing trace filtering works, because if userspace is looking at a few
> hundred regions, then the size of the filtering expression explodes, and
> adding or removing a single range becomes a pain). So I guess a new
> ioctl() to add/remove ranges for MMU_NOTIFIER perf events?
>
> I think filtering is needed, because otherwise events for ranges that
> are not of interest are just a waste of resources to generate and
> process, and make losing good events because of overflow much more
> likely.
>
> We still have the problem of lost events if the mmap buffer overflows,
> but userspace should be able to size the buffer so that such events are
> rare I guess.
>
> In the end this seems to just take the ummunotify code I have, and make
> it be a new type of perf counter instead of a character special device.
> I'd actually be OK with that, since having an oddball new char dev
> interface is not particularly nice. But on the other hand just
> multiplexing a new type of thing under perf events is not all that much
> better. What do you think?

Ingo/Peter/<anyone suggesting perf events> -- can you comment on this
plan of creating PERF_TYPE_MMU_NOTIFIER for perf events to implement
ummunotify? To me it looks like a wash -- the main difference is how
userspace gets the magic ummunotify file descriptor, either by
open("/dev/ummunotify") or by perf_event_open(...PERF_TYPE_MMU_NOTIFIER...),
but pretty much everything else stays pretty much the same in terms of
how much kernel code is involved. We do reuse the perf events mmap
buffer code but I think that ends up being more complicated than
returning events via read().

Anyway, before I spend the time converting over to the new
infrastructure and causing the MPI guys to churn their code, I'd like to
make sure that this is what you guys have in mind.

(By the way, after thinking about this more, I really do think that
filtering events by address range is a must-have -- with filtering,
userspace can map sufficient buffer space to avoid losing events for a
given number of regions; without filtering, events might get lost just
because of invalidate events for ranges userspace didn't even care about)

Thanks,
Roland

2009-10-12 17:34:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify

On Wed, 2009-10-07 at 15:34 -0700, Roland Dreier wrote:
> > So I looked a little deeper into this, and I don't think (even with the
> > filtering extensions) that perf events are directly applicable to this
> > problem. The first issue is that, assuming I'm understanding the
> > comment in perf_event.c:
> >
> > /*
> > * Raw tracepoint data is a severe data leak, only allow root to
> > * have these.
> > */
> >
> > currently tracepoints can only be used by privileged processes. A key
> > feature of ummunotify is that ordinary unprivileged processes can use it.
> >
> > So would it be acceptable to add something like PERF_TYPE_MMU_NOTIFIER
> > as a way of letting unprivileged userspace get access to just MMU events
> > for their own process? Clearly this touches core infrastructure and is
> > not as simple as just adding two tracepoints.
> >
> > Then, assuming we have some way to create an "MMU notifier" perf event,
> > we need a way for userspace to specify which address ranges it would
> > like events for (I don't think the string filter expression used by
> > existing trace filtering works, because if userspace is looking at a few
> > hundred regions, then the size of the filtering expression explodes, and
> > adding or removing a single range becomes a pain). So I guess a new
> > ioctl() to add/remove ranges for MMU_NOTIFIER perf events?
> >
> > I think filtering is needed, because otherwise events for ranges that
> > are not of interest are just a waste of resources to generate and
> > process, and make losing good events because of overflow much more
> > likely.
> >
> > We still have the problem of lost events if the mmap buffer overflows,
> > but userspace should be able to size the buffer so that such events are
> > rare I guess.
> >
> > In the end this seems to just take the ummunotify code I have, and make
> > it be a new type of perf counter instead of a character special device.
> > I'd actually be OK with that, since having an oddball new char dev
> > interface is not particularly nice. But on the other hand just
> > multiplexing a new type of thing under perf events is not all that much
> > better. What do you think?
>
> Ingo/Peter/<anyone suggesting perf events> -- can you comment on this
> plan of creating PERF_TYPE_MMU_NOTIFIER for perf events to implement
> ummunotify? To me it looks like a wash -- the main difference is how
> userspace gets the magic ummunotify file descriptor, either by
> open("/dev/ummunotify") or by perf_event_open(...PERF_TYPE_MMU_NOTIFIER...),
> but pretty much everything else stays pretty much the same in terms of
> how much kernel code is involved. We do reuse the perf events mmap
> buffer code but I think that ends up being more complicated than
> returning events via read().
>
> Anyway, before I spend the time converting over to the new
> infrastructure and causing the MPI guys to churn their code, I'd like to
> make sure that this is what you guys have in mind.
>
> (By the way, after thinking about this more, I really do think that
> filtering events by address range is a must-have -- with filtering,
> userspace can map sufficient buffer space to avoid losing events for a
> given number of regions; without filtering, events might get lost just
> because of invalidate events for ranges userspace didn't even care about)

I think something like

PERF_TYPE_SOFTWARE, PERF_COUNT_SW_MUNMAP + $filter

or

PERF_TYPE_TRACEPOINT, //events/vm/munmap/id + $filter

As for the read/poll issue, I think we can do something like
PERF_FORMAT_BLOCK which would make read() block when ->count hasn't
changed, and make poll() work without requiring a mmap().

As to filter, we can do two things, add a simple single range filter to
perf_event_attr, which is something ia64 has hardware support for IIRC,
or we can possibly use this trace filter muck.

Would something like that be sufficient? With such events only
generating a wakeup (poll) when the unmap actually happens, you'd not
even need an mmap() buffer to keep up with that.

2009-10-12 18:21:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify


* Jason Gunthorpe <[email protected]> wrote:

> On Wed, Sep 30, 2009 at 11:44:56AM +0200, Ingo Molnar wrote:
> > > > OK. It would be nice to tie into something more general, but I
> > > > think I agree -- perf counters are missing the filtering and the "no
> > > > lost events" that ummunotify does have. [...]
> >
> > Performance events filtering is being worked on and now with the
> > proper non-DoS limit you've added you can lose events too, dont you?
> > So it's all a question of how much buffering to add - and with perf
> > events too you can buffer arbitrary large amount of events.
>
> No, the ummunotify does not loose events, that is the fundamental
> difference between it and all tracing schemes.
>
> Every call to ibv_reg_mr is paired with a call to ummunotify to create
> a matching watcher. Both calls allocate some kernel memory, if one
> fails the entire operation fails and userspace can do whatever it does
> on memory allocation failure.

We already support signal notification for perf events, and we also
support two modi of perf ring-buffer overflow notification. Adding a
third one that sends a signal when events are lost would be in line with
that.

This would allow you to have the OOM semantics of requesting a SIGBUS -
or user-space could do other things: like print a warning in the app or
ignore the event overflow.

Which are all interesting things to do. (If you do that you might want
to add that to 'perf top' or 'perf record' as well.)

> After that point the scheme is perfectly lossless.

Well if it can OOM it's not lossless, obviously. You just define "event
loss" to be equivalent to "Destruction of the universe." ;-)

Ingo

2009-10-12 19:31:44

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify

On Mon, Oct 12, 2009 at 08:19:44PM +0200, Ingo Molnar wrote:

> > After that point the scheme is perfectly lossless.
>
> Well if it can OOM it's not lossless, obviously. You just define "event
> loss" to be equivalent to "Destruction of the universe." ;-)

It can't OOM once the ummunotify registration is done - when an event
occurs it doesn't allocate any memory and it doesn't loose events.

It has the same problem as perf - you either bound the number/size of
filters, or let user space allocate filters until the box OOMs. perf
has the additonal problem that even with filters you can still loose
events if the event ring overflows.

Jason

2009-10-12 20:21:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify


* Jason Gunthorpe <[email protected]> wrote:

> On Mon, Oct 12, 2009 at 08:19:44PM +0200, Ingo Molnar wrote:
>
> > > After that point the scheme is perfectly lossless.
> >
> > Well if it can OOM it's not lossless, obviously. You just define
> > "event loss" to be equivalent to "Destruction of the universe." ;-)
>
> It can't OOM once the ummunotify registration is done - when an event
> occurs it doesn't allocate any memory and it doesn't loose events.

Well, it has built-in event loss via the UMMUNOTIFY_FLAG_HINT mechanism:
any double events on the same range will cause an imprecise event to be
recorded and cause the loss of information.

Is that loss of information more acceptable than the loss of information
via the loss of events?

It might be more acceptable because the flag-hint mechanism can at most
cause over-flushing - while with perf events we might miss to invalidate
a range altogether.

Ingo

2009-10-13 04:06:05

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify

On Mon, Oct 12, 2009 at 10:20:46PM +0200, Ingo Molnar wrote:
> It might be more acceptable because the flag-hint mechanism can at most
> cause over-flushing - while with perf events we might miss to invalidate
> a range altogether.

Right. Overflushing is not important, but missing an event entirely is not
recoverable (at least within the current kernel APIs).

Jason

2009-10-13 05:45:14

by Brice Goglin

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify

Ingo Molnar wrote:
> * Jason Gunthorpe <[email protected]> wrote:
>
>
>> On Mon, Oct 12, 2009 at 08:19:44PM +0200, Ingo Molnar wrote:
>>
>>
>>>> After that point the scheme is perfectly lossless.
>>>>
>>> Well if it can OOM it's not lossless, obviously. You just define
>>> "event loss" to be equivalent to "Destruction of the universe." ;-)
>>>
>> It can't OOM once the ummunotify registration is done - when an event
>> occurs it doesn't allocate any memory and it doesn't loose events.
>>
>
> Well, it has built-in event loss via the UMMUNOTIFY_FLAG_HINT mechanism:
> any double events on the same range will cause an imprecise event to be
> recorded and cause the loss of information.
>

The target (MPI) application doesn't care about how many events are
coming here. It just needs to know whether something has been
invalidated in the range. If so, it destroy the whole RDMA window
anyway. So it's actually _good_ that multiple events are merged into a
single one: the application only has to process a single event per
partially-invalidated range.

Brice

2009-10-13 06:39:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify


* Brice Goglin <[email protected]> wrote:

> Ingo Molnar wrote:
> > * Jason Gunthorpe <[email protected]> wrote:
> >
> >
> >> On Mon, Oct 12, 2009 at 08:19:44PM +0200, Ingo Molnar wrote:
> >>
> >>
> >>>> After that point the scheme is perfectly lossless.
> >>>>
> >>> Well if it can OOM it's not lossless, obviously. You just define
> >>> "event loss" to be equivalent to "Destruction of the universe." ;-)
> >>>
> >> It can't OOM once the ummunotify registration is done - when an event
> >> occurs it doesn't allocate any memory and it doesn't loose events.
> >>
> >
> > Well, it has built-in event loss via the UMMUNOTIFY_FLAG_HINT mechanism:
> > any double events on the same range will cause an imprecise event to be
> > recorded and cause the loss of information.
> >
>
> The target (MPI) application doesn't care about how many events are
> coming here. It just needs to know whether something has been
> invalidated in the range. If so, it destroy the whole RDMA window
> anyway. So it's actually _good_ that multiple events are merged into a
> single one: the application only has to process a single event per
> partially-invalidated range.

it's not unconditionally good as the fuzzy-merge-events rule:

events[n].flags = UMMUNOTIFY_EVENT_FLAG_HINT;
events[n].hint_start = max(reg->start, reg->hint_start);
events[n].hint_end = min(reg->end, reg->hint_end);

in essence merges flushes into a single interval - which inevitably
might include areas of memory that were not flushed at all.

For example these two flushes:

[...] [...]

Would be merged into:

[..................]

Btw., isnt the above max/min logic buggy, causing lost events? Shouldnt
it be:

events[n].hint_start = min(reg->start, reg->hint_start);
events[n].hint_end = max(reg->end, reg->hint_end);

?

Ingo

2009-10-13 06:40:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify


* Jason Gunthorpe <[email protected]> wrote:

> On Mon, Oct 12, 2009 at 10:20:46PM +0200, Ingo Molnar wrote:
> > It might be more acceptable because the flag-hint mechanism can at most
> > cause over-flushing - while with perf events we might miss to invalidate
> > a range altogether.
>
> Right. Overflushing is not important, but missing an event entirely is
> not recoverable (at least within the current kernel APIs).

So if we detect event loss in the perf event case (should not happen
with sufficient buffering but it is a possibility the code should be
prepared for) then we can just flush the [0,-1ULL] range, right?

Ingo

2009-10-13 16:28:17

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [ofa-general] Re: [GIT PULL] please pull ummunotify

On Tue, Oct 13, 2009 at 08:40:06AM +0200, Ingo Molnar wrote:
>
> * Jason Gunthorpe <[email protected]> wrote:
>
> > On Mon, Oct 12, 2009 at 10:20:46PM +0200, Ingo Molnar wrote:
> > > It might be more acceptable because the flag-hint mechanism can at most
> > > cause over-flushing - while with perf events we might miss to invalidate
> > > a range altogether.
> >
> > Right. Overflushing is not important, but missing an event entirely is
> > not recoverable (at least within the current kernel APIs).
>
> So if we detect event loss in the perf event case (should not happen
> with sufficient buffering but it is a possibility the code should be
> prepared for) then we can just flush the [0,-1ULL] range, right?

No, the reason overflushing within a registration is OK is because of
how the MPI APIs are defined and typically used. The map and
registration window will typically be 1:1 ie you malloc something and
then register it. It is an error to register beyond your malloced
space. So, in truth, the hint stuff isn't really essential for MPI.

flushing all ranges would result in data loss since ranges may be in
use at the time, and 'flush' is actually unregister/reregister - the
hardware cannot do in place atomic modify.

Jason