2008-02-08 22:07:58

by Christoph Lameter

[permalink] [raw]
Subject: [patch 0/6] MMU Notifiers V6

This is a patchset implementing MMU notifier callbacks based on Andrea's
earlier work. These are needed if Linux pages are referenced from something
else than tracked by the rmaps of the kernel (an external MMU). MMU
notifiers allow us to get rid of the page pinning for RDMA and various
other purposes. It gets rid of the broken use of mlock for page pinning.
(mlock really does *not* pin pages....)

More information on the rationale and the technical details can be found in
the first patch and the README provided by that patch in
Documentation/mmu_notifiers.

The known immediate users are

KVM
- Establishes a refcount to the page via get_user_pages().
- External references are called spte.
- Has page tables to track pages whose refcount was elevated but
no reverse maps.

GRU
- Simple additional hardware TLB (possibly covering multiple instances of
Linux)
- Needs TLB shootdown when the VM unmaps pages.
- Determines page address via follow_page (from interrupt context) but can
fall back to get_user_pages().
- No page reference possible since no page status is kept..

XPmem
- Allows use of a processes memory by remote instances of Linux.
- Provides its own reverse mappings to track remote pte.
- Established refcounts on the exported pages.
- Must sleep in order to wait for remote acks of ptes that are being
cleared.

Andrea's mmu_notifier #4 -> RFC V1

- Merge subsystem rmap based with Linux rmap based approach
- Move Linux rmap based notifiers out of macro
- Try to account for what locks are held while the notifiers are
called.
- Develop a patch sequence that separates out the different types of
hooks so that we can review their use.
- Avoid adding include to linux/mm_types.h
- Integrate RCU logic suggested by Peter.

V1->V2:
- Improve RCU support
- Use mmap_sem for mmu_notifier register / unregister
- Drop invalidate_page from COW, mm/fremap.c and mm/rmap.c since we
already have invalidate_range() callbacks there.
- Clean compile for !MMU_NOTIFIER
- Isolate filemap_xip strangeness into its own diff
- Pass a the flag to invalidate_range to indicate if a spinlock
is held.
- Add invalidate_all()

V2->V3:
- Further RCU fixes
- Fixes from Andrea to fixup aging and move invalidate_range() in do_wp_page
and sys_remap_file_pages() after the pte clearing.

V3->V4:
- Drop locking and synchronize_rcu() on ->release since we know on release that
we are the only executing thread. This is also true for invalidate_all() so
we could drop off the mmu_notifier there early. Use hlist_del_init instead
of hlist_del_rcu.
- Do the invalidation as begin/end pairs with the requirement that the driver
holds off new references in between.
- Fixup filemap_xip.c
- Figure out a potential way in which XPmem can deal with locks that are held.
- Robin's patches to make the mmu_notifier logic manage the PageRmapExported bit.
- Strip cc list down a bit.
- Drop Peters new rcu list macro
- Add description to the core patch

V4->V5:
- Provide missing callouts for mremap.
- Provide missing callouts for copy_page_range.
- Reduce mm_struct space to zero if !MMU_NOTIFIER by #ifdeffing out
structure contents.
- Get rid of the invalidate_all() callback by moving ->release in place
of invalidate_all.
- Require holding mmap_sem on register/unregister instead of acquiring it
ourselves. In some contexts where we want to register/unregister we are
already holding mmap_sem.
- Split out the rmap support patch so that there is no need to apply
all patches for KVM and GRU.

V5->V6:
- Provide missing range callouts for mprotect
- Fix do_wp_page control path sequencing
- Clarify locking conventions
- GRU and XPmem confirmed to work with this patchset.
- Provide skeleton code for GRU/KVM type callback and for XPmem type.
- Rework documentation and put it into Documentation/mmu_notifier.

--


2008-02-08 22:24:51

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 0/6] MMU Notifiers V6

On Fri, 08 Feb 2008 14:06:16 -0800
Christoph Lameter <[email protected]> wrote:

> This is a patchset implementing MMU notifier callbacks based on Andrea's
> earlier work. These are needed if Linux pages are referenced from something
> else than tracked by the rmaps of the kernel (an external MMU). MMU
> notifiers allow us to get rid of the page pinning for RDMA and various
> other purposes. It gets rid of the broken use of mlock for page pinning.
> (mlock really does *not* pin pages....)
>
> More information on the rationale and the technical details can be found in
> the first patch and the README provided by that patch in
> Documentation/mmu_notifiers.
>
> The known immediate users are
>
> KVM
> - Establishes a refcount to the page via get_user_pages().
> - External references are called spte.
> - Has page tables to track pages whose refcount was elevated but
> no reverse maps.
>
> GRU
> - Simple additional hardware TLB (possibly covering multiple instances of
> Linux)
> - Needs TLB shootdown when the VM unmaps pages.
> - Determines page address via follow_page (from interrupt context) but can
> fall back to get_user_pages().
> - No page reference possible since no page status is kept..
>
> XPmem
> - Allows use of a processes memory by remote instances of Linux.
> - Provides its own reverse mappings to track remote pte.
> - Established refcounts on the exported pages.
> - Must sleep in order to wait for remote acks of ptes that are being
> cleared.
>

What about ib_umem_get()?

2008-02-08 23:32:31

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 0/6] MMU Notifiers V6

On Fri, 8 Feb 2008, Andrew Morton wrote:

> What about ib_umem_get()?

Ok. It pins using an elevated refcount. Same as XPmem right now. With that
we effectively pin a page (page migration will fail) but we will
continually be reclaiming the page and may repeatedly try to move it. We
have issues with XPmem causing too many pages to be pinned and thus the
OOM getting into weird behavior modes (OOM or stop lru scanning due to
all_reclaimable set).

An elevated refcount will also not be noticed by any of the schemes under
consideration to improve LRU scanning performance.

2008-02-08 23:36:46

by Robin Holt

[permalink] [raw]
Subject: Re: [patch 0/6] MMU Notifiers V6

On Fri, Feb 08, 2008 at 03:32:19PM -0800, Christoph Lameter wrote:
> On Fri, 8 Feb 2008, Andrew Morton wrote:
>
> > What about ib_umem_get()?
>
> Ok. It pins using an elevated refcount. Same as XPmem right now. With that
> we effectively pin a page (page migration will fail) but we will
> continually be reclaiming the page and may repeatedly try to move it. We
> have issues with XPmem causing too many pages to be pinned and thus the
> OOM getting into weird behavior modes (OOM or stop lru scanning due to
> all_reclaimable set).
>
> An elevated refcount will also not be noticed by any of the schemes under
> consideration to improve LRU scanning performance.

Christoph, I am not sure what you are saying here. With v4 and later,
I thought we were able to use the rmap invalidation to remove the ref
count that XPMEM was holding and therefore be able to swapout. Did I miss
something? I agree the existing XPMEM does pin. I hope we are not saying
the XPMEM based upon these patches will not be able to swap/migrate.

Thanks,
Robin

2008-02-08 23:41:35

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 0/6] MMU Notifiers V6

On Fri, 8 Feb 2008, Robin Holt wrote:

> > > What about ib_umem_get()?
> >
> > Ok. It pins using an elevated refcount. Same as XPmem right now. With that
> > we effectively pin a page (page migration will fail) but we will
> > continually be reclaiming the page and may repeatedly try to move it. We
> > have issues with XPmem causing too many pages to be pinned and thus the
> > OOM getting into weird behavior modes (OOM or stop lru scanning due to
> > all_reclaimable set).
> >
> > An elevated refcount will also not be noticed by any of the schemes under
> > consideration to improve LRU scanning performance.
>
> Christoph, I am not sure what you are saying here. With v4 and later,
> I thought we were able to use the rmap invalidation to remove the ref
> count that XPMEM was holding and therefore be able to swapout. Did I miss
> something? I agree the existing XPMEM does pin. I hope we are not saying
> the XPMEM based upon these patches will not be able to swap/migrate.

Correct.

You missed the turn of the conversation to how ib_umem_get() works.
Currently it seems to pin the same way that the SLES10 XPmem works.


2008-02-08 23:43:21

by Robin Holt

[permalink] [raw]
Subject: Re: [patch 0/6] MMU Notifiers V6

On Fri, Feb 08, 2008 at 03:41:24PM -0800, Christoph Lameter wrote:
> On Fri, 8 Feb 2008, Robin Holt wrote:
>
> > > > What about ib_umem_get()?
>
> Correct.
>
> You missed the turn of the conversation to how ib_umem_get() works.
> Currently it seems to pin the same way that the SLES10 XPmem works.

Ah. I took Andrew's question as more of a probe about whether we had
worked with the IB folks to ensure this fits the ib_umem_get needs
as well.

Thanks,
Robin

2008-02-08 23:58:09

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 0/6] MMU Notifiers V6

On Fri, 8 Feb 2008 17:43:02 -0600 Robin Holt <[email protected]> wrote:

> On Fri, Feb 08, 2008 at 03:41:24PM -0800, Christoph Lameter wrote:
> > On Fri, 8 Feb 2008, Robin Holt wrote:
> >
> > > > > What about ib_umem_get()?
> >
> > Correct.
> >
> > You missed the turn of the conversation to how ib_umem_get() works.
> > Currently it seems to pin the same way that the SLES10 XPmem works.
>
> Ah. I took Andrew's question as more of a probe about whether we had
> worked with the IB folks to ensure this fits the ib_umem_get needs
> as well.
>

You took it correctly, and I didn't understand the answer ;)

2008-02-09 00:05:26

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 0/6] MMU Notifiers V6

On Fri, 8 Feb 2008, Andrew Morton wrote:

> You took it correctly, and I didn't understand the answer ;)

We have done several rounds of discussion on linux-kernel about this so
far and the IB folks have not shown up to join in. I have tried to make
this as general as possible.

2008-02-09 00:13:00

by Roland Dreier

[permalink] [raw]
Subject: Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6

> We have done several rounds of discussion on linux-kernel about this so
> far and the IB folks have not shown up to join in. I have tried to make
> this as general as possible.

Sorry, this has been on my "things to look at" list for a while, but I
haven't gotten a chance to really understand where things are yet.

In general, this MMU notifier stuff will only be useful to a subset of
InfiniBand/RDMA hardware. Some adapters are smart enough to handle
changing the IO virtual -> bus/physical mapping on the fly, but some
aren't. For the dumb adapters, I think the current ib_umem_get() is
pretty close to as good as we can get: we have to keep the physical
pages pinned for as long as the adapter is allowed to DMA into the
memory region.

For the smart adapters, we just need a chance to change the adapter's
page table when the kernel/CPU's mapping changes, and naively, this
stuff looks like it would work.

Andrew, does that help?

- R.

2008-02-09 00:13:25

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 0/6] MMU Notifiers V6

On Fri, 8 Feb 2008 16:05:00 -0800 (PST) Christoph Lameter <[email protected]> wrote:

> On Fri, 8 Feb 2008, Andrew Morton wrote:
>
> > You took it correctly, and I didn't understand the answer ;)
>
> We have done several rounds of discussion on linux-kernel about this so
> far and the IB folks have not shown up to join in. I have tried to make
> this as general as possible.

infiniband would appear to be the major present in-kernel client of this new
interface. So as a part of proving its usefulness, correctness, etc we
should surely work on converting infiniband to use it, and prove its
goodness.

Quite possibly none of the infiniband developers even know about it..

2008-02-09 00:16:43

by Christoph Lameter

[permalink] [raw]
Subject: Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6

On Fri, 8 Feb 2008, Roland Dreier wrote:

> In general, this MMU notifier stuff will only be useful to a subset of
> InfiniBand/RDMA hardware. Some adapters are smart enough to handle
> changing the IO virtual -> bus/physical mapping on the fly, but some
> aren't. For the dumb adapters, I think the current ib_umem_get() is
> pretty close to as good as we can get: we have to keep the physical
> pages pinned for as long as the adapter is allowed to DMA into the
> memory region.

I thought the adaptor can always remove the mapping by renegotiating
with the remote side? Even if its dumb then a callback could notify the
driver that it may be required to tear down the mapping. We then hold the
pages until we get okay by the driver that the mapping has been removed.

We could also let the unmapping fail if the driver indicates that the
mapping must stay.

2008-02-09 00:18:47

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 0/6] MMU Notifiers V6

On Fri, 8 Feb 2008, Andrew Morton wrote:

> Quite possibly none of the infiniband developers even know about it..

Well Andrea's initial approach was even featured on LWN a couple of
weeks back.

2008-02-09 00:22:53

by Roland Dreier

[permalink] [raw]
Subject: Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6

> I thought the adaptor can always remove the mapping by renegotiating
> with the remote side? Even if its dumb then a callback could notify the
> driver that it may be required to tear down the mapping. We then hold the
> pages until we get okay by the driver that the mapping has been removed.

Of course we can always destroy the memory region but that would break
the semantics that applications expect. Basically an application can
register some chunk of its memory and get a key that it can pass to a
remote peer to let the remote peer operate on its memory via RDMA.
And that memory region/key is expected to stay valid until there is an
application-level operation to destroy it (or until the app crashes or
gets killed, etc).

> We could also let the unmapping fail if the driver indicates that the
> mapping must stay.

That would of course work -- dumb adapters would just always fail,
which might be inefficient.

2008-02-09 00:36:30

by Christoph Lameter

[permalink] [raw]
Subject: Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6

On Fri, 8 Feb 2008, Roland Dreier wrote:

> That would of course work -- dumb adapters would just always fail,
> which might be inefficient.

Hmmmm.. that means we need something that actually pins pages for good so
that the VM can avoid reclaiming it and so that page migration can avoid
trying to migrate them. Something like yet another page flag.

Ccing Rik.

2008-02-09 01:24:59

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6

On Fri, Feb 08, 2008 at 04:36:16PM -0800, Christoph Lameter wrote:
> On Fri, 8 Feb 2008, Roland Dreier wrote:
>
> > That would of course work -- dumb adapters would just always fail,
> > which might be inefficient.
>
> Hmmmm.. that means we need something that actually pins pages for good so
> that the VM can avoid reclaiming it and so that page migration can avoid
> trying to migrate them. Something like yet another page flag.

What's wrong with pinning with the page count like now? Dumb adapters
would simply not register themself in the mmu notifier list no?

>
> Ccing Rik.

2008-02-09 01:27:19

by Christoph Lameter

[permalink] [raw]
Subject: Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6

On Sat, 9 Feb 2008, Andrea Arcangeli wrote:

> > Hmmmm.. that means we need something that actually pins pages for good so
> > that the VM can avoid reclaiming it and so that page migration can avoid
> > trying to migrate them. Something like yet another page flag.
>
> What's wrong with pinning with the page count like now? Dumb adapters
> would simply not register themself in the mmu notifier list no?

Pages will still be on the LRU and cycle through rmap again and again.
If page migration is used on those pages then the code may make repeated
attempt to migrate the page thinking that the page count must at some
point drop.

I do not think that the page count was intended to be used to pin pages
permanently. If we had a marker on such pages then we could take them off
the LRU and not try to migrate them.

2008-02-09 01:57:17

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6

On Fri, Feb 08, 2008 at 05:27:03PM -0800, Christoph Lameter wrote:
> Pages will still be on the LRU and cycle through rmap again and again.
> If page migration is used on those pages then the code may make repeated
> attempt to migrate the page thinking that the page count must at some
> point drop.
>
> I do not think that the page count was intended to be used to pin pages
> permanently. If we had a marker on such pages then we could take them off
> the LRU and not try to migrate them.

The VM shouldn't break if try_to_unmap doesn't actually make the page
freeable for whatever reason. Permanent pins shouldn't happen anyway,
so defining an ad-hoc API for that doesn't sound too appealing. Not
sure if old hardware deserves those special lru-size-reduction
optimizations but it's not my call (certainly swapoff/mlock would get
higher priority in that lru-size-reduction area).

2008-02-09 02:16:30

by Christoph Lameter

[permalink] [raw]
Subject: Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6

On Sat, 9 Feb 2008, Andrea Arcangeli wrote:

> The VM shouldn't break if try_to_unmap doesn't actually make the page
> freeable for whatever reason. Permanent pins shouldn't happen anyway,

VM is livelocking if too many page are pinned that way right now. The
higher the processors per node the higher the risk of livelock because
more processors are in the process of cycling through pages that have an
elevated refcount.

> so defining an ad-hoc API for that doesn't sound too appealing. Not
> sure if old hardware deserves those special lru-size-reduction
> optimizations but it's not my call (certainly swapoff/mlock would get
> higher priority in that lru-size-reduction area).

Rik has a patchset under development that addresses issues like this. The
elevated refcount pin problem is not really relevant to the patchset we
are discussing here.

2008-02-09 12:58:30

by Rik van Riel

[permalink] [raw]
Subject: Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6

On Fri, 8 Feb 2008 18:16:16 -0800 (PST)
Christoph Lameter <[email protected]> wrote:
> On Sat, 9 Feb 2008, Andrea Arcangeli wrote:
>
> > The VM shouldn't break if try_to_unmap doesn't actually make the page
> > freeable for whatever reason. Permanent pins shouldn't happen anyway,
>
> VM is livelocking if too many page are pinned that way right now.

> Rik has a patchset under development that addresses issues like this

PG_mlock is on the way and can easily be reused for this, too.

--
All rights reversed.

2008-02-09 21:46:48

by Christoph Lameter

[permalink] [raw]
Subject: Re: [ofa-general] Re: [patch 0/6] MMU Notifiers V6

On Sat, 9 Feb 2008, Rik van Riel wrote:

> PG_mlock is on the way and can easily be reused for this, too.

Note that a pinned page is different from an mlocked page. A mlocked page
can be moved through page migration and/or memory hotplug. A pinned page
must make both fail.

2008-02-11 22:41:08

by Roland Dreier

[permalink] [raw]
Subject: Demand paging for memory regions (was Re: MMU Notifiers V6)

[Adding [email protected] to get the IB/RDMA people involved]

This thread has patches that add support for notifying drivers when a
process's memory map changes. The hope is that this is useful for
letting RDMA devices handle registered memory without pinning the
underlying pages, by updating the RDMA device's translation tables
whenever the host kernel's tables change.

Is anyone interested in working on using this for drivers/infiniband?
I am interested in participating, but I don't think I have enough time
to do this by myself.

Also, at least naively it seems that this is only useful for hardware
that has support for this type of demand paging, and can handle
not-present pages, generating interrupts for page faults, etc. I know
that Mellanox HCAs should have this support; are there any other
devices that can do this?

The beginning of this thread is at <http://lkml.org/lkml/2008/2/8/458>.

- R.

2008-02-12 22:02:44

by Steve Wise

[permalink] [raw]
Subject: Re: Demand paging for memory regions (was Re: MMU Notifiers V6)

Roland Dreier wrote:
> [Adding [email protected] to get the IB/RDMA people involved]
>
> This thread has patches that add support for notifying drivers when a
> process's memory map changes. The hope is that this is useful for
> letting RDMA devices handle registered memory without pinning the
> underlying pages, by updating the RDMA device's translation tables
> whenever the host kernel's tables change.
>
> Is anyone interested in working on using this for drivers/infiniband?
> I am interested in participating, but I don't think I have enough time
> to do this by myself.

I don't have time, although it would be interesting work!

>
> Also, at least naively it seems that this is only useful for hardware
> that has support for this type of demand paging, and can handle
> not-present pages, generating interrupts for page faults, etc. I know
> that Mellanox HCAs should have this support; are there any other
> devices that can do this?
>

Chelsio's T3 HW doesn't support this.


Steve.

2008-02-12 22:11:00

by Christoph Lameter

[permalink] [raw]
Subject: Re: Demand paging for memory regions (was Re: MMU Notifiers V6)

On Tue, 12 Feb 2008, Steve Wise wrote:

> Chelsio's T3 HW doesn't support this.

Not so far I guess but it could be equipped with these features right?

Having the VM manage the memory area for Infiniband allows more reliable
system operations and enables the sharing of large memory areas via
Infiniband without the risk of livelocks or OOMs.

2008-02-12 22:42:23

by Roland Dreier

[permalink] [raw]
Subject: Re: [ofa-general] Re: Demand paging for memory regions

> > Chelsio's T3 HW doesn't support this.

> Not so far I guess but it could be equipped with these features right?

I don't know anything about the T3 internals, but it's not clear that
you could do this without a new chip design in general. Lot's of RDMA
devices were designed expecting that when a packet arrives, the HW can
look up the bus address for a given memory region/offset and place the
packet immediately. It seems like a major change to be able to
generate a "page fault" interrupt when a page isn't present, or even
just wait to scatter some data until the host finishes updating page
tables when the HW needs the translation.

- R.

2008-02-12 23:52:39

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [ofa-general] Re: Demand paging for memory regions

On Tue, Feb 12, 2008 at 02:41:48PM -0800, Roland Dreier wrote:
> > > Chelsio's T3 HW doesn't support this.
>
> > Not so far I guess but it could be equipped with these features right?
>
> I don't know anything about the T3 internals, but it's not clear that
> you could do this without a new chip design in general. Lot's of RDMA
> devices were designed expecting that when a packet arrives, the HW can
> look up the bus address for a given memory region/offset and place
> the

Well, certainly today the memfree IB devices store the page tables in
host memory so they are already designed to hang onto packets during
the page lookup over PCIE, adding in faulting makes this time
larger.

But this is not a good thing at all, IB's congestion model is based on
the notion that end ports can always accept packets without making
input contigent on output. If you take a software interrupt to fill in
the page pointer then you could potentially deadlock on the
fabric. For example using this mechanism to allow swap-in of RDMA target
pages and then putting the storage over IB would be deadlock
prone. Even without deadlock slowing down the input path will cause
network congestion and poor performance for other nodes. It is not a
desirable thing to do..

I expect that iwarp running over flow controlled ethernet has similar
kinds of problems for similar reasons..

In general the best I think you can hope for with RDMA hardware is
page migration using some atomic operations with the adaptor and a cpu
page copy with retry sort of scheme - but is pure page migration
interesting at all?

Jason

2008-02-13 00:56:57

by Christoph Lameter

[permalink] [raw]
Subject: Re: [ofa-general] Re: Demand paging for memory regions

On Tue, 12 Feb 2008, Roland Dreier wrote:

> I don't know anything about the T3 internals, but it's not clear that
> you could do this without a new chip design in general. Lot's of RDMA
> devices were designed expecting that when a packet arrives, the HW can
> look up the bus address for a given memory region/offset and place the
> packet immediately. It seems like a major change to be able to
> generate a "page fault" interrupt when a page isn't present, or even
> just wait to scatter some data until the host finishes updating page
> tables when the HW needs the translation.

Well if the VM wants to invalidate a page then the remote end first has to
remove its mapping.

If a page has been removed then the remote end would encounter a fault and
then would have to wait for the local end to reestablish its mapping
before proceeding.

So the packet would only be generated when both ends are in sync.

2008-02-13 01:01:30

by Christoph Lameter

[permalink] [raw]
Subject: Re: [ofa-general] Re: Demand paging for memory regions

On Tue, 12 Feb 2008, Jason Gunthorpe wrote:

> Well, certainly today the memfree IB devices store the page tables in
> host memory so they are already designed to hang onto packets during
> the page lookup over PCIE, adding in faulting makes this time
> larger.

You really do not need a page table to use it. What needs to be maintained
is knowledge on both side about what pages are currently shared across
RDMA. If the VM decides to reclaim a page then the notification is used to
remove the remote entry. If the remote side then tries to access the page
again then the page fault on the remote side will stall until the local
page has been brought back. RDMA can proceed after both sides again agree
on that page now being sharable.

2008-02-13 01:27:24

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [ofa-general] Re: Demand paging for memory regions

On Tue, Feb 12, 2008 at 05:01:17PM -0800, Christoph Lameter wrote:
> On Tue, 12 Feb 2008, Jason Gunthorpe wrote:
>
> > Well, certainly today the memfree IB devices store the page tables in
> > host memory so they are already designed to hang onto packets during
> > the page lookup over PCIE, adding in faulting makes this time
> > larger.
>
> You really do not need a page table to use it. What needs to be maintained
> is knowledge on both side about what pages are currently shared across
> RDMA. If the VM decides to reclaim a page then the notification is used to
> remove the remote entry. If the remote side then tries to access the page
> again then the page fault on the remote side will stall until the local
> page has been brought back. RDMA can proceed after both sides again agree
> on that page now being sharable.

The problem is that the existing wire protocols do not have a
provision for doing an 'are you ready' or 'I am not ready' exchange
and they are not designed to store page tables on both sides as you
propose. The remote side can send RDMA WRITE traffic at any time after
the RDMA region is established. The local side must be able to handle
it. There is no way to signal that a page is not ready and the remote
should not send.

This means the only possible implementation is to stall/discard at the
local adaptor when a RDMA WRITE is recieved for a page that has been
reclaimed. This is what leads to deadlock/poor performance..

Jason

2008-02-13 01:46:18

by Steve Wise

[permalink] [raw]
Subject: Re: [ofa-general] Re: Demand paging for memory regions

Jason Gunthorpe wrote:
> On Tue, Feb 12, 2008 at 05:01:17PM -0800, Christoph Lameter wrote:
>> On Tue, 12 Feb 2008, Jason Gunthorpe wrote:
>>
>>> Well, certainly today the memfree IB devices store the page tables in
>>> host memory so they are already designed to hang onto packets during
>>> the page lookup over PCIE, adding in faulting makes this time
>>> larger.
>> You really do not need a page table to use it. What needs to be maintained
>> is knowledge on both side about what pages are currently shared across
>> RDMA. If the VM decides to reclaim a page then the notification is used to
>> remove the remote entry. If the remote side then tries to access the page
>> again then the page fault on the remote side will stall until the local
>> page has been brought back. RDMA can proceed after both sides again agree
>> on that page now being sharable.
>
> The problem is that the existing wire protocols do not have a
> provision for doing an 'are you ready' or 'I am not ready' exchange
> and they are not designed to store page tables on both sides as you
> propose. The remote side can send RDMA WRITE traffic at any time after
> the RDMA region is established. The local side must be able to handle
> it. There is no way to signal that a page is not ready and the remote
> should not send.
>
> This means the only possible implementation is to stall/discard at the
> local adaptor when a RDMA WRITE is recieved for a page that has been
> reclaimed. This is what leads to deadlock/poor performance..
>

If the events are few and far between then this model is probably ok.
For iWARP, it means TCP retransmit and slow start and all that, but if
its an infrequent event, then its ok if it helps the host better manage
memory.

Maybe... ;-)


Steve.

2008-02-13 02:01:51

by Christian Bell

[permalink] [raw]
Subject: Re: [ofa-general] Re: Demand paging for memory regions

On Tue, 12 Feb 2008, Christoph Lameter wrote:

> On Tue, 12 Feb 2008, Jason Gunthorpe wrote:
>
> > Well, certainly today the memfree IB devices store the page tables in
> > host memory so they are already designed to hang onto packets during
> > the page lookup over PCIE, adding in faulting makes this time
> > larger.
>
> You really do not need a page table to use it. What needs to be maintained
> is knowledge on both side about what pages are currently shared across
> RDMA. If the VM decides to reclaim a page then the notification is used to
> remove the remote entry. If the remote side then tries to access the page
> again then the page fault on the remote side will stall until the local
> page has been brought back. RDMA can proceed after both sides again agree
> on that page now being sharable.

HPC environments won't be amenable to a pessimistic approach of
synchronizing before every data transfer. RDMA is assumed to be a
low-level data movement mechanism that has no implied
synchronization. In some parallel programming models, it's not
uncommon to use RDMA to send 8-byte messages. It can be difficult to
make and hold guarantees about in-memory pages when many concurrent
RDMA operations are in flight (not uncommon in reasonably large
machines). Some of the in-memory page information could be shared
with some form of remote caching strategy but then it's a different
problem with its own scalability challenges.

I think there are very potential clients of the interface when an
optimistic approach is used. Part of the trick, however, has to do
with being able to re-start transfers instead of buffering the data
or making guarantees about delivery that could cause deadlock (as was
alluded to earlier in this thread). InfiniBand is constrained in
this regard since it requires message-ordering between endpoints (or
queue pairs). One could argue that this is still possible with IB,
at the cost of throwing more packets away when a referenced page is
not in memory. With this approach, the worse case demand paging
scenario is met when the active working set of referenced pages is
larger than the amount physical memory -- but HPC applications are
already bound by this anyway.

You'll find that Quadrics has the most experience in this area and
that their entire architecture is adapted to being optimistic about
demand paging in RDMA transfers -- they've been maintaining a patchset
to do this for years.

. . christian

2008-02-13 02:19:21

by Christoph Lameter

[permalink] [raw]
Subject: Re: [ofa-general] Re: Demand paging for memory regions

On Tue, 12 Feb 2008, Christian Bell wrote:

> I think there are very potential clients of the interface when an
> optimistic approach is used. Part of the trick, however, has to do
> with being able to re-start transfers instead of buffering the data
> or making guarantees about delivery that could cause deadlock (as was
> alluded to earlier in this thread). InfiniBand is constrained in
> this regard since it requires message-ordering between endpoints (or
> queue pairs). One could argue that this is still possible with IB,
> at the cost of throwing more packets away when a referenced page is
> not in memory. With this approach, the worse case demand paging
> scenario is met when the active working set of referenced pages is
> larger than the amount physical memory -- but HPC applications are
> already bound by this anyway.
>
> You'll find that Quadrics has the most experience in this area and
> that their entire architecture is adapted to being optimistic about
> demand paging in RDMA transfers -- they've been maintaining a patchset
> to do this for years.

The notifier patchset that we are discussing here was mostly inspired by
their work.

There is no need to restart transfers that you have never started in the
first place. The remote side would never start a transfer if the page
reference has been torn down. In order to start the transfer a fault
handler on the remote side would have to setup the association between the
memory on both ends again.


2008-02-13 02:35:26

by Christoph Lameter

[permalink] [raw]
Subject: Re: [ofa-general] Re: Demand paging for memory regions

On Tue, 12 Feb 2008, Jason Gunthorpe wrote:

> The problem is that the existing wire protocols do not have a
> provision for doing an 'are you ready' or 'I am not ready' exchange
> and they are not designed to store page tables on both sides as you
> propose. The remote side can send RDMA WRITE traffic at any time after
> the RDMA region is established. The local side must be able to handle
> it. There is no way to signal that a page is not ready and the remote
> should not send.
>
> This means the only possible implementation is to stall/discard at the
> local adaptor when a RDMA WRITE is recieved for a page that has been
> reclaimed. This is what leads to deadlock/poor performance..

You would only use the wire protocols *after* having established the RDMA
region. The notifier chains allows a RDMA region (or parts thereof) to be
down on demand by the VM. The region can be reestablished if one of
the side accesses it. I hope I got that right. Not much exposure to
Infiniband so far.

Lets say you have a two systems A and B. Each has their memory region MemA
and MemB. Each side also has page tables for this region PtA and PtB.

Now you establish a RDMA connection between both side. The pages in both
MemB and MemA are present and so are entries in PtA and PtB. RDMA
traffic can proceed.

The VM on system A now gets into a situation in which memory becomes
heavily used by another (maybe non RDMA process) and after checking that
there was no recent reference to MemA and MemB (via a notifier aging
callback) decides to reclaim the memory from MemA.

In that case it will notify the RDMA subsystem on A that it is trying to
reclaim a certain page.

The RDMA subsystem on A will then send a message to B notifying it that
the memory will be going away. B now has to remove its corresponding page
from memory (and drop the entry in PtB) and confirm to A that this has
happened. RDMA traffic is then stopped for this page. Then A can also
remove its page, the corresponding entry in PtA and the page is reclaimed
or pushed out to swap completing the page reclaim.

If either side then accesses the page again then the reverse process
happens. If B accesses the page then it wil first of all incur a page
fault because the entry in PtB is missing. The fault will then cause a
message to be send to A to establish the page again. A will create an
entry in PtA and will then confirm to B that the page was established. At
that point RDMA operations can occur again.

So the whole scheme does not really need a hardware page table in the RDMA
hardware. The page tables of the two systems A and B are sufficient.

The scheme can also be applied to a larger range than only a single page.
The RDMA subsystem could tear down a large section when reclaim is
pushing on it and then reestablish it as needed.

Swapping and page reclaim is certainly not something that improves the
speed of the application affected by swapping and page reclaim but it
allows the VM to manage memory effectively if multiple loads are runing on
a system.

2008-02-13 03:26:00

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [ofa-general] Re: Demand paging for memory regions

On Tue, Feb 12, 2008 at 06:35:09PM -0800, Christoph Lameter wrote:
> On Tue, 12 Feb 2008, Jason Gunthorpe wrote:
>
> > The problem is that the existing wire protocols do not have a
> > provision for doing an 'are you ready' or 'I am not ready' exchange
> > and they are not designed to store page tables on both sides as you
> > propose. The remote side can send RDMA WRITE traffic at any time after
> > the RDMA region is established. The local side must be able to handle
> > it. There is no way to signal that a page is not ready and the remote
> > should not send.
> >
> > This means the only possible implementation is to stall/discard at the
> > local adaptor when a RDMA WRITE is recieved for a page that has been
> > reclaimed. This is what leads to deadlock/poor performance..
>
> You would only use the wire protocols *after* having established the RDMA
> region. The notifier chains allows a RDMA region (or parts thereof) to be
> down on demand by the VM. The region can be reestablished if one of
> the side accesses it. I hope I got that right. Not much exposure to
> Infiniband so far.

[clip explaination]

But this isn't how IB or iwarp work at all. What you describe is a
significant change to the general RDMA operation and requires changes to
both sides of the connection and the wire protocol.

A few comments on RDMA operation that might clarify things a little
bit more:
- In RDMA (iwarp and IB versions) the hardware page tables exist to
linearize the local memory so the remote does not need to be aware
of non-linearities in the physical address space. The main
motivation for this is kernel bypass where the user space app wants
to instruct the remote side to DMA into memory using user space
addresses. Hardware provides the page tables to switch from
incoming user space virtual addresses to physical addresess.

This greatly simplifies the user space programming model since you
don't need to pass around or create s/g lists for memory that is
already virtually continuous.

Many kernel RDMA drivers (SCSI, NFS) only use the HW page tables
for access control and enforcing the liftime of the mapping.

The page tables in the RDMA hardware exist primarily to support
this, and not for other reasons. The pinning of pages is one part
to support the HW page tables and one part to support the RDMA
lifetime rules, the liftime rules are what cause problems for
the VM.
- The wire protocol consists of packets that say 'Write XXX bytes to
offset YY in Region RRR'. Creating a region produces the RRR label
and currently pins the pages. So long as the RRR label is valid the
remote side can issue write packets at any time without any
further synchronization. There is no wire level events associated
with creating RRR. You can pass RRR to the other machine in any
fashion, even using carrier pigeons :)
- The RDMA layer is very general (ala TCP), useful protocols (like SCSI)
are built on top of it and they specify the lifetime rules and
protocol for exchanging RRR.

Every protocol is different. In kernel protocols like SRP and NFS
RDMA seem to have very short lifetimes for RRR and work more like
pci_map_* in real SCSI hardware.
- HPC userspace apps, like MPI apps, have different lifetime rules
and tend to be really long lived. These people will not want
anything that makes their OPs more expensive and also probably
don't care too much about the VM problems you are looking at (?)
- There is no protocol support to exchange RRR. This is all done
by upper level protocols (ala HTTP vs TCP). You cannot assert
and revoke RRR in a general way. Every protocol is different
and optimized.

This is your step 'A will then send a message to B notifying..'.
It simply does not exist in the protocol specifications

I don't know much about Quadrics, but I would be hesitant to lump it
in too much with these RDMA semantics. Christian's comments sound like
they operate closer to what you described and that is why the have an
existing patch set. I don't know :)

What it boils down to is that to implement true removal of pages in a
general way the kernel and HCA must either drop packets or stall
incoming packets, both are big performance problems - and I can't see
many users wanting this. Enterprise style people using SCSI, NFS, etc
already have short pin periods and HPC MPI users probably won't care
about the VM issues enough to warrent the performance overhead.

Regards,
Jason

2008-02-13 03:59:26

by Patrick Geoffray

[permalink] [raw]
Subject: Re: [ofa-general] Re: Demand paging for memory regions

Jason,

Jason Gunthorpe wrote:
> I don't know much about Quadrics, but I would be hesitant to lump it
> in too much with these RDMA semantics. Christian's comments sound like
> they operate closer to what you described and that is why the have an
> existing patch set. I don't know :)

The Quadrics folks have been doing RDMA for 10 years, there is a reason
why they maintained a patch.

> What it boils down to is that to implement true removal of pages in a
> general way the kernel and HCA must either drop packets or stall
> incoming packets, both are big performance problems - and I can't see
> many users wanting this. Enterprise style people using SCSI, NFS, etc
> already have short pin periods and HPC MPI users probably won't care
> about the VM issues enough to warrent the performance overhead.

This is not true, HPC people do care about the VM issues a lot. Memory
registration (pinning and translating) is usually too expensive to be
performed in the critical path before and after each send or receive. So
they factor it out by registering a buffer the first time it is used,
and keeping it registered in a registration cache. However, the
application may free() a buffer that is in the registration cache, so
HPC people provide their own malloc to catch free(). They also try to
catch sbrk() and munmap() to deregister memory before it is released to
the OS. This is a Major pain that a VM notifier would easily solve.
Being able to swap registered pages to disk or migrate them in a NUMA
system is a welcome bonus.

Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com

2008-02-13 04:09:19

by Christian Bell

[permalink] [raw]
Subject: Re: [ofa-general] Re: Demand paging for memory regions

On Tue, 12 Feb 2008, Christoph Lameter wrote:

> On Tue, 12 Feb 2008, Jason Gunthorpe wrote:
>
> > The problem is that the existing wire protocols do not have a
> > provision for doing an 'are you ready' or 'I am not ready' exchange
> > and they are not designed to store page tables on both sides as you
> > propose. The remote side can send RDMA WRITE traffic at any time after
> > the RDMA region is established. The local side must be able to handle
> > it. There is no way to signal that a page is not ready and the remote
> > should not send.
> >
> > This means the only possible implementation is to stall/discard at the
> > local adaptor when a RDMA WRITE is recieved for a page that has been
> > reclaimed. This is what leads to deadlock/poor performance..

You're arguing that a HW page table is not needed by describing a use
case that is essentially what all RDMA solutions already do above the
wire protocols (all solutions except Quadrics, of course).

> You would only use the wire protocols *after* having established the RDMA
> region. The notifier chains allows a RDMA region (or parts thereof) to be
> down on demand by the VM. The region can be reestablished if one of
> the side accesses it. I hope I got that right. Not much exposure to
> Infiniband so far.

RDMA is already always used *after* memory regions are set up --
they are set up out-of-band w.r.t RDMA but essentially this is the
"before" part.

> Lets say you have a two systems A and B. Each has their memory region MemA
> and MemB. Each side also has page tables for this region PtA and PtB.
>
> Now you establish a RDMA connection between both side. The pages in both
> MemB and MemA are present and so are entries in PtA and PtB. RDMA
> traffic can proceed.
>
> The VM on system A now gets into a situation in which memory becomes
> heavily used by another (maybe non RDMA process) and after checking that
> there was no recent reference to MemA and MemB (via a notifier aging
> callback) decides to reclaim the memory from MemA.
>
> In that case it will notify the RDMA subsystem on A that it is trying to
> reclaim a certain page.
>
> The RDMA subsystem on A will then send a message to B notifying it that
> the memory will be going away. B now has to remove its corresponding page
> from memory (and drop the entry in PtB) and confirm to A that this has
> happened. RDMA traffic is then stopped for this page. Then A can also
> remove its page, the corresponding entry in PtA and the page is reclaimed
> or pushed out to swap completing the page reclaim.
>
> If either side then accesses the page again then the reverse process
> happens. If B accesses the page then it wil first of all incur a page
> fault because the entry in PtB is missing. The fault will then cause a
> message to be send to A to establish the page again. A will create an
> entry in PtA and will then confirm to B that the page was established. At
> that point RDMA operations can occur again.

The notifier-reclaim cycle you describe is akin to the out-of-band
pin-unpin control messages used by existing communication libraries.
Also, I think what you are proposing can have problems at scale -- A
must keep track of all of the (potentially many systems) of memA and
cooperatively get an agreement from all these systems before reclaiming
the page.

When messages are sufficiently large, the control messaging necessary
to setup/teardown the regions is relatively small. This is not
always the case however -- in programming models that employ smaller
messages, the one-sided nature of RDMA is the most attractive part of
it.

> So the whole scheme does not really need a hardware page table in the RDMA
> hardware. The page tables of the two systems A and B are sufficient.
>
> The scheme can also be applied to a larger range than only a single page.
> The RDMA subsystem could tear down a large section when reclaim is
> pushing on it and then reestablish it as needed.

Nothing any communication/runtime system can't already do today. The
point of RDMA demand paging is enabling the possibility of using RDMA
without the implied synchronization -- the optimistic part. Using
the notifiers to duplicate existing memory region handling for RDMA
hardware that doesn't have HW page tables is possible but undermines
the more important consumer of your patches in my opinion.

One other area that has not been brought up yet (I think) is the
applicability of notifiers in letting users know when pinned memory
is reclaimed by the kernel. This is useful when a lower-level
library employs lazy deregistration strategies on memory regions that
are subsequently released to the kernel via the application's use of
munmap or sbrk. Ohio Supercomputing Center has work in this area but
a generalized approach in the kernel would certainly be welcome.


. . christian

--
[email protected]
(QLogic Host Solutions Group, formerly Pathscale)

2008-02-13 04:26:21

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [ofa-general] Re: Demand paging for memory regions

[mangled CC list trimmed]
On Tue, Feb 12, 2008 at 10:56:26PM -0500, Patrick Geoffray wrote:

> Jason Gunthorpe wrote:
>> I don't know much about Quadrics, but I would be hesitant to lump it
>> in too much with these RDMA semantics. Christian's comments sound like
>> they operate closer to what you described and that is why the have an
>> existing patch set. I don't know :)
>
> The Quadrics folks have been doing RDMA for 10 years, there is a reason why
> they maintained a patch.

This wasn't ment as a slight against Quadrics, only to point out that
the specific wire protcols used by IB and iwarp are what cause this
limitation, it would be easy to imagine that Quadrics has some
additional twist that can make this easier..

>> What it boils down to is that to implement true removal of pages in a
>> general way the kernel and HCA must either drop packets or stall
>> incoming packets, both are big performance problems - and I can't see
>> many users wanting this. Enterprise style people using SCSI, NFS, etc
>> already have short pin periods and HPC MPI users probably won't care
>> about the VM issues enough to warrent the performance overhead.
>
> This is not true, HPC people do care about the VM issues a lot. Memory
> registration (pinning and translating) is usually too expensive to

I ment that HPC users are unlikely to want to swap active RDMA pages
if this causes a performance cost on normal operations. None of my
comments are ment to imply that lazy de-registration or page migration
are not good things.

Regards,
Jason

2008-02-13 04:47:54

by Patrick Geoffray

[permalink] [raw]
Subject: Re: [ofa-general] Re: Demand paging for memory regions

Jason Gunthorpe wrote:
> [mangled CC list trimmed]
Thanks, noticed that afterwards.

> This wasn't ment as a slight against Quadrics, only to point out that
> the specific wire protcols used by IB and iwarp are what cause this
> limitation, it would be easy to imagine that Quadrics has some
> additional twist that can make this easier..

The wire protocols are similar, nothing fancy. The specificity of
Quadrics (and many others) is that they can change the behavior of the
NIC in firmware, so they adapt to what the OS offers. They had the VM
notifier support in Tru64 back in the days, they just ported the
functionality to Linux.

> I ment that HPC users are unlikely to want to swap active RDMA pages
> if this causes a performance cost on normal operations. None of my

Swapping to disk is not a normal operations in HPC, it's going to be
slow anyway. The main problem for HPC users is not swapping, it's that
they do not know when a registered page is released to the OS through
free(), sbrk() or munmap(). Like swapping, they don't expect that it
will happen often, but they have to handle it gracefully.

Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com

2008-02-13 14:31:29

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: Re: [patch 0/6] MMU Notifiers V6

> GRU
> - Simple additional hardware TLB (possibly covering multiple instances of
> Linux)
> - Needs TLB shootdown when the VM unmaps pages.
> - Determines page address via follow_page (from interrupt context) but can
> fall back to get_user_pages().
> - No page reference possible since no page status is kept..

I applied the latest mmuops patch to a 2.6.24 kernel & updated the
GRU driver to use it. As far as I can tell, everything works ok.
Although more testing is needed, all current tests of driver functionality
are working on both a system simulator and a hardware simulator.

The driver itself is still a few weeks from being ready to post but I can
send code fragments of the portions related to mmuops or external TLB
management if anyone is interested.


--- jack

2008-02-13 16:13:04

by Christoph Raisch

[permalink] [raw]
Subject: Re: [ofa-general] Re: Demand paging for memory regions


> > > Chelsio's T3 HW doesn't support this.


For ehca we currently can't modify a large MR when it has been allocated.
EHCA Hardware expects the pages to be there (MRs must not have "holes").
This is also true for the global MR covering all kernel space.
Therefore we still need the memory to be "pinned" if ib_umem_get() is
called.

So with the current implementation we don't have much use for a notifier.


"It is difficult to make predictions, especially about the future"
Gruss / Regards
Christoph Raisch + Hoang-Nam Nguyen


2008-02-13 18:52:18

by Christoph Lameter

[permalink] [raw]
Subject: Re: [ofa-general] Re: Demand paging for memory regions

On Tue, 12 Feb 2008, Jason Gunthorpe wrote:

> But this isn't how IB or iwarp work at all. What you describe is a
> significant change to the general RDMA operation and requires changes to
> both sides of the connection and the wire protocol.

Yes it may require a separate connection between both sides where a
kind of VM notification protocol is established to tear these things down and
set them up again. That is if there is nothing in the RDMA protocol that
allows a notification to the other side that the mapping is being down
down.

> - In RDMA (iwarp and IB versions) the hardware page tables exist to
> linearize the local memory so the remote does not need to be aware
> of non-linearities in the physical address space. The main
> motivation for this is kernel bypass where the user space app wants
> to instruct the remote side to DMA into memory using user space
> addresses. Hardware provides the page tables to switch from
> incoming user space virtual addresses to physical addresess.

s/switch/translate I guess. That is good and those page tables could be
used for the notification scheme to enable reclaim. But they are optional
and are maintaining the driver state. The linearization could be
reconstructed from the kernel page tables on demand.

> Many kernel RDMA drivers (SCSI, NFS) only use the HW page tables
> for access control and enforcing the liftime of the mapping.

Well the mapping would have to be on demand to avoid the issues that we
currently have with pinning. The user API could stay the same. If the
driver tracks the mappings using the notifier then the VM can make sure
that the right things happen on exit etc etc.

> The page tables in the RDMA hardware exist primarily to support
> this, and not for other reasons. The pinning of pages is one part
> to support the HW page tables and one part to support the RDMA
> lifetime rules, the liftime rules are what cause problems for
> the VM.

So the driver software can tear down and establish page tables
entries at will? I do not see the problem. The RDMA hardware is one thing,
the way things are visible to the user another. If the driver can
establish and remove mappings as needed via RDMA then the user can have
the illusion of persistent RDMA memory. This is the same as virtual memory
providing the illusion of a process having lots of memory all for itself.


> - The wire protocol consists of packets that say 'Write XXX bytes to
> offset YY in Region RRR'. Creating a region produces the RRR label
> and currently pins the pages. So long as the RRR label is valid the
> remote side can issue write packets at any time without any
> further synchronization. There is no wire level events associated
> with creating RRR. You can pass RRR to the other machine in any
> fashion, even using carrier pigeons :)
> - The RDMA layer is very general (ala TCP), useful protocols (like SCSI)
> are built on top of it and they specify the lifetime rules and
> protocol for exchanging RRR.

Well yes of course. What is proposed here is an additional notification
mechanism (could even be via tcp/udp to simplify things) that would manage
the mappings at a higher level. The writes would not occur if the mapping
has not been established.

> This is your step 'A will then send a message to B notifying..'.
> It simply does not exist in the protocol specifications

Of course. You need to create an additional communication layer to get
that.

> What it boils down to is that to implement true removal of pages in a
> general way the kernel and HCA must either drop packets or stall
> incoming packets, both are big performance problems - and I can't see
> many users wanting this. Enterprise style people using SCSI, NFS, etc
> already have short pin periods and HPC MPI users probably won't care
> about the VM issues enough to warrent the performance overhead.

True maybe you cannot do this by simply staying within the protocol bounds
of RDMA that is based on page pinning if the RDMA protocol does not
support a notification to the other side that the mapping is going away.

If RDMA cannot do this then you would need additional ways of notifying
the remote side that pages/mappings are invalidated.

2008-02-13 19:00:22

by Christoph Lameter

[permalink] [raw]
Subject: Re: [ofa-general] Re: Demand paging for memory regions

On Tue, 12 Feb 2008, Christian Bell wrote:

> You're arguing that a HW page table is not needed by describing a use
> case that is essentially what all RDMA solutions already do above the
> wire protocols (all solutions except Quadrics, of course).

The HW page table is not essential to the notification scheme. That the
RDMA uses the page table for linearization is another issue. A chip could
just have a TLB cache and lookup the entries using the OS page table f.e.

> > Lets say you have a two systems A and B. Each has their memory region MemA
> > and MemB. Each side also has page tables for this region PtA and PtB.
> > If either side then accesses the page again then the reverse process
> > happens. If B accesses the page then it wil first of all incur a page
> > fault because the entry in PtB is missing. The fault will then cause a
> > message to be send to A to establish the page again. A will create an
> > entry in PtA and will then confirm to B that the page was established. At
> > that point RDMA operations can occur again.
>
> The notifier-reclaim cycle you describe is akin to the out-of-band
> pin-unpin control messages used by existing communication libraries.
> Also, I think what you are proposing can have problems at scale -- A
> must keep track of all of the (potentially many systems) of memA and
> cooperatively get an agreement from all these systems before reclaiming
> the page.

Right. We (SGI) have done something like this for a long time with XPmem
and it scales ok.

> When messages are sufficiently large, the control messaging necessary
> to setup/teardown the regions is relatively small. This is not
> always the case however -- in programming models that employ smaller
> messages, the one-sided nature of RDMA is the most attractive part of
> it.

The messaging would only be needed if a process comes under memory
pressure. As long as there is enough memory nothing like this will occur.

> Nothing any communication/runtime system can't already do today. The
> point of RDMA demand paging is enabling the possibility of using RDMA
> without the implied synchronization -- the optimistic part. Using
> the notifiers to duplicate existing memory region handling for RDMA
> hardware that doesn't have HW page tables is possible but undermines
> the more important consumer of your patches in my opinion.

The notifier schemet should integrate into existing memory region
handling and not cause a duplication. If you already have library layers
that do this then it should be possible to integrate it.

> One other area that has not been brought up yet (I think) is the
> applicability of notifiers in letting users know when pinned memory
> is reclaimed by the kernel. This is useful when a lower-level
> library employs lazy deregistration strategies on memory regions that
> are subsequently released to the kernel via the application's use of
> munmap or sbrk. Ohio Supercomputing Center has work in this area but
> a generalized approach in the kernel would certainly be welcome.

The driver gets the notifications about memory being reclaimed. The driver
could then notify user code about the release as well.

Pinned memory current *cannot* be reclaimed by the kernel. The refcount is
elevated. This means that the VM tries to remove the mappings and then
sees that it was not able to remove all references. Then it gives up and
tries again and again and again.... Thus the potential for livelock.

2008-02-13 19:02:26

by Christoph Lameter

[permalink] [raw]
Subject: Re: [ofa-general] Re: Demand paging for memory regions

On Wed, 13 Feb 2008, Christoph Raisch wrote:

> For ehca we currently can't modify a large MR when it has been allocated.
> EHCA Hardware expects the pages to be there (MRs must not have "holes").
> This is also true for the global MR covering all kernel space.
> Therefore we still need the memory to be "pinned" if ib_umem_get() is
> called.

It cannot be freed and then reallocated? What happens when a process
exists?

2008-02-13 19:46:32

by Christian Bell

[permalink] [raw]
Subject: Re: [ofa-general] Re: Demand paging for memory regions

On Wed, 13 Feb 2008, Christoph Lameter wrote:

> Right. We (SGI) have done something like this for a long time with XPmem
> and it scales ok.

I'd dispute this based on experience developing PGAS language support
on the Altix but more importantly (and less subjectively), I think
that "scales ok" refers to a very specific case. Sure, pages (and/or
regions) can be large on some systems and the number of systems may
not always be in the thousands but you're still claiming scalability
for a mechanism that essentially logs who accesses the regions. Then
there's the fact that reclaim becomes a collective communication
operation over all region accessors. Makes me nervous.

> > When messages are sufficiently large, the control messaging necessary
> > to setup/teardown the regions is relatively small. This is not
> > always the case however -- in programming models that employ smaller
> > messages, the one-sided nature of RDMA is the most attractive part of
> > it.
>
> The messaging would only be needed if a process comes under memory
> pressure. As long as there is enough memory nothing like this will occur.
>
> > Nothing any communication/runtime system can't already do today. The
> > point of RDMA demand paging is enabling the possibility of using RDMA
> > without the implied synchronization -- the optimistic part. Using
> > the notifiers to duplicate existing memory region handling for RDMA
> > hardware that doesn't have HW page tables is possible but undermines
> > the more important consumer of your patches in my opinion.
>

> The notifier schemet should integrate into existing memory region
> handling and not cause a duplication. If you already have library layers
> that do this then it should be possible to integrate it.

I appreciate that you're trying to make a general case for the
applicability of notifiers on all types of existing RDMA hardware and
wire protocols. Also, I'm not disagreeing whether a HW page table
is required or not: clearly it's not required to make *some* use of
the notifier scheme.

However, short of providing user-level notifications for pinned pages
that are inadvertently released to the O/S, I don't believe that the
patchset provides any significant added value for the HPC community
that can't optimistically do RDMA demand paging.


. . christian

2008-02-13 19:52:30

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [ofa-general] Re: Demand paging for memory regions

On Wed, Feb 13, 2008 at 10:51:58AM -0800, Christoph Lameter wrote:
> On Tue, 12 Feb 2008, Jason Gunthorpe wrote:
>
> > But this isn't how IB or iwarp work at all. What you describe is a
> > significant change to the general RDMA operation and requires changes to
> > both sides of the connection and the wire protocol.
>
> Yes it may require a separate connection between both sides where a
> kind of VM notification protocol is established to tear these things down and
> set them up again. That is if there is nothing in the RDMA protocol that
> allows a notification to the other side that the mapping is being down
> down.

Well, yes, you could build this thing you are describing on top of the
RDMA protocol and get some support from some of the hardware - but it
is a new set of protocols and they would need to be implemented in
several places. It is not transparent to userspace and it is not
compatible with existing implementations.

Unfortunately it really has little to do with the drivers - changes,
for instance, need to be made to support this in the user space MPI
libraries. The RDMA ops do not pass through the kernel, userspace
talks directly to the hardware which complicates building any sort of
abstraction.

That is where I think you run into trouble, if you ask the MPI people
to add code to their critical path to support swapping they probably
will not be too interested. At a minimum to support your idea you need
to check on every RDMA if the remote page is mapped... Plus the
overheads Christian was talking about in the OOB channel(s).

Jason

2008-02-13 20:32:24

by Christoph Lameter

[permalink] [raw]
Subject: Re: [ofa-general] Re: Demand paging for memory regions

On Wed, 13 Feb 2008, Christian Bell wrote:

> not always be in the thousands but you're still claiming scalability
> for a mechanism that essentially logs who accesses the regions. Then
> there's the fact that reclaim becomes a collective communication
> operation over all region accessors. Makes me nervous.

Well reclaim is not a very fast process (and we usually try to avoid it
as much as possible for our HPC). Essentially its only there to allow
shifts of processing loads and to allow efficient caching of application
data.

> However, short of providing user-level notifications for pinned pages
> that are inadvertently released to the O/S, I don't believe that the
> patchset provides any significant added value for the HPC community
> that can't optimistically do RDMA demand paging.

We currently also run XPmem with pinning. Its great as long as you just
run one load on the system. No reclaim ever iccurs.

However, if you do things that require lots of allocations etc etc then
the page pinning can easily lead to livelock if reclaim is finally
triggerd and also strange OOM situations since the VM cannot free any
pages. So the main issue that is addressed here is reliability of pinned
page operations. Better VM integration avoids these issues because we can
unpin on request to deal with memory shortages.

2008-02-13 20:36:55

by Christoph Lameter

[permalink] [raw]
Subject: Re: [ofa-general] Re: Demand paging for memory regions

On Wed, 13 Feb 2008, Jason Gunthorpe wrote:

> Unfortunately it really has little to do with the drivers - changes,
> for instance, need to be made to support this in the user space MPI
> libraries. The RDMA ops do not pass through the kernel, userspace
> talks directly to the hardware which complicates building any sort of
> abstraction.

Ok so the notifiers have to be handed over to the user space library that
has the function of the device driver here...

> That is where I think you run into trouble, if you ask the MPI people
> to add code to their critical path to support swapping they probably
> will not be too interested. At a minimum to support your idea you need
> to check on every RDMA if the remote page is mapped... Plus the
> overheads Christian was talking about in the OOB channel(s).

You only need to check if a handle has been receiving invalidates. If not
then you can just go ahead as now. You can use the notifier to take down
the whole region if any reclaim occur against it (probably best and
simples to implement approach). Then you mark the handle so that the
mapping is reestablished before the next operation.

2008-02-13 22:51:48

by Kanoj Sarcar

[permalink] [raw]
Subject: Re: [ofa-general] Re: Demand paging for memory regions


--- Christoph Lameter <[email protected]> wrote:

> On Wed, 13 Feb 2008, Christian Bell wrote:
>
> > not always be in the thousands but you're still
> claiming scalability
> > for a mechanism that essentially logs who accesses
> the regions. Then
> > there's the fact that reclaim becomes a collective
> communication
> > operation over all region accessors. Makes me
> nervous.
>
> Well reclaim is not a very fast process (and we
> usually try to avoid it
> as much as possible for our HPC). Essentially its
> only there to allow
> shifts of processing loads and to allow efficient
> caching of application
> data.
>
> > However, short of providing user-level
> notifications for pinned pages
> > that are inadvertently released to the O/S, I
> don't believe that the
> > patchset provides any significant added value for
> the HPC community
> > that can't optimistically do RDMA demand paging.
>
> We currently also run XPmem with pinning. Its great
> as long as you just
> run one load on the system. No reclaim ever iccurs.
>
> However, if you do things that require lots of
> allocations etc etc then
> the page pinning can easily lead to livelock if
> reclaim is finally
> triggerd and also strange OOM situations since the
> VM cannot free any
> pages. So the main issue that is addressed here is
> reliability of pinned
> page operations. Better VM integration avoids these
> issues because we can
> unpin on request to deal with memory shortages.
>
>

I have a question on the basic need for the mmu
notifier stuff wrt rdma hardware and pinning memory.

It seems that the need is to solve potential memory
shortage and overcommit issues by being able to
reclaim pages pinned by rdma driver/hardware. Is my
understanding correct?

If I do understand correctly, then why is rdma page
pinning any different than eg mlock pinning? I imagine
Oracle pins lots of memory (using mlock), how come
they do not run into vm overcommit issues?

Are we up against some kind of breaking c-o-w issue
here that is different between mlock and rdma pinning?

Asked another way, why should effort be spent on a
notifier scheme, and rather not on fixing any memory
accounting problems and unifying how pin pages are
accounted for that get pinned via mlock() or rdma
drivers?

Startup benefits are well understood with the notifier
scheme (ie, not all pages need to be faulted in at
memory region creation time), specially when most of
the memory region is not accessed at all. I would
imagine most of HPC does not work this way though.
Then again, as rdma hardware is applied
(increasingly?) towards apps with short lived
connections, the notifier scheme will help with
startup times.

Kanoj



____________________________________________________________________________________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

2008-02-13 23:02:36

by Christoph Lameter

[permalink] [raw]
Subject: Re: [ofa-general] Re: Demand paging for memory regions

On Wed, 13 Feb 2008, Kanoj Sarcar wrote:

> It seems that the need is to solve potential memory
> shortage and overcommit issues by being able to
> reclaim pages pinned by rdma driver/hardware. Is my
> understanding correct?

Correct.

> If I do understand correctly, then why is rdma page
> pinning any different than eg mlock pinning? I imagine
> Oracle pins lots of memory (using mlock), how come
> they do not run into vm overcommit issues?

Mlocked pages are not pinned. They are movable by f.e. page migration and
will be potentially be moved by future memory defrag approaches. Currently
we have the same issues with mlocked pages as with pinned pages. There is
work in progress to put mlocked pages onto a different lru so that reclaim
exempts these pages and more work on limiting the percentage of memory
that can be mlocked.

> Are we up against some kind of breaking c-o-w issue
> here that is different between mlock and rdma pinning?

Not that I know.

> Asked another way, why should effort be spent on a
> notifier scheme, and rather not on fixing any memory
> accounting problems and unifying how pin pages are
> accounted for that get pinned via mlock() or rdma
> drivers?

There are efforts underway to account for and limit mlocked pages as
described above. Page pinning the way it is done by Infiniband through
increasing the page refcount is treated by the VM as a temporary
condition not as a permanent pin. The VM will continually try to reclaim
these pages thinking that the temporary usage of the page must cease
soon. This is why the use of large amounts of pinned pages can lead to
livelock situations.

If we want to have pinning behavior then we could mark pinned pages
specially so that the VM will not continually try to evict these pages. We
could manage them similar to mlocked pages but just not allow page
migration, memory unplug and defrag to occur on pinned memory. All of
theses would have to fail. With the notifier scheme the device driver
could be told to get rid of the pinned memory. This would make these 3
techniques work despite having an RDMA memory section.

> Startup benefits are well understood with the notifier
> scheme (ie, not all pages need to be faulted in at
> memory region creation time), specially when most of
> the memory region is not accessed at all. I would
> imagine most of HPC does not work this way though.

No for optimal performance you would want to prefault all pages like
it is now. The notifier scheme would only become relevant in memory
shortage situations.

> Then again, as rdma hardware is applied (increasingly?) towards apps
> with short lived connections, the notifier scheme will help with startup
> times.

The main use of the notifier scheme is for stability and reliability. The
"pinned" pages become unpinnable on request by the VM. So the VM can work
itself out of memory shortage situations in cooperation with the
RDMA logic instead of simply failing.

2008-02-13 23:23:37

by Pete Wyckoff

[permalink] [raw]
Subject: Re: [ofa-general] Re: Demand paging for memory regions

[email protected] wrote on Tue, 12 Feb 2008 20:09 -0800:
> One other area that has not been brought up yet (I think) is the
> applicability of notifiers in letting users know when pinned memory
> is reclaimed by the kernel. This is useful when a lower-level
> library employs lazy deregistration strategies on memory regions that
> are subsequently released to the kernel via the application's use of
> munmap or sbrk. Ohio Supercomputing Center has work in this area but
> a generalized approach in the kernel would certainly be welcome.

The whole need for memory registration is a giant pain. There is no
motivating application need for it---it is simply a hack around
virtual memory and the lack of full VM support in current hardware.
There are real hardware issues that interact poorly with virtual
memory, as discussed previously in this thread.

The way a messaging cycle goes in IB is:

register buf
post send from buf
wait for completion
deregister buf

This tends to get hidden via userspace software libraries into
a single call:

MPI_send(buf)

Now if you actually do the reg/dereg every time, things are very
slow. So userspace library writers came up with the idea of caching
registrations:

if buf is not registered:
register buf
post send from buf
wait for completion

The second time that the app happens to do a send from the same
buffer, it proceeds much faster. Spatial locality applies here, and
this caching is generally worth it. Some libraries have schemes to
limit the size of the registration cache too.

But there are plenty of ways to hurt yourself with such a scheme.
The first being a huge pool of unused but registered memory, as the
library doesn't know the app patterns, and it doesn't know the VM
pressure level in the kernel.

There are plenty of subtle ways that this breaks too. If the
registered buf is removed from the address space via munmap() or
sbrk() or other ways, the mapping and registration are gone, but the
library has no way of knowing that the app just did this. Sure the
physical page is still there and pinned, but the app cannot get at
it. Later if new address space arrives at the same virtual address
but a different physical page, the library will mistakenly think it
already has it registered properly, and data is transferred from
this old now-unmapped physical page.

The whole situation is rather ridiculuous, but we are quite stuck
with it for current generation IB and iWarp hardware. If we can't
have the kernel interact with the device directly, we could at least
manage state in these multiple userspace registration caches. The
VM could ask for certain (or any) pages to be released, and the
library would respond if they are indeed not in use by the device.
The app itself does not know about pinned regions, and the library
is aware of exactly which regions are potentially in use.

Since the great majority of userspace messaging over IB goes through
middleware like MPI or PGAS languages, and they all have the same
approach to registration caching, this approach could fix the
problem for a big segment of use cases.

More text on the registration caching problem is here:

http://www.osc.edu/~pw/papers/wyckoff-memreg-ccgrid05.pdf

with an approach using vm_ops open and close operations in a kernel
module here:

http://www.osc.edu/~pw/dreg/

There is a place for VM notifiers in RDMA messaging, but not in
talking to devices, at least not the current set. If you can define
a reasonable userspace interface for VM notifiers, libraries can
manage registration caches more efficiently, letting the kernel
unmap pinned pages as it likes.

-- Pete

2008-02-13 23:43:38

by Kanoj Sarcar

[permalink] [raw]
Subject: Re: [ofa-general] Re: Demand paging for memory regions


--- Christoph Lameter <[email protected]> wrote:

> On Wed, 13 Feb 2008, Kanoj Sarcar wrote:
>
> > It seems that the need is to solve potential
> memory
> > shortage and overcommit issues by being able to
> > reclaim pages pinned by rdma driver/hardware. Is
> my
> > understanding correct?
>
> Correct.
>
> > If I do understand correctly, then why is rdma
> page
> > pinning any different than eg mlock pinning? I
> imagine
> > Oracle pins lots of memory (using mlock), how come
> > they do not run into vm overcommit issues?
>
> Mlocked pages are not pinned. They are movable by
> f.e. page migration and
> will be potentially be moved by future memory defrag
> approaches. Currently
> we have the same issues with mlocked pages as with
> pinned pages. There is
> work in progress to put mlocked pages onto a
> different lru so that reclaim
> exempts these pages and more work on limiting the
> percentage of memory
> that can be mlocked.
>
> > Are we up against some kind of breaking c-o-w
> issue
> > here that is different between mlock and rdma
> pinning?
>
> Not that I know.
>
> > Asked another way, why should effort be spent on a
> > notifier scheme, and rather not on fixing any
> memory
> > accounting problems and unifying how pin pages are
> > accounted for that get pinned via mlock() or rdma
> > drivers?
>
> There are efforts underway to account for and limit
> mlocked pages as
> described above. Page pinning the way it is done by
> Infiniband through
> increasing the page refcount is treated by the VM as
> a temporary
> condition not as a permanent pin. The VM will
> continually try to reclaim
> these pages thinking that the temporary usage of the
> page must cease
> soon. This is why the use of large amounts of pinned
> pages can lead to
> livelock situations.

Oh ok, yes, I did see the discussion on this; sorry I
missed it. I do see what notifiers bring to the table
now (without endorsing it :-)).

An orthogonal question is this: is IB/rdma the only
"culprit" that elevates page refcounts? Are there no
other subsystems which do a similar thing?

The example I am thinking about is rawio (Oracle's
mlock'ed SHM regions are handed to rawio, isn't it?).
My understanding of how rawio works in Linux is quite
dated though ...

Kanoj

>
> If we want to have pinning behavior then we could
> mark pinned pages
> specially so that the VM will not continually try to
> evict these pages. We
> could manage them similar to mlocked pages but just
> not allow page
> migration, memory unplug and defrag to occur on
> pinned memory. All of
> theses would have to fail. With the notifier scheme
> the device driver
> could be told to get rid of the pinned memory. This
> would make these 3
> techniques work despite having an RDMA memory
> section.
>
> > Startup benefits are well understood with the
> notifier
> > scheme (ie, not all pages need to be faulted in at
> > memory region creation time), specially when most
> of
> > the memory region is not accessed at all. I would
> > imagine most of HPC does not work this way though.
>
> No for optimal performance you would want to
> prefault all pages like
> it is now. The notifier scheme would only become
> relevant in memory
> shortage situations.
>
> > Then again, as rdma hardware is applied
> (increasingly?) towards apps
> > with short lived connections, the notifier scheme
> will help with startup
> > times.
>
> The main use of the notifier scheme is for stability
> and reliability. The
> "pinned" pages become unpinnable on request by the
> VM. So the VM can work
> itself out of memory shortage situations in
> cooperation with the
> RDMA logic instead of simply failing.
>
> --
> To unsubscribe, send a message with 'unsubscribe
> linux-mm' in
> the body to [email protected]. For more info on
> Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]">
> [email protected] </a>
>



____________________________________________________________________________________
Looking for last minute shopping deals?
Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping

2008-02-14 00:01:40

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [ofa-general] Re: Demand paging for memory regions

On Wed, Feb 13, 2008 at 06:23:08PM -0500, Pete Wyckoff wrote:
> [email protected] wrote on Tue, 12 Feb 2008 20:09 -0800:
> > One other area that has not been brought up yet (I think) is the
> > applicability of notifiers in letting users know when pinned memory
> > is reclaimed by the kernel. This is useful when a lower-level
> > library employs lazy deregistration strategies on memory regions that
> > are subsequently released to the kernel via the application's use of
> > munmap or sbrk. Ohio Supercomputing Center has work in this area but
> > a generalized approach in the kernel would certainly be welcome.
>
> The whole need for memory registration is a giant pain. There is no
> motivating application need for it---it is simply a hack around
> virtual memory and the lack of full VM support in current hardware.
> There are real hardware issues that interact poorly with virtual
> memory, as discussed previously in this thread.

Well, the registrations also exist to provide protection against
rouge/faulty remotes, but for the purposes of MPI that is probably not
important.

Here is a thought.. Some RDMA hardware can change the page tables on
the fly. What if the kernel had a mechanism to dynamically maintain a
full registration of the processes entire address space ('mlocked' but
able to be migrated)? MPI would never need to register a buffer, and
all the messy cases with munmap/sbrk/etc go away - the risk is that
other MPI nodes can randomly scribble all over the process :)

Christoph: It seemed to me you were first talking about
freeing/swapping/faulting RDMA'able pages - but would pure migration
as a special hardware supported case be useful like Catilan suggested?

Regards,
Jason

2008-02-14 00:14:08

by Jesse Barnes

[permalink] [raw]
Subject: Re: Demand paging for memory regions

On Wednesday, February 13, 2008 3:43 pm Kanoj Sarcar wrote:
> Oh ok, yes, I did see the discussion on this; sorry I
> missed it. I do see what notifiers bring to the table
> now (without endorsing it :-)).
>
> An orthogonal question is this: is IB/rdma the only
> "culprit" that elevates page refcounts? Are there no
> other subsystems which do a similar thing?
>
> The example I am thinking about is rawio (Oracle's
> mlock'ed SHM regions are handed to rawio, isn't it?).
> My understanding of how rawio works in Linux is quite
> dated though ...

We're doing something similar in the DRM these days... We need big chunks of
memory to be pinned so that the GPU can operate on them, but when the
operation completes we can allow them to be swappable again. I think with
the current implementation, allocations are always pinned, but we'll
definitely want to change that soon.

Dave?

Jesse

2008-02-14 00:57:25

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [ofa-general] Re: Demand paging for memory regions

Hi Kanoj,

On Wed, Feb 13, 2008 at 03:43:17PM -0800, Kanoj Sarcar wrote:
> Oh ok, yes, I did see the discussion on this; sorry I
> missed it. I do see what notifiers bring to the table
> now (without endorsing it :-)).

I'm not really livelocks are really the big issue here.

I'm running N 1G VM on a 1G ram system, with N-1G swapped
out. Combining this with auto-ballooning, rss limiting, and ksm ram
sharing, provides really advanced and lowlevel virtualization VM
capabilities to the linux kernel while at the same time guaranteeing
no oom failures as long as the guest pages are lower than ram+swap
(just slower runtime if too many pages are unshared or if the balloons
are deflated etc..).

Swapping the virtual machine in the host may be more efficient than
having the guest swapping over a virtual swap paravirt storage for
example. As more management features are added admins will gain more
experience in handling those new features and they'll find what's best
for them. mmu notifiers and real reliable swapping are the enabler for
those more advanced VM features.

oom livelocks wouldn't happen anyway with KVM as long as the maximimal
number of guest physical is lower than RAM.

> An orthogonal question is this: is IB/rdma the only
> "culprit" that elevates page refcounts? Are there no
> other subsystems which do a similar thing?
>
> The example I am thinking about is rawio (Oracle's
> mlock'ed SHM regions are handed to rawio, isn't it?).
> My understanding of how rawio works in Linux is quite
> dated though ...

rawio in flight I/O shall be limited. As long as each task can't pin
more than X ram, and the ram is released when the task is oom killed,
and the first get_user_pages/alloc_pages/slab_alloc that returns
-ENOMEM takes an oom fail path that returns failure to userland,
everything is ok.

Even with IB deadlock could only happen if IB would allow unlimited
memory to be pinned down by unprivileged users.

If IB is insecure and DoSable without mmu notifiers, then I'm not sure
how enabling swapping of the IB memory could be enough to fix the
DoS. Keep in mind that even tmpfs can't be safe allowing all ram+swap
to be allocated in a tmpfs file (despite the tmpfs file storage
includes swap and not only ram). Pinning the whole ram+swap with tmpfs
livelocks the same way of pinning the whole ram with ramfs. So if you
add mmu notifier support to IB, you only need to RDMA an area as large
as ram+swap to livelock again as before... no difference at all.

I don't think livelocks have anything to do with mmu notifiers (other
than to deferring the livelock to the "swap+ram" point of no return
instead of the current "ram" point of no return). Livelocks have to be
solved the usual way: handling alloc_pages/get_user_pages/slab
allocation failures with a fail path that returns to userland and
allows the ram to be released if the task was selected for
oom-killage.

The real benefit of the mmu notifiers for IB would be to allow the
rdma region to be larger than RAM without triggering the oom
killer (or without triggering a livelock if it's DoSable but then the
livelock would need fixing to be converted in a regular oom-killing by
some other mean not related to the mmu-notifier, it's really an
orthogonal problem).

So suppose you've a MPI simulation that requires a 10G array and
you've only 1G of ram, then you can rdma over 10G like if you had 10G
of ram. Things will preform ok only if there's some huge locality of
the computations. For virtualization it's orders of magnitude more
useful than for computer clusters but certain simulations really swaps
so I don't exclude certain RDMA apps will also need this (dunno about
IB).

2008-02-14 19:35:48

by Christoph Lameter

[permalink] [raw]
Subject: Re: [ofa-general] Re: Demand paging for memory regions

On Wed, 13 Feb 2008, Kanoj Sarcar wrote:

> Oh ok, yes, I did see the discussion on this; sorry I
> missed it. I do see what notifiers bring to the table
> now (without endorsing it :-)).
>
> An orthogonal question is this: is IB/rdma the only
> "culprit" that elevates page refcounts? Are there no
> other subsystems which do a similar thing?

Yes there are actually two projects by SGI that also ran into the same
issue that motivated the work on this. One is XPmem which allows
sharing of process memory between different Linux instances and then
there is the GRU which is a kind of DMA engine. Then there is KVM and
probably multiple other drivers.

2008-02-27 22:11:38

by Christoph Lameter

[permalink] [raw]
Subject: Re: [ofa-general] Re: Demand paging for memory regions

On Wed, 13 Feb 2008, Jason Gunthorpe wrote:

> Christoph: It seemed to me you were first talking about
> freeing/swapping/faulting RDMA'able pages - but would pure migration
> as a special hardware supported case be useful like Catilan suggested?

That is a special case of the proposed solution. You could mlock the
regions of interest. Those can then only be migrated but not swapped out.

However, I think we need some limit on the number of pages one can mlock.
Otherwise the VM can get into a situation where reclaim is not possible
because the majority of memory is either mlocked or pinned by I/O etc.