LinuxLists.cc - RE: [ofa-general] Re: Demand paging for memory regions

2008-02-12 23:32:24

Subject: RE: [ofa-general] Re: Demand paging for memory regions

> -----Original Message-----
> From: [email protected] [mailto:general-
> [email protected]] On Behalf Of Roland Dreier
> Sent: Tuesday, February 12, 2008 2:42 PM
> To: Christoph Lameter
> Cc: Rik van Riel; [email protected]; Andrea Arcangeli;
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]; Robin Holt;
[email protected];
> Andrew Morton; [email protected]
> Subject: Re: [ofa-general] Re: Demand paging for memory regions
>
> > > Chelsio's T3 HW doesn't support this.
>
> > Not so far I guess but it could be equipped with these features
> right?
>
> I don't know anything about the T3 internals, but it's not clear that
> you could do this without a new chip design in general. Lot's of RDMA
> devices were designed expecting that when a packet arrives, the HW can
> look up the bus address for a given memory region/offset and place the
> packet immediately. It seems like a major change to be able to
> generate a "page fault" interrupt when a page isn't present, or even
> just wait to scatter some data until the host finishes updating page
> tables when the HW needs the translation.

That is correct, not a change we can make for T3. We could, in theory,
deal with changing mappings though. The change would need to be
synchronized though: the VM would need to tell us which mapping were
about to change and the driver would then need to disable DMA to/from
it, do the change and resume DMA.

>
> - R.
>
> _______________________________________________
> general mailing list
> [email protected]
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-
> general

2008-02-13 00:57:56

by Christoph Lameter

[permalink] [raw]

Subject: RE: [ofa-general] Re: Demand paging for memory regions

On Tue, 12 Feb 2008, Felix Marti wrote:

> > I don't know anything about the T3 internals, but it's not clear that
> > you could do this without a new chip design in general. Lot's of RDMA
> > devices were designed expecting that when a packet arrives, the HW can
> > look up the bus address for a given memory region/offset and place the
> > packet immediately. It seems like a major change to be able to
> > generate a "page fault" interrupt when a page isn't present, or even
> > just wait to scatter some data until the host finishes updating page
> > tables when the HW needs the translation.
>
> That is correct, not a change we can make for T3. We could, in theory,
> deal with changing mappings though. The change would need to be
> synchronized though: the VM would need to tell us which mapping were
> about to change and the driver would then need to disable DMA to/from
> it, do the change and resume DMA.

Right. That is the intend of the patchset.

2008-02-14 15:10:21

by Steve Wise

[permalink] [raw]

Subject: Re: [ofa-general] Re: Demand paging for memory regions

Felix Marti wrote:

>
> That is correct, not a change we can make for T3. We could, in theory,
> deal with changing mappings though. The change would need to be
> synchronized though: the VM would need to tell us which mapping were
> about to change and the driver would then need to disable DMA to/from
> it, do the change and resume DMA.
>

Note that for T3, this involves suspending _all_ rdma connections that
are in the same PD as the MR being remapped. This is because the driver
doesn't know who the application advertised the rkey/stag to. So
without that knowledge, all connections that _might_ rdma into the MR
must be suspended. If the MR was only setup for local access, then the
driver could track the connections with references to the MR and only
quiesce those connections.

Point being, it will stop probably all connections that an application
is using (assuming the application uses a single PD).

Steve.

2008-02-14 15:53:47

by Robin Holt

[permalink] [raw]

Subject: Re: [ofa-general] Re: Demand paging for memory regions

On Thu, Feb 14, 2008 at 09:09:08AM -0600, Steve Wise wrote:
> Note that for T3, this involves suspending _all_ rdma connections that are
> in the same PD as the MR being remapped. This is because the driver
> doesn't know who the application advertised the rkey/stag to. So without

Is there a reason the driver can not track these.

> Point being, it will stop probably all connections that an application is
> using (assuming the application uses a single PD).

It seems like the need to not stop all would be a compelling enough reason
to modify the driver to track which processes have received the rkey/stag.

Thanks,
Robin

2008-02-14 16:24:19

by Steve Wise

[permalink] [raw]

Subject: Re: [ofa-general] Re: Demand paging for memory regions

Robin Holt wrote:
> On Thu, Feb 14, 2008 at 09:09:08AM -0600, Steve Wise wrote:
>> Note that for T3, this involves suspending _all_ rdma connections that are
>> in the same PD as the MR being remapped. This is because the driver
>> doesn't know who the application advertised the rkey/stag to. So without
>
> Is there a reason the driver can not track these.
>

Because advertising of a MR (ie telling the peer about your rkey/stag,
offset and length) is application-specific and can be done out of band,
or in band as simple SEND/RECV payload. Either way, the driver has no
way of tracking this because the protocol used is application-specific.

>> Point being, it will stop probably all connections that an application is
>> using (assuming the application uses a single PD).
>
> It seems like the need to not stop all would be a compelling enough reason
> to modify the driver to track which processes have received the rkey/stag.
>

Yes, _if_ the driver could track this.

And _if_ the rdma API and paradigm was such that the kernel/driver could
keep track, then remote revokations of MR tags could be supported.

Stevo

2008-02-14 17:49:09

by Caitlin Bestler

[permalink] [raw]

Subject: Re: [ofa-general] Re: Demand paging for memory regions

On Thu, Feb 14, 2008 at 8:23 AM, Steve Wise <[email protected]> wrote:
> Robin Holt wrote:
> > On Thu, Feb 14, 2008 at 09:09:08AM -0600, Steve Wise wrote:
> >> Note that for T3, this involves suspending _all_ rdma connections that are
> >> in the same PD as the MR being remapped. This is because the driver
> >> doesn't know who the application advertised the rkey/stag to. So without
> >
> > Is there a reason the driver can not track these.
> >
>
> Because advertising of a MR (ie telling the peer about your rkey/stag,
> offset and length) is application-specific and can be done out of band,
> or in band as simple SEND/RECV payload. Either way, the driver has no
> way of tracking this because the protocol used is application-specific.
>
>

I fully agree. If there is one important thing about RDMA and other fastpath
solutions that must be understood is that the driver does not see the
payload. This is a fundamental strength, but it means that you have
to identify what if any intercept points there are in advance.

You also raise a good point on the scope of any suspend/resume API.
Device reporting of this capability would not be a simple boolean, but
more of a suspend/resume scope. A minimal scope would be any
connection that actually attempts to use the suspended MR. Slightly
wider would be any connection *allowed* to use the MR, which could
expand all the way to any connection under the same PD. Convievably
I could imagine an RDMA device reporting that it could support suspend/
resume, but only at the scope of the entire device.

But even at such a wide scope, suspend/resume could be useful to
a Memory Manager. The pages could be fully migrated to the new
location, and the only work that was still required during the critical
suspend/resume region was to actually shift to the new map. That
might be short enough that not accepting *any* incoming RDMA
packet would be acceptable.

And if the goal is to replace a memory card the alternative might
be migrating the applications to other physical servers, which would
mean a much longer period of not accepting incoming RDMA packets.

But the broader question is what the goal is here. Allowing memory to
be shuffled is valuable, and perhaps even ultimately a requirement for
high availability systems. RDMA and other direct-access APIs should
be evolving their interfaces to accommodate these needs.

Oversubscribing memory is a totally different matter. If an application
is working with memory that is oversubscribed by a factor of 2 or more
can it really benefit from zero-copy direct placement? At first glance I
can't see what RDMA could be bringing of value when the overhead of
swapping is going to be that large.

If it really does make sense, then explicitly registering the portion of
memory that should be enabled to receive incoming traffic while the
application is swapped out actually makes sense.

Current Memory Registration methods force applications to either
register too much or too often. They register too much when the cost
of registration is high, and the application responds by registering its
entire buffer pool permanently. This is a problem when it overstates
the amount of memory that the application needs to have resident,
or when the device imposes limits on the size of memory maps that
it can know. The alternative is to register too often, that is on a
per-operation basis.

To me that suggests the solutions lie in making it more reasonable
to register more memory, or in making it practical to register memory
on-the-fly on a per-operation basis with low enough overhead that
applications don't feel the need to build elaborate registration caching
schemes.

As has been pointed out a few times in this thread, the RDMA and
transport layers simply do not have enough information to know which
portion of registered memory *really* had to be registered. So any
back-pressure scheme where the Memory Manager is asking for
pinned memory to be "given back" would have to go all the way to
the application. Only the application knows what it is "really" using.

I also suspect that most applications that are interested in using
RDMA would rather be told they can allocate 200M indefinitely
(and with real memory backing it) than be given 1GB of virtual
memory that is backed by 200-300M of physical memory,
especially if it meant dealing with memory pressure upcalls.

> >> Point being, it will stop probably all connections that an application is
> >> using (assuming the application uses a single PD).
> >
> > It seems like the need to not stop all would be a compelling enough reason
> > to modify the driver to track which processes have received the rkey/stag.
> >
>
> Yes, _if_ the driver could track this.
>
> And _if_ the rdma API and paradigm was such that the kernel/driver could
> keep track, then remote revokations of MR tags could be supported.
>
> Stevo
>
>
> _______________________________________________
> general mailing list
> [email protected]
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>

2008-02-14 19:39:40

by Christoph Lameter

[permalink] [raw]

Subject: Re: [ofa-general] Re: Demand paging for memory regions

On Thu, 14 Feb 2008, Steve Wise wrote:

> Note that for T3, this involves suspending _all_ rdma connections that are in
> the same PD as the MR being remapped. This is because the driver doesn't know
> who the application advertised the rkey/stag to. So without that knowledge,
> all connections that _might_ rdma into the MR must be suspended. If the MR
> was only setup for local access, then the driver could track the connections
> with references to the MR and only quiesce those connections.
>
> Point being, it will stop probably all connections that an application is
> using (assuming the application uses a single PD).

Right but if the system starts reclaiming pages of the application then we
have a memory shortage. So the user should address that by not running
other apps concurrently. The stopping of all connections is still better
than the VM getting into major trouble. And the stopping of connections in
order to move the process memory into a more advantageous memory location
(f.e. using page migration) or stopping of connections in order to be able
to move the process memory out of a range of failing memory is certainly
good.

2008-02-14 20:17:35

by Caitlin Bestler

[permalink] [raw]

Subject: Re: [ofa-general] Re: Demand paging for memory regions

On Thu, Feb 14, 2008 at 11:39 AM, Christoph Lameter <[email protected]> wrote:
> On Thu, 14 Feb 2008, Steve Wise wrote:
>
> > Note that for T3, this involves suspending _all_ rdma connections that are in
> > the same PD as the MR being remapped. This is because the driver doesn't know
> > who the application advertised the rkey/stag to. So without that knowledge,
> > all connections that _might_ rdma into the MR must be suspended. If the MR
> > was only setup for local access, then the driver could track the connections
> > with references to the MR and only quiesce those connections.
> >
> > Point being, it will stop probably all connections that an application is
> > using (assuming the application uses a single PD).
>
> Right but if the system starts reclaiming pages of the application then we
> have a memory shortage. So the user should address that by not running
> other apps concurrently. The stopping of all connections is still better
> than the VM getting into major trouble. And the stopping of connections in
> order to move the process memory into a more advantageous memory location
> (f.e. using page migration) or stopping of connections in order to be able
> to move the process memory out of a range of failing memory is certainly
> good.
>

In that spirit, there are two important aspects of a suspend/resume API that
would enable the memory manager to solve problems most effectively:

1) The device should be allowed flexibility to extend the scope of the suspend
to what it is capable of implementing -- rather than being forced
to say that
it does not support suspend/;resume merely because it does so at a different
granularity.

2) It is very important that users of this API understand that it is
only the RDMA
device handling of incoming packets and WQEs that is being suspended. The
peers are not suspended by this API, or even told that this end is
suspending.
Unless the suspend is kept *extremely* short there will be adverse impacts.
And "short" here is measured in network terms, not human terms. The blink
of any eye is *way* too long. Any external dependencies between "suspend"
and "resume" will probably mean that things will not work, especially if the
external entities involve a disk drive.

So suspend/resume to re-arrange pages is one thing. Suspend/resume to cover
swapping out pages so they can be reallocated is an exercise in futility. By the
time you resume the connections will be broken or at the minimum damaged.

2008-02-14 20:20:35

by Christoph Lameter

[permalink] [raw]

Subject: Re: [ofa-general] Re: Demand paging for memory regions

On Thu, 14 Feb 2008, Caitlin Bestler wrote:

> So suspend/resume to re-arrange pages is one thing. Suspend/resume to cover
> swapping out pages so they can be reallocated is an exercise in futility. By the
> time you resume the connections will be broken or at the minimum damaged.

The connections would then have to be torn down before swap out and would
have to be reestablished after the pages have been brought back from swap.

2008-02-14 22:44:11

by Caitlin Bestler

[permalink] [raw]

Subject: Re: [ofa-general] Re: Demand paging for memory regions

On Thu, Feb 14, 2008 at 12:20 PM, Christoph Lameter <[email protected]> wrote:
> On Thu, 14 Feb 2008, Caitlin Bestler wrote:
>
> > So suspend/resume to re-arrange pages is one thing. Suspend/resume to cover
> > swapping out pages so they can be reallocated is an exercise in futility. By the
> > time you resume the connections will be broken or at the minimum damaged.
>
> The connections would then have to be torn down before swap out and would
> have to be reestablished after the pages have been brought back from swap.
>
>
I have no problem with that, as long as the application layer is responsible for
tearing down and re-establishing the connections. The RDMA/transport layers
are incapable of tearing down and re-establishing a connection transparently
because connections need to be approved above the RDMA layer.

Further the teardown will have visible artificats that the application
must deal with,
such as flushed Recv WQEs.

This is still, the RDMA device will do X and will not worry about Y. The reasons
for not worrying about Y could be that the suspend will be very short, or that
other mechanisms have taken care of all the Ys independently.

For example, an HPC cluster that suspended the *entire* cluster would not
have to worry about dropped packets.

2008-02-14 22:49:23

by Christoph Lameter

[permalink] [raw]

Subject: Re: [ofa-general] Re: Demand paging for memory regions

On Thu, 14 Feb 2008, Caitlin Bestler wrote:

> I have no problem with that, as long as the application layer is responsible for
> tearing down and re-establishing the connections. The RDMA/transport layers
> are incapable of tearing down and re-establishing a connection transparently
> because connections need to be approved above the RDMA layer.

I am not that familiar with the RDMA layers but it seems that RDMA has
a library that does device driver like things right? So the logic would
best fit in there I guess.

If you combine mlock with the mmu notifier then you can actually
guarantee that a certain memory range will not be swapped out. The
notifier will then only be called if the memory range will need to be
moved for page migration, memory unplug etc etc. There may be a limit on
the percentage of memory that you can mlock in the future. This may be
done to guarantee that the VM still has memory to work with.

2008-02-15 01:29:19

by Caitlin Bestler

[permalink] [raw]

Subject: RE: [ofa-general] Re: Demand paging for memory regions

> -----Original Message-----
> From: Christoph Lameter [mailto:[email protected]]
> Sent: Thursday, February 14, 2008 2:49 PM
> To: Caitlin Bestler
> Cc: [email protected]; [email protected];
[email protected];
> [email protected]; [email protected]
> Subject: Re: [ofa-general] Re: Demand paging for memory regions
>
> On Thu, 14 Feb 2008, Caitlin Bestler wrote:
>
> > I have no problem with that, as long as the application layer is
> responsible for
> > tearing down and re-establishing the connections. The RDMA/transport
> layers
> > are incapable of tearing down and re-establishing a connection
> transparently
> > because connections need to be approved above the RDMA layer.
>
> I am not that familiar with the RDMA layers but it seems that RDMA has
> a library that does device driver like things right? So the logic
would
> best fit in there I guess.
>
> If you combine mlock with the mmu notifier then you can actually
> guarantee that a certain memory range will not be swapped out. The
> notifier will then only be called if the memory range will need to be
> moved for page migration, memory unplug etc etc. There may be a limit
> on
> the percentage of memory that you can mlock in the future. This may be
> done to guarantee that the VM still has memory to work with.
>

The problem is that with existing APIs, or even slightly modified APIs,
the RDMA layer will not be able to figure out which connections need to
be "interrupted" in order to deal with what memory suspensions.

Further, because any request for a new connection will be handled by
the remote *application layer* peer there is no way for the two RDMA
layers to agree to covertly tear down and re-establish the connection.
Nor really should there be, connections should be approved by OS layer
networking controls. RDMA should not be able to tell the network stack,
"trust me, you don't have to check if this connection is legitimate".

Another example, if you terminate a connection pending receive
operations
complete *to the user* in a Completion Queue. Those completions are NOT
seen by the RDMA layer, and especially not by the Connection Manager. It
has absolutely no way to repost them transparently to the same
connection
when the connection is re-established.

Even worse, some portions of a receive operation might have been placed
in the receive buffer and acknowledged to the remote peer. But there is
no mechanism to report this fact in the CQE. A receive operation that is
aborted is aborted. There is no concept of partial success. Therefore
you
cannot covertly terminate a connection mid-operation and covertly
re-establish
it later. Data will be lost, it will no longer be a reliable connection,
and
therefore it needs to be torn down anyway.

The RDMA layers also cannot tell the other side not to transmit. Flow
control is the responsibility of the application layer, not RDMA.

What the RDMA layer could do is this: once you tell it to suspend a
given
memory region it can either tell you that it doesn't know how to do that
or it can instruct the device to stop processing a set of connections
that will ceases all access for a given Memory Region. When you resume
it can guarantee that it is no longer using any cached older mappings
for the memory region (assuming it was capable of doing the suspend),
and then because RDMA connections are reliable everything will recover
unless the connection timed-out. The chance that it will time-out is
probably low, but the chance that the underlying connection will be in
slow start or equivalent is much higher.

So any solution that requires the upper layers to suspend operations
for a brief bit will require explicit interaction with those layers.
No RDMA layer can perform the sleight of hand tricks that you seem
to want it to perform.

AT the RDMA layer the best you could get is very brief suspensions
for the purpose of *re-arranging* memory, not of reducing the amount
of registered memory. If you need to reduce the amount of registered
memory then you have to talk to the application. Discussions on making
it easier for the application to trim a memory region dynamically might
be in order, but you will not work around the fact that the application
layer needs to determine what pages are registered. And they would
really
prefer just to be told how much memory they can have up front, they can
figure out how to deal with that amount of memory on their own.

2008-02-15 02:38:13

by Christoph Lameter

[permalink] [raw]

Subject: RE: [ofa-general] Re: Demand paging for memory regions

On Thu, 14 Feb 2008, Caitlin Bestler wrote:

> So any solution that requires the upper layers to suspend operations
> for a brief bit will require explicit interaction with those layers.
> No RDMA layer can perform the sleight of hand tricks that you seem
> to want it to perform.

Looks like it has to be up there right.

> AT the RDMA layer the best you could get is very brief suspensions for
> the purpose of *re-arranging* memory, not of reducing the amount of
> registered memory. If you need to reduce the amount of registered memory
> then you have to talk to the application. Discussions on making it
> easier for the application to trim a memory region dynamically might be
> in order, but you will not work around the fact that the application
> layer needs to determine what pages are registered. And they would
> really prefer just to be told how much memory they can have up front,
> they can figure out how to deal with that amount of memory on their own.

What does it mean that the "application layer has to be determine what
pages are registered"? The application does not know which of its pages
are currently in memory. It can only force these pages to stay in memory
if their are mlocked.

2008-02-15 18:11:51

by Caitlin Bestler

[permalink] [raw]

Subject: RE: [ofa-general] Re: Demand paging for memory regions

Christoph Lameter asked:
>
> What does it mean that the "application layer has to be determine what
> pages are registered"? The application does not know which of its
pages
> are currently in memory. It can only force these pages to stay in
> memory if their are mlocked.
>

An application that advertises an RDMA accessible buffer
to a remote peer *does* have to know that its pages *are*
currently in memory.

The application does *not* need for the virtual-to-physical
mapping of those pages to be frozen for the lifespan of the
Memory Region. But it is issuing an invitation to its peer
to perform direct writes to the advertised buffer. When the
peer decides to exercise that invitation the pages have to
be there.

An analogy: when you write a check for $100 you do not have
to identify the serial numbers of ten $10 bills, but you are
expected to have the funds in your account.

Issuing a buffer advertisement for memory you do not have
is the network equivalent of writing a check that you do
not have funds for.

Now, just as your bank may offer overdraft protection, an
RDMA device could merely report a page fault rather than
tearing down the connection itself. But that does not grant
permission for applications to advertise buffer space that
they do not have committed, it merely helps recovery from
a programming fault.

A suspend/resume interface between the Virtual Memory Manager
and the RDMA layer allows pages to be re-arranged at the
convenience of the Virtual Memory Manager without breaking
the application layer peer-to-peer contract. The current
interfaces that pin exact pages are really the equivalent
of having to tell the bank that when Joe cashes this $100
check that you should give him *these* ten $10 bills. It
works, but it adds too much overhead and is very inflexible.
So there are a lot of good reasons to evolve this interface
to better deal with these issues. Other areas of possible
evolution include allowing growing or trimming of Memory
Regions without invalidating their advertised handles.

But the more fundamental issue is recognizing that applications
that use direct interfaces need to know that buffers that they
enable truly have committed resources. They need a way to
ask for twenty *real* pages, not twenty pages of address
space. And they need to do it in a way that allows memory
to be rearranged or even migrated with them to a new host.

2008-02-15 18:46:13

by Christoph Lameter

[permalink] [raw]

Subject: RE: [ofa-general] Re: Demand paging for memory regions

On Fri, 15 Feb 2008, Caitlin Bestler wrote:

> > What does it mean that the "application layer has to be determine what
> > pages are registered"? The application does not know which of its
> pages
> > are currently in memory. It can only force these pages to stay in
> > memory if their are mlocked.
> >
>
> An application that advertises an RDMA accessible buffer
> to a remote peer *does* have to know that its pages *are*
> currently in memory.

Ok that would mean it needs to inform the VM of that issue by mlocking
these pages.

> But the more fundamental issue is recognizing that applications
> that use direct interfaces need to know that buffers that they
> enable truly have committed resources. They need a way to
> ask for twenty *real* pages, not twenty pages of address
> space. And they need to do it in a way that allows memory
> to be rearranged or even migrated with them to a new host.

mlock will force the pages to stay in memory without requiring the OS to
keep them where they are.

2008-02-15 18:56:05

by Caitlin Bestler

[permalink] [raw]

Subject: RE: [ofa-general] Re: Demand paging for memory regions

> -----Original Message-----
> From: Christoph Lameter [mailto:[email protected]]
> Sent: Friday, February 15, 2008 10:46 AM
> To: Caitlin Bestler
> Cc: [email protected]; [email protected];
[email protected];
> [email protected]; [email protected]
> Subject: RE: [ofa-general] Re: Demand paging for memory regions
>
> On Fri, 15 Feb 2008, Caitlin Bestler wrote:
>
> > > What does it mean that the "application layer has to be determine
> what
> > > pages are registered"? The application does not know which of its
> > pages
> > > are currently in memory. It can only force these pages to stay in
> > > memory if their are mlocked.
> > >
> >
> > An application that advertises an RDMA accessible buffer
> > to a remote peer *does* have to know that its pages *are*
> > currently in memory.
>
> Ok that would mean it needs to inform the VM of that issue by mlocking
> these pages.
>
> > But the more fundamental issue is recognizing that applications
> > that use direct interfaces need to know that buffers that they
> > enable truly have committed resources. They need a way to
> > ask for twenty *real* pages, not twenty pages of address
> > space. And they need to do it in a way that allows memory
> > to be rearranged or even migrated with them to a new host.
>
> mlock will force the pages to stay in memory without requiring the OS
> to keep them where they are.

So that would mean that mlock is used by the application before it
registers memory for direct access, and then it is up to the RDMA
layer and the OS to negotiate actual pinning of the addresses for
whatever duration is required.

There is no *protocol* barrier to replacing pages within a Memory
Region as long as it is done in a way that keeps the content of
those page coherent. But existing devices have their own ideas
on how this is done and existing devices are notoriously poor at
learning new tricks.

Merely mlocking pages deals with the end-to-end RDMA semantics.
What still needs to be addressed is how a fastpath interface
would dynamically pin and unpin. Yielding pins for short-term
suspensions (and flushing cached translations) deals with the
rest. Understanding the range of support that existing devices
could provide with software updates would be the next step if
you wanted to pursue this.

2008-02-15 20:02:22

by Christoph Lameter

[permalink] [raw]

Subject: RE: [ofa-general] Re: Demand paging for memory regions

On Fri, 15 Feb 2008, Caitlin Bestler wrote:

> So that would mean that mlock is used by the application before it
> registers memory for direct access, and then it is up to the RDMA
> layer and the OS to negotiate actual pinning of the addresses for
> whatever duration is required.

Right.

> There is no *protocol* barrier to replacing pages within a Memory
> Region as long as it is done in a way that keeps the content of
> those page coherent. But existing devices have their own ideas
> on how this is done and existing devices are notoriously poor at
> learning new tricks.

Hmmmm.. Okay. But that is mainly a device driver maintenance issue.

> Merely mlocking pages deals with the end-to-end RDMA semantics.
> What still needs to be addressed is how a fastpath interface
> would dynamically pin and unpin. Yielding pins for short-term
> suspensions (and flushing cached translations) deals with the
> rest. Understanding the range of support that existing devices
> could provide with software updates would be the next step if
> you wanted to pursue this.

That is addressed on the VM level by the mmu_notifier which started this
whole thread. The RDMA layers need to subscribe to this notifier and then
do whatever the hardware requires to unpin and pin memory. I can only go
as far as dealing with the VM layer. If you have any issues there I'd be
glad to help.

2008-02-15 20:17:12

by Caitlin Bestler

[permalink] [raw]

Subject: RE: [ofa-general] Re: Demand paging for memory regions

Christoph Lameter wrote
>
> > Merely mlocking pages deals with the end-to-end RDMA semantics.
> > What still needs to be addressed is how a fastpath interface
> > would dynamically pin and unpin. Yielding pins for short-term
> > suspensions (and flushing cached translations) deals with the
> > rest. Understanding the range of support that existing devices
> > could provide with software updates would be the next step if
> > you wanted to pursue this.
>
> That is addressed on the VM level by the mmu_notifier which started
> this whole thread. The RDMA layers need to subscribe to this notifier
> and then do whatever the hardware requires to unpin and pin memory.
> I can only go as far as dealing with the VM layer. If you have any
> issues there I'd be glad to help.

There isn't much point in the RDMA layer subscribing to mmu
notifications
if the specific RDMA device will not be able to react appropriately when
the notification occurs. I don't see how you get around needing to know
which devices are capable of supporting page migration (via
suspend/resume
or other mechanisms) and which can only respond to a page migration by
aborting connections.

2008-02-15 22:58:35

by Christoph Lameter

[permalink] [raw]

Subject: RE: [ofa-general] Re: Demand paging for memory regions

On Fri, 15 Feb 2008, Caitlin Bestler wrote:

> There isn't much point in the RDMA layer subscribing to mmu
> notifications
> if the specific RDMA device will not be able to react appropriately when
> the notification occurs. I don't see how you get around needing to know
> which devices are capable of supporting page migration (via
> suspend/resume
> or other mechanisms) and which can only respond to a page migration by
> aborting connections.

You either register callbacks if the device can react properly or you
dont. If you dont then the device will continue to have the problem with
page pinning etc until someone comes around and implements the
mmu callbacks to fix these issues.

I have doubts regarding the claim that some devices just cannot be made to
suspend and resume appropriately. They obviously can be shutdown and so
its a matter of sequencing the things the right way. I.e. stop the app
wait for a quiet period then release resources etc.

2008-02-15 23:52:25

by Caitlin Bestler

[permalink] [raw]

Subject: RE: [ofa-general] Re: Demand paging for memory regions

> -----Original Message-----
> From: Christoph Lameter [mailto:[email protected]]
> Sent: Friday, February 15, 2008 2:50 PM
> To: Caitlin Bestler
> Cc: [email protected]; [email protected];
[email protected];
> [email protected]; [email protected]
> Subject: RE: [ofa-general] Re: Demand paging for memory regions
>
> On Fri, 15 Feb 2008, Caitlin Bestler wrote:
>
> > There isn't much point in the RDMA layer subscribing to mmu
> > notifications
> > if the specific RDMA device will not be able to react appropriately
> when
> > the notification occurs. I don't see how you get around needing to
> know
> > which devices are capable of supporting page migration (via
> > suspend/resume
> > or other mechanisms) and which can only respond to a page migration
> by
> > aborting connections.
>
> You either register callbacks if the device can react properly or you
> dont. If you dont then the device will continue to have the problem
> with
> page pinning etc until someone comes around and implements the
> mmu callbacks to fix these issues.
>
> I have doubts regarding the claim that some devices just cannot be
made
> to
> suspend and resume appropriately. They obviously can be shutdown and
so
> its a matter of sequencing the things the right way. I.e. stop the app
> wait for a quiet period then release resources etc.
>
>

That is true. What some devices will be unable to do is suspend
and resume in a manner that is transparent to the application.
However, for the duration required to re-arrange pages it is
definitely feasible to do so transparently to the application.

Presumably the Virtual Memory Manager would be more willing to
take an action that is transparent to the user than one that is
disruptive, although obviously as the owner of the physical memory
it has the right to do either.