I don't think that we should jump to the conclusion that in the long
term HPC users cannot benefit from support of mechanisms such as
hotremoval of memory or other forms of page migration in physical
memory. In an earlier exchange on the openib-general list Mike Krause
sent the message quoted below on very much the same topic. On the other
hand I am willing to accept that there is practical value to
implementations which are not (yet) sophisticated to enough to support
the migration functions.

Steve Langdon

> Michael Krause wrote: At 05:35 PM 3/14/2005, Caitlin Bestler wrote:
>
>>
>>
>> > -----Original Message-----
>> > From: Troy Benjegerdes [ mailto:[email protected]]
>> > Sent: Monday, March 14, 2005 5:06 PM
>> > To: Caitlin Bestler
>> > Cc: [email protected]
>> > Subject: Re: [openib-general] Getting rid of pinned memory requirement
>> >
>> > >
>> > > The key is that the entire operation either has to be fast
>> > > enough so that no connection or application session layer
>> > > time-outs occur, or an end-to-end agreement to suspend the
>> > > connetion is a requirement. The first option seems more
>> > > plausible to me, the second essentially
>> > > reuqires extending the CM protocol. That's a tall order even for
>> > > InfiniBand, and it's even worse for iWARP where the CM
>> > > functionality typically ends when the connection is established.
>> >
>> > I'll buy the good network design argument.
>
>
> I and others designed InfiniBand RNR (Receiver not ready) operations
> to allow one to adjust V-to-P mappings (not change the address that
> was advertised) in order to allow an OS to safely play some games with
> memory and not drop a connection. The time values associated with RNR
> allow a solution to tolerate up to infinite amount of time to perform
> such operations but the envisioned goal was to do this on the order of
> a handful or milliseconds in the worse case. For iWARP, there was no
> support for defining RNR functionality as indeed many people claimed
> one could just drop in-bound segments and allow the retransmission
> protocol to deal with the delay (even if this has performance
> implications due to back-off algorithms though some claim SACK would
> minimize this to a large extent). Again, the idea was to minimize the
> worse case to milliseconds of down time. BTW, all of this assumed
> that the OS would not perform these types of changes that often so the
> long-term impact on an application would be minimum.
>
>> >
>> > I suppose if the kernel wants to revoke a card's pinned
>> > memory, we should be able to guarantee that it gets new
>> > pinned memory within a bounded time. What sort of timing do
>> > we need? Milliseconds?
>> > Microseconds?
>> >
>> > In the case of iWarp, isn't this just TCP underneath? If so,
>> > can't we just drop any packets in the pipe on the floor and
>> > let them get retransmitted? (I suppose the same argument goes
>> > for infiniband..
>> > what sort of a time window do we have for retransmission?)
>> >
>> > What are the limits on end-to-end flow control in IB and iWarp?
>> >
>>
>> >From the RDMA Provider's perspective, the short answer is "quick
>> enough so that I don't have to do anything heroic to keep the
>> connection alive."
>
>
> It should not require anything heroic. What is does require is a
> local method to suspend the local QP(s) so that it cannot place or
> read memory in the effected area. That can take some time depending
> upon the implementation. There is then the time to over write the
> mappings which again depending upon the implementation and the number
> of mappings could be milliseconds in length.
>
>> With TCP you also have to add "and healthy". If you've ever had a
>> long download that got effectively stalled by a burst of noise and
>> you just hit the 'reload' button on your browser then you know what
>> I'm talking about.
>>
>> But in transport neutral terms I would think that one RTT is
>> definitely safe -- that much data could have
>> been dropped by one switch failure or one nasty spike in inbound noise.
>>
>> > >
>> > > Yes, there are limits on how much memory you can mlock, or even
>> > > allocate. Applications are required to reqister memory precisely
>> > > because the required guarantess are not there by default.
>> > Eliminating
>> > > those guarantees *is* effectively rewriting every RDMA application
>> > > without even letting them know.
>> >
>> > Some of this argument is a policy issue, which I would argue
>> > shouldn't be hard-coded in the code or in the network hardware.
>> >
>> > At least in my view, the guarantees are only there to make
>> > applications go fast. We are getting low latency and high
>> > performance with infiniband by making memory registration go
>> > really really slow. If, to make big HPC simulation
>> > applications work, we wind up doing memcpy() to put the data
>> > into a registered buffer because we can't register half of
>> > physical memory, the application isn't going very fast.
>> >
>>
>> What you are looking for is a distinction between registering
>> memory to *enable* the RNIC to optimize local access and
>> registering memory to enable its being advertised to the
>> remote end.
>>
>> Early implementations of RDMA, both IB and iWARP, have not
>> distinquished between the two. But theoretically *applications*
>> do not need memory regions that are not enabled for remote
>> access to be pinned. That is an RNIC requirement that could
>> evolve. But applications themselves *do* need remotely
>> accessible memory regions, portions of which they intend
>> to advertise with RKeys, to be truly available (i.e., pinned).
>>
>> You are also making a policy assumption that an application
>> that actually needs half of physical memory should be using
>> paged memory. Memory is cheap, and if performance is critical
>> why should this memory be swapped out to disk?
>>
>> Is the limitation on not being able to register half of
>> physical memory based upon some assumption that swapping
>> is a requirement? Or is it a limitation in the memory region
>> size? If it's the latter, you need to get the OS to support
>> larger page sizes.
>
>
> For some OS, you can pin very large areas. I've seen 15/16 of memory
> being able to be pinned with no adverse impacts on the applications.
> For these OS, kernel memory is effectively pinned memory. As such,
> depending upon the mix of services being provided, the system may
> operate quite nicely with such large amounts of memory being pinned.
> As more services are "ported" to operate over RDMA technologies,
> memory management isn't necessarily any harder; it just becomes
> something people have to think more about. Today's VM designs have
> allowed people to get sloppy as they assume that swapping will occur
> and since many platforms are not that loaded, they don't see any real
> adverse impacts. User-space RDMA applications requires people to
> think once again about memory management and that swapping isn't a
> get-out-of-jail card. One needs to develop resource management tools
> to determine who obtains specified amounts of resources and their
> priorities. For the most part, this is somewhat a re-invention of
> some thinking that went into the micro-kernel work in past years.
> These problems are not intractable; they are only constrained by the
> legacy inertia inherent in all technologies today.
>
> Mike
>
>
>

IWAMOTO Toshihiro wrote:

>At Mon, 25 Apr 2005 16:58:03 -0700,
>Roland Dreier wrote:
>
>
>> Andrew> It would be better to obtain this memory via a mmap() of
>> Andrew> some special device node, so we can perform appropriate
>> Andrew> permission checking and clean everything up on unclean
>> Andrew> application exit.
>>
>>This seems to interact poorly with how applications want to use RDMA,
>>ie typically through a library interface such as MPI. People doing
>>HPC don't want to recode their apps to use a new allocator, they just
>>want to link to a new MPI library and have the app go fast.
>>
>>
>
>Such HPC users cannot use the memory hotremoval feature, and something
>needs to be implemented so that the NUMA migration can handle such
>memory properly, but I see your point.
>
>If such memory were allocated by a driver, the memory could be placed
>in non-hotremovable areas to avoid the above problems.
>
>--
>IWAMOTO Toshihiro
>_______________________________________________
>openib-general mailing list
>[email protected]
>http://openib.org/mailman/listinfo/openib-general
>
>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
>

Attachments:

steve.langdon.vcf (348.00 B)
smime.p7s (6.04 kB)
S/MIME Cryptographic Signature Download all attachments

2005-04-26 03:27:10

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation

Timur Tabi <[email protected]> wrote:
>
> Andrew Morton wrote:
>
> > RLIMIT_MEMLOCK sounds like the appropriate mechanism. We cannot rely upon
> > userspace running mlock(), so perhaps it is appropriate to run sys_mlock()
> > in-kernel because that gives us the appropriate RLIMIT_MEMLOCK checking.
>
> I don't see what's wrong with relying on userspace to call mlock(). First all, all RDMA
> apps call a third-party API, like DAPL or MPI, to register memory. The memory needs to be
> registered in order for the driver and adapter to know where it is. During this
> registration, the memory is also pinned. That's when we call mlock().

All the above refers to well-behaved applications.

Now think about how the syscalls which you provide may be used by
applications which are *designed* to cripple or to compromise the machine.

> >
> > However an hostile app can just go and run munlock() and then allocate
> > some more pinned-by-get_user_pages() memory.
>
> Isn't mlock() on a per-process basis anyway? How can one process call munlock() on
> another process' memory?

I'm referring to an application which uses your syscalls to obtain pinned
memory and uses munlock() so that it may then use your syscalls to obtain
evem more pinned memory. With the objective of taking the machine down.

> > umm, how about we
> >
> > - force the special pages into a separate vma
> >
> > - run get_user_pages() against it all
> >
> > - use RLIMIT_MEMLOCK accounting to check whether the user is allowed to
> > do this thing
> >
> > - undo the RMLIMIT_MEMLOCK accounting in ->release
>
> Isn't this kinda what mlock() does already? Create a new VMA and then VM_LOCK it?

kinda. But applications can undo the mlock which the kernel did.

> > This will all interact with user-initiated mlock/munlock in messy ways.
> > Maybe a new kernel-internal vma->vm_flag which works like VM_LOCKED but is
> > unaffected by mlock/munlock activity is needed.
> >
> > A bit of generalisation in do_mlock() should suit?
>
> Yes, but do_mlock() needs to prevent pages from being moved during memory hotswap.

I haven't even thought about memory hotswap. Surely it'll fail if the
pages are pinned by get_user_pages()?

2005-04-26 03:34:36

On Wed, May 11, 2005 at 05:53:36PM -0500, Timur Tabi wrote:
> Andrea Arcangeli wrote:
>
> >If the problem appears again even after the last fix for the COW I did
> >last year, than it means we've another yet another bug to fix.
>
> All of my memory pinning test cases pass when I use get_user_pages() with
> kernels 2.6.7 and later.

Well then your problem was the cow bug, that was corrupting userland
with O_DIRECT too...