2005-04-22 13:11:11

by Bodo Eggert

[permalink] [raw]
Subject: Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation

Andy Isaacson <[email protected]> wrote:
> On Wed, Apr 20, 2005 at 10:07:45PM -0500, Timur Tabi wrote:

>> I don't know if VM_REGISTERED is a good idea or not, but it should be
>> absolutely impossible for the kernel to reclaim "registered" (aka pinned)
>> memory, no matter what. For RDMA services (such as Infiniband, iWARP, etc),
>> it's normal for non-root processes to pin hundreds of megabytes of memory,
>> and that memory better be locked to those physical pages until the
>> application deregisters them.
>
> If you take the hardline position that "the app is the only thing that
> matters", your code is unlikely to get merged. Linux is a
> general-purpose OS.

All userspace hardware drivers with DMA will require pinned pages (and some
of them will require continuous memory). Since this memory may be scheduled
to be accessed by DMA, reclaiming those pages may (aka. will) result in
"random" memory corruption unless done by the driver itself.

You can't even set a time limit, the driver may have allocated all DMA
memory to queued transfers, and some media needs to get plugged in by
the lazy robot. As soon as the robot arrives - boom. (For the same reason,
this memory MUST NOT be freed if the application terminates abnormally,
e.g. killed by OOM).

In other words, you need to make this memory as unaccessible as the
framebuffer on a graphic card. If that causes a lockup, you better had
prevented that while allocating.

> In a Linux context, I doubt that fullblown SA is necessary or
> appropriate. Rather, I'd suggest two new signals, SIGMEMLOW and
> SIGMEMCRIT. The userland comms library registers handlers for both.
> When the kernel decides that it needs to reclaim some memory from the
> app, it sends SIGMEMLOW. The comms library then has the responsibility
> to un-reserve some memory in an orderly fashion. If a reasonable [1]
> time has expired since SIGMEMLOW and the kernel is still hungry, the
> kernel sends SIGMEMCRIT. At this point, the comms lib *must* unregister
> some memory [2] even if it has to drop state to do so; if it returns
> from the signal handler without having unregistered the memory, the
> kernel will SIGKILL.

Choosing Data loss vs. finitely stalled system may sometimes be a bad
decision.

If I designes an application that might get a "gimme memory or die",
I'd reserve an extra bunch of memory with the only purpose of being
released in this situation. If the kernel had done that instead, this
part of memory could have been used e.g. as a read-only disk cache in
the meantime (off cause provided somebody cared to implement that).

> [2] Is there a way for the kernel to pass down to userspace how many
> pages it wants, maybe in the sigcontext?

Then you'd need only one signal.

I think this interface is usefull, it would e.g. allow a picture viewer
to cache as many decoded and scaled pictures as the RAM permits, freeing
them if the RAM gets full and the swap would have to be used.

--
"When the pin is pulled, Mr. Grenade is not our friend.
-U.S. Marine Corps


2005-04-22 17:01:44

by Fab Tillier

[permalink] [raw]
Subject: RE: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbsimplementation

> From: Bodo Eggert <[email protected]>
> Sent: Friday, April 22, 2005 6:10 AM
>
> All userspace hardware drivers with DMA will require pinned pages (and
> some of them will require continuous memory). Since this memory may be
> scheduled to be accessed by DMA, reclaiming those pages may (aka. will)
> result in "random" memory corruption unless done by the driver itself.

Any reclaim must involve the driver. That doesn't mean that it must involve
the application. That said this isn't trivial to implement.

>
> You can't even set a time limit, the driver may have allocated all DMA
> memory to queued transfers, and some media needs to get plugged in by
> the lazy robot. As soon as the robot arrives - boom. (For the same reason,
> this memory MUST NOT be freed if the application terminates abnormally,
> e.g. killed by OOM).

InfiniBand provides support for deregistering memory that might be
referenced at some future time by an RDMA operation. The only side effect
this has is that the QP on both sides of the connection transition to an
error state.

Upon abnormal termination, all registrations must be undone and the memory
unpinned. This must be synchronized with the hardware so that there are no
races. The IB deregistration semantics provide such synchronization. I'd
venture that any HW design that does not do this is broken.

Requiring the memory to never be freed upon abnormal termination equates to
a serious memory leak, in that physical memory is leaked, not virtual.

- Fab

2005-04-22 22:02:25

by Bodo Eggert

[permalink] [raw]
Subject: RE: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbsimplementation

On Fri, 22 Apr 2005, Fab Tillier wrote:
> > From: Bodo Eggert <[email protected]>
> > Sent: Friday, April 22, 2005 6:10 AM

> > You can't even set a time limit, the driver may have allocated all DMA
> > memory to queued transfers, and some media needs to get plugged in by
> > the lazy robot. As soon as the robot arrives - boom. (For the same reason,
> > this memory MUST NOT be freed if the application terminates abnormally,
> > e.g. killed by OOM).
>
> InfiniBand provides support for deregistering memory that might be
> referenced at some future time by an RDMA operation. The only side effect
> this has is that the QP on both sides of the connection transition to an
> error state.
>
> Upon abnormal termination, all registrations must be undone and the memory
> unpinned. This must be synchronized with the hardware so that there are no
> races.

If you know the hardware. If you have userspace drivers, this will be
impossible, and even if you have kernel drivers, you'll need to know
which of them is responsible for each part of the pinned memory.

This doesn't imply the affected memory to be lost. The same application
that created the pinned memory can reset the hardware (provided nobody
changed the configuration), then reconnect to the shared memory segment
you'll use for that purpose and use or free it.

--
To iterate is human; to recurse, divine.