LinuxLists.cc - locking user space memory in kernel

2004-03-21 11:17:07

Subject: locking user space memory in kernel

Hi,
I need to be able to lock memory allocated in user space and passed to
my driver, in order to pass it to a dma controller that can maintain a
translation table for each process. The obvious thing is to use
sys_mlock() (and sys_munlock() for unlocking) but this function is not
exported anymore, nore is sys_call_table. I considered marking the
relevant vma->vm_flags with VM_LOCKED and calling get_user_pages but
that could be overkill if I want to lock just a portion of the VMA.
Currently I do some hacking to find the addresses of sys_mlock/sys_munlock.
I also need to maintain a reference count on the locking /unlocking such
that a region that has been locked twice will really be unlocked after
unlocking twice. This needs to support partly overlapping regions. To
cope with this I have implemented some code on top of calls to
sys_mlock/sys_munlock to provide this functionality.
Are there more standard ways to get this functionality from the kernel?
Any help is appreciated.

Thanks
Eli

2004-03-21 11:32:08

by Manfred Spraul

[permalink] [raw]

Subject: Re: locking user space memory in kernel

Hi Eli,

I think just get_user_pages() should be sufficient: the pages won't be
swapped out. You don't need to set VM_LOCKED in vma->vm_flags to prevent
the swap out. In the worst case, the pte is cleared a that will cause a
soft page fault, but the physical address won't change. Multiple
get_user_pages() calls on overlapping regions are ok, the page count is
an atomic_t, at least 24-bit large.

--
Manfred

2004-03-21 11:35:14

by Arjan van de Ven

[permalink] [raw]

Subject: Re: locking user space memory in kernel

On Sun, 2004-03-21 at 12:18, Eli Cohen wrote:
> Hi,
> I need to be able to lock memory allocated in user space and passed to
> my driver, in order to pass it to a dma controller that can maintain a
> translation table for each process. The obvious thing is to use

the linux way is to do it the other way around, provide a device that
userspace then can mmap......

Attachments:

signature.asc (189.00 B)
This is a digitally signed message part

2004-03-21 14:12:35

by Manfred Spraul

[permalink] [raw]

Subject: Re: locking user space memory in kernel

Arjan wrote:

>On Sun, 2004-03-21 at 12:18, Eli Cohen wrote:
>> Hi,
>> I need to be able to lock memory allocated in user space and passed to
>> my driver, in order to pass it to a dma controller that can maintain a
>> translation table for each process. The obvious thing is to use
>
>the linux way is to do it the other way around, provide a device that
>userspace then can mmap......
>
That's definitively the preferred method, but unfortunately there are
existing apis that are the other way around. I think the main MPI
transfer functions must read/write to arbitrary addresses, I'm sure
there are other examples.

--
Manfred

2004-03-21 16:40:40

by Roland Dreier

[permalink] [raw]

Subject: Re: locking user space memory in kernel

2004-03-21 17:15:32

by Manfred Spraul

[permalink] [raw]

Subject: Re: locking user space memory in kernel

Roland Dreier wrote:

> Manfred> I think just get_user_pages() should be sufficient: the
> Manfred> pages won't be swapped out. You don't need to set
> Manfred> VM_LOCKED in vma->vm_flags to prevent the swap out. In
> Manfred> the worst case, the pte is cleared a that will cause a
> Manfred> soft page fault, but the physical address won't
> Manfred> change. Multiple get_user_pages() calls on overlapping
> Manfred> regions are ok, the page count is an atomic_t, at least
> Manfred> 24-bit large.
>
>There is one case that we ran into where the physical address can
>change: if a process does a fork() and then triggers COW.
>
You are right.
What should happen if there are registered transfers during fork()? Copy
the pages during the fork() syscall?

--
Manfred

2004-03-21 18:18:43

by Roland Dreier

[permalink] [raw]

Subject: Re: locking user space memory in kernel

Manfred> I think just get_user_pages() should be sufficient: the
Manfred> pages won't be swapped out. You don't need to set
Manfred> VM_LOCKED in vma->vm_flags to prevent the swap out. In
Manfred> the worst case, the pte is cleared a that will cause a
Manfred> soft page fault, but the physical address won't
Manfred> change. Multiple get_user_pages() calls on overlapping
Manfred> regions are ok, the page count is an atomic_t, at least
Manfred> 24-bit large.

Roland> There is one case that we ran into where the physical
Roland> address can change: if a process does a fork() and then
Roland> triggers COW.

Manfred> You are right. What should happen if there are
Manfred> registered transfers during fork()? Copy the pages
Manfred> during the fork() syscall?

The current Mellanox InfiniBand driver goes to some trouble to mark
the memory being registered with VM_DONTCOPY. This means the vmas
don't get copied into the child of a fork(), so the COW doesn't
happen. However, this certainly leads to some quirks in semantics.
In particular, an application using fork() has to be careful that
registered memory doesn't share a page with something the child
process wants to use.

I don't think copying all the registered memory on fork() is feasible,
because it's going to kill performance (especially since exec() is
likely to immediately follow the fork() in the child). Also, there
may not be enough memory around to copy everything.

Out of curiousity, what happens if I fork with pending AIO in the
current kernel?

- Roland

2004-03-22 13:14:08

by Eli Cohen

[permalink] [raw]

Subject: Re: locking user space memory in kernel

Roland Dreier wrote:

>I don't think copying all the registered memory on fork() is feasible,
>because it's going to kill performance (especially since exec() is
>likely to immediately follow the fork() in the child). Also, there
>may not be enough memory around to copy everything.
>
>
>
Suppose a new vma flag is introduced, VM_NOCOW and an API to apply this
flag on a range of addreses, splitting or unifying vmas as necessary. A
driver which registers memory with hardware would call this function.
When fork takes place, the ptes of the parent belonging to such vmas
will not be changed to read only thus they will not undergo COW. The
kernel will copy the first and last pages of theses vmas to the child.
All the pages in between will be marked read only and will undergo COW
when written to. One problem would be that that child can read pages of
the parent after the parent modifies them but that could be avoided if
the address space of the child does not inherit the range of the moddle
pages. ???
Eli

2004-03-22 15:20:59

by Eli Cohen

[permalink] [raw]

Subject: Re: locking user space memory in kernel

Roland Dreier wrote:

> I don't think copying all the registered memory on fork() is feasible,
> because it's going to kill performance (especially since exec() is
> likely to immediately follow the fork() in the child). Also, there
> may not be enough memory around to copy everything.
>
>
>
Suppose a new vma flag is introduced, VM_NOCOW and an API to apply this
flag on a range of addreses, splitting or unifying vmas as necessary. A
driver which registers memory with hardware would call this function.
When fork takes place, the ptes of the parent belonging to such vmas
will not be changed to read only thus they will not undergo COW. The
kernel will copy the first and last pages of theses vmas to the child.
All the pages in between will be marked read only and will undergo COW
when written to. One problem would be that that child can read pages of
the parent after the parent modifies them but that could be avoided if
the address space of the child does not inherit the range of the middle
pages. ???
Eli

2004-03-22 19:34:29

by Manfred Spraul

[permalink] [raw]

Subject: Re: locking user space memory in kernel

Eli Cohen wrote:

> Roland Dreier wrote:
>
>> I don't think copying all the registered memory on fork() is feasible,
>> because it's going to kill performance (especially since exec() is
>> likely to immediately follow the fork() in the child). Also, there
>> may not be enough memory around to copy everything.
>>
>>
>>
> Suppose a new vma flag is introduced, VM_NOCOW and an API to apply
> this flag on a range of addreses, splitting or unifying vmas as necessary.

Something like that. But it should be hidden within a suitable
abstraction. get_user_pages and then put_page is not stateful enough.
Actually it's fundamentally broken for platform that need cache flush
calls. create_page_mapping/free_page_mapping, or something like that.

And I still think that the initial implementation should copy the
affected pages within fork() - it might be slow, but at least it's
simple and correct. _If_ it's too slow, then it can be fixed later.

--
Manfred

2004-04-08 00:46:07

by Libor Michalek

[permalink] [raw]

Subject: Re: locking user space memory in kernel

----- Forwarded message from Manfred Spraul <[email protected]> -----
>
> Date: Sun, 21 Mar 2004 12:31:59 +0100
> From: Manfred Spraul <[email protected]>
> To: Eli Cohen <[email protected]>
> Cc: [email protected]
> Subject: Re: locking user space memory in kernel
>
> Hi Eli,
>
> I think just get_user_pages() should be sufficient: the pages won't be
> swapped out. You don't need to set VM_LOCKED in vma->vm_flags to prevent
> the swap out. In the worst case, the pte is cleared a that will cause a
> soft page fault, but the physical address won't change. Multiple
> get_user_pages() calls on overlapping regions are ok, the page count is
> an atomic_t, at least 24-bit large.

The soft page fault is a problem if the device is going to write data
into the buffer and then notify the user that the buffer now contains
valid data. If the soft page fault occurs before the device has written
to the page list, once the user is notified of the write and reads the
buffer, it will no longer be the same pages as the ones to which the
device wrote. Is setting VM_LOCKED the only way to prevent the soft
page fault and this issue?

-Libor

2004-04-08 05:23:13

by Manfred Spraul

[permalink] [raw]

Subject: Re: locking user space memory in kernel

Libor Michalek wrote:

>----- Forwarded message from Manfred Spraul <[email protected]> -----
>
>
>>Date: Sun, 21 Mar 2004 12:31:59 +0100
>>From: Manfred Spraul <[email protected]>
>>To: Eli Cohen <[email protected]>
>>Cc: [email protected]
>>Subject: Re: locking user space memory in kernel
>>
>>Hi Eli,
>>
>>I think just get_user_pages() should be sufficient: the pages won't be
>>swapped out. You don't need to set VM_LOCKED in vma->vm_flags to prevent
>>the swap out. In the worst case, the pte is cleared a that will cause a
>>soft page fault, but the physical address won't change. Multiple
>>get_user_pages() calls on overlapping regions are ok, the page count is
>>an atomic_t, at least 24-bit large.
>>
>>
>
> The soft page fault is a problem if the device is going to write data
>into the buffer and then notify the user that the buffer now contains
>valid data. If the soft page fault occurs before the device has written
>to the page list, once the user is notified of the write and reads the
>buffer, it will no longer be the same pages as the ones to which the
>device wrote.
>
No. The physical addresses do not change due to a soft page fault.
A soft fault means that the page table entry is cleared, but that the
physical page is still in the system memory. do_swap_page does a swap
cache lookup and finds the original physical page in memory and maps it
back to the virtual address. The physical page can't be dropped from the
swap cache because your driver still holds one reference to the page -
no swapout.

But fork() is a problem for get_user_pages(): You probably have to write
an improved function (create_user_mapping/destroy_user_mapping) that
handles fork correctly.
And add arch hooks into the new function - they are required for archs
with incoherent cpu caches. Right now O_DIRECT doesn't flush the cpu
caches, because it's impossible to implement it with get_user_pages().
It works, because the data cache is usually coherent and noone loads
libraries with O_DIRECT.

--
Manfred

2004-04-08 06:17:13

by Ross Dickson

[permalink] [raw]