2023-09-05 16:33:58

by Christoph Hellwig

[permalink] [raw]
Subject: getting rid of the last memory modifitions through gup(FOLL_GET)

Hi all,

we've made some nice progress on converting code that modifies user
memory to the pin_user_pages interface, especially though the work
from David Howells on iov_iter_extract_pages. This thread tries to
coordinate on how to finish off this work.

The obvious next step is the remaining users of iov_iter_get_pages2
and iov_iter_get_pages_alloc2. We have three file system direct I/O
users of those left: ceph, fuse and nfs. Lei Huang has sent patches
to convert fuse to iov_iter_extract_pages which I'd love to see merged,
and we'd need equivalent work for ceph and nfs.

The non-file system uses are in the vmsplice code, which only reads
from the pages (but would still benefit from an iov_iter_extract_pages
conversion), and in net. Out of the users in net, all but the 9p code
appear to be for reads from memory, so they don't pin even if a
conversion would be nice to retire iov_iter_get_pages* APIs.

After that we might have to do an audit of the raw get_user_pages APIs,
but there probably aren't many that modify file backed memory.

I'm also wondering what a good debug aid would be for detecting
writes to file backed memory without a previous reservation, but
everything either involves a page flag or file system code. But if
someone has an idea I'm all ear as something mechanical to catch these
uses would be quite helpful.


2023-09-08 10:19:34

by David Howells

[permalink] [raw]
Subject: Re: getting rid of the last memory modifitions through gup(FOLL_GET)

Christoph Hellwig <[email protected]> wrote:

> we've made some nice progress on converting code that modifies user
> memory to the pin_user_pages interface, especially though the work
> from David Howells on iov_iter_extract_pages. This thread tries to
> coordinate on how to finish off this work.

Right at this moment, I'm writing some kunit tests for iov_iter and I've found
at least one bug.

David

2023-09-08 12:44:28

by Christoph Hellwig

[permalink] [raw]
Subject: Re: getting rid of the last memory modifitions through gup(FOLL_GET)

On Wed, Sep 06, 2023 at 11:42:33AM +0200, David Hildenbrand wrote:
>> and iov_iter_get_pages_alloc2. We have three file system direct I/O
>> users of those left: ceph, fuse and nfs. Lei Huang has sent patches
>> to convert fuse to iov_iter_extract_pages which I'd love to see merged,
>> and we'd need equivalent work for ceph and nfs.
>>
>> The non-file system uses are in the vmsplice code, which only reads
>
> vmsplice really has to be fixed to specify FOLL_PIN|FOLL_LONGTERM for good;
> I recall that David Howells had patches for that at one point. (at least to
> use FOLL_PIN)

Hmm, unless I'm misreading the code vmsplace is only using
iov_iter_get_pages2 for reading from the user address space anyway.
Or am I missing something?

>> After that we might have to do an audit of the raw get_user_pages APIs,
>> but there probably aren't many that modify file backed memory.
>
> ptrace should apply that ends up doing a FOLL_GET|FOLL_WRITE.

Yes, if that ends up on file backed shared mappings we also need a pin.

> Further, KVM ends up using FOLL_GET|FOLL_WRITE to populate the second-level
> page tables for VMs, and uses MMU notifiers to synchronize the second-level
> page tables with process page table changes. So once a PTE goes from
> writable -> r/o in the process page table, the second level page tables for
> the VM will get updated. Such MMU users are quite different from ordinary
> GUP users.

Can KVM page tables use file backed shared mappings?

> Converting ptrace might not be desired/required as well (the reference is
> dropped immediately after the read/write access).

But the pin is needed to make sure the file system can account for
dirtying the pages. Something we fundamentally can't do with get.

> The end goal as discussed a couple of times would be the to limit FOLL_GET
> in general only to a couple of users that can be audited and keep using it
> for a good reason. Arbitrary drivers that perform DMA should stop using it
> (and ideally be prevented from using it) and switch to FOLL_PIN.

Agreed, that's where I'd like to get to. Preferably with the non-pin
API not even beeing epxorted to modules.

2023-09-09 20:18:04

by Christoph Hellwig

[permalink] [raw]
Subject: Re: getting rid of the last memory modifitions through gup(FOLL_GET)

On Fri, Sep 08, 2023 at 06:48:05PM +0200, David Hildenbrand wrote:
> vmsplice_to_pipe() -> iter_to_pipe() -> iov_iter_get_pages2()
>
> So it ends up calling get_user_pages_fast()
>
> ... and not using FOLL_PIN|FOLL_LONGTERM
>
> Why FOLL_LONGTERM? Because it's a longterm pin, where unprivileged users
> can grab a reference on a page for all eternity, breaking CMA and memory
> hotunplug (well, and harming compaction).
>
> Why FOLL_PIN? Well FOLL_LONGTERM only applies to FOLL_PIN. But for
> anonymous memory, this will also take care of the last remaining hugetlb
> COW test (trigger COW unsharing) as commented back in:
>
> https://lore.kernel.org/all/[email protected]/

Well, I'm not against it. It just isn't required for deadling with
file system writeback vs GUP modification race this thread was started
for.

>> Can KVM page tables use file backed shared mappings?
>
> Yes, usually shmem and hugetlb. But with things like emulated
> NVDIMMs/virtio-pmem for VMs, easily also ordinary files.
>
> But it's really not ordinary write access through GUP. It's write access
> via a secondary page table (secondary MMU), that's synchronized to the
> process page table -- just like if the CPU would be writing to the page
> using the process page tables (primary MMU).

Writing through the process page tables takes a write faul when first
writing, which calls into ->page_mkwrite in the file system. Does the
synchronization take care of that? If not we need to add or emulate it.

> ptrace will find the pagecache page writable in the page table (PTE write
> bit set), if it intends to write to the page (FOLL_WRITE). If it is not
> writable, it will trigger a page fault that informs the file system.

Yes, that case is (mostly) fine.

>
> With an FS that wants writenotify, we will not map a page writable (PTE
> write bit not set) unless it is dirty (PTE dirty bit set) IIRC.
>
> So are we concerned about a race between the filesystem removing the PTE
> write bit (to catch next write access before it gets dirtied again) and
> ptrace marking the page dirty?

Yes. This is the race that we've run into with various GUP users.

> Yes. However, secondary MMU users (like KVM) would need some way to keep
> making use of that; ideally, using a proper separate interface instead of
> (ab)using plain GUP and confusing people :)

I'mm all for that.