LinuxLists.cc - find_get_page() VS pin_user

2023-04-11 19:51:29

Subject: find_get_page() VS pin_user_pages()

Hello folks,

I work with an application which aims to share memory in the userspace and
interact with the NIC DMA. The memory allocation workflow begins in the
userspace, which creates a new file backed by 2MiB hugepages with
memfd_create(MFD_HUGETLB, MFD_HUGE_2MB) and fallocate(). Then the userspace
makes an IOCTL to the kernel module with the file descriptor and size so that
the kernel module can get the struct page with find_get_page(). Then the kernel
module calls dma_map_single(page_address(page)) for NIC, which concludes the
datapath. The allocated memory may (significantly) outlive the originating
userspace application. The hugepages stay mapped with NIC, and the kernel
module wants to continue using them and map to other applications that come and
go with vm_mmap().

I am studying the pin_user_pages*() family of functions, and I wonder if the
outlined workflow requires it. The hugepages do not page out, but they can move
as they may be allocated with GFP_HIGHUSER_MOVABLE. However, find_get_page()
must increment the page reference counter without mapping and prevent it from
moving. In particular, https://docs.kernel.org/mm/page_migration.html:

> How migrate_pages() works
> ...
> Steps:
> ...
> 4. All the page table references to the page are converted to migration
> entries. This decreases the mapcount of a page. If the resulting mapcount
> is not zero then we do not migrate the page.

Does find_get_page() achieve that condition or does the outlined workflow
still requires pin_user_pages*() for safe DMA?

Thanks in advance,
Ivan

2023-04-12 00:14:21

by Alistair Popple

[permalink] [raw]

Subject: Re: find_get_page() VS pin_user_pages()

"Teterevkov, Ivan" <[email protected]> writes:

> Hello folks,
>
> I work with an application which aims to share memory in the userspace and
> interact with the NIC DMA. The memory allocation workflow begins in the
> userspace, which creates a new file backed by 2MiB hugepages with
> memfd_create(MFD_HUGETLB, MFD_HUGE_2MB) and fallocate(). Then the userspace
> makes an IOCTL to the kernel module with the file descriptor and size so that
> the kernel module can get the struct page with find_get_page(). Then the kernel
> module calls dma_map_single(page_address(page)) for NIC, which concludes the
> datapath. The allocated memory may (significantly) outlive the originating
> userspace application. The hugepages stay mapped with NIC, and the kernel
> module wants to continue using them and map to other applications that come and
> go with vm_mmap().
>
> I am studying the pin_user_pages*() family of functions, and I wonder if the
> outlined workflow requires it. The hugepages do not page out, but they can move
> as they may be allocated with GFP_HIGHUSER_MOVABLE. However, find_get_page()
> must increment the page reference counter without mapping and prevent it from
> moving. In particular, https://docs.kernel.org/mm/page_migration.html:

I'm not super familiar with the memfd_create()/find_get_page() workflow
but is there some reason you're not using pin_user_pages*(FOLL_LONGTERM)
to get the struct page initially? You're description above sounds
exactly the use case pin_user_pages() was designed for because it marks
the page as being writen to by DMA, makes sure it's not in a movable
zone, etc.

>> How migrate_pages() works
>> ...
>> Steps:
>> ...
>> 4. All the page table references to the page are converted to migration
>> entries. This decreases the mapcount of a page. If the resulting mapcount
>> is not zero then we do not migrate the page.
>
> Does find_get_page() achieve that condition or does the outlined workflow
> still requires pin_user_pages*() for safe DMA?

Yes. The extra page reference will prevent the migration regardless of
mapcount being zero or not. See folio_expected_refs() for how the extra
reference is detected.

> Thanks in advance,
> Ivan

2023-04-12 08:28:15

by David Hildenbrand

[permalink] [raw]

Subject: Re: find_get_page() VS pin_user_pages()

On 11.04.23 21:43, Teterevkov, Ivan wrote:
> Hello folks,
>
> I work with an application which aims to share memory in the userspace and
> interact with the NIC DMA. The memory allocation workflow begins in the
> userspace, which creates a new file backed by 2MiB hugepages with
> memfd_create(MFD_HUGETLB, MFD_HUGE_2MB) and fallocate(). Then the userspace
> makes an IOCTL to the kernel module with the file descriptor and size so that
> the kernel module can get the struct page with find_get_page(). Then the kernel
> module calls dma_map_single(page_address(page)) for NIC, which concludes the
> datapath. The allocated memory may (significantly) outlive the originating
> userspace application. The hugepages stay mapped with NIC, and the kernel
> module wants to continue using them and map to other applications that come and
> go with vm_mmap().
>
> I am studying the pin_user_pages*() family of functions, and I wonder if the
> outlined workflow requires it. The hugepages do not page out, but they can move
> as they may be allocated with GFP_HIGHUSER_MOVABLE. However, find_get_page()
> must increment the page reference counter without mapping and prevent it from
> moving. In particular, https://docs.kernel.org/mm/page_migration.html:

I suspect that find_get_page() is not the kind of interface you want to
use for the purpose you describe. find_get_page() is a wrapper around
pagecache_get_page() and seems more like a helper for implementing an fs
(looking at the users and the fact that it only considers pages that are
in the pagecache).

Instead, you might want to mmap the memfd and pass the user space
address range to the ioctl. There, you'd call pin_user_pages_*().

In general, for long-term pinning a page (possibly keeping the page
pinned forever, controlled by user space, which seems to be what you are
doing) you want so use pin_user_pages() with FOLL_LONGTERM. That will
try migrating the page off of e.g., ZONE_MOVABLE or MIGRATE_CMA, where
movability has to be guaranteed.

But I am no fs expert, so I'll cc some people that might know better if
this would be an abuse of find_get_page().

--
Thanks,

David / dhildenb

2023-04-12 09:23:13

by David Howells

[permalink] [raw]

Subject: Re: find_get_page() VS pin_user_pages()

David Hildenbrand <[email protected]> wrote:

> I suspect that find_get_page() is not the kind of interface you want to use
> for the purpose you describe. find_get_page() is a wrapper around
> pagecache_get_page() and seems more like a helper for implementing an fs
> (looking at the users and the fact that it only considers pages that are in
> the pagecache).

Btw, at some point we're going to need public functions to get extra pins on
pages. vmsplice() should be pinning the pages it pushes into a pipe - so all
pages in a pipe should probably be pinned - and anyone who splices a page out
of a pipe and retains it (skbuffs spring strongly to mind) should also get a
pin on the page.

So should all pages held by an skbuff be pinned rather than ref'd? I have a
patch to use the bottom two bits of an skb frag's page pointer to keep track
of whether the page it points to is ref'd, pinned or neither, but if we can
make it pin/not-pin them, I only need one bit for that.

David

2023-04-12 09:24:24

by Teterevkov, Ivan

[permalink] [raw]

Subject: RE: find_get_page() VS pin_user_pages()

From: Alistair Popple <[email protected]>

> "Teterevkov, Ivan" <[email protected]> writes:
>
> > Hello folks,
> >
> > I work with an application which aims to share memory in the userspace and
> > interact with the NIC DMA. The memory allocation workflow begins in the
> > userspace, which creates a new file backed by 2MiB hugepages with
> > memfd_create(MFD_HUGETLB, MFD_HUGE_2MB) and fallocate(). Then the userspace
> > makes an IOCTL to the kernel module with the file descriptor and size so that
> > the kernel module can get the struct page with find_get_page(). Then the kernel
> > module calls dma_map_single(page_address(page)) for NIC, which concludes the
> > datapath. The allocated memory may (significantly) outlive the originating
> > userspace application. The hugepages stay mapped with NIC, and the kernel
> > module wants to continue using them and map to other applications that come and
> > go with vm_mmap().
> >
> > I am studying the pin_user_pages*() family of functions, and I wonder if the
> > outlined workflow requires it. The hugepages do not page out, but they can move
> > as they may be allocated with GFP_HIGHUSER_MOVABLE. However, find_get_page()
> > must increment the page reference counter without mapping and prevent it from
> > moving. In particular, https://docs.kernel.org/mm/page_migration.html:
>
> I'm not super familiar with the memfd_create()/find_get_page() workflow
> but is there some reason you're not using pin_user_pages*(FOLL_LONGTERM)
> to get the struct page initially? You're description above sounds
> exactly the use case pin_user_pages() was designed for because it marks
> the page as being writen to by DMA, makes sure it's not in a movable
> zone, etc.
>

The biggest obstacle with the application workflow is that the memory
allocation is mostly kernel-driven. The kernel module may want to tell DMA
about the hugepages before the userspace application maps it into its address
space, so the kernel module does not have the starting user address at hand.
I believe one kernel-side workaround would be to vm_mmap(),
pin_user_pages(FOLL_LONGTERM) and possibly vm_munmap() shortly after if we do
not want to keep them mapped in the originating application. This would have a
side effect, but the pinning would stay in place until the kernel module unpins
the pages with unpin_user_page().

The pin_user_pages*() operating on behalf of the userspace application made me
think that the pinning was not designed to outlive the application, but perhaps
that is what FOLL_LONGTERM for in comparison with FOLL_PIN?

> >> How migrate_pages() works
> >> ...
> >> Steps:
> >> ...
> >> 4. All the page table references to the page are converted to migration
> >> entries. This decreases the mapcount of a page. If the resulting mapcount
> >> is not zero then we do not migrate the page.
> >
> > Does find_get_page() achieve that condition or does the outlined workflow
> > still requires pin_user_pages*() for safe DMA?
>
> Yes. The extra page reference will prevent the migration regardless of
> mapcount being zero or not. See folio_expected_refs() for how the extra
> reference is detected.
>

Thank you for pointing out folio_expected_refs(). I see that as soon as the
reference counter exceeds the number returned by folio_expected_refs(), the
page becomes pinned, but it reduces the mobility for the pages coming from
ZONE_MOVABLE making pin_user_pages*() preferable.

Thanks,
Ivan

2023-04-12 10:02:57

by Teterevkov, Ivan

[permalink] [raw]

Subject: RE: find_get_page() VS pin_user_pages()

From: David Hildenbrand <[email protected]>

> On 11.04.23 21:43, Teterevkov, Ivan wrote:
> >
> > I am studying the pin_user_pages*() family of functions, and I wonder if the
> > outlined workflow requires it. The hugepages do not page out, but they can move
> > as they may be allocated with GFP_HIGHUSER_MOVABLE. However, find_get_page()
> > must increment the page reference counter without mapping and prevent it from
> > moving. In particular, https://docs.kernel.org/mm/page_migration.html:
>
> I suspect that find_get_page() is not the kind of interface you want to
> use for the purpose you describe. find_get_page() is a wrapper around
> pagecache_get_page() and seems more like a helper for implementing an fs
> (looking at the users and the fact that it only considers pages that are
> in the pagecache).
>
> Instead, you might want to mmap the memfd and pass the user space
> address range to the ioctl. There, you'd call pin_user_pages_*().
>
> In general, for long-term pinning a page (possibly keeping the page
> pinned forever, controlled by user space, which seems to be what you are
> doing) you want so use pin_user_pages() with FOLL_LONGTERM. That will
> try migrating the page off of e.g., ZONE_MOVABLE or MIGRATE_CMA, where
> movability has to be guaranteed.
>

Thanks, David. As I am trying to outline to Alistair in another thread, our
application is designed with the kernel module driving memory allocation,
preferably backed by mappable hugepages. The kernel module might tell DMA about
the pages before the originating userspace application maps them so that the
kernel module does not have the start address for pin_user_pages*() at hand.

I believe that find_get_page() happens to work for the application because it
is backed by a file created by hugetlb_file_setup(). However, we should
consider aligning it with the API, as you explained above and stop abusing
find_get_page().

Thanks,
Ivan

2023-04-12 10:45:39

by Jan Kara

[permalink] [raw]

Subject: Re: find_get_page() VS pin_user_pages()

On Wed 12-04-23 09:04:33, Teterevkov, Ivan wrote:
> From: Alistair Popple <[email protected]>
>
> > "Teterevkov, Ivan" <[email protected]> writes:
> >
> > > Hello folks,
> > >
> > > I work with an application which aims to share memory in the userspace and
> > > interact with the NIC DMA. The memory allocation workflow begins in the
> > > userspace, which creates a new file backed by 2MiB hugepages with
> > > memfd_create(MFD_HUGETLB, MFD_HUGE_2MB) and fallocate(). Then the userspace
> > > makes an IOCTL to the kernel module with the file descriptor and size so that
> > > the kernel module can get the struct page with find_get_page(). Then the kernel
> > > module calls dma_map_single(page_address(page)) for NIC, which concludes the
> > > datapath. The allocated memory may (significantly) outlive the originating
> > > userspace application. The hugepages stay mapped with NIC, and the kernel
> > > module wants to continue using them and map to other applications that come and
> > > go with vm_mmap().
> > >
> > > I am studying the pin_user_pages*() family of functions, and I wonder if the
> > > outlined workflow requires it. The hugepages do not page out, but they can move
> > > as they may be allocated with GFP_HIGHUSER_MOVABLE. However, find_get_page()
> > > must increment the page reference counter without mapping and prevent it from
> > > moving. In particular, https://docs.kernel.org/mm/page_migration.html:
> >
> > I'm not super familiar with the memfd_create()/find_get_page() workflow
> > but is there some reason you're not using pin_user_pages*(FOLL_LONGTERM)
> > to get the struct page initially? You're description above sounds
> > exactly the use case pin_user_pages() was designed for because it marks
> > the page as being writen to by DMA, makes sure it's not in a movable
> > zone, etc.
> >
>
> The biggest obstacle with the application workflow is that the memory
> allocation is mostly kernel-driven. The kernel module may want to tell DMA
> about the hugepages before the userspace application maps it into its address
> space, so the kernel module does not have the starting user address at hand.

I'm a bit confused. Above you write that:

"The memory allocation workflow begins in the userspace, which creates a new
file backed by 2MiB hugepages with memfd_create(MFD_HUGETLB, MFD_HUGE_2MB)
and fallocate(). Then the userspace makes an IOCTL to the kernel module
with the file descriptor and size so that the kernel module can get the
struct page with find_get_page()."

So the memory allocation actually does happen from fallocate(2) as far as I
can tell. What guys are suggesting is that instead of passing the prepared
'fd' to ioctl(2), your application should mmap the file and pass the
address of the mmapped area. That's how things are usually done and it also
gives userspace more freedom over how it prepares buffers for DMA. Also then
pin_user_pages() comes as a natural API to use in the driver.

Now I'm not sure whether changing the ioctl(2) is still an option for you.
If not, then you have to resort to some kind of workaround as you
mentioned. But still pin_user_pages(FOLL_LONGTERM) is definitely the API
you should be using for telling the kernel you are going to DMA into these
pages and want to hold onto them for a long time.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-04-12 12:28:45

by Teterevkov, Ivan

[permalink] [raw]

Subject: RE: find_get_page() VS pin_user_pages()

From: Jan Kara <[email protected]>
> I'm a bit confused. Above you write that:
>
> "The memory allocation workflow begins in the userspace, which creates a new
> file backed by 2MiB hugepages with memfd_create(MFD_HUGETLB, MFD_HUGE_2MB)
> and fallocate(). Then the userspace makes an IOCTL to the kernel module
> with the file descriptor and size so that the kernel module can get the
> struct page with find_get_page()."
>
> So the memory allocation actually does happen from fallocate(2) as far as I
> can tell. What guys are suggesting is that instead of passing the prepared
> 'fd' to ioctl(2), your application should mmap the file and pass the
> address of the mmapped area. That's how things are usually done and it also
> gives userspace more freedom over how it prepares buffers for DMA. Also then
> pin_user_pages() comes as a natural API to use in the driver.
>

I failed to explain that the kernel module might call vfs_fallocate() to
allocate hugepages, then find_get_page() and finally dma_map_single(), all
before the userspace maps it. Sorry for the confusion.

> Now I'm not sure whether changing the ioctl(2) is still an option for you.
> If not, then you have to resort to some kind of workaround as you
> mentioned. But still pin_user_pages(FOLL_LONGTERM) is definitely the API
> you should be using for telling the kernel you are going to DMA into these
> pages and want to hold onto them for a long time.
>

Changing the application workflow and then doing ioctl() with the address is
what I ideally want with either find_get_page() alone or vm_mmap() with
pin_user_pages() as a workaround, and the latter is preferred.

Thanks,
Ivan

2023-04-13 12:47:42

by David Hildenbrand

[permalink] [raw]

Subject: Re: find_get_page() VS pin_user_pages()

On 12.04.23 10:41, David Howells wrote:
> David Hildenbrand <[email protected]> wrote:
>
>> I suspect that find_get_page() is not the kind of interface you want to use
>> for the purpose you describe. find_get_page() is a wrapper around
>> pagecache_get_page() and seems more like a helper for implementing an fs
>> (looking at the users and the fact that it only considers pages that are in
>> the pagecache).
>
> Btw, at some point we're going to need public functions to get extra pins on
> pages. vmsplice() should be pinning the pages it pushes into a pipe - so all
> pages in a pipe should probably be pinned - and anyone who splices a page out
> of a pipe and retains it (skbuffs spring strongly to mind) should also get a
> pin on the page.

As discussed, vmsplice() is a bit special, because it has
longterm-pinning semantics: we'd want to migrate the page out of
ZONE_MOVABLE/MIGRATE_CMA/... because the page might remain pinned in the
pipe possibly forever, controlled by user space.
pin_user_pages(FOLL_LONGTERM) would do the right thing, but we might
ahve to be careful with extra pins.

I guess it depends on what we want to achieve. Let's discuss what would
happen when we want to pin some page (and not going via pin_user_page())
that's definitely not an anon page -- so let's assume a pagecache page:

(a) Short-term pinning when already pinned (extra pins): easy.
(b) Short-term pinning when not pinned yet: should be fairly easy
(pin_user_pages() doesn't do anything special for pagecache pages
either).
(c) Long-term pinning when already long-term pinned (extra long-term
pinnings): easy
(d) Long-term pinning when already short-term pinned: problematic,
because we might have to migrate the page first, but it's already
pinned ... and if we obtained the page via pin_user_page() from a
MAP_PRIVATE VMA, we'd have to do another
pin_user_page(FOLL_LONGTERM) that would properly break COW and give
us an anon page ...
(e) Long-term pinning when not pinned yet: fairly easy, but we might
have to migrate the page first (like FOLL_LONGTERM would).

Regarding anon pages, we should pin only via pin_user_page(), so the
"not pinned" case does not apply. Replicating pins -- (a) and (c) -- is
usually easy, but (d) is similarly problematic.

Focusing again on !anon pages: if it's just "get another short-term pin
on an already pinned page", it's easy (and I recall John H. had
patches). If it's "get a long-term pin on an already pinned page", it
can be problematic.

Any pages that will never have to be migrated when long-term pinning
(just some allocated kernel page without MOVABLE semantics) are super
easy to pin, and to add extra pins to.

>
> So should all pages held by an skbuff be pinned rather than ref'd? I have a
> patch to use the bottom two bits of an skb frag's page pointer to keep track
> of whether the page it points to is ref'd, pinned or neither, but if we can
> make it pin/not-pin them, I only need one bit for that.

It might possibly be the right thing. But ref'd vs. pinned really only
makes a difference to (a) pages mapped into user space or (b) pages in
the pageache. Of course, in any case, long-term semantics have to be
respected if the page to pin might have been allocated with MOVABLE
semantics.

--
Thanks,

David / dhildenb