LinuxLists.cc - [PATCH net-next 00/12] splice, net: Replace sendpage with sendmsg(MSG_SPLICE

2023-05-24 15:52:14

Subject: [PATCH net-next 00/12] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 3

Here's the third tranche of patches towards providing a MSG_SPLICE_PAGES
internal sendmsg flag that is intended to replace the ->sendpage() op with
calls to sendmsg(). MSG_SPLICE_PAGES is a hint that tells the protocol
that it should splice the pages supplied if it can and copy them if not.

The primary focus of this tranche is to allow data passed in the slab to be
copied into page fragments (appending it to existing free space within an
sk_buff could also be possible), thereby allowing a single sendmsg() to mix
data held in the slab (such as higher-level protocol pieces) and data held
in pages (such as content for a network filesystem). This puts the copying
in (mostly) one place: skb_splice_from_iter().

To make this work, some sort of locking is needed with the allocator. I've
chosen to make the allocator internally have a separate bucket per cpu, as
the netdev and napi allocators already do - and then share the allocated
pages amongst those services that were using their own allocators. I'm not
sure that the existing usage of the allocator is completely thread safe.

TLS is also converted here because that does things differently and uses
sk_msg rather than sk_buff - and so can't use skb_splice_from_iter().

So, firstly the page_frag_alloc_align() allocator is overhauled:

(1) Split it out from mm/page_alloc.c into its own file,
mm/page_frag_alloc.c.

(2) Add a common function to clear an allocator.

(3) Make the alignment specification consistent with some of the wrapper
functions.

(4) Make it use multipage folios rather than compound pages.

(5) Make it handle __GFP_ZERO, rather than devolving this to the page
allocator.

Note that the current behaviour is potentially broken as the page may
get reused if all refs have been dropped, but it doesn't then get
cleared. This might mean that the NVMe over TCP driver, for example,
will malfunction under some circumstances.

(6) Give it per-cpu buckets to allocate from to avoid the need for locking
against users on other cpus.

(7) The netdev_alloc_cache and the napi fragment cache are then recast
in terms of this and some private allocators are removed.

We can then make use of the page fragment allocator to copy data that is
resident in the slab rather than returning EIO:

(8) Make skb_splice_from_iter() copy data provided in the slab to page
fragments.

(9) Implement MSG_SPLICE_PAGES support in the AF_TLS-sw sendmsg and make
tls_sw_sendpage() just a wrapper around sendmsg().

(10) Implement MSG_SPLICE_PAGES support in AF_TLS-device and make
tls_device_sendpage() just a wrapper around sendmsg().

I've pushed the patches here also:

https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=sendpage-3

David

Link: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=51c78a4d532efe9543a4df019ff405f05c6157f6 # part 1

David Howells (12):
mm: Move the page fragment allocator from page_alloc.c into its own
file
mm: Provide a page_frag_cache allocator cleanup function
mm: Make the page_frag_cache allocator alignment param a pow-of-2
mm: Make the page_frag_cache allocator use multipage folios
mm: Make the page_frag_cache allocator handle __GFP_ZERO itself
mm: Make the page_frag_cache allocator use per-cpu
net: Clean up users of netdev_alloc_cache and napi_frag_cache
net: Copy slab data for sendmsg(MSG_SPLICE_PAGES)
tls/sw: Support MSG_SPLICE_PAGES
tls/sw: Convert tls_sw_sendpage() to use MSG_SPLICE_PAGES
tls/device: Support MSG_SPLICE_PAGES
tls/device: Convert tls_device_sendpage() to use MSG_SPLICE_PAGES

drivers/net/ethernet/google/gve/gve.h | 1 -
drivers/net/ethernet/google/gve/gve_main.c | 16 --
drivers/net/ethernet/google/gve/gve_rx.c | 2 +-
drivers/net/ethernet/mediatek/mtk_wed_wo.c | 19 +-
drivers/net/ethernet/mediatek/mtk_wed_wo.h | 2 -
drivers/nvme/host/tcp.c | 19 +-
drivers/nvme/target/tcp.c | 22 +-
include/linux/gfp.h | 17 +-
include/linux/mm_types.h | 13 +-
include/linux/skbuff.h | 28 +--
mm/Makefile | 2 +-
mm/page_alloc.c | 126 ------------
mm/page_frag_alloc.c | 206 +++++++++++++++++++
net/core/skbuff.c | 94 +++++----
net/tls/tls_device.c | 93 ++++-----
net/tls/tls_sw.c | 221 ++++++++-------------
16 files changed, 418 insertions(+), 463 deletions(-)
create mode 100644 mm/page_frag_alloc.c

2023-05-24 15:54:50

by David Howells

[permalink] [raw]

Subject: [PATCH net-next 04/12] mm: Make the page_frag_cache allocator use multipage folios

Change the page_frag_cache allocator to use multipage folios rather than
groups of pages. This reduces page_frag_free to just a folio_put() or
put_page().

Signed-off-by: David Howells <[email protected]>
cc: "David S. Miller" <[email protected]>
cc: Eric Dumazet <[email protected]>
cc: Jakub Kicinski <[email protected]>
cc: Paolo Abeni <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Jeroen de Borst <[email protected]>
cc: Catherine Sullivan <[email protected]>
cc: Shailend Chand <[email protected]>
cc: Felix Fietkau <[email protected]>
cc: John Crispin <[email protected]>
cc: Sean Wang <[email protected]>
cc: Mark Lee <[email protected]>
cc: Lorenzo Bianconi <[email protected]>
cc: Matthias Brugger <[email protected]>
cc: AngeloGioacchino Del Regno <[email protected]>
cc: Keith Busch <[email protected]>
cc: Jens Axboe <[email protected]>
cc: Christoph Hellwig <[email protected]>
cc: Sagi Grimberg <[email protected]>
cc: Chaitanya Kulkarni <[email protected]>
cc: Andrew Morton <[email protected]>
cc: Matthew Wilcox <[email protected]>
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
---
include/linux/mm_types.h | 13 ++----
mm/page_frag_alloc.c | 99 +++++++++++++++++++---------------------
2 files changed, 52 insertions(+), 60 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 306a3d1a0fa6..d7c52a5979cc 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -420,18 +420,13 @@ static inline void *folio_get_private(struct folio *folio)
}

struct page_frag_cache {
- void * va;
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
- __u16 offset;
- __u16 size;
-#else
- __u32 offset;
-#endif
+ struct folio *folio;
+ unsigned int offset;
/* we maintain a pagecount bias, so that we dont dirty cache line
* containing page->_refcount every time we allocate a fragment.
*/
- unsigned int pagecnt_bias;
- bool pfmemalloc;
+ unsigned int pagecnt_bias;
+ bool pfmemalloc;
};

typedef unsigned long vm_flags_t;
diff --git a/mm/page_frag_alloc.c b/mm/page_frag_alloc.c
index 9d3f6fbd9a07..ffd68bfb677d 100644
--- a/mm/page_frag_alloc.c
+++ b/mm/page_frag_alloc.c
@@ -16,33 +16,34 @@
#include <linux/init.h>
#include <linux/mm.h>

-static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
- gfp_t gfp_mask)
+/*
+ * Allocate a new folio for the frag cache.
+ */
+static struct folio *page_frag_cache_refill(struct page_frag_cache *nc,
+ gfp_t gfp_mask)
{
- struct page *page = NULL;
+ struct folio *folio = NULL;
gfp_t gfp = gfp_mask;

#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
- gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
- __GFP_NOMEMALLOC;
- page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
- PAGE_FRAG_CACHE_MAX_ORDER);
- nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
+ gfp_mask |= __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
+ folio = folio_alloc(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER);
#endif
- if (unlikely(!page))
- page = alloc_pages_node(NUMA_NO_NODE, gfp, 0);
+ if (unlikely(!folio))
+ folio = folio_alloc(gfp, 0);

- nc->va = page ? page_address(page) : NULL;
-
- return page;
+ if (folio)
+ nc->folio = folio;
+ return folio;
}

void __page_frag_cache_drain(struct page *page, unsigned int count)
{
- VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
+ struct folio *folio = page_folio(page);
+
+ VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);

- if (page_ref_sub_and_test(page, count - 1))
- __free_pages(page, compound_order(page));
+ folio_put_refs(folio, count);
}
EXPORT_SYMBOL(__page_frag_cache_drain);

@@ -54,11 +55,12 @@ EXPORT_SYMBOL(__page_frag_cache_drain);
*/
void page_frag_cache_clear(struct page_frag_cache *nc)
{
- if (nc->va) {
- struct page *page = virt_to_head_page(nc->va);
+ struct folio *folio = nc->folio;

- __page_frag_cache_drain(page, nc->pagecnt_bias);
- nc->va = NULL;
+ if (folio) {
+ VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
+ folio_put_refs(folio, nc->pagecnt_bias);
+ nc->folio = NULL;
}
}
EXPORT_SYMBOL(page_frag_cache_clear);
@@ -67,56 +69,51 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
unsigned int fragsz, gfp_t gfp_mask,
unsigned int align)
{
- unsigned int size = PAGE_SIZE;
- struct page *page;
- int offset;
+ struct folio *folio = nc->folio;
+ size_t offset;

WARN_ON_ONCE(!is_power_of_2(align));

- if (unlikely(!nc->va)) {
+ if (unlikely(!folio)) {
refill:
- page = __page_frag_cache_refill(nc, gfp_mask);
- if (!page)
+ folio = page_frag_cache_refill(nc, gfp_mask);
+ if (!folio)
return NULL;

-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
- /* if size can vary use size else just use PAGE_SIZE */
- size = nc->size;
-#endif
/* Even if we own the page, we do not use atomic_set().
* This would break get_page_unless_zero() users.
*/
- page_ref_add(page, PAGE_FRAG_CACHE_MAX_SIZE);
+ folio_ref_add(folio, PAGE_FRAG_CACHE_MAX_SIZE);

/* reset page count bias and offset to start of new frag */
- nc->pfmemalloc = page_is_pfmemalloc(page);
+ nc->pfmemalloc = folio_is_pfmemalloc(folio);
nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
- nc->offset = size;
+ nc->offset = folio_size(folio);
}

- offset = nc->offset - fragsz;
- if (unlikely(offset < 0)) {
- page = virt_to_page(nc->va);
-
- if (page_ref_count(page) != nc->pagecnt_bias)
+ offset = nc->offset;
+ if (unlikely(fragsz > offset)) {
+ /* Reuse the folio if everyone we gave it to has finished with
+ * it.
+ */
+ if (!folio_ref_sub_and_test(folio, nc->pagecnt_bias)) {
+ nc->folio = NULL;
goto refill;
+ }
+
if (unlikely(nc->pfmemalloc)) {
- page_ref_sub(page, nc->pagecnt_bias - 1);
- __free_pages(page, compound_order(page));
+ __folio_put(folio);
+ nc->folio = NULL;
goto refill;
}

-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
- /* if size can vary use size else just use PAGE_SIZE */
- size = nc->size;
-#endif
/* OK, page count is 0, we can safely set it */
- set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
+ folio_set_count(folio, PAGE_FRAG_CACHE_MAX_SIZE + 1);

/* reset page count bias and offset to start of new frag */
nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
- offset = size - fragsz;
- if (unlikely(offset < 0)) {
+ offset = folio_size(folio);
+ if (unlikely(fragsz > offset)) {
/*
* The caller is trying to allocate a fragment
* with fragsz > PAGE_SIZE but the cache isn't big
@@ -126,15 +123,17 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
* it could make memory pressure worse
* so we simply return NULL here.
*/
+ nc->offset = offset;
return NULL;
}
}

nc->pagecnt_bias--;
+ offset -= fragsz;
offset &= ~(align - 1);
nc->offset = offset;

- return nc->va + offset;
+ return folio_address(folio) + offset;
}
EXPORT_SYMBOL(page_frag_alloc_align);

@@ -143,8 +142,6 @@ EXPORT_SYMBOL(page_frag_alloc_align);
*/
void page_frag_free(void *addr)
{
- struct page *page = virt_to_head_page(addr);
-
- __free_pages(page, compound_order(page));
+ folio_put(virt_to_folio(addr));
}
EXPORT_SYMBOL(page_frag_free);

2023-05-26 12:54:17

by David Howells

[permalink] [raw]

Subject: Re: [PATCH net-next 04/12] mm: Make the page_frag_cache allocator use multipage folios

Yunsheng Lin <[email protected]> wrote:

> > Change the page_frag_cache allocator to use multipage folios rather than
> > groups of pages. This reduces page_frag_free to just a folio_put() or
> > put_page().
>
> put_page() is not used in this patch, perhaps remove it to avoid
> the confusion?

Will do if I need to respin the patches.

> Also, Is there any significant difference between __free_pages()
> and folio_put()? IOW, what does the 'reduces' part means here?

I meant that the folio code handles page compounding for us and we don't need
to work out how big the page is for ourselves.

If you look at __free_pages(), you can see a PageHead() call. folio_put()
doesn't need that.

> I followed some disscusion about folio before, but have not really
> understood about real difference between 'multipage folios' and
> 'groups of pages' yet. Is folio mostly used to avoid the confusion
> about whether a page is 'headpage of compound page', 'base page' or
> 'tailpage of compound page'? Or is there any abvious benefit about
> folio that I missed?

There is a benefit: a folio pointer always points to the head page and so we
never need to do "is this compound? where's the head?" logic to find it. When
going from a page pointer, we still have to find the head.

Ultimately, the aim is to reduce struct page to a typed pointer to massively
reduce the amount of space consumed by mem_map[]. A page struct will then
point at a folio or a slab struct or one of a number of different types. But
to get to that point, we have to stop a whole lot of things from using page
structs, but rather use some other type, such as folio.

Eventually, there won't be a need for head pages and tail pages per se - just
memory objects of different sizes.

> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 306a3d1a0fa6..d7c52a5979cc 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -420,18 +420,13 @@ static inline void *folio_get_private(struct folio *folio)
> > }
> >
> > struct page_frag_cache {
> > - void * va;
> > -#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> > - __u16 offset;
> > - __u16 size;
> > -#else
> > - __u32 offset;
> > -#endif
> > + struct folio *folio;
> > + unsigned int offset;
> > /* we maintain a pagecount bias, so that we dont dirty cache line
> > * containing page->_refcount every time we allocate a fragment.
> > */
> > - unsigned int pagecnt_bias;
> > - bool pfmemalloc;
> > + unsigned int pagecnt_bias;
> > + bool pfmemalloc;
> > };
>
> It seems 'va' and 'size' field is used to avoid touching 'stuct page' to
> avoid possible cache bouncing when there is more frag can be allocated
> from the page while other frags is freed at the same time before this patch?

Hmmm... fair point, though va is calculated from the page pointer on most
arches without the need to dereference struct page (only arc, m68k and sparc
define WANT_PAGE_VIRTUAL).

David

2023-05-26 14:44:28

by Mika Penttilä

[permalink] [raw]

Subject: Re: [PATCH net-next 04/12] mm: Make the page_frag_cache allocator use multipage folios

Hi,

On 26.5.2023 15.47, David Howells wrote:
> Yunsheng Lin <[email protected]> wrote:
>
>>> Change the page_frag_cache allocator to use multipage folios rather than
>>> groups of pages. This reduces page_frag_free to just a folio_put() or
>>> put_page().
>>
>> put_page() is not used in this patch, perhaps remove it to avoid
>> the confusion?
>
> Will do if I need to respin the patches.
>
>> Also, Is there any significant difference between __free_pages()
>> and folio_put()? IOW, what does the 'reduces' part means here?
>
> I meant that the folio code handles page compounding for us and we don't need
> to work out how big the page is for ourselves.
>
> If you look at __free_pages(), you can see a PageHead() call. folio_put()
> doesn't need that.
>
>> I followed some disscusion about folio before, but have not really
>> understood about real difference between 'multipage folios' and
>> 'groups of pages' yet. Is folio mostly used to avoid the confusion
>> about whether a page is 'headpage of compound page', 'base page' or
>> 'tailpage of compound page'? Or is there any abvious benefit about
>> folio that I missed?
>
> There is a benefit: a folio pointer always points to the head page and so we
> never need to do "is this compound? where's the head?" logic to find it. When
> going from a page pointer, we still have to find the head.
>

But page_frag_free() uses folio_put(virt_to_folio(addr)) and
virt_to_folio() depends on the compound infrastructure to get the head
page and folio.

> Ultimately, the aim is to reduce struct page to a typed pointer to massively
> reduce the amount of space consumed by mem_map[]. A page struct will then
> point at a folio or a slab struct or one of a number of different types. But
> to get to that point, we have to stop a whole lot of things from using page
> structs, but rather use some other type, such as folio.
>
> Eventually, there won't be a need for head pages and tail pages per se - just
> memory objects of different sizes.
>
>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>> index 306a3d1a0fa6..d7c52a5979cc 100644
>>> --- a/include/linux/mm_types.h
>>> +++ b/include/linux/mm_types.h
>>> @@ -420,18 +420,13 @@ static inline void *folio_get_private(struct folio *folio)
>>> }
>>>
>>> struct page_frag_cache {
>>> - void * va;
>>> -#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
>>> - __u16 offset;
>>> - __u16 size;
>>> -#else
>>> - __u32 offset;
>>> -#endif
>>> + struct folio *folio;
>>> + unsigned int offset;
>>> /* we maintain a pagecount bias, so that we dont dirty cache line
>>> * containing page->_refcount every time we allocate a fragment.
>>> */
>>> - unsigned int pagecnt_bias;
>>> - bool pfmemalloc;
>>> + unsigned int pagecnt_bias;
>>> + bool pfmemalloc;
>>> };
>>
>> It seems 'va' and 'size' field is used to avoid touching 'stuct page' to
>> avoid possible cache bouncing when there is more frag can be allocated
>> from the page while other frags is freed at the same time before this patch?
>
> Hmmm... fair point, though va is calculated from the page pointer on most
> arches without the need to dereference struct page (only arc, m68k and sparc
> define WANT_PAGE_VIRTUAL).
>
> David
>

--Mika

2023-05-27 01:11:16

by Jakub Kicinski

[permalink] [raw]

Subject: Re: [PATCH net-next 04/12] mm: Make the page_frag_cache allocator use multipage folios

On Wed, 24 May 2023 16:33:03 +0100 David Howells wrote:
> - offset = nc->offset - fragsz;
> - if (unlikely(offset < 0)) {
> - page = virt_to_page(nc->va);
> -
> - if (page_ref_count(page) != nc->pagecnt_bias)
> + offset = nc->offset;
> + if (unlikely(fragsz > offset)) {
> + /* Reuse the folio if everyone we gave it to has finished with
> + * it.
> + */
> + if (!folio_ref_sub_and_test(folio, nc->pagecnt_bias)) {
> + nc->folio = NULL;
> goto refill;
> + }
> +
> if (unlikely(nc->pfmemalloc)) {
> - page_ref_sub(page, nc->pagecnt_bias - 1);
> - __free_pages(page, compound_order(page));
> + __folio_put(folio);

This is not a pure 1:1 page -> folio conversion.
Why mix conversion with other code changes?

2023-05-27 16:13:04

by Alexander Duyck

[permalink] [raw]

Subject: Re: [PATCH net-next 04/12] mm: Make the page_frag_cache allocator use multipage folios

On Fri, 2023-05-26 at 19:56 +0800, Yunsheng Lin wrote:
> On 2023/5/24 23:33, David Howells wrote:
> > Change the page_frag_cache allocator to use multipage folios rather than
> > groups of pages. This reduces page_frag_free to just a folio_put() or
> > put_page().
>
> Hi, David
>
> put_page() is not used in this patch, perhaps remove it to avoid
> the confusion?
> Also, Is there any significant difference between __free_pages()
> and folio_put()? IOW, what does the 'reduces' part means here?
>
> I followed some disscusion about folio before, but have not really
> understood about real difference between 'multipage folios' and
> 'groups of pages' yet. Is folio mostly used to avoid the confusion
> about whether a page is 'headpage of compound page', 'base page' or
> 'tailpage of compound page'? Or is there any abvious benefit about
> folio that I missed?
>
> >
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 306a3d1a0fa6..d7c52a5979cc 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -420,18 +420,13 @@ static inline void *folio_get_private(struct folio *folio)
> > }
> >
> > struct page_frag_cache {
> > - void * va;
> > -#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> > - __u16 offset;
> > - __u16 size;
> > -#else
> > - __u32 offset;
> > -#endif
> > + struct folio *folio;
> > + unsigned int offset;
> > /* we maintain a pagecount bias, so that we dont dirty cache line
> > * containing page->_refcount every time we allocate a fragment.
> > */
> > - unsigned int pagecnt_bias;
> > - bool pfmemalloc;
> > + unsigned int pagecnt_bias;
> > + bool pfmemalloc;
> > };
>
> It seems 'va' and 'size' field is used to avoid touching 'stuct page' to
> avoid possible cache bouncing when there is more frag can be allocated
> from the page while other frags is freed at the same time before this patch?
> It might be worth calling that out in the commit log or split it into another
> patch to make it clearer and easier to review?

Yes, there is a cost for going from page to virtual address. That is
why we only use the page when we finally get to freeing or resetting
the pagecnt_bias.

Also I have some concerns about going from page to folio as it seems
like the folio_alloc setups the transparent hugepage destructor instead
of using the compound page destructor. I would think that would slow
down most users as it looks like there is a spinlock that is taken in
the hugepage destructor that isn't there in the compound page
destructor.

2023-06-06 09:00:44

by David Howells

[permalink] [raw]

Subject: Re: [PATCH net-next 04/12] mm: Make the page_frag_cache allocator use multipage folios

Alexander H Duyck <[email protected]> wrote:

> Also I have some concerns about going from page to folio as it seems
> like the folio_alloc setups the transparent hugepage destructor instead
> of using the compound page destructor. I would think that would slow
> down most users as it looks like there is a spinlock that is taken in
> the hugepage destructor that isn't there in the compound page
> destructor.

Note that this code is going to have to move to folios[*] at some point.
"Old-style" compound pages are going to go away, I believe. Matthew Wilcox
and the mm folks are on a drive towards simplifying memory management,
formalising chunks larger than a single page - with the ultimate aim of
reducing the page struct to a single, typed pointer.

So, take, for example, a folio: As I understand it, this will no longer
overlay struct page, but rather will become a single, dynamically-allocated
struct that covers a pow-of-2 number of pages. A contiguous subset of page
structs will point at it.

However, rather than using a folio, we could define a "page fragment" memory
type. Rather than having all the flags and fields to be found in struct
folio, it could have just the set to be found in page_frag_cache.

David

[*] It will be possible to have some other type than "folio". See "struct
slab" in mm/slab.h for example. struct slab corresponds to a set of pages
and, in the future, a number of struct pages will point at it.

2023-06-06 15:31:05

by Alexander Duyck

[permalink] [raw]

Subject: Re: [PATCH net-next 04/12] mm: Make the page_frag_cache allocator use multipage folios

On Tue, Jun 6, 2023 at 1:25 AM David Howells <[email protected]> wrote:
>
> Alexander H Duyck <[email protected]> wrote:
>
> > Also I have some concerns about going from page to folio as it seems
> > like the folio_alloc setups the transparent hugepage destructor instead
> > of using the compound page destructor. I would think that would slow
> > down most users as it looks like there is a spinlock that is taken in
> > the hugepage destructor that isn't there in the compound page
> > destructor.
>
> Note that this code is going to have to move to folios[*] at some point.
> "Old-style" compound pages are going to go away, I believe. Matthew Wilcox
> and the mm folks are on a drive towards simplifying memory management,
> formalising chunks larger than a single page - with the ultimate aim of
> reducing the page struct to a single, typed pointer.

I'm not against making the move, but as others have pointed out this
is getting into unrelated things. One of those being the fact that to
transition to using folios we don't need to get rid of the use of the
virtual address. The idea behind using the virtual address here is
that we can avoid a bunch of address translation overhead since we
only need to use the folio if we are going to allocate, retire, or
recycle a page/folio. If we are using an order 3 page that shouldn't
be very often.

> So, take, for example, a folio: As I understand it, this will no longer
> overlay struct page, but rather will become a single, dynamically-allocated
> struct that covers a pow-of-2 number of pages. A contiguous subset of page
> structs will point at it.
>
> However, rather than using a folio, we could define a "page fragment" memory
> type. Rather than having all the flags and fields to be found in struct
> folio, it could have just the set to be found in page_frag_cache.

I don't think we need a new memory type. For the most part the page
fragment code is really more a subset of something like a
__get_free_pages where the requester provides the size, is just given
a virtual address, and we shouldn't need to be allocating a new page
as often as ideally the allocations are 2K or less in size.

Also one thing I would want to avoid is adding complexity to the
freeing path. The general idea with page frags is that they are meant
to be lightweight in terms of freeing as well. So just as they are
similar to __get_free_pages in terms of allocation the freeing is
meant to be similar to free_pages.

> David
>
> [*] It will be possible to have some other type than "folio". See "struct
> slab" in mm/slab.h for example. struct slab corresponds to a set of pages
> and, in the future, a number of struct pages will point at it.

I want to avoid getting anywhere near the complexity of a slab
allocator. The whole point of this was to keep it simple so that
drivers could use it and get decent performance. When I had
implemented it in the Intel drivers back in the day this approach was
essentially just a reference count/page offset hack that allowed us to
split a page in 2 and use the pages as a sort of mobius strip within
the ring buffer.