2024-01-09 01:15:13

by Mina Almasry

[permalink] [raw]
Subject: [RFC PATCH net-next v5 0/2] Abstract page from net stack

Changes in RFC v5:
- RFC due to merge window
- Changed netmem to __bitwise unsigned long.

-----------

Changes in v4:
- Forked off the trivial fixes to skb_frag_t field access to their own
patches and changed this to RFC that depends on these fixes:

https://lore.kernel.org/netdev/[email protected]/T/#u
https://lore.kernel.org/netdev/[email protected]/T/#u

- Use an empty struct for netmem instead of void* __bitwise as that's
not a correct use of __bitwise.

-----------

Changes in v3:

- Replaced the struct netmem union with an opaque netmem_ref type.
- Added func docs to the netmem helpers and type.
- Renamed the skb_frag_t fields since it's no longer a bio_vec

-----------

Changes in v2:
- Reverted changes to the page_pool. The page pool now retains the same
API, so that we don't have to touch many existing drivers. The devmem
TCP series will include the changes to the page pool.

- Addressed comments.

This series is a prerequisite to the devmem TCP series. For a full
snapshot of the code which includes these changes, feel free to check:

https://github.com/mina/linux/commits/tcpdevmem-rfcv5/

-----------

Currently these components in the net stack use the struct page
directly:

1. Drivers.
2. Page pool.
3. skb_frag_t.

To add support for new (non struct page) memory types to the net stack, we
must first abstract the current memory type.

Originally the plan was to reuse struct page* for the new memory types,
and to set the LSB on the page* to indicate it's not really a page.
However, for safe compiler type checking we need to introduce a new type.

struct netmem is introduced to abstract the underlying memory type.
Currently it's a no-op abstraction that is always a struct page underneath.
In parallel there is an undergoing effort to add support for devmem to the
net stack:

https://lore.kernel.org/netdev/[email protected]/

Cc: Jason Gunthorpe <[email protected]>
Cc: Christian König <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Yunsheng Lin <[email protected]>
Cc: Willem de Bruijn <[email protected]>

Mina Almasry (2):
net: introduce abstraction for network memory
net: add netmem to skb_frag_t

include/linux/skbuff.h | 90 +++++++++++++++++++++++++++++-------------
include/net/netmem.h | 41 +++++++++++++++++++
net/core/skbuff.c | 22 ++++++++---
net/kcm/kcmsock.c | 9 ++++-
4 files changed, 127 insertions(+), 35 deletions(-)
create mode 100644 include/net/netmem.h

--
2.43.0.472.g3155946c3a-goog



2024-01-09 01:15:34

by Mina Almasry

[permalink] [raw]
Subject: [RFC PATCH net-next v5 1/2] net: introduce abstraction for network memory

Add the netmem_ref type, an abstraction for network memory.

To add support for new memory types to the net stack, we must first
abstract the current memory type. Currently parts of the net stack
use struct page directly:

- page_pool
- drivers
- skb_frag_t

Originally the plan was to reuse struct page* for the new memory types,
and to set the LSB on the page* to indicate it's not really a page.
However, for compiler type checking we need to introduce a new type.

netmem_ref is introduced to abstract the underlying memory type.
Currently it's a no-op abstraction that is always a struct page
underneath. In parallel there is an undergoing effort to add support
for devmem to the net stack:

https://lore.kernel.org/netdev/[email protected]/

netmem_ref can be pointers to different underlying memory types, and the
low bits are set to indicate the memory type. Helpers are provided
to convert netmem pointers to the underlying memory type (currently only
struct page). In the devmem series helpers are provided so that calling
code can use netmem without worrying about the underlying memory type
unless absolutely necessary.

Signed-off-by: Mina Almasry <[email protected]>

---

rfc v5:
- RFC due to merge window.
- Change to 'typedef unsigned long __bitwise netmem_ref;'
- Fixed commit message (Shakeel).
- Did not apply Shakeel's reviewed-by since the code changed
significantly.

v4:
- use 'struct netmem;' instead of 'typedef void *__bitwise netmem_ref;'

Using __bitwise with a non-integer type was wrong and triggered many
patchwork bot errors/warnings.

Using an integer type causes the compiler to warn when casting NULL to
the integer type.

Attempt to use an empty struct for our opaque network memory.

v3:

- Modify struct netmem from a union of struct page + new types to an opaque
netmem_ref type. I went with:

+typedef void *__bitwise netmem_ref;

rather than this that Jakub recommended:

+typedef unsigned long __bitwise netmem_ref;

Because with the latter the compiler issues warnings to cast NULL to
netmem_ref. I hope that's ok.

- Add some function docs.

v2:

- Use container_of instead of a type cast (David).
---
include/net/netmem.h | 41 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 41 insertions(+)
create mode 100644 include/net/netmem.h

diff --git a/include/net/netmem.h b/include/net/netmem.h
new file mode 100644
index 000000000000..9f327d964782
--- /dev/null
+++ b/include/net/netmem.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Network memory
+ *
+ * Author: Mina Almasry <[email protected]>
+ */
+
+#ifndef _NET_NETMEM_H
+#define _NET_NETMEM_H
+
+/**
+ * netmem_ref - a nonexistent type marking a reference to generic network
+ * memory.
+ *
+ * A netmem_ref currently is always a reference to a struct page. This
+ * abstraction is introduced so support for new memory types can be added.
+ *
+ * Use the supplied helpers to obtain the underlying memory pointer and fields.
+ */
+typedef unsigned long __bitwise netmem_ref;
+
+/* This conversion fails (returns NULL) if the netmem_ref is not struct page
+ * backed.
+ *
+ * Currently struct page is the only possible netmem, and this helper never
+ * fails.
+ */
+static inline struct page *netmem_to_page(netmem_ref netmem)
+{
+ return (__force struct page *)netmem;
+}
+
+/* Converting from page to netmem is always safe, because a page can always be
+ * a netmem.
+ */
+static inline netmem_ref page_to_netmem(struct page *page)
+{
+ return (__force netmem_ref)page;
+}
+
+#endif /* _NET_NETMEM_H */
--
2.43.0.472.g3155946c3a-goog


2024-01-09 01:15:51

by Mina Almasry

[permalink] [raw]
Subject: [RFC PATCH net-next v5 2/2] net: add netmem to skb_frag_t

Use struct netmem* instead of page in skb_frag_t. Currently struct
netmem* is always a struct page underneath, but the abstraction
allows efforts to add support for skb frags not backed by pages.

There is unfortunately 1 instance where the skb_frag_t is assumed to be
a exactly a bio_vec in kcm. For this case, WARN_ON_ONCE and return error
before doing a cast.

Add skb[_frag]_fill_netmem_*() and skb_add_rx_frag_netmem() helpers so
that the API can be used to create netmem skbs.

Signed-off-by: Mina Almasry <[email protected]>

---

v4:
- Handle error in kcm_write_msgs() instead of only warning (Willem)

v3:
- Renamed the fields in skb_frag_t.

v2:
- Add skb frag filling helpers.

---
include/linux/skbuff.h | 90 +++++++++++++++++++++++++++++-------------
net/core/skbuff.c | 22 ++++++++---
net/kcm/kcmsock.c | 9 ++++-
3 files changed, 86 insertions(+), 35 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a5ae952454c8..e59f76151628 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -37,6 +37,7 @@
#endif
#include <net/net_debug.h>
#include <net/dropreason-core.h>
+#include <net/netmem.h>

/**
* DOC: skb checksums
@@ -359,7 +360,11 @@ extern int sysctl_max_skb_frags;
*/
#define GSO_BY_FRAGS 0xFFFF

-typedef struct bio_vec skb_frag_t;
+typedef struct skb_frag {
+ netmem_ref netmem;
+ unsigned int len;
+ unsigned int offset;
+} skb_frag_t;

/**
* skb_frag_size() - Returns the size of a skb fragment
@@ -367,7 +372,7 @@ typedef struct bio_vec skb_frag_t;
*/
static inline unsigned int skb_frag_size(const skb_frag_t *frag)
{
- return frag->bv_len;
+ return frag->len;
}

/**
@@ -377,7 +382,7 @@ static inline unsigned int skb_frag_size(const skb_frag_t *frag)
*/
static inline void skb_frag_size_set(skb_frag_t *frag, unsigned int size)
{
- frag->bv_len = size;
+ frag->len = size;
}

/**
@@ -387,7 +392,7 @@ static inline void skb_frag_size_set(skb_frag_t *frag, unsigned int size)
*/
static inline void skb_frag_size_add(skb_frag_t *frag, int delta)
{
- frag->bv_len += delta;
+ frag->len += delta;
}

/**
@@ -397,7 +402,7 @@ static inline void skb_frag_size_add(skb_frag_t *frag, int delta)
*/
static inline void skb_frag_size_sub(skb_frag_t *frag, int delta)
{
- frag->bv_len -= delta;
+ frag->len -= delta;
}

/**
@@ -417,7 +422,7 @@ static inline bool skb_frag_must_loop(struct page *p)
* skb_frag_foreach_page - loop over pages in a fragment
*
* @f: skb frag to operate on
- * @f_off: offset from start of f->bv_page
+ * @f_off: offset from start of f->netmem
* @f_len: length from f_off to loop over
* @p: (temp var) current page
* @p_off: (temp var) offset from start of current page,
@@ -2429,22 +2434,37 @@ static inline unsigned int skb_pagelen(const struct sk_buff *skb)
return skb_headlen(skb) + __skb_pagelen(skb);
}

+static inline void skb_frag_fill_netmem_desc(skb_frag_t *frag,
+ netmem_ref netmem, int off,
+ int size)
+{
+ frag->netmem = netmem;
+ frag->offset = off;
+ skb_frag_size_set(frag, size);
+}
+
static inline void skb_frag_fill_page_desc(skb_frag_t *frag,
struct page *page,
int off, int size)
{
- frag->bv_page = page;
- frag->bv_offset = off;
- skb_frag_size_set(frag, size);
+ skb_frag_fill_netmem_desc(frag, page_to_netmem(page), off, size);
+}
+
+static inline void __skb_fill_netmem_desc_noacc(struct skb_shared_info *shinfo,
+ int i, netmem_ref netmem,
+ int off, int size)
+{
+ skb_frag_t *frag = &shinfo->frags[i];
+
+ skb_frag_fill_netmem_desc(frag, netmem, off, size);
}

static inline void __skb_fill_page_desc_noacc(struct skb_shared_info *shinfo,
int i, struct page *page,
int off, int size)
{
- skb_frag_t *frag = &shinfo->frags[i];
-
- skb_frag_fill_page_desc(frag, page, off, size);
+ __skb_fill_netmem_desc_noacc(shinfo, i, page_to_netmem(page), off,
+ size);
}

/**
@@ -2460,10 +2480,10 @@ static inline void skb_len_add(struct sk_buff *skb, int delta)
}

/**
- * __skb_fill_page_desc - initialise a paged fragment in an skb
+ * __skb_fill_netmem_desc - initialise a fragment in an skb
* @skb: buffer containing fragment to be initialised
- * @i: paged fragment index to initialise
- * @page: the page to use for this fragment
+ * @i: fragment index to initialise
+ * @netmem: the netmem to use for this fragment
* @off: the offset to the data with @page
* @size: the length of the data
*
@@ -2472,10 +2492,12 @@ static inline void skb_len_add(struct sk_buff *skb, int delta)
*
* Does not take any additional reference on the fragment.
*/
-static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
- struct page *page, int off, int size)
+static inline void __skb_fill_netmem_desc(struct sk_buff *skb, int i,
+ netmem_ref netmem, int off, int size)
{
- __skb_fill_page_desc_noacc(skb_shinfo(skb), i, page, off, size);
+ struct page *page = netmem_to_page(netmem);
+
+ __skb_fill_netmem_desc_noacc(skb_shinfo(skb), i, netmem, off, size);

/* Propagate page pfmemalloc to the skb if we can. The problem is
* that not all callers have unique ownership of the page but rely
@@ -2483,7 +2505,20 @@ static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
*/
page = compound_head(page);
if (page_is_pfmemalloc(page))
- skb->pfmemalloc = true;
+ skb->pfmemalloc = true;
+}
+
+static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
+ struct page *page, int off, int size)
+{
+ __skb_fill_netmem_desc(skb, i, page_to_netmem(page), off, size);
+}
+
+static inline void skb_fill_netmem_desc(struct sk_buff *skb, int i,
+ netmem_ref netmem, int off, int size)
+{
+ __skb_fill_netmem_desc(skb, i, netmem, off, size);
+ skb_shinfo(skb)->nr_frags = i + 1;
}

/**
@@ -2503,8 +2538,7 @@ static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
static inline void skb_fill_page_desc(struct sk_buff *skb, int i,
struct page *page, int off, int size)
{
- __skb_fill_page_desc(skb, i, page, off, size);
- skb_shinfo(skb)->nr_frags = i + 1;
+ skb_fill_netmem_desc(skb, i, page_to_netmem(page), off, size);
}

/**
@@ -2530,6 +2564,8 @@ static inline void skb_fill_page_desc_noacc(struct sk_buff *skb, int i,

void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
int size, unsigned int truesize);
+void skb_add_rx_frag_netmem(struct sk_buff *skb, int i, netmem_ref netmem,
+ int off, int size, unsigned int truesize);

void skb_coalesce_rx_frag(struct sk_buff *skb, int i, int size,
unsigned int truesize);
@@ -3378,7 +3414,7 @@ static inline void skb_propagate_pfmemalloc(const struct page *page,
*/
static inline unsigned int skb_frag_off(const skb_frag_t *frag)
{
- return frag->bv_offset;
+ return frag->offset;
}

/**
@@ -3388,7 +3424,7 @@ static inline unsigned int skb_frag_off(const skb_frag_t *frag)
*/
static inline void skb_frag_off_add(skb_frag_t *frag, int delta)
{
- frag->bv_offset += delta;
+ frag->offset += delta;
}

/**
@@ -3398,7 +3434,7 @@ static inline void skb_frag_off_add(skb_frag_t *frag, int delta)
*/
static inline void skb_frag_off_set(skb_frag_t *frag, unsigned int offset)
{
- frag->bv_offset = offset;
+ frag->offset = offset;
}

/**
@@ -3409,7 +3445,7 @@ static inline void skb_frag_off_set(skb_frag_t *frag, unsigned int offset)
static inline void skb_frag_off_copy(skb_frag_t *fragto,
const skb_frag_t *fragfrom)
{
- fragto->bv_offset = fragfrom->bv_offset;
+ fragto->offset = fragfrom->offset;
}

/**
@@ -3420,7 +3456,7 @@ static inline void skb_frag_off_copy(skb_frag_t *fragto,
*/
static inline struct page *skb_frag_page(const skb_frag_t *frag)
{
- return frag->bv_page;
+ return netmem_to_page(frag->netmem);
}

/**
@@ -3524,7 +3560,7 @@ static inline void *skb_frag_address_safe(const skb_frag_t *frag)
static inline void skb_frag_page_copy(skb_frag_t *fragto,
const skb_frag_t *fragfrom)
{
- fragto->bv_page = fragfrom->bv_page;
+ fragto->netmem = fragfrom->netmem;
}

bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 12d22c0b8551..4fdc33c81969 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -845,16 +845,24 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
}
EXPORT_SYMBOL(__napi_alloc_skb);

-void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
- int size, unsigned int truesize)
+void skb_add_rx_frag_netmem(struct sk_buff *skb, int i, netmem_ref netmem,
+ int off, int size, unsigned int truesize)
{
DEBUG_NET_WARN_ON_ONCE(size > truesize);

- skb_fill_page_desc(skb, i, page, off, size);
+ skb_fill_netmem_desc(skb, i, netmem, off, size);
skb->len += size;
skb->data_len += size;
skb->truesize += truesize;
}
+EXPORT_SYMBOL(skb_add_rx_frag_netmem);
+
+void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
+ int size, unsigned int truesize)
+{
+ skb_add_rx_frag_netmem(skb, i, page_to_netmem(page), off, size,
+ truesize);
+}
EXPORT_SYMBOL(skb_add_rx_frag);

void skb_coalesce_rx_frag(struct sk_buff *skb, int i, int size,
@@ -1904,10 +1912,11 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)

/* skb frags point to kernel buffers */
for (i = 0; i < new_frags - 1; i++) {
- __skb_fill_page_desc(skb, i, head, 0, psize);
+ __skb_fill_netmem_desc(skb, i, page_to_netmem(head), 0, psize);
head = (struct page *)page_private(head);
}
- __skb_fill_page_desc(skb, new_frags - 1, head, 0, d_off);
+ __skb_fill_netmem_desc(skb, new_frags - 1, page_to_netmem(head), 0,
+ d_off);
skb_shinfo(skb)->nr_frags = new_frags;

release:
@@ -3645,7 +3654,8 @@ skb_zerocopy(struct sk_buff *to, struct sk_buff *from, int len, int hlen)
if (plen) {
page = virt_to_head_page(from->head);
offset = from->data - (unsigned char *)page_address(page);
- __skb_fill_page_desc(to, 0, page, offset, plen);
+ __skb_fill_netmem_desc(to, 0, page_to_netmem(page),
+ offset, plen);
get_page(page);
j = 1;
len -= plen;
diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index 1184d40167b8..145ef22b2b35 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -636,9 +636,14 @@ static int kcm_write_msgs(struct kcm_sock *kcm)
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
msize += skb_frag_size(&skb_shinfo(skb)->frags[i]);

+ if (WARN_ON_ONCE(!skb_frag_page(&skb_shinfo(skb)->frags[0]))) {
+ ret = -EINVAL;
+ goto out;
+ }
+
iov_iter_bvec(&msg.msg_iter, ITER_SOURCE,
- skb_shinfo(skb)->frags, skb_shinfo(skb)->nr_frags,
- msize);
+ (const struct bio_vec *)skb_shinfo(skb)->frags,
+ skb_shinfo(skb)->nr_frags, msize);
iov_iter_advance(&msg.msg_iter, txm->frag_offset);

do {
--
2.43.0.472.g3155946c3a-goog


2024-01-10 02:03:33

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [RFC PATCH net-next v5 0/2] Abstract page from net stack

On Mon, 8 Jan 2024 17:14:50 -0800 Mina Almasry wrote:
> Changes in RFC v5:
> - RFC due to merge window
> - Changed netmem to __bitwise unsigned long.


LGTM, thanks!

2024-01-10 18:45:29

by Shakeel Butt

[permalink] [raw]
Subject: Re: [RFC PATCH net-next v5 1/2] net: introduce abstraction for network memory

On Mon, Jan 8, 2024 at 5:15 PM Mina Almasry <[email protected]> wrote:
>
> Add the netmem_ref type, an abstraction for network memory.
>
> To add support for new memory types to the net stack, we must first
> abstract the current memory type. Currently parts of the net stack
> use struct page directly:
>
> - page_pool
> - drivers
> - skb_frag_t
>
> Originally the plan was to reuse struct page* for the new memory types,
> and to set the LSB on the page* to indicate it's not really a page.
> However, for compiler type checking we need to introduce a new type.
>
> netmem_ref is introduced to abstract the underlying memory type.
> Currently it's a no-op abstraction that is always a struct page
> underneath. In parallel there is an undergoing effort to add support
> for devmem to the net stack:
>
> https://lore.kernel.org/netdev/[email protected]/
>
> netmem_ref can be pointers to different underlying memory types, and the
> low bits are set to indicate the memory type. Helpers are provided
> to convert netmem pointers to the underlying memory type (currently only
> struct page). In the devmem series helpers are provided so that calling
> code can use netmem without worrying about the underlying memory type
> unless absolutely necessary.
>
> Signed-off-by: Mina Almasry <[email protected]>
>

Reviewed-by: Shakeel Butt <[email protected]>

2024-01-12 00:35:19

by Mina Almasry

[permalink] [raw]
Subject: Re: [RFC PATCH net-next v5 2/2] net: add netmem to skb_frag_t

On Thu, Jan 11, 2024 at 4:45 AM Yunsheng Lin <[email protected]> wrote:
>
> On 2024/1/9 9:14, Mina Almasry wrote:
>
> ...
>
> >
> > + if (WARN_ON_ONCE(!skb_frag_page(&skb_shinfo(skb)->frags[0]))) {
>
> I am really hate to bring it up again.
> If you are not willing to introduce a new helper,

I'm actually more than happy to add a new helper like:

static inline netmem_ref skb_frag_netmem();

For future callers to obtain frag->netmem to use the netmem_ref directly.

What I'm hung up on is really your follow up request:

"Is it possible to introduce something like skb_frag_netmem() for
netmem? so that we can keep most existing users of skb_frag_page()
unchanged and avoid adding additional checking overhead for existing
users."

With this patchseries, skb_frag_t no longer has a page pointer inside
of it, it only has a netmem_ref. The netmem_ref is currently always a
page, but in the future may not be a page. Can you clarify how we keep
skb_frag_page() unchanged and without checks? What do you expect
skb_frag_page() and its callers to do? We can not assume netmem_ref is
always a struct page. I'm happy to implement a change but I need to
understand it a bit better.

> do you care to use some
> existing API like skb_frag_address_safe()?
>

skb_frag_address_safe() checks that the page is mapped. In this case,
we are not checking if the frag page is mapped; we would like to make
sure that the skb_frag has a page inside of it in the first place.
Seems like a different check from skb_frag_address_safe().

In fact, skb_frag_address[_safe]() actually assume that the skb frag
is always a page currently, I think I need to squash this fix:

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index e59f76151628..bc8b107d0235 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3533,7 +3533,9 @@ static inline void skb_frag_unref(struct sk_buff
*skb, int f)
*/
static inline void *skb_frag_address(const skb_frag_t *frag)
{
- return page_address(skb_frag_page(frag)) + skb_frag_off(frag);
+ return skb_frag_page(frag) ?
+ page_address(skb_frag_page(frag)) + skb_frag_off(frag) :
+ NULL;
}

/**
@@ -3545,7 +3547,14 @@ static inline void *skb_frag_address(const
skb_frag_t *frag)
*/
static inline void *skb_frag_address_safe(const skb_frag_t *frag)
{
- void *ptr = page_address(skb_frag_page(frag));
+ struct page *page;
+ void *ptr;
+
+ page = skb_frag_page(frag);
+ if (!page)
+ return NULL;
+
+ ptr = page_address(skb_frag_page(frag));
if (unlikely(!ptr))
return NULL;

> > + ret = -EINVAL;
> > + goto out;
> > + }
> > +
> > iov_iter_bvec(&msg.msg_iter, ITER_SOURCE,
> > - skb_shinfo(skb)->frags, skb_shinfo(skb)->nr_frags,
> > - msize);
> > + (const struct bio_vec *)skb_shinfo(skb)->frags,
> > + skb_shinfo(skb)->nr_frags, msize);
>
> I think we need to use some built-time checking to ensure some consistency
> between skb_frag_t and bio_vec.
>

I can add static_assert() that bio_vec->bv_len & bio_vec->bv_offset
are aligned with skb_frag_t->len & skb_frag_t->offset.

I can also maybe add a helper skb_frag_bvec() to do the cast instead
of doing it at the calling site. That may be a bit cleaner.

> > iov_iter_advance(&msg.msg_iter, txm->frag_offset);
> >
> > do {
> >



--
Thanks,
Mina

2024-01-12 15:36:08

by Mina Almasry

[permalink] [raw]
Subject: Re: [RFC PATCH net-next v5 2/2] net: add netmem to skb_frag_t

On Fri, Jan 12, 2024 at 3:51 AM Yunsheng Lin <[email protected]> wrote:
>
> On 2024/1/12 8:34, Mina Almasry wrote:
> > On Thu, Jan 11, 2024 at 4:45 AM Yunsheng Lin <[email protected]> wrote:
> >>
> >> On 2024/1/9 9:14, Mina Almasry wrote:
> >>
> >> ...
> >>
> >>>
> >>> + if (WARN_ON_ONCE(!skb_frag_page(&skb_shinfo(skb)->frags[0]))) {
> >>
> >> I am really hate to bring it up again.
> >> If you are not willing to introduce a new helper,
> >
> > I'm actually more than happy to add a new helper like:
> >
> > static inline netmem_ref skb_frag_netmem();
> >
> > For future callers to obtain frag->netmem to use the netmem_ref directly.
> >
> > What I'm hung up on is really your follow up request:
> >
> > "Is it possible to introduce something like skb_frag_netmem() for
> > netmem? so that we can keep most existing users of skb_frag_page()
> > unchanged and avoid adding additional checking overhead for existing
> > users."
> >
> > With this patchseries, skb_frag_t no longer has a page pointer inside
> > of it, it only has a netmem_ref. The netmem_ref is currently always a
> > page, but in the future may not be a page. Can you clarify how we keep
> > skb_frag_page() unchanged and without checks? What do you expect
> > skb_frag_page() and its callers to do? We can not assume netmem_ref is
> > always a struct page. I'm happy to implement a change but I need to
> > understand it a bit better.


You did not answer my question that I asked here, and ignoring this
question is preventing us from making any forward progress on this
discussion. What do you expect or want skb_frag_page() to do when
there is no page in the frag?

>
> There are still many existing places still not expecting or handling
> skb_frag_page() returning NULL, mostly those are in the drivers not
> supporting devmem,

As of this series skb_frag_page() cannot return NULL.

In the devmem series, all core networking stack places where
skb_frag_page() may return NULL are audited.

skb_frag_page() returning NULL in a driver that doesn't support devmem
is not possible. The driver is the one that creates the devmem frags
in the first place. When the driver author adds devmem support, they
should also add support for skb_frag_page() returning NULL.

> what's the point of adding the extral overhead for
> those driver?
>

There is no overhead with static branches. The checks are no-op unless
the user enables devmem, at which point the devmem connections see no
overhead and non-devmem connections will see minimal overhead that I
suspect will not reproduce any perf issue. If the user is not fine
with that they can simply not enable devmem and continue to not
experience any overhead.

> The networking stack should forbid skb with devmem frag being forwarded
> to drivers not supporting devmem yet. I am sure if this is done properly
> in your patchset yet? if not, I think you might to audit every existing
> drivers handling skb_frag_page() returning NULL correctly.
>

There is no audit required. The devmem frags are generated by the
driver and forwarded to the TCP stack, not the other way around.

>
> >
> >> do you care to use some
> >> existing API like skb_frag_address_safe()?
> >>
> >
> > skb_frag_address_safe() checks that the page is mapped. In this case,
> > we are not checking if the frag page is mapped; we would like to make
> > sure that the skb_frag has a page inside of it in the first place.
> > Seems like a different check from skb_frag_address_safe().
> >
> > In fact, skb_frag_address[_safe]() actually assume that the skb frag
> > is always a page currently, I think I need to squash this fix:
> >
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index e59f76151628..bc8b107d0235 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -3533,7 +3533,9 @@ static inline void skb_frag_unref(struct sk_buff
> > *skb, int f)
> > */
> > static inline void *skb_frag_address(const skb_frag_t *frag)
> > {
> > - return page_address(skb_frag_page(frag)) + skb_frag_off(frag);
> > + return skb_frag_page(frag) ?
> > + page_address(skb_frag_page(frag)) + skb_frag_off(frag) :
> > + NULL;
> > }
> >
> > /**
> > @@ -3545,7 +3547,14 @@ static inline void *skb_frag_address(const
> > skb_frag_t *frag)
> > */
> > static inline void *skb_frag_address_safe(const skb_frag_t *frag)
> > {
> > - void *ptr = page_address(skb_frag_page(frag));
> > + struct page *page;
> > + void *ptr;
> > +
> > + page = skb_frag_page(frag);
> > + if (!page)
> > + return NULL;
> > +
> > + ptr = page_address(skb_frag_page(frag));
> > if (unlikely(!ptr))
> > return NULL;
> >
> >>> + ret = -EINVAL;
> >>> + goto out;
> >>> + }
> >>> +
> >>> iov_iter_bvec(&msg.msg_iter, ITER_SOURCE,
> >>> - skb_shinfo(skb)->frags, skb_shinfo(skb)->nr_frags,
> >>> - msize);
> >>> + (const struct bio_vec *)skb_shinfo(skb)->frags,
> >>> + skb_shinfo(skb)->nr_frags, msize);
> >>
> >> I think we need to use some built-time checking to ensure some consistency
> >> between skb_frag_t and bio_vec.
> >>
> >
> > I can add static_assert() that bio_vec->bv_len & bio_vec->bv_offset
> > are aligned with skb_frag_t->len & skb_frag_t->offset.
> >
> > I can also maybe add a helper skb_frag_bvec() to do the cast instead
> > of doing it at the calling site. That may be a bit cleaner.
>
> I think the more generic way to do is to add something like iov_iter_netmem()
> instead of doing any cast, so that netmem can be decoupled from bvec completely.
>
> >
> >>> iov_iter_advance(&msg.msg_iter, txm->frag_offset);
> >>>
> >>> do {
> >>>
> >
> >
> >
> > --
> > Thanks,
> > Mina
> > .
> >



--
Thanks,
Mina

2024-01-15 23:24:03

by Mina Almasry

[permalink] [raw]
Subject: Re: [RFC PATCH net-next v5 2/2] net: add netmem to skb_frag_t

On Mon, Jan 15, 2024 at 1:37 AM Yunsheng Lin <[email protected]> wrote:
>
> On 2024/1/12 23:35, Mina Almasry wrote:
> > On Fri, Jan 12, 2024 at 3:51 AM Yunsheng Lin <[email protected]> wrote:
> >>
> >> On 2024/1/12 8:34, Mina Almasry wrote:
> >>> On Thu, Jan 11, 2024 at 4:45 AM Yunsheng Lin <[email protected]> wrote:
> >>>>
> >>>> On 2024/1/9 9:14, Mina Almasry wrote:
> >>>>
> >>>> ...
> >>>>
> >>>>>
> >>>>> + if (WARN_ON_ONCE(!skb_frag_page(&skb_shinfo(skb)->frags[0]))) {
> >>>>
> >>>> I am really hate to bring it up again.
> >>>> If you are not willing to introduce a new helper,
> >>>
> >>> I'm actually more than happy to add a new helper like:
> >>>
> >>> static inline netmem_ref skb_frag_netmem();
> >>>
> >>> For future callers to obtain frag->netmem to use the netmem_ref directly.
> >>>
> >>> What I'm hung up on is really your follow up request:
> >>>
> >>> "Is it possible to introduce something like skb_frag_netmem() for
> >>> netmem? so that we can keep most existing users of skb_frag_page()
> >>> unchanged and avoid adding additional checking overhead for existing
> >>> users."
> >>>
> >>> With this patchseries, skb_frag_t no longer has a page pointer inside
> >>> of it, it only has a netmem_ref. The netmem_ref is currently always a
> >>> page, but in the future may not be a page. Can you clarify how we keep
> >>> skb_frag_page() unchanged and without checks? What do you expect
> >>> skb_frag_page() and its callers to do? We can not assume netmem_ref is
> >>> always a struct page. I'm happy to implement a change but I need to
> >>> understand it a bit better.
> >
> >
> > You did not answer my question that I asked here, and ignoring this
> > question is preventing us from making any forward progress on this
> > discussion. What do you expect or want skb_frag_page() to do when
> > there is no page in the frag?
>
> I would expect it to do nothing.

I don't understand. skb_frag_page() with an empty implementation just
results in a compiler error as the function needs to return a page
pointer. Do you actually expect skb_frag_page() to unconditionally
cast frag->netmem to a page pointer? That was explained as
unacceptable over and over again by Jason and Christian as it risks
casting devmem to page; completely unacceptable and will get nacked.
Do you have a suggestion of what skb_frag_page() should do that will
not get nacked by mm?

> IMHO, the caller is expected to only call skb_frag_page() for the frag
> with normal page, for example, doing some 'readable' checking before
> callingskb_frag_page(), or not doing any checking at all if it is called
> in some existing driver not support devmem.
>

You did not specify what you actually want skb_frag_page() to do, but
IMO leaving it up to the programmer to guess/decide if it's safe to
use with no checking in the function is very error prone and hacky, as
the programmer is obviously liable to assume the wrong thing.

All skb_frag_page() callers (which are mainly core networking
actually) would need to do a check anyway to ensure the compiler type
safety, set as a hard requirement by mm & dmabuf maintainers, and at
that point having the check in the function makes sense.

> >
> >>
> >> There are still many existing places still not expecting or handling
> >> skb_frag_page() returning NULL, mostly those are in the drivers not
> >> supporting devmem,
> >
> > As of this series skb_frag_page() cannot return NULL.
> >
> > In the devmem series, all core networking stack places where
> > skb_frag_page() may return NULL are audited.
> >
> > skb_frag_page() returning NULL in a driver that doesn't support devmem
> > is not possible. The driver is the one that creates the devmem frags
>
> Is it possible a netdev supporting devmen and a netdev not supporting
> devmen are put into the same bridge, and a rx skb from netdev supporting
> devmen is forwarded to netdev not supporting devmem?
>
> br_forward() or ip_forward() may be the place that might do this kind
> of forwarding?
>

I'd need to check, but even at that point it sounds to me like this
forwarding code would need to check handle devmem or disabled for
devmem, not a global audit of drivers.

> > in the first place. When the driver author adds devmem support, they
> > should also add support for skb_frag_page() returning NULL.
> >
> >> what's the point of adding the extral overhead for
> >> those driver?
> >>
> >
> > There is no overhead with static branches. The checks are no-op unless
> > the user enables devmem, at which point the devmem connections see no
>
> no-op is still some cpu instruction that will be replaced by some jumping
> instruction when devmem is enabled, right?
>

No, there is no overhead with static branches. There seems to be a
misunderstanding of how they work causing this massive overestimation
of the 'overhead' of the checks. Branches in code are free or almost
free when they are correctly predicted by the compiler:

if (unlikely(check()))
do_work();
continue();

Is free or almost free when check() is false, as the compiler assumes
check() is false and starts executing continue() immediately. There is
a minor perf hit when the branch is mispredicted as the CPU needs to
start executing do_work() after the check() result is available.
Static branches enable us to at runtime decide which branch the
compiler predicts. We could put devmem processing to always be in the
unlikely path without static branches, but that's not necessary.
Optimizing devmem processing with static branches when devmem is
enabled seems the right choice here. That's my understanding at least.

> Maybe Alexander had better words for those kinds of overhead:
> "The overhead may not seem like much, but when you are having to deal
> with it per page and you are processing pages at millions per second it
> can quickly start to add up."
>
>

This IMO would be relevant if we're adding (significant, or
measureable) overhead to page processing, but with static branches or
putting devmem in the unlikely path we are not.

> > overhead and non-devmem connections will see minimal overhead that I
> > suspect will not reproduce any perf issue. If the user is not fine
> > with that they can simply not enable devmem and continue to not
> > experience any overhead.
> >
> >> The networking stack should forbid skb with devmem frag being forwarded
> >> to drivers not supporting devmem yet. I am sure if this is done properly
> >> in your patchset yet? if not, I think you might to audit every existing
> >> drivers handling skb_frag_page() returning NULL correctly.
> >>
> >
> > There is no audit required. The devmem frags are generated by the
> > driver and forwarded to the TCP stack, not the other way around.
> >
> >>
> >>>
>


--
Thanks,
Mina

2024-01-16 00:01:50

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH net-next v5 2/2] net: add netmem to skb_frag_t

On Mon, Jan 15, 2024 at 03:23:33PM -0800, Mina Almasry wrote:
> > > You did not answer my question that I asked here, and ignoring this
> > > question is preventing us from making any forward progress on this
> > > discussion. What do you expect or want skb_frag_page() to do when
> > > there is no page in the frag?
> >
> > I would expect it to do nothing.
>
> I don't understand. skb_frag_page() with an empty implementation just
> results in a compiler error as the function needs to return a page
> pointer. Do you actually expect skb_frag_page() to unconditionally
> cast frag->netmem to a page pointer? That was explained as
> unacceptable over and over again by Jason and Christian as it risks
> casting devmem to page; completely unacceptable and will get nacked.
> Do you have a suggestion of what skb_frag_page() should do that will
> not get nacked by mm?

WARN_ON and return NULL seems reasonable?

Jason

2024-01-16 12:16:32

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC PATCH net-next v5 2/2] net: add netmem to skb_frag_t

On Tue, Jan 16, 2024 at 07:04:13PM +0800, Yunsheng Lin wrote:
> On 2024/1/16 8:01, Jason Gunthorpe wrote:
> > On Mon, Jan 15, 2024 at 03:23:33PM -0800, Mina Almasry wrote:
> >>>> You did not answer my question that I asked here, and ignoring this
> >>>> question is preventing us from making any forward progress on this
> >>>> discussion. What do you expect or want skb_frag_page() to do when
> >>>> there is no page in the frag?
> >>>
> >>> I would expect it to do nothing.
> >>
> >> I don't understand. skb_frag_page() with an empty implementation just
> >> results in a compiler error as the function needs to return a page
> >> pointer. Do you actually expect skb_frag_page() to unconditionally
> >> cast frag->netmem to a page pointer? That was explained as
> >> unacceptable over and over again by Jason and Christian as it risks
> >> casting devmem to page; completely unacceptable and will get nacked.
> >> Do you have a suggestion of what skb_frag_page() should do that will
> >> not get nacked by mm?
> >
> > WARN_ON and return NULL seems reasonable?
>
> While I am agreed that it may be a nightmare to debug the case of passing
> a false page into the mm system, but I am not sure what's the point of
> returning NULL to caller if the caller is not expecting or handling
> the

You have to return something and NULL will largely reliably crash the
thread. The WARN_ON explains in detail why your thread just crashed.

> NULL returning[for example, most of mm API called by the networking does not
> seems to handling NULL as input page], isn't the NULL returning will make
> the kernel panic anyway? Doesn't it make more sense to just add a BUG_ON()
> depending on some configuration like CONFIG_DEBUG_NET or CONFIG_DEVMEM?
> As returning NULL seems to be causing a confusion for the caller of
> skb_frag_page() as whether to or how to handle the NULL returning case.

Possibly, though Linus doesn't like BUG_ON on principle..

I think the bigger challenge is convincing people that this devmem
stuff doesn't just open a bunch of holes in the kernel where userspace
can crash it.

The fact you all are debating what to do with skb_frag_page() suggests
to me there isn't confidence...

Jason

2024-01-17 18:01:28

by Mina Almasry

[permalink] [raw]
Subject: Re: [RFC PATCH net-next v5 2/2] net: add netmem to skb_frag_t

On Tue, Jan 16, 2024 at 4:16 AM Jason Gunthorpe <[email protected]> wrote:
>
> On Tue, Jan 16, 2024 at 07:04:13PM +0800, Yunsheng Lin wrote:
> > On 2024/1/16 8:01, Jason Gunthorpe wrote:
> > > On Mon, Jan 15, 2024 at 03:23:33PM -0800, Mina Almasry wrote:
> > >>>> You did not answer my question that I asked here, and ignoring this
> > >>>> question is preventing us from making any forward progress on this
> > >>>> discussion. What do you expect or want skb_frag_page() to do when
> > >>>> there is no page in the frag?
> > >>>
> > >>> I would expect it to do nothing.
> > >>
> > >> I don't understand. skb_frag_page() with an empty implementation just
> > >> results in a compiler error as the function needs to return a page
> > >> pointer. Do you actually expect skb_frag_page() to unconditionally
> > >> cast frag->netmem to a page pointer? That was explained as
> > >> unacceptable over and over again by Jason and Christian as it risks
> > >> casting devmem to page; completely unacceptable and will get nacked.
> > >> Do you have a suggestion of what skb_frag_page() should do that will
> > >> not get nacked by mm?
> > >
> > > WARN_ON and return NULL seems reasonable?
> >

That's more or less what I'm thinking.

> > While I am agreed that it may be a nightmare to debug the case of passing
> > a false page into the mm system, but I am not sure what's the point of
> > returning NULL to caller if the caller is not expecting or handling
> > the
>
> You have to return something and NULL will largely reliably crash the
> thread. The WARN_ON explains in detail why your thread just crashed.
>

Agreed.

> > NULL returning[for example, most of mm API called by the networking does not
> > seems to handling NULL as input page], isn't the NULL returning will make
> > the kernel panic anyway? Doesn't it make more sense to just add a BUG_ON()
> > depending on some configuration like CONFIG_DEBUG_NET or CONFIG_DEVMEM?
> > As returning NULL seems to be causing a confusion for the caller of
> > skb_frag_page() as whether to or how to handle the NULL returning case.
>
> Possibly, though Linus doesn't like BUG_ON on principle..
>
> I think the bigger challenge is convincing people that this devmem
> stuff doesn't just open a bunch of holes in the kernel where userspace
> can crash it.
>

It does not, and as of right now there are no pending concerns from
any netdev maintainers regarding mishandled devmem checks at least.
This is because the devmem series comes with a full audit of
skb_frag_page() callers [1] and all areas in the net stack attempting
to access the skb [2].

[1] https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/
[2] https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

> The fact you all are debating what to do with skb_frag_page() suggests
> to me there isn't confidence...
>

The debate raging on is related to the performance of skb_frag_page(),
not correctness (and even then, I don't think it's related to
perf...). Yunsheng would like us to optimize skb_frag_page() using an
unconditional cast from netmem to page. This in Yunsheng's mind is a
performance optimization as we don't need to add an if statement
checking if the netmem is a page. I'm resistant to implement that
change so far because:

(a) unconditionally casting from netmem to page negates the compiler
type safety that you and Christian are laying out as a requirement for
the devmem stuff.
(b) With likely/unlikely or static branches the check to make sure
netmem is page is a no-op for existing use cases anyway, so AFAIU,
there is no perf gain from optimizing it out anyway.

But none of this is related to correctness. Code calling
skb_frag_page() will fail or crash if it's not handled correctly
regardless of the implementation details of skb_frag_page(). In the
devmem series we add support to handle it correctly via [1] & [2].

--
Thanks,
Mina

2024-01-17 18:34:42

by Mina Almasry

[permalink] [raw]
Subject: Re: [RFC PATCH net-next v5 2/2] net: add netmem to skb_frag_t

On Wed, Jan 17, 2024 at 10:00 AM Mina Almasry <[email protected]> wrote:
>
> On Tue, Jan 16, 2024 at 4:16 AM Jason Gunthorpe <[email protected]> wrote:
> >
> > On Tue, Jan 16, 2024 at 07:04:13PM +0800, Yunsheng Lin wrote:
> > > On 2024/1/16 8:01, Jason Gunthorpe wrote:
> > > > On Mon, Jan 15, 2024 at 03:23:33PM -0800, Mina Almasry wrote:
> > > >>>> You did not answer my question that I asked here, and ignoring this
> > > >>>> question is preventing us from making any forward progress on this
> > > >>>> discussion. What do you expect or want skb_frag_page() to do when
> > > >>>> there is no page in the frag?
> > > >>>
> > > >>> I would expect it to do nothing.
> > > >>
> > > >> I don't understand. skb_frag_page() with an empty implementation just
> > > >> results in a compiler error as the function needs to return a page
> > > >> pointer. Do you actually expect skb_frag_page() to unconditionally
> > > >> cast frag->netmem to a page pointer? That was explained as
> > > >> unacceptable over and over again by Jason and Christian as it risks
> > > >> casting devmem to page; completely unacceptable and will get nacked.
> > > >> Do you have a suggestion of what skb_frag_page() should do that will
> > > >> not get nacked by mm?
> > > >
> > > > WARN_ON and return NULL seems reasonable?
> > >
>
> That's more or less what I'm thinking.
>
> > > While I am agreed that it may be a nightmare to debug the case of passing
> > > a false page into the mm system, but I am not sure what's the point of
> > > returning NULL to caller if the caller is not expecting or handling
> > > the
> >
> > You have to return something and NULL will largely reliably crash the
> > thread. The WARN_ON explains in detail why your thread just crashed.
> >
>
> Agreed.
>
> > > NULL returning[for example, most of mm API called by the networking does not
> > > seems to handling NULL as input page], isn't the NULL returning will make
> > > the kernel panic anyway? Doesn't it make more sense to just add a BUG_ON()
> > > depending on some configuration like CONFIG_DEBUG_NET or CONFIG_DEVMEM?
> > > As returning NULL seems to be causing a confusion for the caller of
> > > skb_frag_page() as whether to or how to handle the NULL returning case.
> >
> > Possibly, though Linus doesn't like BUG_ON on principle..
> >
> > I think the bigger challenge is convincing people that this devmem
> > stuff doesn't just open a bunch of holes in the kernel where userspace
> > can crash it.
> >
>
> It does not, and as of right now there are no pending concerns from
> any netdev maintainers regarding mishandled devmem checks at least.
> This is because the devmem series comes with a full audit of
> skb_frag_page() callers [1] and all areas in the net stack attempting
> to access the skb [2].
>
> [1] https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/
> [2] https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/
>
> > The fact you all are debating what to do with skb_frag_page() suggests
> > to me there isn't confidence...
> >
>
> The debate raging on is related to the performance of skb_frag_page(),
> not correctness (and even then, I don't think it's related to
> perf...). Yunsheng would like us to optimize skb_frag_page() using an
> unconditional cast from netmem to page. This in Yunsheng's mind is a
> performance optimization as we don't need to add an if statement
> checking if the netmem is a page. I'm resistant to implement that
> change so far because:
>
> (a) unconditionally casting from netmem to page negates the compiler
> type safety that you and Christian are laying out as a requirement for
> the devmem stuff.
> (b) With likely/unlikely or static branches the check to make sure
> netmem is page is a no-op for existing use cases anyway, so AFAIU,
> there is no perf gain from optimizing it out anyway.
>

Another thought, if anyone else is concerned about the devmem checks
performance, potentially we could introduce CONFIG_NET_DEVMEM which
when disabled prevents the user from using devmem at all (disables the
netlink API).

When that is disabled, skb_frag_page(), netmem_to_page() & friends can
assume netmem is always a page and do a straight cast between netmem &
page. When it's enabled, it will check that netmem == page before
doing a cast, and return NULL if it is not a page.

I think this is technically viable and I think preserves the compiler
type safety requirements set by mm folks. From my POV though, bloating
the kernel with a a new CONFIG just to optimize out no-op checks seems
unnecessary, but if there is agreement that the checks are a concern,
adding CONFIG_NET_DEVMEM should address it while being acceptable to
mm maintainers.

> But none of this is related to correctness. Code calling
> skb_frag_page() will fail or crash if it's not handled correctly
> regardless of the implementation details of skb_frag_page(). In the
> devmem series we add support to handle it correctly via [1] & [2].
>
> --
> Thanks,
> Mina



--
Thanks,
Mina

2024-01-17 18:54:25

by Willem de Bruijn

[permalink] [raw]
Subject: Re: [RFC PATCH net-next v5 2/2] net: add netmem to skb_frag_t

Mina Almasry wrote:
> On Wed, Jan 17, 2024 at 10:00 AM Mina Almasry <[email protected]> wrote:
> >
> > On Tue, Jan 16, 2024 at 4:16 AM Jason Gunthorpe <[email protected]> wrote:
> > >
> > > On Tue, Jan 16, 2024 at 07:04:13PM +0800, Yunsheng Lin wrote:
> > > > On 2024/1/16 8:01, Jason Gunthorpe wrote:
> > > > > On Mon, Jan 15, 2024 at 03:23:33PM -0800, Mina Almasry wrote:
> > > > >>>> You did not answer my question that I asked here, and ignoring this
> > > > >>>> question is preventing us from making any forward progress on this
> > > > >>>> discussion. What do you expect or want skb_frag_page() to do when
> > > > >>>> there is no page in the frag?
> > > > >>>
> > > > >>> I would expect it to do nothing.
> > > > >>
> > > > >> I don't understand. skb_frag_page() with an empty implementation just
> > > > >> results in a compiler error as the function needs to return a page
> > > > >> pointer. Do you actually expect skb_frag_page() to unconditionally
> > > > >> cast frag->netmem to a page pointer? That was explained as
> > > > >> unacceptable over and over again by Jason and Christian as it risks
> > > > >> casting devmem to page; completely unacceptable and will get nacked.
> > > > >> Do you have a suggestion of what skb_frag_page() should do that will
> > > > >> not get nacked by mm?
> > > > >
> > > > > WARN_ON and return NULL seems reasonable?
> > > >
> >
> > That's more or less what I'm thinking.
> >
> > > > While I am agreed that it may be a nightmare to debug the case of passing
> > > > a false page into the mm system, but I am not sure what's the point of
> > > > returning NULL to caller if the caller is not expecting or handling
> > > > the
> > >
> > > You have to return something and NULL will largely reliably crash the
> > > thread. The WARN_ON explains in detail why your thread just crashed.
> > >
> >
> > Agreed.
> >
> > > > NULL returning[for example, most of mm API called by the networking does not
> > > > seems to handling NULL as input page], isn't the NULL returning will make
> > > > the kernel panic anyway? Doesn't it make more sense to just add a BUG_ON()
> > > > depending on some configuration like CONFIG_DEBUG_NET or CONFIG_DEVMEM?
> > > > As returning NULL seems to be causing a confusion for the caller of
> > > > skb_frag_page() as whether to or how to handle the NULL returning case.
> > >
> > > Possibly, though Linus doesn't like BUG_ON on principle..
> > >
> > > I think the bigger challenge is convincing people that this devmem
> > > stuff doesn't just open a bunch of holes in the kernel where userspace
> > > can crash it.
> > >
> >
> > It does not, and as of right now there are no pending concerns from
> > any netdev maintainers regarding mishandled devmem checks at least.
> > This is because the devmem series comes with a full audit of
> > skb_frag_page() callers [1] and all areas in the net stack attempting
> > to access the skb [2].
> >
> > [1] https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/
> > [2] https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/
> >
> > > The fact you all are debating what to do with skb_frag_page() suggests
> > > to me there isn't confidence...
> > >
> >
> > The debate raging on is related to the performance of skb_frag_page(),
> > not correctness (and even then, I don't think it's related to
> > perf...). Yunsheng would like us to optimize skb_frag_page() using an
> > unconditional cast from netmem to page. This in Yunsheng's mind is a
> > performance optimization as we don't need to add an if statement
> > checking if the netmem is a page. I'm resistant to implement that
> > change so far because:
> >
> > (a) unconditionally casting from netmem to page negates the compiler
> > type safety that you and Christian are laying out as a requirement for
> > the devmem stuff.
> > (b) With likely/unlikely or static branches the check to make sure
> > netmem is page is a no-op for existing use cases anyway, so AFAIU,
> > there is no perf gain from optimizing it out anyway.
> >
>
> Another thought, if anyone else is concerned about the devmem checks
> performance, potentially we could introduce CONFIG_NET_DEVMEM which
> when disabled prevents the user from using devmem at all (disables the
> netlink API).
>
> When that is disabled, skb_frag_page(), netmem_to_page() & friends can
> assume netmem is always a page and do a straight cast between netmem &
> page. When it's enabled, it will check that netmem == page before
> doing a cast, and return NULL if it is not a page.
>
> I think this is technically viable and I think preserves the compiler
> type safety requirements set by mm folks. From my POV though, bloating
> the kernel with a a new CONFIG just to optimize out no-op checks seems
> unnecessary, but if there is agreement that the checks are a concern,
> adding CONFIG_NET_DEVMEM should address it while being acceptable to
> mm maintainers.

I agree. A concern with CONFIGs is that what matters in practice is
which default the distros compile with. In this case, adding hurdles
to using the feature likely for no real reason.

Static branches are used throughout the kernel in performance
sensitive paths, exactly because they allow optional paths effectively
for free. I'm quite surprised that this issue is being raised so
strongly here, as they are hardly new or controversial.

But perhaps best is to show with data. Is there a representative page
pool benchmark that exercises the most sensitive case (XDP_DROP?) that
we can run with and without a static branch to demonstrate that any
diff is within the noise?

> > But none of this is related to correctness. Code calling
> > skb_frag_page() will fail or crash if it's not handled correctly
> > regardless of the implementation details of skb_frag_page(). In the
> > devmem series we add support to handle it correctly via [1] & [2].
> >
> > --
> > Thanks,
> > Mina
>
>
>
> --
> Thanks,
> Mina



2024-01-18 13:57:15

by Mina Almasry

[permalink] [raw]
Subject: Re: [RFC PATCH net-next v5 2/2] net: add netmem to skb_frag_t

On Wed, Jan 17, 2024 at 10:54 AM Willem de Bruijn
<[email protected]> wrote:
>
> Mina Almasry wrote:
> > On Wed, Jan 17, 2024 at 10:00 AM Mina Almasry <[email protected]> wrote:
> > >
> > > On Tue, Jan 16, 2024 at 4:16 AM Jason Gunthorpe <[email protected]> wrote:
> > > >
> > > > On Tue, Jan 16, 2024 at 07:04:13PM +0800, Yunsheng Lin wrote:
> > > > > On 2024/1/16 8:01, Jason Gunthorpe wrote:
> > > > > > On Mon, Jan 15, 2024 at 03:23:33PM -0800, Mina Almasry wrote:
> > > > > >>>> You did not answer my question that I asked here, and ignoring this
> > > > > >>>> question is preventing us from making any forward progress on this
> > > > > >>>> discussion. What do you expect or want skb_frag_page() to do when
> > > > > >>>> there is no page in the frag?
> > > > > >>>
> > > > > >>> I would expect it to do nothing.
> > > > > >>
> > > > > >> I don't understand. skb_frag_page() with an empty implementation just
> > > > > >> results in a compiler error as the function needs to return a page
> > > > > >> pointer. Do you actually expect skb_frag_page() to unconditionally
> > > > > >> cast frag->netmem to a page pointer? That was explained as
> > > > > >> unacceptable over and over again by Jason and Christian as it risks
> > > > > >> casting devmem to page; completely unacceptable and will get nacked.
> > > > > >> Do you have a suggestion of what skb_frag_page() should do that will
> > > > > >> not get nacked by mm?
> > > > > >
> > > > > > WARN_ON and return NULL seems reasonable?
> > > > >
> > >
> > > That's more or less what I'm thinking.
> > >
> > > > > While I am agreed that it may be a nightmare to debug the case of passing
> > > > > a false page into the mm system, but I am not sure what's the point of
> > > > > returning NULL to caller if the caller is not expecting or handling
> > > > > the
> > > >
> > > > You have to return something and NULL will largely reliably crash the
> > > > thread. The WARN_ON explains in detail why your thread just crashed.
> > > >
> > >
> > > Agreed.
> > >
> > > > > NULL returning[for example, most of mm API called by the networking does not
> > > > > seems to handling NULL as input page], isn't the NULL returning will make
> > > > > the kernel panic anyway? Doesn't it make more sense to just add a BUG_ON()
> > > > > depending on some configuration like CONFIG_DEBUG_NET or CONFIG_DEVMEM?
> > > > > As returning NULL seems to be causing a confusion for the caller of
> > > > > skb_frag_page() as whether to or how to handle the NULL returning case.
> > > >
> > > > Possibly, though Linus doesn't like BUG_ON on principle..
> > > >
> > > > I think the bigger challenge is convincing people that this devmem
> > > > stuff doesn't just open a bunch of holes in the kernel where userspace
> > > > can crash it.
> > > >
> > >
> > > It does not, and as of right now there are no pending concerns from
> > > any netdev maintainers regarding mishandled devmem checks at least.
> > > This is because the devmem series comes with a full audit of
> > > skb_frag_page() callers [1] and all areas in the net stack attempting
> > > to access the skb [2].
> > >
> > > [1] https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/
> > > [2] https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/
> > >
> > > > The fact you all are debating what to do with skb_frag_page() suggests
> > > > to me there isn't confidence...
> > > >
> > >
> > > The debate raging on is related to the performance of skb_frag_page(),
> > > not correctness (and even then, I don't think it's related to
> > > perf...). Yunsheng would like us to optimize skb_frag_page() using an
> > > unconditional cast from netmem to page. This in Yunsheng's mind is a
> > > performance optimization as we don't need to add an if statement
> > > checking if the netmem is a page. I'm resistant to implement that
> > > change so far because:
> > >
> > > (a) unconditionally casting from netmem to page negates the compiler
> > > type safety that you and Christian are laying out as a requirement for
> > > the devmem stuff.
> > > (b) With likely/unlikely or static branches the check to make sure
> > > netmem is page is a no-op for existing use cases anyway, so AFAIU,
> > > there is no perf gain from optimizing it out anyway.
> > >
> >
> > Another thought, if anyone else is concerned about the devmem checks
> > performance, potentially we could introduce CONFIG_NET_DEVMEM which
> > when disabled prevents the user from using devmem at all (disables the
> > netlink API).
> >
> > When that is disabled, skb_frag_page(), netmem_to_page() & friends can
> > assume netmem is always a page and do a straight cast between netmem &
> > page. When it's enabled, it will check that netmem == page before
> > doing a cast, and return NULL if it is not a page.
> >
> > I think this is technically viable and I think preserves the compiler
> > type safety requirements set by mm folks. From my POV though, bloating
> > the kernel with a a new CONFIG just to optimize out no-op checks seems
> > unnecessary, but if there is agreement that the checks are a concern,
> > adding CONFIG_NET_DEVMEM should address it while being acceptable to
> > mm maintainers.
>
> I agree. A concern with CONFIGs is that what matters in practice is
> which default the distros compile with. In this case, adding hurdles
> to using the feature likely for no real reason.
>
> Static branches are used throughout the kernel in performance
> sensitive paths, exactly because they allow optional paths effectively
> for free. I'm quite surprised that this issue is being raised so
> strongly here, as they are hardly new or controversial.
>
> But perhaps best is to show with data. Is there a representative page
> pool benchmark that exercises the most sensitive case (XDP_DROP?) that
> we can run with and without a static branch to demonstrate that any
> diff is within the noise?
>

Yes, Jesper has a page_pool benchmark that he pointed me to in RFC v2:

https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c

When running the results on RFCv2, the results were:

net-next @ b44693495af8
https://pastebin.com/raw/JuU7UQXe

+ devmem TCP changes:
https://pastebin.com/raw/mY1L6U4r

On RFC v2 the benchmark showed only a single instruction regression in
the page_pool fast path & the change deemed acceptable to Jesper from
a performance POV [1].

I have not run the benchmark continually on follow up iterations of
the RFC, but I think I'll start doing so in the future.

[1] https://lore.kernel.org/netdev/[email protected]/

> > > But none of this is related to correctness. Code calling
> > > skb_frag_page() will fail or crash if it's not handled correctly
> > > regardless of the implementation details of skb_frag_page(). In the
> > > devmem series we add support to handle it correctly via [1] & [2].
> > >
> > > --
> > > Thanks,
> > > Mina
> >
> >
> >
> > --
> > Thanks,
> > Mina
>
>


--
Thanks,
Mina