2023-11-06 02:46:41

by Mina Almasry

[permalink] [raw]
Subject: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

For device memory TCP, we expect the skb headers to be available in host
memory for access, and we expect the skb frags to be in device memory
and unaccessible to the host. We expect there to be no mixing and
matching of device memory frags (unaccessible) with host memory frags
(accessible) in the same skb.

Add a skb->devmem flag which indicates whether the frags in this skb
are device memory frags or not.

__skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
and marks the skb as skb->devmem accordingly.

Add checks through the network stack to avoid accessing the frags of
devmem skbs and avoid coalescing devmem skbs with non devmem skbs.

Signed-off-by: Willem de Bruijn <[email protected]>
Signed-off-by: Kaiyuan Zhang <[email protected]>
Signed-off-by: Mina Almasry <[email protected]>

---
include/linux/skbuff.h | 14 +++++++-
include/net/tcp.h | 5 +--
net/core/datagram.c | 6 ++++
net/core/gro.c | 5 ++-
net/core/skbuff.c | 77 ++++++++++++++++++++++++++++++++++++------
net/ipv4/tcp.c | 6 ++++
net/ipv4/tcp_input.c | 13 +++++--
net/ipv4/tcp_output.c | 5 ++-
net/packet/af_packet.c | 4 +--
9 files changed, 115 insertions(+), 20 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 1fae276c1353..8fb468ff8115 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
* @csum_level: indicates the number of consecutive checksums found in
* the packet minus one that have been verified as
* CHECKSUM_UNNECESSARY (max 3)
+ * @devmem: indicates that all the fragments in this skb are backed by
+ * device memory.
* @dst_pending_confirm: need to confirm neighbour
* @decrypted: Decrypted SKB
* @slow_gro: state present at GRO time, slower prepare step required
@@ -991,7 +993,7 @@ struct sk_buff {
#if IS_ENABLED(CONFIG_IP_SCTP)
__u8 csum_not_inet:1;
#endif
-
+ __u8 devmem:1;
#if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
__u16 tc_index; /* traffic control index */
#endif
@@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
__skb_zcopy_downgrade_managed(skb);
}

+/* Return true if frags in this skb are not readable by the host. */
+static inline bool skb_frags_not_readable(const struct sk_buff *skb)
+{
+ return skb->devmem;
+}
+
static inline void skb_mark_not_on_list(struct sk_buff *skb)
{
skb->next = NULL;
@@ -2468,6 +2476,10 @@ static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
struct page *page, int off, int size)
{
__skb_fill_page_desc_noacc(skb_shinfo(skb), i, page, off, size);
+ if (page_is_page_pool_iov(page)) {
+ skb->devmem = true;
+ return;
+ }

/* Propagate page pfmemalloc to the skb if we can. The problem is
* that not all callers have unique ownership of the page but rely
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 39b731c900dd..1ae62d1e284b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1012,7 +1012,7 @@ static inline int tcp_skb_mss(const struct sk_buff *skb)

static inline bool tcp_skb_can_collapse_to(const struct sk_buff *skb)
{
- return likely(!TCP_SKB_CB(skb)->eor);
+ return likely(!TCP_SKB_CB(skb)->eor && !skb_frags_not_readable(skb));
}

static inline bool tcp_skb_can_collapse(const struct sk_buff *to,
@@ -1020,7 +1020,8 @@ static inline bool tcp_skb_can_collapse(const struct sk_buff *to,
{
return likely(tcp_skb_can_collapse_to(to) &&
mptcp_skb_can_collapse(to, from) &&
- skb_pure_zcopy_same(to, from));
+ skb_pure_zcopy_same(to, from) &&
+ skb_frags_not_readable(to) == skb_frags_not_readable(from));
}

/* Events passed to congestion control interface */
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 176eb5834746..cdd4fb129968 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -425,6 +425,9 @@ static int __skb_datagram_iter(const struct sk_buff *skb, int offset,
return 0;
}

+ if (skb_frags_not_readable(skb))
+ goto short_copy;
+
/* Copy paged appendix. Hmm... why does this look so complicated? */
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
int end;
@@ -616,6 +619,9 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
{
int frag;

+ if (skb_frags_not_readable(skb))
+ return -EFAULT;
+
if (msg && msg->msg_ubuf && msg->sg_from_iter)
return msg->sg_from_iter(sk, skb, from, length);

diff --git a/net/core/gro.c b/net/core/gro.c
index 42d7f6755f32..56046d65386a 100644
--- a/net/core/gro.c
+++ b/net/core/gro.c
@@ -390,6 +390,9 @@ static void gro_pull_from_frag0(struct sk_buff *skb, int grow)
{
struct skb_shared_info *pinfo = skb_shinfo(skb);

+ if (WARN_ON_ONCE(skb_frags_not_readable(skb)))
+ return;
+
BUG_ON(skb->end - skb->tail < grow);

memcpy(skb_tail_pointer(skb), NAPI_GRO_CB(skb)->frag0, grow);
@@ -411,7 +414,7 @@ static void gro_try_pull_from_frag0(struct sk_buff *skb)
{
int grow = skb_gro_offset(skb) - skb_headlen(skb);

- if (grow > 0)
+ if (grow > 0 && !skb_frags_not_readable(skb))
gro_pull_from_frag0(skb, grow);
}

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 13eca4fd25e1..f01673ed2eff 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1230,6 +1230,14 @@ void skb_dump(const char *level, const struct sk_buff *skb, bool full_pkt)
struct page *p;
u8 *vaddr;

+ if (skb_frag_is_page_pool_iov(frag)) {
+ printk("%sskb frag %d: not readable\n", level, i);
+ len -= frag->bv_len;
+ if (!len)
+ break;
+ continue;
+ }
+
skb_frag_foreach_page(frag, skb_frag_off(frag),
skb_frag_size(frag), p, p_off, p_len,
copied) {
@@ -1807,6 +1815,9 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
if (skb_shared(skb) || skb_unclone(skb, gfp_mask))
return -EINVAL;

+ if (skb_frags_not_readable(skb))
+ return -EFAULT;
+
if (!num_frags)
goto release;

@@ -1977,8 +1988,12 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask)
{
int headerlen = skb_headroom(skb);
unsigned int size = skb_end_offset(skb) + skb->data_len;
- struct sk_buff *n = __alloc_skb(size, gfp_mask,
- skb_alloc_rx_flag(skb), NUMA_NO_NODE);
+ struct sk_buff *n;
+
+ if (skb_frags_not_readable(skb))
+ return NULL;
+
+ n = __alloc_skb(size, gfp_mask, skb_alloc_rx_flag(skb), NUMA_NO_NODE);

if (!n)
return NULL;
@@ -2304,14 +2319,16 @@ struct sk_buff *skb_copy_expand(const struct sk_buff *skb,
int newheadroom, int newtailroom,
gfp_t gfp_mask)
{
- /*
- * Allocate the copy buffer
- */
- struct sk_buff *n = __alloc_skb(newheadroom + skb->len + newtailroom,
- gfp_mask, skb_alloc_rx_flag(skb),
- NUMA_NO_NODE);
int oldheadroom = skb_headroom(skb);
int head_copy_len, head_copy_off;
+ struct sk_buff *n;
+
+ if (skb_frags_not_readable(skb))
+ return NULL;
+
+ /* Allocate the copy buffer */
+ n = __alloc_skb(newheadroom + skb->len + newtailroom, gfp_mask,
+ skb_alloc_rx_flag(skb), NUMA_NO_NODE);

if (!n)
return NULL;
@@ -2650,6 +2667,9 @@ void *__pskb_pull_tail(struct sk_buff *skb, int delta)
*/
int i, k, eat = (skb->tail + delta) - skb->end;

+ if (skb_frags_not_readable(skb))
+ return NULL;
+
if (eat > 0 || skb_cloned(skb)) {
if (pskb_expand_head(skb, 0, eat > 0 ? eat + 128 : 0,
GFP_ATOMIC))
@@ -2803,6 +2823,9 @@ int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len)
to += copy;
}

+ if (skb_frags_not_readable(skb))
+ goto fault;
+
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
int end;
skb_frag_t *f = &skb_shinfo(skb)->frags[i];
@@ -2991,6 +3014,9 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe,
/*
* then map the fragments
*/
+ if (skb_frags_not_readable(skb))
+ return false;
+
for (seg = 0; seg < skb_shinfo(skb)->nr_frags; seg++) {
const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];

@@ -3214,6 +3240,9 @@ int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len)
from += copy;
}

+ if (skb_frags_not_readable(skb))
+ goto fault;
+
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
int end;
@@ -3293,6 +3322,9 @@ __wsum __skb_checksum(const struct sk_buff *skb, int offset, int len,
pos = copy;
}

+ if (skb_frags_not_readable(skb))
+ return 0;
+
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
int end;
skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
@@ -3393,6 +3425,9 @@ __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset,
pos = copy;
}

+ if (skb_frags_not_readable(skb))
+ return 0;
+
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
int end;

@@ -3883,7 +3918,9 @@ static inline void skb_split_inside_header(struct sk_buff *skb,
skb_shinfo(skb1)->frags[i] = skb_shinfo(skb)->frags[i];

skb_shinfo(skb1)->nr_frags = skb_shinfo(skb)->nr_frags;
+ skb1->devmem = skb->devmem;
skb_shinfo(skb)->nr_frags = 0;
+ skb->devmem = 0;
skb1->data_len = skb->data_len;
skb1->len += skb1->data_len;
skb->data_len = 0;
@@ -3897,6 +3934,7 @@ static inline void skb_split_no_header(struct sk_buff *skb,
{
int i, k = 0;
const int nfrags = skb_shinfo(skb)->nr_frags;
+ const int devmem = skb->devmem;

skb_shinfo(skb)->nr_frags = 0;
skb1->len = skb1->data_len = skb->len - len;
@@ -3930,6 +3968,16 @@ static inline void skb_split_no_header(struct sk_buff *skb,
pos += size;
}
skb_shinfo(skb1)->nr_frags = k;
+
+ if (skb_shinfo(skb)->nr_frags)
+ skb->devmem = devmem;
+ else
+ skb->devmem = 0;
+
+ if (skb_shinfo(skb1)->nr_frags)
+ skb1->devmem = devmem;
+ else
+ skb1->devmem = 0;
}

/**
@@ -4165,6 +4213,9 @@ unsigned int skb_seq_read(unsigned int consumed, const u8 **data,
return block_limit - abs_offset;
}

+ if (skb_frags_not_readable(st->cur_skb))
+ return 0;
+
if (st->frag_idx == 0 && !st->frag_data)
st->stepped_offset += skb_headlen(st->cur_skb);

@@ -5779,7 +5830,10 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
(from->pp_recycle && skb_cloned(from)))
return false;

- if (len <= skb_tailroom(to)) {
+ if (skb_frags_not_readable(from) != skb_frags_not_readable(to))
+ return false;
+
+ if (len <= skb_tailroom(to) && !skb_frags_not_readable(from)) {
if (len)
BUG_ON(skb_copy_bits(from, 0, skb_put(to, len), len));
*delta_truesize = 0;
@@ -5954,6 +6008,9 @@ int skb_ensure_writable(struct sk_buff *skb, unsigned int write_len)
if (!pskb_may_pull(skb, write_len))
return -ENOMEM;

+ if (skb_frags_not_readable(skb))
+ return -EFAULT;
+
if (!skb_cloned(skb) || skb_clone_writable(skb, write_len))
return 0;

@@ -6608,7 +6665,7 @@ void skb_condense(struct sk_buff *skb)
{
if (skb->data_len) {
if (skb->data_len > skb->end - skb->tail ||
- skb_cloned(skb))
+ skb_cloned(skb) || skb_frags_not_readable(skb))
return;

/* Nice, we can free page frag(s) right now */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 23b29dc49271..5c6fed52ed0e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2138,6 +2138,9 @@ static int tcp_zerocopy_receive(struct sock *sk,
skb = tcp_recv_skb(sk, seq, &offset);
}

+ if (skb_frags_not_readable(skb))
+ break;
+
if (TCP_SKB_CB(skb)->has_rxtstamp) {
tcp_update_recv_tstamps(skb, tss);
zc->msg_flags |= TCP_CMSG_TS;
@@ -4411,6 +4414,9 @@ int tcp_md5_hash_skb_data(struct tcp_md5sig_pool *hp,
if (crypto_ahash_update(req))
return 1;

+ if (skb_frags_not_readable(skb))
+ return 1;
+
for (i = 0; i < shi->nr_frags; ++i) {
const skb_frag_t *f = &shi->frags[i];
unsigned int offset = skb_frag_off(f);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 18b858597af4..64643dad5e1a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5264,6 +5264,9 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
for (end_of_skbs = true; skb != NULL && skb != tail; skb = n) {
n = tcp_skb_next(skb, list);

+ if (skb_frags_not_readable(skb))
+ goto skip_this;
+
/* No new bits? It is possible on ofo queue. */
if (!before(start, TCP_SKB_CB(skb)->end_seq)) {
skb = tcp_collapse_one(sk, skb, list, root);
@@ -5284,17 +5287,20 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
break;
}

- if (n && n != tail && mptcp_skb_can_collapse(skb, n) &&
+ if (n && n != tail && !skb_frags_not_readable(n) &&
+ mptcp_skb_can_collapse(skb, n) &&
TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(n)->seq) {
end_of_skbs = false;
break;
}

+skip_this:
/* Decided to skip this, advance start seq. */
start = TCP_SKB_CB(skb)->end_seq;
}
if (end_of_skbs ||
- (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)))
+ (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) ||
+ skb_frags_not_readable(skb))
return;

__skb_queue_head_init(&tmp);
@@ -5338,7 +5344,8 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
if (!skb ||
skb == tail ||
!mptcp_skb_can_collapse(nskb, skb) ||
- (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)))
+ (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) ||
+ skb_frags_not_readable(skb))
goto end;
#ifdef CONFIG_TLS_DEVICE
if (skb->decrypted != nskb->decrypted)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 2866ccbccde0..60df27f6c649 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2309,7 +2309,8 @@ static bool tcp_can_coalesce_send_queue_head(struct sock *sk, int len)

if (unlikely(TCP_SKB_CB(skb)->eor) ||
tcp_has_tx_tstamp(skb) ||
- !skb_pure_zcopy_same(skb, next))
+ !skb_pure_zcopy_same(skb, next) ||
+ skb_frags_not_readable(skb) != skb_frags_not_readable(next))
return false;

len -= skb->len;
@@ -3193,6 +3194,8 @@ static bool tcp_can_collapse(const struct sock *sk, const struct sk_buff *skb)
return false;
if (skb_cloned(skb))
return false;
+ if (skb_frags_not_readable(skb))
+ return false;
/* Some heuristics for collapsing over SACK'd could be invented */
if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)
return false;
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index a84e00b5904b..8f6cca683939 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2156,7 +2156,7 @@ static int packet_rcv(struct sk_buff *skb, struct net_device *dev,
}
}

- snaplen = skb->len;
+ snaplen = skb_frags_not_readable(skb) ? skb_headlen(skb) : skb->len;

res = run_filter(skb, sk, snaplen);
if (!res)
@@ -2279,7 +2279,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
}
}

- snaplen = skb->len;
+ snaplen = skb_frags_not_readable(skb) ? skb_headlen(skb) : skb->len;

res = run_filter(skb, sk, snaplen);
if (!res)
--
2.42.0.869.gea05f2083d-goog


2023-11-06 18:48:19

by Stanislav Fomichev

[permalink] [raw]
Subject: Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

On 11/05, Mina Almasry wrote:
> For device memory TCP, we expect the skb headers to be available in host
> memory for access, and we expect the skb frags to be in device memory
> and unaccessible to the host. We expect there to be no mixing and
> matching of device memory frags (unaccessible) with host memory frags
> (accessible) in the same skb.
>
> Add a skb->devmem flag which indicates whether the frags in this skb
> are device memory frags or not.
>
> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> and marks the skb as skb->devmem accordingly.
>
> Add checks through the network stack to avoid accessing the frags of
> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
>
> Signed-off-by: Willem de Bruijn <[email protected]>
> Signed-off-by: Kaiyuan Zhang <[email protected]>
> Signed-off-by: Mina Almasry <[email protected]>
>
> ---
> include/linux/skbuff.h | 14 +++++++-
> include/net/tcp.h | 5 +--
> net/core/datagram.c | 6 ++++
> net/core/gro.c | 5 ++-
> net/core/skbuff.c | 77 ++++++++++++++++++++++++++++++++++++------
> net/ipv4/tcp.c | 6 ++++
> net/ipv4/tcp_input.c | 13 +++++--
> net/ipv4/tcp_output.c | 5 ++-
> net/packet/af_packet.c | 4 +--
> 9 files changed, 115 insertions(+), 20 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 1fae276c1353..8fb468ff8115 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> * @csum_level: indicates the number of consecutive checksums found in
> * the packet minus one that have been verified as
> * CHECKSUM_UNNECESSARY (max 3)
> + * @devmem: indicates that all the fragments in this skb are backed by
> + * device memory.
> * @dst_pending_confirm: need to confirm neighbour
> * @decrypted: Decrypted SKB
> * @slow_gro: state present at GRO time, slower prepare step required
> @@ -991,7 +993,7 @@ struct sk_buff {
> #if IS_ENABLED(CONFIG_IP_SCTP)
> __u8 csum_not_inet:1;
> #endif
> -
> + __u8 devmem:1;
> #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> __u16 tc_index; /* traffic control index */
> #endif
> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
> __skb_zcopy_downgrade_managed(skb);
> }
>
> +/* Return true if frags in this skb are not readable by the host. */
> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> +{
> + return skb->devmem;

bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
It better communicates the fact that the stack shouldn't dereference the
frags (because it has 'devmem' fragments or for some other potential
future reason).

2023-11-06 19:35:08

by David Ahern

[permalink] [raw]
Subject: Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> On 11/05, Mina Almasry wrote:
>> For device memory TCP, we expect the skb headers to be available in host
>> memory for access, and we expect the skb frags to be in device memory
>> and unaccessible to the host. We expect there to be no mixing and
>> matching of device memory frags (unaccessible) with host memory frags
>> (accessible) in the same skb.
>>
>> Add a skb->devmem flag which indicates whether the frags in this skb
>> are device memory frags or not.
>>
>> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
>> and marks the skb as skb->devmem accordingly.
>>
>> Add checks through the network stack to avoid accessing the frags of
>> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
>>
>> Signed-off-by: Willem de Bruijn <[email protected]>
>> Signed-off-by: Kaiyuan Zhang <[email protected]>
>> Signed-off-by: Mina Almasry <[email protected]>
>>
>> ---
>> include/linux/skbuff.h | 14 +++++++-
>> include/net/tcp.h | 5 +--
>> net/core/datagram.c | 6 ++++
>> net/core/gro.c | 5 ++-
>> net/core/skbuff.c | 77 ++++++++++++++++++++++++++++++++++++------
>> net/ipv4/tcp.c | 6 ++++
>> net/ipv4/tcp_input.c | 13 +++++--
>> net/ipv4/tcp_output.c | 5 ++-
>> net/packet/af_packet.c | 4 +--
>> 9 files changed, 115 insertions(+), 20 deletions(-)
>>
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index 1fae276c1353..8fb468ff8115 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
>> * @csum_level: indicates the number of consecutive checksums found in
>> * the packet minus one that have been verified as
>> * CHECKSUM_UNNECESSARY (max 3)
>> + * @devmem: indicates that all the fragments in this skb are backed by
>> + * device memory.
>> * @dst_pending_confirm: need to confirm neighbour
>> * @decrypted: Decrypted SKB
>> * @slow_gro: state present at GRO time, slower prepare step required
>> @@ -991,7 +993,7 @@ struct sk_buff {
>> #if IS_ENABLED(CONFIG_IP_SCTP)
>> __u8 csum_not_inet:1;
>> #endif
>> -
>> + __u8 devmem:1;
>> #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
>> __u16 tc_index; /* traffic control index */
>> #endif
>> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
>> __skb_zcopy_downgrade_managed(skb);
>> }
>>
>> +/* Return true if frags in this skb are not readable by the host. */
>> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
>> +{
>> + return skb->devmem;
>
> bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> It better communicates the fact that the stack shouldn't dereference the
> frags (because it has 'devmem' fragments or for some other potential
> future reason).

+1.

Also, the flag on the skb is an optimization - a high level signal that
one or more frags is in unreadable memory. There is no requirement that
all of the frags are in the same memory type.

2023-11-06 20:32:20

by Mina Almasry

[permalink] [raw]
Subject: Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

On Mon, Nov 6, 2023 at 11:34 AM David Ahern <[email protected]> wrote:
>
> On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > On 11/05, Mina Almasry wrote:
> >> For device memory TCP, we expect the skb headers to be available in host
> >> memory for access, and we expect the skb frags to be in device memory
> >> and unaccessible to the host. We expect there to be no mixing and
> >> matching of device memory frags (unaccessible) with host memory frags
> >> (accessible) in the same skb.
> >>
> >> Add a skb->devmem flag which indicates whether the frags in this skb
> >> are device memory frags or not.
> >>
> >> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> >> and marks the skb as skb->devmem accordingly.
> >>
> >> Add checks through the network stack to avoid accessing the frags of
> >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> >>
> >> Signed-off-by: Willem de Bruijn <[email protected]>
> >> Signed-off-by: Kaiyuan Zhang <[email protected]>
> >> Signed-off-by: Mina Almasry <[email protected]>
> >>
> >> ---
> >> include/linux/skbuff.h | 14 +++++++-
> >> include/net/tcp.h | 5 +--
> >> net/core/datagram.c | 6 ++++
> >> net/core/gro.c | 5 ++-
> >> net/core/skbuff.c | 77 ++++++++++++++++++++++++++++++++++++------
> >> net/ipv4/tcp.c | 6 ++++
> >> net/ipv4/tcp_input.c | 13 +++++--
> >> net/ipv4/tcp_output.c | 5 ++-
> >> net/packet/af_packet.c | 4 +--
> >> 9 files changed, 115 insertions(+), 20 deletions(-)
> >>
> >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> >> index 1fae276c1353..8fb468ff8115 100644
> >> --- a/include/linux/skbuff.h
> >> +++ b/include/linux/skbuff.h
> >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> >> * @csum_level: indicates the number of consecutive checksums found in
> >> * the packet minus one that have been verified as
> >> * CHECKSUM_UNNECESSARY (max 3)
> >> + * @devmem: indicates that all the fragments in this skb are backed by
> >> + * device memory.
> >> * @dst_pending_confirm: need to confirm neighbour
> >> * @decrypted: Decrypted SKB
> >> * @slow_gro: state present at GRO time, slower prepare step required
> >> @@ -991,7 +993,7 @@ struct sk_buff {
> >> #if IS_ENABLED(CONFIG_IP_SCTP)
> >> __u8 csum_not_inet:1;
> >> #endif
> >> -
> >> + __u8 devmem:1;
> >> #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> >> __u16 tc_index; /* traffic control index */
> >> #endif
> >> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
> >> __skb_zcopy_downgrade_managed(skb);
> >> }
> >>
> >> +/* Return true if frags in this skb are not readable by the host. */
> >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> >> +{
> >> + return skb->devmem;
> >
> > bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> > It better communicates the fact that the stack shouldn't dereference the
> > frags (because it has 'devmem' fragments or for some other potential
> > future reason).
>
> +1.
>
> Also, the flag on the skb is an optimization - a high level signal that
> one or more frags is in unreadable memory. There is no requirement that
> all of the frags are in the same memory type.

The flag indicates that the skb contains all devmem dma-buf memory
specifically, not generic 'not_readable' frags as the comment says:

+ * @devmem: indicates that all the fragments in this skb are backed by
+ * device memory.

The reason it's not a generic 'not_readable' flag is because handing
off a generic not_readable skb to the userspace is semantically not
what we're doing. recvmsg() is augmented in this patch series to
return a devmem skb to the user via a cmsg_devmem struct which refers
specifically to the memory in the dma-buf. recvmsg() in this patch
series is not augmented to give any 'not_readable' skb to the
userspace.

IMHO skb->devmem + an skb_frags_not_readable() as implemented is
correct. If a new type of unreadable skbs are introduced to the stack,
I imagine the stack would implement:

1. new header flag: skb->newmem
2.

static inline bool skb_frags_not_readable(const struct skb_buff *skb)
{
return skb->devmem || skb->newmem;
}

3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this patch
series, but tcp_recvmsg_newmem() would handle skb->newmem skbs.

--
Thanks,
Mina

2023-11-06 20:57:00

by Stanislav Fomichev

[permalink] [raw]
Subject: Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

On 11/05, Mina Almasry wrote:
> For device memory TCP, we expect the skb headers to be available in host
> memory for access, and we expect the skb frags to be in device memory
> and unaccessible to the host. We expect there to be no mixing and
> matching of device memory frags (unaccessible) with host memory frags
> (accessible) in the same skb.
>
> Add a skb->devmem flag which indicates whether the frags in this skb
> are device memory frags or not.
>
> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> and marks the skb as skb->devmem accordingly.
>
> Add checks through the network stack to avoid accessing the frags of
> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
>
> Signed-off-by: Willem de Bruijn <[email protected]>
> Signed-off-by: Kaiyuan Zhang <[email protected]>
> Signed-off-by: Mina Almasry <[email protected]>

[..]

> - snaplen = skb->len;
> + snaplen = skb_frags_not_readable(skb) ? skb_headlen(skb) : skb->len;
>
> res = run_filter(skb, sk, snaplen);
> if (!res)
> @@ -2279,7 +2279,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
> }
> }
>
> - snaplen = skb->len;
> + snaplen = skb_frags_not_readable(skb) ? skb_headlen(skb) : skb->len;
>
> res = run_filter(skb, sk, snaplen);
> if (!res)

Not sure it covers 100% of bpf. We might need to double-check bpf_xdp_copy_buf
which is having its own, non-skb shinfo and frags. And in general, xdp
can reference those shinfo frags early... (xdp part happens
before we create an skb with all devmem association)

2023-11-06 21:59:47

by Stanislav Fomichev

[permalink] [raw]
Subject: Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

On 11/06, Mina Almasry wrote:
> On Mon, Nov 6, 2023 at 11:34 AM David Ahern <[email protected]> wrote:
> >
> > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > On 11/05, Mina Almasry wrote:
> > >> For device memory TCP, we expect the skb headers to be available in host
> > >> memory for access, and we expect the skb frags to be in device memory
> > >> and unaccessible to the host. We expect there to be no mixing and
> > >> matching of device memory frags (unaccessible) with host memory frags
> > >> (accessible) in the same skb.
> > >>
> > >> Add a skb->devmem flag which indicates whether the frags in this skb
> > >> are device memory frags or not.
> > >>
> > >> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> > >> and marks the skb as skb->devmem accordingly.
> > >>
> > >> Add checks through the network stack to avoid accessing the frags of
> > >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> > >>
> > >> Signed-off-by: Willem de Bruijn <[email protected]>
> > >> Signed-off-by: Kaiyuan Zhang <[email protected]>
> > >> Signed-off-by: Mina Almasry <[email protected]>
> > >>
> > >> ---
> > >> include/linux/skbuff.h | 14 +++++++-
> > >> include/net/tcp.h | 5 +--
> > >> net/core/datagram.c | 6 ++++
> > >> net/core/gro.c | 5 ++-
> > >> net/core/skbuff.c | 77 ++++++++++++++++++++++++++++++++++++------
> > >> net/ipv4/tcp.c | 6 ++++
> > >> net/ipv4/tcp_input.c | 13 +++++--
> > >> net/ipv4/tcp_output.c | 5 ++-
> > >> net/packet/af_packet.c | 4 +--
> > >> 9 files changed, 115 insertions(+), 20 deletions(-)
> > >>
> > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > >> index 1fae276c1353..8fb468ff8115 100644
> > >> --- a/include/linux/skbuff.h
> > >> +++ b/include/linux/skbuff.h
> > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > >> * @csum_level: indicates the number of consecutive checksums found in
> > >> * the packet minus one that have been verified as
> > >> * CHECKSUM_UNNECESSARY (max 3)
> > >> + * @devmem: indicates that all the fragments in this skb are backed by
> > >> + * device memory.
> > >> * @dst_pending_confirm: need to confirm neighbour
> > >> * @decrypted: Decrypted SKB
> > >> * @slow_gro: state present at GRO time, slower prepare step required
> > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > >> #if IS_ENABLED(CONFIG_IP_SCTP)
> > >> __u8 csum_not_inet:1;
> > >> #endif
> > >> -
> > >> + __u8 devmem:1;
> > >> #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > >> __u16 tc_index; /* traffic control index */
> > >> #endif
> > >> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > >> __skb_zcopy_downgrade_managed(skb);
> > >> }
> > >>
> > >> +/* Return true if frags in this skb are not readable by the host. */
> > >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> > >> +{
> > >> + return skb->devmem;
> > >
> > > bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> > > It better communicates the fact that the stack shouldn't dereference the
> > > frags (because it has 'devmem' fragments or for some other potential
> > > future reason).
> >
> > +1.
> >
> > Also, the flag on the skb is an optimization - a high level signal that
> > one or more frags is in unreadable memory. There is no requirement that
> > all of the frags are in the same memory type.

David: maybe there should be such a requirement (that they all are
unreadable)? Might be easier to support initially; we can relax later
on.

> The flag indicates that the skb contains all devmem dma-buf memory
> specifically, not generic 'not_readable' frags as the comment says:
>
> + * @devmem: indicates that all the fragments in this skb are backed by
> + * device memory.
>
> The reason it's not a generic 'not_readable' flag is because handing
> off a generic not_readable skb to the userspace is semantically not
> what we're doing. recvmsg() is augmented in this patch series to
> return a devmem skb to the user via a cmsg_devmem struct which refers
> specifically to the memory in the dma-buf. recvmsg() in this patch
> series is not augmented to give any 'not_readable' skb to the
> userspace.
>
> IMHO skb->devmem + an skb_frags_not_readable() as implemented is
> correct. If a new type of unreadable skbs are introduced to the stack,
> I imagine the stack would implement:
>
> 1. new header flag: skb->newmem
> 2.
>
> static inline bool skb_frags_not_readable(const struct skb_buff *skb)
> {
> return skb->devmem || skb->newmem;
> }
>
> 3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this patch
> series, but tcp_recvmsg_newmem() would handle skb->newmem skbs.

You copy it to the userspace in a special way because your frags
are page_is_page_pool_iov(). I agree with David, the skb bit is
just and optimization.

For most of the core stack, it doesn't matter why your skb is not
readable. For a few places where it matters (recvmsg?), you can
double-check your frags (all or some) with page_is_page_pool_iov.

Unrelated: we probably need socket to dmabuf association as well (via
netlink or something).
We are fundamentally receiving into and sending from a dmabuf (devmem ==
dmabuf).
And once you have this association, recvmsg shouldn't need any new
special flags.

2023-11-06 22:19:15

by Mina Almasry

[permalink] [raw]
Subject: Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev <[email protected]> wrote:
>
> On 11/06, Mina Almasry wrote:
> > On Mon, Nov 6, 2023 at 11:34 AM David Ahern <[email protected]> wrote:
> > >
> > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > On 11/05, Mina Almasry wrote:
> > > >> For device memory TCP, we expect the skb headers to be available in host
> > > >> memory for access, and we expect the skb frags to be in device memory
> > > >> and unaccessible to the host. We expect there to be no mixing and
> > > >> matching of device memory frags (unaccessible) with host memory frags
> > > >> (accessible) in the same skb.
> > > >>
> > > >> Add a skb->devmem flag which indicates whether the frags in this skb
> > > >> are device memory frags or not.
> > > >>
> > > >> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> > > >> and marks the skb as skb->devmem accordingly.
> > > >>
> > > >> Add checks through the network stack to avoid accessing the frags of
> > > >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> > > >>
> > > >> Signed-off-by: Willem de Bruijn <[email protected]>
> > > >> Signed-off-by: Kaiyuan Zhang <[email protected]>
> > > >> Signed-off-by: Mina Almasry <[email protected]>
> > > >>
> > > >> ---
> > > >> include/linux/skbuff.h | 14 +++++++-
> > > >> include/net/tcp.h | 5 +--
> > > >> net/core/datagram.c | 6 ++++
> > > >> net/core/gro.c | 5 ++-
> > > >> net/core/skbuff.c | 77 ++++++++++++++++++++++++++++++++++++------
> > > >> net/ipv4/tcp.c | 6 ++++
> > > >> net/ipv4/tcp_input.c | 13 +++++--
> > > >> net/ipv4/tcp_output.c | 5 ++-
> > > >> net/packet/af_packet.c | 4 +--
> > > >> 9 files changed, 115 insertions(+), 20 deletions(-)
> > > >>
> > > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > >> index 1fae276c1353..8fb468ff8115 100644
> > > >> --- a/include/linux/skbuff.h
> > > >> +++ b/include/linux/skbuff.h
> > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > >> * @csum_level: indicates the number of consecutive checksums found in
> > > >> * the packet minus one that have been verified as
> > > >> * CHECKSUM_UNNECESSARY (max 3)
> > > >> + * @devmem: indicates that all the fragments in this skb are backed by
> > > >> + * device memory.
> > > >> * @dst_pending_confirm: need to confirm neighbour
> > > >> * @decrypted: Decrypted SKB
> > > >> * @slow_gro: state present at GRO time, slower prepare step required
> > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > >> #if IS_ENABLED(CONFIG_IP_SCTP)
> > > >> __u8 csum_not_inet:1;
> > > >> #endif
> > > >> -
> > > >> + __u8 devmem:1;
> > > >> #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > > >> __u16 tc_index; /* traffic control index */
> > > >> #endif
> > > >> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > >> __skb_zcopy_downgrade_managed(skb);
> > > >> }
> > > >>
> > > >> +/* Return true if frags in this skb are not readable by the host. */
> > > >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> > > >> +{
> > > >> + return skb->devmem;
> > > >
> > > > bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> > > > It better communicates the fact that the stack shouldn't dereference the
> > > > frags (because it has 'devmem' fragments or for some other potential
> > > > future reason).
> > >
> > > +1.
> > >
> > > Also, the flag on the skb is an optimization - a high level signal that
> > > one or more frags is in unreadable memory. There is no requirement that
> > > all of the frags are in the same memory type.
>
> David: maybe there should be such a requirement (that they all are
> unreadable)? Might be easier to support initially; we can relax later
> on.
>

Currently devmem == not_readable, and the restriction is that all the
frags in the same skb must be either all readable or all unreadable
(all devmem or all non-devmem).

> > The flag indicates that the skb contains all devmem dma-buf memory
> > specifically, not generic 'not_readable' frags as the comment says:
> >
> > + * @devmem: indicates that all the fragments in this skb are backed by
> > + * device memory.
> >
> > The reason it's not a generic 'not_readable' flag is because handing
> > off a generic not_readable skb to the userspace is semantically not
> > what we're doing. recvmsg() is augmented in this patch series to
> > return a devmem skb to the user via a cmsg_devmem struct which refers
> > specifically to the memory in the dma-buf. recvmsg() in this patch
> > series is not augmented to give any 'not_readable' skb to the
> > userspace.
> >
> > IMHO skb->devmem + an skb_frags_not_readable() as implemented is
> > correct. If a new type of unreadable skbs are introduced to the stack,
> > I imagine the stack would implement:
> >
> > 1. new header flag: skb->newmem
> > 2.
> >
> > static inline bool skb_frags_not_readable(const struct skb_buff *skb)
> > {
> > return skb->devmem || skb->newmem;
> > }
> >
> > 3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this patch
> > series, but tcp_recvmsg_newmem() would handle skb->newmem skbs.
>
> You copy it to the userspace in a special way because your frags
> are page_is_page_pool_iov(). I agree with David, the skb bit is
> just and optimization.
>
> For most of the core stack, it doesn't matter why your skb is not
> readable. For a few places where it matters (recvmsg?), you can
> double-check your frags (all or some) with page_is_page_pool_iov.
>

I see, we can do that then. I.e. make the header flag 'not_readable'
and check the frags to decide to delegate to tcp_recvmsg_devmem() or
something else. We can even assume not_readable == devmem because
currently devmem is the only type of unreadable frag currently.

> Unrelated: we probably need socket to dmabuf association as well (via
> netlink or something).

Not sure this is possible. The dma-buf is bound to the rx-queue, and
any packets that land on that rx-queue are bound to that dma-buf,
regardless of which socket that packet belongs to. So the association
IMO must be rx-queue to dma-buf, not socket to dma-buf.

> We are fundamentally receiving into and sending from a dmabuf (devmem ==
> dmabuf).
> And once you have this association, recvmsg shouldn't need any new
> special flags.


--
Thanks,
Mina

2023-11-06 23:00:08

by Stanislav Fomichev

[permalink] [raw]
Subject: Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

On 11/06, Mina Almasry wrote:
> On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev <[email protected]> wrote:
> >
> > On 11/06, Mina Almasry wrote:
> > > On Mon, Nov 6, 2023 at 11:34 AM David Ahern <[email protected]> wrote:
> > > >
> > > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > > On 11/05, Mina Almasry wrote:
> > > > >> For device memory TCP, we expect the skb headers to be available in host
> > > > >> memory for access, and we expect the skb frags to be in device memory
> > > > >> and unaccessible to the host. We expect there to be no mixing and
> > > > >> matching of device memory frags (unaccessible) with host memory frags
> > > > >> (accessible) in the same skb.
> > > > >>
> > > > >> Add a skb->devmem flag which indicates whether the frags in this skb
> > > > >> are device memory frags or not.
> > > > >>
> > > > >> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> > > > >> and marks the skb as skb->devmem accordingly.
> > > > >>
> > > > >> Add checks through the network stack to avoid accessing the frags of
> > > > >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> > > > >>
> > > > >> Signed-off-by: Willem de Bruijn <[email protected]>
> > > > >> Signed-off-by: Kaiyuan Zhang <[email protected]>
> > > > >> Signed-off-by: Mina Almasry <[email protected]>
> > > > >>
> > > > >> ---
> > > > >> include/linux/skbuff.h | 14 +++++++-
> > > > >> include/net/tcp.h | 5 +--
> > > > >> net/core/datagram.c | 6 ++++
> > > > >> net/core/gro.c | 5 ++-
> > > > >> net/core/skbuff.c | 77 ++++++++++++++++++++++++++++++++++++------
> > > > >> net/ipv4/tcp.c | 6 ++++
> > > > >> net/ipv4/tcp_input.c | 13 +++++--
> > > > >> net/ipv4/tcp_output.c | 5 ++-
> > > > >> net/packet/af_packet.c | 4 +--
> > > > >> 9 files changed, 115 insertions(+), 20 deletions(-)
> > > > >>
> > > > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > > >> index 1fae276c1353..8fb468ff8115 100644
> > > > >> --- a/include/linux/skbuff.h
> > > > >> +++ b/include/linux/skbuff.h
> > > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > > >> * @csum_level: indicates the number of consecutive checksums found in
> > > > >> * the packet minus one that have been verified as
> > > > >> * CHECKSUM_UNNECESSARY (max 3)
> > > > >> + * @devmem: indicates that all the fragments in this skb are backed by
> > > > >> + * device memory.
> > > > >> * @dst_pending_confirm: need to confirm neighbour
> > > > >> * @decrypted: Decrypted SKB
> > > > >> * @slow_gro: state present at GRO time, slower prepare step required
> > > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > > >> #if IS_ENABLED(CONFIG_IP_SCTP)
> > > > >> __u8 csum_not_inet:1;
> > > > >> #endif
> > > > >> -
> > > > >> + __u8 devmem:1;
> > > > >> #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > > > >> __u16 tc_index; /* traffic control index */
> > > > >> #endif
> > > > >> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > > >> __skb_zcopy_downgrade_managed(skb);
> > > > >> }
> > > > >>
> > > > >> +/* Return true if frags in this skb are not readable by the host. */
> > > > >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> > > > >> +{
> > > > >> + return skb->devmem;
> > > > >
> > > > > bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> > > > > It better communicates the fact that the stack shouldn't dereference the
> > > > > frags (because it has 'devmem' fragments or for some other potential
> > > > > future reason).
> > > >
> > > > +1.
> > > >
> > > > Also, the flag on the skb is an optimization - a high level signal that
> > > > one or more frags is in unreadable memory. There is no requirement that
> > > > all of the frags are in the same memory type.
> >
> > David: maybe there should be such a requirement (that they all are
> > unreadable)? Might be easier to support initially; we can relax later
> > on.
> >
>
> Currently devmem == not_readable, and the restriction is that all the
> frags in the same skb must be either all readable or all unreadable
> (all devmem or all non-devmem).
>
> > > The flag indicates that the skb contains all devmem dma-buf memory
> > > specifically, not generic 'not_readable' frags as the comment says:
> > >
> > > + * @devmem: indicates that all the fragments in this skb are backed by
> > > + * device memory.
> > >
> > > The reason it's not a generic 'not_readable' flag is because handing
> > > off a generic not_readable skb to the userspace is semantically not
> > > what we're doing. recvmsg() is augmented in this patch series to
> > > return a devmem skb to the user via a cmsg_devmem struct which refers
> > > specifically to the memory in the dma-buf. recvmsg() in this patch
> > > series is not augmented to give any 'not_readable' skb to the
> > > userspace.
> > >
> > > IMHO skb->devmem + an skb_frags_not_readable() as implemented is
> > > correct. If a new type of unreadable skbs are introduced to the stack,
> > > I imagine the stack would implement:
> > >
> > > 1. new header flag: skb->newmem
> > > 2.
> > >
> > > static inline bool skb_frags_not_readable(const struct skb_buff *skb)
> > > {
> > > return skb->devmem || skb->newmem;
> > > }
> > >
> > > 3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this patch
> > > series, but tcp_recvmsg_newmem() would handle skb->newmem skbs.
> >
> > You copy it to the userspace in a special way because your frags
> > are page_is_page_pool_iov(). I agree with David, the skb bit is
> > just and optimization.
> >
> > For most of the core stack, it doesn't matter why your skb is not
> > readable. For a few places where it matters (recvmsg?), you can
> > double-check your frags (all or some) with page_is_page_pool_iov.
> >
>
> I see, we can do that then. I.e. make the header flag 'not_readable'
> and check the frags to decide to delegate to tcp_recvmsg_devmem() or
> something else. We can even assume not_readable == devmem because
> currently devmem is the only type of unreadable frag currently.
>
> > Unrelated: we probably need socket to dmabuf association as well (via
> > netlink or something).
>
> Not sure this is possible. The dma-buf is bound to the rx-queue, and
> any packets that land on that rx-queue are bound to that dma-buf,
> regardless of which socket that packet belongs to. So the association
> IMO must be rx-queue to dma-buf, not socket to dma-buf.

But there is still always 1 dmabuf to 1 socket association (on rx), right?
Because otherwise, there is no way currently to tell, at recvmsg, which
dmabuf the received token belongs to.

So why not have a separate control channel action to say: this socket fd
is supposed to receive into this dmabuf fd? This action would put
the socket into permanent 'MSG_SOCK_DEVMEM' mode. Maybe you can also
put some checks at the lower level to to enforce this dmabuf
association. (to avoid any potential issues with flow steering)

We'll still have dmabuf to rx-queue association because of various reasons..

2023-11-06 23:29:51

by Mina Almasry

[permalink] [raw]
Subject: Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

On Mon, Nov 6, 2023 at 2:59 PM Stanislav Fomichev <[email protected]> wrote:
>
> On 11/06, Mina Almasry wrote:
> > On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev <[email protected]> wrote:
> > >
> > > On 11/06, Mina Almasry wrote:
> > > > On Mon, Nov 6, 2023 at 11:34 AM David Ahern <[email protected]> wrote:
> > > > >
> > > > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > > > On 11/05, Mina Almasry wrote:
> > > > > >> For device memory TCP, we expect the skb headers to be available in host
> > > > > >> memory for access, and we expect the skb frags to be in device memory
> > > > > >> and unaccessible to the host. We expect there to be no mixing and
> > > > > >> matching of device memory frags (unaccessible) with host memory frags
> > > > > >> (accessible) in the same skb.
> > > > > >>
> > > > > >> Add a skb->devmem flag which indicates whether the frags in this skb
> > > > > >> are device memory frags or not.
> > > > > >>
> > > > > >> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> > > > > >> and marks the skb as skb->devmem accordingly.
> > > > > >>
> > > > > >> Add checks through the network stack to avoid accessing the frags of
> > > > > >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> > > > > >>
> > > > > >> Signed-off-by: Willem de Bruijn <[email protected]>
> > > > > >> Signed-off-by: Kaiyuan Zhang <[email protected]>
> > > > > >> Signed-off-by: Mina Almasry <[email protected]>
> > > > > >>
> > > > > >> ---
> > > > > >> include/linux/skbuff.h | 14 +++++++-
> > > > > >> include/net/tcp.h | 5 +--
> > > > > >> net/core/datagram.c | 6 ++++
> > > > > >> net/core/gro.c | 5 ++-
> > > > > >> net/core/skbuff.c | 77 ++++++++++++++++++++++++++++++++++++------
> > > > > >> net/ipv4/tcp.c | 6 ++++
> > > > > >> net/ipv4/tcp_input.c | 13 +++++--
> > > > > >> net/ipv4/tcp_output.c | 5 ++-
> > > > > >> net/packet/af_packet.c | 4 +--
> > > > > >> 9 files changed, 115 insertions(+), 20 deletions(-)
> > > > > >>
> > > > > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > > > >> index 1fae276c1353..8fb468ff8115 100644
> > > > > >> --- a/include/linux/skbuff.h
> > > > > >> +++ b/include/linux/skbuff.h
> > > > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > > > >> * @csum_level: indicates the number of consecutive checksums found in
> > > > > >> * the packet minus one that have been verified as
> > > > > >> * CHECKSUM_UNNECESSARY (max 3)
> > > > > >> + * @devmem: indicates that all the fragments in this skb are backed by
> > > > > >> + * device memory.
> > > > > >> * @dst_pending_confirm: need to confirm neighbour
> > > > > >> * @decrypted: Decrypted SKB
> > > > > >> * @slow_gro: state present at GRO time, slower prepare step required
> > > > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > > > >> #if IS_ENABLED(CONFIG_IP_SCTP)
> > > > > >> __u8 csum_not_inet:1;
> > > > > >> #endif
> > > > > >> -
> > > > > >> + __u8 devmem:1;
> > > > > >> #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > > > > >> __u16 tc_index; /* traffic control index */
> > > > > >> #endif
> > > > > >> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > > > >> __skb_zcopy_downgrade_managed(skb);
> > > > > >> }
> > > > > >>
> > > > > >> +/* Return true if frags in this skb are not readable by the host. */
> > > > > >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> > > > > >> +{
> > > > > >> + return skb->devmem;
> > > > > >
> > > > > > bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> > > > > > It better communicates the fact that the stack shouldn't dereference the
> > > > > > frags (because it has 'devmem' fragments or for some other potential
> > > > > > future reason).
> > > > >
> > > > > +1.
> > > > >
> > > > > Also, the flag on the skb is an optimization - a high level signal that
> > > > > one or more frags is in unreadable memory. There is no requirement that
> > > > > all of the frags are in the same memory type.
> > >
> > > David: maybe there should be such a requirement (that they all are
> > > unreadable)? Might be easier to support initially; we can relax later
> > > on.
> > >
> >
> > Currently devmem == not_readable, and the restriction is that all the
> > frags in the same skb must be either all readable or all unreadable
> > (all devmem or all non-devmem).
> >
> > > > The flag indicates that the skb contains all devmem dma-buf memory
> > > > specifically, not generic 'not_readable' frags as the comment says:
> > > >
> > > > + * @devmem: indicates that all the fragments in this skb are backed by
> > > > + * device memory.
> > > >
> > > > The reason it's not a generic 'not_readable' flag is because handing
> > > > off a generic not_readable skb to the userspace is semantically not
> > > > what we're doing. recvmsg() is augmented in this patch series to
> > > > return a devmem skb to the user via a cmsg_devmem struct which refers
> > > > specifically to the memory in the dma-buf. recvmsg() in this patch
> > > > series is not augmented to give any 'not_readable' skb to the
> > > > userspace.
> > > >
> > > > IMHO skb->devmem + an skb_frags_not_readable() as implemented is
> > > > correct. If a new type of unreadable skbs are introduced to the stack,
> > > > I imagine the stack would implement:
> > > >
> > > > 1. new header flag: skb->newmem
> > > > 2.
> > > >
> > > > static inline bool skb_frags_not_readable(const struct skb_buff *skb)
> > > > {
> > > > return skb->devmem || skb->newmem;
> > > > }
> > > >
> > > > 3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this patch
> > > > series, but tcp_recvmsg_newmem() would handle skb->newmem skbs.
> > >
> > > You copy it to the userspace in a special way because your frags
> > > are page_is_page_pool_iov(). I agree with David, the skb bit is
> > > just and optimization.
> > >
> > > For most of the core stack, it doesn't matter why your skb is not
> > > readable. For a few places where it matters (recvmsg?), you can
> > > double-check your frags (all or some) with page_is_page_pool_iov.
> > >
> >
> > I see, we can do that then. I.e. make the header flag 'not_readable'
> > and check the frags to decide to delegate to tcp_recvmsg_devmem() or
> > something else. We can even assume not_readable == devmem because
> > currently devmem is the only type of unreadable frag currently.
> >
> > > Unrelated: we probably need socket to dmabuf association as well (via
> > > netlink or something).
> >
> > Not sure this is possible. The dma-buf is bound to the rx-queue, and
> > any packets that land on that rx-queue are bound to that dma-buf,
> > regardless of which socket that packet belongs to. So the association
> > IMO must be rx-queue to dma-buf, not socket to dma-buf.
>
> But there is still always 1 dmabuf to 1 socket association (on rx), right?
> Because otherwise, there is no way currently to tell, at recvmsg, which
> dmabuf the received token belongs to.
>

Yes, but this 1 dma-buf to 1 socket association happens because the
user binds the dma-buf to an rx-queue and configures flow steering of
the socket to that rx-queue.

> So why not have a separate control channel action to say: this socket fd
> is supposed to receive into this dmabuf fd?
> This action would put
> the socket into permanent 'MSG_SOCK_DEVMEM' mode. Maybe you can also
> put some checks at the lower level to to enforce this dmabuf
> association. (to avoid any potential issues with flow steering)
>

setsockopt(SO_DEVMEM_ASSERT_DMA_BUF, dmabuf_fd)? Sounds interesting,
but maybe a bit of a weird API to me. Because the API can't enforce
the socket to receive packets on a dma-buf (rx-queue binding + flow
steering does that), but the API can assert that incoming packets are
received on said dma-buf. I guess it would check packets before they
are acked and would drop packets that landed on the wrong queue.

I'm a bit unsure about defensively programming features (and uapi no
less) to 'avoid any potential issues with flow steering'. Flow
steering is supposed to work.

Also if we wanted to defensively program something to avoid flow
steering issues, then I'd suggest adding to cmsg_devmem the dma-buf fd
that the data is on, not this setsockopt() that asserts. IMO it's a
weird API for the userspace to ask the kernel to assert some condition
(at least I haven't seen it before or commonly).

But again, in general, I'm a bit unsure about defensively designing
uapi around a feature like flow steering that's supposed to work.

> We'll still have dmabuf to rx-queue association because of various reasons..

--
Thanks,
Mina

2023-11-06 23:37:28

by David Ahern

[permalink] [raw]
Subject: Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

On 11/6/23 3:18 PM, Mina Almasry wrote:
>>>>>> @@ -991,7 +993,7 @@ struct sk_buff {
>>>>>> #if IS_ENABLED(CONFIG_IP_SCTP)
>>>>>> __u8 csum_not_inet:1;
>>>>>> #endif
>>>>>> -
>>>>>> + __u8 devmem:1;
>>>>>> #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
>>>>>> __u16 tc_index; /* traffic control index */
>>>>>> #endif
>>>>>> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
>>>>>> __skb_zcopy_downgrade_managed(skb);
>>>>>> }
>>>>>>
>>>>>> +/* Return true if frags in this skb are not readable by the host. */
>>>>>> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
>>>>>> +{
>>>>>> + return skb->devmem;
>>>>>
>>>>> bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
>>>>> It better communicates the fact that the stack shouldn't dereference the
>>>>> frags (because it has 'devmem' fragments or for some other potential
>>>>> future reason).
>>>>
>>>> +1.
>>>>
>>>> Also, the flag on the skb is an optimization - a high level signal that
>>>> one or more frags is in unreadable memory. There is no requirement that
>>>> all of the frags are in the same memory type.
>>
>> David: maybe there should be such a requirement (that they all are
>> unreadable)? Might be easier to support initially; we can relax later
>> on.
>>
>
> Currently devmem == not_readable, and the restriction is that all the
> frags in the same skb must be either all readable or all unreadable
> (all devmem or all non-devmem).

What requires that restriction? In all of the uses of skb->devmem and
skb_frags_not_readable() what matters is if any frag is not readable,
then frag list walk or collapse is avoided.

2023-11-06 23:56:38

by Stanislav Fomichev

[permalink] [raw]
Subject: Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

On Mon, Nov 6, 2023 at 3:27 PM Mina Almasry <[email protected]> wrote:
>
> On Mon, Nov 6, 2023 at 2:59 PM Stanislav Fomichev <[email protected]> wrote:
> >
> > On 11/06, Mina Almasry wrote:
> > > On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev <[email protected]> wrote:
> > > >
> > > > On 11/06, Mina Almasry wrote:
> > > > > On Mon, Nov 6, 2023 at 11:34 AM David Ahern <[email protected]> wrote:
> > > > > >
> > > > > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > > > > On 11/05, Mina Almasry wrote:
> > > > > > >> For device memory TCP, we expect the skb headers to be available in host
> > > > > > >> memory for access, and we expect the skb frags to be in device memory
> > > > > > >> and unaccessible to the host. We expect there to be no mixing and
> > > > > > >> matching of device memory frags (unaccessible) with host memory frags
> > > > > > >> (accessible) in the same skb.
> > > > > > >>
> > > > > > >> Add a skb->devmem flag which indicates whether the frags in this skb
> > > > > > >> are device memory frags or not.
> > > > > > >>
> > > > > > >> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> > > > > > >> and marks the skb as skb->devmem accordingly.
> > > > > > >>
> > > > > > >> Add checks through the network stack to avoid accessing the frags of
> > > > > > >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> > > > > > >>
> > > > > > >> Signed-off-by: Willem de Bruijn <[email protected]>
> > > > > > >> Signed-off-by: Kaiyuan Zhang <[email protected]>
> > > > > > >> Signed-off-by: Mina Almasry <[email protected]>
> > > > > > >>
> > > > > > >> ---
> > > > > > >> include/linux/skbuff.h | 14 +++++++-
> > > > > > >> include/net/tcp.h | 5 +--
> > > > > > >> net/core/datagram.c | 6 ++++
> > > > > > >> net/core/gro.c | 5 ++-
> > > > > > >> net/core/skbuff.c | 77 ++++++++++++++++++++++++++++++++++++------
> > > > > > >> net/ipv4/tcp.c | 6 ++++
> > > > > > >> net/ipv4/tcp_input.c | 13 +++++--
> > > > > > >> net/ipv4/tcp_output.c | 5 ++-
> > > > > > >> net/packet/af_packet.c | 4 +--
> > > > > > >> 9 files changed, 115 insertions(+), 20 deletions(-)
> > > > > > >>
> > > > > > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > > > > >> index 1fae276c1353..8fb468ff8115 100644
> > > > > > >> --- a/include/linux/skbuff.h
> > > > > > >> +++ b/include/linux/skbuff.h
> > > > > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > > > > >> * @csum_level: indicates the number of consecutive checksums found in
> > > > > > >> * the packet minus one that have been verified as
> > > > > > >> * CHECKSUM_UNNECESSARY (max 3)
> > > > > > >> + * @devmem: indicates that all the fragments in this skb are backed by
> > > > > > >> + * device memory.
> > > > > > >> * @dst_pending_confirm: need to confirm neighbour
> > > > > > >> * @decrypted: Decrypted SKB
> > > > > > >> * @slow_gro: state present at GRO time, slower prepare step required
> > > > > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > > > > >> #if IS_ENABLED(CONFIG_IP_SCTP)
> > > > > > >> __u8 csum_not_inet:1;
> > > > > > >> #endif
> > > > > > >> -
> > > > > > >> + __u8 devmem:1;
> > > > > > >> #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > > > > > >> __u16 tc_index; /* traffic control index */
> > > > > > >> #endif
> > > > > > >> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > > > > >> __skb_zcopy_downgrade_managed(skb);
> > > > > > >> }
> > > > > > >>
> > > > > > >> +/* Return true if frags in this skb are not readable by the host. */
> > > > > > >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> > > > > > >> +{
> > > > > > >> + return skb->devmem;
> > > > > > >
> > > > > > > bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> > > > > > > It better communicates the fact that the stack shouldn't dereference the
> > > > > > > frags (because it has 'devmem' fragments or for some other potential
> > > > > > > future reason).
> > > > > >
> > > > > > +1.
> > > > > >
> > > > > > Also, the flag on the skb is an optimization - a high level signal that
> > > > > > one or more frags is in unreadable memory. There is no requirement that
> > > > > > all of the frags are in the same memory type.
> > > >
> > > > David: maybe there should be such a requirement (that they all are
> > > > unreadable)? Might be easier to support initially; we can relax later
> > > > on.
> > > >
> > >
> > > Currently devmem == not_readable, and the restriction is that all the
> > > frags in the same skb must be either all readable or all unreadable
> > > (all devmem or all non-devmem).
> > >
> > > > > The flag indicates that the skb contains all devmem dma-buf memory
> > > > > specifically, not generic 'not_readable' frags as the comment says:
> > > > >
> > > > > + * @devmem: indicates that all the fragments in this skb are backed by
> > > > > + * device memory.
> > > > >
> > > > > The reason it's not a generic 'not_readable' flag is because handing
> > > > > off a generic not_readable skb to the userspace is semantically not
> > > > > what we're doing. recvmsg() is augmented in this patch series to
> > > > > return a devmem skb to the user via a cmsg_devmem struct which refers
> > > > > specifically to the memory in the dma-buf. recvmsg() in this patch
> > > > > series is not augmented to give any 'not_readable' skb to the
> > > > > userspace.
> > > > >
> > > > > IMHO skb->devmem + an skb_frags_not_readable() as implemented is
> > > > > correct. If a new type of unreadable skbs are introduced to the stack,
> > > > > I imagine the stack would implement:
> > > > >
> > > > > 1. new header flag: skb->newmem
> > > > > 2.
> > > > >
> > > > > static inline bool skb_frags_not_readable(const struct skb_buff *skb)
> > > > > {
> > > > > return skb->devmem || skb->newmem;
> > > > > }
> > > > >
> > > > > 3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this patch
> > > > > series, but tcp_recvmsg_newmem() would handle skb->newmem skbs.
> > > >
> > > > You copy it to the userspace in a special way because your frags
> > > > are page_is_page_pool_iov(). I agree with David, the skb bit is
> > > > just and optimization.
> > > >
> > > > For most of the core stack, it doesn't matter why your skb is not
> > > > readable. For a few places where it matters (recvmsg?), you can
> > > > double-check your frags (all or some) with page_is_page_pool_iov.
> > > >
> > >
> > > I see, we can do that then. I.e. make the header flag 'not_readable'
> > > and check the frags to decide to delegate to tcp_recvmsg_devmem() or
> > > something else. We can even assume not_readable == devmem because
> > > currently devmem is the only type of unreadable frag currently.
> > >
> > > > Unrelated: we probably need socket to dmabuf association as well (via
> > > > netlink or something).
> > >
> > > Not sure this is possible. The dma-buf is bound to the rx-queue, and
> > > any packets that land on that rx-queue are bound to that dma-buf,
> > > regardless of which socket that packet belongs to. So the association
> > > IMO must be rx-queue to dma-buf, not socket to dma-buf.
> >
> > But there is still always 1 dmabuf to 1 socket association (on rx), right?
> > Because otherwise, there is no way currently to tell, at recvmsg, which
> > dmabuf the received token belongs to.
> >
>
> Yes, but this 1 dma-buf to 1 socket association happens because the
> user binds the dma-buf to an rx-queue and configures flow steering of
> the socket to that rx-queue.

It's still fixed and won't change during the socket lifetime, right?
And the socket has to know this association; otherwise those tokens
are useless since they don't carry anything to identify the dmabuf.

I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is that
it somehow implies that I have an option of passing or not passing it
for an individual system call.
If we know that we're going to use dmabuf with the socket, maybe we
should move this flag to the socket() syscall?

fd = socket(AF_INET6, SOCK_STREAM, SOCK_DEVMEM);

?

> > So why not have a separate control channel action to say: this socket fd
> > is supposed to receive into this dmabuf fd?
> > This action would put
> > the socket into permanent 'MSG_SOCK_DEVMEM' mode. Maybe you can also
> > put some checks at the lower level to to enforce this dmabuf
> > association. (to avoid any potential issues with flow steering)
> >
>
> setsockopt(SO_DEVMEM_ASSERT_DMA_BUF, dmabuf_fd)? Sounds interesting,
> but maybe a bit of a weird API to me. Because the API can't enforce
> the socket to receive packets on a dma-buf (rx-queue binding + flow
> steering does that), but the API can assert that incoming packets are
> received on said dma-buf. I guess it would check packets before they
> are acked and would drop packets that landed on the wrong queue.
>
> I'm a bit unsure about defensively programming features (and uapi no
> less) to 'avoid any potential issues with flow steering'. Flow
> steering is supposed to work.
>
> Also if we wanted to defensively program something to avoid flow
> steering issues, then I'd suggest adding to cmsg_devmem the dma-buf fd
> that the data is on, not this setsockopt() that asserts. IMO it's a
> weird API for the userspace to ask the kernel to assert some condition
> (at least I haven't seen it before or commonly).
>
> But again, in general, I'm a bit unsure about defensively designing
> uapi around a feature like flow steering that's supposed to work.

2023-11-07 00:04:11

by Mina Almasry

[permalink] [raw]
Subject: Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

On Mon, Nov 6, 2023 at 3:37 PM David Ahern <[email protected]> wrote:
>
> On 11/6/23 3:18 PM, Mina Almasry wrote:
> >>>>>> @@ -991,7 +993,7 @@ struct sk_buff {
> >>>>>> #if IS_ENABLED(CONFIG_IP_SCTP)
> >>>>>> __u8 csum_not_inet:1;
> >>>>>> #endif
> >>>>>> -
> >>>>>> + __u8 devmem:1;
> >>>>>> #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> >>>>>> __u16 tc_index; /* traffic control index */
> >>>>>> #endif
> >>>>>> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
> >>>>>> __skb_zcopy_downgrade_managed(skb);
> >>>>>> }
> >>>>>>
> >>>>>> +/* Return true if frags in this skb are not readable by the host. */
> >>>>>> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> >>>>>> +{
> >>>>>> + return skb->devmem;
> >>>>>
> >>>>> bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> >>>>> It better communicates the fact that the stack shouldn't dereference the
> >>>>> frags (because it has 'devmem' fragments or for some other potential
> >>>>> future reason).
> >>>>
> >>>> +1.
> >>>>
> >>>> Also, the flag on the skb is an optimization - a high level signal that
> >>>> one or more frags is in unreadable memory. There is no requirement that
> >>>> all of the frags are in the same memory type.
> >>
> >> David: maybe there should be such a requirement (that they all are
> >> unreadable)? Might be easier to support initially; we can relax later
> >> on.
> >>
> >
> > Currently devmem == not_readable, and the restriction is that all the
> > frags in the same skb must be either all readable or all unreadable
> > (all devmem or all non-devmem).
>
> What requires that restriction? In all of the uses of skb->devmem and
> skb_frags_not_readable() what matters is if any frag is not readable,
> then frag list walk or collapse is avoided.
>
>

Currently only tcp_recvmsg_devmem(), I think. tcp_recvmsg_locked()
delegates to tcp_recvmsg_devmem() if skb->devmem, and
tcp_recvmsg_devmem() net_err's if it finds a non-iov frag in the skb.
This is done for some simplicity, because iov's are given to the user
via cmsg, but pages are copied into the linear buffer. I think it
would be confusing for the user if we simultaneously copied some data
to the linear buffer and gave them a devmem cmsgs in the same
recvmsg() call.

So, my simplicity is:

1. in a single skb, all frags must be devmem or non-devmem, no mixing.
2. In a single recvmsg() call, we only process devmem or non-devmem
skbs, no mixing.

--
Thanks,
Mina

2023-11-07 00:08:13

by Willem de Bruijn

[permalink] [raw]
Subject: Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

On Mon, Nov 6, 2023 at 3:55 PM Stanislav Fomichev <[email protected]> wrote:
>
> On Mon, Nov 6, 2023 at 3:27 PM Mina Almasry <[email protected]> wrote:
> >
> > On Mon, Nov 6, 2023 at 2:59 PM Stanislav Fomichev <[email protected]> wrote:
> > >
> > > On 11/06, Mina Almasry wrote:
> > > > On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev <[email protected]> wrote:
> > > > >
> > > > > On 11/06, Mina Almasry wrote:
> > > > > > On Mon, Nov 6, 2023 at 11:34 AM David Ahern <[email protected]> wrote:
> > > > > > >
> > > > > > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > > > > > On 11/05, Mina Almasry wrote:
> > > > > > > >> For device memory TCP, we expect the skb headers to be available in host
> > > > > > > >> memory for access, and we expect the skb frags to be in device memory
> > > > > > > >> and unaccessible to the host. We expect there to be no mixing and
> > > > > > > >> matching of device memory frags (unaccessible) with host memory frags
> > > > > > > >> (accessible) in the same skb.
> > > > > > > >>
> > > > > > > >> Add a skb->devmem flag which indicates whether the frags in this skb
> > > > > > > >> are device memory frags or not.
> > > > > > > >>
> > > > > > > >> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> > > > > > > >> and marks the skb as skb->devmem accordingly.
> > > > > > > >>
> > > > > > > >> Add checks through the network stack to avoid accessing the frags of
> > > > > > > >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> > > > > > > >>
> > > > > > > >> Signed-off-by: Willem de Bruijn <[email protected]>
> > > > > > > >> Signed-off-by: Kaiyuan Zhang <[email protected]>
> > > > > > > >> Signed-off-by: Mina Almasry <[email protected]>
> > > > > > > >>
> > > > > > > >> ---
> > > > > > > >> include/linux/skbuff.h | 14 +++++++-
> > > > > > > >> include/net/tcp.h | 5 +--
> > > > > > > >> net/core/datagram.c | 6 ++++
> > > > > > > >> net/core/gro.c | 5 ++-
> > > > > > > >> net/core/skbuff.c | 77 ++++++++++++++++++++++++++++++++++++------
> > > > > > > >> net/ipv4/tcp.c | 6 ++++
> > > > > > > >> net/ipv4/tcp_input.c | 13 +++++--
> > > > > > > >> net/ipv4/tcp_output.c | 5 ++-
> > > > > > > >> net/packet/af_packet.c | 4 +--
> > > > > > > >> 9 files changed, 115 insertions(+), 20 deletions(-)
> > > > > > > >>
> > > > > > > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > > > > > >> index 1fae276c1353..8fb468ff8115 100644
> > > > > > > >> --- a/include/linux/skbuff.h
> > > > > > > >> +++ b/include/linux/skbuff.h
> > > > > > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > > > > > >> * @csum_level: indicates the number of consecutive checksums found in
> > > > > > > >> * the packet minus one that have been verified as
> > > > > > > >> * CHECKSUM_UNNECESSARY (max 3)
> > > > > > > >> + * @devmem: indicates that all the fragments in this skb are backed by
> > > > > > > >> + * device memory.
> > > > > > > >> * @dst_pending_confirm: need to confirm neighbour
> > > > > > > >> * @decrypted: Decrypted SKB
> > > > > > > >> * @slow_gro: state present at GRO time, slower prepare step required
> > > > > > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > > > > > >> #if IS_ENABLED(CONFIG_IP_SCTP)
> > > > > > > >> __u8 csum_not_inet:1;
> > > > > > > >> #endif
> > > > > > > >> -
> > > > > > > >> + __u8 devmem:1;
> > > > > > > >> #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > > > > > > >> __u16 tc_index; /* traffic control index */
> > > > > > > >> #endif
> > > > > > > >> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > > > > > >> __skb_zcopy_downgrade_managed(skb);
> > > > > > > >> }
> > > > > > > >>
> > > > > > > >> +/* Return true if frags in this skb are not readable by the host. */
> > > > > > > >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> > > > > > > >> +{
> > > > > > > >> + return skb->devmem;
> > > > > > > >
> > > > > > > > bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> > > > > > > > It better communicates the fact that the stack shouldn't dereference the
> > > > > > > > frags (because it has 'devmem' fragments or for some other potential
> > > > > > > > future reason).
> > > > > > >
> > > > > > > +1.
> > > > > > >
> > > > > > > Also, the flag on the skb is an optimization - a high level signal that
> > > > > > > one or more frags is in unreadable memory. There is no requirement that
> > > > > > > all of the frags are in the same memory type.
> > > > >
> > > > > David: maybe there should be such a requirement (that they all are
> > > > > unreadable)? Might be easier to support initially; we can relax later
> > > > > on.
> > > > >
> > > >
> > > > Currently devmem == not_readable, and the restriction is that all the
> > > > frags in the same skb must be either all readable or all unreadable
> > > > (all devmem or all non-devmem).
> > > >
> > > > > > The flag indicates that the skb contains all devmem dma-buf memory
> > > > > > specifically, not generic 'not_readable' frags as the comment says:
> > > > > >
> > > > > > + * @devmem: indicates that all the fragments in this skb are backed by
> > > > > > + * device memory.
> > > > > >
> > > > > > The reason it's not a generic 'not_readable' flag is because handing
> > > > > > off a generic not_readable skb to the userspace is semantically not
> > > > > > what we're doing. recvmsg() is augmented in this patch series to
> > > > > > return a devmem skb to the user via a cmsg_devmem struct which refers
> > > > > > specifically to the memory in the dma-buf. recvmsg() in this patch
> > > > > > series is not augmented to give any 'not_readable' skb to the
> > > > > > userspace.
> > > > > >
> > > > > > IMHO skb->devmem + an skb_frags_not_readable() as implemented is
> > > > > > correct. If a new type of unreadable skbs are introduced to the stack,
> > > > > > I imagine the stack would implement:
> > > > > >
> > > > > > 1. new header flag: skb->newmem
> > > > > > 2.
> > > > > >
> > > > > > static inline bool skb_frags_not_readable(const struct skb_buff *skb)
> > > > > > {
> > > > > > return skb->devmem || skb->newmem;
> > > > > > }
> > > > > >
> > > > > > 3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this patch
> > > > > > series, but tcp_recvmsg_newmem() would handle skb->newmem skbs.
> > > > >
> > > > > You copy it to the userspace in a special way because your frags
> > > > > are page_is_page_pool_iov(). I agree with David, the skb bit is
> > > > > just and optimization.
> > > > >
> > > > > For most of the core stack, it doesn't matter why your skb is not
> > > > > readable. For a few places where it matters (recvmsg?), you can
> > > > > double-check your frags (all or some) with page_is_page_pool_iov.
> > > > >
> > > >
> > > > I see, we can do that then. I.e. make the header flag 'not_readable'
> > > > and check the frags to decide to delegate to tcp_recvmsg_devmem() or
> > > > something else. We can even assume not_readable == devmem because
> > > > currently devmem is the only type of unreadable frag currently.
> > > >
> > > > > Unrelated: we probably need socket to dmabuf association as well (via
> > > > > netlink or something).
> > > >
> > > > Not sure this is possible. The dma-buf is bound to the rx-queue, and
> > > > any packets that land on that rx-queue are bound to that dma-buf,
> > > > regardless of which socket that packet belongs to. So the association
> > > > IMO must be rx-queue to dma-buf, not socket to dma-buf.
> > >
> > > But there is still always 1 dmabuf to 1 socket association (on rx), right?
> > > Because otherwise, there is no way currently to tell, at recvmsg, which
> > > dmabuf the received token belongs to.
> > >
> >
> > Yes, but this 1 dma-buf to 1 socket association happens because the
> > user binds the dma-buf to an rx-queue and configures flow steering of
> > the socket to that rx-queue.
>
> It's still fixed and won't change during the socket lifetime, right?
> And the socket has to know this association; otherwise those tokens
> are useless since they don't carry anything to identify the dmabuf.
>
> I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is that
> it somehow implies that I have an option of passing or not passing it
> for an individual system call.
> If we know that we're going to use dmabuf with the socket, maybe we
> should move this flag to the socket() syscall?
>
> fd = socket(AF_INET6, SOCK_STREAM, SOCK_DEVMEM);
>
> ?

I think it should then be a setsockopt called before any data is
exchanged, with no change of modifying mode later. We generally use
setsockopts for the mode of a socket. This use of the protocol field
in socket() for setting a mode would be novel. Also, it might miss
passively opened connections, or be overly restrictive: one approach
for all accepted child sockets.

2023-11-07 00:16:32

by Stanislav Fomichev

[permalink] [raw]
Subject: Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

On 11/06, Willem de Bruijn wrote:
> On Mon, Nov 6, 2023 at 3:55 PM Stanislav Fomichev <[email protected]> wrote:
> >
> > On Mon, Nov 6, 2023 at 3:27 PM Mina Almasry <[email protected]> wrote:
> > >
> > > On Mon, Nov 6, 2023 at 2:59 PM Stanislav Fomichev <[email protected]> wrote:
> > > >
> > > > On 11/06, Mina Almasry wrote:
> > > > > On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev <[email protected]> wrote:
> > > > > >
> > > > > > On 11/06, Mina Almasry wrote:
> > > > > > > On Mon, Nov 6, 2023 at 11:34 AM David Ahern <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > > > > > > On 11/05, Mina Almasry wrote:
> > > > > > > > >> For device memory TCP, we expect the skb headers to be available in host
> > > > > > > > >> memory for access, and we expect the skb frags to be in device memory
> > > > > > > > >> and unaccessible to the host. We expect there to be no mixing and
> > > > > > > > >> matching of device memory frags (unaccessible) with host memory frags
> > > > > > > > >> (accessible) in the same skb.
> > > > > > > > >>
> > > > > > > > >> Add a skb->devmem flag which indicates whether the frags in this skb
> > > > > > > > >> are device memory frags or not.
> > > > > > > > >>
> > > > > > > > >> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> > > > > > > > >> and marks the skb as skb->devmem accordingly.
> > > > > > > > >>
> > > > > > > > >> Add checks through the network stack to avoid accessing the frags of
> > > > > > > > >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> > > > > > > > >>
> > > > > > > > >> Signed-off-by: Willem de Bruijn <[email protected]>
> > > > > > > > >> Signed-off-by: Kaiyuan Zhang <[email protected]>
> > > > > > > > >> Signed-off-by: Mina Almasry <[email protected]>
> > > > > > > > >>
> > > > > > > > >> ---
> > > > > > > > >> include/linux/skbuff.h | 14 +++++++-
> > > > > > > > >> include/net/tcp.h | 5 +--
> > > > > > > > >> net/core/datagram.c | 6 ++++
> > > > > > > > >> net/core/gro.c | 5 ++-
> > > > > > > > >> net/core/skbuff.c | 77 ++++++++++++++++++++++++++++++++++++------
> > > > > > > > >> net/ipv4/tcp.c | 6 ++++
> > > > > > > > >> net/ipv4/tcp_input.c | 13 +++++--
> > > > > > > > >> net/ipv4/tcp_output.c | 5 ++-
> > > > > > > > >> net/packet/af_packet.c | 4 +--
> > > > > > > > >> 9 files changed, 115 insertions(+), 20 deletions(-)
> > > > > > > > >>
> > > > > > > > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > > > > > > >> index 1fae276c1353..8fb468ff8115 100644
> > > > > > > > >> --- a/include/linux/skbuff.h
> > > > > > > > >> +++ b/include/linux/skbuff.h
> > > > > > > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > > > > > > >> * @csum_level: indicates the number of consecutive checksums found in
> > > > > > > > >> * the packet minus one that have been verified as
> > > > > > > > >> * CHECKSUM_UNNECESSARY (max 3)
> > > > > > > > >> + * @devmem: indicates that all the fragments in this skb are backed by
> > > > > > > > >> + * device memory.
> > > > > > > > >> * @dst_pending_confirm: need to confirm neighbour
> > > > > > > > >> * @decrypted: Decrypted SKB
> > > > > > > > >> * @slow_gro: state present at GRO time, slower prepare step required
> > > > > > > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > > > > > > >> #if IS_ENABLED(CONFIG_IP_SCTP)
> > > > > > > > >> __u8 csum_not_inet:1;
> > > > > > > > >> #endif
> > > > > > > > >> -
> > > > > > > > >> + __u8 devmem:1;
> > > > > > > > >> #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > > > > > > > >> __u16 tc_index; /* traffic control index */
> > > > > > > > >> #endif
> > > > > > > > >> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > > > > > > >> __skb_zcopy_downgrade_managed(skb);
> > > > > > > > >> }
> > > > > > > > >>
> > > > > > > > >> +/* Return true if frags in this skb are not readable by the host. */
> > > > > > > > >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> > > > > > > > >> +{
> > > > > > > > >> + return skb->devmem;
> > > > > > > > >
> > > > > > > > > bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> > > > > > > > > It better communicates the fact that the stack shouldn't dereference the
> > > > > > > > > frags (because it has 'devmem' fragments or for some other potential
> > > > > > > > > future reason).
> > > > > > > >
> > > > > > > > +1.
> > > > > > > >
> > > > > > > > Also, the flag on the skb is an optimization - a high level signal that
> > > > > > > > one or more frags is in unreadable memory. There is no requirement that
> > > > > > > > all of the frags are in the same memory type.
> > > > > >
> > > > > > David: maybe there should be such a requirement (that they all are
> > > > > > unreadable)? Might be easier to support initially; we can relax later
> > > > > > on.
> > > > > >
> > > > >
> > > > > Currently devmem == not_readable, and the restriction is that all the
> > > > > frags in the same skb must be either all readable or all unreadable
> > > > > (all devmem or all non-devmem).
> > > > >
> > > > > > > The flag indicates that the skb contains all devmem dma-buf memory
> > > > > > > specifically, not generic 'not_readable' frags as the comment says:
> > > > > > >
> > > > > > > + * @devmem: indicates that all the fragments in this skb are backed by
> > > > > > > + * device memory.
> > > > > > >
> > > > > > > The reason it's not a generic 'not_readable' flag is because handing
> > > > > > > off a generic not_readable skb to the userspace is semantically not
> > > > > > > what we're doing. recvmsg() is augmented in this patch series to
> > > > > > > return a devmem skb to the user via a cmsg_devmem struct which refers
> > > > > > > specifically to the memory in the dma-buf. recvmsg() in this patch
> > > > > > > series is not augmented to give any 'not_readable' skb to the
> > > > > > > userspace.
> > > > > > >
> > > > > > > IMHO skb->devmem + an skb_frags_not_readable() as implemented is
> > > > > > > correct. If a new type of unreadable skbs are introduced to the stack,
> > > > > > > I imagine the stack would implement:
> > > > > > >
> > > > > > > 1. new header flag: skb->newmem
> > > > > > > 2.
> > > > > > >
> > > > > > > static inline bool skb_frags_not_readable(const struct skb_buff *skb)
> > > > > > > {
> > > > > > > return skb->devmem || skb->newmem;
> > > > > > > }
> > > > > > >
> > > > > > > 3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this patch
> > > > > > > series, but tcp_recvmsg_newmem() would handle skb->newmem skbs.
> > > > > >
> > > > > > You copy it to the userspace in a special way because your frags
> > > > > > are page_is_page_pool_iov(). I agree with David, the skb bit is
> > > > > > just and optimization.
> > > > > >
> > > > > > For most of the core stack, it doesn't matter why your skb is not
> > > > > > readable. For a few places where it matters (recvmsg?), you can
> > > > > > double-check your frags (all or some) with page_is_page_pool_iov.
> > > > > >
> > > > >
> > > > > I see, we can do that then. I.e. make the header flag 'not_readable'
> > > > > and check the frags to decide to delegate to tcp_recvmsg_devmem() or
> > > > > something else. We can even assume not_readable == devmem because
> > > > > currently devmem is the only type of unreadable frag currently.
> > > > >
> > > > > > Unrelated: we probably need socket to dmabuf association as well (via
> > > > > > netlink or something).
> > > > >
> > > > > Not sure this is possible. The dma-buf is bound to the rx-queue, and
> > > > > any packets that land on that rx-queue are bound to that dma-buf,
> > > > > regardless of which socket that packet belongs to. So the association
> > > > > IMO must be rx-queue to dma-buf, not socket to dma-buf.
> > > >
> > > > But there is still always 1 dmabuf to 1 socket association (on rx), right?
> > > > Because otherwise, there is no way currently to tell, at recvmsg, which
> > > > dmabuf the received token belongs to.
> > > >
> > >
> > > Yes, but this 1 dma-buf to 1 socket association happens because the
> > > user binds the dma-buf to an rx-queue and configures flow steering of
> > > the socket to that rx-queue.
> >
> > It's still fixed and won't change during the socket lifetime, right?
> > And the socket has to know this association; otherwise those tokens
> > are useless since they don't carry anything to identify the dmabuf.
> >
> > I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is that
> > it somehow implies that I have an option of passing or not passing it
> > for an individual system call.
> > If we know that we're going to use dmabuf with the socket, maybe we
> > should move this flag to the socket() syscall?
> >
> > fd = socket(AF_INET6, SOCK_STREAM, SOCK_DEVMEM);
> >
> > ?
>
> I think it should then be a setsockopt called before any data is
> exchanged, with no change of modifying mode later. We generally use
> setsockopts for the mode of a socket. This use of the protocol field
> in socket() for setting a mode would be novel. Also, it might miss
> passively opened connections, or be overly restrictive: one approach
> for all accepted child sockets.

I was thinking this is similar to SOCK_CLOEXEC or SOCK_NONBLOCK? There
are plenty of bits we can grab. But setsockopt works as well!

2023-11-07 00:16:40

by David Ahern

[permalink] [raw]
Subject: Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

On 11/5/23 7:44 PM, Mina Almasry wrote:
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index 176eb5834746..cdd4fb129968 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -425,6 +425,9 @@ static int __skb_datagram_iter(const struct sk_buff *skb, int offset,
> return 0;
> }
>
> + if (skb_frags_not_readable(skb))
> + goto short_copy;
> +
> /* Copy paged appendix. Hmm... why does this look so complicated? */
> for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
> int end;
> @@ -616,6 +619,9 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
> {
> int frag;
>
> + if (skb_frags_not_readable(skb))
> + return -EFAULT;

This check ....
> +
> if (msg && msg->msg_ubuf && msg->sg_from_iter)
> return msg->sg_from_iter(sk, skb, from, length);


... should go here. That allows custome sg_from_iter to have access to
the skb. What matters is not expecting struct page (e.g., refcounting);
if the custom iter does not do that then all is well. io_uring's iter
does not look at the pages, so all good.

>
> diff --git a/net/core/gro.c b/net/core/gro.c
> index 42d7f6755f32..56046d65386a 100644
> --- a/net/core/gro.c
> +++ b/net/core/gro.c
> @@ -390,6 +390,9 @@ static void gro_pull_from_frag0(struct sk_buff *skb, int grow)
> {
> struct skb_shared_info *pinfo = skb_shinfo(skb);
>
> + if (WARN_ON_ONCE(skb_frags_not_readable(skb)))
> + return;
> +
> BUG_ON(skb->end - skb->tail < grow);
>
> memcpy(skb_tail_pointer(skb), NAPI_GRO_CB(skb)->frag0, grow);
> @@ -411,7 +414,7 @@ static void gro_try_pull_from_frag0(struct sk_buff *skb)
> {
> int grow = skb_gro_offset(skb) - skb_headlen(skb);
>
> - if (grow > 0)
> + if (grow > 0 && !skb_frags_not_readable(skb))
> gro_pull_from_frag0(skb, grow);
> }
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 13eca4fd25e1..f01673ed2eff 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1230,6 +1230,14 @@ void skb_dump(const char *level, const struct sk_buff *skb, bool full_pkt)
> struct page *p;
> u8 *vaddr;
>
> + if (skb_frag_is_page_pool_iov(frag)) {

Why skb_frag_is_page_pool_iov here vs skb_frags_not_readable?

2023-11-07 00:20:55

by Mina Almasry

[permalink] [raw]
Subject: Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

On Mon, Nov 6, 2023 at 4:08 PM Willem de Bruijn
<[email protected]> wrote:
>
> On Mon, Nov 6, 2023 at 3:55 PM Stanislav Fomichev <[email protected]> wrote:
> >
> > On Mon, Nov 6, 2023 at 3:27 PM Mina Almasry <[email protected]> wrote:
> > >
> > > On Mon, Nov 6, 2023 at 2:59 PM Stanislav Fomichev <[email protected]> wrote:
> > > >
> > > > On 11/06, Mina Almasry wrote:
> > > > > On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev <[email protected]> wrote:
> > > > > >
> > > > > > On 11/06, Mina Almasry wrote:
> > > > > > > On Mon, Nov 6, 2023 at 11:34 AM David Ahern <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > > > > > > On 11/05, Mina Almasry wrote:
> > > > > > > > >> For device memory TCP, we expect the skb headers to be available in host
> > > > > > > > >> memory for access, and we expect the skb frags to be in device memory
> > > > > > > > >> and unaccessible to the host. We expect there to be no mixing and
> > > > > > > > >> matching of device memory frags (unaccessible) with host memory frags
> > > > > > > > >> (accessible) in the same skb.
> > > > > > > > >>
> > > > > > > > >> Add a skb->devmem flag which indicates whether the frags in this skb
> > > > > > > > >> are device memory frags or not.
> > > > > > > > >>
> > > > > > > > >> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> > > > > > > > >> and marks the skb as skb->devmem accordingly.
> > > > > > > > >>
> > > > > > > > >> Add checks through the network stack to avoid accessing the frags of
> > > > > > > > >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> > > > > > > > >>
> > > > > > > > >> Signed-off-by: Willem de Bruijn <[email protected]>
> > > > > > > > >> Signed-off-by: Kaiyuan Zhang <[email protected]>
> > > > > > > > >> Signed-off-by: Mina Almasry <[email protected]>
> > > > > > > > >>
> > > > > > > > >> ---
> > > > > > > > >> include/linux/skbuff.h | 14 +++++++-
> > > > > > > > >> include/net/tcp.h | 5 +--
> > > > > > > > >> net/core/datagram.c | 6 ++++
> > > > > > > > >> net/core/gro.c | 5 ++-
> > > > > > > > >> net/core/skbuff.c | 77 ++++++++++++++++++++++++++++++++++++------
> > > > > > > > >> net/ipv4/tcp.c | 6 ++++
> > > > > > > > >> net/ipv4/tcp_input.c | 13 +++++--
> > > > > > > > >> net/ipv4/tcp_output.c | 5 ++-
> > > > > > > > >> net/packet/af_packet.c | 4 +--
> > > > > > > > >> 9 files changed, 115 insertions(+), 20 deletions(-)
> > > > > > > > >>
> > > > > > > > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > > > > > > >> index 1fae276c1353..8fb468ff8115 100644
> > > > > > > > >> --- a/include/linux/skbuff.h
> > > > > > > > >> +++ b/include/linux/skbuff.h
> > > > > > > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > > > > > > >> * @csum_level: indicates the number of consecutive checksums found in
> > > > > > > > >> * the packet minus one that have been verified as
> > > > > > > > >> * CHECKSUM_UNNECESSARY (max 3)
> > > > > > > > >> + * @devmem: indicates that all the fragments in this skb are backed by
> > > > > > > > >> + * device memory.
> > > > > > > > >> * @dst_pending_confirm: need to confirm neighbour
> > > > > > > > >> * @decrypted: Decrypted SKB
> > > > > > > > >> * @slow_gro: state present at GRO time, slower prepare step required
> > > > > > > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > > > > > > >> #if IS_ENABLED(CONFIG_IP_SCTP)
> > > > > > > > >> __u8 csum_not_inet:1;
> > > > > > > > >> #endif
> > > > > > > > >> -
> > > > > > > > >> + __u8 devmem:1;
> > > > > > > > >> #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > > > > > > > >> __u16 tc_index; /* traffic control index */
> > > > > > > > >> #endif
> > > > > > > > >> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > > > > > > >> __skb_zcopy_downgrade_managed(skb);
> > > > > > > > >> }
> > > > > > > > >>
> > > > > > > > >> +/* Return true if frags in this skb are not readable by the host. */
> > > > > > > > >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> > > > > > > > >> +{
> > > > > > > > >> + return skb->devmem;
> > > > > > > > >
> > > > > > > > > bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> > > > > > > > > It better communicates the fact that the stack shouldn't dereference the
> > > > > > > > > frags (because it has 'devmem' fragments or for some other potential
> > > > > > > > > future reason).
> > > > > > > >
> > > > > > > > +1.
> > > > > > > >
> > > > > > > > Also, the flag on the skb is an optimization - a high level signal that
> > > > > > > > one or more frags is in unreadable memory. There is no requirement that
> > > > > > > > all of the frags are in the same memory type.
> > > > > >
> > > > > > David: maybe there should be such a requirement (that they all are
> > > > > > unreadable)? Might be easier to support initially; we can relax later
> > > > > > on.
> > > > > >
> > > > >
> > > > > Currently devmem == not_readable, and the restriction is that all the
> > > > > frags in the same skb must be either all readable or all unreadable
> > > > > (all devmem or all non-devmem).
> > > > >
> > > > > > > The flag indicates that the skb contains all devmem dma-buf memory
> > > > > > > specifically, not generic 'not_readable' frags as the comment says:
> > > > > > >
> > > > > > > + * @devmem: indicates that all the fragments in this skb are backed by
> > > > > > > + * device memory.
> > > > > > >
> > > > > > > The reason it's not a generic 'not_readable' flag is because handing
> > > > > > > off a generic not_readable skb to the userspace is semantically not
> > > > > > > what we're doing. recvmsg() is augmented in this patch series to
> > > > > > > return a devmem skb to the user via a cmsg_devmem struct which refers
> > > > > > > specifically to the memory in the dma-buf. recvmsg() in this patch
> > > > > > > series is not augmented to give any 'not_readable' skb to the
> > > > > > > userspace.
> > > > > > >
> > > > > > > IMHO skb->devmem + an skb_frags_not_readable() as implemented is
> > > > > > > correct. If a new type of unreadable skbs are introduced to the stack,
> > > > > > > I imagine the stack would implement:
> > > > > > >
> > > > > > > 1. new header flag: skb->newmem
> > > > > > > 2.
> > > > > > >
> > > > > > > static inline bool skb_frags_not_readable(const struct skb_buff *skb)
> > > > > > > {
> > > > > > > return skb->devmem || skb->newmem;
> > > > > > > }
> > > > > > >
> > > > > > > 3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this patch
> > > > > > > series, but tcp_recvmsg_newmem() would handle skb->newmem skbs.
> > > > > >
> > > > > > You copy it to the userspace in a special way because your frags
> > > > > > are page_is_page_pool_iov(). I agree with David, the skb bit is
> > > > > > just and optimization.
> > > > > >
> > > > > > For most of the core stack, it doesn't matter why your skb is not
> > > > > > readable. For a few places where it matters (recvmsg?), you can
> > > > > > double-check your frags (all or some) with page_is_page_pool_iov.
> > > > > >
> > > > >
> > > > > I see, we can do that then. I.e. make the header flag 'not_readable'
> > > > > and check the frags to decide to delegate to tcp_recvmsg_devmem() or
> > > > > something else. We can even assume not_readable == devmem because
> > > > > currently devmem is the only type of unreadable frag currently.
> > > > >
> > > > > > Unrelated: we probably need socket to dmabuf association as well (via
> > > > > > netlink or something).
> > > > >
> > > > > Not sure this is possible. The dma-buf is bound to the rx-queue, and
> > > > > any packets that land on that rx-queue are bound to that dma-buf,
> > > > > regardless of which socket that packet belongs to. So the association
> > > > > IMO must be rx-queue to dma-buf, not socket to dma-buf.
> > > >
> > > > But there is still always 1 dmabuf to 1 socket association (on rx), right?
> > > > Because otherwise, there is no way currently to tell, at recvmsg, which
> > > > dmabuf the received token belongs to.
> > > >
> > >
> > > Yes, but this 1 dma-buf to 1 socket association happens because the
> > > user binds the dma-buf to an rx-queue and configures flow steering of
> > > the socket to that rx-queue.
> >
> > It's still fixed and won't change during the socket lifetime, right?

Technically, no.

The user is free to modify or delete flow steering rules outside of
the lifetime of the socket. Technically it's possible for the user to
reconfigure flow steering while the socket is simultaneously
receiving, and the result will be packets switching
from devmem to non-devmem. For a reasonably correctly configured
application the application would probably want to steer 1 flow to 1
dma-buf and never change it, but this is not something we enforce, but
rather the user orchestrates. In theory someone can find a use case
for configuring and unconfigure flow steering during a connection.

> > And the socket has to know this association; otherwise those tokens
> > are useless since they don't carry anything to identify the dmabuf.
> >
> > I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is that
> > it somehow implies that I have an option of passing or not passing it
> > for an individual system call.

You do have the option of passing it or not passing it per system
call. The MSG_SOCK_DEVMEM says the application is willing to receive
devmem cmsgs - that's all. The application doesn't get to decide
whether it's actually going to receive a devmem cmsg or not, because
that's dictated by the type of skb that is present in the receive
queue, and not up to the application. I should explain this in the
commit message...

> > If we know that we're going to use dmabuf with the socket, maybe we
> > should move this flag to the socket() syscall?
> >
> > fd = socket(AF_INET6, SOCK_STREAM, SOCK_DEVMEM);
> >
> > ?
>
> I think it should then be a setsockopt called before any data is
> exchanged, with no change of modifying mode later. We generally use
> setsockopts for the mode of a socket. This use of the protocol field
> in socket() for setting a mode would be novel. Also, it might miss
> passively opened connections, or be overly restrictive: one approach
> for all accepted child sockets.

We can definitely move SOCK_DEVMEM to a setsockopt(). Seems more than
reasonable.

--
Thanks,
Mina

2023-11-07 00:24:25

by Mina Almasry

[permalink] [raw]
Subject: Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

On Mon, Nov 6, 2023 at 4:16 PM David Ahern <[email protected]> wrote:
>
> On 11/5/23 7:44 PM, Mina Almasry wrote:
> > diff --git a/net/core/datagram.c b/net/core/datagram.c
> > index 176eb5834746..cdd4fb129968 100644
> > --- a/net/core/datagram.c
> > +++ b/net/core/datagram.c
> > @@ -425,6 +425,9 @@ static int __skb_datagram_iter(const struct sk_buff *skb, int offset,
> > return 0;
> > }
> >
> > + if (skb_frags_not_readable(skb))
> > + goto short_copy;
> > +
> > /* Copy paged appendix. Hmm... why does this look so complicated? */
> > for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
> > int end;
> > @@ -616,6 +619,9 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
> > {
> > int frag;
> >
> > + if (skb_frags_not_readable(skb))
> > + return -EFAULT;
>
> This check ....
> > +
> > if (msg && msg->msg_ubuf && msg->sg_from_iter)
> > return msg->sg_from_iter(sk, skb, from, length);
>
>
> ... should go here. That allows custome sg_from_iter to have access to
> the skb. What matters is not expecting struct page (e.g., refcounting);
> if the custom iter does not do that then all is well. io_uring's iter
> does not look at the pages, so all good.
>
> >
> > diff --git a/net/core/gro.c b/net/core/gro.c
> > index 42d7f6755f32..56046d65386a 100644
> > --- a/net/core/gro.c
> > +++ b/net/core/gro.c
> > @@ -390,6 +390,9 @@ static void gro_pull_from_frag0(struct sk_buff *skb, int grow)
> > {
> > struct skb_shared_info *pinfo = skb_shinfo(skb);
> >
> > + if (WARN_ON_ONCE(skb_frags_not_readable(skb)))
> > + return;
> > +
> > BUG_ON(skb->end - skb->tail < grow);
> >
> > memcpy(skb_tail_pointer(skb), NAPI_GRO_CB(skb)->frag0, grow);
> > @@ -411,7 +414,7 @@ static void gro_try_pull_from_frag0(struct sk_buff *skb)
> > {
> > int grow = skb_gro_offset(skb) - skb_headlen(skb);
> >
> > - if (grow > 0)
> > + if (grow > 0 && !skb_frags_not_readable(skb))
> > gro_pull_from_frag0(skb, grow);
> > }
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 13eca4fd25e1..f01673ed2eff 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -1230,6 +1230,14 @@ void skb_dump(const char *level, const struct sk_buff *skb, bool full_pkt)
> > struct page *p;
> > u8 *vaddr;
> >
> > + if (skb_frag_is_page_pool_iov(frag)) {
>
> Why skb_frag_is_page_pool_iov here vs skb_frags_not_readable?

Seems like a silly choice on my end. I should probably check
skb_frags_not_readable() and not kmap any frags in that case. Will do.

--
Thanks,
Mina

2023-11-07 01:01:28

by Stanislav Fomichev

[permalink] [raw]
Subject: Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

On 11/06, Stanislav Fomichev wrote:
> On 11/06, Willem de Bruijn wrote:
> > On Mon, Nov 6, 2023 at 3:55 PM Stanislav Fomichev <[email protected]> wrote:
> > >
> > > On Mon, Nov 6, 2023 at 3:27 PM Mina Almasry <[email protected]> wrote:
> > > >
> > > > On Mon, Nov 6, 2023 at 2:59 PM Stanislav Fomichev <[email protected]> wrote:
> > > > >
> > > > > On 11/06, Mina Almasry wrote:
> > > > > > On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev <[email protected]> wrote:
> > > > > > >
> > > > > > > On 11/06, Mina Almasry wrote:
> > > > > > > > On Mon, Nov 6, 2023 at 11:34 AM David Ahern <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > > > > > > > On 11/05, Mina Almasry wrote:
> > > > > > > > > >> For device memory TCP, we expect the skb headers to be available in host
> > > > > > > > > >> memory for access, and we expect the skb frags to be in device memory
> > > > > > > > > >> and unaccessible to the host. We expect there to be no mixing and
> > > > > > > > > >> matching of device memory frags (unaccessible) with host memory frags
> > > > > > > > > >> (accessible) in the same skb.
> > > > > > > > > >>
> > > > > > > > > >> Add a skb->devmem flag which indicates whether the frags in this skb
> > > > > > > > > >> are device memory frags or not.
> > > > > > > > > >>
> > > > > > > > > >> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> > > > > > > > > >> and marks the skb as skb->devmem accordingly.
> > > > > > > > > >>
> > > > > > > > > >> Add checks through the network stack to avoid accessing the frags of
> > > > > > > > > >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> > > > > > > > > >>
> > > > > > > > > >> Signed-off-by: Willem de Bruijn <[email protected]>
> > > > > > > > > >> Signed-off-by: Kaiyuan Zhang <[email protected]>
> > > > > > > > > >> Signed-off-by: Mina Almasry <[email protected]>
> > > > > > > > > >>
> > > > > > > > > >> ---
> > > > > > > > > >> include/linux/skbuff.h | 14 +++++++-
> > > > > > > > > >> include/net/tcp.h | 5 +--
> > > > > > > > > >> net/core/datagram.c | 6 ++++
> > > > > > > > > >> net/core/gro.c | 5 ++-
> > > > > > > > > >> net/core/skbuff.c | 77 ++++++++++++++++++++++++++++++++++++------
> > > > > > > > > >> net/ipv4/tcp.c | 6 ++++
> > > > > > > > > >> net/ipv4/tcp_input.c | 13 +++++--
> > > > > > > > > >> net/ipv4/tcp_output.c | 5 ++-
> > > > > > > > > >> net/packet/af_packet.c | 4 +--
> > > > > > > > > >> 9 files changed, 115 insertions(+), 20 deletions(-)
> > > > > > > > > >>
> > > > > > > > > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > > > > > > > >> index 1fae276c1353..8fb468ff8115 100644
> > > > > > > > > >> --- a/include/linux/skbuff.h
> > > > > > > > > >> +++ b/include/linux/skbuff.h
> > > > > > > > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > > > > > > > >> * @csum_level: indicates the number of consecutive checksums found in
> > > > > > > > > >> * the packet minus one that have been verified as
> > > > > > > > > >> * CHECKSUM_UNNECESSARY (max 3)
> > > > > > > > > >> + * @devmem: indicates that all the fragments in this skb are backed by
> > > > > > > > > >> + * device memory.
> > > > > > > > > >> * @dst_pending_confirm: need to confirm neighbour
> > > > > > > > > >> * @decrypted: Decrypted SKB
> > > > > > > > > >> * @slow_gro: state present at GRO time, slower prepare step required
> > > > > > > > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > > > > > > > >> #if IS_ENABLED(CONFIG_IP_SCTP)
> > > > > > > > > >> __u8 csum_not_inet:1;
> > > > > > > > > >> #endif
> > > > > > > > > >> -
> > > > > > > > > >> + __u8 devmem:1;
> > > > > > > > > >> #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > > > > > > > > >> __u16 tc_index; /* traffic control index */
> > > > > > > > > >> #endif
> > > > > > > > > >> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > > > > > > > >> __skb_zcopy_downgrade_managed(skb);
> > > > > > > > > >> }
> > > > > > > > > >>
> > > > > > > > > >> +/* Return true if frags in this skb are not readable by the host. */
> > > > > > > > > >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> > > > > > > > > >> +{
> > > > > > > > > >> + return skb->devmem;
> > > > > > > > > >
> > > > > > > > > > bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> > > > > > > > > > It better communicates the fact that the stack shouldn't dereference the
> > > > > > > > > > frags (because it has 'devmem' fragments or for some other potential
> > > > > > > > > > future reason).
> > > > > > > > >
> > > > > > > > > +1.
> > > > > > > > >
> > > > > > > > > Also, the flag on the skb is an optimization - a high level signal that
> > > > > > > > > one or more frags is in unreadable memory. There is no requirement that
> > > > > > > > > all of the frags are in the same memory type.
> > > > > > >
> > > > > > > David: maybe there should be such a requirement (that they all are
> > > > > > > unreadable)? Might be easier to support initially; we can relax later
> > > > > > > on.
> > > > > > >
> > > > > >
> > > > > > Currently devmem == not_readable, and the restriction is that all the
> > > > > > frags in the same skb must be either all readable or all unreadable
> > > > > > (all devmem or all non-devmem).
> > > > > >
> > > > > > > > The flag indicates that the skb contains all devmem dma-buf memory
> > > > > > > > specifically, not generic 'not_readable' frags as the comment says:
> > > > > > > >
> > > > > > > > + * @devmem: indicates that all the fragments in this skb are backed by
> > > > > > > > + * device memory.
> > > > > > > >
> > > > > > > > The reason it's not a generic 'not_readable' flag is because handing
> > > > > > > > off a generic not_readable skb to the userspace is semantically not
> > > > > > > > what we're doing. recvmsg() is augmented in this patch series to
> > > > > > > > return a devmem skb to the user via a cmsg_devmem struct which refers
> > > > > > > > specifically to the memory in the dma-buf. recvmsg() in this patch
> > > > > > > > series is not augmented to give any 'not_readable' skb to the
> > > > > > > > userspace.
> > > > > > > >
> > > > > > > > IMHO skb->devmem + an skb_frags_not_readable() as implemented is
> > > > > > > > correct. If a new type of unreadable skbs are introduced to the stack,
> > > > > > > > I imagine the stack would implement:
> > > > > > > >
> > > > > > > > 1. new header flag: skb->newmem
> > > > > > > > 2.
> > > > > > > >
> > > > > > > > static inline bool skb_frags_not_readable(const struct skb_buff *skb)
> > > > > > > > {
> > > > > > > > return skb->devmem || skb->newmem;
> > > > > > > > }
> > > > > > > >
> > > > > > > > 3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this patch
> > > > > > > > series, but tcp_recvmsg_newmem() would handle skb->newmem skbs.
> > > > > > >
> > > > > > > You copy it to the userspace in a special way because your frags
> > > > > > > are page_is_page_pool_iov(). I agree with David, the skb bit is
> > > > > > > just and optimization.
> > > > > > >
> > > > > > > For most of the core stack, it doesn't matter why your skb is not
> > > > > > > readable. For a few places where it matters (recvmsg?), you can
> > > > > > > double-check your frags (all or some) with page_is_page_pool_iov.
> > > > > > >
> > > > > >
> > > > > > I see, we can do that then. I.e. make the header flag 'not_readable'
> > > > > > and check the frags to decide to delegate to tcp_recvmsg_devmem() or
> > > > > > something else. We can even assume not_readable == devmem because
> > > > > > currently devmem is the only type of unreadable frag currently.
> > > > > >
> > > > > > > Unrelated: we probably need socket to dmabuf association as well (via
> > > > > > > netlink or something).
> > > > > >
> > > > > > Not sure this is possible. The dma-buf is bound to the rx-queue, and
> > > > > > any packets that land on that rx-queue are bound to that dma-buf,
> > > > > > regardless of which socket that packet belongs to. So the association
> > > > > > IMO must be rx-queue to dma-buf, not socket to dma-buf.
> > > > >
> > > > > But there is still always 1 dmabuf to 1 socket association (on rx), right?
> > > > > Because otherwise, there is no way currently to tell, at recvmsg, which
> > > > > dmabuf the received token belongs to.
> > > > >
> > > >
> > > > Yes, but this 1 dma-buf to 1 socket association happens because the
> > > > user binds the dma-buf to an rx-queue and configures flow steering of
> > > > the socket to that rx-queue.
> > >
> > > It's still fixed and won't change during the socket lifetime, right?
> > > And the socket has to know this association; otherwise those tokens
> > > are useless since they don't carry anything to identify the dmabuf.
> > >
> > > I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is that
> > > it somehow implies that I have an option of passing or not passing it
> > > for an individual system call.
> > > If we know that we're going to use dmabuf with the socket, maybe we
> > > should move this flag to the socket() syscall?
> > >
> > > fd = socket(AF_INET6, SOCK_STREAM, SOCK_DEVMEM);
> > >
> > > ?
> >
> > I think it should then be a setsockopt called before any data is
> > exchanged, with no change of modifying mode later. We generally use
> > setsockopts for the mode of a socket. This use of the protocol field
> > in socket() for setting a mode would be novel. Also, it might miss
> > passively opened connections, or be overly restrictive: one approach
> > for all accepted child sockets.
>
> I was thinking this is similar to SOCK_CLOEXEC or SOCK_NONBLOCK? There
> are plenty of bits we can grab. But setsockopt works as well!

To follow up: if we have this flag on a socket, not on a per-message
basis, can we also use recvmsg for the recycling part maybe?

while (true) {
memset(msg, 0, ...);

/* receive the tokens */
ret = recvmsg(fd, &msg, 0);

/* recycle the tokens from the above recvmsg() */
ret = recvmsg(fd, &msg, MSG_RECYCLE);
}

recvmsg + MSG_RECYCLE can parse the same format that regular recvmsg
exports (SO_DEVMEM_OFFSET) and we can also add extra cmsg option
to recycle a range.

Will this be more straightforward than a setsockopt(SO_DEVMEM_DONTNEED)?
Or is it more confusing?

2023-11-07 01:06:39

by Stanislav Fomichev

[permalink] [raw]
Subject: Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

On 11/06, Mina Almasry wrote:
> On Mon, Nov 6, 2023 at 4:08 PM Willem de Bruijn
> <[email protected]> wrote:
> >
> > On Mon, Nov 6, 2023 at 3:55 PM Stanislav Fomichev <[email protected]> wrote:
> > >
> > > On Mon, Nov 6, 2023 at 3:27 PM Mina Almasry <[email protected]> wrote:
> > > >
> > > > On Mon, Nov 6, 2023 at 2:59 PM Stanislav Fomichev <[email protected]> wrote:
> > > > >
> > > > > On 11/06, Mina Almasry wrote:
> > > > > > On Mon, Nov 6, 2023 at 1:59 PM Stanislav Fomichev <[email protected]> wrote:
> > > > > > >
> > > > > > > On 11/06, Mina Almasry wrote:
> > > > > > > > On Mon, Nov 6, 2023 at 11:34 AM David Ahern <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> > > > > > > > > > On 11/05, Mina Almasry wrote:
> > > > > > > > > >> For device memory TCP, we expect the skb headers to be available in host
> > > > > > > > > >> memory for access, and we expect the skb frags to be in device memory
> > > > > > > > > >> and unaccessible to the host. We expect there to be no mixing and
> > > > > > > > > >> matching of device memory frags (unaccessible) with host memory frags
> > > > > > > > > >> (accessible) in the same skb.
> > > > > > > > > >>
> > > > > > > > > >> Add a skb->devmem flag which indicates whether the frags in this skb
> > > > > > > > > >> are device memory frags or not.
> > > > > > > > > >>
> > > > > > > > > >> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
> > > > > > > > > >> and marks the skb as skb->devmem accordingly.
> > > > > > > > > >>
> > > > > > > > > >> Add checks through the network stack to avoid accessing the frags of
> > > > > > > > > >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
> > > > > > > > > >>
> > > > > > > > > >> Signed-off-by: Willem de Bruijn <[email protected]>
> > > > > > > > > >> Signed-off-by: Kaiyuan Zhang <[email protected]>
> > > > > > > > > >> Signed-off-by: Mina Almasry <[email protected]>
> > > > > > > > > >>
> > > > > > > > > >> ---
> > > > > > > > > >> include/linux/skbuff.h | 14 +++++++-
> > > > > > > > > >> include/net/tcp.h | 5 +--
> > > > > > > > > >> net/core/datagram.c | 6 ++++
> > > > > > > > > >> net/core/gro.c | 5 ++-
> > > > > > > > > >> net/core/skbuff.c | 77 ++++++++++++++++++++++++++++++++++++------
> > > > > > > > > >> net/ipv4/tcp.c | 6 ++++
> > > > > > > > > >> net/ipv4/tcp_input.c | 13 +++++--
> > > > > > > > > >> net/ipv4/tcp_output.c | 5 ++-
> > > > > > > > > >> net/packet/af_packet.c | 4 +--
> > > > > > > > > >> 9 files changed, 115 insertions(+), 20 deletions(-)
> > > > > > > > > >>
> > > > > > > > > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > > > > > > > >> index 1fae276c1353..8fb468ff8115 100644
> > > > > > > > > >> --- a/include/linux/skbuff.h
> > > > > > > > > >> +++ b/include/linux/skbuff.h
> > > > > > > > > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> > > > > > > > > >> * @csum_level: indicates the number of consecutive checksums found in
> > > > > > > > > >> * the packet minus one that have been verified as
> > > > > > > > > >> * CHECKSUM_UNNECESSARY (max 3)
> > > > > > > > > >> + * @devmem: indicates that all the fragments in this skb are backed by
> > > > > > > > > >> + * device memory.
> > > > > > > > > >> * @dst_pending_confirm: need to confirm neighbour
> > > > > > > > > >> * @decrypted: Decrypted SKB
> > > > > > > > > >> * @slow_gro: state present at GRO time, slower prepare step required
> > > > > > > > > >> @@ -991,7 +993,7 @@ struct sk_buff {
> > > > > > > > > >> #if IS_ENABLED(CONFIG_IP_SCTP)
> > > > > > > > > >> __u8 csum_not_inet:1;
> > > > > > > > > >> #endif
> > > > > > > > > >> -
> > > > > > > > > >> + __u8 devmem:1;
> > > > > > > > > >> #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> > > > > > > > > >> __u16 tc_index; /* traffic control index */
> > > > > > > > > >> #endif
> > > > > > > > > >> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
> > > > > > > > > >> __skb_zcopy_downgrade_managed(skb);
> > > > > > > > > >> }
> > > > > > > > > >>
> > > > > > > > > >> +/* Return true if frags in this skb are not readable by the host. */
> > > > > > > > > >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
> > > > > > > > > >> +{
> > > > > > > > > >> + return skb->devmem;
> > > > > > > > > >
> > > > > > > > > > bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> > > > > > > > > > It better communicates the fact that the stack shouldn't dereference the
> > > > > > > > > > frags (because it has 'devmem' fragments or for some other potential
> > > > > > > > > > future reason).
> > > > > > > > >
> > > > > > > > > +1.
> > > > > > > > >
> > > > > > > > > Also, the flag on the skb is an optimization - a high level signal that
> > > > > > > > > one or more frags is in unreadable memory. There is no requirement that
> > > > > > > > > all of the frags are in the same memory type.
> > > > > > >
> > > > > > > David: maybe there should be such a requirement (that they all are
> > > > > > > unreadable)? Might be easier to support initially; we can relax later
> > > > > > > on.
> > > > > > >
> > > > > >
> > > > > > Currently devmem == not_readable, and the restriction is that all the
> > > > > > frags in the same skb must be either all readable or all unreadable
> > > > > > (all devmem or all non-devmem).
> > > > > >
> > > > > > > > The flag indicates that the skb contains all devmem dma-buf memory
> > > > > > > > specifically, not generic 'not_readable' frags as the comment says:
> > > > > > > >
> > > > > > > > + * @devmem: indicates that all the fragments in this skb are backed by
> > > > > > > > + * device memory.
> > > > > > > >
> > > > > > > > The reason it's not a generic 'not_readable' flag is because handing
> > > > > > > > off a generic not_readable skb to the userspace is semantically not
> > > > > > > > what we're doing. recvmsg() is augmented in this patch series to
> > > > > > > > return a devmem skb to the user via a cmsg_devmem struct which refers
> > > > > > > > specifically to the memory in the dma-buf. recvmsg() in this patch
> > > > > > > > series is not augmented to give any 'not_readable' skb to the
> > > > > > > > userspace.
> > > > > > > >
> > > > > > > > IMHO skb->devmem + an skb_frags_not_readable() as implemented is
> > > > > > > > correct. If a new type of unreadable skbs are introduced to the stack,
> > > > > > > > I imagine the stack would implement:
> > > > > > > >
> > > > > > > > 1. new header flag: skb->newmem
> > > > > > > > 2.
> > > > > > > >
> > > > > > > > static inline bool skb_frags_not_readable(const struct skb_buff *skb)
> > > > > > > > {
> > > > > > > > return skb->devmem || skb->newmem;
> > > > > > > > }
> > > > > > > >
> > > > > > > > 3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this patch
> > > > > > > > series, but tcp_recvmsg_newmem() would handle skb->newmem skbs.
> > > > > > >
> > > > > > > You copy it to the userspace in a special way because your frags
> > > > > > > are page_is_page_pool_iov(). I agree with David, the skb bit is
> > > > > > > just and optimization.
> > > > > > >
> > > > > > > For most of the core stack, it doesn't matter why your skb is not
> > > > > > > readable. For a few places where it matters (recvmsg?), you can
> > > > > > > double-check your frags (all or some) with page_is_page_pool_iov.
> > > > > > >
> > > > > >
> > > > > > I see, we can do that then. I.e. make the header flag 'not_readable'
> > > > > > and check the frags to decide to delegate to tcp_recvmsg_devmem() or
> > > > > > something else. We can even assume not_readable == devmem because
> > > > > > currently devmem is the only type of unreadable frag currently.
> > > > > >
> > > > > > > Unrelated: we probably need socket to dmabuf association as well (via
> > > > > > > netlink or something).
> > > > > >
> > > > > > Not sure this is possible. The dma-buf is bound to the rx-queue, and
> > > > > > any packets that land on that rx-queue are bound to that dma-buf,
> > > > > > regardless of which socket that packet belongs to. So the association
> > > > > > IMO must be rx-queue to dma-buf, not socket to dma-buf.
> > > > >
> > > > > But there is still always 1 dmabuf to 1 socket association (on rx), right?
> > > > > Because otherwise, there is no way currently to tell, at recvmsg, which
> > > > > dmabuf the received token belongs to.
> > > > >
> > > >
> > > > Yes, but this 1 dma-buf to 1 socket association happens because the
> > > > user binds the dma-buf to an rx-queue and configures flow steering of
> > > > the socket to that rx-queue.
> > >
> > > It's still fixed and won't change during the socket lifetime, right?
>
> Technically, no.
>
> The user is free to modify or delete flow steering rules outside of
> the lifetime of the socket. Technically it's possible for the user to
> reconfigure flow steering while the socket is simultaneously
> receiving, and the result will be packets switching
> from devmem to non-devmem. For a reasonably correctly configured
> application the application would probably want to steer 1 flow to 1
> dma-buf and never change it, but this is not something we enforce, but
> rather the user orchestrates. In theory someone can find a use case
> for configuring and unconfigure flow steering during a connection.

If we do want to support this flexible configuration then we also
should export some dmabuf id along with the token?

> > > And the socket has to know this association; otherwise those tokens
> > > are useless since they don't carry anything to identify the dmabuf.
> > >
> > > I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is that
> > > it somehow implies that I have an option of passing or not passing it
> > > for an individual system call.
>
> You do have the option of passing it or not passing it per system
> call. The MSG_SOCK_DEVMEM says the application is willing to receive
> devmem cmsgs - that's all. The application doesn't get to decide
> whether it's actually going to receive a devmem cmsg or not, because
> that's dictated by the type of skb that is present in the receive
> queue, and not up to the application. I should explain this in the
> commit message...

What would be the case of passing it or not passing it? Some fallback to
the host memory after flow steering update? Yeah, would be useful to
document those constrains. I'd lean on starting stricter and relaxing
those conditions if we find the use-cases.

> > > If we know that we're going to use dmabuf with the socket, maybe we
> > > should move this flag to the socket() syscall?
> > >
> > > fd = socket(AF_INET6, SOCK_STREAM, SOCK_DEVMEM);
> > >
> > > ?
> >
> > I think it should then be a setsockopt called before any data is
> > exchanged, with no change of modifying mode later. We generally use
> > setsockopts for the mode of a socket. This use of the protocol field
> > in socket() for setting a mode would be novel. Also, it might miss
> > passively opened connections, or be overly restrictive: one approach
> > for all accepted child sockets.
>
> We can definitely move SOCK_DEVMEM to a setsockopt(). Seems more than
> reasonable.

SG, added another suggestion for SO_DEVMEM_DONTNEED on another thread
with Willem. LMK what you think.

2023-11-07 01:09:31

by David Ahern

[permalink] [raw]
Subject: Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

On 11/6/23 5:20 PM, Mina Almasry wrote:
> The user is free to modify or delete flow steering rules outside of the
> lifetime of the socket. Technically it's possible for the user to
> reconfigure flow steering while the socket is simultaneously receiving,
> and the result will be packets switching from devmem to non-devmem.

generically, from one page pool to another (ie., devmem piece of that
statement is not relevant).

2023-11-07 02:24:43

by Willem de Bruijn

[permalink] [raw]
Subject: Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

> > > > I think my other issue with MSG_SOCK_DEVMEM being on recvmsg is that
> > > > it somehow implies that I have an option of passing or not passing it
> > > > for an individual system call.
> > > > If we know that we're going to use dmabuf with the socket, maybe we
> > > > should move this flag to the socket() syscall?
> > > >
> > > > fd = socket(AF_INET6, SOCK_STREAM, SOCK_DEVMEM);
> > > >
> > > > ?
> > >
> > > I think it should then be a setsockopt called before any data is
> > > exchanged, with no change of modifying mode later. We generally use
> > > setsockopts for the mode of a socket. This use of the protocol field
> > > in socket() for setting a mode would be novel. Also, it might miss
> > > passively opened connections, or be overly restrictive: one approach
> > > for all accepted child sockets.
> >
> > I was thinking this is similar to SOCK_CLOEXEC or SOCK_NONBLOCK? There
> > are plenty of bits we can grab. But setsockopt works as well!
>
> To follow up: if we have this flag on a socket, not on a per-message
> basis, can we also use recvmsg for the recycling part maybe?
>
> while (true) {
> memset(msg, 0, ...);
>
> /* receive the tokens */
> ret = recvmsg(fd, &msg, 0);
>
> /* recycle the tokens from the above recvmsg() */
> ret = recvmsg(fd, &msg, MSG_RECYCLE);
> }
>
> recvmsg + MSG_RECYCLE can parse the same format that regular recvmsg
> exports (SO_DEVMEM_OFFSET) and we can also add extra cmsg option
> to recycle a range.
>
> Will this be more straightforward than a setsockopt(SO_DEVMEM_DONTNEED)?
> Or is it more confusing?

It would have to be sendmsg, as recvmsg is a copy_to_user operation.

I am not aware of any precedent in multiplexing the data stream and a
control operation stream in this manner. It would also require adding
a branch in the sendmsg hot path.

The memory is associated with the socket, freed when the socket is
closed as well as on SO_DEVMEM_DONTNEED. Fundamentally it is a socket
state operation, for which setsockopt is the socket interface.

Is your request purely a dislike, or is there some technical concern
with BPF and setsockopt?