2024-05-30 21:49:05

by Yan Zhai

[permalink] [raw]
Subject: [RFC net-next 0/6] net: pass receive socket to drop tracepoint

Greeting!

We set up our production packet drop monitoring around the kfree_skb
tracepoint. While this tracepoint is extremely valuable for diagnosing
critical problems, we find some limitation with drops on the local
receive path: this tracepoint can only inspect the dropped skb itself,
but such skb might not carry enough information to:

1. determine in which netns/container this skb gets dropped
2. determine by which socket/service this skb oughts to be received

The 1st issue is because skb->dev is the only member field with valid
netns reference. But skb->dev can get cleared or reused. For example,
tcp_v4_rcv will clear skb->dev and in later processing it might be reused
for OFO tree.

The 2nd issue is because there is no reference on an skb that reliably
points to a receiving socket. skb->sk usually points to the local
sending socket, and it only points to a receive socket briefly after
early demux stage, yet the socket can get stolen later. For certain drop
reason like TCP OFO_MERGE, Zerowindow, UDP at PROTO_MEM error, etc, it
is hard to infer which receiving socket is impacted. This cannot be
overcome by simply looking at the packet header, because of
complications like sk lookup programs. In the past, single purpose
tracepoints like trace_udp_fail_queue_rcv_skb, trace_sock_rcvqueue_full,
etc are added as needed to provide more visibility. This could be
handled in a more generic way.

In this change set we propose a new 'kfree_skb_for_sk' call as a drop-in
replacement for kfree_skb_reason at various local input path. It accepts
an extra receiving socket argument, and places the socket in skb->cb for
tracepoint consumption. With an rx socket, it can easily deal with both
issues above. Using cb field is more of a concern that a tracepoint
signature might be a part of stable ABI, but please advise if otherwise.

Yan Zhai (6):
net: add kfree_skb_for_sk function
ping: pass rx socket on rcv drops
net: raw: pass rx socket on rcv drops
tcp: pass rx socket on rcv drops
udp: pass rx socket on rcv drops
af_packet: pass rx socket on rcv drops

include/linux/skbuff.h | 48 ++++++++++++++++++++++++++++++++++++++++--
net/core/dev.c | 21 +++++++-----------
net/core/skbuff.c | 29 +++++++++++++------------
net/ipv4/ping.c | 2 +-
net/ipv4/raw.c | 4 ++--
net/ipv4/syncookies.c | 2 +-
net/ipv4/tcp_input.c | 2 +-
net/ipv4/tcp_ipv4.c | 4 ++--
net/ipv4/udp.c | 6 +++---
net/ipv6/raw.c | 8 +++----
net/ipv6/syncookies.c | 2 +-
net/ipv6/tcp_ipv6.c | 4 ++--
net/ipv6/udp.c | 6 +++---
net/packet/af_packet.c | 6 +++---
14 files changed, 93 insertions(+), 51 deletions(-)

--
2.30.2




2024-05-30 21:49:22

by Yan Zhai

[permalink] [raw]
Subject: [RFC net-next 6/6] af_packet: pass rx socket on rcv drops

Use kfree_skb_for_sk call to pass on the rx socket

Signed-off-by: Yan Zhai <[email protected]>
---
net/packet/af_packet.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index fce390887591..30a6447b4fc4 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2226,7 +2226,7 @@ static int packet_rcv(struct sk_buff *skb, struct net_device *dev,
skb->len = skb_len;
}
drop:
- kfree_skb_reason(skb, drop_reason);
+ kfree_skb_for_sk(skb, sk, drop_reason);
return 0;
}

@@ -2494,7 +2494,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
skb->len = skb_len;
}
drop:
- kfree_skb_reason(skb, drop_reason);
+ kfree_skb_for_sk(skb, sk, drop_reason);
return 0;

drop_n_account:
@@ -2503,7 +2503,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
drop_reason = SKB_DROP_REASON_PACKET_SOCK_ERROR;

sk->sk_data_ready(sk);
- kfree_skb_reason(copy_skb, drop_reason);
+ kfree_skb_for_sk(copy_skb, sk, drop_reason);
goto drop_n_restore;
}

--
2.30.2



2024-05-30 21:49:36

by Yan Zhai

[permalink] [raw]
Subject: [RFC net-next 1/6] net: add kfree_skb_for_sk function

Implement a new kfree_skb_for_sk to replace kfree_skb_reason on a few
local receive path. The function accepts an extra receiving socket
argument, which will be set in skb->cb for kfree_skb/consume_skb
tracepoint consumption. With this extra bit of information, it will be
easier to attribute dropped packets to netns/containers and
sockets/services for performance and error monitoring purpose.

Signed-off-by: Yan Zhai <[email protected]>
---
include/linux/skbuff.h | 48 ++++++++++++++++++++++++++++++++++++++++--
net/core/dev.c | 21 +++++++-----------
net/core/skbuff.c | 29 +++++++++++++------------
3 files changed, 70 insertions(+), 28 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fe7d8dbef77e..66f5b06798f2 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1251,8 +1251,52 @@ static inline bool skb_data_unref(const struct sk_buff *skb,
return true;
}

-void __fix_address
-kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason);
+/*
+ * Some protocol will clear or reuse skb->dev field for other purposes. For
+ * example, TCP stack would reuse the pointer for out of order packet handling.
+ * This caused some problem for drop monitoring on kfree_skb tracepoint, since
+ * no other fields of an skb provides netns information. In addition, it is
+ * also complicated to recover receive socket information for dropped packets,
+ * because the socket lookup can be an sk-lookup BPF program.
+ *
+ * This can be addressed by just passing the rx socket to the tracepoint,
+ * because it also has valid netns binding.
+ */
+struct kfree_skb_cb {
+ enum skb_drop_reason reason; /* used only by dev_kfree_skb_irq */
+ struct sock *rx_sk;
+};
+
+#define KFREE_SKB_CB(skb) ((struct kfree_skb_cb *)(skb)->cb)
+
+/* Save cb->rx_sk before calling kfree_skb/consume_skb tracepoint, and restore
+ * after the tracepoint. This is necessary because some skb destructor might
+ * rely on values in skb->cb, e.g. unix_destruct_scm.
+ */
+#define _call_trace_kfree_skb(action, skb, sk, ...) do { \
+ if (trace_##action##_skb_enabled()) { \
+ struct kfree_skb_cb saved; \
+ saved.rx_sk = KFREE_SKB_CB(skb)->rx_sk; \
+ KFREE_SKB_CB(skb)->rx_sk = sk; \
+ trace_##action##_skb((skb), ## __VA_ARGS__); \
+ KFREE_SKB_CB(skb)->rx_sk = saved.rx_sk; \
+ } \
+} while (0)
+
+#define call_trace_kfree_skb(skb, sk, ...) \
+ _call_trace_kfree_skb(kfree, skb, sk, ## __VA_ARGS__)
+
+#define call_trace_consume_skb(skb, sk, ...) \
+ _call_trace_kfree_skb(consume, skb, sk, ## __VA_ARGS__)
+
+void __fix_address kfree_skb_for_sk(struct sk_buff *skb, struct sock *rx_sk,
+ enum skb_drop_reason reason);
+
+static inline void
+kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason)
+{
+ kfree_skb_for_sk(skb, NULL, reason);
+}

/**
* kfree_skb - free an sk_buff with 'NOT_SPECIFIED' reason
diff --git a/net/core/dev.c b/net/core/dev.c
index 85fe8138f3e4..17516f26be92 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3135,15 +3135,6 @@ void __netif_schedule(struct Qdisc *q)
}
EXPORT_SYMBOL(__netif_schedule);

-struct dev_kfree_skb_cb {
- enum skb_drop_reason reason;
-};
-
-static struct dev_kfree_skb_cb *get_kfree_skb_cb(const struct sk_buff *skb)
-{
- return (struct dev_kfree_skb_cb *)skb->cb;
-}
-
void netif_schedule_queue(struct netdev_queue *txq)
{
rcu_read_lock();
@@ -3182,7 +3173,11 @@ void dev_kfree_skb_irq_reason(struct sk_buff *skb, enum skb_drop_reason reason)
} else if (likely(!refcount_dec_and_test(&skb->users))) {
return;
}
- get_kfree_skb_cb(skb)->reason = reason;
+
+ /* There is no need to save the old cb since we are the only user. */
+ KFREE_SKB_CB(skb)->reason = reason;
+ KFREE_SKB_CB(skb)->rx_sk = NULL;
+
local_irq_save(flags);
skb->next = __this_cpu_read(softnet_data.completion_queue);
__this_cpu_write(softnet_data.completion_queue, skb);
@@ -5229,17 +5224,17 @@ static __latent_entropy void net_tx_action(struct softirq_action *h)
clist = clist->next;

WARN_ON(refcount_read(&skb->users));
- if (likely(get_kfree_skb_cb(skb)->reason == SKB_CONSUMED))
+ if (likely(KFREE_SKB_CB(skb)->reason == SKB_CONSUMED))
trace_consume_skb(skb, net_tx_action);
else
trace_kfree_skb(skb, net_tx_action,
- get_kfree_skb_cb(skb)->reason);
+ KFREE_SKB_CB(skb)->reason);

if (skb->fclone != SKB_FCLONE_UNAVAILABLE)
__kfree_skb(skb);
else
__napi_kfree_skb(skb,
- get_kfree_skb_cb(skb)->reason);
+ KFREE_SKB_CB(skb)->reason);
}
}

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 466999a7515e..5ce6996512a1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1190,7 +1190,8 @@ void __kfree_skb(struct sk_buff *skb)
EXPORT_SYMBOL(__kfree_skb);

static __always_inline
-bool __kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason)
+bool __kfree_skb_reason(struct sk_buff *skb, struct sock *rx_sk,
+ enum skb_drop_reason reason)
{
if (unlikely(!skb_unref(skb)))
return false;
@@ -1201,28 +1202,30 @@ bool __kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason)
SKB_DROP_REASON_SUBSYS_NUM);

if (reason == SKB_CONSUMED)
- trace_consume_skb(skb, __builtin_return_address(0));
+ call_trace_consume_skb(skb, rx_sk, __builtin_return_address(0));
else
- trace_kfree_skb(skb, __builtin_return_address(0), reason);
+ call_trace_kfree_skb(skb, rx_sk, __builtin_return_address(0), reason);
+
return true;
}

/**
- * kfree_skb_reason - free an sk_buff with special reason
+ * kfree_skb_for_sk - free an sk_buff with special reason and receiving socket
* @skb: buffer to free
+ * @rx_sk: the socket to receive the buffer, or NULL if not applicable
* @reason: reason why this skb is dropped
*
* Drop a reference to the buffer and free it if the usage count has
- * hit zero. Meanwhile, pass the drop reason to 'kfree_skb'
+ * hit zero. Meanwhile, pass the drop reason and rx socket to 'kfree_skb'
* tracepoint.
*/
-void __fix_address
-kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason)
+void __fix_address kfree_skb_for_sk(struct sk_buff *skb, struct sock *rx_sk,
+ enum skb_drop_reason reason)
{
- if (__kfree_skb_reason(skb, reason))
+ if (__kfree_skb_reason(skb, rx_sk, reason))
__kfree_skb(skb);
}
-EXPORT_SYMBOL(kfree_skb_reason);
+EXPORT_SYMBOL(kfree_skb_for_sk);

#define KFREE_SKB_BULK_SIZE 16

@@ -1261,7 +1264,7 @@ kfree_skb_list_reason(struct sk_buff *segs, enum skb_drop_reason reason)
while (segs) {
struct sk_buff *next = segs->next;

- if (__kfree_skb_reason(segs, reason)) {
+ if (__kfree_skb_reason(segs, NULL, reason)) {
skb_poison_list(segs);
kfree_skb_add_bulk(segs, &sa, reason);
}
@@ -1405,7 +1408,7 @@ void consume_skb(struct sk_buff *skb)
if (!skb_unref(skb))
return;

- trace_consume_skb(skb, __builtin_return_address(0));
+ call_trace_consume_skb(skb, NULL, __builtin_return_address(0));
__kfree_skb(skb);
}
EXPORT_SYMBOL(consume_skb);
@@ -1420,7 +1423,7 @@ EXPORT_SYMBOL(consume_skb);
*/
void __consume_stateless_skb(struct sk_buff *skb)
{
- trace_consume_skb(skb, __builtin_return_address(0));
+ call_trace_consume_skb(skb, NULL, __builtin_return_address(0));
skb_release_data(skb, SKB_CONSUMED);
kfree_skbmem(skb);
}
@@ -1478,7 +1481,7 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
return;

/* if reaching here SKB is ready to free */
- trace_consume_skb(skb, __builtin_return_address(0));
+ call_trace_consume_skb(skb, NULL, __builtin_return_address(0));

/* if SKB is a clone, don't handle this case */
if (skb->fclone != SKB_FCLONE_UNAVAILABLE) {
--
2.30.2



2024-05-30 21:51:01

by Yan Zhai

[permalink] [raw]
Subject: [RFC net-next 4/6] tcp: pass rx socket on rcv drops

Use kfree_skb_for_sk call to pass on the rx socket

Signed-off-by: Yan Zhai <[email protected]>
---
net/ipv4/syncookies.c | 2 +-
net/ipv4/tcp_input.c | 2 +-
net/ipv4/tcp_ipv4.c | 4 ++--
net/ipv6/syncookies.c | 2 +-
net/ipv6/tcp_ipv6.c | 4 ++--
5 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index b61d36810fe3..fd0ccf0a0439 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -496,6 +496,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
out_free:
reqsk_free(req);
out_drop:
- kfree_skb_reason(skb, reason);
+ kfree_skb_for_sk(skb, sk, reason);
return NULL;
}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9c04a9c8be9d..6c5f6fe7d9fa 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4849,7 +4849,7 @@ static void tcp_drop_reason(struct sock *sk, struct sk_buff *skb,
enum skb_drop_reason reason)
{
sk_drops_add(sk, skb);
- kfree_skb_reason(skb, reason);
+ kfree_skb_for_sk(skb, sk, reason);
}

/* This one checks to see if we can put data from the
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 8f70b8d1d1e5..cff9e0c7daec 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1944,7 +1944,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
reset:
tcp_v4_send_reset(rsk, skb, sk_rst_convert_drop_reason(reason));
discard:
- kfree_skb_reason(skb, reason);
+ kfree_skb_for_sk(skb, sk, reason);
/* Be careful here. If this function gets more complicated and
* gcc suffers from register pressure on the x86, sk (in %ebx)
* might be destroyed here. This current version compiles correctly,
@@ -2381,7 +2381,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
discard_it:
SKB_DR_OR(drop_reason, NOT_SPECIFIED);
/* Discard frame. */
- kfree_skb_reason(skb, drop_reason);
+ kfree_skb_for_sk(skb, sk, drop_reason);
return 0;

discard_and_relse:
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index bfad1e89b6a6..d09fec4bb0c0 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -275,6 +275,6 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
out_free:
reqsk_free(req);
out_drop:
- kfree_skb_reason(skb, reason);
+ kfree_skb_for_sk(skb, sk, reason);
return NULL;
}
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 750aa681779c..c7a7f339ca74 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1682,7 +1682,7 @@ int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
discard:
if (opt_skb)
__kfree_skb(opt_skb);
- kfree_skb_reason(skb, reason);
+ kfree_skb_for_sk(skb, sk, reason);
return 0;
csum_err:
reason = SKB_DROP_REASON_TCP_CSUM;
@@ -1948,7 +1948,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)

discard_it:
SKB_DR_OR(drop_reason, NOT_SPECIFIED);
- kfree_skb_reason(skb, drop_reason);
+ kfree_skb_for_sk(skb, sk, drop_reason);
return 0;

discard_and_relse:
--
2.30.2



2024-05-30 21:51:26

by Yan Zhai

[permalink] [raw]
Subject: [RFC net-next 5/6] udp: pass rx socket on rcv drops

Use kfree_skb_for_sk call to pass on the rx socket

Signed-off-by: Yan Zhai <[email protected]>
---
net/ipv4/udp.c | 6 +++---
net/ipv6/udp.c | 6 +++---
2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 189c9113fe9a..e5dbd1cbad50 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2074,7 +2074,7 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
}
UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
trace_udp_fail_queue_rcv_skb(rc, sk, skb);
- kfree_skb_reason(skb, drop_reason);
+ kfree_skb_for_sk(skb, sk, drop_reason);
return -1;
}

@@ -2196,7 +2196,7 @@ static int udp_queue_rcv_one_skb(struct sock *sk, struct sk_buff *skb)
drop:
__UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
atomic_inc(&sk->sk_drops);
- kfree_skb_reason(skb, drop_reason);
+ kfree_skb_for_sk(skb, sk, drop_reason);
return -1;
}

@@ -2485,7 +2485,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
__UDP_INC_STATS(net, UDP_MIB_CSUMERRORS, proto == IPPROTO_UDPLITE);
drop:
__UDP_INC_STATS(net, UDP_MIB_INERRORS, proto == IPPROTO_UDPLITE);
- kfree_skb_reason(skb, drop_reason);
+ kfree_skb_for_sk(skb, sk, drop_reason);
return 0;
}

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index c81a07ac0463..97a327c759b8 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -673,7 +673,7 @@ static int __udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
}
UDP6_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
trace_udp_fail_queue_rcv_skb(rc, sk, skb);
- kfree_skb_reason(skb, drop_reason);
+ kfree_skb_for_sk(skb, sk, drop_reason);
return -1;
}

@@ -776,7 +776,7 @@ static int udpv6_queue_rcv_one_skb(struct sock *sk, struct sk_buff *skb)
drop:
__UDP6_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
atomic_inc(&sk->sk_drops);
- kfree_skb_reason(skb, drop_reason);
+ kfree_skb_for_sk(skb, sk, drop_reason);
return -1;
}

@@ -1054,7 +1054,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
__UDP6_INC_STATS(net, UDP_MIB_CSUMERRORS, proto == IPPROTO_UDPLITE);
discard:
__UDP6_INC_STATS(net, UDP_MIB_INERRORS, proto == IPPROTO_UDPLITE);
- kfree_skb_reason(skb, reason);
+ kfree_skb_for_sk(skb, sk, reason);
return 0;
}

--
2.30.2



2024-05-31 16:58:38

by Yan Zhai

[permalink] [raw]
Subject: Re: [RFC net-next 1/6] net: add kfree_skb_for_sk function

Hi Eric,

Thanks for the feedback.

On Fri, May 31, 2024 at 1:51 AM Eric Dumazet <[email protected]> wrote:
>
> On Thu, May 30, 2024 at 11:46 PM Yan Zhai <[email protected]> wrote:
> >
> > Implement a new kfree_skb_for_sk to replace kfree_skb_reason on a few
> > local receive path. The function accepts an extra receiving socket
> > argument, which will be set in skb->cb for kfree_skb/consume_skb
> > tracepoint consumption. With this extra bit of information, it will be
> > easier to attribute dropped packets to netns/containers and
> > sockets/services for performance and error monitoring purposes.
>
> This is a lot of code churn...
>
> I have to ask : Why not simply adding an sk parameter to an existing
> trace point ?
>
Modifying a signature of the current tracepoint seems like a breaking
change, that's why I was saving the context inside skb->cb, hoping to
not impact any existing programs watching this tracepoint. But
thinking it twice, it might not cause a problem if the signature
becomes:

trace_kfree_skb(const struct sk_buff *skb, void *location, enum
skb_drop_reason reason, const struct sock *sk)

As return values are usually not a thing for tracepoints, it is
probably still compatible. The cons is that the last "sk" still breaks
the integrity of naming. How about making a "kfree_skb_context"
internal struct and putting it as the last argument to "hide" the
naming confusion?

> If this not possible, I would rather add new tracepoints, adding new classes,
> because it will ease your debugging :
>
> When looking for TCP drops, simply use a tcp_event_sk_skb_reason instance,
> and voila, no distractions caused by RAW/ICMP/ICMPv6/af_packet drops.
>
> DECLARE_EVENT_CLASS(tcp_event_sk_skb_reason,
>
> TP_PROTO(const struct sock *sk, const struct sk_buff *skb, enum
> skb_drop_reason reason),
> ...
> );

The alternative of adding another tracepoint could indeed work, we had
a few cases like that in the past, e.g.

https://lore.kernel.org/lkml/[email protected]/
https://lore.kernel.org/netdev/[email protected]/

But it does feel like a whack-a-mole thing. The problems are solvable
if we extend the kfree_skb tracepoint, so I would prefer to not add a
new tracepoint.

>
> Also, the name ( kfree_skb_for_sk) and order of parameters is confusing.
>
> I always prefer this kind of ordering/names :
>
> void sk_skb_reason_drop( [struct net *net ] // not relevant here, but
> to expand the rationale
> struct sock *sk, struct sk_buff *skb, enum skb_drop_reason reason)
>
> Looking at the name, we immediately see the parameter order.
>
> The consume one (no @reason there) would be called
>
> void sk_skb_consume(struct sock *sk, struct sk_buff *skb);

I was intending to keep the "kfree_skb" prefix initially since it
would appear less surprising to kernel developers who used kfree_skb
and kfree_skb_reason. But your points do make good sense. How about
"kfree_sk_skb_reason" and "consume_sk_skb" here?

thanks
Yan

2024-05-31 17:33:25

by Eric Dumazet

[permalink] [raw]
Subject: Re: [RFC net-next 1/6] net: add kfree_skb_for_sk function

On Fri, May 31, 2024 at 6:58 PM Yan Zhai <[email protected]> wrote:
>
> Hi Eric,
>
> Thanks for the feedback.
>
> On Fri, May 31, 2024 at 1:51 AM Eric Dumazet <[email protected]> wrote:
> >
> > On Thu, May 30, 2024 at 11:46 PM Yan Zhai <[email protected]> wrote:
> > >
> > > Implement a new kfree_skb_for_sk to replace kfree_skb_reason on a few
> > > local receive path. The function accepts an extra receiving socket
> > > argument, which will be set in skb->cb for kfree_skb/consume_skb
> > > tracepoint consumption. With this extra bit of information, it will be
> > > easier to attribute dropped packets to netns/containers and
> > > sockets/services for performance and error monitoring purposes.
> >
> > This is a lot of code churn...
> >
> > I have to ask : Why not simply adding an sk parameter to an existing
> > trace point ?
> >
> Modifying a signature of the current tracepoint seems like a breaking
> change, that's why I was saving the context inside skb->cb, hoping to
> not impact any existing programs watching this tracepoint. But
> thinking it twice, it might not cause a problem if the signature
> becomes:
>
> trace_kfree_skb(const struct sk_buff *skb, void *location, enum
> skb_drop_reason reason, const struct sock *sk)
>
> As return values are usually not a thing for tracepoints, it is
> probably still compatible. The cons is that the last "sk" still breaks
> the integrity of naming. How about making a "kfree_skb_context"
> internal struct and putting it as the last argument to "hide" the
> naming confusion?
>
> > If this not possible, I would rather add new tracepoints, adding new classes,
> > because it will ease your debugging :
> >
> > When looking for TCP drops, simply use a tcp_event_sk_skb_reason instance,
> > and voila, no distractions caused by RAW/ICMP/ICMPv6/af_packet drops.
> >
> > DECLARE_EVENT_CLASS(tcp_event_sk_skb_reason,
> >
> > TP_PROTO(const struct sock *sk, const struct sk_buff *skb, enum
> > skb_drop_reason reason),
> > ...
> > );
>
> The alternative of adding another tracepoint could indeed work, we had
> a few cases like that in the past, e.g.
>
> https://lore.kernel.org/lkml/[email protected]/
> https://lore.kernel.org/netdev/[email protected]/
>
> But it does feel like a whack-a-mole thing. The problems are solvable
> if we extend the kfree_skb tracepoint, so I would prefer to not add a
> new tracepoint.

Solvable with many future merge conflicts for stable teams.


>
> >
> > Also, the name ( kfree_skb_for_sk) and order of parameters is confusing.
> >
> > I always prefer this kind of ordering/names :
> >
> > void sk_skb_reason_drop( [struct net *net ] // not relevant here, but
> > to expand the rationale
> > struct sock *sk, struct sk_buff *skb, enum skb_drop_reason reason)
> >
> > Looking at the name, we immediately see the parameter order.
> >
> > The consume one (no @reason there) would be called
> >
> > void sk_skb_consume(struct sock *sk, struct sk_buff *skb);
>
> I was intending to keep the "kfree_skb" prefix initially since it
> would appear less surprising to kernel developers who used kfree_skb
> and kfree_skb_reason. But your points do make good sense. How about
> "kfree_sk_skb_reason" and "consume_sk_skb" here?
>

IMO kfree_skb() and consume_skb() were a wrong choice. We have to live
with them.

It should have been skb_free(), skb_consume(), skb_alloc(),
to be consistent.

Following (partial) list was much better:

skb_add_rx_frag_netmem, skb_coalesce_rx_frag, skb_pp_cow_data,
skb_cow_data_for_xdp,
skb_dump, skb_tx_error, skb_morph, skb_zerocopy_iter_stream, skb_copy_ubufs,
skb_clone, skb_headers_offset_update, skb_copy_header, skb_copy,
skb_realloc_headroom, skb_expand_head, skb_copy_expand, skb_put,
skb_push, skb_pull, skb_pull_data, skb_trim, skb_copy_bits,
skb_splice_bits, skb_send_sock_locked, skb_store_bits,
skb_checksum, skb_copy_and_csum_bits, skb_zerocopy_headlen,
skb_zerocopy, skb_copy_and_csum_dev, skb_dequeue,
skb_dequeue_tail, skb_queue_purge_reason, skb_errqueue_purge,
skb_queue_head, skb_queue_tail, skb_unlink, skb_append,
skb_split, skb_prepare_seq_read, skb_seq_read, skb_abort_seq_read,
skb_find_text, skb_append_pagefrags, skb_pull_rcsum, skb_segment_list,
skb_segment, skb_to_sgvec, skb_to_sgvec_nomark, skb_cow_data, skb_clone_sk,
skb_complete_tx_timestamp, skb_tstamp_tx, skb_complete_wifi_ack,
skb_partial_csum_set, skb_checksum_setup, skb_checksum_trimmed,
skb_try_coalesce, skb_scrub_packet, skb_vlan_untag, skb_ensure_writable,
skb_ensure_writable_head_tail, skb_vlan_pop, skb_vlan_push, skb_eth_pop,
skb_eth_push, skb_mpls_push, skb_mpls_pop, skb_mpls_update_lse,
skb_mpls_dec_ttl, skb_condense, skb_ext_add, skb_splice_from_iter

(just to make my point very very clear)

Instead we have a myriad of functions with illogical parameter
ordering vs their names.

I see no reason to add more confusion for new helpers.

2024-05-31 19:05:45

by Yan Zhai

[permalink] [raw]
Subject: Re: [RFC net-next 1/6] net: add kfree_skb_for_sk function

On Fri, May 31, 2024 at 12:32 PM Eric Dumazet <[email protected]> wrote:
>
> On Fri, May 31, 2024 at 6:58 PM Yan Zhai <[email protected]> wrote:
> >
> > Hi Eric,
> >
> > Thanks for the feedback.
> >
> > On Fri, May 31, 2024 at 1:51 AM Eric Dumazet <[email protected]> wrote:
> > >
> > > On Thu, May 30, 2024 at 11:46 PM Yan Zhai <[email protected]> wrote:
> > > >
> > > > Implement a new kfree_skb_for_sk to replace kfree_skb_reason on a few
> > > > local receive path. The function accepts an extra receiving socket
> > > > argument, which will be set in skb->cb for kfree_skb/consume_skb
> > > > tracepoint consumption. With this extra bit of information, it will be
> > > > easier to attribute dropped packets to netns/containers and
> > > > sockets/services for performance and error monitoring purposes.
> > >
> > > This is a lot of code churn...
> > >
> > > I have to ask : Why not simply adding an sk parameter to an existing
> > > trace point ?
> > >
> > Modifying a signature of the current tracepoint seems like a breaking
> > change, that's why I was saving the context inside skb->cb, hoping to
> > not impact any existing programs watching this tracepoint. But
> > thinking it twice, it might not cause a problem if the signature
> > becomes:
> >
> > trace_kfree_skb(const struct sk_buff *skb, void *location, enum
> > skb_drop_reason reason, const struct sock *sk)
> >
> > As return values are usually not a thing for tracepoints, it is
> > probably still compatible. The cons is that the last "sk" still breaks
> > the integrity of naming. How about making a "kfree_skb_context"
> > internal struct and putting it as the last argument to "hide" the
> > naming confusion?
> >
> > > If this not possible, I would rather add new tracepoints, adding new classes,
> > > because it will ease your debugging :
> > >
> > > When looking for TCP drops, simply use a tcp_event_sk_skb_reason instance,
> > > and voila, no distractions caused by RAW/ICMP/ICMPv6/af_packet drops.
> > >
> > > DECLARE_EVENT_CLASS(tcp_event_sk_skb_reason,
> > >
> > > TP_PROTO(const struct sock *sk, const struct sk_buff *skb, enum
> > > skb_drop_reason reason),
> > > ...
> > > );
> >
> > The alternative of adding another tracepoint could indeed work, we had
> > a few cases like that in the past, e.g.
> >
> > https://lore.kernel.org/lkml/[email protected]/
> > https://lore.kernel.org/netdev/[email protected]/
> >
> > But it does feel like a whack-a-mole thing. The problems are solvable
> > if we extend the kfree_skb tracepoint, so I would prefer to not add a
> > new tracepoint.
>
> Solvable with many future merge conflicts for stable teams.
>
I don't quite follow it. I think this specific commit using skb->cb is
unnecessary so I am going to re-work it. As you initially mentioned,
maybe I should just extend kfree_skb tracepoint. I saw a similar
change dd1b527831a3("net: add location to trace_consume_skb()"), is it
something I might follow, or do you specifically mean changes like
this can annoy stable teams?

>
> >
> > >
> > > Also, the name ( kfree_skb_for_sk) and order of parameters is confusing.
> > >
> > > I always prefer this kind of ordering/names :
> > >
> > > void sk_skb_reason_drop( [struct net *net ] // not relevant here, but
> > > to expand the rationale
> > > struct sock *sk, struct sk_buff *skb, enum skb_drop_reason reason)
> > >
> > > Looking at the name, we immediately see the parameter order.
> > >
> > > The consume one (no @reason there) would be called
> > >
> > > void sk_skb_consume(struct sock *sk, struct sk_buff *skb);
> >
> > I was intending to keep the "kfree_skb" prefix initially since it
> > would appear less surprising to kernel developers who used kfree_skb
> > and kfree_skb_reason. But your points do make good sense. How about
> > "kfree_sk_skb_reason" and "consume_sk_skb" here?
> >
>
> IMO kfree_skb() and consume_skb() were a wrong choice. We have to live
> with them.
>
> It should have been skb_free(), skb_consume(), skb_alloc(),
> to be consistent.
>
> Following (partial) list was much better:
>
> skb_add_rx_frag_netmem, skb_coalesce_rx_frag, skb_pp_cow_data,
> skb_cow_data_for_xdp,
> skb_dump, skb_tx_error, skb_morph, skb_zerocopy_iter_stream, skb_copy_ubufs,
> skb_clone, skb_headers_offset_update, skb_copy_header, skb_copy,
> skb_realloc_headroom, skb_expand_head, skb_copy_expand, skb_put,
> skb_push, skb_pull, skb_pull_data, skb_trim, skb_copy_bits,
> skb_splice_bits, skb_send_sock_locked, skb_store_bits,
> skb_checksum, skb_copy_and_csum_bits, skb_zerocopy_headlen,
> skb_zerocopy, skb_copy_and_csum_dev, skb_dequeue,
> skb_dequeue_tail, skb_queue_purge_reason, skb_errqueue_purge,
> skb_queue_head, skb_queue_tail, skb_unlink, skb_append,
> skb_split, skb_prepare_seq_read, skb_seq_read, skb_abort_seq_read,
> skb_find_text, skb_append_pagefrags, skb_pull_rcsum, skb_segment_list,
> skb_segment, skb_to_sgvec, skb_to_sgvec_nomark, skb_cow_data, skb_clone_sk,
> skb_complete_tx_timestamp, skb_tstamp_tx, skb_complete_wifi_ack,
> skb_partial_csum_set, skb_checksum_setup, skb_checksum_trimmed,
> skb_try_coalesce, skb_scrub_packet, skb_vlan_untag, skb_ensure_writable,
> skb_ensure_writable_head_tail, skb_vlan_pop, skb_vlan_push, skb_eth_pop,
> skb_eth_push, skb_mpls_push, skb_mpls_pop, skb_mpls_update_lse,
> skb_mpls_dec_ttl, skb_condense, skb_ext_add, skb_splice_from_iter
>
> (just to make my point very very clear)
>
> Instead we have a myriad of functions with illogical parameter
> ordering vs their names.
>
> I see no reason to add more confusion for new helpers.

ACK. Thanks for clarifying.

Yan

2024-05-31 19:18:54

by Eric Dumazet

[permalink] [raw]
Subject: Re: [RFC net-next 1/6] net: add kfree_skb_for_sk function

On Fri, May 31, 2024 at 9:05 PM Yan Zhai <[email protected]> wrote:
>

> I don't quite follow it. I think this specific commit using skb->cb is
> unnecessary so I am going to re-work it. As you initially mentioned,
> maybe I should just extend kfree_skb tracepoint. I saw a similar
> change dd1b527831a3("net: add location to trace_consume_skb()"), is it
> something I might follow, or do you specifically mean changes like
> this can annoy stable teams?
>

I do not think trace points arguments are put in stone.

If they were, I would nack the addition of new tracepoints, to prevent
ossification.