2024-06-04 21:47:48

by Yan Zhai

[permalink] [raw]
Subject: [RFC v3 net-next 0/7] net: pass receive socket to drop tracepoint

We set up our production packet drop monitoring around the kfree_skb
tracepoint. While this tracepoint is extremely valuable for diagnosing
critical problems, it also has some limitation with drops on the local
receive path: this tracepoint can only inspect the dropped skb itself,
but such skb might not carry enough information to:

1. determine in which netns/container this skb gets dropped
2. determine by which socket/service this skb oughts to be received

The 1st issue is because skb->dev is the only member field with valid
netns reference. But skb->dev can get cleared or reused. For example,
tcp_v4_rcv will clear skb->dev and in later processing it might be reused
for OFO tree.

The 2nd issue is because there is no reference on an skb that reliably
points to a receiving socket. skb->sk usually points to the local
sending socket, and it only points to a receive socket briefly after
early demux stage, yet the socket can get stolen later. For certain drop
reason like TCP OFO_MERGE, Zerowindow, UDP at PROTO_MEM error, etc, it
is hard to infer which receiving socket is impacted. This cannot be
overcome by simply looking at the packet header, because of
complications like sk lookup programs. In the past, single purpose
tracepoints like trace_udp_fail_queue_rcv_skb, trace_sock_rcvqueue_full,
etc are added as needed to provide more visibility. This could be
handled in a more generic way.

In this change set we propose a new 'sk_skb_reason_drop' call as a drop-in
replacement for kfree_skb_reason at various local input path. It accepts
an extra receiving socket argument. Both issues above can be resolved
via this new argument.

V2->V3: fixed drop_monitor function signatures; fixed a few uninitialized sks;
Added a few missing report tags from test bots (also noticed by Dan
Carpenter and Simon Horman).

V1->V2: instead of using skb->cb, directly add the needed argument to
trace_kfree_skb tracepoint. Also renamed functions as Eric Dumazet
suggested.

V2: https://lore.kernel.org/linux-kernel/[email protected]/
V1: https://lore.kernel.org/netdev/[email protected]/

Yan Zhai (7):
net: add rx_sk to trace_kfree_skb
net: introduce sk_skb_reason_drop function
ping: use sk_skb_reason_drop to free rx packets
net: raw: use sk_skb_reason_drop to free rx packets
tcp: use sk_skb_reason_drop to free rx packets
udp: use sk_skb_reason_drop to free rx packets
af_packet: use sk_skb_reason_drop to free rx packets

include/linux/skbuff.h | 10 ++++++++--
include/trace/events/skb.h | 11 +++++++----
net/core/dev.c | 2 +-
net/core/drop_monitor.c | 9 ++++++---
net/core/skbuff.c | 22 ++++++++++++----------
net/ipv4/ping.c | 2 +-
net/ipv4/raw.c | 4 ++--
net/ipv4/syncookies.c | 2 +-
net/ipv4/tcp_input.c | 2 +-
net/ipv4/tcp_ipv4.c | 6 +++---
net/ipv4/udp.c | 10 +++++-----
net/ipv6/raw.c | 8 ++++----
net/ipv6/syncookies.c | 2 +-
net/ipv6/tcp_ipv6.c | 6 +++---
net/ipv6/udp.c | 10 +++++-----
net/packet/af_packet.c | 10 +++++-----
16 files changed, 65 insertions(+), 51 deletions(-)

--
2.30.2




2024-06-04 21:48:08

by Yan Zhai

[permalink] [raw]
Subject: [RFC v3 net-next 1/7] net: add rx_sk to trace_kfree_skb

skb does not include enough information to find out receiving
sockets/services and netns/containers on packet drops. In theory
skb->dev tells about netns, but it can get cleared/reused, e.g. by TCP
stack for OOO packet lookup. Similarly, skb->sk often identifies a local
sender, and tells nothing about a receiver.

Allow passing an extra receiving socket to the tracepoint to improve
the visibility on receiving drops.

Signed-off-by: Yan Zhai <[email protected]>
---
v2->v3: fixed drop_monitor function prototype
---
include/trace/events/skb.h | 11 +++++++----
net/core/dev.c | 2 +-
net/core/drop_monitor.c | 9 ++++++---
net/core/skbuff.c | 2 +-
4 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/include/trace/events/skb.h b/include/trace/events/skb.h
index 07e0715628ec..aa6b46b6172c 100644
--- a/include/trace/events/skb.h
+++ b/include/trace/events/skb.h
@@ -24,15 +24,16 @@ DEFINE_DROP_REASON(FN, FN)
TRACE_EVENT(kfree_skb,

TP_PROTO(struct sk_buff *skb, void *location,
- enum skb_drop_reason reason),
+ enum skb_drop_reason reason, struct sock *rx_sk),

- TP_ARGS(skb, location, reason),
+ TP_ARGS(skb, location, reason, rx_sk),

TP_STRUCT__entry(
__field(void *, skbaddr)
__field(void *, location)
__field(unsigned short, protocol)
__field(enum skb_drop_reason, reason)
+ __field(void *, rx_skaddr)
),

TP_fast_assign(
@@ -40,12 +41,14 @@ TRACE_EVENT(kfree_skb,
__entry->location = location;
__entry->protocol = ntohs(skb->protocol);
__entry->reason = reason;
+ __entry->rx_skaddr = rx_sk;
),

- TP_printk("skbaddr=%p protocol=%u location=%pS reason: %s",
+ TP_printk("skbaddr=%p protocol=%u location=%pS reason: %s rx_skaddr=%p",
__entry->skbaddr, __entry->protocol, __entry->location,
__print_symbolic(__entry->reason,
- DEFINE_DROP_REASON(FN, FNe)))
+ DEFINE_DROP_REASON(FN, FNe)),
+ __entry->rx_skaddr)
);

#undef FN
diff --git a/net/core/dev.c b/net/core/dev.c
index 85fe8138f3e4..7844227ecbfd 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5233,7 +5233,7 @@ static __latent_entropy void net_tx_action(struct softirq_action *h)
trace_consume_skb(skb, net_tx_action);
else
trace_kfree_skb(skb, net_tx_action,
- get_kfree_skb_cb(skb)->reason);
+ get_kfree_skb_cb(skb)->reason, NULL);

if (skb->fclone != SKB_FCLONE_UNAVAILABLE)
__kfree_skb(skb);
diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c
index 430ed18f8584..2e0ae3328232 100644
--- a/net/core/drop_monitor.c
+++ b/net/core/drop_monitor.c
@@ -109,7 +109,8 @@ static u32 net_dm_queue_len = 1000;
struct net_dm_alert_ops {
void (*kfree_skb_probe)(void *ignore, struct sk_buff *skb,
void *location,
- enum skb_drop_reason reason);
+ enum skb_drop_reason reason,
+ struct sock *rx_sk);
void (*napi_poll_probe)(void *ignore, struct napi_struct *napi,
int work, int budget);
void (*work_item_func)(struct work_struct *work);
@@ -264,7 +265,8 @@ static void trace_drop_common(struct sk_buff *skb, void *location)

static void trace_kfree_skb_hit(void *ignore, struct sk_buff *skb,
void *location,
- enum skb_drop_reason reason)
+ enum skb_drop_reason reason,
+ struct sock *rx_sk)
{
trace_drop_common(skb, location);
}
@@ -491,7 +493,8 @@ static const struct net_dm_alert_ops net_dm_alert_summary_ops = {
static void net_dm_packet_trace_kfree_skb_hit(void *ignore,
struct sk_buff *skb,
void *location,
- enum skb_drop_reason reason)
+ enum skb_drop_reason reason,
+ struct sock *rx_sk)
{
ktime_t tstamp = ktime_get_real();
struct per_cpu_dm_data *data;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 466999a7515e..2854afdd713f 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1203,7 +1203,7 @@ bool __kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason)
if (reason == SKB_CONSUMED)
trace_consume_skb(skb, __builtin_return_address(0));
else
- trace_kfree_skb(skb, __builtin_return_address(0), reason);
+ trace_kfree_skb(skb, __builtin_return_address(0), reason, NULL);
return true;
}

--
2.30.2



2024-06-04 21:48:26

by Yan Zhai

[permalink] [raw]
Subject: [RFC v3 net-next 2/7] net: introduce sk_skb_reason_drop function

Long used destructors kfree_skb and kfree_skb_reason do not pass
receiving socket to packet drop tracepoints trace_kfree_skb.
This makes it hard to track packet drops of a certain netns (container)
or a socket (user application).

The naming of these destructors are also not consistent with most sk/skb
operating functions, i.e. functions named "sk_xxx" or "skb_xxx".
Introduce a new functions sk_skb_reason_drop as drop-in replacement for
kfree_skb_reason on local receiving path. Callers can now pass receiving
sockets to the tracepoints.

kfree_skb and kfree_skb_reason are still usable but they are now just
inline helpers that call sk_skb_reason_drop.

Note it is not feasible to do the same to consume_skb. Packets not
dropped can flow through multiple receive handlers, and have multiple
receiving sockets. Leave it untouched for now.

Suggested-by: Eric Dumazet <[email protected]>
Signed-off-by: Yan Zhai <[email protected]>
---
v1->v2: changes function names to be more consistent with common sk/skb
operations
---
include/linux/skbuff.h | 10 ++++++++--
net/core/skbuff.c | 22 ++++++++++++----------
2 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fe7d8dbef77e..c479a2515a62 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1251,8 +1251,14 @@ static inline bool skb_data_unref(const struct sk_buff *skb,
return true;
}

-void __fix_address
-kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason);
+void __fix_address sk_skb_reason_drop(struct sock *sk, struct sk_buff *skb,
+ enum skb_drop_reason reason);
+
+static inline void
+kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason)
+{
+ sk_skb_reason_drop(NULL, skb, reason);
+}

/**
* kfree_skb - free an sk_buff with 'NOT_SPECIFIED' reason
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 2854afdd713f..9def11fe42c4 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1190,7 +1190,8 @@ void __kfree_skb(struct sk_buff *skb)
EXPORT_SYMBOL(__kfree_skb);

static __always_inline
-bool __kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason)
+bool __sk_skb_reason_drop(struct sock *sk, struct sk_buff *skb,
+ enum skb_drop_reason reason)
{
if (unlikely(!skb_unref(skb)))
return false;
@@ -1203,26 +1204,27 @@ bool __kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason)
if (reason == SKB_CONSUMED)
trace_consume_skb(skb, __builtin_return_address(0));
else
- trace_kfree_skb(skb, __builtin_return_address(0), reason, NULL);
+ trace_kfree_skb(skb, __builtin_return_address(0), reason, sk);
return true;
}

/**
- * kfree_skb_reason - free an sk_buff with special reason
+ * sk_skb_reason_drop - free an sk_buff with special reason
+ * @sk: the socket to receive @skb, or NULL if not applicable
* @skb: buffer to free
* @reason: reason why this skb is dropped
*
- * Drop a reference to the buffer and free it if the usage count has
- * hit zero. Meanwhile, pass the drop reason to 'kfree_skb'
- * tracepoint.
+ * Drop a reference to the buffer and free it if the usage count has hit
+ * zero. Meanwhile, pass the receiving socket and drop reason to
+ * 'kfree_skb' tracepoint.
*/
void __fix_address
-kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason)
+sk_skb_reason_drop(struct sock *sk, struct sk_buff *skb, enum skb_drop_reason reason)
{
- if (__kfree_skb_reason(skb, reason))
+ if (__sk_skb_reason_drop(sk, skb, reason))
__kfree_skb(skb);
}
-EXPORT_SYMBOL(kfree_skb_reason);
+EXPORT_SYMBOL(sk_skb_reason_drop);

#define KFREE_SKB_BULK_SIZE 16

@@ -1261,7 +1263,7 @@ kfree_skb_list_reason(struct sk_buff *segs, enum skb_drop_reason reason)
while (segs) {
struct sk_buff *next = segs->next;

- if (__kfree_skb_reason(segs, reason)) {
+ if (__sk_skb_reason_drop(NULL, segs, reason)) {
skb_poison_list(segs);
kfree_skb_add_bulk(segs, &sa, reason);
}
--
2.30.2



2024-06-04 21:48:43

by Yan Zhai

[permalink] [raw]
Subject: [RFC v3 net-next 3/7] ping: use sk_skb_reason_drop to free rx packets

Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving
socket to the tracepoint.

Signed-off-by: Yan Zhai <[email protected]>
---
net/ipv4/ping.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index 823306487a82..619ddc087957 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -946,7 +946,7 @@ static enum skb_drop_reason __ping_queue_rcv_skb(struct sock *sk,
pr_debug("ping_queue_rcv_skb(sk=%p,sk->num=%d,skb=%p)\n",
inet_sk(sk), inet_sk(sk)->inet_num, skb);
if (sock_queue_rcv_skb_reason(sk, skb, &reason) < 0) {
- kfree_skb_reason(skb, reason);
+ sk_skb_reason_drop(sk, skb, reason);
pr_debug("ping_queue_rcv_skb -> failed\n");
return reason;
}
--
2.30.2



2024-06-04 21:49:00

by Yan Zhai

[permalink] [raw]
Subject: [RFC v3 net-next 4/7] net: raw: use sk_skb_reason_drop to free rx packets

Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving
socket to the tracepoint.

Signed-off-by: Yan Zhai <[email protected]>
---
net/ipv4/raw.c | 4 ++--
net/ipv6/raw.c | 8 ++++----
2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 1a0953650356..474dfd263c8b 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -301,7 +301,7 @@ static int raw_rcv_skb(struct sock *sk, struct sk_buff *skb)

ipv4_pktinfo_prepare(sk, skb, true);
if (sock_queue_rcv_skb_reason(sk, skb, &reason) < 0) {
- kfree_skb_reason(skb, reason);
+ sk_skb_reason_drop(sk, skb, reason);
return NET_RX_DROP;
}

@@ -312,7 +312,7 @@ int raw_rcv(struct sock *sk, struct sk_buff *skb)
{
if (!xfrm4_policy_check(sk, XFRM_POLICY_IN, skb)) {
atomic_inc(&sk->sk_drops);
- kfree_skb_reason(skb, SKB_DROP_REASON_XFRM_POLICY);
+ sk_skb_reason_drop(sk, skb, SKB_DROP_REASON_XFRM_POLICY);
return NET_RX_DROP;
}
nf_reset_ct(skb);
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index f838366e8256..608fa9d05b55 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -362,14 +362,14 @@ static inline int rawv6_rcv_skb(struct sock *sk, struct sk_buff *skb)
if ((raw6_sk(sk)->checksum || rcu_access_pointer(sk->sk_filter)) &&
skb_checksum_complete(skb)) {
atomic_inc(&sk->sk_drops);
- kfree_skb_reason(skb, SKB_DROP_REASON_SKB_CSUM);
+ sk_skb_reason_drop(sk, skb, SKB_DROP_REASON_SKB_CSUM);
return NET_RX_DROP;
}

/* Charge it to the socket. */
skb_dst_drop(skb);
if (sock_queue_rcv_skb_reason(sk, skb, &reason) < 0) {
- kfree_skb_reason(skb, reason);
+ sk_skb_reason_drop(sk, skb, reason);
return NET_RX_DROP;
}

@@ -390,7 +390,7 @@ int rawv6_rcv(struct sock *sk, struct sk_buff *skb)

if (!xfrm6_policy_check(sk, XFRM_POLICY_IN, skb)) {
atomic_inc(&sk->sk_drops);
- kfree_skb_reason(skb, SKB_DROP_REASON_XFRM_POLICY);
+ sk_skb_reason_drop(sk, skb, SKB_DROP_REASON_XFRM_POLICY);
return NET_RX_DROP;
}
nf_reset_ct(skb);
@@ -415,7 +415,7 @@ int rawv6_rcv(struct sock *sk, struct sk_buff *skb)
if (inet_test_bit(HDRINCL, sk)) {
if (skb_checksum_complete(skb)) {
atomic_inc(&sk->sk_drops);
- kfree_skb_reason(skb, SKB_DROP_REASON_SKB_CSUM);
+ sk_skb_reason_drop(sk, skb, SKB_DROP_REASON_SKB_CSUM);
return NET_RX_DROP;
}
}
--
2.30.2



2024-06-04 21:49:21

by Yan Zhai

[permalink] [raw]
Subject: [RFC v3 net-next 5/7] tcp: use sk_skb_reason_drop to free rx packets

Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving
socket to the tracepoint.

Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/r/[email protected]/
Signed-off-by: Yan Zhai <[email protected]>
---
v2->v3: added missing report tags
---
net/ipv4/syncookies.c | 2 +-
net/ipv4/tcp_input.c | 2 +-
net/ipv4/tcp_ipv4.c | 6 +++---
net/ipv6/syncookies.c | 2 +-
net/ipv6/tcp_ipv6.c | 6 +++---
5 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index b61d36810fe3..1948d15f1f28 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -496,6 +496,6 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
out_free:
reqsk_free(req);
out_drop:
- kfree_skb_reason(skb, reason);
+ sk_skb_reason_drop(sk, skb, reason);
return NULL;
}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5aadf64e554d..bedb079de1f0 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4859,7 +4859,7 @@ static void tcp_drop_reason(struct sock *sk, struct sk_buff *skb,
enum skb_drop_reason reason)
{
sk_drops_add(sk, skb);
- kfree_skb_reason(skb, reason);
+ sk_skb_reason_drop(sk, skb, reason);
}

/* This one checks to see if we can put data from the
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 041c7eda9abe..f7a046bc4b27 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1939,7 +1939,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
reset:
tcp_v4_send_reset(rsk, skb, sk_rst_convert_drop_reason(reason));
discard:
- kfree_skb_reason(skb, reason);
+ sk_skb_reason_drop(sk, skb, reason);
/* Be careful here. If this function gets more complicated and
* gcc suffers from register pressure on the x86, sk (in %ebx)
* might be destroyed here. This current version compiles correctly,
@@ -2176,8 +2176,8 @@ int tcp_v4_rcv(struct sk_buff *skb)
int dif = inet_iif(skb);
const struct iphdr *iph;
const struct tcphdr *th;
+ struct sock *sk = NULL;
bool refcounted;
- struct sock *sk;
int ret;
u32 isn;

@@ -2376,7 +2376,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
discard_it:
SKB_DR_OR(drop_reason, NOT_SPECIFIED);
/* Discard frame. */
- kfree_skb_reason(skb, drop_reason);
+ sk_skb_reason_drop(sk, skb, drop_reason);
return 0;

discard_and_relse:
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index bfad1e89b6a6..9d83eadd308b 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -275,6 +275,6 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
out_free:
reqsk_free(req);
out_drop:
- kfree_skb_reason(skb, reason);
+ sk_skb_reason_drop(sk, skb, reason);
return NULL;
}
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 1ac7502e1bf5..93967accc35d 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1678,7 +1678,7 @@ int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
discard:
if (opt_skb)
__kfree_skb(opt_skb);
- kfree_skb_reason(skb, reason);
+ sk_skb_reason_drop(sk, skb, reason);
return 0;
csum_err:
reason = SKB_DROP_REASON_TCP_CSUM;
@@ -1751,8 +1751,8 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
int dif = inet6_iif(skb);
const struct tcphdr *th;
const struct ipv6hdr *hdr;
+ struct sock *sk = NULL;
bool refcounted;
- struct sock *sk;
int ret;
u32 isn;
struct net *net = dev_net(skb->dev);
@@ -1944,7 +1944,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)

discard_it:
SKB_DR_OR(drop_reason, NOT_SPECIFIED);
- kfree_skb_reason(skb, drop_reason);
+ sk_skb_reason_drop(sk, skb, drop_reason);
return 0;

discard_and_relse:
--
2.30.2



2024-06-04 21:49:38

by Yan Zhai

[permalink] [raw]
Subject: [RFC v3 net-next 6/7] udp: use sk_skb_reason_drop to free rx packets

Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving
socket to the tracepoint.

Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/r/[email protected]/
Signed-off-by: Yan Zhai <[email protected]>
---
v2->v3: added missing report tags
---
net/ipv4/udp.c | 10 +++++-----
net/ipv6/udp.c | 10 +++++-----
2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 189c9113fe9a..ecafb1695999 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2074,7 +2074,7 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
}
UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
trace_udp_fail_queue_rcv_skb(rc, sk, skb);
- kfree_skb_reason(skb, drop_reason);
+ sk_skb_reason_drop(sk, skb, drop_reason);
return -1;
}

@@ -2196,7 +2196,7 @@ static int udp_queue_rcv_one_skb(struct sock *sk, struct sk_buff *skb)
drop:
__UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
atomic_inc(&sk->sk_drops);
- kfree_skb_reason(skb, drop_reason);
+ sk_skb_reason_drop(sk, skb, drop_reason);
return -1;
}

@@ -2383,7 +2383,7 @@ static int udp_unicast_rcv_skb(struct sock *sk, struct sk_buff *skb,
int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
int proto)
{
- struct sock *sk;
+ struct sock *sk = NULL;
struct udphdr *uh;
unsigned short ulen;
struct rtable *rt = skb_rtable(skb);
@@ -2460,7 +2460,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
* Hmm. We got an UDP packet to a port to which we
* don't wanna listen. Ignore it.
*/
- kfree_skb_reason(skb, drop_reason);
+ sk_skb_reason_drop(sk, skb, drop_reason);
return 0;

short_packet:
@@ -2485,7 +2485,7 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
__UDP_INC_STATS(net, UDP_MIB_CSUMERRORS, proto == IPPROTO_UDPLITE);
drop:
__UDP_INC_STATS(net, UDP_MIB_INERRORS, proto == IPPROTO_UDPLITE);
- kfree_skb_reason(skb, drop_reason);
+ sk_skb_reason_drop(sk, skb, drop_reason);
return 0;
}

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index c81a07ac0463..b56f0b9f4307 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -673,7 +673,7 @@ static int __udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
}
UDP6_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
trace_udp_fail_queue_rcv_skb(rc, sk, skb);
- kfree_skb_reason(skb, drop_reason);
+ sk_skb_reason_drop(sk, skb, drop_reason);
return -1;
}

@@ -776,7 +776,7 @@ static int udpv6_queue_rcv_one_skb(struct sock *sk, struct sk_buff *skb)
drop:
__UDP6_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
atomic_inc(&sk->sk_drops);
- kfree_skb_reason(skb, drop_reason);
+ sk_skb_reason_drop(sk, skb, drop_reason);
return -1;
}

@@ -940,8 +940,8 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
enum skb_drop_reason reason = SKB_DROP_REASON_NOT_SPECIFIED;
const struct in6_addr *saddr, *daddr;
struct net *net = dev_net(skb->dev);
+ struct sock *sk = NULL;
struct udphdr *uh;
- struct sock *sk;
bool refcounted;
u32 ulen = 0;

@@ -1033,7 +1033,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
__UDP6_INC_STATS(net, UDP_MIB_NOPORTS, proto == IPPROTO_UDPLITE);
icmpv6_send(skb, ICMPV6_DEST_UNREACH, ICMPV6_PORT_UNREACH, 0);

- kfree_skb_reason(skb, reason);
+ sk_skb_reason_drop(sk, skb, reason);
return 0;

short_packet:
@@ -1054,7 +1054,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
__UDP6_INC_STATS(net, UDP_MIB_CSUMERRORS, proto == IPPROTO_UDPLITE);
discard:
__UDP6_INC_STATS(net, UDP_MIB_INERRORS, proto == IPPROTO_UDPLITE);
- kfree_skb_reason(skb, reason);
+ sk_skb_reason_drop(sk, skb, reason);
return 0;
}

--
2.30.2



2024-06-04 21:49:55

by Yan Zhai

[permalink] [raw]
Subject: [RFC v3 net-next 7/7] af_packet: use sk_skb_reason_drop to free rx packets

Replace kfree_skb_reason with sk_skb_reason_drop and pass the receiving
socket to the tracepoint.

Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/r/[email protected]/
Signed-off-by: Yan Zhai <[email protected]>
---
v2->v3: fixed uninitialized sk, added missing report tags.
---
net/packet/af_packet.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index fce390887591..42d29b8a84fc 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2121,7 +2121,7 @@ static int packet_rcv(struct sk_buff *skb, struct net_device *dev,
struct packet_type *pt, struct net_device *orig_dev)
{
enum skb_drop_reason drop_reason = SKB_CONSUMED;
- struct sock *sk;
+ struct sock *sk = NULL;
struct sockaddr_ll *sll;
struct packet_sock *po;
u8 *skb_head = skb->data;
@@ -2226,7 +2226,7 @@ static int packet_rcv(struct sk_buff *skb, struct net_device *dev,
skb->len = skb_len;
}
drop:
- kfree_skb_reason(skb, drop_reason);
+ sk_skb_reason_drop(sk, skb, drop_reason);
return 0;
}

@@ -2234,7 +2234,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
struct packet_type *pt, struct net_device *orig_dev)
{
enum skb_drop_reason drop_reason = SKB_CONSUMED;
- struct sock *sk;
+ struct sock *sk = NULL;
struct packet_sock *po;
struct sockaddr_ll *sll;
union tpacket_uhdr h;
@@ -2494,7 +2494,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
skb->len = skb_len;
}
drop:
- kfree_skb_reason(skb, drop_reason);
+ sk_skb_reason_drop(sk, skb, drop_reason);
return 0;

drop_n_account:
@@ -2503,7 +2503,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
drop_reason = SKB_DROP_REASON_PACKET_SOCK_ERROR;

sk->sk_data_ready(sk);
- kfree_skb_reason(copy_skb, drop_reason);
+ sk_skb_reason_drop(sk, copy_skb, drop_reason);
goto drop_n_restore;
}

--
2.30.2



2024-06-05 23:58:01

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC v3 net-next 1/7] net: add rx_sk to trace_kfree_skb

On Tue, 4 Jun 2024 14:47:38 -0700
Yan Zhai <[email protected]> wrote:

> skb does not include enough information to find out receiving
> sockets/services and netns/containers on packet drops. In theory
> skb->dev tells about netns, but it can get cleared/reused, e.g. by TCP
> stack for OOO packet lookup. Similarly, skb->sk often identifies a local
> sender, and tells nothing about a receiver.
>
> Allow passing an extra receiving socket to the tracepoint to improve
> the visibility on receiving drops.
>
> Signed-off-by: Yan Zhai <[email protected]>
> ---
> v2->v3: fixed drop_monitor function prototype
> ---
> include/trace/events/skb.h | 11 +++++++----
> net/core/dev.c | 2 +-
> net/core/drop_monitor.c | 9 ++++++---
> net/core/skbuff.c | 2 +-
> 4 files changed, 15 insertions(+), 9 deletions(-)
>
> diff --git a/include/trace/events/skb.h b/include/trace/events/skb.h
> index 07e0715628ec..aa6b46b6172c 100644
> --- a/include/trace/events/skb.h
> +++ b/include/trace/events/skb.h
> @@ -24,15 +24,16 @@ DEFINE_DROP_REASON(FN, FN)
> TRACE_EVENT(kfree_skb,
>
> TP_PROTO(struct sk_buff *skb, void *location,
> - enum skb_drop_reason reason),
> + enum skb_drop_reason reason, struct sock *rx_sk),
>
> - TP_ARGS(skb, location, reason),
> + TP_ARGS(skb, location, reason, rx_sk),
>
> TP_STRUCT__entry(
> __field(void *, skbaddr)
> __field(void *, location)
> __field(unsigned short, protocol)
> __field(enum skb_drop_reason, reason)
> + __field(void *, rx_skaddr)

Please add the pointer after the other pointers:

__field(void *, skbaddr)
__field(void *, location)
+ __field(void *, rx_skaddr)
__field(unsigned short, protocol)
__field(enum skb_drop_reason, reason)

otherwise you are adding holes in the ring buffer event.

The TP_STRUCT__entry() is a structure that is saved in the ring buffer. We
want to avoid alignment holes. I also question having a short before the
enum, if the emum is 4 bytes. The short should be at the end.

In fact, looking at the format file, there is a 2 byte hole:

# cat /sys/kernel/tracing/events/skb/kfree_skb/format

name: kfree_skb
ID: 1799
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;

field:void * skbaddr; offset:8; size:8; signed:0;
field:void * location; offset:16; size:8; signed:0;
field:unsigned short protocol; offset:24; size:2; signed:0;
field:enum skb_drop_reason reason; offset:28; size:4; signed:0;

Notice that "protocol" is 2 bytes in size at offset 24, but "reason" starts
at offset 28. This means at offset 26, there's a 2 byte hole.

-- Steve



> ),
>
> TP_fast_assign(
> @@ -40,12 +41,14 @@ TRACE_EVENT(kfree_skb,
> __entry->location = location;
> __entry->protocol = ntohs(skb->protocol);
> __entry->reason = reason;
> + __entry->rx_skaddr = rx_sk;
> ),
>


2024-06-06 15:38:23

by Yan Zhai

[permalink] [raw]
Subject: Re: [RFC v3 net-next 1/7] net: add rx_sk to trace_kfree_skb

On Wed, Jun 5, 2024 at 6:57 PM Steven Rostedt <[email protected]> wrote:
>
> On Tue, 4 Jun 2024 14:47:38 -0700
> Yan Zhai <[email protected]> wrote:
>
> > skb does not include enough information to find out receiving
> > sockets/services and netns/containers on packet drops. In theory
> > skb->dev tells about netns, but it can get cleared/reused, e.g. by TCP
> > stack for OOO packet lookup. Similarly, skb->sk often identifies a local
> > sender, and tells nothing about a receiver.
> >
> > Allow passing an extra receiving socket to the tracepoint to improve
> > the visibility on receiving drops.
> >
> > Signed-off-by: Yan Zhai <[email protected]>
> > ---
> > v2->v3: fixed drop_monitor function prototype
> > ---
> > include/trace/events/skb.h | 11 +++++++----
> > net/core/dev.c | 2 +-
> > net/core/drop_monitor.c | 9 ++++++---
> > net/core/skbuff.c | 2 +-
> > 4 files changed, 15 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/trace/events/skb.h b/include/trace/events/skb.h
> > index 07e0715628ec..aa6b46b6172c 100644
> > --- a/include/trace/events/skb.h
> > +++ b/include/trace/events/skb.h
> > @@ -24,15 +24,16 @@ DEFINE_DROP_REASON(FN, FN)
> > TRACE_EVENT(kfree_skb,
> >
> > TP_PROTO(struct sk_buff *skb, void *location,
> > - enum skb_drop_reason reason),
> > + enum skb_drop_reason reason, struct sock *rx_sk),
> >
> > - TP_ARGS(skb, location, reason),
> > + TP_ARGS(skb, location, reason, rx_sk),
> >
> > TP_STRUCT__entry(
> > __field(void *, skbaddr)
> > __field(void *, location)
> > __field(unsigned short, protocol)
> > __field(enum skb_drop_reason, reason)
> > + __field(void *, rx_skaddr)
>
> Please add the pointer after the other pointers:
>
> __field(void *, skbaddr)
> __field(void *, location)
> + __field(void *, rx_skaddr)
> __field(unsigned short, protocol)
> __field(enum skb_drop_reason, reason)
>
> otherwise you are adding holes in the ring buffer event.
>
> The TP_STRUCT__entry() is a structure that is saved in the ring buffer. We
> want to avoid alignment holes. I also question having a short before the
> enum, if the emum is 4 bytes. The short should be at the end.
>
> In fact, looking at the format file, there is a 2 byte hole:
>
> # cat /sys/kernel/tracing/events/skb/kfree_skb/format
>
> name: kfree_skb
> ID: 1799
> format:
> field:unsigned short common_type; offset:0; size:2; signed:0;
> field:unsigned char common_flags; offset:2; size:1; signed:0;
> field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
> field:int common_pid; offset:4; size:4; signed:1;
>
> field:void * skbaddr; offset:8; size:8; signed:0;
> field:void * location; offset:16; size:8; signed:0;
> field:unsigned short protocol; offset:24; size:2; signed:0;
> field:enum skb_drop_reason reason; offset:28; size:4; signed:0;
>
> Notice that "protocol" is 2 bytes in size at offset 24, but "reason" starts
> at offset 28. This means at offset 26, there's a 2 byte hole.
>
The reason I added the pointer as the last argument is trying to
minimize the surprise to existing TP users, because for common ABIs
it's fine to omit later arguments when defining a function, but it
needs change and recompilation if the order of arguments changed.

Looking at the actual format after the change, it does not add a new
hole since protocol and reason are already packed into the same 8-byte
block, so rx_skaddr starts at 8-byte aligned offset:

# cat /sys/kernel/debug/tracing/events/skb/kfree_skb/format
name: kfree_skb
ID: 2260
format:
field:unsigned short common_type; offset:0;
size:2; signed:0;
field:unsigned char common_flags; offset:2;
size:1; signed:0;
field:unsigned char common_preempt_count; offset:3;
size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;

field:void * skbaddr; offset:8; size:8; signed:0;
field:void * location; offset:16; size:8; signed:0;
field:unsigned short protocol; offset:24; size:2; signed:0;
field:enum skb_drop_reason reason; offset:28;
size:4; signed:0;
field:void * rx_skaddr; offset:32; size:8; signed:0;

Do you think we still need to change the order?

Yan


> -- Steve
>
>
>
> > ),
> >
> > TP_fast_assign(
> > @@ -40,12 +41,14 @@ TRACE_EVENT(kfree_skb,
> > __entry->location = location;
> > __entry->protocol = ntohs(skb->protocol);
> > __entry->reason = reason;
> > + __entry->rx_skaddr = rx_sk;
> > ),
> >
>

2024-06-07 14:29:25

by David Ahern

[permalink] [raw]
Subject: Re: [RFC v3 net-next 1/7] net: add rx_sk to trace_kfree_skb

On 6/6/24 9:37 AM, Yan Zhai wrote:
> # cat /sys/kernel/debug/tracing/events/skb/kfree_skb/format
> name: kfree_skb
> ID: 2260
> format:
> field:unsigned short common_type; offset:0;
> size:2; signed:0;
> field:unsigned char common_flags; offset:2;
> size:1; signed:0;
> field:unsigned char common_preempt_count; offset:3;
> size:1; signed:0;
> field:int common_pid; offset:4; size:4; signed:1;
>
> field:void * skbaddr; offset:8; size:8; signed:0;
> field:void * location; offset:16; size:8; signed:0;
> field:unsigned short protocol; offset:24; size:2; signed:0;
> field:enum skb_drop_reason reason; offset:28;
> size:4; signed:0;
> field:void * rx_skaddr; offset:32; size:8; signed:0;
>
> Do you think we still need to change the order?
>

yes. Keeping proper order now ensures no holes creep in with later changes.


2024-06-10 17:07:54

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC v3 net-next 1/7] net: add rx_sk to trace_kfree_skb

On Thu, 6 Jun 2024 10:37:46 -0500
Yan Zhai <[email protected]> wrote:

> > name: kfree_skb
> > ID: 1799
> > format:
> > field:unsigned short common_type; offset:0; size:2; signed:0;
> > field:unsigned char common_flags; offset:2; size:1; signed:0;
> > field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
> > field:int common_pid; offset:4; size:4; signed:1;
> >
> > field:void * skbaddr; offset:8; size:8; signed:0;
> > field:void * location; offset:16; size:8; signed:0;
> > field:unsigned short protocol; offset:24; size:2; signed:0;
> > field:enum skb_drop_reason reason; offset:28; size:4; signed:0;
> >
> > Notice that "protocol" is 2 bytes in size at offset 24, but "reason" starts
> > at offset 28. This means at offset 26, there's a 2 byte hole.
> >
> The reason I added the pointer as the last argument is trying to
> minimize the surprise to existing TP users, because for common ABIs
> it's fine to omit later arguments when defining a function, but it
> needs change and recompilation if the order of arguments changed.

Nothing should be hard coding the offsets of the fields. This is
exported to user space so that tools can see where the fields are.
That's the purpose of libtraceevent. The fields should be movable and
not affect anything. There should be no need to recompile.

>
> Looking at the actual format after the change, it does not add a new
> hole since protocol and reason are already packed into the same 8-byte
> block, so rx_skaddr starts at 8-byte aligned offset:
>
> # cat /sys/kernel/debug/tracing/events/skb/kfree_skb/format
> name: kfree_skb
> ID: 2260
> format:
> field:unsigned short common_type; offset:0;
> size:2; signed:0;
> field:unsigned char common_flags; offset:2;
> size:1; signed:0;
> field:unsigned char common_preempt_count; offset:3;
> size:1; signed:0;
> field:int common_pid; offset:4; size:4; signed:1;
>
> field:void * skbaddr; offset:8; size:8; signed:0;
> field:void * location; offset:16; size:8; signed:0;
> field:unsigned short protocol; offset:24; size:2; signed:0;
> field:enum skb_drop_reason reason; offset:28;
> size:4; signed:0;
> field:void * rx_skaddr; offset:32; size:8; signed:0;
>
> Do you think we still need to change the order?

Up to you, just wanted to point it out.

-- Steve


2024-06-10 20:25:58

by Yan Zhai

[permalink] [raw]
Subject: Re: [RFC v3 net-next 1/7] net: add rx_sk to trace_kfree_skb

On Mon, Jun 10, 2024 at 11:54 AM Steven Rostedt <[email protected]> wrote:
>
> On Thu, 6 Jun 2024 10:37:46 -0500
> Yan Zhai <[email protected]> wrote:
>
> > > name: kfree_skb
> > > ID: 1799
> > > format:
> > > field:unsigned short common_type; offset:0; size:2; signed:0;
> > > field:unsigned char common_flags; offset:2; size:1; signed:0;
> > > field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
> > > field:int common_pid; offset:4; size:4; signed:1;
> > >
> > > field:void * skbaddr; offset:8; size:8; signed:0;
> > > field:void * location; offset:16; size:8; signed:0;
> > > field:unsigned short protocol; offset:24; size:2; signed:0;
> > > field:enum skb_drop_reason reason; offset:28; size:4; signed:0;
> > >
> > > Notice that "protocol" is 2 bytes in size at offset 24, but "reason" starts
> > > at offset 28. This means at offset 26, there's a 2 byte hole.
> > >
> > The reason I added the pointer as the last argument is trying to
> > minimize the surprise to existing TP users, because for common ABIs
> > it's fine to omit later arguments when defining a function, but it
> > needs change and recompilation if the order of arguments changed.
>
> Nothing should be hard coding the offsets of the fields. This is
> exported to user space so that tools can see where the fields are.
> That's the purpose of libtraceevent. The fields should be movable and
> not affect anything. There should be no need to recompile.
>
Oh I misunderstood previously. I was also thinking about the argument
order in TP_PROTO, but what you mentioned is just the order in
TP_STRUCT__entry, correct? I'd prefer to not change the argument order
but the struct field order can definitely be aligned better here.

Yan

> >
> > Looking at the actual format after the change, it does not add a new
> > hole since protocol and reason are already packed into the same 8-byte
> > block, so rx_skaddr starts at 8-byte aligned offset:
> >
> > # cat /sys/kernel/debug/tracing/events/skb/kfree_skb/format
> > name: kfree_skb
> > ID: 2260
> > format:
> > field:unsigned short common_type; offset:0;
> > size:2; signed:0;
> > field:unsigned char common_flags; offset:2;
> > size:1; signed:0;
> > field:unsigned char common_preempt_count; offset:3;
> > size:1; signed:0;
> > field:int common_pid; offset:4; size:4; signed:1;
> >
> > field:void * skbaddr; offset:8; size:8; signed:0;
> > field:void * location; offset:16; size:8; signed:0;
> > field:unsigned short protocol; offset:24; size:2; signed:0;
> > field:enum skb_drop_reason reason; offset:28;
> > size:4; signed:0;
> > field:void * rx_skaddr; offset:32; size:8; signed:0;
> >
> > Do you think we still need to change the order?
>
> Up to you, just wanted to point it out.
>
> -- Steve
>