From: Peilin Ye <[email protected]>
Traffic redirected by bpf_redirect_peer() (used by recent CNIs like
Cilium) is not accounted for in the RX stats of veth devices, confusing
user space metrics collectors such as cAdvisor [1], as reported by
Youlun.
Currently veth devices use the @lstats per-CPU counters, which only
cover TX traffic. veth_get_stats64() actually collects RX stats of a
veth device from its peer's TX (@lstats) counters, based on the
assumption that a veth device can _only_ receive packets from its peer,
which is no longer true.
Instead, use @tstats to maintain both per-CPU RX and TX traffic counters
for each veth device, and count bpf_redirect_peer() traffic in
skb_do_redirect().
veth_stats_rx() might need a name change (perhaps to "veth_stats_xdp()")
for less confusion, but let's leave it to a separate patch to keep this
fix minimal.
[1] Specifically, the "container_network_receive_{byte,packet}s_total"
counters are affected.
Reported-by: Youlun Zhang <[email protected]>
Fixes: 9aa1206e8f48 ("bpf: Add redirect_peer helper")
Cc: Jiang Wang <[email protected]>
Signed-off-by: Peilin Ye <[email protected]>
---
drivers/net/veth.c | 36 ++++++++++++++----------------------
net/core/filter.c | 1 +
2 files changed, 15 insertions(+), 22 deletions(-)
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 9980517ed8b0..df7a7c21a46d 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -373,7 +373,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
skb_tx_timestamp(skb);
if (likely(veth_forward_skb(rcv, skb, rq, use_napi) == NET_RX_SUCCESS)) {
if (!use_napi)
- dev_lstats_add(dev, length);
+ dev_sw_netstats_tx_add(dev, 1, length);
else
__veth_xdp_flush(rq);
} else {
@@ -387,14 +387,6 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
return ret;
}
-static u64 veth_stats_tx(struct net_device *dev, u64 *packets, u64 *bytes)
-{
- struct veth_priv *priv = netdev_priv(dev);
-
- dev_lstats_read(dev, packets, bytes);
- return atomic64_read(&priv->dropped);
-}
-
static void veth_stats_rx(struct veth_stats *result, struct net_device *dev)
{
struct veth_priv *priv = netdev_priv(dev);
@@ -432,24 +424,24 @@ static void veth_get_stats64(struct net_device *dev,
struct veth_priv *priv = netdev_priv(dev);
struct net_device *peer;
struct veth_stats rx;
- u64 packets, bytes;
- tot->tx_dropped = veth_stats_tx(dev, &packets, &bytes);
- tot->tx_bytes = bytes;
- tot->tx_packets = packets;
+ tot->tx_dropped = atomic64_read(&priv->dropped);
+ dev_fetch_sw_netstats(tot, dev->tstats);
veth_stats_rx(&rx, dev);
tot->tx_dropped += rx.xdp_tx_err;
tot->rx_dropped = rx.rx_drops + rx.peer_tq_xdp_xmit_err;
- tot->rx_bytes = rx.xdp_bytes;
- tot->rx_packets = rx.xdp_packets;
+ tot->rx_bytes += rx.xdp_bytes;
+ tot->rx_packets += rx.xdp_packets;
rcu_read_lock();
peer = rcu_dereference(priv->peer);
if (peer) {
- veth_stats_tx(peer, &packets, &bytes);
- tot->rx_bytes += bytes;
- tot->rx_packets += packets;
+ struct rtnl_link_stats64 tot_peer = {};
+
+ dev_fetch_sw_netstats(&tot_peer, peer->tstats);
+ tot->rx_bytes += tot_peer.tx_bytes;
+ tot->rx_packets += tot_peer.tx_packets;
veth_stats_rx(&rx, peer);
tot->tx_dropped += rx.peer_tq_xdp_xmit_err;
@@ -1508,13 +1500,13 @@ static int veth_dev_init(struct net_device *dev)
{
int err;
- dev->lstats = netdev_alloc_pcpu_stats(struct pcpu_lstats);
- if (!dev->lstats)
+ dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
+ if (!dev->tstats)
return -ENOMEM;
err = veth_alloc_queues(dev);
if (err) {
- free_percpu(dev->lstats);
+ free_percpu(dev->tstats);
return err;
}
@@ -1524,7 +1516,7 @@ static int veth_dev_init(struct net_device *dev)
static void veth_dev_free(struct net_device *dev)
{
veth_free_queues(dev);
- free_percpu(dev->lstats);
+ free_percpu(dev->tstats);
}
#ifdef CONFIG_NET_POLL_CONTROLLER
diff --git a/net/core/filter.c b/net/core/filter.c
index 21d75108c2e9..7aca28b7d0fd 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2492,6 +2492,7 @@ int skb_do_redirect(struct sk_buff *skb)
net_eq(net, dev_net(dev))))
goto out_drop;
skb->dev = dev;
+ dev_sw_netstats_rx_add(dev, skb->len);
return -EAGAIN;
}
return flags & BPF_F_NEIGH ?
--
2.20.1
Hi all,
On Fri, Oct 27, 2023 at 06:46:57PM +0000, Peilin Ye wrote:
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 21d75108c2e9..7aca28b7d0fd 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2492,6 +2492,7 @@ int skb_do_redirect(struct sk_buff *skb)
> net_eq(net, dev_net(dev))))
> goto out_drop;
> skb->dev = dev;
> + dev_sw_netstats_rx_add(dev, skb->len);
This assumes that all devices that support BPF_F_PEER (currently only
veth) use tstats (instead of lstats, or dstats) - is that okay?
If not, should I add another NDO e.g. ->ndo_stats_rx_add()?
Thanks,
Peilin Ye
Hi Peilin,
On 10/27/23 9:02 PM, Peilin Ye wrote:
> On Fri, Oct 27, 2023 at 06:46:57PM +0000, Peilin Ye wrote:
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 21d75108c2e9..7aca28b7d0fd 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -2492,6 +2492,7 @@ int skb_do_redirect(struct sk_buff *skb)
>> net_eq(net, dev_net(dev))))
>> goto out_drop;
>> skb->dev = dev;
>> + dev_sw_netstats_rx_add(dev, skb->len);
>
> This assumes that all devices that support BPF_F_PEER (currently only
> veth) use tstats (instead of lstats, or dstats) - is that okay?
Dumb question, but why all this change and not simply just call ...
dev_lstats_add(dev, skb->len)
... on the host dev ?
> If not, should I add another NDO e.g. ->ndo_stats_rx_add()?
Definitely no new stats ndo resp indirect call in fast path.
Thanks,
Daniel
Hi Daniel,
Thanks for taking a look!
On Sat, Oct 28, 2023 at 09:06:44AM +0200, Daniel Borkmann wrote:
> > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > index 21d75108c2e9..7aca28b7d0fd 100644
> > > --- a/net/core/filter.c
> > > +++ b/net/core/filter.c
> > > @@ -2492,6 +2492,7 @@ int skb_do_redirect(struct sk_buff *skb)
> > > net_eq(net, dev_net(dev))))
> > > goto out_drop;
> > > skb->dev = dev;
> > > + dev_sw_netstats_rx_add(dev, skb->len);
> >
> > This assumes that all devices that support BPF_F_PEER (currently only
> > veth) use tstats (instead of lstats, or dstats) - is that okay?
>
> Dumb question, but why all this change and not simply just call ...
>
> dev_lstats_add(dev, skb->len)
>
> ... on the host dev ?
Since I didn't want to update host-veth's TX counters. If we
bpf_redirect_peer()ed a packet from NIC TC ingress to Pod-veth TC ingress,
I think it means we've bypassed host-veth TX?
> > If not, should I add another NDO e.g. ->ndo_stats_rx_add()?
>
> Definitely no new stats ndo resp indirect call in fast path.
Yeah, I think I'll put a comment saying that all devices that support
BPF_F_PEER must use tstats (or must use lstats), then.
Thanks,
Peilin Ye
On 10/29/23 1:11 AM, Peilin Ye wrote:
> On Sat, Oct 28, 2023 at 09:06:44AM +0200, Daniel Borkmann wrote:
>>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>>> index 21d75108c2e9..7aca28b7d0fd 100644
>>>> --- a/net/core/filter.c
>>>> +++ b/net/core/filter.c
>>>> @@ -2492,6 +2492,7 @@ int skb_do_redirect(struct sk_buff *skb)
>>>> net_eq(net, dev_net(dev))))
>>>> goto out_drop;
>>>> skb->dev = dev;
>>>> + dev_sw_netstats_rx_add(dev, skb->len);
>>>
>>> This assumes that all devices that support BPF_F_PEER (currently only
>>> veth) use tstats (instead of lstats, or dstats) - is that okay?
>>
>> Dumb question, but why all this change and not simply just call ...
>>
>> dev_lstats_add(dev, skb->len)
>>
>> ... on the host dev ?
>
> Since I didn't want to update host-veth's TX counters. If we
> bpf_redirect_peer()ed a packet from NIC TC ingress to Pod-veth TC ingress,
> I think it means we've bypassed host-veth TX?
Yes. So the idea is to transition to tstats replace the location where
we used to bump lstats with tstat's tx counter, and only the peer redirect
would bump the rx counter.. then upon stats traversal we fold the latter into
the rx stats which was populated by the opposite's tx counters. Makes sense.
OT: does cadvisor run inside the Pod to collect the device stats? Just
curious how it gathers them.
>>> If not, should I add another NDO e.g. ->ndo_stats_rx_add()?
>>
>> Definitely no new stats ndo resp indirect call in fast path.
>
> Yeah, I think I'll put a comment saying that all devices that support
> BPF_F_PEER must use tstats (or must use lstats), then.
sgtm.
Thanks,
Daniel
On Mon, Oct 30, 2023 at 03:19:26PM +0100, Daniel Borkmann wrote:
> OT: does cadvisor run inside the Pod to collect the device stats? Just
> curious how it gathers them.
Seems like it reads from /proc/$PID/net/dev:
https://github.com/google/cadvisor/blob/d19df4f3b8e73c7e807d4b9894c252b54852fa8a/container/libcontainer/handler.go#L461
Thanks,
Peilin Ye
On Mon, 30 Oct 2023 15:19:26 +0100 Daniel Borkmann wrote:
> > Since I didn't want to update host-veth's TX counters. If we
> > bpf_redirect_peer()ed a packet from NIC TC ingress to Pod-veth TC ingress,
> > I think it means we've bypassed host-veth TX?
>
> Yes. So the idea is to transition to tstats replace the location where
> we used to bump lstats with tstat's tx counter, and only the peer redirect
> would bump the rx counter.. then upon stats traversal we fold the latter into
> the rx stats which was populated by the opposite's tx counters. Makes sense.
>
> OT: does cadvisor run inside the Pod to collect the device stats? Just
> curious how it gathers them.
Somewhat related - where does netkit count stats?
> >> Definitely no new stats ndo resp indirect call in fast path.
> >
> > Yeah, I think I'll put a comment saying that all devices that support
> > BPF_F_PEER must use tstats (or must use lstats), then.
>
> sgtm.
Is comment good enough? Can we try to do something more robust?
Move the allocation of stats into the core at registration based
on some u8 assigned in the driver? (I haven't looked at the code TBH)
On 10/31/23 8:53 PM, Jakub Kicinski wrote:
> On Mon, 30 Oct 2023 15:19:26 +0100 Daniel Borkmann wrote:
>>> Since I didn't want to update host-veth's TX counters. If we
>>> bpf_redirect_peer()ed a packet from NIC TC ingress to Pod-veth TC ingress,
>>> I think it means we've bypassed host-veth TX?
>>
>> Yes. So the idea is to transition to tstats replace the location where
>> we used to bump lstats with tstat's tx counter, and only the peer redirect
>> would bump the rx counter.. then upon stats traversal we fold the latter into
>> the rx stats which was populated by the opposite's tx counters. Makes sense.
>>
>> OT: does cadvisor run inside the Pod to collect the device stats? Just
>> curious how it gathers them.
>
> Somewhat related - where does netkit count stats?
Yeap, it needs it as well, I have a local branch here where I pushed all
of it - coming out soon; I was planning to add some selftests in addition
till end of this week:
https://github.com/cilium/linux/commits/pr/ndo_peer
>>>> Definitely no new stats ndo resp indirect call in fast path.
>>>
>>> Yeah, I think I'll put a comment saying that all devices that support
>>> BPF_F_PEER must use tstats (or must use lstats), then.
>>
>> sgtm.
>
> Is comment good enough? Can we try to do something more robust?
> Move the allocation of stats into the core at registration based
> on some u8 assigned in the driver? (I haven't looked at the code TBH)
Hm, not sure. One thing that comes to mind is lazy one-time allocation
like in case of netdev_core_stats_inc(), so whenever one of the helpers
like dev_sw_netstats_{rx,tx}_add() are called and dev->tstats are still
NULL, the core knows about the driver's intent, but tbh that doesn't
feel overly clean and in case of netdev_core_stats_inc() it's more in
the exception case rather than fast-path.
Other option could be to have two small helpers in the core which then
set a flag as well:
static inline int netdev_tstats_alloc(struct net_device *dev)
{
dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
if (!dev->tstats)
return -ENOMEM;
dev->priv_flags |= IFF_USES_TSTATS;
return 0;
}
static inline void netdev_tstats_free(struct net_device *dev)
{
free_percpu(dev->tstats);
}
They can then be used from .ndo_init/uninit - not sure if this would
be overall nicer.. or just leaving it at the .ndo callback comment for
the time being until really more users show up (which I doubt tbh).
Thanks,
Daniel