LinuxLists.cc - [PATCH] net :mana : Add per-cpu stats for MANA device

[permalink] [raw]

Subject: Re: [PATCH] net :mana : Add per-cpu stats for MANA device

On Thu, 7 Mar 2024 06:52:12 -0800 Shradha Gupta wrote:
> Extend 'ethtool -S' output for mana devices to include per-CPU packet
> stats

But why? You already have per queue stats.

> Built-on: Ubuntu22
> Tested-on: Ubuntu22

I understand when people mention testing driver changes on particular
SKUs of supported devices, that adds meaningful info. I don't understand
what "built" and "tested" on a distro is supposed to tell us.
Honest question, what is the value of this?

2024-03-07 15:49:36

[permalink] [raw]

Subject: RE: [PATCH] net :mana : Add per-cpu stats for MANA device

> -----Original Message-----
> From: Jakub Kicinski <[email protected]>
> Sent: Thursday, March 7, 2024 10:29 AM
> To: Shradha Gupta <[email protected]>
> Cc: [email protected]; [email protected]; linux-
> [email protected]; [email protected]; Eric Dumazet
> <[email protected]>; Paolo Abeni <[email protected]>; Ajay Sharma
> <[email protected]>; Leon Romanovsky <[email protected]>; Thomas
> Gleixner <[email protected]>; Sebastian Andrzej Siewior
> <[email protected]>; KY Srinivasan <[email protected]>; Haiyang Zhang
> <[email protected]>; Wei Liu <[email protected]>; Dexuan Cui
> <[email protected]>; Long Li <[email protected]>; Michael Kelley
> <[email protected]>; Shradha Gupta <[email protected]>
> Subject: Re: [PATCH] net :mana : Add per-cpu stats for MANA device
>
> On Thu, 7 Mar 2024 06:52:12 -0800 Shradha Gupta wrote:
> > Extend 'ethtool -S' output for mana devices to include per-CPU packet
> > stats
>
> But why? You already have per queue stats.
Yes. But the q to cpu binding is dynamic, we also want the per-CPU stat
to analyze the CPU usage by counting the packets and bytes on each CPU.

>
> > Built-on: Ubuntu22
> > Tested-on: Ubuntu22
>
> I understand when people mention testing driver changes on particular
> SKUs of supported devices, that adds meaningful info. I don't understand
> what "built" and "tested" on a distro is supposed to tell us.
> Honest question, what is the value of this?
@Shradha Gupta Please add the tested SKU and test case here. Or, you don't
have to include test info with the patch.

Thanks,
- Haiyang

2024-03-07 16:18:24

[permalink] [raw]

Subject: RE: [PATCH] net :mana : Add per-cpu stats for MANA device

> -----Original Message-----
> From: Shradha Gupta <[email protected]>
> Sent: Thursday, March 7, 2024 9:52 AM
> To: [email protected]; [email protected]; linux-
> [email protected]; [email protected]
> Cc: Shradha Gupta <[email protected]>; Eric Dumazet
> <[email protected]>; Jakub Kicinski <[email protected]>; Paolo Abeni
> <[email protected]>; Ajay Sharma <[email protected]>; Leon
> Romanovsky <[email protected]>; Thomas Gleixner <[email protected]>;
> Sebastian Andrzej Siewior <[email protected]>; KY Srinivasan
> <[email protected]>; Haiyang Zhang <[email protected]>; Wei Liu
> <[email protected]>; Dexuan Cui <[email protected]>; Long Li
> <[email protected]>; Michael Kelley <[email protected]>; Shradha
> Gupta <[email protected]>
> Subject: [PATCH] net :mana : Add per-cpu stats for MANA device
>
> Extend 'ethtool -S' output for mana devices to include per-CPU packet
> stats
>
> Built-on: Ubuntu22
> Tested-on: Ubuntu22
> Signed-off-by: Shradha Gupta <[email protected]>
> ---
> drivers/net/ethernet/microsoft/mana/mana_en.c | 22 ++++++++++
> .../ethernet/microsoft/mana/mana_ethtool.c | 40 ++++++++++++++++++-
> include/net/mana/mana.h | 12 ++++++
> 3 files changed, 72 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
> b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index 59287c6e6cee..b27ee6684936 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -224,6 +224,7 @@ netdev_tx_t mana_start_xmit(struct sk_buff *skb,
> struct net_device *ndev)
> int gso_hs = 0; /* zero for non-GSO pkts */
> u16 txq_idx = skb_get_queue_mapping(skb);
> struct gdma_dev *gd = apc->ac->gdma_dev;
> + struct mana_pcpu_stats *pcpu_stats;
> bool ipv4 = false, ipv6 = false;
> struct mana_tx_package pkg = {};
> struct netdev_queue *net_txq;
> @@ -234,6 +235,8 @@ netdev_tx_t mana_start_xmit(struct sk_buff *skb,
> struct net_device *ndev)
> struct mana_cq *cq;
> int err, len;
>
> + pcpu_stats = this_cpu_ptr(apc->pcpu_stats);
> +
> if (unlikely(!apc->port_is_up))
> goto tx_drop;
>
> @@ -412,6 +415,12 @@ netdev_tx_t mana_start_xmit(struct sk_buff *skb,
> struct net_device *ndev)
> tx_stats->bytes += len;
> u64_stats_update_end(&tx_stats->syncp);
>
> + /* Also update the per-CPU stats */
> + u64_stats_update_begin(&pcpu_stats->syncp);
> + pcpu_stats->tx_packets++;
> + pcpu_stats->tx_bytes += len;
> + u64_stats_update_end(&pcpu_stats->syncp);
> +
> tx_busy:
> if (netif_tx_queue_stopped(net_txq) && mana_can_tx(gdma_sq)) {
> netif_tx_wake_queue(net_txq);
> @@ -425,6 +434,9 @@ netdev_tx_t mana_start_xmit(struct sk_buff *skb,
> struct net_device *ndev)
> kfree(pkg.sgl_ptr);
> tx_drop_count:
> ndev->stats.tx_dropped++;
> + u64_stats_update_begin(&pcpu_stats->syncp);
> + pcpu_stats->tx_dropped++;
> + u64_stats_update_end(&pcpu_stats->syncp);
> tx_drop:
> dev_kfree_skb_any(skb);
> return NETDEV_TX_OK;
> @@ -1505,6 +1517,8 @@ static void mana_rx_skb(void *buf_va, bool
> from_pool,
> struct mana_stats_rx *rx_stats = &rxq->stats;
> struct net_device *ndev = rxq->ndev;
> uint pkt_len = cqe->ppi[0].pkt_len;
> + struct mana_pcpu_stats *pcpu_stats;
> + struct mana_port_context *apc;
> u16 rxq_idx = rxq->rxq_idx;
> struct napi_struct *napi;
> struct xdp_buff xdp = {};
> @@ -1512,6 +1526,9 @@ static void mana_rx_skb(void *buf_va, bool
> from_pool,
> u32 hash_value;
> u32 act;
>
> + apc = netdev_priv(ndev);
> + pcpu_stats = this_cpu_ptr(apc->pcpu_stats);
> +
> rxq->rx_cq.work_done++;
> napi = &rxq->rx_cq.napi;
>
> @@ -1570,6 +1587,11 @@ static void mana_rx_skb(void *buf_va, bool
> from_pool,
> rx_stats->xdp_tx++;
> u64_stats_update_end(&rx_stats->syncp);
>
> + u64_stats_update_begin(&pcpu_stats->syncp);
> + pcpu_stats->rx_packets++;
> + pcpu_stats->rx_bytes += pkt_len;
> + u64_stats_update_end(&pcpu_stats->syncp);
> +
> if (act == XDP_TX) {
> skb_set_queue_mapping(skb, rxq_idx);
> mana_xdp_tx(skb, ndev);
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> index ab2413d71f6c..e3aa47ead601 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
> @@ -83,8 +83,9 @@ static int mana_get_sset_count(struct net_device *ndev,
> int stringset)
> if (stringset != ETH_SS_STATS)
> return -EINVAL;
>
> - return ARRAY_SIZE(mana_eth_stats) + num_queues *
> - (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
> + return ARRAY_SIZE(mana_eth_stats) +
> + (num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT)) +
> + (num_present_cpus() * (MANA_STATS_RX_PCPU +
> MANA_STATS_TX_PCPU));
> }
>
> static void mana_get_strings(struct net_device *ndev, u32 stringset, u8
> *data)
> @@ -139,6 +140,19 @@ static void mana_get_strings(struct net_device
> *ndev, u32 stringset, u8 *data)
> sprintf(p, "tx_%d_mana_map_err", i);
> p += ETH_GSTRING_LEN;
> }
> +
> + for (i = 0; i < num_present_cpus(); i++) {
> + sprintf(p, "cpu%d_rx_packets", i);
> + p += ETH_GSTRING_LEN;
> + sprintf(p, "cpu%d_rx_bytes", i);
> + p += ETH_GSTRING_LEN;
> + sprintf(p, "cpu%d_tx_packets", i);
> + p += ETH_GSTRING_LEN;
> + sprintf(p, "cpu%d_tx_bytes", i);
> + p += ETH_GSTRING_LEN;
> + sprintf(p, "cpu%d_tx_dropped", i);
> + p += ETH_GSTRING_LEN;
> + }
> }
>
> static void mana_get_ethtool_stats(struct net_device *ndev,
> @@ -222,6 +236,28 @@ static void mana_get_ethtool_stats(struct net_device
> *ndev,
> data[i++] = csum_partial;
> data[i++] = mana_map_err;
> }
> +
> + for_each_possible_cpu(q) {
> + const struct mana_pcpu_stats *pcpu_stats =
> + per_cpu_ptr(apc->pcpu_stats, q);
> + u64 rx_packets, rx_bytes, tx_packets, tx_bytes, tx_dropped;
> + unsigned int start;
> +
> + do {
> + start = u64_stats_fetch_begin(&pcpu_stats->syncp);
> + rx_packets = pcpu_stats->rx_packets;
> + tx_packets = pcpu_stats->tx_packets;
> + rx_bytes = pcpu_stats->rx_bytes;
> + tx_bytes = pcpu_stats->tx_bytes;
> + tx_dropped = pcpu_stats->tx_dropped;
> + } while (u64_stats_fetch_retry(&pcpu_stats->syncp, start));
> +
> + data[i++] = rx_packets;
> + data[i++] = rx_bytes;
> + data[i++] = tx_packets;
> + data[i++] = tx_bytes;
> + data[i++] = tx_dropped;
> + }
> }
>
> static int mana_get_rxnfc(struct net_device *ndev, struct ethtool_rxnfc
> *cmd,
> diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
> index 76147feb0d10..9a2414ee7f02 100644
> --- a/include/net/mana/mana.h
> +++ b/include/net/mana/mana.h
> @@ -51,6 +51,8 @@ enum TRI_STATE {
> /* Update this count whenever the respective structures are changed */
> #define MANA_STATS_RX_COUNT 5
> #define MANA_STATS_TX_COUNT 11
> +#define MANA_STATS_RX_PCPU 2
> +#define MANA_STATS_TX_PCPU 3
>
> struct mana_stats_rx {
> u64 packets;
> @@ -386,6 +388,15 @@ struct mana_ethtool_stats {
> u64 rx_cqe_unknown_type;
> };
>
> +struct mana_pcpu_stats {
> + u64 rx_packets;
> + u64 rx_bytes;
> + u64 tx_packets;
> + u64 tx_bytes;
> + u64 tx_dropped;
> + struct u64_stats_sync syncp;
> +};
> +
> struct mana_context {
> struct gdma_dev *gdma_dev;
>
> @@ -449,6 +460,7 @@ struct mana_port_context {
> bool port_st_save; /* Saved port state */
>
> struct mana_ethtool_stats eth_stats;
> + struct mana_pcpu_stats __percpu *pcpu_stats;

Where are pcpu_stats alloc-ed?
Seems I cannot see any alloc in the patch.

Thanks,
- Haiyang

2024-03-07 17:01:59

[permalink] [raw]

Subject: Re: [PATCH] net :mana : Add per-cpu stats for MANA device

On Thu, 7 Mar 2024 15:49:15 +0000 Haiyang Zhang wrote:
> > > Extend 'ethtool -S' output for mana devices to include per-CPU packet
> > > stats
> >
> > But why? You already have per queue stats.
> Yes. But the q to cpu binding is dynamic, we also want the per-CPU stat
> to analyze the CPU usage by counting the packets and bytes on each CPU.

Dynamic is a bit of an exaggeration, right? On a well-configured system
each CPU should use a single queue assigned thru XPS. And for manual
debug bpftrace should serve the purpose quite well.

Please note that you can't use num_present_cpus() to size stats in
ethtool -S , you have to use possible_cpus(), because the retrieval
of the stats is done in a multi-syscall fashion and there are no
explicit lengths in the API. So you must always report all possible
stats, not just currently active :(

2024-03-08 05:31:16

[permalink] [raw]

Subject: Re: [PATCH] net :mana : Add per-cpu stats for MANA device

Thanks for the comments, I am sending out a newer version with these fixes.

On Thu, Mar 07, 2024 at 09:01:45AM -0800, Jakub Kicinski wrote:
> On Thu, 7 Mar 2024 15:49:15 +0000 Haiyang Zhang wrote:
> > > > Extend 'ethtool -S' output for mana devices to include per-CPU packet
> > > > stats
> > >
> > > But why? You already have per queue stats.
> > Yes. But the q to cpu binding is dynamic, we also want the per-CPU stat
> > to analyze the CPU usage by counting the packets and bytes on each CPU.
>
> Dynamic is a bit of an exaggeration, right? On a well-configured system
> each CPU should use a single queue assigned thru XPS. And for manual
> debug bpftrace should serve the purpose quite well.
>
> Please note that you can't use num_present_cpus() to size stats in
> ethtool -S , you have to use possible_cpus(), because the retrieval
> of the stats is done in a multi-syscall fashion and there are no
> explicit lengths in the API. So you must always report all possible
> stats, not just currently active :(

2024-03-08 18:52:17

[permalink] [raw]

Subject: RE: [PATCH] net :mana : Add per-cpu stats for MANA device

> -----Original Message-----
> From: Jakub Kicinski <[email protected]>
> Sent: Thursday, March 7, 2024 12:02 PM
> To: Haiyang Zhang <[email protected]>
> Cc: Shradha Gupta <[email protected]>; Shradha Gupta
> <[email protected]>; [email protected]; linux-
> [email protected]; [email protected];
> [email protected]; Eric Dumazet <[email protected]>; Paolo Abeni
> <[email protected]>; Ajay Sharma <[email protected]>; Leon
> Romanovsky <[email protected]>; Thomas Gleixner <[email protected]>;
> Sebastian Andrzej Siewior <[email protected]>; KY Srinivasan
> <[email protected]>; Wei Liu <[email protected]>; Dexuan Cui
> <[email protected]>; Long Li <[email protected]>; Michael Kelley
> <[email protected]>
> Subject: Re: [PATCH] net :mana : Add per-cpu stats for MANA device
>
> On Thu, 7 Mar 2024 15:49:15 +0000 Haiyang Zhang wrote:
> > > > Extend 'ethtool -S' output for mana devices to include per-CPU
> packet
> > > > stats
> > >
> > > But why? You already have per queue stats.
> > Yes. But the q to cpu binding is dynamic, we also want the per-CPU stat
> > to analyze the CPU usage by counting the packets and bytes on each CPU.
>
> Dynamic is a bit of an exaggeration, right? On a well-configured system
> each CPU should use a single queue assigned thru XPS. And for manual
> debug bpftrace should serve the purpose quite well.

Some programs, like irqbalancer can dynamically change the CPU affinity,
so we want to add the per-CPU counters for better understanding of the CPU
usage.

Thanks,
- Haiyang

2024-03-08 19:23:00

[permalink] [raw]

Subject: Re: [PATCH] net :mana : Add per-cpu stats for MANA device

On Fri, 8 Mar 2024 18:51:58 +0000 Haiyang Zhang wrote:
> > Dynamic is a bit of an exaggeration, right? On a well-configured system
> > each CPU should use a single queue assigned thru XPS. And for manual
> > debug bpftrace should serve the purpose quite well.
>
> Some programs, like irqbalancer can dynamically change the CPU affinity,
> so we want to add the per-CPU counters for better understanding of the CPU
> usage.

Do you have experimental data showing this making a difference
in production?

Seems unlikely, but if it does work we should enable it for all
devices, no driver by driver.

2024-03-08 19:44:16

[permalink] [raw]

Subject: RE: [PATCH] net :mana : Add per-cpu stats for MANA device

> -----Original Message-----
> From: Jakub Kicinski <[email protected]>
> Sent: Friday, March 8, 2024 2:23 PM
> To: Haiyang Zhang <[email protected]>
> Cc: Shradha Gupta <[email protected]>; Shradha Gupta
> <[email protected]>; [email protected]; linux-
> [email protected]; [email protected];
> [email protected]; Eric Dumazet <[email protected]>; Paolo Abeni
> <[email protected]>; Ajay Sharma <[email protected]>; Leon
> Romanovsky <[email protected]>; Thomas Gleixner <[email protected]>;
> Sebastian Andrzej Siewior <[email protected]>; KY Srinivasan
> <[email protected]>; Wei Liu <[email protected]>; Dexuan Cui
> <[email protected]>; Long Li <[email protected]>; Michael Kelley
> <[email protected]>
> Subject: Re: [PATCH] net :mana : Add per-cpu stats for MANA device
>
> On Fri, 8 Mar 2024 18:51:58 +0000 Haiyang Zhang wrote:
> > > Dynamic is a bit of an exaggeration, right? On a well-configured
> system
> > > each CPU should use a single queue assigned thru XPS. And for manual
> > > debug bpftrace should serve the purpose quite well.
> >
> > Some programs, like irqbalancer can dynamically change the CPU
> affinity,
> > so we want to add the per-CPU counters for better understanding of the
> CPU
> > usage.
>
> Do you have experimental data showing this making a difference
> in production?
Shradha, could you please add some data before / after enabling irqbalancer
which changes cpu affinity?

>
> Seems unlikely, but if it does work we should enable it for all
> devices, no driver by driver.
There are some existing drivers, like mlx, rmnet, netvsc, etc. using percpu
counters. Are you suggesting we add a common API for all drivers?

Thanks,
- Haiyang

2024-03-08 19:55:23

by Rahul Rameshbabu

[permalink] [raw]

Subject: Re: [PATCH] net :mana : Add per-cpu stats for MANA device

On Fri, 08 Mar, 2024 19:43:57 +0000 Haiyang Zhang <[email protected]> wrote:
>> -----Original Message-----
>> From: Jakub Kicinski <[email protected]>
>> Sent: Friday, March 8, 2024 2:23 PM
>> To: Haiyang Zhang <[email protected]>
>> Cc: Shradha Gupta <[email protected]>; Shradha Gupta
>> <[email protected]>; [email protected]; linux-
>> [email protected]; [email protected];
>> [email protected]; Eric Dumazet <[email protected]>; Paolo Abeni
>> <[email protected]>; Ajay Sharma <[email protected]>; Leon
>> Romanovsky <[email protected]>; Thomas Gleixner <[email protected]>;
>> Sebastian Andrzej Siewior <[email protected]>; KY Srinivasan
>> <[email protected]>; Wei Liu <[email protected]>; Dexuan Cui
>> <[email protected]>; Long Li <[email protected]>; Michael Kelley
>> <[email protected]>
>> Subject: Re: [PATCH] net :mana : Add per-cpu stats for MANA device
>>
>> On Fri, 8 Mar 2024 18:51:58 +0000 Haiyang Zhang wrote:
>> > > Dynamic is a bit of an exaggeration, right? On a well-configured
>> system
>> > > each CPU should use a single queue assigned thru XPS. And for manual
>> > > debug bpftrace should serve the purpose quite well.
>> >
>> > Some programs, like irqbalancer can dynamically change the CPU
>> affinity,
>> > so we want to add the per-CPU counters for better understanding of the
>> CPU
>> > usage.
>>
>> Do you have experimental data showing this making a difference
>> in production?
> Shradha, could you please add some data before / after enabling irqbalancer
> which changes cpu affinity?
>
>>
>> Seems unlikely, but if it does work we should enable it for all
>> devices, no driver by driver.
> There are some existing drivers, like mlx, rmnet, netvsc, etc. using percpu
> counters. Are you suggesting we add a common API for all drivers?

Wanted to chime in with regards to mlx. You might be conflating per-cpu
with per-queue. When we run ethtool -S, we present counters per netdev
queue rather than per-cpu. The number of queues we instantiate is
related to CPUs but it not always 1-1.

Jakub just recently supported a proper interface for per-queue stats
counters that we are interested in supporting.

https://lore.kernel.org/netdev/[email protected]/

--
Thanks,

Rahul Rameshbabu

2024-03-08 20:28:04

[permalink] [raw]

Subject: Re: [PATCH] net :mana : Add per-cpu stats for MANA device

On Fri, 8 Mar 2024 19:43:57 +0000 Haiyang Zhang wrote:
> > Seems unlikely, but if it does work we should enable it for all
> > devices, no driver by driver.
> There are some existing drivers, like mlx, rmnet, netvsc, etc. using percpu
> counters. Are you suggesting we add a common API for all drivers?

Hm, I don't think mlx does. The typical use case for pcpu stats so far
has been software devices which don't have queues, and implement
lockless Tx. In that case instead of recording stats on a local queue
struct we can count per-cpu and not worry about locking.

2024-03-08 20:33:46

by Sebastian Andrzej Siewior

[permalink] [raw]

Subject: Re: RE: [PATCH] net :mana : Add per-cpu stats for MANA device

On 2024-03-08 19:43:57 [+0000], Haiyang Zhang wrote:
> > Do you have experimental data showing this making a difference
> > in production?
> Shradha, could you please add some data before / after enabling irqbalancer
> which changes cpu affinity?

so you have one queue and one interrupt and then the irqbalancer is
pushing the interrupt from CPU to another. What kind of information do
you gain from per-CPU counters here?

> Thanks,
> - Haiyang

Sebastian

2024-03-11 04:20:03

[permalink] [raw]

Subject: Re: [PATCH] net :mana : Add per-cpu stats for MANA device

On Fri, Mar 08, 2024 at 11:22:44AM -0800, Jakub Kicinski wrote:
> On Fri, 8 Mar 2024 18:51:58 +0000 Haiyang Zhang wrote:
> > > Dynamic is a bit of an exaggeration, right? On a well-configured system
> > > each CPU should use a single queue assigned thru XPS. And for manual
> > > debug bpftrace should serve the purpose quite well.
> >
> > Some programs, like irqbalancer can dynamically change the CPU affinity,
> > so we want to add the per-CPU counters for better understanding of the CPU
> > usage.
>
> Do you have experimental data showing this making a difference
> in production?
Sure, will try to get that data for this discussion
>
> Seems unlikely, but if it does work we should enable it for all
> devices, no driver by driver.
You mean, if the usecase seems valid we should try to extend the framework
mentioned by Rahul (https://lore.kernel.org/lkml/[email protected]/)
to include these stats as well?
Will explore this a bit more and update. Thanks.

2024-03-11 15:51:38

[permalink] [raw]

Subject: Re: [PATCH] net :mana : Add per-cpu stats for MANA device

On Sun, 10 Mar 2024 21:19:50 -0700 Shradha Gupta wrote:
> > Seems unlikely, but if it does work we should enable it for all
> > devices, no driver by driver.
> You mean, if the usecase seems valid we should try to extend the framework
> mentioned by Rahul (https://lore.kernel.org/lkml/[email protected]/)
> to include these stats as well?

"framework" is a big word, but yes, add a netlink command to get
pcpu stats. Let's focus on the usefulness before investing time
in a rewrite, tho.

2024-03-11 15:57:09

by Stephen Hemminger

[permalink] [raw]

Subject: Re: [PATCH] net :mana : Add per-cpu stats for MANA device

On Sun, 10 Mar 2024 21:19:50 -0700
Shradha Gupta <[email protected]> wrote:

> On Fri, Mar 08, 2024 at 11:22:44AM -0800, Jakub Kicinski wrote:
> > On Fri, 8 Mar 2024 18:51:58 +0000 Haiyang Zhang wrote:
> > > > Dynamic is a bit of an exaggeration, right? On a well-configured system
> > > > each CPU should use a single queue assigned thru XPS. And for manual
> > > > debug bpftrace should serve the purpose quite well.
> > >
> > > Some programs, like irqbalancer can dynamically change the CPU affinity,
> > > so we want to add the per-CPU counters for better understanding of the CPU
> > > usage.
> >
> > Do you have experimental data showing this making a difference
> > in production?
> Sure, will try to get that data for this discussion
> >
> > Seems unlikely, but if it does work we should enable it for all
> > devices, no driver by driver.
> You mean, if the usecase seems valid we should try to extend the framework
> mentioned by Rahul (https://lore.kernel.org/lkml/[email protected]/)
> to include these stats as well?
> Will explore this a bit more and update. Thanks.
>

Remember, statistics aren't free, and even per-cpu stats end up taking cache space.

2024-03-11 16:42:00

by Stephen Hemminger

[permalink] [raw]

Subject: Re: [PATCH] net :mana : Add per-cpu stats for MANA device

On Mon, 11 Mar 2024 08:51:26 -0700
Jakub Kicinski <[email protected]> wrote:

> On Sun, 10 Mar 2024 21:19:50 -0700 Shradha Gupta wrote:
> > > Seems unlikely, but if it does work we should enable it for all
> > > devices, no driver by driver.
> > You mean, if the usecase seems valid we should try to extend the framework
> > mentioned by Rahul (https://lore.kernel.org/lkml/[email protected]/)
> > to include these stats as well?
>
> "framework" is a big word, but yes, add a netlink command to get
> pcpu stats. Let's focus on the usefulness before investing time
> in a rewrite, tho.

Couldn't userspace do the same thing by looking at /proc/interrupts
and PCI info?

2024-03-14 02:57:36

[permalink] [raw]

Subject: Re: [PATCH] net :mana : Add per-cpu stats for MANA device

On Sun, Mar 10, 2024 at 09:19:50PM -0700, Shradha Gupta wrote:
> On Fri, Mar 08, 2024 at 11:22:44AM -0800, Jakub Kicinski wrote:
> > On Fri, 8 Mar 2024 18:51:58 +0000 Haiyang Zhang wrote:
> > > > Dynamic is a bit of an exaggeration, right? On a well-configured system
> > > > each CPU should use a single queue assigned thru XPS. And for manual
> > > > debug bpftrace should serve the purpose quite well.
> > >
> > > Some programs, like irqbalancer can dynamically change the CPU affinity,
> > > so we want to add the per-CPU counters for better understanding of the CPU
> > > usage.
> >
> > Do you have experimental data showing this making a difference
> > in production?
> Sure, will try to get that data for this discussion
> >
> > Seems unlikely, but if it does work we should enable it for all
> > devices, no driver by driver.
> You mean, if the usecase seems valid we should try to extend the framework
> mentioned by Rahul (https://lore.kernel.org/lkml/[email protected]/)
> to include these stats as well?
> Will explore this a bit more and update. Thanks.

Following is the data we can share:

Default interrupts affinity for each queue:

25: 1 103 0 2989138 Hyper-V PCIe MSI 4138200989697-edge mana_q0@pci:7870:00:00.0
26: 0 1 4005360 0 Hyper-V PCIe MSI 4138200989698-edge mana_q1@pci:7870:00:00.0
27: 0 0 1 2997584 Hyper-V PCIe MSI 4138200989699-edge mana_q2@pci:7870:00:00.0
28: 3565461 0 0 1 Hyper-V PCIe MSI 4138200989700-edge mana_q3
@pci:7870:00:00.0

As seen the CPU-queue mapping is not 1:1, Queue 0 and Queue 2 are both mapped
to cpu3. From this knowledge we can figure out the total RX stats processed by
each CPU by adding the values of mana_q0 and mana_q2 stats for cpu3. But if
this data changes dynamically using irqbalance or smp_affinity file edits, the
above assumption fails.

Interrupt affinity for mana_q2 changes and the affinity table looks as follows
25: 1 103 0 3038084 Hyper-V PCIe MSI 4138200989697-edge mana_q0@pci:7870:00:00.0
26: 0 1 4012447 0 Hyper-V PCIe MSI 4138200989698-edge mana_q1@pci:7870:00:00.0
27: 157181 10 1 3007990 Hyper-V PCIe MSI 4138200989699-edge mana_q2@pci:7870:00:00.0
28: 3593858 0 0 1 Hyper-V PCIe MSI 4138200989700-edge mana_q3@pci:7870:00:00.0

And during this time we might end up calculating the per-CPU stats incorrectly,
messing up the understanding of CPU usage by MANA driver that is consumed by
monitoring services.

Also sharing the existing per-queue stats during this experiment, in case needed

Per-queue stats before changing CPU-affinities:
tx_cq_err: 0
tx_cqe_unknown_type: 0
rx_coalesced_err: 0
rx_cqe_unknown_type: 0
rx_0_packets: 4230152
rx_0_bytes: 289545167
rx_0_xdp_drop: 0
rx_0_xdp_tx: 0
rx_0_xdp_redirect: 0
rx_1_packets: 4113017
rx_1_bytes: 314552601
rx_1_xdp_drop: 0
rx_1_xdp_tx: 0
rx_1_xdp_redirect: 0
rx_2_packets: 4458906
rx_2_bytes: 305117506
rx_2_xdp_drop: 0
rx_2_xdp_tx: 0
rx_2_xdp_redirect: 0
rx_3_packets: 4619589
rx_3_bytes: 315445084
rx_3_xdp_drop: 0
rx_3_xdp_tx: 0
rx_3_xdp_redirect: 0
hc_tx_err_vport_disabled: 0
hc_tx_err_inval_vportoffset_pkt: 0
hc_tx_err_vlan_enforcement: 0
hc_tx_err_eth_type_enforcement: 0
hc_tx_err_sa_enforcement: 0
hc_tx_err_sqpdid_enforcement: 0
hc_tx_err_cqpdid_enforcement: 0
hc_tx_err_mtu_violation: 0
hc_tx_err_inval_oob: 0
hc_tx_err_gdma: 0
hc_tx_bytes: 126336708121
hc_tx_ucast_pkts: 86748013
hc_tx_ucast_bytes: 126336703775
hc_tx_bcast_pkts: 37
hc_tx_bcast_bytes: 2842
hc_tx_mcast_pkts: 7
hc_tx_mcast_bytes: 1504
tx_cq_err: 0
tx_cqe_unknown_type: 0
rx_coalesced_err: 0
rx_cqe_unknown_type: 0
rx_0_packets: 4230152
rx_0_bytes: 289545167
rx_0_xdp_drop: 0
rx_0_xdp_tx: 0
rx_0_xdp_redirect: 0
rx_1_packets: 4113017
rx_1_bytes: 314552601
rx_1_xdp_drop: 0
rx_1_xdp_tx: 0
rx_1_xdp_redirect: 0
rx_2_packets: 4458906
rx_2_bytes: 305117506
rx_2_xdp_drop: 0
rx_2_xdp_tx: 0
rx_2_xdp_redirect: 0
rx_3_packets: 4619589
rx_3_bytes: 315445084
rx_3_xdp_drop: 0
rx_3_xdp_tx: 0
rx_3_xdp_redirect: 0
tx_0_packets: 5995507
tx_0_bytes: 28749696408
tx_0_xdp_xmit: 0
tx_0_tso_packets: 4719840
tx_0_tso_bytes: 26873844525
tx_0_tso_inner_packets: 0
tx_0_tso_inner_bytes: 0
tx_0_long_pkt_fmt: 0
tx_0_short_pkt_fmt: 5995507
tx_0_csum_partial: 1275621
tx_0_mana_map_err: 0
tx_1_packets: 6653598
tx_1_bytes: 38318341475
tx_1_xdp_xmit: 0
tx_1_tso_packets: 5330921
tx_1_tso_bytes: 36210150488
tx_1_tso_inner_packets: 0
tx_1_tso_inner_bytes: 0
tx_1_long_pkt_fmt: 0
tx_1_short_pkt_fmt: 6653598
tx_1_csum_partial: 1322643
tx_1_mana_map_err: 0
tx_2_packets: 5715246
tx_2_bytes: 25662283686
tx_2_xdp_xmit: 0
tx_2_tso_packets: 4619118
tx_2_tso_bytes: 23829680267
tx_2_tso_inner_packets: 0
tx_2_tso_inner_bytes: 0
tx_2_long_pkt_fmt: 0
tx_2_short_pkt_fmt: 5715246
tx_2_csum_partial: 1096092
tx_2_mana_map_err: 0
tx_3_packets: 6175860
tx_3_bytes: 29500667904
tx_3_xdp_xmit: 0
tx_3_tso_packets: 4951591
tx_3_tso_bytes: 27446937448
tx_3_tso_inner_packets: 0
tx_3_tso_inner_bytes: 0
tx_3_long_pkt_fmt: 0
tx_3_short_pkt_fmt: 6175860
tx_3_csum_partial: 1224213
tx_3_mana_map_err: 0

Per-queue stats after changing CPU-affinities:
rx_0_packets: 4781895
rx_0_bytes: 326478061
rx_0_xdp_drop: 0
rx_0_xdp_tx: 0
rx_0_xdp_redirect: 0
rx_1_packets: 4116990
rx_1_bytes: 315439234
rx_1_xdp_drop: 0
rx_1_xdp_tx: 0
rx_1_xdp_redirect: 0
rx_2_packets: 4528800
rx_2_bytes: 310312337
rx_2_xdp_drop: 0
rx_2_xdp_tx: 0
rx_2_xdp_redirect: 0
rx_3_packets: 4622622
rx_3_bytes: 316282431
rx_3_xdp_drop: 0
rx_3_xdp_tx: 0
rx_3_xdp_redirect: 0
tx_0_packets: 5999379
tx_0_bytes: 28750864476
tx_0_xdp_xmit: 0
tx_0_tso_packets: 4720027
tx_0_tso_bytes: 26874344494
tx_0_tso_inner_packets: 0
tx_0_tso_inner_bytes: 0
tx_0_long_pkt_fmt: 0
tx_0_short_pkt_fmt: 5999379
tx_0_csum_partial: 1279296
tx_0_mana_map_err: 0
tx_1_packets: 6656913
tx_1_bytes: 38319355168
tx_1_xdp_xmit: 0
tx_1_tso_packets: 5331086
tx_1_tso_bytes: 36210592040
tx_1_tso_inner_packets: 0
tx_1_tso_inner_bytes: 0
tx_1_long_pkt_fmt: 0
tx_1_short_pkt_fmt: 6656913
tx_1_csum_partial: 1325785
tx_1_mana_map_err: 0
tx_2_packets: 5906172
tx_2_bytes: 36758032245
tx_2_xdp_xmit: 0
tx_2_tso_packets: 4806348
tx_2_tso_bytes: 34912213258
tx_2_tso_inner_packets: 0
tx_2_tso_inner_bytes: 0
tx_2_long_pkt_fmt: 0
tx_2_short_pkt_fmt: 5906172
tx_2_csum_partial: 1099782
tx_2_mana_map_err: 0
tx_3_packets: 6202399
tx_3_bytes: 30840325531
tx_3_xdp_xmit: 0
tx_3_tso_packets: 4973730
tx_3_tso_bytes: 28784371532
tx_3_tso_inner_packets: 0
tx_3_tso_inner_bytes: 0
tx_3_long_pkt_fmt: 0
tx_3_short_pkt_fmt: 6202399
tx_3_csum_partial: 1228603
tx_3_mana_map_err: 0

2024-03-14 03:06:12

by Stephen Hemminger

[permalink] [raw]

Subject: Re: [PATCH] net :mana : Add per-cpu stats for MANA device

On Wed, 13 Mar 2024 19:57:20 -0700
Shradha Gupta <[email protected]> wrote:

> Following is the data we can share:
>
> Default interrupts affinity for each queue:
>
> 25: 1 103 0 2989138 Hyper-V PCIe MSI 4138200989697-edge mana_q0@pci:7870:00:00.0
> 26: 0 1 4005360 0 Hyper-V PCIe MSI 4138200989698-edge mana_q1@pci:7870:00:00.0
> 27: 0 0 1 2997584 Hyper-V PCIe MSI 4138200989699-edge mana_q2@pci:7870:00:00.0
> 28: 3565461 0 0 1 Hyper-V PCIe MSI 4138200989700-edge mana_q3
> @pci:7870:00:00.0
>
> As seen the CPU-queue mapping is not 1:1, Queue 0 and Queue 2 are both mapped
> to cpu3. From this knowledge we can figure out the total RX stats processed by
> each CPU by adding the values of mana_q0 and mana_q2 stats for cpu3. But if
> this data changes dynamically using irqbalance or smp_affinity file edits, the
> above assumption fails.

irqbalance is often a bad idea.
In the past, doing one shot balancing at startup was a better plan.

2024-03-14 18:47:40

[permalink] [raw]

Subject: Re: [PATCH] net :mana : Add per-cpu stats for MANA device

On Wed, 13 Mar 2024 19:57:20 -0700 Shradha Gupta wrote:
> Default interrupts affinity for each queue:
>
> 25: 1 103 0 2989138 Hyper-V PCIe MSI 4138200989697-edge mana_q0@pci:7870:00:00.0
> 26: 0 1 4005360 0 Hyper-V PCIe MSI 4138200989698-edge mana_q1@pci:7870:00:00.0
> 27: 0 0 1 2997584 Hyper-V PCIe MSI 4138200989699-edge mana_q2@pci:7870:00:00.0
> 28: 3565461 0 0 1 Hyper-V PCIe MSI 4138200989700-edge mana_q3
> @pci:7870:00:00.0
>
> As seen the CPU-queue mapping is not 1:1, Queue 0 and Queue 2 are both mapped
> to cpu3. From this knowledge we can figure out the total RX stats processed by
> each CPU by adding the values of mana_q0 and mana_q2 stats for cpu3. But if
> this data changes dynamically using irqbalance or smp_affinity file edits, the
> above assumption fails.
>
> Interrupt affinity for mana_q2 changes and the affinity table looks as follows
> 25: 1 103 0 3038084 Hyper-V PCIe MSI 4138200989697-edge mana_q0@pci:7870:00:00.0
> 26: 0 1 4012447 0 Hyper-V PCIe MSI 4138200989698-edge mana_q1@pci:7870:00:00.0
> 27: 157181 10 1 3007990 Hyper-V PCIe MSI 4138200989699-edge mana_q2@pci:7870:00:00.0
> 28: 3593858 0 0 1 Hyper-V PCIe MSI 4138200989700-edge mana_q3@pci:7870:00:00.0
>
> And during this time we might end up calculating the per-CPU stats incorrectly,
> messing up the understanding of CPU usage by MANA driver that is consumed by
> monitoring services.

Like Stephen said, forget about irqbalance for networking.

Assume that the IRQs are affinitized and XPS set, correctly.

Now, presumably you can use your pcpu stats to "trade queues",
e.g. 4 CPUs / 4 queues, if CPU 0 insists on using queue 1
instead of queue 0, you can swap the 0 <> 1 assignment.
That's just an example of an "algorithm", maybe you have other
use cases. But if the problem is "user runs broken irqbalance"
the solution is not in the kernel...

2024-03-14 19:03:13

[permalink] [raw]

Subject: RE: [PATCH] net :mana : Add per-cpu stats for MANA device

> -----Original Message-----
> From: Jakub Kicinski <[email protected]>
> Sent: Thursday, March 14, 2024 2:28 PM
> To: Shradha Gupta <[email protected]>
> Cc: Haiyang Zhang <[email protected]>; Shradha Gupta
> <[email protected]>; [email protected]; linux-
> [email protected]; [email protected];
> [email protected]; Eric Dumazet <[email protected]>; Paolo Abeni
> <[email protected]>; Ajay Sharma <[email protected]>; Leon
> Romanovsky <[email protected]>; Thomas Gleixner <[email protected]>;
> Sebastian Andrzej Siewior <[email protected]>; KY Srinivasan
> <[email protected]>; Wei Liu <[email protected]>; Dexuan Cui
> <[email protected]>; Long Li <[email protected]>; Michael Kelley
> <[email protected]>
> Subject: Re: [PATCH] net :mana : Add per-cpu stats for MANA device
>
> On Wed, 13 Mar 2024 19:57:20 -0700 Shradha Gupta wrote:
> > Default interrupts affinity for each queue:
> >
> > 25: 1 103 0 2989138 Hyper-V PCIe MSI
> 4138200989697-edge mana_q0@pci:7870:00:00.0
> > 26: 0 1 4005360 0 Hyper-V PCIe MSI
> 4138200989698-edge mana_q1@pci:7870:00:00.0
> > 27: 0 0 1 2997584 Hyper-V PCIe MSI
> 4138200989699-edge mana_q2@pci:7870:00:00.0
> > 28: 3565461 0 0 1 Hyper-V PCIe MSI
> 4138200989700-edge mana_q3
> > @pci:7870:00:00.0
> >
> > As seen the CPU-queue mapping is not 1:1, Queue 0 and Queue 2 are both
> mapped
> > to cpu3. From this knowledge we can figure out the total RX stats
> processed by
> > each CPU by adding the values of mana_q0 and mana_q2 stats for cpu3.
> But if
> > this data changes dynamically using irqbalance or smp_affinity file
> edits, the
> > above assumption fails.
> >
> > Interrupt affinity for mana_q2 changes and the affinity table looks as
> follows
> > 25: 1 103 0 3038084 Hyper-V PCIe MSI
> 4138200989697-edge mana_q0@pci:7870:00:00.0
> > 26: 0 1 4012447 0 Hyper-V PCIe MSI
> 4138200989698-edge mana_q1@pci:7870:00:00.0
> > 27: 157181 10 1 3007990 Hyper-V PCIe MSI
> 4138200989699-edge mana_q2@pci:7870:00:00.0
> > 28: 3593858 0 0 1 Hyper-V PCIe MSI
> 4138200989700-edge mana_q3@pci:7870:00:00.0
> >
> > And during this time we might end up calculating the per-CPU stats
> incorrectly,
> > messing up the understanding of CPU usage by MANA driver that is
> consumed by
> > monitoring services.
>
> Like Stephen said, forget about irqbalance for networking.
>
> Assume that the IRQs are affinitized and XPS set, correctly.
>
> Now, presumably you can use your pcpu stats to "trade queues",
> e.g. 4 CPUs / 4 queues, if CPU 0 insists on using queue 1
> instead of queue 0, you can swap the 0 <> 1 assignment.
> That's just an example of an "algorithm", maybe you have other
> use cases. But if the problem is "user runs broken irqbalance"
> the solution is not in the kernel...

We understand irqbalance may be a "bad idea", and recommended some
customers to disable it when having problems with it... But it's
still enabled by default, and we cannot let all distro vendors
and custom image makers to disable the irqbalance. So, our host-
networking team is eager to have per-CPU stats for analyzing CPU
usage related to irqbalance or other issues.

Thanks,
- Haiyang

2024-03-14 19:05:45

[permalink] [raw]

Subject: Re: [PATCH] net :mana : Add per-cpu stats for MANA device

On Thu, 14 Mar 2024 18:54:31 +0000 Haiyang Zhang wrote:
> We understand irqbalance may be a "bad idea", and recommended some
> customers to disable it when having problems with it... But it's
> still enabled by default, and we cannot let all distro vendors
> and custom image makers to disable the irqbalance. So, our host-
> networking team is eager to have per-CPU stats for analyzing CPU
> usage related to irqbalance or other issues.

You need a use case to get the stats upstream.
"CPU usage related to irqbalance or other issues" is both too vague,
and irqbalance is a user space problem.

2024-03-14 20:01:49

by Alireza Dabagh

[permalink] [raw]

Subject: RE: [EXTERNAL] Re: [PATCH] net :mana : Add per-cpu stats for MANA device

Per processor network stats have been supported on netvsc and Mellanox NICs for years. They are also supported on Windows.

I routinely use these stats to rule in/out the issues with RSS imbalance due to either NIC not handling RSS correctly, incorrect MSI-X affinity, or not having enough entropy in flows.

And yes, I perfectly understand that there are cases that packets received on processor x are processed (in tcp/ip stack) on processor y. But in many cases, there is a correlation between individual processor utilization and the processor where these packets are received on by the NIC.

-thanks, ali
(some suggested that I do mention on this thread that I am one of the original inventors of RSS. So there you have it. Personally I don't think that telling people "I invented the damn thing and I know what I am talking about" is the right way to handle this.)
-----Original Message-----
From: Jakub Kicinski <[email protected]>
Sent: Thursday, March 14, 2024 12:05 PM
To: Haiyang Zhang <[email protected]>
Cc: Shradha Gupta <[email protected]>; Shradha Gupta <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Eric Dumazet <[email protected]>; Paolo Abeni <[email protected]>; Ajay Sharma <[email protected]>; Leon Romanovsky <[email protected]>; Thomas Gleixner <[email protected]>; Sebastian Andrzej Siewior <[email protected]>; KY Srinivasan <[email protected]>; Wei Liu <[email protected]>; Dexuan Cui <[email protected]>; Long Li <[email protected]>; Michael Kelley <[email protected]>; Alireza Dabagh <[email protected]>; Paul Rosswurm <[email protected]>
Subject: [EXTERNAL] Re: [PATCH] net :mana : Add per-cpu stats for MANA device

On Thu, 14 Mar 2024 18:54:31 +0000 Haiyang Zhang wrote:
> We understand irqbalance may be a "bad idea", and recommended some
> customers to disable it when having problems with it... But it's still
> enabled by default, and we cannot let all distro vendors and custom
> image makers to disable the irqbalance. So, our host- networking team
> is eager to have per-CPU stats for analyzing CPU usage related to
> irqbalance or other issues.

You need a use case to get the stats upstream.
"CPU usage related to irqbalance or other issues" is both too vague, and irqbalance is a user space problem.

2024-04-03 05:44:44