LinuxLists.cc - [PATCH v2 0/2] tg3: Fix the TX ring stall

[permalink] [raw]

Subject: [PATCH v2 2/2] tg3: Fix the TX ring stall

From: Alex Pakhunov <[email protected]>

The TX ring maintained by the tg3 driver can end up in the state, when it
has packets queued for sending but the NIC hardware is not informed, so no
progress is made. This leads to a multi-second interruption in network
traffic followed by dev_watchdog() firing and resetting the queue.

The specific sequence of steps is:

1. tg3_start_xmit() is called at least once and queues packet(s) without
updating tnapi->prodmbox (netdev_xmit_more() returns true)
2. tg3_start_xmit() is called with an SKB which causes tg3_tso_bug() to be
called.
3. tg3_tso_bug() determines that the SKB is too large, ...

if (unlikely(tg3_tx_avail(tnapi) <= frag_cnt_est)) {

... stops the queue, and returns NETDEV_TX_BUSY:

netif_tx_stop_queue(txq);
...
if (tg3_tx_avail(tnapi) <= frag_cnt_est)
return NETDEV_TX_BUSY;

4. Since all tg3_tso_bug() call sites directly return, the code updating
tnapi->prodmbox is skipped.

5. The queue is stuck now. tg3_start_xmit() is not called while the queue
is stopped. The NIC is not processing new packets because
tnapi->prodmbox wasn't updated. tg3_tx() is not called by
tg3_poll_work() because the all TX descriptions that could be freed has
been freed:

/* run TX completion thread */
if (tnapi->hw_status->idx[0].tx_consumer != tnapi->tx_cons) {
tg3_tx(tnapi);

6. Eventually, dev_watchdog() fires triggering a reset of the queue.

This fix makes sure that the tnapi->prodmbox update happens regardless of
the reason tg3_start_xmit() returned.

Signed-off-by: Alex Pakhunov <[email protected]>
Signed-off-by: Vincent Wong <[email protected]>
---
v2: Sort Order the local variables in tg3_start_xmit() in the RCS order
v1: https://lore.kernel.org/netdev/[email protected]/T/#t
---
drivers/net/ethernet/broadcom/tg3.c | 53 +++++++++++++++++++++++------
1 file changed, 42 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index 99638e6c9e16..f7680d3e46da 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -6603,9 +6603,9 @@ static void tg3_tx(struct tg3_napi *tnapi)

tnapi->tx_cons = sw_idx;

- /* Need to make the tx_cons update visible to tg3_start_xmit()
+ /* Need to make the tx_cons update visible to __tg3_start_xmit()
* before checking for netif_queue_stopped(). Without the
- * memory barrier, there is a small possibility that tg3_start_xmit()
+ * memory barrier, there is a small possibility that __tg3_start_xmit()
* will miss it and cause the queue to be stopped forever.
*/
smp_mb();
@@ -7845,7 +7845,7 @@ static bool tg3_tso_bug_gso_check(struct tg3_napi *tnapi, struct sk_buff *skb)
return skb_shinfo(skb)->gso_segs < tnapi->tx_pending / 3;
}

-static netdev_tx_t tg3_start_xmit(struct sk_buff *, struct net_device *);
+static netdev_tx_t __tg3_start_xmit(struct sk_buff *, struct net_device *);

/* Use GSO to workaround all TSO packets that meet HW bug conditions
* indicated in tg3_tx_frag_set()
@@ -7881,7 +7881,7 @@ static int tg3_tso_bug(struct tg3 *tp, struct tg3_napi *tnapi,

skb_list_walk_safe(segs, seg, next) {
skb_mark_not_on_list(seg);
- tg3_start_xmit(seg, tp->dev);
+ __tg3_start_xmit(seg, tp->dev);
}

tg3_tso_bug_end:
@@ -7891,7 +7891,7 @@ static int tg3_tso_bug(struct tg3 *tp, struct tg3_napi *tnapi,
}

/* hard_start_xmit for all devices */
-static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
+static netdev_tx_t __tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
struct tg3 *tp = netdev_priv(dev);
u32 len, entry, base_flags, mss, vlan = 0;
@@ -8135,11 +8135,6 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
netif_tx_wake_queue(txq);
}

- if (!netdev_xmit_more() || netif_xmit_stopped(txq)) {
- /* Packets are ready, update Tx producer idx on card. */
- tw32_tx_mbox(tnapi->prodmbox, entry);
- }
-
return NETDEV_TX_OK;

dma_error:
@@ -8152,6 +8147,42 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
return NETDEV_TX_OK;
}

+static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+ struct netdev_queue *txq;
+ u16 skb_queue_mapping;
+ netdev_tx_t ret;
+
+ skb_queue_mapping = skb_get_queue_mapping(skb);
+ txq = netdev_get_tx_queue(dev, skb_queue_mapping);
+
+ ret = __tg3_start_xmit(skb, dev);
+
+ /* Notify the hardware that packets are ready by updating the TX ring
+ * tail pointer. We respect netdev_xmit_more() thus avoiding poking
+ * the hardware for every packet. To guarantee forward progress the TX
+ * ring must be drained when it is full as indicated by
+ * netif_xmit_stopped(). This needs to happen even when the current
+ * skb was dropped or rejected with NETDEV_TX_BUSY. Otherwise packets
+ * queued by previous __tg3_start_xmit() calls might get stuck in
+ * the queue forever.
+ */
+ if (!netdev_xmit_more() || netif_xmit_stopped(txq)) {
+ struct tg3_napi *tnapi;
+ struct tg3 *tp;
+
+ tp = netdev_priv(dev);
+ tnapi = &tp->napi[skb_queue_mapping];
+
+ if (tg3_flag(tp, ENABLE_TSS))
+ tnapi++;
+
+ tw32_tx_mbox(tnapi->prodmbox, tnapi->tx_prod);
+ }
+
+ return ret;
+}
+
static void tg3_mac_loopback(struct tg3 *tp, bool enable)
{
if (enable) {
@@ -17682,7 +17713,7 @@ static int tg3_init_one(struct pci_dev *pdev,
* device behind the EPB cannot support DMA addresses > 40-bit.
* On 64-bit systems with IOMMU, use 40-bit dma_mask.
* On 64-bit systems without IOMMU, use 64-bit dma_mask and
- * do DMA address check in tg3_start_xmit().
+ * do DMA address check in __tg3_start_xmit().
*/
if (tg3_flag(tp, IS_5788))
persist_dma_mask = dma_mask = DMA_BIT_MASK(32);
--
2.39.3

2023-11-02 17:26:36

[permalink] [raw]

Subject: [PATCH v2 1/2] tg3: Increment tx_dropped in tg3_tso_bug()

From: Alex Pakhunov <[email protected]>

tg3_tso_bug() drops a packet if it cannot be segmented for any reason.
The number of discarded frames should be incremeneted accordingly.

Signed-off-by: Alex Pakhunov <[email protected]>
Signed-off-by: Vincent Wong <[email protected]>
---
drivers/net/ethernet/broadcom/tg3.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index 14b311196b8f..99638e6c9e16 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -7874,8 +7874,10 @@ static int tg3_tso_bug(struct tg3 *tp, struct tg3_napi *tnapi,

segs = skb_gso_segment(skb, tp->dev->features &
~(NETIF_F_TSO | NETIF_F_TSO6));
- if (IS_ERR(segs) || !segs)
+ if (IS_ERR(segs) || !segs) {
+ tp->tx_dropped++;
goto tg3_tso_bug_end;
+ }

skb_list_walk_safe(segs, seg, next) {
skb_mark_not_on_list(seg);
--
2.39.3

2023-11-02 20:05:12

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] tg3: Increment tx_dropped in tg3_tso_bug()

On Thu, Nov 2, 2023 at 10:25 AM <[email protected]> wrote:
>
> From: Alex Pakhunov <[email protected]>
>
> tg3_tso_bug() drops a packet if it cannot be segmented for any reason.
> The number of discarded frames should be incremeneted accordingly.
>
> Signed-off-by: Alex Pakhunov <[email protected]>
> Signed-off-by: Vincent Wong <[email protected]>
> ---
> drivers/net/ethernet/broadcom/tg3.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
> index 14b311196b8f..99638e6c9e16 100644
> --- a/drivers/net/ethernet/broadcom/tg3.c
> +++ b/drivers/net/ethernet/broadcom/tg3.c
> @@ -7874,8 +7874,10 @@ static int tg3_tso_bug(struct tg3 *tp, struct tg3_napi *tnapi,
>
> segs = skb_gso_segment(skb, tp->dev->features &
> ~(NETIF_F_TSO | NETIF_F_TSO6));
> - if (IS_ERR(segs) || !segs)
> + if (IS_ERR(segs) || !segs) {
> + tp->tx_dropped++;

This is prone to race conditions if we have more than one TX queue.
The original driver code only supported one TX queue and the counters
were never modified properly to support multiple queues. We should
convert them to per queue counters by moving tx_dropped and rx_dropped
to the tg3_napi struct.

> goto tg3_tso_bug_end;
> + }
>
> skb_list_walk_safe(segs, seg, next) {
> skb_mark_not_on_list(seg);
> --
> 2.39.3
>

Attachments:

2023-11-02 20:11:00

[permalink] [raw]

Subject: Re: [PATCH v2 2/2] tg3: Fix the TX ring stall

On Thu, Nov 2, 2023 at 10:25 AM <[email protected]> wrote:
>
> From: Alex Pakhunov <[email protected]>
>
> The TX ring maintained by the tg3 driver can end up in the state, when it
> has packets queued for sending but the NIC hardware is not informed, so no
> progress is made. This leads to a multi-second interruption in network
> traffic followed by dev_watchdog() firing and resetting the queue.
>
> The specific sequence of steps is:
>
> 1. tg3_start_xmit() is called at least once and queues packet(s) without
> updating tnapi->prodmbox (netdev_xmit_more() returns true)
> 2. tg3_start_xmit() is called with an SKB which causes tg3_tso_bug() to be
> called.
> 3. tg3_tso_bug() determines that the SKB is too large, ...
>
> if (unlikely(tg3_tx_avail(tnapi) <= frag_cnt_est)) {
>
> ... stops the queue, and returns NETDEV_TX_BUSY:
>
> netif_tx_stop_queue(txq);
> ...
> if (tg3_tx_avail(tnapi) <= frag_cnt_est)
> return NETDEV_TX_BUSY;
>
> 4. Since all tg3_tso_bug() call sites directly return, the code updating
> tnapi->prodmbox is skipped.
>
> 5. The queue is stuck now. tg3_start_xmit() is not called while the queue
> is stopped. The NIC is not processing new packets because
> tnapi->prodmbox wasn't updated. tg3_tx() is not called by
> tg3_poll_work() because the all TX descriptions that could be freed has
> been freed:
>
> /* run TX completion thread */
> if (tnapi->hw_status->idx[0].tx_consumer != tnapi->tx_cons) {
> tg3_tx(tnapi);
>
> 6. Eventually, dev_watchdog() fires triggering a reset of the queue.
>
> This fix makes sure that the tnapi->prodmbox update happens regardless of
> the reason tg3_start_xmit() returned.
>
> Signed-off-by: Alex Pakhunov <[email protected]>
> Signed-off-by: Vincent Wong <[email protected]>
> ---
> v2: Sort Order the local variables in tg3_start_xmit() in the RCS order
> v1: https://lore.kernel.org/netdev/[email protected]/T/#t
> ---

Thanks.

Reviewed-by: Michael Chan <[email protected]>

Attachments:

2023-11-03 17:08:38

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] tg3: Increment tx_dropped in tg3_tso_bug()

> This is prone to race conditions if we have more than one TX queue.

Yes, indeed.

> The original driver code only supported one TX queue and the counters
> were never modified properly to support multiple queues. We should
> convert them to per queue counters by moving tx_dropped and rx_dropped
> to the tg3_napi struct.

I'm not super familiar with the recommended approach for handling locks in
network drivers, so I spent a bit of tme looking at what tg3 does.

It seems that there are a few ways to remove the race condition when
working with these counters:

1. Use atomic increments. It is easy but every update is more expensive
than it needs to be. We might be able to say that there specific
counters are updated rarely, so maybe we don't care too much.
2. netif_tx_lock is already taken when tx_droped is incremented - wrap
rx_dropped increment and reading both counters in netif_tx_lock. This
seems legal since tg3_tx() can take netif_tx_lock. I'm not sure how to
order netif_tx_lock and tp->lock, since tg3_get_stats64() takes
the latter. Should netif_tx_lock be takes inside tp->lock? Should they
be not nested?
3. Using tp->lock to protect rx_dropped (tg3_poll_link() already takes it
so it must be legal) and netif_tx_lock to protect tx_dropped.

There are probably other options. Can you recommend an aproach?

Also, this seems like a larger change that should be done separately from
fixing the TX stall. Should we land just "[PATCH v2 2/2]"? Should we land
the whole patch (since it does not make race condition much worse) and fix
the race condition separately?

Alex.

2023-11-03 23:03:23

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] tg3: Increment tx_dropped in tg3_tso_bug()

On Fri, Nov 3, 2023 at 10:07 AM Alex Pakhunov
<[email protected]> wrote:
> I'm not super familiar with the recommended approach for handling locks in
> network drivers, so I spent a bit of tme looking at what tg3 does.
>
> It seems that there are a few ways to remove the race condition when
> working with these counters:
>
> 1. Use atomic increments. It is easy but every update is more expensive
> than it needs to be. We might be able to say that there specific
> counters are updated rarely, so maybe we don't care too much.
> 2. netif_tx_lock is already taken when tx_droped is incremented - wrap
> rx_dropped increment and reading both counters in netif_tx_lock. This
> seems legal since tg3_tx() can take netif_tx_lock. I'm not sure how to
> order netif_tx_lock and tp->lock, since tg3_get_stats64() takes
> the latter. Should netif_tx_lock be takes inside tp->lock? Should they
> be not nested?
> 3. Using tp->lock to protect rx_dropped (tg3_poll_link() already takes it
> so it must be legal) and netif_tx_lock to protect tx_dropped.
>
> There are probably other options. Can you recommend an aproach?

I recommend using per queue counters as briefly mentioned in my
earlier reply. Move the tx_dropped and rx_dropped counters to the per
queue tg3_napi struct. Incrementing tnapi->tx_dropped in
tg3_start_xmit() is serialized by the netif_tx_lock held by the stack.

Similarly, incrementing tnapi->rx_dropped in the tg3_rx() is serialized by NAPI.

tg3_get_stats64() can just loop and sum all the tx_dropped and
rx_dropped counters in each tg3_napi struct. We don't worry about
locks here since we are just reading.

>
> Also, this seems like a larger change that should be done separately from
> fixing the TX stall. Should we land just "[PATCH v2 2/2]"? Should we land
> the whole patch (since it does not make race condition much worse) and fix
> the race condition separately?
>

Yes, we can merge patch #2 first which fixes the stall. Please repost
just patch #2 standalone if you want to do that. Thanks.

Attachments:

2023-11-05 19:27:07

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] tg3: Increment tx_dropped in tg3_tso_bug()

> I recommend using per queue counters as briefly mentioned in my
> earlier reply...
> tg3_get_stats64() can just loop and sum all the tx_dropped and
> rx_dropped counters in each tg3_napi struct. We don't worry about
> locks here since we are just reading.

Got it. So the core idea is to make sure there is a single writer for each
counter which will make updating the counter race-free. It does not keep
reading the counters from multiple queues completely race free, but, I
guess, the assumption is that computing the aggregate counter to be
slightly wrong is acceptable - it will be recomputed correctly next time.

There is still some gotchas on 32 bit machines though. 64 bit reads are not
atomic there, so we have to make the counters 32bit to compensate:

====
@@ -11895,6 +11898,9 @@ static void tg3_get_nstats(struct tg3 *tp, struct rtnl_link_stats64 *stats)
{
struct rtnl_link_stats64 *old_stats = &tp->net_stats_prev;
struct tg3_hw_stats *hw_stats = tp->hw_stats;
+ uintptr_t rx_dropped = 0;
+ uintptr_t tx_dropped = 0;
+ int i;

stats->rx_packets = old_stats->rx_packets +
get_stat64(&hw_stats->rx_ucast_packets) +
@@ -11941,8 +11947,27 @@ static void tg3_get_nstats(struct tg3 *tp, struct rtnl_link_stats64 *stats)
stats->rx_missed_errors = old_stats->rx_missed_errors +
get_stat64(&hw_stats->rx_discards);

- stats->rx_dropped = tp->rx_dropped;
- stats->tx_dropped = tp->tx_dropped;
+ /* Aggregate per-queue counters. Each per-queue counter is updated by
+ * a single writer, race-free. The aggregare counters might be not
+ * completely accurate (if an update happens in the middle of the loop)
+ * but they will be recomputed correctly the next time this function is
+ * called. This avoids explicit synchronization between this function
+ * and tg3_rx()/tg3_start_xmit().
+ **/
+ for (i = 0; i < tp->irq_cnt; i++) {
+ struct tg3_napi *tnapi = &tp->napi[i];
+
+ rx_dropped += tnapi->rx_dropped;
+ tx_dropped += tnapi->tx_dropped;
+ }
+
+ /* Since we are using uintptr_t, these counters wrap around at 4G on
+ * a 32bit machine. This seems like an acceptable price for being
+ * able to read them atomically in the loop above.
+ */
+ stats->rx_dropped = rx_dropped;
+ stats->tx_dropped = tx_dropped;
+
}
====

An alternative implementation would use atomic64_add to update
tg3::[rt]x_dropped. It would allow the counters to be 64 bit even on 32 bit
machines. The downside is that updating the counter will be slightly more
expensive. There counters are not updated often, so the cost is negligible.
ALthough it also means that preactically speaking we don't care if
the counters are effectively 32 bits wide.

I'll assume you prefer the former implementation for now, but let me know
if this not the case.

> Yes, we can merge patch #2 first which fixes the stall. Please repost
> just patch #2 standalone if you want to do that.

OK, I posted "[PATCH v3] tg3: Fix the TX ring stall".

Alex.

2023-11-06 03:59:29

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] tg3: Increment tx_dropped in tg3_tso_bug()

On Sun, Nov 5, 2023 at 11:26 AM Alex Pakhunov
<[email protected]> wrote:
>
> > I recommend using per queue counters as briefly mentioned in my
> > earlier reply...
> > tg3_get_stats64() can just loop and sum all the tx_dropped and
> > rx_dropped counters in each tg3_napi struct. We don't worry about
> > locks here since we are just reading.
>
> Got it. So the core idea is to make sure there is a single writer for each
> counter which will make updating the counter race-free. It does not keep
> reading the counters from multiple queues completely race free, but, I
> guess, the assumption is that computing the aggregate counter to be
> slightly wrong is acceptable - it will be recomputed correctly next time.

Correct.

>
> There is still some gotchas on 32 bit machines though. 64 bit reads are not
> atomic there, so we have to make the counters 32bit to compensate:

These counters are currently defined as unsigned long which is 32-bit
on 32-bit CPUs and 64-bit on 64-bit CPUs. We can just keep them
unchanged. Thanks.

Attachments: