Although there is watchdog_timeo to let users know when the transmit queue
begin stall, but dev_watchdog() is called with an interval. The jiffies
will always be greater than watchdog_timeo.
To let users know the exact time the stall started, print jiffies when
the transmit queue time out.
Signed-off-by: Yajun Deng <[email protected]>
---
net/sched/sch_generic.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index a9aadc4e6858..67b78e260d28 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -502,7 +502,7 @@ static void dev_watchdog(struct timer_list *t)
if (netif_device_present(dev) &&
netif_running(dev) &&
netif_carrier_ok(dev)) {
- int some_queue_timedout = 0;
+ unsigned long some_queue_timedout = 0;
unsigned int i;
unsigned long trans_start;
@@ -514,7 +514,7 @@ static void dev_watchdog(struct timer_list *t)
if (netif_xmit_stopped(txq) &&
time_after(jiffies, (trans_start +
dev->watchdog_timeo))) {
- some_queue_timedout = 1;
+ some_queue_timedout = jiffies - trans_start;
atomic_long_inc(&txq->trans_timeout);
break;
}
@@ -522,8 +522,9 @@ static void dev_watchdog(struct timer_list *t)
if (unlikely(some_queue_timedout)) {
trace_net_dev_xmit_timeout(dev, i);
- WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): transmit queue %u timed out\n",
- dev->name, netdev_drivername(dev), i);
+ WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): \
+ transmit queue %u timed out %lu jiffies\n",
+ dev->name, netdev_drivername(dev), i, some_queue_timedout);
netif_freeze_queues(dev);
dev->netdev_ops->ndo_tx_timeout(dev, i);
netif_unfreeze_queues(dev);
--
2.25.1
On Wed, Apr 19, 2023 at 1:56 PM Yajun Deng <[email protected]> wrote:
>
> Although there is watchdog_timeo to let users know when the transmit queue
> begin stall, but dev_watchdog() is called with an interval. The jiffies
> will always be greater than watchdog_timeo.
>
> To let users know the exact time the stall started, print jiffies when
> the transmit queue time out.
>
> Signed-off-by: Yajun Deng <[email protected]>
> ---
> atomic_long_inc(&txq->trans_timeout);
> break;
> }
> @@ -522,8 +522,9 @@ static void dev_watchdog(struct timer_list *t)
>
> if (unlikely(some_queue_timedout)) {
> trace_net_dev_xmit_timeout(dev, i);
> - WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): transmit queue %u timed out\n",
> - dev->name, netdev_drivername(dev), i);
> + WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): \
> + transmit queue %u timed out %lu jiffies\n",
> + dev->name, netdev_drivername(dev), i, some_queue_timedout);
If we really want this, I suggest we export a time in ms units, using
jiffies_to_msecs()
On Wed, 19 Apr 2023 19:56:32 +0800 Yajun Deng wrote:
> Although there is watchdog_timeo to let users know when the transmit queue
> begin stall, but dev_watchdog() is called with an interval. The jiffies
> will always be greater than watchdog_timeo.
>
> To let users know the exact time the stall started, print jiffies when
> the transmit queue time out.
Please add an explanation of how this information is useful in practice.
April 20, 2023 9:27 AM, "Jakub Kicinski" <[email protected]> wrote:
> On Wed, 19 Apr 2023 19:56:32 +0800 Yajun Deng wrote:
>
>> Although there is watchdog_timeo to let users know when the transmit queue
>> begin stall, but dev_watchdog() is called with an interval. The jiffies
>> will always be greater than watchdog_timeo.
>>
>> To let users know the exact time the stall started, print jiffies when
>> the transmit queue time out.
>
> Please add an explanation of how this information is useful in practice.
We found some cases with several warnings. We want to confirm which happened first.
First warning:
16:37:57 kernel: [ 7100.097547] ------------[ cut here ]------------
16:37:57 kernel: [ 7100.097550] NETDEV WATCHDOG: eno2 (i40e): transmit queue 8 timed out
16:37:57 kernel: [ 7100.097571] WARNING: CPU: 8 PID: 0 at net/sched/sch_generic.c:467 dev_watchdog+0x260/0x270
...
Second warning:
16:38:44 kernel: [ 7147.756952] rcu: INFO: rcu_preempt self-detected stall on CPU
16:38:44 kernel: [ 7147.756958] rcu: 24-....: (59999 ticks this GP) idle=546/1/0x4000000000000000 softirq=367 3137/3673146 fqs=13844
16:38:44 kernel: [ 7147.756960] (t=60001 jiffies g=4322709 q=133381)
16:38:44 kernel: [ 7147.756962] NMI backtrace for cpu 24
...
As we can see, the transmit queue start stall should be before 16:37:52, the rcu start stall is 16:37:44.
These two times are closer, we want to confirm which happened first.
April 19, 2023 8:02 PM, "Eric Dumazet" <[email protected]> wrote:
> On Wed, Apr 19, 2023 at 1:56 PM Yajun Deng <[email protected]> wrote:
>
>> Although there is watchdog_timeo to let users know when the transmit queue
>> begin stall, but dev_watchdog() is called with an interval. The jiffies
>> will always be greater than watchdog_timeo.
>>
>> To let users know the exact time the stall started, print jiffies when
>> the transmit queue time out.
>>
>> Signed-off-by: Yajun Deng <[email protected]>
>> ---
>>
>> atomic_long_inc(&txq->trans_timeout);
>> break;
>> }
>> @@ -522,8 +522,9 @@ static void dev_watchdog(struct timer_list *t)
>>
>> if (unlikely(some_queue_timedout)) {
>> trace_net_dev_xmit_timeout(dev, i);
>> - WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): transmit queue %u timed out\n",
>> - dev->name, netdev_drivername(dev), i);
>> + WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): \
>> + transmit queue %u timed out %lu jiffies\n",
>> + dev->name, netdev_drivername(dev), i, some_queue_timedout);
>
> If we really want this, I suggest we export a time in ms units, using
> jiffies_to_msecs()
OK.