Hello.
The following woe happened on my home router running as a Wi-Fi AP with
MT7612:
```
[16592.157962] divide error: 0000 [#1] PREEMPT SMP PTI
[16592.163169] CPU: 2 PID: 683 Comm: mt76-tx phy0 Tainted: G C 5.12.0-pf4 #1
[16592.171745] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./J3710-ITX, BIOS P1.50 04/16/2018
[16592.182155] RIP: 0010:minstrel_ht_get_tp_avg+0xb1/0x100 [mac80211]
[16592.188795] Code: 04 00 00 00 7f 1c 31 c9 81 fa f1 49 02 00 0f 9c c1 8d 0c cd 08 00 00 00 eb 08 8b 47 30 b9 01 00 00 00 69 c0 e8 03 00 00 31 d2 <f7> f1 ba 66 0e 00 00 39 d6 0f 4f f2 49 63 d1 48 8d 0c 52 48 8d 0c
[16592.208796] RSP: 0018:ffffb6a601293be8 EFLAGS: 00010246
[16592.214292] RAX: 000000000001a5e0 RBX: ffff9c658ee95170 RCX: 0000000000000000
[16592.222054] RDX: 0000000000000000 RSI: 0000000000000585 RDI: ffff9c658ee94000
[16592.229620] RBP: 0000000000000006 R08: 0000000000000006 R09: 0000000000000012
[16592.237254] R10: 0000000000000007 R11: 0000000000000000 R12: ffff9c658ee94000
[16592.244828] R13: 0000000000000012 R14: ffff9c658ee9534c R15: 0000000000000585
[16592.252635] FS: 0000000000000000(0000) GS:ffff9c66f8500000(0000) knlGS:0000000000000000
[16592.261086] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[16592.267326] CR2: 0000555d81abe7b8 CR3: 0000000116c10000 CR4: 00000000001006e0
[16592.274904] Call Trace:
[16592.277549] minstrel_ht_update_stats+0x1fe/0x1320 [mac80211]
[16592.283730] minstrel_ht_tx_status+0x67f/0x710 [mac80211]
[16592.289442] rate_control_tx_status+0x6e/0xb0 [mac80211]
[16592.295267] ieee80211_tx_status_ext+0x22e/0xb00 [mac80211]
[16592.311290] ieee80211_tx_status+0x7d/0xa0 [mac80211]
[16592.316714] mt76_tx_status_unlock+0x83/0xa0 [mt76]
[16592.321999] mt76x02_send_tx_status+0x1b7/0x400 [mt76x02_lib]
[16592.334516] mt76x02_tx_worker+0x8f/0xd0 [mt76x02_lib]
[16592.340091] __mt76_worker_fn+0x78/0xb0 [mt76]
[16592.345067] kthread+0x183/0x1b0
[16592.353378] ret_from_fork+0x22/0x30
```
`faddr2line` says it is this:
```
430 int
431 minstrel_ht_get_tp_avg(struct minstrel_ht_sta *mi, int group, int rate,
432 int prob_avg)
433 {
…
435 unsigned int ampdu_len = 1;
…
441 if (minstrel_ht_is_legacy_group(group))
…
443 else
444 ampdu_len = minstrel_ht_avg_ampdu_len(mi);
…
446 nsecs = 1000 * overhead / ampdu_len;
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
```
So, it seems `minstrel_ht_avg_ampdu_len()` can return 0, which is not
really legitimate.
Looking at `minstrel_ht_avg_ampdu_len()`, I see the following:
```
16:#define MINSTREL_SCALE 12
…
18:#define MINSTREL_TRUNC(val) ((val) >> MINSTREL_SCALE)
```
```
401 static unsigned int
402 minstrel_ht_avg_ampdu_len(struct minstrel_ht_sta *mi)
403 {
…
406 if (mi->avg_ampdu_len)
407 return MINSTREL_TRUNC(mi->avg_ampdu_len);
```
So, likely, `mi->avg_ampdu_len` is non-zero, but it's too small, hence
right bitshift makes it zero.
Should there be some protection against such a situation?
Felix, could you maybe check this? Looks like you were the last one to
overhaul that part of code.
Thanks!
--
Oleksandr Natalenko (post-factum)
On Sat, Aug 21, 2021 at 01:14:34PM +0200, Oleksandr Natalenko wrote:
> > > So, it seems `minstrel_ht_avg_ampdu_len()` can return 0, which is not
> > > really legitimate.
> > >
> > > Looking at `minstrel_ht_avg_ampdu_len()`, I see the following:
> > >
> > > ```
> > > 16:#define MINSTREL_SCALE 12
> > > ...
> > > 18:#define MINSTREL_TRUNC(val) ((val) >> MINSTREL_SCALE)
> > > ```
> > >
> > > ```
> > >
> > > 401 static unsigned int
> > > 402 minstrel_ht_avg_ampdu_len(struct minstrel_ht_sta *mi)
> > > 403 {
> > >
> > > ...
> > >
> > > 406 if (mi->avg_ampdu_len)
> > > 407 return MINSTREL_TRUNC(mi->avg_ampdu_len);
> > >
> > > ```
> > >
> > > So, likely, `mi->avg_ampdu_len` is non-zero, but it's too small, hence
> > > right bitshift makes it zero.
(...)
> I've also found out that this happens exactly at midnight, IOW, at 00:00:00.
> Not every midnight, though.
>
> Does it have something to do with timekeeping? This is strange, I wouldn't
> expect kernel to act like that. Probably, some client sends malformed frame?
> How to find out?
Well, in minstrel_ht_update_stats() at line 1006 avg_ampdu_len is
explicitly set to zero. And this seems to be called based on timing
criteria from minstrel_ht_tx_status() so this could confirm your
experience. Thus there is some inconsistency there.
Willy