2020-12-08 11:30:45

by Rui Salvaterra

[permalink] [raw]
Subject: [BUG] iwlwifi: card unusable after firmware crash

Hi, everyone,

I'm running Linux 5.10-rc7 with this firmware/hardware:

[ 1.431878] iwlwifi 0000:02:00.0: loaded firmware version
29.198743027.0 7265D-29.ucode op_mode iwlmvm
[ 1.431899] iwlwifi 0000:02:00.0: Detected Intel(R) Dual Band
Wireless AC 3165, REV=0x210

Most of the time, after disabling/enabling Wi-Fi (or, say, rebooting
the router), the card dies and only comes back after reboot. This is
the relevant dmesg output:

[ 3174.003910] iwlwifi 0000:02:00.0: RF_KILL bit toggled to disable radio.
[ 3174.003913] iwlwifi 0000:02:00.0: reporting RF_KILL (radio disabled)
[ 3174.040788] iwlwifi 0000:02:00.0: iwlwifi transaction failed,
dumping registers
[ 3174.040790] iwlwifi 0000:02:00.0: iwlwifi device config registers:
[ 3174.041008] iwlwifi 0000:02:00.0: 00000000: 31668086 00100406
02800099 00000010 c1100004 00000000 00000000 00000000
[ 3174.041010] iwlwifi 0000:02:00.0: 00000020: 00000000 00000000
00000000 42108086 00000000 000000c8 00000000 000001ff
[ 3174.041011] iwlwifi 0000:02:00.0: 00000040: 00020010 10008ec0
00100c10 0046e811 10110142 00000000 00000000 00000000
[ 3174.041013] iwlwifi 0000:02:00.0: 00000060: 00000000 00080812
00000005 00000000 00010001 00000000 00000000 00000000
[ 3174.041014] iwlwifi 0000:02:00.0: 00000080: 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000
[ 3174.041016] iwlwifi 0000:02:00.0: 000000a0: 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000
[ 3174.041017] iwlwifi 0000:02:00.0: 000000c0: 00000000 00000000
c823d001 0d000000 00814005 fee00318 00000000 00000000
[ 3174.041019] iwlwifi 0000:02:00.0: 000000e0: 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000
[ 3174.041020] iwlwifi 0000:02:00.0: 00000100: 14010001 00000000
00000000 00462031 00000000 00002000 00000000 00000000
[ 3174.041022] iwlwifi 0000:02:00.0: 00000120: 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000
[ 3174.041023] iwlwifi 0000:02:00.0: 00000140: 14c10003 ff75a50d
e09467ff 15410018 10031003 0001001e 00f01e1f 6003000b
[ 3174.041024] iwlwifi 0000:02:00.0: iwlwifi device memory mapped registers:
[ 3174.041075] iwlwifi 0000:02:00.0: 00000000: 00800000 00000000
00000080 00000000 00000000 00000000 00000000 00000000
[ 3174.041076] iwlwifi 0000:02:00.0: 00000020: 00000001 080003d8
00000210 d55555d5 00000000 d55555d5 80008040 00080042
[ 3174.041088] iwlwifi 0000:02:00.0: iwlwifi device AER capability structure:
[ 3174.041112] iwlwifi 0000:02:00.0: 00000000: 14010001 00000000
00000000 00462031 00000000 00002000 00000000 00000000
[ 3174.041113] iwlwifi 0000:02:00.0: 00000020: 00000000 00000000 00000000
[ 3174.041114] iwlwifi 0000:02:00.0: iwlwifi parent port
(0000:00:1c.5) config registers:
[ 3174.041198] iwlwifi 0000:00:1c.5: 00000000: 9d158086 00100407
060400f1 00810010 00000000 00000000 00020200 200000f0
[ 3174.041200] iwlwifi 0000:00:1c.5: 00000020: c110c110 0001fff1
00000000 00000000 00000000 00000040 00000000 000202ff
[ 3174.041201] iwlwifi 0000:00:1c.5: 00000040: 01428010 00008001
00110000 06724813 70110042 002cb200 01400000 00000000
[ 3174.041203] iwlwifi 0000:00:1c.5: 00000060: 00000000 00000037
00000000 0000000e 00010003 00000000 00000000 00000000
[ 3174.041204] iwlwifi 0000:00:1c.5: 00000080: 00019005 fee00258
00000000 00000000 0000a00d 382a17aa 00000000 00000000
[ 3174.041206] iwlwifi 0000:00:1c.5: 000000a0: c8030001 00000000
00000000 00000000 00000000 00000000 00000000 00000000
[ 3174.041207] iwlwifi 0000:00:1c.5: 000000c0: 00000000 00000000
00000000 00000000 07001001 00001842 8b1e0008 00000000
[ 3174.041209] iwlwifi 0000:00:1c.5: 000000e0: 00630300 00000000
00100006 00000000 00000050 4c000000 08230fb3 02000004
[ 3174.041210] iwlwifi 0000:00:1c.5: 00000100: 14000000 00000000
00000000 00060011 00000001 00002000 00000000 00000000
[ 3174.041211] iwlwifi 0000:00:1c.5: 00000120: 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000
[ 3174.041213] iwlwifi 0000:00:1c.5: 00000140: 2001000d 0000000f
00000000 00000000 00000000 00000000 00000000 00000000
[ 3174.041214] iwlwifi 0000:00:1c.5: 00000160: 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000
[ 3174.041216] iwlwifi 0000:00:1c.5: 00000180: 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000
[ 3174.041217] iwlwifi 0000:00:1c.5: 000001a0: 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000
[ 3174.041218] iwlwifi 0000:00:1c.5: 000001c0: 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000
[ 3174.041220] iwlwifi 0000:00:1c.5: 000001e0: 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000
[ 3174.041221] iwlwifi 0000:00:1c.5: 00000200: 2201001e 0028281f 6003280b

I also had this problem with the previous firmware version. Any ideas
about what could be causing it?

Thanks,
Rui


2020-12-08 16:12:35

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [BUG] iwlwifi: card unusable after firmware crash

On Tue, 8 Dec 2020 11:27:50 +0000 Rui Salvaterra wrote:
> Hi, everyone,
>
> I'm running Linux 5.10-rc7 with this firmware/hardware:
>
> [ 1.431878] iwlwifi 0000:02:00.0: loaded firmware version
> 29.198743027.0 7265D-29.ucode op_mode iwlmvm
> [ 1.431899] iwlwifi 0000:02:00.0: Detected Intel(R) Dual Band
> Wireless AC 3165, REV=0x210
>
> Most of the time, after disabling/enabling Wi-Fi (or, say, rebooting
> the router), the card dies and only comes back after reboot. This is
> the relevant dmesg output:

Just to confirm - is this a regression in 5.10-rc? Does 5.9 work
smoothly in the problematic scenario?

2020-12-09 17:16:38

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [BUG] iwlwifi: card unusable after firmware crash

On Tue, 8 Dec 2020 23:17:48 +0000 Rui Salvaterra wrote:
> Hi, Luca,
>
> On Tue, 8 Dec 2020 at 16:27, Coelho, Luciano <[email protected]> wrote:
> > On Tue, 2020-12-08 at 11:27 +0000, Rui Salvaterra wrote:
> > >
> > > [ 3174.003910] iwlwifi 0000:02:00.0: RF_KILL bit toggled to disable radio.
> > > [ 3174.003913] iwlwifi 0000:02:00.0: reporting RF_KILL (radio disabled)
> >
> > It looks like your machine is reporting RF-Kill to the WiFi device.
>
> Yes, that's an artifact of how I tested: I rebooted the router, the
> Wi-Fi interface disassociated and the dmesg was clean. However, after
> the router came up, the laptop didn't reconnect (and the connection
> had completely disappeared from nmtui). Afterwards, I did the rfkill
> cycle you see, and only then I got the register dump.
>
> > There seems to be some sort of race there that is causing us to still
> > try to communicate with the device (and thus you see the transaction
> > failed dump), but that will obviously fail when RF-Kill is enabled.
>
> I'm not sure about that, the card was already dead before the rfkill cycle.

Any luck figuring this out, Luca? If this is a 5.10 regression we need
to let Linus know tomorrow, so the time is ticking :(

2020-12-09 20:40:21

by Luciano Coelho

[permalink] [raw]
Subject: Re: [BUG] iwlwifi: card unusable after firmware crash

Hi Jakub et al,

On Wed, 2020-12-09 at 09:13 -0800, Jakub Kicinski wrote:
> On Tue, 8 Dec 2020 23:17:48 +0000 Rui Salvaterra wrote:
> > Hi, Luca,
> >
> > On Tue, 8 Dec 2020 at 16:27, Coelho, Luciano <[email protected]> wrote:
> > > On Tue, 2020-12-08 at 11:27 +0000, Rui Salvaterra wrote:
> > > >
> > > > [ 3174.003910] iwlwifi 0000:02:00.0: RF_KILL bit toggled to disable radio.
> > > > [ 3174.003913] iwlwifi 0000:02:00.0: reporting RF_KILL (radio disabled)
> > >
> > > It looks like your machine is reporting RF-Kill to the WiFi device.
> >
> > Yes, that's an artifact of how I tested: I rebooted the router, the
> > Wi-Fi interface disassociated and the dmesg was clean. However, after
> > the router came up, the laptop didn't reconnect (and the connection
> > had completely disappeared from nmtui). Afterwards, I did the rfkill
> > cycle you see, and only then I got the register dump.
> >
> > > There seems to be some sort of race there that is causing us to still
> > > try to communicate with the device (and thus you see the transaction
> > > failed dump), but that will obviously fail when RF-Kill is enabled.
> >
> > I'm not sure about that, the card was already dead before the rfkill cycle.
>
> Any luck figuring this out, Luca? If this is a 5.10 regression we need
> to let Linus know tomorrow, so the time is ticking :(

I just checked all the commits in iwlwifi between v5.9 and v5.10 and I
don't see anything that could affect how RF-Kill works.

Emmanuel, do you remember anything that went in that could have
affected RF-Kill?

--
Cheers,
Luca.

2020-12-09 20:41:26

by Rui Salvaterra

[permalink] [raw]
Subject: Re: [BUG] iwlwifi: card unusable after firmware crash

Hi, guys,

On Wed, 9 Dec 2020 at 17:13, Jakub Kicinski <[email protected]> wrote:
>
> Any luck figuring this out, Luca? If this is a 5.10 regression we need
> to let Linus know tomorrow, so the time is ticking :(

I don't have the possibility to test other kernels at the moment, but
I will do so in a few days (at least to find a working version to
bisect). Meanwhile, I don't know if this is relevant or not, but I'm
using WPA3 PSK.

Thanks,
Rui

2020-12-09 20:53:59

by Emmanuel Grumbach

[permalink] [raw]
Subject: Re: [BUG] iwlwifi: card unusable after firmware crash

On Wed, Dec 9, 2020 at 7:19 PM Jakub Kicinski <[email protected]> wrote:
>
> On Tue, 8 Dec 2020 23:17:48 +0000 Rui Salvaterra wrote:
> > Hi, Luca,
> >
> > On Tue, 8 Dec 2020 at 16:27, Coelho, Luciano <[email protected]> wrote:
> > > On Tue, 2020-12-08 at 11:27 +0000, Rui Salvaterra wrote:
> > > >
> > > > [ 3174.003910] iwlwifi 0000:02:00.0: RF_KILL bit toggled to disable radio.
> > > > [ 3174.003913] iwlwifi 0000:02:00.0: reporting RF_KILL (radio disabled)
> > >
> > > It looks like your machine is reporting RF-Kill to the WiFi device.
> >
> > Yes, that's an artifact of how I tested: I rebooted the router, the
> > Wi-Fi interface disassociated and the dmesg was clean. However, after
> > the router came up, the laptop didn't reconnect (and the connection
> > had completely disappeared from nmtui). Afterwards, I did the rfkill
> > cycle you see, and only then I got the register dump.
> >
> > > There seems to be some sort of race there that is causing us to still
> > > try to communicate with the device (and thus you see the transaction
> > > failed dump), but that will obviously fail when RF-Kill is enabled.
> >
> > I'm not sure about that, the card was already dead before the rfkill cycle.
>
> Any luck figuring this out, Luca? If this is a 5.10 regression we need
> to let Linus know tomorrow, so the time is ticking :(

Rui, I looked at the register dump and looks like you're using AMT on
your system?
Can you confirm?

2020-12-09 20:57:35

by Rui Salvaterra

[permalink] [raw]
Subject: Re: [BUG] iwlwifi: card unusable after firmware crash

Hi, again,

On Wed, 9 Dec 2020 at 20:40, Emmanuel Grumbach <[email protected]> wrote:
>
> Besides, don't you get a stack dump in the vicinity of this register
> dump? That's be helpful to see.

Nope. No stack trace at all. Only the register dump.

Thanks,
Rui

2020-12-09 20:58:41

by Emmanuel Grumbach

[permalink] [raw]
Subject: Re: [BUG] iwlwifi: card unusable after firmware crash

On Wed, Dec 9, 2020 at 10:40 PM Rui Salvaterra <[email protected]> wrote:
>
> Hi, Emmanuel,
>
> On Wed, 9 Dec 2020 at 20:32, Emmanuel Grumbach <[email protected]> wrote:
> >
> > Rui, I looked at the register dump and looks like you're using AMT on
> > your system?
> > Can you confirm?
>
> AMT? You mean Intel Active Management? Heavens, no, not that I know
> of! This is a personal laptop (Lenovo B51-80). (And I'd personally
> kill the ME with fire, if I could.)

Yes, I mean that thing. No VPRO sticker on the laptop?
Weird... So apparently I was wrong about the register value.

>
> Thanks,
> Rui

2020-12-09 21:12:22

by Emmanuel Grumbach

[permalink] [raw]
Subject: Re: [BUG] iwlwifi: card unusable after firmware crash

On Wed, Dec 9, 2020 at 10:47 PM Emmanuel Grumbach <[email protected]> wrote:
>
> On Wed, Dec 9, 2020 at 10:40 PM Rui Salvaterra <[email protected]> wrote:
> >
> > Hi, Emmanuel,
> >
> > On Wed, 9 Dec 2020 at 20:32, Emmanuel Grumbach <[email protected]> wrote:
> > >
> > > Rui, I looked at the register dump and looks like you're using AMT on
> > > your system?
> > > Can you confirm?
> >
> > AMT? You mean Intel Active Management? Heavens, no, not that I know
> > of! This is a personal laptop (Lenovo B51-80). (And I'd personally
> > kill the ME with fire, if I could.)
>
> Yes, I mean that thing. No VPRO sticker on the laptop?
> Weird... So apparently I was wrong about the register value.

Indeed, the bit is reverse logic. So we can put that aside.
Frankly, I have no clue. You can try our backport tree to bisect,
should be easier..
What I see here is that your GP_CTRL value is 080003d8

#define CSR_GP_CNTRL_REG_FLAG_HW_RF_KILL_SW (0x08000000)
which means sense since apparently, HW RF-Kill was asserted.
#define CSR_GP_CNTRL_REG_FLAG_GOING_TO_SLEEP (0x00000010)
Which means that the device is going to sleep... And that's the problem:

iwl_trans_pcie_grab_nic_access:
ret = iwl_poll_bit(trans, CSR_GP_CNTRL,
CSR_GP_CNTRL_REG_VAL_MAC_ACCESS_EN,
(CSR_GP_CNTRL_REG_FLAG_MAC_CLOCK_READY |
CSR_GP_CNTRL_REG_FLAG_GOING_TO_SLEEP), 15000);
if (unlikely(ret < 0)) {
u32 cntrl = iwl_read32(trans, CSR_GP_CNTRL);

WARN_ONCE(1,
"Timeout waiting for hardware access
(CSR_GP_CNTRL 0x%08x)\n",
cntrl);

but I'd expect the splat in your log...
Or maybe you can't load the firmware?

Can you try this:
diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
b/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
index 2fffbbc8462f..748300752630 100644
--- a/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
+++ b/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
@@ -2121,6 +2121,7 @@ static bool
iwl_trans_pcie_grab_nic_access(struct iwl_trans *trans,
* track nic_access anyway.
*/
__release(&trans_pcie->reg_lock);
+ mdelay(1);
return true;
}

If that helps, then... I'd have no clue why it helps, but this
specific device caused us trouble like bad timing after
grab_nic_access..


>
> >
> > Thanks,
> > Rui

2020-12-09 21:20:25

by Rui Salvaterra

[permalink] [raw]
Subject: Re: [BUG] iwlwifi: card unusable after firmware crash

Hi again, Emmanuel,

On Wed, 9 Dec 2020 at 21:07, Emmanuel Grumbach <[email protected]> wrote:
>
> Indeed, the bit is reverse logic. So we can put that aside.
> Frankly, I have no clue. You can try our backport tree to bisect,
> should be easier..
> What I see here is that your GP_CTRL value is 080003d8
>
> #define CSR_GP_CNTRL_REG_FLAG_HW_RF_KILL_SW (0x08000000)
> which means sense since apparently, HW RF-Kill was asserted.
> #define CSR_GP_CNTRL_REG_FLAG_GOING_TO_SLEEP (0x00000010)
> Which means that the device is going to sleep... And that's the problem:
>
> iwl_trans_pcie_grab_nic_access:
> ret = iwl_poll_bit(trans, CSR_GP_CNTRL,
> CSR_GP_CNTRL_REG_VAL_MAC_ACCESS_EN,
> (CSR_GP_CNTRL_REG_FLAG_MAC_CLOCK_READY |
> CSR_GP_CNTRL_REG_FLAG_GOING_TO_SLEEP), 15000);
> if (unlikely(ret < 0)) {
> u32 cntrl = iwl_read32(trans, CSR_GP_CNTRL);
>
> WARN_ONCE(1,
> "Timeout waiting for hardware access
> (CSR_GP_CNTRL 0x%08x)\n",
> cntrl);
>
> but I'd expect the splat in your log...
> Or maybe you can't load the firmware?

Well, my kernel doesn't have any modules, it's all built-in. The
firmware is obviously loading fine, otherwise the card wouldn't work,
but yeah, that WARN_ONCE hasn't triggered at all.

> Can you try this:
> diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
> b/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
> index 2fffbbc8462f..748300752630 100644
> --- a/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
> +++ b/drivers/net/wireless/intel/iwlwifi/pcie/trans.c
> @@ -2121,6 +2121,7 @@ static bool
> iwl_trans_pcie_grab_nic_access(struct iwl_trans *trans,
> * track nic_access anyway.
> */
> __release(&trans_pcie->reg_lock);
> + mdelay(1);
> return true;
> }
>
> If that helps, then... I'd have no clue why it helps, but this
> specific device caused us trouble like bad timing after
> grab_nic_access..

I'll give it a spin. Nasty hack, but if it works, it works. :)

Thanks,
Rui

2020-12-10 19:02:35

by Emmanuel Grumbach

[permalink] [raw]
Subject: Re: [BUG] iwlwifi: card unusable after firmware crash

Hi,

> Hi, again,
>
> I haven't tested any patch or bisected, but I have another data point.
> I built and tested Linux 5.8.18, with the same firmware, and it is
> working correctly. I reduced the test case to just rfkilling the
> connection, which showed the register dump immediately (before that I
> was using the airplane toggle on the keyboard, which isn't working
> correctly, it disables and immediately reenables the radio, for some
> unfathomable reason).
> So, now I'm inclined to believe this is some sort of race condition
> between rfkill and pending transactions.

Which also means it's not a regression.
You can add a dump_stack() in the function that dumps the registers to
get a clue.