2021-09-08 04:59:11

by Dominique Martinet

[permalink] [raw]
Subject: mwifiex cmd timeout on one pci variant

Hi,

This is probably more a question for the maker.. But maybe someone will
know.
I've got a mwifiex M.2 pci module that won't take any wifi command and
hang right away (dmesg below). Bluetooth through serial works.


Context:
I've got a board with an i.MX8MP chip, and three different marvell W8997
M.2 modules -- one from laird which works fine, and two from azurewave
which are labeled exactly the same AW-CM276MA 2276MA PCIE-UART except
one works and not the other.
The inscription on the chip itself are slightly different, one saying
it's a W8997-M1216 from marvell (works) and the other having AW-CM276NF
azurewave mark. The electronics around are also different.

I could say it's just a bad chip, but I've actually got two of each
(samples) which act the same... And I've tried it in another device
where it works with the same kernel/firmware, so there must be something
wrong on the board as well as the wifi card works elsewhere.


Anyway, if someone knows how to get around to debugging this, I'd
appreciate a pointer! I can't see anything wrong with the tools I have
here.
If nothing else, I can't read /sys/class/devcoredump/devcd*/data that I
saw Amitkumar Karwar request somewhere else, so just deciphering this
would be great help.


dmesg looks like this on failure:
[ 108.513028] mwifiex_pcie 0000:01:00.0: mwifiex_cmd_timeout_func: Timeout cmd id = 0x10, act = 0x1
[ 108.522388] mwifiex_pcie 0000:01:00.0: num_data_h2c_failure = 0
[ 108.528310] mwifiex_pcie 0000:01:00.0: num_cmd_h2c_failure = 0
[ 108.534143] mwifiex_pcie 0000:01:00.0: is_cmd_timedout = 1
[ 108.539631] mwifiex_pcie 0000:01:00.0: num_tx_timeout = 0
[ 108.545029] mwifiex_pcie 0000:01:00.0: last_cmd_index = 0
[ 108.550431] mwifiex_pcie 0000:01:00.0: last_cmd_id: 10 00 28 00 16 00 cd 00 1e 00
[ 108.557913] mwifiex_pcie 0000:01:00.0: last_cmd_act: 01 00 13 00 01 00 01 00 00 00
[ 108.565484] mwifiex_pcie 0000:01:00.0: last_cmd_resp_index = 4
[ 108.571318] mwifiex_pcie 0000:01:00.0: last_cmd_resp_id: df 80 28 80 16 80 cd 80 1e 80
[ 108.579237] mwifiex_pcie 0000:01:00.0: last_event_index = 2
[ 108.584810] mwifiex_pcie 0000:01:00.0: last_event: 00 00 0b 00 0a 00 00 00 00 00
[ 108.592206] mwifiex_pcie 0000:01:00.0: data_sent=0 cmd_sent=1
[ 108.597954] mwifiex_pcie 0000:01:00.0: ps_mode=1 ps_state=0
[ 108.604085] mwifiex_pcie 0000:01:00.0: ===mwifiex driverinfo dump start===
[ 108.613552] mwifiex_pcie 0000:01:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (16.68.1.p179)
[ 108.621748] mwifiex_pcie 0000:01:00.0: PCIE register dump start
[ 108.627676] mwifiex_pcie 0000:01:00.0: pcie scratch register:
[ 108.633441] mwifiex_pcie 0000:01:00.0: reg:0xcf0, value=0xfedcba00
reg:0xcf8, value=0x8260049
reg:0xcfc, value=0x1282820

[ 108.648584] mwifiex_pcie 0000:01:00.0: PCIE register dump end
[ 108.654411] mwifiex_pcie 0000:01:00.0: ===mwifiex driverinfo dump end===
[ 108.661119] mwifiex_pcie 0000:01:00.0: == mwifiex firmware dump start ==
[ 110.560689] mwifiex_pcie 0000:01:00.0: cmd_wait_q terminated: -110
[ 148.127107] mwifiex_pcie 0000:01:00.0: == mwifiex firmware dump end ==
[ 148.134552] mwifiex_pcie 0000:01:00.0: == mwifiex dump information to /sys/class/devcoredump start
[ 148.143669] mwifiex_pcie 0000:01:00.0: == mwifiex dump information to /sys/class/devcoredump end
[ 148.152485] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
[ 148.158915] mwifiex_pcie 0000:01:00.0: info: shutdown mwifiex...
[ 148.165829] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
[ 148.443761] mwifiex_pcie 0000:01:00.0: info: dnld wifi firmware from 169340 bytes
[ 149.511193] mwifiex_pcie 0000:01:00.0: info: FW download over, size 632240 bytes
[ 150.163677] mwifiex_pcie 0000:01:00.0: WLAN FW is active
[ 150.231583] mwifiex_pcie 0000:01:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (16.68.1.p179)
[ 150.239814] mwifiex_pcie 0000:01:00.0: driver_version = mwifiex 1.0 (16.68.1.p179)

I tried with two different firmwares, full dmesg and data.txt are here:
hang on `ip link set mlan0 up`:
https://codewreck.org/tmp/16.68.1.p179-data.txt
https://codewreck.org/tmp/16.68.1.p179-dmesg

hang on `iw mlan0 scan` after successful link up:
https://codewreck.org/tmp/16.68.1.p179-2-data.txt
https://codewreck.org/tmp/16.68.1.p179-2-dmesg

other firmware (dmesg truncated to just timeout message):
https://codewreck.org/tmp/16.68.10.p16-data.txt
https://codewreck.org/tmp/16.68.10.p16-dmesg



Extra info:
- it doesn't always fail at the same place, so this looks like a
tolerance problem? e.g. sometimes transmission works and sometimes
a message is garbled?

- on the working azurewave module I can keep the card maxed at ~300mbps
in or ~100mbps out without problem for a while with iperf so signals
can't be that bad...? Or that could just be wishful thinking!



Thanks,
--
Dominique Martinet


2021-09-08 05:39:26

by Dominique Martinet

[permalink] [raw]
Subject: Re: mwifiex cmd timeout on one pci variant

(+cc Jonas Dreßler, sorry for two mails in a row for others)

Dominique MARTINET wrote on Wed, Sep 08, 2021 at 01:43:43PM +0900:
> I've got a board with an i.MX8MP chip, and three different marvell W8997
> M.2 modules

(I just noticed Jonas' patches "mwifiex: Work around firmware bugs on
88W8897 chip" on linux-wireless, but it doesn't seem to change anything
for me, so my problem isn't related to pci post or interrupt wake
apparently. Was worth a try...)

I'm surprised though he says the latest firmware is 15.68.19.p21, but I
can't find it anywhere -- linux-firmware only has up to 16.68.1.p179 and
I got 16.68.10.p16 from NXP dependencies, and now I'm searching a bit
harder i also found 16.92.10.p124 !? (note 16.92 instead of 16.68, also
NXP) but I have no idea where to find anything 'official' from marvell
as git.marvell.com/mwifiex-firmware.git disappeared.

Where could I find this version you speak of?


Thanks,

> -- one from laird which works fine, and two from azurewave
> which are labeled exactly the same AW-CM276MA 2276MA PCIE-UART except
> one works and not the other.
> The inscription on the chip itself are slightly different, one saying
> it's a W8997-M1216 from marvell (works) and the other having AW-CM276NF
> azurewave mark. The electronics around are also different.
>
> I could say it's just a bad chip, but I've actually got two of each
> (samples) which act the same... And I've tried it in another device
> where it works with the same kernel/firmware, so there must be something
> wrong on the board as well as the wifi card works elsewhere.
>
>
> Anyway, if someone knows how to get around to debugging this, I'd
> appreciate a pointer! I can't see anything wrong with the tools I have
> here.
> If nothing else, I can't read /sys/class/devcoredump/devcd*/data that I
> saw Amitkumar Karwar request somewhere else, so just deciphering this
> would be great help.
>
>
> dmesg looks like this on failure:
> [ 108.513028] mwifiex_pcie 0000:01:00.0: mwifiex_cmd_timeout_func: Timeout cmd id = 0x10, act = 0x1
> [ 108.522388] mwifiex_pcie 0000:01:00.0: num_data_h2c_failure = 0
> [ 108.528310] mwifiex_pcie 0000:01:00.0: num_cmd_h2c_failure = 0
> [ 108.534143] mwifiex_pcie 0000:01:00.0: is_cmd_timedout = 1
> [ 108.539631] mwifiex_pcie 0000:01:00.0: num_tx_timeout = 0
> [ 108.545029] mwifiex_pcie 0000:01:00.0: last_cmd_index = 0
> [ 108.550431] mwifiex_pcie 0000:01:00.0: last_cmd_id: 10 00 28 00 16 00 cd 00 1e 00
> [ 108.557913] mwifiex_pcie 0000:01:00.0: last_cmd_act: 01 00 13 00 01 00 01 00 00 00
> [ 108.565484] mwifiex_pcie 0000:01:00.0: last_cmd_resp_index = 4
> [ 108.571318] mwifiex_pcie 0000:01:00.0: last_cmd_resp_id: df 80 28 80 16 80 cd 80 1e 80
> [ 108.579237] mwifiex_pcie 0000:01:00.0: last_event_index = 2
> [ 108.584810] mwifiex_pcie 0000:01:00.0: last_event: 00 00 0b 00 0a 00 00 00 00 00
> [ 108.592206] mwifiex_pcie 0000:01:00.0: data_sent=0 cmd_sent=1
> [ 108.597954] mwifiex_pcie 0000:01:00.0: ps_mode=1 ps_state=0
> [ 108.604085] mwifiex_pcie 0000:01:00.0: ===mwifiex driverinfo dump start===
> [ 108.613552] mwifiex_pcie 0000:01:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (16.68.1.p179)
> [ 108.621748] mwifiex_pcie 0000:01:00.0: PCIE register dump start
> [ 108.627676] mwifiex_pcie 0000:01:00.0: pcie scratch register:
> [ 108.633441] mwifiex_pcie 0000:01:00.0: reg:0xcf0, value=0xfedcba00
> reg:0xcf8, value=0x8260049
> reg:0xcfc, value=0x1282820
>
> [ 108.648584] mwifiex_pcie 0000:01:00.0: PCIE register dump end
> [ 108.654411] mwifiex_pcie 0000:01:00.0: ===mwifiex driverinfo dump end===
> [ 108.661119] mwifiex_pcie 0000:01:00.0: == mwifiex firmware dump start ==
> [ 110.560689] mwifiex_pcie 0000:01:00.0: cmd_wait_q terminated: -110
> [ 148.127107] mwifiex_pcie 0000:01:00.0: == mwifiex firmware dump end ==
> [ 148.134552] mwifiex_pcie 0000:01:00.0: == mwifiex dump information to /sys/class/devcoredump start
> [ 148.143669] mwifiex_pcie 0000:01:00.0: == mwifiex dump information to /sys/class/devcoredump end
> [ 148.152485] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state
> [ 148.158915] mwifiex_pcie 0000:01:00.0: info: shutdown mwifiex...
> [ 148.165829] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed
> [ 148.443761] mwifiex_pcie 0000:01:00.0: info: dnld wifi firmware from 169340 bytes
> [ 149.511193] mwifiex_pcie 0000:01:00.0: info: FW download over, size 632240 bytes
> [ 150.163677] mwifiex_pcie 0000:01:00.0: WLAN FW is active
> [ 150.231583] mwifiex_pcie 0000:01:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (16.68.1.p179)
> [ 150.239814] mwifiex_pcie 0000:01:00.0: driver_version = mwifiex 1.0 (16.68.1.p179)
>
> I tried with two different firmwares, full dmesg and data.txt are here:
> hang on `ip link set mlan0 up`:
> https://codewreck.org/tmp/16.68.1.p179-data.txt
> https://codewreck.org/tmp/16.68.1.p179-dmesg
>
> hang on `iw mlan0 scan` after successful link up:
> https://codewreck.org/tmp/16.68.1.p179-2-data.txt
> https://codewreck.org/tmp/16.68.1.p179-2-dmesg
>
> other firmware (dmesg truncated to just timeout message):
> https://codewreck.org/tmp/16.68.10.p16-data.txt
> https://codewreck.org/tmp/16.68.10.p16-dmesg
>
>
>
> Extra info:
> - it doesn't always fail at the same place, so this looks like a
> tolerance problem? e.g. sometimes transmission works and sometimes
> a message is garbled?
>
> - on the working azurewave module I can keep the card maxed at ~300mbps
> in or ~100mbps out without problem for a while with iperf so signals
> can't be that bad...? Or that could just be wishful thinking!
--
Dominique Martinet

2021-09-08 05:48:50

by Sharvari Harisangam

[permalink] [raw]
Subject: RE: [EXT] Re: mwifiex cmd timeout on one pci variant

Hi Dominique,

Use firmware from https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/mrvl
for mwifiex driver.


Thanks,
Sharvari

> -----Original Message-----
> From: Dominique MARTINET <[email protected]>
> Sent: Wednesday, September 8, 2021 11:06 AM
> To: [email protected]; Amitkumar Karwar
> <[email protected]>; Jonas Dre?ler <[email protected]>
> Cc: Takashi Iwai <[email protected]>; Tsuchiya Yuto <[email protected]>; Geert
> Uytterhoeven <[email protected]>; Arnd Bergmann <[email protected]>;
> Lee Jones <[email protected]>; Kalle Valo <[email protected]>; Xinming
> Hu <[email protected]>; Sharvari Harisangam
> <[email protected]>; Ganapathi Bhat <[email protected]>
> Subject: [EXT] Re: mwifiex cmd timeout on one pci variant
>
> Caution: EXT Email
>
> (+cc Jonas Dre?ler, sorry for two mails in a row for others)
>
> Dominique MARTINET wrote on Wed, Sep 08, 2021 at 01:43:43PM +0900:
> > I've got a board with an i.MX8MP chip, and three different marvell
> > W8997
> > M.2 modules
>
> (I just noticed Jonas' patches "mwifiex: Work around firmware bugs on
> 88W8897 chip" on linux-wireless, but it doesn't seem to change anything for me,
> so my problem isn't related to pci post or interrupt wake apparently. Was worth
> a try...)
>
> I'm surprised though he says the latest firmware is 15.68.19.p21, but I can't find
> it anywhere -- linux-firmware only has up to 16.68.1.p179 and I got 16.68.10.p16
> from NXP dependencies, and now I'm searching a bit harder i also found
> 16.92.10.p124 !? (note 16.92 instead of 16.68, also
> NXP) but I have no idea where to find anything 'official' from marvell as
> git.marvell.com/mwifiex-firmware.git disappeared.
>
> Where could I find this version you speak of?
>
>
> Thanks,
>
> > -- one from laird which works fine, and two from azurewave which are
> > labeled exactly the same AW-CM276MA 2276MA PCIE-UART except one works
> > and not the other.
> > The inscription on the chip itself are slightly different, one saying
> > it's a W8997-M1216 from marvell (works) and the other having
> > AW-CM276NF azurewave mark. The electronics around are also different.
> >
> > I could say it's just a bad chip, but I've actually got two of each
> > (samples) which act the same... And I've tried it in another device
> > where it works with the same kernel/firmware, so there must be
> > something wrong on the board as well as the wifi card works elsewhere.
> >
> >
> > Anyway, if someone knows how to get around to debugging this, I'd
> > appreciate a pointer! I can't see anything wrong with the tools I have
> > here.
> > If nothing else, I can't read /sys/class/devcoredump/devcd*/data that
> > I saw Amitkumar Karwar request somewhere else, so just deciphering
> > this would be great help.
> >
> >
> > dmesg looks like this on failure:
> > [ 108.513028] mwifiex_pcie 0000:01:00.0: mwifiex_cmd_timeout_func:
> > Timeout cmd id = 0x10, act = 0x1 [ 108.522388] mwifiex_pcie
> > 0000:01:00.0: num_data_h2c_failure = 0 [ 108.528310] mwifiex_pcie
> > 0000:01:00.0: num_cmd_h2c_failure = 0 [ 108.534143] mwifiex_pcie
> > 0000:01:00.0: is_cmd_timedout = 1 [ 108.539631] mwifiex_pcie
> > 0000:01:00.0: num_tx_timeout = 0 [ 108.545029] mwifiex_pcie
> > 0000:01:00.0: last_cmd_index = 0 [ 108.550431] mwifiex_pcie
> > 0000:01:00.0: last_cmd_id: 10 00 28 00 16 00 cd 00 1e 00 [
> > 108.557913] mwifiex_pcie 0000:01:00.0: last_cmd_act: 01 00 13 00 01 00
> > 01 00 00 00 [ 108.565484] mwifiex_pcie 0000:01:00.0:
> > last_cmd_resp_index = 4 [ 108.571318] mwifiex_pcie 0000:01:00.0:
> > last_cmd_resp_id: df 80 28 80 16 80 cd 80 1e 80 [ 108.579237]
> > mwifiex_pcie 0000:01:00.0: last_event_index = 2 [ 108.584810]
> > mwifiex_pcie 0000:01:00.0: last_event: 00 00 0b 00 0a 00 00 00 00 00 [
> > 108.592206] mwifiex_pcie 0000:01:00.0: data_sent=0 cmd_sent=1 [
> > 108.597954] mwifiex_pcie 0000:01:00.0: ps_mode=1 ps_state=0 [
> > 108.604085] mwifiex_pcie 0000:01:00.0: ===mwifiex driverinfo dump
> > start=== [ 108.613552] mwifiex_pcie 0000:01:00.0: info: MWIFIEX VERSION:
> mwifiex 1.0 (16.68.1.p179) [ 108.621748] mwifiex_pcie 0000:01:00.0: PCIE
> register dump start [ 108.627676] mwifiex_pcie 0000:01:00.0: pcie scratch
> register:
> > [ 108.633441] mwifiex_pcie 0000:01:00.0: reg:0xcf0, value=0xfedcba00
> > reg:0xcf8, value=0x8260049
> > reg:0xcfc, value=0x1282820
> >
> > [ 108.648584] mwifiex_pcie 0000:01:00.0: PCIE register dump end [
> > 108.654411] mwifiex_pcie 0000:01:00.0: ===mwifiex driverinfo dump
> > end=== [ 108.661119] mwifiex_pcie 0000:01:00.0: == mwifiex firmware
> > dump start == [ 110.560689] mwifiex_pcie 0000:01:00.0: cmd_wait_q
> > terminated: -110 [ 148.127107] mwifiex_pcie 0000:01:00.0: == mwifiex
> > firmware dump end == [ 148.134552] mwifiex_pcie 0000:01:00.0: ==
> > mwifiex dump information to /sys/class/devcoredump start [
> > 148.143669] mwifiex_pcie 0000:01:00.0: == mwifiex dump information to
> > /sys/class/devcoredump end [ 148.152485] mwifiex_pcie 0000:01:00.0:
> PREP_CMD: FW is in bad state [ 148.158915] mwifiex_pcie 0000:01:00.0: info:
> shutdown mwifiex...
> > [ 148.165829] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed [
> > 148.443761] mwifiex_pcie 0000:01:00.0: info: dnld wifi firmware from
> > 169340 bytes [ 149.511193] mwifiex_pcie 0000:01:00.0: info: FW
> > download over, size 632240 bytes [ 150.163677] mwifiex_pcie
> > 0000:01:00.0: WLAN FW is active [ 150.231583] mwifiex_pcie
> > 0000:01:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (16.68.1.p179) [
> > 150.239814] mwifiex_pcie 0000:01:00.0: driver_version = mwifiex 1.0
> > (16.68.1.p179)
> >
> > I tried with two different firmwares, full dmesg and data.txt are here:
> > hang on `ip link set mlan0 up`:
> > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcode
> > wreck.org%2Ftmp%2F16.68.1.p179-
> data.txt&amp;data=04%7C01%7Csharvari.ha
> >
> risangam%40nxp.com%7C86cd8c9cb9ea4e65eeb508d9728a9b73%7C686ea1d3
> bc2b4c
> >
> 6fa92cd99c5c301635%7C0%7C1%7C637666761903623322%7CUnknown%7CT
> WFpbGZsb3
> >
> d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
> D%7
> >
> C3000&amp;sdata=I2QRJPwCgkPDqs1v8DnNGNciAHdqgPRIiYcrAc%2BPo4Y%3D
> &amp;r
> > eserved=0
> > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcode
> > wreck.org%2Ftmp%2F16.68.1.p179-
> dmesg&amp;data=04%7C01%7Csharvari.haris
> >
> angam%40nxp.com%7C86cd8c9cb9ea4e65eeb508d9728a9b73%7C686ea1d3bc
> 2b4c6fa
> >
> 92cd99c5c301635%7C0%7C1%7C637666761903623322%7CUnknown%7CTWFp
> bGZsb3d8e
> >
> yJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7
> C30
> >
> 00&amp;sdata=Bcz1xLDvenNyPAF1l9lErWdzXR03FRmt9IHiWSTZUhs%3D&amp;r
> eserv
> > ed=0
> >
> > hang on `iw mlan0 scan` after successful link up:
> > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcode
> > wreck.org%2Ftmp%2F16.68.1.p179-2-
> data.txt&amp;data=04%7C01%7Csharvari.
> >
> harisangam%40nxp.com%7C86cd8c9cb9ea4e65eeb508d9728a9b73%7C686ea1
> d3bc2b
> >
> 4c6fa92cd99c5c301635%7C0%7C1%7C637666761903623322%7CUnknown%7C
> TWFpbGZs
> >
> b3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0
> %3D
> > %7C3000&amp;sdata=6UC7r1UTPkAbjJ7EwRvDKtDy9NgAFw2PExw9iObdabI%
> 3D&amp;r
> > eserved=0
> > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcode
> > wreck.org%2Ftmp%2F16.68.1.p179-2-
> dmesg&amp;data=04%7C01%7Csharvari.har
> >
> isangam%40nxp.com%7C86cd8c9cb9ea4e65eeb508d9728a9b73%7C686ea1d3b
> c2b4c6
> >
> fa92cd99c5c301635%7C0%7C1%7C637666761903623322%7CUnknown%7CTW
> FpbGZsb3d
> >
> 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D
> %7C
> >
> 3000&amp;sdata=etz0gCsNgiBRvrff7J0GH%2BOR%2Bn7TYgBj3RGNLkPbuGo%3
> D&amp;
> > reserved=0
> >
> > other firmware (dmesg truncated to just timeout message):
> > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcode
> > wreck.org%2Ftmp%2F16.68.10.p16-
> data.txt&amp;data=04%7C01%7Csharvari.ha
> >
> risangam%40nxp.com%7C86cd8c9cb9ea4e65eeb508d9728a9b73%7C686ea1d3
> bc2b4c
> >
> 6fa92cd99c5c301635%7C0%7C1%7C637666761903623322%7CUnknown%7CT
> WFpbGZsb3
> >
> d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
> D%7
> >
> C3000&amp;sdata=fz34AoQH2IwUU%2B3RyrTWUu8tLyqJTUQb0YWcWbWlVZ8
> %3D&amp;r
> > eserved=0
> > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcode
> > wreck.org%2Ftmp%2F16.68.10.p16-
> dmesg&amp;data=04%7C01%7Csharvari.haris
> >
> angam%40nxp.com%7C86cd8c9cb9ea4e65eeb508d9728a9b73%7C686ea1d3bc
> 2b4c6fa
> >
> 92cd99c5c301635%7C0%7C1%7C637666761903623322%7CUnknown%7CTWFp
> bGZsb3d8e
> >
> yJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7
> C30
> >
> 00&amp;sdata=ABzwF0ix3q5PdbqI%2Bm8SQ2U6teVeh%2FclaDkgQBVKrpA%3D
> &amp;re
> > served=0
> >
> >
> >
> > Extra info:
> > - it doesn't always fail at the same place, so this looks like a
> > tolerance problem? e.g. sometimes transmission works and sometimes a
> > message is garbled?
> >
> > - on the working azurewave module I can keep the card maxed at
> > ~300mbps in or ~100mbps out without problem for a while with iperf so
> > signals can't be that bad...? Or that could just be wishful thinking!
> --
> Dominique Martinet

2021-09-08 05:59:47

by Dominique Martinet

[permalink] [raw]
Subject: Re: [EXT] Re: mwifiex cmd timeout on one pci variant

Sharvari Harisangam wrote on Wed, Sep 08, 2021 at 05:45:53AM +0000:
> Use firmware from https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/mrvl
> for mwifiex driver.

Thanks, that's the first firmware I was using; it's currently at
16.68.1.p179 which is why I'm surprised Jonas said the latest would be
15.68.19.p21.

I think it's just a different variant of the driver now though,
a binary grep matches 15.68.19.p21 for pcie8897_uapsta.bin but I my
driver loads pcieuart8997_combo_v4.bin
I hadn't noticed the first number didn't match, but that likely confirms
it.

Sorry for the noise on firmware version, I'm still interested in
understanding why the command timeouts.
--
Dominique Martinet

2021-09-08 16:58:08

by Brian Norris

[permalink] [raw]
Subject: Re: mwifiex cmd timeout on one pci variant

On Tue, Sep 7, 2021 at 9:52 PM Dominique MARTINET
<[email protected]> wrote:
> This is probably more a question for the maker.. But maybe someone will
> know.

Last I knew, at least one of the CC'd is still employed by the owner
of this IP (NXP now), but I don't know that much. But then, they
haven't been giving out a lot of free support lately, AFAIK.

> I've got a mwifiex M.2 pci module that won't take any wifi command and
> hang right away (dmesg below). Bluetooth through serial works.
>
>
> Context:
> I've got a board with an i.MX8MP chip, and three different marvell W8997
> M.2 modules -- one from laird which works fine, and two from azurewave
> which are labeled exactly the same AW-CM276MA 2276MA PCIE-UART except
> one works and not the other.
> The inscription on the chip itself are slightly different, one saying
> it's a W8997-M1216 from marvell (works) and the other having AW-CM276NF
> azurewave mark. The electronics around are also different.

FWIW, I've only ever worked with the PCIe-USB variant of this chip.
And it had tons of bugs that resulted in "command timeouts" along the
way, and it took a lot of co-working with Marvell to get the firmware
fixed. I don't know their release model well enough to know whether
the PCIe-UART variant will have all the same bugs (and bugfixes). But
in case it helps, here's our firmware history:
https://chromium.googlesource.com/chromiumos/third_party/linux-firmware/+log/HEAD/mrvl/pcieusb8997_combo_v4.bin
*Most* of those should align pretty closely with what was published to
linux-firmware.git, but it's not guaranteed, since Marvell didn't
always follow our upstream-first guidelines there.

> I could say it's just a bad chip, but I've actually got two of each
> (samples) which act the same... And I've tried it in another device
> where it works with the same kernel/firmware, so there must be something
> wrong on the board as well as the wifi card works elsewhere.

I've seen something as small as the "wrong" kind of noise cause the
firmware to grind to a halt, so there could be a firmware bug that
gets tickled by the particular layout or antenna configuration of the
board in question. I suppose that's not a very helpful guess, but at
least it might validate your observations.

> Anyway, if someone knows how to get around to debugging this, I'd
> appreciate a pointer! I can't see anything wrong with the tools I have
> here.
> If nothing else, I can't read /sys/class/devcoredump/devcd*/data that I
> saw Amitkumar Karwar request somewhere else, so just deciphering this
> would be great help.

I've never had success with that, but I haven't tried very hard. My
understanding is that it's something equivalent to a register and
state dump of the proprietary firmware, so you won't really learn much
from it without proprietary knowledge.

Good luck,
Brian

2021-09-08 23:51:28

by Dominique Martinet

[permalink] [raw]
Subject: Re: mwifiex cmd timeout on one pci variant

Brian Norris wrote on Wed, Sep 08, 2021 at 09:56:57AM -0700:
> On Tue, Sep 7, 2021 at 9:52 PM Dominique MARTINET
> <[email protected]> wrote:
> > This is probably more a question for the maker.. But maybe someone will
> > know.
>
> Last I knew, at least one of the CC'd is still employed by the owner
> of this IP (NXP now), but I don't know that much. But then, they
> haven't been giving out a lot of free support lately, AFAIK.

Thanks for the information.
I've also tried reaching out through our reseller before but it'll
likely take some time to tickle up, if it ever does...

> > Context:
> > I've got a board with an i.MX8MP chip, and three different marvell W8997
> > M.2 modules -- one from laird which works fine, and two from azurewave
> > which are labeled exactly the same AW-CM276MA 2276MA PCIE-UART except
> > one works and not the other.
> > The inscription on the chip itself are slightly different, one saying
> > it's a W8997-M1216 from marvell (works) and the other having AW-CM276NF
> > azurewave mark. The electronics around are also different.
>
> FWIW, I've only ever worked with the PCIe-USB variant of this chip.
> And it had tons of bugs that resulted in "command timeouts" along the
> way, and it took a lot of co-working with Marvell to get the firmware
> fixed. I don't know their release model well enough to know whether
> the PCIe-UART variant will have all the same bugs (and bugfixes). But
> in case it helps, here's our firmware history:
> https://chromium.googlesource.com/chromiumos/third_party/linux-firmware/+log/HEAD/mrvl/pcieusb8997_combo_v4.bin
> *Most* of those should align pretty closely with what was published to
> linux-firmware.git, but it's not guaranteed, since Marvell didn't
> always follow our upstream-first guidelines there.

That indeed shows more updates for pcieusb than pcieuart..
The upstream repo https://github.com/NXP/mwifiex-firmware/ listed in
commit message also only seems to have ever updated pcieusb, so I guess
that's where the efforts have been poured.

Also made me try to force loading 'wrong' firmwares to see what happen
and pcie8997_wlan_v4.bin actually loads! But seems to have similar
problems. Others fail to start, but that's not really surprising.


> > I could say it's just a bad chip, but I've actually got two of each
> > (samples) which act the same... And I've tried it in another device
> > where it works with the same kernel/firmware, so there must be something
> > wrong on the board as well as the wifi card works elsewhere.
>
> I've seen something as small as the "wrong" kind of noise cause the
> firmware to grind to a halt, so there could be a firmware bug that
> gets tickled by the particular layout or antenna configuration of the
> board in question. I suppose that's not a very helpful guess, but at
> least it might validate your observations.

Every bit helps! :)


> > Anyway, if someone knows how to get around to debugging this, I'd
> > appreciate a pointer! I can't see anything wrong with the tools I have
> > here.
> > If nothing else, I can't read /sys/class/devcoredump/devcd*/data that I
> > saw Amitkumar Karwar request somewhere else, so just deciphering this
> > would be great help.
>
> I've never had success with that, but I haven't tried very hard. My
> understanding is that it's something equivalent to a register and
> state dump of the proprietary firmware, so you won't really learn much
> from it without proprietary knowledge.

Yes, that was my guess as well (hence the many Ccs with hopefully one
working there :-D), but I also understand free support isn't something
all companies let or encourage their employees to do.


Thanks for the reply anyway, if nothing else it might make the
difference for paying a bit more for a card with less troubling
history if my words weight enough.

--
Dominique

2021-09-14 10:12:37

by Jonas Dreßler

[permalink] [raw]
Subject: Re: [EXT] Re: mwifiex cmd timeout on one pci variant

Hi Dominique,

regarding the firmware version, as you can see in the commit updating
the firmware binaries
(https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/mrvl/pcie8897_uapsta.bin?id=1a5773c0c89ee44cee51a285d5c7c1063cdb0891),
indeed the version numbering differs between the different versions of
the card (usb/usb, pcie/usb, pcie/uart(?)).

Anyway, if you manage to find newer firmware for any of those versions,
I'd be happy if you could point me to that, apparently they just fixed a
critical vulnerability in the Windows firmware again (see
https://support.microsoft.com/en-us/surface/surface-pro-5th-gen-update-history-5203144a-90c1-63df-ce0b-7ec7ff32ff10),
I wouldn't be surprised if our firmware is also affected by that.

About the command timeout, I have no idea why the fix isn't working for
you, but well, my analysis of the issue is also just a (not exactly
educated) guess, so it might as well be a completely different problem
and my fix is just a lucky hack.

I'd kinda hope though that my proposed patches finally wake up some
people at NXP and motivate them to take a look at that firmware repo again.

Jonas

On 9/8/21 7:56 AM, Dominique MARTINET wrote:
> Sharvari Harisangam wrote on Wed, Sep 08, 2021 at 05:45:53AM +0000:
>> Use firmware from https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/mrvl
>> for mwifiex driver.
>
> Thanks, that's the first firmware I was using; it's currently at
> 16.68.1.p179 which is why I'm surprised Jonas said the latest would be
> 15.68.19.p21.
>
> I think it's just a different variant of the driver now though,
> a binary grep matches 15.68.19.p21 for pcie8897_uapsta.bin but I my
> driver loads pcieuart8997_combo_v4.bin
> I hadn't noticed the first number didn't match, but that likely confirms
> it.
>
> Sorry for the noise on firmware version, I'm still interested in
> understanding why the command timeouts.
>

2021-09-15 01:45:33

by Dominique Martinet

[permalink] [raw]
Subject: Re: [EXT] Re: mwifiex cmd timeout on one pci variant

Hi Jonas,

Jonas Dreßler wrote on Tue, Sep 14, 2021 at 12:11:46PM +0200:
> regarding the firmware version, as you can see in the commit updating the
> firmware binaries (https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/mrvl/pcie8897_uapsta.bin?id=1a5773c0c89ee44cee51a285d5c7c1063cdb0891),
> indeed the version numbering differs between the different versions of the
> card (usb/usb, pcie/usb, pcie/uart(?)).

Right. The update frequency is also quite different, so I'm assuming the
pcie/uart version I'm using has a lot of vulnerabilities left open as
well...


> Anyway, if you manage to find newer firmware for any of those versions, I'd
> be happy if you could point me to that, apparently they just fixed a
> critical vulnerability in the Windows firmware again (see https://support.microsoft.com/en-us/surface/surface-pro-5th-gen-update-history-5203144a-90c1-63df-ce0b-7ec7ff32ff10),
> I wouldn't be surprised if our firmware is also affected by that.

That sounds like a safe bet..
I assume the firmwares are not compatible and we can't just load these?


> About the command timeout, I have no idea why the fix isn't working for you,
> but well, my analysis of the issue is also just a (not exactly educated)
> guess, so it might as well be a completely different problem and my fix is
> just a lucky hack.

Right, it really depends on why the firmware crashed, but we have no way
of investigating that at the moment.

> I'd kinda hope though that my proposed patches finally wake up some people
> at NXP and motivate them to take a look at that firmware repo again.

If it works well enough it could be a reason not to bother :D
Alternatively if they can't spend time on it maybe open the firmware
code (under NDA? my company probably already has one with NXP..), but
my problem will need more time to reach them through regular channels.

--
Dominique

2021-09-15 10:07:38

by Jonas Dreßler

[permalink] [raw]
Subject: Re: [EXT] Re: mwifiex cmd timeout on one pci variant

On 9/15/21 3:43 AM, Dominique MARTINET wrote:
> Hi Jonas,
>
> Jonas Dreßler wrote on Tue, Sep 14, 2021 at 12:11:46PM +0200:
>> regarding the firmware version, as you can see in the commit updating the
>> firmware binaries (https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/mrvl/pcie8897_uapsta.bin?id=1a5773c0c89ee44cee51a285d5c7c1063cdb0891),
>> indeed the version numbering differs between the different versions of the
>> card (usb/usb, pcie/usb, pcie/uart(?)).
>
> Right. The update frequency is also quite different, so I'm assuming the
> pcie/uart version I'm using has a lot of vulnerabilities left open as
> well...
>
>
>> Anyway, if you manage to find newer firmware for any of those versions, I'd
>> be happy if you could point me to that, apparently they just fixed a
>> critical vulnerability in the Windows firmware again (see https://support.microsoft.com/en-us/surface/surface-pro-5th-gen-update-history-5203144a-90c1-63df-ce0b-7ec7ff32ff10),
>> I wouldn't be surprised if our firmware is also affected by that.
>
> That sounds like a safe bet..
> I assume the firmwares are not compatible and we can't just load these?

Yeah, they're quite similar and seem to descend from the same codebase,
but the APIs between the kernel driver and the firmware are very different.

>
>
>> About the command timeout, I have no idea why the fix isn't working for you,
>> but well, my analysis of the issue is also just a (not exactly educated)
>> guess, so it might as well be a completely different problem and my fix is
>> just a lucky hack.
>
> Right, it really depends on why the firmware crashed, but we have no way
> of investigating that at the moment.

One more thing that comes to mind after reading this discussion
https://lore.kernel.org/linux-wireless/[email protected]/
is that maybe the read-back is really only serving the purpose of a
udelay(), so if you want you can try playing around with that a bit
instead of the read-back.

>
>> I'd kinda hope though that my proposed patches finally wake up some people
>> at NXP and motivate them to take a look at that firmware repo again.
>
> If it works well enough it could be a reason not to bother :D
> Alternatively if they can't spend time on it maybe open the firmware
> code (under NDA? my company probably already has one with NXP..), but
> my problem will need more time to reach them through regular channels.
>