2022-02-01 08:25:57

by Felipe Contreras

[permalink] [raw]
Subject: Regression in 5.16.3 with mt7921e

Hello,

I've always had trouble with this driver in my Asus Zephyrus laptop,
but I was able to use it eventually, that's until 5.16.3 landed.

This version completely broke it. I'm unable to bring the interface
up, no matter what I try.

Before, sometimes I was able to make the chip work by suspending the
laptop, but in 5.16.3 the machine doesn't wake up (which is probably
another issue).

Reverting back to 5.16.2 makes it work.

Let me know if you need more information, or if you would like me to
bisect the issue.

Cheers.

--
Felipe Contreras


2022-02-01 08:27:42

by James

[permalink] [raw]
Subject: Re: Regression in 5.16.3 with mt7921e

Does dmesg show anything?

2022-02-01 08:43:46

by Felipe Contreras

[permalink] [raw]
Subject: Re: Regression in 5.16.3 with mt7921e

On Sat, Jan 29, 2022 at 1:12 PM James <[email protected]> wrote:
>
> Does dmesg show anything?

It's hard to tell because it seems there are multiple conflating
issues. I booted into 5.16.3 again, and this time I experienced a
different problem, so far I've seen these two:

1. The device appears, but I'm unable to bring it up
2. The device doesn't even appear

For issue #2 I see this interesting error:

[ 0.325945] Freeing initrd memory: 8768K
[ 0.331968] ------------[ cut here ]------------
[ 0.331969] WARNING: CPU: 4 PID: 1 at drivers/iommu/amd/init.c:839
amd_iommu_enable_interrupts+0x352/0x430
[ 0.331975] Modules linked in:
[ 0.331977] CPU: 4 PID: 1 Comm: swapper/0 Not tainted
5.16.3-arch1-1 #1 ca51a3fe35922d501638d513dc9548a2c4fed987
[ 0.331980] Hardware name: ASUSTeK COMPUTER INC. ROG Zephyrus G14
GA401QM_GA401QM/GA401QM, BIOS GA401QM.410 12/13/2021
[ 0.331980] RIP: 0010:amd_iommu_enable_interrupts+0x352/0x430
[ 0.331982] Code: ff ff 48 8b 7b 18 89 04 24 e8 2a 3a ed ff 8b 04
24 e9 45 fd ff ff 0f 0b 48 8b 1b 48 81 fb 70 09 b6 99 0f 85 00 fd ff
ff eb 96 <0f> 0b 48 8b 1b 48 81 fb 70 09 b6 99 0f 85 ec fc ff ff eb 82
31 f6
[ 0.331983] RSP: 0018:ffffa17a00087db8 EFLAGS: 00010246
[ 0.331985] RAX: 0000000000000018 RBX: ffff89af0004b000 RCX: ffffa17a00100000
[ 0.331986] RDX: 0000000000000000 RSI: ffffa17a00100000 RDI: 0000000000000000
[ 0.331986] RBP: 0000000080000000 R08: 0000000000000000 R09: 0000000000000000
[ 0.331987] R10: 0000000000000000 R11: 0000000000000000 R12: 000ffffffffffff8
[ 0.331988] R13: 0800000000000000 R14: ffffa17a00087dc0 R15: ffff89af013323c0
[ 0.331988] FS: 0000000000000000(0000) GS:ffff89b1de700000(0000)
knlGS:0000000000000000
[ 0.331989] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.331990] CR2: 0000000000000000 CR3: 000000020a410000 CR4: 0000000000750ee0
[ 0.331991] PKRU: 55555554
[ 0.331991] Call Trace:
[ 0.331992] <TASK>
[ 0.331995] iommu_go_to_state+0x1164/0x1458
[ 0.331999] ? e820__memblock_setup+0x7d/0x7d
[ 0.332002] amd_iommu_init+0xf/0x29
[ 0.332003] pci_iommu_init+0x16/0x3f
[ 0.332005] do_one_initcall+0x57/0x220
[ 0.332008] kernel_init_freeable+0x1e8/0x242
[ 0.332010] ? rest_init+0xd0/0xd0
[ 0.332013] kernel_init+0x16/0x130
[ 0.332014] ret_from_fork+0x22/0x30
[ 0.332016] </TASK>
[ 0.332018] ---[ end trace 99de2ba3e793f5cf ]---
[ 0.332018] software IO TLB: tearing down default memory pool

Even more interesting is that I rebooted into 5.16.2 and the same
warning appeared, and the same issue happened: I didn't see the
driver. I turned off the laptop (as opposed to rebooting), and then
turned it on, and now the wireless works fine (in 5.16.2).

The reason I turn off the laptop is that I read in some forums that
turning off the computer and waiting 10 seconds makes the chip work
again (although that was for yet another issue, I've not experienced
lately, and it happened even in Windows).

Here's the whole dmesg: https://dpaste.org/0sj3

I'll try to disable the proprietary nvidia driver to see if there's
any difference.

--
Felipe Contreras

2022-02-01 08:44:14

by Felipe Contreras

[permalink] [raw]
Subject: Re: Regression in 5.16.3 with mt7921e

On Sat, Jan 29, 2022 at 1:50 PM Felipe Contreras
<[email protected]> wrote:
>
> On Sat, Jan 29, 2022 at 1:12 PM James <[email protected]> wrote:
> >
> > Does dmesg show anything?
>
> It's hard to tell because it seems there are multiple conflating
> issues. I booted into 5.16.3 again, and this time I experienced a
> different problem, so far I've seen these two:
>
> 1. The device appears, but I'm unable to bring it up
> 2. The device doesn't even appear

I removed the nvidia driver and I was still able to reproduce issue #1.

Here are the interesting bits:

[ 2.295614] mt7921e 0000:02:00.0: enabling device (0000 -> 0002)
[ 2.295810] mt7921e 0000:02:00.0: ASIC revision: 79610010
[ 2.377578] mt7921e 0000:02:00.0: HW/SW Version: 0x8a108a10, Build
Time: 20220110230855a
[ 2.846987] mt7921e 0000:02:00.0: WM Firmware Version: ____010000,
Build Time: 20220110230951
[ 2.874395] mt7921e 0000:02:00.0: Firmware init done
[ 7.374118] mt7921e 0000:02:00.0: Message 00020001 (seq 4) timeout
[ 7.374180] mt7921e 0000:02:00.0: chip reset
[ 13.773763] mt7921e 0000:02:00.0: Message 000046ed (seq 5) timeout
[ 13.887279] mt7921e 0000:02:00.0: HW/SW Version: 0x8a108a10, Build
Time: 20220110230855a
[ 13.958763] mt7921e 0000:02:00.0: WM Firmware Version: ____010000,
Build Time: 20220110230951
[ 13.989292] mt7921e 0000:02:00.0: Firmware init done
[ 54.093979] mt7921e 0000:02:00.0: Message 00020001 (seq 10) timeout
[ 54.094010] mt7921e 0000:02:00.0: chip reset
[ 60.493981] mt7921e 0000:02:00.0: Message 000046ed (seq 11) timeout
[ 60.600757] mt7921e 0000:02:00.0: HW/SW Version: 0x8a108a10, Build
Time: 20220110230855a
[ 60.672805] mt7921e 0000:02:00.0: WM Firmware Version: ____010000,
Build Time: 20220110230951
[ 60.704784] mt7921e 0000:02:00.0: Firmware init done

The last "Firmware init done" happened after I did "ip link set wlan0
up" which failed.

Here's the full dmesg: https://dpaste.org/PVTE

Yet another issue (#3) is that the kernel sometimes crashes when
starting up the system. I've mostly ignored this issue, but looking at
the log when that happens, it seems to be related to the mt7921
driver:

Jan 29 14:21:58 chronos kernel: mt7921e 0000:02:00.0: Timeout for driver own
Jan 29 14:21:58 chronos kernel: BUG: Bad page state in process
systemd-udevd pfn:103328
Jan 29 14:21:58 chronos kernel: page:00000000128101f9 refcount:-1
mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x103328
Jan 29 14:21:58 chronos kernel: flags:
0x2ffff0000000000(node=0|zone=2|lastcpupid=0xffff)
Jan 29 14:21:58 chronos kernel: raw: 02ffff0000000000 dead000000000100
dead000000000122 0000000000000000
Jan 29 14:21:58 chronos kernel: raw: 0000000000000000 0000000000000000
ffffffffffffffff 0000000000000000
Jan 29 14:21:58 chronos kernel: page dumped because: nonzero _refcount
Jan 29 14:21:58 chronos kernel: Modules linked in: bnep btusb btrtl
btbcm ccm algif_aead cbc btintel des_generic libdes ecb bluetooth
iptable_nat ecdh_generic hid_asus nf_nat algif_skcipher nf_conntrack
cmac nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c md4 algif_hash
iptable_mangle iptable_filter af_alg intel_rapl_msr intel_rapl_common
joydev mousedev mt7921e(+) mt7921_common snd_hda_codec_realtek
mt76_connac_lib snd_hda_codec_generic edac_mce_amd ledtrig_audio mt76
snd_hda_codec_hdmi hid_multitouch snd_hda_intel mac80211 kvm_amd
snd_intel_dspcfg vfat asus_nb_wmi snd_intel_sdw_acpi libarc4 fat
amdgpu kvm irqbypass snd_hda_codec asus_wmi snd_pci_acp6x
crct10dif_pclmul cfg80211 snd_hda_core sparse_keymap crc32_pclmul
snd_hwdep ghash_clmulni_intel platform_profile snd_pcm aesni_intel
i8042 snd_timer crypto_simd gpu_sched snd_pci_acp5x cryptd serio
ucsi_acpi sp5100_tco snd drm_ttm_helper snd_rn_pci_acp3x rapl
typec_ucsi pcspkr wmi_bmof rfkill ttm soundcore ccp snd_pci_acp3x
i2c_piix4 k10temp tpm_crb typec tpm_tis
Jan 29 14:21:58 chronos kernel: roles mac_hid tpm_tis_core
i2c_hid_acpi tpm i2c_hid amd_pmc rng_core acpi_cpufreq asus_wireless
pinctrl_amd pkcs8_key_parser crypto_user fuse bpf_preload ip_tables
x_tables usbhid ext4 crc32c_generic crc16 mbcache jbd2 xhci_pci
crc32c_intel xhci_pci_renesas wmi video
Jan 29 14:21:58 chronos kernel: CPU: 12 PID: 396 Comm: systemd-udevd
Tainted: G W 5.16.3-arch1-1 #1
ca51a3fe35922d501638d513dc9548a2c4fed987
Jan 29 14:21:58 chronos kernel: Hardware name: ASUSTeK COMPUTER INC.
ROG Zephyrus G14 GA401QM_GA401QM/GA401QM, BIOS GA401QM.410 12/13/2021
Jan 29 14:21:58 chronos kernel: Call Trace:
Jan 29 14:21:58 chronos kernel: <TASK>
Jan 29 14:21:58 chronos kernel: dump_stack_lvl+0x48/0x66
Jan 29 14:21:58 chronos kernel: bad_page.cold+0x63/0x94
Jan 29 14:21:58 chronos kernel: free_pcppages_bulk+0x1f2/0x380
Jan 29 14:21:58 chronos kernel: free_unref_page+0xbd/0x140
Jan 29 14:21:58 chronos kernel: mt76_dma_rx_cleanup+0x94/0x120 [mt76
d94b4c9690089b7441d9b3262ec58606565d1b82]
Jan 29 14:21:58 chronos kernel: mt7921_wpdma_reset+0xbc/0x1c0
[mt7921e 7e95012acfae7cc199e541d3b3dbe15de0128110]
Jan 29 14:21:58 chronos kernel: mt7921_register_device+0x32b/0x5e0
[mt7921_common 19fe4291bf468cdc820d57b91bfc4be907d53377]
Jan 29 14:21:58 chronos kernel: mt7921_pci_probe+0x1f1/0x230 [mt7921e
7e95012acfae7cc199e541d3b3dbe15de0128110]
Jan 29 14:21:58 chronos kernel: ? __pm_runtime_resume+0x58/0x80
Jan 29 14:21:58 chronos kernel: local_pci_probe+0x45/0x90
Jan 29 14:21:58 chronos kernel: ? pci_match_device+0xdf/0x140
Jan 29 14:21:58 chronos kernel: pci_device_probe+0xcf/0x1c0
Jan 29 14:21:58 chronos kernel: really_probe+0x203/0x400
Jan 29 14:21:58 chronos kernel: __driver_probe_device+0x112/0x190
Jan 29 14:21:58 chronos kernel: driver_probe_device+0x1e/0x90
Jan 29 14:21:58 chronos kernel: __driver_attach+0xc8/0x1e0
Jan 29 14:21:58 chronos kernel: ? __device_attach_driver+0xf0/0xf0
Jan 29 14:21:58 chronos kernel: ? __device_attach_driver+0xf0/0xf0
Jan 29 14:21:58 chronos kernel: bus_for_each_dev+0x8d/0xe0
Jan 29 14:21:58 chronos kernel: bus_add_driver+0x154/0x200
Jan 29 14:21:58 chronos kernel: driver_register+0x8f/0xf0
Jan 29 14:21:58 chronos kernel: ? 0xffffffffc0753000
Jan 29 14:21:58 chronos kernel: do_one_initcall+0x57/0x220
Jan 29 14:21:58 chronos kernel: do_init_module+0x5c/0x270
Jan 29 14:21:58 chronos kernel: load_module+0x25d7/0x27a0
Jan 29 14:21:58 chronos kernel: ? __alloc_pages_bulk+0x5e7/0x740
Jan 29 14:21:58 chronos kernel: ? __do_sys_init_module+0x12e/0x1b0
Jan 29 14:21:58 chronos kernel: __do_sys_init_module+0x12e/0x1b0
Jan 29 14:21:58 chronos kernel: do_syscall_64+0x5c/0x90
Jan 29 14:21:58 chronos kernel: ? ksys_read+0x67/0xf0
Jan 29 14:21:58 chronos kernel: ? exc_page_fault+0x72/0x180
Jan 29 14:21:58 chronos kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
Jan 29 14:21:58 chronos kernel: RIP: 0033:0x7f261332632e
Jan 29 14:21:58 chronos kernel: Code: 48 8b 0d 45 0b 0c 00 f7 d8 64 89
01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89
ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 12 0b 0c
00 f7 d8 64 89 01 48
Jan 29 14:21:58 chronos kernel: RSP: 002b:00007ffd2b6d7fd8 EFLAGS:
00000246 ORIG_RAX: 00000000000000af
Jan 29 14:21:58 chronos kernel: RAX: ffffffffffffffda RBX:
000056487d08e8b0 RCX: 00007f261332632e
Jan 29 14:21:58 chronos kernel: RDX: 00007f261347aa9d RSI:
000000000002b3af RDI: 000056487d408720
Jan 29 14:21:58 chronos kernel: RBP: 000056487d408720 R08:
27d4eb2f165667c5 R09: 0000000000000000
Jan 29 14:21:58 chronos kernel: R10: 000056487d0c3830 R11:
0000000000000246 R12: 00007f261347aa9d
Jan 29 14:21:58 chronos kernel: R13: 0000000000000001 R14:
000056487d0e5bd0 R15: 000056487d08e8b0
Jan 29 14:21:58 chronos kernel: </TASK>
Jan 29 14:21:58 chronos kernel: Disabling lock debugging due to kernel taint

This is not a regression though, as it happens in 5.16.2 too.

Here's a full dmesg of the crash: http://dpaste.org/xtt5

--
Felipe Contreras

2022-02-01 09:58:41

by Greg KH

[permalink] [raw]
Subject: Re: Regression in 5.16.3 with mt7921e

On Sat, Jan 29, 2022 at 01:05:50PM -0600, Felipe Contreras wrote:
> Hello,
>
> I've always had trouble with this driver in my Asus Zephyrus laptop,
> but I was able to use it eventually, that's until 5.16.3 landed.
>
> This version completely broke it. I'm unable to bring the interface
> up, no matter what I try.
>
> Before, sometimes I was able to make the chip work by suspending the
> laptop, but in 5.16.3 the machine doesn't wake up (which is probably
> another issue).
>
> Reverting back to 5.16.2 makes it work.
>
> Let me know if you need more information, or if you would like me to
> bisect the issue.

Using 'git bisect' would be best, so we know what commit exactly causes
the problems.

thanks,

greg k-h

2022-02-01 09:59:59

by Abhijeet Viswa

[permalink] [raw]
Subject: Re: Regression in 5.16.3 with mt7921e

I can confirm this regression. I have an Asus TUF laptop.

I've also tried "resetting" the chip by holding down the power button
for 60 seconds. This usually helped in previous versions.

If it helps, this issue does not exist in Windows (I have a dualboot).

This is the dmesg log when trying to bring the interface up:
https://dpaste.org/Wouy

Happy to help diagnosing this further.

Thanks
Abhijeet

On 30/01/22 00:35, Felipe Contreras wrote:
> Hello,
>
> I've always had trouble with this driver in my Asus Zephyrus laptop,
> but I was able to use it eventually, that's until 5.16.3 landed.
>
> This version completely broke it. I'm unable to bring the interface
> up, no matter what I try.
>
> Before, sometimes I was able to make the chip work by suspending the
> laptop, but in 5.16.3 the machine doesn't wake up (which is probably
> another issue).
>
> Reverting back to 5.16.2 makes it work.
>
> Let me know if you need more information, or if you would like me to
> bisect the issue.
>
> Cheers.
>

2022-02-01 10:06:43

by Felipe Contreras

[permalink] [raw]
Subject: Re: Regression in 5.16.3 with mt7921e

On Sun, Jan 30, 2022 at 1:28 AM Greg KH <[email protected]> wrote:
>
> On Sat, Jan 29, 2022 at 01:05:50PM -0600, Felipe Contreras wrote:
> > Hello,
> >
> > I've always had trouble with this driver in my Asus Zephyrus laptop,
> > but I was able to use it eventually, that's until 5.16.3 landed.
> >
> > This version completely broke it. I'm unable to bring the interface
> > up, no matter what I try.
> >
> > Before, sometimes I was able to make the chip work by suspending the
> > laptop, but in 5.16.3 the machine doesn't wake up (which is probably
> > another issue).
> >
> > Reverting back to 5.16.2 makes it work.
> >
> > Let me know if you need more information, or if you would like me to
> > bisect the issue.
>
> Using 'git bisect' would be best, so we know what commit exactly causes
> the problems.

I know, but it has been a while since I've created a decent config
file to build a kernel.

Either way, I pushed forward and the commit is a38b94c43943.

Upstream commit 547224024579 introduced a regression that was fixed by
the next commit 680a2ead741a, but the second commit was never merged
to stable.

I've sent the second commit to fix the regression.

--
Felipe Contreras

2022-02-01 10:37:06

by Greg KH

[permalink] [raw]
Subject: Re: Regression in 5.16.3 with mt7921e

On Sun, Jan 30, 2022 at 02:07:32AM -0600, Felipe Contreras wrote:
> On Sun, Jan 30, 2022 at 1:28 AM Greg KH <[email protected]> wrote:
> >
> > On Sat, Jan 29, 2022 at 01:05:50PM -0600, Felipe Contreras wrote:
> > > Hello,
> > >
> > > I've always had trouble with this driver in my Asus Zephyrus laptop,
> > > but I was able to use it eventually, that's until 5.16.3 landed.
> > >
> > > This version completely broke it. I'm unable to bring the interface
> > > up, no matter what I try.
> > >
> > > Before, sometimes I was able to make the chip work by suspending the
> > > laptop, but in 5.16.3 the machine doesn't wake up (which is probably
> > > another issue).
> > >
> > > Reverting back to 5.16.2 makes it work.
> > >
> > > Let me know if you need more information, or if you would like me to
> > > bisect the issue.
> >
> > Using 'git bisect' would be best, so we know what commit exactly causes
> > the problems.
>
> I know, but it has been a while since I've created a decent config
> file to build a kernel.
>
> Either way, I pushed forward and the commit is a38b94c43943.
>
> Upstream commit 547224024579 introduced a regression that was fixed by
> the next commit 680a2ead741a, but the second commit was never merged
> to stable.
>
> I've sent the second commit to fix the regression.

Wonderful, thanks for figuring this out and sending the fix.

greg k-h