2023-06-14 21:17:31

by Brian Geffon

[permalink] [raw]
Subject: thunderbolt: resume from hibernation CPUs racing in tb_ring_start

Hi,
On ChromeOS we're running a 5.15 kernel patched up to 6.4-rc6 w.r.t to
drivers/thunderbolt code and we're seeing a similar issue to the one
discussed in https://lore.kernel.org/lkml/[email protected]/T/#
/ https://bugzilla.kernel.org/show_bug.cgi?id=217343 where when
resuming from hibernation you'll see warnings along the lines of

[ 126.292769] thunderbolt 0000:00:0d.3: interrupt for RX ring 0 is
already enabled

The thing that's odd is it appears three CPUs are racing through this code path:

[ 126.292076] ------------[ cut here ]------------
[ 126.292077] thunderbolt 0000:00:0d.2: interrupt for TX ring 0 is
already enabled
[ 126.292080] proc_thermal_pci 0000:00:04.0: PM:
pci_pm_thaw_noirq+0x0/0x7c returned 0 after 606 usecs
[ 126.292086] ------------[ cut here ]------------
[ 126.292087] thunderbolt 0000:00:0d.3: interrupt for TX ring 0 is
already enabled
[ 126.292089] WARNING: CPU: 6 PID: 7879 at
drivers/thunderbolt/nhi.c:138 ring_interrupt_active+0x1cd/0x225
[ 126.292092] Modules linked in:
[ 126.292091] WARNING: CPU: 0 PID: 175 at
drivers/thunderbolt/nhi.c:138 ring_interrupt_active+0x1cd/0x225
[ 126.292157] CPU: 0 PID: 175 Comm: kworker/u24:2 Tainted: G U
5.15.116-19568-g766d8095041b #24
fdadcb2517d1d37363ad385ffddbc1ad5dc72550
[ 126.292158] lzo_rle zram joydev
[ 126.292159] Hardware name: Google Anahera/Anahera, BIOS
Google_Anahera.14505.143.0 06/22/2022
[ 126.292159]
[ 126.292160] Workqueue: events_unbound async_run_entry_fn
[ 126.292160] CPU: 6 PID: 7879 Comm: kworker/u24:13 Tainted: G U
5.15.116-19568-g766d8095041b #24
fdadcb2517d1d37363ad385ffddbc1ad5dc72550
[ 126.292162]
[ 126.292162] RIP: 0010:ring_interrupt_active+0x1cd/0x225
[ 126.292163] Hardware name: Google Anahera/Anahera, BIOS
Google_Anahera.14505.143.0 06/22/2022
[ 126.292163] Code: 0f 44 ce 45 8b 46 14 48 c7 c6 98 a9 75 82 49 c7
c1 9f ee 72 82 84 db 4c 0f 45 ce 48 c7 c7 3e d8 6f 82 48 89 c6 e8 96
cb 53 ff <0f> 0b 49 8c
[ 126.292164] Workqueue: events_unbound async_run_entry_fn
[ 126.292164] RSP: 0000:ffffc90000777d18 EFLAGS: 00010046
[ 126.292166]
[ 126.292166] RAX: c7428cb6704ffd00 RBX: 0000000000000001 RCX: ffffffff82a5eff8
[ 126.292166] RIP: 0010:ring_interrupt_active+0x1cd/0x225
[ 126.292167] RDX: c7428cb6704ffd00 RSI: 0000000000000086 RDI: 0000000000000001
[ 126.292168] RBP: ffffc90000777d48 R08: 000000000000070c R09: ffffffff83167704
[ 126.292169] R10: 0000000200000000 R11: ffffffffffffffff R12: 0000000000038200
[ 126.292168] Code: 0f 44 ce 45 8b 46 14 48 c7 c6 98 a9 75 82 49 c7
c1 9f ee 72 82 84 db 4c 0f 45 ce 48 c7 c7 3e d8 6f 82 48 89 c6 e8 96
cb 53 ff <0f> 0b 49 8c
[ 126.292169] R13: 0000000000001001 R14: ffff888103348600 R15: 0000000000000001
[ 126.292170] RSP: 0000:ffffc9000ca17d18 EFLAGS: 00010046
[ 126.292170] FS: 0000000000000000(0000) GS:ffff88846f800000(0000)
knlGS:0000000000000000
[ 126.292171]
[ 126.292171] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 126.292171] RAX: d9637f716c437500 RBX: 0000000000000001 RCX: ffffffff82a5eff8
[ 126.292172] CR2: 00005b6c529532b8 CR3: 0000000286340000 CR4: 0000000000752ef0
[ 126.292172] RDX: d9637f716c437500 RSI: 0000000000000086 RDI: 0000000000000001
[ 126.292173] PKRU: 55555554
[ 126.292173] Call Trace:
[ 126.292174] RBP: ffffc9000ca17d48 R08: 0000000000000709 R09: ffffffff8316761c
[ 126.292175] <TASK>
[ 126.292175] R10: 0000000200000000 R11: ffffffffffffffff R12: 0000000000038200
[ 126.292176] R13: 0000000000001001 R14: ffff8881016a4840 R15: 0000000000000001
[ 126.292176] FS: 0000000000000000(0000) GS:ffff88846f980000(0000)
knlGS:0000000000000000
[ 126.292176] ? __warn+0xa3/0x131
[ 126.292178] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 126.292178] ? ring_interrupt_active+0x1cd/0x225
[ 126.292179] CR2: 0000000000000000 CR3: 0000000002a0c000 CR4: 0000000000752ee0
[ 126.292180] PKRU: 55555554
[ 126.292179] ? report_bug+0x97/0xfa
[ 126.292180] Call Trace:
[ 126.292182] <TASK>
[ 126.292181] ? handle_bug+0x41/0x66
[ 126.292183] ? exc_invalid_op+0x1b/0x4b
[ 126.292183] ? asm_exc_invalid_op+0x16/0x20
[ 126.292183] ? __warn+0xa3/0x131
[ 126.292185] ? ring_interrupt_active+0x1cd/0x225
[ 126.292186] ? ring_interrupt_active+0x1cd/0x225
[ 126.292186] ? ring_interrupt_active+0x1cd/0x225
[ 126.292187] tb_ring_start+0x1ba/0x1d7
[ 126.292187] ? report_bug+0x97/0xfa
[ 126.292188] tb_ctl_start+0x18/0x49
[ 126.292189] tb_domain_thaw_noirq+0x2a/0x5e
[ 126.292189] ? handle_bug+0x41/0x66
[ 126.292190] ? pci_pm_freeze_noirq+0xb8/0xb8
[ 126.292191] ? exc_invalid_op+0x1b/0x4b
[ 126.292192] dpm_run_callback+0x3c/0x12d
[ 126.292192] ? asm_exc_invalid_op+0x16/0x20
[ 126.292193] device_resume_noirq+0x10e/0x1dc
[ 126.292194] ? ring_interrupt_active+0x1cd/0x225
[ 126.292194] async_resume_noirq+0x1f/0x5b
[ 126.292195] ? ring_interrupt_active+0x1cd/0x225
[ 126.292195] async_run_entry_fn+0x2b/0xc4
[ 126.292196] tb_ring_start+0x1ba/0x1d7
[ 126.292197] process_one_work+0x1b7/0x469
[ 126.292198] worker_thread+0x11a/0x288
[ 126.292198] tb_ctl_start+0x18/0x49
[ 126.292199] kthread+0x13a/0x152
[ 126.292200] tb_domain_thaw_noirq+0x2a/0x5e
[ 126.292201] ? process_one_work+0x469/0x469
[ 126.292202] ? kthread_blkcg+0x31/0x31
[ 126.292201] ? pci_pm_freeze_noirq+0xb8/0xb8
[ 126.292202] ret_from_fork+0x1f/0x30
[ 126.292203] dpm_run_callback+0x3c/0x12d
[ 126.292205] </TASK>
[ 126.292205] device_resume_noirq+0x10e/0x1dc
[ 126.292206] ---[ end trace 349544d25453a0c7 ]---
[ 126.292206] async_resume_noirq+0x1f/0x5b
[ 126.292208] async_run_entry_fn+0x2b/0xc4
[ 126.292209] process_one_work+0x1b7/0x469
[ 126.292210] worker_thread+0x11a/0x288
[ 126.292212] kthread+0x13a/0x152
[ 126.292213] ? process_one_work+0x469/0x469
[ 126.292214] ? kthread_blkcg+0x31/0x31
[ 126.292215] ret_from_fork+0x1f/0x30
[ 126.292218] </TASK>
[ 126.292218] ---[ end trace 349544d25453a0c8 ]---
[ 126.292225] ------------[ cut here ]------------
[ 126.292226] thunderbolt 0000:00:0d.2: interrupt for RX ring 0 is
already enabled
[ 126.292231] WARNING: CPU: 2 PID: 7879 at
drivers/thunderbolt/nhi.c:138 ring_interrupt_active+0x1cd/0x225
[ 126.292233] Modules linked in: snd_seq_dummy snd_seq snd_seq_device
bridge stp llc tun vhost_vsock vhost vhost_iotlb
vmw_vsock_virtio_transport_common vsockg
[ 126.292253] videobuf2_memops btintel cdc_ether usbnet btmtk
videobuf2_common btbcm snd_intel_sdw_acpi kfifo_buf mei_hdcp
cros_ec_typec industrialio snd_socv
[ 126.292261] CPU: 2 PID: 7879 Comm: kworker/u24:13 Tainted: G U
W 5.15.116-19568-g766d8095041b #24
fdadcb2517d1d37363ad385ffddbc1ad5dc72550
[ 126.292263] Hardware name: Google Anahera/Anahera, BIOS
Google_Anahera.14505.143.0 06/22/2022
[ 126.292264] Workqueue: events_unbound async_run_entry_fn
[ 126.292266] RIP: 0010:ring_interrupt_active+0x1cd/0x225
[ 126.292267] Code: 0f 44 ce 45 8b 46 14 48 c7 c6 98 a9 75 82 49 c7
c1 9f ee 72 82 84 db 4c 0f 45 ce 48 c7 c7 3e d8 6f 82 48 89 c6 e8 96
cb 53 ff <0f> 0b 49 8c
[ 126.292268] RSP: 0000:ffffc9000ca17d18 EFLAGS: 00010046
[ 126.292269] RAX: d9637f716c437500 RBX: 0000000000000001 RCX: ffffffff82a5eff8
[ 126.292270] RDX: d9637f716c437500 RSI: 0000000000000086 RDI: 0000000000000001
[ 126.292270] RBP: ffffc9000ca17d48 R08: 000000000000081a R09: ffffffff8316a02c
[ 126.292271] R10: 0000000200000000 R11: ffffffffffffffff R12: 0000000000038200
[ 126.292272] R13: 0000000000001001 R14: ffff8881016a4780 R15: 0000000000001000
[ 126.292272] FS: 0000000000000000(0000) GS:ffff88846f880000(0000)
knlGS:0000000000000000
[ 126.292273] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 126.292274] CR2: 0000000000000000 CR3: 0000000002a0c000 CR4: 0000000000752ee0
[ 126.292274] PKRU: 55555554
[ 126.292275] Call Trace:
[ 126.292275] <TASK>
[ 126.292276] ? __warn+0xa3/0x131
[ 126.292278] ? ring_interrupt_active+0x1cd/0x225
[ 126.292279] ? report_bug+0x97/0xfa
[ 126.292281] ? handle_bug+0x41/0x66
[ 126.292281] ? exc_invalid_op+0x1b/0x4b
[ 126.292282] ? asm_exc_invalid_op+0x16/0x20
[ 126.292284] ? ring_interrupt_active+0x1cd/0x225
[ 126.292284] ? ring_interrupt_active+0x1cd/0x225
[ 126.292285] tb_ring_start+0x1ba/0x1d7
[ 126.292286] tb_ctl_start+0x21/0x49
[ 126.292287] tb_domain_thaw_noirq+0x2a/0x5e
[ 126.292288] ? pci_pm_freeze_noirq+0xb8/0xb8
[ 126.292289] dpm_run_callback+0x3c/0x12d
[ 126.292290] device_resume_noirq+0x10e/0x1dc
[ 126.292291] async_resume_noirq+0x1f/0x5b
[ 126.292292] async_run_entry_fn+0x2b/0xc4
[ 126.292293] process_one_work+0x1b7/0x469
[ 126.292295] worker_thread+0x11a/0x288
[ 126.292295] kthread+0x13a/0x152
[ 126.292296] ? process_one_work+0x469/0x469
[ 126.292297] ? kthread_blkcg+0x31/0x31
[ 126.292298] ret_from_fork+0x1f/0x30
[ 126.292301] </TASK>
[ 126.292302] ---[ end trace 349544d25453a0c9 ]---
[ 126.292306] thunderbolt 0000:00:0d.2: PM:
pci_pm_thaw_noirq+0x0/0x7c returned 0 after 825 usecs
[ 126.292311] pcieport 0000:00:07.0: PM: calling
pci_pm_thaw_noirq+0x0/0x7c @ 9, parent: pci0000:00
[ 126.292320] i915 0000:00:02.0: PM: calling
pci_pm_thaw_noirq+0x0/0x7c @ 7909, parent: pci0000:00
[ 126.292768] ------------[ cut here ]------------
[ 126.292769] thunderbolt 0000:00:0d.3: interrupt for RX ring 0 is
already enabled
[ 126.292777] WARNING: CPU: 6 PID: 175 at
drivers/thunderbolt/nhi.c:138 ring_interrupt_active+0x1cd/0x225
[ 126.292779] Modules linked in: snd_seq_dummy snd_seq snd_seq_device
bridge stp llc tun vhost_vsock vhost vhost_iotlb
vmw_vsock_virtio_transport_common vsockg
[ 126.292798] videobuf2_memops btintel cdc_ether usbnet btmtk
videobuf2_common btbcm snd_intel_sdw_acpi kfifo_buf mei_hdcp
cros_ec_typec industrialio snd_socv
[ 126.292807] CPU: 6 PID: 175 Comm: kworker/u24:2 Tainted: G U W
5.15.116-19568-g766d8095041b #24
fdadcb2517d1d37363ad385ffddbc1ad5dc72550
[ 126.292809] Hardware name: Google Anahera/Anahera, BIOS
Google_Anahera.14505.143.0 06/22/2022
[ 126.292810] Workqueue: events_unbound async_run_entry_fn
[ 126.292811] RIP: 0010:ring_interrupt_active+0x1cd/0x225
[ 126.292812] Code: 0f 44 ce 45 8b 46 14 48 c7 c6 98 a9 75 82 49 c7
c1 9f ee 72 82 84 db 4c 0f 45 ce 48 c7 c7 3e d8 6f 82 48 89 c6 e8 96
cb 53 ff <0f> 0b 49 8c
[ 126.292814] RSP: 0000:ffffc90000777d18 EFLAGS: 00010046
[ 126.292815] RAX: c7428cb6704ffd00 RBX: 0000000000000001 RCX: ffffffff82a5eff8
[ 126.292815] RDX: c7428cb6704ffd00 RSI: 0000000000000086 RDI: 0000000000000001
[ 126.292816] RBP: ffffc90000777d48 R08: 000000000000084b R09: ffffffff8316b08c
[ 126.292817] R10: 0000000200000000 R11: ffffffffffffffff R12: 0000000000038200
[ 126.292818] R13: 0000000000001001 R14: ffff888103348e40 R15: 0000000000001000
[ 126.292818] FS: 0000000000000000(0000) GS:ffff88846f980000(0000)
knlGS:0000000000000000
[ 126.292819] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 126.292820] CR2: 0000000000000000 CR3: 0000000002a0c000 CR4: 0000000000752ee0
[ 126.292821] PKRU: 55555554
[ 126.292822] Call Trace:
[ 126.292822] <TASK>
[ 126.292823] ? __warn+0xa3/0x131
[ 126.292825] ? ring_interrupt_active+0x1cd/0x225
[ 126.292826] ? report_bug+0x97/0xfa
[ 126.292828] ? handle_bug+0x41/0x66
[ 126.292829] ? exc_invalid_op+0x1b/0x4b
[ 126.292830] ? asm_exc_invalid_op+0x16/0x20
[ 126.292831] ? ring_interrupt_active+0x1cd/0x225
[ 126.292832] ? ring_interrupt_active+0x1cd/0x225
[ 126.292833] tb_ring_start+0x1ba/0x1d7
[ 126.292834] tb_ctl_start+0x21/0x49
[ 126.292836] tb_domain_thaw_noirq+0x2a/0x5e
[ 126.292837] ? pci_pm_freeze_noirq+0xb8/0xb8
[ 126.292838] dpm_run_callback+0x3c/0x12d
[ 126.292840] device_resume_noirq+0x10e/0x1dc
[ 126.292841] async_resume_noirq+0x1f/0x5b
[ 126.292842] async_run_entry_fn+0x2b/0xc4
[ 126.292843] process_one_work+0x1b7/0x469
[ 126.292844] worker_thread+0x11a/0x288
[ 126.292845] kthread+0x13a/0x152
[ 126.292847] ? process_one_work+0x469/0x469
[ 126.292847] ? kthread_blkcg+0x31/0x31
[ 126.292849] ret_from_fork+0x1f/0x30
[ 126.292851] </TASK>
[ 126.292851] ---[ end trace 349544d25453a0ca ]---


2023-06-15 04:58:29

by Mika Westerberg

[permalink] [raw]
Subject: Re: thunderbolt: resume from hibernation CPUs racing in tb_ring_start

Hi,

On Wed, Jun 14, 2023 at 04:26:49PM -0400, Brian Geffon wrote:
> Hi,
> On ChromeOS we're running a 5.15 kernel patched up to 6.4-rc6 w.r.t to
> drivers/thunderbolt code and we're seeing a similar issue to the one
> discussed in https://lore.kernel.org/lkml/[email protected]/T/#
> / https://bugzilla.kernel.org/show_bug.cgi?id=217343 where when
> resuming from hibernation you'll see warnings along the lines of
>
> [ 126.292769] thunderbolt 0000:00:0d.3: interrupt for RX ring 0 is
> already enabled
>
> The thing that's odd is it appears three CPUs are racing through this code path:
>
> [ 126.292076] ------------[ cut here ]------------
> [ 126.292077] thunderbolt 0000:00:0d.2: interrupt for TX ring 0 is
> already enabled
> [ 126.292080] proc_thermal_pci 0000:00:04.0: PM:
> pci_pm_thaw_noirq+0x0/0x7c returned 0 after 606 usecs
> [ 126.292086] ------------[ cut here ]------------
> [ 126.292087] thunderbolt 0000:00:0d.3: interrupt for TX ring 0 is
> already enabled
> [ 126.292089] WARNING: CPU: 6 PID: 7879 at
> drivers/thunderbolt/nhi.c:138 ring_interrupt_active+0x1cd/0x225
> [ 126.292092] Modules linked in:
> [ 126.292091] WARNING: CPU: 0 PID: 175 at
> drivers/thunderbolt/nhi.c:138 ring_interrupt_active+0x1cd/0x225
> [ 126.292157] CPU: 0 PID: 175 Comm: kworker/u24:2 Tainted: G U
> 5.15.116-19568-g766d8095041b #24
> fdadcb2517d1d37363ad385ffddbc1ad5dc72550
> [ 126.292158] lzo_rle zram joydev
> [ 126.292159] Hardware name: Google Anahera/Anahera, BIOS
> Google_Anahera.14505.143.0 06/22/2022
> [ 126.292159]
> [ 126.292160] Workqueue: events_unbound async_run_entry_fn
> [ 126.292160] CPU: 6 PID: 7879 Comm: kworker/u24:13 Tainted: G U
> 5.15.116-19568-g766d8095041b #24

Do you have this one?

https://git.kernel.org/pub/scm/linux/kernel/git/westeri/thunderbolt.git/commit/?h=fixes&id=9f9666e65359d5047089aef97ac87c50f624ecb0

2023-06-15 10:59:56

by Brian Geffon

[permalink] [raw]
Subject: Re: thunderbolt: resume from hibernation CPUs racing in tb_ring_start

On Thu, Jun 15, 2023 at 12:49 AM Mika Westerberg
<[email protected]> wrote:
>
> Hi,
>
> On Wed, Jun 14, 2023 at 04:26:49PM -0400, Brian Geffon wrote:
> > Hi,
> > On ChromeOS we're running a 5.15 kernel patched up to 6.4-rc6 w.r.t to
> > drivers/thunderbolt code and we're seeing a similar issue to the one
> > discussed in https://lore.kernel.org/lkml/[email protected]/T/#
> > / https://bugzilla.kernel.org/show_bug.cgi?id=217343 where when
> > resuming from hibernation you'll see warnings along the lines of
> >
> > [ 126.292769] thunderbolt 0000:00:0d.3: interrupt for RX ring 0 is
> > already enabled
> >
> > The thing that's odd is it appears three CPUs are racing through this code path:
> >
> > [ 126.292076] ------------[ cut here ]------------
> > [ 126.292077] thunderbolt 0000:00:0d.2: interrupt for TX ring 0 is
> > already enabled
> > [ 126.292080] proc_thermal_pci 0000:00:04.0: PM:
> > pci_pm_thaw_noirq+0x0/0x7c returned 0 after 606 usecs
> > [ 126.292086] ------------[ cut here ]------------
> > [ 126.292087] thunderbolt 0000:00:0d.3: interrupt for TX ring 0 is
> > already enabled
> > [ 126.292089] WARNING: CPU: 6 PID: 7879 at
> > drivers/thunderbolt/nhi.c:138 ring_interrupt_active+0x1cd/0x225
> > [ 126.292092] Modules linked in:
> > [ 126.292091] WARNING: CPU: 0 PID: 175 at
> > drivers/thunderbolt/nhi.c:138 ring_interrupt_active+0x1cd/0x225
> > [ 126.292157] CPU: 0 PID: 175 Comm: kworker/u24:2 Tainted: G U
> > 5.15.116-19568-g766d8095041b #24
> > fdadcb2517d1d37363ad385ffddbc1ad5dc72550
> > [ 126.292158] lzo_rle zram joydev
> > [ 126.292159] Hardware name: Google Anahera/Anahera, BIOS
> > Google_Anahera.14505.143.0 06/22/2022
> > [ 126.292159]
> > [ 126.292160] Workqueue: events_unbound async_run_entry_fn
> > [ 126.292160] CPU: 6 PID: 7879 Comm: kworker/u24:13 Tainted: G U
> > 5.15.116-19568-g766d8095041b #24
>
> Do you have this one?
>
> https://git.kernel.org/pub/scm/linux/kernel/git/westeri/thunderbolt.git/commit/?h=fixes&id=9f9666e65359d5047089aef97ac87c50f624ecb0

Thank you for pointing me to this, we do not have that in our tree yet
since it's not upstream. I'll apply it and see if it resolved the
issue.

Thanks!
Brian