On Wed, May 05, 2021 at 02:01:17PM +0200, Pali Roh?r wrote:
> Hello!
>
> During debugging of pci-aardvark.c driver I got following synchronous
> external abort 96000210 which I can reproduce with VIA XHCI controller
> when PCIe hot plug support is enabled in kernel and PCIe Root Bridge
> triggers link down event via PCIe hot plug interrupt.
>
> [ 71.773033] pcieport 0000:00:00.0: pciehp: Slot(0): Link Down
> [ 71.779120] xhci_hcd 0000:01:00.0: remove, state 4
> [ 71.784113] usb usb5: USB disconnect, device number 1
> [ 71.790398] xhci_hcd 0000:01:00.0: USB bus 5 deregistered
> [ 72.511899] Internal error: synchronous external abort: 96000210 [#1] SMP
> [ 72.518918] Modules linked in:
> [ 72.522074] CPU: 1 PID: 988 Comm: irq/53-pciehp Not tainted 5.12.0-dirty #949
> [ 72.536983] pstate: 60000085 (nZCv daIf -PAN -UAO -TCO BTYPE=--)
> [ 72.543182] pc : xhci_irq+0x70/0x17b8
> [ 72.546972] lr : xhci_irq+0x28/0x17b8
> [ 72.550752] sp : ffffffc012b8bab0
> [ 72.554167] x29: ffffffc012b8bab0 x28: 00000000000000a0
> [ 72.559652] x27: 0000000000000060 x26: ffffff8000af2250
> [ 72.565135] x25: ffffffc0100b0d48 x24: ffffffc0100b0be0
> [ 72.570620] x23: ffffff80003be028 x22: ffffff8000af229c
> [ 72.576104] x21: 0000000000000080 x20: ffffff8000af2000
> [ 72.581587] x19: ffffff8000af2000 x18: 0000000000000004
> [ 72.587071] x17: 0000000000000000 x16: 0000000000000000
> [ 72.592553] x15: ffffffc01154cc70 x14: ffffff8001751df8
> [ 72.598037] x13: 0000000000000000 x12: 0000000000000000
> [ 72.603519] x11: ffffff8001751da8 x10: ffffffc01154cc78
> [ 72.609001] x9 : ffffffc01087c238 x8 : 0000000000000000
> [ 72.614485] x7 : ffffffc01162c4e0 x6 : 0000000000000000
> [ 72.619967] x5 : fffffffe00085000 x4 : fffffffe00085000
> [ 72.625451] x3 : 0000000000000000 x2 : 0000000000000001
> [ 72.630933] x1 : ffffffc0118bd024 x0 : 0000000000000000
> [ 72.636415] Call trace:
> [ 72.638936] xhci_irq+0x70/0x17b8
> [ 72.642360] usb_hcd_irq+0x34/0x50
> [ 72.645876] usb_hcd_pci_remove+0x78/0x138
> [ 72.650106] xhci_pci_remove+0x6c/0xa8
> [ 72.653978] pci_device_remove+0x44/0x108
> [ 72.658122] device_release_driver_internal+0x110/0x1e0
> [ 72.663521] device_release_driver+0x1c/0x28
> [ 72.667931] pci_stop_bus_device+0x84/0xc0
> [ 72.672162] pci_stop_and_remove_bus_device+0x1c/0x30
> [ 72.677373] pciehp_unconfigure_device+0x98/0xf8
> [ 72.682138] pciehp_disable_slot+0x60/0x118
> [ 72.686457] pciehp_handle_presence_or_link_change+0xec/0x3b0
> [ 72.692386] pciehp_ist+0x170/0x1a0
> [ 72.695984] irq_thread_fn+0x30/0x90
> [ 72.699674] irq_thread+0x13c/0x200
> [ 72.703271] kthread+0x12c/0x130
> [ 72.706603] ret_from_fork+0x10/0x1c
> [ 72.710299] Code: 35ffff83 35002741 f9400f41 91001021 (b9400021)
> [ 72.716586] ---[ end trace 20ce3e30ff292c93 ]---
> [ 72.721453] genirq: exiting task "irq/53-pciehp" (988) is an active IRQ thread (irq 53)
> [ 72.730068] sched: RT throttling activated
>
> And after that kernel is in some semi-broken state. Some functionality
> works, but some other (like reboot) does not.
>
> I can reproduce it also when I manually inject/fake this link down PCIe
> hot plug interrupt with setting corresponding bits in PCIe Root Status
> registers, so pciehp driver thinks that link down even occurred.
>
> I suspect that issue is in usb_hcd_pci_remove() function which calls
> local_irq_disable()+usb_hcd_irq()+local_irq_enable() functions but do
> not take into care that whole usb_hcd_pci_remove() function may be
> called from interrupt context.
usb_hcd_pci_remove() should NOT be called from interrupt context.
What is causing that to happen? No PCI driver can handle that,
especially USB ones.
> Can you look at this issue if it is really safe to call usb_hcd_irq()
> from interrupt context? Or rather if it is safe to call functions like
> pciehp_disable_slot() or device_release_driver() from interrupt context
> like it can be seen in call trace?
What is removing devices from an irq? That is wrong, pci hotplug never
used to do that, what recently changed?
thanks,
greg k-h
On Wednesday 05 May 2021 14:09:17 Greg KH wrote:
> On Wed, May 05, 2021 at 02:01:17PM +0200, Pali Rohár wrote:
> > Hello!
> >
> > During debugging of pci-aardvark.c driver I got following synchronous
> > external abort 96000210 which I can reproduce with VIA XHCI controller
> > when PCIe hot plug support is enabled in kernel and PCIe Root Bridge
> > triggers link down event via PCIe hot plug interrupt.
> >
> > [ 71.773033] pcieport 0000:00:00.0: pciehp: Slot(0): Link Down
> > [ 71.779120] xhci_hcd 0000:01:00.0: remove, state 4
> > [ 71.784113] usb usb5: USB disconnect, device number 1
> > [ 71.790398] xhci_hcd 0000:01:00.0: USB bus 5 deregistered
> > [ 72.511899] Internal error: synchronous external abort: 96000210 [#1] SMP
> > [ 72.518918] Modules linked in:
> > [ 72.522074] CPU: 1 PID: 988 Comm: irq/53-pciehp Not tainted 5.12.0-dirty #949
> > [ 72.536983] pstate: 60000085 (nZCv daIf -PAN -UAO -TCO BTYPE=--)
> > [ 72.543182] pc : xhci_irq+0x70/0x17b8
> > [ 72.546972] lr : xhci_irq+0x28/0x17b8
> > [ 72.550752] sp : ffffffc012b8bab0
> > [ 72.554167] x29: ffffffc012b8bab0 x28: 00000000000000a0
> > [ 72.559652] x27: 0000000000000060 x26: ffffff8000af2250
> > [ 72.565135] x25: ffffffc0100b0d48 x24: ffffffc0100b0be0
> > [ 72.570620] x23: ffffff80003be028 x22: ffffff8000af229c
> > [ 72.576104] x21: 0000000000000080 x20: ffffff8000af2000
> > [ 72.581587] x19: ffffff8000af2000 x18: 0000000000000004
> > [ 72.587071] x17: 0000000000000000 x16: 0000000000000000
> > [ 72.592553] x15: ffffffc01154cc70 x14: ffffff8001751df8
> > [ 72.598037] x13: 0000000000000000 x12: 0000000000000000
> > [ 72.603519] x11: ffffff8001751da8 x10: ffffffc01154cc78
> > [ 72.609001] x9 : ffffffc01087c238 x8 : 0000000000000000
> > [ 72.614485] x7 : ffffffc01162c4e0 x6 : 0000000000000000
> > [ 72.619967] x5 : fffffffe00085000 x4 : fffffffe00085000
> > [ 72.625451] x3 : 0000000000000000 x2 : 0000000000000001
> > [ 72.630933] x1 : ffffffc0118bd024 x0 : 0000000000000000
> > [ 72.636415] Call trace:
> > [ 72.638936] xhci_irq+0x70/0x17b8
> > [ 72.642360] usb_hcd_irq+0x34/0x50
> > [ 72.645876] usb_hcd_pci_remove+0x78/0x138
> > [ 72.650106] xhci_pci_remove+0x6c/0xa8
> > [ 72.653978] pci_device_remove+0x44/0x108
> > [ 72.658122] device_release_driver_internal+0x110/0x1e0
> > [ 72.663521] device_release_driver+0x1c/0x28
> > [ 72.667931] pci_stop_bus_device+0x84/0xc0
> > [ 72.672162] pci_stop_and_remove_bus_device+0x1c/0x30
> > [ 72.677373] pciehp_unconfigure_device+0x98/0xf8
> > [ 72.682138] pciehp_disable_slot+0x60/0x118
> > [ 72.686457] pciehp_handle_presence_or_link_change+0xec/0x3b0
> > [ 72.692386] pciehp_ist+0x170/0x1a0
> > [ 72.695984] irq_thread_fn+0x30/0x90
> > [ 72.699674] irq_thread+0x13c/0x200
> > [ 72.703271] kthread+0x12c/0x130
> > [ 72.706603] ret_from_fork+0x10/0x1c
> > [ 72.710299] Code: 35ffff83 35002741 f9400f41 91001021 (b9400021)
> > [ 72.716586] ---[ end trace 20ce3e30ff292c93 ]---
> > [ 72.721453] genirq: exiting task "irq/53-pciehp" (988) is an active IRQ thread (irq 53)
> > [ 72.730068] sched: RT throttling activated
> >
> > And after that kernel is in some semi-broken state. Some functionality
> > works, but some other (like reboot) does not.
> >
> > I can reproduce it also when I manually inject/fake this link down PCIe
> > hot plug interrupt with setting corresponding bits in PCIe Root Status
> > registers, so pciehp driver thinks that link down even occurred.
> >
> > I suspect that issue is in usb_hcd_pci_remove() function which calls
> > local_irq_disable()+usb_hcd_irq()+local_irq_enable() functions but do
> > not take into care that whole usb_hcd_pci_remove() function may be
> > called from interrupt context.
>
> usb_hcd_pci_remove() should NOT be called from interrupt context.
>
> What is causing that to happen?
PCIe Hot Plug interrupt with PCI_EXP_SLTSTA_DLLSC status bit set.
I can reproduce it by issuing PCIe Hot Reset to PCIe controller (via
setpci from userspace) which resulted in link down event (which is
obvious) and PCIe controller then triggered link down interrupt.
> No PCI driver can handle that, especially USB ones.
>
> > Can you look at this issue if it is really safe to call usb_hcd_irq()
> > from interrupt context? Or rather if it is safe to call functions like
> > pciehp_disable_slot() or device_release_driver() from interrupt context
> > like it can be seen in call trace?
>
> What is removing devices from an irq?
It can be seen in above call trace. It is pciehp_disable_slot() followed
by pciehp_unconfigure_device().
> That is wrong, pci hotplug never used to do that, what recently changed?
I really do not know what was changed recently. I hope that other people
in linux-pci ML would know history details better.
I just spotted this crash during debugging PCIe controller driver
pci-aardvark.c with trying to expose its link down events via "hot plug"
interrupt and corresponding link layer state flags.
And because in whole call trace I see only generic PCIe and USB code
path without any driver specific parts, I suspect that this is not PCIe
controller-specific issue but rather something "wrong" in genetic PCIe
(or USB) code. That is why I sent this email, so maybe somebody else
find something suspicious here.
But still there is a chance that issue can be also in pci-aardvark.c
driver and somehow it masked its issue and propagated it into generic
PCIe hot plug code path.
> thanks,
>
> greg k-h
On Wed, May 05, 2021 at 02:33:46PM +0200, Pali Roh?r wrote:
> I just spotted this crash during debugging PCIe controller driver
> pci-aardvark.c with trying to expose its link down events via "hot plug"
> interrupt and corresponding link layer state flags.
>
> And because in whole call trace I see only generic PCIe and USB code
> path without any driver specific parts, I suspect that this is not PCIe
> controller-specific issue but rather something "wrong" in genetic PCIe
> (or USB) code. That is why I sent this email, so maybe somebody else
> find something suspicious here.
>
> But still there is a chance that issue can be also in pci-aardvark.c
> driver and somehow it masked its issue and propagated it into generic
> PCIe hot plug code path.
If you hot-remove the XHCI controller, accesses to its MMIO space
will fail. xhci_irq() seems to perform such MMIO accesses.
Normally this should happen silently and MMIO accesses just return
with a fabricated "all ones" response. Chances are however that the
Aardvark controller raises a synchronous external abort instead.
Perhaps you can teach it not to do that.
Thanks,
Lukas
On Wednesday 05 May 2021 14:44:02 Lukas Wunner wrote:
> On Wed, May 05, 2021 at 02:33:46PM +0200, Pali Rohár wrote:
> > I just spotted this crash during debugging PCIe controller driver
> > pci-aardvark.c with trying to expose its link down events via "hot plug"
> > interrupt and corresponding link layer state flags.
> >
> > And because in whole call trace I see only generic PCIe and USB code
> > path without any driver specific parts, I suspect that this is not PCIe
> > controller-specific issue but rather something "wrong" in genetic PCIe
> > (or USB) code. That is why I sent this email, so maybe somebody else
> > find something suspicious here.
> >
> > But still there is a chance that issue can be also in pci-aardvark.c
> > driver and somehow it masked its issue and propagated it into generic
> > PCIe hot plug code path.
>
> If you hot-remove the XHCI controller, accesses to its MMIO space
> will fail. xhci_irq() seems to perform such MMIO accesses.
That abort happens at offset 4d00, here is part of objdump:
if (!arch_irqs_disabled_flags(flags))
4ccc: 340014a0 cbz w0, 4f60 <xhci_irq+0x2d0>
4cd0: d2800000 mov x0, #0x0 // #0
4cd4: 910a7276 add x22, x19, #0x29c
4cd8: 52800022 mov w2, #0x1 // #1
4cdc: f98002d1 prfm pstl1strm, [x22]
4ce0: 885ffec1 ldaxr w1, [x22]
4ce4: 4a000023 eor w3, w1, w0
4ce8: 35000063 cbnz w3, 4cf4 <xhci_irq+0x64>
4cec: 88037ec2 stxr w3, w2, [x22]
4cf0: 35ffff83 cbnz w3, 4ce0 <xhci_irq+0x50>
4cf4: 35002741 cbnz w1, 51dc <xhci_irq+0x54c>
status = readl(&xhci->op_regs->status);
4cf8: f9400f41 ldr x1, [x26, #24]
4cfc: 91001021 add x1, x1, #0x4
4d00: b9400021 ldr w1, [x1]
So it looks like it is that MMIO access, right?
> Normally this should happen silently and MMIO accesses just return
> with a fabricated "all ones" response. Chances are however that the
> Aardvark controller raises a synchronous external abort instead.
This makes sense. Good catch lso with fact that it is from threaded
context!
> Perhaps you can teach it not to do that.
No :-( I read all documentation which is available for this PCIe
controller, part of Marvell A3720 SoC and I have not found anything
which allows me to configure raising external aborts.
I already figured out that CPU receive external abort also when trying
to issue a new PIO transfer for accessing PCI config space while
previous transfer has not finished yet. And also there is no way (at
least in documentation) which allows to "mask" this external abort. But
this issue can be fixed in pci-aardvark.c driver to disallow access to
config space while previous transfer is still running (I will send patch
for this one).
So seems that PCIe controller HW triggers these external aborts when
device on PCIe bus is not accessible anymore.
If this issue is really caused by MMIO access from xhci driver when
device is not accessible on the bus anymore, can we do something to
prevent this kernel crash? Somehow mask that external abort in kernel
for a time during MMIO access?
> Thanks,
>
> Lukas
On Wed, May 05, 2021 at 02:33:46PM +0200, Pali Roh?r wrote:
> On Wednesday 05 May 2021 14:09:17 Greg KH wrote:
> > On Wed, May 05, 2021 at 02:01:17PM +0200, Pali Roh?r wrote:
> > > Hello!
> > >
> > > During debugging of pci-aardvark.c driver I got following synchronous
> > > external abort 96000210 which I can reproduce with VIA XHCI controller
> > > when PCIe hot plug support is enabled in kernel and PCIe Root Bridge
> > > triggers link down event via PCIe hot plug interrupt.
> > >
> > > [ 71.773033] pcieport 0000:00:00.0: pciehp: Slot(0): Link Down
> > > [ 71.779120] xhci_hcd 0000:01:00.0: remove, state 4
> > > [ 71.784113] usb usb5: USB disconnect, device number 1
> > > [ 71.790398] xhci_hcd 0000:01:00.0: USB bus 5 deregistered
> > > [ 72.511899] Internal error: synchronous external abort: 96000210 [#1] SMP
> > > [ 72.518918] Modules linked in:
> > > [ 72.522074] CPU: 1 PID: 988 Comm: irq/53-pciehp Not tainted 5.12.0-dirty #949
> > > [ 72.536983] pstate: 60000085 (nZCv daIf -PAN -UAO -TCO BTYPE=--)
> > > [ 72.543182] pc : xhci_irq+0x70/0x17b8
> > > [ 72.546972] lr : xhci_irq+0x28/0x17b8
> > > [ 72.550752] sp : ffffffc012b8bab0
> > > [ 72.554167] x29: ffffffc012b8bab0 x28: 00000000000000a0
> > > [ 72.559652] x27: 0000000000000060 x26: ffffff8000af2250
> > > [ 72.565135] x25: ffffffc0100b0d48 x24: ffffffc0100b0be0
> > > [ 72.570620] x23: ffffff80003be028 x22: ffffff8000af229c
> > > [ 72.576104] x21: 0000000000000080 x20: ffffff8000af2000
> > > [ 72.581587] x19: ffffff8000af2000 x18: 0000000000000004
> > > [ 72.587071] x17: 0000000000000000 x16: 0000000000000000
> > > [ 72.592553] x15: ffffffc01154cc70 x14: ffffff8001751df8
> > > [ 72.598037] x13: 0000000000000000 x12: 0000000000000000
> > > [ 72.603519] x11: ffffff8001751da8 x10: ffffffc01154cc78
> > > [ 72.609001] x9 : ffffffc01087c238 x8 : 0000000000000000
> > > [ 72.614485] x7 : ffffffc01162c4e0 x6 : 0000000000000000
> > > [ 72.619967] x5 : fffffffe00085000 x4 : fffffffe00085000
> > > [ 72.625451] x3 : 0000000000000000 x2 : 0000000000000001
> > > [ 72.630933] x1 : ffffffc0118bd024 x0 : 0000000000000000
> > > [ 72.636415] Call trace:
> > > [ 72.638936] xhci_irq+0x70/0x17b8
> > > [ 72.642360] usb_hcd_irq+0x34/0x50
> > > [ 72.645876] usb_hcd_pci_remove+0x78/0x138
> > > [ 72.650106] xhci_pci_remove+0x6c/0xa8
> > > [ 72.653978] pci_device_remove+0x44/0x108
> > > [ 72.658122] device_release_driver_internal+0x110/0x1e0
> > > [ 72.663521] device_release_driver+0x1c/0x28
> > > [ 72.667931] pci_stop_bus_device+0x84/0xc0
> > > [ 72.672162] pci_stop_and_remove_bus_device+0x1c/0x30
> > > [ 72.677373] pciehp_unconfigure_device+0x98/0xf8
> > > [ 72.682138] pciehp_disable_slot+0x60/0x118
> > > [ 72.686457] pciehp_handle_presence_or_link_change+0xec/0x3b0
> > > [ 72.692386] pciehp_ist+0x170/0x1a0
> > > [ 72.695984] irq_thread_fn+0x30/0x90
> > > [ 72.699674] irq_thread+0x13c/0x200
> > > [ 72.703271] kthread+0x12c/0x130
> > > [ 72.706603] ret_from_fork+0x10/0x1c
> > > [ 72.710299] Code: 35ffff83 35002741 f9400f41 91001021 (b9400021)
> > > [ 72.716586] ---[ end trace 20ce3e30ff292c93 ]---
> > > [ 72.721453] genirq: exiting task "irq/53-pciehp" (988) is an active IRQ thread (irq 53)
> > > [ 72.730068] sched: RT throttling activated
> > >
> > > And after that kernel is in some semi-broken state. Some functionality
> > > works, but some other (like reboot) does not.
> > >
> > > I can reproduce it also when I manually inject/fake this link down PCIe
> > > hot plug interrupt with setting corresponding bits in PCIe Root Status
> > > registers, so pciehp driver thinks that link down even occurred.
> > >
> > > I suspect that issue is in usb_hcd_pci_remove() function which calls
> > > local_irq_disable()+usb_hcd_irq()+local_irq_enable() functions but do
> > > not take into care that whole usb_hcd_pci_remove() function may be
> > > called from interrupt context.
> >
> > usb_hcd_pci_remove() should NOT be called from interrupt context.
> >
> > What is causing that to happen?
>
> PCIe Hot Plug interrupt with PCI_EXP_SLTSTA_DLLSC status bit set.
>
> I can reproduce it by issuing PCIe Hot Reset to PCIe controller (via
> setpci from userspace) which resulted in link down event (which is
> obvious) and PCIe controller then triggered link down interrupt.
>
> > No PCI driver can handle that, especially USB ones.
> >
> > > Can you look at this issue if it is really safe to call usb_hcd_irq()
> > > from interrupt context? Or rather if it is safe to call functions like
> > > pciehp_disable_slot() or device_release_driver() from interrupt context
> > > like it can be seen in call trace?
> >
> > What is removing devices from an irq?
>
> It can be seen in above call trace. It is pciehp_disable_slot() followed
> by pciehp_unconfigure_device().
But pciehp_disable_slot() is called under protection of a mutex, so we
"know" it can't be called from an irq. The trace might be wrong there,
or someone moved to using a threaded irq handler somehow?
I would focus on the "synchronous external abort", are you sure that is
not just a platform error being hit somehow that is independent of the
xhci driver?
> > That is wrong, pci hotplug never used to do that, what recently changed?
>
> I really do not know what was changed recently. I hope that other people
> in linux-pci ML would know history details better.
>
> I just spotted this crash during debugging PCIe controller driver
> pci-aardvark.c with trying to expose its link down events via "hot plug"
> interrupt and corresponding link layer state flags.
>
> And because in whole call trace I see only generic PCIe and USB code
> path without any driver specific parts, I suspect that this is not PCIe
> controller-specific issue but rather something "wrong" in genetic PCIe
> (or USB) code. That is why I sent this email, so maybe somebody else
> find something suspicious here.
>
> But still there is a chance that issue can be also in pci-aardvark.c
> driver and somehow it masked its issue and propagated it into generic
> PCIe hot plug code path.
Any chance you can use 'git bisect' to track down where this showed up?
thanks,
greg k-h
On Wed, May 05, 2021 at 02:09:17PM +0200, Greg KH wrote:
> On Wed, May 05, 2021 at 02:01:17PM +0200, Pali Roh?r wrote:
> > [ 72.511899] Internal error: synchronous external abort: 96000210 [#1] SMP
[...]
> > [ 72.636415] Call trace:
> > [ 72.638936] xhci_irq+0x70/0x17b8
> > [ 72.642360] usb_hcd_irq+0x34/0x50
> > [ 72.645876] usb_hcd_pci_remove+0x78/0x138
> > [ 72.650106] xhci_pci_remove+0x6c/0xa8
> > [ 72.653978] pci_device_remove+0x44/0x108
> > [ 72.658122] device_release_driver_internal+0x110/0x1e0
> > [ 72.663521] device_release_driver+0x1c/0x28
> > [ 72.667931] pci_stop_bus_device+0x84/0xc0
> > [ 72.672162] pci_stop_and_remove_bus_device+0x1c/0x30
> > [ 72.677373] pciehp_unconfigure_device+0x98/0xf8
> > [ 72.682138] pciehp_disable_slot+0x60/0x118
> > [ 72.686457] pciehp_handle_presence_or_link_change+0xec/0x3b0
> > [ 72.692386] pciehp_ist+0x170/0x1a0
> > [ 72.695984] irq_thread_fn+0x30/0x90
^^^^^^^^^^^^^
[...]
> > I suspect that issue is in usb_hcd_pci_remove() function which calls
> > local_irq_disable()+usb_hcd_irq()+local_irq_enable() functions but do
> > not take into care that whole usb_hcd_pci_remove() function may be
> > called from interrupt context.
>
> usb_hcd_pci_remove() should NOT be called from interrupt context.
>
> What is causing that to happen?
Nothing. It's called from an IRQ *thread*, i.e. task context, see above.
> > Can you look at this issue if it is really safe to call usb_hcd_irq()
> > from interrupt context? Or rather if it is safe to call functions like
> > pciehp_disable_slot() or device_release_driver() from interrupt context
> > like it can be seen in call trace?
>
> What is removing devices from an irq? That is wrong, pci hotplug never
> used to do that, what recently changed?
Nothing changed, the allegation that something is called from interrupt
context is wrong.
Thanks,
Lukas
From: Pali Rohár
> Sent: 05 May 2021 14:03
...
> I already figured out that CPU receive external abort also when trying
> to issue a new PIO transfer for accessing PCI config space while
> previous transfer has not finished yet. And also there is no way (at
> least in documentation) which allows to "mask" this external abort. But
> this issue can be fixed in pci-aardvark.c driver to disallow access to
> config space while previous transfer is still running (I will send patch
> for this one).
My the sound of the above you need to put a global spinlock around
all PCIe config space accesses.
Is this the horrid hardware that can't do a 'normal' PCIe transfer
while a config space access is in progress?
If that it true then you have bigger problems.
Especially if it is an SMP system.
> So seems that PCIe controller HW triggers these external aborts when
> device on PCIe bus is not accessible anymore.
>
> If this issue is really caused by MMIO access from xhci driver when
> device is not accessible on the bus anymore, can we do something to
> prevent this kernel crash? Somehow mask that external abort in kernel
> for a time during MMIO access?
If it is a cycle abort then the interrupted address is probably
that of the MMIO instruction.
So you need to catch the abort, emulate the instruction and
then return to the next one.
This probably requires an exception table containing the address
of every readb/w/l() instruction.
If you get a similar error on writes it is likely to be a few
instructions after the actual writeb/w/l() instruction.
Write are normally 'posted' and asynchronous.
If you are really lucky you can get enough state out of the
abort handler to fixup/ignore the cycle without an
exception table.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Wednesday 05 May 2021 15:20:11 David Laight wrote:
> From: Pali Rohár
> > Sent: 05 May 2021 14:03
> ...
> > I already figured out that CPU receive external abort also when trying
> > to issue a new PIO transfer for accessing PCI config space while
> > previous transfer has not finished yet. And also there is no way (at
> > least in documentation) which allows to "mask" this external abort. But
> > this issue can be fixed in pci-aardvark.c driver to disallow access to
> > config space while previous transfer is still running (I will send patch
> > for this one).
>
> My the sound of the above you need to put a global spinlock around
> all PCIe config space accesses.
Kernel already uses raw_spin_lock_irqsave(), see pci_lock_config() macro
in pci/access.c which implements this global lock for config space
access.
But issue is that pci-driver.c does not wait for finishing transfer and
return from function which unlock this spin lock...
Week ago I fixed this issue in U-Boot and similar fix would be needed
also for kernel https://source.denx.de/u-boot/u-boot/-/commit/eccbd4ad8e4e
But this issue is not related to my original report about XHCI & PCI.
> Is this the horrid hardware that can't do a 'normal' PCIe transfer
> while a config space access is in progress?
Issue is different. You cannot do config space PIO transfer while
another config space PIO transfer is in progress.
> If that it true then you have bigger problems.
> Especially if it is an SMP system.
I really hope that memory read or write transfer can be initiated while
config transfer is in progress. Marvell A3720 platform on which can be
found this pci aardvark controller is 2 core CPU SoC.
At least I have not seen any abort when PCIe link is up, card connected
and previous config access transfer finished.
> > So seems that PCIe controller HW triggers these external aborts when
> > device on PCIe bus is not accessible anymore.
> >
> > If this issue is really caused by MMIO access from xhci driver when
> > device is not accessible on the bus anymore, can we do something to
> > prevent this kernel crash? Somehow mask that external abort in kernel
> > for a time during MMIO access?
>
> If it is a cycle abort then the interrupted address is probably
> that of the MMIO instruction.
> So you need to catch the abort, emulate the instruction and
> then return to the next one.
Has kernel API & infrastructure for catching these aborts and executing
own driver handler when abort happens?
> This probably requires an exception table containing the address
> of every readb/w/l() instruction.
>
> If you get a similar error on writes it is likely to be a few
> instructions after the actual writeb/w/l() instruction.
> Write are normally 'posted' and asynchronous.
>
> If you are really lucky you can get enough state out of the
> abort handler to fixup/ignore the cycle without an
> exception table.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
On Wed, May 05, 2021 at 05:39:42PM +0200, Pali Roh?r wrote:
> On Wednesday 05 May 2021 15:20:11 David Laight wrote:
> > From: Pali Roh?r
> > Sent: 05 May 2021 14:03
> > > So seems that PCIe controller HW triggers these external aborts when
> > > device on PCIe bus is not accessible anymore.
> > >
> > > If this issue is really caused by MMIO access from xhci driver when
> > > device is not accessible on the bus anymore, can we do something to
> > > prevent this kernel crash? Somehow mask that external abort in kernel
> > > for a time during MMIO access?
> >
> > If it is a cycle abort then the interrupted address is probably
> > that of the MMIO instruction.
> > So you need to catch the abort, emulate the instruction and
> > then return to the next one.
>
> Has kernel API & infrastructure for catching these aborts and executing
> own driver handler when abort happens?
Yes, see here for an example:
https://lore.kernel.org/linux-pci/[email protected]/
On Saturday 19 June 2021 09:53:58 Lukas Wunner wrote:
> On Wed, May 05, 2021 at 05:39:42PM +0200, Pali Rohár wrote:
> > On Wednesday 05 May 2021 15:20:11 David Laight wrote:
> > > From: Pali Rohár
> > > Sent: 05 May 2021 14:03
> > > > So seems that PCIe controller HW triggers these external aborts when
> > > > device on PCIe bus is not accessible anymore.
> > > >
> > > > If this issue is really caused by MMIO access from xhci driver when
> > > > device is not accessible on the bus anymore, can we do something to
> > > > prevent this kernel crash? Somehow mask that external abort in kernel
> > > > for a time during MMIO access?
> > >
> > > If it is a cycle abort then the interrupted address is probably
> > > that of the MMIO instruction.
> > > So you need to catch the abort, emulate the instruction and
> > > then return to the next one.
> >
> > Has kernel API & infrastructure for catching these aborts and executing
> > own driver handler when abort happens?
>
> Yes, see here for an example:
>
> https://lore.kernel.org/linux-pci/[email protected]/
What I do not see here how to catch and recover from error. It looks
like that in above patch is just implemented catching error, printing
more verbose output and let kernel continue in rebooting / crashing.
At least I do not see how to "catch the abort, emulate the instruction
and then return to the next one" as David wrote.