2024-05-03 19:26:53

by Nam Cao

[permalink] [raw]
Subject: [PATCH 2/4] PCI: pciehp: bail out if pci_hp_add_bridge() fails

If there is no bus number available for the downstream bus of the
hot-plugged bridge, pci_hp_add_bridge() will fail. The driver proceeds
regardless, and the kernel crashes.

Abort if pci_hp_add_bridge() fails.

Fixes: 0eb3bcfd088e ("[PATCH] pciehp: allow bridged card hotplug")
Signed-off-by: Nam Cao <[email protected]>
Cc: Rajesh Shah <[email protected]>
Cc: <[email protected]>
---
drivers/pci/hotplug/pciehp_pci.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c
index ad12515a4a12..faf4fcf2fbdf 100644
--- a/drivers/pci/hotplug/pciehp_pci.c
+++ b/drivers/pci/hotplug/pciehp_pci.c
@@ -58,8 +58,13 @@ int pciehp_configure_device(struct controller *ctrl)
goto out;
}

- for_each_pci_bridge(dev, parent)
- pci_hp_add_bridge(dev);
+ for_each_pci_bridge(dev, parent) {
+ if (pci_hp_add_bridge(dev)) {
+ pci_stop_and_remove_bus_device(dev);
+ ret = -EINVAL;
+ goto out;
+ }
+ }

pci_assign_unassigned_bridge_resources(bridge);
pcie_bus_configure_settings(parent);
--
2.39.2



2024-05-03 21:25:53

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 2/4] PCI: pciehp: bail out if pci_hp_add_bridge() fails

[+cc Lukas]

On Fri, May 03, 2024 at 09:23:20PM +0200, Nam Cao wrote:
> If there is no bus number available for the downstream bus of the
> hot-plugged bridge, pci_hp_add_bridge() will fail. The driver proceeds
> regardless, and the kernel crashes.
>
> Abort if pci_hp_add_bridge() fails.

Thanks for this and for the details in the cover letter. The cover
letter doesn't get directly preserved and connected to the git commit,
so please include some of the details here in the commit log.

I don't think we need *everything* from the cover letter; just enough
of the messages to show what went wrong and how the kernel crashed, so
somebody who trips over this can connect the crash with this fix. And
the timestamps are not relevant, so you can strip them out. The qemu
repro case is useful too, thanks for that!

Same for the shpchp patch.

And use "git log --oneline drivers/pci/hotplug/pciehp_pci.c" and match
the formatting (in particular, the capitalization) of your subject
lines.

> Fixes: 0eb3bcfd088e ("[PATCH] pciehp: allow bridged card hotplug")
> Signed-off-by: Nam Cao <[email protected]>
> Cc: Rajesh Shah <[email protected]>
> Cc: <[email protected]>
> ---
> drivers/pci/hotplug/pciehp_pci.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c
> index ad12515a4a12..faf4fcf2fbdf 100644
> --- a/drivers/pci/hotplug/pciehp_pci.c
> +++ b/drivers/pci/hotplug/pciehp_pci.c
> @@ -58,8 +58,13 @@ int pciehp_configure_device(struct controller *ctrl)
> goto out;
> }
>
> - for_each_pci_bridge(dev, parent)
> - pci_hp_add_bridge(dev);
> + for_each_pci_bridge(dev, parent) {
> + if (pci_hp_add_bridge(dev)) {
> + pci_stop_and_remove_bus_device(dev);
> + ret = -EINVAL;
> + goto out;
> + }
> + }
>
> pci_assign_unassigned_bridge_resources(bridge);
> pcie_bus_configure_settings(parent);
> --
> 2.39.2
>

2024-05-03 21:47:47

by Nam Cao

[permalink] [raw]
Subject: Re: [PATCH 2/4] PCI: pciehp: bail out if pci_hp_add_bridge() fails

On Fri, May 03, 2024 at 04:23:41PM -0500, Bjorn Helgaas wrote:
> On Fri, May 03, 2024 at 09:23:20PM +0200, Nam Cao wrote:
> > If there is no bus number available for the downstream bus of the
> > hot-plugged bridge, pci_hp_add_bridge() will fail. The driver proceeds
> > regardless, and the kernel crashes.
> >
> > Abort if pci_hp_add_bridge() fails.
>
> Thanks for this and for the details in the cover letter. The cover
> letter doesn't get directly preserved and connected to the git commit,
> so please include some of the details here in the commit log.
>
> I don't think we need *everything* from the cover letter; just enough
> of the messages to show what went wrong and how the kernel crashed, so
> somebody who trips over this can connect the crash with this fix. And
> the timestamps are not relevant, so you can strip them out. The qemu
> repro case is useful too, thanks for that!
>
> Same for the shpchp patch.
>
> And use "git log --oneline drivers/pci/hotplug/pciehp_pci.c" and match
> the formatting (in particular, the capitalization) of your subject
> lines.

Thanks for the detailed instructions. I will send v2 next week.

Best regards,
Nam

2024-05-04 08:54:40

by Lukas Wunner

[permalink] [raw]
Subject: Re: [PATCH 2/4] PCI: pciehp: bail out if pci_hp_add_bridge() fails

On Fri, May 03, 2024 at 09:23:20PM +0200, Nam Cao wrote:
> If there is no bus number available for the downstream bus of the
> hot-plugged bridge, pci_hp_add_bridge() will fail. The driver proceeds
> regardless, and the kernel crashes.
>
> Abort if pci_hp_add_bridge() fails.
[...]
> --- a/drivers/pci/hotplug/pciehp_pci.c
> +++ b/drivers/pci/hotplug/pciehp_pci.c
> @@ -58,8 +58,13 @@ int pciehp_configure_device(struct controller *ctrl)
> goto out;
> }
>
> - for_each_pci_bridge(dev, parent)
> - pci_hp_add_bridge(dev);
> + for_each_pci_bridge(dev, parent) {
> + if (pci_hp_add_bridge(dev)) {
> + pci_stop_and_remove_bus_device(dev);
> + ret = -EINVAL;
> + goto out;
> + }
> + }

Is the pci_stop_and_remove_bus_device() really necessary here?
Why not just leave the bridge as is, without any child devices?

Thanks,

Lukas

2024-05-04 09:35:48

by Nam Cao

[permalink] [raw]
Subject: Re: [PATCH 2/4] PCI: pciehp: bail out if pci_hp_add_bridge() fails

On Sat, May 04, 2024 at 10:54:15AM +0200, Lukas Wunner wrote:
> On Fri, May 03, 2024 at 09:23:20PM +0200, Nam Cao wrote:
> > If there is no bus number available for the downstream bus of the
> > hot-plugged bridge, pci_hp_add_bridge() will fail. The driver proceeds
> > regardless, and the kernel crashes.
> >
> > Abort if pci_hp_add_bridge() fails.
> [...]
> > --- a/drivers/pci/hotplug/pciehp_pci.c
> > +++ b/drivers/pci/hotplug/pciehp_pci.c
> > @@ -58,8 +58,13 @@ int pciehp_configure_device(struct controller *ctrl)
> > goto out;
> > }
> >
> > - for_each_pci_bridge(dev, parent)
> > - pci_hp_add_bridge(dev);
> > + for_each_pci_bridge(dev, parent) {
> > + if (pci_hp_add_bridge(dev)) {
> > + pci_stop_and_remove_bus_device(dev);
> > + ret = -EINVAL;
> > + goto out;
> > + }
> > + }
>
> Is the pci_stop_and_remove_bus_device() really necessary here?
> Why not just leave the bridge as is, without any child devices?

pci_stop_and_remove_bus_device() is not necessary to prevent kernel
crashing. But without this, we cannot hot-plug any other devices to this
slot afterward, despite the bridge has already been removed. Below is what
happens without pci_stop_and_remove_bus_device().

First, we hotplug a bridge. That fails, so QEMU removes this bridge:
(qemu) device_add pci-bridge,id=br2,bus=br1,chassis_nr=19,addr=1
[ 9.289609] shpchp 0000:01:00.0: Latch close on Slot(1-1)
[ 9.291145] shpchp 0000:01:00.0: Button pressed on Slot(1-1)
[ 9.292705] shpchp 0000:01:00.0: Card present on Slot(1-1)
[ 9.294369] shpchp 0000:01:00.0: PCI slot #1-1 - powering on due to button press
[ 15.529997] pci 0000:02:01.0: [1b36:0001] type 01 class 0x060400 conventional PCI bridge
[ 15.533907] pci 0000:02:01.0: BAR 0 [mem 0x00000000-0x000000ff 64bit]
[ 15.535802] pci 0000:02:01.0: PCI bridge to [bus 00]
[ 15.538519] pci 0000:02:01.0: bridge window [io 0x0000-0x0fff]
[ 15.540261] pci 0000:02:01.0: bridge window [mem 0x00000000-0x000fffff]
[ 15.543486] pci 0000:02:01.0: bridge window [mem 0x00000000-0x000fffff 64bit pref]
[ 15.547151] pci 0000:02:01.0: No bus number available for hot-added bridge
[ 15.549067] shpchp 0000:01:00.0: Cannot add device at 0000:02:01
[ 15.553104] shpchp 0000:01:00.0: Latch open on Slot(1-1)
[ 15.555246] shpchp 0000:01:00.0: Card not present on Slot(1-1)

Then, hot-plug an ethernet device. But the kernel still incorrectly
thought the bridge is still there, and refuses this new ethernet device:
(qemu) device_add e1000,bus=br1,addr=1
[ 58.163529] shpchp 0000:01:00.0: Latch close on Slot(1-1)
[ 58.165076] shpchp 0000:01:00.0: Button pressed on Slot(1-1)
[ 58.166650] shpchp 0000:01:00.0: Card present on Slot(1-1)
[ 58.168287] shpchp 0000:01:00.0: PCI slot #1-1 - powering on due to button press
[ 64.677492] shpchp 0000:01:00.0: Device 0000:02:01.0 already exists at 0000:02:01, cannot hot-add
[ 64.680007] shpchp 0000:01:00.0: Cannot add device at 0000:02:01
[ 64.682802] shpchp 0000:01:00.0: Latch open on Slot(1-1)
[ 64.684353] shpchp 0000:01:00.0: Card not present on Slot(1-1)

Best regards,
Nam

2024-05-04 09:52:11

by Lukas Wunner

[permalink] [raw]
Subject: Re: [PATCH 2/4] PCI: pciehp: bail out if pci_hp_add_bridge() fails

On Sat, May 04, 2024 at 11:35:29AM +0200, Nam Cao wrote:
> On Sat, May 04, 2024 at 10:54:15AM +0200, Lukas Wunner wrote:
> > On Fri, May 03, 2024 at 09:23:20PM +0200, Nam Cao wrote:
> > > If there is no bus number available for the downstream bus of the
> > > hot-plugged bridge, pci_hp_add_bridge() will fail. The driver proceeds
> > > regardless, and the kernel crashes.
> > >
> > > Abort if pci_hp_add_bridge() fails.
> > [...]
> > > --- a/drivers/pci/hotplug/pciehp_pci.c
> > > +++ b/drivers/pci/hotplug/pciehp_pci.c
> > > @@ -58,8 +58,13 @@ int pciehp_configure_device(struct controller *ctrl)
> > > goto out;
> > > }
> > >
> > > - for_each_pci_bridge(dev, parent)
> > > - pci_hp_add_bridge(dev);
> > > + for_each_pci_bridge(dev, parent) {
> > > + if (pci_hp_add_bridge(dev)) {
> > > + pci_stop_and_remove_bus_device(dev);
> > > + ret = -EINVAL;
> > > + goto out;
> > > + }
> > > + }
> >
> > Is the pci_stop_and_remove_bus_device() really necessary here?
> > Why not just leave the bridge as is, without any child devices?
>
> pci_stop_and_remove_bus_device() is not necessary to prevent kernel
> crashing. But without this, we cannot hot-plug any other devices to this
> slot afterward, despite the bridge has already been removed. Below is what
> happens without pci_stop_and_remove_bus_device().
>
> First, we hotplug a bridge. That fails, so QEMU removes this bridge:
> (qemu) device_add pci-bridge,id=br2,bus=br1,chassis_nr=19,addr=1
> [ 9.289609] shpchp 0000:01:00.0: Latch close on Slot(1-1)
> [ 9.291145] shpchp 0000:01:00.0: Button pressed on Slot(1-1)
> [ 9.292705] shpchp 0000:01:00.0: Card present on Slot(1-1)
> [ 9.294369] shpchp 0000:01:00.0: PCI slot #1-1 - powering on due to button press
> [ 15.529997] pci 0000:02:01.0: [1b36:0001] type 01 class 0x060400 conventional PCI bridge
> [ 15.533907] pci 0000:02:01.0: BAR 0 [mem 0x00000000-0x000000ff 64bit]
> [ 15.535802] pci 0000:02:01.0: PCI bridge to [bus 00]
> [ 15.538519] pci 0000:02:01.0: bridge window [io 0x0000-0x0fff]
> [ 15.540261] pci 0000:02:01.0: bridge window [mem 0x00000000-0x000fffff]
> [ 15.543486] pci 0000:02:01.0: bridge window [mem 0x00000000-0x000fffff 64bit pref]
> [ 15.547151] pci 0000:02:01.0: No bus number available for hot-added bridge
> [ 15.549067] shpchp 0000:01:00.0: Cannot add device at 0000:02:01
> [ 15.553104] shpchp 0000:01:00.0: Latch open on Slot(1-1)
> [ 15.555246] shpchp 0000:01:00.0: Card not present on Slot(1-1)

I'm not familiar with shpchp, I don't understand why it's thinking
that there's no card after it failed to find a bus number.

Could you reproduce with pciehp instead of shpchp please?

Thanks,

Lukas

2024-05-04 10:56:54

by Nam Cao

[permalink] [raw]
Subject: Re: [PATCH 2/4] PCI: pciehp: bail out if pci_hp_add_bridge() fails

On Sat, May 04, 2024 at 11:51:54AM +0200, Lukas Wunner wrote:
> On Sat, May 04, 2024 at 11:35:29AM +0200, Nam Cao wrote:
> > pci_stop_and_remove_bus_device() is not necessary to prevent kernel
> > crashing. But without this, we cannot hot-plug any other devices to this
> > slot afterward, despite the bridge has already been removed. Below is what
> > happens without pci_stop_and_remove_bus_device().
> >
> > First, we hotplug a bridge. That fails, so QEMU removes this bridge:
> > (qemu) device_add pci-bridge,id=br2,bus=br1,chassis_nr=19,addr=1
> > [ 9.289609] shpchp 0000:01:00.0: Latch close on Slot(1-1)
> > [ 9.291145] shpchp 0000:01:00.0: Button pressed on Slot(1-1)
> > [ 9.292705] shpchp 0000:01:00.0: Card present on Slot(1-1)
> > [ 9.294369] shpchp 0000:01:00.0: PCI slot #1-1 - powering on due to button press
> > [ 15.529997] pci 0000:02:01.0: [1b36:0001] type 01 class 0x060400 conventional PCI bridge
> > [ 15.533907] pci 0000:02:01.0: BAR 0 [mem 0x00000000-0x000000ff 64bit]
> > [ 15.535802] pci 0000:02:01.0: PCI bridge to [bus 00]
> > [ 15.538519] pci 0000:02:01.0: bridge window [io 0x0000-0x0fff]
> > [ 15.540261] pci 0000:02:01.0: bridge window [mem 0x00000000-0x000fffff]
> > [ 15.543486] pci 0000:02:01.0: bridge window [mem 0x00000000-0x000fffff 64bit pref]
> > [ 15.547151] pci 0000:02:01.0: No bus number available for hot-added bridge
> > [ 15.549067] shpchp 0000:01:00.0: Cannot add device at 0000:02:01
> > [ 15.553104] shpchp 0000:01:00.0: Latch open on Slot(1-1)
> > [ 15.555246] shpchp 0000:01:00.0: Card not present on Slot(1-1)
>
> I'm not familiar with shpchp, I don't understand why it's thinking
> that there's no card after it failed to find a bus number.

Sorry, I got mixed up between the two.

> Could you reproduce with pciehp instead of shpchp please?

Same thing for pciehp below. I think the problem is because without
pci_stop_and_remove_bus_device(), no one cleans up the device added in
pci_scan_slot(). When another device get hot-added, pci_get_slot() wrongly
thinks another device is already there, so the hot-plug fails.

Best regards,
Nam

(qemu) device_add pcie-pci-bridge,id=br1,bus=rp1
[ 19.840550] pcieport 0000:00:03.0: pciehp: pending interrupts 0x0009 from Slot Status
[ 19.842843] pcieport 0000:00:03.0: pciehp: Slot(1): Button press: will power on in 5 sec
[ 19.845289] pcieport 0000:00:03.0: pciehp: pending interrupts 0x0010 from Slot Status
[ 19.847502] pcieport 0000:00:03.0: pciehp: pciehp_set_indicators: SLOTCTRL 6c write cmd 2c0
[ 19.849876] pcieport 0000:00:03.0: pciehp: pciehp_check_link_active: lnk_status = 2011
[ 19.852094] pcieport 0000:00:03.0: pciehp: Slot(1): Card present
[ 19.853809] pcieport 0000:00:03.0: pciehp: Slot(1): Link Up
[ 19.855412] pcieport 0000:00:03.0: pciehp: pciehp_get_power_status: SLOTCTRL 6c value read 6f1
[ 19.857975] pcieport 0000:00:03.0: pciehp: pending interrupts 0x0010 from Slot Status
[ 19.860199] pcieport 0000:00:03.0: pciehp: pciehp_power_on_slot: SLOTCTRL 6c write cmd 0
[ 19.862586] pcieport 0000:00:03.0: pciehp: pending interrupts 0x0010 from Slot Status
[ 19.864806] pcieport 0000:00:03.0: pciehp: pciehp_set_indicators: SLOTCTRL 6c write cmd 200
[ 20.994936] pcieport 0000:00:03.0: pciehp: pciehp_check_link_status: lnk_status = 2011
[ 20.997463] pci 0000:01:00.0: [1b36:000e] type 01 class 0x060400 PCIe to PCI/PCI-X bridge
[ 21.001131] pci 0000:01:00.0: BAR 0 [mem 0x00000000-0x000000ff 64bit]
[ 21.003071] pci 0000:01:00.0: PCI bridge to [bus 00]
[ 21.005417] pci 0000:01:00.0: bridge window [io 0x0000-0x0fff]
[ 21.007181] pci 0000:01:00.0: bridge window [mem 0x00000000-0x000fffff]
[ 21.010084] pci 0000:01:00.0: bridge window [mem 0x00000000-0x000fffff 64bit pref]
[ 21.014162] pci 0000:01:00.0: vgaarb: pci_notify
[ 21.015900] pci 0000:01:00.0: No bus number available for hot-added bridge
[ 21.017865] pcieport 0000:00:03.0: pciehp: Cannot add device at 0000:01:00
[ 21.019931] pcieport 0000:00:03.0: pciehp: pending interrupts 0x0010 from Slot Status
[ 21.022178] pcieport 0000:00:03.0: pciehp: pciehp_power_off_slot: SLOTCTRL 6c write cmd 400
[ 22.084607] pcieport 0000:00:03.0: pciehp: pending interrupts 0x0018 from Slot Status
[ 22.086845] pcieport 0000:00:03.0: pciehp: pciehp_set_indicators: SLOTCTRL 6c write cmd 340
[ 22.089323] pcieport 0000:00:03.0: pciehp: pending interrupts 0x0010 from Slot Status
[ 22.091539] pcieport 0000:00:03.0: pciehp: pciehp_set_indicators: SLOTCTRL 6c write cmd 300
[ 22.093913] pcieport 0000:00:03.0: pciehp: pciehp_check_link_active: lnk_status = 11

(qemu) device_add e1000,bus=rp1,id=eth1
[ 58.389527] pcieport 0000:00:03.0: pciehp: pending interrupts 0x0009 from Slot Status
[ 58.391789] pcieport 0000:00:03.0: pciehp: Slot(1): Button press: will power on in 5 sec
[ 58.394175] pcieport 0000:00:03.0: pciehp: pending interrupts 0x0010 from Slot Status
[ 58.396365] pcieport 0000:00:03.0: pciehp: pciehp_set_indicators: SLOTCTRL 6c write cmd 2c0
[ 58.398681] pcieport 0000:00:03.0: pciehp: pciehp_check_link_active: lnk_status = 2011
[ 58.400871] pcieport 0000:00:03.0: pciehp: Slot(1): Card present
[ 58.402542] pcieport 0000:00:03.0: pciehp: Slot(1): Link Up
[ 58.404154] pcieport 0000:00:03.0: pciehp: pciehp_get_power_status: SLOTCTRL 6c value read 6f1
[ 58.406627] pcieport 0000:00:03.0: pciehp: pending interrupts 0x0010 from Slot Status
[ 58.408798] pcieport 0000:00:03.0: pciehp: pciehp_power_on_slot: SLOTCTRL 6c write cmd 0
[ 58.411213] pcieport 0000:00:03.0: pciehp: pending interrupts 0x0010 from Slot Status
[ 58.413386] pcieport 0000:00:03.0: pciehp: pciehp_set_indicators: SLOTCTRL 6c write cmd 200
[ 59.523011] pcieport 0000:00:03.0: pciehp: pciehp_check_link_status: lnk_status = 2011
[ 59.525256] pcieport 0000:00:03.0: pciehp: Device 0000:01:00.0 already exists at 0000:01:00, skipping hot-add
[ 59.528139] pcieport 0000:00:03.0: pciehp: pending interrupts 0x0010 from Slot Status
[ 59.530325] pcieport 0000:00:03.0: pciehp: pciehp_set_indicators: SLOTCTRL 6c write cmd 1c0

2024-05-04 15:02:51

by Lukas Wunner

[permalink] [raw]
Subject: Re: [PATCH 2/4] PCI: pciehp: bail out if pci_hp_add_bridge() fails

On Sat, May 04, 2024 at 12:56:30PM +0200, Nam Cao wrote:
> On Sat, May 04, 2024 at 11:51:54AM +0200, Lukas Wunner wrote:
> > Could you reproduce with pciehp instead of shpchp please?
>
> Same thing for pciehp below. I think the problem is because without
> pci_stop_and_remove_bus_device(), no one cleans up the device added in
> pci_scan_slot(). When another device get hot-added, pci_get_slot() wrongly
> thinks another device is already there, so the hot-plug fails.

pciehp powers down the slot because you're returning a negative errno
from pciehp_configure_device(). Please return 0 instead if
pci_hp_add_bridge() fails.

Thanks,

Lukas

2024-05-04 15:48:32

by Nam Cao

[permalink] [raw]
Subject: Re: [PATCH 2/4] PCI: pciehp: bail out if pci_hp_add_bridge() fails

On Sat, May 04, 2024 at 05:02:34PM +0200, Lukas Wunner wrote:
> On Sat, May 04, 2024 at 12:56:30PM +0200, Nam Cao wrote:
> > On Sat, May 04, 2024 at 11:51:54AM +0200, Lukas Wunner wrote:
> > > Could you reproduce with pciehp instead of shpchp please?
> >
> > Same thing for pciehp below. I think the problem is because without
> > pci_stop_and_remove_bus_device(), no one cleans up the device added in
> > pci_scan_slot(). When another device get hot-added, pci_get_slot() wrongly
> > thinks another device is already there, so the hot-plug fails.
>
> pciehp powers down the slot because you're returning a negative errno
> from pciehp_configure_device(). Please return 0 instead if
> pci_hp_add_bridge() fails.

Thanks for the suggestion. This is applicable to shpchp as well.

Best regards,
Nam