2023-04-20 02:41:51

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH v3 0/6] pci-hyper: Fix race condition bugs for fast device hotplug

Before the guest finishes probing a device, the host may be already starting
to remove the device. Currently there are multiple race condition bugs in the
pci-hyperv driver, which can cause the guest to panic. The patchset fixes
the crashes.

The patchset also does some cleanup work: patch 3 removes the useless
hv_pcichild_state, and patch 4 reverts an old patch which is not really
useful (without patch 4, it would be hard to make patch 5 clean).

Patch 6 removes the use of a global mutex lock, and enables async-probing
to allow concurrent device probing for faster boot.

v3 is based on v6.3-rc5. No code change since v2. I just added Michael's
and Long Li's Reviewed-by.

The patchset is also availsble in my github branch:
https://github.com/dcui/tdx/commits/decui/vpci/v6.3-rc5-v3

v2 can be found here:
https://lwn.net/ml/linux-kernel/[email protected]/

Please review. Thanks!


Dexuan Cui (6):
PCI: hv: Fix a race condition bug in hv_pci_query_relations()
PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic
PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_dev
Revert "PCI: hv: Fix a timing issue which causes kdump to fail
occasionally"
PCI: hv: Add a per-bus mutex state_lock
PCI: hv: Use async probing to reduce boot time

drivers/pci/controller/pci-hyperv.c | 145 +++++++++++++++++-----------
1 file changed, 86 insertions(+), 59 deletions(-)

--
2.25.1


2023-04-20 02:42:01

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH v3 1/6] PCI: hv: Fix a race condition bug in hv_pci_query_relations()

Fix the longstanding race between hv_pci_query_relations() and
survey_child_resources() by flushing the workqueue before we exit from
hv_pci_query_relations().

Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
Signed-off-by: Dexuan Cui <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Cc: [email protected]
---

v2:
Removed the "debug code".
No change to the patch body.
Added Cc:stable

v3:
Added Michael's Reviewed-by.

drivers/pci/controller/pci-hyperv.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index f33370b756283..b82c7cde19e66 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -3308,6 +3308,19 @@ static int hv_pci_query_relations(struct hv_device *hdev)
if (!ret)
ret = wait_for_response(hdev, &comp);

+ /*
+ * In the case of fast device addition/removal, it's possible that
+ * vmbus_sendpacket() or wait_for_response() returns -ENODEV but we
+ * already got a PCI_BUS_RELATIONS* message from the host and the
+ * channel callback already scheduled a work to hbus->wq, which can be
+ * running survey_child_resources() -> complete(&hbus->survey_event),
+ * even after hv_pci_query_relations() exits and the stack variable
+ * 'comp' is no longer valid. This can cause a strange hang issue
+ * or sometimes a page fault. Flush hbus->wq before we exit from
+ * hv_pci_query_relations() to avoid the issues.
+ */
+ flush_workqueue(hbus->wq);
+
return ret;
}

--
2.25.1

2023-04-20 02:42:03

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH v3 2/6] PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic

When the host tries to remove a PCI device, the host first sends a
PCI_EJECT message to the guest, and the guest is supposed to gracefully
remove the PCI device and send a PCI_EJECTION_COMPLETE message to the host;
the host then sends a VMBus message CHANNELMSG_RESCIND_CHANNELOFFER to
the guest (when the guest receives this message, the device is already
unassigned from the guest) and the guest can do some final cleanup work;
if the guest fails to respond to the PCI_EJECT message within one minute,
the host sends the VMBus message CHANNELMSG_RESCIND_CHANNELOFFER and
removes the PCI device forcibly.

In the case of fast device addition/removal, it's possible that the PCI
device driver is still configuring MSI-X interrupts when the guest receives
the PCI_EJECT message; the channel callback calls hv_pci_eject_device(),
which sets hpdev->state to hv_pcichild_ejecting, and schedules a work
hv_eject_device_work(); if the PCI device driver is calling
pci_alloc_irq_vectors() -> ... -> hv_compose_msi_msg(), we can break the
while loop in hv_compose_msi_msg() due to the updated hpdev->state, and
leave data->chip_data with its default value of NULL; later, when the PCI
device driver calls request_irq() -> ... -> hv_irq_unmask(), the guest
crashes in hv_arch_irq_unmask() due to data->chip_data being NULL.

Fix the issue by not testing hpdev->state in the while loop: when the
guest receives PCI_EJECT, the device is still assigned to the guest, and
the guest has one minute to finish the device removal gracefully. We don't
really need to (and we should not) test hpdev->state in the loop.

Fixes: de0aa7b2f97d ("PCI: hv: Fix 2 hang issues in hv_compose_msi_msg()")
Signed-off-by: Dexuan Cui <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Cc: [email protected]
---

v2:
Removed the "debug code".
No change to the patch body.
Added Cc:stable

v3:
Added Michael's Reviewed-by.

drivers/pci/controller/pci-hyperv.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index b82c7cde19e66..1b11cf7391933 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -643,6 +643,11 @@ static void hv_arch_irq_unmask(struct irq_data *data)
pbus = pdev->bus;
hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
int_desc = data->chip_data;
+ if (!int_desc) {
+ dev_warn(&hbus->hdev->device, "%s() can not unmask irq %u\n",
+ __func__, data->irq);
+ return;
+ }

spin_lock_irqsave(&hbus->retarget_msi_interrupt_lock, flags);

@@ -1911,12 +1916,6 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
hv_pci_onchannelcallback(hbus);
spin_unlock_irqrestore(&channel->sched_lock, flags);

- if (hpdev->state == hv_pcichild_ejecting) {
- dev_err_once(&hbus->hdev->device,
- "the device is being ejected\n");
- goto enable_tasklet;
- }
-
udelay(100);
}

--
2.25.1

2023-04-20 02:42:20

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH v3 3/6] PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_dev

The hpdev->state is never really useful. The only use in
hv_pci_eject_device() and hv_eject_device_work() is not really necessary.

Signed-off-by: Dexuan Cui <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Cc: [email protected]
---

v2:
No change to the patch body.
Added Cc:stable

v3:
Added Michael's Reviewed-by.

drivers/pci/controller/pci-hyperv.c | 12 ------------
1 file changed, 12 deletions(-)

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 1b11cf7391933..46df6d093d683 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -553,19 +553,10 @@ struct hv_dr_state {
struct hv_pcidev_description func[];
};

-enum hv_pcichild_state {
- hv_pcichild_init = 0,
- hv_pcichild_requirements,
- hv_pcichild_resourced,
- hv_pcichild_ejecting,
- hv_pcichild_maximum
-};
-
struct hv_pci_dev {
/* List protected by pci_rescan_remove_lock */
struct list_head list_entry;
refcount_t refs;
- enum hv_pcichild_state state;
struct pci_slot *pci_slot;
struct hv_pcidev_description desc;
bool reported_missing;
@@ -2750,8 +2741,6 @@ static void hv_eject_device_work(struct work_struct *work)
hpdev = container_of(work, struct hv_pci_dev, wrk);
hbus = hpdev->hbus;

- WARN_ON(hpdev->state != hv_pcichild_ejecting);
-
/*
* Ejection can come before or after the PCI bus has been set up, so
* attempt to find it and tear down the bus state, if it exists. This
@@ -2808,7 +2797,6 @@ static void hv_pci_eject_device(struct hv_pci_dev *hpdev)
return;
}

- hpdev->state = hv_pcichild_ejecting;
get_pcichild(hpdev);
INIT_WORK(&hpdev->wrk, hv_eject_device_work);
queue_work(hbus->wq, &hpdev->wrk);
--
2.25.1

2023-04-20 02:42:43

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH v3 5/6] PCI: hv: Add a per-bus mutex state_lock

In the case of fast device addition/removal, it's possible that
hv_eject_device_work() can start to run before create_root_hv_pci_bus()
starts to run; as a result, the pci_get_domain_bus_and_slot() in
hv_eject_device_work() can return a 'pdev' of NULL, and
hv_eject_device_work() can remove the 'hpdev', and immediately send a
message PCI_EJECTION_COMPLETE to the host, and the host immediately
unassigns the PCI device from the guest; meanwhile,
create_root_hv_pci_bus() and the PCI device driver can be probing the
dead PCI device and reporting timeout errors.

Fix the issue by adding a per-bus mutex 'state_lock' and grabbing the
mutex before powering on the PCI bus in hv_pci_enter_d0(): when
hv_eject_device_work() starts to run, it's able to find the 'pdev' and call
pci_stop_and_remove_bus_device(pdev): if the PCI device driver has
loaded, the PCI device driver's probe() function is already called in
create_root_hv_pci_bus() -> pci_bus_add_devices(), and now
hv_eject_device_work() -> pci_stop_and_remove_bus_device() is able
to call the PCI device driver's remove() function and remove the device
reliably; if the PCI device driver hasn't loaded yet, the function call
hv_eject_device_work() -> pci_stop_and_remove_bus_device() is able to
remove the PCI device reliably and the PCI device driver's probe()
function won't be called; if the PCI device driver's probe() is already
running (e.g., systemd-udev is loading the PCI device driver), it must
be holding the per-device lock, and after the probe() finishes and releases
the lock, hv_eject_device_work() -> pci_stop_and_remove_bus_device() is
able to proceed to remove the device reliably.

Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
Signed-off-by: Dexuan Cui <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Cc: [email protected]
---

v2:
Removed the "debug code".
Fixed the "goto out" in hv_pci_resume() [Michael Kelley]
Added Cc:stable

v3:
Added Michael's Reviewed-by.

drivers/pci/controller/pci-hyperv.c | 29 ++++++++++++++++++++++++++---
1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 48feab095a144..3ae2f99dea8c2 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -489,7 +489,10 @@ struct hv_pcibus_device {
struct fwnode_handle *fwnode;
/* Protocol version negotiated with the host */
enum pci_protocol_version_t protocol_version;
+
+ struct mutex state_lock;
enum hv_pcibus_state state;
+
struct hv_device *hdev;
resource_size_t low_mmio_space;
resource_size_t high_mmio_space;
@@ -2512,6 +2515,8 @@ static void pci_devices_present_work(struct work_struct *work)
if (!dr)
return;

+ mutex_lock(&hbus->state_lock);
+
/* First, mark all existing children as reported missing. */
spin_lock_irqsave(&hbus->device_list_lock, flags);
list_for_each_entry(hpdev, &hbus->children, list_entry) {
@@ -2593,6 +2598,8 @@ static void pci_devices_present_work(struct work_struct *work)
break;
}

+ mutex_unlock(&hbus->state_lock);
+
kfree(dr);
}

@@ -2741,6 +2748,8 @@ static void hv_eject_device_work(struct work_struct *work)
hpdev = container_of(work, struct hv_pci_dev, wrk);
hbus = hpdev->hbus;

+ mutex_lock(&hbus->state_lock);
+
/*
* Ejection can come before or after the PCI bus has been set up, so
* attempt to find it and tear down the bus state, if it exists. This
@@ -2777,6 +2786,8 @@ static void hv_eject_device_work(struct work_struct *work)
put_pcichild(hpdev);
put_pcichild(hpdev);
/* hpdev has been freed. Do not use it any more. */
+
+ mutex_unlock(&hbus->state_lock);
}

/**
@@ -3562,6 +3573,7 @@ static int hv_pci_probe(struct hv_device *hdev,
return -ENOMEM;

hbus->bridge = bridge;
+ mutex_init(&hbus->state_lock);
hbus->state = hv_pcibus_init;
hbus->wslot_res_allocated = -1;

@@ -3670,9 +3682,11 @@ static int hv_pci_probe(struct hv_device *hdev,
if (ret)
goto free_irq_domain;

+ mutex_lock(&hbus->state_lock);
+
ret = hv_pci_enter_d0(hdev);
if (ret)
- goto free_irq_domain;
+ goto release_state_lock;

ret = hv_pci_allocate_bridge_windows(hbus);
if (ret)
@@ -3690,12 +3704,15 @@ static int hv_pci_probe(struct hv_device *hdev,
if (ret)
goto free_windows;

+ mutex_unlock(&hbus->state_lock);
return 0;

free_windows:
hv_pci_free_bridge_windows(hbus);
exit_d0:
(void) hv_pci_bus_exit(hdev, true);
+release_state_lock:
+ mutex_unlock(&hbus->state_lock);
free_irq_domain:
irq_domain_remove(hbus->irq_domain);
free_fwnode:
@@ -3945,20 +3962,26 @@ static int hv_pci_resume(struct hv_device *hdev)
if (ret)
goto out;

+ mutex_lock(&hbus->state_lock);
+
ret = hv_pci_enter_d0(hdev);
if (ret)
- goto out;
+ goto release_state_lock;

ret = hv_send_resources_allocated(hdev);
if (ret)
- goto out;
+ goto release_state_lock;

prepopulate_bars(hbus);

hv_pci_restore_msi_state(hbus);

hbus->state = hv_pcibus_installed;
+ mutex_unlock(&hbus->state_lock);
return 0;
+
+release_state_lock:
+ mutex_unlock(&hbus->state_lock);
out:
vmbus_close(hdev->channel);
return ret;
--
2.25.1

2023-04-20 02:42:53

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH v3 4/6] Revert "PCI: hv: Fix a timing issue which causes kdump to fail occasionally"

This reverts commit d6af2ed29c7c1c311b96dac989dcb991e90ee195.

The statement "the hv_pci_bus_exit() call releases structures of all its
child devices" in commit d6af2ed29c7c is not true: in the path
hv_pci_probe() -> hv_pci_enter_d0() -> hv_pci_bus_exit(hdev, true): the
parameter "keep_devs" is true, so hv_pci_bus_exit() does *not* release the
child "struct hv_pci_dev *hpdev" that is created earlier in
pci_devices_present_work() -> new_pcichild_device().

The commit d6af2ed29c7c was originally made in July 2020 for RHEL 7.7,
where the old version of hv_pci_bus_exit() was used; when the commit was
rebased and merged into the upstream, people didn't notice that it's
not really necessary. The commit itself doesn't cause any issue, but it
makes hv_pci_probe() more complicated. Revert it to facilitate some
upcoming changes to hv_pci_probe().

Signed-off-by: Dexuan Cui <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Acked-by: Wei Hu <[email protected]>
Cc: [email protected]
---

v2:
No change to the patch body.
Added Wei Hu's Acked-by.
Added Cc:stable

v3:
Added Michael's Reviewed-by.

drivers/pci/controller/pci-hyperv.c | 71 ++++++++++++++---------------
1 file changed, 34 insertions(+), 37 deletions(-)

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 46df6d093d683..48feab095a144 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -3225,8 +3225,10 @@ static int hv_pci_enter_d0(struct hv_device *hdev)
struct pci_bus_d0_entry *d0_entry;
struct hv_pci_compl comp_pkt;
struct pci_packet *pkt;
+ bool retry = true;
int ret;

+enter_d0_retry:
/*
* Tell the host that the bus is ready to use, and moved into the
* powered-on state. This includes telling the host which region
@@ -3253,6 +3255,38 @@ static int hv_pci_enter_d0(struct hv_device *hdev)
if (ret)
goto exit;

+ /*
+ * In certain case (Kdump) the pci device of interest was
+ * not cleanly shut down and resource is still held on host
+ * side, the host could return invalid device status.
+ * We need to explicitly request host to release the resource
+ * and try to enter D0 again.
+ */
+ if (comp_pkt.completion_status < 0 && retry) {
+ retry = false;
+
+ dev_err(&hdev->device, "Retrying D0 Entry\n");
+
+ /*
+ * Hv_pci_bus_exit() calls hv_send_resource_released()
+ * to free up resources of its child devices.
+ * In the kdump kernel we need to set the
+ * wslot_res_allocated to 255 so it scans all child
+ * devices to release resources allocated in the
+ * normal kernel before panic happened.
+ */
+ hbus->wslot_res_allocated = 255;
+
+ ret = hv_pci_bus_exit(hdev, true);
+
+ if (ret == 0) {
+ kfree(pkt);
+ goto enter_d0_retry;
+ }
+ dev_err(&hdev->device,
+ "Retrying D0 failed with ret %d\n", ret);
+ }
+
if (comp_pkt.completion_status < 0) {
dev_err(&hdev->device,
"PCI Pass-through VSP failed D0 Entry with status %x\n",
@@ -3493,7 +3527,6 @@ static int hv_pci_probe(struct hv_device *hdev,
struct hv_pcibus_device *hbus;
u16 dom_req, dom;
char *name;
- bool enter_d0_retry = true;
int ret;

/*
@@ -3633,47 +3666,11 @@ static int hv_pci_probe(struct hv_device *hdev,
if (ret)
goto free_fwnode;

-retry:
ret = hv_pci_query_relations(hdev);
if (ret)
goto free_irq_domain;

ret = hv_pci_enter_d0(hdev);
- /*
- * In certain case (Kdump) the pci device of interest was
- * not cleanly shut down and resource is still held on host
- * side, the host could return invalid device status.
- * We need to explicitly request host to release the resource
- * and try to enter D0 again.
- * Since the hv_pci_bus_exit() call releases structures
- * of all its child devices, we need to start the retry from
- * hv_pci_query_relations() call, requesting host to send
- * the synchronous child device relations message before this
- * information is needed in hv_send_resources_allocated()
- * call later.
- */
- if (ret == -EPROTO && enter_d0_retry) {
- enter_d0_retry = false;
-
- dev_err(&hdev->device, "Retrying D0 Entry\n");
-
- /*
- * Hv_pci_bus_exit() calls hv_send_resources_released()
- * to free up resources of its child devices.
- * In the kdump kernel we need to set the
- * wslot_res_allocated to 255 so it scans all child
- * devices to release resources allocated in the
- * normal kernel before panic happened.
- */
- hbus->wslot_res_allocated = 255;
- ret = hv_pci_bus_exit(hdev, true);
-
- if (ret == 0)
- goto retry;
-
- dev_err(&hdev->device,
- "Retrying D0 failed with ret %d\n", ret);
- }
if (ret)
goto free_irq_domain;

--
2.25.1

2023-04-20 02:44:33

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH v3 6/6] PCI: hv: Use async probing to reduce boot time

Commit 414428c5da1c ("PCI: hv: Lock PCI bus on device eject") added
pci_lock_rescan_remove() and pci_unlock_rescan_remove() in
create_root_hv_pci_bus() and in hv_eject_device_work() to address the
race between create_root_hv_pci_bus() and hv_eject_device_work(), but it
turns that grabing the pci_rescan_remove_lock mutex is not enough:
refer to the earlier fix "PCI: hv: Add a per-bus mutex state_lock".

Now with hbus->state_lock and other fixes, the race is resolved, so
remove pci_{lock,unlock}_rescan_remove() in create_root_hv_pci_bus():
this removes the serialization in hv_pci_probe() and hence allows
async-probing (PROBE_PREFER_ASYNCHRONOUS) to work.

Add the async-probing flag to hv_pci_drv.

pci_{lock,unlock}_rescan_remove() in hv_eject_device_work() and in
hv_pci_remove() are still kept: according to the comment before
drivers/pci/probe.c: static DEFINE_MUTEX(pci_rescan_remove_lock),
"PCI device removal routines should always be executed under this mutex".

Signed-off-by: Dexuan Cui <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Reviewed-by: Long Li <[email protected]>
Cc: [email protected]
---

v2:
No change to the patch body.
Improved the commit message [Michael Kelley]
Added Cc:stable

v3:
Added Michael's and Long Li's Reviewed-by.
Fixed a typo in the commit message: grubing -> grabing [Thanks, Michael!]

drivers/pci/controller/pci-hyperv.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 3ae2f99dea8c2..2ea2b1b8a4c9a 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -2312,12 +2312,16 @@ static int create_root_hv_pci_bus(struct hv_pcibus_device *hbus)
if (error)
return error;

- pci_lock_rescan_remove();
+ /*
+ * pci_lock_rescan_remove() and pci_unlock_rescan_remove() are
+ * unnecessary here, because we hold the hbus->state_lock, meaning
+ * hv_eject_device_work() and pci_devices_present_work() can't race
+ * with create_root_hv_pci_bus().
+ */
hv_pci_assign_numa_node(hbus);
pci_bus_assign_resources(bridge->bus);
hv_pci_assign_slots(hbus);
pci_bus_add_devices(bridge->bus);
- pci_unlock_rescan_remove();
hbus->state = hv_pcibus_installed;
return 0;
}
@@ -4003,6 +4007,9 @@ static struct hv_driver hv_pci_drv = {
.remove = hv_pci_remove,
.suspend = hv_pci_suspend,
.resume = hv_pci_resume,
+ .driver = {
+ .probe_type = PROBE_PREFER_ASYNCHRONOUS,
+ },
};

static void __exit exit_hv_pci_drv(void)
--
2.25.1

2023-04-21 02:08:45

by Dexuan Cui

[permalink] [raw]
Subject: RE: [PATCH v3 0/6] pci-hyper: Fix race condition bugs for fast device hotplug

> From: Dexuan Cui <[email protected]>
> Sent: Wednesday, April 19, 2023 7:41 PM
> ...
> Before the guest finishes probing a device, the host may be already starting
> to remove the device. Currently there are multiple race condition bugs in the
> pci-hyperv driver, which can cause the guest to panic. The patchset fixes
> the crashes.
>
> The patchset also does some cleanup work: patch 3 removes the useless
> hv_pcichild_state, and patch 4 reverts an old patch which is not really
> useful (without patch 4, it would be hard to make patch 5 clean).
>
> Patch 6 removes the use of a global mutex lock, and enables async-probing
> to allow concurrent device probing for faster boot.
>
> v3 is based on v6.3-rc5. No code change since v2. I just added Michael's
> and Long Li's Reviewed-by.
> ...
>
> Dexuan Cui (6):
> PCI: hv: Fix a race condition bug in hv_pci_query_relations()
> PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic
> PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_dev
> Revert "PCI: hv: Fix a timing issue which causes kdump to fail
> occasionally"
> PCI: hv: Add a per-bus mutex state_lock
> PCI: hv: Use async probing to reduce boot time
>
> drivers/pci/controller/pci-hyperv.c | 145 +++++++++++++++++-----------
> 1 file changed, 86 insertions(+), 59 deletions(-)

Hi Bjorn, Lorenzo, since basically this patchset is Hyper-V stuff, I would
like it to go through the hyper-v tree if you have no objection.

The hyper-v tree already has one big PCI patch from Michael:
https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/commit/?h=hyperv-next&id=2c6ba4216844ca7918289b49ed5f3f7138ee2402

Thanks,
Dexuan

2023-04-21 22:25:30

by Dexuan Cui

[permalink] [raw]
Subject: RE: [PATCH v3 0/6] pci-hyper: Fix race condition bugs for fast device hotplug

> From: Dexuan Cui
> Sent: Thursday, April 20, 2023 7:04 PM
> > ...
> >
> > Dexuan Cui (6):
> > PCI: hv: Fix a race condition bug in hv_pci_query_relations()
> > PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic
> > PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_dev
> > Revert "PCI: hv: Fix a timing issue which causes kdump to fail
> > occasionally"
> > PCI: hv: Add a per-bus mutex state_lock
> > PCI: hv: Use async probing to reduce boot time
> >
> > drivers/pci/controller/pci-hyperv.c | 145 +++++++++++++++++-----------
> > 1 file changed, 86 insertions(+), 59 deletions(-)
>
> Hi Bjorn, Lorenzo, since basically this patchset is Hyper-V stuff, I would
> like it to go through the hyper-v tree if you have no objection.
>
> The hyper-v tree already has one big PCI patch from Michael:
> https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/commit/?h=
> hyperv-next&id=2c6ba4216844ca7918289b49ed5f3f7138ee2402
>
> Thanks,
> Dexuan

Hi Lorenzo, thanks for Ack'ing the patch:
Re: [PATCH v2] PCI: hv: Replace retarget_msi_interrupt_params with hyperv_pcpu_input_arg

It would be great if you and/or Bjorn can Ack this patchset as well :-)

v1 of this patchset was posted on 3/28:
https://lwn.net/ml/linux-kernel/20230328045122.25850-1-decui%40microsoft.com/
and v3 got Michael Kelley's and Long Li's Reviewed-by.

I have done a long-haul testing against the patchset and it worked
reliably without causing any issue: without the patchset, usually the VM
can crash within 1~2 days; with the patchset, the VM is still running fine
after 2 weeks.

Thanks,
Dexuan

2023-04-23 19:32:02

by Simon Horman

[permalink] [raw]
Subject: Re: [PATCH v3 6/6] PCI: hv: Use async probing to reduce boot time

On Wed, Apr 19, 2023 at 07:40:37PM -0700, Dexuan Cui wrote:
> Commit 414428c5da1c ("PCI: hv: Lock PCI bus on device eject") added
> pci_lock_rescan_remove() and pci_unlock_rescan_remove() in
> create_root_hv_pci_bus() and in hv_eject_device_work() to address the
> race between create_root_hv_pci_bus() and hv_eject_device_work(), but it
> turns that grabing the pci_rescan_remove_lock mutex is not enough:

nit: s/grabing/grabbing/g

> refer to the earlier fix "PCI: hv: Add a per-bus mutex state_lock".

...

2023-04-24 20:58:35

by Dexuan Cui

[permalink] [raw]
Subject: RE: [PATCH v3 6/6] PCI: hv: Use async probing to reduce boot time

> From: Simon Horman <[email protected]>
> Sent: Sunday, April 23, 2023 12:11 PM
> To: Dexuan Cui <[email protected]>
> ...
> On Wed, Apr 19, 2023 at 07:40:37PM -0700, Dexuan Cui wrote:
> > Commit 414428c5da1c ("PCI: hv: Lock PCI bus on device eject") added
> > pci_lock_rescan_remove() and pci_unlock_rescan_remove() in
> > create_root_hv_pci_bus() and in hv_eject_device_work() to address the
> > race between create_root_hv_pci_bus() and hv_eject_device_work(), but it
> > turns that grabing the pci_rescan_remove_lock mutex is not enough:
>
> nit: s/grabing/grabbing/g
Thanks for spotting this! I suppose the maintainer(s) would help fix this.

2023-05-08 16:58:48

by Wei Liu

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] pci-hyper: Fix race condition bugs for fast device hotplug

On Fri, Apr 21, 2023 at 10:23:03PM +0000, Dexuan Cui wrote:
> > From: Dexuan Cui
> > Sent: Thursday, April 20, 2023 7:04 PM
> > > ...
> > >
> > > Dexuan Cui (6):
> > > PCI: hv: Fix a race condition bug in hv_pci_query_relations()
> > > PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic
> > > PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_dev
> > > Revert "PCI: hv: Fix a timing issue which causes kdump to fail
> > > occasionally"
> > > PCI: hv: Add a per-bus mutex state_lock
> > > PCI: hv: Use async probing to reduce boot time
> > >
> > > drivers/pci/controller/pci-hyperv.c | 145 +++++++++++++++++-----------
> > > 1 file changed, 86 insertions(+), 59 deletions(-)
> >
> > Hi Bjorn, Lorenzo, since basically this patchset is Hyper-V stuff, I would
> > like it to go through the hyper-v tree if you have no objection.
> >
> > The hyper-v tree already has one big PCI patch from Michael:
> > https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/commit/?h=
> > hyperv-next&id=2c6ba4216844ca7918289b49ed5f3f7138ee2402
> >
> > Thanks,
> > Dexuan
>
> Hi Lorenzo, thanks for Ack'ing the patch:
> Re: [PATCH v2] PCI: hv: Replace retarget_msi_interrupt_params with hyperv_pcpu_input_arg
>
> It would be great if you and/or Bjorn can Ack this patchset as well :-)
>

Lorenzo and Bjorn, are you happy with these patches? I can collect them
via the hyperv-fixes tree.

Thanks,
Wei.

> v1 of this patchset was posted on 3/28:
> https://lwn.net/ml/linux-kernel/20230328045122.25850-1-decui%40microsoft.com/
> and v3 got Michael Kelley's and Long Li's Reviewed-by.
>
> I have done a long-haul testing against the patchset and it worked
> reliably without causing any issue: without the patchset, usually the VM
> can crash within 1~2 days; with the patchset, the VM is still running fine
> after 2 weeks.
>
> Thanks,
> Dexuan

2023-05-10 08:45:25

by Lorenzo Pieralisi

[permalink] [raw]
Subject: Re: [PATCH v3 6/6] PCI: hv: Use async probing to reduce boot time

On Wed, Apr 19, 2023 at 07:40:37PM -0700, Dexuan Cui wrote:
> Commit 414428c5da1c ("PCI: hv: Lock PCI bus on device eject") added
> pci_lock_rescan_remove() and pci_unlock_rescan_remove() in
> create_root_hv_pci_bus() and in hv_eject_device_work() to address the
> race between create_root_hv_pci_bus() and hv_eject_device_work(), but it
> turns that grabing the pci_rescan_remove_lock mutex is not enough:
> refer to the earlier fix "PCI: hv: Add a per-bus mutex state_lock".

This is meaningless for a commit log reader, there is nothing to
refer to.

> Now with hbus->state_lock and other fixes, the race is resolved, so

"other fixes" is meaningless too.

Explain the problem and how you fix it (this patch should be split
because the Subject does not represent what you are doing precisely,
see below).

> remove pci_{lock,unlock}_rescan_remove() in create_root_hv_pci_bus():
> this removes the serialization in hv_pci_probe() and hence allows
> async-probing (PROBE_PREFER_ASYNCHRONOUS) to work.
>
> Add the async-probing flag to hv_pci_drv.

Adding the asynchronous probing should be a separate patch and
I don't think you should send it to stable kernels straight away
because a) it is not a fix b) it can trigger further regressions.

> pci_{lock,unlock}_rescan_remove() in hv_eject_device_work() and in
> hv_pci_remove() are still kept: according to the comment before
> drivers/pci/probe.c: static DEFINE_MUTEX(pci_rescan_remove_lock),
> "PCI device removal routines should always be executed under this mutex".

This patch should be split, first thing is to fix and document what
you are changing for pci_{lock,unlock}_rescan_remove() then add
asynchronous probing.

Lorenzo

> Signed-off-by: Dexuan Cui <[email protected]>
> Reviewed-by: Michael Kelley <[email protected]>
> Reviewed-by: Long Li <[email protected]>
> Cc: [email protected]
> ---
>
> v2:
> No change to the patch body.
> Improved the commit message [Michael Kelley]
> Added Cc:stable
>
> v3:
> Added Michael's and Long Li's Reviewed-by.
> Fixed a typo in the commit message: grubing -> grabing [Thanks, Michael!]
>
> drivers/pci/controller/pci-hyperv.c | 11 +++++++++--
> 1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 3ae2f99dea8c2..2ea2b1b8a4c9a 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -2312,12 +2312,16 @@ static int create_root_hv_pci_bus(struct hv_pcibus_device *hbus)
> if (error)
> return error;
>
> - pci_lock_rescan_remove();
> + /*
> + * pci_lock_rescan_remove() and pci_unlock_rescan_remove() are
> + * unnecessary here, because we hold the hbus->state_lock, meaning
> + * hv_eject_device_work() and pci_devices_present_work() can't race
> + * with create_root_hv_pci_bus().
> + */
> hv_pci_assign_numa_node(hbus);
> pci_bus_assign_resources(bridge->bus);
> hv_pci_assign_slots(hbus);
> pci_bus_add_devices(bridge->bus);
> - pci_unlock_rescan_remove();
> hbus->state = hv_pcibus_installed;
> return 0;
> }
> @@ -4003,6 +4007,9 @@ static struct hv_driver hv_pci_drv = {
> .remove = hv_pci_remove,
> .suspend = hv_pci_suspend,
> .resume = hv_pci_resume,
> + .driver = {
> + .probe_type = PROBE_PREFER_ASYNCHRONOUS,
> + },
> };
>
> static void __exit exit_hv_pci_drv(void)
> --
> 2.25.1
>

2023-05-10 17:28:06

by Dexuan Cui

[permalink] [raw]
Subject: RE: [PATCH v3 6/6] PCI: hv: Use async probing to reduce boot time

> From: Lorenzo Pieralisi <[email protected]>
> Sent: Wednesday, May 10, 2023 1:23 AM
> To: Dexuan Cui <[email protected]>
> ...
> On Wed, Apr 19, 2023 at 07:40:37PM -0700, Dexuan Cui wrote:
> > Commit 414428c5da1c ("PCI: hv: Lock PCI bus on device eject") added
> > pci_lock_rescan_remove() and pci_unlock_rescan_remove() in
> > create_root_hv_pci_bus() and in hv_eject_device_work() to address the
> > race between create_root_hv_pci_bus() and hv_eject_device_work(), but it
> > turns that grabing the pci_rescan_remove_lock mutex is not enough:
> > refer to the earlier fix "PCI: hv: Add a per-bus mutex state_lock".
>
> This is meaningless for a commit log reader, there is nothing to
> refer to.
Correct. Because patch 5
[PATCH v3 5/6] PCI: hv: Add a per-bus mutex state_lock
has not been in any upstream tree, so I don't have a commit id yet.

> > Now with hbus->state_lock and other fixes, the race is resolved, so
>
> "other fixes" is meaningless too.
Ditto.

> Explain the problem and how you fix it (this patch should be split
> because the Subject does not represent what you are doing precisely,
> see below).
Ok, I will better explain the boot time issue.

> > remove pci_{lock,unlock}_rescan_remove() in create_root_hv_pci_bus():
> > this removes the serialization in hv_pci_probe() and hence allows
> > async-probing (PROBE_PREFER_ASYNCHRONOUS) to work.
> >
> > Add the async-probing flag to hv_pci_drv.
>
> Adding the asynchronous probing should be a separate patch and
> I don't think you should send it to stable kernels straight away
> because a) it is not a fix b) it can trigger further regressions.
Agreed. I'll remove the line "Cc: stable".

> > pci_{lock,unlock}_rescan_remove() in hv_eject_device_work() and in
> > hv_pci_remove() are still kept: according to the comment before
> > drivers/pci/probe.c: static DEFINE_MUTEX(pci_rescan_remove_lock),
> > "PCI device removal routines should always be executed under this mutex".
>
> This patch should be split, first thing is to fix and document what
> you are changing for pci_{lock,unlock}_rescan_remove() then add
> asynchronous probing.
>
> Lorenzo
Ok, I'll split this patch into two.

Thanks for reviewing the patch.
Can you please give an "Acked-by" or "Reviewed-by" to patch 1~5
if they look good to you? The first 5 patches have been there for a
while, and they already got Michael's Reviewed-by.

I hope the first 5 patches can go through the hyperv-fixes branch in
the hyperv tree
https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/log/?h=hyperv-fixes
since they are specific to Hyper-V.

After the first 5 patches are in, I can refer to the commit IDs, and I
will split this patch (patch 6).

Thanks,
Dexuan


2023-05-17 00:25:31

by Dexuan Cui

[permalink] [raw]
Subject: RE: [PATCH v3 6/6] PCI: hv: Use async probing to reduce boot time

> From: Dexuan Cui
> Sent: Wednesday, May 10, 2023 10:12 AM
> To: Lorenzo Pieralisi <[email protected]>
> > ...
> > This patch should be split, first thing is to fix and document what
> > you are changing for pci_{lock,unlock}_rescan_remove() then add
> > asynchronous probing.
> >
> > Lorenzo
> Ok, I'll split this patch into two.
>
> Thanks for reviewing the patch.
> Can you please give an "Acked-by" or "Reviewed-by" to patch 1~5
> if they look good to you? The first 5 patches have been there for a
> while, and they already got Michael's Reviewed-by.

Hi Lorenzo, Bjorn and all,
Ping -- it would be great to have your Acked-by or Reviewed-by for
patch 1 to 5.

> I hope the first 5 patches can go through the hyperv-fixes branch in
> the hyperv tree
> https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/log/?h=hyp
> erv-fixes
> since they are specific to Hyper-V.
>
> After the first 5 patches are in, I can refer to the commit IDs, and I
> will split this patch (patch 6).
>
> Thanks,
> Dexuan


2023-05-23 19:38:01

by Dexuan Cui

[permalink] [raw]
Subject: RE: [PATCH v3 6/6] PCI: hv: Use async probing to reduce boot time

> From: Dexuan Cui <[email protected]>
> Sent: Tuesday, May 16, 2023 5:03 PM
> ...
> > From: Dexuan Cui
> > Sent: Wednesday, May 10, 2023 10:12 AM
> > To: Lorenzo Pieralisi <[email protected]>
> > > ...
> > > This patch should be split, first thing is to fix and document what
> > > you are changing for pci_{lock,unlock}_rescan_remove() then add
> > > asynchronous probing.
> > >
> > > Lorenzo
> > Ok, I'll split this patch into two.
> >
> > Thanks for reviewing the patch.
> > Can you please give an "Acked-by" or "Reviewed-by" to patch 1~5
> > if they look good to you? The first 5 patches have been there for a
> > while, and they already got Michael's Reviewed-by.
>
> Hi Lorenzo, Bjorn and all,
> Ping -- it would be great to have your Acked-by or Reviewed-by for
> patch 1 to 5.

Gentle ping .

> > I hope the first 5 patches can go through the hyperv-fixes branch in
> > the hyperv tree
> >
> https://git.ke/
> rnel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Fhyperv%2Flinux.git%2F
> log%2F%3Fh%3Dhyp&data=05%7C01%7Cdecui%40microsoft.com%7C65c86f
> fe8d8542dbae0708db566a1607%7C72f988bf86f141af91ab2d7cd011db47%
> 7C1%7C0%7C638198785892993948%7CUnknown%7CTWFpbGZsb3d8eyJWIj
> oiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3
> 000%7C%7C%7C&sdata=h9vxjnGo6oYzare%2FqqcXndg2NZZ0Ap%2BH33q0i
> Mtf7D4%3D&reserved=0
> > erv-fixes
> > since they are specific to Hyper-V.
> >
> > After the first 5 patches are in, I can refer to the commit IDs, and I
> > will split this patch (patch 6).
> >
> > Thanks,
> > Dexuan


2023-05-25 08:19:44

by Lorenzo Pieralisi

[permalink] [raw]
Subject: Re: [PATCH v3 1/6] PCI: hv: Fix a race condition bug in hv_pci_query_relations()

On Wed, Apr 19, 2023 at 07:40:32PM -0700, Dexuan Cui wrote:
> Fix the longstanding race between hv_pci_query_relations() and
> survey_child_resources() by flushing the workqueue before we exit from
> hv_pci_query_relations().

"Fix the longstanding race" is vague. Please describe the race
succinctly at least to give an idea of what the problem is.

> Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
> Signed-off-by: Dexuan Cui <[email protected]>
> Reviewed-by: Michael Kelley <[email protected]>
> Cc: [email protected]
> ---
>
> v2:
> Removed the "debug code".
> No change to the patch body.
> Added Cc:stable
>
> v3:
> Added Michael's Reviewed-by.
>
> drivers/pci/controller/pci-hyperv.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index f33370b756283..b82c7cde19e66 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -3308,6 +3308,19 @@ static int hv_pci_query_relations(struct hv_device *hdev)
> if (!ret)
> ret = wait_for_response(hdev, &comp);
>
> + /*
> + * In the case of fast device addition/removal, it's possible that
> + * vmbus_sendpacket() or wait_for_response() returns -ENODEV but we
> + * already got a PCI_BUS_RELATIONS* message from the host and the
> + * channel callback already scheduled a work to hbus->wq, which can be
> + * running survey_child_resources() -> complete(&hbus->survey_event),
> + * even after hv_pci_query_relations() exits and the stack variable
> + * 'comp' is no longer valid. This can cause a strange hang issue

"A strange hang" sounds like we don't understand what's happening, it
does not feel like it is a solid understanding of the issue.

I would remove it - given that you already explain that comp is no
longer valid - that is already a bug that needs fixing.

Acked-by: Lorenzo Pieralisi <[email protected]>

> + * or sometimes a page fault. Flush hbus->wq before we exit from
> + * hv_pci_query_relations() to avoid the issues.
> + */
> + flush_workqueue(hbus->wq);
> +
> return ret;
> }
>
> --
> 2.25.1
>

2023-05-25 08:22:36

by Lorenzo Pieralisi

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_dev

On Wed, Apr 19, 2023 at 07:40:34PM -0700, Dexuan Cui wrote:
> The hpdev->state is never really useful. The only use in
> hv_pci_eject_device() and hv_eject_device_work() is not really necessary.
>
> Signed-off-by: Dexuan Cui <[email protected]>
> Reviewed-by: Michael Kelley <[email protected]>
> Cc: [email protected]
> ---
>
> v2:
> No change to the patch body.
> Added Cc:stable
>
> v3:
> Added Michael's Reviewed-by.
>
> drivers/pci/controller/pci-hyperv.c | 12 ------------
> 1 file changed, 12 deletions(-)

Is this patch _required_ for subsequent fixes ? It is not a fix itself
so I am asking.

Acked-by: Lorenzo Pieralisi <[email protected]>

> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 1b11cf7391933..46df6d093d683 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -553,19 +553,10 @@ struct hv_dr_state {
> struct hv_pcidev_description func[];
> };
>
> -enum hv_pcichild_state {
> - hv_pcichild_init = 0,
> - hv_pcichild_requirements,
> - hv_pcichild_resourced,
> - hv_pcichild_ejecting,
> - hv_pcichild_maximum
> -};
> -
> struct hv_pci_dev {
> /* List protected by pci_rescan_remove_lock */
> struct list_head list_entry;
> refcount_t refs;
> - enum hv_pcichild_state state;
> struct pci_slot *pci_slot;
> struct hv_pcidev_description desc;
> bool reported_missing;
> @@ -2750,8 +2741,6 @@ static void hv_eject_device_work(struct work_struct *work)
> hpdev = container_of(work, struct hv_pci_dev, wrk);
> hbus = hpdev->hbus;
>
> - WARN_ON(hpdev->state != hv_pcichild_ejecting);
> -
> /*
> * Ejection can come before or after the PCI bus has been set up, so
> * attempt to find it and tear down the bus state, if it exists. This
> @@ -2808,7 +2797,6 @@ static void hv_pci_eject_device(struct hv_pci_dev *hpdev)
> return;
> }
>
> - hpdev->state = hv_pcichild_ejecting;
> get_pcichild(hpdev);
> INIT_WORK(&hpdev->wrk, hv_eject_device_work);
> queue_work(hbus->wq, &hpdev->wrk);
> --
> 2.25.1
>

2023-05-25 08:33:29

by Lorenzo Pieralisi

[permalink] [raw]
Subject: Re: [PATCH v3 4/6] Revert "PCI: hv: Fix a timing issue which causes kdump to fail occasionally"

On Wed, Apr 19, 2023 at 07:40:35PM -0700, Dexuan Cui wrote:
> This reverts commit d6af2ed29c7c1c311b96dac989dcb991e90ee195.
>
> The statement "the hv_pci_bus_exit() call releases structures of all its
> child devices" in commit d6af2ed29c7c is not true: in the path
> hv_pci_probe() -> hv_pci_enter_d0() -> hv_pci_bus_exit(hdev, true): the
> parameter "keep_devs" is true, so hv_pci_bus_exit() does *not* release the
> child "struct hv_pci_dev *hpdev" that is created earlier in
> pci_devices_present_work() -> new_pcichild_device().
>
> The commit d6af2ed29c7c was originally made in July 2020 for RHEL 7.7,
> where the old version of hv_pci_bus_exit() was used; when the commit was
> rebased and merged into the upstream, people didn't notice that it's
> not really necessary. The commit itself doesn't cause any issue, but it
> makes hv_pci_probe() more complicated. Revert it to facilitate some
> upcoming changes to hv_pci_probe().

If d6af2ed29c7c does not cause any issue this is not a fix and should be
merged only with subsequent changes.

> Signed-off-by: Dexuan Cui <[email protected]>
> Reviewed-by: Michael Kelley <[email protected]>
> Acked-by: Wei Hu <[email protected]>
> Cc: [email protected]
> ---
>
> v2:
> No change to the patch body.
> Added Wei Hu's Acked-by.
> Added Cc:stable
>
> v3:
> Added Michael's Reviewed-by.
>
> drivers/pci/controller/pci-hyperv.c | 71 ++++++++++++++---------------
> 1 file changed, 34 insertions(+), 37 deletions(-)
>
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 46df6d093d683..48feab095a144 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -3225,8 +3225,10 @@ static int hv_pci_enter_d0(struct hv_device *hdev)
> struct pci_bus_d0_entry *d0_entry;
> struct hv_pci_compl comp_pkt;
> struct pci_packet *pkt;
> + bool retry = true;
> int ret;
>
> +enter_d0_retry:
> /*
> * Tell the host that the bus is ready to use, and moved into the
> * powered-on state. This includes telling the host which region
> @@ -3253,6 +3255,38 @@ static int hv_pci_enter_d0(struct hv_device *hdev)
> if (ret)
> goto exit;
>
> + /*
> + * In certain case (Kdump) the pci device of interest was
> + * not cleanly shut down and resource is still held on host
> + * side, the host could return invalid device status.
> + * We need to explicitly request host to release the resource
> + * and try to enter D0 again.
> + */
> + if (comp_pkt.completion_status < 0 && retry) {
> + retry = false;
> +
> + dev_err(&hdev->device, "Retrying D0 Entry\n");
> +
> + /*
> + * Hv_pci_bus_exit() calls hv_send_resource_released()
> + * to free up resources of its child devices.
> + * In the kdump kernel we need to set the
> + * wslot_res_allocated to 255 so it scans all child
> + * devices to release resources allocated in the
> + * normal kernel before panic happened.
> + */
> + hbus->wslot_res_allocated = 255;
> +
> + ret = hv_pci_bus_exit(hdev, true);
> +
> + if (ret == 0) {
> + kfree(pkt);
> + goto enter_d0_retry;
> + }
> + dev_err(&hdev->device,
> + "Retrying D0 failed with ret %d\n", ret);
> + }
> +
> if (comp_pkt.completion_status < 0) {
> dev_err(&hdev->device,
> "PCI Pass-through VSP failed D0 Entry with status %x\n",
> @@ -3493,7 +3527,6 @@ static int hv_pci_probe(struct hv_device *hdev,
> struct hv_pcibus_device *hbus;
> u16 dom_req, dom;
> char *name;
> - bool enter_d0_retry = true;
> int ret;
>
> /*
> @@ -3633,47 +3666,11 @@ static int hv_pci_probe(struct hv_device *hdev,
> if (ret)
> goto free_fwnode;
>
> -retry:
> ret = hv_pci_query_relations(hdev);
> if (ret)
> goto free_irq_domain;
>
> ret = hv_pci_enter_d0(hdev);
> - /*
> - * In certain case (Kdump) the pci device of interest was
> - * not cleanly shut down and resource is still held on host
> - * side, the host could return invalid device status.
> - * We need to explicitly request host to release the resource
> - * and try to enter D0 again.
> - * Since the hv_pci_bus_exit() call releases structures
> - * of all its child devices, we need to start the retry from
> - * hv_pci_query_relations() call, requesting host to send
> - * the synchronous child device relations message before this
> - * information is needed in hv_send_resources_allocated()
> - * call later.
> - */
> - if (ret == -EPROTO && enter_d0_retry) {
> - enter_d0_retry = false;
> -
> - dev_err(&hdev->device, "Retrying D0 Entry\n");
> -
> - /*
> - * Hv_pci_bus_exit() calls hv_send_resources_released()
> - * to free up resources of its child devices.
> - * In the kdump kernel we need to set the
> - * wslot_res_allocated to 255 so it scans all child
> - * devices to release resources allocated in the
> - * normal kernel before panic happened.
> - */
> - hbus->wslot_res_allocated = 255;
> - ret = hv_pci_bus_exit(hdev, true);
> -
> - if (ret == 0)
> - goto retry;
> -
> - dev_err(&hdev->device,
> - "Retrying D0 failed with ret %d\n", ret);
> - }
> if (ret)
> goto free_irq_domain;
>
> --
> 2.25.1
>

2023-05-25 08:51:06

by Lorenzo Pieralisi

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] PCI: hv: Add a per-bus mutex state_lock

On Wed, Apr 19, 2023 at 07:40:36PM -0700, Dexuan Cui wrote:
> In the case of fast device addition/removal, it's possible that
> hv_eject_device_work() can start to run before create_root_hv_pci_bus()
> starts to run; as a result, the pci_get_domain_bus_and_slot() in
> hv_eject_device_work() can return a 'pdev' of NULL, and
> hv_eject_device_work() can remove the 'hpdev', and immediately send a
> message PCI_EJECTION_COMPLETE to the host, and the host immediately
> unassigns the PCI device from the guest; meanwhile,
> create_root_hv_pci_bus() and the PCI device driver can be probing the
> dead PCI device and reporting timeout errors.
>
> Fix the issue by adding a per-bus mutex 'state_lock' and grabbing the
> mutex before powering on the PCI bus in hv_pci_enter_d0(): when
> hv_eject_device_work() starts to run, it's able to find the 'pdev' and call
> pci_stop_and_remove_bus_device(pdev): if the PCI device driver has
> loaded, the PCI device driver's probe() function is already called in
> create_root_hv_pci_bus() -> pci_bus_add_devices(), and now
> hv_eject_device_work() -> pci_stop_and_remove_bus_device() is able
> to call the PCI device driver's remove() function and remove the device
> reliably; if the PCI device driver hasn't loaded yet, the function call
> hv_eject_device_work() -> pci_stop_and_remove_bus_device() is able to
> remove the PCI device reliably and the PCI device driver's probe()
> function won't be called; if the PCI device driver's probe() is already
> running (e.g., systemd-udev is loading the PCI device driver), it must
> be holding the per-device lock, and after the probe() finishes and releases
> the lock, hv_eject_device_work() -> pci_stop_and_remove_bus_device() is
> able to proceed to remove the device reliably.
>
> Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
> Signed-off-by: Dexuan Cui <[email protected]>
> Reviewed-by: Michael Kelley <[email protected]>
> Cc: [email protected]
> ---
>
> v2:
> Removed the "debug code".
> Fixed the "goto out" in hv_pci_resume() [Michael Kelley]
> Added Cc:stable
>
> v3:
> Added Michael's Reviewed-by.
>
> drivers/pci/controller/pci-hyperv.c | 29 ++++++++++++++++++++++++++---
> 1 file changed, 26 insertions(+), 3 deletions(-)

Acked-by: Lorenzo Pieralisi <[email protected]>

> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 48feab095a144..3ae2f99dea8c2 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -489,7 +489,10 @@ struct hv_pcibus_device {
> struct fwnode_handle *fwnode;
> /* Protocol version negotiated with the host */
> enum pci_protocol_version_t protocol_version;
> +
> + struct mutex state_lock;
> enum hv_pcibus_state state;
> +
> struct hv_device *hdev;
> resource_size_t low_mmio_space;
> resource_size_t high_mmio_space;
> @@ -2512,6 +2515,8 @@ static void pci_devices_present_work(struct work_struct *work)
> if (!dr)
> return;
>
> + mutex_lock(&hbus->state_lock);
> +
> /* First, mark all existing children as reported missing. */
> spin_lock_irqsave(&hbus->device_list_lock, flags);
> list_for_each_entry(hpdev, &hbus->children, list_entry) {
> @@ -2593,6 +2598,8 @@ static void pci_devices_present_work(struct work_struct *work)
> break;
> }
>
> + mutex_unlock(&hbus->state_lock);
> +
> kfree(dr);
> }
>
> @@ -2741,6 +2748,8 @@ static void hv_eject_device_work(struct work_struct *work)
> hpdev = container_of(work, struct hv_pci_dev, wrk);
> hbus = hpdev->hbus;
>
> + mutex_lock(&hbus->state_lock);
> +
> /*
> * Ejection can come before or after the PCI bus has been set up, so
> * attempt to find it and tear down the bus state, if it exists. This
> @@ -2777,6 +2786,8 @@ static void hv_eject_device_work(struct work_struct *work)
> put_pcichild(hpdev);
> put_pcichild(hpdev);
> /* hpdev has been freed. Do not use it any more. */
> +
> + mutex_unlock(&hbus->state_lock);
> }
>
> /**
> @@ -3562,6 +3573,7 @@ static int hv_pci_probe(struct hv_device *hdev,
> return -ENOMEM;
>
> hbus->bridge = bridge;
> + mutex_init(&hbus->state_lock);
> hbus->state = hv_pcibus_init;
> hbus->wslot_res_allocated = -1;
>
> @@ -3670,9 +3682,11 @@ static int hv_pci_probe(struct hv_device *hdev,
> if (ret)
> goto free_irq_domain;
>
> + mutex_lock(&hbus->state_lock);
> +
> ret = hv_pci_enter_d0(hdev);
> if (ret)
> - goto free_irq_domain;
> + goto release_state_lock;
>
> ret = hv_pci_allocate_bridge_windows(hbus);
> if (ret)
> @@ -3690,12 +3704,15 @@ static int hv_pci_probe(struct hv_device *hdev,
> if (ret)
> goto free_windows;
>
> + mutex_unlock(&hbus->state_lock);
> return 0;
>
> free_windows:
> hv_pci_free_bridge_windows(hbus);
> exit_d0:
> (void) hv_pci_bus_exit(hdev, true);
> +release_state_lock:
> + mutex_unlock(&hbus->state_lock);
> free_irq_domain:
> irq_domain_remove(hbus->irq_domain);
> free_fwnode:
> @@ -3945,20 +3962,26 @@ static int hv_pci_resume(struct hv_device *hdev)
> if (ret)
> goto out;
>
> + mutex_lock(&hbus->state_lock);
> +
> ret = hv_pci_enter_d0(hdev);
> if (ret)
> - goto out;
> + goto release_state_lock;
>
> ret = hv_send_resources_allocated(hdev);
> if (ret)
> - goto out;
> + goto release_state_lock;
>
> prepopulate_bars(hbus);
>
> hv_pci_restore_msi_state(hbus);
>
> hbus->state = hv_pcibus_installed;
> + mutex_unlock(&hbus->state_lock);
> return 0;
> +
> +release_state_lock:
> + mutex_unlock(&hbus->state_lock);
> out:
> vmbus_close(hdev->channel);
> return ret;
> --
> 2.25.1
>

2023-05-25 10:17:17

by Lorenzo Pieralisi

[permalink] [raw]
Subject: Re: [PATCH v3 2/6] PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic

On Wed, Apr 19, 2023 at 07:40:33PM -0700, Dexuan Cui wrote:
> When the host tries to remove a PCI device, the host first sends a
> PCI_EJECT message to the guest, and the guest is supposed to gracefully
> remove the PCI device and send a PCI_EJECTION_COMPLETE message to the host;
> the host then sends a VMBus message CHANNELMSG_RESCIND_CHANNELOFFER to
> the guest (when the guest receives this message, the device is already
> unassigned from the guest) and the guest can do some final cleanup work;
> if the guest fails to respond to the PCI_EJECT message within one minute,
> the host sends the VMBus message CHANNELMSG_RESCIND_CHANNELOFFER and
> removes the PCI device forcibly.
>
> In the case of fast device addition/removal, it's possible that the PCI
> device driver is still configuring MSI-X interrupts when the guest receives
> the PCI_EJECT message; the channel callback calls hv_pci_eject_device(),
> which sets hpdev->state to hv_pcichild_ejecting, and schedules a work
> hv_eject_device_work(); if the PCI device driver is calling
> pci_alloc_irq_vectors() -> ... -> hv_compose_msi_msg(), we can break the
> while loop in hv_compose_msi_msg() due to the updated hpdev->state, and
> leave data->chip_data with its default value of NULL; later, when the PCI
> device driver calls request_irq() -> ... -> hv_irq_unmask(), the guest
> crashes in hv_arch_irq_unmask() due to data->chip_data being NULL.
>
> Fix the issue by not testing hpdev->state in the while loop: when the
> guest receives PCI_EJECT, the device is still assigned to the guest, and
> the guest has one minute to finish the device removal gracefully. We don't
> really need to (and we should not) test hpdev->state in the loop.
>
> Fixes: de0aa7b2f97d ("PCI: hv: Fix 2 hang issues in hv_compose_msi_msg()")
> Signed-off-by: Dexuan Cui <[email protected]>
> Reviewed-by: Michael Kelley <[email protected]>
> Cc: [email protected]
> ---
>
> v2:
> Removed the "debug code".
> No change to the patch body.
> Added Cc:stable
>
> v3:
> Added Michael's Reviewed-by.
>
> drivers/pci/controller/pci-hyperv.c | 11 +++++------
> 1 file changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index b82c7cde19e66..1b11cf7391933 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -643,6 +643,11 @@ static void hv_arch_irq_unmask(struct irq_data *data)
> pbus = pdev->bus;
> hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
> int_desc = data->chip_data;
> + if (!int_desc) {
> + dev_warn(&hbus->hdev->device, "%s() can not unmask irq %u\n",
> + __func__, data->irq);
> + return;
> + }

That's a check that should be there regardless ?

> spin_lock_irqsave(&hbus->retarget_msi_interrupt_lock, flags);
>
> @@ -1911,12 +1916,6 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
> hv_pci_onchannelcallback(hbus);
> spin_unlock_irqrestore(&channel->sched_lock, flags);
>
> - if (hpdev->state == hv_pcichild_ejecting) {
> - dev_err_once(&hbus->hdev->device,
> - "the device is being ejected\n");
> - goto enable_tasklet;
> - }
> -
> udelay(100);
> }

I don't understand why this code is in hv_compose_msi_msg() in the first
place (and why only in that function ?) to me this looks like you are
adding plasters in the code that can turn out to be problematic while
ejecting a device, this does not seem robust at all - that's my opinion.

Feel free to merge this code, I can't ACK it, sorry.

Lorenzo

2023-06-15 04:23:52

by Dexuan Cui

[permalink] [raw]
Subject: RE: [PATCH v3 1/6] PCI: hv: Fix a race condition bug in hv_pci_query_relations()

> From: Lorenzo Pieralisi <[email protected]>
> Sent: Thursday, May 25, 2023 1:14 AM
> To: Dexuan Cui <[email protected]>
>
> On Wed, Apr 19, 2023 at 07:40:32PM -0700, Dexuan Cui wrote:
> > Fix the longstanding race between hv_pci_query_relations() and
> > survey_child_resources() by flushing the workqueue before we exit from
> > hv_pci_query_relations().
>
> "Fix the longstanding race" is vague. Please describe the race
> succinctly at least to give an idea of what the problem is.
Hi Lozenzo, I'm sorry for the late response -- finally I'm back on the patchset...

I'll use the below commit message in v4:

Since day 1 of the driver, there has been a race between
hv_pci_query_relations() and survey_child_resources(): during fast
device hotplug, hv_pci_query_relations() may error out due to
device-remove and the stack variable 'comp' is no longer valid;
however, pci_devices_present_work() -> survey_child_resources() ->
complete() may be running on another CPU and accessing the no-longer-valid
'comp'. Fix the race by flushing the workqueue before we exit from
hv_pci_query_relations().

I'll also update the comment to share more details of the race:

--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -3401,6 +3401,24 @@ static int hv_pci_query_relations(struct hv_device *hdev)
if (!ret)
ret = wait_for_response(hdev, &comp);

+ /*
+ * In the case of fast device addition/removal, it's possible that
+ * vmbus_sendpacket() or wait_for_response() returns -ENODEV but we
+ * already got a PCI_BUS_RELATIONS* message from the host and the
+ * channel callback already scheduled a work to hbus->wq, which can be
+ * running pci_devices_present_work() -> survey_child_resources() ->
+ * complete(&hbus->survey_event), even after hv_pci_query_relations()
+ * exits and the stack variable 'comp' is no longer valid; as a result,
+ * a hang or a page fault may happen when the complete() calls
+ * raw_spin_lock_irqsave(). Flush hbus->wq before we exit from
+ * hv_pci_query_relations() to avoid the issues. Note: if 'ret' is
+ * -ENODEV, there can't be any more work item scheduled to hbus->wq
+ * after the flush_workqueue(): see vmbus_onoffer_rescind() ->
+ * vmbus_reset_channel_cb(), vmbus_rescind_cleanup() ->
+ * channel->rescind = true.
+ */
+ flush_workqueue(hbus->wq);
+
return ret;
}

> > + * In the case of fast device addition/removal, it's possible that
> > + * vmbus_sendpacket() or wait_for_response() returns -ENODEV but we
> > + * already got a PCI_BUS_RELATIONS* message from the host and the
> > + * channel callback already scheduled a work to hbus->wq, which can
> > be
> > + * running survey_child_resources() -> complete(&hbus->survey_event),
> > + * even after hv_pci_query_relations() exits and the stack variable
> > + * 'comp' is no longer valid. This can cause a strange hang issue
>
> "A strange hang" sounds like we don't understand what's happening, it
> does not feel like it is a solid understanding of the issue.
>
> I would remove it - given that you already explain that comp is no
> longer valid - that is already a bug that needs fixing.
I share more details in the comment (see the above).

> Acked-by: Lorenzo Pieralisi <[email protected]>
Thanks!

2023-06-15 04:46:31

by Dexuan Cui

[permalink] [raw]
Subject: RE: [PATCH v3 2/6] PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic

> From: Lorenzo Pieralisi <[email protected]>
> Sent: Thursday, May 25, 2023 3:16 AM
> > --- a/drivers/pci/controller/pci-hyperv.c
> > +++ b/drivers/pci/controller/pci-hyperv.c
> > @@ -643,6 +643,11 @@ static void hv_arch_irq_unmask(struct irq_data
> > *data)
> > pbus = pdev->bus;
> > hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
> > int_desc = data->chip_data;
> > + if (!int_desc) {
> > + dev_warn(&hbus->hdev->device, "%s() can not unmask irq %u\n",
> > + __func__, data->irq);
> > + return;
> > + }
>
> That's a check that should be there regardless ?
Yes. Normally data->chip_data is set at the end of hv_compose_msi_msg(),
and it should not be NULL. However, in rare circumstances, we might see a
failure in hv_compose_msi_msg(), e.g. the hypervisor/host might return an
error in comp.comp_pkt.completion_status (at least this is possible in theory).

In case we see a failure in hv_compose_msi_msg(), data->chip_data stays
with its default value of NULL; because the return type of
hv_compose_msi_msg() is "void", we can not return an error to the upper
layer; later, when the upper layer calls request_irq() -> ... -> hv_irq_unmask(),
hv_arch_irq_unmask() crashes because data->chip_data is NULL -- with this
check, we're able to error out gracefully, and the user can better understand
what goes wrong.

> > spin_lock_irqsave(&hbus->retarget_msi_interrupt_lock, flags);
> >
> > @@ -1911,12 +1916,6 @@ static void hv_compose_msi_msg(struct
> irq_data *data, struct msi_msg *msg)
> > hv_pci_onchannelcallback(hbus);
> > spin_unlock_irqrestore(&channel->sched_lock, flags);
> >
> > - if (hpdev->state == hv_pcichild_ejecting) {
> > - dev_err_once(&hbus->hdev->device,
> > - "the device is being ejected\n");
> > - goto enable_tasklet;
> > - }
> > -
> > udelay(100);
> > }
>
> I don't understand why this code is in hv_compose_msi_msg() in the first
> place (and why only in that function ?) to me this looks like you are
> adding plasters in the code that can turn out to be problematic while
> ejecting a device, this does not seem robust at all - that's my opinion.

The code was incorrectly added by
de0aa7b2f97d ("PCI: hv: Fix 2 hang issues in hv_compose_msi_msg()")

de0aa7b2f97d says

"
2. If the host is ejecting the VF device before we reach
hv_compose_msi_msg(), in a UP VM, we can hang in hv_compose_msi_msg()
forever, because at this time the host doesn't respond to the
CREATE_INTERRUPT request.
"

de0aa7b2f97d implies that the host doesn't respond to the guest's
CREATE_INTERRUPT request once the guest receives the PCI_EJECT message -- this
is incorrect: after the guest receives the PCI_EJECT message, actually the host
still responds to the guest's request, as long as the guest sends the request within
1 minute AND the guest doesn't send a PCI_EJECTION_COMPLETE message to
the host in hv_eject_device_work(). The real issue is that currently the guest
can send PCI_EJECTION_COMPLETE to the host before the guest finishes the
device-probing/removing handling -- once the guest sends PCI_EJECTION_COMPLETE,
the host unassigns the PCI device from the guest and ignores any request
from the guest.

So here the check "hpdev->state == hv_pcichild_ejecting" is incorrect. We
should remove the check since it can cause a panic (see the commit messsage
for the detailed explanation)

The "premature PCI_EJECTION_COMPLETE" issue is resolved by:
[PATCH v3 5/6] PCI: hv: Add a per-bus mutex state_lock

> Feel free to merge this code, I can't ACK it, sorry.
>
> Lorenzo
Thanks for sharing the thougths!

2023-06-15 05:10:26

by Dexuan Cui

[permalink] [raw]
Subject: RE: [PATCH v3 4/6] Revert "PCI: hv: Fix a timing issue which causes kdump to fail occasionally"

> From: Lorenzo Pieralisi <[email protected]>
> Sent: Thursday, May 25, 2023 1:23 AM
> ...
> On Wed, Apr 19, 2023 at 07:40:35PM -0700, Dexuan Cui wrote:
> > This reverts commit d6af2ed29c7c1c311b96dac989dcb991e90ee195.
> >
> > The statement "the hv_pci_bus_exit() call releases structures of all its
> > child devices" in commit d6af2ed29c7c is not true: in the path
> > hv_pci_probe() -> hv_pci_enter_d0() -> hv_pci_bus_exit(hdev, true): the
> > parameter "keep_devs" is true, so hv_pci_bus_exit() does *not* release the
> > child "struct hv_pci_dev *hpdev" that is created earlier in
> > pci_devices_present_work() -> new_pcichild_device().
> >
> > The commit d6af2ed29c7c was originally made in July 2020 for RHEL 7.7,
> > where the old version of hv_pci_bus_exit() was used; when the commit
> > was
> > rebased and merged into the upstream, people didn't notice that it's
> > not really necessary. The commit itself doesn't cause any issue, but it
> > makes hv_pci_probe() more complicated. Revert it to facilitate some
> > upcoming changes to hv_pci_probe().
>
> If d6af2ed29c7c does not cause any issue this is not a fix and should be
> merged only with subsequent changes.

d6af2ed29c7c does not cause any functional issue, but it makes the code
less readable, and so I'd like to not merge this patch with patch 5 -- this way
people can easily know what the real change is in patch 5.

2023-06-15 05:10:47

by Dexuan Cui

[permalink] [raw]
Subject: RE: [PATCH v3 3/6] PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_dev

> From: Lorenzo Pieralisi <[email protected]>
> Sent: Thursday, May 25, 2023 1:16 AM
> To: Dexuan Cui <[email protected]>
>
> Is this patch _required_ for subsequent fixes ?
It's not required. IMO it's good to have this patch with patch 2 since
patch 2 also touches hv_pcichild_ejecting. I'm OK if people think
it's better to merge this patch through the "next" branch rather than
the "fixes" branch.

> It is not a fix itself so I am asking.

> Acked-by: Lorenzo Pieralisi <[email protected]>
Thanks!