2018-03-03 03:54:03

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH 1/3] PCI: hv: fix a comment typo in _hv_pcifront_read_config()

No functional change.

Signed-off-by: Dexuan Cui <[email protected]>
Fixes: bdd74440d9e8 ("PCI: hv: Add explicit barriers to config space access")
Cc: Vitaly Kuznetsov <[email protected]>
Cc: [email protected]
Cc: Stephen Hemminger <[email protected]>
Cc: K. Y. Srinivasan <[email protected]>
---
drivers/pci/host/pci-hyperv.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/host/pci-hyperv.c b/drivers/pci/host/pci-hyperv.c
index 2faf38eab785..1233300f41c6 100644
--- a/drivers/pci/host/pci-hyperv.c
+++ b/drivers/pci/host/pci-hyperv.c
@@ -653,7 +653,7 @@ static void _hv_pcifront_read_config(struct hv_pci_dev *hpdev, int where,
break;
}
/*
- * Make sure the write was done before we release the spinlock
+ * Make sure the read was done before we release the spinlock
* allowing consecutive reads/writes.
*/
mb();
--
2.7.4


2018-03-03 03:54:36

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH 3/3] PCI: hv: fix 2 hang issues in hv_compose_msi_msg()

1. With the patch "x86/vector/msi: Switch to global reservation mode"
(4900be8360), the recent v4.15 and newer kernels always hang for 1-vCPU
Hyper-V VM with SR-IOV. This is because when we reach hv_compose_msi_msg()
by request_irq() -> request_threaded_irq() -> __setup_irq()->irq_startup()
-> __irq_startup() -> irq_domain_activate_irq() -> ... ->
msi_domain_activate() -> ... -> hv_compose_msi_msg(), local irq is
disabled in __setup_irq().

Fix this by polling the channel.

2. If the host is ejecting the VF device before we reach
hv_compose_msi_msg(), in a UP VM, we can hang in hv_compose_msi_msg()
forever, because at this time the host doesn't respond to the
CREATE_INTERRUPT request. This issue also happens to old kernels like
v4.14, v4.13, etc.

Fix this by polling the channel for the PCI_EJECT message and
hpdev->state, and by checking the PCI vendor ID.

Note: actually the above issues also happen to a SMP VM, if
"hbus->hdev->channel->target_cpu == smp_processor_id()" is true.

Signed-off-by: Dexuan Cui <[email protected]>
Tested-by: Adrian Suhov <[email protected]>
Tested-by: Chris Valean <[email protected]>
Cc: [email protected]
Cc: Stephen Hemminger <[email protected]>
Cc: K. Y. Srinivasan <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Jack Morgenstein <[email protected]>
---
drivers/pci/host/pci-hyperv.c | 58 ++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 57 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/host/pci-hyperv.c b/drivers/pci/host/pci-hyperv.c
index 57b1fb3ebdb9..32d6b03cdd40 100644
--- a/drivers/pci/host/pci-hyperv.c
+++ b/drivers/pci/host/pci-hyperv.c
@@ -522,6 +522,8 @@ struct hv_pci_compl {
s32 completion_status;
};

+static void hv_pci_onchannelcallback(void *context);
+
/**
* hv_pci_generic_compl() - Invoked for a completion packet
* @context: Set up by the sender of the packet.
@@ -666,6 +668,31 @@ static void _hv_pcifront_read_config(struct hv_pci_dev *hpdev, int where,
}
}

+static u16 hv_pcifront_get_vendor_id(struct hv_pci_dev *hpdev)
+{
+ u16 ret;
+ unsigned long flags;
+ void __iomem *addr = hpdev->hbus->cfg_addr + CFG_PAGE_OFFSET +
+ PCI_VENDOR_ID;
+
+ spin_lock_irqsave(&hpdev->hbus->config_lock, flags);
+
+ /* Choose the function to be read. (See comment above) */
+ writel(hpdev->desc.win_slot.slot, hpdev->hbus->cfg_addr);
+ /* Make sure the function was chosen before we start reading. */
+ mb();
+ /* Read from that function's config space. */
+ ret = readw(addr);
+ /*
+ * mb() is not required here, because the spin_unlock_irqrestore()
+ * is a barrier.
+ */
+
+ spin_unlock_irqrestore(&hpdev->hbus->config_lock, flags);
+
+ return ret;
+}
+
/**
* _hv_pcifront_write_config() - Internal PCI config write
* @hpdev: The PCI driver's representation of the device
@@ -1108,8 +1135,37 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
* Since this function is called with IRQ locks held, can't
* do normal wait for completion; instead poll.
*/
- while (!try_wait_for_completion(&comp.comp_pkt.host_event))
+ while (!try_wait_for_completion(&comp.comp_pkt.host_event)) {
+ /* 0xFFFF means an invalid PCI VENDOR ID. */
+ if (hv_pcifront_get_vendor_id(hpdev) == 0xFFFF) {
+ dev_err_once(&hbus->hdev->device,
+ "the device has gone\n");
+ goto free_int_desc;
+ }
+
+ /*
+ * When the higher level interrupt code calls us with
+ * interrupt disabled, we must poll the channel by calling
+ * the channel callback directly when channel->target_cpu is
+ * the current CPU. When the higher level interrupt code
+ * calls us with interrupt enabled, let's add the
+ * local_bh_disable()/enable() to avoid race.
+ */
+ local_bh_disable();
+
+ if (hbus->hdev->channel->target_cpu == smp_processor_id())
+ hv_pci_onchannelcallback(hbus);
+
+ local_bh_enable();
+
+ if (hpdev->state == hv_pcichild_ejecting) {
+ dev_err_once(&hbus->hdev->device,
+ "the device is being ejected\n");
+ goto free_int_desc;
+ }
+
udelay(100);
+ }

if (comp.comp_pkt.completion_status < 0) {
dev_err(&hbus->hdev->device,
--
2.7.4

2018-03-03 03:54:45

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH 2/3] PCI: hv: serialize the present/eject work items

When we hot-remove the device, we first receive a PCI_EJECT message and
then receive a PCI_BUS_RELATIONS message with bus_rel->device_count == 0.

The first message is offloaded to hv_eject_device_work(), and the second
is offloaded to pci_devices_present_work(). Both the paths can be running
list_del(&hpdev->list_entry), causing general protection fault, because
system_wq can run them concurrently.

The patch eliminates the race condition.

Signed-off-by: Dexuan Cui <[email protected]>
Tested-by: Adrian Suhov <[email protected]>
Tested-by: Chris Valean <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Jack Morgenstein <[email protected]>
Cc: [email protected]
Cc: Stephen Hemminger <[email protected]>
Cc: K. Y. Srinivasan <[email protected]>
---
drivers/pci/host/pci-hyperv.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/host/pci-hyperv.c b/drivers/pci/host/pci-hyperv.c
index 1233300f41c6..57b1fb3ebdb9 100644
--- a/drivers/pci/host/pci-hyperv.c
+++ b/drivers/pci/host/pci-hyperv.c
@@ -461,6 +461,8 @@ struct hv_pcibus_device {
struct retarget_msi_interrupt retarget_msi_interrupt_params;

spinlock_t retarget_msi_interrupt_lock;
+
+ struct workqueue_struct *wq;
};

/*
@@ -1770,7 +1772,7 @@ static void hv_pci_devices_present(struct hv_pcibus_device *hbus,
spin_unlock_irqrestore(&hbus->device_list_lock, flags);

get_hvpcibus(hbus);
- schedule_work(&dr_wrk->wrk);
+ queue_work(hbus->wq, &dr_wrk->wrk);
}

/**
@@ -1848,7 +1850,7 @@ static void hv_pci_eject_device(struct hv_pci_dev *hpdev)
get_pcichild(hpdev, hv_pcidev_ref_pnp);
INIT_WORK(&hpdev->wrk, hv_eject_device_work);
get_hvpcibus(hpdev->hbus);
- schedule_work(&hpdev->wrk);
+ queue_work(hpdev->hbus->wq, &hpdev->wrk);
}

/**
@@ -2463,11 +2465,17 @@ static int hv_pci_probe(struct hv_device *hdev,
spin_lock_init(&hbus->retarget_msi_interrupt_lock);
sema_init(&hbus->enum_sem, 1);
init_completion(&hbus->remove_event);
+ hbus->wq = alloc_ordered_workqueue("hv_pci_%x", 0,
+ hbus->sysdata.domain);
+ if (!hbus->wq) {
+ ret = -ENOMEM;
+ goto free_bus;
+ }

ret = vmbus_open(hdev->channel, pci_ring_size, pci_ring_size, NULL, 0,
hv_pci_onchannelcallback, hbus);
if (ret)
- goto free_bus;
+ goto destroy_wq;

hv_set_drvdata(hdev, hbus);

@@ -2536,6 +2544,9 @@ static int hv_pci_probe(struct hv_device *hdev,
hv_free_config_window(hbus);
close:
vmbus_close(hdev->channel);
+destroy_wq:
+ drain_workqueue(hbus->wq);
+ destroy_workqueue(hbus->wq);
free_bus:
free_page((unsigned long)hbus);
return ret;
@@ -2615,6 +2626,8 @@ static int hv_pci_remove(struct hv_device *hdev)
irq_domain_free_fwnode(hbus->sysdata.fwnode);
put_hvpcibus(hbus);
wait_for_completion(&hbus->remove_event);
+ drain_workqueue(hbus->wq);
+ destroy_workqueue(hbus->wq);
free_page((unsigned long)hbus);
return 0;
}
--
2.7.4

2018-03-03 16:11:03

by Michael Kelley (EOSG)

[permalink] [raw]
Subject: RE: [PATCH 2/3] PCI: hv: serialize the present/eject work items

> -----Original Message-----
> From: [email protected] <[email protected]> On Behalf
> Of Dexuan Cui
> Sent: Friday, March 2, 2018 4:21 PM
> To: [email protected]; [email protected]; KY Srinivasan <[email protected]>;
> Stephen Hemminger <[email protected]>
> Cc: [email protected]; [email protected]; Haiyang Zhang
> <[email protected]>; [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; Dexuan Cui <[email protected]>;
> Jack Morgenstein <[email protected]>; [email protected]
> Subject: [PATCH 2/3] PCI: hv: serialize the present/eject work items
>
> When we hot-remove the device, we first receive a PCI_EJECT message and
> then receive a PCI_BUS_RELATIONS message with bus_rel->device_count == 0.
>
> The first message is offloaded to hv_eject_device_work(), and the second
> is offloaded to pci_devices_present_work(). Both the paths can be running
> list_del(&hpdev->list_entry), causing general protection fault, because
> system_wq can run them concurrently.
>
> The patch eliminates the race condition.

With this patch, the enum_sem field in struct hv_pcibus_device
is no longer needed. The semaphore serializes execution in
hv_pci_devices_present_work(), and that serialization is now done
with the ordered workqueue. Also, the last paragraph of the top level
comment for hv_pci_devices_present_work() should be updated to
reflect the new ordering assumptions.

Separately, an unrelated bug: At the top of hv_eject_device_work(),
the first test may do a put_pcichild() and return. This exit path also
needs to do put_hvpcibus() to balance the ref counts, or do a goto
the last two lines at the bottom of the function.

>
> Signed-off-by: Dexuan Cui <[email protected]>
> Tested-by: Adrian Suhov <[email protected]>
> Tested-by: Chris Valean <[email protected]>
> Cc: Vitaly Kuznetsov <[email protected]>
> Cc: Jack Morgenstein <[email protected]>
> Cc: [email protected]
> Cc: Stephen Hemminger <[email protected]>
> Cc: K. Y. Srinivasan <[email protected]>
> ---
> drivers/pci/host/pci-hyperv.c | 19 ++++++++++++++++---
> 1 file changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/pci/host/pci-hyperv.c b/drivers/pci/host/pci-hyperv.c
> index 1233300f41c6..57b1fb3ebdb9 100644
> --- a/drivers/pci/host/pci-hyperv.c
> +++ b/drivers/pci/host/pci-hyperv.c
> @@ -461,6 +461,8 @@ struct hv_pcibus_device {
> struct retarget_msi_interrupt retarget_msi_interrupt_params;
>
> spinlock_t retarget_msi_interrupt_lock;
> +
> + struct workqueue_struct *wq;
> };
>
> /*
> @@ -1770,7 +1772,7 @@ static void hv_pci_devices_present(struct hv_pcibus_device *hbus,
> spin_unlock_irqrestore(&hbus->device_list_lock, flags);
>
> get_hvpcibus(hbus);
> - schedule_work(&dr_wrk->wrk);
> + queue_work(hbus->wq, &dr_wrk->wrk);

This invocation of get_hvpcibus() and queue_work() could be made
conditional on whether the preceding list_add_tail() transitioned
the list from empty to non-empty. If the list was already non-empty,
a previously queued invocation of hv_pci_devices_present_work()
will find the new entry and process it. But this is an
optimization in a non-perf sensitive code path, so may not be
worth it.

> }
>
> /**
> @@ -1848,7 +1850,7 @@ static void hv_pci_eject_device(struct hv_pci_dev *hpdev)
> get_pcichild(hpdev, hv_pcidev_ref_pnp);
> INIT_WORK(&hpdev->wrk, hv_eject_device_work);
> get_hvpcibus(hpdev->hbus);
> - schedule_work(&hpdev->wrk);
> + queue_work(hpdev->hbus->wq, &hpdev->wrk);
> }
>
> /**
> @@ -2463,11 +2465,17 @@ static int hv_pci_probe(struct hv_device *hdev,
> spin_lock_init(&hbus->retarget_msi_interrupt_lock);
> sema_init(&hbus->enum_sem, 1);
> init_completion(&hbus->remove_event);
> + hbus->wq = alloc_ordered_workqueue("hv_pci_%x", 0,
> + hbus->sysdata.domain);
> + if (!hbus->wq) {
> + ret = -ENOMEM;
> + goto free_bus;
> + }
>
> ret = vmbus_open(hdev->channel, pci_ring_size, pci_ring_size, NULL, 0,
> hv_pci_onchannelcallback, hbus);
> if (ret)
> - goto free_bus;
> + goto destroy_wq;
>
> hv_set_drvdata(hdev, hbus);
>
> @@ -2536,6 +2544,9 @@ static int hv_pci_probe(struct hv_device *hdev,
> hv_free_config_window(hbus);
> close:
> vmbus_close(hdev->channel);
> +destroy_wq:
> + drain_workqueue(hbus->wq);

The drain_workqueue() call isn't necessary. destroy_workqueue() calls
drain_workqueue() and there better not be anything in the workqueue
anyway since all the ref counts are zero.

> + destroy_workqueue(hbus->wq);
> free_bus:
> free_page((unsigned long)hbus);
> return ret;
> @@ -2615,6 +2626,8 @@ static int hv_pci_remove(struct hv_device *hdev)
> irq_domain_free_fwnode(hbus->sysdata.fwnode);
> put_hvpcibus(hbus);
> wait_for_completion(&hbus->remove_event);
> + drain_workqueue(hbus->wq);

Same here -- drain_workqueue() isn't needed. The workqueue
must be empty anyway since the remove_event has completed
and the ref counts will all be zero.

> + destroy_workqueue(hbus->wq);
> free_page((unsigned long)hbus);
> return 0;
> }
> --
> 2.7.4

2018-03-05 05:14:46

by Dexuan Cui

[permalink] [raw]
Subject: RE: [PATCH 2/3] PCI: hv: serialize the present/eject work items

> From: Michael Kelley (EOSG)
> Sent: Saturday, March 3, 2018 08:10
> > From: [email protected] <linux-kernel-
> [email protected]> On Behalf Of Dexuan Cui
> > Sent: Friday, March 2, 2018 4:21 PM
> > When we hot-remove the device, we first receive a PCI_EJECT message and
> > then receive a PCI_BUS_RELATIONS message with bus_rel->device_count ==
> 0.
> >
> > The first message is offloaded to hv_eject_device_work(), and the second
> > is offloaded to pci_devices_present_work(). Both the paths can be running
> > list_del(&hpdev->list_entry), causing general protection fault, because
> > system_wq can run them concurrently.
> >
> > The patch eliminates the race condition.
>
> With this patch, the enum_sem field in struct hv_pcibus_device
> is no longer needed. The semaphore serializes execution in
> hv_pci_devices_present_work(), and that serialization is now done
> with the ordered workqueue. Also, the last paragraph of the top level
> comment for hv_pci_devices_present_work() should be updated to
> reflect the new ordering assumptions.

Thanks! I'll make an extra patch for this.

> Separately, an unrelated bug: At the top of hv_eject_device_work(),
> the first test may do a put_pcichild() and return. This exit path also
> needs to do put_hvpcibus() to balance the ref counts, or do a goto
> the last two lines at the bottom of the function.

When we're in hv_eject_device_work(), IMO hpdev->state must be
hv_pcichild_ejecting, so I'm going to make a patch like this:

@@ -1867,10 +1867,7 @@ static void hv_eject_device_work(struct work_struct *work)

hpdev = container_of(work, struct hv_pci_dev, wrk);

- if (hpdev->state != hv_pcichild_ejecting) {
- put_pcichild(hpdev, hv_pcidev_ref_pnp);
- return;
- }
+ WARN_ON(hpdev->state != hv_pcichild_ejecting);

/*
* Ejection can come before or after the PCI bus has been set up, so

> > @@ -1770,7 +1772,7 @@ static void hv_pci_devices_present(struct
> hv_pcibus_device *hbus,
> > spin_unlock_irqrestore(&hbus->device_list_lock, flags);
> >
> > get_hvpcibus(hbus);
> > - schedule_work(&dr_wrk->wrk);
> > + queue_work(hbus->wq, &dr_wrk->wrk);
>
> This invocation of get_hvpcibus() and queue_work() could be made
> conditional on whether the preceding list_add_tail() transitioned
> the list from empty to non-empty. If the list was already non-empty,
> a previously queued invocation of hv_pci_devices_present_work()
> will find the new entry and process it. But this is an
> optimization in a non-perf sensitive code path, so may not be
> worth it.

Exactly. I'll add the the optimization.

> > @@ -1848,7 +1850,7 @@ static void hv_pci_eject_device(struct hv_pci_dev
> *hpdev)
> > get_pcichild(hpdev, hv_pcidev_ref_pnp);
> > INIT_WORK(&hpdev->wrk, hv_eject_device_work);
> > get_hvpcibus(hpdev->hbus);
> > - schedule_work(&hpdev->wrk);
> > + queue_work(hpdev->hbus->wq, &hpdev->wrk);
> > }
> >
> > /**
> > @@ -2463,11 +2465,17 @@ static int hv_pci_probe(struct hv_device *hdev,
> > spin_lock_init(&hbus->retarget_msi_interrupt_lock);
> > sema_init(&hbus->enum_sem, 1);
> > init_completion(&hbus->remove_event);
> > + hbus->wq = alloc_ordered_workqueue("hv_pci_%x", 0,
> > + hbus->sysdata.domain);
> > + if (!hbus->wq) {
> > + ret = -ENOMEM;
> > + goto free_bus;
> > + }
> >
> > ret = vmbus_open(hdev->channel, pci_ring_size, pci_ring_size, NULL, 0,
> > hv_pci_onchannelcallback, hbus);
> > if (ret)
> > - goto free_bus;
> > + goto destroy_wq;
> >
> > hv_set_drvdata(hdev, hbus);
> >
> > @@ -2536,6 +2544,9 @@ static int hv_pci_probe(struct hv_device *hdev,
> > hv_free_config_window(hbus);
> > close:
> > vmbus_close(hdev->channel);
> > +destroy_wq:
> > + drain_workqueue(hbus->wq);
>
> The drain_workqueue() call isn't necessary. destroy_workqueue() calls
> drain_workqueue() and there better not be anything in the workqueue
> anyway since all the ref counts are zero.

OK. Will remove it.

> > + destroy_workqueue(hbus->wq);
> > free_bus:
> > free_page((unsigned long)hbus);
> > return ret;
> > @@ -2615,6 +2626,8 @@ static int hv_pci_remove(struct hv_device *hdev)
> > irq_domain_free_fwnode(hbus->sysdata.fwnode);
> > put_hvpcibus(hbus);
> > wait_for_completion(&hbus->remove_event);
> > + drain_workqueue(hbus->wq);
>
> Same here -- drain_workqueue() isn't needed. The workqueue
> must be empty anyway since the remove_event has completed
> and the ref counts will all be zero.

Will remove it.

> > + destroy_workqueue(hbus->wq);

I'm going to post a v2 patchset tomorrow.

Thanks,
-- Dexuan