2018-03-05 19:26:27

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH v2 0/6] some fixes to the pci-hyperv driver.

Changes since v1 are:

Patch 1, 6: no change since v1.

Patch 2,4,5: I added these new patches, as suggested by Michael Kelley.

Patch 3: Removed the unnecessary drain_workqueue(), as suggested by Michael Kelley.


Dexuan Cui (6):
PCI: hv: fix a comment typo in _hv_pcifront_read_config()
PCI: hv: hv_eject_device_work(): remove the bogus test
PCI: hv: serialize the present/eject work items
PCI: hv: remove hbus->enum_sem
PCI: hv: hv_pci_devices_present(): only queue a new work when
necessary
PCI: hv: fix 2 hang issues in hv_compose_msi_msg()

drivers/pci/host/pci-hyperv.c | 116 ++++++++++++++++++++++++++++++++----------
1 file changed, 90 insertions(+), 26 deletions(-)

--
2.7.4


2018-03-05 19:23:53

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH v2 6/6] PCI: hv: fix 2 hang issues in hv_compose_msi_msg()

1. With the patch "x86/vector/msi: Switch to global reservation mode"
(4900be8360), the recent v4.15 and newer kernels always hang for 1-vCPU
Hyper-V VM with SR-IOV. This is because when we reach hv_compose_msi_msg()
by request_irq() -> request_threaded_irq() -> __setup_irq()->irq_startup()
-> __irq_startup() -> irq_domain_activate_irq() -> ... ->
msi_domain_activate() -> ... -> hv_compose_msi_msg(), local irq is
disabled in __setup_irq().

Fix this by polling the channel.

2. If the host is ejecting the VF device before we reach
hv_compose_msi_msg(), in a UP VM, we can hang in hv_compose_msi_msg()
forever, because at this time the host doesn't respond to the
CREATE_INTERRUPT request. This issue also happens to old kernels like
v4.14, v4.13, etc.

Fix this by polling the channel for the PCI_EJECT message and
hpdev->state, and by checking the PCI vendor ID.

Note: actually the above issues also happen to a SMP VM, if
"hbus->hdev->channel->target_cpu == smp_processor_id()" is true.

Signed-off-by: Dexuan Cui <[email protected]>
Tested-by: Adrian Suhov <[email protected]>
Tested-by: Chris Valean <[email protected]>
Cc: [email protected]
Cc: Stephen Hemminger <[email protected]>
Cc: K. Y. Srinivasan <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Jack Morgenstein <[email protected]>
---
drivers/pci/host/pci-hyperv.c | 58 ++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 57 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/host/pci-hyperv.c b/drivers/pci/host/pci-hyperv.c
index d3aa6736a9bb..114624dfbd97 100644
--- a/drivers/pci/host/pci-hyperv.c
+++ b/drivers/pci/host/pci-hyperv.c
@@ -521,6 +521,8 @@ struct hv_pci_compl {
s32 completion_status;
};

+static void hv_pci_onchannelcallback(void *context);
+
/**
* hv_pci_generic_compl() - Invoked for a completion packet
* @context: Set up by the sender of the packet.
@@ -665,6 +667,31 @@ static void _hv_pcifront_read_config(struct hv_pci_dev *hpdev, int where,
}
}

+static u16 hv_pcifront_get_vendor_id(struct hv_pci_dev *hpdev)
+{
+ u16 ret;
+ unsigned long flags;
+ void __iomem *addr = hpdev->hbus->cfg_addr + CFG_PAGE_OFFSET +
+ PCI_VENDOR_ID;
+
+ spin_lock_irqsave(&hpdev->hbus->config_lock, flags);
+
+ /* Choose the function to be read. (See comment above) */
+ writel(hpdev->desc.win_slot.slot, hpdev->hbus->cfg_addr);
+ /* Make sure the function was chosen before we start reading. */
+ mb();
+ /* Read from that function's config space. */
+ ret = readw(addr);
+ /*
+ * mb() is not required here, because the spin_unlock_irqrestore()
+ * is a barrier.
+ */
+
+ spin_unlock_irqrestore(&hpdev->hbus->config_lock, flags);
+
+ return ret;
+}
+
/**
* _hv_pcifront_write_config() - Internal PCI config write
* @hpdev: The PCI driver's representation of the device
@@ -1107,8 +1134,37 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
* Since this function is called with IRQ locks held, can't
* do normal wait for completion; instead poll.
*/
- while (!try_wait_for_completion(&comp.comp_pkt.host_event))
+ while (!try_wait_for_completion(&comp.comp_pkt.host_event)) {
+ /* 0xFFFF means an invalid PCI VENDOR ID. */
+ if (hv_pcifront_get_vendor_id(hpdev) == 0xFFFF) {
+ dev_err_once(&hbus->hdev->device,
+ "the device has gone\n");
+ goto free_int_desc;
+ }
+
+ /*
+ * When the higher level interrupt code calls us with
+ * interrupt disabled, we must poll the channel by calling
+ * the channel callback directly when channel->target_cpu is
+ * the current CPU. When the higher level interrupt code
+ * calls us with interrupt enabled, let's add the
+ * local_bh_disable()/enable() to avoid race.
+ */
+ local_bh_disable();
+
+ if (hbus->hdev->channel->target_cpu == smp_processor_id())
+ hv_pci_onchannelcallback(hbus);
+
+ local_bh_enable();
+
+ if (hpdev->state == hv_pcichild_ejecting) {
+ dev_err_once(&hbus->hdev->device,
+ "the device is being ejected\n");
+ goto free_int_desc;
+ }
+
udelay(100);
+ }

if (comp.comp_pkt.completion_status < 0) {
dev_err(&hbus->hdev->device,
--
2.7.4

2018-03-05 19:24:43

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH v2 5/6] PCI: hv: hv_pci_devices_present(): only queue a new work when necessary

If there is a pending work, we just need to add the new dr into
the dr_list.

This is suggested by Michael Kelley.

Signed-off-by: Dexuan Cui <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Jack Morgenstein <[email protected]>
Cc: [email protected]
Cc: Stephen Hemminger <[email protected]>
Cc: K. Y. Srinivasan <[email protected]>
Cc: Michael Kelley (EOSG) <[email protected]>
---
drivers/pci/host/pci-hyperv.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/host/pci-hyperv.c b/drivers/pci/host/pci-hyperv.c
index 3a385212f666..d3aa6736a9bb 100644
--- a/drivers/pci/host/pci-hyperv.c
+++ b/drivers/pci/host/pci-hyperv.c
@@ -1733,6 +1733,7 @@ static void hv_pci_devices_present(struct hv_pcibus_device *hbus,
struct hv_dr_state *dr;
struct hv_dr_work *dr_wrk;
unsigned long flags;
+ bool pending_dr;

dr_wrk = kzalloc(sizeof(*dr_wrk), GFP_NOWAIT);
if (!dr_wrk)
@@ -1756,11 +1757,23 @@ static void hv_pci_devices_present(struct hv_pcibus_device *hbus,
}

spin_lock_irqsave(&hbus->device_list_lock, flags);
+
+ /*
+ * If pending_dr is true, we have already queued a work,
+ * which will see the new dr. Otherwise, we need to
+ * queue a new work.
+ */
+ pending_dr = !list_empty(&hbus->dr_list);
list_add_tail(&dr->list_entry, &hbus->dr_list);
- spin_unlock_irqrestore(&hbus->device_list_lock, flags);

- get_hvpcibus(hbus);
- queue_work(hbus->wq, &dr_wrk->wrk);
+ if (pending_dr) {
+ kfree(dr_wrk);
+ } else {
+ get_hvpcibus(hbus);
+ queue_work(hbus->wq, &dr_wrk->wrk);
+ }
+
+ spin_unlock_irqrestore(&hbus->device_list_lock, flags);
}

/**
--
2.7.4

2018-03-05 19:24:57

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH v2 2/6] PCI: hv: hv_eject_device_work(): remove the bogus test

When we're in the function, hpdev->state must be hv_pcichild_ejecting:
see hv_pci_eject_device().

Signed-off-by: Dexuan Cui <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Jack Morgenstein <[email protected]>
Cc: [email protected]
Cc: Stephen Hemminger <[email protected]>
Cc: K. Y. Srinivasan <[email protected]>
Cc: Michael Kelley (EOSG) <[email protected]>
---
drivers/pci/host/pci-hyperv.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/pci/host/pci-hyperv.c b/drivers/pci/host/pci-hyperv.c
index 1233300f41c6..04edb24c92ee 100644
--- a/drivers/pci/host/pci-hyperv.c
+++ b/drivers/pci/host/pci-hyperv.c
@@ -1796,10 +1796,7 @@ static void hv_eject_device_work(struct work_struct *work)

hpdev = container_of(work, struct hv_pci_dev, wrk);

- if (hpdev->state != hv_pcichild_ejecting) {
- put_pcichild(hpdev, hv_pcidev_ref_pnp);
- return;
- }
+ WARN_ON(hpdev->state != hv_pcichild_ejecting);

/*
* Ejection can come before or after the PCI bus has been set up, so
--
2.7.4

2018-03-05 19:25:20

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH v2 4/6] PCI: hv: remove hbus->enum_sem

Since we serialize the present/eject work items now, we don't need the
semaphore any more.

This is suggested by Michael Kelley.

Signed-off-by: Dexuan Cui <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Jack Morgenstein <[email protected]>
Cc: [email protected]
Cc: Stephen Hemminger <[email protected]>
Cc: K. Y. Srinivasan <[email protected]>
Cc: Michael Kelley (EOSG) <[email protected]>
---
drivers/pci/host/pci-hyperv.c | 17 ++---------------
1 file changed, 2 insertions(+), 15 deletions(-)

diff --git a/drivers/pci/host/pci-hyperv.c b/drivers/pci/host/pci-hyperv.c
index aaee41faf55f..3a385212f666 100644
--- a/drivers/pci/host/pci-hyperv.c
+++ b/drivers/pci/host/pci-hyperv.c
@@ -447,7 +447,6 @@ struct hv_pcibus_device {
spinlock_t device_list_lock; /* Protect lists below */
void __iomem *cfg_addr;

- struct semaphore enum_sem;
struct list_head resources_for_children;

struct list_head children;
@@ -1592,12 +1591,8 @@ static struct hv_pci_dev *get_pcichild_wslot(struct hv_pcibus_device *hbus,
* It must also treat the omission of a previously observed device as
* notification that the device no longer exists.
*
- * Note that this function is a work item, and it may not be
- * invoked in the order that it was queued. Back to back
- * updates of the list of present devices may involve queuing
- * multiple work items, and this one may run before ones that
- * were sent later. As such, this function only does something
- * if is the last one in the queue.
+ * Note that this function is serialized with hv_eject_device_work(),
+ * because both are pushed to the ordered workqueue hbus->wq.
*/
static void pci_devices_present_work(struct work_struct *work)
{
@@ -1618,11 +1613,6 @@ static void pci_devices_present_work(struct work_struct *work)

INIT_LIST_HEAD(&removed);

- if (down_interruptible(&hbus->enum_sem)) {
- put_hvpcibus(hbus);
- return;
- }
-
/* Pull this off the queue and process it if it was the last one. */
spin_lock_irqsave(&hbus->device_list_lock, flags);
while (!list_empty(&hbus->dr_list)) {
@@ -1639,7 +1629,6 @@ static void pci_devices_present_work(struct work_struct *work)
spin_unlock_irqrestore(&hbus->device_list_lock, flags);

if (!dr) {
- up(&hbus->enum_sem);
put_hvpcibus(hbus);
return;
}
@@ -1726,7 +1715,6 @@ static void pci_devices_present_work(struct work_struct *work)
break;
}

- up(&hbus->enum_sem);
put_hvpcibus(hbus);
kfree(dr);
}
@@ -2460,7 +2448,6 @@ static int hv_pci_probe(struct hv_device *hdev,
spin_lock_init(&hbus->config_lock);
spin_lock_init(&hbus->device_list_lock);
spin_lock_init(&hbus->retarget_msi_interrupt_lock);
- sema_init(&hbus->enum_sem, 1);
init_completion(&hbus->remove_event);
hbus->wq = alloc_ordered_workqueue("hv_pci_%x", 0,
hbus->sysdata.domain);
--
2.7.4

2018-03-05 19:26:47

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH v2 1/6] PCI: hv: fix a comment typo in _hv_pcifront_read_config()

No functional change.

Signed-off-by: Dexuan Cui <[email protected]>
Fixes: bdd74440d9e8 ("PCI: hv: Add explicit barriers to config space access")
Cc: Vitaly Kuznetsov <[email protected]>
Cc: [email protected]
Cc: Stephen Hemminger <[email protected]>
Cc: K. Y. Srinivasan <[email protected]>
---
drivers/pci/host/pci-hyperv.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/host/pci-hyperv.c b/drivers/pci/host/pci-hyperv.c
index 2faf38eab785..1233300f41c6 100644
--- a/drivers/pci/host/pci-hyperv.c
+++ b/drivers/pci/host/pci-hyperv.c
@@ -653,7 +653,7 @@ static void _hv_pcifront_read_config(struct hv_pci_dev *hpdev, int where,
break;
}
/*
- * Make sure the write was done before we release the spinlock
+ * Make sure the read was done before we release the spinlock
* allowing consecutive reads/writes.
*/
mb();
--
2.7.4

2018-03-05 19:27:38

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH v2 3/6] PCI: hv: serialize the present/eject work items

When we hot-remove the device, we first receive a PCI_EJECT message and
then receive a PCI_BUS_RELATIONS message with bus_rel->device_count == 0.

The first message is offloaded to hv_eject_device_work(), and the second
is offloaded to pci_devices_present_work(). Both the paths can be running
list_del(&hpdev->list_entry), causing general protection fault, because
system_wq can run them concurrently.

The patch eliminates the race condition.

Signed-off-by: Dexuan Cui <[email protected]>
Tested-by: Adrian Suhov <[email protected]>
Tested-by: Chris Valean <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Jack Morgenstein <[email protected]>
Cc: [email protected]
Cc: Stephen Hemminger <[email protected]>
Cc: K. Y. Srinivasan <[email protected]>
---
drivers/pci/host/pci-hyperv.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/host/pci-hyperv.c b/drivers/pci/host/pci-hyperv.c
index 04edb24c92ee..aaee41faf55f 100644
--- a/drivers/pci/host/pci-hyperv.c
+++ b/drivers/pci/host/pci-hyperv.c
@@ -461,6 +461,8 @@ struct hv_pcibus_device {
struct retarget_msi_interrupt retarget_msi_interrupt_params;

spinlock_t retarget_msi_interrupt_lock;
+
+ struct workqueue_struct *wq;
};

/*
@@ -1770,7 +1772,7 @@ static void hv_pci_devices_present(struct hv_pcibus_device *hbus,
spin_unlock_irqrestore(&hbus->device_list_lock, flags);

get_hvpcibus(hbus);
- schedule_work(&dr_wrk->wrk);
+ queue_work(hbus->wq, &dr_wrk->wrk);
}

/**
@@ -1845,7 +1847,7 @@ static void hv_pci_eject_device(struct hv_pci_dev *hpdev)
get_pcichild(hpdev, hv_pcidev_ref_pnp);
INIT_WORK(&hpdev->wrk, hv_eject_device_work);
get_hvpcibus(hpdev->hbus);
- schedule_work(&hpdev->wrk);
+ queue_work(hpdev->hbus->wq, &hpdev->wrk);
}

/**
@@ -2460,11 +2462,17 @@ static int hv_pci_probe(struct hv_device *hdev,
spin_lock_init(&hbus->retarget_msi_interrupt_lock);
sema_init(&hbus->enum_sem, 1);
init_completion(&hbus->remove_event);
+ hbus->wq = alloc_ordered_workqueue("hv_pci_%x", 0,
+ hbus->sysdata.domain);
+ if (!hbus->wq) {
+ ret = -ENOMEM;
+ goto free_bus;
+ }

ret = vmbus_open(hdev->channel, pci_ring_size, pci_ring_size, NULL, 0,
hv_pci_onchannelcallback, hbus);
if (ret)
- goto free_bus;
+ goto destroy_wq;

hv_set_drvdata(hdev, hbus);

@@ -2533,6 +2541,8 @@ static int hv_pci_probe(struct hv_device *hdev,
hv_free_config_window(hbus);
close:
vmbus_close(hdev->channel);
+destroy_wq:
+ destroy_workqueue(hbus->wq);
free_bus:
free_page((unsigned long)hbus);
return ret;
@@ -2612,6 +2622,7 @@ static int hv_pci_remove(struct hv_device *hdev)
irq_domain_free_fwnode(hbus->sysdata.fwnode);
put_hvpcibus(hbus);
wait_for_completion(&hbus->remove_event);
+ destroy_workqueue(hbus->wq);
free_page((unsigned long)hbus);
return 0;
}
--
2.7.4

2018-03-05 23:48:59

by Michael Kelley (EOSG)

[permalink] [raw]
Subject: RE: [PATCH v2 5/6] PCI: hv: hv_pci_devices_present(): only queue a new work when necessary

> -----Original Message-----
> From: Dexuan Cui
> Sent: Monday, March 5, 2018 11:22 AM
> To: [email protected]; [email protected]; KY Srinivasan <[email protected]>;
> Stephen Hemminger <[email protected]>; [email protected]; [email protected];
> [email protected]
> Cc: [email protected]; [email protected]; Haiyang Zhang
> <[email protected]>; [email protected]; [email protected]; Michael
> Kelley (EOSG) <[email protected]>; Dexuan Cui <[email protected]>; Jack
> Morgenstein <[email protected]>; [email protected]
> Subject: [PATCH v2 5/6] PCI: hv: hv_pci_devices_present(): only queue a new work when
> necessary
>
> If there is a pending work, we just need to add the new dr into
> the dr_list.
>
> This is suggested by Michael Kelley.
>
> Signed-off-by: Dexuan Cui <[email protected]>
> Cc: Vitaly Kuznetsov <[email protected]>
> Cc: Jack Morgenstein <[email protected]>
> Cc: [email protected]
> Cc: Stephen Hemminger <[email protected]>
> Cc: K. Y. Srinivasan <[email protected]>
> Cc: Michael Kelley (EOSG) <[email protected]>
> ---
> drivers/pci/host/pci-hyperv.c | 19 ++++++++++++++++---
> 1 file changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/pci/host/pci-hyperv.c b/drivers/pci/host/pci-hyperv.c
> index 3a385212f666..d3aa6736a9bb 100644
> --- a/drivers/pci/host/pci-hyperv.c
> +++ b/drivers/pci/host/pci-hyperv.c
> @@ -1733,6 +1733,7 @@ static void hv_pci_devices_present(struct hv_pcibus_device *hbus,
> struct hv_dr_state *dr;
> struct hv_dr_work *dr_wrk;
> unsigned long flags;
> + bool pending_dr;
>
> dr_wrk = kzalloc(sizeof(*dr_wrk), GFP_NOWAIT);
> if (!dr_wrk)
> @@ -1756,11 +1757,23 @@ static void hv_pci_devices_present(struct hv_pcibus_device
> *hbus,
> }
>
> spin_lock_irqsave(&hbus->device_list_lock, flags);
> +
> + /*
> + * If pending_dr is true, we have already queued a work,
> + * which will see the new dr. Otherwise, we need to
> + * queue a new work.
> + */
> + pending_dr = !list_empty(&hbus->dr_list);
> list_add_tail(&dr->list_entry, &hbus->dr_list);
> - spin_unlock_irqrestore(&hbus->device_list_lock, flags);

A minor point: The spin_unlock_irqrestore() call can
stay here. Once we have the list status in a local variable
and the new entry is added to the list, nothing bad can
happen if we drop the spin lock. At worst, and very unlikely,
we'll queue work when some other thread has already queued
work to process the list entry, but that's no big deal. I'd argue
for keeping the code covered by a spin lock as small as possible.

Michael

>
> - get_hvpcibus(hbus);
> - queue_work(hbus->wq, &dr_wrk->wrk);
> + if (pending_dr) {
> + kfree(dr_wrk);
> + } else {
> + get_hvpcibus(hbus);
> + queue_work(hbus->wq, &dr_wrk->wrk);
> + }
> +
> + spin_unlock_irqrestore(&hbus->device_list_lock, flags);
> }
>
> /**
> --
> 2.7.4

2018-03-06 00:18:52

by Dexuan Cui

[permalink] [raw]
Subject: RE: [PATCH v2 5/6] PCI: hv: hv_pci_devices_present(): only queue a new work when necessary

> From: Michael Kelley (EOSG)
> Sent: Monday, March 5, 2018 15:48
> > @@ -1756,11 +1757,23 @@ static void hv_pci_devices_present(struct
> hv_pcibus_device
> > *hbus,
> > }
> >
> > spin_lock_irqsave(&hbus->device_list_lock, flags);
> > +
> > + /*
> > + * If pending_dr is true, we have already queued a work,
> > + * which will see the new dr. Otherwise, we need to
> > + * queue a new work.
> > + */
> > + pending_dr = !list_empty(&hbus->dr_list);
> > list_add_tail(&dr->list_entry, &hbus->dr_list);
> > - spin_unlock_irqrestore(&hbus->device_list_lock, flags);
>
> A minor point: The spin_unlock_irqrestore() call can
> stay here. Once we have the list status in a local variable
> and the new entry is added to the list, nothing bad can
> happen if we drop the spin lock. At worst, and very unlikely,
> we'll queue work when some other thread has already queued
> work to process the list entry, but that's no big deal. I'd argue
> for keeping the code covered by a spin lock as small as possible.
>
> Michael

I agree. Will fix this in v3.

> >
> > - get_hvpcibus(hbus);
> > - queue_work(hbus->wq, &dr_wrk->wrk);
> > + if (pending_dr) {
> > + kfree(dr_wrk);
> > + } else {
> > + get_hvpcibus(hbus);
> > + queue_work(hbus->wq, &dr_wrk->wrk);
> > + }
> > +
> > + spin_unlock_irqrestore(&hbus->device_list_lock, flags);
> > }

To receive more comments from others, I'll hold off v3 until tomorrow.

Thanks,
-- Dexuan