2019-03-04 21:36:32

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH 0/3] pci-hyperv: fix memory leak and add pci_destroy_slot()

Patch #1 fixes a memory leak caused by incorrectly-maintained hpdev->refs.

Patch #2 and #3 make sure the "slot" is removed in all the scenarios.
Without them, in the quick hot-add/hot-remove test, systemd-dev may easily
crash when trying to access a dangling sys file in /sys/bus/pci/slots/:
"BUG: unable to handle kernel paging request".

BTW, Patch #2 was posted on Feb 7, 2019, and this is the v2: the change
to hv_eject_device_work() in v1 is removed, as the change is only needed
when we hot-remove the device and remove the pci-hyperv driver at the
same time. It looks more work is required to make this scenaro work
correctly, and since removing the driver is not really a "usual" usage,
we can address this scenario in the future.

Please review the patchset.

Dexuan Cui (3):
PCI: hv: Fix a memory leak in hv_eject_device_work()
PCI: hv: Add hv_pci_remove_slots() when we unload the driver
PCI: hv: Add pci_destroy_slot() in pci_devices_present_work(), if
necessary

drivers/pci/controller/pci-hyperv.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)

--
2.19.1



2019-03-04 21:35:43

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH 2/3] PCI: hv: Add hv_pci_remove_slots() when we unload the driver

When we unload pci-hyperv, the host doesn't send us a PCI_EJECT message.
In this case we also need to make sure the sysfs pci slot directory
is removed, otherwise "cat /sys/bus/pci/slots/2/address" will trigger
"BUG: unable to handle kernel paging request" (I noticed the issue when
systemd-dev crashed for me when I unloaded the driver). And, if we
unload/reload the driver several times, we'll have multiple pci slot
directories in /sys/bus/pci/slots/ like this:

root@localhost:~# ls -rtl /sys/bus/pci/slots/
total 0
drwxr-xr-x 2 root root 0 Feb 7 10:49 2
drwxr-xr-x 2 root root 0 Feb 7 10:49 2-1
drwxr-xr-x 2 root root 0 Feb 7 10:51 2-2

The patch adds the missing code.

Fixes: a15f2c08c708 ("PCI: hv: support reporting serial number as slot information")
Signed-off-by: Dexuan Cui <[email protected]>
Acked-by: Stephen Hemminger <[email protected]>
Cc: [email protected]
---
drivers/pci/controller/pci-hyperv.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 30f16b882746..b489412e3502 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -1486,6 +1486,21 @@ static void hv_pci_assign_slots(struct hv_pcibus_device *hbus)
}
}

+/*
+ * Remove entries in sysfs pci slot directory.
+ */
+static void hv_pci_remove_slots(struct hv_pcibus_device *hbus)
+{
+ struct hv_pci_dev *hpdev;
+
+ list_for_each_entry(hpdev, &hbus->children, list_entry) {
+ if (!hpdev->pci_slot)
+ continue;
+ pci_destroy_slot(hpdev->pci_slot);
+ hpdev->pci_slot = NULL;
+ }
+}
+
/**
* create_root_hv_pci_bus() - Expose a new root PCI bus
* @hbus: Root PCI bus, as understood by this driver
@@ -2680,6 +2695,7 @@ static int hv_pci_remove(struct hv_device *hdev)
pci_lock_rescan_remove();
pci_stop_root_bus(hbus->pci_bus);
pci_remove_root_bus(hbus->pci_bus);
+ hv_pci_remove_slots(hbus);
pci_unlock_rescan_remove();
hbus->state = hv_pcibus_removed;
}
--
2.19.1


2019-03-04 21:35:45

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work()

After a device is just created in new_pcichild_device(), hpdev->refs is set
to 2 (i.e. the initial value of 1 plus the get_pcichild()).

When we hot remove the device from the host, in Linux VM we first call
hv_pci_eject_device(), which increases hpdev->refs by get_pcichild() and
then schedules a work of hv_eject_device_work(), so hpdev->refs becomes 3
(let's ignore the paired get/put_pcichild() in other places). But in
hv_eject_device_work(), currently we only call put_pcichild() twice,
meaning the 'hpdev' struct can't be freed in put_pcichild(). This patch
adds one put_pcichild() to fix the memory leak.

BTW, the device can also be removed when we run "rmmod pci-hyperv". On this
path (hv_pci_remove() -> hv_pci_bus_exit() -> hv_pci_devices_present()),
hpdev->refs is 2, and we do correctly call put_pcichild() twice in
pci_devices_present_work().

Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
Signed-off-by: Dexuan Cui <[email protected]>
Cc: <[email protected]>
---
drivers/pci/controller/pci-hyperv.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 95441a35eceb..30f16b882746 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -1900,6 +1900,9 @@ static void hv_eject_device_work(struct work_struct *work)
sizeof(*ejct_pkt), (unsigned long)&ctxt.pkt,
VM_PKT_DATA_INBAND, 0);

+ /* For the get_pcichild() in hv_pci_eject_device() */
+ put_pcichild(hpdev);
+ /* For the two refs got in new_pcichild_device() */
put_pcichild(hpdev);
put_pcichild(hpdev);
put_hvpcibus(hpdev->hbus);
--
2.19.1


2019-03-04 21:35:50

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH 3/3] PCI: hv: Add pci_destroy_slot() in pci_devices_present_work(), if necessary

When we hot-remove a device, usually the host sends us a PCI_EJECT message,
and a PCI_BUS_RELATIONS message with bus_rel->device_count == 0. But when
we do the quick hot-add/hot-remove test, the host may not send us the
PCI_EJECT message, if the guest has not fully finished the initialization
by sending the PCI_RESOURCES_ASSIGNED* message to the host, so it's
potentially unsafe to only depend on the pci_destroy_slot() in
hv_eject_device_work(), though create_root_hv_pci_bus() ->
hv_pci_assign_slots() is not called in this case. Note: in this case, the
host still sends the guest a PCI_BUS_RELATIONS message with
bus_rel->device_count == 0.

And, in the quick hot-add/hot-remove test, we can have such a race: before
pci_devices_present_work() -> new_pcichild_device() adds the new device
into hbus->children, we may have already received the PCI_EJECT message,
and hence the taklet handler hv_pci_onchannelcallback() may fail to find
the "hpdev" by get_pcichild_wslot(hbus, dev_message->wslot.slot), so
hv_pci_eject_device() is NOT called; later create_root_hv_pci_bus() ->
hv_pci_assign_slots() creates the slot, and the PCI_BUS_RELATIONS message
with bus_rel->device_count == 0 removes the device from hbus->children, and
we end up being unable to remove the slot in hv_pci_remove() ->
hv_pci_remove_slots().

The patch removes the slot in pci_devices_present_work() when the device
is removed. This can address the above race. Note 1:
pci_devices_present_work() and hv_eject_device_work() run in the
singled-threaded hbus->wq, so there is not a double-remove issue for the
slot. Note 2: we can't offload hv_pci_eject_device() from
hv_pci_onchannelcallback() to the workqueue, because we need
hv_pci_onchannelcallback() synchronously call hv_pci_eject_device() to
poll the channel's ringbuffer to work around the
"hangs in hv_compose_msi_msg()" issue: see
commit de0aa7b2f97d ("PCI: hv: Fix 2 hang issues in hv_compose_msi_msg()")

Fixes: a15f2c08c708 ("PCI: hv: support reporting serial number as slot information")
Signed-off-by: Dexuan Cui <[email protected]>
Cc: [email protected]
---
drivers/pci/controller/pci-hyperv.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index b489412e3502..82acd6155adf 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -1776,6 +1776,10 @@ static void pci_devices_present_work(struct work_struct *work)
hpdev = list_first_entry(&removed, struct hv_pci_dev,
list_entry);
list_del(&hpdev->list_entry);
+
+ if (hpdev->pci_slot)
+ pci_destroy_slot(hpdev->pci_slot);
+
put_pcichild(hpdev);
}

--
2.19.1


2019-03-05 19:05:06

by Stephen Hemminger

[permalink] [raw]
Subject: Re: [PATCH 0/3] pci-hyperv: fix memory leak and add pci_destroy_slot()

On Mon, 4 Mar 2019 21:34:47 +0000
Dexuan Cui <[email protected]> wrote:

> Patch #1 fixes a memory leak caused by incorrectly-maintained hpdev->refs.
>
> Patch #2 and #3 make sure the "slot" is removed in all the scenarios.
> Without them, in the quick hot-add/hot-remove test, systemd-dev may easily
> crash when trying to access a dangling sys file in /sys/bus/pci/slots/:
> "BUG: unable to handle kernel paging request".
>
> BTW, Patch #2 was posted on Feb 7, 2019, and this is the v2: the change
> to hv_eject_device_work() in v1 is removed, as the change is only needed
> when we hot-remove the device and remove the pci-hyperv driver at the
> same time. It looks more work is required to make this scenaro work
> correctly, and since removing the driver is not really a "usual" usage,
> we can address this scenario in the future.
>
> Please review the patchset.
>
> Dexuan Cui (3):
> PCI: hv: Fix a memory leak in hv_eject_device_work()
> PCI: hv: Add hv_pci_remove_slots() when we unload the driver
> PCI: hv: Add pci_destroy_slot() in pci_devices_present_work(), if
> necessary
>
> drivers/pci/controller/pci-hyperv.c | 23 +++++++++++++++++++++++
> 1 file changed, 23 insertions(+)


Thanks for fixing this.

Reviewed-by: Stephen Hemminger <[email protected]>

2019-03-20 21:38:47

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: RE: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work()

From: Dexuan Cui <[email protected]>
>
> After a device is just created in new_pcichild_device(), hpdev->refs is set
> to 2 (i.e. the initial value of 1 plus the get_pcichild()).
>
> When we hot remove the device from the host, in Linux VM we first call
> hv_pci_eject_device(), which increases hpdev->refs by get_pcichild() and
> then schedules a work of hv_eject_device_work(), so hpdev->refs becomes 3
> (let's ignore the paired get/put_pcichild() in other places). But in
> hv_eject_device_work(), currently we only call put_pcichild() twice,
> meaning the 'hpdev' struct can't be freed in put_pcichild(). This patch
> adds one put_pcichild() to fix the memory leak.
>
> BTW, the device can also be removed when we run "rmmod pci-hyperv". On this
> path (hv_pci_remove() -> hv_pci_bus_exit() -> hv_pci_devices_present()),
> hpdev->refs is 2, and we do correctly call put_pcichild() twice in
> pci_devices_present_work().
>
> Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
> Signed-off-by: Dexuan Cui <[email protected]>
> Cc: <[email protected]>

Exiting new_pcichild_device() with hpdev->refs set to 2 seems OK to me.
There is the reference in the hbus->children list, and there is the reference that
is returned to the caller. But what is strange is that pci_devices_present_work()
overwrites the reference returned in local variable hpdev without doing a
put_pcichild(). It seems like the "normal" reference count should be 1 when the
child device is not being manipulated, not 2. The fix would be to add a call to
put_pcichild() when the return value from new_pcichild_device() is overwritten.
Then remove the call to put_pcichild() in pci_device_present_work() when missing
children are moved to the local list. The children have been moved from one list
to another, so there's no need to decrement the reference count. Then when
everything in the local list is deleted, the reference is correctly decremented,
presumably freeing the memory.

With this approach, the code in hv_eject_device_work() is correct. There's
one call to put_pcichild() to reflect removing the child device from the hbus->
children list, and one call to put_pcichild() to pair with the get_pcichild() in
hv_pci_eject_device().

Your patch works, but to me it leaves the ref count in an unnatural state
most of the time.

Michael

2019-03-20 21:41:25

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: RE: [PATCH 2/3] PCI: hv: Add hv_pci_remove_slots() when we unload the driver

From: Dexuan Cui <[email protected]> Sent: Monday, March 4, 2019 1:35 PM
>
> When we unload pci-hyperv, the host doesn't send us a PCI_EJECT message.
> In this case we also need to make sure the sysfs pci slot directory
> is removed, otherwise "cat /sys/bus/pci/slots/2/address" will trigger
> "BUG: unable to handle kernel paging request" (I noticed the issue when
> systemd-dev crashed for me when I unloaded the driver). And, if we
> unload/reload the driver several times, we'll have multiple pci slot
> directories in /sys/bus/pci/slots/ like this:
>
> root@localhost:~# ls -rtl /sys/bus/pci/slots/
> total 0
> drwxr-xr-x 2 root root 0 Feb 7 10:49 2
> drwxr-xr-x 2 root root 0 Feb 7 10:49 2-1
> drwxr-xr-x 2 root root 0 Feb 7 10:51 2-2
>
> The patch adds the missing code.
>
> Fixes: a15f2c08c708 ("PCI: hv: support reporting serial number as slot information")
> Signed-off-by: Dexuan Cui <[email protected]>
> Acked-by: Stephen Hemminger <[email protected]>
> Cc: [email protected]

Reviewed-by: Michael Kelley <[email protected]>

2019-03-20 21:46:22

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: RE: [PATCH 3/3] PCI: hv: Add pci_destroy_slot() in pci_devices_present_work(), if necessary

From: Dexuan Cui <[email protected]> Sent: Monday, March 4, 2019 1:35 PM
>
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index b489412e3502..82acd6155adf 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -1776,6 +1776,10 @@ static void pci_devices_present_work(struct work_struct *work)
> hpdev = list_first_entry(&removed, struct hv_pci_dev,
> list_entry);
> list_del(&hpdev->list_entry);
> +
> + if (hpdev->pci_slot)
> + pci_destroy_slot(hpdev->pci_slot);

The code is inconsistent in whether hpdev->pci_slot is set to NULL after calling
pci_destory_slot(). Patch 2 in this series does set it to NULL, but this code does not.
And the code in hv_eject_device_work() does not set it to NULL.

It looks like all the places that test the value of hpdev->pci_slot or call
pci_destroy_slot() are serialized, so it looks like it really doesn't matter. But when
the code is inconsistent about setting to NULL, it always makes me wonder if there
is a reason.

Michael


> +
> put_pcichild(hpdev);
> }
>
> --
> 2.19.1


2019-03-21 00:14:32

by Dexuan Cui

[permalink] [raw]
Subject: RE: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work()

> From: Michael Kelley <[email protected]>
> Sent: Wednesday, March 20, 2019 2:38 PM
>
> From: Dexuan Cui <[email protected]>
> >
> > After a device is just created in new_pcichild_device(), hpdev->refs is set
> > to 2 (i.e. the initial value of 1 plus the get_pcichild()).
> >
> > When we hot remove the device from the host, in Linux VM we first call
> > hv_pci_eject_device(), which increases hpdev->refs by get_pcichild() and
> > then schedules a work of hv_eject_device_work(), so hpdev->refs becomes 3
> > (let's ignore the paired get/put_pcichild() in other places). But in
> > hv_eject_device_work(), currently we only call put_pcichild() twice,
> > meaning the 'hpdev' struct can't be freed in put_pcichild(). This patch
> > adds one put_pcichild() to fix the memory leak.
> >
> > BTW, the device can also be removed when we run "rmmod pci-hyperv". On
> this
> > path (hv_pci_remove() -> hv_pci_bus_exit() -> hv_pci_devices_present()),
> > hpdev->refs is 2, and we do correctly call put_pcichild() twice in
> > pci_devices_present_work().
>
> Exiting new_pcichild_device() with hpdev->refs set to 2 seems OK to me.
> There is the reference in the hbus->children list, and there is the reference that
> is returned to the caller.
So IMO the "normal" reference count should be 2. :-) IMO only when a hv_pci_dev
device is about to be destroyed, its reference count can drop to less than 2,
i.e. first temporarily drop to 1 (meaning the hv_pci_dev device is removed from
hbus->children), and then drop to zero (meaning kfree(hpdev) is called).

> But what is strange is that pci_devices_present_work()
> overwrites the reference returned in local variable hpdev without doing a
> put_pcichild().
I suppose you mean:

/* First, mark all existing children as reported missing. */
spin_lock_irqsave(&hbus->device_list_lock, flags);
list_for_each_entry(hpdev, &hbus->children, list_entry) {
hpdev->reported_missing = true;
}
spin_unlock_irqrestore(&hbus->device_list_lock, flags)

This is not strange to me, because, in pci_devices_present_work(), at first we
don't know which devices are about to disappear, so we pre-mark all devices to
be potentially missing like that; if a device is still on the bus, we'll mark its
hpdev->reported_missing to false later; only after we know exactly which
devices are missing, we should call put_pcichild() against them. All these
seem natural to me.

> It seems like the "normal" reference count should be 1 when the
> child device is not being manipulated, not 2.
What does "not being manipulated" mean?

> The fix would be to add a call to
> put_pcichild() when the return value from new_pcichild_device() is
> overwritten.
In pci_devices_present_work(), we NEVER "overwrite" the "hpdev" returned
from new_pcichild_device(): the "reported_missing" field of the new hpdev
is implicitly initialized to false in new_pcichild_device().

> Then remove the call to put_pcichild() in pci_device_present_work() when
> missing
> children are moved to the local list. The children have been moved from one
> list
> to another, so there's no need to decrement the reference count. Then when
> everything in the local list is deleted, the reference is correctly decremented,
> presumably freeing the memory.
>
> With this approach, the code in hv_eject_device_work() is correct. There's
> one call to put_pcichild() to reflect removing the child device from the hbus->
> children list, and one call to put_pcichild() to pair with the get_pcichild() in
> hv_pci_eject_device().
Please refer to my replies above. IMO we should fix
hv_eject_device_work() rather than pci_devices_present_work().

Thanks
-- Dexuan

> Your patch works, but to me it leaves the ref count in an unnatural state
> most of the time.
>
> Michael


2019-03-21 00:37:05

by Dexuan Cui

[permalink] [raw]
Subject: RE: [PATCH 3/3] PCI: hv: Add pci_destroy_slot() in pci_devices_present_work(), if necessary

> From: Michael Kelley <[email protected]>
> Sent: Wednesday, March 20, 2019 2:44 PM
> To: Dexuan Cui <[email protected]>; [email protected];
> [email protected]; [email protected]; KY Srinivasan
> > ...
> > diff --git a/drivers/pci/controller/pci-hyperv.c
> > @@ -1776,6 +1776,10 @@ static void pci_devices_present_work(struct
> work_struct *work)
> > hpdev = list_first_entry(&removed, struct hv_pci_dev,
> > list_entry);
> > list_del(&hpdev->list_entry);
> > +
> > + if (hpdev->pci_slot)
> > + pci_destroy_slot(hpdev->pci_slot);
>
> The code is inconsistent in whether hpdev->pci_slot is set to NULL after calling
> pci_destory_slot().
Here, in pci_devices_present_work(), it's unnecessary to set it to NULL,
Because:
1) the "hpdev" is removed from hbus->children and it can not be seen
elsewhere;
2) the "hpdev" struct is freed in the below put_pcichild():

while (!list_empty(&removed)) {
hpdev = list_first_entry(&removed, struct hv_pci_dev,
list_entry);
list_del(&hpdev->list_entry);

if (hpdev->pci_slot)
pci_destroy_slot(hpdev->pci_slot);

put_pcichild(hpdev);
}

> Patch 2 in this series does set it to NULL, but this code does not.
In Patch2, i.e. in the code path hv_pci_remove() -> hv_pci_remove_slots(),
we must set hpdev->pci_slot to NULL, otherwise, later, due to
hv_pci_remove() -> hv_pci_bus_exit() ->
hv_pci_devices_present() with the zero "relations", we'll double-free the
"hpdev" struct in pci_devices_present_work() -- see the above.

> And the code in hv_eject_device_work() does not set it to NULL.
It's unnecessary to set hpdev->pci_slot to NULL in hv_eject_device_work(),
Because in hv_eject_device_work():
1) the "hpdev" is removed from hbus->children and it can not be seen
elsewhere;
2) the "hpdev" struct is freed at the end of hv_eject_device_work() with my
first patch: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work().

> It looks like all the places that test the value of hpdev->pci_slot or call
> pci_destroy_slot() are serialized, so it looks like it really doesn't matter. But
> when
> the code is inconsistent about setting to NULL, it always makes me wonder if
> there
> is a reason.
>
> Michael

Thanks,
-- Dexuan

2019-03-21 00:43:15

by Dexuan Cui

[permalink] [raw]
Subject: RE: [PATCH 3/3] PCI: hv: Add pci_destroy_slot() in pci_devices_present_work(), if necessary

> From: [email protected]
> <[email protected]> On Behalf Of Dexuan Cui
> > ...
> > Patch 2 in this series does set it to NULL, but this code does not.
> In Patch2, i.e. in the code path hv_pci_remove() -> hv_pci_remove_slots(),
> we must set hpdev->pci_slot to NULL, otherwise, later, due to
> hv_pci_remove() -> hv_pci_bus_exit() ->
> hv_pci_devices_present() with the zero "relations", we'll double-free the
> "hpdev" struct in pci_devices_present_work() -- see the above.
A minor correction: I meant we'll call pci_destroy_slot(hpdev->pci_slot)
twice, not "double-free hpdev".

Thanks,
-- Dexuan

2019-03-26 17:10:20

by Lorenzo Pieralisi

[permalink] [raw]
Subject: Re: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work()

On Thu, Mar 21, 2019 at 12:12:03AM +0000, Dexuan Cui wrote:
> > From: Michael Kelley <[email protected]>
> > Sent: Wednesday, March 20, 2019 2:38 PM
> >
> > From: Dexuan Cui <[email protected]>
> > >
> > > After a device is just created in new_pcichild_device(), hpdev->refs is set
> > > to 2 (i.e. the initial value of 1 plus the get_pcichild()).
> > >
> > > When we hot remove the device from the host, in Linux VM we first call
> > > hv_pci_eject_device(), which increases hpdev->refs by get_pcichild() and
> > > then schedules a work of hv_eject_device_work(), so hpdev->refs becomes 3
> > > (let's ignore the paired get/put_pcichild() in other places). But in
> > > hv_eject_device_work(), currently we only call put_pcichild() twice,
> > > meaning the 'hpdev' struct can't be freed in put_pcichild(). This patch
> > > adds one put_pcichild() to fix the memory leak.
> > >
> > > BTW, the device can also be removed when we run "rmmod pci-hyperv". On
> > this
> > > path (hv_pci_remove() -> hv_pci_bus_exit() -> hv_pci_devices_present()),
> > > hpdev->refs is 2, and we do correctly call put_pcichild() twice in
> > > pci_devices_present_work().
> >
> > Exiting new_pcichild_device() with hpdev->refs set to 2 seems OK to me.
> > There is the reference in the hbus->children list, and there is the reference that
> > is returned to the caller.
> So IMO the "normal" reference count should be 2. :-) IMO only when a hv_pci_dev
> device is about to be destroyed, its reference count can drop to less than 2,
> i.e. first temporarily drop to 1 (meaning the hv_pci_dev device is removed from
> hbus->children), and then drop to zero (meaning kfree(hpdev) is called).
>
> > But what is strange is that pci_devices_present_work()
> > overwrites the reference returned in local variable hpdev without doing a
> > put_pcichild().
> I suppose you mean:
>
> /* First, mark all existing children as reported missing. */
> spin_lock_irqsave(&hbus->device_list_lock, flags);
> list_for_each_entry(hpdev, &hbus->children, list_entry) {
> hpdev->reported_missing = true;
> }
> spin_unlock_irqrestore(&hbus->device_list_lock, flags)
>
> This is not strange to me, because, in pci_devices_present_work(), at first we
> don't know which devices are about to disappear, so we pre-mark all devices to
> be potentially missing like that; if a device is still on the bus, we'll mark its
> hpdev->reported_missing to false later; only after we know exactly which
> devices are missing, we should call put_pcichild() against them. All these
> seem natural to me.
>
> > It seems like the "normal" reference count should be 1 when the
> > child device is not being manipulated, not 2.
> What does "not being manipulated" mean?
>
> > The fix would be to add a call to
> > put_pcichild() when the return value from new_pcichild_device() is
> > overwritten.
> In pci_devices_present_work(), we NEVER "overwrite" the "hpdev" returned
> from new_pcichild_device(): the "reported_missing" field of the new hpdev
> is implicitly initialized to false in new_pcichild_device().
>
> > Then remove the call to put_pcichild() in pci_device_present_work() when
> > missing
> > children are moved to the local list. The children have been moved from one
> > list
> > to another, so there's no need to decrement the reference count. Then when
> > everything in the local list is deleted, the reference is correctly decremented,
> > presumably freeing the memory.
> >
> > With this approach, the code in hv_eject_device_work() is correct. There's
> > one call to put_pcichild() to reflect removing the child device from the hbus->
> > children list, and one call to put_pcichild() to pair with the get_pcichild() in
> > hv_pci_eject_device().
> Please refer to my replies above. IMO we should fix
> hv_eject_device_work() rather than pci_devices_present_work().

Have we reached a conclusion on this ? I would like to merge this series
given that it is fixing bugs and it has hung in the balance for quite
a while but it looks like Michael is not too happy about these patches
and I need a maintainer ACK to merge them.

Thanks,
Lorenzo

> Thanks
> -- Dexuan
>
> > Your patch works, but to me it leaves the ref count in an unnatural state
> > most of the time.
> >
> > Michael
>

2019-03-26 17:49:32

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: RE: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work()

From: Lorenzo Pieralisi <[email protected]> Sent: Tuesday, March 26, 2019 10:09 AM
> On Thu, Mar 21, 2019 at 12:12:03AM +0000, Dexuan Cui wrote:
> > > From: Michael Kelley <[email protected]>
> > > Sent: Wednesday, March 20, 2019 2:38 PM
> > >
> > > From: Dexuan Cui <[email protected]>
> > > >
> > > > After a device is just created in new_pcichild_device(), hpdev->refs is set
> > > > to 2 (i.e. the initial value of 1 plus the get_pcichild()).
> > > >
> > > > When we hot remove the device from the host, in Linux VM we first call
> > > > hv_pci_eject_device(), which increases hpdev->refs by get_pcichild() and
> > > > then schedules a work of hv_eject_device_work(), so hpdev->refs becomes 3
> > > > (let's ignore the paired get/put_pcichild() in other places). But in
> > > > hv_eject_device_work(), currently we only call put_pcichild() twice,
> > > > meaning the 'hpdev' struct can't be freed in put_pcichild(). This patch
> > > > adds one put_pcichild() to fix the memory leak.
> > > >
> > > > BTW, the device can also be removed when we run "rmmod pci-hyperv". On
> > > this
> > > > path (hv_pci_remove() -> hv_pci_bus_exit() -> hv_pci_devices_present()),
> > > > hpdev->refs is 2, and we do correctly call put_pcichild() twice in
> > > > pci_devices_present_work().
> > >
> > > Exiting new_pcichild_device() with hpdev->refs set to 2 seems OK to me.
> > > There is the reference in the hbus->children list, and there is the reference that
> > > is returned to the caller.
> > So IMO the "normal" reference count should be 2. :-) IMO only when a hv_pci_dev
> > device is about to be destroyed, its reference count can drop to less than 2,
> > i.e. first temporarily drop to 1 (meaning the hv_pci_dev device is removed from
> > hbus->children), and then drop to zero (meaning kfree(hpdev) is called).
> >
> > > But what is strange is that pci_devices_present_work()
> > > overwrites the reference returned in local variable hpdev without doing a
> > > put_pcichild().
> > I suppose you mean:
> >
> > /* First, mark all existing children as reported missing. */
> > spin_lock_irqsave(&hbus->device_list_lock, flags);
> > list_for_each_entry(hpdev, &hbus->children, list_entry) {
> > hpdev->reported_missing = true;
> > }
> > spin_unlock_irqrestore(&hbus->device_list_lock, flags)
> >
> > This is not strange to me, because, in pci_devices_present_work(), at first we
> > don't know which devices are about to disappear, so we pre-mark all devices to
> > be potentially missing like that; if a device is still on the bus, we'll mark its
> > hpdev->reported_missing to false later; only after we know exactly which
> > devices are missing, we should call put_pcichild() against them. All these
> > seem natural to me.
> >
> > > It seems like the "normal" reference count should be 1 when the
> > > child device is not being manipulated, not 2.
> > What does "not being manipulated" mean?
> >
> > > The fix would be to add a call to
> > > put_pcichild() when the return value from new_pcichild_device() is
> > > overwritten.
> > In pci_devices_present_work(), we NEVER "overwrite" the "hpdev" returned
> > from new_pcichild_device(): the "reported_missing" field of the new hpdev
> > is implicitly initialized to false in new_pcichild_device().
> >
> > > Then remove the call to put_pcichild() in pci_device_present_work() when
> > > missing
> > > children are moved to the local list. The children have been moved from one
> > > list
> > > to another, so there's no need to decrement the reference count. Then when
> > > everything in the local list is deleted, the reference is correctly decremented,
> > > presumably freeing the memory.
> > >
> > > With this approach, the code in hv_eject_device_work() is correct. There's
> > > one call to put_pcichild() to reflect removing the child device from the hbus->
> > > children list, and one call to put_pcichild() to pair with the get_pcichild() in
> > > hv_pci_eject_device().
> > Please refer to my replies above. IMO we should fix
> > hv_eject_device_work() rather than pci_devices_present_work().
>
> Have we reached a conclusion on this ? I would like to merge this series
> given that it is fixing bugs and it has hung in the balance for quite
> a while but it looks like Michael is not too happy about these patches
> and I need a maintainer ACK to merge them.
>
> Thanks,
> Lorenzo

Dexuan and I have discussed the topic extensively offline. The patch works
in its current form, and I'll agree to it.

Reviewed-by: Michael Kelley <[email protected]>

2019-03-26 17:52:27

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: RE: [PATCH 3/3] PCI: hv: Add pci_destroy_slot() in pci_devices_present_work(), if necessary

From: Dexuan Cui <[email protected]> Sent: Wednesday, March 20, 2019 5:36 PM
>
> > From: Michael Kelley <[email protected]>
> > > ...
> > > diff --git a/drivers/pci/controller/pci-hyperv.c
> > > @@ -1776,6 +1776,10 @@ static void pci_devices_present_work(struct
> > work_struct *work)
> > > hpdev = list_first_entry(&removed, struct hv_pci_dev,
> > > list_entry);
> > > list_del(&hpdev->list_entry);
> > > +
> > > + if (hpdev->pci_slot)
> > > + pci_destroy_slot(hpdev->pci_slot);
> >
> > The code is inconsistent in whether hpdev->pci_slot is set to NULL after calling
> > pci_destory_slot().
> Here, in pci_devices_present_work(), it's unnecessary to set it to NULL,
> Because:
> 1) the "hpdev" is removed from hbus->children and it can not be seen
> elsewhere;
> 2) the "hpdev" struct is freed in the below put_pcichild():
>
> while (!list_empty(&removed)) {
> hpdev = list_first_entry(&removed, struct hv_pci_dev,
> list_entry);
> list_del(&hpdev->list_entry);
>
> if (hpdev->pci_slot)
> pci_destroy_slot(hpdev->pci_slot);
>
> put_pcichild(hpdev);
> }
>
> > Patch 2 in this series does set it to NULL, but this code does not.
> In Patch2, i.e. in the code path hv_pci_remove() -> hv_pci_remove_slots(),
> we must set hpdev->pci_slot to NULL, otherwise, later, due to
> hv_pci_remove() -> hv_pci_bus_exit() ->
> hv_pci_devices_present() with the zero "relations", we'll double-free the
> "hpdev" struct in pci_devices_present_work() -- see the above.
>
> > And the code in hv_eject_device_work() does not set it to NULL.
> It's unnecessary to set hpdev->pci_slot to NULL in hv_eject_device_work(),
> Because in hv_eject_device_work():
> 1) the "hpdev" is removed from hbus->children and it can not be seen
> elsewhere;
> 2) the "hpdev" struct is freed at the end of hv_eject_device_work() with my
> first patch: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work().
>
> > It looks like all the places that test the value of hpdev->pci_slot or call
> > pci_destroy_slot() are serialized, so it looks like it really doesn't matter. But
> > when
> > the code is inconsistent about setting to NULL, it always makes me wonder if
> > there
> > is a reason.
> >
> > Michael
>

Reviewed-by: Michael Kelley <[email protected]>

2019-03-26 18:02:43

by Dexuan Cui

[permalink] [raw]
Subject: RE: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work()

> From: Michael Kelley <[email protected]>
> Sent: Tuesday, March 26, 2019 10:47 AM
> To: Lorenzo Pieralisi <[email protected]>; Dexuan Cui
> <[email protected]>
> Cc: [email protected]; [email protected]; KY Srinivasan
> <[email protected]>; Stephen Hemminger <[email protected]>;
> Sasha Levin <[email protected]>; [email protected];
> [email protected]; [email protected]; Haiyang
> Zhang <[email protected]>; [email protected]; [email protected];
> [email protected]; vkuznets <[email protected]>;
> [email protected]; [email protected]; [email protected]
> Subject: RE: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work()
>
> From: Lorenzo Pieralisi <[email protected]> Sent: Tuesday, March 26,
> 2019 10:09 AM
> > On Thu, Mar 21, 2019 at 12:12:03AM +0000, Dexuan Cui wrote:
> > > > From: Michael Kelley <[email protected]>
> > > > Sent: Wednesday, March 20, 2019 2:38 PM
> > > >
> > > > From: Dexuan Cui <[email protected]>
> > > > >
> > > > > After a device is just created in new_pcichild_device(), hpdev->refs is
> set
> > > > > to 2 (i.e. the initial value of 1 plus the get_pcichild()).
> > > > >
> > > > > When we hot remove the device from the host, in Linux VM we first call
> > > > > hv_pci_eject_device(), which increases hpdev->refs by get_pcichild()
> and
> > > > > then schedules a work of hv_eject_device_work(), so hpdev->refs
> becomes 3
> > > > > (let's ignore the paired get/put_pcichild() in other places). But in
> > > > > hv_eject_device_work(), currently we only call put_pcichild() twice,
> > > > > meaning the 'hpdev' struct can't be freed in put_pcichild(). This patch
> > > > > adds one put_pcichild() to fix the memory leak.
> > > > >
> > > > > BTW, the device can also be removed when we run "rmmod pci-hyperv".
> On
> > > > this
> > > > > path (hv_pci_remove() -> hv_pci_bus_exit() ->
> hv_pci_devices_present()),
> > > > > hpdev->refs is 2, and we do correctly call put_pcichild() twice in
> > > > > pci_devices_present_work().
> > > >
> > > > Exiting new_pcichild_device() with hpdev->refs set to 2 seems OK to me.
> > > > There is the reference in the hbus->children list, and there is the
> reference that
> > > > is returned to the caller.
> > > So IMO the "normal" reference count should be 2. :-) IMO only when a
> hv_pci_dev
> > > device is about to be destroyed, its reference count can drop to less than 2,
> > > i.e. first temporarily drop to 1 (meaning the hv_pci_dev device is removed
> from
> > > hbus->children), and then drop to zero (meaning kfree(hpdev) is called).
> > >
> > > > But what is strange is that pci_devices_present_work()
> > > > overwrites the reference returned in local variable hpdev without doing a
> > > > put_pcichild().
> > > I suppose you mean:
> > >
> > > /* First, mark all existing children as reported missing. */
> > > spin_lock_irqsave(&hbus->device_list_lock, flags);
> > > list_for_each_entry(hpdev, &hbus->children, list_entry) {
> > > hpdev->reported_missing = true;
> > > }
> > > spin_unlock_irqrestore(&hbus->device_list_lock, flags)
> > >
> > > This is not strange to me, because, in pci_devices_present_work(), at first
> we
> > > don't know which devices are about to disappear, so we pre-mark all
> devices to
> > > be potentially missing like that; if a device is still on the bus, we'll mark its
> > > hpdev->reported_missing to false later; only after we know exactly which
> > > devices are missing, we should call put_pcichild() against them. All these
> > > seem natural to me.
> > >
> > > > It seems like the "normal" reference count should be 1 when the
> > > > child device is not being manipulated, not 2.
> > > What does "not being manipulated" mean?
> > >
> > > > The fix would be to add a call to
> > > > put_pcichild() when the return value from new_pcichild_device() is
> > > > overwritten.
> > > In pci_devices_present_work(), we NEVER "overwrite" the "hpdev"
> returned
> > > from new_pcichild_device(): the "reported_missing" field of the new hpdev
> > > is implicitly initialized to false in new_pcichild_device().
> > >
> > > > Then remove the call to put_pcichild() in pci_device_present_work()
> when
> > > > missing
> > > > children are moved to the local list. The children have been moved from
> one
> > > > list
> > > > to another, so there's no need to decrement the reference count. Then
> when
> > > > everything in the local list is deleted, the reference is correctly
> decremented,
> > > > presumably freeing the memory.
> > > >
> > > > With this approach, the code in hv_eject_device_work() is correct.
> There's
> > > > one call to put_pcichild() to reflect removing the child device from the
> hbus->
> > > > children list, and one call to put_pcichild() to pair with the get_pcichild() in
> > > > hv_pci_eject_device().
> > > Please refer to my replies above. IMO we should fix
> > > hv_eject_device_work() rather than pci_devices_present_work().
> >
> > Have we reached a conclusion on this ? I would like to merge this series
> > given that it is fixing bugs and it has hung in the balance for quite
> > a while but it looks like Michael is not too happy about these patches
> > and I need a maintainer ACK to merge them.
> >
> > Thanks,
> > Lorenzo
>
> Dexuan and I have discussed the topic extensively offline. The patch works
> in its current form, and I'll agree to it.
>
> Reviewed-by: Michael Kelley <[email protected]>

Thanks, Michael!

Hi Lorenzo,
All the 3 patches have got Michael's Reviewed-by.

Previously, Stephen Hemminger, one of the Hyper-V driver maintainers,
provided his Reviewed-by in the " [PATCH 0/3]" mail:
https://lkml.org/lkml/2019/3/5/521

Thanks,
--Dexuan

2019-03-26 18:13:32

by Lorenzo Pieralisi

[permalink] [raw]
Subject: Re: [PATCH 1/3] PCI: hv: Fix a memory leak in hv_eject_device_work()

On Tue, Mar 26, 2019 at 06:01:32PM +0000, Dexuan Cui wrote:

[...]

> > > Have we reached a conclusion on this ? I would like to merge this series
> > > given that it is fixing bugs and it has hung in the balance for quite
> > > a while but it looks like Michael is not too happy about these patches
> > > and I need a maintainer ACK to merge them.
> > >
> > > Thanks,
> > > Lorenzo
> >
> > Dexuan and I have discussed the topic extensively offline. The patch works
> > in its current form, and I'll agree to it.
> >
> > Reviewed-by: Michael Kelley <[email protected]>
>
> Thanks, Michael!
>
> Hi Lorenzo,
> All the 3 patches have got Michael's Reviewed-by.

Good, I asked because I do not want to merge patches with review
questions open, thanks for the clarifications.

> Previously, Stephen Hemminger, one of the Hyper-V driver maintainers,
> provided his Reviewed-by in the " [PATCH 0/3]" mail:
> https://lkml.org/lkml/2019/3/5/521

I will reformat/rewrite the logs and notify you when queued.

Thanks,
Lorenzo

2019-03-26 19:55:36

by Lorenzo Pieralisi

[permalink] [raw]
Subject: Re: [PATCH 3/3] PCI: hv: Add pci_destroy_slot() in pci_devices_present_work(), if necessary

On Mon, Mar 04, 2019 at 09:34:49PM +0000, Dexuan Cui wrote:
> When we hot-remove a device, usually the host sends us a PCI_EJECT message,
> and a PCI_BUS_RELATIONS message with bus_rel->device_count == 0. But when
> we do the quick hot-add/hot-remove test, the host may not send us the
> PCI_EJECT message, if the guest has not fully finished the initialization
> by sending the PCI_RESOURCES_ASSIGNED* message to the host, so it's
> potentially unsafe to only depend on the pci_destroy_slot() in
> hv_eject_device_work(), though create_root_hv_pci_bus() ->
> hv_pci_assign_slots() is not called in this case. Note: in this case, the
> host still sends the guest a PCI_BUS_RELATIONS message with
> bus_rel->device_count == 0.
>
> And, in the quick hot-add/hot-remove test, we can have such a race: before
> pci_devices_present_work() -> new_pcichild_device() adds the new device
> into hbus->children, we may have already received the PCI_EJECT message,
> and hence the taklet handler hv_pci_onchannelcallback() may fail to find
> the "hpdev" by get_pcichild_wslot(hbus, dev_message->wslot.slot), so
> hv_pci_eject_device() is NOT called; later create_root_hv_pci_bus() ->
> hv_pci_assign_slots() creates the slot, and the PCI_BUS_RELATIONS message
> with bus_rel->device_count == 0 removes the device from hbus->children, and
> we end up being unable to remove the slot in hv_pci_remove() ->
> hv_pci_remove_slots().
>
> The patch removes the slot in pci_devices_present_work() when the device
> is removed. This can address the above race. Note 1:
> pci_devices_present_work() and hv_eject_device_work() run in the
> singled-threaded hbus->wq, so there is not a double-remove issue for the
> slot. Note 2: we can't offload hv_pci_eject_device() from
> hv_pci_onchannelcallback() to the workqueue, because we need
> hv_pci_onchannelcallback() synchronously call hv_pci_eject_device() to
> poll the channel's ringbuffer to work around the
> "hangs in hv_compose_msi_msg()" issue: see
> commit de0aa7b2f97d ("PCI: hv: Fix 2 hang issues in hv_compose_msi_msg()")

This commit log is unreadable, sorry. Indentation, punctuation and
formatting are just a mess, try to read it, you will notice by
yourself.

I basically reformatted it completely and pushed the series to
pci/controller-fixes but that's the last time I do it since I am not an
editor, next time I won't merge it.

More importantly, these patches are marked for stable, given the series
of fixes that triggered this series please ensure it was tested
thoroughly because it is honestly complicate to understand and I do not
want to backport further fixes to stable kernels on top of this.

Please have a look and report back.

Thanks,
Lorenzo

> Fixes: a15f2c08c708 ("PCI: hv: support reporting serial number as slot information")
> Signed-off-by: Dexuan Cui <[email protected]>
> Cc: [email protected]
> ---
> drivers/pci/controller/pci-hyperv.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index b489412e3502..82acd6155adf 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -1776,6 +1776,10 @@ static void pci_devices_present_work(struct work_struct *work)
> hpdev = list_first_entry(&removed, struct hv_pci_dev,
> list_entry);
> list_del(&hpdev->list_entry);
> +
> + if (hpdev->pci_slot)
> + pci_destroy_slot(hpdev->pci_slot);
> +
> put_pcichild(hpdev);
> }
>
> --
> 2.19.1
>

2019-03-27 00:23:53

by Dexuan Cui

[permalink] [raw]
Subject: RE: [PATCH 3/3] PCI: hv: Add pci_destroy_slot() in pci_devices_present_work(), if necessary

> From: Lorenzo Pieralisi <[email protected]>
> Sent: Tuesday, March 26, 2019 12:55 PM
> On Mon, Mar 04, 2019 at 09:34:49PM +0000, Dexuan Cui wrote:
> > When we hot-remove a device, usually the host sends us a PCI_EJECT
> message,
> > and a PCI_BUS_RELATIONS message with bus_rel->device_count == 0. But
> when
> > we do the quick hot-add/hot-remove test, the host may not send us the
> > PCI_EJECT message, if the guest has not fully finished the initialization
> > by sending the PCI_RESOURCES_ASSIGNED* message to the host, so it's
> > potentially unsafe to only depend on the pci_destroy_slot() in
> > hv_eject_device_work(), though create_root_hv_pci_bus() ->
> > hv_pci_assign_slots() is not called in this case. Note: in this case, the
> > host still sends the guest a PCI_BUS_RELATIONS message with
> > bus_rel->device_count == 0.
> >
> > And, in the quick hot-add/hot-remove test, we can have such a race: before
> > pci_devices_present_work() -> new_pcichild_device() adds the new device
> > into hbus->children, we may have already received the PCI_EJECT message,
> > and hence the taklet handler hv_pci_onchannelcallback() may fail to find
> > the "hpdev" by get_pcichild_wslot(hbus, dev_message->wslot.slot), so
> > hv_pci_eject_device() is NOT called; later create_root_hv_pci_bus() ->
> > hv_pci_assign_slots() creates the slot, and the PCI_BUS_RELATIONS message
> > with bus_rel->device_count == 0 removes the device from hbus->children,
> and
> > we end up being unable to remove the slot in hv_pci_remove() ->
> > hv_pci_remove_slots().
> >
> > The patch removes the slot in pci_devices_present_work() when the device
> > is removed. This can address the above race. Note 1:
> > pci_devices_present_work() and hv_eject_device_work() run in the
> > singled-threaded hbus->wq, so there is not a double-remove issue for the
> > slot. Note 2: we can't offload hv_pci_eject_device() from
> > hv_pci_onchannelcallback() to the workqueue, because we need
> > hv_pci_onchannelcallback() synchronously call hv_pci_eject_device() to
> > poll the channel's ringbuffer to work around the
> > "hangs in hv_compose_msi_msg()" issue: see
> > commit de0aa7b2f97d ("PCI: hv: Fix 2 hang issues in
> hv_compose_msi_msg()")
>
> This commit log is unreadable, sorry. Indentation, punctuation and
> formatting are just a mess, try to read it, you will notice by
> yourself.
>
> I basically reformatted it completely and pushed the series to
> pci/controller-fixes but that's the last time I do it since I am not an
> editor, next time I won't merge it.

Hi Lorenzo,
Thank you for helping improve my changelogs! I did learn a lot after
carefully comparing the improved version with my original version. :-)

I'll try my best to write a good changelog for my future patches.

> More importantly, these patches are marked for stable, given the series
> of fixes that triggered this series please ensure it was tested
> thoroughly because it is honestly complicate to understand and I do not
> want to backport further fixes to stable kernels on top of this.

I did the hot-add/hot-remove test in a loop for several thousand times,
and the patchset worked as expected and didn't show any issue.

> Please have a look and report back.
>
> Thanks,
> Lorenzo

Thanks again!

Thanks,
-- Dexuan