2013-05-08 08:18:49

by Zhenzhong Duan

[permalink] [raw]
Subject: [PATCH] xen: reuse the same pirq allocated when driver load first time

When driver load and unload in a loop, pirq will exhaust finally.
Try to use the same pirq which was already mapped and binded at first time
when driver loaded.

Read pirq from msix entry and test if data is XEN_PIRQ_MSI_DATA
xen_irq_from_pirq(pirq) < 0 checking is wrong as irq will be freed
when driver unload, it's always true in second load.

Signed-off-by: Zhenzhong Duan <[email protected]>
Tested-by: Yuval Shaia <[email protected]>
---
arch/x86/pci/xen.c | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
index 94e7662..4289eef 100644
--- a/arch/x86/pci/xen.c
+++ b/arch/x86/pci/xen.c
@@ -230,8 +230,7 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
__read_msi_msg(msidesc, &msg);
pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) |
((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
- if (msg.data != XEN_PIRQ_MSI_DATA ||
- xen_irq_from_pirq(pirq) < 0) {
+ if (msg.data != XEN_PIRQ_MSI_DATA) {
pirq = xen_allocate_pirq_msi(dev, msidesc);
if (pirq < 0) {
irq = -ENODEV;
--
1.7.3


2013-05-10 18:54:03

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH] xen: reuse the same pirq allocated when driver load first time

On Wed, May 08, 2013 at 04:18:24PM +0800, Zhenzhong Duan wrote:
> When driver load and unload in a loop, pirq will exhaust finally.
> Try to use the same pirq which was already mapped and binded at first time

So what happens if I unload and reload two drivers in random order?

> when driver loaded.
>
> Read pirq from msix entry and test if data is XEN_PIRQ_MSI_DATA
> xen_irq_from_pirq(pirq) < 0 checking is wrong as irq will be freed
> when driver unload, it's always true in second load.

If my understanding is right the issue at hand is that the caching
information about the pirq disappears once the driver has been
unloaded b/c the event's irq-info is removed (as the driver is
unloaded and free_irq is called).

Stefano,
Is there a specific write to the MSI structure that would cause the
hypervisor to drop the PIRQ? Or a nice hypercall to "free" an
PIRQ in usage?

I presume that on the first load the msg.data value is
ALWAYS != XEN_PIRQ_MSI_DATA right? Is that somethign we can ALWAYS
depend? Or is there a possibility that it could be that when the
PCI device is unplugged from one guest and plugged in another?

Could we use PHYSDEVOP_irq_status_query to figure out if the PIRQ
has been already allocated? Ah, does not look like it.

I am a bit worried about depending on just msg.data to tells us
whether the device had been initialized before.

Perhaps a different solution could be to utilize the
'xen_device_domain_owner' code. See attached patch (not
compile tested) to retain whether we had initialized this
device in the past and we can skip the hypercall.

On a second thought it could also use some extra checks
to make sure that the pirq that would be used on the second
run is the same as on the first one.

>
> Signed-off-by: Zhenzhong Duan <[email protected]>
> Tested-by: Yuval Shaia <[email protected]>
> ---
> arch/x86/pci/xen.c | 3 +--
> 1 files changed, 1 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
> index 94e7662..4289eef 100644
> --- a/arch/x86/pci/xen.c
> +++ b/arch/x86/pci/xen.c
> @@ -230,8 +230,7 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> __read_msi_msg(msidesc, &msg);
> pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) |
> ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
> - if (msg.data != XEN_PIRQ_MSI_DATA ||
> - xen_irq_from_pirq(pirq) < 0) {
> + if (msg.data != XEN_PIRQ_MSI_DATA) {

> if (pirq < 0) {
> irq = -ENODEV;
> --
> 1.7.3
>


Attachments:
(No filename) (2.45 kB)
use-xen-device-domain-owner.patch (1.81 kB)
Download all attachments

2013-05-13 07:45:22

by Zhenzhong Duan

[permalink] [raw]
Subject: Re: [PATCH] xen: reuse the same pirq allocated when driver load first time


On 2013-05-11 02:53, Konrad Rzeszutek Wilk wrote:
> On Wed, May 08, 2013 at 04:18:24PM +0800, Zhenzhong Duan wrote:
>> When driver load and unload in a loop, pirq will exhaust finally.
>> Try to use the same pirq which was already mapped and binded at first time
> So what happens if I unload and reload two drivers in random order?
Same result. In fact, not only pirq exhuast issue, also driver can't be
reloaded successfully everytime.
When trying to load driver which need to setup msix (ex. mlx4_core) it once
succeed and once failed (3rd time succeed, 4th failed etc).

zduan

2013-05-13 11:06:47

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH] xen: reuse the same pirq allocated when driver load first time

On Fri, 10 May 2013, Konrad Rzeszutek Wilk wrote:
> On Wed, May 08, 2013 at 04:18:24PM +0800, Zhenzhong Duan wrote:
> > When driver load and unload in a loop, pirq will exhaust finally.
> > Try to use the same pirq which was already mapped and binded at first time
>
> So what happens if I unload and reload two drivers in random order?
>
> > when driver loaded.
> >
> > Read pirq from msix entry and test if data is XEN_PIRQ_MSI_DATA
> > xen_irq_from_pirq(pirq) < 0 checking is wrong as irq will be freed
> > when driver unload, it's always true in second load.
>
> If my understanding is right the issue at hand is that the caching
> information about the pirq disappears once the driver has been
> unloaded b/c the event's irq-info is removed (as the driver is
> unloaded and free_irq is called).
>
> Stefano,
> Is there a specific write to the MSI structure that would cause the
> hypervisor to drop the PIRQ? Or a nice hypercall to "free" an
> PIRQ in usage?

We already have a "free PIRQ" hypercall, it's called
PHYSDEVOP_unmap_pirq and should be called by QEMU.
Linux should disable the MSI bit in the PCI config options of the
device:

pci_disable_msi -> pci_msi_shutdown -> msi_set_enable(0)

That should cause QEMU to issue a xc_physdev_unmap_pirq that actually
unmaps the pirq. If it doesn't, it's a bug :)


> I presume that on the first load the msg.data value is
> ALWAYS != XEN_PIRQ_MSI_DATA right? Is that somethign we can ALWAYS
> depend? Or is there a possibility that it could be that when the
> PCI device is unplugged from one guest and plugged in another?
>
> Could we use PHYSDEVOP_irq_status_query to figure out if the PIRQ
> has been already allocated? Ah, does not look like it.
>
> I am a bit worried about depending on just msg.data to tells us
> whether the device had been initialized before.

Data comes from QEMU, I think it is zeroed at init time.

2013-05-13 14:08:13

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH] xen: reuse the same pirq allocated when driver load first time

On Mon, May 13, 2013 at 12:06:43PM +0100, Stefano Stabellini wrote:
> On Fri, 10 May 2013, Konrad Rzeszutek Wilk wrote:
> > On Wed, May 08, 2013 at 04:18:24PM +0800, Zhenzhong Duan wrote:
> > > When driver load and unload in a loop, pirq will exhaust finally.
> > > Try to use the same pirq which was already mapped and binded at first time
> >
> > So what happens if I unload and reload two drivers in random order?
> >
> > > when driver loaded.
> > >
> > > Read pirq from msix entry and test if data is XEN_PIRQ_MSI_DATA
> > > xen_irq_from_pirq(pirq) < 0 checking is wrong as irq will be freed
> > > when driver unload, it's always true in second load.
> >
> > If my understanding is right the issue at hand is that the caching
> > information about the pirq disappears once the driver has been
> > unloaded b/c the event's irq-info is removed (as the driver is
> > unloaded and free_irq is called).
> >
> > Stefano,
> > Is there a specific write to the MSI structure that would cause the
> > hypervisor to drop the PIRQ? Or a nice hypercall to "free" an
> > PIRQ in usage?
>
> We already have a "free PIRQ" hypercall, it's called
> PHYSDEVOP_unmap_pirq and should be called by QEMU.

Considering that we call function that allocates (PHYSDEVOP_get_free_pirq)
it in the Linux kernel (and not in QEMU), perhaps that should be done in the
Linux kernel as part of xen_destroy_irq()? Or would that confuse QEMU?

It looks like QEMU only does that hypercall (via xc_physdev_unmap_pirq)
unregister_real_device which is only called during pci unplug?

> Linux should disable the MSI bit in the PCI config options of the
> device:
>
> pci_disable_msi -> pci_msi_shutdown -> msi_set_enable(0)

Zhenzhong, does it do that? Looking at the driver it certainly seems that way.
>
> That should cause QEMU to issue a xc_physdev_unmap_pirq that actually
> unmaps the pirq. If it doesn't, it's a bug :)

<sigh> It doesn't do that. So two bugs:
- QEMU doing that
- Linux kernel doing the hypercall as well.

>
>
> > I presume that on the first load the msg.data value is
> > ALWAYS != XEN_PIRQ_MSI_DATA right? Is that somethign we can ALWAYS
> > depend? Or is there a possibility that it could be that when the
> > PCI device is unplugged from one guest and plugged in another?
> >
> > Could we use PHYSDEVOP_irq_status_query to figure out if the PIRQ
> > has been already allocated? Ah, does not look like it.
> >
> > I am a bit worried about depending on just msg.data to tells us
> > whether the device had been initialized before.
>
> Data comes from QEMU, I think it is zeroed at init time.

2013-05-13 14:50:57

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH] xen: reuse the same pirq allocated when driver load first time

On Mon, 13 May 2013, Konrad Rzeszutek Wilk wrote:
> On Mon, May 13, 2013 at 12:06:43PM +0100, Stefano Stabellini wrote:
> > On Fri, 10 May 2013, Konrad Rzeszutek Wilk wrote:
> > > On Wed, May 08, 2013 at 04:18:24PM +0800, Zhenzhong Duan wrote:
> > > > When driver load and unload in a loop, pirq will exhaust finally.
> > > > Try to use the same pirq which was already mapped and binded at first time
> > >
> > > So what happens if I unload and reload two drivers in random order?
> > >
> > > > when driver loaded.
> > > >
> > > > Read pirq from msix entry and test if data is XEN_PIRQ_MSI_DATA
> > > > xen_irq_from_pirq(pirq) < 0 checking is wrong as irq will be freed
> > > > when driver unload, it's always true in second load.
> > >
> > > If my understanding is right the issue at hand is that the caching
> > > information about the pirq disappears once the driver has been
> > > unloaded b/c the event's irq-info is removed (as the driver is
> > > unloaded and free_irq is called).
> > >
> > > Stefano,
> > > Is there a specific write to the MSI structure that would cause the
> > > hypervisor to drop the PIRQ? Or a nice hypercall to "free" an
> > > PIRQ in usage?
> >
> > We already have a "free PIRQ" hypercall, it's called
> > PHYSDEVOP_unmap_pirq and should be called by QEMU.
>
> Considering that we call function that allocates (PHYSDEVOP_get_free_pirq)
> it in the Linux kernel (and not in QEMU), perhaps that should be done in the
> Linux kernel as part of xen_destroy_irq()? Or would that confuse QEMU?

I think it would confuse QEMU. It is probably better to let the unmap
being handled by it.


> It looks like QEMU only does that hypercall (via xc_physdev_unmap_pirq)
> unregister_real_device which is only called during pci unplug?

You are right! I would think that this behaviour is erroneous unless it
was done on purpose to avoid allocating MSIs twice.
If that is the case we would need to do something similar in Linux too.

I think that the issue is the mismatch between QEMU's and Linux's
behaviours: either both should be allocating MSIs once, or they should
both be allocating and deallocating MSIs every time the driver is loaded
and unloaded.


> > Linux should disable the MSI bit in the PCI config options of the
> > device:
> >
> > pci_disable_msi -> pci_msi_shutdown -> msi_set_enable(0)
>
> Zhenzhong, does it do that? Looking at the driver it certainly seems that way.
> >
> > That should cause QEMU to issue a xc_physdev_unmap_pirq that actually
> > unmaps the pirq. If it doesn't, it's a bug :)
>
> <sigh> It doesn't do that. So two bugs:
> - QEMU doing that
> - Linux kernel doing the hypercall as well.

At first sight I also thought that the Linux kernel was issuing
PHYSDEVOP_unmap_pirq too but actually Linux is only doing it if
xen_initial_domain().

2013-05-13 16:17:36

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH] xen: reuse the same pirq allocated when driver load first time

On Mon, May 13, 2013 at 03:50:52PM +0100, Stefano Stabellini wrote:
> On Mon, 13 May 2013, Konrad Rzeszutek Wilk wrote:
> > On Mon, May 13, 2013 at 12:06:43PM +0100, Stefano Stabellini wrote:
> > > On Fri, 10 May 2013, Konrad Rzeszutek Wilk wrote:
> > > > On Wed, May 08, 2013 at 04:18:24PM +0800, Zhenzhong Duan wrote:
> > > > > When driver load and unload in a loop, pirq will exhaust finally.
> > > > > Try to use the same pirq which was already mapped and binded at first time
> > > >
> > > > So what happens if I unload and reload two drivers in random order?
> > > >
> > > > > when driver loaded.
> > > > >
> > > > > Read pirq from msix entry and test if data is XEN_PIRQ_MSI_DATA
> > > > > xen_irq_from_pirq(pirq) < 0 checking is wrong as irq will be freed
> > > > > when driver unload, it's always true in second load.
> > > >
> > > > If my understanding is right the issue at hand is that the caching
> > > > information about the pirq disappears once the driver has been
> > > > unloaded b/c the event's irq-info is removed (as the driver is
> > > > unloaded and free_irq is called).
> > > >
> > > > Stefano,
> > > > Is there a specific write to the MSI structure that would cause the
> > > > hypervisor to drop the PIRQ? Or a nice hypercall to "free" an
> > > > PIRQ in usage?
> > >
> > > We already have a "free PIRQ" hypercall, it's called
> > > PHYSDEVOP_unmap_pirq and should be called by QEMU.
> >
> > Considering that we call function that allocates (PHYSDEVOP_get_free_pirq)
> > it in the Linux kernel (and not in QEMU), perhaps that should be done in the
> > Linux kernel as part of xen_destroy_irq()? Or would that confuse QEMU?
>
> I think it would confuse QEMU. It is probably better to let the unmap
> being handled by it.
>
>
> > It looks like QEMU only does that hypercall (via xc_physdev_unmap_pirq)
> > unregister_real_device which is only called during pci unplug?
>
> You are right! I would think that this behaviour is erroneous unless it
> was done on purpose to avoid allocating MSIs twice.
> If that is the case we would need to do something similar in Linux too.
>
> I think that the issue is the mismatch between QEMU's and Linux's
> behaviours: either both should be allocating MSIs once, or they should
> both be allocating and deallocating MSIs every time the driver is loaded
> and unloaded.

Right. But we also have the scenario that QEMU and Linux are going to
be out of sync. So we need fixes in both places - I think.

>
>
> > > Linux should disable the MSI bit in the PCI config options of the
> > > device:
> > >
> > > pci_disable_msi -> pci_msi_shutdown -> msi_set_enable(0)
> >
> > Zhenzhong, does it do that? Looking at the driver it certainly seems that way.
> > >
> > > That should cause QEMU to issue a xc_physdev_unmap_pirq that actually
> > > unmaps the pirq. If it doesn't, it's a bug :)
> >
> > <sigh> It doesn't do that. So two bugs:
> > - QEMU doing that
> > - Linux kernel doing the hypercall as well.
>
> At first sight I also thought that the Linux kernel was issuing
> PHYSDEVOP_unmap_pirq too but actually Linux is only doing it if
> xen_initial_domain().

That seems like an easy fix. Just do 'if (xen_initial_domain()
|| xen_hvm_domain())'. I think the only one we cannot do it for
is 'xen_pv_domain()' (so PCI in PV guests) as the "owner" of the PIRQ is
actually dom0.

2013-05-13 17:24:51

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH] xen: reuse the same pirq allocated when driver load first time

On Mon, 13 May 2013, Konrad Rzeszutek Wilk wrote:
> On Mon, May 13, 2013 at 03:50:52PM +0100, Stefano Stabellini wrote:
> > On Mon, 13 May 2013, Konrad Rzeszutek Wilk wrote:
> > > On Mon, May 13, 2013 at 12:06:43PM +0100, Stefano Stabellini wrote:
> > > > On Fri, 10 May 2013, Konrad Rzeszutek Wilk wrote:
> > > > > On Wed, May 08, 2013 at 04:18:24PM +0800, Zhenzhong Duan wrote:
> > > > > > When driver load and unload in a loop, pirq will exhaust finally.
> > > > > > Try to use the same pirq which was already mapped and binded at first time
> > > > >
> > > > > So what happens if I unload and reload two drivers in random order?
> > > > >
> > > > > > when driver loaded.
> > > > > >
> > > > > > Read pirq from msix entry and test if data is XEN_PIRQ_MSI_DATA
> > > > > > xen_irq_from_pirq(pirq) < 0 checking is wrong as irq will be freed
> > > > > > when driver unload, it's always true in second load.
> > > > >
> > > > > If my understanding is right the issue at hand is that the caching
> > > > > information about the pirq disappears once the driver has been
> > > > > unloaded b/c the event's irq-info is removed (as the driver is
> > > > > unloaded and free_irq is called).
> > > > >
> > > > > Stefano,
> > > > > Is there a specific write to the MSI structure that would cause the
> > > > > hypervisor to drop the PIRQ? Or a nice hypercall to "free" an
> > > > > PIRQ in usage?
> > > >
> > > > We already have a "free PIRQ" hypercall, it's called
> > > > PHYSDEVOP_unmap_pirq and should be called by QEMU.
> > >
> > > Considering that we call function that allocates (PHYSDEVOP_get_free_pirq)
> > > it in the Linux kernel (and not in QEMU), perhaps that should be done in the
> > > Linux kernel as part of xen_destroy_irq()? Or would that confuse QEMU?
> >
> > I think it would confuse QEMU. It is probably better to let the unmap
> > being handled by it.
> >
> >
> > > It looks like QEMU only does that hypercall (via xc_physdev_unmap_pirq)
> > > unregister_real_device which is only called during pci unplug?
> >
> > You are right! I would think that this behaviour is erroneous unless it
> > was done on purpose to avoid allocating MSIs twice.
> > If that is the case we would need to do something similar in Linux too.
> >
> > I think that the issue is the mismatch between QEMU's and Linux's
> > behaviours: either both should be allocating MSIs once, or they should
> > both be allocating and deallocating MSIs every time the driver is loaded
> > and unloaded.
>
> Right. But we also have the scenario that QEMU and Linux are going to
> be out of sync. So we need fixes in both places - I think.

QEMU is the owner of the pirq, in fact it is the one that creates and
destroys the mapping. I think that the right place to fix this problem
is in QEMU, the ABI would be much cleaner as a result. As a side effect
we don't need to make any changes in Linux.

> >
> > > > Linux should disable the MSI bit in the PCI config options of the
> > > > device:
> > > >
> > > > pci_disable_msi -> pci_msi_shutdown -> msi_set_enable(0)
> > >
> > > Zhenzhong, does it do that? Looking at the driver it certainly seems that way.
> > > >
> > > > That should cause QEMU to issue a xc_physdev_unmap_pirq that actually
> > > > unmaps the pirq. If it doesn't, it's a bug :)
> > >
> > > <sigh> It doesn't do that. So two bugs:
> > > - QEMU doing that
> > > - Linux kernel doing the hypercall as well.
> >
> > At first sight I also thought that the Linux kernel was issuing
> > PHYSDEVOP_unmap_pirq too but actually Linux is only doing it if
> > xen_initial_domain().
>
> That seems like an easy fix. Just do 'if (xen_initial_domain()
> || xen_hvm_domain())'. I think the only one we cannot do it for
> is 'xen_pv_domain()' (so PCI in PV guests) as the "owner" of the PIRQ is
> actually dom0.

We could do that and it might be easier than changing QEMU, but I think
it would be more clear and consistent if QEMU was the one to do the
unmapping.

2013-05-13 18:21:16

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH] xen: reuse the same pirq allocated when driver load first time

On Mon, May 13, 2013 at 06:24:46PM +0100, Stefano Stabellini wrote:
> On Mon, 13 May 2013, Konrad Rzeszutek Wilk wrote:
> > On Mon, May 13, 2013 at 03:50:52PM +0100, Stefano Stabellini wrote:
> > > On Mon, 13 May 2013, Konrad Rzeszutek Wilk wrote:
> > > > On Mon, May 13, 2013 at 12:06:43PM +0100, Stefano Stabellini wrote:
> > > > > On Fri, 10 May 2013, Konrad Rzeszutek Wilk wrote:
> > > > > > On Wed, May 08, 2013 at 04:18:24PM +0800, Zhenzhong Duan wrote:
> > > > > > > When driver load and unload in a loop, pirq will exhaust finally.
> > > > > > > Try to use the same pirq which was already mapped and binded at first time
> > > > > >
> > > > > > So what happens if I unload and reload two drivers in random order?
> > > > > >
> > > > > > > when driver loaded.
> > > > > > >
> > > > > > > Read pirq from msix entry and test if data is XEN_PIRQ_MSI_DATA
> > > > > > > xen_irq_from_pirq(pirq) < 0 checking is wrong as irq will be freed
> > > > > > > when driver unload, it's always true in second load.
> > > > > >
> > > > > > If my understanding is right the issue at hand is that the caching
> > > > > > information about the pirq disappears once the driver has been
> > > > > > unloaded b/c the event's irq-info is removed (as the driver is
> > > > > > unloaded and free_irq is called).
> > > > > >
> > > > > > Stefano,
> > > > > > Is there a specific write to the MSI structure that would cause the
> > > > > > hypervisor to drop the PIRQ? Or a nice hypercall to "free" an
> > > > > > PIRQ in usage?
> > > > >
> > > > > We already have a "free PIRQ" hypercall, it's called
> > > > > PHYSDEVOP_unmap_pirq and should be called by QEMU.
> > > >
> > > > Considering that we call function that allocates (PHYSDEVOP_get_free_pirq)
> > > > it in the Linux kernel (and not in QEMU), perhaps that should be done in the
> > > > Linux kernel as part of xen_destroy_irq()? Or would that confuse QEMU?
> > >
> > > I think it would confuse QEMU. It is probably better to let the unmap
> > > being handled by it.
> > >
> > >
> > > > It looks like QEMU only does that hypercall (via xc_physdev_unmap_pirq)
> > > > unregister_real_device which is only called during pci unplug?
> > >
> > > You are right! I would think that this behaviour is erroneous unless it
> > > was done on purpose to avoid allocating MSIs twice.
> > > If that is the case we would need to do something similar in Linux too.
> > >
> > > I think that the issue is the mismatch between QEMU's and Linux's
> > > behaviours: either both should be allocating MSIs once, or they should
> > > both be allocating and deallocating MSIs every time the driver is loaded
> > > and unloaded.
> >
> > Right. But we also have the scenario that QEMU and Linux are going to
> > be out of sync. So we need fixes in both places - I think.
>
> QEMU is the owner of the pirq, in fact it is the one that creates and
> destroys the mapping. I think that the right place to fix this problem
> is in QEMU, the ABI would be much cleaner as a result. As a side effect
> we don't need to make any changes in Linux.

You do. You need to remove the PHYSDEVOP_get_free_pirq call in that case.
>
> > >
> > > > > Linux should disable the MSI bit in the PCI config options of the
> > > > > device:
> > > > >
> > > > > pci_disable_msi -> pci_msi_shutdown -> msi_set_enable(0)
> > > >
> > > > Zhenzhong, does it do that? Looking at the driver it certainly seems that way.
> > > > >
> > > > > That should cause QEMU to issue a xc_physdev_unmap_pirq that actually
> > > > > unmaps the pirq. If it doesn't, it's a bug :)
> > > >
> > > > <sigh> It doesn't do that. So two bugs:
> > > > - QEMU doing that
> > > > - Linux kernel doing the hypercall as well.
> > >
> > > At first sight I also thought that the Linux kernel was issuing
> > > PHYSDEVOP_unmap_pirq too but actually Linux is only doing it if
> > > xen_initial_domain().
> >
> > That seems like an easy fix. Just do 'if (xen_initial_domain()
> > || xen_hvm_domain())'. I think the only one we cannot do it for
> > is 'xen_pv_domain()' (so PCI in PV guests) as the "owner" of the PIRQ is
> > actually dom0.
>
> We could do that and it might be easier than changing QEMU, but I think
> it would be more clear and consistent if QEMU was the one to do the
> unmapping.

And allocating the PIRQ?

2013-05-14 13:49:56

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH] xen: reuse the same pirq allocated when driver load first time

On Mon, 13 May 2013, Konrad Rzeszutek Wilk wrote:
> On Mon, May 13, 2013 at 06:24:46PM +0100, Stefano Stabellini wrote:
> > On Mon, 13 May 2013, Konrad Rzeszutek Wilk wrote:
> > > On Mon, May 13, 2013 at 03:50:52PM +0100, Stefano Stabellini wrote:
> > > > On Mon, 13 May 2013, Konrad Rzeszutek Wilk wrote:
> > > > > On Mon, May 13, 2013 at 12:06:43PM +0100, Stefano Stabellini wrote:
> > > > > > On Fri, 10 May 2013, Konrad Rzeszutek Wilk wrote:
> > > > > > > On Wed, May 08, 2013 at 04:18:24PM +0800, Zhenzhong Duan wrote:
> > > > > > > > When driver load and unload in a loop, pirq will exhaust finally.
> > > > > > > > Try to use the same pirq which was already mapped and binded at first time
> > > > > > >
> > > > > > > So what happens if I unload and reload two drivers in random order?
> > > > > > >
> > > > > > > > when driver loaded.
> > > > > > > >
> > > > > > > > Read pirq from msix entry and test if data is XEN_PIRQ_MSI_DATA
> > > > > > > > xen_irq_from_pirq(pirq) < 0 checking is wrong as irq will be freed
> > > > > > > > when driver unload, it's always true in second load.
> > > > > > >
> > > > > > > If my understanding is right the issue at hand is that the caching
> > > > > > > information about the pirq disappears once the driver has been
> > > > > > > unloaded b/c the event's irq-info is removed (as the driver is
> > > > > > > unloaded and free_irq is called).
> > > > > > >
> > > > > > > Stefano,
> > > > > > > Is there a specific write to the MSI structure that would cause the
> > > > > > > hypervisor to drop the PIRQ? Or a nice hypercall to "free" an
> > > > > > > PIRQ in usage?
> > > > > >
> > > > > > We already have a "free PIRQ" hypercall, it's called
> > > > > > PHYSDEVOP_unmap_pirq and should be called by QEMU.
> > > > >
> > > > > Considering that we call function that allocates (PHYSDEVOP_get_free_pirq)
> > > > > it in the Linux kernel (and not in QEMU), perhaps that should be done in the
> > > > > Linux kernel as part of xen_destroy_irq()? Or would that confuse QEMU?
> > > >
> > > > I think it would confuse QEMU. It is probably better to let the unmap
> > > > being handled by it.
> > > >
> > > >
> > > > > It looks like QEMU only does that hypercall (via xc_physdev_unmap_pirq)
> > > > > unregister_real_device which is only called during pci unplug?
> > > >
> > > > You are right! I would think that this behaviour is erroneous unless it
> > > > was done on purpose to avoid allocating MSIs twice.
> > > > If that is the case we would need to do something similar in Linux too.
> > > >
> > > > I think that the issue is the mismatch between QEMU's and Linux's
> > > > behaviours: either both should be allocating MSIs once, or they should
> > > > both be allocating and deallocating MSIs every time the driver is loaded
> > > > and unloaded.
> > >
> > > Right. But we also have the scenario that QEMU and Linux are going to
> > > be out of sync. So we need fixes in both places - I think.
> >
> > QEMU is the owner of the pirq, in fact it is the one that creates and
> > destroys the mapping. I think that the right place to fix this problem
> > is in QEMU, the ABI would be much cleaner as a result. As a side effect
> > we don't need to make any changes in Linux.
>
> You do. You need to remove the PHYSDEVOP_get_free_pirq call in that case.

PHYSDEVOP_get_free_pirq needs to stay, because Linux needs to know the
pirq that QEMU is going to use.

However I would let QEMU handle the mapping (it already does that in
pt_msi_setup calling xc_physdev_map_pirq_msi) and unmapping (that is
done by calling xc_domain_unbind_msi_irq from pt_msi_disable).
I think the problem is that pt_msi_disable is only called on
unregister_real_device and pt_reset_interrupt_and_io_mapping, not when
the guest disables MSIs.

2013-05-14 14:20:49

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH] xen: reuse the same pirq allocated when driver load first time

On Tue, May 14, 2013 at 02:49:50PM +0100, Stefano Stabellini wrote:
> On Mon, 13 May 2013, Konrad Rzeszutek Wilk wrote:
> > On Mon, May 13, 2013 at 06:24:46PM +0100, Stefano Stabellini wrote:
> > > On Mon, 13 May 2013, Konrad Rzeszutek Wilk wrote:
> > > > On Mon, May 13, 2013 at 03:50:52PM +0100, Stefano Stabellini wrote:
> > > > > On Mon, 13 May 2013, Konrad Rzeszutek Wilk wrote:
> > > > > > On Mon, May 13, 2013 at 12:06:43PM +0100, Stefano Stabellini wrote:
> > > > > > > On Fri, 10 May 2013, Konrad Rzeszutek Wilk wrote:
> > > > > > > > On Wed, May 08, 2013 at 04:18:24PM +0800, Zhenzhong Duan wrote:
> > > > > > > > > When driver load and unload in a loop, pirq will exhaust finally.
> > > > > > > > > Try to use the same pirq which was already mapped and binded at first time
> > > > > > > >
> > > > > > > > So what happens if I unload and reload two drivers in random order?
> > > > > > > >
> > > > > > > > > when driver loaded.
> > > > > > > > >
> > > > > > > > > Read pirq from msix entry and test if data is XEN_PIRQ_MSI_DATA
> > > > > > > > > xen_irq_from_pirq(pirq) < 0 checking is wrong as irq will be freed
> > > > > > > > > when driver unload, it's always true in second load.
> > > > > > > >
> > > > > > > > If my understanding is right the issue at hand is that the caching
> > > > > > > > information about the pirq disappears once the driver has been
> > > > > > > > unloaded b/c the event's irq-info is removed (as the driver is
> > > > > > > > unloaded and free_irq is called).
> > > > > > > >
> > > > > > > > Stefano,
> > > > > > > > Is there a specific write to the MSI structure that would cause the
> > > > > > > > hypervisor to drop the PIRQ? Or a nice hypercall to "free" an
> > > > > > > > PIRQ in usage?
> > > > > > >
> > > > > > > We already have a "free PIRQ" hypercall, it's called
> > > > > > > PHYSDEVOP_unmap_pirq and should be called by QEMU.
> > > > > >
> > > > > > Considering that we call function that allocates (PHYSDEVOP_get_free_pirq)
> > > > > > it in the Linux kernel (and not in QEMU), perhaps that should be done in the
> > > > > > Linux kernel as part of xen_destroy_irq()? Or would that confuse QEMU?
> > > > >
> > > > > I think it would confuse QEMU. It is probably better to let the unmap
> > > > > being handled by it.
> > > > >
> > > > >
> > > > > > It looks like QEMU only does that hypercall (via xc_physdev_unmap_pirq)
> > > > > > unregister_real_device which is only called during pci unplug?
> > > > >
> > > > > You are right! I would think that this behaviour is erroneous unless it
> > > > > was done on purpose to avoid allocating MSIs twice.
> > > > > If that is the case we would need to do something similar in Linux too.
> > > > >
> > > > > I think that the issue is the mismatch between QEMU's and Linux's
> > > > > behaviours: either both should be allocating MSIs once, or they should
> > > > > both be allocating and deallocating MSIs every time the driver is loaded
> > > > > and unloaded.
> > > >
> > > > Right. But we also have the scenario that QEMU and Linux are going to
> > > > be out of sync. So we need fixes in both places - I think.
> > >
> > > QEMU is the owner of the pirq, in fact it is the one that creates and
> > > destroys the mapping. I think that the right place to fix this problem
> > > is in QEMU, the ABI would be much cleaner as a result. As a side effect
> > > we don't need to make any changes in Linux.
> >
> > You do. You need to remove the PHYSDEVOP_get_free_pirq call in that case.
>
> PHYSDEVOP_get_free_pirq needs to stay, because Linux needs to know the
> pirq that QEMU is going to use.

That looks like an API violation. We have an hypercall that
allocates the PIRQ in the Linux, then two hypercalls in the QEMU
layer - one to map, and the other to unmap and free.


>
> However I would let QEMU handle the mapping (it already does that in
> pt_msi_setup calling xc_physdev_map_pirq_msi) and unmapping (that is
> done by calling xc_domain_unbind_msi_irq from pt_msi_disable).
> I think the problem is that pt_msi_disable is only called on
> unregister_real_device and pt_reset_interrupt_and_io_mapping, not when
> the guest disables MSIs.

Sure, I am not disputing that. I think the fix in QEMU to call the
unmap is correct.

But I am also wondering whether it makes sense to do that in the Linux
kernel - as it does the alloc in the first place. Seems like a bit of
duct-tape has been used to connect this plumbing together.

2013-05-15 09:41:14

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH] xen: reuse the same pirq allocated when driver load first time

On Tue, 14 May 2013, Konrad Rzeszutek Wilk wrote:
> On Tue, May 14, 2013 at 02:49:50PM +0100, Stefano Stabellini wrote:
> > On Mon, 13 May 2013, Konrad Rzeszutek Wilk wrote:
> > > On Mon, May 13, 2013 at 06:24:46PM +0100, Stefano Stabellini wrote:
> > > > On Mon, 13 May 2013, Konrad Rzeszutek Wilk wrote:
> > > > > On Mon, May 13, 2013 at 03:50:52PM +0100, Stefano Stabellini wrote:
> > > > > > On Mon, 13 May 2013, Konrad Rzeszutek Wilk wrote:
> > > > > > > On Mon, May 13, 2013 at 12:06:43PM +0100, Stefano Stabellini wrote:
> > > > > > > > On Fri, 10 May 2013, Konrad Rzeszutek Wilk wrote:
> > > > > > > > > On Wed, May 08, 2013 at 04:18:24PM +0800, Zhenzhong Duan wrote:
> > > > > > > > > > When driver load and unload in a loop, pirq will exhaust finally.
> > > > > > > > > > Try to use the same pirq which was already mapped and binded at first time
> > > > > > > > >
> > > > > > > > > So what happens if I unload and reload two drivers in random order?
> > > > > > > > >
> > > > > > > > > > when driver loaded.
> > > > > > > > > >
> > > > > > > > > > Read pirq from msix entry and test if data is XEN_PIRQ_MSI_DATA
> > > > > > > > > > xen_irq_from_pirq(pirq) < 0 checking is wrong as irq will be freed
> > > > > > > > > > when driver unload, it's always true in second load.
> > > > > > > > >
> > > > > > > > > If my understanding is right the issue at hand is that the caching
> > > > > > > > > information about the pirq disappears once the driver has been
> > > > > > > > > unloaded b/c the event's irq-info is removed (as the driver is
> > > > > > > > > unloaded and free_irq is called).
> > > > > > > > >
> > > > > > > > > Stefano,
> > > > > > > > > Is there a specific write to the MSI structure that would cause the
> > > > > > > > > hypervisor to drop the PIRQ? Or a nice hypercall to "free" an
> > > > > > > > > PIRQ in usage?
> > > > > > > >
> > > > > > > > We already have a "free PIRQ" hypercall, it's called
> > > > > > > > PHYSDEVOP_unmap_pirq and should be called by QEMU.
> > > > > > >
> > > > > > > Considering that we call function that allocates (PHYSDEVOP_get_free_pirq)
> > > > > > > it in the Linux kernel (and not in QEMU), perhaps that should be done in the
> > > > > > > Linux kernel as part of xen_destroy_irq()? Or would that confuse QEMU?
> > > > > >
> > > > > > I think it would confuse QEMU. It is probably better to let the unmap
> > > > > > being handled by it.
> > > > > >
> > > > > >
> > > > > > > It looks like QEMU only does that hypercall (via xc_physdev_unmap_pirq)
> > > > > > > unregister_real_device which is only called during pci unplug?
> > > > > >
> > > > > > You are right! I would think that this behaviour is erroneous unless it
> > > > > > was done on purpose to avoid allocating MSIs twice.
> > > > > > If that is the case we would need to do something similar in Linux too.
> > > > > >
> > > > > > I think that the issue is the mismatch between QEMU's and Linux's
> > > > > > behaviours: either both should be allocating MSIs once, or they should
> > > > > > both be allocating and deallocating MSIs every time the driver is loaded
> > > > > > and unloaded.
> > > > >
> > > > > Right. But we also have the scenario that QEMU and Linux are going to
> > > > > be out of sync. So we need fixes in both places - I think.
> > > >
> > > > QEMU is the owner of the pirq, in fact it is the one that creates and
> > > > destroys the mapping. I think that the right place to fix this problem
> > > > is in QEMU, the ABI would be much cleaner as a result. As a side effect
> > > > we don't need to make any changes in Linux.
> > >
> > > You do. You need to remove the PHYSDEVOP_get_free_pirq call in that case.
> >
> > PHYSDEVOP_get_free_pirq needs to stay, because Linux needs to know the
> > pirq that QEMU is going to use.
>
> That looks like an API violation. We have an hypercall that
> allocates the PIRQ in the Linux, then two hypercalls in the QEMU
> layer - one to map, and the other to unmap and free.
>
> > However I would let QEMU handle the mapping (it already does that in
> > pt_msi_setup calling xc_physdev_map_pirq_msi) and unmapping (that is
> > done by calling xc_domain_unbind_msi_irq from pt_msi_disable).
> > I think the problem is that pt_msi_disable is only called on
> > unregister_real_device and pt_reset_interrupt_and_io_mapping, not when
> > the guest disables MSIs.
>
> Sure, I am not disputing that. I think the fix in QEMU to call the
> unmap is correct.
>
> But I am also wondering whether it makes sense to do that in the Linux
> kernel - as it does the alloc in the first place. Seems like a bit of
> duct-tape has been used to connect this plumbing together.


I admit that it is not a great interface.
I would be open to options that move the entire setup/freeing in Linux,
but keep in mind that we need to retain the pirq code in QEMU for pure
HVM guests.

2013-05-15 14:19:00

by Zhenzhong Duan

[permalink] [raw]
Subject: Re: [PATCH] xen: reuse the same pirq allocated when driver load first time


On 2013-05-15 17:41, Stefano Stabellini wrote:
> On Tue, 14 May 2013, Konrad Rzeszutek Wilk wrote:
>> On Tue, May 14, 2013 at 02:49:50PM +0100, Stefano Stabellini wrote:
>>> On Mon, 13 May 2013, Konrad Rzeszutek Wilk wrote:
>>>> On Mon, May 13, 2013 at 06:24:46PM +0100, Stefano Stabellini wrote:
>>>>> On Mon, 13 May 2013, Konrad Rzeszutek Wilk wrote:
>>>>>> On Mon, May 13, 2013 at 03:50:52PM +0100, Stefano Stabellini wrote:
>>>>>>> On Mon, 13 May 2013, Konrad Rzeszutek Wilk wrote:
>>>>>>>> On Mon, May 13, 2013 at 12:06:43PM +0100, Stefano Stabellini wrote:
>>>>>>>>> On Fri, 10 May 2013, Konrad Rzeszutek Wilk wrote:
>>>>>>>>>> On Wed, May 08, 2013 at 04:18:24PM +0800, Zhenzhong Duan wrote:
>>>>>>>>>>> When driver load and unload in a loop, pirq will exhaust finally.
>>>>>>>>>>> Try to use the same pirq which was already mapped and binded at first time
>>>>>>>>>> So what happens if I unload and reload two drivers in random order?
>>>>>>>>>>
>>>>>>>>>>> when driver loaded.
>>>>>>>>>>>
>>>>>>>>>>> Read pirq from msix entry and test if data is XEN_PIRQ_MSI_DATA
>>>>>>>>>>> xen_irq_from_pirq(pirq) < 0 checking is wrong as irq will be freed
>>>>>>>>>>> when driver unload, it's always true in second load.
>>>>>>>>>> If my understanding is right the issue at hand is that the caching
>>>>>>>>>> information about the pirq disappears once the driver has been
>>>>>>>>>> unloaded b/c the event's irq-info is removed (as the driver is
>>>>>>>>>> unloaded and free_irq is called).
>>>>>>>>>>
>>>>>>>>>> Stefano,
>>>>>>>>>> Is there a specific write to the MSI structure that would cause the
>>>>>>>>>> hypervisor to drop the PIRQ? Or a nice hypercall to "free" an
>>>>>>>>>> PIRQ in usage?
>>>>>>>>> We already have a "free PIRQ" hypercall, it's called
>>>>>>>>> PHYSDEVOP_unmap_pirq and should be called by QEMU.
>>>>>>>> Considering that we call function that allocates (PHYSDEVOP_get_free_pirq)
>>>>>>>> it in the Linux kernel (and not in QEMU), perhaps that should be done in the
>>>>>>>> Linux kernel as part of xen_destroy_irq()? Or would that confuse QEMU?
>>>>>>> I think it would confuse QEMU. It is probably better to let the unmap
>>>>>>> being handled by it.
>>>>>>>
>>>>>>>
>>>>>>>> It looks like QEMU only does that hypercall (via xc_physdev_unmap_pirq)
>>>>>>>> unregister_real_device which is only called during pci unplug?
>>>>>>> You are right! I would think that this behaviour is erroneous unless it
>>>>>>> was done on purpose to avoid allocating MSIs twice.
>>>>>>> If that is the case we would need to do something similar in Linux too.
>>>>>>>
>>>>>>> I think that the issue is the mismatch between QEMU's and Linux's
>>>>>>> behaviours: either both should be allocating MSIs once, or they should
>>>>>>> both be allocating and deallocating MSIs every time the driver is loaded
>>>>>>> and unloaded.
>>>>>> Right. But we also have the scenario that QEMU and Linux are going to
>>>>>> be out of sync. So we need fixes in both places - I think.
>>>>> QEMU is the owner of the pirq, in fact it is the one that creates and
>>>>> destroys the mapping. I think that the right place to fix this problem
>>>>> is in QEMU, the ABI would be much cleaner as a result. As a side effect
>>>>> we don't need to make any changes in Linux.
>>>> You do. You need to remove the PHYSDEVOP_get_free_pirq call in that case.
>>>
>>> PHYSDEVOP_get_free_pirq needs to stay, because Linux needs to know the
>>> pirq that QEMU is going to use.
>> That looks like an API violation. We have an hypercall that
>> allocates the PIRQ in the Linux, then two hypercalls in the QEMU
>> layer - one to map, and the other to unmap and free.
>>
>>> However I would let QEMU handle the mapping (it already does that in
>>> pt_msi_setup calling xc_physdev_map_pirq_msi) and unmapping (that is
>>> done by calling xc_domain_unbind_msi_irq from pt_msi_disable).
>>> I think the problem is that pt_msi_disable is only called on
>>> unregister_real_device and pt_reset_interrupt_and_io_mapping, not when
>>> the guest disables MSIs.
>> Sure, I am not disputing that. I think the fix in QEMU to call the
>> unmap is correct.
>>
>> But I am also wondering whether it makes sense to do that in the Linux
>> kernel - as it does the alloc in the first place. Seems like a bit of
>> duct-tape has been used to connect this plumbing together.
>
> I admit that it is not a great interface.
> I would be open to options that move the entire setup/freeing in Linux,
> but keep in mind that we need to retain the pirq code in QEMU for pure
> HVM guests.
It seems MAP_PIRQ_TYPE_MSI is unsupported in physdev_hvm_map_pirq yet.
Move setup/freeing in Linux need to add that support first.

zduan

2013-05-17 02:22:16

by Zhenzhong Duan

[permalink] [raw]
Subject: Re: [PATCH] xen: reuse the same pirq allocated when driver load first time


On 2013-05-15 17:41, Stefano Stabellini wrote:
> On Tue, 14 May 2013, Konrad Rzeszutek Wilk wrote:
>> On Tue, May 14, 2013 at 02:49:50PM +0100, Stefano Stabellini wrote:
>>> On Mon, 13 May 2013, Konrad Rzeszutek Wilk wrote:
>>>> On Mon, May 13, 2013 at 06:24:46PM +0100, Stefano Stabellini wrote:
>>>>> On Mon, 13 May 2013, Konrad Rzeszutek Wilk wrote:
>>>>>> On Mon, May 13, 2013 at 03:50:52PM +0100, Stefano Stabellini wrote:
>>>>>>> On Mon, 13 May 2013, Konrad Rzeszutek Wilk wrote:
>>>>>>>> On Mon, May 13, 2013 at 12:06:43PM +0100, Stefano Stabellini wrote:
>>>>>>>>> On Fri, 10 May 2013, Konrad Rzeszutek Wilk wrote:
>>>>>>>>>> On Wed, May 08, 2013 at 04:18:24PM +0800, Zhenzhong Duan wrote:
>>>>>>>>>>> When driver load and unload in a loop, pirq will exhaust finally.
>>>>>>>>>>> Try to use the same pirq which was already mapped and binded at first time
>>>>>>>>>> So what happens if I unload and reload two drivers in random order?
>>>>>>>>>>
>>>>>>>>>>> when driver loaded.
>>>>>>>>>>>
>>>>>>>>>>> Read pirq from msix entry and test if data is XEN_PIRQ_MSI_DATA
>>>>>>>>>>> xen_irq_from_pirq(pirq) < 0 checking is wrong as irq will be freed
>>>>>>>>>>> when driver unload, it's always true in second load.
>>>>>>>>>> If my understanding is right the issue at hand is that the caching
>>>>>>>>>> information about the pirq disappears once the driver has been
>>>>>>>>>> unloaded b/c the event's irq-info is removed (as the driver is
>>>>>>>>>> unloaded and free_irq is called).
>>>>>>>>>>
>>>>>>>>>> Stefano,
>>>>>>>>>> Is there a specific write to the MSI structure that would cause the
>>>>>>>>>> hypervisor to drop the PIRQ? Or a nice hypercall to "free" an
>>>>>>>>>> PIRQ in usage?
>>>>>>>>> We already have a "free PIRQ" hypercall, it's called
>>>>>>>>> PHYSDEVOP_unmap_pirq and should be called by QEMU.
>>>>>>>> Considering that we call function that allocates (PHYSDEVOP_get_free_pirq)
>>>>>>>> it in the Linux kernel (and not in QEMU), perhaps that should be done in the
>>>>>>>> Linux kernel as part of xen_destroy_irq()? Or would that confuse QEMU?
>>>>>>> I think it would confuse QEMU. It is probably better to let the unmap
>>>>>>> being handled by it.
>>>>>>>
>>>>>>>
>>>>>>>> It looks like QEMU only does that hypercall (via xc_physdev_unmap_pirq)
>>>>>>>> unregister_real_device which is only called during pci unplug?
>>>>>>> You are right! I would think that this behaviour is erroneous unless it
>>>>>>> was done on purpose to avoid allocating MSIs twice.
>>>>>>> If that is the case we would need to do something similar in Linux too.
>>>>>>>
>>>>>>> I think that the issue is the mismatch between QEMU's and Linux's
>>>>>>> behaviours: either both should be allocating MSIs once, or they should
>>>>>>> both be allocating and deallocating MSIs every time the driver is loaded
>>>>>>> and unloaded.
>>>>>> Right. But we also have the scenario that QEMU and Linux are going to
>>>>>> be out of sync. So we need fixes in both places - I think.
>>>>> QEMU is the owner of the pirq, in fact it is the one that creates and
>>>>> destroys the mapping. I think that the right place to fix this problem
>>>>> is in QEMU, the ABI would be much cleaner as a result. As a side effect
>>>>> we don't need to make any changes in Linux.
>>>> You do. You need to remove the PHYSDEVOP_get_free_pirq call in that case.
>>>
>>> PHYSDEVOP_get_free_pirq needs to stay, because Linux needs to know the
>>> pirq that QEMU is going to use.
>> That looks like an API violation. We have an hypercall that
>> allocates the PIRQ in the Linux, then two hypercalls in the QEMU
>> layer - one to map, and the other to unmap and free.
>>
>>> However I would let QEMU handle the mapping (it already does that in
>>> pt_msi_setup calling xc_physdev_map_pirq_msi) and unmapping (that is
>>> done by calling xc_domain_unbind_msi_irq from pt_msi_disable).
>>> I think the problem is that pt_msi_disable is only called on
>>> unregister_real_device and pt_reset_interrupt_and_io_mapping, not when
>>> the guest disables MSIs.
>> Sure, I am not disputing that. I think the fix in QEMU to call the
>> unmap is correct.
>>
>> But I am also wondering whether it makes sense to do that in the Linux
>> kernel - as it does the alloc in the first place. Seems like a bit of
>> duct-tape has been used to connect this plumbing together.
>
> I admit that it is not a great interface.
> I would be open to options that move the entire setup/freeing in Linux,
> but keep in mind that we need to retain the pirq code in QEMU for pure
> HVM guests.
Hi Stefano,

do you work out a patch for me to test?

thanks
zduan

2013-05-20 10:24:17

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH] xen: reuse the same pirq allocated when driver load first time

On Fri, 17 May 2013, Zhenzhong Duan wrote:
> On 2013-05-15 17:41, Stefano Stabellini wrote:
> > On Tue, 14 May 2013, Konrad Rzeszutek Wilk wrote:
> > > On Tue, May 14, 2013 at 02:49:50PM +0100, Stefano Stabellini wrote:
> > > > On Mon, 13 May 2013, Konrad Rzeszutek Wilk wrote:
> > > > > On Mon, May 13, 2013 at 06:24:46PM +0100, Stefano Stabellini wrote:
> > > > > > On Mon, 13 May 2013, Konrad Rzeszutek Wilk wrote:
> > > > > > > On Mon, May 13, 2013 at 03:50:52PM +0100, Stefano Stabellini
> > > > > > > wrote:
> > > > > > > > On Mon, 13 May 2013, Konrad Rzeszutek Wilk wrote:
> > > > > > > > > On Mon, May 13, 2013 at 12:06:43PM +0100, Stefano Stabellini
> > > > > > > > > wrote:
> > > > > > > > > > On Fri, 10 May 2013, Konrad Rzeszutek Wilk wrote:
> > > > > > > > > > > On Wed, May 08, 2013 at 04:18:24PM +0800, Zhenzhong Duan
> > > > > > > > > > > wrote:
> > > > > > > > > > > > When driver load and unload in a loop, pirq will exhaust
> > > > > > > > > > > > finally.
> > > > > > > > > > > > Try to use the same pirq which was already mapped and
> > > > > > > > > > > > binded at first time
> > > > > > > > > > > So what happens if I unload and reload two drivers in
> > > > > > > > > > > random order?
> > > > > > > > > > >
> > > > > > > > > > > > when driver loaded.
> > > > > > > > > > > >
> > > > > > > > > > > > Read pirq from msix entry and test if data is
> > > > > > > > > > > > XEN_PIRQ_MSI_DATA
> > > > > > > > > > > > xen_irq_from_pirq(pirq) < 0 checking is wrong as irq
> > > > > > > > > > > > will be freed
> > > > > > > > > > > > when driver unload, it's always true in second load.
> > > > > > > > > > > If my understanding is right the issue at hand is that the
> > > > > > > > > > > caching
> > > > > > > > > > > information about the pirq disappears once the driver has
> > > > > > > > > > > been
> > > > > > > > > > > unloaded b/c the event's irq-info is removed (as the
> > > > > > > > > > > driver is
> > > > > > > > > > > unloaded and free_irq is called).
> > > > > > > > > > >
> > > > > > > > > > > Stefano,
> > > > > > > > > > > Is there a specific write to the MSI structure that would
> > > > > > > > > > > cause the
> > > > > > > > > > > hypervisor to drop the PIRQ? Or a nice hypercall to "free"
> > > > > > > > > > > an
> > > > > > > > > > > PIRQ in usage?
> > > > > > > > > > We already have a "free PIRQ" hypercall, it's called
> > > > > > > > > > PHYSDEVOP_unmap_pirq and should be called by QEMU.
> > > > > > > > > Considering that we call function that allocates
> > > > > > > > > (PHYSDEVOP_get_free_pirq)
> > > > > > > > > it in the Linux kernel (and not in QEMU), perhaps that should
> > > > > > > > > be done in the
> > > > > > > > > Linux kernel as part of xen_destroy_irq()? Or would that
> > > > > > > > > confuse QEMU?
> > > > > > > > I think it would confuse QEMU. It is probably better to let the
> > > > > > > > unmap
> > > > > > > > being handled by it.
> > > > > > > >
> > > > > > > >
> > > > > > > > > It looks like QEMU only does that hypercall (via
> > > > > > > > > xc_physdev_unmap_pirq)
> > > > > > > > > unregister_real_device which is only called during pci unplug?
> > > > > > > > You are right! I would think that this behaviour is erroneous
> > > > > > > > unless it
> > > > > > > > was done on purpose to avoid allocating MSIs twice.
> > > > > > > > If that is the case we would need to do something similar in
> > > > > > > > Linux too.
> > > > > > > >
> > > > > > > > I think that the issue is the mismatch between QEMU's and
> > > > > > > > Linux's
> > > > > > > > behaviours: either both should be allocating MSIs once, or they
> > > > > > > > should
> > > > > > > > both be allocating and deallocating MSIs every time the driver
> > > > > > > > is loaded
> > > > > > > > and unloaded.
> > > > > > > Right. But we also have the scenario that QEMU and Linux are going
> > > > > > > to
> > > > > > > be out of sync. So we need fixes in both places - I think.
> > > > > > QEMU is the owner of the pirq, in fact it is the one that creates
> > > > > > and
> > > > > > destroys the mapping. I think that the right place to fix this
> > > > > > problem
> > > > > > is in QEMU, the ABI would be much cleaner as a result. As a side
> > > > > > effect
> > > > > > we don't need to make any changes in Linux.
> > > > > You do. You need to remove the PHYSDEVOP_get_free_pirq call in that
> > > > > case.
> > > > PHYSDEVOP_get_free_pirq needs to stay, because Linux needs to know the
> > > > pirq that QEMU is going to use.
> > > That looks like an API violation. We have an hypercall that
> > > allocates the PIRQ in the Linux, then two hypercalls in the QEMU
> > > layer - one to map, and the other to unmap and free.
> > >
> > > > However I would let QEMU handle the mapping (it already does that in
> > > > pt_msi_setup calling xc_physdev_map_pirq_msi) and unmapping (that is
> > > > done by calling xc_domain_unbind_msi_irq from pt_msi_disable).
> > > > I think the problem is that pt_msi_disable is only called on
> > > > unregister_real_device and pt_reset_interrupt_and_io_mapping, not when
> > > > the guest disables MSIs.
> > > Sure, I am not disputing that. I think the fix in QEMU to call the
> > > unmap is correct.
> > >
> > > But I am also wondering whether it makes sense to do that in the Linux
> > > kernel - as it does the alloc in the first place. Seems like a bit of
> > > duct-tape has been used to connect this plumbing together.
> >
> > I admit that it is not a great interface.
> > I would be open to options that move the entire setup/freeing in Linux,
> > but keep in mind that we need to retain the pirq code in QEMU for pure
> > HVM guests.
> Hi Stefano,
>
> do you work out a patch for me to test?

I'll be traveling/busy for a few weeks, maybe it's best if someone else
picks up this work item.

2013-05-20 15:24:41

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH] xen: reuse the same pirq allocated when driver load first time

On Mon, May 20, 2013 at 11:24:04AM +0100, Stefano Stabellini wrote:
> On Fri, 17 May 2013, Zhenzhong Duan wrote:
> > On 2013-05-15 17:41, Stefano Stabellini wrote:
> > > On Tue, 14 May 2013, Konrad Rzeszutek Wilk wrote:
> > > > On Tue, May 14, 2013 at 02:49:50PM +0100, Stefano Stabellini wrote:
> > > > > On Mon, 13 May 2013, Konrad Rzeszutek Wilk wrote:
> > > > > > On Mon, May 13, 2013 at 06:24:46PM +0100, Stefano Stabellini wrote:
> > > > > > > On Mon, 13 May 2013, Konrad Rzeszutek Wilk wrote:
> > > > > > > > On Mon, May 13, 2013 at 03:50:52PM +0100, Stefano Stabellini
> > > > > > > > wrote:
> > > > > > > > > On Mon, 13 May 2013, Konrad Rzeszutek Wilk wrote:
> > > > > > > > > > On Mon, May 13, 2013 at 12:06:43PM +0100, Stefano Stabellini
> > > > > > > > > > wrote:
> > > > > > > > > > > On Fri, 10 May 2013, Konrad Rzeszutek Wilk wrote:
> > > > > > > > > > > > On Wed, May 08, 2013 at 04:18:24PM +0800, Zhenzhong Duan
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > When driver load and unload in a loop, pirq will exhaust
> > > > > > > > > > > > > finally.
> > > > > > > > > > > > > Try to use the same pirq which was already mapped and
> > > > > > > > > > > > > binded at first time
> > > > > > > > > > > > So what happens if I unload and reload two drivers in
> > > > > > > > > > > > random order?
> > > > > > > > > > > >
> > > > > > > > > > > > > when driver loaded.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Read pirq from msix entry and test if data is
> > > > > > > > > > > > > XEN_PIRQ_MSI_DATA
> > > > > > > > > > > > > xen_irq_from_pirq(pirq) < 0 checking is wrong as irq
> > > > > > > > > > > > > will be freed
> > > > > > > > > > > > > when driver unload, it's always true in second load.
> > > > > > > > > > > > If my understanding is right the issue at hand is that the
> > > > > > > > > > > > caching
> > > > > > > > > > > > information about the pirq disappears once the driver has
> > > > > > > > > > > > been
> > > > > > > > > > > > unloaded b/c the event's irq-info is removed (as the
> > > > > > > > > > > > driver is
> > > > > > > > > > > > unloaded and free_irq is called).
> > > > > > > > > > > >
> > > > > > > > > > > > Stefano,
> > > > > > > > > > > > Is there a specific write to the MSI structure that would
> > > > > > > > > > > > cause the
> > > > > > > > > > > > hypervisor to drop the PIRQ? Or a nice hypercall to "free"
> > > > > > > > > > > > an
> > > > > > > > > > > > PIRQ in usage?
> > > > > > > > > > > We already have a "free PIRQ" hypercall, it's called
> > > > > > > > > > > PHYSDEVOP_unmap_pirq and should be called by QEMU.
> > > > > > > > > > Considering that we call function that allocates
> > > > > > > > > > (PHYSDEVOP_get_free_pirq)
> > > > > > > > > > it in the Linux kernel (and not in QEMU), perhaps that should
> > > > > > > > > > be done in the
> > > > > > > > > > Linux kernel as part of xen_destroy_irq()? Or would that
> > > > > > > > > > confuse QEMU?
> > > > > > > > > I think it would confuse QEMU. It is probably better to let the
> > > > > > > > > unmap
> > > > > > > > > being handled by it.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > It looks like QEMU only does that hypercall (via
> > > > > > > > > > xc_physdev_unmap_pirq)
> > > > > > > > > > unregister_real_device which is only called during pci unplug?
> > > > > > > > > You are right! I would think that this behaviour is erroneous
> > > > > > > > > unless it
> > > > > > > > > was done on purpose to avoid allocating MSIs twice.
> > > > > > > > > If that is the case we would need to do something similar in
> > > > > > > > > Linux too.
> > > > > > > > >
> > > > > > > > > I think that the issue is the mismatch between QEMU's and
> > > > > > > > > Linux's
> > > > > > > > > behaviours: either both should be allocating MSIs once, or they
> > > > > > > > > should
> > > > > > > > > both be allocating and deallocating MSIs every time the driver
> > > > > > > > > is loaded
> > > > > > > > > and unloaded.
> > > > > > > > Right. But we also have the scenario that QEMU and Linux are going
> > > > > > > > to
> > > > > > > > be out of sync. So we need fixes in both places - I think.
> > > > > > > QEMU is the owner of the pirq, in fact it is the one that creates
> > > > > > > and
> > > > > > > destroys the mapping. I think that the right place to fix this
> > > > > > > problem
> > > > > > > is in QEMU, the ABI would be much cleaner as a result. As a side
> > > > > > > effect
> > > > > > > we don't need to make any changes in Linux.
> > > > > > You do. You need to remove the PHYSDEVOP_get_free_pirq call in that
> > > > > > case.
> > > > > PHYSDEVOP_get_free_pirq needs to stay, because Linux needs to know the
> > > > > pirq that QEMU is going to use.
> > > > That looks like an API violation. We have an hypercall that
> > > > allocates the PIRQ in the Linux, then two hypercalls in the QEMU
> > > > layer - one to map, and the other to unmap and free.
> > > >
> > > > > However I would let QEMU handle the mapping (it already does that in
> > > > > pt_msi_setup calling xc_physdev_map_pirq_msi) and unmapping (that is
> > > > > done by calling xc_domain_unbind_msi_irq from pt_msi_disable).
> > > > > I think the problem is that pt_msi_disable is only called on
> > > > > unregister_real_device and pt_reset_interrupt_and_io_mapping, not when
> > > > > the guest disables MSIs.
> > > > Sure, I am not disputing that. I think the fix in QEMU to call the
> > > > unmap is correct.
> > > >
> > > > But I am also wondering whether it makes sense to do that in the Linux
> > > > kernel - as it does the alloc in the first place. Seems like a bit of
> > > > duct-tape has been used to connect this plumbing together.
> > >
> > > I admit that it is not a great interface.
> > > I would be open to options that move the entire setup/freeing in Linux,
> > > but keep in mind that we need to retain the pirq code in QEMU for pure
> > > HVM guests.
> > Hi Stefano,
> >
> > do you work out a patch for me to test?

I did the patch that which calls PHYSDEVOP_unmap_pirq from within
Linux kernel. And it did not fix a thing. Here is the patch (sorry about
the copy-n-paste):
diff --git a/drivers/xen/events.c b/drivers/xen/events.c
index 6a6bbe4..d122ca9 100644
--- a/drivers/xen/events.c
+++ b/drivers/xen/events.c
@@ -789,21 +789,28 @@ error_irq:
return ret;
}
#endif
-
+static bool unmap_pirq = false;
+static int __init unmap_pirq_setup(char *s)
+{
+ unmap_pirq = true;
+ return 1;
+}
+__setup("unmap_pirq", unmap_pirq_setup);
int xen_destroy_irq(int irq)
{
struct irq_desc *desc;
struct physdev_unmap_pirq unmap_irq;
struct irq_info *info = info_for_irq(irq);
int rc = -ENOENT;
-
mutex_lock(&irq_mapping_update_lock);

desc = irq_to_desc(irq);
if (!desc)
goto out;

- if (xen_initial_domain()) {
+ if (xen_initial_domain() || unmap_pirq) {
+ printk(KERN_INFO "irq: %d, pirq:%d unmapping\n",
+ irq, info->u.pirq.pirq);
unmap_irq.pirq = info->u.pirq.pirq;
unmap_irq.domid = info->u.pirq.domid;
rc = HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, &unmap_irq);


Meaning that the later call to xen_allocate_pirq_msi would still get new PIRQs
value.

On the other hand, the original patch proposed by Zhenzhong works nicely.


.. I am wondering what is going wrong here.
>
> I'll be traveling/busy for a few weeks, maybe it's best if someone else
> picks up this work item.

2013-05-20 17:57:28

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH] xen: reuse the same pirq allocated when driver load first time

> > Hi Stefano,
> >
> > do you work out a patch for me to test?
>
> I'll be traveling/busy for a few weeks, maybe it's best if someone else
> picks up this work item.

This little test-case below should have worked:

#include <linux/module.h>
#include <linux/kthread.h>
#include <linux/pagemap.h>
#include <linux/init.h>
#include <xen/xen.h>
#include <xen/page.h>
#include <asm/xen/hypervisor.h>
#include <xen/features.h>
#include <xen/events.h>

MODULE_AUTHOR("Konrad Rzeszutek Wilk <[email protected]>");
MODULE_DESCRIPTION("alloc_and_unmap");
MODULE_LICENSE("GPL");
MODULE_VERSION("0.1");

static int do_it(void)
{
struct physdev_get_free_pirq op_get_free_pirq;
struct physdev_unmap_pirq unmap_irq;
int rc, pirq;

op_get_free_pirq.type = MAP_PIRQ_TYPE_MSI;
rc = HYPERVISOR_physdev_op(PHYSDEVOP_get_free_pirq, &op_get_free_pirq);
if (rc) {
printk(KERN_WARNING "%s:%d rc:%d\n", __func__, __LINE__, rc);
return rc;
}
pirq = op_get_free_pirq.pirq;
unmap_irq.pirq = pirq;
unmap_irq.domid = DOMID_SELF;
rc = HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, &unmap_irq);
if (rc) {
printk(KERN_WARNING "unmap irq failed %d\n", rc);
return rc;
}
printk("PIRQ: %d\n", pirq);
return 0;
}
static int __init alloc_and_unmap_init(void)
{
int i;

for (i = 0; i < 10; i++)
if (do_it())
break;
return 0;
}
static void __exit alloc_and_unmap_exit(void)
{
}
module_init(alloc_and_unmap_init);
module_exit(alloc_and_unmap_exit);


But I get:

# insmod /alloc_and_unmap.ko
[ 34.899277] PIRQ: 55
[ 34.901846] PIRQ: 54
[ 34.904351] PIRQ: 53
[ 34.906921] PIRQ: 52
[ 34.909451] PIRQ: 51
[ 34.912038] PIRQ: 50
[ 34.914650] PIRQ: 49
[ 34.917205] PIRQ: 48
[ 34.919776] PIRQ: 47
[ 34.922339] PIRQ: 46

Which means there is some bug in the hypervisor as well (This is with Xen 4.3
and traditional QEMU - not that it matters as I am just doing these hypercalls).


At this point I think that upstream option is to save the PIRQ value and re-use it.
Will post a patch for it.

2013-05-20 20:39:17

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH] xen: reuse the same pirq allocated when driver load first time

> At this point I think that upstream option is to save the PIRQ value and re-use it.
> Will post a patch for it.

Here is the patch. It works for me when passing in a NIC driver.

>From 509499568d1cdf1f2a3fb53773c991f4b063eb56 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <[email protected]>
Date: Mon, 20 May 2013 16:08:16 -0400
Subject: [PATCH] xen/pci: Track PVHVM PIRQs.

The PIRQs that the hypervisor provides for the guest are a limited
resource. They are acquired via PHYSDEVOP_get_free_pirq and in
theory should be returned back to the hypervisor via PHYSDEVOP_unmap_pirq
hypercall. Unfortunatly that is not the case.

This means that if there is a PCI device that has been passed in
the guest and does a loop of 'rmmod <driver>;modprobe <driver>"
we end up exhausting all of the PIRQs that are available.

For example (with kernel built as debug), we get this:
00:05.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
[ 152.659396] e1000e 0000:00:05.0: xen: msi bound to pirq=53
[ 152.665856] e1000e 0000:00:05.0: xen: msi --> pirq=53 --> irq=73
.. snip
[ 188.276835] e1000e 0000:00:05.0: xen: msi bound to pirq=51
[ 188.283194] e1000e 0000:00:05.0: xen: msi --> pirq=51 --> irq=73

.. and so on, until the pirq value is zero. This is an acute problem
when many PCI devices with many MSI-X entries are passed in the guest.

There is an alternative solution where we assume that on PCI
initialization (so when user passes in the PCI device) QEMU will init
the MSI and MSI-X entries to zero. Based on that assumptions and
that the Linux MSI API will write the PIRQ value to the MSI/MSI-X
(and used by QEMU through the life-cycle of the PCI device), we can
also depend on that. That means if MSI (or MSI-X entries) are read back
and are not 0, we can re-use that PIRQ value. However this patch
guards against future changes in QEMU in case that assumption
is incorrect.

Reported-by: Zhenzhong Duan <[email protected]>
CC: Stefano Stabellini <[email protected]>
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
---
drivers/xen/events.c | 124 +++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 122 insertions(+), 2 deletions(-)

diff --git a/drivers/xen/events.c b/drivers/xen/events.c
index 6a6bbe4..8aae21a 100644
--- a/drivers/xen/events.c
+++ b/drivers/xen/events.c
@@ -112,6 +112,27 @@ struct irq_info {
#define PIRQ_NEEDS_EOI (1 << 0)
#define PIRQ_SHAREABLE (1 << 1)

+/*
+ * The PHYSDEVOP_get_free_pirq allocates a set of PIRQs for the guest and
+ * the PHYSDEVOP_unmap_pirq is suppose to return them to the hypervisor.
+ * Unfortunatly that is not the case and we exhaust all of the PIRQs that are
+ * allocated for the domain if a driver is loaded/unloaded in a loop.
+ * The pirq_info serves a cache of the allocated PIRQs so that we can reuse
+ * for drivers. Note, it is only used by the MSI, MSI-X routines.
+ */
+#ifdef CONFIG_PCI_MSI
+static LIST_HEAD(xen_pirq_list_head);
+#endif
+enum xen_pirq_status {
+ PIRQ_FREE = 1,
+ PIRQ_BUSY
+};
+struct pirq_info {
+ struct list_head list;
+ int pirq;
+ enum xen_pirq_status status;
+};
+
static int *evtchn_to_irq;
#ifdef CONFIG_X86
static unsigned long *pirq_eoi_map;
@@ -145,6 +166,86 @@ static struct irq_chip xen_pirq_chip;
static void enable_dynirq(struct irq_data *data);
static void disable_dynirq(struct irq_data *data);

+#ifdef CONFIG_PCI_MSI
+static void xen_pirq_add(int pirq)
+{
+ struct pirq_info *p_info;
+
+ if (!xen_hvm_domain())
+ return;
+
+ p_info = kzalloc(sizeof(*p_info), GFP_KERNEL);
+ if (!p_info)
+ return;
+
+ p_info->pirq = pirq;
+ p_info->status = PIRQ_FREE;
+
+ mutex_lock(&irq_mapping_update_lock);
+ list_add_tail(&p_info->list, &xen_pirq_list_head);
+ mutex_unlock(&irq_mapping_update_lock);
+}
+static void xen_update_pirq_status(int pirq, enum xen_pirq_status status)
+{
+ struct pirq_info *p_info = NULL;
+ bool found = false;
+
+ if (!xen_hvm_domain())
+ return;
+
+ /*
+ * The irq_mapping_update_lock is being held while this is called.
+ */
+ list_for_each_entry(p_info, &xen_pirq_list_head, list) {
+
+ if (p_info->pirq != pirq)
+ continue;
+ found = true;
+ break;
+ }
+ if (found && p_info)
+ p_info->status = status;
+}
+static int xen_pirq_get_free(void)
+{
+ int pirq = -EBUSY;
+ struct pirq_info *p_info = NULL;
+
+ if (!xen_hvm_domain())
+ return -ENODEV;
+
+ mutex_lock(&irq_mapping_update_lock);
+ list_for_each_entry(p_info, &xen_pirq_list_head, list) {
+ if (p_info->status == PIRQ_BUSY)
+ continue;
+ pirq = p_info->pirq;
+ break;
+ }
+ mutex_unlock(&irq_mapping_update_lock);
+ return pirq;
+}
+static void xen_pirq_clear_all(void)
+{
+ struct pirq_info *p_info = NULL, *tmp;
+
+ if (!xen_hvm_domain())
+ return;
+
+ list_for_each_entry_safe(p_info, tmp, &xen_pirq_list_head, list) {
+ /*
+ * When migrating to a new host, the PT PCI devices _should_
+ * be unplugged!
+ */
+ WARN_ON(p_info->status == PIRQ_BUSY);
+ list_del(&p_info->list);
+ kfree(p_info);
+ }
+}
+#else
+static void xen_update_pirq_status(int pirq, enum xen_pirq_status status) { }
+static void xen_pirq_clear_all(void) { }
+#endif
+
/* Get info for IRQ */
static struct irq_info *info_for_irq(unsigned irq)
{
@@ -222,6 +323,8 @@ static void xen_irq_info_pirq_init(unsigned irq,
info->u.pirq.gsi = gsi;
info->u.pirq.domid = domid;
info->u.pirq.flags = flags;
+
+ xen_update_pirq_status(pirq, PIRQ_BUSY);
}

/*
@@ -520,6 +622,12 @@ static void xen_free_irq(unsigned irq)
if (WARN_ON(!info))
return;

+ /*
+ * Update the list of PIRQs (for PVHVM guests) that it is free.
+ */
+ if (info->type == IRQT_PIRQ)
+ xen_update_pirq_status(info->u.pirq.pirq, PIRQ_FREE);
+
list_del(&info->list);

irq_set_handler_data(irq, NULL);
@@ -750,15 +858,22 @@ out:
#ifdef CONFIG_PCI_MSI
int xen_allocate_pirq_msi(struct pci_dev *dev, struct msi_desc *msidesc)
{
- int rc;
+ int rc, pirq;
struct physdev_get_free_pirq op_get_free_pirq;

+ pirq = xen_pirq_get_free();
+ if (pirq >= 0)
+ return pirq;
+
op_get_free_pirq.type = MAP_PIRQ_TYPE_MSI;
rc = HYPERVISOR_physdev_op(PHYSDEVOP_get_free_pirq, &op_get_free_pirq);

WARN_ONCE(rc == -ENOSYS,
"hypervisor does not support the PHYSDEVOP_get_free_pirq interface\n");

+ if (rc == 0)
+ xen_pirq_add(op_get_free_pirq.pirq);
+
return rc ? -1 : op_get_free_pirq.pirq;
}

@@ -1631,6 +1746,11 @@ static void restore_pirqs(void)

__startup_pirq(irq);
}
+
+ /* For PVHVM guests with PCI passthrough devices (that got unplugged
+ * before the migrate) we MUST clear our cache.
+ */
+ xen_pirq_clear_all();
}

static void restore_cpu_virqs(unsigned int cpu)
--
1.7.7.6

2013-05-21 10:07:15

by David Vrabel

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH] xen: reuse the same pirq allocated when driver load first time

On 20/05/13 21:38, Konrad Rzeszutek Wilk wrote:
>> At this point I think that upstream option is to save the PIRQ value and re-use it.
>> Will post a patch for it.
>
> Here is the patch. It works for me when passing in a NIC driver.
>
>>From 509499568d1cdf1f2a3fb53773c991f4b063eb56 Mon Sep 17 00:00:00 2001
> From: Konrad Rzeszutek Wilk <[email protected]>
> Date: Mon, 20 May 2013 16:08:16 -0400
> Subject: [PATCH] xen/pci: Track PVHVM PIRQs.
>
> The PIRQs that the hypervisor provides for the guest are a limited
> resource. They are acquired via PHYSDEVOP_get_free_pirq and in
> theory should be returned back to the hypervisor via PHYSDEVOP_unmap_pirq
> hypercall. Unfortunatly that is not the case.
>
> This means that if there is a PCI device that has been passed in
> the guest and does a loop of 'rmmod <driver>;modprobe <driver>"
> we end up exhausting all of the PIRQs that are available.
>
> For example (with kernel built as debug), we get this:
> 00:05.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
> [ 152.659396] e1000e 0000:00:05.0: xen: msi bound to pirq=53
> [ 152.665856] e1000e 0000:00:05.0: xen: msi --> pirq=53 --> irq=73
> .. snip
> [ 188.276835] e1000e 0000:00:05.0: xen: msi bound to pirq=51
> [ 188.283194] e1000e 0000:00:05.0: xen: msi --> pirq=51 --> irq=73
>
> .. and so on, until the pirq value is zero. This is an acute problem
> when many PCI devices with many MSI-X entries are passed in the guest.
>
> There is an alternative solution where we assume that on PCI
> initialization (so when user passes in the PCI device) QEMU will init
> the MSI and MSI-X entries to zero. Based on that assumptions and
> that the Linux MSI API will write the PIRQ value to the MSI/MSI-X
> (and used by QEMU through the life-cycle of the PCI device), we can
> also depend on that. That means if MSI (or MSI-X entries) are read back
> and are not 0, we can re-use that PIRQ value. However this patch
> guards against future changes in QEMU in case that assumption
> is incorrect.
>
> Reported-by: Zhenzhong Duan <[email protected]>
> CC: Stefano Stabellini <[email protected]>
> Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
> ---
> drivers/xen/events.c | 124 +++++++++++++++++++++++++++++++++++++++++++++++++-
> 1 files changed, 122 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/xen/events.c b/drivers/xen/events.c
> index 6a6bbe4..8aae21a 100644
> --- a/drivers/xen/events.c
> +++ b/drivers/xen/events.c
> @@ -112,6 +112,27 @@ struct irq_info {
> #define PIRQ_NEEDS_EOI (1 << 0)
> #define PIRQ_SHAREABLE (1 << 1)
>
> +/*
> + * The PHYSDEVOP_get_free_pirq allocates a set of PIRQs for the guest and
> + * the PHYSDEVOP_unmap_pirq is suppose to return them to the hypervisor.
> + * Unfortunatly that is not the case and we exhaust all of the PIRQs that are
> + * allocated for the domain if a driver is loaded/unloaded in a loop.
> + * The pirq_info serves a cache of the allocated PIRQs so that we can reuse
> + * for drivers. Note, it is only used by the MSI, MSI-X routines.
> + */

Ick. Let's fix the bug in the hypervisor instead of hacking up the
kernel like this.

Looking at the hypervisor code I couldn't see anything obviously wrong.
I do note that Xen doesn't free the pirq until it has been unbound by
the guest. Xen will warn if the guest unmaps a pirq that is still bound
("domD: forcing unbind of pirq P"). Is this what is happening? If so,
that would suggest a bug in the guest rather than the hypervisor.

David

2013-05-21 13:41:28

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH] xen: reuse the same pirq allocated when driver load first time

> Looking at the hypervisor code I couldn't see anything obviously wrong.

I think the culprit is "physdev_unmap_pirq":

if ( is_hvm_domain(d) )
{
spin_lock(&d->event_lock);
gdprintk(XENLOG_WARNING,"d%d, pirq: %d is %x %s, irq: %d\n",
d->domain_id, pirq, domain_pirq_to_emuirq(d, pirq),
domain_pirq_to_emuirq(d, pirq) == IRQ_UNBOUND ? "unbound" : "",
domain_pirq_to_irq(d, pirq));

if ( domain_pirq_to_emuirq(d, pirq) != IRQ_UNBOUND )
ret = unmap_domain_pirq_emuirq(d, pirq);
spin_unlock(&d->event_lock);
if ( domid == DOMID_SELF || ret )
goto free_domain;

It always tells me unbound:

(XEN) physdev.c:237:d14 14, pirq: 54 is ffffffff
(XEN) irq.c:1873:d14 14, nr_pirqs: 56
(XEN) physdev.c:237:d14 14, pirq: 53 is ffffffff
(XEN) irq.c:1873:d14 14, nr_pirqs: 56
(XEN) physdev.c:237:d14 14, pirq: 52 is ffffffff
(XEN) irq.c:1873:d14 14, nr_pirqs: 56
(XEN) physdev.c:237:d14 14, pirq: 51 is ffffffff
(XEN) irq.c:1873:d14 14, nr_pirqs: 56
(XEN) physdev.c:237:d14 14, pirq: 50 is ffffffff
(XEN) irq.c:1873:d14 14, nr_pirqs: 56
(a bit older debug code, so the 'unbound' does not show up here).

Which means that the call to unmap_domain_pirq_emuirq does not happen.
The checks in unmap_domain_pirq_emuirq also look to be depend
on the code being IRQ_UNBOUND.

In other words, all of that code looks to only clear things when
they are !IRQ_UNBOUND.

But the other logic (IRQ_UNBOUND) looks to be missing a removal
in the radix tree:

if ( emuirq != IRQ_PT )
radix_tree_delete(&d->arch.hvm_domain.emuirq_pirq, emuirq);

And I think that is what is causing the leak - the radix tree
needs to be pruned? Or perhaps the allocate_pirq should check
the radix tree for IRQ_UNBOUND ones and re-use them?

> I do note that Xen doesn't free the pirq until it has been unbound by
> the guest. Xen will warn if the guest unmaps a pirq that is still bound
> ("domD: forcing unbind of pirq P"). Is this what is happening? If so,

No. It does not b/c of this check in physdev_unmap_pirq:

if ( domid == DOMID_SELF || ret )
goto free_domain

which jumps over the call to unmap_domain_pirq.

> that would suggest a bug in the guest rather than the hypervisor.

No. But then I am not even using it. See attached module.

>
> David


Attachments:
(No filename) (2.91 kB)
alloc_and_unmap.c (1.20 kB)
Download all attachments

2013-05-21 16:51:15

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH] xen: reuse the same pirq allocated when driver load first time

On Tue, 21 May 2013, Konrad Rzeszutek Wilk wrote:
> > Looking at the hypervisor code I couldn't see anything obviously wrong.
>
> I think the culprit is "physdev_unmap_pirq":
>
> if ( is_hvm_domain(d) )
> {
> spin_lock(&d->event_lock);
> gdprintk(XENLOG_WARNING,"d%d, pirq: %d is %x %s, irq: %d\n",
> d->domain_id, pirq, domain_pirq_to_emuirq(d, pirq),
> domain_pirq_to_emuirq(d, pirq) == IRQ_UNBOUND ? "unbound" : "",
> domain_pirq_to_irq(d, pirq));
>
> if ( domain_pirq_to_emuirq(d, pirq) != IRQ_UNBOUND )
> ret = unmap_domain_pirq_emuirq(d, pirq);
> spin_unlock(&d->event_lock);
> if ( domid == DOMID_SELF || ret )
> goto free_domain;
>
> It always tells me unbound:
>
> (XEN) physdev.c:237:d14 14, pirq: 54 is ffffffff
> (XEN) irq.c:1873:d14 14, nr_pirqs: 56
> (XEN) physdev.c:237:d14 14, pirq: 53 is ffffffff
> (XEN) irq.c:1873:d14 14, nr_pirqs: 56
> (XEN) physdev.c:237:d14 14, pirq: 52 is ffffffff
> (XEN) irq.c:1873:d14 14, nr_pirqs: 56
> (XEN) physdev.c:237:d14 14, pirq: 51 is ffffffff
> (XEN) irq.c:1873:d14 14, nr_pirqs: 56
> (XEN) physdev.c:237:d14 14, pirq: 50 is ffffffff
> (XEN) irq.c:1873:d14 14, nr_pirqs: 56
> (a bit older debug code, so the 'unbound' does not show up here).
>
> Which means that the call to unmap_domain_pirq_emuirq does not happen.
> The checks in unmap_domain_pirq_emuirq also look to be depend
> on the code being IRQ_UNBOUND.
>
> In other words, all of that code looks to only clear things when
> they are !IRQ_UNBOUND.
>
> But the other logic (IRQ_UNBOUND) looks to be missing a removal
> in the radix tree:
>
> if ( emuirq != IRQ_PT )
> radix_tree_delete(&d->arch.hvm_domain.emuirq_pirq, emuirq);
>
> And I think that is what is causing the leak - the radix tree
> needs to be pruned? Or perhaps the allocate_pirq should check
> the radix tree for IRQ_UNBOUND ones and re-use them?

I think that you are looking in the wrong place.
The issue is that QEMU doesn't call pt_msi_disable in
pt_msgctrl_reg_write if (!val & PCI_MSI_FLAGS_ENABLE).

The code above is correct as is because it is trying to handle emulated
IRQs and MSIs, not real passthrough MSIs. They latter are not added to
that radix tree, see physdev_hvm_map_pirq and physdev_map_pirq.

2013-05-21 20:43:17

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH] xen: reuse the same pirq allocated when driver load first time

On Tue, May 21, 2013 at 05:51:02PM +0100, Stefano Stabellini wrote:
> On Tue, 21 May 2013, Konrad Rzeszutek Wilk wrote:
> > > Looking at the hypervisor code I couldn't see anything obviously wrong.
> >
> > I think the culprit is "physdev_unmap_pirq":
> >
> > if ( is_hvm_domain(d) )
> > {
> > spin_lock(&d->event_lock);
> > gdprintk(XENLOG_WARNING,"d%d, pirq: %d is %x %s, irq: %d\n",
> > d->domain_id, pirq, domain_pirq_to_emuirq(d, pirq),
> > domain_pirq_to_emuirq(d, pirq) == IRQ_UNBOUND ? "unbound" : "",
> > domain_pirq_to_irq(d, pirq));
> >
> > if ( domain_pirq_to_emuirq(d, pirq) != IRQ_UNBOUND )
> > ret = unmap_domain_pirq_emuirq(d, pirq);
> > spin_unlock(&d->event_lock);
> > if ( domid == DOMID_SELF || ret )
> > goto free_domain;
> >
> > It always tells me unbound:
> >
> > (XEN) physdev.c:237:d14 14, pirq: 54 is ffffffff
> > (XEN) irq.c:1873:d14 14, nr_pirqs: 56
> > (XEN) physdev.c:237:d14 14, pirq: 53 is ffffffff
> > (XEN) irq.c:1873:d14 14, nr_pirqs: 56
> > (XEN) physdev.c:237:d14 14, pirq: 52 is ffffffff
> > (XEN) irq.c:1873:d14 14, nr_pirqs: 56
> > (XEN) physdev.c:237:d14 14, pirq: 51 is ffffffff
> > (XEN) irq.c:1873:d14 14, nr_pirqs: 56
> > (XEN) physdev.c:237:d14 14, pirq: 50 is ffffffff
> > (XEN) irq.c:1873:d14 14, nr_pirqs: 56
> > (a bit older debug code, so the 'unbound' does not show up here).
> >
> > Which means that the call to unmap_domain_pirq_emuirq does not happen.
> > The checks in unmap_domain_pirq_emuirq also look to be depend
> > on the code being IRQ_UNBOUND.
> >
> > In other words, all of that code looks to only clear things when
> > they are !IRQ_UNBOUND.
> >
> > But the other logic (IRQ_UNBOUND) looks to be missing a removal
> > in the radix tree:
> >
> > if ( emuirq != IRQ_PT )
> > radix_tree_delete(&d->arch.hvm_domain.emuirq_pirq, emuirq);
> >
> > And I think that is what is causing the leak - the radix tree
> > needs to be pruned? Or perhaps the allocate_pirq should check
> > the radix tree for IRQ_UNBOUND ones and re-use them?
>
> I think that you are looking in the wrong place.
> The issue is that QEMU doesn't call pt_msi_disable in
> pt_msgctrl_reg_write if (!val & PCI_MSI_FLAGS_ENABLE).

In my test-case I am not even calling QEMU. I am just doing two hypercalls
hypercall - get_free_pirq and unmap.
>
> The code above is correct as is because it is trying to handle emulated
> IRQs and MSIs, not real passthrough MSIs. They latter are not added to
> that radix tree, see physdev_hvm_map_pirq and physdev_map_pirq.

The bug is in the hypervisor. This little patch solves the test-case
(I hadn't tried to do the PCI passthrough yet)


diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index b0b0c65..b78717a 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -1851,8 +1851,8 @@ static int pirq_guest_force_unbind(struct domain *d, struct pirq *pirq)
static inline bool_t is_free_pirq(const struct domain *d,
const struct pirq *pirq)
{
- return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) ||
- pirq->arch.hvm.emuirq == IRQ_UNBOUND));
+ return !pirq || ((pirq->arch.irq == 0 || (pirq->arch.irq == PIRQ_ALLOCATED)) &&
+ (!is_hvm_domain(d) || pirq->arch.hvm.emuirq == IRQ_UNBOUND));
}

int get_free_pirq(struct domain *d, int type)


The reason is that pirq->arch.irq in PHYSDEVOP_get_free_pirq is set to
from the value of zero to -1 (PIRQ_ALLOCATED). Then in map_domain_pirq
we check it first:

904 old_irq = domain_pirq_to_irq(d, pirq);
.. snip..
1907 if ( (old_irq > 0 && (old_irq != irq) ) ||

and since the 'old_irq' is -1 (or zero), and the irq passed in
is different, then all checks pass and the value is over-written:

1988 set_domain_irq_pirq(d, irq, info);

And that is it.

2013-05-21 21:50:23

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH] xen: reuse the same pirq allocated when driver load first time

On Tue, 21 May 2013, Konrad Rzeszutek Wilk wrote:
> On Tue, May 21, 2013 at 05:51:02PM +0100, Stefano Stabellini wrote:
> > On Tue, 21 May 2013, Konrad Rzeszutek Wilk wrote:
> > > > Looking at the hypervisor code I couldn't see anything obviously wrong.
> > >
> > > I think the culprit is "physdev_unmap_pirq":
> > >
> > > if ( is_hvm_domain(d) )
> > > {
> > > spin_lock(&d->event_lock);
> > > gdprintk(XENLOG_WARNING,"d%d, pirq: %d is %x %s, irq: %d\n",
> > > d->domain_id, pirq, domain_pirq_to_emuirq(d, pirq),
> > > domain_pirq_to_emuirq(d, pirq) == IRQ_UNBOUND ? "unbound" : "",
> > > domain_pirq_to_irq(d, pirq));
> > >
> > > if ( domain_pirq_to_emuirq(d, pirq) != IRQ_UNBOUND )
> > > ret = unmap_domain_pirq_emuirq(d, pirq);
> > > spin_unlock(&d->event_lock);
> > > if ( domid == DOMID_SELF || ret )
> > > goto free_domain;
> > >
> > > It always tells me unbound:
> > >
> > > (XEN) physdev.c:237:d14 14, pirq: 54 is ffffffff
> > > (XEN) irq.c:1873:d14 14, nr_pirqs: 56
> > > (XEN) physdev.c:237:d14 14, pirq: 53 is ffffffff
> > > (XEN) irq.c:1873:d14 14, nr_pirqs: 56
> > > (XEN) physdev.c:237:d14 14, pirq: 52 is ffffffff
> > > (XEN) irq.c:1873:d14 14, nr_pirqs: 56
> > > (XEN) physdev.c:237:d14 14, pirq: 51 is ffffffff
> > > (XEN) irq.c:1873:d14 14, nr_pirqs: 56
> > > (XEN) physdev.c:237:d14 14, pirq: 50 is ffffffff
> > > (XEN) irq.c:1873:d14 14, nr_pirqs: 56
> > > (a bit older debug code, so the 'unbound' does not show up here).
> > >
> > > Which means that the call to unmap_domain_pirq_emuirq does not happen.
> > > The checks in unmap_domain_pirq_emuirq also look to be depend
> > > on the code being IRQ_UNBOUND.
> > >
> > > In other words, all of that code looks to only clear things when
> > > they are !IRQ_UNBOUND.
> > >
> > > But the other logic (IRQ_UNBOUND) looks to be missing a removal
> > > in the radix tree:
> > >
> > > if ( emuirq != IRQ_PT )
> > > radix_tree_delete(&d->arch.hvm_domain.emuirq_pirq, emuirq);
> > >
> > > And I think that is what is causing the leak - the radix tree
> > > needs to be pruned? Or perhaps the allocate_pirq should check
> > > the radix tree for IRQ_UNBOUND ones and re-use them?
> >
> > I think that you are looking in the wrong place.
> > The issue is that QEMU doesn't call pt_msi_disable in
> > pt_msgctrl_reg_write if (!val & PCI_MSI_FLAGS_ENABLE).
>
> In my test-case I am not even calling QEMU. I am just doing two hypercalls
> hypercall - get_free_pirq and unmap.
> >
> > The code above is correct as is because it is trying to handle emulated
> > IRQs and MSIs, not real passthrough MSIs. They latter are not added to
> > that radix tree, see physdev_hvm_map_pirq and physdev_map_pirq.
>
> The bug is in the hypervisor. This little patch solves the test-case
> (I hadn't tried to do the PCI passthrough yet)
>
>
> diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
> index b0b0c65..b78717a 100644
> --- a/xen/arch/x86/irq.c
> +++ b/xen/arch/x86/irq.c
> @@ -1851,8 +1851,8 @@ static int pirq_guest_force_unbind(struct domain *d, struct pirq *pirq)
> static inline bool_t is_free_pirq(const struct domain *d,
> const struct pirq *pirq)
> {
> - return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) ||
> - pirq->arch.hvm.emuirq == IRQ_UNBOUND));
> + return !pirq || ((pirq->arch.irq == 0 || (pirq->arch.irq == PIRQ_ALLOCATED)) &&
> + (!is_hvm_domain(d) || pirq->arch.hvm.emuirq == IRQ_UNBOUND));
> }
>
> int get_free_pirq(struct domain *d, int type)
>
>
> The reason is that pirq->arch.irq in PHYSDEVOP_get_free_pirq is set to
> from the value of zero to -1 (PIRQ_ALLOCATED). Then in map_domain_pirq
> we check it first:
>
> 904 old_irq = domain_pirq_to_irq(d, pirq);
> .. snip..
> 1907 if ( (old_irq > 0 && (old_irq != irq) ) ||
>
> and since the 'old_irq' is -1 (or zero), and the irq passed in
> is different, then all checks pass and the value is over-written:
>
> 1988 set_domain_irq_pirq(d, irq, info);
>
> And that is it.


We have to be careful about this: the point of PHYSDEVOP_get_free_pirq is
that Linux can know for sure the pirq that is going to be used to map the
MSI by QEMU. If you modify is_free_pirq that way, suddenly the pirq
could be allocated for something else after Linux called
PHYSDEVOP_get_free_pirq and before QEMU called xc_physdev_map_pirq_msi.

Right now the unmap is supposed to be done by QEMU, not Linux. So I
think that it is "normal" (although counterintuitive) that your little
test works that way.

pirq allocated via PHYSDEVOP_get_free_pirq should be passed to QEMU,
mapped by QEMU, unmapped by QEMU and eventually freed by QEMU.

This is not the bestest interface ever written of course but that's how
it works now.

2013-05-21 22:41:49

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH] xen: reuse the same pirq allocated when driver load first time

On Tue, May 21, 2013 at 10:50:09PM +0100, Stefano Stabellini wrote:
> On Tue, 21 May 2013, Konrad Rzeszutek Wilk wrote:
> > On Tue, May 21, 2013 at 05:51:02PM +0100, Stefano Stabellini wrote:
> > > On Tue, 21 May 2013, Konrad Rzeszutek Wilk wrote:
> > > > > Looking at the hypervisor code I couldn't see anything obviously wrong.
> > > >
> > > > I think the culprit is "physdev_unmap_pirq":
> > > >
> > > > if ( is_hvm_domain(d) )
> > > > {
> > > > spin_lock(&d->event_lock);
> > > > gdprintk(XENLOG_WARNING,"d%d, pirq: %d is %x %s, irq: %d\n",
> > > > d->domain_id, pirq, domain_pirq_to_emuirq(d, pirq),
> > > > domain_pirq_to_emuirq(d, pirq) == IRQ_UNBOUND ? "unbound" : "",
> > > > domain_pirq_to_irq(d, pirq));
> > > >
> > > > if ( domain_pirq_to_emuirq(d, pirq) != IRQ_UNBOUND )
> > > > ret = unmap_domain_pirq_emuirq(d, pirq);
> > > > spin_unlock(&d->event_lock);
> > > > if ( domid == DOMID_SELF || ret )
> > > > goto free_domain;
> > > >
> > > > It always tells me unbound:
> > > >
> > > > (XEN) physdev.c:237:d14 14, pirq: 54 is ffffffff
> > > > (XEN) irq.c:1873:d14 14, nr_pirqs: 56
> > > > (XEN) physdev.c:237:d14 14, pirq: 53 is ffffffff
> > > > (XEN) irq.c:1873:d14 14, nr_pirqs: 56
> > > > (XEN) physdev.c:237:d14 14, pirq: 52 is ffffffff
> > > > (XEN) irq.c:1873:d14 14, nr_pirqs: 56
> > > > (XEN) physdev.c:237:d14 14, pirq: 51 is ffffffff
> > > > (XEN) irq.c:1873:d14 14, nr_pirqs: 56
> > > > (XEN) physdev.c:237:d14 14, pirq: 50 is ffffffff
> > > > (XEN) irq.c:1873:d14 14, nr_pirqs: 56
> > > > (a bit older debug code, so the 'unbound' does not show up here).
> > > >
> > > > Which means that the call to unmap_domain_pirq_emuirq does not happen.
> > > > The checks in unmap_domain_pirq_emuirq also look to be depend
> > > > on the code being IRQ_UNBOUND.
> > > >
> > > > In other words, all of that code looks to only clear things when
> > > > they are !IRQ_UNBOUND.
> > > >
> > > > But the other logic (IRQ_UNBOUND) looks to be missing a removal
> > > > in the radix tree:
> > > >
> > > > if ( emuirq != IRQ_PT )
> > > > radix_tree_delete(&d->arch.hvm_domain.emuirq_pirq, emuirq);
> > > >
> > > > And I think that is what is causing the leak - the radix tree
> > > > needs to be pruned? Or perhaps the allocate_pirq should check
> > > > the radix tree for IRQ_UNBOUND ones and re-use them?
> > >
> > > I think that you are looking in the wrong place.
> > > The issue is that QEMU doesn't call pt_msi_disable in
> > > pt_msgctrl_reg_write if (!val & PCI_MSI_FLAGS_ENABLE).
> >
> > In my test-case I am not even calling QEMU. I am just doing two hypercalls
> > hypercall - get_free_pirq and unmap.
> > >
> > > The code above is correct as is because it is trying to handle emulated
> > > IRQs and MSIs, not real passthrough MSIs. They latter are not added to
> > > that radix tree, see physdev_hvm_map_pirq and physdev_map_pirq.
> >
> > The bug is in the hypervisor. This little patch solves the test-case
> > (I hadn't tried to do the PCI passthrough yet)
> >
> >
> > diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
> > index b0b0c65..b78717a 100644
> > --- a/xen/arch/x86/irq.c
> > +++ b/xen/arch/x86/irq.c
> > @@ -1851,8 +1851,8 @@ static int pirq_guest_force_unbind(struct domain *d, struct pirq *pirq)
> > static inline bool_t is_free_pirq(const struct domain *d,
> > const struct pirq *pirq)
> > {
> > - return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) ||
> > - pirq->arch.hvm.emuirq == IRQ_UNBOUND));
> > + return !pirq || ((pirq->arch.irq == 0 || (pirq->arch.irq == PIRQ_ALLOCATED)) &&
> > + (!is_hvm_domain(d) || pirq->arch.hvm.emuirq == IRQ_UNBOUND));
> > }
> >
> > int get_free_pirq(struct domain *d, int type)

To be fair, this diff is just to demonstrate that the pirq->arch.irq is
the one that seems to gate things. I am not suggesting that this is the
final patch - just the 'aha, this is what is happening!'.
> >
> >
> > The reason is that pirq->arch.irq in PHYSDEVOP_get_free_pirq is set to
> > from the value of zero to -1 (PIRQ_ALLOCATED). Then in map_domain_pirq
> > we check it first:
> >
> > 904 old_irq = domain_pirq_to_irq(d, pirq);
> > .. snip..
> > 1907 if ( (old_irq > 0 && (old_irq != irq) ) ||
> >
> > and since the 'old_irq' is -1 (or zero), and the irq passed in
> > is different, then all checks pass and the value is over-written:
> >
> > 1988 set_domain_irq_pirq(d, irq, info);
> >
> > And that is it.
>
>
> We have to be careful about this: the point of PHYSDEVOP_get_free_pirq is
> that Linux can know for sure the pirq that is going to be used to map the
> MSI by QEMU. If you modify is_free_pirq that way, suddenly the pirq
> could be allocated for something else after Linux called
> PHYSDEVOP_get_free_pirq and before QEMU called xc_physdev_map_pirq_msi.

Yes. And I think the 'is_free_pirq' modification above is incorrect.

I think the fix should be in the unmap_pirq code (hypervisor) to check
if the arch.irq is either zero or PIRQ_ALLOCATED. Right now it only
checks for zero.

Then as you say there is also xc_physdev_map_pirq_msi, but the call
chain looks to be INTx-MSIx. The pt_msi_update is the call that is
used when guest writes the PIRQ (xc_domain_update_msi_irq -> XEN_DOMCTL_bind_pt_irq)

And that looks to be it. The pt_msi_update can be called multiple times if
the guest decides to use a different PIRQ.

>
> Right now the unmap is supposed to be done by QEMU, not Linux. So I
> think that it is "normal" (although counterintuitive) that your little
> test works that way.

Yes, the test-case is flawed.
>
> pirq allocated via PHYSDEVOP_get_free_pirq should be passed to QEMU,
> mapped by QEMU, unmapped by QEMU and eventually freed by QEMU.
>
> This is not the bestest interface ever written of course but that's how
> it works now.

That is kindly said :-)

2013-05-22 09:37:48

by Jan Beulich

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH] xen: reuse the same pirq allocated when driver load first time

>>> On 22.05.13 at 00:41, Konrad Rzeszutek Wilk <[email protected]> wrote:
> On Tue, May 21, 2013 at 10:50:09PM +0100, Stefano Stabellini wrote:
>> We have to be careful about this: the point of PHYSDEVOP_get_free_pirq is
>> that Linux can know for sure the pirq that is going to be used to map the
>> MSI by QEMU. If you modify is_free_pirq that way, suddenly the pirq
>> could be allocated for something else after Linux called
>> PHYSDEVOP_get_free_pirq and before QEMU called xc_physdev_map_pirq_msi.
>
> Yes. And I think the 'is_free_pirq' modification above is incorrect.
>
> I think the fix should be in the unmap_pirq code (hypervisor) to check
> if the arch.irq is either zero or PIRQ_ALLOCATED. Right now it only
> checks for zero.

Which check are you talking about? Looking at physdev_unmap_pirq()
I see none at all, unmap_domain_pirq() has a <= 0 check, and
unmap_domain_pirq_emuirq() again doesn't appear to have any.

If you're talking about unmap_domain_pirq(), then you'll need to
be careful: A negative value here doesn't necessarily mean
PIRQ_ALLOCATED, but could also come from another run that
found it necessary to force the unbind. Hence the definition of
PIRQ_ALLOCATED would then collide with the (unlikely?) case of
IRQ1 having got assigned to a guest. To be on the safe side, we
should therefore redefine PIRQ_ALLOCATED to say INT_MIN.

Jan

2013-05-22 15:15:01

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH] xen: reuse the same pirq allocated when driver load first time

On Wed, May 22, 2013 at 10:37:39AM +0100, Jan Beulich wrote:
> >>> On 22.05.13 at 00:41, Konrad Rzeszutek Wilk <[email protected]> wrote:
> > On Tue, May 21, 2013 at 10:50:09PM +0100, Stefano Stabellini wrote:
> >> We have to be careful about this: the point of PHYSDEVOP_get_free_pirq is
> >> that Linux can know for sure the pirq that is going to be used to map the
> >> MSI by QEMU. If you modify is_free_pirq that way, suddenly the pirq
> >> could be allocated for something else after Linux called
> >> PHYSDEVOP_get_free_pirq and before QEMU called xc_physdev_map_pirq_msi.
> >
> > Yes. And I think the 'is_free_pirq' modification above is incorrect.
> >
> > I think the fix should be in the unmap_pirq code (hypervisor) to check
> > if the arch.irq is either zero or PIRQ_ALLOCATED. Right now it only
> > checks for zero.
>
> Which check are you talking about? Looking at physdev_unmap_pirq()

Sorry about being so haphazard here. I am still digging in the code
and trying to get a sense of how QEMU and hypervisor are suppose to
dance together.

The check was on the PHYSDEVOP_get_free_pirq, which calls get_free_pirq
and uses the is_free_pirq check. After the get_free_pirq call, the logic
in PHYSDEVOP_get_free_pirq sets info->arch.pirq = PIRQ_ALLOCATED to
protect itself from giving the same PIRQ twice.

The physdev_unmap_pirq (from PHYSDEVOP_unmap_pirq), only has this
check:
if (domain_pirq_to_emuirq(d, pirq) != IRQ_UNBOUND)

and since the arch.hvm.emuirq is IRQ_UNBOUND (-1), it does not
call unmap_domain_pirq_emuirq. It probably shouldn't, but it should
at least remove the info->arch.pirq = PIRQ_ALLOCATED as we are
telling the hypervisor: "hey, I am done with this, return to the
pool." But since that is not cleared, the PHYSDEVOP_get_free_pirq
will skip this pirq as arch.pirq is still set to PIRQ_ALLOCATED.

> I see none at all, unmap_domain_pirq() has a <= 0 check, and
> unmap_domain_pirq_emuirq() again doesn't appear to have any.

The 'unmap_domain_pirq' path would be if dom0 (so QEMU) did the
unmap for the guest. That is via the PHYSDEVOP_unmap_pirq. And
I think if that path was taken (as Stefano suggests QEMU should
do when a MSI or MSI-X driver is unloaded and zero is writen as
an PIRQ), we would end up calling clear_domain_irq_pirq, which
would set arch.pirq = 0.

Or to a negative value as you pointed out later. Which then
means we won't be ever able to re-use the PIRQ (as
PHYSDEVOP_get_free_pirq or rather get_free_pirq would skip over it
as arch.pirq != 0).
>
> If you're talking about unmap_domain_pirq(), then you'll need to
> be careful: A negative value here doesn't necessarily mean
> PIRQ_ALLOCATED, but could also come from another run that
> found it necessary to force the unbind. Hence the definition of
> PIRQ_ALLOCATED would then collide with the (unlikely?) case of
> IRQ1 having got assigned to a guest. To be on the safe side, we
> should therefore redefine PIRQ_ALLOCATED to say INT_MIN.

You are right about being cautious - this is a bit of complex
code interaction between Xen, QEMU, and Linux kernel.

2013-05-22 15:25:17

by Jan Beulich

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH] xen: reuse the same pirq allocated when driver load first time

>>> On 22.05.13 at 17:14, Konrad Rzeszutek Wilk <[email protected]> wrote:
> The physdev_unmap_pirq (from PHYSDEVOP_unmap_pirq), only has this
> check:
> if (domain_pirq_to_emuirq(d, pirq) != IRQ_UNBOUND)
>
> and since the arch.hvm.emuirq is IRQ_UNBOUND (-1), it does not
> call unmap_domain_pirq_emuirq. It probably shouldn't, but it should
> at least remove the info->arch.pirq = PIRQ_ALLOCATED as we are
> telling the hypervisor: "hey, I am done with this, return to the
> pool." But since that is not cleared, the PHYSDEVOP_get_free_pirq
> will skip this pirq as arch.pirq is still set to PIRQ_ALLOCATED.

Okay, that clarifies it quite a bit. For one, I'll leave any of the
emuirq stuff to Stefano, who wrote this originally. And then, from
the beginning of this thread, I'm not convinced that freeing a pirq
is really the right thing here: unmap_pirq() is the counterpart of
map_pirq(), not get_free_pirq(). I would think that is a guest
allocates a pirq and then unmaps it without first mapping it, it's
the guest's fault that it now lost one pirq resource. It should not
have allocated one in the first place if it didn't mean to use it for
anything.

>> I see none at all, unmap_domain_pirq() has a <= 0 check, and
>> unmap_domain_pirq_emuirq() again doesn't appear to have any.
>
> The 'unmap_domain_pirq' path would be if dom0 (so QEMU) did the
> unmap for the guest. That is via the PHYSDEVOP_unmap_pirq. And
> I think if that path was taken (as Stefano suggests QEMU should
> do when a MSI or MSI-X driver is unloaded and zero is writen as
> an PIRQ), we would end up calling clear_domain_irq_pirq, which
> would set arch.pirq = 0.
>
> Or to a negative value as you pointed out later. Which then
> means we won't be ever able to re-use the PIRQ (as
> PHYSDEVOP_get_free_pirq or rather get_free_pirq would skip over it
> as arch.pirq != 0).

That setting to a negative value is not causing the slot to be
permanently lost, it merely defers its freeing. It was the only
way I could find back then to reasonably handle an unmap
being done before the matched unbind.

Jan

2013-05-22 16:41:50

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH] xen: reuse the same pirq allocated when driver load first time

On Wed, May 22, 2013 at 04:25:10PM +0100, Jan Beulich wrote:
> >>> On 22.05.13 at 17:14, Konrad Rzeszutek Wilk <[email protected]> wrote:
> > The physdev_unmap_pirq (from PHYSDEVOP_unmap_pirq), only has this
> > check:
> > if (domain_pirq_to_emuirq(d, pirq) != IRQ_UNBOUND)
> >
> > and since the arch.hvm.emuirq is IRQ_UNBOUND (-1), it does not
> > call unmap_domain_pirq_emuirq. It probably shouldn't, but it should
> > at least remove the info->arch.pirq = PIRQ_ALLOCATED as we are
> > telling the hypervisor: "hey, I am done with this, return to the
> > pool." But since that is not cleared, the PHYSDEVOP_get_free_pirq
> > will skip this pirq as arch.pirq is still set to PIRQ_ALLOCATED.
>
> Okay, that clarifies it quite a bit. For one, I'll leave any of the
> emuirq stuff to Stefano, who wrote this originally. And then, from
> the beginning of this thread, I'm not convinced that freeing a pirq
> is really the right thing here: unmap_pirq() is the counterpart of
> map_pirq(), not get_free_pirq(). I would think that is a guest
> allocates a pirq and then unmaps it without first mapping it, it's
> the guest's fault that it now lost one pirq resource. It should not
> have allocated one in the first place if it didn't mean to use it for
> anything.

It does use it, but if you do run this in a loop:
rmmod e1000e;modprobe e1000e

it ends up doing thse three hypercalls: PHYSDEVOP_get_free_pirq,
PHYSDEVOP_map_pirq, PHYSDEVOP_unmap_pirq and so on. The reason is that
drivers/xen/events.c keeps track of the Linux IRQ <-> PIRQ just as long
as needed - if the driver does a free_irq, well, then the mapping is
de-allocated and lost.

One patch I posted (for Linux) keeps track of the PIRQ so that if
free_irq is called and we remove the Linux IRQ <-> PIRQ association,
we still have the PIRQ saved away and can re-use it.

In other words, the loop ends up doing:
PHYSDEVOP_map_pirq, PHYSDEVOP_unmap_pirq

>
> >> I see none at all, unmap_domain_pirq() has a <= 0 check, and
> >> unmap_domain_pirq_emuirq() again doesn't appear to have any.
> >
> > The 'unmap_domain_pirq' path would be if dom0 (so QEMU) did the
> > unmap for the guest. That is via the PHYSDEVOP_unmap_pirq. And
> > I think if that path was taken (as Stefano suggests QEMU should
> > do when a MSI or MSI-X driver is unloaded and zero is writen as
> > an PIRQ), we would end up calling clear_domain_irq_pirq, which
> > would set arch.pirq = 0.
> >
> > Or to a negative value as you pointed out later. Which then
> > means we won't be ever able to re-use the PIRQ (as
> > PHYSDEVOP_get_free_pirq or rather get_free_pirq would skip over it
> > as arch.pirq != 0).
>
> That setting to a negative value is not causing the slot to be
> permanently lost, it merely defers its freeing. It was the only
> way I could find back then to reasonably handle an unmap
> being done before the matched unbind.

Ah, so pt_irq_destroy_bind (XEN_DOMCTL_unbind_pt_irq) is the counterpart
to PHYSDEVOP_get_free_pirq in some form. Which looks to be on QEMU side
only called when the PCI device is put in sleep or pulled out of the guest.

It probably shouldn't be called when the device is merely de-activated.

2013-05-23 06:31:56

by Jan Beulich

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH] xen: reuse the same pirq allocated when driver load first time

>>> On 22.05.13 at 18:41, Konrad Rzeszutek Wilk <[email protected]> wrote:
> On Wed, May 22, 2013 at 04:25:10PM +0100, Jan Beulich wrote:
>> Okay, that clarifies it quite a bit. For one, I'll leave any of the
>> emuirq stuff to Stefano, who wrote this originally. And then, from
>> the beginning of this thread, I'm not convinced that freeing a pirq
>> is really the right thing here: unmap_pirq() is the counterpart of
>> map_pirq(), not get_free_pirq(). I would think that is a guest
>> allocates a pirq and then unmaps it without first mapping it, it's
>> the guest's fault that it now lost one pirq resource. It should not
>> have allocated one in the first place if it didn't mean to use it for
>> anything.
>
> It does use it, but if you do run this in a loop:
> rmmod e1000e;modprobe e1000e
>
> it ends up doing thse three hypercalls: PHYSDEVOP_get_free_pirq,
> PHYSDEVOP_map_pirq, PHYSDEVOP_unmap_pirq and so on.

Yeah, my comment was more towards your simplified repro
module...

> The reason is that
> drivers/xen/events.c keeps track of the Linux IRQ <-> PIRQ just as long
> as needed - if the driver does a free_irq, well, then the mapping is
> de-allocated and lost.
>
> One patch I posted (for Linux) keeps track of the PIRQ so that if
> free_irq is called and we remove the Linux IRQ <-> PIRQ association,
> we still have the PIRQ saved away and can re-use it.
>
> In other words, the loop ends up doing:
> PHYSDEVOP_map_pirq, PHYSDEVOP_unmap_pirq

And that's the right solution here imo. Remember that
get_free_pirq was introduced only for the pv-ops kernel, the
classic one never needed it (and hence couldn't leak pIRQ-s).

>> That setting to a negative value is not causing the slot to be
>> permanently lost, it merely defers its freeing. It was the only
>> way I could find back then to reasonably handle an unmap
>> being done before the matched unbind.
>
> Ah, so pt_irq_destroy_bind (XEN_DOMCTL_unbind_pt_irq) is the counterpart
> to PHYSDEVOP_get_free_pirq in some form. Which looks to be on QEMU side
> only called when the PCI device is put in sleep or pulled out of the guest.
>
> It probably shouldn't be called when the device is merely de-activated.

Right.

Jan

2013-05-29 17:51:07

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH] xen: reuse the same pirq allocated when driver load first time

On Tue, 21 May 2013, Stefano Stabellini wrote:
> On Tue, 21 May 2013, Konrad Rzeszutek Wilk wrote:
> > > Looking at the hypervisor code I couldn't see anything obviously wrong.
> >
> > I think the culprit is "physdev_unmap_pirq":
> >
> > if ( is_hvm_domain(d) )
> > {
> > spin_lock(&d->event_lock);
> > gdprintk(XENLOG_WARNING,"d%d, pirq: %d is %x %s, irq: %d\n",
> > d->domain_id, pirq, domain_pirq_to_emuirq(d, pirq),
> > domain_pirq_to_emuirq(d, pirq) == IRQ_UNBOUND ? "unbound" : "",
> > domain_pirq_to_irq(d, pirq));
> >
> > if ( domain_pirq_to_emuirq(d, pirq) != IRQ_UNBOUND )
> > ret = unmap_domain_pirq_emuirq(d, pirq);
> > spin_unlock(&d->event_lock);
> > if ( domid == DOMID_SELF || ret )
> > goto free_domain;
> >
> > It always tells me unbound:
> >
> > (XEN) physdev.c:237:d14 14, pirq: 54 is ffffffff
> > (XEN) irq.c:1873:d14 14, nr_pirqs: 56
> > (XEN) physdev.c:237:d14 14, pirq: 53 is ffffffff
> > (XEN) irq.c:1873:d14 14, nr_pirqs: 56
> > (XEN) physdev.c:237:d14 14, pirq: 52 is ffffffff
> > (XEN) irq.c:1873:d14 14, nr_pirqs: 56
> > (XEN) physdev.c:237:d14 14, pirq: 51 is ffffffff
> > (XEN) irq.c:1873:d14 14, nr_pirqs: 56
> > (XEN) physdev.c:237:d14 14, pirq: 50 is ffffffff
> > (XEN) irq.c:1873:d14 14, nr_pirqs: 56
> > (a bit older debug code, so the 'unbound' does not show up here).
> >
> > Which means that the call to unmap_domain_pirq_emuirq does not happen.
> > The checks in unmap_domain_pirq_emuirq also look to be depend
> > on the code being IRQ_UNBOUND.
> >
> > In other words, all of that code looks to only clear things when
> > they are !IRQ_UNBOUND.
> >
> > But the other logic (IRQ_UNBOUND) looks to be missing a removal
> > in the radix tree:
> >
> > if ( emuirq != IRQ_PT )
> > radix_tree_delete(&d->arch.hvm_domain.emuirq_pirq, emuirq);
> >
> > And I think that is what is causing the leak - the radix tree
> > needs to be pruned? Or perhaps the allocate_pirq should check
> > the radix tree for IRQ_UNBOUND ones and re-use them?
>
> I think that you are looking in the wrong place.
> The issue is that QEMU doesn't call pt_msi_disable in
> pt_msgctrl_reg_write if (!val & PCI_MSI_FLAGS_ENABLE).
>
> The code above is correct as is because it is trying to handle emulated
> IRQs and MSIs, not real passthrough MSIs. They latter are not added to
> that radix tree, see physdev_hvm_map_pirq and physdev_map_pirq.


This patch fixes the issue, I have only tested MSI (MSI-X completely
untested).


diff --git a/hw/pass-through.c b/hw/pass-through.c
index 304c438..079e465 100644
--- a/hw/pass-through.c
+++ b/hw/pass-through.c
@@ -3866,7 +3866,11 @@ static int pt_msgctrl_reg_write(struct pt_dev *ptdev,
ptdev->msi->flags |= PCI_MSI_FLAGS_ENABLE;
}
else
- ptdev->msi->flags &= ~PCI_MSI_FLAGS_ENABLE;
+ {
+ if (ptdev->msi->flags & PT_MSI_MAPPED) {
+ pt_msi_disable(ptdev);
+ }
+ }

/* pass through MSI_ENABLE bit when no MSI-INTx translation */
if (!ptdev->msi_trans_en) {
@@ -4013,6 +4017,8 @@ static int pt_msixctrl_reg_write(struct pt_dev *ptdev,
pt_disable_msi_translate(ptdev);
}
pt_msix_update(ptdev);
+ } else if (!(*value & PCI_MSIX_ENABLE) && ptdev->msix->enabled) {
+ pt_msix_delete(ptdev);
}

ptdev->msix->enabled = !!(*value & PCI_MSIX_ENABLE);
diff --git a/hw/pt-msi.c b/hw/pt-msi.c
index b03b989..65fa7d6 100644
--- a/hw/pt-msi.c
+++ b/hw/pt-msi.c
@@ -213,7 +213,8 @@ void pt_msi_disable(struct pt_dev *dev)

out:
/* clear msi info */
- dev->msi->flags &= ~(MSI_FLAG_UNINIT | PT_MSI_MAPPED | PCI_MSI_FLAGS_ENABLE);
+ dev->msi->flags &= ~(PT_MSI_MAPPED | PCI_MSI_FLAGS_ENABLE);
+ dev->msi->flags |= MSI_FLAG_UNINIT;
dev->msi->pirq = -1;
dev->msi_trans_en = 0;
}

2013-05-30 17:49:11

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH] xen: reuse the same pirq allocated when driver load first time

On Wed, May 29, 2013 at 06:50:41PM +0100, Stefano Stabellini wrote:
> On Tue, 21 May 2013, Stefano Stabellini wrote:
> > On Tue, 21 May 2013, Konrad Rzeszutek Wilk wrote:
> > > > Looking at the hypervisor code I couldn't see anything obviously wrong.
> > >
> > > I think the culprit is "physdev_unmap_pirq":
> > >
> > > if ( is_hvm_domain(d) )
> > > {
> > > spin_lock(&d->event_lock);
> > > gdprintk(XENLOG_WARNING,"d%d, pirq: %d is %x %s, irq: %d\n",
> > > d->domain_id, pirq, domain_pirq_to_emuirq(d, pirq),
> > > domain_pirq_to_emuirq(d, pirq) == IRQ_UNBOUND ? "unbound" : "",
> > > domain_pirq_to_irq(d, pirq));
> > >
> > > if ( domain_pirq_to_emuirq(d, pirq) != IRQ_UNBOUND )
> > > ret = unmap_domain_pirq_emuirq(d, pirq);
> > > spin_unlock(&d->event_lock);
> > > if ( domid == DOMID_SELF || ret )
> > > goto free_domain;
> > >
> > > It always tells me unbound:
> > >
> > > (XEN) physdev.c:237:d14 14, pirq: 54 is ffffffff
> > > (XEN) irq.c:1873:d14 14, nr_pirqs: 56
> > > (XEN) physdev.c:237:d14 14, pirq: 53 is ffffffff
> > > (XEN) irq.c:1873:d14 14, nr_pirqs: 56
> > > (XEN) physdev.c:237:d14 14, pirq: 52 is ffffffff
> > > (XEN) irq.c:1873:d14 14, nr_pirqs: 56
> > > (XEN) physdev.c:237:d14 14, pirq: 51 is ffffffff
> > > (XEN) irq.c:1873:d14 14, nr_pirqs: 56
> > > (XEN) physdev.c:237:d14 14, pirq: 50 is ffffffff
> > > (XEN) irq.c:1873:d14 14, nr_pirqs: 56
> > > (a bit older debug code, so the 'unbound' does not show up here).
> > >
> > > Which means that the call to unmap_domain_pirq_emuirq does not happen.
> > > The checks in unmap_domain_pirq_emuirq also look to be depend
> > > on the code being IRQ_UNBOUND.
> > >
> > > In other words, all of that code looks to only clear things when
> > > they are !IRQ_UNBOUND.
> > >
> > > But the other logic (IRQ_UNBOUND) looks to be missing a removal
> > > in the radix tree:
> > >
> > > if ( emuirq != IRQ_PT )
> > > radix_tree_delete(&d->arch.hvm_domain.emuirq_pirq, emuirq);
> > >
> > > And I think that is what is causing the leak - the radix tree
> > > needs to be pruned? Or perhaps the allocate_pirq should check
> > > the radix tree for IRQ_UNBOUND ones and re-use them?
> >
> > I think that you are looking in the wrong place.
> > The issue is that QEMU doesn't call pt_msi_disable in
> > pt_msgctrl_reg_write if (!val & PCI_MSI_FLAGS_ENABLE).
> >
> > The code above is correct as is because it is trying to handle emulated
> > IRQs and MSIs, not real passthrough MSIs. They latter are not added to
> > that radix tree, see physdev_hvm_map_pirq and physdev_map_pirq.
>
>
> This patch fixes the issue, I have only tested MSI (MSI-X completely
> untested).

I tested it on my NIC which has MSI and it worked nicely. The other box
with MSI-X is giving me a bit of trouble so will have to wait.

Duan,

Could you - when you get a moment - also try this with the PCI device
that was having issues please?
>
>
> diff --git a/hw/pass-through.c b/hw/pass-through.c
> index 304c438..079e465 100644
> --- a/hw/pass-through.c
> +++ b/hw/pass-through.c
> @@ -3866,7 +3866,11 @@ static int pt_msgctrl_reg_write(struct pt_dev *ptdev,
> ptdev->msi->flags |= PCI_MSI_FLAGS_ENABLE;
> }
> else
> - ptdev->msi->flags &= ~PCI_MSI_FLAGS_ENABLE;
> + {
> + if (ptdev->msi->flags & PT_MSI_MAPPED) {
> + pt_msi_disable(ptdev);
> + }
> + }
>
> /* pass through MSI_ENABLE bit when no MSI-INTx translation */
> if (!ptdev->msi_trans_en) {
> @@ -4013,6 +4017,8 @@ static int pt_msixctrl_reg_write(struct pt_dev *ptdev,
> pt_disable_msi_translate(ptdev);
> }
> pt_msix_update(ptdev);
> + } else if (!(*value & PCI_MSIX_ENABLE) && ptdev->msix->enabled) {
> + pt_msix_delete(ptdev);
> }
>
> ptdev->msix->enabled = !!(*value & PCI_MSIX_ENABLE);
> diff --git a/hw/pt-msi.c b/hw/pt-msi.c
> index b03b989..65fa7d6 100644
> --- a/hw/pt-msi.c
> +++ b/hw/pt-msi.c
> @@ -213,7 +213,8 @@ void pt_msi_disable(struct pt_dev *dev)
>
> out:
> /* clear msi info */
> - dev->msi->flags &= ~(MSI_FLAG_UNINIT | PT_MSI_MAPPED | PCI_MSI_FLAGS_ENABLE);
> + dev->msi->flags &= ~(PT_MSI_MAPPED | PCI_MSI_FLAGS_ENABLE);
> + dev->msi->flags |= MSI_FLAG_UNINIT;
> dev->msi->pirq = -1;
> dev->msi_trans_en = 0;
> }