2021-06-22 10:17:14

by Tian, Kevin

[permalink] [raw]
Subject: Virtualizing MSI-X on IMS via VFIO

Hi, Alex,

Need your help to understand the current MSI-X virtualization flow in
VFIO. Some background info first.

Recently we are discussing how to virtualize MSI-X with Interrupt
Message Storage (IMS) on mdev:
https://lore.kernel.org/kvm/[email protected]/

IMS is a device specific interrupt storage, allowing an optimized and
scalable manner for generating interrupts. idxd mdev exposes virtual
MSI-X capability to guest but uses IMS entries physically for generating
interrupts.

Thomas has helped implement a generic ims irqchip driver:
https://lore.kernel.org/linux-hyperv/[email protected]/

idxd device allows software to specify an IMS entry (for triggering
completion interrupt) when submitting a descriptor. To prevent one
mdev triggering malicious interrupt into another mdev (by specifying
an arbitrary entry), idxd ims entry includes a PASID field for validation -
only a matching PASID in the executed descriptor can trigger interrupt
via this entry. idxd driver is expected to program ims entries with
PASIDs that are allocated to the mdev which owns those entries.

Other devices may have different ID and format to isolate ims entries.
But we need abstract a generic means for programming vendor-specific
ID into vendor-specific ims entry, without violating the layering model.

Thomas suggested vendor driver to first register ID information (possibly
plus the location where to write ID to) in msi_desc when allocating irqs
(extend existing alloc function or via new helper function) and then have
the generic ims irqchip driver to update ID to the ims entry when it's
started up by request_irq().

Then there are two questions to be answered:

1) How does vendor driver decide the ID to be registered to msi_desc?
2) How is Thomas's model mapped to the MSI-X virtualization flow in VFIO?

For the 1st open, there are two types of PASIDs on idxd mdev:

1) default PASID: one per mdev and allocated when mdev is created;
2) sva PASIDs: multiple per mdev and allocated on-demand (via vIOMMU);

If vIOMMU is not exposed, all ims entries of this mdev should be
programmed with default PASID which is always available in mdev's
lifespan.

If vIOMMU is exposed and guest sva is enabled, entries used for sva
should be tagged with sva PASIDs, leaving others tagged with default
PASID. To help achieve intra-guest interrupt isolation, guest idxd driver
needs program guest sva PASIDs into virtual MSIX_PERM register (one
per MSI-X entry) for validation. Access to MSIX_PERM is trap-and-emulated
by host idxd driver which then figure out which PASID to register to
msi_desc (require PASID translation info via new /dev/iommu proposal).

The guest driver is expected to update MSIX_PERM before request_irq().

Now the 2nd open requires your help. Below is what I learned from
current vfio/qemu code (for vfio-pci device):

0) Qemu doesn't attempt to allocate all irqs as reported by msix->
table_size. It is done in an dynamic and incremental way.

1) VFIO provides just one command (VFIO_DEVICE_SET_IRQS) for
allocating/enabling irqs given a set of vMSIX vectors [start, count]:

a) if irqs not allocated, allocate irqs [start+count]. Enable irqs for
specified vectors [start, count] via request_irq();
b) if irqs already allocated, enable irqs for specified vectors;
c) if irq already enabled, disable and re-enable irqs for specified
vectors because user may specify a different eventfd;

2) When guest enables virtual MSI-X capability, Qemu calls VFIO_
DEVICE_SET_IRQS to enable vector#0, even though it's currently
masked by the guest. Interrupts are received by Qemu but blocked
from guest via mask/pending bit emulation. The main intention is
to enable physical MSI-X;

3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_
DEVICE_SET_IRQS to enable vector#0 again, with a eventfd different
from the one provided in 2);

4) When guest unmasks vector#1, Qemu finds it's outside of allocated
vectors (only vector#0 now):

a) Qemu first calls VFIO_DEVICE_SET_IRQS to disable and free
irq for vector#0;

b) Qemu then calls VFIO_DEVICE_SET_IRQS to allocate and enable
irqs for both vector#0 and vector#1;

5) When guest unmasks vector#2, same flow in 4) continues.

....

If above understanding is correct, how is lost interrupt avoided between
4.a) and 4.b) given that irq has been torn down for vector#0 in the middle
while from guest p.o.v this vector is actually unmasked? There must be
a mechanism in place, but I just didn't figure it out...

Given above flow is robust, mapping Thomas's model to this flow is
straightforward. Assume idxd mdev has two vectors: vector#0 for
misc/error interrupt and vector#1 as completion interrupt for guest
sva. VFIO_DEVICE_SET_IRQS is handled by idxd mdev driver:

2) When guest enables virtual MSI-X capability, Qemu calls VFIO_
DEVICE_SET_IRQS to enable vector#0. Because vector#0 is not
used for sva, MSIX_PERM#0 has PASID disabled. Host idxd driver
knows to register default PASID to msi_desc#0 when allocating irqs.
Then .startup() callback of ims irqchip is called to program default
PASID saved in msi_desc#0 to the target ims entry when request_irq().

3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_
DEVICE_SET_IRQS to enable vector#0 again. Following same logic
as vfio-pci, idxd driver first disable irq#0 via free_irq() and then
re-enable irq#0 via request_irq(). It's still default PASID being used
according to msi_desc#0.

4) When guest unmasks vector#1, Qemu finds it's outside of allocated
vectors (only vector#0 now):

a) Qemu first calls VFIO_DEVICE_SET_IRQS to disable and free
irq for vector#0. msi_desc#0 is also freed.

b) Qemu then calls VFIO_DEVICE_SET_IRQS to allocate and enable
irqs for both vector#0 and vector#1. At this point, MSIX_PERM#0
has PASID disabled while MSIX_PERM#1 has a valid guest PASID1
for sva. idxd driver registers default PASID to msix_desc#0 and
host PASID2 (translated from guest PASID1) to msix_desc#1 when
allocating irqs. Later when both irqs are enabled via request_irq(),
ims irqchip driver updates the target ims entries according to
msix_desc#0 and misx_desc#1 respectively.

But this is specific to how Qemu virtualizes MSI-X today. What about it
may change (or another device model) to allocate all table_size irqs
when guest enables MSI-X capability? At that point we don't have valid
MSIX_PERM content to register PASID info to msix_desc. Possibly what
we really require is a separate helper function allowing driver to update
msix_desc after irq allocation, e.g. when guest unmasks a vector...

and do you see any other facets which are overlooked here?

Thanks
Kevin


2021-06-22 15:52:33

by Dave Jiang

[permalink] [raw]
Subject: Re: Virtualizing MSI-X on IMS via VFIO


On 6/22/2021 3:16 AM, Tian, Kevin wrote:
> Hi, Alex,
>
> Need your help to understand the current MSI-X virtualization flow in
> VFIO. Some background info first.
>
> Recently we are discussing how to virtualize MSI-X with Interrupt
> Message Storage (IMS) on mdev:
> https://lore.kernel.org/kvm/[email protected]/
>
> IMS is a device specific interrupt storage, allowing an optimized and
> scalable manner for generating interrupts. idxd mdev exposes virtual
> MSI-X capability to guest but uses IMS entries physically for generating
> interrupts.
>
> Thomas has helped implement a generic ims irqchip driver:
> https://lore.kernel.org/linux-hyperv/[email protected]/
>
> idxd device allows software to specify an IMS entry (for triggering
> completion interrupt) when submitting a descriptor. To prevent one
> mdev triggering malicious interrupt into another mdev (by specifying
> an arbitrary entry), idxd ims entry includes a PASID field for validation -
> only a matching PASID in the executed descriptor can trigger interrupt
> via this entry. idxd driver is expected to program ims entries with
> PASIDs that are allocated to the mdev which owns those entries.
>
> Other devices may have different ID and format to isolate ims entries.
> But we need abstract a generic means for programming vendor-specific
> ID into vendor-specific ims entry, without violating the layering model.
>
> Thomas suggested vendor driver to first register ID information (possibly
> plus the location where to write ID to) in msi_desc when allocating irqs
> (extend existing alloc function or via new helper function) and then have
> the generic ims irqchip driver to update ID to the ims entry when it's
> started up by request_irq().
>
> Then there are two questions to be answered:
>
> 1) How does vendor driver decide the ID to be registered to msi_desc?
> 2) How is Thomas's model mapped to the MSI-X virtualization flow in VFIO?
>
> For the 1st open, there are two types of PASIDs on idxd mdev:
>
> 1) default PASID: one per mdev and allocated when mdev is created;
> 2) sva PASIDs: multiple per mdev and allocated on-demand (via vIOMMU);
>
> If vIOMMU is not exposed, all ims entries of this mdev should be
> programmed with default PASID which is always available in mdev's
> lifespan.
>
> If vIOMMU is exposed and guest sva is enabled, entries used for sva
> should be tagged with sva PASIDs, leaving others tagged with default
> PASID. To help achieve intra-guest interrupt isolation, guest idxd driver
> needs program guest sva PASIDs into virtual MSIX_PERM register (one
> per MSI-X entry) for validation. Access to MSIX_PERM is trap-and-emulated
> by host idxd driver which then figure out which PASID to register to
> msi_desc (require PASID translation info via new /dev/iommu proposal).
>
> The guest driver is expected to update MSIX_PERM before request_irq().
>
> Now the 2nd open requires your help. Below is what I learned from
> current vfio/qemu code (for vfio-pci device):
>
> 0) Qemu doesn't attempt to allocate all irqs as reported by msix->
> table_size. It is done in an dynamic and incremental way.
>
> 1) VFIO provides just one command (VFIO_DEVICE_SET_IRQS) for
> allocating/enabling irqs given a set of vMSIX vectors [start, count]:
>
> a) if irqs not allocated, allocate irqs [start+count]. Enable irqs for
> specified vectors [start, count] via request_irq();
> b) if irqs already allocated, enable irqs for specified vectors;
> c) if irq already enabled, disable and re-enable irqs for specified
> vectors because user may specify a different eventfd;
>
> 2) When guest enables virtual MSI-X capability, Qemu calls VFIO_
> DEVICE_SET_IRQS to enable vector#0, even though it's currently
> masked by the guest. Interrupts are received by Qemu but blocked
> from guest via mask/pending bit emulation. The main intention is
> to enable physical MSI-X;
>
> 3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_
> DEVICE_SET_IRQS to enable vector#0 again, with a eventfd different
> from the one provided in 2);
>
> 4) When guest unmasks vector#1, Qemu finds it's outside of allocated
> vectors (only vector#0 now):
>
> a) Qemu first calls VFIO_DEVICE_SET_IRQS to disable and free
> irq for vector#0;
>
> b) Qemu then calls VFIO_DEVICE_SET_IRQS to allocate and enable
> irqs for both vector#0 and vector#1;
>
> 5) When guest unmasks vector#2, same flow in 4) continues.
>
> ....
>
> If above understanding is correct, how is lost interrupt avoided between
> 4.a) and 4.b) given that irq has been torn down for vector#0 in the middle
> while from guest p.o.v this vector is actually unmasked? There must be
> a mechanism in place, but I just didn't figure it out...
>
> Given above flow is robust, mapping Thomas's model to this flow is
> straightforward. Assume idxd mdev has two vectors: vector#0 for
> misc/error interrupt and vector#1 as completion interrupt for guest
> sva. VFIO_DEVICE_SET_IRQS is handled by idxd mdev driver:
>
> 2) When guest enables virtual MSI-X capability, Qemu calls VFIO_
> DEVICE_SET_IRQS to enable vector#0. Because vector#0 is not
> used for sva, MSIX_PERM#0 has PASID disabled. Host idxd driver
> knows to register default PASID to msi_desc#0 when allocating irqs.
> Then .startup() callback of ims irqchip is called to program default
> PASID saved in msi_desc#0 to the target ims entry when request_irq().
>
> 3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_
> DEVICE_SET_IRQS to enable vector#0 again. Following same logic
> as vfio-pci, idxd driver first disable irq#0 via free_irq() and then
> re-enable irq#0 via request_irq(). It's still default PASID being used
> according to msi_desc#0.

Hi Kevin, slight correction here. Because vector#0 is emulated for idxd
vdev, it has no IMS backing. So there is no msi_desc#0 for that vector.
msi_desc#0 actually starts at vector#1 where IMS is allocated to back
it. vector#0 does not go through request_irq(). It only has eventfd
part. Everything you say is correct but starts at vector#1.


>
> 4) When guest unmasks vector#1, Qemu finds it's outside of allocated
> vectors (only vector#0 now):
>
> a) Qemu first calls VFIO_DEVICE_SET_IRQS to disable and free
> irq for vector#0. msi_desc#0 is also freed.
>
> b) Qemu then calls VFIO_DEVICE_SET_IRQS to allocate and enable
> irqs for both vector#0 and vector#1. At this point, MSIX_PERM#0
> has PASID disabled while MSIX_PERM#1 has a valid guest PASID1
> for sva. idxd driver registers default PASID to msix_desc#0 and
> host PASID2 (translated from guest PASID1) to msix_desc#1 when
> allocating irqs. Later when both irqs are enabled via request_irq(),
> ims irqchip driver updates the target ims entries according to
> msix_desc#0 and misx_desc#1 respectively.
>
> But this is specific to how Qemu virtualizes MSI-X today. What about it
> may change (or another device model) to allocate all table_size irqs
> when guest enables MSI-X capability? At that point we don't have valid
> MSIX_PERM content to register PASID info to msix_desc. Possibly what
> we really require is a separate helper function allowing driver to update
> msix_desc after irq allocation, e.g. when guest unmasks a vector...
>
> and do you see any other facets which are overlooked here?
>
> Thanks
> Kevin

2021-06-22 19:13:33

by Alex Williamson

[permalink] [raw]
Subject: Re: Virtualizing MSI-X on IMS via VFIO

On Tue, 22 Jun 2021 10:16:15 +0000
"Tian, Kevin" <[email protected]> wrote:

> Hi, Alex,
>
> Need your help to understand the current MSI-X virtualization flow in
> VFIO. Some background info first.
>
> Recently we are discussing how to virtualize MSI-X with Interrupt
> Message Storage (IMS) on mdev:
> https://lore.kernel.org/kvm/[email protected]/
>
> IMS is a device specific interrupt storage, allowing an optimized and
> scalable manner for generating interrupts. idxd mdev exposes virtual
> MSI-X capability to guest but uses IMS entries physically for generating
> interrupts.
>
> Thomas has helped implement a generic ims irqchip driver:
> https://lore.kernel.org/linux-hyperv/[email protected]/
>
> idxd device allows software to specify an IMS entry (for triggering
> completion interrupt) when submitting a descriptor. To prevent one
> mdev triggering malicious interrupt into another mdev (by specifying
> an arbitrary entry), idxd ims entry includes a PASID field for validation -
> only a matching PASID in the executed descriptor can trigger interrupt
> via this entry. idxd driver is expected to program ims entries with
> PASIDs that are allocated to the mdev which owns those entries.
>
> Other devices may have different ID and format to isolate ims entries.
> But we need abstract a generic means for programming vendor-specific
> ID into vendor-specific ims entry, without violating the layering model.
>
> Thomas suggested vendor driver to first register ID information (possibly
> plus the location where to write ID to) in msi_desc when allocating irqs
> (extend existing alloc function or via new helper function) and then have
> the generic ims irqchip driver to update ID to the ims entry when it's
> started up by request_irq().
>
> Then there are two questions to be answered:
>
> 1) How does vendor driver decide the ID to be registered to msi_desc?
> 2) How is Thomas's model mapped to the MSI-X virtualization flow in VFIO?
>
> For the 1st open, there are two types of PASIDs on idxd mdev:
>
> 1) default PASID: one per mdev and allocated when mdev is created;
> 2) sva PASIDs: multiple per mdev and allocated on-demand (via vIOMMU);
>
> If vIOMMU is not exposed, all ims entries of this mdev should be
> programmed with default PASID which is always available in mdev's
> lifespan.
>
> If vIOMMU is exposed and guest sva is enabled, entries used for sva
> should be tagged with sva PASIDs, leaving others tagged with default
> PASID. To help achieve intra-guest interrupt isolation, guest idxd driver
> needs program guest sva PASIDs into virtual MSIX_PERM register (one
> per MSI-X entry) for validation. Access to MSIX_PERM is trap-and-emulated
> by host idxd driver which then figure out which PASID to register to
> msi_desc (require PASID translation info via new /dev/iommu proposal).
>
> The guest driver is expected to update MSIX_PERM before request_irq().
>
> Now the 2nd open requires your help. Below is what I learned from
> current vfio/qemu code (for vfio-pci device):
>
> 0) Qemu doesn't attempt to allocate all irqs as reported by msix->
> table_size. It is done in an dynamic and incremental way.

Not by table_size, our expectation is that the device interrupt
behavior can be implicitly affected by the enabled vectors and the
table size may support far more vectors than the driver might actually
use. It's also easier if we never need to get into the scenario of
pci_alloc_irq_vectors() returning a smaller than requested number of
vectors and needing to fallback to a vector negotiation that doesn't
exist via MSI-X.

FWIW, more recent QEMU will scan the vector table for the highest
non-masked vector to initially enable that number of vectors in the
host, both to improve restore behavior after migration and avoid
overhead for guests that write the vector table before setting the
MSI-X capability enable bit (Windows?).

> 1) VFIO provides just one command (VFIO_DEVICE_SET_IRQS) for
> allocating/enabling irqs given a set of vMSIX vectors [start, count]:
>
> a) if irqs not allocated, allocate irqs [start+count]. Enable irqs for
> specified vectors [start, count] via request_irq();
> b) if irqs already allocated, enable irqs for specified vectors;
> c) if irq already enabled, disable and re-enable irqs for specified
> vectors because user may specify a different eventfd;
>
> 2) When guest enables virtual MSI-X capability, Qemu calls VFIO_
> DEVICE_SET_IRQS to enable vector#0, even though it's currently
> masked by the guest. Interrupts are received by Qemu but blocked
> from guest via mask/pending bit emulation. The main intention is
> to enable physical MSI-X;

Yes, this is a bit awkward since the interrupt API is via SET_IRQS and
we don't allow writes to the MSI-X enable bit via config space.

> 3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_
> DEVICE_SET_IRQS to enable vector#0 again, with a eventfd different
> from the one provided in 2);
>
> 4) When guest unmasks vector#1, Qemu finds it's outside of allocated
> vectors (only vector#0 now):
>
> a) Qemu first calls VFIO_DEVICE_SET_IRQS to disable and free
> irq for vector#0;
>
> b) Qemu then calls VFIO_DEVICE_SET_IRQS to allocate and enable
> irqs for both vector#0 and vector#1;
>
> 5) When guest unmasks vector#2, same flow in 4) continues.
>
> ....
>
> If above understanding is correct, how is lost interrupt avoided between
> 4.a) and 4.b) given that irq has been torn down for vector#0 in the middle
> while from guest p.o.v this vector is actually unmasked? There must be
> a mechanism in place, but I just didn't figure it out...

In practice unmasking new vectors is rare and done only at
initialization. Risk from lost interrupts at this time is low. When
masking and unmasking vectors that are already in use, we're only
changing the signaling eventfd between KVM and QEMU such that QEMU can
set emulated pending bits in response to interrupts (and our lack of
interfaces to handle the mask/unmask at the host). I believe that
locking in the vfio-pci driver prevents an interrupt from being lost
during the eventfd switch.

> Given above flow is robust, mapping Thomas's model to this flow is
> straightforward. Assume idxd mdev has two vectors: vector#0 for
> misc/error interrupt and vector#1 as completion interrupt for guest
> sva. VFIO_DEVICE_SET_IRQS is handled by idxd mdev driver:
>
> 2) When guest enables virtual MSI-X capability, Qemu calls VFIO_
> DEVICE_SET_IRQS to enable vector#0. Because vector#0 is not
> used for sva, MSIX_PERM#0 has PASID disabled. Host idxd driver
> knows to register default PASID to msi_desc#0 when allocating irqs.
> Then .startup() callback of ims irqchip is called to program default
> PASID saved in msi_desc#0 to the target ims entry when request_irq().
>
> 3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_
> DEVICE_SET_IRQS to enable vector#0 again. Following same logic
> as vfio-pci, idxd driver first disable irq#0 via free_irq() and then
> re-enable irq#0 via request_irq(). It's still default PASID being used
> according to msi_desc#0.
>
> 4) When guest unmasks vector#1, Qemu finds it's outside of allocated
> vectors (only vector#0 now):
>
> a) Qemu first calls VFIO_DEVICE_SET_IRQS to disable and free
> irq for vector#0. msi_desc#0 is also freed.
>
> b) Qemu then calls VFIO_DEVICE_SET_IRQS to allocate and enable
> irqs for both vector#0 and vector#1. At this point, MSIX_PERM#0
> has PASID disabled while MSIX_PERM#1 has a valid guest PASID1
> for sva. idxd driver registers default PASID to msix_desc#0 and
> host PASID2 (translated from guest PASID1) to msix_desc#1 when
> allocating irqs. Later when both irqs are enabled via request_irq(),
> ims irqchip driver updates the target ims entries according to
> msix_desc#0 and misx_desc#1 respectively.
>
> But this is specific to how Qemu virtualizes MSI-X today. What about it
> may change (or another device model) to allocate all table_size irqs
> when guest enables MSI-X capability? At that point we don't have valid
> MSIX_PERM content to register PASID info to msix_desc. Possibly what
> we really require is a separate helper function allowing driver to update
> msix_desc after irq allocation, e.g. when guest unmasks a vector...

I think you're basically asking how you can guarantee that you always
have a mechanism to update your device specific MSIX_PERM table before
or after the vector tables. You're trapping and emulating the
MSIX_PERM table, so every write traps to your driver. Therefore
shouldn't the driver be able to setup any vector using the default PASID
if MSIX_PERM is not configured and update it with the correct PASID
translation based on the trapped write to the register? Logically I
think you'd want your guest driver to mask and unmask the affected
vector around modifying the MSIX_PERM entry as well, so it would be
another option to reevaluate MSI_PERM on unmask, which triggers a
SET_IRQS ioctl into the idxd host driver. Either entry point could
trigger a descriptor update to the host irqchip driver.

> and do you see any other facets which are overlooked here?

AIUI, the default with no sva PASID should always work, the host driver
initializes the device with a virtual MSIX_PERM table with all the
entries disabled, no special guest/host driver coordination is
required. In the case of setting a PASID for a vector, you have entry
points into the driver either by the virtualization of the MSIX_PERM
table or likely also at the unmasking of the vector, so it seems fully
contained. Thanks,

Alex

2021-06-23 00:02:15

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Virtualizing MSI-X on IMS via VFIO

Alex,

[ Resend due to a delivery issue here. Sorry if you got this twice. ]

On Tue, Jun 22 2021 at 13:12, Alex Williamson wrote:
> On Tue, 22 Jun 2021 10:16:15 +0000 "Tian, Kevin" <[email protected]> wrote:
>
> Now the 2nd open requires your help. Below is what I learned from
>> current vfio/qemu code (for vfio-pci device):
>>
>> 0) Qemu doesn't attempt to allocate all irqs as reported by msix->
>> table_size. It is done in an dynamic and incremental way.
>
> Not by table_size, our expectation is that the device interrupt
> behavior can be implicitly affected by the enabled vectors and the
> table size may support far more vectors than the driver might actually
> use. It's also easier if we never need to get into the scenario of
> pci_alloc_irq_vectors() returning a smaller than requested number of
> vectors and needing to fallback to a vector negotiation that doesn't
> exist via MSI-X.
>
> FWIW, more recent QEMU will scan the vector table for the highest
> non-masked vector to initially enable that number of vectors in the
> host, both to improve restore behavior after migration and avoid
> overhead for guests that write the vector table before setting the
> MSI-X capability enable bit (Windows?).
>
>> 1) VFIO provides just one command (VFIO_DEVICE_SET_IRQS) for
>> allocating/enabling irqs given a set of vMSIX vectors [start, count]:
>> a) if irqs not allocated, allocate irqs [start+count]. Enable irqs for
>> specified vectors [start, count] via request_irq();
>> b) if irqs already allocated, enable irqs for specified vectors;
>> c) if irq already enabled, disable and re-enable irqs for specified
>> vectors because user may specify a different eventfd;
>>
>> 2) When guest enables virtual MSI-X capability, Qemu calls VFIO_
>> DEVICE_SET_IRQS to enable vector#0, even though it's currently
>> masked by the guest. Interrupts are received by Qemu but blocked
>> from guest via mask/pending bit emulation. The main intention is
>> to enable physical MSI-X;
>
> Yes, this is a bit awkward since the interrupt API is via SET_IRQS and
> we don't allow writes to the MSI-X enable bit via config space.
>
>> 3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_
>> DEVICE_SET_IRQS to enable vector#0 again, with a eventfd different
>> from the one provided in 2);
>>
>> 4) When guest unmasks vector#1, Qemu finds it's outside of allocated
>> vectors (only vector#0 now):
>>
>> a) Qemu first calls VFIO_DEVICE_SET_IRQS to disable and free
>> irq for vector#0;
>>
>> b) Qemu then calls VFIO_DEVICE_SET_IRQS to allocate and enable
>> irqs for both vector#0 and vector#1;
>>
>> 5) When guest unmasks vector#2, same flow in 4) continues.

That's dangerous and makes weird assumptions about interrupts being
requested early in the driver init() function. But that's wishful
thinking, really. There are enough drivers, especially networking which
request interrupts on device open() and not on device init(). Some
special functions only request the irq way later, i.e. only when someone
uses that special function and at this point the other irqs of that very
same device are already in use.

>> If above understanding is correct, how is lost interrupt avoided between
>> 4.a) and 4.b) given that irq has been torn down for vector#0 in the middle
>> while from guest p.o.v this vector is actually unmasked? There must be
>> a mechanism in place, but I just didn't figure it out...
>
> In practice unmasking new vectors is rare and done only at
> initialization. Risk from lost interrupts at this time is low.

See above. Wishful thinking.

OMG, I really don't want to be the one to debug _WHY_ a device lost
interrupts just because it did a late request_irq() when the device is
operational already and has other interrupts active.

> When masking and unmasking vectors that are already in use, we're only
> changing the signaling eventfd between KVM and QEMU such that QEMU can
> set emulated pending bits in response to interrupts (and our lack of
> interfaces to handle the mask/unmask at the host). I believe that
> locking in the vfio-pci driver prevents an interrupt from being lost
> during the eventfd switch.

Let's look at this from a driver perspective:

1) Driver checks how many entries are possible in the MSI-X table

2) Driver invokes pci_msix_enable[_range]() which returns the number
of vectors the system is willing to hand out to the driver.

3) Driver assigns the vectors to the different resources in the
hardware

4) Later on these interrupts are requested, but not necessarily during
device init.

Yes, request_irq() can fail and today it can fail also due to CPU
vector exhaustion. That's perfectly fine as the driver can handle
the fail and act accordingly.

All of this is consistent and well defined.

Now lets look at the guest. This VFIO mechanism introduces two brand new
failure modes because of this:

guest::unmask()
trapped_by_host()
free_irqs();
pci_free_irq_vectors();
pci_disable_msix();
pci_alloc_irq_vectors();
pci_enable_msix();
request_irqs();

#1 What happens if the host side allocation or the host side request_irq()
fails?

a) Guest is killed?
b) Failure is silently ignored and guest just does not receive
interrupts?
c) Any other mechanism?

Whatever it is, it simply _cannot_ make request_irq() in the
guest fail because the guest has already passed all failure
points and is in the low level function of unmasking the
interrupt for the first time after which is will return 0, aka
success.

So if you answer #a, fine with me. It's well defined.

#2 What happens to already active interrupts on that device which might
fire during that time?

They get lost or are redirected to the legacy PCI interrupt and
there is absolutely nothing you can do about that.

Simply because to prevent that the host side would have to
disable the interrupt source at the device level, i.e. fiddle
with actual device registers to shut up interrupt delivery and
reenable it afterwards again and thereby racing against some
other VCPU of the same guest which fiddles with that very same
registers.

IOW, undefined behaviour, which the "VFIO design" shrugged off on
completely unjustified assumptions.

No matter how much you argue about this being unlikely, this is just
broken. Unlikely simply does not exist at cloud scale.

Aside of that. How did you come to the conclusion that #2 does not
matter? By analyzing _every_ open and closed source driver for their
usage patterns and documenting that all drivers which want to work in
VFIO-PCI space have to follow specific rules vs. interrupt setup and
usage? I'm pretty sure that you have a mechanism in place which
monitors closely whether a driver violates those well documented rules.

Yes, I know that I'm dreaming and the reality is that this is based on
interesting assumptions and just works by chance.

I have no idea _why_ this has been done that way. The changelogs of the
relevant commits are void of useful content and lack links to the
possibly existing discussions about this.

I only can assume that back then the main problem was vector exhaustion
on the host and to avoid allocating memory for interrupt descriptors
etc, right?

The host vector exhaustion problem was that each MSIX vector consumed a
real CPU vector which is a limited resource on X86. This is not longer
the case today:

1) pci_msix_enable[range]() consumes exactly zero CPU vectors from
the allocatable range independent of the number of MSIX vectors
it allocates, unless it is in multi-queue managed mode where it
will actually reserve a vector (maybe two) per CPU.

But for devices which are not using that mode, they just
opportunistically "reserve" vectors.

All entries are initialized with a special system vector which
when raised will emit a nastigram in dmesg.

2) request_irq() actually allocates a CPU vector from the
allocatable vector space which can obviously still fail, which is
perfectly fine.

So the only downside today of allocating more MSI-X vectors than
necessary is memory consumption for the irq descriptors.

Though for virtualization there is still another problem:

Even if all possible MSI-X vectors for a passthrough PCI device would
be allocated upfront independent of the actual usage in the guest,
then there is still the issue of request_irq() failing on the host
once the guest decides to use any of those interrupts.

It's a halfways reasonable argumentation by some definition of
reasonable, that this case would be a host system configuration problem
and the admin who overcommitted is responsible for the consequence.

Where the only reasonable consequence is to kill the guest right there
because there is no mechanism to actually tell it that the host ran out
of resources.

Not at all a pretty solution, but it is contrary to the status quo well
defined. The most important aspect is that it is well defined for the
case of success:

If it succeeds then there is no way that already requested interrupts
can be lost or end up being redirected to the legacy PCI irq due to
clearing the MSIX enable bit, which is a gazillion times better than
the "let's hope it works" based tinkerware we have now.

So, aside of the existing VFIO/PCI/MSIX thing being just half thought
out, even thinking about proliferating this with IMS is bonkers.

IMS is meant to avoid the problem of MSI-X which needs to disable MSI-X
in order to expand the number of vectors. The use cases are going to be
even more dynamic than the usual device drivers, so the lost interrupt
issue will be much more likely to trigger.

So no, we are not going to proliferate this complete ignorance of how
MSI-X actually works and just cram another "feature" into code which is
known to be incorrect.

Thanks,

tglx

2021-06-23 06:13:58

by Tian, Kevin

[permalink] [raw]
Subject: RE: Virtualizing MSI-X on IMS via VFIO

> From: Thomas Gleixner <[email protected]>
> Sent: Wednesday, June 23, 2021 7:59 AM
>
[...]
> I only can assume that back then the main problem was vector exhaustion
> on the host and to avoid allocating memory for interrupt descriptors
> etc, right?
>
> The host vector exhaustion problem was that each MSIX vector consumed a
> real CPU vector which is a limited resource on X86. This is not longer
> the case today:
>
> 1) pci_msix_enable[range]() consumes exactly zero CPU vectors from
> the allocatable range independent of the number of MSIX vectors
> it allocates, unless it is in multi-queue managed mode where it
> will actually reserve a vector (maybe two) per CPU.
>
> But for devices which are not using that mode, they just
> opportunistically "reserve" vectors.
>
> All entries are initialized with a special system vector which
> when raised will emit a nastigram in dmesg.
>
> 2) request_irq() actually allocates a CPU vector from the
> allocatable vector space which can obviously still fail, which is
> perfectly fine.
>
> So the only downside today of allocating more MSI-X vectors than
> necessary is memory consumption for the irq descriptors.

Curious about irte entry when IRQ remapping is enabled. Is it also
allocated at request_irq()?

>
> Though for virtualization there is still another problem:
>
> Even if all possible MSI-X vectors for a passthrough PCI device would
> be allocated upfront independent of the actual usage in the guest,
> then there is still the issue of request_irq() failing on the host
> once the guest decides to use any of those interrupts.
>
> It's a halfways reasonable argumentation by some definition of
> reasonable, that this case would be a host system configuration problem
> and the admin who overcommitted is responsible for the consequence.
>
> Where the only reasonable consequence is to kill the guest right there
> because there is no mechanism to actually tell it that the host ran out
> of resources.
>
> Not at all a pretty solution, but it is contrary to the status quo well
> defined. The most important aspect is that it is well defined for the
> case of success:
>
> If it succeeds then there is no way that already requested interrupts
> can be lost or end up being redirected to the legacy PCI irq due to
> clearing the MSIX enable bit, which is a gazillion times better than
> the "let's hope it works" based tinkerware we have now.

fair enough.

>
> So, aside of the existing VFIO/PCI/MSIX thing being just half thought
> out, even thinking about proliferating this with IMS is bonkers.
>
> IMS is meant to avoid the problem of MSI-X which needs to disable MSI-X
> in order to expand the number of vectors. The use cases are going to be
> even more dynamic than the usual device drivers, so the lost interrupt
> issue will be much more likely to trigger.
>
> So no, we are not going to proliferate this complete ignorance of how
> MSI-X actually works and just cram another "feature" into code which is
> known to be incorrect.
>

So the correct flow is like below:

guest::enable_msix()
trapped_by_host()
pci_alloc_irq_vectors(); // for all possible vMSI-X entries
pci_enable_msix();

guest::unmask()
trapped_by_host()
request_irqs();

the first trap calls a new VFIO ioctl e.g. VFIO_DEVICE_ALLOC_IRQS.

the 2nd trap can reuse existing VFIO_DEVICE_SET_IRQS which just
does request_irq() if specified irqs have been allocated.

Then map ims to this flow:

guest::enable_msix()
trapped_by_host()
msi_domain_alloc_irqs(); // for all possible vMSI-X entries
for_all_allocated_irqs(i)
pci_update_msi_desc_id(i, default_pasid); // a new helper func

guest::unmask(entry#0)
trapped_by_host()
request_irqs();
ims_array_irq_startup(); // write msi_desc.id (default_pasid) to ims entry

guest::set_msix_perm(entry#1, guest_sva_pasid)
trapped_by_host()
pci_update_msi_desc_id(1, host_sva_pasid);

guest::unmask(entry#1)
trapped_by_host()
request_irqs();
ims_array_irq_startup(); // write msi_desc.id (host_sva_pasid) to ims entry

Does above match your thoughts?

Thanks
Kevin

2021-06-23 06:20:10

by Tian, Kevin

[permalink] [raw]
Subject: RE: Virtualizing MSI-X on IMS via VFIO

> From: Jiang, Dave <[email protected]>
> Sent: Tuesday, June 22, 2021 11:51 PM
>
> On 6/22/2021 3:16 AM, Tian, Kevin wrote:
> > Hi, Alex,
> >
> > Need your help to understand the current MSI-X virtualization flow in
> > VFIO. Some background info first.
> >
> > Recently we are discussing how to virtualize MSI-X with Interrupt
> > Message Storage (IMS) on mdev:
> > https://lore.kernel.org/kvm/[email protected]/
> >
> > IMS is a device specific interrupt storage, allowing an optimized and
> > scalable manner for generating interrupts. idxd mdev exposes virtual
> > MSI-X capability to guest but uses IMS entries physically for generating
> > interrupts.
> >
> > Thomas has helped implement a generic ims irqchip driver:
> > https://lore.kernel.org/linux-
> hyperv/[email protected]/
> >
> > idxd device allows software to specify an IMS entry (for triggering
> > completion interrupt) when submitting a descriptor. To prevent one
> > mdev triggering malicious interrupt into another mdev (by specifying
> > an arbitrary entry), idxd ims entry includes a PASID field for validation -
> > only a matching PASID in the executed descriptor can trigger interrupt
> > via this entry. idxd driver is expected to program ims entries with
> > PASIDs that are allocated to the mdev which owns those entries.
> >
> > Other devices may have different ID and format to isolate ims entries.
> > But we need abstract a generic means for programming vendor-specific
> > ID into vendor-specific ims entry, without violating the layering model.
> >
> > Thomas suggested vendor driver to first register ID information (possibly
> > plus the location where to write ID to) in msi_desc when allocating irqs
> > (extend existing alloc function or via new helper function) and then have
> > the generic ims irqchip driver to update ID to the ims entry when it's
> > started up by request_irq().
> >
> > Then there are two questions to be answered:
> >
> > 1) How does vendor driver decide the ID to be registered to msi_desc?
> > 2) How is Thomas's model mapped to the MSI-X virtualization flow in
> VFIO?
> >
> > For the 1st open, there are two types of PASIDs on idxd mdev:
> >
> > 1) default PASID: one per mdev and allocated when mdev is created;
> > 2) sva PASIDs: multiple per mdev and allocated on-demand (via
> vIOMMU);
> >
> > If vIOMMU is not exposed, all ims entries of this mdev should be
> > programmed with default PASID which is always available in mdev's
> > lifespan.
> >
> > If vIOMMU is exposed and guest sva is enabled, entries used for sva
> > should be tagged with sva PASIDs, leaving others tagged with default
> > PASID. To help achieve intra-guest interrupt isolation, guest idxd driver
> > needs program guest sva PASIDs into virtual MSIX_PERM register (one
> > per MSI-X entry) for validation. Access to MSIX_PERM is trap-and-emulated
> > by host idxd driver which then figure out which PASID to register to
> > msi_desc (require PASID translation info via new /dev/iommu proposal).
> >
> > The guest driver is expected to update MSIX_PERM before request_irq().
> >
> > Now the 2nd open requires your help. Below is what I learned from
> > current vfio/qemu code (for vfio-pci device):
> >
> > 0) Qemu doesn't attempt to allocate all irqs as reported by msix->
> > table_size. It is done in an dynamic and incremental way.
> >
> > 1) VFIO provides just one command (VFIO_DEVICE_SET_IRQS) for
> > allocating/enabling irqs given a set of vMSIX vectors [start, count]:
> >
> > a) if irqs not allocated, allocate irqs [start+count]. Enable irqs for
> > specified vectors [start, count] via request_irq();
> > b) if irqs already allocated, enable irqs for specified vectors;
> > c) if irq already enabled, disable and re-enable irqs for specified
> > vectors because user may specify a different eventfd;
> >
> > 2) When guest enables virtual MSI-X capability, Qemu calls VFIO_
> > DEVICE_SET_IRQS to enable vector#0, even though it's currently
> > masked by the guest. Interrupts are received by Qemu but blocked
> > from guest via mask/pending bit emulation. The main intention is
> > to enable physical MSI-X;
> >
> > 3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_
> > DEVICE_SET_IRQS to enable vector#0 again, with a eventfd different
> > from the one provided in 2);
> >
> > 4) When guest unmasks vector#1, Qemu finds it's outside of allocated
> > vectors (only vector#0 now):
> >
> > a) Qemu first calls VFIO_DEVICE_SET_IRQS to disable and free
> > irq for vector#0;
> >
> > b) Qemu then calls VFIO_DEVICE_SET_IRQS to allocate and enable
> > irqs for both vector#0 and vector#1;
> >
> > 5) When guest unmasks vector#2, same flow in 4) continues.
> >
> > ....
> >
> > If above understanding is correct, how is lost interrupt avoided between
> > 4.a) and 4.b) given that irq has been torn down for vector#0 in the middle
> > while from guest p.o.v this vector is actually unmasked? There must be
> > a mechanism in place, but I just didn't figure it out...
> >
> > Given above flow is robust, mapping Thomas's model to this flow is
> > straightforward. Assume idxd mdev has two vectors: vector#0 for
> > misc/error interrupt and vector#1 as completion interrupt for guest
> > sva. VFIO_DEVICE_SET_IRQS is handled by idxd mdev driver:
> >
> > 2) When guest enables virtual MSI-X capability, Qemu calls VFIO_
> > DEVICE_SET_IRQS to enable vector#0. Because vector#0 is not
> > used for sva, MSIX_PERM#0 has PASID disabled. Host idxd driver
> > knows to register default PASID to msi_desc#0 when allocating irqs.
> > Then .startup() callback of ims irqchip is called to program default
> > PASID saved in msi_desc#0 to the target ims entry when request_irq().
> >
> > 3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_
> > DEVICE_SET_IRQS to enable vector#0 again. Following same logic
> > as vfio-pci, idxd driver first disable irq#0 via free_irq() and then
> > re-enable irq#0 via request_irq(). It's still default PASID being used
> > according to msi_desc#0.
>
> Hi Kevin, slight correction here. Because vector#0 is emulated for idxd
> vdev, it has no IMS backing. So there is no msi_desc#0 for that vector.
> msi_desc#0 actually starts at vector#1 where IMS is allocated to back
> it. vector#0 does not go through request_irq(). It only has eventfd
> part. Everything you say is correct but starts at vector#1.
>

You are right. But for illustration simplicity, let's still assume both vector
#0 and #1 are backed by ims in following discussion, since purely emulated
vector is anyway outside of this context. ????

Thanks
Kevin

2021-06-23 15:20:24

by Alex Williamson

[permalink] [raw]
Subject: Re: Virtualizing MSI-X on IMS via VFIO

On Wed, 23 Jun 2021 01:59:24 +0200
Thomas Gleixner <[email protected]> wrote:

> Alex,
>
> [ Resend due to a delivery issue here. Sorry if you got this twice. ]
>
> On Tue, Jun 22 2021 at 13:12, Alex Williamson wrote:
> > On Tue, 22 Jun 2021 10:16:15 +0000 "Tian, Kevin" <[email protected]> wrote:
> >
> > Now the 2nd open requires your help. Below is what I learned from
> >> current vfio/qemu code (for vfio-pci device):
> >>
> >> 0) Qemu doesn't attempt to allocate all irqs as reported by msix->
> >> table_size. It is done in an dynamic and incremental way.
> >
> > Not by table_size, our expectation is that the device interrupt
> > behavior can be implicitly affected by the enabled vectors and the
> > table size may support far more vectors than the driver might actually
> > use. It's also easier if we never need to get into the scenario of
> > pci_alloc_irq_vectors() returning a smaller than requested number of
> > vectors and needing to fallback to a vector negotiation that doesn't
> > exist via MSI-X.
> >
> > FWIW, more recent QEMU will scan the vector table for the highest
> > non-masked vector to initially enable that number of vectors in the
> > host, both to improve restore behavior after migration and avoid
> > overhead for guests that write the vector table before setting the
> > MSI-X capability enable bit (Windows?).
> >
> >> 1) VFIO provides just one command (VFIO_DEVICE_SET_IRQS) for
> >> allocating/enabling irqs given a set of vMSIX vectors [start, count]:
> >> a) if irqs not allocated, allocate irqs [start+count]. Enable irqs for
> >> specified vectors [start, count] via request_irq();
> >> b) if irqs already allocated, enable irqs for specified vectors;
> >> c) if irq already enabled, disable and re-enable irqs for specified
> >> vectors because user may specify a different eventfd;
> >>
> >> 2) When guest enables virtual MSI-X capability, Qemu calls VFIO_
> >> DEVICE_SET_IRQS to enable vector#0, even though it's currently
> >> masked by the guest. Interrupts are received by Qemu but blocked
> >> from guest via mask/pending bit emulation. The main intention is
> >> to enable physical MSI-X;
> >
> > Yes, this is a bit awkward since the interrupt API is via SET_IRQS and
> > we don't allow writes to the MSI-X enable bit via config space.
> >
> >> 3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_
> >> DEVICE_SET_IRQS to enable vector#0 again, with a eventfd different
> >> from the one provided in 2);
> >>
> >> 4) When guest unmasks vector#1, Qemu finds it's outside of allocated
> >> vectors (only vector#0 now):
> >>
> >> a) Qemu first calls VFIO_DEVICE_SET_IRQS to disable and free
> >> irq for vector#0;
> >>
> >> b) Qemu then calls VFIO_DEVICE_SET_IRQS to allocate and enable
> >> irqs for both vector#0 and vector#1;
> >>
> >> 5) When guest unmasks vector#2, same flow in 4) continues.
>
> That's dangerous and makes weird assumptions about interrupts being
> requested early in the driver init() function. But that's wishful
> thinking, really. There are enough drivers, especially networking which
> request interrupts on device open() and not on device init(). Some
> special functions only request the irq way later, i.e. only when someone
> uses that special function and at this point the other irqs of that very
> same device are already in use.

I'd consider allocating vectors on open() to be setup time, device
initialization. Chances of inflight interrupts is low. If you have
examples of drivers where the irq request comes way later please let
me know, maybe there's some way we can identify that use case.

> >> If above understanding is correct, how is lost interrupt avoided between
> >> 4.a) and 4.b) given that irq has been torn down for vector#0 in the middle
> >> while from guest p.o.v this vector is actually unmasked? There must be
> >> a mechanism in place, but I just didn't figure it out...
> >
> > In practice unmasking new vectors is rare and done only at
> > initialization. Risk from lost interrupts at this time is low.
>
> See above. Wishful thinking.
>
> OMG, I really don't want to be the one to debug _WHY_ a device lost
> interrupts just because it did a late request_irq() when the device is
> operational already and has other interrupts active.
>
> > When masking and unmasking vectors that are already in use, we're only
> > changing the signaling eventfd between KVM and QEMU such that QEMU can
> > set emulated pending bits in response to interrupts (and our lack of
> > interfaces to handle the mask/unmask at the host). I believe that
> > locking in the vfio-pci driver prevents an interrupt from being lost
> > during the eventfd switch.
>
> Let's look at this from a driver perspective:
>
> 1) Driver checks how many entries are possible in the MSI-X table
>
> 2) Driver invokes pci_msix_enable[_range]() which returns the number
> of vectors the system is willing to hand out to the driver.
>
> 3) Driver assigns the vectors to the different resources in the
> hardware
>
> 4) Later on these interrupts are requested, but not necessarily during
> device init.
>
> Yes, request_irq() can fail and today it can fail also due to CPU
> vector exhaustion. That's perfectly fine as the driver can handle
> the fail and act accordingly.
>
> All of this is consistent and well defined.

We can't know in the hypervisor how many vectors the guest driver asked
for, we only know that it's somewhere between one and the number
supported by the vector table. We'd contribute to vector exhaustion if
we always assume the max.

> Now lets look at the guest. This VFIO mechanism introduces two brand new
> failure modes because of this:
>
> guest::unmask()
> trapped_by_host()
> free_irqs();
> pci_free_irq_vectors();
> pci_disable_msix();
> pci_alloc_irq_vectors();
> pci_enable_msix();
> request_irqs();
>
> #1 What happens if the host side allocation or the host side request_irq()
> fails?
>
> a) Guest is killed?
> b) Failure is silently ignored and guest just does not receive
> interrupts?
> c) Any other mechanism?

QEMU generates a log on failure. At vfio-pci we're emulating the MSI-X
capability and vector table, there is no mechanism in the PCI spec
defined protocol that manipulating a vector table entry can fail, but
clearly it can fail on the host. That's just the world we live in,
either we overestimate vector usage on the host or we do the best we
can to infer how the device is actually being used.

Within the vfio API we have a mechanism to describe that the above
behavior is required, ie. that we can't dynamically resize the number
of MSI-X vectors. If we had kernel interfaces to dynamically change
our vector allocation after pci_alloc_irq_vectors() we could remove
that flag and QEMU could wouldn't need to go through this process.

>
> Whatever it is, it simply _cannot_ make request_irq() in the
> guest fail because the guest has already passed all failure
> points and is in the low level function of unmasking the
> interrupt for the first time after which is will return 0, aka
> success.
>
> So if you answer #a, fine with me. It's well defined.

As far as the guest is concerned, #b. As you say, we're at the point
in the guest where the guest interrupt code is interacting with MSI-X
directly and there is no way it can fail on hardware. It's not
entirely silent though, the hypervisor logs an error. Whether you
think "the device I hot-added to my VM doesn't work" is worse than "I
hot-added a device to my VM and it crashed" is a better failure mode is
a matter of perspective.

> #2 What happens to already active interrupts on that device which might
> fire during that time?
>
> They get lost or are redirected to the legacy PCI interrupt and
> there is absolutely nothing you can do about that.
>
> Simply because to prevent that the host side would have to
> disable the interrupt source at the device level, i.e. fiddle
> with actual device registers to shut up interrupt delivery and
> reenable it afterwards again and thereby racing against some
> other VCPU of the same guest which fiddles with that very same
> registers.
>
> IOW, undefined behaviour, which the "VFIO design" shrugged off on
> completely unjustified assumptions.

The vfio design signals to userspace that this behavior is required via
a flag on the IRQ info structure. Should host interfaces become
available that allow us to re-size the number of vectors for a device,
we can remove the flag and userspace would no longer need this
disable/re-enable process.

> No matter how much you argue about this being unlikely, this is just
> broken. Unlikely simply does not exist at cloud scale.
>
> Aside of that. How did you come to the conclusion that #2 does not
> matter? By analyzing _every_ open and closed source driver for their
> usage patterns and documenting that all drivers which want to work in
> VFIO-PCI space have to follow specific rules vs. interrupt setup and
> usage? I'm pretty sure that you have a mechanism in place which
> monitors closely whether a driver violates those well documented rules.
>
> Yes, I know that I'm dreaming and the reality is that this is based on
> interesting assumptions and just works by chance.
>
> I have no idea _why_ this has been done that way. The changelogs of the
> relevant commits are void of useful content and lack links to the
> possibly existing discussions about this.
>
> I only can assume that back then the main problem was vector exhaustion
> on the host and to avoid allocating memory for interrupt descriptors
> etc, right?

Essentially yes, and it's largely userspace policy, ie. QEMU. QEMU
owns the device, it has a right to allocate an entire vector table
worth of interrupts if it wants.

> The host vector exhaustion problem was that each MSIX vector consumed a
> real CPU vector which is a limited resource on X86. This is not longer
> the case today:
>
> 1) pci_msix_enable[range]() consumes exactly zero CPU vectors from
> the allocatable range independent of the number of MSIX vectors
> it allocates, unless it is in multi-queue managed mode where it
> will actually reserve a vector (maybe two) per CPU.
>
> But for devices which are not using that mode, they just
> opportunistically "reserve" vectors.
>
> All entries are initialized with a special system vector which
> when raised will emit a nastigram in dmesg.
>
> 2) request_irq() actually allocates a CPU vector from the
> allocatable vector space which can obviously still fail, which is
> perfectly fine.

request_irq() is where the vector table entry actually gets unmasked
though, that act of unmasking the vector table entry in hardware cannot
fail, but virtualization of that act is backed by a request_irq() in
the host, which can fail.

> So the only downside today of allocating more MSI-X vectors than
> necessary is memory consumption for the irq descriptors.

As above, this is a QEMU policy of essentially trying to be a good
citizen and allocate only what we can infer the guest is using. What's
a good way for QEMU, or any userspace, to know it's running on a host
where vector exhaustion is not an issue?

> Though for virtualization there is still another problem:
>
> Even if all possible MSI-X vectors for a passthrough PCI device would
> be allocated upfront independent of the actual usage in the guest,
> then there is still the issue of request_irq() failing on the host
> once the guest decides to use any of those interrupts.

Yup.

> It's a halfways reasonable argumentation by some definition of
> reasonable, that this case would be a host system configuration problem
> and the admin who overcommitted is responsible for the consequence.
>
> Where the only reasonable consequence is to kill the guest right there
> because there is no mechanism to actually tell it that the host ran out
> of resources.

People tend not to like their VM getting killed and potentially losing
data, so kill-the-guest vs device-doesn't-work is still debatable imo.

> Not at all a pretty solution, but it is contrary to the status quo well
> defined. The most important aspect is that it is well defined for the
> case of success:
>
> If it succeeds then there is no way that already requested interrupts
> can be lost or end up being redirected to the legacy PCI irq due to
> clearing the MSIX enable bit, which is a gazillion times better than
> the "let's hope it works" based tinkerware we have now.
>
> So, aside of the existing VFIO/PCI/MSIX thing being just half thought
> out, even thinking about proliferating this with IMS is bonkers.
>
> IMS is meant to avoid the problem of MSI-X which needs to disable MSI-X
> in order to expand the number of vectors. The use cases are going to be
> even more dynamic than the usual device drivers, so the lost interrupt
> issue will be much more likely to trigger.
>
> So no, we are not going to proliferate this complete ignorance of how
> MSI-X actually works and just cram another "feature" into code which is
> known to be incorrect.

Some of the issues of virtualizing MSI-X are unsolvable without
creating a new paravirtual interface, but obviously we want to work
with existing drivers and unmodified guests, so that's not an option.

To work with what we've got, the vfio API describes the limitation of
the host interfaces via the VFIO_IRQ_INFO_NORESIZE flag. QEMU then
makes a choice in an attempt to better reflect what we can infer of the
guest programming of the device to incrementally enable vectors. We
could a) work to provide host kernel interfaces that allow us to remove
that noresize flag and b) decide whether QEMU's usage policy can be
improved on kernels where vector exhaustion is no longer an issue.
Thanks,

Alex

2021-06-23 16:34:17

by Thomas Gleixner

[permalink] [raw]
Subject: RE: Virtualizing MSI-X on IMS via VFIO

On Wed, Jun 23 2021 at 06:12, Kevin Tian wrote:
>> From: Thomas Gleixner <[email protected]>
>> So the only downside today of allocating more MSI-X vectors than
>> necessary is memory consumption for the irq descriptors.
>
> Curious about irte entry when IRQ remapping is enabled. Is it also
> allocated at request_irq()?

Good question. No, it has to be allocated right away. We stick the
shutdown vector into the IRTE and then request_irq() will update it with
the real one.

> So the correct flow is like below:
>
> guest::enable_msix()
> trapped_by_host()
> pci_alloc_irq_vectors(); // for all possible vMSI-X entries
> pci_enable_msix();
>
> guest::unmask()
> trapped_by_host()
> request_irqs();
>
> the first trap calls a new VFIO ioctl e.g. VFIO_DEVICE_ALLOC_IRQS.
>
> the 2nd trap can reuse existing VFIO_DEVICE_SET_IRQS which just
> does request_irq() if specified irqs have been allocated.
>
> Then map ims to this flow:
>
> guest::enable_msix()
> trapped_by_host()
> msi_domain_alloc_irqs(); // for all possible vMSI-X entries
> for_all_allocated_irqs(i)
> pci_update_msi_desc_id(i, default_pasid); // a new helper func
>
> guest::unmask(entry#0)
> trapped_by_host()
> request_irqs();
> ims_array_irq_startup(); // write msi_desc.id (default_pasid) to ims entry
>
> guest::set_msix_perm(entry#1, guest_sva_pasid)
> trapped_by_host()
> pci_update_msi_desc_id(1, host_sva_pasid);
>
> guest::unmask(entry#1)
> trapped_by_host()
> request_irqs();
> ims_array_irq_startup(); // write msi_desc.id (host_sva_pasid) to ims entry

That's one way to do that, but that still has the same problem that the
request_irq() in the guest succeeds even if the host side fails.

As this is really new stuff there is no real good reason to force that
into the existing VFIO/MSIX stuff with all it's known downsides and
limitations.

The point is, that IMS can just add another interrupt to a device on the
fly without doing any of the PCI/MSIX nasties. So why not take advantage
of that?

I can see the point of using PCI to expose the device to the guest
because it's trivial to enumerate, but contrary to VF devices there is
no legacy and the mechanism how to setup the device interrupts can be
completely different from PCI/MSIX.

Exposing some trappable "IMS" storage in a separate PCI bar won't cut it
because this still has the same problem that the allocation or
request_irq() on the host can fail w/o feedback.

So IMO creating a proper paravirt interface is the right approach. It
avoids _all_ of the trouble and will be necessary anyway once you want
to support devices which store the message/pasid in system memory and
not in on-device memory.

Thanks,

tglx




2021-06-23 16:42:39

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Virtualizing MSI-X on IMS via VFIO

On Wed, Jun 23, 2021 at 06:31:34PM +0200, Thomas Gleixner wrote:

> So IMO creating a proper paravirt interface is the right approach. It
> avoids _all_ of the trouble and will be necessary anyway once you want
> to support devices which store the message/pasid in system memory and
> not in on-device memory.

I think this is basically where we got to in the other earlier
discussion with using IMS natively in VMs - it can't be done
generically without a new paravirt interface.

The guest needs a paravirt interface to program the IOMMU to route MSI
vectors to the guest's vAPIC and then the guest itself can deliver an
addr/data pair directly to the HW.

In this mode qemu would not emulate MSI at all so will avoid all the
problems you identified.

How to build that and provide backwards compat is an open
question. Instead that thread went into blocking IMS on VM situations..

Jason

2021-06-23 23:39:05

by Tian, Kevin

[permalink] [raw]
Subject: RE: Virtualizing MSI-X on IMS via VFIO

> From: Thomas Gleixner <[email protected]>
> Sent: Thursday, June 24, 2021 12:32 AM
>
> On Wed, Jun 23 2021 at 06:12, Kevin Tian wrote:
> >> From: Thomas Gleixner <[email protected]>
> >> So the only downside today of allocating more MSI-X vectors than
> >> necessary is memory consumption for the irq descriptors.
> >
> > Curious about irte entry when IRQ remapping is enabled. Is it also
> > allocated at request_irq()?
>
> Good question. No, it has to be allocated right away. We stick the
> shutdown vector into the IRTE and then request_irq() will update it with
> the real one.

There are max 64K irte entries per Intel VT-d. Do we consider it as
a limited resource in this new model, though it's much more than
CPU vectors?

>
> > So the correct flow is like below:
> >
> > guest::enable_msix()
> > trapped_by_host()
> > pci_alloc_irq_vectors(); // for all possible vMSI-X entries
> > pci_enable_msix();
> >
> > guest::unmask()
> > trapped_by_host()
> > request_irqs();
> >
> > the first trap calls a new VFIO ioctl e.g. VFIO_DEVICE_ALLOC_IRQS.
> >
> > the 2nd trap can reuse existing VFIO_DEVICE_SET_IRQS which just
> > does request_irq() if specified irqs have been allocated.
> >
> > Then map ims to this flow:
> >
> > guest::enable_msix()
> > trapped_by_host()
> > msi_domain_alloc_irqs(); // for all possible vMSI-X entries
> > for_all_allocated_irqs(i)
> > pci_update_msi_desc_id(i, default_pasid); // a new helper func
> >
> > guest::unmask(entry#0)
> > trapped_by_host()
> > request_irqs();
> > ims_array_irq_startup(); // write msi_desc.id (default_pasid) to ims
> entry
> >
> > guest::set_msix_perm(entry#1, guest_sva_pasid)
> > trapped_by_host()
> > pci_update_msi_desc_id(1, host_sva_pasid);
> >
> > guest::unmask(entry#1)
> > trapped_by_host()
> > request_irqs();
> > ims_array_irq_startup(); // write msi_desc.id (host_sva_pasid) to ims
> entry
>
> That's one way to do that, but that still has the same problem that the
> request_irq() in the guest succeeds even if the host side fails.

yes

>
> As this is really new stuff there is no real good reason to force that
> into the existing VFIO/MSIX stuff with all it's known downsides and
> limitations.
>
> The point is, that IMS can just add another interrupt to a device on the
> fly without doing any of the PCI/MSIX nasties. So why not take advantage
> of that?
>
> I can see the point of using PCI to expose the device to the guest
> because it's trivial to enumerate, but contrary to VF devices there is

also about compatibility since PCI is supported by almost all OSes.

> no legacy and the mechanism how to setup the device interrupts can be
> completely different from PCI/MSIX.
>
> Exposing some trappable "IMS" storage in a separate PCI bar won't cut it
> because this still has the same problem that the allocation or
> request_irq() on the host can fail w/o feedback.

yes to fully fix the said nasty some feedback mechanism is required.

>
> So IMO creating a proper paravirt interface is the right approach. It
> avoids _all_ of the trouble and will be necessary anyway once you want
> to support devices which store the message/pasid in system memory and
> not in on-device memory.
>

While I agree a paravirt interface is definitely cleaner, I wonder whether
this should be done in orthogonal or tied to all new ims-capable devices.
Back to earlier discussion about guest ims support, you explained a layered
model where the paravirt interface sits between msi domain and vector
domain to get addr/data pair from the host. In this way it could provide
a feedback mechanism for both msi and ims devices, thus not specific
to ims only. Then considering the transition window where not all guest
OSes may support paravirt interface at the same time (or there are
multiple paravirt interfaces which takes time for host to support all),
would below staging approach still makes sense?

1) Fix the lost interrupt issue in existing MSI virtualization flow;
2) Virtualize MSI-X on IMS, bearing the same request_irq() problem;
3) Develop a paravirt interface to solve request_irq() problem for
both msi and ims devices;

Thanks
Kevin

2021-06-23 23:44:58

by Tian, Kevin

[permalink] [raw]
Subject: RE: Virtualizing MSI-X on IMS via VFIO

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, June 24, 2021 12:41 AM
>
> On Wed, Jun 23, 2021 at 06:31:34PM +0200, Thomas Gleixner wrote:
>
> > So IMO creating a proper paravirt interface is the right approach. It
> > avoids _all_ of the trouble and will be necessary anyway once you want
> > to support devices which store the message/pasid in system memory and
> > not in on-device memory.
>
> I think this is basically where we got to in the other earlier
> discussion with using IMS natively in VMs - it can't be done
> generically without a new paravirt interface.
>
> The guest needs a paravirt interface to program the IOMMU to route MSI
> vectors to the guest's vAPIC and then the guest itself can deliver an
> addr/data pair directly to the HW.
>
> In this mode qemu would not emulate MSI at all so will avoid all the
> problems you identified.

No emulation for PF/VF.

But emulation might be required for mdev for two reasons:

1) the ims entries for mdevs are collapsed together;
2) there are other fields in ims entry which cannot allow guest to
control, e.g. PASID;

>
> How to build that and provide backwards compat is an open
> question. Instead that thread went into blocking IMS on VM situations..
>
> Jason

2021-06-24 00:06:17

by Tian, Kevin

[permalink] [raw]
Subject: RE: Virtualizing MSI-X on IMS via VFIO

> From: Alex Williamson <[email protected]>
> Sent: Wednesday, June 23, 2021 11:20 PM
>
[...]
> > So the only downside today of allocating more MSI-X vectors than
> > necessary is memory consumption for the irq descriptors.
>
> As above, this is a QEMU policy of essentially trying to be a good
> citizen and allocate only what we can infer the guest is using. What's
> a good way for QEMU, or any userspace, to know it's running on a host
> where vector exhaustion is not an issue?

In my proposal a new command (VFIO_DEVICE_ALLOC_IRQS) is
introduced to separate allocation from enabling. The availability
of this command could be the indicator whether vector
exhaustion is not an issue now?

> >
> > So no, we are not going to proliferate this complete ignorance of how
> > MSI-X actually works and just cram another "feature" into code which is
> > known to be incorrect.
>
> Some of the issues of virtualizing MSI-X are unsolvable without
> creating a new paravirtual interface, but obviously we want to work
> with existing drivers and unmodified guests, so that's not an option.
>
> To work with what we've got, the vfio API describes the limitation of
> the host interfaces via the VFIO_IRQ_INFO_NORESIZE flag. QEMU then
> makes a choice in an attempt to better reflect what we can infer of the
> guest programming of the device to incrementally enable vectors. We

It's a surprise to me that Qemu even doesn't look at this flag today after
searching its code...

> could a) work to provide host kernel interfaces that allow us to remove
> that noresize flag and b) decide whether QEMU's usage policy can be
> improved on kernels where vector exhaustion is no longer an issue.

Thomas can help confirm but looks noresize limitation is still there.
b) makes more sense since Thomas thinks vector exhaustion is not
an issue now (except one minor open about irte).

Thanks
Kevin

2021-06-24 00:45:24

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Virtualizing MSI-X on IMS via VFIO

Alex!

On Wed, Jun 23 2021 at 09:19, Alex Williamson wrote:
> On Wed, 23 Jun 2021 01:59:24 +0200
> Thomas Gleixner <[email protected]> wrote:
>> That's dangerous and makes weird assumptions about interrupts being
>> requested early in the driver init() function. But that's wishful
>> thinking, really. There are enough drivers, especially networking which
>> request interrupts on device open() and not on device init(). Some
>> special functions only request the irq way later, i.e. only when someone
>> uses that special function and at this point the other irqs of that very
>> same device are already in use.
>
> I'd consider allocating vectors on open() to be setup time, device
> initialization. Chances of inflight interrupts is low. If you have

It does not matter whether chances are low or not. Again: At datacenter
scale 'unlikely' does not exist except in wishful thinking.

It does not matter whether it's unlikely or not. What matters is that
this problem exists.

If only one of one million resize operations results in a lost interrupt
and subsequently in a stale device then it is exactly _one_ fail too
much especially for cases where it could have been avoided.

If you would have argued, that you can't do anything about it for
unmodified, i.e. unaware guests, but otherwise acknowledged that this
_is_ a problem which could and should have been fixed long ago, then we
can just move on and have a conversation about technical ways to fix it,
but handwaving it away is definitely _not_ an option.

It's not even documented anywhere. It's just implemented that way to
prove the theorem that virtualization creates more problems than it
solves.

> examples of drivers where the irq request comes way later please let
> me know, maybe there's some way we can identify that use case.

IOW, please tell me where I need to add more heuristics in order to
"fix" the symptom instead of the root cause?

Here we go nevertheless:

1) Multi-queue devices with managed interrupts.

A queue and it's interrupt(s) are not started up when the associated
CPU(s) is/are not online or not present. But a later hotplug brings
them up and that's the point where such interrupts are unmasked the
first time.

All the other queues are active already at that point. Now you just
have to ensure that there is load on the already online queues and
the chances of getting a stale queue or running into a hopefully
recoverable timeout are not longer low. That fail is pretty much
guaranteed. I'm sure Joe Admin will appreciate all the helpful tools
to debug that and the extensive documentation about this kind of
failure.

Surely you will tell me that these kind of devices are already
covered by weird heuristics in the hypervisor which allocate all
vectors upfront based on PCI ID or whatever, but I still have to tell
you that this is beyond wrong and could have been avoided long ago at
least for the not so uncommon case of an up-to-date hypervisor and
up-to-date host and guest kernels.

2) Some drivers request interrupts with IRQF_NO_AUTOEN (or the older
mechanism of marking it as NOAUTOEN before request_irq() which does
not start it up. It's fully initialized but it gets unmasked when
later on the driver invokes enable_irq(). And for some drivers that
can be very late.

3) Multi-function devices where a subfunction gets initialized only when
needed.

4) We have at least one case here with a custom FPGA driver where a
certain power hungry IP block is only initialized when the program
loaded by the end-user uses a particular library which enables it.

At that point the interrupt gets requested and unmasked but there is
already other functionality enabled in the FPGA with heavy interrupt
load, so ripping MSI-X temporarily off for resizing would be a
guarantee for losing interrupts in hard to debug ways. That stuff
runs today only on bare metal fortunately, but there are plans to
have it exposed to guests...

Good thing we talked about this well documented VFIO 'feature' now,
so we can spare us the head scratching later.

There are certainly others, but that's from the top of my head.

>> Let's look at this from a driver perspective:
>>
>> 1) Driver checks how many entries are possible in the MSI-X table
>>
>> 2) Driver invokes pci_msix_enable[_range]() which returns the number
>> of vectors the system is willing to hand out to the driver.
>>
>> 3) Driver assigns the vectors to the different resources in the
>> hardware
>>
>> 4) Later on these interrupts are requested, but not necessarily during
>> device init.
>>
>> Yes, request_irq() can fail and today it can fail also due to CPU
>> vector exhaustion. That's perfectly fine as the driver can handle
>> the fail and act accordingly.
>>
>> All of this is consistent and well defined.
>
> We can't know in the hypervisor how many vectors the guest driver asked
> for, we only know that it's somewhere between one and the number
> supported by the vector table. We'd contribute to vector exhaustion if
> we always assume the max.

I understand that, but that does not make it more consistent.

>> Now lets look at the guest. This VFIO mechanism introduces two brand new
>> failure modes because of this:
>>
>> guest::unmask()
>> trapped_by_host()
>> free_irqs();
>> pci_free_irq_vectors();
>> pci_disable_msix();
>> pci_alloc_irq_vectors();
>> pci_enable_msix();
>> request_irqs();
>>
>> #1 What happens if the host side allocation or the host side request_irq()
>> fails?
>>
>> a) Guest is killed?
>> b) Failure is silently ignored and guest just does not receive
>> interrupts?
>> c) Any other mechanism?
>
> QEMU generates a log on failure. At vfio-pci we're emulating the MSI-X
> capability and vector table, there is no mechanism in the PCI spec
> defined protocol that manipulating a vector table entry can fail, but
> clearly it can fail on the host. That's just the world we live in,
> either we overestimate vector usage on the host or we do the best we
> can to infer how the device is actually being used.
>
> Within the vfio API we have a mechanism to describe that the above
> behavior is required, ie. that we can't dynamically resize the number
> of MSI-X vectors. If we had kernel interfaces to dynamically change
> our vector allocation after pci_alloc_irq_vectors() we could remove
> that flag and QEMU could wouldn't need to go through this process.
>
>>
>> Whatever it is, it simply _cannot_ make request_irq() in the
>> guest fail because the guest has already passed all failure
>> points and is in the low level function of unmasking the
>> interrupt for the first time after which is will return 0, aka
>> success.
>>
>> So if you answer #a, fine with me. It's well defined.
>
> As far as the guest is concerned, #b. As you say, we're at the point
> in the guest where the guest interrupt code is interacting with MSI-X
> directly and there is no way it can fail on hardware. It's not
> entirely silent though, the hypervisor logs an error. Whether you
> think "the device I hot-added to my VM doesn't work" is worse than "I
> hot-added a device to my VM and it crashed" is a better failure mode is
> a matter of perspective.

Yes, it's a matter of perspective, but crashing the thing is at least
well defined. Silently breaking the device is an interesting choice and
completely undefined whether you log it or not.

Even if it does not break because the host side request_irq() succeeds,
risking that it can lose interrupts is daft and if that happens then it
still malfunctions and this is _not_ going to be logged in any
hypervisor log. It's going to be a nightmare to debug.

>> #2 What happens to already active interrupts on that device which might
>> fire during that time?
>>
>> They get lost or are redirected to the legacy PCI interrupt and
>> there is absolutely nothing you can do about that.
>>
>> Simply because to prevent that the host side would have to
>> disable the interrupt source at the device level, i.e. fiddle
>> with actual device registers to shut up interrupt delivery and
>> reenable it afterwards again and thereby racing against some
>> other VCPU of the same guest which fiddles with that very same
>> registers.
>>
>> IOW, undefined behaviour, which the "VFIO design" shrugged off on
>> completely unjustified assumptions.
>
> The vfio design signals to userspace that this behavior is required via
> a flag on the IRQ info structure. Should host interfaces become
> available that allow us to re-size the number of vectors for a device,
> we can remove the flag and userspace would no longer need this
> disable/re-enable process.

You cannot resize MSI-X without clearing the MSI-X enable bit. And
that's not going to change anytime soon. It's how MSI-X is specified.

So what are you trying to tell me? That your design is great because it
allows to remove a flag which cannot be removed for the particular class
of devices utilizing true MSI-X which is affected by this problem?

I'm amazed how you emphasize design all over the place. I'm truly
impressed that this very design does not even document that the
re-sizing business can cause losing interrupts on already active
interrupts on the same device.

>> 2) request_irq() actually allocates a CPU vector from the
>> allocatable vector space which can obviously still fail, which is
>> perfectly fine.
>
> request_irq() is where the vector table entry actually gets unmasked
> though, that act of unmasking the vector table entry in hardware cannot
> fail, but virtualization of that act is backed by a request_irq() in
> the host, which can fail.

Again: request_irq() does not necessarily unmask. Just because for a
large part of the drivers the first unmask happens from request_irq()
does not make it an universal law. The unmask can be completely detached
from request_irq().

What this mechanism observes is the first unmask() operation on a
particalar MSI-X vector, nothing else.

You _cannot_ make assumptions about when and where that unmask happens
especially not with closed source guests, but your assumptions do not
even hold with open source guests.

Can we pretty please use the proper and precise terminology and not some
blurb based on random assumptions about how things might work or might
have worked 20 years ago?

>> So the only downside today of allocating more MSI-X vectors than
>> necessary is memory consumption for the irq descriptors.
>
> As above, this is a QEMU policy of essentially trying to be a good
> citizen and allocate only what we can infer the guest is using. What's
> a good way for QEMU, or any userspace, to know it's running on a host
> where vector exhaustion is not an issue?

If the kernel is halfways recent (4.15+) then it uses reservation mode
on x86:

- It does not consume CPU vectors for not requested interrupts (except
for multi-queue devices with managed interrupts).

- It consumes an IRTE entry per vector, which might or might not be an
issue. There are 64k entries per IOMMU on Intel and sizeable number
of entries per PCI function on AMD.

- It consumes host memory obviously

Of course this does still not solve the problem that the actual
request_irq() can fail due to vector exhaustion (or other reasons), but
it at least prevents losing interrupts for already activated vectors due
to resizing.

>> Not at all a pretty solution, but it is contrary to the status quo well
>> defined. The most important aspect is that it is well defined for the
>> case of success:
>>
>> If it succeeds then there is no way that already requested interrupts
>> can be lost or end up being redirected to the legacy PCI irq due to
>> clearing the MSIX enable bit, which is a gazillion times better than
>> the "let's hope it works" based tinkerware we have now.
>>
>> So, aside of the existing VFIO/PCI/MSIX thing being just half thought
>> out, even thinking about proliferating this with IMS is bonkers.
>>
>> IMS is meant to avoid the problem of MSI-X which needs to disable MSI-X
>> in order to expand the number of vectors. The use cases are going to be
>> even more dynamic than the usual device drivers, so the lost interrupt
>> issue will be much more likely to trigger.
>>
>> So no, we are not going to proliferate this complete ignorance of how
>> MSI-X actually works and just cram another "feature" into code which is
>> known to be incorrect.
>
> Some of the issues of virtualizing MSI-X are unsolvable without
> creating a new paravirtual interface, but obviously we want to work
> with existing drivers and unmodified guests, so that's not an option.

I grant you that unmodified, i.e. unaware guests are an issue which is
pretty much impossible to fix, but at least it could have been
documented.

So this has been known for almost _ten_ years now and the only answer
you have is 'not an option' for everything and not just for unaware
guests?

Sorry, but that's hillarious and it's absolutely not rocket science to
fix that.

You do not even have to modify _one_ single PCI device driver if you put
the hypercall into the obvious place, i.e. pci_enable_msix[_range](),
which is invoked when the driver initializes and sizes and negotiates
the resources.

Existing and completely unmodified drivers handle the situation
perfectly fine when this function tells them: No, you can't get 1000
vectors, I grant you only one.

Those which don't are broken even w/o virt.

So the only situation where you really can by some means justify the
'try and pray' approach is on guests which are unaware of that mechanism
for whatever reason. That's it. And that want's proper documentation of
the potential consequences.

> To work with what we've got, the vfio API describes the limitation of
> the host interfaces via the VFIO_IRQ_INFO_NORESIZE flag. QEMU then
> makes a choice in an attempt to better reflect what we can infer of the
> guest programming of the device to incrementally enable vectors. We
> could a) work to provide host kernel interfaces that allow us to remove
> that noresize flag and b) decide whether QEMU's usage policy can be
> improved on kernels where vector exhaustion is no longer an issue.

For IMS "work with what we've got" is just not a reasonable answer and
IMS needs hypercalls sooner than later anyway for:

1) Message store in guest system RAM which obviously cannot be
trapped.

That's the case where a single subdevice (not VF) provided by the
host driver managed PCI device is wrapped and exposed as "PCI"
device to the guest because that makes it simple to enumerate and
probe.

Obviously this just covers the device-memory storage of IDXD as
well because that's just a an implementation detail.

2) Passthrough of a full IMS capable device into the guest which has
different requirements.

What we are looking at right now is #1. #2 is a different story and
needs way more than #1.

So the question is where such a hypercall should be located.

A) IMS device specific which is going to end up with code duplication
and slightly different bugs all over the place. Not really
desirable.

B) Part of PCI/MSI[X] is the right thing as it can be used to address
the existing issues as well in a sane way.

So #B would look like this:

device_init(dev)
pci_enable_msix_range(dev, msidescs, minvec, maxvec)
....
ret = hypercall(MSIX_ENABLE, &alloc_info);

We need to find the right spot for that call and the alloc_info
structure needs some thought obviously, but none of this is rocket
science.

The hypervisor can then:

1) Grant the request and do the required allocation/setup for
the vectors which the guest device driver asked for

2) Grant and setup only $N vectors

3) Refuse the request and return a proper error code

and everything just falls in place including sane error handling on all
ends.

Now add the reverse operation for MSIX_DISABLE and suddenly all of this
becomes a design.

I'm really amazed that this does not exist today because it's so bloody
obvious.

Maybe it's just bloody obvious to me, but I very well remember that I
explained on several occasions in great length why resizing MSI-X with
already active interrupts is bloody wrong.

Both in email and face to face, the last face to face was at a table
during Plumbers 2019 in Lisbon where IMS was discussed for the first
time in public.

Thanks,

tglx

2021-06-24 01:21:40

by Thomas Gleixner

[permalink] [raw]
Subject: RE: Virtualizing MSI-X on IMS via VFIO

Kevin!

On Wed, Jun 23 2021 at 23:37, Kevin Tian wrote:
>> From: Thomas Gleixner <[email protected]>
>> > Curious about irte entry when IRQ remapping is enabled. Is it also
>> > allocated at request_irq()?
>>
>> Good question. No, it has to be allocated right away. We stick the
>> shutdown vector into the IRTE and then request_irq() will update it with
>> the real one.
>
> There are max 64K irte entries per Intel VT-d. Do we consider it as
> a limited resource in this new model, though it's much more than
> CPU vectors?

It's surely a limited resource. For me 64k entries seems to be plenty,
but what do I know. I'm not a virtualization wizard.

> Back to earlier discussion about guest ims support, you explained a layered
> model where the paravirt interface sits between msi domain and vector
> domain to get addr/data pair from the host. In this way it could provide
> a feedback mechanism for both msi and ims devices, thus not specific
> to ims only. Then considering the transition window where not all guest
> OSes may support paravirt interface at the same time (or there are
> multiple paravirt interfaces which takes time for host to support all),
> would below staging approach still makes sense?
>
> 1) Fix the lost interrupt issue in existing MSI virtualization flow;

That _cannot_ be fixed without a hypercall. See my reply to Alex.

> 2) Virtualize MSI-X on IMS, bearing the same request_irq() problem;

That solves what? Maybe your perceived roadmap problem, but certainly
not any technical problem in the right way. Again: See my reply to Alex.

> 3) Develop a paravirt interface to solve request_irq() problem for
> both msi and ims devices;

First of all it's not a request_irq() problem: It's a plain resource
management problem which requires proper interaction between host and
guest.

And yes, it _is_ the correct answer to the problem and as I outlined in
my reply to Alex already it is _not_ rocket science and it won't make a
significant difference on your timeline because it's straight forward
and solves the problem properly with the added benefit to solve existing
problems which should and could have been solved long ago.

I don't care at all about the time you are wasting with half baken
thoughts about avoiding to do the right thing, but I very much care
about my time wasted to debunk them.

Thanks,

tglx

2021-06-24 01:38:05

by Thomas Gleixner

[permalink] [raw]
Subject: RE: Virtualizing MSI-X on IMS via VFIO

Kevin,

On Thu, Jun 24 2021 at 00:00, Kevin Tian wrote:
>> From: Alex Williamson <[email protected]>
>> Sent: Wednesday, June 23, 2021 11:20 PM
>>
> [...]
> > > So the only downside today of allocating more MSI-X vectors than
>> > necessary is memory consumption for the irq descriptors.
>>
>> As above, this is a QEMU policy of essentially trying to be a good
>> citizen and allocate only what we can infer the guest is using. What's
>> a good way for QEMU, or any userspace, to know it's running on a host
>> where vector exhaustion is not an issue?
>
> In my proposal a new command (VFIO_DEVICE_ALLOC_IRQS) is
> introduced to separate allocation from enabling. The availability
> of this command could be the indicator whether vector
> exhaustion is not an issue now?

Your proposal still does not address the fundamental issue of a missing
feedback to the guest and you can invent a gazillion more IOCTL commands
and none of them will solve that issue. A hypercall/paravirt interface is
the only reasonable solution.

The time you are wasting to come up with non-solutions would have surely
been better spent implementing the already known and obvious proper
solution. You might be halfways done already with that.

Thanks,

tglx

2021-06-24 02:21:32

by Thomas Gleixner

[permalink] [raw]
Subject: RE: Virtualizing MSI-X on IMS via VFIO

Kevin,

thank you very much for digging into this! You made my day!

On Thu, Jun 24 2021 at 00:00, Kevin Tian wrote:
>> From: Alex Williamson <[email protected]>
>> To work with what we've got, the vfio API describes the limitation of
>> the host interfaces via the VFIO_IRQ_INFO_NORESIZE flag. QEMU then
>> makes a choice in an attempt to better reflect what we can infer of the
>> guest programming of the device to incrementally enable vectors. We
>
> It's a surprise to me that Qemu even doesn't look at this flag today after
> searching its code...

Indeed.

git clone https://github.com/qemu/qemu.git
cd qemu
git log -p | grep NORESIZE
+ * The NORESIZE flag indicates that the interrupt lines within the index
+#define VFIO_IRQ_INFO_NORESIZE (1 << 3)

According to the git history of QEMU this was never used at all and I
don't care about the magic muck which might be in some RHT repository
which might make use of that.

Find below the proper fix for this nonsense which just wasted everyones
time. I'll post it officialy with a proper changelog tomorrow unless
Kevin beats me to it who actually unearthed this and surely earns the
credit.

Alex, I seriously have to ask what you were trying to tell us about this
flag and it's great value and the design related to this.

I'm sure you can submit the corresponding fix to qemu yourself.

And once you are back from lala land, can you please explain how
VFIO/PCI/MSIX is supposed to work in reality?

Thanks,

tglx
---
--- a/drivers/gpu/drm/i915/gvt/kvmgt.c
+++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
@@ -1644,8 +1644,6 @@ static long intel_vgpu_ioctl(struct mdev
if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
info.flags |= (VFIO_IRQ_INFO_MASKABLE |
VFIO_IRQ_INFO_AUTOMASKED);
- else
- info.flags |= VFIO_IRQ_INFO_NORESIZE;

return copy_to_user((void __user *)arg, &info, minsz) ?
-EFAULT : 0;
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1018,8 +1018,6 @@ static long vfio_pci_ioctl(struct vfio_d
if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
info.flags |= (VFIO_IRQ_INFO_MASKABLE |
VFIO_IRQ_INFO_AUTOMASKED);
- else
- info.flags |= VFIO_IRQ_INFO_NORESIZE;

return copy_to_user((void __user *)arg, &info, minsz) ?
-EFAULT : 0;
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -693,16 +693,6 @@ struct vfio_region_info_cap_nvlink2_lnks
* automatically masked by VFIO and the user needs to unmask the line
* to receive new interrupts. This is primarily intended to distinguish
* level triggered interrupts.
- *
- * The NORESIZE flag indicates that the interrupt lines within the index
- * are setup as a set and new subindexes cannot be enabled without first
- * disabling the entire index. This is used for interrupts like PCI MSI
- * and MSI-X where the driver may only use a subset of the available
- * indexes, but VFIO needs to enable a specific number of vectors
- * upfront. In the case of MSI-X, where the user can enable MSI-X and
- * then add and unmask vectors, it's up to userspace to make the decision
- * whether to allocate the maximum supported number of vectors or tear
- * down setup and incrementally increase the vectors as each is enabled.
*/
struct vfio_irq_info {
__u32 argsz;
@@ -710,7 +700,6 @@ struct vfio_irq_info {
#define VFIO_IRQ_INFO_EVENTFD (1 << 0)
#define VFIO_IRQ_INFO_MASKABLE (1 << 1)
#define VFIO_IRQ_INFO_AUTOMASKED (1 << 2)
-#define VFIO_IRQ_INFO_NORESIZE (1 << 3)
__u32 index; /* IRQ index */
__u32 count; /* Number of IRQs within this index */
};
--- a/samples/vfio-mdev/mtty.c
+++ b/samples/vfio-mdev/mtty.c
@@ -1092,9 +1092,6 @@ static int mtty_get_irq_info(struct mdev
if (irq_info->index == VFIO_PCI_INTX_IRQ_INDEX)
irq_info->flags |= (VFIO_IRQ_INFO_MASKABLE |
VFIO_IRQ_INFO_AUTOMASKED);
- else
- irq_info->flags |= VFIO_IRQ_INFO_NORESIZE;
-
return 0;
}

2021-06-24 02:42:23

by Tian, Kevin

[permalink] [raw]
Subject: RE: Virtualizing MSI-X on IMS via VFIO

> From: Thomas Gleixner <[email protected]>
> Sent: Thursday, June 24, 2021 9:19 AM
>
> Kevin!
>
> On Wed, Jun 23 2021 at 23:37, Kevin Tian wrote:
> >> From: Thomas Gleixner <[email protected]>
> >> > Curious about irte entry when IRQ remapping is enabled. Is it also
> >> > allocated at request_irq()?
> >>
> >> Good question. No, it has to be allocated right away. We stick the
> >> shutdown vector into the IRTE and then request_irq() will update it with
> >> the real one.
> >
> > There are max 64K irte entries per Intel VT-d. Do we consider it as
> > a limited resource in this new model, though it's much more than
> > CPU vectors?
>
> It's surely a limited resource. For me 64k entries seems to be plenty,
> but what do I know. I'm not a virtualization wizard.
>
> > Back to earlier discussion about guest ims support, you explained a layered
> > model where the paravirt interface sits between msi domain and vector
> > domain to get addr/data pair from the host. In this way it could provide
> > a feedback mechanism for both msi and ims devices, thus not specific
> > to ims only. Then considering the transition window where not all guest
> > OSes may support paravirt interface at the same time (or there are
> > multiple paravirt interfaces which takes time for host to support all),
> > would below staging approach still makes sense?
> >
> > 1) Fix the lost interrupt issue in existing MSI virtualization flow;
>
> That _cannot_ be fixed without a hypercall. See my reply to Alex.

The lost interrupt issue was caused due to resizing based on stale
impression of vector exhaustion.

With your explanation this issue can be partially fixed by having Qemu
allocate all possible irqs when guest enables msi-x and never resizes
it before guest disables msi-x.

The remaining problem is no feedback to block guest request_irq()
in case of vector shortage. This has to be solved via paravirt interface
but fixing lost interrupt alone is still a step forward for guest which
doesn't implement the paravirt interface.

>
> > 2) Virtualize MSI-X on IMS, bearing the same request_irq() problem;
>
> That solves what? Maybe your perceived roadmap problem, but certainly
> not any technical problem in the right way. Again: See my reply to Alex.

Not about roadmap. See explanation below.

>
> > 3) Develop a paravirt interface to solve request_irq() problem for
> > both msi and ims devices;
>
> First of all it's not a request_irq() problem: It's a plain resource
> management problem which requires proper interaction between host and
> guest.

sure.

>
> And yes, it _is_ the correct answer to the problem and as I outlined in
> my reply to Alex already it is _not_ rocket science and it won't make a
> significant difference on your timeline because it's straight forward
> and solves the problem properly with the added benefit to solve existing
> problems which should and could have been solved long ago.
>
> I don't care at all about the time you are wasting with half baken
> thoughts about avoiding to do the right thing, but I very much care
> about my time wasted to debunk them.
>

I'm really not thinking from any angle of roadmap thing, and I actually
very much appreciate all of your comments on the right direction.

All my comments are purely based on possible use scenarios. I will give
more explanation below and hope you can consider it as a thought
practice to compose the full picture based on your guidance, instead of
seeking half baken idea to waste your time. ????

At any time guest OSes can be categorized into three classes:

a) doesn't implement any paravirt interface for vector allocation;

b) implement one paravirt interface that has been supported by KVM;

c) implement one paravirt interface which has not been supported by KVM;

The transition phase from c) to b) is undefined, but it does exist more
or less. For example a windows guest will never implement the interface
defined between Linux guest and Linux host. It will have its own hyperv
variation which likely takes time for KVM to emulate and claim support.

Transition from a) to b) or a) to c) is a guest-side choice. It's not controlled
by the host world.

Here I didn't further differentiate whether a guest OS support ims, since
once a supported paravirt interface is in place both msi and ims can get
necessary feedback info from the host side.

Then let's look at the host side:

1) kernel versions before we conduct any discussed change:

This is a known broken world as you explained. irq resizing could
lead to lost interrupts in all three guest classes. The only mitigation
is to document this limitation somewhere.

We'll not enable ims based on this broken framework.

2) kernel versions after we make a clean refactoring:

a) For guest OS which doesn't implement paravirt interface:
c) For guest OS which implement a paravirt interface not
supported by KVM:

You confirmed that recent kernels (since 4.15+) all uses
reservation mode to avoid vector exhaustion. So VFIO can
define a new protocol asking its userspace to disable resizing
by allocating all possible irqs when guest msix is enabled. This
is one step forward by fixing the lost interrupt issue and is what
the step-1) in my proposal tries to achieve.

But there remains a limitation as no feedback is provided into
the guest to block it when host vectors are in shortage. But
that's the reality that we have to bear for such guest. VFIO
returns such error info to and let userspace decide how to
react.

It's not elegant but improved over the status quo. and we do
see value of enabling ims-capable device/subdevice within
such guest, though the guest will just fall back to use msix.
This is about step-2) in my proposal;

b) For guest OS which implement a paravirt interface supported
by KVM:

This is the right framework that you just described. With such
interface in place, the guest needs to proactively claim resource
from the host side before it can actually enable a specific msi/ims
entry. Everything is well set with cooperation between host/guest.

If you agree with above, then it's not something that we want to make
half-baked stuff with my proposal. It's really about splitting tasks by doing
conservative stuff first which works for most guests and then optimizing
things for new guests. And strictly speaking we don't want to do paravirt
stuff very late since it's the much cleaner approach in concept. We do plan
to find some resource to initiate a separate design discussion in parallel
with fixing interrupt lost issue for a) and c).

Does this rationale sound good to you?

Thanks
Kevin

2021-06-24 02:49:37

by Alex Williamson

[permalink] [raw]
Subject: Re: Virtualizing MSI-X on IMS via VFIO

On Thu, 24 Jun 2021 04:20:31 +0200
Thomas Gleixner <[email protected]> wrote:

> Kevin,
>
> thank you very much for digging into this! You made my day!
>
> On Thu, Jun 24 2021 at 00:00, Kevin Tian wrote:
> >> From: Alex Williamson <[email protected]>
> >> To work with what we've got, the vfio API describes the limitation of
> >> the host interfaces via the VFIO_IRQ_INFO_NORESIZE flag. QEMU then
> >> makes a choice in an attempt to better reflect what we can infer of the
> >> guest programming of the device to incrementally enable vectors. We
> >
> > It's a surprise to me that Qemu even doesn't look at this flag today after
> > searching its code...
>
> Indeed.
>
> git clone https://github.com/qemu/qemu.git
> cd qemu
> git log -p | grep NORESIZE
> + * The NORESIZE flag indicates that the interrupt lines within the index
> +#define VFIO_IRQ_INFO_NORESIZE (1 << 3)
>
> According to the git history of QEMU this was never used at all and I
> don't care about the magic muck which might be in some RHT repository
> which might make use of that.
>
> Find below the proper fix for this nonsense which just wasted everyones
> time. I'll post it officialy with a proper changelog tomorrow unless
> Kevin beats me to it who actually unearthed this and surely earns the
> credit.
>
> Alex, I seriously have to ask what you were trying to tell us about this
> flag and it's great value and the design related to this.
>
> I'm sure you can submit the corresponding fix to qemu yourself.
>
> And once you are back from lala land, can you please explain how
> VFIO/PCI/MSIX is supposed to work in reality?

It's part of the spec, there's never been a case of !NORESIZE, assuming
NORESIZE is the safe behavior. Sorry, there's no smoking gun here, NAK

> ---
> --- a/drivers/gpu/drm/i915/gvt/kvmgt.c
> +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
> @@ -1644,8 +1644,6 @@ static long intel_vgpu_ioctl(struct mdev
> if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
> info.flags |= (VFIO_IRQ_INFO_MASKABLE |
> VFIO_IRQ_INFO_AUTOMASKED);
> - else
> - info.flags |= VFIO_IRQ_INFO_NORESIZE;
>
> return copy_to_user((void __user *)arg, &info, minsz) ?
> -EFAULT : 0;
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1018,8 +1018,6 @@ static long vfio_pci_ioctl(struct vfio_d
> if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
> info.flags |= (VFIO_IRQ_INFO_MASKABLE |
> VFIO_IRQ_INFO_AUTOMASKED);
> - else
> - info.flags |= VFIO_IRQ_INFO_NORESIZE;
>
> return copy_to_user((void __user *)arg, &info, minsz) ?
> -EFAULT : 0;
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -693,16 +693,6 @@ struct vfio_region_info_cap_nvlink2_lnks
> * automatically masked by VFIO and the user needs to unmask the line
> * to receive new interrupts. This is primarily intended to distinguish
> * level triggered interrupts.
> - *
> - * The NORESIZE flag indicates that the interrupt lines within the index
> - * are setup as a set and new subindexes cannot be enabled without first
> - * disabling the entire index. This is used for interrupts like PCI MSI
> - * and MSI-X where the driver may only use a subset of the available
> - * indexes, but VFIO needs to enable a specific number of vectors
> - * upfront. In the case of MSI-X, where the user can enable MSI-X and
> - * then add and unmask vectors, it's up to userspace to make the decision
> - * whether to allocate the maximum supported number of vectors or tear
> - * down setup and incrementally increase the vectors as each is enabled.
> */
> struct vfio_irq_info {
> __u32 argsz;
> @@ -710,7 +700,6 @@ struct vfio_irq_info {
> #define VFIO_IRQ_INFO_EVENTFD (1 << 0)
> #define VFIO_IRQ_INFO_MASKABLE (1 << 1)
> #define VFIO_IRQ_INFO_AUTOMASKED (1 << 2)
> -#define VFIO_IRQ_INFO_NORESIZE (1 << 3)
> __u32 index; /* IRQ index */
> __u32 count; /* Number of IRQs within this index */
> };
> --- a/samples/vfio-mdev/mtty.c
> +++ b/samples/vfio-mdev/mtty.c
> @@ -1092,9 +1092,6 @@ static int mtty_get_irq_info(struct mdev
> if (irq_info->index == VFIO_PCI_INTX_IRQ_INDEX)
> irq_info->flags |= (VFIO_IRQ_INFO_MASKABLE |
> VFIO_IRQ_INFO_AUTOMASKED);
> - else
> - irq_info->flags |= VFIO_IRQ_INFO_NORESIZE;
> -
> return 0;
> }
>
>

2021-06-24 12:07:03

by Thomas Gleixner

[permalink] [raw]
Subject: [PATCH] vfio/pci: Document the MSI[X] resize side effects properly

The documentation of VFIO_IRQ_INFO_NORESIZE is inaccurate as it suggests
that it is safe to dynamically add new MSI-X vectors even when
previously allocated vectors are already in use and enabled.

Enabling additional vectors is possible according the MSI-X specification,
but the kernel does not have any mechanisms today to do that safely.

The only available mechanism is to teardown the already active vectors
and to setup the full vector set afterwards.

This requires to temporarily disable MSI-X which redirects any interrupt
raised by the device during this time to the legacy PCI/INTX which is
not handled and the interrupt is therefore lost.

Update the documentation of VFIO_IRQ_INFO_NORESIZE accordingly.

Signed-off-by: Thomas Gleixner <[email protected]>
---
include/uapi/linux/vfio.h | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -699,10 +699,19 @@ struct vfio_region_info_cap_nvlink2_lnks
* disabling the entire index. This is used for interrupts like PCI MSI
* and MSI-X where the driver may only use a subset of the available
* indexes, but VFIO needs to enable a specific number of vectors
- * upfront. In the case of MSI-X, where the user can enable MSI-X and
- * then add and unmask vectors, it's up to userspace to make the decision
- * whether to allocate the maximum supported number of vectors or tear
- * down setup and incrementally increase the vectors as each is enabled.
+ * upfront.
+ *
+ * MSI cannot be resized safely when interrupts are in use already because
+ * resizing requires temporary disablement of MSI for updating the relevant
+ * PCI config space entries. Disabling MSI redirects an interrupt raised by
+ * the device during this time to the unhandled legacy PCI/INTX, which
+ * means the interrupt is lost.
+ *
+ * Enabling additional vectors for MSI-X is possible at least from the
+ * perspective of the MSI-X specification, but not supported by the
+ * exisiting PCI/MSI-X mechanisms in the kernel. The kernel provides
+ * currently only a full teardown/setup cycle which requires to disable
+ * MSI-X temporarily with the same side effects as for MSI.
*/
struct vfio_irq_info {
__u32 argsz;

2021-06-24 15:16:30

by Thomas Gleixner

[permalink] [raw]
Subject: RE: Virtualizing MSI-X on IMS via VFIO

Kevin!

On Thu, Jun 24 2021 at 02:41, Kevin Tian wrote:
>> From: Thomas Gleixner <[email protected]>
>> On Wed, Jun 23 2021 at 23:37, Kevin Tian wrote:

>> > 1) Fix the lost interrupt issue in existing MSI virtualization flow;
>>
>> That _cannot_ be fixed without a hypercall. See my reply to Alex.
>
> The lost interrupt issue was caused due to resizing based on stale
> impression of vector exhaustion.
>
> With your explanation this issue can be partially fixed by having Qemu
> allocate all possible irqs when guest enables msi-x and never resizes
> it before guest disables msi-x.

Yes, that works with all the downsides attached to it.

> The remaining problem is no feedback to block guest request_irq()
> in case of vector shortage. This has to be solved via paravirt interface
> but fixing lost interrupt alone is still a step forward for guest which
> doesn't implement the paravirt interface.

Fair enough.

> At any time guest OSes can be categorized into three classes:
>
> a) doesn't implement any paravirt interface for vector allocation;
>
> b) implement one paravirt interface that has been supported by KVM;
>
> c) implement one paravirt interface which has not been supported by KVM;
>
> The transition phase from c) to b) is undefined, but it does exist more
> or less. For example a windows guest will never implement the interface
> defined between Linux guest and Linux host. It will have its own hyperv
> variation which likely takes time for KVM to emulate and claim support.
>
> Transition from a) to b) or a) to c) is a guest-side choice. It's not
> controlled by the host world.

That's correct.

> Here I didn't further differentiate whether a guest OS support ims, since
> once a supported paravirt interface is in place both msi and ims can get
> necessary feedback info from the host side.
>
> Then let's look at the host side:
>
> 1) kernel versions before we conduct any discussed change:
>
> This is a known broken world as you explained. irq resizing could
> lead to lost interrupts in all three guest classes. The only mitigation
> is to document this limitation somewhere.
>
> We'll not enable ims based on this broken framework.
>
> 2) kernel versions after we make a clean refactoring:
>
> a) For guest OS which doesn't implement paravirt interface:
> c) For guest OS which implement a paravirt interface not
> supported by KVM:
>
> You confirmed that recent kernels (since 4.15+) all uses
> reservation mode to avoid vector exhaustion. So VFIO can
> define a new protocol asking its userspace to disable resizing
> by allocating all possible irqs when guest msix is enabled. This
> is one step forward by fixing the lost interrupt issue and is what
> the step-1) in my proposal tries to achieve.

After studying the MSI-X specification again, I think there is another
option to solve this for MSI-X, i.e. the dynamic sizing part:

MSI requires to disable MSI in order to update the number of enabled
vectors in the control word.

MSI-X does not have that requirement as there is no 'number of used
vectors' control field. MSI-X provides a fixed sized vector table and
enabling MSI-X "activates" the full table.

System software has to set proper messages in the table and eventually
associate the table entries to device (sub)functions if that's not
hardwired in the device and controlled by queue enablement etc.

According to the specification there is no requirement for masked table
entries to contain a valid message:

"Mask Bit: ... When this bit is set, the function is prohibited from
sending a message using this MSI-X Table entry."

which means that the function must reread the table entry when the mask
bit in the vector control word is cleared.

So instead of tearing down the whole set and then bringing it up again,
which is wrong, the kernel could allocate an interrupt descriptor and
append an MSI entry, write the table entry and unmask it afterwards.

There are a couple of things to get there:

1) MSI descriptor list handling.

The msi descriptor list of a device is assumed to be unmutable after
pci_enable_msix() has successfully enabled all of it, which means
that there is no serialization in place.

IIRC, the original attempt to glue IMS into the existing MSI
management had a patch to add the required protections. That part
could be dusted off.

2) Provide the required functionality in the MSI irq domain
infrastructure

3) Expose this functionality through a proper interface.

That's append only of course.

It's not clear to me whether this is worth the effort, but at least it
is a viable solution if memory consumption, IRTE consumption is an
actual concern.

Thanks,

tglx

2021-06-24 17:01:33

by Jacob Pan

[permalink] [raw]
Subject: Re: Virtualizing MSI-X on IMS via VFIO

Hi Kevin,

On Wed, 23 Jun 2021 19:41:24 -0700, "Tian, Kevin" <[email protected]>
wrote:

> > > 1) Fix the lost interrupt issue in existing MSI virtualization flow;
> > >
> >
> > That _cannot_ be fixed without a hypercall. See my reply to Alex.
>
> The lost interrupt issue was caused due to resizing based on stale
> impression of vector exhaustion.

Is it possible to mitigate the lost interrupt by always injecting an IRQ
after unmask? Either in VFIO layer, or let QEMU do that after the second
VFIO_DEVICE_SET_IRQS in step 4.b of your original email.

After all, spurious interrupts should be tolerated and unmasking MSI-x
should be rare. I am not suggesting this as an alternative to the real fix,
just a stop gap.

Thanks,

Jacob

2021-06-24 17:53:26

by Alex Williamson

[permalink] [raw]
Subject: Re: Virtualizing MSI-X on IMS via VFIO

On Thu, 24 Jun 2021 00:00:37 +0000
"Tian, Kevin" <[email protected]> wrote:

> > From: Alex Williamson <[email protected]>
> > Sent: Wednesday, June 23, 2021 11:20 PM
> >
> [...]
> > > So the only downside today of allocating more MSI-X vectors than
> > > necessary is memory consumption for the irq descriptors.
> >
> > As above, this is a QEMU policy of essentially trying to be a good
> > citizen and allocate only what we can infer the guest is using. What's
> > a good way for QEMU, or any userspace, to know it's running on a host
> > where vector exhaustion is not an issue?
>
> In my proposal a new command (VFIO_DEVICE_ALLOC_IRQS) is
> introduced to separate allocation from enabling. The availability
> of this command could be the indicator whether vector
> exhaustion is not an issue now?

We have options with existing interfaces if we want to provide some
programmatic means through vfio to hint to userspace about vector
usage. Otherwise I don't see much justification for this new ioctl, it
can largely be done with SET_IRQS, or certainly with extensions of
flags.

> > > So no, we are not going to proliferate this complete ignorance of how
> > > MSI-X actually works and just cram another "feature" into code which is
> > > known to be incorrect.
> >
> > Some of the issues of virtualizing MSI-X are unsolvable without
> > creating a new paravirtual interface, but obviously we want to work
> > with existing drivers and unmodified guests, so that's not an option.
> >
> > To work with what we've got, the vfio API describes the limitation of
> > the host interfaces via the VFIO_IRQ_INFO_NORESIZE flag. QEMU then
> > makes a choice in an attempt to better reflect what we can infer of the
> > guest programming of the device to incrementally enable vectors. We
>
> It's a surprise to me that Qemu even doesn't look at this flag today after
> searching its code...

There are no examples of the alternative, it would be dead, untested
code. The flag exists in the uAPI to indicate a limitation of the
underlying implementation that has always existed. Should we remove
that limitation, as Thomas now sees as possible, then QEMU wouldn't
need to make a choice whether to fully allocate the vector table or
incrementally tear-down and re-init.

> > could a) work to provide host kernel interfaces that allow us to remove
> > that noresize flag and b) decide whether QEMU's usage policy can be
> > improved on kernels where vector exhaustion is no longer an issue.
>
> Thomas can help confirm but looks noresize limitation is still there.
> b) makes more sense since Thomas thinks vector exhaustion is not
> an issue now (except one minor open about irte).

As noted elsewhere, a) is indeed a limitation of the host interfaces,
not implicit to MSI-X. Obviously we can look at different QEMU
policies, including generating hardware faults to the VM on exhaustion
or unmask failures, interrupt injection or better inferring potential
vector usage. Thanks,

Alex

2021-06-24 21:46:07

by Alex Williamson

[permalink] [raw]
Subject: Re: Virtualizing MSI-X on IMS via VFIO

On Thu, 24 Jun 2021 17:14:39 +0200
Thomas Gleixner <[email protected]> wrote:

> After studying the MSI-X specification again, I think there is another
> option to solve this for MSI-X, i.e. the dynamic sizing part:
>
> MSI requires to disable MSI in order to update the number of enabled
> vectors in the control word.

Exactly what part of the spec requires this? This is generally the
convention I expect too, and there are complications around contiguous
vectors and data field alignment, but I'm not actually able to find a
requirement in the spec that MSI Enable must be 0 when modifying other
writable fields or that writable fields are latched when MSI Enable is
set.

> MSI-X does not have that requirement as there is no 'number of used
> vectors' control field. MSI-X provides a fixed sized vector table and
> enabling MSI-X "activates" the full table.
>
> System software has to set proper messages in the table and eventually
> associate the table entries to device (sub)functions if that's not
> hardwired in the device and controlled by queue enablement etc.
>
> According to the specification there is no requirement for masked table
> entries to contain a valid message:
>
> "Mask Bit: ... When this bit is set, the function is prohibited from
> sending a message using this MSI-X Table entry."
>
> which means that the function must reread the table entry when the mask
> bit in the vector control word is cleared.

What is a "valid" message as far as the device is concerned? "Valid"
is meaningful to system software and hardware, the device doesn't care.

Like MSI above, I think the real question is when is the data latched
by the hardware. For MSI-X this seems to be addressed in (PCIe 5.0
spec) 6.1.4.2 MSI-X Configuration:

Software must not modify the Address, Data, or Steering Tag fields of
an entry while it is unmasked.

Followed by 6.1.4.5 Per-vector Masking and Function Masking:

For MSI-X, a Function is permitted to cache Address and Data values
from unmasked MSI-X Table entries. However, anytime software unmasks
a currently masked MSI-X Table entry either by Clearing its Mask bit
or by Clearing the Function Mask bit, the Function must update any
Address or Data values that it cached from that entry. If software
changes the Address or Data value of an entry while the entry is
unmasked, the result is undefined.

So caching/latching occurs on unmask for MSI-X, but I can't find
similar statements for MSI. If you have, please note them. It's
possible MSI is per interrupt.

Anyway, at least MSI-X if not also MSI could have a !NORESIZE
implementation, which is why this flag exists in vfio. Thanks,

Alex

2021-06-24 22:24:02

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH] vfio/pci: Document the MSI[X] resize side effects properly

On Thu, 24 Jun 2021 14:06:09 +0200
Thomas Gleixner <[email protected]> wrote:

> The documentation of VFIO_IRQ_INFO_NORESIZE is inaccurate as it suggests
> that it is safe to dynamically add new MSI-X vectors even when
> previously allocated vectors are already in use and enabled.
>
> Enabling additional vectors is possible according the MSI-X specification,
> but the kernel does not have any mechanisms today to do that safely.
>
> The only available mechanism is to teardown the already active vectors
> and to setup the full vector set afterwards.
>
> This requires to temporarily disable MSI-X which redirects any interrupt
> raised by the device during this time to the legacy PCI/INTX which is
> not handled and the interrupt is therefore lost.
>
> Update the documentation of VFIO_IRQ_INFO_NORESIZE accordingly.
>
> Signed-off-by: Thomas Gleixner <[email protected]>
> ---
> include/uapi/linux/vfio.h | 17 +++++++++++++----
> 1 file changed, 13 insertions(+), 4 deletions(-)
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -699,10 +699,19 @@ struct vfio_region_info_cap_nvlink2_lnks
> * disabling the entire index. This is used for interrupts like PCI MSI
> * and MSI-X where the driver may only use a subset of the available
> * indexes, but VFIO needs to enable a specific number of vectors
> - * upfront. In the case of MSI-X, where the user can enable MSI-X and
> - * then add and unmask vectors, it's up to userspace to make the decision
> - * whether to allocate the maximum supported number of vectors or tear
> - * down setup and incrementally increase the vectors as each is enabled.
> + * upfront.
> + *
> + * MSI cannot be resized safely when interrupts are in use already because
> + * resizing requires temporary disablement of MSI for updating the relevant
> + * PCI config space entries. Disabling MSI redirects an interrupt raised by
> + * the device during this time to the unhandled legacy PCI/INTX, which
> + * means the interrupt is lost.
> + *
> + * Enabling additional vectors for MSI-X is possible at least from the
> + * perspective of the MSI-X specification, but not supported by the
> + * exisiting PCI/MSI-X mechanisms in the kernel. The kernel provides
> + * currently only a full teardown/setup cycle which requires to disable
> + * MSI-X temporarily with the same side effects as for MSI.
> */
> struct vfio_irq_info {
> __u32 argsz;
>

There's good information here, but as per my other reply I think
NORESIZE might be only a host implementation issue for both MSI and
MSI/X.

I'd also rather not focus on that existing implementation in this
header, which is essentially the uAPI spec, because that implementation
can change and we're unlikely to remember to update the description
here. We might even be describing a device that emulates MSI/X in some
way that it's not bound by this limitation. For example maybe Intel's
emulation of MSI-X backed by IMS wouldn't need this flag and we could
update QEMU to finally have a branch that avoids the teardown/setup.
We have a flag to indicate this behavior, consequences should be
relative to the presence of that flag.

Finally a nit, I don't really see a strong case that the existing text
is actually inaccurate or implying some safety against lost interrupts.
It's actually making note of the issue here already, though the more
explicit description is welcome. Thanks,

Alex

2021-06-25 05:22:05

by Tian, Kevin

[permalink] [raw]
Subject: RE: Virtualizing MSI-X on IMS via VFIO

> From: Alex Williamson <[email protected]>
> Sent: Friday, June 25, 2021 5:45 AM
>
> On Thu, 24 Jun 2021 17:14:39 +0200
> Thomas Gleixner <[email protected]> wrote:
>
> > After studying the MSI-X specification again, I think there is another
> > option to solve this for MSI-X, i.e. the dynamic sizing part:
> >
> > MSI requires to disable MSI in order to update the number of enabled
> > vectors in the control word.
>
> Exactly what part of the spec requires this? This is generally the
> convention I expect too, and there are complications around contiguous
> vectors and data field alignment, but I'm not actually able to find a
> requirement in the spec that MSI Enable must be 0 when modifying other
> writable fields or that writable fields are latched when MSI Enable is
> set.
>
> > MSI-X does not have that requirement as there is no 'number of used
> > vectors' control field. MSI-X provides a fixed sized vector table and
> > enabling MSI-X "activates" the full table.
> >
> > System software has to set proper messages in the table and eventually
> > associate the table entries to device (sub)functions if that's not
> > hardwired in the device and controlled by queue enablement etc.
> >
> > According to the specification there is no requirement for masked table
> > entries to contain a valid message:
> >
> > "Mask Bit: ... When this bit is set, the function is prohibited from
> > sending a message using this MSI-X Table entry."
> >
> > which means that the function must reread the table entry when the mask
> > bit in the vector control word is cleared.
>
> What is a "valid" message as far as the device is concerned? "Valid"
> is meaningful to system software and hardware, the device doesn't care.
>
> Like MSI above, I think the real question is when is the data latched
> by the hardware. For MSI-X this seems to be addressed in (PCIe 5.0
> spec) 6.1.4.2 MSI-X Configuration:
>
> Software must not modify the Address, Data, or Steering Tag fields of
> an entry while it is unmasked.
>
> Followed by 6.1.4.5 Per-vector Masking and Function Masking:
>
> For MSI-X, a Function is permitted to cache Address and Data values
> from unmasked MSI-X Table entries. However, anytime software unmasks
> a currently masked MSI-X Table entry either by Clearing its Mask bit
> or by Clearing the Function Mask bit, the Function must update any
> Address or Data values that it cached from that entry. If software
> changes the Address or Data value of an entry while the entry is
> unmasked, the result is undefined.
>
> So caching/latching occurs on unmask for MSI-X, but I can't find
> similar statements for MSI. If you have, please note them. It's
> possible MSI is per interrupt.

I checked PCI Local Bus Specification rev3.0. At that time MSI and
MSI-X were described/compared together in almost every paragraph
in 6.8.3.4 (Per-vector Masking and Function Masking). The paragraph
that you cited is the last one in that section. It's a pity that MSI is
not clarified in this paragraph but it gives me the impression that
MSI function is not permitted to cache address and data values.
Later after MSI and MSI-X descriptions were split into separate
sections in PCIe spec, this impression is definitely weakened a lot.

If true, this even implies that software is free to change data/addr
when MSI is unmasked, which is sort of counter-intuitive to most
people.

Then I further found below thread:

https://lore.kernel.org/lkml/[email protected]/

It identified a device which does latch the message content in a
MSI-capable device, forcing the kernel to startup irq early before
enabling MSI capability.

So, no answer and let's see whether Thomas can help identify
a better proof.

>
> Anyway, at least MSI-X if not also MSI could have a !NORESIZE
> implementation, which is why this flag exists in vfio. Thanks,
>

For MSI we can still mitigate the lost interrupt issue by having
Qemu to allocate all possible MSI vectors in the start when guest
enables MSI capability and never freeing them before disable.
Anyway there are just up to 32 vectors per device, and total
vectors of all MSI devices in a platform should be limited. This
won't be a big problem after CPU vector exhaustion is relaxed.

p.s. one question to Thomas. As Alex cited above, software must
not modify the Address, Data, or Steering Tag fields of an MSI-X
entry while it is unmasked. However this rule might be violated
today in below flow:

request_irq()
__setup_irq()
irq_startup()
__irq_startup()
irq_enable()
unmask_irq() <<<<<<<<<<<<<
irq_setup_affinity()
irq_do_set_affinity()
msi_set_affinity() // when IR is disabled
irq_msi_update_msg()
pci_msi_domain_write_msg() <<<<<<<<<<<<<<

Isn't above have msi-x entry updated after it's unmasked?

I may overlook something though since there are many branches
in the middle which may make above flow invalid...

Thanks
Kevin

2021-06-25 08:30:15

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Virtualizing MSI-X on IMS via VFIO

Alex!

On Thu, Jun 24 2021 at 15:44, Alex Williamson wrote:
> On Thu, 24 Jun 2021 17:14:39 +0200
> Thomas Gleixner <[email protected]> wrote:
>
>> After studying the MSI-X specification again, I think there is another
>> option to solve this for MSI-X, i.e. the dynamic sizing part:
>>
>> MSI requires to disable MSI in order to update the number of enabled
>> vectors in the control word.
>
> Exactly what part of the spec requires this? This is generally the
> convention I expect too, and there are complications around contiguous
> vectors and data field alignment, but I'm not actually able to find a
> requirement in the spec that MSI Enable must be 0 when modifying other
> writable fields or that writable fields are latched when MSI Enable is
> set.

There is nothing in the spec which mandates that, but based on
experience I know that devices latch the number of vectors field when
the enable bit goes from 0 to 1, which makes sense. Devices derive their
internal interrupt routing from that.

>> which means that the function must reread the table entry when the mask
>> bit in the vector control word is cleared.
>
> What is a "valid" message as far as the device is concerned? "Valid"
> is meaningful to system software and hardware, the device doesn't
> care.

That's correct, it uses whatever is there.

> So caching/latching occurs on unmask for MSI-X, but I can't find
> similar statements for MSI. If you have, please note them. It's
> possible MSI is per interrupt.

MSI is mostly implementation defined due to the blury specification.

Also the fact that MSI masking is optional does not make it any
better. Most devices (even new ones) do not have MSI masking.

> Anyway, at least MSI-X if not also MSI could have a !NORESIZE
> implementation, which is why this flag exists in vfio.

MSI-X yes with a pretty large surgery.

MSI, no way. Contrary to MSI-X you cannot just update the $N entry in
the table because there is no table. MSI has a base message and derives
the $Nth vector message from it by modifying the lower bits of the data
word.

So without masking updating the base message for multi-msi is close
to impossible. Look at the dance we have to do in msi_set_affinity().

But even with masking there is still the issue of the 'number of
vectors' field and you can't set that to maximum at init time either
because some devices derive from that how interrupts are routed and you
surely don't want to change that behaviour while devices are active.

Even if that'd be possible, then we'd need to allocate the full IRTE
space, which would be just another corner case and require extra
handling.

MSI is a badly specified trainwreck and we already have enough horrible
code dealing with it. No need to add more of that which is going to
cause more problems than it solves.

The sad thing is that despite the fact that the problems of MSI are
known for more than a decade MSI is still widely used in new silicon
and most of the time even without masking.

> Anyway, at least MSI-X if not also MSI could have a !NORESIZE
> implementation, which is why this flag exists in vfio.

Right, it's there to be ignored for MSI-X in the current implementation
of QEMU and VFIO/PCI.

Thanks,

tglx

2021-06-25 08:44:05

by Thomas Gleixner

[permalink] [raw]
Subject: RE: Virtualizing MSI-X on IMS via VFIO

On Fri, Jun 25 2021 at 05:21, Kevin Tian wrote:
>> From: Alex Williamson <[email protected]>
>> So caching/latching occurs on unmask for MSI-X, but I can't find
>> similar statements for MSI. If you have, please note them. It's
>> possible MSI is per interrupt.
>
> I checked PCI Local Bus Specification rev3.0. At that time MSI and
> MSI-X were described/compared together in almost every paragraph
> in 6.8.3.4 (Per-vector Masking and Function Masking). The paragraph
> that you cited is the last one in that section. It's a pity that MSI is
> not clarified in this paragraph but it gives me the impression that
> MSI function is not permitted to cache address and data values.
> Later after MSI and MSI-X descriptions were split into separate
> sections in PCIe spec, this impression is definitely weakened a lot.
>
> If true, this even implies that software is free to change data/addr
> when MSI is unmasked, which is sort of counter-intuitive to most
> people.

Yes, software is free to do that and it has to deal with the
consequences. See arch/x86/kernel/apic/msi.c::msi_set_affinity().

> Then I further found below thread:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> It identified a device which does latch the message content in a
> MSI-capable device, forcing the kernel to startup irq early before
> enabling MSI capability.
>
> So, no answer and let's see whether Thomas can help identify
> a better proof.

As I said to Alex: The MSI specification is and always was blury and the
behaviour in detail is implementation defined. IOW, what might work on
device A is not guaranteed to work on device B.

> p.s. one question to Thomas. As Alex cited above, software must
> not modify the Address, Data, or Steering Tag fields of an MSI-X
> entry while it is unmasked. However this rule might be violated
> today in below flow:
>
> request_irq()
> __setup_irq()
> irq_startup()
> __irq_startup()
> irq_enable()
> unmask_irq() <<<<<<<<<<<<<
> irq_setup_affinity()
> irq_do_set_affinity()
> msi_set_affinity() // when IR is disabled
> irq_msi_update_msg()
> pci_msi_domain_write_msg() <<<<<<<<<<<<<<
>
> Isn't above have msi-x entry updated after it's unmasked?

Dammit, I could swear that we had masking at the core or PCI level at
some point. Let me dig into this.

Thanks,

tglx

2021-06-25 12:44:43

by Thomas Gleixner

[permalink] [raw]
Subject: RE: Virtualizing MSI-X on IMS via VFIO

On Fri, Jun 25 2021 at 10:43, Thomas Gleixner wrote:
> On Fri, Jun 25 2021 at 05:21, Kevin Tian wrote:
>> p.s. one question to Thomas. As Alex cited above, software must
>> not modify the Address, Data, or Steering Tag fields of an MSI-X
>> entry while it is unmasked. However this rule might be violated
>> today in below flow:
>>
>> request_irq()
>> __setup_irq()
>> irq_startup()
>> __irq_startup()
>> irq_enable()
>> unmask_irq() <<<<<<<<<<<<<
>> irq_setup_affinity()
>> irq_do_set_affinity()
>> msi_set_affinity() // when IR is disabled
>> irq_msi_update_msg()
>> pci_msi_domain_write_msg() <<<<<<<<<<<<<<
>>
>> Isn't above have msi-x entry updated after it's unmasked?
>
> Dammit, I could swear that we had masking at the core or PCI level at
> some point. Let me dig into this.

Indeed, that code path does not check irq_can_move_pcntxt(). It doesn't
blow up in our face by chance because of this:

__setup_irq()
irq_activate()
unmask()
irq_setup_affinity()

irq_activate() assigns a vector based on the affinity mask so
irq_setup_affinity() ends up writing the same data again pointlessly.

For some stupid reason the ordering of startup/setup_affinity is the way
it is for historical reasons. I tried to reorder it at some point but
that caused failure on !x86 so I went back to the status quo.

All other affinity settings happen with the interrupt masked because we
do that from actual interrupt context via irq_move_masked_irq() which
does the right thing.

Let me fix that proper for the startup case.

Thanks,

tglx

2021-06-25 21:20:33

by Thomas Gleixner

[permalink] [raw]
Subject: RE: Virtualizing MSI-X on IMS via VFIO

On Fri, Jun 25 2021 at 10:43, Thomas Gleixner wrote:
> On Fri, Jun 25 2021 at 05:21, Kevin Tian wrote:
>>> From: Alex Williamson <[email protected]>
>>> So caching/latching occurs on unmask for MSI-X, but I can't find
>>> similar statements for MSI. If you have, please note them. It's
>>> possible MSI is per interrupt.
>>
>> I checked PCI Local Bus Specification rev3.0. At that time MSI and
>> MSI-X were described/compared together in almost every paragraph
>> in 6.8.3.4 (Per-vector Masking and Function Masking). The paragraph
>> that you cited is the last one in that section. It's a pity that MSI is
>> not clarified in this paragraph but it gives me the impression that
>> MSI function is not permitted to cache address and data values.
>> Later after MSI and MSI-X descriptions were split into separate
>> sections in PCIe spec, this impression is definitely weakened a lot.
>>
>> If true, this even implies that software is free to change data/addr
>> when MSI is unmasked, which is sort of counter-intuitive to most
>> people.
>
> Yes, software is free to do that and it has to deal with the
> consequences. See arch/x86/kernel/apic/msi.c::msi_set_affinity().

Well, it's actually forced to do so when the MSI implementation does not
provide masking. If masking is available then it should be used also for
MSI as it prevents the nasty update problem where the hardware can
observe inconsistent state... Which x86 does correctly except for that
startup issue you spotted, but I'm not at all sure about the rest.

The startup issue is halfways trivial to fix I think, but there are
other issues with the PCI/MSI-X handling which I discovered while
staring at that code for a while. Still working on it.

Thanks,

tglx