2021-11-27 01:30:55

by Thomas Gleixner

[permalink] [raw]
Subject: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

Replace the about to vanish iterators, make use of the filtering and take
the descriptor lock around the iteration.

Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Jon Mason <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Allen Hubbe <[email protected]>
Cc: [email protected]
---
drivers/ntb/msi.c | 19 +++++++++++++------
1 file changed, 13 insertions(+), 6 deletions(-)

--- a/drivers/ntb/msi.c
+++ b/drivers/ntb/msi.c
@@ -108,8 +108,10 @@ int ntb_msi_setup_mws(struct ntb_dev *nt
if (!ntb->msi)
return -EINVAL;

- desc = first_msi_entry(&ntb->pdev->dev);
+ msi_lock_descs(&ntb->pdev->dev);
+ desc = msi_first_desc(&ntb->pdev->dev);
addr = desc->msg.address_lo + ((uint64_t)desc->msg.address_hi << 32);
+ msi_unlock_descs(&ntb->pdev->dev);

for (peer = 0; peer < ntb_peer_port_count(ntb); peer++) {
peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
@@ -281,13 +283,15 @@ int ntbm_msi_request_threaded_irq(struct
const char *name, void *dev_id,
struct ntb_msi_desc *msi_desc)
{
+ struct device *dev = &ntb->pdev->dev;
struct msi_desc *entry;
int ret;

if (!ntb->msi)
return -EINVAL;

- for_each_pci_msi_entry(entry, ntb->pdev) {
+ msi_lock_descs(dev);
+ msi_for_each_desc(entry, dev, MSI_DESC_ASSOCIATED) {
if (irq_has_action(entry->irq))
continue;

@@ -304,14 +308,17 @@ int ntbm_msi_request_threaded_irq(struct
ret = ntbm_msi_setup_callback(ntb, entry, msi_desc);
if (ret) {
devm_free_irq(&ntb->dev, entry->irq, dev_id);
- return ret;
+ goto unlock;
}

-
- return entry->irq;
+ ret = entry->irq;
+ goto unlock;
}
+ ret = -ENODEV;

- return -ENODEV;
+unlock:
+ msi_unlock_descs(dev);
+ return ret;
}
EXPORT_SYMBOL(ntbm_msi_request_threaded_irq);




2021-11-29 18:24:03

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()



On 2021-11-26 6:23 p.m., Thomas Gleixner wrote:
> Replace the about to vanish iterators, make use of the filtering and take
> the descriptor lock around the iteration.
>
> Signed-off-by: Thomas Gleixner <[email protected]>
> Cc: Jon Mason <[email protected]>
> Cc: Dave Jiang <[email protected]>
> Cc: Allen Hubbe <[email protected]>
> Cc: [email protected]

This patch looks good to me:

Reviewed-by: Logan Gunthorpe <[email protected]>

Thanks,

Logan

2021-11-29 22:28:02

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()



On 2021-11-29 1:51 p.m., Thomas Gleixner wrote:
> Logan,
>
> On Mon, Nov 29 2021 at 11:21, Logan Gunthorpe wrote:
>> On 2021-11-26 6:23 p.m., Thomas Gleixner wrote:
>>> Replace the about to vanish iterators, make use of the filtering and take
>>> the descriptor lock around the iteration.
>>
>> This patch looks good to me:
>>
>> Reviewed-by: Logan Gunthorpe <[email protected]>
>
> thanks for having a look at this. While I have your attention, I have a
> question related to NTB.
>
> The switchtec driver is the only one which uses PCI_IRQ_VIRTUAL in order
> to allocate non-hardware backed MSI-X descriptors.
>
> AFAIU these descriptors are not MSI-X descriptors in the regular sense
> of PCI/MSI-X. They are allocated via the PCI/MSI mechanism but their
> usage is somewhere in NTB which has nothing to do with the way how the
> real MSI-X interrupts of a device work which explains why we have those
> pci.msi_attrib.is_virtual checks all over the place.
>
> I assume that there are other variants feeding into NTB which can handle
> that without this PCI_IRQ_VIRTUAL quirk, but TBH, I got completely lost
> in that code.
>
> Could you please shed some light on the larger picture of this?

Yes, of course. I'll try to explain:

The NTB code here is trying to create an MSI interrupt that is not
triggered by the PCI device itself but from a peer behind the
Non-Transparent Bridge (or, more carefully: from the CPU's perspective
the interrupt will come from the PCI device, but nothing in the PCI
device's firmware or hardware will have triggered the interrupt).

In most cases, the NTB code needs more interrupts than the hardware
actually provides for in its MSI-X table. That's what PCI_IRQ_VIRTUAL is
for: it allows the driver to request more interrupts than the hardware
advertises (ie. pci_msix_vec_count()). These extra interrupts are
created, but get flagged with msi_attrib.is_virtual which ensures
functions that program the MSI-X table don't try to write past the end
of the hardware's table.

The NTB code in drivers/ntb/msi.c uses these virtual MSI-X interrupts.
(Or rather it can use any interrupt: it doesn't care whether its virtual
or not, it would be fine if it is just a spare interrupt in hardware,
but in practice, it will usually be a virtual one). The code uses these
interrupts by setting up a memory window in the bridge to cover the MMIO
addresses of MSI-X interrupts. It communicates the offsets of the
interrupts (and the MSI message data) to the peer so that the peer can
trigger the interrupt simply by writing the message data to its side of
the memory window. (In the code: ntbm_msi_request_threaded_irq() is
called to request and interrupt which fills in the ntb_msi_desc with the
offset and data, which is transferred to the peer which would then use
ntb_msi_peer_trigger() to trigger the interrupt.)

Existing NTB hardware does already have what's called a doorbell which
provides the same functionally as the above technique. However, existing
hardware implementations of doorbells have significant latency and thus
slow down performance substantially. Implementing the MSI interrupts as
described above increased the performance of ntb_transport by more than
three times[1].

There aren't really other "variants". In theory, IDT hardware would also
require the same quirk but the drivers in the tree aren't quite up to
snuff and don't even support ntb_transport (so nobody has added
support). Intel and AMD drivers could probably do this as well (provided
they have extra memory windows) but I don't know that anyone has tried.

Let me know if anything is still unclear or you have further questions.
You can also read the last posting of the patch series[2] if you'd like
more information.

Logan

[1] 2b0569b3b7e6 ("NTB: Add MSI interrupt support to ntb_transport")
[2]
https://lore.kernel.org/all/[email protected]/T/#u





2021-11-29 22:41:02

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

Logan,

On Mon, Nov 29 2021 at 11:21, Logan Gunthorpe wrote:
> On 2021-11-26 6:23 p.m., Thomas Gleixner wrote:
>> Replace the about to vanish iterators, make use of the filtering and take
>> the descriptor lock around the iteration.
>
> This patch looks good to me:
>
> Reviewed-by: Logan Gunthorpe <[email protected]>

thanks for having a look at this. While I have your attention, I have a
question related to NTB.

The switchtec driver is the only one which uses PCI_IRQ_VIRTUAL in order
to allocate non-hardware backed MSI-X descriptors.

AFAIU these descriptors are not MSI-X descriptors in the regular sense
of PCI/MSI-X. They are allocated via the PCI/MSI mechanism but their
usage is somewhere in NTB which has nothing to do with the way how the
real MSI-X interrupts of a device work which explains why we have those
pci.msi_attrib.is_virtual checks all over the place.

I assume that there are other variants feeding into NTB which can handle
that without this PCI_IRQ_VIRTUAL quirk, but TBH, I got completely lost
in that code.

Could you please shed some light on the larger picture of this?

Thanks,

tglx

2021-11-29 22:52:30

by Dave Jiang

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()


On 11/29/2021 3:27 PM, Logan Gunthorpe wrote:
>
> On 2021-11-29 1:51 p.m., Thomas Gleixner wrote:
>> Logan,
>>
>> On Mon, Nov 29 2021 at 11:21, Logan Gunthorpe wrote:
>>> On 2021-11-26 6:23 p.m., Thomas Gleixner wrote:
>>>> Replace the about to vanish iterators, make use of the filtering and take
>>>> the descriptor lock around the iteration.
>>> This patch looks good to me:
>>>
>>> Reviewed-by: Logan Gunthorpe <[email protected]>
>> thanks for having a look at this. While I have your attention, I have a
>> question related to NTB.
>>
>> The switchtec driver is the only one which uses PCI_IRQ_VIRTUAL in order
>> to allocate non-hardware backed MSI-X descriptors.
>>
>> AFAIU these descriptors are not MSI-X descriptors in the regular sense
>> of PCI/MSI-X. They are allocated via the PCI/MSI mechanism but their
>> usage is somewhere in NTB which has nothing to do with the way how the
>> real MSI-X interrupts of a device work which explains why we have those
>> pci.msi_attrib.is_virtual checks all over the place.
>>
>> I assume that there are other variants feeding into NTB which can handle
>> that without this PCI_IRQ_VIRTUAL quirk, but TBH, I got completely lost
>> in that code.
>>
>> Could you please shed some light on the larger picture of this?
> Yes, of course. I'll try to explain:
>
> The NTB code here is trying to create an MSI interrupt that is not
> triggered by the PCI device itself but from a peer behind the
> Non-Transparent Bridge (or, more carefully: from the CPU's perspective
> the interrupt will come from the PCI device, but nothing in the PCI
> device's firmware or hardware will have triggered the interrupt).
>
> In most cases, the NTB code needs more interrupts than the hardware
> actually provides for in its MSI-X table. That's what PCI_IRQ_VIRTUAL is
> for: it allows the driver to request more interrupts than the hardware
> advertises (ie. pci_msix_vec_count()). These extra interrupts are
> created, but get flagged with msi_attrib.is_virtual which ensures
> functions that program the MSI-X table don't try to write past the end
> of the hardware's table.
>
> The NTB code in drivers/ntb/msi.c uses these virtual MSI-X interrupts.
> (Or rather it can use any interrupt: it doesn't care whether its virtual
> or not, it would be fine if it is just a spare interrupt in hardware,
> but in practice, it will usually be a virtual one). The code uses these
> interrupts by setting up a memory window in the bridge to cover the MMIO
> addresses of MSI-X interrupts. It communicates the offsets of the
> interrupts (and the MSI message data) to the peer so that the peer can
> trigger the interrupt simply by writing the message data to its side of
> the memory window. (In the code: ntbm_msi_request_threaded_irq() is
> called to request and interrupt which fills in the ntb_msi_desc with the
> offset and data, which is transferred to the peer which would then use
> ntb_msi_peer_trigger() to trigger the interrupt.)
>
> Existing NTB hardware does already have what's called a doorbell which
> provides the same functionally as the above technique. However, existing
> hardware implementations of doorbells have significant latency and thus
> slow down performance substantially. Implementing the MSI interrupts as
> described above increased the performance of ntb_transport by more than
> three times[1].
>
> There aren't really other "variants". In theory, IDT hardware would also
> require the same quirk but the drivers in the tree aren't quite up to
> snuff and don't even support ntb_transport (so nobody has added
> support). Intel and AMD drivers could probably do this as well (provided
> they have extra memory windows) but I don't know that anyone has tried.

The Intel driver used to do something similar to bypass the doorbell
hardware errata for pre Skylake Xeon hardware that wasn't upstream. I'd
like to get this working for the performance reasons you mentioned. I
just really need to find some time to test this with the second memory
window Intel NTB has.


>
> Let me know if anything is still unclear or you have further questions.
> You can also read the last posting of the patch series[2] if you'd like
> more information.
>
> Logan
>
> [1] 2b0569b3b7e6 ("NTB: Add MSI interrupt support to ntb_transport")
> [2]
> https://lore.kernel.org/all/[email protected]/T/#u
>
>
>
>

2021-11-29 23:31:52

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Mon, Nov 29, 2021 at 03:27:20PM -0700, Logan Gunthorpe wrote:

> In most cases, the NTB code needs more interrupts than the hardware
> actually provides for in its MSI-X table. That's what PCI_IRQ_VIRTUAL is
> for: it allows the driver to request more interrupts than the hardware
> advertises (ie. pci_msix_vec_count()). These extra interrupts are
> created, but get flagged with msi_attrib.is_virtual which ensures
> functions that program the MSI-X table don't try to write past the end
> of the hardware's table.

AFAICT what you've described is what Intel is calling IMS in other
contexts.

IMS is fundamentally a way to control MSI interrupt descriptors that
are not accessed through PCI SIG compliant means. In this case the NTB
driver has to do its magic to relay the addr/data pairs to the real
MSI storage in the hidden devices.

PCI_IRQ_VIRTUAL should probably be fully replaced by the new dynamic
APIs in the fullness of time..

> Existing NTB hardware does already have what's called a doorbell which
> provides the same functionally as the above technique. However, existing
> hardware implementations of doorbells have significant latency and thus
> slow down performance substantially. Implementing the MSI interrupts as
> described above increased the performance of ntb_transport by more than
> three times[1].

Does the doorbell scheme allow as many interrupts?

Jason

2021-11-29 23:52:50

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()



On 2021-11-29 4:31 p.m., Jason Gunthorpe wrote:
> On Mon, Nov 29, 2021 at 03:27:20PM -0700, Logan Gunthorpe wrote:
>
>> In most cases, the NTB code needs more interrupts than the hardware
>> actually provides for in its MSI-X table. That's what PCI_IRQ_VIRTUAL is
>> for: it allows the driver to request more interrupts than the hardware
>> advertises (ie. pci_msix_vec_count()). These extra interrupts are
>> created, but get flagged with msi_attrib.is_virtual which ensures
>> functions that program the MSI-X table don't try to write past the end
>> of the hardware's table.
>
> AFAICT what you've described is what Intel is calling IMS in other
> contexts.
>
> IMS is fundamentally a way to control MSI interrupt descriptors that
> are not accessed through PCI SIG compliant means. In this case the NTB
> driver has to do its magic to relay the addr/data pairs to the real
> MSI storage in the hidden devices.

With current applications, it isn't that there is real "MSI storage"
anywhere; the device on the other side of the bridge is always another
Linux host which holds the address (or rather mw offset) and data in
memory to use when it needs to trigger the interrupt of the other
machine. There are many prototypes and proprietary messes that try to
have other PCI devices (ie NVMe, etc) behind the non-transparent bridge;
but the Linux subsystem has no support for this.

> PCI_IRQ_VIRTUAL should probably be fully replaced by the new dynamic
> APIs in the fullness of time..

Perhaps, I don't really know much about IMS or how close a match it is.

>> Existing NTB hardware does already have what's called a doorbell which
>> provides the same functionally as the above technique. However, existing
>> hardware implementations of doorbells have significant latency and thus
>> slow down performance substantially. Implementing the MSI interrupts as
>> described above increased the performance of ntb_transport by more than
>> three times[1].
>
> Does the doorbell scheme allow as many interrupts?

No, but for current applications there are plenty of doorbells.

Switchtec hardware (and I think other hardware) typically have 64
doorbells for the entire network (they must be split among the number of
hosts in the network; a two host system could have 32 per host). The NTB
subsystem in Linux only currently supports 2 hosts, but switchtec
hardware supports up to 48 hosts, in which case you might only have 1
doorbell per host and that might be limiting depending on the application.

Logan

2021-11-30 00:01:55

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Mon, Nov 29, 2021 at 04:52:35PM -0700, Logan Gunthorpe wrote:
>
>
> On 2021-11-29 4:31 p.m., Jason Gunthorpe wrote:
> > On Mon, Nov 29, 2021 at 03:27:20PM -0700, Logan Gunthorpe wrote:
> >
> >> In most cases, the NTB code needs more interrupts than the hardware
> >> actually provides for in its MSI-X table. That's what PCI_IRQ_VIRTUAL is
> >> for: it allows the driver to request more interrupts than the hardware
> >> advertises (ie. pci_msix_vec_count()). These extra interrupts are
> >> created, but get flagged with msi_attrib.is_virtual which ensures
> >> functions that program the MSI-X table don't try to write past the end
> >> of the hardware's table.
> >
> > AFAICT what you've described is what Intel is calling IMS in other
> > contexts.
> >
> > IMS is fundamentally a way to control MSI interrupt descriptors that
> > are not accessed through PCI SIG compliant means. In this case the NTB
> > driver has to do its magic to relay the addr/data pairs to the real
> > MSI storage in the hidden devices.
>
> With current applications, it isn't that there is real "MSI storage"
> anywhere; the device on the other side of the bridge is always another
> Linux host which holds the address (or rather mw offset) and data in
> memory to use when it needs to trigger the interrupt of the other
> machine.

Sure, that is fine "MSI Storage". The triggering device only needs to
store the addr/data pair someplace to be "MSI Storage".

Jason

2021-11-30 00:29:46

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

Logan,

On Mon, Nov 29 2021 at 15:27, Logan Gunthorpe wrote:
> On 2021-11-29 1:51 p.m., Thomas Gleixner wrote:
>> The switchtec driver is the only one which uses PCI_IRQ_VIRTUAL in order
>> to allocate non-hardware backed MSI-X descriptors.
>>
>> AFAIU these descriptors are not MSI-X descriptors in the regular sense
>> of PCI/MSI-X. They are allocated via the PCI/MSI mechanism but their
>> usage is somewhere in NTB which has nothing to do with the way how the
>> real MSI-X interrupts of a device work which explains why we have those
>> pci.msi_attrib.is_virtual checks all over the place.
>>
>> I assume that there are other variants feeding into NTB which can handle
>> that without this PCI_IRQ_VIRTUAL quirk, but TBH, I got completely lost
>> in that code.
>>
>> Could you please shed some light on the larger picture of this?
>
> Yes, of course. I'll try to explain:
>
> The NTB code here is trying to create an MSI interrupt that is not
> triggered by the PCI device itself but from a peer behind the
> Non-Transparent Bridge (or, more carefully: from the CPU's perspective
> the interrupt will come from the PCI device, but nothing in the PCI
> device's firmware or hardware will have triggered the interrupt).

That far I got.

> In most cases, the NTB code needs more interrupts than the hardware
> actually provides for in its MSI-X table. That's what PCI_IRQ_VIRTUAL is
> for: it allows the driver to request more interrupts than the hardware
> advertises (ie. pci_msix_vec_count()). These extra interrupts are
> created, but get flagged with msi_attrib.is_virtual which ensures
> functions that program the MSI-X table don't try to write past the end
> of the hardware's table.
>
> The NTB code in drivers/ntb/msi.c uses these virtual MSI-X interrupts.
> (Or rather it can use any interrupt: it doesn't care whether its virtual
> or not, it would be fine if it is just a spare interrupt in hardware,
> but in practice, it will usually be a virtual one). The code uses these
> interrupts by setting up a memory window in the bridge to cover the MMIO
> addresses of MSI-X interrupts. It communicates the offsets of the
> interrupts (and the MSI message data) to the peer so that the peer can
> trigger the interrupt simply by writing the message data to its side of
> the memory window. (In the code: ntbm_msi_request_threaded_irq() is
> called to request and interrupt which fills in the ntb_msi_desc with the
> offset and data, which is transferred to the peer which would then use
> ntb_msi_peer_trigger() to trigger the interrupt.)

So the whole thing looks like this:

PCIe
|
| ________________________
| | |
| | NTB |
| | |
| | PCI config space |
| | MSI-X space |
| |_______________________|
|---| |
| Memory window A |
| Memory window .. |
| Memory window N |
|_______________________|

The peers can inject an interrupt through the associated memory window
like the NTB device itself does via the real MSI-X interrupts by writing
the associated MSI message data to the associated address (either
directly to the targeted APIC or to the IOMMU table).

As this message goes through the NTB device it's tagged as originating
from the NTB PCIe device as seen by the IOMMU/Interrupt remapping unit.

So far so good.

I completely understand why you went down the road to add this
PCI_IRQ_VIRTUAL "feature", but TBH this should have never happened.

Why?

These interrupts have absolutely nothing to do with PCI/MSI-X as defined
by the spec and as handled by the PCI/MSI core.

The fact that the MSI message is transported by PCIe does not change
that at all, neither does it matter that from an IOMMU/Interrupt
remapping POV these messages are tagged to come from that particular
PCIe device.

At the conceptual level these interrupts are in separate irq domains:

| _______________________
| | |
| | NTB |
| | |
| | PCI config space |
| | MSI-X space | <- #1 Global or per IOMMU zone PCI/MSI domain
| |_____________________ |
|---| |
| Memory window A |
| Memory window .. | <- #2 Per device NTB domain
| Memory window N |
|______________________|

You surely can see the disctinction between #1 and #2, right?

And because they are in different domains, they simply cannot share the
interrupt chip implementation taking care of the particular oddities of
the "hardware". I know, you made it 'shared' by adding these
'is_virtual' conditionals all over the place, but that pretty much
defeats the purpose of having separate domains.

This is also reflected in drivers/ntb/msi.c::ntbm_msi_request_threaded_irq():

for_each_pci_msi_entry(entry, ntb->pdev) {
if (irq_has_action(entry->irq))
continue;

What on earth guarantees that this check has any relevance? Just because
an entry does not have an interrupt requested on it does not tell
anything.

That might be an actual entry which belongs to the physical PCI NTB
device MSI-X space which is not requested yet. Just because that
swichtec device does not fall into that category does not make it any
more correct.

Seriously, infrastructure code which relies on undocumented assumptions
based on a particular hardware device is broken by definition.

I'm way too tired to come up with a proper solution for that, but that
PCI_IRQ_VIRTUAL has to die ASAP.

Thanks,

tglx


2021-11-30 19:21:55

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()



On 2021-11-29 5:29 p.m., Thomas Gleixner wrote:
> At the conceptual level these interrupts are in separate irq domains:
>
> | _______________________
> | | |
> | | NTB |
> | | |
> | | PCI config space |
> | | MSI-X space | <- #1 Global or per IOMMU zone PCI/MSI domain
> | |_____________________ |
> |---| |
> | Memory window A |
> | Memory window .. | <- #2 Per device NTB domain
> | Memory window N |
> |______________________|
>
> You surely can see the disctinction between #1 and #2, right?

I wouldn't say that's entirely obvious or even irrefutable. However, I'm
certainly open to this approach if it improves the code.

> I'm way too tired to come up with a proper solution for that, but that
> PCI_IRQ_VIRTUAL has to die ASAP.

I'm willing to volunteer a bit of my time to clean this up, but I'd need
a bit more direction on what a proper solution would look like. The MSI
domain code is far from well documented nor is it easy to understand.

Logan

2021-11-30 19:48:09

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

Logan,

On Tue, Nov 30 2021 at 12:21, Logan Gunthorpe wrote:
> On 2021-11-29 5:29 p.m., Thomas Gleixner wrote:
>> I'm way too tired to come up with a proper solution for that, but that
>> PCI_IRQ_VIRTUAL has to die ASAP.
>
> I'm willing to volunteer a bit of my time to clean this up, but I'd need
> a bit more direction on what a proper solution would look like. The MSI
> domain code is far from well documented nor is it easy to understand.

Fair enough. I'm struggling with finding time to document that properly.

I've not yet made my mind up what the best way forward for this is, but
I have a few ideas which I want to explore deeper.

But the most important question is right now on which architectures
these switchtec virtual interrupt things are used.

If it's used on any architecture which does not use hierarchical
irqdomains for MSI (x86, arm, arm64, power64), then using irqdomains is
obviously a non-starter unless falling back to a single interrupt would
not be considered a regression :)

Thanks,

tglx

2021-11-30 20:14:55

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()



On 2021-11-30 12:48 p.m., Thomas Gleixner wrote:
> On Tue, Nov 30 2021 at 12:21, Logan Gunthorpe wrote:
>> On 2021-11-29 5:29 p.m., Thomas Gleixner wrote:
>>> I'm way too tired to come up with a proper solution for that, but that
>>> PCI_IRQ_VIRTUAL has to die ASAP.
>>
>> I'm willing to volunteer a bit of my time to clean this up, but I'd need
>> a bit more direction on what a proper solution would look like. The MSI
>> domain code is far from well documented nor is it easy to understand.
>
> Fair enough. I'm struggling with finding time to document that properly.
>
> I've not yet made my mind up what the best way forward for this is, but
> I have a few ideas which I want to explore deeper.
>
> But the most important question is right now on which architectures
> these switchtec virtual interrupt things are used.
>
> If it's used on any architecture which does not use hierarchical
> irqdomains for MSI (x86, arm, arm64, power64), then using irqdomains is
> obviously a non-starter unless falling back to a single interrupt would
> not be considered a regression :)

Well that's a hard question to answer. The switchtec hardware is a very
generic PCI switch that can technically be used on any architecture that
supports PCI. However, NTB is a very specialized use case and only a
handful of companies have attempted to use it for anything, as is. I
can't say I know for sure, but my gut says the vast majority (if not
all) are using x86. Maybe some are trying with arm64 or power64, but the
only architecture not in that list that I'd conceivably think someone
might care about down the road might be riscv.

Most other NTB hardware (specifically AMD and Intel) are built into x86
CPUs so should be fine with this restriction.

I personally expect it would be fine to add a dependency on
CONFIG_IRQ_DOMAIN_HIERARCHY to CONFIG_NTB_MSI. However, I've copied Doug
from GigaIO who's the only user I know that might have a better informed
opinion on this.

Logan

2021-11-30 20:28:09

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Tue, Nov 30, 2021 at 08:48:03PM +0100, Thomas Gleixner wrote:
> Logan,
>
> On Tue, Nov 30 2021 at 12:21, Logan Gunthorpe wrote:
> > On 2021-11-29 5:29 p.m., Thomas Gleixner wrote:
> >> I'm way too tired to come up with a proper solution for that, but that
> >> PCI_IRQ_VIRTUAL has to die ASAP.
> >
> > I'm willing to volunteer a bit of my time to clean this up, but I'd need
> > a bit more direction on what a proper solution would look like. The MSI
> > domain code is far from well documented nor is it easy to understand.
>
> Fair enough. I'm struggling with finding time to document that properly.
>
> I've not yet made my mind up what the best way forward for this is, but
> I have a few ideas which I want to explore deeper.

I may have lost the plot in all of these patches, but I thought the
direction was moving toward the msi_domain_alloc_irqs() approach IDXD
demo'd here:

https://lore.kernel.org/kvm/162164243591.261970.3439987543338120797.stgit@djiang5-desk3.ch.intel.com/

I'd expect all the descriptor handling code in drivers/ntb/msi.c to
get wrapped in an irq_chip instead of inserting a single-use callback
to the pci core code's implementation:

void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
{
if (entry->write_msi_msg)
entry->write_msi_msg(entry, entry->write_msi_msg_data);

If this doesn't become an irq_chip what other way is there to properly
program the addr/data pair as drivers/ntb/msi.c is doing?

Jason

2021-11-30 21:23:54

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Tue, Nov 30 2021 at 16:28, Jason Gunthorpe wrote:
> On Tue, Nov 30, 2021 at 08:48:03PM +0100, Thomas Gleixner wrote:
>> On Tue, Nov 30 2021 at 12:21, Logan Gunthorpe wrote:
>> > On 2021-11-29 5:29 p.m., Thomas Gleixner wrote:
>> >> I'm way too tired to come up with a proper solution for that, but that
>> >> PCI_IRQ_VIRTUAL has to die ASAP.
>> >
>> > I'm willing to volunteer a bit of my time to clean this up, but I'd need
>> > a bit more direction on what a proper solution would look like. The MSI
>> > domain code is far from well documented nor is it easy to understand.
>>
>> Fair enough. I'm struggling with finding time to document that properly.
>>
>> I've not yet made my mind up what the best way forward for this is, but
>> I have a few ideas which I want to explore deeper.
>
> I may have lost the plot in all of these patches, but I thought the
> direction was moving toward the msi_domain_alloc_irqs() approach IDXD
> demo'd here:
>
> https://lore.kernel.org/kvm/162164243591.261970.3439987543338120797.stgit@djiang5-desk3.ch.intel.com/

Yes, that's something I have in mind. Though this patch series would not
be really required to support IDXD, it's making stuff simpler.

The main point of this is to cure the VFIO issue of tearing down MSI-X
of passed through devices in order to expand the MSI-X vector space on
the host.

> I'd expect all the descriptor handling code in drivers/ntb/msi.c to
> get wrapped in an irq_chip instead of inserting a single-use callback
> to the pci core code's implementation:
>
> void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
> {
> if (entry->write_msi_msg)
> entry->write_msi_msg(entry, entry->write_msi_msg_data);
>
> If this doesn't become an irq_chip what other way is there to properly
> program the addr/data pair as drivers/ntb/msi.c is doing?

That's not the question. This surely will be a separate irq chip and a
separate irqdomain.

The real problem is where to store the MSI descriptors because the PCI
device has its own real PCI/MSI-X interrupts which means it still shares
the storage space.

IDXD is different in that regard because IDXD creates subdevices which
have their own struct device and they just store the MSI descriptors in
the msi data of that device.

I'm currently tending to partition the index space in the xarray:

0x00000000 - 0x0000ffff PCI/MSI-X
0x00010000 - 0x0001ffff NTB

which is feasible now with the range modifications and way simpler to do
with xarray than with the linked list.

Thanks,

tglx



2021-12-01 00:18:00

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Tue, Nov 30, 2021 at 10:23:16PM +0100, Thomas Gleixner wrote:

> > If this doesn't become an irq_chip what other way is there to properly
> > program the addr/data pair as drivers/ntb/msi.c is doing?
>
> That's not the question. This surely will be a separate irq chip and a
> separate irqdomain.

OK

> The real problem is where to store the MSI descriptors because the PCI
> device has its own real PCI/MSI-X interrupts which means it still shares
> the storage space.

Er.. I never realized that just looking at the patches :|

That is relevant to all real "IMS" users. IDXD escaped this because
it, IMHO, wrongly used the mdev with the IRQ layer. The mdev is purely
a messy artifact of VFIO, it should not be required to make the IRQ
layers work.

I don't think it makes sense that the msi_desc would point to a mdev,
the iommu layer consumes the msi_desc_to_dev(), it really should point
to the physical device that originates the message with a proper
iommu ops/data/etc.

> I'm currently tending to partition the index space in the xarray:
>
> 0x00000000 - 0x0000ffff PCI/MSI-X
> 0x00010000 - 0x0001ffff NTB

It is OK, with some xarray work it can be range allocating & reserving
so that the msi_domain_alloc_irqs() flows can carve out chunks of the
number space..

Another view is the msi_domain_alloc_irqs() flows should have their
own xarrays..

> which is feasible now with the range modifications and way simpler to do
> with xarray than with the linked list.

Indeed!

Regards,
Jason

2021-12-01 10:16:57

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

Jason,

CC+ IOMMU folks

On Tue, Nov 30 2021 at 20:17, Jason Gunthorpe wrote:
> On Tue, Nov 30, 2021 at 10:23:16PM +0100, Thomas Gleixner wrote:
>> The real problem is where to store the MSI descriptors because the PCI
>> device has its own real PCI/MSI-X interrupts which means it still shares
>> the storage space.
>
> Er.. I never realized that just looking at the patches :|
>
> That is relevant to all real "IMS" users. IDXD escaped this because
> it, IMHO, wrongly used the mdev with the IRQ layer. The mdev is purely
> a messy artifact of VFIO, it should not be required to make the IRQ
> layers work.

> I don't think it makes sense that the msi_desc would point to a mdev,
> the iommu layer consumes the msi_desc_to_dev(), it really should point
> to the physical device that originates the message with a proper
> iommu ops/data/etc.

Looking at the device slices as subdevices with their own struct device
makes a lot of sense from the conceptual level. That makes is pretty
much obvious to manage the MSIs of those devices at this level like we
do for any other device.

Whether mdev is the right encapsulation for these subdevices is an
orthogonal problem.

I surely agree that msi_desc::dev is an interesting question, but we
already have this disconnect of msi_desc::dev and DMA today due to DMA
aliasing. I haven't looked at that in detail yet, but of course the
alias handling is substantially different accross the various IOMMU
implementations.

Though I fear there is also a use case for MSI-X and IMS tied to the
same device. That network card you are talking about might end up using
MSI-X for a control block and then IMS for the actual network queues
when it is used as physical function device as a whole, but that's
conceptually a different case.

>> I'm currently tending to partition the index space in the xarray:
>>
>> 0x00000000 - 0x0000ffff PCI/MSI-X
>> 0x00010000 - 0x0001ffff NTB
>
> It is OK, with some xarray work it can be range allocating & reserving
> so that the msi_domain_alloc_irqs() flows can carve out chunks of the
> number space..
>
> Another view is the msi_domain_alloc_irqs() flows should have their
> own xarrays..

Yes, I was thinking about that as well. The trivial way would be:

struct xarray store[MSI_MAX_STORES];

and then have a store index for each allocation domain. With the
proposed encapsulation of the xarray handling that's definitely
feasible. Whether that buys much is a different question. Let me think
about it some more.

>> which is feasible now with the range modifications and way simpler to do
>> with xarray than with the linked list.
>
> Indeed!

I'm glad you like the approach.

Thanks,

tglx



2021-12-01 13:00:32

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Wed, Dec 01, 2021 at 11:16:47AM +0100, Thomas Gleixner wrote:

> Looking at the device slices as subdevices with their own struct device
> makes a lot of sense from the conceptual level.

Except IMS is not just for subdevices, it should be usable for any
driver in any case as a general interrupt mechiansm, as you alluded to
below about ethernet queues. ntb seems to be the current example of
this need..

If it works properly on the physical PCI dev there is no reason to try
to also make it work on the mdev and add complixity in iommu land.

> Though I fear there is also a use case for MSI-X and IMS tied to the
> same device. That network card you are talking about might end up using
> MSI-X for a control block and then IMS for the actual network queues
> when it is used as physical function device as a whole, but that's
> conceptually a different case.

Is it? Why?

IDXD is not so much making device "slices", but virtualizing and
sharing a PCI device. The IDXD hardware is multi-queue like the NIC I
described and the VFIO driver is simply allocating queues from a PCI
device for specific usages and assigning them interrupts.

There is already a char dev interface that equally allocates queues
from the same IDXD device, why shouldn't it be able to access IMS
interrupt pools too?

IMHO a well designed IDXD driver should put all the PCI MMIO, queue
mangement, interrupts, etc in the PCI driver layer, and the VFIO
driver layer should only deal with the MMIO trapping and VFIO
interfacing.

From this view it is conceptually wrong to have the VFIO driver
directly talking to MMIO registers in the PCI device or owning the
irq_chip. It would be very odd for the PCI driver to allocate
interrupts from some random external struct device when it is creating
queues on the PCI device.

> > Another view is the msi_domain_alloc_irqs() flows should have their
> > own xarrays..
>
> Yes, I was thinking about that as well. The trivial way would be:
>
> struct xarray store[MSI_MAX_STORES];
>
> and then have a store index for each allocation domain. With the
> proposed encapsulation of the xarray handling that's definitely
> feasible. Whether that buys much is a different question. Let me think
> about it some more.

Any possibility that the 'IMS' xarray could be outside the struct
device?

Thanks,
Jason

2021-12-01 14:52:18

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Tue, Nov 30 2021 at 22:23, Thomas Gleixner wrote:
> On Tue, Nov 30 2021 at 16:28, Jason Gunthorpe wrote:
>
> The real problem is where to store the MSI descriptors because the PCI
> device has its own real PCI/MSI-X interrupts which means it still shares
> the storage space.

Bah. I confused myself by staring at the existing code instead of
looking at how this NTB stuff actually works.

So if I understand it correctly then the end result looks like this:

1) PCIe device (switchtec)

The device has 4 MSI[X] interrupts: event, dma_rpc, message,
doorbell. The event and dma_rpc interrupts are requested by the
switchtec PCI driver itself.

2) Switchtec character device

The switchtec PCI driver creates a character device which is exposed
for device specific IOCTLs

The device belongs to the switchtec_class device class.

3) Switchtec NTB device

The ntb_hw_switchtec driver registers the switchtec_class class
interface.

So when #2 is registered with the driver core the switchtec class
interface add_dev() function is invoked. That function creates a NTB
device, requests the message and the doorbell interrupts which have
been allocated by the underlying PCIe device driver (#1) and
registers the NTB device with the NTB core.

4) The NTB core then tries to use the virtual MSI vectors which have
been allocated by the switchtec driver in #1 and requires the msg
write intercept to actually expose it to the peers.

So we really can go and create a MSI irqdomain and stick the pointer
into stdev->dev.irqdomain. The parent domain of this irqdomain is

stdev->pdev.dev.irqdomain->parent

which is either the irq remapping domain or the vector domain. Which is
pretty much what I proposed as general facility for IMS/IDXD. I need to
go back and polish that up on top of the current pile.

Along with that have an irq chip implementation which exposes:

static struct irq_chip ntb_chip = {
.name = "ntb",
.irq_ack = irq_chip_ack_parent,
.irq_write_msi_msg = ntb_msi_write_msg,
#ifdef CONFIG_SMP
.irq_set_affinity = irq_chip_set_affinity_parent,
#endif
};

We just need some reasonable solution for the DMA/remap problem Jason
mentioned vs. msi_desc::dev, but that wants to be cleaned up in any
case for all the aliasing muck.

Thanks,

tglx



2021-12-01 15:11:29

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Wed, Dec 01, 2021 at 03:52:02PM +0100, Thomas Gleixner wrote:
> So we really can go and create a MSI irqdomain and stick the pointer
> into stdev->dev.irqdomain. The parent domain of this irqdomain is
>
> stdev->pdev.dev.irqdomain->parent

It can work (pending some solution to the iommu stuff), but IMHO it is
strange/hacky to put HW objects like irqdomain on what is a character
struct device with a set major/minor in dev->devt and associated
struct cdev.

Conceptually it makes no sense to me, cdevs are software constructs,
they should never go into HW areas..

Jason

2021-12-01 16:29:06

by Dave Jiang

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()


On 12/1/2021 3:16 AM, Thomas Gleixner wrote:
> Jason,
>
> CC+ IOMMU folks
>
> On Tue, Nov 30 2021 at 20:17, Jason Gunthorpe wrote:
>> On Tue, Nov 30, 2021 at 10:23:16PM +0100, Thomas Gleixner wrote:
>>> The real problem is where to store the MSI descriptors because the PCI
>>> device has its own real PCI/MSI-X interrupts which means it still shares
>>> the storage space.
>> Er.. I never realized that just looking at the patches :|
>>
>> That is relevant to all real "IMS" users. IDXD escaped this because
>> it, IMHO, wrongly used the mdev with the IRQ layer. The mdev is purely
>> a messy artifact of VFIO, it should not be required to make the IRQ
>> layers work.
>> I don't think it makes sense that the msi_desc would point to a mdev,
>> the iommu layer consumes the msi_desc_to_dev(), it really should point
>> to the physical device that originates the message with a proper
>> iommu ops/data/etc.
> Looking at the device slices as subdevices with their own struct device
> makes a lot of sense from the conceptual level. That makes is pretty
> much obvious to manage the MSIs of those devices at this level like we
> do for any other device.
>
> Whether mdev is the right encapsulation for these subdevices is an
> orthogonal problem.
>
> I surely agree that msi_desc::dev is an interesting question, but we
> already have this disconnect of msi_desc::dev and DMA today due to DMA
> aliasing. I haven't looked at that in detail yet, but of course the
> alias handling is substantially different accross the various IOMMU
> implementations.
>
> Though I fear there is also a use case for MSI-X and IMS tied to the
> same device. That network card you are talking about might end up using
> MSI-X for a control block and then IMS for the actual network queues
> when it is used as physical function device as a whole, but that's
> conceptually a different case.

Hi Thomas. This is actually the IDXD usage for a mediated device passed
to a guest kernel when we plumb the pass through of IMS to the guest
rather than doing previous implementation of having a MSIX vector on
guest backed by IMS. The control block for the mediated device is
emulated and therefore an emulated MSIX vector will be surfaced as
vector 0. However the queues will backed by IMS vectors. So we end up
needing MSIX and IMS coexist running on the guest kernel for the same
device.

DJ

>
>>> I'm currently tending to partition the index space in the xarray:
>>>
>>> 0x00000000 - 0x0000ffff PCI/MSI-X
>>> 0x00010000 - 0x0001ffff NTB
>> It is OK, with some xarray work it can be range allocating & reserving
>> so that the msi_domain_alloc_irqs() flows can carve out chunks of the
>> number space..
>>
>> Another view is the msi_domain_alloc_irqs() flows should have their
>> own xarrays..
> Yes, I was thinking about that as well. The trivial way would be:
>
> struct xarray store[MSI_MAX_STORES];
>
> and then have a store index for each allocation domain. With the
> proposed encapsulation of the xarray handling that's definitely
> feasible. Whether that buys much is a different question. Let me think
> about it some more.
>
>>> which is feasible now with the range modifications and way simpler to do
>>> with xarray than with the linked list.
>> Indeed!
> I'm glad you like the approach.
>
> Thanks,
>
> tglx
>
>

2021-12-01 17:35:45

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Wed, Dec 01 2021 at 09:00, Jason Gunthorpe wrote:
> On Wed, Dec 01, 2021 at 11:16:47AM +0100, Thomas Gleixner wrote:
>> Looking at the device slices as subdevices with their own struct device
>> makes a lot of sense from the conceptual level.
>
> Except IMS is not just for subdevices, it should be usable for any
> driver in any case as a general interrupt mechiansm, as you alluded to
> below about ethernet queues. ntb seems to be the current example of
> this need..

But NTB is operating through an abstraction layer and is not a direct
PCIe device driver.

> IDXD is not so much making device "slices", but virtualizing and
> sharing a PCI device. The IDXD hardware is multi-queue like the NIC I
> described and the VFIO driver is simply allocating queues from a PCI
> device for specific usages and assigning them interrupts.

Right.

But what is the representation for that resulting device? Some VFIO
specific homebrewn muck or something which is based on struct device?

Right now with VF passthrough, I can see the interrupts which are
associated to the device.

How is that going to be with something which is just made up? Does that
expose it's own msi properties then somewhere hidden in the VFIO layer?

See below.

> There is already a char dev interface that equally allocates queues
> from the same IDXD device, why shouldn't it be able to access IMS
> interrupt pools too?

Why wouldn't it be able to do so?

> IMHO a well designed IDXD driver should put all the PCI MMIO, queue
> mangement, interrupts, etc in the PCI driver layer, and the VFIO
> driver layer should only deal with the MMIO trapping and VFIO
> interfacing.
>
> From this view it is conceptually wrong to have the VFIO driver
> directly talking to MMIO registers in the PCI device or owning the
> irq_chip.

The VFIO driver does not own the irq chip ever. The irq chip is of
course part of the underlying infrastructure. I never asked for that.

PCIe driver
Owns the PCI/MSI[x] interrupts for the control block

Has a control mechanism which handles queues or whatever the
device is partitioned into, that's what I called slice.

The irqdomain is part of that control mechanism and the irqchip
implementation belongs to that as well. It has to because the
message store is device specific.

Whether the doamin and chip implementation is in a separate
drivers/irqchip/foo.c file for sharing or directly built into the
PCIe driver itself does not matter.

When it allocates a slice for whatever usage then it also
allocates the IMS interrupts (though the VFIO people want to
have only one and do the allocations later on demand).

That allocation cannot be part of the PCI/MSIx interrupt
domain as we already agreed on.

We have several problems to solve. Let me look at it from both point of
views:

1) Storage

A) Having "subdevices" solves the storage problem nicely and
makes everything just fall in place. Even for a purely
physical multiqueue device one can argue that each queue is a
"subdevice" of the physical device. The fact that we lump them
all together today is not an argument against that.

B) Requires extra storage in the PCIe device and extra storage
per subdevice, queue to keep track of the interrupts which
are associated to it.

2) Exposure of VFIO interrupts via sysfs

A) Just works

B) Requires extra mechanisms to expose it

3) On demand expansion of the vectors for VFIO

A) Just works because the device has an irqdomain assigned.

B) Requires extra indirections to do that

4) IOMMU msi_desc::dev

A) I think that's reasonably simple to address by having the
relationship to the underlying PCIe device stored in struct
device, which is not really adding any complexity to the IOMMU
code.

Quite the contrary it could be used to make the DMA aliasing a
problem of device setup time and not a lookup per interrupt
allocation in the IOMMU code.

B) No change required.

4) PASID

While an Intel IDXD specific issue, it want's to be solved
without any nasty hacks.

A) Having a "subdevice" allows to associate the PASID with the
underlying struct device which makes IOMMU integration trivial

B) Needs some other custom hackery to get that solved

So both variants come with their ups and downs.

IMO A) is the right thing to do when looking at all the involved moving
pieces.

> It would be very odd for the PCI driver to allocate interrupts from
> some random external struct device when it is creating queues on the
> PCI device.

No. That's not what I want.

>> and then have a store index for each allocation domain. With the
>> proposed encapsulation of the xarray handling that's definitely
>> feasible. Whether that buys much is a different question. Let me think
>> about it some more.
>
> Any possibility that the 'IMS' xarray could be outside the struct
> device?

We could, but we really want to keep things tied to devices which is the
right thing to do.

Thanks,

tglx

2021-12-01 18:14:51

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Wed, Dec 01, 2021 at 06:35:35PM +0100, Thomas Gleixner wrote:
> On Wed, Dec 01 2021 at 09:00, Jason Gunthorpe wrote:
> > On Wed, Dec 01, 2021 at 11:16:47AM +0100, Thomas Gleixner wrote:
> >> Looking at the device slices as subdevices with their own struct device
> >> makes a lot of sense from the conceptual level.
> >
> > Except IMS is not just for subdevices, it should be usable for any
> > driver in any case as a general interrupt mechiansm, as you alluded to
> > below about ethernet queues. ntb seems to be the current example of
> > this need..
>
> But NTB is operating through an abstraction layer and is not a direct
> PCIe device driver.

I'm not sure exactly how NTB seems to be split between switchtec and
the ntb code, but since the ntbd code seems to be doing MMIO touches,
it feels like part of a PCIe driver?

> > IDXD is not so much making device "slices", but virtualizing and
> > sharing a PCI device. The IDXD hardware is multi-queue like the NIC I
> > described and the VFIO driver is simply allocating queues from a PCI
> > device for specific usages and assigning them interrupts.
>
> Right.
>
> But what is the representation for that resulting device? Some VFIO
> specific homebrewn muck or something which is based on struct device?

Why is there a device? A queue is a queue, not a device.

If the task is to make some struct device (eg mdev, or cdev, or
whatever) then queues may be allocated to do this, but the queue is
logically a resource handed out by the PCIe driver and there should
not be a requirement to have an external struct device just to create
a queue.

> Right now with VF passthrough, I can see the interrupts which are
> associated to the device.
>
> How is that going to be with something which is just made up? Does that
> expose it's own msi properties then somewhere hidden in the VFIO layer?

For sysfs, I think all interrupts should be on the PCI directory.

> > There is already a char dev interface that equally allocates queues
> > from the same IDXD device, why shouldn't it be able to access IMS
> > interrupt pools too?
>
> Why wouldn't it be able to do so?

The only 'struct device' there is a cdev and I really don't think
cdevs should have interrupts. It is a bit hacky as a in-kernel thing
and downright wrong as a sysfs ABI.

> The VFIO driver does not own the irq chip ever. The irq chip is of
> course part of the underlying infrastructure. I never asked for that.

That isn't quite what I ment.. I ment the PCIe driver cannot create
the domain or make use of the irq_chip until the VFIO layer comes
along and provides the struct device. To me this is backwards
layering, the interrupts come from the PCIe layer and should exist
independently from VFIO.

> When it allocates a slice for whatever usage then it also
> allocates the IMS interrupts (though the VFIO people want to
> have only one and do the allocations later on demand).
>
> That allocation cannot be part of the PCI/MSIx interrupt
> domain as we already agreed on.

Yes, it is just an open question of where the new irq_domain need to
reside

> 1) Storage
>
> A) Having "subdevices" solves the storage problem nicely and
> makes everything just fall in place. Even for a purely
> physical multiqueue device one can argue that each queue is a
> "subdevice" of the physical device. The fact that we lump them
> all together today is not an argument against that.

I don't like the idea that queue is a device, that is trying to force
a struct device centric world onto a queue which doesn't really want
it..

> B) Requires extra storage in the PCIe device and extra storage
> per subdevice, queue to keep track of the interrupts which
> are associated to it.

Yes

> 2) Exposure of VFIO interrupts via sysfs
>
> A) Just works

I would say this is flawed, in sysfs I expect all the interrupts for
the PCIe device to be in the PCIe sysfs, not strewn over subsystem
owned sub-directories.

For instance, today in mlx5, when a subdevice allocates a queue for a
slice (which is modeled as an aux device) the queue's assigned MSI-X
interrupt shows up on the PCIe sysfs, not the aux.

It should be uniform, if I assign a queue a legacy INT, MSI or an IMS
it should show in sysfs in the same way. Leaking this kernel
implementation detail as sysfs ABI does not seem good.

> 3) On demand expansion of the vectors for VFIO
>
> A) Just works because the device has an irqdomain assigned.
>
> B) Requires extra indirections to do that

Yes.

> 4) PASID
>
> While an Intel IDXD specific issue, it want's to be solved
> without any nasty hacks.
>
> A) Having a "subdevice" allows to associate the PASID with the
> underlying struct device which makes IOMMU integration trivial
>
> B) Needs some other custom hackery to get that solved

Yes

> > Any possibility that the 'IMS' xarray could be outside the struct
> > device?
>
> We could, but we really want to keep things tied to devices which is the
> right thing to do.

I see the sysfs issue makes this a poor idea as well, as where would
the sysfs live if there was no struct device?

I'm inclined to think either of your ideas with the xarray are good
directions, primarily because it keeps HW data out of non-HW struct
devices and maintains a consistent sysfs representation for all the
different interrupt allocation methods.

Regards,
Jason

2021-12-01 18:38:02

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Wed, Dec 01 2021 at 11:11, Jason Gunthorpe wrote:
> On Wed, Dec 01, 2021 at 03:52:02PM +0100, Thomas Gleixner wrote:
>> So we really can go and create a MSI irqdomain and stick the pointer
>> into stdev->dev.irqdomain. The parent domain of this irqdomain is
>>
>> stdev->pdev.dev.irqdomain->parent
>
> It can work (pending some solution to the iommu stuff), but IMHO it is
> strange/hacky to put HW objects like irqdomain on what is a character
> struct device with a set major/minor in dev->devt and associated
> struct cdev.
>
> Conceptually it makes no sense to me, cdevs are software constructs,
> they should never go into HW areas..

I picked that because it _is_ already used to establish the connection
to the switchtec_class NTB driver which is beyond the usual cdev muck.

Thanks,

tglx

2021-12-01 18:41:13

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

Dave,

please trim your replies.

On Wed, Dec 01 2021 at 09:28, Dave Jiang wrote:

> On 12/1/2021 3:16 AM, Thomas Gleixner wrote:
>> Jason,
>>
>> CC+ IOMMU folks
>>
>> On Tue, Nov 30 2021 at 20:17, Jason Gunthorpe wrote:
>>> On Tue, Nov 30, 2021 at 10:23:16PM +0100, Thomas Gleixner wrote:
>>
>> Though I fear there is also a use case for MSI-X and IMS tied to the
>> same device. That network card you are talking about might end up using
>> MSI-X for a control block and then IMS for the actual network queues
>> when it is used as physical function device as a whole, but that's
>> conceptually a different case.
>
> Hi Thomas. This is actually the IDXD usage for a mediated device passed
> to a guest kernel when we plumb the pass through of IMS to the guest
> rather than doing previous implementation of having a MSIX vector on
> guest backed by IMS.

Which makes a lot of sense.

> The control block for the mediated device is emulated and therefore an
> emulated MSIX vector will be surfaced as vector 0. However the queues
> will backed by IMS vectors. So we end up needing MSIX and IMS coexist
> running on the guest kernel for the same device.

Why? What's wrong with using straight MSI-X for all of them?

Thanks,

tglx

2021-12-01 18:46:50

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()



On 2021-12-01 11:14 a.m., 'Jason Gunthorpe' via linux-ntb wrote:
> On Wed, Dec 01, 2021 at 06:35:35PM +0100, Thomas Gleixner wrote:
>> On Wed, Dec 01 2021 at 09:00, Jason Gunthorpe wrote:
>>> On Wed, Dec 01, 2021 at 11:16:47AM +0100, Thomas Gleixner wrote:
>>>> Looking at the device slices as subdevices with their own struct device
>>>> makes a lot of sense from the conceptual level.
>>>
>>> Except IMS is not just for subdevices, it should be usable for any
>>> driver in any case as a general interrupt mechiansm, as you alluded to
>>> below about ethernet queues. ntb seems to be the current example of
>>> this need..
>>
>> But NTB is operating through an abstraction layer and is not a direct
>> PCIe device driver.
>
> I'm not sure exactly how NTB seems to be split between switchtec and
> the ntb code, but since the ntbd code seems to be doing MMIO touches,
> it feels like part of a PCIe driver?

Eh, sort of. NTB has lots of layers. At the top you'll find ntb_netdev
which is an network interface. Below that is ntb_transport() which is a
generic queue pair. Below that is the hardware driver itself (ie
switchtec) through the abstraction layer. The switchtec driver is split
in two: the main driver which just allows for information and
administration of the switch itself and switchtec_ntb which is the
module that provides the NTB abstractions to twiddle its registers.

ntb_transport() doesn't directly do MMIO touches (as it doesn't know
what the underlying hardware registers are). Except for the memory
windows which are usually setup to be a specific BAR (or parts of a
BAR). ntb_transport will do MMIO writes to those specific BAR address
which correspond to writing into buffers on the peer.

Logan

2021-12-01 18:47:31

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Wed, Dec 01, 2021 at 07:37:32PM +0100, Thomas Gleixner wrote:
> On Wed, Dec 01 2021 at 11:11, Jason Gunthorpe wrote:
> > On Wed, Dec 01, 2021 at 03:52:02PM +0100, Thomas Gleixner wrote:
> >> So we really can go and create a MSI irqdomain and stick the pointer
> >> into stdev->dev.irqdomain. The parent domain of this irqdomain is
> >>
> >> stdev->pdev.dev.irqdomain->parent
> >
> > It can work (pending some solution to the iommu stuff), but IMHO it is
> > strange/hacky to put HW objects like irqdomain on what is a character
> > struct device with a set major/minor in dev->devt and associated
> > struct cdev.
> >
> > Conceptually it makes no sense to me, cdevs are software constructs,
> > they should never go into HW areas..
>
> I picked that because it _is_ already used to establish the connection
> to the switchtec_class NTB driver which is beyond the usual cdev muck.

IMHO that is also a misuse. These days two drivers should be hooked
together using an aux device, not a cdev and the obscure
class_interface stuff. Aux device supports auto probing and module
auto loading for instance.

An interrupt on an aux device is at least somewhat conceptually
parallel to an interrupt on a mdev as both are usually representing
some slice of a device.

Thanks,
Jason

2021-12-01 18:47:57

by Dave Jiang

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()


On 12/1/2021 11:41 AM, Thomas Gleixner wrote:
> Dave,
>
> please trim your replies.
>
> On Wed, Dec 01 2021 at 09:28, Dave Jiang wrote:
>
>> On 12/1/2021 3:16 AM, Thomas Gleixner wrote:
>>> Jason,
>>>
>>> CC+ IOMMU folks
>>>
>>> On Tue, Nov 30 2021 at 20:17, Jason Gunthorpe wrote:
>>>> On Tue, Nov 30, 2021 at 10:23:16PM +0100, Thomas Gleixner wrote:
>>> Though I fear there is also a use case for MSI-X and IMS tied to the
>>> same device. That network card you are talking about might end up using
>>> MSI-X for a control block and then IMS for the actual network queues
>>> when it is used as physical function device as a whole, but that's
>>> conceptually a different case.
>> Hi Thomas. This is actually the IDXD usage for a mediated device passed
>> to a guest kernel when we plumb the pass through of IMS to the guest
>> rather than doing previous implementation of having a MSIX vector on
>> guest backed by IMS.
> Which makes a lot of sense.
>
>> The control block for the mediated device is emulated and therefore an
>> emulated MSIX vector will be surfaced as vector 0. However the queues
>> will backed by IMS vectors. So we end up needing MSIX and IMS coexist
>> running on the guest kernel for the same device.
> Why? What's wrong with using straight MSI-X for all of them?

The hardware implementation does not have enough MSIX vectors for
guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS
vectors. So if we are to do MSI-X for all of them, then we need to do
the IMS backed MSIX scheme rather than passthrough IMS to guests.


>
> Thanks,
>
> tglx
>

2021-12-01 20:21:36

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

Jason,

On Wed, Dec 01 2021 at 14:14, Jason Gunthorpe wrote:
> On Wed, Dec 01, 2021 at 06:35:35PM +0100, Thomas Gleixner wrote:
>> On Wed, Dec 01 2021 at 09:00, Jason Gunthorpe wrote:
>> But NTB is operating through an abstraction layer and is not a direct
>> PCIe device driver.
>
> I'm not sure exactly how NTB seems to be split between switchtec and
> the ntb code, but since the ntbd code seems to be doing MMIO touches,
> it feels like part of a PCIe driver?

It's a maze of magic PCIe driver with callbacks left and right into the
underlying NTB hardware driver which itself does some stuff on its own
and also calls back into the underlying PCIe driver.

Decomposing that thing feels like being trapped in the Labyrinth of
Knossos. But let's ignore that for a moment.

>> The VFIO driver does not own the irq chip ever. The irq chip is of
>> course part of the underlying infrastructure. I never asked for that.
>
> That isn't quite what I ment.. I ment the PCIe driver cannot create
> the domain or make use of the irq_chip until the VFIO layer comes
> along and provides the struct device. To me this is backwards
> layering, the interrupts come from the PCIe layer and should exist
> independently from VFIO.

See below.

>> When it allocates a slice for whatever usage then it also
>> allocates the IMS interrupts (though the VFIO people want to
>> have only one and do the allocations later on demand).
>>
>> That allocation cannot be part of the PCI/MSIx interrupt
>> domain as we already agreed on.
>
> Yes, it is just an open question of where the new irq_domain need to
> reside

The irqdomain is created by and part of the physical device. But that
does not mean that the interrupts which are allocated from that irq
domain are stored in the physical device representation.

Going by that logic, the PCI/MSI domain would store all MSI[X]
interrupts which are allocated on some root bridge or wherever.

They are obviously stored per PCI device, but the irqdomain is owned
e.g. by the underlying IOMMU zone.

The irqdomain is managing and handing out resources. Like any other
resource manager does.

See?

>> 1) Storage
>>
>> A) Having "subdevices" solves the storage problem nicely and
>> makes everything just fall in place. Even for a purely
>> physical multiqueue device one can argue that each queue is a
>> "subdevice" of the physical device. The fact that we lump them
>> all together today is not an argument against that.
>
> I don't like the idea that queue is a device, that is trying to force
> a struct device centric world onto a queue which doesn't really want
> it..

Here we are at the point where we agree to disagree.

Of course a queue is a resource, but what prevents us to represent a
queue as a carved out subdevice of the physical device?

Look at it from the VF point of view. If VFs are disabled then all
resources belong to the physical device. If VFs are enabled then the
hardware/firmware splits the resources into separate subdevices which
have their own device representation, interrupt storage etc.

As VFs have a scalability limitation due to the underlying PCIe
restrictions the step we are talking about now is to split the queues up
in software which means nothing else than creating a software
representation of finer grained and more scalable subdevices.

So why would we want to pretend that these are not devices?

They are from a conceptual and topology view a subdevice of the physical
device. Just because they are named queues does not make it any
different.

Let's look at VFIO again. If VFIO passes a VF through then it builds a
wrapper around the VF device.

So if a queue is represented as a subdevice, then VFIO can just build
a wrapper around that subdevice.

But with your model, VFIO has to create a device, request a queue,
wrestle the interrupts in place, etc. Which is exactly the opposite of
how VFs are handled.

So again, why would we want to make software managed subdevices look
exactly the opposite way like hardware/firmware managed subdevices?

Let me also look at the cdev which is exposed by the phsyical device.

The cdev is nothing else than a software vehicle to create a /dev/
node. (ignore the switchtec case which (ab)uses the cdev to connect to
NTB). The cdev allows user space to allocate/release a resource.

Of course it can just allocate a queue and stick the necessary
interrupts into the physical device MSI descriptor storage.

But there is no reason why the queue allocation cannot allocate a
subdevice representing the queue.

That makes the VFIO and the cdev case just using the same underlying
representation which can expose it's properties via the underlying
device.

Which in turn is consistent all over the place and does not require any
special case for anything. Neither for interrupts nor for anything else.

Thanks,

tglx




2021-12-01 20:26:10

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Wed, Dec 01 2021 at 11:47, Dave Jiang wrote:
> On 12/1/2021 11:41 AM, Thomas Gleixner wrote:
>>> Hi Thomas. This is actually the IDXD usage for a mediated device passed
>>> to a guest kernel when we plumb the pass through of IMS to the guest
>>> rather than doing previous implementation of having a MSIX vector on
>>> guest backed by IMS.
>> Which makes a lot of sense.
>>
>>> The control block for the mediated device is emulated and therefore an
>>> emulated MSIX vector will be surfaced as vector 0. However the queues
>>> will backed by IMS vectors. So we end up needing MSIX and IMS coexist
>>> running on the guest kernel for the same device.
>> Why? What's wrong with using straight MSI-X for all of them?
>
> The hardware implementation does not have enough MSIX vectors for
> guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS
> vectors. So if we are to do MSI-X for all of them, then we need to do
> the IMS backed MSIX scheme rather than passthrough IMS to guests.

Confused. Are you talking about passing a full IDXD device to the guest
or about passing a carved out subdevice, aka. queue?

Thanks,

tglx

2021-12-01 20:27:34

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Wed, Dec 01 2021 at 14:47, Jason Gunthorpe wrote:
> On Wed, Dec 01, 2021 at 07:37:32PM +0100, Thomas Gleixner wrote:
>> I picked that because it _is_ already used to establish the connection
>> to the switchtec_class NTB driver which is beyond the usual cdev muck.
>
> IMHO that is also a misuse. These days two drivers should be hooked
> together using an aux device, not a cdev and the obscure
> class_interface stuff. Aux device supports auto probing and module
> auto loading for instance.
>
> An interrupt on an aux device is at least somewhat conceptually
> parallel to an interrupt on a mdev as both are usually representing
> some slice of a device.

No argument about that.

2021-12-01 21:21:26

by Dave Jiang

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()


On 12/1/2021 1:25 PM, Thomas Gleixner wrote:
> On Wed, Dec 01 2021 at 11:47, Dave Jiang wrote:
>> On 12/1/2021 11:41 AM, Thomas Gleixner wrote:
>>>> Hi Thomas. This is actually the IDXD usage for a mediated device passed
>>>> to a guest kernel when we plumb the pass through of IMS to the guest
>>>> rather than doing previous implementation of having a MSIX vector on
>>>> guest backed by IMS.
>>> Which makes a lot of sense.
>>>
>>>> The control block for the mediated device is emulated and therefore an
>>>> emulated MSIX vector will be surfaced as vector 0. However the queues
>>>> will backed by IMS vectors. So we end up needing MSIX and IMS coexist
>>>> running on the guest kernel for the same device.
>>> Why? What's wrong with using straight MSI-X for all of them?
>> The hardware implementation does not have enough MSIX vectors for
>> guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS
>> vectors. So if we are to do MSI-X for all of them, then we need to do
>> the IMS backed MSIX scheme rather than passthrough IMS to guests.
> Confused. Are you talking about passing a full IDXD device to the guest
> or about passing a carved out subdevice, aka. queue?

I'm talking about carving out a subdevice. I had the impression of you
wanting IMS passed through for all variations. But it sounds like for a
sub-device, you are ok with the implementation of MSIX backed by IMS?


>
> Thanks,
>
> tglx
>

2021-12-01 21:46:23

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Wed, Dec 01 2021 at 14:21, Dave Jiang wrote:
> On 12/1/2021 1:25 PM, Thomas Gleixner wrote:
>>> The hardware implementation does not have enough MSIX vectors for
>>> guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS
>>> vectors. So if we are to do MSI-X for all of them, then we need to do
>>> the IMS backed MSIX scheme rather than passthrough IMS to guests.
>> Confused. Are you talking about passing a full IDXD device to the guest
>> or about passing a carved out subdevice, aka. queue?
>
> I'm talking about carving out a subdevice. I had the impression of you
> wanting IMS passed through for all variations. But it sounds like for a
> sub-device, you are ok with the implementation of MSIX backed by IMS?

I don't see anything wrong with that. A subdevice is it's own entity and
VFIO can chose the most conveniant representation of it to the guest
obviously.

How that is backed on the host does not really matter. You can expose
MSI-X to the guest with a INTx backing as well.

I'm still failing to see the connection between the 9 MSIX vectors and
the 2048 IMS vectors which I assume that this is the limitation of the
physical device, right?

What needs a subdevice to expose?

Thanks,

tglx




2021-12-01 21:49:53

by Dave Jiang

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()


On 12/1/2021 2:44 PM, Thomas Gleixner wrote:
> On Wed, Dec 01 2021 at 14:21, Dave Jiang wrote:
>> On 12/1/2021 1:25 PM, Thomas Gleixner wrote:
>>>> The hardware implementation does not have enough MSIX vectors for
>>>> guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS
>>>> vectors. So if we are to do MSI-X for all of them, then we need to do
>>>> the IMS backed MSIX scheme rather than passthrough IMS to guests.
>>> Confused. Are you talking about passing a full IDXD device to the guest
>>> or about passing a carved out subdevice, aka. queue?
>> I'm talking about carving out a subdevice. I had the impression of you
>> wanting IMS passed through for all variations. But it sounds like for a
>> sub-device, you are ok with the implementation of MSIX backed by IMS?
> I don't see anything wrong with that. A subdevice is it's own entity and
> VFIO can chose the most conveniant representation of it to the guest
> obviously.
>
> How that is backed on the host does not really matter. You can expose
> MSI-X to the guest with a INTx backing as well.
>
> I'm still failing to see the connection between the 9 MSIX vectors and
> the 2048 IMS vectors which I assume that this is the limitation of the
> physical device, right?

I think I was confused with what you were asking and was thinking you
are saying why can't we just have MSIX on guest backed by the MSIX on
the physical device and thought there would not be enough vectors to
service the many guests. I think I understand what your position is now
with the clarification above.


>
> What needs a subdevice to expose?
>
> Thanks,
>
> tglx
>
>
>

2021-12-01 22:03:15

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Wed, Dec 01 2021 at 14:49, Dave Jiang wrote:
> On 12/1/2021 2:44 PM, Thomas Gleixner wrote:
>> How that is backed on the host does not really matter. You can expose
>> MSI-X to the guest with a INTx backing as well.
>>
>> I'm still failing to see the connection between the 9 MSIX vectors and
>> the 2048 IMS vectors which I assume that this is the limitation of the
>> physical device, right?
>
> I think I was confused with what you were asking and was thinking you
> are saying why can't we just have MSIX on guest backed by the MSIX on
> the physical device and thought there would not be enough vectors to
> service the many guests. I think I understand what your position is now
> with the clarification above.

This still depends on how this overall discussion about representation
of all of this stuff is resolved.

>> What needs a subdevice to expose?

Can you answer that too please?

Thanks,

tglx

2021-12-01 22:55:14

by Dave Jiang

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()


On 12/1/2021 3:03 PM, Thomas Gleixner wrote:
> On Wed, Dec 01 2021 at 14:49, Dave Jiang wrote:
>> On 12/1/2021 2:44 PM, Thomas Gleixner wrote:
>>> How that is backed on the host does not really matter. You can expose
>>> MSI-X to the guest with a INTx backing as well.
>>>
>>> I'm still failing to see the connection between the 9 MSIX vectors and
>>> the 2048 IMS vectors which I assume that this is the limitation of the
>>> physical device, right?
>> I think I was confused with what you were asking and was thinking you
>> are saying why can't we just have MSIX on guest backed by the MSIX on
>> the physical device and thought there would not be enough vectors to
>> service the many guests. I think I understand what your position is now
>> with the clarification above.
> This still depends on how this overall discussion about representation
> of all of this stuff is resolved.
>
>>> What needs a subdevice to expose?
> Can you answer that too please?

Sorry. So initial version of the IDXD sub-device is represented with a
single queue. It needs a command irq (emulated) and a completion irq
that is backed by a device vector (currently IMS).


>
> Thanks,
>
> tglx

2021-12-01 23:57:18

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

dave,

On Wed, Dec 01 2021 at 15:53, Dave Jiang wrote:
> On 12/1/2021 3:03 PM, Thomas Gleixner wrote:
>> This still depends on how this overall discussion about representation
>> of all of this stuff is resolved.
>>
>>>> What needs a subdevice to expose?
>> Can you answer that too please?
>
> Sorry. So initial version of the IDXD sub-device is represented with a
> single queue. It needs a command irq (emulated) and a completion irq
> that is backed by a device vector (currently IMS).

thanks for clarification! Let me think about it some more.

tglx

2021-12-02 00:03:46

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

Jason,

On Wed, Dec 01 2021 at 21:21, Thomas Gleixner wrote:
> On Wed, Dec 01 2021 at 14:14, Jason Gunthorpe wrote:
> Which in turn is consistent all over the place and does not require any
> special case for anything. Neither for interrupts nor for anything else.

that said, feel free to tell me that I'm getting it all wrong.

The reason I'm harping on this is that we are creating ABIs on several
ends and we all know that getting that wrong is a major pain.

Thanks,

tglx

2021-12-02 13:55:11

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Thu, Dec 02, 2021 at 01:01:42AM +0100, Thomas Gleixner wrote:
> Jason,
>
> On Wed, Dec 01 2021 at 21:21, Thomas Gleixner wrote:
> > On Wed, Dec 01 2021 at 14:14, Jason Gunthorpe wrote:
> > Which in turn is consistent all over the place and does not require any
> > special case for anything. Neither for interrupts nor for anything else.
>
> that said, feel free to tell me that I'm getting it all wrong.
>
> The reason I'm harping on this is that we are creating ABIs on several
> ends and we all know that getting that wrong is a major pain.

I don't really like coupling the method to fetch IRQs with needing
special struct devices. Struct devices have a sysfs presence and it is
not always appropriate to create sysfs stuff just to allocate some
IRQs.

A queue is simply not a device, it doesn't make any sense. A queue is
more like a socket().

That said, we often have enough struct devices floating about to make
this work. Between netdev/ib_device/aux device/mdev we can use them to
do this.

I think it is conceptual nonsense to attach an IMS IRQ domain to a
netdev or a cdev, but it will solve this problem.

However, I would really prefer that there be no uAPI here.

I looked at the msi_irqs/ stuff and could not find a user. Debian code
search found nothing, Greg redid the entire uAPI in 2013
(1c51b50c2995), so I think it is just dead. (maybe delete it?)

So lets not create any sysfs for IMS with the msi_irqs/ dir. We can
revise the in-kenel mechanism someday if it turns out to be a problem.

As to your question:

> So again, why would we want to make software managed subdevices look
> exactly the opposite way like hardware/firmware managed subdevices?

That isn't my thinking at all.

Something like mlx5 has a hw/fw managed VF and there is an RPC call
from driver to device to 'create a queue'. The device has no hard
division inside along a VF, the device simply checks resource limits
and security properties and returns one of the >>10k queues. Again
think more like socket() than a hard partitioning.

It is the same as I suggest for IDXD & VFIO where the PCIe IDXD layer
takes the place of hw/fw and has a 'create a queue' API call for the
VFIO layer to use. Instead of using a VF as the security identity, it
uses a PASID.

This is a logical partitioning and it matches the partioning we'd have
if it was a real device.

> So if a queue is represented as a subdevice, then VFIO can just build
> a wrapper around that subdevice.

I think that oversimplifies the picture.

IDXD is a multi queue device that uses PASID as a security context. It
has a cdev /dev/idxd interface where userspace can use an IOCTL and
get a queue to use. The queue is bound to a PASID that is linked to an
IO Page table that mirrors the process page table. Userspace operates
the queue and does whatever with it.

VFIO is just another interface that should logically be considered a
peer of the cdev. Using VFIO userspace can get a queue, bind it to a
PASID and operate it. The primary difference between the cdev and the
VFIO mdev is user programming API - VFIO uses IOCTLs that carry
emulated MMIO read/write operations.

I consider *neither* to be a subdevice. They are just a user API,
however convoluted, to create a queue, associate it with a PASID
security context and allow userspace to operate the queue. It is much
closer to socket() than a PCI VF subdevice.

Internally the driver should be built so that the PCI driver is doing
all the device operation and the two uAPI layers are only concerend
with translating their repsective uAPIs to the internal device API.

Further, there is no reason why IMS should be reserved exclusively for
VFIO! Why shouldn't the cdev be able to use IMS vectors too? It is
just a feature of the PCI device like MSI. If the queue has a PASID it
can use IDXD's IMS.

If we really need a 2nd struct device to turn on IMS then, I'd suggest
picking the cdev, as it keeps IMS and its allocator inside the IDXD
PCIe driver and not in the VFIO world.

Regards,
Jason

2021-12-02 14:23:51

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Thu, Dec 02, 2021 at 09:55:02AM -0400, Jason Gunthorpe wrote:
> Further, there is no reason why IMS should be reserved exclusively for
> VFIO! Why shouldn't the cdev be able to use IMS vectors too? It is
> just a feature of the PCI device like MSI. If the queue has a PASID it
> can use IDXD's IMS.

No, sorry, but a cdev is not for anything resembling any real resource
at all.

It is ONLY for the /dev/NODE interface that controls the character
device api to userspace. The struct device involved in it is ONLY for
that, nothing else. Any attempt to add things to it will be gleefully
rejected.

The cdev api today (in the kernel) exposes too much mess and there's at
least 4 or 5 different ways to use it. It's on my long-term TODO list
to fix this up to not even allow abuses like you are considering here,
so please don't do that.

> If we really need a 2nd struct device to turn on IMS then, I'd suggest
> picking the cdev, as it keeps IMS and its allocator inside the IDXD
> PCIe driver and not in the VFIO world.

No! Again, a cdev is to control the lifespan/lifecycle of the /dev/NODE
only. Anything other than that is not ok to do at all.

thanks,

greg k-h

2021-12-02 14:45:55

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Thu, Dec 02, 2021 at 03:23:38PM +0100, Greg Kroah-Hartman wrote:
> On Thu, Dec 02, 2021 at 09:55:02AM -0400, Jason Gunthorpe wrote:
> > Further, there is no reason why IMS should be reserved exclusively for
> > VFIO! Why shouldn't the cdev be able to use IMS vectors too? It is
> > just a feature of the PCI device like MSI. If the queue has a PASID it
> > can use IDXD's IMS.
>
> No, sorry, but a cdev is not for anything resembling any real resource
> at all.

My point is that when the user asks the driver to allocate a queue
through a cdev ioctl it should be able to get the queue attached to an
IMS, today it can only get a queue attached to a MSI.

> It is ONLY for the /dev/NODE interface that controls the character
> device api to userspace. The struct device involved in it is ONLY for
> that, nothing else. Any attempt to add things to it will be gleefully
> rejected.

I agree with you!

> > If we really need a 2nd struct device to turn on IMS then, I'd suggest
> > picking the cdev, as it keeps IMS and its allocator inside the IDXD
> > PCIe driver and not in the VFIO world.
>
> No! Again, a cdev is to control the lifespan/lifecycle of the /dev/NODE
> only. Anything other than that is not ok to do at all.

Said the same thing in a prior email - which is why I think the only
logical choice here is to make IMS work on the pci_device

FWIW I feel the same way about the VFIO mdev - its *ONLY* purpose is
to control the lifecycle and we are close to stripping away all the
other abuses using it for other things.

Jason

2021-12-02 19:25:58

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

Jason,

On Thu, Dec 02 2021 at 09:55, Jason Gunthorpe wrote:
> On Thu, Dec 02, 2021 at 01:01:42AM +0100, Thomas Gleixner wrote:
>> On Wed, Dec 01 2021 at 21:21, Thomas Gleixner wrote:
>> > On Wed, Dec 01 2021 at 14:14, Jason Gunthorpe wrote:
>> > Which in turn is consistent all over the place and does not require any
>> > special case for anything. Neither for interrupts nor for anything else.
>>
>> that said, feel free to tell me that I'm getting it all wrong.
>>
>> The reason I'm harping on this is that we are creating ABIs on several
>> ends and we all know that getting that wrong is a major pain.
>
> I don't really like coupling the method to fetch IRQs with needing
> special struct devices. Struct devices have a sysfs presence and it is
> not always appropriate to create sysfs stuff just to allocate some
> IRQs.
>
> A queue is simply not a device, it doesn't make any sense. A queue is
> more like a socket().

Let's put the term 'device' for a bit please.

> That said, we often have enough struct devices floating about to make
> this work. Between netdev/ib_device/aux device/mdev we can use them to
> do this.
>
> I think it is conceptual nonsense to attach an IMS IRQ domain to a
> netdev or a cdev, but it will solve this problem.

The IMS irqdomain is _NOT_ part of the netdev or cdev or whatever. I
explained that several times now.

We seem to have a serious problem of terminology and the understanding
of topology which is why we continue to talk past each other forever.

Let's take a big step back and clarify these things first.

An irq domain is an entity which has two parts:

1) The resource management part

2) An irqchip which knows how to access the actual hardware
implementation for the various operations related to irq chips
(mask, unmask, ack, eoi, ....)

irq domains are used for uniform resource spaces which have a uniform
hardware interrupt chip. Modern computer systems have several resource
levels which form a hierarchy of several irqdomain levels.

The root of that hierarchy is at the CPU level:

|---------------|
| CPU | BUS |
| |-----|
| | IRQ |
|----------------

The CPU exposes the bus and a mechanism to raise interrupts in the CPU
from external components. Whether those interrupts are raised by wire or
by writing a message to a given address is an implementation detail.

So the root IRQ domain is the one which handles the CPU level interrupt
controller and manages the associated resources: The per CPU vector
space on x86, the global vector space on ARM64.

If there's an IOMMU in the system them the IOMMU is the next level
interrupt controller:

1) It manages the translation table resources

2) Provides an interrupt chip which knows how to access the
table entries

Of course there can be several IOMMUs depending on the system topology,
but from an irq domain view they are at the same level.

On top of that we have on x86 the next level irq domains:

IO/APIC, HPET and PCI/MSI

which again do resource management and provide an irqchip which
handles the underlying hardware implementation.

So the full hierarchy so far looks like this on X86

VECTOR --|
|-- IOMMU --|
|-- IO/APIC
|-- HPET
|-- PCI/MSI

So allocating an interrupt on the outer level which is exposed to
devices requires allocation through the hierarchy down to the root
domain (VECTOR). The reason why we have this hierarchy is that this
makes it trivial to support systems with and without IOMMU because on a
system w/o IOMMU the outer domains have the vector domain as their
parent. ARM(64) has even deeper hierarchies, but operates on the same
principle.

Let's look at the PCI/MSI domain more detailed:

It's either one irqdomain per system or one irqdomain per IOMMU domain

The PCI/MCI irqdomain does not really do resource allocations, it's
all delegated down the hierarchy.

It provides also an interrupt chip which knows to access the MSI[X]
entries and can handle masking etc.

Now you might ask how the MSI descriptors come into play. The MSI
descriptors are allocated by the PCI/MSI interface functions in order to
enable the PCI/MSI irqdomain to actually allocate resources.

They could be allocated as part of the irqdomain allocation mechanism
itself. That's what I do now for simple irq domains which only have a
linear MSI descriptor space where the only information in the descriptor
is the index and not all the PCI/MSI tidbits.

It's kinda blury, but that's moslty due to the fact that we still have
to support the legacy PCI/MSI implementations on architectures which do
not use hierarchical irq domains.

The MSI descriptors are stored in device->msi.data.store. That makes
sense because they are allocated on behalf of that device. But there is
no requirement to store them there. We could just store a cookie in the
device and have some central storage. It's an implementation detail,
which I kept that way so far.

But note, these MSI descriptors are not part of the PCI/MSI
irqdomain. They are just resources which the irqdomain hands out to the
device on allocation and collects on free.

From a topology view the location of the PCI/MSI domain is at the PCI
level which is correct because if you look at a PCIe card then it
consists of two parts:

|----------------|----------------------------
| PCIe interface | |
|----------------| Physical function |
| Config space | |
|----------------|----------------------------
| Bus bridge |
|---------------------------------------------

The PCIe interface is _not_ managed by the device driver. It's managed
by the PCI layer including the content of the MSI-X table.

The physical function or some interrupt translation unit reads these
entries to actually raise an interrupt down through

PCIe -> IOMMU -> CPU

by writing the message data to the message address.

If you look at the above irqdomain hierarchy then you'll notice that the
irqdomain hierarchy is modeled exactly that way.

Now ideally with we should do the same modeling for IMS:

CPU -> IOMMU -> IMS

and have that global or per IOMMU domain. But we can't do that because
IMS is not uniform in terms of the interrupt chip.

IDXD has this table in device memory and you want to store the message
in system memory and the next fancy card will do something else.

Megha tried to do that via this platform MSI indirection, but that's
just a nasty hack. The platform MSI domain is already questionable, but
bolting IMS support into it would make it a total trainwreck.

Aside of that this would not solve the allocation problem for devices
which have a IMS table in device memory because then every device driver
would have to have it's own table allocation mechanism.

Creating a IMS-IDXD irq domain per IOMMU domain would be a possible
solution, but then dealing with different incarnations of table
layouts will create a zoo of crap. Also it would not solve the problem
of IMS storage in system memory.

That's why I suggested to create a device specific IMS irq domain with a
device specific irqchip. The implementation can be purely per device or
a shared implementation in drivers/irqchip. Which is preferrable because
otherwise we won't ever get code reused even if the device memory table
or the system memory storage mechanism are the same.

So the device driver for the PCIe device has to do:

probe()
...
mydev->ims_domain = ims_$type_create_msi_domain(....)

Now allocations from that domain for usage by the device driver do:

ret = msi_domain_alloc_irqs(mydev->ims_domain, ...);

Now that's where the problem starts:

The allocation creates MSI descriptors and has to store them
somewhere.

Of course we can store them in pci_dev.dev.msi.data.store. Either with a
dedicated xarray or by partitioning the xarray space. Both have their
pro and cons.

Ok?

But what I really struggle with is how this is further represented when
the queues are allocated for VFIO, cdev, whatever usage.

The point is that what you call a queue is not a single resource entity,
it's a container for which several resource types are required to become
functional:

1) The device queue
2) Interrupts
3) ....

Of those resources only one, the device queue, is managed by the
device driver.

The fact that the irqdomain is instantiated by the device driver does
not make it any different. As explained above it's just an
implementation detail which makes it possible to handle the device
specific message storage requirement in a sane way. The actual interrupt
resources come still from the underlying irqdomains as for PCI/MSI.

Now I was looking for a generic representation of such a container and
my initial reaction was to bind it to a struct device, which also makes
it trivial to store these MSI descriptors in that struct device.

I can understand your resistance against that to some extent, but I
think we really have to come up with a proper abstract representation
for these.

Why?

Simply because I want to avoid yet another non-uniform representation of
these things which are making it hard to expose them in a topology view
and require non-uniform wrappers for the same type of information all
over the place.

We seriously have enough of this adhoc stuff already and there is no
need to open another can of worms.

After thinking more about this, I came up to name it: Logical function.

See, I was carefully avoiding the term 'device' and even more so did not
fall for the Software-Defined buzzword space. :)

Such a logical function would be the entity to hand out for VFIO or
cdev.

But it also can be exported e.g. via sysfs

bus/pci/.../lfunc/0/

to expose information about these. From a practical POV exposing the
interrupts there too seems to be genuinly useful in order to do proper
affinity settings etc. Retrieving that information from /proc/interrupts
is a complete nightmare.

While you did not find anything in debian, I use the msi_irqs/ as an
easy way to find out which interrupts belong to which device.

I have no strong opinion about those entries as surely can figure it out
by other means, but from a consistency view it makes sense to me.

Using such a uniform representation for storing the associated MSI
descriptors makes sense to me not only from the irq layer but also for
practical purposes.

If this is not split out, then every driver and wrapper has to come up
with it's own representation of this instead of being able to do:

request_irq(msi_get_virq(lfunc, idx=0), handler0, ...);
request_irq(msi_get_virq(lfunc, idx=1), handler1, ...);

I went through a ton of usage sites while I was working on this and the
amount of crap which is done there differently is amazing.

Of course that needs some thoughts vs. the MSI domain interfaces, but as
we are modifying them anyway, it's a good opportunity to make introduce
a common representation for these things.

Thanks,

tglx

2021-12-02 20:00:28

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Thu, Dec 02, 2021 at 08:25:48PM +0100, Thomas Gleixner wrote:
> Jason,
>
> On Thu, Dec 02 2021 at 09:55, Jason Gunthorpe wrote:
> > On Thu, Dec 02, 2021 at 01:01:42AM +0100, Thomas Gleixner wrote:
> >> On Wed, Dec 01 2021 at 21:21, Thomas Gleixner wrote:
> >> > On Wed, Dec 01 2021 at 14:14, Jason Gunthorpe wrote:
> >> > Which in turn is consistent all over the place and does not require any
> >> > special case for anything. Neither for interrupts nor for anything else.
> >>
> >> that said, feel free to tell me that I'm getting it all wrong.
> >>
> >> The reason I'm harping on this is that we are creating ABIs on several
> >> ends and we all know that getting that wrong is a major pain.
> >
> > I don't really like coupling the method to fetch IRQs with needing
> > special struct devices. Struct devices have a sysfs presence and it is
> > not always appropriate to create sysfs stuff just to allocate some
> > IRQs.
> >
> > A queue is simply not a device, it doesn't make any sense. A queue is
> > more like a socket().
>
> Let's put the term 'device' for a bit please.
>
> > That said, we often have enough struct devices floating about to make
> > this work. Between netdev/ib_device/aux device/mdev we can use them to
> > do this.
> >
> > I think it is conceptual nonsense to attach an IMS IRQ domain to a
> > netdev or a cdev, but it will solve this problem.
>
> The IMS irqdomain is _NOT_ part of the netdev or cdev or whatever. I
> explained that several times now.
>
> We seem to have a serious problem of terminology and the understanding
> of topology which is why we continue to talk past each other forever.

I think I understand and agree with everything you said below.

The point we diverge is where to put the vector storage:

> Of course we can store them in pci_dev.dev.msi.data.store. Either with a
> dedicated xarray or by partitioning the xarray space. Both have their
> pro and cons.

This decision seems to drive the question of how many 'struct devices'
do we need, and where do we get them..

> But what I really struggle with is how this is further represented when
> the queues are allocated for VFIO, cdev, whatever usage.

Yes, this seems to be the primary question

> The fact that the irqdomain is instantiated by the device driver does
> not make it any different. As explained above it's just an
> implementation detail which makes it possible to handle the device
> specific message storage requirement in a sane way. The actual interrupt
> resources come still from the underlying irqdomains as for PCI/MSI.

Yes! This is not under debate

> Now I was looking for a generic representation of such a container and
> my initial reaction was to bind it to a struct device, which also makes
> it trivial to store these MSI descriptors in that struct device.

Right, I've been trying to argue around just this choice.

> I can understand your resistance against that to some extent, but I
> think we really have to come up with a proper abstract representation
> for these.

Ok

> Such a logical function would be the entity to hand out for VFIO or
> cdev.

What is a logical function, concretely?

Does it have struct device?

Can I instead suggest a name like 'message interrupt table' ?

Ie a device has two linearly indexed message interrupt tables - the
PCI SIG defined MSI/MSI-X one created by the PCI core and the IMS one
created by the driver.

Both start at 0 index and they have different irq_domains.

Instead of asking the driver to create a domain we ask the driver to
create a new 'message interrupt table'. The driver provides the
irq_chip to program the messages and the pci_device. The core code
manages the irq domain setup.

Using what you say below:

> If this is not split out, then every driver and wrapper has to come up
> with it's own representation of this instead of being able to do:
>
> request_irq(msi_get_virq(lfunc, idx=0), handler0, ...);
> request_irq(msi_get_virq(lfunc, idx=1), handler1, ...);

We could say:
msi_get_virq(device.pci_msi_table, index=0)

Is the 0th PCI SIG MSI vector

Something like:

ims_table = pci_create_msi_table(pci_dev, my_irq_chip,..)
msi_get_virq(ims_table, index=0)

Is the 0th IMS vector

Is it close to what you are thinking with lfunc?

Regards,
Jason

2021-12-02 22:31:17

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

Jason,

On Thu, Dec 02 2021 at 16:00, Jason Gunthorpe wrote:
> On Thu, Dec 02, 2021 at 08:25:48PM +0100, Thomas Gleixner wrote:
>> We seem to have a serious problem of terminology and the understanding
>> of topology which is why we continue to talk past each other forever.
>
> I think I understand and agree with everything you said below.

Good!

> The point we diverge is where to put the vector storage:

Kinda. The vector, i.e. message storage is either:

- MSI entry in the PCI config space
- MSI-X table in the PCI config space
- Device specific IMS storage

The software representation aka struct msi_desc is a different
story. That's what we are debating.

>> Of course we can store them in pci_dev.dev.msi.data.store. Either with a
>> dedicated xarray or by partitioning the xarray space. Both have their
>> pro and cons.
>
> This decision seems to drive the question of how many 'struct devices'
> do we need, and where do we get them..

Not really. There is nothing what enforces to make the MSI irqdomain
storage strictly hang off struct device. There has to be a connection to
a struct device in some way obviously to make IOMMU happy.

>> Such a logical function would be the entity to hand out for VFIO or
>> cdev.
>
> What is a logical function, concretely?

That's a name I came up with for a abstract representation of such a
queue container. I came up with that as a obvious consequence of my
previous reasoning about PF -> VF -> XF.

> Does it have struct device?

It does not necessarily have to .

> Can I instead suggest a name like 'message interrupt table' ?

Well yes, but that's not what I meant. See below.

> Ie a device has two linearly indexed message interrupt tables - the
> PCI SIG defined MSI/MSI-X one created by the PCI core and the IMS one
> created by the driver.
>
> Both start at 0 index and they have different irq_domains.
>
> Instead of asking the driver to create a domain we ask the driver to
> create a new 'message interrupt table'. The driver provides the
> irq_chip to program the messages and the pci_device. The core code
> manages the irq domain setup.
>
> Using what you say below:
>
>> If this is not split out, then every driver and wrapper has to come up
>> with it's own representation of this instead of being able to do:
>>
>> request_irq(msi_get_virq(lfunc, idx=0), handler0, ...);
>> request_irq(msi_get_virq(lfunc, idx=1), handler1, ...);
>
> We could say:
> msi_get_virq(device.pci_msi_table, index=0)
>
> Is the 0th PCI SIG MSI vector
>
> Something like:
>
> ims_table = pci_create_msi_table(pci_dev, my_irq_chip,..)
> msi_get_virq(ims_table, index=0)

Which is pretty much a wrapper around two different irqdomains for the
device and either partitioned index space in the xarray or two xarrays.

Just badly named because the table itself is where the resulting message
is stored, which is composed with the help of the relevant MSI
descriptor. See above.

We really should not try to make up an artifical table representation
for something which does not necessarily have a table at all, i.e. the
devices you talk about which store the message in queue specific system
memory. Pretending that this is a table is just silly.

Also I disagree that this has to be tied to a PCI specific interface,
except for creating a PCI specific wrapper for it to not make a driver
developer have to write '&pdev->dev', which is the very least of our
problems.

IMS as a technical concept is absolutely not PCI specific at all and not
invented by PCI/SIG. It's a marketing brandname for something which
existed way before they thought about it: Message signaled interrupts.

Aside of that 'my_irq_chip' does not cut it at all because of the way
how the resulting messages are stored. IDXD has IOMEM storage and a
storage space limitation while your device uses system memory storage
and has other limitations, i.e. system memory and the number of queues
the device can provide.

An irqchip is a just set of functions to talk to hardware either
directly or via some indirect transport (I2C, SPI, MLX queue management
magic...). These functions require irqdomain and/or device specific
information to function.

Trying to create a universal pci_create_foo() wrapper around this is
going to be like the 13th Herkulean task.

Seriously, you cannot make something uniform which is by definition
non-uniform.

Let's not even try to pretend that it is possible.

> Is the 0th IMS vector
>
> Is it close to what you are thinking with lfunc?

Not really. I was really reasoning about an abstract representation for
a functional queue, which is more than just a queue allocated from the
PF or VF device.

I really meant a container like this:

struct logical_function {
/* Pointer to the physical device */
struct device *phys_device;
/* MSI descriptor storage */
struct msi_data msi;
/* The queue number */
unsigned int queue_nr;
/* Add more information which is common to these things */
};

Now the real queue, which is obviously not generic:

struct myqueue_function {
struct logical_function lfunc;
struct myqueue queue;
};

The idea is to have a common representation for these type of things
which allows:

1) Have common code for exposing queues to VFIO, cdev, sysfs...

You still need myqueue specific code, but the common stuff which is
in struct logical_function can be generic and device independent.

2) Having the MSI storage per logical function (queue) allows to have
a queue relative 0 based MSI index space.

The actual index in the physical table (think IMS) would be held in
the msi descriptor itself.

` Which then allows queue relative addressing without extra device/queue
specific meta storage.

i.e.

msi_get_virq(&myqueue->lfunc.msi, idx = 0)

v.s.

idx = myqueue->msidx[0];
msi_get_virq(pcidev->dev, idx);

where the queue management code has to set up myqueue->msidx[]
and stick the index of the underlying device storage into it.

3) Setup and teardown would be simply per logical function for
all of the related resources which are required.

Interrrupt teardown would look like this:

msi_domain_free_all_irqs(irqdomain, &lfunc->msi);

vs.

for (i = 0; i < myqueue->nrirqs; i++)
msi_domain_free_irq(irqdomain, &pcidev->dev, myqueue->msidx[0]);


Now change struct logical_function to:

struct logical_function {
- /* Pointer to the physical device */
- struct device *phys_device;

+ /* Pseudo device to allow using devres */
+ struct pseudo_device pseudo_device;

/* MSI descriptor storage */
struct msi_data msi;
/* The queue number */
unsigned int queue_nr;
/* Add more information which is common to these things */
};

where struct pseudo_device holds the phys_device pointer and then you
can utilize the devres infrastructure like you do for any other device
and do:

pseudo_device_add(&myqueue->lfunc.pseudo_device);

at setup time and

pseudo_device_remove(&myqueue->lfunc.pseudo_device);

on teardown and let all the resources including MSI interrupts be
released automatically.

Needs some infrastructure obviously, but to me that makes a lot of
sense.

And I named it pseudo_device on purpose as it is just a vehicle to make
existing infrastructure which is device specific usable for this kind of
thing.

I might be completely off track. Feel free to tell me so :)

Thanks,

tglx

2021-12-03 00:38:00

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Thu, Dec 02, 2021 at 11:31:11PM +0100, Thomas Gleixner wrote:

> The software representation aka struct msi_desc is a different
> story. That's what we are debating.

Okay, I did mean msi_desc storage, so we are talking about the same thigns

> >> Of course we can store them in pci_dev.dev.msi.data.store. Either with a
> >> dedicated xarray or by partitioning the xarray space. Both have their
> >> pro and cons.
> >
> > This decision seems to drive the question of how many 'struct devices'
> > do we need, and where do we get them..
>
> Not really. There is nothing what enforces to make the MSI irqdomain
> storage strictly hang off struct device. There has to be a connection to
> a struct device in some way obviously to make IOMMU happy.

Yes, I thought this too, OK we agree

> Just badly named because the table itself is where the resulting message
> is stored, which is composed with the help of the relevant MSI
> descriptor. See above.

I picked the name because it looks like it will primarily contain an
xarray and the API you suggested to query it is idx based. To me that
is a table. A table of msi_desc storage to be specific.

It seems we have some agreement here as your lfunc also included the
same xarray and uses an idx as part of the API, right?

> We really should not try to make up an artifical table representation
> for something which does not necessarily have a table at all, i.e. the
> devices you talk about which store the message in queue specific system
> memory. Pretending that this is a table is just silly.

Now I'm a bit confused, what is the idx in the lfunc?

I think this is language again, I would call idx an artificial table
representation?

> Also I disagree that this has to be tied to a PCI specific interface,
> except for creating a PCI specific wrapper for it to not make a driver
> developer have to write '&pdev->dev', which is the very least of our
> problems.

Agreed, just trying to be illustrative at a higher level

> Aside of that 'my_irq_chip' does not cut it at all because of the way
> how the resulting messages are stored. IDXD has IOMEM storage and a
> storage space limitation while your device uses system memory storage
> and has other limitations, i.e. system memory and the number of queues
> the device can provide.

Okay, so the device must setup the domain because it must provide
information during setup. Makes sense

> Not really. I was really reasoning about an abstract representation for
> a functional queue, which is more than just a queue allocated from the
> PF or VF device.
>
> I really meant a container like this:
>
> struct logical_function {
> /* Pointer to the physical device */
> struct device *phys_device;
> /* MSI descriptor storage */
> struct msi_data msi;

Up to here is basically what I thought the "message table" would
contain. Possibly a pointer to the iommu_domain too?

> /* The queue number */
> unsigned int queue_nr;
> /* Add more information which is common to these things */

Not sure why do we need this?

Lets imagine a simple probe function for a simple timer device. It has
no cdev, vfio, queues, etc. However, the device only supports IMS. No
MSI, MSI-X, nothing else.

What does the probe function look like to get to request_irq()?

Why does this simple driver create something called a 'logical
function' to access its only interrupt?

I think you have the right idea here, just forget about VFIO, the IDXD
use case, all of it. Provide a way to use IMS cleanly and concurrently
with MSI.

Do that and everything else should have sane solutions too.

> The idea is to have a common representation for these type of things
> which allows:
>
> 1) Have common code for exposing queues to VFIO, cdev, sysfs...
>
> You still need myqueue specific code, but the common stuff which is
> in struct logical_function can be generic and device independent.

I will quote you: "Seriously, you cannot make something uniform which
is by definition non-uniform." :)

We will find there is no common stuff here, we did this exercise
already when we designed struct auxiliary_device, and came up
empty.

> 2) Having the MSI storage per logical function (queue) allows to have
> a queue relative 0 based MSI index space.

Can you explain a bit what you see 0 meaning in this idx numbering?

I also don't understand what 'queue relative' means?

> The actual index in the physical table (think IMS) would be held in
> the msi descriptor itself.

This means that a device like IDXD would store the phsical IMS table
entry # in the msi descriptor? What is idx then?

For a device like IDXD with a simple linear table, how does the driver
request a specific entry in the IMS table? eg maybe IMS table entry #0
is special like we often see in MSI?

> msi_get_virq(&myqueue->lfunc.msi, idx = 0)
>
> v.s.
>
> idx = myqueue->msidx[0];
> msi_get_virq(pcidev->dev, idx);

> where the queue management code has to set up myqueue->msidx[]
> and stick the index of the underlying device storage into it.

For the devices I know about there are two approaches for allocating
interrupts.

Many devices have a few special interrupts at fixed table
indexes. These can only be used for their single purpose. Eg MSI 0 and
1. An example is IRQ 1 signals the device had an error, read the error
reporting MMIO registers.

Then the device has a broad pool of fungible vectors. These are all
interchangeable, any vector can be used with any queue. eg MSI 2->128

A common mode operation is to setup the special vectors first, and
then assign on demand the pool vectors as CPUs are allocated for
use. Basically each CPU gets a vector and then many device objects are
attached to that vector. This scheme gives CPU locality and doesn't
exhaust CPU vectors. As example nvme uses 1 queue per CPU and 1 vector
per queue and userspace tells it how many queues (and thus CPUs) to
consume.

A device like mlx5 may have 1000's of queues and hundreds of sub
devices sharing a single CPU interrupt. This is desirable because we
can quickly run out of interrupts when we are talking about 100's of
concurrent subdevices.

When a "VFIO mdev" gets involved (mlx5 has this already, it is just
called VDPA, don't ask) the same allocator is used by the VFIO part
except VFIO has to request a dedicated MSI. Dedicated is because the
Intel VTx is used to deliver that MSI addr/data directly to the
guest's vAPIC.

When I imagine mlx5 using IMS, I see IMS as a simple extension of the
existing MSI-X vector pool. Every IMS vector is interchangable and the
main PCI driver will apply exactly the same allocation algorithm we
already have today for MSI, just with even more vectors. When VFIO
wants a vector it may get a MSI or it may get an IMS, I don't want to
care.

All of this about logical functions just confuses
responsibilities. The IRQ layer should be worrying about configuring
IRQs and not dictating how the device will design its IRQ assignment
policy or subdevice scheme.

> 3) Setup and teardown would be simply per logical function for
> all of the related resources which are required.
>
> Interrrupt teardown would look like this:
>
> msi_domain_free_all_irqs(irqdomain, &lfunc->msi);

Looks nice

> Now change struct logical_function to:
>
> struct logical_function {
> - /* Pointer to the physical device */
> - struct device *phys_device;
>
> + /* Pseudo device to allow using devres */
> + struct pseudo_device pseudo_device;
>
> /* MSI descriptor storage */
> struct msi_data msi;
> /* The queue number */
> unsigned int queue_nr;
> /* Add more information which is common to these things */
> };
>
> where struct pseudo_device holds the phys_device pointer and then you
> can utilize the devres infrastructure like you do for any other device
> and do:
>
> pseudo_device_add(&myqueue->lfunc.pseudo_device);
>
> at setup time and
>
> pseudo_device_remove(&myqueue->lfunc.pseudo_device);
>
> on teardown and let all the resources including MSI interrupts be
> released automatically.

We are already doing this with struct auxiliary_device and other
similar things. I don't see the value to couple the msi_data into the
lifetime model?

> I might be completely off track. Feel free to tell me so :)

I think you have the right core idea, just shrink your thinking of
logical_function to only cover msi stuff and leave the device model to
the other places that already have good mechansims. eg auxiliary device

IMHO this has become hyper focused on the special IDXD VFIO use case -
step back and think about my timer example above - a simple pci_driver
that just wants to use IMS for itself. No queues, no VFIO, no
mess. Just how does it configure and use the interrupt source.

Is it helpful feedback?

Regards,
Jason

2021-12-03 15:08:03

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

Jason,

On Thu, Dec 02 2021 at 20:37, Jason Gunthorpe wrote:
> On Thu, Dec 02, 2021 at 11:31:11PM +0100, Thomas Gleixner wrote:
>> >> Of course we can store them in pci_dev.dev.msi.data.store. Either with a
>> >> dedicated xarray or by partitioning the xarray space. Both have their
>> >> pro and cons.
>> >
>> > This decision seems to drive the question of how many 'struct devices'
>> > do we need, and where do we get them..
>>
>> Not really. There is nothing what enforces to make the MSI irqdomain
>> storage strictly hang off struct device. There has to be a connection to
>> a struct device in some way obviously to make IOMMU happy.
>
> Yes, I thought this too, OK we agree

One thing which just occured to me and I missed so far is the
association of the irqdomain.

Right now we know for each device which irqdomain it belongs to. That's
part of device discovery (PCI, device tree or whatever) which set's up

device->irqdomain

That obviously cannot work for that device specific IMS domain. Why?

Because the PCIe device provides MSI[x] which obviously requires that
device->irqdomain is pointing to the PCI/MSI irqdomain. Otherwise the
PCI/MSI allocation mechanism wont work.

Sure each driver can have

mystruct->ims_irqdomain

and then do the allocation via

msi_irq_domain_alloc(mystruct->ims_irqdomain....)

but I really need to look at all of the MSI infrastructure code whether
it makes assumptions about this:

msi_desc->dev->irqdomain

being the correct one. Which would obviously just be correct when we'd
have a per queue struct device :)

>> Just badly named because the table itself is where the resulting message
>> is stored, which is composed with the help of the relevant MSI
>> descriptor. See above.
>
> I picked the name because it looks like it will primarily contain an
> xarray and the API you suggested to query it is idx based. To me that
> is a table. A table of msi_desc storage to be specific.
>
> It seems we have some agreement here as your lfunc also included the
> same xarray and uses an idx as part of the API, right?

It's a per lfunc xarray, yes.

>> We really should not try to make up an artifical table representation
>> for something which does not necessarily have a table at all, i.e. the
>> devices you talk about which store the message in queue specific system
>> memory. Pretending that this is a table is just silly.
>
> Now I'm a bit confused, what is the idx in the lfunc?
>
> I think this is language again, I would call idx an artificial table
> representation?

Ok. I prefer the term indexed storage, but yeah.

>> I really meant a container like this:
>>
>> struct logical_function {
>> /* Pointer to the physical device */
>> struct device *phys_device;
>> /* MSI descriptor storage */
>> struct msi_data msi;
>
> Up to here is basically what I thought the "message table" would
> contain. Possibly a pointer to the iommu_domain too?
>
>> /* The queue number */
>> unsigned int queue_nr;
>> /* Add more information which is common to these things */
>
> Not sure why do we need this?
>
> Lets imagine a simple probe function for a simple timer device. It has
> no cdev, vfio, queues, etc. However, the device only supports IMS. No
> MSI, MSI-X, nothing else.
>
> What does the probe function look like to get to request_irq()?

The timer device would be represented by what? A struct device?

If so then something needs to set device->irqdomain and then you can
just do:

msi_irq_domain_alloc_irqs(dev->irqdomain, dev, ...);

> Why does this simple driver create something called a 'logical
> function' to access its only interrupt?

It does not have to when it's struct device based.

> I think you have the right idea here, just forget about VFIO, the IDXD
> use case, all of it. Provide a way to use IMS cleanly and concurrently
> with MSI.
>
> Do that and everything else should have sane solutions too.

To do that, I need to understand the bigger picture and the intended
usage because otherwise we end up with an utter mess.

>> The idea is to have a common representation for these type of things
>> which allows:
>>
>> 1) Have common code for exposing queues to VFIO, cdev, sysfs...
>>
>> You still need myqueue specific code, but the common stuff which is
>> in struct logical_function can be generic and device independent.
>
> I will quote you: "Seriously, you cannot make something uniform which
> is by definition non-uniform." :)
>
> We will find there is no common stuff here, we did this exercise
> already when we designed struct auxiliary_device, and came up
> empty.

Really?

>> 2) Having the MSI storage per logical function (queue) allows to have
>> a queue relative 0 based MSI index space.
>
> Can you explain a bit what you see 0 meaning in this idx numbering?
>
> I also don't understand what 'queue relative' means?
>
>> The actual index in the physical table (think IMS) would be held in
>> the msi descriptor itself.
>
> This means that a device like IDXD would store the phsical IMS table
> entry # in the msi descriptor? What is idx then?
>
> For a device like IDXD with a simple linear table, how does the driver
> request a specific entry in the IMS table? eg maybe IMS table entry #0
> is special like we often see in MSI?

If there is a hardwired entry then this hardwired entry belongs to the
controlblock (status, error or whatever), but certainly not to a
dynamically allocatable queue as that would defeat the whole purpose.

That hardwired entry could surely exist in a IDXD type setup where the
message storage is in an defined array on the device.

But with the use case you described where you store the message at some
place in per queue system memory, the MSI index is not relevant at all,
because there is no table. So it's completely meaningless for the device
and just useful for finding the related MSI descriptor again.

If the queue or the controlblock have an internal small array of
messages in their queue or controlblock specific memory, then how would
a per device global index help?

Or has each queue and controlblock and whatever access to a shared large
array where the messages are stored and the indices are handed out to
the queues and controlblocks?

If each of them have their own small array, then queue relative indexing
makes a ton of sense, no?

> For the devices I know about there are two approaches for allocating
> interrupts.

I'm aware of that.

> When I imagine mlx5 using IMS, I see IMS as a simple extension of the
> existing MSI-X vector pool. Every IMS vector is interchangable and the
> main PCI driver will apply exactly the same allocation algorithm we
> already have today for MSI, just with even more vectors. When VFIO
> wants a vector it may get a MSI or it may get an IMS, I don't want to
> care.
>
> All of this about logical functions just confuses
> responsibilities. The IRQ layer should be worrying about configuring
> IRQs and not dictating how the device will design its IRQ assignment
> policy or subdevice scheme.

The IRQ layer _has_ to worry about it because there is a strict
relationship between device and irqdomain today.

Now with a PCIe device having PCI/MSI[X] it's a strict 1:1
relationship. Now add IMS to the picture and we end up with a 1:2
relationship, which does not work today.

That's why my initial reaction was to partition the device into
subdevices which would have solved the problem nicely.

> IMHO this has become hyper focused on the special IDXD VFIO use case -
> step back and think about my timer example above - a simple pci_driver
> that just wants to use IMS for itself. No queues, no VFIO, no
> mess. Just how does it configure and use the interrupt source.

That would mean that the PCI device would not use MSI-X at all, right?
So it could have pci_dev.dev.irqdomain = IMSdomain.

But that sucks too because it's yet another quirk as the PCIe discovery
cannot set that up simply because the device specific IMS domain does
not exist yet.

> Is it helpful feedback?

Yes, definitely. It helps me to understand the bigger picture and to
find a proper representation of these things so that we don't end up
with the usual mess in drivers/* which makes it a fricking nightmare to
change anything at the core code at all.

Thanks,

tglx

2021-12-03 16:41:12

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Fri, Dec 03, 2021 at 04:07:58PM +0100, Thomas Gleixner wrote:
> Jason,
>
> On Thu, Dec 02 2021 at 20:37, Jason Gunthorpe wrote:
> > On Thu, Dec 02, 2021 at 11:31:11PM +0100, Thomas Gleixner wrote:
> >> >> Of course we can store them in pci_dev.dev.msi.data.store. Either with a
> >> >> dedicated xarray or by partitioning the xarray space. Both have their
> >> >> pro and cons.
> >> >
> >> > This decision seems to drive the question of how many 'struct devices'
> >> > do we need, and where do we get them..
> >>
> >> Not really. There is nothing what enforces to make the MSI irqdomain
> >> storage strictly hang off struct device. There has to be a connection to
> >> a struct device in some way obviously to make IOMMU happy.
> >
> > Yes, I thought this too, OK we agree
>
> One thing which just occured to me and I missed so far is the
> association of the irqdomain.

Oh? I though this was part of what you were thinking when talkinga
about using the mdev struct device.

> but I really need to look at all of the MSI infrastructure code whether
> it makes assumptions about this:
>
> msi_desc->dev->irqdomain
>
> being the correct one. Which would obviously just be correct when we'd
> have a per queue struct device :)

Right

I'd half expect the msi_desc to gain a pointer to the 'lfunc'

Then msi_desc->dev becomes msi_desc->origin_physical_device and is
only used by IOMMU code to figure out MSI remapping/etc. Maybe that
allows solve your aliasing problem (Is this the RID aliasing we see
with iommu_groups too?)

> > Lets imagine a simple probe function for a simple timer device. It has
> > no cdev, vfio, queues, etc. However, the device only supports IMS. No
> > MSI, MSI-X, nothing else.
> >
> > What does the probe function look like to get to request_irq()?
>
> The timer device would be represented by what? A struct device?

Lets say it is a pci_device and probe is a pci_driver probe function

> If so then something needs to set device->irqdomain and then you can
> just do:
>
> msi_irq_domain_alloc_irqs(dev->irqdomain, dev, ...);

Okay, but that doesn't work if it is a PCI device and dev->irqdomain
is already the PCI MSI and INTx irqdomain?

> To do that, I need to understand the bigger picture and the intended
> usage because otherwise we end up with an utter mess.

Sure

> >> The idea is to have a common representation for these type of things
> >> which allows:
> >>
> >> 1) Have common code for exposing queues to VFIO, cdev, sysfs...
> >>
> >> You still need myqueue specific code, but the common stuff which is
> >> in struct logical_function can be generic and device independent.
> >
> > I will quote you: "Seriously, you cannot make something uniform which
> > is by definition non-uniform." :)
> >
> > We will find there is no common stuff here, we did this exercise
> > already when we designed struct auxiliary_device, and came up
> > empty.
>
> Really?

Yes.. It was a long drawn out affair, and auxiliary_device is just a
simple container

> >> 2) Having the MSI storage per logical function (queue) allows to have
> >> a queue relative 0 based MSI index space.
> >
> > Can you explain a bit what you see 0 meaning in this idx numbering?
> >
> > I also don't understand what 'queue relative' means?
> >
> >> The actual index in the physical table (think IMS) would be held in
> >> the msi descriptor itself.
> >
> > This means that a device like IDXD would store the phsical IMS table
> > entry # in the msi descriptor? What is idx then?
> >
> > For a device like IDXD with a simple linear table, how does the driver
> > request a specific entry in the IMS table? eg maybe IMS table entry #0
> > is special like we often see in MSI?
>
> If there is a hardwired entry then this hardwired entry belongs to the
> controlblock (status, error or whatever), but certainly not to a
> dynamically allocatable queue as that would defeat the whole purpose.
>
> That hardwired entry could surely exist in a IDXD type setup where the
> message storage is in an defined array on the device.

Agree - so how does it get accessed?

> But with the use case you described where you store the message at some
> place in per queue system memory, the MSI index is not relevant at all,
> because there is no table. So it's completely meaningless for the device
> and just useful for finding the related MSI descriptor again.

Yes

What I don't see is how exactly we make this situation work. The big
picture sequence would be:

- Device allocates a queue and gets a queue handle
- Device finds a free msi_desc and associates that msi_desc with the
queue handle
- Device specific irq_chip uses the msi_desc to get the queue handle
and programs the addr/data

In this regard the msi_desc is just an artifact of the API, the real
work is in the msi_desc.

> Or has each queue and controlblock and whatever access to a shared large
> array where the messages are stored and the indices are handed out to
> the queues and controlblocks?

> If each of them have their own small array, then queue relative indexing
> makes a ton of sense, no?

Okay, I see.

I don't know of any use case for more than one interrupt on a queue,
and if it did come up I'd probably approach it by making the queue
handle above also specify the 'queue relative HW index'

> > All of this about logical functions just confuses
> > responsibilities. The IRQ layer should be worrying about configuring
> > IRQs and not dictating how the device will design its IRQ assignment
> > policy or subdevice scheme.
>
> The IRQ layer _has_ to worry about it because there is a strict
> relationship between device and irqdomain today.

Yes. Absolutely.

This is what I've been thinking for a few messages now :)

I liked your lfunc/"msi_table" (without the queue stuff) because it
breaks that 1:1 without bringing in struct device to do it.

If that seems reasonable I think it is the best way to go.

Lets do a thought experiment, lets say we forget about the current PCI
MSI API.

What if it worked more like this:

probe()
// Access the real PCI SIG MSI/MSI-X table
mystruct->msi_table = pci_allocate_msi_table(pci_dev);

// Special interrupt 0
mstruct->desc0 = msi_table_alloc_vector(mystruct->msi_table, hw_label=0);
request_irq(mystruct->desc0, ..)

// Special interrupt 1
mystruct->desc1 = msi_table_alloc_vector(mystruct->msi_table, hw_label=1);
request_irq(mystruct->desc1->irq, ..)

// Fungible queue interrutps are 3->MAX
for (nqueues) {
queue[i]->desc = msi_table_alloc_vector_range(mystruct->msi_table, lowest=3, highest=MAX);
request_irq(queue[i]->desc->irq, ..)
}

remove()
for (nqueues)
msi_table_free_vector(mystruct->msi_table, queue[i]->desc)
msi_table_free_vector(mystruct->msi_table, mystruct->desc1);
msi_table_free_vector(mystruct->msi_table, mystruct->desc0);
msi_table_free(mystruct->msi_table);

(please take some latitude when reading this, I am just trying to
sketch)

Here I am trying to show:

- msi_table is a general API for accessing MSIs. Each bus type
would provide some kind of logical creation function for their
bus standardized MSI table type. eg MSI/MSI-X/etc

It is a logical handle for the physical resource the holds the MSI
addr/data paris.

- Use a pointer instead of idx as the API for referring to a MSI
irq. Most drivers probably only use msi_desc->irq?

- We do not have device->msi, it is held in the driver instead.
dev->irqdomain is always the bus/platform originated irqdomain of
the physical device.

Thus we break the 1:1 of the device and irqdomain. A device can
spawn many msi_tables, but only one would be on the dev->irqdomain

- Rely on dynamic allocation instead of the awkward batch interfaces.
This means a msi_table starts out with 0 msi_descs and as we go
they are allocated, installed in the xarray, and connected to the
HW.

- hw_label is defined by whatever type of msi_table it is:
for PCI MSI this is the MSI table index
for IDXD IMS this is the IMS table index
for mlx5 memory IMS it is -1 (?) and the msi_desc is just placed
anywhere in the xarray. msi_desc->xarray_index can store its
location

- 'msi_table_alloc_vector_range' allows a simple API to get any
free entry from a group of fungible entries to make it clear
to readers the typical special/pooled HW design pattern

Is it sane? What really needs device->msi anyhow?

IMS is a logical progression of this concept:

probe()
mystruct->msi_table = pci_allocate_msi_table(pci_dev);
mystruct->ims_irqdomain = <....>
mystruct->ims_table = msi_allocate_table(pci_dev->dev, mystruct->ims_irqdomain ..)

// Use MSI
mystruct->desc0 = msi_table_alloc_vector(mystruct->msi_table, hw_label=0);
// Use IMS
mystruct->desc1 = msi_table_alloc_vector(mystruct->ims_table, hw_label=0);

Not saying we can/should go out and just do something so radical, but
as a thought experiment, can it guide toward a direction like this?

> That's why my initial reaction was to partition the device into
> subdevices which would have solved the problem nicely.

Yes, I see that logic.

> > IMHO this has become hyper focused on the special IDXD VFIO use case -
> > step back and think about my timer example above - a simple pci_driver
> > that just wants to use IMS for itself. No queues, no VFIO, no
> > mess. Just how does it configure and use the interrupt source.
>
> That would mean that the PCI device would not use MSI-X at all, right?
> So it could have pci_dev.dev.irqdomain = IMSdomain.

Yes, but I'm not trying to get that as a solution. I fully expect all
real PCI devices to have MSI tables to support OS's without IMS
capability (notably old Linux)

So IMS and MSI must co-exist.

> > Is it helpful feedback?
>
> Yes, definitely. It helps me to understand the bigger picture and to
> find a proper representation of these things so that we don't end up
> with the usual mess in drivers/* which makes it a fricking nightmare to
> change anything at the core code at all.

Great!

I agree alot with the other email you wrote that IMS == MSI - I think
that is key.

So much of these conversations have been warped by thinking of IMS as
some wonky thing for virtualization, or trying to focus on VFIO as the
only use case.

I think the other sort of tangential issue when you think on IMS ==
MSI is how pci_msi_create_irq_domain() is basically a special case for
PCI MSI by providing chip ops that only work for PCI to the irqdomain.

Not only that, but instead of cleanly having two irq_chip ops for the
very different HW programming under the MSI and MSI-X cases it is
handled with a 'if (msi_desc->msix)' on each op.

If IMS == MSI, then couldn't we conceptually have the dev->irqdomain
completely unable to handle IMS/MSI/MSI-X at all, and instead, when
the driver asks for PCI MSI access we create a new hierarchical
irqdomain and link it to a MSI chip_ops or a MSI-X chip_ops - just as
you outlined for IMS? (again, not saying to do this, but let's ask if
that makes more sense than the current configuration)

Jason

2021-12-04 14:20:42

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

Jason,

On Fri, Dec 03 2021 at 12:41, Jason Gunthorpe wrote:
> On Fri, Dec 03, 2021 at 04:07:58PM +0100, Thomas Gleixner wrote:
> Lets do a thought experiment, lets say we forget about the current PCI
> MSI API.
>
> What if it worked more like this:
>
> probe()
> // Access the real PCI SIG MSI/MSI-X table
> mystruct->msi_table = pci_allocate_msi_table(pci_dev);
>
> // Special interrupt 0
> mstruct->desc0 = msi_table_alloc_vector(mystruct->msi_table, hw_label=0);
> request_irq(mystruct->desc0, ..)

A device driver should not even know about msi_desc. Period.

The code which gets the msi_desc pointer handed in is the irqdomain code
and eventually the irqchip callbacks which need to deal with it.

I just cleaned up quite some places again which fiddled with msi_desc
for the very wrong reasons. We've had subtle bugs due to that in the
past.

The time I wasted in the past to fix drivers which fiddled with msi_desc
and even with irq descriptors just because they can, makes me think hard
about encapsulations because I have absolutely no interest to do full
tree cleanups every time I need to change something in the core
code. Neither do I have interest to deal with the resulting bugs which
end up on my desk because the usual consequence is that the abuse makes
the core go belly up.

If I need to change a dozen irq chip implementations, that's a limited
scope, but chasing the abuse all over the place is insane. I'm dead
tired of it.

The driver can get some info struct which has the index and the linux
irq number in it and that is all it needs, seriously.

If C would provide access protection scopes in objects then the msi
descriptor pointer could become a cookie. But we only can document that
the driver writer has to treat it like a cookie which is equally
effective to not documenting it at all. That's unfortunately the sad
reality. So the goal is to make it harder to screw up and not to make it
easier. The only way to do that is strict encapsulation.

> - msi_table is a general API for accessing MSIs. Each bus type
> would provide some kind of logical creation function for their
> bus standardized MSI table type. eg MSI/MSI-X/etc

You can't give up on that table thing, right? We just established that
for devices like yours IMS is not necessary a table and does not even
need the index. The fact that MSI-X uses a table does not make
everything a nail^Wtable. :)

> It is a logical handle for the physical resource the holds the MSI
> addr/data paris.
>
> - Use a pointer instead of idx as the API for referring to a MSI
> irq. Most drivers probably only use msi_desc->irq?

No. pci_irq_vector() translates the hw index to linux irq number. The
hardware index is known by the driver when it is doing a batched
allocation, right?

There is also that magic interface which hands in an array of struct
msi_entry. msi_entry contains the hardware index and the linux irq
number. The hardware index is filled by the driver and the linux irq
number by the core code when the allocation was successful.

Yes, it's awful from an interface point of view, but that has nothing to
do with the underlying infrastructure and the problem at hand.

I'm fine with using the info struct I described above as the reference
cookie.

> - We do not have device->msi, it is held in the driver instead.

No. First of all drivers have no business with that, really.

It's infrastructure which the driver utilizes and that infrastructure
manages the data it requires and uses it within the infrastructure, in
this case irqdomains and the associated irq chips.

Aside of that. Even if it would be a good idea, then someone has to
rewrite the PCI/MSI backends of powerpc, mips-octeon, s390, xen and some
more first. Along with the MSI and the PCI/MSI core.

Then create either nasty wrappers around the existing interfaces which
create those magic tables automatically and store them where? In struct
device->msi perhaps? :)

Alternatively cleanup ~360 usage sites throughout drivers and add the
same silly boilerplate code all over the place, which is wrong to begin
with. We want less boilerplate not more.

I'm sure you're volunteering or can provide the engineering capacity to
do that. Depending on the capacity this might be done in 1-10 years from
now.

> dev->irqdomain is always the bus/platform originated irqdomain of
> the physical device.

This is a guarantee for subtle bugs and I'm not even remotely interested
going there. See also below.

> Thus we break the 1:1 of the device and irqdomain. A device can
> spawn many msi_tables, but only one would be on the dev->irqdomain

Why are you trying brute force to push things into device drivers?
That's really the wrong direction. We want common infrastructure to be
managed by generic code. And all of this is about common infrastructure.

The only device specific thing which is required due to IMS is the
irqchip plus it's data. Whether that irq chip implementation is part of
the device driver itself or is preferrably an irq chip implementation
which can be shared and instantiated by several device types is not
making any difference here at all. It's from an encapsulation POV still
an irq chip which is handled and managed by the interrupt subsystem.

We don't need that for PCI/MSI[-X] at all, so why would we go and give
up on central managed storage and add pointless crap to drivers so that
driver writers can screw up even more?

> - Rely on dynamic allocation instead of the awkward batch interfaces.
> This means a msi_table starts out with 0 msi_descs and as we go
> they are allocated, installed in the xarray, and connected to the
> HW.

We need to preserve multi-vector batch allocations unless you have the
engineering capacity to fixup the stuff I mentioned above.

And no, you can't just make the batch allocation a wrapper loop around
single vector allocation as that will break the world as well.

There is a very good reason why I made runtime expansion of MSI-X
interrupts opt-in by the underlying architecture specific MSI
backend. IMS support has to be opt-in for the very same reason.

> - hw_label is defined by whatever type of msi_table it is:
> for PCI MSI this is the MSI table index
> for IDXD IMS this is the IMS table index
> for mlx5 memory IMS it is -1 (?) and the msi_desc is just placed
> anywhere in the xarray. msi_desc->xarray_index can store its
> location
>
> - 'msi_table_alloc_vector_range' allows a simple API to get any
> free entry from a group of fungible entries to make it clear
> to readers the typical special/pooled HW design pattern
>
> Is it sane? What really needs device->msi anyhow?

device->msi is a container for interrupt management and that's done by
the interrupt code and not by random device drivers. Again, struct
msi_desc is a software construct which the interrupt core code and the
irqdomains and irq chip implementation use for various purposes. Nothing
outside of this realm has to even know about the fact that this data
structure exists in the first place. See below.

> IMS is a logical progression of this concept:
>
> probe()
> mystruct->msi_table = pci_allocate_msi_table(pci_dev);
> mystruct->ims_irqdomain = <....>
> mystruct->ims_table = msi_allocate_table(pci_dev->dev, mystruct->ims_irqdomain ..)
>
> // Use MSI
> mystruct->desc0 = msi_table_alloc_vector(mystruct->msi_table, hw_label=0);
> // Use IMS
> mystruct->desc1 = msi_table_alloc_vector(mystruct->ims_table, hw_label=0);
>
> Not saying we can/should go out and just do something so radical, but
> as a thought experiment, can it guide toward a direction like this?

What I agree on is that the interface should be in a way that the driver
can:

1) Allocate vectors at a given hardware index

2) Allocate vectors within a given hardware index range

3) The core returns the hardware index and the Linux irq number in
a separate info structure which is independent of the interrupt
stack internal data representations.

But the driver interface is _not_ the problem at all.

Interrupts are fundamentally a resource pool which roots at the
CPU. There are layers on top, e.g. IOMMU, which are involved in
routing. On top of that there is the outermost interrupt chip which is
handling the device specifics.

Interrupt delivery roots at the CPU too. So the interrupt core code has
to have a consistent view of the whole chain. That requires to have
proper data representations and a consistent relationship between these
representations.

That brings me back to device->msi. This is part of the data
representation and disconnecting it make things just harder and more
exposed to the whim of driver writers. Thanks, no.

Sure the driver can get a cookie of some sort to do allocation/free from
different resource pools, e.g. PCI-MSI[X] and IMS. But the internal data
representation management has to be done at the core level.

That means that a driver which utilizes device specific IMS should not
even try to make the irqchip/domain code mangled into the other device
code. It should create the irqdomain with the associated chip and that
creation process returns a cookie which is passed to the actual device
specific code. Allocation then can use that cookie and not the irqdomain
pointer itself.

But the driver API is the least of the problems at this point,
really. What's important is to know what kind of functionality the
drivers need.

Then you build the thing up from bottom to top, layer by layer
considering the functional requirements.

The final driver API is a conveniance wrapper based on the underlying
infrastucture which hides common tasks from the driver level.

> I agree alot with the other email you wrote that IMS == MSI - I think
> that is key.

Yes, that was clear to me from the very beginning, that all of them just
raise interrupts by writing msg.data to msg.address.

> So much of these conversations have been warped by thinking of IMS as
> some wonky thing for virtualization, or trying to focus on VFIO as the
> only use case.

Well, I was always thinking about all use cases.

The important thing about this discussion was that I had to understand
how this is used and that my naive assumption that this new approach is
just a logical consequence of splitting the devices up further into
smaller slices is not really correct.

The Physical function -> Virtual function -> Logical function view made
a lot of sense to me. And that's how some people explained it in the
earlier discussions. From that point of view having a struct device to
represent it was not that absurd, right?

So thanks for being patient in educating me here.

The fact that we will be able to support this at all is based on years
of cleanups, consolidation and restructuring of the infrastructure. The
result of doing this is that my trust in driver developers in that
regard is very close to zero. The cleanup patches I had to do for this
series just to solve the single irqdomain case did not help to elevate
the trust level either.

The common approach is: My device is special and therefore I need
special X. The core does not provide special X so I hack something up
which violates all known rules of encapsulation and layering in one go.

Reality is that this is in 99+% of the cases not true at all. If "my
device is special" would be true, then writing and maintaining an OS
which can handle this hardware zoo would be close to impossible.

That's why I'm insisting on understanding the use cases and the
underlying concepts of representation deeply. That's why I'm more
interested in data representation and the relation ship of data than in
function interfaces and code.

Code and function interfaces are defined by data representation and the
relationship of data. Doing it the other way round is doomed.

> I think the other sort of tangential issue when you think on IMS ==
> MSI is how pci_msi_create_irq_domain() is basically a special case for
> PCI MSI by providing chip ops that only work for PCI to the irqdomain.
>
> Not only that, but instead of cleanly having two irq_chip ops for the
> very different HW programming under the MSI and MSI-X cases it is
> handled with a 'if (msi_desc->msix)' on each op.

I'm well aware of that. That's historical ballast which is also required
to keep the legacy architecture code alive. Same problem as explained
above.

> If IMS == MSI, then couldn't we conceptually have the dev->irqdomain
> completely unable to handle IMS/MSI/MSI-X at all, and instead, when
> the driver asks for PCI MSI access we create a new hierarchical
> irqdomain and link it to a MSI chip_ops or a MSI-X chip_ops - just as
> you outlined for IMS? (again, not saying to do this, but let's ask if
> that makes more sense than the current configuration)

That's not really a good idea because dev->irqdomain is a generic
mechanism and not restricted to the PCI use case. Special casing it for
PCI is just wrong. Special casing it for all use cases just to please
PCI is equally wrong. There is a world outside of PCI and x86.

But yes, dev->irqdomain _is_ part of the problem and even the root of
it. I definitely stared at it before, but for stupid some reason I
stopped staring at it. I blame it on too much email. :)

The point is that up to now we had a strict 1:1 device irqdomain
mapping, which starts at dev->irqdomain. The MSI descriptor storage is
just a consequence of that. And yes, it took a while to understand that
there is a use case for 1:2.

So I need to break that up in a way which caters for both cases, but
does neither create a special case for PCI nor for the rest of the
universe, i.e. the 1:1 case has to be a subset of the 1:2 case which
means all of it is common case. With that solved the storage question
becomes a nobrainer.

When I looked at it again yesterday while writing mail, I went back to
my notes and found the loose ends where I left off. Let me go back and
start over from there.

Thanks,

tglx

2021-12-05 14:16:47

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Sat, Dec 04 2021 at 15:20, Thomas Gleixner wrote:
> On Fri, Dec 03 2021 at 12:41, Jason Gunthorpe wrote:
> So I need to break that up in a way which caters for both cases, but
> does neither create a special case for PCI nor for the rest of the
> universe, i.e. the 1:1 case has to be a subset of the 1:2 case which
> means all of it is common case. With that solved the storage question
> becomes a nobrainer.
>
> When I looked at it again yesterday while writing mail, I went back to
> my notes and found the loose ends where I left off. Let me go back and
> start over from there.

I found out why I stopped looking at it. I came from a similar point of
view what you were suggesting:

>> If IMS == MSI, then couldn't we conceptually have the dev->irqdomain
>> completely unable to handle IMS/MSI/MSI-X at all, and instead, when
>> the driver asks for PCI MSI access we create a new hierarchical
>> irqdomain and link it to a MSI chip_ops or a MSI-X chip_ops - just as
>> you outlined for IMS? (again, not saying to do this, but let's ask if
>> that makes more sense than the current configuration)

Which I shot down with:

> That's not really a good idea because dev->irqdomain is a generic
> mechanism and not restricted to the PCI use case. Special casing it for
> PCI is just wrong. Special casing it for all use cases just to please
> PCI is equally wrong. There is a world outside of PCI and x86.

That argument is actually only partially correct.

After studying my notes and some more code (sigh), it looks feasible
under certain assumptions, constraints and consequences.

Assumptions:

1) The irqdomain pointer of PCI devices which is set up during device
discovery is not used by anything else than infrastructure which
knows how to handle it.

Of course there is no guarantee, but I'm not that horrified about
it anymore after chasing the exposure with yet more coccinelle
scripts.

Constraints:

1) This is strictly opt-in and depends on hierarchical irqdomains.

If an architecture/subarchitecture wants to support it then it
needs to rework their PCI/MSI backend to hierarchical irqdomains or
make their PCI/MSI irqdomain ready for the task.

From my inspection of 30+ PCI/MSI irqdomains, most of them should
be trivial to convert. The hard ones are powerpc, XEN and VMD,
where XEN is definitely the most convoluted one.

That means that devices which depend on IMS won't work on anything
which is not up to date.

2) Guest support is strictly opt-in

The underlying architecture/subarchitecture specific irqdomain has
to detect at setup time (eventually early boot), whether the
underlying hypervisor supports it.

The only reasonable way to support that is the availability of
interrupt remapping via vIOMMU, as we discussed before.

3) IOMMU/Interrupt remapping dependency

While IMS just works without interrupt remapping on bare metal the
fact that there is no reliable way to figure out whether the kernel
runs on bare metal or not, makes it pretty much mandatory, at least
on x86.

That's not a hardcoded constraint. It's a decision made during the
setup of the underlying architecture/subarchitecture specific
irqdomain.

4) The resulting irqdomain hierarchy would ideally look like this:

VECTOR -> [IOMMU, ROUTING, ...] -> PCI/[MSI/MSI-X/IMS] domains

That does not work in all cases due to architecture and host
controller constraints, so we might end up with:

VECTOR -> IOMMU -> SHIM -> PCI/[MSI/MSI-X/IMS] domains

The nice thing about the irqdomain hierarchy concept is that this
does not create any runtime special cases as the base hierarchy is
established at boot or device detection time. It's just another
layer of indirection.

5) The design rules for the device specific IMS irqdomains have to be
documented and enforced to the extent possible.

Rules which I have in my notes as of today:

- The device specific IMS irq chip / irqdomain has to be strictly
separated from the rest of the driver code and can only
interact via the irq chip data which is either per interrupt or
per device.

I have some ideas how to enforce these things to go into
drivers/irqchip/ so they are exposed to scrutiny and not
burried in some "my device is special" driver code and applied
by subsystem maintainers before anyone can even look at it.

- The irqchip callbacks which can be implemented by these top
level domains are going to be restricted.

- For the irqchip callbacks which are allowed/required the rules
vs. following down the hierarchy need to be defined and
enforced.

- To achieve that the registration interface will not be based on
struct irq_chip. This will be a new representation and the core
will convert that into a proper irq chip which fits into the
hierarchy. This provides one central place where the hierarchy
requirements can be handled as they depend on the underlying
MSI domain (IOMMU, SHIM, etc.). Otherwise any change on that
would require to chase the IMS irqchips all over the place.

Consequences:

1) A more radical split between legacy and hierarchical irqdomain
code in drivers/pci/msi/ into:

- api
- legacy
- irqdomain
- shared

That means that we are going to end up with duplicated code for
some of the mechanisms up to the point where the stuck-in-the-mud
parts either get converted or deleted.

2) The device centric storage concept will stay as it does not make
any sense to push it towards drivers and what's worse it would be a
major pain vs. the not yet up to the task irqdomains and the legacy
architecture backends to change that. I really have no interrest to
make the legacy code

It also makes sense because the interrupts are strictly tied to the
device. They cannot originate from some disconnected layer of thin
air.

Sorry Jason, no tables for you. :)

How to get there:

1) I'm going to post part 1-3 of the series once more with the fallout
and review comments addressed.

2) If nobody objects, I'll merge that into tip irq/msi and work on top
of that.

The consolidation makes sense on it's own and is required anyway. I
might need to revisit some of the already touched places, but that
should be a halfways limited number. I rather do that step for step
on top than going back to start and mixing the new concepts in from
the very beginning.

But I drop part 4 in it's current form because that's going to be
part of the new infrastructure.

3) I'll work on that bottom up towards a driver exposable API as that
is going to be a result of the final requirements of the underlying
infrastructure.

The final driver visible interfaces can be bikeshedded on top to
make them palatable for driver writers.

4) Convert x86 PCI/MSI[x] to the new scheme

5) Implement an IMS user.

The obvious candidate which should be halfways accessible is the
ath11 PCI driver which falls into that category.

It uses horrendous hackery to make it "work" by abusing MSI. It's a
wonder that it works at all, by some definition of "works".

I'm pretty sure how to make it fall apart without touching a single
line of code.

With a small code change I can make it fail hard without blowing up
any other MSI/MSI-X user except the switchtec NTB.

That's a prime example for the way how driver writers behave.

Instead of talking to the people who are responsible for the
interrupt subsystem, they go off and do their own thing. It does
not explode on their test machine, but it's not even remotely close
to the requirements for PCI drivers to work independent of the
underlying platform.

Of course the responsible maintainer does not even notice and waves
it through without questioning it.

Thoughts?

Thanks,

tglx

2021-12-06 14:19:28

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Sat, Dec 04, 2021 at 03:20:36PM +0100, Thomas Gleixner wrote:
> Jason,
>
> On Fri, Dec 03 2021 at 12:41, Jason Gunthorpe wrote:
> > On Fri, Dec 03, 2021 at 04:07:58PM +0100, Thomas Gleixner wrote:
> > Lets do a thought experiment, lets say we forget about the current PCI
> > MSI API.

I've read both your emails, and I'll make a few small remarks on this
one, mainly to try to clearly communicate what I was thinking

Just want to call out the above paragraph, not suggesting we do it,
but the thought exercise of what should we do if not weighed down by
endless engineering cruft is usually informative.

> > What if it worked more like this:
> >
> > probe()
> > // Access the real PCI SIG MSI/MSI-X table
> > mystruct->msi_table = pci_allocate_msi_table(pci_dev);
> >
> > // Special interrupt 0
> > mstruct->desc0 = msi_table_alloc_vector(mystruct->msi_table, hw_label=0);
> > request_irq(mystruct->desc0, ..)
>
> A device driver should not even know about msi_desc. Period.

Here, this wasn't really about the msi_desc, it was just about some
pointer-based handle. Call it what you will, and force it to be opaque
to the drivers.

eg this 'msi table' layer can just have a

struct msi_handle;

In headers

and in a C it can be:

struct msi_handle {
struct msi_desc desc;
};

And achieve what you've asked for

> > - msi_table is a general API for accessing MSIs. Each bus type
> > would provide some kind of logical creation function for their
> > bus standardized MSI table type. eg MSI/MSI-X/etc
>
> You can't give up on that table thing, right? We just established that
> for devices like yours IMS is not necessary a table and does not even
> need the index. The fact that MSI-X uses a table does not make
> everything a nail^Wtable. :)

It is just a name for the idea of a handle to a device's MSI
mechanism.

Call it 'msi_hw' or something.

My concept is that each device, integral to the device, has some ways
to operate the device's MSI storage (MSI/MSI-X/IMS/Platform), so lets
give that a name and give it a struct and an API. It is all still core
code.

Think of it as a clean separation between 'determining the addr/data
pair' and 'writing the addr/data pair to HW'.

Today you are thinking about this object as an irqdomain, irqchip, and
msi xarray - I think?

> > It is a logical handle for the physical resource the holds the MSI
> > addr/data paris.
> >
> > - Use a pointer instead of idx as the API for referring to a MSI
> > irq. Most drivers probably only use msi_desc->irq?
>
> No. pci_irq_vector() translates the hw index to linux irq number. The
> hardware index is known by the driver when it is doing a batched
> allocation, right?

Yes

> I'm fine with using the info struct I described above as the reference
> cookie.

IMHO an opaque pointer to refer to the MSI is cleaner

> > - We do not have device->msi, it is held in the driver instead.
>
> No. First of all drivers have no business with that, really.

I'm reading your second email and still not entirely clear what your
thinking is for devices->msi?

> > dev->irqdomain is always the bus/platform originated irqdomain of
> > the physical device.
>
> This is a guarantee for subtle bugs and I'm not even remotely interested
> going there. See also below.

Not sure I follow this? My suggestion is that it is either as-is today
or NULL and we don't try to use eg mdev->irqdomain for anything.

> > Thus we break the 1:1 of the device and irqdomain. A device can
> > spawn many msi_tables, but only one would be on the dev->irqdomain
>
> Why are you trying brute force to push things into device drivers?
> That's really the wrong direction. We want common infrastructure to be
> managed by generic code. And all of this is about common infrastructure.

The driver needs some kind of handle to tell the core code what MSI
'namespace' it is talking about in every API - eg MSI or IMS.

Pointers are the usual way to make such a handle.

Going along the IMS == MSI principle that says there should be a
driver facing API with a pointer handle for the MSI and a pointer
handle for the IMS.

Yes, this means the existing API is some compat wrapper around a
pointer API and would have to store the pointer handle in the device,
but the idea would be to make that pci->dev ONLY for the compat API,
not used in the IRQ infrastructure.

> > Is it sane? What really needs device->msi anyhow?
>
> device->msi is a container for interrupt management and that's done by
> the interrupt code and not by random device drivers. Again, struct
> msi_desc is a software construct which the interrupt core code and the
> irqdomains and irq chip implementation use for various purposes. Nothing
> outside of this realm has to even know about the fact that this data
> structure exists in the first place. See below.

I imagined that msi_desc remains as part of the interrupt core code
and the 'msi_table' object is another interrupt core code object for
dealing with device's MSI

> > IMS is a logical progression of this concept:
> >
> > probe()
> > mystruct->msi_table = pci_allocate_msi_table(pci_dev);
> > mystruct->ims_irqdomain = <....>
> > mystruct->ims_table = msi_allocate_table(pci_dev->dev, mystruct->ims_irqdomain ..)
> >
> > // Use MSI
> > mystruct->desc0 = msi_table_alloc_vector(mystruct->msi_table, hw_label=0);
> > // Use IMS
> > mystruct->desc1 = msi_table_alloc_vector(mystruct->ims_table, hw_label=0);
> >
> > Not saying we can/should go out and just do something so radical, but
> > as a thought experiment, can it guide toward a direction like this?
>
> What I agree on is that the interface should be in a way that the driver
> can:
>
> 1) Allocate vectors at a given hardware index
>
> 2) Allocate vectors within a given hardware index range
>
> 3) The core returns the hardware index and the Linux irq number in
> a separate info structure which is independent of the interrupt
> stack internal data representations.

Still slightly like an opaque pointer better than a two entry struct -
but compat is compat..

> Sure the driver can get a cookie of some sort to do allocation/free from
> different resource pools, e.g. PCI-MSI[X] and IMS. But the internal data
> representation management has to be done at the core level.

And here I still see it in the core level - this 'msi table' is a core
level data-structure and the opaque 'msi handle' is a core level
data-structure

We are just exposing a handle that the drive can use to. You are
calling it a cooking, but IMHO using a cookie instead of a pointer
seems obfuscating a bit?

> even try to make the irqchip/domain code mangled into the other device
> code. It should create the irqdomain with the associated chip and that
> creation process returns a cookie which is passed to the actual device
> specific code. Allocation then can use that cookie and not the irqdomain
> pointer itself.

Sounds like your cookie == my msi_table? Maybe we are all agreeing at
this point then?

> So thanks for being patient in educating me here.

I'm happy you got something out of all these words!

> The fact that we will be able to support this at all is based on years
> of cleanups, consolidation and restructuring of the infrastructure. The
> result of doing this is that my trust in driver developers in that
> regard is very close to zero. The cleanup patches I had to do for this
> series just to solve the single irqdomain case did not help to elevate
> the trust level either.

Yes, it is amazing how many patches this is at already.

Jason

2021-12-06 14:44:19

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Sun, Dec 05, 2021 at 03:16:40PM +0100, Thomas Gleixner wrote:
> On Sat, Dec 04 2021 at 15:20, Thomas Gleixner wrote:
> > On Fri, Dec 03 2021 at 12:41, Jason Gunthorpe wrote:
> > So I need to break that up in a way which caters for both cases, but
> > does neither create a special case for PCI nor for the rest of the
> > universe, i.e. the 1:1 case has to be a subset of the 1:2 case which
> > means all of it is common case. With that solved the storage question
> > becomes a nobrainer.
> >
> > When I looked at it again yesterday while writing mail, I went back to
> > my notes and found the loose ends where I left off. Let me go back and
> > start over from there.
>
> I found out why I stopped looking at it. I came from a similar point of
> view what you were suggesting:
>
> >> If IMS == MSI, then couldn't we conceptually have the dev->irqdomain
> >> completely unable to handle IMS/MSI/MSI-X at all, and instead, when
> >> the driver asks for PCI MSI access we create a new hierarchical
> >> irqdomain and link it to a MSI chip_ops or a MSI-X chip_ops - just as
> >> you outlined for IMS? (again, not saying to do this, but let's ask if
> >> that makes more sense than the current configuration)
>
> Which I shot down with:
>
> > That's not really a good idea because dev->irqdomain is a generic
> > mechanism and not restricted to the PCI use case. Special casing it for
> > PCI is just wrong. Special casing it for all use cases just to please
> > PCI is equally wrong. There is a world outside of PCI and x86.
>
> That argument is actually only partially correct.

I'm not sure I understood your reply? I think we are both agreeing
that dev->irqdomain wants to be a generic mechanism?

I'd say that today we've special cased it to handle PCI. IMHO that is
exactly what pci_msi_create_irq_domain() is doing - it replaces the
chip ops with ops that can *ONLY* do PCI MSI and so dev->irqdomain
becomes PCI only and non-generic.

> After studying my notes and some more code (sigh), it looks feasible
> under certain assumptions, constraints and consequences.
>
> Assumptions:
>
> 1) The irqdomain pointer of PCI devices which is set up during device
> discovery is not used by anything else than infrastructure which
> knows how to handle it.
>
> Of course there is no guarantee, but I'm not that horrified about
> it anymore after chasing the exposure with yet more coccinelle
> scripts.

OK


> Constraints:
>
> 1) This is strictly opt-in and depends on hierarchical irqdomains.

OK

> That means that devices which depend on IMS won't work on anything
> which is not up to date.

OK - I think any pressure to get platforms to update their code is
only positive.

> 2) Guest support is strictly opt-in
>
> The underlying architecture/subarchitecture specific irqdomain has
> to detect at setup time (eventually early boot), whether the
> underlying hypervisor supports it.
>
> The only reasonable way to support that is the availability of
> interrupt remapping via vIOMMU, as we discussed before.

This is talking about IMS specifically because of the legacy issue
where the MSI addr/data pair inside a guest is often completely fake?

> 4) The resulting irqdomain hierarchy would ideally look like this:
>
> VECTOR -> [IOMMU, ROUTING, ...] -> PCI/[MSI/MSI-X/IMS] domains

OK, this matches where I have come from as well

> That does not work in all cases due to architecture and host
> controller constraints, so we might end up with:
>
> VECTOR -> IOMMU -> SHIM -> PCI/[MSI/MSI-X/IMS] domains

OK - I dont' know enough about the architecture/controller details to
imagine what SHIM is, but if it allows keeping the PCI code as purely
PCI code, then great

> 5) The design rules for the device specific IMS irqdomains have to be
> documented and enforced to the extent possible.
>
> Rules which I have in my notes as of today:
>
> - The device specific IMS irq chip / irqdomain has to be strictly
> separated from the rest of the driver code and can only
> interact via the irq chip data which is either per interrupt or
> per device.

It seems OK with the observaion that IDXD and mlx5 will both need to
set 'per-interrupt' data before provisioning the IRQ

> I have some ideas how to enforce these things to go into
> drivers/irqchip/ so they are exposed to scrutiny and not
> burried in some "my device is special" driver code and applied
> by subsystem maintainers before anyone can even look at it.

Means more modules, but OK

> - The irqchip callbacks which can be implemented by these top
> level domains are going to be restricted.

OK - I think it is great that the driver will see a special ops struct
that is 'ops for device's MSI addr/data pair storage'. It makes it
really clear what it is

> - For the irqchip callbacks which are allowed/required the rules
> vs. following down the hierarchy need to be defined and
> enforced.

The driver should be the ultimate origin of the interrupt so it is
always end-point in the hierarchy, opposite the CPU?

I would hope the driver doesn't have an exposure to hierarchy?

> - To achieve that the registration interface will not be based on
> struct irq_chip. This will be a new representation and the core
> will convert that into a proper irq chip which fits into the
> hierarchy. This provides one central place where the hierarchy
> requirements can be handled as they depend on the underlying
> MSI domain (IOMMU, SHIM, etc.). Otherwise any change on that
> would require to chase the IMS irqchips all over the place.

OK, I like this too.

So we have a new concept: 'device MSI storage ops'

Put them along with the xarray holding the msi_descs and you've got my
msi_table :)


> 2) The device centric storage concept will stay as it does not make
> any sense to push it towards drivers and what's worse it would be a
> major pain vs. the not yet up to the task irqdomains and the legacy
> architecture backends to change that. I really have no interrest to
> make the legacy code
>
> It also makes sense because the interrupts are strictly tied to the
> device. They cannot originate from some disconnected layer of thin
> air.
>
> Sorry Jason, no tables for you. :)

How does the driver select with 'device MSI storage ops' it is
requesting a MSI for ?

> 1) I'm going to post part 1-3 of the series once more with the fallout
> and review comments addressed.

OK, I didn't see anything in there that was making anything harder in
this direction

> 5) Implement an IMS user.
>
> The obvious candidate which should be halfways accessible is the
> ath11 PCI driver which falls into that category.

Aiiee:

drivers/net/wireless/ath/ath11k/pci.c: ab_pci->msi_ep_base_data = msi_desc->msg.data;

So, we already have two in-tree PCI IMS devices!!

Agree this makes a lot of sense to focus on some first steps

Along with NTB which is in the same general camp

Thanksm
Jason

2021-12-06 15:09:25

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

Jason,

On Mon, Dec 06 2021 at 10:19, Jason Gunthorpe wrote:
> On Sat, Dec 04, 2021 at 03:20:36PM +0100, Thomas Gleixner wrote:
>> even try to make the irqchip/domain code mangled into the other device
>> code. It should create the irqdomain with the associated chip and that
>> creation process returns a cookie which is passed to the actual device
>> specific code. Allocation then can use that cookie and not the irqdomain
>> pointer itself.
>
> Sounds like your cookie == my msi_table? Maybe we are all agreeing at
> this point then?

I think so. It's going to be something the driver can use as a
reference. Same for the actual interrupt allocated through this domain
reference.

>> So thanks for being patient in educating me here.
>
> I'm happy you got something out of all these words!

Definitely so. That's why we are having these discussions, right?

The shiny irq subsystem is not so shiny when the drivers cannot use
it. OTOH, it's often enough the case that driver folks want to use it
the wrong way just because.

> Yes, it is amazing how many patches this is at already.

Don't worry. You'll get a few more patch bombs in your inbox until IMS
is supported, unless you want to be "unsubscribed".

Thanks,

tglx

2021-12-06 16:09:21

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

Jason,

On Mon, Dec 06 2021 at 10:43, Jason Gunthorpe wrote:
> On Sun, Dec 05, 2021 at 03:16:40PM +0100, Thomas Gleixner wrote:
>> > That's not really a good idea because dev->irqdomain is a generic
>> > mechanism and not restricted to the PCI use case. Special casing it for
>> > PCI is just wrong. Special casing it for all use cases just to please
>> > PCI is equally wrong. There is a world outside of PCI and x86.
>>
>> That argument is actually only partially correct.
>
> I'm not sure I understood your reply? I think we are both agreeing
> that dev->irqdomain wants to be a generic mechanism?

Yes. I managed to confuse myself there by being too paranoid about how
to distinguish things on platforms which need to support both ways, i.e.
x86 when XEN is enabled.

> I'd say that today we've special cased it to handle PCI. IMHO that is
> exactly what pci_msi_create_irq_domain() is doing - it replaces the
> chip ops with ops that can *ONLY* do PCI MSI and so dev->irqdomain
> becomes PCI only and non-generic.

Right. See above. That's why I went back to my notes, did some more
research ...

>> 2) Guest support is strictly opt-in
>>
>> The underlying architecture/subarchitecture specific irqdomain has
>> to detect at setup time (eventually early boot), whether the
>> underlying hypervisor supports it.
>>
>> The only reasonable way to support that is the availability of
>> interrupt remapping via vIOMMU, as we discussed before.
>
> This is talking about IMS specifically because of the legacy issue
> where the MSI addr/data pair inside a guest is often completely fake?

This is about IMS, right. PCI/MSI[x] is handled today because the writes
to the MSI/MSI-X message store can be trapped.

>> That does not work in all cases due to architecture and host
>> controller constraints, so we might end up with:
>>
>> VECTOR -> IOMMU -> SHIM -> PCI/[MSI/MSI-X/IMS] domains
>
> OK - I dont' know enough about the architecture/controller details to
> imagine what SHIM is, but if it allows keeping the PCI code as purely
> PCI code, then great

It's today part of the arch/subarch specific PCI/MSI domain to deal with
quirks above the IOMMU level. As we can't proliferate that into the new
endpoint domain, that needs to be done as a shim layer in between which
has no real other functionality than applying the quirks. Yes, it's all
pretty. Welcome to my wonderful world.

>> - The irqchip callbacks which can be implemented by these top
>> level domains are going to be restricted.
>
> OK - I think it is great that the driver will see a special ops struct
> that is 'ops for device's MSI addr/data pair storage'. It makes it
> really clear what it is

It will need some more than that, e.g. mask/unmask and as we discussed
quite some time ago something like the irq_buslock/unlock pair, so you
can handle updates to the state from thread context via a command queue
(IIRC).

>> - For the irqchip callbacks which are allowed/required the rules
>> vs. following down the hierarchy need to be defined and
>> enforced.
>
> The driver should be the ultimate origin of the interrupt so it is
> always end-point in the hierarchy, opposite the CPU?
>
> I would hope the driver doesn't have an exposure to hierarchy?

No.

> So we have a new concept: 'device MSI storage ops'
>
> Put them along with the xarray holding the msi_descs and you've got my
> msi_table :)

Hehe.

>> Sorry Jason, no tables for you. :)
>
> How does the driver select with 'device MSI storage ops' it is
> requesting a MSI for ?

Via some cookie, reference whatever as discussed in the other
mail. We'll bikeshed the naming once I get there :)

>> 1) I'm going to post part 1-3 of the series once more with the fallout
>> and review comments addressed.
>
> OK, I didn't see anything in there that was making anything harder in
> this direction

It's helping to keep the existing stuff including the !irqdomain parts
sufficiently self contained so I can actually change the inner workings
of msi domains without going back to any of these places (hopefully).

>> 5) Implement an IMS user.
>>
>> The obvious candidate which should be halfways accessible is the
>> ath11 PCI driver which falls into that category.
>
> Aiiee:

Yes.

> drivers/net/wireless/ath/ath11k/pci.c: ab_pci->msi_ep_base_data = msi_desc->msg.data;

That's only one part of it. Look how the address is retrieved.

Thanks,

tglx

2021-12-06 17:05:46

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Mon, Dec 06, 2021 at 04:47:58PM +0100, Thomas Gleixner wrote:

> >> - The irqchip callbacks which can be implemented by these top
> >> level domains are going to be restricted.
> >
> > OK - I think it is great that the driver will see a special ops struct
> > that is 'ops for device's MSI addr/data pair storage'. It makes it
> > really clear what it is
>
> It will need some more than that, e.g. mask/unmask and as we discussed
> quite some time ago something like the irq_buslock/unlock pair, so you
> can handle updates to the state from thread context via a command queue
> (IIRC).

Yes, I was thinking about all of that in here.

Let me ask a slightly different question

pci_msi_create_irq_domain() hooks into the platforms irq_chip as an
alternative to hierarchical irq domains (?)

eg:

chip->irq_write_msi_msg = pci_msi_domain_write_msg;

And we see:

void pci_msi_domain_write_msg(struct irq_data *irq_data, struct msi_msg *msg)
{
struct msi_desc *desc = irq_data_get_msi_desc(irq_data);

Now, if we have your idea to have some:

struct msi_storage_ops {
write_msg
mask;
unmask;
lock;
unlock;
};

Which is how a driver plugs in its storage operations.

In almost all cases 'ops' will come along with a 'state', so lets
create one:

struct msi_storage { // Look, I avoided the word table!
const struct msi_storage_ops *ops
};

ie:

struct msi_storage_ops {
void (*write_msg)(struct msi_storage *msi, struct msi_desc *desc, struct msi_msg *msg);


Now, what if we made a 'generic_msi_create_irq_domain()' that hooks
the irq_chip with something like this:

void generic_msi_domain_write_msg(struct irq_data *irq_data, struct msi_msg *msg)
{
struct msi_desc *desc = irq_data_get_msi_desc(irq_data);
struct msi_storage *msi = desc->msi;

msi->ops->write_msg(msi, desc, msg);
}

And then have what pci_msi_domain_write_msg() did now accomplished by
having it set desc->storage to a pci_msi_storage or pci_msix_storage
with those msi_storage_ops pointing at pci_msi_domain_write_msg/etc

Then we can transform:

void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
{
struct pci_dev *dev = msi_desc_to_pci_dev(entry);

into

struct pci_msi_storage {
struct msi_storage storage;
struct pci_dev *dev;
unsigned int msi_cap_off;
};

void pci_write_msi64_msg(struct msi_storage *storage, struct msi_desc *desc, struct msi_msg *msg)
{
struct pci_msi_storage *msi = container_of(storage, struct pci_msi_storage, storage);
unsigned int pos = storage->msi_cap_off;
struct pci_dev *dev = msi->dev;

pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &msgctl);
msgctl &= ~PCI_MSI_FLAGS_QSIZE;
msgctl |= desc->msi_attrib.multiple << 4;
pci_write_config_word(dev, pos + PCI_MSI_FLAGS, msgctl);

pci_write_config_dword(dev, pos + PCI_MSI_ADDRESS_LO, msg->address_lo);
pci_write_config_dword(dev, pos + PCI_MSI_ADDRESS_HI, msg->address_hi);
pci_write_config_word(dev, pos + PCI_MSI_DATA_64, msg->data);
/* Ensure that the writes are visible in the device */
pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &msgctl);
desc->msg = *msg; // in core code instead ?
}

And others for the different cases. Look no ifs!

OK?

Now, we have some duplication between the struct msi_storage_ops and
the struct irq_chip. Let's see what that is about:

arch/x86/kernel/apic/msi.c: .irq_write_msi_msg = dmar_msi_write_msg,
arch/x86/kernel/hpet.c: .irq_write_msi_msg = hpet_msi_write_msg,

Surprised! These are actually IMS. The HPET and DMAR devices both have
device-specific message storage! So these could use
msi_storage_ops. And WTF is IOMMU DMAR driver code doing in
apic/msi.c ???

arch/powerpc/platforms/pseries/msi.c: .irq_write_msi_msg = pseries_msi_write_msg,

AFAICT this is really like virtualization? The hypervisor is
controlling the real MSI table and what the OS sees is faked out
somewhat.

This is more of quirk in the PCI MSI implementation (do not touch the
storage) and a block on non-PCI uses of MSI similar to what x86 needs?

drivers/irqchip/irq-gic-v2m.c: .irq_write_msi_msg = pci_msi_domain_write_msg,
drivers/irqchip/irq-gic-v3-its-pci-msi.c: .irq_write_msi_msg = pci_msi_domain_write_msg,
drivers/irqchip/irq-gic-v3-mbi.c: .irq_write_msi_msg = pci_msi_domain_write_msg,

ARM seems to be replacing the 'mask at source' with 'mask at
destination' - I wonder why?

Should this really be hierarchical where we mask *both* the MSI
originating device (storage_ops->mask) and at the CPU IRQ controller?
(gicv2m_mask_msi_irq ?) if it can?

drivers/base/platform-msi.c: chip->irq_write_msi_msg = platform_msi_write_msg;

Oh! this is doing what I kind of just suggested, just non-generically
and hacked into platform bus drivers the same as PCI does:

static void platform_msi_write_msg(struct irq_data *data, struct msi_msg *msg)
{
struct msi_desc *desc = irq_data_get_msi_desc(data);
struct platform_msi_priv_data *priv_data;

priv_data = desc->platform.msi_priv_data;

priv_data->write_msg(desc, msg);
}

platform_msi entirely gets deleted. Instead all platform drivers using
it will use IMS - set up a msi_storage_ops with
storage_ops->write_msg == platform_msi_priv_data::write_msg and
allocate a msi_storage someplace.

So, at least at this first order, we could have world where the
irq_chip does not overlap struct msi_storage_ops - ie we move the MSI
related things from irq_chip to msi_storage_ops.

Then, all the core code places calling into chip->[msi] would instead
find the msi_storage_op from the irq_data. ie like (or better)

irq_data->desc->storage->ops

Now we have a real clear split of responsibility.

The irq_domain hierarchy and irq_chip is all about classic wired
interrupts and interrupt controller path after the device launches the
message. For MSI it's job is to determine the addr/data.

msi_storage_ops is all about interrupt origin points that can hold an
addr/data pair. It's job is to store the addr/data into the device
specific registers, do mask-at-source, locking etc.

PCI, HPET, DMAR move to msi_storage_ops instead of using irq_chip

Then you get what you asked at the start, the platform's irq_domain is
now fully generic and can be 1:N to different MSI storage providers.

For API compat every pci struct device will have to instantiate a
msi_storage someplace, but that seems easy enough.

Seems like a nice uniform solution?

Jason

2021-12-06 20:28:53

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

Jason,

On Mon, Dec 06 2021 at 13:00, Jason Gunthorpe wrote:
> On Mon, Dec 06, 2021 at 04:47:58PM +0100, Thomas Gleixner wrote:
>> It will need some more than that, e.g. mask/unmask and as we discussed
>> quite some time ago something like the irq_buslock/unlock pair, so you
>> can handle updates to the state from thread context via a command queue
>> (IIRC).
>
> pci_msi_create_irq_domain() hooks into the platforms irq_chip as an
> alternative to hierarchical irq domains (?)

It's based on hierarchical irqdomains. It's the outermost domain and irq
chip. On X86:

[VECTOR chip=apic]->[PCI chip=PCI]
or
[VECTOR chip=apic]->[IOMMU chip=IR]->[PCI chip=PCI-IR]

The chips are different because of quirks. See below :)

> chip->irq_write_msi_msg = pci_msi_domain_write_msg;
> In almost all cases 'ops' will come along with a 'state', so lets
> create one:
>
> struct msi_storage { // Look, I avoided the word table!

I already made a note, that I need to smuggle a table in somewhere :)

> And others for the different cases. Look no ifs!
>
> OK?

That's already the plan in some form, but there's a long way towards
that. See below.

Also there will be a question of how many different callbacks we're
going to create just to avoid one conditional. At some point this might
become silly.

> Now, we have some duplication between the struct msi_storage_ops and
> the struct irq_chip. Let's see what that is about:
>
> arch/x86/kernel/apic/msi.c: .irq_write_msi_msg = dmar_msi_write_msg,
> arch/x86/kernel/hpet.c: .irq_write_msi_msg = hpet_msi_write_msg,
>
> Surprised! These are actually IMS. The HPET and DMAR devices both have
> device-specific message storage! So these could use
> msi_storage_ops. And WTF is IOMMU DMAR driver code doing in
> apic/msi.c ???

Historical reasons coming from the pre irqdomain aera. Also DMAR needs
direct access to the x86 low level composer which we didn't want to
expose. Plus DMAR is shared with ia64 to make it more interesting.

Yes, they can be converted. But that's the least of my worries. Those
are straight forward and not really relevant for the design.

> arch/powerpc/platforms/pseries/msi.c: .irq_write_msi_msg = pseries_msi_write_msg,
>
> AFAICT this is really like virtualization? The hypervisor is
> controlling the real MSI table and what the OS sees is faked out
> somewhat.
>
> This is more of quirk in the PCI MSI implementation (do not touch the
> storage) and a block on non-PCI uses of MSI similar to what x86 needs?

There is an underlying hypervisor of some sorts and that stuff needs to
deal with it. I leave that to the powerpc wizards to sort out.

> drivers/irqchip/irq-gic-v2m.c: .irq_write_msi_msg = pci_msi_domain_write_msg,
> drivers/irqchip/irq-gic-v3-its-pci-msi.c: .irq_write_msi_msg = pci_msi_domain_write_msg,
> drivers/irqchip/irq-gic-v3-mbi.c: .irq_write_msi_msg = pci_msi_domain_write_msg,
>
> ARM seems to be replacing the 'mask at source' with 'mask at
> destination' - I wonder why?

Because the majority of PCI/MSI endpoint implementations do not provide
masking...

We're telling hardware people for 15+ years that this is a horrible
idea, but it's as effective as talking to a wall. Sure the spec grants
them to make my life miserable...

> Should this really be hierarchical where we mask *both* the MSI
> originating device (storage_ops->mask) and at the CPU IRQ controller?
> (gicv2m_mask_msi_irq ?) if it can?

I wish I could mask underneath for some stuff on x86. Though that would
not help with the worst problem vs. affinity settings. See the horrible
dance in:

x86/kernel/apic/msi.c::msi_set_affinity()

So this will end up with a shim as the base domain for !IOMMU systems:

|--[HPET]
[VECTOR chip=apic]-|--[x86-msi chip=x86-msi]-|--[PCI/MSI]
|--[DMAR] |--[PCI/MSI-X]

That nonsense can't move into the apic domain set_affinity() callback as
this is not required when interrupt remapping is enabled.

With IOMMU this looks then:

|--[HPET]
[VECTOR chip=apic]-|--[IOMMU chip=IR]-|--[PCI/MSI]
|--[DMAR] |--[PCI/MSI-X]
|--[PCI/IMS]

> drivers/base/platform-msi.c: chip->irq_write_msi_msg = platform_msi_write_msg;
>
> Oh! this is doing what I kind of just suggested, just non-generically
> and hacked into platform bus drivers the same as PCI does:

Correct. It's a hack and it's on the list of things which need to
vanish. I was already discussing that with Marc on the side for quite a
while.

> PCI, HPET, DMAR move to msi_storage_ops instead of using irq_chip

With different parent domains. DMAR hangs always directly off the vector
domain. HPET has its own IOMMU zone.

You forgot IO/APIC which is a MSI endpoint too, just more convoluted but
it's not using MSI domains so it's not in the way. I'm not going to
touch that with a ten foot pole. :)

There's also VMD, HyperV and the XEN crime which is a horrible shim to
make device->msi_domain consistent on x86. For fixing XEN properly I'm
not masochistic enough.

> For API compat every pci struct device will have to instantiate a
> msi_storage someplace, but that seems easy enough.

That's easy to hide in the existing driver interfaces for PCI/MSI and
PCI/MSI-X.

> Seems like a nice uniform solution?

That's where I'm heading.

I have a full inventory of the various horrors involved, so I have a
pretty good picture what kind of oddities are involved, where a shim
domain is required and which underlying platforms require the MSI irq
chip to do:

irq_chip_mask()
msi_ops->mask()
parent->chip->mask()

and so forth. I need to figure out how the parent irq chip / irqdomain
transports that requirement.

But that part is not where the real work is. I'll get there eventually
once I sorted the underlying parts:

- Building up the infrastructure in kernel/irq/

- Decomposing pci/msi/* further

- Make use of the infrastructure for an alternate pci/msi
implemention.

- Have a transition mechanism to convert one part at a time to keep
the patch sizes reviewable and the whole mess bisectable.

Doing all this while keeping the full universe of legacy/arch, current
PCI/MSI domains alive makes that interesting. I broke the world in the
past, so I'm not afraid of doing it again. Though I try to avoid it to
the extent possible. :)

I have a pretty good picture in my head and notes already, which needs
to be dumped into code. But let me post part 1-3 V2 first, so that pile
gets out of the way. Not having to juggle 90 patches makes life easier.

Thanks,

tglx

2021-12-06 21:06:15

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Mon, Dec 06, 2021 at 09:28:47PM +0100, Thomas Gleixner wrote:

> That's already the plan in some form, but there's a long way towards
> that. See below.

Okay, then I think we are thinking the same sorts of things, it is
good to see

> Also there will be a question of how many different callbacks we're
> going to create just to avoid one conditional. At some point this might
> become silly.

Yes, but I think the the overal point is that it we could have ops
that do exactly what they need, and we can choose when we get there
which makes the most sense. ie the 64 vs 32 is probably silly.

> > Surprised! These are actually IMS. The HPET and DMAR devices both
> > have device-specific message storage! So these could use
> > msi_storage_ops. And WTF is IOMMU DMAR driver code doing in
> > apic/msi.c ???
>
> Historical reasons coming from the pre irqdomain aera. Also DMAR needs
> direct access to the x86 low level composer which we didn't want to
> expose. Plus DMAR is shared with ia64 to make it more interesting.
>
> Yes, they can be converted. But that's the least of my worries. Those
> are straight forward and not really relevant for the design.
>
> > arch/powerpc/platforms/pseries/msi.c: .irq_write_msi_msg = pseries_msi_write_msg,
> >
> > AFAICT this is really like virtualization? The hypervisor is
> > controlling the real MSI table and what the OS sees is faked out
> > somewhat.
> >
> > This is more of quirk in the PCI MSI implementation (do not touch the
> > storage) and a block on non-PCI uses of MSI similar to what x86 needs?
>
> There is an underlying hypervisor of some sorts and that stuff needs to
> deal with it. I leave that to the powerpc wizards to sort out.
>
> > drivers/irqchip/irq-gic-v2m.c: .irq_write_msi_msg = pci_msi_domain_write_msg,
> > drivers/irqchip/irq-gic-v3-its-pci-msi.c: .irq_write_msi_msg = pci_msi_domain_write_msg,
> > drivers/irqchip/irq-gic-v3-mbi.c: .irq_write_msi_msg = pci_msi_domain_write_msg,
> >
> > ARM seems to be replacing the 'mask at source' with 'mask at
> > destination' - I wonder why?
>
> Because the majority of PCI/MSI endpoint implementations do not provide
> masking...
>
> We're telling hardware people for 15+ years that this is a horrible
> idea, but it's as effective as talking to a wall. Sure the spec grants
> them to make my life miserable...
>
> > Should this really be hierarchical where we mask *both* the MSI
> > originating device (storage_ops->mask) and at the CPU IRQ controller?
> > (gicv2m_mask_msi_irq ?) if it can?
>
> I wish I could mask underneath for some stuff on x86. Though that would
> not help with the worst problem vs. affinity settings. See the horrible
> dance in:

My thinking here is that this stuff in ARM is one of the different
cases (ie not using MSI_FLAG_USE_DEF_CHIP_OPS), and I guess we can
just handle it cleanly by having the core call both the irq_chip->mask
and the msi_storage_ops->mask and we don't need ARM to be different,
x86 just won't provide a mask at destination op.

> x86/kernel/apic/msi.c::msi_set_affinity()

Okay, so it is complicated, but it is just calling
irq_data_get_irq_chip(irqd)->irq_write_msi_msg(irqd, msg);

So, from a msi_storage_ops perspective, things are still clean.

> > PCI, HPET, DMAR move to msi_storage_ops instead of using irq_chip
>
> With different parent domains. DMAR hangs always directly off the vector
> domain. HPET has its own IOMMU zone.
>
> You forgot IO/APIC which is a MSI endpoint too, just more convoluted but
> it's not using MSI domains so it's not in the way. I'm not going to
> touch that with a ten foot pole. :)

I left off IOAPIC because I view it as conceptually different. I used
the phrasse "device that originated the interrupt" deliberately,
IOAPIC is just a middle box that converts from a physical interrupt
line to a message world, it belongs with the physical interrupt
infrastructure.

Possibly the IOAPIC considerations is what motivated some of this to
look the way it does today, because it really was trying to hide MSI
under normal PCI INTX physical pins with full compatability. We kind
of kept doing that as MSI grew into its own thing.

> > Seems like a nice uniform solution?
>
> That's where I'm heading.

Okay, it looks like a lot TBH, but I think the direction is clean and
general.

I'm curious to see if you end up with irq_domains and irq_chips along
with what I labeled as the msi_storage above, or if those turn out to
be unnecesary for the driver to provide MSI programming.

Also, if msi_storage_ops can be robust enough you'd be comfortable
with it in a driver .c file and just a regex match in the MAINTAINERS
file :)

> - Have a transition mechanism to convert one part at a time to keep
> the patch sizes reviewable and the whole mess bisectable.

This seems difficult all on its own..

> I have a pretty good picture in my head and notes already, which needs
> to be dumped into code. But let me post part 1-3 V2 first, so that pile
> gets out of the way. Not having to juggle 90 patches makes life easier.

Okay! Thanks

Jason

2021-12-06 22:21:25

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Mon, Dec 06 2021 at 17:06, Jason Gunthorpe wrote:
> On Mon, Dec 06, 2021 at 09:28:47PM +0100, Thomas Gleixner wrote:
>> I wish I could mask underneath for some stuff on x86. Though that would
>> not help with the worst problem vs. affinity settings. See the horrible
>> dance in:
>
> My thinking here is that this stuff in ARM is one of the different
> cases (ie not using MSI_FLAG_USE_DEF_CHIP_OPS), and I guess we can
> just handle it cleanly by having the core call both the irq_chip->mask
> and the msi_storage_ops->mask and we don't need ARM to be different,
> x86 just won't provide a mask at destination op.
>
>> x86/kernel/apic/msi.c::msi_set_affinity()
>
> Okay, so it is complicated, but it is just calling
> irq_data_get_irq_chip(irqd)->irq_write_msi_msg(irqd, msg);
>
> So, from a msi_storage_ops perspective, things are still clean.

Yes.

>> You forgot IO/APIC which is a MSI endpoint too, just more convoluted but
>> it's not using MSI domains so it's not in the way. I'm not going to
>> touch that with a ten foot pole. :)
>
> I left off IOAPIC because I view it as conceptually different. I used
> the phrasse "device that originated the interrupt" deliberately,
> IOAPIC is just a middle box that converts from a physical interrupt
> line to a message world, it belongs with the physical interrupt
> infrastructure.

I mentioned it because there is mbigen on arm64 which is the same thing,
translates hundreds of wire inputs into MSI. It's even worse than
IO/APIC. There is a horrible hack to make it "work" which Marc and I are
looking at whether we can kill it on the way.

> Possibly the IOAPIC considerations is what motivated some of this to
> look the way it does today, because it really was trying to hide MSI
> under normal PCI INTX physical pins with full compatability. We kind
> of kept doing that as MSI grew into its own thing.

Not really. It was more to avoid having a complete separate
infrastructure for irqdomain based MSI[X]. Lazyness and lack of time
added the rest of non-motivation :)

> I'm curious to see if you end up with irq_domains and irq_chips along
> with what I labeled as the msi_storage above, or if those turn out to
> be unnecesary for the driver to provide MSI programming.

I cant avoid irq chips because from the interrupt handling side of view
that's unavoidable unless we create a duplicate zoo there. What I have
in mind is to convert the msi ops provided by the device driver into a
real chip as that just falls in place without further changes.

The irqdomain will be real as well just to make things consistent and to
share as much code as possible.

> Also, if msi_storage_ops can be robust enough you'd be comfortable
> with it in a driver .c file and just a regex match in the MAINTAINERS
> file :)

That might work. Let's see when we are there.

>> - Have a transition mechanism to convert one part at a time to keep
>> the patch sizes reviewable and the whole mess bisectable.
>
> This seems difficult all on its own..

I've done that before. It just needs some thought.

Thanks,

tglx

2021-12-09 05:23:48

by Tian, Kevin

[permalink] [raw]
Subject: RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

> From: Thomas Gleixner <[email protected]>
> Sent: Thursday, December 2, 2021 5:45 AM
>
> On Wed, Dec 01 2021 at 14:21, Dave Jiang wrote:
> > On 12/1/2021 1:25 PM, Thomas Gleixner wrote:
> >>> The hardware implementation does not have enough MSIX vectors for
> >>> guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS
> >>> vectors. So if we are to do MSI-X for all of them, then we need to do
> >>> the IMS backed MSIX scheme rather than passthrough IMS to guests.
> >> Confused. Are you talking about passing a full IDXD device to the guest
> >> or about passing a carved out subdevice, aka. queue?
> >
> > I'm talking about carving out a subdevice. I had the impression of you
> > wanting IMS passed through for all variations. But it sounds like for a
> > sub-device, you are ok with the implementation of MSIX backed by IMS?
>
> I don't see anything wrong with that. A subdevice is it's own entity and
> VFIO can chose the most conveniant representation of it to the guest
> obviously.
>
> How that is backed on the host does not really matter. You can expose
> MSI-X to the guest with a INTx backing as well.
>

Agree with this point. How the interrupts are represented to the guest
is orthogonal to how the backend resource is allocated. Physically MSI-X
and IMS can be enabled simultaneously on an IDXD device. Once
dynamic allocation is allowed for both, either one can be allocated for
a subdevice (with only difference on supported #subdevices).

When an interrupt resource is exposed to the guest with the same type
(e.g. MSI-on-MSI or IMS-on-IMS), it can be also passed through to the
guest as long as a hypercall machinery is in place to get addr/data pair
from the host (as you suggested earlier).

Thanks
Kevin

2021-12-09 05:41:20

by Tian, Kevin

[permalink] [raw]
Subject: RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, December 2, 2021 9:55 PM
>
> Further, there is no reason why IMS should be reserved exclusively for
> VFIO!

This is correct. Just as what you agreed with Thomas, the only difference
between IMS and MSI is on where the messages are stored. Physically
it is unreasonable for the HW to check whether an interrupt is used for
a specific software usage (e.g. virtualization) since it doesn't have such
knowledge. At most an entry is associated to PASID, but again the PASID
can be used anywhere.

The wording in current IDXD spec is not clear on this part. We'll talk to
the spec owner to have it fixed.

Thanks
Kevin

2021-12-09 05:47:34

by Jason Wang

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Thu, Dec 9, 2021 at 1:41 PM Tian, Kevin <[email protected]> wrote:
>
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Thursday, December 2, 2021 9:55 PM
> >
> > Further, there is no reason why IMS should be reserved exclusively for
> > VFIO!
>
> This is correct. Just as what you agreed with Thomas, the only difference
> between IMS and MSI is on where the messages are stored. Physically
> it is unreasonable for the HW to check whether an interrupt is used for
> a specific software usage (e.g. virtualization) since it doesn't have such
> knowledge. At most an entry is associated to PASID, but again the PASID
> can be used anywhere.

Note that vDPA had the plan to use IMS for the parent that can have a
huge amount of instances.

Thanks

>
> The wording in current IDXD spec is not clear on this part. We'll talk to
> the spec owner to have it fixed.
>
> Thanks
> Kevin
>


2021-12-09 06:26:50

by Tian, Kevin

[permalink] [raw]
Subject: RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

> From: Jason Gunthorpe <[email protected]>
> Sent: Saturday, December 4, 2021 12:41 AM
>
> > Or has each queue and controlblock and whatever access to a shared large
> > array where the messages are stored and the indices are handed out to
> > the queues and controlblocks?
>
> > If each of them have their own small array, then queue relative indexing
> > makes a ton of sense, no?
>
> Okay, I see.
>
> I don't know of any use case for more than one interrupt on a queue,
> and if it did come up I'd probably approach it by making the queue
> handle above also specify the 'queue relative HW index'
>

We have such use case with IDXD.

Basically the IDXD queue allows software to put an interrupt handle
(the index of MSI-X or IMS entry) in the submitted descriptor. Upon
completion of the descriptor the hardware finds the specified entry
and then generate interrupt to notify software.

Conceptually descriptors submitted to a same queue can use different
handles, implying one queue can be associated to multiple interrupts.

One example is the shared work queue usage which allows multiple
clients directly and simultaneously submitting descriptors to the
same queue, by using ENQCMD(pasid, descriptor) instruction. In this
case each client can be allocated with an interrupt entry (including the
information about the client's pasid for permission check when the
HW generates completion interrupt) and then use this entry for
all descriptors submitted by that client.

Haven't completed reading of this thread, but would like to point
out this usage so it is not ignored in the final rework. It basically
means one queue might be associated to multiple interrupt entries
and multiple pasids. ????

Thanks
Kevin

2021-12-09 08:37:13

by Thomas Gleixner

[permalink] [raw]
Subject: RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Thu, Dec 09 2021 at 05:23, Kevin Tian wrote:
>> From: Thomas Gleixner <[email protected]>
>> I don't see anything wrong with that. A subdevice is it's own entity and
>> VFIO can chose the most conveniant representation of it to the guest
>> obviously.
>>
>> How that is backed on the host does not really matter. You can expose
>> MSI-X to the guest with a INTx backing as well.
>>
>
> Agree with this point. How the interrupts are represented to the guest
> is orthogonal to how the backend resource is allocated. Physically MSI-X
> and IMS can be enabled simultaneously on an IDXD device. Once
> dynamic allocation is allowed for both, either one can be allocated for
> a subdevice (with only difference on supported #subdevices).
>
> When an interrupt resource is exposed to the guest with the same type
> (e.g. MSI-on-MSI or IMS-on-IMS), it can be also passed through to the
> guest as long as a hypercall machinery is in place to get addr/data pair
> from the host (as you suggested earlier).

As I pointed out in the conclusion of this thread, IMS is only going to
be supported with interrupt remapping in place on both host and guest.

As these devices are requiring a vIOMMU on the guest anyway (PASID, User
IO page tables), the required hypercalls are part of the vIOMMU/IR
implementation. If you look at it from the irqdomain hierarchy view:

|- PCI-MSI
VECTOR -- [v]IOMMU/IR -|- PCI-MSI-X
|- PCI-IMS

So host and guest use just the same representation which makes a ton of
sense.

There are two places where this matters:

1) The activate() callback of the IR domain

2) The irq_set_affinity() callback of the irqchip associated with the
IR domain

Both callbacks are allowed to fail and the error code is handed back to
the originating call site.

If you look at the above hierarchy view then MSI/MSI-X/IMS are all
treated in exactly the same way. It all becomes the common case.

No?

Thanks,

tglx



2021-12-09 09:03:06

by Thomas Gleixner

[permalink] [raw]
Subject: RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

Kevin,

On Thu, Dec 09 2021 at 06:26, Kevin Tian wrote:
>> From: Jason Gunthorpe <[email protected]>
>> I don't know of any use case for more than one interrupt on a queue,
>> and if it did come up I'd probably approach it by making the queue
>> handle above also specify the 'queue relative HW index'
>
> We have such use case with IDXD.
>
> Basically the IDXD queue allows software to put an interrupt handle
> (the index of MSI-X or IMS entry) in the submitted descriptor. Upon
> completion of the descriptor the hardware finds the specified entry
> and then generate interrupt to notify software.
>
> Conceptually descriptors submitted to a same queue can use different
> handles, implying one queue can be associated to multiple interrupts.

I think you are looking at that from the internal implementation details
of IDXD. But you can just model it in an IDXD implementation agnostic
way:

ENQCMD(PASID, IMS-ENTRY,.....)

implies an on demand allocation of a virtual queue, which is deallocated
when the command completes. The PASID and IMS-ENTRY act as the 'queue'
identifier.

The implementation detail of IDXD that it executes these computations on
an internal shared workqueue does not change that.

Such a workqueue can only execute one enqueued command at a time, which
means that during the execution of a particular command that IDXD
internal workqueue represents the 'virtual queue' which is identified by
the unique PASID/IMS-ENTRY pair.

No?

Thanks,

tglx

2021-12-09 12:17:43

by Tian, Kevin

[permalink] [raw]
Subject: RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

> From: Thomas Gleixner <[email protected]>
> Sent: Thursday, December 9, 2021 5:03 PM
>
> Kevin,
>
> On Thu, Dec 09 2021 at 06:26, Kevin Tian wrote:
> >> From: Jason Gunthorpe <[email protected]>
> >> I don't know of any use case for more than one interrupt on a queue,
> >> and if it did come up I'd probably approach it by making the queue
> >> handle above also specify the 'queue relative HW index'
> >
> > We have such use case with IDXD.
> >
> > Basically the IDXD queue allows software to put an interrupt handle
> > (the index of MSI-X or IMS entry) in the submitted descriptor. Upon
> > completion of the descriptor the hardware finds the specified entry
> > and then generate interrupt to notify software.
> >
> > Conceptually descriptors submitted to a same queue can use different
> > handles, implying one queue can be associated to multiple interrupts.
>
> I think you are looking at that from the internal implementation details
> of IDXD. But you can just model it in an IDXD implementation agnostic
> way:
>
> ENQCMD(PASID, IMS-ENTRY,.....)

Not exactly IMS-ENTRY. MSI-ENTRY also works.

>
> implies an on demand allocation of a virtual queue, which is deallocated
> when the command completes. The PASID and IMS-ENTRY act as the 'queue'
> identifier.
>
> The implementation detail of IDXD that it executes these computations on
> an internal shared workqueue does not change that.
>
> Such a workqueue can only execute one enqueued command at a time,
> which
> means that during the execution of a particular command that IDXD
> internal workqueue represents the 'virtual queue' which is identified by
> the unique PASID/IMS-ENTRY pair.
>
> No?
>

While it's one way of looking at this model do we want to actually
create some objects to represent this 'virtual queue' concept? that
implies each ENQCMD must be moderated to create such short-lifespan
objects and I'm not sure the benefit of doing so.

If not then from driver p.o.v it's still one queue resource and driver
needs to manage its association with multiple interrupt entries and
PASIDs when it's connected to multiple clients.

Thanks
Kevin

2021-12-09 12:31:12

by Tian, Kevin

[permalink] [raw]
Subject: RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

> From: Thomas Gleixner <[email protected]>
> Sent: Thursday, December 9, 2021 4:37 PM
>
> On Thu, Dec 09 2021 at 05:23, Kevin Tian wrote:
> >> From: Thomas Gleixner <[email protected]>
> >> I don't see anything wrong with that. A subdevice is it's own entity and
> >> VFIO can chose the most conveniant representation of it to the guest
> >> obviously.
> >>
> >> How that is backed on the host does not really matter. You can expose
> >> MSI-X to the guest with a INTx backing as well.
> >>
> >
> > Agree with this point. How the interrupts are represented to the guest
> > is orthogonal to how the backend resource is allocated. Physically MSI-X
> > and IMS can be enabled simultaneously on an IDXD device. Once
> > dynamic allocation is allowed for both, either one can be allocated for
> > a subdevice (with only difference on supported #subdevices).
> >
> > When an interrupt resource is exposed to the guest with the same type
> > (e.g. MSI-on-MSI or IMS-on-IMS), it can be also passed through to the
> > guest as long as a hypercall machinery is in place to get addr/data pair
> > from the host (as you suggested earlier).
>
> As I pointed out in the conclusion of this thread, IMS is only going to
> be supported with interrupt remapping in place on both host and guest.

I still need to read the last few mails but thanks for pointing it out now.

>
> As these devices are requiring a vIOMMU on the guest anyway (PASID, User
> IO page tables), the required hypercalls are part of the vIOMMU/IR
> implementation. If you look at it from the irqdomain hierarchy view:
>
> |- PCI-MSI
> VECTOR -- [v]IOMMU/IR -|- PCI-MSI-X
> |- PCI-IMS
>
> So host and guest use just the same representation which makes a ton of
> sense.
>
> There are two places where this matters:
>
> 1) The activate() callback of the IR domain
>
> 2) The irq_set_affinity() callback of the irqchip associated with the
> IR domain
>
> Both callbacks are allowed to fail and the error code is handed back to
> the originating call site.
>
> If you look at the above hierarchy view then MSI/MSI-X/IMS are all
> treated in exactly the same way. It all becomes the common case.
>
> No?
>

Yes, I think above makes sense.

For a new guest OS which supports this enlightened hierarchy the same
machinery works for all type of interrupt storages and we have a
failure path from host to guest in case of host-side resource shortage.
And no trap is required on guest access to the interrupt storage.

A legacy guest OS which doesn't support the enlightened hierarchy
can only use MSI/MSI-X which is still trapped. But with vector
reallocation support from your work the situation already improves
a lot than current awkward way in VFIO (free all previous vectors
and then re-allocate).

Overall I think this is a good modeling.

Thanks
Kevin

2021-12-09 15:57:45

by Thomas Gleixner

[permalink] [raw]
Subject: RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Thu, Dec 09 2021 at 12:17, Kevin Tian wrote:
>> From: Thomas Gleixner <[email protected]>
>> I think you are looking at that from the internal implementation details
>> of IDXD. But you can just model it in an IDXD implementation agnostic
>> way:
>>
>> ENQCMD(PASID, IMS-ENTRY,.....)
>
> Not exactly IMS-ENTRY. MSI-ENTRY also works.

Sure.

>>
>> implies an on demand allocation of a virtual queue, which is deallocated
>> when the command completes. The PASID and IMS-ENTRY act as the 'queue'
>> identifier.
>>
>> The implementation detail of IDXD that it executes these computations on
>> an internal shared workqueue does not change that.
>>
>> Such a workqueue can only execute one enqueued command at a time,
>> which
>> means that during the execution of a particular command that IDXD
>> internal workqueue represents the 'virtual queue' which is identified by
>> the unique PASID/IMS-ENTRY pair.
>
> While it's one way of looking at this model do we want to actually
> create some objects to represent this 'virtual queue' concept? that
> implies each ENQCMD must be moderated to create such short-lifespan
> objects and I'm not sure the benefit of doing so.

You don't have to create anything. The PASID/ENTRY pair represents that
'virtual queue', right?

> If not then from driver p.o.v it's still one queue resource and driver
> needs to manage its association with multiple interrupt entries and
> PASIDs when it's connected to multiple clients.

That's correct, but there is nothing problematic with it. It's like
allocating multiple interrupts for any other hardware device or
subdevice, right?

What's probably more interresting is how the PASID/interrupt/RID
relations are managed.

Thanks,

tglx


2021-12-09 16:21:34

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Thu, Dec 09, 2021 at 09:37:06AM +0100, Thomas Gleixner wrote:
> On Thu, Dec 09 2021 at 05:23, Kevin Tian wrote:
> >> From: Thomas Gleixner <[email protected]>
> >> I don't see anything wrong with that. A subdevice is it's own entity and
> >> VFIO can chose the most conveniant representation of it to the guest
> >> obviously.
> >>
> >> How that is backed on the host does not really matter. You can expose
> >> MSI-X to the guest with a INTx backing as well.
> >>
> >
> > Agree with this point. How the interrupts are represented to the guest
> > is orthogonal to how the backend resource is allocated. Physically MSI-X
> > and IMS can be enabled simultaneously on an IDXD device. Once
> > dynamic allocation is allowed for both, either one can be allocated for
> > a subdevice (with only difference on supported #subdevices).
> >
> > When an interrupt resource is exposed to the guest with the same type
> > (e.g. MSI-on-MSI or IMS-on-IMS), it can be also passed through to the
> > guest as long as a hypercall machinery is in place to get addr/data pair
> > from the host (as you suggested earlier).
>
> As I pointed out in the conclusion of this thread, IMS is only going to
> be supported with interrupt remapping in place on both host and guest.
>
> As these devices are requiring a vIOMMU on the guest anyway (PASID, User
> IO page tables), the required hypercalls are part of the vIOMMU/IR
> implementation. If you look at it from the irqdomain hierarchy view:

It is true for IDXD, but mlx5 will work without a PASID or vIOMMU in a
guest today, and there is no reason to imagine some future IMS would
have any different device requirements from MSI-X.

Though, vIOMMU operating in bypass mode seems like it is fine if it
helps the solution.

> If you look at the above hierarchy view then MSI/MSI-X/IMS are all
> treated in exactly the same way. It all becomes the common case.

Unfortunately in a guest they are not all the same - it is like the
PPC code I saw messing with MSI today - MSI setup is a hypercall,
either explicitly or implicitly by trapping device registers.

So MSI is special compared to everything else because MSI has that
hypervisor intrusion.

My ideal outcome would be to have the guest negotiate some new
capability with the hypervisor where the guest promises to use new
hypecalls to get addr/data pairs and the hypervisor removes all the
emulation around the PCI MSI. Then MSI == IMS again and we can have
everything treated in exactly the same way. Hypervisor doesn't care
how the guest tells the origin device what the addr/data pair is.

This moves the hypervisor trap to setup the interrupt remapping from
the MSI emulation to the new hypercall.

If we keep the MSI emulation in the hypervisor then MSI != IMS. The
MSI code needs to write a addr/data pair compatible with the emulation
and the IMS code needs to write an addr/data pair from the
hypercall. Seems like this scenario is best avoided!

From this perspective I haven't connected how virtual interrupt
remapping helps in the guest? Is this a way to provide the hypercall
I'm imagining above?

Jason

2021-12-09 20:32:48

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Thu, Dec 09 2021 at 12:21, Jason Gunthorpe wrote:
> On Thu, Dec 09, 2021 at 09:37:06AM +0100, Thomas Gleixner wrote:
> If we keep the MSI emulation in the hypervisor then MSI != IMS. The
> MSI code needs to write a addr/data pair compatible with the emulation
> and the IMS code needs to write an addr/data pair from the
> hypercall. Seems like this scenario is best avoided!
>
> From this perspective I haven't connected how virtual interrupt
> remapping helps in the guest? Is this a way to provide the hypercall
> I'm imagining above?

That was my thought to avoid having different mechanisms.

The address/data pair is computed in two places:

1) Activation of an interrupt
2) Affinity setting on an interrupt

Both configure the IRTE when interrupt remapping is in place.

In both cases a vector is allocated in the vector domain and based on
the resulting target APIC / vector number pair the IRTE is
(re)configured.

So putting the hypercall into the vIRTE update is the obvious
place. Both activation and affinity setting can fail and propagate an
error code down to the originating caller.

Hmm?

Thanks,

tglx



2021-12-09 20:58:48

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Thu, Dec 09, 2021 at 09:32:42PM +0100, Thomas Gleixner wrote:
> On Thu, Dec 09 2021 at 12:21, Jason Gunthorpe wrote:
> > On Thu, Dec 09, 2021 at 09:37:06AM +0100, Thomas Gleixner wrote:
> > If we keep the MSI emulation in the hypervisor then MSI != IMS. The
> > MSI code needs to write a addr/data pair compatible with the emulation
> > and the IMS code needs to write an addr/data pair from the
> > hypercall. Seems like this scenario is best avoided!
> >
> > From this perspective I haven't connected how virtual interrupt
> > remapping helps in the guest? Is this a way to provide the hypercall
> > I'm imagining above?
>
> That was my thought to avoid having different mechanisms.
>
> The address/data pair is computed in two places:
>
> 1) Activation of an interrupt
> 2) Affinity setting on an interrupt
>
> Both configure the IRTE when interrupt remapping is in place.
>
> In both cases a vector is allocated in the vector domain and based on
> the resulting target APIC / vector number pair the IRTE is
> (re)configured.
>
> So putting the hypercall into the vIRTE update is the obvious
> place. Both activation and affinity setting can fail and propagate an
> error code down to the originating caller.
>
> Hmm?

Okay, I think I get it. Would be nice to have someone from intel
familiar with the vIOMMU protocols and qemu code remark what the
hypervisor side can look like.

There is a bit more work here, we'd have to change VFIO to somehow
entirely disconnect the kernel IRQ logic from the MSI table and
directly pass control of it to the guest after the hypervisor IOMMU IR
secures it. ie directly mmap the msi-x table into the guest

Jason

2021-12-09 22:09:58

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Thu, Dec 09 2021 at 16:58, Jason Gunthorpe wrote:
> On Thu, Dec 09, 2021 at 09:32:42PM +0100, Thomas Gleixner wrote:
>> That was my thought to avoid having different mechanisms.
>>
>> The address/data pair is computed in two places:
>>
>> 1) Activation of an interrupt
>> 2) Affinity setting on an interrupt
>>
>> Both configure the IRTE when interrupt remapping is in place.
>>
>> In both cases a vector is allocated in the vector domain and based on
>> the resulting target APIC / vector number pair the IRTE is
>> (re)configured.
>>
>> So putting the hypercall into the vIRTE update is the obvious
>> place. Both activation and affinity setting can fail and propagate an
>> error code down to the originating caller.
>>
>> Hmm?
>
> Okay, I think I get it. Would be nice to have someone from intel
> familiar with the vIOMMU protocols and qemu code remark what the
> hypervisor side can look like.
>
> There is a bit more work here, we'd have to change VFIO to somehow
> entirely disconnect the kernel IRQ logic from the MSI table and
> directly pass control of it to the guest after the hypervisor IOMMU IR
> secures it. ie directly mmap the msi-x table into the guest

That makes everything consistent and a clear cut on all levels, right?

Thanks,

tglx


2021-12-10 00:26:23

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Thu, Dec 09 2021 at 23:09, Thomas Gleixner wrote:
> On Thu, Dec 09 2021 at 16:58, Jason Gunthorpe wrote:
>> Okay, I think I get it. Would be nice to have someone from intel
>> familiar with the vIOMMU protocols and qemu code remark what the
>> hypervisor side can look like.
>>
>> There is a bit more work here, we'd have to change VFIO to somehow
>> entirely disconnect the kernel IRQ logic from the MSI table and
>> directly pass control of it to the guest after the hypervisor IOMMU IR
>> secures it. ie directly mmap the msi-x table into the guest
>
> That makes everything consistent and a clear cut on all levels, right?

Let me give a bit more rationale here, why I think this is the right
thing to do. There are several problems with IMS both on the host and on
the guest side:

1) Contrary to MSI/MSI-X the address/data pair is not completely
managed by the core. It's handed off to driver writers in the
hope they get it right.

2) Without interrupt remapping there is a fundamental issue on x86
for the affinity setting case, as there is no guarantee that
the magic protocol which we came up with (see msi_set_affinity()
in the x86 code) is correctly implemented at the driver level or
that the update is truly atomic so that the problem does not
arise. My interrest in chasing these things is exactly zero.

With interrupt remapping the affinity change happens at the IRTE
level and not at the device level. It's a one time setup for the
device.

Just for the record:

The ATH11 thing does not have that problem by pure luck because
multi-vector MSI is not supported on X86 unless interrupt
remapping is enabled.

The switchtec NTB thing will fall apart w/o remapping AFAICT.

3) With remapping the message for the device is constructed at
allocation time. It does not change after that because the affinity
change happens at the remapping level, which eliminates #2 above.

That has another advantage for IMS because it does not require any
synchronization with the queue or whatever is involved. The next
interrupt after the change at the remapping level ends up on the
new target.

4) For the guest side we agreed that we need an hypercall because the
host can't trap the write to the MSI[-X] entry anymore.

Aside of the fact that this creates a special case for IMS which is
undesirable in my opinion, it's not really obvious where the
hypercall should be placed to work for all scenarios so that it can
also solve the existing issue of silent failures.

5) It's not possible for the kernel to reliably detect whether it is
running on bare metal or not. Yes we talked about heuristics, but
that's something I really want to avoid.

When looking at the above I came to the conclusion that the consistent
way is to make IMS depend on IR both on the host and the guest as this
solves all of the above in one go.

How would that work? With IR the irqdomain hierarchy looks like this:

|--IO/APIC
|--MSI
vector -- IR --|--MIX-X
|--IMS

There are several context where this matters:

1) Allocation of an interrupt, e.g. pci_alloc_irq_vectors().

2) Activation of an interrupt which happens during allocation and/or
at request_irq() time

3) Interrupt affinity setting

#1 Allocation

That allocates an IRTE, which can fail

#2 Activation

That's the step where actually a CPU vector is allocated, where the
IRTE is updated and where the device message is composed to target
the IRTE.

On X86 activation is happening twice:

1) During allocation it allocates a special CPU vector which is
handed out to all allocated interrupts. That's called reservation
mode. This was introduced to prevent vector exhaustion for two
cases:

- Devices allocating tons of MSI-X vectors without using
them. That obviously needs to be fixed at the device driver
level, but due to the fact that post probe() allocation is not
supported, that's not always possible

- CPU hotunplug

All vectors targeting the outgoing CPU need to be migrated to a
new target CPU, which can result in exhaustion of the vector
space.

Reservation mode avoids that because it just uses a unique
vector for all interrupts which are allocated but not
requested.

2) On request_irq()

As the vector assigned during allocation is just a place holder
to make the MSI hardware happy it needs to be replaced by a
real vector.

Both can fail and the error is propagated through the call chain

#3 Changing the interrupt affinity

This obviously needs to allocate a new target CPU vector and update
the IRTE.

Allocating a new target CPU vector can fail.

When looking at it from the host side, then the host needs to do the
same things:

1) Allocate an IRTE for #1

2) Update the IRTE for #2 and #3

But that does not necessarily mean that we need two hypercalls. We can
get away with one in the code which updates the IRTE and that would be
the point where the host side has to allocate the backing host
interrupt, which would replace that allocate on unmask mechanism which
is used today.

It might look awkward on first sight that an IRTE update can fail, but
it's not that awkward when put into context:

The first update happens during activation and activation can fail for
various reasons.

The charm is that his works for everything from INTx to IMS because all
of them go through the same procedure, except that INTx (IO/APIC) does
not support the reservation mode dance.

Thoughts?

Thanks,

tglx

2021-12-10 07:29:07

by Tian, Kevin

[permalink] [raw]
Subject: RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

> From: Thomas Gleixner <[email protected]>
> Sent: Friday, December 10, 2021 8:26 AM
>
> On Thu, Dec 09 2021 at 23:09, Thomas Gleixner wrote:
> > On Thu, Dec 09 2021 at 16:58, Jason Gunthorpe wrote:
> >> Okay, I think I get it. Would be nice to have someone from intel
> >> familiar with the vIOMMU protocols and qemu code remark what the
> >> hypervisor side can look like.
> >>
> >> There is a bit more work here, we'd have to change VFIO to somehow
> >> entirely disconnect the kernel IRQ logic from the MSI table and
> >> directly pass control of it to the guest after the hypervisor IOMMU IR
> >> secures it. ie directly mmap the msi-x table into the guest
> >
> > That makes everything consistent and a clear cut on all levels, right?
>
> Let me give a bit more rationale here, why I think this is the right
> thing to do. There are several problems with IMS both on the host and on
> the guest side:
>
> 1) Contrary to MSI/MSI-X the address/data pair is not completely
> managed by the core. It's handed off to driver writers in the
> hope they get it right.
>
> 2) Without interrupt remapping there is a fundamental issue on x86
> for the affinity setting case, as there is no guarantee that
> the magic protocol which we came up with (see msi_set_affinity()
> in the x86 code) is correctly implemented at the driver level or
> that the update is truly atomic so that the problem does not
> arise. My interrest in chasing these things is exactly zero.
>
> With interrupt remapping the affinity change happens at the IRTE
> level and not at the device level. It's a one time setup for the
> device.

This is a good rationale for making IMS depend on IR on native.

>
> Just for the record:
>
> The ATH11 thing does not have that problem by pure luck because
> multi-vector MSI is not supported on X86 unless interrupt
> remapping is enabled.
>
> The switchtec NTB thing will fall apart w/o remapping AFAICT.
>
> 3) With remapping the message for the device is constructed at
> allocation time. It does not change after that because the affinity
> change happens at the remapping level, which eliminates #2 above.
>
> That has another advantage for IMS because it does not require any
> synchronization with the queue or whatever is involved. The next
> interrupt after the change at the remapping level ends up on the
> new target.

Yes

>
> 4) For the guest side we agreed that we need an hypercall because the
> host can't trap the write to the MSI[-X] entry anymore.

To be accurate I'd like to not call it "can't trap". The host still traps the
MSI/MSI-X entry if the hypercall is not used. This is for current guest
OS which doesn't have this hypercall mechanism. For future new guest
OS which will support this machinery then a handshake process from
such guest will disable the trap for MSI-X and map it for direct guest
access in the fly.

MSI has to be always trapped although the guest has acquired the right
data/addr pair via the hypercall, since it's a PCI config capability.

>
> Aside of the fact that this creates a special case for IMS which is
> undesirable in my opinion, it's not really obvious where the
> hypercall should be placed to work for all scenarios so that it can
> also solve the existing issue of silent failures.
>
> 5) It's not possible for the kernel to reliably detect whether it is
> running on bare metal or not. Yes we talked about heuristics, but
> that's something I really want to avoid.

How would the hypercall mechanism avoid such heuristics?

>
> When looking at the above I came to the conclusion that the consistent
> way is to make IMS depend on IR both on the host and the guest as this
> solves all of the above in one go.
>
> How would that work? With IR the irqdomain hierarchy looks like this:
>
> |--IO/APIC
> |--MSI
> vector -- IR --|--MIX-X
> |--IMS
>
> There are several context where this matters:
>
> 1) Allocation of an interrupt, e.g. pci_alloc_irq_vectors().
>
> 2) Activation of an interrupt which happens during allocation and/or
> at request_irq() time
>
> 3) Interrupt affinity setting
>
> #1 Allocation
>
> That allocates an IRTE, which can fail
>
> #2 Activation
>
> That's the step where actually a CPU vector is allocated, where the
> IRTE is updated and where the device message is composed to target
> the IRTE.
>
> On X86 activation is happening twice:
>
> 1) During allocation it allocates a special CPU vector which is
> handed out to all allocated interrupts. That's called reservation
> mode. This was introduced to prevent vector exhaustion for two
> cases:
>
> - Devices allocating tons of MSI-X vectors without using
> them. That obviously needs to be fixed at the device driver
> level, but due to the fact that post probe() allocation is not
> supported, that's not always possible
>
> - CPU hotunplug
>
> All vectors targeting the outgoing CPU need to be migrated to a
> new target CPU, which can result in exhaustion of the vector
> space.
>
> Reservation mode avoids that because it just uses a unique
> vector for all interrupts which are allocated but not
> requested.
>
> 2) On request_irq()
>
> As the vector assigned during allocation is just a place holder
> to make the MSI hardware happy it needs to be replaced by a
> real vector.
>
> Both can fail and the error is propagated through the call chain
>
> #3 Changing the interrupt affinity
>
> This obviously needs to allocate a new target CPU vector and update
> the IRTE.
>
> Allocating a new target CPU vector can fail.
>
> When looking at it from the host side, then the host needs to do the
> same things:
>
> 1) Allocate an IRTE for #1
>
> 2) Update the IRTE for #2 and #3
>
> But that does not necessarily mean that we need two hypercalls. We can
> get away with one in the code which updates the IRTE and that would be
> the point where the host side has to allocate the backing host
> interrupt, which would replace that allocate on unmask mechanism which
> is used today.
>
> It might look awkward on first sight that an IRTE update can fail, but
> it's not that awkward when put into context:
>
> The first update happens during activation and activation can fail for
> various reasons.
>
> The charm is that his works for everything from INTx to IMS because all
> of them go through the same procedure, except that INTx (IO/APIC) does
> not support the reservation mode dance.
>
> Thoughts?
>

Above sounds the right direction to me. and here is more thought
down the road.

First let's look at how the interrupt is delivered to the guest today.

With IR the physical interrupt is first delivered to the host, then
converted into a virtual interrupt and finally injected to the guest.
Let's put posted interrupt aside since it's an optional platform capability.

HW interrupt
vfio_msihandler(): ->irqfd
kvm_arch_set_irq_inatomic()
kvm_set_msi_irq()
kvm_irq_delivery_to_apic_fast()

Virtual interrupt injection is based on the virtual routing information
registered by Qemu, via KVM_SET_GSI_ROUTING(gsi, routing_info).

GSI is later associated to irqfd via KVM_IRQFD(irqfd, gsi).

Qemu composes the virtual routing information based on trapping
of various interrupt storages (INTx, MSI, MSI-X, etc.). When IR is
enabled in the vIOMMU, the routing info is composed by combining
the storage entry and vIRTE entry.

Now adding the new hypercall machinery to above flow, obviously
we also want the hypercall to carry the virtual routing information
(vIRTE) to the host beside acquiring data/addr pair from the host.
Without trap this information is now hidden from Qemu.

Then Qemu needs to find out the GSI number for the vIRTE handle.
Again Qemu doesn't have such information since it doesn't know
which MSI[-X] entry points to this handle due to no trap.

This implies that we may also need carry device ID, #msi entry, etc.
in the hypercall, so Qemu can associate the virtual routing info
to the right [irqfd, gsi].

In your model the hypercall is raised by IR domain. Do you see
any problem of finding those information within IR domain?

Thanks
Kevin

2021-12-10 07:36:19

by Tian, Kevin

[permalink] [raw]
Subject: RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

> From: Jason Gunthorpe <[email protected]>
> Sent: Friday, December 10, 2021 4:59 AM
>
> On Thu, Dec 09, 2021 at 09:32:42PM +0100, Thomas Gleixner wrote:
> > On Thu, Dec 09 2021 at 12:21, Jason Gunthorpe wrote:
> > > On Thu, Dec 09, 2021 at 09:37:06AM +0100, Thomas Gleixner wrote:
> > > If we keep the MSI emulation in the hypervisor then MSI != IMS. The
> > > MSI code needs to write a addr/data pair compatible with the emulation
> > > and the IMS code needs to write an addr/data pair from the
> > > hypercall. Seems like this scenario is best avoided!
> > >
> > > From this perspective I haven't connected how virtual interrupt
> > > remapping helps in the guest? Is this a way to provide the hypercall
> > > I'm imagining above?
> >
> > That was my thought to avoid having different mechanisms.
> >
> > The address/data pair is computed in two places:
> >
> > 1) Activation of an interrupt
> > 2) Affinity setting on an interrupt
> >
> > Both configure the IRTE when interrupt remapping is in place.
> >
> > In both cases a vector is allocated in the vector domain and based on
> > the resulting target APIC / vector number pair the IRTE is
> > (re)configured.
> >
> > So putting the hypercall into the vIRTE update is the obvious
> > place. Both activation and affinity setting can fail and propagate an
> > error code down to the originating caller.
> >
> > Hmm?
>
> Okay, I think I get it. Would be nice to have someone from intel
> familiar with the vIOMMU protocols and qemu code remark what the
> hypervisor side can look like.
>
> There is a bit more work here, we'd have to change VFIO to somehow
> entirely disconnect the kernel IRQ logic from the MSI table and
> directly pass control of it to the guest after the hypervisor IOMMU IR
> secures it. ie directly mmap the msi-x table into the guest
>

It's supported already:

/*
* The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
* which allows direct access to non-MSIX registers which happened to be within
* the same system page.
*
* Even though the userspace gets direct access to the MSIX data, the existing
* VFIO_DEVICE_SET_IRQS interface must still be used for MSIX configuration.
*/
#define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3

IIRC this was introduced for PPC when a device has MSI-X in the same BAR as
other MMIO registers. Trapping MSI-X leads to performance downgrade on
accesses to adjacent registers. MSI-X can be mapped by userspace because
PPC already uses a hypercall mechanism for interrupt. Though unclear about
the detail it sounds a similar usage as proposed here.

Thanks
Kevin

2021-12-10 07:38:09

by Tian, Kevin

[permalink] [raw]
Subject: RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

> From: Thomas Gleixner <[email protected]>
> Sent: Thursday, December 9, 2021 11:58 PM
>
> On Thu, Dec 09 2021 at 12:17, Kevin Tian wrote:
> >> From: Thomas Gleixner <[email protected]>
> >> I think you are looking at that from the internal implementation details
> >> of IDXD. But you can just model it in an IDXD implementation agnostic
> >> way:
> >>
> >> ENQCMD(PASID, IMS-ENTRY,.....)
> >
> > Not exactly IMS-ENTRY. MSI-ENTRY also works.
>
> Sure.
>
> >>
> >> implies an on demand allocation of a virtual queue, which is deallocated
> >> when the command completes. The PASID and IMS-ENTRY act as the
> 'queue'
> >> identifier.
> >>
> >> The implementation detail of IDXD that it executes these computations on
> >> an internal shared workqueue does not change that.
> >>
> >> Such a workqueue can only execute one enqueued command at a time,
> >> which
> >> means that during the execution of a particular command that IDXD
> >> internal workqueue represents the 'virtual queue' which is identified by
> >> the unique PASID/IMS-ENTRY pair.
> >
> > While it's one way of looking at this model do we want to actually
> > create some objects to represent this 'virtual queue' concept? that
> > implies each ENQCMD must be moderated to create such short-lifespan
> > objects and I'm not sure the benefit of doing so.
>
> You don't have to create anything. The PASID/ENTRY pair represents that
> 'virtual queue', right?

Yes

>
> > If not then from driver p.o.v it's still one queue resource and driver
> > needs to manage its association with multiple interrupt entries and
> > PASIDs when it's connected to multiple clients.
>
> That's correct, but there is nothing problematic with it. It's like
> allocating multiple interrupts for any other hardware device or
> subdevice, right?

No question on this. Just want to point out such usage example
since Jason mentioned it. ????

>
> What's probably more interresting is how the PASID/interrupt/RID
> relations are managed.
>

yes, that's something we need further think of.

Thanks
Kevin

2021-12-10 12:13:28

by Thomas Gleixner

[permalink] [raw]
Subject: RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

Kevin,

On Fri, Dec 10 2021 at 07:29, Kevin Tian wrote:
>> From: Thomas Gleixner <[email protected]>
>> 4) For the guest side we agreed that we need an hypercall because the
>> host can't trap the write to the MSI[-X] entry anymore.
>
> To be accurate I'd like to not call it "can't trap". The host still traps the
> MSI/MSI-X entry if the hypercall is not used. This is for current guest
> OS which doesn't have this hypercall mechanism. For future new guest
> OS which will support this machinery then a handshake process from
> such guest will disable the trap for MSI-X and map it for direct guest
> access in the fly.

Right. What I'm suggesting is a clear cut between the current approach,
which obviously needs to be preserved, and a consistent new approach
which handles MSI, MSI-X and IMS in the exactly same way.

> MSI has to be always trapped although the guest has acquired the right
> data/addr pair via the hypercall, since it's a PCI config capability.
>
>>
>> Aside of the fact that this creates a special case for IMS which is
>> undesirable in my opinion, it's not really obvious where the
>> hypercall should be placed to work for all scenarios so that it can
>> also solve the existing issue of silent failures.
>>
>> 5) It's not possible for the kernel to reliably detect whether it is
>> running on bare metal or not. Yes we talked about heuristics, but
>> that's something I really want to avoid.
>
> How would the hypercall mechanism avoid such heuristics?

The availability of IR remapping where the irqdomain which is provided
by the remapping unit signals that it supports this new scheme:

|--IO/APIC
|--MSI
vector -- IR --|--MSI-X
|--IMS

while the current scheme is:

|--IO/APIC
vector -- IR --|--PCI/MSI[-X]

or

|--IO/APIC
vector --------|--PCI/MSI[-X]

So in the new scheme the IR domain will advertise new features which are
not available on older kernels. The availability of these new features
is the indicator for the interrupt subsystem and subsequently for PCI
whether IMS is supported or not.

Bootup either finds an IR unit or not. In the bare metal case that's the
usual hardware/firmware detection. In the guest case it's the
availability of vIR including the required hypercall protocol.

So for the guest case the initialization would look like this:

qemu starts guest
Method of interrupt management defaults to current scheme
restricted to MSI/MSI-X

guest initializes
older guest do not check for the hypercall so everything
continues as of today

new guest initializes vIR, detects hypercall and requests
from the hypervisor to switch over to the new scheme.

The hypervisor grants or rejects the request, i.e. either both
switch to the new scheme or stay with the old one.

The new scheme means, that all variants, MSI, MSI-X, IMS are handled in
the same way. Staying on the old scheme means that IMS is not available
to the guest.

Having that clear separation is in my opinion way better than trying to
make all of that a big maze of conditionals.

I'm going to make that clear cut in the PCI/MSI management layer as
well. Trying to do that hybrid we do today for irqdomain and non
irqdomain based backends is just not feasible. The decision which way to
go will be very close to the driver exposed API, i.e.:

pci_alloc_ims_vector()
if (new_scheme())
return new_scheme_alloc_ims();
else
return -ENOTSUPP;

and new_scheme_alloc_ims() will have:

new_scheme_alloc_ims()
if (!system_supports_ims())
return -ENOTSUPP;
....

system_supports_ims() makes its decision based on the feature flags
exposed by the underlying base MSI irqdomain, i.e. either vector or IR
on x86.

Vector domain will not have that feature flag set, but IR will have on
bare metal and eventually in the guest when the vIR setup and hypercall
detection and negotiation succeeds.

> Then Qemu needs to find out the GSI number for the vIRTE handle.
> Again Qemu doesn't have such information since it doesn't know
> which MSI[-X] entry points to this handle due to no trap.
>
> This implies that we may also need carry device ID, #msi entry, etc.
> in the hypercall, so Qemu can associate the virtual routing info
> to the right [irqfd, gsi].
>
> In your model the hypercall is raised by IR domain. Do you see
> any problem of finding those information within IR domain?

IR has the following information available:

Interrupt type
- MSI: Device, index and number of vectors
- MSI-X: Device, index
- IMS: Device, index

Target APIC/vector pair

IMS: The index depends on the storage type:

For storage in device memory, e.g. IDXD, it's the array index.

For storage in system memory, the index is a software artifact.

Does that answer your question?

Thanks,

tglx

2021-12-10 12:30:33

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Fri, Dec 10, 2021 at 07:36:12AM +0000, Tian, Kevin wrote:

> /*
> * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
> * which allows direct access to non-MSIX registers which happened to be within
> * the same system page.
> *
> * Even though the userspace gets direct access to the MSIX data, the existing
> * VFIO_DEVICE_SET_IRQS interface must still be used for MSIX configuration.
^^^^^^^^^^^^^^^^^^^^^

It is this I think we don't want, there should be no hypervisor
involvment on installing the addr/data pair

Guessing by what I saw in PPC there is a hypercall to install a MSI
and the guest never touches the page even though it is
mapped. Presumably there is some IR like thing making this secured

PPC has the same basic problem, I think they have a hypercall to
install a MSI, not a hypercall to provision a generic addr/data pair.

Jason

2021-12-10 12:39:44

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Fri, Dec 10, 2021 at 07:29:01AM +0000, Tian, Kevin wrote:
> > 5) It's not possible for the kernel to reliably detect whether it is
> > running on bare metal or not. Yes we talked about heuristics, but
> > that's something I really want to avoid.
>
> How would the hypercall mechanism avoid such heuristics?

It is clever, we don't have an vIOMMU that supplies vIR today, so by
definition all guests are excluded and only bare metal works.

> > The charm is that his works for everything from INTx to IMS because all
> > of them go through the same procedure, except that INTx (IO/APIC) does
> > not support the reservation mode dance.

Do we even have vIOAPIC?

> > Thoughts?

It seems reasonable - do you have any idea how this all would work on
ARM too? IMS on baremetal ARM is surely interesting. I assume they
have a similar issue with trapping the MSI

> Then Qemu needs to find out the GSI number for the vIRTE handle.
> Again Qemu doesn't have such information since it doesn't know
> which MSI[-X] entry points to this handle due to no trap.

No this is already going wrong. qemu *cannot* know the MSI information
because there is no MSI information for IMS.

All qemu should get is the origin device information and data about
how the guest wants the interrupt setup.

Forget about guests and all of this complexity, design how to make
VFIO work with IMS in pure userspace like DPDK.

We must have a VFIO ioctl to acquire a addr/data pair and link it to
an event fd.

I'm not sure exactly how this should be done, it is 90% of what IMS
is, except the VFIO irq_chip cannot touch any real HW and certainly
cannot do mask/unmask..

Maybe that is OK now that it requires IR?

Jason

2021-12-10 19:00:23

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

Jason,

On Fri, Dec 10 2021 at 08:39, Jason Gunthorpe wrote:

> On Fri, Dec 10, 2021 at 07:29:01AM +0000, Tian, Kevin wrote:
>> > 5) It's not possible for the kernel to reliably detect whether it is
>> > running on bare metal or not. Yes we talked about heuristics, but
>> > that's something I really want to avoid.
>>
>> How would the hypercall mechanism avoid such heuristics?
>
> It is clever, we don't have an vIOMMU that supplies vIR today, so by
> definition all guests are excluded and only bare metal works.

Dammit. Now you spilled the beans. :)

>> > The charm is that his works for everything from INTx to IMS because all
>> > of them go through the same procedure, except that INTx (IO/APIC) does
>> > not support the reservation mode dance.
>
> Do we even have vIOAPIC?

It does not matter much. INTx via IOAPIC is different anyway because
INTx is shared so it's unclear from which device it originates.

> It seems reasonable - do you have any idea how this all would work on
> ARM too? IMS on baremetal ARM is surely interesting. I assume they
> have a similar issue with trapping the MSI

On baremetal it should just work once ARM is converted over. No idea
about guests. Marc should know.

>> Then Qemu needs to find out the GSI number for the vIRTE handle.
>> Again Qemu doesn't have such information since it doesn't know
>> which MSI[-X] entry points to this handle due to no trap.
>
> No this is already going wrong. qemu *cannot* know the MSI information
> because there is no MSI information for IMS.
>
> All qemu should get is the origin device information and data about
> how the guest wants the interrupt setup.
>
> Forget about guests and all of this complexity, design how to make
> VFIO work with IMS in pure userspace like DPDK.
>
> We must have a VFIO ioctl to acquire a addr/data pair and link it to
> an event fd.
>
> I'm not sure exactly how this should be done, it is 90% of what IMS
> is, except the VFIO irq_chip cannot touch any real HW and certainly
> cannot do mask/unmask..
>
> Maybe that is OK now that it requires IR?

IR unfortunately does not allow masking, but we surely can come up some
emergency button to press when e.g. an interrupt storm is detected.

Thanks,

tglx

2021-12-11 07:44:16

by Tian, Kevin

[permalink] [raw]
Subject: RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

> From: Thomas Gleixner <[email protected]>
> Sent: Saturday, December 11, 2021 3:00 AM
>
> Jason,
>
> On Fri, Dec 10 2021 at 08:39, Jason Gunthorpe wrote:
>
> > On Fri, Dec 10, 2021 at 07:29:01AM +0000, Tian, Kevin wrote:
> >> > 5) It's not possible for the kernel to reliably detect whether it is
> >> > running on bare metal or not. Yes we talked about heuristics, but
> >> > that's something I really want to avoid.
> >>
> >> How would the hypercall mechanism avoid such heuristics?
> >
> > It is clever, we don't have an vIOMMU that supplies vIR today, so by
> > definition all guests are excluded and only bare metal works.
>
> Dammit. Now you spilled the beans. :)
>

Unfortunately we do have that today. Qemu supports IR for
both AMD and Intel vIOMMU.

Thanks
Kevin

2021-12-11 07:53:09

by Tian, Kevin

[permalink] [raw]
Subject: RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

> From: Jason Gunthorpe <[email protected]>
> Sent: Friday, December 10, 2021 8:40 PM
>
>
> > Then Qemu needs to find out the GSI number for the vIRTE handle.
> > Again Qemu doesn't have such information since it doesn't know
> > which MSI[-X] entry points to this handle due to no trap.
>
> No this is already going wrong. qemu *cannot* know the MSI information
> because there is no MSI information for IMS.

I haven't thought of IMS at this step. The IR approach applies to
all types of interrupt storages, thus I'm more interested in how it
affect the storages which are already virtualized today (MSI[-X]
in my thought practice).

>
> All qemu should get is the origin device information and data about
> how the guest wants the interrupt setup.

yes, this is what I concluded in previous mail. The hypercall should
provide those information to the host and then get addr/data pair
back from the host.

>
> Forget about guests and all of this complexity, design how to make
> VFIO work with IMS in pure userspace like DPDK.
>
> We must have a VFIO ioctl to acquire a addr/data pair and link it to
> an event fd.

No question on it. Still need more thought down this road.

Thanks
Kevin

2021-12-11 08:06:43

by Tian, Kevin

[permalink] [raw]
Subject: RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

> From: Thomas Gleixner <[email protected]>
> Sent: Friday, December 10, 2021 8:13 PM
>
> >> 5) It's not possible for the kernel to reliably detect whether it is
> >> running on bare metal or not. Yes we talked about heuristics, but
> >> that's something I really want to avoid.
> >
> > How would the hypercall mechanism avoid such heuristics?
>
> The availability of IR remapping where the irqdomain which is provided
> by the remapping unit signals that it supports this new scheme:
>
> |--IO/APIC
> |--MSI
> vector -- IR --|--MSI-X
> |--IMS
>
> while the current scheme is:
>
> |--IO/APIC
> vector -- IR --|--PCI/MSI[-X]
>
> or
>
> |--IO/APIC
> vector --------|--PCI/MSI[-X]
>
> So in the new scheme the IR domain will advertise new features which are
> not available on older kernels. The availability of these new features
> is the indicator for the interrupt subsystem and subsequently for PCI
> whether IMS is supported or not.
>
> Bootup either finds an IR unit or not. In the bare metal case that's the
> usual hardware/firmware detection. In the guest case it's the
> availability of vIR including the required hypercall protocol.

Given we have vIR already, there are three scenarios:

1) Bare metal: IR (no hypercall, for sure)
2) VM: vIR (no hypercall, today)
3) VM: vIR (hypercall, tomorrow)

IMS should be allowed only for 1) and 3).

But how to differentiate 2) from 1) if no guest heuristics?

btw I checked Qemu history to find vIR was introduced in 2016:

commit 1121e0afdcfa0cd40e36bd3acff56a3fac4f70fd
Author: Peter Xu <[email protected]>
Date: Thu Jul 14 13:56:13 2016 +0800

x86-iommu: introduce "intremap" property

Adding one property for intel-iommu devices to specify whether we should
support interrupt remapping. By default, IR is disabled. To enable it,
we should use (take Intel IOMMU as example):

-device intel_iommu,intremap=on

This property can be shared by Intel and future AMD IOMMUs.

Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Michael S. Tsirkin <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

>
> > Then Qemu needs to find out the GSI number for the vIRTE handle.
> > Again Qemu doesn't have such information since it doesn't know
> > which MSI[-X] entry points to this handle due to no trap.
> >
> > This implies that we may also need carry device ID, #msi entry, etc.
> > in the hypercall, so Qemu can associate the virtual routing info
> > to the right [irqfd, gsi].
> >
> > In your model the hypercall is raised by IR domain. Do you see
> > any problem of finding those information within IR domain?
>
> IR has the following information available:
>
> Interrupt type
> - MSI: Device, index and number of vectors
> - MSI-X: Device, index
> - IMS: Device, index
>
> Target APIC/vector pair
>
> IMS: The index depends on the storage type:
>
> For storage in device memory, e.g. IDXD, it's the array index.
>
> For storage in system memory, the index is a software artifact.
>
> Does that answer your question?
>

Yes.

Thanks
Kevin

2021-12-11 13:04:59

by Thomas Gleixner

[permalink] [raw]
Subject: RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

Kevin,

On Sat, Dec 11 2021 at 07:44, Kevin Tian wrote:
>> From: Thomas Gleixner <[email protected]>
>> On Fri, Dec 10 2021 at 08:39, Jason Gunthorpe wrote:
>> > It is clever, we don't have an vIOMMU that supplies vIR today, so by
>> > definition all guests are excluded and only bare metal works.
>>
>> Dammit. Now you spilled the beans. :)
>
> Unfortunately we do have that today. Qemu supports IR for
> both AMD and Intel vIOMMU.

can you point me to the code?

All I can find is drivers/iommu/virtio-iommu.c but I can't find anything
vIR related there.

Thanks,

tglx

2021-12-12 00:12:12

by Thomas Gleixner

[permalink] [raw]
Subject: RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

Kevin,

On Sat, Dec 11 2021 at 07:52, Kevin Tian wrote:
>> From: Jason Gunthorpe <[email protected]>
>> > Then Qemu needs to find out the GSI number for the vIRTE handle.
>> > Again Qemu doesn't have such information since it doesn't know
>> > which MSI[-X] entry points to this handle due to no trap.
>>
>> No this is already going wrong. qemu *cannot* know the MSI information
>> because there is no MSI information for IMS.
>
> I haven't thought of IMS at this step. The IR approach applies to
> all types of interrupt storages, thus I'm more interested in how it
> affect the storages which are already virtualized today (MSI[-X]
> in my thought practice).

They are not any different. As I explained several times now IMS is
nothing new at all. It existed since the invention of Message Signaled
interrupts. Why?

The principle behind Message Signaled Interrupts is:

Device writes DATA to ADDRESS which raises an interrupt in a CPU

Message Signaled Interrupts obviously need some place to store the
ADDRESS/DATA pair so that the device can use it for raising an
interrupt, i.e. an

Interrupt Message Store, short IMS.

PCI/MSI was the first implementation of this and the storage was defined
to be at a specified and therefore uniform and device independent place.

PCI/MSI-X followed the same approch. While it solved quite some of the
shortcomings of PCI/MSI it still has a specificed and uniform and device
independent place to store the message (ADDRESS/DATA pair)

Now the PCI wizards figured out that PCI/MSI[-X] is not longer up to the
task for various reasons and came up with the revolutionary new concept
of IMS, aka Interrupt Message Store. where the device defines where the
message is stored.

IOW, this is coming back full circle to the original problem of where to
store the message, i.e. the ADDRESS/DATA pair so that the device can
raise an interrupt in a CPU, which requires - drum roll - an

Interrupt Message Store, short IMS.

So you simply have to look at it from a pure MSI (not PCI/MSI) point
of view:

MSI at the conceptual level requires storage for the ADDRESS/DATA
pair at some place so that the device or the compute unit embedded in
the device can write DATA to ADDRESS.

That's it. Not more, not less.

When you look at it from this perspective, then you'll realize that

PCI/MSI and PCI/MSI-X are just implementations of IMS

Not more, not less. The fact that they have very strict rules about the
storage space and the fact that they are mutually exclusive does not
change that at all.

That's where a lot of the confusion comes from. If you go back to all
the IDXD/IMS discussions which happened over time then you'll figure out
that _all_ of us where coming from the same wrong assumption:

IMS is new and it's just another exclusive variant of PCI/MSI and
PCi/MSI-X.

It took _all_ of us quite some time to realize that we need to look at
it from the other way around.

There was surely some other conceptual confusion vs. subdevices, queues
and whatever involved which contributed to that. Water under the bridge.

Coming back to your initial question:

> I haven't thought of IMS at this step. The IR approach applies to
> all types of interrupt storages, thus I'm more interested in how it
> affect the storages which are already virtualized today (MSI[-X]
> in my thought practice).

Stop focussing on implementation details. Focus on the general concept
instead. See above.

Thanks,

tglx


2021-12-12 02:00:00

by Tian, Kevin

[permalink] [raw]
Subject: RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

> From: Thomas Gleixner <[email protected]>
> Sent: Saturday, December 11, 2021 9:05 PM
>
> Kevin,
>
> On Sat, Dec 11 2021 at 07:44, Kevin Tian wrote:
> >> From: Thomas Gleixner <[email protected]>
> >> On Fri, Dec 10 2021 at 08:39, Jason Gunthorpe wrote:
> >> > It is clever, we don't have an vIOMMU that supplies vIR today, so by
> >> > definition all guests are excluded and only bare metal works.
> >>
> >> Dammit. Now you spilled the beans. :)
> >
> > Unfortunately we do have that today. Qemu supports IR for
> > both AMD and Intel vIOMMU.
>
> can you point me to the code?
>
> All I can find is drivers/iommu/virtio-iommu.c but I can't find anything
> vIR related there.
>

Well, virtio-iommu is a para-virtualized vIOMMU implementations.

In reality there are also fully emulated vIOMMU implementations (e.g.
Qemu fully emulates Intel/AMD/ARM IOMMUs). In those configurations
the IR logic in existing iommu drivers just apply:

drivers/iommu/intel/irq_remapping.c
drivers/iommu/amd/iommu.c
...

As I replied in another mail, the 1st vIR implementation was introduced
to Qemu back to 2016:

commit 1121e0afdcfa0cd40e36bd3acff56a3fac4f70fd
Author: Peter Xu <[email protected]>
Date: Thu Jul 14 13:56:13 2016 +0800

x86-iommu: introduce "intremap" property

Adding one property for intel-iommu devices to specify whether we should
support interrupt remapping. By default, IR is disabled. To enable it,
we should use (take Intel IOMMU as example):

-device intel_iommu,intremap=on

This property can be shared by Intel and future AMD IOMMUs.

Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: Michael S. Tsirkin <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>

Thanks
Kevin

2021-12-12 02:14:22

by Tian, Kevin

[permalink] [raw]
Subject: RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

Hi, Thomas,

> From: Thomas Gleixner <[email protected]>
> Sent: Sunday, December 12, 2021 8:12 AM
>
> Kevin,
>
> On Sat, Dec 11 2021 at 07:52, Kevin Tian wrote:
> >> From: Jason Gunthorpe <[email protected]>
> >> > Then Qemu needs to find out the GSI number for the vIRTE handle.
> >> > Again Qemu doesn't have such information since it doesn't know
> >> > which MSI[-X] entry points to this handle due to no trap.
> >>
> >> No this is already going wrong. qemu *cannot* know the MSI information
> >> because there is no MSI information for IMS.
> >
> > I haven't thought of IMS at this step. The IR approach applies to
> > all types of interrupt storages, thus I'm more interested in how it
> > affect the storages which are already virtualized today (MSI[-X]
> > in my thought practice).
>
> They are not any different. As I explained several times now IMS is
> nothing new at all. It existed since the invention of Message Signaled
> interrupts. Why?
>
> The principle behind Message Signaled Interrupts is:
>
> Device writes DATA to ADDRESS which raises an interrupt in a CPU
>
> Message Signaled Interrupts obviously need some place to store the
> ADDRESS/DATA pair so that the device can use it for raising an
> interrupt, i.e. an
>
> Interrupt Message Store, short IMS.
>
> PCI/MSI was the first implementation of this and the storage was defined
> to be at a specified and therefore uniform and device independent place.
>
> PCI/MSI-X followed the same approch. While it solved quite some of the
> shortcomings of PCI/MSI it still has a specificed and uniform and device
> independent place to store the message (ADDRESS/DATA pair)
>
> Now the PCI wizards figured out that PCI/MSI[-X] is not longer up to the
> task for various reasons and came up with the revolutionary new concept
> of IMS, aka Interrupt Message Store. where the device defines where the
> message is stored.
>
> IOW, this is coming back full circle to the original problem of where to
> store the message, i.e. the ADDRESS/DATA pair so that the device can
> raise an interrupt in a CPU, which requires - drum roll - an
>
> Interrupt Message Store, short IMS.
>
> So you simply have to look at it from a pure MSI (not PCI/MSI) point
> of view:
>
> MSI at the conceptual level requires storage for the ADDRESS/DATA
> pair at some place so that the device or the compute unit embedded in
> the device can write DATA to ADDRESS.
>
> That's it. Not more, not less.
>
> When you look at it from this perspective, then you'll realize that
>
> PCI/MSI and PCI/MSI-X are just implementations of IMS
>
> Not more, not less. The fact that they have very strict rules about the
> storage space and the fact that they are mutually exclusive does not
> change that at all.
>
> That's where a lot of the confusion comes from. If you go back to all
> the IDXD/IMS discussions which happened over time then you'll figure out
> that _all_ of us where coming from the same wrong assumption:
>
> IMS is new and it's just another exclusive variant of PCI/MSI and
> PCi/MSI-X.
>
> It took _all_ of us quite some time to realize that we need to look at
> it from the other way around.
>
> There was surely some other conceptual confusion vs. subdevices, queues
> and whatever involved which contributed to that. Water under the bridge.
>
> Coming back to your initial question:
>
> > I haven't thought of IMS at this step. The IR approach applies to
> > all types of interrupt storages, thus I'm more interested in how it
> > affect the storages which are already virtualized today (MSI[-X]
> > in my thought practice).
>
> Stop focussing on implementation details. Focus on the general concept
> instead. See above.
>

I have no any problem on understanding above relational. This has
been acked multiple times in my previous replies.

I just continue the thought practice along that direction to see what
the host flow will be like (step by step). Looking at the current
implementation is just one necessary step in my thought practice to
help refine the picture. When I found something which may be
worth being aligned then I shared to avoid follow a wrong direction
too far.

If both of your think it simply adds noise to this discussion, I can
surely hold back and focus on 'concept' only.

Thanks
Kevin

2021-12-12 06:44:59

by Mika Penttilä

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()



On 10.12.2021 9.36, Tian, Kevin wrote:
>> From: Jason Gunthorpe <[email protected]>
>> Sent: Friday, December 10, 2021 4:59 AM
>>
>> On Thu, Dec 09, 2021 at 09:32:42PM +0100, Thomas Gleixner wrote:
>>> On Thu, Dec 09 2021 at 12:21, Jason Gunthorpe wrote:
>>>> On Thu, Dec 09, 2021 at 09:37:06AM +0100, Thomas Gleixner wrote:
>>>> If we keep the MSI emulation in the hypervisor then MSI != IMS. The
>>>> MSI code needs to write a addr/data pair compatible with the emulation
>>>> and the IMS code needs to write an addr/data pair from the
>>>> hypercall. Seems like this scenario is best avoided!
>>>>
>>>> From this perspective I haven't connected how virtual interrupt
>>>> remapping helps in the guest? Is this a way to provide the hypercall
>>>> I'm imagining above?
>>> That was my thought to avoid having different mechanisms.
>>>
>>> The address/data pair is computed in two places:
>>>
>>> 1) Activation of an interrupt
>>> 2) Affinity setting on an interrupt
>>>
>>> Both configure the IRTE when interrupt remapping is in place.
>>>
>>> In both cases a vector is allocated in the vector domain and based on
>>> the resulting target APIC / vector number pair the IRTE is
>>> (re)configured.
>>>
>>> So putting the hypercall into the vIRTE update is the obvious
>>> place. Both activation and affinity setting can fail and propagate an
>>> error code down to the originating caller.
>>>
>>> Hmm?
>> Okay, I think I get it. Would be nice to have someone from intel
>> familiar with the vIOMMU protocols and qemu code remark what the
>> hypervisor side can look like.
>>
>> There is a bit more work here, we'd have to change VFIO to somehow
>> entirely disconnect the kernel IRQ logic from the MSI table and
>> directly pass control of it to the guest after the hypervisor IOMMU IR
>> secures it. ie directly mmap the msi-x table into the guest
>>
> It's supported already:
>
> /*
> * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
> * which allows direct access to non-MSIX registers which happened to be within
> * the same system page.
> *
> * Even though the userspace gets direct access to the MSIX data, the existing
> * VFIO_DEVICE_SET_IRQS interface must still be used for MSIX configuration.
> */
> #define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3
>
> IIRC this was introduced for PPC when a device has MSI-X in the same BAR as
> other MMIO registers. Trapping MSI-X leads to performance downgrade on
> accesses to adjacent registers. MSI-X can be mapped by userspace because
> PPC already uses a hypercall mechanism for interrupt. Though unclear about
> the detail it sounds a similar usage as proposed here.
>
> Thanks
> Kevin

I see  VFIO_REGION_INFO_CAP_MSIX_MAPPABLE is always set so if msix table
is in its own bar, qemu never traps/emulates the access. On the other
hand, qemu is said to depend on emulating masking. So how is this
supposed to work, in case the table is not in the config bar?

Thanks,
Mika



2021-12-12 20:50:47

by Thomas Gleixner

[permalink] [raw]
Subject: RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

Kevin,

On Sun, Dec 12 2021 at 02:14, Kevin Tian wrote:
>> From: Thomas Gleixner <[email protected]>
> I just continue the thought practice along that direction to see what
> the host flow will be like (step by step). Looking at the current
> implementation is just one necessary step in my thought practice to
> help refine the picture. When I found something which may be
> worth being aligned then I shared to avoid follow a wrong direction
> too far.
>
> If both of your think it simply adds noise to this discussion, I can
> surely hold back and focus on 'concept' only.

All good. We _want_ your participartion for sure. Comparing and
contrasting it to the existing flow is fine.

Thanks,

tglx

2021-12-12 20:55:37

by Thomas Gleixner

[permalink] [raw]
Subject: RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

Kevin,

On Sun, Dec 12 2021 at 01:56, Kevin Tian wrote:
>> From: Thomas Gleixner <[email protected]>
>> All I can find is drivers/iommu/virtio-iommu.c but I can't find anything
>> vIR related there.
>
> Well, virtio-iommu is a para-virtualized vIOMMU implementations.
>
> In reality there are also fully emulated vIOMMU implementations (e.g.
> Qemu fully emulates Intel/AMD/ARM IOMMUs). In those configurations
> the IR logic in existing iommu drivers just apply:
>
> drivers/iommu/intel/irq_remapping.c
> drivers/iommu/amd/iommu.c

thanks for the explanation. So that's a full IOMMU emulation. I was more
expecting a paravirtualized lightweight one.

Thanks,

tglx

2021-12-12 23:28:07

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Sun, Dec 12, 2021 at 08:44:46AM +0200, Mika Penttilä wrote:

> > /*
> > * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
> > * which allows direct access to non-MSIX registers which happened to be within
> > * the same system page.
> > *
> > * Even though the userspace gets direct access to the MSIX data, the existing
> > * VFIO_DEVICE_SET_IRQS interface must still be used for MSIX configuration.
> > */
> > #define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3
> >
> > IIRC this was introduced for PPC when a device has MSI-X in the same BAR as
> > other MMIO registers. Trapping MSI-X leads to performance downgrade on
> > accesses to adjacent registers. MSI-X can be mapped by userspace because
> > PPC already uses a hypercall mechanism for interrupt. Though unclear about
> > the detail it sounds a similar usage as proposed here.
> >
> > Thanks
> > Kevin
>
> I see  VFIO_REGION_INFO_CAP_MSIX_MAPPABLE is always set so if msix table is
> in its own bar, qemu never traps/emulates the access.

It is some backwards compat, the kernel always sets it to indicate a
new kernel, that doesn't mean qemu doesn't trap.

As the comment says, ""VFIO_DEVICE_SET_IRQS interface must still be
used for MSIX configuration"" so there is no way qemu can meet that
without either trapping the MSI page or using a special hypercall
(ppc)

Jason

2021-12-12 23:37:20

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Sun, Dec 12, 2021 at 09:55:32PM +0100, Thomas Gleixner wrote:
> Kevin,
>
> On Sun, Dec 12 2021 at 01:56, Kevin Tian wrote:
> >> From: Thomas Gleixner <[email protected]>
> >> All I can find is drivers/iommu/virtio-iommu.c but I can't find anything
> >> vIR related there.
> >
> > Well, virtio-iommu is a para-virtualized vIOMMU implementations.
> >
> > In reality there are also fully emulated vIOMMU implementations (e.g.
> > Qemu fully emulates Intel/AMD/ARM IOMMUs). In those configurations
> > the IR logic in existing iommu drivers just apply:
> >
> > drivers/iommu/intel/irq_remapping.c
> > drivers/iommu/amd/iommu.c
>
> thanks for the explanation. So that's a full IOMMU emulation. I was more
> expecting a paravirtualized lightweight one.

Kevin can you explain what on earth vIR is for and how does it work??

Obviously we don't expose the IR machinery to userspace, so at best
this is somehow changing what the MSI trap does?

Jason

2021-12-12 23:42:59

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Sun, Dec 12, 2021 at 01:12:05AM +0100, Thomas Gleixner wrote:

> PCI/MSI and PCI/MSI-X are just implementations of IMS
>
> Not more, not less. The fact that they have very strict rules about the
> storage space and the fact that they are mutually exclusive does not
> change that at all.

And the mess we have is that virtualiation broke this
design. Virtualization made MSI/MSI-X special!

I am wondering if we just need to bite the bullet and force the
introduction of a new ACPI flag for the APIC that says one of:
- message addr/data pairs work correctly (baremetal)
- creating message addr/data pairs need to use a hypercall protocol
- property not defined so assume only MSI/MSI-X/etc work.

Intel was originally trying to do this with the 'IMS enabled' PCI
Capability block, but a per PCI device capability is in the wrong
layer.

Jason

2021-12-13 07:50:59

by Tian, Kevin

[permalink] [raw]
Subject: RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

> From: Jason Gunthorpe <[email protected]>
> Sent: Monday, December 13, 2021 7:37 AM
>
> On Sun, Dec 12, 2021 at 09:55:32PM +0100, Thomas Gleixner wrote:
> > Kevin,
> >
> > On Sun, Dec 12 2021 at 01:56, Kevin Tian wrote:
> > >> From: Thomas Gleixner <[email protected]>
> > >> All I can find is drivers/iommu/virtio-iommu.c but I can't find anything
> > >> vIR related there.
> > >
> > > Well, virtio-iommu is a para-virtualized vIOMMU implementations.
> > >
> > > In reality there are also fully emulated vIOMMU implementations (e.g.
> > > Qemu fully emulates Intel/AMD/ARM IOMMUs). In those configurations
> > > the IR logic in existing iommu drivers just apply:
> > >
> > > drivers/iommu/intel/irq_remapping.c
> > > drivers/iommu/amd/iommu.c
> >
> > thanks for the explanation. So that's a full IOMMU emulation. I was more
> > expecting a paravirtualized lightweight one.
>
> Kevin can you explain what on earth vIR is for and how does it work??
>
> Obviously we don't expose the IR machinery to userspace, so at best
> this is somehow changing what the MSI trap does?
>

Initially it was introduced for supporting more than 255 vCPUs. Due to
full emulation this capability can certainly support other vIR usages
as observed on bare metal.

vIR doesn't rely on the physical IR presence.

First if the guest doesn't have vfio device then the physical capability
doesn't matter.

Even with vfio device, IR by definition is just about remapping instead
of injection (talk about this part later). The interrupts are always routed
to the host handler first (vfio_msihandler() in this case), which then
triggers irqfd to call virtual interrupt injection handler (irqfd_wakeup())
in kvm.

This suggests a clear role split between vfio and kvm:

- vfio is responsible for irq allocation/startup as it is the device driver;
- kvm takes care of virtual interrupt injection, being a VMM;

The two are connected via irqfd.

Following this split vIR information is completely hidden in userspace.
Qemu calculates the routing information between vGSI and vCPU
(with or without vIR, and for whatever trapped interrupt storages)
and then registers it to kvm.

When kvm receives a notification via irqfd, it follows irqfd->vGSI->vCPU
and injects a virtual interrupt into the target vCPU.

Then comes an interesting scenario about IOMMU posted interrupt (PI).
This capability allows the IR engine directly converting a physical
interrupt into virtual and then inject it into the guest. Kinda offloading
the virtual routing information into the hardware.

This is currently achieved via IRQ bypass manager, which helps connect
vfio (IRQ producer) to kvm (IRQ consumer) around a specific Linux irq
number. Once the connection is established, kvm calls
irq_set_vcpu_affinity() to update IRTE with virtual routing information
for that irq number.

With that design Qemu doesn't know whether IR or PI is enabled
physically. It always talks to vfio for having IRQ resource allocated and
to kvm for registering virtual routing information.

Then adding the new hypercall machinery into this picture:

1) The hypercall needs to carry all necessary virtual routing
information due to no-trap;

2) Before querying IRTE data/pair, Qemu needs to complete necessary
operations as of today to have IRTE ready:

a) Register irqfd and related GSI routing info to kvm
b) Allocates/startup IRQs via vfio;

When PI is enabled, IRTE is ready only after both are completed.

3) Qemu gets IRTE data/pair from kernel and return to the guest.

Thanks
Kevin

2022-09-15 09:45:02

by Tian, Kevin

[permalink] [raw]
Subject: RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

Hi, Thomas,

> From: Thomas Gleixner <[email protected]>
> Sent: Monday, December 13, 2021 4:56 AM
>
> Kevin,
>
> On Sun, Dec 12 2021 at 01:56, Kevin Tian wrote:
> >> From: Thomas Gleixner <[email protected]>
> >> All I can find is drivers/iommu/virtio-iommu.c but I can't find anything
> >> vIR related there.
> >
> > Well, virtio-iommu is a para-virtualized vIOMMU implementations.
> >
> > In reality there are also fully emulated vIOMMU implementations (e.g.
> > Qemu fully emulates Intel/AMD/ARM IOMMUs). In those configurations
> > the IR logic in existing iommu drivers just apply:
> >
> > drivers/iommu/intel/irq_remapping.c
> > drivers/iommu/amd/iommu.c
>
> thanks for the explanation. So that's a full IOMMU emulation. I was more
> expecting a paravirtualized lightweight one.
>

Resume this old thread as I realized this open was not closed after discussing
with Reinette who will pick up this work.

In practice emulated IOMMUs are still widely used and many new features
(pasid, sva, etc.) will come to them first instead of on virtio-iommu. So it'd
be good to make the new scheme working on emulated IOMMUs too.

Following this direction probably one feasible option is to introduce certain
PV facility in the spec of emulated IOMMUs as a contract to differentiate
from their existing vIR logic, if we don't want to go back to do heuristics.

Intel-IOMMU already defines a VCMD interface in the spec, in particular for
exchanging information between host/guest. It can be easily extended to
indicate and allow exchanging interrupt addr/data pair between host/guest.

Does it sound a right direction for other IOMMU vendors to follow if they
want to benefit from the new scheme instead of using virtio-iommu?

And with that we don't need define CPU/VMM specific hypercalls. Just
rely on vIOMMUs to claim the capability and enable the new vIR scheme.

--

btw another open is about VM live migration.

After migration the IRTE index could change hence the addr/data pair
acquired before migration becomes stale and must be fixed.

and stale content is both programmed to the interrupt storage and cached
in msi_desc.

The host migration driver may be able to help fix the addr/data in MMIO-based
IMS when restoring the device state, but not memory-based IMS and guest
cached content.

This kindly requires guest cooperation to get it right then, e.g. notify the guest
mark previously acquired addr/data as stale before stopping the VM on src
and then notify it to reacquire add/data after resuming the VM on dest.

But I'm not sure how to handle interrupt lost in the window between resuming
the VM and notifying it to reacquire...

Thanks
Kevin

2022-09-20 14:44:06

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Thu, Sep 15, 2022 at 09:24:07AM +0000, Tian, Kevin wrote:

> After migration the IRTE index could change hence the addr/data pair
> acquired before migration becomes stale and must be fixed.

The migration has to keep this stuff stable somehow, it seems
infeasible to fix it after the fact.

Jason

2022-09-21 08:11:24

by Tian, Kevin

[permalink] [raw]
Subject: RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

> From: Jason Gunthorpe <[email protected]>
> Sent: Tuesday, September 20, 2022 10:10 PM
>
> On Thu, Sep 15, 2022 at 09:24:07AM +0000, Tian, Kevin wrote:
>
> > After migration the IRTE index could change hence the addr/data pair
> > acquired before migration becomes stale and must be fixed.
>
> The migration has to keep this stuff stable somehow, it seems
> infeasible to fix it after the fact.
>

This is not how live migration typically works, i.e. we don't try to
freeze the same host resource cross migration. It's pretty fragile
and the chance of failure is high.

btw isn't it the same reason why we don't want to expose host
PASID into guest in iommufd discussion?

My overall impression is that any such exposure of host resource
requires certain guest cooperation to do after-the-fact fix to
enable migration (though it's obviously difficult for this interrupt
case), otherwise the actual usage would be very limited...

2022-09-21 12:55:34

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Wed, Sep 21, 2022 at 07:57:00AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Tuesday, September 20, 2022 10:10 PM
> >
> > On Thu, Sep 15, 2022 at 09:24:07AM +0000, Tian, Kevin wrote:
> >
> > > After migration the IRTE index could change hence the addr/data pair
> > > acquired before migration becomes stale and must be fixed.
> >
> > The migration has to keep this stuff stable somehow, it seems
> > infeasible to fix it after the fact.
> >
>
> This is not how live migration typically works, i.e. we don't try to
> freeze the same host resource cross migration. It's pretty fragile
> and the chance of failure is high.

Thinking of the interrupt routing as a host resource is the problem -
it is a device resource and just like PASID the ideal design would
have each RID have its own stable numberspace scoped within it. The
entire RID and all its stable numberspace would be copied over.

This includes any pPASIDs and MSI routing.

> btw isn't it the same reason why we don't want to expose host
> PASID into guest in iommufd discussion?

Not quite, the only reason not to expose the physical pasid is ENQCMD
and other mdev based approaches.

If IMS is being used with a mdev then perhaps the mdev driver can do
the vIMS/pIMS translation much like we want to do for pPASID/vPASID

But if a RID is being used with IMS then the MSI under the RID should
be stable, IMHO.

Jason

2022-09-22 05:21:53

by Tian, Kevin

[permalink] [raw]
Subject: RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, September 21, 2022 8:48 PM
>
> On Wed, Sep 21, 2022 at 07:57:00AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Tuesday, September 20, 2022 10:10 PM
> > >
> > > On Thu, Sep 15, 2022 at 09:24:07AM +0000, Tian, Kevin wrote:
> > >
> > > > After migration the IRTE index could change hence the addr/data pair
> > > > acquired before migration becomes stale and must be fixed.
> > >
> > > The migration has to keep this stuff stable somehow, it seems
> > > infeasible to fix it after the fact.
> > >
> >
> > This is not how live migration typically works, i.e. we don't try to
> > freeze the same host resource cross migration. It's pretty fragile
> > and the chance of failure is high.
>
> Thinking of the interrupt routing as a host resource is the problem -
> it is a device resource and just like PASID the ideal design would
> have each RID have its own stable numberspace scoped within it. The
> entire RID and all its stable numberspace would be copied over.
>

Unfortunately it is not the case at least on Intel VT-d. The interrupt
remapping table is per VT-d instead of per RID.

2022-09-22 12:44:23

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Thu, Sep 22, 2022 at 05:11:00AM +0000, Tian, Kevin wrote:

> > Thinking of the interrupt routing as a host resource is the problem -
> > it is a device resource and just like PASID the ideal design would
> > have each RID have its own stable numberspace scoped within it. The
> > entire RID and all its stable numberspace would be copied over.
> >
>
> Unfortunately it is not the case at least on Intel VT-d. The interrupt
> remapping table is per VT-d instead of per RID.

Doesn't that just turn it into a SW problem how it manages it?

Jason

2022-09-22 22:58:06

by Tian, Kevin

[permalink] [raw]
Subject: RE: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, September 22, 2022 8:14 PM
>
> On Thu, Sep 22, 2022 at 05:11:00AM +0000, Tian, Kevin wrote:
>
> > > Thinking of the interrupt routing as a host resource is the problem -
> > > it is a device resource and just like PASID the ideal design would
> > > have each RID have its own stable numberspace scoped within it. The
> > > entire RID and all its stable numberspace would be copied over.
> > >
> >
> > Unfortunately it is not the case at least on Intel VT-d. The interrupt
> > remapping table is per VT-d instead of per RID.
>
> Doesn't that just turn it into a SW problem how it manages it?
>

When IRTEs are shared cross RID's I'm not sure how to make the
allocation stable between the src and the dest. They are allocated
dynamically and hierarchically in request_irq().

I'm not sure what SW mechanism can cleanly ensure a stable
allocation scheme between two machines.

Can you elaborate?

2022-09-23 13:36:45

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

On Thu, Sep 22, 2022 at 10:42:46PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Thursday, September 22, 2022 8:14 PM
> >
> > On Thu, Sep 22, 2022 at 05:11:00AM +0000, Tian, Kevin wrote:
> >
> > > > Thinking of the interrupt routing as a host resource is the problem -
> > > > it is a device resource and just like PASID the ideal design would
> > > > have each RID have its own stable numberspace scoped within it. The
> > > > entire RID and all its stable numberspace would be copied over.
> > > >
> > >
> > > Unfortunately it is not the case at least on Intel VT-d. The interrupt
> > > remapping table is per VT-d instead of per RID.
> >
> > Doesn't that just turn it into a SW problem how it manages it?
> >
>
> When IRTEs are shared cross RID's I'm not sure how to make the
> allocation stable between the src and the dest. They are allocated
> dynamically and hierarchically in request_irq().

So I suppose the issue is labeling, and we can't standardize where in
the shared space each RID will fall, and the HW can't virtualize this
for us.

Maybe the answer is that IMS just isn't supported on this HW if live
migration is also turned on.

Jason

Subject: [tip: irq/core] PCI/MSI: Provide post-enable dynamic allocation interfaces for MSI-X

The following commit has been merged into the irq/core branch of tip:

Commit-ID: 9e8abb30dddff204aa61c90a326cc9793da63efb
Gitweb: https://git.kernel.org/tip/9e8abb30dddff204aa61c90a326cc9793da63efb
Author: Thomas Gleixner <[email protected]>
AuthorDate: Fri, 25 Nov 2022 00:26:24 +01:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Mon, 05 Dec 2022 19:21:04 +01:00

PCI/MSI: Provide post-enable dynamic allocation interfaces for MSI-X

MSI-X vectors can be allocated after the initial MSI-X enablement, but this
needs explicit support of the underlying interrupt domains.

Provide a function to query the ability and functions to allocate/free
individual vectors post-enable.

The allocation can either request a specific index in the MSI-X table or
with the index argument MSI_ANY_INDEX it allocates the next free vector.

The return value is a struct msi_map which on success contains both index
and the Linux interrupt number. In case of failure index is negative and
the Linux interrupt number is 0.

The allocation function is for a single MSI-X index at a time as that's
sufficient for the most urgent use case VFIO to get rid of the 'disable
MSI-X, reallocate, enable-MSI-X' cycle which is prone to lost interrupts
and redirections to the legacy and obviously unhandled INTx.

As single index allocation is also sufficient for the use cases Jason
Gunthorpe pointed out: Allocation of a MSI-X or IMS vector for a network
queue. See Link below.

Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Acked-by: Bjorn Helgaas <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
Link: https://lore.kernel.org/r/[email protected]

---
drivers/pci/msi/api.c | 67 ++++++++++++++++++++++++++++++++++++-
drivers/pci/msi/irqdomain.c | 3 +-
include/linux/pci.h | 6 +++-
3 files changed, 75 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c
index 2d46a0c..c8816db 100644
--- a/drivers/pci/msi/api.c
+++ b/drivers/pci/msi/api.c
@@ -113,6 +113,73 @@ int pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries,
EXPORT_SYMBOL(pci_enable_msix_range);

/**
+ * pci_msix_can_alloc_dyn - Query whether dynamic allocation after enabling
+ * MSI-X is supported
+ *
+ * @dev: PCI device to operate on
+ *
+ * Return: True if supported, false otherwise
+ */
+bool pci_msix_can_alloc_dyn(struct pci_dev *dev)
+{
+ if (!dev->msix_cap)
+ return false;
+
+ return pci_msi_domain_supports(dev, MSI_FLAG_PCI_MSIX_ALLOC_DYN, DENY_LEGACY);
+}
+EXPORT_SYMBOL_GPL(pci_msix_can_alloc_dyn);
+
+/**
+ * pci_msix_alloc_irq_at - Allocate an MSI-X interrupt after enabling MSI-X
+ * at a given MSI-X vector index or any free vector index
+ *
+ * @dev: PCI device to operate on
+ * @index: Index to allocate. If @index == MSI_ANY_INDEX this allocates
+ * the next free index in the MSI-X table
+ * @affdesc: Optional pointer to an affinity descriptor structure. NULL otherwise
+ *
+ * Return: A struct msi_map
+ *
+ * On success msi_map::index contains the allocated index (>= 0) and
+ * msi_map::virq contains the allocated Linux interrupt number (> 0).
+ *
+ * On fail msi_map::index contains the error code and msi_map::virq
+ * is set to 0.
+ */
+struct msi_map pci_msix_alloc_irq_at(struct pci_dev *dev, unsigned int index,
+ const struct irq_affinity_desc *affdesc)
+{
+ struct msi_map map = { .index = -ENOTSUPP };
+
+ if (!dev->msix_enabled)
+ return map;
+
+ if (!pci_msix_can_alloc_dyn(dev))
+ return map;
+
+ return msi_domain_alloc_irq_at(&dev->dev, MSI_DEFAULT_DOMAIN, index, affdesc, NULL);
+}
+EXPORT_SYMBOL_GPL(pci_msix_alloc_irq_at);
+
+/**
+ * pci_msix_free_irq - Free an interrupt on a PCI/MSIX interrupt domain
+ * which was allocated via pci_msix_alloc_irq_at()
+ *
+ * @dev: The PCI device to operate on
+ * @map: A struct msi_map describing the interrupt to free
+ * as returned from the allocation function.
+ */
+void pci_msix_free_irq(struct pci_dev *dev, struct msi_map map)
+{
+ if (WARN_ON_ONCE(map.index < 0 || map.virq <= 0))
+ return;
+ if (WARN_ON_ONCE(!pci_msix_can_alloc_dyn(dev)))
+ return;
+ msi_domain_free_irqs_range(&dev->dev, MSI_DEFAULT_DOMAIN, map.index, map.index);
+}
+EXPORT_SYMBOL_GPL(pci_msix_free_irq);
+
+/**
* pci_disable_msix() - Disable MSI-X interrupt mode on device
* @dev: the PCI device to operate on
*
diff --git a/drivers/pci/msi/irqdomain.c b/drivers/pci/msi/irqdomain.c
index 8afaef1..deb1930 100644
--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -225,7 +225,8 @@ static const struct msi_domain_template pci_msix_template = {
},

.info = {
- .flags = MSI_COMMON_FLAGS | MSI_FLAG_PCI_MSIX,
+ .flags = MSI_COMMON_FLAGS | MSI_FLAG_PCI_MSIX |
+ MSI_FLAG_PCI_MSIX_ALLOC_DYN,
.bus_token = DOMAIN_BUS_PCI_DEVICE_MSIX,
},
};
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 243e48f..68b14ba 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -38,6 +38,7 @@
#include <linux/interrupt.h>
#include <linux/io.h>
#include <linux/resource_ext.h>
+#include <linux/msi_api.h>
#include <uapi/linux/pci.h>

#include <linux/pci_ids.h>
@@ -1559,6 +1560,11 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
unsigned int max_vecs, unsigned int flags,
struct irq_affinity *affd);

+bool pci_msix_can_alloc_dyn(struct pci_dev *dev);
+struct msi_map pci_msix_alloc_irq_at(struct pci_dev *dev, unsigned int index,
+ const struct irq_affinity_desc *affdesc);
+void pci_msix_free_irq(struct pci_dev *pdev, struct msi_map map);
+
void pci_free_irq_vectors(struct pci_dev *dev);
int pci_irq_vector(struct pci_dev *dev, unsigned int nr);
const struct cpumask *pci_irq_get_affinity(struct pci_dev *pdev, int vec);

Subject: [tip: irq/core] PCI/MSI: Provide post-enable dynamic allocation interfaces for MSI-X

The following commit has been merged into the irq/core branch of tip:

Commit-ID: 34026364df8eca05ee32e706a2c014511a19af02
Gitweb: https://git.kernel.org/tip/34026364df8eca05ee32e706a2c014511a19af02
Author: Thomas Gleixner <[email protected]>
AuthorDate: Fri, 25 Nov 2022 00:26:24 +01:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Mon, 05 Dec 2022 22:22:34 +01:00

PCI/MSI: Provide post-enable dynamic allocation interfaces for MSI-X

MSI-X vectors can be allocated after the initial MSI-X enablement, but this
needs explicit support of the underlying interrupt domains.

Provide a function to query the ability and functions to allocate/free
individual vectors post-enable.

The allocation can either request a specific index in the MSI-X table or
with the index argument MSI_ANY_INDEX it allocates the next free vector.

The return value is a struct msi_map which on success contains both index
and the Linux interrupt number. In case of failure index is negative and
the Linux interrupt number is 0.

The allocation function is for a single MSI-X index at a time as that's
sufficient for the most urgent use case VFIO to get rid of the 'disable
MSI-X, reallocate, enable-MSI-X' cycle which is prone to lost interrupts
and redirections to the legacy and obviously unhandled INTx.

As single index allocation is also sufficient for the use cases Jason
Gunthorpe pointed out: Allocation of a MSI-X or IMS vector for a network
queue. See Link below.

Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Acked-by: Bjorn Helgaas <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
Link: https://lore.kernel.org/r/[email protected]

---
drivers/pci/msi/api.c | 67 ++++++++++++++++++++++++++++++++++++-
drivers/pci/msi/irqdomain.c | 3 +-
include/linux/pci.h | 6 +++-
3 files changed, 75 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/msi/api.c b/drivers/pci/msi/api.c
index 2d46a0c..c8816db 100644
--- a/drivers/pci/msi/api.c
+++ b/drivers/pci/msi/api.c
@@ -113,6 +113,73 @@ int pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries,
EXPORT_SYMBOL(pci_enable_msix_range);

/**
+ * pci_msix_can_alloc_dyn - Query whether dynamic allocation after enabling
+ * MSI-X is supported
+ *
+ * @dev: PCI device to operate on
+ *
+ * Return: True if supported, false otherwise
+ */
+bool pci_msix_can_alloc_dyn(struct pci_dev *dev)
+{
+ if (!dev->msix_cap)
+ return false;
+
+ return pci_msi_domain_supports(dev, MSI_FLAG_PCI_MSIX_ALLOC_DYN, DENY_LEGACY);
+}
+EXPORT_SYMBOL_GPL(pci_msix_can_alloc_dyn);
+
+/**
+ * pci_msix_alloc_irq_at - Allocate an MSI-X interrupt after enabling MSI-X
+ * at a given MSI-X vector index or any free vector index
+ *
+ * @dev: PCI device to operate on
+ * @index: Index to allocate. If @index == MSI_ANY_INDEX this allocates
+ * the next free index in the MSI-X table
+ * @affdesc: Optional pointer to an affinity descriptor structure. NULL otherwise
+ *
+ * Return: A struct msi_map
+ *
+ * On success msi_map::index contains the allocated index (>= 0) and
+ * msi_map::virq contains the allocated Linux interrupt number (> 0).
+ *
+ * On fail msi_map::index contains the error code and msi_map::virq
+ * is set to 0.
+ */
+struct msi_map pci_msix_alloc_irq_at(struct pci_dev *dev, unsigned int index,
+ const struct irq_affinity_desc *affdesc)
+{
+ struct msi_map map = { .index = -ENOTSUPP };
+
+ if (!dev->msix_enabled)
+ return map;
+
+ if (!pci_msix_can_alloc_dyn(dev))
+ return map;
+
+ return msi_domain_alloc_irq_at(&dev->dev, MSI_DEFAULT_DOMAIN, index, affdesc, NULL);
+}
+EXPORT_SYMBOL_GPL(pci_msix_alloc_irq_at);
+
+/**
+ * pci_msix_free_irq - Free an interrupt on a PCI/MSIX interrupt domain
+ * which was allocated via pci_msix_alloc_irq_at()
+ *
+ * @dev: The PCI device to operate on
+ * @map: A struct msi_map describing the interrupt to free
+ * as returned from the allocation function.
+ */
+void pci_msix_free_irq(struct pci_dev *dev, struct msi_map map)
+{
+ if (WARN_ON_ONCE(map.index < 0 || map.virq <= 0))
+ return;
+ if (WARN_ON_ONCE(!pci_msix_can_alloc_dyn(dev)))
+ return;
+ msi_domain_free_irqs_range(&dev->dev, MSI_DEFAULT_DOMAIN, map.index, map.index);
+}
+EXPORT_SYMBOL_GPL(pci_msix_free_irq);
+
+/**
* pci_disable_msix() - Disable MSI-X interrupt mode on device
* @dev: the PCI device to operate on
*
diff --git a/drivers/pci/msi/irqdomain.c b/drivers/pci/msi/irqdomain.c
index 8afaef1..deb1930 100644
--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -225,7 +225,8 @@ static const struct msi_domain_template pci_msix_template = {
},

.info = {
- .flags = MSI_COMMON_FLAGS | MSI_FLAG_PCI_MSIX,
+ .flags = MSI_COMMON_FLAGS | MSI_FLAG_PCI_MSIX |
+ MSI_FLAG_PCI_MSIX_ALLOC_DYN,
.bus_token = DOMAIN_BUS_PCI_DEVICE_MSIX,
},
};
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 243e48f..68b14ba 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -38,6 +38,7 @@
#include <linux/interrupt.h>
#include <linux/io.h>
#include <linux/resource_ext.h>
+#include <linux/msi_api.h>
#include <uapi/linux/pci.h>

#include <linux/pci_ids.h>
@@ -1559,6 +1560,11 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
unsigned int max_vecs, unsigned int flags,
struct irq_affinity *affd);

+bool pci_msix_can_alloc_dyn(struct pci_dev *dev);
+struct msi_map pci_msix_alloc_irq_at(struct pci_dev *dev, unsigned int index,
+ const struct irq_affinity_desc *affdesc);
+void pci_msix_free_irq(struct pci_dev *pdev, struct msi_map map);
+
void pci_free_irq_vectors(struct pci_dev *dev);
int pci_irq_vector(struct pci_dev *dev, unsigned int nr);
const struct cpumask *pci_irq_get_affinity(struct pci_dev *pdev, int vec);