2020-08-26 12:19:00

by Thomas Gleixner

[permalink] [raw]
Subject: [patch V2 00/46] x86, PCI, XEN, genirq ...: Prepare for device MSI

This is the second version of providing a base to support device MSI (non
PCI based) and on top of that support for IMS (Interrupt Message Storm)
based devices in a halfways architecture independent way.

The first version can be found here:

https://lore.kernel.org/r/[email protected]

It's still a mixed bag of bug fixes, cleanups and general improvements
which are worthwhile independent of device MSI.

There are quite a bunch of issues to solve:

- X86 does not use the device::msi_domain pointer for historical reasons
and due to XEN, which makes it impossible to create an architecture
agnostic device MSI infrastructure.

- X86 has it's own msi_alloc_info data type which is pointlessly
different from the generic version and does not allow to share code.

- The logic of composing MSI messages in an hierarchy is busted at the
core level and of course some (x86) drivers depend on that.

- A few minor shortcomings as usual

This series addresses that in several steps:

1) Accidental bug fixes

iommu/amd: Prevent NULL pointer dereference

2) Janitoring

x86/init: Remove unused init ops
PCI: vmd: Dont abuse vector irqomain as parent
x86/msi: Remove pointless vcpu_affinity callback

3) Sanitizing the composition of MSI messages in a hierarchy

genirq/chip: Use the first chip in irq_chip_compose_msi_msg()
x86/msi: Move compose message callback where it belongs

4) Simplification of the x86 specific interrupt allocation mechanism

x86/irq: Rename X86_IRQ_ALLOC_TYPE_MSI* to reflect PCI dependency
x86/irq: Add allocation type for parent domain retrieval
iommu/vt-d: Consolidate irq domain getter
iommu/amd: Consolidate irq domain getter
iommu/irq_remapping: Consolidate irq domain lookup

5) Consolidation of the X86 specific interrupt allocation mechanism to be as close
as possible to the generic MSI allocation mechanism which allows to get rid
of quite a bunch of x86'isms which are pointless

x86/irq: Prepare consolidation of irq_alloc_info
x86/msi: Consolidate HPET allocation
x86/ioapic: Consolidate IOAPIC allocation
x86/irq: Consolidate DMAR irq allocation
x86/irq: Consolidate UV domain allocation
PCI/MSI: Rework pci_msi_domain_calc_hwirq()
x86/msi: Consolidate MSI allocation
x86/msi: Use generic MSI domain ops

6) x86 specific cleanups to remove the dependency on arch_*_msi_irqs()

x86/irq: Move apic_post_init() invocation to one place
x86/pci: Reducde #ifdeffery in PCI init code
x86/irq: Initialize PCI/MSI domain at PCI init time
irqdomain/msi: Provide DOMAIN_BUS_VMD_MSI
PCI: vmd: Mark VMD irqdomain with DOMAIN_BUS_VMD_MSI
PCI/MSI: Provide pci_dev_has_special_msi_domain() helper
x86/xen: Make xen_msi_init() static and rename it to xen_hvm_msi_init()
x86/xen: Rework MSI teardown
x86/xen: Consolidate XEN-MSI init
irqdomain/msi: Allow to override msi_domain_alloc/free_irqs()
x86/xen: Wrap XEN MSI management into irqdomain
iommm/vt-d: Store irq domain in struct device
iommm/amd: Store irq domain in struct device
x86/pci: Set default irq domain in pcibios_add_device()
PCI/MSI: Make arch_.*_msi_irq[s] fallbacks selectable
x86/irq: Cleanup the arch_*_msi_irqs() leftovers
x86/irq: Make most MSI ops XEN private
iommu/vt-d: Remove domain search for PCI/MSI[X]
iommu/amd: Remove domain search for PCI/MSI

7) X86 specific preparation for device MSI

x86/irq: Add DEV_MSI allocation type
x86/msi: Rename and rework pci_msi_prepare() to cover non-PCI MSI

8) Generic device MSI infrastructure
platform-msi: Provide default irq_chip:: Ack
genirq/proc: Take buslock on affinity write
genirq/msi: Provide and use msi_domain_set_default_info_flags()
platform-msi: Add device MSI infrastructure
irqdomain/msi: Provide msi_alloc/free_store() callbacks

9) POC of IMS (Interrupt Message Storm) irq domain and irqchip
implementations for both device array and queue storage.

irqchip: Add IMS (Interrupt Message Storm) driver - NOT FOR MERGING

Changes vs. V1:

- Addressed various review comments and addressed the 0day fallout.
- Corrected the XEN logic (Jürgen)
- Make the arch fallback in PCI/MSI opt-in not opt-out (Bjorn)

- Fixed the compose MSI message inconsistency

- Ensure that the necessary flags are set for device SMI

- Make the irq bus logic work for affinity setting to prepare
support for IMS storage in queue memory. It turned out to be
less scary than I feared.

- Remove leftovers in iommu/intel|amd

- Reworked the IMS POC driver to cover queue storage so Jason can have a
look whether that fits the needs of MLX devices.

The whole lot is also available from git:

git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git device-msi

This has been tested on Intel/AMD/KVM but lacks testing on:

- HYPERV (-ENODEV)
- VMD enabled systems (-ENODEV)
- XEN (-ENOCLUE)
- IMS (-ENODEV)

- Any non-X86 code which might depend on the broken compose MSI message
logic. Marc excpects not much fallout, but agrees that we need to fix
it anyway.

#1 - #3 should be applied unconditionally for obvious reasons
#4 - #6 are wortwhile cleanups which should be done independent of device MSI

#7 - #8 look promising to cleanup the platform MSI implementation
independent of #8, but I neither had cycles nor the stomach to
tackle that.

#9 is obviously just for the folks interested in IMS

Thanks,

tglx


2020-08-28 11:45:06

by Joerg Roedel

[permalink] [raw]
Subject: Re: [patch V2 00/46] x86, PCI, XEN, genirq ...: Prepare for device MSI

On Wed, Aug 26, 2020 at 01:16:28PM +0200, Thomas Gleixner wrote:
> This is the second version of providing a base to support device MSI (non
> PCI based) and on top of that support for IMS (Interrupt Message Storm)
> based devices in a halfways architecture independent way.
>
> The first version can be found here:
>
> https://lore.kernel.org/r/[email protected]
>
> It's still a mixed bag of bug fixes, cleanups and general improvements
> which are worthwhile independent of device MSI.
>
> There are quite a bunch of issues to solve:
>
> - X86 does not use the device::msi_domain pointer for historical reasons
> and due to XEN, which makes it impossible to create an architecture
> agnostic device MSI infrastructure.
>
> - X86 has it's own msi_alloc_info data type which is pointlessly
> different from the generic version and does not allow to share code.
>
> - The logic of composing MSI messages in an hierarchy is busted at the
> core level and of course some (x86) drivers depend on that.
>
> - A few minor shortcomings as usual
>
> This series addresses that in several steps:

For all IOMMU changes:

Acked-by: Joerg Roedel <[email protected]>

2020-08-31 01:00:32

by Baolu Lu

[permalink] [raw]
Subject: Re: [patch V2 00/46] x86, PCI, XEN, genirq ...: Prepare for device MSI

Hi Thomas,

On 8/26/20 7:16 PM, Thomas Gleixner wrote:
> This is the second version of providing a base to support device MSI (non
> PCI based) and on top of that support for IMS (Interrupt Message Storm)
> based devices in a halfways architecture independent way.

After applying this patch series, the dmar_alloc_hwirq() helper doesn't
work anymore during boot. This causes the IOMMU driver to fail to
register the DMA fault handler and abort the IOMMU probe processing.
Is this a known issue?

Best regards,
baolu

>
> The first version can be found here:
>
> https://lore.kernel.org/r/[email protected]
>
> It's still a mixed bag of bug fixes, cleanups and general improvements
> which are worthwhile independent of device MSI.
>
> There are quite a bunch of issues to solve:
>
> - X86 does not use the device::msi_domain pointer for historical reasons
> and due to XEN, which makes it impossible to create an architecture
> agnostic device MSI infrastructure.
>
> - X86 has it's own msi_alloc_info data type which is pointlessly
> different from the generic version and does not allow to share code.
>
> - The logic of composing MSI messages in an hierarchy is busted at the
> core level and of course some (x86) drivers depend on that.
>
> - A few minor shortcomings as usual
>
> This series addresses that in several steps:
>
> 1) Accidental bug fixes
>
> iommu/amd: Prevent NULL pointer dereference
>
> 2) Janitoring
>
> x86/init: Remove unused init ops
> PCI: vmd: Dont abuse vector irqomain as parent
> x86/msi: Remove pointless vcpu_affinity callback
>
> 3) Sanitizing the composition of MSI messages in a hierarchy
>
> genirq/chip: Use the first chip in irq_chip_compose_msi_msg()
> x86/msi: Move compose message callback where it belongs
>
> 4) Simplification of the x86 specific interrupt allocation mechanism
>
> x86/irq: Rename X86_IRQ_ALLOC_TYPE_MSI* to reflect PCI dependency
> x86/irq: Add allocation type for parent domain retrieval
> iommu/vt-d: Consolidate irq domain getter
> iommu/amd: Consolidate irq domain getter
> iommu/irq_remapping: Consolidate irq domain lookup
>
> 5) Consolidation of the X86 specific interrupt allocation mechanism to be as close
> as possible to the generic MSI allocation mechanism which allows to get rid
> of quite a bunch of x86'isms which are pointless
>
> x86/irq: Prepare consolidation of irq_alloc_info
> x86/msi: Consolidate HPET allocation
> x86/ioapic: Consolidate IOAPIC allocation
> x86/irq: Consolidate DMAR irq allocation
> x86/irq: Consolidate UV domain allocation
> PCI/MSI: Rework pci_msi_domain_calc_hwirq()
> x86/msi: Consolidate MSI allocation
> x86/msi: Use generic MSI domain ops
>
> 6) x86 specific cleanups to remove the dependency on arch_*_msi_irqs()
>
> x86/irq: Move apic_post_init() invocation to one place
> x86/pci: Reducde #ifdeffery in PCI init code
> x86/irq: Initialize PCI/MSI domain at PCI init time
> irqdomain/msi: Provide DOMAIN_BUS_VMD_MSI
> PCI: vmd: Mark VMD irqdomain with DOMAIN_BUS_VMD_MSI
> PCI/MSI: Provide pci_dev_has_special_msi_domain() helper
> x86/xen: Make xen_msi_init() static and rename it to xen_hvm_msi_init()
> x86/xen: Rework MSI teardown
> x86/xen: Consolidate XEN-MSI init
> irqdomain/msi: Allow to override msi_domain_alloc/free_irqs()
> x86/xen: Wrap XEN MSI management into irqdomain
> iommm/vt-d: Store irq domain in struct device
> iommm/amd: Store irq domain in struct device
> x86/pci: Set default irq domain in pcibios_add_device()
> PCI/MSI: Make arch_.*_msi_irq[s] fallbacks selectable
> x86/irq: Cleanup the arch_*_msi_irqs() leftovers
> x86/irq: Make most MSI ops XEN private
> iommu/vt-d: Remove domain search for PCI/MSI[X]
> iommu/amd: Remove domain search for PCI/MSI
>
> 7) X86 specific preparation for device MSI
>
> x86/irq: Add DEV_MSI allocation type
> x86/msi: Rename and rework pci_msi_prepare() to cover non-PCI MSI
>
> 8) Generic device MSI infrastructure
> platform-msi: Provide default irq_chip:: Ack
> genirq/proc: Take buslock on affinity write
> genirq/msi: Provide and use msi_domain_set_default_info_flags()
> platform-msi: Add device MSI infrastructure
> irqdomain/msi: Provide msi_alloc/free_store() callbacks
>
> 9) POC of IMS (Interrupt Message Storm) irq domain and irqchip
> implementations for both device array and queue storage.
>
> irqchip: Add IMS (Interrupt Message Storm) driver - NOT FOR MERGING
>
> Changes vs. V1:
>
> - Addressed various review comments and addressed the 0day fallout.
> - Corrected the XEN logic (Jürgen)
> - Make the arch fallback in PCI/MSI opt-in not opt-out (Bjorn)
>
> - Fixed the compose MSI message inconsistency
>
> - Ensure that the necessary flags are set for device SMI
>
> - Make the irq bus logic work for affinity setting to prepare
> support for IMS storage in queue memory. It turned out to be
> less scary than I feared.
>
> - Remove leftovers in iommu/intel|amd
>
> - Reworked the IMS POC driver to cover queue storage so Jason can have a
> look whether that fits the needs of MLX devices.
>
> The whole lot is also available from git:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git device-msi
>
> This has been tested on Intel/AMD/KVM but lacks testing on:
>
> - HYPERV (-ENODEV)
> - VMD enabled systems (-ENODEV)
> - XEN (-ENOCLUE)
> - IMS (-ENODEV)
>
> - Any non-X86 code which might depend on the broken compose MSI message
> logic. Marc excpects not much fallout, but agrees that we need to fix
> it anyway.
>
> #1 - #3 should be applied unconditionally for obvious reasons
> #4 - #6 are wortwhile cleanups which should be done independent of device MSI
>
> #7 - #8 look promising to cleanup the platform MSI implementation
> independent of #8, but I neither had cycles nor the stomach to
> tackle that.
>
> #9 is obviously just for the folks interested in IMS
>
> Thanks,
>
> tglx
>

2020-08-31 07:13:55

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch V2 00/46] x86, PCI, XEN, genirq ...: Prepare for device MSI

On Mon, Aug 31 2020 at 08:51, Lu Baolu wrote:
> On 8/26/20 7:16 PM, Thomas Gleixner wrote:
>> This is the second version of providing a base to support device MSI (non
>> PCI based) and on top of that support for IMS (Interrupt Message Storm)
>> based devices in a halfways architecture independent way.
>
> After applying this patch series, the dmar_alloc_hwirq() helper doesn't
> work anymore during boot. This causes the IOMMU driver to fail to
> register the DMA fault handler and abort the IOMMU probe processing.
> Is this a known issue?

See replies to patch 15/46 or pull the git tree. It has the issue fixed.

Thanks,

tglx

2020-08-31 07:32:08

by Baolu Lu

[permalink] [raw]
Subject: Re: [patch V2 00/46] x86, PCI, XEN, genirq ...: Prepare for device MSI

Hi Thomas,

On 2020/8/31 15:10, Thomas Gleixner wrote:
> On Mon, Aug 31 2020 at 08:51, Lu Baolu wrote:
>> On 8/26/20 7:16 PM, Thomas Gleixner wrote:
>>> This is the second version of providing a base to support device MSI (non
>>> PCI based) and on top of that support for IMS (Interrupt Message Storm)
>>> based devices in a halfways architecture independent way.
>>
>> After applying this patch series, the dmar_alloc_hwirq() helper doesn't
>> work anymore during boot. This causes the IOMMU driver to fail to
>> register the DMA fault handler and abort the IOMMU probe processing.
>> Is this a known issue?
>
> See replies to patch 15/46 or pull the git tree. It has the issue fixed.

Ah! Yes. Sorry for the noise.

Beset regards,
baolu

2020-09-01 09:07:28

by Boqun Feng

[permalink] [raw]
Subject: Re: [patch V2 00/46] x86, PCI, XEN, genirq ...: Prepare for device MSI

Hi Thomas,

On Wed, Aug 26, 2020 at 01:16:28PM +0200, Thomas Gleixner wrote:
[...]
>
> The whole lot is also available from git:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git device-msi
>
> This has been tested on Intel/AMD/KVM but lacks testing on:
>
> - HYPERV (-ENODEV)

FWIW, I did a build and boot test in a hyperv guest with your
development branch, the latest commit is 71cbf478eb6f ("irqchip: Add
IMS (Interrupt Message Storm) driver - NOT FOR MERGING"). And everything
seemed working fine.

If you want me to set/unset a particular CONFIG option or run some
command for testing purposes, please let me know ;-)

Regards,
Bqoun

> - VMD enabled systems (-ENODEV)
> - XEN (-ENOCLUE)
> - IMS (-ENODEV)
>
> - Any non-X86 code which might depend on the broken compose MSI message
> logic. Marc excpects not much fallout, but agrees that we need to fix
> it anyway.
>
> #1 - #3 should be applied unconditionally for obvious reasons
> #4 - #6 are wortwhile cleanups which should be done independent of device MSI
>
> #7 - #8 look promising to cleanup the platform MSI implementation
> independent of #8, but I neither had cycles nor the stomach to
> tackle that.
>
> #9 is obviously just for the folks interested in IMS
>
> Thanks,
>
> tglx

2020-09-03 16:36:20

by Raj, Ashok

[permalink] [raw]
Subject: Re: [patch V2 00/46] x86, PCI, XEN, genirq ...: Prepare for device MSI

Hi Thomas,

Thanks a ton for jumping in helping on straightening it for IMS!!!


On Wed, Aug 26, 2020 at 01:16:28PM +0200, Thomas Gleixner wrote:
> This is the second version of providing a base to support device MSI (non
> PCI based) and on top of that support for IMS (Interrupt Message Storm)

s/Storm/Store

maybe pun intended :-)

> based devices in a halfways architecture independent way.

You mean "halfways" because the message addr and data follow guidelines
per arch (x86 or such), but the location of the storage isn't dictated
by architecture? or did you have something else in mind?

>
> The first version can be found here:
>
> https://lore.kernel.org/r/[email protected]
>

[snip]

>
> Changes vs. V1:
>
> - Addressed various review comments and addressed the 0day fallout.
> - Corrected the XEN logic (J?rgen)
> - Make the arch fallback in PCI/MSI opt-in not opt-out (Bjorn)
>
> - Fixed the compose MSI message inconsistency
>
> - Ensure that the necessary flags are set for device SMI

is that supposed to be MSI?

Cheers,
Ashok

2020-09-03 18:15:48

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch V2 00/46] x86, PCI, XEN, genirq ...: Prepare for device MSI

Ashok,

On Thu, Sep 03 2020 at 09:35, Ashok Raj wrote:
> On Wed, Aug 26, 2020 at 01:16:28PM +0200, Thomas Gleixner wrote:
>> This is the second version of providing a base to support device MSI (non
>> PCI based) and on top of that support for IMS (Interrupt Message Storm)
>
> s/Storm/Store
>
> maybe pun intended :-)

Maybe? :)

>> based devices in a halfways architecture independent way.
>
> You mean "halfways" because the message addr and data follow guidelines
> per arch (x86 or such), but the location of the storage isn't dictated
> by architecture? or did you have something else in mind?

Yes, the actual message adress and data format are architecture
specific, but we also have x86 specific allocation info format which
needs an arch callback unfortunately.

>> - Ensure that the necessary flags are set for device SMI
>
> is that supposed to be MSI?

Of course, but SMI is a better match for Message Storm :)

Thanks,

tglx

2020-09-08 03:41:18

by Russ Anderson

[permalink] [raw]
Subject: Re: [patch V2 00/46] x86, PCI, XEN, genirq ...: Prepare for device MSI

On Wed, Aug 26, 2020 at 01:16:28PM +0200, Thomas Gleixner wrote:
> This is the second version of providing a base to support device MSI (non
> PCI based) and on top of that support for IMS (Interrupt Message Storm)
> based devices in a halfways architecture independent way.

Booted with quick testing on a 32 socket, 1536 CPU, 12 TB memory
Cascade Lake system and a 8 socket, 144 CPU, 3 TB memory
Cooper Lake system without any obvious regression.


--
Russ Anderson, SuperDome Flex Linux Kernel Group Manager
HPE - Hewlett Packard Enterprise (formerly SGI) [email protected]

2020-09-25 15:30:55

by Qian Cai

[permalink] [raw]
Subject: Re: [patch V2 00/46] x86, PCI, XEN, genirq ...: Prepare for device MSI

On Wed, 2020-08-26 at 13:16 +0200, Thomas Gleixner wrote:
> This is the second version of providing a base to support device MSI (non
> PCI based) and on top of that support for IMS (Interrupt Message Storm)
> based devices in a halfways architecture independent way.
>
> The first version can be found here:
>
> https://lore.kernel.org/r/[email protected]
>
> It's still a mixed bag of bug fixes, cleanups and general improvements
> which are worthwhile independent of device MSI.

Reverting the part of this patchset on the top of today's linux-next fixed an
boot issue on HPE ProLiant DL560 Gen10, i.e.,

$ git revert --no-edit 13b90cadfc29..bc95fd0d7c42

.config: https://gitlab.com/cailca/linux-mm/-/blob/master/x86.config

It looks like the crashes happen in the interrupt remapping code where they are
only able to to generate partial call traces.

[ 1.912386][ T0] ACPI: X2APIC_NMI (uid[0xf5] high level 9983][ T0] ... MAX_LOCK_DEPTH: 48
[ 7.914876][ T0] ... MAX_LOCKDEP_KEYS: 8192
[ 7.919942][ T0] ... CLASSHASH_SIZE: 4096
[ 7.925009][ T0] ... MAX_LOCKDEP_ENTRIES: 32768
[ 7.930163][ T0] ... MAX_LOCKDEP_CHAINS: 65536
[ 7.935318][ T0] ... CHAINHASH_SIZE: 32768
[ 7.940473][ T0] memory used by lock dependency info: 6301 kB
[ 7.946586][ T0] memory used for stack traces: 4224 kB
[ 7.952088][ T0] per task-struct memory footprint: 1920 bytes
[ 7.968312][ T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
[ 7.980281][ T0] ACPI: Core revision 20200717
[ 7.993343][ T0] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 79635855245 ns
[ 8.003270][ T0] APIC: Switch to symmetric I/O mode setup
[ 8.008951][ T0] DMAR: Host address width 46
[ 8.013512][ T0] DMAR: DRHD base: 0x000000e5ffc000 flags: 0x0
[ 8.019680][ T0] DMAR: dmar0: reg_base_addr e5ffc000 ver 1:0 cap 8d2078c106f0466 [ T0] DMAR-IR: IOAPIC id 15 under DRHD base 0xe5ffc000 IOMMU 0
[ 8.420990][ T0] DMAR-IR: IOAPIC id 8 under DRHD base 0xddffc000 IOMMU 15
[ 8.428166][ T0] DMAR-IR: IOAPIC id 9 under DRHD base 0xddffc000 IOMMU 15
[ 8.435341][ T0] DMAR-IR: HPET id 0 under DRHD base 0xddffc000
[ 8.441456][ T0] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[ 8.457911][ T0] DMAR-IR: Enabled IRQ remapping in x2apic mode
[ 8.466614][ T0] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 8.474295][ T0] #PF: supervisor instruction fetch in kernel mode
[ 8.480669][ T0] #PF: error_code(0x0010) - not-present page
[ 8.486518][ T0] PGD 0 P4D 0
[ 8.489757][ T0] Oops: 0010 [#1] SMP KASAN PTI
[ 8.494476][ T0] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G I 5.9.0-rc6-next-20200925 #2
[ 8.503987][ T0] Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560 Gen10, BIOS U34 11/13/2019
[ 8.513238][ T0] RIP: 0010:0x0
[ 8.516562][ T0] Code: Bad RIP v

or

[ 2.906744][ T0] ACPI: X2API32, address 0xfec68000, GSI 128-135
[ 2.907063][ T0] IOAPIC[15]: apic_id 29, version 32, address 0xfec70000, GSI 136-143
[ 2.907071][ T0] IOAPIC[16]: apic_id 30, version 32, address 0xfec78000, GSI 144-151
[ 2.907079][ T0] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[ 2.907084][ T0] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[ 2.907100][ T0] Using ACPI (MADT) for SMP configuration information
[ 2.907105][ T0] ACPI: HPET id: 0x8086a701 base: 0xfed00000
[ 2.907116][ T0] ACPI: SPCR: console: uart,mmio,0x0,115200
[ 2.907121][ T0] TSC deadline timer available
[ 2.907126][ T0] smpboot: Allowing 144 CPUs, 0 hotplug CPUs
[ 2.907163][ T0] [mem 0xd0000000-0xfdffffff] available for PCI devices
[ 2.907175][ T0] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[ 2.914541][ T0] setup_percpu: NR_CPUS:256 nr_cpumask_bits:144 nr_cpu_ids:144 nr_node_ids:4
[ 2.926109][ 466 ecap f020df
[ 9.134709][ T0] DMAR: DRHD base: 0x000000f5ffc000 flags: 0x0
[ 9.140867][ T0] DMAR: dmar8: reg_base_addr f5ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 9.149610][ T0] DMAR: DRHD base: 0x000000f7ffc000 flags: 0x0
[ 9.155762][ T0] DMAR: dmar9: reg_base_addr f7ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 9.164491][ T0] DMAR: DRHD base: 0x000000f9ffc000 flags: 0x0
[ 9.170645][ T0] DMAR: dmar10: reg_base_addr f9ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 9.179476][ T0] DMAR: DRHD base: 0x000000fbffc000 flags: 0x0
[ 9.185626][ T0] DMAR: dmar11: reg_base_addr fbffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 9.194442][ T0] DMAR: DRHD base: 0x000000dfffc000 flags: 0x0
[ 9.200587][ T0] DMAR: dmar12: reg_base_addr dfffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 9.209418][ T0] DMAR: DRHD base: 0x000000e1ffc000 flags: 0x0
[ 9.215551][ T0] DMAR: dmar13: reg_base_addr e1ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 9.224367][ T0] DMAR: DRHD base: 0x000000e3ffc83][ T0] msi_domain_alloc+0x8e/0x280
[ 9.615015][ T0] __irq_domain_a8992cd
[ 9.711906][ T0] R10: ffffffff85407d78 R11: fffffbfff18992cc R12: ffffffff8546ffc0
[ 9.719761][ T0] R13: 0000000000000098 R14: ffff888106e63a40 R15: 0000000000000001
[ 9.727617][ T0] FS: 0000000000000000(0000) GS:ffff8887df800000(0000) knlGS:0000000000000000
[ 9.736431][ T0] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9.742892][ T0] CR2: ffffffffffffffd6 CR3: 0000001ba7814001 CR4: 00000000000606b0
[ 9.750747][ T0] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9.758601][ T0] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 9.766456][ T0] Kernel panic - not syncing: Fatal exception
[ 9.772547][ T0] ---[ end Kernel panic - not syncing: Fatal exception ]---

The working boot (without those patches) looks like this:

[ 1.913963][ T0] ACPI: X2APIC_NMI (uid[0xf4] high level lint[0x1])
[ 1.913967][ T0] ACPI: X2APIC_NMI (uid[0xf5] high level lint[0x1])
[ 1.913970][ T0] ACPI: X2APIC_NMI (uid[0xf6] high level lint[0x1])
[ 1.913974][ T0] ACPI: X2APIC_NMI (uid[0xf7] high level lint[0x1])
[ 1.914017][ T0] IOAPIC[0]: apic_id 8, version 32, address 0xfec00000, GSI 0-23
[ 1.914032][ T0] IOAPIC[1]: apic_id 9, version 32, address 0xfec01000, GSI 24-31
[ 1.914039][ T0] IOAPIC[2]: apic_id 10, version 32, address 0xfec08000, GSI 32-39
[ 1.914047][ T0] IOAPIC[3]: apic_id 11, version 32, address 0xfec10000, GSI 40-47
[ 1.914054][ T0] IOAPIC[4]: apic_id 12, version 32, address 0xfec18000, GSI 48-55
[ 1.914062][ T0] IOAPIC[5]: apic_id 15, version 32, address 0xfec20000, GSI 56-63
[ 1.[ 7.994567][ T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
[ 8.006541][ T0] ACPI: Core revision 20200717
[ 8.019713][ T0] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 79635855245 ns
[ 8.029672][ T0] APIC: Switch to symmetric I/O mode setup
[ 8.035354][ T0] DMAR: Host address width 46
[ 8.039915][ T0] DMAR: DRHD base: 0x000000e5ffc000 flags: 0x0
[ 8.046095][ T0] DMAR: dmar0: reg_base_addr e5ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 8.054840][ T0] DMAR: DRHD base: 0x000000e7ffc000 flags: 0x0
[ 8.060997][ T0] DMAR: dmar1: reg_base_addr e7ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 8.069740][ T0] DMAR: DRHD base: 0x000000e9ffc000 flags: 0x0
[ 8.075872][ T0] DMAR: dmar2: reg_base_addr e9ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 8.084615][ T0] DMAR: DRHD base: 0x000000ebffc000 flags: 0x0
[ 8.090761][ T0] DMAR: dmar3: reg_base_addr ebffc000 ver 1:0 cap 8d2078c106f0466 ecap fMAR-IR: Enabled IRQ remapping in x2apic mode
[ 8.513491][ T0] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[ 8.568289][ T0] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2b3e459bf4c, max_idle_ns: 440795289890 ns
[ 8.579576][ T0] Calibrating delay loop (skipped), value calculated using timer frequency.. 6000.00 BogoMIPS (lpj=30000000)
[ 8.589574][ T0] pid_max: default: 147456 minimum: 1152
[ 8.714025][ T0] efi: memattr: Entry attributes invalid: RO and XP bits both cleared
[ 8.719577][ T0] efi: memattr: ! 0x0000a057a000-0x0000a05b4fff [Runtime Code |RUN| | | | | | | | | | | | ]
[ 8.775355][ T0] Dentry cache hash table entries: 8388608 (order: 14, 67108864 bytes, vmalloc)
[ 8.798868][ T0] Inode-cache hash table entries: 4194304 (order: 13, 33554432 bytes, vmalloc)
[ 8.811550][ T0] Mount-cache hash table entries: 131072 (order: 8, 1048576 bytes, vmalloc)
[ 8.820076][ T0] Mountpoint-cache hash table entries: 131072 (order: 8, 1048576 bytes, vmalloc)
[ 8.879327][ T0] mce: CPU0: Thermal mo[ 8.996916][ T1] Performance Events: PEBS fmt3+, Skylake events, 32-deep LBR, full-width counters, Intel PMU driver.
[ 8.999591][ T1] ... version: 4
[ 9.004310][ T1] ... bit width: 48
[ 9.009118][ T1] ... generic registers: 4
[ 9.009574][ T1] ... value mask: 0000ffffffffffff
[ 9.015601][ T1] ... max period: 00007fffffffffff
[ 9.019574][ T1] ... fixed-purpose events: 3
[ 9.024294][ T1] ... event mask: 000000070000000f
[ 9.034357][ T1] rcu: Hierarchical SRCU implementation.
[ 9.062516][ T5] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.

>
> There are quite a bunch of issues to solve:
>
> - X86 does not use the device::msi_domain pointer for historical reasons
> and due to XEN, which makes it impossible to create an architecture
> agnostic device MSI infrastructure.
>
> - X86 has it's own msi_alloc_info data type which is pointlessly
> different from the generic version and does not allow to share code.
>
> - The logic of composing MSI messages in an hierarchy is busted at the
> core level and of course some (x86) drivers depend on that.
>
> - A few minor shortcomings as usual
>
> This series addresses that in several steps:
>
> 1) Accidental bug fixes
>
> iommu/amd: Prevent NULL pointer dereference
>
> 2) Janitoring
>
> x86/init: Remove unused init ops
> PCI: vmd: Dont abuse vector irqomain as parent
> x86/msi: Remove pointless vcpu_affinity callback
>
> 3) Sanitizing the composition of MSI messages in a hierarchy
>
> genirq/chip: Use the first chip in irq_chip_compose_msi_msg()
> x86/msi: Move compose message callback where it belongs
>
> 4) Simplification of the x86 specific interrupt allocation mechanism
>
> x86/irq: Rename X86_IRQ_ALLOC_TYPE_MSI* to reflect PCI dependency
> x86/irq: Add allocation type for parent domain retrieval
> iommu/vt-d: Consolidate irq domain getter
> iommu/amd: Consolidate irq domain getter
> iommu/irq_remapping: Consolidate irq domain lookup
>
> 5) Consolidation of the X86 specific interrupt allocation mechanism to be as
> close
> as possible to the generic MSI allocation mechanism which allows to get
> rid
> of quite a bunch of x86'isms which are pointless
>
> x86/irq: Prepare consolidation of irq_alloc_info
> x86/msi: Consolidate HPET allocation
> x86/ioapic: Consolidate IOAPIC allocation
> x86/irq: Consolidate DMAR irq allocation
> x86/irq: Consolidate UV domain allocation
> PCI/MSI: Rework pci_msi_domain_calc_hwirq()
> x86/msi: Consolidate MSI allocation
> x86/msi: Use generic MSI domain ops
>
> 6) x86 specific cleanups to remove the dependency on arch_*_msi_irqs()
>
> x86/irq: Move apic_post_init() invocation to one place
> x86/pci: Reducde #ifdeffery in PCI init code
> x86/irq: Initialize PCI/MSI domain at PCI init time
> irqdomain/msi: Provide DOMAIN_BUS_VMD_MSI
> PCI: vmd: Mark VMD irqdomain with DOMAIN_BUS_VMD_MSI
> PCI/MSI: Provide pci_dev_has_special_msi_domain() helper
> x86/xen: Make xen_msi_init() static and rename it to xen_hvm_msi_init()
> x86/xen: Rework MSI teardown
> x86/xen: Consolidate XEN-MSI init
> irqdomain/msi: Allow to override msi_domain_alloc/free_irqs()
> x86/xen: Wrap XEN MSI management into irqdomain
> iommm/vt-d: Store irq domain in struct device
> iommm/amd: Store irq domain in struct device
> x86/pci: Set default irq domain in pcibios_add_device()
> PCI/MSI: Make arch_.*_msi_irq[s] fallbacks selectable
> x86/irq: Cleanup the arch_*_msi_irqs() leftovers
> x86/irq: Make most MSI ops XEN private
> iommu/vt-d: Remove domain search for PCI/MSI[X]
> iommu/amd: Remove domain search for PCI/MSI
>
> 7) X86 specific preparation for device MSI
>
> x86/irq: Add DEV_MSI allocation type
> x86/msi: Rename and rework pci_msi_prepare() to cover non-PCI MSI
>
> 8) Generic device MSI infrastructure
> platform-msi: Provide default irq_chip:: Ack
> genirq/proc: Take buslock on affinity write
> genirq/msi: Provide and use msi_domain_set_default_info_flags()
> platform-msi: Add device MSI infrastructure
> irqdomain/msi: Provide msi_alloc/free_store() callbacks
>
> 9) POC of IMS (Interrupt Message Storm) irq domain and irqchip
> implementations for both device array and queue storage.
>
> irqchip: Add IMS (Interrupt Message Storm) driver - NOT FOR MERGING
>
> Changes vs. V1:
>
> - Addressed various review comments and addressed the 0day fallout.
> - Corrected the XEN logic (Jürgen)
> - Make the arch fallback in PCI/MSI opt-in not opt-out (Bjorn)
>
> - Fixed the compose MSI message inconsistency
>
> - Ensure that the necessary flags are set for device SMI
>
> - Make the irq bus logic work for affinity setting to prepare
> support for IMS storage in queue memory. It turned out to be
> less scary than I feared.
>
> - Remove leftovers in iommu/intel|amd
>
> - Reworked the IMS POC driver to cover queue storage so Jason can have a
> look whether that fits the needs of MLX devices.
>
> The whole lot is also available from git:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git device-msi
>
> This has been tested on Intel/AMD/KVM but lacks testing on:
>
> - HYPERV (-ENODEV)
> - VMD enabled systems (-ENODEV)
> - XEN (-ENOCLUE)
> - IMS (-ENODEV)
>
> - Any non-X86 code which might depend on the broken compose MSI message
> logic. Marc excpects not much fallout, but agrees that we need to fix
> it anyway.
>
> #1 - #3 should be applied unconditionally for obvious reasons
> #4 - #6 are wortwhile cleanups which should be done independent of device MSI
>
> #7 - #8 look promising to cleanup the platform MSI implementation
> independent of #8, but I neither had cycles nor the stomach to
> tackle that.
>
> #9 is obviously just for the folks interested in IMS
>
> Thanks,
>
> tglx

2020-09-25 15:54:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch V2 00/46] x86, PCI, XEN, genirq ...: Prepare for device MSI

On Fri, Sep 25, 2020 at 11:29:13AM -0400, Qian Cai wrote:

> It looks like the crashes happen in the interrupt remapping code where they are
> only able to to generate partial call traces.

> [ 8.466614][ T0] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [ 8.474295][ T0] #PF: supervisor instruction fetch in kernel mode
> [ 8.480669][ T0] #PF: error_code(0x0010) - not-present page
> [ 8.486518][ T0] PGD 0 P4D 0
> [ 8.489757][ T0] Oops: 0010 [#1] SMP KASAN PTI
> [ 8.494476][ T0] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G I 5.9.0-rc6-next-20200925 #2
> [ 8.503987][ T0] Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560 Gen10, BIOS U34 11/13/2019
> [ 8.513238][ T0] RIP: 0010:0x0
> [ 8.516562][ T0] Code: Bad RIP v

Here it looks like this:

[ 1.830276] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1.838043] #PF: supervisor instruction fetch in kernel mode
[ 1.844357] #PF: error_code(0x0010) - not-present page
[ 1.850090] PGD 0 P4D 0
[ 1.852915] Oops: 0010 [#1] SMP
[ 1.856419] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.9.0-rc6-00700-g0248dedd12d4 #419
[ 1.865447] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
[ 1.876902] RIP: 0010:0x0
[ 1.879824] Code: Bad RIP value.
[ 1.883423] RSP: 0000:ffffffff82803da0 EFLAGS: 00010282
[ 1.889251] RAX: 0000000000000000 RBX: ffffffff8282b980 RCX: ffffffff82803e40
[ 1.897241] RDX: 0000000000000001 RSI: ffffffff82803e40 RDI: ffffffff8282b980
[ 1.905201] RBP: ffff88842f331000 R08: 00000000ffffffff R09: 0000000000000001
[ 1.913162] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000048
[ 1.921123] R13: ffffffff82803e40 R14: ffffffff8282b9c0 R15: 0000000000000000
[ 1.929085] FS: 0000000000000000(0000) GS:ffff88842f400000(0000) knlGS:0000000000000000
[ 1.938113] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1.944524] CR2: ffffffffffffffd6 CR3: 0000000002811001 CR4: 00000000000606b0
[ 1.952484] Call Trace:
[ 1.955214] msi_domain_alloc+0x36/0x130
[ 1.959594] __irq_domain_alloc_irqs+0x165/0x380
[ 1.964748] dmar_alloc_hwirq+0x9a/0x120
[ 1.969127] dmar_set_interrupt.part.0+0x1c/0x60
[ 1.974281] enable_drhd_fault_handling+0x2c/0x6c
[ 1.979532] apic_intr_mode_init+0xfa/0x100
[ 1.984191] x86_late_time_init+0x20/0x30
[ 1.988662] start_kernel+0x723/0x7e6
[ 1.992748] secondary_startup_64_no_verify+0xa6/0xab
[ 1.998386] Modules linked in:
[ 2.001794] CR2: 0000000000000000
[ 2.005510] ---[ end trace 837dc60d7c66efa2 ]---

2020-09-25 23:18:05

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch V2 00/46] x86, PCI, XEN, genirq ...: Prepare for device MSI

On Fri, Sep 25 2020 at 17:49, Peter Zijlstra wrote:
> Here it looks like this:
>
> [ 1.830276] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [ 1.838043] #PF: supervisor instruction fetch in kernel mode
> [ 1.844357] #PF: error_code(0x0010) - not-present page
> [ 1.850090] PGD 0 P4D 0
> [ 1.852915] Oops: 0010 [#1] SMP
> [ 1.856419] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.9.0-rc6-00700-g0248dedd12d4 #419
> [ 1.865447] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
> [ 1.876902] RIP: 0010:0x0
> [ 1.879824] Code: Bad RIP value.
> [ 1.883423] RSP: 0000:ffffffff82803da0 EFLAGS: 00010282
> [ 1.889251] RAX: 0000000000000000 RBX: ffffffff8282b980 RCX: ffffffff82803e40
> [ 1.897241] RDX: 0000000000000001 RSI: ffffffff82803e40 RDI: ffffffff8282b980
> [ 1.905201] RBP: ffff88842f331000 R08: 00000000ffffffff R09: 0000000000000001
> [ 1.913162] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000048
> [ 1.921123] R13: ffffffff82803e40 R14: ffffffff8282b9c0 R15: 0000000000000000
> [ 1.929085] FS: 0000000000000000(0000) GS:ffff88842f400000(0000) knlGS:0000000000000000
> [ 1.938113] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1.944524] CR2: ffffffffffffffd6 CR3: 0000000002811001 CR4: 00000000000606b0
> [ 1.952484] Call Trace:
> [ 1.955214] msi_domain_alloc+0x36/0x130

Hrm. That looks like a not initialized mandatory callback. Confused.

Is this on -next and if so, does this happen on tip:x86/irq as well?

Can you provide yoru config please?

Thanks,

tglx

2020-09-27 08:48:03

by Thomas Gleixner

[permalink] [raw]
Subject: [PATCH] x86/apic/msi: Unbreak DMAR and HPET MSI

Switching the DMAR and HPET MSI code to use the generic MSI domain ops
missed to add the flag which tells the core code to update the domain
operations with the defaults. As a consequence the core code crashes
when an interrupt in one of those domains is allocated.

Add the missing flags.

Fixes: 9006c133a422 ("x86/msi: Use generic MSI domain ops")
Reported-by: Qian Cai <[email protected]>
Reported-by: Peter Zijlstra <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/x86/kernel/apic/msi.c | 2 ++
1 file changed, 2 insertions(+)

--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -309,6 +309,7 @@ static struct msi_domain_ops dmar_msi_do
static struct msi_domain_info dmar_msi_domain_info = {
.ops = &dmar_msi_domain_ops,
.chip = &dmar_msi_controller,
+ .flags = MSI_FLAG_USE_DEF_DOM_OPS,
};

static struct irq_domain *dmar_get_irq_domain(void)
@@ -408,6 +409,7 @@ static struct msi_domain_ops hpet_msi_do
static struct msi_domain_info hpet_msi_domain_info = {
.ops = &hpet_msi_domain_ops,
.chip = &hpet_msi_controller,
+ .flags = MSI_FLAG_USE_DEF_DOM_OPS,
};

struct irq_domain *hpet_create_irq_domain(int hpet_id)

Subject: [tip: x86/irq] x86/apic/msi: Unbreak DMAR and HPET MSI

The following commit has been merged into the x86/irq branch of tip:

Commit-ID: d27e623ace6af259075b6e0437380ee8d6268c5d
Gitweb: https://git.kernel.org/tip/d27e623ace6af259075b6e0437380ee8d6268c5d
Author: Thomas Gleixner <[email protected]>
AuthorDate: Sun, 27 Sep 2020 10:46:44 +02:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Sun, 27 Sep 2020 21:53:41 +02:00

x86/apic/msi: Unbreak DMAR and HPET MSI

Switching the DMAR and HPET MSI code to use the generic MSI domain ops
missed to add the flag which tells the core code to update the domain
operations with the defaults. As a consequence the core code crashes
when an interrupt in one of those domains is allocated.

Add the missing flags.

Fixes: 9006c133a422 ("x86/msi: Use generic MSI domain ops")
Reported-by: Qian Cai <[email protected]>
Reported-by: Peter Zijlstra <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

---
arch/x86/kernel/apic/msi.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c
index 3b522b0..6313f0a 100644
--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -309,6 +309,7 @@ static struct msi_domain_ops dmar_msi_domain_ops = {
static struct msi_domain_info dmar_msi_domain_info = {
.ops = &dmar_msi_domain_ops,
.chip = &dmar_msi_controller,
+ .flags = MSI_FLAG_USE_DEF_DOM_OPS,
};

static struct irq_domain *dmar_get_irq_domain(void)
@@ -408,6 +409,7 @@ static struct msi_domain_ops hpet_msi_domain_ops = {
static struct msi_domain_info hpet_msi_domain_info = {
.ops = &hpet_msi_domain_ops,
.chip = &hpet_msi_controller,
+ .flags = MSI_FLAG_USE_DEF_DOM_OPS,
};

struct irq_domain *hpet_create_irq_domain(int hpet_id)

2020-09-29 23:06:22

by Megha Dey

[permalink] [raw]
Subject: Re: [patch V2 00/46] x86, PCI, XEN, genirq ...: Prepare for device MSI

Hi Thomas,

On 8/26/2020 4:16 AM, Thomas Gleixner wrote:
> This is the second version of providing a base to support device MSI (non
> PCI based) and on top of that support for IMS (Interrupt Message Storm)
> based devices in a halfways architecture independent way.
>
> The first version can be found here:
>
> https://lore.kernel.org/r/[email protected]
>
> It's still a mixed bag of bug fixes, cleanups and general improvements
> which are worthwhile independent of device MSI.
>
> There are quite a bunch of issues to solve:
>
> - X86 does not use the device::msi_domain pointer for historical reasons
> and due to XEN, which makes it impossible to create an architecture
> agnostic device MSI infrastructure.
>
> - X86 has it's own msi_alloc_info data type which is pointlessly
> different from the generic version and does not allow to share code.
>
> - The logic of composing MSI messages in an hierarchy is busted at the
> core level and of course some (x86) drivers depend on that.
>
> - A few minor shortcomings as usual
>
> This series addresses that in several steps:
>
> 1) Accidental bug fixes
>
> iommu/amd: Prevent NULL pointer dereference
>
> 2) Janitoring
>
> x86/init: Remove unused init ops
> PCI: vmd: Dont abuse vector irqomain as parent
> x86/msi: Remove pointless vcpu_affinity callback
>
> 3) Sanitizing the composition of MSI messages in a hierarchy
>
> genirq/chip: Use the first chip in irq_chip_compose_msi_msg()
> x86/msi: Move compose message callback where it belongs
>
> 4) Simplification of the x86 specific interrupt allocation mechanism
>
> x86/irq: Rename X86_IRQ_ALLOC_TYPE_MSI* to reflect PCI dependency
> x86/irq: Add allocation type for parent domain retrieval
> iommu/vt-d: Consolidate irq domain getter
> iommu/amd: Consolidate irq domain getter
> iommu/irq_remapping: Consolidate irq domain lookup
>
> 5) Consolidation of the X86 specific interrupt allocation mechanism to be as close
> as possible to the generic MSI allocation mechanism which allows to get rid
> of quite a bunch of x86'isms which are pointless
>
> x86/irq: Prepare consolidation of irq_alloc_info
> x86/msi: Consolidate HPET allocation
> x86/ioapic: Consolidate IOAPIC allocation
> x86/irq: Consolidate DMAR irq allocation
> x86/irq: Consolidate UV domain allocation
> PCI/MSI: Rework pci_msi_domain_calc_hwirq()
> x86/msi: Consolidate MSI allocation
> x86/msi: Use generic MSI domain ops
>
> 6) x86 specific cleanups to remove the dependency on arch_*_msi_irqs()
>
> x86/irq: Move apic_post_init() invocation to one place
> x86/pci: Reducde #ifdeffery in PCI init code
> x86/irq: Initialize PCI/MSI domain at PCI init time
> irqdomain/msi: Provide DOMAIN_BUS_VMD_MSI
> PCI: vmd: Mark VMD irqdomain with DOMAIN_BUS_VMD_MSI
> PCI/MSI: Provide pci_dev_has_special_msi_domain() helper
> x86/xen: Make xen_msi_init() static and rename it to xen_hvm_msi_init()
> x86/xen: Rework MSI teardown
> x86/xen: Consolidate XEN-MSI init
> irqdomain/msi: Allow to override msi_domain_alloc/free_irqs()
> x86/xen: Wrap XEN MSI management into irqdomain
> iommm/vt-d: Store irq domain in struct device
> iommm/amd: Store irq domain in struct device
> x86/pci: Set default irq domain in pcibios_add_device()
> PCI/MSI: Make arch_.*_msi_irq[s] fallbacks selectable
> x86/irq: Cleanup the arch_*_msi_irqs() leftovers
> x86/irq: Make most MSI ops XEN private
> iommu/vt-d: Remove domain search for PCI/MSI[X]
> iommu/amd: Remove domain search for PCI/MSI
>
> 7) X86 specific preparation for device MSI
>
> x86/irq: Add DEV_MSI allocation type
> x86/msi: Rename and rework pci_msi_prepare() to cover non-PCI MSI
>
> 8) Generic device MSI infrastructure
> platform-msi: Provide default irq_chip:: Ack
> genirq/proc: Take buslock on affinity write
> genirq/msi: Provide and use msi_domain_set_default_info_flags()
> platform-msi: Add device MSI infrastructure
> irqdomain/msi: Provide msi_alloc/free_store() callbacks
>
> 9) POC of IMS (Interrupt Message Storm) irq domain and irqchip
> implementations for both device array and queue storage.
>
> irqchip: Add IMS (Interrupt Message Storm) driver - NOT FOR MERGING
>
> Changes vs. V1:
>
> - Addressed various review comments and addressed the 0day fallout.
> - Corrected the XEN logic (Jürgen)
> - Make the arch fallback in PCI/MSI opt-in not opt-out (Bjorn)
>
> - Fixed the compose MSI message inconsistency
>
> - Ensure that the necessary flags are set for device SMI
>
> - Make the irq bus logic work for affinity setting to prepare
> support for IMS storage in queue memory. It turned out to be
> less scary than I feared.
>
> - Remove leftovers in iommu/intel|amd
>
> - Reworked the IMS POC driver to cover queue storage so Jason can have a
> look whether that fits the needs of MLX devices.
>
> The whole lot is also available from git:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git device-msi
>
> This has been tested on Intel/AMD/KVM but lacks testing on:
>
> - HYPERV (-ENODEV)
> - VMD enabled systems (-ENODEV)
> - XEN (-ENOCLUE)
> - IMS (-ENODEV)
>
> - Any non-X86 code which might depend on the broken compose MSI message
> logic. Marc excpects not much fallout, but agrees that we need to fix
> it anyway.
>
> #1 - #3 should be applied unconditionally for obvious reasons
> #4 - #6 are wortwhile cleanups which should be done independent of device MSI
>
> #7 - #8 look promising to cleanup the platform MSI implementation
> independent of #8, but I neither had cycles nor the stomach to
> tackle that.
>
> #9 is obviously just for the folks interested in IMS
>
> Thanks,
>
> tglx

I see that the tip tree (as of 9/29) has most of these patches but
notice that the DEV_MSI related patches

haven't made it. I have tested the tip tree(x86/irq branch) with your
DEV_MSI infra patches and our IMS

patches with the IDXD driver and was wondering if we should push out
those patches as part of our patchset?

Thanks,

Megha

2020-09-30 06:43:24

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch V2 00/46] x86, PCI, XEN, genirq ...: Prepare for device MSI

On Tue, Sep 29 2020 at 16:03, Megha Dey wrote:
> On 8/26/2020 4:16 AM, Thomas Gleixner wrote:
>> #9 is obviously just for the folks interested in IMS
>>
>
> I see that the tip tree (as of 9/29) has most of these patches but
> notice that the DEV_MSI related patches
>
> haven't made it. I have tested the tip tree(x86/irq branch) with your
> DEV_MSI infra patches and our IMS patches with the IDXD driver and was

Your IMS patches? Why do you need something special again?

> wondering if we should push out those patches as part of our patchset?

As I don't have any hardware to test that, I was waiting for you and
Jason to confirm that this actually works for the two different IMS
implementations.

Thanks,

tglx

2020-09-30 11:46:56

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [patch V2 00/46] x86, PCI, XEN, genirq ...: Prepare for device MSI

On Wed, Sep 30, 2020 at 08:41:48AM +0200, Thomas Gleixner wrote:
> On Tue, Sep 29 2020 at 16:03, Megha Dey wrote:
> > On 8/26/2020 4:16 AM, Thomas Gleixner wrote:
> >> #9 is obviously just for the folks interested in IMS
> >>
> >
> > I see that the tip tree (as of 9/29) has most of these patches but
> > notice that the DEV_MSI related patches
> >
> > haven't made it. I have tested the tip tree(x86/irq branch) with your
> > DEV_MSI infra patches and our IMS patches with the IDXD driver and was
>
> Your IMS patches? Why do you need something special again?
>
> > wondering if we should push out those patches as part of our patchset?
>
> As I don't have any hardware to test that, I was waiting for you and
> Jason to confirm that this actually works for the two different IMS
> implementations.

How urgently do you need this? The code looked good from what I
understood. It will be a while before we have all the parts to send an
actual patch though.

We might be able to put together a mockup just to prove it

Jason

2020-09-30 15:22:37

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch V2 00/46] x86, PCI, XEN, genirq ...: Prepare for device MSI

On Wed, Sep 30 2020 at 08:43, Jason Gunthorpe wrote:
> On Wed, Sep 30, 2020 at 08:41:48AM +0200, Thomas Gleixner wrote:
>> On Tue, Sep 29 2020 at 16:03, Megha Dey wrote:
>> > On 8/26/2020 4:16 AM, Thomas Gleixner wrote:
>> >> #9 is obviously just for the folks interested in IMS
>> >>
>> >
>> > I see that the tip tree (as of 9/29) has most of these patches but
>> > notice that the DEV_MSI related patches
>> >
>> > haven't made it. I have tested the tip tree(x86/irq branch) with your
>> > DEV_MSI infra patches and our IMS patches with the IDXD driver and was
>>
>> Your IMS patches? Why do you need something special again?
>>
>> > wondering if we should push out those patches as part of our patchset?
>>
>> As I don't have any hardware to test that, I was waiting for you and
>> Jason to confirm that this actually works for the two different IMS
>> implementations.
>
> How urgently do you need this? The code looked good from what I
> understood. It will be a while before we have all the parts to send an
> actual patch though.

I personally do not need it at all :) Megha might have different
thoughts...

> We might be able to put together a mockup just to prove it

If that makes Megha's stuff going that would of course be appreciated,
but we can defer the IMS_QUEUE part for later. It's orthogonal to the
IMS_ARRAY stuff.

Thanks,

tglx

2020-09-30 17:28:57

by Megha Dey

[permalink] [raw]
Subject: Re: [patch V2 00/46] x86, PCI, XEN, genirq ...: Prepare for device MSI

Hi Thomas/Jason,

On 9/30/2020 8:20 AM, Thomas Gleixner wrote:
> On Wed, Sep 30 2020 at 08:43, Jason Gunthorpe wrote:
>> On Wed, Sep 30, 2020 at 08:41:48AM +0200, Thomas Gleixner wrote:
>>> On Tue, Sep 29 2020 at 16:03, Megha Dey wrote:
>>>> On 8/26/2020 4:16 AM, Thomas Gleixner wrote:
>>>>> #9 is obviously just for the folks interested in IMS
>>>>>
>>>> I see that the tip tree (as of 9/29) has most of these patches but
>>>> notice that the DEV_MSI related patches
>>>>
>>>> haven't made it. I have tested the tip tree(x86/irq branch) with your
>>>> DEV_MSI infra patches and our IMS patches with the IDXD driver and was
>>> Your IMS patches? Why do you need something special again?

By IMS patches, I meant your IMS driver patch that was updated (as it
was untested, it had some compile

errors and we removed the IMS_QUEUE parts) :

https://lore.kernel.org/lkml/160021246221.67751.16280230469654363209.stgit@djiang5-desk3.ch.intel.com/

and some iommu related changes required by IMS.

https://lore.kernel.org/lkml/160021246905.67751.1674517279122764758.stgit@djiang5-desk3.ch.intel.com/

The whole patchset can be found here:

https://lore.kernel.org/lkml/[email protected]/

It would be great if you could review the IMS patches :)

>>>
>>>> wondering if we should push out those patches as part of our patchset?
>>> As I don't have any hardware to test that, I was waiting for you and
>>> Jason to confirm that this actually works for the two different IMS
>>> implementations.
>> How urgently do you need this? The code looked good from what I
>> understood. It will be a while before we have all the parts to send an
>> actual patch though.
> I personally do not need it at all :) Megha might have different
> thoughts...

I have tested these patches and it works fine (I had to add a couple of
EXPORT_SYMBOLS).

We were hoping to get IMS in the 5.10 merge window :)

>
>> We might be able to put together a mockup just to prove it
> If that makes Megha's stuff going that would of course be appreciated,
> but we can defer the IMS_QUEUE part for later. It's orthogonal to the
> IMS_ARRAY stuff.

In our patch series, we have removed the IMS_QUEUE stuff and retained
only the IMS_ARRAY parts

as that was sufficient for us.

>
> Thanks,
>
> tglx

2020-09-30 18:15:52

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch V2 00/46] x86, PCI, XEN, genirq ...: Prepare for device MSI

Megha,

On Wed, Sep 30 2020 at 10:25, Megha Dey wrote:
> On 9/30/2020 8:20 AM, Thomas Gleixner wrote:
>>>> Your IMS patches? Why do you need something special again?
>
> By IMS patches, I meant your IMS driver patch that was updated (as it
> was untested, it had some compile errors and we removed the IMS_QUEUE
> parts) :

Ok.

> The whole patchset can be found here:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> It would be great if you could review the IMS patches :)

It somehow slipped through the cracks. I'll have a look.

> We were hoping to get IMS in the 5.10 merge window :)

Hope dies last, right?

>>> We might be able to put together a mockup just to prove it
>> If that makes Megha's stuff going that would of course be appreciated,
>> but we can defer the IMS_QUEUE part for later. It's orthogonal to the
>> IMS_ARRAY stuff.
>
> In our patch series, we have removed the IMS_QUEUE stuff and retained
> only the IMS_ARRAY parts > as that was sufficient for us.

That works. We can add that back when Jason has his puzzle pieces
sorted.

Thanks,

tglx

2020-11-12 12:57:38

by Jason Gunthorpe

[permalink] [raw]
Subject: REGRESSION: Re: [patch V2 00/46] x86, PCI, XEN, genirq ...: Prepare for device MSI

On Wed, Aug 26, 2020 at 01:16:28PM +0200, Thomas Gleixner wrote:
> This is the second version of providing a base to support device MSI (non
> PCI based) and on top of that support for IMS (Interrupt Message Storm)
> based devices in a halfways architecture independent way.

Hi Thomas,

Our test team has been struggling with a regression on bare metal
SRIOV VFs since -rc1 that they were able to bisect to this series

This commit tests good:

5712c3ed549e ("Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc")

This commit tests bad:

981aa1d366bf ("PCI: MSI: Fix Kconfig dependencies for PCI_MSI_ARCH_FALLBACKS")

They were unable to bisect further into the series because some of the
interior commits don't boot :(

When we try to load the mlx5 driver on a bare metal VF it gets this:

[Thu Oct 22 08:54:51 2020] DMAR: DRHD: handling fault status reg 2
[Thu Oct 22 08:54:51 2020] DMAR: [INTR-REMAP] Request device [42:00.2] fault index 1600 [fault reason 37] Blocked a compatibility format interrupt request
[Thu Oct 22 08:55:04 2020] mlx5_core 0000:42:00.1 eth4: Link down
[Thu Oct 22 08:55:11 2020] mlx5_core 0000:42:00.1 eth4: Link up
[Thu Oct 22 08:55:54 2020] mlx5_core 0000:42:00.2: mlx5_cmd_eq_recover:264:(pid 3390): Recovered 1 EQEs on cmd_eq
[Thu Oct 22 08:55:54 2020] mlx5_core 0000:42:00.2: wait_func_handle_exec_timeout:1051:(pid 3390): cmd0: CREATE_EQ(0×301) recovered after timeout
[Thu Oct 22 08:55:54 2020] DMAR: DRHD: handling fault status reg 102
[Thu Oct 22 08:55:54 2020] DMAR: [INTR-REMAP] Request device [42:00.2] fault index 1600 [fault reason 37] Blocked a compatibility format interrupt request

If you have any idea Ziyad and Itay can run any debugging you like.

I suppose it is because this series is handing out compatability
addr/data pairs while the IOMMU is setup to only accept remap ones
from SRIOV VFs?

Thanks,
Jason

2020-11-12 14:17:17

by Thomas Gleixner

[permalink] [raw]
Subject: Re: REGRESSION: Re: [patch V2 00/46] x86, PCI, XEN, genirq ...: Prepare for device MSI

Jason,

(trimmed CC list a bit)

On Thu, Nov 12 2020 at 08:55, Jason Gunthorpe wrote:
> On Wed, Aug 26, 2020 at 01:16:28PM +0200, Thomas Gleixner wrote:
> They were unable to bisect further into the series because some of the
> interior commits don't boot :(
>
> When we try to load the mlx5 driver on a bare metal VF it gets this:
>
> [Thu Oct 22 08:54:51 2020] DMAR: DRHD: handling fault status reg 2
> [Thu Oct 22 08:54:51 2020] DMAR: [INTR-REMAP] Request device [42:00.2] fault index 1600 [fault reason 37] Blocked a compatibility format interrupt request
> [Thu Oct 22 08:55:04 2020] mlx5_core 0000:42:00.1 eth4: Link down
> [Thu Oct 22 08:55:11 2020] mlx5_core 0000:42:00.1 eth4: Link up
> [Thu Oct 22 08:55:54 2020] mlx5_core 0000:42:00.2: mlx5_cmd_eq_recover:264:(pid 3390): Recovered 1 EQEs on cmd_eq
> [Thu Oct 22 08:55:54 2020] mlx5_core 0000:42:00.2: wait_func_handle_exec_timeout:1051:(pid 3390): cmd0: CREATE_EQ(0×301) recovered after timeout
> [Thu Oct 22 08:55:54 2020] DMAR: DRHD: handling fault status reg 102
> [Thu Oct 22 08:55:54 2020] DMAR: [INTR-REMAP] Request device [42:00.2] fault index 1600 [fault reason 37] Blocked a compatibility format interrupt request
>
> If you have any idea Ziyad and Itay can run any debugging you like.
>
> I suppose it is because this series is handing out compatability
> addr/data pairs while the IOMMU is setup to only accept remap ones
> from SRIOV VFs?

So the issue seems to be that the VF device has the default irq domain
assigned and not the remapping domain. Let me stare into the code to see
how these VF devices are set up and registered with the IOMMU/remap
unit.

Thanks,

tglx

2020-11-12 15:22:50

by Thomas Gleixner

[permalink] [raw]
Subject: Re: REGRESSION: Re: [patch V2 00/46] x86, PCI, XEN, genirq ...: Prepare for device MSI

On Thu, Nov 12 2020 at 15:15, Thomas Gleixner wrote:
> On Thu, Nov 12 2020 at 08:55, Jason Gunthorpe wrote:
>> On Wed, Aug 26, 2020 at 01:16:28PM +0200, Thomas Gleixner wrote:
>> They were unable to bisect further into the series because some of the
>> interior commits don't boot :(
>>
>> When we try to load the mlx5 driver on a bare metal VF it gets this:
>>
>> [Thu Oct 22 08:54:51 2020] DMAR: DRHD: handling fault status reg 2
>> [Thu Oct 22 08:54:51 2020] DMAR: [INTR-REMAP] Request device [42:00.2] fault index 1600 [fault reason 37] Blocked a compatibility format interrupt request
>> [Thu Oct 22 08:55:04 2020] mlx5_core 0000:42:00.1 eth4: Link down
>> [Thu Oct 22 08:55:11 2020] mlx5_core 0000:42:00.1 eth4: Link up
>> [Thu Oct 22 08:55:54 2020] mlx5_core 0000:42:00.2: mlx5_cmd_eq_recover:264:(pid 3390): Recovered 1 EQEs on cmd_eq
>> [Thu Oct 22 08:55:54 2020] mlx5_core 0000:42:00.2: wait_func_handle_exec_timeout:1051:(pid 3390): cmd0: CREATE_EQ(0×301) recovered after timeout
>> [Thu Oct 22 08:55:54 2020] DMAR: DRHD: handling fault status reg 102
>> [Thu Oct 22 08:55:54 2020] DMAR: [INTR-REMAP] Request device [42:00.2] fault index 1600 [fault reason 37] Blocked a compatibility format interrupt request
>>
>> If you have any idea Ziyad and Itay can run any debugging you like.
>>
>> I suppose it is because this series is handing out compatability
>> addr/data pairs while the IOMMU is setup to only accept remap ones
>> from SRIOV VFs?
>
> So the issue seems to be that the VF device has the default irq domain
> assigned and not the remapping domain. Let me stare into the code to see
> how these VF devices are set up and registered with the IOMMU/remap
> unit.

Found the reason. Will fix it after walking the dogs. Brain needs some
fresh air.

Thanks,

tglx

2020-11-12 19:17:56

by Thomas Gleixner

[permalink] [raw]
Subject: iommu/vt-d: Cure VF irqdomain hickup

The recent changes to store the MSI irqdomain pointer in struct device
missed that Intel DMAR does not register virtual function devices. Due to
that a VF device gets the plain PCI-MSI domain assigned and then issues
compat MSI messages which get caught by the interrupt remapping unit.

Cure that by inheriting the irq domain from the physical function
device.

That's a temporary workaround. The correct fix is to inherit the irq domain
from the bus, but that's a larger effort which needs quite some other
changes to the way how x86 manages PCI and MSI domains.

Fixes: 85a8dfc57a0b ("iommm/vt-d: Store irq domain in struct device")
Reported-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
drivers/iommu/intel/dmar.c | 19 ++++++++++++++++++-
1 file changed, 18 insertions(+), 1 deletion(-)

--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -333,6 +333,11 @@ static void dmar_pci_bus_del_dev(struct
dmar_iommu_notify_scope_dev(info);
}

+static inline void vf_inherit_msi_domain(struct pci_dev *pdev)
+{
+ dev_set_msi_domain(&pdev->dev, dev_get_msi_domain(&pdev->physfn->dev));
+}
+
static int dmar_pci_bus_notifier(struct notifier_block *nb,
unsigned long action, void *data)
{
@@ -342,8 +347,20 @@ static int dmar_pci_bus_notifier(struct
/* Only care about add/remove events for physical functions.
* For VFs we actually do the lookup based on the corresponding
* PF in device_to_iommu() anyway. */
- if (pdev->is_virtfn)
+ if (pdev->is_virtfn) {
+ /*
+ * Note: This is a horrible hack and needs to be cleaned
+ * up by assigning the domain to the bus, but that's too
+ * big of a change for post rc3.
+ *
+ * Ensure that the VF device inherits the irq domain of the
+ * PF device:
+ */
+ if (action == BUS_NOTIFY_ADD_DEVICE)
+ vf_inherit_msi_domain(pdev);
return NOTIFY_DONE;
+ }
+
if (action != BUS_NOTIFY_ADD_DEVICE &&
action != BUS_NOTIFY_REMOVED_DEVICE)
return NOTIFY_DONE;

2020-11-12 21:37:40

by Thomas Gleixner

[permalink] [raw]
Subject: Re: iommu/vt-d: Cure VF irqdomain hickup

On Thu, Nov 12 2020 at 20:15, Thomas Gleixner wrote:
> The recent changes to store the MSI irqdomain pointer in struct device
> missed that Intel DMAR does not register virtual function devices. Due to
> that a VF device gets the plain PCI-MSI domain assigned and then issues
> compat MSI messages which get caught by the interrupt remapping unit.
>
> Cure that by inheriting the irq domain from the physical function
> device.
>
> That's a temporary workaround. The correct fix is to inherit the irq domain
> from the bus, but that's a larger effort which needs quite some other
> changes to the way how x86 manages PCI and MSI domains.

Bah, that's not really going to work with the way how irq remapping
works on x86 because at least Intel/DMAR can have more than one DMAR
unit on a bus.

So the alternative solution would be to assign the domain per device,
but the current ordering creates a hen and egg problem. Looking the
domain up in pci_set_msi_domain() does not work because at that point
the device is not registered in the IOMMU. That happens from
device_add().

Marc, is there any problem to reorder the calls in pci_device_add():

device_add();
pci_set_msi_domain();

That would allow to add a irq_find_matching_fwspec() based lookup to
pci_msi_get_device_domain().

Though I'm not yet convinced that the outcome would be less horrible
than the hack in the DMAR driver when I'm taking all the other horrors
of x86 (including XEN) into account :)

Thanks,

tglx

2020-11-13 07:23:15

by Baolu Lu

[permalink] [raw]
Subject: Re: iommu/vt-d: Cure VF irqdomain hickup

Hi Thomas,

On 2020/11/13 3:15, Thomas Gleixner wrote:
> The recent changes to store the MSI irqdomain pointer in struct device
> missed that Intel DMAR does not register virtual function devices. Due to
> that a VF device gets the plain PCI-MSI domain assigned and then issues
> compat MSI messages which get caught by the interrupt remapping unit.
>
> Cure that by inheriting the irq domain from the physical function
> device.
>
> That's a temporary workaround. The correct fix is to inherit the irq domain
> from the bus, but that's a larger effort which needs quite some other
> changes to the way how x86 manages PCI and MSI domains.
>
> Fixes: 85a8dfc57a0b ("iommm/vt-d: Store irq domain in struct device")
> Reported-by: Jason Gunthorpe <[email protected]>
> Signed-off-by: Thomas Gleixner <[email protected]>
> ---
> drivers/iommu/intel/dmar.c | 19 ++++++++++++++++++-
> 1 file changed, 18 insertions(+), 1 deletion(-)
>
> --- a/drivers/iommu/intel/dmar.c
> +++ b/drivers/iommu/intel/dmar.c
> @@ -333,6 +333,11 @@ static void dmar_pci_bus_del_dev(struct
> dmar_iommu_notify_scope_dev(info);
> }
>
> +static inline void vf_inherit_msi_domain(struct pci_dev *pdev)
> +{
> + dev_set_msi_domain(&pdev->dev, dev_get_msi_domain(&pdev->physfn->dev));
> +}
> +
> static int dmar_pci_bus_notifier(struct notifier_block *nb,
> unsigned long action, void *data)
> {
> @@ -342,8 +347,20 @@ static int dmar_pci_bus_notifier(struct
> /* Only care about add/remove events for physical functions.
> * For VFs we actually do the lookup based on the corresponding
> * PF in device_to_iommu() anyway. */
> - if (pdev->is_virtfn)
> + if (pdev->is_virtfn) {
> + /*
> + * Note: This is a horrible hack and needs to be cleaned
> + * up by assigning the domain to the bus, but that's too
> + * big of a change for post rc3.
> + *
> + * Ensure that the VF device inherits the irq domain of the
> + * PF device:
> + */
> + if (action == BUS_NOTIFY_ADD_DEVICE)
> + vf_inherit_msi_domain(pdev);
> return NOTIFY_DONE;
> + }
> +
> if (action != BUS_NOTIFY_ADD_DEVICE &&
> action != BUS_NOTIFY_REMOVED_DEVICE)
> return NOTIFY_DONE;

We also encountered this problem in internal testing. This patch can
solve the problem.

Acked-by: Lu Baolu <[email protected]>

Best regards,
baolu

2020-11-13 09:23:59

by Marc Zyngier

[permalink] [raw]
Subject: Re: iommu/vt-d: Cure VF irqdomain hickup

On 2020-11-12 21:34, Thomas Gleixner wrote:
> On Thu, Nov 12 2020 at 20:15, Thomas Gleixner wrote:
>> The recent changes to store the MSI irqdomain pointer in struct device
>> missed that Intel DMAR does not register virtual function devices.
>> Due to
>> that a VF device gets the plain PCI-MSI domain assigned and then
>> issues
>> compat MSI messages which get caught by the interrupt remapping unit.
>>
>> Cure that by inheriting the irq domain from the physical function
>> device.
>>
>> That's a temporary workaround. The correct fix is to inherit the irq
>> domain
>> from the bus, but that's a larger effort which needs quite some other
>> changes to the way how x86 manages PCI and MSI domains.
>
> Bah, that's not really going to work with the way how irq remapping
> works on x86 because at least Intel/DMAR can have more than one DMAR
> unit on a bus.
>
> So the alternative solution would be to assign the domain per device,
> but the current ordering creates a hen and egg problem. Looking the
> domain up in pci_set_msi_domain() does not work because at that point
> the device is not registered in the IOMMU. That happens from
> device_add().
>
> Marc, is there any problem to reorder the calls in pci_device_add():
>
> device_add();
> pci_set_msi_domain();

I *think* it works as long as we keep the "match_driver = false" hack.
Otherwise, we risk binding to a driver early, and game over.

> That would allow to add a irq_find_matching_fwspec() based lookup to
> pci_msi_get_device_domain().

Just so that I understand the issue: is the core of the problem that
there is no 1:1 mapping between a PCI bus and a DMAR unit, and no
firmware topology information to indicate which one to pick?

> Though I'm not yet convinced that the outcome would be less horrible
> than the hack in the DMAR driver when I'm taking all the other horrors
> of x86 (including XEN) into account :)

I tried to follow the notifier into the DMAR driver, ended up in the
IRQ remapping code, and lost the will to live. I have a question though:

In the bus notifier callback, you end-up in dmar_pci_bus_add_dev(),
which calls intel_irq_remap_add_device(), which tries to set the
MSI domain. Why isn't that enough? Are we still missing any information
at that stage?

Thanks,

M.
--
Jazz is not dead. It just smells funny...

2020-11-13 13:55:38

by Thomas Gleixner

[permalink] [raw]
Subject: Re: iommu/vt-d: Cure VF irqdomain hickup

On Fri, Nov 13 2020 at 09:19, Marc Zyngier wrote:
> On 2020-11-12 21:34, Thomas Gleixner wrote:
>> That would allow to add a irq_find_matching_fwspec() based lookup to
>> pci_msi_get_device_domain().
>
> Just so that I understand the issue: is the core of the problem that
> there is no 1:1 mapping between a PCI bus and a DMAR unit, and no
> firmware topology information to indicate which one to pick?

Yes, we don't have a 1:1 mapping and there is some info, but that's all
a horrible mess.

>> Though I'm not yet convinced that the outcome would be less horrible
>> than the hack in the DMAR driver when I'm taking all the other horrors
>> of x86 (including XEN) into account :)
>
> I tried to follow the notifier into the DMAR driver, ended up in the
> IRQ remapping code, and lost the will to live.

Please just don't look at that and stay alive :)

> I have a question though:
>
> In the bus notifier callback, you end-up in dmar_pci_bus_add_dev(),
> which calls intel_irq_remap_add_device(), which tries to set the MSI
> domain. Why isn't that enough? Are we still missing any information at
> that stage?

That works, but this code is not reached for VF devices ... See the
patch which cures that.

If we want to get rid of that mess we'd need to rewrite the DMAR IOMMU
device registration completely. I'll leave it as is for now. My will to
live is more important :)

Thanks

tglx

2020-11-16 09:51:38

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: iommu/vt-d: Cure VF irqdomain hickup

Hi Thomas,

On Thu, Nov 12, 2020 at 8:16 PM Thomas Gleixner <[email protected]> wrote:
> The recent changes to store the MSI irqdomain pointer in struct device
> missed that Intel DMAR does not register virtual function devices. Due to
> that a VF device gets the plain PCI-MSI domain assigned and then issues
> compat MSI messages which get caught by the interrupt remapping unit.
>
> Cure that by inheriting the irq domain from the physical function
> device.
>
> That's a temporary workaround. The correct fix is to inherit the irq domain
> from the bus, but that's a larger effort which needs quite some other
> changes to the way how x86 manages PCI and MSI domains.
>
> Fixes: 85a8dfc57a0b ("iommm/vt-d: Store irq domain in struct device")
> Reported-by: Jason Gunthorpe <[email protected]>
> Signed-off-by: Thomas Gleixner <[email protected]>
> ---
> drivers/iommu/intel/dmar.c | 19 ++++++++++++++++++-
> 1 file changed, 18 insertions(+), 1 deletion(-)
>
> --- a/drivers/iommu/intel/dmar.c
> +++ b/drivers/iommu/intel/dmar.c
> @@ -333,6 +333,11 @@ static void dmar_pci_bus_del_dev(struct
> dmar_iommu_notify_scope_dev(info);
> }
>
> +static inline void vf_inherit_msi_domain(struct pci_dev *pdev)
> +{
> + dev_set_msi_domain(&pdev->dev, dev_get_msi_domain(&pdev->physfn->dev));

If CONFIG_PCI_ATS is not set:

error: 'struct pci_dev' has no member named 'physfn'

http://kisskb.ellerman.id.au/kisskb/buildresult/14400927/

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2020-11-16 12:53:39

by Thomas Gleixner

[permalink] [raw]
Subject: Re: iommu/vt-d: Cure VF irqdomain hickup

Geert,

On Mon, Nov 16 2020 at 10:47, Geert Uytterhoeven wrote:
> On Thu, Nov 12, 2020 at 8:16 PM Thomas Gleixner <[email protected]> wrote:
>> The recent changes to store the MSI irqdomain pointer in struct device
>> missed that Intel DMAR does not register virtual function devices. Due to
>> that a VF device gets the plain PCI-MSI domain assigned and then issues
>> compat MSI messages which get caught by the interrupt remapping unit.
>>
>> Cure that by inheriting the irq domain from the physical function
>> device.
>>
>> That's a temporary workaround. The correct fix is to inherit the irq domain
>> from the bus, but that's a larger effort which needs quite some other
>> changes to the way how x86 manages PCI and MSI domains.
>>
>> Fixes: 85a8dfc57a0b ("iommm/vt-d: Store irq domain in struct device")
>> Reported-by: Jason Gunthorpe <[email protected]>
>> Signed-off-by: Thomas Gleixner <[email protected]>
>> ---
>> drivers/iommu/intel/dmar.c | 19 ++++++++++++++++++-
>> 1 file changed, 18 insertions(+), 1 deletion(-)
>>
>> --- a/drivers/iommu/intel/dmar.c
>> +++ b/drivers/iommu/intel/dmar.c
>> @@ -333,6 +333,11 @@ static void dmar_pci_bus_del_dev(struct
>> dmar_iommu_notify_scope_dev(info);
>> }
>>
>> +static inline void vf_inherit_msi_domain(struct pci_dev *pdev)
>> +{
>> + dev_set_msi_domain(&pdev->dev, dev_get_msi_domain(&pdev->physfn->dev));
>
> If CONFIG_PCI_ATS is not set:
>
> error: 'struct pci_dev' has no member named 'physfn'
>

thanks for pointing that out. Yet moar ifdeffery, oh well...

Thanks,

tglx

2020-11-16 20:10:56

by Baolu Lu

[permalink] [raw]
Subject: Re: iommu/vt-d: Cure VF irqdomain hickup

On 2020/11/16 17:47, Geert Uytterhoeven wrote:
> Hi Thomas,
>
> On Thu, Nov 12, 2020 at 8:16 PM Thomas Gleixner <[email protected]> wrote:
>> The recent changes to store the MSI irqdomain pointer in struct device
>> missed that Intel DMAR does not register virtual function devices. Due to
>> that a VF device gets the plain PCI-MSI domain assigned and then issues
>> compat MSI messages which get caught by the interrupt remapping unit.
>>
>> Cure that by inheriting the irq domain from the physical function
>> device.
>>
>> That's a temporary workaround. The correct fix is to inherit the irq domain
>> from the bus, but that's a larger effort which needs quite some other
>> changes to the way how x86 manages PCI and MSI domains.
>>
>> Fixes: 85a8dfc57a0b ("iommm/vt-d: Store irq domain in struct device")
>> Reported-by: Jason Gunthorpe <[email protected]>
>> Signed-off-by: Thomas Gleixner <[email protected]>
>> ---
>> drivers/iommu/intel/dmar.c | 19 ++++++++++++++++++-
>> 1 file changed, 18 insertions(+), 1 deletion(-)
>>
>> --- a/drivers/iommu/intel/dmar.c
>> +++ b/drivers/iommu/intel/dmar.c
>> @@ -333,6 +333,11 @@ static void dmar_pci_bus_del_dev(struct
>> dmar_iommu_notify_scope_dev(info);
>> }
>>
>> +static inline void vf_inherit_msi_domain(struct pci_dev *pdev)
>> +{
>> + dev_set_msi_domain(&pdev->dev, dev_get_msi_domain(&pdev->physfn->dev));
>
> If CONFIG_PCI_ATS is not set:
>
> error: 'struct pci_dev' has no member named 'physfn'
>
> http://kisskb.ellerman.id.au/kisskb/buildresult/14400927/

Maybe pci_physfn() helper should be used here.

Best regards,
baolu

2020-11-16 23:25:32

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: iommu/vt-d: Cure VF irqdomain hickup

On Thu, Nov 12, 2020 at 08:15:02PM +0100, Thomas Gleixner wrote:
> The recent changes to store the MSI irqdomain pointer in struct device
> missed that Intel DMAR does not register virtual function devices. Due to
> that a VF device gets the plain PCI-MSI domain assigned and then issues
> compat MSI messages which get caught by the interrupt remapping unit.
>
> Cure that by inheriting the irq domain from the physical function
> device.
>
> That's a temporary workaround. The correct fix is to inherit the irq domain
> from the bus, but that's a larger effort which needs quite some other
> changes to the way how x86 manages PCI and MSI domains.
>
> Fixes: 85a8dfc57a0b ("iommm/vt-d: Store irq domain in struct device")
> Reported-by: Jason Gunthorpe <[email protected]>
> Signed-off-by: Thomas Gleixner <[email protected]>
> ---
> drivers/iommu/intel/dmar.c | 19 ++++++++++++++++++-
> 1 file changed, 18 insertions(+), 1 deletion(-)

Our QA says it solves the issue:

Tested-by: Itay Aveksis <[email protected]>

Thanks,
Jason