2024-02-29 21:11:51

by Dimitri Sivanich

[permalink] [raw]
Subject: [PATCH] Allocate DMAR fault interrupts locally

The Intel IOMMU code currently tries to allocate all DMAR fault interrupt
vectors on the boot cpu. On large systems with high DMAR counts this
results in vector exhaustion, and most of the vectors are not initially
allocated socket local.

Instead, have a cpu on each node do the vector allocation for the DMARs on
that node. The boot cpu still does the allocation for its node during its
boot sequence.

Signed-off-by: Dimitri Sivanich <[email protected]>
---
drivers/iommu/intel/dmar.c | 34 +++++++++++++++++++++++++++++++++-
1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
index 23cb80d62a9a..41ef72ba7509 100644
--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -2108,8 +2108,12 @@ int __init enable_drhd_fault_handling(void)
*/
for_each_iommu(iommu, drhd) {
u32 fault_status;
- int ret = dmar_set_interrupt(iommu);
+ int ret;

+ if (iommu->node != cpu_to_node(smp_processor_id()))
+ continue;
+
+ ret = dmar_set_interrupt(iommu);
if (ret) {
pr_err("DRHD %Lx: failed to enable fault, interrupt, ret %d\n",
(unsigned long long)drhd->reg_base_addr, ret);
@@ -2192,6 +2196,34 @@ static int __init dmar_free_unused_resources(void)

late_initcall(dmar_free_unused_resources);

+#ifdef CONFIG_X86_LOCAL_APIC
+static void __init irq_remap_enable_fault_handling_thr(struct work_struct *work)
+{
+ irq_remap_enable_fault_handling();
+}
+
+static int __init assign_dmar_vectors(void)
+{
+ struct work_struct irq_remap_work;
+ int nid;
+
+ INIT_WORK(&irq_remap_work, irq_remap_enable_fault_handling_thr);
+ cpus_read_lock();
+ for_each_online_node(nid) {
+ /* Boot cpu dmar vectors are assigned before the rest */
+ if (nid == cpu_to_node(get_boot_cpu_id()))
+ continue;
+ schedule_work_on(cpumask_first(cpumask_of_node(nid)),
+ &irq_remap_work);
+ flush_work(&irq_remap_work);
+ }
+ cpus_read_unlock();
+ return 0;
+}
+
+arch_initcall(assign_dmar_vectors);
+#endif /* CONFIG_X86_LOCAL_APIC */
+
/*
* DMAR Hotplug Support
* For more details, please refer to Intel(R) Virtualization Technology
--
2.35.3



2024-02-29 22:18:53

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] Allocate DMAR fault interrupts locally

Dimitri!

On Thu, Feb 29 2024 at 14:07, Dimitri Sivanich wrote:

The subject lacks a subsystem prefix. You're doing this for how many
decades now?

> The Intel IOMMU code currently tries to allocate all DMAR fault interrupt
>
> +#ifdef CONFIG_X86_LOCAL_APIC

I seriously doubt that this code can ever be compiled w/o X86_LOCAL_APIC:

obj-$(CONFIG_DMAR_TABLE) += dmar.o

config DMAR_TABLE
bool

config INTEL_IOMMU
depends on PCI_MSI && ACPI && X86
select DMAR_TABLE

config IRQ_REMAP
depends on X86_64 && X86_IO_APIC && PCI_MSI && ACPI
select DMAR_TABLE

config X86_LOCAL_APIC
def_bool y
depends on X86_64 || SMP || X86_32_NON_STANDARD || X86_UP_APIC || PCI_MSI

What are you trying to achieve here other than #ifdef voodoo?

> +static void __init irq_remap_enable_fault_handling_thr(struct work_struct *work)
> +{
> + irq_remap_enable_fault_handling();

because if INTEL_IOMMU=y and IRQ_REMAP=n then X86_LOCAL_APIC=y and this
muck gets invoked for nothing. 'git grep irq_remap_enable_fault_handling
include/' might give you a hint.

> +}
> +
> +static int __init assign_dmar_vectors(void)
> +{
> + struct work_struct irq_remap_work;
> + int nid;
> +
> + INIT_WORK(&irq_remap_work, irq_remap_enable_fault_handling_thr);
> + cpus_read_lock();
> + for_each_online_node(nid) {
> + /* Boot cpu dmar vectors are assigned before the rest */
> + if (nid == cpu_to_node(get_boot_cpu_id()))
> + continue;
> + schedule_work_on(cpumask_first(cpumask_of_node(nid)),
> + &irq_remap_work);
> + flush_work(&irq_remap_work);
> + }
> + cpus_read_unlock();
> + return 0;
> +}
> +
> +arch_initcall(assign_dmar_vectors);

Stray newline before arch_initcall(), but that's not the problem.

The real problems are:

1) This approach only works when _ALL_ APs have been brought up during
boot. With 'maxcpus=N' on the command line this will fail to enable
fault handling when the APs which have not been brought up initially
are onlined later on.

This might be working in practice because intel_iommu_init() will
enable the interrupts later on via init_dmars() unconditionally, but
that's far from correct because IRQ_REMAP does not depend on
INTEL_IOMMU.

2) It leaves a gap where the reporting is not working between bringing
up the APs during boot and this initcall. Mostly theoretical, but
that does not make it more correct either.

What you really want is a cpu hotplug state in the CPUHP_BP_PREPARE_DYN
space which enables the interrupt for the node _before_ the first AP of
the node is brought up. That will solve the problem nicely w/o any of
the above issues.

Thanks,

tglx

2024-03-01 19:44:34

by Jacob Pan

[permalink] [raw]
Subject: Re: [PATCH] Allocate DMAR fault interrupts locally

Hi Thomas,

On Thu, 29 Feb 2024 23:18:37 +0100, Thomas Gleixner <[email protected]>
wrote:

> Dimitri!
>
> On Thu, Feb 29 2024 at 14:07, Dimitri Sivanich wrote:
>
> The subject lacks a subsystem prefix. You're doing this for how many
> decades now?
>
> > The Intel IOMMU code currently tries to allocate all DMAR fault
> > interrupt
> > +#ifdef CONFIG_X86_LOCAL_APIC
>
> I seriously doubt that this code can ever be compiled w/o X86_LOCAL_APIC:
>
> obj-$(CONFIG_DMAR_TABLE) += dmar.o
>
> config DMAR_TABLE
> bool
>
> config INTEL_IOMMU
> depends on PCI_MSI && ACPI && X86
> select DMAR_TABLE
>
> config IRQ_REMAP
> depends on X86_64 && X86_IO_APIC && PCI_MSI && ACPI
> select DMAR_TABLE
>
> config X86_LOCAL_APIC
> def_bool y
> depends on X86_64 || SMP || X86_32_NON_STANDARD || X86_UP_APIC ||
> PCI_MSI
>
> What are you trying to achieve here other than #ifdef voodoo?
>
> > +static void __init irq_remap_enable_fault_handling_thr(struct
> > work_struct *work) +{
> > + irq_remap_enable_fault_handling();
>
> because if INTEL_IOMMU=y and IRQ_REMAP=n then X86_LOCAL_APIC=y and this
> muck gets invoked for nothing. 'git grep irq_remap_enable_fault_handling
> include/' might give you a hint.
>
> > +}
> > +
> > +static int __init assign_dmar_vectors(void)
> > +{
> > + struct work_struct irq_remap_work;
> > + int nid;
> > +
> > + INIT_WORK(&irq_remap_work,
> > irq_remap_enable_fault_handling_thr);
> > + cpus_read_lock();
> > + for_each_online_node(nid) {
> > + /* Boot cpu dmar vectors are assigned before the rest
> > */
> > + if (nid == cpu_to_node(get_boot_cpu_id()))
> > + continue;
> > + schedule_work_on(cpumask_first(cpumask_of_node(nid)),
> > + &irq_remap_work);
> > + flush_work(&irq_remap_work);
> > + }
> > + cpus_read_unlock();
> > + return 0;
> > +}
> > +
> > +arch_initcall(assign_dmar_vectors);
>
> Stray newline before arch_initcall(), but that's not the problem.
>
> The real problems are:
>
> 1) This approach only works when _ALL_ APs have been brought up during
> boot. With 'maxcpus=N' on the command line this will fail to enable
> fault handling when the APs which have not been brought up initially
> are onlined later on.
>
> This might be working in practice because intel_iommu_init() will
> enable the interrupts later on via init_dmars() unconditionally, but
> that's far from correct because IRQ_REMAP does not depend on
> INTEL_IOMMU.
The dmar fault interrupt is VT-d's own interrupt, not subject to IRQ_REMAP.
So this set up has nothing to do with IR, right?

Maybe we should not call it irq_remap_work, call it dmar_fault_irq_work
instead?

> 2) It leaves a gap where the reporting is not working between bringing
> up the APs during boot and this initcall. Mostly theoretical, but
> that does not make it more correct either.
>
> What you really want is a cpu hotplug state in the CPUHP_BP_PREPARE_DYN
> space which enables the interrupt for the node _before_ the first AP of
> the node is brought up. That will solve the problem nicely w/o any of
> the above issues.
>
> Thanks,
>
> tglx
>


Thanks,

Jacob

2024-03-01 19:59:34

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] Allocate DMAR fault interrupts locally

Jacob!

On Fri, Mar 01 2024 at 11:50, Jacob Pan wrote:
> On Thu, 29 Feb 2024 23:18:37 +0100, Thomas Gleixner <[email protected]>
> wrote:
>> On Thu, Feb 29 2024 at 14:07, Dimitri Sivanich wrote:
>> The real problems are:
>>
>> 1) This approach only works when _ALL_ APs have been brought up during
>> boot. With 'maxcpus=N' on the command line this will fail to enable
>> fault handling when the APs which have not been brought up initially
>> are onlined later on.
>>
>> This might be working in practice because intel_iommu_init() will
>> enable the interrupts later on via init_dmars() unconditionally, but
>> that's far from correct because IRQ_REMAP does not depend on
>> INTEL_IOMMU.
> The dmar fault interrupt is VT-d's own interrupt, not subject to IRQ_REMAP.
> So this set up has nothing to do with IR, right?

Both interrupt remap and the IOMMU use DMAR and both set the DMAR's
fault interrupt up, whatever comes first. If IR is enabled then IR does
it, if not then the IOMMU init handles it.

But yes, the interrupt is part of DMAR.

> Maybe we should not call it irq_remap_work, call it dmar_fault_irq_work
> instead?

This work thing does not work at all as I explained in detail. No matter
how you name it. The only sane way to do that is with a hotplug state.

Thanks,

tglx


2024-03-11 20:40:09

by Dimitri Sivanich

[permalink] [raw]
Subject: Re: [PATCH] Allocate DMAR fault interrupts locally

Thomas!

On Thu, Feb 29, 2024 at 11:18:37PM +0100, Thomas Gleixner wrote:
> Dimitri!
>
> On Thu, Feb 29 2024 at 14:07, Dimitri Sivanich wrote:
>
> > +}
> > +
> > +static int __init assign_dmar_vectors(void)
> > +{
> > + struct work_struct irq_remap_work;
> > + int nid;
> > +
> > + INIT_WORK(&irq_remap_work, irq_remap_enable_fault_handling_thr);
> > + cpus_read_lock();
> > + for_each_online_node(nid) {
> > + /* Boot cpu dmar vectors are assigned before the rest */
> > + if (nid == cpu_to_node(get_boot_cpu_id()))
> > + continue;
> > + schedule_work_on(cpumask_first(cpumask_of_node(nid)),
> > + &irq_remap_work);
> > + flush_work(&irq_remap_work);
> > + }
> > + cpus_read_unlock();
> > + return 0;
> > +}
> > +
> > +arch_initcall(assign_dmar_vectors);
>
> Stray newline before arch_initcall(), but that's not the problem.
>
> The real problems are:
>
> 1) This approach only works when _ALL_ APs have been brought up during
> boot. With 'maxcpus=N' on the command line this will fail to enable
> fault handling when the APs which have not been brought up initially
> are onlined later on.
>
> This might be working in practice because intel_iommu_init() will
> enable the interrupts later on via init_dmars() unconditionally, but
> that's far from correct because IRQ_REMAP does not depend on
> INTEL_IOMMU.
>
> 2) It leaves a gap where the reporting is not working between bringing
> up the APs during boot and this initcall. Mostly theoretical, but
> that does not make it more correct either.
>
> What you really want is a cpu hotplug state in the CPUHP_BP_PREPARE_DYN
> space which enables the interrupt for the node _before_ the first AP of
> the node is brought up. That will solve the problem nicely w/o any of
> the above issues.
>

Initially this sounds like a good approach. As things currently stand, however,
there are (at least) several problems with attempting to allocate interrupts on
cpus that are not running yet via the existing dmar_set_interrupt path.

- The code relies on node_to_cpumask_map (cpumask_of_node()), which has been
allocated, but not populated at the CPUHP_BP_PREPARE_DYN stage.

- The irq_matrix cpumaps do not indicate being online or initialized yet, except
for the boot cpu instance, of course.

So things still revert to boot cpu allocation, until we exhaust the vectors.

Of course, running the dmar_set_interrupt code from a CPUHP_AP_ONLINE_DYN state
does work (although I believe there is a concurrency issue that could show up
with the current dmar_set_interrupt path).

So the code seems to have been designed based on the assumption that it will be
run on an already active (though not necessarily fully onlined?) cpu. To make
this work, any code based on that assumption would need to be fixed. Otherwise,
a different approach is needed.


2024-03-21 22:11:53

by Jacob Pan

[permalink] [raw]
Subject: Re: [PATCH] Allocate DMAR fault interrupts locally

Hi Dimitri,

On Mon, 11 Mar 2024 15:38:59 -0500, Dimitri Sivanich <[email protected]>
wrote:

> Thomas!
>
> On Thu, Feb 29, 2024 at 11:18:37PM +0100, Thomas Gleixner wrote:
> > Dimitri!
> >
> > On Thu, Feb 29 2024 at 14:07, Dimitri Sivanich wrote:
> >
> > > +}
> > > +
> > > +static int __init assign_dmar_vectors(void)
> > > +{
> > > + struct work_struct irq_remap_work;
> > > + int nid;
> > > +
> > > + INIT_WORK(&irq_remap_work,
> > > irq_remap_enable_fault_handling_thr);
> > > + cpus_read_lock();
> > > + for_each_online_node(nid) {
> > > + /* Boot cpu dmar vectors are assigned before the
> > > rest */
> > > + if (nid == cpu_to_node(get_boot_cpu_id()))
> > > + continue;
> > > + schedule_work_on(cpumask_first(cpumask_of_node(nid)),
> > > + &irq_remap_work);
> > > + flush_work(&irq_remap_work);
> > > + }
> > > + cpus_read_unlock();
> > > + return 0;
> > > +}
> > > +
> > > +arch_initcall(assign_dmar_vectors);
> >
> > Stray newline before arch_initcall(), but that's not the problem.
> >
> > The real problems are:
> >
> > 1) This approach only works when _ALL_ APs have been brought up during
> > boot. With 'maxcpus=N' on the command line this will fail to enable
> > fault handling when the APs which have not been brought up initially
> > are onlined later on.
> >
> > This might be working in practice because intel_iommu_init() will
> > enable the interrupts later on via init_dmars() unconditionally, but
> > that's far from correct because IRQ_REMAP does not depend on
> > INTEL_IOMMU.
> >
> > 2) It leaves a gap where the reporting is not working between bringing
> > up the APs during boot and this initcall. Mostly theoretical, but
> > that does not make it more correct either.
> >
> > What you really want is a cpu hotplug state in the CPUHP_BP_PREPARE_DYN
> > space which enables the interrupt for the node _before_ the first AP of
> > the node is brought up. That will solve the problem nicely w/o any of
> > the above issues.
> >
>
> Initially this sounds like a good approach. As things currently stand,
> however, there are (at least) several problems with attempting to
> allocate interrupts on cpus that are not running yet via the existing
> dmar_set_interrupt path.
>
> - The code relies on node_to_cpumask_map (cpumask_of_node()), which has
> been allocated, but not populated at the CPUHP_BP_PREPARE_DYN stage.
>
> - The irq_matrix cpumaps do not indicate being online or initialized yet,
> except for the boot cpu instance, of course.
>
> So things still revert to boot cpu allocation, until we exhaust the
> vectors.
>
> Of course, running the dmar_set_interrupt code from a CPUHP_AP_ONLINE_DYN
> state does work (although I believe there is a concurrency issue that
> could show up with the current dmar_set_interrupt path).
>
> So the code seems to have been designed based on the assumption that it
> will be run on an already active (though not necessarily fully onlined?)
> cpu. To make this work, any code based on that assumption would need to
> be fixed. Otherwise, a different approach is needed.
This may not be pretty but since DMAR fault is for unrecoverable faults,
they are rare and infrequent, should never happen on a healthy system. Can
we share one BSP vector for all DMARs? i.e. let dmar_fault() handler search
for the offending DMAR for fault reasons.


Thanks,

Jacob

2024-03-24 20:05:55

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] Allocate DMAR fault interrupts locally

Dimitri!

On Mon, Mar 11 2024 at 15:38, Dimitri Sivanich wrote:
> On Thu, Feb 29, 2024 at 11:18:37PM +0100, Thomas Gleixner wrote:
>> What you really want is a cpu hotplug state in the CPUHP_BP_PREPARE_DYN
>> space which enables the interrupt for the node _before_ the first AP of
>> the node is brought up. That will solve the problem nicely w/o any of
>> the above issues.
>>
>
> Initially this sounds like a good approach. As things currently stand, however,
> there are (at least) several problems with attempting to allocate interrupts on
> cpus that are not running yet via the existing dmar_set_interrupt path.
>
> - The code relies on node_to_cpumask_map (cpumask_of_node()), which has been
> allocated, but not populated at the CPUHP_BP_PREPARE_DYN stage.
>
> - The irq_matrix cpumaps do not indicate being online or initialized yet, except
> for the boot cpu instance, of course.
>
> So things still revert to boot cpu allocation, until we exhaust the
> vectors.

I thought about the following:

CPUHP_BP_PREPARE_DYN allocates the hardware interrupt on the control
CPU (the boot CPU during early boot).

CPUHP_AP_ONLINE_DYN moves it over to the AP. This needs to set
affinity and then retrigger the interrupt so that the horrible
non-remapped MSI migration logic is invoked.

Though that does not work for parallel bringup as then the prepare stage
is invoked for all CPUs before any of them gets to the online phase,
which obviously ends up with the same problem.

> Of course, running the dmar_set_interrupt code from a CPUHP_AP_ONLINE_DYN state
> does work (although I believe there is a concurrency issue that could show up
> with the current dmar_set_interrupt path).

Which concurrency issue? CPU hotplug is fully serialized.

> So the code seems to have been designed based on the assumption that it will be
> run on an already active (though not necessarily fully onlined?) cpu. To make
> this work, any code based on that assumption would need to be fixed. Otherwise,
> a different approach is needed.

Yes, the interrupt vector code it is designed that way and for the
general case this is absolutely the right thing to do.

Thanks,

tglx

2024-03-24 21:05:52

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] Allocate DMAR fault interrupts locally

Jacob!

On Thu, Mar 21 2024 at 15:13, Jacob Pan wrote:
> On Mon, 11 Mar 2024 15:38:59 -0500, Dimitri Sivanich <[email protected]>
> wrote:
>> So the code seems to have been designed based on the assumption that it
>> will be run on an already active (though not necessarily fully onlined?)
>> cpu. To make this work, any code based on that assumption would need to
>> be fixed. Otherwise, a different approach is needed.
>
> This may not be pretty but since DMAR fault is for unrecoverable faults,
> they are rare and infrequent, should never happen on a healthy system. Can
> we share one BSP vector for all DMARs? i.e. let dmar_fault() handler search
> for the offending DMAR for fault reasons.

It might not be pretty on the first thought, but it has a charm.

You are right. If DMAR faults happen then there is an issue which is
going to affect the machine badly anyway, so the extra search through
the iommu list is not necessarily horrible and it does not matter at all
whether the access is local or not.

But there are two interrupts involved. The DMAR one and the SVM one.

And all of that DMAR/SVM code is built around the assumption that
everything can be set up at boot time on the BSP.

The DMAR case definitely is solvable by sharing the interrupt, see the
uncompiled below. I still need to wrap my brain around the SVM part, but
feel free to beat me to it or preferrably explain to me that it's not
necessary at all :)

Thanks,

tglx
---
--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -1182,7 +1182,6 @@ static void free_iommu(struct intel_iomm
iommu->pr_irq = 0;
}
free_irq(iommu->irq, iommu);
- dmar_free_hwirq(iommu->irq);
iommu->irq = 0;
}

@@ -1956,17 +1955,21 @@ void dmar_msi_mask(struct irq_data *data
raw_spin_unlock_irqrestore(&iommu->register_lock, flag);
}

-void dmar_msi_write(int irq, struct msi_msg *msg)
+static void dmar_iommu_msi_write(struct intel_iommu *iommu, struct msi_msg *msg)
{
- struct intel_iommu *iommu = irq_get_handler_data(irq);
int reg = dmar_msi_reg(iommu, irq);
- unsigned long flag;
+ unsigned long flags;

- raw_spin_lock_irqsave(&iommu->register_lock, flag);
+ raw_spin_lock_irqsave(&iommu->register_lock, flags);
writel(msg->data, iommu->reg + reg + 4);
writel(msg->address_lo, iommu->reg + reg + 8);
writel(msg->address_hi, iommu->reg + reg + 12);
- raw_spin_unlock_irqrestore(&iommu->register_lock, flag);
+ raw_spin_unlock_irqrestore(&iommu->register_lock, flags);
+}
+
+void dmar_msi_write(int irq, struct msi_msg *msg)
+{
+ dmar_iommu_msi_write(irq_get_handler_data(irq), msg);
}

void dmar_msi_read(int irq, struct msi_msg *msg)
@@ -2100,23 +2103,37 @@ irqreturn_t dmar_fault(int irq, void *de

int dmar_set_interrupt(struct intel_iommu *iommu)
{
- int irq, ret;
+ static int dmar_irq;
+ int ret;

- /*
- * Check if the fault interrupt is already initialized.
- */
+ /* Don't initialize it twice for a given iommu */
if (iommu->irq)
return 0;

- irq = dmar_alloc_hwirq(iommu->seq_id, iommu->node, iommu);
- if (irq > 0) {
- iommu->irq = irq;
+ /*
+ * There is one shared interrupt for all IOMMUs to prevent vector
+ * exhaustion.
+ */
+ if (!dmar_irq) {
+ int irq = dmar_alloc_hwirq(iommu->seq_id, iommu->node, iommu);
+
+ if (irq <= 0) {
+ pr_err("No free IRQ vectors\n");
+ return -EINVAL;
+ }
+ dmar_irq = irq;
} else {
- pr_err("No free IRQ vectors\n");
- return -EINVAL;
+ struct msi_msg msg;
+
+ /*
+ * Get the MSI message from the shared interrupt and write
+ * it to the iommu MSI registers.
+ */
+ dmar_msi_read(dmar_irq, &msg);
+ dmar_iommu_msi_write(iommu, &msg);
}

- ret = request_irq(irq, dmar_fault, IRQF_NO_THREAD, iommu->name, iommu);
+ ret = request_irq(dmar_irq, dmar_fault, IRQF_NO_THREAD | IRQF_SHARED, iommu->name, iommu);
if (ret)
pr_err("Can't request irq\n");
return ret;

2024-03-25 19:21:28

by Dimitri Sivanich

[permalink] [raw]
Subject: Re: [PATCH] Allocate DMAR fault interrupts locally

Thomas!

On Sun, Mar 24, 2024 at 09:05:35PM +0100, Thomas Gleixner wrote:
> Dimitri!
>
> On Mon, Mar 11 2024 at 15:38, Dimitri Sivanich wrote:
> > On Thu, Feb 29, 2024 at 11:18:37PM +0100, Thomas Gleixner wrote:
> >> What you really want is a cpu hotplug state in the CPUHP_BP_PREPARE_DYN
> >> space which enables the interrupt for the node _before_ the first AP of
> >> the node is brought up. That will solve the problem nicely w/o any of
> >> the above issues.
> >>
> >
> > Initially this sounds like a good approach. As things currently stand, however,
> > there are (at least) several problems with attempting to allocate interrupts on
> > cpus that are not running yet via the existing dmar_set_interrupt path.
> >
> > - The code relies on node_to_cpumask_map (cpumask_of_node()), which has been
> > allocated, but not populated at the CPUHP_BP_PREPARE_DYN stage.
> >
> > - The irq_matrix cpumaps do not indicate being online or initialized yet, except
> > for the boot cpu instance, of course.
> >
> > So things still revert to boot cpu allocation, until we exhaust the
> > vectors.
>
> I thought about the following:
>
> CPUHP_BP_PREPARE_DYN allocates the hardware interrupt on the control
> CPU (the boot CPU during early boot).
>
> CPUHP_AP_ONLINE_DYN moves it over to the AP. This needs to set
> affinity and then retrigger the interrupt so that the horrible
> non-remapped MSI migration logic is invoked.
>
> Though that does not work for parallel bringup as then the prepare stage
> is invoked for all CPUs before any of them gets to the online phase,
> which obviously ends up with the same problem.
>
> > Of course, running the dmar_set_interrupt code from a CPUHP_AP_ONLINE_DYN state
> > does work (although I believe there is a concurrency issue that could show up
> > with the current dmar_set_interrupt path).
>
> Which concurrency issue? CPU hotplug is fully serialized.

Yes, which allowed a simpler iplementation for V2 of this patch than I first
thought.

>
> > So the code seems to have been designed based on the assumption that it will be
> > run on an already active (though not necessarily fully onlined?) cpu. To make
> > this work, any code based on that assumption would need to be fixed. Otherwise,
> > a different approach is needed.
>
> Yes, the interrupt vector code it is designed that way and for the
> general case this is absolutely the right thing to do.
>
> Thanks,
>
> tglx

2024-03-25 22:32:09

by Jacob Pan

[permalink] [raw]
Subject: Re: [PATCH] Allocate DMAR fault interrupts locally

Hi Thomas,

On Sun, 24 Mar 2024 22:05:35 +0100, Thomas Gleixner <[email protected]>
wrote:

> Jacob!
>
> On Thu, Mar 21 2024 at 15:13, Jacob Pan wrote:
> > On Mon, 11 Mar 2024 15:38:59 -0500, Dimitri Sivanich <[email protected]>
> > wrote:
> >> So the code seems to have been designed based on the assumption that it
> >> will be run on an already active (though not necessarily fully
> >> onlined?) cpu. To make this work, any code based on that assumption
> >> would need to be fixed. Otherwise, a different approach is needed.
> >
> > This may not be pretty but since DMAR fault is for unrecoverable faults,
> > they are rare and infrequent, should never happen on a healthy system.
> > Can we share one BSP vector for all DMARs? i.e. let dmar_fault()
> > handler search for the offending DMAR for fault reasons.
>
> It might not be pretty on the first thought, but it has a charm.
>
> You are right. If DMAR faults happen then there is an issue which is
> going to affect the machine badly anyway, so the extra search through
> the iommu list is not necessarily horrible and it does not matter at all
> whether the access is local or not.
>
> But there are two interrupts involved. The DMAR one and the SVM one.
Actually, there are three. perfmon for the newer SoCs with enhanced command
interface.

> And all of that DMAR/SVM code is built around the assumption that
> everything can be set up at boot time on the BSP.
>
> The DMAR case definitely is solvable by sharing the interrupt, see the
> uncompiled below. I still need to wrap my brain around the SVM part, but
> feel free to beat me to it or preferrably explain to me that it's not
> necessary at all :)
>
I don't think it is necessary for SVM (page request interrupts) nor PERMON
IRQs. Their affinity already set to their local node and spread out.
Unlike DMAR fault IRQ, they are performance critical. SVM PRQ needs to have
access to local queues (see potential bug below).

So on my dual socket system. IRQ 40-47 are DMAR-prq (SVM). I see:

root@984fee003c4f:~/jacob/irqstat# for i in {40..47}; do cat /sys/kernel/debug/irq/irqs/$i | grep -e eff;done
effectiv: 2
effectiv: 3
effectiv: 4
effectiv: 5
effectiv: 49
effectiv: 50
effectiv: 51
effectiv: 52
(reverse-i-search)`nom': sudo apt-get install g^Cme-settings-daemon
root@984fee003c4f:~/jacob/irqstat# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 1
node 0 size: 64198 MB
node 0 free: 60458 MB
node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 8
7 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175
90 191
node 1 size: 64437 MB
node 1 free: 62793 MB
node distances:
node 0 1
0: 10 21
1: 21 10

I will give the code below a try, perhaps use a built-in device to
trigger DMAR faults.

On a separate note, just found out that page request queues are not
allocated locally, probably should be:

--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -67,7 +67,8 @@ int intel_svm_enable_prq(struct intel_iommu *iommu)
struct page *pages;
int irq, ret;

- pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, PRQ_ORDER);
+ pages = alloc_pages_node(iommu->node, GFP_KERNEL | __GFP_ZERO, PRQ_ORDER);
+
> ---
> --- a/drivers/iommu/intel/dmar.c
> +++ b/drivers/iommu/intel/dmar.c
> @@ -1182,7 +1182,6 @@ static void free_iommu(struct intel_iomm
> iommu->pr_irq = 0;
> }
> free_irq(iommu->irq, iommu);
> - dmar_free_hwirq(iommu->irq);
> iommu->irq = 0;
> }
>
> @@ -1956,17 +1955,21 @@ void dmar_msi_mask(struct irq_data *data
> raw_spin_unlock_irqrestore(&iommu->register_lock, flag);
> }
>
> -void dmar_msi_write(int irq, struct msi_msg *msg)
> +static void dmar_iommu_msi_write(struct intel_iommu *iommu, struct
> msi_msg *msg) {
> - struct intel_iommu *iommu = irq_get_handler_data(irq);
> int reg = dmar_msi_reg(iommu, irq);
> - unsigned long flag;
> + unsigned long flags;
>
> - raw_spin_lock_irqsave(&iommu->register_lock, flag);
> + raw_spin_lock_irqsave(&iommu->register_lock, flags);
> writel(msg->data, iommu->reg + reg + 4);
> writel(msg->address_lo, iommu->reg + reg + 8);
> writel(msg->address_hi, iommu->reg + reg + 12);
> - raw_spin_unlock_irqrestore(&iommu->register_lock, flag);
> + raw_spin_unlock_irqrestore(&iommu->register_lock, flags);
> +}
> +
> +void dmar_msi_write(int irq, struct msi_msg *msg)
> +{
> + dmar_iommu_msi_write(irq_get_handler_data(irq), msg);
> }
>
> void dmar_msi_read(int irq, struct msi_msg *msg)
> @@ -2100,23 +2103,37 @@ irqreturn_t dmar_fault(int irq, void *de
>
> int dmar_set_interrupt(struct intel_iommu *iommu)
> {
> - int irq, ret;
> + static int dmar_irq;
> + int ret;
>
> - /*
> - * Check if the fault interrupt is already initialized.
> - */
> + /* Don't initialize it twice for a given iommu */
> if (iommu->irq)
> return 0;
>
> - irq = dmar_alloc_hwirq(iommu->seq_id, iommu->node, iommu);
> - if (irq > 0) {
> - iommu->irq = irq;
> + /*
> + * There is one shared interrupt for all IOMMUs to prevent vector
> + * exhaustion.
> + */
> + if (!dmar_irq) {
> + int irq = dmar_alloc_hwirq(iommu->seq_id, iommu->node,
> iommu); +
> + if (irq <= 0) {
> + pr_err("No free IRQ vectors\n");
> + return -EINVAL;
> + }
> + dmar_irq = irq;
> } else {
> - pr_err("No free IRQ vectors\n");
> - return -EINVAL;
> + struct msi_msg msg;
> +
> + /*
> + * Get the MSI message from the shared interrupt and
> write
> + * it to the iommu MSI registers.
> + */
> + dmar_msi_read(dmar_irq, &msg);
> + dmar_iommu_msi_write(iommu, &msg);
> }
>
> - ret = request_irq(irq, dmar_fault, IRQF_NO_THREAD, iommu->name,
> iommu);
> + ret = request_irq(dmar_irq, dmar_fault, IRQF_NO_THREAD |
> IRQF_SHARED, iommu->name, iommu); if (ret)
> pr_err("Can't request irq\n");
> return ret;


Thanks,

Jacob

2024-04-03 23:56:39

by Jacob Pan

[permalink] [raw]
Subject: Re: [PATCH] Allocate DMAR fault interrupts locally

Hi Thomas,

On Sun, 24 Mar 2024 22:05:35 +0100, Thomas Gleixner <[email protected]>
wrote:

I sent out a tested patch based on your code, I made a few changes below.

> -- a/drivers/iommu/intel/dmar.c
> +++ b/drivers/iommu/intel/dmar.c
> @@ -1182,7 +1182,6 @@ static void free_iommu(struct intel_iomm
> iommu->pr_irq = 0;
> }
> free_irq(iommu->irq, iommu);
> - dmar_free_hwirq(iommu->irq);
> iommu->irq = 0;
> }
>
> @@ -1956,17 +1955,21 @@ void dmar_msi_mask(struct irq_data *data
> raw_spin_unlock_irqrestore(&iommu->register_lock, flag);
> }
>
> -void dmar_msi_write(int irq, struct msi_msg *msg)
> +static void dmar_iommu_msi_write(struct intel_iommu *iommu, struct
> msi_msg *msg) {
> - struct intel_iommu *iommu = irq_get_handler_data(irq);
> int reg = dmar_msi_reg(iommu, irq);
> - unsigned long flag;
> + unsigned long flags;
>
> - raw_spin_lock_irqsave(&iommu->register_lock, flag);
> + raw_spin_lock_irqsave(&iommu->register_lock, flags);
> writel(msg->data, iommu->reg + reg + 4);
> writel(msg->address_lo, iommu->reg + reg + 8);
> writel(msg->address_hi, iommu->reg + reg + 12);
> - raw_spin_unlock_irqrestore(&iommu->register_lock, flag);
> + raw_spin_unlock_irqrestore(&iommu->register_lock, flags);
> +}
> +
> +void dmar_msi_write(int irq, struct msi_msg *msg)
> +{
> + dmar_iommu_msi_write(irq_get_handler_data(irq), msg);
page request and perfmon irqs also use this function, where irq !=
iommu->irq. In this case, fault irq got overwritten by other dmar irq
sources. So we need to pass in irq as well.

> }
>
> void dmar_msi_read(int irq, struct msi_msg *msg)
> @@ -2100,23 +2103,37 @@ irqreturn_t dmar_fault(int irq, void *de
>
> int dmar_set_interrupt(struct intel_iommu *iommu)
> {
> - int irq, ret;
> + static int dmar_irq;
> + int ret;
>
> - /*
> - * Check if the fault interrupt is already initialized.
> - */
> + /* Don't initialize it twice for a given iommu */
> if (iommu->irq)
> return 0;
>
> - irq = dmar_alloc_hwirq(iommu->seq_id, iommu->node, iommu);
> - if (irq > 0) {
> - iommu->irq = irq;
> + /*
> + * There is one shared interrupt for all IOMMUs to prevent vector
> + * exhaustion.
> + */
> + if (!dmar_irq) {
> + int irq = dmar_alloc_hwirq(iommu->seq_id, iommu->node,
> iommu); +
> + if (irq <= 0) {
> + pr_err("No free IRQ vectors\n");
> + return -EINVAL;
> + }
> + dmar_irq = irq;
Need to store the fault irq.
iommu->irq = irq

> } else {
> - pr_err("No free IRQ vectors\n");
> - return -EINVAL;
> + struct msi_msg msg;
> +
> + /*
> + * Get the MSI message from the shared interrupt and
> write
> + * it to the iommu MSI registers.
> + */
> + dmar_msi_read(dmar_irq, &msg);
> + dmar_iommu_msi_write(iommu, &msg);
> }
>
> - ret = request_irq(irq, dmar_fault, IRQF_NO_THREAD, iommu->name,
> iommu);
> + ret = request_irq(dmar_irq, dmar_fault, IRQF_NO_THREAD |
> IRQF_SHARED, iommu->name, iommu); if (ret)
I added IRQF_NOBALANCING, ok? it makes things simple by not reprogramming
all the DMAR msi controls.

> pr_err("Can't request irq\n");
> return ret;


Thanks,

Jacob