When interrupt remapping is enabled, MSI interrupt messages must follow a
special format that the IOMMU can understand. Hence, when the HPET hard
lockup detector is used with interrupt remapping, it must also follow this
special format.
The IOMMU, given the information about a particular interrupt, already
knows how to populate the MSI message with this special format and the
corresponding entry in the interrupt remapping table. Given that this is a
special interrupt case, we want to avoid the interrupt subsystem. Add two
functions to create an entry for the HPET hard lockup detector. Perform
this process in two steps as described below.
When initializing the lockup detector, the function
hld_hpet_intremap_alloc_irq() permanently allocates a new entry in the
interrupt remapping table and populates it with the information the
IOMMU driver needs. In order to populate the table, the IOMMU needs to
know the HPET block ID as described in the ACPI table. Hence, add such
ID to the data of the hardlockup detector.
When the hardlockup detector is enabled, the function
hld_hpet_intremapactivate_irq() activates the recently created entry
in the interrupt remapping table via the modify_irte() functions. While
doing this, it specifies which CPU the interrupt must target via its APIC
ID. This function can be called every time the destination iD of the
interrupt needs to be updated; there is no need to allocate or remove
entries in the interrupt remapping table.
Cc: Ashok Raj <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Jacob Pan <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Wincy Van <[email protected]>
Cc: Kate Stewart <[email protected]>
Cc: Philippe Ombredanne <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Jan Kiszka <[email protected]>
Cc: Lu Baolu <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Suravee Suthikulpanit <[email protected]>
Cc: "Ravi V. Shankar" <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
arch/x86/include/asm/hpet.h | 11 +++++++
arch/x86/kernel/hpet.c | 1 +
drivers/iommu/intel_irq_remapping.c | 49 +++++++++++++++++++++++++++++
3 files changed, 61 insertions(+)
diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index a82cbe17479d..811051fa7ade 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -119,6 +119,8 @@ struct hpet_hld_data {
u64 tsc_ticks_per_cpu;
u32 handling_cpu;
u32 enabled_cpus;
+ u8 blockid;
+ void *intremap_data;
struct msi_msg msi_msg;
unsigned long cpu_monitored_mask[0];
};
@@ -129,6 +131,15 @@ extern void hardlockup_detector_hpet_stop(void);
extern void hardlockup_detector_hpet_enable(unsigned int cpu);
extern void hardlockup_detector_hpet_disable(unsigned int cpu);
extern void hardlockup_detector_switch_to_perf(void);
+#ifdef CONFIG_IRQ_REMAP
+extern int hld_hpet_intremap_activate_irq(struct hpet_hld_data *hdata);
+extern int hld_hpet_intremap_alloc_irq(struct hpet_hld_data *hdata);
+#else
+static inline int hld_hpet_intremap_activate_irq(struct hpet_hld_data *hdata)
+{ return -ENODEV; }
+static inline int hld_hpet_intremap_alloc_irq(struct hpet_hld_data *hdata)
+{ return -ENODEV; }
+#endif /* CONFIG_IRQ_REMAP */
#else
static inline struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void)
{ return NULL; }
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index dd3bb664a188..ddc9be81a075 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -202,6 +202,7 @@ struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void)
*/
temp = (u64)cfg << HPET_COUNTER_CLK_PERIOD_SHIFT;
hdata->ticks_per_second = hpet_get_ticks_per_sec(temp);
+ hdata->blockid = hpet_blockid;
return hdata;
}
diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index 2e61eaca7d7e..256466dd30cb 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -20,6 +20,7 @@
#include <asm/irq_remapping.h>
#include <asm/pci-direct.h>
#include <asm/msidef.h>
+#include <asm/hpet.h>
#include "irq_remapping.h"
@@ -1516,3 +1517,51 @@ int dmar_ir_hotplug(struct dmar_drhd_unit *dmaru, bool insert)
return ret;
}
+
+#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
+int hld_hpet_intremap_activate_irq(struct hpet_hld_data *hdata)
+{
+ u32 destid = apic->calc_dest_apicid(hdata->handling_cpu);
+ struct intel_ir_data *data;
+
+ data = (struct intel_ir_data *)hdata->intremap_data;
+ data->irte_entry.dest_id = IRTE_DEST(destid);
+ return modify_irte(&data->irq_2_iommu, &data->irte_entry);
+}
+
+int hld_hpet_intremap_alloc_irq(struct hpet_hld_data *hdata)
+{
+ struct intel_ir_data *data;
+ struct irq_alloc_info info;
+ struct intel_iommu *iommu;
+ struct irq_cfg irq_cfg;
+ int index;
+
+ iommu = map_hpet_to_ir(hdata->blockid);
+ if (!iommu)
+ return -ENODEV;
+
+ data = kzalloc(sizeof(*data), GFP_KERNEL);
+ if (!data)
+ return -ENOMEM;
+
+ down_read(&dmar_global_lock);
+ index = alloc_irte(iommu, 0, &data->irq_2_iommu, 1);
+ up_read(&dmar_global_lock);
+ if (index < 0)
+ return index;
+
+ info.type = X86_IRQ_ALLOC_TYPE_HPET;
+ info.hpet_id = hdata->blockid;
+
+ /* Vector is not relevant if NMI is the delivery mode */
+ irq_cfg.vector = 0;
+ irq_cfg.delivery_mode = dest_NMI;
+ intel_irq_remapping_prepare_irte(data, &irq_cfg, &info, index, 0);
+
+ hdata->intremap_data = data;
+ memcpy(&hdata->msi_msg, &data->msi_entry, sizeof(hdata->msi_msg));
+
+ return 0;
+}
+#endif /* CONFIG_HARDLOCKUP_DETECTOR_HPET */
--
2.17.1
On Thu, 23 May 2019, Ricardo Neri wrote:
> When interrupt remapping is enabled, MSI interrupt messages must follow a
> special format that the IOMMU can understand. Hence, when the HPET hard
> lockup detector is used with interrupt remapping, it must also follow this
> special format.
>
> The IOMMU, given the information about a particular interrupt, already
> knows how to populate the MSI message with this special format and the
> corresponding entry in the interrupt remapping table. Given that this is a
> special interrupt case, we want to avoid the interrupt subsystem. Add two
> functions to create an entry for the HPET hard lockup detector. Perform
> this process in two steps as described below.
>
> When initializing the lockup detector, the function
> hld_hpet_intremap_alloc_irq() permanently allocates a new entry in the
> interrupt remapping table and populates it with the information the
> IOMMU driver needs. In order to populate the table, the IOMMU needs to
> know the HPET block ID as described in the ACPI table. Hence, add such
> ID to the data of the hardlockup detector.
>
> When the hardlockup detector is enabled, the function
> hld_hpet_intremapactivate_irq() activates the recently created entry
> in the interrupt remapping table via the modify_irte() functions. While
> doing this, it specifies which CPU the interrupt must target via its APIC
> ID. This function can be called every time the destination iD of the
> interrupt needs to be updated; there is no need to allocate or remove
> entries in the interrupt remapping table.
And except for a few details all of this functionality exists today. There
is no need to hack HPET NMI specific knowledge into the irq remapping
driver. And of course to unbreak AMD you'd have to copy the same hackery
into the AMD interrupt remapping code.
More lines of duplicated duct tape code are better to maintain, right?
Sigh.
tglx
On Thu, 23 May 2019, Ricardo Neri wrote:
> When the hardlockup detector is enabled, the function
> hld_hpet_intremapactivate_irq() activates the recently created entry
> in the interrupt remapping table via the modify_irte() functions. While
> doing this, it specifies which CPU the interrupt must target via its APIC
> ID. This function can be called every time the destination iD of the
> interrupt needs to be updated; there is no need to allocate or remove
> entries in the interrupt remapping table.
Brilliant.
> +int hld_hpet_intremap_activate_irq(struct hpet_hld_data *hdata)
> +{
> + u32 destid = apic->calc_dest_apicid(hdata->handling_cpu);
> + struct intel_ir_data *data;
> +
> + data = (struct intel_ir_data *)hdata->intremap_data;
> + data->irte_entry.dest_id = IRTE_DEST(destid);
> + return modify_irte(&data->irq_2_iommu, &data->irte_entry);
This calls modify_irte() which does at the very beginning:
raw_spin_lock_irqsave(&irq_2_ir_lock, flags);
How is that supposed to work from NMI context? Not to talk about the
other spinlocks which are taken in the subsequent call chain.
You cannot call in any of that code from NMI context.
The only reason why this never deadlocked in your testing is that nothing
else touched that particular iommu where the HPET hangs off concurrently.
But that's just pure luck and not design.
Thanks,
tglx
On Sun, 16 Jun 2019, Thomas Gleixner wrote:
> On Thu, 23 May 2019, Ricardo Neri wrote:
> > When the hardlockup detector is enabled, the function
> > hld_hpet_intremapactivate_irq() activates the recently created entry
> > in the interrupt remapping table via the modify_irte() functions. While
> > doing this, it specifies which CPU the interrupt must target via its APIC
> > ID. This function can be called every time the destination iD of the
> > interrupt needs to be updated; there is no need to allocate or remove
> > entries in the interrupt remapping table.
>
> Brilliant.
>
> > +int hld_hpet_intremap_activate_irq(struct hpet_hld_data *hdata)
> > +{
> > + u32 destid = apic->calc_dest_apicid(hdata->handling_cpu);
> > + struct intel_ir_data *data;
> > +
> > + data = (struct intel_ir_data *)hdata->intremap_data;
> > + data->irte_entry.dest_id = IRTE_DEST(destid);
> > + return modify_irte(&data->irq_2_iommu, &data->irte_entry);
>
> This calls modify_irte() which does at the very beginning:
>
> raw_spin_lock_irqsave(&irq_2_ir_lock, flags);
>
> How is that supposed to work from NMI context? Not to talk about the
> other spinlocks which are taken in the subsequent call chain.
>
> You cannot call in any of that code from NMI context.
>
> The only reason why this never deadlocked in your testing is that nothing
> else touched that particular iommu where the HPET hangs off concurrently.
>
> But that's just pure luck and not design.
And just for the record. I warned you about that problem during the review
of an earlier version and told you to talk to IOMMU folks whether there is
a way to update the entry w/o running into that lock problem.
Can you tell my why am I actually reviewing patches and spending time on
this when the result is ignored anyway?
I also tried to figure out why you went away from the IPI broadcast
design. The only information I found is:
Changes vs. v1:
* Brought back the round-robin mechanism proposed in v1 (this time not
using the interrupt subsystem). This also requires to compute
expiration times as in v1 (Andi Kleen, Stephane Eranian).
Great that there is no trace of any mail from Andi or Stephane about this
on LKML. There is no problem with talking offlist about this stuff, but
then you should at least provide a rationale for those who were not part of
the private conversation.
Thanks,
tglcx
Hi,
On Mon, Jun 17, 2019 at 1:25 AM Thomas Gleixner <[email protected]> wrote:
>
> On Sun, 16 Jun 2019, Thomas Gleixner wrote:
> > On Thu, 23 May 2019, Ricardo Neri wrote:
> > > When the hardlockup detector is enabled, the function
> > > hld_hpet_intremapactivate_irq() activates the recently created entry
> > > in the interrupt remapping table via the modify_irte() functions. While
> > > doing this, it specifies which CPU the interrupt must target via its APIC
> > > ID. This function can be called every time the destination iD of the
> > > interrupt needs to be updated; there is no need to allocate or remove
> > > entries in the interrupt remapping table.
> >
> > Brilliant.
> >
> > > +int hld_hpet_intremap_activate_irq(struct hpet_hld_data *hdata)
> > > +{
> > > + u32 destid = apic->calc_dest_apicid(hdata->handling_cpu);
> > > + struct intel_ir_data *data;
> > > +
> > > + data = (struct intel_ir_data *)hdata->intremap_data;
> > > + data->irte_entry.dest_id = IRTE_DEST(destid);
> > > + return modify_irte(&data->irq_2_iommu, &data->irte_entry);
> >
> > This calls modify_irte() which does at the very beginning:
> >
> > raw_spin_lock_irqsave(&irq_2_ir_lock, flags);
> >
> > How is that supposed to work from NMI context? Not to talk about the
> > other spinlocks which are taken in the subsequent call chain.
> >
> > You cannot call in any of that code from NMI context.
> >
> > The only reason why this never deadlocked in your testing is that nothing
> > else touched that particular iommu where the HPET hangs off concurrently.
> >
> > But that's just pure luck and not design.
>
> And just for the record. I warned you about that problem during the review
> of an earlier version and told you to talk to IOMMU folks whether there is
> a way to update the entry w/o running into that lock problem.
>
> Can you tell my why am I actually reviewing patches and spending time on
> this when the result is ignored anyway?
>
> I also tried to figure out why you went away from the IPI broadcast
> design. The only information I found is:
>
> Changes vs. v1:
>
> * Brought back the round-robin mechanism proposed in v1 (this time not
> using the interrupt subsystem). This also requires to compute
> expiration times as in v1 (Andi Kleen, Stephane Eranian).
>
> Great that there is no trace of any mail from Andi or Stephane about this
> on LKML. There is no problem with talking offlist about this stuff, but
> then you should at least provide a rationale for those who were not part of
> the private conversation.
>
Let me add some context to this whole patch series. The pressure on
the core PMU counters
is increasing as more people want to use them to measure always more
events. When the PMU
is overcommitted, i.e., more events than counters for them, there is
multiplexing. It comes
with an overhead that is too high for certain applications. One way to
avoid this is to lower the
multiplexing frequency, which is by default 1ms, but that comes with
loss of accuracy. Another approach is
to measure only a small number of events at a time and use multiple
runs, but then you lose consistent event
view. Another approach is to push for increasing the number of
counters. But getting new hardware
counters takes time. Short term, we can investigate what it would take
to free one cycle-capable
counter which is commandeered by the hard lockup detector on all X86
processors today. The functionality
of the watchdog, being able to get a crash dump on kernel deadlocks,
is important and we cannot simply
disable it. At scale, many bugs are exposed and thus machines
deadlock. Therefore, we want to investigate
what it would take to move the detector to another NMI-capable source,
such as the HPET because the
detector does not need high low granularity timer and interrupts only every 2s.
Furthermore, recent Intel erratum, e.g., the TSX issue forcing the TFA
code in perf_events, have increased the pressure
even more with only 3 generic counters left. Thus, it is time to look
at alternative ways of getting a hard lockup detector
(NMI watchdog) from another NMI source than the PMU. To that extent, I
have been discussing about alternatives.
Intel suggested using the HPET and Ricardo has been working on
producing this patch series. It is clear from your review
that the patches have issues, but I am hoping that they can be
resolved with constructive feedback knowing what the end goal is.
As for the round-robin changes, yes, we discussed this as an
alternative to avoid overloading CPU0 with handling
all of the work to broadcasting IPI to 100+ other CPUs.
Thanks.
Stephane,
On Mon, 17 Jun 2019, Stephane Eranian wrote:
> On Mon, Jun 17, 2019 at 1:25 AM Thomas Gleixner <[email protected]> wrote:
> > Great that there is no trace of any mail from Andi or Stephane about this
> > on LKML. There is no problem with talking offlist about this stuff, but
> > then you should at least provide a rationale for those who were not part of
> > the private conversation.
> >
> Let me add some context to this whole patch series. The pressure on the
> core PMU counters is increasing as more people want to use them to
> measure always more events. When the PMU is overcommitted, i.e., more
> events than counters for them, there is multiplexing. It comes with an
> overhead that is too high for certain applications. One way to avoid this
> is to lower the multiplexing frequency, which is by default 1ms, but that
> comes with loss of accuracy. Another approach is to measure only a small
> number of events at a time and use multiple runs, but then you lose
> consistent event view. Another approach is to push for increasing the
> number of counters. But getting new hardware counters takes time. Short
> term, we can investigate what it would take to free one cycle-capable
> counter which is commandeered by the hard lockup detector on all X86
> processors today. The functionality of the watchdog, being able to get a
> crash dump on kernel deadlocks, is important and we cannot simply disable
> it. At scale, many bugs are exposed and thus machines
> deadlock. Therefore, we want to investigate what it would take to move
> the detector to another NMI-capable source, such as the HPET because the
> detector does not need high low granularity timer and interrupts only
> every 2s.
I'm well aware about the reasons for this.
> Furthermore, recent Intel erratum, e.g., the TSX issue forcing the TFA
> code in perf_events, have increased the pressure even more with only 3
> generic counters left. Thus, it is time to look at alternative ways of
> getting a hard lockup detector (NMI watchdog) from another NMI source
> than the PMU. To that extent, I have been discussing about alternatives.
>
> Intel suggested using the HPET and Ricardo has been working on
> producing this patch series. It is clear from your review
> that the patches have issues, but I am hoping that they can be
> resolved with constructive feedback knowing what the end goal is.
Well, I gave constructive feedback from the very first version on. But
essential parts of that feedback have been ignored for whatever reasons.
> As for the round-robin changes, yes, we discussed this as an alternative
> to avoid overloading CPU0 with handling all of the work to broadcasting
> IPI to 100+ other CPUs.
I can understand the reason why you don't want to do that, but again, I
said way before this was tried that changing affinity from NMI context with
the IOMMU cannot work by just calling into the iommu code and it needs some
deep investigation with the IOMMU wizards whether a preallocated entry can
be used lockless (including the subsequently required flush).
The outcome is that the change was implemented by simply calling into
functions which I told that they cannot be called from NMI context.
Unless this problem is not solved and I doubt it can be solved after
talking to IOMMU people and studying manuals, the round robin mechanics in
the current form are not going to happen. We'd need a SMI based lockup
detector to debug the resulting livelock wreckage.
There are two possible options:
1) Back to the IPI approach
The probem with broadcast is that it sends IPIs one by one to each
online CPU, which sums up with a large number of CPUs.
The interesting question is why the kernel does not utilize the all
excluding self destination shorthand for this. The SDM is not giving
any information.
But there is a historic commit which is related and gives a hint:
commit e77deacb7b078156fcadf27b838a4ce1a65eda04
Author: Keith Owens <[email protected]>
Date: Mon Jun 26 13:59:56 2006 +0200
[PATCH] x86_64: Avoid broadcasting NMI IPIs
On some i386/x86_64 systems, sending an NMI IPI as a broadcast will
reset the system. This seems to be a BIOS bug which affects
machines where one or more cpus are not under OS control. It
occurs on HT systems with a version of the OS that is not compiled
without HT support. It also occurs when a system is booted with
max_cpus=n where 2 <= n < cpus known to the BIOS. The fix is to
always send NMI IPI as a mask instead of as a broadcast.
I can see the issue with max_cpus and that'd be trivial to solve by
disabling the HPET watchdog when maxcpus < num_present_cpus is on the
command line (That's broken anyway with respect to MCEs. See the stupid
dance we need to do for 'nosmt').
Though the HT part of the changelog is unparseable garbage but might be
a cryptic hint to the 'nosmt' mess. Though back then we did not have
a way to disable the siblings (or did we?). Weird...
It definitely would be worthwhile to experiment with that. if we could
use shorthands (also for regular IPIs) that would be a great
improvement in general and would nicely solve that NMI issue. Beware of
the dragons though.
2) Delegate round robin to irq_work
Use the same mechanism as perf for stuff which needs to be done outside
of NMI context.
That can solve the issue, but the drawback is that in case the NMI hits
a locked up interrupt disabled region the affinity stays on the same
CPU as the regular IPI which kicks the irq work is not going to be
handled. Might not be a big issue as we could detect the situation
that the IPI comes back to the same CPU. Not pretty and lots of nasty
corner case and race condition handling.
There is another issue with that round robin scheme as I pointed out to
Ricardo:
With a small watchdog threshold and tons of CPUs the time to switch
the affinity becomes short. That brings the HPET reprogramming (in
case of oneshot) into the SMI endagered zone and aside of that it
will eat performance as well because with lets say 1 second threshold
and 1000 CPUs we are going to flush the interrupt remapping
table/entry once per millisecond. No idea how big the penalty is, but
it's certainly not free.
One possible way out would be to use a combined approach of building
CPU groups (lets say 8) where one of the CPUs gets the NMI and IPIs the
other 7 and then round robins to the next group. Whether that's any
better, I can't tell.
Sorry that I can't come up with the magic cure and just can provide more
questions than answers.
Thanks,
tglx
On Mon, Jun 17, 2019 at 10:25:35AM +0200, Thomas Gleixner wrote:
> On Sun, 16 Jun 2019, Thomas Gleixner wrote:
> > On Thu, 23 May 2019, Ricardo Neri wrote:
> > > When the hardlockup detector is enabled, the function
> > > hld_hpet_intremapactivate_irq() activates the recently created entry
> > > in the interrupt remapping table via the modify_irte() functions. While
> > > doing this, it specifies which CPU the interrupt must target via its APIC
> > > ID. This function can be called every time the destination iD of the
> > > interrupt needs to be updated; there is no need to allocate or remove
> > > entries in the interrupt remapping table.
> >
> > Brilliant.
> >
> > > +int hld_hpet_intremap_activate_irq(struct hpet_hld_data *hdata)
> > > +{
> > > + u32 destid = apic->calc_dest_apicid(hdata->handling_cpu);
> > > + struct intel_ir_data *data;
> > > +
> > > + data = (struct intel_ir_data *)hdata->intremap_data;
> > > + data->irte_entry.dest_id = IRTE_DEST(destid);
> > > + return modify_irte(&data->irq_2_iommu, &data->irte_entry);
> >
> > This calls modify_irte() which does at the very beginning:
> >
> > raw_spin_lock_irqsave(&irq_2_ir_lock, flags);
> >
> > How is that supposed to work from NMI context? Not to talk about the
> > other spinlocks which are taken in the subsequent call chain.
> >
> > You cannot call in any of that code from NMI context.
> >
> > The only reason why this never deadlocked in your testing is that nothing
> > else touched that particular iommu where the HPET hangs off concurrently.
> >
> > But that's just pure luck and not design.
>
> And just for the record. I warned you about that problem during the review
> of an earlier version and told you to talk to IOMMU folks whether there is
> a way to update the entry w/o running into that lock problem.
I think I misunderstood your feedback. You did mention issues on locking
between NMI and !NMI contexts. However, that was in the context of using the
generic irq code to do things such as set the affinity of the interrupt and
requesting an irq. I understood that I should instead program things directly.
I extrapolated this to the IOMMU driver in which I also added code directly
instead of using the existing layering.
Also, at the time, the question regarding the IOMMU, as I understood, was
whether it was posible to reserve a IOMMU remapping entry upfront. I believe
my patches achieve that, even if they are hacky and ugly, and have locking
issues. I see now that the locking issues are also part of the IOMMU
discussion. Perhaps that was also implicit.
>
> Can you tell my why am I actually reviewing patches and spending time on
> this when the result is ignored anyway?
Yes, Thomas, I should have checked first with the IOMMU maintainers
first on the issues in the paragraph above. It is not my intention to
waste your time; your feedback has been valuable and has contributed to
improve the code.
>
> I also tried to figure out why you went away from the IPI broadcast
> design. The only information I found is:
>
> Changes vs. v1:
>
> * Brought back the round-robin mechanism proposed in v1 (this time not
> using the interrupt subsystem). This also requires to compute
> expiration times as in v1 (Andi Kleen, Stephane Eranian).
>
> Great that there is no trace of any mail from Andi or Stephane about this
> on LKML. There is no problem with talking offlist about this stuff, but
> then you should at least provide a rationale for those who were not part of
> the private conversation.
Stephane has already commented the rationale.
Thanks and BR,
Ricardo
On Tue, 18 Jun 2019 01:08:06 +0200 (CEST)
Thomas Gleixner <[email protected]> wrote:
> Stephane,
>
> On Mon, 17 Jun 2019, Stephane Eranian wrote:
> > On Mon, Jun 17, 2019 at 1:25 AM Thomas Gleixner
> > <[email protected]> wrote:
> > > Great that there is no trace of any mail from Andi or Stephane
> > > about this on LKML. There is no problem with talking offlist
> > > about this stuff, but then you should at least provide a
> > > rationale for those who were not part of the private conversation.
> > >
> > Let me add some context to this whole patch series. The pressure on
> > the core PMU counters is increasing as more people want to use them
> > to measure always more events. When the PMU is overcommitted, i.e.,
> > more events than counters for them, there is multiplexing. It comes
> > with an overhead that is too high for certain applications. One way
> > to avoid this is to lower the multiplexing frequency, which is by
> > default 1ms, but that comes with loss of accuracy. Another approach
> > is to measure only a small number of events at a time and use
> > multiple runs, but then you lose consistent event view. Another
> > approach is to push for increasing the number of counters. But
> > getting new hardware counters takes time. Short term, we can
> > investigate what it would take to free one cycle-capable counter
> > which is commandeered by the hard lockup detector on all X86
> > processors today. The functionality of the watchdog, being able to
> > get a crash dump on kernel deadlocks, is important and we cannot
> > simply disable it. At scale, many bugs are exposed and thus
> > machines deadlock. Therefore, we want to investigate what it would
> > take to move the detector to another NMI-capable source, such as
> > the HPET because the detector does not need high low granularity
> > timer and interrupts only every 2s.
>
> I'm well aware about the reasons for this.
>
> > Furthermore, recent Intel erratum, e.g., the TSX issue forcing the
> > TFA code in perf_events, have increased the pressure even more with
> > only 3 generic counters left. Thus, it is time to look at
> > alternative ways of getting a hard lockup detector (NMI watchdog)
> > from another NMI source than the PMU. To that extent, I have been
> > discussing about alternatives.
> >
> > Intel suggested using the HPET and Ricardo has been working on
> > producing this patch series. It is clear from your review
> > that the patches have issues, but I am hoping that they can be
> > resolved with constructive feedback knowing what the end goal is.
>
> Well, I gave constructive feedback from the very first version on. But
> essential parts of that feedback have been ignored for whatever
> reasons.
>
> > As for the round-robin changes, yes, we discussed this as an
> > alternative to avoid overloading CPU0 with handling all of the work
> > to broadcasting IPI to 100+ other CPUs.
>
> I can understand the reason why you don't want to do that, but again,
> I said way before this was tried that changing affinity from NMI
> context with the IOMMU cannot work by just calling into the iommu
> code and it needs some deep investigation with the IOMMU wizards
> whether a preallocated entry can be used lockless (including the
> subsequently required flush).
>
> The outcome is that the change was implemented by simply calling into
> functions which I told that they cannot be called from NMI context.
>
> Unless this problem is not solved and I doubt it can be solved after
> talking to IOMMU people and studying manuals,
I agree. modify irte might be done with cmpxchg_double() but the queued
invalidation interface for IRTE cache flush is shared with DMA and
requires holding a spinlock for enque descriptors, QI tail update etc.
Also, reserving & manipulating IRTE slot for hpet via backdoor might not
be needed if the HPET PCI BDF (found in ACPI) can be utilized. But it
might need more work to add a fake PCI device for HPET.
> the round robin
> mechanics in the current form are not going to happen. We'd need a
> SMI based lockup detector to debug the resulting livelock wreckage.
>
> There are two possible options:
>
> 1) Back to the IPI approach
>
> The probem with broadcast is that it sends IPIs one by one to
> each online CPU, which sums up with a large number of CPUs.
>
> The interesting question is why the kernel does not utilize the
> all excluding self destination shorthand for this. The SDM is not
> giving any information.
>
> But there is a historic commit which is related and gives a hint:
>
> commit e77deacb7b078156fcadf27b838a4ce1a65eda04
> Author: Keith Owens <[email protected]>
> Date: Mon Jun 26 13:59:56 2006 +0200
>
> [PATCH] x86_64: Avoid broadcasting NMI IPIs
>
> On some i386/x86_64 systems, sending an NMI IPI as a
> broadcast will reset the system. This seems to be a BIOS bug which
> affects machines where one or more cpus are not under OS control. It
> occurs on HT systems with a version of the OS that is not
> compiled without HT support. It also occurs when a system is booted
> with max_cpus=n where 2 <= n < cpus known to the BIOS. The fix is to
> always send NMI IPI as a mask instead of as a broadcast.
>
> I can see the issue with max_cpus and that'd be trivial to solve
> by disabling the HPET watchdog when maxcpus < num_present_cpus is on
> the command line (That's broken anyway with respect to MCEs. See the
> stupid dance we need to do for 'nosmt').
>
> Though the HT part of the changelog is unparseable garbage but
> might be a cryptic hint to the 'nosmt' mess. Though back then we did
> not have a way to disable the siblings (or did we?). Weird...
>
> It definitely would be worthwhile to experiment with that. if we
> could use shorthands (also for regular IPIs) that would be a great
> improvement in general and would nicely solve that NMI issue.
> Beware of the dragons though.
>
> 2) Delegate round robin to irq_work
>
> Use the same mechanism as perf for stuff which needs to be done
> outside of NMI context.
>
> That can solve the issue, but the drawback is that in case the
> NMI hits a locked up interrupt disabled region the affinity stays on
> the same CPU as the regular IPI which kicks the irq work is not going
> to be handled. Might not be a big issue as we could detect the
> situation that the IPI comes back to the same CPU. Not pretty and
> lots of nasty corner case and race condition handling.
>
> There is another issue with that round robin scheme as I pointed
> out to Ricardo:
>
> With a small watchdog threshold and tons of CPUs the time to
> switch the affinity becomes short. That brings the HPET reprogramming
> (in case of oneshot) into the SMI endagered zone and aside of that it
> will eat performance as well because with lets say 1 second
> threshold and 1000 CPUs we are going to flush the interrupt remapping
> table/entry once per millisecond. No idea how big the penalty
> is, but it's certainly not free.
>
> One possible way out would be to use a combined approach of
> building CPU groups (lets say 8) where one of the CPUs gets the NMI
> and IPIs the other 7 and then round robins to the next group. Whether
> that's any better, I can't tell.
>
> Sorry that I can't come up with the magic cure and just can provide
> more questions than answers.
>
> Thanks,
>
> tglx
>
>
>
> _______________________________________________
> iommu mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
[Jacob Pan]
On Wed, 19 Jun 2019, Jacob Pan wrote:
> On Tue, 18 Jun 2019 01:08:06 +0200 (CEST)
> Thomas Gleixner <[email protected]> wrote:
> >
> > Unless this problem is not solved and I doubt it can be solved after
> > talking to IOMMU people and studying manuals,
>
> I agree. modify irte might be done with cmpxchg_double() but the queued
> invalidation interface for IRTE cache flush is shared with DMA and
> requires holding a spinlock for enque descriptors, QI tail update etc.
>
> Also, reserving & manipulating IRTE slot for hpet via backdoor might not
> be needed if the HPET PCI BDF (found in ACPI) can be utilized. But it
> might need more work to add a fake PCI device for HPET.
What would PCI/BDF solve?
Thanks,
tglx
On Fri, 21 Jun 2019 17:33:28 +0200 (CEST)
Thomas Gleixner <[email protected]> wrote:
> On Wed, 19 Jun 2019, Jacob Pan wrote:
> > On Tue, 18 Jun 2019 01:08:06 +0200 (CEST)
> > Thomas Gleixner <[email protected]> wrote:
> > >
> > > Unless this problem is not solved and I doubt it can be solved
> > > after talking to IOMMU people and studying manuals,
> >
> > I agree. modify irte might be done with cmpxchg_double() but the
> > queued invalidation interface for IRTE cache flush is shared with
> > DMA and requires holding a spinlock for enque descriptors, QI tail
> > update etc.
> >
> > Also, reserving & manipulating IRTE slot for hpet via backdoor
> > might not be needed if the HPET PCI BDF (found in ACPI) can be
> > utilized. But it might need more work to add a fake PCI device for
> > HPET.
>
> What would PCI/BDF solve?
I was thinking if HPET is a PCI device then it can naturally
gain slots in IOMMU remapping table IRTEs via PCI MSI code. Then perhaps
it can use the IRQ subsystem to set affinity etc. w/o directly adding
additional helper functions in IRQ remapping code. I have not followed
all the discussions, just a thought.
On Fri, 21 Jun 2019 10:31:26 -0700
Jacob Pan <[email protected]> wrote:
> On Fri, 21 Jun 2019 17:33:28 +0200 (CEST)
> Thomas Gleixner <[email protected]> wrote:
>
> > On Wed, 19 Jun 2019, Jacob Pan wrote:
> > > On Tue, 18 Jun 2019 01:08:06 +0200 (CEST)
> > > Thomas Gleixner <[email protected]> wrote:
> > > >
> > > > Unless this problem is not solved and I doubt it can be solved
> > > > after talking to IOMMU people and studying manuals,
> > >
> > > I agree. modify irte might be done with cmpxchg_double() but the
> > > queued invalidation interface for IRTE cache flush is shared with
> > > DMA and requires holding a spinlock for enque descriptors, QI tail
> > > update etc.
> > >
> > > Also, reserving & manipulating IRTE slot for hpet via backdoor
> > > might not be needed if the HPET PCI BDF (found in ACPI) can be
> > > utilized. But it might need more work to add a fake PCI device for
> > > HPET.
> >
> > What would PCI/BDF solve?
> I was thinking if HPET is a PCI device then it can naturally
> gain slots in IOMMU remapping table IRTEs via PCI MSI code. Then
> perhaps it can use the IRQ subsystem to set affinity etc. w/o
> directly adding additional helper functions in IRQ remapping code. I
> have not followed all the discussions, just a thought.
>
I looked at the code again, seems the per cpu HPET code already taken
care of HPET MSI management. Why can't we use IR-HPET-MSI chip and
domain to allocate and set affinity etc.?
Most APIC timer has ARAT not enough per cpu HPET, so per cpu HPET is
not used mostly.
Jacob
On Fri, 21 Jun 2019, Jacob Pan wrote:
> On Fri, 21 Jun 2019 10:31:26 -0700
> Jacob Pan <[email protected]> wrote:
>
> > On Fri, 21 Jun 2019 17:33:28 +0200 (CEST)
> > Thomas Gleixner <[email protected]> wrote:
> >
> > > On Wed, 19 Jun 2019, Jacob Pan wrote:
> > > > On Tue, 18 Jun 2019 01:08:06 +0200 (CEST)
> > > > Thomas Gleixner <[email protected]> wrote:
> > > > >
> > > > > Unless this problem is not solved and I doubt it can be solved
> > > > > after talking to IOMMU people and studying manuals,
> > > >
> > > > I agree. modify irte might be done with cmpxchg_double() but the
> > > > queued invalidation interface for IRTE cache flush is shared with
> > > > DMA and requires holding a spinlock for enque descriptors, QI tail
> > > > update etc.
> > > >
> > > > Also, reserving & manipulating IRTE slot for hpet via backdoor
> > > > might not be needed if the HPET PCI BDF (found in ACPI) can be
> > > > utilized. But it might need more work to add a fake PCI device for
> > > > HPET.
> > >
> > > What would PCI/BDF solve?
> > I was thinking if HPET is a PCI device then it can naturally
> > gain slots in IOMMU remapping table IRTEs via PCI MSI code. Then
> > perhaps it can use the IRQ subsystem to set affinity etc. w/o
> > directly adding additional helper functions in IRQ remapping code. I
> > have not followed all the discussions, just a thought.
> >
> I looked at the code again, seems the per cpu HPET code already taken
> care of HPET MSI management. Why can't we use IR-HPET-MSI chip and
> domain to allocate and set affinity etc.?
> Most APIC timer has ARAT not enough per cpu HPET, so per cpu HPET is
> not used mostly.
Sure, we can use that, but that does not allow to move the affinity from
NMI context either. Same issue with the IOMMU as with the other hack.
Thanks,
tglx
On Fri, Jun 21, 2019 at 10:05:01PM +0200, Thomas Gleixner wrote:
> On Fri, 21 Jun 2019, Jacob Pan wrote:
> > On Fri, 21 Jun 2019 10:31:26 -0700
> > Jacob Pan <[email protected]> wrote:
> >
> > > On Fri, 21 Jun 2019 17:33:28 +0200 (CEST)
> > > Thomas Gleixner <[email protected]> wrote:
> > >
> > > > On Wed, 19 Jun 2019, Jacob Pan wrote:
> > > > > On Tue, 18 Jun 2019 01:08:06 +0200 (CEST)
> > > > > Thomas Gleixner <[email protected]> wrote:
> > > > > >
> > > > > > Unless this problem is not solved and I doubt it can be solved
> > > > > > after talking to IOMMU people and studying manuals,
> > > > >
> > > > > I agree. modify irte might be done with cmpxchg_double() but the
> > > > > queued invalidation interface for IRTE cache flush is shared with
> > > > > DMA and requires holding a spinlock for enque descriptors, QI tail
> > > > > update etc.
> > > > >
> > > > > Also, reserving & manipulating IRTE slot for hpet via backdoor
> > > > > might not be needed if the HPET PCI BDF (found in ACPI) can be
> > > > > utilized. But it might need more work to add a fake PCI device for
> > > > > HPET.
> > > >
> > > > What would PCI/BDF solve?
> > > I was thinking if HPET is a PCI device then it can naturally
> > > gain slots in IOMMU remapping table IRTEs via PCI MSI code. Then
> > > perhaps it can use the IRQ subsystem to set affinity etc. w/o
> > > directly adding additional helper functions in IRQ remapping code. I
> > > have not followed all the discussions, just a thought.
> > >
> > I looked at the code again, seems the per cpu HPET code already taken
> > care of HPET MSI management. Why can't we use IR-HPET-MSI chip and
> > domain to allocate and set affinity etc.?
> > Most APIC timer has ARAT not enough per cpu HPET, so per cpu HPET is
> > not used mostly.
>
> Sure, we can use that, but that does not allow to move the affinity from
> NMI context either. Same issue with the IOMMU as with the other hack.
If I understand Thomas' point correctly, the problem is having to take
lock in NMI context to update the IRTE for the HPET; both as in my hack
and in the generic irq code. The problem is worse when using the generic
irq code as there are several layers and several locks that need to be
handled.
Thanks and BR,
Ricardo
On Fri, 21 Jun 2019, Ricardo Neri wrote:
> On Fri, Jun 21, 2019 at 10:05:01PM +0200, Thomas Gleixner wrote:
> > On Fri, 21 Jun 2019, Jacob Pan wrote:
> > > >
> > > I looked at the code again, seems the per cpu HPET code already taken
> > > care of HPET MSI management. Why can't we use IR-HPET-MSI chip and
> > > domain to allocate and set affinity etc.?
> > > Most APIC timer has ARAT not enough per cpu HPET, so per cpu HPET is
> > > not used mostly.
> >
> > Sure, we can use that, but that does not allow to move the affinity from
> > NMI context either. Same issue with the IOMMU as with the other hack.
>
> If I understand Thomas' point correctly, the problem is having to take
> lock in NMI context to update the IRTE for the HPET; both as in my hack
> and in the generic irq code. The problem is worse when using the generic
> irq code as there are several layers and several locks that need to be
> handled.
It does not matter how many locks are involved. One is enough to wedge the
machine.
Thanks,
tglx
On Tue, Jun 18, 2019 at 01:08:06AM +0200, Thomas Gleixner wrote:
> Stephane,
>
> On Mon, 17 Jun 2019, Stephane Eranian wrote:
> > On Mon, Jun 17, 2019 at 1:25 AM Thomas Gleixner <[email protected]> wrote:
> > > Great that there is no trace of any mail from Andi or Stephane about this
> > > on LKML. There is no problem with talking offlist about this stuff, but
> > > then you should at least provide a rationale for those who were not part of
> > > the private conversation.
> > >
> > Let me add some context to this whole patch series. The pressure on the
> > core PMU counters is increasing as more people want to use them to
> > measure always more events. When the PMU is overcommitted, i.e., more
> > events than counters for them, there is multiplexing. It comes with an
> > overhead that is too high for certain applications. One way to avoid this
> > is to lower the multiplexing frequency, which is by default 1ms, but that
> > comes with loss of accuracy. Another approach is to measure only a small
> > number of events at a time and use multiple runs, but then you lose
> > consistent event view. Another approach is to push for increasing the
> > number of counters. But getting new hardware counters takes time. Short
> > term, we can investigate what it would take to free one cycle-capable
> > counter which is commandeered by the hard lockup detector on all X86
> > processors today. The functionality of the watchdog, being able to get a
> > crash dump on kernel deadlocks, is important and we cannot simply disable
> > it. At scale, many bugs are exposed and thus machines
> > deadlock. Therefore, we want to investigate what it would take to move
> > the detector to another NMI-capable source, such as the HPET because the
> > detector does not need high low granularity timer and interrupts only
> > every 2s.
>
> I'm well aware about the reasons for this.
>
> > Furthermore, recent Intel erratum, e.g., the TSX issue forcing the TFA
> > code in perf_events, have increased the pressure even more with only 3
> > generic counters left. Thus, it is time to look at alternative ways of
> > getting a hard lockup detector (NMI watchdog) from another NMI source
> > than the PMU. To that extent, I have been discussing about alternatives.
> >
> > Intel suggested using the HPET and Ricardo has been working on
> > producing this patch series. It is clear from your review
> > that the patches have issues, but I am hoping that they can be
> > resolved with constructive feedback knowing what the end goal is.
>
> Well, I gave constructive feedback from the very first version on. But
> essential parts of that feedback have been ignored for whatever reasons.
>
> > As for the round-robin changes, yes, we discussed this as an alternative
> > to avoid overloading CPU0 with handling all of the work to broadcasting
> > IPI to 100+ other CPUs.
>
> I can understand the reason why you don't want to do that, but again, I
> said way before this was tried that changing affinity from NMI context with
> the IOMMU cannot work by just calling into the iommu code and it needs some
> deep investigation with the IOMMU wizards whether a preallocated entry can
> be used lockless (including the subsequently required flush).
>
> The outcome is that the change was implemented by simply calling into
> functions which I told that they cannot be called from NMI context.
>
> Unless this problem is not solved and I doubt it can be solved after
> talking to IOMMU people and studying manuals, the round robin mechanics in
> the current form are not going to happen. We'd need a SMI based lockup
> detector to debug the resulting livelock wreckage.
>
> There are two possible options:
>
> 1) Back to the IPI approach
>
> The probem with broadcast is that it sends IPIs one by one to each
> online CPU, which sums up with a large number of CPUs.
>
> The interesting question is why the kernel does not utilize the all
> excluding self destination shorthand for this. The SDM is not giving
> any information.
>
> But there is a historic commit which is related and gives a hint:
>
> commit e77deacb7b078156fcadf27b838a4ce1a65eda04
> Author: Keith Owens <[email protected]>
> Date: Mon Jun 26 13:59:56 2006 +0200
>
> [PATCH] x86_64: Avoid broadcasting NMI IPIs
>
> On some i386/x86_64 systems, sending an NMI IPI as a broadcast will
> reset the system. This seems to be a BIOS bug which affects
> machines where one or more cpus are not under OS control. It
> occurs on HT systems with a version of the OS that is not compiled
> without HT support. It also occurs when a system is booted with
> max_cpus=n where 2 <= n < cpus known to the BIOS. The fix is to
> always send NMI IPI as a mask instead of as a broadcast.
>
> I can see the issue with max_cpus and that'd be trivial to solve by
> disabling the HPET watchdog when maxcpus < num_present_cpus is on the
> command line (That's broken anyway with respect to MCEs. See the stupid
> dance we need to do for 'nosmt').
>
> Though the HT part of the changelog is unparseable garbage but might be
> a cryptic hint to the 'nosmt' mess. Though back then we did not have
> a way to disable the siblings (or did we?). Weird...
>
> It definitely would be worthwhile to experiment with that. if we could
> use shorthands (also for regular IPIs) that would be a great
> improvement in general and would nicely solve that NMI issue. Beware of
> the dragons though.
>
> 2) Delegate round robin to irq_work
>
> Use the same mechanism as perf for stuff which needs to be done outside
> of NMI context.
>
> That can solve the issue, but the drawback is that in case the NMI hits
> a locked up interrupt disabled region the affinity stays on the same
> CPU as the regular IPI which kicks the irq work is not going to be
> handled. Might not be a big issue as we could detect the situation
> that the IPI comes back to the same CPU. Not pretty and lots of nasty
> corner case and race condition handling.
>
> There is another issue with that round robin scheme as I pointed out to
> Ricardo:
>
> With a small watchdog threshold and tons of CPUs the time to switch
> the affinity becomes short. That brings the HPET reprogramming (in
> case of oneshot) into the SMI endagered zone and aside of that it
> will eat performance as well because with lets say 1 second threshold
> and 1000 CPUs we are going to flush the interrupt remapping
> table/entry once per millisecond. No idea how big the penalty is, but
> it's certainly not free.
>
> One possible way out would be to use a combined approach of building
> CPU groups (lets say 8) where one of the CPUs gets the NMI and IPIs the
> other 7 and then round robins to the next group. Whether that's any
> better, I can't tell.
>
> Sorry that I can't come up with the magic cure and just can provide more
> questions than answers.
I am sorry it took this long to reply to this thread. I'd like to kindly
ask for your feedback on my proposed changes for the following iteration.
* Jacob and Ashok, who work on IOMMU stuff, agreed that updating the
interrupt remapping table in NMI context is not possible as we would
always fall into locking issues. Hence, as you suggest, I will defer
this to an irq_work (and add checks in case the interrupt affinity did
not change).
* You have said in the past that you would not like to have a
request_nmi() interface for x86. For !IRQ_REMAP this is not a problem
as the MSI message can be programmed directly in the hardlockup
detector. However, for IRQ_REMAP, I would need to use request_irq()
and possibly have an interrupt handler that does nothing. The NMI
handler is still needed. This looks a little awkward but it does
allow to use the existing IRQ subsystem and hierarchy.
* Since in x86 there is not a IRQF_NMI flag, the interrupt remapping
driver needs a way to identify the special case in which the entity
requesting the interrupt is the HPET hardlockup detector. This can be
done in the interrupt remapping driver .alloc() function. It would
look at the irq_alloc_info to determine if it is of type
IRQ_ALLOC_TYPE_HPET. If this is the case, the data of the associated
HPET channel is available. Such channel had been previously reserved
to be used by the HPET initalization code and added to the
IR-HPET-MSI domain. Once identified, the interrupt remapping driver
changes the delivery mode to NMI before creating the interrupt
remapping table entry. Furthermore, this will not break the AMD
driver as all it has to do is implement the same one-line change.
* In order to avoid HPET NMI interrupts that are too close together, as
you described earlier, a mixed mechanism can be used in which groups
of N CPUs receive IPIs (using your updated shorthands). The affinity of
the HPET interrupt would only target every (N+1)'th CPU.
I have a quick-and-dirty implementation of this but wanted to check with
you first if this sounds reasonable.
Thanks and BR,
Ricardo