LinuxLists.cc - [PATCH v6 00/29] x86: Implement an HPET-based hardlockup detector

2022-05-09 07:50:00

Subject: [PATCH v6 00/29] x86: Implement an HPET-based hardlockup detector

Hi,

This is the sixth version of this patchset. It again took me a while to
post this version as I have been working on other projects and implemented
major reworks in the series.

This work has gone through several versions. I have incorporated feedback
from Thomas Gleixner and others. Many of the patches have review tags (the
patches for arch/x86 have review tags from the trusted x86 reviewers).
I believe it is ready for review by the x86 maintainers. Thus, I have
dropped the RFC qualifier.

I would especially appreciate review from the IOMMU experts on the patches
for the AMD IOMMU (patches 11-13) and from the powerpc experts on the
refactoring of the generic hardlockup detector (patches 18-20, 24, 28).

Also, I observed the error in the expected TSC value has its largest value
when the system just booted. Afterwards, the error becomes very small. In
this version allows a larger than necessary error. I also would appreciate
feedback on this particular. Perhaps the system can switch to this detector
after boot and free up the PMU counters. Please see the Testing section for
details.

Previous version of the patches can be found in [1], [2], [3], [4], and
[5]. Also, thanks to Thomas' feedback, this version handles the changes to
the interrupt subsystem more cleanly and therefore it is not necessary to
handle interrupt remapping separately [6] as in the last version.

For easier review, please refer to the changelog below for a list of
unchanged, updated, and new patches.

== Problem statement

In CPU architectures that do not have an non-maskable (NMI) watchdog, one
can be constructed using any resource capable of issuing NMIs. In x86,
this is done using a counter (one per CPU) of the Performance Monitoring
Unit (PMU). PMU counters, however, are scarce and used mainly to profile
workloads. Using them for the NMI watchdog is an overkill.

Moreover, since the counter that the NMI watchdog uses cannot be shared
with any other perf users, the current perf-event subsystem may return a
false positive when validating certain metric groups. Certain metric groups
may never get a chance to be scheduled [7], [8].

== Description of the detector

Unlike the perf-based hardlockup detector, this implementation is driven by
a single source of NMIs: an HPET channel. One of the CPUs that the
hardlockup detector monitors handles the HPET NMI. Upon reception, it
issues a shorthand IPI to the rest of the CPUs. All CPUs will get the NMI,
but only those online and being monitored will look for hardlockups.

The time-stamp counter (TSC) is used to determine whether the HPET caused
the NMI. When the HPET channel fires, we compute the value the TSC is
expected to have the next time it fires. Once it does, we read the actual
TSC value. If it falls within a small error window, we determine that the
HPET channel issued the NMI. I have found experimentally that the expected
TSC value consistently has an error of less than 0.2% (see further details
in the Testing section).

A side effect of using this detector is that in a kernel configured with
NO_HZ_FULL, the CPUs specified in the nohz_full command-line argument would
also receive the NMI IPI (but they would not look for hardlokcups). Thus,
this hardlockup detector is an opt-in feature.

This detector cannot be used on systems that disable the HPET, unless it is
force-enabled.

On AMD systems with interrupt remapping enabled, the detector can only be
used in APIC physical mode (see patch 12).

== Main updates in this version

Thomas Gleixner pointed out several design issues:

1) In previous versions, the HPET channel targeted individual CPUs in a
round-robin manner. This required changing the affinity of the NMI.
Doing this with interrupt remapping is not feasible since there is
not an atomic, lock-less way of updating an IRTE in the interrupt
remapping domain.
+ Solution: In this version, the affinity of the HPET timer never
changes while it is in operation. When a change in affinity is
needed (e.g., the handling CPU is going offline or stops being
monitored), the detector is disabled first.

2) In previous versions, I issued apic::send_IPI_mask_allbutself() to
trigger a hardlockup check on the monitored CPUs. This does not work
as the target cpumask and the contents of the APIC ICR would be
corrupted [9].
+ Solution: Only use this detector when IPI shorthands [10] are
enabled in the system.

3) In previous versions, I added checks in the interrupt remapping
drivers checks to identify the HPET hardlockup detector IRQ and
changed its delivery mode. This looked hacky [11].
+ Solution: As per suggestion from Thomas, I added an
X86_IRQ_ALLOC_AS_NMI to be used when allocating IRQs. Every
irq_domain in the IRQ hierarchy will configure the delivery mode
of the IRQ as specified in the IRQ config info.

== Parts of this series ==

For clarity, patches are grouped as follows:

1) IRQ updates: Patches 1-13 refactor parts of the interrupt subsystem
and irq_chips to support separate delivery mode of each IRQ. Thus,
specific IRQs can now be configured as NMIs as needed.

2) HPET updates. Patches 14-17 prepare the HPET code to accommodate the
new detector: rework periodic programming, reserve and configure a
timer for the detector and expose a few existing functions.

3) NMI watchdog. Patches 18-21 update the existing hardlockup detector
to uncouple it from perf, and introduces a new NMI handler category
intended to run after the NMI_LOCAL handlers.

4) New HPET-based hardlockup detector. Patches 22-25

5) Hardlockup detector management. Patches 26-29 are a collection of
miscellaneous patches to determine when to use the HPET hardlockup
detector and stop it if necessary. It also includes an x86-specific
shim hardlockup detector that selects between the perf- and hpet-based
implementations. It also switches back to the perf implementation if
the TSC becomes unstable.

== Testing ==

Tests were conducted on the master branch of the tip tree and are
available here:

[email protected]:ricardon/tip.git rneri/hpet-wdt-v6

+++ Tests for functional correctness

I tested this series in a variety of server parts: Intel's Sapphire Rapids,
Cooperlake, Icelake, Cascadelake, Snowridge, Skylake, Haswell, Broadwell,
and IvyTown as well as AMD's Rome.

I also tested the series in desktop parts such as Alderlake and Haswell,
but had to use hpet=force in the command line.

On these systems, the detector works with and without interrupt remapping,
in both APIC physical and logical destination modes.

I used the test_lockup kernel module to ensure that the hardlockups were
detected:

$ echo 10 > /proc/sys/kernel/watchdog_thresh
$ modprobe test_lockup time_secs=20 iterations=1 state=R disable_irq

Warnings in dmesg indicate that the hardlockup was detected.

+++ Characterization of the actual TSC value wrt to its expected value

(NOTE: This same data is also presented in patch 23)

Let tsc_delta be the difference between the value the TSC has now and the
value it will have when the next HPET channel interrupt happens. Define the
error window as a percentage of tsc_delta.

Below is a table that characterizes the error in the error in the expected
TSC value when the HPET channel fires in a subset of the systems I tested.
It presents the error as a percentage of tsc_delta and in microseconds.

The table summarizes the error of 4096 interrupts of the HPET channel
collected after the system has been up for 5 minutes as well as since boot.

The maximum observed error on any system is 0.045%. When the error since
boot is considered, the maximum observed error is 0.198%.

To find the most common error value (i.e., the mode), the collected data is
grouped into buckets of 0.000001 percentage points of the error and 10ns,
respectively. The most common error on any system is of 0.01317%

watchdog_thresh 1s 10s 60s
Error wrt
expected
TSC value % us % us % us

AMD EPYC 7742 64-Core Processor
Abs max
since boot 0.04517 451.74 0.00171 171.04 0.00034 201.89
Abs max 0.04517 451.74 0.00171 171.04 0.00034 201.89
Mode 0.00002 0.18 0.00002 2.07 -0.00003 -19.20

Intel(R) Xeon(R) CPU E7-8890 - INTEL_FAM6_HASWELL_X
abs max
since boot 0.00811 81.15 0.00462 462.40 0.00014 81.65
Abs max 0.00811 81.15 0.00084 84.31 0.00014 81.65
Mode -0.00422 -42.16 -0.00043 -42.50 -0.00007 -40.40

Intel(R) Xeon(R) Platinum 8170M CPU @ 2.10GHz - INTEL_FAM6_SKYLAKE_X
Abs max
since boot 0.10530 1053.04 0.01324 1324.27 0.00407 2443.25
Abs max 0.01166 116.59 0.00114 114.11 0.00024 143.47
Mode -0.01023 -102.32 -0.00103 -102.44 -0.00022 -132.38

Intel(R) Xeon(R) CPU E5-2699A v4 @ 2.40GHz - INTEL_FAM6_BROADSWELL_X
Abs max
since boot 0.00010 99.34 0.00099 98.83 0.00016 97.50
Abs max 0.00010 99.34 0.00099 98.83 0.00016 97.50
Mode -0.00007 -74.29 -0.00074 -73.99 -0.00012 -73.12

Intel(R) Xeon(R) Gold 5318H CPU - INTEL_FAM6_COOPERLAKE_X
Abs max
since boot 0.11262 1126.17 0.01109 1109.17 0.00409 2455.73
Abs max 0.01073 107.31 0.00109 109.02 0.00019 115.34
Mode -0.00953 -95.26 -0.00094 -93.63 -0.00015 -90.42

Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz - INTEL_FAM6_ICELAKE_X
Abs max
since boot 0.19853 1985.30 0.00784 783.53 -0.00017 -104.77
Abs max 0.01550 155.02 0.00158 157.56 0.00020 117.74
Mode -0.01317 -131.65 -0.00136 -136.42 -0.00018 -105.06

Thanks and BR,
Ricardo

== Changelog ==

Changes since v5:
+ Unchanged patches: 14, 15, 18, 19, 24, 25, 28
+ Updated patches: 2, 8, 17, 21-23, 26, 29
+ New patches: 1, 3-7, 9-16, 20, 27

* Added support in the interrupt subsystem to allocate IRQs with NMI mode.
(patches 1-13)
* Only enable the detector when IPI shorthands are enabled in the system.
(patch 22)
* Delayed the initialization of the hardlockup detector until after
calling smp_init(). Only then, we know whether IPI shorthands have been
enabled. (patch 20)
* Removed code to periodically update the affinity of the HPET NMI to
rotate among packages or groups of packages.
* Removed logic to group the monitored CPUs by groups. All CPUs in all
packages receive IPIs.
* Removed irq_work to change the affinity of the HPET channel interrupt.
* Updated the redirection hint in the Intel IOMMU IRTE to match the
destination mode. (patch 7)
* Correctly added support for NMI delivery mode in the AMD IOMMU.
(patches 11-13)
* Restart the NMI watchdog after refining tsc_khz. (patch 27)
* Added a check for the allowed maximum frequency of the HPET. (patch 17)
* Added hpet_hld_free_timer() to properly free the reserved HPET channel
if the initialization is not completed. (patch 17)
* Reworked the periodic setting the HPET channel. Rather than changing it
every time the channel is disabled or enabled, do it only once. While
at here, wrap the code in an initial setup function. (patch 22)
* Implemented hardlockup_detector_hpet_start() to be called when tsc_khz is
is refined. (patch 22)
* Reduced the error window of the expected TSC value at the time of the
HPET channel expiration. (patch 23)
* Described better the heuristics used to determine if the HPET channel
caused the NMI. (patch 23)
* Added a table to characterize the error in the expected TSC value when
the HPET channel fires. (patch 23)
* Added watchdog_nmi_start() to be used when tsc_khz is recalibrated.
(patch 26)
* Always build the x86-specific hardlockup detector shim; not only
when the HPET-based detector is selected. (patch 26)
* Relocated the declaration of hardlockup_detector_switch_to_perf() to
x86/nmi.h It does not depend on HPET.
* Use time_in_range64() to compare the actual TSC value vs the expected
value. This makes it more readable. (patch 22)
* Dropped pointless X86_64 || X86_32 check in Kconfig. (patch 26)

Changes since v4:
* Added commentary on the tests performed on this feature. (Andi)
* Added a stub version of hardlockup_detector_switch_to_perf() for
!CONFIG_HPET_TIMER. (lkp)
* Use switch to select the type of x86 hardlockup detector. (Andi)
* Renamed a local variable in update_ticks_per_group(). (Andi)
* Made this hardlockup detector available to X86_32.
* Reworked logic to kick the HPET timer to remove a local variable.
(Andi)
* Added a comment on what type of timer channel will be assigned to the
detector. (Andi)
* Reworded help comment in the X86_HARDLOCKUP_DETECTOR_HPET Kconfig
option. (Andi)
* Removed unnecessary switch to level interrupt mode when disabling the
timer. (Andi)
* Disabled the HPET timer to avoid a race between an incoming interrupt
and an update of the MSI destination ID. (Ashok)
* Renamed hpet_hardlockup_detector_get_timer() as hpet_hld_get_timer()
* Added commentary on an undocumented udelay() when programming an
HPET channel in periodic mode. (Ashok)
* Reworked code to use new enumeration apic_delivery_modes and reworked
MSI message composition fields [13].
* Partitioned monitored CPUs into groups. Each CPU in the group is
inspected for hardlockups using an IPI.
* Use a round-robin mechanism to update the affinity of the HPET timer.
Affinity is updated every watchdog_thresh seconds to target the
handling CPU of the group.
* Moved update of the HPET interrupt affinity to an irq_work. (Thomas
Gleixner).
* Updated expiration of the HPET timer and the expected value of the
TSC based on the number of groups of monitored CPUs.
* Renamed hpet_set_comparator() to hpet_set_comparator_periodic() to
remove decision logic for periodic case. (Thomas Gleixner)
* Reworked timer reservation to use Thomas' rework on HPET channel
management [13].
* Removed hard-coded channel number for the hardlockup detector.
* Provided more details on the sequence of HPET channel reservation.
(Thomas Gleixner)
* Only reserve a channel for the hardlockup detector if enabled via
kernel command line. The function reserving the channel is called from
hardlockup detector. (Thomas Gleixner)
* Dropped hpet_hld_data::enabled_cpus and instead use cpumask_weight().
* Renamed hpet_hld_data::cpu_monitored_mask to
hld_data_data.cpu_monitored_mask and converted it to cpumask_var_t.
* Flushed out any outstanding interrupt before enabling the HPET channel.
* Removed unnecessary MSI_DATA_LEVEL_ASSERT from the MSI message.
* Added comments in hardlockup_detector_nmi_handler() to explain how
CPUs are targeted for an IPI.
* Updated code to only issue an IPI when needed (i.e., there are CPUs in
the group other than the handling CPU).
* Reworked hardlockup_detector_hpet_init() for readability.
* Now reserve the cpumasks in the hardlockup detector code and not in the
generic HPET code.
* Handle the case of watchdog_thresh = 0 when disabling the detector.

Change since v3:
* Fixed yet another bug in periodic programming of the HPET timer that
prevented the system from booting.
* Fixed computation of HPET frequency to use hpet_readl() only.
* Added a missing #include in the watchdog_hld_hpet.c
* Fixed various typos and grammar errors (Randy Dunlap)

Changes since v2:
* Added functionality to switch to the perf-based hardlockup
detector if the TSC becomes unstable (Thomas Gleixner).
* Brought back the round-robin mechanism proposed in v1 (this time not
using the interrupt subsystem). This also requires computing
expiration times as in v1 (Andi Kleen, Stephane Eranian).
* Fixed a bug in which using a periodic timer was not working(thanks
to Suravee Suthikulpanit!).
* In this version, I incorporate support for interrupt remapping in the
last 4 patches so that they can be reviewed separately if needed.
* Removed redundant documentation of functions (Thomas Gleixner).
* Added a new category of NMI handler, NMI_WATCHDOG, which executes after
NMI_LOCAL handlers (Andi Kleen).
* Updated handling of "nmi_watchdog" to support comma-separated
arguments.
* Undid split of the generic hardlockup detector into a separate file
(Thomas Gleixner).
* Added a new intermediate symbol CONFIG_HARDLOCKUP_DETECTOR_CORE to
select generic parts of the detector (Paul E. McKenney,
Thomas Gleixner).
* Removed use of struct cpumask in favor of a variable length array in
conjunction with kzalloc (Peter Zijlstra).
* Added CPU as argument hardlockup_detector_hpet_enable()/disable()
(Thomas Gleixner).
* Remove unnecessary export of function declarations, flags, and bit
fields (Thomas Gleixner).
* Removed unnecessary check for FSB support when reserving timer for the
detector (Thomas Gleixner).
* Separated TSC code from HPET code in kick_timer() (Thomas Gleixner).
* Reworked condition to check if the expected TSC value is within the
error margin to avoid conditional (Peter Zijlstra).
* Removed TSC error margin from struct hld_data; use global variable
instead (Peter Zijlstra).
* Removed previously introduced watchdog_get_allowed_cpumask*() and
reworked hardlockup_detector_hpet_enable()/disable() to not need
access to watchdog_allowed_mask (Thomas Gleixner).

Changes since v1:
* Removed reads to HPET registers at every NMI. Instead use the time-stamp
counter to infer the interrupt source (Thomas Gleixner, Andi Kleen).
* Do not target CPUs in a round-robin manner. Instead, the HPET timer
always targets the same CPU; other CPUs are monitored via an
interprocessor interrupt.
* Removed use of generic irq code to set interrupt affinity and NMI
delivery. Instead, configure the interrupt directly in HPET registers
(Thomas Gleixner).
* Removed the proposed ops structure for NMI watchdogs. Instead, split
the existing implementation into a generic library and perf-specific
infrastructure (Thomas Gleixner, Nicholas Piggin).
* Added an x86-specific shim hardlockup detector that selects between
HPET and perf infrastructures as needed (Nicholas Piggin).
* Removed locks taken in NMI and !NMI context. This was wrong and is no
longer needed (Thomas Gleixner).
* Fixed unconditional return NMI_HANDLED when the HPET timer is programmed
for FSB/MSI delivery (Peter Zijlstra).

[1]. https://lore.kernel.org/lkml/1528851463-21140-1-git-send-email-ricardo.neri-calderon@linux.intel.com/
[2]. https://lore.kernel.org/lkml/1551283518-18922-1-git-send-email-ricardo.neri-calderon@linux.intel.com/
[3]. https://lore.kernel.org/lkml/1557842534-4266-1-git-send-email-ricardo.neri-calderon@linux.intel.com/
[4]. https://lore.kernel.org/lkml/1558660583-28561-1-git-send-email-ricardo.neri-calderon@linux.intel.com/
[5]. https://lore.kernel.org/lkml/[email protected]/T/
[6]. https://lore.kernel.org/linux-iommu/[email protected]/T/
[7]. https://lore.kernel.org/lkml/[email protected]/
[8]. https://lore.kernel.org/lkml/[email protected]/
[9]. https://lore.kernel.org/lkml/[email protected]/T/#me7de1b4ff4a91166c1610a2883b1f77ffe8b6ddf
[10]. https://lore.kernel.org/all/[email protected]/t/
[11]. https://lore.kernel.org/linux-iommu/[email protected]/T/#mde9be6aca9119602e90e9293df9995aa056dafce

[12]. https://lore.kernel.org/r/[email protected]
https://lore.kernel.org/all/[email protected]/t/
[13]. https://lore.kernel.org/lkml/[email protected]/

Ricardo Neri (29):
irq/matrix: Expose functions to allocate the best CPU for new vectors
x86/apic: Add irq_cfg::delivery_mode
x86/apic/msi: Set the delivery mode individually for each IRQ
x86/apic: Add the X86_IRQ_ALLOC_AS_NMI irq allocation flag
x86/apic/vector: Do not allocate vectors for NMIs
x86/apic/vector: Implement support for NMI delivery mode
iommu/vt-d: Clear the redirection hint when the destination mode is
physical
iommu/vt-d: Rework prepare_irte() to support per-IRQ delivery mode
iommu/vt-d: Set the IRTE delivery mode individually for each IRQ
iommu/vt-d: Implement minor tweaks for NMI irqs
iommu/amd: Expose [set|get]_dev_entry_bit()
iommu/amd: Enable NMIPass when allocating an NMI irq
iommu/amd: Compose MSI messages for NMI irqs in non-IR format
x86/hpet: Expose hpet_writel() in header
x86/hpet: Add helper function hpet_set_comparator_periodic()
x86/hpet: Prepare IRQ assignments to use the X86_ALLOC_AS_NMI flag
x86/hpet: Reserve an HPET channel for the hardlockup detector
watchdog/hardlockup: Define a generic function to detect hardlockups
watchdog/hardlockup: Decouple the hardlockup detector from perf
init/main: Delay initialization of the lockup detector after
smp_init()
x86/nmi: Add an NMI_WATCHDOG NMI handler category
x86/watchdog/hardlockup: Add an HPET-based hardlockup detector
x86/watchdog/hardlockup/hpet: Determine if HPET timer caused NMI
watchdog/hardlockup: Use parse_option_str() to handle "nmi_watchdog"
watchdog/hardlockup/hpet: Only enable the HPET watchdog via a boot
parameter
x86/watchdog: Add a shim hardlockup detector
watchdog: Expose lockup_detector_reconfigure()
x86/tsc: Restart NMI watchdog after refining tsc_khz
x86/tsc: Switch to perf-based hardlockup detector if TSC become
unstable

.../admin-guide/kernel-parameters.txt | 8 +-
arch/x86/Kconfig.debug | 13 +
arch/x86/include/asm/hpet.h | 49 ++
arch/x86/include/asm/hw_irq.h | 5 +-
arch/x86/include/asm/irqdomain.h | 1 +
arch/x86/include/asm/nmi.h | 7 +
arch/x86/kernel/Makefile | 3 +
arch/x86/kernel/apic/apic.c | 2 +-
arch/x86/kernel/apic/vector.c | 48 ++
arch/x86/kernel/hpet.c | 162 ++++++-
arch/x86/kernel/nmi.c | 10 +
arch/x86/kernel/tsc.c | 8 +
arch/x86/kernel/watchdog_hld.c | 91 ++++
arch/x86/kernel/watchdog_hld_hpet.c | 458 ++++++++++++++++++
drivers/iommu/amd/amd_iommu.h | 3 +
drivers/iommu/amd/init.c | 4 +-
drivers/iommu/amd/iommu.c | 41 +-
drivers/iommu/intel/irq_remapping.c | 32 +-
include/linux/irq.h | 4 +
include/linux/nmi.h | 8 +-
init/main.c | 4 +-
kernel/Makefile | 2 +-
kernel/irq/matrix.c | 32 +-
kernel/watchdog.c | 12 +-
kernel/watchdog_hld.c | 50 +-
lib/Kconfig.debug | 4 +
26 files changed, 994 insertions(+), 67 deletions(-)
create mode 100644 arch/x86/kernel/watchdog_hld.c
create mode 100644 arch/x86/kernel/watchdog_hld_hpet.c

--
2.17.1

2022-05-09 08:19:46

by Ricardo Neri

[permalink] [raw]

Subject: [PATCH v6 08/29] iommu/vt-d: Rework prepare_irte() to support per-IRQ delivery mode

struct irq_cfg::delivery_mode specifies the delivery mode of each IRQ
separately. Configuring the delivery mode of an IRTE would require adding
a third argument to prepare_irte(). Instead, simply take a pointer to the
irq_cfg for which an IRTE is being configured. This change does not cause
functional changes.

Cc: Andi Kleen <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: "Ravi V. Shankar" <[email protected]>
Cc: Lu Baolu <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Reviewed-by: Ashok Raj <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Reviewed-by: Lu Baolu <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v5:
* Only change the signature of prepare_irte(). A separate patch changes
the setting of the delivery_mode.

Changes since v4:
* None

Changes since v3:
* None

Changes since v2:
* None

Changes since v1:
* Introduced this patch.
---
drivers/iommu/intel/irq_remapping.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c
index d2764a71f91a..66d37186ec28 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1111,7 +1111,7 @@ void intel_irq_remap_add_device(struct dmar_pci_notify_info *info)
dev_set_msi_domain(&info->dev->dev, map_dev_to_ir(info->dev));
}

-static void prepare_irte(struct irte *irte, int vector, unsigned int dest)
+static void prepare_irte(struct irte *irte, struct irq_cfg *irq_cfg)
{
memset(irte, 0, sizeof(*irte));

@@ -1126,8 +1126,8 @@ static void prepare_irte(struct irte *irte, int vector, unsigned int dest)
*/
irte->trigger_mode = 0;
irte->dlvry_mode = apic->delivery_mode;
- irte->vector = vector;
- irte->dest_id = IRTE_DEST(dest);
+ irte->vector = irq_cfg->vector;
+ irte->dest_id = IRTE_DEST(irq_cfg->dest_apicid);

/*
* When using the destination mode of physical APICID, only the
@@ -1278,8 +1278,7 @@ static void intel_irq_remapping_prepare_irte(struct intel_ir_data *data,
{
struct irte *irte = &data->irte_entry;

- prepare_irte(irte, irq_cfg->vector, irq_cfg->dest_apicid);
-
+ prepare_irte(irte, irq_cfg);
switch (info->type) {
case X86_IRQ_ALLOC_TYPE_IOAPIC:
/* Set source-id of interrupt request */
--
2.17.1

2022-05-09 08:30:08

by Ricardo Neri

[permalink] [raw]

Subject: [PATCH v6 11/29] iommu/amd: Expose [set|get]_dev_entry_bit()

These functions are used to check and set specific bits in a Device Table
Entry. For instance, they can be used to modify the setting of the NMIPass
field.

Currently, these functions are used only for ACPI-specified devices.
However, an interrupt is to be allocated with NMI as delivery mode, the
Device Table Entry needs modified accordingly in irq_remapping_alloc().

As a first step expose these two functions. No functional changes.

Cc: Andi Kleen <[email protected]>
Cc: "Ravi V. Shankar" <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Suravee Suthikulpanit <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v5:
* Introduced this patch

Changes since v4:
* N/A

Changes since v3:
* N/A

Changes since v2:
* N/A

Changes since v1:
* N/A
---
drivers/iommu/amd/amd_iommu.h | 3 +++
drivers/iommu/amd/init.c | 4 ++--
2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/amd/amd_iommu.h b/drivers/iommu/amd/amd_iommu.h
index 1ab31074f5b3..9f3d1564c84e 100644
--- a/drivers/iommu/amd/amd_iommu.h
+++ b/drivers/iommu/amd/amd_iommu.h
@@ -128,4 +128,7 @@ static inline void amd_iommu_apply_ivrs_quirks(void) { }

extern void amd_iommu_domain_set_pgtable(struct protection_domain *domain,
u64 *root, int mode);
+
+extern void set_dev_entry_bit(u16 devid, u8 bit);
+extern int get_dev_entry_bit(u16 devid, u8 bit);
#endif
diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index b4a798c7b347..823e76b284f1 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -914,7 +914,7 @@ static void iommu_enable_gt(struct amd_iommu *iommu)
}

/* sets a specific bit in the device table entry. */
-static void set_dev_entry_bit(u16 devid, u8 bit)
+void set_dev_entry_bit(u16 devid, u8 bit)
{
int i = (bit >> 6) & 0x03;
int _bit = bit & 0x3f;
@@ -922,7 +922,7 @@ static void set_dev_entry_bit(u16 devid, u8 bit)
amd_iommu_dev_table[devid].data[i] |= (1UL << _bit);
}

-static int get_dev_entry_bit(u16 devid, u8 bit)
+int get_dev_entry_bit(u16 devid, u8 bit)
{
int i = (bit >> 6) & 0x03;
int _bit = bit & 0x3f;
--
2.17.1

2022-05-09 09:55:17

by Ricardo Neri

[permalink] [raw]

Subject: [PATCH v6 23/29] x86/watchdog/hardlockup/hpet: Determine if HPET timer caused NMI

It is not possible to determine the source of a non-maskable interrupt
(NMI) in x86. When dealing with an HPET channel, the only direct method to
determine whether it caused an NMI would be to read the Interrupt Status
register.

However, reading HPET registers is slow and, therefore, not to be done
while in NMI context. Furthermore, status is not available if the HPET
channel is programmed to deliver an MSI interrupt.

An indirect manner to infer if an incoming NMI was caused by the HPET
channel of the detector is to use the time-stamp counter (TSC). Compute the
value that the TSC is expected to have at the next interrupt of the HPET
channel and compare it with the value it has when the interrupt does
happen. If the actual value falls within a small error window, assume
that the HPET channel of the detector is the source of the NMI.

Let tsc_delta be the difference between the value the TSC has now and the
value it will have when the next HPET channel interrupt happens. Define the
error window as a percentage of tsc_delta.

Below is a table that characterizes the error in the error in the expected
TSC value when the HPET channel fires on a variety of systems. It presents
the error as a percentage of tsc_delta and in microseconds.

The table summarizes the error of 4096 interrupts of the HPET channel
collected after the system has been up for 5 minutes as well as since boot.

The maximum observed error on any system is 0.045%. When the error since
boot is considered, the maximum observed error is 0.198%.

To find the most common error value, the collected data is grouped into
buckets of 0.000001 percentage points of the error and 10ns, respectively.
The most common error on any system is of 0.01317%

Allow a maximum error that is twice as big the maximum error observed in
these experiments: 0.4%

watchdog_thresh 1s 10s 60s
Error wrt
expected
TSC value % us % us % us

AMD EPYC 7742 64-Core Processor
Abs max
since boot 0.04517 451.74 0.00171 171.04 0.00034 201.89
Abs max 0.04517 451.74 0.00171 171.04 0.00034 201.89
Mode 0.00002 0.18 0.00002 2.07 -0.00003 -19.20

Intel(R) Xeon(R) CPU E7-8890 - INTEL_FAM6_HASWELL_X
abs max
since boot 0.00811 81.15 0.00462 462.40 0.00014 81.65
Abs max 0.00811 81.15 0.00084 84.31 0.00014 81.65
Mode -0.00422 -42.16 -0.00043 -42.50 -0.00007 -40.40

Intel(R) Xeon(R) Platinum 8170M - INTEL_FAM6_SKYLAKE_X
Abs max
since boot 0.10530 1053.04 0.01324 1324.27 0.00407 2443.25
Abs max 0.01166 116.59 0.00114 114.11 0.00024 143.47
Mode -0.01023 -102.32 -0.00103 -102.44 -0.00022 -132.38

Intel(R) Xeon(R) CPU E5-2699A v4 - INTEL_FAM6_BROADSWELL_X
Abs max
since boot 0.00010 99.34 0.00099 98.83 0.00016 97.50
Abs max 0.00010 99.34 0.00099 98.83 0.00016 97.50
Mode -0.00007 -74.29 -0.00074 -73.99 -0.00012 -73.12

Intel(R) Xeon(R) Gold 5318H - INTEL_FAM6_COOPERLAKE_X
Abs max
since boot 0.11262 1126.17 0.01109 1109.17 0.00409 2455.73
Abs max 0.01073 107.31 0.00109 109.02 0.00019 115.34
Mode -0.00953 -95.26 -0.00094 -93.63 -0.00015 -90.42

Intel(R) Xeon(R) Platinum 8360Y - INTEL_FAM6_ICELAKE_X
Abs max
since boot 0.19853 1985.30 0.00784 783.53 -0.00017 -104.77
Abs max 0.01550 155.02 0.00158 157.56 0.00020 117.74
Mode -0.01317 -131.65 -0.00136 -136.42 -0.00018 -105.06

Cc: Andi Kleen <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: "Ravi V. Shankar" <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Suggested-by: Andi Kleen <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
---
NOTE: The error characterization data is repetead here from the cover
letter.
---
Changes since v5:
* Reworked is_hpet_hld_interrupt() to reduce indentation.
* Use time_in_range64() to compare the actual TSC value vs the expected
value. This makes it more readable. (Tony)
* Reduced the error window of the expected TSC value at the time of the
HPET channel expiration.
* Described better the heuristics used to determine if the HPET channel
caused the NMI. (Tony)
* Added a table to characterize the error in the expected TSC value when
the HPET channel fires.
* Removed references to groups of monitored CPUs. Instead, use tsc_khz
directly.

Changes since v4:
* Compute the TSC expected value at the next HPET interrupt based on the
number of monitored packages and not the number of monitored CPUs.

Changes since v3:
* None

Changes since v2:
* Reworked condition to check if the expected TSC value is within the
error margin to avoid an unnecessary conditional. (Peter Zijlstra)
* Removed TSC error margin from struct hld_data; use a global variable
instead. (Peter Zijlstra)

Changes since v1:
* Introduced this patch.
---
arch/x86/include/asm/hpet.h | 3 ++
arch/x86/kernel/watchdog_hld_hpet.c | 54 +++++++++++++++++++++++++++--
2 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index c88901744848..af0a504b5cff 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -113,6 +113,8 @@ static inline int is_hpet_enabled(void) { return 0; }
* @channel: HPET channel assigned to the detector
* @channe_priv: Private data of the assigned channel
* @ticks_per_second: Frequency of the HPET timer
+ * @tsc_next: Estimated value of the TSC at the next
+ * HPET timer interrupt
* @irq: IRQ number assigned to the HPET channel
* @handling_cpu: CPU handling the HPET interrupt
* @monitored_cpumask: CPUs monitored by the hardlockup detector
@@ -124,6 +126,7 @@ struct hpet_hld_data {
u32 channel;
struct hpet_channel *channel_priv;
u64 ticks_per_second;
+ u64 tsc_next;
int irq;
u32 handling_cpu;
cpumask_var_t monitored_cpumask;
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c b/arch/x86/kernel/watchdog_hld_hpet.c
index 9fc7ac2c5059..3effdbf29095 100644
--- a/arch/x86/kernel/watchdog_hld_hpet.c
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -34,6 +34,7 @@

static struct hpet_hld_data *hld_data;
static bool hardlockup_use_hpet;
+static u64 tsc_next_error;

extern struct static_key_false apic_use_ipi_shorthand;

@@ -59,10 +60,39 @@ static void __init setup_hpet_channel(struct hpet_hld_data *hdata)
* Reprogram the timer to expire within watchdog_thresh seconds in the future.
* If the timer supports periodic mode, it is not kicked unless @force is
* true.
+ *
+ * Also, compute the expected value of the time-stamp counter at the time of
+ * expiration as well as a deviation from the expected value.
*/
static void kick_timer(struct hpet_hld_data *hdata, bool force)
{
- u64 new_compare, count, period = 0;
+ u64 tsc_curr, tsc_delta, new_compare, count, period = 0;
+
+ tsc_curr = rdtsc();
+
+ /*
+ * Compute the delta between the value of the TSC now and the value
+ * it will have the next time the HPET channel fires.
+ */
+ tsc_delta = watchdog_thresh * tsc_khz * 1000L;
+ hdata->tsc_next = tsc_curr + tsc_delta;
+
+ /*
+ * Define an error window between the expected TSC value and the actual
+ * value it will have the next time the HPET channel fires. Define this
+ * error as percentage of tsc_delta.
+ *
+ * The systems that have been tested so far exhibit an error of 0.05%
+ * of the expected TSC value once the system is up and running. Systems
+ * that refine tsc_khz exhibit a larger initial error up to 0.2%.
+ *
+ * To be safe, allow a maximum error of ~0.4%. This error value can be
+ * computed by left-shifting tsc_delta by 8 positions. Shift 9
+ * positions to calculate half the error. When the HPET channel fires,
+ * check if the actual TSC value is in the range
+ * [tsc_next - (tsc_next_error / 2), tsc_next + (tsc_next_error / 2)]
+ */
+ tsc_next_error = tsc_delta >> 9;

/* Kick the timer only when needed. */
if (!force && hdata->has_periodic)
@@ -126,12 +156,32 @@ static void enable_timer(struct hpet_hld_data *hdata)
* is_hpet_hld_interrupt() - Check if an HPET timer caused the interrupt
* @hdata: A data structure with the timer instance to enable
*
+ * Checking whether the HPET was the source of this NMI is not possible.
+ * Determining the sources of NMIs is not possible. Furthermore, we have
+ * programmed the HPET channel for MSI delivery, which does not have a
+ * status bit. Also, reading HPET registers is slow.
+ *
+ * Instead, we just assume that any NMI delivered within a time window
+ * of when the HPET was expected to fire probably came from the HPET.
+ *
+ * The window is estimated using the TSC counter. Check the comments in
+ * kick_timer() for details on the size of the time window.
+ *
* Returns:
* True if the HPET watchdog timer caused the interrupt. False otherwise.
*/
static bool is_hpet_hld_interrupt(struct hpet_hld_data *hdata)
{
- return false;
+ u64 tsc_curr, tsc_curr_min, tsc_curr_max;
+
+ if (smp_processor_id() != hdata->handling_cpu)
+ return false;
+
+ tsc_curr = rdtsc();
+ tsc_curr_min = tsc_curr - tsc_next_error;
+ tsc_curr_max = tsc_curr + tsc_next_error;
+
+ return time_in_range64(hdata->tsc_next, tsc_curr_min, tsc_curr_max);
}

/**
--
2.17.1

2022-05-09 10:00:11

by Ricardo Neri

[permalink] [raw]

Subject: [PATCH v6 05/29] x86/apic/vector: Do not allocate vectors for NMIs

Vectors are meaningless when allocating IRQs with NMI as the delivery mode.
In such case, skip the reservation of IRQ vectors. Do it in the lowest-
level functions where the actual IRQ reservation takes place.

Since NMIs target specific CPUs, keep the functionality to find the best
CPU.

Cc: Andi Kleen <[email protected]>
Cc: "Ravi V. Shankar" <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Reviewed-by: Tony Luck <[email protected]>
Signed-off-by: Ricardo Neri <[email protected]>
---
Changes since v5:
* Introduced this patch.

Changes since v4:
* N/A

Changes since v3:
* N/A

Changes since v2:
* N/A

Changes since v1:
* N/A
---
arch/x86/kernel/apic/vector.c | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 838e220e8860..11f881f45cec 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -245,11 +245,20 @@ assign_vector_locked(struct irq_data *irqd, const struct cpumask *dest)
if (apicd->move_in_progress || !hlist_unhashed(&apicd->clist))
return -EBUSY;

+ if (apicd->hw_irq_cfg.delivery_mode == APIC_DELIVERY_MODE_NMI) {
+ cpu = irq_matrix_find_best_cpu(vector_matrix, dest);
+ apicd->cpu = cpu;
+ vector = 0;
+ goto no_vector;
+ }
+
vector = irq_matrix_alloc(vector_matrix, dest, resvd, &cpu);
trace_vector_alloc(irqd->irq, vector, resvd, vector);
if (vector < 0)
return vector;
apic_update_vector(irqd, vector, cpu);
+
+no_vector:
apic_update_irq_cfg(irqd, vector, cpu);

return 0;
@@ -321,12 +330,22 @@ assign_managed_vector(struct irq_data *irqd, const struct cpumask *dest)
/* set_affinity might call here for nothing */
if (apicd->vector && cpumask_test_cpu(apicd->cpu, vector_searchmask))
return 0;
+
+ if (apicd->hw_irq_cfg.delivery_mode == APIC_DELIVERY_MODE_NMI) {
+ cpu = irq_matrix_find_best_cpu_managed(vector_matrix, dest);
+ apicd->cpu = cpu;
+ vector = 0;
+ goto no_vector;
+ }
+
vector = irq_matrix_alloc_managed(vector_matrix, vector_searchmask,
&cpu);
trace_vector_alloc_managed(irqd->irq, vector, vector);
if (vector < 0)
return vector;
apic_update_vector(irqd, vector, cpu);
+
+no_vector:
apic_update_irq_cfg(irqd, vector, cpu);
return 0;
}
@@ -376,6 +395,10 @@ static void x86_vector_deactivate(struct irq_domain *dom, struct irq_data *irqd)
if (apicd->has_reserved)
return;

+ /* NMI IRQs do not have associated vectors; nothing to do. */
+ if (apicd->hw_irq_cfg.delivery_mode == APIC_DELIVERY_MODE_NMI)
+ return;
+
raw_spin_lock_irqsave(&vector_lock, flags);
clear_irq_vector(irqd);
if (apicd->can_reserve)
@@ -472,6 +495,10 @@ static void vector_free_reserved_and_managed(struct irq_data *irqd)
trace_vector_teardown(irqd->irq, apicd->is_managed,
apicd->has_reserved);

+ /* NMI IRQs do not have associated vectors; nothing to do. */
+ if (apicd->hw_irq_cfg.delivery_mode == APIC_DELIVERY_MODE_NMI)
+ return;
+
if (apicd->has_reserved)
irq_matrix_remove_reserved(vector_matrix);
if (apicd->is_managed)
--
2.17.1

2022-05-14 02:35:56

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH v6 05/29] x86/apic/vector: Do not allocate vectors for NMIs

On Fri, May 13 2022 at 11:03, Ricardo Neri wrote:
> On Fri, May 06, 2022 at 11:12:20PM +0200, Thomas Gleixner wrote:
>> Why would a NMI ever end up in this code? There is no vector management
>> required and this find cpu exercise is pointless.
>
> But even if the NMI has a fixed vector, it is still necessary to determine
> which CPU will get the NMI. It is still necessary to determine what to
> write in the Destination ID field of the MSI message.
>
> irq_matrix_find_best_cpu() would find the CPU with the lowest number of
> managed vectors so that the NMI is directed to that CPU.

What's the point to send it to the CPU with the lowest number of
interrupts. It's not that this NMI happens every 50 microseconds.
We pick one online CPU and are done.

> In today's code, an NMI would end up here because we rely on the existing
> interrupt management infrastructure... Unless, the check is done the entry
> points as you propose.

Correct. We don't want to call into functions which are not designed for
NMIs.

>> > +
>> > + if (apicd->hw_irq_cfg.delivery_mode == APIC_DELIVERY_MODE_NMI) {
>> > + cpu = irq_matrix_find_best_cpu_managed(vector_matrix, dest);
>> > + apicd->cpu = cpu;
>> > + vector = 0;
>> > + goto no_vector;
>> > + }
>>
>> This code can never be reached for a NMI delivery. If so, then it's a
>> bug.
>>
>> This all is special purpose for that particular HPET NMI watchdog use
>> case and we are not exposing this to anything else at all.
>>
>> So why are you sprinkling this NMI nonsense all over the place? Just
>> because? There are well defined entry points to all of this where this
>> can be fenced off.
>
> I put the NMI checks in these points because assign_vector_locked() and
> assign_managed_vector() are reached through multiple paths and these are
> the two places where the allocation of the vector is requested and the
> destination CPU is determined.
>
> I do observe this code being reached for an NMI, but that is because this
> code still does not know about NMIs... Unless the checks for NMI are put
> in the entry points as you pointed out.
>
> The intent was to refactor the code in a generic manner and not to focus
> only in the NMI watchdog. That would have looked hacky IMO.

We don't want to have more of this really. Supporting NMIs on x86 in a
broader way is simply not reasonable because there is only one NMI
vector and we have no sensible way to get to the cause of the NMI
without a massive overhead.

Even if we get multiple NMI vectors some shiny day, this will be
fundamentally different than regular interrupts and certainly not
exposed broadly. There will be 99.99% fixed vectors for simplicity sake.

>> + if (info->flags & X86_IRQ_ALLOC_AS_NMI) {
>> + /*
>> + * NMIs have a fixed vector and need their own
>> + * interrupt chip so nothing can end up in the
>> + * regular local APIC management code except the
>> + * MSI message composing callback.
>> + */
>> + irqd->chip = &lapic_nmi_controller;
>> + /*
>> + * Don't allow affinity setting attempts for NMIs.
>> + * This cannot work with the regular affinity
>> + * mechanisms and for the intended HPET NMI
>> + * watchdog use case it's not required.
>
> But we do need the ability to set affinity, right? As stated above, we need
> to know what Destination ID to write in the MSI message or in the interrupt
> remapping table entry.
>
> It cannot be any CPU because only one specific CPU is supposed to handle the
> NMI from the HPET channel.
>
> We cannot hard-code a CPU for that because it may go offline (and ignore NMIs)
> or not be part of the monitored CPUs.
>
> Also, if lapic_nmi_controller.irq_set_affinity() is NULL, then irq_chips
> INTEL-IR, AMD-IR, those using msi_domain_set_affinity() need to check for NULL.
> They currently unconditionally call the parent irq_chip's irq_set_affinity().
> I see that there is a irq_chip_set_affinity_parent() function. Perhaps it can
> be used for this check?

Yes, this lacks obviously a NMI specific set_affinity callback and this
can be very trivial and does not have any of the complexity of interrupt
affinity assignment. First online CPU in the mask with a fallback to any
online CPU.

I did not claim that this is complete. This was for illustration.

>> + */
>> + irqd_set_no_balance(irqd);
>
> This code does not set apicd->hw_irq_cfg.delivery_mode as NMI, right?
> I had to add that to make it work.

I assumed you can figure that out on your own :)

Thanks,

tglx

2022-05-14 04:27:07

by Ricardo Neri

[permalink] [raw]

Subject: Re: [PATCH v6 05/29] x86/apic/vector: Do not allocate vectors for NMIs

On Fri, May 13, 2022 at 10:50:09PM +0200, Thomas Gleixner wrote:
> On Fri, May 13 2022 at 11:03, Ricardo Neri wrote:
> > On Fri, May 06, 2022 at 11:12:20PM +0200, Thomas Gleixner wrote:
> >> Why would a NMI ever end up in this code? There is no vector management
> >> required and this find cpu exercise is pointless.
> >
> > But even if the NMI has a fixed vector, it is still necessary to determine
> > which CPU will get the NMI. It is still necessary to determine what to
> > write in the Destination ID field of the MSI message.
> >
> > irq_matrix_find_best_cpu() would find the CPU with the lowest number of
> > managed vectors so that the NMI is directed to that CPU.
>
> What's the point to send it to the CPU with the lowest number of
> interrupts. It's not that this NMI happens every 50 microseconds.
> We pick one online CPU and are done.

Indeed, that is sensible.

>
> > In today's code, an NMI would end up here because we rely on the existing
> > interrupt management infrastructure... Unless, the check is done the entry
> > points as you propose.
>
> Correct. We don't want to call into functions which are not designed for
> NMIs.

Agreed.

>
> >> > +
> >> > + if (apicd->hw_irq_cfg.delivery_mode == APIC_DELIVERY_MODE_NMI) {
> >> > + cpu = irq_matrix_find_best_cpu_managed(vector_matrix, dest);
> >> > + apicd->cpu = cpu;
> >> > + vector = 0;
> >> > + goto no_vector;
> >> > + }
> >>
> >> This code can never be reached for a NMI delivery. If so, then it's a
> >> bug.
> >>
> >> This all is special purpose for that particular HPET NMI watchdog use
> >> case and we are not exposing this to anything else at all.
> >>
> >> So why are you sprinkling this NMI nonsense all over the place? Just
> >> because? There are well defined entry points to all of this where this
> >> can be fenced off.
> >
> > I put the NMI checks in these points because assign_vector_locked() and
> > assign_managed_vector() are reached through multiple paths and these are
> > the two places where the allocation of the vector is requested and the
> > destination CPU is determined.
> >
> > I do observe this code being reached for an NMI, but that is because this
> > code still does not know about NMIs... Unless the checks for NMI are put
> > in the entry points as you pointed out.
> >
> > The intent was to refactor the code in a generic manner and not to focus
> > only in the NMI watchdog. That would have looked hacky IMO.
>
> We don't want to have more of this really. Supporting NMIs on x86 in a
> broader way is simply not reasonable because there is only one NMI
> vector and we have no sensible way to get to the cause of the NMI
> without a massive overhead.
>
> Even if we get multiple NMI vectors some shiny day, this will be
> fundamentally different than regular interrupts and certainly not
> exposed broadly. There will be 99.99% fixed vectors for simplicity sake.

Understood.

>
> >> + if (info->flags & X86_IRQ_ALLOC_AS_NMI) {
> >> + /*
> >> + * NMIs have a fixed vector and need their own
> >> + * interrupt chip so nothing can end up in the
> >> + * regular local APIC management code except the
> >> + * MSI message composing callback.
> >> + */
> >> + irqd->chip = &lapic_nmi_controller;
> >> + /*
> >> + * Don't allow affinity setting attempts for NMIs.
> >> + * This cannot work with the regular affinity
> >> + * mechanisms and for the intended HPET NMI
> >> + * watchdog use case it's not required.
> >
> > But we do need the ability to set affinity, right? As stated above, we need
> > to know what Destination ID to write in the MSI message or in the interrupt
> > remapping table entry.
> >
> > It cannot be any CPU because only one specific CPU is supposed to handle the
> > NMI from the HPET channel.
> >
> > We cannot hard-code a CPU for that because it may go offline (and ignore NMIs)
> > or not be part of the monitored CPUs.
> >
> > Also, if lapic_nmi_controller.irq_set_affinity() is NULL, then irq_chips
> > INTEL-IR, AMD-IR, those using msi_domain_set_affinity() need to check for NULL.
> > They currently unconditionally call the parent irq_chip's irq_set_affinity().
> > I see that there is a irq_chip_set_affinity_parent() function. Perhaps it can
> > be used for this check?
>
> Yes, this lacks obviously a NMI specific set_affinity callback and this
> can be very trivial and does not have any of the complexity of interrupt
> affinity assignment. First online CPU in the mask with a fallback to any
> online CPU.

Why would we need a fallback to any online CPU? Shouldn't it fail if it cannot
find an online CPU in the mask?

>
> I did not claim that this is complete. This was for illustration.

In the reworked patch, may I add a Co-developed-by with your name and your SOB?

>
> >> + */
> >> + irqd_set_no_balance(irqd);
> >
> > This code does not set apicd->hw_irq_cfg.delivery_mode as NMI, right?
> > I had to add that to make it work.
>
> I assumed you can figure that out on your own :)

:)

Thanks and BR,
Ricardo

2022-05-14 08:37:17

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH v6 05/29] x86/apic/vector: Do not allocate vectors for NMIs

On Fri, May 13 2022 at 16:45, Ricardo Neri wrote:
> On Fri, May 13, 2022 at 10:50:09PM +0200, Thomas Gleixner wrote:
>> > Also, if lapic_nmi_controller.irq_set_affinity() is NULL, then irq_chips
>> > INTEL-IR, AMD-IR, those using msi_domain_set_affinity() need to check for NULL.
>> > They currently unconditionally call the parent irq_chip's irq_set_affinity().
>> > I see that there is a irq_chip_set_affinity_parent() function. Perhaps it can
>> > be used for this check?
>>
>> Yes, this lacks obviously a NMI specific set_affinity callback and this
>> can be very trivial and does not have any of the complexity of interrupt
>> affinity assignment. First online CPU in the mask with a fallback to any
>> online CPU.
>
> Why would we need a fallback to any online CPU? Shouldn't it fail if it cannot
> find an online CPU in the mask?

Might as well fail. Let me think about it.

>> I did not claim that this is complete. This was for illustration.
>
> In the reworked patch, may I add a Co-developed-by with your name and your SOB?

Suggested-by is good enough.

Thanks,

tglx