Hi Thomas and all,
This patch set is aimed to improve IRQ throughput on Intel Xeon by making use of
posted interrupts.
There is a session at LPC2023 IOMMU/VFIO/PCI MC where I have presented this
topic.
https://lpc.events/event/17/sessions/172/#20231115
Background
==========
On modern x86 server SoCs, interrupt remapping (IR) is required and turned
on by default to support X2APIC. Two interrupt remapping modes can be supported
by IOMMU/VT-d:
- Remappable (host)
- Posted (guest only so far)
With remappable mode, the device MSI to CPU process is a HW flow without system
software touch points, it roughly goes as follows:
1. Devices issue interrupt requests with writes to 0xFEEx_xxxx
2. The system agent accepts and remaps/translates the IRQ
3. Upon receiving the translation response, the system agent notifies the
destination CPU with the translated MSI
4. CPU's local APIC accepts interrupts into its IRR/ISR registers
5. Interrupt delivered through IDT (MSI vector)
The above process can be inefficient under high IRQ rates. The notifications in
step #3 are often unnecessary when the destination CPU is already overwhelmed
with handling bursts of IRQs. On some architectures, such as Intel Xeon, step #3
is also expensive and requires strong ordering w.r.t DMA. As a result, slower
IRQ rates can become a limiting factor for DMA I/O performance.
For example, on Intel Xeon Sapphire Rapids SoC, as more NVMe disks are attached
to the same socket, FIO (libaio engine) 4K block random read performance
per-disk drops quickly.
# of disks 2 4 8
-------------------------------------
IOPS(million) 1.991 1.136 0.834
(NVMe Gen 5 Samsung PM174x)
With posted mode enabled in interrupt remapping, the interrupt flow is divided
into two parts: posting (storing pending IRQ vector information in memory) and
CPU notification.
The above remappable IRQ flow becomes the following (1 and 2 unchanged):
3. Notifies the destination CPU with a notification vector
- IOMMU suppresses CPU notification
- IOMMU atomic swap/store IRQ status to memory-resident posted interrupt
descriptor (PID)
4. CPU's local APIC accepts the notification interrupt into its IRR/ISR
registers
5. Interrupt delivered through IDT (notification vector handler)
System SW allows new notifications by clearing outstanding notification
(ON) bit in PID.
(The above flow is not in Linux today since we only use posted mode for VM)
Note that the system software can now suppress CPU notifications at runtime as
needed. This allows the system software to coalesce the expensive CPU
notifications and in turn, improve IRQ throughput and DMA performance.
Consider the following scenario when MSIs arrive at a CPU in high-frequency
bursts:
Time ----------------------------------------------------------------------->
^ ^ ^ ^ ^ ^ ^ ^ ^
MSIs A B C D E F G H I
RI N N' N' N N' N' N' N N
PI N N N N
RI: remappable interrupt; PI: posted interrupt;
N: interrupt notification, N': superfluous interrupt notification
With remappable interrupt (row titled RI), every MSI generates a notification
event to the CPU.
With posted interrupts enabled in this patch set (row titled PI), CPU
notifications are coalesced during IRQ bursts. N's are eliminated in the flow
above. We refer to this mechanism Coalesced Interrupt Delivery (CID).
Post interrupts have existed for a long time, they have been used for
virtualization where MSIs from directly assigned devices can be delivered to
the guest kernel without VMM intervention. On x86 Intel platforms, posted
interrupts can be used on the host as well. Only host physical address of
Posted interrupt descriptor (PID) is used.
This patch set enables a new usage of posted interrupts on existing (and
new hardware) for host kernel device MSIs. It is referred to as Posted MSIs
throughout this patch set.
Performance (with this patch set):
==================================
Test #1. NVMe FIO
FIO libaio (million IOPS/sec/disk) Gen 5 NVMe Samsung PM174x disks on a single
socket, Intel Xeon Sapphire Rapids. Random read with 4k block size. NVMe IRQ
affinity is managed by the kernel with one vector per CPU.
#disks Before After %Gain
---------------------------------------------
8 0.834 1.943 132%
4 1.136 2.023 78%
Other observations:
- Increased block sizes shows diminishing benefits, e.g. with 4 NVME disks on
one x16 PCIe slot, the combined IOPS looks like:
Block Size Baseline PostedMSI
-------------------------------------
4K 6475 8778
8K 5727 5896
16k 2864 2900
32k 1546 1520
128k 397 398
- Submission/Completion latency (usec) also improved at 4K block size only
FIO report SLAT
---------------------------------------
Block Size Baseline postedMSI
4k 2177 2282
8k 4416 3967
16k 2950 3053
32k 3453 3505
128k 5911 5801
FIO report CLAT
---------------------------------------
Block Size Baseline postedMSI
4k 313 230
8k 352 343
16k 711 702
32k 1320 1343
128k 5146 5137
Test #2. Intel Data Streaming Accelerator
Two dedicated workqueues from two PCI root complex integrated endpoint
(RCIEP) devices, pin IRQ affinity of the two interrupts to a single CPU.
Before After %Gain
-------------------------------------
DSA memfill (mil IRQs/sec) 5.157 8.987 74%
DMA throughput has similar improvements.
At lower IRQ rate (< 1 million/second), no performance benefits nor regression
observed so far.
No harm tests also performed to ensure no performance regression on workloads
that do not have high interrupt rate. These tests include:
- kernel compile time
- file copy
- FIO NVME random writes
Implementation choices:
======================
- Transparent to the device drivers
- System-wide option instead of per-device or per-IRQ opt-in, i.e. once enabled
all device MSIs are posted. The benefit is that we only need to change IR
irq_chip and domain layer. No change to PCI MSI.
Exceptions are: IOAPIC, HPET, and VT-d's own IRQs
- Limit the number of polling/demuxing loops per CPU notification event
- Only change Intel-IR in IRQ domain hierarchy VECTOR->INTEL-IR->PCI-MSI,
- X86 Intel only so far, can be extended to other architectures with posted
interrupt support (ARM and AMD), RFC.
- Bare metal only, no posted interrupt capable virtual IOMMU.
Changes and implications (moving from remappable to posted mode)
===============================
1. All MSI vectors are multiplexed into a single notification vector for each
CPU MSI vectors are then de-multiplexed by SW, no IDT delivery for MSIs
2. Losing the following features compared to the remappable mode (AFAIK, none of
the below matters for device MSIs)
- Control of delivery mode, e.g. NMI for MSIs
- No logical destinations, posted interrupt destination is x2APIC
physical APIC ID
- No per vector stack, since all MSI vectors are multiplexed into one
Runtime changes
===============
The IRQ runtime behavior has changed with this patch, here is a pseudo trace
comparison for 3 MSIs of different vectors arriving in a burst on the same CPU.
A system vector interrupt (e.g. timer) arrives randomly.
BEFORE:
interrupt(MSI)
irq_enter()
handler() /* EOI */
irq_exit()
process_softirq()
interrupt(timer)
interrupt(MSI)
irq_enter()
handler() /* EOI */
irq_exit()
process_softirq()
interrupt(MSI)
irq_enter()
handler() /* EOI */
irq_exit()
process_softirq()
AFTER:
interrupt /* Posted MSI notification vector */
irq_enter()
atomic_xchg(PIR)
handler()
handler()
handler()
pi_clear_on()
apic_eoi()
irq_exit()
interrupt(timer)
process_softirq()
With posted MSI (as pointed out by Thomas Gleixner), both high-priority
interrupts (system interrupt vectors) and softIRQs are blocked during MSI vector
demux loop. Some can be timing sensitive.
Here are the options I have attempted or still working on:
1. Use self-IPI to invoke MSI vector handler but that took away the majority of
the performance benefits.
2. Limit the # of demuxing loops, this is implemented in this patch. Note that
today, we already allow one low priority MSI to block system interrupts. System
vector can preempt MSI vectors without waiting for EOI but we have IRQ disabled
in the ISR.
Performance data (on DSA with MEMFILL) also shows that coalescing more than 3
loops yields diminishing benefits. Therefore, the max loops for coalescing is
set to 3 in this patch.
MaxLoop IRQ/sec bandwidth Mbps
-------------------------------------------------------------------------
2 6157107 25219
3 6226611 25504
4 6557081 26857
5 6629683 27155
6 6662425 27289
3. limit the time that system interrupts can be blocked (WIP).
In addition, posted MSI uses atomic xchg from both CPU and IOMMU. Compared to
remappable mode, there may be additional cache line ownership contention over
PID. However, we have not observed performance regression at lower IRQ rates.
At high interrupt rate, posted mode always wins.
Testing:
========
The following tests have been performed and continue to be evaluated.
- IRQ affinity change, migration
- CPU offlining
- Multi vector coalescing
- Low IRQ rate, general no-harm test
- VM device assignment via VFIO
- General no harm test, performance regressions have not been observed for low
IRQ rate workload.
With the patch, a new entry in /proc/interrupts is added.
cat /proc/interrupts | grep PMN
PMN: 13868907 Posted MSI notification event
No change to the device MSI accounting.
A new INTEL-IR-POST irq_chip is visible at IRQ debugfs, e.g.
domain: IR-PCI-MSIX-0000:6f:01.0-12
hwirq: 0x8
chip: IR-PCI-MSIX-0000:6f:01.0
flags: 0x430
IRQCHIP_SKIP_SET_WAKE
IRQCHIP_ONESHOT_SAFE
parent:
domain: INTEL-IR-12-13
hwirq: 0x90000
chip: INTEL-IR-POST /* For posted MSIs */
flags: 0x0
parent:
domain: VECTOR
hwirq: 0x65
chip: APIC
Acknowledgment
==============
- Rajesh Sankaran and Ashok Raj for the original idea
- Thomas Gleixner for reviewing and guiding the upstream direction of PoC
patches. Help correct my many misunderstandings of the IRQ subsystem.
- Jie J Yan(Jeff), Sebastien Lemarie, and Dan Liang for performance evaluation
with NVMe and network workload
- Bernice Zhang and Scott Morris for functional validation
- Michael Prinke helped me understand how VT-d HW works
- Sanjay Kumar for providing the DSA IRQ test suite
Changelogs (details in each patch):
V3:
- Add Intel flexible return and event delivery (FRED) support
- Fix potential double EOI bug
- Fix a bug in removing posted interrupt descriptor bitfields
V2:
- Code change logs are in individual patches.
- Use "Originally-by" and "Suggested-by" tags to clarify
credits/responsibilities.
- More performance evaluation done on FIO
4K rand read test. Four Samsung PM174x NVMe drives on a single x16 PCIe gen5
lane. Fixed CPU frequency at 2.7GHz (p1, highest non-turbo).
IOPS* CPU% sys% user% Ints/sec IOPS/CPU LAT**
AIO (before) 6231 55.5 39.7 15.8 5714721 112.2702703 328
AIO (after) 8936 71.5 51.5 20 7397543 124.979021 229
IOURING(before) 6880 43.7 30.3 13.4 6512402 157.4370709 149
IOURING(after) 8688 58.3 41.3 17 7625158 149.0222985 118
IOURING POLLEDQ 13100 100 85.1 14.9 8000 131 156
* x1000 4 drives combined
** 95% usec.
This patchset improves IOPS, IRQ throughput, and reduces latency for non-polled
queues.
V1 (since RFC)
- Removed mentioning of wishful features, IRQ preemption, separate and
full MSI vector space
- Refined MSI handler de-multiplexing loop based on suggestions from
Peter and Thomas. Reduced xchg() usage and code duplication
- Assign the new posted IR irq_chip only to device MSI/x, avoid changing
IO-APIC code
- Extract and use common code for preventing lost interrupt during
affinity change
- Added more test results to the cover letter
Thanks,
Jacob
Jacob Pan (12):
KVM: VMX: Move posted interrupt descriptor out of vmx code
x86/irq: Unionize PID.PIR for 64bit access w/o casting
x86/irq: Remove bitfields in posted interrupt descriptor
x86/irq: Add a Kconfig option for posted MSI
x86/irq: Reserve a per CPU IDT vector for posted MSIs
x86/irq: Set up per host CPU posted interrupt descriptors
x86/irq: Factor out calling ISR from common_interrupt
x86/irq: Install posted MSI notification handler
x86/irq: Factor out common code for checking pending interrupts
x86/irq: Extend checks for pending vectors to posted interrupts
iommu/vt-d: Make posted MSI an opt-in cmdline option
iommu/vt-d: Enable posted mode for device MSIs
.../admin-guide/kernel-parameters.txt | 1 +
arch/x86/Kconfig | 11 ++
arch/x86/entry/entry_fred.c | 2 +
arch/x86/include/asm/apic.h | 12 ++
arch/x86/include/asm/hardirq.h | 6 +
arch/x86/include/asm/idtentry.h | 6 +
arch/x86/include/asm/irq_remapping.h | 11 ++
arch/x86/include/asm/irq_vectors.h | 8 +-
arch/x86/include/asm/posted_intr.h | 118 ++++++++++++
arch/x86/kernel/apic/vector.c | 5 +-
arch/x86/kernel/cpu/common.c | 3 +
arch/x86/kernel/idt.c | 3 +
arch/x86/kernel/irq.c | 172 ++++++++++++++++--
arch/x86/kvm/vmx/posted_intr.c | 4 +-
arch/x86/kvm/vmx/posted_intr.h | 93 +---------
arch/x86/kvm/vmx/vmx.c | 3 +-
arch/x86/kvm/vmx/vmx.h | 2 +-
drivers/iommu/intel/irq_remapping.c | 113 +++++++++++-
drivers/iommu/irq_remapping.c | 9 +-
19 files changed, 463 insertions(+), 119 deletions(-)
create mode 100644 arch/x86/include/asm/posted_intr.h
--
2.25.1
To prepare native usage of posted interrupt, move PID declaration out of
VMX code such that they can be shared.
Acked-by: Sean Christopherson <[email protected]>
Signed-off-by: Jacob Pan <[email protected]>
---
arch/x86/include/asm/posted_intr.h | 88 ++++++++++++++++++++++++++++
arch/x86/kvm/vmx/posted_intr.h | 93 +-----------------------------
arch/x86/kvm/vmx/vmx.c | 1 +
arch/x86/kvm/vmx/vmx.h | 2 +-
4 files changed, 91 insertions(+), 93 deletions(-)
create mode 100644 arch/x86/include/asm/posted_intr.h
diff --git a/arch/x86/include/asm/posted_intr.h b/arch/x86/include/asm/posted_intr.h
new file mode 100644
index 000000000000..f0324c56f7af
--- /dev/null
+++ b/arch/x86/include/asm/posted_intr.h
@@ -0,0 +1,88 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _X86_POSTED_INTR_H
+#define _X86_POSTED_INTR_H
+
+#define POSTED_INTR_ON 0
+#define POSTED_INTR_SN 1
+
+#define PID_TABLE_ENTRY_VALID 1
+
+/* Posted-Interrupt Descriptor */
+struct pi_desc {
+ u32 pir[8]; /* Posted interrupt requested */
+ union {
+ struct {
+ /* bit 256 - Outstanding Notification */
+ u16 on : 1,
+ /* bit 257 - Suppress Notification */
+ sn : 1,
+ /* bit 271:258 - Reserved */
+ rsvd_1 : 14;
+ /* bit 279:272 - Notification Vector */
+ u8 nv;
+ /* bit 287:280 - Reserved */
+ u8 rsvd_2;
+ /* bit 319:288 - Notification Destination */
+ u32 ndst;
+ };
+ u64 control;
+ };
+ u32 rsvd[6];
+} __aligned(64);
+
+static inline bool pi_test_and_set_on(struct pi_desc *pi_desc)
+{
+ return test_and_set_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->control);
+}
+
+static inline bool pi_test_and_clear_on(struct pi_desc *pi_desc)
+{
+ return test_and_clear_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->control);
+}
+
+static inline bool pi_test_and_clear_sn(struct pi_desc *pi_desc)
+{
+ return test_and_clear_bit(POSTED_INTR_SN, (unsigned long *)&pi_desc->control);
+}
+
+static inline bool pi_test_and_set_pir(int vector, struct pi_desc *pi_desc)
+{
+ return test_and_set_bit(vector, (unsigned long *)pi_desc->pir);
+}
+
+static inline bool pi_is_pir_empty(struct pi_desc *pi_desc)
+{
+ return bitmap_empty((unsigned long *)pi_desc->pir, NR_VECTORS);
+}
+
+static inline void pi_set_sn(struct pi_desc *pi_desc)
+{
+ set_bit(POSTED_INTR_SN, (unsigned long *)&pi_desc->control);
+}
+
+static inline void pi_set_on(struct pi_desc *pi_desc)
+{
+ set_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->control);
+}
+
+static inline void pi_clear_on(struct pi_desc *pi_desc)
+{
+ clear_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->control);
+}
+
+static inline void pi_clear_sn(struct pi_desc *pi_desc)
+{
+ clear_bit(POSTED_INTR_SN, (unsigned long *)&pi_desc->control);
+}
+
+static inline bool pi_test_on(struct pi_desc *pi_desc)
+{
+ return test_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->control);
+}
+
+static inline bool pi_test_sn(struct pi_desc *pi_desc)
+{
+ return test_bit(POSTED_INTR_SN, (unsigned long *)&pi_desc->control);
+}
+
+#endif /* _X86_POSTED_INTR_H */
diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
index 26992076552e..6b2a0226257e 100644
--- a/arch/x86/kvm/vmx/posted_intr.h
+++ b/arch/x86/kvm/vmx/posted_intr.h
@@ -1,98 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef __KVM_X86_VMX_POSTED_INTR_H
#define __KVM_X86_VMX_POSTED_INTR_H
-
-#define POSTED_INTR_ON 0
-#define POSTED_INTR_SN 1
-
-#define PID_TABLE_ENTRY_VALID 1
-
-/* Posted-Interrupt Descriptor */
-struct pi_desc {
- u32 pir[8]; /* Posted interrupt requested */
- union {
- struct {
- /* bit 256 - Outstanding Notification */
- u16 on : 1,
- /* bit 257 - Suppress Notification */
- sn : 1,
- /* bit 271:258 - Reserved */
- rsvd_1 : 14;
- /* bit 279:272 - Notification Vector */
- u8 nv;
- /* bit 287:280 - Reserved */
- u8 rsvd_2;
- /* bit 319:288 - Notification Destination */
- u32 ndst;
- };
- u64 control;
- };
- u32 rsvd[6];
-} __aligned(64);
-
-static inline bool pi_test_and_set_on(struct pi_desc *pi_desc)
-{
- return test_and_set_bit(POSTED_INTR_ON,
- (unsigned long *)&pi_desc->control);
-}
-
-static inline bool pi_test_and_clear_on(struct pi_desc *pi_desc)
-{
- return test_and_clear_bit(POSTED_INTR_ON,
- (unsigned long *)&pi_desc->control);
-}
-
-static inline bool pi_test_and_clear_sn(struct pi_desc *pi_desc)
-{
- return test_and_clear_bit(POSTED_INTR_SN,
- (unsigned long *)&pi_desc->control);
-}
-
-static inline bool pi_test_and_set_pir(int vector, struct pi_desc *pi_desc)
-{
- return test_and_set_bit(vector, (unsigned long *)pi_desc->pir);
-}
-
-static inline bool pi_is_pir_empty(struct pi_desc *pi_desc)
-{
- return bitmap_empty((unsigned long *)pi_desc->pir, NR_VECTORS);
-}
-
-static inline void pi_set_sn(struct pi_desc *pi_desc)
-{
- set_bit(POSTED_INTR_SN,
- (unsigned long *)&pi_desc->control);
-}
-
-static inline void pi_set_on(struct pi_desc *pi_desc)
-{
- set_bit(POSTED_INTR_ON,
- (unsigned long *)&pi_desc->control);
-}
-
-static inline void pi_clear_on(struct pi_desc *pi_desc)
-{
- clear_bit(POSTED_INTR_ON,
- (unsigned long *)&pi_desc->control);
-}
-
-static inline void pi_clear_sn(struct pi_desc *pi_desc)
-{
- clear_bit(POSTED_INTR_SN,
- (unsigned long *)&pi_desc->control);
-}
-
-static inline bool pi_test_on(struct pi_desc *pi_desc)
-{
- return test_bit(POSTED_INTR_ON,
- (unsigned long *)&pi_desc->control);
-}
-
-static inline bool pi_test_sn(struct pi_desc *pi_desc)
-{
- return test_bit(POSTED_INTR_SN,
- (unsigned long *)&pi_desc->control);
-}
+#include <asm/posted_intr.h>
void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu);
void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c37a89eda90f..d94bb069bac9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -70,6 +70,7 @@
#include "x86.h"
#include "smm.h"
#include "vmx_onhyperv.h"
+#include "posted_intr.h"
MODULE_AUTHOR("Qumranet");
MODULE_LICENSE("GPL");
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 65786dbe7d60..e133e8077e6d 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -7,10 +7,10 @@
#include <asm/kvm.h>
#include <asm/intel_pt.h>
#include <asm/perf_event.h>
+#include <asm/posted_intr.h>
#include "capabilities.h"
#include "../kvm_cache_regs.h"
-#include "posted_intr.h"
#include "vmcs.h"
#include "vmx_ops.h"
#include "../cpuid.h"
--
2.25.1
During interrupt affinity change, it is possible to have interrupts delivered
to the old CPU after the affinity has changed to the new one. To prevent lost
interrupts, local APIC IRR is checked on the old CPU. Similar checks must be
done for posted MSIs given the same reason.
Consider the following scenario:
Device system agent iommu memory CPU/LAPIC
1 FEEX_XXXX
2 Interrupt request
3 Fetch IRTE ->
4 ->Atomic Swap PID.PIR(vec)
Push to Global Observable(GO)
5 if (ON*)
done;*
else
6 send a notification ->
* ON: outstanding notification, 1 will suppress new notifications
If the affinity change happens between 3 and 5 in IOMMU, the old CPU's posted
interrupt request (PIR) could have pending bit set for the vector being moved.
This patch adds a helper function to check individual vector status.
Then use the helper to check for pending interrupts on the source CPU's
PID.
Signed-off-by: Jacob Pan <[email protected]>
---
v3: Fix a stray letter in the comment, no code change
v2: Fold in helper function patch.
---
arch/x86/include/asm/apic.h | 3 ++-
arch/x86/include/asm/posted_intr.h | 18 ++++++++++++++++++
2 files changed, 20 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 50f9781fa3ed..5644c396713e 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -14,6 +14,7 @@
#include <asm/msr.h>
#include <asm/hardirq.h>
#include <asm/io.h>
+#include <asm/posted_intr.h>
#define ARCH_APICTIMER_STOPS_ON_C3 1
@@ -508,7 +509,7 @@ static inline bool is_vector_pending(unsigned int vector)
if (irr & (1 << (vector % 32)))
return true;
- return false;
+ return pi_pending_this_cpu(vector);
}
/*
diff --git a/arch/x86/include/asm/posted_intr.h b/arch/x86/include/asm/posted_intr.h
index 6f84f6739d99..de788b400fba 100644
--- a/arch/x86/include/asm/posted_intr.h
+++ b/arch/x86/include/asm/posted_intr.h
@@ -1,6 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _X86_POSTED_INTR_H
#define _X86_POSTED_INTR_H
+#include <asm/irq_vectors.h>
#define POSTED_INTR_ON 0
#define POSTED_INTR_SN 1
@@ -92,8 +93,25 @@ static inline void __pi_clear_sn(struct pi_desc *pi_desc)
}
#ifdef CONFIG_X86_POSTED_MSI
+/*
+ * Not all external vectors are subject to interrupt remapping, e.g. IOMMU's
+ * own interrupts. Here we do not distinguish them since those vector bits in
+ * PIR will always be zero.
+ */
+static inline bool pi_pending_this_cpu(unsigned int vector)
+{
+ struct pi_desc *pid = this_cpu_ptr(&posted_msi_pi_desc);
+
+ if (WARN_ON_ONCE(vector > NR_VECTORS || vector < FIRST_EXTERNAL_VECTOR))
+ return false;
+
+ return test_bit(vector, (unsigned long *)pid->pir);
+}
+
extern void intel_posted_msi_init(void);
#else
+static inline bool pi_pending_this_cpu(unsigned int vector) { return false; }
+
static inline void intel_posted_msi_init(void) {};
#endif /* X86_POSTED_MSI */
--
2.25.1
Mixture of bitfields and types is weird and really not intuitive, remove
bitfields and use typed data exclusively. Bitfields often result in
inferior machine code.
Link: https://lore.kernel.org/all/20240404101735.402feec8@jacob-builder/T/#mf66e34a82a48f4d8e2926b5581eff59a122de53a
Suggested-by: Sean Christopherson <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Jacob Pan <[email protected]>
---
v3:
- Fix a bug where SN bit position was used as the mask, reported by
Oliver Sang.
- Add and use non-atomic helpers to manipulate SN bit
- Use pi_test_sn() instead of open coding
v2:
- Replace bitfields, no more mix.
---
arch/x86/include/asm/posted_intr.h | 21 ++++++++++++---------
arch/x86/kvm/vmx/posted_intr.c | 4 ++--
arch/x86/kvm/vmx/vmx.c | 2 +-
3 files changed, 15 insertions(+), 12 deletions(-)
diff --git a/arch/x86/include/asm/posted_intr.h b/arch/x86/include/asm/posted_intr.h
index acf237b2882e..20e31891de15 100644
--- a/arch/x86/include/asm/posted_intr.h
+++ b/arch/x86/include/asm/posted_intr.h
@@ -15,17 +15,9 @@ struct pi_desc {
};
union {
struct {
- /* bit 256 - Outstanding Notification */
- u16 on : 1,
- /* bit 257 - Suppress Notification */
- sn : 1,
- /* bit 271:258 - Reserved */
- rsvd_1 : 14;
- /* bit 279:272 - Notification Vector */
+ u16 notifications; /* Suppress and outstanding bits */
u8 nv;
- /* bit 287:280 - Reserved */
u8 rsvd_2;
- /* bit 319:288 - Notification Destination */
u32 ndst;
};
u64 control;
@@ -88,4 +80,15 @@ static inline bool pi_test_sn(struct pi_desc *pi_desc)
return test_bit(POSTED_INTR_SN, (unsigned long *)&pi_desc->control);
}
+/* Non-atomic helpers */
+static inline void __pi_set_sn(struct pi_desc *pi_desc)
+{
+ pi_desc->notifications |= BIT(POSTED_INTR_SN);
+}
+
+static inline void __pi_clear_sn(struct pi_desc *pi_desc)
+{
+ pi_desc->notifications &= ~BIT(POSTED_INTR_SN);
+}
+
#endif /* _X86_POSTED_INTR_H */
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index af662312fd07..ec08fa3caf43 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -107,7 +107,7 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
* handle task migration (@cpu != vcpu->cpu).
*/
new.ndst = dest;
- new.sn = 0;
+ __pi_clear_sn(&new);
/*
* Restore the notification vector; in the blocking case, the
@@ -157,7 +157,7 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
&per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu));
raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
- WARN(pi_desc->sn, "PI descriptor SN field set before blocking");
+ WARN(pi_test_sn(pi_desc), "PI descriptor SN field set before blocking");
old.control = READ_ONCE(pi_desc->control);
do {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d94bb069bac9..f505745913c8 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4843,7 +4843,7 @@ static void __vmx_vcpu_reset(struct kvm_vcpu *vcpu)
* or POSTED_INTR_WAKEUP_VECTOR.
*/
vmx->pi_desc.nv = POSTED_INTR_VECTOR;
- vmx->pi_desc.sn = 1;
+ __pi_set_sn(&vmx->pi_desc);
}
static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
--
2.25.1
The following commit has been merged into the x86/irq branch of tip:
Commit-ID: ce0a92871179f8ca58ae8e3cf50e726a163bf831
Gitweb: https://git.kernel.org/tip/ce0a92871179f8ca58ae8e3cf50e726a163bf831
Author: Jacob Pan <[email protected]>
AuthorDate: Tue, 23 Apr 2024 10:41:12 -07:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Tue, 30 Apr 2024 00:54:43 +02:00
x86/irq: Extend checks for pending vectors to posted interrupts
During interrupt affinity change, it is possible to have interrupts delivered
to the old CPU after the affinity has changed to the new one. To prevent lost
interrupts, local APIC IRR is checked on the old CPU. Similar checks must be
done for posted MSIs given the same reason.
Consider the following scenario:
Device system agent iommu memory CPU/LAPIC
1 FEEX_XXXX
2 Interrupt request
3 Fetch IRTE ->
4 ->Atomic Swap PID.PIR(vec)
Push to Global Observable(GO)
5 if (ON*)
done;*
else
6 send a notification ->
* ON: outstanding notification, 1 will suppress new notifications
If the affinity change happens between 3 and 5 in the IOMMU, the old CPU's
posted interrupt request (PIR) could have the pending bit set for the
vector being moved.
Add a helper function to check individual vector status. Then use the
helper to check for pending interrupts on the source CPU's PID.
Signed-off-by: Jacob Pan <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/include/asm/apic.h | 3 ++-
arch/x86/include/asm/posted_intr.h | 18 ++++++++++++++++++
2 files changed, 20 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 50f9781..5644c39 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -14,6 +14,7 @@
#include <asm/msr.h>
#include <asm/hardirq.h>
#include <asm/io.h>
+#include <asm/posted_intr.h>
#define ARCH_APICTIMER_STOPS_ON_C3 1
@@ -508,7 +509,7 @@ static inline bool is_vector_pending(unsigned int vector)
if (irr & (1 << (vector % 32)))
return true;
- return false;
+ return pi_pending_this_cpu(vector);
}
/*
diff --git a/arch/x86/include/asm/posted_intr.h b/arch/x86/include/asm/posted_intr.h
index 6f84f67..de788b4 100644
--- a/arch/x86/include/asm/posted_intr.h
+++ b/arch/x86/include/asm/posted_intr.h
@@ -1,6 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _X86_POSTED_INTR_H
#define _X86_POSTED_INTR_H
+#include <asm/irq_vectors.h>
#define POSTED_INTR_ON 0
#define POSTED_INTR_SN 1
@@ -92,8 +93,25 @@ static inline void __pi_clear_sn(struct pi_desc *pi_desc)
}
#ifdef CONFIG_X86_POSTED_MSI
+/*
+ * Not all external vectors are subject to interrupt remapping, e.g. IOMMU's
+ * own interrupts. Here we do not distinguish them since those vector bits in
+ * PIR will always be zero.
+ */
+static inline bool pi_pending_this_cpu(unsigned int vector)
+{
+ struct pi_desc *pid = this_cpu_ptr(&posted_msi_pi_desc);
+
+ if (WARN_ON_ONCE(vector > NR_VECTORS || vector < FIRST_EXTERNAL_VECTOR))
+ return false;
+
+ return test_bit(vector, (unsigned long *)pid->pir);
+}
+
extern void intel_posted_msi_init(void);
#else
+static inline bool pi_pending_this_cpu(unsigned int vector) { return false; }
+
static inline void intel_posted_msi_init(void) {};
#endif /* X86_POSTED_MSI */
The following commit has been merged into the x86/irq branch of tip:
Commit-ID: 2254808b53d92c9fe7b645b2f43acc55f22cdce6
Gitweb: https://git.kernel.org/tip/2254808b53d92c9fe7b645b2f43acc55f22cdce6
Author: Jacob Pan <[email protected]>
AuthorDate: Tue, 23 Apr 2024 10:41:05 -07:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Tue, 30 Apr 2024 00:54:42 +02:00
x86/irq: Remove bitfields in posted interrupt descriptor
Mixture of bitfields and types is weird and really not intuitive, remove
bitfields and use typed data exclusively. Bitfields often result in
inferior machine code.
Suggested-by: Sean Christopherson <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Jacob Pan <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/20240404101735.402feec8@jacob-builder/T/#mf66e34a82a48f4d8e2926b5581eff59a122de53a
---
arch/x86/include/asm/posted_intr.h | 21 ++++++++++++---------
arch/x86/kvm/vmx/posted_intr.c | 4 ++--
arch/x86/kvm/vmx/vmx.c | 2 +-
3 files changed, 15 insertions(+), 12 deletions(-)
diff --git a/arch/x86/include/asm/posted_intr.h b/arch/x86/include/asm/posted_intr.h
index acf237b..20e3189 100644
--- a/arch/x86/include/asm/posted_intr.h
+++ b/arch/x86/include/asm/posted_intr.h
@@ -15,17 +15,9 @@ struct pi_desc {
};
union {
struct {
- /* bit 256 - Outstanding Notification */
- u16 on : 1,
- /* bit 257 - Suppress Notification */
- sn : 1,
- /* bit 271:258 - Reserved */
- rsvd_1 : 14;
- /* bit 279:272 - Notification Vector */
+ u16 notifications; /* Suppress and outstanding bits */
u8 nv;
- /* bit 287:280 - Reserved */
u8 rsvd_2;
- /* bit 319:288 - Notification Destination */
u32 ndst;
};
u64 control;
@@ -88,4 +80,15 @@ static inline bool pi_test_sn(struct pi_desc *pi_desc)
return test_bit(POSTED_INTR_SN, (unsigned long *)&pi_desc->control);
}
+/* Non-atomic helpers */
+static inline void __pi_set_sn(struct pi_desc *pi_desc)
+{
+ pi_desc->notifications |= BIT(POSTED_INTR_SN);
+}
+
+static inline void __pi_clear_sn(struct pi_desc *pi_desc)
+{
+ pi_desc->notifications &= ~BIT(POSTED_INTR_SN);
+}
+
#endif /* _X86_POSTED_INTR_H */
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index af66231..ec08fa3 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -107,7 +107,7 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
* handle task migration (@cpu != vcpu->cpu).
*/
new.ndst = dest;
- new.sn = 0;
+ __pi_clear_sn(&new);
/*
* Restore the notification vector; in the blocking case, the
@@ -157,7 +157,7 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
&per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu));
raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
- WARN(pi_desc->sn, "PI descriptor SN field set before blocking");
+ WARN(pi_test_sn(pi_desc), "PI descriptor SN field set before blocking");
old.control = READ_ONCE(pi_desc->control);
do {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 273d264..becefaf 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4845,7 +4845,7 @@ static void __vmx_vcpu_reset(struct kvm_vcpu *vcpu)
* or POSTED_INTR_WAKEUP_VECTOR.
*/
vmx->pi_desc.nv = POSTED_INTR_VECTOR;
- vmx->pi_desc.sn = 1;
+ __pi_set_sn(&vmx->pi_desc);
}
static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
The following commit has been merged into the x86/irq branch of tip:
Commit-ID: 699f67512f04cbaee965fad872702c06eaf440f6
Gitweb: https://git.kernel.org/tip/699f67512f04cbaee965fad872702c06eaf440f6
Author: Jacob Pan <[email protected]>
AuthorDate: Tue, 23 Apr 2024 10:41:03 -07:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Tue, 30 Apr 2024 00:54:42 +02:00
KVM: VMX: Move posted interrupt descriptor out of VMX code
To prepare native usage of posted interrupts, move the PID declarations out
of VMX code such that they can be shared.
Signed-off-by: Jacob Pan <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/include/asm/posted_intr.h | 88 +++++++++++++++++++++++++++-
arch/x86/kvm/vmx/posted_intr.h | 93 +-----------------------------
arch/x86/kvm/vmx/vmx.c | 1 +-
arch/x86/kvm/vmx/vmx.h | 2 +-
4 files changed, 91 insertions(+), 93 deletions(-)
create mode 100644 arch/x86/include/asm/posted_intr.h
diff --git a/arch/x86/include/asm/posted_intr.h b/arch/x86/include/asm/posted_intr.h
new file mode 100644
index 0000000..f0324c5
--- /dev/null
+++ b/arch/x86/include/asm/posted_intr.h
@@ -0,0 +1,88 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _X86_POSTED_INTR_H
+#define _X86_POSTED_INTR_H
+
+#define POSTED_INTR_ON 0
+#define POSTED_INTR_SN 1
+
+#define PID_TABLE_ENTRY_VALID 1
+
+/* Posted-Interrupt Descriptor */
+struct pi_desc {
+ u32 pir[8]; /* Posted interrupt requested */
+ union {
+ struct {
+ /* bit 256 - Outstanding Notification */
+ u16 on : 1,
+ /* bit 257 - Suppress Notification */
+ sn : 1,
+ /* bit 271:258 - Reserved */
+ rsvd_1 : 14;
+ /* bit 279:272 - Notification Vector */
+ u8 nv;
+ /* bit 287:280 - Reserved */
+ u8 rsvd_2;
+ /* bit 319:288 - Notification Destination */
+ u32 ndst;
+ };
+ u64 control;
+ };
+ u32 rsvd[6];
+} __aligned(64);
+
+static inline bool pi_test_and_set_on(struct pi_desc *pi_desc)
+{
+ return test_and_set_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->control);
+}
+
+static inline bool pi_test_and_clear_on(struct pi_desc *pi_desc)
+{
+ return test_and_clear_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->control);
+}
+
+static inline bool pi_test_and_clear_sn(struct pi_desc *pi_desc)
+{
+ return test_and_clear_bit(POSTED_INTR_SN, (unsigned long *)&pi_desc->control);
+}
+
+static inline bool pi_test_and_set_pir(int vector, struct pi_desc *pi_desc)
+{
+ return test_and_set_bit(vector, (unsigned long *)pi_desc->pir);
+}
+
+static inline bool pi_is_pir_empty(struct pi_desc *pi_desc)
+{
+ return bitmap_empty((unsigned long *)pi_desc->pir, NR_VECTORS);
+}
+
+static inline void pi_set_sn(struct pi_desc *pi_desc)
+{
+ set_bit(POSTED_INTR_SN, (unsigned long *)&pi_desc->control);
+}
+
+static inline void pi_set_on(struct pi_desc *pi_desc)
+{
+ set_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->control);
+}
+
+static inline void pi_clear_on(struct pi_desc *pi_desc)
+{
+ clear_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->control);
+}
+
+static inline void pi_clear_sn(struct pi_desc *pi_desc)
+{
+ clear_bit(POSTED_INTR_SN, (unsigned long *)&pi_desc->control);
+}
+
+static inline bool pi_test_on(struct pi_desc *pi_desc)
+{
+ return test_bit(POSTED_INTR_ON, (unsigned long *)&pi_desc->control);
+}
+
+static inline bool pi_test_sn(struct pi_desc *pi_desc)
+{
+ return test_bit(POSTED_INTR_SN, (unsigned long *)&pi_desc->control);
+}
+
+#endif /* _X86_POSTED_INTR_H */
diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
index 2699207..6b2a022 100644
--- a/arch/x86/kvm/vmx/posted_intr.h
+++ b/arch/x86/kvm/vmx/posted_intr.h
@@ -1,98 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef __KVM_X86_VMX_POSTED_INTR_H
#define __KVM_X86_VMX_POSTED_INTR_H
-
-#define POSTED_INTR_ON 0
-#define POSTED_INTR_SN 1
-
-#define PID_TABLE_ENTRY_VALID 1
-
-/* Posted-Interrupt Descriptor */
-struct pi_desc {
- u32 pir[8]; /* Posted interrupt requested */
- union {
- struct {
- /* bit 256 - Outstanding Notification */
- u16 on : 1,
- /* bit 257 - Suppress Notification */
- sn : 1,
- /* bit 271:258 - Reserved */
- rsvd_1 : 14;
- /* bit 279:272 - Notification Vector */
- u8 nv;
- /* bit 287:280 - Reserved */
- u8 rsvd_2;
- /* bit 319:288 - Notification Destination */
- u32 ndst;
- };
- u64 control;
- };
- u32 rsvd[6];
-} __aligned(64);
-
-static inline bool pi_test_and_set_on(struct pi_desc *pi_desc)
-{
- return test_and_set_bit(POSTED_INTR_ON,
- (unsigned long *)&pi_desc->control);
-}
-
-static inline bool pi_test_and_clear_on(struct pi_desc *pi_desc)
-{
- return test_and_clear_bit(POSTED_INTR_ON,
- (unsigned long *)&pi_desc->control);
-}
-
-static inline bool pi_test_and_clear_sn(struct pi_desc *pi_desc)
-{
- return test_and_clear_bit(POSTED_INTR_SN,
- (unsigned long *)&pi_desc->control);
-}
-
-static inline bool pi_test_and_set_pir(int vector, struct pi_desc *pi_desc)
-{
- return test_and_set_bit(vector, (unsigned long *)pi_desc->pir);
-}
-
-static inline bool pi_is_pir_empty(struct pi_desc *pi_desc)
-{
- return bitmap_empty((unsigned long *)pi_desc->pir, NR_VECTORS);
-}
-
-static inline void pi_set_sn(struct pi_desc *pi_desc)
-{
- set_bit(POSTED_INTR_SN,
- (unsigned long *)&pi_desc->control);
-}
-
-static inline void pi_set_on(struct pi_desc *pi_desc)
-{
- set_bit(POSTED_INTR_ON,
- (unsigned long *)&pi_desc->control);
-}
-
-static inline void pi_clear_on(struct pi_desc *pi_desc)
-{
- clear_bit(POSTED_INTR_ON,
- (unsigned long *)&pi_desc->control);
-}
-
-static inline void pi_clear_sn(struct pi_desc *pi_desc)
-{
- clear_bit(POSTED_INTR_SN,
- (unsigned long *)&pi_desc->control);
-}
-
-static inline bool pi_test_on(struct pi_desc *pi_desc)
-{
- return test_bit(POSTED_INTR_ON,
- (unsigned long *)&pi_desc->control);
-}
-
-static inline bool pi_test_sn(struct pi_desc *pi_desc)
-{
- return test_bit(POSTED_INTR_SN,
- (unsigned long *)&pi_desc->control);
-}
+#include <asm/posted_intr.h>
void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu);
void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 22411f4..273d264 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -70,6 +70,7 @@
#include "x86.h"
#include "smm.h"
#include "vmx_onhyperv.h"
+#include "posted_intr.h"
MODULE_AUTHOR("Qumranet");
MODULE_LICENSE("GPL");
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 90f9e44..7e48336 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -7,10 +7,10 @@
#include <asm/kvm.h>
#include <asm/intel_pt.h>
#include <asm/perf_event.h>
+#include <asm/posted_intr.h>
#include "capabilities.h"
#include "../kvm_cache_regs.h"
-#include "posted_intr.h"
#include "vmcs.h"
#include "vmx_ops.h"
#include "../cpuid.h"
hi, Jacob Pan,
On Tue, Apr 23, 2024 at 10:41:05AM -0700, Jacob Pan wrote:
> Mixture of bitfields and types is weird and really not intuitive, remove
> bitfields and use typed data exclusively. Bitfields often result in
> inferior machine code.
>
> Link: https://lore.kernel.org/all/20240404101735.402feec8@jacob-builder/T/#mf66e34a82a48f4d8e2926b5581eff59a122de53a
> Suggested-by: Sean Christopherson <[email protected]>
> Suggested-by: Thomas Gleixner <[email protected]>
> Signed-off-by: Jacob Pan <[email protected]>
>
> ---
> v3:
> - Fix a bug where SN bit position was used as the mask, reported by
> Oliver Sang.
we tested this verion, and confirmed the issue gone
Tested-by: kernel test robot <[email protected]>
below just for reference.
for previous version, we noticed a
"WARNING:at_arch/x86/kvm/vmx/posted_intr.c:#pi_enable_wakeup_handler[kvm_intel]"
in testcase: kernel-selftests
Call Trace like below:
[ 399.225452][ T8098] ------------[ cut here ]------------
[ 399.232475][ T8098] PI descriptor SN field set before blocking
[ 399.232514][ T8098] WARNING: CPU: 184 PID: 8098 at arch/x86/kvm/vmx/posted_intr.c:160 pi_enable_wakeup_handler+0x421/0x5f0 [kvm_intel]
[ 399.254685][ T8098] Modules linked in: openvswitch nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 intel_rapl_msr intel_rapl_common btrfs
+blake2b_generic x86_pkg_temp_thermal xor intel_powerclamp zstd_compress coretemp raid6_pq libcrc32c kvm_intel kvm nvme crct10dif_pclmul crc32_pclmul
+nvme_core crc32c_intel t10_pi ghash_clmulni_intel ast sha512_ssse3 rapl intel_cstate dax_hmem crc64_rocksoft_generic drm_shmem_helper mei_me i2c_i801
+crc64_rocksoft mei drm_kms_helper i2c_smbus crc64 i2c_ismt wmi ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter joydev
+binfmt_misc loop fuse drm dm_mod ip_tables
[ 399.321529][ T8098] CPU: 184 PID: 8098 Comm: xapic_ipi_test Not tainted 6.9.0-rc1-00008-g037ccaed5dc5 #1
[ 399.333325][ T8098] RIP: 0010:pi_enable_wakeup_handler+0x421/0x5f0 [kvm_intel]
[ 399.342631][ T8098] Code: e8 a4 5a 7b c0 e9 b2 fc ff ff e8 5a 5c 7b c0 fb eb 93 bf f1 00 00 00 e8 dd 9c 29 c0 eb 82 48 c7 c7 e0 c6 ee c0 e8 0f 2f 37 c0
+<0f> 0b e9 e7 fe ff ff 4c 89 f7 e8 00 6a c8 c0 e9 cd fe ff ff 4c 89
[ 399.365742][ T8098] RSP: 0018:ffa000003a527780 EFLAGS: 00010086
[ 399.373626][ T8098] RAX: 0000000000000000 RBX: ff1100247c23c740 RCX: 0000000000000027
[ 399.383668][ T8098] RDX: 0000000000000027 RSI: 0000000000000004 RDI: ff11003fc3030c08
[ 399.393759][ T8098] RBP: ffa000003a527908 R08: 0000000000000001 R09: ffe21c07f8606181
[ 399.403834][ T8098] R10: ff11003fc3030c0b R11: 0000000000000000 R12: 1ff40000074a4ef4
[ 399.413901][ T8098] R13: 0000000000000000 R14: ff1100247c23e4a0 R15: 00000000000000b8
[ 399.423969][ T8098] FS: 00007f10de2006c0(0000) GS:ff11003fc3000000(0000) knlGS:0000000000000000
[ 399.435126][ T8098] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 399.443651][ T8098] CR2: 0000000000000000 CR3: 000000210b958004 CR4: 0000000000f73ef0
[ 399.453753][ T8098] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 399.463835][ T8098] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 399.473912][ T8098] PKRU: 55555554
[ 399.479005][ T8098] Call Trace:
[ 399.483812][ T8098] <TASK>
[ 399.488224][ T8098] ? __warn+0xcc/0x2d0
[ 399.493903][ T8098] ? pi_enable_wakeup_handler+0x421/0x5f0 [kvm_intel]
[ 399.502643][ T8098] ? report_bug+0x261/0x2c0
[ 399.508825][ T8098] ? handle_bug+0x3a/0x90
[ 399.514817][ T8098] ? exc_invalid_op+0x17/0x40
[ 399.521191][ T8098] ? asm_exc_invalid_op+0x1a/0x20
[ 399.527938][ T8098] ? pi_enable_wakeup_handler+0x421/0x5f0 [kvm_intel]
[ 399.536677][ T8098] ? __pfx_pi_enable_wakeup_handler+0x10/0x10 [kvm_intel]
[ 399.545747][ T8098] ? __pfx_lock_repin_lock+0x10/0x10
[ 399.552740][ T8098] ? newidle_balance+0xc85/0x1300
[ 399.559408][ T8098] ? __pfx_perf_event_context_sched_out+0x10/0x10
[ 399.567652][ T8098] ? lock_acquire+0x432/0x4e0
[ 399.573959][ T8098] ? vmx_get_rflags+0x26/0x2c0 [kvm_intel]
[ 399.581527][ T8098] vmx_vcpu_pi_put+0x1d3/0x230 [kvm_intel]
[ 399.589119][ T8098] vmx_vcpu_put+0x12/0x20 [kvm_intel]
[ 399.596205][ T8098] kvm_arch_vcpu_put+0x49e/0x7b0 [kvm]
[ 399.603463][ T8098] kvm_sched_out+0xb2/0xe0 [kvm]
[ 399.610100][ T8098] prepare_task_switch+0x321/0xc40
[ 399.616863][ T8098] ? lock_release+0x1bf/0x240
[ 399.623146][ T8098] __schedule+0x5a6/0x20b0
[ 399.629125][ T8098] ? __pfx___schedule+0x10/0x10
[ 399.635581][ T8098] ? __pfx_lock_acquire+0x10/0x10
[ 399.642228][ T8098] ? kvm_apic_has_interrupt+0x9c/0x160 [kvm]
[ 399.650034][ T8098] ? lock_acquire+0x432/0x4e0
[ 399.656267][ T8098] ? __pfx_lock_acquire+0x10/0x10
[ 399.662905][ T8098] schedule+0xe2/0x2a0
[ 399.668478][ T8098] kvm_vcpu_block+0xd1/0x1c0 [kvm]
[ 399.675308][ T8098] kvm_vcpu_halt+0xee/0x900 [kvm]
[ 399.682018][ T8098] vcpu_run+0x50a/0x9d0 [kvm]
[ 399.688346][ T8098] kvm_arch_vcpu_ioctl_run+0x377/0x1430 [kvm]
[ 399.696254][ T8098] ? lock_release+0xe5/0x240
[ 399.702428][ T8098] kvm_vcpu_ioctl+0x34c/0xc40 [kvm]
[ 399.709341][ T8098] ? __pfx_kvm_vcpu_ioctl+0x10/0x10 [kvm]
[ 399.716821][ T8098] ? __lock_release+0x103/0x440
[ 399.723960][ T8098] ? __fget_files+0x1c7/0x330
[ 399.730219][ T8098] ? __pfx___lock_release+0x10/0x10
[ 399.737750][ T8098] ? __fget_files+0x1cc/0x330
[ 399.744028][ T8098] ? __fget_files+0x1c7/0x330
[ 399.750289][ T8098] ? lock_release+0xe5/0x240
[ 399.756461][ T8098] ? __fget_files+0x1cc/0x330
[ 399.762719][ T8098] __x64_sys_ioctl+0x134/0x1b0
[ 399.769077][ T8098] do_syscall_64+0x93/0x170
[ 399.775146][ T8098] ? do_user_addr_fault+0x477/0xcb0
[ 399.781989][ T8098] ? lockdep_hardirqs_on_prepare+0x279/0x3e0
[ 399.789654][ T8098] entry_SYSCALL_64_after_hwframe+0x6c/0x74
[ 399.797192][ T8098] RIP: 0033:0x7f10de88bc5b
[ 399.803079][ T8098] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05
+<89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[ 399.826105][ T8098] RSP: 002b:00007f10de1ff9c0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 399.836565][ T8098] RAX: ffffffffffffffda RBX: 000000001a5418c0 RCX: 00007f10de88bc5b
[ 399.846521][ T8098] R
> - Add and use non-atomic helpers to manipulate SN bit
> - Use pi_test_sn() instead of open coding
> v2:
> - Replace bitfields, no more mix.
> ---
> arch/x86/include/asm/posted_intr.h | 21 ++++++++++++---------
> arch/x86/kvm/vmx/posted_intr.c | 4 ++--
> arch/x86/kvm/vmx/vmx.c | 2 +-
> 3 files changed, 15 insertions(+), 12 deletions(-)
>
> diff --git a/arch/x86/include/asm/posted_intr.h b/arch/x86/include/asm/posted_intr.h
> index acf237b2882e..20e31891de15 100644
> --- a/arch/x86/include/asm/posted_intr.h
> +++ b/arch/x86/include/asm/posted_intr.h
> @@ -15,17 +15,9 @@ struct pi_desc {
> };
> union {
> struct {
> - /* bit 256 - Outstanding Notification */
> - u16 on : 1,
> - /* bit 257 - Suppress Notification */
> - sn : 1,
> - /* bit 271:258 - Reserved */
> - rsvd_1 : 14;
> - /* bit 279:272 - Notification Vector */
> + u16 notifications; /* Suppress and outstanding bits */
> u8 nv;
> - /* bit 287:280 - Reserved */
> u8 rsvd_2;
> - /* bit 319:288 - Notification Destination */
> u32 ndst;
> };
> u64 control;
> @@ -88,4 +80,15 @@ static inline bool pi_test_sn(struct pi_desc *pi_desc)
> return test_bit(POSTED_INTR_SN, (unsigned long *)&pi_desc->control);
> }
>
> +/* Non-atomic helpers */
> +static inline void __pi_set_sn(struct pi_desc *pi_desc)
> +{
> + pi_desc->notifications |= BIT(POSTED_INTR_SN);
> +}
> +
> +static inline void __pi_clear_sn(struct pi_desc *pi_desc)
> +{
> + pi_desc->notifications &= ~BIT(POSTED_INTR_SN);
> +}
> +
> #endif /* _X86_POSTED_INTR_H */
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index af662312fd07..ec08fa3caf43 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -107,7 +107,7 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
> * handle task migration (@cpu != vcpu->cpu).
> */
> new.ndst = dest;
> - new.sn = 0;
> + __pi_clear_sn(&new);
>
> /*
> * Restore the notification vector; in the blocking case, the
> @@ -157,7 +157,7 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
> &per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu));
> raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
>
> - WARN(pi_desc->sn, "PI descriptor SN field set before blocking");
> + WARN(pi_test_sn(pi_desc), "PI descriptor SN field set before blocking");
>
> old.control = READ_ONCE(pi_desc->control);
> do {
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index d94bb069bac9..f505745913c8 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4843,7 +4843,7 @@ static void __vmx_vcpu_reset(struct kvm_vcpu *vcpu)
> * or POSTED_INTR_WAKEUP_VECTOR.
> */
> vmx->pi_desc.nv = POSTED_INTR_VECTOR;
> - vmx->pi_desc.sn = 1;
> + __pi_set_sn(&vmx->pi_desc);
> }
>
> static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> --
> 2.25.1
>