VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
With VT-d Posted-Interrupts enabled, external interrupts from
direct-assigned devices can be delivered to guests without VMM
intervention when guest is running in non-root mode.
You can find the VT-d Posted-Interrtups Spec. in the following URL:
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html
v1->v2:
* Use VFIO framework to enable this feature, the VFIO part of this series is
base on Eric's patch "[PATCH v3 0/8] KVM-VFIO IRQ forward control"
* Rebase this patchset on git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git,
then revise some irq logic based on the new hierarchy irqdomain patches provided
by Jiang Liu <[email protected]>
v2->v3:
* Adjust the Posted-interrupts Descriptor updating logic when vCPU is
preempted or blocked.
* KVM_DEV_VFIO_DEVICE_POSTING_IRQ --> KVM_DEV_VFIO_DEVICE_POST_IRQ
* __KVM_HAVE_ARCH_KVM_VFIO_POSTING --> __KVM_HAVE_ARCH_KVM_VFIO_POST
* Add KVM_DEV_VFIO_DEVICE_UNPOST_IRQ attribute for VFIO irq, which
can be used to change back to remapping mode.
* Fix typo
This patch series is made of the following groups:
1-6: Some preparation changes in iommu and irq component, this is based on the
new hierarchy irqdomain logic.
7-9, 26: IOMMU changes for VT-d Posted-Interrupts, such as, feature detection,
command line parameter.
10-17, 22-25: Changes related to KVM itself.
18-20: Changes in VFIO component, this part was previously sent out as
"[RFC PATCH v2 0/2] kvm-vfio: implement the vfio skeleton for VT-d Posted-Interrupts"
21: x86 irq related changes
Feng Wu (26):
genirq: Introduce irq_set_vcpu_affinity() to target an interrupt to a
VCPU
iommu: Add new member capability to struct irq_remap_ops
iommu, x86: Define new irte structure for VT-d Posted-Interrupts
iommu, x86: Implement irq_set_vcpu_affinity for intel_ir_chip
x86, irq: Implement irq_set_vcpu_affinity for pci_msi_ir_controller
iommu, x86: No need to migrating irq for VT-d Posted-Interrupts
iommu, x86: Add cap_pi_support() to detect VT-d PI capability
iommu, x86: Add intel_irq_remapping_capability() for Intel
iommu, x86: define irq_remapping_cap()
KVM: change struct pi_desc for VT-d Posted-Interrupts
KVM: Add some helper functions for Posted-Interrupts
KVM: Initialize VT-d Posted-Interrupts Descriptor
KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI
KVM: Get Posted-Interrupts descriptor address from struct kvm_vcpu
KVM: add interfaces to control PI outside vmx
KVM: Make struct kvm_irq_routing_table accessible
KVM: make kvm_set_msi_irq() public
KVM: kvm-vfio: User API for VT-d Posted-Interrupts
KVM: kvm-vfio: implement the VFIO skeleton for VT-d Posted-Interrupts
KVM: x86: kvm-vfio: VT-d posted-interrupts setup
x86, irq: Define a global vector for VT-d Posted-Interrupts
KVM: Define a wakeup worker thread for vCPU
KVM: Update Posted-Interrupts Descriptor when vCPU is preempted
KVM: Update Posted-Interrupts Descriptor when vCPU is blocked
KVM: Suppress posted-interrupt when 'SN' is set
iommu/vt-d: Add a command line parameter for VT-d posted-interrupts
Documentation/kernel-parameters.txt | 1 +
Documentation/virtual/kvm/devices/vfio.txt | 9 ++
arch/x86/include/asm/entry_arch.h | 2 +
arch/x86/include/asm/hardirq.h | 1 +
arch/x86/include/asm/hw_irq.h | 2 +
arch/x86/include/asm/irq_remapping.h | 11 ++
arch/x86/include/asm/irq_vectors.h | 1 +
arch/x86/include/asm/kvm_host.h | 12 ++
arch/x86/kernel/apic/msi.c | 1 +
arch/x86/kernel/entry_64.S | 2 +
arch/x86/kernel/irq.c | 27 ++++
arch/x86/kernel/irqinit.c | 2 +
arch/x86/kvm/Makefile | 2 +-
arch/x86/kvm/kvm_vfio_x86.c | 77 +++++++++
arch/x86/kvm/vmx.c | 244 ++++++++++++++++++++++++++++-
arch/x86/kvm/x86.c | 22 ++-
drivers/iommu/intel_irq_remapping.c | 68 +++++++-
drivers/iommu/irq_remapping.c | 24 ++-
drivers/iommu/irq_remapping.h | 8 +
include/linux/dmar.h | 32 ++++
include/linux/intel-iommu.h | 1 +
include/linux/irq.h | 7 +
include/linux/kvm_host.h | 46 ++++++
include/uapi/linux/kvm.h | 11 ++
kernel/irq/chip.c | 14 ++
kernel/irq/manage.c | 20 +++
virt/kvm/irq_comm.c | 43 ++++-
virt/kvm/irqchip.c | 11 --
virt/kvm/kvm_main.c | 15 ++
virt/kvm/vfio.c | 107 +++++++++++++
30 files changed, 795 insertions(+), 28 deletions(-)
create mode 100644 arch/x86/kvm/kvm_vfio_x86.c
--
1.9.1
With Posted-Interrupts support in Intel CPU and IOMMU, an external
interrupt from assigned-devices could be directly delivered to a
virtual CPU in a virtual machine. Instead of hacking KVM and Intel
IOMMU drivers, we propose a platform independent interface to target
an interrupt to a specific virtual CPU in a virtual machine, or set
virtual CPU affinity for an interrupt.
By adopting this new interface and the hierarchy irqdomain, we could
easily support posted-interrupts on Intel platforms, and also provide
flexible enough interfaces for other platforms to support similar
features.
We may also cooperate between set_affinity() and set_vcpu_affinity()
in IRQ core or irq chip drivers.
Here is the usage scenario for this interface:
Guest update MSI/MSI-X interrupt configuration
-->QEMU and KVM handle this
-->KVM call this interface (passing posted interrupts descriptor
and guest vector)
-->irq core will transfer the control to IOMMU
-->IOMMU will do the real work of updating IRTE (IRTE has new
format for VT-d Posted-Interrupts)
Signed-off-by: Jiang Liu <[email protected]>
Signed-off-by: Feng Wu <[email protected]>
---
include/linux/irq.h | 4 ++++
kernel/irq/chip.c | 14 ++++++++++++++
kernel/irq/manage.c | 20 ++++++++++++++++++++
3 files changed, 38 insertions(+)
diff --git a/include/linux/irq.h b/include/linux/irq.h
index f26e736..83abafc 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -324,6 +324,8 @@ static inline irq_hw_number_t irqd_to_hwirq(struct irq_data *d)
* irq_request_resources
* @irq_compose_msi_msg: optional to compose message content for MSI
* @irq_write_msi_msg: optional to write message content for MSI
+ * @irq_set_vcpu_affinity: optional to target a virtual CPU in a virtual
+ * machine
* @flags: chip specific flags
*/
struct irq_chip {
@@ -362,6 +364,7 @@ struct irq_chip {
void (*irq_compose_msi_msg)(struct irq_data *data, struct msi_msg *msg);
void (*irq_write_msi_msg)(struct irq_data *data, struct msi_msg *msg);
+ int (*irq_set_vcpu_affinity)(struct irq_data *data, void *vcpu_info);
unsigned long flags;
};
@@ -416,6 +419,7 @@ extern void irq_cpu_online(void);
extern void irq_cpu_offline(void);
extern int irq_set_affinity_locked(struct irq_data *data,
const struct cpumask *cpumask, bool force);
+extern int irq_set_vcpu_affinity(unsigned int irq, void *vcpu_info);
#if defined(CONFIG_SMP) && defined(CONFIG_GENERIC_PENDING_IRQ)
void irq_move_irq(struct irq_data *data);
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index 6f1c7a5..fe0908f 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -948,6 +948,20 @@ int irq_chip_retrigger_hierarchy(struct irq_data *data)
return -ENOSYS;
}
+
+/**
+ * irq_chip_set_vcpu_affinity_parent - Set vcpu affinity on the parent interrupt
+ * @data: Pointer to interrupt specific data
+ * @dest: The vcpu affinity information
+ */
+int irq_chip_set_vcpu_affinity_parent(struct irq_data *data, void *vcpu_info)
+{
+ data = data->parent_data;
+ if (data->chip->irq_set_vcpu_affinity)
+ return data->chip->irq_set_vcpu_affinity(data, vcpu_info);
+
+ return -ENOSYS;
+}
#endif
/**
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 8069237..bd3a1ba 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -247,6 +247,26 @@ int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m)
}
EXPORT_SYMBOL_GPL(irq_set_affinity_hint);
+int irq_set_vcpu_affinity(unsigned int irq, void *vcpu_info)
+{
+ struct irq_desc *desc = irq_to_desc(irq);
+ struct irq_chip *chip;
+ unsigned long flags;
+ int ret = -ENOSYS;
+
+ if (!desc)
+ return -EINVAL;
+
+ raw_spin_lock_irqsave(&desc->lock, flags);
+ chip = desc->irq_data.chip;
+ if (chip && chip->irq_set_vcpu_affinity)
+ ret = chip->irq_set_vcpu_affinity(irq_desc_get_irq_data(desc),
+ vcpu_info);
+ raw_spin_unlock_irqrestore(&desc->lock, flags);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(irq_set_vcpu_affinity);
+
static void irq_affinity_notify(struct work_struct *work)
{
struct irq_affinity_notify *notify =
--
1.9.1
Implement irq_set_vcpu_affinity for intel_ir_chip.
Signed-off-by: Feng Wu <[email protected]>
Reviewed-by: Jiang Liu <[email protected]>
---
arch/x86/include/asm/irq_remapping.h | 5 +++++
drivers/iommu/intel_irq_remapping.c | 35 +++++++++++++++++++++++++++++++++++
2 files changed, 40 insertions(+)
diff --git a/arch/x86/include/asm/irq_remapping.h b/arch/x86/include/asm/irq_remapping.h
index f67ae08..f87ac70 100644
--- a/arch/x86/include/asm/irq_remapping.h
+++ b/arch/x86/include/asm/irq_remapping.h
@@ -60,6 +60,11 @@ static inline struct irq_domain *arch_get_ir_parent_domain(void)
return x86_vector_domain;
}
+struct vcpu_data {
+ u64 pi_desc_addr; /* Physical address of PI Descriptor */
+ u32 vector; /* Guest vector of the interrupt */
+};
+
#else /* CONFIG_IRQ_REMAP */
static inline void setup_irq_remapping_ops(void) { }
diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index f6da3b2..48c2051 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -42,6 +42,7 @@ struct irq_2_iommu {
struct intel_ir_data {
struct irq_2_iommu irq_2_iommu;
struct irte irte_entry;
+ struct irte_pi irte_pi_entry;
union {
struct msi_msg msi_entry;
};
@@ -1010,10 +1011,44 @@ static void intel_ir_compose_msi_msg(struct irq_data *irq_data,
*msg = ir_data->msi_entry;
}
+static int intel_ir_set_vcpu_affinity(struct irq_data *data, void *vcpu_info)
+{
+ struct intel_ir_data *ir_data = data->chip_data;
+ struct irte_pi *irte_pi = &ir_data->irte_pi_entry;
+ struct vcpu_data *vcpu_pi_info;
+
+ /* stop posting interrupts, back to remapping mode */
+ if (!vcpu_info)
+ modify_irte(&ir_data->irq_2_iommu, &ir_data->irte_entry);
+ else {
+ vcpu_pi_info = (struct vcpu_data *)vcpu_info;
+ memcpy(irte_pi, &ir_data->irte_entry, sizeof(struct irte));
+
+ irte_pi->urg = 0;
+ irte_pi->vector = vcpu_pi_info->vector;
+ irte_pi->pda_l = (vcpu_pi_info->pi_desc_addr >>
+ (32 - PDA_LOW_BIT)) & ~(-1UL << PDA_LOW_BIT);
+ irte_pi->pda_h = (vcpu_pi_info->pi_desc_addr >> 32) &
+ ~(-1UL << PDA_HIGH_BIT);
+
+ irte_pi->__reserved_1 = 0;
+ irte_pi->__reserved_2 = 0;
+ irte_pi->__reserved_3 = 0;
+ irte_pi->__reserved_4 = 0;
+
+ irte_pi->pst = 1;
+
+ modify_irte(&ir_data->irq_2_iommu, (struct irte *)irte_pi);
+ }
+
+ return 0;
+}
+
static struct irq_chip intel_ir_chip = {
.irq_ack = ir_ack_apic_edge,
.irq_set_affinity = intel_ir_set_affinity,
.irq_compose_msi_msg = intel_ir_compose_msi_msg,
+ .irq_set_vcpu_affinity = intel_ir_set_vcpu_affinity,
};
static void intel_irq_remapping_prepare_irte(struct intel_ir_data *data,
--
1.9.1
This patch adds a new member capability to struct irq_remap_ops,
this new function ops can be used to check whether some
features are supported, such as VT-d Posted-Interrupts.
Signed-off-by: Feng Wu <[email protected]>
Reviewed-by: Jiang Liu <[email protected]>
---
arch/x86/include/asm/irq_remapping.h | 4 ++++
drivers/iommu/irq_remapping.h | 4 ++++
2 files changed, 8 insertions(+)
diff --git a/arch/x86/include/asm/irq_remapping.h b/arch/x86/include/asm/irq_remapping.h
index 6ba2431..f67ae08 100644
--- a/arch/x86/include/asm/irq_remapping.h
+++ b/arch/x86/include/asm/irq_remapping.h
@@ -31,6 +31,10 @@ struct irq_alloc_info;
#ifdef CONFIG_IRQ_REMAP
+enum irq_remap_cap {
+ IRQ_POSTING_CAP = 0,
+};
+
extern void setup_irq_remapping_ops(void);
extern int irq_remapping_supported(void);
extern void set_irq_remapping_broken(void);
diff --git a/drivers/iommu/irq_remapping.h b/drivers/iommu/irq_remapping.h
index 4bd791d..2d991b2 100644
--- a/drivers/iommu/irq_remapping.h
+++ b/drivers/iommu/irq_remapping.h
@@ -28,6 +28,7 @@ struct irq_data;
struct msi_msg;
struct irq_domain;
struct irq_alloc_info;
+enum irq_remap_cap;
extern int disable_irq_remap;
extern int irq_remap_broken;
@@ -39,6 +40,9 @@ struct irq_remap_ops {
/* Check whether Interrupt Remapping is supported */
int (*supported)(void);
+ /* Check some capability is supported */
+ bool (*capability)(enum irq_remap_cap);
+
/* Initializes hardware and makes it ready for remapping interrupts */
int (*prepare)(void);
--
1.9.1
Add helper function to detect VT-d Posted-Interrupts capability.
Signed-off-by: Feng Wu <[email protected]>
Reviewed-by: Jiang Liu <[email protected]>
---
include/linux/intel-iommu.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index ecaf3a9..8174ae8 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -87,6 +87,7 @@ static inline void dmar_writeq(void __iomem *addr, u64 val)
/*
* Decoding Capability Register
*/
+#define cap_pi_support(c) (((c) >> 59) & 1)
#define cap_read_drain(c) (((c) >> 55) & 1)
#define cap_write_drain(c) (((c) >> 54) & 1)
#define cap_max_amask_val(c) (((c) >> 48) & 0x3f)
--
1.9.1
Add the Intel side implementation for capability in
struct irq_remap_ops.
Signed-off-by: Feng Wu <[email protected]>
Reviewed-by: Jiang Liu <[email protected]>
---
drivers/iommu/intel_irq_remapping.c | 27 +++++++++++++++++++++++++++
drivers/iommu/irq_remapping.c | 2 ++
drivers/iommu/irq_remapping.h | 4 ++++
3 files changed, 33 insertions(+)
diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index ab9057a..08a7c39 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -652,6 +652,32 @@ error:
return -1;
}
+static bool intel_irq_remapping_capability(enum irq_remap_cap cap)
+{
+ struct dmar_drhd_unit *drhd;
+ struct intel_iommu *iommu;
+
+ switch (cap) {
+ case IRQ_POSTING_CAP:
+ /*
+ * If 1) posted-interrupts is disabled by user
+ * or 2) irq remapping is disabled, posted-interrupts
+ * is not supported.
+ */
+ if (disable_irq_post || !irq_remapping_enabled)
+ return 0;
+
+ for_each_iommu(iommu, drhd)
+ if (!cap_pi_support(iommu->cap))
+ return 0;
+
+ return 1;
+ default:
+ pr_warn("Unknown irq remapping capability.\n");
+ return 0;
+ }
+}
+
static int ir_parse_one_hpet_scope(struct acpi_dmar_device_scope *scope,
struct intel_iommu *iommu,
struct acpi_dmar_hardware_unit *drhd)
@@ -948,6 +974,7 @@ static struct irq_domain *intel_get_irq_domain(struct irq_alloc_info *info)
struct irq_remap_ops intel_irq_remap_ops = {
.supported = intel_irq_remapping_supported,
+ .capability = intel_irq_remapping_capability,
.prepare = dmar_table_init,
.enable = intel_enable_irq_remapping,
.disable = disable_irq_remapping,
diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
index 3c3da04d..e63e969 100644
--- a/drivers/iommu/irq_remapping.c
+++ b/drivers/iommu/irq_remapping.c
@@ -24,6 +24,8 @@ int irq_remap_broken;
int disable_sourceid_checking;
int no_x2apic_optout;
+int disable_irq_post = 1;
+
static struct irq_remap_ops *remap_ops;
static void irq_remapping_disable_io_apic(void)
diff --git a/drivers/iommu/irq_remapping.h b/drivers/iommu/irq_remapping.h
index 2d991b2..cb1f46d 100644
--- a/drivers/iommu/irq_remapping.h
+++ b/drivers/iommu/irq_remapping.h
@@ -36,6 +36,8 @@ extern int disable_sourceid_checking;
extern int no_x2apic_optout;
extern int irq_remapping_enabled;
+extern int disable_irq_post;
+
struct irq_remap_ops {
/* Check whether Interrupt Remapping is supported */
int (*supported)(void);
@@ -76,6 +78,8 @@ extern void ir_ack_apic_edge(struct irq_data *data);
#define disable_irq_remap 1
#define irq_remap_broken 0
+#define disable_irq_post 1
+
#endif /* CONFIG_IRQ_REMAP */
#endif /* __IRQ_REMAPPING_H */
--
1.9.1
This patch defines a new interface kvm_find_dest_vcpu for
VT-d PI, which can returns the destination vCPU of the
interrupt for guests.
Since VT-d PI cannot handle broadcast/multicast interrupt,
Here we only handle Fixed and Lowest priority interrupts.
The current method of handling guest lowest priority interrtups
is to use a counter 'apic_arb_prio' for each vCPU, we choose the
vCPU with smallest 'apic_arb_prio' and then increase it by 1.
However, for VT-d PI, we cannot re-use this, since we no longer
have control to 'apic_arb_prio' with posted interrupt direct
delivery by Hardware.
Here, we introduce a similar way with 'apic_arb_prio' to handle
guest lowest priority interrtups when VT-d PI is used. Here is the
ideas:
- Each vCPU has a counter 'round_robin_counter'.
- When guests sets an interrupts to lowest priority, we choose
the vCPU with smallest 'round_robin_counter' as the destination,
then increase it.
Signed-off-by: Feng Wu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 4 ++++
virt/kvm/irq_comm.c | 41 +++++++++++++++++++++++++++++++++++++++++
2 files changed, 45 insertions(+)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6ed0c30..7a41808 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -358,6 +358,7 @@ struct kvm_vcpu_arch {
struct kvm_lapic *apic; /* kernel irqchip context */
unsigned long apic_attention;
int32_t apic_arb_prio;
+ int32_t round_robin_counter;
int mp_state;
u64 ia32_misc_enable_msr;
bool tpr_access_reporting;
@@ -1093,4 +1094,7 @@ int kvm_pmu_read_pmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
void kvm_handle_pmu_event(struct kvm_vcpu *vcpu);
void kvm_deliver_pmi(struct kvm_vcpu *vcpu);
+bool kvm_find_dest_vcpu(struct kvm *kvm, struct kvm_lapic_irq *irq,
+ struct kvm_vcpu **dest_vcpu);
+
#endif /* _ASM_X86_KVM_HOST_H */
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 963b899..f3c5d69 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -317,6 +317,47 @@ out:
return r;
}
+int kvm_compare_rr_counter(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2)
+{
+ return vcpu1->arch.round_robin_counter -
+ vcpu2->arch.round_robin_counter;
+}
+
+bool kvm_find_dest_vcpu(struct kvm *kvm, struct kvm_lapic_irq *irq,
+ struct kvm_vcpu **dest_vcpu)
+{
+ int i, r = 0;
+ struct kvm_vcpu *vcpu, *dest = NULL;
+
+ kvm_for_each_vcpu(i, vcpu, kvm) {
+ if (!kvm_apic_present(vcpu))
+ continue;
+
+ if (!kvm_apic_match_dest(vcpu, NULL, irq->shorthand,
+ irq->dest_id, irq->dest_mode))
+ continue;
+
+ if (!kvm_is_dm_lowest_prio(irq)) {
+ r++;
+ *dest_vcpu = vcpu;
+ } else if (kvm_lapic_enabled(vcpu)) {
+ if (!dest)
+ dest = vcpu;
+ else if (kvm_compare_rr_counter(vcpu, dest) < 0)
+ dest = vcpu;
+ }
+ }
+
+ if (dest) {
+ dest->arch.round_robin_counter++;
+ *dest_vcpu = dest;
+ return true;
+ } else if (r == 1)
+ return true;
+
+ return false;
+}
+
#define IOAPIC_ROUTING_ENTRY(irq) \
{ .gsi = irq, .type = KVM_IRQ_ROUTING_IRQCHIP, \
.u.irqchip = { .irqchip = KVM_IRQCHIP_IOAPIC, .pin = (irq) } }
--
1.9.1
Define a interface to get PI descriptor address from the vCPU structure.
Signed-off-by: Feng Wu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/vmx.c | 12 ++++++++++++
2 files changed, 13 insertions(+)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7a41808..9b45b78 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -772,6 +772,7 @@ struct kvm_x86_ops {
int (*check_nested_events)(struct kvm_vcpu *vcpu, bool external_intr);
void (*sched_in)(struct kvm_vcpu *kvm, int cpu);
+ u64 (*get_pi_desc_addr)(struct kvm_vcpu *vcpu);
};
struct kvm_arch_async_pf {
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 66ca275..81f239b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -562,6 +562,11 @@ static inline struct vcpu_vmx *to_vmx(struct kvm_vcpu *vcpu)
return container_of(vcpu, struct vcpu_vmx, vcpu);
}
+struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
+{
+ return &(to_vmx(vcpu)->pi_desc);
+}
+
#define VMCS12_OFFSET(x) offsetof(struct vmcs12, x)
#define FIELD(number, name) [number] = VMCS12_OFFSET(name)
#define FIELD64(number, name) [number] = VMCS12_OFFSET(name), \
@@ -4298,6 +4303,11 @@ static void vmx_sync_pir_to_irr_dummy(struct kvm_vcpu *vcpu)
return;
}
+static u64 vmx_get_pi_desc_addr(struct kvm_vcpu *vcpu)
+{
+ return __pa((u64)vcpu_to_pi_desc(vcpu));
+}
+
/*
* Set up the vmcs's constant host-state fields, i.e., host-state fields that
* will not change in the lifetime of the guest.
@@ -9244,6 +9254,8 @@ static struct kvm_x86_ops vmx_x86_ops = {
.check_nested_events = vmx_check_nested_events,
.sched_in = vmx_sched_in,
+
+ .get_pi_desc_addr = vmx_get_pi_desc_addr,
};
static int __init vmx_init(void)
--
1.9.1
This patch adds pi_clear_sn and pi_set_sn to struct kvm_x86_ops,
so we can set/clear SN outside vmx.
Signed-off-by: Feng Wu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 +++
arch/x86/kvm/vmx.c | 13 +++++++++++++
2 files changed, 16 insertions(+)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9b45b78..cd4b174 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -773,6 +773,9 @@ struct kvm_x86_ops {
void (*sched_in)(struct kvm_vcpu *kvm, int cpu);
u64 (*get_pi_desc_addr)(struct kvm_vcpu *vcpu);
+
+ void (*pi_clear_sn)(struct kvm_vcpu *vcpu);
+ void (*pi_set_sn)(struct kvm_vcpu *vcpu);
};
struct kvm_arch_async_pf {
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 81f239b..ee3b735 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -567,6 +567,16 @@ struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
return &(to_vmx(vcpu)->pi_desc);
}
+static void vmx_pi_clear_sn(struct kvm_vcpu *vcpu)
+{
+ pi_clear_sn(vcpu_to_pi_desc(vcpu));
+}
+
+static void vmx_pi_set_sn(struct kvm_vcpu *vcpu)
+{
+ pi_set_sn(vcpu_to_pi_desc(vcpu));
+}
+
#define VMCS12_OFFSET(x) offsetof(struct vmcs12, x)
#define FIELD(number, name) [number] = VMCS12_OFFSET(name)
#define FIELD64(number, name) [number] = VMCS12_OFFSET(name), \
@@ -9256,6 +9266,9 @@ static struct kvm_x86_ops vmx_x86_ops = {
.sched_in = vmx_sched_in,
.get_pi_desc_addr = vmx_get_pi_desc_addr,
+
+ .pi_clear_sn = vmx_pi_clear_sn,
+ .pi_set_sn = vmx_pi_set_sn,
};
static int __init vmx_init(void)
--
1.9.1
Move struct kvm_irq_routing_table from irqchip.c to kvm_host.h,
so we can use it outside of irqchip.c.
Signed-off-by: Feng Wu <[email protected]>
---
include/linux/kvm_host.h | 19 +++++++++++++++++++
virt/kvm/irqchip.c | 11 -----------
2 files changed, 19 insertions(+), 11 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0b9659d..cfa85ac 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -335,6 +335,25 @@ struct kvm_kernel_irq_routing_entry {
struct hlist_node link;
};
+#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
+
+struct kvm_irq_routing_table {
+ int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS];
+ struct kvm_kernel_irq_routing_entry *rt_entries;
+ u32 nr_rt_entries;
+ /*
+ * Array indexed by gsi. Each entry contains list of irq chips
+ * the gsi is connected to.
+ */
+ struct hlist_head map[0];
+};
+
+#else
+
+struct kvm_irq_routing_table {};
+
+#endif
+
#ifndef KVM_PRIVATE_MEM_SLOTS
#define KVM_PRIVATE_MEM_SLOTS 0
#endif
diff --git a/virt/kvm/irqchip.c b/virt/kvm/irqchip.c
index 7f256f3..cdf29a6 100644
--- a/virt/kvm/irqchip.c
+++ b/virt/kvm/irqchip.c
@@ -31,17 +31,6 @@
#include <trace/events/kvm.h>
#include "irq.h"
-struct kvm_irq_routing_table {
- int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS];
- struct kvm_kernel_irq_routing_entry *rt_entries;
- u32 nr_rt_entries;
- /*
- * Array indexed by gsi. Each entry contains list of irq chips
- * the gsi is connected to.
- */
- struct hlist_head map[0];
-};
-
int kvm_irq_map_gsi(struct kvm *kvm,
struct kvm_kernel_irq_routing_entry *entries, int gsi)
{
--
1.9.1
Make kvm_set_msi_irq() public, we can use this function outside.
Signed-off-by: Feng Wu <[email protected]>
---
include/linux/kvm_host.h | 2 ++
virt/kvm/irq_comm.c | 2 +-
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index cfa85ac..5cd4420 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -785,6 +785,8 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
struct kvm_irq_ack_notifier *kian);
int kvm_request_irq_source_id(struct kvm *kvm);
void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id);
+void kvm_set_msi_irq(struct kvm_kernel_irq_routing_entry *e,
+ struct kvm_lapic_irq *irq);
#ifdef CONFIG_KVM_DEVICE_ASSIGNMENT
int kvm_iommu_map_pages(struct kvm *kvm, struct kvm_memory_slot *slot);
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index f3c5d69..231671a 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -106,7 +106,7 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
return r;
}
-static inline void kvm_set_msi_irq(struct kvm_kernel_irq_routing_entry *e,
+void kvm_set_msi_irq(struct kvm_kernel_irq_routing_entry *e,
struct kvm_lapic_irq *irq)
{
trace_kvm_msi_set_irq(e->msi.address_lo, e->msi.data);
--
1.9.1
Currently, we use a global vector as the Posted-Interrupts
Notification Event for all the vCPUs in the system. We need
to introduce another global vector for VT-d Posted-Interrtups,
which will be used to wakeup the sleep vCPU when an external
interrupt from a direct-assigned device happens for that vCPU.
Signed-off-by: Feng Wu <[email protected]>
---
arch/x86/include/asm/entry_arch.h | 2 ++
arch/x86/include/asm/hardirq.h | 1 +
arch/x86/include/asm/hw_irq.h | 2 ++
arch/x86/include/asm/irq_vectors.h | 1 +
arch/x86/kernel/entry_64.S | 2 ++
arch/x86/kernel/irq.c | 27 +++++++++++++++++++++++++++
arch/x86/kernel/irqinit.c | 2 ++
7 files changed, 37 insertions(+)
diff --git a/arch/x86/include/asm/entry_arch.h b/arch/x86/include/asm/entry_arch.h
index dc5fa66..27ca0af 100644
--- a/arch/x86/include/asm/entry_arch.h
+++ b/arch/x86/include/asm/entry_arch.h
@@ -23,6 +23,8 @@ BUILD_INTERRUPT(x86_platform_ipi, X86_PLATFORM_IPI_VECTOR)
#ifdef CONFIG_HAVE_KVM
BUILD_INTERRUPT3(kvm_posted_intr_ipi, POSTED_INTR_VECTOR,
smp_kvm_posted_intr_ipi)
+BUILD_INTERRUPT3(kvm_posted_intr_wakeup_ipi, POSTED_INTR_WAKEUP_VECTOR,
+ smp_kvm_posted_intr_wakeup_ipi)
#endif
/*
diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index 0f5fb6b..9866065 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -14,6 +14,7 @@ typedef struct {
#endif
#ifdef CONFIG_HAVE_KVM
unsigned int kvm_posted_intr_ipis;
+ unsigned int kvm_posted_intr_wakeup_ipis;
#endif
unsigned int x86_platform_ipis; /* arch dependent */
unsigned int apic_perf_irqs;
diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index e7ae6eb..38fac9b 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -29,6 +29,7 @@
extern asmlinkage void apic_timer_interrupt(void);
extern asmlinkage void x86_platform_ipi(void);
extern asmlinkage void kvm_posted_intr_ipi(void);
+extern asmlinkage void kvm_posted_intr_wakeup_ipi(void);
extern asmlinkage void error_interrupt(void);
extern asmlinkage void irq_work_interrupt(void);
@@ -92,6 +93,7 @@ extern void trace_call_function_single_interrupt(void);
#define trace_irq_move_cleanup_interrupt irq_move_cleanup_interrupt
#define trace_reboot_interrupt reboot_interrupt
#define trace_kvm_posted_intr_ipi kvm_posted_intr_ipi
+#define trace_kvm_posted_intr_wakeup_ipi kvm_posted_intr_wakeup_ipi
#endif /* CONFIG_TRACING */
struct irq_domain;
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index b26cb12..dca94f2 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -105,6 +105,7 @@
/* Vector for KVM to deliver posted interrupt IPI */
#ifdef CONFIG_HAVE_KVM
#define POSTED_INTR_VECTOR 0xf2
+#define POSTED_INTR_WAKEUP_VECTOR 0xf1
#endif
/*
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index e61c14a..a598447 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -960,6 +960,8 @@ apicinterrupt X86_PLATFORM_IPI_VECTOR \
#ifdef CONFIG_HAVE_KVM
apicinterrupt3 POSTED_INTR_VECTOR \
kvm_posted_intr_ipi smp_kvm_posted_intr_ipi
+apicinterrupt3 POSTED_INTR_WAKEUP_VECTOR \
+ kvm_posted_intr_wakeup_ipi smp_kvm_posted_intr_wakeup_ipi
#endif
#ifdef CONFIG_X86_MCE_THRESHOLD
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 922d285..47408c3 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -237,6 +237,9 @@ __visible void smp_x86_platform_ipi(struct pt_regs *regs)
}
#ifdef CONFIG_HAVE_KVM
+void (*wakeup_handler_callback)(void) = NULL;
+EXPORT_SYMBOL_GPL(wakeup_handler_callback);
+
/*
* Handler for POSTED_INTERRUPT_VECTOR.
*/
@@ -256,6 +259,30 @@ __visible void smp_kvm_posted_intr_ipi(struct pt_regs *regs)
set_irq_regs(old_regs);
}
+
+/*
+ * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR.
+ */
+__visible void smp_kvm_posted_intr_wakeup_ipi(struct pt_regs *regs)
+{
+ struct pt_regs *old_regs = set_irq_regs(regs);
+
+ ack_APIC_irq();
+
+ irq_enter();
+
+ exit_idle();
+
+ inc_irq_stat(kvm_posted_intr_wakeup_ipis);
+
+ if (wakeup_handler_callback)
+ wakeup_handler_callback();
+
+ irq_exit();
+
+ set_irq_regs(old_regs);
+}
+
#endif
__visible void smp_trace_x86_platform_ipi(struct pt_regs *regs)
diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c
index 70e181e..844673c 100644
--- a/arch/x86/kernel/irqinit.c
+++ b/arch/x86/kernel/irqinit.c
@@ -144,6 +144,8 @@ static void __init apic_intr_init(void)
#ifdef CONFIG_HAVE_KVM
/* IPI for KVM to deliver posted interrupt */
alloc_intr_gate(POSTED_INTR_VECTOR, kvm_posted_intr_ipi);
+ /* IPI for KVM to deliver interrupt to wake up tasks */
+ alloc_intr_gate(POSTED_INTR_WAKEUP_VECTOR, kvm_posted_intr_wakeup_ipi);
#endif
/* IPI vectors for APIC spurious and error interrupts */
--
1.9.1
This patch updates the Posted-Interrupts Descriptor when vCPU
is blocked.
pre-block:
- Add the vCPU to the blocked per-CPU list
- Clear 'SN'
- Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
post-block:
- Remove the vCPU from the per-CPU list
Signed-off-by: Feng Wu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/kvm/vmx.c | 96 +++++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/x86.c | 22 +++++++---
include/linux/kvm_host.h | 4 ++
virt/kvm/kvm_main.c | 6 +++
5 files changed, 123 insertions(+), 7 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 13e3e40..32c110a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -101,6 +101,8 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
#define ASYNC_PF_PER_VCPU 64
+extern void (*wakeup_handler_callback)(void);
+
enum kvm_reg {
VCPU_REGS_RAX = 0,
VCPU_REGS_RCX = 1,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index bf2e6cd..a1c83a2 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -832,6 +832,13 @@ static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
static DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
static DEFINE_PER_CPU(struct desc_ptr, host_gdt);
+/*
+ * We maintian a per-CPU linked-list of vCPU, so in wakeup_handler() we
+ * can find which vCPU should be waken up.
+ */
+static DEFINE_PER_CPU(struct list_head, blocked_vcpu_on_cpu);
+static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
+
static unsigned long *vmx_io_bitmap_a;
static unsigned long *vmx_io_bitmap_b;
static unsigned long *vmx_msr_bitmap_legacy;
@@ -1921,6 +1928,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
struct pi_desc old, new;
unsigned int dest;
+ unsigned long flags;
memset(&old, 0, sizeof(old));
memset(&new, 0, sizeof(new));
@@ -1942,6 +1950,20 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
new.nv = POSTED_INTR_VECTOR;
} while (cmpxchg(&pi_desc->control, old.control,
new.control) != old.control);
+
+ /*
+ * Delete the vCPU from the related wakeup queue
+ * if we are resuming from blocked state
+ */
+ if (vcpu->blocked) {
+ vcpu->blocked = false;
+ spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
+ vcpu->wakeup_cpu), flags);
+ list_del(&vcpu->blocked_vcpu_list);
+ spin_unlock_irqrestore(&per_cpu(blocked_vcpu_on_cpu_lock,
+ vcpu->wakeup_cpu), flags);
+ vcpu->wakeup_cpu = -1;
+ }
}
}
@@ -1950,6 +1972,9 @@ static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
if (irq_remapping_cap(IRQ_POSTING_CAP)) {
struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
struct pi_desc old, new;
+ unsigned long flags;
+ int cpu;
+ struct cpumask cpu_others_mask;
memset(&old, 0, sizeof(old));
memset(&new, 0, sizeof(new));
@@ -1961,6 +1986,54 @@ static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
pi_set_sn(&new);
} while (cmpxchg(&pi_desc->control, old.control,
new.control) != old.control);
+ } else if (vcpu->blocked) {
+ /*
+ * The vcpu is blocked on the wait queue.
+ * Store the blocked vCPU on the list of the
+ * vcpu->wakeup_cpu, which is the destination
+ * of the wake-up notification event.
+ */
+ vcpu->wakeup_cpu = vcpu->cpu;
+ spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
+ vcpu->wakeup_cpu), flags);
+ list_add_tail(&vcpu->blocked_vcpu_list,
+ &per_cpu(blocked_vcpu_on_cpu,
+ vcpu->wakeup_cpu));
+ spin_unlock_irqrestore(
+ &per_cpu(blocked_vcpu_on_cpu_lock,
+ vcpu->wakeup_cpu), flags);
+
+ do {
+ old.control = new.control = pi_desc->control;
+
+ /*
+ * We should not block the vCPU if
+ * an interrupt is posted for it.
+ */
+ if (pi_test_on(pi_desc) == 1) {
+ /*
+ * We need schedule the wakeup worker
+ * on a different cpu other than
+ * vcpu->cpu, because in some case,
+ * schedule_work() will call
+ * try_to_wake_up() which needs acquire
+ * the rq lock. This can cause deadlock.
+ */
+ cpumask_copy(&cpu_others_mask,
+ cpu_online_mask);
+ cpu_clear(vcpu->cpu, cpu_others_mask);
+ cpu = any_online_cpu(cpu_others_mask);
+
+ schedule_work_on(cpu,
+ &vcpu->wakeup_worker);
+ }
+
+ pi_clear_sn(&new);
+
+ /* set 'NV' to 'wakeup vector' */
+ new.nv = POSTED_INTR_WAKEUP_VECTOR;
+ } while (cmpxchg(&pi_desc->control, old.control,
+ new.control) != old.control);
}
}
@@ -2842,6 +2915,8 @@ static int hardware_enable(void)
return -EBUSY;
INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
+ INIT_LIST_HEAD(&per_cpu(blocked_vcpu_on_cpu, cpu));
+ spin_lock_init(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
/*
* Now we can enable the vmclear operation in kdump
@@ -9315,6 +9390,25 @@ static struct kvm_x86_ops vmx_x86_ops = {
.pi_set_sn = vmx_pi_set_sn,
};
+/*
+ * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR.
+ */
+void wakeup_handler(void)
+{
+ struct kvm_vcpu *vcpu;
+ int cpu = smp_processor_id();
+
+ spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
+ list_for_each_entry(vcpu, &per_cpu(blocked_vcpu_on_cpu, cpu),
+ blocked_vcpu_list) {
+ struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
+
+ if (pi_test_on(pi_desc) == 1)
+ kvm_vcpu_kick(vcpu);
+ }
+ spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
+}
+
static int __init vmx_init(void)
{
int r, i, msr;
@@ -9429,6 +9523,8 @@ static int __init vmx_init(void)
update_ple_window_actual_max();
+ wakeup_handler_callback = wakeup_handler;
+
return 0;
out7:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0033df3..1551a46 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6152,6 +6152,21 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
kvm_vcpu_reload_apic_access_page(vcpu);
}
+ /*
+ * Since posted-interrupts can be set by VT-d HW now, in this
+ * case, KVM_REQ_EVENT is not set. We move the following
+ * operations out of the if statement.
+ */
+ if (kvm_lapic_enabled(vcpu)) {
+ /*
+ * Update architecture specific hints for APIC
+ * virtual interrupt delivery.
+ */
+ if (kvm_x86_ops->hwapic_irr_update)
+ kvm_x86_ops->hwapic_irr_update(vcpu,
+ kvm_lapic_find_highest_irr(vcpu));
+ }
+
if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
kvm_apic_accept_events(vcpu);
if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) {
@@ -6168,13 +6183,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
kvm_x86_ops->enable_irq_window(vcpu);
if (kvm_lapic_enabled(vcpu)) {
- /*
- * Update architecture specific hints for APIC
- * virtual interrupt delivery.
- */
- if (kvm_x86_ops->hwapic_irr_update)
- kvm_x86_ops->hwapic_irr_update(vcpu,
- kvm_lapic_find_highest_irr(vcpu));
update_cr8_intercept(vcpu);
kvm_lapic_sync_to_vapic(vcpu);
}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3d7242c..d981d16 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -239,6 +239,9 @@ struct kvm_vcpu {
unsigned long requests;
unsigned long guest_debug;
+ int wakeup_cpu;
+ struct list_head blocked_vcpu_list;
+
struct mutex mutex;
struct kvm_run *run;
@@ -282,6 +285,7 @@ struct kvm_vcpu {
} spin_loop;
#endif
bool preempted;
+ bool blocked;
struct kvm_vcpu_arch arch;
};
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ba53fd6..6deb994 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -233,6 +233,9 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
INIT_WORK(&vcpu->wakeup_worker, wakeup_thread);
+ vcpu->wakeup_cpu = -1;
+ INIT_LIST_HEAD(&vcpu->blocked_vcpu_list);
+
page = alloc_page(GFP_KERNEL | __GFP_ZERO);
if (!page) {
r = -ENOMEM;
@@ -243,6 +246,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
kvm_vcpu_set_in_spin_loop(vcpu, false);
kvm_vcpu_set_dy_eligible(vcpu, false);
vcpu->preempted = false;
+ vcpu->blocked = false;
r = kvm_arch_vcpu_init(vcpu);
if (r < 0)
@@ -1752,6 +1756,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
DEFINE_WAIT(wait);
for (;;) {
+ vcpu->blocked = true;
prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
if (kvm_arch_vcpu_runnable(vcpu)) {
@@ -1767,6 +1772,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
}
finish_wait(&vcpu->wq, &wait);
+ vcpu->blocked = false;
}
EXPORT_SYMBOL_GPL(kvm_vcpu_block);
--
1.9.1
This patch adds the kvm-vfio interface for VT-d Posted-Interrrupts.
When guests update MSI/MSI-x information for an assigned-device,
QEMU will use KVM_DEV_VFIO_DEVICE_POST_IRQ attribute to setup
IRTE for VT-d PI. Userspace program can also use
KVM_DEV_VFIO_DEVICE_UNPOST_IRQ to change back to irq remapping mode.
This patch implements these IRQ attributes.
Signed-off-by: Feng Wu <[email protected]>
---
include/linux/kvm_host.h | 20 +++++++++
virt/kvm/vfio.c | 107 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 127 insertions(+)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 5cd4420..ca9a393 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1134,6 +1134,26 @@ static inline int kvm_arch_vfio_set_forward(struct kvm_fwd_irq *fwd_irq,
}
#endif
+#ifdef __KVM_HAVE_ARCH_KVM_VFIO_POST
+/*
+ * kvm_arch_vfio_update_pi_irte - set IRTE for Posted-Interrupts
+ *
+ * @kvm: kvm
+ * @host_irq: host irq of the interrupt
+ * @guest_irq: gsi of the interrupt
+ * @set: set or unset PI
+ * returns 0 on success, < 0 on failure
+ */
+int kvm_arch_vfio_update_pi_irte(struct kvm *kvm, unsigned int host_irq,
+ uint32_t guest_irq, bool set);
+#else
+static int kvm_arch_vfio_update_pi_irte(struct kvm *kvm, unsigned int host_irq,
+ uint32_t guest_irq, bool set)
+{
+ return 0;
+}
+#endif
+
#ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
static inline void kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool val)
diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
index 6bc7001..dbc6c3b 100644
--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c
@@ -446,6 +446,99 @@ out:
return ret;
}
+static int kvm_vfio_pci_get_irq_count(struct pci_dev *pdev, int irq_type)
+{
+ if (irq_type == VFIO_PCI_INTX_IRQ_INDEX) {
+ u8 pin;
+
+ pci_read_config_byte(pdev, PCI_INTERRUPT_PIN, &pin);
+ if (pin)
+ return 1;
+ } else if (irq_type == VFIO_PCI_MSI_IRQ_INDEX)
+ return pci_msi_vec_count(pdev);
+ else if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
+ return pci_msix_vec_count(pdev);
+
+ return 0;
+}
+
+static int kvm_vfio_control_pi(struct kvm_device *kdev,
+ int32_t __user *argp, bool set)
+{
+ struct kvm_vfio_dev_irq pi_info;
+ uint32_t *gsi;
+ unsigned long minsz;
+ struct vfio_device *vdev;
+ struct msi_desc *entry;
+ struct device *dev;
+ struct pci_dev *pdev;
+ int i, max, ret;
+
+ minsz = offsetofend(struct kvm_vfio_dev_irq, count);
+
+ if (copy_from_user(&pi_info, (void __user *)argp, minsz))
+ return -EFAULT;
+
+ if (pi_info.argsz < minsz || pi_info.index >= VFIO_PCI_NUM_IRQS)
+ return -EINVAL;
+
+ vdev = kvm_vfio_get_vfio_device(pi_info.fd);
+ if (IS_ERR(vdev))
+ return PTR_ERR(vdev);
+
+ dev = kvm_vfio_external_base_device(vdev);
+ if (!dev || !dev_is_pci(dev)) {
+ ret = -EFAULT;
+ goto put_vfio_device;
+ }
+
+ pdev = to_pci_dev(dev);
+
+ max = kvm_vfio_pci_get_irq_count(pdev, pi_info.index);
+ if (max <= 0) {
+ ret = -EFAULT;
+ goto put_vfio_device;
+ }
+
+ if (pi_info.argsz - minsz < pi_info.count * sizeof(u32) ||
+ pi_info.start >= max || pi_info.start + pi_info.count > max) {
+ ret = -EINVAL;
+ goto put_vfio_device;
+ }
+
+ gsi = memdup_user((void __user *)((unsigned long)argp + minsz),
+ pi_info.count * sizeof(u32));
+ if (IS_ERR(gsi)) {
+ ret = PTR_ERR(gsi);
+ goto put_vfio_device;
+ }
+
+#ifdef CONFIG_PCI_MSI
+ for (i = 0; i < pi_info.count; i++) {
+ list_for_each_entry(entry, &pdev->msi_list, list) {
+ if (entry->msi_attrib.entry_nr != pi_info.start+i)
+ continue;
+
+ ret = kvm_arch_vfio_update_pi_irte(kdev->kvm,
+ entry->irq,
+ gsi[i],
+ set);
+ if (ret)
+ goto free_gsi;
+ }
+ }
+#endif
+
+ ret = 0;
+
+free_gsi:
+ kfree(gsi);
+
+put_vfio_device:
+ kvm_vfio_put_vfio_device(vdev);
+ return ret;
+}
+
static int kvm_vfio_set_device(struct kvm_device *kdev, long attr, u64 arg)
{
int32_t __user *argp = (int32_t __user *)(unsigned long)arg;
@@ -456,6 +549,14 @@ static int kvm_vfio_set_device(struct kvm_device *kdev, long attr, u64 arg)
case KVM_DEV_VFIO_DEVICE_UNFORWARD_IRQ:
ret = kvm_vfio_control_irq_forward(kdev, attr, argp);
break;
+#ifdef __KVM_HAVE_ARCH_KVM_VFIO_POST
+ case KVM_DEV_VFIO_DEVICE_POST_IRQ:
+ ret = kvm_vfio_control_pi(kdev, argp, 1);
+ break;
+ case KVM_DEV_VFIO_DEVICE_UNPOST_IRQ:
+ ret = kvm_vfio_control_pi(kdev, argp, 0);
+ break;
+#endif
default:
ret = -ENXIO;
}
@@ -511,6 +612,12 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
case KVM_DEV_VFIO_DEVICE_UNFORWARD_IRQ:
return 0;
#endif
+#ifdef __KVM_HAVE_ARCH_KVM_VFIO_POST
+ case KVM_DEV_VFIO_DEVICE_POST_IRQ:
+ case KVM_DEV_VFIO_DEVICE_UNPOST_IRQ:
+ return 0;
+#endif
+
}
break;
}
--
1.9.1
Define a wakeup worker thread for a vCPU.
Signed-off-by: Feng Wu <[email protected]>
---
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 9 +++++++++
2 files changed, 10 insertions(+)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ca9a393..3d7242c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -249,6 +249,7 @@ struct kvm_vcpu {
int sigset_active;
sigset_t sigset;
struct kvm_vcpu_stat stat;
+ struct work_struct wakeup_worker;
#ifdef CONFIG_HAS_IOMEM
int mmio_needed;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 25ffac9..ba53fd6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -211,6 +211,13 @@ void kvm_make_scan_ioapic_request(struct kvm *kvm)
kvm_make_all_cpus_request(kvm, KVM_REQ_SCAN_IOAPIC);
}
+static void wakeup_thread(struct work_struct *work)
+{
+ struct kvm_vcpu *vcpu = container_of(work, struct kvm_vcpu,
+ wakeup_worker);
+ kvm_vcpu_kick(vcpu);
+}
+
int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
{
struct page *page;
@@ -224,6 +231,8 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
init_waitqueue_head(&vcpu->wq);
kvm_async_pf_vcpu_init(vcpu);
+ INIT_WORK(&vcpu->wakeup_worker, wakeup_thread);
+
page = alloc_page(GFP_KERNEL | __GFP_ZERO);
if (!page) {
r = -ENOMEM;
--
1.9.1
Enable VT-d Posted-Interrtups and add a command line
parameter for it.
Signed-off-by: Feng Wu <[email protected]>
---
Documentation/kernel-parameters.txt | 1 +
drivers/iommu/irq_remapping.c | 12 ++++++++----
2 files changed, 9 insertions(+), 4 deletions(-)
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 838f377..324b790 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1453,6 +1453,7 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
nosid disable Source ID checking
no_x2apic_optout
BIOS x2APIC opt-out request will be ignored
+ nopost disable Interrupt Posting
iomem= Disable strict checking of access to MMIO memory
strict regions from userspace.
diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
index b008663..aa3cd23 100644
--- a/drivers/iommu/irq_remapping.c
+++ b/drivers/iommu/irq_remapping.c
@@ -24,7 +24,7 @@ int irq_remap_broken;
int disable_sourceid_checking;
int no_x2apic_optout;
-int disable_irq_post = 1;
+int disable_irq_post = 0;
static struct irq_remap_ops *remap_ops;
@@ -59,14 +59,18 @@ static __init int setup_irqremap(char *str)
return -EINVAL;
while (*str) {
- if (!strncmp(str, "on", 2))
+ if (!strncmp(str, "on", 2)) {
disable_irq_remap = 0;
- else if (!strncmp(str, "off", 3))
+ disable_irq_post = 0;
+ } else if (!strncmp(str, "off", 3)) {
disable_irq_remap = 1;
- else if (!strncmp(str, "nosid", 5))
+ disable_irq_post = 1;
+ } else if (!strncmp(str, "nosid", 5))
disable_sourceid_checking = 1;
else if (!strncmp(str, "no_x2apic_optout", 16))
no_x2apic_optout = 1;
+ else if (!strncmp(str, "nopost", 6))
+ disable_irq_post = 1;
str += strcspn(str, ",");
while (*str == ',')
--
1.9.1
Currently, we don't support urgent interrupt, all interrupts
are recognized as non-urgent interrupt, so we cannot send
posted-interrupt when 'SN' is set.
Signed-off-by: Feng Wu <[email protected]>
---
arch/x86/kvm/vmx.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a1c83a2..0aee151 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4401,15 +4401,22 @@ static int vmx_vm_has_apicv(struct kvm *kvm)
static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
- int r;
+ int r, sn;
if (pi_test_and_set_pir(vector, &vmx->pi_desc))
return;
+ /*
+ * Currently, we don't support urgent interrupt, all interrupts
+ * are recognized as non-urgent interrupt, so we cannot send
+ * posted-interrupt when 'SN' is set.
+ */
+ sn = pi_test_sn(&vmx->pi_desc);
+
r = pi_test_and_set_on(&vmx->pi_desc);
kvm_make_request(KVM_REQ_EVENT, vcpu);
#ifdef CONFIG_SMP
- if (!r && (vcpu->mode == IN_GUEST_MODE))
+ if (!r && !sn && (vcpu->mode == IN_GUEST_MODE))
apic->send_IPI_mask(get_cpu_mask(vcpu->cpu),
POSTED_INTR_VECTOR);
else
--
1.9.1
This patch updates the Posted-Interrupts Descriptor when vCPU
is preempted.
sched out:
- Set 'SN' to suppress furture non-urgent interrupts posted for
the vCPU.
sched in:
- Clear 'SN'
- Change NDST if vCPU is scheduled to a different CPU
- Set 'NV' to POSTED_INTR_VECTOR
Signed-off-by: Feng Wu <[email protected]>
---
arch/x86/kvm/vmx.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 44 insertions(+)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ee3b735..bf2e6cd 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1916,10 +1916,54 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */
vmx->loaded_vmcs->cpu = cpu;
}
+
+ if (irq_remapping_cap(IRQ_POSTING_CAP)) {
+ struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
+ struct pi_desc old, new;
+ unsigned int dest;
+
+ memset(&old, 0, sizeof(old));
+ memset(&new, 0, sizeof(new));
+
+ do {
+ old.control = new.control = pi_desc->control;
+ if (vcpu->cpu != cpu) {
+ dest = cpu_physical_id(cpu);
+
+ if (x2apic_enabled())
+ new.ndst = dest;
+ else
+ new.ndst = (dest << 8) & 0xFF00;
+ }
+
+ pi_clear_sn(&new);
+
+ /* set 'NV' to 'notification vector' */
+ new.nv = POSTED_INTR_VECTOR;
+ } while (cmpxchg(&pi_desc->control, old.control,
+ new.control) != old.control);
+ }
}
static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
{
+ if (irq_remapping_cap(IRQ_POSTING_CAP)) {
+ struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
+ struct pi_desc old, new;
+
+ memset(&old, 0, sizeof(old));
+ memset(&new, 0, sizeof(new));
+
+ /* Set SN when the vCPU is preempted */
+ if (vcpu->preempted) {
+ do {
+ old.control = new.control = pi_desc->control;
+ pi_set_sn(&new);
+ } while (cmpxchg(&pi_desc->control, old.control,
+ new.control) != old.control);
+ }
+ }
+
__vmx_load_host_state(to_vmx(vcpu));
if (!vmm_exclusive) {
__loaded_vmcs_clear(to_vmx(vcpu)->loaded_vmcs);
--
1.9.1
This patch defines macro __KVM_HAVE_ARCH_KVM_VFIO_POST and
implement kvm_arch_vfio_update_pi_irte for x86 architecture.
Signed-off-by: Feng Wu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/Makefile | 2 +-
arch/x86/kvm/kvm_vfio_x86.c | 77 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 80 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/kvm/kvm_vfio_x86.c
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index cd4b174..13e3e40 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -82,6 +82,8 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
(base_gfn >> KVM_HPAGE_GFN_SHIFT(level));
}
+#define __KVM_HAVE_ARCH_KVM_VFIO_POST
+
#define SELECTOR_TI_MASK (1 << 2)
#define SELECTOR_RPL_MASK 0x03
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 25d22b2..8809d58 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -14,7 +14,7 @@ kvm-$(CONFIG_KVM_DEVICE_ASSIGNMENT) += $(KVM)/assigned-dev.o $(KVM)/iommu.o
kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
kvm-y += x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
- i8254.o cpuid.o pmu.o
+ i8254.o cpuid.o pmu.o kvm_vfio_x86.o
kvm-intel-y += vmx.o
kvm-amd-y += svm.o
diff --git a/arch/x86/kvm/kvm_vfio_x86.c b/arch/x86/kvm/kvm_vfio_x86.c
new file mode 100644
index 0000000..2ba618e
--- /dev/null
+++ b/arch/x86/kvm/kvm_vfio_x86.c
@@ -0,0 +1,77 @@
+/*
+ * Copyright (C) 2014 Intel Corporation.
+ * Authors: Feng Wu <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kvm_host.h>
+#include <asm/irq_remapping.h>
+
+/*
+ * kvm_arch_vfio_update_pi_irte - set IRTE for Posted-Interrupts
+ *
+ * @kvm: kvm
+ * @host_irq: host irq of the interrupt
+ * @guest_irq: gsi of the interrupt
+ * @set: set or unset PI
+ * returns 0 on success, < 0 on failure
+ */
+int kvm_arch_vfio_update_pi_irte(struct kvm *kvm, unsigned int host_irq,
+ uint32_t guest_irq, bool set)
+{
+ struct kvm_kernel_irq_routing_entry *e;
+ struct kvm_irq_routing_table *irq_rt;
+ struct kvm_lapic_irq irq;
+ struct kvm_vcpu *vcpu;
+ struct vcpu_data vcpu_info;
+ int idx, ret = -EINVAL;
+
+ idx = srcu_read_lock(&kvm->irq_srcu);
+ irq_rt = srcu_dereference(kvm->irq_routing, &kvm->irq_srcu);
+ BUG_ON(guest_irq >= irq_rt->nr_rt_entries);
+
+ hlist_for_each_entry(e, &irq_rt->map[guest_irq], link) {
+ if (e->type != KVM_IRQ_ROUTING_MSI)
+ continue;
+ /*
+ * VT-d PI cannot support posting multicast/broadcast
+ * interrupts to a VCPU, we still use interrupt remapping
+ * for these kind of interrupts.
+ */
+
+ kvm_set_msi_irq(e, &irq);
+ if (!kvm_find_dest_vcpu(kvm, &irq, &vcpu))
+ continue;
+
+ vcpu_info.pi_desc_addr = kvm_x86_ops->get_pi_desc_addr(vcpu);
+ vcpu_info.vector = irq.vector;
+
+ if (set)
+ ret = irq_set_vcpu_affinity(host_irq, &vcpu_info);
+ else {
+ /* suppress notification event before unposting */
+ kvm_x86_ops->pi_set_sn(vcpu);
+ ret = irq_set_vcpu_affinity(host_irq, NULL);
+ kvm_x86_ops->pi_clear_sn(vcpu);
+ }
+
+ if (ret < 0) {
+ printk(KERN_INFO "%s: failed to update PI IRTE\n",
+ __func__);
+ goto out;
+ }
+ }
+
+ ret = 0;
+out:
+ srcu_read_unlock(&kvm->irq_srcu, idx);
+ return ret;
+}
--
1.9.1
This patch adds and documents two new attributes
KVM_DEV_VFIO_DEVICE_POST_IRQ and KVM_DEV_VFIO_DEVICE_UNPOST_IRQ
in KVM_DEV_VFIO_DEVICE group. The new attributes are used for
VT-d Posted-Interrupts.
When guest OS changes the interrupt configuration for an
assigned device, such as, MSI/MSIx data/address fields,
QEMU will use this IRQ attribute to tell KVM to update the
related IRTE according the VT-d Posted-Interrrupts Specification,
such as, the guest vector should be updated in the related IRTE.
Signed-off-by: Feng Wu <[email protected]>
---
Documentation/virtual/kvm/devices/vfio.txt | 9 +++++++++
include/uapi/linux/kvm.h | 11 +++++++++++
2 files changed, 20 insertions(+)
diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
index f7aff29..ecfbf61 100644
--- a/Documentation/virtual/kvm/devices/vfio.txt
+++ b/Documentation/virtual/kvm/devices/vfio.txt
@@ -42,3 +42,12 @@ activated before VFIO_DEVICE_SET_IRQS has been called to trigger the IRQ
or associate an eventfd to it. Unforwarding can only be called while the
signaling has been disabled with VFIO_DEVICE_SET_IRQS. If this condition is
not satisfied, the command returns an -EBUSY.
+
+ KVM_DEV_VFIO_DEVICE_POST_IRQ: set a VFIO device IRQ as posted
+ KVM_DEV_VFIO_DEVICE_UNPOST_IRQ: set a VFIO device IRQ as remapped
+For this attribute, kvm_device_attr.addr points to a kvm_vfio_dev_irq struct.
+
+When guest OS changes the interrupt configuration for an assigned device,
+such as, MSI/MSIx data/address fields, QEMU will use this IRQ attribute
+to tell KVM to update the related IRTE according the VT-d Posted-Interrrupts
+Specification, such as, the guest vector should be updated in the related IRTE.
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index a269a42..8f51487 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -949,6 +949,8 @@ struct kvm_device_attr {
#define KVM_DEV_VFIO_DEVICE 2
#define KVM_DEV_VFIO_DEVICE_FORWARD_IRQ 1
#define KVM_DEV_VFIO_DEVICE_UNFORWARD_IRQ 2
+#define KVM_DEV_VFIO_DEVICE_POST_IRQ 3
+#define KVM_DEV_VFIO_DEVICE_UNPOST_IRQ 4
enum kvm_device_type {
KVM_DEV_TYPE_FSL_MPIC_20 = 1,
@@ -973,6 +975,15 @@ struct kvm_arch_forwarded_irq {
__u32 gsi; /* gsi, ie. virtual IRQ number */
};
+struct kvm_vfio_dev_irq {
+ __u32 argsz;
+ __u32 fd; /* file descriptor of the VFIO device */
+ __u32 index; /* VFIO device IRQ index */
+ __u32 start;
+ __u32 count;
+ __u32 gsi[]; /* gsi, ie. virtual IRQ number */
+};
+
/*
* ioctls for VM fds
*/
--
1.9.1
This patch adds some helper functions to manipulate the
Posted-Interrupts Descriptor.
Signed-off-by: Feng Wu <[email protected]>
---
arch/x86/kvm/vmx.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index abdb84f..0b1383e 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -408,6 +408,8 @@ struct nested_vmx {
};
#define POSTED_INTR_ON 0
+#define POSTED_INTR_SN 1
+
/* Posted-Interrupt Descriptor */
struct pi_desc {
u32 pir[8]; /* Posted interrupt requested */
@@ -443,6 +445,30 @@ static int pi_test_and_set_pir(int vector, struct pi_desc *pi_desc)
return test_and_set_bit(vector, (unsigned long *)pi_desc->pir);
}
+static void pi_clear_sn(struct pi_desc *pi_desc)
+{
+ return clear_bit(POSTED_INTR_SN,
+ (unsigned long *)&pi_desc->control);
+}
+
+static void pi_set_sn(struct pi_desc *pi_desc)
+{
+ return set_bit(POSTED_INTR_SN,
+ (unsigned long *)&pi_desc->control);
+}
+
+static int pi_test_on(struct pi_desc *pi_desc)
+{
+ return test_bit(POSTED_INTR_ON,
+ (unsigned long *)&pi_desc->control);
+}
+
+static int pi_test_sn(struct pi_desc *pi_desc)
+{
+ return test_bit(POSTED_INTR_SN,
+ (unsigned long *)&pi_desc->control);
+}
+
struct vcpu_vmx {
struct kvm_vcpu vcpu;
unsigned long host_rsp;
--
1.9.1
Change struct pi_desc for VT-d Posted-Interrupts.
Signed-off-by: Feng Wu <[email protected]>
---
arch/x86/kvm/vmx.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 3e556c6..abdb84f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -411,8 +411,19 @@ struct nested_vmx {
/* Posted-Interrupt Descriptor */
struct pi_desc {
u32 pir[8]; /* Posted interrupt requested */
- u32 control; /* bit 0 of control is outstanding notification bit */
- u32 rsvd[7];
+ union {
+ struct {
+ u64 on : 1,
+ sn : 1,
+ rsvd_1 : 13,
+ ndm : 1,
+ nv : 8,
+ rsvd_2 : 8,
+ ndst : 32;
+ };
+ u64 control;
+ };
+ u32 rsvd[6];
} __aligned(64);
static bool pi_test_and_set_on(struct pi_desc *pi_desc)
--
1.9.1
This patch initializes the VT-d Posted-Interrupts Descriptor.
Signed-off-by: Feng Wu <[email protected]>
---
arch/x86/kvm/vmx.c | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 0b1383e..66ca275 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -45,6 +45,7 @@
#include <asm/perf_event.h>
#include <asm/debugreg.h>
#include <asm/kexec.h>
+#include <asm/irq_remapping.h>
#include "trace.h"
@@ -4433,6 +4434,30 @@ static void ept_set_mmio_spte_mask(void)
kvm_mmu_set_mmio_spte_mask((0x3ull << 62) | 0x6ull);
}
+static void pi_desc_init(struct vcpu_vmx *vmx)
+{
+ unsigned int dest;
+
+ if (!irq_remapping_cap(IRQ_POSTING_CAP))
+ return;
+
+ /*
+ * Initialize Posted-Interrupt Descriptor
+ */
+
+ pi_clear_sn(&vmx->pi_desc);
+ vmx->pi_desc.nv = POSTED_INTR_VECTOR;
+
+ /* Physical mode for Notificaiton Event */
+ vmx->pi_desc.ndm = 0;
+ dest = cpu_physical_id(vmx->vcpu.cpu);
+
+ if (x2apic_enabled())
+ vmx->pi_desc.ndst = dest;
+ else
+ vmx->pi_desc.ndst = (dest << 8) & 0xFF00;
+}
+
/*
* Sets up the vmcs for emulated real mode.
*/
@@ -4476,6 +4501,8 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
vmcs_write64(POSTED_INTR_NV, POSTED_INTR_VECTOR);
vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
+
+ pi_desc_init(vmx);
}
if (ple_gap) {
--
1.9.1
This patch adds a new interface irq_remapping_cap() to detect
whether irq remapping supports new features, such as VT-d
Posted-Interrupts. We export this function out, so that KVM
code can check this and use this mechanism properly.
Signed-off-by: Feng Wu <[email protected]>
Reviewed-by: Jiang Liu <[email protected]>
---
arch/x86/include/asm/irq_remapping.h | 2 ++
drivers/iommu/irq_remapping.c | 12 ++++++++++++
2 files changed, 14 insertions(+)
diff --git a/arch/x86/include/asm/irq_remapping.h b/arch/x86/include/asm/irq_remapping.h
index f87ac70..b3ad067 100644
--- a/arch/x86/include/asm/irq_remapping.h
+++ b/arch/x86/include/asm/irq_remapping.h
@@ -37,6 +37,7 @@ enum irq_remap_cap {
extern void setup_irq_remapping_ops(void);
extern int irq_remapping_supported(void);
+extern bool irq_remapping_cap(enum irq_remap_cap cap);
extern void set_irq_remapping_broken(void);
extern int irq_remapping_prepare(void);
extern int irq_remapping_enable(void);
@@ -69,6 +70,7 @@ struct vcpu_data {
static inline void setup_irq_remapping_ops(void) { }
static inline int irq_remapping_supported(void) { return 0; }
+static bool irq_remapping_cap(enum irq_remap_cap cap) { return 0; }
static inline void set_irq_remapping_broken(void) { }
static inline int irq_remapping_prepare(void) { return -ENODEV; }
static inline int irq_remapping_enable(void) { return -ENODEV; }
diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
index e63e969..b008663 100644
--- a/drivers/iommu/irq_remapping.c
+++ b/drivers/iommu/irq_remapping.c
@@ -103,6 +103,18 @@ int irq_remapping_supported(void)
return remap_ops->supported();
}
+bool irq_remapping_cap(enum irq_remap_cap cap)
+{
+ if (disable_irq_post)
+ return 0;
+
+ if (!remap_ops || !remap_ops->capability)
+ return 0;
+
+ return remap_ops->capability(cap);
+}
+EXPORT_SYMBOL_GPL(irq_remapping_cap);
+
int __init irq_remapping_prepare(void)
{
if (!remap_ops || !remap_ops->prepare)
--
1.9.1
We don't need to migrate the irqs for VT-d Posted-Interrupts here.
When 'pst' is set in IRTE, the associated irq will be posted to
guests instead of interrupt remapping. The destination of the
interrupt is set in Posted-Interrupts Descriptor, and the migration
happens during vCPU scheduling.
However, we still update the cached irte here, which can be used
when changing back to remapping mode.
Signed-off-by: Feng Wu <[email protected]>
Reviewed-by: Jiang Liu <[email protected]>
---
drivers/iommu/intel_irq_remapping.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index 48c2051..ab9057a 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -977,6 +977,7 @@ intel_ir_set_affinity(struct irq_data *data, const struct cpumask *mask,
{
struct intel_ir_data *ir_data = data->chip_data;
struct irte *irte = &ir_data->irte_entry;
+ struct irte_pi *irte_pi = (struct irte_pi *)irte;
struct irq_cfg *cfg = irqd_cfg(data);
struct irq_data *parent = data->parent_data;
int ret;
@@ -991,7 +992,10 @@ intel_ir_set_affinity(struct irq_data *data, const struct cpumask *mask,
*/
irte->vector = cfg->vector;
irte->dest_id = IRTE_DEST(cfg->dest_apicid);
- modify_irte(&ir_data->irq_2_iommu, irte);
+
+ /* We don't need to modify irte if the interrupt is for posting. */
+ if (irte_pi->pst != 1)
+ modify_irte(&ir_data->irq_2_iommu, irte);
/*
* After this point, all the interrupts will start arriving
--
1.9.1
Implement irq_set_vcpu_affinity for pci_msi_ir_controller.
Signed-off-by: Feng Wu <[email protected]>
Reviewed-by: Jiang Liu <[email protected]>
---
arch/x86/kernel/apic/msi.c | 1 +
include/linux/irq.h | 3 +++
2 files changed, 4 insertions(+)
diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c
index da163da..b0ed073 100644
--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -152,6 +152,7 @@ static struct irq_chip pci_msi_ir_controller = {
.irq_mask = pci_msi_mask_irq,
.irq_ack = irq_chip_ack_parent,
.irq_retrigger = irq_chip_retrigger_hierarchy,
+ .irq_set_vcpu_affinity = irq_chip_set_vcpu_affinity_parent,
.flags = IRQCHIP_SKIP_SET_WAKE,
};
diff --git a/include/linux/irq.h b/include/linux/irq.h
index 83abafc..5dcaa7f 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -464,6 +464,9 @@ extern void irq_chip_eoi_parent(struct irq_data *data);
extern int irq_chip_set_affinity_parent(struct irq_data *data,
const struct cpumask *dest,
bool force);
+extern int irq_chip_set_vcpu_affinity_parent(struct irq_data *data,
+ void *vcpu_info);
+
#endif
static inline void irq_chip_write_msi_msg(struct irq_data *data,
--
1.9.1
Add a new irte_pi structure for VT-d Posted-Interrupts.
Signed-off-by: Feng Wu <[email protected]>
Reviewed-by: Jiang Liu <[email protected]>
---
include/linux/dmar.h | 32 ++++++++++++++++++++++++++++++++
1 file changed, 32 insertions(+)
diff --git a/include/linux/dmar.h b/include/linux/dmar.h
index 8473756..c7f9cda 100644
--- a/include/linux/dmar.h
+++ b/include/linux/dmar.h
@@ -212,6 +212,38 @@ struct irte {
};
};
+struct irte_pi {
+ union {
+ struct {
+ __u64 present : 1,
+ fpd : 1,
+ __reserved_1 : 6,
+ avail : 4,
+ __reserved_2 : 2,
+ urg : 1,
+ pst : 1,
+ vector : 8,
+ __reserved_3 : 14,
+ pda_l : 26;
+ };
+ __u64 low;
+ };
+
+ union {
+ struct {
+ __u64 sid : 16,
+ sq : 2,
+ svt : 2,
+ __reserved_4 : 12,
+ pda_h : 32;
+ };
+ __u64 high;
+ };
+};
+
+#define PDA_LOW_BIT 26
+#define PDA_HIGH_BIT 32
+
enum {
IRQ_REMAP_XAPIC_MODE,
IRQ_REMAP_X2APIC_MODE,
--
1.9.1
Hi Paolo,
Could you please have a look at this series? Thanks a lot!
Thanks,
Feng
> -----Original Message-----
> From: Wu, Feng
> Sent: Friday, December 12, 2014 11:15 PM
> To: [email protected]; [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]; Wu, Feng
> Subject: [v3 00/26] Add VT-d Posted-Interrupts support
>
> VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
> With VT-d Posted-Interrupts enabled, external interrupts from
> direct-assigned devices can be delivered to guests without VMM
> intervention when guest is running in non-root mode.
>
> You can find the VT-d Posted-Interrtups Spec. in the following URL:
> http://www.intel.com/content/www/us/en/intelligent-systems/intel-technolog
> y/vt-directed-io-spec.html
>
> v1->v2:
> * Use VFIO framework to enable this feature, the VFIO part of this series is
> base on Eric's patch "[PATCH v3 0/8] KVM-VFIO IRQ forward control"
> * Rebase this patchset on
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git,
> then revise some irq logic based on the new hierarchy irqdomain patches
> provided
> by Jiang Liu <[email protected]>
>
> v2->v3:
> * Adjust the Posted-interrupts Descriptor updating logic when vCPU is
> preempted or blocked.
> * KVM_DEV_VFIO_DEVICE_POSTING_IRQ -->
> KVM_DEV_VFIO_DEVICE_POST_IRQ
> * __KVM_HAVE_ARCH_KVM_VFIO_POSTING -->
> __KVM_HAVE_ARCH_KVM_VFIO_POST
> * Add KVM_DEV_VFIO_DEVICE_UNPOST_IRQ attribute for VFIO irq, which
> can be used to change back to remapping mode.
> * Fix typo
>
> This patch series is made of the following groups:
> 1-6: Some preparation changes in iommu and irq component, this is based on
> the
> new hierarchy irqdomain logic.
> 7-9, 26: IOMMU changes for VT-d Posted-Interrupts, such as, feature
> detection,
> command line parameter.
> 10-17, 22-25: Changes related to KVM itself.
> 18-20: Changes in VFIO component, this part was previously sent out as
> "[RFC PATCH v2 0/2] kvm-vfio: implement the vfio skeleton for VT-d
> Posted-Interrupts"
> 21: x86 irq related changes
>
> Feng Wu (26):
> genirq: Introduce irq_set_vcpu_affinity() to target an interrupt to a
> VCPU
> iommu: Add new member capability to struct irq_remap_ops
> iommu, x86: Define new irte structure for VT-d Posted-Interrupts
> iommu, x86: Implement irq_set_vcpu_affinity for intel_ir_chip
> x86, irq: Implement irq_set_vcpu_affinity for pci_msi_ir_controller
> iommu, x86: No need to migrating irq for VT-d Posted-Interrupts
> iommu, x86: Add cap_pi_support() to detect VT-d PI capability
> iommu, x86: Add intel_irq_remapping_capability() for Intel
> iommu, x86: define irq_remapping_cap()
> KVM: change struct pi_desc for VT-d Posted-Interrupts
> KVM: Add some helper functions for Posted-Interrupts
> KVM: Initialize VT-d Posted-Interrupts Descriptor
> KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI
> KVM: Get Posted-Interrupts descriptor address from struct kvm_vcpu
> KVM: add interfaces to control PI outside vmx
> KVM: Make struct kvm_irq_routing_table accessible
> KVM: make kvm_set_msi_irq() public
> KVM: kvm-vfio: User API for VT-d Posted-Interrupts
> KVM: kvm-vfio: implement the VFIO skeleton for VT-d Posted-Interrupts
> KVM: x86: kvm-vfio: VT-d posted-interrupts setup
> x86, irq: Define a global vector for VT-d Posted-Interrupts
> KVM: Define a wakeup worker thread for vCPU
> KVM: Update Posted-Interrupts Descriptor when vCPU is preempted
> KVM: Update Posted-Interrupts Descriptor when vCPU is blocked
> KVM: Suppress posted-interrupt when 'SN' is set
> iommu/vt-d: Add a command line parameter for VT-d posted-interrupts
>
> Documentation/kernel-parameters.txt | 1 +
> Documentation/virtual/kvm/devices/vfio.txt | 9 ++
> arch/x86/include/asm/entry_arch.h | 2 +
> arch/x86/include/asm/hardirq.h | 1 +
> arch/x86/include/asm/hw_irq.h | 2 +
> arch/x86/include/asm/irq_remapping.h | 11 ++
> arch/x86/include/asm/irq_vectors.h | 1 +
> arch/x86/include/asm/kvm_host.h | 12 ++
> arch/x86/kernel/apic/msi.c | 1 +
> arch/x86/kernel/entry_64.S | 2 +
> arch/x86/kernel/irq.c | 27 ++++
> arch/x86/kernel/irqinit.c | 2 +
> arch/x86/kvm/Makefile | 2 +-
> arch/x86/kvm/kvm_vfio_x86.c | 77 +++++++++
> arch/x86/kvm/vmx.c | 244
> ++++++++++++++++++++++++++++-
> arch/x86/kvm/x86.c | 22 ++-
> drivers/iommu/intel_irq_remapping.c | 68 +++++++-
> drivers/iommu/irq_remapping.c | 24 ++-
> drivers/iommu/irq_remapping.h | 8 +
> include/linux/dmar.h | 32 ++++
> include/linux/intel-iommu.h | 1 +
> include/linux/irq.h | 7 +
> include/linux/kvm_host.h | 46 ++++++
> include/uapi/linux/kvm.h | 11 ++
> kernel/irq/chip.c | 14 ++
> kernel/irq/manage.c | 20 +++
> virt/kvm/irq_comm.c | 43 ++++-
> virt/kvm/irqchip.c | 11 --
> virt/kvm/kvm_main.c | 15 ++
> virt/kvm/vfio.c | 107 +++++++++++++
> 30 files changed, 795 insertions(+), 28 deletions(-)
> create mode 100644 arch/x86/kvm/kvm_vfio_x86.c
>
> --
> 1.9.1
On 12/12/2014 16:14, Feng Wu wrote:
> Move struct kvm_irq_routing_table from irqchip.c to kvm_host.h,
> so we can use it outside of irqchip.c.
>
> Signed-off-by: Feng Wu <[email protected]>
> ---
> include/linux/kvm_host.h | 19 +++++++++++++++++++
> virt/kvm/irqchip.c | 11 -----------
> 2 files changed, 19 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 0b9659d..cfa85ac 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -335,6 +335,25 @@ struct kvm_kernel_irq_routing_entry {
> struct hlist_node link;
> };
>
> +#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
> +
> +struct kvm_irq_routing_table {
> + int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS];
> + struct kvm_kernel_irq_routing_entry *rt_entries;
> + u32 nr_rt_entries;
> + /*
> + * Array indexed by gsi. Each entry contains list of irq chips
> + * the gsi is connected to.
> + */
> + struct hlist_head map[0];
> +};
> +
> +#else
> +
> +struct kvm_irq_routing_table {};
If possible, just make this "struct kvm_irq_routing_table;" and pull
this line to include/linux/kvm_types.h.
Paolo
> +
> +#endif
> +
> #ifndef KVM_PRIVATE_MEM_SLOTS
> #define KVM_PRIVATE_MEM_SLOTS 0
> #endif
> diff --git a/virt/kvm/irqchip.c b/virt/kvm/irqchip.c
> index 7f256f3..cdf29a6 100644
> --- a/virt/kvm/irqchip.c
> +++ b/virt/kvm/irqchip.c
> @@ -31,17 +31,6 @@
> #include <trace/events/kvm.h>
> #include "irq.h"
>
> -struct kvm_irq_routing_table {
> - int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS];
> - struct kvm_kernel_irq_routing_entry *rt_entries;
> - u32 nr_rt_entries;
> - /*
> - * Array indexed by gsi. Each entry contains list of irq chips
> - * the gsi is connected to.
> - */
> - struct hlist_head map[0];
> -};
> -
> int kvm_irq_map_gsi(struct kvm *kvm,
> struct kvm_kernel_irq_routing_entry *entries, int gsi)
> {
>
On 12/12/2014 16:14, Feng Wu wrote:
> This patch updates the Posted-Interrupts Descriptor when vCPU
> is blocked.
>
> pre-block:
> - Add the vCPU to the blocked per-CPU list
> - Clear 'SN'
Should SN be already clear (and NV set to POSTED_INTR_VECTOR)? Can it
happen that you go from sched-out to blocked without doing a sched-in first?
In fact, if this is possible, what happens if vcpu->preempted &&
vcpu->blocked?
> - Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
>
> post-block:
> - Remove the vCPU from the per-CPU list
Paolo
> Signed-off-by: Feng Wu <[email protected]>
On 12/12/2014 16:14, Feng Wu wrote:
> + if (irq_remapping_cap(IRQ_POSTING_CAP)) {
> + struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> + struct pi_desc old, new;
> + unsigned int dest;
> +
> + memset(&old, 0, sizeof(old));
> + memset(&new, 0, sizeof(new));
This is quite expensive. Just use an u64 for old_control and
new_control, instead of a full struct.
>
> + pi_clear_sn(&new);
This can be simply new.sn = 0. It does not need atomic operations.
Same in patch 24 (if needed at all there---see the reply there).
>
> + if (irq_remapping_cap(IRQ_POSTING_CAP)) {
> + struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> + struct pi_desc old, new;
> +
> + memset(&old, 0, sizeof(old));
> + memset(&new, 0, sizeof(new));
> +
Here you do not need old/new at all because...
> + if (vcpu->preempted) {
> + do {
> + old.control = new.control = pi_desc->control;
> + pi_set_sn(&new);
> + } while (cmpxchg(&pi_desc->control, old.control,
> + new.control) != old.control);
this can do pi_set_sn directly on pi_desc, without the cmpxchg.
Paolo
On 12/12/2014 16:14, Feng Wu wrote:
> Make kvm_set_msi_irq() public, we can use this function outside.
>
> Signed-off-by: Feng Wu <[email protected]>
> ---
> include/linux/kvm_host.h | 2 ++
> virt/kvm/irq_comm.c | 2 +-
> 2 files changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index cfa85ac..5cd4420 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -785,6 +785,8 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
> struct kvm_irq_ack_notifier *kian);
> int kvm_request_irq_source_id(struct kvm *kvm);
> void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id);
> +void kvm_set_msi_irq(struct kvm_kernel_irq_routing_entry *e,
> + struct kvm_lapic_irq *irq);
This function is now in arch/x86, so please add to
arch/x86/include/asm/kvm_host.h instead.
> #ifdef CONFIG_KVM_DEVICE_ASSIGNMENT
> int kvm_iommu_map_pages(struct kvm *kvm, struct kvm_memory_slot *slot);
> diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> index f3c5d69..231671a 100644
> --- a/virt/kvm/irq_comm.c
> +++ b/virt/kvm/irq_comm.c
> @@ -106,7 +106,7 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
> return r;
> }
>
> -static inline void kvm_set_msi_irq(struct kvm_kernel_irq_routing_entry *e,
> +void kvm_set_msi_irq(struct kvm_kernel_irq_routing_entry *e,
> struct kvm_lapic_irq *irq)
> {
> trace_kvm_msi_set_irq(e->msi.address_lo, e->msi.data);
>
On 12/12/2014 16:14, Feng Wu wrote:
> Currently, we don't support urgent interrupt, all interrupts
> are recognized as non-urgent interrupt, so we cannot send
> posted-interrupt when 'SN' is set.
Can this happen? If the vcpu is in guest mode, it cannot have been
scheduled out, and that's the only case when SN is set.
Paolo
> Signed-off-by: Feng Wu <[email protected]>
> ---
> arch/x86/kvm/vmx.c | 11 +++++++++--
> 1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index a1c83a2..0aee151 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -4401,15 +4401,22 @@ static int vmx_vm_has_apicv(struct kvm *kvm)
> static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> - int r;
> + int r, sn;
>
> if (pi_test_and_set_pir(vector, &vmx->pi_desc))
> return;
>
> + /*
> + * Currently, we don't support urgent interrupt, all interrupts
> + * are recognized as non-urgent interrupt, so we cannot send
> + * posted-interrupt when 'SN' is set.
> + */
> + sn = pi_test_sn(&vmx->pi_desc);
> +
> r = pi_test_and_set_on(&vmx->pi_desc);
> kvm_make_request(KVM_REQ_EVENT, vcpu);
> #ifdef CONFIG_SMP
> - if (!r && (vcpu->mode == IN_GUEST_MODE))
> + if (!r && !sn && (vcpu->mode == IN_GUEST_MODE))
> apic->send_IPI_mask(get_cpu_mask(vcpu->cpu),
> POSTED_INTR_VECTOR);
> else
> --
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Paolo Bonzini
> Sent: Thursday, December 18, 2014 1:43 AM
> To: Wu, Feng; Thomas Gleixner; Ingo Molnar; H. Peter Anvin; [email protected];
> Gleb Natapov; Paolo Bonzini; [email protected]; [email protected]; Alex
> Williamson; Jiang Liu
> Cc: [email protected]; [email protected]; KVM list;
> Eric Auger
> Subject: Re: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set
>
>
>
> On 12/12/2014 16:14, Feng Wu wrote:
> > Currently, we don't support urgent interrupt, all interrupts
> > are recognized as non-urgent interrupt, so we cannot send
> > posted-interrupt when 'SN' is set.
>
> Can this happen? If the vcpu is in guest mode, it cannot have been
> scheduled out, and that's the only case when SN is set.
>
> Paolo
Currently, the only place where SN is set is vCPU is preempted and waiting for
the next scheduling in the runqueue. But I am not sure whether we need to
set SN for other purpose in future. Adding SN checking here is just to follow
the Spec. non-urgent interrupts are suppressed when SN is set.
Thanks,
Feng
>
> > Signed-off-by: Feng Wu
> <[email protected]>
> > ---
> > arch/x86/kvm/vmx.c | 11 +++++++++--
> > 1 file changed, 9 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index a1c83a2..0aee151 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -4401,15 +4401,22 @@ static int vmx_vm_has_apicv(struct kvm *kvm)
> > static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int
> vector)
> > {
> > struct vcpu_vmx *vmx = to_vmx(vcpu);
> > - int r;
> > + int r, sn;
> >
> > if (pi_test_and_set_pir(vector, &vmx->pi_desc))
> > return;
> >
> > + /*
> > + * Currently, we don't support urgent interrupt, all interrupts
> > + * are recognized as non-urgent interrupt, so we cannot send
> > + * posted-interrupt when 'SN' is set.
> > + */
> > + sn = pi_test_sn(&vmx->pi_desc);
> > +
> > r = pi_test_and_set_on(&vmx->pi_desc);
> > kvm_make_request(KVM_REQ_EVENT, vcpu);
> > #ifdef CONFIG_SMP
> > - if (!r && (vcpu->mode == IN_GUEST_MODE))
> > + if (!r && !sn && (vcpu->mode == IN_GUEST_MODE))
> > apic->send_IPI_mask(get_cpu_mask(vcpu->cpu),
> > POSTED_INTR_VECTOR);
> > else
> > --
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
> -----Original Message-----
> From: Paolo Bonzini [mailto:[email protected]] On Behalf Of Paolo
> Bonzini
> Sent: Thursday, December 18, 2014 1:11 AM
> To: Wu, Feng; Thomas Gleixner; Ingo Molnar; H. Peter Anvin; [email protected];
> Gleb Natapov; Paolo Bonzini; [email protected]; [email protected]; Alex
> Williamson; jiang.liu-VuQAYsv1563Yd54FQh9/[email protected]
> Cc: [email protected]; [email protected]; KVM list;
> Eric Auger
> Subject: Re: [v3 23/26] KVM: Update Posted-Interrupts Descriptor when vCPU
> is preempted
>
>
>
> On 12/12/2014 16:14, Feng Wu wrote:
> > + if (irq_remapping_cap(IRQ_POSTING_CAP)) {
> > + struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> > + struct pi_desc old, new;
> > + unsigned int dest;
> > +
> > + memset(&old, 0, sizeof(old));
> > + memset(&new, 0, sizeof(new));
>
> This is quite expensive. Just use an u64 for old_control and
> new_control, instead of a full struct.
>
> >
> > + pi_clear_sn(&new);
>
> This can be simply new.sn = 0. It does not need atomic operations.
Thanks for your comments, Paolo!
If we use u64 new_control, we cannot use new.sn any more.
Maybe we can change the struct pi_desc {} like this:
typedef struct pid_control{
u64 on : 1,
sn : 1,
rsvd_1 : 13,
ndm : 1,
nv : 8,
rsvd_2 : 8,
ndst : 32;
}pid_control_t;
struct pi_desc {
u32 pir[8]; /* Posted interrupt requested */
pid_control_t control;
u32 rsvd[6];
} __aligned(64);
Then we can define pid_control_t new_control, old_control. And use new_control.sn = 0.
What is your opinon?
Thanks,
Feng
>
> Same in patch 24 (if needed at all there---see the reply there).
>
> >
> > + if (irq_remapping_cap(IRQ_POSTING_CAP)) {
> > + struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> > + struct pi_desc old, new;
> > +
> > + memset(&old, 0, sizeof(old));
> > + memset(&new, 0, sizeof(new));
> > +
>
> Here you do not need old/new at all because...
>
> > + if (vcpu->preempted) {
> > + do {
> > + old.control = new.control = pi_desc->control;
> > + pi_set_sn(&new);
> > + } while (cmpxchg(&pi_desc->control, old.control,
> > + new.control) != old.control);
>
> this can do pi_set_sn directly on pi_desc, without the cmpxchg.
>
> Paolo
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On
> Behalf Of Paolo Bonzini
> Sent: Thursday, December 18, 2014 1:10 AM
> To: Wu, Feng; Thomas Gleixner; Ingo Molnar; H. Peter Anvin; [email protected];
> Gleb Natapov; Paolo Bonzini; [email protected]; [email protected]; Alex
> Williamson
> Cc: [email protected]; [email protected]; KVM list;
> Eric Auger
> Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU
> is blocked
>
>
>
> On 12/12/2014 16:14, Feng Wu wrote:
> > This patch updates the Posted-Interrupts Descriptor when vCPU
> > is blocked.
> >
> > pre-block:
> > - Add the vCPU to the blocked per-CPU list
> > - Clear 'SN'
>
> Should SN be already clear (and NV set to POSTED_INTR_VECTOR)?
I think the SN bit should be clear here, Adding it here is just to make sure
SN is clear when vCPU is blocked, so it can receive wakeup notification event later.
> Can it
> happen that you go from sched-out to blocked without doing a sched-in first?
>
I cannot imagine this scenario, can you please be more specific? Thanks a lot!
> In fact, if this is possible, what happens if vcpu->preempted &&
> vcpu->blocked?
In fact, vcpu->preempted && vcpu->blocked happens sometimes, but I think there is
no issues. Please refer to the following case:
kvm_vcpu_block()
-> vcpu->blocked = true;
-> prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
before schedule() is called, this vcpu is woken up by another guy, so
the state of the vcpu associated thread is changed to TASK_RUNNING,
then preemption happens after interrupts or the following schedule() is
hit, this will call kvm_sched_out(), in which current->state == TASK_RUNNING
and vcpu->preempted is set to true. So now vcpu->preempted and vcpu->blocked
are both true. In vmx_vcpu_put(), we will check vcpu->preempted first, so
the vCPU will not be blocked, and the vcpu->blocked will be set the false in
vmx_vcpu_load().
But maybe I need do a little change to the vmx_vcpu_load() like below:
/*
* Delete the vCPU from the related wakeup queue
* if we are resuming from blocked state
*/
if (vcpu->blocked) {
vcpu->blocked = false;
+ /* if wakeup_cpu == -1, the vcpu is currently not blocked on any
+ pCPU, don't need dequeue here */
+ if (vcpu->wakeup_cpu != -1) {
spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
vcpu->wakeup_cpu), flags);
list_del(&vcpu->blocked_vcpu_list);
spin_unlock_irqrestore(&per_cpu(blocked_vcpu_on_cpu_lock,
vcpu->wakeup_cpu), flags);
vcpu->wakeup_cpu = -1;
+ }
}
Any ideas about this? Thanks a lot!
Thanks,
Feng
-> schedule();
>
> > - Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
> >
> > post-block:
> > - Remove the vCPU from the per-CPU list
>
> Paolo
>
> > Signed-off-by: Feng Wu <[email protected]>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On 18/12/2014 04:15, Wu, Feng wrote:
> Thanks for your comments, Paolo!
>
> If we use u64 new_control, we cannot use new.sn any more.
> Maybe we can change the struct pi_desc {} like this:
>
> typedef struct pid_control{
> u64 on : 1,
> sn : 1,
> rsvd_1 : 13,
> ndm : 1,
> nv : 8,
> rsvd_2 : 8,
> ndst : 32;
> }pid_control_t;
>
> struct pi_desc {
> u32 pir[8]; /* Posted interrupt requested */
> pid_control_t control;
Probably something like this to keep the union:
typedef union pid_control {
u64 full;
struct {
u64 on : 1,
...
} fields;
};
> u32 rsvd[6];
> } __aligned(64);
>
>
> Then we can define pid_control_t new_control, old_control. And use new_control.sn = 0.
>
> What is your opinon?
Sure. Alternatively, keep using struct pi_desc new; just
do not zero it, nor access any field outide the control word.
Paolo
On 18/12/2014 04:16, Wu, Feng wrote:
>>> pre-block:
>>> - Add the vCPU to the blocked per-CPU list
>>> - Clear 'SN'
>>
>> Should SN be already clear (and NV set to POSTED_INTR_VECTOR)?
>
> I think the SN bit should be clear here, Adding it here is just to make sure
> SN is clear when vCPU is blocked, so it can receive wakeup notification event later.
Then, please, WARN if the SN bit is set inside the if (vcpu->blocked).
Inside that if you can just add the vCPU to the blocked list on vcpu_put.
>> Can it
>> happen that you go from sched-out to blocked without doing a sched-in first?
>>
>
> I cannot imagine this scenario, can you please be more specific? Thanks a lot!
I cannot either. :) But it would be the case where SN is not cleared.
So we agree that it cannot happen.
>> In fact, if this is possible, what happens if vcpu->preempted &&
>> vcpu->blocked?
>
> In fact, vcpu->preempted && vcpu->blocked happens sometimes, but I think there is
> no issues. Please refer to the following case:
I agree that there should be no issues. But if it can happen, it's better:
1) to separate the handling of preemption and blocking: preemption
handles SN/NV/NDST, blocking handles the wakeup list.
2) to change this
+ } else if (vcpu->blocked) {
+ /*
+ * The vcpu is blocked on the wait queue.
+ * Store the blocked vCPU on the list of the
+ * vcpu->wakeup_cpu, which is the destination
+ * of the wake-up notification event.
to just
}
if (vcpu->blocked) {
...
}
> kvm_vcpu_block()
> -> vcpu->blocked = true;
> -> prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
>
> before schedule() is called, this vcpu is woken up by another guy, so
> the state of the vcpu associated thread is changed to TASK_RUNNING,
> then preemption happens after interrupts or the following schedule() is
> hit, this will call kvm_sched_out(), in which current->state == TASK_RUNNING
> and vcpu->preempted is set to true. So now vcpu->preempted and vcpu->blocked
> are both true. In vmx_vcpu_put(), we will check vcpu->preempted first, so
> the vCPU will not be blocked, and the vcpu->blocked will be set the false in
> vmx_vcpu_load().
>
> But maybe I need do a little change to the vmx_vcpu_load() like below:
>
> /*
> * Delete the vCPU from the related wakeup queue
> * if we are resuming from blocked state
> */
> if (vcpu->blocked) {
> vcpu->blocked = false;
> + /* if wakeup_cpu == -1, the vcpu is currently not blocked on any
> + pCPU, don't need dequeue here */
> + if (vcpu->wakeup_cpu != -1) {
> spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
> vcpu->wakeup_cpu), flags);
> list_del(&vcpu->blocked_vcpu_list);
> spin_unlock_irqrestore(&per_cpu(blocked_vcpu_on_cpu_lock,
> vcpu->wakeup_cpu), flags);
> vcpu->wakeup_cpu = -1;
> + }
> }
Good idea.
Paolo
> Any ideas about this? Thanks a lot!
>
> Thanks,
> Feng
>
>
> -> schedule();
>
>
>>
>>> - Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
>>>
>>> post-block:
>>> - Remove the vCPU from the per-CPU list
>>
>> Paolo
>>
>>> Signed-off-by: Feng Wu <[email protected]>
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
On 18/12/2014 04:14, Wu, Feng wrote:
>
>
>> -----Original Message-----
>> From: [email protected]
>> [mailto:[email protected]] On Behalf Of Paolo Bonzini
>> Sent: Thursday, December 18, 2014 1:43 AM
>> To: Wu, Feng; Thomas Gleixner; Ingo Molnar; H. Peter Anvin; [email protected];
>> Gleb Natapov; Paolo Bonzini; [email protected]; [email protected]; Alex
>> Williamson; Jiang Liu
>> Cc: [email protected]; [email protected]; KVM list;
>> Eric Auger
>> Subject: Re: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set
>>
>>
>>
>> On 12/12/2014 16:14, Feng Wu wrote:
>>> Currently, we don't support urgent interrupt, all interrupts
>>> are recognized as non-urgent interrupt, so we cannot send
>>> posted-interrupt when 'SN' is set.
>>
>> Can this happen? If the vcpu is in guest mode, it cannot have been
>> scheduled out, and that's the only case when SN is set.
>>
>> Paolo
>
> Currently, the only place where SN is set is vCPU is preempted and waiting for
> the next scheduling in the runqueue. But I am not sure whether we need to
> set SN for other purpose in future. Adding SN checking here is just to follow
> the Spec. non-urgent interrupts are suppressed when SN is set.
I would change that to a WARN_ON_ONCE then.
Paolo
> Thanks,
> Feng
>
>>
>>> Signed-off-by: Feng Wu
>> <[email protected]>
>>> ---
>>> arch/x86/kvm/vmx.c | 11 +++++++++--
>>> 1 file changed, 9 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>> index a1c83a2..0aee151 100644
>>> --- a/arch/x86/kvm/vmx.c
>>> +++ b/arch/x86/kvm/vmx.c
>>> @@ -4401,15 +4401,22 @@ static int vmx_vm_has_apicv(struct kvm *kvm)
>>> static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int
>> vector)
>>> {
>>> struct vcpu_vmx *vmx = to_vmx(vcpu);
>>> - int r;
>>> + int r, sn;
>>>
>>> if (pi_test_and_set_pir(vector, &vmx->pi_desc))
>>> return;
>>>
>>> + /*
>>> + * Currently, we don't support urgent interrupt, all interrupts
>>> + * are recognized as non-urgent interrupt, so we cannot send
>>> + * posted-interrupt when 'SN' is set.
>>> + */
>>> + sn = pi_test_sn(&vmx->pi_desc);
>>> +
>>> r = pi_test_and_set_on(&vmx->pi_desc);
>>> kvm_make_request(KVM_REQ_EVENT, vcpu);
>>> #ifdef CONFIG_SMP
>>> - if (!r && (vcpu->mode == IN_GUEST_MODE))
>>> + if (!r && !sn && (vcpu->mode == IN_GUEST_MODE))
>>> apic->send_IPI_mask(get_cpu_mask(vcpu->cpu),
>>> POSTED_INTR_VECTOR);
>>> else
>>> --
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
Feng Wu wrote on 2014-12-12:
> We don't need to migrate the irqs for VT-d Posted-Interrupts here.
> When 'pst' is set in IRTE, the associated irq will be posted to guests
> instead of interrupt remapping. The destination of the interrupt is
> set in Posted-Interrupts Descriptor, and the migration happens during
> vCPU scheduling.
>
> However, we still update the cached irte here, which can be used when
> changing back to remapping mode.
>
> Signed-off-by: Feng Wu <[email protected]>
> Reviewed-by: Jiang Liu <[email protected]>
> ---
> drivers/iommu/intel_irq_remapping.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
> diff --git a/drivers/iommu/intel_irq_remapping.c
> b/drivers/iommu/intel_irq_remapping.c index 48c2051..ab9057a 100644 ---
> a/drivers/iommu/intel_irq_remapping.c +++
> b/drivers/iommu/intel_irq_remapping.c @@ -977,6 +977,7 @@
> intel_ir_set_affinity(struct irq_data *data, const struct cpumask *mask,
> {
> struct intel_ir_data *ir_data = data->chip_data; struct irte *irte =
> &ir_data->irte_entry; + struct irte_pi *irte_pi = (struct irte_pi
> *)irte; struct irq_cfg *cfg = irqd_cfg(data); struct irq_data *parent
> = data->parent_data; int ret;
> @@ -991,7 +992,10 @@ intel_ir_set_affinity(struct irq_data *data,
> const struct cpumask *mask,
> */
> irte->vector = cfg->vector;
> irte->dest_id = IRTE_DEST(cfg->dest_apicid);
> - modify_irte(&ir_data->irq_2_iommu, irte);
> +
> + /* We don't need to modify irte if the interrupt is for posting. */
> + if (irte_pi->pst != 1)
> + modify_irte(&ir_data->irq_2_iommu, irte);
What happens if user changes the IRQ affinity manually?
>
> /*
> * After this point, all the interrupts will start arriving
Best regards,
Yang
Feng Wu wrote on 2014-12-12:
> This patch defines a new interface kvm_find_dest_vcpu for
> VT-d PI, which can returns the destination vCPU of the
> interrupt for guests.
>
> Since VT-d PI cannot handle broadcast/multicast interrupt,
> Here we only handle Fixed and Lowest priority interrupts.
>
> The current method of handling guest lowest priority interrtups
> is to use a counter 'apic_arb_prio' for each vCPU, we choose the
> vCPU with smallest 'apic_arb_prio' and then increase it by 1.
> However, for VT-d PI, we cannot re-use this, since we no longer
> have control to 'apic_arb_prio' with posted interrupt direct
> delivery by Hardware.
>
> Here, we introduce a similar way with 'apic_arb_prio' to handle guest
> lowest priority interrtups when VT-d PI is used. Here is the ideas: -
> Each vCPU has a counter 'round_robin_counter'. - When guests sets an
> interrupts to lowest priority, we choose the vCPU with smallest
> 'round_robin_counter' as the destination, then increase it.
How this can work well? All subsequent interrupts are delivered to one vCPU? It shouldn't be the best solution, need more consideration. Also, I think you should take the apic_arb_prio into consider since the priority is for the whole vCPU not for one interrupt.
Best regards,
Yang
Feng Wu wrote on 2014-12-12:
> Currently, we use a global vector as the Posted-Interrupts
> Notification Event for all the vCPUs in the system. We need to
> introduce another global vector for VT-d Posted-Interrtups, which will
> be used to wakeup the sleep vCPU when an external interrupt from a direct-assigned device happens for that vCPU.
>
Hi Feng,
Since the idea of two global vectors mechanism is from me, please add me to the comments.
> Signed-off-by: Feng Wu <[email protected]>
> ---
> arch/x86/include/asm/entry_arch.h | 2 ++
> arch/x86/include/asm/hardirq.h | 1 +
> arch/x86/include/asm/hw_irq.h | 2 ++
> arch/x86/include/asm/irq_vectors.h | 1 +
> arch/x86/kernel/entry_64.S | 2 ++
> arch/x86/kernel/irq.c | 27 +++++++++++++++++++++++++++
> arch/x86/kernel/irqinit.c | 2 ++
> 7 files changed, 37 insertions(+)
> diff --git a/arch/x86/include/asm/entry_arch.h
> b/arch/x86/include/asm/entry_arch.h index dc5fa66..27ca0af 100644 ---
> a/arch/x86/include/asm/entry_arch.h +++
> b/arch/x86/include/asm/entry_arch.h @@ -23,6 +23,8 @@
> BUILD_INTERRUPT(x86_platform_ipi, X86_PLATFORM_IPI_VECTOR) #ifdef
> CONFIG_HAVE_KVM BUILD_INTERRUPT3(kvm_posted_intr_ipi, POSTED_INTR_VECTOR,
> smp_kvm_posted_intr_ipi)
> +BUILD_INTERRUPT3(kvm_posted_intr_wakeup_ipi, POSTED_INTR_WAKEUP_VECTOR,
> + smp_kvm_posted_intr_wakeup_ipi)
> #endif
>
> /*
> diff --git a/arch/x86/include/asm/hardirq.h
> b/arch/x86/include/asm/hardirq.h index 0f5fb6b..9866065 100644
> --- a/arch/x86/include/asm/hardirq.h
> +++ b/arch/x86/include/asm/hardirq.h
> @@ -14,6 +14,7 @@ typedef struct {
> #endif #ifdef CONFIG_HAVE_KVM unsigned int kvm_posted_intr_ipis;
> + unsigned int kvm_posted_intr_wakeup_ipis; #endif unsigned int
> x86_platform_ipis; /* arch dependent */ unsigned int apic_perf_irqs;
> diff --git a/arch/x86/include/asm/hw_irq.h
> b/arch/x86/include/asm/hw_irq.h index e7ae6eb..38fac9b 100644
> --- a/arch/x86/include/asm/hw_irq.h
> +++ b/arch/x86/include/asm/hw_irq.h
> @@ -29,6 +29,7 @@
> extern asmlinkage void apic_timer_interrupt(void); extern asmlinkage
> void x86_platform_ipi(void); extern asmlinkage void
> kvm_posted_intr_ipi(void); +extern asmlinkage void
> kvm_posted_intr_wakeup_ipi(void);
> extern asmlinkage void error_interrupt(void); extern asmlinkage void
> irq_work_interrupt(void);
>
> @@ -92,6 +93,7 @@ extern void
> trace_call_function_single_interrupt(void);
> #define trace_irq_move_cleanup_interrupt irq_move_cleanup_interrupt
> #define trace_reboot_interrupt reboot_interrupt #define
> trace_kvm_posted_intr_ipi kvm_posted_intr_ipi
> +#define trace_kvm_posted_intr_wakeup_ipi kvm_posted_intr_wakeup_ipi
> #endif /* CONFIG_TRACING */
>
> struct irq_domain;
> diff --git a/arch/x86/include/asm/irq_vectors.h
> b/arch/x86/include/asm/irq_vectors.h index b26cb12..dca94f2 100644 ---
> a/arch/x86/include/asm/irq_vectors.h +++
> b/arch/x86/include/asm/irq_vectors.h @@ -105,6 +105,7 @@
> /* Vector for KVM to deliver posted interrupt IPI */ #ifdef
> CONFIG_HAVE_KVM #define POSTED_INTR_VECTOR 0xf2 +#define
> POSTED_INTR_WAKEUP_VECTOR 0xf1 #endif
>
> /*
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index e61c14a..a598447 100644 --- a/arch/x86/kernel/entry_64.S +++
> b/arch/x86/kernel/entry_64.S @@ -960,6 +960,8 @@ apicinterrupt
> X86_PLATFORM_IPI_VECTOR \ #ifdef CONFIG_HAVE_KVM
> apicinterrupt3 POSTED_INTR_VECTOR \
> kvm_posted_intr_ipi smp_kvm_posted_intr_ipi
> +apicinterrupt3 POSTED_INTR_WAKEUP_VECTOR \
> + kvm_posted_intr_wakeup_ipi smp_kvm_posted_intr_wakeup_ipi
> #endif
>
> #ifdef CONFIG_X86_MCE_THRESHOLD
> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index
> 922d285..47408c3 100644
> --- a/arch/x86/kernel/irq.c
> +++ b/arch/x86/kernel/irq.c
> @@ -237,6 +237,9 @@ __visible void smp_x86_platform_ipi(struct pt_regs
> *regs) }
>
> #ifdef CONFIG_HAVE_KVM
> +void (*wakeup_handler_callback)(void) = NULL;
> +EXPORT_SYMBOL_GPL(wakeup_handler_callback); +
> /*
> * Handler for POSTED_INTERRUPT_VECTOR.
> */
> @@ -256,6 +259,30 @@ __visible void smp_kvm_posted_intr_ipi(struct
> pt_regs *regs)
>
> set_irq_regs(old_regs);
> }
> +
> +/*
> + * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR.
> + */
> +__visible void smp_kvm_posted_intr_wakeup_ipi(struct pt_regs *regs) {
> + struct pt_regs *old_regs = set_irq_regs(regs);
> +
> + ack_APIC_irq();
> +
> + irq_enter();
> +
> + exit_idle();
> +
> + inc_irq_stat(kvm_posted_intr_wakeup_ipis);
> +
> + if (wakeup_handler_callback)
> + wakeup_handler_callback();
> +
> + irq_exit();
> +
> + set_irq_regs(old_regs);
> +}
> +
> #endif
>
> __visible void smp_trace_x86_platform_ipi(struct pt_regs *regs) diff
> --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c index
> 70e181e..844673c 100644 --- a/arch/x86/kernel/irqinit.c +++
> b/arch/x86/kernel/irqinit.c @@ -144,6 +144,8 @@ static void __init
> apic_intr_init(void) #ifdef CONFIG_HAVE_KVM
> /* IPI for KVM to deliver posted interrupt */
> alloc_intr_gate(POSTED_INTR_VECTOR, kvm_posted_intr_ipi);
> + /* IPI for KVM to deliver interrupt to wake up tasks */
> + alloc_intr_gate(POSTED_INTR_WAKEUP_VECTOR,
> +kvm_posted_intr_wakeup_ipi);
> #endif
>
> /* IPI vectors for APIC spurious and error interrupts */
Best regards,
Yang
Paolo Bonzini wrote on 2014-12-18:
>
>
> On 18/12/2014 04:14, Wu, Feng wrote:
>>
>>
>> [email protected] wrote on mailto:[email protected]] On Behalf Of Paolo:
>>> [email protected]; Gleb Natapov; Paolo Bonzini; [email protected];
>>> [email protected]; Alex Williamson;
>>> joro-zLv9SwRftAIdnm+Jiang
>>> Liu
>>> Cc: [email protected];
>>> [email protected]; KVM list;
>>> Eric Auger
>>> Subject: Re: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is
>>> set
>>>
>>>
>>>
>>> On 12/12/2014 16:14, Feng Wu wrote:
>>>> Currently, we don't support urgent interrupt, all interrupts are
>>>> recognized as non-urgent interrupt, so we cannot send
>>>> posted-interrupt when 'SN' is set.
>>>
>>> Can this happen? If the vcpu is in guest mode, it cannot have been
>>> scheduled out, and that's the only case when SN is set.
>>>
>>> Paolo
>>
>> Currently, the only place where SN is set is vCPU is preempted and
If the vCPU is preempted, shouldn't the subsequent be ignored? What happens if a PI is occurs when vCPU is preempted?
>> waiting for the next scheduling in the runqueue. But I am not sure
>> whether we need to set SN for other purpose in future. Adding SN
>> checking here is just to follow the Spec. non-urgent interrupts are
>> suppressed
> when SN is set.
>
> I would change that to a WARN_ON_ONCE then.
Best regards,
Yang
Feng Wu wrote on 2014-12-12:
> This patch initializes the VT-d Posted-Interrupts Descriptor.
>
> Signed-off-by: Feng Wu <[email protected]>
> ---
> arch/x86/kvm/vmx.c | 27 +++++++++++++++++++++++++++
> 1 file changed, 27 insertions(+)
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index
> 0b1383e..66ca275 100644 --- a/arch/x86/kvm/vmx.c +++
> b/arch/x86/kvm/vmx.c @@ -45,6 +45,7 @@
> #include <asm/perf_event.h>
> #include <asm/debugreg.h>
> #include <asm/kexec.h>
> +#include <asm/irq_remapping.h>
>
> #include "trace.h"
> @@ -4433,6 +4434,30 @@ static void ept_set_mmio_spte_mask(void)
> kvm_mmu_set_mmio_spte_mask((0x3ull << 62) | 0x6ull); }
> +static void pi_desc_init(struct vcpu_vmx *vmx) {
> + unsigned int dest;
> +
> + if (!irq_remapping_cap(IRQ_POSTING_CAP))
> + return;
> +
> + /*
> + * Initialize Posted-Interrupt Descriptor
> + */
> +
> + pi_clear_sn(&vmx->pi_desc);
> + vmx->pi_desc.nv = POSTED_INTR_VECTOR;
Here.
> +
> + /* Physical mode for Notificaiton Event */
> + vmx->pi_desc.ndm = 0;
And from here..
> + dest = cpu_physical_id(vmx->vcpu.cpu);
> +
> + if (x2apic_enabled())
> + vmx->pi_desc.ndst = dest;
> + else
> + vmx->pi_desc.ndst = (dest << 8) & 0xFF00; }
> +
..to here are useless. The right place to update PI descriptor is where vcpu got loaded not in initialization.
> /*
> * Sets up the vmcs for emulated real mode.
> */
> @@ -4476,6 +4501,8 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
>
> vmcs_write64(POSTED_INTR_NV, POSTED_INTR_VECTOR);
> vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
> +
> + pi_desc_init(vmx);
> }
>
> if (ple_gap) {
Best regards,
Yang
On 18/12/2014 15:49, Zhang, Yang Z wrote:
>>> Here, we introduce a similar way with 'apic_arb_prio' to handle
>>> guest lowest priority interrtups when VT-d PI is used. Here is
>>> the ideas: - Each vCPU has a counter 'round_robin_counter'. -
>>> When guests sets an interrupts to lowest priority, we choose the
>>> vCPU with smallest 'round_robin_counter' as the destination, then
>>> increase it.
>
> How this can work well? All subsequent interrupts are delivered to
> one vCPU? It shouldn't be the best solution, need more consideration.
Well, it's a hardware limitation. The alternative (which is easy to
implement) is to only do PI for single-CPU interrupts. This should work
well for multiqueue NICs (and of course for UP guests :)), so perhaps
it's a good idea to only support that as a first attempt.
Paolo
> Also, I think you should take the apic_arb_prio into consider since
> the priority is for the whole vCPU not for one interrupt.
> -----Original Message-----
> From: Zhang, Yang Z
> Sent: Thursday, December 18, 2014 10:55 PM
> To: Wu, Feng; [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]; Wu, Feng
> Subject: RE: [v3 21/26] x86, irq: Define a global vector for VT-d
> Posted-Interrupts
>
> Feng Wu wrote on 2014-12-12:
> > Currently, we use a global vector as the Posted-Interrupts
> > Notification Event for all the vCPUs in the system. We need to
> > introduce another global vector for VT-d Posted-Interrtups, which will
> > be used to wakeup the sleep vCPU when an external interrupt from a
> direct-assigned device happens for that vCPU.
> >
>
> Hi Feng,
>
> Since the idea of two global vectors mechanism is from me, please add me to
> the comments.
No problem, Yang, I will add a "suggested-by Yang Zhang <[email protected]>"
in this patch. Thanks a lot!
Thanks,
Feng
>
> > Signed-off-by: Feng Wu <[email protected]>
> > ---
> > arch/x86/include/asm/entry_arch.h | 2 ++
> > arch/x86/include/asm/hardirq.h | 1 +
> > arch/x86/include/asm/hw_irq.h | 2 ++
> > arch/x86/include/asm/irq_vectors.h | 1 +
> > arch/x86/kernel/entry_64.S | 2 ++
> > arch/x86/kernel/irq.c | 27
> +++++++++++++++++++++++++++
> > arch/x86/kernel/irqinit.c | 2 ++
> > 7 files changed, 37 insertions(+)
> > diff --git a/arch/x86/include/asm/entry_arch.h
> > b/arch/x86/include/asm/entry_arch.h index dc5fa66..27ca0af 100644 ---
> > a/arch/x86/include/asm/entry_arch.h +++
> > b/arch/x86/include/asm/entry_arch.h @@ -23,6 +23,8 @@
> > BUILD_INTERRUPT(x86_platform_ipi, X86_PLATFORM_IPI_VECTOR) #ifdef
> > CONFIG_HAVE_KVM BUILD_INTERRUPT3(kvm_posted_intr_ipi,
> POSTED_INTR_VECTOR,
> > smp_kvm_posted_intr_ipi)
> > +BUILD_INTERRUPT3(kvm_posted_intr_wakeup_ipi,
> POSTED_INTR_WAKEUP_VECTOR,
> > + smp_kvm_posted_intr_wakeup_ipi)
> > #endif
> >
> > /*
> > diff --git a/arch/x86/include/asm/hardirq.h
> > b/arch/x86/include/asm/hardirq.h index 0f5fb6b..9866065 100644
> > --- a/arch/x86/include/asm/hardirq.h
> > +++ b/arch/x86/include/asm/hardirq.h
> > @@ -14,6 +14,7 @@ typedef struct {
> > #endif #ifdef CONFIG_HAVE_KVM unsigned int kvm_posted_intr_ipis;
> > + unsigned int kvm_posted_intr_wakeup_ipis; #endif unsigned int
> > x86_platform_ipis; /* arch dependent */ unsigned int apic_perf_irqs;
> > diff --git a/arch/x86/include/asm/hw_irq.h
> > b/arch/x86/include/asm/hw_irq.h index e7ae6eb..38fac9b 100644
> > --- a/arch/x86/include/asm/hw_irq.h
> > +++ b/arch/x86/include/asm/hw_irq.h
> > @@ -29,6 +29,7 @@
> > extern asmlinkage void apic_timer_interrupt(void); extern asmlinkage
> > void x86_platform_ipi(void); extern asmlinkage void
> > kvm_posted_intr_ipi(void); +extern asmlinkage void
> > kvm_posted_intr_wakeup_ipi(void);
> > extern asmlinkage void error_interrupt(void); extern asmlinkage void
> > irq_work_interrupt(void);
> >
> > @@ -92,6 +93,7 @@ extern void
> > trace_call_function_single_interrupt(void);
> > #define trace_irq_move_cleanup_interrupt irq_move_cleanup_interrupt
> > #define trace_reboot_interrupt reboot_interrupt #define
> > trace_kvm_posted_intr_ipi kvm_posted_intr_ipi
> > +#define trace_kvm_posted_intr_wakeup_ipi kvm_posted_intr_wakeup_ipi
> > #endif /* CONFIG_TRACING */
> >
> > struct irq_domain;
> > diff --git a/arch/x86/include/asm/irq_vectors.h
> > b/arch/x86/include/asm/irq_vectors.h index b26cb12..dca94f2 100644 ---
> > a/arch/x86/include/asm/irq_vectors.h +++
> > b/arch/x86/include/asm/irq_vectors.h @@ -105,6 +105,7 @@
> > /* Vector for KVM to deliver posted interrupt IPI */ #ifdef
> > CONFIG_HAVE_KVM #define POSTED_INTR_VECTOR 0xf2 +#define
> > POSTED_INTR_WAKEUP_VECTOR 0xf1 #endif
> >
> > /*
> > diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> > index e61c14a..a598447 100644 --- a/arch/x86/kernel/entry_64.S +++
> > b/arch/x86/kernel/entry_64.S @@ -960,6 +960,8 @@ apicinterrupt
> > X86_PLATFORM_IPI_VECTOR \ #ifdef CONFIG_HAVE_KVM
> > apicinterrupt3 POSTED_INTR_VECTOR \
> > kvm_posted_intr_ipi smp_kvm_posted_intr_ipi
> > +apicinterrupt3 POSTED_INTR_WAKEUP_VECTOR \
> > + kvm_posted_intr_wakeup_ipi smp_kvm_posted_intr_wakeup_ipi
> > #endif
> >
> > #ifdef CONFIG_X86_MCE_THRESHOLD
> > diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index
> > 922d285..47408c3 100644
> > --- a/arch/x86/kernel/irq.c
> > +++ b/arch/x86/kernel/irq.c
> > @@ -237,6 +237,9 @@ __visible void smp_x86_platform_ipi(struct pt_regs
> > *regs) }
> >
> > #ifdef CONFIG_HAVE_KVM
> > +void (*wakeup_handler_callback)(void) = NULL;
> > +EXPORT_SYMBOL_GPL(wakeup_handler_callback); +
> > /*
> > * Handler for POSTED_INTERRUPT_VECTOR.
> > */
> > @@ -256,6 +259,30 @@ __visible void smp_kvm_posted_intr_ipi(struct
> > pt_regs *regs)
> >
> > set_irq_regs(old_regs);
> > }
> > +
> > +/*
> > + * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR.
> > + */
> > +__visible void smp_kvm_posted_intr_wakeup_ipi(struct pt_regs *regs) {
> > + struct pt_regs *old_regs = set_irq_regs(regs);
> > +
> > + ack_APIC_irq();
> > +
> > + irq_enter();
> > +
> > + exit_idle();
> > +
> > + inc_irq_stat(kvm_posted_intr_wakeup_ipis);
> > +
> > + if (wakeup_handler_callback)
> > + wakeup_handler_callback();
> > +
> > + irq_exit();
> > +
> > + set_irq_regs(old_regs);
> > +}
> > +
> > #endif
> >
> > __visible void smp_trace_x86_platform_ipi(struct pt_regs *regs) diff
> > --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c index
> > 70e181e..844673c 100644 --- a/arch/x86/kernel/irqinit.c +++
> > b/arch/x86/kernel/irqinit.c @@ -144,6 +144,8 @@ static void __init
> > apic_intr_init(void) #ifdef CONFIG_HAVE_KVM
> > /* IPI for KVM to deliver posted interrupt */
> > alloc_intr_gate(POSTED_INTR_VECTOR, kvm_posted_intr_ipi);
> > + /* IPI for KVM to deliver interrupt to wake up tasks */
> > + alloc_intr_gate(POSTED_INTR_WAKEUP_VECTOR,
> > +kvm_posted_intr_wakeup_ipi);
> > #endif
> >
> > /* IPI vectors for APIC spurious and error interrupts */
>
>
> Best regards,
> Yang
>
Paolo Bonzini wrote on 2014-12-19:
>
>
> On 18/12/2014 15:49, Zhang, Yang Z wrote:
>>>> Here, we introduce a similar way with 'apic_arb_prio' to handle
>>>> guest lowest priority interrtups when VT-d PI is used. Here is the
>>>> ideas: - Each vCPU has a counter 'round_robin_counter'. - When
>>>> guests sets an interrupts to lowest priority, we choose the vCPU
>>>> with smallest 'round_robin_counter' as the destination, then
>>>> increase it.
>>
>> How this can work well? All subsequent interrupts are delivered to
>> one vCPU? It shouldn't be the best solution, need more consideration.
>
> Well, it's a hardware limitation. The alternative (which is easy to
Agree, it is limited by hardware. But lowest priority distributes the interrupt more efficient than fixed mode. And current implementation more likes to switch the lowest priority mode to fixed mode. In case of interrupt intensive environment, this may be a bottleneck and VM may not benefit greatly from VT-d PI. But agree again, it is really a hardware limitation.
> implement) is to only do PI for single-CPU interrupts. This should
> work well for multiqueue NICs (and of course for UP guests :)), so
> perhaps it's a good idea to only support that as a first attempt.
The more easy way is to deliver the interrupt to the first matched VCPU we find. The round_robin_counter really helps nothing here since the interrupt is delivered by hardware directly.
>
> Paolo
>
>> Also, I think you should take the apic_arb_prio into consider since
>> the priority is for the whole vCPU not for one interrupt.
Best regards,
Yang
> -----Original Message-----
> From: Zhang, Yang Z
> Sent: Friday, December 19, 2014 9:14 AM
> To: Paolo Bonzini; Wu, Feng; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: RE: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for
> VT-d PI
>
> Paolo Bonzini wrote on 2014-12-19:
> >
> >
> > On 18/12/2014 15:49, Zhang, Yang Z wrote:
> >>>> Here, we introduce a similar way with 'apic_arb_prio' to handle
> >>>> guest lowest priority interrtups when VT-d PI is used. Here is the
> >>>> ideas: - Each vCPU has a counter 'round_robin_counter'. - When
> >>>> guests sets an interrupts to lowest priority, we choose the vCPU
> >>>> with smallest 'round_robin_counter' as the destination, then
> >>>> increase it.
> >>
> >> How this can work well? All subsequent interrupts are delivered to
> >> one vCPU? It shouldn't be the best solution, need more consideration.
> >
> > Well, it's a hardware limitation. The alternative (which is easy to
>
> Agree, it is limited by hardware. But lowest priority distributes the interrupt
> more efficient than fixed mode. And current implementation more likes to
> switch the lowest priority mode to fixed mode. In case of interrupt intensive
> environment, this may be a bottleneck and VM may not benefit greatly from
> VT-d PI. But agree again, it is really a hardware limitation.
>
> > implement) is to only do PI for single-CPU interrupts. This should
> > work well for multiqueue NICs (and of course for UP guests :)), so
> > perhaps it's a good idea to only support that as a first attempt.
>
> The more easy way is to deliver the interrupt to the first matched VCPU we find.
> The round_robin_counter really helps nothing here since the interrupt is
> delivered by hardware directly.
>
> >
> > Paolo
> >
> >> Also, I think you should take the apic_arb_prio into consider since
> >> the priority is for the whole vCPU not for one interrupt.
>
>
> Best regards,
> Yang
In fact, the current solution was discussed with Rajesh in the cc List, here is Rajesh's original words:
"When you see a guest requesting a lowest priority interrupts (by programming the virtual IOAPIC, or by programming the virtual MSI/MSI-X registers), have KVM associate it to a vCPU. Or, put another way, use the 'apic_arb_prio' method you describe below, but instead of using it at time of interrupt (which you no longer have control with posted interrupt direct delivery), do it at time of initializing the interrupt resource. This way, if the guest asks for 4 lowest priority interrupts, and say you a guest with two vCPUs, the first interrupt request will be serviced by KVM by assigning it through posting to vCPU0, the next one goes to vCPU1, the next one would go back to vCPU0, and so forth.. You could also choose to do this based on vector hashing instead of round-robin."
Thanks,
Feng
>
> -----Original Message-----
> From: Paolo Bonzini [mailto:[email protected]]
> Sent: Friday, December 19, 2014 12:58 AM
> To: Zhang, Yang Z; Wu, Feng; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for
> VT-d PI
>
>
>
> On 18/12/2014 15:49, Zhang, Yang Z wrote:
> >>> Here, we introduce a similar way with 'apic_arb_prio' to handle
> >>> guest lowest priority interrtups when VT-d PI is used. Here is
> >>> the ideas: - Each vCPU has a counter 'round_robin_counter'. -
> >>> When guests sets an interrupts to lowest priority, we choose the
> >>> vCPU with smallest 'round_robin_counter' as the destination, then
> >>> increase it.
> >
> > How this can work well? All subsequent interrupts are delivered to
> > one vCPU? It shouldn't be the best solution, need more consideration.
>
> Well, it's a hardware limitation. The alternative (which is easy to
> implement) is to only do PI for single-CPU interrupts. This should work
> well for multiqueue NICs (and of course for UP guests :)), so perhaps
> it's a good idea to only support that as a first attempt.
>
> Paolo
Paolo, what do you mean by "single-CPU interrupts"? Do you mean we don't
support lowest priority interrupts for PI? But Linux OS uses lowest priority
for most of the case? If so, we can hardly get benefit from this feature for
Linux guest OS.
Thanks,
Feng
>
> > Also, I think you should take the apic_arb_prio into consider since
> > the priority is for the whole vCPU not for one interrupt.
> -----Original Message-----
> From: Zhang, Yang Z
> Sent: Thursday, December 18, 2014 10:26 PM
> To: Wu, Feng; [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]; Wu, Feng
> Subject: RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d
> Posted-Interrupts
>
> Feng Wu wrote on 2014-12-12:
> > We don't need to migrate the irqs for VT-d Posted-Interrupts here.
> > When 'pst' is set in IRTE, the associated irq will be posted to guests
> > instead of interrupt remapping. The destination of the interrupt is
> > set in Posted-Interrupts Descriptor, and the migration happens during
> > vCPU scheduling.
> >
> > However, we still update the cached irte here, which can be used when
> > changing back to remapping mode.
> >
> > Signed-off-by: Feng Wu <[email protected]>
> > Reviewed-by: Jiang Liu <[email protected]>
> > ---
> > drivers/iommu/intel_irq_remapping.c | 6 +++++-
> > 1 file changed, 5 insertions(+), 1 deletion(-)
> > diff --git a/drivers/iommu/intel_irq_remapping.c
> > b/drivers/iommu/intel_irq_remapping.c index 48c2051..ab9057a 100644 ---
> > a/drivers/iommu/intel_irq_remapping.c +++
> > b/drivers/iommu/intel_irq_remapping.c @@ -977,6 +977,7 @@
> > intel_ir_set_affinity(struct irq_data *data, const struct cpumask *mask,
> > {
> > struct intel_ir_data *ir_data = data->chip_data; struct irte *irte =
> > &ir_data->irte_entry; + struct irte_pi *irte_pi = (struct irte_pi
> > *)irte; struct irq_cfg *cfg = irqd_cfg(data); struct irq_data *parent
> > = data->parent_data; int ret;
> > @@ -991,7 +992,10 @@ intel_ir_set_affinity(struct irq_data *data,
> > const struct cpumask *mask,
> > */
> > irte->vector = cfg->vector;
> > irte->dest_id = IRTE_DEST(cfg->dest_apicid);
> > - modify_irte(&ir_data->irq_2_iommu, irte);
> > +
> > + /* We don't need to modify irte if the interrupt is for posting. */
> > + if (irte_pi->pst != 1)
> > + modify_irte(&ir_data->irq_2_iommu, irte);
>
> What happens if user changes the IRQ affinity manually?
If the IRQ is posted, its affinity is controlled by guest (irq <---> vCPU <----> pCPU),
it has no effect when host changes its affinity.
Thanks,
Feng
>
> >
> > /*
> > * After this point, all the interrupts will start arriving
>
>
> Best regards,
> Yang
>
Wu, Feng wrote on 2014-12-19:
>
>
> Zhang, Yang Z wrote on 2014-12-18:
>> [email protected]
>> Cc: [email protected]; [email protected];
>> [email protected]; [email protected]; Wu, Feng
>> Subject: RE: [v3 06/26] iommu, x86: No need to migrating irq for
>> VT-d Posted-Interrupts
>>
>> Feng Wu wrote on 2014-12-12:
>>> We don't need to migrate the irqs for VT-d Posted-Interrupts here.
>>> When 'pst' is set in IRTE, the associated irq will be posted to
>>> guests instead of interrupt remapping. The destination of the
>>> interrupt is set in Posted-Interrupts Descriptor, and the
>>> migration happens during vCPU scheduling.
>>>
>>> However, we still update the cached irte here, which can be used
>>> when changing back to remapping mode.
>>>
>>> Signed-off-by: Feng Wu <[email protected]>
>>> Reviewed-by: Jiang Liu <[email protected]>
>>> ---
>>> drivers/iommu/intel_irq_remapping.c | 6 +++++-
>>> 1 file changed, 5 insertions(+), 1 deletion(-) diff --git
>>> a/drivers/iommu/intel_irq_remapping.c
>>> b/drivers/iommu/intel_irq_remapping.c index 48c2051..ab9057a
>>> 100644
>>> --- a/drivers/iommu/intel_irq_remapping.c +++
>>> b/drivers/iommu/intel_irq_remapping.c @@ -977,6 +977,7 @@
>>> intel_ir_set_affinity(struct irq_data *data, const struct cpumask
>>> *mask, {
>>> struct intel_ir_data *ir_data = data->chip_data; struct irte *irte =
>>> &ir_data->irte_entry; + struct irte_pi *irte_pi = (struct irte_pi
>>> *)irte; struct irq_cfg *cfg = irqd_cfg(data); struct irq_data *parent
>>> = data->parent_data; int ret;
>>> @@ -991,7 +992,10 @@ intel_ir_set_affinity(struct irq_data *data,
>>> const struct cpumask *mask,
>>> */
>>> irte->vector = cfg->vector;
>>> irte->dest_id = IRTE_DEST(cfg->dest_apicid);
>>> - modify_irte(&ir_data->irq_2_iommu, irte);
>>> +
>>> + /* We don't need to modify irte if the interrupt is for posting. */
>>> + if (irte_pi->pst != 1)
>>> + modify_irte(&ir_data->irq_2_iommu, irte);
>>
>> What happens if user changes the IRQ affinity manually?
>
> If the IRQ is posted, its affinity is controlled by guest (irq <--->
> vCPU <----> pCPU), it has no effect when host changes its affinity.
That's the problem: User is able to changes it in host but it never takes effect since it is actually controlled by guest. I guess it will break the IRQ balance too.
>
> Thanks,
> Feng
>
>>
>>>
>>> /*
>>> * After this point, all the interrupts will start arriving
>>
>>
>> Best regards,
>> Yang
>>
Best regards,
Yang
Wu, Feng wrote on 2014-12-19:
>
>
> Paolo Bonzini wrote on 2014-12-19:
>> [email protected]
>> Cc: [email protected]; [email protected];
>> [email protected]; [email protected]
>> Subject: Re: [v3 13/26] KVM: Define a new interface
>> kvm_find_dest_vcpu() for VT-d PI
>>
>>
>>
>> On 18/12/2014 15:49, Zhang, Yang Z wrote:
>>>>> Here, we introduce a similar way with 'apic_arb_prio' to handle
>>>>> guest lowest priority interrtups when VT-d PI is used. Here is
>>>>> the
>>>>> ideas: - Each vCPU has a counter 'round_robin_counter'. - When
>>>>> guests sets an interrupts to lowest priority, we choose the vCPU
>>>>> with smallest 'round_robin_counter' as the destination, then
>>>>> increase it.
>>>
>>> How this can work well? All subsequent interrupts are delivered to
>>> one vCPU? It shouldn't be the best solution, need more consideration.
>>
>> Well, it's a hardware limitation. The alternative (which is easy to
>> implement) is to only do PI for single-CPU interrupts. This should
>> work well for multiqueue NICs (and of course for UP guests :)), so
>> perhaps it's a good idea to only support that as a first attempt.
>>
>> Paolo
>
> Paolo, what do you mean by "single-CPU interrupts"? Do you mean we
It should be same idea as I mentioned on another thread: deliver the interrupt to a single CPU(maybe the first matched VCPU?)
> don't support lowest priority interrupts for PI? But Linux OS uses
> lowest priority for most of the case? If so, we can hardly get benefit
> from this feature for Linux guest OS.
>
> Thanks,
> Feng
>
>>
>>> Also, I think you should take the apic_arb_prio into consider
>>> since the priority is for the whole vCPU not for one interrupt.
Best regards,
Yang
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Paolo Bonzini
> Sent: Thursday, December 18, 2014 4:32 PM
> To: [email protected]
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [v3 23/26] KVM: Update Posted-Interrupts Descriptor when vCPU
> is preempted
>
>
>
> On 18/12/2014 04:15, Wu, Feng wrote:
> > Thanks for your comments, Paolo!
> >
> > If we use u64 new_control, we cannot use new.sn any more.
> > Maybe we can change the struct pi_desc {} like this:
> >
> > typedef struct pid_control{
> > u64 on : 1,
> > sn : 1,
> > rsvd_1 : 13,
> > ndm : 1,
> > nv : 8,
> > rsvd_2 : 8,
> > ndst : 32;
> > }pid_control_t;
> >
> > struct pi_desc {
> > u32 pir[8]; /* Posted interrupt requested */
> > pid_control_t control;
>
> Probably something like this to keep the union:
>
> typedef union pid_control {
> u64 full;
> struct {
> u64 on : 1,
> ...
> } fields;
> };
>
> > u32 rsvd[6];
> > } __aligned(64);
> >
> >
> > Then we can define pid_control_t new_control, old_control. And use
> new_control.sn = 0.
> >
> > What is your opinon?
>
> Sure. Alternatively, keep using struct pi_desc new; just
> do not zero it, nor access any field outide the control word.
>
> Paolo
Yes, this is also a good idea. Thanks!
Thanks,
Feng
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Paolo Bonzini
> Sent: Thursday, December 18, 2014 12:18 AM
> To: [email protected]
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [v3 16/26] KVM: Make struct kvm_irq_routing_table accessible
>
>
>
> On 12/12/2014 16:14, Feng Wu wrote:
> > Move struct kvm_irq_routing_table from irqchip.c to kvm_host.h,
> > so we can use it outside of irqchip.c.
> >
> > Signed-off-by: Feng Wu <[email protected]>
> > ---
> > include/linux/kvm_host.h | 19 +++++++++++++++++++
> > virt/kvm/irqchip.c | 11 -----------
> > 2 files changed, 19 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 0b9659d..cfa85ac 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -335,6 +335,25 @@ struct kvm_kernel_irq_routing_entry {
> > struct hlist_node link;
> > };
> >
> > +#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
> > +
> > +struct kvm_irq_routing_table {
> > + int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS];
> > + struct kvm_kernel_irq_routing_entry *rt_entries;
> > + u32 nr_rt_entries;
> > + /*
> > + * Array indexed by gsi. Each entry contains list of irq chips
> > + * the gsi is connected to.
> > + */
> > + struct hlist_head map[0];
> > +};
> > +
> > +#else
> > +
> > +struct kvm_irq_routing_table {};
>
> If possible, just make this "struct kvm_irq_routing_table;" and pull
> this line to include/linux/kvm_types.h.
>
> Paolo
Do you mean move the definition of struct kvm_irq_routing_table
to include/linux/kvm_types.h and add a declaration here?
Thanks,
Feng
>
> > +
> > +#endif
> > +
> > #ifndef KVM_PRIVATE_MEM_SLOTS
> > #define KVM_PRIVATE_MEM_SLOTS 0
> > #endif
> > diff --git a/virt/kvm/irqchip.c b/virt/kvm/irqchip.c
> > index 7f256f3..cdf29a6 100644
> > --- a/virt/kvm/irqchip.c
> > +++ b/virt/kvm/irqchip.c
> > @@ -31,17 +31,6 @@
> > #include <trace/events/kvm.h>
> > #include "irq.h"
> >
> > -struct kvm_irq_routing_table {
> > - int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS];
> > - struct kvm_kernel_irq_routing_entry *rt_entries;
> > - u32 nr_rt_entries;
> > - /*
> > - * Array indexed by gsi. Each entry contains list of irq chips
> > - * the gsi is connected to.
> > - */
> > - struct hlist_head map[0];
> > -};
> > -
> > int kvm_irq_map_gsi(struct kvm *kvm,
> > struct kvm_kernel_irq_routing_entry *entries, int gsi)
> > {
> >
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Paolo Bonzini
> Sent: Thursday, December 18, 2014 4:37 PM
> To: [email protected]
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU
> is blocked
>
>
>
> On 18/12/2014 04:16, Wu, Feng wrote:
> >>> pre-block:
> >>> - Add the vCPU to the blocked per-CPU list
> >>> - Clear 'SN'
> >>
> >> Should SN be already clear (and NV set to POSTED_INTR_VECTOR)?
> >
> > I think the SN bit should be clear here, Adding it here is just to make sure
> > SN is clear when vCPU is blocked, so it can receive wakeup notification event
> later.
>
> Then, please, WARN if the SN bit is set inside the if (vcpu->blocked).
> Inside that if you can just add the vCPU to the blocked list on vcpu_put.
>
> >> Can it
> >> happen that you go from sched-out to blocked without doing a sched-in
> first?
> >>
> >
> > I cannot imagine this scenario, can you please be more specific? Thanks a lot!
>
> I cannot either. :) But it would be the case where SN is not cleared.
> So we agree that it cannot happen.
>
> >> In fact, if this is possible, what happens if vcpu->preempted &&
> >> vcpu->blocked?
> >
> > In fact, vcpu->preempted && vcpu->blocked happens sometimes, but I think
> there is
> > no issues. Please refer to the following case:
>
> I agree that there should be no issues. But if it can happen, it's better:
>
> 1) to separate the handling of preemption and blocking: preemption
> handles SN/NV/NDST, blocking handles the wakeup list.
>
Sorry, I don't quite understand this.
I think handling of preemption and blocking is separated in vmx_vcpu_put().
For vmx_vcpu_load(), the handling of SN/NV/NDST is common for preemption
and blocking.
Thanks,
Feng
> 2) to change this
>
> + } else if (vcpu->blocked) {
> + /*
> + * The vcpu is blocked on the wait queue.
> + * Store the blocked vCPU on the list of the
> + * vcpu->wakeup_cpu, which is the destination
> + * of the wake-up notification event.
>
> to just
>
> }
> if (vcpu->blocked) {
> ...
> }
> > kvm_vcpu_block()
> > -> vcpu->blocked = true;
> > -> prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
> >
> > before schedule() is called, this vcpu is woken up by another guy, so
> > the state of the vcpu associated thread is changed to TASK_RUNNING,
> > then preemption happens after interrupts or the following schedule() is
> > hit, this will call kvm_sched_out(), in which current->state ==
> TASK_RUNNING
> > and vcpu->preempted is set to true. So now vcpu->preempted and
> vcpu->blocked
> > are both true. In vmx_vcpu_put(), we will check vcpu->preempted first, so
> > the vCPU will not be blocked, and the vcpu->blocked will be set the false in
> > vmx_vcpu_load().
> >
> > But maybe I need do a little change to the vmx_vcpu_load() like below:
> >
> > /*
> > * Delete the vCPU from the related wakeup queue
> > * if we are resuming from blocked state
> > */
> > if (vcpu->blocked) {
> > vcpu->blocked = false;
> > + /* if wakeup_cpu == -1, the vcpu is currently not
> blocked on any
> > + pCPU, don't need dequeue here */
> > + if (vcpu->wakeup_cpu != -1) {
> >
> spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
> > vcpu->wakeup_cpu), flags);
> > list_del(&vcpu->blocked_vcpu_list);
> >
> spin_unlock_irqrestore(&per_cpu(blocked_vcpu_on_cpu_lock,
> > vcpu->wakeup_cpu), flags);
> > vcpu->wakeup_cpu = -1;
> > + }
> > }
>
> Good idea.
>
> Paolo
>
> > Any ideas about this? Thanks a lot!
> >
> > Thanks,
> > Feng
> >
> >
> > -> schedule();
> >
> >
> >>
> >>> - Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
> >>>
> >>> post-block:
> >>> - Remove the vCPU from the per-CPU list
> >>
> >> Paolo
> >>
> >>> Signed-off-by: Feng Wu <[email protected]>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> the body of a message to [email protected]
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Zhang, Yang Z
> Sent: Thursday, December 18, 2014 11:10 PM
> To: Paolo Bonzini; [email protected]
> Cc: [email protected]; [email protected];
> [email protected]
> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set
>
> Paolo Bonzini wrote on 2014-12-18:
> >
> >
> > On 18/12/2014 04:14, Wu, Feng wrote:
> >>
> >>
> >> [email protected] wrote on
> mailto:[email protected]] On Behalf Of Paolo:
> >>> [email protected]; Gleb Natapov; Paolo Bonzini; [email protected];
> >>> [email protected]; Alex Williamson;
> >>> joro-zLv9SwRftAIdnm+Jiang
> >>> Liu
> >>> Cc: [email protected];
> >>> [email protected]; KVM list;
> >>> Eric Auger
> >>> Subject: Re: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is
> >>> set
> >>>
> >>>
> >>>
> >>> On 12/12/2014 16:14, Feng Wu wrote:
> >>>> Currently, we don't support urgent interrupt, all interrupts are
> >>>> recognized as non-urgent interrupt, so we cannot send
> >>>> posted-interrupt when 'SN' is set.
> >>>
> >>> Can this happen? If the vcpu is in guest mode, it cannot have been
> >>> scheduled out, and that's the only case when SN is set.
> >>>
> >>> Paolo
> >>
> >> Currently, the only place where SN is set is vCPU is preempted and
>
> If the vCPU is preempted, shouldn't the subsequent be ignored? What happens
> if a PI is occurs when vCPU is preempted?
If a vCPU is preempted, the 'SN' bit is set, the subsequent interrupts are
suppressed for posting.
Thanks,
Feng
>
> >> waiting for the next scheduling in the runqueue. But I am not sure
> >> whether we need to set SN for other purpose in future. Adding SN
> >> checking here is just to follow the Spec. non-urgent interrupts are
> >> suppressed
> > when SN is set.
> >
> > I would change that to a WARN_ON_ONCE then.
>
>
> Best regards,
> Yang
>
>
> _______________________________________________
> iommu mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
Wu, Feng wrote on 2014-12-19:
>
>
> [email protected] wrote on mailto:[email protected]] On Behalf Of:
>> Cc: [email protected]; [email protected];
>> [email protected]
>> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is
>> set
>>
>> Paolo Bonzini wrote on 2014-12-18:
>>>
>>>
>>> On 18/12/2014 04:14, Wu, Feng wrote:
>>>>
>>>>
>>>> [email protected] wrote on
>> mailto:[email protected]] On Behalf Of Paolo:
>>>>> [email protected]; Gleb Natapov; Paolo Bonzini;
>>>>> [email protected];
>>>>> [email protected]; Alex Williamson;
>>>>> joro-zLv9SwRftAIdnm+Jiang
>>>>> Liu
>>>>> Cc: [email protected];
>>>>> [email protected]; KVM list;
>>>>> Eric Auger
>>>>> Subject: Re: [v3 25/26] KVM: Suppress posted-interrupt when 'SN'
>>>>> is set
>>>>>
>>>>>
>>>>>
>>>>> On 12/12/2014 16:14, Feng Wu wrote:
>>>>>> Currently, we don't support urgent interrupt, all interrupts
>>>>>> are recognized as non-urgent interrupt, so we cannot send
>>>>>> posted-interrupt when 'SN' is set.
>>>>>
>>>>> Can this happen? If the vcpu is in guest mode, it cannot have
>>>>> been scheduled out, and that's the only case when SN is set.
>>>>>
>>>>> Paolo
>>>>
>>>> Currently, the only place where SN is set is vCPU is preempted
>>>> and
>>
>> If the vCPU is preempted, shouldn't the subsequent be ignored? What
>> happens if a PI is occurs when vCPU is preempted?
>
> If a vCPU is preempted, the 'SN' bit is set, the subsequent interrupts
> are suppressed for posting.
I mean what happens if we don't set SN bit. From my point, if preempter already disabled the interrupt, it is ok to leave SN bit as zero. But if preempter enabled the interrupt, doesn't this mean he allow interrupt to happen? BTW, since there already has ON bit, so this means there only have one interrupt arrived at most and it doesn't hurt performance. Do we really need to set SN bit?
>
> Thanks,
> Feng
>
>>
>>>> waiting for the next scheduling in the runqueue. But I am not
>>>> sure whether we need to set SN for other purpose in future.
>>>> Adding SN checking here is just to follow the Spec. non-urgent
>>>> interrupts are suppressed
>>> when SN is set.
>>>
>>> I would change that to a WARN_ON_ONCE then.
>>
>>
>> Best regards,
>> Yang
>>
>>
>> _______________________________________________
>> iommu mailing list
>> [email protected]
>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
Best regards,
Yang
> -----Original Message-----
> From: Zhang, Yang Z
> Sent: Friday, December 19, 2014 11:33 AM
> To: Wu, Feng; Paolo Bonzini; [email protected]
> Cc: [email protected]; [email protected];
> [email protected]
> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set
>
> Wu, Feng wrote on 2014-12-19:
> >
> >
> > [email protected] wrote on
> mailto:[email protected]] On Behalf Of:
> >> Cc: [email protected]; [email protected];
> >> [email protected]
> >> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is
> >> set
> >>
> >> Paolo Bonzini wrote on 2014-12-18:
> >>>
> >>>
> >>> On 18/12/2014 04:14, Wu, Feng wrote:
> >>>>
> >>>>
> >>>> [email protected] wrote on
> >> mailto:[email protected]] On Behalf Of Paolo:
> >>>>> [email protected]; Gleb Natapov; Paolo Bonzini;
> >>>>> [email protected];
> >>>>> [email protected]; Alex Williamson;
> >>>>> joro-zLv9SwRftAIdnm+Jiang
> >>>>> Liu
> >>>>> Cc: [email protected];
> >>>>> [email protected]; KVM list;
> >>>>> Eric Auger
> >>>>> Subject: Re: [v3 25/26] KVM: Suppress posted-interrupt when 'SN'
> >>>>> is set
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 12/12/2014 16:14, Feng Wu wrote:
> >>>>>> Currently, we don't support urgent interrupt, all interrupts
> >>>>>> are recognized as non-urgent interrupt, so we cannot send
> >>>>>> posted-interrupt when 'SN' is set.
> >>>>>
> >>>>> Can this happen? If the vcpu is in guest mode, it cannot have
> >>>>> been scheduled out, and that's the only case when SN is set.
> >>>>>
> >>>>> Paolo
> >>>>
> >>>> Currently, the only place where SN is set is vCPU is preempted
> >>>> and
> >>
> >> If the vCPU is preempted, shouldn't the subsequent be ignored? What
> >> happens if a PI is occurs when vCPU is preempted?
> >
> > If a vCPU is preempted, the 'SN' bit is set, the subsequent interrupts
> > are suppressed for posting.
>
> I mean what happens if we don't set SN bit. From my point, if preempter
> already disabled the interrupt, it is ok to leave SN bit as zero. But if preempter
> enabled the interrupt, doesn't this mean he allow interrupt to happen? BTW,
> since there already has ON bit, so this means there only have one interrupt
> arrived at most and it doesn't hurt performance. Do we really need to set SN
> bit?
See this scenario:
vCPU0 is running on pCPU0
--> vCPU0 is preempted by vCPU1
--> Then vCPU1 is running on pCPU0 and vCPU0 is waiting for schedule in runqueue
If the we don't set SN for vCPU0, then all subsequent interrupts for vCPU0 is posted
to vCPU1, this will consume hardware and software efforts and in fact it is not needed
at all. If SN is set for vCPU0, VT-d hardware will not issue Notification Event for vCPU0
when an interrupt is for it, but just setting the related PIR bit.
Thanks,
Feng
>
> >
> > Thanks,
> > Feng
> >
> >>
> >>>> waiting for the next scheduling in the runqueue. But I am not
> >>>> sure whether we need to set SN for other purpose in future.
> >>>> Adding SN checking here is just to follow the Spec. non-urgent
> >>>> interrupts are suppressed
> >>> when SN is set.
> >>>
> >>> I would change that to a WARN_ON_ONCE then.
> >>
> >>
> >> Best regards,
> >> Yang
> >>
> >>
> >> _______________________________________________
> >> iommu mailing list
> >> [email protected]
> >> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>
>
> Best regards,
> Yang
>
Wu, Feng wrote on 2014-12-19:
>
>
> Zhang, Yang Z wrote on 2014-12-19:
>> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is
>> set
>>
>> Wu, Feng wrote on 2014-12-19:
>>>
>>>
>>> [email protected] wrote on
>> mailto:[email protected]] On Behalf Of:
>>>> Cc: [email protected];
>>>> [email protected]; [email protected]
>>>> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN'
>>>> is set
>>>>
>>>> Paolo Bonzini wrote on 2014-12-18:
>>>>>
>>>>>
>>>>> On 18/12/2014 04:14, Wu, Feng wrote:
>>>>>>
>>>>>>
>>>>>> [email protected] wrote on
>>>> mailto:[email protected]] On Behalf Of Paolo:
>>>>>>> [email protected]; Gleb Natapov; Paolo Bonzini; [email protected];
>>>>>>> [email protected]; Alex Williamson;
>>>>>>> joro-zLv9SwRftAIdnm+Jiang Liu Cc:
>>>>>>> [email protected];
>>>>>>> [email protected]; KVM list;
>>>>>>> Eric Auger Subject: Re: [v3 25/26] KVM: Suppress posted-interrupt
>>>>>>> when 'SN' is set
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 12/12/2014 16:14, Feng Wu wrote:
>>>>>>>> Currently, we don't support urgent interrupt, all interrupts
>>>>>>>> are recognized as non-urgent interrupt, so we cannot send
>>>>>>>> posted-interrupt when 'SN' is set.
>>>>>>>
>>>>>>> Can this happen? If the vcpu is in guest mode, it cannot have
>>>>>>> been scheduled out, and that's the only case when SN is set.
>>>>>>>
>>>>>>> Paolo
>>>>>>
>>>>>> Currently, the only place where SN is set is vCPU is preempted
>>>>>> and
>>>>
>>>> If the vCPU is preempted, shouldn't the subsequent be ignored?
>>>> What happens if a PI is occurs when vCPU is preempted?
>>>
>>> If a vCPU is preempted, the 'SN' bit is set, the subsequent
>>> interrupts are suppressed for posting.
>>
>> I mean what happens if we don't set SN bit. From my point, if
>> preempter already disabled the interrupt, it is ok to leave SN bit
>> as zero. But if preempter enabled the interrupt, doesn't this mean
>> he allow interrupt to happen? BTW, since there already has ON bit,
>> so this means there only have one interrupt arrived at most and it
>> doesn't hurt performance. Do we really need to set SN bit?
>
>
> See this scenario:
> vCPU0 is running on pCPU0
> --> vCPU0 is preempted by vCPU1
> --> Then vCPU1 is running on pCPU0 and vCPU0 is waiting for schedule
> --> in runqueue
>
> If the we don't set SN for vCPU0, then all subsequent interrupts for
> vCPU0 is posted to vCPU1, this will consume hardware and software
The PI vector for vCPU1 is notification vector, but the PI vector for vCPU0 should be wakeup vector. Why vCPU1 will consume this PI event?
> efforts and in fact it is not needed at all. If SN is set for vCPU0,
> VT-d hardware will not issue Notification Event for vCPU0 when an
> interrupt is for it, but just setting the related PIR bit.
>
> Thanks,
> Feng
>
>>
>>>
>>> Thanks,
>>> Feng
>>>
>>>>
>>>>>> waiting for the next scheduling in the runqueue. But I am not
>>>>>> sure whether we need to set SN for other purpose in future.
>>>>>> Adding SN checking here is just to follow the Spec. non-urgent
>>>>>> interrupts are suppressed
>>>>> when SN is set.
>>>>>
>>>>> I would change that to a WARN_ON_ONCE then.
>>>>
>>>>
>>>> Best regards,
>>>> Yang
>>>>
>>>>
>>>> _______________________________________________
>>>> iommu mailing list
>>>> [email protected]
>>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>>
>>
>> Best regards,
>> Yang
>>
Best regards,
Yang
> -----Original Message-----
> From: Zhang, Yang Z
> Sent: Friday, December 19, 2014 12:44 PM
> To: Wu, Feng; Paolo Bonzini; [email protected]
> Cc: [email protected]; [email protected];
> [email protected]
> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set
>
> Wu, Feng wrote on 2014-12-19:
> >
> >
> > Zhang, Yang Z wrote on 2014-12-19:
> >> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is
> >> set
> >>
> >> Wu, Feng wrote on 2014-12-19:
> >>>
> >>>
> >>> [email protected] wrote on
> >> mailto:[email protected]] On Behalf Of:
> >>>> Cc: [email protected];
> >>>> [email protected]; [email protected]
> >>>> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN'
> >>>> is set
> >>>>
> >>>> Paolo Bonzini wrote on 2014-12-18:
> >>>>>
> >>>>>
> >>>>> On 18/12/2014 04:14, Wu, Feng wrote:
> >>>>>>
> >>>>>>
> >>>>>> [email protected] wrote on
> >>>> mailto:[email protected]] On Behalf Of Paolo:
> >>>>>>> [email protected]; Gleb Natapov; Paolo Bonzini;
> [email protected];
> >>>>>>> [email protected]; Alex Williamson;
> >>>>>>> joro-zLv9SwRftAIdnm+Jiang Liu Cc:
> >>>>>>> [email protected];
> >>>>>>> [email protected]; KVM
> list;
> >>>>>>> Eric Auger Subject: Re: [v3 25/26] KVM: Suppress posted-interrupt
> >>>>>>> when 'SN' is set
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 12/12/2014 16:14, Feng Wu wrote:
> >>>>>>>> Currently, we don't support urgent interrupt, all interrupts
> >>>>>>>> are recognized as non-urgent interrupt, so we cannot send
> >>>>>>>> posted-interrupt when 'SN' is set.
> >>>>>>>
> >>>>>>> Can this happen? If the vcpu is in guest mode, it cannot have
> >>>>>>> been scheduled out, and that's the only case when SN is set.
> >>>>>>>
> >>>>>>> Paolo
> >>>>>>
> >>>>>> Currently, the only place where SN is set is vCPU is preempted
> >>>>>> and
> >>>>
> >>>> If the vCPU is preempted, shouldn't the subsequent be ignored?
> >>>> What happens if a PI is occurs when vCPU is preempted?
> >>>
> >>> If a vCPU is preempted, the 'SN' bit is set, the subsequent
> >>> interrupts are suppressed for posting.
> >>
> >> I mean what happens if we don't set SN bit. From my point, if
> >> preempter already disabled the interrupt, it is ok to leave SN bit
> >> as zero. But if preempter enabled the interrupt, doesn't this mean
> >> he allow interrupt to happen? BTW, since there already has ON bit,
> >> so this means there only have one interrupt arrived at most and it
> >> doesn't hurt performance. Do we really need to set SN bit?
> >
> >
> > See this scenario:
> > vCPU0 is running on pCPU0
> > --> vCPU0 is preempted by vCPU1
> > --> Then vCPU1 is running on pCPU0 and vCPU0 is waiting for schedule
> > --> in runqueue
> >
> > If the we don't set SN for vCPU0, then all subsequent interrupts for
> > vCPU0 is posted to vCPU1, this will consume hardware and software
>
> The PI vector for vCPU1 is notification vector, but the PI vector for vCPU0
> should be wakeup vector. Why vCPU1 will consume this PI event?
Wakeup vector is only used for blocking case, when vCPU is preempted
and waiting in the runqueue, the NV is the notification vector.
Thanks,
Feng
>
> > efforts and in fact it is not needed at all. If SN is set for vCPU0,
> > VT-d hardware will not issue Notification Event for vCPU0 when an
> > interrupt is for it, but just setting the related PIR bit.
> >
> > Thanks,
> > Feng
> >
> >>
> >>>
> >>> Thanks,
> >>> Feng
> >>>
> >>>>
> >>>>>> waiting for the next scheduling in the runqueue. But I am not
> >>>>>> sure whether we need to set SN for other purpose in future.
> >>>>>> Adding SN checking here is just to follow the Spec. non-urgent
> >>>>>> interrupts are suppressed
> >>>>> when SN is set.
> >>>>>
> >>>>> I would change that to a WARN_ON_ONCE then.
> >>>>
> >>>>
> >>>> Best regards,
> >>>> Yang
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> iommu mailing list
> >>>> [email protected]
> >>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
> >>
> >>
> >> Best regards,
> >> Yang
> >>
>
>
> Best regards,
> Yang
>
Wu, Feng wrote on 2014-12-19:
>
>
> Zhang, Yang Z wrote on 2014-12-19:
>> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is
>> set
>>
>> Wu, Feng wrote on 2014-12-19:
>>>
>>>
>>> Zhang, Yang Z wrote on 2014-12-19:
>>>> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN'
>>>> is set
>>>>
>>>> Wu, Feng wrote on 2014-12-19:
>>>>>
>>>>>
>>>>> [email protected] wrote on
>>>> mailto:[email protected]] On Behalf Of:
>>>>>> Cc: [email protected];
>>>>>> [email protected]; [email protected]
>>>>>> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN'
>>>>>> is set
>>>>>>
>>>>>> Paolo Bonzini wrote on 2014-12-18:
>>>>>>>
>>>>>>>
>>>>>>> On 18/12/2014 04:14, Wu, Feng wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> [email protected] wrote on
>>>>>> mailto:[email protected]] On Behalf Of Paolo:
>>>>>>>>> [email protected]; Gleb Natapov; Paolo Bonzini;
>>>>>>>>> [email protected];
>>>>>>>>> [email protected]; Alex Williamson;
>>>>>>>>> joro-zLv9SwRftAIdnm+Jiang Liu Cc:
>>>>>>>>> [email protected];
>>>>>>>>> [email protected]; KVM list;
>>>>>>>>> Eric Auger Subject: Re: [v3 25/26] KVM: Suppress
>>>>>>>>> posted-interrupt when 'SN' is set
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 12/12/2014 16:14, Feng Wu wrote:
>>>>>>>>>> Currently, we don't support urgent interrupt, all
>>>>>>>>>> interrupts are recognized as non-urgent interrupt, so we
>>>>>>>>>> cannot send posted-interrupt when 'SN' is set.
>>>>>>>>>
>>>>>>>>> Can this happen? If the vcpu is in guest mode, it cannot
>>>>>>>>> have been scheduled out, and that's the only case when SN is set.
>>>>>>>>>
>>>>>>>>> Paolo
>>>>>>>>
>>>>>>>> Currently, the only place where SN is set is vCPU is
>>>>>>>> preempted and
>>>>>>
>>>>>> If the vCPU is preempted, shouldn't the subsequent be ignored?
>>>>>> What happens if a PI is occurs when vCPU is preempted?
>>>>>
>>>>> If a vCPU is preempted, the 'SN' bit is set, the subsequent
>>>>> interrupts are suppressed for posting.
>>>>
>>>> I mean what happens if we don't set SN bit. From my point, if
>>>> preempter already disabled the interrupt, it is ok to leave SN
>>>> bit as zero. But if preempter enabled the interrupt, doesn't this
>>>> mean he allow interrupt to happen? BTW, since there already has
>>>> ON bit, so this means there only have one interrupt arrived at
>>>> most and it doesn't hurt performance. Do we really need to set SN bit?
>>>
>>>
>>> See this scenario:
>>> vCPU0 is running on pCPU0
>>> --> vCPU0 is preempted by vCPU1
>>> --> Then vCPU1 is running on pCPU0 and vCPU0 is waiting for
>>> --> schedule in runqueue
>>>
>>> If the we don't set SN for vCPU0, then all subsequent interrupts
>>> for
>>> vCPU0 is posted to vCPU1, this will consume hardware and software
>>
>> The PI vector for vCPU1 is notification vector, but the PI vector
>> for
>> vCPU0 should be wakeup vector. Why vCPU1 will consume this PI event?
>
> Wakeup vector is only used for blocking case, when vCPU is preempted
> and waiting in the runqueue, the NV is the notification vector.
I see your point. But from performance point, if we can schedule the vCPU to another PCPU to handle the interrupt, it would helpful. But I remember current KVM will not schedule the vCPU in run queue (even though it got preempted) to another pCPU to run(Am I right?). So it may hard to do it.
>
> Thanks,
> Feng
>
>>
>>> efforts and in fact it is not needed at all. If SN is set for
>>> vCPU0, VT-d hardware will not issue Notification Event for vCPU0
>>> when an interrupt is for it, but just setting the related PIR bit.
>>>
>>> Thanks,
>>> Feng
>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> Feng
>>>>>
>>>>>>
>>>>>>>> waiting for the next scheduling in the runqueue. But I am not
>>>>>>>> sure whether we need to set SN for other purpose in future.
>>>>>>>> Adding SN checking here is just to follow the Spec.
>>>>>>>> non-urgent interrupts are suppressed
>>>>>>> when SN is set.
>>>>>>>
>>>>>>> I would change that to a WARN_ON_ONCE then.
>>>>>>
>>>>>>
>>>>>> Best regards,
>>>>>> Yang
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> iommu mailing list
>>>>>> [email protected]
>>>>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>>>>
>>>>
>>>> Best regards,
>>>> Yang
>>>>
>>
>>
>> Best regards,
>> Yang
>>
Best regards,
Yang
> -----Original Message-----
> From: Zhang, Yang Z
> Sent: Friday, December 19, 2014 1:26 PM
> To: Wu, Feng; Paolo Bonzini; [email protected]
> Cc: [email protected]; [email protected];
> [email protected]
> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set
>
> Wu, Feng wrote on 2014-12-19:
> >
> >
> > Zhang, Yang Z wrote on 2014-12-19:
> >> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is
> >> set
> >>
> >> Wu, Feng wrote on 2014-12-19:
> >>>
> >>>
> >>> Zhang, Yang Z wrote on 2014-12-19:
> >>>> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN'
> >>>> is set
> >>>>
> >>>> Wu, Feng wrote on 2014-12-19:
> >>>>>
> >>>>>
> >>>>> [email protected] wrote on
> >>>> mailto:[email protected]] On Behalf Of:
> >>>>>> Cc: [email protected];
> >>>>>> [email protected]; [email protected]
> >>>>>> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN'
> >>>>>> is set
> >>>>>>
> >>>>>> Paolo Bonzini wrote on 2014-12-18:
> >>>>>>>
> >>>>>>>
> >>>>>>> On 18/12/2014 04:14, Wu, Feng wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> [email protected] wrote on
> >>>>>> mailto:[email protected]] On Behalf Of Paolo:
> >>>>>>>>> [email protected]; Gleb Natapov; Paolo Bonzini;
> >>>>>>>>> [email protected];
> >>>>>>>>> [email protected]; Alex
> Williamson;
> >>>>>>>>> joro-zLv9SwRftAIdnm+Jiang Liu Cc:
> >>>>>>>>> [email protected];
> >>>>>>>>> [email protected]; KVM
> list;
> >>>>>>>>> Eric Auger Subject: Re: [v3 25/26] KVM: Suppress
> >>>>>>>>> posted-interrupt when 'SN' is set
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On 12/12/2014 16:14, Feng Wu wrote:
> >>>>>>>>>> Currently, we don't support urgent interrupt, all
> >>>>>>>>>> interrupts are recognized as non-urgent interrupt, so we
> >>>>>>>>>> cannot send posted-interrupt when 'SN' is set.
> >>>>>>>>>
> >>>>>>>>> Can this happen? If the vcpu is in guest mode, it cannot
> >>>>>>>>> have been scheduled out, and that's the only case when SN is set.
> >>>>>>>>>
> >>>>>>>>> Paolo
> >>>>>>>>
> >>>>>>>> Currently, the only place where SN is set is vCPU is
> >>>>>>>> preempted and
> >>>>>>
> >>>>>> If the vCPU is preempted, shouldn't the subsequent be ignored?
> >>>>>> What happens if a PI is occurs when vCPU is preempted?
> >>>>>
> >>>>> If a vCPU is preempted, the 'SN' bit is set, the subsequent
> >>>>> interrupts are suppressed for posting.
> >>>>
> >>>> I mean what happens if we don't set SN bit. From my point, if
> >>>> preempter already disabled the interrupt, it is ok to leave SN
> >>>> bit as zero. But if preempter enabled the interrupt, doesn't this
> >>>> mean he allow interrupt to happen? BTW, since there already has
> >>>> ON bit, so this means there only have one interrupt arrived at
> >>>> most and it doesn't hurt performance. Do we really need to set SN bit?
> >>>
> >>>
> >>> See this scenario:
> >>> vCPU0 is running on pCPU0
> >>> --> vCPU0 is preempted by vCPU1
> >>> --> Then vCPU1 is running on pCPU0 and vCPU0 is waiting for
> >>> --> schedule in runqueue
> >>>
> >>> If the we don't set SN for vCPU0, then all subsequent interrupts
> >>> for
> >>> vCPU0 is posted to vCPU1, this will consume hardware and software
> >>
> >> The PI vector for vCPU1 is notification vector, but the PI vector
> >> for
> >> vCPU0 should be wakeup vector. Why vCPU1 will consume this PI event?
> >
> > Wakeup vector is only used for blocking case, when vCPU is preempted
> > and waiting in the runqueue, the NV is the notification vector.
>
> I see your point. But from performance point, if we can schedule the vCPU to
> another PCPU to handle the interrupt, it would helpful. But I remember current
> KVM will not schedule the vCPU in run queue (even though it got preempted) to
> another pCPU to run(Am I right?). So it may hard to do it.
>
KVM is using the Linux scheduler, when the preempted vCPU (in runqueue) is
scheduled again depends on the scheduling algorithm itself, I think it is a little
hard for us to get involved.
I think what you mentioned is a little like the urgent interrupt in VT-d PI Spec.
For this kind of interrupts, if an interrupt is coming for an preempted vCPU
(waiting in the run queue), we need to schedule the vCPU immediately. This
is some real time things. And we don't support urgent interrupt so far.
Thanks,
Feng
> >
> > Thanks,
> > Feng
> >
> >>
> >>> efforts and in fact it is not needed at all. If SN is set for
> >>> vCPU0, VT-d hardware will not issue Notification Event for vCPU0
> >>> when an interrupt is for it, but just setting the related PIR bit.
> >>>
> >>> Thanks,
> >>> Feng
> >>>
> >>>>
> >>>>>
> >>>>> Thanks,
> >>>>> Feng
> >>>>>
> >>>>>>
> >>>>>>>> waiting for the next scheduling in the runqueue. But I am not
> >>>>>>>> sure whether we need to set SN for other purpose in future.
> >>>>>>>> Adding SN checking here is just to follow the Spec.
> >>>>>>>> non-urgent interrupts are suppressed
> >>>>>>> when SN is set.
> >>>>>>>
> >>>>>>> I would change that to a WARN_ON_ONCE then.
> >>>>>>
> >>>>>>
> >>>>>> Best regards,
> >>>>>> Yang
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> iommu mailing list
> >>>>>> [email protected]
> >>>>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
> >>>>
> >>>>
> >>>> Best regards,
> >>>> Yang
> >>>>
> >>
> >>
> >> Best regards,
> >> Yang
> >>
>
>
> Best regards,
> Yang
>
Wu, Feng wrote on 2014-12-19:
>
>
> Zhang, Yang Z wrote on 2014-12-19:
>> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is
>> set
>>
>> Wu, Feng wrote on 2014-12-19:
>>>
>>>
>>> Zhang, Yang Z wrote on 2014-12-19:
>>>> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN'
>>>> is set
>>>>
>>>> Wu, Feng wrote on 2014-12-19:
>>>>>
>>>>>
>>>>> Zhang, Yang Z wrote on 2014-12-19:
>>>>>> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN'
>>>>>> is set
>>>>>>
>>>>>> Wu, Feng wrote on 2014-12-19:
>>>>>>>
>>>>>>>
>>>>>>> [email protected] wrote on
>>>>>> mailto:[email protected]] On Behalf Of:
>>>>>>>> Cc: [email protected];
>>>>>>>> [email protected]; [email protected]
>>>>>>>> Subject: RE: [v3 25/26] KVM: Suppress posted-interrupt when 'SN'
>>>>>>>> is set
>>>>>>>>
>>>>>>>> Paolo Bonzini wrote on 2014-12-18:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 18/12/2014 04:14, Wu, Feng wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [email protected] wrote on
>>>>>>>> mailto:[email protected]] On Behalf Of Paolo:
>>>>>>>>>>> [email protected]; Gleb Natapov; Paolo Bonzini;
>>>>>>>>>>> [email protected];
>>>>>>>>>>> [email protected]; Alex Williamson;
>>>>>>>>>>> joro-zLv9SwRftAIdnm+Jiang Liu Cc:
>>>>>>>>>>> [email protected];
>>>>>>>>>>> [email protected];
> KVM
>> list;
>>>>>>>>>>> Eric Auger Subject: Re: [v3 25/26] KVM: Suppress
>>>>>>>>>>> posted-interrupt when 'SN' is set
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 12/12/2014 16:14, Feng Wu wrote:
>>>>>>>>>>>> Currently, we don't support urgent interrupt, all
>>>>>>>>>>>> interrupts are recognized as non-urgent interrupt, so we
>>>>>>>>>>>> cannot send posted-interrupt when 'SN' is set.
>>>>>>>>>>>
>>>>>>>>>>> Can this happen? If the vcpu is in guest mode, it cannot
>>>>>>>>>>> have been scheduled out, and that's the only case when SN is set.
>>>>>>>>>>>
>>>>>>>>>>> Paolo
>>>>>>>>>>
>>>>>>>>>> Currently, the only place where SN is set is vCPU is
>>>>>>>>>> preempted and
>>>>>>>>
>>>>>>>> If the vCPU is preempted, shouldn't the subsequent be ignored?
>>>>>>>> What happens if a PI is occurs when vCPU is preempted?
>>>>>>>
>>>>>>> If a vCPU is preempted, the 'SN' bit is set, the subsequent
>>>>>>> interrupts are suppressed for posting.
>>>>>>
>>>>>> I mean what happens if we don't set SN bit. From my point, if
>>>>>> preempter already disabled the interrupt, it is ok to leave SN
>>>>>> bit as zero. But if preempter enabled the interrupt, doesn't
>>>>>> this mean he allow interrupt to happen? BTW, since there
>>>>>> already has ON bit, so this means there only have one interrupt
>>>>>> arrived at most and it doesn't hurt performance. Do we really need to set SN bit?
>>>>>
>>>>>
>>>>> See this scenario:
>>>>> vCPU0 is running on pCPU0
>>>>> --> vCPU0 is preempted by vCPU1
>>>>> --> Then vCPU1 is running on pCPU0 and vCPU0 is waiting for
>>>>> --> schedule in runqueue
>>>>>
>>>>> If the we don't set SN for vCPU0, then all subsequent interrupts
>>>>> for
>>>>> vCPU0 is posted to vCPU1, this will consume hardware and
>>>>> software
>>>>
>>>> The PI vector for vCPU1 is notification vector, but the PI vector
>>>> for
>>>> vCPU0 should be wakeup vector. Why vCPU1 will consume this PI event?
>>>
>>> Wakeup vector is only used for blocking case, when vCPU is
>>> preempted and waiting in the runqueue, the NV is the notification vector.
>>
>> I see your point. But from performance point, if we can schedule the
>> vCPU to another PCPU to handle the interrupt, it would helpful. But I
>> remember current KVM will not schedule the vCPU in run queue (even
>> though it got preempted) to another pCPU to run(Am I right?). So it may
>> hard to do it.
>>
>
> KVM is using the Linux scheduler, when the preempted vCPU (in
> runqueue) is scheduled again depends on the scheduling algorithm
> itself, I think it is a little hard for us to get involved.
>
> I think what you mentioned is a little like the urgent interrupt in VT-d PI Spec.
> For this kind of interrupts, if an interrupt is coming for an
> preempted vCPU (waiting in the run queue), we need to schedule the
> vCPU immediately. This is some real time things. And we don't support urgent interrupt so far.
Yes. IIRC, if we use two global vectors mechanism properly, there should no need to use hardware urgent interrupt mechanism. :)
>
> Thanks,
> Feng
>
>>>
>>> Thanks,
>>> Feng
>>>
>>>>
>>>>> efforts and in fact it is not needed at all. If SN is set for
>>>>> vCPU0, VT-d hardware will not issue Notification Event for vCPU0
>>>>> when an interrupt is for it, but just setting the related PIR bit.
>>>>>
>>>>> Thanks,
>>>>> Feng
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Feng
>>>>>>>
>>>>>>>>
>>>>>>>>>> waiting for the next scheduling in the runqueue. But I am
>>>>>>>>>> not sure whether we need to set SN for other purpose in future.
>>>>>>>>>> Adding SN checking here is just to follow the Spec.
>>>>>>>>>> non-urgent interrupts are suppressed
>>>>>>>>> when SN is set.
>>>>>>>>>
>>>>>>>>> I would change that to a WARN_ON_ONCE then.
>>>>>>>>
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Yang
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> iommu mailing list
>>>>>>>> [email protected]
>>>>>>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>>>>>>
>>>>>>
>>>>>> Best regards,
>>>>>> Yang
>>>>>>
>>>>
>>>>
>>>> Best regards,
>>>> Yang
>>>>
>>
>>
>> Best regards,
>> Yang
>>
Best regards,
Yang
On 19/12/2014 02:30, Wu, Feng wrote:
>>> How this can work well? All subsequent interrupts are delivered to
>>> one vCPU? It shouldn't be the best solution, need more consideration.
>>
>> Well, it's a hardware limitation. The alternative (which is easy to
>> implement) is to only do PI for single-CPU interrupts. This should work
>> well for multiqueue NICs (and of course for UP guests :)), so perhaps
>> it's a good idea to only support that as a first attempt.
>>
>> Paolo
>
> Paolo, what do you mean by "single-CPU interrupts"? Do you mean we don't
> support lowest priority interrupts for PI? But Linux OS uses lowest priority
> for most of the case? If so, we can hardly get benefit from this feature for
> Linux guest OS.
You can post lowest priority interrupts if they are delivered to a
single CPU, in which case they are effectively fixed priority.
If they are broadcast to multiple CPUs, do not post them.
Paolo
On 19/12/2014 03:19, Wu, Feng wrote:
>>> > >
>>> > > +#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
>>> > > +
>>> > > +struct kvm_irq_routing_table {
>>> > > + int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS];
>>> > > + struct kvm_kernel_irq_routing_entry *rt_entries;
>>> > > + u32 nr_rt_entries;
>>> > > + /*
>>> > > + * Array indexed by gsi. Each entry contains list of irq chips
>>> > > + * the gsi is connected to.
>>> > > + */
>>> > > + struct hlist_head map[0];
>>> > > +};
>>> > > +
>>> > > +#else
>>> > > +
>>> > > +struct kvm_irq_routing_table {};
>> >
>> > If possible, just make this "struct kvm_irq_routing_table;" and pull
>> > this line to include/linux/kvm_types.h.
>> >
>> > Paolo
> Do you mean move the definition of struct kvm_irq_routing_table
> to include/linux/kvm_types.h and add a declaration here?
Move
struct kvm_irq_routing_table;
to include/linux/kvm_types.h. In kvm_host.h, leave the #ifdef with the
full definition but drop the #else.
Paolo
On 19/12/2014 02:46, Zhang, Yang Z wrote:
>> If the IRQ is posted, its affinity is controlled by guest (irq <--->
>> vCPU <----> pCPU), it has no effect when host changes its affinity.
>
> That's the problem: User is able to changes it in host but it never
> takes effect since it is actually controlled by guest. I guess it
> will break the IRQ balance too.
I don't think that's a problem.
Controlling the affinity in the host affects which CPU in the host takes
care of signaling the guest.
If this signaling is done directly by the chipset, there is no need to
do anything in the host and thus the host affinity can be bypassed.
Paolo
On 19/12/2014 06:25, Zhang, Yang Z wrote:
> I see your point. But from performance point, if we can schedule the
> vCPU to another PCPU to handle the interrupt, it would helpful. But I
> remember current KVM will not schedule the vCPU in run queue (even
> though it got preempted) to another pCPU to run(Am I right?). So it
> may hard to do it.
Yes. If the vCPU is in the run queue, it means it exhausted its
quantum. As Feng said, the scheduler can decide to migrate it to
another pCPU, or it can decide to leave it runnable but not start it.
KVM doesn't try to force the scheduler one way or the other.
If the vCPU is I/O bound, it will not exhaust its quantum and will not
be preempted. It will block, and the wakeup vector will restart it.
I don't think urgent notifications are interesting. If you want to do
real time work, pin the vCPU to a physical CPU, and isolate the pCPU
with isolcpus. Then the vCPU will always be running.
Paolo
> -----Original Message-----
> From: Paolo Bonzini [mailto:[email protected]] On Behalf Of Paolo
> Bonzini
> Sent: Friday, December 19, 2014 8:01 PM
> To: Zhang, Yang Z; Wu, Feng; Paolo Bonzini; KVM list
> Cc: [email protected]; [email protected]
> Subject: Re: [v3 25/26] KVM: Suppress posted-interrupt when 'SN' is set
>
>
>
> On 19/12/2014 06:25, Zhang, Yang Z wrote:
> > I see your point. But from performance point, if we can schedule the
> > vCPU to another PCPU to handle the interrupt, it would helpful. But I
> > remember current KVM will not schedule the vCPU in run queue (even
> > though it got preempted) to another pCPU to run(Am I right?). So it
> > may hard to do it.
>
> Yes. If the vCPU is in the run queue, it means it exhausted its
> quantum. As Feng said, the scheduler can decide to migrate it to
> another pCPU, or it can decide to leave it runnable but not start it.
> KVM doesn't try to force the scheduler one way or the other.
>
> If the vCPU is I/O bound, it will not exhaust its quantum and will not
> be preempted. It will block, and the wakeup vector will restart it.
>
> I don't think urgent notifications are interesting. If you want to do
> real time work, pin the vCPU to a physical CPU, and isolate the pCPU
> with isolcpus. Then the vCPU will always be running.
>
> Paolo
I agree, thanks Paolo!
Thanks,
Feng
> -----Original Message-----
> From: Paolo Bonzini [mailto:[email protected]] On Behalf Of Paolo
> Bonzini
> Sent: Friday, December 19, 2014 8:00 PM
> To: Wu, Feng; [email protected]
> Cc: [email protected]; [email protected]
> Subject: Re: [v3 16/26] KVM: Make struct kvm_irq_routing_table accessible
>
>
>
> On 19/12/2014 03:19, Wu, Feng wrote:
> >>> > >
> >>> > > +#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
> >>> > > +
> >>> > > +struct kvm_irq_routing_table {
> >>> > > + int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS];
> >>> > > + struct kvm_kernel_irq_routing_entry *rt_entries;
> >>> > > + u32 nr_rt_entries;
> >>> > > + /*
> >>> > > + * Array indexed by gsi. Each entry contains list of irq chips
> >>> > > + * the gsi is connected to.
> >>> > > + */
> >>> > > + struct hlist_head map[0];
> >>> > > +};
> >>> > > +
> >>> > > +#else
> >>> > > +
> >>> > > +struct kvm_irq_routing_table {};
> >> >
> >> > If possible, just make this "struct kvm_irq_routing_table;" and pull
> >> > this line to include/linux/kvm_types.h.
> >> >
> >> > Paolo
> > Do you mean move the definition of struct kvm_irq_routing_table
> > to include/linux/kvm_types.h and add a declaration here?
>
> Move
>
> struct kvm_irq_routing_table;
>
> to include/linux/kvm_types.h. In kvm_host.h, leave the #ifdef with the
> full definition but drop the #else.
>
> Paolo
Paolo, Thanks for the explanation. I notice that " struct kvm_irq_routing_table;"
is already in include/linux/kvm_types.h.
Thanks,
Feng
> -----Original Message-----
> From: Paolo Bonzini [mailto:[email protected]] On Behalf Of Paolo
> Bonzini
> Sent: Friday, December 19, 2014 7:59 PM
> To: Wu, Feng; Paolo Bonzini; Zhang, Yang Z; Thomas Gleixner; Ingo Molnar; H.
> Peter Anvin; [email protected]; Gleb Natapov; [email protected];
> [email protected]; Alex Williamson; Jiang Liu
> Cc: [email protected]; [email protected]; KVM list;
> Eric Auger
> Subject: Re: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for
> VT-d PI
>
>
>
> On 19/12/2014 02:30, Wu, Feng wrote:
> >>> How this can work well? All subsequent interrupts are delivered to
> >>> one vCPU? It shouldn't be the best solution, need more consideration.
> >>
> >> Well, it's a hardware limitation. The alternative (which is easy to
> >> implement) is to only do PI for single-CPU interrupts. This should work
> >> well for multiqueue NICs (and of course for UP guests :)), so perhaps
> >> it's a good idea to only support that as a first attempt.
> >>
> >> Paolo
> >
> > Paolo, what do you mean by "single-CPU interrupts"? Do you mean we don't
> > support lowest priority interrupts for PI? But Linux OS uses lowest priority
> > for most of the case? If so, we can hardly get benefit from this feature for
> > Linux guest OS.
>
> You can post lowest priority interrupts if they are delivered to a
> single CPU, in which case they are effectively fixed priority.
>
> If they are broadcast to multiple CPUs, do not post them.
>
> Paolo
In my understanding, lowest priority interrupts are always delivered to a
Single CPU, we need to find the right destination CPU from the cpumask.
This is what I do in this patch. Did I misunderstanding something in your
Comments? Thanks a lot!
Actually, we don't support posting broadcast/multicast interrupts, because
the interrupt is associated with one Posted-interrupts descriptor, then one
vCPU.
Thanks,
Feng
On 20/12/2014 00:48, Wu, Feng wrote:
> In my understanding, lowest priority interrupts are always delivered to a
> Single CPU, we need to find the right destination CPU from the cpumask.
Yes, but which CPU however differs every time the interrupt is
delivered. So the emulation here is a bit poor. For now, please limit
PI to fixed interrupts.
> Actually, we don't support posting broadcast/multicast interrupts, because
> the interrupt is associated with one Posted-interrupts descriptor, then one
> vCPU.
Understood.
Paolo
> -----Original Message-----
> From: Paolo Bonzini [mailto:[email protected]] On Behalf Of Paolo
> Bonzini
> Sent: Saturday, December 20, 2014 9:17 PM
> To: Wu, Feng; Paolo Bonzini; Zhang, Yang Z; Thomas Gleixner; Ingo Molnar; H.
> Peter Anvin; [email protected]; Gleb Natapov; [email protected];
> [email protected]; Alex Williamson; Jiang Liu
> Cc: [email protected]; [email protected]; KVM list;
> Eric Auger
> Subject: Re: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for
> VT-d PI
>
>
>
> On 20/12/2014 00:48, Wu, Feng wrote:
> > In my understanding, lowest priority interrupts are always delivered to a
> > Single CPU, we need to find the right destination CPU from the cpumask.
>
> Yes, but which CPU however differs every time the interrupt is
> delivered. So the emulation here is a bit poor. For now, please limit
> PI to fixed interrupts.
Do you mean we don't support Lowest priority interrupts? As I mentioned before,
Lowest priority interrupts is widely used in Linux, so I think supporting lowest priority
interrupts is very important for Linux guest OS. Do you have any ideas/suggestions about
how to support Lowest priority interrupts for PI? Thanks a lot!
Thanks,
Feng
>
> > Actually, we don't support posting broadcast/multicast interrupts, because
> > the interrupt is associated with one Posted-interrupts descriptor, then one
> > vCPU.
>
> Understood.
>
> Paolo
On 22/12/2014 05:48, Wu, Feng wrote:
> Do you mean we don't support Lowest priority interrupts? As I mentioned before,
> Lowest priority interrupts is widely used in Linux, so I think supporting lowest priority
> interrupts is very important for Linux guest OS. Do you have any ideas/suggestions about
> how to support Lowest priority interrupts for PI? Thanks a lot!
Can you support them only if the destination is a single CPU?
Paolo
> -----Original Message-----
> From: Paolo Bonzini [mailto:[email protected]]
> Sent: Monday, December 22, 2014 5:28 PM
> To: Wu, Feng; Zhang, Yang Z; Thomas Gleixner; Ingo Molnar; H. Peter Anvin;
> [email protected]; Gleb Natapov; [email protected]; [email protected];
> Alex Williamson; Jiang Liu
> Cc: [email protected]; [email protected]; KVM list;
> Eric Auger
> Subject: Re: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for
> VT-d PI
>
>
>
> On 22/12/2014 05:48, Wu, Feng wrote:
> > Do you mean we don't support Lowest priority interrupts? As I mentioned
> before,
> > Lowest priority interrupts is widely used in Linux, so I think supporting lowest
> priority
> > interrupts is very important for Linux guest OS. Do you have any
> ideas/suggestions about
> > how to support Lowest priority interrupts for PI? Thanks a lot!
>
> Can you support them only if the destination is a single CPU?
Sorry, I am not quite understand this. I still don't understand the "single CPU" here.
Lowest priority interrupts always have a cpumask which contains multiple CPU.
Thanks,
Feng
>
> Paolo
On 22/12/2014 12:04, Wu, Feng wrote:
> > Can you support them only if the destination is a single CPU?
>
> Sorry, I am not quite understand this. I still don't understand the "single CPU" here.
> Lowest priority interrupts always have a cpumask which contains multiple CPU.
Yes, and those need not be accelerated. But what if you set affinity to
a single CPU?
Paolo
> -----Original Message-----
> From: Paolo Bonzini [mailto:[email protected]]
> Sent: Monday, December 22, 2014 7:07 PM
> To: Wu, Feng; Zhang, Yang Z; Thomas Gleixner; Ingo Molnar; H. Peter Anvin;
> [email protected]; Gleb Natapov; [email protected]; [email protected];
> Alex Williamson; Jiang Liu
> Cc: [email protected]; [email protected]; KVM list;
> Eric Auger
> Subject: Re: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for
> VT-d PI
>
>
>
> On 22/12/2014 12:04, Wu, Feng wrote:
> > > Can you support them only if the destination is a single CPU?
> >
> > Sorry, I am not quite understand this. I still don't understand the "single CPU"
> here.
> > Lowest priority interrupts always have a cpumask which contains multiple
> CPU.
>
> Yes, and those need not be accelerated. But what if you set affinity to
> a single CPU?
How do I set affinity to a single CPU if guest configure a lowest priority interrupt? Thanks a lot!
Thanks,
Feng
>
> Paolo
On 22/12/2014 12:17, Wu, Feng wrote:
>> Yes, and those need not be accelerated. But what if you set
>> affinity to a single CPU?
>
> How do I set affinity to a single CPU if guest configure a lowest
> priority interrupt? Thanks a lot!
I mean if the guest (via irqbalance and /proc/irq/) configures affinity
to a single vCPU. In that case, you can use PI.
Paolo
On Mon, Dec 22, 2014 at 12:23:36PM +0100, Paolo Bonzini wrote:
>
>
> On 22/12/2014 12:17, Wu, Feng wrote:
> >> Yes, and those need not be accelerated. But what if you set
> >> affinity to a single CPU?
> >
> > How do I set affinity to a single CPU if guest configure a lowest
> > priority interrupt? Thanks a lot!
>
> I mean if the guest (via irqbalance and /proc/irq/) configures affinity
> to a single vCPU. In that case, you can use PI.
>
The problem is we still need to support PI with lowest priority delivery mode
even if guest does not configure irq affinity via /proc/irq/. Don't we?
Thanks
-Yong
Paolo Bonzini wrote on 2014-12-19:
>
>
> On 19/12/2014 02:46, Zhang, Yang Z wrote:
>>> If the IRQ is posted, its affinity is controlled by guest (irq
>>> <---> vCPU <----> pCPU), it has no effect when host changes its affinity.
>>
>> That's the problem: User is able to changes it in host but it never
>> takes effect since it is actually controlled by guest. I guess it
>> will break the IRQ balance too.
>
> I don't think that's a problem.
>
> Controlling the affinity in the host affects which CPU in the host
> takes care of signaling the guest.
>
> If this signaling is done directly by the chipset, there is no need to
> do anything in the host and thus the host affinity can be bypassed.
I don't quite understand it. If user set an interrupt's affinity to a CPU, but he still see the interrupt delivers to other CPUs in host. Do you think it is a right behavior?
>
> Paolo
Best regards,
Yang
Paolo Bonzini wrote on 2014-12-23:
>> The problem is we still need to support PI with lowest priority
>> delivery mode
> even if guest does not configure irq affinity via /proc/irq/. Don't we?
>
> Yes, but we can get the basic support working first.
>
> I and Feng talked on irc and agreed to start with a simple
> implementation. Then he can add support for vector hashing for all
> interrupts, both vt-d pi and normal LAPIC interrupts.
Agree. Obviously, current PI has some limitations to achieve highest performance. We can start with a simple implementation but must ensure we don't change hardware's behavior(From guest's point). Feng's current implementation or use the way I suggested are both ok since both of them cannot solve the problem well.
>
> Paolo
>
Best regards,
Yang
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On 23/12/2014 01:37, Zhang, Yang Z wrote:
> I don't quite understand it. If user set an interrupt's affinity to a
> CPU, but he still see the interrupt delivers to other CPUs in host.
> Do you think it is a right behavior?
No, the interrupt is not delivered at all in the host. Normally you'd have:
- interrupt delivered to CPU from host affinity
- VFIO interrupt handler writes to irqfd
- interrupt delivered to vCPU from guest affinity
Here, you just skip the first two steps. The interrupt is delivered to
the thread that is running the vCPU directly, so the host affinity is
bypassed entirely.
... unless you are considering the case where the vCPU is blocked and
the host is processing the posted interrupt wakeup vector. In that case
yes, it would be better to set NDST to a CPU matching the host affinity.
But it would be handled in patch 24. We also have the same problem
with lowest-priority interrupts; likely the host has configured the
interrupt affinity for any CPU. So we can do it later when we add
vector hashing support. In the meanwhile, Feng, please add a FIXME comment.
Does this make sense?
Paolo
> -----Original Message-----
> From: Paolo Bonzini [mailto:[email protected]] On Behalf Of Paolo
> Bonzini
> Sent: Tuesday, December 23, 2014 4:48 PM
> To: Zhang, Yang Z; Paolo Bonzini; Wu, Feng; Thomas Gleixner; Ingo Molnar; H.
> Peter Anvin; [email protected]; Gleb Natapov; [email protected];
> [email protected]; Alex Williamson; Jiang Liu
> Cc: [email protected]; [email protected]; KVM list;
> Eric Auger
> Subject: Re: [v3 06/26] iommu, x86: No need to migrating irq for VT-d
> Posted-Interrupts
>
>
>
> On 23/12/2014 01:37, Zhang, Yang Z wrote:
> > I don't quite understand it. If user set an interrupt's affinity to a
> > CPU, but he still see the interrupt delivers to other CPUs in host.
> > Do you think it is a right behavior?
>
> No, the interrupt is not delivered at all in the host. Normally you'd have:
>
> - interrupt delivered to CPU from host affinity
>
> - VFIO interrupt handler writes to irqfd
>
> - interrupt delivered to vCPU from guest affinity
>
> Here, you just skip the first two steps. The interrupt is delivered to
> the thread that is running the vCPU directly, so the host affinity is
> bypassed entirely.
>
> ... unless you are considering the case where the vCPU is blocked and
> the host is processing the posted interrupt wakeup vector. In that case
> yes, it would be better to set NDST to a CPU matching the host affinity.
In my understanding, wakeup vector should have no relationship with the
host affinity of the irq. Wakeup notification event should be delivered to
the pCPU which the vCPU was blocked on. And in kernel's point of view,
the irq is not associated with the wakeup vector, right?
Thanks,
Feng
> But it would be handled in patch 24. We also have the same problem
> with lowest-priority interrupts; likely the host has configured the
> interrupt affinity for any CPU. So we can do it later when we add
> vector hashing support. In the meanwhile, Feng, please add a FIXME
> comment.
>
> Does this make sense?
>
> Paolo
On 23/12/2014 10:07, Wu, Feng wrote:
>> On 23/12/2014 01:37, Zhang, Yang Z wrote:
>>> I don't quite understand it. If user set an interrupt's affinity to a
>>> CPU, but he still see the interrupt delivers to other CPUs in host.
>>> Do you think it is a right behavior?
>>
>> No, the interrupt is not delivered at all in the host. Normally you'd have:
>>
>> - interrupt delivered to CPU from host affinity
>>
>> - VFIO interrupt handler writes to irqfd
>>
>> - interrupt delivered to vCPU from guest affinity
>>
>> Here, you just skip the first two steps. The interrupt is delivered to
>> the thread that is running the vCPU directly, so the host affinity is
>> bypassed entirely.
>>
>> ... unless you are considering the case where the vCPU is blocked and
>> the host is processing the posted interrupt wakeup vector. In that case
>> yes, it would be better to set NDST to a CPU matching the host affinity.
>
> In my understanding, wakeup vector should have no relationship with the
> host affinity of the irq. Wakeup notification event should be delivered to
> the pCPU which the vCPU was blocked on. And in kernel's point of view,
> the irq is not associated with the wakeup vector, right?
That is correct indeed. It is not associated to the wakeup vector,
hence this patch is right, I think.
However, the wakeup vector has the same function as the VFIO interrupt
handler, so you could argue that it is tied to the host affinity rather
than the guest. Let's wait for Yang to answer.
Paolo
Paolo Bonzini wrote on 2014-12-23:
>
>
> On 23/12/2014 10:07, Wu, Feng wrote:
>>> On 23/12/2014 01:37, Zhang, Yang Z wrote:
>>>> I don't quite understand it. If user set an interrupt's affinity
>>>> to a CPU, but he still see the interrupt delivers to other CPUs in host.
>>>> Do you think it is a right behavior?
>>>
>>> No, the interrupt is not delivered at all in the host. Normally you'd have:
>>>
>>> - interrupt delivered to CPU from host affinity
>>>
>>> - VFIO interrupt handler writes to irqfd
>>>
>>> - interrupt delivered to vCPU from guest affinity
>>>
>>> Here, you just skip the first two steps. The interrupt is
>>> delivered to the thread that is running the vCPU directly, so the
>>> host affinity is bypassed entirely.
>>>
>>> ... unless you are considering the case where the vCPU is blocked
>>> and the host is processing the posted interrupt wakeup vector. In
>>> that case yes, it would be better to set NDST to a CPU matching the host affinity.
>>
>> In my understanding, wakeup vector should have no relationship with
>> the host affinity of the irq. Wakeup notification event should be
>> delivered to the pCPU which the vCPU was blocked on. And in kernel's
>> point of view, the irq is not associated with the wakeup vector, right?
>
> That is correct indeed. It is not associated to the wakeup vector,
> hence this patch is right, I think.
>
> However, the wakeup vector has the same function as the VFIO interrupt
> handler, so you could argue that it is tied to the host affinity
> rather than the guest. Let's wait for Yang to answer.
Actually, that's my original question too. I am wondering what happens if the user changes the assigned device's affinity in host's /proc/irq/? If ignore it is acceptable, then this patch is ok. But it seems the discussion out of my scope, need some experts to tell us their idea since it will impact the user experience.
>
> Paolo
Best regards,
Yang
On 2014/12/24 9:38, Zhang, Yang Z wrote:
> Paolo Bonzini wrote on 2014-12-23:
>>
>>
>> On 23/12/2014 10:07, Wu, Feng wrote:
>>>> On 23/12/2014 01:37, Zhang, Yang Z wrote:
>>>>> I don't quite understand it. If user set an interrupt's affinity
>>>>> to a CPU, but he still see the interrupt delivers to other CPUs in host.
>>>>> Do you think it is a right behavior?
>>>>
>>>> No, the interrupt is not delivered at all in the host. Normally you'd have:
>>>>
>>>> - interrupt delivered to CPU from host affinity
>>>>
>>>> - VFIO interrupt handler writes to irqfd
>>>>
>>>> - interrupt delivered to vCPU from guest affinity
>>>>
>>>> Here, you just skip the first two steps. The interrupt is
>>>> delivered to the thread that is running the vCPU directly, so the
>>>> host affinity is bypassed entirely.
>>>>
>>>> ... unless you are considering the case where the vCPU is blocked
>>>> and the host is processing the posted interrupt wakeup vector. In
>>>> that case yes, it would be better to set NDST to a CPU matching the host affinity.
>>>
>>> In my understanding, wakeup vector should have no relationship with
>>> the host affinity of the irq. Wakeup notification event should be
>>> delivered to the pCPU which the vCPU was blocked on. And in kernel's
>>> point of view, the irq is not associated with the wakeup vector, right?
>>
>> That is correct indeed. It is not associated to the wakeup vector,
>> hence this patch is right, I think.
>>
>> However, the wakeup vector has the same function as the VFIO interrupt
>> handler, so you could argue that it is tied to the host affinity
>> rather than the guest. Let's wait for Yang to answer.
>
> Actually, that's my original question too. I am wondering what happens if the user changes the assigned device's affinity in host's /proc/irq/? If ignore it is acceptable, then this patch is ok. But it seems the discussion out of my scope, need some experts to tell us their idea since it will impact the user experience.
Hi Yang,
Originally we have a proposal to return failure when user
sets IRQ affinity through native OS interfaces if an IRQ is in PI
mode. But that proposal will break CPU hot-removal because OS needs
to migrate away all IRQs binding to the CPU to be offlined. Then we
propose saving user IRQ affinity setting without changing hardware
configuration (keeping PI configuration). Later when PI mode is
disabled, the cached affinity setting will be used to setup IRQ
destination for native OS. On the other hand, for IRQ in PI mode,
it won't be delivered to native OS, so user may not sense that
the IRQ is delivered to CPUs other than those in the affinity set.
In that aspect, I think it's acceptable:)
Regards!
Gerry
>
>>
>> Paolo
>
>
> Best regards,
> Yang
>
>
Jiang Liu wrote on 2014-12-24:
> On 2014/12/24 9:38, Zhang, Yang Z wrote:
>> Paolo Bonzini wrote on 2014-12-23:
>>>
>>>
>>> On 23/12/2014 10:07, Wu, Feng wrote:
>>>>> On 23/12/2014 01:37, Zhang, Yang Z wrote:
>>>>>> I don't quite understand it. If user set an interrupt's affinity
>>>>>> to a CPU, but he still see the interrupt delivers to other CPUs in host.
>>>>>> Do you think it is a right behavior?
>>>>>
>>>>> No, the interrupt is not delivered at all in the host. Normally you'd have:
>>>>>
>>>>> - interrupt delivered to CPU from host affinity
>>>>>
>>>>> - VFIO interrupt handler writes to irqfd
>>>>>
>>>>> - interrupt delivered to vCPU from guest affinity
>>>>>
>>>>> Here, you just skip the first two steps. The interrupt is
>>>>> delivered to the thread that is running the vCPU directly, so the
>>>>> host affinity is bypassed entirely.
>>>>>
>>>>> ... unless you are considering the case where the vCPU is blocked
>>>>> and the host is processing the posted interrupt wakeup vector.
>>>>> In that case yes, it would be better to set NDST to a CPU
>>>>> matching the host
> affinity.
>>>>
>>>> In my understanding, wakeup vector should have no relationship
>>>> with the host affinity of the irq. Wakeup notification event
>>>> should be delivered to the pCPU which the vCPU was blocked on. And
>>>> in kernel's point of view, the irq is not associated with the wakeup vector, right?
>>>
>>> That is correct indeed. It is not associated to the wakeup vector,
>>> hence this patch is right, I think.
>>>
>>> However, the wakeup vector has the same function as the VFIO
>>> interrupt handler, so you could argue that it is tied to the host
>>> affinity rather than the guest. Let's wait for Yang to answer.
>>
>> Actually, that's my original question too. I am wondering what
>> happens if the
> user changes the assigned device's affinity in host's /proc/irq/? If
> ignore it is acceptable, then this patch is ok. But it seems the
> discussion out of my scope, need some experts to tell us their idea since it will impact the user experience.
> Hi Yang,
Hi Jiang,
> Originally we have a proposal to return failure when user sets IRQ
> affinity through native OS interfaces if an IRQ is in PI mode. But
> that proposal will break CPU hot-removal because OS needs to migrate
> away all IRQs binding to the CPU to be offlined. Then we propose
> saving user IRQ affinity setting without changing hardware
> configuration (keeping PI configuration). Later when PI mode is
> disabled, the cached affinity setting will be used to setup IRQ
> destination for native OS. On the other hand, for IRQ in PI mode, it
> won't be delivered to native OS, so user may not sense that the IRQ is delivered to CPUs other than those in the affinity set.
The IRQ is still there but will be delivered to host in the form of PI event(if the VCPU is running in root-mode). I am not sure whether those interrupts should be reflected in /proc/interrupts? If the answer is yes, then which entries should be used, a new PI entry or use the original IRQ entry?
> In that aspect, I think it's acceptable:) Regards!
Yes, if all of you guys(especially the IRQ maintainer) are think it is acceptable then we can follow current implementation and document it.
> Gerry
>>
>>>
>>> Paolo
>>
>>
>> Best regards,
>> Yang
>>
>>
Best regards,
Yang
> -----Original Message-----
> From: Zhang, Yang Z
> Sent: Wednesday, December 24, 2014 10:33 AM
> To: Jiang Liu; Paolo Bonzini; Wu, Feng; Thomas Gleixner; Ingo Molnar; H. Peter
> Anvin; [email protected]; Gleb Natapov; [email protected];
> [email protected]; Alex Williamson
> Cc: [email protected]; [email protected]; KVM list;
> Eric Auger
> Subject: RE: [v3 06/26] iommu, x86: No need to migrating irq for VT-d
> Posted-Interrupts
>
> Jiang Liu wrote on 2014-12-24:
> > On 2014/12/24 9:38, Zhang, Yang Z wrote:
> >> Paolo Bonzini wrote on 2014-12-23:
> >>>
> >>>
> >>> On 23/12/2014 10:07, Wu, Feng wrote:
> >>>>> On 23/12/2014 01:37, Zhang, Yang Z wrote:
> >>>>>> I don't quite understand it. If user set an interrupt's affinity
> >>>>>> to a CPU, but he still see the interrupt delivers to other CPUs in host.
> >>>>>> Do you think it is a right behavior?
> >>>>>
> >>>>> No, the interrupt is not delivered at all in the host. Normally you'd
> have:
> >>>>>
> >>>>> - interrupt delivered to CPU from host affinity
> >>>>>
> >>>>> - VFIO interrupt handler writes to irqfd
> >>>>>
> >>>>> - interrupt delivered to vCPU from guest affinity
> >>>>>
> >>>>> Here, you just skip the first two steps. The interrupt is
> >>>>> delivered to the thread that is running the vCPU directly, so the
> >>>>> host affinity is bypassed entirely.
> >>>>>
> >>>>> ... unless you are considering the case where the vCPU is blocked
> >>>>> and the host is processing the posted interrupt wakeup vector.
> >>>>> In that case yes, it would be better to set NDST to a CPU
> >>>>> matching the host
> > affinity.
> >>>>
> >>>> In my understanding, wakeup vector should have no relationship
> >>>> with the host affinity of the irq. Wakeup notification event
> >>>> should be delivered to the pCPU which the vCPU was blocked on. And
> >>>> in kernel's point of view, the irq is not associated with the wakeup vector,
> right?
> >>>
> >>> That is correct indeed. It is not associated to the wakeup vector,
> >>> hence this patch is right, I think.
> >>>
> >>> However, the wakeup vector has the same function as the VFIO
> >>> interrupt handler, so you could argue that it is tied to the host
> >>> affinity rather than the guest. Let's wait for Yang to answer.
> >>
> >> Actually, that's my original question too. I am wondering what
> >> happens if the
> > user changes the assigned device's affinity in host's /proc/irq/? If
> > ignore it is acceptable, then this patch is ok. But it seems the
> > discussion out of my scope, need some experts to tell us their idea since it will
> impact the user experience.
> > Hi Yang,
>
> Hi Jiang,
>
> > Originally we have a proposal to return failure when user sets IRQ
> > affinity through native OS interfaces if an IRQ is in PI mode. But
> > that proposal will break CPU hot-removal because OS needs to migrate
> > away all IRQs binding to the CPU to be offlined. Then we propose
> > saving user IRQ affinity setting without changing hardware
> > configuration (keeping PI configuration). Later when PI mode is
> > disabled, the cached affinity setting will be used to setup IRQ
> > destination for native OS. On the other hand, for IRQ in PI mode, it
> > won't be delivered to native OS, so user may not sense that the IRQ is
> delivered to CPUs other than those in the affinity set.
>
> The IRQ is still there but will be delivered to host in the form of PI event(if the
> VCPU is running in root-mode). I am not sure whether those interrupts should
> be reflected in /proc/interrupts? If the answer is yes, then which entries should
> be used, a new PI entry or use the original IRQ entry?
Even though, setting the affinity of the IRQ in host should not affect the destination of the
PI event (normal notification event of wakeup notification event), because the destination
of the PI event is determined in NDST field of Posted-interrupts descriptor and PI notification
vector is global. Just had a discussion with Jiang offline, maybe we can add the statistics
information for the notification vector in /proc/interrupts just like any other global
interrupts.
Thanks,
Feng
>
> > In that aspect, I think it's acceptable:) Regards!
>
> Yes, if all of you guys(especially the IRQ maintainer) are think it is acceptable
> then we can follow current implementation and document it.
>
> > Gerry
> >>
> >>>
> >>> Paolo
> >>
> >>
> >> Best regards,
> >> Yang
> >>
> >>
>
>
> Best regards,
> Yang
>
Wu, Feng wrote on 2014-12-24:
>
>
> Zhang, Yang Z wrote on 2014-12-24:
>> Cc: [email protected]; [email protected];
>> KVM list; Eric Auger
>> Subject: RE: [v3 06/26] iommu, x86: No need to migrating irq for
>> VT-d Posted-Interrupts
>>
>> Jiang Liu wrote on 2014-12-24:
>>> On 2014/12/24 9:38, Zhang, Yang Z wrote:
>>>> Paolo Bonzini wrote on 2014-12-23:
>>>>>
>>>>>
>>>>> On 23/12/2014 10:07, Wu, Feng wrote:
>>>>>>> On 23/12/2014 01:37, Zhang, Yang Z wrote:
>>>>>>>> I don't quite understand it. If user set an interrupt's affinity
>>>>>>>> to a CPU, but he still see the interrupt delivers to other CPUs
>>>>>>>> in host. Do you think it is a right behavior?
>>>>>>>
>>>>>>> No, the interrupt is not delivered at all in the host. Normally
>>>>>>> you'd have:
>>>>>>>
>>>>>>> - interrupt delivered to CPU from host affinity
>>>>>>>
>>>>>>> - VFIO interrupt handler writes to irqfd
>>>>>>>
>>>>>>> - interrupt delivered to vCPU from guest affinity
>>>>>>>
>>>>>>> Here, you just skip the first two steps. The interrupt is
>>>>>>> delivered to the thread that is running the vCPU directly, so
>>>>>>> the host affinity is bypassed entirely.
>>>>>>>
>>>>>>> ... unless you are considering the case where the vCPU is
>>>>>>> blocked and the host is processing the posted interrupt wakeup vector.
>>>>>>> In that case yes, it would be better to set NDST to a CPU
>>>>>>> matching the host
>>> affinity.
>>>>>>
>>>>>> In my understanding, wakeup vector should have no relationship
>>>>>> with the host affinity of the irq. Wakeup notification event
>>>>>> should be delivered to the pCPU which the vCPU was blocked on.
>>>>>> And in kernel's point of view, the irq is not associated with
>>>>>> the wakeup vector,
>> right?
>>>>>
>>>>> That is correct indeed. It is not associated to the wakeup
>>>>> vector, hence this patch is right, I think.
>>>>>
>>>>> However, the wakeup vector has the same function as the VFIO
>>>>> interrupt handler, so you could argue that it is tied to the
>>>>> host affinity rather than the guest. Let's wait for Yang to answer.
>>>>
>>>> Actually, that's my original question too. I am wondering what
>>>> happens if the
>>> user changes the assigned device's affinity in host's /proc/irq/? If
>>> ignore it is acceptable, then this patch is ok. But it seems the
>>> discussion out of my scope, need some experts to tell us their idea
>>> since it will impact the user experience. Hi Yang,
>>
>> Hi Jiang,
>>
>>> Originally we have a proposal to return failure when user sets
>>> IRQ affinity through native OS interfaces if an IRQ is in PI mode.
>>> But that proposal will break CPU hot-removal because OS needs to
>>> migrate away all IRQs binding to the CPU to be offlined. Then we
>>> propose saving user IRQ affinity setting without changing hardware
>>> configuration (keeping PI configuration). Later when PI mode is
>>> disabled, the cached affinity setting will be used to setup IRQ
>>> destination for native OS. On the other hand, for IRQ in PI mode,
>>> it won't be delivered to native OS, so user may not sense that the
>>> IRQ is
>> delivered to CPUs other than those in the affinity set.
>>
>> The IRQ is still there but will be delivered to host in the form of
>> PI event(if the VCPU is running in root-mode). I am not sure whether
>> those interrupts should be reflected in /proc/interrupts? If the
>> answer is yes, then which entries should be used, a new PI entry or
>> use the
> original IRQ entry?
>
> Even though, setting the affinity of the IRQ in host should not affect
> the destination of the PI event (normal notification event of wakeup
This is your implementation. To me, disable PI if the VCPU is going to
run in the CPU out of IRQ affinity bitmap also is acceptable. And it will
keep the user interface looks the same as before.
Hi Thomas, Ingo, Peter
Can you guys help to review this patch? Really appreciate if you can give
some feedbacks.
> notification event), because the destination of the PI event is
> determined in NDST field of Posted-interrupts descriptor and PI
> notification vector is global. Just had a discussion with Jiang
> offline, maybe we can add the statistics information for the notification vector in /proc/interrupts just like any other global interrupts.
>
> Thanks,
> Feng
>
>>
>>> In that aspect, I think it's acceptable:) Regards!
>>
>> Yes, if all of you guys(especially the IRQ maintainer) are think it
>> is acceptable then we can follow current implementation and document it.
>>
>>> Gerry
>>>>
>>>>>
>>>>> Paolo
>>>>
>>>>
>>>> Best regards,
>>>> Yang
>>>>
>>>>
>>
>>
>> Best regards,
>> Yang
>>
Best regards,
Yang
On 2014/12/24 10:32, Zhang, Yang Z wrote:
> Jiang Liu wrote on 2014-12-24:
>> On 2014/12/24 9:38, Zhang, Yang Z wrote:
>>> Paolo Bonzini wrote on 2014-12-23:
>>>>
>>>>
>>>> On 23/12/2014 10:07, Wu, Feng wrote:
>>>>>> On 23/12/2014 01:37, Zhang, Yang Z wrote:
>>>>>>> I don't quite understand it. If user set an interrupt's affinity
>>>>>>> to a CPU, but he still see the interrupt delivers to other CPUs in host.
>>>>>>> Do you think it is a right behavior?
>>>>>>
>>>>>> No, the interrupt is not delivered at all in the host. Normally you'd have:
>>>>>>
>>>>>> - interrupt delivered to CPU from host affinity
>>>>>>
>>>>>> - VFIO interrupt handler writes to irqfd
>>>>>>
>>>>>> - interrupt delivered to vCPU from guest affinity
>>>>>>
>>>>>> Here, you just skip the first two steps. The interrupt is
>>>>>> delivered to the thread that is running the vCPU directly, so the
>>>>>> host affinity is bypassed entirely.
>>>>>>
>>>>>> ... unless you are considering the case where the vCPU is blocked
>>>>>> and the host is processing the posted interrupt wakeup vector.
>>>>>> In that case yes, it would be better to set NDST to a CPU
>>>>>> matching the host
>> affinity.
>>>>>
>>>>> In my understanding, wakeup vector should have no relationship
>>>>> with the host affinity of the irq. Wakeup notification event
>>>>> should be delivered to the pCPU which the vCPU was blocked on. And
>>>>> in kernel's point of view, the irq is not associated with the wakeup vector, right?
>>>>
>>>> That is correct indeed. It is not associated to the wakeup vector,
>>>> hence this patch is right, I think.
>>>>
>>>> However, the wakeup vector has the same function as the VFIO
>>>> interrupt handler, so you could argue that it is tied to the host
>>>> affinity rather than the guest. Let's wait for Yang to answer.
>>>
>>> Actually, that's my original question too. I am wondering what
>>> happens if the
>> user changes the assigned device's affinity in host's /proc/irq/? If
>> ignore it is acceptable, then this patch is ok. But it seems the
>> discussion out of my scope, need some experts to tell us their idea since it will impact the user experience.
>> Hi Yang,
>
> Hi Jiang,
>
>> Originally we have a proposal to return failure when user sets IRQ
>> affinity through native OS interfaces if an IRQ is in PI mode. But
>> that proposal will break CPU hot-removal because OS needs to migrate
>> away all IRQs binding to the CPU to be offlined. Then we propose
>> saving user IRQ affinity setting without changing hardware
>> configuration (keeping PI configuration). Later when PI mode is
>> disabled, the cached affinity setting will be used to setup IRQ
>> destination for native OS. On the other hand, for IRQ in PI mode, it
>> won't be delivered to native OS, so user may not sense that the IRQ is delivered to CPUs other than those in the affinity set.
>
> The IRQ is still there but will be delivered to host in the form of PI event(if the VCPU is running in root-mode). I am not sure whether those interrupts should be reflected in /proc/interrupts? If the answer is yes, then which entries should be used, a new PI entry or use the original IRQ entry?
You are right, the native interrupt statistics will become inaccurate.
Maybe some document about this behavior is preferred.
>
>> In that aspect, I think it's acceptable:) Regards!
>
> Yes, if all of you guys(especially the IRQ maintainer) are think it is acceptable then we can follow current implementation and document it.
Good suggestion, we will send an email to Thomas for advice after New
Year.
>
>> Gerry
>>>
>>>>
>>>> Paolo
>>>
>>>
>>> Best regards,
>>> Yang
>>>
>>>
>
>
> Best regards,
> Yang
>
>
> -----Original Message-----
> From: H. Peter Anvin [mailto:[email protected]]
> Sent: Saturday, January 31, 2015 2:19 AM
> To: Wu, Feng; [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [v3 21/26] x86, irq: Define a global vector for VT-d
> Posted-Interrupts
>
> On 12/12/2014 07:14 AM, Feng Wu wrote:
> > Currently, we use a global vector as the Posted-Interrupts
> > Notification Event for all the vCPUs in the system. We need
> > to introduce another global vector for VT-d Posted-Interrtups,
> > which will be used to wakeup the sleep vCPU when an external
> > interrupt from a direct-assigned device happens for that vCPU.
> >
> > Signed-off-by: Feng Wu <[email protected]>
> >
>
> > #ifdef CONFIG_HAVE_KVM
> > +void (*wakeup_handler_callback)(void) = NULL;
> > +EXPORT_SYMBOL_GPL(wakeup_handler_callback);
> > +
>
> Stylistic nitpick: we generally don't explicitly initialize
> global/static pointer variables to NULL (that happens automatically anyway.)
>
> Other than that,
>
> Acked-by: H. Peter Anvin <[email protected]>
Thanks a lot for your review, Peter!
Thanks,
Feng
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Fri, Dec 12, 2014 at 11:14:55PM +0800, Feng Wu wrote:
> Currently, we use a global vector as the Posted-Interrupts
> Notification Event for all the vCPUs in the system. We need
> to introduce another global vector for VT-d Posted-Interrtups,
> which will be used to wakeup the sleep vCPU when an external
> interrupt from a direct-assigned device happens for that vCPU.
>
> Signed-off-by: Feng Wu <[email protected]>
Why an additional vector is necessary?
Can't you simply wakeup the vcpu from kvm_posted_intr_ipi, the posted
interrupt vector handler ?
On Fri, Dec 12, 2014 at 11:14:57PM +0800, Feng Wu wrote:
> This patch updates the Posted-Interrupts Descriptor when vCPU
> is preempted.
>
> sched out:
> - Set 'SN' to suppress furture non-urgent interrupts posted for
> the vCPU.
What wakes the vcpu in the case of a non-urgent interrupt, then?
I wonder how is software suppose to configure the urgent/non-urgent
flag. Can you give examples of (hypothetical) urgent and non-urgent
interrupts.
> sched in:
> - Clear 'SN'
> - Change NDST if vCPU is scheduled to a different CPU
> - Set 'NV' to POSTED_INTR_VECTOR
What about:
POSTED_INTR_VECTOR interrupt handler:
- Wakeup vcpu.
- Set 'SN' to suppress future interrupts.
HLT emulation entry:
- Clear 'SN' to receive VT-d interrupt notification.
> Signed-off-by: Feng Wu <[email protected]>
> ---
> arch/x86/kvm/vmx.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 44 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index ee3b735..bf2e6cd 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -1916,10 +1916,54 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */
> vmx->loaded_vmcs->cpu = cpu;
> }
> +
> + if (irq_remapping_cap(IRQ_POSTING_CAP)) {
> + struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> + struct pi_desc old, new;
> + unsigned int dest;
> +
> + memset(&old, 0, sizeof(old));
> + memset(&new, 0, sizeof(new));
> +
> + do {
> + old.control = new.control = pi_desc->control;
> + if (vcpu->cpu != cpu) {
> + dest = cpu_physical_id(cpu);
> +
> + if (x2apic_enabled())
> + new.ndst = dest;
> + else
> + new.ndst = (dest << 8) & 0xFF00;
> + }
> +
> + pi_clear_sn(&new);
> +
> + /* set 'NV' to 'notification vector' */
> + new.nv = POSTED_INTR_VECTOR;
> + } while (cmpxchg(&pi_desc->control, old.control,
> + new.control) != old.control);
> + }
> }
>
> static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
> {
> + if (irq_remapping_cap(IRQ_POSTING_CAP)) {
> + struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> + struct pi_desc old, new;
> +
> + memset(&old, 0, sizeof(old));
> + memset(&new, 0, sizeof(new));
> +
> + /* Set SN when the vCPU is preempted */
> + if (vcpu->preempted) {
> + do {
> + old.control = new.control = pi_desc->control;
> + pi_set_sn(&new);
> + } while (cmpxchg(&pi_desc->control, old.control,
> + new.control) != old.control);
> + }
> + }
> +
> __vmx_load_host_state(to_vmx(vcpu));
> if (!vmm_exclusive) {
> __loaded_vmcs_clear(to_vmx(vcpu)->loaded_vmcs);
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Dec 12, 2014 at 11:14:58PM +0800, Feng Wu wrote:
> This patch updates the Posted-Interrupts Descriptor when vCPU
> is blocked.
>
> pre-block:
> - Add the vCPU to the blocked per-CPU list
> - Clear 'SN'
> - Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
>
> post-block:
> - Remove the vCPU from the per-CPU list
>
> Signed-off-by: Feng Wu <[email protected]>
> ---
Don't see this is needed, can use the existing POSTED_INTR_VECTOR:
If in guest mode, IPI will be handled in VMX non-root by performed
PIR->IRR transfer.
If outside guest mode, POSTED_INTR_VECTOR IPI will be handled by host
which can wakeup the guest (in case it is halted).
> -----Original Message-----
> From: Marcelo Tosatti [mailto:[email protected]]
> Sent: Thursday, February 26, 2015 5:50 AM
> To: Wu, Feng
> Cc: [email protected]; [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU
> is blocked
>
> On Fri, Dec 12, 2014 at 11:14:58PM +0800, Feng Wu wrote:
> > This patch updates the Posted-Interrupts Descriptor when vCPU
> > is blocked.
> >
> > pre-block:
> > - Add the vCPU to the blocked per-CPU list
> > - Clear 'SN'
> > - Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
> >
> > post-block:
> > - Remove the vCPU from the per-CPU list
> >
> > Signed-off-by: Feng Wu <[email protected]>
> > ---
>
> Don't see this is needed, can use the existing POSTED_INTR_VECTOR:
>
> If in guest mode, IPI will be handled in VMX non-root by performed
> PIR->IRR transfer.
>
> If outside guest mode, POSTED_INTR_VECTOR IPI will be handled by host
> which can wakeup the guest (in case it is halted).
Please see the following scenario:
1. vCPU0 is running on pCPU0
2. vCPU0 is halted and vCPU1 is currently running on pCPU0
3. An interrupt occurs for vCPU0, if we still use POSTED_INTR_VECTOR
for vCPU0, the notification event for vCPU0 (the event will go to pCPU1)
will be consumed by vCPU1 incorrectly. The worst case is that vCPU0
will never be woken up again since the wakeup event for it is always
consumed by other vCPUs incorrectly.
Thanks,
Feng
On Fri, Dec 12, 2014 at 11:14:58PM +0800, Feng Wu wrote:
> This patch updates the Posted-Interrupts Descriptor when vCPU
> is blocked.
>
> pre-block:
> - Add the vCPU to the blocked per-CPU list
> - Clear 'SN'
> - Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
>
> post-block:
> - Remove the vCPU from the per-CPU list
>
> Signed-off-by: Feng Wu <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 2 +
> arch/x86/kvm/vmx.c | 96 +++++++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/x86.c | 22 +++++++---
> include/linux/kvm_host.h | 4 ++
> virt/kvm/kvm_main.c | 6 +++
> 5 files changed, 123 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 13e3e40..32c110a 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -101,6 +101,8 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
>
> #define ASYNC_PF_PER_VCPU 64
>
> +extern void (*wakeup_handler_callback)(void);
> +
> enum kvm_reg {
> VCPU_REGS_RAX = 0,
> VCPU_REGS_RCX = 1,
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index bf2e6cd..a1c83a2 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -832,6 +832,13 @@ static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
> static DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
> static DEFINE_PER_CPU(struct desc_ptr, host_gdt);
>
> +/*
> + * We maintian a per-CPU linked-list of vCPU, so in wakeup_handler() we
> + * can find which vCPU should be waken up.
> + */
> +static DEFINE_PER_CPU(struct list_head, blocked_vcpu_on_cpu);
> +static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
> +
> static unsigned long *vmx_io_bitmap_a;
> static unsigned long *vmx_io_bitmap_b;
> static unsigned long *vmx_msr_bitmap_legacy;
> @@ -1921,6 +1928,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> struct pi_desc old, new;
> unsigned int dest;
> + unsigned long flags;
>
> memset(&old, 0, sizeof(old));
> memset(&new, 0, sizeof(new));
> @@ -1942,6 +1950,20 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> new.nv = POSTED_INTR_VECTOR;
> } while (cmpxchg(&pi_desc->control, old.control,
> new.control) != old.control);
> +
> + /*
> + * Delete the vCPU from the related wakeup queue
> + * if we are resuming from blocked state
> + */
> + if (vcpu->blocked) {
> + vcpu->blocked = false;
> + spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
> + vcpu->wakeup_cpu), flags);
> + list_del(&vcpu->blocked_vcpu_list);
> + spin_unlock_irqrestore(&per_cpu(blocked_vcpu_on_cpu_lock,
> + vcpu->wakeup_cpu), flags);
> + vcpu->wakeup_cpu = -1;
> + }
> }
> }
>
> @@ -1950,6 +1972,9 @@ static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
> if (irq_remapping_cap(IRQ_POSTING_CAP)) {
> struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> struct pi_desc old, new;
> + unsigned long flags;
> + int cpu;
> + struct cpumask cpu_others_mask;
>
> memset(&old, 0, sizeof(old));
> memset(&new, 0, sizeof(new));
> @@ -1961,6 +1986,54 @@ static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
> pi_set_sn(&new);
> } while (cmpxchg(&pi_desc->control, old.control,
> new.control) != old.control);
> + } else if (vcpu->blocked) {
> + /*
> + * The vcpu is blocked on the wait queue.
> + * Store the blocked vCPU on the list of the
> + * vcpu->wakeup_cpu, which is the destination
> + * of the wake-up notification event.
> + */
> + vcpu->wakeup_cpu = vcpu->cpu;
> + spin_lock_irqsave(&per_cpu(blocked_vcpu_on_cpu_lock,
> + vcpu->wakeup_cpu), flags);
> + list_add_tail(&vcpu->blocked_vcpu_list,
> + &per_cpu(blocked_vcpu_on_cpu,
> + vcpu->wakeup_cpu));
> + spin_unlock_irqrestore(
> + &per_cpu(blocked_vcpu_on_cpu_lock,
> + vcpu->wakeup_cpu), flags);
> +
> + do {
> + old.control = new.control = pi_desc->control;
> +
> + /*
> + * We should not block the vCPU if
> + * an interrupt is posted for it.
> + */
> + if (pi_test_on(pi_desc) == 1) {
> + /*
> + * We need schedule the wakeup worker
> + * on a different cpu other than
> + * vcpu->cpu, because in some case,
> + * schedule_work() will call
> + * try_to_wake_up() which needs acquire
> + * the rq lock. This can cause deadlock.
> + */
> + cpumask_copy(&cpu_others_mask,
> + cpu_online_mask);
> + cpu_clear(vcpu->cpu, cpu_others_mask);
> + cpu = any_online_cpu(cpu_others_mask);
> +
> + schedule_work_on(cpu,
> + &vcpu->wakeup_worker);
> + }
> +
> + pi_clear_sn(&new);
> +
> + /* set 'NV' to 'wakeup vector' */
> + new.nv = POSTED_INTR_WAKEUP_VECTOR;
> + } while (cmpxchg(&pi_desc->control, old.control,
> + new.control) != old.control);
> }
This can be done exclusively on HLT emulation, correct? (that is, on
entry to HLT and exit from HLT).
If the vcpu is scheduled out for any other reason (transition to
userspace or transition to other thread), it will eventually resume
execution. And in that case, continuation of execution does not depend
on the event (VT-d interrupt) notification.
There is a race window with the code above, I believe.
> }
>
> @@ -2842,6 +2915,8 @@ static int hardware_enable(void)
> return -EBUSY;
>
> INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
> + INIT_LIST_HEAD(&per_cpu(blocked_vcpu_on_cpu, cpu));
> + spin_lock_init(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
>
> /*
> * Now we can enable the vmclear operation in kdump
> @@ -9315,6 +9390,25 @@ static struct kvm_x86_ops vmx_x86_ops = {
> .pi_set_sn = vmx_pi_set_sn,
> };
>
> +/*
> + * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR.
> + */
> +void wakeup_handler(void)
> +{
> + struct kvm_vcpu *vcpu;
> + int cpu = smp_processor_id();
> +
> + spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
> + list_for_each_entry(vcpu, &per_cpu(blocked_vcpu_on_cpu, cpu),
> + blocked_vcpu_list) {
> + struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> +
> + if (pi_test_on(pi_desc) == 1)
> + kvm_vcpu_kick(vcpu);
> + }
> + spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
> +}
Looping through all blocked vcpus does not scale:
Can you allocate more vectors and then multiplex those
vectors amongst the HLT'ed vcpus?
It seems there is a bunch free:
commit 52aec3308db85f4e9f5c8b9f5dc4fbd0138c6fa4
Author: Alex Shi <[email protected]>
Date: Thu Jun 28 09:02:23 2012 +0800
x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR
Can you add only vcpus which have posted IRTEs that point to this pCPU
to the HLT'ed vcpu lists? (so for example, vcpus without assigned
devices are not part of the list).
> +
> static int __init vmx_init(void)
> {
> int r, i, msr;
> @@ -9429,6 +9523,8 @@ static int __init vmx_init(void)
>
> update_ple_window_actual_max();
>
> + wakeup_handler_callback = wakeup_handler;
> +
> return 0;
>
> out7:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 0033df3..1551a46 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6152,6 +6152,21 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> kvm_vcpu_reload_apic_access_page(vcpu);
> }
>
> + /*
> + * Since posted-interrupts can be set by VT-d HW now, in this
> + * case, KVM_REQ_EVENT is not set. We move the following
> + * operations out of the if statement.
> + */
> + if (kvm_lapic_enabled(vcpu)) {
> + /*
> + * Update architecture specific hints for APIC
> + * virtual interrupt delivery.
> + */
> + if (kvm_x86_ops->hwapic_irr_update)
> + kvm_x86_ops->hwapic_irr_update(vcpu,
> + kvm_lapic_find_highest_irr(vcpu));
> + }
> +
This is a hot fast path. You can set KVM_REQ_EVENT from wakeup_handler.
> if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
> kvm_apic_accept_events(vcpu);
> if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) {
> @@ -6168,13 +6183,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> kvm_x86_ops->enable_irq_window(vcpu);
>
> if (kvm_lapic_enabled(vcpu)) {
> - /*
> - * Update architecture specific hints for APIC
> - * virtual interrupt delivery.
> - */
> - if (kvm_x86_ops->hwapic_irr_update)
> - kvm_x86_ops->hwapic_irr_update(vcpu,
> - kvm_lapic_find_highest_irr(vcpu));
> update_cr8_intercept(vcpu);
> kvm_lapic_sync_to_vapic(vcpu);
> }
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3d7242c..d981d16 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -239,6 +239,9 @@ struct kvm_vcpu {
> unsigned long requests;
> unsigned long guest_debug;
>
> + int wakeup_cpu;
> + struct list_head blocked_vcpu_list;
> +
> struct mutex mutex;
> struct kvm_run *run;
>
> @@ -282,6 +285,7 @@ struct kvm_vcpu {
> } spin_loop;
> #endif
> bool preempted;
> + bool blocked;
> struct kvm_vcpu_arch arch;
> };
Please remove blocked and wakeup_cpu, they should not be necessary.
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ba53fd6..6deb994 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -233,6 +233,9 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
>
> INIT_WORK(&vcpu->wakeup_worker, wakeup_thread);
>
> + vcpu->wakeup_cpu = -1;
> + INIT_LIST_HEAD(&vcpu->blocked_vcpu_list);
> +
> page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> if (!page) {
> r = -ENOMEM;
> @@ -243,6 +246,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
> kvm_vcpu_set_in_spin_loop(vcpu, false);
> kvm_vcpu_set_dy_eligible(vcpu, false);
> vcpu->preempted = false;
> + vcpu->blocked = false;
>
> r = kvm_arch_vcpu_init(vcpu);
> if (r < 0)
> @@ -1752,6 +1756,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> DEFINE_WAIT(wait);
>
> for (;;) {
> + vcpu->blocked = true;
> prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
>
> if (kvm_arch_vcpu_runnable(vcpu)) {
> @@ -1767,6 +1772,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> }
>
> finish_wait(&vcpu->wq, &wait);
> + vcpu->blocked = false;
> }
> EXPORT_SYMBOL_GPL(kvm_vcpu_block);
>
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Feb 26, 2015 at 08:08:15AM +0000, Wu, Feng wrote:
>
>
> > -----Original Message-----
> > From: Marcelo Tosatti [mailto:[email protected]]
> > Sent: Thursday, February 26, 2015 5:50 AM
> > To: Wu, Feng
> > Cc: [email protected]; [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected];
> > [email protected]; [email protected]
> > Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU
> > is blocked
> >
> > On Fri, Dec 12, 2014 at 11:14:58PM +0800, Feng Wu wrote:
> > > This patch updates the Posted-Interrupts Descriptor when vCPU
> > > is blocked.
> > >
> > > pre-block:
> > > - Add the vCPU to the blocked per-CPU list
> > > - Clear 'SN'
> > > - Set 'NV' to POSTED_INTR_WAKEUP_VECTOR
> > >
> > > post-block:
> > > - Remove the vCPU from the per-CPU list
> > >
> > > Signed-off-by: Feng Wu <[email protected]>
> > > ---
> >
> > Don't see this is needed, can use the existing POSTED_INTR_VECTOR:
> >
> > If in guest mode, IPI will be handled in VMX non-root by performed
> > PIR->IRR transfer.
> >
> > If outside guest mode, POSTED_INTR_VECTOR IPI will be handled by host
> > which can wakeup the guest (in case it is halted).
>
> Please see the following scenario:
>
> 1. vCPU0 is running on pCPU0
> 2. vCPU0 is halted and vCPU1 is currently running on pCPU0
> 3. An interrupt occurs for vCPU0, if we still use POSTED_INTR_VECTOR
> for vCPU0, the notification event for vCPU0 (the event will go to pCPU1)
> will be consumed by vCPU1 incorrectly. The worst case is that vCPU0
> will never be woken up again since the wakeup event for it is always
> consumed by other vCPUs incorrectly.
>
> Thanks,
> Feng
Ouch, yes.
> -----Original Message-----
> From: Marcelo Tosatti [mailto:[email protected]]
> Sent: Tuesday, March 31, 2015 7:56 AM
> To: Wu, Feng
> Cc: [email protected]; [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU
> is blocked
>
> On Mon, Mar 30, 2015 at 04:46:55AM +0000, Wu, Feng wrote:
> >
> >
> > > -----Original Message-----
> > > From: Marcelo Tosatti [mailto:[email protected]]
> > > Sent: Saturday, March 28, 2015 3:30 AM
> > > To: Wu, Feng
> > > Cc: [email protected]; [email protected]; [email protected];
> [email protected];
> > > [email protected]; [email protected]; [email protected];
> > > [email protected]; [email protected]; [email protected];
> > > [email protected]; [email protected];
> > > [email protected]; [email protected]
> > > Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when
> vCPU
> > > is blocked
> > >
> > > On Fri, Mar 27, 2015 at 06:34:14AM +0000, Wu, Feng wrote:
> > > > > > Currently, the following code is executed before local_irq_disable() is
> > > called,
> > > > > > so do you mean 1)moving local_irq_disable() to the place before it. 2)
> after
> > > > > interrupt
> > > > > > is disabled, set KVM_REQ_EVENT in case the ON bit is set?
> > > > >
> > > > > 2) after interrupt is disabled, set KVM_REQ_EVENT in case the ON bit
> > > > > is set.
> > > >
> > > > Here is my understanding about your comments here:
> > > > - Disable interrupts
> > > > - Check 'ON'
> > > > - Set KVM_REQ_EVENT if 'ON' is set
> > > >
> > > > Then we can put the above code inside " if
> > > (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) "
> > > > just like it used to be. However, I still have some questions about this
> > > comment:
> > > >
> > > > 1. Where should I set KVM_REQ_EVENT? In function vcpu_enter_guest(),
> or
> > > other places?
> > >
> > > See below:
> > >
> > > > If in vcpu_enter_guest(), since currently local_irq_disable() is called after
> > > 'KVM_REQ_EVENT'
> > > > is checked, is it helpful to set KVM_REQ_EVENT after local_irq_disable() is
> > > called?
> > >
> > > local_irq_disable();
> > >
> > > *** add code here ***
> >
> > So we need add code like the following here, right?
> >
> > if ('ON' is set)
> > kvm_make_request(KVM_REQ_EVENT, vcpu);
>
Hi Marcelo,
I changed the code as above, then I found that the ping latency was extremely big, (70ms - 400ms).
I digged into it and got the root cause. We cannot use "checking-on" as the judgment, since 'ON'
can be cleared by hypervisor software in lots of places. In this case, KVM_REQ_EVENT cannot be
set when we check 'ON' bit, hence the interrupts are not injected to the guest in time.
Please refer to the following code, in which 'ON' bit can be cleared:
apic_find_highest_irr () --> vmx_sync_pir_to_irr () --> pi_test_and_clear_on()
Searching from the code step by step, apic_find_highest_irr() can be called by many other guys.
Thanks,
Feng
> Yes.
>
> > > if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests
> > > ^^^^^^^^^^^^^^
>
> Point *1.
>
> > > || need_resched() || signal_pending(current)) {
> > > vcpu->mode = OUTSIDE_GUEST_MODE;
> > > smp_wmb();
> > > local_irq_enable();
> > > preempt_enable();
> > > vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
> > > r = 1;
> > > goto cancel_injection;
> > > }
> > >
> > > > 2. 'ON' is set by VT-d hardware, it can be set even when interrupt is
> disabled
> > > (the related bit in PIR is also set).
> > >
> > > Yes, we are checking if the HW has set an interrupt in PIR while
> > > outside VM (which requires PIR->VIRR transfer by software).
> > >
> > > If the interrupt it set by hardware after local_irq_disable(),
> > > VMX-entry will handle the interrupt and perform the PIR->VIRR
> > > transfer and reevaluate interrupts, injecting to guest
> > > if necessary, is that correct ?
> > >
> > > > So does it make sense to check 'ON' and set KVM_REQ_EVENT accordingly
> > > after interrupt is disabled?
> > >
> > > To replace the costly
> > >
> > > + */
> > > + if (kvm_x86_ops->hwapic_irr_update)
> > > + kvm_x86_ops->hwapic_irr_update(vcpu,
> > > + kvm_lapic_find_highest_irr(vcpu));
> > >
> > > Yes, i think so.
> >
> > After adding the "checking ON and setting KVM_REQ_EVENT" operations
> listed in my
> > comments above, do you mean we still need to keep the costly code above
> > inside "if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {}" in
> function
> > vcpu_enter_guest() as it used to be? If yes, my question is what is the exact
> purpose
> > of "checking ON and setting KVM_REQ_EVENT" operations? Here is the code
> flow in
> > vcpu_enter_guest():
> >
> > 1. Check KVM_REQ_EVENT, if it is set, sync pir->virr
> > 2. Disable interrupts
> > 3. Check ON and set KVM_REQ_EVENT -- Here, we set KVM_REQ_EVENT, but
> it is
> > checked in the step 1, which means, we cannot get any benefits even we set
> it here,
> > since the "pir->virr" sync operation was done in step 1, between step 3 and
> VM-Entry,
> > we don't synchronize the pir to virr. So even we set KVM_REQ_EVENT here,
> the interrupts
> > remaining in PIR cannot be delivered to guest during this VM-Entry, right?
>
> Please check point *1 above. The code will go back to
>
> "if (kvm_check_request(KVM_REQ_EVENT, vcpu)"
>
> And perform the pir->virr sync.
On Tue, Apr 14, 2015 at 07:37:44AM +0000, Wu, Feng wrote:
>
>
> > -----Original Message-----
> > From: Marcelo Tosatti [mailto:[email protected]]
> > Sent: Tuesday, March 31, 2015 7:56 AM
> > To: Wu, Feng
> > Cc: [email protected]; [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected];
> > [email protected]; [email protected]
> > Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU
> > is blocked
> >
> > On Mon, Mar 30, 2015 at 04:46:55AM +0000, Wu, Feng wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Marcelo Tosatti [mailto:[email protected]]
> > > > Sent: Saturday, March 28, 2015 3:30 AM
> > > > To: Wu, Feng
> > > > Cc: [email protected]; [email protected]; [email protected];
> > [email protected];
> > > > [email protected]; [email protected]; [email protected];
> > > > [email protected]; [email protected]; [email protected];
> > > > [email protected]; [email protected];
> > > > [email protected]; [email protected]
> > > > Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when
> > vCPU
> > > > is blocked
> > > >
> > > > On Fri, Mar 27, 2015 at 06:34:14AM +0000, Wu, Feng wrote:
> > > > > > > Currently, the following code is executed before local_irq_disable() is
> > > > called,
> > > > > > > so do you mean 1)moving local_irq_disable() to the place before it. 2)
> > after
> > > > > > interrupt
> > > > > > > is disabled, set KVM_REQ_EVENT in case the ON bit is set?
> > > > > >
> > > > > > 2) after interrupt is disabled, set KVM_REQ_EVENT in case the ON bit
> > > > > > is set.
> > > > >
> > > > > Here is my understanding about your comments here:
> > > > > - Disable interrupts
> > > > > - Check 'ON'
> > > > > - Set KVM_REQ_EVENT if 'ON' is set
> > > > >
> > > > > Then we can put the above code inside " if
> > > > (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) "
> > > > > just like it used to be. However, I still have some questions about this
> > > > comment:
> > > > >
> > > > > 1. Where should I set KVM_REQ_EVENT? In function vcpu_enter_guest(),
> > or
> > > > other places?
> > > >
> > > > See below:
> > > >
> > > > > If in vcpu_enter_guest(), since currently local_irq_disable() is called after
> > > > 'KVM_REQ_EVENT'
> > > > > is checked, is it helpful to set KVM_REQ_EVENT after local_irq_disable() is
> > > > called?
> > > >
> > > > local_irq_disable();
> > > >
> > > > *** add code here ***
> > >
> > > So we need add code like the following here, right?
> > >
> > > if ('ON' is set)
> > > kvm_make_request(KVM_REQ_EVENT, vcpu);
> >
>
> Hi Marcelo,
>
> I changed the code as above, then I found that the ping latency was extremely big, (70ms - 400ms).
> I digged into it and got the root cause. We cannot use "checking-on" as the judgment, since 'ON'
> can be cleared by hypervisor software in lots of places. In this case, KVM_REQ_EVENT cannot be
> set when we check 'ON' bit, hence the interrupts are not injected to the guest in time.
>
> Please refer to the following code, in which 'ON' bit can be cleared:
>
> apic_find_highest_irr () --> vmx_sync_pir_to_irr () --> pi_test_and_clear_on()
>
> Searching from the code step by step, apic_find_highest_irr() can be called by many other guys.
>
> Thanks,
Ok then, ignore my suggestion.
Can you resend the latest version please ?
> -----Original Message-----
> From: Marcelo Tosatti [mailto:[email protected]]
> Sent: Saturday, June 06, 2015 5:59 AM
> To: Wu, Feng
> Cc: [email protected]; [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when vCPU
> is blocked
>
> On Tue, Apr 14, 2015 at 07:37:44AM +0000, Wu, Feng wrote:
> >
> >
> > > -----Original Message-----
> > > From: Marcelo Tosatti [mailto:[email protected]]
> > > Sent: Tuesday, March 31, 2015 7:56 AM
> > > To: Wu, Feng
> > > Cc: [email protected]; [email protected]; [email protected];
> [email protected];
> > > [email protected]; [email protected]; [email protected];
> > > [email protected]; [email protected]; [email protected];
> > > [email protected]; [email protected];
> > > [email protected]; [email protected]
> > > Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when
> vCPU
> > > is blocked
> > >
> > > On Mon, Mar 30, 2015 at 04:46:55AM +0000, Wu, Feng wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Marcelo Tosatti [mailto:[email protected]]
> > > > > Sent: Saturday, March 28, 2015 3:30 AM
> > > > > To: Wu, Feng
> > > > > Cc: [email protected]; [email protected]; [email protected];
> > > [email protected];
> > > > > [email protected]; [email protected]; [email protected];
> > > > > [email protected]; [email protected];
> [email protected];
> > > > > [email protected]; [email protected];
> > > > > [email protected]; [email protected]
> > > > > Subject: Re: [v3 24/26] KVM: Update Posted-Interrupts Descriptor when
> > > vCPU
> > > > > is blocked
> > > > >
> > > > > On Fri, Mar 27, 2015 at 06:34:14AM +0000, Wu, Feng wrote:
> > > > > > > > Currently, the following code is executed before local_irq_disable()
> is
> > > > > called,
> > > > > > > > so do you mean 1)moving local_irq_disable() to the place before it.
> 2)
> > > after
> > > > > > > interrupt
> > > > > > > > is disabled, set KVM_REQ_EVENT in case the ON bit is set?
> > > > > > >
> > > > > > > 2) after interrupt is disabled, set KVM_REQ_EVENT in case the ON
> bit
> > > > > > > is set.
> > > > > >
> > > > > > Here is my understanding about your comments here:
> > > > > > - Disable interrupts
> > > > > > - Check 'ON'
> > > > > > - Set KVM_REQ_EVENT if 'ON' is set
> > > > > >
> > > > > > Then we can put the above code inside " if
> > > > > (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) "
> > > > > > just like it used to be. However, I still have some questions about this
> > > > > comment:
> > > > > >
> > > > > > 1. Where should I set KVM_REQ_EVENT? In function
> vcpu_enter_guest(),
> > > or
> > > > > other places?
> > > > >
> > > > > See below:
> > > > >
> > > > > > If in vcpu_enter_guest(), since currently local_irq_disable() is called
> after
> > > > > 'KVM_REQ_EVENT'
> > > > > > is checked, is it helpful to set KVM_REQ_EVENT after
> local_irq_disable() is
> > > > > called?
> > > > >
> > > > > local_irq_disable();
> > > > >
> > > > > *** add code here ***
> > > >
> > > > So we need add code like the following here, right?
> > > >
> > > > if ('ON' is set)
> > > > kvm_make_request(KVM_REQ_EVENT, vcpu);
> > >
> >
> > Hi Marcelo,
> >
> > I changed the code as above, then I found that the ping latency was
> extremely big, (70ms - 400ms).
> > I digged into it and got the root cause. We cannot use "checking-on" as the
> judgment, since 'ON'
> > can be cleared by hypervisor software in lots of places. In this case,
> KVM_REQ_EVENT cannot be
> > set when we check 'ON' bit, hence the interrupts are not injected to the guest
> in time.
> >
> > Please refer to the following code, in which 'ON' bit can be cleared:
> >
> > apic_find_highest_irr () --> vmx_sync_pir_to_irr () --> pi_test_and_clear_on()
> >
> > Searching from the code step by step, apic_find_highest_irr() can be called by
> many other guys.
> >
> > Thanks,
>
> Ok then, ignore my suggestion.
>
> Can you resend the latest version please ?
Thanks for your review, I will send the new version soon.
Thanks,
Feng
>