With this patch series I'm trying to address several issues with kexec on pvhvm:
- shared_info issue (1st patch, just sending Olaf's work with Konrad's fix)
- create specific pvhvm shutdown handler for kexec (2nd patch)
- GSI PIRQ issue (3rd patch, I'm pretty confident that it does the right thing)
- MSI PIRQ issue (4th patch, and I'm not sure it doesn't break anything -> RFC)
This patch series can be tested on single vCPU guest. We still have SMP issues with
pvhvm guests and kexec which require additional fixes.
Olaf Hering (1):
xen PVonHVM: use E820_Reserved area for shared_info
Vitaly Kuznetsov (3):
xen/pvhvm: Introduce xen_pvhvm_kexec_shutdown()
xen/pvhvm: Unmap all PIRQs on startup and shutdown
xen/pvhvm: Make MSI IRQs work after kexec
arch/x86/pci/xen.c | 3 +-
arch/x86/xen/enlighten.c | 83 +++++++++++++++++++++++++++++++---------
arch/x86/xen/smp.c | 10 +++++
arch/x86/xen/smp.h | 1 +
arch/x86/xen/suspend.c | 2 +-
arch/x86/xen/xen-ops.h | 2 +-
drivers/xen/events/events_base.c | 76 ++++++++++++++++++++++++++++++++++++
include/xen/events.h | 3 ++
8 files changed, 158 insertions(+), 22 deletions(-)
--
1.9.3
When kexec was peformed MSI IRQs for passthrough-ed devices were already
mapped and we see non-zero pirq extracted from MSI msg. xen_irq_from_pirq()
fails as we have no IRQ mapping information for that. Requesting for new
mapping with __write_msi_msg() does not result in MSI IRQ being remapped so
we don't recieve these IRQs.
RFC: I wasn't able to understand why commit af42b8d1 which introduced
xen_irq_from_pirq() check in xen_hvm_setup_msi_irqs() is checking that instead
of checking pirq > 0 as if the mapping was already done (and we have pirq>0 here)
we don't need to request for a new pirq. We're loosing existing PIRQ and I'm also
not sure when __write_msi_msg() with new PIRQ will result in new mapping.
Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
arch/x86/pci/xen.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
index 905956f..685e8f1 100644
--- a/arch/x86/pci/xen.c
+++ b/arch/x86/pci/xen.c
@@ -231,8 +231,7 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
__read_msi_msg(msidesc, &msg);
pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) |
((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
- if (msg.data != XEN_PIRQ_MSI_DATA ||
- xen_irq_from_pirq(pirq) < 0) {
+ if (msg.data != XEN_PIRQ_MSI_DATA || pirq <= 0) {
pirq = xen_allocate_pirq_msi(dev, msidesc);
if (pirq < 0) {
irq = -ENODEV;
--
1.9.3
From: Olaf Hering <[email protected]>
This is a respin of 00e37bdb0113a98408de42db85be002f21dbffd3
("xen PVonHVM: move shared_info to MMIO before kexec").
Currently kexec in a PVonHVM guest fails with a triple fault because the
new kernel overwrites the shared info page. The exact failure depends on
the size of the kernel image. This patch moves the pfn from RAM into an
E820 reserved memory area.
The pfn containing the shared_info is located somewhere in RAM. This will
cause trouble if the current kernel is doing a kexec boot into a new
kernel. The new kernel (and its startup code) can not know where the pfn
is, so it can not reserve the page. The hypervisor will continue to update
the pfn, and as a result memory corruption occours in the new kernel.
The toolstack marks the memory area FC000000-FFFFFFFF as reserved in the
E820 map. Within that range newer toolstacks (4.3+) will keep 1MB
starting from FE700000 as reserved for guest use. Older Xen4 toolstacks
will usually not allocate areas up to FE700000, so FE700000 is expected
to work also with older toolstacks.
In Xen3 there is no reserved area at a fixed location. If the guest is
started on such old hosts the shared_info page will be placed in RAM. As
a result kexec can not be used.
Signed-off-by: Olaf Hering <[email protected]>
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
(cherry picked from commit 9d02b43dee0d7fb18dfb13a00915550b1a3daa9f)
[On resume we need to reset the xen_vcpu_info, which the original
patch did not do]
Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
arch/x86/xen/enlighten.c | 74 ++++++++++++++++++++++++++++++++++++------------
arch/x86/xen/suspend.c | 2 +-
arch/x86/xen/xen-ops.h | 2 +-
3 files changed, 58 insertions(+), 20 deletions(-)
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index ffb101e..a11af62 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1726,23 +1726,29 @@ asmlinkage __visible void __init xen_start_kernel(void)
#endif
}
-void __ref xen_hvm_init_shared_info(void)
+#ifdef CONFIG_XEN_PVHVM
+#define HVM_SHARED_INFO_ADDR 0xFE700000UL
+static struct shared_info *xen_hvm_shared_info;
+static unsigned long xen_hvm_sip_phys;
+static int xen_major, xen_minor;
+
+static void xen_hvm_connect_shared_info(unsigned long pfn)
{
- int cpu;
struct xen_add_to_physmap xatp;
- static struct shared_info *shared_info_page = 0;
- if (!shared_info_page)
- shared_info_page = (struct shared_info *)
- extend_brk(PAGE_SIZE, PAGE_SIZE);
xatp.domid = DOMID_SELF;
xatp.idx = 0;
xatp.space = XENMAPSPACE_shared_info;
- xatp.gpfn = __pa(shared_info_page) >> PAGE_SHIFT;
+ xatp.gpfn = pfn;
if (HYPERVISOR_memory_op(XENMEM_add_to_physmap, &xatp))
BUG();
- HYPERVISOR_shared_info = (struct shared_info *)shared_info_page;
+}
+static void __init xen_hvm_set_shared_info(struct shared_info *sip)
+{
+ int cpu;
+
+ HYPERVISOR_shared_info = sip;
/* xen_vcpu is a pointer to the vcpu_info struct in the shared_info
* page, we use it in the event channel upcall and in some pvclock
@@ -1760,20 +1766,39 @@ void __ref xen_hvm_init_shared_info(void)
}
}
-#ifdef CONFIG_XEN_PVHVM
+/* Reconnect the shared_info pfn to a (new) mfn */
+void xen_hvm_resume_shared_info(void)
+{
+ xen_hvm_connect_shared_info(xen_hvm_sip_phys >> PAGE_SHIFT);
+ xen_hvm_set_shared_info(xen_hvm_shared_info);
+}
+
+/* Xen tools prior to Xen 4 do not provide a E820_Reserved area for guest usage.
+ * On these old tools the shared info page will be placed in E820_Ram.
+ * Xen 4 provides a E820_Reserved area at 0xFC000000, and this code expects
+ * that nothing is mapped up to HVM_SHARED_INFO_ADDR.
+ * Xen 4.3+ provides an explicit 1MB area at HVM_SHARED_INFO_ADDR which is used
+ * here for the shared info page. */
+static void __init xen_hvm_init_shared_info(void)
+{
+ if (xen_major < 4) {
+ xen_hvm_shared_info = extend_brk(PAGE_SIZE, PAGE_SIZE);
+ xen_hvm_sip_phys = __pa(xen_hvm_shared_info);
+ } else {
+ xen_hvm_sip_phys = HVM_SHARED_INFO_ADDR;
+ set_fixmap(FIX_PARAVIRT_BOOTMAP, xen_hvm_sip_phys);
+ xen_hvm_shared_info =
+ (struct shared_info *)fix_to_virt(FIX_PARAVIRT_BOOTMAP);
+ }
+ xen_hvm_resume_shared_info();
+}
+
static void __init init_hvm_pv_info(void)
{
- int major, minor;
- uint32_t eax, ebx, ecx, edx, pages, msr, base;
+ uint32_t ecx, edx, pages, msr, base;
u64 pfn;
base = xen_cpuid_base();
- cpuid(base + 1, &eax, &ebx, &ecx, &edx);
-
- major = eax >> 16;
- minor = eax & 0xffff;
- printk(KERN_INFO "Xen version %d.%d.\n", major, minor);
-
cpuid(base + 2, &pages, &msr, &ecx, &edx);
pfn = __pa(hypercall_page);
@@ -1828,10 +1853,23 @@ static void __init xen_hvm_guest_init(void)
static uint32_t __init xen_hvm_platform(void)
{
+ uint32_t eax, ebx, ecx, edx, base;
+
if (xen_pv_domain())
return 0;
- return xen_cpuid_base();
+ base = xen_cpuid_base();
+ if (!base)
+ return 0;
+
+ cpuid(base + 1, &eax, &ebx, &ecx, &edx);
+
+ xen_major = eax >> 16;
+ xen_minor = eax & 0xffff;
+
+ printk(KERN_INFO "Xen version %d.%d.\n", xen_major, xen_minor);
+
+ return 1;
}
bool xen_hvm_need_lapic(void)
diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
index c4df9db..8d6793e 100644
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -32,7 +32,7 @@ static void xen_hvm_post_suspend(int suspend_cancelled)
{
#ifdef CONFIG_XEN_PVHVM
int cpu;
- xen_hvm_init_shared_info();
+ xen_hvm_resume_shared_info();
xen_callback_vector();
xen_unplug_emulated_devices();
if (xen_feature(XENFEAT_hvm_safe_pvclock)) {
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 97d8765..d083e82 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -43,7 +43,7 @@ void xen_enable_syscall(void);
void xen_vcpu_restore(void);
void xen_callback_vector(void);
-void xen_hvm_init_shared_info(void);
+void xen_hvm_resume_shared_info(void);
void xen_unplug_emulated_devices(void);
void __init xen_build_dynamic_phys_to_machine(void);
--
1.9.3
PVHVM guest requires special actions before kexec. Register specific
xen_pvhvm_kexec_shutdown() handler for machine_ops.shutdown().
Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
arch/x86/xen/enlighten.c | 9 +++++++++
arch/x86/xen/smp.c | 9 +++++++++
arch/x86/xen/smp.h | 1 +
3 files changed, 19 insertions(+)
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index a11af62..8074e4a 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1833,6 +1833,12 @@ static struct notifier_block xen_hvm_cpu_notifier = {
.notifier_call = xen_hvm_cpu_notify,
};
+static void xen_pvhvm_kexec_shutdown(void)
+{
+ xen_kexec_shutdown();
+ native_machine_shutdown();
+}
+
static void __init xen_hvm_guest_init(void)
{
init_hvm_pv_info();
@@ -1849,6 +1855,9 @@ static void __init xen_hvm_guest_init(void)
x86_init.irqs.intr_init = xen_init_IRQ;
xen_hvm_init_time_ops();
xen_hvm_init_mmu_ops();
+#ifdef CONFIG_KEXEC
+ machine_ops.shutdown = xen_pvhvm_kexec_shutdown;
+#endif
}
static uint32_t __init xen_hvm_platform(void)
diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index 7005974..35dcf39 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -18,6 +18,7 @@
#include <linux/smp.h>
#include <linux/irq_work.h>
#include <linux/tick.h>
+#include <linux/kexec.h>
#include <asm/paravirt.h>
#include <asm/desc.h>
@@ -762,6 +763,14 @@ static void xen_hvm_cpu_die(unsigned int cpu)
native_cpu_die(cpu);
}
+void xen_kexec_shutdown(void)
+{
+#ifdef CONFIG_KEXEC
+ if (!kexec_in_progress)
+ return;
+#endif
+}
+
void __init xen_hvm_smp_init(void)
{
if (!xen_have_vector_callback)
diff --git a/arch/x86/xen/smp.h b/arch/x86/xen/smp.h
index c7c2d89..1af0493 100644
--- a/arch/x86/xen/smp.h
+++ b/arch/x86/xen/smp.h
@@ -8,4 +8,5 @@ extern void xen_send_IPI_allbutself(int vector);
extern void xen_send_IPI_all(int vector);
extern void xen_send_IPI_self(int vector);
+extern void xen_kexec_shutdown(void);
#endif
--
1.9.3
When kexec is being run PIRQs from Qemu-emulated devices are still
mapped to old event channels and new kernel has no information about
that. Trying to map them twice results in the following in Xen's dmesg:
(XEN) irq.c:2278: dom7: pirq 24 or emuirq 8 already mapped
(XEN) irq.c:2278: dom7: pirq 24 or emuirq 12 already mapped
(XEN) irq.c:2278: dom7: pirq 24 or emuirq 1 already mapped
...
and the following in new kernel's dmesg:
[ 92.286796] xen:events: Failed to obtain physical IRQ 4
The result is that the new kernel doesn't recieve IRQs for Qemu-emulated
devices. Address the issue by unmapping all mapped PIRQs on kernel shutdown
when kexec was requested and on every kernel startup. We need to do this
twice to deal with the following issues:
- startup-time unmapping is required to make kdump work;
- shutdown-time unmapping is required to support kexec-ing non-fixed kernels;
- shutdown-time unmapping is required to make Qemu-emulated NICs work after
kexec (event channel is being closed on shutdown but no PHYSDEVOP_unmap_pirq
is being performed).
Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
arch/x86/xen/smp.c | 1 +
drivers/xen/events/events_base.c | 76 ++++++++++++++++++++++++++++++++++++++++
include/xen/events.h | 3 ++
3 files changed, 80 insertions(+)
diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index 35dcf39..e2b4deb 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -768,6 +768,7 @@ void xen_kexec_shutdown(void)
#ifdef CONFIG_KEXEC
if (!kexec_in_progress)
return;
+ xen_unmap_all_pirqs();
#endif
}
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index c919d3d..7701c7f 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -1643,6 +1643,80 @@ void xen_callback_vector(void) {}
static bool fifo_events = true;
module_param(fifo_events, bool, 0);
+void xen_unmap_all_pirqs(void)
+{
+ int pirq, rc, gsi, irq, evtchn;
+ struct physdev_unmap_pirq unmap_irq;
+ struct irq_info *info;
+ struct evtchn_close close;
+
+ mutex_lock(&irq_mapping_update_lock);
+
+ list_for_each_entry(info, &xen_irq_list_head, list) {
+ if (info->type != IRQT_PIRQ)
+ continue;
+
+ pirq = info->u.pirq.pirq;
+ gsi = info->u.pirq.gsi;
+ evtchn = info->evtchn;
+ irq = info->irq;
+
+ pr_debug("unmapping pirq gsi=%d pirq=%d irq=%d evtchn=%d\n",
+ gsi, pirq, irq, evtchn);
+
+ if (evtchn > 0) {
+ close.port = evtchn;
+ if (HYPERVISOR_event_channel_op(EVTCHNOP_close,
+ &close) != 0)
+ pr_warn("close evtchn %d failed\n", evtchn);
+ }
+
+ unmap_irq.pirq = pirq;
+ unmap_irq.domid = DOMID_SELF;
+
+ rc = HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, &unmap_irq);
+ if (rc)
+ pr_warn("unmap pirq failed gsi=%d pirq=%d irq=%d rc=%d\n",
+ gsi, pirq, irq, rc);
+ }
+
+ mutex_unlock(&irq_mapping_update_lock);
+}
+EXPORT_SYMBOL_GPL(xen_unmap_all_pirqs);
+
+static void xen_startup_unmap_pirqs(void)
+{
+ struct evtchn_status status;
+ int port, rc = -ENOENT;
+ struct physdev_unmap_pirq unmap_irq;
+ struct evtchn_close close;
+
+ memset(&status, 0, sizeof(status));
+ for (port = 0; port < xen_evtchn_max_channels(); port++) {
+ status.dom = DOMID_SELF;
+ status.port = port;
+ rc = HYPERVISOR_event_channel_op(EVTCHNOP_status, &status);
+ if (rc < 0)
+ continue;
+ if (status.status == EVTCHNSTAT_pirq) {
+ close.port = port;
+ if (HYPERVISOR_event_channel_op(EVTCHNOP_close,
+ &close) != 0)
+ pr_warn("xen: failed to close evtchn %d\n",
+ port);
+ unmap_irq.pirq = status.u.pirq;
+ unmap_irq.domid = DOMID_SELF;
+ pr_warn("xen: unmapping previously mapped pirq %d\n",
+ unmap_irq.pirq);
+ if (HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq,
+ &unmap_irq) != 0)
+ pr_warn("xen: failed to unmap pirq %d\n",
+ unmap_irq.pirq);
+ }
+ }
+}
+
+
void __init xen_init_IRQ(void)
{
int ret = -EINVAL;
@@ -1671,6 +1745,8 @@ void __init xen_init_IRQ(void)
xen_callback_vector();
if (xen_hvm_domain()) {
+ xen_startup_unmap_pirqs();
+
native_init_IRQ();
/* pci_xen_hvm_init must be called after native_init_IRQ so that
* __acpi_register_gsi can point at the right function */
diff --git a/include/xen/events.h b/include/xen/events.h
index 8bee7a7..3f9f428 100644
--- a/include/xen/events.h
+++ b/include/xen/events.h
@@ -122,6 +122,9 @@ int xen_irq_from_gsi(unsigned gsi);
/* Determine whether to ignore this IRQ if it is passed to a guest. */
int xen_test_irq_shared(int irq);
+/* Unmap all PIRQs and close event channels */
+void xen_unmap_all_pirqs(void);
+
/* initialize Xen IRQ subsystem */
void xen_init_IRQ(void);
#endif /* _XEN_EVENTS_H */
--
1.9.3
On Tue, Jul 15, 2014 at 03:40:37PM +0200, Vitaly Kuznetsov wrote:
> From: Olaf Hering <[email protected]>
>
> This is a respin of 00e37bdb0113a98408de42db85be002f21dbffd3
> ("xen PVonHVM: move shared_info to MMIO before kexec").
>
> Currently kexec in a PVonHVM guest fails with a triple fault because the
> new kernel overwrites the shared info page. The exact failure depends on
> the size of the kernel image. This patch moves the pfn from RAM into an
> E820 reserved memory area.
>
> The pfn containing the shared_info is located somewhere in RAM. This will
> cause trouble if the current kernel is doing a kexec boot into a new
> kernel. The new kernel (and its startup code) can not know where the pfn
> is, so it can not reserve the page. The hypervisor will continue to update
> the pfn, and as a result memory corruption occours in the new kernel.
>
> The toolstack marks the memory area FC000000-FFFFFFFF as reserved in the
> E820 map. Within that range newer toolstacks (4.3+) will keep 1MB
> starting from FE700000 as reserved for guest use. Older Xen4 toolstacks
> will usually not allocate areas up to FE700000, so FE700000 is expected
> to work also with older toolstacks.
>
> In Xen3 there is no reserved area at a fixed location. If the guest is
> started on such old hosts the shared_info page will be placed in RAM. As
> a result kexec can not be used.
So this looks right, the one thing that we really need to check
is e9daff24a266307943457086533041bd971d0ef9
This reverts commit 9d02b43dee0d7fb18dfb13a00915550b1a3daa9f.
We are doing this b/c on 32-bit PVonHVM with older hypervisors
(Xen 4.1) it ends up bothing up the start_info. This is bad b/c
we use it for the time keeping, and the timekeeping code loops
forever - as the version field never changes. Olaf says to
revert it, so lets do that.
Could you kindly test that the migration on 32-bit PVHVM guests
on older hypervisors works?
>
> Signed-off-by: Olaf Hering <[email protected]>
> Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
> (cherry picked from commit 9d02b43dee0d7fb18dfb13a00915550b1a3daa9f)
>
> [On resume we need to reset the xen_vcpu_info, which the original
> patch did not do]
>
> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> ---
> arch/x86/xen/enlighten.c | 74 ++++++++++++++++++++++++++++++++++++------------
> arch/x86/xen/suspend.c | 2 +-
> arch/x86/xen/xen-ops.h | 2 +-
> 3 files changed, 58 insertions(+), 20 deletions(-)
>
> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
> index ffb101e..a11af62 100644
> --- a/arch/x86/xen/enlighten.c
> +++ b/arch/x86/xen/enlighten.c
> @@ -1726,23 +1726,29 @@ asmlinkage __visible void __init xen_start_kernel(void)
> #endif
> }
>
> -void __ref xen_hvm_init_shared_info(void)
> +#ifdef CONFIG_XEN_PVHVM
> +#define HVM_SHARED_INFO_ADDR 0xFE700000UL
> +static struct shared_info *xen_hvm_shared_info;
> +static unsigned long xen_hvm_sip_phys;
> +static int xen_major, xen_minor;
> +
> +static void xen_hvm_connect_shared_info(unsigned long pfn)
> {
> - int cpu;
> struct xen_add_to_physmap xatp;
> - static struct shared_info *shared_info_page = 0;
>
> - if (!shared_info_page)
> - shared_info_page = (struct shared_info *)
> - extend_brk(PAGE_SIZE, PAGE_SIZE);
> xatp.domid = DOMID_SELF;
> xatp.idx = 0;
> xatp.space = XENMAPSPACE_shared_info;
> - xatp.gpfn = __pa(shared_info_page) >> PAGE_SHIFT;
> + xatp.gpfn = pfn;
> if (HYPERVISOR_memory_op(XENMEM_add_to_physmap, &xatp))
> BUG();
>
> - HYPERVISOR_shared_info = (struct shared_info *)shared_info_page;
> +}
> +static void __init xen_hvm_set_shared_info(struct shared_info *sip)
> +{
> + int cpu;
> +
> + HYPERVISOR_shared_info = sip;
>
> /* xen_vcpu is a pointer to the vcpu_info struct in the shared_info
> * page, we use it in the event channel upcall and in some pvclock
> @@ -1760,20 +1766,39 @@ void __ref xen_hvm_init_shared_info(void)
> }
> }
>
> -#ifdef CONFIG_XEN_PVHVM
> +/* Reconnect the shared_info pfn to a (new) mfn */
> +void xen_hvm_resume_shared_info(void)
> +{
> + xen_hvm_connect_shared_info(xen_hvm_sip_phys >> PAGE_SHIFT);
> + xen_hvm_set_shared_info(xen_hvm_shared_info);
> +}
> +
> +/* Xen tools prior to Xen 4 do not provide a E820_Reserved area for guest usage.
> + * On these old tools the shared info page will be placed in E820_Ram.
> + * Xen 4 provides a E820_Reserved area at 0xFC000000, and this code expects
> + * that nothing is mapped up to HVM_SHARED_INFO_ADDR.
> + * Xen 4.3+ provides an explicit 1MB area at HVM_SHARED_INFO_ADDR which is used
> + * here for the shared info page. */
> +static void __init xen_hvm_init_shared_info(void)
> +{
> + if (xen_major < 4) {
> + xen_hvm_shared_info = extend_brk(PAGE_SIZE, PAGE_SIZE);
> + xen_hvm_sip_phys = __pa(xen_hvm_shared_info);
> + } else {
> + xen_hvm_sip_phys = HVM_SHARED_INFO_ADDR;
> + set_fixmap(FIX_PARAVIRT_BOOTMAP, xen_hvm_sip_phys);
> + xen_hvm_shared_info =
> + (struct shared_info *)fix_to_virt(FIX_PARAVIRT_BOOTMAP);
> + }
> + xen_hvm_resume_shared_info();
> +}
> +
> static void __init init_hvm_pv_info(void)
> {
> - int major, minor;
> - uint32_t eax, ebx, ecx, edx, pages, msr, base;
> + uint32_t ecx, edx, pages, msr, base;
> u64 pfn;
>
> base = xen_cpuid_base();
> - cpuid(base + 1, &eax, &ebx, &ecx, &edx);
> -
> - major = eax >> 16;
> - minor = eax & 0xffff;
> - printk(KERN_INFO "Xen version %d.%d.\n", major, minor);
> -
> cpuid(base + 2, &pages, &msr, &ecx, &edx);
>
> pfn = __pa(hypercall_page);
> @@ -1828,10 +1853,23 @@ static void __init xen_hvm_guest_init(void)
>
> static uint32_t __init xen_hvm_platform(void)
> {
> + uint32_t eax, ebx, ecx, edx, base;
> +
> if (xen_pv_domain())
> return 0;
>
> - return xen_cpuid_base();
> + base = xen_cpuid_base();
> + if (!base)
> + return 0;
> +
> + cpuid(base + 1, &eax, &ebx, &ecx, &edx);
> +
> + xen_major = eax >> 16;
> + xen_minor = eax & 0xffff;
> +
> + printk(KERN_INFO "Xen version %d.%d.\n", xen_major, xen_minor);
> +
> + return 1;
> }
>
> bool xen_hvm_need_lapic(void)
> diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
> index c4df9db..8d6793e 100644
> --- a/arch/x86/xen/suspend.c
> +++ b/arch/x86/xen/suspend.c
> @@ -32,7 +32,7 @@ static void xen_hvm_post_suspend(int suspend_cancelled)
> {
> #ifdef CONFIG_XEN_PVHVM
> int cpu;
> - xen_hvm_init_shared_info();
> + xen_hvm_resume_shared_info();
> xen_callback_vector();
> xen_unplug_emulated_devices();
> if (xen_feature(XENFEAT_hvm_safe_pvclock)) {
> diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
> index 97d8765..d083e82 100644
> --- a/arch/x86/xen/xen-ops.h
> +++ b/arch/x86/xen/xen-ops.h
> @@ -43,7 +43,7 @@ void xen_enable_syscall(void);
> void xen_vcpu_restore(void);
>
> void xen_callback_vector(void);
> -void xen_hvm_init_shared_info(void);
> +void xen_hvm_resume_shared_info(void);
> void xen_unplug_emulated_devices(void);
>
> void __init xen_build_dynamic_phys_to_machine(void);
> --
> 1.9.3
>
On Tue, Jul 15, 2014 at 03:40:38PM +0200, Vitaly Kuznetsov wrote:
> PVHVM guest requires special actions before kexec. Register specific
> xen_pvhvm_kexec_shutdown() handler for machine_ops.shutdown().
>
This looks close to what I had sent as an RFC to you?
> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> ---
> arch/x86/xen/enlighten.c | 9 +++++++++
> arch/x86/xen/smp.c | 9 +++++++++
> arch/x86/xen/smp.h | 1 +
> 3 files changed, 19 insertions(+)
>
> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
> index a11af62..8074e4a 100644
> --- a/arch/x86/xen/enlighten.c
> +++ b/arch/x86/xen/enlighten.c
> @@ -1833,6 +1833,12 @@ static struct notifier_block xen_hvm_cpu_notifier = {
> .notifier_call = xen_hvm_cpu_notify,
> };
>
> +static void xen_pvhvm_kexec_shutdown(void)
> +{
> + xen_kexec_shutdown();
> + native_machine_shutdown();
> +}
> +
> static void __init xen_hvm_guest_init(void)
> {
> init_hvm_pv_info();
> @@ -1849,6 +1855,9 @@ static void __init xen_hvm_guest_init(void)
> x86_init.irqs.intr_init = xen_init_IRQ;
> xen_hvm_init_time_ops();
> xen_hvm_init_mmu_ops();
> +#ifdef CONFIG_KEXEC
> + machine_ops.shutdown = xen_pvhvm_kexec_shutdown;
> +#endif
> }
>
> static uint32_t __init xen_hvm_platform(void)
> diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
> index 7005974..35dcf39 100644
> --- a/arch/x86/xen/smp.c
> +++ b/arch/x86/xen/smp.c
> @@ -18,6 +18,7 @@
> #include <linux/smp.h>
> #include <linux/irq_work.h>
> #include <linux/tick.h>
> +#include <linux/kexec.h>
>
> #include <asm/paravirt.h>
> #include <asm/desc.h>
> @@ -762,6 +763,14 @@ static void xen_hvm_cpu_die(unsigned int cpu)
> native_cpu_die(cpu);
> }
>
> +void xen_kexec_shutdown(void)
> +{
> +#ifdef CONFIG_KEXEC
> + if (!kexec_in_progress)
> + return;
> +#endif
> +}
> +
> void __init xen_hvm_smp_init(void)
> {
> if (!xen_have_vector_callback)
> diff --git a/arch/x86/xen/smp.h b/arch/x86/xen/smp.h
> index c7c2d89..1af0493 100644
> --- a/arch/x86/xen/smp.h
> +++ b/arch/x86/xen/smp.h
> @@ -8,4 +8,5 @@ extern void xen_send_IPI_allbutself(int vector);
> extern void xen_send_IPI_all(int vector);
> extern void xen_send_IPI_self(int vector);
>
> +extern void xen_kexec_shutdown(void);
> #endif
> --
> 1.9.3
>
On Tue, Jul 15, 2014 at 03:40:40PM +0200, Vitaly Kuznetsov wrote:
> When kexec was peformed MSI IRQs for passthrough-ed devices were already
> mapped and we see non-zero pirq extracted from MSI msg. xen_irq_from_pirq()
> fails as we have no IRQ mapping information for that. Requesting for new
> mapping with __write_msi_msg() does not result in MSI IRQ being remapped so
> we don't recieve these IRQs.
receive
How come '__write_msi_msg' does not result in new MSI IRQs?
Is it fair to state that your code ends up reading the MSI IRQ (PIRQ)
from the device and updating the internal PIRQ<->IRQ code to match
with the reality?
>
> RFC: I wasn't able to understand why commit af42b8d1 which introduced
> xen_irq_from_pirq() check in xen_hvm_setup_msi_irqs() is checking that instead
> of checking pirq > 0 as if the mapping was already done (and we have pirq>0 here)
> we don't need to request for a new pirq. We're loosing existing PIRQ and I'm also
> not sure when __write_msi_msg() with new PIRQ will result in new mapping.
We don't request a new pirq. We end up returning before we call xen_allocate_pirq_msi.
At least that is how the commit you mentioned worked.
In regards to why using 'xen_irq_from_pirq' instead of just checking the PIRQ - is
that we might be called twice by a buggy driver. As such we want to check
our PIRQ<->IRQ to figure this out.
>
> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> ---
> arch/x86/pci/xen.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
> index 905956f..685e8f1 100644
> --- a/arch/x86/pci/xen.c
> +++ b/arch/x86/pci/xen.c
> @@ -231,8 +231,7 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> __read_msi_msg(msidesc, &msg);
> pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) |
> ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
> - if (msg.data != XEN_PIRQ_MSI_DATA ||
> - xen_irq_from_pirq(pirq) < 0) {
> + if (msg.data != XEN_PIRQ_MSI_DATA || pirq <= 0) {
> pirq = xen_allocate_pirq_msi(dev, msidesc);
> if (pirq < 0) {
> irq = -ENODEV;
> --
> 1.9.3
>
On Tue, Jul 15, 2014 at 03:40:39PM +0200, Vitaly Kuznetsov wrote:
> When kexec is being run PIRQs from Qemu-emulated devices are still
> mapped to old event channels and new kernel has no information about
> that. Trying to map them twice results in the following in Xen's dmesg:
>
> (XEN) irq.c:2278: dom7: pirq 24 or emuirq 8 already mapped
> (XEN) irq.c:2278: dom7: pirq 24 or emuirq 12 already mapped
> (XEN) irq.c:2278: dom7: pirq 24 or emuirq 1 already mapped
> ...
>
> and the following in new kernel's dmesg:
>
> [ 92.286796] xen:events: Failed to obtain physical IRQ 4
>
> The result is that the new kernel doesn't recieve IRQs for Qemu-emulated
> devices. Address the issue by unmapping all mapped PIRQs on kernel shutdown
> when kexec was requested and on every kernel startup. We need to do this
> twice to deal with the following issues:
> - startup-time unmapping is required to make kdump work;
> - shutdown-time unmapping is required to support kexec-ing non-fixed kernels;
> - shutdown-time unmapping is required to make Qemu-emulated NICs work after
> kexec (event channel is being closed on shutdown but no PHYSDEVOP_unmap_pirq
> is being performed).
How does this work when you boot the guest under Xen 4.4 where the FIFO events
are used? Does it still work correctly?
Thanks.
>
> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> ---
> arch/x86/xen/smp.c | 1 +
> drivers/xen/events/events_base.c | 76 ++++++++++++++++++++++++++++++++++++++++
> include/xen/events.h | 3 ++
> 3 files changed, 80 insertions(+)
>
> diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
> index 35dcf39..e2b4deb 100644
> --- a/arch/x86/xen/smp.c
> +++ b/arch/x86/xen/smp.c
> @@ -768,6 +768,7 @@ void xen_kexec_shutdown(void)
> #ifdef CONFIG_KEXEC
> if (!kexec_in_progress)
> return;
> + xen_unmap_all_pirqs();
> #endif
> }
>
> diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
> index c919d3d..7701c7f 100644
> --- a/drivers/xen/events/events_base.c
> +++ b/drivers/xen/events/events_base.c
> @@ -1643,6 +1643,80 @@ void xen_callback_vector(void) {}
> static bool fifo_events = true;
> module_param(fifo_events, bool, 0);
>
> +void xen_unmap_all_pirqs(void)
> +{
> + int pirq, rc, gsi, irq, evtchn;
> + struct physdev_unmap_pirq unmap_irq;
> + struct irq_info *info;
> + struct evtchn_close close;
> +
> + mutex_lock(&irq_mapping_update_lock);
> +
> + list_for_each_entry(info, &xen_irq_list_head, list) {
> + if (info->type != IRQT_PIRQ)
> + continue;
> +
> + pirq = info->u.pirq.pirq;
> + gsi = info->u.pirq.gsi;
> + evtchn = info->evtchn;
> + irq = info->irq;
> +
> + pr_debug("unmapping pirq gsi=%d pirq=%d irq=%d evtchn=%d\n",
> + gsi, pirq, irq, evtchn);
> +
> + if (evtchn > 0) {
> + close.port = evtchn;
> + if (HYPERVISOR_event_channel_op(EVTCHNOP_close,
> + &close) != 0)
> + pr_warn("close evtchn %d failed\n", evtchn);
> + }
> +
> + unmap_irq.pirq = pirq;
> + unmap_irq.domid = DOMID_SELF;
> +
> + rc = HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, &unmap_irq);
> + if (rc)
> + pr_warn("unmap pirq failed gsi=%d pirq=%d irq=%d rc=%d\n",
> + gsi, pirq, irq, rc);
> + }
> +
> + mutex_unlock(&irq_mapping_update_lock);
> +}
> +EXPORT_SYMBOL_GPL(xen_unmap_all_pirqs);
Why the EXPORT? Is this used by modules?
> +
> +static void xen_startup_unmap_pirqs(void)
> +{
> + struct evtchn_status status;
> + int port, rc = -ENOENT;
> + struct physdev_unmap_pirq unmap_irq;
> + struct evtchn_close close;
> +
> + memset(&status, 0, sizeof(status));
> + for (port = 0; port < xen_evtchn_max_channels(); port++) {
> + status.dom = DOMID_SELF;
> + status.port = port;
> + rc = HYPERVISOR_event_channel_op(EVTCHNOP_status, &status);
> + if (rc < 0)
> + continue;
> + if (status.status == EVTCHNSTAT_pirq) {
> + close.port = port;
> + if (HYPERVISOR_event_channel_op(EVTCHNOP_close,
> + &close) != 0)
> + pr_warn("xen: failed to close evtchn %d\n",
> + port);
> + unmap_irq.pirq = status.u.pirq;
> + unmap_irq.domid = DOMID_SELF;
> + pr_warn("xen: unmapping previously mapped pirq %d\n",
> + unmap_irq.pirq);
> + if (HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq,
> + &unmap_irq) != 0)
> + pr_warn("xen: failed to unmap pirq %d\n",
> + unmap_irq.pirq);
> + }
> + }
> +}
> +
> +
> void __init xen_init_IRQ(void)
> {
> int ret = -EINVAL;
> @@ -1671,6 +1745,8 @@ void __init xen_init_IRQ(void)
> xen_callback_vector();
>
> if (xen_hvm_domain()) {
> + xen_startup_unmap_pirqs();
> +
> native_init_IRQ();
> /* pci_xen_hvm_init must be called after native_init_IRQ so that
> * __acpi_register_gsi can point at the right function */
> diff --git a/include/xen/events.h b/include/xen/events.h
> index 8bee7a7..3f9f428 100644
> --- a/include/xen/events.h
> +++ b/include/xen/events.h
> @@ -122,6 +122,9 @@ int xen_irq_from_gsi(unsigned gsi);
> /* Determine whether to ignore this IRQ if it is passed to a guest. */
> int xen_test_irq_shared(int irq);
>
> +/* Unmap all PIRQs and close event channels */
> +void xen_unmap_all_pirqs(void);
> +
> /* initialize Xen IRQ subsystem */
> void xen_init_IRQ(void);
> #endif /* _XEN_EVENTS_H */
> --
> 1.9.3
>
Konrad Rzeszutek Wilk <[email protected]> writes:
> On Tue, Jul 15, 2014 at 03:40:37PM +0200, Vitaly Kuznetsov wrote:
>> From: Olaf Hering <[email protected]>
>>
>> This is a respin of 00e37bdb0113a98408de42db85be002f21dbffd3
>> ("xen PVonHVM: move shared_info to MMIO before kexec").
>>
>> Currently kexec in a PVonHVM guest fails with a triple fault because the
>> new kernel overwrites the shared info page. The exact failure depends on
>> the size of the kernel image. This patch moves the pfn from RAM into an
>> E820 reserved memory area.
>>
>> The pfn containing the shared_info is located somewhere in RAM. This will
>> cause trouble if the current kernel is doing a kexec boot into a new
>> kernel. The new kernel (and its startup code) can not know where the pfn
>> is, so it can not reserve the page. The hypervisor will continue to update
>> the pfn, and as a result memory corruption occours in the new kernel.
>>
>> The toolstack marks the memory area FC000000-FFFFFFFF as reserved in the
>> E820 map. Within that range newer toolstacks (4.3+) will keep 1MB
>> starting from FE700000 as reserved for guest use. Older Xen4 toolstacks
>> will usually not allocate areas up to FE700000, so FE700000 is expected
>> to work also with older toolstacks.
>>
>> In Xen3 there is no reserved area at a fixed location. If the guest is
>> started on such old hosts the shared_info page will be placed in RAM. As
>> a result kexec can not be used.
>
> So this looks right, the one thing that we really need to check
> is e9daff24a266307943457086533041bd971d0ef9
>
> This reverts commit 9d02b43dee0d7fb18dfb13a00915550b1a3daa9f.
>
> We are doing this b/c on 32-bit PVonHVM with older hypervisors
> (Xen 4.1) it ends up bothing up the start_info. This is bad b/c
> we use it for the time keeping, and the timekeeping code loops
> forever - as the version field never changes. Olaf says to
> revert it, so lets do that.
>
> Could you kindly test that the migration on 32-bit PVHVM guests
> on older hypervisors works?
>
Sure, will do! Was there anything special about the setup or any 32-bit
pvhvm guest migration (on 64-bit hypervisor I suppose) would fail? I can
try checking both current and old versions to make sure the issue was
acutually fixed.
>>
>> Signed-off-by: Olaf Hering <[email protected]>
>> Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
>> (cherry picked from commit 9d02b43dee0d7fb18dfb13a00915550b1a3daa9f)
>>
>> [On resume we need to reset the xen_vcpu_info, which the original
>> patch did not do]
>>
>> Signed-off-by: Vitaly Kuznetsov <[email protected]>
>> ---
>> arch/x86/xen/enlighten.c | 74 ++++++++++++++++++++++++++++++++++++------------
>> arch/x86/xen/suspend.c | 2 +-
>> arch/x86/xen/xen-ops.h | 2 +-
>> 3 files changed, 58 insertions(+), 20 deletions(-)
>>
>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>> index ffb101e..a11af62 100644
>> --- a/arch/x86/xen/enlighten.c
>> +++ b/arch/x86/xen/enlighten.c
>> @@ -1726,23 +1726,29 @@ asmlinkage __visible void __init xen_start_kernel(void)
>> #endif
>> }
>>
>> -void __ref xen_hvm_init_shared_info(void)
>> +#ifdef CONFIG_XEN_PVHVM
>> +#define HVM_SHARED_INFO_ADDR 0xFE700000UL
>> +static struct shared_info *xen_hvm_shared_info;
>> +static unsigned long xen_hvm_sip_phys;
>> +static int xen_major, xen_minor;
>> +
>> +static void xen_hvm_connect_shared_info(unsigned long pfn)
>> {
>> - int cpu;
>> struct xen_add_to_physmap xatp;
>> - static struct shared_info *shared_info_page = 0;
>>
>> - if (!shared_info_page)
>> - shared_info_page = (struct shared_info *)
>> - extend_brk(PAGE_SIZE, PAGE_SIZE);
>> xatp.domid = DOMID_SELF;
>> xatp.idx = 0;
>> xatp.space = XENMAPSPACE_shared_info;
>> - xatp.gpfn = __pa(shared_info_page) >> PAGE_SHIFT;
>> + xatp.gpfn = pfn;
>> if (HYPERVISOR_memory_op(XENMEM_add_to_physmap, &xatp))
>> BUG();
>>
>> - HYPERVISOR_shared_info = (struct shared_info *)shared_info_page;
>> +}
>> +static void __init xen_hvm_set_shared_info(struct shared_info *sip)
>> +{
>> + int cpu;
>> +
>> + HYPERVISOR_shared_info = sip;
>>
>> /* xen_vcpu is a pointer to the vcpu_info struct in the shared_info
>> * page, we use it in the event channel upcall and in some pvclock
>> @@ -1760,20 +1766,39 @@ void __ref xen_hvm_init_shared_info(void)
>> }
>> }
>>
>> -#ifdef CONFIG_XEN_PVHVM
>> +/* Reconnect the shared_info pfn to a (new) mfn */
>> +void xen_hvm_resume_shared_info(void)
>> +{
>> + xen_hvm_connect_shared_info(xen_hvm_sip_phys >> PAGE_SHIFT);
>> + xen_hvm_set_shared_info(xen_hvm_shared_info);
>> +}
>> +
>> +/* Xen tools prior to Xen 4 do not provide a E820_Reserved area for guest usage.
>> + * On these old tools the shared info page will be placed in E820_Ram.
>> + * Xen 4 provides a E820_Reserved area at 0xFC000000, and this code expects
>> + * that nothing is mapped up to HVM_SHARED_INFO_ADDR.
>> + * Xen 4.3+ provides an explicit 1MB area at HVM_SHARED_INFO_ADDR which is used
>> + * here for the shared info page. */
>> +static void __init xen_hvm_init_shared_info(void)
>> +{
>> + if (xen_major < 4) {
>> + xen_hvm_shared_info = extend_brk(PAGE_SIZE, PAGE_SIZE);
>> + xen_hvm_sip_phys = __pa(xen_hvm_shared_info);
>> + } else {
>> + xen_hvm_sip_phys = HVM_SHARED_INFO_ADDR;
>> + set_fixmap(FIX_PARAVIRT_BOOTMAP, xen_hvm_sip_phys);
>> + xen_hvm_shared_info =
>> + (struct shared_info *)fix_to_virt(FIX_PARAVIRT_BOOTMAP);
>> + }
>> + xen_hvm_resume_shared_info();
>> +}
>> +
>> static void __init init_hvm_pv_info(void)
>> {
>> - int major, minor;
>> - uint32_t eax, ebx, ecx, edx, pages, msr, base;
>> + uint32_t ecx, edx, pages, msr, base;
>> u64 pfn;
>>
>> base = xen_cpuid_base();
>> - cpuid(base + 1, &eax, &ebx, &ecx, &edx);
>> -
>> - major = eax >> 16;
>> - minor = eax & 0xffff;
>> - printk(KERN_INFO "Xen version %d.%d.\n", major, minor);
>> -
>> cpuid(base + 2, &pages, &msr, &ecx, &edx);
>>
>> pfn = __pa(hypercall_page);
>> @@ -1828,10 +1853,23 @@ static void __init xen_hvm_guest_init(void)
>>
>> static uint32_t __init xen_hvm_platform(void)
>> {
>> + uint32_t eax, ebx, ecx, edx, base;
>> +
>> if (xen_pv_domain())
>> return 0;
>>
>> - return xen_cpuid_base();
>> + base = xen_cpuid_base();
>> + if (!base)
>> + return 0;
>> +
>> + cpuid(base + 1, &eax, &ebx, &ecx, &edx);
>> +
>> + xen_major = eax >> 16;
>> + xen_minor = eax & 0xffff;
>> +
>> + printk(KERN_INFO "Xen version %d.%d.\n", xen_major, xen_minor);
>> +
>> + return 1;
>> }
>>
>> bool xen_hvm_need_lapic(void)
>> diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
>> index c4df9db..8d6793e 100644
>> --- a/arch/x86/xen/suspend.c
>> +++ b/arch/x86/xen/suspend.c
>> @@ -32,7 +32,7 @@ static void xen_hvm_post_suspend(int suspend_cancelled)
>> {
>> #ifdef CONFIG_XEN_PVHVM
>> int cpu;
>> - xen_hvm_init_shared_info();
>> + xen_hvm_resume_shared_info();
>> xen_callback_vector();
>> xen_unplug_emulated_devices();
>> if (xen_feature(XENFEAT_hvm_safe_pvclock)) {
>> diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
>> index 97d8765..d083e82 100644
>> --- a/arch/x86/xen/xen-ops.h
>> +++ b/arch/x86/xen/xen-ops.h
>> @@ -43,7 +43,7 @@ void xen_enable_syscall(void);
>> void xen_vcpu_restore(void);
>>
>> void xen_callback_vector(void);
>> -void xen_hvm_init_shared_info(void);
>> +void xen_hvm_resume_shared_info(void);
>> void xen_unplug_emulated_devices(void);
>>
>> void __init xen_build_dynamic_phys_to_machine(void);
>> --
>> 1.9.3
>>
--
Vitaly
On Tue, Jul 15, 2014 at 05:43:17PM +0200, Vitaly Kuznetsov wrote:
> Konrad Rzeszutek Wilk <[email protected]> writes:
>
> > On Tue, Jul 15, 2014 at 03:40:37PM +0200, Vitaly Kuznetsov wrote:
> >> From: Olaf Hering <[email protected]>
> >>
> >> This is a respin of 00e37bdb0113a98408de42db85be002f21dbffd3
> >> ("xen PVonHVM: move shared_info to MMIO before kexec").
> >>
> >> Currently kexec in a PVonHVM guest fails with a triple fault because the
> >> new kernel overwrites the shared info page. The exact failure depends on
> >> the size of the kernel image. This patch moves the pfn from RAM into an
> >> E820 reserved memory area.
> >>
> >> The pfn containing the shared_info is located somewhere in RAM. This will
> >> cause trouble if the current kernel is doing a kexec boot into a new
> >> kernel. The new kernel (and its startup code) can not know where the pfn
> >> is, so it can not reserve the page. The hypervisor will continue to update
> >> the pfn, and as a result memory corruption occours in the new kernel.
> >>
> >> The toolstack marks the memory area FC000000-FFFFFFFF as reserved in the
> >> E820 map. Within that range newer toolstacks (4.3+) will keep 1MB
> >> starting from FE700000 as reserved for guest use. Older Xen4 toolstacks
> >> will usually not allocate areas up to FE700000, so FE700000 is expected
> >> to work also with older toolstacks.
> >>
> >> In Xen3 there is no reserved area at a fixed location. If the guest is
> >> started on such old hosts the shared_info page will be placed in RAM. As
> >> a result kexec can not be used.
> >
> > So this looks right, the one thing that we really need to check
> > is e9daff24a266307943457086533041bd971d0ef9
> >
> > This reverts commit 9d02b43dee0d7fb18dfb13a00915550b1a3daa9f.
> >
> > We are doing this b/c on 32-bit PVonHVM with older hypervisors
> > (Xen 4.1) it ends up bothing up the start_info. This is bad b/c
> > we use it for the time keeping, and the timekeeping code loops
> > forever - as the version field never changes. Olaf says to
> > revert it, so lets do that.
> >
> > Could you kindly test that the migration on 32-bit PVHVM guests
> > on older hypervisors works?
> >
>
> Sure, will do! Was there anything special about the setup or any 32-bit
> pvhvm guest migration (on 64-bit hypervisor I suppose) would fail? I can
> try checking both current and old versions to make sure the issue was
> acutually fixed.
Nothing fancy (well, it was SMP, so 4 CPUs). I did the 'save'/'restore' and the
guest would not restore properly.
Thank you!
>
> >>
> >> Signed-off-by: Olaf Hering <[email protected]>
> >> Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
> >> (cherry picked from commit 9d02b43dee0d7fb18dfb13a00915550b1a3daa9f)
> >>
> >> [On resume we need to reset the xen_vcpu_info, which the original
> >> patch did not do]
> >>
> >> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> >> ---
> >> arch/x86/xen/enlighten.c | 74 ++++++++++++++++++++++++++++++++++++------------
> >> arch/x86/xen/suspend.c | 2 +-
> >> arch/x86/xen/xen-ops.h | 2 +-
> >> 3 files changed, 58 insertions(+), 20 deletions(-)
> >>
> >> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
> >> index ffb101e..a11af62 100644
> >> --- a/arch/x86/xen/enlighten.c
> >> +++ b/arch/x86/xen/enlighten.c
> >> @@ -1726,23 +1726,29 @@ asmlinkage __visible void __init xen_start_kernel(void)
> >> #endif
> >> }
> >>
> >> -void __ref xen_hvm_init_shared_info(void)
> >> +#ifdef CONFIG_XEN_PVHVM
> >> +#define HVM_SHARED_INFO_ADDR 0xFE700000UL
> >> +static struct shared_info *xen_hvm_shared_info;
> >> +static unsigned long xen_hvm_sip_phys;
> >> +static int xen_major, xen_minor;
> >> +
> >> +static void xen_hvm_connect_shared_info(unsigned long pfn)
> >> {
> >> - int cpu;
> >> struct xen_add_to_physmap xatp;
> >> - static struct shared_info *shared_info_page = 0;
> >>
> >> - if (!shared_info_page)
> >> - shared_info_page = (struct shared_info *)
> >> - extend_brk(PAGE_SIZE, PAGE_SIZE);
> >> xatp.domid = DOMID_SELF;
> >> xatp.idx = 0;
> >> xatp.space = XENMAPSPACE_shared_info;
> >> - xatp.gpfn = __pa(shared_info_page) >> PAGE_SHIFT;
> >> + xatp.gpfn = pfn;
> >> if (HYPERVISOR_memory_op(XENMEM_add_to_physmap, &xatp))
> >> BUG();
> >>
> >> - HYPERVISOR_shared_info = (struct shared_info *)shared_info_page;
> >> +}
> >> +static void __init xen_hvm_set_shared_info(struct shared_info *sip)
> >> +{
> >> + int cpu;
> >> +
> >> + HYPERVISOR_shared_info = sip;
> >>
> >> /* xen_vcpu is a pointer to the vcpu_info struct in the shared_info
> >> * page, we use it in the event channel upcall and in some pvclock
> >> @@ -1760,20 +1766,39 @@ void __ref xen_hvm_init_shared_info(void)
> >> }
> >> }
> >>
> >> -#ifdef CONFIG_XEN_PVHVM
> >> +/* Reconnect the shared_info pfn to a (new) mfn */
> >> +void xen_hvm_resume_shared_info(void)
> >> +{
> >> + xen_hvm_connect_shared_info(xen_hvm_sip_phys >> PAGE_SHIFT);
> >> + xen_hvm_set_shared_info(xen_hvm_shared_info);
> >> +}
> >> +
> >> +/* Xen tools prior to Xen 4 do not provide a E820_Reserved area for guest usage.
> >> + * On these old tools the shared info page will be placed in E820_Ram.
> >> + * Xen 4 provides a E820_Reserved area at 0xFC000000, and this code expects
> >> + * that nothing is mapped up to HVM_SHARED_INFO_ADDR.
> >> + * Xen 4.3+ provides an explicit 1MB area at HVM_SHARED_INFO_ADDR which is used
> >> + * here for the shared info page. */
> >> +static void __init xen_hvm_init_shared_info(void)
> >> +{
> >> + if (xen_major < 4) {
> >> + xen_hvm_shared_info = extend_brk(PAGE_SIZE, PAGE_SIZE);
> >> + xen_hvm_sip_phys = __pa(xen_hvm_shared_info);
> >> + } else {
> >> + xen_hvm_sip_phys = HVM_SHARED_INFO_ADDR;
> >> + set_fixmap(FIX_PARAVIRT_BOOTMAP, xen_hvm_sip_phys);
> >> + xen_hvm_shared_info =
> >> + (struct shared_info *)fix_to_virt(FIX_PARAVIRT_BOOTMAP);
> >> + }
> >> + xen_hvm_resume_shared_info();
> >> +}
> >> +
> >> static void __init init_hvm_pv_info(void)
> >> {
> >> - int major, minor;
> >> - uint32_t eax, ebx, ecx, edx, pages, msr, base;
> >> + uint32_t ecx, edx, pages, msr, base;
> >> u64 pfn;
> >>
> >> base = xen_cpuid_base();
> >> - cpuid(base + 1, &eax, &ebx, &ecx, &edx);
> >> -
> >> - major = eax >> 16;
> >> - minor = eax & 0xffff;
> >> - printk(KERN_INFO "Xen version %d.%d.\n", major, minor);
> >> -
> >> cpuid(base + 2, &pages, &msr, &ecx, &edx);
> >>
> >> pfn = __pa(hypercall_page);
> >> @@ -1828,10 +1853,23 @@ static void __init xen_hvm_guest_init(void)
> >>
> >> static uint32_t __init xen_hvm_platform(void)
> >> {
> >> + uint32_t eax, ebx, ecx, edx, base;
> >> +
> >> if (xen_pv_domain())
> >> return 0;
> >>
> >> - return xen_cpuid_base();
> >> + base = xen_cpuid_base();
> >> + if (!base)
> >> + return 0;
> >> +
> >> + cpuid(base + 1, &eax, &ebx, &ecx, &edx);
> >> +
> >> + xen_major = eax >> 16;
> >> + xen_minor = eax & 0xffff;
> >> +
> >> + printk(KERN_INFO "Xen version %d.%d.\n", xen_major, xen_minor);
> >> +
> >> + return 1;
> >> }
> >>
> >> bool xen_hvm_need_lapic(void)
> >> diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
> >> index c4df9db..8d6793e 100644
> >> --- a/arch/x86/xen/suspend.c
> >> +++ b/arch/x86/xen/suspend.c
> >> @@ -32,7 +32,7 @@ static void xen_hvm_post_suspend(int suspend_cancelled)
> >> {
> >> #ifdef CONFIG_XEN_PVHVM
> >> int cpu;
> >> - xen_hvm_init_shared_info();
> >> + xen_hvm_resume_shared_info();
> >> xen_callback_vector();
> >> xen_unplug_emulated_devices();
> >> if (xen_feature(XENFEAT_hvm_safe_pvclock)) {
> >> diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
> >> index 97d8765..d083e82 100644
> >> --- a/arch/x86/xen/xen-ops.h
> >> +++ b/arch/x86/xen/xen-ops.h
> >> @@ -43,7 +43,7 @@ void xen_enable_syscall(void);
> >> void xen_vcpu_restore(void);
> >>
> >> void xen_callback_vector(void);
> >> -void xen_hvm_init_shared_info(void);
> >> +void xen_hvm_resume_shared_info(void);
> >> void xen_unplug_emulated_devices(void);
> >>
> >> void __init xen_build_dynamic_phys_to_machine(void);
> >> --
> >> 1.9.3
> >>
>
> --
> Vitaly
Konrad Rzeszutek Wilk <[email protected]> writes:
> On Tue, Jul 15, 2014 at 03:40:38PM +0200, Vitaly Kuznetsov wrote:
>> PVHVM guest requires special actions before kexec. Register specific
>> xen_pvhvm_kexec_shutdown() handler for machine_ops.shutdown().
>>
>
> This looks close to what I had sent as an RFC to you?
>
Yes, I stole that part from your "RFC: VCPU_reset_cpu_info" patch to
call xen_unmap_all_pirqs() on shutdown. I'm looking at SMP issues your
patches address as well.
>> Signed-off-by: Vitaly Kuznetsov <[email protected]>
>> ---
>> arch/x86/xen/enlighten.c | 9 +++++++++
>> arch/x86/xen/smp.c | 9 +++++++++
>> arch/x86/xen/smp.h | 1 +
>> 3 files changed, 19 insertions(+)
>>
>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>> index a11af62..8074e4a 100644
>> --- a/arch/x86/xen/enlighten.c
>> +++ b/arch/x86/xen/enlighten.c
>> @@ -1833,6 +1833,12 @@ static struct notifier_block xen_hvm_cpu_notifier = {
>> .notifier_call = xen_hvm_cpu_notify,
>> };
>>
>> +static void xen_pvhvm_kexec_shutdown(void)
>> +{
>> + xen_kexec_shutdown();
>> + native_machine_shutdown();
>> +}
>> +
>> static void __init xen_hvm_guest_init(void)
>> {
>> init_hvm_pv_info();
>> @@ -1849,6 +1855,9 @@ static void __init xen_hvm_guest_init(void)
>> x86_init.irqs.intr_init = xen_init_IRQ;
>> xen_hvm_init_time_ops();
>> xen_hvm_init_mmu_ops();
>> +#ifdef CONFIG_KEXEC
>> + machine_ops.shutdown = xen_pvhvm_kexec_shutdown;
>> +#endif
>> }
>>
>> static uint32_t __init xen_hvm_platform(void)
>> diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
>> index 7005974..35dcf39 100644
>> --- a/arch/x86/xen/smp.c
>> +++ b/arch/x86/xen/smp.c
>> @@ -18,6 +18,7 @@
>> #include <linux/smp.h>
>> #include <linux/irq_work.h>
>> #include <linux/tick.h>
>> +#include <linux/kexec.h>
>>
>> #include <asm/paravirt.h>
>> #include <asm/desc.h>
>> @@ -762,6 +763,14 @@ static void xen_hvm_cpu_die(unsigned int cpu)
>> native_cpu_die(cpu);
>> }
>>
>> +void xen_kexec_shutdown(void)
>> +{
>> +#ifdef CONFIG_KEXEC
>> + if (!kexec_in_progress)
>> + return;
>> +#endif
>> +}
>> +
>> void __init xen_hvm_smp_init(void)
>> {
>> if (!xen_have_vector_callback)
>> diff --git a/arch/x86/xen/smp.h b/arch/x86/xen/smp.h
>> index c7c2d89..1af0493 100644
>> --- a/arch/x86/xen/smp.h
>> +++ b/arch/x86/xen/smp.h
>> @@ -8,4 +8,5 @@ extern void xen_send_IPI_allbutself(int vector);
>> extern void xen_send_IPI_all(int vector);
>> extern void xen_send_IPI_self(int vector);
>>
>> +extern void xen_kexec_shutdown(void);
>> #endif
>> --
>> 1.9.3
>>
--
Vitaly
On Tue, Jul 15, 2014 at 05:52:33PM +0200, Vitaly Kuznetsov wrote:
> Konrad Rzeszutek Wilk <[email protected]> writes:
>
> > On Tue, Jul 15, 2014 at 03:40:38PM +0200, Vitaly Kuznetsov wrote:
> >> PVHVM guest requires special actions before kexec. Register specific
> >> xen_pvhvm_kexec_shutdown() handler for machine_ops.shutdown().
> >>
> >
> > This looks close to what I had sent as an RFC to you?
> >
>
> Yes, I stole that part from your "RFC: VCPU_reset_cpu_info" patch to
> call xen_unmap_all_pirqs() on shutdown. I'm looking at SMP issues your
> patches address as well.
It is customary to attribute that in the patch. Such as:
"Original idea/patch from Konrad Rzeszutek Wilk" is suffice.
Thanks!
>
> >> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> >> ---
> >> arch/x86/xen/enlighten.c | 9 +++++++++
> >> arch/x86/xen/smp.c | 9 +++++++++
> >> arch/x86/xen/smp.h | 1 +
> >> 3 files changed, 19 insertions(+)
> >>
> >> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
> >> index a11af62..8074e4a 100644
> >> --- a/arch/x86/xen/enlighten.c
> >> +++ b/arch/x86/xen/enlighten.c
> >> @@ -1833,6 +1833,12 @@ static struct notifier_block xen_hvm_cpu_notifier = {
> >> .notifier_call = xen_hvm_cpu_notify,
> >> };
> >>
> >> +static void xen_pvhvm_kexec_shutdown(void)
> >> +{
> >> + xen_kexec_shutdown();
> >> + native_machine_shutdown();
> >> +}
> >> +
> >> static void __init xen_hvm_guest_init(void)
> >> {
> >> init_hvm_pv_info();
> >> @@ -1849,6 +1855,9 @@ static void __init xen_hvm_guest_init(void)
> >> x86_init.irqs.intr_init = xen_init_IRQ;
> >> xen_hvm_init_time_ops();
> >> xen_hvm_init_mmu_ops();
> >> +#ifdef CONFIG_KEXEC
> >> + machine_ops.shutdown = xen_pvhvm_kexec_shutdown;
> >> +#endif
> >> }
> >>
> >> static uint32_t __init xen_hvm_platform(void)
> >> diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
> >> index 7005974..35dcf39 100644
> >> --- a/arch/x86/xen/smp.c
> >> +++ b/arch/x86/xen/smp.c
> >> @@ -18,6 +18,7 @@
> >> #include <linux/smp.h>
> >> #include <linux/irq_work.h>
> >> #include <linux/tick.h>
> >> +#include <linux/kexec.h>
> >>
> >> #include <asm/paravirt.h>
> >> #include <asm/desc.h>
> >> @@ -762,6 +763,14 @@ static void xen_hvm_cpu_die(unsigned int cpu)
> >> native_cpu_die(cpu);
> >> }
> >>
> >> +void xen_kexec_shutdown(void)
> >> +{
> >> +#ifdef CONFIG_KEXEC
> >> + if (!kexec_in_progress)
> >> + return;
> >> +#endif
> >> +}
> >> +
> >> void __init xen_hvm_smp_init(void)
> >> {
> >> if (!xen_have_vector_callback)
> >> diff --git a/arch/x86/xen/smp.h b/arch/x86/xen/smp.h
> >> index c7c2d89..1af0493 100644
> >> --- a/arch/x86/xen/smp.h
> >> +++ b/arch/x86/xen/smp.h
> >> @@ -8,4 +8,5 @@ extern void xen_send_IPI_allbutself(int vector);
> >> extern void xen_send_IPI_all(int vector);
> >> extern void xen_send_IPI_self(int vector);
> >>
> >> +extern void xen_kexec_shutdown(void);
> >> #endif
> >> --
> >> 1.9.3
> >>
>
> --
> Vitaly
On 07/15/2014 09:40 AM, Vitaly Kuznetsov wrote:
> PVHVM guest requires special actions before kexec. Register specific
> xen_pvhvm_kexec_shutdown() handler for machine_ops.shutdown().
>
> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> ---
> arch/x86/xen/enlighten.c | 9 +++++++++
> arch/x86/xen/smp.c | 9 +++++++++
> arch/x86/xen/smp.h | 1 +
> 3 files changed, 19 insertions(+)
>
> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
> index a11af62..8074e4a 100644
> --- a/arch/x86/xen/enlighten.c
> +++ b/arch/x86/xen/enlighten.c
> @@ -1833,6 +1833,12 @@ static struct notifier_block xen_hvm_cpu_notifier = {
> .notifier_call = xen_hvm_cpu_notify,
> };
>
> +static void xen_pvhvm_kexec_shutdown(void)
> +{
> + xen_kexec_shutdown();
> + native_machine_shutdown();
> +}
> +
This should probably be under ifdef CONFIG_KEXEC ?
> static void __init xen_hvm_guest_init(void)
> {
> init_hvm_pv_info();
> @@ -1849,6 +1855,9 @@ static void __init xen_hvm_guest_init(void)
> x86_init.irqs.intr_init = xen_init_IRQ;
> xen_hvm_init_time_ops();
> xen_hvm_init_mmu_ops();
> +#ifdef CONFIG_KEXEC
> + machine_ops.shutdown = xen_pvhvm_kexec_shutdown;
> +#endif
> }
>
> static uint32_t __init xen_hvm_platform(void)
> diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
> index 7005974..35dcf39 100644
> --- a/arch/x86/xen/smp.c
> +++ b/arch/x86/xen/smp.c
> @@ -18,6 +18,7 @@
> #include <linux/smp.h>
> #include <linux/irq_work.h>
> #include <linux/tick.h>
> +#include <linux/kexec.h>
>
> #include <asm/paravirt.h>
> #include <asm/desc.h>
> @@ -762,6 +763,14 @@ static void xen_hvm_cpu_die(unsigned int cpu)
> native_cpu_die(cpu);
> }
>
> +void xen_kexec_shutdown(void)
> +{
> +#ifdef CONFIG_KEXEC
> + if (!kexec_in_progress)
> + return;
> +#endif
> +}
> +
and this one too (whole routine)? And "#include" above as well.
> void __init xen_hvm_smp_init(void)
> {
> if (!xen_have_vector_callback)
> diff --git a/arch/x86/xen/smp.h b/arch/x86/xen/smp.h
> index c7c2d89..1af0493 100644
> --- a/arch/x86/xen/smp.h
> +++ b/arch/x86/xen/smp.h
> @@ -8,4 +8,5 @@ extern void xen_send_IPI_allbutself(int vector);
> extern void xen_send_IPI_all(int vector);
> extern void xen_send_IPI_self(int vector);
>
> +extern void xen_kexec_shutdown(void);
> #endif
Konrad Rzeszutek Wilk <[email protected]> writes:
> On Tue, Jul 15, 2014 at 03:40:40PM +0200, Vitaly Kuznetsov wrote:
>> When kexec was peformed MSI IRQs for passthrough-ed devices were already
>> mapped and we see non-zero pirq extracted from MSI msg. xen_irq_from_pirq()
>> fails as we have no IRQ mapping information for that. Requesting for new
>> mapping with __write_msi_msg() does not result in MSI IRQ being remapped so
>> we don't recieve these IRQs.
>
> receive
>
Thanks for your comments!
> How come '__write_msi_msg' does not result in new MSI IRQs?
>
Actually that was the hidden question in my RFC :-)
Let me describe what I see. When normal boot is performed we have the
following in xen_hvm_setup_msi_irqs():
__read_msi_msg()
pirq -> 0
then we allocate new pirq with
pirq = xen_allocate_pirq_msi()
pirq -> 54
and we have the following mapping:
xen: msi --> pirq=54 --> irq=72
in 'xl debug-keys i':
(XEN) IRQ: 29 affinity:04 vec:b9 type=PCI-MSI status=00000030 in-flight=0 domain-list=7: 54(----),
After kexec we see the following:
__read_msi_msg()
pirq -> 54
but as xen_irq_from_pirq() fails we follow the same path allocating new pirq:
pirq = xen_allocate_pirq_msi()
pirq -> 55
and we have the following mapping:
xen: msi --> pirq=55 --> irq=75
However (afaict) mapping in xen wasn't updated:
in 'xl debug-keys i':
(XEN) IRQ: 29 affinity:02 vec:b9 type=PCI-MSI status=00000030 in-flight=0 domain-list=7: 54(--M-),
> Is it fair to state that your code ends up reading the MSI IRQ (PIRQ)
> from the device and updating the internal PIRQ<->IRQ code to match
> with the reality?
>
Yea, 'always trust the device'.
>>
>> RFC: I wasn't able to understand why commit af42b8d1 which introduced
>> xen_irq_from_pirq() check in xen_hvm_setup_msi_irqs() is checking that instead
>> of checking pirq > 0 as if the mapping was already done (and we have pirq>0 here)
>> we don't need to request for a new pirq. We're loosing existing PIRQ and I'm also
>> not sure when __write_msi_msg() with new PIRQ will result in new mapping.
>
> We don't request a new pirq. We end up returning before we call xen_allocate_pirq_msi.
> At least that is how the commit you mentioned worked.
>
I meant to say that in case we have pirq > 0 from __read_msi_msg() but
xen_irq_from_pirq(pirq) fails (kexec-only case?) we always do
xen_allocate_pirq_msi() which brings us new pirq.
> In regards to why using 'xen_irq_from_pirq' instead of just checking the PIRQ - is
> that we might be called twice by a buggy driver. As such we want to check
> our PIRQ<->IRQ to figure this out.
But if we're called twice we'll see the same pirq, right? Or there are
some cases when we see 'crap' instead of pirq here?
I think it would be nice to use the same pirq after kexec instead of
allocating a new one even in case we can make remapping work.
Thanks for your comments again!
>>
>> Signed-off-by: Vitaly Kuznetsov <[email protected]>
>> ---
>> arch/x86/pci/xen.c | 3 +--
>> 1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
>> index 905956f..685e8f1 100644
>> --- a/arch/x86/pci/xen.c
>> +++ b/arch/x86/pci/xen.c
>> @@ -231,8 +231,7 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
>> __read_msi_msg(msidesc, &msg);
>> pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) |
>> ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
>> - if (msg.data != XEN_PIRQ_MSI_DATA ||
>> - xen_irq_from_pirq(pirq) < 0) {
>> + if (msg.data != XEN_PIRQ_MSI_DATA || pirq <= 0) {
>> pirq = xen_allocate_pirq_msi(dev, msidesc);
>> if (pirq < 0) {
>> irq = -ENODEV;
>> --
>> 1.9.3
>>
--
Vitaly
Konrad Rzeszutek Wilk <[email protected]> writes:
> On Tue, Jul 15, 2014 at 03:40:39PM +0200, Vitaly Kuznetsov wrote:
>> When kexec is being run PIRQs from Qemu-emulated devices are still
>> mapped to old event channels and new kernel has no information about
>> that. Trying to map them twice results in the following in Xen's dmesg:
>>
>> (XEN) irq.c:2278: dom7: pirq 24 or emuirq 8 already mapped
>> (XEN) irq.c:2278: dom7: pirq 24 or emuirq 12 already mapped
>> (XEN) irq.c:2278: dom7: pirq 24 or emuirq 1 already mapped
>> ...
>>
>> and the following in new kernel's dmesg:
>>
>> [ 92.286796] xen:events: Failed to obtain physical IRQ 4
>>
>> The result is that the new kernel doesn't recieve IRQs for Qemu-emulated
>> devices. Address the issue by unmapping all mapped PIRQs on kernel shutdown
>> when kexec was requested and on every kernel startup. We need to do this
>> twice to deal with the following issues:
>> - startup-time unmapping is required to make kdump work;
>> - shutdown-time unmapping is required to support kexec-ing non-fixed kernels;
>> - shutdown-time unmapping is required to make Qemu-emulated NICs work after
>> kexec (event channel is being closed on shutdown but no PHYSDEVOP_unmap_pirq
>> is being performed).
>
> How does this work when you boot the guest under Xen 4.4 where the FIFO events
> are used? Does it still work correctly?
Thanks for pointing that out! I've checked and it doesn't. However
patches make no difference - guest kernel gets stuck on boot with and
without them. Will try to investigate...
>
> Thanks.
>>
>> Signed-off-by: Vitaly Kuznetsov <[email protected]>
>> ---
>> arch/x86/xen/smp.c | 1 +
>> drivers/xen/events/events_base.c | 76 ++++++++++++++++++++++++++++++++++++++++
>> include/xen/events.h | 3 ++
>> 3 files changed, 80 insertions(+)
>>
>> diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
>> index 35dcf39..e2b4deb 100644
>> --- a/arch/x86/xen/smp.c
>> +++ b/arch/x86/xen/smp.c
>> @@ -768,6 +768,7 @@ void xen_kexec_shutdown(void)
>> #ifdef CONFIG_KEXEC
>> if (!kexec_in_progress)
>> return;
>> + xen_unmap_all_pirqs();
>> #endif
>> }
>>
>> diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
>> index c919d3d..7701c7f 100644
>> --- a/drivers/xen/events/events_base.c
>> +++ b/drivers/xen/events/events_base.c
>> @@ -1643,6 +1643,80 @@ void xen_callback_vector(void) {}
>> static bool fifo_events = true;
>> module_param(fifo_events, bool, 0);
>>
>> +void xen_unmap_all_pirqs(void)
>> +{
>> + int pirq, rc, gsi, irq, evtchn;
>> + struct physdev_unmap_pirq unmap_irq;
>> + struct irq_info *info;
>> + struct evtchn_close close;
>> +
>> + mutex_lock(&irq_mapping_update_lock);
>> +
>> + list_for_each_entry(info, &xen_irq_list_head, list) {
>> + if (info->type != IRQT_PIRQ)
>> + continue;
>> +
>> + pirq = info->u.pirq.pirq;
>> + gsi = info->u.pirq.gsi;
>> + evtchn = info->evtchn;
>> + irq = info->irq;
>> +
>> + pr_debug("unmapping pirq gsi=%d pirq=%d irq=%d evtchn=%d\n",
>> + gsi, pirq, irq, evtchn);
>> +
>> + if (evtchn > 0) {
>> + close.port = evtchn;
>> + if (HYPERVISOR_event_channel_op(EVTCHNOP_close,
>> + &close) != 0)
>> + pr_warn("close evtchn %d failed\n", evtchn);
>> + }
>> +
>> + unmap_irq.pirq = pirq;
>> + unmap_irq.domid = DOMID_SELF;
>> +
>> + rc = HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, &unmap_irq);
>> + if (rc)
>> + pr_warn("unmap pirq failed gsi=%d pirq=%d irq=%d rc=%d\n",
>> + gsi, pirq, irq, rc);
>> + }
>> +
>> + mutex_unlock(&irq_mapping_update_lock);
>> +}
>> +EXPORT_SYMBOL_GPL(xen_unmap_all_pirqs);
>
> Why the EXPORT? Is this used by modules?
>> +
>> +static void xen_startup_unmap_pirqs(void)
>> +{
>> + struct evtchn_status status;
>> + int port, rc = -ENOENT;
>> + struct physdev_unmap_pirq unmap_irq;
>> + struct evtchn_close close;
>> +
>> + memset(&status, 0, sizeof(status));
>> + for (port = 0; port < xen_evtchn_max_channels(); port++) {
>> + status.dom = DOMID_SELF;
>> + status.port = port;
>> + rc = HYPERVISOR_event_channel_op(EVTCHNOP_status, &status);
>> + if (rc < 0)
>> + continue;
>> + if (status.status == EVTCHNSTAT_pirq) {
>> + close.port = port;
>> + if (HYPERVISOR_event_channel_op(EVTCHNOP_close,
>> + &close) != 0)
>> + pr_warn("xen: failed to close evtchn %d\n",
>> + port);
>> + unmap_irq.pirq = status.u.pirq;
>> + unmap_irq.domid = DOMID_SELF;
>> + pr_warn("xen: unmapping previously mapped pirq %d\n",
>> + unmap_irq.pirq);
>> + if (HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq,
>> + &unmap_irq) != 0)
>> + pr_warn("xen: failed to unmap pirq %d\n",
>> + unmap_irq.pirq);
>> + }
>> + }
>> +}
>> +
>> +
>> void __init xen_init_IRQ(void)
>> {
>> int ret = -EINVAL;
>> @@ -1671,6 +1745,8 @@ void __init xen_init_IRQ(void)
>> xen_callback_vector();
>>
>> if (xen_hvm_domain()) {
>> + xen_startup_unmap_pirqs();
>> +
>> native_init_IRQ();
>> /* pci_xen_hvm_init must be called after native_init_IRQ so that
>> * __acpi_register_gsi can point at the right function */
>> diff --git a/include/xen/events.h b/include/xen/events.h
>> index 8bee7a7..3f9f428 100644
>> --- a/include/xen/events.h
>> +++ b/include/xen/events.h
>> @@ -122,6 +122,9 @@ int xen_irq_from_gsi(unsigned gsi);
>> /* Determine whether to ignore this IRQ if it is passed to a guest. */
>> int xen_test_irq_shared(int irq);
>>
>> +/* Unmap all PIRQs and close event channels */
>> +void xen_unmap_all_pirqs(void);
>> +
>> /* initialize Xen IRQ subsystem */
>> void xen_init_IRQ(void);
>> #endif /* _XEN_EVENTS_H */
>> --
>> 1.9.3
>>
--
Vitaly
On Wed, Jul 16, 2014 at 11:01:55AM +0200, Vitaly Kuznetsov wrote:
> Konrad Rzeszutek Wilk <[email protected]> writes:
>
> > On Tue, Jul 15, 2014 at 03:40:40PM +0200, Vitaly Kuznetsov wrote:
> >> When kexec was peformed MSI IRQs for passthrough-ed devices were already
> >> mapped and we see non-zero pirq extracted from MSI msg. xen_irq_from_pirq()
> >> fails as we have no IRQ mapping information for that. Requesting for new
> >> mapping with __write_msi_msg() does not result in MSI IRQ being remapped so
> >> we don't recieve these IRQs.
> >
> > receive
> >
>
> Thanks for your comments!
Thank you for quick turnaround with the answers!
>
> > How come '__write_msi_msg' does not result in new MSI IRQs?
> >
>
> Actually that was the hidden question in my RFC :-)
>
> Let me describe what I see. When normal boot is performed we have the
> following in xen_hvm_setup_msi_irqs():
>
> __read_msi_msg()
> pirq -> 0
>
> then we allocate new pirq with
> pirq = xen_allocate_pirq_msi()
> pirq -> 54
>
> and we have the following mapping:
> xen: msi --> pirq=54 --> irq=72
>
> in 'xl debug-keys i':
> (XEN) IRQ: 29 affinity:04 vec:b9 type=PCI-MSI status=00000030 in-flight=0 domain-list=7: 54(----),
>
> After kexec we see the following:
> __read_msi_msg()
> pirq -> 54
>
> but as xen_irq_from_pirq() fails we follow the same path allocating new pirq:
> pirq = xen_allocate_pirq_msi()
> pirq -> 55
>
> and we have the following mapping:
> xen: msi --> pirq=55 --> irq=75
>
> However (afaict) mapping in xen wasn't updated:
>
> in 'xl debug-keys i':
> (XEN) IRQ: 29 affinity:02 vec:b9 type=PCI-MSI status=00000030 in-flight=0 domain-list=7: 54(--M-),
I am wondering if that is related to in QEMU traditional:
qemu-xen-trad: free all the pirqs for msi/msix when driver unloads
(which in the upstream QEMU is 1d4fd4f0e2fc5dcae0c60e00cc9af95f52988050)
If you have that patch in, is the PIRQ value correctly updated?
>
> > Is it fair to state that your code ends up reading the MSI IRQ (PIRQ)
> > from the device and updating the internal PIRQ<->IRQ code to match
> > with the reality?
> >
>
> Yea, 'always trust the device'.
>
> >>
> >> RFC: I wasn't able to understand why commit af42b8d1 which introduced
> >> xen_irq_from_pirq() check in xen_hvm_setup_msi_irqs() is checking that instead
> >> of checking pirq > 0 as if the mapping was already done (and we have pirq>0 here)
> >> we don't need to request for a new pirq. We're loosing existing PIRQ and I'm also
> >> not sure when __write_msi_msg() with new PIRQ will result in new mapping.
> >
> > We don't request a new pirq. We end up returning before we call xen_allocate_pirq_msi.
> > At least that is how the commit you mentioned worked.
> >
>
> I meant to say that in case we have pirq > 0 from __read_msi_msg() but
> xen_irq_from_pirq(pirq) fails (kexec-only case?) we always do
> xen_allocate_pirq_msi() which brings us new pirq.
>
> > In regards to why using 'xen_irq_from_pirq' instead of just checking the PIRQ - is
> > that we might be called twice by a buggy driver. As such we want to check
> > our PIRQ<->IRQ to figure this out.
>
> But if we're called twice we'll see the same pirq, right? Or there are
Good point.
> some cases when we see 'crap' instead of pirq here?
For PCI passthrough devices they will be zero until they are enabled.
But I am not sure about the emulated devices, such as e1000 or such, which
would also go through this path (I think - do we have MSI devices that
we emulate in QEMU?)
>
> I think it would be nice to use the same pirq after kexec instead of
> allocating a new one even in case we can make remapping work.
I concur.
Stefano, do you recall why you used xen_irq_from_pirq instead of just
trusting the 'pirq' value? Was it to workaround broken QEMU?
>
> Thanks for your comments again!
>
> >>
> >> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> >> ---
> >> arch/x86/pci/xen.c | 3 +--
> >> 1 file changed, 1 insertion(+), 2 deletions(-)
> >>
> >> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
> >> index 905956f..685e8f1 100644
> >> --- a/arch/x86/pci/xen.c
> >> +++ b/arch/x86/pci/xen.c
> >> @@ -231,8 +231,7 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> >> __read_msi_msg(msidesc, &msg);
> >> pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) |
> >> ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
> >> - if (msg.data != XEN_PIRQ_MSI_DATA ||
> >> - xen_irq_from_pirq(pirq) < 0) {
> >> + if (msg.data != XEN_PIRQ_MSI_DATA || pirq <= 0) {
> >> pirq = xen_allocate_pirq_msi(dev, msidesc);
> >> if (pirq < 0) {
> >> irq = -ENODEV;
> >> --
> >> 1.9.3
> >>
>
> --
> Vitaly
On Wed, Jul 16, 2014 at 11:37:10AM +0200, Vitaly Kuznetsov wrote:
> Konrad Rzeszutek Wilk <[email protected]> writes:
>
> > On Tue, Jul 15, 2014 at 03:40:39PM +0200, Vitaly Kuznetsov wrote:
> >> When kexec is being run PIRQs from Qemu-emulated devices are still
> >> mapped to old event channels and new kernel has no information about
> >> that. Trying to map them twice results in the following in Xen's dmesg:
> >>
> >> (XEN) irq.c:2278: dom7: pirq 24 or emuirq 8 already mapped
> >> (XEN) irq.c:2278: dom7: pirq 24 or emuirq 12 already mapped
> >> (XEN) irq.c:2278: dom7: pirq 24 or emuirq 1 already mapped
> >> ...
> >>
> >> and the following in new kernel's dmesg:
> >>
> >> [ 92.286796] xen:events: Failed to obtain physical IRQ 4
> >>
> >> The result is that the new kernel doesn't recieve IRQs for Qemu-emulated
> >> devices. Address the issue by unmapping all mapped PIRQs on kernel shutdown
> >> when kexec was requested and on every kernel startup. We need to do this
> >> twice to deal with the following issues:
> >> - startup-time unmapping is required to make kdump work;
> >> - shutdown-time unmapping is required to support kexec-ing non-fixed kernels;
> >> - shutdown-time unmapping is required to make Qemu-emulated NICs work after
> >> kexec (event channel is being closed on shutdown but no PHYSDEVOP_unmap_pirq
> >> is being performed).
> >
> > How does this work when you boot the guest under Xen 4.4 where the FIFO events
> > are used? Does it still work correctly?
>
> Thanks for pointing that out! I've checked and it doesn't. However
> patches make no difference - guest kernel gets stuck on boot with and
> without them. Will try to investigate...
I think for FIFO events we can't do much right now - it would need some
new hypercalls to de-allocate or such.
But I was thinking that your code logic could just return out when
it detects that it is running with FIFO events (with a TODO comment)
- and also spit out some information to this effect?
Say: "Use xen.fifo=0 in your launching kernel"
(don't know the right name for the kernel in which you do 'kexec -e' in ?
Is that launching? Original? Bootstrap kernel?)
>
> >
> > Thanks.
> >>
> >> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> >> ---
> >> arch/x86/xen/smp.c | 1 +
> >> drivers/xen/events/events_base.c | 76 ++++++++++++++++++++++++++++++++++++++++
> >> include/xen/events.h | 3 ++
> >> 3 files changed, 80 insertions(+)
> >>
> >> diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
> >> index 35dcf39..e2b4deb 100644
> >> --- a/arch/x86/xen/smp.c
> >> +++ b/arch/x86/xen/smp.c
> >> @@ -768,6 +768,7 @@ void xen_kexec_shutdown(void)
> >> #ifdef CONFIG_KEXEC
> >> if (!kexec_in_progress)
> >> return;
> >> + xen_unmap_all_pirqs();
> >> #endif
> >> }
> >>
> >> diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
> >> index c919d3d..7701c7f 100644
> >> --- a/drivers/xen/events/events_base.c
> >> +++ b/drivers/xen/events/events_base.c
> >> @@ -1643,6 +1643,80 @@ void xen_callback_vector(void) {}
> >> static bool fifo_events = true;
> >> module_param(fifo_events, bool, 0);
> >>
> >> +void xen_unmap_all_pirqs(void)
> >> +{
> >> + int pirq, rc, gsi, irq, evtchn;
> >> + struct physdev_unmap_pirq unmap_irq;
> >> + struct irq_info *info;
> >> + struct evtchn_close close;
> >> +
> >> + mutex_lock(&irq_mapping_update_lock);
> >> +
> >> + list_for_each_entry(info, &xen_irq_list_head, list) {
> >> + if (info->type != IRQT_PIRQ)
> >> + continue;
> >> +
> >> + pirq = info->u.pirq.pirq;
> >> + gsi = info->u.pirq.gsi;
> >> + evtchn = info->evtchn;
> >> + irq = info->irq;
> >> +
> >> + pr_debug("unmapping pirq gsi=%d pirq=%d irq=%d evtchn=%d\n",
> >> + gsi, pirq, irq, evtchn);
> >> +
> >> + if (evtchn > 0) {
> >> + close.port = evtchn;
> >> + if (HYPERVISOR_event_channel_op(EVTCHNOP_close,
> >> + &close) != 0)
> >> + pr_warn("close evtchn %d failed\n", evtchn);
> >> + }
> >> +
> >> + unmap_irq.pirq = pirq;
> >> + unmap_irq.domid = DOMID_SELF;
> >> +
> >> + rc = HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, &unmap_irq);
> >> + if (rc)
> >> + pr_warn("unmap pirq failed gsi=%d pirq=%d irq=%d rc=%d\n",
> >> + gsi, pirq, irq, rc);
> >> + }
> >> +
> >> + mutex_unlock(&irq_mapping_update_lock);
> >> +}
> >> +EXPORT_SYMBOL_GPL(xen_unmap_all_pirqs);
> >
> > Why the EXPORT? Is this used by modules?
> >> +
> >> +static void xen_startup_unmap_pirqs(void)
> >> +{
> >> + struct evtchn_status status;
> >> + int port, rc = -ENOENT;
> >> + struct physdev_unmap_pirq unmap_irq;
> >> + struct evtchn_close close;
> >> +
> >> + memset(&status, 0, sizeof(status));
> >> + for (port = 0; port < xen_evtchn_max_channels(); port++) {
> >> + status.dom = DOMID_SELF;
> >> + status.port = port;
> >> + rc = HYPERVISOR_event_channel_op(EVTCHNOP_status, &status);
> >> + if (rc < 0)
> >> + continue;
> >> + if (status.status == EVTCHNSTAT_pirq) {
> >> + close.port = port;
> >> + if (HYPERVISOR_event_channel_op(EVTCHNOP_close,
> >> + &close) != 0)
> >> + pr_warn("xen: failed to close evtchn %d\n",
> >> + port);
> >> + unmap_irq.pirq = status.u.pirq;
> >> + unmap_irq.domid = DOMID_SELF;
> >> + pr_warn("xen: unmapping previously mapped pirq %d\n",
> >> + unmap_irq.pirq);
> >> + if (HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq,
> >> + &unmap_irq) != 0)
> >> + pr_warn("xen: failed to unmap pirq %d\n",
> >> + unmap_irq.pirq);
> >> + }
> >> + }
> >> +}
> >> +
> >> +
> >> void __init xen_init_IRQ(void)
> >> {
> >> int ret = -EINVAL;
> >> @@ -1671,6 +1745,8 @@ void __init xen_init_IRQ(void)
> >> xen_callback_vector();
> >>
> >> if (xen_hvm_domain()) {
> >> + xen_startup_unmap_pirqs();
> >> +
> >> native_init_IRQ();
> >> /* pci_xen_hvm_init must be called after native_init_IRQ so that
> >> * __acpi_register_gsi can point at the right function */
> >> diff --git a/include/xen/events.h b/include/xen/events.h
> >> index 8bee7a7..3f9f428 100644
> >> --- a/include/xen/events.h
> >> +++ b/include/xen/events.h
> >> @@ -122,6 +122,9 @@ int xen_irq_from_gsi(unsigned gsi);
> >> /* Determine whether to ignore this IRQ if it is passed to a guest. */
> >> int xen_test_irq_shared(int irq);
> >>
> >> +/* Unmap all PIRQs and close event channels */
> >> +void xen_unmap_all_pirqs(void);
> >> +
> >> /* initialize Xen IRQ subsystem */
> >> void xen_init_IRQ(void);
> >> #endif /* _XEN_EVENTS_H */
> >> --
> >> 1.9.3
> >>
>
> --
> Vitaly
Konrad Rzeszutek Wilk <[email protected]> writes:
> On Wed, Jul 16, 2014 at 11:37:10AM +0200, Vitaly Kuznetsov wrote:
>> Konrad Rzeszutek Wilk <[email protected]> writes:
>>
>> > On Tue, Jul 15, 2014 at 03:40:39PM +0200, Vitaly Kuznetsov wrote:
>> >> When kexec is being run PIRQs from Qemu-emulated devices are still
>> >> mapped to old event channels and new kernel has no information about
>> >> that. Trying to map them twice results in the following in Xen's dmesg:
>> >>
>> >> (XEN) irq.c:2278: dom7: pirq 24 or emuirq 8 already mapped
>> >> (XEN) irq.c:2278: dom7: pirq 24 or emuirq 12 already mapped
>> >> (XEN) irq.c:2278: dom7: pirq 24 or emuirq 1 already mapped
>> >> ...
>> >>
>> >> and the following in new kernel's dmesg:
>> >>
>> >> [ 92.286796] xen:events: Failed to obtain physical IRQ 4
>> >>
>> >> The result is that the new kernel doesn't recieve IRQs for Qemu-emulated
>> >> devices. Address the issue by unmapping all mapped PIRQs on kernel shutdown
>> >> when kexec was requested and on every kernel startup. We need to do this
>> >> twice to deal with the following issues:
>> >> - startup-time unmapping is required to make kdump work;
>> >> - shutdown-time unmapping is required to support kexec-ing non-fixed kernels;
>> >> - shutdown-time unmapping is required to make Qemu-emulated NICs work after
>> >> kexec (event channel is being closed on shutdown but no PHYSDEVOP_unmap_pirq
>> >> is being performed).
>> >
>> > How does this work when you boot the guest under Xen 4.4 where the FIFO events
>> > are used? Does it still work correctly?
>>
>> Thanks for pointing that out! I've checked and it doesn't. However
>> patches make no difference - guest kernel gets stuck on boot with and
>> without them. Will try to investigate...
>
> I think for FIFO events we can't do much right now - it would need some
> new hypercalls to de-allocate or such.
>
Yeah, you're probably right. I tried wrapping evtchn_fifo_destroy() into
'EVTCHNOP_fifo_destroy' hypercall but it seems some other actions are
required as well..
> But I was thinking that your code logic could just return out when
> it detects that it is running with FIFO events (with a TODO comment)
> - and also spit out some information to this effect?
Sure, having TODO here is a good idea.
>
> Say: "Use xen.fifo=0 in your launching kernel"
s,xen.fifo,xen.fifo_events,
>
> (don't know the right name for the kernel in which you do 'kexec -e' in ?
> Is that launching? Original? Bootstrap kernel?)
Yes, if under Xen-4.4 I boot original kernel with "xen.fifo_events=0"
I'm able to do kexec with "xen.fifo_events=0" and even without it (but
only once :-). Once kernel is booted with FIFO-based event channels
enabled no kexec is possible, new kernel gets stuck (I guess vcpuop
timer doesn't work..). My patch series brings no difference here..
Thanks,
>
>>
>> >
>> > Thanks.
>> >>
>> >> Signed-off-by: Vitaly Kuznetsov <[email protected]>
>> >> ---
>> >> arch/x86/xen/smp.c | 1 +
>> >> drivers/xen/events/events_base.c | 76 ++++++++++++++++++++++++++++++++++++++++
>> >> include/xen/events.h | 3 ++
>> >> 3 files changed, 80 insertions(+)
>> >>
>> >> diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
>> >> index 35dcf39..e2b4deb 100644
>> >> --- a/arch/x86/xen/smp.c
>> >> +++ b/arch/x86/xen/smp.c
>> >> @@ -768,6 +768,7 @@ void xen_kexec_shutdown(void)
>> >> #ifdef CONFIG_KEXEC
>> >> if (!kexec_in_progress)
>> >> return;
>> >> + xen_unmap_all_pirqs();
>> >> #endif
>> >> }
>> >>
>> >> diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
>> >> index c919d3d..7701c7f 100644
>> >> --- a/drivers/xen/events/events_base.c
>> >> +++ b/drivers/xen/events/events_base.c
>> >> @@ -1643,6 +1643,80 @@ void xen_callback_vector(void) {}
>> >> static bool fifo_events = true;
>> >> module_param(fifo_events, bool, 0);
>> >>
>> >> +void xen_unmap_all_pirqs(void)
>> >> +{
>> >> + int pirq, rc, gsi, irq, evtchn;
>> >> + struct physdev_unmap_pirq unmap_irq;
>> >> + struct irq_info *info;
>> >> + struct evtchn_close close;
>> >> +
>> >> + mutex_lock(&irq_mapping_update_lock);
>> >> +
>> >> + list_for_each_entry(info, &xen_irq_list_head, list) {
>> >> + if (info->type != IRQT_PIRQ)
>> >> + continue;
>> >> +
>> >> + pirq = info->u.pirq.pirq;
>> >> + gsi = info->u.pirq.gsi;
>> >> + evtchn = info->evtchn;
>> >> + irq = info->irq;
>> >> +
>> >> + pr_debug("unmapping pirq gsi=%d pirq=%d irq=%d evtchn=%d\n",
>> >> + gsi, pirq, irq, evtchn);
>> >> +
>> >> + if (evtchn > 0) {
>> >> + close.port = evtchn;
>> >> + if (HYPERVISOR_event_channel_op(EVTCHNOP_close,
>> >> + &close) != 0)
>> >> + pr_warn("close evtchn %d failed\n", evtchn);
>> >> + }
>> >> +
>> >> + unmap_irq.pirq = pirq;
>> >> + unmap_irq.domid = DOMID_SELF;
>> >> +
>> >> + rc = HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq, &unmap_irq);
>> >> + if (rc)
>> >> + pr_warn("unmap pirq failed gsi=%d pirq=%d irq=%d rc=%d\n",
>> >> + gsi, pirq, irq, rc);
>> >> + }
>> >> +
>> >> + mutex_unlock(&irq_mapping_update_lock);
>> >> +}
>> >> +EXPORT_SYMBOL_GPL(xen_unmap_all_pirqs);
>> >
>> > Why the EXPORT? Is this used by modules?
>> >> +
>> >> +static void xen_startup_unmap_pirqs(void)
>> >> +{
>> >> + struct evtchn_status status;
>> >> + int port, rc = -ENOENT;
>> >> + struct physdev_unmap_pirq unmap_irq;
>> >> + struct evtchn_close close;
>> >> +
>> >> + memset(&status, 0, sizeof(status));
>> >> + for (port = 0; port < xen_evtchn_max_channels(); port++) {
>> >> + status.dom = DOMID_SELF;
>> >> + status.port = port;
>> >> + rc = HYPERVISOR_event_channel_op(EVTCHNOP_status, &status);
>> >> + if (rc < 0)
>> >> + continue;
>> >> + if (status.status == EVTCHNSTAT_pirq) {
>> >> + close.port = port;
>> >> + if (HYPERVISOR_event_channel_op(EVTCHNOP_close,
>> >> + &close) != 0)
>> >> + pr_warn("xen: failed to close evtchn %d\n",
>> >> + port);
>> >> + unmap_irq.pirq = status.u.pirq;
>> >> + unmap_irq.domid = DOMID_SELF;
>> >> + pr_warn("xen: unmapping previously mapped pirq %d\n",
>> >> + unmap_irq.pirq);
>> >> + if (HYPERVISOR_physdev_op(PHYSDEVOP_unmap_pirq,
>> >> + &unmap_irq) != 0)
>> >> + pr_warn("xen: failed to unmap pirq %d\n",
>> >> + unmap_irq.pirq);
>> >> + }
>> >> + }
>> >> +}
>> >> +
>> >> +
>> >> void __init xen_init_IRQ(void)
>> >> {
>> >> int ret = -EINVAL;
>> >> @@ -1671,6 +1745,8 @@ void __init xen_init_IRQ(void)
>> >> xen_callback_vector();
>> >>
>> >> if (xen_hvm_domain()) {
>> >> + xen_startup_unmap_pirqs();
>> >> +
>> >> native_init_IRQ();
>> >> /* pci_xen_hvm_init must be called after native_init_IRQ so that
>> >> * __acpi_register_gsi can point at the right function */
>> >> diff --git a/include/xen/events.h b/include/xen/events.h
>> >> index 8bee7a7..3f9f428 100644
>> >> --- a/include/xen/events.h
>> >> +++ b/include/xen/events.h
>> >> @@ -122,6 +122,9 @@ int xen_irq_from_gsi(unsigned gsi);
>> >> /* Determine whether to ignore this IRQ if it is passed to a guest. */
>> >> int xen_test_irq_shared(int irq);
>> >>
>> >> +/* Unmap all PIRQs and close event channels */
>> >> +void xen_unmap_all_pirqs(void);
>> >> +
>> >> /* initialize Xen IRQ subsystem */
>> >> void xen_init_IRQ(void);
>> >> #endif /* _XEN_EVENTS_H */
>> >> --
>> >> 1.9.3
>> >>
>>
>> --
>> Vitaly
--
Vitaly
Konrad Rzeszutek Wilk <[email protected]> writes:
> On Wed, Jul 16, 2014 at 11:01:55AM +0200, Vitaly Kuznetsov wrote:
>> Konrad Rzeszutek Wilk <[email protected]> writes:
>>
>> > On Tue, Jul 15, 2014 at 03:40:40PM +0200, Vitaly Kuznetsov wrote:
>> >> When kexec was peformed MSI IRQs for passthrough-ed devices were already
>> >> mapped and we see non-zero pirq extracted from MSI msg. xen_irq_from_pirq()
>> >> fails as we have no IRQ mapping information for that. Requesting for new
>> >> mapping with __write_msi_msg() does not result in MSI IRQ being remapped so
>> >> we don't recieve these IRQs.
>> >
>> > receive
>> >
>>
>> Thanks for your comments!
>
> Thank you for quick turnaround with the answers!
>>
>> > How come '__write_msi_msg' does not result in new MSI IRQs?
>> >
>>
>> Actually that was the hidden question in my RFC :-)
>>
>> Let me describe what I see. When normal boot is performed we have the
>> following in xen_hvm_setup_msi_irqs():
>>
>> __read_msi_msg()
>> pirq -> 0
>>
>> then we allocate new pirq with
>> pirq = xen_allocate_pirq_msi()
>> pirq -> 54
>>
>> and we have the following mapping:
>> xen: msi --> pirq=54 --> irq=72
>>
>> in 'xl debug-keys i':
>> (XEN) IRQ: 29 affinity:04 vec:b9 type=PCI-MSI status=00000030 in-flight=0 domain-list=7: 54(----),
>>
>> After kexec we see the following:
>> __read_msi_msg()
>> pirq -> 54
>>
>> but as xen_irq_from_pirq() fails we follow the same path allocating new pirq:
>> pirq = xen_allocate_pirq_msi()
>> pirq -> 55
>>
>> and we have the following mapping:
>> xen: msi --> pirq=55 --> irq=75
>>
>> However (afaict) mapping in xen wasn't updated:
>>
>> in 'xl debug-keys i':
>> (XEN) IRQ: 29 affinity:02 vec:b9 type=PCI-MSI status=00000030 in-flight=0 domain-list=7: 54(--M-),
>
> I am wondering if that is related to in QEMU traditional:
>
> qemu-xen-trad: free all the pirqs for msi/msix when driver unloads
>
> (which in the upstream QEMU is 1d4fd4f0e2fc5dcae0c60e00cc9af95f52988050)
>
> If you have that patch in, is the PIRQ value correctly updated?
>
Thanks, that really works! I tested both kexec -e / kdump cases. I'm
wondering if we although need my commit to workaround non-fixed qemus?
>>
>> > Is it fair to state that your code ends up reading the MSI IRQ (PIRQ)
>> > from the device and updating the internal PIRQ<->IRQ code to match
>> > with the reality?
>> >
>>
>> Yea, 'always trust the device'.
>>
>> >>
>> >> RFC: I wasn't able to understand why commit af42b8d1 which introduced
>> >> xen_irq_from_pirq() check in xen_hvm_setup_msi_irqs() is checking that instead
>> >> of checking pirq > 0 as if the mapping was already done (and we have pirq>0 here)
>> >> we don't need to request for a new pirq. We're loosing existing PIRQ and I'm also
>> >> not sure when __write_msi_msg() with new PIRQ will result in new mapping.
>> >
>> > We don't request a new pirq. We end up returning before we call xen_allocate_pirq_msi.
>> > At least that is how the commit you mentioned worked.
>> >
>>
>> I meant to say that in case we have pirq > 0 from __read_msi_msg() but
>> xen_irq_from_pirq(pirq) fails (kexec-only case?) we always do
>> xen_allocate_pirq_msi() which brings us new pirq.
>>
>> > In regards to why using 'xen_irq_from_pirq' instead of just checking the PIRQ - is
>> > that we might be called twice by a buggy driver. As such we want to check
>> > our PIRQ<->IRQ to figure this out.
>>
>> But if we're called twice we'll see the same pirq, right? Or there are
>
> Good point.
>> some cases when we see 'crap' instead of pirq here?
>
> For PCI passthrough devices they will be zero until they are enabled.
> But I am not sure about the emulated devices, such as e1000 or such, which
> would also go through this path (I think - do we have MSI devices that
> we emulate in QEMU?)
AFAICT emulated e1000 doesn't use MSI (at least with qemu-tradidtional)
and with my patch series it works after kexec.
>
>>
>> I think it would be nice to use the same pirq after kexec instead of
>> allocating a new one even in case we can make remapping work.
>
> I concur.
>
> Stefano, do you recall why you used xen_irq_from_pirq instead of just
> trusting the 'pirq' value? Was it to workaround broken QEMU?
>
>>
>> Thanks for your comments again!
>>
>> >>
>> >> Signed-off-by: Vitaly Kuznetsov <[email protected]>
>> >> ---
>> >> arch/x86/pci/xen.c | 3 +--
>> >> 1 file changed, 1 insertion(+), 2 deletions(-)
>> >>
>> >> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
>> >> index 905956f..685e8f1 100644
>> >> --- a/arch/x86/pci/xen.c
>> >> +++ b/arch/x86/pci/xen.c
>> >> @@ -231,8 +231,7 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
>> >> __read_msi_msg(msidesc, &msg);
>> >> pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) |
>> >> ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
>> >> - if (msg.data != XEN_PIRQ_MSI_DATA ||
>> >> - xen_irq_from_pirq(pirq) < 0) {
>> >> + if (msg.data != XEN_PIRQ_MSI_DATA || pirq <= 0) {
>> >> pirq = xen_allocate_pirq_msi(dev, msidesc);
>> >> if (pirq < 0) {
>> >> irq = -ENODEV;
>> >> --
>> >> 1.9.3
>> >>
>>
>> --
>> Vitaly
--
Vitaly
On Wed, Jul 16, 2014 at 07:20:39PM +0200, Vitaly Kuznetsov wrote:
> Konrad Rzeszutek Wilk <[email protected]> writes:
>
> > On Wed, Jul 16, 2014 at 11:01:55AM +0200, Vitaly Kuznetsov wrote:
> >> Konrad Rzeszutek Wilk <[email protected]> writes:
> >>
> >> > On Tue, Jul 15, 2014 at 03:40:40PM +0200, Vitaly Kuznetsov wrote:
> >> >> When kexec was peformed MSI IRQs for passthrough-ed devices were already
> >> >> mapped and we see non-zero pirq extracted from MSI msg. xen_irq_from_pirq()
> >> >> fails as we have no IRQ mapping information for that. Requesting for new
> >> >> mapping with __write_msi_msg() does not result in MSI IRQ being remapped so
> >> >> we don't recieve these IRQs.
> >> >
> >> > receive
> >> >
> >>
> >> Thanks for your comments!
> >
> > Thank you for quick turnaround with the answers!
> >>
> >> > How come '__write_msi_msg' does not result in new MSI IRQs?
> >> >
> >>
> >> Actually that was the hidden question in my RFC :-)
> >>
> >> Let me describe what I see. When normal boot is performed we have the
> >> following in xen_hvm_setup_msi_irqs():
> >>
> >> __read_msi_msg()
> >> pirq -> 0
> >>
> >> then we allocate new pirq with
> >> pirq = xen_allocate_pirq_msi()
> >> pirq -> 54
> >>
> >> and we have the following mapping:
> >> xen: msi --> pirq=54 --> irq=72
> >>
> >> in 'xl debug-keys i':
> >> (XEN) IRQ: 29 affinity:04 vec:b9 type=PCI-MSI status=00000030 in-flight=0 domain-list=7: 54(----),
> >>
> >> After kexec we see the following:
> >> __read_msi_msg()
> >> pirq -> 54
> >>
> >> but as xen_irq_from_pirq() fails we follow the same path allocating new pirq:
> >> pirq = xen_allocate_pirq_msi()
> >> pirq -> 55
> >>
> >> and we have the following mapping:
> >> xen: msi --> pirq=55 --> irq=75
> >>
> >> However (afaict) mapping in xen wasn't updated:
> >>
> >> in 'xl debug-keys i':
> >> (XEN) IRQ: 29 affinity:02 vec:b9 type=PCI-MSI status=00000030 in-flight=0 domain-list=7: 54(--M-),
> >
> > I am wondering if that is related to in QEMU traditional:
> >
> > qemu-xen-trad: free all the pirqs for msi/msix when driver unloads
> >
> > (which in the upstream QEMU is 1d4fd4f0e2fc5dcae0c60e00cc9af95f52988050)
> >
> > If you have that patch in, is the PIRQ value correctly updated?
> >
>
> Thanks, that really works! I tested both kexec -e / kdump cases. I'm
> wondering if we although need my commit to workaround non-fixed qemus?
Without your patch on older QEMU's with PCI passthrough we won't get
any more interrupts after we kexec in the guest right?
As in, this issue happens _only_ with PCI passthrough devices that use
MSI or MSI-X?
Still need to get Stefano's view on this.
>
> >>
> >> > Is it fair to state that your code ends up reading the MSI IRQ (PIRQ)
> >> > from the device and updating the internal PIRQ<->IRQ code to match
> >> > with the reality?
> >> >
> >>
> >> Yea, 'always trust the device'.
> >>
> >> >>
> >> >> RFC: I wasn't able to understand why commit af42b8d1 which introduced
> >> >> xen_irq_from_pirq() check in xen_hvm_setup_msi_irqs() is checking that instead
> >> >> of checking pirq > 0 as if the mapping was already done (and we have pirq>0 here)
> >> >> we don't need to request for a new pirq. We're loosing existing PIRQ and I'm also
> >> >> not sure when __write_msi_msg() with new PIRQ will result in new mapping.
> >> >
> >> > We don't request a new pirq. We end up returning before we call xen_allocate_pirq_msi.
> >> > At least that is how the commit you mentioned worked.
> >> >
> >>
> >> I meant to say that in case we have pirq > 0 from __read_msi_msg() but
> >> xen_irq_from_pirq(pirq) fails (kexec-only case?) we always do
> >> xen_allocate_pirq_msi() which brings us new pirq.
> >>
> >> > In regards to why using 'xen_irq_from_pirq' instead of just checking the PIRQ - is
> >> > that we might be called twice by a buggy driver. As such we want to check
> >> > our PIRQ<->IRQ to figure this out.
> >>
> >> But if we're called twice we'll see the same pirq, right? Or there are
> >
> > Good point.
> >> some cases when we see 'crap' instead of pirq here?
> >
> > For PCI passthrough devices they will be zero until they are enabled.
> > But I am not sure about the emulated devices, such as e1000 or such, which
> > would also go through this path (I think - do we have MSI devices that
> > we emulate in QEMU?)
>
> AFAICT emulated e1000 doesn't use MSI (at least with qemu-tradidtional)
> and with my patch series it works after kexec.
>
> >
> >>
> >> I think it would be nice to use the same pirq after kexec instead of
> >> allocating a new one even in case we can make remapping work.
> >
> > I concur.
> >
> > Stefano, do you recall why you used xen_irq_from_pirq instead of just
> > trusting the 'pirq' value? Was it to workaround broken QEMU?
> >
> >>
> >> Thanks for your comments again!
> >>
> >> >>
> >> >> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> >> >> ---
> >> >> arch/x86/pci/xen.c | 3 +--
> >> >> 1 file changed, 1 insertion(+), 2 deletions(-)
> >> >>
> >> >> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
> >> >> index 905956f..685e8f1 100644
> >> >> --- a/arch/x86/pci/xen.c
> >> >> +++ b/arch/x86/pci/xen.c
> >> >> @@ -231,8 +231,7 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> >> >> __read_msi_msg(msidesc, &msg);
> >> >> pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) |
> >> >> ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
> >> >> - if (msg.data != XEN_PIRQ_MSI_DATA ||
> >> >> - xen_irq_from_pirq(pirq) < 0) {
> >> >> + if (msg.data != XEN_PIRQ_MSI_DATA || pirq <= 0) {
> >> >> pirq = xen_allocate_pirq_msi(dev, msidesc);
> >> >> if (pirq < 0) {
> >> >> irq = -ENODEV;
> >> >> --
> >> >> 1.9.3
> >> >>
> >>
> >> --
> >> Vitaly
>
> --
> Vitaly
Konrad Rzeszutek Wilk <[email protected]> writes:
> On Wed, Jul 16, 2014 at 07:20:39PM +0200, Vitaly Kuznetsov wrote:
>> Konrad Rzeszutek Wilk <[email protected]> writes:
>>
>> > On Wed, Jul 16, 2014 at 11:01:55AM +0200, Vitaly Kuznetsov wrote:
>> >> Konrad Rzeszutek Wilk <[email protected]> writes:
>> >>
>> >> > On Tue, Jul 15, 2014 at 03:40:40PM +0200, Vitaly Kuznetsov wrote:
>> >> >> When kexec was peformed MSI IRQs for passthrough-ed devices were already
>> >> >> mapped and we see non-zero pirq extracted from MSI msg. xen_irq_from_pirq()
>> >> >> fails as we have no IRQ mapping information for that. Requesting for new
>> >> >> mapping with __write_msi_msg() does not result in MSI IRQ being remapped so
>> >> >> we don't recieve these IRQs.
>> >> >
>> >> > receive
>> >> >
>> >>
>> >> Thanks for your comments!
>> >
>> > Thank you for quick turnaround with the answers!
>> >>
>> >> > How come '__write_msi_msg' does not result in new MSI IRQs?
>> >> >
>> >>
>> >> Actually that was the hidden question in my RFC :-)
>> >>
>> >> Let me describe what I see. When normal boot is performed we have the
>> >> following in xen_hvm_setup_msi_irqs():
>> >>
>> >> __read_msi_msg()
>> >> pirq -> 0
>> >>
>> >> then we allocate new pirq with
>> >> pirq = xen_allocate_pirq_msi()
>> >> pirq -> 54
>> >>
>> >> and we have the following mapping:
>> >> xen: msi --> pirq=54 --> irq=72
>> >>
>> >> in 'xl debug-keys i':
>> >> (XEN) IRQ: 29 affinity:04 vec:b9 type=PCI-MSI status=00000030 in-flight=0 domain-list=7: 54(----),
>> >>
>> >> After kexec we see the following:
>> >> __read_msi_msg()
>> >> pirq -> 54
>> >>
>> >> but as xen_irq_from_pirq() fails we follow the same path allocating new pirq:
>> >> pirq = xen_allocate_pirq_msi()
>> >> pirq -> 55
>> >>
>> >> and we have the following mapping:
>> >> xen: msi --> pirq=55 --> irq=75
>> >>
>> >> However (afaict) mapping in xen wasn't updated:
>> >>
>> >> in 'xl debug-keys i':
>> >> (XEN) IRQ: 29 affinity:02 vec:b9 type=PCI-MSI status=00000030 in-flight=0 domain-list=7: 54(--M-),
>> >
>> > I am wondering if that is related to in QEMU traditional:
>> >
>> > qemu-xen-trad: free all the pirqs for msi/msix when driver unloads
>> >
>> > (which in the upstream QEMU is 1d4fd4f0e2fc5dcae0c60e00cc9af95f52988050)
>> >
>> > If you have that patch in, is the PIRQ value correctly updated?
>> >
>>
>> Thanks, that really works! I tested both kexec -e / kdump cases. I'm
>> wondering if we although need my commit to workaround non-fixed qemus?
>
> Without your patch on older QEMU's with PCI passthrough we won't get
> any more interrupts after we kexec in the guest right?
>
Correct.
> As in, this issue happens _only_ with PCI passthrough devices that use
> MSI or MSI-X?
I haven't tested MSI-X but in theory yes, only MSI and MSI-X
passthrough-ed devices are affected.
>
> Still need to get Stefano's view on this.
>
Sure, thanks!
>>
>> >>
>> >> > Is it fair to state that your code ends up reading the MSI IRQ (PIRQ)
>> >> > from the device and updating the internal PIRQ<->IRQ code to match
>> >> > with the reality?
>> >> >
>> >>
>> >> Yea, 'always trust the device'.
>> >>
>> >> >>
>> >> >> RFC: I wasn't able to understand why commit af42b8d1 which introduced
>> >> >> xen_irq_from_pirq() check in xen_hvm_setup_msi_irqs() is checking that instead
>> >> >> of checking pirq > 0 as if the mapping was already done (and we have pirq>0 here)
>> >> >> we don't need to request for a new pirq. We're loosing existing PIRQ and I'm also
>> >> >> not sure when __write_msi_msg() with new PIRQ will result in new mapping.
>> >> >
>> >> > We don't request a new pirq. We end up returning before we call xen_allocate_pirq_msi.
>> >> > At least that is how the commit you mentioned worked.
>> >> >
>> >>
>> >> I meant to say that in case we have pirq > 0 from __read_msi_msg() but
>> >> xen_irq_from_pirq(pirq) fails (kexec-only case?) we always do
>> >> xen_allocate_pirq_msi() which brings us new pirq.
>> >>
>> >> > In regards to why using 'xen_irq_from_pirq' instead of just checking the PIRQ - is
>> >> > that we might be called twice by a buggy driver. As such we want to check
>> >> > our PIRQ<->IRQ to figure this out.
>> >>
>> >> But if we're called twice we'll see the same pirq, right? Or there are
>> >
>> > Good point.
>> >> some cases when we see 'crap' instead of pirq here?
>> >
>> > For PCI passthrough devices they will be zero until they are enabled.
>> > But I am not sure about the emulated devices, such as e1000 or such, which
>> > would also go through this path (I think - do we have MSI devices that
>> > we emulate in QEMU?)
>>
>> AFAICT emulated e1000 doesn't use MSI (at least with qemu-tradidtional)
>> and with my patch series it works after kexec.
>>
>> >
>> >>
>> >> I think it would be nice to use the same pirq after kexec instead of
>> >> allocating a new one even in case we can make remapping work.
>> >
>> > I concur.
>> >
>> > Stefano, do you recall why you used xen_irq_from_pirq instead of just
>> > trusting the 'pirq' value? Was it to workaround broken QEMU?
>> >
>> >>
>> >> Thanks for your comments again!
>> >>
>> >> >>
>> >> >> Signed-off-by: Vitaly Kuznetsov <[email protected]>
>> >> >> ---
>> >> >> arch/x86/pci/xen.c | 3 +--
>> >> >> 1 file changed, 1 insertion(+), 2 deletions(-)
>> >> >>
>> >> >> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
>> >> >> index 905956f..685e8f1 100644
>> >> >> --- a/arch/x86/pci/xen.c
>> >> >> +++ b/arch/x86/pci/xen.c
>> >> >> @@ -231,8 +231,7 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
>> >> >> __read_msi_msg(msidesc, &msg);
>> >> >> pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) |
>> >> >> ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
>> >> >> - if (msg.data != XEN_PIRQ_MSI_DATA ||
>> >> >> - xen_irq_from_pirq(pirq) < 0) {
>> >> >> + if (msg.data != XEN_PIRQ_MSI_DATA || pirq <= 0) {
>> >> >> pirq = xen_allocate_pirq_msi(dev, msidesc);
>> >> >> if (pirq < 0) {
>> >> >> irq = -ENODEV;
>> >> >> --
>> >> >> 1.9.3
>> >> >>
>> >>
>> >> --
>> >> Vitaly
>>
>> --
>> Vitaly
--
Vitaly
Konrad Rzeszutek Wilk <[email protected]> writes:
> On Tue, Jul 15, 2014 at 05:43:17PM +0200, Vitaly Kuznetsov wrote:
>> Konrad Rzeszutek Wilk <[email protected]> writes:
>>
>> > On Tue, Jul 15, 2014 at 03:40:37PM +0200, Vitaly Kuznetsov wrote:
>> >> From: Olaf Hering <[email protected]>
>> >>
>> >> This is a respin of 00e37bdb0113a98408de42db85be002f21dbffd3
>> >> ("xen PVonHVM: move shared_info to MMIO before kexec").
>> >>
>> >> Currently kexec in a PVonHVM guest fails with a triple fault because the
>> >> new kernel overwrites the shared info page. The exact failure depends on
>> >> the size of the kernel image. This patch moves the pfn from RAM into an
>> >> E820 reserved memory area.
>> >>
>> >> The pfn containing the shared_info is located somewhere in RAM. This will
>> >> cause trouble if the current kernel is doing a kexec boot into a new
>> >> kernel. The new kernel (and its startup code) can not know where the pfn
>> >> is, so it can not reserve the page. The hypervisor will continue to update
>> >> the pfn, and as a result memory corruption occours in the new kernel.
>> >>
>> >> The toolstack marks the memory area FC000000-FFFFFFFF as reserved in the
>> >> E820 map. Within that range newer toolstacks (4.3+) will keep 1MB
>> >> starting from FE700000 as reserved for guest use. Older Xen4 toolstacks
>> >> will usually not allocate areas up to FE700000, so FE700000 is expected
>> >> to work also with older toolstacks.
>> >>
>> >> In Xen3 there is no reserved area at a fixed location. If the guest is
>> >> started on such old hosts the shared_info page will be placed in RAM. As
>> >> a result kexec can not be used.
>> >
>> > So this looks right, the one thing that we really need to check
>> > is e9daff24a266307943457086533041bd971d0ef9
>> >
>> > This reverts commit 9d02b43dee0d7fb18dfb13a00915550b1a3daa9f.
>> >
>> > We are doing this b/c on 32-bit PVonHVM with older hypervisors
>> > (Xen 4.1) it ends up bothing up the start_info. This is bad b/c
>> > we use it for the time keeping, and the timekeeping code loops
>> > forever - as the version field never changes. Olaf says to
>> > revert it, so lets do that.
>> >
>> > Could you kindly test that the migration on 32-bit PVHVM guests
>> > on older hypervisors works?
>> >
>>
>> Sure, will do! Was there anything special about the setup or any 32-bit
>> pvhvm guest migration (on 64-bit hypervisor I suppose) would fail? I can
>> try checking both current and old versions to make sure the issue was
>> acutually fixed.
>
> Nothing fancy (well, it was SMP, so 4 CPUs). I did the 'save'/'restore' and the
> guest would not restore properly.
>
The symptoms you saw were: after the resume guest appears to be frozen,
all vcpus except for the first one spin at 100%? I was able to reproduce
that on old patch version and everything works fine with your fix
(calling xen_hvm_set_shared_info() in addition to
xen_hvm_connect_shared_info() on resume). We're probably safe to apply
it now, thanks!
However I'd like to suggest we remove '__init' from
xen_hvm_set_shared_info() as now we call it on resume.
> Thank you!
>>
>> >>
>> >> Signed-off-by: Olaf Hering <[email protected]>
>> >> Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
>> >> (cherry picked from commit 9d02b43dee0d7fb18dfb13a00915550b1a3daa9f)
>> >>
>> >> [On resume we need to reset the xen_vcpu_info, which the original
>> >> patch did not do]
>> >>
>> >> Signed-off-by: Vitaly Kuznetsov <[email protected]>
>> >> ---
>> >> arch/x86/xen/enlighten.c | 74 ++++++++++++++++++++++++++++++++++++------------
>> >> arch/x86/xen/suspend.c | 2 +-
>> >> arch/x86/xen/xen-ops.h | 2 +-
>> >> 3 files changed, 58 insertions(+), 20 deletions(-)
>> >>
>> >> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>> >> index ffb101e..a11af62 100644
>> >> --- a/arch/x86/xen/enlighten.c
>> >> +++ b/arch/x86/xen/enlighten.c
>> >> @@ -1726,23 +1726,29 @@ asmlinkage __visible void __init xen_start_kernel(void)
>> >> #endif
>> >> }
>> >>
>> >> -void __ref xen_hvm_init_shared_info(void)
>> >> +#ifdef CONFIG_XEN_PVHVM
>> >> +#define HVM_SHARED_INFO_ADDR 0xFE700000UL
>> >> +static struct shared_info *xen_hvm_shared_info;
>> >> +static unsigned long xen_hvm_sip_phys;
>> >> +static int xen_major, xen_minor;
>> >> +
>> >> +static void xen_hvm_connect_shared_info(unsigned long pfn)
>> >> {
>> >> - int cpu;
>> >> struct xen_add_to_physmap xatp;
>> >> - static struct shared_info *shared_info_page = 0;
>> >>
>> >> - if (!shared_info_page)
>> >> - shared_info_page = (struct shared_info *)
>> >> - extend_brk(PAGE_SIZE, PAGE_SIZE);
>> >> xatp.domid = DOMID_SELF;
>> >> xatp.idx = 0;
>> >> xatp.space = XENMAPSPACE_shared_info;
>> >> - xatp.gpfn = __pa(shared_info_page) >> PAGE_SHIFT;
>> >> + xatp.gpfn = pfn;
>> >> if (HYPERVISOR_memory_op(XENMEM_add_to_physmap, &xatp))
>> >> BUG();
>> >>
>> >> - HYPERVISOR_shared_info = (struct shared_info *)shared_info_page;
>> >> +}
>> >> +static void __init xen_hvm_set_shared_info(struct shared_info *sip)
>> >> +{
>> >> + int cpu;
>> >> +
>> >> + HYPERVISOR_shared_info = sip;
>> >>
>> >> /* xen_vcpu is a pointer to the vcpu_info struct in the shared_info
>> >> * page, we use it in the event channel upcall and in some pvclock
>> >> @@ -1760,20 +1766,39 @@ void __ref xen_hvm_init_shared_info(void)
>> >> }
>> >> }
>> >>
>> >> -#ifdef CONFIG_XEN_PVHVM
>> >> +/* Reconnect the shared_info pfn to a (new) mfn */
>> >> +void xen_hvm_resume_shared_info(void)
>> >> +{
>> >> + xen_hvm_connect_shared_info(xen_hvm_sip_phys >> PAGE_SHIFT);
>> >> + xen_hvm_set_shared_info(xen_hvm_shared_info);
>> >> +}
>> >> +
>> >> +/* Xen tools prior to Xen 4 do not provide a E820_Reserved area for guest usage.
>> >> + * On these old tools the shared info page will be placed in E820_Ram.
>> >> + * Xen 4 provides a E820_Reserved area at 0xFC000000, and this code expects
>> >> + * that nothing is mapped up to HVM_SHARED_INFO_ADDR.
>> >> + * Xen 4.3+ provides an explicit 1MB area at HVM_SHARED_INFO_ADDR which is used
>> >> + * here for the shared info page. */
>> >> +static void __init xen_hvm_init_shared_info(void)
>> >> +{
>> >> + if (xen_major < 4) {
>> >> + xen_hvm_shared_info = extend_brk(PAGE_SIZE, PAGE_SIZE);
>> >> + xen_hvm_sip_phys = __pa(xen_hvm_shared_info);
>> >> + } else {
>> >> + xen_hvm_sip_phys = HVM_SHARED_INFO_ADDR;
>> >> + set_fixmap(FIX_PARAVIRT_BOOTMAP, xen_hvm_sip_phys);
>> >> + xen_hvm_shared_info =
>> >> + (struct shared_info *)fix_to_virt(FIX_PARAVIRT_BOOTMAP);
>> >> + }
>> >> + xen_hvm_resume_shared_info();
>> >> +}
>> >> +
>> >> static void __init init_hvm_pv_info(void)
>> >> {
>> >> - int major, minor;
>> >> - uint32_t eax, ebx, ecx, edx, pages, msr, base;
>> >> + uint32_t ecx, edx, pages, msr, base;
>> >> u64 pfn;
>> >>
>> >> base = xen_cpuid_base();
>> >> - cpuid(base + 1, &eax, &ebx, &ecx, &edx);
>> >> -
>> >> - major = eax >> 16;
>> >> - minor = eax & 0xffff;
>> >> - printk(KERN_INFO "Xen version %d.%d.\n", major, minor);
>> >> -
>> >> cpuid(base + 2, &pages, &msr, &ecx, &edx);
>> >>
>> >> pfn = __pa(hypercall_page);
>> >> @@ -1828,10 +1853,23 @@ static void __init xen_hvm_guest_init(void)
>> >>
>> >> static uint32_t __init xen_hvm_platform(void)
>> >> {
>> >> + uint32_t eax, ebx, ecx, edx, base;
>> >> +
>> >> if (xen_pv_domain())
>> >> return 0;
>> >>
>> >> - return xen_cpuid_base();
>> >> + base = xen_cpuid_base();
>> >> + if (!base)
>> >> + return 0;
>> >> +
>> >> + cpuid(base + 1, &eax, &ebx, &ecx, &edx);
>> >> +
>> >> + xen_major = eax >> 16;
>> >> + xen_minor = eax & 0xffff;
>> >> +
>> >> + printk(KERN_INFO "Xen version %d.%d.\n", xen_major, xen_minor);
>> >> +
>> >> + return 1;
>> >> }
>> >>
>> >> bool xen_hvm_need_lapic(void)
>> >> diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
>> >> index c4df9db..8d6793e 100644
>> >> --- a/arch/x86/xen/suspend.c
>> >> +++ b/arch/x86/xen/suspend.c
>> >> @@ -32,7 +32,7 @@ static void xen_hvm_post_suspend(int suspend_cancelled)
>> >> {
>> >> #ifdef CONFIG_XEN_PVHVM
>> >> int cpu;
>> >> - xen_hvm_init_shared_info();
>> >> + xen_hvm_resume_shared_info();
>> >> xen_callback_vector();
>> >> xen_unplug_emulated_devices();
>> >> if (xen_feature(XENFEAT_hvm_safe_pvclock)) {
>> >> diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
>> >> index 97d8765..d083e82 100644
>> >> --- a/arch/x86/xen/xen-ops.h
>> >> +++ b/arch/x86/xen/xen-ops.h
>> >> @@ -43,7 +43,7 @@ void xen_enable_syscall(void);
>> >> void xen_vcpu_restore(void);
>> >>
>> >> void xen_callback_vector(void);
>> >> -void xen_hvm_init_shared_info(void);
>> >> +void xen_hvm_resume_shared_info(void);
>> >> void xen_unplug_emulated_devices(void);
>> >>
>> >> void __init xen_build_dynamic_phys_to_machine(void);
>> >> --
>> >> 1.9.3
>> >>
>>
>> --
>> Vitaly
--
Vitaly
On Fri, Jul 18, 2014 at 01:05:46PM +0200, Vitaly Kuznetsov wrote:
> Konrad Rzeszutek Wilk <[email protected]> writes:
>
> > On Tue, Jul 15, 2014 at 05:43:17PM +0200, Vitaly Kuznetsov wrote:
> >> Konrad Rzeszutek Wilk <[email protected]> writes:
> >>
> >> > On Tue, Jul 15, 2014 at 03:40:37PM +0200, Vitaly Kuznetsov wrote:
> >> >> From: Olaf Hering <[email protected]>
> >> >>
> >> >> This is a respin of 00e37bdb0113a98408de42db85be002f21dbffd3
> >> >> ("xen PVonHVM: move shared_info to MMIO before kexec").
> >> >>
> >> >> Currently kexec in a PVonHVM guest fails with a triple fault because the
> >> >> new kernel overwrites the shared info page. The exact failure depends on
> >> >> the size of the kernel image. This patch moves the pfn from RAM into an
> >> >> E820 reserved memory area.
> >> >>
> >> >> The pfn containing the shared_info is located somewhere in RAM. This will
> >> >> cause trouble if the current kernel is doing a kexec boot into a new
> >> >> kernel. The new kernel (and its startup code) can not know where the pfn
> >> >> is, so it can not reserve the page. The hypervisor will continue to update
> >> >> the pfn, and as a result memory corruption occours in the new kernel.
> >> >>
> >> >> The toolstack marks the memory area FC000000-FFFFFFFF as reserved in the
> >> >> E820 map. Within that range newer toolstacks (4.3+) will keep 1MB
> >> >> starting from FE700000 as reserved for guest use. Older Xen4 toolstacks
> >> >> will usually not allocate areas up to FE700000, so FE700000 is expected
> >> >> to work also with older toolstacks.
> >> >>
> >> >> In Xen3 there is no reserved area at a fixed location. If the guest is
> >> >> started on such old hosts the shared_info page will be placed in RAM. As
> >> >> a result kexec can not be used.
> >> >
> >> > So this looks right, the one thing that we really need to check
> >> > is e9daff24a266307943457086533041bd971d0ef9
> >> >
> >> > This reverts commit 9d02b43dee0d7fb18dfb13a00915550b1a3daa9f.
> >> >
> >> > We are doing this b/c on 32-bit PVonHVM with older hypervisors
> >> > (Xen 4.1) it ends up bothing up the start_info. This is bad b/c
> >> > we use it for the time keeping, and the timekeeping code loops
> >> > forever - as the version field never changes. Olaf says to
> >> > revert it, so lets do that.
> >> >
> >> > Could you kindly test that the migration on 32-bit PVHVM guests
> >> > on older hypervisors works?
> >> >
> >>
> >> Sure, will do! Was there anything special about the setup or any 32-bit
> >> pvhvm guest migration (on 64-bit hypervisor I suppose) would fail? I can
> >> try checking both current and old versions to make sure the issue was
> >> acutually fixed.
> >
> > Nothing fancy (well, it was SMP, so 4 CPUs). I did the 'save'/'restore' and the
> > guest would not restore properly.
> >
>
> The symptoms you saw were: after the resume guest appears to be frozen,
> all vcpus except for the first one spin at 100%? I was able to reproduce
Yes, that is it.
> that on old patch version and everything works fine with your fix
> (calling xen_hvm_set_shared_info() in addition to
> xen_hvm_connect_shared_info() on resume). We're probably safe to apply
> it now, thanks!
Woot! Could you include that tidbit of information in
the commit please?
>
> However I'd like to suggest we remove '__init' from
> xen_hvm_set_shared_info() as now we call it on resume.
Good idea.
Lets wait until Stefano responds (for the MSI PIRQ one), and
if he does not have anything special to say, then repost the
whole patchset including this tiny __init fix and the updated
comment?
Konrad Rzeszutek Wilk <[email protected]> writes:
> On Fri, Jul 18, 2014 at 01:05:46PM +0200, Vitaly Kuznetsov wrote:
>> Konrad Rzeszutek Wilk <[email protected]> writes:
>>
>> > On Tue, Jul 15, 2014 at 05:43:17PM +0200, Vitaly Kuznetsov wrote:
>> >> Konrad Rzeszutek Wilk <[email protected]> writes:
>> >>
>> >> > On Tue, Jul 15, 2014 at 03:40:37PM +0200, Vitaly Kuznetsov wrote:
>> >> >> From: Olaf Hering <[email protected]>
>> >> >>
>> >> >> This is a respin of 00e37bdb0113a98408de42db85be002f21dbffd3
>> >> >> ("xen PVonHVM: move shared_info to MMIO before kexec").
>> >> >>
>> >> >> Currently kexec in a PVonHVM guest fails with a triple fault because the
>> >> >> new kernel overwrites the shared info page. The exact failure depends on
>> >> >> the size of the kernel image. This patch moves the pfn from RAM into an
>> >> >> E820 reserved memory area.
>> >> >>
>> >> >> The pfn containing the shared_info is located somewhere in RAM. This will
>> >> >> cause trouble if the current kernel is doing a kexec boot into a new
>> >> >> kernel. The new kernel (and its startup code) can not know where the pfn
>> >> >> is, so it can not reserve the page. The hypervisor will continue to update
>> >> >> the pfn, and as a result memory corruption occours in the new kernel.
>> >> >>
>> >> >> The toolstack marks the memory area FC000000-FFFFFFFF as reserved in the
>> >> >> E820 map. Within that range newer toolstacks (4.3+) will keep 1MB
>> >> >> starting from FE700000 as reserved for guest use. Older Xen4 toolstacks
>> >> >> will usually not allocate areas up to FE700000, so FE700000 is expected
>> >> >> to work also with older toolstacks.
>> >> >>
>> >> >> In Xen3 there is no reserved area at a fixed location. If the guest is
>> >> >> started on such old hosts the shared_info page will be placed in RAM. As
>> >> >> a result kexec can not be used.
>> >> >
>> >> > So this looks right, the one thing that we really need to check
>> >> > is e9daff24a266307943457086533041bd971d0ef9
>> >> >
>> >> > This reverts commit 9d02b43dee0d7fb18dfb13a00915550b1a3daa9f.
>> >> >
>> >> > We are doing this b/c on 32-bit PVonHVM with older hypervisors
>> >> > (Xen 4.1) it ends up bothing up the start_info. This is bad b/c
>> >> > we use it for the time keeping, and the timekeeping code loops
>> >> > forever - as the version field never changes. Olaf says to
>> >> > revert it, so lets do that.
>> >> >
>> >> > Could you kindly test that the migration on 32-bit PVHVM guests
>> >> > on older hypervisors works?
>> >> >
>> >>
>> >> Sure, will do! Was there anything special about the setup or any 32-bit
>> >> pvhvm guest migration (on 64-bit hypervisor I suppose) would fail? I can
>> >> try checking both current and old versions to make sure the issue was
>> >> acutually fixed.
>> >
>> > Nothing fancy (well, it was SMP, so 4 CPUs). I did the 'save'/'restore' and the
>> > guest would not restore properly.
>> >
>>
>> The symptoms you saw were: after the resume guest appears to be frozen,
>> all vcpus except for the first one spin at 100%? I was able to reproduce
>
> Yes, that is it.
>> that on old patch version and everything works fine with your fix
>> (calling xen_hvm_set_shared_info() in addition to
>> xen_hvm_connect_shared_info() on resume). We're probably safe to apply
>> it now, thanks!
>
> Woot! Could you include that tidbit of information in
> the commit please?
>
Sure,
>>
>> However I'd like to suggest we remove '__init' from
>> xen_hvm_set_shared_info() as now we call it on resume.
>
> Good idea.
>
> Lets wait until Stefano responds (for the MSI PIRQ one), and
> if he does not have anything special to say, then repost the
> whole patchset including this tiny __init fix and the updated
> comment?
Deal :-) Please take a look at my '[PATCH RFC] evtchn: introduce
EVTCHNOP_fifo_destroy hypercall'. In case that works we can fix FIFO
case at the same time, no TODO required.
I'll be able to return to this work at the end of next week.
--
Vitaly
On Wed, 16 Jul 2014, Konrad Rzeszutek Wilk wrote:
> On Wed, Jul 16, 2014 at 11:01:55AM +0200, Vitaly Kuznetsov wrote:
> > Konrad Rzeszutek Wilk <[email protected]> writes:
> >
> > > On Tue, Jul 15, 2014 at 03:40:40PM +0200, Vitaly Kuznetsov wrote:
> > >> When kexec was peformed MSI IRQs for passthrough-ed devices were already
> > >> mapped and we see non-zero pirq extracted from MSI msg. xen_irq_from_pirq()
> > >> fails as we have no IRQ mapping information for that. Requesting for new
> > >> mapping with __write_msi_msg() does not result in MSI IRQ being remapped so
> > >> we don't recieve these IRQs.
> > >
> > > receive
> > >
> >
> > Thanks for your comments!
>
> Thank you for quick turnaround with the answers!
> >
> > > How come '__write_msi_msg' does not result in new MSI IRQs?
> > >
> >
> > Actually that was the hidden question in my RFC :-)
> >
> > Let me describe what I see. When normal boot is performed we have the
> > following in xen_hvm_setup_msi_irqs():
> >
> > __read_msi_msg()
> > pirq -> 0
> >
> > then we allocate new pirq with
> > pirq = xen_allocate_pirq_msi()
> > pirq -> 54
> >
> > and we have the following mapping:
> > xen: msi --> pirq=54 --> irq=72
> >
> > in 'xl debug-keys i':
> > (XEN) IRQ: 29 affinity:04 vec:b9 type=PCI-MSI status=00000030 in-flight=0 domain-list=7: 54(----),
> >
> > After kexec we see the following:
> > __read_msi_msg()
> > pirq -> 54
> >
> > but as xen_irq_from_pirq() fails we follow the same path allocating new pirq:
> > pirq = xen_allocate_pirq_msi()
> > pirq -> 55
> >
> > and we have the following mapping:
> > xen: msi --> pirq=55 --> irq=75
> >
> > However (afaict) mapping in xen wasn't updated:
> >
> > in 'xl debug-keys i':
> > (XEN) IRQ: 29 affinity:02 vec:b9 type=PCI-MSI status=00000030 in-flight=0 domain-list=7: 54(--M-),
>
> I am wondering if that is related to in QEMU traditional:
>
> qemu-xen-trad: free all the pirqs for msi/msix when driver unloads
>
> (which in the upstream QEMU is 1d4fd4f0e2fc5dcae0c60e00cc9af95f52988050)
>
> If you have that patch in, is the PIRQ value correctly updated?
>
> >
> > > Is it fair to state that your code ends up reading the MSI IRQ (PIRQ)
> > > from the device and updating the internal PIRQ<->IRQ code to match
> > > with the reality?
> > >
> >
> > Yea, 'always trust the device'.
> >
> > >>
> > >> RFC: I wasn't able to understand why commit af42b8d1 which introduced
> > >> xen_irq_from_pirq() check in xen_hvm_setup_msi_irqs() is checking that instead
> > >> of checking pirq > 0 as if the mapping was already done (and we have pirq>0 here)
> > >> we don't need to request for a new pirq. We're loosing existing PIRQ and I'm also
> > >> not sure when __write_msi_msg() with new PIRQ will result in new mapping.
> > >
> > > We don't request a new pirq. We end up returning before we call xen_allocate_pirq_msi.
> > > At least that is how the commit you mentioned worked.
> > >
> >
> > I meant to say that in case we have pirq > 0 from __read_msi_msg() but
> > xen_irq_from_pirq(pirq) fails (kexec-only case?) we always do
> > xen_allocate_pirq_msi() which brings us new pirq.
> >
> > > In regards to why using 'xen_irq_from_pirq' instead of just checking the PIRQ - is
> > > that we might be called twice by a buggy driver. As such we want to check
> > > our PIRQ<->IRQ to figure this out.
> >
> > But if we're called twice we'll see the same pirq, right? Or there are
>
> Good point.
> > some cases when we see 'crap' instead of pirq here?
>
> For PCI passthrough devices they will be zero until they are enabled.
> But I am not sure about the emulated devices, such as e1000 or such, which
> would also go through this path (I think - do we have MSI devices that
> we emulate in QEMU?)
>
> >
> > I think it would be nice to use the same pirq after kexec instead of
> > allocating a new one even in case we can make remapping work.
>
> I concur.
>
> Stefano, do you recall why you used xen_irq_from_pirq instead of just
> trusting the 'pirq' value? Was it to workaround broken QEMU?
If I recall correctly the problem is that pirq == 0 is a valid pirq
number. So the check pirq <= 0 is wrong.
Can we rely on the fact that msg.data is always 0 on first read? If so,
then we could simply:
diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
index 905956f..d824743 100644
--- a/arch/x86/pci/xen.c
+++ b/arch/x86/pci/xen.c
@@ -231,8 +231,7 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
__read_msi_msg(msidesc, &msg);
pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) |
((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
- if (msg.data != XEN_PIRQ_MSI_DATA ||
- xen_irq_from_pirq(pirq) < 0) {
+ if (msg.data != XEN_PIRQ_MSI_DATA) {
pirq = xen_allocate_pirq_msi(dev, msidesc);
if (pirq < 0) {
irq = -ENODEV;
On 15/07/14 14:40, Vitaly Kuznetsov wrote:
> With this patch series I'm trying to address several issues with kexec on pvhvm:
> - shared_info issue (1st patch, just sending Olaf's work with Konrad's fix)
> - create specific pvhvm shutdown handler for kexec (2nd patch)
> - GSI PIRQ issue (3rd patch, I'm pretty confident that it does the right thing)
> - MSI PIRQ issue (4th patch, and I'm not sure it doesn't break anything -> RFC)
>
> This patch series can be tested on single vCPU guest. We still have SMP issues with
> pvhvm guests and kexec which require additional fixes.
In addition to the fixes for multi-VCPU guests, what else remains?
What's the plan for handling granted pages?
I don't think we want to accept a partial solution unless the known
non-working configurations fail fast on kexec load.
David
On 15/07/14 14:40, Vitaly Kuznetsov wrote:
> From: Olaf Hering <[email protected]>
>
> This is a respin of 00e37bdb0113a98408de42db85be002f21dbffd3
> ("xen PVonHVM: move shared_info to MMIO before kexec").
>
> Currently kexec in a PVonHVM guest fails with a triple fault because the
> new kernel overwrites the shared info page. The exact failure depends on
> the size of the kernel image. This patch moves the pfn from RAM into an
> E820 reserved memory area.
>
> The pfn containing the shared_info is located somewhere in RAM. This will
> cause trouble if the current kernel is doing a kexec boot into a new
> kernel. The new kernel (and its startup code) can not know where the pfn
> is, so it can not reserve the page. The hypervisor will continue to update
> the pfn, and as a result memory corruption occours in the new kernel.
>
> The toolstack marks the memory area FC000000-FFFFFFFF as reserved in the
> E820 map. Within that range newer toolstacks (4.3+) will keep 1MB
> starting from FE700000 as reserved for guest use. Older Xen4 toolstacks
> will usually not allocate areas up to FE700000, so FE700000 is expected
> to work also with older toolstacks.
>
> In Xen3 there is no reserved area at a fixed location. If the guest is
> started on such old hosts the shared_info page will be placed in RAM. As
> a result kexec can not be used.
>
> Signed-off-by: Olaf Hering <[email protected]>
> Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
> (cherry picked from commit 9d02b43dee0d7fb18dfb13a00915550b1a3daa9f)
Is this a useful commit to give? Isn't it one of Konrad's random trees?
> --- a/arch/x86/xen/enlighten.c
> +++ b/arch/x86/xen/enlighten.c
> @@ -1726,23 +1726,29 @@ asmlinkage __visible void __init xen_start_kernel(void)
> #endif
> }
>
> -void __ref xen_hvm_init_shared_info(void)
> +#ifdef CONFIG_XEN_PVHVM
> +#define HVM_SHARED_INFO_ADDR 0xFE700000UL
I would like clearer documentation for this magic address. Perhaps
referring to some toolstack documentation?
> @@ -1828,10 +1853,23 @@ static void __init xen_hvm_guest_init(void)
>
> static uint32_t __init xen_hvm_platform(void)
> {
> + uint32_t eax, ebx, ecx, edx, base;
> +
> if (xen_pv_domain())
> return 0;
>
> - return xen_cpuid_base();
> + base = xen_cpuid_base();
> + if (!base)
> + return 0;
> +
> + cpuid(base + 1, &eax, &ebx, &ecx, &edx);
> +
> + xen_major = eax >> 16;
> + xen_minor = eax & 0xffff;
> +
> + printk(KERN_INFO "Xen version %d.%d.\n", xen_major, xen_minor);
> +
> + return 1;
You appear to have changed to return value here? Did you mean to return
base here?
David
On 15/07/14 14:40, Vitaly Kuznetsov wrote:
> PVHVM guest requires special actions before kexec. Register specific
> xen_pvhvm_kexec_shutdown() handler for machine_ops.shutdown().
[...]
> --- a/arch/x86/xen/enlighten.c
> +++ b/arch/x86/xen/enlighten.c
> @@ -1833,6 +1833,12 @@ static struct notifier_block xen_hvm_cpu_notifier = {
> .notifier_call = xen_hvm_cpu_notify,
> };
>
> +static void xen_pvhvm_kexec_shutdown(void)
> +{
> + xen_kexec_shutdown();
I think it would be preferrable to have
if (kexec_in_progress)
xen_kexec_shutdown();
David
On 15/07/14 14:40, Vitaly Kuznetsov wrote:
> When kexec is being run PIRQs from Qemu-emulated devices are still
> mapped to old event channels and new kernel has no information about
> that. Trying to map them twice results in the following in Xen's dmesg:
>
> (XEN) irq.c:2278: dom7: pirq 24 or emuirq 8 already mapped
> (XEN) irq.c:2278: dom7: pirq 24 or emuirq 12 already mapped
> (XEN) irq.c:2278: dom7: pirq 24 or emuirq 1 already mapped
> ...
>
> and the following in new kernel's dmesg:
>
> [ 92.286796] xen:events: Failed to obtain physical IRQ 4
>
> The result is that the new kernel doesn't recieve IRQs for Qemu-emulated
> devices. Address the issue by unmapping all mapped PIRQs on kernel shutdown
> when kexec was requested and on every kernel startup. We need to do this
> twice to deal with the following issues:
> - startup-time unmapping is required to make kdump work;
> - shutdown-time unmapping is required to support kexec-ing non-fixed kernels;
> - shutdown-time unmapping is required to make Qemu-emulated NICs work after
> kexec (event channel is being closed on shutdown but no PHYSDEVOP_unmap_pirq
> is being performed).
I think this should be done only in one place -- just prior to exec'ing
the new kernel (including kdump kernels).
> --- a/arch/x86/xen/smp.c
> +++ b/arch/x86/xen/smp.c
> @@ -768,6 +768,7 @@ void xen_kexec_shutdown(void)
> #ifdef CONFIG_KEXEC
> if (!kexec_in_progress)
> return;
> + xen_unmap_all_pirqs();
> #endif
> }
>
> diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
> index c919d3d..7701c7f 100644
> --- a/drivers/xen/events/events_base.c
> +++ b/drivers/xen/events/events_base.c
> @@ -1643,6 +1643,80 @@ void xen_callback_vector(void) {}
> static bool fifo_events = true;
> module_param(fifo_events, bool, 0);
>
> +void xen_unmap_all_pirqs(void)
> +{
> + int pirq, rc, gsi, irq, evtchn;
> + struct physdev_unmap_pirq unmap_irq;
> + struct irq_info *info;
> + struct evtchn_close close;
> +
> + mutex_lock(&irq_mapping_update_lock);
> +
> + list_for_each_entry(info, &xen_irq_list_head, list) {
> + if (info->type != IRQT_PIRQ)
> + continue;
I think you need to do this by querying Xen state rather than relying on
potentially bad guest state. Particularly since you may crash while
holding irq_mapping_update_lock.
EVTCHNOP_status gets you the info you need I think.
David
On 16/07/14 18:20, Vitaly Kuznetsov wrote:
>
> Thanks, that really works! I tested both kexec -e / kdump cases. I'm
> wondering if we although need my commit to workaround non-fixed qemus?
I don't think so.
David
David Vrabel <[email protected]> writes:
> On 15/07/14 14:40, Vitaly Kuznetsov wrote:
>> When kexec is being run PIRQs from Qemu-emulated devices are still
>> mapped to old event channels and new kernel has no information about
>> that. Trying to map them twice results in the following in Xen's dmesg:
>>
>> (XEN) irq.c:2278: dom7: pirq 24 or emuirq 8 already mapped
>> (XEN) irq.c:2278: dom7: pirq 24 or emuirq 12 already mapped
>> (XEN) irq.c:2278: dom7: pirq 24 or emuirq 1 already mapped
>> ...
>>
>> and the following in new kernel's dmesg:
>>
>> [ 92.286796] xen:events: Failed to obtain physical IRQ 4
>>
>> The result is that the new kernel doesn't recieve IRQs for Qemu-emulated
>> devices. Address the issue by unmapping all mapped PIRQs on kernel shutdown
>> when kexec was requested and on every kernel startup. We need to do this
>> twice to deal with the following issues:
>> - startup-time unmapping is required to make kdump work;
>> - shutdown-time unmapping is required to support kexec-ing non-fixed kernels;
>> - shutdown-time unmapping is required to make Qemu-emulated NICs work after
>> kexec (event channel is being closed on shutdown but no PHYSDEVOP_unmap_pirq
>> is being performed).
>
> I think this should be done only in one place -- just prior to exec'ing
> the new kernel (including kdump kernels).
>
Thank you for your comments!
The problem I'm fighting wiht atm is: with FIFO-based event channels we
need to call evtchn_fifo_destroy() so next EVTCHNOP_init_control won't
fail. I was intended to put evtchn_fifo_destroy() in
EVTCHNOP_reset. That introduces a problem: we need to deal with
store/console channels. It is possible to remap those from guest with
EVTCHNOP_bind_interdomain (if we remember where they were mapped before)
but we can't do it after we did evtchn_fifo_destroy() and we can't
rebind them after kexec and performing EVTCHNOP_init_control as
we can't remember where these channels were mapped to after kexec/kdump.
I see the following possible solutions:
1) We put evtchn_fifo_destroy() in EVTCHNOP_init_control so
EVTCHNOP_init_control can be called twice. No EVTCHNOP_reset is required
in that case.
2) Introduce special (e.g. 'EVTCHNOP_fifo_destroy') hypercall to do
evtchn_fifo_destroy() without closing all channels. Alternatively we can
avoid closing all channels in EVTCHNOP_reset when called with DOMID_SELF
(as this mode is not being used atm) -- but that would look unobvious.
3) Keep evtchn_fifo_destroy() in EVTCHNOP_reset but keep console/store
channels -- I saw your concerns it is not safe, some sort of additional
blocking will be required.
4) Do the remapping boot time (query for store/console channels ->
perform EVTCHNOP_reset -> rebind with EVTCHNOP_bind_interdomain).
There is an additional problem: EVTCHNOP_bind_interdomain operation has
local port as OUT parameter so we can't guarantee that remapping
store/console channels will remap them to the same local channel they
were mapped before EVTCHNOP_reset (and we have this information in hvm
info: HVM_PARAM_CONSOLE_EVTCHN/HVM_PARAM_STORE_EVTCHN, ...). Not sure
how to deal with that in case we go with remapping.
Your thoughts would be very appreciated. Thank you again,
>> --- a/arch/x86/xen/smp.c
>> +++ b/arch/x86/xen/smp.c
>> @@ -768,6 +768,7 @@ void xen_kexec_shutdown(void)
>> #ifdef CONFIG_KEXEC
>> if (!kexec_in_progress)
>> return;
>> + xen_unmap_all_pirqs();
>> #endif
>> }
>>
>> diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
>> index c919d3d..7701c7f 100644
>> --- a/drivers/xen/events/events_base.c
>> +++ b/drivers/xen/events/events_base.c
>> @@ -1643,6 +1643,80 @@ void xen_callback_vector(void) {}
>> static bool fifo_events = true;
>> module_param(fifo_events, bool, 0);
>>
>> +void xen_unmap_all_pirqs(void)
>> +{
>> + int pirq, rc, gsi, irq, evtchn;
>> + struct physdev_unmap_pirq unmap_irq;
>> + struct irq_info *info;
>> + struct evtchn_close close;
>> +
>> + mutex_lock(&irq_mapping_update_lock);
>> +
>> + list_for_each_entry(info, &xen_irq_list_head, list) {
>> + if (info->type != IRQT_PIRQ)
>> + continue;
>
> I think you need to do this by querying Xen state rather than relying on
> potentially bad guest state. Particularly since you may crash while
> holding irq_mapping_update_lock.
>
> EVTCHNOP_status gets you the info you need I think.
>
> David
--
Vitaly
On 29/07/14 14:50, Vitaly Kuznetsov wrote:
> David Vrabel <[email protected]> writes:
>
>> On 15/07/14 14:40, Vitaly Kuznetsov wrote:
>>> When kexec is being run PIRQs from Qemu-emulated devices are still
>>> mapped to old event channels and new kernel has no information about
>>> that. Trying to map them twice results in the following in Xen's dmesg:
>>>
>>> (XEN) irq.c:2278: dom7: pirq 24 or emuirq 8 already mapped
>>> (XEN) irq.c:2278: dom7: pirq 24 or emuirq 12 already mapped
>>> (XEN) irq.c:2278: dom7: pirq 24 or emuirq 1 already mapped
>>> ...
>>>
>>> and the following in new kernel's dmesg:
>>>
>>> [ 92.286796] xen:events: Failed to obtain physical IRQ 4
>>>
>>> The result is that the new kernel doesn't recieve IRQs for Qemu-emulated
>>> devices. Address the issue by unmapping all mapped PIRQs on kernel shutdown
>>> when kexec was requested and on every kernel startup. We need to do this
>>> twice to deal with the following issues:
>>> - startup-time unmapping is required to make kdump work;
>>> - shutdown-time unmapping is required to support kexec-ing non-fixed kernels;
>>> - shutdown-time unmapping is required to make Qemu-emulated NICs work after
>>> kexec (event channel is being closed on shutdown but no PHYSDEVOP_unmap_pirq
>>> is being performed).
>>
>> I think this should be done only in one place -- just prior to exec'ing
>> the new kernel (including kdump kernels).
>>
>
> Thank you for your comments!
>
> The problem I'm fighting wiht atm is: with FIFO-based event channels we
> need to call evtchn_fifo_destroy() so next EVTCHNOP_init_control won't
> fail. I was intended to put evtchn_fifo_destroy() in
> EVTCHNOP_reset. That introduces a problem: we need to deal with
> store/console channels. It is possible to remap those from guest with
> EVTCHNOP_bind_interdomain (if we remember where they were mapped before)
> but we can't do it after we did evtchn_fifo_destroy() and we can't
> rebind them after kexec and performing EVTCHNOP_init_control as
> we can't remember where these channels were mapped to after kexec/kdump.
>
> I see the following possible solutions:
> 1) We put evtchn_fifo_destroy() in EVTCHNOP_init_control so
> EVTCHNOP_init_control can be called twice. No EVTCHNOP_reset is required
> in that case.
EVTCHNOP_init_control is called for each VCPU so I can't see how this
would work.
> 2) Introduce special (e.g. 'EVTCHNOP_fifo_destroy') hypercall to do
> evtchn_fifo_destroy() without closing all channels. Alternatively we can
> avoid closing all channels in EVTCHNOP_reset when called with DOMID_SELF
> (as this mode is not being used atm) -- but that would look unobvious.
I would try this. The guest prior to kexec would then:
1. Use EVTCHNOP_status to query remote end of console and xenstore event
channels.
2. Loop for all event channels:
a. unmap pirq (if required)
b. EVTCHNOP_close
3. EVTCHNOP_fifo_destroy (the implementation of which must verify that
no channels are bound).
4. EVTCHNOP_bind_interdomain to rebind the console and xenstore channels.
David
David Vrabel <[email protected]> writes:
> On 29/07/14 14:50, Vitaly Kuznetsov wrote:
>> David Vrabel <[email protected]> writes:
>>
>>> On 15/07/14 14:40, Vitaly Kuznetsov wrote:
>>>> When kexec is being run PIRQs from Qemu-emulated devices are still
>>>> mapped to old event channels and new kernel has no information about
>>>> that. Trying to map them twice results in the following in Xen's dmesg:
>>>>
>>>> (XEN) irq.c:2278: dom7: pirq 24 or emuirq 8 already mapped
>>>> (XEN) irq.c:2278: dom7: pirq 24 or emuirq 12 already mapped
>>>> (XEN) irq.c:2278: dom7: pirq 24 or emuirq 1 already mapped
>>>> ...
>>>>
>>>> and the following in new kernel's dmesg:
>>>>
>>>> [ 92.286796] xen:events: Failed to obtain physical IRQ 4
>>>>
>>>> The result is that the new kernel doesn't recieve IRQs for Qemu-emulated
>>>> devices. Address the issue by unmapping all mapped PIRQs on kernel shutdown
>>>> when kexec was requested and on every kernel startup. We need to do this
>>>> twice to deal with the following issues:
>>>> - startup-time unmapping is required to make kdump work;
>>>> - shutdown-time unmapping is required to support kexec-ing non-fixed kernels;
>>>> - shutdown-time unmapping is required to make Qemu-emulated NICs work after
>>>> kexec (event channel is being closed on shutdown but no PHYSDEVOP_unmap_pirq
>>>> is being performed).
>>>
>>> I think this should be done only in one place -- just prior to exec'ing
>>> the new kernel (including kdump kernels).
>>>
>>
>> Thank you for your comments!
>>
>> The problem I'm fighting wiht atm is: with FIFO-based event channels we
>> need to call evtchn_fifo_destroy() so next EVTCHNOP_init_control won't
>> fail. I was intended to put evtchn_fifo_destroy() in
>> EVTCHNOP_reset. That introduces a problem: we need to deal with
>> store/console channels. It is possible to remap those from guest with
>> EVTCHNOP_bind_interdomain (if we remember where they were mapped before)
>> but we can't do it after we did evtchn_fifo_destroy() and we can't
>> rebind them after kexec and performing EVTCHNOP_init_control as
>> we can't remember where these channels were mapped to after kexec/kdump.
>>
>> I see the following possible solutions:
>> 1) We put evtchn_fifo_destroy() in EVTCHNOP_init_control so
>> EVTCHNOP_init_control can be called twice. No EVTCHNOP_reset is required
>> in that case.
>
> EVTCHNOP_init_control is called for each VCPU so I can't see how this
> would work.
Right, forgot about that...
>
>> 2) Introduce special (e.g. 'EVTCHNOP_fifo_destroy') hypercall to do
>> evtchn_fifo_destroy() without closing all channels. Alternatively we can
>> avoid closing all channels in EVTCHNOP_reset when called with DOMID_SELF
>> (as this mode is not being used atm) -- but that would look unobvious.
>
> I would try this. The guest prior to kexec would then:
>
> 1. Use EVTCHNOP_status to query remote end of console and xenstore event
> channels.
>
> 2. Loop for all event channels:
>
> a. unmap pirq (if required)
> b. EVTCHNOP_close
>
> 3. EVTCHNOP_fifo_destroy (the implementation of which must verify that
> no channels are bound).
>
> 4. EVTCHNOP_bind_interdomain to rebind the console and xenstore channels.
>
Yea, that's what I have now when I put evtchn_fifo_destroy() in
EVTCHNOP_reset. The problem here is: we can't do
EVTCHNOP_bind_interdomain after we did evtchn_fifo_destroy(), we need to
call EVTCHNOP_init_control first. And we'll do that only after kexec so
we won't remember what we need to remap.. The second issue is the fact
that EVTCHNOP_bind_interdomain will remap store/storage channels to
*some* local ports, not necessary matching hvm info
(HVM_PARAM_CONSOLE_EVTCHN/HVM_PARAM_STORE_EVTCHN)..
Would it be safe is instead of closing interdomain channels on
EVTCHNOP_fifo_destroy we switch evtchn_port_ops to evtchn_port_ops_2l
(so on EVTCHNOP_init_control after kexec we switch back)? I'll try
prototyping this.
Thank you for your comments,
> David
--
Vitaly
On 29/07/14 18:06, Vitaly Kuznetsov wrote:
> David Vrabel <[email protected]> writes:
>
>> On 29/07/14 14:50, Vitaly Kuznetsov wrote:
>>> David Vrabel <[email protected]> writes:
>>>
>>>> On 15/07/14 14:40, Vitaly Kuznetsov wrote:
>>>>> When kexec is being run PIRQs from Qemu-emulated devices are still
>>>>> mapped to old event channels and new kernel has no information about
>>>>> that. Trying to map them twice results in the following in Xen's dmesg:
>>>>>
>>>>> (XEN) irq.c:2278: dom7: pirq 24 or emuirq 8 already mapped
>>>>> (XEN) irq.c:2278: dom7: pirq 24 or emuirq 12 already mapped
>>>>> (XEN) irq.c:2278: dom7: pirq 24 or emuirq 1 already mapped
>>>>> ...
>>>>>
>>>>> and the following in new kernel's dmesg:
>>>>>
>>>>> [ 92.286796] xen:events: Failed to obtain physical IRQ 4
>>>>>
>>>>> The result is that the new kernel doesn't recieve IRQs for Qemu-emulated
>>>>> devices. Address the issue by unmapping all mapped PIRQs on kernel shutdown
>>>>> when kexec was requested and on every kernel startup. We need to do this
>>>>> twice to deal with the following issues:
>>>>> - startup-time unmapping is required to make kdump work;
>>>>> - shutdown-time unmapping is required to support kexec-ing non-fixed kernels;
>>>>> - shutdown-time unmapping is required to make Qemu-emulated NICs work after
>>>>> kexec (event channel is being closed on shutdown but no PHYSDEVOP_unmap_pirq
>>>>> is being performed).
>>>>
>>>> I think this should be done only in one place -- just prior to exec'ing
>>>> the new kernel (including kdump kernels).
>>>>
>>>
>>> Thank you for your comments!
>>>
>>> The problem I'm fighting wiht atm is: with FIFO-based event channels we
>>> need to call evtchn_fifo_destroy() so next EVTCHNOP_init_control won't
>>> fail. I was intended to put evtchn_fifo_destroy() in
>>> EVTCHNOP_reset. That introduces a problem: we need to deal with
>>> store/console channels. It is possible to remap those from guest with
>>> EVTCHNOP_bind_interdomain (if we remember where they were mapped before)
>>> but we can't do it after we did evtchn_fifo_destroy() and we can't
>>> rebind them after kexec and performing EVTCHNOP_init_control as
>>> we can't remember where these channels were mapped to after kexec/kdump.
>>>
>>> I see the following possible solutions:
>>> 1) We put evtchn_fifo_destroy() in EVTCHNOP_init_control so
>>> EVTCHNOP_init_control can be called twice. No EVTCHNOP_reset is required
>>> in that case.
>>
>> EVTCHNOP_init_control is called for each VCPU so I can't see how this
>> would work.
>
> Right, forgot about that...
>
>>
>>> 2) Introduce special (e.g. 'EVTCHNOP_fifo_destroy') hypercall to do
>>> evtchn_fifo_destroy() without closing all channels. Alternatively we can
>>> avoid closing all channels in EVTCHNOP_reset when called with DOMID_SELF
>>> (as this mode is not being used atm) -- but that would look unobvious.
>>
>> I would try this. The guest prior to kexec would then:
>>
>> 1. Use EVTCHNOP_status to query remote end of console and xenstore event
>> channels.
>>
>> 2. Loop for all event channels:
>>
>> a. unmap pirq (if required)
>> b. EVTCHNOP_close
>>
>> 3. EVTCHNOP_fifo_destroy (the implementation of which must verify that
>> no channels are bound).
>>
>> 4. EVTCHNOP_bind_interdomain to rebind the console and xenstore channels.
>>
>
> Yea, that's what I have now when I put evtchn_fifo_destroy() in
> EVTCHNOP_reset. The problem here is: we can't do
> EVTCHNOP_bind_interdomain after we did evtchn_fifo_destroy(), we need to
> call EVTCHNOP_init_control first. And we'll do that only after kexec so
> we won't remember what we need to remap.. The second issue is the fact
> that EVTCHNOP_bind_interdomain will remap store/storage channels to
> *some* local ports, not necessary matching hvm info
> (HVM_PARAM_CONSOLE_EVTCHN/HVM_PARAM_STORE_EVTCHN)..
You can set the HVM params to match.
> Would it be safe is instead of closing interdomain channels on
> EVTCHNOP_fifo_destroy we switch evtchn_port_ops to evtchn_port_ops_2l
> (so on EVTCHNOP_init_control after kexec we switch back)? I'll try
> prototyping this.
Switching the ops back to the 2l ones is an essential part of
fifo_destroy (the execed kernel might not have fifo support). I assumed
you were already doing this.
David