In the firmware-first RAS solution, corrupt data is detected in a
memory location when guest OS application software executing at EL0
or guest OS kernel El1 software are reading from the memory. The
memory node records errors in an error record accessible using
system registers.
Because SCR_EL3.EA is 1, then CPU will trap to El3 firmware, EL3
firmware records the error to APEI table through reading system
register.
Because the error was taken from a lower Exception level, if the
exception is SEA/SEI and HCR_EL2.TEA/HCR_EL2.AMO is 1, firmware
sets ESR_EL2/FAR_EL2 to fake a exception trap to EL2, then
transfers to hypervisor.
For the synchronous external abort(SEA), Hypervisor calls the
ghes_handle_memory_failure() to deal with this error,
ghes_handle_memory_failure() function reads the APEI table and
callls memory_failure() to decide whether it needs to deliver
SIGBUS signal to user space, the advantage of using SIGBUS signal
to notify user space is that it can be compatible with Non-Kvm users.
For the SError Interrupt(SEI),KVM firstly classified the error.
Not call memory_failure() to handle it. Because the error address recorded
by APEI is not accurated, so can not identify the address to hwpoison
memory. If the SError error comes from guest user mode and is not propagated,
then signal user space to handle it, otherwise, directly injects virtual
SError, or panic if the error is fatal. when user space handles the error,
it will specify syndrome for the injected virtual SError. This syndrome value
is set to the VSESR_EL2. VSESR_EL2 is a new ARMv8.2 RAS extensions register
which provides the syndrome value reported to software on taking a virtual
SError interrupt exception.
Dongjiu Geng (5):
acpi: apei: remove the unused code
arm64: kvm: support user space to query RAS extension feature
arm64: kvm: route synchronous external abort exceptions to el2
KVM: arm64: allow get exception information from userspace
arm64: kvm: handle SEI notification and pass the virtual syndrome
James Morse (1):
KVM: arm64: Save ESR_EL2 on guest SError
Xie XiuQi (1):
arm64: cpufeature: Detect CPU RAS Extentions
arch/arm/include/asm/kvm_host.h | 2 ++
arch/arm/kvm/guest.c | 9 ++++++
arch/arm64/Kconfig | 16 +++++++++++
arch/arm64/include/asm/barrier.h | 1 +
arch/arm64/include/asm/cpucaps.h | 3 +-
arch/arm64/include/asm/esr.h | 11 +++++++
arch/arm64/include/asm/kvm_arm.h | 2 ++
arch/arm64/include/asm/kvm_emulate.h | 17 +++++++++++
arch/arm64/include/asm/kvm_host.h | 2 ++
arch/arm64/include/asm/sysreg.h | 5 ++++
arch/arm64/include/asm/system_misc.h | 1 +
arch/arm64/include/uapi/asm/kvm.h | 5 ++++
arch/arm64/kernel/cpufeature.c | 13 +++++++++
arch/arm64/kernel/process.c | 3 ++
arch/arm64/kvm/guest.c | 50 ++++++++++++++++++++++++++++++++
arch/arm64/kvm/handle_exit.c | 56 ++++++++++++++++++++++++++++++++----
arch/arm64/kvm/hyp/switch.c | 29 +++++++++++++++++--
arch/arm64/kvm/reset.c | 3 ++
arch/arm64/mm/fault.c | 34 ++++++++++++++++++++++
drivers/acpi/apei/ghes.c | 14 ---------
include/uapi/linux/kvm.h | 3 ++
virt/kvm/arm/arm.c | 7 +++++
22 files changed, 263 insertions(+), 23 deletions(-)
--
2.14.1
In current code logic, the two functions ghes_sea_add() and
ghes_sea_remove() are only called when CONFIG_ACPI_APEI_SEA
is defined. If not, it will return errors in the ghes_probe()
and not contiue. Hence, remove the unnecessary handling when
CONFIG_ACPI_APEI_SEI is not defined.
change since v5:
1. remove the SEI notification type handling, because the SEI is
asynchronous exception and the address is not accurate. so
not call memory_failure() to handle it.
Signed-off-by: Dongjiu Geng <[email protected]>
---
drivers/acpi/apei/ghes.c | 14 --------------
1 file changed, 14 deletions(-)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index d661d452b238..c15a08db2c7c 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -813,7 +813,6 @@ static struct notifier_block ghes_notifier_hed = {
.notifier_call = ghes_notify_hed,
};
-#ifdef CONFIG_ACPI_APEI_SEA
static LIST_HEAD(ghes_sea);
/*
@@ -848,19 +847,6 @@ static void ghes_sea_remove(struct ghes *ghes)
mutex_unlock(&ghes_list_mutex);
synchronize_rcu();
}
-#else /* CONFIG_ACPI_APEI_SEA */
-static inline void ghes_sea_add(struct ghes *ghes)
-{
- pr_err(GHES_PFX "ID: %d, trying to add SEA notification which is not supported\n",
- ghes->generic->header.source_id);
-}
-
-static inline void ghes_sea_remove(struct ghes *ghes)
-{
- pr_err(GHES_PFX "ID: %d, trying to remove SEA notification which is not supported\n",
- ghes->generic->header.source_id);
-}
-#endif /* CONFIG_ACPI_APEI_SEA */
#ifdef CONFIG_HAVE_ACPI_APEI_NMI
/*
--
2.14.1
From: Xie XiuQi <[email protected]>
ARM's v8.2 Extentions add support for Reliability, Availability and
Serviceability (RAS). On CPUs with these extensions system software
can use additional barriers to isolate errors and determine if faults
are pending.
Add cpufeature detection and a barrier in the context-switch code.
There is no need to use alternatives for this as CPUs that don't
support this feature will treat the instruction as a nop.
Platform level RAS support may require additional firmware support.
Signed-off-by: Xie XiuQi <[email protected]>
[Rebased, added esb and config option, reworded commit message]
Signed-off-by: James Morse <[email protected]>
---
arch/arm64/Kconfig | 16 ++++++++++++++++
arch/arm64/include/asm/barrier.h | 1 +
arch/arm64/include/asm/cpucaps.h | 3 ++-
arch/arm64/include/asm/sysreg.h | 2 ++
arch/arm64/kernel/cpufeature.c | 13 +++++++++++++
arch/arm64/kernel/process.c | 3 +++
6 files changed, 37 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index dfd908630631..4d87aa963d83 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -960,6 +960,22 @@ config ARM64_UAO
regular load/store instructions if the cpu does not implement the
feature.
+config ARM64_RAS_EXTN
+ bool "Enable support for RAS CPU Extensions"
+ default y
+ help
+ CPUs that support the Reliability, Availability and Serviceability
+ (RAS) Extensions, part of ARMv8.2 are able to track faults and
+ errors, classify them and report them to software.
+
+ On CPUs with these extensions system software can use additional
+ barriers to determine if faults are pending and read the
+ classification from a new set of registers.
+
+ Selecting this feature will allow the kernel to use these barriers
+ and access the new registers if the system supports the extension.
+ Platform RAS features may additionally depend on firmware support.
+
endmenu
config ARM64_MODULE_CMODEL_LARGE
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 0fe7e43b7fbc..8b0a0eb67625 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -30,6 +30,7 @@
#define isb() asm volatile("isb" : : : "memory")
#define dmb(opt) asm volatile("dmb " #opt : : : "memory")
#define dsb(opt) asm volatile("dsb " #opt : : : "memory")
+#define esb() asm volatile("hint #16" : : : "memory")
#define mb() dsb(sy)
#define rmb() dsb(ld)
diff --git a/arch/arm64/include/asm/cpucaps.h b/arch/arm64/include/asm/cpucaps.h
index 8d2272c6822c..f93bf77f1f74 100644
--- a/arch/arm64/include/asm/cpucaps.h
+++ b/arch/arm64/include/asm/cpucaps.h
@@ -39,7 +39,8 @@
#define ARM64_WORKAROUND_QCOM_FALKOR_E1003 18
#define ARM64_WORKAROUND_858921 19
#define ARM64_WORKAROUND_CAVIUM_30115 20
+#define ARM64_HAS_RAS_EXTN 21
-#define ARM64_NCAPS 21
+#define ARM64_NCAPS 22
#endif /* __ASM_CPUCAPS_H */
diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index 248339e4aaf5..35b786b43ee4 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -331,6 +331,7 @@
#define ID_AA64ISAR1_JSCVT_SHIFT 12
/* id_aa64pfr0 */
+#define ID_AA64PFR0_RAS_SHIFT 28
#define ID_AA64PFR0_GIC_SHIFT 24
#define ID_AA64PFR0_ASIMD_SHIFT 20
#define ID_AA64PFR0_FP_SHIFT 16
@@ -339,6 +340,7 @@
#define ID_AA64PFR0_EL1_SHIFT 4
#define ID_AA64PFR0_EL0_SHIFT 0
+#define ID_AA64PFR0_RAS_V1 0x1
#define ID_AA64PFR0_FP_NI 0xf
#define ID_AA64PFR0_FP_SUPPORTED 0x0
#define ID_AA64PFR0_ASIMD_NI 0xf
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 9f9e0064c8c1..a807ab55ee10 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -124,6 +124,7 @@ static const struct arm64_ftr_bits ftr_id_aa64isar1[] = {
};
static const struct arm64_ftr_bits ftr_id_aa64pfr0[] = {
+ ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_EXACT, ID_AA64PFR0_RAS_SHIFT, 4, 0),
ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_EXACT, ID_AA64PFR0_GIC_SHIFT, 4, 0),
S_ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64PFR0_ASIMD_SHIFT, 4, ID_AA64PFR0_ASIMD_NI),
S_ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64PFR0_FP_SHIFT, 4, ID_AA64PFR0_FP_NI),
@@ -888,6 +889,18 @@ static const struct arm64_cpu_capabilities arm64_features[] = {
.min_field_value = 0,
.matches = has_no_fpsimd,
},
+#ifdef CONFIG_ARM64_RAS_EXTN
+ {
+ .desc = "RAS Extension Support",
+ .capability = ARM64_HAS_RAS_EXTN,
+ .def_scope = SCOPE_SYSTEM,
+ .matches = has_cpuid_feature,
+ .sys_reg = SYS_ID_AA64PFR0_EL1,
+ .sign = FTR_UNSIGNED,
+ .field_pos = ID_AA64PFR0_RAS_SHIFT,
+ .min_field_value = ID_AA64PFR0_RAS_V1,
+ },
+#endif /* CONFIG_ARM64_RAS_EXTN */
{},
};
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index c845c8c04d95..7a17b4a1bd9e 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -370,6 +370,9 @@ __notrace_funcgraph struct task_struct *__switch_to(struct task_struct *prev,
*/
dsb(ish);
+ /* Deliver any pending SError from prev */
+ esb();
+
/* the actual thread switch */
last = cpu_switch_to(prev, next);
--
2.14.1
After received SError, KVM firstly classified the error.
Not call memory_failure() to handle it. Because the address recorded
by APEI is not accurated, so can not identify the address to hwpoison
memory.
If the SError error come from guest user mode and is not
propagated, then signal user space to handle it, otherwise,
directly injects virtual SError, or panic if the error is fatal.
user space will specify syndrome for the injected virtual SError.
This syndrome value is set to the VSESR_EL2, VSESR_EL2 is a new
ARMv8.2 RAS extensions register which provides the syndrome value
reported to software on taking a virtual SError interrupt exception.
change since v5:
1. not call the memory_failure() to handle the SEI error
2. kvm classify the SError and decide how to do.
3. add code to deliver signal compatible to Non-KVM user.
4. correct some typo errors
Signed-off-by: Dongjiu Geng <[email protected]>
Signed-off-by: Quanming Wu <[email protected]>
---
arch/arm/include/asm/kvm_host.h | 2 ++
arch/arm/kvm/guest.c | 9 ++++++
arch/arm64/include/asm/esr.h | 11 +++++++
arch/arm64/include/asm/kvm_emulate.h | 10 +++++++
arch/arm64/include/asm/kvm_host.h | 2 ++
arch/arm64/include/asm/sysreg.h | 3 ++
arch/arm64/include/asm/system_misc.h | 1 +
arch/arm64/kvm/guest.c | 14 +++++++++
arch/arm64/kvm/handle_exit.c | 56 ++++++++++++++++++++++++++++++++----
arch/arm64/kvm/hyp/switch.c | 14 +++++++++
arch/arm64/mm/fault.c | 34 ++++++++++++++++++++++
include/uapi/linux/kvm.h | 2 ++
virt/kvm/arm/arm.c | 7 +++++
13 files changed, 160 insertions(+), 5 deletions(-)
diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 127e2dd2e21c..bdb6ea690257 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -244,6 +244,8 @@ int kvm_arm_coproc_set_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *);
int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
int exception_index);
+int kvm_vcpu_ioctl_sei(struct kvm_vcpu *vcpu, u64 *syndrome);
+
static inline void __cpu_init_hyp_mode(phys_addr_t pgd_ptr,
unsigned long hyp_stack_ptr,
unsigned long vector_ptr)
diff --git a/arch/arm/kvm/guest.c b/arch/arm/kvm/guest.c
index 1e0784ebbfd6..e120e7458c30 100644
--- a/arch/arm/kvm/guest.c
+++ b/arch/arm/kvm/guest.c
@@ -248,6 +248,15 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
return -EINVAL;
}
+/*
+ * we only support SEI injection with specified synchronous
+ * in ARM64, not support it in ARM32.
+ */
+int kvm_vcpu_ioctl_sei(struct kvm_vcpu *vcpu, u64 *syndrome)
+{
+ return -EINVAL;
+}
+
int __attribute_const__ kvm_target_cpu(void)
{
switch (read_cpuid_part()) {
diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
index 8cabd57b6348..fe4a543add8f 100644
--- a/arch/arm64/include/asm/esr.h
+++ b/arch/arm64/include/asm/esr.h
@@ -77,6 +77,7 @@
#define ESR_ELx_EC_MASK (UL(0x3F) << ESR_ELx_EC_SHIFT)
#define ESR_ELx_EC(esr) (((esr) & ESR_ELx_EC_MASK) >> ESR_ELx_EC_SHIFT)
+
#define ESR_ELx_IL (UL(1) << 25)
#define ESR_ELx_ISS_MASK (ESR_ELx_IL - 1)
@@ -95,6 +96,7 @@
#define ESR_ELx_FSC_ACCESS (0x08)
#define ESR_ELx_FSC_FAULT (0x04)
#define ESR_ELx_FSC_PERM (0x0C)
+#define ESR_ELx_FSC_SERROR (0x11)
/* ISS field definitions for Data Aborts */
#define ESR_ELx_ISV (UL(1) << 24)
@@ -107,6 +109,15 @@
#define ESR_ELx_AR (UL(1) << 14)
#define ESR_ELx_CM (UL(1) << 8)
+/* ISS field definitions for SError interrupt */
+#define ESR_ELx_AET_SHIFT (10)
+#define ESR_ELx_AET (UL(0x7) << ESR_ELx_AET_SHIFT)
+#define ESR_ELx_AET_UC (UL(0) << ESR_ELx_AET_SHIFT) /* Uncontainable */
+#define ESR_ELx_AET_UEU (UL(1) << ESR_ELx_AET_SHIFT) /* Uncorrected Unrecoverable */
+#define ESR_ELx_AET_UEO (UL(2) << ESR_ELx_AET_SHIFT) /* Uncorrected Restartable */
+#define ESR_ELx_AET_UER (UL(3) << ESR_ELx_AET_SHIFT) /* Uncorrected Recoverable */
+#define ESR_ELx_AET_CE (UL(6) << ESR_ELx_AET_SHIFT) /* Corrected */
+
/* ISS field definitions for exceptions taken in to Hyp */
#define ESR_ELx_CV (UL(1) << 24)
#define ESR_ELx_COND_SHIFT (20)
diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
index 47983db27de2..74213bd539dc 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -155,6 +155,16 @@ static inline u32 kvm_vcpu_get_hsr(const struct kvm_vcpu *vcpu)
return vcpu->arch.fault.esr_el2;
}
+static inline u32 kvm_vcpu_get_vsesr(const struct kvm_vcpu *vcpu)
+{
+ return vcpu->arch.fault.vsesr_el2;
+}
+
+static inline void kvm_vcpu_set_vsesr(struct kvm_vcpu *vcpu, unsigned long val)
+{
+ vcpu->arch.fault.vsesr_el2 = val;
+}
+
static inline int kvm_vcpu_get_condition(const struct kvm_vcpu *vcpu)
{
u32 esr = kvm_vcpu_get_hsr(vcpu);
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index d68630007b14..57b011261597 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -88,6 +88,7 @@ struct kvm_vcpu_fault_info {
u32 esr_el2; /* Hyp Syndrom Register */
u64 far_el2; /* Hyp Fault Address Register */
u64 hpfar_el2; /* Hyp IPA Fault Address Register */
+ u32 vsesr_el2; /* Virtual SError Exception Syndrome Register */
};
/*
@@ -381,6 +382,7 @@ int kvm_arm_vcpu_arch_get_attr(struct kvm_vcpu *vcpu,
struct kvm_device_attr *attr);
int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
struct kvm_device_attr *attr);
+int kvm_vcpu_ioctl_sei(struct kvm_vcpu *vcpu, u64 *syndrome);
static inline void __cpu_init_stage2(void)
{
diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index 35b786b43ee4..06059eef0f5d 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -86,6 +86,9 @@
#define REG_PSTATE_PAN_IMM sys_reg(0, 0, 4, 0, 4)
#define REG_PSTATE_UAO_IMM sys_reg(0, 0, 4, 0, 3)
+/* virtual SError exception syndrome register in armv8.2 */
+#define REG_VSESR_EL2 sys_reg(3, 4, 5, 2, 3)
+
#define SET_PSTATE_PAN(x) __emit_inst(0xd5000000 | REG_PSTATE_PAN_IMM | \
(!!x)<<8 | 0x1f)
#define SET_PSTATE_UAO(x) __emit_inst(0xd5000000 | REG_PSTATE_UAO_IMM | \
diff --git a/arch/arm64/include/asm/system_misc.h b/arch/arm64/include/asm/system_misc.h
index 07aa8e3c5630..90bea60cfca3 100644
--- a/arch/arm64/include/asm/system_misc.h
+++ b/arch/arm64/include/asm/system_misc.h
@@ -57,6 +57,7 @@ extern void (*arm_pm_restart)(enum reboot_mode reboot_mode, const char *cmd);
})
int handle_guest_sea(phys_addr_t addr, unsigned int esr);
+int handle_guest_sei(unsigned int esr);
#endif /* __ASSEMBLY__ */
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index 020a644b20d7..4e09b6c0d041 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -313,6 +313,20 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
return -EINVAL;
}
+int kvm_vcpu_ioctl_sei(struct kvm_vcpu *vcpu, u64 *syndrome)
+{
+ u64 reg = *syndrome;
+
+ if (cpus_have_const_cap(ARM64_HAS_RAS_EXTN) && reg)
+ /* set vsesr_el2[24:0] with value that user space specified */
+ kvm_vcpu_set_vsesr(vcpu, reg & ESR_ELx_ISS_MASK);
+
+ /* inject virtual system Error or asynchronous abort */
+ kvm_inject_vabt(vcpu);
+
+ return 0;
+}
+
int __attribute_const__ kvm_target_cpu(void)
{
unsigned long implementor = read_cpuid_implementor();
diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index 17d8a1677a0b..bc55666c3c7f 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -28,6 +28,7 @@
#include <asm/kvm_emulate.h>
#include <asm/kvm_mmu.h>
#include <asm/kvm_psci.h>
+#include <asm/system_misc.h>
#define CREATE_TRACE_POINTS
#include "trace.h"
@@ -178,8 +179,55 @@ static exit_handle_fn kvm_get_exit_handler(struct kvm_vcpu *vcpu)
return arm_exit_handlers[hsr_ec];
}
+/**
+ * kvm_handle_guest_sei - handles SError interrupt or asynchronous aborts
+ * @vcpu: the VCPU pointer
+ *
+ * For RAS SError interrupt, if the AET is [ESR_ELx_AET_UEU] and error
+ * is guest user space, then let host user space do the recovery, otherwise,
+ * directly inject virtual SError to guest or panic.
+ */
+static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu)
+{
+ unsigned int esr = kvm_vcpu_get_hsr(vcpu);
+ bool impdef_syndrome = esr & ESR_ELx_ISV; /* aka IDS */
+ unsigned int aet = esr & ESR_ELx_AET;
+
+ /*
+ * In below three conditions, it will directly inject the virtual SError.
+ * 1. Not support RAS extension; the Syndrome is IMPLEMENTATION DEFINED;
+ * AET is RES0 if 'the value returned in the DFSC field is not
+ * [ESR_ELx_FSC_SERROR]'
+ */
+ if (!cpus_have_const_cap(ARM64_HAS_RAS_EXTN) || impdef_syndrome ||
+ ((esr & ESR_ELx_FSC) != ESR_ELx_FSC_SERROR)) {
+ kvm_inject_vabt(vcpu);
+ return 1;
+ }
+
+ switch (aet) {
+ case ESR_ELx_AET_CE: /* corrected error */
+ case ESR_ELx_AET_UEO: /* restartable error, not yet consumed */
+ return 0; /* continue processing the guest exit */
+ case ESR_ELx_AET_UEU: /* The error has not been propagated */
+ /*
+ * Only handle the guest user mode SEI if the error has not been propagated
+ */
+ if ((!vcpu_mode_priv(vcpu)) && !handle_guest_sei(kvm_vcpu_get_hsr(vcpu)))
+ return 1;
+
+ /* If SError handling is failed, continue run */
+ default:
+ /*
+ * Until now, the CPU supports RAS and SEI is fatal, or user space
+ * does not support to handle the SError.
+ */
+ panic("This Asynchronous SError interrupt is dangerous, panic");
+ }
+}
+
/*
- * Return > 0 to return to guest, < 0 on error, 0 (and set exit_reason) on
+ * return > 0 to return to guest, < 0 on error, 0 (and set exit_reason) on
* proper exit to userspace.
*/
int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
@@ -201,8 +249,7 @@ int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
*vcpu_pc(vcpu) -= adj;
}
- kvm_inject_vabt(vcpu);
- return 1;
+ return kvm_handle_guest_sei(vcpu);
}
exception_index = ARM_EXCEPTION_CODE(exception_index);
@@ -211,8 +258,7 @@ int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
case ARM_EXCEPTION_IRQ:
return 1;
case ARM_EXCEPTION_EL1_SERROR:
- kvm_inject_vabt(vcpu);
- return 1;
+ return kvm_handle_guest_sei(vcpu);
case ARM_EXCEPTION_TRAP:
/*
* See ARM ARM B1.14.1: "Hyp traps on instructions
diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
index c6f17c7675ad..a73346141cf3 100644
--- a/arch/arm64/kvm/hyp/switch.c
+++ b/arch/arm64/kvm/hyp/switch.c
@@ -42,6 +42,13 @@ bool __hyp_text __fpsimd_enabled(void)
return __fpsimd_is_enabled()();
}
+static void __hyp_text __sysreg_set_vsesr(struct kvm_vcpu *vcpu)
+{
+ if (cpus_have_const_cap(ARM64_HAS_RAS_EXTN) &&
+ (vcpu->arch.hcr_el2 & HCR_VSE))
+ write_sysreg_s(vcpu->arch.fault.vsesr_el2, REG_VSESR_EL2);
+}
+
static void __hyp_text __activate_traps_vhe(void)
{
u64 val;
@@ -86,6 +93,13 @@ static void __hyp_text __activate_traps(struct kvm_vcpu *vcpu)
isb();
}
write_sysreg(val, hcr_el2);
+
+ /*
+ * If the virtual SError interrupt is taken to EL1 using AArch64,
+ * then VSESR_EL2 provides the syndrome value reported in ESR_EL1.
+ */
+ __sysreg_set_vsesr(vcpu);
+
/* Trap on AArch32 cp15 c15 accesses (EL1 or EL0) */
write_sysreg(1 << 15, hstr_el2);
/*
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 2509e4fe6992..4407ca89d791 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -687,6 +687,40 @@ int handle_guest_sea(phys_addr_t addr, unsigned int esr)
return ret;
}
+/*
+ * Handle SError interrupt that occur in guest OS.
+ *
+ * The return value will be zero if the SEI was successfully handled
+ * and non-zero if handling is failed.
+ */
+int handle_guest_sei(unsigned int esr)
+{
+ int ret = 0;
+
+ struct siginfo info;
+
+ pr_alert("Asynchronous SError interrupt detected on CPU: %d, esr: %x\n",
+ smp_processor_id(), esr);
+
+ info.si_signo = SIGBUS;
+ info.si_errno = 0;
+ info.si_code = BUS_MCEERR_AR;
+ /*
+ * Because the address is not accurate for Asynchronous Aborts, so set
+ * NULL for the fault address
+ */
+ info.si_addr = NULL;
+
+ ret = force_sig_info(SIGBUS, &info, current);
+ if (ret < 0)
+ pr_info("Handle guest SEI: Error sending signal to %s:%d: %d\n",
+ current->comm, current->pid, ret);
+ else
+ ret = 0;
+
+ return ret;
+}
+
/*
* Dispatch a data abort to the relevant handler.
*/
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 5a2a338cae57..d3fa4c60c9dc 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1356,6 +1356,8 @@ struct kvm_s390_ucas_mapping {
/* Available with KVM_CAP_S390_CMMA_MIGRATION */
#define KVM_S390_GET_CMMA_BITS _IOWR(KVMIO, 0xb8, struct kvm_s390_cmma_log)
#define KVM_S390_SET_CMMA_BITS _IOW(KVMIO, 0xb9, struct kvm_s390_cmma_log)
+#define KVM_ARM_SEI _IO(KVMIO, 0xb10)
+
#define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
#define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index a39a1e161e63..dbaaf206ace2 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -1022,6 +1022,13 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
return -EFAULT;
return kvm_arm_vcpu_has_attr(vcpu, &attr);
}
+ case KVM_ARM_SEI: {
+ u64 syndrome;
+
+ if (copy_from_user(&syndrome, argp, sizeof(syndrome)))
+ return -EFAULT;
+ return kvm_vcpu_ioctl_sei(vcpu, &syndrome);
+ }
default:
return -EINVAL;
}
--
2.14.1
From: James Morse <[email protected]>
When we exit a guest due to an SError the vcpu fault info isn't updated
with the ESR. Today this is only done for traps.
The v8.2 RAS Extensions define ISS values for SError. Update the vcpu's
fault_info with the ESR on SError so that handle_exit() can determine
if this was a RAS SError and decode its severity.
Signed-off-by: James Morse <[email protected]>
---
arch/arm64/kvm/hyp/switch.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
index 945e79c641c4..c6f17c7675ad 100644
--- a/arch/arm64/kvm/hyp/switch.c
+++ b/arch/arm64/kvm/hyp/switch.c
@@ -226,13 +226,20 @@ static bool __hyp_text __translate_far_to_hpfar(u64 far, u64 *hpfar)
return true;
}
+static void __hyp_text __populate_fault_info_esr(struct kvm_vcpu *vcpu)
+{
+ vcpu->arch.fault.esr_el2 = read_sysreg_el2(esr);
+}
+
static bool __hyp_text __populate_fault_info(struct kvm_vcpu *vcpu)
{
- u64 esr = read_sysreg_el2(esr);
- u8 ec = ESR_ELx_EC(esr);
+ u8 ec;
+ u64 esr;
u64 hpfar, far;
- vcpu->arch.fault.esr_el2 = esr;
+ __populate_fault_info_esr(vcpu);
+ esr = vcpu->arch.fault.esr_el2;
+ ec = ESR_ELx_EC(esr);
if (ec != ESR_ELx_EC_DABT_LOW && ec != ESR_ELx_EC_IABT_LOW)
return true;
@@ -321,6 +328,8 @@ int __hyp_text __kvm_vcpu_run(struct kvm_vcpu *vcpu)
*/
if (exit_code == ARM_EXCEPTION_TRAP && !__populate_fault_info(vcpu))
goto again;
+ else if (ARM_EXCEPTION_CODE(exit_code) == ARM_EXCEPTION_EL1_SERROR)
+ __populate_fault_info_esr(vcpu);
if (static_branch_unlikely(&vgic_v2_cpuif_trap) &&
exit_code == ARM_EXCEPTION_TRAP) {
--
2.14.1
In ARMV8.2 RAS extension, a virtual SError exception syndrome
register(VSESR_EL2) is added. This value may be specified from
userspace. Userspace will want to check if the CPU has the RAS
extension. If it has, it wil specify the virtual SError syndrome
value, otherwise it will not be set. This patch adds support for
querying the availability of this extension.
change since v5:
1. modify some patch description
Signed-off-by: Dongjiu Geng <[email protected]>
---
arch/arm64/kvm/reset.c | 3 +++
include/uapi/linux/kvm.h | 1 +
2 files changed, 4 insertions(+)
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index 3256b9228e75..b7313ee028e9 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -77,6 +77,9 @@ int kvm_arch_dev_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_ARM_PMU_V3:
r = kvm_arm_support_pmu_v3();
break;
+ case KVM_CAP_ARM_RAS_EXTENSION:
+ r = cpus_have_const_cap(ARM64_HAS_RAS_EXTN);
+ break;
case KVM_CAP_SET_GUEST_DEBUG:
case KVM_CAP_VCPU_ATTRIBUTES:
r = 1;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6cd63c18708a..5a2a338cae57 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -929,6 +929,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_PPC_SMT_POSSIBLE 147
#define KVM_CAP_HYPERV_SYNIC2 148
#define KVM_CAP_HYPERV_VP_INDEX 149
+#define KVM_CAP_ARM_RAS_EXTENSION 150
#ifdef KVM_CAP_IRQ_ROUTING
--
2.14.1
ARMv8.2 adds a new bit HCR_EL2.TEA which controls to
route synchronous external aborts to EL2, and adds a
trap control bit HCR_EL2.TERR which controls to
trap all Non-secure EL1&0 error record accesses to EL2.
This patch enables the two bits for the guest OS.
when an synchronous abort is generated in the guest OS,
it will trap to EL3 firmware, EL3 firmware will check the
HCR_EL2.TEA value to decide to jump to hypervisor or host
OS. Enabling HCR_EL2.TERR makes error record access
from guest trap to EL2.
change since v5:
1. modify some patch description
Signed-off-by: Dongjiu Geng <[email protected]>
---
arch/arm64/include/asm/kvm_arm.h | 2 ++
arch/arm64/include/asm/kvm_emulate.h | 7 +++++++
2 files changed, 9 insertions(+)
diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index 61d694c2eae5..1188272003c4 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -23,6 +23,8 @@
#include <asm/types.h>
/* Hyp Configuration Register (HCR) bits */
+#define HCR_TEA (UL(1) << 37)
+#define HCR_TERR (UL(1) << 36)
#define HCR_E2H (UL(1) << 34)
#define HCR_ID (UL(1) << 33)
#define HCR_CD (UL(1) << 32)
diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
index fe39e6841326..47983db27de2 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -47,6 +47,13 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu)
vcpu->arch.hcr_el2 = HCR_GUEST_FLAGS;
if (is_kernel_in_hyp_mode())
vcpu->arch.hcr_el2 |= HCR_E2H;
+ if (cpus_have_const_cap(ARM64_HAS_RAS_EXTN)) {
+ /* route synchronous external abort exceptions to EL2 */
+ vcpu->arch.hcr_el2 |= HCR_TEA;
+ /* trap error record accesses */
+ vcpu->arch.hcr_el2 |= HCR_TERR;
+ }
+
if (test_bit(KVM_ARM_VCPU_EL1_32BIT, vcpu->arch.features))
vcpu->arch.hcr_el2 &= ~HCR_RW;
}
--
2.14.1
when userspace gets SIGBUS signal, it does not know whether
this is a synchronous external abort or SError, so needs
to get the exception syndrome. This patch allows userspace
can get this values. For syndrome, only give userspace
syndrome EC and ISS.
Now we move the synchronous external abort injection logic to
userspace, when userspace injects the SEA exception to guest
OS, it needs to specify the far_el1 value, so this patch gives
the exception virtual address to userspace.
change since v5:
1. modify some patch description
Signed-off-by: Dongjiu Geng <[email protected]>
Signed-off-by: Quanming Wu <[email protected]>
---
arch/arm64/include/uapi/asm/kvm.h | 5 +++++
arch/arm64/kvm/guest.c | 36 ++++++++++++++++++++++++++++++++++++
2 files changed, 41 insertions(+)
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index 9f3ca24bbcc6..514261f682b8 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -181,6 +181,11 @@ struct kvm_arch_memory_slot {
#define KVM_REG_ARM64_SYSREG_OP2_MASK 0x0000000000000007
#define KVM_REG_ARM64_SYSREG_OP2_SHIFT 0
+/* AArch64 fault registers */
+#define KVM_REG_ARM64_FAULT (0x0014 << KVM_REG_ARM_COPROC_SHIFT)
+#define KVM_REG_ARM64_FAULT_ESR_EC_ISS (0)
+#define KVM_REG_ARM64_FAULT_FAR (1)
+
#define ARM64_SYS_REG_SHIFT_MASK(x,n) \
(((x) << KVM_REG_ARM64_SYSREG_ ## n ## _SHIFT) & \
KVM_REG_ARM64_SYSREG_ ## n ## _MASK)
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index 5c7f657dd207..020a644b20d7 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -128,6 +128,39 @@ static int set_core_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg)
out:
return err;
}
+static int get_fault_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg)
+{
+ void __user *uaddr = (void __user *)(unsigned long)reg->addr;
+ u32 value;
+ u32 id = reg->id & ~(KVM_REG_ARCH_MASK |
+ KVM_REG_SIZE_MASK | KVM_REG_ARM64_FAULT);
+
+ switch (id) {
+ case KVM_REG_ARM64_FAULT_ESR_EC_ISS:
+ /*
+ * The user space may need to know the fault exception class
+ * ESR_ELx.EC and instruction specific syndrome ESR_ELx.ISS
+ */
+ value = kvm_vcpu_get_hsr(vcpu) & (ESR_ELx_EC_MASK | ESR_ELx_ISS_MASK);
+ if (copy_to_user(uaddr, &value, KVM_REG_SIZE(reg->id)) != 0)
+ return -EFAULT;
+ break;
+ case KVM_REG_ARM64_FAULT_FAR:
+ /*
+ * When user space injects synchronized abort, it needs
+ * to inject the faulting virtual address to guest OS, so
+ * needs to get it from kvm.
+ */
+ if (copy_to_user(uaddr, &(vcpu->arch.fault.far_el2),
+ KVM_REG_SIZE(reg->id)) != 0)
+ return -EFAULT;
+ break;
+ default:
+ return -ENOENT;
+ }
+ return 0;
+}
+
int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
{
@@ -243,6 +276,9 @@ int kvm_arm_get_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg)
if ((reg->id & KVM_REG_ARM_COPROC_MASK) == KVM_REG_ARM_CORE)
return get_core_reg(vcpu, reg);
+ if ((reg->id & KVM_REG_ARM_COPROC_MASK) == KVM_REG_ARM64_FAULT)
+ return get_fault_reg(vcpu, reg);
+
if (is_timer_reg(reg->id))
return get_timer_reg(vcpu, reg);
--
2.14.1
Hi Dongjiu Geng,
On 28/08/17 11:38, Dongjiu Geng wrote:
> In the firmware-first RAS solution, corrupt data is detected in a
> memory location when guest OS application software executing at EL0
> or guest OS kernel El1 software are reading from the memory. The
> memory node records errors in an error record accessible using
> system registers.
>
> Because SCR_EL3.EA is 1, then CPU will trap to El3 firmware, EL3
> firmware records the error to APEI table through reading system
> register.
Strictly speaking these are CPER records in a memory region pointed to by the
HEST->GHES ACPI table.
> Because the error was taken from a lower Exception level, if the
> exception is SEA/SEI and HCR_EL2.TEA/HCR_EL2.AMO is 1, firmware
> sets ESR_EL2/FAR_EL2 to fake a exception trap to EL2, then
> transfers to hypervisor.
What happens if you took an SError from EL2 and EL2 has PSTATE.A set masking
SError? (this is very common today: all kernel code runs like this).
What happens if the hypervisor then executes an ESB with PSTATE.A set? It
expects to see any pending SError deferred and its syndrome written to DISR_EL1,
but this register is RAZ/WI when you set SCR_EL3.EA. '4.4.2' of [0]
> For the synchronous external abort(SEA), Hypervisor calls the
> ghes_handle_memory_failure() to deal with this error,
> ghes_handle_memory_failure() function reads the APEI table and
> callls memory_failure() to decide whether it needs to deliver
> SIGBUS signal to user space, the advantage of using SIGBUS signal
> to notify user space is that it can be compatible with Non-Kvm users.
>
> For the SError Interrupt(SEI),KVM firstly classified the error.
KVM can't parse the CPER records, nor does it know where to look to find them.
KVM should call out to the APEI code so the host kernel can handle the error.
User-space may be signalled by the memory_failure() helper, and user-space may
choose to notify the guest about the memory-failure, but this would be a new error.
> Not call memory_failure() to handle it. Because the error address recorded
> by APEI is not accurated, so can not identify the address to hwpoison
> memory.
This looks like a firmware bug, what address do you get in your CPER records? It
should be a physical address.
To report a memory-error you must have an address.
If the error wasn't detected as a synchronous access then delivering a
synchronous-external-abort is inappropriate (I think we both agree on this), and
SError-interrupt doesn't have a way of specifying an address ... but the CPER
records do.
For firmware-first your SError-interrupt is just a notification, its the CPER
records the OS uses to handle the error.
> If the SError error comes from guest user mode and is not propagated,
> then signal user space to handle it, otherwise, directly injects virtual
> SError, or panic if the error is fatal.
What do you mean by propagated?
I don't think we should ever hand RAS notifications to user-space, the host
kernel should handle them, then describe the symptom (e.g. this region of your
va space is gone) to user-space.
> when user space handles the error,
> it will specify syndrome for the injected virtual SError. This syndrome value
> is set to the VSESR_EL2. VSESR_EL2 is a new ARMv8.2 RAS extensions register
> which provides the syndrome value reported to software on taking a virtual
> SError interrupt exception.
Thanks,
James
[0]
https://static.docs.arm.com/ddi0587/a/RAS%20Extension-release%20candidate_march_29.pdf
Hi Dongjiu Geng,
On 28/08/17 11:38, Dongjiu Geng wrote:
> From: Xie XiuQi <[email protected]>
>
> ARM's v8.2 Extentions add support for Reliability, Availability and
> Serviceability (RAS). On CPUs with these extensions system software
> can use additional barriers to isolate errors and determine if faults
> are pending.
>
> Add cpufeature detection and a barrier in the context-switch code.
> There is no need to use alternatives for this as CPUs that don't
> support this feature will treat the instruction as a nop.
>
> Platform level RAS support may require additional firmware support.
>
> Signed-off-by: Xie XiuQi <[email protected]>
> [Rebased, added esb and config option, reworded commit message]
> Signed-off-by: James Morse <[email protected]>
Nit: when re-posting patches from the list you need to add your signed-off-by.
See Documentation/process/submitting-patches.rst 'Developer's Certificate of
Origin 1.1'
This goes for your patch 2 as well.
> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
> index c845c8c04d95..7a17b4a1bd9e 100644
> --- a/arch/arm64/kernel/process.c
> +++ b/arch/arm64/kernel/process.c
> @@ -370,6 +370,9 @@ __notrace_funcgraph struct task_struct *__switch_to(struct task_struct *prev,
> */
> dsb(ish);
>
> + /* Deliver any pending SError from prev */
> + esb();
> +
This patch was sitting on top of the SError rework. As the cover-letter
describes that was all there to make sure SError is unmasked when we execute
this esb(). Without it any pending SError will be deferred, its ESR is written
to DISR_EL1, which this patch doesn't check.
On its own, this patch is actively harmful to systems that don't have
firmware-first handling.
We probably need to produce a combined series...
Thanks,
James
Hi Dongjiu Geng,
On 28/08/17 11:38, Dongjiu Geng wrote:
> In current code logic, the two functions ghes_sea_add() and
> ghes_sea_remove() are only called when CONFIG_ACPI_APEI_SEA
> is defined. If not, it will return errors in the ghes_probe()
> and not contiue. Hence, remove the unnecessary handling when
> CONFIG_ACPI_APEI_SEI is not defined.
This doesn't match what the patch does. I get this feeling this is needed for
some future patch you haven't included.
> change since v5:
> 1. remove the SEI notification type handling, because the SEI is
> asynchronous exception and the address is not accurate. so
> not call memory_failure() to handle it.
Setting NOTIFY_SEI as the GHES entry's notification type means the OS should
check the GHES->ErrorStatusAddress for CPER records when it receives an
SError-Interrupt, as it may be a notification of an error from this error source.
If you aren't handling the notification, why is this is in the HEST at all?
(and if its not: its not firmware-first)
James
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index d661d452b238..c15a08db2c7c 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -813,7 +813,6 @@ static struct notifier_block ghes_notifier_hed = {
> .notifier_call = ghes_notify_hed,
> };
>
> -#ifdef CONFIG_ACPI_APEI_SEA
> static LIST_HEAD(ghes_sea);
>
> /*
> @@ -848,19 +847,6 @@ static void ghes_sea_remove(struct ghes *ghes)
> mutex_unlock(&ghes_list_mutex);
> synchronize_rcu();
> }
> -#else /* CONFIG_ACPI_APEI_SEA */
> -static inline void ghes_sea_add(struct ghes *ghes)
> -{
> - pr_err(GHES_PFX "ID: %d, trying to add SEA notification which is not
supported\n",
> - ghes->generic->header.source_id);
> -}
> -
> -static inline void ghes_sea_remove(struct ghes *ghes)
> -{
> - pr_err(GHES_PFX "ID: %d, trying to remove SEA notification which is not
supported\n",
> - ghes->generic->header.source_id);
> -}
> -#endif /* CONFIG_ACPI_APEI_SEA */
>
> #ifdef CONFIG_HAVE_ACPI_APEI_NMI
> /*
>
Hi Dongjiu Geng,
On 28/08/17 11:38, Dongjiu Geng wrote:
> In ARMV8.2 RAS extension, a virtual SError exception syndrome
> register(VSESR_EL2) is added. This value may be specified from
> userspace.
I agree that the CPU support for injecting SErrors with a specified ESR should
be exposed to KVM's user space...
> Userspace will want to check if the CPU has the RAS
> extension.
... but user-space wants to know if it can inject SErrors with a specified ESR.
What if we gain another way of doing this that isn't via the RAS-extensions, now
user-space has to check for two capabilities.
> If it has, it wil specify the virtual SError syndrome
> value, otherwise it will not be set. This patch adds support for
> querying the availability of this extension.
I'm against telling user-space what features the CPU has unless it can use them
directly. In this case we are talking about a KVM API, so we should describe the
API not the CPU.
> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> index 3256b9228e75..b7313ee028e9 100644
> --- a/arch/arm64/kvm/reset.c
> +++ b/arch/arm64/kvm/reset.c
> @@ -77,6 +77,9 @@ int kvm_arch_dev_ioctl_check_extension(struct kvm *kvm, long ext)
> case KVM_CAP_ARM_PMU_V3:
> r = kvm_arm_support_pmu_v3();
> break;
> + case KVM_CAP_ARM_RAS_EXTENSION:
This should be called something more like 'KVM_CAP_ARM_INJECT_SERROR_ESR'
> + r = cpus_have_const_cap(ARM64_HAS_RAS_EXTN);
> + break;
> case KVM_CAP_SET_GUEST_DEBUG:
> case KVM_CAP_VCPU_ATTRIBUTES:
> r = 1;
We can inject SError on systems without the RAS extensions using just the
HCR_EL2.VSE bit. We may want to make the 'ESR' part of the API optional, or
expose '1' for the without-ESR version and '2 for with-ESR, (however we choose
to implement that).
The risk is if we want to add a without-ESR version later, and the name we make
ABI now turned out to be a mistake. Marc or Christoffer probably have the best
view of this. (no-one has needed it so far...)
Thanks,
James
Hi James
On 2017/9/1 1:43, James Morse wrote:
> Hi Dongjiu Geng,
>
> On 28/08/17 11:38, Dongjiu Geng wrote:
>> In the firmware-first RAS solution, corrupt data is detected in a
>> memory location when guest OS application software executing at EL0
>> or guest OS kernel El1 software are reading from the memory. The
>> memory node records errors in an error record accessible using
>> system registers.
>>
>> Because SCR_EL3.EA is 1, then CPU will trap to El3 firmware, EL3
>> firmware records the error to APEI table through reading system
>> register.
>
> Strictly speaking these are CPER records in a memory region pointed to by the
> HEST->GHES ACPI table.
yes, Here I mean EL3 firmware reads the RAS Error record register ERXxxx_EL1, such as
ERXADDR_EL1/ERXMISC0_EL1, to get the detailed error info, then record them to
HEST->GHES ACPI table in a memory region.
>
>
>> Because the error was taken from a lower Exception level, if the
>> exception is SEA/SEI and HCR_EL2.TEA/HCR_EL2.AMO is 1, firmware
>> sets ESR_EL2/FAR_EL2 to fake a exception trap to EL2, then
>> transfers to hypervisor.
>
> What happens if you took an SError from EL2 and EL2 has PSTATE.A set masking
> SError? (this is very common today: all kernel code runs like this).
Firstly, the guest OS usually runs in the El1 or El0, not El2.
if El2 happens an SError, it will trap to EL3 firmware even though the PSTATE.A is set.
Because the PSTATE.A can not mask it if the SError is trapped to EL3.
>
> What happens if the hypervisor then executes an ESB with PSTATE.A set? It
> expects to see any pending SError deferred and its syndrome written to DISR_EL1,
> but this register is RAZ/WI when you set SCR_EL3.EA. '4.4.2' of [0]
>From my understand, if the SCR_EL3.EA is set, the Abort can not mask, it always happen and
take to EL3, DISR_El1 can not record the syndrome. DISR_El1 is only recorded when
the External Abort is masked, but when SCR_EL3.EA is set, the pstate.A can not mask the Error.
>
>
>> For the synchronous external abort(SEA), Hypervisor calls the
>> ghes_handle_memory_failure() to deal with this error,
>> ghes_handle_memory_failure() function reads the APEI table and
>> callls memory_failure() to decide whether it needs to deliver
>> SIGBUS signal to user space, the advantage of using SIGBUS signal
>> to notify user space is that it can be compatible with Non-Kvm users.
>>
>> For the SError Interrupt(SEI),KVM firstly classified the error.
>
> KVM can't parse the CPER records, nor does it know where to look to find them.
> KVM should call out to the APEI code so the host kernel can handle the error.
KVM does not parse the CPER records, I mean KVM classified the error according to the esr_el2.AET.
As shown below:
AET, bits [12:10], when categorized
Asynchronous Error Type. Describes the state of the PE after taking the SError interrupt exception.
Software might use the information in the syndrome registers to determine what recovery might be
possible. See Architecturally consumed errors. The possible values of this field are:
0b000 Uncontainable error (UC).
0b001 Unrecoverable error (UEU).
0b010 Restartable error (UEO).
0b011 Recoverable error (UER).
0b110 Corrected error (CE).
I pasted the code here:
+static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu)
+{
+ unsigned int esr = kvm_vcpu_get_hsr(vcpu);
+ bool impdef_syndrome = esr & ESR_ELx_ISV; /* aka IDS */
+ unsigned int aet = esr & ESR_ELx_AET;
+
+ /*
+ * In below three conditions, it will directly inject the virtual SError.
+ * 1. Not support RAS extension; the Syndrome is IMPLEMENTATION DEFINED;
+ * AET is RES0 if 'the value returned in the DFSC field is not
+ * [ESR_ELx_FSC_SERROR]'
+ */
+ if (!cpus_have_const_cap(ARM64_HAS_RAS_EXTN) || impdef_syndrome ||
+ ((esr & ESR_ELx_FSC) != ESR_ELx_FSC_SERROR)) {
+ kvm_inject_vabt(vcpu);
+ return 1;
+ }
+
+ switch (aet) {
+ case ESR_ELx_AET_CE: /* corrected error */
+ case ESR_ELx_AET_UEO: /* restartable error, not yet consumed */
+ return 0; /* continue processing the guest exit */
+ case ESR_ELx_AET_UEU: /* The error has not been propagated */
+ /*
+ * Only handle the guest user mode SEI if the error has not been propagated
+ */
+ if ((!vcpu_mode_priv(vcpu)) && !handle_guest_sei(kvm_vcpu_get_hsr(vcpu)))
+ return 1;
+
+ /* If SError handling is failed, continue run */
+ default:
+ /*
+ * Until now, the CPU supports RAS and SEI is fatal, or user space
+ * does not support to handle the SError.
+ */
+ panic("This Asynchronous SError interrupt is dangerous, panic");
+ }
+}
+
I have called the kernel code to handle the SError.
+/*
+ * Handle SError interrupt that occur in guest OS.
+ *
+ * The return value will be zero if the SEI was successfully handled
+ * and non-zero if handling is failed.
+ */
+int handle_guest_sei(unsigned int esr)
+{
+ int ret = 0;
+
+ struct siginfo info;
+
+ pr_alert("Asynchronous SError interrupt detected on CPU: %d, esr: %x\n",
+ smp_processor_id(), esr);
+
+ info.si_signo = SIGBUS;
+ info.si_errno = 0;
+ info.si_code = BUS_MCEERR_AR;
+ /*
+ * Because the address is not accurate for Asynchronous Aborts, so set
+ * NULL for the fault address
+ */
+ info.si_addr = NULL;
+
+ ret = force_sig_info(SIGBUS, &info, current);
+ if (ret < 0)
+ pr_info("Handle guest SEI: Error sending signal to %s:%d: %d\n",
+ current->comm, current->pid, ret);
+ else
+ ret = 0;
+
+ return ret;
+}
+
>
> User-space may be signalled by the memory_failure() helper, and user-space may
> choose to notify the guest about the memory-failure, but this would be a new error.
For the SError, it is asynchronous abort. so it is not better to call memory_failure() helper, because the error
address is not accurate. memory_failure() will offline or poison the address, but the address is not accurate. so it is
dangerous
>
>
>> Not call memory_failure() to handle it. Because the error address recorded
>> by APEI is not accurated, so can not identify the address to hwpoison
>> memory.
>
> This looks like a firmware bug, what address do you get in your CPER records? It
> should be a physical address.
No, not firmware bug. At least in the armv8.0 CPU and huawei's armv8.2 CPU, the architecture
decided it is not accurate, this abort is asynchronous not synchronous.
>
> To report a memory-error you must have an address.
maybe we can not get the accurate error address, can you get it in your armv8 platform?
>
> If the error wasn't detected as a synchronous access then delivering a
> synchronous-external-abort is inappropriate (I think we both agree on this), and
> SError-interrupt doesn't have a way of specifying an address ... but the CPER
> records do.
In our platform, the physical address in CPER records also does not accurate for SEI. do you have accurate error physical address
for the SEI?
>
> For firmware-first your SError-interrupt is just a notification, its the CPER
> records the OS uses to handle the error.
I can call APEI kernel driver to parse the severity except doing recovery for the address.
May be it is not better handle the physical address that recorded in the CPER, because the error address is not accurate
>
>
>> If the SError error comes from guest user mode and is not propagated,
>> then signal user space to handle it, otherwise, directly injects virtual
>> SError, or panic if the error is fatal.
>
> What do you mean by propagated?
That is, the error was detected but not consumed
>
> I don't think we should ever hand RAS notifications to user-space, the host
> kernel should handle them, then describe the symptom (e.g. this region of your
> va space is gone) to user-space.
For guest OS user space SError, if it is not propagated. let guest OS handles, for example,killing
this APP is better, which can avoid terminate the whole guest OS.
if only host kernel handle them, it can not kill guest os APP.
>
>
>> when user space handles the error,
>> it will specify syndrome for the injected virtual SError. This syndrome value
>> is set to the VSESR_EL2. VSESR_EL2 is a new ARMv8.2 RAS extensions register
>> which provides the syndrome value reported to software on taking a virtual
>> SError interrupt exception.
>
>
> Thanks,
>
> James
>
> [0]
> https://static.docs.arm.com/ddi0587/a/RAS%20Extension-release%20candidate_march_29.pdf
>
>
> .
>
James,
On 2017/9/1 1:44, James Morse wrote:
> Hi Dongjiu Geng,
>
> On 28/08/17 11:38, Dongjiu Geng wrote:
>> From: Xie XiuQi <[email protected]>
>>
>> ARM's v8.2 Extentions add support for Reliability, Availability and
>> Serviceability (RAS). On CPUs with these extensions system software
>> can use additional barriers to isolate errors and determine if faults
>> are pending.
>>
>> Add cpufeature detection and a barrier in the context-switch code.
>> There is no need to use alternatives for this as CPUs that don't
>> support this feature will treat the instruction as a nop.
>>
>> Platform level RAS support may require additional firmware support.
>>
>> Signed-off-by: Xie XiuQi <[email protected]>
>> [Rebased, added esb and config option, reworded commit message]
>> Signed-off-by: James Morse <[email protected]>
>
> Nit: when re-posting patches from the list you need to add your signed-off-by.
> See Documentation/process/submitting-patches.rst 'Developer's Certificate of
> Origin 1.1'
Ok, thanks for the your pointing out.
>
> This goes for your patch 2 as well.
>
>
>> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
>> index c845c8c04d95..7a17b4a1bd9e 100644
>> --- a/arch/arm64/kernel/process.c
>> +++ b/arch/arm64/kernel/process.c
>> @@ -370,6 +370,9 @@ __notrace_funcgraph struct task_struct *__switch_to(struct task_struct *prev,
>> */
>> dsb(ish);
>>
>> + /* Deliver any pending SError from prev */
>> + esb();
>> +
>
> This patch was sitting on top of the SError rework. As the cover-letter
> describes that was all there to make sure SError is unmasked when we execute
> this esb(). Without it any pending SError will be deferred, its ESR is written
> to DISR_EL1, which this patch doesn't check.
>
> On its own, this patch is actively harmful to systems that don't have
> firmware-first handling.
>
> We probably need to produce a combined series...
OK, thanks for your reminder and detailed explanation.
>
>
> Thanks,
>
> James
>
>
> .
>
Hi James,
On 2017/9/1 1:50, James Morse wrote:
> Hi Dongjiu Geng,
>
> On 28/08/17 11:38, Dongjiu Geng wrote:
>> In current code logic, the two functions ghes_sea_add() and
>> ghes_sea_remove() are only called when CONFIG_ACPI_APEI_SEA
>> is defined. If not, it will return errors in the ghes_probe()
>> and not contiue. Hence, remove the unnecessary handling when
>> CONFIG_ACPI_APEI_SEI is not defined.
>
> This doesn't match what the patch does. I get this feeling this is needed for
> some future patch you haven't included.
James, let check the code, when the ghes_probe, if the CONFIG_ACPI_APEI_SEA is not defined.
it will return -ENOTSUPP and goto error, and the ghes_sea_add has no chance to execute.
similar, if the probe is failed, it should not have chance to execute ghes_sea_remove.
static int ghes_probe(struct platform_device *ghes_dev)
{
struct acpi_hest_generic *generic;
struct ghes *ghes = NULL;
int rc = -EINVAL;
generic = *(struct acpi_hest_generic **)ghes_dev->dev.platform_data;
if (!generic->enabled)
return -ENODEV;
switch (generic->notify.type) {
case ACPI_HEST_NOTIFY_POLLED:
case ACPI_HEST_NOTIFY_EXTERNAL:
case ACPI_HEST_NOTIFY_SCI:
case ACPI_HEST_NOTIFY_GSIV:
case ACPI_HEST_NOTIFY_GPIO:
break;
case ACPI_HEST_NOTIFY_SEA:
if (!IS_ENABLED(CONFIG_ACPI_APEI_SEA)) {
pr_warn(GHES_PFX "Generic hardware error source: %d notified via SEA is not supported\n",
generic->header.source_id);
rc = -ENOTSUPP;
goto err;
}
break;
>
>
>> change since v5:
>> 1. remove the SEI notification type handling, because the SEI is
>> asynchronous exception and the address is not accurate. so
>> not call memory_failure() to handle it.
>
> Setting NOTIFY_SEI as the GHES entry's notification type means the OS should
> check the GHES->ErrorStatusAddress for CPER records when it receives an
> SError-Interrupt, as it may be a notification of an error from this error source.
previously I added the NOTIFY_SEI support, but consider the error address in CPER is not accurate and calling memory_failure() may not make sense.
so I remove it.
>
> If you aren't handling the notification, why is this is in the HEST at all?
> (and if its not: its not firmware-first)
For the SEI notification, may be we can parse and handle the CPER record other than the Error physical address
>
>
> James
>
>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index d661d452b238..c15a08db2c7c 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -813,7 +813,6 @@ static struct notifier_block ghes_notifier_hed = {
>> .notifier_call = ghes_notify_hed,
>> };
>>
>> -#ifdef CONFIG_ACPI_APEI_SEA
>> static LIST_HEAD(ghes_sea);
>>
>> /*
>> @@ -848,19 +847,6 @@ static void ghes_sea_remove(struct ghes *ghes)
>> mutex_unlock(&ghes_list_mutex);
>> synchronize_rcu();
>> }
>> -#else /* CONFIG_ACPI_APEI_SEA */
>> -static inline void ghes_sea_add(struct ghes *ghes)
>> -{
>> - pr_err(GHES_PFX "ID: %d, trying to add SEA notification which is not
> supported\n",
>> - ghes->generic->header.source_id);
>> -}
>> -
>> -static inline void ghes_sea_remove(struct ghes *ghes)
>> -{
>> - pr_err(GHES_PFX "ID: %d, trying to remove SEA notification which is not
> supported\n",
>> - ghes->generic->header.source_id);
>> -}
>> -#endif /* CONFIG_ACPI_APEI_SEA */
>>
>> #ifdef CONFIG_HAVE_ACPI_APEI_NMI
>> /*
>>
>
>
> .
>
James,
On 2017/9/1 2:04, James Morse wrote:
> Hi Dongjiu Geng,
>
> On 28/08/17 11:38, Dongjiu Geng wrote:
>> In ARMV8.2 RAS extension, a virtual SError exception syndrome
>> register(VSESR_EL2) is added. This value may be specified from
>> userspace.
>
> I agree that the CPU support for injecting SErrors with a specified ESR should
> be exposed to KVM's user space...
Ok, thanks.
>
>
>> Userspace will want to check if the CPU has the RAS
>> extension.
>
> ... but user-space wants to know if it can inject SErrors with a specified ESR.
>
> What if we gain another way of doing this that isn't via the RAS-extensions, now
> user-space has to check for two capabilities.
>
>
>> If it has, it wil specify the virtual SError syndrome
>> value, otherwise it will not be set. This patch adds support for
>> querying the availability of this extension.
>
> I'm against telling user-space what features the CPU has unless it can use them
> directly. In this case we are talking about a KVM API, so we should describe the
> API not the CPU.
shenglong ([email protected]) who is Qemu maintainer suggested checking the CPU RAS-extensions
to decide whether generate the APEI table and record CPER for the guest OS in the user space.
he means if the host does not support RAS, user space may also not support RAS.
>
>
>> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
>> index 3256b9228e75..b7313ee028e9 100644
>> --- a/arch/arm64/kvm/reset.c
>> +++ b/arch/arm64/kvm/reset.c
>> @@ -77,6 +77,9 @@ int kvm_arch_dev_ioctl_check_extension(struct kvm *kvm, long ext)
>> case KVM_CAP_ARM_PMU_V3:
>> r = kvm_arm_support_pmu_v3();
>> break;
>> + case KVM_CAP_ARM_RAS_EXTENSION:
>
> This should be called something more like 'KVM_CAP_ARM_INJECT_SERROR_ESR'
I understand your suggestion.
>
>
>> + r = cpus_have_const_cap(ARM64_HAS_RAS_EXTN);
>> + break;
>> case KVM_CAP_SET_GUEST_DEBUG:
>> case KVM_CAP_VCPU_ATTRIBUTES:
>> r = 1;
>
>
> We can inject SError on systems without the RAS extensions using just the
> HCR_EL2.VSE bit. We may want to make the 'ESR' part of the API optional, or
> expose '1' for the without-ESR version and '2 for with-ESR, (however we choose
> to implement that).
>
> The risk is if we want to add a without-ESR version later, and the name we make
> ABI now turned out to be a mistake. Marc or Christoffer probably have the best
> view of this. (no-one has needed it so far...)
>
>
> Thanks,
>
> James
>
>
> .
>
On 28 August 2017 at 11:38, Dongjiu Geng <[email protected]> wrote:
> In the firmware-first RAS solution, corrupt data is detected in a
> memory location when guest OS application software executing at EL0
> or guest OS kernel El1 software are reading from the memory. The
> memory node records errors in an error record accessible using
> system registers.
Hi Dongjiu -- it looks like this patch set is extending
the API KVM provides to userspace, but it doesn't update
the documentation in Documentation/virtual/kvm/api.txt.
I appreciate the API is still somewhat under discussion,
but if you can include the docs updates it's helpful to
me for reviewing whether the API makes sense from the
userspace consumer end of it.
thanks
-- PMM
Hi Peter,
On 2017/9/6 19:19, Peter Maydell wrote:
> On 28 August 2017 at 11:38, Dongjiu Geng <[email protected]> wrote:
>> In the firmware-first RAS solution, corrupt data is detected in a
>> memory location when guest OS application software executing at EL0
>> or guest OS kernel El1 software are reading from the memory. The
>> memory node records errors in an error record accessible using
>> system registers.
>
> Hi Dongjiu -- it looks like this patch set is extending
> the API KVM provides to userspace, but it doesn't update
> the documentation in Documentation/virtual/kvm/api.txt.
> I appreciate the API is still somewhat under discussion,
> but if you can include the docs updates it's helpful to
> me for reviewing whether the API makes sense from the
> userspace consumer end of it.
sure, it should. thanks a lot for the reminder. I will update the related docs
in my next patch set version.
>
> thanks
> -- PMM
>
> .
>
Hi Dongjiu Geng,
On 28/08/17 11:38, Dongjiu Geng wrote:
> when userspace gets SIGBUS signal, it does not know whether
> this is a synchronous external abort or SError,
Why would Qemu/kvmtool need to know if the original notification (if there was
one) was synchronous or asynchronous? This is between firmware and the kernel.
I think I can see why you need this: to choose whether to emulate SEA or SEI,
but what if the guest wasn't running? Or the guest was running, but it wasn't
guest-memory that is affected.
What happens if the dram-scrub hardware spots an error in guest memory, but the
guest wasn't running? KVM won't have a relevant ESR value to give you.
What happens if we start swapping a page of guest memory to disk, and discover
the memory is corrupt. This is synchronous, but it wasn't the guest, and KVM
still can't give you an ESR.
What about CPER records discovered through the polled interface? What happens if
I write a PFN into the corrupt-pfn sysfs interface?
I think what you need is some way of knowing if the BUS_MCEERR_A* was directly
caused by a user-space (or guest) access, and if so was it a data or instruction
fetch. These can become SEA notifications.
KVM's user-space shouldn't be a special-case where the kernel behaves
differently: if we tinker with this it needs to make sense for all user space
processes and mean something on all architectures.
I think this information could be useful to other users of these signals, e.g. a
JVM could silently regenerate/reload code/data for a non-direct-access fault
instead of exit-ing (or throwing an exception) for a direct access.
For BUS_MCEERR_A* from memory_failure() we can't know if they are caused by an
access or not. When the mm code gets -EHWPOISON when trying to resolve a
user-space fault we know it was due to a direct-access. (I don't know if/how x86
can know if it was code or data). Faulting guest accesses through KVM are just a
special version of this where KVM fixes-up stage2.
... but for any of this to work we need the address of the corrupt memory.
(-> cover letter)
Thanks,
James
Hi Dongjiu Geng,
On 28/08/17 11:38, Dongjiu Geng wrote:
> ARMv8.2 adds a new bit HCR_EL2.TEA which controls to
> route synchronous external aborts to EL2, and adds a
> trap control bit HCR_EL2.TERR which controls to
> trap all Non-secure EL1&0 error record accesses to EL2.
>
> This patch enables the two bits for the guest OS.
> when an synchronous abort is generated in the guest OS,
> it will trap to EL3 firmware, EL3 firmware will check the
> HCR_EL2.TEA value to decide to jump to hypervisor or host
> OS.
(This is what you are using this for, the patch has nothing to do with EL3.)
> Enabling HCR_EL2.TERR makes error record access
> from guest trap to EL2.
KVM already handles external aborts from lower exception levels, no more work
needs doing for TEA.
What happens when a guest access the RAS-Error-Record registers?
Before we can set HCR_EL2.TERR I think we need to add some minimal emulation for
the registers it traps. Most of them should be RAZ/WI, so it should be
straightforward. (I think KVMs default is to emulate an undef for unknown traps).
Eventually we will want to back this with a page of memory that lets
Qemu/kvmtool configure what the guest can see. (i.e. the emulated machine's
errors for kernel-first handling.)
Thanks,
James
Hi gengdongjiu,
On 05/09/17 08:18, gengdongjiu wrote:
> On 2017/9/1 2:04, James Morse wrote:
>> On 28/08/17 11:38, Dongjiu Geng wrote:
>>> Userspace will want to check if the CPU has the RAS
>>> extension.
>>
>> ... but user-space wants to know if it can inject SErrors with a specified ESR.
>>
>> What if we gain another way of doing this that isn't via the RAS-extensions, now
>> user-space has to check for two capabilities.
>>
>>
>>> If it has, it wil specify the virtual SError syndrome
>>> value, otherwise it will not be set. This patch adds support for
>>> querying the availability of this extension.
>>
>> I'm against telling user-space what features the CPU has unless it can use them
>> directly. In this case we are talking about a KVM API, so we should describe the
>> API not the CPU.
>
> shenglong ([email protected]) who is Qemu maintainer suggested checking the CPU RAS-extensions
> to decide whether generate the APEI table and record CPER for the guest OS in the user space.
> he means if the host does not support RAS, user space may also not support RAS.
The code to signal memory-failure to user-space doesn't depend on the CPU's
RAS-extensions.
If Qemu supports notifying the guest about RAS errors using CPER records, it
should generate a HEST describing firmware first. It can then choose the
notification methods, some of which may require optional KVM APIs to support.
Seattle has a HEST, it doesn't support the CPU RAS-extensions. The kernel can
notify user-space about memory_failure() on this machine. I would expect Qemu to
be able to receive signals and describe memory errors to a guest (1).
The question should be: 'How can Qemu know it can use SEI as a firmware-first
notification?' It needs a KVM API to trigger an SError in the guest with a
specified ESR. The name of the KVM CAP needs to reflect the API (2).
Just because this is the first KVM API that needs the CPU to have the RAS
extensions doesn't mean we should call it 'has RAS' and be done with it.
We will eventually need another KVM API to configure trapping and emulating
values in the RAS ERR registers so that Qemu can emulate a machine without
firmware-first. (This is likely to be a page of memory that backs the registers,
there will need to be another KVM CAP to describe this support (3)).
Exposing the CPUs support for RAS-extensions to support (2) means having
per-platform support for (1). This is either creating extra work, or not
supporting as many platforms as we could. Both are bad.
Once we have (3) as well, any developer needs to know that 'has RAS' just meant
the first API KVM implemented using RAS, and doesn't mean later APIs also using
RAS are supported by the kernel.
Thanks,
James
Hi gengdongjiu,
(I've re-ordered some of the hunks here:)
On 04/09/17 12:10, gengdongjiu wrote:
> On 2017/9/1 1:43, James Morse wrote:
>> On 28/08/17 11:38, Dongjiu Geng wrote:
>>> Not call memory_failure() to handle it. Because the error address recorded
>>> by APEI is not accurated, so can not identify the address to hwpoison
>>> memory.
>>
>> This looks like a firmware bug, what address do you get in your CPER
>> records? It should be a physical address.
> No, not firmware bug. At least in the armv8.0 CPU and huawei's armv8.2 CPU,
> the architecture decided it is not accurate, this abort is asynchronous not
> synchronous.
This is going to be a problem. (I'm chasing Achin to find out when this is
allowed to happen and what we're expected to do about it!)
I hope this isn't the default behaviour, but only happens in exceptionally rare
circumstances.
>> To report a memory-error you must have an address.
> maybe we can not get the accurate error address, can you get it in your armv8
> platform?
I only have software-models, they only generate the errors you tell them to.
I think I see why you're taking this approach with the series, the scenario is:
1. Firmware takes an SError due to a bad memory location from guest EL0.
2. The CPU doesn't provide the address of the memory location.
You want to confine this error as much as possible, in particular to the context
it came from (e.g. guest EL0). CPU context isn't something the CPER records can
describe (they describe failures in system components), hence your hybrid
{kernel,firmware}-first code.
I don't think its safe to kill guest-EL0 and hope this confined the error.
If the affected page of guest memory has never been written to by the guest, the
host will map in the global zero-page, (made read-only at stage2). If the
corruption is in this page it affects the host kernel, guests and user-space
processes. Just because the error came from guest-EL0 doesn't mean
kernel/hypervisor memory isn't affected.
This doesn't just affect that one page: KSM may have merged every copy of every
guest user-space's libc, which has subsequently become corrupt. The first
guest-EL0 to step in this triggers the fault, but it affects all the guests.
With the address all the guests can fix this error, and KSM will re-merge the
pages. Without the address every user-space process in every guest will
eventually be killed.
We aren't even guaranteed that the access that caused the fault came from your
guest EL0. The fault may be in the page tables belonging to the guest kernel,
even worse they may belong to they hypervisor's stage2 page tables.
(Thanks to Mark and Robin for these examples)
I think in this scenario your firmware should describe a memory-error with an
unknown address. (i.e. don't set the 'physical address valid' bit in CPER's
'Table 275 Memory Error Record'). When Linux gets one of these, it should
panic(): We know some memory is corrupt, we don't know where it is.
>> User-space may be signalled by the memory_failure() helper, and user-space >>
may choose to notify the guest about the memory-failure, but this would be a
>> new error.
> For the SError, it is asynchronous abort. so it is not better to call
> memory_failure() helper, because the error address is not accurate.
> memory_failure() will offline or poison the address, but the address is not
> accurate. so it is dangerous
By 'not accurate' do you mean the CPU provides an address, and its wrong.
(surely this is a CPU bug), or just no address is provided. (i.e. the
ERR<n>ADDR.AI 'address incorrect' bit is set).
>>> Because the error was taken from a lower Exception level, if the
>>> exception is SEA/SEI and HCR_EL2.TEA/HCR_EL2.AMO is 1, firmware
>>> sets ESR_EL2/FAR_EL2 to fake a exception trap to EL2, then
>>> transfers to hypervisor.
>>
>> What happens if you took an SError from EL2 and EL2 has PSTATE.A set masking
>> SError? (this is very common today: all kernel code runs like this).
> Firstly, the guest OS usually runs in the El1 or El0, not El2.
> if El2 happens an SError, it will trap to EL3 firmware even though the PSTATE.A is set.
> Because the PSTATE.A can not mask it if the SError is trapped to EL3.
Sure, we agree that from the CPU's view when SCR_EL3.EA is set physical-SError
can't be masked when executing any EL below EL3.
My question was about the 'firmware sets ESR_EL2/FAR_EL2 to fake an exception
trap to EL2' step. While EL3 can take the physical-SError at any time the
normal-world is running, it can't always deliver a fake-SError to EL2, because
EL2 believes it has masked physical-SError.
With the SError rework this should only be masked while we are in entry.S
preparing to handle an exception, receiving an unexpected asynchronous exception
at this point would overwrite ELR/ESR, meaning we could never handle the
original exception.
>> What happens if the hypervisor then executes an ESB with PSTATE.A set? It
>> expects to see any pending SError deferred and its syndrome written to DISR_EL1,
>> but this register is RAZ/WI when you set SCR_EL3.EA. '4.4.2' of [0]
> From my understand, if the SCR_EL3.EA is set, the Abort can not mask, it always happen and
> take to EL3, DISR_El1 can not record the syndrome. DISR_El1 is only recorded when
> the External Abort is masked, but when SCR_EL3.EA is set, the pstate.A can not mask the Error.
But once EL3 wants to notify EL2, and EL2 believes it has SError masked, (even
if the CPU knows its going route physical-SError to EL3) what do you do?
(The best I can think of is that if firmware has to deliver a RAS Error as a
fake-SError, and the target exception-level has SError masked, then firmware
should reboot normal-world and deliver the error via the BERT. To support
NOTIFY_SEI the OS should aim to mask SError as little as possible.)
>>> For the synchronous external abort(SEA), Hypervisor calls the
>>> ghes_handle_memory_failure() to deal with this error,
>>> ghes_handle_memory_failure() function reads the APEI table and
>>> callls memory_failure() to decide whether it needs to deliver
>>> SIGBUS signal to user space, the advantage of using SIGBUS signal
>>> to notify user space is that it can be compatible with Non-Kvm users.
>>>
>>> For the SError Interrupt(SEI),KVM firstly classified the error.
>>
>> KVM can't parse the CPER records, nor does it know where to look to find them.
>> KVM should call out to the APEI code so the host kernel can handle the error.
> KVM does not parse the CPER records, I mean KVM classified the error according to the esr_el2.AET.
Decoding the AET bits in KVM is stub code for systems without firmware-first.
This will eventually be a call-out to some arm64 arch code to decode the RAS ERR
records and do kernel-first handling.
All the GHES notification methods mean 'go read the CPER records'. The CPER
records then contain all the information, including severities. The SError ESR
value should be irrelevant if the host supports firmware-first.
Without firmware-first or kernel-first we decode the SError ESR to know whether
or not to panic(). This is the minimum-work to avoid data corruption while Linux
only supports firmware-first and the platform expects kernel-first.
Thanks,
James
Hi gengdongjiu,
On 04/09/17 12:43, gengdongjiu wrote:
> On 2017/9/1 1:50, James Morse wrote:
>> On 28/08/17 11:38, Dongjiu Geng wrote:
>>> In current code logic, the two functions ghes_sea_add() and
>>> ghes_sea_remove() are only called when CONFIG_ACPI_APEI_SEA
>>> is defined. If not, it will return errors in the ghes_probe()
>>> and not contiue. Hence, remove the unnecessary handling when
>>> CONFIG_ACPI_APEI_SEI is not defined.
>>
>> This doesn't match what the patch does. I get this feeling this is needed for
>> some future patch you haven't included.
>
> James, let check the code, when the ghes_probe, if the CONFIG_ACPI_APEI_SEA is not defined.
> it will return -ENOTSUPP and goto error, and the ghes_sea_add has no chance to execute.
> similar, if the probe is failed, it should not have chance to execute ghes_sea_remove.
It's the 'unnecessary handling when CONFIG_ACPI_APEI_SEI' in the commit message
that confuses me: this patch doesn't reference that Kconfig symbol. I guess that
sentence needs removing for this v6?
Re-reading without that part of the commit-message:
You're relying on the compiler's dead-code elimination to spot unused static
functions and silently drop them. Great!
(there is the small risk that gcc 3.2[0] can't do this, x86 still has to support
this gcc version)
As this is just clean-up patch can you break it out of this series, it isn't
needed to add support for SEI.
(This series adds support for what should be an APEI notification, but the only
code that touches APEI removes some code from a different notification method.)
>> Setting NOTIFY_SEI as the GHES entry's notification type means the OS should
>> check the GHES->ErrorStatusAddress for CPER records when it receives an
>> SError-Interrupt, as it may be a notification of an error from this error source.
> previously I added the NOTIFY_SEI support,
(Yes, I saw that in v5 and expected this series to add some APEI support code )
> but consider the error address in CPER is not accurate and calling memory_failure() may not make sense.
> so I remove it.
'not accurate'... this is going to be a problem, but lets keep that discussion
on the cover-letter.
>> If you aren't handling the notification, why is this is in the HEST at all?
>> (and if its not: its not firmware-first)
> For the SEI notification, may be we can parse and handle the CPER record other than the Error physical address
Sure, but I only see this cleanup patch in this series, where does APEI learn
about NOTIFY_SEI? As this is nothing will ever touch those CPER records, if
you're using GHESv2 firmware will be prevented from delivering subsequent
notifications.
Thanks,
James
[0]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/admin-guide/README.rst#n251
James,
Thanks for the review.
On 2017/9/9 2:17, James Morse wrote:
> Hi gengdongjiu,
>
> On 04/09/17 12:43, gengdongjiu wrote:
>> On 2017/9/1 1:50, James Morse wrote:
>>> On 28/08/17 11:38, Dongjiu Geng wrote:
>>>> In current code logic, the two functions ghes_sea_add() and
>>>> ghes_sea_remove() are only called when CONFIG_ACPI_APEI_SEA
>>>> is defined. If not, it will return errors in the ghes_probe()
>>>> and not contiue. Hence, remove the unnecessary handling when
>>>> CONFIG_ACPI_APEI_SEI is not defined.
>>>
>>> This doesn't match what the patch does. I get this feeling this is needed for
>>> some future patch you haven't included.
>>
>> James, let check the code, when the ghes_probe, if the CONFIG_ACPI_APEI_SEA is not defined.
>> it will return -ENOTSUPP and goto error, and the ghes_sea_add has no chance to execute.
>> similar, if the probe is failed, it should not have chance to execute ghes_sea_remove.
>
> It's the 'unnecessary handling when CONFIG_ACPI_APEI_SEI' in the commit message
> that confuses me: this patch doesn't reference that Kconfig symbol. I guess that
> sentence needs removing for this v6?
thanks for the pointing out, That needs to be removed for v6.
>
> Re-reading without that part of the commit-message:
>
> You're relying on the compiler's dead-code elimination to spot unused static
> functions and silently drop them. Great!
> (there is the small risk that gcc 3.2[0] can't do this, x86 still has to support
> this gcc version)
>
> As this is just clean-up patch can you break it out of this series, it isn't
> needed to add support for SEI.
sure, I will.
>
> (This series adds support for what should be an APEI notification, but the only
> code that touches APEI removes some code from a different notification method.)
understand.
>
>
>>> Setting NOTIFY_SEI as the GHES entry's notification type means the OS should
>>> check the GHES->ErrorStatusAddress for CPER records when it receives an
>>> SError-Interrupt, as it may be a notification of an error from this error source.
>
>> previously I added the NOTIFY_SEI support,
>
> (Yes, I saw that in v5 and expected this series to add some APEI support code )
>
>
>> but consider the error address in CPER is not accurate and calling memory_failure() may not make sense.
>> so I remove it.
>
> 'not accurate'... this is going to be a problem, but lets keep that discussion
> on the cover-letter.
Ok.
>
>
>>> If you aren't handling the notification, why is this is in the HEST at all?
>>> (and if its not: its not firmware-first)
>
>> For the SEI notification, may be we can parse and handle the CPER record other than the Error physical address
>
> Sure, but I only see this cleanup patch in this series, where does APEI learn
> about NOTIFY_SEI? As this is nothing will ever touch those CPER records, if
> you're using GHESv2 firmware will be prevented from delivering subsequent
> notifications.
James, whether it is possible you can review the previous v5 patch which adds the support for NOTIFY_SEI? thanks in advancecIn that patch, I share the SEI notification handling with the SEA notification handling to avoid duplicated code.
https://www.spinics.net/lists/arm-kernel/msg601767.html
>
>
> Thanks,
>
> James
>
> [0]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/admin-guide/README.rst#n251
>
> .
>
Hi James,
On 2017/9/8 0:30, James Morse wrote:
> Hi Dongjiu Geng,
>
> On 28/08/17 11:38, Dongjiu Geng wrote:
>> when userspace gets SIGBUS signal, it does not know whether
>> this is a synchronous external abort or SError,
>
> Why would Qemu/kvmtool need to know if the original notification (if there was
> one) was synchronous or asynchronous? This is between firmware and the kernel.
there are two reasons:
1. Let us firstly discuss the SEA and SEI, there are different workflow for the two different Errors.
2. when record the CPER in the user space, it needs to know the error type, because SEA and SEI are different Error source,
so they have different offset in the APEI table, that is to say they will be recorded to different place of the APEI table.
etc/acpi/tables etc/hardware_errors
==================== ==========================================
+ +--------------------------+ +------------------+
| | HEST | | address | +--------------+
| +--------------------------+ | registers | | Error Status |
| | GHES0 | | +----------------+ | Data Block 0 |
| +--------------------------+ +--------->| |status_address0 |------------->| +------------+
| | ................. | | | +----------------+ | | CPER |
| | error_status_address-----+-+ +------->| |status_address1 |----------+ | | CPER |
| | ................. | | | +----------------+ | | | .... |
| | read_ack_register--------+-+ | | ............. | | | | CPER |
| | read_ack_preserve | | | +------------------+ | | +-+------------+
| | read_ack_write | | | +----->| |status_address10|--------+ | | Error Status |
+ +--------------------------+ | | | | +----------------+ | | | Data Block 1 |
| | GHES1 | +-+-+----->| | ack_value0 | | +-->| +------------+
+ +--------------------------+ | | | +----------------+ | | | CPER |
| | ................. | | | +--->| | ack_value1 | | | | CPER |
| | error_status_address-----+---+ | | | +----------------+ | | | .... |
| | ................. | | | | | ............. | | | | CPER |
| | read_ack_register--------+-----+-+ | +----------------+ | +-+------------+
| | read_ack_preserve | | +->| | ack_value10 | | | |.......... |
| | read_ack_write | | | | +----------------+ | | +------------+
+ +--------------------------| | | | | Error Status |
| | ............... | | | | | Data Block 10|
+ +--------------------------+ | | +---->| +------------+
| | GHES10 | | | | | CPER |
+ +--------------------------+ | | | | CPER |
| | ................. | | | | | .... |
| | error_status_address-----+-----+ | | | CPER |
| | ................. | | +-+------------+
| | read_ack_register--------+---------+
| | read_ack_preserve |
| | read_ack_write |
+ +--------------------------+
>
>
> I think I can see why you need this: to choose whether to emulate SEA or SEI,
emulating SEA or SEI is one reason, another reason is that the CPER will be recorded to different place of APEI.
> but what if the guest wasn't running? Or the guest was running, but it wasn't
> guest-memory that is affected.
If the guest was not running, host firmware will directly notify EL1 host kernel to handle the error, not notify hypervisor
only if the guest was running host firmware can notify the Error to hypervisor.
If the user space is Qemu, and the error is from Qemu, and guest-memory is not involve.
I will not handle it, please see the code for arm64.
void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
{
ram_addr_t ram_addr;
hwaddr paddr;
ARMCPU *cpu = ARM_CPU(c);
CPUARMState *env = &cpu->env;
assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO);
if (addr) {
ram_addr = qemu_ram_addr_from_host(addr);
if (ram_addr != RAM_ADDR_INVALID &&
kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
kvm_cpu_synchronize_state(c);
kvm_hwpoison_page_add(ram_addr);
if (is_abort_sea(env->exception.syndrome)) {
ghes_update_guest(ACPI_HEST_NOTIFY_SEA, paddr);
kvm_inject_arm_sea(c);
} else if (is_abort_sei(env->exception.syndrome)) {
ghes_update_guest(ACPI_HEST_NOTIFY_SEI, paddr);
kvm_inject_arm_sei(c);
}
return;
}
fprintf(stderr, "Hardware memory error for memory used by "
"QEMU itself instead of guest system!\n");
}
if (code == BUS_MCEERR_AR) {
fprintf(stderr, "Hardware memory error!\n");
exit(1);
}
}
For the x86, it also does not handle it, it only print "Hardware memory error for memory used by QEMU itself instead of guest system!"
void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
{
X86CPU *cpu = X86_CPU(c);
CPUX86State *env = &cpu->env;
ram_addr_t ram_addr;
hwaddr paddr;
/* If we get an action required MCE, it has been injected by KVM
* while the VM was running. An action optional MCE instead should
* be coming from the main thread, which qemu_init_sigbus identifies
* as the "early kill" thread.
*/
assert(code == BUS_MCEERR_AR || code == BUS_MCEERR_AO);
if ((env->mcg_cap & MCG_SER_P) && addr) {
ram_addr = qemu_ram_addr_from_host(addr);
if (ram_addr != RAM_ADDR_INVALID &&
kvm_physical_memory_addr_from_host(c->kvm_state, addr, &paddr)) {
kvm_hwpoison_page_add(ram_addr);
kvm_mce_inject(cpu, paddr, code);
return;
}
fprintf(stderr, "Hardware memory error for memory used by "
"QEMU itself instead of guest system!\n");
}
if (code == BUS_MCEERR_AR) {
hardware_memory_error();
}
/* Hope we are lucky for AO MCE */
}
>
> What happens if the dram-scrub hardware spots an error in guest memory, but the
> guest wasn't running? KVM won't have a relevant ESR value to give you.
if the dram-scrub hardware spots an error in guest memory, it will generate
IRQ in DDR controller, not SEA or SEI exception. I still do not consider the GSIV.
For GSIV, may be we can only handle it in the host OS.
>
> What happens if we start swapping a page of guest memory to disk, and discover
> the memory is corrupt. This is synchronous, but it wasn't the guest, and KVM
> still can't give you an ESR.
I think this Error is reported by IRQ(GSIV), GSIV is not SEA/SEI, we should not give the ESR to them.
>
> What about CPER records discovered through the polled interface? What happens if
> I write a PFN into the corrupt-pfn sysfs interface?
I do not understand this question.
I think in the process it should report SEA notification when CPU consume the error page.
>
>
> I think what you need is some way of knowing if the BUS_MCEERR_A* was directly
> caused by a user-space (or guest) access, and if so was it a data or instruction
when user space received the signal, it will judge whether the memory address is user-space (or guest) address
> fetch. These can become SEA notifications.
In fact, it can be SEI, not always SEA, why it will always SEA notifications?
If the memory properties of data is device type, it may become SEI notification.
>
> KVM's user-space shouldn't be a special-case where the kernel behaves
> differently: if we tinker with this it needs to make sense for all user space
> processes and mean something on all architectures.
>
> I think this information could be useful to other users of these signals, e.g. a
> JVM could silently regenerate/reload code/data for a non-direct-access fault
> instead of exit-ing (or throwing an exception) for a direct access.
>
> For BUS_MCEERR_A* from memory_failure() we can't know if they are caused by an
> access or not. When the mm code gets -EHWPOISON when trying to resolve a
Because of that, so I allow userspace getting exception information
> user-space fault we know it was due to a direct-access. (I don't know if/how x86
> can know if it was code or data). Faulting guest accesses through KVM are just a
> special version of this where KVM fixes-up stage2.
>
> ... but for any of this to work we need the address of the corrupt memory.
> (-> cover letter)
>
>
> Thanks,
>
> James
>
> .
>
On 2017/9/8 0:31, James Morse wrote:
> Hi Dongjiu Geng,
>
> On 28/08/17 11:38, Dongjiu Geng wrote:
>> ARMv8.2 adds a new bit HCR_EL2.TEA which controls to
>> route synchronous external aborts to EL2, and adds a
>> trap control bit HCR_EL2.TERR which controls to
>> trap all Non-secure EL1&0 error record accesses to EL2.
>>
>> This patch enables the two bits for the guest OS.
>> when an synchronous abort is generated in the guest OS,
>
>> it will trap to EL3 firmware, EL3 firmware will check the
>> HCR_EL2.TEA value to decide to jump to hypervisor or host
>> OS.
>
> (This is what you are using this for, the patch has nothing to do with EL3.)
No, EL3 will check the HCR_EL2.TEA to decide to jump to hypervisor or host kernel.
>
>
>> Enabling HCR_EL2.TERR makes error record access
>> from guest trap to EL2.
>
>
> KVM already handles external aborts from lower exception levels, no more work
> needs doing for TEA.
when SCR_EL3.EA is set, TEA will not workable, El3 only check its value to decide to hypervisor or EL1 host kernel.
>
> What happens when a guest access the RAS-Error-Record registers?
it will trap to EL2 kvm
>
> Before we can set HCR_EL2.TERR I think we need to add some minimal emulation for
> the registers it traps. Most of them should be RAZ/WI, so it should be
> straightforward. (I think KVMs default is to emulate an undef for unknown traps).
if KVM default handling is to emulate an undef for unknown traps, how about we use its default way? because no one access
the ERR RAS register in the guest .
>
> Eventually we will want to back this with a page of memory that lets
> Qemu/kvmtool configure what the guest can see. (i.e. the emulated machine's
> errors for kernel-first handling.)
I think emulate it to an undef for unknown traps can be enough, no one access the ERR register in the guest.
>
>
> Thanks,
>
> James
>
> .
>
James,
On 2017/9/8 0:31, James Morse wrote:
> KVM already handles external aborts from lower exception levels, no more work
> needs doing for TEA.
If it is firmware first solution, that is SCR_EL3.EA=1, all SError interrupt and synchronous External
Abort exceptions are taken to EL3, so EL3 firmware will handle it, KVM no needs to handle it.
HCR_EL3.TEA is only for EL3 to check its value to decide to jump to hypervisor or kernel.
>
> What happens when a guest access the RAS-Error-Record registers?
>
> Before we can set HCR_EL2.TERR I think we need to add some minimal emulation for
> the registers it traps. Most of them should be RAZ/WI, so it should be
> straightforward. (I think KVMs default is to emulate an undef for unknown traps).
Today I added the support to do some minimal emulation for RAS-Error-Record registers, thanks
for the good suggestion.
>
> Eventually we will want to back this with a page of memory that lets
> Qemu/kvmtool configure what the guest can see. (i.e. the emulated machine's
> errors for kernel-first handling.)
Hi gengdongjiu,
On 11/09/17 13:04, gengdongjiu wrote:
> On 2017/9/9 2:17, James Morse wrote:
>> On 04/09/17 12:43, gengdongjiu wrote:
>>> On 2017/9/1 1:50, James Morse wrote:
>>>> On 28/08/17 11:38, Dongjiu Geng wrote:
>>>> If you aren't handling the notification, why is this is in the HEST at all?
>>>> (and if its not: its not firmware-first)
>>
>>> For the SEI notification, may be we can parse and handle the CPER record other than the Error physical address
>>
>> Sure, but I only see this cleanup patch in this series, where does APEI learn
>> about NOTIFY_SEI? As this is nothing will ever touch those CPER records, if
>> you're using GHESv2 firmware will be prevented from delivering subsequent
>> notifications.
> James, whether it is possible you can review the previous v5 patch which adds the support for
Spreading 'current discussion' over two versions is a problem for anyone trying
to follow this series.
If you post a newer version its normal for people to delete the older versions.
When you post a new version you should be happy that its the latest and greatest.
> NOTIFY_SEI? thanks in advancecIn that patch, I share the SEI notification
handling with the SEA
> notification handling to avoid duplicated code.
You may be able to share some of the code, but I don't think you should share
the list of GHES between notification methods.
This leads to races between the firmware and OS: If CPU-A has received an SEI
firmware would have to avoid generating an SEA on CPU-B as the SEI-handler
running on CPU-A may find and process the second set of CPER records. CPU-B then
gets a spurious notification.
Why is this a problem? KVM needs to know if APEI handled the error, or whether
the Synchronous-External-Abort/SError-Interrupt was due to something else, in
which case we invoke todays default behaviour, which isn't appropriate for a RAS
event.
Thanks,
James
Hi gengdongjiu,
On 14/09/17 12:12, gengdongjiu wrote:
> On 2017/9/8 0:31, James Morse wrote:
>> KVM already handles external aborts from lower exception levels, no more work
>> needs doing for TEA.
> If it is firmware first solution, that is SCR_EL3.EA=1, all SError interrupt and synchronous External
> Abort exceptions are taken to EL3, so EL3 firmware will handle it, KVM no needs to handle it.
... and presumably your firmware generates a fake-Synchronous-external-abort to
hand to EL2 as an APEI SEA notification? My point: this is fine, KVM already
handles synchronous-external aborts, no more work needed for this trap, (in
contrast to the TERR, which you've fixed)
> HCR_EL3.TEA is only for EL3 to check its value to decide to jump to hypervisor or kernel.
HCR_EL3!?!
>> What happens when a guest access the RAS-Error-Record registers?
>>
>> Before we can set HCR_EL2.TERR I think we need to add some minimal emulation for
>> the registers it traps. Most of them should be RAZ/WI, so it should be
>> straightforward. (I think KVMs default is to emulate an undef for unknown traps).
> Today I added the support to do some minimal emulation for RAS-Error-Record registers, thanks
> for the good suggestion.
Thanks. Software has the bad habit of living much longer than we think, if KVM
traps part of the architecture then we have to emulate it... Some bright spark
might boot a future Linux-v4.42 guest on a Linux-v4.16 host.
I had a run through the RAS spec: if we make ERRIDR_EL1 RAZ/WI then we can do
the same with ERRSELR_EL1. Then following the rules for 'If ERRSELR_EL1.SEL is
[>=] ERRIDR_EL1.NUM' that makes the ERX* registers RAZ/WI too.
>> Eventually we will want to back this with a page of memory that lets
>> Qemu/kvmtool configure what the guest can see. (i.e. the emulated machine's
>> errors for kernel-first handling.)
Thanks,
James
On 2017/9/14 20:35, James Morse wrote:
>> James, whether it is possible you can review the previous v5 patch which adds the support for
> Spreading 'current discussion' over two versions is a problem for anyone trying
> to follow this series.
>
> If you post a newer version its normal for people to delete the older versions.
> When you post a new version you should be happy that its the latest and greatest.
>
>
James, today I paste the new version here, thanks.
https://patchwork.kernel.org/patch/9952495/
https://patchwork.kernel.org/patch/9950955/
Hi gengdongjiu,
(re-ordered hunks)
On 13/09/17 08:32, gengdongjiu wrote:
> On 2017/9/8 0:30, James Morse wrote:
>> On 28/08/17 11:38, Dongjiu Geng wrote:
>> For BUS_MCEERR_A* from memory_failure() we can't know if they are caused by
>> an access or not.
Actually it looks like we can: I thought 'BUS_MCEERR_AR' could be triggered via
some CPER flags, but its not. The only code that flags MF_ACTION_REQUIRED is
x86's kernel-first handling, which nicely matches this 'direct access' problem.
BUS_MCEERR_AR also come from KVM stage2 faults (and the x86 equivalent). Powerpc
also triggers these directly, both from what look to be synchronous paths, so I
think its fair to equate BUS_MCEERR_AR to a synchronous access and BUS_MCEERR_AO
to something_else.
I don't think we need anything else.
>> When the mm code gets -EHWPOISON when trying to resolve a
>
> Because of that, so I allow userspace getting exception information
... and there are cases where you can't get the exception information, and other
cases where it wasn't an exception at all.
[...]
>> What happens if the dram-scrub hardware spots an error in guest memory, but
>> the guest wasn't running? KVM won't have a relevant ESR value to give you.
> if the dram-scrub hardware spots an error in guest memory, it will generate
> IRQ in DDR controller, not SEA or SEI exception. I still do not consider the
> GSIV. For GSIV, may be we can only handle it in the host OS.
Great example: this IRQ pulls us out of a guest, we tromp through APEI and then
memory_failure(), the memory happened to belong to the same guest
(coincidence!), we send it some signal and now its user-space's problem.
Your KVM_REG_ARM64_FAULT mechanism is going to return stale data, even though
the notification interrupted the guest, and it was guest memory that was
affected. KVM doesn't have a relevant ESR.
I'm strongly against exposing 'which notification type' this error originally
came from because:
* it doesn't matter once we've got the CPER records,
* there isn't always an answer (there are/will-be other ways of tripping
memory_failure())
* it creates ABI between firwmare, host userspace and guest userspace.
Firmware's choice of notification type shouldn't affect anything other than
the host kernel.
On 13/09/17 08:32, gengdongjiu wrote:
> On 2017/9/8 0:30, James Morse wrote:
>> On 28/08/17 11:38, Dongjiu Geng wrote:
>>> when userspace gets SIGBUS signal, it does not know whether
>>> this is a synchronous external abort or SError,
>>
>> Why would Qemu/kvmtool need to know if the original notification (if there was
>> one) was synchronous or asynchronous? This is between firmware and the kernel.
> there are two reasons:
>
> 1. Let us firstly discuss the SEA and SEI, there are different workflow for the two different Errors.
> 2. when record the CPER in the user space, it needs to know the error type, because SEA and SEI are different Error source,
> so they have different offset in the APEI table, that is to say they will be recorded to different place of the APEI table.
user-space can choose whether to use SEA or SEI, it doesn't have to choose the
same notification type that firmware used, which in turn doesn't have to be the
same as that used by the CPU to notify firmware.
The choice only matters because these notifications hang on an existing pieces
of the Arm-architecture, so the notification can only add to the architecturally
defined meaning. (i.e. You can only send an SEA for something that can already
be described as a synchronous external abort).
Once we get to user-space, for memory_failure() notifications, (which so far is
all we are talking about here), the only thing that could matter is whether the
guest hit a PG_hwpoison page as a stage2 fault. These can be described as
Synchronous-External-Abort.
The Synchronous-External-Abort/SError-Interrupt distinction matters for the CPU
because it can't always make an error synchronous. For memory_failure()
notifications to a KVM guest we really can do this, and we already have this
behaviour for free. An example:
A guest touches some hardware:poisoned memory, for whatever reason the CPU can't
put the world back together to make this a synchronous exception, so it reports
it to firmware as an SError-interrupt.
Linux gets an APEI notification and memory_failure() causes the affected page to
be unmapped from the guest's stage2, and SIGBUS_MCEERR_AO sent to user-space.
Qemu/kvmtool can now notify the guest with an IRQ or POLLed notification. AO->
action optional, probably asynchronous.
But in our example it wasn't really asynchronous, that was just a property of
the original CPU->firmware notification. What happens? The guest vcpu is re-run,
it re-runs the same instructions (this was a contained error so KVM's ELR points
at/before the instruction that steps in the problem). This time KVM takes a
stage2 fault, which the mm code will refuse to fixup because the relevant page
was marked as PG_hwpoision by memory_failure(). KVM signals Qemu/kvmtool with
SIGBUS_MCEERR_AR. Now Qemu/kvmtool can notify the guest using SEA.
> etc/acpi/tables etc/hardware_errors
> ==================== ==========================================
> + +--------------------------+ +------------------+
> | | HEST | | address | +--------------+
> | +--------------------------+ | registers | | Error Status |
> | | GHES0 | | +----------------+ | Data Block 0 |
> | +--------------------------+ +--------->| |status_address0 |------------->| +------------+
> | | ................. | | | +----------------+ | | CPER |
> | | error_status_address-----+-+ +------->| |status_address1 |----------+ | | CPER |
> | | ................. | | | +----------------+ | | | .... |
> | | read_ack_register--------+-+ | | ............. | | | | CPER |
> | | read_ack_preserve | | | +------------------+ | | +-+------------+
> | | read_ack_write | | | +----->| |status_address10|--------+ | | Error Status |
> + +--------------------------+ | | | | +----------------+ | | | Data Block 1 |
> | | GHES1 | +-+-+----->| | ack_value0 | | +-->| +------------+
> + +--------------------------+ | | | +----------------+ | | | CPER |
> | | ................. | | | +--->| | ack_value1 | | | | CPER |
> | | error_status_address-----+---+ | | | +----------------+ | | | .... |
> | | ................. | | | | | ............. | | | | CPER |
> | | read_ack_register--------+-----+-+ | +----------------+ | +-+------------+
> | | read_ack_preserve | | +->| | ack_value10 | | | |.......... |
> | | read_ack_write | | | | +----------------+ | | +------------+
> + +--------------------------| | | | | Error Status |
> | | ............... | | | | | Data Block 10|
> + +--------------------------+ | | +---->| +------------+
> | | GHES10 | | | | | CPER |
> + +--------------------------+ | | | | CPER |
> | | ................. | | | | | .... |
> | | error_status_address-----+-----+ | | | CPER |
> | | ................. | | +-+------------+
> | | read_ack_register--------+---------+
> | | read_ack_preserve |
> | | read_ack_write |
> + +--------------------------+
>
(nice ascii art!)
>> I think I can see why you need this: to choose whether to emulate SEA or SEI,
> emulating SEA or SEI is one reason, another reason is that the CPER will be recorded to different place of APEI.
(This doesn't matter: Generate the CPER records after you've chosen the
notification and this isn't a problem. Or map your 'Error Status Data Blocks'
to status_address* depending on usage not in a fixed 1:1 way)
>> I think what you need is some way of knowing if the BUS_MCEERR_A* was directly
>> caused by a user-space (or guest) access, and if so was it a data or instruction
> when user space received the signal, it will judge whether the memory address is user-space (or guest) address
>> fetch. These can become SEA notifications.
> In fact, it can be SEI, not always SEA, why it will always SEA notifications?
> If the memory properties of data is device type, it may become SEI notification.
Let's take a step back: in what scenario should we use an emulated-SEA instead
of an emulated-SEI? (forget what the CPU and firmware did, this is up to Qemu to
decide).
It can use SEA if this is a valid Synchronous-external-abort. Stage 2 faults are
synchronous exceptions, if you hit a PG_hwpoision page on this path you can
report this back to the guest as a Synchronous-external-abort/SEA.
How do you know? You get SIGBUS_MCEERR_AR from KVM.
Thanks,
James
James,
Thanks for your comments, hope we can make the solution better.
On 2017/9/14 21:00, James Morse wrote:
> Hi gengdongjiu,
>
> (re-ordered hunks)
>
> On 13/09/17 08:32, gengdongjiu wrote:
>> On 2017/9/8 0:30, James Morse wrote:
>>> On 28/08/17 11:38, Dongjiu Geng wrote:
>>> For BUS_MCEERR_A* from memory_failure() we can't know if they are caused by
>>> an access or not.
>
> Actually it looks like we can: I thought 'BUS_MCEERR_AR' could be triggered via
> some CPER flags, but its not. The only code that flags MF_ACTION_REQUIRED is
> x86's kernel-first handling, which nicely matches this 'direct access' problem.
> BUS_MCEERR_AR also come from KVM stage2 faults (and the x86 equivalent). Powerpc
> also triggers these directly, both from what look to be synchronous paths, so I
> think its fair to equate BUS_MCEERR_AR to a synchronous access and BUS_MCEERR_AO
> to something_else.
James, thanks for your explanation.
can I understand that your meaning that "BUS_MCEERR_AR" stands for synchronous access and BUS_MCEERR_AO stands for asynchronous access?
Then for "BUS_MCEERR_AO", how to distinguish it is asynchronous data access(SError) and PCIE AER error?
In the user space, we can check the si_code, if it is "BUS_MCEERR_AR", we use SEA notification type for the guest;
if it is "BUS_MCEERR_AO", we use SEI notification type for the guest.
Because there are only two values for si_code("BUS_MCEERR_AR" and BUS_MCEERR_AO), in which case we can use the GSIV(IRQ) notification type?
>
> I don't think we need anything else.
>
>
>>> When the mm code gets -EHWPOISON when trying to resolve a
>>
>> Because of that, so I allow userspace getting exception information
>
> ... and there are cases where you can't get the exception information, and other
> cases where it wasn't an exception at all.
>
> [...]
>
>
>>> What happens if the dram-scrub hardware spots an error in guest memory, but
>>> the guest wasn't running? KVM won't have a relevant ESR value to give you.
>
>> if the dram-scrub hardware spots an error in guest memory, it will generate
>> IRQ in DDR controller, not SEA or SEI exception. I still do not consider the
>> GSIV. For GSIV, may be we can only handle it in the host OS.
>
> Great example: this IRQ pulls us out of a guest, we tromp through APEI and then
> memory_failure(), the memory happened to belong to the same guest
> (coincidence!), we send it some signal and now its user-space's problem.
>
> Your KVM_REG_ARM64_FAULT mechanism is going to return stale data, even though
> the notification interrupted the guest, and it was guest memory that was
> affected. KVM doesn't have a relevant ESR.
>
>
> I'm strongly against exposing 'which notification type' this error originally
> came from because:
> * it doesn't matter once we've got the CPER records,
> * there isn't always an answer (there are/will-be other ways of tripping
> memory_failure())
> * it creates ABI between firwmare, host userspace and guest userspace.
> Firmware's choice of notification type shouldn't affect anything other than
> the host kernel.
>
>
> On 13/09/17 08:32, gengdongjiu wrote:
>> On 2017/9/8 0:30, James Morse wrote:
>>> On 28/08/17 11:38, Dongjiu Geng wrote:
>>>> when userspace gets SIGBUS signal, it does not know whether
>>>> this is a synchronous external abort or SError,
>>>
>>> Why would Qemu/kvmtool need to know if the original notification (if there was
>>> one) was synchronous or asynchronous? This is between firmware and the kernel.
>
>> there are two reasons:
>>
>> 1. Let us firstly discuss the SEA and SEI, there are different workflow for the two different Errors.
>> 2. when record the CPER in the user space, it needs to know the error type, because SEA and SEI are different Error source,
>> so they have different offset in the APEI table, that is to say they will be recorded to different place of the APEI table.
>
> user-space can choose whether to use SEA or SEI, it doesn't have to choose the
> same notification type that firmware used, which in turn doesn't have to be the
> same as that used by the CPU to notify firmware.
>
> The choice only matters because these notifications hang on an existing pieces
> of the Arm-architecture, so the notification can only add to the architecturally
> defined meaning. (i.e. You can only send an SEA for something that can already
> be described as a synchronous external abort).
>
> Once we get to user-space, for memory_failure() notifications, (which so far is
> all we are talking about here), the only thing that could matter is whether the
> guest hit a PG_hwpoison page as a stage2 fault. These can be described as
> Synchronous-External-Abort.
>
> The Synchronous-External-Abort/SError-Interrupt distinction matters for the CPU
> because it can't always make an error synchronous. For memory_failure()
> notifications to a KVM guest we really can do this, and we already have this
> behaviour for free. An example:
>
> A guest touches some hardware:poisoned memory, for whatever reason the CPU can't
> put the world back together to make this a synchronous exception, so it reports
> it to firmware as an SError-interrupt.
> Linux gets an APEI notification and memory_failure() causes the affected page to
> be unmapped from the guest's stage2, and SIGBUS_MCEERR_AO sent to user-space.
>
> Qemu/kvmtool can now notify the guest with an IRQ or POLLed notification. AO->
> action optional, probably asynchronous.
If so, in this case, Qemu/kvmtool only got a little information(receive a SIGBUS), for this SIGBUS,
it only include the SIGBUS_MCEERR_AO, error address. not include other information,
only according the SIGBUS_MCEERR_AO and error address, user space does not know whether to use IRQ or POLLed notification.
for example, SIGBUS_MCEERR_AO means asynchronous access, user space can use SEI, IRQ or POLLed notification.
so user space will be confused to use which method.
I think if we use such solution, user space only judging SIGBUS_MCEERR_A* is not enough.
how we provide other extra information to let it choose the proper notification?
>
> But in our example it wasn't really asynchronous, that was just a property of
> the original CPU->firmware notification. What happens? The guest vcpu is re-run,
> it re-runs the same instructions (this was a contained error so KVM's ELR points
> at/before the instruction that steps in the problem). This time KVM takes a
> stage2 fault, which the mm code will refuse to fixup because the relevant page
> was marked as PG_hwpoision by memory_failure(). KVM signals Qemu/kvmtool with
> SIGBUS_MCEERR_AR. Now Qemu/kvmtool can notify the guest using SEA.
>
>
>
>
>> etc/acpi/tables etc/hardware_errors
>> ==================== ==========================================
>> + +--------------------------+ +------------------+
>> | | HEST | | address | +--------------+
>> | +--------------------------+ | registers | | Error Status |
>> | | GHES0 | | +----------------+ | Data Block 0 |
>> | +--------------------------+ +--------->| |status_address0 |------------->| +------------+
>> | | ................. | | | +----------------+ | | CPER |
>> | | error_status_address-----+-+ +------->| |status_address1 |----------+ | | CPER |
>> | | ................. | | | +----------------+ | | | .... |
>> | | read_ack_register--------+-+ | | ............. | | | | CPER |
>> | | read_ack_preserve | | | +------------------+ | | +-+------------+
>> | | read_ack_write | | | +----->| |status_address10|--------+ | | Error Status |
>> + +--------------------------+ | | | | +----------------+ | | | Data Block 1 |
>> | | GHES1 | +-+-+----->| | ack_value0 | | +-->| +------------+
>> + +--------------------------+ | | | +----------------+ | | | CPER |
>> | | ................. | | | +--->| | ack_value1 | | | | CPER |
>> | | error_status_address-----+---+ | | | +----------------+ | | | .... |
>> | | ................. | | | | | ............. | | | | CPER |
>> | | read_ack_register--------+-----+-+ | +----------------+ | +-+------------+
>> | | read_ack_preserve | | +->| | ack_value10 | | | |.......... |
>> | | read_ack_write | | | | +----------------+ | | +------------+
>> + +--------------------------| | | | | Error Status |
>> | | ............... | | | | | Data Block 10|
>> + +--------------------------+ | | +---->| +------------+
>> | | GHES10 | | | | | CPER |
>> + +--------------------------+ | | | | CPER |
>> | | ................. | | | | | .... |
>> | | error_status_address-----+-----+ | | | CPER |
>> | | ................. | | +-+------------+
>> | | read_ack_register--------+---------+
>> | | read_ack_preserve |
>> | | read_ack_write |
>> + +--------------------------+
>>
>
> (nice ascii art!)
>
>>> I think I can see why you need this: to choose whether to emulate SEA or SEI,
>
>> emulating SEA or SEI is one reason, another reason is that the CPER will be recorded to different place of APEI.
>
> (This doesn't matter: Generate the CPER records after you've chosen the
> notification and this isn't a problem. Or map your 'Error Status Data Blocks'
> to status_address* depending on usage not in a fixed 1:1 way)
>
>
>>> I think what you need is some way of knowing if the BUS_MCEERR_A* was directly
>>> caused by a user-space (or guest) access, and if so was it a data or instruction
>
>> when user space received the signal, it will judge whether the memory address is user-space (or guest) address
>
>>> fetch. These can become SEA notifications.
>
>> In fact, it can be SEI, not always SEA, why it will always SEA notifications?
>> If the memory properties of data is device type, it may become SEI notification.
>
> Let's take a step back: in what scenario should we use an emulated-SEA instead
> of an emulated-SEI? (forget what the CPU and firmware did, this is up to Qemu to
> decide).
>
> It can use SEA if this is a valid Synchronous-external-abort. Stage 2 faults are
> synchronous exceptions, if you hit a PG_hwpoision page on this path you can
> report this back to the guest as a Synchronous-external-abort/SEA.
> How do you know? You get SIGBUS_MCEERR_AR from KVM.
I understand your idea, I agree we can use SEA notification for guest if it is SIGBUS_MCEERR_AR.
I am worried about the SIGBUS_MCEERR_AO.
if it is SIGBUS_MCEERR_AO, we can choose SEI, IRQ or POLLed notification. so according to what principles to choose that?
In my platform, there is another issue.
for the stage2 fault, we get the IPA from the HPFAR_EL2 register,
but for huawei's CPU, if it is data Error(DFSC[5:0] is 0b010000), not translation error(DFSC[5:0] is 0b0101xx),
the HPFAR_EL2 is NULL, so the IPA is not recorded, in our current KVM code, we get the IPA from the HPFAR_EL2, so
we can not get the right IPA value, because its value is zero.I do not know whether you have same issue.
Although hpfar_el2 does not record IPA, but host firmware can still record the PA
If call memory_failure(), memory_failure can translate the PA to host VA, then deliver host VA to Qemu. Qemu can translate the host VA to IPA.
so we rely on memory_failure() to get the IPA.
>
>
> Thanks,
>
> James
>
> .
>
Hi James
On 2017/9/14 21:00, James Morse wrote:
> Hi gengdongjiu,
> user-space can choose whether to use SEA or SEI, it doesn't have to choose the
> same notification type that firmware used, which in turn doesn't have to be the
> same as that used by the CPU to notify firmware.
>
> The choice only matters because these notifications hang on an existing pieces
> of the Arm-architecture, so the notification can only add to the architecturally
> defined meaning. (i.e. You can only send an SEA for something that can already
> be described as a synchronous external abort).
>
> Once we get to user-space, for memory_failure() notifications, (which so far is
> all we are talking about here), the only thing that could matter is whether the
> guest hit a PG_hwpoison page as a stage2 fault. These can be described as
> Synchronous-External-Abort.
>
> The Synchronous-External-Abort/SError-Interrupt distinction matters for the CPU
> because it can't always make an error synchronous. For memory_failure()
> notifications to a KVM guest we really can do this, and we already have this
> behaviour for free. An example:
>
> A guest touches some hardware:poisoned memory, for whatever reason the CPU can't
> put the world back together to make this a synchronous exception, so it reports
> it to firmware as an SError-interrupt.
> Linux gets an APEI notification and memory_failure() causes the affected page to
> be unmapped from the guest's stage2, and SIGBUS_MCEERR_AO sent to user-space.
>
> Qemu/kvmtool can now notify the guest with an IRQ or POLLed notification. AO->
> action optional, probably asynchronous.
>
> But in our example it wasn't really asynchronous, that was just a property of
> the original CPU->firmware notification. What happens? The guest vcpu is re-run,
> it re-runs the same instructions (this was a contained error so KVM's ELR points
> at/before the instruction that steps in the problem). This time KVM takes a
> stage2 fault, which the mm code will refuse to fixup because the relevant page
> was marked as PG_hwpoision by memory_failure(). KVM signals Qemu/kvmtool with
> SIGBUS_MCEERR_AR. Now Qemu/kvmtool can notify the guest using SEA.
CC Achin
I have some personal opinion, if you think it is not right, hope you can point out.
Synchronous External Abort and SError Interrupt are hardware exception(hardware concept), which is independent of software notification,
in armv8 without RAS, the two concepts already exist. In the APEI spec, in order to better describe the two exceptions, so use SEA and SEI notification to stand for them.
SEA notification stands for Synchronous External Abort, so may be it is not only a notification, it also stands for a hardware error type.
SEI notification stands for SError Interrupt, so may be it is not only a notification, it also stands for a hardware error type.
In the OS, it has different handling flow to the two exception(two notification):
when the guest OS running, if the hardware generates a Synchronous External Abort, we told the guest OS this error is SError Interrupt instead of Synchronous External Abort.
guest OS uses SEI notification handling flow to deal with it, I am not sure whether it will have problem, because the true hardware exception is Synchronous External Abort,
but software treats it as SError interrupt to handle.
In the mainline code, it does not have SEI notification support, the reason I think it is because of the error address record by firmware is not accurate(SError Interrupt is asynchronous exception).
so if treat a hardware Synchronous External Abort as SError interrupt(SEI). The default OS behavior for SEI is PANIC, that is to say, when hardware triggers a Synchronous External Abort(SEA), if guest
treat it as SError interrupt(SEI), the OS will be panic. in fact, it can be recoverable instead of Panic.
I ever added a patch to support the SEI notification, but not sure whether it is can be accepted by open source, until now, not receive response.
Hi gengdongjiu,
On 18/09/17 14:36, gengdongjiu wrote:
> On 2017/9/14 21:00, James Morse wrote:
>> On 13/09/17 08:32, gengdongjiu wrote:
>>> On 2017/9/8 0:30, James Morse wrote:
>>>> On 28/08/17 11:38, Dongjiu Geng wrote:
>>>> For BUS_MCEERR_A* from memory_failure() we can't know if they are caused by
>>>> an access or not.
>>
>> Actually it looks like we can: I thought 'BUS_MCEERR_AR' could be triggered via
>> some CPER flags, but its not. The only code that flags MF_ACTION_REQUIRED is
>> x86's kernel-first handling, which nicely matches this 'direct access' problem.
>> BUS_MCEERR_AR also come from KVM stage2 faults (and the x86 equivalent). Powerpc
>> also triggers these directly, both from what look to be synchronous paths, so I
>> think its fair to equate BUS_MCEERR_AR to a synchronous access and BUS_MCEERR_AO
>> to something_else.
>
> James, thanks for your explanation.
> can I understand that your meaning that "BUS_MCEERR_AR" stands for synchronous access and BUS_MCEERR_AO stands for asynchronous access?
Not 'stands for', as the AR is Action-Required and AO Action-Optional. My point
was I can't find a case where Action-Required is used for an error that isn't
synchronous.
We should run this past the people who maintain the existing BUS_MCEERR_AR
users, in case its just a severity to them.
> Then for "BUS_MCEERR_AO", how to distinguish it is asynchronous data access(SError) and PCIE AER error?
How would userspace get one of these memory errors for a PCIe error?
> In the user space, we can check the si_code, if it is "BUS_MCEERR_AR", we use SEA notification type for the guest;
> if it is "BUS_MCEERR_AO", we use SEI notification type for the guest.
> Because there are only two values for si_code("BUS_MCEERR_AR" and BUS_MCEERR_AO), in which case we can use the GSIV(IRQ) notification type?
This is for Qemu/kvmtool to decide, it depends on what sort of machine they are
emulating.
For example, the physical machine's memory-controller may notify the CPU about
memory errors by triggering SError trapped to EL3, or with a dedicated FIQ, also
routed to EL3. By the time this gets to the host kernel the distinction doesn't
matter. The host has handled the error.
For a guest, your memory-controller is effectively the host kernel. It will give
you an BUS_MCEERR_AO signal for any guest memory that is affected, and a
BUS_MCEERR_AR if the guest directly accesses a page of affected memory.
What Qemu/kvmtool do with this is up to them. If they're emulating a machine
with no RAS features, printing an error and exit.
Otherwise BUS_MCEERR_AR could be notified as one of the flavours of IRQ, unless
the affected vcpu has interrupts masked, in which case an SEA notification gives
you some NMI-like behaviour.
For BUS_MCEERR_AO you could use SEI, IRQ or polled notification. My choice would
be IRQ, as you can't know if the guest supports SEI and it would be a shame to
kill it with an SError if the affected memory was free. SEA for synchronous
errors is still a good choice even if the guest doesn't support it as that
memory is still gone so its still a valid guest:Synchronous-external-abort.
[...]
>>> 1. Let us firstly discuss the SEA and SEI, there are different workflow for the two different Errors.
>> user-space can choose whether to use SEA or SEI, it doesn't have to choose the
>> same notification type that firmware used, which in turn doesn't have to be the
>> same as that used by the CPU to notify firmware.
>>
>> The choice only matters because these notifications hang on an existing pieces
>> of the Arm-architecture, so the notification can only add to the architecturally
>> defined meaning. (i.e. You can only send an SEA for something that can already
>> be described as a synchronous external abort).
>>
>> Once we get to user-space, for memory_failure() notifications, (which so far is
>> all we are talking about here), the only thing that could matter is whether the
>> guest hit a PG_hwpoison page as a stage2 fault. These can be described as
>> Synchronous-External-Abort.
>>
>> The Synchronous-External-Abort/SError-Interrupt distinction matters for the CPU
>> because it can't always make an error synchronous. For memory_failure()
>> notifications to a KVM guest we really can do this, and we already have this
>> behaviour for free. An example:
>>
>> A guest touches some hardware:poisoned memory, for whatever reason the CPU can't
>> put the world back together to make this a synchronous exception, so it reports
>> it to firmware as an SError-interrupt.
>
>> Linux gets an APEI notification and memory_failure() causes the affected page to
>> be unmapped from the guest's stage2, and SIGBUS_MCEERR_AO sent to user-space.
>>
>> Qemu/kvmtool can now notify the guest with an IRQ or POLLed notification. AO->
>> action optional, probably asynchronous.
> If so, in this case, Qemu/kvmtool only got a little information(receive a SIGBUS), for this SIGBUS,
> it only include the SIGBUS_MCEERR_AO, error address. not include other information,
> only according the SIGBUS_MCEERR_AO and error address, user space does not know whether to use IRQ or POLLed notification.
The kernel can't tell it which to use: user space has to decide. This has to be
a property of the machine you are emulating, not the machine you happen to be
running on.
What happens if the notification came using a future notification type that user
space doesn't know about.
What if user space does know about this type, but the guest doesn't.
What if you migrate to a machine that uses a new notification type that you
didn't advertise to the guest via the HEST at boot time.
These dependencies have to break somewhere, and the right place is between the
host kernel and host user-space. This way whatever Qemu/kvmtool do will work in
the above 'what-ifs'.
> for example, SIGBUS_MCEERR_AO means asynchronous access, user space can use SEI, IRQ or POLLed notification.
> so user space will be confused to use which method.
There isn't a wrong choice here. I suggest always-use-IRQ. Its faster than
POLLed, but won't kill a guest that doesn't support NOTIFY_SEI.
> I think if we use such solution, user space only judging SIGBUS_MCEERR_A* is not enough.
> how we provide other extra information to let it choose the proper notification?
Forget the original notification. This physical machine's hardware configuration
and how its memory controller is wired up to report errors should not be
relevant to Qemu/kvmtool.
You need to decide how your emulated platform reports errors, you may want to
make it a configuration option which defaults to something safe.
[...]
> In my platform, there is another issue.
> for the stage2 fault, we get the IPA from the HPFAR_EL2 register,
> but for huawei's CPU, if it is data Error(DFSC[5:0] is 0b010000),
'Synchronous External Abort, not on a translation table walk'
> not translation error(DFSC[5:0] is 0b0101xx),
(the set of external abort, parity or ECC errors that we get from the
page-table-walker)
> the HPFAR_EL2 is NULL, so the IPA is not recorded, in our current KVM code, we get the IPA from the HPFAR_EL2, so
> we can not get the right IPA value, because its value is zero.I do not know whether you have same issue.
This is something the ARM-ARM allows, so we have to live with it in software.
For external aborts the ESR has a 'FnV' bit 10 that for your first DSFSC
'Synchronous External Abort, not on a translation table walk' indicates there is
no FAR, (or presumably HPFAR). I assume you have this bit set in the ESR.
This shouldn't be a problem, for firmware-first we should take the address from
the CPER records as this also gives us a range. For kernel-first we'd take
whatever is in the v8.2 RAS ERR records. Its only if this wasn't a RAS error
that we're likely to print out this address as we kill-the-task/panic.
> Although hpfar_el2 does not record IPA, but host firmware can still record the PA
I agree, it can get the PA from the v8.2 RAS ERR registers and hand it to the OS
using CPER.
> If call memory_failure(), memory_failure can translate the PA to host VA, then deliver
> host VA to Qemu.
Yes, this is how it works for any user-space process as two processes sharing
the same page may map it in different locations.
> Qemu can translate the host VA to IPA. so we rely on memory_failure() to get
> the IPA.
Yes. I don't see why this is a problem: The kernel isn't going to pass RAS
events into the guest, so it never needs to know the IPA.
Instead we notify user-space about ranges of memory affected by
memory_failure(), KVM's user-space isn't a special case here.
As you point out, if Qemu wants to notify the guest it can calculate the IPA and
either use CPER for firmware-first, or in the future, update some representation
of the v8.2 ERR records once we can virtualise kernel-first.
(I'm not sure I understand your point here, but I don't think we disagree,)
Thanks,
James
Hi gengdongjiu,
On 21/09/17 08:55, gengdongjiu wrote:
> On 2017/9/14 21:00, James Morse wrote:
>> user-space can choose whether to use SEA or SEI, it doesn't have to choose the
>> same notification type that firmware used, which in turn doesn't have to be the
>> same as that used by the CPU to notify firmware.
>>
>> The choice only matters because these notifications hang on an existing pieces
>> of the Arm-architecture, so the notification can only add to the architecturally
>> defined meaning. (i.e. You can only send an SEA for something that can already
>> be described as a synchronous external abort).
>>
>> Once we get to user-space, for memory_failure() notifications, (which so far is
>> all we are talking about here), the only thing that could matter is whether the
>> guest hit a PG_hwpoison page as a stage2 fault. These can be described as
>> Synchronous-External-Abort.
>>
>> The Synchronous-External-Abort/SError-Interrupt distinction matters for the CPU
>> because it can't always make an error synchronous. For memory_failure()
>> notifications to a KVM guest we really can do this, and we already have this
>> behaviour for free. An example:
>>
>> A guest touches some hardware:poisoned memory, for whatever reason the CPU can't
>> put the world back together to make this a synchronous exception, so it reports
>> it to firmware as an SError-interrupt.
>> Linux gets an APEI notification and memory_failure() causes the affected page to
>> be unmapped from the guest's stage2, and SIGBUS_MCEERR_AO sent to user-space.
>>
>> Qemu/kvmtool can now notify the guest with an IRQ or POLLed notification. AO->
>> action optional, probably asynchronous.
>>
>> But in our example it wasn't really asynchronous, that was just a property of
>> the original CPU->firmware notification. What happens? The guest vcpu is re-run,
>> it re-runs the same instructions (this was a contained error so KVM's ELR points
>> at/before the instruction that steps in the problem). This time KVM takes a
>> stage2 fault, which the mm code will refuse to fixup because the relevant page
>> was marked as PG_hwpoision by memory_failure(). KVM signals Qemu/kvmtool with
>> SIGBUS_MCEERR_AR. Now Qemu/kvmtool can notify the guest using SEA.
>
> CC Achin
>
> I have some personal opinion, if you think it is not right, hope you can point out.
>
> Synchronous External Abort and SError Interrupt are hardware exception(hardware concept),
> which is independent of software notification,
> in armv8 without RAS, the two concepts already exist. In the APEI spec, in order to
> better describe the two exceptions, so use SEA and SEI notification to stand
for them.
> SEA notification stands for Synchronous External Abort, so may be it is not only a
> notification, it also stands for a hardware error type.
> SEI notification stands for SError Interrupt, so may be it is not only a notification,
> it also stands for a hardware error type.
> In the OS, it has different handling flow to the two exception(two notification):
> when the guest OS running, if the hardware generates a Synchronous External Abort, we
> told the guest OS this error is SError Interrupt instead of Synchronous
External Abort.
This should only happen when APEI doesn't claim the external-abort as a RAS
notification. If there were CPER records to process then the error is handled by
the host, and we can return to the guest.
If this wasn't a firmware-first notification, then you're right KVM hands the
guest an asynchronous external abort. This could be considered a bug in KVM. (we
can discuss with Marc and Christoffer what it should do), but:
I'm not sure what scenario you could see this in: surely all your
CPU:external-aborts are taken to EL3 by SCR_EL3.EA and become firmware-first
notifications. So they should always be claimed by APEI.
> guest OS uses SEI notification handling flow to deal with it, I am not sure whether it
> will have problem, because the true hardware exception is Synchronous External
> Abort, but software treats it as SError interrupt to handle.
Once you're into a guest the original 'true hardware exception' shouldn't
matter. In this scenario KVM has handed the guest an SError, our question is 'is
it an SEI notification?':
For firmware first the guest OS should poke around in the CPER buffers, find
nothing to do, and return to the arch code for (future) kernel-first.
For kernel first the guest OS should trawl through the v8.2 ERR registers, find
nothing to do, and continue to the default case:
By default, we should panic on SError, unless its classified as a non-fatal RAS
error. (I'm tempted to pr_warn_once() if we get RAS notifications but there is
no work to do).
What you may be seeing is some awkwardness with the change in the SError ESR
with v8.2. Previously the VSE mechanism injected an impdef SError, (but they
were all impdef so it didn't matter).
With VSESR_EL2 KVM has to specify one, and all-zeros is a bad choice as this
means 'classified as a RAS error ... unknown!'.
I have a patch in the upcoming SError/RAS series that changes KVMs virtual-abort
code to specify an impdef ESR for this path.
> In the mainline code, it does not have SEI notification support, the reason I
> think it is because of the error address record by firmware is not accurate
> (SError Interrupt is asynchronous exception).
Yes, while we don't expect a FAR with an SError, but we do expect a valid
representation of the RAS error in either the CPER records or the v8.2. ERR
registers (or both). If we have neither of those, its not a RAS error and we
should panic.
> so if treat a hardware Synchronous External Abort as SError interrupt(SEI).
> The default OS behavior for SEI is PANIC, that is to say, when hardware triggers
> a Synchronous External Abort(SEA), if guest treat it as SError interrupt(SEI),
> the OS will be panic. in fact, it can be recoverable instead of Panic.
If its a RAS error APEI (or in the future, the kernel-first handler), should
claim the error, so the guest never sees it. If you are hitting this behaviour
in KVM, then it wasn't a RAS error.
> I ever added a patch to support the SEI notification, but not sure whether
> it is can be accepted by open source, until now, not receive response.
The patch you posted during the merge window made no sense on its own, so must
replace $one_of the other patches in your v5 (or was it v6)... I'll get to it...
Because the SEI notification depends on v8.2 I'd like to get the SError/RAS
series posted (currently re-testing), then I'll pick up enough of the patches
you've posted for a consolidated version of the series, and we can take the
discussion from there.
I'd still like to know what your firmware does if the normal-world believes its
masked physical-SError and you want to hand it an SEI notification.
Thanks,
James
Hi James,
Sorry for my late response, thank you very much for comments.
On 2017/9/23 0:51, James Morse wrote:
[.....]
>>
>> CC Achin
>>
>> I have some personal opinion, if you think it is not right, hope you can point out.
>>
>> Synchronous External Abort and SError Interrupt are hardware exception(hardware concept),
>> which is independent of software notification,
>> in armv8 without RAS, the two concepts already exist. In the APEI spec, in order to
>> better describe the two exceptions, so use SEA and SEI notification to stand
> for them.
>
>> SEA notification stands for Synchronous External Abort, so may be it is not only a
>> notification, it also stands for a hardware error type.
>> SEI notification stands for SError Interrupt, so may be it is not only a notification,
>> it also stands for a hardware error type.
>
>> In the OS, it has different handling flow to the two exception(two notification):
>> when the guest OS running, if the hardware generates a Synchronous External Abort, we
>> told the guest OS this error is SError Interrupt instead of Synchronous
> External Abort.
>
> This should only happen when APEI doesn't claim the external-abort as a RAS
> notification. If there were CPER records to process then the error is handled by
> the host, and we can return to the guest.
consider again. I think you should be right.
In the firmware-first solution, firmware will shield all kinds of errors and record them to the CPER buffer.
>
> If this wasn't a firmware-first notification, then you're right KVM hands the
> guest an asynchronous external abort. This could be considered a bug in KVM. (we
> can discuss with Marc and Christoffer what it should do), but:
>
> I'm not sure what scenario you could see this in: surely all your
> CPU:external-aborts are taken to EL3 by SCR_EL3.EA and become firmware-first
> notifications. So they should always be claimed by APEI.
Yes, if it is firmware-first we should not exist such issue.
>
>
>> guest OS uses SEI notification handling flow to deal with it, I am not sure whether it
>> will have problem, because the true hardware exception is Synchronous External
>> Abort, but software treats it as SError interrupt to handle.
>
> Once you're into a guest the original 'true hardware exception' shouldn't
> matter. In this scenario KVM has handed the guest an SError, our question is 'is
> it an SEI notification?':
>
> For firmware first the guest OS should poke around in the CPER buffers, find
> nothing to do, and return to the arch code for (future) kernel-first.
> For kernel first the guest OS should trawl through the v8.2 ERR registers, find
> nothing to do, and continue to the default case:
>
> By default, we should panic on SError, unless its classified as a non-fatal RAS
> error. (I'm tempted to pr_warn_once() if we get RAS notifications but there is
> no work to do).
understand, thanks.
>
>
> What you may be seeing is some awkwardness with the change in the SError ESR
> with v8.2. Previously the VSE mechanism injected an impdef SError, (but they
> were all impdef so it didn't matter).
> With VSESR_EL2 KVM has to specify one, and all-zeros is a bad choice as this
> means 'classified as a RAS error ... unknown!'.
>
> I have a patch in the upcoming SError/RAS series that changes KVMs virtual-abort
> code to specify an impdef ESR for this path.
Before I remember Marc and you suggest me specify the an impdef ESR (set the vsesr_el2) in
the userspace,
now do you want to specify an impdef ESR in KVM instead of usrspace?
if setting the vsesr_el2 in the KVM, whether user-space need to specify?
May be we can combine the patches that specify an impdef ESR(set vsesr_el2) patch to one.
>
>
>> In the mainline code, it does not have SEI notification support, the reason I
>> think it is because of the error address record by firmware is not accurate
>> (SError Interrupt is asynchronous exception).
>
> Yes, while we don't expect a FAR with an SError, but we do expect a valid
> representation of the RAS error in either the CPER records or the v8.2. ERR
> registers (or both). If we have neither of those, its not a RAS error and we
> should panic.
>
>
>> so if treat a hardware Synchronous External Abort as SError interrupt(SEI).
>> The default OS behavior for SEI is PANIC, that is to say, when hardware triggers
>> a Synchronous External Abort(SEA), if guest treat it as SError interrupt(SEI),
>> the OS will be panic. in fact, it can be recoverable instead of Panic.
>
> If its a RAS error APEI (or in the future, the kernel-first handler), should
> claim the error, so the guest never sees it. If you are hitting this behaviour
> in KVM, then it wasn't a RAS error.
>
>
>> I ever added a patch to support the SEI notification, but not sure whether
>> it is can be accepted by open source, until now, not receive response.
>
> The patch you posted during the merge window made no sense on its own, so must
> replace $one_of the other patches in your v5 (or was it v6)... I'll get to it...
yes, thanks.
Because SEI notification support does not depend on the RAS virtualization, I break them out of this
series.
they are here(Today Tyler have some comments, I will update it)
https://patchwork.kernel.org/patch/9950953/
https://patchwork.kernel.org/patch/9952493/
>
> Because the SEI notification depends on v8.2 I'd like to get the SError/RAS
> series posted (currently re-testing), then I'll pick up enough of the patches
> you've posted for a consolidated version of the series, and we can take the
> discussion from there.
James, it is great, we can make a consolidated version of the series.
>
> I'd still like to know what your firmware does if the normal-world believes its
> masked physical-SError and you want to hand it an SEI notification.
firstly the physical-SError that happened in the EL2/EL1/EL1 can not be masked if SCR_EL3.EA is set.
when trap to EL3, firmware will record the error to APEI CPER from reading ERR* RAS registers.
(1) if HCR_EL2.TEA is set to 1, exception come from EL0, El1. firmware knows this SError come from guest OS, copy the elr_el3 to elr_el2, copy ESR_El3 to ESR_EL2.
if the SError exception come from guest EL0 or EL1, set ELR_EL3 with VBAR_EL2 + 0x580(one EL2 SEI entry point),
execute "ERET", then jump to EL2 hypervisor.
(2)if the SError exception come EL2 hypervisor, copy the elr_el3 to elr_el2, copy ESR_El3 to ESR_EL, set ELR_EL3 with VBAR_EL2+0x380(one EL2 SEI entry point),
execute "ERET", then jump to EL2 hypervisor.
(2) if HCR_EL2.TEA is set to 0, SError come from host EL1 ,EL0, copy ELR_EL3 to ELR_EL1,ESR_EL3 to ESR_EL1
if the SError come from host EL0, set ELR_EL3 with VBAR_EL1+0x580
if the SError come from host EL1, set ELR_EL3 with VBAR_EL1+0x380
execute "ERET", then jump to host EL1.
>
> Thanks,
>
> James
>
> .
>
>> What you may be seeing is some awkwardness with the change in the SError ESR
>> with v8.2. Previously the VSE mechanism injected an impdef SError, (but they
>> were all impdef so it didn't matter).
>> With VSESR_EL2 KVM has to specify one, and all-zeros is a bad choice as this
>> means 'classified as a RAS error ... unknown!'.
>>
>> I have a patch in the upcoming SError/RAS series that changes KVMs virtual-abort
>> code to specify an impdef ESR for this path.
> Before I remember Marc and you suggest me specify the an impdef ESR (set the vsesr_el2) in
> the userspace,
I pasted Marc's propose and your suggestion that set VSESR_EL2(specify
virtual SError syndrome) by the user space.
https://lkml.org/lkml/2017/3/20/441
https://lkml.org/lkml/2017/3/20/516
> now do you want to specify an impdef ESR in KVM instead of usrspace?
> if setting the vsesr_el2 in the KVM, whether user-space need to specify?
> May be we can combine the patches that specify an impdef ESR(set vsesr_el2) patch to one.
>